UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An arithmetic processor for robotics Poon, Joseph Kin-Shing 1985

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1985_A7 P65.pdf [ 5.32MB ]
Metadata
JSON: 831-1.0096321.json
JSON-LD: 831-1.0096321-ld.json
RDF/XML (Pretty): 831-1.0096321-rdf.xml
RDF/JSON: 831-1.0096321-rdf.json
Turtle: 831-1.0096321-turtle.txt
N-Triples: 831-1.0096321-rdf-ntriples.txt
Original Record: 831-1.0096321-source.json
Full Text
831-1.0096321-fulltext.txt
Citation
831-1.0096321.ris

Full Text

A N ARITHMETIC PROCESSOR FOR ROBOTICS by JOSEPH K.IN-SHING POON B.Sc.(Hon.), Queen's University, 1983 A THESIS SUBMITTED IN PARTIAL FULFILLMENT O F T H E REQUIREMENTS FOR T H E D E G R E E O F MASTER O F APPLIED SCIENCE in THE F A C U L T Y O F G R A D U A T E STUDIES (Department of Electrical Engineering) We accept this thesis as conforming to the required standard. T H E UNIVERSITY O F BRITISH COLUMBIA May, 1985 © Joseph Kin-Shing Poon, 1985 I n p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t o f t h e r e q u i r e m e n t s f o r a n a d v a n c e d d e g r e e a t t h e U n i v e r s i t y o f B r i t i s h C o l u m b i a , I a g r e e t h a t t h e L i b r a r y s h a l l m a k e i t f r e e l y a v a i l a b l e f o r r e f e r e n c e a n d s t u d y . I f u r t h e r a g r e e t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s m a y b e g r a n t e d b y t h e h e a d o f m y d e p a r t m e n t o r b y h i s o r h e r r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l n o t b e a l l o w e d w i t h o u t m y w r i t t e n p e r m i s s i o n . D e p a r t m e n t o f E l e c t r i c a l E n g i n e e r i n g T h e U n i v e r s i t y o f B r i t i s h C o l u m b i a 1956 Main Mall V a n c o u v e r , C a n a d a V6T 1Y3 D a t e J u n e 27, 1985 D E - 6 (3/81) ii ABSTRACT An arithmetic processor chip for use in robotics has been designed in 4/*m CMOS technology. The chip is intended to function as a slave processor to a general purpose microprocessor host and be able to perform robot arm coordinate transformation calculations for use in real-time control applications. A parallel-processing, multi-pipelined architecture has been used to produce a 45mm2 chip for which a machine cycle time of 200ns appears possible. The general nature of the architecture of this microprogrammable processor renders it useful for a range of computational tasks in robotics in addition to coordinate transformation. iii Table of Contents Abstract » Acknowledgement viii 1. Introduction 1 2. Coordinate Transformations in Robotics 3 3. Choosing an Appropriate Architecture 8 3.1 Options for High Speed Computation 10 3.1.1 Systolic Array for Band Matrix Multiplication 10 3.1.2 Data Flow 13 3.1.3 Control Row , 17 4. Implementation Considerations 18 4.1 Constraints of the Fabrication Process 18 4.2 Floating-Point vs Fixed-Point Implementation 20 4.3 On-Chip Memory vs Off-Chip Memory 23 4.4 Multiplication 24 4.5 32 Bit vs 24 Bit Representation 24 4.6 Reduced Instruction Set 25 4.7 Outline of Possible Architecture 25 5. The APR Architecture 26 5.1 The Register File 26 5.2 The Adder Section 28 5.3 The Multiplier Section 29 5.4 The Clock Phases 30 5.5 Fixed-Point Mode 31 5.6 Input/Output and Control 33 6. Programming and Scheduling 35 iv 7. Design of Cells 37 7.1 Multi-port Register Design 41 7.2 Barrel Shifter 44 7.3 Leading Zero Checker 47 7.4 Adders 48 7.5 The Clock 55 7.6 Programmable Logic Array 60 7.7 Stack 62 8. Floor Plan of APR 63 9. The APRjr 66 9.1 Speed Considerations 69 9.2 Programming - 70 9.3 The APRjr Subsystem 71 10. Testing 74 10.1 Prototype Testing 74 10.2 Production Testing 78 11. Conclusions 79 11.1 Contributions of this Thesis 79 11.2 Suggestions for Further Research ~ 80 Bibliography 85 Cited References 85 General References 88 Appendix A: Layouts 90 A.1: 3-Port Register Cell 91 A.2: 2 x 2 Register File 92 A.3: One Barrel Shiftier Cell 93 A.4: 3-out-of-5 Barrel Shifter 94 V A.5: 4-Bit Leading Zero Checker 95 A.6: Stack Cell 96 A.7: 2-Bit by 2-Level Stack 97 A.8: 2-Bit Par-Add Adder 98 A.9: 1-Bit Static Regenerative Manchester Carry Adder 99 A. 10: PLA for Stack Control 100 A.11: Clock Generator 101 A.12: Rags ." 102 A. 13: Barrel Shifter Controller for APRjr 103 Appendix B: Simulations 104 B. l : Writing a Zero to a Register 105 B.2: Writing a One to a Register 106 B.3: Writing a One to a Register with Noise 107 B.4: Writing a One to a Register with Noise, Level 2 108 B.5: Register Sourcing a Zero 109 B.6: Register Sourcing a One 110 B.7: Leading Zero Checker I l l B.8: One Par-Add Logic Block 113 B.9: Carry Propagation in One Static Regenerative Manchester Carry Adder Cell 114 B.10: 2-Bit by 2-Level Stack 115 B.11: Clock Generator 116 B.12: PLA for Stack Control 117 Appendix C: Instruction Set 118 Appendix D: Scheduling the Adder and the Multiplier Sections 128 List of Figures 2.1 A Six Degree of Freedom Manipulator 3 2.2 Format of Transformation Matrix 4 3.1 Example of Transformation Matrix 8 3.2 A Band Matrix of Width w 10 3.3 Systolic Array for Matrix Multiplication 10 3.4 Simplistic Implementation of a Data Flow Machine 14 3.5 The MIT Data Flow Computer 15 4.1 Yield Characteristics of ISOCMOS 18 4.2 Number Format for 16-bit Fixed-Point Representation 20 4.3 Error Estimation of Number Format 22 5.1 Block Diagram of APR in Floating-Point Mode 26 5.2 Clock Phases Required in APR 30 5.3 Block Diagram of APR in Fixed-Point Mode 31 5.4 Instruction Pipeline 33 7.1 Communication in Adder Mantissa Section 37 7.2 Inefficient Routing of Adder Mantissa Section 37 7.3 Final Floor Plan of Adder Mantissa Section 38 7.4 A 3- Port Register Cell 41 7.5 4-out-of-7 Barrel Shifter 44 7.6 Barrel Shifter Connected as Up/Down Shifter 44 7.7 4-bit Leading Zero Checker 47 7.8 Dynamic Regenerative Adder 48 7.9 Static Regenerative Adder 49 7.10 4-Bit Adder with Partial Carry Look-Ahead 51 7.11 Circuit Diagram of a Par-Add Block 52 7.12 End-Around-Carry Adder with Domino CMOS 54 7.13 Shift Register for Multi-Phase Clock Generation 56 7.14 Signals from Clock Generation Shift Register 56 7.15 Two-Phase Clock with Feedback to Guarantee Non-Overlapping 58 7.16 PLA for Stack Controller 60 7.17 2-bit by 2-level Stack 62 8.1 Floor Plan of APR 63 9.1 Block Diagram of APRjr 66 9.2 Floor Plan of APRjr 66 9.3 Instruction Formats 70 9.4 Block Diagram of the APRjr Subsystem 71 viii ACKNOWLEDGEMENT I would like to thank my supervisors, Dr. D.L. Pulfrey and Dr. P.D. Lawrence for their invaluable guidance and encouragement throughout this project In addition, I would like Dr. P. van Halen for his help in the use of the pen-plotter. This research has been supported by the Natural Sciences and Engineering Research Council of Canada, and Robotic Systems International, Sidney, British Columbia. 1. INTRODUCTION In order to control the motion of a robot manipulator, it is necessary to know what the position of the manipulator is, given the angles at the joints of the manipulator, and vice versa. For real time applications, such calculations cannot be performed fast enough by a general purpose microprocessor. A mainframe computer or a minicomputer with an attached array processor may be able to accomplish the task. But it is much more desirable to have a smaller unit, preferably one that would fit on the bus of a microprocessor system, that can perform the job. This would reduce cost, space, and power consumption. In this thesis, a special purpose processor, designed to compute coordinate transformations in robotics and implemented in LSI, is described. The processor was designed taking into consideration factors such as available technology, speed, number of supporting chips required, ease of communication with a host processor, ease of programming, and power consumption. Two versions of the processor are discussed, namely the APR chip which supports both floating-point and fixed-point arithmetic, and the APRjr chip which can only handle fixed-point numbers. Both these chips were designed to compute the direct transform of an industrial manipulator [6], in a time short enough for the solution to be used to obtain the inverse transformation by iteration for use in real time trajectory planning. To obtain the necessary speed of operation, the designs embody the features of parallel processing, independent operation of the multiplier and adder sections, non-interlocking pipelines and a reduced instruction set Because the APR chip can handle floating-point numbers, it has a potentially wider range of applications than APRjr. However, for the specific task of fast computation of the direct transform, APRjr is adequate and, in this initial investigation, its design was carried forward to the chip fabrication stage. 2 A brief introduclion to coordinate transformations in robotics is presented in section 2. Section 3 investigates the various ways of performing coordinate transformations at high speed. A framework of a feasible design taking into account the various constraints and requirements is given in section 4. In section 5, the architecture of APR is described in detail. Since scheduling multiple resources is both an interesting topic in parallel computer architectures and an integral part of the APR design, section 6 is devoted to a discussion of how the computing resources on APR can be scheduled efficiently. The detailed circuit design and layout of the more important cells is the topic of section 7. Section 8 deals with routing and floor planning of APR. The architecture and design of APRjr is presented in section 9. In section 10, the procedures that can be used to test the chip design are outlined. Finally, in section 11, the contributions of this research are summarized, and suggestions for further research outlined. 3 2. COORDINATE TRANSFORMATIONS IN ROBOTICS A manipulator, or robot arm, consists of a number of rigid links connected by movable joints. Each joint has one degree of freedom, i.e., can revolve about one axis if it is a revolute joint or can slide along one axis if it is a prismatic joint For example, Figure 2.1 illustrates a manipulator with six degrees of freedom (six revolute joints). The axes about which the links rotate are also indicated. In order to control the movement of the manipulator, it is necessary to know the positions of the various links of the manipulator, in particular the position of the end effector (the free end of the manipulator). It is possible to have sensors at each joint to read the angles and send them back to the controller. The task of determining in cartesian coordinates the position and the orientation of the end effector, given the angles of the joints (or the length of a prismatic joint), is called the direct transform in robotics. Conversely, the process of determining the magnitudes of the angles given the position and orientation of the end effector is called the inverse transform. The direct transform can be computed by multiplication of 4 by 4 matrices. Each joint can be assigned a cartesian coordinate system, which can be related to the coordinate system of the immediate previous joint by a homogeneous transformation matrix (4 by 4) [27]. To compute the transformation matrix of the end effector relative to the base, it is only necessary to multiply the transformation matrices of all the joints together. For example, for the manipulator in Figure 2.1, if Aj is the transformation matrix relating the coordinate system of joint 2 to joint 1, and A 2 is that which relates the coordinate system of joint 3 to joint 2, etc., then the coordinate system of the end effector with respect to the base of the manipulator is given by T = AIAJA3A4AJA4 Fig. 2.1: A Six Degree of Freedom Manipulator The format of the matrix thus obtained is columns specify the orientation of the end position of the end effector, all in cartesian 5 as shown in Figure 2.2. The first three effector and the last column specifies the coordinates. n x o x a x p x n y Oy a y py n 0 o 0 0 Fig. 2.2: Format of Transformation Matrix It should be pointed out that these coordinate transformations are only meaningful if the manipulator is rigid. A rigid manipulator is one which does not deform significantly even under loading conditions. It can be seen that although the method of obtaining the direct transform is quite straightforward, the amount of computation involved is very large. It is possible to multiply the matrices out algebraically and obtain the equations for the four required vectors, which can then be simplified by factoring. However, these equations are still quite complicated for real-time use. For example, the simplified equations of the position vector of a simple manipulator [6], which has a construction similar to the one shown in Figure 2.1, is 6 P x = C,[a2Q - d«(S2 3Q - C C S J + S,(d3 + <LS4S5) P y = Sxtd^CCCS, - Q 3C 5) - a 2Cj - C,(d3 + (1.S.S,) Vz = d , ( S „ C S s + d}Ci) + a2S2 + 4 where, for example, Cj3 = cos(02 + 0 3), S23 = sin(02 + 0 3), and the d's and a's are related to the dimensions and geometry of the manipulator. The inverse transform is considerably more complicated than the direct transform. Not only is the form of the inverse transform very different for different manipulators, but it is also not unique (more than one set of joint angles can give rise to the same end effector position), extremely complicated (involving inverse trigonometric functions, divisions, square roots, etc.), and very often cannot even be found in closed form! However, it is possible to obtain the inverse transform by iterative application of the direct transform. In other words, a guess can be made of what the joint angles should be, the position of the end effector can then be computed using the direct transform with the guessed angles, and the error in the actual position of the end effector should enable a better estimation to be made the next time. The direct transform is also useful in end-point servoing. Since the direct transforms can compute the end-point position and orientation, no TV camera or end-point sensor is required. Only the joint sensors are required. It is very desirable that the direct transform can be calculated as quickly as possible. Present day manipulators do not usually move very fast Typical sampling rates are less than 100Hz. So it may be possible to calculate the inverse transform by iterative application of the direct transform, using a high performance general purpose microprocessor with an attached special purpose arithmetic processor. When computing trajectories in real-time, this approach is far more practical than to try to compute the inverse transform directly by evaluating the inverse transform equations. These 7 equations are very difficult to obtain, and they are so complicated that there are few special properties in them that can be exploited by employing special hardware. The requirement of divisions and large numbers of transcendental function calls also makes evaluation of these equations very slow. For example, even with modern high-speed arithmetic coprocessors, a transcendental function takes the order of hundreds of micro-seconds to compute. In constrast, a high-speed multiplier takes about 50ns to complete a 16-bit multiplication. 8 3. CHOOSING AN APPROPRIATE ARCHITECTURE Because the present project is to design a chip for a special application, namely computing the direct transform in robotics, it is possible, and necessary, to choose an architecture that is most suitable for this particular application. The choice of the correct architecture is very important Since raw processing speed is very much limited by the integrated circuit technology, such as CMOS or TTL, significant improvement in throughput can only be achieved with parallel processing and careful circuit design. However, as the degree of parallelism is increased, the amount of hardware will generally be increased, since parallel processing is essentially spreading out computation in processing elements instead of in time. Also, increase in parallelism almost invariably leads to decrease in generality of application. So the tradeoffs must be considered very carefully. In order to decide what architecture is best suited to the present application, the properties of the problem must be analyzed, and then the hardware and software that will take best advantage of these properties can be designed. The following observations about the direct transform in robotics can be readily made: 1. It is obtained by multiplication of 4 by 4 matrices. 2. The matrices are usually quite sparse, with the last row always containing (0, 0, 0, 1), usually about 6 or 7 other zero entries, and a 1 or -1 entry somewhere. An example of a typical transformation matrix relating 2 adjacent links is shown in Figure 3.1. 3. The overall transformation matrix is of the form shown in Figure 2.2, where ri, o, and a are orthogonal vectors specifying the orientation of the end effector, and p is the position vector of the end effector relative to the base of the manipulator. 4. The computation requires addition, subtraction, multiplication, and sines and 9 C 4 0 S 4 0 S 4 0 C 4 0 0 1 0 d 4 0 0 0 1 Fig. 3.1: Example of Transformation Matrix cosines of angles. For a manipulator with 6 degrees of freedom, there are 6 angles and so at least 12 trigonometric functions calls are required. 5. If the transformation matrices axe multiplied out algebraically, equations can be obtained for n, o, a , and p. These equations typically contain a number of common terms. For example, the term (C1C3C3 - CiSjS })C« + S1S4 appears in the x component of n. o, a , and p in the Excalibur manipulator [6]. This is a consequence of the fact that these vectors are obtained by matrix multiplication. Unless a dedicated matrix multiplication engine is used, this is the best way to compute the direct transform. 10 3.1 Options for High Speed Computation In this section, various options for high speed computation are described with their general performance advantages. Then the pros and cons in applying each option to the present problem are evaluated. Although there is no strict specification as to what the computation system should look like, it is reasonable to assume that the entire system must be able to fit in a number of (preferably one) PC boards that would fit into a card cage connected to a common microprocessor bus. The power requirements of the system must be kept to a minimum. The system will be controlled by the host(s) on the bus. Therefore, options such as using a VAX with an array processor can be ruled out In fact, because of the general purpose nature of these large computing systems, they are not necessarily much faster than a specially designed small system in solving this particular problem. 3.1.1 Systolic Array for Band Matrix Multiplication Because the direct transform is obtained by multiplication of 4 by 4 homogeneous matrices, a dedicated matrix multiplication structure might be suitable. Much work has been done on ways of making highly parallel computing structures for the multiplication of matrices [22]. In the most general case, a systolic array can be used to multiply two n x n band matrices with band width w=p+q-l. A matrix that is not banded is a special case where p=q=n. Figure 3.2 shows a band matrix with p = 3 and q = 2. The multiplication of two band matrices of widths wl and w2 can be implemented with a systolic array of wl x w2 processing elements (Figure 3.3) [22]. The time required for multiplication is 3n + min(wl, w2) cycles. During each cycle each processing element multiplies its two input operands, adds the product to the previous accumulated value, and passes the result to the next element in the next cycle. 91, 3 „ 9 „ 9 „ 9 II 9 M \ \ s 0 v v 0 \ \ a.. V 9 „ 9 II 9 u 9 „ w = p + q - 1 Fig. 3.2: A Band Matrix of Width w results * 0 " Fig. 3.3: Systolic Array for Matrix Multiplication 12 The advantages of such an arrangement are that the processing elements are identical, very little control is required, the interconnection is regular, communication is local, and the multiplication can be completed in a very short time. Applying this system to the coordinate transformation problem, it can be observed that n = 4 p = 4 so wl = w2 = (4 + 3 - 1) = 6. Hence a total of 6 x 6 = 36 processing elements are needed. The number of cycles required would be 3(4) + 6 = 18 for multiplying two matrices. Therefore, to obtain the overall transformation matrix for a six degree of freedom manipulator, 18 x 5 = 90 cycles will be needed. Because of the simplicity of the processing elements, there are a number of ways that they can be implemented: 1. A custom designed chip. Required on chip are: • a multiplier- accumulator (MAC). • a simple control section. • a few storage registers. It is possible to put all these on a single chip. The advantage of this design is that a small number of chips is required (if 36 processors can be considered small!) The disadvantage is that the on-chip M A C will be slow (more than 200ns multiply time). 2. Use an off-the-shelf high speed M A C and design an LSI controller or use MSI off-the-shelf components to build the controller. This approach requires a much larger chip count (at least 2 x 36). The high speed multipliers will also dissipate a lot of power. Thirty-six high-speed MAC's will dissipate around 25-50W. 13 3. Use an off-the-shelf signal processing chip with on chip M A C and memory. Candidate chips include the TMS320 [21], the N E C /uPD7720 [30], or the HITACHI HSP [38]. These chips do not have the proper I/O pins required for communication if they are to be used as the processors in the systolic array. Therefore operation will have to be slowed down to get the operands in and the results out In summary, a systolic array implementation would take a large number of chips and a large amount of PC board area. The system will be very fast, but will only work for multiplying matrices. The only property of the problem that can be made use of is that the direct transform can be obtained by matrix multiplication. Extra processors will also be necessary to control information flow to and from the host, and to look up sines and cosines. 3.1.2 Data Flow One of the relatively new computer architectures that facilitate parallel processing is data flow [35,36]. So it is also worth investigating. An instruction in a data flow computer is carried out as soon as all the operands for the instruction are available. For example, the equation a = (b + 1) * (b - c) can be implemented with the following data flow instructions: i l : (+ () 1 i3/l) •.instruction 1. Read an operand. The '()' represents an operand. Then add the constant 1 to it, and pass the result to operand 1 of instruction 3. i2: (- () () i3/2) ;read 2 operands, subtract the second operand from the first, and pass the result to operand 2 of instruction 3. 14 i3: (* () () a) ;read 2 operands, multiply them together, and pass the result to the variable a. the order in which these instructions are arranged in the stored program is immaterial (as opposed to a conventional control-flow program). The information of the flow of data and control is embedded in the instructions themselves. In il, it is specified that an add operation will be performed on the first (only) operand and the constant 1 when the operand is available, and after the operation, the result is to be passed to the first operand of instruction i3. A 'natural' implementation of this set of instructions is with three processors (Figure 3.4). The supply of b and c and the retrieval of a will be controlled by some other processors. The three processors shown operate on their data whenever all the data is available, independent of the other processors. Although in this simple example data flow does not increase throughput (the ' * ' block b C • a Fig. 3.4: Simplistic Implementation of a Data Flow Machine 15 is a bottle-neck), in the general case a data flow computer is faster than a control flow computer with the same number of processors. It is not necessary that the processors be synchronous. In fact, the system is more efficient if they are fully asynchronous. It is not possible to have a separate processor for each instruction in a complicated program. Therefore, in an actual data flow machine, the computing resources will have, to be shared. An example (the MIT data flow computer [35]) is shown in Figure 3.5. In this system, instructions that have become ready are picked up by the arbitration network, and then sent to the processing section to be processed. The processing section contains a number of processing elements. The results are then sent to the control and distribution networks, passing the data to the specified instructions. In this way, other instructions will become activated and then picked up by the arbitration network. This sequence then repeats itself. For calculating the direct transform, the matrices can be multiplied out and the equations for n , o, a , and p obtained. This is the best way to calculate these vectors; matrix multiplication is only suitable for a systolic array implementation. The equations can then be implemented by writing an appropriate data flow program. The program is easier to write than a comparable one in. a conventional computer because it is more natural. But the program only has to be written once for each manipulator, so this is not a major issue. The main advantage that a multiple processor data flow machine has over a multiple processor von Neumann machine is that of scheduling. In a data flow machine, scheduling is trivial: the programmer only needs to embed the proper execution sequence of the instructions in the instructions themselves, and the data flow computer will take care of the proper sequencing itself. In a control flow machine, the programmer has to make sure that the instructions are sequenced properly by suitable ordering of the instructions. 16 processing elements control network distribution network e.g. (• 0 1 13/1) (-0 0 13/2) Instruction cells arbitration network Fig. 3.5: The MIT Data Flow Computer On the other hand, implementing a data flow computer will present many difficulties. First of all, it is not possible to integrate everything on one chip, especially if multiple processors are desired. It is possible to integrate the control, distribution, and arbitration networks, but because of the bottle-neck nature of these networks, communication will be too heavy for a single chip to handle. Also, the frequent memory accesses in a data flow computer may make it inferior to a control flow computer in this application. In a control flow computer, high speed on-chip registers can reduce the number of memory accesses for data. 17 3.1.3 Control Flow From the previous observations, it seems that the conventional control flow architecture is probably still the best way to meet the present requirements. There are many ways to build a high speed control flow machine. A custom architecture can be built using bit slice processors, or a large number of suitably connected advanced general purpose microprocessors can be used. But since the size of the overall computation system should be kept small, the best approach is to design a custom large scale integration processor. This processor will be called APR, for Arithmetic .Processor for .Robotics. 18 4. IMPLEMENTATION CONSIDERATIONS The idea of a special purpose arithmetic processor for robotics is not new. APAC, a powerful processor similar to and more ambitious than this work, has been described in [21]. APAC has been estimated to require the integration of over 150K transistors using an unspecified "mid-80's" technology. Even though it is now 1985, such a technology is still not widely-available through foundry services. However, foundries able to produce chips with Au m design rules and yields in excess of 60% for 50mm2 die do exist [8]. Such specifications should be sufficient to enable the fabrication of a meaningful special purpose chip and to allow an examination of its effectiveness in a real system. In this section, various implementation constraints of such a processor are considered, and an architecture that would meet these constraints with present day IC technology is outlined. 4.1 Constraints of the Fabrication Process The process that is available is a 4n m ISOCMOS process from G T E [8]. At the present time the G T E process is capable of producing chips of area around 25mm2 with a defect density close to 5.4 Defects/cm2. The yield for such chips is around 80%, which is very close to that predicted by the widely used SEEDS model [31]. Figure 4.1 shows the appropriate curve for this model. The curve suggests that the yield will only decrease to about 20% on increasing the chip size to about 75mm2. However, in • practice, at least as regards GTE's foundry, the curve of Figure 4.1 is not followed beyond about 50mm2 and it has been found that for chip sizes above this figure the yield rapidly tends to zero [17]. The G T E process is not a particularly advanced process. Processes that have a minimum feature size of 1.5jum are not uncommon these days. Also, quite a few of the recently designed powerful microprocessors, such as RISC II [7,16,24,26,32,33,34] and 19 Fig. 4.1: Yield Characteristics of ISOCMOS MIPS [9,10,11,12], have fairly large die sizes (RISC II measures 10.3mm x 5.8mm [16], MIPS measures 7.5mm x 8.4mm [9], both use 4ym nMOS technology). The GTE ISOCMOS process is a p-well process, with a single metal layer and two polysilicon layers. The second poly is used for making capacitors. Capacitors are useful in analog circuits, but are seldom used in digital circuits. Again, such a technology is not very advanced. Many processes nowadays have two metal layers. With two metal layers, not only can layouts be made a lot more compact, but the circuits 20 can be made to run much faster. This is because polysilicon layers usually have a sheet resistance orders of magnitude larger than metal, and the RC delay caused by long polysilicon lines is usually the cause of timing bottlenecks in a digital integrated system. The second polysilicon layer is useless to the present project Because it is deposited after the field oxide is deposited, transistors cannot be made with this layer. Moreover, polyl cannot be crossed with poly2, since wherever poly2 runs over polyl, a fairly large capacitor is formed, greatly increasing the RC delay and causing crosstalk. A rough estimate of the number of transistors that can be integrated on a chip of size 7mm x 7mm with this 4ym CMOS process is about 20,000. This estimate assumes that the design is fairly irregular, such as a microprocessor, as opposed to a regular layout such as a memory chip. 4.2 Floating-Point vs Fixed-Point Implementation Should the data be represented in fixed point or floating point format? No definite answer can be given before a thorough simulation is done. However, rough estimations can be made. If all distances are represented with a 16-bit fixed point as shown in Figure 4.2, then numbers from about -16 to +16 can be represented with a resolution of 2" 1 1 = 4.8 x 10" *. If metres are taken as the unit of measurement, then distances of up to 16m can be represented with a precision of about 0.5mm. By shifting the position of the binary point, larger numbers can be represented with a coarser precision, or smaller numbers with a finer precision. The effects of error propagation must also be considered. Consider for example p of the Excalibur manipulator, P x = QEajC, - d , ( S „ C , C 2 3C 4S 5)] + S,(dj + d«S,S , ) Normally the accuracies of the sines and cosines of the angles are limited by the 2 1 15 14 11 10 sign I binary point Fig. 4.2: Number Format for 16-bit Fixed-Point Representation accuracies of the sensors rather than the data format. Typically the accuracy of a position sensor is around 0.1% to 10%, much poorer than what is offered by the above format Because the effect of the errors introduced by the fixed-point format is to be investigated, it is assumed that the sensors are as accurate as the data format Also, one can almost always find more accurate sensors, although they would undoubtedly cost a lot more. With this assumption, a maximum of IT can be represented by the same format shown in Figure 4.2 with a resolution of TT/2048, using only the lower 11 bits to represent the angle. The higher order bits can be used for memory partitioning. Such an arrangement would allow off-the-shelf sine/cosine generators such as the AM29526/27 and AM29528/29 to be used. If the host uses straight binary numbers to represent the angles, then a conversion is easily performed by multiplying the angles by a suitable conversion factor. This can be done by APR. When sines or cosines are multiplied together, the error will only diminish because sines and cosines of any angle always have a magnitude less than one. Only when terms containing lengths are multiplied together will the errors be magnified. 22 Notice also that in this equation (and all the others in the direct transform of the Excalibur manipulator), it is never necessary' to take the difference of two terms with magnified errors and then multiply it with a big number (larger than one). Assuming that the maximum length of each link is 4 meters (which is a more than reasonable assumption for most present day manipulators), the worst case error estimation is shown in Figure 4.3. It can be seen that there is an error of only about 1mm in the x-component of the position vector of the end effector. Error estimation is the same for the other vector components. The actual errors will be bigger because of the errors introduced by the sensors. Since a 16-bit fixed-point number format seems to suffice, is there any reason why a floating-point processor should be preferred? Floating-point representation requires more hardware. In VLSI, increasing the amount of hardware implies increased design cost and potentially slower operation max=4 max=4 max=4 max=4 4 4 4 4 P» = C, [ a . C 8 - d 6 ( S 2 3 C 9 - C „ C 4 S S ) ] + S, ( d 3 + d 6 S 4 S 5 I N y ' > y f X - V / V.. y / e=0.0001 e=0.00007 e=0.0005 e=0.0003 > y r > y ' e=0.0003 e=0.000B w e=0.0004 v e=0.001 Fig. 4.3: Error Estimation of Number Format 23 speed, in addition to increased chip size. Floating-point arithmetic also takes a longer computation time than fixed-point arithmetic. But this difference in speed is diminished with the use of pipelined systems. Also, in possible applications of APR other than computing direct transforms, such as robot dynamics and control, computer graphics, and digital filtering, floating-point arithmetic is often necessary. Floating-point arithmetic can also increase the accuracy of direct transform computations due to reduction in error accumulation. Therefore, floating-point arithmetic is highly desirable. Since the hardware for fixed-point arithmetic is basically a subset of the hardware for floating-point arithmetic, a processor designed for floating-point arithmetic can easily be extended to handle fixed-point arithmetic also. 4.3 On-Chip Memory vs Off-Chip Memory The program for computing the transformation equations is static, i.e., it never changes for the same manipulator. Therefore, it is possible to put the program on chip as a ROM. R A M is not desirable because then the chip will have to be bootstrapped by writing to the R A M before any computation can start. The lower packing density of R A M also places a severe limitation on program size. Since the instructions are likely to be very wide (32 bits), with the program on chip 32 pins can be saved. Instruction fetch cycles can also be faster with an on-chip ROM. But instruction fetch cycles are not the potential bottlenecks of the processor. Because of the special purpose nature of the chip, not too many will be made for any particular applicatioa With the program stored in an off-chip PROM, program development is a lot easier. Larger programs are also more feasible this way. Therefore, the decision was made that the program would not be integrated on-chip. As for data memory, they must be implemented in R A M . With the available ISOCMOS process, it is not possible to have a large on-chip RAM. A 4 Kbit R A M 24 requires 23 mm2 with this process. Therefore, it is not feasible to put all the data memory on-chip. The best that can be done is to have a relatively large register file of about 32 units. Most general purpose processors have 16 general purpose registers or less. A large register file can reduce the number of memory accesses. Off-chip R A M must be used for the main data memory. 4.4 Multiplication In computing the direct transform, many multiplications are required. Multiplications can be performed in computers by either shift and add, or by the use of a parallel multiplier. Performing multiplication by shift and add requires minimal hardware, but is very slow. Multiplying two 16-bit numbers together by shift and add requires about 6.4MS with a 4M m CMOS process. However, with suitable pipelining and parallel processing, high throughput can be attained in some applications [3]. Parallel multiplers, on the other hand, are a lot faster than their sequential counterparts. Off-the-shelf 16-bit integer multipliers today have multiply times of around 50 or 60ns (e.g. Weitek WTL 2010). But parallel multipliers require a lot of silicon real estate. A 16- bit parallel multiplier will require about 3mm x 3mm in a Au m CMOS process. In the present project, it is not possible to effect high throughput by pipelining slow serial multipliers. It is also not possible to integrate a parallel multiplier on-chip. The best alternative then is to use an off-the-shelf high speed integer multiplier. 4.5 32 Bit vs 24 Bit Representation Since the largest high speed integer multipliers available today are 16-bit multipliers, mantissa multiplication is fixed to 16 bits. However, it is still possible to have a full 32-bit (23 bit mantissa) IEEE floating-point standard [4] adder and full 32-bit registers. This, for example, is the approach taken by the TMS320 [21]. For 25 the present application, a 16-bit mantissa appears sufficient Besides, there are so many multiplications in this application (about twice the number of additions) that the extra accuracy offered by the 32-bit adder will probably not offer much better overall accuracy. Using a sign-magnitude representation for the mantissa (the IEEE standard), the magnitude of the mantissa can be represented with 16 bits on-chip, with another bit representing the sign, thus effectively giving a 25-bit representation. This extra 1-bit accuracy can only be preserved when operands are kept on chip. For communication with external memory or the host, it is better to keep the number format to 24-bits, which is more convenient because it is a multiple of 8. 4.6 Reduced Instruction Set Reduced instruction set computers have become very popular recently, the most notable designs being the RISCs and . MIPSs. A reduced instruction set implies a simpler control section, which, in turn, means reduction in chip area requirement reduced design time, less chances of design errors, and faster execution speed if the instructions are chosen properly. Since APR is designed to be a special purpose arithmetic processor, its instruction set should therefore support only the functions that are necessary for this purpose, namely coordinate transformation in robotics. 4.7 Outline of Possible Architecture In summary, then, APR should handle 32^bit floating-point and 16-bit fixed-point arithmetic. A parallel multiplier is needed, which will be an off-the-shelf high speed integer multiplier. The program memory is to be stored in an off-chip PROM. Except for a large high speed register file on chip, data memory is stored in an off-chip RAM. The control section must be made as simple as possible, implementing only instructions that are needed most often. 26 5. THE APR ARCHITECTURE A block diagram of the overall architecture of APR is shown in Figure 5.1. APR can handle both 24-bit floating-point numbers and 16-bit fixed-point numbers. Figure 5.1 shows APR in its floating-point mode. This high speed floating-point unit design is a conventional design similar to that described in [1]. Both the adder section and the multiplier section are divided into a mantissa half and an exponent half. A 25-bit by 31 word 3-port register file is shared by the two sections. The internal representation of the mantissa in APR is always 17 bit sign-magnitude. The exponent is represented by 8 bits with a +127 bias. This format is close to the IEEE floating-point standard and will thus allow host processors that use this IEEE standard to send data to APR directly, albeit with a loss of 8 bits of accuracy because of the shorter mantissa used in APR. 5.1 The Register File The register file has two read buses and one write bus. During every cycle, the register file operates twice. Firsdy it supplies the two operands to the multiplier, while at the same time storing back the current result from the multiplier section. Secondly, it does the same for the adder section. Because N-channel pass transistors are used for the read and write enable switches in the register cells, the read buses are precharged before every run. The write bus does not need to be precharged because it is sourced by full CMOS circuits. More will be said about the register file in section 7. 21 W 7i 3 8 e. 3 °? 2 a Barrel Shifter Leading Zero Checker Data/Address Multiplier (off-chip) Data I/O Register File for Mantissa Adder Barrel Shifter CCfl Register File for Exponent Adder Leading Zero Checker Subtractor Barrel Shifter control signals Control IR Instruction Adder PC Program Address 28 5.2 The Adder Section In floating-point mode, the adder section is a 5 stage pipeline, similar to that described for APAC [20]. Operands are fetched from the register file in the first stage and, in the next stage, the exponents are compared. The larger exponent is passed to the next stage in the exponent half, and the mantissa of the smaller number is shifted down the appropriate amount to align the binary point The mantissas are then added or subtracted in the next stage, while the exponent goes through a one stage delay. In the fourth stage, the mantissa is normalized and the exponent updated. Overflows and underflows are checked in this stage, and the appropriate flags set In the final stage, the result is stored back in the register file. If the hard limit flags are set, the number stored will be set to the maximum magnitude representable with the same sign, or zero, depending on whether overflow or underflow has occurred respectively. The adder mantissa section consists of a fraction alignment 16- bit barrel shifter, a fast 18-bit adder, a leading zero checker, and a normalization 16- bit barrel shifter. The exponent half consists of an exponent comparison subtractor, and an exponent adjustment adder. The use of sign-magnitude format, together with the requirement that the mantissa must never overflow (overflow and underflow must be handled in the normalization stage), forces the use of an 18-bit adder to handle the 17-bit numbers. Also, the mantissa adder must be implemented with an end-around-carry adder. This is necessary because in sign-magnitude representation, the magnitude of a number must always be positive. Therefore, when the difference between two mantissas is negative, the magnitude of the mantissa must be kept positive with the sign changed to indicate a negative number. This is best done with an end-around-carry adder using l's complement to represent negative numbers [1]. However, this in effect requires the adder to operate twice during each clock cycle, so a fast carry look-ahead scheme must be employed. The "par-add" scheme developed by Brent and Kung [2] and recently implemented in 29 nMOS [37] was chosen for this task. Both barrel shifters use the design described in RISC II [34]. The fraction alignment shifter is a down shifter, while the normalization shifter is an up shifter. The leading zero checker detects the number of leading zeros in the adder result and controls the normalization shifter so that the mantissa always comes out normalized in floating-point mode. It also feeds the shift amount to the exponent adjust adder, which adjusts the exponent of the result accordingly. In order to synchronize the mantissa and exponent halves, one extra delay stage has to be added in between the exponent comparison subtracter and the exponent adjustment adder. Both these adders are regenerative Manchester carry adders [14]. The exponent comparison subtracter is an 8-bit adder, whereas the exponent update adder needs to be 9 bits wide to distinguish between overflow and underflow. When either overflow or underflow occurs, the result is hard limited to the largest magnitude representable or zero, respectively. Because it is not known which input to the exponent comparison adder will be the larger, the difference may come out to be either positive or negative. This is handled by a PLA that maps both a positive and a negative difference to a positive shift amount which is then used to control the first barrel shifter. 5.3 The Multiplier Section The multiplier section is a 4-stage pipeline. In the first stage, the operands are fetched and, in the next stage, the mantissas are multiplied by the parallel multiplier. The exponents just go through a delay in this stage. In the third stage, the result mantissa is normalized by the barrel shifter. The exponent section in this stage is split into two sub-stages. In the first sub-stage, the two exponents are added together. In the second sub-stage, the exponent bias is restored by subtracting one bias amount from the result The effect of shifting the mantissa is taken care of by either feeding a 1 or 0 to the carry in of the exponent adder. For the same reason presented in 30 the adder section, this exponent update adder in the multiplier section must be 9 bits wide. The correct result is then stored back in the register in the fouth stage. In floating-point multiplications the resulting mantissa never needs to be shifted more than one position, so there is no need for a barrel shifter and a 16-bit leading zero checker. These two blocks are needed, however, in fixed-point mode, when the numbers are not normalized. 5.4 The Clock Phases In order to synchronize the above operations, a clock with more than two phases is required. The required clock phases are shown in Figure 5.2, and are 0 0 - n 0,_n 0.1-03 0.1— Fig. 5.2: Clock Phases Required in APR 31 generated as described in section 7.5. The operation of the multiplier section in floating-point mode is as follows: during 0 , , the register file is precharged. In 0 3 , operands to the multiplier are read from the register file. The tri-state pads connecting APR to the external multiplier are enabled as output pads during all of <t>2, so that the operands are driven out through the pads with all the duration of 0 2 available. During the next 0 , , the operands are fed to the inputs of the multiplier. The multiplier operates in the ensuing 0 2 , and the product is sent back to APR in <f>,. The product is normalized in 0 2, and then stored back in the register file in 0 3, after a delay in 0 , . In the adder section, the operands are read from the register file in 0 „ . The fractions are then aligned during the next 0 2, after a simple register transfer in the 0 , between. Then the aligned fractions are added in the next 0 2. The sum is normalized in another 0 2, and the normalized sum finally stored back in the register file in the next 0 u . It should be noted that APR basically operates with a two phase clock consisting of 0 , and 0 2 . The phases 0 3 , 0«, and 0 O are needed so that the register file can be operated twice every machine cycle. 5.5 Fixed-Point Mode Figure 5.3 shows APR in fixed-point mode. In this mode, the two barrel shifters in the adder section are bypassed, so that the adder section becomes a 3-stage pipeline. The multiplier section stays as a 4-stage pipeline, but the shift amount is now a constant controlled by the control unit. The location of the binary point is controlled by the program, and the barrel shifter in the multiplier section is used to realign the binary point after a multiplication. The exponent half of the adder section is completely shut off in fixed point mode, as is the register file and the 2? Lo 5 B* rt t i a Data/Address 33 exponent half in the multiplier section. However, the adder in the multiplier section is used to check for overflows. This is achieved by subtracting the number of leading zeros in the result from the multiplier from a preset value and flagging any negative differences. 5.6 Input/Output and Control For communication with external memory, APR can perform both direct and indirect read and write with 24-bit words. In order to make APR compatible with 8-bit wide system data buses, APR can also do direct read and write with byte wide words. No indirect read/write instructions are available for byte wide data. Table look-up for sines and cosines, as required in the computation of direct transforms, and for D M A to the system memory is provided by appropriate memory mapping. The condition code register has 9 flags. The overflow, zero, and negative flags are set by the data path in the usual way. Of the six remaining flags, 3 are input flags which are set by external circuits, and 3 are output flags that can be set by the APR control unit These flags can be used for interrupt requests, bus grant, and other control signals. The program counter is 16 bits wide, and is not connected to the data path. Only absolute direct jumps are allowed. Conditional jumps are possible, using the zero flag as the jump condition. In order to allow simple subroutine calls, a two level stack is also included in the PC section. Because of the simple instructions required, the control section comprises only a simple PLA. Figure 5.4 shows the instruction fetch and execution pipeline. Every cycle a new instruction is fetched from the program memory and decoded in the following cycle. Execution of the instruction is then started in the following cycle, although the instruction may take more than one cycle to finish. As in RISC and MIPS, and for 34 Execution decode fVograa Address Calculation and Instruction Fetch Fig. 5.4: Instruction Pipeline reasons clearly described in [9] , the instruction fetch and execution pipelines are not interlocked. In other words, when a conditional branch is encountered, the prefetched instructions are always executed, regardless of the outcome of the condition test However, unlike the RISC and MIPS design, there is no internal forwarding [7] in the data path. This makes the design of the control section quite simple, provided that the speed penalty due to the extra delays can be reduced by suitable rearrangement of the instructions. 35 6. PROGRAMMING AND SCHEDULING The efficiency of APR depends entirely on the efficiency of the program. Because much of the processing speed of APR comes from its multi-pipelined architecture, its throughput can be widely different for programs that can keep the pipelines full at most times and programs that cannot There are three major pipelines in APR: the adder pipeline, the multiplier pipeline, and the instruction fetch and execution pipeline. The block diagrams of these pipelines have been shown in the previous section. An operation can be started in the adder or multiplier pipeline only when both operands are in the register file. APR does not make any effort to check whether the operands are valid or not This task is left to the programmer. Therefore, as an example, if the following two instructions are to be executed with APR in floating-point mode: i l : C = A + B ' i2: D = C + 2 then instruction i2 can be started only after at least five cycles from the time i l was started. But other instructions that do not require ' C can be started immediately after i l . . Therefore, the instructions must be arranged in such a way that the pipelines are filled most of the time. The instruction fetch and execution pipeline is easier to handle. The only complication arises when there is a conditional branch in the program. APR does not provide any facility to flush the instruction pipeline when a branch occurs. This is what is called non-interlocked pipelines in MIPS. The instructions that are prefetched into APR will always be executed, regardless of the outcome of any conditional branches. The programmer must therefore make sure that the two instructions after a conditional branch that are prefetched into APR will not cause any errors when executed. It may be possible to rearrange the instructions so 36 that these two instructions will actually do some useful work. Otherwise, NOP (no operation) instructions must be used to fill up this gap. As pointed out in [9], this requirement is often more of a blessing than a curse. Not only is the design of the processor easier with non-interlocking pipelines, but the two prefetched instructions which are wasted in a pipeline that can be flushed can now be made to do useful work. It is rather difficult to write an efficient program to schedule a number of computing resources, even though the present case has only two resources, namely the adder section and the multiplier section. The subject of scheduling multiple resources has been studied extensively in the area of micro-programming. In fact, the programs for APR are somewhere between assembly language programs and micro-programs. As an aid to the programmer, the assembler is written to take into account the possible parallel operation of the adder and the multiplier sections, and the various pipeline relationships.' The branch and bound algorithm with list scheduling heuristic [19] is used in the assembler. It requires the programmer to arrange the addition and multiplication operations as a data dependency graph [19]. The close to optimal program is then generated automatically in a very short time. It is rather difficult to include the effects of a Finite number of registers in a scheduling algorithm, so the effects of the limited number of registers available on-chip is largely ignored in the present assembler. The scheduling algorithm is described in more detail in Appendix D. 37 7. DESIGN OF CELLS It is well known that in VLSI, communication is far more expensive than computation [22]. However, the notion that communication must be modular and local is rather futuristic. Classical von Neumann architecture does not lend itself easily to systems requiring little or no global communication. Also, the complexity and density of most VLSI chips today have not reached the point where multiple self-contained systems can be integrated on one chip. Therefore, it is conceivable that most, if not all, processor designs today utilize global signals that run across the span of the chip. Nevertheless, in VLSI, even local communication must be designed carefully. Take, for example, the adder mantissa section of APR (see Figure 7.1). The register file, barrel shifters, leading zero checker, and adder are all of different heights and widths in their most compact form. If the layout is designed following exactly the arrangement of the block diagram, a lot of area is wasted (see Figure 7.2). Although B.S. B.S. LZC Fig. 7.1: Communication in Adder Mantissa Section 38 n 11 u Reg. File B.S. J -* 1 Adder Sol — ± B.S. -=* Fig. 7.2: Inefficient Routing of Adder Mantissa Section Figure 7.2 is only symbolic, it is obvious from it that a lot of silicon real estate is used up by the wires. A better design is to stretch all the ~ building blocks to the largest height so that they can be butted together snugly. Although this implies less compact layouts for the smaller blocks, the savings in routing area will usually lead to a much more compact overall design. Also, since the adder design adopted in APR is much more compart when the sum is generated on the same side as the inputs, the overall design will be more compact if the adder is placed at the end of the data path. The final design is shown in Figure 7.3. Notice that there are no 'bare' busses in this arrangement Such considerations were taken into account in the design of the basic cells. The design of the more important cells is described in this section. The following set of rules of thumb is used as a guideline in the design of the layouts: 1. Extensive layout compaction is applied only to cells that have a high repetition factor, such as the register cells and the rest of the data path. For cells that are not repeated a lot, logic gates from a standard cell library are used. This 39 Reg. F i l e i> Adder « }> « c B.S. * B.S. LZC » « » » Fig. 7.3: Final Floor Plan of Adder Mantissa Section reduces design time while compromising little in terms of overall layout compactness. 2. Full CMOS is used as much as possible in order to reduce power consumption, except in places where the use of pseudo-nMOS can result in significant reduction in real estate requirement. 3. P-channel transistor are made 2.5 times the size of their corresponding N-channel transistors, in order to have symmetrical rise and fall times. 4. Minimum size transistors (with p-channel transistors 2.5 times the size of N-channel transistors) are used everywhere except where high current drive is required. This not only keeps the area required to a minimum, but also reduces the loading on the drivers that drive the transistors, which in turn reduces the overall delay. 5. Metal wires are used for signal communication as much as possible, with polysilicon used only for short distances or crossunders. 40 6. Power and ground wires are always routed in metal. When signal lines cross the power lines, the signal lines must cross under the metal power lines in polysilicon. In rare cases where power must cross ground, an N + diffusion layer inside a P-well is used as crossunder for the ground wire. N + diffusion layers are preferable to polysilicon because N+ diffusion has a smaller sheet resistance than polysilicon, but has a larger capacitance. Therefore N + diffusion is better than polysilicon for conducting ground lines but worse for connecting signal lines, since the signal level in the ground lines never has to change, so the increased capacitance has no serious effects. 7. Manhattan geometry is used except where significant savings in area can be realized with the use of 45 degree geometry. This makes a conversion of the design to a process supporting only Manhattan geometry easier, although there are currently no plans to do that The designs were done on a Metheus X750 VLSI work station. The designs were simulated extensively wherever possible using the simulation tools available. These tools include: 1. SPICE: a circuit level simulator. For accurate analog simulations of small circuits consisting of up to 20 transistors. 2. Hg. a switch level simulator. For digital simulations, using a switch model for MOS transistors. Circuits containing a few hundred transistors can be simulated conveniently with this simulator. 3. HILO: a gate level simulator. For digital simulations of large complex digital systems. With functional simulations [13], theoretically whole chips and systems can be simulated. But because it cannot be used with the circuit extractor, it was only used sparingly. The layouts have been thoroughly checked for design rule violations using the DRC (Design Rule Checker) on the Metheus VLSI workstation. 41 7.1 Multi-port Register Design In almost any digital processor design, a number of mulu'-port registers are needed. A register is basically a R A M word. The difference (if any) between what is usually called a register file and what is called a R A M is that a register usually has more than one port, with a small number of registers in a register file, and the file is usually placed in the data path itself. This is necessary because by design, registers are used very often, typically once every machine cycle. A multi-port register file would allow the processor to run faster. It is not practical to have a very large register file, because larger files require longer access time [24]. Historically, there is a big difference between registers and RAMs in that registers are integrated in the same chip as the processor, whereas RAMs are in separate packages. With the advent of VLSI, on-chip RAMs are becoming more and more common, and the difference between registers and RAMs is fading. In fact, in RISC II [34], the terms are used interchangeably. The register cell design used in APR is shown in Figure 7.4. This is a 3-port register cell, with two read busses and one write bus. During every cycle, two words can be read from the register file and one word can be written to it A write operation proceeds as follows. For every register cell, when write is high, M5 . is on and MO is off. The write bus is then connected to the input of the inverter formed by M l and M3. The signal on the write bus is thus stored in the gate capacitance of this inverter. When the write control signal returns to low, M5 is turned off and MO is turned on. The write bus is then disconnected from the register cell. Since MO is on, the output of the inverter formed by M2 and M4 is connected to the input of the inverter formed by M l and M3, forming the familiar cross-coupled static memory cell, and the information stored can be retained indefinitely. This completes a write cycle. 42 MriteBus Fig. 7.4: A 3-Port Register Cell For a read cycle, either ReadBusO or ReadBusl can be used. When ReadO goes high, M6 is turned on. This connects ReadBusO to the output of the inverter formed by M2 and M4, and the logic level of ReadBusO will follow the output logic level of this inverter. The same description applies for ReadBusl. With two read busses and one write bus, two words can be read from the register file and one word written to it every cycle. This is very desirable because both the adder and the multiplier requires two operands and produce one result 43 In order to save area, pass transistors must be used instead of transmission gates for the read and write enable switches (M5, M6 and M7). In CMOS, either N -or P-channel transistors can be used foT pass transistors. P-channel transistors are typically 2 to 3 times slower than N-channel transistors, because of the lower mobility of holes compared with that of electrons. However, in a P-well process, N-channel transistors reside in a P-type tub of slightly higher doping density than the substrate. This worsens the body effect, and N-channel pass transistors lose more than a gate threshold in passing a high signal. If one end of an N-type pass transistor is held at 5V, the other end only reaches about 3V. On the other hand, a P-channel transistor can pass a low signal (OV) to around 1.1V. N-channel transistors have been chosen for implementing the register cell because of their superior speed. In order to combat the threshold losses, the read busses are precharged. When the register file is not used, the read and write control lines are driven to zero volts, and the read busses are precharged to Vdd. When the register file becomes active in <p 2 . the capacitance of the read busses will keep the voltage of the busses close to Vdd. Therefore if the register cell contains a I, it will not have to pull the bus high, since the bus is already high. On the other hand, if the cell contains a 0, the bus will have to be pulled low. This presents no problems since N-channel transistors can pass a low signal without any loss. The SPICE simulations of the register cell are shown in sections B.1 to B.6 of appendix B. In B.1, a 0 is written to the register. In B.2, a 7 is written. In B.5, it is sourcing a 0. In B.6, the register is sourcing a 1 to a read bus. The case that presents the most concern is B.2, since the signal after the write enable 44 pass transistor M5 barely reaches 3V, just slightly above the threshold voltage of 2.5V for a standard inverter. The noise margin has been increased by making the N-channel pull down transistor Ml bigger than the P-channel pull-up M3. In appendix B.3, the write signal is reduced to 4V, and it is shown that the register cell can still record the high signal properly. This marginal state exists only when the register is being written to. As soon as writing is complete, M5 will be turned off and MO will be turn on, restoring the logic level to much safer values. The register file is divided into two sections, consisting of 31 x 16 and 31 x 9 units of this register cell. The bigger file is for the mantissa, while the smaller one is for the sign and the exponent Driving the register file are three 5-to-32 decoders, one for each bus. The decoders are basically structured like a PLA, but with no OR planes, since they are not needed. From simulations, the time required to disable the read/write switches and precharge the busses is found to be about 20ns. The time to decode an address is about 10ns. And the time for a register cell to be enabled and sink a 1 from the write bus is about 30ns (there are other possible activities, such as sourcing a 7, but they are not worst cases). Therefore, the register file can be run at a minimum cycle time of 60ns. See Appendix A.1, 2, and B.l-7. 7.2 Barrel Shifter The design of a versatile and compact barrel shifter has been thoroughly discussed in [34]. The circuit diagram- of a 4-out-of-7 barrel shifter is shown in Figure 7.5. This barrel shifter design can serve as either an up shifter or a down shifter (it can also rotate a word, but this capability is not utilized in APR). Figure 7.6a shows the barrel shifter connected as a down shifter, and Figure 7.6b shows it working as a up shifter. In Figure 7.6a, the input 4-bit word is fed to ports InO to In3, where the MSB is connected to In3. If sh3 is high and shO to sh2 are all low, then Out3 is connected to In6, and OutO is connected to In3, etc. Therefore, the MSB 45 in6 in5 in4 in3 H •H: { t sh3 i 1 J r a a s sh2 in? shl J Sit2 J ^ u i OutO —t) a a a jnl shO inO Fig. 7.5: 4-out-of-7 Barrel Shifter of the input word becomes the LSB of the output, and OutJ to Out3 are filled with ffs. This has the effect of shifting the input word down by three positions. To shift the input word up by two positions, sh2 is set to a 7 state. Then Oui3 is connected to In5, and OutO is connected to In2, etc. When shO is high, the input word is passed directly to the output 46 9 9 9 9 9 9 9 9 r 8h3 •ti2 wM shO in6 Out 3 in5 0ut2 in4 Outl in3 OutO in2 l n l inO a a a (a) a-sh3 8h2 shl shO in6 0nt3 in5 Out2 in4 Outl ina Outo in2 inl inO (b) Fig. 7.6: Barrel Shifter Conneaed as Up/Down Shifter In Figure 7.6b, the input word is connected to Jn3 through In6. The operation is very similar to that of the down shifter. Except that now enabling shO will shift the input word up by 3 positions and enabling sh3 will pass the input word to the output directly. Since the barrel shifter is an nMOS pass transistor design, the threshold drop problem is also present This implies mat the output lines of the barrel shifter must also be precharged. 47 7.3 Leading Zero Checker A 4-bit leading zero checker design is shown in Figure 7.7. This circuit is basically a PLA with no OR plane. The position of the first non-zero bit can be obtained both along the direction of the data path or at right angles to it, making it ]r— 3r- JV 3h~jlr' ^ ^ 3 5 Olz 1 5 3 2 Iz 21z 3 J—<1-L. IZ 41Z dataO datal data2 data3 Fig. 7.7: 4-bit Leading Zero Checker 48 particularly suitable for part of a data path. Simulations show that the worst case delay for a 16-bit leading zero checker is about 6ns. The number of leading zeros in binary code can also be obtained with the addition of an OR plane. 7.4 Adders There are two adder designs used in APR. For the small adders and the program counter adder, regenerative Manchester carry adders were used. For the fast mantissa adder, a carry look-ahead adder was used. The regenerative Manchester carry scheme [14] is basically a Manchester carry scheme [22]. The P ,G and K signals are generated as in normal Manchester carry adders and a pass transistor is also used for the carry propagation chain. The difference is that the carry signals are regenerated at every stage. In a dynamic regenerative adder (see Figure 7.8), each section of the carry chain is pulled down to G N D in the precharge phase. This is equivalent to a carry kill at every bit. No kill (AO signal needs to be generated. In the next (active) phase, if propagate (P) is high, which implies generate (G) is low, the carry-in signal is passed to carry- out. In a normal Manchester carry adder, the level of the signal at carry-out will not be the same as that at carry-in. This is because there is a potential drop across MO. However, in the regenerative Manchester carry, if the carry-in signal is high, M l will pull carry-out all the way to Vdd. If the carry-in signal is low, M l will be off, and the section of the carry chain will stay at GND, since it is initially set to G N D . The logic levels can be restored at every stage in the normal Manchester carry scheme too, but the two extra gate delays introduced by the doubly inverting buffers will more than offset the speed improvement gained by the restored logic levels. When P is low, the adder behaves as follows. If G is high, a carry of / is generated by M l . If G is low, MO and M l are both off, and the carry-out stays at GND. This is the same as killing the carry. 49 A static regenerative adder scheme is also possible (Figure 7.9). This is done by adding some circuitry to the dynamic design to implement a carry kill. Static designs are more versatile, since no clock is required. Therefore they can be operated asynchronously. Although the static design has a longer cycle time than the dynamic design, when the adder is used as part of a larger design, the overall speed of the design may be faster with a static adder. For example, in one machine cycle in APR, the^ number of leading zeros in a number must be determined and then added to the original exponent in the multiplier of APR. With a dynamic adder, the machine cycle must be split into two sections. The first section must be long enough for the leading zero checker to reach a stable (correct) result And the second section must be long enough for the adder to complete the addition. If a static adder is used, then it is 50 only necessary that the total time required for the leading zero checker and the adder is less than the period of one machine cycle. The static scheme does have some drawbacks, though. The transistor M 7 presents extra load on the carry chain. Also, when G or K is low, they may have to work against the carry-in signal, so transistor sizing is very critical in a static design. The static regenerative adder design was chosen for all the adders on APR except the high-speed 16-bit mantissa adder. Simulations show that the carry propagation time for one stage of the static regenerative adder is about 7ns. See Appendix B, section B.9. Therefore, an 8-bit regenerative adder will require about 60ns to perform an addition. P (k + 1) Fig. 7.9: Static Regenerative Adder 51 A carry look-ahead scheme that is particularly suitable to VLSI has been described in [2]. The "Par-Add" scheme, as it is called in [37], is a very regular and modular design, making it very suitable for an algorithmic design, Le., by specifying only the size (number of bits) of an adder, its layout can be generated automatically by a program. Par-Add is essentially a partial carry look-ahead tree across groups of two bits. With TTL components, carry look-ahead is usually across groups of four bits. But MOS transistors have less current drive than TTL circuits. So even carry look-ahead across groups of four bits can significantly increase delays due to large capacitive loads on the outputs. A block diagram of a 4-bit par-add adder is shown in Figure 7.10, where A unused unused Carry-In Gout Pout Cin PA 61 PI Cl 60 PO CO Gout Pout Cin PA 61 PI Cl GO PO CO Gout Pout Cin PA 61 PI Cl 60 PO CO Gout Pout Cin FA A3 B3 S3 Gout Pout Cin FA Gout Pout Cin FA A2 B2 Al Bl Sl Gout Pout Cin FA r T T T T T T T T T T 1 AO BO SO Fig. 7.10: 4-Bit Adder with Partial Carry Look-Ahead 52 and B are the numbers to be added, and S is the sum. For the blocks labeled FA (for Full Adder), Pout = A - B Gout = A - B S = A + B + C For the PA (for Par-Add) blocks, POUt = ? 0 ' ? l Gout = G , + (TV G„) C, = Go + (Po-Cin) C„ = Cin Notice that all the PA blocks implement the same logic. This is why the circuit is modular and regular. If N is the width of the adder in number of bits, then the speed of the adder decreases as log2N as opposed to N in an adder without carry look-ahead, such as a Manchester carry adder. The original design in [37] is for an nMOS process. Since APR is to be implemented in CMOS, the design has to be modified. With CMOS, the PA blocks cannot be implemented in full complementary MOS. Full CMOS requires about four times the amount of fan-out of pseudo-nMOS or domino CMOS [18]. Domino CMOS is particularly suitable for this application since it works with non-inverting logic (Le. the basic building blocks are A N D and OR gates, rather than N A N D and NOR gates), which is exactly what the PA blocks require. Therefore, domino CMOS was chosen to implement the mantissa adder, which must be much faster than the other adders because it is 18 bits long, and is an end-around-carry adder. The circuit diagram of a PA block in domino CMOS is shown in Figure 7.11. Because domino CMOS is dynamic, making an end-around-carry adder using domino CMOS presents extra problems. The carry-in of an end-around-carry adder does not become stable until about half way through the Fig. 7.11: Circuit Diagram of a Par-Add Block 54 overall add lime of the adder. But dynamic circuits require their inputs to be stable before they start operating, otherwise incorrect results will occur. This requirement is handled in the mantissa adder by precharging the adder during <f>0, and operating the adder during <j>3 and <pu. During <p3, the carry-out is established and latched. During tf>«, the carry-in is stable and correct, and the sum generated is valid. Figure 7.12 shows the schematic of this arrangement Carry-Out fro» MSB A » IB Pa r -Add IB Adder IB Carry-In to LSB Fig. 7.12: End-Around-Carry Adder with Domino CMOS 55 7.5 The Clock The clock phases required by APR are shown in Figure 5.2, on page 30. Basically, each machine cycle is split into four phases, two relatively short phases (the d>0's), and two relatively long phases (0 3 and 0 „ ) . The reason that these phases are required has been described in section 5.4. This section is concerned with how such phases can be generated in VLSI. In most processors, multi-phase (typically two-phase) clocks are generated from a single phase clock. The single phase clock can be generated either off-chip or on-chip. Depending on the accuracy required of the clock, or the amount of clock frequency drift that can be tolerated, such as due to temperature changes and aging, a crystal or an RC circuit may be used to provide the oscillation. The subject about generation of single phase clocks has been discussed in detail in [22]. APR runs on its own clock, asynchronous with that of the system clock, if there is any. This is necessary to make full use of the speed of APR. For a prototype chip, it is not wise to use a totally internally generated clock using a ring oscillator [22]. The reason is that the timing estimations may be inaccurate. For example, at present the minimum width of <j>2 is estimated to be 160ns. If a ring oscillator is used to generate the base clock, the number of stages in the ring oscillator must be adjusted so that <f>2 is around but above 160ns. If the minimum phase width for <j>2 is longer than estimated, i.e. longer than 160ns, then APR will not work, and there is no way to change the clock frequency after the chip is fabricated. On the other hand, if the minimum phase width is shorter than expected, then APR will run at less than its maximum speed. Therefore, an external oscillator is much more desirable for generating the base clock. With an external oscillator, using either a crystal or an RC circuit, the clock frequency can easily be changed. From here on it is that assumed a single phase symmetrical square wave of selectable 56 frequency is available from off-chip sources. Given a symmetrical single phase clock, the problem now is how to generate the phases 0 O to ^ shown in Figure 5.2. Apart from the obvious relationships shown, the following requirements should be noted: 1. 0 i is high only when 0 O is high. 2. 0! and 0 2 must not overlap. 3. 03 and 4>n should be about two or three times the width of 0 O . 4. 03 and <t>n must not overlap 0 O . A method to generate a two phase non-overlapping clock from a single phase clock is described in [22]. To generate the clock phases required in APR, a more sophisticated scheme is necessary. The heart of the scheme used is a dynamic shift register (see Figure 7.13). The function of the N-channel pull-down transistors will be explained later in this section. The shift register is controlled by a two phase non-overlapping clock, generated from the base clock (clkin in Figure 7.13) using the method in [22]. With the base clock running, sO = / , and si to s5 = 0 on startup, the signals sO to s5 will take the shapes shown in Figure 7.14. From these signals, 0 O to ^ can be generated as follows: 1. 00 = sO + s3 2. * 1 = sO 3. 0 2 = si + s2 + s3 + s4 + s5 4. * 3 = si + s2 5. 0 a = s4 + s5 However, there is still one problem. The non-overlapping requirements are not guaranteed by the shift register. It is possible to estimate the delay of the slowest clock path and then insert enough number of idle base clock periods in between the different clock phases. For example, if the signals sO to s5 are to be used to generate a two phase non-overlapping clock, then the phases can be generated by the Fig. 7.13: Shift Register for Multi-Phase Clock Generation 58 .o n n n  n n n 8 2 I I I I I I 8 3 I I i I I I I .< n n n .= n n n Fig. 7.14: Signals from Clock Generation Shift Register following logic equations: d>, = sl + s2 4>2 = s4 + s5 The duration of sO and s3 provide the necessary non-overlap periods. If more overlapping period is required, the number of idle periods can be increased. A better approach is to use the feedback technique described in [22]. An example to generate a two phase non-overlapping clock is shown in Figure 7.15. It should be noted that the signal that is being fed back to the inputs of the generator must be from the end of the critical path (slowest path). Otherwise non-overlapping cannot be guaranteed. Using this technique, the corresponding signals that are fed to 59 Fig. 7.15: Two-Phase Clock with Feedback to Guarantee Non-Overlapping the various generated signals as shown in the following table. Signal Generated Signal Fed Back 00 03 + 0« 01 02 + 00 02 01 03 00 0 4 00 There is still one potential problem in this clock generator. The proper functioning of the shift register clock generator depends on the fact that at any time, one and only one of the sO to s5 signals is high. Although it is a relatively simple task to set the sO to s5 signals to the proper state on start-up by using the reset line, measures must still be taken to ensure that if there is more than one high 60 signal or no high signal at all in sO to s5, due to noise perhaps, then the shift register must be able to correct itself after at most one machine cycle. The N - channel pull-down transistors shown in Figure 7.13 are included for this purpose. When sO is high, sJ to s5 are forced to low. On the other hand, when all of si to s5 are low, sO is forced to a high. 7.6 Programmable Logic Array There are six programmable logic arrays (PLA) used in APR: 1. Instruction decoder. 2. Stack controller. 3. Fraction alignment barrel shifter controller. 4. Register file decoder. 5. Leading zero checker in the adder section. 6. Leading zero checker in the multiplier section. All these PLA's use a static pseudo-nMOS design. An example of such a design is shown in Figure 7.16, which is the stack controller used in APR. The logic equations that are implemented are: outO = in0« in l « in3 + i n 0 « i n l » i n 2 + in l « in4 outl = inO- inl* in4 + inO* in4 out2 = inO* in4 out3 = in l « in2 Pseudo-nMOS static PLA's are preferable to dynamic PLA's or full CMOS static PLA's. Dynamic PLA's must be clocked, placing limitations on when inputs can be fed to the PLA and when outputs can be obtained from it Full CMOS PLA's are more than twice the size of pseudo-nMOS PLA's. Since the speed of a PLA decreases as its size increases, the saving in power consumption is usually not worth the sacrifice in speed and area. outO outl out2 out3 ( < r - i _ r T i ' i • L J i * ir T I J ir ir T L J i T 4-i_r T L J ir i T T L ^ ' t ir T L 1 i h T L r ir ir L j • ; i i i : Si i H i hi. r r i inO inl in2 in3 in4 in5 Fig. 7.16: PLA for Stack Controller 7.7 Stack The stack design described in [22] is used for the program counter stack in APR. The only modification made is that P-channel transistors are used instead of N-channels for the pass transistors. The circuit diagram of a 2-bit by 2-level stack is shown in Figure 7.17. Fig. 7.17: 2-bit by 2-level Stack 63 8. FLOOR PLAN OF APR The logic design of APR has been carried to completion. A floor plan of APR has been investigated (Figure 8.1). With due allowances made for routing channels, the chip is estimated to measure approximately 8.5mm x 8.5mm. There are 83 pads on the periphery, including the power and ground pads. It can be seen in Figure 8.1 that a significant portion of the chip is devoted to routing. This exemplifies the statement in section 7 that communication in VLSI is more expensive than computation. The design of an efficient floor plan is thus of primary importance for a compact chip layout With a single metal layer and a single polysilicon layer for interconnections, the routing problem is further complicated because polysilicon wires must be used as scarcely as possible, or significant deterioration in processing speed will result Unexpected RC delays caused by long polysilicon wires are probably one of the most common causes of chip design failures. Until better simulation software becomes available, all that can be done to increase the chance of a successful chip implementation is to keep the design simple and regular, and to be extremely careful in every stage of the design. Although chip sizes larger than 70mm2 are not uncommon these days (see section 3.1), such a large chip is not feasible with the available ISOCMOS process due to poor yield. Also, if a 2/um process is available, APR will measure only about 4.3mm x 4.3mm. Therefore, although the APR design is definitely feasible with present day IC technology, it cannot be fabricated with a high degree of confidence using the technology available to the author. Because APR is actually more powerful than is required for coordinate transformations in robotics, it is, however, possible to fabricate a chip that contains only the subset of the functions of APR which are essential for the computation of coordinate transformations. Not only can such a chip serve as a testing vehicle for the APR, but it can also be used immediately in industrial robots 1. Pad fraae 2. Support circuitry for eu l t l p l i e r section 3. Register f i l e for asntisaa 4. Decoders and drivers for ssfttiasa register f i l e 5. Drivers for exponent register f i l e 6. Register f i l e for exponent and sign 7. Adder section 6. PLA for ore-shift control 9. Pipeline registers for s u l . sect, exponent 10. Adder for s u l . sect. sxp. 11. Exponent adjust adder in adder ssct. 12. Shift registers for register sinks 13. Shift registers for register sources 14. Instruction decoder 15. Pro or as counter IB. Clock generator 17. Latches for l i t e r a l s and other constants IB. Exponent coapsrlson odder in sddsr ssct. 19. I/O sultlplsxsr for santlsaa 20. OR-plsns for LZC in s u l . ssct. 21. I/O sultiplsxer for exponent applications. This simplified design is called APRjr. 66 9. THE APRJR The essential difference between APR and APRjr is that the latter only handles fixed-point numbers. For the specific task in mind, namely the computation of the direct transform, a fixed-point representation should be adequate for most robot arms. The analysis concerning the adequacy of fixed-point representation in the computation of the direct transform has been discussed in section 3.2. The circuitry for performing the fixed-point calculations in APRjr is the same as described for APR. Figure 9.1 shows a block diagram of APRjr. The APRjr design is a subset of the APR design. The entire exponent part, the two barrel shifters and the leading zero checker in the adder section are deleted. In addition, the width of the program counter is reduced to 10 bits, and the number of flags, and hence pins, is reduced by two. The rest of the design is essentially the same as in APR, with the numbers being represented in 16- bit 2's complement format instead of sign-magnitude. The sign-magnitude representation is used in APR because it is the IEEE standard and it also makes normalization simpler to handle. In APRjr, there is no IEEE standard to adhere to, and the numbers do" not have to be normalized, so there is no reason to use sign-magnitude representation. Using 2's complement representation also removes the need for an end-around-carry adder for the adder section. A simple 2's complement adder is used instead. Because both APR and APRjr are heavily pipelined, APRjr does not run any faster then APR in terms of the length of each machine cycle. However, the reduced number of pipeline stages in the adder section in APRjr will enable it to complete a computation faster in cases where it cannot run in streamline mode (i.e., when operands can be fed to the execution pipelines continuously). This relatively small improvement in speed at the expense of reduced versatility was the reason why the APR design was attempted originally. Nevertheless, APRjr is meant to be a testing prototype for APR that is both useful and can be fabricated with the available technology, not an improved or alternative design. Barrel Shifter Leading Zero Checker Multiplier (off-chip) BS and LZC controller Data I/O 0 Data/Addresa Register F i le for Mantissa CCR Adder control signals Control IR Instruction PC P r o g r a n A d d r e s s 1. Pad fraat 2. Support c i rcu i try for au l t lp l l e r section 3. Regieter f i l t 4. Adder section 5. Oecodore for register f l i t 6. Clock generator 7. I/O unit 8. 9 . Shift registers for register f i l e 10. Instruction register 11. Progrsa counter 12. Instruction decoder 13. Shift registers for Instructions 14. Flags 13. n.A for stack control 18. Registers for barrel shi f ter sh i f t Mount 69 The floor plan of APRjr is shown in Figure 9.2. Its layout has been completed and has been sent for fabrication. The overall chip size is about 6.5mm x 6.9mm. The following sections describe some of the more global aspects of APRjr, such as processing speed and the use of APRjr in a microprocessor system. Most of the description is applicable to APR also. The differences are pointed out whenever they are important 9.1 Speed Considerations In APRjr, the possible timing bottlenecks are the multiply stage of the multiplier section, and the register decode and fetch stage. APR has another potential timing bottleneck, namely the exponent comparison and fraction alignment stage. Depending on the speed of the off-the-shelf integer multiplier used, either of these two stages will become the ultimate bottleneck. With super-fast multipliers such as the Weitek WTL 2010, the speeds of these two stages are comparable. Although the multiply time of these fast integer multipliers is very short compared with the time required to read and write the register file twice, the overall multiply time is much longer because the signals have to go off-chip. The register read and write cycle is slow mainly because it has to work twice every clock cycle, once for the adder and once for the multiplier. It may be possible to run APRjr faster, and at the same time reduce the size and complexity of the chip, by not running the adder and multiplier in parallel, as, in this case, the register file only has to work once every cycle. However, operating the register file twice in one cycle is preferred here, especially as it allows for the possibility of incorporating a multiplier on chip at some later date. It is to be appreciated that the estimated speed of a 16-bit parallel multiplier in 4M m CMOS is about 300ns, which is much slower than the register file. 70 The time required to run the register file twice is about 150ns. The adder in the data path requires 50ns. Shifting a number with the barrel shifter takes 60ns. Using the Weitek WTL 2010, which has a multiply time of 65ns, total multiply time is estimated to be 160ns. Instruction decoding and program address generation are not in the critical path. 9.2 Programming APRjr has twenty 32-bit wide instructions, which are divided into two groups: multiply/add and ETC. Figure 9.3 shows the formats of the two types of instructions. The first bit of each instruction specifies whether the instruction is a multiply/add or an ETC instruction. In a multiply/add instruction, 1 bit determines whether the adder is to perform addition or subtraction, and 5 bits are used to represent each of the sources and sinks of the multiplier and adder. In the ETC group, 5 bits are used as the op-code. These instructions include memory reads and writes, program branching, 31 30 25 20 15 10 5 111 ? | source 1 | source 0 I sink [ source 1 I source 0 | sink \ ^ / V ^ for ault ipl ier for adder M u l t i p l y / A d d G r o u p 31 26 10 5 0 101 op-code | data/address | source | sink | ETC G r o u p (with a few exceptions, see Appendix C) Fig. 9.3: Instruction Formats 71 loading constants, etc. The complete set of instructions for APRjr is listed in Appendix C. The instruction set for APR is more extensive, consisting of instructions to switch from fixed-point to floating-point mode, and converting 2's complement mantissas to sign-magnitude representations. 9.3 The APRjr Subsystem Figure 9.4 shows how APRjr can be used in an arithmetic subsystem for a microprocessor system. Besides APRjr and its associated integer multiplier, the subsystem contains a program ROM, a data R A M , and a data ROM for table look-ups. Part of the data address space of APRjr is mapped to the system memory. The two data busses are usually separate, however, so that APRjr can run in parallel with the host processor. When the host processor has collected (or guessed) a new set of joint coordinates, and wants the APRjr subsystem to compute the direct transform, it signals to the subsystem by writing to InFlagO of APRjr. This port can either be memory mapped or I/O mapped. Upon receipt of this signal, APRjr will request the use of the system bus from the host by sending an interrupt from its OutFlagO. When the bus-grant signal is received (using the other input port on APRjr, InFlagl), the portion of the data space in APRjr that is mapped to the system memory is used to read the joint coordinates directly from the host's memory. When all the data is read, APRjr relinquishes the system bus, and proceeds to compute the direct transform. At this time, the host is free to perform other tasks. After APRjr has completed the transform, it sends an interrupt to the host, using OutFlagl. The host acknowledges the interrupt by writing to InFlagO of APRjr and relinquishing the system bus. APRjr then writes the result of the transform in the common memory area. Upon finishing, 72 v CO OutFlagl OutFlagO c_ C C Q_ InFlagl -* InFlagO w c_ •o •o ix> ro tn CO c_ X3 •o ID JO re re CD C C ro ro CD QJ -O C L I -r—l CJD -t_> Fig. 9.4: Block Diagram of the APRjr Subsystem 73 it sends a signal from OutFlagO, and returns to the idle state, waiting for another task from the host. 74 10. TESTING A test chip comprising one register cell and one barrel shifter cell was designed, fabricated in 5M m ISOCMOS at GTE's foundry and tested at UBC. These cells were chosen for testing because they contain violations of the design rules. The violations were deliberately incurred in the interests of saving space but were not deemed to be electrically unimportant This was confirmed by the testing as the chip was found to be fully functional. Testing for timing (operation speed) is not possible because of the extra capacitive load of the pads on the chip and the probes of the testing station. Testing APRjr will be a lot more difficult when it is eventually fabricated. The APRjr design does not contain any testability enhancement circuits such as LSSD (Level Sensitive Scan Detection) [5]. This is because inclusion of such circuitry will make the design up to 30% bigger, and there is not much room to spare in the present design. Testing can be divided into 2 parts: prototype testing and production testing. 10.1 Prototype Testing When the first chips come back, .the chips may still contain design errors. Therefore, a step by step procedure is necessary to isolate any errors so that proper corrections can be made in future runs. Prototype testing must start with chip level testing. The following steps should be followed: 1. None of the pins should be shorted when no power is supplied to APRjr. 2. Power should then be connected to APRjr. With all other input pads grounded and output and tri-state pads floating, the clock input should be connected to a clock generator. A data generator such as the HP8180A can be used. The reset pad on APRjr should be connected to a debounced switch. When reset is 75 high, the internal clock should stay at <j>0. When reset goes low, the internal clock should start running. The internally generated clock phases, <j>0 to 0« and their complements should be monitored on a data analyzer such as the HP8182A. A high bandwidth oscilloscope should also be used to observe the exact shapes of these waveforms. If the clock phases prove to be as expected, then the next testing step can be carried out. However, if the clock phases are faulty, then measures must be taken so that other building blocks can be tested without the internally generated clock phases. This can be done by forcing the required clock phases by using the probes in a VLSI probing station to connect the output of a data generator to the appropriate clock wires. Although the tips of the probes are very tiny, the rest of the probes are rather bulky. As a result, in practice only about 6 probes can be used at any one time. Hence, if some of the probes have to be used to force signals onto the chip and some to monitor signals from the chip, then only very small sections of the chip can be tested one at a time. The following steps can be applied to test any faulty building block: a. Using the probes in a probing station, check all power supplies to the block. b. Check all clock inputs, if any. c. Starting from the inputs to the block (e.g. for the clock generator block, the inputs come from the input single-phase clock), check the output of each gate. The inputs and outputs of each stage can be monitored on a data analyzer. Stages that are faulty can then be identified by comparing the observed waveforms with the expected ones. This can be repeated until the final outputs are reached. 76 Where necessary, a stage can be isolated from its inputs by forcing signals on its inputs with the probes connected to a data generator. Because the data generator has a much larger drive than the signals generated by APRjr, the inputs will take on the values from the data generator. It is also possible to isolate a block from other circuits by physically cutting the metal wires on the chip with the probes. This is useful when some circuits connected to a multiplexed line are faulty. The next function to test is address sequencing. With the clock running (assuming it works), the reset line should be held high. This forces the program counter to 0 and the rest of APRjr to a NOP state. A NOP instruction (all O's) should be connected to the program address/data pads. The reset line should then be allowed to go low. This starts program execution in APRjr. The program counter should increment one step every machine cycle until it reaches 3FF. It should then go back to 0. With the program running, instructions that affect the program counter, such as JMP and RTN, can be tested. The proper instructions should be forced to the program address/data pads using the data generator, while the program counter is monitored on the data analyzer. The next instructions that can be tested are I/O instructions. An input datum can be connected to the data address/data pads (the pads for the data are multiplexed to communicate either the data address or the data themselves, hence the term "data data pads") by hard wiring. A memory read instruction should then be supplied by the data generator followed by a memory write instruction, using any register on APRjr to temporarily store the memory content If the instructions are executed properly, then the signals on the data address/data pads should equal the read address supplied by the read instruction during d)3 of the read instruction. In d>3 of the write instruction, the 77 write address should be observed on the pads, with the data appearing in the following 04. 6. The read flags and set flags instructions can be tested next 7. Finally, the most important instructions: add and multiply, can be tested. This can be done by having the data generator generate a short program that reads in some data and then have APRjr operate on the data using either the adder section or the multiplier section. The results can then be written out for checking. The multiplier section can be tested without actually using a parallel multiplier. The multiplier 'result' can be supplied by the data generator. If APRjr passes all these tests, then testing on the chip level is complete. APRjr must then undergo testing at the board level. The following steps are best followed with the use of a wire-wrap board: 1. With APRjr connected as in the last section, connect a WTL 2010 16-bit fixed-point parallel multiplier, or any multiplier that is pin-for-pin compatible with it,, and run the same test program as the last step above on the data generator again. This time, the multiplier result should be correct. 2. Next a data RAM can be added to the system. A number of memory read/write instructions can then be executed to check if data can be read from and written to the data RAM. 3. The program ROM can now be added. Testing programs can be burned in an EPROM and then connected to APRjr. By observing the outputs of APRjr, it should be possible to decide whether the programs are executed correctly or not 4. The final step in board level testing is the addition of the necessary glue logic required to interface the ARPjr sub-system to a microprocessor bus. Then a complete test program (one that will be used to compute the direct transform, for example) can be run. Again, if the results of the computation are correct, 78 then it can be assumed that the program has been executed correctly. 10.2 Production Testing After APRjr has been thoroughly tested in the prototype stage, volume production can be started. Because the yield of chips is never 100%, testing is required to determine which chips are faulty due to processing defects. This is best done by putting every APRjr in its board sub-system and then running a test program, as was done in the last step in prototype testing. Chips that produce the correct results can then be considered defect free. 79 11. CONCLUSIONS After extensive analysis of robot manipulator applications, it is concluded that a 16-bit fixed-point number format is adequate for most present-day applications. By considering the properties of the computations involved in computing coordinate transformations in robotics, and by investigating all the new computer architecture concepts, it is decided that a general purpose single processor system is still the best approach. However, all the latest innovations in computer architecture are applied to this processor. These innovations include parallel processing (adder and multiplier can run in parallel), non-interlocked pipeline instruction fetch and execution cycles, reduced instruction set, large register file, and Harvard architecture. Following these decisions, an arithmetic processor has been designed. The chip, called APR, supports both floating-point and fixed-point arithmetic. Floating-point operations are included in order to broaden the potential applications of APR. A test chip called APRjr has also been designed. APRjr only supports fixed-point arithmetic. It will be both useful as a practical chip and as a test vehicle for APR. The layout of APRjr has been carried to completion and it was submitted for fabrication with a 4LI m CMOS process in late January, 1985. Based on extensive simulations, a machine cycle time of 200ns is expected. 11.1 Contributions of this Thesis The APR project has contributions in both the VLSI and robotics areas. In the VLSI area, the APR project has stimulated the interest in large scale processor designs here at the University of British Columbia. The experience accumulated during the design of APR and APRjr will benefit future VLSI digital designs. Through hands-on experience, fundamental digital designs in LSI such as registers, adders, shifters, and clock generation have been investigated in detail. Hence, a large number of library 80 cells have been created. These designs can be used in future designs, or as the starting point for improved designs. At the beginning of this project, no library cells were available. The attempt to design a large scale chip has also revealed a lot of deficiencies in the C A D tools currently available. The possible improvements are outlined in the next section. In the robotics area, APR can make real time trajectory possible and economical. Because of the general nature of the architecture, APR can be used for other computations in robotics other than coordinate transformations, such as real-time control of manipulators [20]. In addition, APR can be used for other applications such as computer graphics, digital signal processing, and any other tasks that require high speed but only moderate precision arithmetic. 11.2 Suggestions for Further Research VLSI is only an emerging field today, particularly in the University of British Columbia. The ambition of the present project is greatly restricted by the lack of an advanced fabrication process, the high cost and long turn around time of fabrication, and the lack of a sizable research team working actively in VLSI. When all or most of these difficulties are overcome, the APR design can be improved in quite a few ways. APR has had to be designed in one man-year, therefore, the designer will be more than satisfied if the chip turns out to be functional. Given more time and design talents, the author believes that the APR design can be improved in terms of compactness, regularity, and timing. A more refined floor plan can significantiy compact 81 the design, and at the same time make the layout^ more regular. Regularity not only renders the design easier and thus less error prone, but it also facilitates any future modifications. A careful study of the delays of all the signal paths in the design will locate exactly where the bottlenecks in the design are. With this information, the design can be fine-tuned by reducing the resistance in the critical paths, by buffering to reduce the load on each stage, and by modifying the architectural design. On the system level, the computation sub-system can be made more compact and probably faster by designing a multiple chip set to replace the other components, such as the integer multiplier and the program and data memory. For example, the multiplier and the data memory may be integrated on one chip. When a more advanced process becomes available, the APR layout can be completed and the chip fabricated. It may even be possible to make APR more powerful in a number of ways: • The parallel multiplier can be integrated on chip. • A full 32-bit data path can be implemented, with full 32-bit registers and a 23-bit parallel multiplier. • The size of the register file can be increased; 128 registers would be ideal. Larger register files will slow down the machine cycle so much that the gain in having fewer memory accesses will not be worthwhile [24]. • A more powerful instruction set can be designed so that APR is made a lot more versatile, with little or no deterioration in processing speed. A lot of research is also required in the C A D area. Many of the following required C A D tools already exist in some form or another, however, they are not available at the University of British Columbia. Also, most of the existing tools are still at a rudimentary stage, and improvements are not only possible but necessary. • Simulation from layouts. Presently only small sections of a large design (of the 82 size of APRjr) can be simulated. The circuit simulator, SPICE, is not only extremely slow, but it has convergence problems with its level 2 MOSFET model. The switch level simulator, hg, or R N L (Bg is a Metheus modification of RNL), performs reasonably well with moderate size circuits consisting of around 100 transistors. But it has problems dealing with certain circuits [29]. This means that hg cannot be run on any design without scrutiny. There is probably very little that can be done about the speed of simulation with these simulators, unless hardware accelerators are available. However, it will be extremely helpful if the simulators can be invoked interactively using a windowing facility. In other words, the designer can call the simulators to work on a certain portion of his design, without any modifications to the design. At present, in order to simulate a section of a design, the designer must make that portion that he wants to be simulated a self-contained cell. A timing analyzer [15,23] is also helpful in filling the gap between precise circuit or switch level simulation and functional simulation on the chip level. Hierarchical simulation system. When a digital design is formulated, it is best to simulate the design in a top-down fashion. For example, the design is specified first by its input and output characteristics. The design is then simulated on the system (or processor-memory) level. After making sure that the designs works in this level, it is expanded one step down the hierarchy, namely the register-transfer level. The design is then simulated again in this level, making sure that the inputs and outputs are as specified in the processor-memory level model. Next the design is detailed in the gate (logic) level, then the circuit level, and finally the layout level. A design in a lower 83 level is started only after the present level is proven to be compatible with the level above it using the proper simulation tools. At present, only gate level (HILO) and circuit level (hg and SPICE) simulators are available. Auto-Router. Routing a VLSI chip is extremely painstaking and error prone, simply because the designer cannot see what he is doing even on a high-resolution graphics terminal. Multiple windows would help, but it would be much more convenient if an auto-router is available. The ideal router should be able to operate in several modes. In manual mode, the designer can direct the router interactively on the screen. The general direction that the route should take is supplied by the designer, while the router will take care of small details such as separation between layers and crossovers. This mode is useful when the routing channels are tight In semi-auto mode, the router should be able to route a set of ports together without aid, given the location of the ports and the building blocks. In full-auto mode, it should be able to rearrange the building blocks slightly, such as shifting a little bit to either side, so that a complete routing is obtained. The most efficient routing is then obtained by the application of a suitable combination of these modes. Layout compactor. A layout compactor is useful in compacting small cells. The designer specifies the circuit the topology of the connections, and the dimensions of the desired cell, and the layout compactor will attempt to fit the design in the specified area. Floor planner. A floor planner can give the designer estimates of the overall size of the chip given the rough floor plan of the building blocks and the routing requirements. Perhaps the floor planner can even be made smart enough to come up with some close to optimum floor plans with minimal help from the designer. A more powerful circuit extractor. The circuit extractor that is available now 84 only recognizes transistors and parasitic capacitors. What is more desirable is a hierarchical circuit extractor that can recognize the layouts in the gate level, register-transfer level, and even the processor-memory level. Recognizing logic gates from a layout should be easy, especially with the use of artificial intelligence. Register-transfer level and processor-memory level recognition is more difficult, requiring considerable programming effort and a sizable data base. A better scheduling program. The scheduling program which is in use now is still very difficult to use. The user has to break up the individual operations that are required to compute the equations, and then arrange the operations in the form of a data dependency graph. The program can be improved so that the user is only required to enter the equations in a more convenient form (reverse Polish notation, RPN, is a prime candidate). Another possibility for improvement is to include in the scheduling algorithm, the effects of the finite number of registers available. 85 BIBLIOGRAPHY Cited References [I] . S.F. Anderson, J.G. Earle, R.E Goldschmidt, and D.M. Powers, "The IBM System/360 Model 91: Floating Point Execution Unit", IBM Journal of Research and Development, vol. 11, no. 1, pp.34-53, 1967. [2]. R.P. Brent and H.T. Kung, "A Regular Layout for Parallel Adders", IEEE Trans. On Computers, vol. 31, no. 3, pp.260-264, 1982. [3]. J. Clark, "A VLSI Geometry Processor for Graphics", IEEE Computer, vol.13, no.6. pp.59-68, June 1980. [4]. J.T. Coonen, "An Implementation Guide to a Proposed Standard for Floating-Point Arithmetic". IEEE Computer, vol.13, no.l, pp.68-79, 1980. [5]. EB. Fichelberger, T.W. Williams, "A Logic Design Structure for LSI Testability", Tutorial VLSI Support Technologies: Computer-aided Design, Testing, and Packaging, R. Rice (ed), IEEE Computer Society Press, pp.245-251, 1982. [6]. Excalibur Arm (Model No. RT3), Robotic Systems International, Sidney, B.C., V8L 3S1. Canada. [7]. D. Fitzpatrick, "VLSI Implementation of a Reduced Instruction Set Computer", VLSI Systems and Computations, H.T. Kung, B. Sproull, and G. Steele (ed), Computer Science Press, Rockville, Maryland, pp.327-336. 1981. [8]. GTE Microcircuits, Tempe. Arizona 85281. [9]. J.L Hennessy, N. Jouppi, S. Przybylski. C. Rowen, and T. Gross. "Design of a High Performance Processor". Third Caltech Conf. on VLSI, R. Bryant (ed), Computer Science Press, Rockville. Maryland, pp.33-54. 1983. [10]. J.L. Hennessy. N. Jouppi. F. Baskett, and J. Gill. "MIPS: A VLSI Processor Architecture", VLSI Systems and Computations, H.T. Kung, B. Sproull. and G. Steele (ed). Computer Science Press, Rockville, Maryland, pp.337-346, 1981. [II] . J.L Hennessy, N. Jouppi. S. Przybylski. C Rowen, and T. Gross. "Performance 86 Issues in VLSI Processor Design", IEEE Int'l Conf. on Circuits and Computers, pp.153-156, 1983. [12]. J.L. Hennessy, N. Jouppi, J. Gill, F. Basket!, A. Strong, T. Gross, C. Rowen, and J. Leonard, "The MIPS Machine",IEEE Compcon Spring, pp.2-7, 1982. [13]. HILO-2 User Manual, GenRad Inc., 1984 [14]. R.F. Hobson, Simon Fraser University, Burnaby, British Columbia, Canada, private communication, March 1984. [15]. N.P. Jouppi, "TV: An nMOS Timing Analyzer", Third Caliech Conf. on VLSI, R. Bryant (ed), Computer Science Press, Rockville, Maryland, pp.71—85, 1983. [16]. G.H. Katavenis, R.W. Sherburne, D.A. Patterson, and C.H. Sequin, "The RISC II Micro-Architecture", VLSI '83, F. Anceau and E.J. Aas (ed), Elsevier Science Publishers B.V.(North-Holland), pp.349-359, 1983. [17]. A. Kozak, private communication, Microtel Pacific Research, Burnaby, B.C., Canada, Oct 1984. [18]. R.H. Krambeck, C M . Lee, and H.-F.S. Law, "High Speed Compact Circuits with CMOS", IEEE JSSC, vol.SC-17, no.3, pp.614-618, 1982. [19]. D. Landskov, S. Davidson, B. Shriver, and P.W. Mailed., "Local Microcode Compaction Techniques", Computing Surveys, vol. 12, no. 3, pp.261-294, 1980. [20]. C.S.G. Lee, T.N. Mudge, and J.L. Turney, "Hierarchical Control Structure using Special Purpose Processors for the Control of Robot Arms", Proceedings of the IEEE Pattern Recognition and Image Processing Conference, pp.634-640, 1982. [21]. K. McDonough, E. Caudel, S. Magar, and A. Leigh, "Microcomputer with 32-bit arithmetic does high-precision number crunching", Electronics, Feb. 24, pp.105-110, 1982. [22]. C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley, Menlo Park, CA., 1980. [23]. J.K. Ousterhout, "Crystal: A Timing Analyzer for nMOS VLSI Circuits", Third 87 Caltech Conf. on VLSI, R. Bryant (ed), Computer Science Press, Rockville, Maryland, pp.57-69, 1983. [24]. D.A. Patterson and C.H. Sequin, " A VLSI RISC", IEEE Computer, vol. 15, no. 9, pp.8-22, 1982. [25]. D.A. Patterson and C.H. Sequin, "Design Considerations for Single-Chip Computers of the Future", IEEE Trans, on Computers, vol.C-29(2), no.2, pp.108-116, 1980. [26]. D.A. Patterson, " A RISCy Approach to Computer Design", IEEE Compcon Spring, pp.8-14, 1982. [27]. R.P. Paul, Robot Manipulators - Mathematics, Programming and Control", MIT Press, Cambirdge, Mass., 1981. [28]. J.K. Poon, "Parallelism in the Micro-Program of a Special Purpose Arithmetic Processor", UBC VLSI Lab. Tech. Report 001, May, 1984. [29]. RNL 4.2 User's Guide, VLSI Design Tools Reference Manual, Release 2.1, U W / N W VLSI Consortium, University of Washington, Seattle, Washington, 1984. [30]. A. Sawai, "Programmable LSI Digital Signal Processor Development", VLSI Systems and Computations, H.T. Kung, B. Sproull, and G . Steele (ed), Computer Science Press, Rockville, Maryland, pp.29-40, 1981. [31]. R.B. Seeds, "Yield and Cost Analysis of Bipolar LSI", Tech. Digest, IEEE Electron Devices Meeting, Oct 1967. [32]. C.H. Sequin and D.A. Patterson, "Design and Implementation of RISC I", VLSI Architecture, R. Randell and P.C. Treleaven (ed), Prentice Hall, pp.276-298, 1983. [33]. R.W. Sherburne, M.G.H. Katavenis, D.A. Patterson, and C.H. Sequin, "Local Memory in RISCs", IEEE Int'l Conf. on Circuits and Computers, pp.149-152, 1983. [34]. R. Sherburne, "Data Path Design for RISC", Proc. MTT Conf. on Advanced 88 Research in VLSI, pp.53-62, 1982. [35]. P.C. Treleaven, D.R. Brownbridge, and R.P. Hopkins, "Data-Driven and Demand-Driven Computer Architecture", ACM Computing Surveys, vol.14, no.l, pp.93-143, 1982. [36]. P.C. Treleaven, "Decentralized Computer Architectures for VLSI", VLSI Architecture, R. Randell and P.C. Treleaven (ed). Prentice Hall, pp.348-380, 1983. [37]. J. Vuillemin and L. Guibas, "On Fast Binary Addition in MOS Technologies", IEEE Int'l Conf. on Circuits and Computers, pp.147-150, 1982. [38]. H. Yoshimune, Y. Kita, T. Miyamoto, Y. Toba, H. Hara, and T. Akazawa, "A Single Chip Digital Signal Processor and Its Application to Real-Time Speech Analysis", IEEE JSSC, vol. SC-18, no. 1, pp.91-98, 1983. General References In addition to the above cited literature, the following references will be found useful for anyone interested in VLSI or coordinate transformations in robotics: [1]. R.J. Antimone and G.W. Brown, "The Modeling of Resistive Interconnects for Integrated Circuits", IEEE JSSC, vol. SC-18, no. 2, pp.200-203, 1983. [2]. L.N. Dworsky, Modern Transmission Line Theory and Applications, Wiley, New York, 1979. [3]. K. Hwang, Computer Arithmetic Principles, Architecture, and Design, John Wiley and Sons, New York, 1979. [4]. H.-F.S. Law and M . Shoji, "PLA Design for the BELLMAC-32A Microprocessor", IEEE Int'l Conf. on Circuits and Computers, pp.161-163, 1982. [5]. C.S.G. Lee, "Robot Arm Kinetics, Dynamics, and Contror, IEEE Computer, vol.15, no.12, pp.62-80, 1982. [6]. C.S.G. Lee, "A Geometric Approach in Solving the Inverse Kinematics of Puma 89 Robots", 13th Symposium on Industrial Robots, pp.16-1,16-8, 1983. [7]. A.D. Lopez and H.-F.S. Law, "A Dense Gate Matrix Layout Method for MOS VLSI", IEEE JSSC, vol.SC-15, no.4, pp.736-740, 1980. [8]. W.K. Luk, "A Regular Layout for Parallel Multiplier of 0(Log JN) Time", VLSI Systems and Compulations, H.T. Kung, B. Sproull, and G. Steele (ed), Computer Science Press, Rockville, Maryland, pp.317-326, 1981. [9]. J. Mavor, M.A. Jack, and P.B. Denyer, Introduction to MOS LSI Design, Addison-Wesley, 1983. [10]. C. Mead and M. Rem, "Minimum Propagation Delays in VLSI", IEEE JSSC, vol. SC-17, no. 4, pp.773-775, 1982. [11]. S. Muroga, VLSI System Design, John Wiley and Sons, 1982. [12]. R.Y. Pinter, "River Routing: Methodology and Analysis", VLSI Systems and Computations, H.T. Kung, B. Sproull, and G. Steele (ed), Computer Science Press, Rockville, Maryland, pp.141-163, 1981. [13]. J.K. Poon, D.L. Pulfrey, and P.D. Lawrence, "Arithmetic Processor Chip for Robot Control", International Symposium on VLSI Technology, Systems and Applications, pp.337-341, 1985. [14]. G.Rabbat (ed), Hardware and Software Concepts in VLSI, Van Nostrand Reinhold, 1983. [15]. P. Reusens, W.H. Ku, and Y . - H . Mao, "Fixed-Point High Speed Parallel Multiplier", VLSI Systems and Computations, H.T. Kung,B. Sproull,and G. Steele (ed), Computer Science Press, Rockville, Maryland, pp.301-310, 1981. [16]. T. Sakurai, "Approximation of Wiring Delay in VLSI", IEEE JSSC, vol. SC-18, no. 4, pp.418-426, 1983. [17]. M . Shoji, "Electrical Design of BELLMAC-32A Microprocessor", IEEE Int'l Conf. on Circuits and Computers, pp.112-115, 1982. S. Wasser, "High Speed Monolithic Multipliers", IEEE Computer, vol.11, no.10, pp.19-29, 1978. 90 APPENDIX A: LAYOUTS In this appendix, the layouts of some of the more important cells are shown. The design rules used are those for the GTE 4M m ISOCMOS process. The rules are confidential and so cannot be disclosed here. The following legend is used in all the figures: Diffusion Polysilicon — — — — — Metal Contact Cut • N+ Diffusion — —• P+ Diffusion P-Well 91 A . l : 3-Port Register Cell See also pp. 41-44, 92, 110 92 A.2: 2 x 2 Register File Notice how the symmetry of the register cell is taken advantage of in the register file. See also pp. 41-44. 91. 105-110 A.3: One Barrel Shifter Cell See also pp. 44-47, 94 s h l shO 94 A.4: 3-out-of-5 Barrel Shifter See also pp. 44-47, 9 3 sh3 sh2 s h l shO in2 i n l inO A.5: 4-Bit Leading Zero Checker See also pp. 47, 111, 112 &V!rii4Vfriilrj? Vdd substrate 1 I 1 I i ' h I I 1 ' DataO • " • • ' ! ! : ftjsri I c i ^ c r + i P Datal Data2 IS 1 2 3 4 Data3 96 A.6: Stack Cell See also pp. 62, 97, 115 A.7: 2-Bit by 2-Level Stack See also pp. 62, 96, 115 s h l 98 A.8: 2-Bit Par-Add Adder See also pp. 51-54, 113 i r f o ^ r - n i i i n i r - - - f 5 h r i L L U ±T\ f i^>iUr r i rLu! , _ C . I M|. l r - r - i «TT!, L>J4_I n i ftT7-"H.I.'hr.j-sr I i il ^ ir*j , -|U ,! ( I-HI«1— •*— >—" 1 In n z t J • LL-' rJ_J T i I — , J r r i i P « t — V - r so . . . r | .JrrrJ II II • 1 ' H i I! : W ] i j P l i i i i i i i _ - i W T T T ' T r i J i J liM . ! , . . . . . 4 . Jrj l • • :L_ LJ- .._1.J..L, j f-^HMla iii •j__n i i r T i , i i i i r r J J • » 1 i t7jn'.-.,7.....,r..T..._j..l i| i M r - i i ' . — • • • H - r ' L J r i — t - 'l—i l\ff y LI l! 1 5 1 J B b - = r H — t J ^ c — - » S r : T i - U . ^ Ln>~H rr—H! (LI J. — r .J..|.. , — — H - r f - i ; i l . ' i : • j j ^ D j - r - — - n ° | l i f e : A l i 4 99 A.9: 1-Bit Static Regenerative Manchester Carry Adder See also pp. 48-50, 114 GND Cin 1 0 0 A.10: PLA for Stack Control See also pp. 60, 61, 117 i n l inO 101 A . l l : Clock Generator The circuit for the following layout is shown in Figure 7.13. See also pp. 55-60, 116 J" " - V - V - V ' J I I Vdd i i i t i i 102 A.12: Flags The following circuit and corresponding layout is used to store the flags in APRjr. no}o»o_p»r (2) noloaa (2) }o»B_p»r 12) r4 Instruction lo»0 (2) t l l t j l i r (2) >~1 •«ak Orlv« 0. r 1*0 1 i 4-rtittjitr last (2) 103 A.13: Barrel Shifter Controller for APRjr The following circuit and corresponding layout is used for controlling the barrel shifter in the multiplier section of APRjr, and for testing the number of leading zeros in the multiplier result. SetSh overflow inst ruct ion SetSh.par SetfeqlZ overflow 1 < ' i i r[j . . ja_. J o_jjL] t r1r • M • 'B^lctrl i f f ? 151 i LZin overflo SetSh 104 APPENDIX B: SIMULATIONS In this appendix, some of the more important simulations performed are presented. Two simulators have been used in this project, namely the circuit level simulator, SPICE, and the switch level simulator, hg. For small circuits, SPICE with level 1 MOSFET parameters is used. Level 2 is only used for very small circuits, not only because of the slow execution speed of level 2, but also because it has frequent convergence problems. Hg is used for larger circuits, consisting of over 20 transistors. Where necessary, the circuit used in the simulation is shown with the simulation output. 105 B.1: Writing a Zero to a Register See also pp. 41-44, 92 I CIRCUIT: WG DATE. MtW MAR 4 i l . "57: 46 1985 ( f i l e , reg) [ 106 B.2: Writing a One to a Register See also pp. 41-44, 92 CIRCUIT: BEG DATE. *W MAR 4 l i : 33. "58 1 985 ( f i l e , reg) 107 B.3: Writing a One to a Register with Noise The signal on the write bus is set to only 4V in this simulation. Notice the response of the register cell is almost the same as in Appendix B.2. See also pp. 41-44, 92 CIRCUIT: BEG DATE: MW MAB 4 12. 14: 46 1985 (file: "eg) -0.00-f U.On 3.bn 6. On 9. On GflOUP 1: V ( i n t l ) V(wc) V ( i n t O ) V (wl) 12.'On 15 .'On IB.On 2 l ! o n 24~!on 27 !on SCLOn T I M E 108 B.4: Writing a One to a Register with Noise, Level 2 MOSFET model 2 is used in this simulation to investigate the reliability of model 1. Observe that the response is very close to, although not exactly the same as. Appendix B.3. See also pp. 41-44, 92 CIRCUIT: REG—DATE: MM HAH 4 12: 22: 44 1985 ( M i r rtg) B.5: Register Sourcing a Zero See also pp. 41-44, 92 CIRCUIT: REE DATE: HON HAH 4 12: 22: 44 19B9 If I la: rag] 4.81- \ 4.01- ' \ 3.80- ' \ 2.B9- ' \ 1 I 2.46- \ \ 1.B7- / \ ; 1.47- \ \ 0.96- / \ \ 0.48- | \ J "* '°oiOfi 2.Bn 8.On 7.8n lO.'on K.'fln 18,'on •ROUP i: V(ReadBua0) V(RaadO) V(lntl) i 17.8n 20 .'on 22 .'sn SS.'on TINE 110 B.6: Register Sourcing a One See also pp. 41-44, 92 CTAOJIT: B.FG 0»TE WE FEB 19 17.29 46 199"; ( f i l e . '•eg) I I 1 4.56- / 1 1 1 1 4.05- | 1 1 I I 1 3 5 4 - 1 I I 3.04- 1 2 . 5 3 - ' 2.02 - 1 1.52- ' \ 1 1.01- 1 i i 0.50-1 i i -0.00 | i | i | | O.On 2.5n 5.On 7.5n 10. On 12.5n GROUP 1: i/tintl) V (rO.) V (rbuuO) 15. On 17 .'sn 20 .'on 22 Sn 25.'On TIME B.7: Leading Zero Checker See also pp. 47, 95 <-cD_r I_1_X ' CIRCUIT: 17.0 DATE: SUN HAH 17 81: 50: IB 19B5 If l i e : Uii GROUP i: V(liout) v(detain) vpij) 113 8 . 8 : One Par-Add Logic Block One of the Par-Add logic blocks, the circuit of which is shown below, is simulated to estimate the speed of a Par-Add adder. Proper loads, corresponding to the next stages, have been added in the simulation. See also pp. 51-54, 98 CIRCUIT: PAHAPPBSIH—DATE: HOW HAH 4 Zi: 34: *9 1888 I M l r PwAddBtirt 60.On TINE 114 B.9: Carry Propagation in One Static Regenerative Manchester Carry Adder Cell See also pp. 48-50, 98 115 B.10: 2-Bit by 2-Level Stack A '0' is pushed in bit 1, then a T is pushed in bit 1. The data are then popped back out. Bit 0 is not used. See also pp. 62, 96, 97 Ifl la: alkzx2) puahinO-j puanini-| I popouto- | || popouti- | u 1 II 1 •H LJ L_J trr-| LJ L_J « H 1 1 l _ J • h H L _ J l _ l n - 1 I I I I On sin I02n lS3n soltn 2skn 306n 387n 40Bn 4SBn 8 ion t l M 116 B . l l : Clock Generator The clock phases that are generated can be compared with those given in Figure 5.2, except that here the complements of the phases are shown. The glitches will probably not cause any problems because of the large capacitances that the clock has to drive, although improvements can probably be made to remove them. See also pp. 55- 60, 101 If lit: clkSJ pnlOtt-pnilb-irLnrirLTLrir u u u LT •ii ni ni ru n i pM3b--LU HI LU UJ UL phi 46-•r~LU u — I J I J U U I J I J U L J L . clkln o.dou o.sUi i.oWi i . a a u a . l e u zJou s . i e u a.7au 4 . iau 4.aeu s.Jou t l M 117 B.12: PLA for Stack Control Notice that the outputs occasionally take on a wrong value before finally settling down to the correct values. This causes no problems in a static PLA, but may be disastrous if a dynamic design is used. See also pp. 60, 61, 100 (file: plsbSl ^oHjirLRjuirin^ m a -m a - ] i r in4-|_ ins-] J On fin 142 21 3n 284n SSSn 426n 4flVn seen 1 1 ™ - U U L J L J — innr~L iL jUL j inrrnju ou«-iniJliJlJ_riJU LTULJIUILJ^  UUJl ^lnnni innru innnj uuuu mm o u t a - u — u — u — u — L T u — u — u — u e a t n 7 i b n t l M 118 APPENDIX C: INSTRUCTION SET The complete set of instructions for APRjr is listed below. The instructions have been grouped together according to the functions they perform. For each instruction, a three-letter mnemonic is given, followed by the 32-bit op-code. The number of machine cycles required to execute each instruction is also given. Since APRjr uses non-interlocked pipelines, a new instruction can be started whether the previous instruction has finished executing or not It is therefore up to the user, or the compiler, to make sure that the instructions are executed in the proper sequence. For example, in a series of Multiply/Add instructions, an instruction that uses a previous multiplication result can be started only after at least three cycles, and an instruction that uses a previous addition result can be started only after at least two cycles. See Appendix D for some detailed examples. The full name of the instruction, definitions of the various fields in the op-code, and the action performed by the instruction is given in the remarks section. In the opcode descriptions, the following convention is used: 1, 0: Binary digits which must be entered as is; ?-F0-?: A sub-field, containing a certain number of bits, called fd; x-5-x: A field of don't care bits, in this particular case 5 bits long; 119 MULTIPLY /ADD GROUP Mnemonic: M A D Opcode: 10 ?-F5-? ?-F4-? ?-F3-? ?-F2-? ?-Fl-? ?-F0-? No. of cycles: 4 for multiply portion, 3 for add portion Remarks: Simultaneous Multiply/Add; FO: 5-bit field, sink for multiplier; F l : 5-bit field, source 0 for multiplier; F2: 5-bit field, source 1 for multiplier; F3: 5-bit field, sink for adder; F4: 5-bit field, source 0 for adder; F5: 5-bit field, source 1 for adder; The contents of the registers specified by Fl and F2 are multiplied and the result will be stored in the register specified by FO. At the same time, the contents of the registers specified by FA and F5 are added together and the result will be stored in the register specified by F3\ Mnemomic: MSB Opcode: 11 ?-F5-? ?-F4-? 7-F3-? ?-F2-? ?-Fl-? 7-F0-? No. of cycles: 4 for multiply portion, 3 for add portion Remarks: Simultaneous Multiply and Subtact; Action similar to M A D , except that the contents of the register specified by F4 are subtracted from the contents of the register specified by F5\ 120 ETC GROUP Program Counter Control Sub-Group Mnemonic: JMP Opcode: 000001 ?-F0-? x-10-x No. of cycles: 1 Remarks: Unconditional Jump; F0: 16-bit field, new program address, where bit 10 is the LSB; The contents of the register specified by FO is forced into the program counter; Mnemonic: JOF Opcode: 000010 7-F0-? x-10-x No. of cycles: 1 Remarks: Jump on Flag Set; F0: 16-bit field, new program address, where bit 10 is the LSB; PC is set to new value if the main flag (set or cleared by TSF instruction) is set Otherwise, PC is incremented as usual; Mnemonic: JSR Opcode: 000011 7-F0-? x-10-x No. of cycles: 1 Remarks: Jump to Subroutine; F0: 16-bit field, jump address; The present PC is pushed onto the hardware stack (2 levels deep), and the PC is then set to the new value; 121 Mnemonic: RTN Opcode: 000100 x-26-x No. of cycles: 1 Remarks: Return from Subroutine; The contents of the top of the hardware stack is tranferred to the PC. The original PC will be lost; I/O Sub-Group Mnemonic: L C D Opcode: 000101 7-F1-? 7-F0-? x-5-x No. of cycles: 1 Remarks: Load Constant Data; F0: 5-bit field, data sink; F l : 16-bit field, data word; The data word given in Fl is stored in the register specified by FO; Mnemonic: WWD Opcode: 000110 7-F1-? 7-FO-7 x-5-x No. of cycles: depends on external memory speed, minimum = 1 122 Remarks: Write Word Direct; FO: 5-bit field, data source; F l : 16-bit field, data address; The contents of the register specified by FO is written to the data memory location specified by Fl; When 2 or more memory write instructions are executed in sequence, care must be taken such that the external memory has enough time to respond. Other instructions, such as a M A D or JMP instruction, can be executed immediately after a WWD; Mnemonic: WWI Opcode: 000111 x - l l - x 7-F1-7 7-F0-? x-5-x No. of cycles: depends on external memory speed, minimum = 1 Remarks: Write Word Indirect; FO: 5-bit field, data address source; F l : 5-bit field, data source; The contents of the register specified by Fl is written to the memory location specified by the contents of the register specified by FO; See remarks in WWD; Mnemonic: RDW Opcode: 001000 x-21-x 7-F0-? No. of cycles: 1 123 Remarks: Read Data Word; FO: 5-bit field, sink for data; All 16 bits of the data bus is copied to the register specified by FO; Mnemonic: RUB Opcode: 001001 x-26-x No. of cycles: 1 Remarks: Read Upper Data Byte; The upper 8 bits of the data bus is latched on-chip. Note that the data is not written to the register file; Mnemonic: RLB Opcode: 001010 x-21-x ?-F0-? No. of cycles: 1 Remarks: Read Lower Data Byte; F0: 5-bit field, data sink; The lower 8 bits of the data bus is latched on-chip. The contents of the whole 16-bit latch is then written to the register specified by F0; Mnemonic: W U B Opcode: 001011 7-F1-? 7-F0-? x-5-x No. of cycles: 1 124 Remarks: Write Upper Data Byte; FO: 5-bit field, data source; F l : 16-bit field, data address; The upper byte of the register specified by FO is written to the memory location specified by Fl; Mnemonic: WLB Opcode: 001100 7-F1-? 7-F0-? x-5-x No. of cycles: 1 Remarks: Write Lower Data Byte; FO: 5—bit field, data source; F l : 16-bit field, data address; The lower byte of the contents of the register specified by FO is written to the memory location specified by Fl; Mnemonic: SAD Opcode: 001101 7-F0-? x-10-x No. of cycles: 1 Remarks: Send Memory Address, Direct; F0: 16-bit field, memory address; The address specified by FO is driven onto the data address bus. Used to prepare for a direct read from the data memory; Mnemonic: SAI Opcode: 001110 x-16-x 7-F0-? x-5-x No. of cycles: 1 125 Remarks: Send Data Address, Indirect; FO: 5-bit field, address source; The contents of the register specified by FO is driven onto the data address bus. Used to prepare for an indirect read from the data memory; Miscellaneous Sub-Group Mnemonic: L D F Opcode: 001111 x-9-x 7-F0-? x-10-x No. of cycles: 1 Remarks: Load Flags; FO: 7-bit field, new flags; The flags in the condition code register (CCR) are set or cleared according to the new flags supplied. The flags are: OutO Outl InO Inl V N Z Mnemonic: TSF Opcode: 010010 x-9-x 7-F0-? x-10-x No. of cycles: 1 126 Remarks: Test Flags; FO: 7-bit field, test template; The data in the template is ANDed with the CCR, and the main flag is set or cleared according to the result Only JOF uses the main flag; The V(overflow), N(negative), and Z(zero) reflects only the condition of the results of the adder section. There is no way to monitor the results of the multiplier section, they are hard limited when an overflow occurs. The InO and Inl flags are external conditions which can be tested. See section 9.3 for the meaning of these 2 flags. Mnemonic: SSH Opcode: 01000 7-F0-? 0 x-9-x No. of cycles: 1 Remarks: Set Shift Amount; FO: 17-bit field, shift amount only 1 bit should be non-zero in these 17 bits; The shift amount is written to the barrel shifter controller. All subsequent results from the multiplier will then be shifted by the specified amount; For example, to shift all results from the multiplier up by 2 bits, the instruction is: 01000 00100000000000000 xxxxxxxxx Mnemonic: SLZ Opcode: 01000 7-F0-? 1 x-9-x 127 No. of cycles: 1 Remarks: Set Required Number of Leading Zeros; FO: 17-bit field, required number of leading zeros. See examples below; The required number of leading zeros is written to the barrel shifter controller. All subsequent results from the multiplier which have less than this number of leading zeros will be flagged as overflows; For example, to require all results from the multilier to have at least 2 leading zeros, the instruction is: 01000 00111111111111111 xxxxxxxxx and for 5 leading zeros the instruction is: 01000 00000111111111111 xxxxxxxxx Mnemonic: NOP Opcode: 000000 x-26-x No. of cycles: 1 Remarks: No Operation; No action is performed. The flags in the CCR are not affected; 128 APPENDIX D: SCHEDULING THE ADDER AND THE MULTIPLIER SECTIONS Since the adder and multiplier sections can run in parallel, the speed of a program depends a lot on the sequence of the instructions. A program which keeps the adder and the multiplier running most of the time will run a lot faster than a program which does not To illustrate this, consider programming the following simple equation: y = a-b-c - (b-d + a + b-c) A possible schedule for the adder and multiplier is shown below. For the sake of simplicity, the delays required because of the various pipelines is ignored. In other words, it is pretended that there are no pipeline delays. Multiplier Adder b-d b-c a-(b-c) a-f(b-c) (b-d+a+b-c)-d (a-b-c)-(b-a + a + b-)-d And the corresponding APRjr program is, again ignoring pipeline delays: SAD &A -.send address of variable A out, ;to prepare for reading A. ;&A stands for address of A, ;as in the C programming language. RDW R l ;read in value of A, and place it in register 1 SAD &B ;prepare to read B RDW R2 ;put B in register 2 SAD &C RDW R3 SAD & D RDW R4 M A D R5,R2,R4:R0,R0,R0 M A D R6.R2.R3: RO.RO.RO M A D R7,R6.R1:R8,R1,R6 M A D R0,R0,R0:R9,R5,R8 M A D R5,R9,R4:R0,R0,R0 MSB RO,R0,RO:R6,R7,R5 WWD &Y. R6 The schedule can be rearranged to Multiplier b-c b-d a-(b-c) (b-d+a+b-c)-d ;put C in r3 ;put D in r4 ;r5=b*d, adder idle ;r6 = b*c, adder idle ;r7 = a*(b*c), r8 = (a) + (b*c) -.multiplier idle, r9 = (b*d) + (a +b*c) ;r5=(b*d+a + b*c)*d, adder idle ;r6 = (a + b*c>- (b*d + a + b*c)*d ; store result in data memory reduce the number of instructions by 1: Adder a-f(b-c) (b-d)+(a+b-c)) (a-b-c-((b-d+a+b-c)-d) 130 It can be seen that even in this simple example, the execution speed of the program depends on the proper scheduling of the multiplier and the adder. The problem of scheduling multiple processing elements efficiently has been explored quite thoroughly in the micro-programming area. For the present application, the branch and bound (BAB) algorithm with list scheduling heuristic is used. See [19] for more information on scheduling algorithms and [28] for why BAB is chosen amoung other algorithms. In the BAB algorithm with list scheduling heuristic, each operation is assigned a weight, equal to the number of operations that is dependent on the operation. Then the schedule is determined ' by simply picking the operation that has the heaviest weight first If two operations that can be scheduled have the same weight then an arbitrary choice is taken. The algorithm is: Compute weight of each operation; From the list of operations that can be scheduled, schedule in descending order of weights; Repeat until all operations scheduled. During scheduling, no attention is paid to the registers that are needed. Only the sequence of execution is established. After scheduling is complete, registers are allocated to store the intermediate results of each operation. The register that is assigned to an operation is freed whenever all the sons of the operation have been assigned a register. A simplified (pipeline delays ignored) schedule generated for the computation of p of the Excalibur Arm is given below. The program listing of the BAB algorithm is also included for completeness. 132 The R e s u l t i n g S c h e d u l e m u l t i p l i e r r e g i s t e r adder r e g i s t e r operations assigned operations assigned 3 1 0 0 5 2 0 0 7 3 10 4 6 1 0 0 4 2 0 0 13 5 11 6 8 1 0 0 14 2 0 0 12 4 0 0 2 1 17 7 15 4 0 0 20 8 18 9 33 4 32 10 23 5 9 8 22 4 26 9 16 5 25 3 28 1 19 4 27 2 31 3 24 1 0 0 21 4 29 5 0 0 30 1 • operation no.=0 =*>unit i d l e • r e g i s t e r s are reused as soon as p o s s i b l e . • the s i n e ' s and cosines of the angles are assumed to have been read i n t o the r e g i s t e r s already. • p i p e l i n e delays are ignored. 133 1 program B A B H e u M s t i c ; 2 { 2.2 { 3 { Branch and Bound w i t h H e u r i s t i c 4 { 5 { 6 7 c o n s t 8 MaxOp = 50; 9 MaxReg = 50; 10 O c c u p i e d = 1; 11 UnOccupied = O; 12 Nu11 = 0; 13 Mul = 1; 14 Add = 2; 15 I n f i n i t y = 10000; 16 17 type 18 S e t O f P a r e n t s = a r r a y [1..10] of i n t e g e r ; 19 SetOfSons = a r r a y [1..10] of i n t e g e r ; 20 SetOfDes = a r r a y [1..100] of In t e g e r ; 21 OpRec = r e c o r d 22 Reg: i n t e g e r ; 23 Op: i n t e g e r ; 24 Wt: i n t e g e r : 25 NumParents: i n t e g e r ; 26 RemParents: i n t e g e r ; 27 P a r e n t s : S e t O f P a r e n t s ; 28 NumSons: i n t e g e r ; 29 Sons: SetOfSons; 30 end; 31 PtrNode = @OneNode; 32 OneNode = r e c o r d 33 OpNum: i n t e g e r ; 34 Next: PtrNode; 35 end; 36 P t r T i m e S l o t = P T i m e S l o t ; 37 T i m e S l o t = r e c o r d 38 OpNum : i n t e g e r ; 39 P rev: P t r T i m e S l o t ; 40 end; 41 42 v a r 43 S h o r t e s t S c h e d , SchedLen: i n t e g e r ; 44 R d y N u l l , RdyMul, RdyAdd: PtrNode; 45 N e x t R d y N u l l : PtrNode; 46 B e s t S c h e d : P t r T i m e S l o t ; 47 L a S t S l o t : P t r T i m e S l o t ; 48 Op: i n t e g e r ; 49 OpSet: a r r a y [1..Max0p] of OpRec; 50 RegSet: a r r a y [1..MaxReg] of i n t e g e r ; 51 52 53 f u n c t i o n F i n d F r e e R e g : i n t e g e r ; 54 v a r 55 i : i n t e g e r ; 56 b e g i n 57 1 := 1; 134 58 w h i l e R e g S e t f i ] = Occ u p i e d do 59 i : = i + 1 ; GO R e g S e t f i ] := Occupied; 61 F1ndFreeReg := i ; 62 end; {func} 63 64 65 p r o c e d u r e A11ocateReg(PrevTimeS1ots: P t r T i m e S l o t ) ; 66 v a r 67 C u r P a r e n t , i , NumSons: i n t e g e r ; 68 G r e a t P a r e n t : i n t e g e r ; 69 A b o r t : b o o l e a n ; 70 CurOp: P t r T i m e S l o t ; 71 b e g i n 72 CurOp := PrevTimeS1ots; 73 i f CurOp@.Prev <> n i l then 74 b e g i n 75 A1 1 ocateReg(CurOp@.Prev); 76 end; { i f } 77 i f (Cur6p<?>. OpNum <> 0) then 78 b e g i n 79 OpSet[CurOp@>.OpNum].Reg := FindFreeReg; 80 f o r i:=1 to OpSet[CurOp@>.OpNum].NumParents do 81 b e g i n ' 82 C u r P a r e n t := OpSet[CurOp@.OpNum].Parents[i]; 83 NumSons := OpSet[CurParent].NumSons - 1; 84 OpSet[CurParent].NumSons := NumSons; 85 i f NumSons = O then 86 b e g i n 87 i f OpS e t [ C u r P a r e n t ] . O p = N u l l then 88 b e g i n 89 G r e a t P a r e n t := Cu r P a r e n t ; 90 Abort := f a l s e ; 91 w h i l e ( O p S e t [ G r e a t P a r e n t ] . O p = N u l l ) and not Abort do 92 b e g i n 93 i f OpSetfGreatParent].NumParents <> 0 then 94 G r e a t P a r e n t := O p S e t [ G r e a t P a r e n t ] . P a r e n t s ! 1] 95 e l s e 96 Abort := t r u e ; v 97 end; 98 i f not Abort then 99 b e g i n 100 NumSons := OpSet[GreatParent].NumSons - 1; 101 OpSet[GreatParent].NumSons := NumSons; 102 i f NumSons = 0 then 103 R e g S e t [ O p S e t [ G r e a t P a r e n t ] . R e g ] := Unoccupied; 104 end; 105 end 106 e l s e 107 R e g S e t [ O p S e t [ C u r P a r e n t ] . R e g ] := Unoccupied; 108 end; 109 end; 110 end; 111 write(CurOp@.OpNum); 112 1f CurOp@>. OpNum = O then 113 w r i t e ( O ) 114 e l s e 115 write(OpSet[CurOp@.OpNum].Reg); ( 135 116 i f Op = Add then 117 b e g i n 118 w r 1 t e l n ; 119 Op := Mul ; 120 end 121 e l s e 122 Op := Add; 123 end; {proc} 124 125 126 p r o c e d u r e P r i n t S c h e d ; 127 v ar 128 1: i n t e g e r ; 129 b e g i n 130 f o r i:=1 to MaxReg do 131 R e g S e t f i ] := Unoccupied; 132 Op := Mul; 133 A l l o c a t e R e g ( L a s t S l o t ) : 134 end; {proc} 135 136 137 p r o c e d u r e C o n s t r u c t G r a p h ; 138 v a r 139 i , OpNum, NumSons, 5on, NumParents: i n t e g e r ; 140 b e g i n 141 f o r i;=1 to MaxOp do 142 OpSet[i].NumParents := 0; 143 144 read(OpNum); 145 w h i l e OpNum <> 0 do 146 b e g i n 147 read(0pSet[0pNum].Op); 148 NumSons : = 0; 149 r e p e a t 150 r e a d ( S o n ) ; 151 NumSons := NumSons + 1; 152 0pSet[0pNum].Sons[NumSons] := Son: 153 i f Son <> 0 then 154 b e g i n 155 NumParents := OpSet[Son].NumParents + 1; 156 OpSettSon].NumParents := NumParents; 157 OpSet[Son].RemParents := NumParents; 158 0pSet[Son].Parents!NumParents] := OpNum; 159 end; { i f } 160 unt i1 Son = 0; 161 0pSet[0pNum] .NumSons := NumSons - -1; 162 read(OpNum); 163 end; {while} 164 end; {proc} 165 166 167 { T h i s p r o c e d u r e updates t h e r e a d y - n u l l , ready-mul, and 168 ready-add l i s t s . 169 } 170 p r o c e d u r e UpdateLists(0pNum: i n t e g e r ) ; 171 v a r 172 i . NumSib, CurOp, CurWt: i n t e g e r ; 173 S i b l i n g s : SetOfSons; 136 174 Node. RefNodel, RefNode2: PtrNode; 175 b e g i n 176 NumSib:= OpSet[OpNum].NumSons; 177 S i b l i n g s := OpSet[OpNum].Sons: 178 f o r 1:=1 t o NumSib do 179 b e g i n 180 CurOp := S i b l i n g s [ i ] ; 181 OpSet[CurOp].RemParents := OpSet[CurOp].RemParents - 1; 182 i f OpSet[CurOp].RemParents = 0 then 183 b e g i n 184 new(Node); 185 Node@.OpNum : = CurOp; 186 c a s e OpSet[CurOp].Op of 187 N u l l : b e g i n 188 Node<».Next := NextRdyNul 1 ; 189 NextRdyNul1 := Node; 190 end; 191 Mul : b e g i n 192 i f RdyMul = n i l then 193 b e g i n 194 RdyMul := Node; 195 Node@>.Next := n i l ; 196 end 197 e l s e _ ' 198 b e g i n 199 CurWt := OpSet[CurOp].Wt; 200 i f CurWt > OpSet [ RdyMul «>. Opnum] . Wt then 201 b e g i n 202 Nodes?. Next := RdyMul; 203 RdyMul := Node: 204 end 205 e l s e 206 beg i n 207 RefNodel := RdyMul; 208 > RefNode2 := RefNode1@.Next; 209 w h i l e ((RefNode2 <> n i l ) and (OpSet[RefNode2@>.OpNum].Wt > CurWt)) do 210 b e g i n 211 RefNodel := RefNode2; 212 RefNode2 := Ref Node2@>. Next; 213 end; (w h i l e ) 213.3 Ref Node1<» . Next := Node; 213.6 Node@.Next := RefNode2; 214 end; { i f } 215 end; {if} 216 end; {clause} 217 Add : b e g i n 218 i f RdyAdd = n i l then 219 b e g i n 220 RdyAdd := Node; 221 Node@.Next := n i 1 ; 222 end 223 e l s e 224 b e g i n 225 CurWt := OpSet[CurOp].Wt; 226 i f CurWt > OpSet[RdyAdd©.Opnum].Wt then 227 b e g i n 228 Nodee.Next :* RdyAdd; 229 RdyAdd := Node; 137 230 end 231 e l s e 232 b e g i n 233 RefNodel :«= RdyAdd; 234 RefNode2 := Ref Node 1©. Next; 235 w h i l e ((RefNode2 <> n i l ) and (OpSet[RefNode2©.OpNum].Wt > CurWt)) do 236 b e g i n 237 RefNodel := RefNode2; 238 RefNode2 := Ref Node2©. Next; 239 end; {while) 239.3 RefNodel©.Next := Node; 239.6 Node©.Next := RefNode2; 240 end: { i f } 241 end; { i f } 242 end; {clause} 243 end; {case} 244 end; { i f } 245 end; {for} 246 end; {proc} 247 248 249 p r o c e d u r e Schedule; 250 v a r 251 NewSlot: P t r T i m e S l o t ; 252 MulOp, AddOp: PtrNode; * 253 b e g i n 254 NextRdyNul1 := n i l : 255 new(NewSlot); 256 NewSlot©.Prev := L a s t S l o t ; 257 L a s t S l o t := NewSlot; 258 MulOp := RdyMul; 259 i f RdyMul = n i l then 260 NewS 1 ot©. OpNum : = 0 261 e l s e 262 b e g i n 263 RdyMul :« RdyMul©.Next; 264 NewSlot©.OpNum := MulOp©.OpNum; 265 end; { i f } 266 new(NewSlot); 267 NewSlot©.Prev := L a s t S l o t ; 268 L a s t S l o t := NewSlot; 269 AddOp := RdyAdd; 270 i f RdyAdd = n i l then 271 NewSlot©.OpNum := 0 272 e l s e 273 b e g i n 274 RdyAdd := RdyAdd©.Next; 275 NewSlot©.OpNum := AddOp©.OpNum; 276 end; { i f ) 277 If MulOP <> n i l then 278 UpdateL1sts(MulOp©.OpNum); 279 i f AddOp <> n i l then 280 UpdateL1sts(AddOp©.OpNum); 281 w h i l e RdyNul1 <> n i l do 282 b e g i n 283 UpdateLists(RdyNul1@.OpNum); 284 RdyNul1 := RdyNul1©.Next; 285 end; {while} 138 286 RdyNull := NextRdyNul1; 287 end; {proc} 288 289 290 procedure Union(var ParentDes: SetOfDes; 291 var NuntPDes: Integer; 292 SonDes: SetOfDes; 293 NumSDes; Integer); 294 var 295 1. j , k: Integer;, 296 found: boolean; 297 begin 298 k := NumPDes; 299 for 1:=1 to NumSDes do 300 begin 301 found := f a l s e ; 302 J : = 1 ; 303 while ( ( j <= NumPDes) and (not found)) do 304 beg 1n 305 i f SonDes[1] = ParentDes[j] then 306 found := true 307 e l s e 308 J : = j + 1 ; 309 end; {while} 310 i f not found then 311 beg i n 312 k : = k + 1 ; 313 ParentDesfk] := SonDes[i]; 314 end; {If) 315 end; {for} 316 NumPDes := k; 317 end; {proc} 318 319 320 procedure MergeDesfvar Descendents: SetOfDes; 321 var NumDes: integer; 322 OpNum: integer); 323 forward; 324 325 326 procedure AssignWeights(var Descendents: SetOfDes; 327 var NumDes: integer; 328 OpNum: Integer); 329 var 330 NumSons, MyNumDes, i : Integer; 331 Sons: SetOfSons; 332 MyDes: SetOfDes; 333 begin 334 NumSons := OpSet[OpNum].NumSons; 335 Sons := OpSet[OpNum].Sons; 336 MyNumDes := 1: 336.2 MyDes[1] := OpNum; 337 for 1:=1 to NumSons do 338 AssignWeights(MyDes, MyNumDes. Sons[l]); 339 OpSet[OpNum].Wt := MyNumDes; 340 Union(Descendents, NumDes, MyDes, MyNumDes); 341 end; {proc} 342 139 343 344 p r o c e d u r e ComputeWeights; 345 v a r 346 DummyA: SetOfDes; 347 DummyC; I n t e g e r ; 348 b e g i n 349 DummyC := 0; 350 AssignWeights(DummyA, DummyC, 1); 351 end; 352 353 354 p r o c e d u r e MergeDes; 355 v a r 356 i : i n t e g e r ; 357 found: b o o l e a n ; 358 b e g i n 359 i f OpSet[OpNum].NumSons = 0 then 360 b e g i n 361 i := 1; 362 found := f a l s e ; 363 w h i l e ( ( i <= NumDes) and (Not found)) do 364 b e g i n 365 i f D e s c e n d e n t s [ i ] = OpNum then 366 f o u n d := t r u e 367 e l s e 368 1 := i + 1; 369 end; ( w h i l e ) 370 i f not found then 371 b e g i n < 372 D e s c e n d e n t s f i ] := OpNum; 373 NumDes : = 1 ; 374 end; 375 end 376 e l s e 377 A s s i g n W e i g h t s ( D e s c e n d e n t s , NumDes, OpNum); 378 end; (p r o c ) 379 380 381 b e g i n 382 C o n s t r u c t G r a p h ; 383 ComputeWeights; 384 RdyNul1 := n i 1 ; 385 RdyMul := n i l ; 386 RdyAdd := n i l ; 387 L a s t S l o t := n i l ; 388 U p d a t e L i s t s ( 1 ) ; 389 w h i l e ((RdyNul1 <> n i l ) o r (RdyMul <> n i l ) or (RdyAdd <> n i l ) ) do 390 S c h e d u l e ; 391 P r i n t S c h e d ; 392 end. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0096321/manifest

Comment

Related Items