UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An arithmetic processor for robotics Poon, Joseph Kin-Shing 1985

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-UBC_1985_A7 P65.pdf [ 5.32MB ]
JSON: 831-1.0096321.json
JSON-LD: 831-1.0096321-ld.json
RDF/XML (Pretty): 831-1.0096321-rdf.xml
RDF/JSON: 831-1.0096321-rdf.json
Turtle: 831-1.0096321-turtle.txt
N-Triples: 831-1.0096321-rdf-ntriples.txt
Original Record: 831-1.0096321-source.json
Full Text

Full Text

AN  ARITHMETIC PROCESSOR FOR  ROBOTICS  by JOSEPH  K.IN-SHING POON  B.Sc.(Hon.), Queen's University,  1983  A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T T H E REQUIREMENTS FOR T H E D E G R E E O F MASTER  O F APPLIED  SCIENCE  in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Electrical Engineering)  We accept this thesis as to the  T H E UNIVERSITY  conforming  required standard.  O F BRITISH COLUMBIA May,  1985  © Joseph Kin-Shing Poon,  1985  OF  In  p r e s e n t i n g  t h i s  r e q u i r e m e n t s o f  B r i t i s h  i t  f r e e l y  a g r e e f o r  f o r  a n  a v a i l a b l e  t h a t  I  u n d e r s t o o d  t h a t  f i n a n c i a l  b y  h i s  t h a t  o r  h e r o r  s h a l l  f u l f i l m e n t a t  t h e  g r a n t e d  The  o f  U n i v e r s i t y  Electrical o f  B r i t i s h  1956 Main M a l l V a n c o u v e r ,  Canada  V6T 1Y3 Date  DE-6  (3/81)  June  27, 1985  L i b r a r y  b y  n o t  t h e  b e  a l l o w e d  Engineering C o l u m b i a  o f  t h e  s h a l l I  o f  make  f u r t h e r t h i s  h e a d  r e p r e s e n t a t i v e s .  p u b l i c a t i o n  o f  U n i v e r s i t y  c o p y i n g  p e r m i s s i o n .  D e p a r t m e n t  t h e  a n d s t u d y .  e x t e n s i v e  may b e  c o p y i n g  g a i n  d e g r e e  r e f e r e n c e  f o r  p u r p o s e s  o r  p a r t i a l  a g r e e  f o r  p e r m i s s i o n  s c h o l a r l y  i n  a d v a n c e d  C o l u m b i a ,  d e p a r t m e n t  f o r  t h e s i s  I t  t h i s  w i t h o u t  t h e s i s  o f  my  i s t h e s i s my  w r i t t e n  ii  ABSTRACT  An CMOS purpose  arithmetic processor  chip for  use  in robotics  has  been  designed  technology. The chip is intended to function as a slave processor microprocessor  transformation  host  calculations  and for  be use  able in  to  perform  real-time  robot  control  which a machine cycle time of  architecture  of  this  to a general  arm  coordinate  applications.  parallel-processing, multi-pipelined architecture has been used to produce a for  in 4/*m  A  45mm chip 2  200ns appears possible. The general nature of the  microprogrammable  processor  renders  it  useful  computational tasks in robotics in addition to coordinate transformation.  for  a  range  of  iii  Table of Contents  Abstract  »  Acknowledgement  viii  1.  Introduction  2.  Coordinate Transformations  3.  Choosing an Appropriate Architecture 3.1  4.  5.  6.  1 in Robotics  3 8  Options for High Speed Computation  10  3.1.1  Systolic Array for Band Matrix Multiplication  10  3.1.2  Data Flow  13  3.1.3  Control Row  ,  17  Implementation Considerations  18  4.1  Constraints of the Fabrication Process  18  4.2  Floating-Point vs Fixed-Point Implementation  20  4.3  On-Chip Memory vs Off-Chip  23  4.4  Multiplication  24  4.5  32 Bit vs 24 Bit Representation  24  4.6  Reduced Instruction Set  25  4.7  Outline of Possible Architecture  25  The APR Architecture  Memory  26  5.1  The Register File  26  5.2  The Adder Section  28  5.3  The Multiplier Section  29  5.4  The Clock Phases  30  5.5  Fixed-Point Mode  31  5.6  Input/Output and Control  33  Programming and Scheduling  35  iv  7.  Design of Cells  37  7.1  Multi-port Register Design  41  7.2  Barrel Shifter  44  7.3  Leading Zero Checker  47  7.4  Adders  48  7.5  The Clock  55  7.6  Programmable Logic Array  60  7.7  Stack  62  8.  Floor Plan of APR  63  9.  The APRjr  66  10.  11.  9.1  Speed Considerations  9.2  Programming  9.3  The APRjr Subsystem  69 -  70 71  Testing  74  10.1  Prototype Testing  74  10.2  Production Testing  78  Conclusions  79  11.1  Contributions of this Thesis  11.2  Suggestions for Further Research  Bibliography  79 ~  80 85  Cited References  85  General References  88  Appendix A: Layouts  90  A.1: 3-Port Register Cell  91  A.2: 2 x 2 Register File  92  A.3: One Barrel Shiftier Cell  93  A.4: 3-out-of-5  94  Barrel Shifter  V  A.5: 4-Bit Leading Zero Checker  95  A.6: Stack Cell  96  A.7: 2-Bit by 2-Level Stack  97  A.8: 2-Bit Par-Add Adder  98  A.9: 1-Bit Static Regenerative Manchester Carry Adder  99  A. 10: PLA for Stack Control  100  A.11: Clock Generator  101  A.12: Rags  ."  A. 13: Barrel Shifter Controller for APRjr Appendix B: Simulations B. l :  Writing a Zero to a Register  102 103 104 105  B.2: Writing a One to a Register  106  B.3: Writing a One to a Register with Noise  107  B.4: Writing a One to a Register with Noise, Level 2  108  B.5: Register Sourcing a Zero  109  B.6: Register Sourcing a One  110  B.7: Leading Zero Checker  Ill  B.8: One Par-Add Logic Block  113  B.9: Carry Propagation in One Static Regenerative Manchester Carry Adder Cell  114  B.10: 2-Bit by 2-Level Stack  115  B.11: Clock Generator  116  B.12: P L A for Stack Control  117  Appendix C: Instruction Set  118  Appendix D: Scheduling the Adder and the Multiplier Sections  128  List of Figures  2.1 A Six Degree of Freedom Manipulator  3  2.2 Format of Transformation Matrix  4  3.1 Example of Transformation Matrix  8  3.2 A Band Matrix of Width w  10  3.3 Systolic Array for Matrix Multiplication  10  3.4 Simplistic Implementation of a Data Flow Machine  14  3.5 The MIT Data Flow Computer  15  4.1 Yield Characteristics of ISOCMOS  18  4.2 Number Format for 16-bit Fixed-Point Representation  20  4.3 Error Estimation of Number Format  22  5.1 Block Diagram of APR in Floating-Point Mode  26  5.2 Clock Phases Required in APR  30  5.3 Block Diagram of APR in Fixed-Point Mode  31  5.4 Instruction Pipeline  33  7.1 Communication in Adder Mantissa Section  37  7.2 Inefficient Routing of Adder Mantissa Section  37  7.3 Final Floor Plan of Adder Mantissa Section  38  7.4 A 3- Port Register Cell  41  7.5 4-out-of-7 Barrel Shifter  44  7.6 Barrel Shifter Connected as Up/Down Shifter  44  7.7 4-bit Leading Zero Checker  47  7.8 Dynamic Regenerative  48  7.9 Static Regenerative  Adder  Adder  7.10 4-Bit Adder with Partial Carry Look-Ahead  49 51  7.11 Circuit Diagram of a Par-Add Block  52  7.12 End-Around-Carry Adder with Domino C M O S  54  7.13 Shift Register for Multi-Phase Clock Generation  56  7.14 Signals from Clock Generation Shift Register  56  7.15 Two-Phase  58  Clock with Feedback to Guarantee Non-Overlapping  7.16 PLA for Stack Controller  60  7.17 2-bit by 2-level  62  Stack  8.1 Floor Plan of APR  63  9.1 Block Diagram of APRjr  66  9.2 Floor Plan of APRjr  66  9.3 Instruction Formats  70  9.4 Block Diagram of the APRjr Subsystem  71  viii  ACKNOWLEDGEMENT  I would like to thank my supervisors, Dr. D.L. Pulfrey and Dr. P.D. Lawrence for  their invaluable guidance and encouragement throughout this project  In  addition, I would like Dr. P. van Halen  for his help in the use  of the  pen-plotter.  This Research Columbia.  research  Council  of  has  been  Canada,  supported and  by  Robotic  the  Natural  Systems  Sciences  International,  and  Engineering  Sidney,  British  1. INTRODUCTION In order to control the motion of a robot manipulator, it is necessary to know what  the  position  manipulator, performed  and  of  the  vice  manipulator  versa.  For  is,  real  given  time  at  such  microprocessor.  the  joints  calculations  A mainframe  of  cannot  the be  computer or  array processor may be able to accomplish the task.  But it is much more desirable to have a smaller on  angles  applications,  fast enough by a general purpose  a minicomputer with an attached  the  the bus of a microprocessor  unit, preferably  one that would fit  system, that can perform the job. This would reduce  cost, space, and power consumption.  In  this  thesis,  transformations designed of  a  in robotics  special  purpose  and  implemented  taking into consideration  supporting  chips  required,  processor, in LSI,  factors such  ease of  designed  as  is  to  described.  available  compute  The processor  technology,  communication with  a  coordinate  host  was  speed, number  processor,  ease of  programming, and power consumption.  Two  versions  of  the  supports both floating-point only direct  handle  fixed-point  transform  of  an  trajectory  numbers. industrial  planning.  are  To  Both  these  manipulator  the inverse  APR  sections, chip  obtain  the  applications direct  non-interlocking pipelines  can than  transform,  handle APRjr. APRjr  floating-point However, for is  namely  were  [6], in a  necessary  the  APR  chip  which  designed  time  short  to  compute  enough  for  by iteration for use  speed  of  operation,  the  the the  in real designs  independent operation of the multiplier and and  numbers, the  adequate and,  chips  transformation  embody the features of parallel processing, adder  discussed,  and fixed-point arithmetic, and the APRjr chip which can  solution to be used to obtain time  processor  a  reduced it  specific in this  carried forward to the chip fabrication stage.  has task initial  instruction a  potentially of  fast  set  Because  wider  range  computation  investigation,  its  the of  of  the  design  was  2  A section  brief 2.  Section  transformations the  introduclion to coordinate  various  at  3  high  investigates speed.  constraints  and  transformations  the  various  A framework requirements  architecture of APR is described  ways  of a  is  in robotics  feasible  given  in  of  is  presented  performing  design  section  in  coordinate  taking into account  4. In  section  5,  the  in detail. Since scheduling multiple resources is both  an interesting topic in parallel computer architectures and an integral part of the APR design, section 6 is devoted to a discussion of how the computing resources on APR can  be  scheduled  important  cells  is  efficiently. the  topic  The of  detailed  section  7.  circuit  design  Section  8  and  deals  layout  with  of  the  more  routing  and  floor  planning of APR. The architecture and design of APRjr is presented in section 9. In section Finally,  10, in  the  procedures  section  11,  that the  can  be  used  contributions  suggestions for further research outlined.  of  to  test  this  the  chip  research  design  are  are outlined.  summarized,  and  3  2. COORDINATE TRANSFORMATIONS IN ROBOTICS A manipulator, or robot arm, consists of a number of rigid links connected by movable joints. Each joint has one degree of freedom, i.e., can revolve about one axis if  it is a revolute joint or can slide  example, Figure 2.1 illustrates  along one axis if it is a prismatic joint For  a manipulator with six degrees of freedom (six  revolute  joints). The axes about which the links rotate are also indicated.  In order to control the movement of the manipulator, it is necessary to know the positions of the various links of the manipulator, in particular the position of the end effector joint  to  (the free  read  determining effector,  the  in  end of the manipulator). It is possible angles  cartesian  and  send  them  coordinates  the  given the angles of the joints  back  position  to have sensors at each  to  the  controller.  and  the  orientation  The of  task  of  the  end  (or the length of a prismatic joint), is called  the direct transform in robotics. Conversely, the process of determining the  magnitudes  of  called  the  angles  given  the  position  and  orientation of  the  end  effector  is  the  inverse transform.  The  direct  transform  can  be  computed  Each joint can be assigned a cartesian coordinate  system  of  matrix  by  [27].  (4  4)  the  immediate  To  compute  by multiplication of 4 by 4  matrices.  coordinate system, which can be related to the  previous joint the  by  transformation  a  homogeneous matrix  of  transformation  the  end  effector  relative to the base, it is only necessary to multiply the transformation matrices of all the joints  together.  transformation  For  example,  for  the  manipulator  matrix relating the coordinate system  in Figure  2.1,  if  A j is  of joint 2 to joint 1, and A  the 2  is  that which relates the coordinate system of joint 3 to joint 2, etc., then the coordinate system of the end effector with respect to the base of the manipulator is given by  T  =  AIAJA3A4AJA4  Fig. 2.1: A Six Degree of Freedom Manipulator  5  The format of the matrix thus obtained is as shown in Figure 2.2. The first three columns specify the orientation of the end effector and the last column specifies the position of the end effector, all in cartesian coordinates.  n  o  x  x  n y  Oy  n  o  0  0  a  x  a  y  p  x  py  0  Fig. 2.2: Format of Transformation Matrix  It meaningful  should  be  pointed  out  that  these  coordinate  transformations  are  only  if the manipulator is rigid. A rigid manipulator is one which does not  deform significantly even under loading conditions. It can be seen that although the method of obtaining the direct transform is quite straightforward, the amount of computation involved is very large. It is possible to multiply the matrices  out  algebraically  and  obtain  the equations  for  the  four  required vectors, which can then be simplified by factoring. However, these equations are still quite complicated for real-time use. For example, the simplified equations of the position vector of a simple manipulator [6], which has a construction similar to the one shown in Figure 2.1, is  6  Px  = C,[a 2 Q - d«(S 2 3 Q - C C S J  Py  = Sxtd^CCCS, - Q 3 C 5 ) - a 2 Cj - C,(d3 + (1.S.S,)  Vz  = d,(S„CSs  + d}Ci)  where, for example, Cj 3 = cos(0  2  + aS 2  2  +  +  S,(d3  + <LS4S5)  4  + 0 ), S23  = sin(0  3  2  + 0 ), and the d's and a's 3  are related to the dimensions and geometry of the manipulator. The  inverse  transform  is  considerably  more  complicated  than  the  direct  transform. Not only is the form of the inverse transform very different for different manipulators, but it is also not unique (more than one set of joint angles can give rise to  the  same  end  effector  position),  extremely  complicated (involving inverse  trigonometric functions, divisions, square roots, etc.), and very often cannot even be found in closed form! However, it is possible to obtain the inverse transform by iterative application of the direct transform. In other words, a guess can be made of what the joint angles should be, the position of the end effector  can then be  computed using the direct transform with the guessed angles, and the error in the actual position of the end effector should enable a better estimation to be made the next time. The transforms  direct transform is also useful  in end-point servoing. Since the direct  can compute the end-point position and orientation, no TV camera or  end-point sensor is required. Only the joint sensors are required. It is very desirable that the direct transform can be calculated as quickly as possible. Present day manipulators do not usually move very fast Typical sampling rates are less than 100Hz. So it may be possible to calculate the inverse transform by iterative application of the direct transform, using a high performance general purpose microprocessor with an attached special purpose arithmetic processor. When computing trajectories in real-time, this approach is far more practical than to try to compute the inverse transform  directly by evaluating the inverse transform  equations. These  7 equations special  are very difficult to obtain, and they are so complicated that there are few  properties  in them that  can be  exploited  by employing special  hardware. The  requirement of divisions and large numbers of transcendental  function calls also makes  evaluation  of  with  arithmetic  coprocessors,  micro-seconds  these equations  to  a  compute.  very  slow.  transcendental In  complete a 16-bit multiplication.  For  example,  function  even  takes  constrast, a high-speed  the  order  modern of  high-speed  hundreds  multiplier takes about  of  50ns to  8  3.  CHOOSING  A N APPROPRIATE A R C H I T E C T U R E  Because namely  the  computing  present the  choose an architecture  project  direct  is  to  transform  design  a  in robotics,  that is most suitable  chip it  is  for  a  special  possible,  and  application,  necessary,  to  for this particular application. The choice  of the correct architecture is very important Since raw processing  speed is very much  limited  or  by  the  integrated  circuit  technology,  improvement in throughput can only circuit  design.  hardware  However, as  will  generally  be  the  be  degree  increased,  out computation in processing  such  achieved of  since  as  with  CMOS parallel  parallelism parallel  is  processing  increased,  processing  TTL,  is  significant  and  the  careful  amount  essentially  of  spreading  elements instead of in time. Also, increase in parallelism  almost invariably leads to decrease in generality  of application. So  the tradeoffs  must  be considered very carefully.  In  order to decide what architecture is  the properties that  will  best suited  to the present  application,  of the problem must be analyzed, and then the hardware and software  take  best  advantage  of  these  properties  can  be  designed.  The  following  observations about the direct transform in robotics can be readily made: 1.  It is obtained by multiplication of 4 by 4 matrices.  2.  The matrices 0,  0,  1),  are  usually quite sparse, with the last row always containing (0,  usually  somewhere.  about  An example  6  or  of a  7  other  zero  entries,  typical transformation  and matrix  a  1  or  relating  -1  entry  2 adjacent  links is shown in Figure 3.1. 3.  The overall transformation ri,  o,  effector,  and  a  and p  are  matrix  orthogonal  is of the form shown in Figure 2.2,  vectors  specifying  the  orientation  is the position vector of the end effector  of  where  the  end  relative to the base  of the manipulator. 4.  The  computation  requires  addition,  subtraction,  multiplication, and  sines  and  9  C  4  0  S  4  0  S  4  0  C  4  0  0  1  0  d  0  0  0  1  4  F i g . 3.1: Example of Transformation Matrix  cosines of angles. For a manipulator with 6 degrees of freedom, there are 6 angles and so at least 12 trigonometric functions calls are required. 5.  If the transformation matrices axe multiplied out algebraically, equations can be obtained for n, o , a ,  and p.  These equations typically contain a number  of  common terms. For example, the term (C1C3C3 -  CiSjS )C« }  +  S1S4  appears in the x component of n . o, a , and p [6].  This  matrix  is  a  consequence  multiplication. Unless  of  the  fact  that  a dedicated matrix  in the Excalibur manipulator  these  vectors are  multiplication  this is the best way to compute the direct transform.  obtained  engine  by  is used,  10  3.1  Options for High Speed Computation  In this section, various options for high speed  computation are  described with  their general performance advantages. Then the pros and cons in applying each option to the present what the  computation system  entire system would  fit  problem are  must be able  into a  requirements  of  evaluated. should  Although look  like,  system  it is  reasonable  to fit in a number of (preferably  card cage connected to a the  there is no strict specification  must  be  to one)  common microprocessor  kept  to  a  assume  minimum.  as  that the  PC boards bus.  The  to  that  The power  system  will  be  controlled by the host(s) on the bus. Therefore, options such as using a V A X with an array processor these large  can be ruled out  In fact,  computing systems, they  are  because of the general purpose  not necessarily  much  faster  nature of  than a specially  designed small system in solving this particular problem.  3.1.1  Systolic Array for Band Matrix Multiplication  Because homogeneous  the  direct  matrices,  a  transform  dedicated  is  matrix  obtained  by  multiplication  multiplication structure  of  might  4 be  by  4  suitable.  Much work has been done on ways of making highly parallel computing structures for the multiplication of matrices  [22]. In the most general case, a systolic array can be  used to multiply two n x n band matrices with band width w=p+q-l. is not banded is a special case where p=q=n. p =  3 and q  =  Figure 3.2 shows a band matrix with  2. The multiplication of two band matrices of widths wl  can be implemented with a systolic array of wl [22]. The time required for multiplication is cycle each  A matrix that  processing  and w2  x w2 processing elements (Figure 3.3)  3n +  min(wl, w2) cycles. During each  element multiplies its two input operands,  adds the product to  the previous accumulated value, and passes the result to the next element in the next cycle.  3„  91,  9„  9  9„  II  9„  v  \  0 \  9M  a..  V  9  9u  9„  II  w =  \  \  0  Fig. 3.2:  s v  A Band Matrix of Width w  results  *0"  Fig. 3.3:  Systolic Array for Matrix Multiplication  p + q -1  12  The  advantages of such an arrangement  are  that  the processing  elements  are  identical, very little control is required, the interconnection is regular, communication is local, and the multiplication can be completed in a very short time.  Applying  this  system  to  the  coordinate  transformation  problem,  it  can  be  observed that  so  n  =  4  p  =  4  wl =  elements  w2 = are  (4 +  3 -  1) =  6. Hence a total  of 6 x 6 =  needed. The number of cycles required would be  multiplying two matrices.  Therefore, to obtain the overall transformation  six degree of freedom manipulator, 18 x 5 =  Because of the  3(4) +  simplicity of the  36 processing 6 =  18 for  matrix  for a  90 cycles will be needed.  processing  elements,  there  are  a number of  ways that they can be implemented: 1.  A custom designed chip. Required on chip are: •  a multiplier- accumulator (MAC).  •  a simple control section.  •  a few storage registers.  It is possible  to put all these on a single chip. The advantage of this design  is that a small number of chips is required (if 36 processors can be considered small!)  The disadvantage is  that  the on-chip M A C will  be  slow (more  than  200ns multiply time). 2.  Use  an off-the-shelf  MSI  off-the-shelf  high speed  M A C and design  an  LSI  controller or  use  components to build the controller. This approach requires a  much larger chip count (at least 2 x 36). The high speed multipliers will also dissipate 25-50W.  a  lot  of  power.  Thirty-six high-speed  MAC's  will  dissipate around  13  3.  Use  an off-the-shelf  Candidate  chips  HITACHI  HSP [38].  for  signal  include  processing  the  These  communication if  TMS320  chip with on chip M A C and memory. [21],  the  N E C /uPD7720  [30],  or  the  chips do not have the proper I/O pins required  they are  to be  used  as  the  processors in the  systolic  array. Therefore operation will have to be slowed down to get the operands in and  the results out  In  summary,  a  systolic  array  implementation  would take  chips and a large amount of PC board area. The system only  work  made  use  for  multiplying matrices.  of is  that  Extra processors will  the  direct  The only  transform  also be necessary  property  can  be  a  large  number of  will be very fast, but will  of  the problem that  obtained  can be  by matrix multiplication.  to control information flow to and from the  host, and to look up sines and cosines.  3.1.2 Data Flow One  of  the  relatively  new  computer  architectures  that  facilitate  parallel  soon  all  processing is data flow [35,36]. So it is also worth investigating.  An  instruction in a  data  flow  computer  is  carried  out  as  as  the  operands for the instruction are available. For example, the equation  a can il:  =  (b +  1) *  (b -  c)  be implemented with the following data flow instructions: (+  () 1 i3/l)  •.instruction 1. Read an operand. The '()' represents an operand. Then add the constant 1 to it, and pass the result to operand 1 of instruction 3.  i2:  (-  () () i3/2)  ;read 2 operands, subtract the second operand from the first, and pass the result to operand 2 of instruction 3.  14  i3: (*  ;read 2 operands, multiply them together, and pass the  () () a)  result to the variable a.  the order in which these instructions are arranged (as  opposed  in the stored program is immaterial  to a conventional control-flow program).  The information of the flow of  data and control is embedded in the instructions themselves. an  add  operation  will  be  performed  In il,  it is specified  that  on the first (only) operand and the constant 1  when the operand is available, and after  the operation, the result is to be passed to  the first operand of instruction i3. A 'natural' implementation of this set of instructions is with three processors (Figure 3.4). The supply of b and c and the retrieval of a will their  be  controlled by  data  whenever  some other processors. The three processors shown operate on all  the  data  Although in this simple example  is  available,  independent  of  the  other  processors.  data flow does not increase throughput (the ' * ' block  b  C  •a Fig. 3.4: Simplistic Implementation of a Data Flow Machine  15  is a bottle-neck), flow  computer  processors  in the general  with  be  the  same  synchronous.  case a data flow computer is faster than a control number  In  fact,  the  to  have  a  of  processors.  system  is  It  is  more  not  necessary  efficient  if  that  they  the  are  fully  asynchronous.  It  is  not  complicated  possible  program.  Therefore,  in  separate  an  actual  resources will have, to be shared. An example shown in Figure  3.5.  In this system,  processor data (the  for  flow  each  instruction  machine,  the  have  become ready  a  computing  MIT data flow computer  instructions that  in  [35])  is  are picked  up by the arbitration network, and then sent to the processing section to be processed. The sent  processing to  the  section contains a number of processing control  and  distribution  networks,  instructions. In this way, other instructions will  elements. The results are then  passing  the  data  become activated  to  the  specified  and then picked up  by the arbitration network. This sequence then repeats itself.  For equations vectors;  calculating the direct transform, the matrices can be multiplied out and the for  matrix  equations  n,  o,  a,  and p  obtained.  This  multiplication is only suitable  is  the  best  way  for a systolic array  can then be implemented by writing an appropriate  to  calculate  these  implementation. The  data flow program. The  program is easier to write than a comparable one in. a conventional computer because it is more natural. But the program only has to be written once for each manipulator, so this is not a major issue. The main advantage that a multiple processor machine has over a multiple processor a  data flow  von Neumann machine is that of scheduling. In  data flow machine, scheduling is trivial: the programmer only needs to embed the  proper  execution  sequence of  data flow computer will  the  instructions  in the  instructions  take care of the proper sequencing  themselves,  and  the  itself. In a control flow  machine, the programmer has to make sure that the instructions are sequenced properly by suitable ordering of the instructions.  16  processing elements  control network  e.g. (• 0 1 13/1) (-0  distribution  0 13/2)  arbitration  Instruction cells  network  network  Fig. 3.5: The MIT Data Flow Computer  On difficulties. especially  the other hand, implementing a data flow computer will present many First  of  all, it  is  not  possible  to  integrate  everything  on one chip,  if multiple processors are desired. It is possible to integrate the control,  distribution, and arbitration networks, but because of the bottle-neck nature of these networks, communication will be too heavy for a single chip to handle. Also, the frequent memory accesses in a data flow computer may make it inferior to a control flow computer in this application. In a control flow computer, high speed on-chip registers can reduce the number of memory accesses for data.  17  3.1.3  Control Flow  From  the previous  observations, it seems that the conventional  control flow  architecture is probably still the best way to meet the present requirements. There are many ways to build a high speed control flow machine. A custom architecture can be built using general  bit slice processors, or a large number of suitably  purpose  microprocessors  computation system should  can be used. But since  the size  of the overall  be kept small, the best approach is to design a custom  large scale integration processor. This processor .Processor for .Robotics.  connected advanced  will  be called  APR, for Arithmetic  18  4. IMPLEMENTATION CONSIDERATIONS The APAC,  idea  of  a powerful processor  described in [21]. transistors such  a  using  APAC an  technology  foundries for  a special  purpose  has  been estimated  still  "mid-80's"  not  fabrication effectiveness  die  do  exist  [8].  of a meaningful in a  such a processor  real  for  Such  special  system.  Au m  In  this  is  not new.  than this work, has been  technology. Even through  design  specifications purpose  robotics  to require the integration of over 150K  widely-available  able to produce chips with  50mm2  processor  similar to and more ambitious  unspecified is  arithmetic  rules  though  foundry  it is  now  services.  1985,  However,  and yields in excess of 60%  should  be  sufficient  to  enable  the  chip and to allow an examination of its  section,  various  implementation  constraints  of  are considered, and an architecture that would meet these constraints  with present day IC technology is outlined.  4.1 Constraints of the Fabrication Process  The the  present  25mm2  process that is available time  with a  the  defect  is a 4n m  G T E process density  is  capable  close to 5.4  ISOCMOS of  process from G T E [8]. At  producing  Defects/cm2.  chips  The yield  of  area  around  for such chips is  around 80%, which is very close to that predicted by the widely used SEEDS model [31].  Figure 4.1 shows the appropriate  curve for this model. The curve suggests that  the yield will only decrease to about 20% on increasing the chip size to about 75mm2. However, in • practice, at least as not  followed beyond about  50mm2  regards GTE's and it has  foundry, the curve of Figure 4.1 been  is  found that for chip sizes above  this figure the yield rapidly tends to zero [17].  The  G T E process is not a particularly advanced process. Processes that have a  minimum feature size of 1.5jum are not uncommon these days. Also, quite a few of the recently designed powerful microprocessors, such as RISC II [7,16,24,26,32,33,34] and  19  Fig. 4.1: Yield Characteristics of ISOCMOS  MIPS [9,10,11,12], have fairly large die sizes (RISC II measures 10.3mm x 5.8mm [16], MIPS measures 7.5mm x 8.4mm [9], both use 4ym nMOS technology). The G T E ISOCMOS process is a p-well process, with a single metal layer and two polysilicon layers. The second poly is used for making capacitors. Capacitors are useful  in analog  circuits, but are  seldom  used  in digital  circuits.  Again, such a  technology is not very advanced. Many processes nowadays have two metal layers. With two metal layers, not only can layouts be made a lot more compact, but the circuits  20  can  be made  to run much  sheet resistance  faster.  This  is  orders of magnitude larger  because polysilicon  layers  usually have a  than metal, and the R C delay caused by  long polysilicon lines is usually the cause of timing bottlenecks in a digital integrated system.  The second polysilicon  deposited after  layer is  useless to the present  the field oxide is deposited, transistors  project  Because  it is  cannot be made with this layer.  Moreover, polyl cannot be crossed with poly2, since wherever poly2 runs over polyl, a fairly large capacitor is formed, greatly increasing the R C delay and causing crosstalk.  A chip  of  rough estimate size  estimate  7mm  assumes  x  of the number of  7mm  that  the  with design  this is  transistors  4ym fairly  CMOS  that can be process  irregular,  such  is as  integrated  about a  on a  20,000.  This  microprocessor,  as  opposed to a regular layout such as a memory chip.  4.2  Floating-Point vs Fixed-Point Implementation  Should definite  the  answer  data  be  represented  can be given before  in fixed point or  a  floating  thorough simulation is  point format?  No  done. However, rough  estimations can be made. If all distances are represented with a 16-bit fixed point as shown in Figure 4.2, then numbers from about -16 a resolution of 2 " 1 1  =  to +16  can be represented with  4.8 x 10" *. If metres are taken as the unit of measurement,  then distances of up to 16m can be represented with a precision of about 0.5mm. By shifting  the position of the binary point, larger  coarser  precision, or  propagation  must  smaller  also  be  numbers considered.  with  a  numbers can be finer  Consider  represented  precision. The  for  example  p  effects of  the  with a of  error  Excalibur  manipulator,  Px  Normally  =  the  QEajC,  -  accuracies  d,(S„C,  C 2 3 C 4 S 5 )]  of the sines  +  S,(dj  and cosines  +  d«S,S,)  of the angles  are  limited  by the  21  15  14  11  10  sign I binary point  Fig. 4.2: Number Format for 16-bit Fixed-Point Representation  accuracies of the sensors rather than the data format. Typically the accuracy of a position sensor is around 0.1% to 10%, much poorer than what is offered by the above format Because the effect of the errors introduced by the fixed-point format is to be investigated, it is assumed that the sensors are as accurate as the data format Also,  one  can  undoubtedly  cost  almost a  lot  always  find  more. With  more  accurate  this  assumption,  sensors, a  although  maximum of  they  would  IT can be  represented by the same format shown in Figure 4.2 with a resolution of TT/2048, using only the lower 11 bits to represent the angle. The higher order bits can be used  for  memory  partitioning.  Such  an  arrangement  would  allow  off-the-shelf  sine/cosine generators such as the AM29526/27 and AM29528/29 to be used. If the host uses straight binary numbers to represent the angles, then a conversion is easily performed by multiplying the angles by a suitable conversion factor. This can be done  by APR. When  sines or cosines are multiplied together, the error will only diminish  because sines and cosines of any angle always have a magnitude less than one. Only when terms containing lengths are multiplied together will  the errors be magnified.  22  Notice also that  in this equation (and all the others  Excalibur manipulator), it is never necessary' magnified  to take the difference  of the  of two terms with  errors and then multiply it with a big number (larger than one). Assuming  that the maximum length of each link is 4 meters assumption  for  most  present  day  manipulators),  shown in Figure 4.3. It can be seen the  in the direct transform  x-component  (which is a more than reasonable  the  worst  case  error  that there is an error of only  of the position vector of the end effector.  Error  estimation about  is  1mm in  estimation  is the  same for the other vector components. The actual errors will be bigger because of the errors introduced by the sensors.  Since a 16-bit fixed-point number format seems to suffice, is there any reason why  a floating-point processor should be preferred?  Floating-point amount  P» =  of  representation  hardware  implies  increased  max=4  max=4  4  4  C, [ a . C N  y  - d  8  '  6  requires  more  design  hardware.  cost  and  In  VLSI,  potentially  max=4  2 3  C  9  - C„C S )] 4  >  e=0.00007 >  X  - V  d S 6  V..  /  e=0.0005  y  r  >  e=0.0003  operation  4 3  f  4  y  S5I  w  v  e=0.001  4.3: Error Estimation of Number Format  /  e=0.0003 y  e=0.000B  e=0.0004  the  max=4  + S, ( d +  S  y  e=0.0001  Fig.  slower  4 (S  increasing  '  23  speed, in addition to increased chip size. Floating-point arithmetic also takes a longer computation time than fixed-point arithmetic. But this difference in speed is diminished with  the  use  computing  of pipelined systems. Also,  direct  transforms,  such  as  in possible  robot  applications  dynamics  and  of APR other than  control, computer  graphics,  and digital filtering, floating-point arithmetic is often necessary. Floating-point arithmetic can  also  increase  the  accuracy  of  direct  transform  computations  due  to reduction in  error accumulation. Therefore, floating-point arithmetic is highly desirable.  Since  the  hardware  for  fixed-point  arithmetic  hardware for floating-point arithmetic, a processor  is  basically  designed  a  subset  of  the  for floating-point arithmetic  can easily be extended to handle fixed-point arithmetic also.  4.3 On-Chip Memory vs Off-Chip Memory  The program changes chip  for  as  the  a  for computing the transformation  same manipulator. Therefore,  R O M . R A M is  bootstrapped  by  writing to  not the  desirable  it is  because  R A M before  any  equations possible then  is static, i.e., it never  to put  the  chip  computation  can  the program on will start.  have  to  be  The lower  packing density of R A M also places a severe limitation on program size.  Since the instructions are likely to be very wide (32 on  chip  on-chip  32  can be  saved.  Instruction  R O M . But instruction fetch  processor. made  pins  Because of the special  cycles  purpose  for any particular applicatioa  With  fetch are  nature  bits),  with  cycles can also be  not the  potential  the program  faster  bottlenecks  with an of  the  of the chip, not too many will  the program stored  in an off-chip  be  PROM,  program development is a lot easier. Larger programs are also more feasible this way. Therefore, the decision was made that the program would not be integrated on-chip.  As  for data memory, they must be implemented in R A M . With the available  ISOCMOS process, it is not possible  to have a large on-chip R A M . A 4 Kbit R A M  24  requires  23 mm2 with this process.  Therefore, it is not feasible  to put all the data  memory on-chip. The best that can be done is to have a relatively large register file of about or  32 units. Most general purpose processors  less.  RAM  4.4  A large  register  file can reduce  have  16 general purpose registers  the number of memory accesses.  Off-chip  must be used for the main data memory.  Multiplication  In  computing  the  direct  transform,  many  multiplications  are  required.  Multiplications can be performed in computers by either shift and add, or by the use of  a parallel  multiplier. Performing multiplication by shift  and add  requires  minimal  hardware, but is very slow. Multiplying two 16-bit numbers together by shift and add requires about 6.4MS with a 4M m C M O S process. However, with suitable pipelining and parallel  processing, high  multiplers,  on  Off-the-shelf 60ns  (e.g.  the  other  16-bit  Weitek  throughput can be attained in some applications [3]. Parallel hand,  integer  are  a  lot  faster  multipliers today  W T L 2010).  But parallel  than  have  their  multiply  sequential times  multipliers require  a  of lot  counterparts.  around of  silicon real  estate. A 16- bit parallel multiplier will require about 3mm x 3mm in a Au m process.  In  pipelining multiplier  the  present  slow  serial  on-chip.  project,  it  multipliers.  is It  The best alternative  not is  possible  also  not  then is  to  effect  possible  to use  to  high  50 or  CMOS  throughput  integrate  an off-the-shelf  a  by  parallel  high  speed  are  16-bit  integer multiplier.  4.5  32 Bit vs 24 Bit Representation  Since  the  multipliers, mantissa have a full  largest  high  speed  integer  multipliers  available  multiplication is fixed to 16 bits. However, it is still possible  32-bit (23 bit mantissa) IEEE floating-point standard  32-bit registers.  today  This,  [4]  to  adder and full  for example, is the approach taken by the TMS320  [21]. For  25  the present application, a 16-bit mantissa appears sufficient  Besides, there are so many  multiplications in this application (about twice the number of additions) that the extra accuracy  offered  by  the  32-bit  accuracy.  Using a sign-magnitude  adder  will  probably  representation  representing  1-bit  accuracy  communication  the  sign,  thus  can  only  be  with  external  giving a  preserved  when the  much  better  (the IEEE  overall  standard),  with 16 bits on-chip, with another  effectively  memory or  offer  for the mantissa  the magnitude of the mantissa can be represented bit  not  25-bit  operands  host,  it is  representation. are  kept  better  to  on  This  extra  chip.  For  keep the number  format to 24-bits, which is more convenient because it is a multiple of 8.  4.6 Reduced Instruction Set Reduced instruction set computers have become very popular recently, the most notable  designs  being  simpler  control  section,  reduced  design time, less chances of design errors, and faster  instructions  are  chosen  arithmetic processor, are necessary  the  RISCs  which,  in  properly.  and . MIPSs.  A  turn,  reduction in  Since  means  APR  is  reduced  designed  instruction chip  set  area  implies  requirement  execution speed  to  be  a  a  special  if the purpose  its instruction set should therefore support only the functions that  for this purpose, namely coordinate transformation in robotics.  4.7 Outline of Possible Architecture In fixed-point high  summary,  then,  APR  should  handle  32^bit  floating-point  and  16-bit  arithmetic. A parallel multiplier is needed, which will be an off-the-shelf  speed  integer multiplier. The program memory is to be  stored  in an off-chip  P R O M . Except for a large high speed register file on chip, data memory is stored in an  off-chip  R A M . The  control  section  must  be  made  implementing only instructions that are needed most often.  as  simple  as  possible,  26 5. THE APR ARCHITECTURE A APR  block diagram of the overall architecture of APR  is shown in Figure 5.1.  can handle both 24-bit floating-point numbers and 16-bit fixed-point numbers.  Figure 5.1 shows APR in its floating-point mode. This high speed floating-point unit design is a conventional design similar to that described in [1]. Both the adder section and  the multiplier section are divided into a mantissa half and an exponent half. A  25-bit by 31 word 3-port register file is shared by the two sections. The internal representation of the mantissa in APR is always 17 bit sign-magnitude. The exponent is represented  by 8 bits with  a  +127  bias. This format is close  to the IEEE  floating-point standard and will thus allow host processors that use this IEEE standard to send data to APR directly, albeit with a loss of 8 bits of accuracy because of the shorter mantissa used in APR.  5.1 The Register File The  register file has two read buses and one write bus. During every cycle,  the register file operates twice. Firsdy it supplies the two operands to the multiplier, while at the same time storing back the current result from the multiplier section. Secondly, it does the same for the adder section. Because N-channel pass transistors are used for the read and write enable switches in the register cells, the read buses are  precharged before every run. The write bus does not need to be precharged  because it is sourced by full CMOS circuits. More will be said about the register file in section 7.  21  Barrel Shifter  Leading Zero Checker  Multiplier (off-chip)  Barrel  Register File  Shifter  Leading Zero  Adder  Checker  Barrel Shifter  for Mantissa  W  control signals  Control  Data  7i  I/O  CCfl  Data/Address IR  3  8 e.  Instruction  3  °?  2 a  Adder  Register File for Exponent  Subtractor  Adder  PC  Program Address  28  5.2  The Adder Section  In floating-point mode, the adder section is a 5 stage pipeline, similar to that described stage  for  APAC  and, in the  passed  to  number  the  is  next  next  Operands  stage,  stage  shifted  mantissas are  [20].  down  then added  in the  are  fetched  from  the  exponents  are  the  exponent  half,  appropriate  or subtracted  register  compared. and  the  to  align  amount  in the  the  next  The  updated.  appropriate If  the  or  the  limit  flags  has  alignment  and  underflows  are  are  set,  the  number  respectively.  16- bit barrel shifter,  The a  a normalization 16- bit barrel shifter.  stored  will  adder  18-bit  adder  implemented sign-magnitude  binary  smaller  point  The  exponent goes  is normalized and the  in  fast  18-bit  this  stage,  and  the  set  to  the maximum  adder,  be  to  handle  with  an  handled the  in the  17-bit  the  a  consists  leading  adder. The use  of  a  zero checker,  numbers.  Also,  adder.  magnitude  of  a  of  sign-magnitude  must never overflow (overflow  normalization stage),  end-around-carry  representation,  section  The exponent half consists of an exponent  and an exponent adjustment  must  be  mantissa  format, together with the requirement that the mantissa underflow  the  is  with the same sign, or zero, depending on whether overflow  occurred  comparison subtractor,  and  of  stage, while the  checked  exponent  In the final stage, the result is stored back in the register file.  representable  underflow  fraction and  flags set  hard  magnitude  Overflows  in the first  larger  mantissa  through a one stage delay. In the fourth stage, the mantissa exponent  file  the  This  forces  the  mantissa is  adder  necessary  number must  use  always  of an  must  be  because  in  be  positive.  Therefore, when the difference between two mantissas is negative, the magnitude of the mantissa  must be kept positive with the sign changed to indicate a negative number.  This is best done with an end-around-carry adder using l's  complement to represent  negative  adder  numbers  [1].  However, this  during each clock cycle, so a "par-add"  scheme  developed  in  effect  requires  the  to  fast carry look-ahead scheme must be by  Brent  and  Kung  [2]  and  recently  operate  twice  employed. The implemented in  29 nMOS [37]  was chosen for this task.  Both  barrel  alignment shifter  shifters  is a  use  the  design  down shifter,  described  in RISC  II  [34].  while the normalization shifter  The fraction  is an up shifter.  The leading zero checker detects the number of leading zeros in the adder result and controls the normalization shifter so that the mantissa floating-point which  mode.  adjusts  the  It  also  feeds  exponent  of  the  the  shift  result  always comes out normalized in  amount  to  the  accordingly. In  exponent  order  to  adjust  adder,  synchronize the  mantissa and exponent halves, one extra delay stage has to be added in between the exponent comparison subtracter are  regenerative  an  8-bit  adder,  and the exponent adjustment  Manchester carry adders whereas  the  exponent  [14]. update  The exponent adder  needs  adder. Both these adders comparison subtracter  is  to  to  be  9  bits  wide  distinguish between overflow and underflow. When either overflow or underflow occurs, the result is hard limited to the largest magnitude  representable  or zero, respectively.  Because it is not known which input to the exponent comparison adder will  be the  larger, the difference may come out to be either positive or negative. This is handled by a  PLA that maps  both a positive  and a  negative  difference  to a positive  shift  amount which is then used to control the first barrel shifter.  5.3  The Multiplier Section  The multiplier section is a 4-stage pipeline. In the first stage, the operands are fetched and, in the next stage, the mantissas are multiplied by the parallel multiplier. The  exponents just go  through a  delay  in this stage. In the third stage, the result  mantissa is normalized by the barrel shifter. The exponent section in this stage is split into two sub-stages. In the first sub-stage, the two exponents are added together. In the second  sub-stage, the  from the result The effect  exponent bias is  restored  by subtracting  one  bias amount  of shifting the mantissa is taken care of by either feeding  a 1 or 0 to the carry in of the exponent adder. For the same reason presented in  30  the adder section, this exponent update adder in the multiplier section must be 9 bits wide. The correct result is then stored back in the register in the fouth stage.  In floating-point multiplications the resulting mantissa  never needs to be shifted  more than one position, so there is no need for a barrel shifter and a 16-bit leading zero checker. These  two blocks are needed, however, in fixed-point  mode, when the  numbers are not normalized.  5.4  The Clock Phases  In phases  is  order  to  required.  synchronize the The  required  above  clock  operations,  phases  are  a  clock  shown  in  00-n  0,_n  0.103  0.1— Fig. 5.2: Clock Phases Required in APR  with  more  Figure  5.2,  than two and  are  31  generated as described in section  7.5. The operation of the multiplier section in  floating-point mode is as follows: during 0 , , the register file is precharged. In 0 , 3  operands to the multiplier are read  from  the register file. The tri-state pads  connecting APR to the external multiplier are enabled as output pads during all of  <t> , so that the operands are driven out through the pads with all the duration of 2  0  2  available. During the next 0 , , the operands are fed to the inputs of the  multiplier. The multiplier operates in the ensuing 0 , and the product is sent back to 2  APR  in <f>,. The product is normalized in 0 , and then stored back in the register 2  file in 0 , after a delay in 0 , . 3  In the adder section, the operands are read from the register file in 0 „ . The fractions are then aligned during the next 0 , after a simple register transfer in the 2  0,  between. Then the aligned  fractions are added in the next 0 . 2  The sum is  normalized in another 0 , and the normalized sum finally stored back in the register 2  file in the next 0 . u  It should  be noted that APR  basically operates with a two phase clock  consisting of 0 , and 0 . The phases 0 , 0 « , and 0 2  3  O  are needed so that the  register file can be operated twice every machine cycle. 5.5 Fixed-Point Mode Figure 5.3 shows APR in fixed-point mode. In this mode, the two barrel shifters in the adder section  are bypassed, so that the adder section  becomes a  3-stage pipeline. The multiplier section stays as a 4-stage pipeline, but the shift amount is now a constant controlled by the control unit. The location of the binary point is controlled by the program, and the barrel shifter in the multiplier section is used to realign the binary point after a multiplication. The exponent half of the adder section is completely shut off in fixed point mode, as is the register file and the  2? Lo  5 B* rt  t  i a  Data/Address  33  exponent half in the multiplier section. However, the adder in the multiplier section is used  to check  for  overflows.  This is  achieved  by subtracting  the number of  zeros in the result from the multiplier from a preset value and flagging  leading  any negative  differences.  5.6  Input/Output and Control  For indirect  communication with external  read  and  write  with 24-bit  memory,  words.  APR can perform  In order  to make  both  direct  and  APR compatible with  8-bit wide system data buses, APR can also do direct read and write with byte wide words. No indirect read/write instructions are available for byte wide data.  Table look-up for sines and cosines, as transforms,  and  for  required in the computation of direct  D M A to the system memory is provided by appropriate  mapping.  The condition code  flags are  set  register  has  9  by the data path in the usual  flags.  The overflow,  zero,  and  memory negative  way. Of the six remaining flags, 3 are  input flags which are set by external circuits, and 3 are output flags that can be by  the APR control unit These  flags can be used  for interrupt requests, bus  set  grant,  and other control signals.  The program counter is 16 bits wide, and is not connected to the data path. Only absolute direct jumps are allowed. Conditional jumps are possible, flag  as  the jump  condition. In order  to  allow  simple  subroutine  using the zero  calls,  a  two level  stack is also included in the PC section.  Because of the simple instructions required, the control section comprises only a simple PLA. Figure 5.4 shows the instruction fetch and execution pipeline. Every cycle a new instruction is fetched  from the program memory and decoded in the following  cycle. Execution of the instruction is then started in the following cycle, although the instruction may take more than one cycle to finish. As in RISC and MIPS, and for  34  fVograa Address Calculation Execution  decode  and Instruction Fetch  Fig.  5.4:  Instruction Pipeline  reasons clearly described in [ 9 ] , the instruction fetch  and execution pipelines  are not  interlocked. In other words, when a conditional branch is encountered, the prefetched instructions However,  are  always  executed,  regardless  of  the  unlike the RISC and MIPS design, there  outcome  of  the  condition  is no internal forwarding  test  [7] in  the data path. This makes the design of the control section quite simple, provided that the speed penalty due to the extra of  the instructions.  delays  can be reduced by suitable  rearrangement  35  6. PROGRAMMING AND SCHEDULING The Because  efficiency  much  architecture,  of  its  of  the  APR depends processing  throughput  can  entirely  speed  be  of  APR:  execution  the  adder  pipeline,  pipeline. The  the  block  APR  widely different  pipelines full at most times and programs in  on  the  efficiency  comes  from  for  its  programs  the  program.  multi-pipelined  that  can  keep  the  that cannot There are three major pipelines  multiplier pipeline,  diagrams  of  of  and  these pipelines  the  instruction  have  been  fetch  shown  and  in the  previous section.  An both  operation  operands  are  can in  be  the  started register  in the file.  adder  or  APR does  multiplier pipeline  not  make  any  only when  effort  to check  whether the operands are valid or not This task is left to the programmer. Therefore, as  an  example,  if  the  following two instructions  are  to  be  executed  with  APR in  floating-point mode: il:  C =  ' i2: D  =  A +  B  C  2  +  then instruction i2 can be started only after at least five cycles from the time i l was started. But other instructions that do not require ' C can be started immediately after i l . . Therefore, the instructions must be arranged in such a way that the pipelines are filled handle.  most  of  The  the  only  time.  The  instruction  complication  arises  fetch  when  and  there  execution is  a  pipeline  conditional  is  easier  branch  in  to the  program. APR does not provide any facility to flush the instruction pipeline when a branch  occurs.  instructions  that  This are  is  what  prefetched  is  called  non-interlocked  into APR will  always  be  pipelines executed,  cause any  errors  a conditional branch that are prefetched  when executed.  It may be possible  to rearrange  MIPS.  The  regardless of the  outcome of any conditional branches. The programmer must therefore the two instructions after  in  make sure that  into APR will not the instructions  so  36 that  these two instructions  will  actually  do  some  useful  work. Otherwise, N O P (no  operation) instructions must be used to fill up this gap.  As pointed out in [9],  this  requirement is often more of a blessing than a curse. Not only is the design of the processor  easier  with  which are wasted  non-interlocking pipelines,  but  the  two  prefetched  instructions  in a pipeline that can be flushed can now be made to do useful  work.  It  is  rather  difficult to write an  computing resources, adder  efficient  program  to  schedule  a number of  even though the present case has only two resources, namely the  section and the multiplier section. The subject of scheduling multiple  resources  has been studied extensively in the area of micro-programming. In fact, the programs for APR are somewhere between assembly an aid to the programmer, the assembler parallel  operation  relationships.' The used  in  the  multiplication  of  the  adder  branch and  assembler. operations  It as  program is then generated include the effects effects  language programs  is written to take into account the  and the multiplier sections, bound algorithm  requires  the  a  dependency  data  and micro-programs. As  with  list  programmer graph  and the various pipeline  scheduling  to  possible  arrange  [19].  The  heuristic the close  [19]  addition to  is and  optimal  automatically in a very short time. It is rather difficult to  of a Finite number of registers in a scheduling algorithm, so the  of the limited number of registers available on-chip is largely ignored in the  present assembler. The scheduling algorithm is described in more detail in Appendix D.  37  7.  DESIGN O F C E L L S  It  is  well  computation [22]. is rather  known  that  in  VLSI,  communication  is  far  more  expensive  than  However, the notion that communication must be modular and local  futuristic.  Classical  von Neumann architecture  does not lend itself  easily  to  systems requiring little or no global communication. Also, the complexity and density of most  VLSI  chips  today  systems can be integrated all, processor  file,  barrel  reached  the  point  where  multiple  self-contained  on one chip. Therefore, it is conceivable that most,  in VLSI,  example, the adder shifters,  widths in their arrangement  not  if not  designs today utilize global signals that run across the span of the chip.  Nevertheless, Take, for  have  even  local  mantissa  communication must  section of APR (see  leading zero checker, and adder  most  compact  form. If  the  layout  are is  designed  Figure 7.1).  all of different  designed  of the block diagram, a lot of area is wasted  B.S.  be  (see  The  carefully. register  heights  and  following exactly  the  Figure 7.2). Although  B.S.  Fig. 7.1: Communication in Adder Mantissa Section  LZC  38  Reg. File  B.S.  J  -*  1  Adder  Sol  n 11 u  — ± B.S. -=*  Fig. 7.2: Inefficient Routing of Adder Mantissa Section  Figure 7.2 is only symbolic, it is obvious from it that a lot of silicon real estate is used up by the wires. A better design is to stretch all the ~ building blocks to the largest height so that they can be butted together snugly. Although this implies less compact layouts for the smaller blocks, the savings in routing area will usually lead to a much more compact overall design. Also, since the adder design adopted in APR is much more compart when the sum is generated on the same side as the inputs, the overall design will be more compact if the adder is placed at the end of the data path. The final design is shown in Figure 7.3. Notice that there are no 'bare' busses in this arrangement Such considerations were taken into account in the design of the basic cells. The design of the more important cells is described in this section. The following set of rules of thumb is used as a guideline in the design of the layouts: 1.  Extensive layout compaction is applied only to cells that have a high repetition factor, such as the register cells and the rest of the data path. For cells that are not repeated a lot, logic gates from a standard cell library are used. This  39  i> }>  « « c Reg.  File  B.S.  B.S.  LZC  *  » » »  «  Fig.  reduces  Adder  7.3: Final Floor Plan of Adder Mantissa Section  design  time  while  compromising little  in  terms  of  overall layout  in  order  to  reduce  compactness. 2.  Full  CMOS  is  used  as  much  as  possible  power  consumption, except in places where the use of pseudo-nMOS can result in significant reduction in real estate requirement. 3.  P-channel  transistor  are  made  2.5  times  the  size  of  their corresponding  N-channel transistors, in order to have symmetrical rise and fall times. 4.  Minimum  size  transistors  (with p-channel transistors  2.5  times  the size of  N-channel transistors) are used everywhere except where high current drive is required. This not only  keeps the area  required to a minimum, but also  reduces the loading on the drivers that drive the transistors, which in turn reduces the overall delay. 5.  Metal  wires are  used  for signal communication as  polysilicon used only for short distances or crossunders.  much as  possible, with  40  6.  Power and ground wires are the power lines, the polysilicon.  rare  cases  routed in metal. When signal  lines must  cross under the metal must  power lines in diffusion  layer inside a P-well is used as crossunder for the ground wire. N +  diffusion  are preferable  where power  lines cross  N+  layers  In  signal  always  cross ground, an  to polysilicon because N +  diffusion  has  a smaller sheet  resistance than polysilicon, but has a larger capacitance. Therefore N +  diffusion  is better than polysilicon for conducting ground lines but worse for connecting signal lines, since the signal level in the ground lines never has to change, so the increased capacitance has no serious effects. Manhattan geometry  7.  is  used  except  where significant  savings  realized with the use of 45 degree geometry. This makes design  to a process supporting only  Manhattan geometry  in area  can be  a conversion of the easier, although there  are currently no plans to do that The  designs  were  done  on a  Metheus X750 VLSI  simulated extensively wherever possible  work station.  The designs were  using the simulation tools available. These tools  include: 1.  SPICE:  a  circuit  level  simulator.  For  accurate  analog  simulations  of  small  circuits consisting of up to 20 transistors. 2.  Hg. a switch level simulator. For digital simulations, using a switch model for MOS  transistors.  Circuits containing a few hundred transistors  can be simulated  conveniently with this simulator. 3.  HILO:  a  gate level simulator. For digital simulations of large  systems. With  functional simulations [13],  complex digital  theoretically whole chips and systems  can be simulated. But because it cannot be used with the circuit extractor, it was only used sparingly. The  layouts have been thoroughly checked for  design  rule violations using the D R C  (Design Rule Checker) on the Metheus VLSI workstation.  41  7.1  Multi-port Register Design In  almost  any  needed. A register  digital  processor  design,  a  number of  is basically a R A M word. The difference  usually called a register  mulu'-port registers  are  (if any) between what is  file and what is called a R A M is that a register  more than one port, with a small number of registers in a register  usually has  file, and the file  is usually placed in the data path itself. This is necessary because by design, registers are  used  would  very  allow  often,  the  typically  processor  to  once  every  machine  run faster.  It  is  cycle.  A  not practical  multi-port to  have  register a  very  file large  register file, because larger files require longer access time [24]. Historically, there is a big  difference between registers and RAMs in that registers are integrated in the same  chip  as  VLSI,  the processor, on-chip  RAMs  whereas are  RAMs  becoming  are  in separate packages. With  more  and  more  common,  and  the advent of the  difference  between registers and RAMs is fading. In fact, in RISC II [34], the terms are used interchangeably.  The register can  be  register  cell, with two read busses and one write bus. During every cycle, two words read  from  operation proceeds and  cell design used in APR is shown in Figure 7.4. This is a 3-port  the  register  file  and  one  word can  as follows. For every register  be  written to  cell, when write  it  A write  is high, M5 . is on  MO is off. The write bus is then connected to the input of the inverter formed  by  M l and M3. The signal on the write bus is thus stored in the gate capacitance  of  this inverter. When the write  MO  control signal returns to low, M5 is turned off and  is turned on. The write bus is then disconnected from the register  cell. Since MO  is on, the output of the inverter formed by M2 and M4 is connected to the input of the inverter formed by M l and M3, forming the familiar cross-coupled cell, cycle.  and the  information stored  can be  retained  indefinitely. This  static memory  completes  a write  42  MriteBus  Fig. 7.4: A 3-Port Register Cell  For  a read cycle, either ReadBusO or ReadBusl can be used. When ReadO  goes high, M6 is turned on. This connects ReadBusO to the output of the inverter formed by M2 and M4, and the logic level of ReadBusO will follow the output logic level of this inverter. The same description applies for ReadBusl. With  two read  busses and one write bus, two words can be read from the register file and one word written to it every cycle. This is very desirable because both the adder and the multiplier requires two operands and produce one result  43  In  order  to save  area,  pass  transistors  must  be  used  instead  of transmission  gates for the read and write enable switches (M5, M6 and M7). In CMOS, either N or  P-channel  transistors  can  be  used  foT pass  transistors.  P-channel transistors  are  typically 2 to 3 times slower than N-channel transistors, because of the lower mobility of  holes compared with  that  of  electrons.  However, in a  P-well  process,  N-channel  transistors reside in a P-type tub of slightly higher doping density than the substrate. This  worsens  the body effect,  and N-channel  pass transistors  lose  more than a gate  threshold in passing a high signal. If one end of an N-type pass transistor is held at 5V, the other end only reaches about 3V. On the other hand, a P-channel transistor can pass a low signal (OV) to around 1.1V.  N-channel because of  transistors  their superior  have  speed.  been In  order  chosen to  for  combat  implementing the  the  register  cell  threshold losses, the  read  busses are precharged. When the register file is not used, the read and write control lines are driven to zero volts, and the read busses are precharged to Vdd. When the register file becomes active in <p . 2  the capacitance  of the read busses will keep the  voltage of the busses close to Vdd. Therefore if the register cell contains a I, it will not have to pull  the bus high, since the bus is already high. On the other hand, if  the cell contains a 0, the bus will have to be pulled low. This presents no problems since N-channel transistors can pass a low signal without any loss.  The SPICE simulations of the register cell are shown in sections B.1 to B.6 of appendix B. In B.1, a 0 is written to the register. In B.2, a 7 is written. In B.5, it is sourcing a 0. In B.6, the register is sourcing a 1 to a read bus. The case that presents the most concern is B.2, since the signal after the write enable  44  pass transistor M5 for  a  standard  N-channel  pull  barely reaches 3V, just slightly above the threshold voltage of 2.5V inverter. The down  transistor  noise Ml  margin bigger  has than  been the  increased P-channel  by  making  pull-up  the  M3.  In  appendix B.3, the write signal is reduced to 4V, and it is shown that the register cell can still record the high  signal properly. This marginal  state exists only when the  register is being written to. As soon as writing is complete, M5 and MO  will be turned off  will be turn on, restoring the logic level to much safer values. The  register file is divided into two sections, consisting of 31 x 16 and 31  x  9 units of this register cell. The bigger file is for the mantissa, while the smaller one is for the sign and the exponent Driving the register file are three 5-to-32 decoders, one for each bus. The decoders are basically structured like a PLA,  but with no  OR  planes, since they are not needed. From simulations, the time required to disable the read/write switches and precharge the busses is found to be about 20ns. The decode an address is about 10ns. And  time to  the time for a register cell to be enabled  and  sink a 1 from the write bus is about 30ns (there are other possible activities, such as sourcing a 7, but they are not worst cases). Therefore, the register file can be run at a minimum cycle time of 60ns. See Appendix A.1, 2, and B.l-7. 7.2  Barrel Shifter  The  design  of a  discussed in [34]. The  versatile and  compact barrel shifter has  been  thoroughly  circuit diagram- of a 4-out-of-7 barrel shifter is shown in  Figure 7.5. This barrel shifter design can serve as either an up  shifter or a down  shifter (it can also rotate a word, but this capability is not utilized in APR). Figure 7.6a shows the barrel shifter connected as a down shifter, and Figure 7.6b  shows it  working as a up shifter. In Figure 7.6a, the input 4-bit word is fed to ports InO to In3, where the MSB  is connected to In3. If sh3  is high and shO to sh2 are all low,  then Out3 is connected to In6, and OutO is connected to In3, etc. Therefore, the  MSB  45  in6  i  H  r  J  in5  J  Sit2  J  ^  i  —t)  in4  {  •H: in3  1 t  a sh2  sh3  Fig.  a s in? shl  a jnl  a shO  u  OutO  a inO  7.5: 4-out-of-7 Barrel Shifter  of the input word becomes the LSB of the output, and OutJ to Out3 are filled with ffs. This has the effect of shifting the input word down by three positions. To shift the input word up by two positions, sh2 is set to a 7 state. Then Oui3 is connected to In5, and OutO is connected to In2, etc. When shO is high, the input word is passed directly to the output  46 9  8h3  r  9  9  •ti2  wM  9  shO  9  9  9  9  sh3  8h2  shl  shO  in6  Out 3  in6  0nt3  in5  0ut2  in5  Out2  in4  Outl  in4  Outl  in3  OutO  ina  Outo  in2  l n l inO  a  a  a-  in2  inl  inO  a  (a)  (b)  Fig. 7.6: Barrel Shifter Conneaed as Up/Down Shifter  In Figure 7.6b, the input word is connected to Jn3 through In6. The operation is very similar to that of the down shifter. Except that now enabling shO will shift the input word up by 3 positions and enabling sh3 will pass the input word to the output directly. Since the barrel shifter is an nMOS pass transistor design, the threshold drop problem is also present This implies mat the output lines of the barrel shifter must also be precharged.  47  7.3 Leading Zero Checker  A basically  4-bit leading a  zero checker  PLA with no OR plane.  design  is shown in Figure  7.7. This circuit is  The position of the first non-zero  bit can be  obtained both along the direction of the data path or at right angles to it, making it  ]r— 3r-  JV  3h~jlr'  5  5  ^  J—<1-L.  3 2  ^3  dataO  datal  data2  data3  Olz  1 Iz  Fig.  21z 3 IZ 41Z  7.7: 4-bit Leading Zero Checker  48  particularly  suitable  for  part  of  a  data  path.  Simulations  show  that  the  worst case  delay for a 16-bit leading zero checker is about 6ns. The number of leading zeros in binary code can also be obtained with the addition of an OR plane.  7.4  Adders  There  are  two  adder  program counter adder, mantissa  designs  regenerative  used  in  APR. For  the  small  Manchester carry adders were used.  adder, a carry look-ahead adder was used. The regenerative  scheme [14] generated  adders  and  the  For the fast  Manchester carry  is basically a Manchester carry scheme [22]. The P ,G and K signals are  as in normal Manchester carry adders and a pass transistor  is also used for  the carry propagation chain. The difference is that the carry signals are regenerated every  stage.  In  a  dynamic regenerative  adder  (see  Figure  7.8),  each  section  at  of the  carry chain is pulled down to G N D in the precharge phase. This is equivalent to a carry kill at every bit. No kill (AO signal needs to be generated.  In the next (active)  phase, if propagate (P) is high, which implies generate (G) is low, the carry-in signal is passed to carry- out. In a normal Manchester carry adder, the level of the signal at carry-out  will  potential  drop  carry-in  not  be  across  signal is  the  same  as  MO. However,  high, M l will pull  that in  at the  carry-in. This regenerative  is  because  Manchester  there  carry,  is if  a the  carry-out all the way to Vdd. If the carry-in  signal is low, M l will be off, and the section of the carry chain will stay at G N D , since it is initially set  to G N D . The logic levels can be restored  at every stage in  the normal Manchester carry scheme too, but the two extra gate delays introduced by the doubly inverting buffers  will  more than offset  the speed improvement gained by  the restored logic levels.  When P is low, the adder behaves as follows. If G is high, a carry of / is generated  by M l . If G is low, MO and M l are both off, and the carry-out stays at  G N D . This is the same as killing the carry.  49  A static regenerative adder scheme is also possible (Figure 7.9). This is done by  adding some circuitry to the dynamic design to implement a carry kill.  Static  designs are more versatile, since no clock is required. Therefore they can be operated asynchronously. Although the static design has a longer cycle time than the dynamic design, when the adder is used as part of a larger design, the overall speed of the design may be faster with a static adder. For example, in one machine cycle in APR, the^ number of leading zeros in a number must be determined and then added to the original exponent in the multiplier of APR. With a dynamic adder, the machine cycle must be split into two sections. The first section must be long enough for the leading zero checker to reach a stable (correct) result And the second section must be long enough for the adder to complete the addition. If a static adder is used, then it is  50  only necessary that the total time required for the leading zero checker and the adder is  less  than  drawbacks, when  G  the  period of one  machine cycle.  though. The transistor or  K  is  low, they  M 7 presents  may  have  to  The static  scheme  extra  load on the  work  against  the  does  have  some  carry chain.  Also,  carry-in  signal,  so  transistor sizing is very critical in a static design. The static regenerative adder design was chosen for all the adders on APR except the high-speed 16-bit mantissa adder. Simulations  show  that  regenerative  adder  is  the  about  carry 7ns.  propagation See  Appendix  time B,  for section  one B.9.  stage  of  the  static  Therefore, an  8-bit  regenerative adder will require about 60ns to perform an addition.  P (k + 1)  Fig. 7.9: Static Regenerative Adder  51  A carry look-ahead scheme that is particularly suitable to VLSI has been described in [2]. The "Par-Add" scheme, as it is called in [37], is a very regular and modular design, making it very suitable for an algorithmic design, Le., by specifying only the size (number of bits) of an adder, its layout can be generated  automatically  by a  program. Par-Add is essentially a partial carry look-ahead tree across groups of two bits. With TTL components, carry look-ahead is usually across groups of four bits. But MOS  transistors have less current drive than TTL circuits. So even carry look-ahead  across groups of four bits can significantly increase delays due to large capacitive loads on the outputs. A block diagram of a 4-bit par-add adder is shown in Figure 7.10, where A  unused  unused  Gout  Carry-In  Pout  Cin  PA  61  PI  Gout  Pout  Cl  60  Cin  Gout  PA  61  PI  Cl  Gout  Pout Cin  PO  CO  Pout  Cin  PA  GO  PO CO  61  PI  Gout  Pout Cin  Gout  Pout Cin  FA  FA  Cl  60  PO CO  Gout  Pout Cin  FA  FA  rTTTTT TTTTT1  A3  B3 S3  Fig.  A2  B2  Al  Bl  Sl  AO  BO  7.10: 4-Bit Adder with Partial Carry Look-Ahead  SO  52 and B are the numbers to be added, and S is the sum. For the blocks labeled F A (for Full Adder), Pout =  A-B  Gout =  A-B  S  For  =  A +  B +  C  the PA (for Par-Add) blocks,  POUt =  ?  Gout =  G, +  0  ' ? l  C,  =  Go +  C„  =  Cin  (TV G „ ) (Po-Cin)  Notice that all the PA blocks implement the same logic. This is why the circuit is modular and regular. speed of the adder  If  N  is the width of the adder  decreases as  log 2 N as  opposed  in number of bits, then the  to N in an adder  without carry  look-ahead, such as a Manchester carry adder. The original design  in [37]  nMOS  design  process.  Since  modified. With MOS. Full  APR is  CMOS,  to  be  implemented in CMOS,  the PA blocks  cannot  be  the  implemented  in full  is for an has  to  be  complementary  CMOS  requires about four times the amount of fan-out of pseudo-nMOS  or domino C M O S  [18]. Domino C M O S is particularly suitable for this application since  it  works  gates,  with non-inverting logic  rather  than  NAND  and  require. Therefore, domino CMOS must  be  much faster  end-around-carry  (Le.  The  basic building blocks  N O R gates),  which  is  exactly  are  what  A N D and OR the  PA blocks  was chosen to implement the mantissa adder, which  than the other  adder.  the  circuit  adders because it is diagram  of  a  18  PA block  bits long, and is in  domino CMOS  an is  shown in Figure 7.11. Because domino CMOS is dynamic, making an end-around-carry adder  using  domino  CMOS  presents  extra  problems.  The  carry-in  of  end-around-carry adder does not become stable until about half way through the  an  Fig. 7.11:  Circuit Diagram of a Par-Add Block  54  overall add lime of the adder. before  they start operating,  handled in the mantissa adder during tf>«,  <j>3  and  But dynamic circuits require their inputs to be stable  otherwise  incorrect results will  occur. This  adder by precharging the adder during  <pu.  During  <p3,  <f>0,  the carry-out is established  the carry-in is stable and correct, and the sum generated  and latched. During is valid. Figure  Carry-Out fro» MSB A » IB  Par-Add  IB  Fig.  7.12:  Adder  Carry-In to LSB  End-Around-Carry Adder with Domino  is  and operating the  shows the schematic of this arrangement  IB  requirement  CMOS  7.12  55  7.5  The Clock  The  clock  phases required  by  APR are  shown  in  Figure  5.2,  on page 30.  Basically, each machine cycle is split into four phases, two relatively short phases (the d>0's), and two relatively long phases ( 0 3 required  has  been described  in section  and 0 „ ) . The reason that these phases are  5.4. This section is concerned with how such  phases can be generated in VLSI.  In most processors, multi-phase a  single  on-chip.  phase  clock. The  single  (typically two-phase) clocks are generated  phase  Depending on the accuracy  clock  can  be  generated  or an R C circuit may be  off-chip  or  required of the clock, or the amount of clock  frequency drift that can be tolerated, such as due to temperature crystal  either  from  used  changes and aging, a  to provide the oscillation. The subject about  generation of single phase clocks has been discussed in detail in [22].  APR there  is  runs  any.  on its  This  is  own clock, asynchronous  necessary  to  make  full  with that of  use  of  the  the system  speed  of  clock, if  APR. For  a  prototype chip, it is not wise to use a totally internally generated  clock using a ring  oscillator  [22].  be  example,  at present  oscillator  is  oscillator  must  The  used  work,  fabricated.  and  is  that  the  timing  the minimum width of to  generate  the  be adjusted so that  phase width for not  reason  <j>  2  there  base <f>  2  <j>  clock,  no  way  to  may  is estimated  to be  the  of  number  is around but above  is longer than estimated, is  2  estimations  change  inaccurate.  For  160ns. If a ring  stages  in  the  ring  160ns. If the minimum  i.e. longer than 160ns, then APR will the  clock  frequency  after  the  chip is  On the other hand, if the minimum phase width is shorter than expected,  then APR will run at less than its maximum speed. Therefore, an external oscillator is much more desirable  for generating  the base clock. With an external  either a crystal  or an R C circuit, the clock frequency can easily  here  that  on  it  is  assumed  a  single  phase  symmetrical  square  oscillator, using  be changed. From wave  of  selectable  56  frequency is available from off-chip sources.  Given the  phases  a symmetrical single  0  to  O  ^  shown  phase clock, the problem now is how to generate  in  Figure  5.2.  Apart  from  the  obvious  relationships  shown, the following requirements should be noted: 1.  0 i is high only when 0  2.  0 ! and 0  3.  03  4.  0 3 and <t>n  O  is high.  must not overlap.  2  and 4>n should be about two or three times the width of 0 . O  must not overlap 0 . O  A method to generate a two phase non-overlapping clock from a single phase clock is described in [22]. To generate the clock phases required in APR, a more scheme  is necessary. The heart  Figure later  7.13). The function of  in this section. The shift  clock, generated  of the scheme the  used  is a dynamic shift  from the base clock (clkin in Figure 7.13)  +  s3  si  +  s2 +  =  si  +  s2  =  s4  +  s5  00  =  sO  2.  *1  =  sO  3.  0  =  4.  *3  5.  0  However, guaranteed  explained  / , and si to s5 =  using the method in [22].  0 on startup, the signals sO O  to ^  can  as follows:  1.  a  be  (see  register is controlled by a two phase non-overlapping  to s5 will take the shapes shown in Figure 7.14. From these signals, 0  2  register  N-channel pull-down transistors will  With the base clock running, sO =  be generated  sophisticated  there  is  by the  still shift  s3  one  +  s4 +  s5  problem.  register. It  is  The  non-overlapping  possible to estimate the  requirements delay  of  are  not  the slowest  clock path and then insert enough number of idle base clock periods in between the different  clock  phases.  For  example,  if  the  signals  sO to  s5 are  to  be  generate a two phase non-overlapping clock, then the phases can be generated  used  to  by the  Fig. 7.13:  Shift Register for Multi-Phase Clock Generation  58  .o  n n  n n I  82  I I  83  .< = .  I I  n n  i  n n I I  I I  n n  I I  I  n n  Fig. 7.14: Signals from Clock Generation Shift Register  following logic equations: d>,  =  sl + s2  4>  =  s4 + s5  2  The  duration of  sO and  s3 provide  the  necessary  non-overlap  periods.  If more  overlapping period is required, the number of idle periods can be increased. A  better approach  is to use the feedback  technique described in [22]. An  example to generate a two phase non-overlapping clock is shown in Figure 7.15. It should be noted that the signal that is being fed back to the inputs of the generator must be from the end of the critical path (slowest path). Otherwise non-overlapping cannot be guaranteed. Using this technique, the corresponding signals that are fed to  59  Fig. 7.15: Two-Phase Clock with Feedback to Guarantee Non-Overlapping  the various generated signals as shown in the following table.  Signal Generated  There  is  still  one  Signal Fed Back  00  03  + 0«  01  02  + 00  02  01  03  00  0  00  4  potential  problem in this  clock generator.  The proper  functioning of the shift register clock generator depends on the fact that at any time, one and only one of the sO to s5 signals is high. Although it is a relatively simple task to set the sO to s5 signals to the proper state on start-up by using the reset line, measures must still be taken to ensure that if there is more than one high  60  signal  or no high  signal  at  all  in sO to s5, due to noise  perhaps,  then the  shift  register must be able to correct itself after at most one machine cycle. The N - channel pull-down transistors shown in Figure 7.13 are included for this purpose. When sO is high, sJ to s5 are forced to low. On the other hand, when all of si to s5 are low, sO is forced to a high.  7.6 Programmable Logic Array There are six programmable logic arrays (PLA) used in APR: 1.  Instruction decoder.  2.  Stack controller.  3.  Fraction alignment barrel shifter controller.  4.  Register file decoder.  5.  Leading zero checker in the adder section.  6.  Leading zero checker in the multiplier section.  All  these PLA's use  a static pseudo-nMOS design. An example  of such a design  is  shown in Figure 7.16, which is the stack controller used in APR. The logic equations that are implemented are: outO =  in0«inl«in3  outl  inO- inl* in4 +  =  out2 =  inO* in4  out3  inl«in2  =  Pseudo-nMOS  static  PLA's  +  in0«inl»in2  +  inl«in4  inO* in4  are  preferable  to  dynamic  PLA's  or  full  CMOS  static  PLA's. Dynamic PLA's must be clocked, placing limitations on when inputs can be fed to the PLA and when outputs can be obtained from it Full CMOS PLA's are more than twice the size of pseudo-nMOS PLA's. Since the speed of a P L A decreases as its size increases, the saving in speed and area.  in power consumption is usually not worth the  sacrifice  outO outl out2 out3 (  i '  i  •  <r-i_r  L  J  I  J  ir  LJ L J  L  i*  't  T  ir  ir  T  iT iT  ir  L  1 r  ih  ir  L j •  4-i_r  T T  ir  ^ L  T  T T  ir Si i r  inO  ;i  inl  Fig. 7.16:  i  i  : i  in2 in3 in4 in5  PLA for Stack Controller  H i r  hi.  7.7 Stack The APR.  stack  The only  N-channels  design described in [22] modification made  is used for the program counter stack in  is that P-channel  transistors  are used instead of  for the pass transistors. The circuit diagram of a 2-bit by 2-level stack is  shown in Figure 7.17.  Fig. 7.17: 2-bit by 2-level Stack  63  8.  FLOOR  P L A N O F APR  The APR  has  logic  design  been  of  APR has  investigated  (Figure  channels, the chip is estimated  been 8.1).  carried With  to  due  completion. A allowances  to measure approximately  floor  made  plan of  for  routing  8.5mm x 8.5mm. There are  83 pads on the periphery, including the power and ground pads. It can be seen in Figure 8.1 that a significant portion of the chip is devoted to routing. This exemplifies the  statement  in  section  that  7  communication  in  VLSI  is  more  expensive  than  computation. The design of an efficient floor plan is thus of primary importance for a compact  chip  layout  interconnections, must  be  will  result  of  used  as  a  single  metal  routing problem  scarcely  as  is  possible,  Unexpected RC delays  the most  becomes  the  With  further  all that can be  and  caused  a  single  complicated  or significant  polysilicon layer  because  deterioration  for  polysilicon wires  in processing  speed  by long polysilicon wires are probably one  common causes of chip design  available,  layer  failures.  done to increase  Until  better  simulation software  the chance of a successful  chip  implementation is to keep the design simple and regular, and to be extremely careful in every stage of the design.  Although section  3.1),  chip  sizes  larger  such a large chip is  than  70mm2  not feasible  are  not  uncommon these  with  the available  days  (see  ISOCMOS process  due to poor yield. Also, if a 2/um process is available, APR will measure only about 4.3mm x 4.3mm. Therefore, although the APR design is definitely feasible with present day IC technology, it cannot be fabricated with a high degree of confidence using the technology  available  to the  author.  Because APR is  required for coordinate transformations  actually  more  powerful  than is  in robotics, it is, however, possible to  fabricate  a chip that contains only the subset of the functions of APR which are essential the computation of coordinate  transformations.  for  Not only can such a chip serve as a  testing vehicle for the APR, but it can also be used immediately in industrial robots  1. Pad fraae 2. Support circuitry for eultlplier section 3. Register f i l e for asntisaa 4. Decoders and drivers for ssfttiasa register f i l e 5. Drivers for exponent register f i l e 6. Register f i l e for exponent and sign 7. Adder section 6. PLA for ore-shift control 9. Pipeline registers for s u l . sect, exponent 10. Adder for s u l . sect. sxp. 11. Exponent adjust adder in adder ssct. 12. Shift registers for register sinks 13. Shift registers for register sources 14. Instruction decoder 15. Pro oras counter IB. Clock generator 17. Latches for literals and other constants IB. Exponent coapsrlson odder i n sddsr ssct. 19. I/O sultlplsxsr for santlsaa 20. OR-plsns for LZC i n s u l . ssct. 21. I/O sultiplsxer for exponent  applications. This simplified design is called APRjr.  66  9. THE  APRJR The essential difference between APR  fixed-point  and APRjr is that the latter only handles  numbers. For the specific task in mind, namely the computation of the  direct transform, a fixed-point representation should be adequate The  for most robot arms.  analysis concerning the adequacy of fixed-point representation in the computation  of the direct transform has been discussed in section 3.2. The  circuitry for performing  the fixed-point calculations in APRjr is the same as described for APR. shows a block diagram of APRjr. The The  Figure 9.1  APRjr design is a subset of the APR  design.  entire exponent part, the two barrel shifters and the leading zero checker in the  adder section are deleted. In addition, the width of the program counter is reduced to 10 bits, and the number of flags, and hence pins, is reduced by two. The rest of the design is essentially the same as in APR, 16- bit  2's  complement  format  representation is used in APR  instead because  with of  the numbers being represented in  sign-magnitude.  it is the IEEE  The  standard and  normalization simpler to handle. In APRjr, there is no  IEEE  and  there is no  the numbers do" not have to be  sign-magnitude  representation. Using  normalized, so  it also makes  standard to adhere to, reason to use  2's complement representation also removes the  need for an end-around-carry adder for the adder section. A adder is used instead. Because both APR not run any faster then APR  sign-magnitude  simple 2's complement  and APRjr are heavily pipelined, APRjr does  in terms of the length of each machine cycle. However,  the reduced number of pipeline stages in the adder section in APRjr will enable it to complete a computation faster in cases where it cannot run in streamline mode (i.e., when operands  can  be  fed to the execution pipelines continuously). This relatively  small improvement in speed at the expense of reduced versatility was the reason  why  the APR  design was attempted originally. Nevertheless, APRjr is meant to be a testing  prototype  for APR  that is both  useful and  can  technology, not an improved or alternative design.  be  fabricated  with  the available  Barrel  Leading Zero  Shifter  Checker  Multiplier  Register F i l e  Adder  (off-chip) for Mantissa  BS and LZC  control signals  controller  Control  Data I/O  CCR  IR Instruction  0 Data/Addresa  PC  Progran Address  1. Pad fraat 2 . Support c i r c u i t r y for a u l t l p l l e r section 3 . Regieter  filt  4. Adder section 5.  Oecodore for r e g i s t e r  flit  6. Clock generator 7. I/O unit 8. 9 . S h i f t r e g i s t e r s for r e g i s t e r  file  10. Instruction r e g i s t e r 11. Progrsa counter 12. Instruction decoder 13. S h i f t r e g i s t e r s for Instructions 14. Flags 13. n.A for stack control 18. Registers for barrel s h i f t e r s h i f t Mount  69 The completed  floor  plan  of APRjr  is shown in Figure 9.2. Its layout has been  and has been sent for fabrication. The overall chip size is about 6.5mm x  6.9mm. The following sections describe some of the more global aspects of APRjr, such as processing speed and the use of APRjr  in a microprocessor system. Most of the  description is applicable to APR also. The differences are pointed out whenever they are important 9.1  Speed Considerations  In  APRjr,  the possible timing bottlenecks are the multiply stage  of  the  multiplier section, and the register decode and fetch stage. APR has another potential timing  bottleneck, namely  the exponent  comparison  and fraction  alignment  stage.  Depending on the speed of the off-the-shelf integer multiplier used, either of these two stages will become the ultimate bottleneck. With super-fast multipliers such as the Weitek WTL 2010, the speeds of these two stages are comparable. Although the multiply time of these fast integer multipliers is very short compared with the time required to read and write the register file twice, the overall multiply time is much longer because the signals have to go off-chip. The register read and write cycle is slow mainly because it has to work twice every clock cycle, once for the adder and once for the multiplier. It may be possible to run APRjr time reduce  faster, and at the same  the size and complexity of the chip, by not running the adder and  multiplier in parallel, as, in this case, the register file only has to work once every cycle. However, operating the register  file  twice  in one cycle is preferred here,  especially as it allows for the possibility of incorporating a multiplier on chip at some later date. It is to be appreciated that the estimated speed  of a 16-bit parallel  multiplier in 4M m CMOS is about 300ns, which is much slower than the register file.  70 The  time required to run the register file twice is about 150ns. The adder in  the data path requires 50ns. Shifting a number with the barrel shifter  takes 60ns.  Using the Weitek WTL 2010, which has a multiply time of 65ns, total multiply time is estimated to be 160ns. Instruction decoding and program address generation are not in the critical path. 9.2 Programming APRjr has twenty 32-bit wide instructions, which are divided into two groups: multiply/add and ETC. Figure 9.3 shows the formats of the two types of instructions. The  first bit of each instruction specifies whether the instruction is a multiply/add or  an ETC instruction. In a multiply/add instruction, 1 bit determines whether the adder is to perform addition or subtraction, and 5 bits are used to represent each of the sources and sinks of the multiplier and adder. In the ETC group, 5 bits are used as the op-code. These instructions include memory reads and writes, program branching,  31 30  25  111 ? |  source 1 |  \  20 source 0  15  I  sink  ^  [  10 source 1  I  26 op-code  |  sink  ^ for adder  Multiply/Add  31  source 0 |  / V  for a u l t i p l i e r  101  5  Group  10 data/address  ETC  |  Group  (with a few exceptions, see Appendix C)  Fig. 9.3: Instruction Formats  5 source  0 |  sink  |  71  loading constants,  etc. The complete set of instructions for APRjr is listed in Appendix  C.  The switch  instruction  from  set  for  fixed-point  to  mantissas to sign-magnitude  9.3 The APRjr  Figure  APR is  more  floating-point  extensive,  mode,  consisting  and  of  converting  instructions  to  complement  2's  representations.  Subsystem  9.4 shows  how APRjr  can be  used  in an arithmetic subsystem  for a  microprocessor system. Besides APRjr and its associated integer multiplier, the subsystem contains a program R O M , a data R A M , and a data R O M for table look-ups. Part of the data  address  space of  APRjr  is  mapped  to  the  system  memory. The two data  busses are usually separate, however, so that APRjr can run in parallel with the host processor.  When  the  host  processor  has  collected  coordinates, and wants the APRjr subsystem to the subsystem  (or  guessed)  a  new  set  of  joint  to compute the direct transform, it signals  by writing to InFlagO of APRjr.  This port can either be memory  mapped or I/O  mapped. Upon receipt of this signal, APRjr  the system  from the host by sending an interrupt from its OutFlagO. When the  bus-grant  bus signal  is  received  (using  the  other  input  port  will  on  request  APRjr,  the use of  InFlagl),  the  portion of the data space in APRjr that is mapped to the system memory is used to read the joint coordinates directly from the host's memory. When all the data is read, APRjr  relinquishes the system  this time, the host transform,  is free  bus, and proceeds  to perform other  it sends an interrupt to the host,  to compute the direct transform. At  tasks. After using  APRjr  has  OutFlagl. The host  completed the acknowledges  the interrupt by writing to InFlagO of APRjr and relinquishing the system bus.  APRjr  then writes the result of the transform in the common memory area. Upon finishing,  72  v CO  OutFlagl OutFlagO  c_ CC Q_  InFlagl  -*  InFlagO  ix>  X3  w c_ •o •o  tn CO  c_ •o ID JO  ro  CD  CC  re re  QJ -O  I  CL  — r- l  CJD -t_>  Fig.  9.4: Block Diagram of the APRjr Subsystem  ro ro CD  73  it sends  a signal  task from the host.  from  OutFlagO, and  returns  to the  idle  state, waiting  for  another  74  10.  TESTING  A designed,  test  chip  fabricated  comprising  one  register  in 5M m ISOCMOS  cell  at GTE's  and  one  barrel  shifter  foundry and tested at  were  deliberately  incurred in the  interests  of  saving  space  was  UBC. These  cells were chosen for testing because they contain violations of the design violations  cell  but  rules. The were  not  deemed to be electrically unimportant This was confirmed by the testing as the chip was found to be fully  functional. Testing for timing (operation  speed) is not possible  because of the extra capacitive load of the pads on the chip and the probes of the testing station.  Testing APRjr will be a lot more difficult when it is eventually fabricated. The APRjr  design  (Level  Sensitive  make  the design  present  does  not  Scan up  contain  Detection)  any  testability  [5].  This  is  to 30% bigger,  and  there  enhancement  circuits  such  because inclusion of such is  as  LSSD  circuitry will  not much room to spare in the  design. Testing can be divided into 2 parts: prototype testing  and production  testing.  10.1  Prototype Testing  When  the  first  chips  come  back, .the  chips  may  still  contain  design  errors.  Therefore, a step by step procedure is necessary to isolate any errors so that proper corrections can be made  in future  runs. Prototype testing  must start with chip level  testing. The following steps should be followed: 1.  None of the pins should be shorted when no power is supplied to APRjr.  2.  Power should then be connected to APRjr. With all other input pads grounded and output and tri-state pads floating, the clock input should be connected to a clock generator.  A data generator  such as  the HP8180A can be used. The  reset pad on APRjr should be connected to a debounced switch. When reset is  75  high, the internal clock should stay at  <j> . 0  When  reset  goes low, the internal  clock should start running. The internally generated clock phases, and  their complements should be monitored  <j>  0  to 0«  on a data analyzer such as the  HP8182A. A high bandwidth oscilloscope should also be used to observe the exact shapes of these waveforms. If the clock phases prove to be as expected, then the next testing step can be carried out. However, if the clock phases are faulty, then measures must  be taken  so that other  internally generated  building  blocks can be tested without the  clock phases. This can be done by forcing the required  clock phases by using the probes in a VLSI probing station to connect the output of a data generator to the appropriate clock wires. Although the tips of the probes are very tiny, the rest of the probes are rather bulky. As a result, in practice only about 6 probes can be used at any one time. Hence, if some of the probes have to be used to force signals onto the chip and some to monitor signals from the chip, then only very small sections of the chip can be tested one at a time. The following steps can be applied to test any faulty building block: a.  Using the probes in a probing station, check all power supplies to the block.  b.  Check all clock inputs, if any.  c.  Starting from  the inputs to the block (e.g. for the clock generator  block, the inputs come from  the input single-phase clock), check the  output of each gate. The inputs and outputs of each stage can be monitored  on a data analyzer. Stages  that are faulty  can then be  identified by comparing the observed waveforms with the expected ones. This can be repeated until the final outputs are reached.  76 Where forcing  necessary,  signals  generator.  on  its  a  stage  inputs  with  data generator.  next  program  it  probes  from  its  connected  inputs to  a  by data  has a much larger drive than the  It is also possible to isolate a block from other circuits wires on the chip with the probes. This  when some circuits connected to a multiplexed line are faulty.  function  (assuming  isolated  by APRjr, the inputs will take on the values from the  by physically cutting the metal is useful  be the  Because the data generator  signals generated  The  can  to  works),  counter  to  test the  is  address  reset line  0 and  the  sequencing.  should  rest  of  be  With  held  APRjr  to  the  high.  clock  This  NOP  a  running  forces  state.  A  the  NOP  instruction (all O's) should be connected to the program address/data pads. The  reset line should then be allowed to go low. This starts program execution in APRjr.  The program  counter  should  increment  one  step  every  machine  cycle  until it reaches 3FF. It should then go back to 0. With the program running, instructions  JMP  that affect the program counter, such as  and RTN, can be tested. The proper instructions should be forced to the  program  address/data pads using the data generator,  while the program counter  is monitored on the data analyzer. The  next instructions  can be  connected  multiplexed  to  that can be tested are  to the  I/O instructions.  data address/data  communicate  either  the  pads (the  data  address  pads or  An input datum for  the  the  data  data  are  themselves,  hence the term  "data data pads") by hard wiring. A memory read instruction  should  supplied  then  instruction, content  be using  any  by the  register  data  on  APRjr  If the instructions are executed  address/data  pads  should  equal  generator  the  to  followed temporarily  by a store  memory write the  memory  properly, then the signals on the data read  address  supplied  by  the  read  instruction during d)3 of the read instruction. In d>3 of the write instruction, the  77 write address should be observed on the pads, with the data appearing in the following 0 . 4  6.  The read flags and set flags instructions can be tested next  7.  Finally, the most important instructions: add and multiply, can be tested. This can be done by having the data generator generate a short program that reads in some data and then have APRjr operate on the data using either the adder section or the multiplier  section. The  results can  then  be  written out for  checking. The multiplier section can be tested without actually using a parallel multiplier. The multiplier 'result' can be supplied by the data generator.  If APRjr APRjr  passes all these tests, then testing on  must then undergo testing at the board  the chip level is complete.  level. The  following steps are best  followed with the use of a wire-wrap board: 1.  With  APRjr  connected  as in the last section, connect  a WTL  2010  16-bit  fixed-point parallel multiplier, or any multiplier that is pin-for-pin compatible with it,, and run the same test program as the last step above on the data generator again. This time, the multiplier result should be correct. 2.  Next a  data RAM  can  be  added  to the  system. A  number  of memory  read/write instructions can then be executed to check if data can be read from and written to the data 3.  The program ROM  RAM.  can now  be added. Testing programs can be burned in an  EPROM and then connected to APRjr. By observing the outputs of APRjr, it should be possible to decide whether the programs are executed correctly or not 4.  The final step in board level testing is the addition of the necessary glue logic required to interface the ARPjr  sub-system to a microprocessor bus. Then a  complete test program (one that will be used to compute the direct transform, for example) can be run. Again, if the results of the computation  are correct,  78  then it can be assumed that the program has been executed correctly.  10.2  Production Testing  After  APRjr  has  been  thoroughly  tested  in  the  prototype  stage, volume  production can be started. Because the yield of chips is never 100%, testing is required to determine which chips are faulty due to processing defects. This is best done by putting every APRjr in its board sub-system and then running a test program, as was done in the last step in prototype testing. Chips that produce the correct results can then be considered defect free.  79 11.  CONCLUSIONS  After extensive  analysis of robot manipulator applications, it is concluded that a  16-bit fixed-point number format is adequate for most present-day  By  considering  coordinate  the  transformations  architecture  concepts,  properties in  it is  of  robotics,  decided that  the  applications.  computations  involved  investigating  all  and  by  a  general  purpose  single  the  in  computing  new  processor  computer system  still the best approach. However, all the latest innovations in computer architecture applied  to  this  processor.  These  innovations  include  parallel  processing  multiplier can run in parallel), non-interlocked pipeline instruction fetch  (adder  is are and  and execution  cycles, reduced instruction set, large register file, and Harvard architecture.  Following these decisions, an arithmetic processor called  APR,  operations  supports  both  floating-point  and  has  fixed-point  been designed. arithmetic.  are included in order to broaden the potential applications  chip called APRjr  has also been designed.  It will be both useful  The chip,  Floating-point  of APR. A test  APRjr only supports fixed-point arithmetic.  as a practical chip and as a test vehicle for APR. The layout  of APRjr has been carried to completion and it was submitted for fabrication  4LI m CMOS process in late January,  1985. Based  on extensive  with a  simulations, a machine  cycle time of 200ns is expected.  11.1  Contributions of this Thesis  The APR project has contributions in both the VLSI and robotics areas. In the VLSI area, the APR project has stimulated the interest in large scale processor here  at  the  University of  British Columbia. The experience  accumulated  designs  during  the  design of APR and APRjr will benefit future VLSI digital designs. Through hands-on experience, clock  fundamental  generation  have  digital been  designs in LSI  investigated  such as  registers, adders, shifters,  in detail. Hence, a  large number  of  and  library  80  cells  have  been  created.  These  designs  starting point for improved designs.  can  be  used  in  future  designs,  or  as  the  At the beginning of this project, no library cells  were available.  The attempt to design a large scale chip has also revealed a lot of deficiencies in  the C A D tools  currently available. The possible  improvements are  outlined in the  next section.  In  the  robotics  area,  APR  economical. Because of the general  can  make  real  time  trajectory  possible  and  nature of the architecture, APR can be used for  other computations in robotics other than coordinate transformations,  such as  real-time  control of manipulators [20].  In addition, APR can be used for other applications such as computer graphics, digital  signal  processing,  and  any  other  tasks  that  require  high  speed  but  only  moderate precision arithmetic.  11.2 Suggestions for Further Research  VLSI is only an emerging field  today, particularly in the University of British  Columbia. The ambition of the present project is greatly restricted by the lack of an advanced fabrication process,  the high cost and long turn around time of fabrication,  and the lack of a sizable research team working actively in VLSI. When all or most of these difficulties  are overcome, the APR design can be improved in quite  a few  ways.  APR has had to be designed in one man-year, therefore, the designer will be more  than  satisfied  if the  chip turns  out  to  be  functional. Given  more  time and  design talents, the author believes that the APR design can be improved in terms of compactness,  regularity, and timing. A more refined floor plan can significantiy compact  81  the design, and at the same time make the layout^ more regular. Regularity not only renders the design easier and thus less error prone, but it also facilitates  any future  modifications. A careful study of the delays of all the signal paths in the design will locate  exactly  where  the  bottlenecks  in  the  design  are.  With  this  information, the  design can be fine-tuned by reducing the resistance in the critical paths, by buffering to reduce the load on each stage, and by modifying the architectural design.  On and  the system  level, the computation sub-system  can be made more compact  probably faster by designing a multiple chip set to replace the other components,  such as  the integer  multiplier and the program and data  memory. For example, the  multiplier and the data memory may be integrated on one chip.  When completed  a  and  more the  advanced  chip  process  fabricated.  It  becomes may  available,  even  be  the  possible  APR layout to  make  can  be  APR more  powerful in a number of ways: •  The parallel multiplier can be integrated on chip.  •  A full  32-bit  data path can be implemented, with full  32-bit registers and a  23-bit parallel multiplier. •  The size  of the register  file  can be  increased;  128  registers would  be ideal.  Larger register files will slow down the machine cycle so much that the gain in having fewer memory accesses will not be worthwhile [24]. •  A more powerful instruction set can be designed so that APR is made a lot more versatile, with little or no deterioration in processing speed.  A lot of research is also required in the C A D area. Many of the following required CAD the  tools already exist in some form or another, however, they are not available at University of  British Columbia.  Also,  most  of  the  existing  tools  are  still  at  a  rudimentary stage, and improvements are not only possible but necessary. •  Simulation from layouts. Presently only small sections of a large design (of the  82  size  of  APRjr)  can  be  slow,  but  it  extremely  simulated. has  The circuit simulator,  convergence  problems  with  SPICE, its  level  is  not  only  2  MOSFET  model. The switch level simulator, hg, or R N L (Bg is a Metheus modification of  RNL), performs  around  reasonably  100 transistors.  well  But it has  with  moderate  problems  size  dealing with  circuits  consisting  certain circuits  of  [29].  This means that hg cannot be run on any design without scrutiny.  There simulation  probably  with  However,  it  interactively the  is  these  will  be  very  little  simulators,  that  can  unless  extremely  be  done  hardware  helpful  if  the  about  the  accelerators  are  simulators  using a windowing facility. In other words, the  simulators  modifications  to  work  to the  on  a  design.  certain  At present,  portion  of  in order  his  to  can  speed  of  available.  be  invoked  designer  can call  design,  simulate  a  without section  any of a  design, the designer must make that portion that he wants to be simulated a self-contained cell.  A  timing analyzer  [15,23]  is  also  helpful  in  filling  the  gap  between  precise circuit or switch level simulation and functional simulation on the chip level. Hierarchical  simulation system.  When a digital design is formulated, it is best  to  the  top-down  simulate  specified  first  design by  its  in a input  simulated on the system  and  fashion.  output  For  example,  characteristics.  (or processor-memory)  The  level. After  the  design  design  is  making sure  is  then that  the designs works in this level, it is expanded one step down the hierarchy, namely the register-transfer level,  making  sure  processor-memory  that  level. The design the  inputs  and  is then simulated again  outputs  are  as  specified  in this in  the  level model. Next the design is detailed in the gate (logic)  level, then the circuit level, and finally the layout level. A design in a lower  83  level  is started  only after  the present  level is proven to be compatible with  the level above it using the proper simulation tools. At present, only gate level (HILO) and circuit level (hg and SPICE) simulators are available. Auto-Router. simply  Routing a  because  the  VLSI  designer  chip is  extremely  cannot  see  painstaking  what  he  is  and  doing  error prone, even  on  a  high-resolution graphics terminal. Multiple windows would help, but it would be much more convenient if an auto-router be able to operate in several the  router  should small  interactively  take  is  supplied  details such as  useful  when  on  the  The ideal router should  modes. In manual mode, the designer can direct  the  screen.  by the  separation  routing  is available.  The  designer, between  channels  are  general  while the layers  tight  and  In  direction router  that  will  the  route  take care of  crossovers. This mode  semi-auto  mode,  the  is  router  should be able to route a set of ports together without aid, given the location of the ports and the building blocks. In full-auto mode, it should be able to rearrange so  that  the building blocks slightly, such as shifting a little bit to either side, a  complete  routing  is  obtained.  The  most  efficient  routing  is  then  obtained by the application of a suitable combination of these modes. Layout compactor. A layout compactor is useful designer  specifies  the  circuit  the  topology  in compacting small cells. The of  the  connections,  and  the  dimensions of the desired cell, and the layout compactor will attempt to fit the design in the specified area. Floor planner. A  floor planner can give  the designer  size of the chip given the rough floor plan of the routing  requirements.  Perhaps  the  floor  planner  estimates of the overall building blocks  can  even  be  and the  made  smart  enough to come up with some close to optimum floor plans with minimal help from the designer. A  more powerful  circuit extractor. The circuit extractor  that is available now  84 only recognizes transistors and parasitic capacitors. What is more desirable is a hierarchical circuit extractor  that can recognize the layouts in the gate level,  register-transfer level, and even the processor-memory gates from  a  intelligence.  Register-transfer  more difficult,  layout  should  level. Recognizing logic  be  easy,  especially  level  and  processor-memory  requiring considerable  programming  with the  effort  use  level and a  of  artificial  recognition  is  sizable data  base. A better scheduling program. The scheduling program which is in use now is still very difficult to use. The user has to break up the individual operations that are required to compute the equations, and then arrange the operations in the form of a data dependency graph. The program can be improved so that the user is only required to enter the equations in a more convenient form (reverse Polish notation, RPN, is a prime candidate).  Another possibility  for  improvement is to include in the scheduling algorithm, the effects of the finite number of registers available.  85  BIBLIOGRAPHY Cited References [I] .  S.F. Anderson, J.G. Earle, R.E Goldschmidt, and D.M. Powers, "The  IBM  System/360 Model 91: Floating Point Execution Unit", IBM Journal of Research  and Development, vol. 11, no. 1, pp.34-53, 1967. [2].  R.P. Brent and H.T. Kung, "A Regular  Layout for Parallel Adders",  IEEE  Trans. On Computers, vol. 31, no. 3, pp.260-264, 1982.  [3].  J. Clark, "A VLSI Geometry Processor for Graphics", IEEE  Computer, vol.13,  no.6. pp.59-68, June 1980. [4].  J.T.  Coonen,  "An Implementation  Floating-Point Arithmetic". IEEE [5].  E B . Fichelberger,  to  a  Proposed  Standard  for  Computer, vol.13, no.l, pp.68-79, 1980.  T.W. Williams,  Testability", Tutorial VLSI  Guide  "A  Logic  Design  Structure  for LSI  Support Technologies: Computer-aided Design, Testing,  and Packaging, R. Rice (ed), IEEE Computer Society Press, pp.245-251, 1982. [6].  Excalibur Arm (Model No. RT3), Robotic Systems International, Sidney, B.C., V8L 3S1. Canada.  [7].  D. Fitzpatrick, "VLSI Implementation of a Reduced Instruction Set Computer",  VLSI  Systems and Computations, H.T. Kung, B. Sproull, and G. Steele (ed),  Computer Science Press, Rockville, Maryland, pp.327-336. 1981. [8].  GTE Microcircuits, Tempe. Arizona 85281.  [9].  J.L Hennessy, N. Jouppi, S. Przybylski. C. Rowen, and T. Gross. "Design of a High Performance Processor".  Third  Caltech Conf. on VLSI, R. Bryant (ed),  Computer Science Press, Rockville. Maryland, pp.33-54. 1983. [10].  J.L. Hennessy. N. Jouppi. F. Baskett, and J. Gill. "MIPS: A VLSI Processor Architecture", VLSI  Systems and Computations, H.T. Kung, B. Sproull. and G.  Steele (ed). Computer Science Press, Rockville, Maryland, pp.337-346, 1981. [II] .  J.L Hennessy, N. Jouppi. S. Przybylski. C Rowen, and T. Gross. "Performance  86  Issues in VLSI Processor Design",  IEEE  Int'l Conf. on Circuits and Computers,  pp.153-156, 1983. [12].  J.L.  Hennessy,  N . Jouppi,  J. Gill,  F. Basket!, A. Strong,  T. Gross,  C. Rowen,  and J. Leonard, "The MIPS Machine",IEEE Compcon Spring, pp.2-7, 1982. [13].  HILO-2 User Manual, GenRad Inc.,  [14].  R.F.  Hobson,  Simon  Fraser  1984  University,  Burnaby,  British  Columbia,  Canada,  private communication, March 1984. [15].  N.P. Jouppi,  " T V : An nMOS Timing Analyzer", Third Caliech  Conf. on VLSI,  R. Bryant (ed), Computer Science Press, Rockville, Maryland, pp.71—85, 1983. [16].  G . H . Katavenis,  R.W. Sherburne,  D.A. Patterson, and C.H. Sequin,  II Micro-Architecture", VLSI '83, F. Anceau and E.J. Aas (ed),  "The  RISC  Elsevier Science  Publishers B.V.(North-Holland), pp.349-359, 1983. [17].  A.  Kozak,  private  Canada, Oct [18].  [19].  D.  Landskov,  Microtel  Pacific  Research,  Burnaby,  B.C.,  1984.  R.H. Krambeck, with CMOS",  communication,  C M . Lee,  IEEE S.  and  H.-F.S.  Law,  "High  Speed  Compact  Circuits  JSSC, vol.SC-17, no.3, pp.614-618, 1982.  Davidson,  B.  Shriver,  and  P.W. Mailed.,  "Local  Microcode  Compaction Techniques", Computing Surveys, vol. 12, no. 3, pp.261-294, 1980. [20].  C.S.G. Lee, T . N . Mudge, and J.L. Turney, "Hierarchical Control Structure using Special  Purpose Processors for the Control of Robot Arms", Proceedings of the  IEEE Pattern Recognition and Image Processing Conference, pp.634-640, 1982. [21].  K.  McDonough,  32-bit  arithmetic  E.  Caudel,  S.  Magar,  does high-precision  and  number  A.  Leigh,  crunching",  "Microcomputer  with  Electronics, Feb.  24,  pp.105-110, 1982. [22].  C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley, Menlo Park, CA., 1980.  [23].  J.K.  Ousterhout,  "Crystal:  A Timing Analyzer for  nMOS  VLSI Circuits",  Third  87  Caltech  Conf.  on VLSI,  Maryland, pp.57-69, [24].  Bryant  (ed),  Computer  Science  D.A.  Press, Rockville,  1983.  D.A. Patterson and C . H . Sequin, " A VLSI RISC", IEEE 9, pp.8-22,  [25].  R.  Computer,  vol. 15, no.  1982.  Patterson  Computers  of  and the  C.H.  Sequin,  "Design  IEEE  Future",  Trans,  Considerations  on  Computers,  for  Single-Chip  vol.C-29(2),  no.2,  pp.108-116, 1980. [26].  [27].  D.A. Patterson,  "A  Spring,  1982.  pp.8-14,  RISCy  Approach  R.P. Paul, Robot Manipulators  -  to  Computer  Mathematics,  IEEE  Design",  Programming  Compcon  and Control",  MIT  Press, Cambirdge, Mass., 1981. [28].  [29].  J.K.  Poon, "Parallelism in the Micro-Program of a Special  Processor",  UBC VLSI  RNL  User's  4.2  UW/NW [30].  A.  Guide,  VLSI  Report  Design  001, May, 1984. Tools  Reference  Manual, Release  2.1,  VLSI Consortium, University of Washington, Seattle, Washington, 1984.  Sawai,  Systems  Lab. Tech.  Purpose Arithmetic  "Programmable  and  LSI  Computations,  Digital  H.T.  Signal  Kung,  B.  Processor Sproull,  Development", and  G.  Steele  VLSI (ed),  Computer Science Press, Rockville, Maryland, pp.29-40, 1981. [31].  R.B.  Seeds,  Electron [32].  "Yield  Devices  and  Cost  Meeting,  Oct  Analysis  of  Bipolar  LSI",  Tech.  Digest,  1967.  C . H . Sequin and D.A. Patterson, "Design and Implementation of RISC I",  Architecture,  IEEE  R.  Randell  and  P.C. Treleaven  R.W. Sherburne,  M.G.H.  Katavenis,  (ed),  Prentice  Hall,  VLSI  pp.276-298,  1983. [33].  Memory  in RISCs",  IEEE  Int'l  D.A. Patterson,  Conf. on Circuits  and  C . H . Sequin,  and Computers,  "Local  pp.149-152,  1983. [34].  R.  Sherburne,  "Data  Path Design  for  RISC", Proc.  MTT Conf. on  Advanced  88 Research [35].  P.C.  in VLSI,  Treleaven,  Demand-Driven  pp.53-62, 1982. D.R.  Brownbridge,  and  R.P.  Computer Architecture", ACM  Hopkins,  Computing  "Data-Driven  Surveys,  and  vol.14, no.l,  pp.93-143, 1982. [36].  P.C.  Treleaven,  Architecture,  R.  "Decentralized Randell  and  Computer  Architectures  P.C. Treleaven  (ed).  for  Prentice  VLSI  VLSI", Hall,  pp.348-380,  1983. [37].  J.  Vuillemin and L. Guibas,  IEEE [38].  Int'l  Conf. on Circuits  "On Fast Binary Addition in MOS Technologies",  and Computers,  pp.147-150, 1982.  H . Yoshimune, Y. Kita, T. Miyamoto, Y. Toba, H . Hara, and T. Akazawa, " A Single  Chip Digital  Analysis", IEEE  Signal  JSSC,  Processor  and  Its  Application to Real-Time  Speech  vol. SC-18, no. 1, pp.91-98, 1983.  General References In addition to the above cited literature, the following references will be found useful for anyone interested [1].  R.J.  in VLSI or coordinate transformations  Antimone and G.W. Brown, "The Modeling of Resistive Interconnects  Integrated Circuits", IEEE [2].  in robotics:  L.N. Dworsky, Modern  JSSC,  for  vol. SC-18, no. 2, pp.200-203, 1983.  Transmission  Line  Theory  and  Applications,  Wiley, New  York, 1979. [3].  K.  Hwang,  Computer  Arithmetic  Principles,  Architecture,  and  Design,  John  Wiley and Sons, New York, 1979. [4].  [5].  H.-F.S.  and  M.  Shoji,  Microprocessor", IEEE  Int'l  Conf. on Circuits  C.S.G.  Law  Lee,  "Robot  Arm  Kinetics,  "PLA  Design  for  the  and Computers,  Dynamics,  and  BELLMAC-32A  pp.161-163, 1982.  Contror,  IEEE  Computer,  vol.15, no.12, pp.62-80, 1982. [6].  C.S.G. Lee, " A Geometric Approach in Solving the Inverse Kinematics of Puma  89  13th Symposium on Industrial Robots, pp.16-1,16-8, 1983.  Robots", [7].  A . D . Lopez and H.-F.S. Law, " A Dense Gate Matrix Layout Method for MOS VLSI", IEEE  [8].  JSSC, vol.SC-15, no.4, pp.736-740, 1980.  W.K. Luk, " A Regular Layout for Parallel Multiplier of 0(Log J N) Time",  VLSI  Systems  (ed),  and  Compulations,  H.T. Kung,  B.  Sproull,  and  G.  Steele  Computer Science Press, Rockville, Maryland, pp.317-326, 1981. [9].  J.  Mavor,  M.A. Jack,  Introduction to MOS LSI  and P.B. Denyer,  Design,  Addison-Wesley, 1983. [10].  C. Mead and M . Rem, "Minimum  Propagation  Delays in VLSI",  IEEE  JSSC,  vol. SC-17, no. 4, pp.773-775, 1982. [11].  S. Muroga, VLSI  [12].  R.Y. Pinter,  System Design, John Wiley and Sons, 1982.  "River  Routing:  Methodology  VLSI  and Analysis",  Computations, H.T. Kung, B. Sproull, and G . Steele  Systems and  (ed), Computer  Science  Press, Rockville, Maryland, pp.141-163, 1981. [13].  J.K. Poon, D.L. Pulfrey, Robot  Control",  and P.D. Lawrence,  International Symposium  "Arithmetic Processor  on VLSI  Chip for  Technology, Systems and  Applications, pp.337-341, 1985. [14].  G.Rabbat  (ed),  Hardware  and  Software Concepts  in  VLSI,  Van Nostrand  Reinhold, 1983. [15].  P.  Reusens,  W . H . Ku, and Y . - H . Mao, "Fixed-Point  Multiplier", VLSI  High  Speed  Parallel  Systems and Computations, H.T. Kung,B. Sproull,and G . Steele  (ed), Computer Science Press, Rockville, Maryland, pp.301-310, 1981. [16].  T. Sakurai,  "Approximation of Wiring Delay in VLSI", IEEE  JSSC, vol. SC-18,  no. 4, pp.418-426, 1983. [17].  M . Shoji,  "Electrical  Design  of  BELLMAC-32A  Conf. on Circuits and Computers, pp.112-115, Monolithic Multipliers", IEEE  IEEE  Microprocessor",  1982. S. Wasser,  "High  Computer, vol.11, no.10, pp.19-29, 1978.  Int'l Speed  90  APPENDIX A: LAYOUTS In this appendix, the layouts of some of the more important cells are shown. The design rules used are those for the GTE 4M m ISOCMOS process. The rules are confidential and so cannot be disclosed here. The following legend is used in all the figures:  Diffusion Polysilicon  — — — — —  Metal  Contact Cut  N+  Diffusion  P+  Diffusion  P-Well  •  —  —•  91 A . l : 3-Port Register Cell See also pp. 41-44, 92, 110  92 A.2: 2 x 2 Register File Notice how  the symmetry of the register cell is taken advantage of in the register  file. See also pp. 41-44. 91. 105-110  A.3: One Barrel Shifter Cell See also pp. 44-47, 94  shl  shO  94  A.4: 3-out-of-5 Barrel Shifter See also pp. 44-47, 9 3  sh3  sh2  s h l shO  in2 i n l  inO  A.5: 4-Bit See  Leading Zero Checker  also pp. 47, 111, 112 Vdd  &V!rii4Vfriilrj?  substrate  I  1  1  I i'h  DataO II  1  '  •  "  •  • '!  :  ftjsri  !  Ici^cr+iP  Datal  Data2  Data3  IS 1 2 3 4  96 A.6: Stack Cell See also pp. 62, 97, 115  A.7: 2-Bit by 2-Level Stack See also pp. 62, 96, 115  shl  98  A.8:  2-Bit Par-Add Adder  See also pp. 51-54, 113  i r fo^r-ni i i n i r - - - f 5 h r i so LLU ±T\ f i ^ > i U r i r L u ! , _C. I M|. r - r - i «TT!, L>J4_I n i ftT7-"H.I.'hr.j-sr ,•-|U I-HI«— •*— >—" I iInil ^ir*j nzt L-' ,! r  l  1  rJ_J  .  .. :  II II  •  1  'Hi  r|  W ]  1  J  L  (  r  T i I—,JrriiP « t — V -  .JrrrJ i jPl i i  I!  -iW  ^ Ln>~H rr—H! . •  !,.....4. Jrjl • :L_ LJ- .._1.J..L, j  f-^HMla iii •j__n i i i i i i  i_  LI l!  , i iii  1  5  ,  1  T  (LI  J. — r Al .J..|.. , — — H r f i ; i l . ' i: liM •jj^Dj-r- — - n ° | l i f e i iMr-ii'.— •••H-r'LJri—t'l—i l\ff :  i 7jn'.-.7..... r..T..._j..l i|  rrJJ•»1 y  Ti  r t  TT'TriJiJ  ,  J B b - = r H — t J ^  c  4  —-»Sr : T i - U .  99  A.9: 1-Bit Static Regenerative Manchester Carry Adder See also pp. 48-50, 114  GND  Cin  100  A.10: PLA for Stack Control See also pp. 60, 61, 117  inl  inO  101  A.ll:  Clock Generator  The circuit for the following layout is shown in Figure 7.13. See also pp. 55-60, 116  J" " - V - V - V ' J II i t  Vdd  i i i i  102  A.12: Flags The  following circuit and corresponding layout is used to store the flags in APRjr.  no}o»o_p»r (2)  noloaa (2) }o»B_p»r 12)  r4  lo»0 (2)  Instruction  t l l t j l i r (2)  4-  last (2)  r 1*0  >~1  0.  •«ak Orlv«  1  rtittjitr  i  103  A.13: The in  Barrel Shifter Controller for APRjr following circuit and corresponding layout is used for controlling the barrel  shifter  the multiplier section of APRjr, and for testing the number of leading zeros in the  multiplier result. SetSh  overflow  SetSh.par  instruction  SetfeqlZ overflow  M  • 'B^lctrl  iff?  151 i < 'ii 1  r[j..ja_. o_jjL]tr1r  •  J  LZin  overflo SetSh  104  APPENDIX B: SIMULATIONS  In presented.  this  appendix,  some  of  Two simulators  have  been  simulator, SPICE,  the  more  used  in  important this  and the switch level simulator,  level 1 MOSFET parameters  simulations  project,  namely  hg. For small  performed the  are  circuit level  circuits, SPICE with  is used. Level 2 is only used for very small circuits, not  only because of the slow execution speed of level 2, but also because it has frequent convergence problems. Hg is used for larger circuits, consisting of over 20 Where output.  necessary,  the  circuit used  in  the  simulation  is  shown  with  transistors.  the simulation  105 B.1: Writing a Zero to a Register See also pp. 41-44, 92  I CIRCUIT: W G  DATE. MtW MAR  4 i l . "57: 46 1985 ( f i l e ,  reg)  [  106 B.2:  Writing a One  See  also pp. 41-44, 92 CIRCUIT: BEG  to  a Register  DATE. * W MAR  4 l i : 33. "58 1 985 ( f i l e ,  reg)  107  B.3:  Writing a One to a Register with Noise  The  signal on the write bus is set to only 4V in this simulation. Notice the response  of the register cell is almost the same as in Appendix B.2. See also pp. 41-44, 92  CIRCUIT:  -0.00-f  U.On  BEG  DATE: MW  3 . b n 6. On  MAB  4 12. 14: 46 1985 (file:  9. On  12.'On  15 .'On  "eg)  IB.On  2l!on  24~!on  27 !on  SCLOn TIME  GflOUP 1: V ( i n t l ) V(wc)  V ( i n t O ) V (wl)  108  B.4: Writing a One to a Register with Noise, Level 2 M O S F E T model 2 is used in this simulation to investigate the reliability of model Observe  that  the  response  is  very  close  to,  although  Appendix B.3. See also pp. 41-44, 92  CIRCUIT: REG—DATE: MM HAH 4 12: 22: 44 1985 ( M i r rtg)  not  exactly  the  same  1. as.  B.5:  Register Sourcing a Zero  See also pp. 41-44, 92  CIRCUIT: REE  DATE: HON HAH  4.81-  4 12: 22: 44 19B9 If I la: rag]  \  4.01-  '  3.80-  '  2.B9-  '  \ \ 1 I  \  2.46-  \  \  1.B7-  /  ;  \  1.47-  \  \  0.96-  /  0.48-  |  \  \  \  J "*'°oiOfi  2.Bn  i 8.On  7.8n  •ROUP i: V(ReadBua0) V(RaadO)  lO.'on V(lntl)  K.'fln  18,'on  17.8n  20 .'on  22 .'sn  SS.'on TINE  110  B.6:  Register Sourcing a One  See also pp. 41-44, 92  0»TE  CTAOJIT: B.FG  WE FEB 19 17.29  46 199"; ( f i l e .  '•eg)  I  I  1  1 4.56-  /  1  1  1  4.05-  |  I I I I  11 1  3 543.04-  1  2.53-  '  2.02-  1  1.52-  '  \  1  1.01-  1  i i  0.50-  1 -0.00 | O.On  i | i 2 . 5 n 5.On 7 . 5 n  GROUP 1: i/tintl)  V (rO.)  V (rbuuO)  |  10. On  | 12.5n  15. On  i i 17 .'sn  20 .'on  22 S n  25.'On TIME  B.7: Leading Zero Checker See also pp. 47, 95  <- D_r c  I_1_X ''  CIRCUIT: 17.0 DATE: SUN HAH 17 81: 50: IB 19B5 If lie: Uii  GROUP i:  V(liout)  v(detain) vpij)  113  8.8:  One Par-Add  One  of the Par-Add  to  estimate the  Logic Block  speed  logic blocks, the circuit of which is shown below, is of a  Par-Add  adder.  Proper loads, corresponding  simulated  to the next  stages, have been added in the simulation. See also pp. 51-54, 98  CIRCUIT: PAHAPPBSIH—DATE: HOW HAH 4 Zi: 34: *9 1888 I M l r PwAddBtirt  60.On TINE  114  B.9: Carry Propagation in One Static Regenerative Manchester Carry Adder Cell See also pp. 48-50, 98  115  B.10: 2 - B i t by 2-Level A  Stack  '0' is pushed in bit 1, then a T  is pushed in bit 1. The data are then popped  back out. Bit 0 is not used. See also pp. 62, 96, 97  Ifl la: alkzx2)  puahinO-j puanini-|  I  popouto-  |  ||  popouti-  |  u  1  •H L J  L_J  LJ  trr-|  1  II  L_J  «H  1  • H  L_J  h  nOn  II  1  sin  I02n  lS3n  soltn  l _ J  1  l _ l  II  2skn  306n  387n  40Bn  4SBn  8 ion tlM  116 B.ll:  Clock Generator  The clock phases that are generated can be compared with those given in Figure 5.2, except that here the complements of the phases are shown. The glitches will probably not cause any problems because of the large capacitances that the clock has to drive, although improvements can probably be made to remove them. See also pp. 55- 60, 101  If lit: clkSJ  irLnrirLTLrir  pnlOtt-  u u u LT •ii ni ni ru n i -LU HI LU UJ UL  pnilb-  pM3b-  •r~LU u  phi 46-  —IJIJUUIJIJULJL. clkln  o.dou  o.sUi  i.oWi  i.aau  a. leu  zJou  s.ieu  a.7au  4.iau  4.aeu  s.Jou tlM  117  B.12: PLA for Stack Control Notice that the outputs occasionally take on a wrong value before finally settling down to the correct values. This causes no problems in a static PLA, but may be disastrous if a dynamic design is used. See also pp. 60, 61, 100  (file: plsbSl  ^oHjirLRjuirin^ ma-  i  ma-]  1  r  in4-|_  1  J  ins-]  inrrnju  ™-UULJLJ—innr~LiLjULj ou«-iniJliJlJ_riJU  LTULJIUILJ^  UUJl  ^lnnni  innnj  mm  innru  outa-u—u—u—u—LT On  fin  142  213n  284n  SSSn  uuuu  u—u—u—u 426n  4flVn  seen  eatn  7ibn  tlM  118  APPENDIX C: INSTRUCTION SET  have  The  complete  set  been  grouped  together  instruction, number  of  APRjr  a  of  three-letter  for  according  to  mnemonic is  machine cycles  uses  instructions  required  non-interlocked pipelines,  APRjr the  given,  to  new  functions  each  previous  in  a  series  multiplication result  of  Multiply/Add  can be  by  the  instruction  perform.  is  be  also  started  The  full  name  of  the  For  after  an at  each  op-code.  The  given.  Since  whether  the  up to the user, or  instruction  least three  that  sequence. uses  a  cycles, and an  instruction that uses a previous addition result can be started only after cycles. See Appendix D for some detailed  instructions  executed in the proper  instructions,  started only  The  32-bit  It is therefore  the compiler, to make sure that the instructions are example,  below.  they  instruction can  previous instruction has finished executing or not  For  listed  followed  execute  a  is  at least two  examples.  instruction,  definitions  of  the  various  fields  in  the  op-code, and the action performed by the instruction is given in the remarks section. In the opcode descriptions, the following convention is used:  1, 0:  Binary digits which must be entered as is;  ?-F0-?:  A sub-field, containing a certain number of bits, called fd;  x-5-x:  A field of don't care bits, in this particular case 5 bits long;  119 MULTIPLY /ADD  GROUP  Mnemonic: M A D Opcode: 10 ?-F5-? ?-F4-? ?-F3-? ?-F2-? ? - F l - ? ?-F0-? No. of cycles: 4 for multiply portion, 3 for add portion Remarks:  Simultaneous Multiply/Add; FO: 5-bit field, sink for multiplier; Fl:  5-bit field, source 0 for multiplier;  F2: 5-bit field, source 1 for multiplier; F3: 5-bit field, sink for adder; F4: 5-bit field, source 0 for adder; F5: 5-bit field, source 1 for adder; The contents of the registers specified by Fl and F2 are multiplied and the result will be stored in the register specified by FO. At the same time, the contents of the registers specified by FA and F5 are added together and the result will be stored in the register specified by F3\  Mnemomic: MSB Opcode: 11 ?-F5-? ?-F4-? 7-F3-?  ?-F2-? ? - F l - ?  7-F0-?  No. of cycles: 4 for multiply portion, 3 for add portion  Remarks:  Simultaneous Multiply and Subtact; Action similar to M A D , except that the contents of the register specified by F4 are subtracted from the contents of the register specified by F5\  120  ETC  GROUP  Program Counter Control Sub-Group  Mnemonic: JMP Opcode: 000001 ?-F0-?  x-10-x  No. of cycles: 1 Remarks:  Unconditional Jump; F0: 16-bit field, new program address, where bit 10 is the LSB; The contents of the register specified by FO is forced into the program counter;  Mnemonic: JOF Opcode: 000010 7-F0-?  x-10-x  No. of cycles: 1 Remarks:  Jump on Flag Set; F0:  16-bit field, new program address, where bit 10 is the LSB;  PC is set to new value if the main flag (set or cleared by TSF instruction) is set  Mnemonic:  Otherwise, PC is incremented as usual;  JSR  Opcode: 000011 7-F0-?  x-10-x  No. of cycles: 1 Remarks:  Jump to Subroutine; F0: 16-bit field, jump address; The present PC is pushed onto the hardware stack (2 levels deep), and the PC is then set to the new value;  121  Mnemonic: R T N Opcode: 000100 x-26-x No. of cycles: 1 Remarks:  Return from Subroutine; The contents of the top of the hardware stack is tranferred to the PC. The original PC will be lost;  I/O Sub-Group  Mnemonic: L C D Opcode: 000101 7-F1-? 7-F0-?  x-5-x  No. of cycles: 1 Remarks:  Load Constant Data; F0: 5-bit field, data sink; Fl:  16-bit field, data word;  The data word given in Fl is stored in the register specified by FO;  Mnemonic: W W D Opcode: 000110 7-F1-? 7-FO-7 x-5-x No. of cycles: depends on external memory speed, minimum  =  1  122  Remarks:  Write Word Direct; FO: 5-bit field, data source; Fl:  16-bit field, data address;  The contents of the register specified by FO is written to the data memory location specified by Fl; When 2 or more memory write instructions are executed in sequence, care must be taken such that the external memory has enough time to respond. Other instructions, such as a M A D or JMP instruction, can be executed immediately after a W W D ;  Mnemonic: WWI Opcode: 000111 x - l l - x 7-F1-7 7-F0-?  x-5-x  No. of cycles: depends on external memory speed, minimum Remarks:  =  1  Write Word Indirect; FO: 5-bit field, data address source; Fl:  5-bit field, data source;  The contents of the register specified by Fl is written to the memory location specified by the contents of the register specified by FO; See remarks in W W D ;  Mnemonic: R D W Opcode: 001000 x-21-x 7-F0-? No. of cycles: 1  123  Remarks:  Read Data Word; FO: 5-bit field, sink for data; All 16 bits of the data bus is copied to the register  specified  by FO;  Mnemonic: R U B Opcode: 001001 x-26-x No. of cycles: 1 Remarks:  Read Upper Data Byte; The upper 8 bits of the data bus is latched on-chip. Note that the data is not written to the register file;  Mnemonic: R L B Opcode: 001010 x-21-x ?-F0-? No. of cycles: 1 Remarks:  Read Lower Data Byte; F0: 5-bit field, data sink; The lower 8 bits of the data bus is latched on-chip. The contents of the whole 16-bit latch is then written to the register specified by F0;  Mnemonic: W U B Opcode: 001011 7-F1-? 7-F0-? No. of cycles: 1  x-5-x  124  Remarks:  Write Upper Data Byte; FO: 5-bit field, data source; Fl:  16-bit field, data address;  The upper byte of the register specified by FO is written to the memory location specified by Fl;  Mnemonic: W L B Opcode: 001100 7-F1-? 7-F0-?  x-5-x  No. of cycles: 1 Remarks:  Write Lower Data Byte; FO: 5—bit field, data source; Fl:  16-bit field, data address;  The lower byte of the contents of the register specified by FO is written to the memory location specified by Fl;  Mnemonic: S A D Opcode: 001101 7-F0-? x-10-x No. of cycles: 1 Remarks:  Send Memory Address, Direct; F0: 16-bit field, memory address; The address specified by FO is driven onto the data address bus. Used to prepare for a direct read from the data memory;  Mnemonic: SAI Opcode: 001110 x-16-x 7-F0-? No. of cycles: 1  x-5-x  125  Remarks:  Send Data Address, Indirect; FO: 5-bit field, address source; The contents of the register specified by FO is driven onto the data address bus. Used to prepare for an indirect read from the data memory;  Miscellaneous Sub-Group  Mnemonic: L D F Opcode: 001111 x-9-x  7-F0-? x-10-x  No. of cycles: 1 Remarks:  Load Flags; FO: 7-bit field, new flags; The flags in the condition code register (CCR) are set or cleared according to the new flags supplied. The flags are:  OutO Outl  Mnemonic: TSF Opcode: 010010 x-9-x No. of cycles: 1  7-F0-? x-10-x  InO Inl V N Z  126  Remarks:  Test Flags; FO: 7-bit field, test template; The data in the template is ANDed with the C C R , and the  main flag is set or cleared according to the result Only JOF uses the main flag; The V(overflow), N(negative), and Z(zero) reflects only the condition of the results of the adder section. There is no way to monitor the results of the multiplier section, they are hard limited when an overflow occurs. The InO and Inl flags are external conditions which can be tested. See section 9.3 for the meaning of these 2 flags.  Mnemonic: SSH Opcode: 01000 7-F0-? 0  x-9-x  No. of cycles: 1 Remarks:  Set Shift Amount; FO: 17-bit field, shift amount only 1 bit should be non-zero in these 17 bits; The shift amount is written to the barrel shifter controller. All subsequent results from the multiplier will then be shifted by the specified amount; For example, to shift all results from the multiplier up by 2 bits, the instruction is: 01000 00100000000000000 xxxxxxxxx  Mnemonic: SLZ Opcode: 01000 7-F0-? 1 x-9-x  127  No. of cycles: 1 Remarks:  Set Required Number of Leading Zeros; FO: 17-bit field, required number of leading zeros. See examples below; The required number of leading zeros is written to the barrel shifter controller. All subsequent results from the multiplier which have less than this number of leading zeros will be flagged as overflows; For example, to require all results from the multilier to have at least 2 leading zeros, the instruction is: 01000 00111111111111111 xxxxxxxxx and for 5 leading zeros the instruction is: 01000 00000111111111111 xxxxxxxxx  Mnemonic: NOP Opcode: 000000 x-26-x No. of cycles: 1 Remarks:  No Operation; No action is performed. The flags in the C C R are not affected;  128  APPENDIX D: S C H E D U L I N G T H E ADDER A N D T H E M U L T I P L I E R SECTIONS Since the adder and multiplier sections can run in parallel, the speed of a program depends a lot on the sequence of the instructions. A program which  keeps  the adder and the multiplier running most of the time will run a lot faster than a program which  does not To illustrate this, consider programming the following simple  equation: y =  a-b-c - (b-d + a + b-c)  A possible schedule  for the adder and multiplier is shown below. For the sake of  simplicity, the delays required because of the various pipelines is ignored. In other words, it is pretended that there are no pipeline delays. Adder  Multiplier  b-d b-c a-(b-c) a-f(b-c) (b-d+a+b-c)-d (a-b-c)-(b-a + a + b-)-d  And  the corresponding APRjr program is, again ignoring pipeline delays:  SAD &A  -.send address of variable A out, ;to prepare for reading A. ;&A stands for address of A, ;as in the C programming language.  RDW  Rl  SAD  &B  RDW  R2  ;read in value of A, and place it in register 1 ;prepare to read B ;put B in register 2  SAD & C R D W R3  ;put C in r3  SAD & D R D W R4  ;put  MAD  ;r5=b*d, adder idle  D in r4  R5,R2,R4:R0,R0,R0 ;r6 = b*c, adder idle  MAD R6.R2.R3: RO.RO.RO MAD  ;r7 = a*(b*c), r8 = (a) + (b*c)  R7,R6.R1:R8,R1,R6 -.multiplier idle, r9 = (b*d) + (a +b*c)  MAD R0,R0,R0:R9,R5,R8  ;r5=(b*d+a + b*c)*d, adder idle  MAD R5,R9,R4:R0,R0,R0  ;r6 = (a + b*c>- (b*d + a + b*c)*d  MSB RO,R0,RO:R6,R7,R5  ; store result in data memory  W W D & Y . R6  The schedule can be rearranged to reduce the number of instructions by 1:  Multiplier  Adder  b-c b-d  a-f(b-c)  a-(b-c)  (b-d)+(a+b-c))  (b-d+a+b-c)-d (a-b-c-((b-d+a+b-c)-d)  130  It can be seen that even in this simple example, the execution speed of the program depends  on the proper  scheduling of the multiplier and the adder.  The problem of  scheduling multiple processing elements efficiently has been explored quite thoroughly in the micro-programming area. For the present application, the branch and bound (BAB) algorithm  with  list  scheduling  scheduling algorithms and [28]  In  heuristic  is  used.  See  [19]  for  more  information on  for why BAB is chosen amoung other algorithms.  the BAB algorithm with list scheduling heuristic, each operation is assigned  a weight, equal to the number of operations that is dependent on the operation. Then the  schedule  weight first  is  determined ' by  If two operations  simply  picking  the  operation  that  has  the  heaviest  that can be scheduled have the same weight  then an  arbitrary choice is taken. The algorithm is:  Compute weight of each operation; From the list of operations that can be scheduled, schedule in descending order of weights; Repeat until all operations scheduled. During  scheduling, no  attention  is  paid  sequence of execution is established.  to  the  registers  that  is  freed  whenever  all  needed.  Only  the  After scheduling is complete, registers are allocated  to store the intermediate results of each operation. The register operation  are  the  sons  of  the  operation  that is assigned to an have  been  assigned  a  register.  A simplified (pipeline delays ignored) schedule generated for the computation of p  of the Excalibur Arm is given below. The program listing of the BAB algorithm is  also included for  completeness.  132  The R e s u l t i n g Schedule multiplier operations 3 5 7 6 4 13 8 14 12 2 15 20 33 23 22 16 28 27 24 21 0  register assigned 1 2 3 1 2 5 1 2 4 1 4 8 4 5 4 5 1 2 1 4 0  adder operations 0 0 10 0 0 11 0 0 0 17 0 18 32 9 26 25 19 31 0 29 30  register assigned 0 0 4 0 0 6 0 0 0 7 0 9 10 8 9 3 4 3 0 5 1  • operation no.=0 =*>unit idle • r e g i s t e r s are reused as soon as possible. • the sine's and cosines of the angles are assumed to have been read into the r e g i s t e r s already. • pipeline delays are ignored.  133  1 2  2.2 3 4 5  6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57  program B A B H e u M s t i c ; { { { B r a n c h a n d Bound w i t h { {  Heuristic  const MaxOp = 50; MaxReg = 50; O c c u p i e d = 1; U n O c c u p i e d = O; Nu11 = 0; Mul = 1; A d d = 2; I n f i n i t y = 10000; type S e t O f P a r e n t s = a r r a y [1..10] o f i n t e g e r ; SetOfSons = a r r a y [1..10] o f i n t e g e r ; SetOfDes = a r r a y [1..100] o f Integer; OpRec = r e c o r d Reg: i n t e g e r ; Op: i n t e g e r ; Wt: i n t e g e r : NumParents: i n t e g e r ; RemParents: i n t e g e r ; Parents: SetOfParents; NumSons: i n t e g e r ; Sons: SetOfSons; end; P t r N o d e = @OneNode; OneNode = r e c o r d OpNum: i n t e g e r ; Next: PtrNode; end; PtrTimeSlot = PTimeSlot; TimeSlot = record OpNum : i n t e g e r ; Prev: PtrTimeSlot; end; var S h o r t e s t S c h e d , SchedLen: i n t e g e r ; R d y N u l l , R d y M u l , RdyAdd: P t r N o d e ; NextRdyNull: PtrNode; BestSched: PtrTimeSlot; LaStSlot: PtrTimeSlot; Op: i n t e g e r ; O p S e t : a r r a y [ 1 . . M a x 0 p ] o f OpRec; R e g S e t : a r r a y [1..MaxReg] o f i n t e g e r ;  f u n c t i o n FindFreeReg: var i : i nteger; begin 1 := 1;  integer;  134  58 59 GO 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115  while RegSetfi] = Occupied i : = i + 1 ; R e g S e t f i ] := O c c u p i e d ; F 1 n d F r e e R e g := i ; e n d ; {func}  do  procedure A11ocateReg(PrevTimeS1ots: PtrTimeSlot); var C u r P a r e n t , i , NumSons: i n t e g e r ; GreatParent: integer; Abort: boolean; CurOp: P t r T i m e S l o t ; begin C u r O p := P r e v T i m e S 1 o t s ; i f C u r O p @ . P r e v <> n i l t h e n begin A1 1 o c a t e R e g ( C u r O p @ . P r e v ) ; end; { i f } i f (Cur6p<?>. OpNum <> 0 ) t h e n begin OpSet[CurOp@>.OpNum].Reg := F i n d F r e e R e g ; f o r i:=1 t o OpSet[CurOp@>.OpNum].NumParents d o begin ' C u r P a r e n t := O p S e t [ C u r O p @ . O p N u m ] . P a r e n t s [ i ] ; NumSons := O p S e t [ C u r P a r e n t ] . N u m S o n s - 1; O p S e t [ C u r P a r e n t ] . N u m S o n s := NumSons; i f NumSons = O t h e n begin i f OpSet[CurParent].Op = N u l l then begin G r e a t P a r e n t := C u r P a r e n t ; A b o r t := f a l s e ; w h i l e ( O p S e t [ G r e a t P a r e n t ] . O p = N u l l ) and not Abort begin i f OpSetfGreatParent].NumParents <> 0 t h e n G r e a t P a r e n t := O p S e t [ G r e a t P a r e n t ] . P a r e n t s ! 1] else A b o r t := t r u e ; end; i f not Abort then begin NumSons := O p S e t [ G r e a t P a r e n t ] . N u m S o n s - 1; O p S e t [ G r e a t P a r e n t ] . N u m S o n s := NumSons; i f NumSons = 0 t h e n RegSet[OpSet[GreatParent].Reg] := U n o c c u p i e d ; end; end else RegSet[OpSet[CurParent].Reg] := U n o c c u p i e d ; end; end; end; write(CurOp@.OpNum); 1f CurOp@>. OpNum = O t h e n write(O) else write(OpSet[CurOp@.OpNum].Reg);  (  do  v  135  116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173  i f Op = A d d t h e n begin wr1teln; Op := Mul ; end else Op := A d d ; e n d ; {proc}  procedure PrintSched; var 1: i n t e g e r ; begin f o r i:=1 t o MaxReg do R e g S e t f i ] := U n o c c u p i e d ; Op := M u l ; AllocateReg(LastSlot): e n d ; {proc}  procedure ConstructGraph; var i , OpNum, NumSons, 5 o n , N u m P a r e n t s : begin f o r i;=1 t o MaxOp d o O p S e t [ i ] . N u m P a r e n t s := 0;  integer;  read(OpNum); w h i l e OpNum <> 0 d o begin read(0pSet[0pNum].Op); NumSons : = 0; repeat read(Son); NumSons := NumSons + 1; 0 p S e t [ 0 p N u m ] . S o n s [ N u m S o n s ] := S o n : i f S o n <> 0 t h e n begin N u m P a r e n t s := O p S e t [ S o n ] . N u m P a r e n t s + 1; OpSettSon].NumParents := N u m P a r e n t s ; OpSet[Son].RemParents := N u m P a r e n t s ; 0 p S e t [ S o n ] . P a r e n t s ! N u m P a r e n t s ] := OpNum; end; { i f } u n t i1 S o n = 0; 0 p S e t [ 0 p N u m ] .NumSons := NumSons - -1; read(OpNum); end; {while} e n d ; {proc}  { T h i s procedure updates the ready-null, ready-add l i s t s . } procedure UpdateLists(0pNum: integer); var i . NumSib, C u r O p , C u r W t : i n t e g e r ; S i b l i n g s : SetOfSons;  ready-mul, and  136  174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 213.3 213.6 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229  Node. R e f N o d e l , R e f N o d e 2 : P t r N o d e ; begin NumSib:= OpSet[OpNum].NumSons; S i b l i n g s := O p S e t [ O p N u m ] . S o n s : f o r 1:=1 t o NumSib do begin C u r O p := S i b l i n g s [ i ] ; O p S e t [ C u r O p ] . R e m P a r e n t s := O p S e t [ C u r O p ] . R e m P a r e n t s - 1; i f OpSet[CurOp].RemParents = 0 then begin new(Node); Node@.OpNum : = C u r O p ; case OpSet[CurOp].Op of Null : begin Node<».Next := N e x t R d y N u l 1 ; N e x t R d y N u l 1 := Node; end; Mul : begin i f RdyMul = n i l t h e n begin RdyMul := Node; Node@>.Next := n i l ; end else _ ' begin CurWt := O p S e t [ C u r O p ] . W t ; i f CurWt > O p S e t [ RdyMul «>. Opnum] . Wt t h e n begin Nodes?. N e x t = R d y M u l ; RdyMul := Node: end else beg i n R e f N o d e l := R d y M u l ; > R e f N o d e 2 := R e f N o d e 1 @ . N e x t ; w h i l e ( ( R e f N o d e 2 <> n i l ) a n d (OpSet[RefNode2@>.OpNum].Wt begin R e f N o d e l := R e f N o d e 2 ; R e f N o d e 2 := R e f Node2@>. N e x t ; end; (while) R e f Node1<» . N e x t := Node; Node@.Next := R e f N o d e 2 ; end; { i f } end; { i f } end; {clause} Add : begin i f RdyAdd = n i l t h e n begin RdyAdd := Node; Node@.Next := n i 1 ; end else begin CurWt := O p S e t [ C u r O p ] . W t ; i f CurWt > O p S e t [ R d y A d d © . O p n u m ] . W t t h e n begin N o d e e . N e x t :* RdyAdd; R d y A d d := Node; :  > C u r W t ) ) do  137  230 231 232 233 234 235 236 237 238 239 239.3 239.6 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285  end else begin R e f N o d e l :«= RdyAdd; R e f N o d e 2 := Ref Node 1©. N e x t ; w h i l e ( ( R e f N o d e 2 <> n i l ) a n d ( O p S e t [ R e f N o d e 2 © . O p N u m ] . W t begin R e f N o d e l := R e f N o d e 2 ; R e f N o d e 2 := Ref N o d e 2 © . N e x t ; end; {while) R e f N o d e l © . N e x t := Node; N o d e © . N e x t := R e f N o d e 2 ; end: { i f } end; { i f } end; {clause} e n d ; {case} end; { i f } end; {for} e n d ; {proc}  procedure Schedule; var NewSlot: P t r T i m e S l o t ; M u l O p , AddOp: P t r N o d e ; begin N e x t R d y N u l 1 := n i l : new(NewSlot); N e w S l o t © . P r e v := L a s t S l o t ; L a s t S l o t := N e w S l o t ; M u l O p := R d y M u l ; i f RdyMul = n i l t h e n NewS 1 o t © . OpNum : = 0 else begin RdyMul :« R d y M u l © . N e x t ; N e w S l o t © . O p N u m := M u l O p © . O p N u m ; end; { i f } new(NewSlot); N e w S l o t © . P r e v := L a s t S l o t ; L a s t S l o t := N e w S l o t ; AddOp := RdyAdd; i f RdyAdd = n i l t h e n N e w S l o t © . O p N u m := 0 else begin R d y A d d := R d y A d d © . N e x t ; N e w S l o t © . O p N u m := A d d O p © . O p N u m ; end; { i f ) I f MulOP <> n i l t h e n UpdateL1sts(MulOp©.OpNum); i f AddOp <> n i l t h e n UpdateL1sts(AddOp©.OpNum); w h i l e R d y N u l 1 <> n i l d o begin UpdateLists(RdyNul1@.OpNum); RdyNul1 := R d y N u l 1 © . N e x t ; end; {while}  *  > C u r W t ) ) do  138  286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 336.2 337 338 339 340 341 342  RdyNull := NextRdyNul1; end; {proc} procedure Union(var ParentDes: SetOfDes; v a r NuntPDes: Integer; SonDes: SetOfDes; NumSDes; I n t e g e r ) ; var 1. j , k: Integer;, found: boolean; begin k := NumPDes; f o r 1:=1 t o NumSDes do begin found := f a l s e ; J := 1 ; w h i l e ( ( j <= NumPDes) and (not found)) do beg 1n i f SonDes[1] = P a r e n t D e s [ j ] then found := t r u e else J := j + 1; end; {while} i f not found then beg i n k := k + 1; ParentDesfk] := S o n D e s [ i ] ; end; { I f ) end; {for} NumPDes := k; end; {proc} procedure MergeDesfvar Descendents: SetOfDes; v a r NumDes: i n t e g e r ; OpNum: i n t e g e r ) ; forward; procedure A s s i g n W e i g h t s ( v a r Descendents: SetOfDes; v a r NumDes: i n t e g e r ; OpNum: I n t e g e r ) ; var NumSons, MyNumDes, i : Integer; Sons: SetOfSons; MyDes: SetOfDes; begin NumSons := OpSet[OpNum].NumSons; Sons := OpSet[OpNum].Sons; MyNumDes := 1: MyDes[1] := OpNum; f o r 1:=1 t o NumSons do AssignWeights(MyDes, MyNumDes. S o n s [ l ] ) ; OpSet[OpNum].Wt := MyNumDes; Union(Descendents, NumDes, MyDes, MyNumDes); end; {proc}  139  343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392  p r o c e d u r e ComputeWeights; var DummyA: S e t O f D e s ; DummyC; I n t e g e r ; begin DummyC := 0; A s s i g n W e i g h t s ( D u m m y A , DummyC, 1 ) ; end;  p r o c e d u r e MergeDes; var i : i nteger; found: boolean; begin i f OpSet[OpNum].NumSons = 0 t h e n begin i := 1; f o u n d := f a l s e ; w h i l e ( ( i <= NumDes) a n d ( N o t f o u n d ) ) do begin i f D e s c e n d e n t s [ i ] = OpNum t h e n f o u n d := t r u e else 1 := i + 1; end; (while) i fnot found then begin < D e s c e n d e n t s f i ] := OpNum; NumDes : = 1 ; end; end else A s s i g n W e i g h t s ( D e s c e n d e n t s , NumDes, OpNum); end; (proc)  begin ConstructGraph; ComputeWeights; R d y N u l 1 := n i 1 ; RdyMul := n i l ; R d y A d d := n i l ; L a s t S l o t := n i l ; UpdateLists(1); w h i l e ( ( R d y N u l 1 <> n i l ) o r (RdyMul Schedule; PrintSched; end.  <> n i l ) o r ( R d y A d d <> n i l ) ) do  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items