Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Cyclical processor and computer architectures for highly parallel applications Wong, Fut-Suan 1984

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
[if-you-see-this-DO-NOT-CLICK]
UBC_1984_A1 W66.pdf [ 7.17MB ]
Metadata
JSON: 1.0096640.json
JSON-LD: 1.0096640+ld.json
RDF/XML (Pretty): 1.0096640.xml
RDF/JSON: 1.0096640+rdf.json
Turtle: 1.0096640+rdf-turtle.txt
N-Triples: 1.0096640+rdf-ntriples.txt
Original Record: 1.0096640 +original-record.json
Full Text
1.0096640.txt
Citation
1.0096640.ris

Full Text

Cyclical  Processor Highly  and Computer A r c h i t e c t u r e s for Parallel Applications by  F u t - S u a n Wong B.Eng.(Hons.), U n i v e r s i t y of Singapore,1979 M.A.Sc., The S t a t e U n i v e r s i t y o f New Y o r k , 1980  / A/THESIS SUBMITTED IN PARTIAL FULFILLMENT OF ' THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY In THE  FACULTY OF GRADUATE  Department of E l e c t r i c a l  STUDIES Engineering  We a c c e p t t h i s t h e s i s a s c o n f o r m i n g to the required standard  THE  UNIVERSITY OF BRITISH COLUMBIA JANUARY,  ©  1984  1984, F . S. Wong  In p r e s e n t i n g requirements  this thesis f o r an  of  British  it  freely available  agree that for  Library  shall  for reference  and  study.  I  for extensive  that  h i s or  her  copying or  f i n a n c i a l gain  be  shall  publication  not  be  of  Columbia  make  further this  thesis  head o f  this  my  It is thesis  a l l o w e d w i t h o u t my  of  The U n i v e r s i t y o f B r i t i s h 1956 Main Mall Vancouver, Canada V6T 1Y3  the  representatives.  permission.  Department  copying of  g r a n t e d by  the  University  the  s c h o l a r l y p u r p o s e s may  understood  the  I agree that  permission by  f u l f i l m e n t of  advanced degree a t  Columbia,  department or for  in partial  written  i i  Abstract  During computing  among  several  novel  performance  have  this  respect  sorting,  their  usefulness hardware  are  of  a  the  first  (RSS),  derived.  are  This  and  of  the  repetitive  interconnections  required.  comparators  to  sorting  time  hardware  termination  the  sorter,  soon  as  the  is  so input  sort N  found  to  the  list  the  The  i s i n the  ones  --  illustrated  including and  in with  parallel  the  design  computers.  the  structure  correctness  and  systolic of  amenable  the  to  is  VLSI  control structure  number  bounded  sorting  the  arranged  near-neighbour  i s incorporated  that  are  highly  and  be  and  recirculating  simple  items  high-  o p e r a t i o n a l c o n s t r a i n t s are  sorter is  because  needed  ideas  The  general  of  a r c h i t e c t u r e s --  studies,  presented.  thesis,  designs  cyclically  single-chip,  endless  this  presented,  examples  our  powerful  several  f o r the  cyclical  These  of  implementations regular,  of  for  In  next-generation  a  parallel  the  communications  part  i s proved,  of  are  resources  of  of  search  ideas  application  class  the  community.  examined.  algorithms  algorithms  one  machines  packet-switched  In  sorter  been  scientific  specific  methodology  control  has  decades,  architectural  and  which  of  the  few  computing  practicality  use  last  machines  pursuits  the  the  type  of  and of  quadruple  N/4,  and  by  (logN)**2  the and  average N.  A  into  the  control unit  of  process  can  be  as  desired  order.  terminated  In  the  structured intended  for  to  L  part  switching  consisting With  second  --  two-by-two  to  value  N  free  of  prevail  in  simulation delay  cyclical,  results  show t h a t  close  to  relatively  low  switch  In  the  third  methodology Our  for  proposed  primarily resources  "pure"  is  EDC  has  resource higher  aimed  their the  switch  systems devices.  connect using  counts  extended  up  only  and  the  of  deadlocks  and  which  networks.  throughput  other  the  incrementally,  designs  our  studies,  a  next-generation  computers  is  as  at and  of  the  Event-Driven system  a  circular  which  Our  rate  and  despite  its  array  a  (EDC)  has  computing  its and Such a  processing  of  to  is  i t  is  combined both  computations  Compared simpler  design  described.  advantages  control-driven  of  new  Computer  pipeline,  e x t r a c t i n g the  shortcomings.  merits  range.  is  a d v a n t a g e o u s when  i t s average  part  utilization,  speed  of  control-driven activities.  data-driven  alleviating an  with  It  count.  data-driven  arranged  supplemented approach  the  i t can  receivers,  type  of  loop-  large-scale  packet-switched  that  system,  a  be  novel  presented.  two,  i s very  I t can  a  interconnected  of  terms  store-and-forward  other  are  in  network  is large. the  in  i s a power  switches;  wiring, this  is  is  t r a n s m i t t e r s and  amounts of of  (LSSN)  thousands of  where L  N = L ( l o g L ) p a i r s of  N/2  studies,  communications  hundreds  loops  our  network  packet  of  of  other  architecture, capabilities  the  while designs, better and  a  iv  As cyclical packets will  i s shown by o u r architectures  interact  occur  on  systems  because of  network  resources  deadlock  synchronous In  architectures equivalently,  the  the  than  cyclical  circular  (we h a v e , but  requests  however,  will  not  utilization  Systolic of  architectures  are  to the designs  on  Sorter.  the  cyclical  ones  —  can handle  smaller areas  of  presented occur  that of the a c y c l i c  relatively  suitable  the information  Switching  amounts o f i n f o r m a t i o n w i t h t h e r e f o r e more  the  as the L o o p - S t r u c t u r e d  as the R e c i r c u l a t i n g  resource  of  instance,  scheme),  are higher  o n how  deadlocks,  asynchronous,  such  properties  for  by t h e p a c k e t s  systems  general,  other;  such  the  avoidance  the  depend g r e a t l y  with each  Network  a  studies,  --  or  larger they  of very l a r g e - s c a l e  systems.  Key  phrases: computer  systolic  architectures,  arrays, parallel  store-and-forward computat i o n s .  sorting,  deadlocks,  next-generation packet-switched  data-driven  and  computing, networks,  control-driven  V  Table  of  Contents  page  Abstract Table  of  ii Content  List  of  Figures  List  of  Tables  v viii x  Nonemclature  xi  Acknowledgements Chapter  I.  x i i  Introduction  1.1.  Background  1.2.  Cyclical  1.3.  Objectives  Chapter  II.  A  Information  1  Architectures and  Scope  Systolic  of  Processor  4 the  Thesis  7  for Parallel  Sorting  11.1.  Introduction  11.2.  The  Recirculating Systolic  A.  Network D e s c r i p t i o n  12  B.  The  Quadruple Comparator  14  C.  The  Comparison/Exchange/Shift  The  RSS  A.  Algorithm  I  20  B.  Algorithm  II  21  C.  Examples  11.3.  11.4.  11.5.  9 Sorter  (RSS)  Operations  16  Algorithms  Operational  22 Constraints  A.  Constraints  B.  Marking  Scheme A  26  C.  Marking  Scheme B  ...26  Analysis A.  of  Analogy  the with  on  the  RSS the  Size  of  RSS  Algorithms Odd-Even  Transportation  23  vi  Sort B.  I I . 6. Chapter  27  C o r r e c t n e s s o f t h e RSS A l g o r i t h m s Schemes  C.  Correctness  D.  Timing  of the T e r m i n a t i o n  and  Marking 29  Method  Complexities  41  Discussions  43  I I I . A Novel (LSSN)  Loop-Structured  Switching  Network  111 .1 . I n t r o d u c t i o n I I I . 2. Network  48  Topology  A.  Addressing  B.  Routing  111.3. Network  40  Scheme and C o n n e c t i o n  Function....50  Scheme  51  Properties  59  A.  Network C o n f l i c t s  B.  Deadlocks  C.  Network E x t e n s i b i l i t y  and A v o i d a n c e  62 Method  65 .68  111.4. S i m u l a t i o n s and P e r f o r m a n c e A n a l y s i s  70  111.5. D i s c u s s i o n s and O u t l o o k  74  Chapter IV.1.  IV.2.  IV.3.  IV.  D e s i g n and E v a l u a t i o n o f t h e E v e n t - D r i v e n Computer (EDC)  Introduction A.  Background Information  ...81  B.  Recent Developments  83  C.  O v e r v i e w of Our A p p r o a c h  85  The EDC Hardware A r c h i t e c t u r e A.  P r o c e s s i n g Modules  91  B.  Storage  96  C.  Switches  Modules  The EDC I n f o r m a t i o n  98 Structure  vii  IV.4.  IV.5.  I V . 6. Chapter  A.  M a c h i n e Code F o r m a t s  104  B.  Packet  107  C.  Program O r g a n i z a t i o n  D.  Data  E.  Process  T h e EDC  Formats  .108  Structures  111  and R e s o u r c e Management  Programming  Language  A.  Statements  B.  Language C o n s t r u c t s  Performance  and Program  113  Structure  Blocks  114  f o r Array  P r o c e s s i n g . . . . 120  Analysis  A.  F l o w A n a l y s i s o f EDC  123  B.  Example  126  C.  Considerations  Discussions V.  for Generalized  and Outlook  Computations.131 132  Conclusions  V. 1.  Summary  of R e s u l t s  138  V.2.  General  Discussions  139  V.3.  Suggestions  Appendix  A  Appendix  B  Appendix  C  References Vita  for Further  Work  141  vi ii  List 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.  of  Figures  page  Fig.1.1. The c y c l i c a l c o n f i g u r a t i o n 5 Fig.II.1. The R e c i r c u l a t i n g S y s t o l i c S o r t e r (RSS) 11 Fig.II.2. T h e c o n t r o l u n i t o f RSS 11 Fig.II.3. The S c h e m a t i c d i a g r a m o f a quadruple comparator 15 Fig.II.4. Symbols used f o r c o m p a r i s o n and s h i f t 16 Fig.II.5. T h e f o u r o p e r a t i o n s p e r f o r m e d by t h e quadruple comparators 19 Fig.II.6. An e x a m p l e t o i l l u s t r a t e RSS A l g o r i t h m I and M a r k i n g Scheme A 24 Fig.II.7. An e x a m p l e t o i l l u s t r a t e RSS A l g o r i t h m I I a n d M a r k i n g Scheme B 25 Fig.II.8. The o d d - e v e n s o r t e r 28 Fig.II.9. The t h r e e i n d e x e s : i , j , and J , a n d the i n i t i a l marker p o s i t i o n M(i) 30 Fig.11.10.The h o r i z o n t a l comparisons c a r r i e d out on t h e RSS a r r a y 30 F i g . 1 1 . 1 1 . The number o f c o m p a r i s o n c y c l e s v e r s u s t h e n u m b e r o f i t e m s t o be s o r t e d ( A l g o r i t h m I ) 42 F i g . I I . 1 2 . A general-purpose computer system w i t h s p e c i a l - p u r p o s e c h i p s a t t a c h e d [19] 43 F i g . I I I . l . A s s i g n m e n t o f l o o p a n d l i n k l a b e l s on a L S S N w h i c h h a s 16 l o o p s a n d 32 s w i t c h e s 53 F i g . I I I . 2 . C o n n e c t i o n of t r a n s m i t t i n g and r e c e i v i n g d e v i c e s on a L S S N w i t h 16 l o o p s ...54 F i g . I I I . 3 . The s c h e m a t i c d i a g r a m s o f a T y p e - A switch....55 F i g . I I I . 4 . A 16x16 b a s e l i n e n e t w o r k 78 F i g . I I I . 5 . E f f e c t s o f b u f f e r s i z e on t h e t h r o u g h p u t and d e l a y of a 64x64 LSSN 78 F i g . I I I . 6 . The t h r o u g h p u t r a t e s o f a 6 4 x 6 4 b a s e l i n e , a 64x64 LSSN and a 16x16 b a s e l i n e , v e r s u s t h e intera r r i v a l time 79 F i g . I I I . 7 . The d e l a y c u r v e s o f a 6 4 x 6 4 b a s e l i n e , a 6 4 x 6 4 L S S N a n d a 16x16 b a s e l i n e , v e r s u s t h e intera r r i v a l time , 80 Fig.IV.1. EDC s y s t e m b l o c k d i a g r a m 87 Fig.IV.2. T h e c o n n e c t i o n d i a g r a m o f EDC hardware architecture 90 Fig.IV.3. T h e s c h e m a t i c d i a g r a m o f a PSN s w i t c h 100 Fig.IV.4. Parameter p a s s i n g between the c a l l i n g p r o g r a m M and the c a l l e d program P 109 Fig.IV.5. The i n t e r a c t i o n s b e t w e e n c a l l i n g p r o g r a m s and a t a s k program 110 Fig.IV.6. The p h y s i c a l a n d l o g i c a l a r r a n g e m e n t s o f t h e EDC m e m o r y s y s t e m 111 Fig.IV.7. The i m p l e m e n t a t i o n of a r e s o u r c e manager using a task program 113 Fig.IV.8. A " B e g i n / E n d " b l o c k and i t s d a t a - f l o w graph.116 Fig.IV.9. An " I F " b l o c k a n d i t s d a t a - f l o w g r a p h 116  ix  30. F i g . I V . 1 0 . A "Match" b l o c k and i t s d a t a - f l o w g r a p h 117 31. F i g . I V . 1 1 . A "Loop" b l o c k and i t s d a t a - f l o w g r a p h 118 32. F i g . I V . 1 2 . The s t a t e m e n t , d a t a - f l o w g r a p h a n d m a c h i n e code of a p a r a l l e l v e c t o r o p e r a t i o n 120 33. F i g . I V . 1 3 . The s t a t e m e n t , d a t a - f l o w g r a p h a n d m a c h i n e code of a r e d u c t i o n o p e r a t i o n 121 3 4 . F i g . I V . 1 4 . T h e s t a t e m e n t s o f some a l i g n m e n t o p e r a t i o n s , and t h e d a t a - f l o w graph and machine code format of t h e "SHIFT" o p e r a t i o n 123 3 5 . F i g . I V . 1 5 . T h e M A R P f , M A T R a n d MARPc,max c u r v e s o f t h e given example 129  X  List 1. 2. 3. 4. 5.  Table Table Table Table Table  I I . 1. II.2. 111 .1 . III.2. III.3.  6.  Table  III.4.  7. 8.  Table Table  III.5. III.6.  of  Tables  page  R e q u i r e m e n t s o f t h e RSS m a r k i n g s c h e m e s . . C o m p l e x i t i e s of s o r t i n g n e t w o r k s Scalar operations Compound o p e r a t i o n s The " O p e r a n d / N e x t i n s t r u c t i o n s " f i e l d s of s c a l a r o p e r a t i o n s The " O p e r a n d / N e x t i n s t r u c t i o n s " f i e l d s of compound o p e r a t i o n s The f o r m a t s o f i n s t r u c t i o n p a c k e t s The f o r m a t s o f r e s u l t p a c k e t s  40 47 135 135 136 136 137 137  xi  Nomenclature  )  ADT ARS C CS EDC GCU i,I  : : : : : : :  J,J k,K  : :  IRS  :  L LIT LMs LSSN M MATR MARP MIMD N P PDF PSN R  : : : :  RL RPS Rr RSS SIMD SISD SM SP SUT SW T TPs Tr  : : : : : : : : : : : : :  : : : : : :  Array D e s c r i p t i o n Table Average r o u t i n g steps Number o f c o l u m n s Channel Selector E v e n t - D r i v e n Computer Global Control Unit Index ( s h o r t f o r m o f " I n s t r u c t i o n " when u s e d a s s u b s c r i p t ) I ndex I ndex Instruction Registers Number o f l o o p s u s e d i n LSSN Linkage Information Table L o c a l Memories L o o p - S t r u c t u r e d S w i t c h i n g Network Number o f L o c a l M e m o r i e s Maximum A v e r a g e T h r o u g h p u t R a t e Maximum A c c e p t a n c e R a t e o f P a c k e t s M u l t i p l e - I n s t r u c t i o n and M u l t i p l e - D a t a (computer s y s t e Number o f i n p u t x p Number o f p r o c e s s o r s P i e c e - w i s e Data Flow (computer) Packet-Switched Network Number o f rows o r R e c e i v i n g P r o c e s s o r s ( s h o r t f o r m o f " R e s u l t " when u s e d a s s u b s c r i p t ) Request L i s t Receiving Processors Receiver Recirculating Systolic Sorter S i n g l e - I n s t r u c t i o n and M u l t i p l e - D a t a (computer systems S i n g l e - I n s t r u c t i o n and Single-Data (computer systems) S y s t e m Memory Supervisory Processor Storage U t i l i z a t i o n Table Switch Number o f T r a n s m i t t i n g P r o c e s s o r s Transmitting Processors Transmitter  Acknowledgements  I for  sincerely  his  patient  graduate  program.  Dr. my  Schrack  and  supervisory I  mates,  am  help  research  my  and  Dr.  Vuong  also for  during  like  to  their  Dr. the  R.  course  thank  services  M.  Dr. as  of  Ito my  Chanson,  members  of  committee. also  f o r my  thankful stay  on  financial  assistantships  to this  Science,  and  awards p r o v i d e d  past  and  current  campus a memorable  provided  assistantships provided the  my  support,  teaching  Singapore.  supervisor,  guidance  I would  f o r m a k i n g my As  thank  by by  I  am by  the  grateful my  Lee  one. for  the  supervisor,  the  Department  the  office  of  Computer  Foundation  of  1  Chapter  1.  I.  Introduction  Background The  demand  increasing, engaged  Information  in  realtime and  processes.  While the  drawbacks  the  design  which  are  of  —  of  from  so-called  currently  such  they  large  weather  complex  s y s t e m s can  future  handle  from  certain  to  hardware  their  "fifth-generation" for  artificial  and  obesity  restrict  being planned  as  suffer  software  ever-  community  assessment,  computer  demands,  is  scientific  very  conventional  which s e v e r e l y  the  the  computation  simulations  ranging  inextensibility  computation  battlefield  current  —  speed  among  large-scale  intelligence  of  high  particularly  forecasting,  many  for  usefulness computers  very  in [1]  large-scale  applications. The  first  distinguished tubes,  four  by  g e n e r a t i o n s of  their  constituent  transistors, integrated  large-scale generation  integration concept  sometimes  referred  architecture generations  that  f o r the  structures,  has  a  break  to  technologies  circuits  (VLSI).  the  prevailed  and,  in  the Von  the  commonly —  vacuum  currently,  Central with  as  are  to  the  very fifth-  conventional, Neumann,  first  four  or  computer computer  [2].  Several proposed  is  computers  classes  of  computer a r c h i t e c t u r e s  next-generation  s q u a r e and  cube a r r a y s ,  computers, pipelines,  have  including systolic  been tree  arrays  2  [3],  data-driven  dynamic  structures  architectures basis  will  also  we  --  applications,  cyclical  packet  in this  thesis,  methodology  parallel  As o f t o d a y ,  yet evolved  of r e s e a r c h work  design  [ 4 ] , demand-driven  [55,57,61].  has  In t h i s  of  systems  incorporate  look  cyclical  architectures  and s e v e r a l will  into another  be p r o p o s e d .  c a u s e d by t h e r e c u r r e n t  structures,  systolic  implementations.  The  applications  as  such  multiplications,  their  simple,  systems  are  algorithms the  Fast  of  of  Fourier  have been p r o p o s e d f o r  the  systems.  data-flow pulsations  c o n t r a c t i o n s of the highly  very  designs systolic,  by t h e i r  arteries  concept  new  and c o n t r o l - d r i v e n  in  of  dominant  -- f o r h i g h l y  the fundamental p r i n c i p l e s  systems a r e c h a r a c t e r i z e d  Because  these  interesting  Our  r h y t h m i c d a t a movements a n a l o g o u s t o  hearts.  of  i d e a s b a s e d on t h e  pattern: the  [5] and  area.  communications, d a t a - d r i v e n  Systolic  none  t o become t h e s i n g l e ,  will  architectures  systems  repetitive  amenable many  to  VLSI  specialized  T r a n s f o r m and systolic  matrix  computation  [3].  Packet computer  systems  interconnected also tens  communications  via  which  are  local  are  geographically  networks;  been p r o p o s e d f o r m u l t i p r o c e s s o r to  thousands  of  closely  s t o r a g e m o d u l e s -.- e x a m p l e s instruction  traditionally  meant apart  but r e c e n t l y , systems  interconnected  and  they  have  consisting  of  processing  and  a r e d a t a - d r i v e n computers  executions are triggered  for  by t h e a r r i v a l s  i n which of  input  3  operands which are networks  are  encapsulated  used t o convey  i n t o the  f o r m of  packets,  t h e s e p a c k e t s among t h e  and  hardware  modules. Data-driven attentions  due  c o m p u t e r s have r e c e n t l y  to  their  simplicity  a s y n c h r o n o u s p a r a l l e l i s m ; but take  a d v a n t a g e of  array  computation,  activities driven  do  the  instruction signals  also  conform  some  hand, t h e y that  the  do  of not  exists in  inherently  n a t u r a l l y to  conventional,  executions  generated  to data-driven  by  the  systems,  computation  are  sequential  notion  because  hand,  exploitations  systems  are  more  needed t o  of  data-  in array of  difficult  s p e c i f y the  by  explicitly  operand  control contrast  more a d v a n t a g e o u s  in  handling  simple  control  of  the  computation;  but  in  b r a n c h i n g and  done i m p l i c i t l y are  the  control  merging  which  on  other  control-driven  because e x p l i c i t  packets  by in  parallelism  be  computers,  units;  t h e y make use  paths, which otherwise c o u l d systems  sequenced  they are  which e x i s t  the  control-driven  central processing  structures  are  other  explorations  simple c o n t r o l s t r u c t u r e  and  not  the  i n the  enormous  computation.  In  array  the  on  received  of in  sent  signals  execution data-driven among  the  instructions.  More d e t a i l s o f in  the  following We  these various  systems w i l l  be  provided  chapters.  believe  that  in  order  to  gain  significant  4  improvement  i n the  systems, the sequential various  new  extents.  reasons, any  In  tools  techniques,  used  designs  computation  development  not  computation  e t c . , may  we  will  have t o d e p a r t  in  both  as  not  hardware  words,  be  the  implementation.  Throughout  this  primitive  denote  a  prevalent  software  of  the  the  designs;  term  machine  operations;  "highly  other  hand,  capable  is  those  or  containing  which  can  performed  such  quoted  i n the  rationale  based  on  the  exhibited  in  the  of our  as  of  both  computation  those  examples  chapter.  oriented  programs,  executed,  are  execution  to  of  of c y c l i c a l a r c h i t e c t u r e s  program  c y c l e s of  executions.  As  i n s t r u c t i o n s as w e l l  as  in nearly a l l s c i e n t i f i c  the  basically  spontaneous  advocacy  behaviour  "DO-LOOPS" w h i c h e x i s t  very  a  Cyclical Architectures The  is  of t h i s  to  high-level  amounts  low-level  beginning  only  "next-generation"  large  in parallel,  of  refers  executing  parallel"  these  "processor"  a s y n c h r o n o u s , h i g h and  2.  for  but  s y n c h r o n o u s and be  compiler  aspects  "computer"  of  to  existing  p i e c e of p a s s i v e h a r d w a r e c a p a b l e  fullfledged  are  from the  architectural  dissertation,  o p e r a t i o n s ; on  applications  computer  components,  u s e f u l i n our  o n l y emphasize  existing  and  some  off-the-shelf  immediate  to  over  may  other  such  speed  ways cyclical  in  which  in nature.  envision a class  most  and  business-  programs  It i s  are  therefore  of a r c h i t e c t u r e s which  5  have t h e i r r e s o u r c e s a r r a n g e d  into a c y c l i c a l  configuration  as  follows:  Feedback  Input  path  Computation path ( s t o r a g e , p r o c e s s o r s and s w i t c h e s )  >  >  I  I F i g . 1 . 1 . The c y c l i c a l  configuration.  main c o m p u t a t i o n  path  i n Fig.1.1  and s w i t c h i n g e l e m e n t s ,  and e i t h e r  The processing or  memory words a r e u s e d  The  information  packets the  Output  which  of e i t h e r  for buffering goes  through  data, control  signals  consists  shift-registers  and s t o r a g e the  of b o t h  purposes.  feedback  path are  or both, depending  on  applications. Current  could  research involving  be b r o a d l y c l a s s i f i e d  n a t u r e of the feedback (1)  Fourier  three areas depending  processors  Examples  Transform  are  [ 6 , 7 ] and m a t r i x  t h i s area of a p p l i c a t i o n ,  is  to allow  further  and  computation  also  path.  attached  processors  For  elements  architectures on t h e  signals:  Special-purpose computers:  into  such c y c l i c a l  host  the  Fast  transposition [9].  the purpose  interactions  t o re-use  for  to  among  of  feedback the  data  the resources along the  6  (2)  Interconnection  networks  processor-to-processor single-staged multi-staged For  this  feedback  f o r processor-to-memory or  communications:  shuffle-exchange  are  [9,12]  and  networks  shuffle-exchange  area  Examples  networks  [20,21,22],  of a p p l i c a t i o n s , the s o l e purpose of  i s to re-use  the  resources;  there  is  no  i n t e r a c t i o n s among t h e d a t a . (3)  F u l l f l e d g e d , high-performance computers: With only a few  exceptions  [28,60],  systems a r e based For  this  as  the  cycles; the  cyclical merits In  a  or  better  and new  instruction  the  resource  resources  acyclic  packets  when c e r t a i n  be  Therefore,  i f the  a r e f e d back  of  instruction  a r e brought  result  into  packets  are  path.  implemented  with  then  e i t h e r the  the  relative  o f t h e two c o n f i g u r a t i o n s a r e a s f o l l o w s . cyclical  configuration than  the  repeatedly  size  entire  on a s i n g l e  of system  would g i v e  the a c y c l i c  rise  to  one, b e c a u s e  its  by means o f f e e d b a c k ;  this  i n c u r tremendous s a v i n g s  when  configuration  completion  configuration,  utilization  especially  fabrication  path  could  c o u l d be u s e d  f e a t u r e would  the  data-driven  configuration [4,5].  a t t h e end o f t h e f e e d b a c k  and d e m e r i t s  general,  a l l  of a p p l i c a t i o n s , packets of  system the  on t h e c y l i c a l  result  computation  received  If  area  nearly  the  in  system  system  is  to  w o u l d be a b e t t e r c h o i c e .  i s very  be  integrated-circuit  resources, large.  considered  chip,  On t h e  for  its cyclical other  hand,  a  7  the  control  difficult: a  subset  used  the  cyclical  i n some s y s t e m s , m a s k i n g b i t s  counts  ones.  asynchronously , then  forward  type  resources.  they of  important systems  interrupts  already  congested  be  interrupts  cannot  cyclical  the feedback  signals  configurations are  to  i s their  with  communications the  circular  the  of  of  packet-  l a c k of r e s p o n s i v e n e s s , computation  information packets  be p r o c e s s e d  store-and-  requests  characteristic  occur,  main o b j e c t i v e  architectures  of  path  could  such  that the  to  advocate  immediately.  sorting,  by  computer  be  —  this  the  thesis  basic  discussed,  systems.  Our  applications  which  are  and  the  of  principle  ideas  of  be  parallel  design  current  relative  of a  will  including  c o m m u n i c a t i o n s and t h e  a l l of  demerits  is  design  The a d v a n t a g e s o f o u r d e s i g n s  afore-mentioned presented.  as  specific  packet-switched  interest.  of  high-performance  demonstrated  will  others,  O b j e c t i v e s and Scope o f t h e T h e s i s The  novel  due  to disable in  I f the c y c l i c a l  deadlocks  cyclical  while  would be s u s c e p t i b l e t o  b e c a u s e when  class  [9];  (e.g., as packet-switched  Another  switched,  a r e needed  are required to separate  the incoming  networks)  3.  c o n f i g u r a t i o n i s u s u a l l y more  of the p r o c e s s i n g r e s o u r c e s  feedback from  of  of  a  research  to  others  methods t o r e s o l v e t h e v a r i o u s  cyclical  architectures  will  be  8 In  Chapter  systolic  sorter  parallel  sorting  The  sorting  merits  II, (RSS)  the  will  which  module t o be  algorithms,  of  we  RSS  is  a  designed  as a  attached  design  will  present  of  be  the  to a  communications  environments. analysis  of  resolution outline  driven  The  detailed.  LSSN  will  design a  of  work w i l l  The  General be  given  d i s c u s s e d , and be  presented.  the Event-Driven  organizations  addressed.  will  data-driven  activities.  software  be  properties  system  (LSSN)  relative III  will  intended  multiprocessing and the  performance occurrence  Chapter  IV  and will  which i s  supplemented  with  control-  of  design,  performance  V.  and  (EDC)  d i s c u s s i o n s and  i n Chapter  computer.  Computer  rationale and  single-chip,  Chapter  packet-switched,  topology,  of d e a d l o c k s  the  primarily  in  host  controller,  d e s c r i b e a l o o p - s t r u c t u r e d s w i t c h i n g network for  recirculating  of  suggestions  hardware EDC of  will  and be  further  9  Chapter  II.  A Systolic  Processor  For P a r a l l e l  Sorting  Abbreviat ions: N: Number o f i n p u t  items  C,Column#: Number o f c o l u m n s R,Row#: Number o f rows P,Comparator#: Number o f c o m p a r a t o r s i:  Loop  j:  Moving p o s i t i o n  J:  Fixed position  M(i): t:  Initial  cycle  i n loop i  time  Introduction  computer  and  novel  [9-17]; the  index  A marker  Sorting and  index  marker's p o s i t i o n  Comparison  "*":  1.  index  has  been  engineering  sorting  an i m p o r t a n t  applications  algorithms  i n time  number o f c o m p a r a t o r s u s e d w h i l e  flow,  control  designs, strategies  In t h i s  chapter,  [13].  c o u l d be f o u n d  some o f them a r e o p t i m a l  architectural  operation  w h i c h embodies t h e c o n c e p t s  of both  standard  i n the l i t e r a t u r e  lay  emphasis  on  i n t e r c o n n e c t i o n s , data  and i m p l e m e n t a t i o n  we p r e s e n t  Many  c o m p l e x i t i e s , some i n  others  i . e . , processor  i n business  technologies.  a parallel  sorting  network  the c y c l i c a l a r c h i t e c t u r e s  10  and  the  systolic  characterized from  systems  by t h e i r  data  t h e memories, they  move  within  the  .[3]. flow  and/or  system  fashion the  data  from  along  analogous  predetermined  to  are loaded  paths  element  provided  accepts  i t s neighbours  in a  to the p u l s a t i o n s i n the a r t e r i e s  caused  major  advantage  of  such  systems  lies  the  loading  results;  of  the  input  therefore, there  study  architecture  coupled  the  task  useful  interconnection,  with  amenable  to  recirculating  Section  2,  and  constraints  on RSS  Section  will  5  complexities.  demonstrate  systolic  of s o r t i n g . simple  design  analyse  time.  a  cyclical perform  and a d d r e s s i n g  s t r u c t u r e s , the  i s v e r y compact, a n d h e n c e  sorter  sorting  will  contentions  d a t a movements c a n  implementations.  systolic the  of the f i n a l  bus  that  during  Because of t h e h i g h l y r e g u l a r  control  VLSI  and u n l o a d i n g  d u r i n g the computation  will  r e q u i r e d by t h i s  highly  data  i s no d e l a y due t o  memory a c c e s s c o n f l i c t s  This  and  by  i n the fact  processor-memory communications a r e i n v o l v e d o n l y  the  and  rhythmic  that  area  will  r e c u r r e n t c o n t r a c t i o n s of the h e a r t .  A  and  are  intermediate results  and every  and  systems  p a t t e r n : once d a t a  their  among t h e p r o c e s s i n g e l e m e n t s , distributes  Systolic  be  (RSS)  algorithms, discussed  A d e s c r i p t i o n of will  be  given  i n S e c t i o n 3.  in  Section  4  t h e RSS a l g o r i t h m s a n d t h e i r  The r e l a t i v e  m e r i t s o f RSS  discussed along with other  designs  it is  will  i n Section  be 6.  in The  while timing  compared  DlHSHFuO-IHH {FL>tHHFDHLtDtFD HFlHHHFmtHHFC mm mm \mnm HHHHFC LHhHHHHHHHHHr-* I/O s w i t c h ig.II.1.  Systolic  array  The R e c i r c u l a t i n g S y s t o l i c  Sorter  To/from the s y s t o l i c 0)  To I/O  switch  "O X  c o (0 c x: o  o o (C X o o o w rH w o o *  a  Control Unit Sequencer |Clock|  I  Reset  ICounter Terminate  T  in  U  '|  t  JComparator  -' -f^Tft1 |2*Column#| Fig.II.2.  J  Register  The c o n t r o l u n i t o f RSS.  (RSS)  array  12  2.  The  2.A.  Recirculating Systolic  Sorter  (RSS)  Network D e s c r i p t i o n  A schematic diagram of given  in  Fig.IT.1.  The  the  RSS  The  loops  as  sorts  four  shown.  at  columns  of  the  t o p and  the  RSS  is  array  of  and  C  i n t o R rows  articulated  by  2*R  circular  quadruple comparators holds a comparison  b o t t o m and  array,  these  arranged  is  items d u r i n g  the  of  array  E a c h of  input  situated  portion  whole  sorter  network c o n s i s t s of an  "quadruple" comparators which are columns.  proposed  located  where o n l y  comparators  is  c y c l e , except i n the  e i t h e r the involved  those  odd-numbered  upper or  in  and  the  lower  sorting  process.  During  the  initial  opened a t the  Input/Output  lines;  items  data  serial  manner, w i t h  directions. lines  will  Before  —  the  is  that  marker,  and  of  that  the and  "head" of the  loops has  a l l the  the  s o l e p u r p o s e of  loop.  which  convention  each loop  p o s i t i o n on  to  the  will  loops  to  the  the  loops  shifted  in  loops  will  i s to place  input  the  be  array  input  has  to  in  beginning  associated s i d e of  be  a marker  of m a r k i n g a d o p t e d  right-hand  in a  closed.  i n d i c a t e the  be  are  opposite  been l o a d e d ,  comparator  each l o o p ,  The  connected  network t h r o u g h  network  disconnected  the  and  neighbouring  a certain position within end  the  s o r t i n g commences,  "marked"  and  switch  enter  After be  l o a d i n g phase, a l l the  the  here  with  a  marker  13  will  be  refer  regarded  to the  examples,  Notice  that  are to  After  that  are  used  input  data  are  quadruple  comparators.  exchanged,  then  their  first  cycle  may in  markers. the  sorting  will  be  referred  respectively.  one  of t h e  to the a r r a y . compared If  will  a  of  data  shifted,  shifted  RSS  comparison within  has  the  to  be  i f t h e r e a r e any,  where t h e y a r e .  c y c l e s and  be  a  exchanged  pair  remain  proposed  During  and  between s u c c e s s i v e c o m p a r i s o n the markers w i l l  of  they  a s s o c i a t e d markers,  move w i t h them b u t  reader  i . e . , t h e ways t o p l a c e  the marking procedure, applied  The  to represent  e x a m p l e s , and  Scheme B  be  loop.  in Section 3 for i l l u s t r a t i o n s ;  t o the  f o r the' two  algorithms w i l l  not  of  schemes —  the a r r a y p r i o r  Scheme A and  cycle,  "tail"  asterisks  the marking  different as  the  examples g i v e n  these  m a r k e r s on  as  when  do  However,  the  data  are  along with the data  with  which they a r e a s s o c i a t e d .  A  schematic  presented signals  in  diagram  Fig.II.2.  (i.e.,  of This  the  control  unit  to  comparison;  (2) H o r i z o n t a l - c o m p a r i s o n ;  be  performed  (4) S h i f t - o p e r a t i o n .  cycle, (i.e., taken  the  generates  control  by  At  unit  will  one  the comparators:  the  (3)  end  test  (1)  cycle.  is  control of  the  Vertical-  Diagonal-comparison of  each  comparison  the s t a t u s of the a r r a y  " E x c h a n g e / N o - E x c h a n g e " ) t o see whether any place during that  used  the  "Opcode" i n F i g . I I . 2 ) t o i n d i c a t e  operations  and  unit  I t a l s o has  a  exchange  cycle  has  counter  14  which  keeps  track  Exchange" c y c l e s . is  incremented  whenever the  As of  at least twice  a termination  the input  demonstrated the sorted  uppermost of  is  are  the  former  quadruple  of  number  of  i n t h a t c y c l e ; when columns  (i.e., At  into a linear  by  reset  this  list.  3, the f i r s t asterisks  items  in  i t e m s a r e on t h e r i g h t - h a n d  the side  loops.  Comparator  binary  s o r t e r s used  input/output  slightly  gives a sketch comparator.  is  be g e n e r a t e d .  accompanied  "No-  of t h e c o u n t e r and  q u a d r u p l e c o m p a r a t o r s have a h i g h e r  i s only  Fig.II.3  number  signal will  and t h e l a s t  than the c o n v e n t i o n a l , the  cycle,  i n the examples of S e c t i o n  The Q u a d r u p l e  The  new  one e x c h a n g e  the a s t e r i s k s i n the lowest  2.B.  but  a  i t e m s have been s o r t e d  lists  loops,  number o f c o n s e c u t i v e  words, t h e c o n t e n t  entering  reaches  Count=2*C), stage,  In o t h e r  upon  there  count  of t h e c u r r e n t  more  lines  than  logic  i n other  networks,  per comparator  that  of the input/output  of  the  density  of the  latter.  c o n f i g u r a t i o n of a  15  d a t a in marker in  Ioop { {  upper  lower loop { {  »•»> >  "=>  d a t a out <>= = «= marker out< A  data out > marker out  <=== data i n < marker In  A  v  0)  Ol  c  (0 C  O X 0 o1  c a  01 a  c £ O X  Fig.II.3.  In output  lines,  of  to  one  there are four  will  cycle,  the  two  two  smallest  i n an  numbered largest  column, items  in  then its  it  lines  signal,  will  upper-right  sets  used  lines  unit  one  for  to the  line  f o r the  retain and  is  opcodes.  accomplishes  i s the  odd-numbered c o l u m n ,  if  of  taken p l a c e d u r i n g  of the four data  holds to i t s upper-right neighbour;  two  loops connected  exchange has and  comparator.  i n p u t and  single-bit  what a c o m p a r a t o r  If i t i s located push  s e t s of  i s f o r the c l o c k  comparison  following.  of a quadruple  the  whether any  Essentially  it  along  line  indicate  the c u r r e n t  diagram  t o t h e two  markers  comparators: used  schematic  addition  data  shifting  The  it  is  then  items which i t in  an  the s m a l l e s t lower-left  evenand  the  positions  16  respectively. comparator, be d e s c r i b e d  2.C.  However,  when m a r k e r s a r e p r e s e n t i n s i d e t h e  t h e s i t u a t i o n becomes somewhat d i f f e r e n t and i n the next  will  subsection.  The C o m p a r i s o n / E x c h a n g e / S h i f t O p e r a t i o n s  symbols  For  the  convenience  will  be u s e d t h r o u g h t o u t t h i s c h a p t e r :  (i) Direction o  of  illustration,  the  following  of comparison  head or  T tail (ii)  Direction  Fig.II.4.  The ordering solid an  V of s h i f t  Symbols u s e d  direction  of  f o r comparison  comparison  is  of items a f t e r each comparison.  arrow  head  indicates  ascending order;  is desired, loss  of  assumed  in  this  used t o i n d i c a t e  used  to  show  the  In F i g . I I . 4 ( i ) , of the l a r g e r  the  item f o r  i f , on t h e o t h e r hand, a d e s c e n d i n g o r d e r  then t h e arrow  Without  the p o s i t i o n  and s h i f t .  head w i l l  generality, study.  the  indicate ascending  The open arrow  t h e d i r e c t i o n o f movement  t h e s m a l l e r one. order  will  be  of F i g . I I . 4 ( i i ) i s f o r both the  items  17  and  the markers  The depicted 1.  during  four  operations  Vertical-comparison:  the  in  comparison markers  a  two  items  with  the  downward.  The  the comparator:  portion the  left  left  in  parallel,  pointing  markers  otherwise  direction  of  of  items  t o t h e two  at  directions  of  presence  of  with  the  t o the l e f t ;  the  that  according  both  t h e head  portion  will  be  t o t h e two on  directions  of  c a s e ( i i ) When one o r  comparator,  a p p e a r s on t h e  the corresponding  points  to  i n the h o r i z o n t a l always p o i n t s  and t h e t a i l  the  right;  comparisons, the from  right  to  of a loop are involved i n  i . e . , when t h e marker of  the comparator,  reversed. and  right  t o t h e c o n v e n t i o n a d o p t e d , u n l e s s when  comparison,  left  the  t o the l e f t .  of comparison  left  no marker i s on  a r e compared  comparison  i t points  Note  minimum  on t h e upper  When  a r e p r e s e n t : when a marker  portion  direction  the  t h e two  of the comparator  comparison two  are  i s ignored.  Horizontal-comparison: C a s e d ) inside  comparator  a r e compared  parallel,  pointing  by  below:  The  of the comparator  bottom  operations.  performed  i n F i g . I I . 5 and d e s c r i b e d  portion  2.  the " S h i f t "  maximum  This items  a p p e a r s on  the  then the d i r e c t i o n  reversal  prevents  i n a l o o p from  the  crossing  18  over  each  other,  taken  i n Case  and  ( i i ) above.  Diagonal-comparison: portion the  i t i s a c h i e v e d by t h e a c t i o n  The  of t h e comparator  bottom  comparison  in  involving  parallel,  pointing  At  the t o p - r i g h t because  and  these  i t i s useful  portion  Furthermore,  of the  on  a r e compared with  the  directions  the  lower-left two  the v e r t i c a l  Shift:  Cased)  shifted right;  to  comparator  column,  items  then  seems  comparisons; on t h e  simultaneously.  i f the comparator  left  comparison  comparisons combination  comparisons.  i s located  i n an  i t s t o p two i t e m s w i l l  and i t s l o w e r  c a s e ( i i ) i f the comparator  odd-numbered shifted  the  each o t h e r .  when two m a r k e r s a p p e a r  top-left/bottom-right  even-numbered c o l u m n ,  i t s top  of  items a r e a l r e a d y i n  and h o r i z o n t a l  and h o r i z o n t a l  upper  t o t h e two a t  p r o v i d e s an e x c h a n g e n o t p r o v i d e d by t h e of  the  g l a n c e , the d i a g o n a l  order a f t e r the v e r t i c a l however,  items  downward and c r o s s i n g  the f i r s t  redundant,  left  two  is two  t o t h e r i g h t and i t s lower  be  two i t e m s t o t h e located items  in will  an be  items t o the l e f t .  operat1on  act ions  1.Vert1ca1_ Comparison t1me»  t  r  2.Horlzonta1_ Comparison  c a s e ( l ) no marker Is i n v o l v e d  c a s e ( H ) markers a r e Involved  t1me= t .  3.D1agonal_ Comparison t1me« t,  c a s e d ) f o r comparators c a s e ( l l ) f o r comparators i n odd columns i n even columns  4.Shift  tlme= t«  Fig.II.5.  The  comparators.  four operations  p e r f o r m e d by t h e q u a d r u p l e  20  3. 3.A.  The RSS A l g o r i t h m s Algorithm I  This comparison", but  algorithm  involves  "Horizontal-comparison"  not the "Diagonal-comparison",  following  only  program  fragment  written  and  and is  the  "Vertical-  "Shift"  operations  described  in  the  in Pascal:  Program R e c i r c u l a t i n g - S y s t o l i c - S o r t e r ; • Var Terminate : Column# : Row# : Comparator! : Exchange : Count-No-Exchange:  boolean; integer; integer; integer; boolean; integer;  (•Initialization*) • W h i l e NOT T e r m i n a t e do (*enter next c y c l e of comparison*) Begin f o r C:=1 t o C o m p a r a t o r ! do Begin Vertical-comparison; Horizontal-comparison; End; Check-Terminate; Shift; End; (Algorithm  The following 1.  procedure  global  "Check-Terminate"  be  manipulates  the  variables:  "Exchange" - T h i s to  I)  "False"  boolean v a r i a b l e before  a  new  commences, a n d i s s e t t o be " T r u e "  is  always  reset  comparison  cycle  if  any  exchange  21  takes 2.  place  during  the c y c l e .  "Count-No-Exchange" the  number  exchange, equals 3.  of and  - T h i s v a r i a b l e keeps t r a c k of  consecutive is  reset t o zero  "Terminate" - This  following  loop,  boolean and  condition  Condition(l)  is  whenever  variable set  have  no  "Exchange"  to  controls  be " T r u e "  the  i f the  is satisfied:  (for termination):  Count-No-Exchange  Algorithm  which  "True".  "WHILE-DO"  3.B.  cycles  > 2*Column#  II  This algorithm  i s similar  the  "Diagonal-comparison"  DO"  loop:  to Algorithm  operation  i s included  I  except  that  i n i t s "WHILE-  22  P r o g r a m Rec i r c u l a t i n g - S y s t o l i c - S o r t e r ; •  W h i l e NOT T e r m i n a t e do (*enter next c y c l e of comparison*) Begin F o r I : = 1 t o C o m p a r a t o r ! do Begin Vertical-comparison; Horizontal-comparison; D i a g o n a l - c o m p a r i son; End; Check-Terminate; Shift End; (Algorithm  3 .C.  Examples  Two  e x a m p l e s u s i n g t h r e e c o l u m n s , t h r e e rows and  comparators and  (i.e.,  Fig.II.7.  procedures, second  lists  of  last  right than  After  Algorithm  shown f o r t h e  input the  C=3,  R=3, the  I  and  P=8)  are  to l e f t . or equal  and  loading  The and  contents  the  last  t h e minimum of e a c h l o o p the  direction  A l l t h e numbers to those  and  of  marking first  cycles.  order. is  loop  l o o p above.  and  comparator  At  Both the  end  indicated  by  increasing values  in a given  i n the next  of t h e  two  eight  in Fig.II.6  II a r e a p p l i e d t o t h e  s o r t e d i n t o the ascending  cycle,  markers  first  are presented  initial  examples r e s p e c t i v e l y .  array are  the  II)  are  is  from  greater  23  cycle time • 47  at c y c l e time • 1  cycle time • 48  •-1  16  0  3  2  0  0  0  0  -1  •-1  0  0  0  0  0  -1  •-1  11  • 2  2  5  7  17  2  2  2  • 1  2  2  2  2  • 1  2  2  2  -1  3  2  2  • 2  5  5  3  3  e  0  •13  0  16  • 2  5  5  3  15  10  7  1  B  • 6  7  • 7  7  8  8  7  • 7  8  7  8  7  7  • 7  12  9  2  7  2  11  10  9  8  • 8  12  12  11  10  9  8  • 8  2  • 3  2  5  17  8  16  16  15  •13  17  17  16  15  •13  17  17  16  comparison  vert l e a l  vertical  |  V  vertical  comparison  |  comparison  V  V  •-1  2  0  3  2  0  0  0  0  -1  •-1  0  0  0  0  0  -1  11  •16  2  0  7  17  2  2  2  • 1  2  2  2  2  • 1  2  2  •-1 2  0 •13  S  8  . -1  • 2  6  5  3  3  2  2  • 2  5  S  3  3  10  7  1  16  • 8  7  • 7  7  8  8  7  • 7  8  7  8  7  7  • 2  3  9  2  7  2  11  10  9  8  • 8  12  12  11  10  9  8  • 8  7  •12  2  5  17  8  16  16  15  •13  17  17  16  15  •13  17  17  16  8 15  horizontal  |  comparison  horizontal  |  V  compar1son  horizontal  V  •-1  2  0  3  2  0  16  •11  2  0  IT  7  8  0  • 5  13  8  15  10  7  1  • 2  3  9  12  • 7  2  I  comparison  V  0  0  -1  •-1  0  1  0  0  0  0  -1  2  2  2  • 1  2  2  1  2  2 _• 1  2  2  2 I  -1  • 2  5  5  3  3  2  |  2  5  3  3 |  16  • 8  7  • 7  8  7  8  7  l_*  2  7  2  11  10  9  8  • 8  12  |  12  5  17  8  16  16  15  •13  17  17  1  16  15._*13  0  shift  I  »  7  «-1 |  •2  5  8.  8  7  7  7 !  11. 10  9  8  •8 [  17  17  16 |  1  V « < sorted list  Fig.II.6. An example t o i l l u s t r a t e RSS A l g o r i t h m I and M a r k i n g Scheme A.  »>  24  at c y c l e  c y c l e time - 1 •-3  32  8  0  6  11fne • 1B  2  3  2 6  0  at c y c l e 0  -2  time - 19  •-3  •-3  2  2  0  0  -2  22  5  5  11  16  •35  • 2  5  5  5  4  6  5  5  5  4  • 3  •17  0  27  2  33  -2  15  11  6  8  6  • 6  • 6  11  11  8  6  ' 6  •18  •11  17  15  16  16  15  17  16  15  16  15  •15  22  19  18  32  27  •31  31  22  15  3  16  •15  26  19  4  15  5  31  22  22  19  18  •18  •18  26  22  6  6  11  34  •IB  •26  35  34  33  32  27  35  34  33  6  |  vert leal  vertical  compar1 son  j  comparison  vertical  |  V  V  comparison  V  •-3  S  0  8  6  2  2  2  0  O  -2  •-3  •-3  2  2  0  0  -2  22  32  5  2  16  •36  • 3  6  5  5  5  4  6  5  5  5  4  • 3  •17  0  27  11  16  -2  11  11  6  8  6  • 6  • 6  11  11  B  6  6  31  22  15  3  33  •IB  •15  17  15  16  16  15  17  16  15  16  15  •15  • 6  6  4  15  S  26  22  22  19  18  •18  •18  26  22  22  19  18  •18  •31  35  34  33  32  27  35  34  33  32  27  •31  IS  1  26  9. 6  11  horizontal  |  34  horizontal  comparison  |  comparison  horizontal  |  V  V  compar1 son  V  •-3  5  0  8  6  2  2  2  0  0  -2  •-3  •-3  2  2  0  0  -2  32  22  5  2  35  •16  • 3  6  5  S  5  4  6  5  5  8  4  • 3  • 0  17  27  11  16  -2  11  11  8  6  6  • 6  • 6  11  11  8  6  6  31  22  15  3  33  •18  •15  17  16  15  16  15  17  16  16  15  15  •15  • 6  6  19  4  15  5  26  22  22  19  18  •18  •18  26  22  22  19  18  15  6  •18  •31  35  34  33  32  27  35  34  33  32  31  •27  26  diagonal  11 |  34  diagonal  compar1 son  j  •-3  S  32  22  • 0  17  31  22  4  compar1son  diagonal  6  2  2  5  2  35  •16  • 3  6  5  5  5  4  27  11  16  -2  11  11  8  6  6  • 6  3  33  •18  •IS  17  16  15  16  15  2  0  0  -2  •-3  • 6  6  19  15  15  5  26  22  22  19  18  •18  26  15  • 6  11  34  •18  •31  35  34'  33  32  27  Shift  I  Shift  I  comparison  V  8  0  j  V  V  2  L-3  1  2  0  0  -2  |  6  5  5  6  4  • 3  |  L* «  11  11  8  6  6  |  16  .15  IS  »15  |  22  _22  19  18  |  32  31  '27  L_1T_ _16 L_*18  I  35  26 34 «<  33 sorted  F i g . I I . 7 . An example t o i l l u s t r a t e RSS A l g o r i t h m I I and M a r k i n g Scheme B.  11st » >  V  25  4.  Operational  4.A.  Constraints  Most the  Constraints  size  sorter  on t h e S i z e o f RSS  sorting  networks  of the networks. [13]  requires  For  that  impose  examples,  only.  The b a s i c  Batcher's  t h e number o f i t s i n p u t  power o f two, a n d some mesh s o r t e r s arrays  c e r t a i n c o n s t r a i n t s on  [14,15]  work  bitonic  lines  be a  on  square  c o n s t r a i n t o f t h e RSS a r r a y  appears  t o be l e s s s t r i n g e n t :  Requirement(1):  C o l u m n ! >2 Row#  Further  constraints  d e p e n d i n g on t h e m a r k i n g described correct  operations  used,  but  later  on —  used.  below,  an  >1  may  or  may  be  required  schemes u s e d : In Scheme A a n d B t o be  Requirement(1)  is  sufficient  o f b o t h RSS a l g o r i t h m s  additional  on t h e s i z e  not  constraint  o f RSS w i l l  —  when  t o guarantee Scheme  which w i l l  be  A  is  given  be needed when Scheme B i s  26  4.B.  Marking  It  Scheme A  i s observed  certain  ways  results,  and one s u c h ways i s g i v e n  Marking  The  of  from our s i m u l a t i o n  marking  array  can  that  guarantee  initial  correct  below.  marker p o s i t i o n , M ( i ) , o f l o o p  i  i s :  := 4 * i - 2 + 1 - M ( i - 1 )  where i=1,2,...2*Row#-1, be any v a l u e  0< M ( i ) <=  i n t h e range  2*Column#, a n d M(0)  of M ( i ) .  Scheme A i s a p p l i e d t o t h e example o f F i g . I I . 6 , M(0)=1,  M(1)=3-1=2,  M(2)=5-2=3,  M(5)=3-1=2, and t h e p a t t e r n columns, this  only  Scheme A:  M(i)  can  the  studies  then  M(3)=9~3=6,  repeats.  I f there  M(3)=6MOD(2*Column#)=2.  scheme w i l l  be e x p l a i n e d  in Section  where  M(4)=7-6=1,  are  only  two  The r a t i o n a l e b e h i n d 5.B. i  4.C.  Marking  In two  sides  Scheme B  the second of  the  scheme, t h e m a r k e r s comparator  array,  are placed as  along the  demonstrated  in  27  Fig.II.7.  This  Input/Output retrieval A  method  lines  to  of the f i n a l  simpler  and  the  markers;  insert  sorted  list  we  may  use  and  the  a l s o , the  i s e a s i e r t h a n when  Scheme  i s used.  However,  this  scheme  c o l u m n s o f t h e RSS a r r a y higher  i n t e g e r of that  Marking  1 ,  for  i=even  •• —  2*Column#,  for  i=odd.  Analogy with  RSS  2*(An odd i n t e g e r )  •• -=  2*(An odd i n t e g e r ) + 1  design:  next  ,or  Algorithms  t h e Odd-Even T r a n s p o r t a t i o n  algorithms  Even T r a n s p o r t a t i o n the  number o f  Scheme B o n l y ) :  : ==  A n a l y s i s o f t h e RSS  The  the  an odd i n t e g e r , o r t h e  :=  Column#  5.A.  be t w i c e  that  value:  Requi rement(2) ( f o r  5.  requires  Scheme B:  M(i)  of  is  Sort  bear  some  Sort  resemblance  [11]; therefore, a brief  t o t h e Oddexplanation  Odd-Even s o r t e r w o u l d be h e l p f u l i n a n a l y s i n g  t h e RSS  28  K-l  Stage 5-0 a — n  ri  Input  sorted ouput  a.N - l  Fig.II.8.  In F i g . I I . 8 , presence  .of  position. a(j')  a  An  a t stage j'  where are  The Odd-Even  t h e a p p e a r a n c e o f an a r r o w  conventional, itema(j)  will  binary be  sorter  compared  to  another  = j + (-l)**(j+s)  value  of j  1  will  item  (II.1)  than or e q u a l  a l t e r n a t e between  s  is  incremented.  sorting  of  N items i n N c y c l e s  N*(N-1)/2 s o r t e r s ; i t i s  large.  the  s i t u a t e d at that  l e s s t h a n N, where N i s t h e number o f i n p u t  when  indicates  s if,  a l l j ' , j and s a r e g r e a t e r  The  of  Sorter.  This  sorter  to zero  lines.  (j+1) and ( j - 1 )  guarantees  correct  [ 1 1 ] , but i t r e q u i r e s  therefore  and  impractical  a  if  total N  is  29  5.B.  Correctness  In  this  algorithms  are  be  derived.  the  RSS  process  Lemma  to  the  the RSS Let  array.  As  initial 1)  then  first  the  two  (II.1),  we  will  each loop  of  the  with  The  these  prove  marking  RSS  the  schemes  array,  loops;  Schemes  that  examine t h e  RSS will  effects  of  temporarily  t h e n Theorem  ( I I . 1)  i n t e r a c t i o n s , a complete s o r t i n g  Odd-Even T r a n s p o r t a t i o n  Sort  l o o p when e i t h e r A l g o r i t h m  us  consider  demonstrated of  of  the  the  RSS  array;  p o s i t i o n of  i n d i c a t e s the  Because  the  function  of  the  markers t i m e and  + M(i)  three  i s performed  I or  II  is  indexes  i , j and  array;  on  applied  and  for loop  marker  are  the  of  i ,  the  M(i)  loop,  indicates  and  with  i s r e l a t e d to other  time,  (t*(-1)**i)MOD  the  marker.  j i s therefore  indexes as  2C]MOD 2C  fixed  j(=0,1,...2*R-  of a p o s i t i o n away f r o m t h e shifted  RSS  indexes  J(=1,2,..2*C) i n d e x e s t h e  ( 2 C + ( t * ( - 1 ) * * i ) M O D 2C)]MOD  - J +  J on  i n F i g . I I . 9 , i(=0,1,..2*R-1)  distance  j=[(M(i)+2C-J) + =[4C  will  Marking  achieved.  RSS  loops  positions  we  and  array.  Proof:  the  be  (II.1):  e a c h of  on  Algorithms  i n t e r a c t i o n s among t h e that  can  RSS  c o r r e c t , and In Lemma  the  show  the  section,  algorithms  ignoring will  of  a  follows:  2C  (II.2')  ..  .(II.2)  / 30 Moving p o s i t i o n F i x e d p o s i t i o n index 1 3 4 2 Loop  1"0 1=1  •  0  11  6  7  8  9  10  8  7  6  5  4  3  10  9  8  7  6  5  4  8  7  6  5  9  8  10  •  1  5  11  0  index J.  * 1"2  0  11  10  9  1*3  3  2  1  0  11  10  *  1«4  * 0  11  10  9  8  7  6  5  1«5  1  * 0  11  10  9  8  7  6  10  9  1  0  CLE  1«6  7  6  2  1  Tin  Tl IT  *  1-7  11  10  Fig.II.9. The t h r e e i n d e x e s : i , j , and J , a n d t h e i n i t i a l marker p o s i t i o n M ( i ) .  Cycle t-0  time t=1  {a(0.0)* <a(0,1) <a(0,2) {a(0,3) { < {a(0.2C-1)  loop O  (a(1,0)* <a(1,1) {a(1.2) (a(1.3)  loop 1  { {  t=2  t=2C-1  s o r t e d output  : :  {a(1.2C-1)  loop 2R-1  { { { {  (  a(2R-1.O)* a(2R-1,1 ) a(2R-1.2) a(2R-1.3)  {  :  { a(2R-1,2C-1). Fig.II.10. RSS a r r a y .  The h o r i z o n t a l comparisons c a r r i e d out on the  \  31  The the  effect  second t. j  r e a d e r may  will  now  items  First,  J'  marker's  i s due  position  to the e f f e c t  are used  easily  established  that  to trim  several  us  consider  a(i,J)  i s always  = J -  j'= =  of time  the  indexed  with a period  e x p r e s s i o n s to r e l a t e involved  the  in  horizontal  a  of  by and  2C.  j is related  a pair  of  comparison.  comparison.  compared t o a ( i , J ' )  In  where  (II.3)  t o other indexes as  2C] MOD  in  (II.2):  2C  a(i',j'):  [4C+M(i')-J'+(t*(-1)**i*)MOD  2C] MOD  [4C+M(i')-J+(-1)**J+(t*(-1)**i')MOD  a horizontal  and  the v a l u e s of t  (-1)**J  f o r item  j ,  expression (II.2) i s correct  = [4C+M(i)-J+(t*(-1)**i)MOD  Similarly  on  t h e r e l a t i o n s h i p among t h e i n d e x e s ,  {a(i,j),a(i',j')}  item a ( i , j ) ,  j  i n e x p r e s s i o n ( I I . 2 ' ) shows  Fig.II.9.  derive  let  Fig.II.9,  For  verify  t h e example o f  data  For  term  term  b o t h o f them a r e r e p e t i t i v e  Having we  initial  modulo f u n c t i o n s  because  from  composite  of the  composite  The  The  first  comparison,  2C 2C] MOD  i equals i ' , therefore  2C  j ' reduces  32  to: j'=j  Again  + (-1 ) * * J  (II .3' )  from e x p r e s s i o n  J=  (II.2):  - j + 4 C + M ( i ) + ( t * ( - 1 ) * * i ) M O D 2C + 2KC  where K i s any p o s i t i v e i n t e g e r Substituting  J into  where a ( i ' , j ' )  will  (11.3 * ) ,  such that  we o b t a i n  be compared  J will  be p o s i t i v e .  the expression  for  to a ( i , j ) horizontally,  j ' : = - j + ( - 1 ) * * [ j + M ( i ) + 2 C + ( t * ( - 1 ) * * i ) M O D 2C]  Within will  loop  comparing  (II.4)  comparison" perform  point  (-j-1) and (-j+1) a s  with  Odd-Even S o r t  items w i t h i n i s further  the  "Shift"  as f a r as l o o p  a loop  therefore j ' By  the " H o r i z o n t a l operations,  will  i i s concerned, and  c a n be s o r t e d  illustrated  (II.4)  t increases.  and ( I I . 1), we c a n s e e t h a t  when c o u p l e d  the  therefore This  i , M ( i ) and ( - 1 ) * * i a r e c o n s t a n t s ,  a l t e r n a t e between  j ' ,  within  2*C c y c l e s .  in Fig.II.10. Q.E.D.  Theorem with and  ( 1 1 . 1 ) : The e n t i r e RSS a r r a y  the  combination  e i t h e r Marking  is  capable  of e i t h e r A l g o r i t h m  Scheme A o r Scheme B.  of  sorting  I or Algorithm I I ,  33  Proof: the  In a d d i t i o n t o t h e Odd-Even c o m p a r i s o n s w i t h i n a  RSS  a l s o compares t h e i t e m s  o f any two a d j a c e n t  means o f t h e " V e r t i c a l - c o m p a r i s o n " the  purpose  of  these  downward.  that,  are  if  comparisons  a ( i + 1 , 0 ) ) o f one l o o p , next the  higher entire  loop,  RSS  that  It i s  provided  odd-even  of  us c o n s i d e r  smaller  easily  (i.e., sort  the proof  of t h i s  the algorithms  a(i,2C-l))  a l w a y s compared  In  o u t on  theorem  i s reduced  to the  are  provided  and t h e marking  Fig.II.9,  Combining  the  schemes.  between a p a i r note  by  of  that a ( i , j )  Let items  will  be  1  (II.5)  (11.2,4 a n d 5 ) , we c a n o b t a i n  where t h e h e a d and t a i l  =>  the  to a ( i , j ' ) i f ,  From e x p r e s s i o n s  j'=0,  of  be c a r r i e d  i ' = i - ( - 1 )**( i + r J/2 1)  , =>  (i.e.,  will  the "Vertical-comparison"  {a(i,j),a(i',j')}.  j=2C -1  items  discernible  between t h e head  the " h e a d - t a i l " comparisons  combination  t o move  array.  Therefore, proof  is  and t h e t a i l  then  l o o p s by  and. "Diagonal-comparison";  operations  upward a n d l a r g e r i t e m s  loop,  o f any two l o o p s  the p o s i t i o n  meet:  4 C + M ( i ) - J + ( t * ( - 1 ) * * i ) M O D 2C = 2KC+2C-1  4 C + M ( i ' ) - J + ( t * ( - 1 ) * * ( i + 1))MOD 2C = 2K'C  expressions  ( I I . 6 a n d 7 ) , we  J  obtain,  ...(II.6)  (II.7)  34  8C+M(i)+M(i')-2J  =  (K+K')*2C+2C-1  = > J = K " C + ( M ( i ) + M ( i ' ) + 1 )/2  where  K  either  -1, 0 or 1 because  that  a n d K' a r e i n t e g e r s  the  (i+1)  at  tail  1<=J<=2C.  halfway  J=(M(i)+M(i')+1)/2, whether  such t h a t  of l o o p i w i l l  either  or  (II.8)  0<=j<2C,  and K"  equals  Expression (II.8)  means  be compared  between  M(i)  t o t h e head and  M(i'),  J=(M(i)+M(i')+1)/2+C,  t h e r e i s any c o m p a r a t o r  situated  of l o o p i.e.,  depending  at these  on  locations.  From e x p r e s s i o n I I . 8 ,  M(i)+M(i+1)+1  = 2(J-K"C)  => M ( i ) + M ( i + 1 ) = 2 ( J - K " C ) -1  Expression marking  (II.9) gives r i s e  o f t h e RSS  (II.9)  t o another  requirement  f o r the  array:  Requirement(3):  (for  marking  schemes o t h e r t h a n Scheme A a n d B)  M(i)+M(i+1)=An odd  This of  requirement  the loops w i l l  will  integer  ensure  be c o m p a r e d by t h e  that  the t a i l s  vertical  and h e a d s  comparisons.  35  It  is  used,  automatically  satisfied  when M a r k i n g  Scheme A o r B i s  b u t h a s t o be c o n s i d e r e d f o r o t h e r m a r k i n g  schemes. Q.E.D.  Now  we w i l l  i'=i+1  d e r i v e Marking  Scheme A.  and i ' = i - ( - 1 ) ** ( i+  From  (II.5),  rj/2l)  =>i + l"J/2l=An odd i n t e g e r =>rj/2l=(An  odd i n t e g e r ) - i  C a s e d ) a t i=even,  [J/2l=odd;  therefore,  => J= 2*(An odd i n t e g e r ) , o r = 2*(An odd i n t e g e r ) -1  from  e x p r e s s i o n s ( I I . 9 and 10),  => M ( i + 1 ) =  4*(An odd i n t e g e r ) - 2 K " C - 1 - M ( i ) , o r  = 4*(An odd i n t e g e r ) - 2 K " C - 3 - M ( i )  Case(ii)  a t i = o d d , l"J/2l=even;  => J=2*(An  ...(11.11)  therefore,  even i n t e g e r ) , o r  = 2*(An even i n t e g e r ) - 1  from  (11.10)  e x p r e s s i o n s ( I I . 9 and 1 2 ) ,  (11.12)  36  =>  We  M(i+1)=4*(An  even  integer)-2K"C-1-M(i),  = 4*(An  even  integer)-2K"C-3-M(i)  c o u l d o b t a i n Scheme A by  (11.11  and  setting  K"=0  or  (11.13)  in expressions  14):  M(i+1)=4*i-2+1-M(i)  Or, e q u i v a l e n t l y ,  r  M(i)=4*i-2+1-M(i-1)  which  i s Scheme A and where  1<=M(i)<=2C, f o r i:=1,2,...2R  To d e r i v e Scheme B, l e t  M(i):=1, :=2C,  C a s e ( i ) At  i=odd,  f o r i=even f o r i=odd  from e x p r e s s i o n s  J=K"C+(1+2C+1)/2=2*(An even =2*(An e v e n =>  out  J=K"C+C+1 e{3,4,7,8  of the three p o s s i b l e v a u l e s  }  o f K",  ( I I . 8 and  12),  integer),  or  integer)-1 (11.14)  i . e . , - 1 , 0, and  +1,  37  o n l y K"=0  can s a t i s f y  both  expression  (11.14) a n d (1<=J<=2C);  therefore,  J=C+1 e { 3 , 4 , 7 8 f  }  f  =>C i {2,3,6,7,  }  =>C:=2*(An odd i n t e g e r ) , o r  :=2*(An  C a s e ( i i ) A t i=even,  odd i n t e g e r ) + 1  from  (11.15)  expressions  ( I I . 8 and 10),  J:=K"C+(2C+1+1)/2 =K"C+C+1 €{1,2,5,6 both  K"=0  (1<=J<=2C)  a n d K"=-1  can  satisfy  Ce{0,1,4,5  when K" = -1,  C= Any p o s i t i v e  a l l values of i , both  satisfied  C  (11.16)  expression  (11.16) a n d  simultaneously:  when K" = 0,  For  }  simultaneously  }  (11.17') integer  expressions  (11.17)  (11.15 and 17) c a n be  by t h e r e q u i r e m e n t  := 2*(An odd i n t e g e r ) , o r := 2*(An odd i n t e g e r ) + 1  below:  38 t  => C €{2,3,6,7,  which  i s Requirement  Now will  (2) o f Scheme B.  l e t us c o n s i d e r  show  Marking  }  that  the diagonal comparisons,  Requirement  (3.) c a n a c t u a l l y  and  be w a i v e d  when  a(i',j')  will,  Scheme B i s u s e d w i t h A l g o r i t h m I I . Again,  be compared  from F i g . I I . 9 , items a ( i , j )  and  diagonallyi f ,  J'=J-(-1)**J  ...(II.3)  i ' = i - ( - 1 ) * * ( i + r J/21)  Converting  J and J ' i n t o  (II.5)  j and j ' u s i n g e x p r e s s i o n s  ( I I . 2 and  3):  j=[4C+M(i)-J+(t*(-1)**i)MOD  2C] MOD  2C  j'=[4C+M(i')-J+(-1)**J+(t*(-1)**i')MOD  The  we  heads and t a i l s  of the loops  diagonal comparisons i f ,  j = 2C-1 j'=0 i'=i+1  will  2C] MOD  2C  be compared by t h e  39  Substituting  these values  j=2C-1, j'=0,  into  t h e above  expressions,  => 4 C + M ( i ) - J + ( t * ( - 1 ) * * i ) M O D  we g e t ,  2C=2KC+2C-1  => 4C+M(i + 1 )-J+(-1 )'**J+(t*(-1 ) * * ( i + 1 )MOD 2C=2K'C  A d d i n g up t h e two e x p r e s s i o n s ,  8C+M(i)+M(i+1)-2J+(-1)**J=(K+K'+1)*2C"1 =>  M(i)+M(i+1)=2J-2K"C-1-(-1)**J =2(j-K")C-1-(-1)**J =2(J-K")C, =An even  The  last  result  shows t h a t  or 2(J-K")C-2  integer.  Requirement  Scheme B i s u s e d w i t h A l g o r i t h m e q u a l s an odd i n t e g e r , be even  provided integer,  summarized  I I , because  then the h e a d - a n d - t a i l  by t h e " V e r t i c a l - c o m p a r i s o n " , then i t w i l l  comparison" as demonstrated  The  (3) c a n be w a i v e d when  various  be  provided  if  comparisons  will  b u t i f i t e q u a l s an by  the  "Diagonal-  above.  requirements f o r the marking  in Table.II.1.  M(i)+M(i+1)  schemes a r e  40  Table  Marking  II.1  - Requirements of  Scheme  Algorithm  Requirements(1)&(2)  Requirements(1)&(2)  R e q u i r e m e n t s ( 1 ) & ( 3 ) and  Correctness  t h e RSS  (2)  (2.C)  is sufficient  the  array  Termination  is correctly  (3) a r e d u l y met,  i s t h a t , as  exchange  meaning  5.D.  and  of  to we  i n t h e most no  Timing  The  then  guarantee can  see  sorting  I I . 5 and  marked and Condition  proper  in  the  process  be II.8.  2C  Requirements (1)  of  Section  termination.  from e x p r e s s i o n s every  to  Methods  recent consecutive  f u r t h e r exchanges  t h a t the  others  from e x p r e s s i o n s  comparison p a t t e r n repeats  be  II  B  (1),  will  Algorithm  I  schemes.  Requirement(1)  If  the  marking  Requirement(1)  derived  reason  RSS  A  others  5.C.  the  (11.2,4 and  cycles; 2C  i f there  cycles,  subsequent  must have  The  then  5),  i s no there  comparisons,  terminated.  Complexities  RSS  s i m u l a t i o n program  reference.  Input  numbers of  rows and  parameters columns —  is listed  to these  i n Appendix B f o r  the  simulator  include  the  two  numbers d e t e r m i n e  the  41  total  number  value  f o r the  simulation well  as  of  the  I to  numbers  of  indicate  that  of  s o r t on  the  sort  line  and  a p p r o a c h e s N/2  When A l g o r i t h m will  be  much s m a l l e r  comparisons, illustrate Algorithm because and  and this  II  --  i t s comparison  hence  i t s cycle  ^  of  due  time w i l l  sorted  of  each  list  as  n e e d e d by  RSS  combinations  of  simulation  a v e r a g e number  of  to  N  items  results  of  cycles  i s bound by  not  of  cycles  presence Fig.II.6 the  exceed  include be  number of  the  However,  cycle  end  seed  the  increases.  examples  may  initial  the  cycles  various  i s used, the  point. or  N  at  the  needed.  These  set as  II  the  may  with  I, the  random  and  list;  numbers  with Algorithm  to  —  produce the  columns.  needed N,  will  arrays  and  a  input  sorting cycles  show  rows  sorted  the  simulator  number of  Fig.II.11 Algorithm  t o be  generation  run,  the  items  the  longer.  of and  actual that  needed  diagonal II.7  help  speeds  of  of A l g o r i t h m  diagonal  I,  comparison  s Number o f c o m p a r i s o n  cycles  Fig.ll.11. The number o f c o m p a r i s o n c y c l e s v e r s u s the number o f i t e m s t o be s o r t e d ( A l g o r i t h m I ) .  43 6.  Discussions  Since in  sorting  computer  algorithms  applications,  design been  a common a n d n e c e s s a r y there  are  dozens  described i n the l i t e r a t u r e .  have p r e s e n t e d systolic  i s such  two s i m i l a r  idea  sorting  to a c y c l i c a l  of a s o r t e r suggested.  of  In t h i s  algorithms  such  which apply  on t h e s e  Our p r i m a r y  goal  algorithms  i s to look  into  i  the  to s c i e n t i f i c  computation  computer  usefulness  of  (a) t i m e  the design to a  the  stated  some e x i s t i n g  complexity;  goal  system w i t h  the  sorter  , i t c o u l d a l s o be  i n f o r m a t i o n s y s t e m s and r e l a t i o n a l  proposal with  also  host  Pattern Matcher  Undoubtedly,  With  has  SYSTEM BUS  Fig.II.12. A general-purpose purpose c h i p s attached [ 1 9 ] ,  office  the  a s t h e one e n v i s i o n e d by F o s t e r a n d Rung [ 1 9 ] :  Primary Memory  limited  we  a r c h i t e c t u r e , and t h e f u n c t i o n a l  (RSS) b a s e d  I  CPU  sorting  chapter,  o f a s p e c i a l - p u r p o s e V L S I c h i p t h a t c a n be a t t a c h e d computer  operation  in  data  mind,  ones u s i n g  base  special-  i s not used  in  machines.  we now compare o u r  the f o l l o w i n g c r i t e r i a :  (b) hardware c o m p l e x i t y  and  (c)  control  44  complexity. (a) Time C o m p l e x i t y : some  existing  In T a b l e . I I . 2 , t h e s o r t i n g a l g o r i t h m s c o u l d be d i v i d e d  categories,  namely, O ( l o g N ) ,  and  where  O(N),  sorted. in  N  M u l l e r and  the  faster  discouraging  are  time,  on  by  i s based  and  it  2*logN  on  needs  sorted.  items  it  to  two  they  0(N**2). shuffle  type  O(N)  belong"  to  and  O(N**0.5)  S a h n i ' s mesh s o r t i n g bitonic 14N  steps  that  the  the  merge  algorithm  routing  on  a  steps  NxN  input s u b f i l e s  Kung's  scheme  be  mesh s o r t i n g  and  N+0((N**(2/3))logN) compare-exchange  RSS  algorithms  neighour times sorting  to  the  O(N)  simpler c o n t r o l  might  be  less  than  schemes w h i c h r e q u i r e  steps.  category,  those more  actual of  steps The but near-  sorting  the  complex  pre-  scheme  s t r u c t u r e s and  t y p e o f d a t a movements, t h e i r  and  mesh;  [15] n e e d s r o u g h l y 6 N + 0 ( ( N * * ( 2 / 3 ) ) l o g N ) r o u t i n g  because of t h e i r  of  mesh w i t h a p p r o x i m a t e l y  approximately  belong  they  sort  Batcher's  Thompson  a  schemes  and  i t requires  is  requires  the p e r f e c t  mesh s o r t i n g  be  [10]  shuffle-exchange  compare-exchange  moreover,  but  [13] and  NxN  Nassimi  [14]  O(N**0.5)  comparators,  the The  a  therefore  category.  of  of four  i n t h e 0 ( ( l o g N ) * * 2 ) c a t e g o r y , and  interconnections. items  of  sorter  are c h a r a c t e r i z e d  N**2  number  into  Preparator's algorithm  number  both  the  category,  Batcher's b i t o n i c [9]  is  0((logN)**2),  times  mesh  control  45  and  data  movements.  (b) Hardware c o m p l e x i t y : type  of  and  a very  because  low  shown  by  [18]  c h i p area  s e r i o u s drawback when N  by  t h e mesh and  and  repetitive,  fabricated single  (c)  the  c o m p a r i s o n and each  control  of  unit  is  control therefore  Batcher's than  that  It  that  has at  network  large.  types  of  replicating f o r the  — On  been least out  this the  The  sorters  is a other  logic  could  the c i r c u i t s  entire  the  quadruple  of  the  can  comparable  bitonic required  to  sorters, by  the,  that and  Vertical-  be  hardwired the  broadcasts  the  mesh  comparators.  t h e RSS  algorithms  required is  a  and  o p e r a t i o n s to a l l the  s t r u c t u r e r e q u i r e d by  of  various  comparator,  shown i n F i g . I I . 1 s i m p l y  be  arrays.  Horizontal-comparison,  Diagonal-comparison)  s e q u e n c e of t h e s e The  two  unit  (i.e.,  modularity,  algorithms are h i g h l y regular  by  complexity:  operations  into  these  easily  networks  interconnection patterns required  t h e RSS  comparator  Control  is  t o VLSI  i s required to lay  shuffle-exchange  because  and  of v a r i o u s l e n g t h s .  0((N**2)/(logN)**2)  hand,  well-suited  regularity  Thompson  N-vertex  shuffle-exchange  shuffle-exchange  degree of  require wires  an  with  i n t e r c o n n e c t i o n s a r e not  implementations, have  Sorters  much sorters.  by  the  simpler The  46  l simplicity factor  Most constraints and  this  s h u f f l e network  i s concerned.  impose c e r t a i n n o n - t r i v i a l  require  the Batcher's that  the  square  arrays.  (see T a b l e . I I . 1 )  The  appear  to  contraints be  less  sorter  number  of  of  algorithms the  RSS  stringent  in  respect.  In every far  networks  important  be a power o f two, and t h e mesh s o r t i n g  on  algorithms  sorting  implementation  on t h e i r s i z e s ; f o r e x a m p l e s ,  lines  operate  when t h e V L S I  other  the p e r f e c t  input  o f t h e RSS c o n t r o l l e r i s a n o t h e r  summary, a l t h o u g h  aspect,  i t i s highly  t h e RSS d e s i g n  amenable t o V L S I  as i t s hardware and c o n t r o l  i s not optimal i n implementations  c o m p l e x i t i e s are concerned.  as  Table  II.2 - Complexities  Method  #Input  Batcher's  Bitonic  Muller  Preparator'st10]  &  Perfect  Shuffle[9]  Thompson Nasslml  & K u n g ' s Mesh & Sahnl's  Mesh  + ++  ^Comparators  Time  Interconnect ion  Control  N  0(N{logN}**2)  0( {1ogN}**2)  high  1 ow  N  0(N**2)  0(1ogN)  1 ow  high  N  0(N)  0({logN>**2)  high  1 ow  Sort[15]  NxN  NxN  mesh  0(N)++  1 ow  high  Sort[14]  NxN  NxN  mesh  0(N)++  1 ow  high  0(N)++  low  1 ow  N  RSS  Notes:  Sorter[13]  o f S o r t i n g Networks*  0(N)  I n terms o f a m e n a b i l i t y t o VLSI implementations. Please see d i s c u s s i o n s i n Section II.G.  48  Chapter  I I I . A Novel Loop-Structured  Switching  Network  (LSSN)  1.  Introduction Many l a r g e - s c a l e computer a p p l i c a t i o n s s u c h  processing, systems,  weather  require  instructions  such  execution  per  technologies,  f o r e c a s t i n g and rates  second.  s y s t e m s by  off-the-shelf processing  work  co-operative  a  n e t w o r k s can  provide  among t h e s e d e v i c e s , t o expand. and  a NxN  for  For  the  switches while them. is to  the  are  crossbar baseline  the  switch  fail  of  system  when more s w i t c h e s a r e In  switching  this  chapter,  network  The  main  only  r e q u i r e s N/2 of  feature  we  —  the  "bandwidth  and  difficult  five  crossbar  networks are  million  thousand  l a r g e number of  of  switches  more  likely  used.  introduce  a novel  LSSN i s i t s c y c l i c a l  two-by-two s w i t c h e s  t r a n s m i t t i n g and  to  O(NlogN) r e s p e c t i v e l y ;  (LSSN) w h i c h o v e r c o m e s t h e of  even  several existing  r e q u i r e more t h a n a  using  VLSI  devices,  c o u n t s o f a NxN  would need a b o u t  reliability  of  construct  communication  0(N**2) and would  billion  h u n d r e d s or  expensive to b u i l d  A n o t h e r d i s a d v a n t a g e of  that  pairs  [20]  to  storage  Although  required  they are  examples,  baseline  N=1024,  manner.  the  flexible  and  one  advent  interconnecting  image  m i s s i l e defence  than  the  more  t h o u s a n d s of in  more  With  i t i s f e a s i b l e and  large-scale  of  ballistic  as  for  loop-structured above  problems.  connections;  and  it  interconnecting  N  receiving devices;  therefore,  i t is  49  very  attractive  of many  f o r l a r g e - s c a l e , heterogeneous  devices. From t h e s t r u c t u r a l  LSSN  i s a packet-switched,  distributed its  control.  connection  In S e c t i o n revealed,  three,  multi-staged, In  several  and  performance  and  topics  Appendix  C f o r reference.  blocking  Network  and r o u t i n g  network  with  present  algorithms.  will  of our s i m u l a t i o n  will  further  deadlocks  be  be  work  presented.  are  program  provided  in  are l i s t e d in  Topology  Traditionally,  (logL)  they  stages  o f two-by-two s w i t c h e s  a r e well-known  a r e used t o connect  i . e . ,for  receivers.  view,  p r o p e r t i e s o f LSSN w i l l  evaluations  where L i s a power o f two —  terminals,  of  s e c t i o n , we w i l l  and t h e LSSN s i m u l a t i o n  Networks u s i n g  interconnecting  information  thus  forming  back  loops.  loops, could  from t h e output LSSN  but  L  transmission  and r e c e p t i o n .  a power o f two a n d a t l e a s t  equal  L  output  transmitters provided  to L route  side to the input  side,  on  the  from o t h e r s  be u s e d a s b o t h e n t r y  to  to  i s a l s o based  i t differs  --  [20,21,22,25,26].  L input  Feedback p a t h s a r e sometimes  the  data  points  four, the r e s u l t s  for  five,  switches  t h e next  important  In S e c t i o n  Section  feedback  functional  a n d t h e c a u s e s o f a n d method t o a v o i d  Discussions  2.  and  function, addressing  be d i s c u s s e d . studies  s y s t e m s made up  and  of  i n that a l l i t s  exit  With L loops to four —  concept  stations for —  where L i s  LSSN c a n  connect  50  up  t o N=LlogL p a i r s  using for  o n l y N/2  o f t r a n s m i t t e r s ( T r s ) and  switches —  l a r g e v a l u e s o f N.  illustrated  2.A.  Addressing  following  switches  with  been  is  set  defined  which  L'=logL b i t s  The  is  switches  labeled  represented The  as  Xe.ft  with  output  and  N=64  is  Function be  to as  of c o d e ,  easily  the a  understood i f in  which  path  in  this  loops i s which i s  the  i n t o L' of  of a s w i t c h a t the  closed  range  s t a g e s each of binary  digits  s-th stage  would  addresses:  -t-Lnh = Xtnh=  ,. . . ^•i^L* A  • •••  g • • • •'°'i'^n • • • • ' ^  * ^1 s +  a r e o b t a i n e d by c o n c a t e n a t i n g  t o g e t h e r , with the link  i.e.,  bits  a l l  "Straight-through"  closed  integer in  S'=riogL'1  following  Output  These addresses labels  can  are arranged  links  12-Lg.h.t Output  left  L=16  Ag,...-o.^.  output  a s s i g n e d the  loop  i t attractive  F o r a LSSN w i t h L l o o p s , e a c h o f t h e  b i n a r y r e p r e s e n t a t i o n o f an  [0,L-1],  be  Connection  description  have  configuration.  the  with  t o t h e example o f F i g . I I I . 1 ,  connection; a loop  labeled  example  Scheme and  readers refer  the  An  feature renders  (Rrs),  in Fig.III.1.  The the  this  receivers  s e t t o "0"  and  i^'^ _i* * s  the  stage  s - t h b i t of t h e a d d r e s s that  of the r i g h t  output  of  and the link  51  set  to  "1".  One  could verify  this  scheme on  t h e example  of  Fig.Ill.1.  Consider a switch s u p p o s e one then  of  i t s output  i t realizes  the  LSSN (£ ,. •«£ . • S  L  located links  following  in  i s part  s  i s the  complement o f  The  connection function  loops  with  connected  2.B.  labels  Routing In  whereas  transmitters  the  the  in their  -t^.  s - t h s t a g e , any s-th bits  two  will  be  stage.  (Rrs) are links  (Trs)  identified  to  need  by  using  which they are not  generate  be  the  connected,  identified.  a packet  To  w h i c h has  the  format: ; destination  count,  formed,  feedback  field  only  output  where f ' f " i s a 2 - b i t  newly  differing  a message, a T r w i l l  <f'f"  feedback  at the  LSSN, r e c e i v e r s  of  following  that  one's  Scheme  address  dispatch  states  by a s w i t c h a t t h a t  and  function:  &  1  stage,  loop  » where l  l  L  s-th  of the  connection  = *- ,- • - ^ - • •  s  the  and  and  ; message >  which w i l l  is initialized  incremented  paths.  would never  field  address  be  referred  i t will  r e q u i r e more t h a n  as  the  t o z e r o when a p a c k e t  whenever t h e p a c k e t  L a t e r on,  to  two  be bits  shown  goes  is  through  that  r e g a r d l e s s of  this the  52  network to  size.  The a d d r e s s  be t r a n s m i t t e d a r e a l s o Two  o f t h e Rr and t h e  message  c o n t a i n e d i n the packet.  t y p e s o f s w i t c h e s , namely Type-A and Type-B,  be c o n s i d e r e d i n our s t u d i e s , shown  actual  in Fig.III.3.  and t h e i r  schematic  will  diagrams are  o o o o  Loop //  o o o  o o  o  i.-HI  o o  Stage 01  o o o o  o o  o o o  o o  o o  o o o  o o  o  fH fH  o  o o  o o  •H iH O  o o  o o o o  o  o  o iH  o  »HoO  o o o  fH O  r-l  o o  o  o  O iH  i-l  o o o  O fH O  o  Stage 10  iH O  o  Stage 00 o o o o o o  o o o  o  X o  O  o o o  o o  >H  O tH O  o  o o  o  ><  O  o  o  o o o  o o  O  TH  L  t fH H o  o  Stage 11  Fig.III.1. A s s i g n m e n t o f l o o p and l i n k l a b e l s LSSN w h i c h h a s 16 l o o p s a n d 32 s w i t c h e s .  on  Fig.III.2. d e v i c e s on connection connection  C o n n e c t i o n of t r a n s m i t t i n g and r e c e i v i n g a LSSN w i t h 16 l o o p s . (x= a h a r d w i r e d f o r a t r a n s m i t t e r , T r ; 0= a h a r d w i r e d f o r a r e c e i v e r , Rr.)  55  a)  Input buffers  Output  ports  b) loop  (l^.-.-lj)  • •  I  7fT> C CM  Input ports ec tr. CO  c  Duffer pools  o  stape  (s^.-.Sj) Intermediate  ports  Output ports  output l i n k  »  data and control s i g n a l s status signals Indicating a v a i l a b i l i t y o f c l a s s O and c l a s s 1 buffers  F i g . I I I . 3 . The s c h e m a t i c d i a g r a m s o f a Type-A (a) a n d , a Type-B s w i t c h ( b ) .  switch  56  A  Type-A  conventional built-in Type-A into  "0",  one  main  shown  output  Class-0,  i t  is  i n the  i t has  When a p a c k e t  two  enters a  first  switched  placed  to the l e f t  address  is a  i f that b i t i s a " 1 " .  port  s t r u c t u r e than  a  has a  Type-A  c l a s s e s of f i r s t - i n - f i r s t - o u t  Class-1  intermediate  and  ports  Class-2.  which  b u f f e r s a s shown.  indicating  are  input  slightly switch.  switches  lines  g i v i n g t h e same i n f o r m a t i o n  Class-k  accomodate p a c k e t s buffer  is  Class-0  and C l a s s - 1  links),  —  with  connected  the  where  the  intermediate  status  and C l a s s - 1  (which a r e connected  from and  i t s succeeding right  output  k i s i n {0,1,2} —  output  to i t s  ports.  be e x p l a i n e d  of  k;  The  t o the  status  switches links).  i s used t o the  port d i r e c t l y  b u f f e r s a r e connected  When a p a c k e t  four  t o t h e C l a s s - 0 and  of i t s Class-0  a feedback count to  v a r i o u s mechanisms w i l l  contains  a n d two s e t s o f i n c o m i n g  to i t s left  buffer  (FIFO) b u f f e r s :  I t h a s two s e t s o f o u t g o i n g  and  (which a r e connected  also  connected  the a v a i l a b i l i t i e s  to i t s preceding right  It  left  through  that  i n F i g . I I I . 3 , a Type-B s w i t c h internal  used  f e a t u r e s a r e the " s t r u c t u r e d b u f f e r p o o l s " which a r e  made up o f t h r e e  The  those  i f the s-th b i t of i t s d e s t i n a t i o n  more c o m p l i c a t e d  buffers  buffers.  l o c a t e d i n the s - t h stage,  or to the r i g h t  lines  to  o f i t s i n p u t b u f f e r s , and then  port  Class-1  similar  networks, except  first-in-first-out  As  Its  is  packet-switched  switch  output  switch  Class-2  while the  output  functions  port  of these  l a t e r on.  e n t e r s a Type-B s w i t c h  l o c a t e d i n the s-  57  th  stage,  i t will  undergo the f o l l o w i n g operations:  (a) From an i n p u t p o r t t o t h e will its  be  buffer  pool:  The  packet  p l a c e d i n t o one of the b u f f e r s a c c o r d i n g t o  feedback  count;  (b) From t h e b u f f e r p o o l t o t h e o u t p u t p o r t : C a s e d )  From  a C l a s s - 0 and Class-1 b u f f e r s : I f the s - t h b i t of the destination address packet will then  to  of the packet  be s w i t c h e d t o t h e l e f t  the  left  output  switched to the right right output port. The  packet  connected  will  port;  else  the  be  forwarded  the  to  the output  port  s w i t c h i n g and  At  the  output  d e s t i n a t i o n address of the packet w i l l  will  receiver attached to that output removed  from  r e c e i v e r ; e l s e the next the  output  be  forwarded  transmission via  to_  be  the intermediate ports.  occurs, then a strobe s i g n a l  be  will  C a s e ( i i ) From a C l a s s - 2 b u f f e r s :  matched against that of the output  will  i t  intermediate port then  From an output p o r t t o t h e e x t e r i o r : port,  the  intermediate port  to the Class-2 buffer without  going through  (c)  i s a "0", then  link will to  be  I f a match sent  to  output  port  by  other  the  port.  that  end  be s t r o b e d a n d t h e p a c k e t input  the  l i n k , and the packet  switch at the  its  between  the feedback  the  link.  be  For  l a s t and the f i r s t  l o o p s , t h e same o p e r a t i o n w i l l  of will the  stages take  58 place,  but  packets  emerging  stage w i l l  These  addresses gates  determine  the  or  links,  because  through  i t have t o be  be  involving  (e.g.,  controller.  will  shared  network  give rise  will  According  the  switch  the l a b e l s  feedback  Type-A  by  storing  must  be  to  logic to  stage of  i t s output  of those packets p a s s i n g Similar  switch,  features  but  those  must*  hardware  c o u n t s may n o t be i n c l u d e d .  of packets  i s performed  i n the next  of c e n t r a l  among t h e p a c k e t s  section.  by  of not r e q u i r i n g  On t h e o t h e r hand, t h e l a c k to c o n f l i c t s  locally  f o r the  the e f f e c t s  on LSSN when Type-A and Type-B s w i t c h e s  be d e t a i l e d  the  able  i n the l a s t  assigned  counts  incremented. a  the  must be made  r e s o u r c e s s u c h a s p o r t s and l i n k s ;  such c o n f l i c t s  last  referred  a n d t h e r e must be some  not i t i s l o c a t e d  the routing  control  used  operations  links  o f t h e s w i t c h e s , LSSN h a s t h e a d v a n t a g e  a central  of  in  the feedback  Since each  the  present  the  to those d e p i c t e d i n F i g . I I I . 3 .  switch),  t h e network by e x a m i n i n g  also  of  be c o l l e c t i v e l y  of i t s output  the matching;  whether  port  of those  a b o v e , t h e Type-B s w i t c h must c o n t a i n  t o the matching  to perform  count  s t e p f o r t h e Type-B s w i t c h .  the addresses  inside  the output  operations w i l l  features in addition  of a l l ,  available  from  the feedback  incremented.  routing  to the d e s c r i p t i o n s  First  be  three  to as a s i n g l e  following  in addition,  are  59  3.  Network  Properties  First, of  LSSN  then  with  the  than  the  one  packet  S  , ,  will  given  be  information,  that  the  even  by  A,  packet the  The  and  only  the  L'  operation  2.B);  and  first  the  least  L'  the  operation observation  leads  Lemma 1 1 1 . 1 : which  is  behaviors  will  be  stated,  presence  proofs  a l l the  Consider  of  of  more  a l l these  algorithms  used  t into  Example: C o n s i d e r packet loop  destined (1111)  generated.  of  (S'+L')  bits  significant  bits  are  involved  Operation  (b)  of  Section  are  bits,  involved  (c) of  together  in  Section  the  2.B)  of  with  matching  only.  This  f o l l o w i n g lemma:  a LSSN w h i c h has  L loops  address  The  within  the  addresses  significant  bits,  f o r the  1  destination  (i.e., most  S'=riogL'l.  loop admission  to the  the  consists  least  Operation  destined  L ' = l o g L and  S'  significant  (i.e.,  though  a packet  switching  is  single  examined.  i n Appendix  *1^L'*^1carried  in  of a  the  base-two.  Notice A  presence  v a r i o u s p r o p e r t i e s of LSSN w i t h  theorems are are  some u s e f u l t h e o r e m s c o n c e r n i n g  a  packet  • ••a^L** • • • • ^ i , where  packet  L'  and  will  steps  of  be  routed  routing  to  the  after  its  LSSN.  a LSSN w i t h  f o r the  within  L=16,  address  4 steps  of  then  L'=4  (101111) w i l l  and be  S'=2.  routed  r o u t i n g r e g a r d l e s s of  A to  the  where i t  60  Lemma I I I . 2 : C o n s i d e r is  destined  and  S' = rlogL'. 1. 1  L along  that  Theorem  routing  f o r the address After  loop to reach  to  In t h e  with  III.2:  has  which  where L ' = l o g L  been r o u t e d  to the  (L'-1) s t e p s  of  within  of  loop  matching  packet -1)  will  be  steps  of  generated.  i s (2*4  number  packet  a  (21ogL  Fig.III.1,  average  a result  loops,  it is  routing steps  The  needed t o d e l i v e r  L  destination  example  maximum number o f  Theorem  A  packet  i t sdestination.  In a LSSN its  a  s ' * ' * 1 ^ L ' • • • • ^1,  the packet  r e g a r d l e s s o f where  Example:  A  L l o o p s and  , i t needs a t most a n o t h e r  III.1s  delivered  a LSSN w i t h  L=16; -  of  therefore  the  1)=7.  routing steps  i n a LSSN w i t h  (ARS)  L loops i s ,  ARS(L)=(3logL-1)/2+2/L-1  E x a m p l e : In t h e  example of F i g . I I I . 2 , s i n c e L=16,  A R S ( L = l 6 ) = ( 3 l o g 1 6 - l ) / 2 + 2/16  Corollary the  1 1 1 . 1 : Any  feedback  path  packet  a t most  admitted  twice.  -1  =  therefore,  4.625.  i n t o LSSN w i l l  go  through  61  E x a m p l e : I n t h e example o f F i g . I I I . 1  a n d 2, i f T r 4 9  is  a packet  attached to link  10 0000 -- s e n d s  is  attached to link  the  feedback  paths  01 1111, t h e n  t w i c e : The  (1000), and t h e second  Corollary carry why  two b i t s each  classes  the  III.1 explains  probability  t o Rr32 -- w h i c h  packet  time  will  go t h r o u g h  through  the loop  which  the  loop  (1111).  why t h e p a c k e t s o n l y have t o  i t s feedback  count  f ' f " , and  also  p o o l o f t h e Type-B s w i t c h i s made up o f t h r e e  of b u f f e r s  Theorem I I I . 3 :  first  time through  to indicate  buffer  this  —  regardless  o f t h e network  s i z e L.  I n a Type-B s w i t c h o f a LSSN w h i c h that  the  will  destination  result  packet  match t h e l a b e l  switch,  and h e n c e t h e p a c k e t  will  address  carried  o f an o u t p u t  be removed  has L l o o p s ,  from  link the  by  a  of the network  is:  P  r e m o v e d =2L/{3LlogL-L 4} +  where  the  receiving that  transmission  port  pattern  o f t h e network  is  i s such equally  that  each and e v e r y  likely  to  receive  packet.  Theorem  III.4:  The maximum a v e r a g e  LSSN w i t h L l o o p s i s :  throughput  rate  (MATR) o f a  62  MATR(L)=3/2xS  is  where S. R,SW  between two  3.A.  R,SW  t h e maximum r a t e  s w i t c h e s v i a an  Network  f o r the  p o r t s and  data  If be  two  Type-A  A1  Al  rise  - which a r e  r e q u e s t s made  buffers,  conflicts  A simple  for  packet  link.  i n LSSN,  r e s o u r c e s such as  p o r t and  f o r the  to  they  input  may  buffers,  conflicts.  i n LSSN, t h e n  there  would  the Tr  and  t h e c o n t e n t i o n s due by  s h a r i n g the  will  fairness.  i s t o honor  the  same l i n k ,  s w i t c h a t t h e end  discipline  conflicts  in  the two  t h e c o n t e n t i o n s between  round-robin ensure  packets  to  same O u t p u t p o r t ;  - which are  same i n p u t p o r t o f t h e  conflicts  Result  conflicts:  output  of  transmitting  o r more p a c k e t s  thus g i v i n g  conflicts  input  A2  two  s w i t c h e s are used  simultaneous  (b)  output  same network  links,  t y p e s of  (a)  of  Conflicts  When t h e r e a r e contend  xlogLxL**2/{3LlogL-L+4}  the  can  f o r the  the  link.  r e s o l v e both  types  A better  input buffer  of  an  alternative  w h i c h has  more  63  waiting  packets  occupied,  then  conflicts,  i n i t ; and i f b o t h an a r b i t r a r y  the  output  input b u f f e r s  buffer will  over  t h e T r ' s so t h a t t h o s e  admitted  into  t h e network c o u l d  (i.e.,  a faster  observations Appendix  obtained  following  (the  be  the  which a r e a l r e a d y destination faster  obtained). simulator  These  listed  in  reader  may  there  are three p o s s i b l e  refer  to  Fig.III.3  types  f o r the  descriptions):  B1  conflicts  requests buffer  B2 the  B3  pools,  a r e due t o t h e from  the  simultaneous  left  and  Class-2  among t h e  buffer  f o r the  port;  conflicts  - which a r e the c o n t e n t i o n s  between an  p o r t and t h e T r s h a r i n g t h e same o u t p u t  the  input  port  at  the  end o f t h e l i n k .  c o u l d be r e s o l v e d by a  discipline:  right  port;  - which a r e the c o n t e n t i o n s  i n t e r m e d i a t e p o r t s and  conflicts  the  which  f o r t h e same i n t e r m e d i a t e  conflicts  output for  -  made by p a c k e t s  same o u t p u t  (c)  could from  f o r Type-B s w i t c h e s ,  of c o n f l i c t s  (b)  time  their  As f o r A2  be g i v e n a h i g h e r  packets  reach  equally  C.  As  (a)  response  were  be c h o s e n .  ports perhaps should  priority  are  the  conflicting  intermediate port  simple  packets  alternately.  link, B1  round-robin  are switched to  64  The Our  r e s o l u t i o n o f B2 c o n f l i c t s  s i m u l a t i o n s t u d i e s showed t h a t  would g i v e packets,  rise but  throughput based  t o unbearable much  better  r a t e and d e l a y ,  policy  (an  which a s s i g n e d then  the intermediate  and  finally  buffers.  the  performance, could  priority  With t h i s  t o the  port  packets  port  becomes empty; a s c o u l d be e x p l a i n e d  they  need  will  not  Furthermore,  always remain  go  since  use of t h e output  the  size  next  is  highest  port. is  of the Class-2 empty,  will  t o be " e l i g i b l e "  incoming  status lines  next  switch  last  and f i r s t  Class-k  buffer  will  is  i t  is  "eligible"  by Lemma  1  not  in  difference  and  2,  therefore port.  priority in  access  When  port with the  to  the  to the Class-k  non-empty, the Class-k  the  and  output buffers i f the  buffer ofthe between  port connected  the  to to the  i f i t i s non-empty a n d i f t h e switch  The  when t h e o u t p u t  As f o r t h e c o n n e c t i o n s  an i n t e r m e d i a t e  buffers  n o t a c c u m u l a t e and h e n c e  port connected  C l a s s - ( k + 1 ) b u f f e r o f t h e next full.  Class-2  the highest  be g r a n t e d  i f  buffers,  to the Class-0  intermediate  indicate that  i s not f u l l . stage,  b u f f e r , and  intermediate  are assigned  the " e l i g i b l e "  priority  i n S e c t i o n 3.B)  b u f f e r s i s a l w a y s bounded.  An i n t e r m e d i a t e  said  immediately  any  p o r t , they  priority-  i n t h e same l o o p s ,  throught they  the  Class-2  port  a  Class-1  the  switched  packets  the output  with  connected in  certain  i n terms of average  be g i v e n  are  these  to  policy,  to  to the Class-2  port connected  discipline  delay  be o b t a i n e d will  intermediate  intricate.  the round-robin  propagation  explanation  the highest  i s more  i n the  the  above  first  stage  is  definitions  of  65  "eligibility" count  is discernible  of a packet  feedback  paths  status lines  i f one  i s incremented  r e a l i z e s that the  whenever i t goes through  back to the f i r s t  stage.  The  the  i s t h e r e f o r e to help prevent the output ports  and  s w i t c h e d away  priority-based policy  policy the  However,  The forward:  our  performance  the feedback  of  than those o r i g i n a t e d  at  be a s s i g n e d h i g h e r studies  i s concerned  priorities  shows t h a t  But  be  rather  straightgranted  in addition, i t is necessary  for the  full  and Avoidance  the input  port)  before i t can t r a n s m i t .  of the C l a s s - 0 b u f f e r has  conventional  as  the output port are  newly admitted packets c a r r y feedback  In  is  that the Class-0 b u f f e r (not just  Deadlock  far  (an e x p l a n a t i o n w i l l  conflicts Tr and  entry point i s not  availability  this  section).  conflicting  access alternately.  3.B.  counts  simulation  r e s o l u t i o n o f B3  the  Tr to check  be  packets  i s s u p e r i o r than the round-robin d i s c i p l i n e as  overall  the  sooner  s t a g e s , and hence w i l l  o f f e r e d a t the end of next  at  cannot  would favor those  stages, because  these packets are incremented  sooner.  which  immediately.  o r i g i n a t e d at the lower  the upper  purpose  the  of  input p o r t s from being clogged with packets  The  feedback  t o be  checked  counts of  The  because  zero.  Method  non-recirculating, packet-switched  66  networks,  the blockage  as l o n g a s LSSN  —  scheduling  policy;  uses  Type-A  the  situations  i n which c e r t a i n  no  switching  w i t h p a c k e t s and these  to data path c o n f l i c t s  there i s a f a i r  which  deadlocks  due  loops,  switches,  further  and  very  soon  the  temporary  whereas  blockage  can  is  migh  lead  loops are take  in  a to  clogged  place  along  whole network w i l l  become  impassable.  The  deadlock  store-and-forward  problem  type  r e q u e s t s o f network  output  rapidly;  and  particular first  ports,  in  attributable  movements  deadlock. similar  and  A  then  the  contend  for  be  filled  up  p o r t s along'a and  i f the  are w a i t i n g f o r these this  loop w i l l  "multiple-loop"  manner but  if  in transit,  input buffers freed,  output  the  cyclical  switch,  input buffers w i l l  with packets  to  the  i n p u t b u f f e r s always  i n p u t b u f f e r s and  p o r t s t o be  a  is  In a Type-A  the  of a l l these  "single-loop" produced  then  i f a l l the  output  data  i t s two  loop are f i l l e d  packets  occupied  of  resources.  p a c k e t s coming out of the  i n LSSN  enter a  deadlock  is  i t i n v o l v e s more t h a n  one  loop.  According of size  deadlock  t o our  c o u l d be  o f b u f f e r s and  certain entirely, scheme  level; and and  but  reduced  this  recovery  approach it  studies,  significantly  restricting  moreover, a  simulation  the  input  the by  probability  increasing  load  down  d o e s not e l i m i n a t e  requires  procedure.  a  deadlock Perhaps  it  the  to  a  deadlocks detection is  more  67  efficient  t o get  cyclical  around  requests  the  of  deadlock  the  network  s w i t c h e s a r e meant f o r s u c h a  Our is  based  on  forward  by  the  if  length a packet  then  i t may  Clearly,  Raubold  of  the  i s of be  method  packets  been p r o v e d  the  pools, counts, as  has  is  as  they w i l l  into K classes, i n the  the We  resources  are  condition  that  temporary. following  links  granted  With  to  their  where  K  is and  i t s transmitter, such that  that  k<r<K.  K must be  e l i m i n a t e t h i s drawback feedback  counts  of Type-B s w i t c h e s , t h e LSSN w i l l type  follows:  their  put  a by  which  bounded.  request  shared  pools"  network c o n c e r n e d ,  drawback  of  deadlocks.  for packets  A  request  free  simple buffer  their  feedback  on  buffers;  the  and  input/output p o r t s , these  network  to  the  on  requesting  o c c u p a t i o n s by this  be  e n t e r i n g the  buffers according to  t h e r e f o r e t h e r e i s no c i r c u l a r  f o r the  Type-B  deadlocks  According  Class-k buffer  size.  store-and-forward  explanation  [29].  according to t h e i r  t o be  buffer  s t e p s away from  p l a c e d i n t o any  W i t h t h e use of  Haenle  r routing  and  to prevent  "structured  longest path  o f t h e network  classifying  of  and  avoiding  purpose.  pools are d i v i d e d  their  function  has  concept  by  resources;  i d e a of u s i n g Type-B s w i t c h e s  method, b u f f e r the  problem  idea  packets  the packets  i n m i n d , now  we  will  always  will  state  the be the  theorem:  Theorem I I I . 5 :  The  LSSN w h i c h u s e s  Type-B s w i t c h e s  i s deadlock  r  68  free.  An e x p l a n a t o r y p r o o f o f Appendix  3.C.  III.5  is  Very  often  most,  i t i s d e s i r a b l e to expand a network but u s u a l l y such an e x p a n s i o n  i f not a l l ,  useful property that  existing designs.  i t  can  be  is  addressing  facilitate  expanded  the  must  be  complicating  Of  to expand the b a s i c s t r u c t u r e while  is  to  add  the  after the last  expanded  course,  sufficient  way  shuffle  by  to  address  f o r the added stages and d e v i c e s .  new  stages  LSSN  a n d we w a n t would  to  have  keeping i t  the bottom of i t  stage of switches  takes place.  stages originally,  shown:  algorithms.  there  difficult  incrementally  One  immediately  L'  routing  the expansion,  lines to account  perfect  and  after  LSSN has t h e v e r y  a d d i n g more s t a g e s t o t h e b a s i c s t r u c t u r e without  intact  in  Extensibility  i t has been b u i l t ;  the  given  A.  Network  with  Theorem  and  before  Suppose there are L loops t o add L" more s t a g e s , a t o t a l of (L'+L") stages  — the and  then as  69  stage  number, s 1 2  } }  •  • •  y s  L'  }  L' + 1  ]  •  J  •  J  •  the  stages  S"=riog(L'+L")1 digits  and  output  Right LSSN  stage;  and new  routed  x  In  than  from  stage  the  addressed bits  scheme  and  , _^ . . .  of b i n a r y connection  ( A g „ . . . A ^ l - d ^ , ^ . . . .-t^ )  = (7 ,  t )  L  no  x  stages  w o u l d be t r e a t e d much t h e  shuffling  and p a c k e t s to  using  )  of the b a s i c s t r u c t u r e ,  is  the  among t h e o u t p u t  which L'-th  i . e . , the L ' - t h  are bits  sent of  to  the  links them  of are  destination  packets.  is  the L ' - t h ; such  during  be  stages a r e :  t h e e x p a n d e d LSSN, t h e counts  will  (Ag„ . . . A^-^  a l l t h e new  there  of the  =  link=  t)  according  addresses  feedback  link  L  stages;  Rr's  The a d d r e s s i n g  (<e. ,  same a s t h e l a s t  these  and  1  output  In words,  stages  ]  f o r t h e newly added Left  additional  {\log(L'+L")1+L }={S"+L'}  respectively.  function  basic structure  ]  L'+L"  Now  -  1  performed  duty by t h e  of  (L'+L")th  a m i n o r c h a n g e h a s t o be  expansion.  We  incrementing stage  taken  do n o t i n t e n d t o d e r i v e  s c r a t c h f o r t h e e x p a n d e d network  the  rather  care  of  theorems  because the v a l i d i t y  of  70  Corollary  .111.1  a n d Theorem I I I . 2 a r e d i s c e r n i b l e  stages  are regarded  i.e.,  if  single, that  stage  as the s u b s i d i a r i e s of  L' through  compound  stage.  Lastly,  4.  i t s loop  Simulations  are  and Performance  the  comparisons. number  of packets  between up  rate  flowing  delay"  is  for  defined  out  way by  the  and r e c e p t i o n s .  and " p r o p a g a t i o n  and  evaluations  and  as  the  average  average time,  interval  The d e l a y  delay",  that a packet delay  rate  t h e network p e r u n i t  i s made  where  entrance  has t o wait  at the  i n c l u d e s t h e time spent  t h e v a r y i n g parameter and i s d e f i n e d as  next  point  throughput  and s w i t c h i n g w i t h i n t h e network.  between t h e l a s t the  used  i s d e f i n e d as  e n t r y p o i n t , and p r o p a g a t i o n  is  the  throught  i s the average d u r a t i o n  queueing  to  as a  Analysis  measures  generations  of "entrance  delay  two  of packets  their  like  i n t h e more e x p e n s i v e  studies,  Throughput  the delay  stage,  count.  In o u r s i m u l a t i o n delay  L'-th  (L'+L") a r e c o n s i d e r e d we w o u l d  LSSN c o u l d a l s o be e x p a n d e d  doubling  and  stage  the  i f t h e new  successful transmission  Request the  in  interval  average  time  and t h e g e n e r a t i o n of  packet.  In facilitate  order  to  obtain  the a n a l y s i s later  assumptions:  some  meaningful  o n , we have made  results the  and t o  following  71  (a) The  t r a n s m i t t i n g and  selected  (b)  The  of  the  transmission  request the  out  entire  pattern  to transmit  transmitter  receiving  pairs  address  are  space;  i s such t h a t  i s in process  affected will  randomly  i f the  or  not  current  blocked,  generate  then  the  next  request;  (c) Packets they  (d) As  are  removed  arrive  at  immediately  from t h e  their destination;  f o r t i m i n g c o n s i d e r a t i o n s , the  delay  in going estimated  t o be  path  selection  and  their  switches  five  two  switch  would  ones:  selection,  f o r data  to the  buffer pools,  intermediate selection ports  to the  delays.  ports  and  inside Class-2 shorter  the  because  more gate  from the  and  another  transfer ports — case  they  i n our delay  their do  not  Since on  those  path  input  ports  have t o go  to  for  the path  intermediate twelve  packets  switching  the  for  five  of  LSSN,  than  buffer pools  from t h e  for  studies.  delays  a total  of  switch  [30]: three  t r a n s f e r from the  two  buffers,  binary  deadlocks  included  need  switching  transfer.  to  three  data  output  In  be  delays  data  lead  not  conventional two  gate  for  would  analysis will  A Type-B  amount o f  through a conventional  was  Type-A  network when  which  gate are  delay  are  through  the  72  intermediate ports. simulator  is  individual  to  The  main  compute  packet  by  the  summing  s w i t c h i n g as w e l l as queueing  These assumptions have  also  appeared  switching networks simulator,  a  (e.g.  references  was  From time  deadlocks.  The  connected  with  up  its  to time,  type  —  which  [25,30]).  an  LSSN under study  had  would  was  confirmed  b u f f e r s was Class-2 and  b o u n d e d t o two.  b u f f e r s was  Class-1  range).  Our  variation  of  requested For  f i x e d a t two,  buffers  were  the  4 t o 14  show  more  traffic  network, its  were the type  s i z e of  the  Class-0  (an a r b i t r a r y  have a s i g n i f i c a n t  i s reduced,  The  Class-2  the s i z e s of the  used,  delay of a packet  fully  i n t h e B2  the  i s t h a t when more b u f f e r s were  entrance  was  of  Fig.III.5)  the  to  that because  reason  into  was  performance  t h i s reason, and  r e s u l t s (please see  introduced  will  receivers.  size  v a r i e d from  t h e s i z e s do n o t  entering  subjected  b u f f e r s were g i v e n the h i g h e s t p r i o r i t y maximum  LSSN  that i t  16 l o o p s a n d  first  the  packet  i n d i c a t i o n of p o t e n t i a l  network  it  packet  be  64 p a i r s o f t r a n s m i t t e r s a n d  and  they  In the  of d e l a y s  the  of c o n f l i c t s ,  and  the whole s w i t c h i n g a r r a y  is  investigated,  each  entrance,  justifiable,  e f f e c t s o f t h e b u f f e r s i z e on  Class-2  LSSN  t o t a l d e l a y s of  a s s o c i a t e d with each  i n s p e c t e d t o make s u r e t h a t no p a c k e t delay  the  delays.  are considered  the network, so as to r e c o r d each  substantial  of  in the s i m u l a t i o n s t u d i e s of other  timer  encounter.  duty  there  that  the  effect;  the  would  be  although  propagation  the delay  73  would  be  increased; as a result,  the total delay  i s n o t much  affected.  In t h e second performance  of  a  part  of  our  study,  we  The b a s e l i n e networks were  because they a r e t o p o l o g i c a l l y equivalent  not entirely  used  as  either  circuit-switched  network  [24]),  or  to  the  that  networks  at  of  seven,  packet-switched as  LSSN s w i t c h e s conventional  ones.  The  results  presented  in  Fig.III.5;  In factor The  only;  logic  gate  comparisons,  we  i n the baselines had a buffer  and Class-2  obtained other  for  the  b u f f e r s were f i x e d  a  total  this  buffer  switches,  of  buffer  sizes  16  size  would  as are  produce  s i m i l a r to these.  F i g . I I I . 5 , a l l t h e measurements were s c a l e d by t h e  f which i s the operating frequency  v a l u e o f f c o u l d b e a s h i g h a s 6 0 MHz  fabricated with TTL gates, 64x64  Star  networks  For  a n d two, r e s p e c t i v e l y —  well.  The  have a much h i g h e r  used  i t s Class-0, Class-1  results very  studies  networks, whereas LSSN i s  packet-switched  the switches  seven  existing  (e.g.,  s i z e o f 16 ( a n a r b i t r a r y n u m b e r ) ; a s f o r L S S N sizes  many  because b a s e l i n e - l i k e networks c o u l d be  used  the  density than assumed  fair  be  furthermore,  to  considered  We m u s t e m p h a s i z e t h a t o u r c o m p a r i s o n  are  intended  the  64x64 LSSN t o t h a t o f a 64x64 b a s e l i n e a n d  then a 16x16 b a s e l i n e .  networks [ 2 0 ] .  compared  baseline  would  o r 400  MHz  of  the  networks.  i f the switches are  with  ECL  gates.  n e e d a t o t a l o f 192 s w i t c h e s  A  whereas a  74  64x64 LSSN and each.  a  Not  16x16  only  the c o m p l e x i t i e s  baseline  the numbers of  of the  h a v e t o be c o n s i d e r e d  Our short, 16x16  the  but  both  the  that  of  the  that  of the  16x16  savings  be  about  the  of the LSSN w i l l  be  switch  are very  and  have  architectures;  the times  than  will  request  a  count  64x64  40/f,  approach  interval  is  reduced  to  and  hence fewer  a  wiring.  sizes  of  the  Outlook  network  schemes, and  with  large.  described  we  performance  baseline  i s e v e n more s u b s t a n t i a l when t h e  communication  very  three  i s longer  of the LSSN If  match that of  lower  Discussions  Although  delay  is  baseline.  networks considered  We  to  wiring  interval  summary, our r e s u l t s i n d i c a t e t h a t t h e  significantly  routing  of  be c l o s e t o t h a t o f  interval  baseline.  the delay  of a 64x64 LSSN can  5.  amounts  request  will  request  t h r o u g h p u t and  increased,  In  This  i f the  delay  the  64x64  switches  well.  its  when  32  would c o n t r i b u t e  the  throughput of the LSSN w i l l  then  further  as  require  switches  networks, but  r e s u l t s show t h a t  b a s e l i n e , and  higher;  would  based  have  also  several  packet-switched,  a on  novel  method  the  concept  presented  the  properties  of  cyclically  connected  to of  set  up  cyclical  addressing the  a  and  network.  structures  are,  75  susceptible  to  the  used asynchronously, scheme based  The  store-and-forward type of deadlocks we h a v e s u g g e s t e d  on some u n i q u e  topology  of  our proposed  some  attached  to  the f i r s t  stage and receive packets  would  ones.  be  reduced  If  network  only  L  LSSN  resembles  processors  from t h e l a s t  t o an i n d i r e c t  n-cube  LSSN a l s o resembles  could  i t  then  b i n a r y n-cube [ 2 1 ] .  be  the last  sorter [13];therefore,  are  through  stage,  implies that those useful algorithms  for the indirect binary  bitonic  avoidance  LSSN such t h a t they a l l t r a n s m i t p a c k e t s  This similarity  easily.  deadlock  features of our design.  that of  LSSN  existing  a  when  is  adapted  developed for  LSSN  stage of the Batcher's possible  to  perform  B a t c h e r S o r t on LSSN p r o v i d e d t h e r e e x i s t s a masking  scheme t o  disable  items a r e  some  of  c i r c u l a t e d around connected of which  and  the  attached  the network.  processors as data LSSN c o u l d a l s o be  partially  used as an a r b i t r a t o r or a d i s t r i b u t o r —  are essential  i n the designs of data-driven  both  computers  [4,5,8,52,53]. LSSN  can  permutations  —  of  information;  a  be  used  i . e . , one-to-one  side to the output presence  also  side.  central  Such  those  p a t t e r n s c o u l d be p r e - c o m p u t e d The  simpler,  Type-A  switches  perform  mappings —  permutations  controller  alternatively,  to  and  to  a p p l i c a t i o n and they w i l l not cause  from the input  would  compute  frequently retrieved  could  be  arbitrary  the used  when  used  t h e network  require the  for  routing control needed. such an  to deadlock  as  76  long  as  only  provided  one  there  permutation  are sufficient  is  buffers  performed a t a time, inside the  and  switches:  MAX{MIN(2**r0.51ogNl,N/2**[0.51ogNl),  B(L)>  MIN(2** L 0.51ogN J,N/2**[0.51ogN J)}  =  O(N**0.5)  where B ( L ) i s t h e number needed  of buffers  to avoid deadlocks,  of loops;  of  the  Type-A  switches  and N=LlogL, where L i s t h e  the worst-case delay  to perform  number  a one-to-one mapping  is:  Tmax(L) = 2**[0.51ogN]+N/(2**L0.51ogNJ =  and  )-'logL-1  O(N**0.5)  the average delay i s :  Tavg(L) = O(logN)  These r e s u l t s could B(L)  be f o u n d  in reference  [23];  because  and Tmax(L) a r e i n t o l e r a b l y l a r g e f o r l a r g e values  both o f N,  we d o n o t i n t e n d t o i n c l u d e t h e a n a l y s i s i n t h i s d i s s e r t a t i o n .  As i s shown  by o u r s i m u l a t i o n  results (Fig.III.6), the  performance of t h e LSSN i s not as good as that of the  baseline  7,7  network  for  applications  transmission  times;  which  out  send  executions  —  alternately  —  complexities  High  data  i f  technologies, often renders  as the results  then  t h e LSSN w i l l  proposed  system  (e.g.,  component  complexities but  short  of  compute  trades  off  our design.  and  counts, (i.e.,  wiring,  l o g i c gate  external  dispatch  design.  external  c a n be e a s i l y a c h i e v e d  too, many  inter-  instruction  be an a t t r a c t i v e  the system d i f f i c u l t  main motive behind  very  processors  hardware complexities  internal  have  the transmitters are processors  packets  i.e.,, i f the  Our  internal  but  which  hardware  etc.)  with  per switch). with  today's  components and w i r i n g  t o manage —  this  i s the  78  Throughput (packets/sec f : network f r e q u e n c y a: 64x64 LSSN ( b u f f e r = l 4 / l 4 / 2 ) b: 64x64 LSSN ( b u f f e r = 7 / 7 / 2 ) c: 64x64 LSSN ( b u f f e r = 4 / 4 / 2 ) Note-numbers i n b r a c k e t s r e f e r t o the o f C l a s s - 0 , 1 and 2 r e s p e c t i v e l y  2f-  sizes  Inter-arrival time 200/f(sec.) Delay(sec.) 1.5/f - c G  1.0/f  -  0.5/f  -  Inter-arrival time 40  »nrHI?'*  120/  80 ft  It  « a !5 !.LSSN. and d e l a y o f 64x64 5  Eff  t  o£  b  u  £  £  e  r  s  i  160/f  f  z  e  o  n  t h  e  200  throughput s  p  / f  (  s e c  .)  Throughput rate (packets/sec)  A  Baseline Network (64 Tr's x 64 Rr's , 192 Switches) Loop-Structured Switching Network (64 Tr's x 64 Rr's , 32 Switches) Baseline Network (16 Tr's x 16 Rr's , 32 Switches) f = Network Frequency (Hz)  Fig.III.6. The throughput rates of a 64x64 baseline, a 64x64 LSSN and a 16x16 baseline, versus the i n t e r - a r r i v a l time.  Inter-arrival time (sec)  40/f  80/f  120/f  160/f  200  /f  Delay(Sec/packet)  A' 1.4 "  1.0 f 0.6 ~  0.2 -  Inter-arrlval •time(sec)  40/f  80/f  120/f  160/f  200/  f  Fig.III.7. T h e d e l a y c u r v e s o f a 64x64 b a s e l i n e , a 64x64 LSSN a n d a 16x16 b a s e l i n e , v e r s u s t h e i n t e r a r r i v a l time.  CO  o  81  Chapter  IV.  Computer  (EDC)  1.  Design  and  Evaluation  of  The  Event-Driven  Introduction  1 .A.  Background In  cyclical  this  Information  chapter,  architectures  we w i l l could  e x a m i n e how be  the  supercomputer;  the  of the c o n v e n t i o n a l computer  respect,  and  then  the  of  applied to the design  high-performance shortcomings  concept  to start  approach  w i t h , we w i l l  of  our  of a  discuss  systems  in  this  design  will  be  identi fied. Conventional Von  mechanisms  similar refer  execution,  "control"  and "data"  data  termed  "control-driven"  sequenced  (Central passed  among  A highly  by  control  major  or  In by  are specifically  drawback environment  difficulties  programmer  Units).  refer  they a l l "Control"  t o t h e methods f o r computers  instruction  generated these  t o as  instructions  the  computers, and  assigned  t o these  compiler  has  to  reading  concurrency to  be  the —  careful  CPUs  data are memory  data.  o f u s i n g V o n Neumann c o m p u t e r s i s attributable  are  executions  by  writing  i n , specifying  the  scheduling  their  signals  and  mechanisms.  V o n Neumann  because  instructions  which  parallel  mechanisms  instructions.  Processing  locations  and  among  for  referred  machines,  and "data"  t o t h e methods  passing  are  systems a r e often  Neumann, o r s o m e t i m e s a s H a r v a r d ,  have very  for  computer  need  in a of,  either the with  the  82  generation are  not  of c o n t r o l s i g n a l s , corrupted  by  write operations. SIMD  (Single  Instruction there  instructions;  whereas  memory  Data  Another  major  distributions  are  improvement  i n t h e system  often  that  i f , and only  data  information.  and  control  mechanism  streams  of  often  mechanisms.  rise  t o uneven work  and a c c e s s  in  load  bottlenecks  the  by  in  i s always number  of  proportionate  performance.  i n " d a t a - d r i v e n " computers  of  are Von  are  For could  Neumann passed  the  directly  and t h e  [4]  much  the data producing  without  control  is  their  very  systems: from  instructions  i f , they  and b o t h  implemented  ones  storage,  t h e consuming  execution  behavior  to the d i f f i c u l t y i n  accompanied  consuming  t h r o u g h any i n t e r m e d i a t e such  control  t h e above s h o r t c o m i n g s ,  those  the  the  increments  not  i s such that data to  a  executable  system e x t e n s i b i l i t y  mechanisms  from  instructions  is  and  of r e s e a r c h  control  differently mechanism  achieve,  alleviating  and  in  streams [6])  of  instruction  gives  Moreover,  processors  data  of  which o f t e n  to  at  stream  i s attributable  difficult  aimed  t o overcome  asynchronous  the various  among t h e p r o c e s s o r s  goal  r e a d and  i n a MIMD ( M u l t i p l e I n s t r u c t i o n s t r e a m s  drawback  memory m o d u l e s .  during  M u l t i p l e Data  single  the implementation  The  i s easy  stream  i s only a  among  "load p a r t i t i o n i n g "  the  information  streams) system, t h e  accesses  complicates  wrongful  Such a drawback  system because  Multiple  t o e n s u r e t h a t memory l o c a t i o n s  going  mechanism  would be r e a d i e d f o r  have r e c e i v e d a l l t h e r e q u i r e d a  data  therefore  driven be  computer, i t s implemented  as  83  suboperations could  be  incorporated  easily  into  implemented  techniques  (e.g., data-flow  explicit  control  i t s data using  would  in specifying parallelism  a  the  multiprocessing contain  and  large  concurrency.  activities  amounts  systems  of  the  does  implicit  not conform  ready f o r execution  the  inefficient  data  mechanism  i n handling  instructions  is  very  control  of  task  of  data  of d a t a - d r i v e n because is  an the  extent.  As  appealing to which  asynchronous mechanism  to the notion  when t h e i r  computation  compilation  environments  operations,  large arrays  for  the  unstructured,  s u c h a s i n p u t and o u t p u t  necessarily Further,  approach  which  absence  to a great  multiprogramming  However,  data-driven  among  data-driven  The ease  programmmer result,  well-known  analysis).  mechanism  mechanism,  of c e r t a i n  which have  of  a r e not arrived.  systems  i s very  sending  arrays  b o t h t i m e and s p a c e  consuming.  s driven other;  From t h e above d i s c u s s i o n s , i t i s c l e a r and c o n t r o l - d r i v e n approaches therefore,  computers  which  the purpose  1.B.  which  adopt  i s very  combine t h e i r  of b e t t e r  Recent  This  i t  that  a r e complements  the^dataof  each  n a t u r a l to e n v i s i o n a c l a s s of  c o n t r o l and d a t a  mechanisms  for  performance.  Developments  section  will  t h e combined  examine  approach:  three  existing  proposals  84  (1) D e p e n d e n c e - D r i v e n (2) Combined  system  system (1982)  (3) P i e c e - w i s e D a t a - F l o w The D e p e n d e n c e - D r i v e n (Global  Control  Unit)  system  and  to  produce  a l l  c o m p u t a t i o n a n d t h e GCU This  system  heavily  is  computation  would  several processor clusters,  each  idle during their utilization The  static  perform for  executions,  up  of  The c o m p i l e r  is  information about  the  run-time  computation the  some  made  function.  however, cause  is  [34]. GCU  best-suited  vectorized;  (1983)  a  the  will  [32];  [33];  system  c a p a b l e of e x e c u t i n g a h i g h - l e v e l expected  (1981)  scheduling. w h i c h c o u l d be  presence  of  scalar  processing resources to stand thus  giving  rise  to  under-  of these resources. Combined  system  integrates  the concepts of  the  "pure" d a t a - d r i v e n computation and those of the " m u l t i - t h r e a d " control-driven computation. how  iterations,  c a r r i e d out  on  procedure this  T r e a l e v e n e t a l [33] have  c a l l s and r e s o u r c e management  system;  operations are performed.  shown  not  mentioned  is  how  If array operations are  array  decomposed  into individual packets each c o n t a i n i n g a p a r t i c i p a t i n g element,  then  associated packets,  with  there the  will  be  setting  enormous up  and  are  amounts of transmission  array overhead of  the  and a l s o to s y n c h r o n i z e the c o m p l e t i o n s of the a r r a y  operations. Requa e t a l  [34]  have  provided  a  rather  detailed  85  description  of  the  MIMD c h a r a c t e r i s t i c s  PDF s y s t e m w h i c h p o s s e s s e s b o t h -SIMD a n d —  t h e s e two c l a s s e s  p e r f o r m e d on d i f f e r e n t interchangeable.  We  operations could modules,  t y p e s o f hardware believe  be c a r r i e d  then there w i l l  s y s t e m would control.  be The  less PDF  that  system  to  design  avoids using  t h e number o f s c a l a r  eight;  therefore,  t h e speed  1.C.  Overview  Our  a r e not  and array  of  of  and  hardware  easier  to  any i n t e r c o n n e c t i o n  processors  o f t h e PDF s y s t e m  over e x i s t i n g  ones  to  about  i s expected t o  (please  refer  to  [34]).  o f Our A p p r o a c h  objective  multiprocessor (1)  Section  are  be fewer module t y p e s , a n d hence t h e  by l i m i t i n g  the I n t r o d u c t i o n  modules which  o u t on t h e same t y p e  expensive  improvement  computation  i f both s c a l a r  network  have l i m i t e d  of  system  is  to  design  a  heterogeneous  which,  i s capable  of  using  hundreds  to  thousands  of  processors; (2)  has  a  (million  projected  speed  r a n g e o f 100 t o 1,000 MOPS  operations per second);  ( 3 ) p o s s e s s e s b o t h SIMD a n d MIMD c h a r a c t e r i s t i c s is  to  be  data-driven (4)  achieved  by  combining  and c o n t r o l - d r i v e n  to  architectures.  depart  this  the p r i n c i p l e s of  computation;  i s intended f o r next-generation a p p l i c a t i o n s , expected  —  and i s  f r o m t h e p r e v a l e n t , Von Neumann  86  In o r d e r together large  and  processors  a  MIMD  speed  large  busy  large  number  of  processors  a high degree of f l e x i b i l i t y , a  most  in  s y s t e m must be a b l e performance  concurrency  mix to  to  fully  to maintain  attain  computation,  prevalent  Von  software;  therefore,  aspects  of  a  utilize  roughly  we  systems  only  design  its the  the  o f SIMD to  resources, the same  level  of  We a l s o b e l i e v e  achievement  toward  may have t o d e p a r t  from  in  and  both  emphasize rather  keep  application  of mix.  t h e new d e s i g n  to  the ratio  from  significant  Neumann  our  Since  differs  r e g a r d l e s s of the r a t i o  i n order  ultra-fast  of  of the time.  order  To  range, t h e intended a p p l i c a t i o n s  amount  instruction  application,  the  a  yet maintaining  the d e s i r e d  must p o s s e s s  that  connect  s w i t c h i n g network would be i n c l u d e d i n o u r d e s i g n .  achieve  and  to  hardware  the  than  architectural  any  immediate  implementation.  In o u r p r o p o s e d operations are  --  scheduled  suboperations execution is  operation  and  operation are  of  in  a  s t r e a m ) mode t h a n  faster SISD  sequenced  compound  sequential  when  program.  requires a fast executed  A  by many o f them  in  a  MIMD  sequential  by  ( S i n g l e I n s t r u c t i o n stream  for  alignment  computation  solely  but  operation  a r r a y o p e r a t i o n , an a r r a y  i s one w h i c h e i t h e r run  A  of  o f which  using data-driven p r i n c i p l e s ;  compound  a computational or a block  could  processor  for execution within a  types  a n d compound o p e r a t i o n s , b o t h  i n a c o n t r o l - d r i v e n manner.  either  program  scalar  s y s t e m , t h e r e a r e two b a s i c  a  time  single  S i n g l e Data  mode  (due t o  87  communications  overhead),  sequential process  such  shown  i s used to c o n t r o l  as p r i n t i n g  I/O  devices  SP 4  LMs  The  /  >l  EDC  in Fig.IV.1,  A Supervising Processor  a  number  Switching  Network  spread  the  of  the  an  EDC  executions  these  of  read/write  either  computation block  of  for  RPs  !  diagram.  c o n s i s t s of  (TPs)  and  main d u t y  data  into  on  of  the  to s t a r t  provided  (i.e.,  links  the  six basic parts:  Receiving  SP  Processors  left  a r r a y computation,  as w e l l :  i t will  when  become a  control  of  i n a SIMD manner, o r  s e q u e n t i a l p r o g r a m on  a  TP  —  of LMs array  and  i n v o l v e the receives  another  TP  the  and  and  to  using TPs.  alignments  s u b c o n t r o l l e r and  initiate  Packet  both  s e v e r a l TP-LM p a i r s to  a  i s to load  execution  the  (LMs);  bank o f LMs,  s e q u e n t i a l programs) w i l l  compound o p e r a t i o n , SP  The  and  links  Compound o p e r a t i o n s and  printer.  I n s t r u c t i o n R e g i s t e r s • ( I R s ) and  a p p r o p r i a t e TPs  read/write  line  ( S P ) ; a bank of L o c a l M e m o r i e s  (PSN).  instructions  initiate  the  system block  several Transmitting Processors (RPs);  on  an i n h e r e n t l y  A  F i g . I V . 1.-  As  or  use  of  such  a  request  for array  excution  of  i n a SISD f a s h i o n .  a  88  Information  flowing  encapsulated  into  on the  through  PSN t o RPs w h i c h w i l l  packets: from  different  into  streams  o f "MIMD" i n t h i s  case  are  or i n s t r u c t i o n switched  t h e p r o p e r LMs;  formation  of  instruction  i n s t r u c t i o n s and data  IRs t o w a i t  instruction  TPs  by T P s a n d a r e  the executable into  and  result  f o r free  Because each of t h e TPs w i l l  interpretation from  LMs  p l a c e them  f o r the  retrieve  LMs a n d b u f f e r them  execution. from  are generated  responsible  they  of  of e i t h e r  result  are also  right  form  packets:  RPs  packets  the  receive  TPs f o r  instructions  from  time  t o time, the  is  somewhat  different  t h e t r a d i t i o n a l one.  Other (a)  salient  Fewer  f e a t u r e s o f t h e EDC s y s t e m i n c l u d e :  module  types:  modules a r e u s e d , be  used  in  Only  although  large  of (b)  i s easier  module  each type  amounts.  d e s i g n c o s t s and g i v e r i s e which  a few t y p e s  to control  is  o f hardware intended  to  T h i s would r e d u c e t h e  t o a simpler than  architecture  one w h i c h u s e s a l o t  types.  Interleaved instructions  a n d skewed a r r a y s : A  of  t o s t o r e s e q u e n t i a l programs  the  LMs  which w i l l the  are  be e x e c u t e d  read/write  majority  of the  links LMs  programs which w i l l using case,  the  used  by t h e i r in  a  a s s o c i a t e d TPs SISD  a r e meant  be e x e c u t e d  packet-switched  the i n s t r u c t i o n s w i l l  subset  mode;  for  using  while the  non-sequential  i n a MIMD mode by T P s  network.  For the l a t t e r  be i n t e r l e a v e d  into  the  89  LMs  concerned,  thus  access  pattern  skewed  i n t o these  of TPs  be  referenced  reduce (c)  the  and  dependent  them.  on  Extensibility:  be  describe and  the  Section  comparison and  IRs  of  the  d e t a i l e d schematic  given  2,  s t o r e d and  nature 5 o f an  of  will EDC  the  and  some s u g g e s t e d work w i l l  the be  features  would  is  being  several  array  using  a novel the  those  s y s t e m has  and  LMs  the  way  to  alignment  instructions  is  highly  been b u i l t ,  could  advantage  be  the  increased  i s a t t r i b u t a b l e to  network.  diagram  Section  of an  the  the  EDC  i s shown i n  system  will EDC.  language  hardware  explain Section to  be  p e r f o r m a n c e o f an  three given  3  i n an  programming  with  array  architecture  processed  examine  an  out  of  f u n c t i o n a l d e s c r i p t i o n s of  in Section  is  EDC  Such an  extensibility  The  information  EDC  RPs,  incrementally.  will  The  A f t e r an  numbers of TPs,  Fig.IV.2.  signal  of  array  links,  carried  to  be  bottlenecks.  W h i l e an  be  the  techniques  of  These  completions  operations,  more  portions  read/write could  elements w i l l  network w h i c h p r o v i d e s the  extensible.  A  the  equalizing  known s t o r a g e  different  operations  synchronize  the  array  operations:  using  packet-switched  (d)  using  and  p r o b l e m o f memory a c c e s s  on  alignment  RPs;  concurrently.  Overlapped array operated  and  LMs  [36,37] w h i c h a l l o w to  randomizing  afore-mentioned in Section  6.  how  4  will used,  EDC.  A  designs  C^__^« File  Terminal  Sensors, Actuators, etc.  Abbreviations: Supervising Processor System Memory Channel S e l e c t o r L o c a l Memory Transmitting Processor Receiving Processor Instruction Register S w i t c h i n g Network Number o f T r a n s m i t t e r s Number o f R e c e i v e r s Number o f l o c a l Memories  F i g . I V . 2 . The c o n n e c t i o n diagram of EDC hardware architecture.  91  2. 2.A.  EDC H a r d w a r e A r c h i t e c t u r e Processing Modules  (1) S u p e r v i s i n g P r o c e s s o r (SP)  SP  i s the master  controller  of t h e whole system  and i t  oversees the executions of the following a c t i v i t i e s : (a)  Program  downloading  loaded  from  computer Memory for,  external  or  sources  host  bulk memories, and s t o r e d i n the  System  which  is to  fetched  When  such  Programs a r e the  (SM) i n i t i a l l y .  SP w i l l  pages  and i n i t i a l i z a t i o n :  a  as  program  is  access a storage u t i l i z a t i o n  located the  in  called  SM,  which  signal the TPs  f r e e memory  will  f r o m SM a n d l o a d e d i n t o L M s .  l o a d i n g , SP w i l l  t a b l e (SUT)  and a l l o c a t e  program  called  then  be  At the end of  concerned  to  start  execut ion. (b) I n p u t a n d o u t p u t to to  the  o p e r a t i o n s : Input data w i l l  i n p u t b u f f e r l o c a t e d i n SM, a n d t h e n  t h e a p p r o p r i a t e LMs.  array  are  to  concurrently, then LMs  using  [36,37]  Output  be  for  referenced  proceed  independently  an and  be skewed i n t o t h e f i r s t  described  conflict-free  by  Budnik  accesses;  the  et  R al  storage  then be r e c o r d e d i n an a r r a y d e s c r i p t i o n  ( A D T ) l o c a t e d i n SM data  go  I f the various parts of  i twill  techniques  pattern will table  first  will  output b u f f e r which  be  for  future  transferred  references.  from  i s a l s o l o c a t e d i n SM,  LMs t o t h e and  then  92  to )  the outside  devices.  Process  and  resource  requests  for  process  terminations, as  cannot )  creations,  procedure  well as other  maintained  management:  by  calls  to  be h o n o r e d  interactions  A request  enqueue  those  through  list  SP  and  are  Scalar  setup  I f t h e compound o p e r a t i o n  R  TPs  then  SP w i l l  to perform  request  a subset  the operation  by  it  of a block  i s an  array  of the  first  under t h e demand and  o f s e q u e n t i a l p r o g r a m , SP  i n t o LM(k) —  execute will  i t . be  w h e r e R<k<=M — In the former  specified  according  Selector  the above ) Other these  case,  the  the choice (CS) w i l l  will  TP(k) to  the choice  subcontroller  is  be s e t u p  i t has  load  of TPs TP(j)  received;  arbitrary.  The  by SP t o  realize  either  execute  connections.  operating tasks  s y s t e m t a s k s : S P may  directly,  or regard  t a s k s and a s s i g n them t o T P s . the nature  In the  and request  t o t h e compound o p e r a t i o n  in the l a t t e r case, Channel  by  TPs  h a v e t o be  c o n t r o l o f a s u b c o n t r o l l e r T P ( j ) , where j>R. case  that  operations  executed  whereas compound o p e r a t i o n s  operation,  (RL) i s  requests  autonomously; by SP.  and  immediately.  S e t t i n g up o f compound o p e r a t i o n s : need n o t go  handles  and the use of memories  resources.  SP  SP a l s o  o f t h e O.S.  tasks.  them a s a p p l i c a t i o n  The c h o i c e d e p e n d s on  93  (2) R e c e i v i n g P r o c e s s o r s  There network result LMs  a r e R RPs  PSN.  A RP  packets  The  content  of a r e s u l t  (a)  listed  If  it  into  the  If  it  array  an  as  location  a scalar  signalling  address  examine  whether  required  instruction (IR)  to  selection 2.B.(3));  the c o n t e n t s of  will  then  as  operand,  the d e s t i n a t i o n  the  A RP  the  arriving  the v a r i o u s t y p e s  a r r a y element,  update  all  the  of  of  respond  the  result to  the  follows:  will  then  side  i t will  be  specified  stored  by  its  address;  a  processor  of  remove  update  IV.6.  memory  is  or  formats  packet  is  receiving  continuously  in Table  destination (b)  to the  t h e n e t w o r k and  accordingly. are  connected  will  from  packets  (RPs)  will  wait of  the base a d d r e s s  token, the  then  instruction  that  p l a c e d i n an  for  a  free  IR-TP p a i r s  otherwise  no  and  instruction  will  it  it  has  has,  for  be  further  will  the  register  execution  given  by  received then  instruction  TP  an  receiving  word g i v e n  of the p a c k e t ,  information; i f be  the  of  in  (the  Section  action  will  scalar  and  this  group  take  place.  (3) T r a n s m i t t i n g P r o c e s s o r s { T P ( 1 )  This  group  o f TPs  will  operations.  Any  free  TP  continuously  check  i t s associated  t o TP(R)}  execute belonging  both to  instruction  registers  array will for  94  the  addresses of executable i n s t r u c t i o n s .  one,  then TP(k)  LM(k)  will  and execute  addresses  of  contains  fetch the corresponding i n s t r u c t i o n  i t .  the  If IR(k)  The  next  r e s u l t packets which  from  computed r e s u l t s together with  instructions  are then  will  forwarded  the  be p a c k a g e d  to  the  into  network  for  distribution. A  subset  o f t h e s e T P s may  undergo  under  t h e c o n t r o l o f T P ( i ) where i>R.  such  a compound o p e r a t i o n , i t w i l l  control As  s i g n a l s t o t h e s e TPs  soon  they w i l l LMs  as these TPs  When  to  TP(i)  generate and  v i a the Channel  the  receives  broadcast  Selector  have f i n i s h e d t h e i r c u r r e n t  r e s p o n d by f e t c h i n g t h e a r r a y  according  an a r r a y o p e r a t i o n  elements  broadcast • signals.  the  (CS).  activities, from  their  the  array  If  operaton i s , (a) a c o m p u t a t i o n a l a c t i v i t y , on t h e e l e m e n t s  and  then these TPs  then s t o r e the  the memories u s i n g the read/write (b)  an  alignment  the elements the  network  has  been  which  be  will  into result  sent by  results  out,  p a c k e t s and  some  back  to  package  f o r w a r d them  After the l a s t of  operate  links;  o p e r a t i o n , t h e n t h e s e TP w i l l  for alignment.  requested  will  these  TPs  to  element will  be  SP t o g e n e r a t e a s y n c h r o n i z a t i o n t o k e n forwarded  to the  network  to  indicate  the end of t r a n s m i s s i o n ( t h i s s y n c h r o n i z a t i o n p r o c e s s will  be d e s c r i b e d i n S e c t i o n 2 . C ( 2 ) ) .  95  If array  a  TP  i s not  operation, i t  mentioned  i n the  i n v o l v e d i n or has  will  beginning  resume  its  of t h i s  subsection.  (4) T r a n s m i t t i n g P r o c e s s o r s  The  main  operations; execute  of t h e s e  w i t h LMs,  s e q u e n t i a l programs as Any  f r e e TP  belongs  they  to t h i s  i t s a s s o c i a t e d IR  Unlike  the p r e v i o u s group, these  addresses not not  —  of n e x t  the addresses have  of t h e  non-sequential  computed  by  then  To will  be  any  instructions The  programs will  be  SP  i t .  scalar  by  SP  to  continuously packets. the  actual  immediate o p e r a n d s  because  to access  i n the  IRs,  these  TPs  the  stored.  network  this  and  into for  first  and but. do  R  LMs  The  result  result  packets  distribution.  Upon  g r o u p , and  the  t h e a s s o c i a t e d TP  completion,  or produce a r e s u l t  v i a the  as  of a s e q u e n t i a l program,  of  LM,  will  instruction  packaged  to the  the  requested  require that  are  the e x e c u t i o n  into  be  available  links  forwarded  to execute  signal  TPs  — b e  f r e e TP-LM p a i r  loaded  requested either  be  activities  i s to execute  group  instructions,  TPs  initiate  select  will  these  may  opcodes,  read/write  where t h e  which w i l l  the  instructions  direct  TPs  for executable  i.e.,  an  well.  check  instructions  normal  completed  (TP(R+1) t o T P ( T ) }  function  f o r those  just  packet  that  SP  program will  be  TP  will  to t r i g g e r  other  network.  number o f TPs  c o u l d be  larger  than  or  equal  to  96  that  of  LMs, d e p e n d i n g on t h e speeds o f t h e v a r i o u s  hardware  modules and the intended a p p l i c a t i o n s .  /  2.B.  Storage  Modules  (1) S y s t e m Memory  The located as  (SM)  aforementioned  input  i n SM w h i c h a l s o c o n t a i n s  system software  such  and  output  a s I/O r o u t i n e s a n d i n t e r r u p t s e r v i c e  W h i l e i n SM, a l l t h e a d d r e s s e s  remain  the relative  l o c a t a b l e ; when c o p i e d be  using  repeatedly,  for replication SM storage  table  information  program  will  relative addresses  ones by t h e TPs c o n n e c t e d  then  provided  by SP.  will  to  the  I f a program i s  a copy o f i t w i l l be kept  in  SM  purposes.  also contains utilization  (ADT),  of a  s o t h a t t h e p r o g r a m c o u l d be r e -  i n t o LMs, t h e s e  the base address  t o be c a l l e d  the  form  translated into absolute  LMs,  are  a p p l i c a t i o n programs as well  routines. in  buffers  the  table  between a c a l l i n g  those  table  request  aforementioned (SUT),  list  the  tables,  array  description  (RL), as well as a  (LIT)which provides  the linkage  program and i t s c a l l e d  namely,  linkage  information  programs.  (2) L o c a l M e m o r i e s (LMs)  LM(1)  through  LM(R) a r e used  to  contain  interleaved  97  instructions connected  are  executed  i n an  used  their  alternating  to store  t i m e s SP w i l l  loading  and  those  frequently  RPs.  by  granting  LM(R+1)  through  p r o g r a m s w h i c h a r e t o be  above  programs;  activities  such  interferences  t h e s i z e o f LMs s o t h a t  needed p r o g r a m s c o u l d  reside  for  most o f  i n them.  I n s t r u c t i o n Registers (IRs)  IRs been  t o the  are  TP.  of  be r e d u c e d by i n c r e a s i n g  ports  be r e s o l v e d  i n t e r r u p t the  unloading  left  right ports  sequential  could  (3)  Their  manner.  s o l e l y by t h e a s s o c i a t e d At  the  arrays.  between RPs a n d T P s c o u l d  requests  LM(M)  skewed  t o SP and TPs w h i l e  Contentions their  and  serve  mentioned  contain  only  IR(R+1)  as b u f f e r s  in Section  IR(T)  and  TPs.  As  has  2.A(3) and ( 4 ) , I R ( 1 ) t h r o u g h I R ( R )  the addresses of  through  therefore,  between RPs  executable  contain  the  instructions actual  the b u f f e r i n g c a p a c i t i e s of these  while  instructions;  two g r o u p s o f I R s  are d i f f e r e n t . Associated "Full/Not-Full" and mode  w i t h e a c h IR a r e two s i n g l e - b i t  flag  which  indicates  t h e "Autonomous/Slave" f l a g of  t h e c o n n e c t e d TP.  which  flags:  the status  i n d i c a t e s the  the  o f t h e IR, operating  An autonomous TP i s one w h i c h i s  ready t o accept  or i s c u r r e n t l y  IRs,  s l a v e TP i s one w h i c h i s u n d e r g o i n g a compound  while  operation  a  under  executing  the c o n t r o l of another  instructions  processor.  from  98  To examine  schedule  the  flags  an  executable  of  IR(i+n*R) i n the order  which i s a non-negative full  and  i s connected  instruction  2.C.  i n t e g e r , and  t o an autonomous  first TP  of  will  increasng  n  IR w h i c h i s n o t  will  receive  the  Switches  CS perform  enables  those  compound  (CS)  SP  to  activities  program loading,  any  of the TP-LM p a i r s  mentioned i n Section 2.A(1),  input and  output  implementation  hence w i l l  not  (2) P a c k e t  Switching  Other  activities  and  to  namely,  s e t t i n g up  of  conventional  a  (NxN)  but  for  switches;  t h e r e f o r e , PSN  (LSSN) which has  functions  are:  and  (PSN)  packet they  connection,  is a modified  is quite straight-forward  in this dissertation.  Network  switches  PSN  o f CS  be d i s c u s s e d  i n p l a c e o f PSN,  Network  select  operations. The  its  the  RP(i)  packet.  (1) C h a n n e l S e l e c t o r  used  instruction,  switching  require  networks could  at  least  w h e r e a s PSN  v e r s i o n of Loop-Structured been d e s c r i b e d  in  (N/2)logN  uses only  i s a t t r a c t i v e when N i s v e r y  Chapter  be  (N/2)  large. Switching III,  and  99  (a) t o d e l i v e r  (b)  to  result  perform  completion  It PSN the  i s the  from LSSN. same  as  synchronization  on  from  schematic  diagram  from TPs  t o RPs  and  LMs;  synchronization to s i g n a l  the  of a r r a y a l i g n m e n t s .  The  different  hardware  second  that  packets  function  above  which  distinguishes  t o p o l o g y and  a d d r e s s i n g scheme o f PSN  o f LSSN; but  in order to perform  t h e n e t w o r k , t h e PSN  t h e LSSN s w i t c h e s . o f a PSN  switch.  switches  Fig.IV.3  have  are  hardware to  be  illustrates  the  100  From Transmitting Processors  Loop  Input P o r t  i-J  u  £.  bC  w  CO  • -H KH r U  0  CM  CM •  1 •P W (0  CO 4) i H  <M  u  | W  P  ,• •<  Output P o r t  Buffer  •H  OS  r-t  o  Synchronization Stations  Intermediate ports  Output P o r t  To R e c e i v i n g Processors Output L i n k  Fig.IV.3.  pools  x: w bO CO  •X-  Stage A  Class-1  CIS  Right  <H  •P W  Left  W -P w  o1  Class-1  Input P o r t  o1  • * * * "^j  The s c h e m a t i c d i a g r a m o f a PSN  switch.  101  In g e n e r a l , a r e s u l t processor would  packet  have the packet  <Feedback Count;  s e n t o u t by a t r a n s m i t t i n g  format  Destination  as f o l l o w s :  Address;  Result  Type;  Result>  When  a  will  result  packet  be p l a c e d i n t o  according zero  to  when  the  incremented path.  its  enters the  the  feedback  packet  whenever  In Fig.IV.3,  Synchronization  Class-i  coming  count  the  through  a l l types of r e s u l t  packets  will  buffers will their  the  out of the Class-2  be  turns  bypass  from  packets coming s w i t c h e d t o an  to  be  the  buffer  directly  determined  by  to the output  intermediate port; e l s e to the right  also  latter  and  Class-1  wait  for For  of  the  similarities  that exist  and LSSN, those theorems  applicable  t o PSN.  for a network with L loops, the  maximum  be  of  switched  one. between  developed  T h e o r e m 111,1  a  switching  the s - t h b i t of the d e s t i n a t i o n address  to the l e f t  are  forwarded  port.  will  a  the  of  i f i t i s a "0"  t o p o l o g i e s o f PSN  the  For  be  intermediatp port to  the packet:  Because  is  feedback  pools.  when  , then the packet  to  and  except  out of the Class-0  transferred  i s set  the  buffer, i t will  and  switch  Synchronization  switch l o c a t e d in the s-th stage, the d i r e c t i o n is  the  generated,  goes  packets,  empty;  inside  it{0,1,2}, which  packet  to the output port immediately becomes  buffer  is . initially  S t a t i o n s when t h e y e m e r g e packet  input p o r t of a switch, i t  for  the LSSN  have shown t h a t  number  of  switches  102  that any packet would its  have t o go through  d e s t i n a t i o n , i s (21ogL is  through  t h e maximum number o f s w i t c h e s ,  buffer  at  the  Consider the case  packet  packet w i l l  admitted  -1).  last  will  in  is  the  (logL-2)th  stage  on  PSN.  Another  a s r e v e a l e d b y Lemma I I I . 2 , acquired  is  a feedback  why  F i g . I V . 3 need The  packets  that  count  n o t go t h r o u g h  purpose  achieve the effect  of  the  dispatched connected  method  packet.  a  any  routing  which  —  this  of the Class-2 buffers i n  the intermediate ports.  t o PSN, e a c h o f t h e f i r s t  result packets except  that  has  remain i n  steps  Stations  is  in  the  i.e.,  proceed.  o p e r a t i o n have  L TPs (i.e.,  s t a g e o f PSN) w i l l  token  to  o p e r a t i o n s so t h a t  i n v o l v e d i n an alignment  packets will  of  hardware  always  o n t h e s e o p e r a t i o n s may  synchronization  Synchronization  of  packet  of 2 will  of array alignment  to the first  These  i s the  p r o p e r t y o f PSN,  those  be r e q u e s t e d , by  e i t h e r SP o r t h e s u b c o n t r o l l e r o f t h e a l i g n m e n t forward  this  of this observation  Synchronization  dependent  After a l l the elements been  out  which  o f h a r d w a r e s y n c h r o n i z a t i o n on PSN —  signal the completion  other computation  TPs  coming  —  important  t h e same l o o p f o r a n y o f i t s f u r t h e r explains  then  have t o go r e g a r d l e s s  originated; the significant  synchronization  to  (2logL-1),  b e c o m e o b v i o u s w h e n we d i s c u s s t h e  already  a  s t a g e o f PSN a n d h a s t o go  f u r t h e s t d e s t i n a t i o n any packet w i l l i t  i n which  be r e m o v e d f r o m PSN when i t e m e r g e s f r o m a C l a s s - 2  located  where  i n order to arrive at  operation,  form  of a  to  result  be t r e a t e d much t h e same a s o t h e r they  will  S t a t i o n s when e m e r g i n g  be  retained  by  the  from t h e b u f f e r p o o l s ;  103  a synchronization packet station  would  have  to  synchronization packet class,  r e t a i n e d by t h e l e f t wait  in the right  then both packets w i l l  output  ports  in  a  for  the  arrival  (left)  proceed  (right) of  station  s t r a i g h t - t h r o u g h manner.  behind  array packets  arrive  at  the  (logL-2)th  the  concerned  must have been d e l i v e r e d t o t h e i r  been  explained  arrivals  of  the  Stations  t r a n s f o r m them  into  counts  (please  refer  signalling  alignment  to to  tokens  instructions  changing  and IV.6  be on  for  completion  of  occurs,  the  packet  resetting  a  result  output  packet  will  packet  their  result  types  formats);  these  to  trigger  those  the  array  of  are those  a r r i v e s a t an output p o r t o f a  link connected  will  be  matched  to the port.  to that link will  be handed o v e r  be f o r w a r d e d  of the link.  will  by t h e s y n c h r o n i z a t i o n p a c k e t s .  t h e n t h e RP c o n n e c t e d  the result  Class-2  stage  their  their  switch, i t s d e s t i n a t i o n address w i l l  that  by  retransmitted the  the  (logL-2)th  (as  Upon t h e  operation, and their d e s t i n a t i o n addresses  When  end  the  zero.,  will  elements  destinations  packets,  tokens  Table  Synchronization  previous paragraph).  of  lag  and that  a l l the array  signalling  dependent  originally carried  PSN  the  synchronization  Synchronization  feedback  in  stage,  Class-2  of  and  always  elements which they are t r a i l i n g ,  Stations  has  o f t h e same  Such a scheme  that the synchronization packets will  when t h e s e  another  to the intermediate  would ensure the  Class-i  to  against  I f a match  be s t r o b e d a n d  i t ; otherwise  the  t o the switch situated a t the other  1 0 4  More  details  o f PSN c o u l d  be found  i nChapter I I I  w h i c h a l s o e x p l a i n s how t o e x p a n d t h e n e t w o r k i n c r e m e n t a l l y a  s i g n i f i c a n t advantage  networks.  Although  s t r u c t u r e than number  such  a disadvantage  easily  3.  o f switches  conventional  h a s a much c o m p l e x i n t e r n a l  binary  switch,  t h e savings  t h es i z e o f t h enetwork  technologies,  a high  i t will  still  internal complexity  be d i f f i c u l t  i n  offset  i s large. could be  b u t i f a s y s t e m i n v o l v e s t o o many  EDC Information  3.A.  other  aswell as external wiring will  when  achieved,  components,  a PSN switch  a conventional  the  With today's  o f PSN over  —  external  t o manage.  Structure  Machine Instruction Formats  (1) Format f o rs e q u e n t i a l programs: I ti ss i m i l a r t o that conventional double-byte  computer o f opcode  double-bytes  systems, followed  a n d i s arranged by either  of  a s one  one o r  more  of operands. «*-16 b i t s — * . —  ,  Opcode Operands  (2)  Format  f o r non-sequential  scalar and those made  encapsulated  up o f eight  fields:  programs: I ti sused  (a)Opcode field  «*-16bits-»~«  compound o p e r a t i o n s ,  and  i s  which a r e d i v i d e d i n t o 4  double-bytes (b)Control Information field I6bits  t o encode  X  ( c ) Operand I' ( d ) / field [ I 6xl6bits  Next Instruction field  105  The one  "Opcode" and  double-byte  Instruction"  "Control Information" f i e l d s  each,  fields  while  share the remaining  (a) "Opcode" f i e l d : categories  (b)  divided  and  2  show  result  four  operations and  graphs.  five  fields:  It  be  is  further  subfields:  (3)#0perands (4)#Tokens Required Required  of s i n g l e  or boolean  (5)#Tokens To Go  3 b i t s — — 3 b i t s — * A—variable—'  type": S p e c i f i e s  will  numerical  "Next  the  compound  **— 4 b i t s — 3bits—— constant "Result  and  of  six double-bytes.  and  Information"  into  "Operand"  a l o n g w i t h some t y p i c a l e x a m p l e s  (2)Format type  (l)Result type  (1)  scalar  data-flow  "Control  ««-3bits V  T a b l e s IV.1  of  respectively, their  the  are  whether the  or double  value,  or  computed  precision, a  a  signalling  token. (2)  "Format  type":  Specifies  the type of  u s e d t o a c c o m o d a t e o p e r a n d s and of  those  instructions  the  d e p e n d e n t on  format  addresses the c u r r e n t  instruction. (3)  "#Operand R e q u i r e d " : o p e r a n d s n e e d e d by  (4)  Specifies  the  the  total  number  of  instruction.  "#Tokens R e q u i r e d " : E q u a l s plus  the  number  "#Operands of  Required"  signalling  tokens  needed. (5) " I T o k e n s  To  Go":  Equals  "#Tokens  Required"  106  minus t h e number  of tokens  received.  When a RP r e c e i v e s an o p e r a n d token,  it  receiving the  will  instruction;  receiving  instruction (c)  various  compound 4  bits  be  used  These simple  In number  account  formats  "False"  array  will  meet  needs; o t h e r w i s e  new  (a t o t a l  t o the i - t h operand  to the address  respectively.  of 4 which  an  of the j - t h next  are the addresses  of a  of  boolean  of the  operation  F o r m a t No.8  is  i s useful for which  do  not  operands.  T a b l e I V . 4 , "No.  i s the d i f f e r e n c e  elements  and  f o r -16 f o r m a t s ) .  when t h e r e s u l t  of a r r a y elements  "Stride"  by s c a l a r and  i f necessary  and "NextT" a n d " N e x t F "  embedded  fields:  i n T a b l e s IV.3  t h o s e o p e r a t i o n s s u c h a s " D u p l i c a t e " a n d "Wait" carry  i n t o an  Instruction"  of formats  c o u l d be added  placed  zero,  for execution,  operations are l i s t e d  and " N e x t j " r e f e r s  instructions and  types  T a b l e IV.3, " O p i " r e f e r s  instruction,  "True"  "Next  reaches  a r e a s s i g n e d t o t h e "Format T y p e " f i e l d  could  instruction  will  a l l the computational  formats  value  (IR) t o w a i t  respectively.  almost  next  instruction  register  signalling  t h e "#Tokens To Go" o f t h e  when t h i s  & (d) " O p e r a n d s " and The  In  decrement  or a  which  of elements"  involved  refers  i n t h e a r r a y o p e r a t i o n , and  i n the indexes of  take part  t o the t o t a l  two  neighboring  i n the o p e r a t i o n .  Both the  107  "No.  of  control  statements  such  DO".  In  IV.4,  to  elements"  Table  the r e s u l t i n g  the  a s "DO  would  a n d s p a c e ) t o be  vectors w i l l summation  sent  limited  actual more  each  only  their  the formats  numbers  "Duplicate"  3.B.  "Next  the  of  scalar  assigned  "Reduction"  shown "Next  Instructions"  in  ones  ( i n terms o f  every  instruction  base a d d r e s s e s  of the  o p e r a t i o n s such  r e s u l t s w o u l d be  by s c a l a r  f a n - o u t s c o u l d be e x t e n d e d of t h e i r  and  As f o r " R e d u c t i o n "  much t h e same a s t h o s e p r o d u c e d  have  those  are too expensive  to  therefore,  product,  Although  I=1to64step2  respectively.  which  be s e n t .  and  the loop  v e c t o r V1, and " ( V 2 ) " a n d " ( V 3 ) " a r e t h o s e o f  vectors  them;  from  i s t h e base a d d r e s s  compound o p e r a t i o n s e x c e p t  produce  requiring  are obtained  1=1,64,2" o r "FOR  "(V1)"  i n p u t v e c t o r s V2 and V3, All  time  and " S t r i d e "  as  treated  operations.  Tables  IV.3  Instructions" infinitely fields  and  fields,  by h a v i n g  IV.4 their  one  or  p o i n t t o a number o f  operators.  Packet  There  Formats  a r e two c l a s s e s o f p a c k e t s  that  exist  in  EDC,  namely, (a)  Instruction reside refer  packets:  They  flow  i n IRs w h i l e w a i t i n g f o r to Table  IV.5.)  from RPs t o TPs and  execution.  (Please  108  (b)  Result  packets:  They  are  produced  by TPs and a r e  f o r w a r d e d t o RPs v i a t h e network PSN.  (Please  refer  to T a b l e IV.6.)  3.C.  Program  The the  Organization  EDC  program  e x i s t i n g computer  system  software  are  organization  systems. made  i s similar  Both  up  of  the  three  to those of  application  types  of  and  program  components: (1)  Main programs: They a r e a c t i v a t e d such  as  program (2)  the  console  and  components.  program  "Call"  and  programs  use  components.  it  by e x p l i c i t  The  "Distribute"  calling  operators  "Distribute"  parameter passing. "Call"  calls  will  and  the  called  and "Return" operators f o r  As d e p i c t e d  i n F i g . I V . 4 , when t h e tokens,  be d i s p a t c h e d by a RP t o a f r e e T P w h i c h the  not  code exist  from  the  LMs, t h e n SP  will  f r e e memory p a g e s t o i t a n d l o a d memory  will  If  allocate  i t s starting  in  SP.  code  and  does  program  program  LMs,  from  programs use  i n s t r u c t i o n has gathered a l l i t sinput  then request  to  means  n o t t o be c a l l e d by o t h e r  P r o c e d u r e s : They a r e a c t i v a t e d the  v i a external  i t from  location  returned t o t h e r e q u e s t i n g TP which w i l l  then  will  SM be  proceed  109  with other is  computations.  executed,  back t o t h e assigned P M:  i  a  calling  program  results and  program w i l l  CALL  (P;a,b;M.j)  the be  be r o u t e d  memory  pages  released.  DISTRIBUTE a  c  j DISTRIBUTE  will  operator  j  \ \\  1  a l l t h e computed  to the c a l l e d  b  when t h e " R e t u r n "  1  v  RETURN  c  (M.j;c)  b M.j  ;  M. j  F i g . I V . 4 P a r a m e t e r p a s s i n g between t h e c a l l i n g p r o g r a m M and c a l l e d p r o g r a m P. " a " a n d "b" a r e t h e i n p u t p a r a m e t e r s and " c " i s t h e r e t u r n e d r e s u l t , and " j " i s t h e r e t u r n address.  (3)  Task  p r o g r a m s : They a r e u s e d t o p r o t e c t  and/or p h y s i c a l proper entry or  use. points  signals  providing various  The and  the  resources  whereby to  other  i t , and  ensure  therefore  programs  send  data  among  the  components.  between  a task  v e r y much t h e same a s t h a t o f  t h e major d i f f e r e n c e  activated  explicit  call  their  i t i s a means o f  of parameter p a s s i n g is  data  o f one o r more  programs c o u l d  procedure c a l l i n g ; an  to  communications and i n t e r a c t i o n s  implementation  by  as  A task program c o n s i s t s  t y p e s o f program  calling  so  shared  while  i s that a  a procedure i s  task  program  is  110  activated  when  the  existence; also, a results may  are  declared  to  which  declares  terminates  to the callers,  serve  statement  i t has  e.g.  procedure  returned  continue  termination  program  other is  i t comes i n t o  when  the  computed  whereas a task  callers  until  encountered,  program  an  explicit  or the program  which  terminated.  Task t ; Accept  A(x:Real)Return(y:Real);  E n d A; End t ; Callers  Task t The  execution A  F i g . I V . 5 . The i n t e r a c t i o n s between c a l l i n g task program.  The example o f F i g . I V . 5 execution threaded"  path; —  but  in  i . e . , made u p  shows a  general, of  a  task task  several  programs and a  with  a  single  c o u l d be " m u l t i -  concurrent  execution  paths. The level  advantages of using task programs  concurrency  primitives  such  as  instead of low-  semaphores  [42] in  111  handling  inter-program  Furthermore,  the  activities  implementation  principle  of  high-level  c o n s t r u c t s , a task  "monitor"  of  "task"  o f Ada  3.D.  Data  We other  data-driven  Concurrent  ease of  of  tasks  computation. is  Pascal  and  clarity.  conforms  to  Compared  quite [38]  use  different  but  very  to  the other  from  similar  the  to  the  [44].  Structure  only  more  considered  are  discuss arrays complicated  i n our  illustrated  in  design.  in  this  structures The  handling  paper [49,68] of  arrays  although may i n an  some  also  be  EDC  is  Fig.IV.6.  S y s t e m Memory-  Fig.IV.6. The p h y s i c a l and l o g i c a l memory s y s t e m . The f i r s t R LMs a r e p a g e s a s shown.  LM(1)  to  LM(R)  are  used  a r r a n g e m e n t s o f EDC l o g i c a l l y divided into  t o s t o r e d skewed a r r a y s  so  112  t h a t t h e y may TP(R).  processed  However,  constraints, either by  be  loaded entirely  k>R.  or  The  at run time. utilization  3.E.  n o t be  at compile  will  and  be  processed  arrangements  time,  t a b l e (SUT)  must always  could  o r d y n a m i c a l l y by  be u p d a t e d  and  be SP  storage  to reflect  the  patterns.  P r o c e s s and  called  Resource  and  Management  environment, in  a process i s d e f i n e d as e i t h e r  execution,  The  treatments  terminations are very similar request  Request  List  assign  an  to (RL)  until  procedures  t a b l e (SUT)  be  first  number  (ID)  calls:  p l a c e d on  which  will  the then  from  the  linkage  f r e e memory p a g e s from  the  storage  t o t h e p r o c e s s ; SP w i l l  a l l o c a t e d w i t h the program code and  When t h e p r o c e s s  regarded  of process c r e a t i o n s  i t i s r e m o v e d by SP,  i n f o r m a t i o n t a b l e (LIT) and  memories  those  to that of procedure  create a process w i l l  unused i d e n t i f i c a t i o n  utilization  and  d a t a s t r u c t u r e s owned by t h e p r o g r a m a r e  as p a r t s of the p r o c e s s .  run.  instead,  a r r a y d e s c r i p t i o n t a b l e (ADT)  a main or task program  the  through  among s e v e r a l T P ( k ) - L M ( k ) p a i r s where  statically  I n t h e EDC  and  skewed but  i n t o a l o c a l memory LM(k)  divided  The  TP(1)  by  of e f f i c i e n c y or a l g o r i t h m i c  decisions concerning these  either  storage  reasons  a n a r r a y may  TP(k),  made  for  concurrently  then  load  initialize  t e r m i n a t e s , SP w i l l a g a i n  update  the i t to SUT  and L I T a c c o r d i n g l y . As  illustrated  i n F i g . I V . 7 , the management of  hardware  113  and/or  software  resources  u s i n g a task program. particular in  type  the  critical  operators. memory to  token  and  this  a s shown.  incremented  whenever  to  content a  of  a  i s stored  from  with  of  "Release"  d e c r e m e n t e d whenever an " A c q u i r e "  accesses  a t a time  modify  i s achieved  The  accessed  malicious  o n l y one r e q u e s t region  resources  within  by t h e " S e l e c t " a n d "End S e l e c t "  t o prevent  the c r i t i c a l  conveniently  "N" o f F i g . I V . 7 )  w h i c h c a n o n l y be  In order  resources,  t h e number  region enclosed  location,  enter  The number o f u n u s e d  (e.g.,  a memory l o c a t i o n  c o u l d be i m p l e m e n t e d  the  number  memory  request  request  that  w o u l d be a l l o w e d  t h e use of a  that  to  of the  signalling  location  i s honored  i s granted. Task Program  is and Signalling token  Release  Requests from processes  Acquire SELECT  (Acquire) •  (Release) •  N:=N-1  •  N:=N+1  •  *  •»  21  END SELECT  Acknowledgement, to p r o c e s s e s  Released F i g . I V . 7 . The i m p l e m e n t a t i o n a task program.  The Fig.IV.7  "Select"  does  principles,  ZI7—1=1  Acquired  not  because  of a resource  operator  used  conform  faithfully  i t s execution  manager  i n the resource to  i s triggered  the  using  manager  of  data-driven  by t h e a r r i v a l s  114  of  the signalling  necessarily the and  a l l of  same t i m e , the  token  then  plus  at  them. they  selection  one  request  --  not  I f there are several requests at  will  policy  first-come-first-served  least  be e n q u e u e d  f o r these  when  requests  or priority-based,  they  arrive,  c o u l d be  depending  either on  the  implementation.  4.  EDC  Programming  A main  program  program,  a  Language S t r u c t u r e  that  runs  procedure  o n EDC or  a  composed o f one o r more "program of  instructions  blocks, of  except  using  objective  and  compiled  into  translated  4.A.  are  ideas concerning  language,  no  program  task  t h e form  into  of  this  the  b l o c k s " which branching  clarity  into  and  to illustrate data-flow machine  section  how  graphs  of  some  i t i s  or out of the The  advantages  that  existing  used. i s  the  to present EDC  some  programming  language c o n s t r u c t s a r e  which  code u s i n g  and  a  are collections  and endings.  design  of e i t h e r  program,  o f o p t i m i z i n g c o m p i l e r s c o u l d be  The  Section  have  at the beginnings  blocks  techniques  useful  that  takes  could  the formats  be  easily  presented i n  3.  EDC S t a t e m e n t s  and Program  (1) D e c l a r a t i o n s t a t e m e n t s :  Most  Blocks  o f them a r e  used  to  assist  1 15  the  compiler  not  translated the  will  be c o m p i l e d  (2)  setting  into  are  process  up t h e d a t a - f l o w  executable  declaration into  of  Assignment  represent  causing  "convenience"  would  concurrent  Rule  often  whenever  necessary.  variable  name  The  must  within  i t s scope;  means  that  source  of o r i g i n .  in  lead  not  compiled  Fig.IV.8.  data  to  system,  be a s s i g n e d to  p a r t s of the such  obscurities  simply  to  in  the  such  a  Single  confusions  states  more t h a n  data-flow  a  that  a  one v a l u e  graphs,  it  a r c o f t h e g r a p h s c o u l d have a t m o s t one  "Wait"  among  sequentiality  dependencies.  for  sequential  however,  SAR  and "End"  operators  The " W a i t " o p e r a t o r  dependencies desired  into  SP  repeatedly  different  to avoid  " B e g i n / E n d " b l o c k : The " B e g i n " be  used  used  when a p p l i e d  each  be  I n t h e EDC is  request  which  respectively.  much c o n f u s i o n ;  (SAR)  Exceptions  conventional  could  environment.  Assignment  a  entities  program without  that w i l l  space  In  names  different  operations.  operations  statements:  variable  graphs and a r e  t a s k p r o g r a m s and a r r a y s ,  c r e a t i o n s and memory  program,  (3)  in  program  is  blocks  not e x p l i c i t l y  a  as  statements  will  demonstrated i n  means  of  imposing  so a s t o a c h i e v e t h e expressed  by  their  e.g.  Begin  S i g n a l l i n g  O p e r a n d  token(s)  token(s)  Begin:  W a i t  End;  P r o g r a m  End:  (4)  A "Begin/End"  "IF" block: I twill number the part  of  into  "Switch"ing o p e r a t o r s which  t h e program.  have t o be "grounded" operands  block and i t s data-flow  be compiled  flow o f input operands of  g r a p h  W a i t  4  Fig.IV.8  after  116  a  boolean a r eused  plus  i n order  emanating  to discard  a  to direct  into either the "IF" or Sometimes c e r t a i n  graph.  the  "E1SE"  arcs unused  t h ec o n d i t i o n a l t e s t . B  C2  CI  ,g.IF(Cl £ C2)THEN  ELSE END;  Fig.IV.9  (5) boolean  "Match"  block:  I twill  be compiled  graph.  into a series of  a n d " S w i t c h " i n g o p e r a t o r s a s shown i n F i g . I V . 1 0 .  input operand "C1,C2,  An "IF" b l o c k a n d i t s d a t a - f l o w  ...."  "C" w i l l  be matched against a l l  i n parallel,  The  t h e comparands  and the"Switch"ing  operators  117  will direct a  the operands t othe p a r t o f the program which h a s  successful  match.  case a l l the matches  e.g.  An  "Else" part should be provided  in  fail.  MATCH (C) CASE(C1)D0 CASE(C2)D0 ELSE END; Fig.IV.10.  (6)  A "Match" block and  i t s data-flow  "Loop" block: An EDC loop i s d i f f e r e n t and  "Do-all" loops proposed  [40].  Since  parallel  from  graph.  the  "For-all"  in other data-driven  array  computations  languages  i n EDC are  encoded a s compound o p e r a t i o n s ,  i t i s not  necessary  "Loop"s for these computations;  instead,  loops  for data  iterations  and recurrence  dependencies  computations.  between  a r e used  operations which two  t o use  successive  exhibit loop  118  e.g.  LOOP WHILE(Cl-C2)DO  •  NEXT X Z • • • NEXT Cl:« ... NEXT C 2 : « ...  •  LOOP EXIT XLAST :« X; END;  Fig.IV.11.  A "LOOP" b l o c k a n d  i t s data-flow  graph.  119  As  e x e m p l i f i e d by F i g . I V . 1 1 ,  EDC l o o p c o n t a i n s  some a r c s w h i c h another  t h a t some v a r i a b l e s a r e a s s i g n e d more t h a n Assignment  i s t o p l a c e p r e f i x e s such  of  v a r i a b l e names i n q u e s t i o n .  those  treated "NEXT  differently X"  will  consecutive entirely  from  be  loop new  updated  as  computations, variable  when  A  a s "NEW"  "X" d u r i n g a  —  one  w i t h i n t h e l o o p body  Rule.  [39,65]  graph of a  sources  implying  Single  one  two  the  the  and  have  outside  violating  loop  the data-flow  once,  —  thus  common  remedy  a n d "NEXT"  infront  Thus,  "NEXT X" would be  loop  computation,  "X"  at  and  "NEXT  the  next  but  t h e b o u n d a r y o f two X"  will  be  an  loop  computation  with  a  commences.  When t h e c o n d i t o n a l t e s t satisfied, computed assigning (e.g.  (7)  the  loop  results  will  will  associated  be  be p a s s e d  exited  and  then  to the e x t e r i o r  them t o names t h a t a r e  not  used  loop  is  some o f t h e  o f t h e l o o p by  within  the  loop  "XLAST" o f F i g . I V . 1 1 ) .  Prefix certain  for  sequentiality:  activities  sequential, execute  such  a s i n p u t and o u t p u t  a n d i t i s more  their  As h a s been m e n t i o n e d  instructions  convenient  and  i n the order  p r o g r a m s ; and t h e " S E Q " u e n t i a l  prefix  is  before,  are inherently efficient  specified meant  to  by t h e  for  such  purposes.  If then  an  signalling  instruction tokens  block  i s p r e f i x e d with  w o u l d be u s e d t o enhance  its  "SEQ", data-  120  flow  graph  to achieve  the d e s i r e d s e q u e n t i a l i t y .  entire program i s p r e f i x e d with compiled  into  mentioned  in Section  4.B.  a  sequential  "SEQ", programs  Array operations are probably  This encoded will  p a r a l l e l i s m and  section and be  will  reduced  how  i n EDC;  into  using  will the  be  format  Processing  the  richest  abound i n s c i e n t i f i c  discuss  executed  it  an  3.A(1).  Language Constructs f o r Array  synchronous  then  If  source  computations.  one-dimensional  arrays with  one-dimensional  a  arrays  higher  arrays  of  are  dimension  before  their  computations. (1) P a r a l l e l V e c t o r O p e r a t i o n s : The  of  a  parallel  v e c t o r o p e r a t i o n a r e i n d i c a t e d w i t h the use of  an  "INDEX"  set, which  as  has  range  and  t o be d e c l a r e d p r i o r  stride  t o i t s use  follows: (A) ( B ) ( C ) I e.g.  DECLARE I : I N D E X 1..64 BEGIN C(I)  :=  STRIDE 2  (ADD)  A(I)+B(I);  END;  Opcode C o n t r o l #Elements S t r i d e Address Address o f O u t p u t of Input (ADD) I n f o r . « 64 • 2 t Array=(C) Array=(A)  (C)  Address of Input Array=(B)  Next2 Nextl  Fig.IV.12. The s t a t e m e n t , d a t a - f l o w g r a p h and machine co of a p a r a l l e l v e c t o r o p e r a t i o n . The m a c h i n e code format i s as d e s c r i b e d i n Table IV.4.  121  Both the are  regarded  they  as  could  determined at arrays the  (i.e.,  (c)  constants  run  time:  The  (A) and  (B)  operations,  sent  succeeding  execution  Section  operations  and  that is  execution  or  of  the  operations  This  is  "SUMmation", and  "OR".  DECLARE J : INDEX 1 . . 1024 BEGIN xsum END;  :=  and  to  the  be  input  received  from  output  array  from the  memory  operation, input  been  another  f r e q u e n t l y e n c o u n t e r e d , and  "MINimum", "AND"  operation,  the  as an  has  index  variables  obtained of  an  and  is  operand.  described  in  (3).  Operations:  namely,  e.g.  array  base a d d r e s s e s of  such o p e r a t i o n s  2.A.O) and  Reduction  them,  of  in  of Fig.IV.12) are  Fig.IV.12)  to the  The  (2)  either  of  the  indicated  input operands to the  manager p r i o r to  stride  be  (i.e.,  preceding  r a n g e and  SUM(x(J));  type  there  of  are  "PRODUCT",  array  six  of  "MAXimum",  (X)  .  (SUM)  1"  XSUM  Opcode C o n t r o l # E l e m e n t s S t r i d e A d d r e s s o f i n p u t " N e x t 4 N e x t 3 Next2 N e x t l =1024 Infor. (SUM) =1 Array=(X) Fig.IV.13. The s t a t e m e n t s , d a t a - f l o w g r a p h and machine code of a r e d u c t i o n o p e r a t i o n . The machine code f o r m a t i s as d e s c r i b e d i n T a b l e IV.4.  122  When be  array  p a r t i t i o n e d and  k>R  —  and  associated  loaded  each TP  i s t o be  part  of  it  result  f o r m of a r e s u l t  instruction  The  number  d e p e n d s on various  primitives:  way  alignment operators assumed.  the  its partial  packet,  v i a the  network  partial  for a reduction  array  and  the  the  and  to  results. operation  speeds  modules,  the  "SHIFT"  the  cyclically  while except  e l e m e n t s , and by  of  the  forward  obtain  operations:  operator,  vacated  end  by  will  software to  where  processed  the  combine a l l t h e  the  and  order  elements  similar  of  be  —  will  of has  shortest  the to  be  possible  time.  Alignment  array  size  in  computation  the  TPs  o f TP-LM p a i r s u s e d  hardware  alignment  these  which w i l l  the  optimized  array  of  at  i t s elements LM(k)'s  will  independently; each  i n the  reduced,  into several  computation,  an  (3)  an  the  could  be  zeroes  by  the  there  operator  are  two  moves  the  functions cyclic  The  arbitrarily;  a  feedback  of  positions  direction  and  after in  i n s e r t e d i n t o the  then the  are  amount s p e c i f i e d  i s no  operations.  fixed  is specified,  "ROTATE"  "ROTATE"  "SHIFT" o p e r a t o r  that  shifting  and  of  an  i f none of  the  "SHIFT" o p e r a t i o n  will  be  123  (A)<f> P  e . g . x := a ( K S H I F T - 1 ) ;  \ VI /  y := b ( I R O T A T E 2 ) ; z  -1  SHIFT  := c ( J - 1 ) ;  X) Opcode C o n t r o l #Elements SHIFT I n f o r . Fig.IV.14.  =1024  DisplaceAddress Stride Address o f O u t p u t o f I n p u t ments - 1 N e x t 2 , l =1 Array=(X) Array=(A)  T h e s t a t e m e n t s o f some a l i g n m e n t  and t h e d a t a - f l o w graph and machine  operations,  code of t h e  "SHIFT"  operation.  5.  Performance  5.A.  Analysis  F l o w A n a l y s i s o f EDC  In  order  t o simplify the performance  a n d t o a r r i v e a t some  meaningful  results,  a n a l y s i s o f EDC  some  assumptions  will  be made; l a t e r o n , t h e j u s t i f i c a t i o n o f t h e s e  will  be  (1)  assumptions  discussed.  Assumptions: (a)  The problems  t o be p r o c e s s e d by EDC c o m p r i s e a  amount o f c o n c u r r e n c y which w i l l  keep  a l l the  large TPs  busy most o f t h e time; (b)  Initially  a l l t h e c o m p u t a t i o n s a r e assumed t o be o f  124  scalar  type;  compound  operations  will  be  ignored  temporarily; (c) The s c a l a r the  o p e r a t i o n s a r e randomly d i s t r i b u t e d  first  R  approximately  LMs, equal  thus  giving  packet  flow  among  rise  to  among  the output  maximum  throughput  p o r t s o f t h e n e t w o r k PSN.  (2) C o n s t r a i n t s On C o m m u n i c a t i o n s  As rate  Loads;  s t a t e d i n S e c t i o n 2.C(2),  the  (MATR) t h a t c a n be d e l i v e r e d by a PSN w i t h L l o o p s i s :  MATR(L) = 3/2xS  xlogLxL**2/{3LlogL~L+4}  (IV.1)  R,SW (packets/second)  (3) C o n s t r a i n t s On P r o c e s s i n g  In  order  to  from o v e r f l o w i n g ,  the  prevent total  Loads:  the i n s t r u c t i o n processing  r e g i s t e r s (IRs)  speed  of  TPs  must  e x c e e d s t h a t o f RPs, i . e . , TxS T  I > T p  ><S  I > R  >  RxS  p/S  I j R p  I > T P )  xR  (IV.2)  where T a n d R a r e t h e numbers o f T P s a n d RPs r e s p e c t i v e l y , a n d S  i s the average r a t e of producing X , RP  instruction  packets  by  125  a RP, a n d S  „_ i s t h e a v e r a g e I , TP p a c k e t s by a TP. For of  T  of  consuming  a PSN w i t h L l o o p s , we c a n c o n n e c t  N=LlogL p a i r s  connected  rate  with  o f T P s a n d RPs. TPs,  then  instruction  up t o a maximum  I f a l l the input  T=N=LlogL,  and  from  ports are expression  (IV.2), R<  If packets  (S  each  from  acceptance  MARP=S  I j T p  /S  RP  the  I > R p  is  capable  network  rate of r e s u l t  R>R  )xLxlogL  per  packets  most  binary  o p e r a t i o n s , meaning t h a t  of  the s c a l a r  e v e r y two r e s u l t  readied  If  second,  then  S  R  R  result  p  the  maximum  (MARP) by RPs p e r s e c o n d i s :  instructions  will  be c o m p i l e d  on t h e a v e r a g e ,  packets w i l l  cause  into  the acceptance  one i n s t r u c t i o n  RPs  /2  are  network, then R=LlogL  is:  accepting  to  be  f o r execution, i . e . ,  *I.RP " =R,RP  result  of  pXR  Since  of  (IV.3)  (  connected and  V  -  4  >  t o a l l the output p o r t s of the  the  p a c k e t s of such a f u l l y  I  maximum  connected  acceptance configuration  rate  of  (MARPj)  126  MARP =RxS f  But  the  R  R p  value  therefore,  =(LlogL)xS ^ R  of  R  is  ( I V . 5)  R p  constrained  t h e maximum a c c e p t a n c e  rate  by  expression  subjected  to  (IV.3); such  a  constraint i s : M  A  R  P  c <  ^I,TP^I,RP  Substituting  ™  R P  S  R,RP  c,max=  Expressions d e p e n d s on  =S  2 L x l o g L x  (IV.1, the  I,RP  ) x L x l o g L x ?  R RP f  /2  into  the  Si,TP  5  and  network  size  above  *  6) L,  ?  .  expression,  .  .  show t h a t t h e EDC the  speeds of TPs,  .  .  .  (IV.6)  performance RPs  and  the  switches.  5.B.  Example: Let  the  various  us c o n s i d e r hardware  some t y p i c a l modules,  performance as a f u n c t i o n of We  will  assume  the  that  values  f o r the  then  examine  and  network s i z e , e a c h TP  and  RP  speeds of the  EDC  L. i s capable  of  3  127  MOPS ( m i l l i o n o p e r a t i o n s p e r s e c o n d )  on t h e  assumption  i s f a i r and has a l s o appeared  instance,  see  instructions operands, of  PSN  will  two  not to  trigger process  Since  result  of the r e c e i v i n g one  the half  of  token  the  as  their  their  receiving  input  such a r e s u l t packet,  one  packets  (i.e.,  need  "#Tokens  operation to  ready  for  a RP w o u l d  and another  would  out  instructions  one  a t o t a l of four operations.  result  scalar  of the r e s u l t packets coming  instruction,  —  this  most of the  packets  o p e r a t i o n t o s t o r e i t back  result  "  in other studies (for  o p e r a t i o n to r e t r i e v e the token count  Go") it,  require  [30]).  approximately half  execution; one  reference  average  To  decrement to  store  The  other  the  receiving  i n s t r u c t i o n s f o r execution; to process such a r e s u l t packet, RP  would  r e q u i r e e i g h t more o p e r a t i o n s t o t r a n s f e r an  i n s t r u c t i o n word from to  the  average  four  i t s a s s o c i a t e d LM  operations  number o f o p e r a t i o n needed  (8+4))/2=8, and hence  the  packets per second.  As  number  of  might per  execute  and  then  forward  second.  As  for  needed  and  M  t i m  M n  S  to t r a n s f e r a packet  equals  the  result  speed the  the  Sj  (3xl0 /8) 6  into  a  i s around  =(3xl0 )/40  o f t h e PSN  clocking  total  instruction result 40  (one  packets  6  T p  the  i s (4 +  l e t us assume t h a t the  therefore  is  Therefore,  ^  R  8-byte  addition  to process a result  i t t o the network,  is  in  f o r a TP t o f e t c h an  the  e g u a l s " f / t m i n " where " f "  above.  of  f o r TPs,  i t , package  try other values);  network,  value  o p e r a t i o n s needed  f r o m an IR, packet,  mentioned  t o an IR,  a  switches,  frequency  S ^ R  of  S W  the  minimum number o f c l o c k p u l s e s  from the output port of  a  switch  1 28  to  the input p o r t of the next  switch.  In t h i s example,  tmin  i s a s s u m e d t o b e 10 w h i l e f i s t a k e n t o b e 40 MHz; therefore, — 6 6 the value of s e q u a l s 40x10 /10=4xl0 packets per second. n  c w  129  The are p l o t t e d  MART,  MARPj. a n d MARPc ,max c u r v e s o f t h i s  a g a i n s t t h e network  size  L,  and  are  example  shown  in  Fig.IV.15.  Fig.IV.15. example.  The MARP , MATR a n d MARPc,max c u r v e s f  of the given  130  example,  this  In  the  maximum  throughput  rate  a t t a i n a b l e b y t h e E D C i s l i m i t e d b y t h e MARPc,max c u r v e i s t h e l o w e s t among t h e t h r e e c u r v e s . be o b t a i n e d f r o m (a)  Two o b s e r v a t i o n s c o u l d  Fig.IV.15:  T h e r e l a t i v e p o s i t i o n s o f t h e M A R P f a n d MARPc,max suggest  that i nthis  connect  example,  a l l t h e output  p o r t s o f t h ePSN w i t h RPs.  (b) T h e r e l a t i v e  I,RP  positions  that  o f MART  6  a n d MARPc,max  If  curves  f o r most o f t h etime, t h ec a p a c i t y o f PSN  probability  expected  For  (IV.2):  w i l l be higher than that r e q u i r e d , a n d hence and  to  )xT=(3xl0 /40)/(3xl0 /8)x384 6  I,TP'  curves  i t i s not necessary  L=64, t h e n T=LlogL=64x6=384; from e x p r e s s i o n  indicate  which  of  traffic  congestion  t h e extent  i n t h e PSN i s  t o be low.  there a r e always computations  t o keep a l l  theTPs  busy, then t h e raw speed a t t a i n a b l e by t h eEDC i s : Tx3xl0  From  6  = 384x3xl0  6  = 1 1 5 2 (MOPS)  F i g . I V . 1 5 , t h e maximum r a t e o f f l o w o f r e s u l t p a c k e t s i n  the system i s :  MARP  6  c,max  ( L = 6 4 ) = 5 7 . 6 X 1 0  (packets/sec.)  131  and  t h e maximum r a t e o f f l o w  RxS  6  likely  5.C.  curves  of Fig.IV.15  which p a r t of the  are useful i n  architecture  performance bottleneck  Considerations  If  the values  after  estimating  of S  T  from those  i n the.previous  throughput  curves  In both  the  and  computations  compound  in  operations  generality.  The r e s u l t  expectedly,  performance —i.e.,  of such  because  EDC w i l l which  less  which  are  justifiable large  instructions  different different  data-driven. since of and  randomly and e v e n l y  mixed  with  (b) h a s t o be removed f o r  a removal would g i v e r i s e t o the  execution  of  compound  a r r a y o p e r a t i o n s and s e q u e n t i a l programs  communications  amounts  built.  be made up o f  are  c o n t r o l - d r i v e n , and r e q u i r e s s i m p l e r c o n t r o l  incur  most  Computations:  example, t h e n  t h e r e f o r e , assumption  operations  the  S„. __ a n d S „ _,, a r e I,TP R.SW  unknown r a t i o ,  better  be  a n d c o n c l u s i o n s w o u l d be o b t a i n e d .  general,  scalar  will  t h e s y s t e m h a s been  f o rGeneralized  I.RP  is  (packets/second)  6  DD  o f t h e EDC a r c h i t e c t u r e f o r a d e s i r e d s p e e d , a n d i t a l s o  indicates  a  packets i s :  =RxS_ __/2=154x(3xl0 )/8/2=28.9xl0 R f RP I § RP  The size  instruction  overhead Assumptions  EDC i s i n t e n d e d concurrency;  than (a)  s t r u c t u r e s and  scalar and  operations (c)  are  forapplications involving  and  the  interleaving  s k e w i n g o f a r r a y s would s p r e a d among t h e TP-LM  —  pairs.  of  t h e programs  132  6.  Discussions and Outlook  A  design methodology  next-generation  except  t h e immediate  are  the  fabricated  (EDC),  Our p r o p o s a l , which  switches  with  today's  with  which,  TPs  system,  do  components  could  technologies.  be  easily  I f t h e PSN ( P a c k e t  loops,  (Transmitting Processors)  then  can  be  approximately attahced to the  a n d t h e maximum r a t e o f f l o w o f i n s t r u c t i o n a n d r e s u l t  packets will  b e a p p r o x i m a t e l y 58 a n d 30  respectively; exceed  off-the-shelf however,  S w i t c h i n g N e t w o r k ) c o n s i s t s o f 64 400  A l t h o u g h we  i m p l e m e n t a t i o n o f EDC, most o f i t s  implementable  PSN  i s named  i s p r i m a r i l y a data-driven system  with control-driven activities.  not emphasize hardware  J  supercomputers.  E v e n t - D r i v e n Computer supplemented  has been proposed f o r a c l a s s o f  and  the  1,000 MOPS.  is similar  raw  per  S i n c e t h e p r o p o s e d EDC l a n g u a g e  available  today  second,  s p e e d a t t a i n a b l e b y t h e EDC  t o t h e e x i s t i n g ones  techniques  million  [39,41,44,69],  will  structure  many  of  the  c o u l d be a p p l i e d t o i t s c o m p i l e r  design. Compared t o t h e resources  of  EDC  are  Dependence-Driven  System  better  when  utilized:  [32], the a  i n v o l v e d i n an a r r a y o p e r a t i o n , i t c o u l d always g e t instruction execute i t . distinguish  f r o m i t s a s s o c i a t e d IR ( i n s t r u c t i o n The array p r o c e s s i n g c a p a b i l i t i e s i t  from  the  Combined  System  TP i s n o t a  scalar  register) and of  [33],  EDC  The speed  r a n g e o f EDC i s e x p e c t e d t o be many t i m e s h i g h e r t h a n t h a t t h e PDF S y s t e m [ 3 4 ] .  also  of  133  Because  only  a  few  component  d e s i g n c o s t s o f EDC  are expected  the  randomly  programs  are  capabilities, subset  array  of the  provided;  first  and  u s i n g t h e PSN  the  Designing have  presented  several EDC  computation  concepts  loading)  of each  array  the  run s i m u l t a n e o u s l y ,  identification  driven  systems  schedule role  in  i s not  a  ideas, but  task;  there are  Firstly,  to  we still*  before the  the  detailed  developed,  and  and  hardware has  to to  run-time  overhead  (such  performance,  and  as  remedies  h a v e t o be p r o v i d e d independent  processors  play  (such  are  necessary  for other  an  as  effects  processes  F o u r t h l y , the p o l i c i e s used processes  program  i f the  i d e n t i f i c a t i o n method i s  tags are often proposed  the  performed  i t i s necessary  compound o p e r a t i o n s and providing  simple  immediate a t t e n t i o n s  i f several  [48]).  a  links  Secondly,  t h e n an  (unique  read/write  by  outset.  system  Thirdly,  of  processing  c a r r i e d out  h a r d w a r e h a v e t o be  i n c r e a s i n g t h e s i z e o f LMs) are severe.  thus  s y n c h r o n i z a t i o n method  of labor between the compiler the  of  problem  a r r a y s c o u l d be  some a r c h i t e c t u r a l  at  most  t h e LMs  for array be  the  alignment.  a supercomputer  the e f f e c t s of on  of  c o u l d become p r a c t i c a l .  clarified  analyse  As  R TP-LM p a i r s u s i n g the  s p e c i f i c a t i o n s o f t h e EDC  be  Since  serious  could  alignments  are used,  among  the  reduced.  issues which deserve  the d i v i s i o n  load,  which provides a novel  i n d i c a t e the end  low.  distributed  e q u a l i z i n g t h e memory a c c e s s memory b o t t l e n e c k s c o u l d be  t o be  types  data-  by SP  to  important  with enough operations  to  134  keep them busy; policies this  and  it  those  aspect.  computation  is  not  found  we  encoded  as  t h o s e o p e r a t i o n s on s m a l l scalar  operations  reduced;  the  another  area  convenient  so  of  and  on  tolerance in packet  been found EDC  suggested  compound o p e r a t i o n s , but  perhaps  could  such  supercomputer  issues  be  decomposed  have  study.  t h e n be b u i l t  into  sould  be  constitute  Sixthly,  it  would  be  structures  —  such  as  Lastly, although  communication  We  studies  architectures  [45], a specific  been  t o SP  decompositions  are provided.  i n the l i t e r a t u r e  above  such  array  in this repect i s indispensable.  these  exist  that  i f more d a t a  records fault  —  have  arrays  further  to the users lists  there  that the l o a d submitted  criteria of  whether  i n t h e l i t e r a t u r e c o u l d be u s e f u l i n  Fifthly, be  sure  study  concerning  feel that only  adequately  have  after  d e a l t with, can  along the l i n e s set f o r t h  for  a  EDC.  135  Table III. 1 - Scalar Scalar  operation  1.Arithmetic logic  and  operations. *  Typical examples ADD, MUL,OR  Data-flow  graph  \  I ADD  2.Boolean  EQUAL?  3.Data t r a n s f e r and c o n t r o l  SWITCH  DUPLICATE  f  WAIT  i  DUP  ^  CALL 1  J  i  1 t/ATT*  i i  4.Procedural calls  CALL,RETURN  T a b l e I I I . 2 - Compound Compound  t  < •  operations.  operation  T y p i c a l examples (MUL) 1 . V e c t o r a r i t h m e t i c (ADD), and l o g i c  2.Reduction  3.Vector  boolean  4.Alignment  (SUM),(PRODUCT), (MAX),(MIN)  Data-now  HI  I  (ADD)  l I  (SUM)  1  (EQUAL?)  (RIGHT S H I F T ) , (LEFT ROTATE)  ,11  l(ROT)  1  *See  S e c t i o n 4 on the use o f these  graph *  operators.  U  136  III.3  Table  - The "Operand/Next i n s t r u c t i o n s " of s c a l a r o p e r a t i o n s .  Computation Arithmetic.  1  and P r o c e d u r e  2  Scalar Logic  Format  3  Call  4 5  No.  Operands/Next Opl  Data T r a n s f e r  &  0p2  Opl  c Opl ^  0p3  Op2  Opl  instructions  Op 4  Next2 N e x t l -.  Op 2  y  Next2 N e x t l  Next4 Next 3 Next 2 N e x t l  Opl  „ Next4 Next3 Next2 N e x t l Next5 Next4 Next3 Next2 N e x t l  up 1  6  Boolean  fields  «e—Op  *  2 .... >.  Next«p N e x t  N e x t f Nextf Nextp N e x t  F  7  Opl  Op2  8  Next6  Next5 Next4 Next3 Next2 N e x t l  F  Control <s-16bit-*«-16 —x—16  Table  III.4  - The "Operand/Next i n s t r u c t i o n s " of compound o p e r a t i o n s .  Computation Vector and  Arithmetic  Format No. 9  Logic  Vector Boolean  10  16—*r-16—^-16—>  fields  Operands/Next  instructions  No. o f Stride elements  (VI)  <V2)  (V3)  Next2  Nextl  No. O f Stride elements  (VI)  (V2)  (V3)  Next  Next  (V2)  T  F  11  Stride No. o f elements  (VI)  Next2 Displacement  Nextl  12  No. O f Stride elements  (V2) Next4 Next3 Next2  Next^  Alignments  Reduction  K  ---8-bit-**-8  *-16^*-16-*-4 6 — * - 1 6 — 1 6  137  T a b l e I I I . 5 - The f o r m a t s of i n s t r u c t i o n Packet Content  packets.  P a c k e t Format  a. I n s t r u c t i o n A d d r e s s < A d d r e s 8 o f i n s t r u c t i o n ^ b. A c t u a l word  Instruction  ^Opcode;result & format types;operands; Next i n s t r u c t i o n a d d r e s s e s ^  T a b l e I I I . 6 - The f o r m a t s of r e s u l t Packet  R e s u l t Type a. S c a l a r Operand b. A r r a y Element being aligned c. Base A d d r e s s a s s i g n e d t o an array d. S i g n a l l i n g  packets.  Format  f e e d b a c k Count;Destination Address;Result T y p e j R e s u l ^  Token ^Feedback C o u n t ; D e s t l n a t i o n A d d r e s s ; R e s u l t Type^  e. S y n c h r o n i z a t i o n Token  f e e d b a c k C o u n t ; D e s t i n a t i o n A d d r e s s ; R e s u l t Type^  1 38  C h a p t e r V. 1.  Conclusions  Summary  of  Results  In Chapter systolic The  sorter  correctness  operational highly  the  (2)  the  data  suited  for  logic  devices  The  found  by  the  process desired  control  can  be  using  hardware unit  and  RSS.  and  general  This  design  the  following  type  because  average  is also  well-  storage  studies,  the  it's  is  s o r t e r , so  soon as  the  needed  to  sorting cycles, the  is  range  incorporated  that  input  and  closed-loop  within  method  and  (MBMs)  comparators  number of  type the  of  of  the  (3)  bubble memories  quadruple  of  and  is  by  near-neighour  sorting array  termination  as  to  on  the  list  sorting  is in  the  order.  switching  to  of the  work  comparators;  (CCDs),  terminated  Chapter  highly  the  simulation A  due  repetitive  magnetic  and  which  been p r o v e d  shift-register  as  re-circulating  been d e r i v e d .  The  number  our  [(logN)**2,N]. into  on  a  control structure required  among  i s N/4,  has  implementations  devices  items  algorithms  have  regular,  such  presented  algorithms  simple  fabrication  structure. N  two  movements.  charge-coupled  up  the  interconnections  sort  have  and  to VLSI  (1)  algorithms;  as  (RSS) of  amenable  systolic  we  constraints  attributes:  of  II,  network  parallel N=LlogL only  III  N/2  describes  (LSSN)  intended  applications. pairs  of  two-by-two  a  novel  for packet  With  L  switching  communications  loops,  t r a n s m i t t i n g and  loop-structured  i t  can  receiving  elements.  in  connect devices,  Therefore,  i t  1 39  is  very  cost-effective  topology  resembles  network[2l], achieved  by  but  that  u s e d as  It  the  advantage of  in  of  other  studies of  the  relatively  low  Computer  of  to  that  is  while  the  "pure"  better  EDC  resource  a higher  speed  modified  for  loops,  this  processors, million  networks.  Our  and  The  a  o p e r a t i o n s per  is  prevail  simulation  r a t e and  delay  despite  its  the  data-driven,  the  Event-Driven heterogeneous  control-driven  and  for  activities; advantages  c o n t r o l - d r i v e n systems Compared  to  other  array processing c a p a b i l i t i e s LSSN  application; can  hence  it  a d v a n t a g e s of a s i m p l e r a r c h i t e c t u r e ,  utilization,  network  and  methodology  proposal,  shortcomings.  the  this  switches  which  designs  be  stations.  Our  throughput  design  data-driven  range.  receiving  i s aimed a t e x t r a c t i n g t h e  their  has  the  of d e a d l o c k s  other  i s supplemented with  alleviating  designs,  of  primarily  a combined approach both  n-cube  count.  computers.  (EDC)  system which  type  IV d e s c r i b e s a new  next-generation  between  extensibility,  the average  component  Chapter  links  incremental  have shown t h a t  binary  Its  d e v i c e - t o - s w i t c h r a t i o can  packet-switched  close  indirect  t r a n s m i t t i n g and  store-and-forward  cyclical  LSSN a r e  such  both  i t s component c o u n t .  the  LSSN b e c a u s e a l l t h e  be  free  of  a much h i g h e r  could has  i n t e r m s of  of with  connect an  up  execution  s e c o n d can  Chapter  be  III  has  and been  a c o n f i g u r a t i o n of to  approximately  s p e e d o f more t h a n obtained  by  the  EDC.  64 400  1,000  1 40  2.  General The  Discussions main  practicality designs ideas  theme o f t h i s  and  usefulness  of high-performance have  been  application switched  processors  ideas  of  for  correctional  multiplicative  manipulations  the the  of process  allows  themselves  until  than  items  t h e whole  until  of r e s u l t of  no  of a novel,  packets  instructions  i n the  o r may  have  of  i s sorted;  i s  feedback  i n the  re-use  routed  resources;  the  to their (such  as  packets,  i n t h e EDC,  path  signify  cycles,  and as a  brought  into  the  depending  on t h e a m o u n t s o f  The  in  which  contribute  greatly  with each other  the  to  are  n o t be  gathered.  interact  compared  the information  they  properties  list  or  i n t h e RSS  further  o r more i n s t r u c t i o n may  additive  signals); be  feedback  among  information  packets  input  uses  interaction  f o r executions,  the  packet-  direct  path  to  specific  are entirely  (i.e.,  to  f o r the network  one  which  packets  computation  feedback  Our  sorting,  of feedback  the  i s  competitions  new  of  i n our proposals  of the input  data  but there  completion  result,  use  methodology  purposes  i n t h e RSS a r r a y s )  arrivals  computers.  the  control,  the sole purpose  resources  comparisons other  that  feedback  destinations,  the  architectures i n the  parallel  feedback  signals  LSSN,  demonstrate  computer.  from  network  through  including  different  network  to  and  communications and the d e s i g n  The  among  i s  of c y c l i c a l  illustrated  examples  next-generation  arrays,  thesis  cyclical  manners  architectures.  For  141  instance, type  along  the  coming  specific  case  architectures cannot  be  Another  processed  in  Transmitting  going  to  be  through  execution  lack  Suggestions  several  this  new  practicality  in  done.  memory  locations  result  t h e EDC  packets  system  because  and  should  cyclical  interrupts  the computation  path  interrupts  read/write links  connecting  the  when  be  the  reponse  Local  path;  times  cyclical  Memories manner,  allow without  therefore,  c o u l d be  architectures  i n the acyclic  fast  expected. are  better  systems.  Work  we h a v e d e v e l o p e d  architectural  --  in a control-driven  f o r Futher  concepts  and u s e f u l n e s s .  Specific  conflict;  responsiveness  t o those  thesis,  path  the  with packets  short  take  i f the programs a r e c o r r e c t l y  immediately  t h e p r o p o s a l s , b e c a u s e we be  of  this  network  property of packet-switched,  executed  hence  how  do n o t e x i s t  RSS  i s no d a t a  to await  (i.e.,  Processors  when c o m p a r e d  In  the  t h e PSN a n d t h e f e e d b a c k  and  problem  i f there are always  t h e EDC, d i r e c t  general, resources  utilized  to  and there  and loaded), then  i s their  while  programs  3.  paths  a l r e a d y be c o n g e s t e d  occur;  In  movements i n  out of the network  deadlock-free.  demonstrated  but the deadlock  o f t h e EDC,  compiled  to the store-and-forward  however,  i n the L o c a l Memories  written,  could  susceptible  (we h a v e ,  c a n be s o l v e d ) ,  available  the  i s  t h e RSS, because d a t a  place in  LSSN  of deadlocks  problem in  the  feel topics  We  and  have  some  ideas based  demonstrated  not implemented  t h a t more r e l a t e d f o r further  work  on  their any of  has  yet  r e s e a r c h have  been  142  suggested concerning tolerance the  utmost  to  make  in  the  the  previous  detailed  hardware  s t u d i e s of the three  in  particular,  specifications  designs,  should  and  perhaps  a t t e n t i o n s , s i n c e our proposed systems a r e use  of  hundreds  to  components,  processing  and  storage  components  will  paralyse  important  chapters;  i s s u e s a r e not  the  thousands  entire  of  work fault  receive designed  interconnected  failures systems,  i n c l u d e d i n our s t u d i e s .  of  single  and  these  143  Appendix  A  Lemma 1 1 1 . 1 : C o n s i d e r which  a LSSN which h a s L l o o p s  i s destined  L'=logL loop  t  admission  Proof:  f o rthe address  , where  be routed  to the  will  ....t  L'  of  steps  i s admitted  jf i_> . According L ' * s* * 1 packet w i l l be routed t o t h e l o o p  routing  into the  located  in  the  s - t h stage,  maximum v a l u e o f s required  t  to  s'  t  i s L',  route  by  where s=1,2,...L'.  only  t h e packet  L' to  steps the  of  v i a the  a  this switch  Since the routing  aforementioned  are loop  1*  a LSSN w i t h L l o o p s a n d a  is  the address  destined  L'=logL  LSSN  L' s* * 1  Lemma I I I . 2 : C o n s i d e r for  and S'= logL'l.  the loop  of  matching along  £  L  i  According  *  •  ,  has  the  routing  which ,  G  i t needs a t most a n o t h e r  that loop t o reach  to  packet  A , . . A ^ ^ ,..-c!^  A f t e r t h e packet  r  to  Proof:  after i t s  t o t h e r o u t i n g scheme, L'**  L'  L  i n t o t h e LSSN.  Suppose t h e packet  t  packet  Aj-£ ,...-£  The packet  loop  JL  a  A g . .  and S'=riogL'l. within  and  been (L'-l)  where routed steps  i t sd e s t i n a t i o n .  scheme,  the destination  144  address  A  .. A t t)  ... t  the loop  to  .  loop, i t will  the l i n k which  (L'-1) at  of t h e L' o u t p u t  (L'-1)  Theorem  be  removed by e i t h e r  In  to i t s  a  LSSN  are  with  destination  Proof: T h i s theorem  needed  III.2:  of the  In e i t h e r  L  average  case,  loops, a packet w i l l  be  (21ogL  of  -1)  steps  generated.  i s a r e s u l t o f Lemma 1 a n d  The  remainig  necessary.  within  r o u t i n g r e g a r d l e s s of where i t i s  Theorem  to  the r e c e i v e r attached  i s p a r t o f t h a t l o o p , o r one  steps of matching  111.1 :  delivered  along  A f t e r the packet has been s w i t c h e d  r e c e i v e r s a t t a c h e d t o t h e same l o o p .  most  links  X  •L*  this  i s one X  Li  X  number  to deliver a result packet  of  2.  routing steps  (ARS)  i n a LSSN w i t h L l o o p s i s ,  ARS(L)=(3logL-1)/2+2/L-1  Proof: Without  l o s s of g e n e r a l i t y and  for simplicity,  consider a t r a n s m i t t e r (Tr) l o c a t e d i n the f i r s t network. which  The  c a n be  paths,  as  tree" which the  reached shown  without  in  branches  going  Fig.III.2, out toward  paths, and  "tapering" tree.  stage of  routes from t h i s Tr to the set of output through  The  is  in  the  shall the links  feedback  i s i n the form of a "binary  the lower end of the  routes to the remaining set of output  the feedback  we  the  form  network;  l i n k s would of  an  number o f r o u t i n g s t e p s needed  include^  irregular, to  reach  145  t h e r e c e i v e r s on t h e s e two t r e e s a r e t a b u l a t e d  From  in Table  III.1.  T a b l e 111.1, t h e a v e r a g e number o f r o u t i n g  needed f o r a Tr t o r e a c h any o u t p u t l i n k  steps  i s therefore,  ARS(L)={(2x1+4x2+..+LlogL)+(L-2)(logL+1)+(L-4)(logL+2)+ +(L/2)(2logL-1)}/{(2+4+..+L)+(L-2)+(L-4)+..+(L/2)} ={(LlogL-21ogL+L)+(LlogL-41ogL+2L)+(LlogL-L/21ogL+ LdogL-1 )+LlogL}/{LlogL} ={L(logL+(logL+1)+(logL+2)+..+(logL+logL-1))logL(2+4+..L/2)}/{LlogL} ={LlogL(3logL-1)/2-21ogL(2**(logL-1))/{LlogL} =(3logL-1)/2+2/L-1 Q.E.D.  Table.III.1  - The number of r o u t i n g s t e p s needed t o r e a c h t h e r e c e i v e r s of the " b i n a r y " and " t a p e r i n g " t r e e s .  S t a g e # of Rr  Binary tree #Rrs #steps  Tapering tree #Rrs #steps  stage 1  2  1  L-2  logL+1  stage  4  2  L-4  logL+2  ••  ••  • •  stage(logL-1)  L/2  logL-1  L/2  2logL-1  stage(logL)  L  logL  0  21ogL  2  ••  Corallary  ••  111.1 : Any p a c k e t a d m i t t e d i n t o L S S N w i l l  go t h r o u g h  146  the feedback  p a t h a t most  twice.  Proof: Consider a t r a n s m i t t e r ( T r ) which s-th  stage  suppose  sends  a packet a t  t o a r e c e i v e r ( R r ) i s o f r r o u t i n g s t e p s away; a n d  there a r e L' stages i n t h e network.  feedbacks,  the  The  number  of  F, c o u l d be c a l c u l a t e d a s :  F(L)=Quotient((s+r-1)/L')  The  reader  may  verify  the  correctness  e x p r e s s i o n w i t h a s i m p l e example on F i g . I I I . 2 .  of  The  this  maximum  value of F i s therefore,  F(L)  Since  s  m  F(L)  =Quotient(( max S  a  x  max  = L* a n d r  =Quotient(  m  a  r  max ~  1  )/L*)  = 2L«-1 (from Theorem I I I . l ) ,  x  ( 3 L  +  '  "  2 )  then  /L')  = 2 Q.E.D.  Theorem  III.3:  F o r a LSSN w i t h L l o o p s , t h e p r o b a b i l i t y  the d e s t i n a t i o n a d d r e s s c a r r i e d by a r e s u l t packet w i l l the  label  of  an  output  removed from t h e network i s :  l i n k , and hence  that match  the packet w i l l  be  147  P  =2L/{3LlogL-L+4}  removed  where the t r a n s m i s s i o n p a t t e r n i s such receiving that  the  and  network i s e q u a l l y l i k e l y  S i n c e the LSSN has  switches  and  receiving ports each and  of  each  every  to  receive  logL  stages  packet.  Proof: of  port  that  stage every  RP  transmitted  L loops,  i t would have  LlogL p a i r s of t r a n s m i t t i n g p o r t s  (RPs).  Consider  the case  in which  of  the network transmits a r e s u l t  in  the  by  each  network, stage  then of  RPs  the  (TPs) a  TP  packet  number  and in  to  of  each  packets  i s as t a b u l a t e d i n  Table  III.2. In T a b l e times  III.2, "Feedback  the packets  to reach could  will  be  the  (i.e.,  verified  on  RPs  stage,  to  N  the  matched  =  a  row  the  feedback  number  paths  in  of  order  this  table  in Fig.III.3.  From  that w i l l  particular  is obtained  is  c o r r e c t n e s s of  number of p a c k e t s  connected  logL-th)  The  the  the example given  across the corresponding  and  through  their destinations.  t h i s t a b l e , the t o t a l by  go  Count"  stage,  by summing up  of the t a b l e , and  be  received  say the the  last  numbers  it is:  LlogL  t o t a l number of s w i t c h i n g o p e r a t i o n s p e r f o r m e d  by  the  148  same s t a g e i s : N  total  = N  +N  matched  unmatched  where N  i s t h e number o f p a c k e t s w h i c h w i l l n o t be unmatched removed by RPs o f t h a t s t a g e b e c a u s e o f unmatched d e s t i n a t i o n  a d d r e s s e s c a r r i e d by them. stage,  N u  nmatched  c o u  ^^  In  easily  the be  case  computed  products of the e n t r i e s and t h e i r r e p e c t i v e in Table  of  the as  (logL)-th t h e sum  "Feedback  count"s  III.2: N  ^ =1x{(L-2)+(L-4)+...+(L-L/2)+(L-L) unmatched +L+(L-2)+...+(L-L/4)+(L-L/2) +L/2+L+(L-2)+...+(L-L/8)+(L-L/4) T  ••••  + 4+8+16+  + (L-4) + (L-2) } +  2x{(L-L/2) +(L-L/4)+(L-L/2) +(L-L/8)+(L-L/4)+(L-L/2) +  + (L-4) + (L-8) +  + (L-L/2)}  Let N'=(L-2)+(L-4)+...+(L-L/2), then a f t e r re-arrangement,  N unmatched  =N' +L+N' +L/2+L+N'+(L-L/2) +L/4+L/2+L+N'+(L-L/2)+(L-L/4)  of  149  +4+8+....+L/2+L+N'+(L-L/2)+(L-L/4)+..+(L-4) =N'logL+(1+2+3+...+logL-1)*L ={L(logL-1)-(2+4+8+...+L/2)}*logL +{L(1+logL-1)(logL-1)/2} ={L(logL-1)-(L-2)}logL+{LlogL(logL-1)/2} ={3L(logL)**2}/2-{3LlogL}/2+21ogL  => N  => P  t o t a l  ={3L(logL)**2}/2-{LlogL}/2+21ogL  removed  = N  matched  / N  ^ , total  =(LlogL)/{(3L(logL)**2)/2-(LlogL)/2+21ogL} ={2L}/{(3L(logL)**2)/2-L+4} Q.E.D.  T h e o r e m I I I . 4 : T h e maximum a v e r a g e t h r o u g h p u t r a t e  (MATR) o f a  LSSN w i t h L loops i s : MATR(L) = 3/2xS ^ x l o g L x L * * 2 / { 3 L l o g L - L + 4 } R  where S  R  g w  s w  i s t h e maximum r a t e o f t r a n s m i t t i n g ^ R e s u l t  between two s w i t c h e s v i a an output  link.  packets  150  Proof:  Sincere  therefore, all  there  are LlogL  i n a L-looped  t h e maximum a v e r a g e r a t e o f d e l i v e r i n g  LSSN,  packets  to  t h e r e c e i v i n g p o r t s c o u l d be f o r m u l a t e d a s :  MATR(L)=LlogLxS ^^ x ( 1 - P R  where P not  links  conflicted  contain  concerned,  a  c  o  n  f  x i  c  t  e  >xP  d  r e m o v e d  i s t h e p r o b a b i l i t y that an output . J  c  link  will  «  packet  due  and i t c o u l d be  to  conflicts  computed  within the switch  with  the  illustrations  below:  Fig.Ill.6,  On not  receive  the  average,  any  packet  25% o f t h e time due  to  an output  conflicts  in  the  therefore,  1  and  "  P  conflicted* ^ 3  4  also, MART(L)=3/4xS =3/2xS  xlogLxL**2/{3LlogL-L+4} R t S W  xlogLxL**2/{3LlogL-L+4} Q.E.D.  link  will  switch,  151  Theorem I I I . 5 : The  LSSN which uses Type-B s w i t c h e s  i s deadlock  free.  Proof:  Type-B  switches  avoiding deadlocks  (a) The  cannot i.e.,  counts  The last  be  0 and for  switched  feedback  the next  request loop.  to the f i r s t  so  when  the  are  request buffers  formation  n o t be of  shared  clogged.  any  cyclical  f e a t u r e s , t h e p a t h t r a v e r s e d by  and  spirals  interconnected in parallel,  rather than  "cylical"  in  n e t w o r k c o u l d be c o n c e i v e d a s s e v e r a l with the  Class-0  stage as the heads of the s p i r a l s , of the l a s t  they  that links that are  counts w i l l  shape,  buffers  that  class.  feature ensures  whole  immediately,  switch  stage, they w i l l  W i t h t h e s e two  the  they  the  i n the network i s " s p i r a l "  2  stage  if  counts of the packets emerging from  feature prevents  the f i r s t  ports  them.  stage are incremented  first  of  output  not  no  by p a c k e t s w i t h v a r i o u s f e e d b a c k  any p a c k e t  1, s u c h t h a t t h e y a r e  the  to  with  has  the next higher  The  features in  to hold packets  i f the b u f f e r p o o l s of the next  fedback of  of  to contend  room t o a c c e p t  second  essential  intermediate p o r t s are used  eligible  The  two  i n LSSN:  feedback  (b)  provide  stae as the t a i l s .  and  buffers  the C l a s s -  S i n c e t h e r e i s no  152  cyclical deadlock  request free.  of  resources,  the  network  is  therefore  Appendix B  PROGRAM  RECIRCULATING-SYSTOLIC-SORT;  CONST INIT_SEED MARKER COLUMN  = 3; = 1;  RANGE  100.0;  (* LARGEST RANDOM NUMBER TO BE SORTED *) TYPE STORE_RECORD = RECORD MARKER : BOOLEAN; ITEM : INTEGER END; COMPARATOR  RECORD  DATA_ARRANGE  TYPE  RECORD INIT_ROW. INIT_C0LUMN : INTEGER; A l , A J , B I , B J , C I . CO, 01, D J : INTEGER; TEMPORARY : INTEGER END; = (RRANDOM.  SSOUENTIAL);  VAR N_C0MP, H_COMP, V_COMP : INTEGER; LIMIT_NO_SWITCH : INTEGER; ROW, COLUMN INTEGER; STORE ARRAY (. 0 .. 50, 0 .. 50 .) OF STORE_RECORD; ARRAY (. 1 .. 2 0 0 .) OF COMPARATOR RECORD; COMPARATOR • T O T A L _ C Y C L E , SWITCH_PER_CYCLE : INTEGER; : INTEGER; SEED EVEN_START : INTEGER; 01 . I , <J. K, J J , ODD_START INTEGER; TEMPC. TEMPO TEMPA, TEMPB TEMP_COL : INTEGER; MARKERA. MARKERB MARKERC, MARKERD : BOOLEAN; TERMINATE BOOLEAN; DATA_ARRANGE DATA_ARRANGE_TYPE DECR INTEGER; CONTINUOUS NO SWITCH : INTEGER; PROCEDURE SETUP_NETWORK; B E G I N (* SETUP_NETWORK *) ROW := 2 * V_COMP; O COLUMN := 2 * H_COMP; N_COMP := V_COMP * H_COMP TRUNC (H_COMP / 2) (* *NUMBER OF COMPARATORS ) L I M I T NO SWITCH := 2 » H COMP »; ( TERMINATE I F NO SWITCHING CONTINUOUSLY *) I : = O 0 := O FOR K := 1 TO N_COMP DO WITH COMPARATOR (. K .) DO BEGIN (* ** EACH COMPARATOR WILL HOLD 4 ITEMS TO BE SORTED' INIT_ROW := I ; INIT_COLUMN := J ; I := I + 2;  154  IF  END  I >= 2 * V_COMP - 1 THEN BEGIN I := I - 2 * V_COMP + 1; J := J + 2 END  END (* SETUP_NETWORK * ) ;  FUNCTION RANDOM (VAR SEED : INTEGER) : INTEGER; B E G I N (* RANDOM *) RANDOM := TRUNC ( ( S E E D / 6 5 5 3 6 - 0.1) * RANGE); SEED := ( 2 5 1 7 3 * SEED + 13849) MOD 6 5 5 3 6 END (* RANDOM * ) ; PROCEDURE CHECK_TERMINATE; B E G I N (* CHECK_TERMINATE *) IF SWITCH_PER_CYCLE = 0 THEN BEGIN INCR (CONTINUOUS_NO_SWITCH); IF CONTINUOUS_NO_SWITCH >= LIMIT_NO_SWITCH TERMINATE := TRUE END ELSE BEGIN SWITCH_PER_CYCLE := 0; C0NTINU0US_N0_SWITCH := 0 END END (* CHECK_TERMINATE * ) ;  THEN  PROCEDURE I N I T I A L I Z E ; B E G I N .(* I N I T I A L I Z E *) FOR I := 0 TO (ROW - 1) DO FOR d := 0 TO (COLUMN - 1) DO WITH STORE (. I , d .) DO BEGIN IF DATA_ARRANGE = RRANDOM THEN ITEM := RANDOM ( S E E D ) ELSE BEGIN ITEM := SEED; SEED := SEED - DECR END; (* *********"MARKING EACH LOOP* * * * * * * * * *) IF ( ( d = 2 * MARKER_COLUMN - 2) AND ( I MOD 2 = 1 ) ) 0 R ( ( d = 2 * MARKER_COLUMN - 1) AND ( I MOD 2 = 0 ) ) THEN MARKER := TRUE E L S E MARKER := F A L S E END; FOR K := 1 TO N_COMP DO WITH COMPARATOR (. K .) DO  BEGIN TEMPORARY = 0; = INIT ROW; Al = INIT""COLUMN; Ad = Al ; BI Bd  CI Cd  DI Dd  = = = = =  Ad + Al + Ad; Al + Ad +  1 ; 1 ; 1 ; 1  END; T O T A L _ C Y C L E := 0; SWITCH_PER_CYCLE := 0; C0NTINU0US_N0_SWITCH := 0; TERMINATE := F A L S E END (* I N I T I A L I Z E * ) : PROCEDURE VERTICAL_COMP; B E G I N (* VERTICAL_COMP *) FOR K := 1 TO N_COMP DO WITH COMPARATOR (. K .) DO BEGIN IF (STORE (. A l , Ad . ) . I T E M > STORE (. C I . C d BEGIN TEMPORARY := STORE (. A l , Ad . ) . I T E M : STORE (. A l , Ad . ) . I T E M := STORE (. C I , Cd STORE (. C I , Cd . ) . I T E M := TEMPORARY; INCR (SWITCH_PER_CYCLE) END; IF (STORE (. B I , Bd . ) . I T E M > STORE (. D I , Dd BEGIN TEMPORARY := STORE . ( . B I . Bd . ) . I T E M ; STORE (. B I , Bd . ) . I T E M := STORE (. D I . Dd STORE (. D I . Dd . ) . I T E M := TEMPORARY; INCR (SWITCH_PER_CYCLE) END END END (* VERTICAL_COMP * ) ;  ).ITEM) THEN  .).ITEM;  ).ITEM) THEN  .).ITEM;  PROCEDURE DIAGONAL_COMP; B E G I N (* DIAGONAL_COMP *) FOR K := 1 TO N_COMP DO WITH COMPARATOR (. K .) DO BEGIN IF (STORE (. A l , Ad . ) . I T E M > STORE (. D I . Dd . ) . I T E M ) THEN BEGIN TEMPORARY := STORE (. A l , Ad . ) . I T E M ; STORE (. A l , Ad . ) . I T E M := STORE (. D I , Dd . ) . I T E M ; STORE (. D I , Dd . ) . I T E M := TEMPORARY; INCR (SWITCH_PER_CYCLE) END; IF (STORE (. B I . Bd . ) . I T E M > STORE (. C I . Cd . ) . I T E M ) THEN BEGIN TEMPORARY := STORE (. B I . Bd . ).ITEM; STORE (. B I , Bd . ) . I T E M := STORE (. C I , Cd . ) . I T E M ; STORE (. C I . Cd . ) . I T E M := TEMPORARY; INCR (SWITCH_PER C Y C L E ) END END END (* DIAGONAL_COMP * ) ; (* DIAGONAL_COMP *) PROCEDURE HORIZONTAL_COMP; B E G I N (* HORIZONTAL_COMP *) FOR K := 1 TO N_COMP DO WITH COMPARATOR (. K .) DO BEGIN TEMPA := STORE (. A l , Ad TEMPB := STORE (. B I , B d TEMPC := STORE (. C I , Cd  . ) . ITEM . ) . ITEM . ) .ITEM  END  TEMPD := STORE (. DI , Dd . ) ITEM; . MARKERA * STORE (. A l . A J . ).MARKER = STORE (. B I . Bd . ).MARKER MARKERB = STORE (. C I . Cd . ).MARKER MARKERC = STORE (. D I , Dd . ).MARKER MARKERD TEMP COL := TRUNC ( I N I T COLUMN / 2 ) ; IF ((NOT MARKERA) AND (TEMPB > T E M P A ) ) OR ((MARKERA) AND T E M P A ) ) THEN BEGIN TEMPORARY := STORE (. B I , Bd . ) . I T E M ; STORE ( . B I , Bd . ) . I T E M : = STORE (. A l , Ad . ) . I T E M ; STORE (. A l . Ad . ) . I T E M : = TEMPORARY: INCR (SWITCH_PER_CYCLE) END; IF ((NOT MARKERC) AND (TEMPD > T E M P O ) OR ((MARKERC) AND T E M P C ) ) THEN BEGIN TEMPORARY :- STORE (. D I , Dd . ) . I T E M ; STORE (. D I , Dd . ) . I T E M : = STORE (. C I , Cd . ) . I T E M : STORE (. C I . Cd . ) . I T E M : = TEMPORARY; INCR (SWITCH_PER_CYCLE) END END (* HORIZONTAL COMP * ) ;  PROCEDURE D I S P L A Y : B E G I N (* DISPLAY *) WRITELN; WRITELN; SWITCH PER CYCLE WRITELN ('NUMBER OF SWITCHING = WRITELN ('AT C Y C L E TIME = TOTAL CYCLE 5); FOR I := 0 TO (ROW - 1) DO BEGIN IF ( I MOD 2 = 0 ) THEN BEGIN WRITELN; WRITELN END; FOR d := 0 TO (COLUMN - 1 ) DO WITH STORE (. I . d .) DO BEGIN O) THEN WRITE (' IF ( d MOD 2 1): WRITE ( I T E M 4) END; WRITELN END END (* DISPLAY *) ; PROCEDURE TRU_DISPLAY; B E G I N (* TRU_DISPLAY *) WITH COMPARATOR (. 1 .) DO BEGIN EVEN_START := Ad; ODD_START := Cd END; WRITELN; WRITELN; WRITELN ('AT C Y C L E TIME=', TOTAL CYCLE FOR I := 0 TO (ROW - 1) DO BEGIN  5);  5);  (TEMPB  (TEMPD  157  IF  END  ( I MOD 2 = 0 ) THEN BEGIN FOR d := EVEN_START TO EVEN_START + COLUMN - 1 DO BEGIN J J := J MOD COLUMN; IF STORE (. I , dd .).MARKER THEN WRITE (' *', STORE (. I , dd . ) . I T E M : 2 ) E L S E WRITE (' ', STORE (. I , dd . ) . I T E M : 2 ) END; WRITELN END ELSE BEGIN FOR d := ODD_START TO ODD_START + COLUMN - 1 DO BEGIN dd := d MOD COLUMN; IF STORE (. I . d d .).MARKER THEN WRITE (' *'. STORE (. I , d d . ) . I T E M : 2 ) E L S E WRITE (' ', STORE (. I , dd . ) . I T E M : 2) END; WRITELN END END (* TRU_DISPLAY * ) ;  PROCEDURE S H I F T ; B E G I N (* S H I F T *) FOR K := 1 TO N_COMP WITH COMPARATOR (. BEGIN IF ( A l MOD 2 = E L S E Ad := ( A d IF ( B I MOD 2 = E L S E Bd := (Bd IF ( C I MOD 2 = E L S E Cd := ( C d IF ( D I MOD 2 = E L S E Dd := (Dd END END (* S H I F T * ) ;  DO K .) DO 1 ) THEN Ad + COLUMN 1 ) THEN Bd + COLUMN 1) THEN Cd + COLUMN 1 ) THEN Dd + COLUMN -  : = (Ad 1) MOD : = (Bd 1) MOD : = (Cd 1 ) MOD : = (Dd 1) MOD  + 1 ) MOD COLUMN; + 1) MOD COLUMN; + 1) MOD COLUMN; + 1) MOD COLUMN  COLUMN COLUMN COLUMN COLUMN  B E G I N (* PARA_SORT *) H_COMP := 6; V_COMP := 3; SEED := I N I T _ S E E D ; DATA_ARRANGE := RRANDOM: DECR := 1; WHILE (H_COMP <> - 9 9 9 ) DO BEGIN WRITELN ('H_COMP/V COMP/SEED/RAND/DECR=', H_COMP : 5. V_COMP : 5, DATA ARRANGE : 5. DECR : 5 ) ; WRITELN ('ENTER NEW VALUES/ -999 FOR TERMINATION'); READLN (H_COMP. V_COMP, SEED. DATA_ARRANGE, DECR); IF (H_COMP <> - 9 9 9 ) THEN BEGIN SETUP_NETWORK; INITIALIZE; TRUJDISPLAY; WHILE (NOT TERMINATE) DO BEGIN -  : 5, SEED  158  INCR ( T O T A L _ C Y C L E ) ; VERTICAL_COMP; HORIZONTAL_COMP; DIAGONAL_COMP; CHECK_TERMINATE; SHIFT END; TRU_DISPLAY; WRITELN; WRITELN ('NUMBER OF HORIZONTAL COMPARATORS =', H_COMP : 5 ) : WRITELN ('NUMBER OF V E R T I C A L COMPARATORS ='. V_COMP : 5 ) ; WRITELN ('NUMBER OF ITEMS S O R T E D ' , ROW * COLUMN : 5 ) ; WRITELN ('NUMBER OF DOUBLE_COMPARISON/SHIFT C Y C L E S =', TOTAL_CYCLE - LIMIT_NO SWITCH : 5 ) END 3  END END (* PARA  SORT  >  «  159 Appendix C  PROGRAM LSSN; (•DEADLOCK FREE •USE CENTRAL BUFFERS FOR SIMULATION,ELSE STACK * C L A S S _ 0 , _ 1 , _ 2 BUFFERS ARE P R I O R I T I Z E D : C L A S S - 2 = HIGHEST C L A S S - 1 = MIDDLE C L A S S - 0 = LOWEST •MAY 1983 *) CONST N_SW=32; N TR=64; N~RR=64; TR_INTL=10; T SIMULAT=1000000; T~TRANSFER=2: T~DECIDE=3; T_SWITCH=2; F0_SIZE=7; F1_SIZE=7; F2_SIZE=2; F012_SIZE=FO_SIZE+F1_SIZE+F2_SIZE; F_T0TAL=2*N_SW*F012_SIZE; TYPE BUFFER_RECORD=RECORD B_EMPTY:BOOLEAN; B_TR_TIME:INTEGER; B_DEST:INTEGER: B_FEEDBACK_COUNT:INTEGER; END; FIFO_RECORD=RECORD F_START,F_STOP:INTEGER; F_TOP,F_BOTTOM:INTEGER; F_EMPTY,F_FULL:BOOLEAN; F_COUNT:INTEGER; END; INPORT_RECORD=RECORD I_TIMER:INTEGER; I_EMPTY:BOOLEAN; I_TR_TIME:INTEGER; I_DEST:INTEGER; I_FEEDBACK_COUNT:INTEGER; END;  OUTPORT_RECORD=RECORD 0_TIMER:INTEGER; 0_EMPTY,0_MATCHED:BOOLEAN; 0_TR_TIME:INTEGER; 0_DEST:INTEGER; 0_FEEDBACK_COUNT:INTEGER; 0_RR,0_NEXT_SW,0_NEXT_PT:INTEGER ; END; SWITCH RECORD=RECORD INPORT:ARRAY(. 0..1 .) OF INPORT_RECORD; F I F O : A R R A Y ( . 0..1.0..2 .) OF FIFO_RECORD; OUTPORT:ARRAY(. 0..1 .) OF OUTPORT_RECORD; END; TR_RECORD=RECORD T EMPTY,T BLOCKED:BOOLEAN;  OVERFLOW  T_DEST:INTEGER; T_NEXT_SW,T_NEXT_PT:INTEGER; T_TIMER:INTEGER;("TRANSMIT WHEN TIMER END;  REACHES  CLOCK*)  VAR SWITCH:ARRAY(. 1..N_SW .) OF SWITCH_RECORD; TR:ARRAY(. 1..N_TR .) OF TR_RECORD; BUFFER : ARRAY(. 1..F TOTAL .) OF BUFFER_RECORD; FMAX:ARRAY( . 0..2 .7 OF I N T E G E R ; ( * MAX' USAGE OF F I F O * ) CLOCK:INTEGER; SEED:INTEGER; R PACKET,T_PACKET:INTEGER; TOTAL_DELAY:INTEGER; MAX_DELAY:INTEGER; TR_DELAY:INTEGER;("BLOCKAGE AT ENTRANCE*)  FUNCTION RANDOM (VAR SEED : INTEGER) : REAL; BEGIN RANDOM:=SEED/65535; SEED := ( 2 5 1 7 3 * S E E D + 1 3 8 4 9 ) M 0 D 6 5 5 3 6 ; END;  PROCEDURE I N I T I A L I Z E ; VAR TI.SI.IPI.OPI.CI.PI.BI:INTEGER; TEMPI:INTEGER; BEGIN W R I T E L N ( ' L A S T STAGE DOES NOT MATCH LOOP NUMBER,JUST INCR FB#') WRITELN('NUMBER OF SWITCHES ',N_SW:5); WRITELN('NUMBER OF TR=',N_TR:5); W R I T E L N ( ' F I F O S I Z E OF CL-O,1,2,TOTAL PER SWITCH=',FO_SIZE:5, F1_SIZE:5,F2_SIZE:5,F012_SIZE:5); WRITELN('REQUEST R A T E = ' , 1 / T R _ I N T L : 1 0 : 5 ) ; WRITELN('TR INTL= ' , T R _ I N T L : 7 ) ; W R I T E L N ( ' S I M U L A T I O N TIME =',T_SIMULAT:5); FOR SI:=1 TO N_SW DO WITH S W I T C H [ S I ] DO B E G I N (*S*) FOR I P I : = 0 TO 1 DO WITH I N P O R T [ I P I ] DO BEGIN (*IP*) I_TIMER:=0; I_EMPTY:=TRUE; I_DEST:=-1; I_TR_TIME:=0; I_FEEDBACK_COUNT:=0; END;(*IP*) FOR P I : = 0 TO 1 DO BEGIN 3  FOR C I : = 0 TO 2 DO WITH F I F O [ P I . C I ] DO B E G I N (*BP,CL*) F_EMPTY:=TRUE; F_FULL:=FALSE ; F_COUNT:=0; END;(*BP,CL*) (••DETERMINE MEMORY LOCATIONS FOR EACH SWITCH'S TEMPI:=F012_SIZE*(2*SI-2+PI) + 1 ; FIF0[PI,O].F_START:=TEMPI; FIFO[PI,0].F_TOP:=TEMPI;  BUFFER**)  161  FIFO[PI.0].F_BOTTOM:=TEMPI; FIFOtPI,0].F_ST0P:=TEMPI+FO_SIZE-1; FIFO[PI.1].F_START:=TEMPI+FO_SIZE; FIFO[PI,1].F_TOP:=TEMPI+FO_SIZE; FIFO[PI,1].F_BOTTOM:=TEMPI+FO_SIZE: FIFO[PI,1].F_ST0P:=TEMPI+FO_SIZE+F1_SIZE-1; FIFO[PI,2].F_START:=FIF0[PI,1].F_ST0P+1 ; FIFO[PI,2].F_TOP:=FIFO[PI,1].F_ST0P+1; F I F 0 [ P I , 2 ] . F _ B O T T O M : = F I F O [ P I . 1 ] . F STOP+1; FIFO[PI,2].F_STOP:=FIF0[PI,2].F_START+F2_SIZE-1; END; FOR OPI:=0 TO 1 DO WITH O U T P O R T [ O P I ] DO BEGIN(*OP*) 0_TIMER:=0; 0_EMPTY:=TRUE; 0_TR_TIME:=0; 0_MATCHED:=FALSE; 0_DEST:=-1; 0_FEEDBACK_COUNT:=0; READLN(0_RR,0_NEXT_SW.0_NEXT_PT); END;(*0P*) END:(*S*) FOR TI:=1 TO N_TR DO WITH T R [ T I ) DO BEGIN(*T* ) T_EMPTY:=TRUE; T_BLOCKED:=FALSE; T_DEST:=-1; T_NEXT_SW:= T R U N C ( ( T I + 1 ) / 2 ) ; T_NEXT_PT:= (TI+1)MOD 2; REPEAT T_TIMER:=TRUNC(RANDOM(SEED)*2*TR_INTL); U N T I L T _ T I M E R > 0 AND T_TIMER<2*TR_INTL; END;(*T*) FOR B I : = 1 TO F_TOTAL DO WITH BUFF E R [ B I ] DO BEGIN B_EMPTY:=TRUE; B_TR_TIME:=0; B_DEST:=-1; B_FEEDBACK_COUNT:=0; END; CLOCK:=0; TOTAL_DELAY:=0; TR_DELAY:=0; R_PACKET:=0: T_PACKET:=0; MAX_DELAY:=0; FMAX[0]:=0; FMAX[1]:=0; FMAX[2J:=0; (**MAX USAGE OF EACH C L A S S OF B U F F E R S * ) END;(*INIT*)  PROCEDURE TR_GET_DEST; VAR T:INTEGER; BEGIN FOR T:= 1 TO N_TR DO WITH T R [ T ] DO BEGIN(*T*)  IF  T_TIMER <= CLOCK AND T_EMPTY AND NOT T_BLOCKED THEN BEGIN(*TIMER*) REPEAT T_DEST := TRUNC(RANDOM(SEED)*S5); UNTIL ( T DEST IN (. 1..64 .) AND T_DEST <> T ) T_EMPTY:=FALSE; END;(*TIMER*) END; END;(*TR_GET D E S T * )  FUNCTION ROUTE(SW, DT:INTEGER) : INTEGER; VAR SR.DR:INTEGER; BEGIN SR:=SW; DR:=DT; ROUTE:=1;(*INITIAL SETTING*) I F SR<=8 AND ( (DR-1) MOD 2 =0) THEN ROUTE:=0 E L S E I F SR<=16 AND SR>8 AND ( T R U N C ( ( D R - 1 ) / 2 ) MOD 2 =0) THEN ROUTE:=0 E L S E I F SR<=24 AND SR>16 AND ( T R U N C ( ( D R - 1 ) / 4 ) MOD 2=0) THEN ROUTE:=0 E L S E I F SR<=32 AND SR>24 AND ( TRUNC((DR-1)/8)MOD 2=0) THEN ROUTE:=0 END;  PROCEDURE TR_TO_INPORT; VAR TT;INTEGER; PACKET_COUNT:INTEGER; BEGIN FOR TT:=1 TO N_TR DO WITH T R [ T T ] DO BEGIN(*T* ) IF NOT T_EMPTY AND T_TIMER<=CLOCK THEN WITH SWITCH[T_NEXT_SW],INPORT[T N E X T _ P T ] DO BEGIN(*READY TO TRANSMIT*) ~ PACKET_COUNT:=FIFO[T_NEXT_PT,O].F_COUNT; IF ( I_EMPTY) AND I_TIMER<=CLOCK AND F I F 0 [ T _ N E X T _ P T , O ] . F _ C 0 U N T < F O _ S I Z E AND F I F O [ T _ N E X T _ P T , 1 ] . F _ C O U N T < F 1 _ S I Z E AND F I F 0 [ T _ N E X T _ P T , 2 ] . F _ C 0 U N T < F 2 _ S I Z E THEN P A C K E T _ C 0 U N T < ( F 0 _ S I Z E - 1 ) * ( T R U N C ( T _ N E X T _ S W / 8 . 4 + 1 ) * 8 / N _ S W ) THEN ( F I F O [ 0 , 0 ] . F _ C O U N T < ( TRUNC(T_NEXT_SW/8.4)+1)) AND' ( F I F O [ 1 . 0 ] . F _ C O U N T < ( TRUNC(T_NEXT_SW/8.4)+1)) THEN BEGIN(*GRANTED TO TRANSMIT*) I_TIMER:=CLOCK+T_TRANSFER; I_EMPTY:=FALSE; I_DEST:=T_DEST; I _ F E EDBACK_COUNT:=0; I_TR_TIME:=CLOCK; T_EMPTY:=TRUE; T_TIMER:=CLOCK+ TRUNC(TR INTL*2*(RANDOM( SEED ) ) ) ; T_DEST:=-1; T_BLOCKED:=FALSE; INCR(T_PACKET); END ELSE  163  BEGIN T_BLOCKED:=TRUE: INCR(TR_DELAY);("INCREMENT END; END;("READY TO TRANSMIT") END;("T") END;(*TR_TO_INPORT*)  TOTAL  TR_DELAY*)  PROCEDURE TRANSFER( SS,BB,CC,PPRT,OOPT,CC_NEXT:INTEGER); VAR ST,BT,CT,PTRT,OPT,CT_NEXT:INTEGER; BEGIN("GRANT C L A S S - C L B U F F E R * ) ST:=SS; BT:=BB; CT:=CC; PTRT:=PPRT; OPT:=OOPT; CT_NEXT:=CC_NEXT; WITH S W I T C H [ S T ] . F I F O [ B T . C T ] , O U T P O R T [ O P T ] DO BEGIN WITH B U F F E R [ P T R T ] DO BEGIN DECR(F_COUNT) ; 0_TIMER:=CLOCK+T_SWITCH+T_DECIDE; 0_EMPTY:=FALSE; 0_DEST:=B_DEST; I F 0_DEST NOT IN (. 1..64 .) THEN W R I T E L N ( ' ? ? ? ? ? WRONG DEST, L I N E 2 5 7 ? ? ? ? ? ? ' , 0 _ D E S T : 5 ) ; IF 0_DEST=0_RR THEN 0_MATCHED:=TRUE ELSE 0_MATCHED:=FALSE; 0_FEEDBACK_COUNT:=CT_NEXT; 0_TR_TIME:=B_TR_TIME; B_EMPTY:=TRUE; B_DEST:=-1; B_FEEDBACK_COUNT:=0; B_TR_TIME:=CLOCK; F_FULL:=FALSE; F_BOTTOM:=PTRT; IF F_TOP=F_BOTTOM THEN F_EMPTY:=TRUE E L S E F_EMPTY:=FALSE; END END ( * C L A S S _ C L * ) END;  PROCEDURE OUTPORT_TO_INPORT; VAR S,OP:INTEGER; BEGIN FOR S:=1 TO N_SW DO WITH SWITCHES] DO FOR OP:=0 TO 1 DO WITH O U T P O R T [ O P ] . S W I T C H [ 0 _ N E X T _ S W ] . I N P O R T [ 0 _ N E X T _ P T ] DO BEGIN(*S.OP*) IF NOT 0_EMPTY AND NOT 0_MATCHED AND 0_TIMER<=CLOCK THEN BEGIN(*READY TO TRANSMIT*) IF ( I_EMPTY) THEN BEGIN("GRANTED TO TRANSMIT*) I TIMER:=CLOCK+T TRANSFER;  I_EMPTY:=FALSE; I_DEST:=0_DEST; I_TR_TIME:=0_TR_TIME; I_FEEDBACK_COUNT:=0_FEEDBACK_COUNT; 0_TIMER:=CLOCK+T_TRANSFER; 0_DEST:=-1; 0_EMPTY:=TRUE; 0_FEEDBACK_COUNT:=0; 0_TR_TIME:=CL0CK; END ELSE BEGIN(*BLOCKED*) (**********OUTP0RT IS B L O C K E D ? ? ? ? ? ? ? * * * * * * * * * * * * * * * * ) WRITE('?????OUTPORT BLOCKED???? ' ) ; W R I T E C A T T.SW.NEXT SW , FB_COUNT ' ) ; WRITELN( CLOCK:3,S:3,0_NEXT_SW:3,0_FEEDBACK_COUNT:4); 0_EMPTY:=FALSE; END; END(*READY*) ELSE IF 0_MATCHED THEN('REMOVE P A C K E T S * ) BEGIN 0_TIMER:=CLOCK; 0_EMPTY:=TRUE; 0_MATCHED:=FALSE; INCR(R_PACKET); W R I T E L N ( ' R E C E I V E D AT TIME,DEST.RR,DELAY=',CLOCK:5,0_DEST : 5 , 0_RR:5,CL0CK - 0 _ T R _ T I M E : 5 ) ; IF 0_TR_TIME=0 THEN W R I T E L N ( ' ? ? ? ? ? ? ? LINE 3 4 3 ? ? ? ? ' ) ; TOTAL_DELAY:=T0TAL_DELAY+(CLOCK - 0 _ T R _ T I M E ) ; IF ( C L O C K - 0 _ T R _ T I M E ) > MAX_DELAY THEN MAX_DELAY:=(CLOCK - 0 _ T R _ T I M E ) ; 0_DEST:=-1; 0_FEEDBACK_COUNT:=0; 0_TR_TIME:=CLOCK; END;(*REMOVE P A C K E T S * ) END;(*S,OP*) END;(*OUTPORT_INPORT*)  PROCEDURE INPORT_TO_POOL; VAR SS,IP:INTEGER; BEGIN FOR SS:=1 TO N_SW DO WITH S W I T C H [ S S ] DO FOR IP:=0 TO 1 DO WITH I N P O R T [ I P ] DO BEGIN (*S.IP*) IF NOT I_EMPTY AND I_TIMER<=CLOCK THEN B E G I N (*READY TO STORE PACKETS INTO BUFFER POOL*) IF F I F O [ I P , I _ F E E D B A C K _ C O U N T ] . F _ F U L L THEN W R I T E L N ( ' ! ! ! ! ! ! ! F _ F U L L ! ! ! ! ! S,P.CL=',SS:2,IP:2,I_FEEDBACK_COUNT:3) WITH F I F O [ I P , I _ F E E D B A C K _ C O U N T ] DO IF NOT F _ F U L L THEN BEGIN F_TOP:=(F_T0P+1): IF F_TOP>F_STOP THEN F_TOP:=F_START; WITH B U F F E R [ F _ T O P ] DO BEGIN INCR(F_COUNT); IF F_COUNT>FMAX[I_FEEDBACK_COUNT] THEN  FMAX[I_FEEDBACK_COUNT]:= F COUNT; B_EMPTY:=FALSE; ~ B_DEST:=I_DEST; B_TR_TIME:=I_TR_TIME; B_FEEDBACK_COUNT:=I_FEEDBACK_COUNT; I_TIMER:=CLOCK+T_SWITCH+T_DECIDE; I_EMPTY:=TRUE; I_OEST:=-1; I_FEEDBACK_COUNT:=0; I_TR_TIME:=CLOCK; F_EMPTY:=FALSE; IF F_TOP=F_BOTTOM THEN F_FULL:=TRUE ELSE F_FULL:=FALSE; END END END END ( * S , I P * ) END;(*INPORT_BUFFER POOL*) PROCEDURE POOL_TO_OUTPORT; VAR P T R P : I N T E G E R ; ( * P O I N T E R OF STRUCTURED BUFFERS* ) TERMINATE:BOOLEAN; SP,OPP,PP,CLP,CLP_NEXT:INTEGER; CHECK_DEST:INTEGER; CHECK_BIT:INTEGER; OK_TRANSFER:BOOLEAN; NOW_SCHEDULED:INTEGER; BEGIN FOR SP:=1 TO N_SW DO WITH S W I T C H f S P ] DO FOR OPP:=0 TO 1 DO WITH OUTPORT[OPP] DO BEGIN(*S,OP*) I F 0_EMPTY AND 0_TIMER<=CLOCK THEN BEGIN(*READY TO ACCEPT PACKETS FROM C L A S S _ 0 , _ 1 , _ 2 B U F F E R S * ) NOW_SCHEDULED:=0; TERMINATE:=FALSE; WHILE (N0W_SCHEDULED<6) AND (NOT TERMINATE) DO BEGIN CASE NOW SCHEDULED OF 0: B E G I N PP =0; CLP = 2; END; 1 : B E G I N PP = 1 ; CLP = 2; END; 2 : B E G I N PP =0; CLP = 1 ; END; 3 : BEGIN PP = 1 ; CLP = 1; END; 4 : BEGIN PP =0; CLP =0; END; 5: B E G I N PP = 1; CLP =0; END; <>: B E G I N WRITELN( 'ERROR IN POOL TO OUTPORT !!!!!!!');END; END; PTRP. = ( F I F O [ P P . C L P ] . F _ B 0 T T 0 M + 1 ) ; IF P T R P > F I F O [ P P , C L P ] . F _ S T O P THEN PTRP : = F I F O [ P P , C L P ] .F_START; (•REMOVE PACKET FROM BOTTOM OF B U F F E R * ) CHECK_DEST:=BUFFER[PTRP].B_DEST; CHECK_BIT:=ROUTE(SP.CHECK_DEST): (•DETERMINE SWITCH B I T OF PACKET*) ( * I F FEEDBACK PACKET, THEN GO TO NEXT CLASS OF BUFFER*) IF 0_NEXT_SW IN (. 1..8 .) THEN BEGIN IF CLP<2 AND ( ( ( C H E C K DEST-1)M0D 1 6 ) = ( ( 0 RR-1) MOD 16))THEN CLP_NEXT:=2 ELSE IF CLP<2 THEN  166  CLP_NEXT:=CLP+1 E L S E CLP_NEXT:=CLP END ELSE CLP_NEXT:=CLP; I F 0_NEXT_SW IN (. 1..8 .) AND CLP<2 THEN CLP_NEXT:=CLP+1; ELSE CLP_NEXT:=CLP; WITH SWITCH[0 N E X T _ S W ] . F I F O [ 0 _ N E X T _ P T , C L P _ N E X T ] I F (NOT F I F O [ P P , C L P ] . F _ E M P T Y ) AND (F_COUNT< ( F _ S T O P - F _ S T A R T ) ) AND (CHECK_BIT=OPP) THEN BEGIN OK_TRANSFER:=TRUE; TERMINATE:=TRUE; 0_TIMER:=CLOCK+T_SWITCH; 0_EMPTY:=FALSE; END ELSE WITH B U F F E R [ P T R P ] DO BEGIN OK TRANSFER:=FALSE; NOW_SCHEDULED:=(NOW SCHEDULED*1 ); TERMINATE:=FALSE; END; END;(*WHILE*) IF OK_TRANSFER THEN END;("EMPTY*) END;(*S,OP*) END;(*POOL_TO_OUTPORT*)  DO  TRANSFER(SP,PP.CLP,PTRP,OPP,CLP_NEXT);  PROCEDURE GROSS_DISPLAY; BEGIN WRITELN('AT T I M E = ' , C L 0 C K : 5 ) ; WRITELN(' T_PACKET='.T_PACKET:7); WRITELN(' R_PACKET=',R_PACKET:7); IF R_PACKET > 1 THEN WRITELNC AVERAGE DELAY=',TOTAL_DELAY/R_PACKET:10:5); WRITELN(' MAX DELAY= ',MAX_DELAY:5); WRITELNC AVERAGE TR_DELAY= ', T R _ D E L A Y / T _ P A C K E T : 1 0 : 5 ) ; WRITELN(' AVERAGE THROUGHPUT=', R _ P A C K E T / T _ S I M U L A T : 1 0 : 5 ) ; WRITELNC UNDELIVERED PACKETS=', T_PACKET- R _ P A C K E T : 5 ) ; WRITELNC MAX F I F O U S A G E = ' , F M A X [ 0 ] : 3 . F M A X [ 1 ] : 3 , F M A X [ 2 ] : 3 ) ; WRITELN( ' TOTAL F I F O USAGE=' ,FMAX[O] + FMAX[1] + FMAX[2] :5) ; END; PROCEDURE D E T A I L _ D I S P L A Y ; VAR SW.C.P.B:INTEGER; BEGIN WRITELN('BUFFER DISPLAY OF SWITCH ARRAY'); W R I T E C S PT CL B SRC DST SBIT STEP FB GEN_T TR_T W R I T E L N ( ' F_TOP F_BOTTOM MAX_USED'); FOR SW:=1 TO N_SW DO WITH SWITCH[SW] DO BEGIN FOR P:=0 TO 1 DO BEGIN FOR C:=0 TO 2 DO WITH B U F F E R _ P O O L [ P , C ] DO BEGIN  BF_T DE_T  DE_?');  V  167  IF IF  F _ F U L L THEN WRITELN('BUFFER F U L L : S , P , C ',SW:3,P:3,C:3); NOT F_EMPTY THEN BEGIN FOR B:=0 TO F I F O _ S I Z E - 1 DO WITH B U F F E R [ B ] DO IF NOT B_EMPTY THEN BEGIN W R I T E ( ' ',SW:2,' ',P:2,' ',C:2,' ',B:2.' ',' ',B_DE5T:5); WRITE(B_FEEDBACK_COUNT:3,' ' ) ; WRITELN(B_GENERATE_TIME:5,B_TR_TIME:5,' ', F_T0P:6,F_B0TT0M:6); END; END; END; END; WRITELN; END; END; PROCEDURE DEBUG; VAR B:INTEGER; BEGIN W R I T E L N ( ' D I S P L A Y OF B U F F E R S ' ) ; FOR B:=1 TO F_TOTAL DO WITH B U F F E R [ B ] DO BEGIN IF NOT B_EMPTY THEN WRITELN(B:7,B_TR_TIME:5,B_DEST:5,B_FEEDBACK_COUNT:5) ; END; END; BEGIN(*LSSN*) SEED:=8476; INITIALIZE; FOR CLOCK:=1 TO T_SIMULAT DO BEGIN TR_GET_DEST; INPORT_TO_POOL; POOL_TO_OUTPORT; OUTPORT_TO_INPORT; TR_TO_INPORT; I F CLOCK> T_SIMULAT -1 THEN GROSS_DISPLAY; END; END . ("INPUT F I L E WHICH CONTAINS THE INTERCONNECTION  PATTERN OF THE  LSSN*)  >  168  References: 1. T. Moto-oka ( e d i t o r ) 1982. F i f t h Generation N o r t h - H o l l a n d P u b l i s h i n g Company. 2. I E E E S p e c t r u m .  Tomorrow's C o m p u t e r s .  Computer  Vol.20, N o . l l ,  Systems. Nov.  1983.  3. H.T. Kung £> C.E. L e i s e r s o n , " S y s t o l i c A r r a y s f o r V L S I , " D e p t . o f Computer S c . , C a r n e g i e - M e l l o n U n i v . , T e c h . R e p t . CS-79-103, A p r . 1983. 4. Computer, V o l . 15, No. 2, F e b . 1982. S p e c i a l i s s u e on d a t a - f l o w computers. 5. P. C. T r e l e a v e n , D. R. B r o w n b r i d g e and R. P. H o p k i n s , " D a t a - D r i v e n a n d Demand-Driven Computer A r c h i t e c t u r e s , " ACM Computing S u r v e y s , V o l . 14, No. 1, M a r c h 1982. 6. H.S. S t o n e , " P a r a l l e l C o m p u t e r s , " i n I n t r o d u c t i o n A r c h i t e c t u r e s , e d i t e d by H.S. S t o n e e t a l , 1975, Research Associates, Inc.  t o Computer Science  7. W.R. Cyre 6 G.J. L i p o v s k i , " O n g e n e r a t i n g M u l t i p l i e r s f o r a C e l l u l a r F a s t F o u r i e r T r a n s f o r m P r o c e s s o r , " I E E E T r a n s , on C o m p u t e r s , C-21, pp83-87, 1972. 8. D. P. M i s u n a s , " A Computer A r c h i t e c t u r e f o r D a t a F l o w C o m p u t a t i o n , " MIT/LCS/TM-100, C a m b r i d g e , MA., 1975. 9. H.S. S t o n e , " P a r a l l e l P r o c e s s i n g I E E E T r a n s , on C o m p u t e r s , C-20,  with the P e r f e c t p p l 5 3 - l 6 1 , 1971.  Shuffle,"  10. D.E. M u l l e r & E.P. P r e p a r a t a , " B o u n d s t o C o m p l e x i t i e s o f N e t w o r k s f o r S o r t i n g and S w i t c h i n g , " J . A s s . Comput. Mach., V o l . 2 2 , PP195-201, A p r . 1975. 11. D.E. K n u t h , The A r t o f Computer Programming, V o l . 3 , Searching. A d d i s o n - W e s l e y , R e a d i n g , Mass., 1973.  Sorting  and  12. T. L a n g & H.S. S t o n e , " A S h u f f l e - e x c h a n g e n e t w o r k w i t h s i m p l i f i e d C o n t r o l , " I E E E T r a n s , on C o m p u t e r s , V o l . C - 2 5 , pp.55-65, J a n . 76. 13. K.E. B a t c h e r , " S o r t i n g N e t w o r k s and t h e i r A p p l i c a t i o n s , " A F I P S 1968, S p r i n g J o i n t Comput. C o n f . , p p 3 0 7 - 3 l 4 , A p r .  Proc. 1968.  14. D. N a s s i m i & S. S a h n i , " B i t o n i c S o r t on a M e s h - C o n n e c t e d P a r a l l e l Computer," I E E E T r a n s , on Comput., V10-C28, No.1, pp2-7, J a n . 1 9 7 9 . 15. C D . Thompson & H.K. Kung, " S o r t i n g on a Mesh-rConnected P a r a l l e l C o m p u t e r , " Comm. o f t h e ACM, V o l . 2 0 , No.4,pp263-2?1, A p r . 1977. 16. H.T. K u n g , " L e t ' s D e s i g n A l g o r i t h m s f o r VLSI S y s t e m s , " D e p t . of Computer S c . , C a r n e g i e - M e l l o n U n i v . , T e c h . Rep., J a n . 1979. 17. F . S . Wong & M.R. I t o , " A S y s t o l i c S o r t e r and i t s S i m u l a t i o n R e s u l t s , " D e p t . o f E . E . , The U n i v . o f B r i t i s h C o l u m b i a , T e c h . Rep., O c t . 1982.  169  18. C D . Thompson,"A C o m p l e x i t y T h e o r y f o r V L S I , " Ph.D. T h e s i s , C a r n e g i e - M e l l o n U n i v . , D e p t . o f Computer S c . , 1979. 19. M.J. F o s t e r & H.T. K u n g , " D e s i g n o f S p e c i a l - P u r p o s e V L S I C h i p s : E x a m p l e s a n d O p i n i o n s , " D e p t . o f Computer S c . , C a r n e g i e - M e l l o n U n i v . , T e c h . Rep., S e p . 1979. 20. C. Wu & T. Feng,"On a C l a s s o f M u l t i s t a g e I n t e r c o n n e c t i o n N e t w o r k s , " I E E E T r a n s , on C o m p u t e r s , V o l . C-29, No. 8, A u g . 1980, p p . 694-702. 21. M.C. P e a s e , " T h e I n d i r e c t B i n a r y n-Cube M i c r o p r o c e s s o r A r r a y , " I E E E T r a n s , on C o m p u t e r s , V o l . C-26, No.5, May 1977, p p . 4 5 8 - 4 7 3 . 22. Computer, V o l . 1 4 , No. 12, D e c . 1981. S p e c i a l n e c t i o n Networks.  issue  on i n t e r c o n -  23. F.S. Wong & M.R. I t o , " A N o v e l P a c k e t S w i t c h i n g Network," T e c h . R e p t . , D e p t . o f E . E . , The U n i v . o f B r i t i h s C o l u m b i a , C a n a d a , J u l y 1982. 24. C. Wu, T. Feng & M.C. L i n , " S t a r : A L o c a l Network S y s t e m f o r R e a l t i m e Management o f Imagery D a t a , " I E E E T r a n s , on C o m p u t e r s , V o l . C-31, No. 10, O c t . 1982, p p . 923-933. 25. D.M. D i a s & J.R. Jump,"Packet S w i t c h i n g I n t e r c o n n e c t i o n N e t w o r k s f o r M o d u l a r S y s t e m s , " i n Computer, V o l . 14, No. 12, D e c . 1981, pp.42-53. 26. A.R. T r i p a t h i & J . L i p o v s k i , " P a c k e t S w i t c h i n g i n Banyan N e t w o r k s , " P r o c e e d i n g s o f t h e 6 t h A n n u a l Symposium on Computer A r c h i t e c t u r e s , 1979, pp.160-167. 27. K.E. B a t c h e r , " S o r t i n g N e t w o r k s a n d t h e i r A p p l i c a t i o n s , " P r o c e e d i n g s o f AFIPS 1968, S p r i n g J o i n t Computer C o n f . , pp.307 -314, 1968. 28. F . S. Wong a n d M. R. I t o , " A L a r g e - S c a l e D a t a - F l o w Computer For H i g h l y P a r a l l e l S i g n a l P r o c e s s i n g , " Proceedings of t h e 1982 I n t e r n a t i o n a l C o n f e r e n c e on C i r c u i t s a n d C o m p u t e r s , New Y o r k , O c t . 1982. 29. E . R a u b o l d & J . H a e n l e , " A Method o f D e a d l o c k - f r e e R e s o u r c e A l l o c a t i o n a n d Flow C o n t r o l i n P a c k e t N e t w o r k s , " P r o c e e d i n g ICCC 1976, T o r o n t o , C a n a d a , A u g . 1976, pp.483. 30. G.H. B a r n e s & S.F. L u n d s t r o m , " D e s i g n a n d V a l i d a t i o n o f a C o n n e c t i o n Network f o r M a n y - P r o c e s s o r M u l t i p r o c e s s o r S y s t e m s , " i n Computer, V o l . 14, No. 12, D e c . 1981, p p 3 1 - 4 l . 31. K. S. Weng,"An A b s t r a c t I m p l e m e n t a t i o n f o r a G e n e r a l i z e d D a t a F l o w L a n g u a g e , " MIT/LCS/TR-228, C a m b r i d g e , MA., 1979. 32. D. D. G a j s k i , D. J . Kuck a n d D. A. P a d u a , " D e p e n d e n c e - D r i v e n C o m p u t a t i o n , " P r o c e e d i n g s o f t h e I E E E 1981 Compcon S p r i n g , pp. 168-172. 33. P. C. T r e l e a v e n , R . P. H o p k i n s a n d P. W. R a u t e n b a c k , " C o m b i n i n g D a t a F l o w a n d C o n t r o l F l o w C o m p u t i n g , " The Computer J o u r n a l ,  Vol.  2 5 , N o . 2, 1 9 8 2 , p p .  207-271.  170  3 4 . J . E . R e q u a a n d J . R. M c G r a w , " T h e P i e c e - w i s e D a t a F l o w A r c h i t e c t u r e : A r c h i t e c t u r a l C o n c e p t s , " I E E E T r a n s a c t i o n s on C o m p u t e r s , V o l . C - 3 2 , N o . 5, 1 9 8 3 , p p . 4 2 5 - 4 3 7 . 3 5 . F . S . W o n g a n d M. R. I t o , " A L o o p - S t r u c t u r e d S w i t c h i n g N e t w o r k , " T e c h i c a l Rept., Dept. o f E.E., The Univ. o f B r i t i s h Columbia, 1982. ( A c c e p t e d by I E E E T r a n s , on C o m p u t e r s . ) 3 6 . P. B u d n i k a n d D. J . K u c k , " T h e O r g a n i z a t i o n a n d U s e o f P a r a l l e l Memories," IEEE T r a n s , on Computers, V o l . C-26, 1971, p p . 1566-1569. 3 7 . D . H . L a w r i e a n d C . R. V o r a , " T h e P r i m e M e m o r y S y s t e m f o r A r r a y A c c e s s , " I E E E T r a n s , o n C o m p u t e r s , V o l . C - 3 1 , N o . 5, 1 9 8 2 , pp. 435-442. 38. B. H a n s e n . The A r c h i t e c t u r e of Concurrent Pascal. H a l l , I n c . 1977.  Prentice-  3 9 . W. B . A c k e r m a n , " D a t a F l o w L a n g u a g e s , " P r o c . o f t h e 1 9 7 9 N a t i o n a l Computer C o n f e r e n c e , 1979, p p . 1087-1095. 4 0 . S . F . L u n d s t r o m a n d G . H . B a r n e s , " A C o n t r o l l a b l e MIMD A r c h i t e c t u r e , " P r o c . o f t h e 1980 I n t e r n a t i o n a l C o n f e r e n c e on P a r a l l e l P r o c e s s i n g , 1980, p p . 19-27. 4 1 . D. C o m t e , N. H i f d i a n d J . C. S y r e , " T h e D a t a D r i v e n L A U M u l t i p r o c e s s o r System: Results and Perspectives," Information P r o c e s s i n g 8 0 , S. H. L a v i n g t o n ( E d . ) , N o r t h - H o l l a n d P u b . C o . , 1980, p p . 175-179. 4 2 . E . W. D i s j k s t r a , " C o - o p e r a t i n g S e q u e n t i a l P r o c e s s e s , " i n Programming Languages. F. Genuys ( E d . ) Academic P r e s s , 1968. 4 3 . W. B . A c k e r m a n a n d J . B . D e n n i s , " V A L — a V a l u e - o r i e n t e d A l g o r i t h m i c Language: P r e l i m i n a r y r e f e r e n c e manual," MIT/LCS TR-218, J a n . 1979. 44. R e f e r e n c e Manual f o r t h e Ada Programming Language, P r o p o s e d S t a n d a r d D o c u m e n t . US D e p a r t m e n t o f D e f e n s e , 1 9 8 0 . 4 5 . C . K. C . L e u n g , " F a u l t T o l e r a n c e i n P a c k e t C o m m u n i c a t i o n Computer A r c h i t e c t u r e s , " MIT/LCS/TR-250, 1980. 4 6 . D. A . A d a m s , " A C o m p u t a t i o n M o d e l w i t h D a t a F l o w S e q u e n c i n g , " Computer Science Dept., S c h o o l o f Hummanities and Science, S t a n f o r d U n i v e r s i t y , TR-CS17, Dec. 1968. 4 7 . A r v i n d , K. P. G o s t e l o w a n d W. E. P l o u f f e , " A n A s y n c h r o n o u s Programming Language a n d Computing Machine," TR-114a, Dept. of I n f o r . a n d Comp. S c . , UC I r v i n e , D e c . 1 9 7 8 . 4 8 . A r v i n d , V . K a t h a i l a n d K. P i n g a l i , " A D a t a f l o w A r c h i t e c t u r e w i t h Tagged Tokens," MIT/LCS/TM-174, Cambridge, Mar. 1980. 4 9 . A r v i n d a n d R. E . T h o m a s , " I - S t r u c t u r e : A n E f f i c i e n t D a t a f o r F u n c t i o n a l Language," MIT/LCS/TM-178, S e p t . 1980.  Type  5 0 . J . D. B r o c k a n d L . B. M o n t z , " T r a n l a t i o n a n d O p t i m i z a t i o n o f D a t a Flow Programs," P r o c . 1979 I n t l . C o n f . on P a r a l l e l P r o c e s s i n g , B e l l a i r e , M i c h i g a n , Aug. 1979, pp. 46-54.  171  5 1 . A . L . D a v i s , " T h e A r c h i t e c t u r e o f DDM1: A R e c u r s i v e l y S t r u c t u r e d D a t a D r i v e n M a c h i n e , " U n i v . o f U t a h , Comp. S c . D e p t . TR-UUCS-77-113, 1977. 5 2 . J . B . D e n n i s a n d D. P. M i s u n a s , " A P r e l i m i n a r y A r c h i t e c t u r e f o r a B a s i c D a t a - F l o w P r o c e s s o r , " P r o j e c t MAC. M I T C S G Memo 1 0 2 . 5 3 . J . B . D e n n i s a n d D. P. M i s u n a s , " A C o m p u t e r A r c h i t e c t u r e f o r H i g h l y P a r a l l e l S i g n a l P r o c e s s i n g , " P r o c . o f t h e ACM 1974 N a t i o n a l C o n f e r e n c e , pp. 402-409. 5 4 . J . B . D e n n i s a n d K. S. W e n g , " A p p l i c a t i o n o f D a t a F l o w Computation t o t h e Weather Problem," P r o c . o f t h e Symposium on H i g h Speed Computer a n d A l g o r i t h m O r g a n i z a t i o n s , A p r i l 1977, pp. 143-157. 5 5 . S . I . K a r t a s h e v a n d S . P. K a r t a s h e v , " D y n a m i c A r c h i t e c t u r e s : P r o b l e m s a n d S o l u t i o n s , " i n Computer, J u l y 1978 i s s u e . 5 6 . S. P. K a r t a s h e v a n d S. I . K a r t a s h e v , " S u p e r s y s t e m s f o r t h e 8 0 ' s , " i n Computer, Nov. 1980 i s s u e . 57. G. J . L i p o v s k i , " O n a V a r i s t r u c t u r e d A r r a y o f M i c r o p r o c e s s o r s , " IEEE T r a n s , on Computers, F e b . 1977, pp. 125. 5 8 . J . R. M c G r a w , " D a t a F l o w C o m p u t i n g : TM-188, J a n . 1980.  The VAL Language,"  MIT/LCS  59. L . B. M o n t z , " S a f e t y a n d O p t i m i z a t i o n T r a n s f o r m a t i o n o f D a t a F l o w P r o g r a m s , " M I T / L C S / T R - 2 4 0 , C a m b r i d g e , Ma., J a n . 1 9 8 0 . 60. J . Rambuagh,"A D a t a F l o w M u l t i p r o c e s s o r , " Feb. 1977, pp.138-146.  I E E E T r a n s , o n Comp.,  6 1 . S. S. R e d d i a n d E . A . F e u s t e l , " A R e s t r u c t u r a b l e C o m p u t e r S y s t e m , " I E E E T r a n s , on C o m p u t e r s , J a n . 1978, p p . 1-20. 6 2 . R. M. S h a p i r o a n d e t a l . , " R e p r e s e n t a t i o n o f A l g o r i t h m s a s C y c l i c P a r t i a l Ordering," Applied Data Research, Wakerfield, Mass., Report CA-7112-2711, Dec. 1971. 6 3 . H. J . S i e g e l a n d e t a l . , " A S u r v e y o f I n t e r c o n n e c t i o n M e t h o d s for R e c o n f i g u r a b l e P a r a l l e l P r o c e s s i n g Systems," National Computer C o n f e r e n c e 1979, pp. 529-542. 6 4 . M. R. S l e e p , " A p p l i c a t i v e L a n g u a g e s , D a t a F l o w a n d P u r e C o m b i n a t o r y Code," IEEE Compcon 1980, pp.112-115. 6 5 . P. C . T r e l e a v e n , " E x p l o r a t i n g P r o g r a m C o n c u r r e n c y i n C o m p u t i n g Systems," i n Computer, J a n . 1979, pp. 42-49. 66. C. G. V i c k a n d e t a l . , " A d a p t a b l e A r c h i t e c t u r e s f o r S u p e r c o m p u t e r s , " i n Computer, Nov. 1980, pp.17-36. 67. I . Watson a n d J . Gurd,"A P r o t o t y p e Data Flow Computer Token L a b e l l i n g , " N a t i o n a l Computer Conference 1979,  with pp.623-628.  172  6 8 . D. P. M i s u n a s , " S t r u c t u r e P r o c e s s i n g i n a D a t a F l o w P r o c e s s o r , " P r o c e e d i n g s o f 1976 I n t e r n a t i o n a l P a r a l l e l P r o c e s s i n g , A u g . 1976 pp. 100-105. 6 9 . R. H. P e r r o t t , "A L a n g u a g e f o r A r r a y a n d V e c t o r P r o c e s s o r s , " A C M T r a n s , o n P r o g r a m m i n g L a n g u a g e a n d S y s t e m s , V o l . 1, N o . 2, O c t . 1979, pp. 177-195.  

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
United States 3 3
Japan 1 0
City Views Downloads
Ashburn 2 0
Unknown 1 4
Tokyo 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0096640/manifest

Comment

Related Items