Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Design of a serialized link for on-chip global communication Kedia, Amit 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
831-ubc_2006-0521.pdf [ 4.18MB ]
Metadata
JSON: 831-1.0065430.json
JSON-LD: 831-1.0065430-ld.json
RDF/XML (Pretty): 831-1.0065430-rdf.xml
RDF/JSON: 831-1.0065430-rdf.json
Turtle: 831-1.0065430-turtle.txt
N-Triples: 831-1.0065430-rdf-ntriples.txt
Original Record: 831-1.0065430-source.json
Full Text
831-1.0065430-fulltext.txt
Citation
831-1.0065430.ris

Full Text

Design of a Serialized Link for On-Chip Global Communication by Amit Kedia B . Tech., Indian Institute of Technology, Kharagpur, 2003 A THESIS S U B M I T T E D I N P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF M A S T E R OF A P P L I E D S C I E N C E in T H E F A C U L T Y OF G R A D U A T E S T U D I E S (Electrical and Computer Engineering) T H E U N I V E R S I T Y OF B R I T I S H C O L U M B I A October 2006 © Amit Kedia, 2006 Abstract n On-chip global communication is required for data and control transfers across various modules on the chip and determines the performance of the integrated circuit in current technology generation. A particularly difficult challenge at the present time is the routing complexity and congestion of parallel buses that span large distances on the chip. This thesis presents the design of an on-chip serialized link for replacing a parallel bus. The serial channel uses a wave-pipelined signaling scheme which provides high data rates to compensate for the loss of parallelism. A serializer-deserializer (SERDES) transceiver is investigated to determine the required number of wires in the serial channel based on interconnect bandwidth and design overhead. Further, the robustness of the S E R D E S technique is studied in the presence of on-chip variations. A technique for reducing the energy consumption on the serial channel by lowering the average switching activity is also proposed. Using our design technique, a 32-bit wide parallel bus operating at a frequency of l G h z in 90nm C M O S technology can be replaced by a serial channel consisting of 4 interconnects. Considering overhead due to the S E R D E S transceiver itself, the number of interconnects required in the serial channel increases to 8. Further, the intra-die variation forces the required number of interconnects to 16. This thesis also demonstrates that the energy of a serialized link can be reduced by up to 40% i f there is prior knowledge of the data to be transmitted on the bus. i i i Table of Contents A B S T R A C T ii T A B L E O F C O N T E N T S iii LIST O F T A B L E S v LIST O F FIGURES vi A C K N O W L E D G E M E N T S •. viii C H A P T E R 1 INTRODUCTION 1 1.1 On-Chip Global Comrnunication 1 1.1.1 On-Chip Parallel Buses: Issues 2 1.1.2 On-Chip Serial Link: Potential Solution 4 1.2 Design Issues for On-Chip Serial Link: Research Objective 4 1.3 Thesis Contributions 7 1.4 Organization of the Thesis 7 C H A P T E R 2 B A C K G R O U N D 8 2.1 Issues with On-Chip Global Interconnects 9 2.1.1 Routing Complexity 9 2.1.2 Performance (Throughput vs. Latency) 9 2.1.3 Power Dissipation (Energy) 11 2.1.4 On-Chip Variation 12 2.2 Off-Chip Serial L ink vs. On-Chip Serial Link 12 2.3 High-Throughput Signaling Techniques 15 C H A P T E R 3 ON-CHIP SERIAL C H A N N E L DESIGN 19 3.1 Throughput-Centric Wave-Pipelined Interconnect 20 3.2 Analytical Models for Various Performance metrics 21 3.2.1 Analytical Model for Bandwidth 23 3.2.2 Analytical Model for Latency 24 , 3.2.3 Analytical Model for Average Energy per Bit 25 .3.2.4 Analytical Model for Area of the Channel 26 iv 3.3 Performance Metrics vs. Number and Size of Repeaters 27 3 .3 .1 B a n d w i d t h 2 7 3 . 3 . 2 O t h e r P e r f o r m a n c e M e t r i c s ; 2 9 3.4 Bandwidth vs. Interconnect Width and Spacing 32 3 .4 .1 I n t e r c o n n e c t S p a c i n g a n d W i d t h V a r i a t i o n 3 3 3 . 4 . 2 C o m p a r i s o n o f S p a c i n g a n d W i d t h V a r i a t i o n 3 5 3.5 Designing High-Bandwidth, Minimal-Area On-Chip Serial Channel 37 3 .5 .1 P r o b l e m D e s c r i p t i o n 3 7 3 . 5 . 2 P r o p o s e d D e s i g n T e c h n i q u e 3 9 C H A P T E R 4 SERDES T R A N S C E I V E R DESIGN 47 4.1 Wave-Pipelined based S E R D E S scheme 49 4 . 1 . 1 C i r c u i t C o m p o n e n t s 5 0 4 . 1 . 2 S e r i a l i z e r 5 2 4 . 1 . 3 D e s e r i a l i z e r 5 4 4.2 On-Chip Serial Link: A n Example 59 4 . 2 . 1 F i x e d I n t e r c o n n e c t W i d t h 6 0 4 . 2 . 2 V a r i a b l e I n t e r c o n n e c t W i d t h 6 3 4.3 Effect of Intra-die Variations 65 4 .3 .1 V a r i a t i o n M o d e l 6 5 4 . 3 . 2 T h e o r e t i c a l A n a l y s i s 6 7 4 . 3 . 3 E x p e r i m e n t a l V a l i d a t i o n 7 2 4 . 3 . 4 S u m m a r y 7 5 C H A P T E R 5 E N E R G Y E F F I C I E N C Y IN T H E SERIAL C H A N N E L 76 5.1 Previous Work and Motivation 77 5.2 N : 1 Bit Ordering for Switching Activi ty Reduction 78 5.2 .1 O v e r v i e w o f P r o p o s e d T e c h n i q u e 7 9 5 . 2 . 2 P r o b l e m F o r m u l a t i o n a n d S o l u t i o n 81 5.3 Experimental Procedure 83 5.4 Summary 86 C H A P T E R 6 S U M M A R Y AND CONCLUSIONS 87 6.1 Summary 87 6.2 Conclusions 89 6.3 Future Work 89 R E F E R E N C E S 91 List of Tables V Table 3.1: Electrical Parameters of a Repeated Interconnect Segment 22 Table 3.2: Analytical Model for Bandwidth of the On-Chip Serial Channel 24 Table 3.3: Analytical Model for Latency of the On-Chip Serial Channel 24 Table 3.4: Analytical Model for Average Energy per Bi t of the On-Chip Serial Channel26 Table 3.5: Analytical Model for Area Occupied by the On-Chip Serial Channel 27 Table 3.6: Summary of Results for the Two Design Points 31 Table 3.7: Summary of the Design Technique using the T P A metric 39 Table 3.8: Design Variables for Our Design Technique 40 Table 3.9: Summary of the Design Variables and Objectives for the Constant Width Case 43 Table 3.10: Summary of Design Variables and Objectives for the Variable Width Case 46 Table 4.1: Specs, and Variables for 32Gbps Serial Channel and the Corresponding S E R D E S specs 60 Table 4.2: Design Specs and Variables for 58Gbps Serial Channel and the Corresponding S E R D E S 61 Table 4.3: Timing Parameters for the S E R D E S Circuit Components 61 Table 4.4: Timing Parameters for the S E R D E S Operation 62 Table 4.5: Specs, and Variables for 32Gbps Serial Channel and the Corresponding S E R D E S Specs 64 Table 4.6: Specs, and Variables for 54Gbps Serial Channel and the Corresponding S E R D E S Specs 64 List of Figures Figure 1.1: Impact of Technology Scaling on Gate Delay and Interconnect Delay 2 Figure 1.2: On-Chip Serial Link 4 Figure 1.3: Practical Implementation of the On-Chip Serial Link 5 Figure 2.1: Repeater Insertion for Delay Reduction 9 Figure 2.2: Generic Block Diagram of an Off-Chip Serial link 13 Figure 2.3: Generic Block Diagram of the On-Chip Serial Link 15 Figure 2.4: Register Based Interconnect Pipelining 16 Figure 2.5: Wave Pipelining on an Interconnect 18 Figure 3.1: Block Diagram of the On-Chip Serial Link 19 Figure 3.2: Single Interconnect in the Serial Channel 22 Figure 3.3: Minimum Pulsewidth Permitted on the Wave-Pipelined Interconnect 23 Figure 3.4: Contour Plot of the Bandwidth of the Channel as a Function of m and n 29 Figure 3.5: Contour Plot of the Average Energy per Bit as a Function of m and n 30 Figure 3.6: Contour Plot of the Area Occupied by Repeaters as a Function of m and n .. 30 Figure 3.7: Contour Plot of the Latency of the Channel as a Function of m and n 31 Figure 3.8: Summary of the Choice of m and n for Desirable Performance in the Serial Channel 32 Figure 3.9: Bandwidth vs. Interconnect Spacing 34 Figure 3.10: Bandwidth vs. Interconnect Width 35 Figure 3.11: Bandwidth versus Area Occupied by Interconnects in Metal 36 Figure 3.12: Problem Description 38 Figure 3.13: Design Flow for the Constant Width Case 42 Figure 3.14: Summary of the Design Variables (m, n and ns) for jB)^=32Gbps 44 Figure 3.15: Design Flow for the Variable Width Case 45 Figure 4.1: Block Diagram for the On-Chip Serial Link (Our Design Aim) 47 Figure 4.2: 3-to-l Multiplexer Circuit 51 Figure 4.3: Delay Element 51 Figure 4.4: Serializer Circuit 53 V l l Figure 4.5: Waveform for the Serializer Operation from H S P I C E 53 Figure 4.6: Deserializer Circuit 56 Figure 4.7: Waveform for the Deserializer Operation from H S P I C E 56 Figure 4.8: Design Flow for the On-Chip Serial L ink 59 Figure 4.9: Block diagram for Example On-Chip Serial Link 62 Figure 4.10: Timing Waveform for the Example On-Chip Serial Link 63 Figure 4.11: Spatial Variation Model for Intra-Die Variation 66 Figure 4.12: Operation of Deserializer for L i n k l 68 Figure 4.13: Operation of the Deserializer for Link2 70 Figure 4.14: Maximum Bits in a Serial Byte as a Function of Variation Factor, k. 72 Figure 4.15: Simulation Setup in H S P I C E for L i n k l 73 Figure 4.16: Link 1, Timing Waveform at the Deserializer 74 Figure 4.17: Link 2, Timing Waveform at the Deserializer 75 Figure 5.1: Switching Activi ty Increased when Converting from Parallel to Serial 76 Figure 5.2: Average Switching on the Serial Interconnect Depends on Bi t Ordering 79 Figure 5.3: Bi t Ordering Depends on Statistical Characteristics of Data Traces 80 Figure 5.4: Graph for the Example of Figure 5.2 (a) 82 Figure 5.5: Percentage Reduction in Switching Activi ty using Bit Ordering 84 Figure 5.6: Comparison of the Bit Ordering Technique with S I L E N T (Instruction Address Bus) 85 Figure 5.7: Comparison of the Bit Ordering Technique with S I L E N T (Instruction Bus) 85 Acknowledgement V l l l The series of chapters that I present here is the result of several months of labor that would have been extremely difficult without the invaluable help that came to me from several quarters. The endeavor was made possible only because of the constant support and unflinching encouragement I received from Dr. Resve Saleh who I am very fortunate to have as my supervisor. He was always forthcoming in his critical comments and was ever wil l ing to delve into the intricacies of my work so that I could improve the quality of my study. His guidance throughout this task was absolutely indispensable for its success. I am also thankful to Dr. Steve Wilton and Dr. Shahriar Mirabbasi for their valuable feedback. I would also like to extend special thanks to my colleagues Dr. Partha Pande, Dipanjan, Kar im, Jeff, XiongFei , Eddy and Laxminarayan whose insights during discussions over various technical aspects were very useful. I also express my sincere gratitude to Melody and her parents who made my stay at Vancouver very pleasant and comfortable. I am thankful to my other colleagues in the System-on-Chip research group who supported me through the project in various ways. M y other friends - Chhaya, Manisha and Shirley have also played a vital role in keeping my spirits high while I was striving to accomplish this task well . I am indeed deeply indebted to my family that is one of my chief enabling factors. They impart to me a strong sense of motivation and dedication to my work. Finally, I gratefully acknowledge the financial support provided by Natural Sciences and Engineering Research Council of Canada and PMC-Sierra . Amit Kedia University of British Columbia October 2006 1 Chapter 1 Introduction This thesis presents the design and analysis of a serialized link for on-chip global communication. This introductory chapter begins by explaining the problems with the parallel buses used for on-chip global communication and proposes the serial link as a potential solution. The objectives of the thesis are presented and the thesis contributions are summarized. Finally, the organization of the chapters in the rest of the thesis is provided. 1.1 O n - C h i p G l o b a l C o m m u n i c a t i o n On-chip global communication is required for data and control transfers across various modules on the chip and determines the performance of integrated circuits in current technology generations [1]. This is the opposite situation of many years ago when the performance was completely determined by the quality of the transistors. The requirements of on-chip global communication are increasingly becoming more difficult to achieve as deep submicron effects continue to make global interconnects1 slower than logic as seen from Figure 1.1 [1]. The graph shows the relative delays of interconnects (3 cases) and logic gates. A s the computational speed of a logic block increases for each 1 On-chip global interconnects are responsible for communication, clock distribution and power distribution. In this work we consider only the communication aspect of interconnects. 2 i n t r o d u c e d t e c h n o l o g y n o d e , t h e c o m m u n i c a t i o n b e t w e e n t w o s u c h b l o c k s , w h e n s e p a r a t e d b y a l o n g d i s t a n c e , s u f f e r s d u e t o t h e d e g r a d e d e l e c t r i c a l p r o p e r t i e s o f t h e g l o b a l i n t e r c o n n e c t s . E v e n w i t h r e p e a t e r i n s e r t i o n , t h e r e i s a s i g n i f i c a n t d i v e r g e n c e i n t h e l o g i c a n d i n t e r c o n n e c t g r a p h s . 100 0.11 L 250 180 130 90 65 45 32 Process Technology Node (nm) Figure 1.1: Impact of Technology Scaling on Gate Delay and Interconnect Delay F o r t h i s r e a s o n , s i g n i f i c a n t e f f o r t s a r e b e i n g d i r e c t e d t o w a r d s t h e a n a l y s i s a n d d e s i g n o f o n -c h i p g l o b a l i n t e r c o n n e c t s [2]. I n f a c t , i t c a n b e s a i d t h a t t h e d e s i g n o f g l o b a l i n t e r c o n n e c t s i s n o w a s m u c h a c i r c u i t d e s i g n p r o b l e m a s i t i s a r o u t i n g p r o b l e m . 1.1.1 On-Chip Parallel Buses: Issues T h e p h y s i c a l c h a n n e l s f o r o n - c h i p g l o b a l c o m m u n i c a t i o n s a r e o f t e n i m p l e m e n t e d a s b u s e s . A b u s c o n s i s t s o f a s e t o f w i r e s , o n e f o r e a c h b i t t o b e t r a n s m i t t e d i n p a r a l l e l . D r i v e r s a n d r e c e i v e r s a r e p r o v i d e d a t t w o e n d s o f e a c h w i r e . I n a d d i t i o n , i n t e r m e d i a t e r e p e a t e r s a n d 3 registers are inserted along the wires depending upon the latency and throughput requirements. With the demand for increased number of on-chip modules in the future system-on-chip (SoC) designs, the number of buses required for connecting these modules is increasing [3-7]. Further, the average bus length is also increasing so the buses now span larger distances on the chip. A s a consequence, routing of on-chip buses is becoming complex, primarily due to increased routing congestion. Parallel buses can occupy large areas, comparable to processing elements, and thus pose challenges for wiring-constrained SoC designs. Besides wiring complexity and increased area at the interconnect level, parallel buses also occupy large area even at the substrate level [3]. This is because of the large number and size of drivers, repeaters and registers inserted along the interconnect. Other problems with parallel buses are high energy consumption [5], bandwidth limitations and crosstalk-induced delay and noise [3]. Parallel buses require a large number of line drivers and repeaters leading to increased energy dissipation. Skew and Jitter on the parallel bus makes receiver synchronization more difficult and thus leads to bandwidth limitations. Crosstalk between adjacent lines in a parallel bus causes data-dependent signal delay further limiting transmission bandwidth. Moreover, crosstalk-induced noise of parallel bus poses issues with regards to reliable communication. 4 1.1.2 On-Chip Serial Link: Potential Solution A promising solution to the problem of parallel buses is to replace it by an on-chip serial link [3-7] as shown in Figure 1.2. rip bits S ing le Interconnect channel H I— Transmitter Y Ser ia l Channe l Figure 1.2: On-Chip Serial Link n p b i ts Rece i ve r The rip-bit parallel word to be transmitted across a link is serialized in some manner in the transmitter. This serial bit stream is driven onto the communication link and upon reaching the receiver is deserialized to reproduce the np-bit parallel word. This long-haul serial link can overcome various problems of parallel buses, especially those pertaining to wiring complexity and routability. Further, serial links are more area efficient at the substrate level because of the reduction in number of line drivers and repeaters due to reduction in the number of interconnects in the communication channel. 1 . 2 D e s i g n I s s u e s f o r O n - C h i p S e r i a l L i n k : R e s e a r c h O b j e c t i v e Although a serial link promises several benefits, it is more complex to design than a parallel bus. Hence, the scope of the proposed research is to address the various challenges or issues that arise when shifting from a parallel bus to serial link for on-chip global communication. 5 For simplicity, we assume that the sender and receiver are operating at the same frequency, although it is not true in general. To serve as a direct replacement, a single line serial link must provide the same bandwidth as the parallel bus. But i f a single interconnect is insufficient, then multiple interconnects w i l l be needed. Therefore, any practical implementation of on-chip serial link w i l l have the architecture as shown in Figure 1.3, where np bits are serialized and transmitted on ns lines in the serial channel. Channel Length = 1 (Time Period = T c lk ) 1 D. *—' IE p. c Number of lines = n t O-O-i > - £ > PARALLEL LINK SERIALIZED LINK Buffers C I k » — (Time Period = T c]k ) \ 1 / 1 1 1 1 5 f o a. 15 a a —m Clk (Time Period = Tclk ) f O a B —m Clk (Time Period = T c l k ) Figure 1.3: Practical Implementation of the On-Chip Serial Link 6 This thesis addresses the design of different components of an on-chip serialized link, i.e., the serial channel and the SerDes (Serializer/DeSerializer) as shown in Figure 1.3. The key considerations with the design of each component in general are performance, design-ease, energy consumption, silicon efficiency and robustness. The overall goal is to minimize ns. The serial channel must provide same bandwidth as its parallel counterpart while simultaneously reducing the wiring area and the number of interconnects in the communication channel. High throughput-signaling schemes need to be investigated to compensate for the loss of data rate due to serialization. Besides the wiring area minimization, various other performance metrics for the serial channel, e.g., energy consumption and repeater density 2 need to be optimized or traded off. The next challenge is the design of circuits required to interface the parallel input/output to the serial channel. The corresponding circuits perform serialization and de-serialization functions. Very few S E R D E S architectures for on-chip serial links [6-10] have been proposed so far and further investigation from the viewpoint of suitability and robustness of various designs is required. Another concern is energy efficiency of on-chip serial link. A s already described in [11], serializing a parallel bus may lead to increase in the average switching activity and, hence, increase energy consumption as compared to the parallel bus. Techniques for reducing the average switching activity on the serial link need to be explored. 2 Repeater density relates to number of VIAS inserted on the channel. 7 1 . 3 T h e s i s C o n t r i b u t i o n s The contributions of this thesis are as follows: • A n analysis of interconnect with wave-pipelining is carried out to determine the number of interconnects needed to achieve a target bandwidth. • A n existing S E R D E S architecture is analyzed from the timing perspective to demonstrate how signaling overhead translates to higher bandwidth requirements. The circuit is also analyzed for robustness in the presence of on-chip variations. • A novel bit ordering technique for reducing the energy consumption on the serial link by lowering the average switching activity is proposed. The technique is based on bit-ordering and it relies on the complete statistical information of the data being transmitted on the serial link. 1 . 4 O r g a n i z a t i o n o f t h e T h e s i s This thesis is structured as follows. The second chapter provides a general introduction to the field and the background material for the thesis. The design of an on-chip serialized link is divided into two parts, with Chapter 3 covering the design of the serialized channel. Chapter 4 discusses the S E R D E S transceiver scheme required for interfacing to the serialized channel with parallel input/output. In Chapter 5, a novel technique for reducing the switching activity in the serial link is presented. Chapter 6 summarizes the thesis and describes directions for future work. 8 Chapter 2 Background Advancements in System-on-Chip (SoC) methodologies have led to the evolution of on-chip global communication architectures [12]. Up until now, on-chip global communication has been addressed by ad-hoc direct interconnections (point-to-point links) or shared bus structures (single or hierarchical). But with the rising number of on-chip modules in future system-on-chip (SoC) designs and the increase in performance demands, the use of shared bus architectures is quickly reaching its limits. To address this issue, packet-switched network-on-chip (NoC) approaches have been proposed [13]. Whatever be the communication architecture, the actual transportation of data at the physical layer in each is performed by using an ensemble of on-chip global interconnects collectively known as a parallel bus. In spite of the significant developments at the architectural level, the physical level is posing challenges, especially with technology scaling and the deep-submicron effects. The main problem is the routing complexity of large buses. If this problem could be resolved, then power dissipation and process variation issues would become a priority. A n y alternative approaches must satisfy the throughput and latency requirements, while providing proper synchronization of the data from the sender to the receiver. These issues are elaborated in sections to follow. 9 2.1 I s s u e s w i t h O n - C h i p G l o b a l I n t e r c o n n e c t s 2.1.1 Routing Complexity The average length of interconnection is increasing because the various modules span large distances on the chip [3, 5, 14]. A s a consequence, the available wiring resources for global interconnect is decreasing and routing is becoming a major issue. Buses compete for global resources with the clock, power grid and other global signals. Moreover, global interconnect not only occupies routing area, it also consumes silicon area due to the large number and sizes of line drivers, repeaters and registers inserted on them. Further, the use of wider metal pitches, protective shield to prevent coupling and additional wires for differential signaling employed for boosting performance of global communication are area-hungry. 2.1.2 Performance (Throughput vs. Latency) The trend in global interconnects, as technology scales is that the resistance-capacitance (RC) delay of interconnects is getting worse. The traditional method of dealing with this problem is repeater insertion [15] as shown in Figure 2.1. Ro/m mC 0 Input mCo out m Repeater 1/n rfl/n), c(l/n) Interconnect segment 1st repeated segment Output Repeater 2nd repeated segment nth repeated segment Figure 2.1: Repeater Insertion for Delay Reduction 10 The interconnect of length / is divided into n equal length segments and each segment is driven by a repeater of size m times the minimum-sized inverter. The Bakoglu optimal repeater insertion technique [15] minimizes the overall wire delay or latency. It does not try to maximize throughput since there is an assumption that only one data wave is traveling through the interconnect at a time. The optimal parameters for repeater insertion are: where r and c are the resistance and capacitance of the interconnect per unit length. For the minimum-sized repeater, Ro is the equivalent resistance and Cojn and Co_out are the input and output capacitance, respectively. Several variations of the repeater insertion technique, analyzed from various design approaches, can be found in the literature. A l l of them attempt to reduce interconnect latency. Another method of reducing delay is to increase the cross-sectional area of the global interconnect by increasing their pitch, but this is not good since it w i l l lead to a significant increase in area occupied by the interconnects and also increased power dissipation due to increased capacitances. 11 If throughput is a primary design goal, then alternative methods must be applied to the problem. This has motivated our use of interconnect wave pipelining and we wi l l need to revisit the repeater insertion problem. Further details are given in Section 2.3. 2.1.3 Power Dissipation (Energy) With a large (and growing) number of electronic systems being designed with battery considerations in mind, minimizing the on-chip global communication energy on interconnects becomes crucial, which necessitates the use of energy-efficient global communication techniques [16]. The energy consumed by the interconnect for on-chip global communication account for a significant fraction of the total energy consumed in an integrated circuit, and this fraction is expected to grow as technology scales further. The reason can be attributed to the increased interconnect capacitance due to the lateral and fringing capacitance as a result of interconnects getting closer. Further, repeaters and flip-flops inserted on the global interconnect for latency and throughput improvement also consume a lot of energy. Circuit techniques such as low-swing signaling and bit encoding can be used to address the issue of increased energy consumption. Since the switching activity determines the power dissipation, some methods attempt to reduce the number of transitions on the bus [17]. Techniques such as adaptive supply voltage links are employed at the system level for energy-efficient communication on-chip global communication. 12 2.1.4 On-Chip Variation Interconnect reliability is another major challenge for on-chip global communication and includes both signaling and manufacturing reliability [18]. Electrical noise due to crosstalk, electromagnetic interference, and radiation-induced charge injection can produce data errors. Further, there wi l l be perturbations in the electrical characteristics of the interconnects from on-chip variations [19] due to process, supply voltage and temperature effects. 2 . 2 O f f - C h i p S e r i a l L i n k v s . O n - C h i p S e r i a l L i n k Off-chip serial links have been very popular for getting data on and off chips and boards. Wire speeds range from 1 to 12 Gb/s and payloads from 0.8 to lOGbps. There are fewer pins needed on the chip, reduced simultaneous switching output (SSO) problems and lower cost. As a result, the high-speed serial link is the clear choice for off-chip communication. The key mechanism delivering the high data rate operation in off-chip serial links is self-clocking [20]. The problem with the off-chip serial link design is design complexity [21]. The S E R D E S transceiver design is complicated because of electrical characteristics of the off-chip channel and partially because of the self-clocking feature. A generic functional diagram of an off-chip serial link [20] is shown in Figure 2.2. Further details of the functioning of the off-chip serial link can be found in [20]. 13 Transmit Side Parallel Interface Synchronizer (FIFO) Encoding (8B/10B) i Multiplexer De-emphasis i Phase Locked Loop i Ul Transmit Driver SUMMARY - • ^  • • Not needed for on-chip serial link SERDES Transceiver Serial Channel (Twisted pair, PCB trace:OFF-CHTP) Ul > '53 o Receive Side Parallel Interface Synchronizer (FIFO) Decoding (8B/10B) Demultiplexer Clock and Data Recovery ii i Receive Equalization J Matched Termination Recieve Driver Figure 2.2: Generic Block Diagram of an Off-Chip Serial link In order to incorporate the self-clocking property in the serial link, 8b/10B encoding is employed and a complicated clock and data recovery (CDR) circuit is needed. The multiplexing and demultiplexing operations are controlled using a high-speed clock generated using a complex phase-locked loop (PLL) circuit. The off-chip channel includes backplanes traces ( P C B lines, twisted pairs, cables, etc.) and they behave as R L C transmission lines. Hence, proper impedance matching at the receiver is required to avoid interference due to reflection. Pre-emphasis and receiver equalization are 14 needed to reduce signal distortion caused by the low-frequency response of transmission lines. FIFO synchronizers are used both at the transmitter and receiver to synchronize the serial link with the parallel interface. On the other hand, on-chip serial links are rather simple and do not require many of the functional blocks shown in Figure 2.2. Because of higher resistivity and relatively short length (as compared to off-chip wires), the inductances of on-chip interconnects are low enough to be safely ignored [22]. This eliminates the need for exactly matched termination (receive driver) because large reflections are absent due to low or no inductance. Further, skin effect is not a major issue for on-chip interconnects at this point [23] and thus the need for pre-emphasis and receiver equalization for on-chip serial links is eliminated. The self-clocking property is not used in some on-chip serial links and hence 8b/10B encoding and C D R circuits are not required. Instead of using a complex P L L for high-speed clock generation to control the multiplexing and demultiplexing operations, locally generated high-speed clocks using simple ring oscillators are preferred [7]. Some clockless techniques for multiplexing and demultiplexing are also reported in the literature in [8, 9]. Figure 2.3 shows a generic block diagram obtained after eliminating various function present in the off-chip serial link. FIFO synchronizers are required for the case of on-chip serial links, but their design can be quite complicated and beyond the scope of our work. 15 Parallel Interface FIFO Mux Transmit Driver Simple Recieve; Buffer Serial Channel ( t a t a w H n B c f c O N - C H I P ) Demux FIFO Parallel Interface Transmitter (SER) Receiver (DES) Figure 2.3: Generic Block Diagram of the On-Chip Serial Link In general, the design of a S E R D E S transceiver for on-chip serial link is expected to be much simpler. But the on-chip variations are important and have to be kept in mind while designing such links. In order to make it robust against variations, it wi l l introduce complexity in the design, as shown in this thesis. 2 . 3 H i g h - T h r o u g h p u t S i g n a l i n g T e c h n i q u e s Data rates are greatly reduced due to serialization and serial links can be used only i f they can provide a bandwidth equal to its parallel counterpart. In addition, the signaling technique should also have low latency and low energy/power consumption. Pipelining can be used to enhance throughput performance of global interconnects. The most common and traditional technique for enhancing throughput is to insert registers to break a long wire into short-pipelined stages. This technique is illustrated in Figure 2.4 where a wire is divided into smaller stages and registers are inserted in between to enable high throughput relative to the case when no pipelining is used. 16 1st stage 2nd stage nth stage Input Register D Q A Interconnect segment Register D Q A Interconnect segment Register D Q A Interconnect segment Register D Q A Output Clk Figure 2.4: Register Based Interconnect Pipelining But the limitation of register-based pipelining is that the speed up obtained due to register insertion is not linear because registers have their own delay [24]. Further, registers are large in size because not only do they have to store data, they must also drive interconnect sections, and thus consume a lot of power and occupy large area. The most important problem with register-based pipelining is that an additional clock signal has to be routed along the length of the interconnect. This will create more routing problems and further increase power consumption. There will also be the usual synchronization issues between the clock and the data. However, register-based pipelining is preferable when throughput is a priority over latency, power and area. For designs, where high throughput, low latency, low power and low area occupied are given similar importance, traditional pipelining based on register insertion is not favorable. Wave pipelining [25] provides a potential alternative to handle the above problems because, ideally, it has no registers and hence the associated overhead can be avoided. It significantly reduces clock loading (no clock is needed) while retaining the external functionality and timing of a synchronous circuit. Although we are interested in wave pipelining for 17 throughput enhancement over global interconnects, it is important to first understand wave pipelining in general and then extend the concept to global interconnects. Wave pipelining was originally developed as a logic circuit technique in 1960's. It advocates the application of a new signal to a combinational block before the previous input reaches its intended destination storage elements. Multiple waveforms corresponding to successive evaluation co-exist concurrently within the same combinational logic block, providing multiple computational waves. Data are pipelined into circuits without using registers, because wires and transistors not only transmit data and compute but also hold them for a period of time. Wave pipelining theory has been successfully adopted in regular digital systems, such as D R A M , S R A M , and digital multipliers, to achieve high-speed performance. Although it appears simple, the circuits with wave pipelining are difficult to design because of the multi-path problem; usually multiple paths exists between input and output in a circuit. For wave pipelining to work, a small delay difference is needed among all the paths to avoid signal racing. Moreover multiple data paths also exist from input to some internal nodes, and these paths must also be balanced. Large and complex circuits make balancing multi-path delay difficult, imposing structural and physical constraints, which are too stringent to be practical [25]. In case of interconnect, wave pipelining is relatively simple because on-chip global interconnects are fairly simple circuits. They usually consist of a chain of wires and repeaters. Since there is only one path on each interconnection, no delay balancing is needed for internal nodes. The use of such an approach has been proposed in [21, 24, 26-28]. As 18 shown in Figure 2.5, in wave pipelining, the signaling throughput is extended by allowing multiple data pulses to travel across the length of the interconnect at the same time. Multiple waves travelling simultaneously across the length of the interconnect Wave 7 Wave 6 Wave 5 Wave 4 Wave 3 Wave 2 Wave 1 Figure 2.5: Wave Pipelining on an Interconnect Although wave pipelining on interconnects is relatively simple to design, as compared to logic circuits, the inevitable existence of process variation, supply voltage noise and thermal uncertainty imposes certain timing constraints. Another problem with wave pipelining is that since no storage elements are used to separate multiple waves, they are highly-susceptible to noise injection and hence are prone to soft errors. A l l the above problems can be mitigated or reduced by careful design and thus the shortcomings are not limiting in comparison to the benefits that can be derived from this approach. 19 Chapter 3 On-Chip Serial Channel Design In this Chapter, the key considerations when designing a high-bandwidth, minimal-area on-chip serial channel are presented. Figure 3.1 shows the practical implementation of the serial-link replacing a parallel bus where np bits are converted to ns bits and transmitted on ns interconnects in the channel. 1 a. &. c i-. I •c f0> • i i > i Channel Length = 1 C l k » — (Time Period = 1,^) Dri+ers \ Number of lines = n, Buffers •c u v> u Q f o >l a. .ts XI I Clk Our Design aim in this chapter Crime Period = T cik ) Figure 3.1: Block Diagram of the On-Chip Serial Link The serial channel must provide bandwidth equal to or greater then the bandwidth of its parallel counterpart, i.e.,BWx > BWp. The bandwidth provided by the parallel bus is given by: BWp=npx(\/Tak) (3.1) where Tctk is the clock period. 20 The signaling scheme in the channel is called throughput-centric wave-pipelined interconnect based on repeater insertion proposed in [28]. The serial channel is designed in a Metal 5 layer of 90nm C M O S technology and is assumed to span a length of 10mm across the chip. Analytical models for various performance metrics of the on-chip serial channel based on the signaling technique are developed. The primary performance requirement of the channel (i.e., bandwidth) is a function of interconnect geometry as well as the number of repeaters and size of each repeater inserted on the interconnect. These factors are characterized and provide a framework for our modified design technique for the serial channel. 3 . 1 T h r o u g h p u t - C e n t r i c W a v e - P i p e l i n e d I n t e r c o n n e c t When an on-chip parallel bus is replaced by a serial channel, wave pipelining can be used to compensate for the effective loss of parallelism. In wave pipelining, the maximum throughput can be extended by allowing multiple data waves to travel across the length of the interconnect at the same time. This signaling scheme can provide high bandwidth while reducing the wiring area and thus routing congestion, which is the central theme of this research. The advantage of using this scheme as opposed to a more traditional register-based pipelining technique has already been discussed in Chapter 2. The main idea here is that the traditional optimal repeaters insertion [15] is not optimal from the throughput perspective. In [28] the author have carried out detailed analyses of the proposed scheme. Their work includes the derivation of a closed-form analytical expression for the maximum throughput of wave-pipelined R C interconnects. They have further studied the effect of various design 21 parameters on the throughput and other performance metrics of this scheme. Further, they have proposed various design techniques for optimizing different performance objectives targeting a number of applications. Our goal in this chapter is to leverage this scheme for designing a high-bandwidth, minimal-area serial channel for on-chip global communication. 3 . 2 A n a l y t i c a l M o d e l s f o r V a r i o u s P e r f o r m a n c e m e t r i c s To facilitate design and analysis, the analytical models for various performance metrics of the wave-pipelined scheme are developed in this section. The various performance metrics are the bandwidth, latency, average energy per bit and the area occupied by the channel. Figure 3.2 shows one interconnect in the channel of length / and of width W. A s shown in the figure each interconnect is shielded from its neighbor by placing it between two co-planar supply rails of minimum width (Wmin). Shielding provides excellent return paths for the high-frequency current and eliminates delay variations due to coupling. The wires in the channel (including the interconnect and the shield wires) are symmetrically spaced, with the spacing given by S. Orthogonal routing is assumed in the meal layer above and below the serial channel. The resistance and capacitance per unit length of the each interconnect in the channel are given by r and c, respectively. The total capacitance includes the area, lateral and fringing components. We wi l l assume that the rise time of the drivers and the length of the interconnects are such that the inductance can be safely ignored [22]. 22 length = 1 Input Inpujt m Repeater 1/n r •! c t_out R seg Cgeg U L InteTconnect ] segment Shield Interconnect Shield 1/n m Interconnect segment 1st repeated segment 2nd repeated segment 1/n Interconnect: segment Output m Output Repeater nth repeated segment Figure 3.2: Single Interconnect in the Serial Channel As shown in Figure 3.2, each interconnect is divided into n equal size segments, each of which is driven by a repeater of size m times a minimum-sized inverter. The parameters of a minimum size inverter used for timing analysis are on-resistance (Ro), input capacitance (C0jn) and output capacitance ( C 0 0 „ , ) . The number of repeaters inserted on the interconnect channel is the same as the numbers of interconnect segments (n) and, hence, they can be used interchangeably. The description of the electrical parameters of each repeated interconnect segment is given in Table 3.1. Table 3.1: Electrical Parameters of a Repeated Interconnect Segment Parameters Expression On-resistance of the repeater R,=R0/m Repeater Input Capacitance C , J n = m C 0 j n Repeater Output Capacitance Ct_nu, = m C o o u l Interconnect segment resistance Interconnect segment capacitance C„g=cl/n 23 All the interconnects in the channel are symmetric and have similar performances characteristics. 3.2.1 Analyt ica l M o d e l for Bandwidth The bandwidth of a channel is given by the product of the maximum throughput of an interconnect times the number of interconnects in the channel. The maximum throughput is the determined by the minimum pulsewidth allowed, PWmin, on the interconnect. PWmi„ depends on the degree of wave pipelining permitted on the interconnect and is limited to ensure high-quality binary transmission. As shown in Figure 3.3, the minimum width of the input pulses must be such that the output makes at least 10% to 90% of the full swing [28]. Interconnect With Repeaters Inserted Input rv Output V(High)t PW, min PW mm V(Low)-V(High>-V(Low)-Input corresponding Output (Worst Case) Figure 3.3: Minimum Pulsewidth Permitted on the Wave-Pipelined Interconnect The analytical model for the bandwidth of the on-chip serial channel is summarized in Table 3.2 from the detailed derivation of the analytical throughput model in [28]. A distributed RC model for the interconnect segment has been used to estimate the wire transients. 24 Table 3.2: Analytical Model for Bandwidth of the On-Chip Serial Channel Parameters Expression Bandwidth BW =n xT s s max Maximum throughput (one interconnect only) r - 1 max p w min Minimum pulsewidth ( k \ Sakurai time constant [29] °RCseg = = R,C, + R,C„g + R„gC, + 0ARsegCxeg Sakurai coefficient [29] A;, =1.01 K Voltage swing v, at the output of the first repeated segment vn = 0.9; For (/ = n;i< 1; / — ) v,_, = l / (2 - v,) ; 3.2.2 Analyt ica l M o d e l for Latency Latency is the time difference between when a signal enters the channel to when it exits channel. In this case, the signal delay is due to n segments of wires and repeaters. The analytical model for latency of an on-chip serial channel shown in Table 3.3 is based on the traditional Elmore delay model [22]. Instead of a distributed R C model for the wire segments, a lumped R C pi-model has been assumed for each repeated segment. Table 3.3: Analytical Model for Latency of the On-Chip Serial Channel Parameter Expression Latency Latency = n ~R, (Clin + Cloul) + R,Cseg + RsegCl0Ul + 0.5RsegCseg 25 The accuracy of this model been raised as an issue in [30, 31] and better models have been proposed but, for our purposes, the model is quite suitable because of its simplicity and acceptable accuracy. 3.2.3 Analytical Model for Average Energy per Bit Energy per bit is the energy consumed in transmitting a single bit from the source to the destination on the interconnect. A toggle wi l l force each repeater to switch once but each power event requires 2 toggles. Therefore, the power wi l l be pattern-dependent. Average energy per bit is the total energy on an interconnect divided by the number of bits transmitted on the channel. This is multiplied by the number of interconnects ns. The details are described in Table 3.4. The expressions for various components of the power consumption (i.e., switching, leakage and short-circuit) are also presented based on [32]. A number of key design parameters can be tuned to reduce Eh. Reducing the size of each repeater and the number of repeaters w i l l reduce all three components of power. However this w i l l reduce bandwidth and increase latency. Therefore, better techniques are required to reduce Eb without compromising BW or latency (or both). The switching activity factor can also be minimized to reduce the switching power and short-circuit power. This is usually done at the architectural level by encoding the data to be transmitted on the channel. A novel technique for reducing the switching for on-chip serial channel is described in Chapter 5. 26 Table 3.4: Analytical Model for Average Energy per Bit of the On-Chip Serial Channel Parameter Expression Average Energy per bit ^ = ^ / ( / * ) ( « , ) Average power consumption on the channel p - p +p +p total switching short—circuit leakage Average Switching Power P^g=oc((ns)(m)(n){C0 _,„ +C0_ou,) + lc)vDD2fclk Average Short Circuit Power ^short-circuit = a (("i ) (m) (") ^ short-circuit ) ^DDIclk Average Leakage Power ^ = ( i - « ) ( ( « J ( ' « ) ( « ) V ^ ) Short Circuit Current for a minimum sized inverter3 ^short-circuit Leakage current of a minimum sized inverter !* Clock frequency folk Switching Factor (fraction of repeaters that are switched during an average clock cycle) a 3.2.4 Analytical Model for Area of the Channel Most of the traditional latency-centric repeater insertion techniques to reduce area [33] provide models for area of repeaters occupied on the silicon substrate only. The central theme of our work is ON reducing the wiring area (area occupied by channel in metal layer) to reduce the routing congestion. Hence, we formulate expressions for the area occupied by the metal and repeaters on the silicon substrate and summarize them in Table 3.5. It should be noted that the expression for interconnect area is specific for our serial channel as shown Short-circuit current actually depends on the shape of input and output waveforms, which in turn are dependent on design parameters, i.e., interconnect parasitics, size and number of repeaters. For the purpose of simplification we will assume that is constant for various design parameters. 27 in Figure 3.2. For the channel area minimization, A lers and Awire must be considered separately. Table 3.5: Analytical Model for Area Occupied by the On-Chip Serial Channel Parameter Expression Area occupied on silicon by the repeaters Aepeaiers = (n.s) (171 X ") Anverler_min Area of an inverter of minimum size A inverter _ min Area occupied by the channel in metal Km = [(«, )W + (n,)2S + (n, + \)Wmin ] (/) 3 . 3 P e r f o r m a n c e M e t r i c s v s . N u m b e r a n d S i z e o f R e p e a t e r s In this section, we characterize the performance metrics of the on-chip serial channel as a function of the number of repeaters and size of each repeater. Initially, bandwidth is characterized by itself. Then, other performance metrics are calibrated for repeater insertion for a given bandwidth requirement. Analytical models from Section 3.2 are used for the purpose of characterization. Since the interconnects in the channel are symmetric, we characterize a single interconnect only. The interconnect is assumed to be 1 Omm long and its other dimensions are set to their respective minimums (i.e., Wmin and Smi„). 3.3.1 Bandwidth The equation for the serial bandwidth (BWS) for a single interconnect channel from Table 3.2 can be written in the expanded forms as shown below; with ns=l: 28 BW. = (3.1) ( r CQm + 0.4 rc In •n . / ) 1-v, The ratio n/l represents the number of repeaters per unit length of the interconnect and is the repeater density per unit length. Using Eqn. (3.1), we summarize below the bandwidth dependency on m and n/l. A. Number of repeaters: BWS increases monotonically with the increase in n/l independent of the value of m. Moreover, BWS w i l l saturate i f n/l is increased beyond a certain value [28]. Since the length of the wire is fixed for our case, henceforth we represent n/l as simply n. B. Repeater size: BWS increases as m increases, reaches a maximum and then decreases independent of n/l. Thus, there exists an optimum repeaters size m = ^ / / ^ c / C 0 r , which maximizes the BWS obtained by differentiating Eqn. (3.1). The contour plot in Figure 3.4 was obtained using Eqn. (3.1). Interestingly the same value of BWS can be obtained for different combination of m and n. Thus, for a given BWS requirement, the designers can either choose a large number of small repeaters or a small number of large repeaters. The choice has to be made by proper evaluation of other performance metrics of the serial channel and their variation with respect to m and n. 29 Cg^> Bandwidth (Gbps) Repeater Size (m) Figure 3.4: Contour Plot of the Bandwidth of the Channel as a Function of m and n 3.3.2 Other Performance Metrics Figure 3.5 shows the contour plot for the average energy per bit (Eh) with respect to m and n. Similarly, the area occupied by repeaters in silicon (Arepaters) and the latency plots are shown in Figure 3.6 and Figure 3.7, respectively. The area occupied by the interconnects in metal (Awire) is not shown since it is independent of m and n, as seen in Table 3.5. In order to analyze the proper choice of m and n, two different design points that gives the same BWS of 3 Gbps are considered. The two design points, i.e., A and B are shown in the contour plots for each performance metric. Point A corresponds to the case of a small number of large repeaters and point B corresponds to the case of a large number of small repeaters. The observations for the two design points are summarized in Table 3.6. 30 Figure 3.5: Contour Plot of the Average Energy per Bit as a Function of m and n Figure 3.6: Contour Plot of the Area Occupied by Repeaters as a Function of m and n 31 10 20 30 40 50 60 70 80 Repeater Size (m) Figure 3.7: Contour Plot of the Latency of the Channel as a Function of m and n Table 3.6: Summary of Results for the Two Design Points Performance Metrics Design point A m = 51, n = 10 Design point B m = 7 , n = 30 Figure BWs(Gbps) ! n/a Latency (nsec) 0.86 2.06 Figure 3.7 Eb (pJoules) 1.45 1.2 Figure 3.6 A 1A repealers! inverter _ min 570 210 Figure 3.6 5 n/a From the results in Table 3.6 it can be concluded that for a given BWS requirement, the choice of a large number of small repeaters is suitable for reducing Eb and A lers because both of them depend on the product of m and n as seen from analytical models. Figure 3.8 provides general design guidelines based on these observations. The proper choice of operating point depends on the desired design tradeoffs. 32 C^g5 Bandwidth (Gbps) 10 20 30 40 50 60 70 80 Repeater Size (m) Figure 3.8: Summary of the Choice of m and n for Desirable Performance in the Serial Channel 3 . 4 B a n d w i d t h v s . I n t e r c o n n e c t W i d t h a n d S p a c i n g In this section, we consider interconnect width (W) and spacing (S) changes and study its impact on the bandwidth. Since the interconnects in the channel are symmetric, we analyze a single interconnect only. Further, it is assumed that the repeaters inserted will be of optimal size (mopt) corresponding to the interconnect dimension4. Analysis for fixed repeater sizes across varying interconnect dimensions has been carried out in [28]. This is done in order to facilitate efficient design of high-bandwidth, minimal-area, on-chip serial channel described later in Section 3.5. The effect of W and S is analyzed separately and the observations are By optimal repeater, we mean that the size of the repeater for which the maximum throughput of the throughput-centric signaling scheme is maximum independent upon the number of repeaters inserted on the interconnect. 33 summarized with brief explanations. Further, a comparative analysis of a W increase or an S increase for increasing BWS is provided from a minimal wiring-area perspective. 3.4.1 Interconnect Spacing and Width Variation A . Interconnect Spacing (5): Let us first consider the change in S between the wires in the channel while keeping W = Wmin. One should expect the BWS to increase as S increases. From the plot of BWS vs. 5" in Figure 3.9, it can be seen that the BWS does increase initially but saturates later for large value of S. The reason for this is due to an initial decrease in the total capacitance per unit length (c) of the wire as 5" increases5. Eventually, the capacitance value reaches a fixed value. Another point worth mentioning is that the mopt decreases as S increases and this implies a reduced Eh and A l e r s . 5 For an interconnect of fixed width, the total capacitance of the interconnect decreases due to decrease in coupling capacitance as its spacing to a neighbor wire increases. The resistance of the wire is unaffected with spacing variations. 34 « 1 CD Spacing (factor x S ^ ) Figure 3.9: Bandwidth vs. Interconnect Spacing B. Interconnect Width (W): Next, we study the change in BWS with change in W while holding S = Smin. The wire resistance and capacitance per unit length of the interconnect changes in opposite directions with ^variation. Hence, one should expect that the BWS to remain roughly constant because a decrease in r will be offset by the increase in c. But, from the plot of BWS versus W in Figure 3.10, it can be seen that, with the increase in W, BWS increases. The reason is that the relative increase in c is less6 and so the decrease in r overshadows the increase in c, thus allowing BWS to increase. Further, with the increase in W, the value of mopt increases and this will result in increased Eb and Arepealers. 6 For width variation.the area component of C changes with W, while the lateral and the fringing components remain approximately constant. 35 15 01 i i i i i i i I 2 4 6 8 10 12 14 16 Width (factor x W . ) Figure 3.10: Bandwidth vs. Interconnect Width 3.4.2 Comparison of Spacing and Width Variation After analyzing individually the effect on BWS for W and S changes, a comparison is necessary to understand the relative benefit of each for increasing the BWS. One metric that can be used to evaluate the relative benefit is the area occupied by the channel in metal. Figure 3.11 shows the BWS as a function of Awire. The solid lines represents the case when S is varied, keeping W = Wmin and the dotted lines represent the case when W is varied keeping S = Smm. From the plot, it can be seen that for increasing wire area, BWS increases for both cases as explained above. But the incremental change in BWS is much larger for the case when W \s increased as compared to the case when S is increased. It can be inferred that spacing the 36 wires apart is not a very good choice for increasing BWS because the cost in terms of increased wiring area w i l l be large. On the other hand, increasing W presents as a good option for increasing BWS without significantly increasing the wiring area. In conclusion, both W and S can be increased to increase BWS. For minimal-wiring area applications, increasing W (keeping S = Smin), provides a good solution, although more expensive in terms of Eb and Al repeaters CD ro m n=04: Width n=08: Width n=12: Width n=24: Width • n=04: Spacing 1 n=08: Spacing 1 n=12: Spacing 1 n=24: Spacing A . / (W . x I) wire v mn ' Figure 3.11: Bandwidth versus Area Occupied by Interconnects in Metal 37 3 . 5 D e s i g n i n g H i g h - B a n d w i d t h , M i n i m a l - A r e a O n - C h i p S e r i a l C h a n n e l In this section, we propose a technique for designing channel using a throughput-centric wave pipelining signaling scheme. The primary objective of our design technique is to minimize the area occupied by the interconnect in metal layer (Awire) for a required bandwidth (BWS). The reason for minimizing Awire is that it reduces routing congestion, which is the original motivation of our research. For small ns the overhead due to the S E R D E S are reduced. Secondary design objectives for the serial channel are to minimize the area occupied by the repeaters on silicon substrate ( A l e r s ) and the average energy per bit (Eh). Latency of the channel is not included because the design was assumed to be latency insensitive. Note, our modified design technique has been evaluated using the analytical model presented earlier in Section 3.2, but similar results are expected with H S P I C E simulations. 3.5.1 Problem Description A description of design problem is shown in Figure 3.12. In general, the aim is to design an on-chip serial channel for a required bandwidth with optimal efficiency in terms of various design objectives (performance metrics), while satisfying the design constraints. 38 [Design Requirements* & Constraints ' i Bandwidth of the interface (BWS) $ Repeater density unit length ' Design Objectives | low average Energy per bit gj, )j low Area occupied by: repeaters on silicon ( A t c p i x a n )\ wires on metal (A ^ ) Design variables Number of Repeaters (n) Repeater Size (m) Supply Voltage (VDn) Interconnect width (W) spacing (S) Thickness (T) Number of channels (n s ) Figure 3.12: Problem Description A n initial requirement is the formulation of a combined objective function 7 using individual design objectives. The author of [28] has developed such a function, which can be used for designing a high-bandwidth, minimal-area channel. The combined objective function is called Throughput-Per-Area (TPA): TP A Bandwidth Area n xT s max A + A repeaters wire (3.2) Table 3.7 summarizes the objectives using the T P A metric. Although the T P A metric aims at area reduction, it performs minimization of the total area of the channel which includes Appealers and Awire. Since Awire is more important especially in the present era of routing 7 The typical combined objective functions used in VLSI design are power-delay product, energy-delay product etc. 39 congestion, it should be given priority over Arepeaters which is relatively less expensive. Besides, the T P A metric assumes fixed repeater sizes for any BWS, which might not provide a good solution in terms of Eh and A t e r s . Moreover, the design solution obtained using the T P A metric may require a large repeater density per unit length of the wire, thus causing via blockage problems 8. Table 3.7: Summary of the Design Technique using the T P A metric Application High-Bandwidth, Min imal Area Channel Design Requirement Meet or Exceed Target Bandwidth Design Objective Minimize total area of the channel (Arepea,ers+Awire) Combined objective function Throughput-per-area (TPA) Design Variables In summary, the T P A metric is somewhat ineffective because it does not optimize the objective of primary importance, i.e., AWjre, and also does not provide better trade offs for other design objective of the on-chip serial channel. Hence, instead of formulating a combined objective function, we present a different design technique that is based on prioritizing different design objectives. 3.5.2 Proposed Design Technique Our design approach is based on multi-level programming [34] which is a technique for solving multi-objective optimization problems. Multi level programming is a one-shot optimization technique and is intended to find just one optimal point. The first step in 8 Vias are used to link interconnects with the silicon surface. In our case vias will be used to connect interconnect to repeaters inserted. A large number of vias causes routing difficulties and this problem is refereed to as via blockag. In order to avoid via blockage, the repeater density per unit length of the interconnect must be bounded to a maximum. 40 multilevel programming involves ordering the objectives in terms of importance. Next, the values of the variables are found that minimize the primary objective within the given constraints and satisfy the requirements. The points in this set of variables that minimize the second most important objective are found next. This method proceeds recursively until all objectives have been optimized on successively smaller sets of variables. For our target application, the primary objective is to minimize Awire. Next A a l e n and Eb are optimized. We have considered the two different cases i.e., fixed width design and variable width design. The list of variables for the two cases is given in Table 3.8. The interconnect thickness (7)9, spacing (S)10 is assumed to be fixed and a single supply voltage (VDD) is assumed for both the design styles. Table 3.8: Design Variables for Our Design Technique Design Style Design Variables Fixed Interconnect Width ns, m and n Variable Interconnect Width W, ns, m and n The maximum allowable repeaters density {n/T)max is fixed to 30 repeaters per cm for both the cases for each interconnect in the channel. 9 The wire thickness is assumed to be fixed because the designers usually does not have the liberty of changing the wire thickness is a particular metal layer l 0The reasons for omission of interconnect spacing has been explained earlier in Section 3.4. 41 A. Design Objectives Formulation Primary Objectives: A general description for the primary objective is given below: Mathematically, BWs=f,(n„n,m,W) Minimize A,„ for BWr and nl<(nl) wire .1 I \ I /max Secondary Objective: The secondary objectives are to reduce A lers andEb . In the case of our problem, the design variables are restricted to m and n only. That is, W and ns cannot be changed after primary optimization and hence the secondary optimization wi l l be the same for the two design options. Interestingly, both A ters and Eb can be reduced simultaneously because both are directly proportional to the product of m and n for fixed W and ns. Mathematically, BWs=f,(n,m) Eb x ( m x « ) Aepecers^imXn) Minimize (m x n ) for BWS and « / / < ( n / / ) m a x B. C A S E 1: Constant width Design Steps: The flowchart in Figure 3.13 illustrates the design steps both for primary and secondary objectives. The goal for primary objective is to compute ns for a required bandwidth. The design technique for the secondary objective is based on repeater insertion and has been described in details earlier in Section 3.3. 42 ( Start ~) n = \W = W.,n=\ n, = n, +1 « = 1 n = n + l YES Calculate mopl m = m o P , Calculate BW, > '£ o >-> 1-a OH Increase « Decrease Keeping until "/ <("/ / / "17/ o V o cs •o B o u Q~Stop~J) Figure 3.13: Design Flow for the Constant Width Case Exper imental results: We evaluate our technique by designing a serial channel to replace a 32-bit parallel bus operating at a frequency of 1.0 G H z (rc/£=1000ps). The link was 10mm long and routed in Metal 5 layer of a C M O S 90nm technology. Using Eqn. (3.1), the bandwidth requirement for the serial channel was computed to be 32Gbps. A summary of design variables and the performance metrics for the channel obtained using our approach is shown in Table 3.9. We also compare it with the design obtained using the T P A metric. Further, the design variables and the performance metrics of the channel using optimal repeater insertion parameters based on traditional Bakoglu method is also presented. 43 Table 3.9: Summary of the Design Variables and Objectives for the Constant Width Case Parameters Our \pproach I sing Itakoglu TP \ metric Technique Requirement and Constraint ;"32'.Gbps f«'--. ^ Requirement and Constraint • W O ' V ' /max 30 repealers cm Design Variables 4 4 , 7 Design Variables 1 1 1 Design Variables .'• •'\/ • W ;,- ' 57 57 . Design Variables 30 29 14 Primary Design Objective . ' V 27 Secondary i i Design i \ Objectives I A ' •• ' / repeaters / /A / • inverter _min 5040 6612 6384 Secondary i i Design i \ Objectives I 2.95 pJ 3.52 pJ 2.67pJ Latency latency 1.045 nsec V 0i975.risec -' .992nsec \ ** r', 7 From the results in Table 3.9, it can be seen that number of interconnects in the channel is equal to 4, both using our design technique and the TPA metric. Since W is constant, this results in the same Awire for the serial channel designed using the two techniques. Using optimal repeater insertion based on Bakoglu's method, the channel consists of 7 interconnect and hence much larger Awire. Using our design technique, Arepeaters is smaller as compared to both using the TPA metric and the Bakoglu method. The reduction is a result of the secondary optimization of our design technique. The TPA metric gave higher Arepeaters because it assumed the repeater size to be fixed. Similarly, Eh for our design technique is smaller and the reasons are exactly same as for the case of Arepeaters. Latency of the channel 4 4 u s i n g o u r t e c h n i q u e i s g r e a t e r t h a n b o t h d e s i g n o b t a i n e d u s i n g T P A m e t r i c a n d t h e o n e u s i n g B a k o g l u t e c h n i q u e . T h e r e a s o n i s b e c a u s e o u r a p p r o a c h c h o o s e s t o i n s e r t a l a r g e n u m b e r o f s m a l l - s i z e d r e p e a t e r s . 10 20 30 40 50 60 70 80 Repeater Size (m) Figure 3.14: Summary of the Design Variables (m, n and ns) for /JJFj=32Gbps A s u m m a r y f o r t h e c h o i c e o f d e s i g n v a r i a b l e u s i n g o u r t e c h n i q u e , T P A m e t r i c a n d B a k o g l u t e c h n i q u e i s g i v e n i n F i g u r e 3 . 1 4 . F r o m t h e f i g u r e , i t c a n b e s e e n t h a t o u r t e c h n i q u e c h o o s e s d e s i g n v a r i a b l e s i n R e g i o n 1 w h i c h c o r r e s p o n d s t o l o w a v e r a g e e n e r g y p e r b i t , l e s s a r e a o f r e p e a t e r s a n d l a r g e l a t e n c y . T h e T P A m e t r i c a n d t h e B a k o g l u t e c h n i q u e c h o o s e s d e s i g n v a r i a b l e s t h a t f a l l s i n R e g i o n 2 t h a t c o r r e s p o n d s t o r e l a t i v e l y h i g h a v e r a g e e n e r g y p e r b i t , l a r g e a r e a o f r e p e a t e r s b u t s m a l l e r l a t e n c y . 45 B . CASE 2: Var iab le W i d t h Design technique: After analyzing the effectiveness our technique for the constant width case, we extend it for the case of variable width. The flowchart for the design flow is shown in Figure 3.15. The steps for the secondary objective are the same as for the constant width case. ( Start ~) n = ]W = W^,n =1 BW, =/(«.,, n,m,W) =/,(».»-) Minimize Awln using Numerical Techniques for BW, and n/l Z (n/l) > u '£ O >, u es a Figure 3.15: Design Flow for the Variable Width Case 46 Experimental Results: The channel is designed for the 32Gbps bandwidth requirement as before. Table 3.10 summarizes the design variable and the performance metric for the channel for the variable width case. Table 3.10: Summary of Design Variables and Objectives for the Variable Width Case V : Parameters Fixed Width Variable Width Requirement and Constraint BW, 32 Gbps Requirement and Constraint W O A if /max . ,., 3,0 repeaters/cm i Design Variables •4 3 i Design Variables • W/wmin 2.02 : i Design Variables m ' •42- 44 i Design Variables n 30 30 Primary Design Objective \ mm / , < 17 •'; 16.34 Secondary Design Objectives A •/• repeaters / /A / inverter _ min 5040 3960 Secondary Design Objectives \ . 2.95 pJ 3.14pJ Latency t " ' latency 1.045 nsec 0.782nsec From the results in Table 3.10, it can be seen that solution obtained for the variable width case requires 3 interconnect in the channel. This is one less then the number of interconnects required for the constant width case. The value of Eb for the channel is slightly greater because large repeaters are needed for the variable width case. Chapter 4 47 SERDES Transceiver Design This chapter provides the details of a serializer-deserializer (SERDES) transceiver to interface the on-chip serial channel with the parallel input/output as shown in Figure 4 .1 . The serializer takes parallel data and converts it into a serial bit stream. The deserializer works in reverse to the serializer as it takes the serial data and converts it back to a parallel data. The objective is to determine what additional bandwidth requirements are placed on the channel due to the S E R D E S design. Serializer Clk (Time Period = Tdk ) Number of lines = n , Drivers D>—H>-H> Buffers Deserializer, 1:N DES < 1:N DES < 1:N DES < 1/ 3 f O I a . X> a. C Clk (Time Period = Tc\k ) Figure 4.1: Block Diagram for the On-Chip Serial Link (Our Design Aim) A s already described in Chapter 3, the practical implementation of an on-chip serial link wi l l require np bits to be serialized and transmitted on ns lines. This implies, that groups of N bits 48 out of the rip parallel bits as shown in Figure 4.1 must be serialized on each interconnect, where, N = ^- (4.1) To comply with the bandwidth requirement (BWS> BWp) and to satisfy the timing specification (Tcik), the following constraints are imposed on the S E R D E S transceiver design as explained below: BWs>BWp => nX\lPW„m)±np*{\lTclk) Substituting N = np/ns in the above equation we obtain, Tak>NxPWmm (4.2) Since Tcik is a design specification and cannot be changed, the serializer w i l l have to convert an TV-bit parallel word into a serial bit stream of N bits with each bit of pulsewidth PWmin in the specified clock period (Tcik)- Similarly, the deserializer w i l l have to convert the serial bit stream back to a parallel word in the specified clock period Tcik-Hence, the goal of this chapter is to design each N:l serializer and 1.-N deserializer of Figure 4.1 that satisfy the timing constraint given by Eqn. (4.2). We have considered a recent serialization-deserialization circuit proposed in [8] and analyzed it in great details in this chapter. A n example serial-link design is shown incorporating the chosen S E R D E S 49 technique. We also analyze the SERDES technique for robustness in the presence of on-chip variations. 4.1 Wave-Pipelined based SERDES scheme The SERDES scheme under consideration here is referred to as wave-front train based SERDES scheme in [8], but is the same as a wave-pipelined-based SERDES. This scheme looks simple and promising because of several reasons as explained further. The scheme uses the physical delay values of the delay element and multiplexer as timing references instead of a high-speed clock for the purposes of serialization and deserialization. This avoids the overhead and complexity of generating and distributing a synchronized high-speed clock. However, without an explicit clock signal, the receiver risks gradually losing synchronization with the transmitter. For this reason, a pilot bit is added to each serial byte to synchronize the transmitter and receiver. The additional advantage of the pilot bit is that it makes the serial link more immune to synchronization issues related to crosstalk between different interconnects in the serial channel. However, it does add overhead to the transmission which eventually translates into additional bandwidth requirements. In [8], the authors have presented a qualitative description of the serialization-deserialization scheme without much information about the timing constraints. We extend their work by analyzing the serializer-deserializer from timing perspective. We have also made some architectural changes in the design presented in [8]. A modified multiplexer circuit is presented which compensates for the polarity-dependent delay difference. We have also 50 added a flushing mechanism for resetting the deserializer circuit for each serial byte. Further details about each modification are described in this chapter. 4.1.1 Circuit Components The main circuit components for the S E R D E S are 3-to-l multiplexers and delay elements. A . Mul t ip lexer The multiplexer is implemented using C M O S transmission gates. The circuit has polarity-dependent delays, i.e., the delay varies depending on whether a signal transition from low-to-high or high-to-low. This can be compensated by either proper transistor sizing or modifying the multiplexer architecture. The latter technique has been chosen for our purpose and the circuit diagram for the multiplexer is shown in Figure 4.2. The architectural modification ensures equal delay, independent of the polarity of signal transitions. The reference name for the propagation delay for each input/output combination is also shown in the figure. It should be noted that the multiplexer inverts the signal that propagates from node inl/in2 to node out and hence care needs to be taken to ensure correct polarity of the signals being transmitted along this path. 5 1 inl in2 in3 invl inv2 inl/out =dmux_inl ) in2/0Ut =dmux_in2 in3/out=dmux in3 TGI TG2 inv3 inv int /\jnv_ctrll TG3 TG4 ctrll inv out Ismail size y\inv_ctrl2 -out ctrl2 Figure 4.2: 3-to-l Multiplexer Circuit B. Delay Elements T h e d e l a y e l e m e n t i s i m p l e m e n t e d u s i n g a s i m p l e c h a i n o f i n v e r t e r s , a s s h o w n i n F i g u r e 4.3. T h e d e l a y p r o v i d e d b y t h e e l e m e n t i s g i v e n b y CIDE-l D E Input Output inv inv inv odd # of inverters F i g u r e 4.3: D e l a y E l e m e n t T h e c h a i n c o n s i s t s o f a n o d d n u m b e r o f i n v e r t e r s a n d h e n c e t h e s i g n a l p a s s i n g t h r o u g h t h e d e l a y e l e m e n t w i l l a l s o b e i n v e r t e d . T h i s h a s b e e n d o n e i n t e n t i o n a l l y t o c o m p e n s a t e f o r t h e s i g n a l i n v e r s i o n d u e t o t h e m u l t i p l e x e r . A s w e w i l l s e e l a t e r i n t h i s s e c t i o n , b o t h s e r i a l i z e r 52 and deserializer use a combination of multiplexer and delay elements as timing references and hence the signal passing through the combination wi l l have no polarity inversion. 4.1.2 Serializer The serialization technique is demonstrated using a 4:1 serializer circuitry as shown in Figure 4.4. The concepts can be easily extended to the case of N:l serializer. The waveforms for the operation of the serializer are shown in Figure 4.5. The waveforms were obtained by simulation of the serializer circuit in H S P I C E using 90nm C M O S technology. The parallel word to be serialized is available at the input node of the serializer din<0:3> with serialization triggered by the rising clock edge (Clk). The rising edge detector generates an enable low pulse, on the rising clock edge, that connects inl of each mux to out and this causes din<0:3> to be loaded to nodes trint<0:3>. A t the same time, the Vdd input of mux M P , called the pilot signal, is also loaded to trintp. The receiver circuit w i l l use this pilot signal to detect the arrival of the serial data stream and to convert the serial data back to a parallel word. The Gnd input of mux M G pre-discharges the serializer output (srout) to Gnd when the enable is low. The enable pulsewidth (PWenabie) is determined by the time taken for the loading operation and is given by: ^dMVX inl+dDE ( 4 - 3 ) When the enable goes high, in2 of each mux is connected to out. This creates a serial data stream on the nodes trint<0:4> that propagates in a wave-pipelined fashion to srout as shown in the H S P I C E results in Figure 4.5. 53 din3 din-Dummy gate (for load balancing) dinl dinO Dummy gate MO. . . . jin3 J Ml Dummy gate D E R Serializer trint-! M2 i—outl Dummy gate D E R trintl !<H M3 in_ outb-Vdd Gnd Dummy gate Dummy gate DEfe trintO «Hll MP in2 out B-Enable Rising Edge Detector 'Clk Figure 4.4: Serializer Circuit D E R MG n— out trintp i—I™3 M = MULTIPEXER DE = DELAY ELEMENT srout C l k d inO d i n l d i n 2 d in3 ^ enable If1 ^ trintO > 3E3 t r in t l tr int2 trint3 tr int4 s r o u t \ \ :M^^QJ''C^- —Is-^ -gfi al output -1 time-Figure 4.5: Waveform for the Serializer Operation from HSPICE 54 The pulse width of each bit (PWbil) in the serial data stream including the pilot bit given by: PWbil=dmux_in2+dDE (4.4) PWbu must be equal to PWmi„ for reasons explained earlier in Chapter 3 and, hence, we wi l l use PWmin instead of PWbu in all the equations. For a N:l serializer based on the technique described above, the time period (Tser) required to load the parallel word and convert it into a serial data stream at the output of the serializer is given by: TSerZPWemhle+(N + \)PWmm (4.5) The (N+l) term is due to the fact that a pilot signal is appended in front of N serial data bits. If we make PWenable < PWmin, then Eqn. (4.5) can be approximately written as: TSer>(N + 2)PWmin (4.6) 4.1.3 Deserializer T h e l : 4 deserializer circuitry is shown in Figure 4.6. The waveforms for the operation of the deserializer are. shown in Figure 4.7. The waveforms were obtained by simulation in H S P I C E using 90nm C M O S technology. We have added a flushing mechanism in our deserializer circuit, which is not present in the deserializer circuit in [5]. The flushing mechanism is needed to re-initialize the deserializer 55 to accept each new TV-bit serial data and convert it to parallel word for storage in the FIFO and latched output at the rising edge of the Clk cycle. The flushing operation is initiated by the pilot signal of a new serial data stream. The detailed operation is described below. Assume that the stop control signal is high initially and, hence, the out node of each multiplexer is connected to their respective inl node via an inverter. A n inverter is used to ensure correct polarity of the signal being fed back to the input of the multiplexer, since the multiplexer inverts the signal passing through it. When the pilot bit of the serial bit stream arrives at Intc_out, a flush signal is generated and this causes node in3 of each multiplexer in the deserializer to be connected to their respective out nodes. Thus, the output nodes dout<0:3> and node dpilot are initialized to Gnd. Since the stop signal is a delayed dpilot signal, it is also set to Gnd after a delay of dstop. The flushing operation must finish (and the flush signal must go low) before the pilot signal reaches the dsrin node from the Intcout. Hence, the D E Flush block present in between nodes Intc_out and dsrin should have a delay {(IDEF) equal to or greater then the time required for the flushing operation: dDEF ^ d m U X _ M + dstop + 2dAND (4-7) where dAND is the delay of the A N D gate that generates the other control signal for the multiplexers. 56 dout3 final dout2 final doutl final doutO final Clk Intc out) FIFO Synchronizer Delay needed for flushing operation DE Flush DBF DE M I j 1 ivi douO M dout2 doutl p<H I— b DE [ a — w ^ j j y DE flush^ M doutO DE AND MS, Upilot DEstop stop Deserializer M = MULTIPLEXER i DE = DELAY ELEMENT Figure 4.7: Waveform for the Deserializer Operation from HSPICE 57 At the completion of the flushing operation, the stop signal is low and hence node in2 of each multiplexer is connected to their respective out nodes. Thus the serial data stream propagates through the deserializer in a wave-pipelined fashion until the pilot bit arrives at the end, i.e., output of mux MS or dpilot. The 4 serial bits following the pilot signal also arrive at the nodes dout<0:3>. The pilot bit asserts a stop low signal after a delay of dstop provided by DE stop and this causes node out of each multiplexer to be connected in3 node via an inverter. That is, the multiplexer feeds back the output to the input to hold the data value while maintaining correct polarity. The delay dstop should be approximately equal to 0.5*PWmin so that each data bit in the serial stream is sampled approximately near its center. The parallel data word is now available at nodes dout<0:3>, ready to be latched out at the output node doutJinal<0:3> using a FIFO. The FIFO synchronizes the parallel data with the receiver clock. In general, the time required to convert the serial byte consisting of TV bits and a pilot bit into a parallel word for storing at the FIFO is given by: TDe,>dDEF+(N + \)PWmin+dslop (4.8) If the deserializer is designed such that dDFF +0.5PWmm <2PWm.n then Eqn. (4.9) can be estimated as: TDes>{N + 3)PWmin (4.9) By examining Eqns. (4.6) and (4.9), we see that the time required for functioning of TV:7 serializer is smaller then the time required for the functioning of the 7:7V deserializer. 58 Ofcourse, the clock that loads N parallel bits into the serializer and the clock that latches parallel words out of the deserializer must have a common clock time period (Tcik) greater then Toes- It can be written as: Tclk>(N + 3)PWmm (4.10) From Eqn. (4.10), we note that the required clock period is greater then the specified clock period given by Eqn. (4.2) because of the redundancy equivalent to 3 bits. Hence, the required clock period needs to be reduced to match the specified clock period fixed by design specification. In order to accommodate the 3-bit equivalent redundancy in the specified clock period, the serial channel has to be re-designed with a higher bandwidth requirement based on the techniques described earlier in Chapter 3. The overall design flow for the serial link that includes the serial channel and the S E R D E S transceiver is given in Figure 4.8. 59 Increase the bandwidth requirement to accomodate ' 3 bit redundancy Start Set • Design Serial channel foiBffi usind the technique described in Chapter 3. Outputs: {n„PWmin,n,m,W} • Calculate iVusing Eqn (4.1) Tcik NO/ >= Design SERDES transceiver based on wave-front scheme for N and JW-a. Stop Figure 4.8: Design Flow for the On-Chip Serial Link 4.2 On-Chip Serial Link: An Example Based on the design flow of Figure 4.8, we design an example serial-link that replaces a 32-bit parallel bus operating at a frequency of l G h z (rc/r=1000ps). The link is 10mm long and routed in Metal 5 layer of a 90nm C M O S technology. The bandwidth requirement for the serial link was computed to be 32Gbps using Eqn. (3.1). A s illustrated in Figure 4.8 we start by designing the serial channel for bandwidth, i.e., BWS equal to 32Gbps using the technique of Chapter 3 and then evaluate i f the corresponding S E R D E S transceiver fits within the specified clock period of lOOOps. 60 4.2.1 Fixed Interconnect Width The width of the interconnects in the serial channel is fixed and equal to Wmi„. The rule of thumb for the maximum allowable repeaters density (n/l)max was set to 30 repeaters/cm for each interconnect in the channel. A summary of design specifications and variables for the serial channel along with the design specification for the corresponding SERDES transceiver are given in Table 4.1. Table 4.1: Specs, and Variables for 32Gbps Serial Channel and the Corresponding SERDES specs. Parameters Values Design Specifications for the Serial Channel BWS 32Gbps Repeater density (n/!)max 30 repeaters/cm Number of interconnects (ns) 4 Design Variables for the Serial Channel Interconnect width (W) W mm Number of repeaters per interconnect (n) 30 Size of repeater (m) 42 Design Specifications for the corresponding SERDES Transceiver PW 1 " mm 125ps N 8 Tcik (specified) lOOOps Tcik (required) using Eqn. (4.10) 1375ps Interestingly, the SERDES transceiver is not realizable because the required clock period for proper functioning is greater then the specified clock period as seen from the Table 4.1. The required clock period for the SERDES is greater because of redundancy equivalent to 3 bits as explained earlier. In order to accommodate the needed redundancy, we redesign the serial channel with a higher bandwidth. For the 32Gbps serial link with specified clock period of lOOOps, it was 61 found that an increase is the bandwidth of the serial channel from 32Gbps to 58Gbps is required. The new design specifications and are given in Table 4.3. Table 4.2: Design Specs and Variables for 58Gbps Serial Channel and the Corresponding SERDES Parameters Values Design Specifications for the Serial Channel BWS 58Gbps Repeater density (n/t)max 30 repeaters/cm Design Variables for the Serial Channel Number of interconnects (ns) 8 Interconnect width (W) Wf/iif) Number of repeaters per interconnect (n) 30 Size of repeater (m) 29 Design Specifications for the corresponding SERDES Transceiver PW 1 '* mm 140ps N 4 Tcik (specified) lOOOps Tcik (required) using Eqn. (4.10) 980ps Comparing Table 4.1 and 4.2, we see that the channel now requires 8 interconnects as opposed to 4 interconnects, and the repeater sizes reduced from 42 to 29. For the specification given in Table 4.2, the SERDES transceiver was designed and simulated in HSPICE. The timing parameters for the various circuit components are summarized in Table 4.3. Table 4.3: Timing Parameters for the SERDES Circuit Components Parameters Designed Values Multiplexer dmux in! 125ps dmux in2 125ps dmux in3 5 Ops Delay Element dDE 15ps AND Gate dAND 40ps Delay Element dstop 70ps Other timing parameters for the SERDES operation are summarized in Table 4.4. 62 Table 4.4: Timing Parameters for the SERDES Operation Parameters (Design) Values (Design) Parameters (Minimum Required) Values (Minimum Required) Equation Serializer P Wenable 150ps dmuxjnl+ doE 140ps (4.3) Tser lOOOps PWenable+(N+l) PWmin 850ps (4.5) Deserializer dDEF 200ps dmun in3+dstop+2*dAND 200ps (4.7) Tpes lOOOps dDEF+(N+l)PWmin+dstop 970ps (4.8) The block diagram for the example serial-link is shown in Figure 4.9. Figure 4.10 shows the timing waveforms obtained using H S P I C E for a single interconnect in the serial link (marked in Figure 4.9). From the waveforms, it can be seen at two consecutive rising clock edges, 4 parallel bits are loaded into the transmitter. They are converted into a serial byte in the serializer with a pilot bit appended at the byte start and then transmitted on the interconnect in the channel. On reaching the receiver they are converted back into 4-bits parallel word and are available at the FIFO input to be synchronized with the receiver clock. l - lOmm Number of interconnects =8 t>4-^ >" - £ > - H > -D>-H> Drivers n=30, m=29 Simulated in H S P I C E Buffers Figure 4.9: Block diagram for Example On-Chip Serial Link 63 CLK S s 5 0 0 i n ) dinO I o > S C O i J dinl CO a S 5 0 0 | I ) d—2 t o > 5 0 0 n ) d_3 a ? 5 0 0 t - 5 0 0 i r enzbde > 5 0 0 r int__ -1 5 0 3 , int_aut 5 0 0 1 stop If o > 5 0 0 i flush CD -i 5 0 0 1 doulO If o > 5 0 0 1 doutl 5 5 0 0 1 douf2 -O 5 > 5 0 0 1 dout3 & i 5 0 0 1 dpilot 5 0 0 J 0 5 0 0 p I 5 n 2 n T i m . ( I ' " ) ( T I M E ) Figure 4.10: Timing Waveform for the Example On-Chip Serial Link 4.2.2 Variable Interconnect Width If we design the serial channel considering the interconnect width to be variable, we obtain the design for the corresponding S E R D E S transceiver required are given in Table 4.5. 64 Table 4.5: Specs, and Variables for 32Gbps Serial Channel and the Corresponding SERDES Specs. Parameters Values Design Specifications for the Serial Channel BWS 32 Gbps Repeater density (n/l)max 30 repeaters/cm Design Variables for the Serial Channel Number of interconnects (ns) 3 Interconnect width (W) 2 . 0 2 * ^ m , „ Number of repeaters per interconnect (n) 30 Size of repeater (m) 44 Design Specifications for the corresponding SERDES Transceiver PW 1 mm 94ps N 11 Tcik lOOOps Tcik (required) using Eqn. (4.10) 1316ps This S E R D E S is not realizable because it does not comply with the specified clock period of lOOOps. The required clock period is greater then the specified clock period. Hence the serial channel needs to be redesigned with a higher bandwidth requirement. It was found that by increasing the bandwidth to 54Gbps, the 3-bit redundancy could be accommodated in the specified clock period of lOOOps. The design specifications and variables for the 54Gbps are given in Table 4.6. Table 4.6: Specs, and Variables for 54Gbps Serial Channel and the Corresponding SERDES Specs. Parameters Values Design Specifications for the Serial Channel BWS 54 Gbps Repeater density (n/T)max 30 repeaters/cm Number of interconnects (ns) 6 Design Variables for the Serial Channel Interconnect width (W) l.2**Wm,„ Number of repeaters per interconnect (n) 30 Size of repeater (m) 33 Design Specifications for the corresponding SERDES Transceiver PW 1 mm l l l p s N 6 Tcik (specified) lOOOps Tcik (required) using Eqn. (4.10) 977ps 65 Comparing Table 4.5 and 4.6, we see that the results have changed in order to incorporate the overheads due to the S E D R E S transceiver. The serial channel now requires 6 interconnects as opposed to 3 interconnects. Further changes in repeater insertion parameters and the width of the interconnects in the serial channel are required for this design. 4.3 Effect of Intra-die Variations Additional considerations for the S E R D E S transceiver operation in presence of intra-chip variations are analyzed in this section. The sources of intra-die variations can be categorized into physical and environmental factors [19]. Physical factors result in a change in the circuit performance due to layout effects and process "tilt" during manufacture. Environmental factors arise during the operation of a circuit and include variation in power supply voltage and on-chip temperature gradients. 4.3.1 Variation Model Intra-chip variations can be modeled both as random variations and spatial variations [19]. For our purpose, we focus on spatial variations and leave random variations as the subject of future work. Consider the spatial variation as shown Figure 4.11. For modeling purposes, the circuit parameters change as a function f(x, y) of the distance from a given position, (XQ, yo). For example, the process gradients can be modeled in this way [35]. For certain idealized cases, such a function can be also be used for temperature and supply voltage variations [35]. 66 I n t h e f i g u r e , w e s e e t h a t t h e v a r i a t i o n s i n p r o c e s s p a r a m e t e r s o c c u r f r o m r e g i o n A t o r e g i o n B . A s s u m e t h a t r e g i o n A i s f a s t e r t h a n r e g i o n B . T h e v a r i a t i o n f a c t o r f r o m a t i m i n g p e r s p e c t i v e b e t w e e n t h e t w o r e g i o n s i s r e p r e s e n t e d b y k, w h e r e k>1.0. T h u s , r e g i o n A i s A; t i m e s f a s t e r t h e n r e g i o n B . W i t h t h e a b o v e v a r i a t i o n , w e c o n s i d e r t w o s e p a r a t e s e r i a l l i n k s , L i n k l a n d L i n k 2 a s s h o w n i n F i g u r e 4 . 1 0 , a n d a n a l y z e t h e m i n t h e f o l l o w i n g s u b s e c t i o n . X A y' DES / ^^iChannel LMk2 Y Fast - 'SER Transmitter / / W_ / SER Transmitter, L i n k 1 „,.-•-"" > Channel B Figure 4.11: Spatial Variation Model for Intra-Die Variation 67 4.3.2 Theoretical Analysis The serializer and the deserializer are designed to transmit and receiver serial bits of pulsewidth PWbu. Due to intra-die variations, the serializer w i l l actually send serial bits of pulsewidth PWbuSer that are different from the expected value of PWbu. Similarly, the deserializer expects to receive serial bits of pulsewidth PWbuoes that are different from the PWbu it was designed to receive. Although the properties of the serial channel are affected by variation, it w i l l not greatly modify or change the pulsewidth of serial bits being transmitted on it as long as the bit have pulsewidth greater then the minimum allowable value before inter-symbol interference (ISI) occurs. We assume that the minimum allowable pulsewidth in the channel is much smaller then the pulsewidth bits being transmitted. A. Link 1 The serializer lies in the faster region (region A ) and the deserializer lies in the slower region (region B) and hence PWbuser < PWbuDes- Further, since regions A is k times faster than region B , we can write: PW P W h i l S e r = ^ ^ (4.11) k Figure 4.12 shows the functional block diagram for the operation of the deserializer for L i n k l . The serial byte propagates into the deserializer and, after reaching the far end; the pilot bit asserts a sampling signal to the output latches. The three cases as shown in the figure are discussed below. 68 Pilot doutO doutl dout(N-l) doutN O-S'PWbitDe. Functional Deserializer Block Diag Ideal Case Non-ideal case -Limiting Case Pilot 1 ban I bf2) 1 1 bfN-n 1 b(N) ; 1 GND _ p w MtSer _ | . pwb,tL . i . PWbhL 1 I P W bhSer . j „ P WbitSer i „ | l i l i I I ' I - "1 Piliot I b(0) I W2): 1 I bw-n | b(N) | GND p wbifSer sampling point P WbitSpr sampling point p wbhSer; sampling point P WbHSer samplin '1 P WbitSer ; ; point sampling point Figure 4.12: Operation of Deserializer for L inkl Ideal Case (k=7): The serial bits are sampled exactly at their centre as intended in this design. Non-ideal Case (l<k<=kmax): Similar to the ideal case except that the serial bits are sampled off-centre as seen from Figure 4.12. Limiting Case (k>kmax): PWbuser is sufficiently small, as compared to PWbuDes and, hence, the bit b(N) is sampled incorrect at node nd(N). The limiting value of k can be obtained by equating the length of the serial byte with the distance between node ndstop to the node nd(N) as follows: (N + l)PWhil&r=(N + y2)PWt bilDes (4.12) Substituting (4.11) in the above equation, we obtain: 69 k - ( i V + l ) m a x (/V + l /2) (4.13) For the deserializer to sample N bits of the serial byte correctly, k must be less then k„ k< (* + *) (/V + l/2) (4.14) Based on (4.14), the maximum number of bits allowed in a serial byte such that the entire stream is sampled correctly at the deserializer in L i n k l is given by: Nm,ALink\) = floor 2-k 2(k-l) (4.15) B . L i n k 2 The serializer lies in the slower region (region B) and the deserializer lies in the faster region (region A ) and, hence, PWbuser > PWbuDes- Further, since regions A is A: times faster as compared to region B , we can write: PWbjtSer - k PWbilDes (4.16) Figure 4.13 shows the functional block diagram for the operation of the deserializer for Link2. The serial byte propagates into the deserializer and, after reaching the far end, the pilot bit asserts a sampling signal to the output latches. 70 pilot doutl dout2 dout(N-l) doufN Functional Deserializer Block Diag : nAnilnt '. ndO vbitDca _De!jyJ napUot ridN-1 Ideal Case < I Pilot I MO) I b(2) I I bftt-1) 1 b(K) I GND p w bi tDei D e l a y I 9 1 D e l a y ndN Non-ideal Case < Limiting Case< Pilot b£Ql_ bfN-n MM. GND P^bitSer I^bitScr p w bitScr Pilot MIL p wibitSer P * M S . T fbi tSer SampKhg point Sampling point Sampling point bfN-n I b(N) Sampling point Sampling point Figure 4.13: Operation of the Deserializer for Link2 Ideal Case (k=l): The serial bits are sampled exactly at their centre as desired in this circuit. Non-ideal Case (l<k<=kmax): Similar to the above ideal case except that the serial bits are sampled off-centre as seen from Figure 4.13. L i m i t i n g Case (k>kmax)'. PWbuser is sufficiently large as compared to PWbuDes and, hence, the bit b(N) is not sampled correctly. The limiting value of k can be obtained by equating the length of the N-l bits in the serial byte with the distance between node ndstop o the node nd(N) as follows: (N)PWbilSer=(N + )/2)pWm (4.17) Substituting (4.16) in the above equation, we get 71 "l im/ / (N) (4.18) For proper operation of the deserializer, k must be less then km k< (N + l/2) (4.19) Based on (4.19), the maximum number of bits at the deserializer in Link2 is given by: Nma(Link2) = floor 2(k-\) (4.20) Equations (4.15) and (4.20) are plotted in Figure 4.14 for different values of k. With an increase in variation factor k between locations A and B , the maximum number of bits allowed in a serial byte /V_ a x for both l i n k l and link2 decreases. Further, we observe that A^ m a x for l i n k l is greater then i V m a x f o r link2 for specific ranges of k. However, it is difficult to predict whether the serial link designed wi l l end up being l i n k l or link2 as a result of on-chip variation. Hence, we must consider the worst case and design our serial link with _V_.V of link2. 72 S 2 SOur Experimental Paint Link 1 Link 2 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 X Figure 4.14: Maximum Bits in a Serial Byte as a Function of Variation Factor, k. 4.3.3 Experimental Validation In order to validate the analysis for this SERDES transceiver in presence of intra-chip variations, simulations were carried out in HSPICE on the circuit. Although, HSPICE provides simulations capability for variations at the inter-chip level, it is incapable of simulating variations at the intra-chip level. The reason is that, in a single HSPICE simulation, we do not usually include device models from different corners that represent variations because the models have common nomenclature. If we include the device models for several process corners in one model file, it is possible to carry out a meaningful intra-die HSPICE simulation. In the results to be shown, models were constructed from Typical, Fast and Slow process corners to emulate variations at the intra-die level. The degree of 73 variation between device models from Typical and Fast corners is found to be equal to k=1.3. For simulation of L i n k l , we designed the serializer with device models that correspond to Fast process corner. The deserializer was designed with device models that correspond to Typical process corner. For the serial channel we choose device models same as the serializer. The simulation setup in H S P I C E is shown in Figure 4.15. Fast process corner / Serial Channel (single interconnect) n > — ^ — ^ — [ H > Fast process corner Receiver (DES) •s o Typical process corner Figure 4.15: Simulation Setup in HSPICE for Linkl The waveforms for the operation of the receiver are shown in Figure 4.16. It can be seen that only the first bit of the received serial byte was sampled properly. However, the second bit is sampled incorrectly. This result is in agreement with that predicted by analysis and the experimental point is marked in the plot in Figure 4.14. 74 intc out stop flush dout3 > dour2 doutl doutO dpilot ;Pilofi 0 / 1 \ 0 / 1 \ t >..\ ' •, j L ibMofsigrial-akd-the stop-high-iasserts-ttie fliash hiyh / '. \flushes aD the ' " I H L output to gnd tiie pilot siarfai arrives at M'JXS and asserfrsa stop signal the-serial-bits-are- sam.| tn-fnrmparaTlffl-wnrri. p led ind i : o j / : V \fj \ / ' -v ; o | | j ! i i ' ' 0 : V - t v / \— | : i 1 ! ;.\ i \ 0 I \ / \ / (\ o;) Incorrect data j ; r — j w TJni^fj cdl \ o! I \ / - \ i-o-l j . 1 - - \ /. ~>- i C . s i . • , ^ -jA < s. | i \ ^ / ^ / " | i T - J - - = — - — ^ - F = — - — - ' r 1 t ime-F igure 4.16: L i n k 1, T i m i n g W a v e f o r m at the Deser ia l izer Similarly, link2 was simulated with serializer designed using modes that corresponds to the Typical process corner and deserializer with device models that corresponds to Fast process corner. The corresponding waveforms are shown in Figure 4.17. It can be seen that only the first bit of the received serial byte is sampled properly.. The second bit is sampled incorrect as shown in the figure. Again, the result is in agreement with expected results. 75 time -Figure 4.17: Link 2, Timing Waveform at the Deserializer 4.3.4 S u m m a r y Both from the analysis and the experiments conducted, it was found that variations at the intra-die level w i l l limit the number of bits in a serial byte i f wave-pipelined based S E R D E S transceiver is used. This is an important result since it is independent of bandwidth or the overhead of the S E R D E S technique described in earlier sections. In fact, it is clearly the most important factor in determining the required number of lines. The degree of intra-die variation that can be tolerated by the original design with 4:1 multiplexers is rather low. For the example on-chip serial link design of Section 4., the maximum allowable value for the variation of k=1.12. This is equivalent to intra-die variations of 12%. Clearly, more robust methods are needed. 76 Chapter 5 Energy Efficiency in the Serial Channel Although serialized links mitigate some problems of the parallel bus, such as routing complexity and wiring area, it does not guarantee energy efficiency on the interconnect. A key concern is that serializing a parallel bus might lead to increase in the switching activity (SA) on the serial wire [11]. In the example shown in Figure 5.1, a 4-bit parallel bus has an S A of 8/20; however, for the same data words, when transmitted on a serial wire, the S A increases to 12/20. : W6 W5 W4: W3 W2 Wl bl 1 1 1 1 1 1 b2 1 0 1 0 0 1 b3 1 1 0 1 1 1 b4 1 1 0 0 0 1 Average Switching Activity on parallel bus = 8/20 Figure 5.1: Switching Activity Increased when Converting from Parallel to Serial Hence, activity reduction or minimization over the serial channel is necessary to the make the serialized link energy efficient. In this chapter, we propose a novel technique that minimizes the switching activity on the serial interconnects. In Section 5.1, we describe some previously reported work on energy efficient communication on serialized links along with their shortcomings. Our energy reduction technique is presented in Section 5.2. A summary is provided in Section 5.3. Mb3b2bl Mb3b2bl Mb3b2bl Mb3b2bl Mb3b2bl Mb3b2bl 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 Average Switching Activity on serial wire = 12/20 77 5 . 1 Previous Work and Motivation There exist various techniques in literature that deal with average energy per bit reduction on the parallel bus either by employing circuit or system level techniques. However, these techniques cannot be used directly for the case of a serialized link. The current literature contains very few techniques for reducing the average energy per bit in a serialized link. The technique described in [5] are based on decreasing the capacitances of the interconnect by increasing the spacing between the interconnect. Although the technique is quite effective, it is not desirable because it has to be traded off with the routing area due to increase in spacing between the interconnects. The techniques presented in [11, 36] are more desirable because they target reduction of average S A on the interconnects and do not require a compromise or trade off with routing area. The technique presented in [11] is referred to as S I L E N T (Serialized L o w Energy Transmission Coding for On-Chip Interconnect Networks). This technique is based on differential coding and average S A reduction is achieved by X O R i n g the data to be serialized with the previous data, so that only the differences between successive words cause transitions on the serial interconnects. The technique presented in [36] is also based on differential coding and two schemes based on the location of the encoder/decoder on the link have been reported. In the first scheme, encoding is performed before serialization (called ES) and decoding is carried out after de-serialization. In the second scheme, encoding is done after serialization (called SE) and decoding is done before de-serialization. The ES scheme is essentially the same as the S I L E N T scheme. However the SE scheme provides a 78 different S A reduction as compared to the ES scheme depending upon the statistical characteristics of the data words on the parallel bus to be serially transmitted. Since differential encoding is used in [11, 36], these techniques are effective i f the data traces on the parallel bus to be serially transmitted have reasonable degree of spatial or temporal correlation, i.e., uniform statistics. If the data traces on the parallel bus have non-uniform temporal or spatial correlation, then these techniques wi l l not be effective for average S A reduction. Therefore a technique for average S A reduction for the serialized link is needed that is applicable to data traces on the parallel bus for both uniform and non-inform statistics. 5 . 2 N:l B i t O r d e r i n g f o r S w i t c h i n g A c t i v i t y R e d u c t i o n Our technique is based on bit ordering of the TV-bit parallel data word on a serial interconnect for reducing the average SA. It is most effective when it has complete statistical knowledge of the traces on the parallel bus, but it is not necessary to some types of buses. The complete statistical knowledge is obtained by profiling the system and extracting bus data traces. Our approach is very similar to the approach in [37, 38], the difference being that we are trying to reduce the S A based on bit ordering whereas, in [37, 38], the mutual coupling activity is reduced on a parallel bus by wire permutation. 79 5.2.1 Overview of Proposed Technique The following two observations illustrated by examples below forms the basis of our proposed technique. A. Example 1: The average SA on the serial interconnect depends on how the bits are multiplexed (ordered) from TV-bit parallel bus to the serial interconnect. This is because ordering affects the number of times a 1 is followed by a 0 or vice-versa on the serial interconnect. Consider the example shown in Figure 5.2. :W6 W5 W4 W3 W2 Wl bl 1 1 1 i 1 1 1 b2 1 0 1 0 0 1 b3 i 1 1 0 j 1 1 1 M i l l 0 0 0 1 W6W5 W4: W3 W2 Wl b i i i i l i l 1 1 b2 1 0 1 0 0 1 b3 1 1 o: l 1 1 b4 1 1 o io 0 1 W6 W5 W4 W3 W2 W l 3. to Mb3b2bl Mb3b2bl Mb3b2bl Mb3b2bl Mb3b2bl Mb3b2bl 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 Average Switching Activity on serial wire = 12/20 (a) Bit Ordering is {b\,b2,b3,b4} W6 W5 W4 W3 W2 W l t/3 a 3. so b3Mb2bl b3Mb2bl b3Mb2bl b3Mb2bl b3Mb2bl b3b4b2bl 1 1 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 Average Switching Activity on serial wire = 8/20 (b) Bit Ordering is {b\, b2, b4, M} Figure 5.2: Average Switching on the Serial Interconnect Depends on Bit Ordering From Figure 5.2(a), the SA is 12/20 when the bits are ordered as {M,62,63,64} on the serial interconnect. If the bits are ordered as{61,62,64,63}, as shown in Figure 5.2 (b); the SA reduces to 8/12. This example illustrates that bit ordering on the serial interconnect will affect the SA factor and hence can be utilized for the purpose of average SA reduction. 80 B. Example 2: From the first observation we know that different bit ordering may provide different S A on the serial interconnect. Hence, there must exist a particular bit ordering that w i l l minimize the S A and that particular bit ordering w i l l depend on the statistical characteristics of the data pattern on the parallel bus, which is to be transmitted serially. W6 W5 W4 W3: W2 Wl: bl 1 1 1 1 1 1 b2 1 0 1 0 0 1 b3 1 1 0 1 1 1 b4 1 1 0 0 0 1 W6 W5 W4 W3: W2 Wl: bl 1 1 0 1 1 1 b2 1 0 1 0 0 1 b3 i : i o o i l ; b4 1 0 1 0 0 1 W6 W5 W4 W3 W2 Wl b3b4b2bl b3b4b2bl b3b4b2bl b3Mb2bl b3b4b2bl b3b4b2bl 1 1 1 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 Average Switching Activity on serial wire = 8/20 (a) Bit Ordering is {b\,b2,b4,b3} W6 W5 W4 W3 W2 Wl Mb2b3bl Mb2b3bl Mb2b3bl Mb2b3bl Mb2b3bl b4b2b3bl 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1 1 Average Switching Activity on serial wire = 6/20 (b) Bit Ordering is {b\,b3,b2,b4} Figure 5.3: Bit Ordering Depends on Statistical Characteristics of Data Traces Consider a different example in Figure 5.3. In this case the data on the parallel bus have different statistical characteristics in Figure 5.3 (a) and (b). The bit ordering for the following two cases are obtained by analysis so as to minimize the S A on the corresponding serial interconnect. The bit ordering for the two cases were found to be different because of different statistical characteristics. From Figure 5.3(a) it can be seen that the bit ordering {b\,b2,b4,b3} gives minimum S A of 8/20 whereas in Figure 5.3(b), bit ordering {M,b3,b2,b4} gives minimum S A of 6/20. 81 This example illustrates that in order to determine the best bit ordering for SA minimization, we need to use the statistical information of parallel bus traces. 5.2.2 Problem Formulation and Solution The problem of finding the best bit ordering on the serial interconnect for SA can be formulated as a graph problem. A completely connected undirected graph G(V,E,EW) is constructed where V = {vi} is the set of vertices, E-{e0} is the set of edges and EW - {ew.} is the set of edge weights. The vertices correspond to the bit position in each word. The weight denotes the probability that two bits of the parallel word are complementary to each other. More precisely, the weight of a particular edge is denotes the probability that the values on the two nodes connected by the particular edge are complementary to each other: ew0. =Pr{[(v,. = 0)&(v,. = 1 ) ] or [(v,. =l)&(v,. = 0 ) ] } (5.1) A particular bit ordering on the serial wire can be denoted by a path in the graph, which starts and ends at the same vertex and visits all vertices at most once. The cost of a path, which is equivalent to the SA on the serial wire, is simply the sum of the cost of the edges along that path. To illustrate this, a graph is constructed for the example of Figure 5.2 (a) and the edge weights determined by statistical analysis of the data words on the parallel bus. The resulting graph obtained is shown in Figure 5.4. 82 The path with the minimum cost for the graph of the example of Figure 5.2 is shown by dotted lines in Figure 5.4. The bit ordering obtained from our solution technique is [bl,b2,b4,b3] which is same as obtained earlier in Figure 5.2 (b) by simple inspection. Hami l ton Cyc le —i Figure 5.4: Graph for the Example of Figure 5.2 (a) In order to minimize the average S A , we wi l l have to find a path in the graph which is optimal and this is equivalent to finding the minimum cost Hamilton cycle in the graph. The problem is same as the well-known Traveling Salesman Problem (TSP). Since the size of TSP to be solved is relatively small (the number of vertices is same as the number of bits to be ordered on the serial interconnect), an exact TSP algorithm can be used. The TSP solver chosen for our purpose is based on Branch and Bound [14]. 83 5 . 3 E x p e r i m e n t a l P r o c e d u r e In order to study the effectiveness of the N:l bit ordering scheme, we considered traces obtained from a Simplescalar P ISA process simulator [39] executing SPEC2000 integer programs. We extracted traces for the instruction cache address bus (/V=32) and instruction bus (N-32) of the processor. The traces obtained using instruction address bus represent data with uniform statistics because the consecutive addresses are sequential; whereas the traces obtained from instruction bus have non-uniform statistics. For a chosen benchmark program, a sample of traces for instruction address bus consisting of 10 mil l ion traces was considered for analysis. Complete statistical characteristics of the sample traces was obtained by customized pre-profiling of the traces. The statistical characteristics were used to determine the edge weight of the graph. The bit ordering that minimized the S A on the serial interconnect was obtained using the TSP algorithm and the average S A was calculated for the sample traces. The percentage reduction in S A was obtained by comparing it with the case when random bit ordering was used (i.e., the bits of the parallel word were serialized based on their location). This process was repeated for the instruction bus for the chosen benchmark. In this same way, we obtained the average percentage S A reduction for all the benchmarks for instruction address bus and instruction bus. The results are plotted in Figure 5.5. Bit ordering on the serial bus using our approach gives 55% reduction in average S A for instruction cache address bus and 30% reduction in average S A for the instruction bus on average across the various integer benchmarks. Based on the results obtained, our technique 84 is quite effective in reducing average S A on the serial interconnect for traces with uniform or non-uniform statistics, provided that complete statistical knowledge is known in advance. IB restriction Address M ristruction 8 0 -, - - - - -Benchmarks (Integer) Figure 5.5: Percentage Reduction in Switching Activity using Bit Ordering Further, we compare our technique with the S I L E N T scheme of [2]. From the results in Figure 5.6, it can be seen that S I L E N T is slightly better as compared to our technique for instruction address bus. Our approach obtains 50% reduction on average, but the S I L E N T approach is in the range of 85%. This is because it is exploiting the sequentiality of the address traces which provides better reduction in the average S A compared to our bit ordering technique. For the instruction bus, our technique is much better then S I L E N T as seen from the results in Figure 5.7. Our technique consistently reduced the average S A across various benchmarks because customized bit ordering for each individual benchmark was obtained by preliminary 85 profiling of the bus traces. On average, 40% reduction is obtained. Since the traces are non-uniform, the S I L E N T scheme based on differential coding is not effective and, in fact, S A increased by 15%, and 50% in the worst case. Benchmarks (Integer) Figure 5.6: Comparison of the Bit Ordering Technique with S ILENT (Instruction Address Bus) ra Bit Ordering • SILENT S3~ A X ter Benchmarks (Integer) Figure 5.7: Comparison of the Bit Ordering Technique with S ILENT (Instruction Bus) 86 5.4 Summary On the basis of the above observations, we find that the S I L E N T scheme is preferable over our technique for address buses because there is sequentiality in the traces associated with such buses. But for the instruction bus, our bit ordering technique is a better choice for S A reduction. In fact, the ordering can be accomplished without the bus traces since the instruction set is finite. While complete statistical information would provide a better result, we believe that knowing the bit patterns in the instruction set is sufficient to obtain good results. A s described earlier in the thesis, the practical implementation of the serialized link requires np bits to be serialized and transmitted on ns lines. This implies that np bits have to be divided into ns groups of N bits each, where N is given by Eqn. (4.1). The average S A reduction technique would then be applied to each group of N bits before the being serially transmitted on the interconnect. The choice of the S A reduction technique depends upon the statistical characteristics of the traces or the parallel words to be serialized. S I L E N T is better-suited to bus with more uniform traces. Our technique based on bit ordering would be beneficial for data patterns with both uniform and non-inform statistics. 87 Chapter 6 Summary and Conclusions The goal of this thesis was to investigate a relatively new topic of on-chip serialized links. The thesis provides key insights and develops techniques for designing energy-efficient, on-chip serialized links for replacing parallel buses in order to reduce routing complexity. The findings are summarized below along with directions for future work. 6.1 Summary In the first part of the thesis, we presented design considerations for the serial channel using wave pipelining. Bandwidth contours were developed as a function of the number and size of repeaters. Our design technique minimizes the number of interconnects in the serial channel while providing the required bandwidth. In addition, the area occupied by the repeaters and the average energy per bit consumed in the serial channel was also reduced. In the second part, we analyzed a wave-pipeline-based S E R D E S transceiver for interfacing the serial channel with the parallel input/output. A detailed description of the operation of the S E R D E S transceiver from a timing perspective was provided. The S E R D E S scheme under consideration required a pilot signal which translated to additional bandwidth requirements for the serial channel. This additional bandwidth was satisfied by increasing the number of interconnects in the serial channel and by modifying other design parameters. 88 Further, we studied the robustness of the wave-pipelined S E R D E S scheme. It was observed that on-chip variation was the most important factor in determining the number of interconnects in the serialized channel. Using the techniques presented in the first and second part of the thesis, an example serial link was designed to replace a 10mm long, 32-bit wide parallel bus operating at a frequency of l G h z in 90nm C M O S technology. The serial channel by itself required only 4 interconnects for the given bandwidth. After considering the overhead for the wave pipeline S E R D E S transceiver, the number of interconnects required in the serial channel increased to 8. Further, assuming an intra-die variation of 15% forces the required number of interconnects in the link to 16. Finally, we developed a new technique for energy reduction in the serialized link. The proposed technique is based on bit ordering for switching activity reduction on the serial interconnect. The technique relies on knowledge of the data to be transmitted serially in the link. A comparison was also made of our bit ordering technique with an encoding based energy reduction technique called S I L E N T . From simulation results, it was found that our bit ordering technique reduced the average switching activity on the serial interconnect by about 40% for data patterns with both non-uniform and uniform statistics, provided that the bus statistics are known a priori. Based on a number of benchmark circuit traces, our scheme was better on the instruction bus while the silent scheme was more effective on the address bus. 89 6 . 2 C o n c l u s i o n s In general, this study found that the design technique for the serialized link using wave-pipelined interconnect and a S E R D E S transceiver does not significantly reduce the number of interconnects for a given bandwidth requirement. The reason is that the number of interconnects in the link is determined by the degree of intra-die variation that the design can tolerate, which is independent of the bandwidth requirement for the link. However, we still believe that the design approach is useful for two reasons. First, even i f the number of interconnects in the channel is controlled by the variations, the proposed channel design technique wi l l optimize the area occupied by the repeaters and the energy per bit based on the bandwidth requirement for the link. Second, i f we can find a better replacement for the S E R D E S transceiver that is more robust to variations then our technique would result in a more efficient serialized link in terms of the number of interconnects. 6 . 3 F u t u r e W o r k P V T variations can significantly limit the efficiency of the serial channel design. Further analysis of the impact of P V T variation on the wave-pipelined signaling scheme would be useful. The wave-pipelined S E R D E S scheme described in this thesis is not robust enough to tolerate much variation. Hence, modification to the transceiver that w i l l make it more robust is another direction for future work. Additional S E R D E S techniques for on-chip link incorporating P L L and C D R of the off-chip schemes should be investigated. 90 In this study, we also proposed a bit ordering technique to minimize the energy of the serialized link. Although we showed that this technique would be useful for bus traces with both non-uniform and uniform statistics, we have not evaluated the effectiveness of the technique considering the practical implementation of the link where np-bit parallel data is serialized onto ns interconnects. This needs to be pursued as future work. In addition, the total power overhead for the complete wave-pipeline-based SERDES transceiver under consideration has to be evaluated, rather than for the interconnect alone as described here. Overall, the work presented in this thesis provides many opportunities for additional research and some ideas that are useful for practical implementation. 91 References [1] "International Technology Roadmap for Semiconductors (ITRS) 2004," http://public.itrs.net. [2] I. Saastamoinen, T. Suutari, J. Isoaho, and J. Nurmi, "Interconnect IP for Gigascale System-on-Chip," ECCTD. [3] A . Morgenshtien, I. Cidon, A . Koloddny, and R. Ginosar, "Comparative analysis of serial vs parallel links in N O C , " in International Symposium on System-on-Chip, 2004, pp. 185-188. [4] R. Dobkin, I.Cidon, R.Ginosar, A.Kolodny, and A.Morgenshtein, "Fast Asynchronous Bit-Serial Interconnects for Network-on-Chip," Technion-Israel Institute of Technology, 2004. [5] M . Ghoneima, Y . Ismail, M . Khellah, J. Tschanz, and V . De, "Serial-link bus: a low-power on-chip bus architecture," in IEEE/ACM International Conference on Computer-Aided Design, 2005, pp. 541-546. [6] K . Lee, S.-J. Lee, S.-E. K i m , H . - M . Choi , D . K i m , S. K i m , M . - W . Lee, and H.-J . Yoo , " A 51mW 1.6GHz on-chip network for low-power heterogeneous SoC platform " in IEEE International Solid-State Circuits Conference, 2004, pp. 152 -518. [7] S. Kimura, T. Hayakawa, T. Horiyama, M . Nakanishi, and K . Watanabe, " A n on-chip high speed serial communication method based on independent ring oscillators " IEEE International Solid-State Circuits Conference, vol . 1, pp. 390 - 391 2003. [8] S.-J. Lee, K . K i m , H . K i m , N . Cho, and H.-J . Yoo , "Adaptive network-on-chip with wave front train serialization scheme," in Symposium on VLSI Circuits, 2005, pp. 1 0 4 - 107. [9] R. Dobkin, R. Ginosar, and A . Kolodny, "Fast Asynchronous Shift Register for Bit Serial Communication " in 12th IEEE International Symposium on Asynchronous Circuits and Systems 2006, pp. 117 - 127. [10] I.-C. Wey, L . - H . Chang, Y . - G . Chen, S.-H. Chang, and A . - Y . Wu , " A 2Gb/s high-speed scalable shift-register based on-chip serial communication design for SoC 92 applications " in IEEE International Symposium on Circuits and Systems, 2005, pp. 1074 - 1077. K . Lee, S.-J. Lee, and H.-J . You , " S I L E N T : Serialized L o w Energy Transmission Coding for On-Chip Interconnection Networks," in IEEE/ACM International Conference on Computer-Aided Design, 2004, pp. 448-451. S. Furber, "Future Tends in SoC Interconnect," white paper, Silistix, Inc and Silistix, Ltd. , 2005. L . D . M . Benini, G , "Networks on chips: a new SoC paradigm," IEEE Computer, vol . 35, pp. 70 - 78, Jan 2002. D. Applegate, R. E . Bixby, V . Chvatal, and W . Cook, "On solution of traveling salesman problems," Documenta Mathematica, vol . I C M III, pp. 645-656, 1998. H . Bakoglu and J. Meindl , "Optimal Interconnect circuits for V L S I , " IEEE Trans. Electron Devices, vol . ED-32, pp. 903-909, M a y 1985. V . S. Raghunathan, M . B . ; Gupta, R . K . . " A survey of techniques for energy efficient on-chip communication," in Design Automation Conference, 2003, pp. 900 -905. S. Ramprasad, N . R. Shanbhag, and I. N . Hajj, " A coding framework for low-power address and data busses," IEEE Transactions on Very Large Scale Integration (VLSI) Systems,, vol . 7, pp. 212-221, June, 1999. S. V . Morton, "Techniques for Driving Interconnects," in Design of High Peformance Microprocessors Circuits, A . Chandrakashan, W. Bowhi l l , and F. Fox, Eds.: I E E E Press, 2000. D . Boning and S. Nassif, "Models of Process Variations in Devices and Interconnects," in Design of High Peformance Microprocessors Circuits, A . Chandrakashan, W . Bowhi l l , and F. Fox, Eds.: I E E E Press, 2000. T. Granberg, Handbook of Digital Techniques for High-Speed Design: Prentice Hal l PTR, 2004. J. C . Chen, "Multi-Gigabit SerDes: The Cornerstone of High Speed Serial Interconnects," white paper, Genesys Logic America, Inc. 93 [22] D . A . Hodges, H . G . Jackson, and R. A . Saleh, Analysis and Design of Digital Integrated Circuits In Deep Submicron Technology, Third ed. N e w York: McGraw-H i l l , 2004. [23] A . Deutsch, P. W . Coteus, G . V . Kopcsay, H . H . Smith, C. W . Surovic, B . L . Krauter, D . C. Edelstein, and P. L . Restle, "On-chip wiring design challenges for gigahertz operation " Proceedings of the IEEE, vol . 89, pp. 529-555, Apr i l 2001. [24] L . Zhang, Y . Hu, and C.-P. Chen, "Wave-pipelined on-chip global interconnect," in Asia and South Pacific Design Automation Conference, 2005, pp. 127-132. [25] W. P. Burleson, M . Ciesielski, F. Klass, and W . L i u , "Wave-pipelining: A tutorial and research survey," IEEE Trans. VLSI systems, vol . 6, pp. 464-474, 1998. [26] J. X u and W. Wayne, " A wave-pipelined on-chip interconnect structure for networks-on-chips,," in 11th Symposium on High Performance Interconnects, 2003, pp. 10-14. [27] M . Hashimoto, A . Tsuchiya, and H . Onodera, "On-chip global signaling by wave pipelining," in IEEE 13th Topical Meeting onElectrical Performance of Electronic Packaging, 2004, pp.:311 - 314. [28] V . V . Deodhar, "Throughput-Centric Wave-Pipelined Interconnect Circuits for Gigascale Integration." vol . PhD: Georgia Institute of Technology, 2005. [29] T. Sakuria, "Closed-form expressions for interconnection delay, coupling, and crosstalk in VLSI ' s , " IEEE Trans. Electron. Devices, vol . 40, pp. 118-124, Jan. 1993. [30] A . Nalamalpu and W. Burleson, " A practical approach to D S M repeater insertion: satisfying delay constraints while minimizing area,," in Nth Annual IEEE ASIC/SOC Conference, 2001. [31] V . Adler and E . Friedman, "Repeater design to reduce delay and power in resistive interconnect," IEEE Trans. Circuits and Systems - II: Analog and Digital signal Processing, vol . 45, pp. 607-616, May 1998. [32] K . Banerjee and A . Mehrotra, " A power-optimal repeater insertion methodology for global interconnects in nanometer designs," IEEE Trans. Electron Devices, vol. 49, pp. 2001-2007, Nov. 2002. 94 N . M . G . Garcea, and R. Otten,, "Simultaneous analytic area and power optimization for repeater insertion, ," in IEEE/ACM International Conference on Computer Aided Design, 2003, pp. 568-573. L . N . Vicente and P. H . Calamai, "Bi-level and Multi level Programming: A Bibliography Review 1994," Journal of Global Optimization, 1994. M . Hashimoto, T. Yamamoto, and H . Onodera, "Statistical analysis of clock skew variation in H-tree structure," in Sixth International Symposium on Quality of Electronic Design, 2005, pp. 402-407. M . Ghoneima, Y . Ismail, M . Khellah, and V . De, "Reducing the Data Switching Activity on Serial Link Buses," in 7th International Symposium on Quality Electronic Design, 2006, pp. 425-432. E . Naroska, S.-J. Ruan, and U . Schweigelshohn, " A n Effecient Algorithm for Simultaneous Wire Permutation, Inversion, and Spacing," in International Symposium on Circuits and Systems, 2005, pp. 109-112. L . Macchiarulo, E . Macc i , and M . Ponciono, "Low Energy Encoding for Deep-Submicron Address Buses," in International Symposium on Low Power Electronic Design, 2001, pp. 176-181. D . Burger and T. M . Austin, "The simplescalar tool set, version 2.0," www.simplescalar.com. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065430/manifest

Comment

Related Items