UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Implementation and analysis of surfing pipelines Yang, Suwen 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2005-0709.pdf [ 4.1MB ]
JSON: 831-1.0051631.json
JSON-LD: 831-1.0051631-ld.json
RDF/XML (Pretty): 831-1.0051631-rdf.xml
RDF/JSON: 831-1.0051631-rdf.json
Turtle: 831-1.0051631-turtle.txt
N-Triples: 831-1.0051631-rdf-ntriples.txt
Original Record: 831-1.0051631-source.json
Full Text

Full Text

I m p l e m e n t a t i o n a n d A n a l y s i s of S u r f i n g P i p e l i n e s by Suwen Y a n g B . E . , Huazhong University of Science and Technology, 1995 M.S.,University of Washington, 2001  A THESIS SUBMITTED  IN P A R T I A L F U L F I L L M E N T O F  THE REQUIREMENTS  FORTHE DEGREE OF  M a s t e r of Science in THE FACULTY OFGRADUATE  STUDIES  (Computer Science)  The  U n i v e r s i t y of B r i t i s h C o l u m b i a August 2005 ©  Suwen Yang, 2005  Abstract High performance digital systems make extensive use of pipelines. Three years ago, "surfing" pipelines were proposed. A timing pulse propagates through the surfing pipeline, and the individual logic elements of the pipeline are modified so that their delays are smaller in the presence of the pulse than in its absence. T h i s creates an "event attractor" where events in the data path occur at the rising edge of the timing pulse. These attractors reduce timing uncertainties and improve the performance of the pipeline. A circuit technique called "preswitching" was proposed to implement the delay variation required for surfing. In this thesis, we demonstrate a working, surfing chip and address issues of power consumption and robustness. W e demonstrate surfing by the design, fabrication and test of a chip using preswitching surfing circuits. T h e surfing ring in this chip supports two, independent waves of computation separated only by the surfing effect - no latches or other storage elements are used.  We operated the ring for over 48 hours and 2 * 1 0  15  surfing events and never observed an error. T h e preswitching circuits in this chip exhibit unacceptable power consumption, motivating our work on energy-efficient designs. We introduce a new family of surfing circuits based on charge-sharing, dynamic gates. T h e design and simulation of a carry-lookahead adder show that this  ii  technique offers very competitive performance by standard metrics. T h i s design also demonstrates surfing in a design with a non-uniform circuit structure. Finally, we develop a new method for robustness analysis.  We formulate  noise-margin analysis as a numerical optimization problem that takes the timevarying behavior of surfing and other dynamic logic designs into account. W i t h this approach we compare the robustness of several, high-performance logic families, quantify the timing stability of surfing circuits, and demonstrate trade-offs between performance and robustness. O u r demonstration of a physical surfing chip; design of novel; low-power surfing circuits; and a new noise-margin analysis method together bring surfing closer to a practical reality. These tools and .techniques will aid the development of novel logic designs to face the challenges of deep-submicron integrated circuit design.  iii  Contents Abstract  ii  Contents  iv  List of Tables  vii  List of Figures  viii  Acknowledgments  xi  Dedication  xii  1 Introduction  1  1.1  Surfing  2  1.2  Problem Statement  5  1.3  Thesis Statement  6  1.4  Thesis Organization  6  2 Background 2.1  Pipelining  2.2  H i g h Speed Circuits  7 7 12  iv  2.3  Circuit Evaluation  22  2.4  Noise Analysis  23  3 The Test Chip  26  3.1  Structure of the Test C h i p  26  3.2  Design of the Test C h i p  30  3.3  Test Results  42  3.4  Energy Overhead  47  3.5  Observations  48  3.6  Summary  49  4 Lower Power Surfing  50  4.1  Surfing with Charge Sharing  51  4.2  Simulation Results  64  4.3  Summary  66  5 Noise Analysis for Surfing Logic 5.1  68  Noise-Margin Analysis  69  5.1.1  72  Calculating the Sensitivity M a t r i x  5.2  Circuits  73  5.3  Results  77  5.4  5.3.1  Small Signal Stability  77  5.3.2  Noise M a r g i n as Design A i d  83  5.3.3  Large Signal Stability  85  Summary  92  6 Conclusions and Future Work  95  v  Bibliography  101  vi  List of Tables  3.1  C o m m a n d s Launched by the Serial Control Register  37  3.2  Comparison of Preswitching with Non-surfing, Dual-rail X O R Gates  48  4.1  Energy Comparison of Surfing and Non-surfing Dual-rail X O R Gates  56  4.2  Structure of the Backbone in Figure 4.7  61  4.3  Structure of the Spines in Figure 4.7  62  5.1  Parameters Used in Noise-Margin Analysis  75  5.2  Robustness of Different Logics  83  vii  List of Figures  1.1  T i m i n g Requirement of Surfing  4  2.1  A Simple Synchronous Circuit  8  2.2  Wave Pipelining with T w o Waves  9  2.3  T i m i n g Uncertainty of Wave Pipelining  10  2.4  Surfing Pipelining  11  2.5  T i m i n g Requirement of Surfing Pipeline  12  2.6  A Domino Buffer  13  2.7  A Self-Resetting Domino Buffer  14  2.8  A Self-Resetting Preswitching Buffer  15  2.9  Operation of a Surfing Gate  17  2.10  B o d y Bias Voltage Generator Circuit  18  2.11  Operation of a Variable Threshold Voltage Keeper Circuit  19  2.12  T w o O u t p u t Prediction Logic Inverters  20  2.13 A M O S Current M o d e Logic Inverter  21  2.14  A D y C M L Logic Inverters  21  3.1  T h e Surfing R i n g  27  3.2  T h e Surfing X O R - G a t e (true half)  29  viii  3.3  T h e O u t p u t Cell  30  3.4  T h e Pseudolatch  31  3.5  T h e Two-Input L a t c h  32  3.6  T h e Input Cell  33  3.7  A Stage of the G a s P T i m i n g R i n g  35  3.8  T h e G a s P Cell (G-cell)  36  3.9  T h e Serial Control Register  36  3.10  The  3.11  nlatch Used in the Generator  39  3.12  platch Used in the Generator  39  select  Signal Generator  38  3.13 dblatch Used in the Generator  40  3.14 Pseudolatch Used in the Generator  40  3.15 T h e Synchronized select Signals  41  3.16  Frequency of f/8 versus Number of Tokens in the G a s P R i n g  . . . .  43  3.17 T h e Test C h i p  45  3.18 Relationship between Supply Voltage and Frequency of f/8  46  4.1  A n Energy Efficient Surfing Circuit  53  4.2  Simulation of an X O R Gate with Charge-Sharing Surfing  53  4.3  The Brent-Kung Adder  58  4.4  A Domino Implementation of the P K G Block from [16]  58  4.5  T h e Surfing Circuit for Propagate  59  4.6  The Mixed Brent-Kung Adder  59  4.7  The Timing Chain  60  5.1  Noise-Margin Measurement Circuit  69  ix  5.2  A Static C M O S Inverter  74  5.3  A First-Order Transistor M o d e l [19]  75  5.4  A C h a i n of Surfing Buffers  76  5.5  A Self-Resetting Domino Buffer  76  5.6  T w o O u t p u t Prediction Logic Inverters  77  5.7  Eigenvectors for a Static Buffer  78  5.8  Eigenvectors for a Self-Resetting Domino G a t e  79  5.9  Eigenvector for an O P L Buffer with Input = 0  80  5.10 Eigenvector for an O P L Buffer with Input = 1  81  5.11 Eigenvectors of the Sensitivity M a t r i x for the Surfing C h a i n  . . . .  82  5.12 Largest Eigenvalue versus Stage Delay of O P L Buffer with Input = 1  84  5.13 T h e Effects of Varying the Stage Delay  85  5.14 T h e Effects of Varying the W i d t h of the N-channel P u l l - U p  86  5.15 Noise-Margin Estimates with Input = 0  87  5.16 Noise-Margin Estimates with Input = 1  88  5.17 T h e Delay of an O P L Inverter  90  5.18 T h e Delay with Respect to Arrival T i m e  91  5.19 C o m p u t a t i o n T i m e for Different Logics  93  6.1  A Self-Resetting Buffer with Variable Threshold Voltage Keeper  6.2  Variable Threshold Voltage Keeper with Width  fast  . .  98  Signals of Different Pulse 98  x  Acknowledgments W i t h o u t extensive support, discussion and endless encouragement from D r . M a r k Greenstreet, this work would not have been possible. I would not have completed the chip design in six months without the solid and clean start on the design done by B r i a n Winters. I owe the students and staff in the U B C S O C lab a debt of gratitude, especially Roozbeh Mehrabadi and Roberto Rosales for helping me with the chip simulation and testing. Special thanks go to my husband Zhenyu Zhang who supports my every reasonable decision. Finally, I would like to thank my friends, Jihong R e n , K a n g K a n g Y i n , Qiang K o n g , Suling Yang, and many others, for such a joyful time in beautiful Vancouver.  S U W E N  The University of British Columbia August 2005  xi  Y A N G  m y h u s b a n d , Z h e n y u Z h a n g for his endless s u p p o r t a n d p a t i e n c e .  xii  Chapter 1  Introduction Advances in microelectronics systems in the past three decades have led to desktop computers that are far more powerful than large mainframes from a generation ago, graphics processors that offer stunning images and animation, and highspeed networks that provide a new medium for communication including the web, e-businesses, etc.  T h e technological factors that drive high-end chips such as mi-  croprocessors, graphics processors, and network routers are primarily: advances in fabrication technology (smaller transistors), advances in digital circuits (faster logic gates), and advances in architecture (fine grained pipelining).  T h i s thesis makes  contributions in the areas of circuits and pipelining. A digital pipeline is an assembly line for processing information. T h e time to complete a pipeline step is the period of the C P U ' s clock (i.e. the inverse of the clock frequency). Architects improve performance by finding ways to break the tasks of the chip into smaller pieces such that each step of the pipeline can be performed in less time. T h i s is known as "fine-grain" pipelining. For example, a typical C P U today may have a clock period corresponding to 8 or 12 basic logic operations ( A N D ,  1  O R , N O T , etc.)  compared with a microprocessor from 20 years ago where typical  clock periods were 50-100 gate delays [18]. E a c h pipeline stage is composed of logic gates. A s described in chapter 2, the performance of gates can be improved by precharging or using voltage swings less than the full power-supply. In a precharged gate, one or more signals within the gate are periodically set to a default (typically high) value; then, the logic circuitry only has to handle the case of switching to the other (typically low) value resulting in a faster overall design. In a partial voltage swing gate, the gate output does not traverse the entire range of the power supply. T h i s allows a partial voltage swing gate to operate with smaller delays than its full swing counterpart. Aggressive pipelining and circuit techniques create challenges for managing timing and noise issues. Fine grain pipelining reduces the clock period. Producing and distributing such high-frequency clock signals with adequate precision and uniformity is a serious design challenge.  Precharged logic techniques can exacerbate  these problems by requiring even finer resolution of the timing signals with multiple clock phases needed to control the precharge operations of the gates within a pipeline stage. Furthermore, low voltage swing designs are inherently more susceptible to noise because the same disturbance is a larger percentage of the separation between high and low values than for a full-swing gate. These problems are exacerbated in deep-submicron technologies where transistors have very low switching times, wire delays are large, and power supply voltages can be as low as one volt or less [4, 10].  1.1  Surfing  Advanced, fine-grained pipelining techniques face a bottleneck:  the latches.  In a  digital pipeline stage, data paths use different gates to achieve different functions  2  and gates work in varying environments, due to noise, temperature, data dependent delays, and other factors. the same speed.  Hence it is very difficult to r u n all the data paths at  Latches, controlled by a global clock, are introduced into digital  pipelines. W h e n the latches are transparent, logic events are free to move to the next stage. Otherwise, logic events are blocked from propagating to the next stage. T o synchronize logic events for the next stage, the period of the clock has to be greater than that of the slowest data path. Latches slow down fast data paths because logic events cannot propagate when the clock is low. Additionally, latches add extra delay to every data path to propagate data from one stage to the next. Overcoming the disadvantages  of latches in a pipeline becomes especially  important at very high operating frequencies where the overhead from latches can be a large fraction of the total clock period. Surfing provides a chance to tackle these problems. In a surfing pipeline, latches and a global clock are no longer necessary. Instead, a timing reference pulse is introduced for every logic gate in the data path and propagated parallel to the data path. T h i s timing pulse is used to modulate the delay of each logic gate as a smooth function of the time between the data event's arrival and that of the timing reference signal.  Figure 1.1 illustrates this  relationship. In this figure, the upper curve is the delay of events in the data path. T h e dashed line indicates the delay of the timing reference pulse from one stage to the next. Surfing occurs if the following conditions are satisfied: • If a logic event arrives earlier than the rising edge of the timing reference pulse, then the delay of the logic event must be greater than that for the timing reference.  •  Conversely, if a logic event arrives later than the rising edge of the timing  3  logic delay control delay  I  I  l  l  l  l  l  I  I  I  I  I  I  I  I  I  |  l  l  l  l  I  l  l  l  l  l  l  -f  0 Figure 1.1: T i m i n g Requirement of Surfing  reference pulse, then the delay of the logic event should be less than that for the timing reference.  Under these conditions, logic events are attracted to the rising edge of the timing reference pulse. If a logic event arrives earlier than the reference pulse, then in the next logic gate, the logic event and timing event will be closer to each other because the logic delay is greater than the delay of the reference pulse. In the reverse case, the distance between the data event and the timing event will decrease  because  the logic delay for this scenario is less than the delay of the reference pulse. In a chain, the logic events will eventually be attracted to the rising edge of the timing reference pulse such that the propagation delay of logic events matches the delay of the reference pulse. W i t h surfing, all of the data paths have the same delay. Hence, latches are no longer necessary to synchronize the logic events. T h i s is the charm of surfing, which makes latchless pipelining possible.  4  1.2  P r o b l e m Statement  In 2002, Winters and Greenstreet proposed preswitching surfing logic [38].  With  extensive simulations, they demonstrated that surfing works. However, they did not address the following issues:  •  A real chip may work in a different environment from the simulated chip, due to uncontrollable variations in the power supply voltage, temperature, process parameters, noise, and so on. Given the unusual nature of surfing, a physical demonstration is needed to make sure that nothing has been overlooked in the model.  • Power now is a dominant design concern [32, 28, 20, 29, 19] for most integrated circuit designs.  However, the preswitching technique greatly increases  the  power consumption of the chip because of a short circuit current used to create the surfing effect.  T o make surfing practical, new circuits are needed  that overcome these power problems.  • T h e preswitching surfing technique uses partial voltage swing to produce the delay variation required for surfing. be analyzed and quantified.  T h e impact on noise margin needs to  However, as shown in figure 1.1, the delay of  surfing gates is a time varying function and the behavior of surfing gates varies over time.  T h u s , the noise-margin analysis must consider the time domain  properties of the disturbances. Existing analysis techniques either ignore this, or they make unrealistic, simplifying assumptions.  5  1.3  Thesis Statement  T h i s thesis brings surfing closer to practice by implementing a surfing chip; developing novel, low-power surfing circuits; and introducing new methods for noise-margin analysis.  1.4  Thesis Organization  T h e thesis is organized as follows:  •  Chapter 2 introduces related work for digital pipelining, high speed circuits and noise-margin analysis.  •  Chapter 3 gives a detailed description of preswitching surfing logic and a chip designed to verify the theory.  •  Chapter 4 introduces a novel, low power family of surfing circuits.  •  Chapter 5 presents a new technique for analyzing the noise margins of dynamic circuits. T h i s techniques takes the time dependent behavior of dynamic circuits into account. We use this technique to compare the robustness of surfing circuits with other design styles.  •  Chapter 6 concludes the thesis.  6  Chapter 2  Background In this chapter, we provide an overview of related work for digital pipelining, highspeed circuits and noise-margin analysis.  2.1  Pipelining  In traditional synchronous design, as shown in figure 2 . 1 , delays in data paths may vary over a wide range. A s mentioned in chapter 1, inserting latches allows proper operation of a circuit. However, latches add extra delays to all paths, including the slowest one. T h e clock's period must be longer than the delay of the slowest data path. T h e slowest data path determines the working frequency and throughput of a synchronous circuit. Wave pipelining has been used in various designs to mitigate the overhead of latches.  One of the earliest example was the floating point unit for the I B M  360 model 91 [3].  Currently, wave pipelining is used in the L I caches of most  high performance microprocessors [41, 4 2 , 40]. Burleson et al provide an excellent survey and tutorial on wave pipelining [6]. T h e key idea in wave pipelining is that  7  7X>  3.  T  Figure 2 . 1 : A Simple Synchronous Circuit  a pipeline can support k waves in flight between latches if clock's period P satisfies inequality 2 . 1 . Smax/k where 5 i m  tively.  n  and 5  max  < P < 6 /(k min  - 1)  (2.1)  are the m i n i m u m and m a x i m u m delay of data paths respec-  For example, if S /2 max  < P < 5 i m  n  (i.e. k = 2 ) , the circuit shown i n  figure 2 . 2 can support two waves. T h e constraint, 6  max  < 2P means that a wave  needs at most two clock periods to reach the right side of the combinational logic. Likewise, 5 i m  n  > P says that a wave will not arrive at the register on the right  until after a full clock cycle, ensuring that it will not interfere with the previous wave of computation. W h e n the clock signal, <j>, i n figure 2 . 2 goes to 1, wave C is launched between latches 1 and 2 . Since wave A has already arrived at register 2 before <j> goes high, wave A is transferred to the right side of register 2 . Wave B has been in the combinational logic for only one clock period, it will continue between registers >  1  <5max/2,  and  2  to complete its calculation during the next clock cycle. Since  wave C cannot overtake wave B.  with two waves i n flight.  <5 , m  n  So this pipeline can work correctly  A few other technical conditions are required for wave  pipelining as described in [6]. Wave pipelining allows the circuit to work with k times the throughput of  8  Figure 2.2: Wave Pipelining with T w o Waves  the traditional synchronous design and with the same latency.  In practice, tim-  ing uncertainty hinders the wide-spread usage of wave pipelining. Along the data path, the timing uncertainty grows monotonically because every level contributes some amount.  T h i s uncertainty constrains the working frequency of the pipeline:  inequality 2.1 implies that P must be greater than 5  max  - 5 i . T o minimize timm  n  ing uncertainty, designers typically arrange logic gates in levels and introduce extra buffers as shown in figure 2.3 to match delays. Currently, wave pipelining is mainly used to pipeline caches. In 1991, Williams and Horowitz in [36] introduced the concept of "zerooverhead" pipelines.  A pipeline has zero overhead if no latency is introduced by  control or latching.  T h i s can be achieved if a pipeline has a total latency equal  to the sum of the latencies of its stages.  Williams and Horowitz presented an  implementation of a zero-overhead pipeline using domino circuits with a self-timed control circuit that generated the timing signals to achieve zero-overhead operation. In 2002, Winters and Greenstreet proposed the surfing approach for pipelining [38, 39] which achieves negative overhead.  In a surfing pipeline as shown in  figure 2.4, a timing pulse is propagated parallel to the data paths. Every logic el-  9  delay  Figure 2.3: T i m i n g Uncertainty of Wave Pipelining  ement i n the data path is augmented with this timing pulse which modulates the element's propagation delay. T h e surfing theory states two requirements:  1. W h e n the timing pulse is 1, the maximum delay of the gate the m i n i m u m propagation delay of the timing pulse 6f i , tTn  n  5i  tmax  is less than  where 5i is the  delay of the logic gate with fast = 1, and 5/ is the stage-to-stage delay of the fast pulse.  2. If the timing pulse is 0, the m i n i m u m delay of the gate #o,mm is greater than the m a x i m u m delay of the timing pulse Sf  ,  tmax  where 5o is the delay of the  logic gate with fast = 0.  Winters and Greenstreet summarized these two requirements with the following inequality:  If inequality 2.2 is satisfied, the arrival time of logic events in the data paths  10  Figure 2.4: Surfing Pipelining  will be attracted to that of the timing pulse. A s shown in figure 2.5, the timing pulse divides the period of the timing pulse into four intervals. T h e interval capture region.  [£1,1:4]  is the  In interval [ i i , ^ ] , the input event arrives earlier than the timing  pulse, and the logic delay is greater than the propagation delay of the timing pulse. Conversely, in interval  [£3,£4],  the input event arrives after the timing pulse, and  the logic delay is less than the delay of the timing pulse. Whenever the input event comes in interval [£1 , £ 2 ] or [ £ 3 , £ 4 ] , at the next stage, the input event will be closer to the timing pulse. Hence, after several stages, the input events will converge to arrive in the steady-state surfing interval  [£ ,£3]2  T h e interval  [£4^5]  is the metastability  interval. Events in this interval will eventually exit to surf with either the preceding wave or the following wave. Surfing creates an event attractor such that the delay spread along the data paths is kept small regardless of the pipeline length.  A  transparent latch is an extreme case of a surfing gate - the latch's delay goes to infinity when the latch is not enabled and drops to a bounded value when the enable signal is asserted.  Surfing logic can be seen as "soft" latching.  Unlike a  traditional latch which slows down early arrivals, surfing designs accelerate  late  arrivals to achieve higher performance than purely combinational designs. Surfing increases throughput and decreases latency simultaneously,  11  because surfing logic  elements have less delay than their non-surfing counterparts.  2.2  High Speed Circuits  We are interested in revising high-speed circuits to meet the timing requirements of surfing pipelining. M a n y C M O S logic styles such as domino [21, 17, 7], output prediction logic [27], variable threshold voltage keepers [23], have timing properties that vary in response to externally applied precharge or clock signals and could potentially be adapted for use with surfing. We describe each of the logic families here and examine their potential for surfing in chapter 6. A domino gate [21, 17] consists of a dynamic gate, a static inverter lout and a keeper m3. A dynamic gate replaces the pull-up P M O S stack in the corresponding static gate with a precharge transistor, controlled by the clock, as shown in figure 2.6. W h e n the clock is low, this transistor pulls node x high. T h e gate begins to evaluate its logic function when the clock goes high. If in goes high, node x is pulled low, and the output inverter lout drives node out high. Transistor m3 is a keeper ensuring that node x remains high during the evaluation phase unless a high value on in  12  clock  time(ns)  Figure 2.6: A Domino Buffer  causes transistors m l and m2 to pull it low.  Domino circuits can operate much  faster than their static counterparts because they do not present large p-channel devices in their input loads.  Transistor m2 is called a footer and prevents short-  circuit currents that lead to excessive power consumption if clock goes low before the input. Like transparent latches, footed domino gates are an extreme case of a surfing gate - the domino gate's delay goes to infinity when the clock is low and drops to a bounded value when the clock is high. footer to obtain an even faster gate.  It is possible to eliminate the  T h i s is called a "footless" gate.  However,  power consumption can become a severe issue because of the short circuit current path formed by transistors m l and m4 if the clock goes low before the other inputs. N O R A [13] is a variation of domino that cascades dynamic stages by alternating N - and P - stages.  However, the lack of a restoring inverter makes  NORA  very sensitive to noise: an input disturbance slightly greater than the transistor threshold voltage can propagate through a chain of N O R A gates. Furthermore, removing the static inverter of the domino design often results in little improvement  13  time(ns)  Figure 2.7: A Self-Resetting Domino Buffer  in performance because the inverter provides useful gain for driving the loads of the domino gate [34]. For these reasons, N O R A has found little use in practical designs. Self-resetting domino logic removes the clock in domino gates.  Figure 2.7  shows a self-resetting domino buffer [7, 8]. A s with the original domino gate, node x is pulled low when the input goes high, and the output inverter drives node out high. T h i s triggers the self-reset. After node out goes high, node p goes low, taking the place of the clock signal from the original domino gate.  T h i s charges x back  to a high value and lout then drives out low again. Self-resetting C M O S is a pulse logic: logic values are represented by pulses rather than steady voltage levels. T h e self-reset mechanism allows operation without footers - with careful design, the reset of each gate starts slightly after the resets of its predecessors complete. While offering high performance, self-resetting domino presents several design challenges. T h e sizing of the keeper presents a trade-off between speed, power consumption, and  14  noise margin. W h e n implementing functions more complicated than a buffer, the n-channel device that pulls down on node x is replaced by an appropriate network of n-channel transistors. T h e design must ensure sufficient overlap of pulses on different inputs to fully trigger the gate. T o avoid short-circuit currents during precharge, the precharge control signal, p, must drop low after the input(s) have been reset but early enough to complete the reset of out in time for the next gate's precharge. T h e lack of design tools to automate the synthesis and validation of self-resetting circuits has limited their application. In particular, designers need tools that enable maximizing performance while satisfying system robustness requirements. Winters and Greenstreet [38, 39] revised the self-resetting domino gate to obtain a surfing gate as shown in figure 2.8.  T h e input labeled fast is the timing  pulse that modulates the gate's delay. W h e n fast is low, no input pulses are expected, and transistor m3 keeps node x high, taking the role of the keeper from the domino designs presented earlier. D u r i n g the high portion of a fast pulse, node x floats at its high level until appropriate input pulses arrive. Simultaneously, current flowing  15  through transistor m5 pulls node out slightly above ground. In particular, transistor m5 and the pull-down device in inverter lout form a voltage divider. A s these are both n-channel devices, their properties track closely over variations of fabrication parameters. Winters and Greenstreet found that by making m5 about 8 0 % of the width of the pull-down in lout, it pulls out to about 0 . 2 V D D when fast is asserted. T h i s provides the speed-up required for surfing. Figure 2 . 9 shows the operation of the surfing buffer under various conditions. Curves A , B, and C show the propagation of a pulse through the surfing buffer, the propagation of the same input pulse with fast connected to ground, and the response of the gate when no input pulse is received.  Comparing curves A and B reveals the speed-up that occurs when fast  is asserted.  Curve C shows the voltage shift on the output due to preswitching  independent of the input pulse. Curve in is the input for gates with outputs A and B, and x is the precharged node for the gate with output A . In chapter 3 , we describe the implementation and test of a chip that demonstrates surfing with preswitching circuits. T h e added power consumption, due to the fighting between transistor m5 and the pull-down transistor in inverter lout is quite disadvantageous.  Surfing is  achieved with partial voltage swing. T h u s , noise margin is a concern for this kind of design. T h e keepers in dynamic gates, such as those in figures 2 . 6 and 2 . 7 play an important role in providing the gates' noise immunity. A s the technology scales down, keeper sizing becomes more important as exponentially increasing subthreshold leakage currents threaten the reliable operation of deep submicron dynamic circuits. A weak (small) keeper will provide an unacceptably small noise margin for the circuit. However, a strong (big) keeper will increase the power consumption and increase the delay of the circuit due to the fighting of the keeper and the circuit's pull-down  16  '3.9  4  4.1  4.2  4.3  time(s)  4.4  4.5  4.6  Figure 2.9: Operation of a Surfing Gate  stack [2, 22]. K u r s a n and Friedman [23] proposed using two levels of supply voltages to the n-well for the keeper transistor of a domino gate. Figure 2.10 describes a circuit to generate the variable voltage supplies to the n-well - the logic circuit is traditional domino as depicted in figure 2.6. A s shown in figure 2.11, different voltage levels are applied in different operation phases. W h e n the clock is low, the gate is precharged. A t the end of the precharge phase, the n-well is raised to a high-supply voltage (YDD2) to increase the threshold voltage of the keeper transistor.  T h e resulting  weak keeper allows high-speed operation at the beginning of the evaluation phase. Shortly into the evaluation phase, the gate should have completed its operation and now needs to retain its result for use by subsequent logic gates. T h e n-well voltage is lowered to strengthen the keeper, thereby improving the noise margin of the gate and enabling it to robustly hold its output value until the next precharge phase. T h i s allows the fast operation of a weak keeper during evaluation and the robustness of a strong keeper after evaluation has completed. However, the capacitance between the  17  VDD2  [P node2 2  nodel  node3 clock I  nl  n2  Figure 2.10: B o d y Bias Voltage Generator Circuit  well and the substrate and the well resistance are large enough for typical processes that it is not clear if this approach can be used at high clock frequencies. O u t p u t prediction logic [27] ( O P L ) combines properties of static and domino designs.  A static gate can be converted to an O P L version by adding a footer  transistor to the pull-down network and a precharge pull-up to the output. B o t h of these added transistors have their gates connected to a clock signal. For example, figure 2.12 shows two, cascaded O P L inverters. O P L uses many, finely spaced, clock phases.  W h e n the clock for an O P L gate is low, the gate output is pulled high.  T h e clock for the gate should go high slightly before the preceding gates complete evaluation.  T h i s causes the output of the gate to start to fall while waiting for  the inputs to settle. If the inputs remain high, the output will drop all the way to ground.  O n the other hand, if the inputs transition such that the output should  be high, then the output will recover to the high level.  W i t h proper timing, the  output is at an intermediate level when the inputs arrive, reducing the  effective  propagation delay of the gate because the transition does not start from one of the rails.  Like domino circuits, O P L offers significant speed-up compared with static  18  evaluation phase  precharge phase  high Vth keeper  low Vth keeper  high Vth keeper  -*  \t>D2  /  \\  body bias \i>Di  /  \i>Di clock  , ' weak 1 keeper  0 \bDi  input  strong keeper  ^  ///  0  \  /  \\\  Figure 2.11: Operation of a Variable Threshold Voltage Keeper Circuit  C M O S . However, this speed-up depends on maintaining the intended relationship between the clock phases and the output transitions of the O P L gates. Noise margin is another issue. If a small noise occurs while the output is at an intermediate level, it may corrupt the intended timing relationship between the clock and input. T h e circuits described above are voltage mode logic circuits. For this kind of logic, transistors are used as switches that turn on or off to create a path to power or ground according to the voltage of the inputs of the gate. However, for current mode logic, some transistors can be in partially O N states. Figure 2.13 is a M O S current mode logic ( M C M L ) inverter. Transistor m2' is a D C current source controlled by V y . RI and R2 are two pull-up resistors (typically implemented with r e  small, P M O S devices whose gates are connected to ground). T h e logic function is implemented by the transistors in the dashed box as shown in figure 2.13. A M C M L gate operates differentially with each input pair being the outputs of a differential circuit. T h e value of the output node of a gate depends on the difference between  19  currents passing though the branches of the circuit.  For example, if V {ml) gs  is  higher than V (m2), the current through RI exceeds the current though R2, and gs  the voltage on N l drops. Eventually it reaches a steady state and the current though m l is equal to I . CS  T h i s causes the voltage on node x to increase and bring transistor  m2 to cut-off. T h e n the voltage on N2 goes to VDD- For a M C M L gate, the logic part can be implemented with N M O S transistors only, which switch faster than P M O S transistors. T h e reduced voltage swing also reduces the dynamic power dissipation. However, the static power consumption increases because of the D C current source lcsD y n a m i c Current M o d e Logic ( D y C M L ) [1] uses a virtual ground to reduce the voltage swing and avoid the static power consumption of M C M L . Transistor m8, as shown in figure 2.14, is used as a capacitor to form the virtual ground. W h e n clock is low, nodes out.T and out.F are precharged to VDD- Transistor m2' is on and node w is discharged to G N D . Once clock goes high, transistors m2', m4T and m4F are off. Transistor m2 switches on the current paths from out.T and out.F to  20  Figure 2.13: A MOS Current Mode Logic Inverter  21  w through x.  T h e n , transistors m l T and m l F determine which of out.T or out.F  will be pulled down according to the input in.T and in.F. T h e path with a higher input will have a larger discharging current. Transistors m 7 T and m 7 F speed up the evaluation and serve to maintain the logic values on out.T and out.F. W i t h a reduced voltage swing on x and the use of differential logic, D y C M L achieves less delay and consumes less power than a corresponding M C M L circuit. Because of the use of reduced voltage swing, noise margin is a critical issue with D y C M L .  2.3  Circuit Evaluation  In battery powered systems, such as cell phones, laptops, and so on, circuit designers need to pay extra attention to lowering energy consumption to extend battery life. Practical limitations of heat dissipation have made power consumption a primary design concern for desktop computers and servers as well. Performance is no longer the single most important feature of a circuit. comparing different circuits is challenging.  W i t h multiple design  objectives,  It is not fair to compare circuits by  energy consumption (E) or delay (t) only. Circuit designers can often find ways to trade off energy consumption for performance. Power dissipation is not a sufficient metric to evaluate circuits. B y simply increasing the period of the clock, the power dissipation of the circuit will drop. Energy is proportional to the square of supply voltage. T h o u g h lowering supply voltage reduces energy dissipation, delay increases at the same time. Et and Et  2  are two commonly used metrics to reflect the trade-offs  of power and performance. Gonzalez and Horowitz proposed the Et metric in [15] because the energy dissipation of a circuit is dependent on its performance. However, it is difficult to reduce energy dissipation and improve performance at the same time. T h i s metric  22  favors a design that stresses both energy efficiency and performance. Gonzalez and Horowitz in [14] calculated supply and threshold voltages for optimal Et. W i t h the first-order model of energy and delay of a C M O S circuit, lowering supply and threshold voltage is advantageous to increase energy efficiency, especially when transistors are velocity saturated [19, p. 57-66]. Due to the process variations and variability in operating conditions, these advantages are limited by the need to provide adequate noise margin. If velocity saturation of transistors is not considered and the power supply voltage is not close to the threshold voltage, Et  2  is independent of the power sup-  ply [26]. However, these assumptions are inaccurate for low power supply voltage and deep submicron processes where velocity saturated operation dominates.  Et  2  strongly emphasizes performance. It favors a circuit design which trades off energy consumption for a small reduction of delay. In this thesis, we compute b o t h Et and Et  2  to characterize the trade-off between power and performance in our designs and  to compare our circuits with other approaches.  2.4  Noise Analysis  Noise margin is another important measure of circuit quality that can be traded with power and performance.  For example, increasing the size of transistor m 5  in figure 2.8 increases the voltage shift on node o u t . T h i s increases the difference between the fast and slow delays, thus increasing timing robustness while compromising the voltage noise margin. In this case, surfing trades voltage robustness for timing robustness. In real designs, signals are disturbed from their ideal values due to capacitive coupling from nearby wires (crosstalk), variation in the power supply voltage due  23  to the resistance and inductance of the power and ground networks combined with switching currents, substrate noise, thermal noise, etc. Designers use noise margin measures to quantify the robustness of logic circuits in the presence of such disturbances. For static C M O S gates, the disturbance is typically modeled as an offset of the input voltage away from the ideal values of power or ground. T h e static noise margin is denned as the input offset that brings the gate to a point where the small signal gain has a magnitude of one [19]. Clearly, this ensures that the change in the output voltage will be less than the change to the input voltage. T h u s , disturbances smaller than this static noise margin diminish in a long chain of gates. A s the technology moves to very deep submicron feature sizes, the static noise margin which assumes D C noise becomes more and more conservative.  This  is because digital noise often exhibits narrow pulse widths and gates behave like low pass filters.  In addition, dynamic gates exhibit different responses to noise  because their dynamics change over their multiple phases of evaluation.  Take a  footed domino buffer as shown in figure 2.6 as an example. W h e n the clock input is low, x will go high and out will go low regardless of the value of in. In other words, the output is independent of the input during this phase and the noise margin is unbounded. O n the other hand, when clock is high, x will be pulled down if in moves above the n-channel threshold voltage.  T h e degree to which nodes x and out are  disturbed depends on the relative size of the keeper pull-up and the time and shape of the disturbance on in. T h e dynamic noise margin depends strongly on the waveform of the disturbance and, in many cases, on the output load.  Various researchers have pro-  posed techniques for measuring the noise margin of dynamic circuits. Zolotov and Blaauw [43] proposed a latch transition failure criterion: a failure is said to occur  24  if the noise changes the state of a memory element. W h i l e this equates noise margin with system failure, determining the exact space of disturbances for which the circuit will operate properly is impractical or impossible for most circuits.  This  motivates using approximations that provide bounds on the actual noise margin. One of the earliest approaches is that proposed by Larsson and Svensson [24] where they model the disturbance as a pulse parameterized by its width and height. O t h ers have noted that common disturbances often have other shapes.  For example,  triangular and exponential pulses have been considered [11, 30]. Shepard has proposed a mixed criteria, namely finding the smallest dynamic disturbance that brings the output to the unity-gain point of static analysis [30, 31]. A s described in [43], Shepard's criterion provides neither an upper nor a lower bound for the actual noise margin. Because of the timing pulse introduced in surfing circuits, the noise margin of the surfing circuits is strongly dependent on the arrival time of the noise, which is ignored in previous noise analysis techniques. Such time varying dynamics were not considered in these earlier papers. T h u s , we present a new technique for analyzing the noise margin of dynamic circuits in chapter 5.  25  Chapter 3  The Test Chip In this chapter, we describe the chip we designed and tested to verify the surfing theory.  3.1  S t r u c t u r e of the Test C h i p  We implemented a proof-of-concept test chip to demonstrate surfing. T h e key structure on the chip is a twelve-stage ring that calculates a pseudo-random sequence on 11-bit, parallel words. T o keep the focus of the chip on surfing, we avoid long-wires for data and implemented a ring that only requires nearest neighbor communication. T h e sequence that we calculate is based on the linear-feedback shift register ( L F S R [12]) sequence:  w(i,j) where w(i,j) is the j  t h  =  w(i - l,j - 1) XOR w(i -  bit of the i  th  (3.1)  word in the sequence and the j — 1 is calcu-  lated modulo-11. T h e actual chip implements a slight variation on this sequence as described in section 3.2.  26  b i t ^ X H X H X H I H I OHXHXHXHIHIHO DCIX I H XI H XI H IDHOI XH bits^o-XHXHXH I H I HOHX xxxi x I DO I I x W H O H X H X H X H ii H 9  bit  1  H  1  gtiti ^ i h i X- oI - x - x - x - Ii -X i - oI- xI - xI - x Ix rx  T X « > I - X - I - I ~ 0 ~ X - X - X - I - I ~ 0I  i  i  ti T3  ti ti  x  IHXHXHIHIHOHXHXHXHIHIHOHXHs i DCDO DO i DO b k s H X H X H X H i I^OHXHXHXHIHIHOH i I DODO x bit 2 ~OHxHxHXH iI H I HOHXHXHXH I HI H D O XO i i i I X x x x I I I biti H I - o - x - x X H I H I H O H X H X H X H I H XI i I I L^SXI D O I bitO  I ~ I - 0 - X - X - X - I - I - 0 - X - X - X ~  stage staged 0 j from top edge  stage 6  from top edge  Figure 3.1: T h e Surfing R i n g  27  st  ^p  We implement the recurrence of equation 3.1 with an array of surfing X O R gates.  T o set and observe the values of the waves in the surfing ring, we embed  chains of "input" and "output" cells in the array that form a serial scan chain. Figure 3.1 depicts the ring: cells labeled ' X ' are surfing X O R gates, cells labeled T are input multiplexors with input serial registers and surfing pseudolatches, cells labeled ' O ' combine a surfing X O R gate with the output serial register and a surfing pseudolatch, and cells labeled ' G ' are the G a s P [33] handshaking cells that generate the fast signals.  T h e "Control" cell handles the interfaces to the serial scan chain  as well as synchronizing the control signals for transferring data between the scan chain and the surfing array. Section 3.2 describes the design of these components in greater detail. T o describe the ring, we write w(i,j) to denote the j  t h  bit of a word output  by column i. E a c h X and O cell in the main array receives surfing data inputs from its west and southwest neighbors for w(i —  and w(i —  — 1) respectively from  equation 3.1. T h e east and northeast directed connections between cells in figure 3.1 depict these connections.  T h e I cells receive a surfing data input from their west  neighbor. T h e I cells perform no computation and under normal operation simply propagate values to the east and northeast neighbors. Hence the I cells have no input from their southeast neighbors.  T h e fast signals from the G a s P chain propagate  downward in each column. T h e south directed connections between cells indicate these connections.  E a c h fast signal is one long wire that spans the height of the  array. T h i s is the only non-local communication in the ring. T h e I and O cells have connections to propagate the select and unload control signals from the "Control" block. These are southeast directed connections in the figure. In addition, the I and O cells also have connections to propagate the serial data inputs and output. These  28  •5h  1.67/2.67  fast(i)  -r1.67>  pre(i,j).T  -c|[a5.67  9.B][-I w(i, j).T  w(i-1, j).T—IJTe ie] — [ w(i-1, j).F  —[ [ 12  1  w(i-1, j-1 ).F—] [Te ie]|— w(i-1, j-1).T  Figure 3.2: T h e Surfing X O R - G a t e (true half)  connections also are southeast directed. T h e test chip includes a simple serial control interface that allows us to set the initial values of the computations, take snapshots of the ring's state, start and stop the G a s P timing chain, and set the number of tokens in the timing chain. T h e ring can operate with one or two surfing waves. B y operating the ring with two waves, we demonstrate that it can support independent waves of computation without interference for arbitrarily long periods of time. T h e two chains of I cells in the figure provide data for initializing the two waves, and the two chains of O cells take snapshots of the two waves. T o verify correct operation of the ring, we initialize the ring with two waves with a known separation in the pseudo-random sequence. After the ring has r u n for a while, we take a snapshot of the ring's state and verify that the two waves have values with the same separation as the original values. T h e ring continues to operate during a snapshot allowing subsequent snapshots to be taken without reinitialization.  29  unload(i-l)—,  fast(i)  serData(i-l)—,  w(i-1,j)  w(i,j)  unload(i) serData(i) w(i-1,j-1)=J  fast(i)  <j> ;  0  2  Figure 3 . 3 : T h e O u t p u t C e l l  3.2  Design of the Test Chip  Given the novelty of surfing pipelines, this section provides descriptions of each cell that we used in the ring. We used dual-rail domino with "preswitching" as described in [38] to effect the required speed-up when fast is asserted.  Figure 3 . 2 shows the  true half of the X-cell. T h e number beside each transistor is its shape factor. T h e n-channel pull-up controlled by fast has a shape factor that is about 8 0 % of that for the pull-down in the inverter for the domino output. These transistors fight with each other during preswitching.  Using n-channel devices for both provides good  tracking over variations in fabrication parameters and operating conditions.  N-  channel transistors are weak pull-ups, and this arrangement moves the gate outputs to about 0 . 2 V D D when fast is asserted and provides a peak reduction in gate delay of about 3 0 % . T o provide some margin when surfing, we set the stage-to-stage delay for the surfing ring to be about 8 0 % of the propagation delay for non-surfing domino.  30  -r  1.67/2.67  16]|>-C<|—  Figure 3.4: The Pseudolatch  As shown in figure 3.3, the output cell extends the surfing X O R gate by adding a serial scan chain with a surfing control signal for transferring data from the surfing path into the serial chain. Keeping with our approach of using only nearest neighbor communication, the  unload  signals surf along the O-cell chains using  "pseudolatches" shown in figure 3.4. Since the delay of the pseudolatch should be the same as the stage delay of the  fast  signal, for simplicity, we add the  fast  gated  NMOS to the pull-down stack to resemble the design of the X O R gate. We keep the preswitching surfing design in the pseudolatch to maintain its timing robustness. This maintains the proper timing relationship between the fast signal and the control signals propagated by the pseudolatches. Like other surfing components, the pseudolatch provides no static state holding capability. We used a two-input latch shown in figure 3.5 to transfer data from the surfing ring to the serial scan chain. The overlap of d l and enl must be large enough to acquire valid data from the X O R gate. To achieve this, we used a design for our two-input latch that has a small set-up and hold window and a negative hold time. By carefully sizing these transistors, we achieved a set-up and hold window (set-up time is 61ps and hold time -37ps respectively) that is less than one F 0 4 delay  1  *An F04 delay is the delay for one static inverter driving four inverters of the same size 31  5.33/5.33  0 - 4  en224.33  en1  e  dH[8  5.33/2.67  2.67  en1—4L5.33 en2—<j[5.33  en1—\[8  5.33] h-r-' 2.6711  '  en2—I [2.67  62en2-  33 5.33/2.67  O-  Ws.33  T  en2—<f[2.67 2.67] t> 1.67JI in2—IP-67  eiri —IL2.67 zj y> 5.33/5.33  cj[5.33 ]  T  Figure 3.5:  T h e Two-Input L a t c h  across all five process corners.  as itself [34]. I n the T S M C 0.18/x process, one F 0 4 delay is a b o u t 90ps.  32  fast(i)  45  pseudo-latch  select(i-l)  dual-rail mux  s aO  w(i-1,j) =  a1 serData(i-l)—* D  Q en  1 */  =£select(i)  y  D Q en  )w(i,j)  »serData(i)  1  fast(i)  *2  Figure 3.6: The Input Cell  Figure 3.6 shows the input cell. It replaces the dual-rail X O R gate with a multiplexor. We implemented the multiplexor simply by changing the inputs to the transistors of the X O R circuit. Thus, the two gates have very similar delay characteristics which simplifies surfing. To avoid broadcasting data and control signals, we placed the I-cells along a diagonal. While it might seem natural to place them parallel to the diagonal from w(i — 1, j — 1) to w(i,j), this would preclude proper initialization of the ring. Let's assume that the I-cells are placed in this "natural" way, the X cells above the I cells will not be properly initialized, as an X-cell receives inputs from its west and southwest neighbor. Then the X-cells below the I-cells, would receive their southwest inputs from each other and would not be initialized correctly. Thus, we placed the I-cells on the other diagonal which explains a handful of peculiarities of the ring. First, to retain the regularity of the layout, the X-cells and O-cells are placed on diagonals parallel to the I-cells. To set both the w(i — l,j — 1) and w(i — inputs, we used two parallel chains of I-cells for each wave. Consider an X-cell that is immediately to the left of an I-cell. The next X O R calculation for the wave is performed by the O-cells to the right. As shown in figure 3.1, the horizontal path  33  from such an X-cell to the corresponding O-cell goes through two I-cells, whereas the "north-east" path goes through only one I-cell. T h i s disrupts the L F S R calculation from equation 3.1. We determined through simulation that the resulting sequence has a period of 1533 which is quite sufficient for our tests.  A s an added benefit,  this arrangement allows the snapshot from a single diagonal chain of O-cells to completely describe the state of a wave.  Thus, we chose to use this approach to  generate pseudo-random sequence for our design. Having examined the construction of the ring, we now look at the generation of timing and control signals. O u r timing generator uses a variation of the "GasP backwards" control described in [38]. T h a t design had two gate delays in the forward direction which nicely matches the domino structure of our surfing gates and four delays in the reverse direction. We added two more inverters to the backward path as shown in figure 3.7 to ensure adequate delay for the self-reset of the buffer to generate the fast signals. W i t h this extra delay, the ring operates correctly. F r o m an asynchronous design perspective, this extra delay serves an unusual purpose. For proper surfing, the G a s P chain must be token limited [35]; in other words, the acknowledge event from the right must arrive at the N A N D gate before the request event from the left.  A s described in [9, 37], the time separation of the  two inputs affects the delay of the stage. In particular, when the two input events are nearly simultaneous, the delay from the last input is greater than when the events are more widely separated. T h i s is known as the "Charlie effect", and it is this phenomenon that allows our G a s P ring to maintain adequate separation of the pulses while operating well inside the token-limited regime. In figure 3.7, the N A N D gate in each stage has an extra input to enable the N A N D gate. In one stage, this extra input is connected to start and the others are connected to the power supply.  34  Figure 3.7: A Stage of the G a s P T i m i n g R i n g  If the start signal is 0, the G a s P timing chain halts. T h e G a s P timing chain starts to run once the start signal goes to high.  T h e G-cell in figure 3.1 extends each  G a s P stage with a multiplexer and a two-phase static latch. Figure 3.8 shows the schematic of a G-cell. T h e input3 input to the N A N D gate is either connected to start or the power supply. T h e G-cell uses the serial data input to set the number of tokens in the timing chain.  W h e n reset is high and start low, pi in the G a s P  timing chain is set to the reverse of the values held in the two-phase latch in the corresponding stage. T h e multiplexer also functions as a keeper for the shared stage wire of the G a s P circuit. Once reset goes to low and start high, the value of the pi signal is overtaken by the value set by the free running of the G a s P chain. T h u s the multiplexer's driving capability is designed to be very small. We may load m tokens in the G a s P ring such that on the left side of this special N A N D gate at stage i — the Pi-i,  Pi-2,  , Pi-m are set to VDD and the others to 0.  We have a control circuit to input and synchronize the data. It is composed of three parts:  35  -D  Q en  —i——  en  I  er  to the probe pad  to the tatches in the O - c e l l s  o  to the latches In the G - c e l l s  to the latches in the l-cells  serOata _  Figure 3.9: T h e Serial Control Register  1. the serial control register.  2. the circuit to generate the select signals  3. the circuit to generate the unload signals  T h e serial control register as shown in figure 3.9 uses two-phase static latches controlled by the external clock to transfer the data. two-phase latch.  Here each box represents a  T h e external clock's frequency is quite low.  Normally we set  the frequency to be less than 3 0 0 K H Z . Figure 3.9 shows the serial control register.  For proper operation of the serial control register, a valid frame of serial  36  r  Data Di D D D D D D D 2  3  A  5  6 7  8  ?able 3.1: Commands Launched by the Serial Control Register Symbol Command reset load the data in the latches to the GasP timing chain start start to run the GasP timing chain LFSR start to run the ring output take a snapshot of the ring load load the serial data into the ring and GasP timing chain data the data to the L F S R ring and GasP timing chain the internal clock <j>\ <t>i the internal clock (j> 4>2 2  input data takes the following coding format where Di represents a valid data: lllll )DiD D'iDi )D^DeDjDi. r  r  2  Five consecutive Is set the enable signal to a high  voltage level. The enable signal then enables the parallel registers and triggers the corresponding commands. Table 3.1 lists all the commands that can be launched by the serial control register. Note that it is never necessary to set five consecutive ones in L>i through D&. For proper operation of the chip, the first step is to load in data for the I-cells and to initialize the GasP ring. Thus we set D$ to 1 to enable the loading. Each G-cell, I-cell and O-cell has a two-phase latch inside controlled by <j>\ and cj) . These 2  latches are connected together as in figure 3.9. The input of the first latch in the chain is the serData signal. The output of the chain is connected to a probe pad to allow the testing of the long serial chain. For correct operation, Dj and Ds cannot be 1 at the same time. To perform a test, we first reset the GasP timing chain by setting D± to be high and D  2  low. We can then load multiple tokens into the GasP timing chain.  The ring inside the chip can run with one or two tokens in the GasP timing chain. Then we start the GasP timing chain by setting D high and D\ low. 2  37  syn_pre d  q  —1  d  q  n  q  d  q platch  ck  ck  ck  ck  ck  ck  dblatch  I fast.  I fast.  \select1 slpre T  q  Ii f a s t  b  I fasti  dblatch  syn2 d syn3  i+2  s2 ,  q  P  b  e  I fasti  syn3  I fast.  r  Lf a s t  i + 2  ck |fast  i+  i—\Sl_O  3  d nlatch  fast  y  q syn2 platen  ck  s  d  nlatch  d  syn2  q  platen  1 fasti  syn  d  syn  nlatch  p r e  i+1  syn2  | > ^ s e l e c t 1 _0  s2_0 re [\select2_0  syn3  fast, , +  !fast  fast  i+2  P  i+1  jfasti ;  syn  +  Figure 3.10: T h e select Signal Generator  We then start the ring by setting D3 to generate the select signal used in the ring for proper initialization. We set D4 to generate the unload signal to take a snapshot of the ring and observe the values from a probe pad. T h e circuits used to generate the select and unload signals are almost the same. We will describe the circuit to generate the select signal. Setting D3 makes the input to the synchronizer goes to high.  W e use the three-bit ripple counter  synchronizer from [25] to synchronize the rising edge of D3 with a fast signal fastj. T h e input to the synchronizer is the LFSR signal.  We label the output of the  synchronizer as syn.pre. Using the circuit shown in figure 3.10, we produce pulses on select signals. Figure 3.11, 3.12, 3.13 and 3.14 describe the boxes used in figure 3.10. Figure 3.15 shows the output of the select signal generator. selectl_l and selectl.O signal are the dual-rail pair inputs to the first chain of I-cells. select2_l and select2_0 are the pair for the second chain. The s l  p r e  , sl-0p , s 2 re  p r e  and s 2 _ 0  pre  signals go through a pseudolatch shown in  figure 3.14 to synchronize with the next fast signal. T h i s pseudolatch is similar to the  38  5.33/5.33 phi -  T i 1-33 5.33/2.67  =  2.67  phi—cj[3.33  3.33 1.6711— phi—I [1.67  F i g u r e 3.11: n l a t c h U s e d i n t h e G e n e r a t o r  phi  2.67 phi  phi —4 [3.33 3.33] h—4  .67 8/2.67  1.67 phi—I [1.67  F i g u r e 3.12: p l a t e n U s e d i n t h e G e n e r a t o r  39  phi 5.33  <3 5.33  5.33  phi Figure 3.13: dblatch Used in the Generator  -r  1.67/2.67  16]p-0<]— 1.67]p  out 9.33 out  HL8  logic symbol  26.67/12 fast  Figure 3.14: Pseudolatch Used in the Generator  pseudolatch in figure 3.4 except that the N M O S gated by fast; in the pull-down stack is removed. F r o m simulation, we noticed that s l . , s l _ 0 p 7  e  p r e  , s2  p r e  and s2_0  pre  arrive  later than fastj+2. Removing this N M O S device further reduces the m i n i m u m delay of the data events and increases the timing robustness.  T h e n selectl.l,  selectl.O,  select2_l, and selectl_0 go through several pseudolatches to synchronize with the fast signals before they are used as inputs to the I-cells. Generating the unload signal is the same as generating the selectl.l signal. T h u s we use the same circuit as in figure 3.10 to generate the unload signal except that the circuits used to generate selectl_0, select2_l and select2_0 are removed and the input to the three-bit ripple counter synchronizer is the output signal from the serial control register. In addition to the serial chain, we have probe pads that allow us to observe  40  fasti  AA hA AA AA syn_pre^  syn \  fasti+1  AA A A A A AA . A A AAAA A AA AA AA AA A A A A A A AA AA A AAAA. Slpre  S2 pre  Sl_O  p r e  s2_0  p r e  fast  i+2  select 1 1  select2 1  selectl 0  setect2 0  Figure 3.15: The Synchronized select Signals  41  the fast pulse for one of the ring stages and the same after scaling with a divideby-eight counter.  W h i l e the ring will only surf properly with one or two tokens  in the G a s P timing chain, we can initialize the chain with any number of tokens. T h i s allows us to measure the timing properties of the G a s P chain. We also have probe points for observing the synchronized control signals from the I-cell and O-cell chains.  3.3  Test Results  We fabricated our test chip using the T S M C 0.18/z C M O S process. Figure 3.17 shows the fabricated chip. T o keep our academic fabrication service ( C M C , the Canadian Microelectronics Corporation) happy, we kept the chip's pin count low. T h u s , the chip is designed to be tested on a probe station. T h e row of eight probe-pads at the top provides power, ground, and all data inputs to the chip. T h e clock signal is for the serial scan and control path. E a c h column of three pads at the bottom provides a ground-signal-ground arrangement that we use to probe an output signal.  The  signal fast8 is the fast signal from stage 8, and the signal f/8 is the output of an on-chip divide-by-eight counter driven by this fast signal. sl_l,  T h e signals sCLl, sl_0,  and U l _ l are select and unload signals from the I-cell and O-cell chains. We  can simultaneously probe one signal from the top and one from the bottom. Observing f/8, we determined the speed of the G a s P pipeline. Figure 3.16 shows the relationship between the frequency of f/8 and the number of tokens in the G a s P timing ring. A s the token number increases from 0 to 3, the frequency of f/8 increases linearly with a slope equal to 80.3MHZ  per token. F r o m 4 tokens to 8  tokens, the frequency of f/8 decreases with a slope equal to —26.3MHZ per token. T h e frequency of f/8 reaches its peak value with 3 tokens in the ring. W i t h a 1.8V  42  250  number of tokens in the GasP chain  Figure 3.16: Frequency of f/8 versus Number of Tokens in the G a s P R i n g  43  supply, the forward delay 6p of a G a s P stage with two token is 130ps calculated by equation 3.2 where left_slope is the slope of the solid line in figure 3.16.  6  F  = l/(left_slope  * 8 * 12) = 1/(80.3 * 8 * 12) = 130ps  (3.2)  T h e reverse delay <5R of a G a s P stage with two token is 396ps calculated by equation 3.3 where right_slope is the slope of the dotted line in figure 3.16.  8 = - 1 /(right.slope * 8 * 12) = - l / ( - 2 6 . 3 * 8 * 12) = 396P5 R  (3.3)  T h e forward delay Sp is somewhat slower than predicted by H S P I C E simulations which yielded 95ps for typical parameters (roughly 1 F 0 4 delay) and 115ps at the slow-slow corner . 2  T h i s discrepancy can be explained by the voltage drop and  heating caused by the chip's power consumption. W i t h a 1.8 volt power supply, the chip consumes 120mA. F r o m the serial-data out pad, we conclude that the on chip VDD is 1-7V. We conclude that there is a moderate voltage drop across the power probes. Furthermore, the temperature on the edge of the chip is 34C, and we expect that the temperature in the core would be higher, whereas the H S P I C E simulations assumed a die temperature of 25C. T h e combined effects of the lowered power supply voltage and elevated die temperature suffice to explain the lower than expected speed.  These also motivate our work on more energy efficient surfing  described in chapter 4. We then verified proper surfing behavior.  We loaded two waves into the  ring, allowed it to r u n , and then took snapshots and verified the waves. W i t h power T y p i c a l design methods for high-performance digital circuits make extensive use of circuit simulations. Because it is not possible to perform simulations for a l l possible values of the physical parameters, they are often divided into five representative points according to the performance of the n-channel and p-channel transistors. F o r example, the fast-slow "corner" has fast n-channel devices a n d slow p-channel ones. T h e other four standard corners for the parameters are typical-typical, fast-fast, slow-slow, a n d slow-fast. 2  44  45  supply voltage equal to 1.8V, the frequency of the f/8 signal is 1 6 0 M H Z . R u n n i n g the ring for various intervals of up to 48 hours, we observed no errors. T h i s means that the ring supported two independent waves of computation separated only by the surfing effect for 2.6* 1 0  15  moves between pipeline stages without a single error.  We believe that this compellingly demonstrates robust surfing on a real chip. We then varied the power supply voltage.  Figure 3.18 describes the rela-  tionship between power supply voltage and the frequency of f/8.  W h e n the sup-  ply voltage is low, f/8's frequency increases linearly with the power supply voltage as predicted by a first-order, long channel transistor model [19]. / = 130.94  *V DD  W e use a line  73.54 to fit the testing data from I V to 1.9V. W h e n the supply  46  voltage is greater than 1.9V, the curve bends because of velocity saturation [19]. We observed correct surfing when the external power supply voltage ranged from 1.44V to 2.67V. A t the lower voltage, the stage-to-stage propagation time was 175ps and the upper voltage, the propagation delay was 96ps. T h u s , our surfing circuit works over a range of nearly 2:1 in the power supply voltage and speed. A s the speed is determined by the G a s P timing ring, it is not surprising that speed is roughly linear in voltage as predicted by simple C M O S scaling models.  3.4  Energy Overhead  Our testing demonstrated that surfing works correctly. However, preswitching surfing consumes an unacceptable amount of power.  Using H S P I C E simulations, we  obtained a detailed breakdown of the power consumption of the surfing X O R gates from our test chip (e.g. see figure 3.2).  We compared a chain of 16 preswitching,  surfing, dual-rail X O R gates with a chain of 16 non-surfing, self-resetting dual-rail X O R gates. T h e  fast  domino  signals are generated by a series of chains of two small  inverters. Furthermore, we define the side in a dual-rail X O R gate with an output event as the active side, the other side the inactive side. Table 3.2 summarizes the comparison when the width of the  fast  pulse is equal to 250ps. T h i s pulse width  corresponds that obtained by simulating the chip. T h e preswitching X O R gate uses about 2.46 times as much energy per operation as the non-surfing,  self-resetting  domino equivalent. T h e surfing gate has a propagation delay that is 18.5% less than that of the non-surfing gate (75ps vs. 92ps). T h u s , the surfing gate is a factor of 63% worse by the Et  2  [26] metric than an ordinary self-resetting domino gate.  47  Table 3.2: Comparison of Preswitching with Non-surfing, Dual-rail X O R Gates Item  Non-surfing  delay(ps)  Preswitching X O R  92  75  energy per operation(pJ)  0.1630  0.4016  fraction of energy consumed  0.9994  0.5190  0.0006  0.4466  0  0.0344  by the active side fraction of energy consumed by the inactive side fraction of energy used to generate fast signals  3.5  Observations  A n interesting phenomenon that we discovered in H S P I C E simulations is that not only the forward path is surfing, but also the reset path, which surfs on the falling edge of the self-reset signal,  pre(i,j).  For the surfing X O R gate shown in figure 3.2,  if the falling edge of the input signal comes earlier than  pre(i,j),  because  pre(i,j)  is still high, the voltage on node x remains low until p r e ( i , j ) goes to low. falling edge of the input signal comes later than  pre(i,j),  If the  because of the fighting  of the precharge P M O S and pull-down stack, the voltage on x has already risen a small amount and node x will charge to VQD sooner.  T h u s the delay of the reset  path depends on the arrival time of the falling edge of the input signal. If this delay variation meets the timing requirement of surfing, a chain of such X O R gates can operate properly and the pulse width of the preswitching surfing gate is less than that of the non-surfing, self-resetting domino gate because the surfing effect on the reset path makes the reset happen earlier.  Otherwise, if the reset path is slower  than the forward path, the input pulse will become wider and wider and finally the pulse will disappear.  48  In H S P I C E simulations, we also observed that to make the G a s P timing chain work in all five corners, we need to increase the reverse delay in the G a s P chain, using the Charlie effect to maintain proper separation between pulses.  T h e separation  must be greater than the reset time of the self-reset buffer used to generate the fast signals.  3.6  Summary  We demonstrated a working surfing chip. T h i s chip uses the preswitching surfing technique.  T h i s approach shows excellent tracking over variations in process pa-  rameters, power supply voltage, and temperature. We. validated the robustness to process parameters using five-corner H S P I C E simulation (see note on page 44), but could not test this in the lab because we had chips from a single fabrication run. T h e robustness to voltage and temperature variation is clear both in simulation and from laboratory measurements. W i t h o u t latches or other storage elements, this chip can support 2 waves. We observed no errors while running it for over 48 hours.  This  chip can operate correctly with the supply voltage ranging from 1.44V to 2.67V. However, preswitching surfing consumes an unacceptable amount of power.  Our  testing demonstrates that the average current is 120mA with supply voltage equal to 1.8V.  49  Chapter 4  Lower Power Surfing O u r test chip demonstrates that surfing can work in a real chip. However, the extra power consumption caused by preswitching is unacceptable for most real applications. A s mentioned in Chapter 3, in a non-surfing dual rail X O R gate, the side of the gate that generates the output pulse accounts for more than 99% of the total energy consumption. In contrast, the two sides of the surfing gate are nearly equal in energy consumption regardless of which generates the output pulse. Nearly all of the energy consumption of the inactive side of the surfing gate is due to current through the n-channel pull-up transistor that effects the preswitching. T h e inactive side of the gate consumes this short-circuit current throughout the duration of the fast pulse, whereas the active side only consumes the current on the leading edge until it actually generates the active pulse. T h u s , we see nearly four times the shortcircuit current through the preswitching pull-up on the inactive side as on the active side. T h e active side of the surfing gate still consumes 34% more energy than the  50  non-surfing gate to charge the internal dynamic node. T h i s is due to the self-reset starting before the inputs of the gate return to ground. Initially, we were surprised to observe that this short-circuit current through the precharge path could not be eliminated by increasing the delay in the self-reset path. In our surfing gate, the fast signal accelerates the evaluation phase of the gate but not the reset phase: evaluation propagates faster than reset through our gate. T h u s , the self-reset signal (pre(i,j) in figure 3.2) is asserted before the inputs fall regardless of the delay in the self-reset loop.  In particular, the self-reset signal occurs just enough before  the falling of the gate's inputs to allow reset events to surf on the self-reset signal, consuming short-circuit current reminiscent of that in the evaluation phase. In this section, we present a novel family of surfing circuits aiming to reduce power consumption and demonstrate them using the design of a B r e n t - K u n g [5] carry lookahead adder.  4.1  Surfing with Charge Sharing  F r o m the observations described in section 3.4, we concluded that to reduce power we needed to avoid the short-circuit currents associated with preswitching. T h e key idea behind our new design is to transfer charge from the internal node of the surfing gate to its output while waiting for input events to arrive. T h i s lowers the voltage on the internal node and raises the voltage on the output, both of which lead to faster operation. We assume dual-rail gates, or their generalization, one-hot gates that can have more than two outputs, but exactly one generates an output event during any evaluation cycle. W h e n one side of the gate generates an output event, we disable the charge transfer paths, and restore the node voltages for the inactive side(s). A s this approach involves no short circuits from power to ground, it uses  51  much less energy than preswitching circuits. Figure 4.1 shows our new surfing circuit for a dual-rail X O R gate with inputs a and b and output y. Figure 4.2 describes its operation. W h e n f a s t is low, the gate is reset: x . T and x . F are pulled high, and y.T and y.F are pulled low. W h e n f a s t goes high, all four of these nodes head toward an intermediate voltage. Unlike the preswitching based design from chapter 3, the voltage shift for this design is primarily through charge sharing between the capacitors at nodes x . T and y.T (resp. x . F and y . F ) . T h i s eliminates the short-circuit current that caused the large energy consumption in the preswitching design. T h i s intermediate voltage situation is resolved by the pull-down network, that brings one of x . T or x . F lower than the other which in turn brings the corresponding output high. Once one output goes high, the cross-coupled devices act as keepers for the inactive side of the gate, restoring the x signal to a high level and y to low. T h e three n-channel devices in series from y.T to x.T (resp. y.F to x . F ) include inputs to terminate the charge sharing once an output goes high.  T h u s , the surfing effect is only applied until an output is generated,  further reducing energy consumption. Optimizing power consumption and speed involves a trade-off between adding transistors to shut-off short-circuit paths early, and the extra capacitive loads presented by these extra transistors. In our simulations, we found that we could make the transistors for the feedback path and the keepers quite small, thus minimizing their speed and power penalty. For the preswitching technique, the m i n i m u m propagation delay is dependent on the size of the pull-up N M O S transistor. However, for the charge-sharing technique, the m i n i m u m delay is not dependent on the size of transistors on the feedback path. Instead, it is related to the ratio of the capacitances on the output node and the internal node. In the new surfing circuit design,  52  fast  fast  Figure 4.1: A n Energy Efficient Surfing Circuit  fast  >  x.T  "of '  o >  I  -'"V  is '  s„__  /••  „  . .  \  i  „  x.F  It  r*L time(ns)  Figure 4.2: Simulation of an X O R Gate with Charge-Sharing Surfing  53  we abandoned the self-reset style. T h i s is for two reasons: first, as mentioned earlier, surfing on the reset path causes extra power consumption; second, self-resetting circuits require aligning the input pulses to assure enough overlap. T h e chip we used to demonstrate surfing uses similar circuits in the ring to guarantee this. However circuits with more complex structures need extra attention to make sure of this. We use the domino structure here to simplify the design for more complex circuits. We simulated chains of dual-rail X O R gates implemented with non-surfing, self-resetting domino, the preswitching technique described in chapter 3, and the new charge-sharing approach described here.  T h e new charge-sharing gates are  7% faster than their preswitching counterparts (65ps vs. 70ps forward latency). In table 4.1.A, we summarize the energy per operation consumed by a dual-rail X O R gate with different design styles. Tables 4.1.B and 4.1.C list the energy breakdown for the active and inactive sides respectively.  In all these three tables, the energy  is scaled to 0.1630pJ, the energy of the non-surfing, self-resetting X O R . We set the propagation delay of the fast signals for the chains of preswitching and charge-sharing surfing X O R gates to be 75ps, 18% faster than the non-surfing X O R chain. Per operation, charge-sharing surfing X O R gates consume only 34% more energy than non-surfing X O R gates while preswitching surfing consumes roughly 146% more.  For preswitching surfing X O R gates, the active and inactive sides  consume 0.28 units and 1.1 units of energy by the pull-up N M O S gated by the fast signals, which explains most of the extra energy consumption.  B y using the  internal capacitance to charge the load capacitance, charge-sharing surfing X O R gates consume less power than a preswitching gate.  T h e speed-up available by  charge sharing depends on the ratio of the internal and load capacitances of the gate.  For the X O R chain, each gate is loaded with the input capacitance of two  54  more X O R gates of the same type (i.e. non-surfing, preswitching or charge sharing). Compared to the non-surfing X O R gates, precharging nodes x . T and x.F uses more power in the charge-sharing design. For the active side, it is because when fast goes to low, the inputs to the pull-down stack are still high. T h e fighting between the pull-down stack and precharge P M O S consumes extra power. O n the inactive side, the increased energy consumption is used to restore the internal node x . F or x . T toVrjD- T h i s is mostly done by the keeper. A s we expected, the inactive side of the charge-sharing X O R gate consumes much less power than that of the preswitching gate because the feedback path is cut off as soon as an output event is generated. T h e active side of the preswitching gate consumes 0.5360 units (0.4481+0.0879) of energy to charge the internal dynamic node. However, the active side of the chargesharing gate spends 0.4530 units of energy to do the same job. T h i s further explains why we abandoned the self-reset design. T o apply our charge-sharing based approach to a realistic function block, we designed a B r e n t - K u n g , carry-lookahead adder using charge-sharing surfing.  Fig-  ure 4.3 shows the 16-bit version of the adder. Black squares represent P K G (carry propagate, kill, and generate) blocks.  T h e lowest square in each column is grey;  grey blocks produce G and K outputs but omit P. Triangles represent buffers. T h e bold wires depict the critical delay path through the adder. O u r design starts from the domino implementation described by Harris and Sutherland in [16]. Figure 4.4 shows the implementation of the P K G block. Noting that all blocks have a fan-out of two, Harris and Sutherland propose an implementation where all black and grey cells are 'sized to have the drive capability of a "unit" inverter. Likewise, buffers on the critical path have unit drive capability, and buffers off the critical path have half of this drive. Furthermore, they assume that horizon-  55  Table 4.1: Energy Comparison of Surfing and Non-surfing Dual-rail X O R Gates A : Total Energy 3  Item  Non-surfing  delay(ps)  92  Preswitching  Charge Sharing  Surfing  Surfing  75  75  energy per operation  1.0000  2.4632  1.3440  fraction of energy consu-  0.9994  0.5190  0.8265  0.0006  0.4466  0.0911  0.0000  0.0344  0.0824  med by the active side fraction of energy consumed by the inactive side fraction of energy used to generate fast signals  B : Energy Consumption on the Active Side of Dual-rail X O R G a t e s Item  Non-surfing  Preswitching  Charge Sharing  Surfing  Surfing  precharge  0.3334  0.4481  0.4530  keeper  0.0209  0.0081  0.0433  inverter to the  0.0900  0.0879  NA  0.5551  0.4541  0.6145  0.0000  0.2802  NA  3  precharge P M O S inverter to the output node pull-up N M O S  C : Energy Consumption on the Inactive Side of Dual-rail X O R G a t e s Item  Non-surfing  Preswitching  Charge Sharing  Surfing  Surfing  precharge  0.0000  0.0000  0.0209  keeper  0.0005  0.0007  0.1012  inverter to the  0.0000  0.0000  NA  0.0001  0.0001  0.0002  0.0000  1.0994  NA  3  precharge P M O S inverter to the output node pull-up N M O S  A l l energy values are normalized to 0.1630pJ, the energy of a signal operation of a non-surfing, dual-rail, self-resetting X O R gate. 3  56  tal wires have a capacitance of one sixth the gate capacitance of a unit inverter per column crossed. H S P I C E simulations show delays from 522ps (for 0 + 0) to 650ps (for —1 + 1) to complete a 32-bit add. T h e energy per addition is roughly 50pJ. We focus our attention on the design and operation of the P K G (black) cells. T h e application of our approach to the G K (grey) cells and buffers is similar and straightforward.  T h e propagate, generate, and kill signals are mutually exclusive.  Rather than implementing each of them with dual-rail circuits we implement a triple-rail gate (a.k.a.  "one-hot") instead, just as in the domino implementation  from [16] (see figure 4.4). Figure 4.5 shows our implementation of the P portion of the P K G block; the implementations of G and K are similar. We noted that not all cells in the adder are on the timing critical path. In particular, the cells in the initial P K G calculation for the most-significant half of the word, and the cells in the final sum calculation for the least-significant half of the word are relatively non-critical. We used standard domino circuits for these cells.  Figure 4.6 shows such a 16-bit version adder.  Compared to figure 4.3, in  figure 4.6, to allow the increased delay spread due to the domino gates, cells in dashed box A are moved down one level and cells in dashed box C are moved up one level. Because in figure 4.6, the number of levels of the timing chain remains the same, cells in dashed box B are moved down two levels.  Compared with the pure  surfing design, the mixed adder design yielded a power savings of roughly 6% with no degradation of performance.  57  15 14 13  12  11  10  9  8  7  6  5  4  3  2  1  0 ]  15:0 14:0 13:012:011:010:09:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 Figure 4.3: T h e B r e n t - K u n g A d d e r  Figure 4.4: A Domino Implementation of the P K G Block from [16]  58  Figure 4.5: T h e Surfing Circuit for Propagate  domino cell  Figure 4.6: T h e M i x e d B r e n t - K u n g A d d e r  59  o—r>°backbone  inverter 1  backbone inverter 2  H>—H>—[>^4>—l>backbone backbone backbone backbone  backbone inverter 3  inverter 4 inverter 5 inverter 6  inverter 7  Figure 4.7: T h e T i m i n g C h a i n  We now consider the design of the timing chain. a different design than the one we used in the ring.  Three factors motivate  First, the adder is a linear  pipeline - our goal is to minimize the propagation delay, not to support arbitrarily large numbers of iterative calculations.  Second, stages in our new surfing design  have very low propagation delays, as small as 45ps.  It is impractical to design a  G a s P stage with stage-to-stage delays this small. T h i r d , the adder has long wires which we avoided in the ring.  T h i s requires either resizing the transistors i n the  P K G cells at each level, or having different gate delays. We followed the example of [16] and chose to keep all P K G cells the same for simplicity. However, this requires having different delays at different stages of the timing chain. Figure 4.7 sketches our timing chain. W e assume a design where the adder is part of a larger pipeline as depicted in figure 2.4.  T h u s , we have a clock input  to the chain, which produces the signals fastn . . .fastio- T o obtain a delay less than  60  Table 4.2: Structure of the Backbone in Figure 4.7 Backbone Inverter  Shape Factor  Connection to C h a i n  1  33.3/16.7  fasto , fasto  2  16.7/23.3  fasti  3  10.7/4  fast  4  36.7/18.3  NA  5  36/21.7  fast  6  31.3/15.7  NA  7  26.7/9.3  8 9  2  3  fast.4  24/12  fasts  9.3/4  f a s t 6  10  30/13.7  NA  11  27.3/13.7  f a s t 7  12  25.3/12.7  fast  13  3.1/2.7  fastg  14  8/6  15  10/2.7  8  NA fast io.  one buffer delay, we constructed a "backbone" chain, with a separate output chain for each fast signal. B y carefully sizing the inverters in each chain, we obtain the fine spacing desired for the fast signals. Tables 4.2 and 4.3 list the sizes of inverters in the timing chain for the adder with domino and surfing cells inside.  61  Spine  Table 4.3: Structure o f the Spines in Figure 4.7 Level 1 Level 2 Level 3 Level 4 Level 5  Delay f r o m fast;-1 t o fast; (ps)  0  10/3  36/66.7  fasto  20/10  49.3/26.7  178/66.7  26.7/18  22.7/11.3  8.3/5.3  16.7/15.3  82/26.7  7.7/2.3  8/8  27.3/10  14.7/10.7  48  12.7/2  12/7.3  46/20  13.3/10  52  10/3.7  13.3/10  57.3/23.3  9.3/6.7  56  10/3.3  8/2.7  16/8  48/20  8/4  20/14  67.3/37.3  21.3/12.7  47  8/4  16.7/11.3  65.3/33.3  13.3/10.7  48  9.3/6.7  8/2  14.7/7.3  70/26.7  7.7/4  24/12  96.7/40  26.7/15.3  6/2.7  13.3/8.7  74.7/20  fast  NA NA  fasti fasti fast  2  fast  2  fast  3  fast  3  13.3/12.7  48  f a s t 4 f a s t 4  fasts fasts  10/8  62  f a s t 6 f a s t 6 f a s t 7  fasty fast  8  fast  8  50/25  45  fastg fastg  46  fastio fastio  62  45  T h e chaxge-sharing surfing adder does not have as tight of process parameter, voltage and temperature tracking as the surfing ring described in chapter 3. Three factors contribute to this:  1. T h e preswitching surfing circuits in the ring use two N M O S devices to fight with each other to create the surfing effect. However, the charge-sharing surfing circuit depends on the ratio of the capacitors at the internal dynamic node and output node.  2. T h e surfing ring has only X O R and M U X gates. T h e y have the same structure. In the ring, the wires connecting the X O R and M U X gates have almost the same length. However, in the adder, each layer involves P K G cells, K G cells and buffers.  Furthermore, as shown in figure 4.3 and figure 4.6, the wires'  length varies a lot from one layer to another layer. E v e n in the same layer, gates drive different amounts of capacitance loads.  3. In the surfing ring, gates with similar structure drive nearly identical loads. T h u s , we designed the timing ring to have a similar structure to the X O R ring and therefore to have closely tracking delays. T h e charge-sharing surfing adder loses these similarities. Delays in the data paths have significant components due to gate and wire capacitances.  O n the other hand, delays in the timing  path are determined primarily by gate capacitances.  T h u s , the timing chain  fails to track variations in the relative weight of gate and wire capacitances in the data path. It might be possible to improve this tracking by deliberately inserting long wires in the timing chain.  D u e to a lack of design tools to  support such wire matching, we have not attempted this.  63  4.2  Simulation Results  We simulated our design using H S P I C E and compared it with the domino design from [16] in terms of energy consumption and delay. We report the results for a 32bit adder comparing with the domino adder assuming typical process parameters. To show that our circuits are robust to process parameter variations, we verified the surfing circuits and timing chain using standard "five-corner" simulations. T h e surfing adder achieves roughly 19% reduction in delay compared with the domino adder: the surfing adder has a propagation delay of 524ps compared with the worst case delay of 650ps for the domino design. A s we have emphasized, surfing dramatically reduces the delay spread in pipelines. For the domino adder, the data-dependent variation in delay until the last output bit settled is 128ps. For the surfing adder, the corresponding delay spread is 33ps. T h e tighter timing bounds of the surfing adder arise from the surfing effect as well as the uniformity of these circuits in the final two stages. We note that larger spreads occur within the adder due to variations in loading of different gates and using inputs that skip over surfing levels. This.illustrates how surfing can be used to achieve tight timing bounds, and the methodology can be relaxed to achieve lower delay or power consumption as desired. T h e energy consumption of the surfing adder is actually slightly lower than that of the domino design: the surfing adder consumes 44.5pJ per add compared with the 50.4pJ for the domino adder, an 11.7% decrease. We were initially surprised that the surfing design used less energy per operation than its domino counterpart. T h e n we realized that the delayed fast signals reduce the power consumption caused by the fighting between the precharge P M O S and the pull-down stack. T h u s , the surfing adder has Et and Et  2  metrics of 2.33* 1 0  - 2 9  64  Js and 1.22* 1 0  - 2 9  Js , 2  while the domino  adder has metrics of 3.28 * 1 0  - 2 0  Js and 2.13 * 1 0 ~  model for voltage scaling, as to Et,  29  Js . 2  Assuming a long-channel  our design can achieve the same performance  as the domino adder but with a 29% reduction in energy consumption operated applications would r u n 41% longer, i.e.  jrrj29  ~ 1-41-  (battery  )• W i t h H S P I C E  simulation, we adjusted the power supply voltage to 1.575V to r u n the mixed adder roughly at 650ps.  Per operation, the average energy consumption is 32.3pJ with  36% reduction compared with the domino adder.  A s to Et , 2  based on the same  assumption, our design can achieve the same performance as the domino adder but with a 43% reduction in energy consumption (battery powered applications would run 75% longer). T o optimize the speed of the domino design, we used footless circuits like those in the surfing version. T h i s results in short-circuit current during precharge. T h e same clock controls precharging of all of the domino gates.  W h i l e all of the  precharge transistors turn on at the same time, the internal nodes of each P K G gate remain fairly low until the inputs to the gate go low.  T h u s , the precharge  propagates across the adder from one level to the next while the gates consume significant short-circuit current. In contrast, the minimal timing uncertainty in the surfing design ensures that very little short-circuit current occurs during precharge. T h e short-circuit current of the domino design could be reduced by adding footers, but doing so would slow down the adder significantly.  For example, an additional  delay of 21% is reported in [16]. W h i l e this would result in a lower energy design than the surfing circuit, it would certainly be unfavorable by Et  2  and probably  unfavorable by the Et metric as well. Alternatively, we could use a timing chain as for the surfing design to stagger the precharge signals and reduce the short-circuit current. O f course, the energy for such a timing chain would have to be included  65  in the energy budget for the domino adder. A t this point, we have not performed a detailed comparison with alternative implementations of the domino adder. O f course, with any comparison, there are both caveats and opportunities for improvement. T h e most critical question that we faced was how to account for the power for clock generation for the domino adder and creating the  fast signals  for the  surfing one. For the numbers reported above, we used the total power consumption of the surfing adder including its timing chain. For the domino design, we used the power consumption of the adder and of a three inverter chain with a step-up of four to model the clock tree. We believe that this underestimates the power consumption of the domino design. W h i l e our design shows significant Et  2  advantages, we see many opportuni-  ties for improvement. We could design the timing chain to propagate the falling edge of the fast signal slower than the rising edge. T h i s should eliminate the remaining short-circuit current during the reset phase.  4.3  Summary  We proposed a novel surfing technique named charge-sharing surfing. Compared with preswitching surfing, charge-sharing surfing can achieve slightly smaller delay, but with much less power consumption by reducing the power consumed on the side with no output events. We used the charge-sharing surfing technique to implement a B r e n t - K u n g adder.  Compared with the domino counterpart, the surfing adder has 19%, 12%,  29% and 43% reductions on delay, E, Et and the Et  2  metrics. We observed that the  surfing adder bounds the delay spread in a 33ps interval. However, for the domino adder, the delay spread is 128ps.  Unlike the surfing ring described in chapter 3  66  w h i c h h a d s i m i l a r cells a n d stage-to-stage w i r e s o f a l m o s t t h e s a m e l e n g t h ,  the  B r e n t - K u n g a d d e r d e s i g n c o n t a i n s different s u r f i n g cells a n d n o n - u n i f o r m stage-tostage w i r e l e n g t h . T h i s e x t e n d s t h e usage o f s u r f i n g t o c i r c u i t s w i t h m o r e p r a c t i c a l functions.  67  Chapter 5  Noise Analysis for Surfing Logic T h i s chapter presents a noise analysis approach that considers arbitrary disturbance waveforms.  T h i s approach is more general than traditional methods that model  noise with D C offsets or fixed shape waveforms.  In particular, we divide a cycle  of operation of a gate into n intervals, and consider disturbances that are stairstep functions over these intervals.  B y making n suitably large, we obtain good  approximations of arbitrary waveforms.  We use a metric to measure the input  disturbance and the resulting output disturbance. For example, the l metric ( R M S ) 2  on the voltage corresponds roughly to the energy of the disturbance. We define the noise margin to be the smallest, non-zero, input disturbance that results in an output disturbance that is at least as large. T h i s leads to a formulation of noisemargin analysis as a non-linear optimization problem. O u r approach subsumes the rectangular, triangular, and exponential pulse models [24, 11, 30] as special cases. T o solve these optimization problems, we need an efficient way to compute the gradient of the magnitude of the output disturbance with respect to the components of the input disturbance vector. In section 5.1 we introduce a "sensitivity matrix"  68  •AV , n  in  v  V-i  Figure 5.1: Noise-Margin Measurement Circuit  that we calculate by augmenting the O D E model for the circuit with a matrix for calculating its small-signal response.  In addition to enabling the non-linear  optimization, the sensitivity matrix allows us to identify propagating modes of small signal disturbances, which provides insight into the robustness properties of each of the circuits that we analyze.  5.1 Figure  Noise-Margin Analysis 5.1 shows our configuration for measuring noise margins.  input waveform, V^  n  We apply an  to a reference chain (the lower chain in the figure), and we  apply a disturbed version to the upper chain. T h e input disturbance, A V j  n  models  noise. B o t h signals propagate through one or more gates. T h e difference between the output of the reference chain and the disturbed chain is the output disturbance, AV f. ou  We include one more buffer or inverter at the output of each chain to account  for output loads in real circuits. Likewise, we obtain a realistic input waveform for V m by propagating a pulse through a chain of buffers or inverters until the pulse shape reaches an equilibrium.  69  We formulate noise-margin analysis as the optimization problem: .  AV  o u t  ||  in  =  \\AV tn  >  0  (5.1)  i:  We divide the time interval over which | | A V j | | and | | A V ^ | | are calculated into n n  intervals and model A V ^ to represent A V ^  n  n  0 U  as a stair step function on these intervals. T h i s allows us  with a vector when using numerical optimization techniques. B y  choosing n to be large enough, these stair step functions can closely approximate arbitrary functions. We note that C M O S logic gates act like low-pass filters and are thus relatively insensitive to the sharp edges of the steps. For simplicity, we use equal size time intervals. We note that the framework from equation 5.1 is quite general. Different metrics can be used to reflect different noise models, and constraints can be added to reflect bounds on the maximum instantaneous magnitude of the noise, etc. In the results presented in this thesis, we use an l  2  (i.e. R M S ) metric to calculate  the magnitude of disturbances, and we consider noises such that the instantaneous value of the disturbed input is between the power supply voltage and ground. We use standard numerical optimization techniques to solve the system from equation 5.1. T h i s presents us with two challenges: 1. How do we know that the optimization procedure finds the global optimum?  2. How can we calculate the gradient of | | A V  0 U  j | | with respect to A V j  n  efficiently  and with sufficient accuracy? Presently, we address the first issue by starting the optimizer from several different initial conditions. T h i s does not provide a guarantee that we have found the global optimum; however, if the optimizer consistently finds the same optimum, it suggests  70  that the search space may be reasonably smooth. T h e second issue is the motivation for calculating a sensitivity matrix. A t each step, the numerical optimizer estimates the gradient of || A V  0 U  ^ | | with  respect to the components of A V ^ . T h e optimizer uses this gradient to guide its n  search for the optimum. Estimating the gradient naively by making a small change to A V ^ and observing the change to AV ^ is unacceptable in practice because such n  0U  a calculation takes the small difference of values that already have significant errors from the integration.  T h u s , the direct method is slow and extremely inaccurate.  Instead, we define a small signal sensitivity matrix S with  outi  dAV  ^ in,i  UL  V  where i and j are indices over the time steps of the disturbance vectors. T h e gradient  of ||AV £|| is easily obtained from AV f 0U  \7\\&V \\ out  where | | A V  and 5:  0U  0 t t t  |l =  = 2S AV T  out  (5.3)  A V * ^  We obtain S by calculating the small signal response of the circuit.  The  sensitivity matrix, S, captures the response of the circuit to small disturbances. If all of the eigenvalues of S have magnitudes that are less than one, then the circuit's behavior is stable in the presence of small disturbances. T h e eigenvectors of S are the propagating modes for small disturbances. If the corresponding eigenvalue has a magnitude less than one, then this mode dies out in a long chain. We note that most logic gates have at least one eigenvalue that is precisely one: the corresponding eigenvector corresponds to the time derivative of an input event - the disturbance causes a time shift of the event. In other words, small noises can effect time shifts that must be accounted for in the timing margins for the circuit. For designs such as  71  self-resetting domino, such disturbances can reduce the time overlap of input pulses and cause the circuit to fail. For other designs this can lead to a failure of set-up or hold requirements for registers.  T h e sensitivity matrix also enables the use of  numerical optimization to analyze the large-signal behavior and robustness of the circuit.  5.1.1  Calculating the Sensitivity M a t r i x  T h e remainder of this section presents our procedure for calculating the sensitivity matrix, S.  Let V(t)  be the voltage vector giving the state of the circuit at time t.  We define the vector 7 ^ as the derivative of V(tj)  with respect to the value of the  stair step for A V ^ at time U. We then have n  S(j,i) =  7iii  (out)  (5-4)  when out is the index of the output node of the chain. T o calculate 7 we use the small-signal response of the circuit, Let  9V . (q) t  A  <„..fe»>  =  om  ( 5  where q and p are nodes of the circuit. T h e matrix A  t j ) i i  -  5 )  describes how the state  of the circuit is altered at time tj in response to a small perturbation at time We calculate A t V = f(V, Vi ), n  by augmenting the differential equation model for the circuit. If  we integrate the system  V(t) = V + tif(V(u),V (u))du rh A ., = 1+ I Jac(f,V(u))A du 0  t  in  (5.6)  U!ti  t i  Jti  where we write J a c ( / , V(u))  for the Jacobian of / at V(u).  nodes, adding the calculation of A changes a variables.  72  m variable O D E  If the circuit has m to one with  m(m+1)  Perturbing V^  n  than in.  from time £; to time i ; i affects the voltages on nodes other +  Because we are considering each step of the stair step separately,  calculation of 7 ^ resets the voltage on node V^  n  U \. +  the  to its undisturbed value at time  These observations yield:  T?,* 0) = °> 7i+l,i(p)  =  Hj<i  U ,u(p,in),  A  +l  if  p + in (5.7)  li+i,i{in) = 0 7j,i0)  =  A , nfi+i,i, tj ti+1  iij>i + l  A brute-force implementation of equation 5.7 requires 0 ( n ) 2  ces, where n is the number of time points in the analysis.  different A matri-  Noting that A  i f c i t i  =  Atfc.tj-Atj-,^, we rewrite the last line of equation 5.7 to get  7j,i(p)  =  &t ,t - 'Yj-i,i, i  i  1  if j > i + l  (5.8)  Using this formulation, the entries of S can be calculated with a single integration of the augmented model and n — 1 matrix-vector multiplications. T h e A matrices are mxm  where m is the number of nodes in the circuit. For noise-margin analysis,  we typically use models with one or a few gates. T h u s , m is small and the time for calculating S is acceptable.  5.2  Circuits  We demonstrate our method for noise-margin analysis by applying it to four circuit design styles: static C M O S , self-resetting domino, output prediction logic and preswitching surfing gates. In the future, we will also apply this method to charge sharing surfing gates.  We use an inverter or buffer as our example in each case.  Like other methods for noise-margin analysis, our approach extends directly to more  73  ^  16  out 8  Figure 5.2: A Static C M O S Inverter  complex gates simply by considering the case when the input under analysis is the enabling input for an output event. Figure 5.2 shows a static C M O S inverter. We include static designs to provide a baseline for comparison. T h e results that we present in section 5.3 are based on the T S M C 0.18^x bulk C M O S process. length of 0.18/Li. the schematic.  A l l transistors in our designs have a gate  Gate widths are 0.18/it times the multiplier factor indicated in  T h e other process related parameters are summarized in table 5 . 1 .  T o make the comparison as fair as possible, we use equivalent transistor sizing for each of the gates.  W h e n gates in two different logic styles have transistors that  perform the same function, we make those transistors the same size.  O f course,  each design has some devices which are unique for that style, such as the pull-up N M O S in the preswitching surfing gate. our  We size those transistors to the best of  ability to be reasonable given the rest of the circuit design.  use a gate to drive a copy of itself.  In a chain, we  For simplicity, we use a simple,  first-order  transistor model as presented in figure 5.3 to calculate all drain to source currents. A s observed in section 3.3, this is a reasonable approximation when VDD = 1 - 8 V .  74  v  threshold voltage proportionality factor shape factor, W/L  th  k  S  v  =  ids  = 0,  ds  _  Vd  kSy2  = ksvZ(v  ge  - iy  d s  ),  if o < v < v gc  if  <  V  ds  V  ds  9e  .  Figure 5.3: A First-Order Transistor M o d e l [19]  Table 5.1: Parameters Used in Noise-Margin Analysis Name Value N M O S process transconductance parameter k  270/xA/V  P M O S process transconductance parameter k  -90 A/V  n  p  supply voltage V  2  2  M  1.81/  DD  QAV -0.4V  N M O S threshold voltage V  thin  P M O S threshold voltage V  thtP  gate capacitance coefficient  C  0.97F/m  g  drain capacitance coefficient Cd  2  0.003F/m  Likewise, we assume that all capacitances are fixed capacitances to ground and use the relationship that ic = Cv.  Satisfying Kirchoff's current law at each node yields:  v  = -C-H {v) d3  (5.9)  T h i s gives us the O D E model for our circuit. In principle, our approach could be extended to more realistic, industrial strength models for semiconductor circuits (e.g. B S I M models [19]). Explicit calculation of the Jacobian operator would be much more tedious due to the extra complexity of the model, but we see no fundamental reason why our approach could not be applied to more realistic models. We applied our noise-margin approach to a preswitching buffer, as shown in figure 5.4. For comparison, we also applied this approach to a self-resetting domino buffer, as described in figure 5.5.  O P L gates, as introduced in chapter 2, also  75  fasti  k-1  fast  •k+i  Figure 5.4: A C h a i n of Surfing Buffers  T 16 o  p  2.67/1.67 -^<^  i n j \ _  L  1.67>  keeper  X  rv  out 8  26.67/12  "U LT  Figure 5.5: A Self-Resetting Domino Buffer  76  2.6  2.7  2.S  2.9  time(s)  3  3.1  «'°~  Figure 5.6: T w o O u t p u t Prediction Logic Inverters  demonstrate the timing stability of surfing. We include O P L in the noise-margin analysis to compare the strength of its event attractor with surfing gates. We use an O P L inverter as shown in figure 5.6.  5.3  Results  We implemented the noise-margin calculation described in the previous section using M a t l a b and applied it to the circuits described in section 5.2.  We present results  for both small- and large-signal analysis.  5.3.1  Small Signal Stability  T h e sensitivity matrix, S, characterizes the behavior of the circuits under small disturbances. A s described in section 5.1, we determine the equilibrium shape for pulses by using a long initial chain and calculate the sensitivity matrix for a pulse of this shape. T h e eigenvalues and eigenvectors of this matrix characterize the smallsignal sensitivity and stability of the circuit. In particular, the eigenvectors are the  77  >  undisturbed input  B  "O - f 1  5  >  A.=0.9934'\\  200  300  400  time(*2ps) Figure 5.7: Eigenvectors for a Static Buffer  small-signal, propagating modes for the circuit, and the corresponding eigenvalues tell whether the mode grows or dies out in a chain. If the eigenvalue has a magnitude greater than one, then the mode grows as the disturbance propagates through a chain. Conversely, if the eigenvalue has a magnitude less than one, the mode dies out.  If all eigenvalues have magnitudes less than one, then the circuit is stable  in the presence of small-signal disturbances.  In particular, the largest magnitude  eigenvalue provides a quantitative measure of the small-signal stability of the circuit. We took chains of static, O P L , self-resetting domino, and preswitching, surfing buffers as our examples. T h e stage-to-stage delays are set to be 20ps and 63ps for the O P L and surfing buffer chains respectively. We choose an interval for calculation of the S matrix which contains all of the input and output transitions. We set the time step to be 2ps and computed the S matrix as described in section 5.1.1.  78  ^''^=0.9907 _  0  5  i 2.1  1 2.15  1 2.2  1 2.25  1  1  1  1  1  1  2.3  2.35  2.4  2.45  2.5  2.55  1 2.6  time(s) Figure 5.8: Eigenvectors for a Self-Resetting Domino Gate  A t first, we applied our method to a buffer chain built from static inverters. Figure 5.7 shows the eigenvectors corresponding to the two largest eigenvalues and their corresponding disturbed inputs.  T h e eigenvectors match the time derivative  of the rising and falling edges of the input within 2%. T h e two largest eigenvalues are 0.9972 and 0.9934. We attribute the difference between these values and the predicted value of 1 to artifacts of our time quantization. T h e next four eigenvalues are 0.0026, 0.0025, 1.25 * 10~ , and 1.24 * 10~ . 5  5  These are all real and positive as  expected. We note that due to the limited precision of the device models and the integrator, the smallest two of these values are probably not very significant. For the self-resetting domino gate the maximum eigenvalue 0.9907 is also very close to 1 (see figure 5.8). Again, this corresponds to a timing shift of the input pulse. Because the gate is self-resetting, advancing the rising edge also advances the falling  79  edge, and the corresponding eigenvector has tracked the time derivative of  both  edges. T h e second largest eigenvalue is 0.0404 which affects only the falling edge. We note that the arrival time of the falling edge of the input has a small influence on the falling edge of the output due to short circuit current while precharging node PFigures 5.9 and 5.10 show the largest eigenvalues and their eigenvectors of the sensitivity matrix for an O P L inverter with stage delay equal to 20ps. In both cases, the largest eigenvalue is less than one, showing that the timing of output events is determined by the clock signal to the gate as well as the arrival time of the inputs. In fact, if an input arrives a little before or after its nominal time, the output will be closer to the target value. Thus, small timing disturbances will disappear as the event propagates through several stages.  80  undisturbed input  "Q1.5 >  ^=0.2391 1.9  2  2.1  2.2  2.3  time(s)  2.4  2.5  2.6  2.7  x,0  "°  Figure 5.10: Eigenvector for an O P L Buffer with Input = 1  Analyzing the surfing chain from figure 5.4, the five largest magnitude eigenvalues are: 0.5096, 0.2384, 0.0030.  and (1.5324 ± 7.9406z) * 10~ . 5  Again, we do  not regard the last pair of eigenvalues as being significant given the precision of the integrator. In fact, our circuit models are monotonic, and we expect the sensitivity matrix, S, to be positive definite. We believe that the complex values for the small eigenvalues is most likely a side-effect of our approximation of continuous time with a set of discrete samples or of integration errors. Figure 5.11 shows the eigenvectors for the three largest eigenvalues for the surfing chain. T h e eigenvector corresponding to the largest eigenvalue, 0.5096 corresponds to a time shift of the pulse. T h e left, negative peak shifts the rising edge of the pulse later, and the right, positive peak, shifts the failing edge later. Because the gate is self-resetting, delaying the rising edge also delays the falling edge. T h u s , the  81  }.=0.5096 -0.5  2.05  2.1  2.15  ' X=0.2384 2.25  2.2  time(s)  2.3  2.35  2.45  2.4  *io-  s  Figure 5.11: Eigenvectors of the Sensitivity M a t r i x for the Surfing C h a i n  positive peak of the eigenvector has a bigger magnitude than the negative peak. T h e eigenvector for the second largest eigenvalue, 0.2384 shifts the falling edge without disturbing the rising edge. T h e eigenvector for the third largest eigenvalue, 0.0030 shifts just the rising edge.  A l l of the eigenvalues for the surfing gate have magni-  tudes that are significantly less than one.  T h i s shows the stability of the surfing  gate. Intuitively, any small disturbance will decrease by at least a factor of nearly two from one stage to the next in the chain. Table 5.2 shows the largest eigenvalue for each type of gates.  T h e main  disturbance modes for all of these gates can be interpreted as time shifts of the input events. T h e static logic and self-resetting logic will propagate the time shift along the gates. However, O P L and surfing gates show strong timing stability behavior  82  Table 5,2: Robustness of Different Logics Largest Eigenvalue  Gate Style static Inverter  0.9972  self-resetting domino buffer  0.9907  O P L buffer with input = 0  0.2413  O P L buffer with input = 1  0.2391 0.507  preswitching surfing buffer with input = 1  with all the eigenvalues significantly less than 1.  5.3.2  Noise Margin as Design Aid  Next, we explored the use of our method as a design aid. Figure 5.12 shows how O P L allows a designer to trade timing stability for performance - as the target stage delay decreases,  the largest eigenvalue of the  sensitivity matrix approaches one and the timing stability decreases. delays where this eigenvalue exceeds one, the circuit fails.  For target  W h e n the stage delay  increases beyond around 30ps, the O P L chain enters the clock blocking mode, where the throughput of the chain is limited by the clocks. E a r l y arrival of the input does not affect the time of the output event. In clock blocking mode, the O P L gate can reduce the delay for a late input arrival. T h e timing stability increases as the target stage delay increases.  T h a t is why the m a x i m u m eigenvalue approaches zero with  the increasing stage delay. For surfing pipelines, performance is determined by the delay of the timing chain. T h e delay per stage, Sf t, as  can be any value between 5 /  > m  i ^ and £/, ax,i(see n  m  equation 2.2); however, robustness is lost for extreme values in this range. First, we studied the trade-off between stage delay and the robustness of the surfing design as measured by the largest eigenvalue of the sensitivity matrix, S.  83  Figure 5.13 shows  time(ps) Figure 5.12: Largest Eigenvalue versus Stage Delay of O P L Buffer with Input = 1  this trade-off. Prior to this work, we have used the midpoint of the m i n i m u m and m a x i m u m delay as the target value for the stage delay. T h i s plot shows that the maximum small signal robustness actually occurs at a slightly larger delay. Figure 5.14 shows the effect of varying the width of the n-channel pull-up transistor in the surfing gate. T h e plot shows the m a x i m u m delay of the gate (with fast low), the m i n i m u m delay of the gate (with fast high), and the value of the largest eigenvalue of the sensitivity matrix. Not surprisingly, the m i n i m u m delay decreases as the pull-up becomes stronger. There is a slight growth in the m a x i m u m delay due to the extra drain capacitance of the larger transistor. W h e n the pull-up is eliminated (width = 0), there is still a small surfing effect contributed by the keeper transistor controlled by fast.  Accordingly, the largest eigenvalue is slightly  less than 1, namely 0.913. A s the pull-up is made stronger, this eigenvalue decreases,  84  0.45' 50  ' 55  ' 85  1 60  1 70  1 75  stage delay(ps) Figure 5.13: T h e Effects of Varying the Stage Delay  quantifying the increase in the strength of the surfing effect.  5.3.3  Large Signal Stability  T h e previous results were based on a small-signal analysis for different logic styles. However, digital circuits are highly non-linear and the main concern of practicing designers is robustness in the non-linear, large-signal domain. We now apply our non-linear optimization formulation for noise-margin analysis to the four classes of circuits from section 5.2.  In particular, we look for the smallest input disturbance  waveform that creates an output disturbance which is at least as large. A s noted in section 5.1, when the circuit is analyzed with an input transition, this condition is satisfied by disturbances that are proportional to the time derivative of the input signal, in other words, they shift the time of input events. In this section, we focus on the other case, response to noise in the absence of an input transition. In this case, the noise causes a spurious output event.  Figure 5.15 and figure 5.16 show  the noise margin for the four different logic styles as a function of the length of the chain.  85  100  100*largest_eigenvalue  90  so \70  60  50  40  O  0.2  0.4  0.6  0.8  1.2  1 .4  width of the p u l l - u p N M O S in \i Figure 5.14: T h e Effects of Varying the W i d t h of the N-channel P u l l - U p  For the domino gate, we apply the disturbance at the input node as for the other gates; however, we compare the voltages at the internal nodes of the second domino gate and the final gate. We also did this to the surfing gates with input equal to 0. We found this necessary to obtain convergence of the numerical optimization with the domino gate model. For a domino gate, the falling edge of the output node is controlled by the arrival of the precharge signal, but not by an input disturbance. T h u s it is very difficult to satisfy the first constraint for the optimization problem from equation 5.1. For the static buffer, the strength of the p-channel device is 2/3 that of the strength of the n-channel.  T h i s explains why the dynamic noise margin is lower  when the input is 0 than when the input is a 1: the gate is more susceptible  to  input disturbances when the output is driven by the weaker p-channel device. T h e noise margin of the self-resetting domino logic is 36% lower than that of the static  86  * CN J>  160  c '5b  1-to  *  . *  *  buffer * *  ^  *  *  *  *  *  *  -  <>  a self-resetting domino  • 1—»  o  F3  O  O  O  O  O  O  O  O  O  O  O  O  O  O  Q  O  O  O  O  O  O  O  O  100  80  surfing A  A  A  O  40  OPL on ""0  5  O  10  15  gates i n the chain Figure 5.15: Noise-Margin Estimates with Input = 0  design showing the trade-off of robustness and performance between these two logic families. W h e n there is no input event, the noise margin of the surfing buffer is lower than that of the self-resetting buffer chain. We stopped the analysis with 2 surfing gates in the chain.  W i t h gates number greater than 3, the optimization fails to  converge because it meets a very steep cliff. We multiplied the disturbance obtained when gate number is 2 with 1.01 and noted that this noise causes a function failure of the chain. However, if we multiplied this disturbance with 0.99, the chain operates correctly. T h u s we use the noise margin with gate number equal to 2 as an approximate noise margin with larger gate numbers. T h e smaller noise margin for surfing gates when compared with static or self-resetting domino designs has two  87  I  C/3  CX  .... j  1  # CN  >  200  e  *  £>fj  a  * *  ,  ,  1  *  buffer * *  * 150  *  <u o 100  A  A  A  O  O  A o  o  &  o  0  A  i  i  surfing A O  A  *  A  A  A  t  O  O  O  O  O  (  OPL 10  i  gates i n the c h a i n Figure 5.16: Noise-Margin Estimates with Input = 1  causes:  • the keeper does not perform as a keeper when fast is 1.  • the pull-up N M O S device facilitates noise propagation when fast is 1.  However, these are necessary for surfing to occur. T h e noise margin with an input event clearly demonstrates the timing stability of surfing which increases as the number of gates in the chain increases. T h e O P L chain has the smallest noise margin in all the gates though it is much faster than the others. It demonstrates the same type of timing stability as exhibited by surfing gates. However, the capture interval for an O P L gate is quite small. Figure 5.17 plots the delay of an O P L inverter with the assumption that the  88  clock's period is infinite. T h e dotted curve is the delay with an input equal to 0. If the input goes to 0 too early, the O P L inverter works like a domino gate such that the delay decreases with a slope equal to -1. T h e later an input transition from 1 to 0 arrives, the deeper the voltage dip on the output side is.  T h i s causes the  increase of the delay to a constant value. T h e dashed curve plots the delay for an input equal to 1. If there is no voltage dip on the input side, the delay is a constant determined by the sizes of M O S devices in the O P L gates.  If the voltage dip is  small enough, this dip only affects the arrival time of the input but not the time of the output. T h u s the delay from data input to data output decreases. However, if the voltage dip is deep enough, the delay increases to a constant value. T h e solid curve is the delay for an O P L buffer. T h e buffer delay is the sum of the previous two cases. Figure 5.18 shows the curves of delay versus arrival time of inputs for a preswitching buffer and an O P L buffer. In the interval [t , t ], 3  5  the O P L chain will  not work correctly because the input events cannot catch up with the c l o c k signals. Eventually, the input event will fall into the interval [ £ 4 , £ 5 ] where an input event cannot trigger a corresponding output event and the delay goes to infinity.  This  also happens in the interval [to, ti]. T h e capture interval [ti, ti] for the O P L buffer is much smaller than the capture interval [ti, £ 4 ] of the preswitching buffer.  This  explains why the O P L chain has smaller noise margin. Recall that the sensitivity matrix for O P L has smaller eigenvalues than those for preswitching surfing. T h i s is an example that shows that small signal stability does not necessarily translate into large signal robustness. For the static buffer chains, O P L , and the surfing buffer chain, the noise margin increases with the length of the chain. T h i s is because finding m i n i m u m input and output disturbances that have the same magnitude, does not imply that  89  arrival time Figure 5.17: T h e Delay of an O P L Inverter  they have the same shape. T h u s , a disturbance that will pass through one stage will not necessarily propagate through a long chain. T h i s effect is especially pronounced for the surfing chain. In fact, we note that, with an input event, the surfing chain appears to have almost no noise margin when a single stage is considered, but is much more robust when a chain of more stages is analyzed. We ignore the noise-margin curve for the self-resetting  domino gate with  input = 1. Because self-resetting domino represents a "1" value with a pulse, the optimizer finds the disturbance that shifts the time of the rising input event. T h i s affects the time of both the rising and falling edges of the output and trivially satisfies the optimization condition. T h e designer would like to know "What is the smallest disturbance that suppresses an output event that should have occurred." A s just noted, our analysis does not provide an answer to this question.  In future work,  we plan to examine modifications to our approach that will address this limitation. W h e n the input is equal to 0, the noise margin grows very slowly with the number of gates in the chain. T h i s is also a consequence of the pulse-signalling of the self-  90  0  1  Figure 5.18: T h e Delay with Respect to Arrival T i m e  resetting design.  T h e optimizer finds a small input pulse that triggers an output  pulse. T h e output pulse propagates through the remaining stages. O f course, this is exactly what the designer wants to know: "What is the smallest disturbance that triggers a spurious output event." We note that the M a t l a b optimization function fmincon failed for all four design styles when we attempted the optimization without the sensitivity matrix: T h e gradient estimates that it obtained by subtracting integration runs with nearby inputs had too much error to allow a productive search.  B y using the sensitivity  matrix S in the optimization, we obtained successful convergence in relatively short times (see figure 5.19). W e ran the optimization on a 3 G H Z Pentium I V processor with 2 G main memory. For the static C M O S and O P L designs we started with a  91  chain consisting of a single gate, and then used the solution from a chain of length k as the initial point for a chain of length k + 1. For self-resetting domino, we started with a chain of length 15 and then progressively removed stages. For the surfing design, we started with a chain of length 5 and worked both directions. In all cases, the initial solution took a few hours. Although the computation time varies significantly depending on the number of gates in the chain, the increased computation time with adding or deleting a gate are quite close for the O P L , static, and surfing logic gates.  For the O P L gate, the computation time increases dramatically for  chains with more than 8 gates. T h i s is because the optimization reaches the edge of the capture interval for surfing and a small change of input signal will cause a huge change to the output. T h i s causes the optimizer to spend more time satisfying the constraints. For more than 12 gates, the computation time decreases. Currently, we do not know why this happens. For the self-resetting logic, the incremental computation time is relatively small because it approaches the asymptotic scenario very quickly.  5.4 Summary We formulated noise-margin analysis as a non-linear optimization problem.  By  calculating a small-signal sensitivity matrix as a part of integrating the circuit model, we obtain accurate gradient estimates that enable these optimization problems to be solved efficiently.  T h e small signal model also shows the linkage of timing margins  and noise margins in all four logic styles that we considered. We implemented a proof-of-concept tool and used it to analyze static C M O S , self-resetting domino, preswitching self-resetting domino and output prediction logic ( O P L ) . O u r implementation uses a simple, first-order transistor model. We expect  92  B  + OPL with input = 0 x OPL with input = 1  180h  * o o <  o  inverter with input = 1 inverter with input = 0 domino with input = 0 surfing with input = 1  10  15  gates in the chain  Figure 5.19: Computation Time for Different Logics  that using a more sophisticated model will affect the quantitative details of the analysis, but not the basic approach and qualitative results should remain the same. This is a possible topic for future work. As expected, static designs have the highest noise margin. O P L running at a moderate speed has the lowest dynamic noise margin, but has a form of small signal timing stability not present in the static and self-resetting domino designs.  O P L and preswitching logic demonstrate the  timing stability of surfing with the eigenvalues of their sensitivity matrices less than 1. When the input is low, the preswitching logic shows lower noise margin than self-resetting domino logic because of output shift introduced by the  fast  signal.  Compared to O P L logic, the preswitching logic has a greater dynamic noise margin because of the larger capture interval with preswitching. As we have mentioned in chapter 1, designers can trade noise margin with power consumption and delay. For  93  a fair comparison between circuits of different styles, all these three figures of merit should be taken into account. Though O P L has the lowest noise margin, it has the smallest delay. It will be more reasonable to compare the noise margins of O P L and preswitching surfing gates running at the same speed. In the future, we will continue working on this. Currently, our approach cannot be applied to domino gates where the falling edge of output voltage is controlled by the clock. In future work, we plan to examine modifications to our approach that will address this problem. We also plan to include charge-sharing surfing gates and other dynamic logic circuits such as D y C M L in our analysis.  94  Chapter 6  Conclusions and Future Work We have presented a working surfing chip, new surfing circuits that greatly reduce the power penalties associated with surfing, and a novel noise-margin analysis approach. These contributions bring surfing closer to being a practical design approach. O u r chip implements a simple, pseudo-random sequence generator. It supports two independent waves of computation in a 12 stage ring without any latches or other storage elements. We have operated the ring for over 48 hours and 2.6* 1 0  15  surfing transfers of data between stages without error. T h e chip operates correctly over a wide range of power supply voltages.  T h i s chip demonstrates that surfing  pipelines are possible in the real world. T h e chip also demonstrates the main downside of the "preswitching" approach to surfing: power consumption.  We addressed power consumption by in-  troducing a new family of surfing circuits that avoids the short-circuit currents of preswitching. T h i s new technique uses charge sharing to move the output and internal node of a dynamic gate away from the power supply rails to accelerate the  95  generation of an eventual output event. We call this "charge-sharing surfing". U n like the preswitching surfing whose delay depends on the ratio of the pull-up N M O S and pull-down N M O S in the inverter, the delay of a charge-sharing surfing gate depends on the ratio of the internal and load capacitances.  In the future, we .will  continue our research in this area. We presented a simulation study wherein we implemented a B r e n t - K u n g carry lookahead adder with domino logic and with our new surfing approach. T h e surfing adder is 19% faster and uses 11.7% less energy than the domino design. T h e surfing version outperforms the domino design by a factor of 1.75 according to the metric. We anticipate that focusing on accelerating critical paths will enable further improvements in energy and speed.  T h i s should allow designers to use surfing as  a general design method without restrictions to regular structures like other wave pipelining approaches.  W i t h the event attractors created by surfing circuits, we  can accurately predict the interval in which input events will arrive. T h u s we can enable high speed operation of a gate only in this interval. W i t h this knowledge, we expect to employ surfing to reduce the leakage current which becomes more and more important as the technology scales down [20]. Other techniques are also available to generate the surfing effect. A s shown in chapter 5, O P L gates have surfing capability. However, as described in chapter 5, the small capture interval for O P L gates results in a small noise margin. O P L treats rising and falling edges differently. Falling output events are accelerated by the surfing effect arising from the initial output dip of the gate.  T h i s same dip  retards rising output events. For O P L , surfing arises from the sum of the delays of a rising and falling event from two consecutive stages. T h e narrowness of the output  96  dip results in a narrow capture interval. A t this point, we have not identified any solution to overcome this limitation. K u r s a n and Friedman's variable threshold voltage keeper technique [23] also has surfing potential. T h e delay of a domino gate is affected by the strength of the keeper [22]; thus, it may be possible to modulate the n-well potential for the keeper to induce surfing. Figure 6.1 shows our proposed adaptation of K u r s a n and Friedman's technique for a surfing buffer. We use the same body bias voltage generator circuit as in figure 2.10 to generate the b o d y bias voltage.  T h e inverter connected to the  body bias voltage generator circuit is placed there to be consistent with the previous surfing circuits in chapter 3 and 4 such that the gate's delay is smaller when fast is high than it is when fast is low. A wide fast pulse as in scheme (a) of figure 6.2 effects a weak keeper and low noise margin most of the time. A wide fast pulse also makes the capture interval when the fast signal is low smaller than that when the fast signal is high. A s we have mentioned in chapter 5, an asymmetric capture interval may decrease the dynamic noise margin. T h u s , a wide fast pulse is not preferred. A narrow fast signal will make the keeper strong and noise margin high most of the time. Whether a narrow fast signal or a symmetric fast signal is preferable depends on to which kind of noise the circuit is more sensitive, a noise causing a timing shift or a noise causing voltage shift. Furthermore, the capacitance between the well and the substrate and the well resistance are large enough in typical processes that it is not clear if this approach can be used at high operating frequencies. Dynamically adjusting the power rails is another way to obtain varying transition delay. T h i s may be possible by using a virtual ground as proposed in D y n a m i c Current M o d e Logic ( D y C M L )  [1] to achieve surfing. In the following, we take the  D y C M L circuit as shown in figure 2.14 as an example. W h e n an input event comes  97  m4]|D-  fast  | X Q _ body bias voltage • J-MBBQGT generator mBr- ^ 1  but lout  ml  Figure 6.1: A Self-Resetting Buffer with Variable Threshold Voltage Keeper  high Vth low Vib (i.e. weak) (i.e. strong) keeper keeper  \  body bias  low Vth (ie. strong) keeper  high Vth (i.e. weak) keeper  /  \  high Vth (i.e. weak) keeper  I  VoDr  I Scheme (a) high Vth (i.e. weak) keeper \bm *  *  body bias  VDDI  \  high Vih (i.e. weak) keeper  /  ^  VDDI  fast  low Vth (i.e. weak) keeper  \\  /  /  \  high Vth (i.e. weak) keeper  low Vth (ie. weak) keeper  //  /  Scheme (b) Figure 6.2: Variable Threshold Voltage Keeper with fast Signals of Different Pulse Width  98  early, the outputs remain unchanged because the if input events arrive late, then  clock signal is still low. Conversely,  out.T and out.F converge towards an intermediate  voltage. T h i s situation continues until the inputs arrive. T h e later the inputs arrive, the lower the intermediate voltage is. A t first, the data delay decreases with a lower intermediate voltage because the high-to-low transition (normally the slower one) is accelerated. If the input is sufficiently late, then the low-to-high transition becomes the critical one, and further delay of the inputs increases the gate delay. T h u s , we expect a delay vs. input arrival profile similar to that of O P L . Furthermore, it should be noted that delaying the input events increases the amount of charge transferred to node w before evaluation completes.  T h i s results in a higher final level for the  low-going output. T h i s effect can be reduced by increasing the capacitance on node w by using a large transistor for m8.  Extensive simulations would be needed to  assess the suitability of D y C M L or similar designs for use as a surfing logic family. We presented a method for analyzing the robustness of surfing circuits. Surfing creates event attractors whereby events in the data path are attracted to a fixed relationship to events in the timing path. We analyze these attractors by constructing the small-signal response of the circuit and by using numerical optimization techniques to study the large-signal stability. T h i s analysis shows that surfing circuits are very robust (i.e. highly damped) with respect to disturbances of the input. We presented preliminary results showing how this approach can be applied to the design of surfing pipelines and to noise-margin analysis. Compared with other approaches, our noise-margin analysis technique can verify the timing stability of surfing circuits and demonstrate trade-offs between performance and noise margin. O u r noise-margin analysis is based on finding the smallest input that produces an output of the same magnitude by the l  2  99  metric. Assuming that the op-  timizer finds the global optimum, this is a condition that ensures that any smaller disturbance will not propagate through a chain of logic gates.  A s noted in chap-  ter 5, for many logic families, this condition is satisfied by a disturbance that is proportional to the time derivative of the input signal; in other words, the disturbance effects a time shift of the input. T h i s is both a feature and a limitation of our approach. O n the one hand, this correspondence of critical disturbances and time shifts shows the connection between noise margins and timing margins. In the future, we hope to further develop this connection to find a physically motivated quality metric that combines energy, delay, and robustness.  O n the other hand,  this correspondence has prevented us from making meaningful measures of the robustness of some circuits. If a small input disturbance can cause a time shift of the output that has the same magnitude, then we conclude that the noise margin is very small. In future work, we would like to find ways of identifying other, large-signal noise modes even for circuits where time shifts propagate unattenuated through chains of gates. B y providing a physical implementation, novel circuit designs, and a new approach to noise-margin analysis, we have moved surfing pipelines much closer to a viable approach for practical designs. Further deployment of surfing will depend mainly on developing adequate C A D tool support especially in the areas of noise margin and timing analysis and automated transistor sizing and timing path synthesis. O u r research has clearly identified many of the requirements for such tools and has suggested possible directions for future research to address these challenges.  100  Bibliography [1] M o h a m e d W . A l l a m and M o h a m e d I. Elmasry.  D y n a m i c current mode logic  ( D y C M L ) : A new low-power high performance logic style.  IEEE  Journal of  Solid-State Circuits, 36(3):550-558, M a r c h 2001. [2] A . Alvandpour, P. Larsson-Edefors, and C . Svensson. A leakage-tolerant multiphase keeper for wide domino circuits. In Proceedings of the IEEE International Conference on Electronics, Circuits and Systems, pages 209-212, September 1999. [3] S . F . Anderson, J . G . Earle, et al. T h e I B M system/360 model 91 floating point execution unit. In IBM J. Research and Development, pages 34-53, January 1967. [4] M . Anis, M . A l l a m , and M . Elmasry. Impact of technology scaling on C M O S logic styles. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 49(8):577-588, August 2002. [5] R . Brent and H . T . K u n g . A regular layout for parallel adders. IEEE Transactions on Computers, C-31(3):260-264, M a r c h 1982. [6] Wayne P. Burleson, Maciej Ciesielski, et al. Wave-pipelining: a tutorial and research survey. IEEE Transactions on VLSI Systems, 6(3):464-474, September 1998. [7] Terry I. Chappell, Barbara A . Chappell, et al. A 2-ns cycle, 3.8-ns access 512kb C M O S E C L S R A M with a fully pipelined architecture.  IEEE Journal of  Solid-State Circuits, 26(11):1577-1585, November 1991. [8] A y o o b E . Dooply and Kenneth Y . Y u n . O p t i m a l clocking and enhanced testability for high-performance self-resetting domino pipelines.  In Proceedings of  the Twentieth Anniversary Conference on Advanced Research in VLSI, pages 220-214, M a r c h 1999.  101  [9] J o C . Ebergen, Scott Fairbanks, and Ivan E . Sutherland. Predicting performance of micropipelines using Charlie Diagrams. In Proceedings of the Fourth International Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 238-246, A p r i l 1998. [10] M . J . F l y n n , P. Hung, and K . W . R u d d . Deep submicron microprocessor design issues. IEEE Micro, 19(4):ll-22, August 1999. [11] T . Gemmeke and T . G . Noll.  A physically oriented model to quantify the  dynamic noise margin. In Proceeding of the 30th European Conference on SolidState Circuits, pages 467-470, September 2004. [12] Solomon W . Golomb. Shift Register Sequences. Holden-Day, 1967. [13] N . F . Goncalves and H . D e M a n . N O R A : a racefree dynamic C M O S technique for pipelined logic structures. IEEE Journal of Solid-State Circuits, 18:261-266, June 1983. [14] Ricardo Gonzalez, Benjamin M . Gordon, and M a r k Horowitz. threshold voltage scaling for low power C M O S .  Supply and  IEEE Journal of Solid-State  Circuits, 32(8):1210-1216, August 1997. [15] Ricardo Gonzalez and M a r k Horowitz.  Energy dissipation in general pur-  pose microprocessors. IEEE Journal of Solid-State Circuits, 31 (9): 1277-1284, September 1996. [16] David Harris and Ivan Sutherland. Logical effort of carry propagate adders. In Proceedings of the 37th Asilomar Conference on Signals, Systems and Computers, volume 1, pages 873-878. I E E E , November 2003. [17] L . Heller, W . Griffin, J . Davis, and N . T h o m a . Cascode voltage switch logic: A differential C M O S logic family.  In 1984 IEEE  Int'I. Solid State Circuits  Conference, pages 16-17, 1984. [18] R o n H o , Kenneth W . M a i , and M a r k A . Horowitz. T h e future of wires. In Proceedings of the IEEE, volume 89, pages 490-504, A p r i l 2001. [19] D a v i d A . Hodges, Horace G . Jackson, et al. Analysis and design of digital integrated circuits. M c G r a w Hill, 2003. [20] N . S K i m , T . A u s t i n , et al. Leakage current: Moore's law meets static power. Computer, 36(12):68-75, December 2003.  102  [21] R . H . Krambeck, C M . Lee, and H . S . Law. High-speed compact circuits with CMOS.  IEEE Journal of Solid-State Circuits,  [22] R . K . Krishnamurthy, A . Alvandpour, et al. leakage-tolerant register file.  SC-17:614-619, June 1982.  A 130-nm 6-Ghz 256x32 bit  IEEE Journal of Solid-State Circuits,  37:624-632,  M a y 2002. [23] Volkan K u r s a n and E b y G . Friedman. voltage keeper.  Domino logic with variable threshold  IEEE Transactions on Very Large Scale Integration Syatems,  11(6):1080-1093, December 2003. [24] P. Larsson and C . Svensson.  Noise in digital dynamic C M O S circuits.  Journal of Solid-State Circuits,  IEEE  29(6):655-662, June 1994.  [25] Trevor W . S . Lee, M a r k R . Greenstreet, and Carl-Johan Seger. Automatic verification of refinement. In  Computer Design,  Proceedings of the 1994 International Conference on  Boston, October 1994.  [26] A l a i n J . M a r t i n , M i k a Nystrom, and P a u l I. Penzes.  Et : 2  A metric for time  and energy efficiency of computation. In R a m i Melhem and Robert Graybill, editors,  Power Aware Computing.  Kluwer, 2002.  [27] L a r r y M c M u r c h i e , Su K i o , et al. O u t p u t prediction logic: A high-performance C M O S design technique. In  on Computer Design, [28] B . Moyer.  IEEE,  Proceedings of the 2000 International Conference  pages 247-254, 2000.  Low-power design for embedded processors.  In  Proceedings of the  volume 89, pages 1576-1587, November 2001.  [29] M . Mudge.  Power:  a first-class architectural design constraint.  Computer,  34(4):52-58, A p r i l 2001. [30] K . L . Shepard and K . C h o u . Cell characterization for noise stability.  Custom Integrated Circuits Conference,  IEEE 2000  pages 91-94, 2000.  [31] K . L . Shepard and V . Narayanan. Noise in deep submicron digital design.  In  Proceedings of the 1996 International Conference on Computer Aided Design, pages 406-411, 1996. [32] D . Singh, J . M . Rabaey, et al. Power conscious C A D tools and methodologies: a perspective.  In  Proceedings of the IEEE, volume  1995.  103  83, pages 570-594, A p r i l  [33] Ivan Sutherland and Scott Fairbanks.  G a s P : A minimal F I F O control.  In  Proceedings of the Seventh International Symposium on Asynchronous Circuits and Systems, pages 46-53, A p r i l 2001. [34] Ivan E . Sutherland, Robert F . Sproull, and D a v i d Harris. Logical Effort: Designing Fast CMOS Circuits. Morgan K a u f m a n n , 1999. [35] T e d E . Williams. A n a l y z i n g and improving latency and throughput in selftimed pipelines and rings.  In TAU 1992 ACM International Workshop on  Timing Issues in the Specification and Synthesis of Digital Systems, Princeton, N J , M a r c h 1992. [36] T e d E . Williams and M a r k A . Horowitz.  A zero-overhead self-timed 160-ns  15-b C M O S divider. IEEE Journal of Solid-State Circuits, 26(11):1651-1661, November 1991. [37] A n t h o n y J . Winstanley, Aurelien Garivier, and M a r k R . Greenstreet. A n event spacing experiment. In Proceedings of the Eighth International Symposium on Asynchronous Circuits and Systems, pages 42-51, Manchester, U K , A p r i l 2002. [38] B r i a n D . Winters and M a r k R . Greenstreet. pipeline.  A negative-overhead,  self-timed  In Proceedings of the Eighth International Symposium on Asyn-  chronous Circuits and Systems, pages 32-41, Manchester, U K , A p r i l 2002. [39] B r i a n D . Winters and M a r k R . Greenstreet. pipelining using self-timed circuit techniques.  Surfing: A robust form of wave Microprocessors and Microsys-  tems, 27(9):409-419, October 2003. [40] H o i - J u n Yoo.  A study of pipeline architectures for high-speed  synchronous  D R A M ' s . IEEE Journal of Solid-State Circuits, 32:1597-1603, October 1997. [41] H o i - J u n Y o o , Kee-Woo Park, and Chang-Ho Chung. A 150Mhz 8-banks 256M synchronous D R A M with wave pipelining methods. IEEE International Conference on Solid-State Circuits, pages 250-251, February 1995. [42] Hongil Yoon, G i - W o n C h a , and Changsik Yoo. A 2.5-V, 3 3 3 - M b / s / p i n , 1-Gbit, double-data-rate synchronous D R A M .  IEEE Journal of Solid-State Circuits,  34:1589-1599, November 1999. [43] V . Zolotov, D . Blaauw, et al. Noise propagation and failure criteria for V L S I designs. In IEEE/ACM  International Conference on Computer Aided Design,  pages 587-594, November 2002.  104  


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items