UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Implementation and analysis of surfing pipelines Yang, Suwen 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_2005-0709.pdf [ 4.1MB ]
JSON: 831-1.0051631.json
JSON-LD: 831-1.0051631-ld.json
RDF/XML (Pretty): 831-1.0051631-rdf.xml
RDF/JSON: 831-1.0051631-rdf.json
Turtle: 831-1.0051631-turtle.txt
N-Triples: 831-1.0051631-rdf-ntriples.txt
Original Record: 831-1.0051631-source.json
Full Text

Full Text

Implementation and Analys is of Surfing Pipelines by Suwen Yang B . E . , Huazhong University of Science and Technology, 1995 M.S.,University of Washington, 2001 A T H E S I S S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M a s t e r of Science in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Computer Science) T h e Universi ty of Br i t i sh C o l u m b i a August 2005 © Suwen Yang, 2005 Abstract High performance digital systems make extensive use of pipelines. Three years ago, "surfing" pipelines were proposed. A timing pulse propagates through the surfing pipeline, and the individual logic elements of the pipeline are modified so that their delays are smaller in the presence of the pulse than in its absence. This creates an "event attractor" where events in the data path occur at the rising edge of the timing pulse. These attractors reduce timing uncertainties and improve the performance of the pipeline. A circuit technique called "preswitching" was proposed to implement the delay variation required for surfing. In this thesis, we demonstrate a working, surfing chip and address issues of power consumption and robustness. We demonstrate surfing by the design, fabrication and test of a chip using preswitching surfing circuits. The surfing ring in this chip supports two, independent waves of computation separated only by the surfing effect - no latches or other storage elements are used. We operated the ring for over 48 hours and 2 * 10 1 5 surfing events and never observed an error. The preswitching circuits in this chip exhibit unacceptable power consumption, motivating our work on energy-efficient designs. We introduce a new family of surfing circuits based on charge-sharing, dy-namic gates. The design and simulation of a carry-lookahead adder show that this ii technique offers very competitive performance by standard metrics. This design also demonstrates surfing in a design with a non-uniform circuit structure. Finally, we develop a new method for robustness analysis. We formulate noise-margin analysis as a numerical optimization problem that takes the time-varying behavior of surfing and other dynamic logic designs into account. W i t h this approach we compare the robustness of several, high-performance logic families, quantify the timing stability of surfing circuits, and demonstrate trade-offs between performance and robustness. Our demonstration of a physical surfing chip; design of novel; low-power surfing circuits; and a new noise-margin analysis method together bring surfing closer to a practical reality. These tools and .techniques will aid the development of novel logic designs to face the challenges of deep-submicron integrated circuit design. iii Contents Abstract ii Contents iv List of Tables vii List of Figures viii Acknowledgments xi Dedication xii 1 Introduction 1 1.1 Surfing 2 1.2 Problem Statement 5 1.3 Thesis Statement 6 1.4 Thesis Organization 6 2 Background 7 2.1 Pipelining 7 2.2 High Speed Circuits 12 iv 2.3 Circuit Evaluation 22 2.4 Noise Analysis 23 3 The Test Chip 26 3.1 Structure of the Test Chip 26 3.2 Design of the Test Chip 30 3.3 Test Results 42 3.4 Energy Overhead 47 3.5 Observations 48 3.6 Summary 49 4 Lower Power Surfing 50 4.1 Surfing with Charge Sharing 51 4.2 Simulation Results 64 4.3 Summary 66 5 Noise Analysis for Surfing Logic 68 5.1 Noise-Margin Analysis 69 5.1.1 Calculating the Sensitivity Matrix 72 5.2 Circuits 73 5.3 Results 77 5.3.1 Small Signal Stability 77 5.3.2 Noise Margin as Design A i d 83 5.3.3 Large Signal Stability 85 5.4 Summary 92 6 Conclusions and Future Work 95 v Bibl iography 101 vi List of Tables 3.1 Commands Launched by the Serial Control Register 37 3.2 Comparison of Preswitching with Non-surfing, Dual-rail X O R Gates 48 4.1 Energy Comparison of Surfing and Non-surfing Dual-rail X O R Gates 56 4.2 Structure of the Backbone in Figure 4.7 61 4.3 Structure of the Spines in Figure 4.7 62 5.1 Parameters Used in Noise-Margin Analysis 75 5.2 Robustness of Different Logics 83 vii List of Figures 1.1 Timing Requirement of Surfing 4 2.1 A Simple Synchronous Circuit 8 2.2 Wave Pipelining with Two Waves 9 2.3 T iming Uncertainty of Wave Pipelining 10 2.4 Surfing Pipelining 11 2.5 T iming Requirement of Surfing Pipeline 12 2.6 A Domino Buffer 13 2.7 A Self-Resetting Domino Buffer 14 2.8 A Self-Resetting Preswitching Buffer 15 2.9 Operation of a Surfing Gate 17 2.10 Body Bias Voltage Generator Circuit 18 2.11 Operation of a Variable Threshold Voltage Keeper Circuit 19 2.12 Two Output Prediction Logic Inverters 20 2.13 A M O S Current Mode Logic Inverter 21 2.14 A D y C M L Logic Inverters 21 3.1 The Surfing Ring 27 3.2 The Surfing X O R - G a t e (true half) 29 viii 3.3 The Output Cell 30 3.4 The Pseudolatch 31 3.5 The Two-Input Latch 32 3.6 The Input Cell 33 3.7 A Stage of the GasP Timing Ring 35 3.8 The GasP Cell (G-cell) 36 3.9 The Serial Control Register 36 3.10 The select Signal Generator 38 3.11 nlatch Used in the Generator 39 3.12 platch Used in the Generator 39 3.13 dblatch Used in the Generator 40 3.14 Pseudolatch Used in the Generator 40 3.15 T h e Synchronized select Signals 41 3.16 Frequency of f/8 versus Number of Tokens in the GasP Ring . . . . 43 3.17 The Test Chip 45 3.18 Relationship between Supply Voltage and Frequency of f/8 46 4.1 A n Energy Efficient Surfing Circuit 53 4.2 Simulation of an X O R Gate with Charge-Sharing Surfing 53 4.3 The Brent-Kung Adder 58 4.4 A Domino Implementation of the P K G Block from [16] 58 4.5 The Surfing Circuit for Propagate 59 4.6 The Mixed Brent-Kung Adder 59 4.7 The Timing Chain 60 5.1 Noise-Margin Measurement Circuit 69 ix 5.2 A Static C M O S Inverter 74 5.3 A First-Order Transistor Model [19] 75 5.4 A Chain of Surfing Buffers 76 5.5 A Self-Resetting Domino Buffer 76 5.6 Two Output Prediction Logic Inverters 77 5.7 Eigenvectors for a Static Buffer 78 5.8 Eigenvectors for a Self-Resetting Domino Gate 79 5.9 Eigenvector for an O P L Buffer with Input = 0 80 5.10 Eigenvector for an O P L Buffer with Input = 1 81 5.11 Eigenvectors of the Sensitivity Matrix for the Surfing Chain . . . . 82 5.12 Largest Eigenvalue versus Stage Delay of O P L Buffer with Input = 1 84 5.13 The Effects of Varying the Stage Delay 85 5.14 The Effects of Varying the W i d t h of the N-channel Pul l -Up 86 5.15 Noise-Margin Estimates with Input = 0 87 5.16 Noise-Margin Estimates with Input = 1 88 5.17 The Delay of an O P L Inverter 90 5.18 The Delay with Respect to Arrival Time 91 5.19 Computation Time for Different Logics 93 6.1 A Self-Resetting Buffer with Variable Threshold Voltage Keeper . . 98 6.2 Variable Threshold Voltage Keeper with fast Signals of Different Pulse W i d t h 98 x Acknowledgments Without extensive support, discussion and endless encouragement from Dr. Mark Greenstreet, this work would not have been possible. I would not have completed the chip design in six months without the solid and clean start on the design done by Brian Winters. I owe the students and staff in the U B C S O C lab a debt of gratitude, especially Roozbeh Mehrabadi and Roberto Rosales for helping me with the chip simulation and testing. Special thanks go to my husband Zhenyu Zhang who supports my every reasonable decision. Finally, I would like to thank my friends, Jihong Ren, KangKang Y i n , Qiang Kong, Suling Yang, and many others, for such a joyful time in beautiful Vancouver. S U W E N Y A N G The University of British Columbia August 2005 xi m y h u s b a n d , Z h e n y u Z h a n g for his endless s u p p o r t a n d pat ience . x i i Chapter 1 Introduction Advances in microelectronics systems in the past three decades have led to desk-top computers that are far more powerful than large mainframes from a genera-tion ago, graphics processors that offer stunning images and animation, and high-speed networks that provide a new medium for communication including the web, e-businesses, etc. The technological factors that drive high-end chips such as mi-croprocessors, graphics processors, and network routers are primarily: advances in fabrication technology (smaller transistors), advances in digital circuits (faster logic gates), and advances in architecture (fine grained pipelining). This thesis makes contributions in the areas of circuits and pipelining. A digital pipeline is an assembly line for processing information. The time to complete a pipeline step is the period of the C P U ' s clock (i.e. the inverse of the clock frequency). Architects improve performance by finding ways to break the tasks of the chip into smaller pieces such that each step of the pipeline can be performed in less time. This is known as "fine-grain" pipelining. For example, a typical C P U today may have a clock period corresponding to 8 or 12 basic logic operations ( A N D , 1 O R , N O T , etc.) compared with a microprocessor from 20 years ago where typical clock periods were 50-100 gate delays [18]. Each pipeline stage is composed of logic gates. As described in chapter 2, the performance of gates can be improved by precharging or using voltage swings less than the full power-supply. In a precharged gate, one or more signals within the gate are periodically set to a default (typically high) value; then, the logic circuitry only has to handle the case of switching to the other (typically low) value resulting in a faster overall design. In a partial voltage swing gate, the gate output does not traverse the entire range of the power supply. This allows a partial voltage swing gate to operate with smaller delays than its full swing counterpart. Aggressive pipelining and circuit techniques create challenges for managing timing and noise issues. Fine grain pipelining reduces the clock period. Producing and distributing such high-frequency clock signals with adequate precision and uni-formity is a serious design challenge. Precharged logic techniques can exacerbate these problems by requiring even finer resolution of the timing signals with multiple clock phases needed to control the precharge operations of the gates within a pipeline stage. Furthermore, low voltage swing designs are inherently more susceptible to noise because the same disturbance is a larger percentage of the separation between high and low values than for a full-swing gate. These problems are exacerbated in deep-submicron technologies where transistors have very low switching times, wire delays are large, and power supply voltages can be as low as one volt or less [4, 10]. 1.1 Surfing Advanced, fine-grained pipelining techniques face a bottleneck: the latches. In a digital pipeline stage, data paths use different gates to achieve different functions 2 and gates work in varying environments, due to noise, temperature, data dependent delays, and other factors. Hence it is very difficult to run all the data paths at the same speed. Latches, controlled by a global clock, are introduced into digital pipelines. When the latches are transparent, logic events are free to move to the next stage. Otherwise, logic events are blocked from propagating to the next stage. To synchronize logic events for the next stage, the period of the clock has to be greater than that of the slowest data path. Latches slow down fast data paths because logic events cannot propagate when the clock is low. Additionally, latches add extra delay to every data path to propagate data from one stage to the next. Overcoming the disadvantages of latches in a pipeline becomes especially important at very high operating frequencies where the overhead from latches can be a large fraction of the total clock period. Surfing provides a chance to tackle these problems. In a surfing pipeline, latches and a global clock are no longer necessary. Instead, a timing reference pulse is introduced for every logic gate in the data path and propagated parallel to the data path. This timing pulse is used to modulate the delay of each logic gate as a smooth function of the time between the data event's arrival and that of the timing reference signal. Figure 1.1 illustrates this relationship. In this figure, the upper curve is the delay of events in the data path. The dashed line indicates the delay of the timing reference pulse from one stage to the next. Surfing occurs if the following conditions are satisfied: • If a logic event arrives earlier than the rising edge of the timing reference pulse, then the delay of the logic event must be greater than that for the timing reference. • Conversely, if a logic event arrives later than the rising edge of the timing 3 logic delay control delay I I l l l l l I I I I I I I I I | l l l l I l l l l l l - f 0 Figure 1.1: T iming Requirement of Surfing reference pulse, then the delay of the logic event should be less than that for the timing reference. Under these conditions, logic events are attracted to the rising edge of the timing reference pulse. If a logic event arrives earlier than the reference pulse, then in the next logic gate, the logic event and timing event will be closer to each other because the logic delay is greater than the delay of the reference pulse. In the reverse case, the distance between the data event and the timing event will decrease because the logic delay for this scenario is less than the delay of the reference pulse. In a chain, the logic events will eventually be attracted to the rising edge of the timing reference pulse such that the propagation delay of logic events matches the delay of the reference pulse. W i t h surfing, all of the data paths have the same delay. Hence, latches are no longer necessary to synchronize the logic events. This is the charm of surfing, which makes latchless pipelining possible. 4 1.2 P r o b l e m Statement In 2002, Winters and Greenstreet proposed preswitching surfing logic [38]. W i t h extensive simulations, they demonstrated that surfing works. However, they did not address the following issues: • A real chip may work in a different environment from the simulated chip, due to uncontrollable variations in the power supply voltage, temperature, process parameters, noise, and so on. Given the unusual nature of surfing, a physical demonstration is needed to make sure that nothing has been overlooked in the model. • Power now is a dominant design concern [32, 28, 20, 29, 19] for most integrated circuit designs. However, the preswitching technique greatly increases the power consumption of the chip because of a short circuit current used to create the surfing effect. To make surfing practical, new circuits are needed that overcome these power problems. • The preswitching surfing technique uses partial voltage swing to produce the delay variation required for surfing. The impact on noise margin needs to be analyzed and quantified. However, as shown in figure 1.1, the delay of surfing gates is a time varying function and the behavior of surfing gates varies over time. Thus, the noise-margin analysis must consider the time domain properties of the disturbances. Existing analysis techniques either ignore this, or they make unrealistic, simplifying assumptions. 5 1.3 Thesis Statement This thesis brings surfing closer to practice by implementing a surfing chip; develop-ing novel, low-power surfing circuits; and introducing new methods for noise-margin analysis. 1.4 Thesis Organization The thesis is organized as follows: • Chapter 2 introduces related work for digital pipelining, high speed circuits and noise-margin analysis. • Chapter 3 gives a detailed description of preswitching surfing logic and a chip designed to verify the theory. • Chapter 4 introduces a novel, low power family of surfing circuits. • Chapter 5 presents a new technique for analyzing the noise margins of dynamic circuits. This techniques takes the time dependent behavior of dynamic cir-cuits into account. We use this technique to compare the robustness of surfing circuits with other design styles. • Chapter 6 concludes the thesis. 6 Chapter 2 Background In this chapter, we provide an overview of related work for digital pipelining, high-speed circuits and noise-margin analysis. 2.1 Pipe l in ing In traditional synchronous design, as shown in figure 2 .1, delays in data paths may vary over a wide range. As mentioned in chapter 1, inserting latches allows proper operation of a circuit. However, latches add extra delays to all paths, including the slowest one. The clock's period must be longer than the delay of the slowest data path. The slowest data path determines the working frequency and throughput of a synchronous circuit. Wave pipelining has been used in various designs to mitigate the overhead of latches. One of the earliest example was the floating point unit for the I B M 360 model 91 [3]. Currently, wave pipelining is used in the L I caches of most high performance microprocessors [41, 42, 40]. Burleson et al provide an excellent survey and tutorial on wave pipelining [6]. The key idea in wave pipelining is that 7 T 7X> 3 . Figure 2 . 1 : A Simple Synchronous Circuit a pipeline can support k waves in flight between latches if clock's period P satisfies inequality 2 . 1 . Smax/k < P < 6min/(k - 1) (2 .1 ) where 5 m i n and 5max are the minimum and maximum delay of data paths respec-tively. For example, if Smax/2 < P < 5 m i n (i.e. k = 2 ) , the circuit shown in figure 2 .2 can support two waves. The constraint, 6max < 2P means that a wave needs at most two clock periods to reach the right side of the combinational logic. Likewise, 5 m i n > P says that a wave will not arrive at the register on the right until after a full clock cycle, ensuring that it will not interfere with the previous wave of computation. When the clock signal, <j>, in figure 2 .2 goes to 1, wave C is launched between latches 1 and 2 . Since wave A has already arrived at register 2 before <j> goes high, wave A is transferred to the right side of register 2 . Wave B has been in the combinational logic for only one clock period, it will continue between registers 1 and 2 to complete its calculation during the next clock cycle. Since <5 m , n > <5max/2, wave C cannot overtake wave B. So this pipeline can work correctly with two waves in flight. A few other technical conditions are required for wave pipelining as described in [6]. Wave pipelining allows the circuit to work with k times the throughput of 8 Figure 2.2: Wave Pipelining with Two Waves the traditional synchronous design and with the same latency. In practice, tim-ing uncertainty hinders the wide-spread usage of wave pipelining. Along the data path, the timing uncertainty grows monotonically because every level contributes some amount. This uncertainty constrains the working frequency of the pipeline: inequality 2.1 implies that P must be greater than 5max - 5min. To minimize tim-ing uncertainty, designers typically arrange logic gates in levels and introduce extra buffers as shown in figure 2.3 to match delays. Currently, wave pipelining is mainly used to pipeline caches. In 1991, Williams and Horowitz in [36] introduced the concept of "zero-overhead" pipelines. A pipeline has zero overhead if no latency is introduced by control or latching. This can be achieved if a pipeline has a total latency equal to the sum of the latencies of its stages. Williams and Horowitz presented an implementation of a zero-overhead pipeline using domino circuits with a self-timed control circuit that generated the timing signals to achieve zero-overhead operation. In 2002, Winters and Greenstreet proposed the surfing approach for pipelin-ing [38, 39] which achieves negative overhead. In a surfing pipeline as shown in figure 2.4, a timing pulse is propagated parallel to the data paths. Every logic el-9 delay Figure 2.3: Timing Uncertainty of Wave Pipelining ement in the data path is augmented with this timing pulse which modulates the element's propagation delay. The surfing theory states two requirements: 1. When the timing pulse is 1, the maximum delay of the gate 5itmax is less than the minimum propagation delay of the timing pulse 6ftTnin, where 5i is the delay of the logic gate with fast = 1, and 5/ is the stage-to-stage delay of the fast pulse. 2. If the timing pulse is 0, the minimum delay of the gate #o,mm is greater than the maximum delay of the timing pulse Sftmax, where 5o is the delay of the logic gate with fast = 0. Winters and Greenstreet summarized these two requirements with the following inequality: If inequality 2.2 is satisfied, the arrival time of logic events in the data paths 10 Figure 2.4: Surfing Pipelining will be attracted to that of the timing pulse. As shown in figure 2.5, the timing pulse divides the period of the timing pulse into four intervals. The interval [ £ 1 , 1 : 4 ] is the capture region. In interval [ i i ,^] , the input event arrives earlier than the timing pulse, and the logic delay is greater than the propagation delay of the timing pulse. Conversely, in interval [ £ 3 , £ 4 ] , the input event arrives after the timing pulse, and the logic delay is less than the delay of the timing pulse. Whenever the input event comes in interval [£1 , £ 2 ] or [ £ 3 , £ 4 ] , at the next stage, the input event will be closer to the timing pulse. Hence, after several stages, the input events will converge to arrive in the steady-state surfing interval [ £ 2 , £ 3 ] - The interval [ £ 4 ^ 5 ] is the metastability interval. Events in this interval will eventually exit to surf with either the preceding wave or the following wave. Surfing creates an event attractor such that the delay spread along the data paths is kept small regardless of the pipeline length. A transparent latch is an extreme case of a surfing gate - the latch's delay goes to infinity when the latch is not enabled and drops to a bounded value when the enable signal is asserted. Surfing logic can be seen as "soft" latching. Unlike a traditional latch which slows down early arrivals, surfing designs accelerate late arrivals to achieve higher performance than purely combinational designs. Surfing increases throughput and decreases latency simultaneously, because surfing logic 11 elements have less delay than their non-surfing counterparts. 2.2 High Speed Circuits We are interested in revising high-speed circuits to meet the timing requirements of surfing pipelining. Many C M O S logic styles such as domino [21, 17, 7], output prediction logic [27], variable threshold voltage keepers [23], have timing properties that vary in response to externally applied precharge or clock signals and could potentially be adapted for use with surfing. We describe each of the logic families here and examine their potential for surfing in chapter 6. A domino gate [21, 17] consists of a dynamic gate, a static inverter lout and a keeper m3. A dynamic gate replaces the pull-up P M O S stack in the corresponding static gate with a precharge transistor, controlled by the clock, as shown in figure 2.6. When the clock is low, this transistor pulls node x high. The gate begins to evaluate its logic function when the clock goes high. If in goes high, node x is pulled low, and the output inverter lout drives node out high. Transistor m3 is a keeper ensuring that node x remains high during the evaluation phase unless a high value on in 12 clock time(ns) Figure 2.6: A Domino Buffer causes transistors m l and m2 to pull it low. Domino circuits can operate much faster than their static counterparts because they do not present large p-channel devices in their input loads. Transistor m2 is called a footer and prevents short-circuit currents that lead to excessive power consumption if clock goes low before the input. Like transparent latches, footed domino gates are an extreme case of a surfing gate - the domino gate's delay goes to infinity when the clock is low and drops to a bounded value when the clock is high. It is possible to eliminate the footer to obtain an even faster gate. This is called a "footless" gate. However, power consumption can become a severe issue because of the short circuit current path formed by transistors m l and m4 if the clock goes low before the other inputs. N O R A [13] is a variation of domino that cascades dynamic stages by alter-nating N - and P- stages. However, the lack of a restoring inverter makes N O R A very sensitive to noise: an input disturbance slightly greater than the transistor threshold voltage can propagate through a chain of N O R A gates. Furthermore, re-moving the static inverter of the domino design often results in little improvement 13 time(ns) Figure 2.7: A Self-Resetting Domino Buffer in performance because the inverter provides useful gain for driving the loads of the domino gate [34]. For these reasons, N O R A has found little use in practical designs. Self-resetting domino logic removes the clock in domino gates. Figure 2.7 shows a self-resetting domino buffer [7, 8]. As with the original domino gate, node x is pulled low when the input goes high, and the output inverter drives node out high. This triggers the self-reset. After node out goes high, node p goes low, taking the place of the clock signal from the original domino gate. This charges x back to a high value and lout then drives out low again. Self-resetting C M O S is a pulse logic: logic values are represented by pulses rather than steady voltage levels. The self-reset mechanism allows operation without footers - with careful design, the reset of each gate starts slightly after the resets of its predecessors complete. While offering high performance, self-resetting domino presents several design challenges. The sizing of the keeper presents a trade-off between speed, power consumption, and 14 noise margin. When implementing functions more complicated than a buffer, the n-channel device that pulls down on node x is replaced by an appropriate network of n-channel transistors. The design must ensure sufficient overlap of pulses on different inputs to fully trigger the gate. To avoid short-circuit currents during precharge, the precharge control signal, p, must drop low after the input(s) have been reset but early enough to complete the reset of out in time for the next gate's precharge. The lack of design tools to automate the synthesis and validation of self-resetting circuits has limited their application. In particular, designers need tools that enable maximizing performance while satisfying system robustness requirements. Winters and Greenstreet [38, 39] revised the self-resetting domino gate to obtain a surfing gate as shown in figure 2.8. The input labeled fast is the timing pulse that modulates the gate's delay. When fast is low, no input pulses are expected, and transistor m3 keeps node x high, taking the role of the keeper from the domino designs presented earlier. During the high portion of a fast pulse, node x floats at its high level until appropriate input pulses arrive. Simultaneously, current flowing 15 through transistor m5 pulls node out slightly above ground. In particular, transistor m5 and the pull-down device in inverter lout form a voltage divider. As these are both n-channel devices, their properties track closely over variations of fabrication parameters. Winters and Greenstreet found that by making m5 about 8 0 % of the width of the pull-down in lout, it pulls out to about 0 . 2 V D D when fast is asserted. This provides the speed-up required for surfing. Figure 2.9 shows the operation of the surfing buffer under various conditions. Curves A, B, and C show the propagation of a pulse through the surfing buffer, the propagation of the same input pulse with fast connected to ground, and the response of the gate when no input pulse is received. Comparing curves A and B reveals the speed-up that occurs when fast is asserted. Curve C shows the voltage shift on the output due to preswitching independent of the input pulse. Curve in is the input for gates with outputs A and B, and x is the precharged node for the gate with output A. In chapter 3, we describe the implementation and test of a chip that demonstrates surfing with preswitching circuits. The added power consumption, due to the fighting between transistor m5 and the pull-down transistor in inverter lout is quite disadvantageous. Surfing is achieved with partial voltage swing. Thus, noise margin is a concern for this kind of design. The keepers in dynamic gates, such as those in figures 2.6 and 2 .7 play an im-portant role in providing the gates' noise immunity. As the technology scales down, keeper sizing becomes more important as exponentially increasing subthreshold leak-age currents threaten the reliable operation of deep submicron dynamic circuits. A weak (small) keeper will provide an unacceptably small noise margin for the circuit. However, a strong (big) keeper will increase the power consumption and increase the delay of the circuit due to the fighting of the keeper and the circuit's pull-down 16 '3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 time(s) Figure 2.9: Operation of a Surfing Gate stack [2, 22]. Kursan and Friedman [23] proposed using two levels of supply voltages to the n-well for the keeper transistor of a domino gate. Figure 2.10 describes a circuit to generate the variable voltage supplies to the n-well - the logic circuit is traditional domino as depicted in figure 2.6. As shown in figure 2.11, different voltage levels are applied in different operation phases. When the clock is low, the gate is precharged. At the end of the precharge phase, the n-well is raised to a high-supply voltage (YDD2) to increase the threshold voltage of the keeper transistor. The resulting weak keeper allows high-speed operation at the beginning of the evaluation phase. Shortly into the evaluation phase, the gate should have completed its operation and now needs to retain its result for use by subsequent logic gates. The n-well voltage is lowered to strengthen the keeper, thereby improving the noise margin of the gate and enabling it to robustly hold its output value until the next precharge phase. This allows the fast operation of a weak keeper during evaluation and the robustness of a strong keeper after evaluation has completed. However, the capacitance between the 17 V D D 2 nodel clock I [P2 node2 nl node3 n2 Figure 2.10: Body Bias Voltage Generator Circuit well and the substrate and the well resistance are large enough for typical processes that it is not clear if this approach can be used at high clock frequencies. Output prediction logic [27] (OPL) combines properties of static and domino designs. A static gate can be converted to an O P L version by adding a footer transistor to the pull-down network and a precharge pull-up to the output. Both of these added transistors have their gates connected to a clock signal. For example, figure 2.12 shows two, cascaded O P L inverters. O P L uses many, finely spaced, clock phases. When the clock for an O P L gate is low, the gate output is pulled high. The clock for the gate should go high slightly before the preceding gates complete evaluation. This causes the output of the gate to start to fall while waiting for the inputs to settle. If the inputs remain high, the output will drop all the way to ground. O n the other hand, if the inputs transition such that the output should be high, then the output will recover to the high level. W i t h proper timing, the output is at an intermediate level when the inputs arrive, reducing the effective propagation delay of the gate because the transition does not start from one of the rails. Like domino circuits, O P L offers significant speed-up compared with static 18 evaluation phase precharge phase high Vth keeper low Vth keeper high Vth keeper \t>D2 -* \ / body bias \i>Di \ / / \i>Di clock , 0 1 \bDi input 0 \ ' weak keeper strong keeper ^ /// \\\ Figure 2.11: Operation of a Variable Threshold Voltage Keeper Circuit C M O S . However, this speed-up depends on maintaining the intended relationship between the clock phases and the output transitions of the O P L gates. Noise margin is another issue. If a small noise occurs while the output is at an intermediate level, it may corrupt the intended timing relationship between the clock and input. The circuits described above are voltage mode logic circuits. For this kind of logic, transistors are used as switches that turn on or off to create a path to power or ground according to the voltage of the inputs of the gate. However, for current mode logic, some transistors can be in partially O N states. Figure 2.13 is a M O S current mode logic ( M C M L ) inverter. Transistor m2' is a D C current source controlled by V r e y . RI and R2 are two pull-up resistors (typically implemented with small, P M O S devices whose gates are connected to ground). The logic function is implemented by the transistors in the dashed box as shown in figure 2.13. A M C M L gate operates differentially with each input pair being the outputs of a differential circuit. The value of the output node of a gate depends on the difference between 19 currents passing though the branches of the circuit. For example, if Vgs{ml) is higher than Vgs(m2), the current through RI exceeds the current though R2, and the voltage on N l drops. Eventually it reaches a steady state and the current though m l is equal to ICS. This causes the voltage on node x to increase and bring transistor m2 to cut-off. Then the voltage on N2 goes to VDD- For a M C M L gate, the logic part can be implemented with N M O S transistors only, which switch faster than P M O S transistors. The reduced voltage swing also reduces the dynamic power dissipation. However, the static power consumption increases because of the D C current source lcs-Dynamic Current Mode Logic ( D y C M L ) [1] uses a virtual ground to reduce the voltage swing and avoid the static power consumption of M C M L . Transistor m8, as shown in figure 2.14, is used as a capacitor to form the virtual ground. When clock is low, nodes out.T and out.F are precharged to VDD- Transistor m2' is on and node w is discharged to G N D . Once clock goes high, transistors m2', m4T and m4F are off. Transistor m2 switches on the current paths from out.T and out.F to 20 Figure 2.13: A MOS Current Mode Logic Inverter 21 w through x. Then, transistors m l T and m l F determine which of out.T or out.F will be pulled down according to the input in.T and in.F. The path with a higher input will have a larger discharging current. Transistors m 7 T and m7F speed up the evaluation and serve to maintain the logic values on out.T and out.F. W i t h a reduced voltage swing on x and the use of differential logic, D y C M L achieves less delay and consumes less power than a corresponding M C M L circuit. Because of the use of reduced voltage swing, noise margin is a critical issue with D y C M L . 2.3 Circu i t Evaluat ion In battery powered systems, such as cell phones, laptops, and so on, circuit designers need to pay extra attention to lowering energy consumption to extend battery life. Practical limitations of heat dissipation have made power consumption a primary design concern for desktop computers and servers as well. Performance is no longer the single most important feature of a circuit. W i t h multiple design objectives, comparing different circuits is challenging. It is not fair to compare circuits by energy consumption (E) or delay (t) only. Circuit designers can often find ways to trade off energy consumption for performance. Power dissipation is not a sufficient metric to evaluate circuits. B y simply increasing the period of the clock, the power dissipation of the circuit will drop. Energy is proportional to the square of supply voltage. Though lowering supply voltage reduces energy dissipation, delay increases at the same time. Et and Et2 are two commonly used metrics to reflect the trade-offs of power and performance. Gonzalez and Horowitz proposed the Et metric in [15] because the energy dissipation of a circuit is dependent on its performance. However, it is difficult to reduce energy dissipation and improve performance at the same time. This metric 22 favors a design that stresses both energy efficiency and performance. Gonzalez and Horowitz in [14] calculated supply and threshold voltages for optimal Et. W i t h the first-order model of energy and delay of a C M O S circuit, lowering supply and thresh-old voltage is advantageous to increase energy efficiency, especially when transistors are velocity saturated [19, p. 57-66]. Due to the process variations and variability in operating conditions, these advantages are limited by the need to provide adequate noise margin. If velocity saturation of transistors is not considered and the power supply voltage is not close to the threshold voltage, Et2 is independent of the power sup-ply [26]. However, these assumptions are inaccurate for low power supply voltage and deep submicron processes where velocity saturated operation dominates. Et2 strongly emphasizes performance. It favors a circuit design which trades off energy consumption for a small reduction of delay. In this thesis, we compute both Et and Et2 to characterize the trade-off between power and performance in our designs and to compare our circuits with other approaches. 2.4 Noise Analys is Noise margin is another important measure of circuit quality that can be traded with power and performance. For example, increasing the size of transistor m5 in figure 2.8 increases the voltage shift on node out. This increases the difference between the fast and slow delays, thus increasing timing robustness while compro-mising the voltage noise margin. In this case, surfing trades voltage robustness for timing robustness. In real designs, signals are disturbed from their ideal values due to capacitive coupling from nearby wires (crosstalk), variation in the power supply voltage due 23 to the resistance and inductance of the power and ground networks combined with switching currents, substrate noise, thermal noise, etc. Designers use noise margin measures to quantify the robustness of logic circuits in the presence of such distur-bances. For static C M O S gates, the disturbance is typically modeled as an offset of the input voltage away from the ideal values of power or ground. The static noise margin is denned as the input offset that brings the gate to a point where the small signal gain has a magnitude of one [19]. Clearly, this ensures that the change in the output voltage will be less than the change to the input voltage. Thus, disturbances smaller than this static noise margin diminish in a long chain of gates. As the technology moves to very deep submicron feature sizes, the static noise margin which assumes D C noise becomes more and more conservative. This is because digital noise often exhibits narrow pulse widths and gates behave like low pass filters. In addition, dynamic gates exhibit different responses to noise because their dynamics change over their multiple phases of evaluation. Take a footed domino buffer as shown in figure 2.6 as an example. When the clock input is low, x will go high and out will go low regardless of the value of in. In other words, the output is independent of the input during this phase and the noise margin is unbounded. O n the other hand, when clock is high, x will be pulled down if in moves above the n-channel threshold voltage. The degree to which nodes x and out are disturbed depends on the relative size of the keeper pull-up and the time and shape of the disturbance on in. The dynamic noise margin depends strongly on the waveform of the dis-turbance and, in many cases, on the output load. Various researchers have pro-posed techniques for measuring the noise margin of dynamic circuits. Zolotov and Blaauw [43] proposed a latch transition failure criterion: a failure is said to occur 24 if the noise changes the state of a memory element. While this equates noise mar-gin with system failure, determining the exact space of disturbances for which the circuit will operate properly is impractical or impossible for most circuits. This motivates using approximations that provide bounds on the actual noise margin. One of the earliest approaches is that proposed by Larsson and Svensson [24] where they model the disturbance as a pulse parameterized by its width and height. Oth-ers have noted that common disturbances often have other shapes. For example, triangular and exponential pulses have been considered [11, 30]. Shepard has pro-posed a mixed criteria, namely finding the smallest dynamic disturbance that brings the output to the unity-gain point of static analysis [30, 31]. As described in [43], Shepard's criterion provides neither an upper nor a lower bound for the actual noise margin. Because of the timing pulse introduced in surfing circuits, the noise margin of the surfing circuits is strongly dependent on the arrival time of the noise, which is ignored in previous noise analysis techniques. Such time varying dynamics were not considered in these earlier papers. Thus, we present a new technique for analyzing the noise margin of dynamic circuits in chapter 5. 25 Chapter 3 The Test Chip In this chapter, we describe the chip we designed and tested to verify the surfing theory. 3.1 Structure of the Test C h i p We implemented a proof-of-concept test chip to demonstrate surfing. The key struc-ture on the chip is a twelve-stage ring that calculates a pseudo-random sequence on 11-bit, parallel words. To keep the focus of the chip on surfing, we avoid long-wires for data and implemented a ring that only requires nearest neighbor communica-tion. The sequence that we calculate is based on the linear-feedback shift register ( L F S R [12]) sequence: w(i,j) = w(i - l,j - 1) XOR w(i - (3.1) where w(i,j) is the j t h bit of the ith word in the sequence and the j — 1 is calcu-lated modulo-11. The actual chip implements a slight variation on this sequence as described in section 3.2. 26 b i t 9 ^ X H X H X H I H I b i t s ^ o - X H X H X H I H I H O H X H X H X H I H I H x x x i ti ti I X I I I I OHXHXHXHIHIHO D C I X b i t x W 1 H 1 H O H X H X H X H i H x x r x T g ^ i h i - o - x - x - x - i - i - o - x - x - x -« > - X - I - I ~ 0 ~ X - X - X - I - I ~ 0 I I I I X D O D O I I I X X I I i i i b k s H X H X H X H i D C D O I H X H X H I H I H O H X H X H X H I H I H O H X H s bitO I I bit i H I - o - x - x I X I I D O i i bit2~OHxHxHXH I H I HOHXHXHXH I H I H x x x D O i i x i I^OHXHXHXHIHIHOH stage staged 0 j from top edge I XO I D O i D O D O X H I H I H O H X H X H X H I H L^SXI i i I ~ I - 0 - X - X - X - I - I - 0 - X - X - X ~ D O I X I stage 6 from top edge s t ^p ti T3 ti ti x Figure 3.1: The Surfing Ring 27 We implement the recurrence of equation 3.1 with an array of surfing X O R gates. To set and observe the values of the waves in the surfing ring, we embed chains of "input" and "output" cells in the array that form a serial scan chain. Figure 3.1 depicts the ring: cells labeled ' X ' are surfing X O R gates, cells labeled T are input multiplexors with input serial registers and surfing pseudolatches, cells labeled ' O ' combine a surfing X O R gate with the output serial register and a surfing pseudolatch, and cells labeled ' G ' are the GasP [33] handshaking cells that generate the fast signals. The "Control" cell handles the interfaces to the serial scan chain as well as synchronizing the control signals for transferring data between the scan chain and the surfing array. Section 3.2 describes the design of these components in greater detail. To describe the ring, we write w(i,j) to denote the j t h bit of a word output by column i. Each X and O cell in the main array receives surfing data inputs from its west and southwest neighbors for w(i — and w(i — — 1) respectively from equation 3.1. The east and northeast directed connections between cells in figure 3.1 depict these connections. The I cells receive a surfing data input from their west neighbor. The I cells perform no computation and under normal operation simply propagate values to the east and northeast neighbors. Hence the I cells have no input from their southeast neighbors. The fast signals from the GasP chain propagate downward in each column. The south directed connections between cells indicate these connections. Each fast signal is one long wire that spans the height of the array. This is the only non-local communication in the ring. The I and O cells have connections to propagate the select and unload control signals from the "Control" block. These are southeast directed connections in the figure. In addition, the I and O cells also have connections to propagate the serial data inputs and output. These 28 • 5 h 1.67/2.67 pre(i,j).T - r - fast(i) 1.67> -c|[a5.67 9.B][-I w(i-1, j).T—IJTe ie] [— w(i-1, j).F 1—[ [ 12 w(i-1, j-1 ).F—] [Te ie]|— w(i-1, j-1).T w(i, j).T Figure 3.2: The Surfing X O R - G a t e (true half) connections also are southeast directed. The test chip includes a simple serial control interface that allows us to set the initial values of the computations, take snapshots of the ring's state, start and stop the GasP timing chain, and set the number of tokens in the timing chain. The ring can operate with one or two surfing waves. B y operating the ring with two waves, we demonstrate that it can support independent waves of computation without interference for arbitrarily long periods of time. The two chains of I cells in the figure provide data for initializing the two waves, and the two chains of O cells take snapshots of the two waves. To verify correct operation of the ring, we initialize the ring with two waves with a known separation in the pseudo-random sequence. After the ring has run for a while, we take a snapshot of the ring's state and verify that the two waves have values with the same separation as the original values. The ring continues to operate during a snapshot allowing subsequent snapshots to be taken without reinitialization. 29 unload(i-l)—, serData(i-l)—, w(i-1,j) fast(i) w ( i - 1 , j - 1 ) = J w(i,j) unload(i) serData(i) fast(i) <j>; 0 2 Figure 3.3: The Output Cell 3.2 Design of the Test Chip Given the novelty of surfing pipelines, this section provides descriptions of each cell that we used in the ring. We used dual-rail domino with "preswitching" as described in [38] to effect the required speed-up when fast is asserted. Figure 3.2 shows the true half of the X-cell. The number beside each transistor is its shape factor. The n-channel pull-up controlled by fast has a shape factor that is about 8 0 % of that for the pull-down in the inverter for the domino output. These transistors fight with each other during preswitching. Using n-channel devices for both provides good tracking over variations in fabrication parameters and operating conditions. N -channel transistors are weak pull-ups, and this arrangement moves the gate outputs to about 0 . 2 V D D when fast is asserted and provides a peak reduction in gate delay of about 30%. To provide some margin when surfing, we set the stage-to-stage delay for the surfing ring to be about 8 0 % of the propagation delay for non-surfing domino. 3 0 -r 1.67/2.67 1 6 ] | > - C < | — Figure 3.4: The Pseudolatch As shown in figure 3.3, the output cell extends the surfing X O R gate by adding a serial scan chain with a surfing control signal for transferring data from the surfing path into the serial chain. Keeping with our approach of using only near-est neighbor communication, the unload signals surf along the O-cell chains using "pseudolatches" shown in figure 3.4. Since the delay of the pseudolatch should be the same as the stage delay of the fast signal, for simplicity, we add the fast gated NMOS to the pull-down stack to resemble the design of the X O R gate. We keep the preswitching surfing design in the pseudolatch to maintain its timing robustness. This maintains the proper timing relationship between the fast signal and the con-trol signals propagated by the pseudolatches. Like other surfing components, the pseudolatch provides no static state holding capability. We used a two-input latch shown in figure 3.5 to transfer data from the surfing ring to the serial scan chain. The overlap of d l and enl must be large enough to acquire valid data from the X O R gate. To achieve this, we used a design for our two-input latch that has a small set-up and hold window and a negative hold time. By carefully sizing these transistors, we achieved a set-up and hold window (set-up time is 61ps and hold time -37ps respectively) that is less than one F04 delay 1 *An F04 delay is the delay for one static inverter driving four inverters of the same size 31 en1 d H [ 8 en1—\[8 24.33 eWs.33 62-en2-en2-5.33/2.67 O-en1—4L5.33 en2—<j[5.33 5.33] h-r-' 2.6711 ' en2—I [2.67 eiri —IL2.67 5.33/5.33 T 0 - 4 33 2.67 5.33/2.67 en2—<f[2.67 2.67] t> 1.67JI in2—IP-67 zj y> cj[5.33 5.33/5.33 ] T Figure 3.5: The Two-Input Latch across all five process corners. as itself [34]. In the T S M C 0.18/x process, one F 0 4 delay is about 90ps. 32 select(i-l) fast(i) 45 pseudo-latch =£select(i) w(i-1,j) = serData(i-l)—* D Q en D Q en 1 */ 1 * 2 s aO y a1 dual-rail mux )w(i,j) »serData(i) fast(i) Figure 3.6: The Input Cell Figure 3.6 shows the input cell. It replaces the dual-rail X O R gate with a multiplexor. We implemented the multiplexor simply by changing the inputs to the transistors of the X O R circuit. Thus, the two gates have very similar delay characteristics which simplifies surfing. To avoid broadcasting data and control signals, we placed the I-cells along a diagonal. While it might seem natural to place them parallel to the diagonal from w(i — 1, j — 1) to w(i,j), this would preclude proper initialization of the ring. Let's assume that the I-cells are placed in this "natural" way, the X cells above the I cells will not be properly initialized, as an X-cell receives inputs from its west and southwest neighbor. Then the X-cells below the I-cells, would receive their southwest inputs from each other and would not be initialized correctly. Thus, we placed the I-cells on the other diagonal which explains a handful of peculiarities of the ring. First, to retain the regularity of the layout, the X-cells and O-cells are placed on diagonals parallel to the I-cells. To set both the w(i — l,j — 1) and w(i — inputs, we used two parallel chains of I-cells for each wave. Consider an X-cell that is immediately to the left of an I-cell. The next X O R calculation for the wave is performed by the O-cells to the right. As shown in figure 3.1, the horizontal path 33 from such an X-cell to the corresponding O-cell goes through two I-cells, whereas the "north-east" path goes through only one I-cell. This disrupts the L F S R calculation from equation 3.1. We determined through simulation that the resulting sequence has a period of 1533 which is quite sufficient for our tests. As an added benefit, this arrangement allows the snapshot from a single diagonal chain of O-cells to completely describe the state of a wave. Thus, we chose to use this approach to generate pseudo-random sequence for our design. Having examined the construction of the ring, we now look at the generation of timing and control signals. Our timing generator uses a variation of the "GasP backwards" control described in [38]. That design had two gate delays in the forward direction which nicely matches the domino structure of our surfing gates and four delays in the reverse direction. We added two more inverters to the backward path as shown in figure 3.7 to ensure adequate delay for the self-reset of the buffer to generate the fast signals. W i t h this extra delay, the ring operates correctly. From an asynchronous design perspective, this extra delay serves an unusual purpose. For proper surfing, the GasP chain must be token limited [35]; in other words, the acknowledge event from the right must arrive at the N A N D gate before the request event from the left. As described in [9, 37], the time separation of the two inputs affects the delay of the stage. In particular, when the two input events are nearly simultaneous, the delay from the last input is greater than when the events are more widely separated. This is known as the "Charlie effect", and it is this phenomenon that allows our GasP ring to maintain adequate separation of the pulses while operating well inside the token-limited regime. In figure 3.7, the N A N D gate in each stage has an extra input to enable the N A N D gate. In one stage, this extra input is connected to start and the others are connected to the power supply. 34 Figure 3.7: A Stage of the GasP Timing Ring If the start signal is 0, the GasP timing chain halts. The GasP timing chain starts to run once the start signal goes to high. The G-cell in figure 3.1 extends each GasP stage with a multiplexer and a two-phase static latch. Figure 3.8 shows the schematic of a G-cell. The input3 input to the N A N D gate is either connected to start or the power supply. The G-cell uses the serial data input to set the number of tokens in the timing chain. When reset is high and start low, pi in the GasP timing chain is set to the reverse of the values held in the two-phase latch in the corresponding stage. T h e multiplexer also functions as a keeper for the shared stage wire of the GasP circuit. Once reset goes to low and start high, the value of the pi signal is overtaken by the value set by the free running of the GasP chain. Thus the multiplexer's driving capability is designed to be very small. We may load m tokens in the GasP ring such that on the left side of this special N A N D gate at stage i — the Pi-i, Pi-2, , Pi-m are set to VDD and the others to 0. We have a control circuit to input and synchronize the data. It is composed of three parts: 35 - D Q -en — i — — en I er o to the tatches to the probe pad in the O-cel ls to the latches In the G-cells to the latches in the l-cells serOata _ Figure 3.9: The Serial Control Register 1. the serial control register. 2. the circuit to generate the select signals 3. the circuit to generate the unload signals The serial control register as shown in figure 3.9 uses two-phase static latches controlled by the external clock to transfer the data. Here each box represents a two-phase latch. The external clock's frequency is quite low. Normally we set the frequency to be less than 300KHZ. Figure 3.9 shows the serial control reg-ister. For proper operation of the serial control register, a valid frame of serial 36 r ?able 3.1: Commands Launched by the Serial Control Register Data Symbol Command Di reset load the data in the latches to the GasP timing chain D2 start start to run the GasP timing chain D3 LFSR start to run the ring DA output take a snapshot of the ring D5 load load the serial data into the ring and GasP timing chain D6 data the data to the L F S R ring and GasP timing chain D7 <t>i the internal clock <j>\ D8 4>2 the internal clock (j>2 input data takes the following coding format where Di represents a valid data: lllllr)DiD2D'iDir)D^DeDjDi. Five consecutive Is set the enable signal to a high voltage level. The enable signal then enables the parallel registers and triggers the corresponding commands. Table 3.1 lists all the commands that can be launched by the serial control register. Note that it is never necessary to set five consecutive ones in L>i through D&. For proper operation of the chip, the first step is to load in data for the I-cells and to initialize the GasP ring. Thus we set D$ to 1 to enable the loading. Each G-cell, I-cell and O-cell has a two-phase latch inside controlled by <j>\ and cj)2. These latches are connected together as in figure 3.9. The input of the first latch in the chain is the serData signal. The output of the chain is connected to a probe pad to allow the testing of the long serial chain. For correct operation, Dj and Ds cannot be 1 at the same time. To perform a test, we first reset the GasP timing chain by setting D± to be high and D2 low. We can then load multiple tokens into the GasP timing chain. The ring inside the chip can run with one or two tokens in the GasP timing chain. Then we start the GasP timing chain by setting D2 high and D\ low. 37 —1 syn_pre d q nlatch ck d q platen ck 1 fasti I fast. syn dblatch d q b ck slpre T \ s e l e c t 1 syn2 I i fast i + 2 syn d q nlatch ck d q platen ck syn2 d q nlatch ck I fast. I fasti I fasti fast i + s y n 3 i — \ S l _ O p r e | > ^ s e l e c t 1 _0 fast, + , ! fast i + 2 syn2 syn3 syn2 syn3 fast i + 1 syn dblatch s 2 P , e r d q L b ck f a s t i + 2 | fast i + 1 s2_0Pre [\select2_0 jfasti+; d q platch ck syn3 I fast. Figure 3.10: The select Signal Generator We then start the ring by setting D3 to generate the select signal used in the ring for proper initialization. We set D4 to generate the unload signal to take a snapshot of the ring and observe the values from a probe pad. The circuits used to generate the select and unload signals are almost the same. We will describe the circuit to generate the select signal. Setting D3 makes the input to the synchronizer goes to high. We use the three-bit ripple counter synchronizer from [25] to synchronize the rising edge of D3 with a fast signal fastj. The input to the synchronizer is the LFSR signal. We label the output of the synchronizer as syn.pre. Using the circuit shown in figure 3.10, we produce pulses on select signals. Figure 3.11, 3.12, 3.13 and 3.14 describe the boxes used in figure 3.10. Figure 3.15 shows the output of the select signal generator. selectl_l and selectl.O signal are the dual-rail pair inputs to the first chain of I-cells. select2_l and select2_0 are the pair for the second chain. The s l p r e , s l -0p r e , s 2 p r e and s2_0 p r e signals go through a pseudolatch shown in figure 3.14 to synchronize with the next fast signal. This pseudolatch is similar to the 38 phi -5.33/5.33 T = i 1-33 5.33/2.67 2.67 phi—cj[3.33 3.33 1.6711— phi—I [1.67 F i g u r e 3.11: n l a t c h U s e d i n the G e n e r a t o r phi phi 2.67 .67 8/2.67 phi —4 [3.33 3.33] h—4 1.67 phi—I [1.67 F i g u r e 3.12: p l a t en U s e d i n the G e n e r a t o r 39 phi <3 5.33 5.33 5.33 phi Figure 3.13: dblatch Used in the Generator -r 1.67/2.67 16 ]p -0< ]— 1.67]p HL8 9.33 out out 26.67/12 fast logic symbol Figure 3.14: Pseudolatch Used in the Generator pseudolatch in figure 3.4 except that the N M O S gated by fast; in the pull-down stack is removed. From simulation, we noticed that s l p 7 . e , s l _0 p r e , s 2 p r e and s2_0 p r e arrive later than fastj+2. Removing this N M O S device further reduces the minimum delay of the data events and increases the timing robustness. Then selectl.l , selectl.O, select2_l, and selectl_0 go through several pseudolatches to synchronize with the fast signals before they are used as inputs to the I-cells. Generating the unload signal is the same as generating the selectl.l signal. Thus we use the same circuit as in figure 3.10 to generate the unload signal except that the circuits used to generate selectl_0, select2_l and select2_0 are removed and the input to the three-bit ripple counter synchronizer is the output signal from the serial control register. In addition to the serial chain, we have probe pads that allow us to observe 40 A A fasti h A A A A A syn_pre^ syn \ A A fasti+1 A A A A A A . S lpre A S2 pre A S l _ O p r e A A A A s 2 _ 0 p r e A A A A A A A fast i + 2 A A A A A A select 1 1 A select2 1 A selectl 0 A A A A setect2 0 A A A A A . Figure 3.15: The Synchronized select Signals 41 the fast pulse for one of the ring stages and the same after scaling with a divide-by-eight counter. While the ring will only surf properly with one or two tokens in the GasP timing chain, we can initialize the chain with any number of tokens. This allows us to measure the timing properties of the GasP chain. We also have probe points for observing the synchronized control signals from the I-cell and O-cell chains. 3.3 Test Results We fabricated our test chip using the T S M C 0.18/z C M O S process. Figure 3.17 shows the fabricated chip. To keep our academic fabrication service ( C M C , the Canadian Microelectronics Corporation) happy, we kept the chip's pin count low. Thus, the chip is designed to be tested on a probe station. The row of eight probe-pads at the top provides power, ground, and all data inputs to the chip. The clock signal is for the serial scan and control path. Each column of three pads at the bottom provides a ground-signal-ground arrangement that we use to probe an output signal. The signal fast8 is the fast signal from stage 8, and the signal f/8 is the output of an on-chip divide-by-eight counter driven by this fast signal. The signals sCLl, sl_0, s l_ l , and U l _ l are select and unload signals from the I-cell and O-cell chains. We can simultaneously probe one signal from the top and one from the bottom. Observing f/8, we determined the speed of the GasP pipeline. Figure 3.16 shows the relationship between the frequency of f/8 and the number of tokens in the GasP timing ring. As the token number increases from 0 to 3, the frequency of f/8 increases linearly with a slope equal to 80.3MHZ per token. From 4 tokens to 8 tokens, the frequency of f/8 decreases with a slope equal to —26.3MHZ per token. The frequency of f/8 reaches its peak value with 3 tokens in the ring. W i t h a 1.8V 42 250 number of tokens in the GasP chain Figure 3.16: Frequency of f/8 versus Number of Tokens in the GasP Ring 43 supply, the forward delay 6p of a GasP stage with two token is 130ps calculated by equation 3.2 where left_slope is the slope of the solid line in figure 3.16. 6F = l/(left_slope * 8 * 12) = 1/(80.3 * 8 * 12) = 130ps (3.2) The reverse delay <5R of a GasP stage with two token is 396ps calculated by equa-tion 3.3 where right_slope is the slope of the dotted line in figure 3.16. 8R = - 1 /(right.slope * 8 * 12) = - l / ( - 2 6 . 3 * 8 * 12) = 396P5 (3.3) The forward delay Sp is somewhat slower than predicted by H S P I C E simulations which yielded 95ps for typical parameters (roughly 1 F 0 4 delay) and 115ps at the slow-slow corner 2 . This discrepancy can be explained by the voltage drop and heating caused by the chip's power consumption. W i t h a 1.8 volt power supply, the chip consumes 120mA. From the serial-data out pad, we conclude that the on chip VDD is 1-7V. We conclude that there is a moderate voltage drop across the power probes. Furthermore, the temperature on the edge of the chip is 34C, and we expect that the temperature in the core would be higher, whereas the H S P I C E simulations assumed a die temperature of 25C. The combined effects of the lowered power supply voltage and elevated die temperature suffice to explain the lower than expected speed. These also motivate our work on more energy efficient surfing described in chapter 4. We then verified proper surfing behavior. We loaded two waves into the ring, allowed it to run, and then took snapshots and verified the waves. W i t h power 2 Typ i ca l design methods for high-performance digi tal circuits make extensive use of circuit simulations. Because it is not possible to perform simulations for a l l possible values of the physical parameters, they are often divided into five representative points according to the performance of the n-channel and p-channel transistors. For example, the fast-slow "corner" has fast n-channel devices and slow p-channel ones. The other four standard corners for the parameters are typical- typical , fast-fast, slow-slow, and slow-fast. 44 45 supply voltage equal to 1.8V, the frequency of the f/8 signal is 160MHZ. Running the ring for various intervals of up to 48 hours, we observed no errors. This means that the ring supported two independent waves of computation separated only by the surfing effect for 2.6* 10 1 5 moves between pipeline stages without a single error. We believe that this compellingly demonstrates robust surfing on a real chip. We then varied the power supply voltage. Figure 3.18 describes the rela-tionship between power supply voltage and the frequency of f/8. When the sup-ply voltage is low, f/8's frequency increases linearly with the power supply voltage as predicted by a first-order, long channel transistor model [19]. We use a line / = 130.94 *VDD- 73.54 to fit the testing data from I V to 1.9V. When the supply 46 voltage is greater than 1.9V, the curve bends because of velocity saturation [19]. We observed correct surfing when the external power supply voltage ranged from 1.44V to 2.67V. At the lower voltage, the stage-to-stage propagation time was 175ps and the upper voltage, the propagation delay was 96ps. Thus, our surfing circuit works over a range of nearly 2:1 in the power supply voltage and speed. As the speed is determined by the GasP timing ring, it is not surprising that speed is roughly linear in voltage as predicted by simple C M O S scaling models. 3.4 Energy Overhead Our testing demonstrated that surfing works correctly. However, preswitching surf-ing consumes an unacceptable amount of power. Using H S P I C E simulations, we obtained a detailed breakdown of the power consumption of the surfing X O R gates from our test chip (e.g. see figure 3.2). We compared a chain of 16 preswitching, surfing, dual-rail X O R gates with a chain of 16 non-surfing, self-resetting domino dual-rail X O R gates. The fast signals are generated by a series of chains of two small inverters. Furthermore, we define the side in a dual-rail X O R gate with an output event as the active side, the other side the inactive side. Table 3.2 summarizes the comparison when the width of the fast pulse is equal to 250ps. This pulse width corresponds that obtained by simulating the chip. The preswitching X O R gate uses about 2.46 times as much energy per operation as the non-surfing, self-resetting domino equivalent. The surfing gate has a propagation delay that is 18.5% less than that of the non-surfing gate (75ps vs. 92ps). Thus, the surfing gate is a factor of 63% worse by the Et2 [26] metric than an ordinary self-resetting domino gate. 47 Table 3.2: Comparison of Preswitching with Non-surfing, Dual-rail X O R Gates Item Non-surfing Preswitching X O R delay(ps) 92 75 energy per operation(pJ) 0.1630 0.4016 fraction of energy consumed 0.9994 0.5190 by the active side fraction of energy consumed 0.0006 0.4466 by the inactive side fraction of energy used to 0 0.0344 generate fast signals 3.5 Observations A n interesting phenomenon that we discovered in H S P I C E simulations is that not only the forward path is surfing, but also the reset path, which surfs on the falling edge of the self-reset signal, pre(i,j). For the surfing X O R gate shown in figure 3.2, if the falling edge of the input signal comes earlier than pre(i,j), because pre(i,j) is still high, the voltage on node x remains low until pre(i ,j) goes to low. If the falling edge of the input signal comes later than pre(i,j), because of the fighting of the precharge P M O S and pull-down stack, the voltage on x has already risen a small amount and node x will charge to VQD sooner. Thus the delay of the reset path depends on the arrival time of the falling edge of the input signal. If this delay variation meets the timing requirement of surfing, a chain of such X O R gates can operate properly and the pulse width of the preswitching surfing gate is less than that of the non-surfing, self-resetting domino gate because the surfing effect on the reset path makes the reset happen earlier. Otherwise, if the reset path is slower than the forward path, the input pulse will become wider and wider and finally the pulse will disappear. 48 In H S P I C E simulations, we also observed that to make the GasP timing chain work in all five corners, we need to increase the reverse delay in the GasP chain, using the Charlie effect to maintain proper separation between pulses. The separation must be greater than the reset time of the self-reset buffer used to generate the fast signals. 3.6 Summary We demonstrated a working surfing chip. This chip uses the preswitching surfing technique. This approach shows excellent tracking over variations in process pa-rameters, power supply voltage, and temperature. We. validated the robustness to process parameters using five-corner H S P I C E simulation (see note on page 44), but could not test this in the lab because we had chips from a single fabrication run. The robustness to voltage and temperature variation is clear both in simulation and from laboratory measurements. Without latches or other storage elements, this chip can support 2 waves. We observed no errors while running it for over 48 hours. This chip can operate correctly with the supply voltage ranging from 1.44V to 2.67V. However, preswitching surfing consumes an unacceptable amount of power. Our testing demonstrates that the average current is 120mA with supply voltage equal to 1.8V. 49 Chapter 4 Lower Power Surfing Our test chip demonstrates that surfing can work in a real chip. However, the extra power consumption caused by preswitching is unacceptable for most real applica-tions. As mentioned in Chapter 3, in a non-surfing dual rail X O R gate, the side of the gate that generates the output pulse accounts for more than 99% of the total energy consumption. In contrast, the two sides of the surfing gate are nearly equal in energy consumption regardless of which generates the output pulse. Nearly all of the energy consumption of the inactive side of the surfing gate is due to current through the n-channel pull-up transistor that effects the preswitching. The inactive side of the gate consumes this short-circuit current throughout the duration of the fast pulse, whereas the active side only consumes the current on the leading edge until it actually generates the active pulse. Thus, we see nearly four times the short-circuit current through the preswitching pull-up on the inactive side as on the active side. The active side of the surfing gate still consumes 34% more energy than the 50 non-surfing gate to charge the internal dynamic node. This is due to the self-reset starting before the inputs of the gate return to ground. Initially, we were surprised to observe that this short-circuit current through the precharge path could not be eliminated by increasing the delay in the self-reset path. In our surfing gate, the fast signal accelerates the evaluation phase of the gate but not the reset phase: evaluation propagates faster than reset through our gate. Thus, the self-reset signal (pre(i,j) in figure 3.2) is asserted before the inputs fall regardless of the delay in the self-reset loop. In particular, the self-reset signal occurs just enough before the falling of the gate's inputs to allow reset events to surf on the self-reset signal, consuming short-circuit current reminiscent of that in the evaluation phase. In this section, we present a novel family of surfing circuits aiming to reduce power consumption and demonstrate them using the design of a Brent-Kung [5] carry lookahead adder. 4.1 Surfing with Charge Sharing From the observations described in section 3.4, we concluded that to reduce power we needed to avoid the short-circuit currents associated with preswitching. The key idea behind our new design is to transfer charge from the internal node of the surfing gate to its output while waiting for input events to arrive. This lowers the voltage on the internal node and raises the voltage on the output, both of which lead to faster operation. We assume dual-rail gates, or their generalization, one-hot gates that can have more than two outputs, but exactly one generates an output event during any evaluation cycle. When one side of the gate generates an output event, we disable the charge transfer paths, and restore the node voltages for the inactive side(s). As this approach involves no short circuits from power to ground, it uses 51 much less energy than preswitching circuits. Figure 4.1 shows our new surfing circuit for a dual-rail X O R gate with in-puts a and b and output y. Figure 4.2 describes its operation. When f a s t is low, the gate is reset: x.T and x .F are pulled high, and y.T and y.F are pulled low. When f a s t goes high, all four of these nodes head toward an intermediate voltage. Unlike the preswitching based design from chapter 3, the voltage shift for this design is primarily through charge sharing between the capacitors at nodes x.T and y.T (resp. x .F and y .F ) . This eliminates the short-circuit current that caused the large energy consumption in the preswitching design. This intermediate voltage situation is re-solved by the pull-down network, that brings one of x.T or x .F lower than the other which in turn brings the corresponding output high. Once one output goes high, the cross-coupled devices act as keepers for the inactive side of the gate, restoring the x signal to a high level and y to low. The three n-channel devices in series from y.T to x.T (resp. y.F to x .F ) include inputs to terminate the charge sharing once an output goes high. Thus, the surfing effect is only applied until an output is generated, further reducing energy consumption. Optimizing power consumption and speed involves a trade-off between adding transistors to shut-off short-circuit paths early, and the extra capacitive loads pre-sented by these extra transistors. In our simulations, we found that we could make the transistors for the feedback path and the keepers quite small, thus minimizing their speed and power penalty. For the preswitching technique, the minimum prop-agation delay is dependent on the size of the pull-up N M O S transistor. However, for the charge-sharing technique, the minimum delay is not dependent on the size of transistors on the feedback path. Instead, it is related to the ratio of the capaci-tances on the output node and the internal node. In the new surfing circuit design, 52 fast fast Figure 4.1: A n Energy Efficient Surfing Circuit fast > "of ' i s ' o > x.T x.F It r*L -'"V I /•• s„__ „ . . \ i „ time(ns) Figure 4.2: Simulation of an X O R Gate with Charge-Sharing Surfing 53 we abandoned the self-reset style. This is for two reasons: first, as mentioned earlier, surfing on the reset path causes extra power consumption; second, self-resetting cir-cuits require aligning the input pulses to assure enough overlap. The chip we used to demonstrate surfing uses similar circuits in the ring to guarantee this. However circuits with more complex structures need extra attention to make sure of this. We use the domino structure here to simplify the design for more complex circuits. We simulated chains of dual-rail X O R gates implemented with non-surfing, self-resetting domino, the preswitching technique described in chapter 3, and the new charge-sharing approach described here. The new charge-sharing gates are 7% faster than their preswitching counterparts (65ps vs. 70ps forward latency). In table 4.1.A, we summarize the energy per operation consumed by a dual-rail X O R gate with different design styles. Tables 4.1.B and 4.1.C list the energy breakdown for the active and inactive sides respectively. In all these three tables, the energy is scaled to 0.1630pJ, the energy of the non-surfing, self-resetting X O R . We set the propagation delay of the fast signals for the chains of preswitching and charge-sharing surfing X O R gates to be 75ps, 18% faster than the non-surfing X O R chain. Per operation, charge-sharing surfing X O R gates consume only 34% more energy than non-surfing X O R gates while preswitching surfing consumes roughly 146% more. For preswitching surfing X O R gates, the active and inactive sides consume 0.28 units and 1.1 units of energy by the pull-up N M O S gated by the fast signals, which explains most of the extra energy consumption. B y using the internal capacitance to charge the load capacitance, charge-sharing surfing X O R gates consume less power than a preswitching gate. The speed-up available by charge sharing depends on the ratio of the internal and load capacitances of the gate. For the X O R chain, each gate is loaded with the input capacitance of two 54 more X O R gates of the same type (i.e. non-surfing, preswitching or charge sharing). Compared to the non-surfing X O R gates, precharging nodes x.T and x .F uses more power in the charge-sharing design. For the active side, it is because when fast goes to low, the inputs to the pull-down stack are still high. The fighting between the pull-down stack and precharge P M O S consumes extra power. O n the inactive side, the increased energy consumption is used to restore the internal node x .F or x.T toVrjD- This is mostly done by the keeper. As we expected, the inactive side of the charge-sharing X O R gate consumes much less power than that of the preswitching gate because the feedback path is cut off as soon as an output event is generated. The active side of the preswitching gate consumes 0.5360 units (0.4481+0.0879) of energy to charge the internal dynamic node. However, the active side of the charge-sharing gate spends 0.4530 units of energy to do the same job. This further explains why we abandoned the self-reset design. To apply our charge-sharing based approach to a realistic function block, we designed a Brent-Kung, carry-lookahead adder using charge-sharing surfing. Fig-ure 4.3 shows the 16-bit version of the adder. Black squares represent P K G (carry propagate, kill, and generate) blocks. The lowest square in each column is grey; grey blocks produce G and K outputs but omit P. Triangles represent buffers. The bold wires depict the critical delay path through the adder. Our design starts from the domino implementation described by Harris and Sutherland in [16]. Figure 4.4 shows the implementation of the P K G block. Noting that all blocks have a fan-out of two, Harris and Sutherland propose an implementa-tion where all black and grey cells are 'sized to have the drive capability of a "unit" inverter. Likewise, buffers on the critical path have unit drive capability, and buffers off the critical path have half of this drive. Furthermore, they assume that horizon-55 Table 4.1: Energy Comparison of Surfing and Non-surfing Dual-rail X O R Gates A : Total Energy 3 Item Non-surfing Preswitching Surfing Charge Sharing Surfing delay(ps) 92 75 75 energy per operation 1.0000 2.4632 1.3440 fraction of energy consu-med by the active side 0.9994 0.5190 0.8265 fraction of energy consu-med by the inactive side 0.0006 0.4466 0.0911 fraction of energy used to generate fast signals 0.0000 0.0344 0.0824 B: Energy Consumption on the Active Side of Dual-rail X O R Gates 3 Item Non-surfing Preswitching Surfing Charge Sharing Surfing precharge 0.3334 0.4481 0.4530 keeper 0.0209 0.0081 0.0433 inverter to the 0.0900 0.0879 N A precharge P M O S inverter to the 0.5551 0.4541 0.6145 output node pull-up N M O S 0.0000 0.2802 N A C : Energy Consumption on the Inactive Side of Dual-rail X O R Gates 3 Item Non-surfing Preswitching Surfing Charge Sharing Surfing precharge 0.0000 0.0000 0.0209 keeper 0.0005 0.0007 0.1012 inverter to the 0.0000 0.0000 N A precharge P M O S inverter to the 0.0001 0.0001 0.0002 output node pull-up N M O S 0.0000 1.0994 N A 3 A l l energy values are normalized to 0.1630pJ, the energy of a signal operation of a non-surfing, dual-rail , self-resetting X O R gate. 56 tal wires have a capacitance of one sixth the gate capacitance of a unit inverter per column crossed. H S P I C E simulations show delays from 522ps (for 0 + 0) to 650ps (for —1 + 1) to complete a 32-bit add. The energy per addition is roughly 50pJ. We focus our attention on the design and operation of the P K G (black) cells. The application of our approach to the G K (grey) cells and buffers is similar and straightforward. The propagate, generate, and kill signals are mutually exclusive. Rather than implementing each of them with dual-rail circuits we implement a triple-rail gate (a.k.a. "one-hot") instead, just as in the domino implementation from [16] (see figure 4.4). Figure 4.5 shows our implementation of the P portion of the P K G block; the implementations of G and K are similar. We noted that not all cells in the adder are on the timing critical path. In particular, the cells in the initial P K G calculation for the most-significant half of the word, and the cells in the final sum calculation for the least-significant half of the word are relatively non-critical. We used standard domino circuits for these cells. Figure 4.6 shows such a 16-bit version adder. Compared to figure 4.3, in figure 4.6, to allow the increased delay spread due to the domino gates, cells in dashed box A are moved down one level and cells in dashed box C are moved up one level. Because in figure 4.6, the number of levels of the timing chain remains the same, cells in dashed box B are moved down two levels. Compared with the pure surfing design, the mixed adder design yielded a power savings of roughly 6% with no degradation of performance. 57 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ] 15:0 14:0 13:012:011:010:09:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0 Figure 4.3: The Brent-Kung Adder Figure 4.4: A Domino Implementation of the P K G Block from [16] 58 Figure 4.5: The Surfing Circuit for Propagate domino cell Figure 4.6: The Mixed Brent-Kung Adder 59 o—r>°-backbone inverter 1 backbone inverter 2 H>—H>—[>^4>—l>-backbone inverter 3 backbone backbone backbone inverter 4 inverter 5 inverter 6 backbone inverter 7 Figure 4.7: The Timing Chain We now consider the design of the timing chain. Three factors motivate a different design than the one we used in the ring. First, the adder is a linear pipeline - our goal is to minimize the propagation delay, not to support arbitrarily large numbers of iterative calculations. Second, stages in our new surfing design have very low propagation delays, as small as 45ps. It is impractical to design a GasP stage with stage-to-stage delays this small. T h i r d , the adder has long wires which we avoided in the ring. This requires either resizing the transistors in the P K G cells at each level, or having different gate delays. We followed the example of [16] and chose to keep all P K G cells the same for simplicity. However, this requires having different delays at different stages of the timing chain. Figure 4.7 sketches our timing chain. We assume a design where the adder is part of a larger pipeline as depicted in figure 2.4. Thus, we have a clock input to the chain, which produces the signals fastn . . . fast io- To obtain a delay less than 60 Table 4.2: Structure of the Backbone in Figure 4.7 Backbone Inverter Shape Factor Connection to Chain 1 33.3/16.7 fasto , fasto 2 16.7/23.3 fasti 3 10.7/4 fast2 4 36.7/18.3 N A 5 36/21.7 fast3 6 31.3/15.7 N A 7 26.7/9.3 fast.4 8 24/12 fasts 9 9.3/4 f a s t 6 10 30/13.7 N A 11 27.3/13.7 f a s t 7 12 25.3/12.7 fast8 13 3.1/2.7 fastg 14 8/6 N A 15 10/2.7 fast io. one buffer delay, we constructed a "backbone" chain, with a separate output chain for each fast signal. B y carefully sizing the inverters in each chain, we obtain the fine spacing desired for the fast signals. Tables 4.2 and 4.3 list the sizes of inverters in the timing chain for the adder with domino and surfing cells inside. 61 Table 4.3: Structure of the Spines in Figure 4.7 Spine Level 1 Level 2 Level 3 Level 4 Level 5 Delay f r om fast;-1 to fast; (ps) fas t 0 1 0 / 3 36 /66 .7 N A fasto fas t i 2 0 / 1 0 49 .3 /26 .7 178/66 .7 26 .7 /18 N A fas t i fast 2 22.7 /11 .3 8.3/5.3 16.7 /15.3 82 /26 .7 13 .3 /12 .7 48 fast 2 fas t 3 7.7/2 .3 8/8 27 .3 /10 14.7/10.7 48 fas t 3 f a s t 4 12.7 /2 12 /7 .3 4 6 / 2 0 13 .3 /10 52 f a s t 4 fasts 10 /3 .7 13 .3 /10 57 .3 /23 .3 9 .3 /6 .7 56 fasts f a s t 6 10/3 .3 8 /2 .7 16 /8 4 8 / 2 0 10 /8 62 f a s t 6 f a s t 7 8/4 2 0 / 1 4 67 .3 /37 .3 21 .3 /12 .7 47 fasty fas t 8 8 /4 16.7 /11.3 65 .3 /33 .3 13.3/10.7 48 fas t 8 fastg 9 .3 /6 .7 8 /2 14.7 /7 .3 70 /26 .7 50 /25 45 fastg fastio 7 .7 /4 24 /12 96 .7 /40 26 .7 /15 .3 46 fastio 6 /2 .7 13 .3 /8 .7 74 .7 /20 45 62 T h e chaxge-sharing surfing adder does not have as tight of process parameter, voltage and temperature tracking as the surfing ring described in chapter 3. Three factors contribute to this: 1. The preswitching surfing circuits in the ring use two N M O S devices to fight with each other to create the surfing effect. However, the charge-sharing surf-ing circuit depends on the ratio of the capacitors at the internal dynamic node and output node. 2. The surfing ring has only X O R and M U X gates. They have the same structure. In the ring, the wires connecting the X O R and M U X gates have almost the same length. However, in the adder, each layer involves P K G cells, K G cells and buffers. Furthermore, as shown in figure 4.3 and figure 4.6, the wires' length varies a lot from one layer to another layer. Even in the same layer, gates drive different amounts of capacitance loads. 3. In the surfing ring, gates with similar structure drive nearly identical loads. Thus, we designed the timing ring to have a similar structure to the X O R ring and therefore to have closely tracking delays. The charge-sharing surfing adder loses these similarities. Delays in the data paths have significant components due to gate and wire capacitances. O n the other hand, delays in the timing path are determined primarily by gate capacitances. Thus, the timing chain fails to track variations in the relative weight of gate and wire capacitances in the data path. It might be possible to improve this tracking by deliberately inserting long wires in the timing chain. Due to a lack of design tools to support such wire matching, we have not attempted this. 63 4.2 Simulat ion Results We simulated our design using H S P I C E and compared it with the domino design from [16] in terms of energy consumption and delay. We report the results for a 32-bit adder comparing with the domino adder assuming typical process parameters. To show that our circuits are robust to process parameter variations, we verified the surfing circuits and timing chain using standard "five-corner" simulations. The surfing adder achieves roughly 19% reduction in delay compared with the domino adder: the surfing adder has a propagation delay of 524ps compared with the worst case delay of 650ps for the domino design. As we have emphasized, surfing dramatically reduces the delay spread in pipelines. For the domino adder, the data-dependent variation in delay until the last output bit settled is 128ps. For the surfing adder, the corresponding delay spread is 33ps. The tighter timing bounds of the surfing adder arise from the surfing effect as well as the uniformity of these circuits in the final two stages. We note that larger spreads occur within the adder due to variations in loading of different gates and using inputs that skip over surfing levels. This.illustrates how surfing can be used to achieve tight timing bounds, and the methodology can be relaxed to achieve lower delay or power consumption as desired. The energy consumption of the surfing adder is actually slightly lower than that of the domino design: the surfing adder consumes 44.5pJ per add compared with the 50.4pJ for the domino adder, an 11.7% decrease. We were initially surprised that the surfing design used less energy per operation than its domino counterpart. Then we realized that the delayed fast signals reduce the power consumption caused by the fighting between the precharge P M O S and the pull-down stack. Thus, the surfing adder has Et and Et2 metrics of 2.33* 1 0 - 2 9 Js and 1.22* 1 0 - 2 9 Js2, while the domino 64 adder has metrics of 3.28 * 1 0 - 2 0 Js and 2.13 * 10~ 2 9 Js2. Assuming a long-channel model for voltage scaling, as to Et, our design can achieve the same performance as the domino adder but with a 29% reduction in energy consumption (battery operated applications would run 41% longer, i.e. j r r j 2 9 ~ 1-41- )• W i t h H S P I C E simulation, we adjusted the power supply voltage to 1.575V to run the mixed adder roughly at 650ps. Per operation, the average energy consumption is 32.3pJ with 36% reduction compared with the domino adder. As to Et2, based on the same assumption, our design can achieve the same performance as the domino adder but with a 43% reduction in energy consumption (battery powered applications would run 75% longer). To optimize the speed of the domino design, we used footless circuits like those in the surfing version. This results in short-circuit current during precharge. The same clock controls precharging of all of the domino gates. While all of the precharge transistors turn on at the same time, the internal nodes of each P K G gate remain fairly low until the inputs to the gate go low. Thus, the precharge propagates across the adder from one level to the next while the gates consume significant short-circuit current. In contrast, the minimal timing uncertainty in the surfing design ensures that very little short-circuit current occurs during precharge. The short-circuit current of the domino design could be reduced by adding footers, but doing so would slow down the adder significantly. For example, an additional delay of 21% is reported in [16]. While this would result in a lower energy design than the surfing circuit, it would certainly be unfavorable by Et2 and probably unfavorable by the Et metric as well. Alternatively, we could use a timing chain as for the surfing design to stagger the precharge signals and reduce the short-circuit current. O f course, the energy for such a timing chain would have to be included 65 in the energy budget for the domino adder. At this point, we have not performed a detailed comparison with alternative implementations of the domino adder. O f course, with any comparison, there are both caveats and opportunities for improvement. The most critical question that we faced was how to account for the power for clock generation for the domino adder and creating the fast signals for the surfing one. For the numbers reported above, we used the total power consumption of the surfing adder including its timing chain. For the domino design, we used the power consumption of the adder and of a three inverter chain with a step-up of four to model the clock tree. We believe that this underestimates the power consumption of the domino design. While our design shows significant Et2 advantages, we see many opportuni-ties for improvement. We could design the timing chain to propagate the falling edge of the fast signal slower than the rising edge. This should eliminate the remaining short-circuit current during the reset phase. 4.3 Summary We proposed a novel surfing technique named charge-sharing surfing. Compared with preswitching surfing, charge-sharing surfing can achieve slightly smaller delay, but with much less power consumption by reducing the power consumed on the side with no output events. We used the charge-sharing surfing technique to implement a Brent-Kung adder. Compared with the domino counterpart, the surfing adder has 19%, 12%, 29% and 43% reductions on delay, E, Et and the Et2 metrics. We observed that the surfing adder bounds the delay spread in a 33ps interval. However, for the domino adder, the delay spread is 128ps. Unlike the surfing ring described in chapter 3 66 w h i c h h a d s i m i l a r cells a n d stage-to-stage wires of a lmos t the same l eng th , the B r e n t - K u n g adder des ign conta ins different su r f ing cells a n d n o n - u n i f o r m stage-to-stage w i r e l eng th . T h i s ex tends the usage of su r f ing to c i r cu i t s w i t h more p r a c t i c a l func t ions . 67 Chapter 5 Noise Analysis for Surfing Logic This chapter presents a noise analysis approach that considers arbitrary disturbance waveforms. This approach is more general than traditional methods that model noise with D C offsets or fixed shape waveforms. In particular, we divide a cycle of operation of a gate into n intervals, and consider disturbances that are stair-step functions over these intervals. B y making n suitably large, we obtain good approximations of arbitrary waveforms. We use a metric to measure the input disturbance and the resulting output disturbance. For example, the l2 metric (RMS) on the voltage corresponds roughly to the energy of the disturbance. We define the noise margin to be the smallest, non-zero, input disturbance that results in an output disturbance that is at least as large. This leads to a formulation of noise-margin analysis as a non-linear optimization problem. Our approach subsumes the rectangular, triangular, and exponential pulse models [24, 11, 30] as special cases. To solve these optimization problems, we need an efficient way to compute the gradient of the magnitude of the output disturbance with respect to the components of the input disturbance vector. In section 5.1 we introduce a "sensitivity matrix" 68 •AVn, vin V-i Figure 5.1: Noise-Margin Measurement Circuit that we calculate by augmenting the O D E model for the circuit with a matrix for calculating its small-signal response. In addition to enabling the non-linear optimization, the sensitivity matrix allows us to identify propagating modes of small signal disturbances, which provides insight into the robustness properties of each of the circuits that we analyze. 5.1 No i se -Marg in Analys is Figure 5.1 shows our configuration for measuring noise margins. We apply an input waveform, V^n to a reference chain (the lower chain in the figure), and we apply a disturbed version to the upper chain. The input disturbance, A V j n models noise. Both signals propagate through one or more gates. The difference between the output of the reference chain and the disturbed chain is the output disturbance, AV o uf. We include one more buffer or inverter at the output of each chain to account for output loads in real circuits. Likewise, we obtain a realistic input waveform for V m by propagating a pulse through a chain of buffers or inverters until the pulse shape reaches an equilibrium. 69 We formulate noise-margin analysis as the optimization problem: . A V o u t | | = \\AVi: (5.1) tn in > 0 We divide the time interval over which | | A V j n | | and | | A V 0 U ^ | | are calculated into n intervals and model A V ^ n as a stair step function on these intervals. This allows us to represent A V ^ n with a vector when using numerical optimization techniques. B y choosing n to be large enough, these stair step functions can closely approximate arbitrary functions. We note that C M O S logic gates act like low-pass filters and are thus relatively insensitive to the sharp edges of the steps. For simplicity, we use equal size time intervals. We note that the framework from equation 5.1 is quite general. Different metrics can be used to reflect different noise models, and constraints can be added to reflect bounds on the maximum instantaneous magnitude of the noise, etc. In the results presented in this thesis, we use an l2 (i.e. R M S ) metric to calculate the magnitude of disturbances, and we consider noises such that the instantaneous value of the disturbed input is between the power supply voltage and ground. We use standard numerical optimization techniques to solve the system from equation 5.1. This presents us with two challenges: 1. How do we know that the optimization procedure finds the global optimum? 2. How can we calculate the gradient of | | A V 0 U j | | with respect to A V j n efficiently and with sufficient accuracy? Presently, we address the first issue by starting the optimizer from several different initial conditions. This does not provide a guarantee that we have found the global optimum; however, if the optimizer consistently finds the same optimum, it suggests 70 that the search space may be reasonably smooth. The second issue is the motivation for calculating a sensitivity matrix. At each step, the numerical optimizer estimates the gradient of || A V 0 U ^ | | with respect to the components of A V ^ n . The optimizer uses this gradient to guide its search for the optimum. Estimating the gradient naively by making a small change to A V ^ n and observing the change to AV0U^ is unacceptable in practice because such a calculation takes the small difference of values that already have significant errors from the integration. Thus, the direct method is slow and extremely inaccurate. Instead, we define a small signal sensitivity matrix S with dAVouti UL^Vin,i where i and j are indices over the time steps of the disturbance vectors. The gradient of | |AV 0 U£| | is easily obtained from AV0Uf and 5: \7\\&Vout\\ = 2STAVout (5.3) where | | A V 0 t t t | l = A V * ^ We obtain S by calculating the small signal response of the circuit. The sensitivity matrix, S, captures the response of the circuit to small disturbances. If all of the eigenvalues of S have magnitudes that are less than one, then the circuit's behavior is stable in the presence of small disturbances. The eigenvectors of S are the propagating modes for small disturbances. If the corresponding eigenvalue has a magnitude less than one, then this mode dies out in a long chain. We note that most logic gates have at least one eigenvalue that is precisely one: the corresponding eigenvector corresponds to the time derivative of an input event - the disturbance causes a time shift of the event. In other words, small noises can effect time shifts that must be accounted for in the timing margins for the circuit. For designs such as 71 self-resetting domino, such disturbances can reduce the time overlap of input pulses and cause the circuit to fail. For other designs this can lead to a failure of set-up or hold requirements for registers. The sensitivity matrix also enables the use of numerical optimization to analyze the large-signal behavior and robustness of the circuit. 5.1.1 C a l c u l a t i n g the Sens i t i v i t y M a t r i x The remainder of this section presents our procedure for calculating the sensitivity matrix, S. Let V(t) be the voltage vector giving the state of the circuit at time t. We define the vector 7 ^ as the derivative of V(tj) with respect to the value of the stair step for A V ^ n at time U. We then have S(j,i) = 7 i i i (out) (5-4) when out is the index of the output node of the chain. To calculate 7 we use the small-signal response of the circuit, Let 9Vt. (q) A <„. . fe»> = om ( 5 - 5 ) where q and p are nodes of the circuit. The matrix A t j ) i i describes how the state of the circuit is altered at time tj in response to a small perturbation at time We calculate A t by augmenting the differential equation model for the circuit. If V = f(V, Vin), we integrate the system V(t) = V0 + tif(V(u),Vin(u))du rh (5.6) A t . , t i = 1+ I Jac(f,V(u))AU!tidu Jti where we write Jac( / , V(u)) for the Jacobian of / at V(u). If the circuit has m nodes, adding the calculation of A changes a m variable O D E to one with m(m+1) variables. 72 Perturbing V^n from time £; to time i ; + i affects the voltages on nodes other than in. Because we are considering each step of the stair step separately, the calculation of 7 ^ resets the voltage on node V^n to its undisturbed value at time U+\. These observations yield: T?,* 0) = °> Hj<i 7i+l,i(p) = AU+l,u(p,in), if p + in (5.7) li+i,i{in) = 0 7j,i0) = Atj,ti+1nfi+i,i, iij>i + l A brute-force implementation of equation 5.7 requires 0 ( n 2 ) different A matri-ces, where n is the number of time points in the analysis. Noting that A i f c i t i = Atfc.tj-Atj-,^, we rewrite the last line of equation 5.7 to get 7j,i(p) = &ti,ti-1'Yj-i,i, if j > i + l (5.8) Using this formulation, the entries of S can be calculated with a single integration of the augmented model and n — 1 matrix-vector multiplications. The A matrices are mxm where m is the number of nodes in the circuit. For noise-margin analysis, we typically use models with one or a few gates. Thus, m is small and the time for calculating S is acceptable. 5.2 Circui t s We demonstrate our method for noise-margin analysis by applying it to four cir-cuit design styles: static C M O S , self-resetting domino, output prediction logic and preswitching surfing gates. In the future, we will also apply this method to charge sharing surfing gates. We use an inverter or buffer as our example in each case. Like other methods for noise-margin analysis, our approach extends directly to more 73 ^ 16 8 out Figure 5.2: A Static C M O S Inverter complex gates simply by considering the case when the input under analysis is the enabling input for an output event. Figure 5.2 shows a static C M O S inverter. We include static designs to provide a baseline for comparison. The results that we present in section 5.3 are based on the T S M C 0.18^x bulk C M O S process. A l l transistors in our designs have a gate length of 0.18/Li. Gate widths are 0.18/it times the multiplier factor indicated in the schematic. The other process related parameters are summarized in table 5.1. To make the comparison as fair as possible, we use equivalent transistor sizing for each of the gates. When gates in two different logic styles have transistors that perform the same function, we make those transistors the same size. Of course, each design has some devices which are unique for that style, such as the pull-up N M O S in the preswitching surfing gate. We size those transistors to the best of our ability to be reasonable given the rest of the circuit design. In a chain, we use a gate to drive a copy of itself. For simplicity, we use a simple, first-order transistor model as presented in figure 5.3 to calculate all drain to source currents. As observed in section 3.3, this is a reasonable approximation when VDD = 1-8V. 74 vth k S vds threshold voltage proportionality factor shape factor, W/L = Vd-ids = 0, _ kSy2 = ksvZ(vge - i y d s ) , if o < vgc < vds if Vds < V 9 e . Figure 5.3: A First-Order Transistor Model [19] Table 5.1: Parameters Used in Noise-Margin Analysis Name Value N M O S process transconductance parameter kn 270 /xA/V 2 P M O S process transconductance parameter kp - 9 0 M A / V 2 supply voltage VDD 1.81/ N M O S threshold voltage Vthin QAV P M O S threshold voltage VthtP -0 .4V gate capacitance coefficient Cg 0 . 9 7 F / m 2 drain capacitance coefficient Cd 0.003F/m Likewise, we assume that all capacitances are fixed capacitances to ground and use the relationship that ic = Cv. Satisfying Kirchoff's current law at each node yields: v = -C-Hd3{v) (5.9) This gives us the O D E model for our circuit. In principle, our approach could be extended to more realistic, industrial strength models for semiconductor circuits (e.g. B S I M models [19]). Explicit calculation of the Jacobian operator would be much more tedious due to the extra complexity of the model, but we see no fundamental reason why our approach could not be applied to more realistic models. We applied our noise-margin approach to a preswitching buffer, as shown in figure 5.4. For comparison, we also applied this approach to a self-resetting domino buffer, as described in figure 5.5. O P L gates, as introduced in chapter 2, also 75 k-1 fasti fast •k+i Figure 5.4: A Chain of Surfing Buffers T p 2.67/1.67 16 o L-^<^ 1.67> keeper i n j \ _ X rv out 8 26.67/12 Figure 5.5: A Self-Resetting Domino Buffer "U LT 76 2.6 2.7 2.S 2.9 3 3.1 time(s) «'°~ Figure 5.6: Two Output Prediction Logic Inverters demonstrate the timing stability of surfing. We include O P L in the noise-margin analysis to compare the strength of its event attractor with surfing gates. We use an O P L inverter as shown in figure 5.6. 5.3 Results We implemented the noise-margin calculation described in the previous section using Matlab and applied it to the circuits described in section 5.2. We present results for both small- and large-signal analysis. 5.3.1 Small Signal Stability The sensitivity matrix, S, characterizes the behavior of the circuits under small disturbances. As described in section 5.1, we determine the equilibrium shape for pulses by using a long initial chain and calculate the sensitivity matrix for a pulse of this shape. The eigenvalues and eigenvectors of this matrix characterize the small-signal sensitivity and stability of the circuit. In particular, the eigenvectors are the 77 > B "O 1- 5f > undisturbed input A.=0.9934'\\ 200 300 400 time(*2ps) Figure 5.7: Eigenvectors for a Static Buffer small-signal, propagating modes for the circuit, and the corresponding eigenvalues tell whether the mode grows or dies out in a chain. If the eigenvalue has a magnitude greater than one, then the mode grows as the disturbance propagates through a chain. Conversely, if the eigenvalue has a magnitude less than one, the mode dies out. If all eigenvalues have magnitudes less than one, then the circuit is stable in the presence of small-signal disturbances. In particular, the largest magnitude eigenvalue provides a quantitative measure of the small-signal stability of the circuit. We took chains of static, O P L , self-resetting domino, and preswitching, surf-ing buffers as our examples. The stage-to-stage delays are set to be 20ps and 63ps for the O P L and surfing buffer chains respectively. We choose an interval for calcu-lation of the S matrix which contains all of the input and output transitions. We set the time step to be 2ps and computed the S matrix as described in section 5.1.1. 78 ^''^=0.9907 _ 0 5 i 1 1 1 1 1 1 1 1 1 1 2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 2.5 2.55 2.6 time(s) Figure 5.8: Eigenvectors for a Self-Resetting Domino Gate A t first, we applied our method to a buffer chain built from static inverters. Figure 5.7 shows the eigenvectors corresponding to the two largest eigenvalues and their corresponding disturbed inputs. The eigenvectors match the time derivative of the rising and falling edges of the input within 2%. The two largest eigenvalues are 0.9972 and 0.9934. We attribute the difference between these values and the predicted value of 1 to artifacts of our time quantization. The next four eigenvalues are 0.0026, 0.0025, 1.25 * 10~ 5 , and 1.24 * 10~ 5. These are all real and positive as expected. We note that due to the limited precision of the device models and the integrator, the smallest two of these values are probably not very significant. For the self-resetting domino gate the maximum eigenvalue 0.9907 is also very close to 1 (see figure 5.8). Again, this corresponds to a timing shift of the input pulse. Because the gate is self-resetting, advancing the rising edge also advances the falling 79 edge, and the corresponding eigenvector has tracked the time derivative of both edges. The second largest eigenvalue is 0.0404 which affects only the falling edge. We note that the arrival time of the falling edge of the input has a small influence on the falling edge of the output due to short circuit current while precharging node P-Figures 5.9 and 5.10 show the largest eigenvalues and their eigenvectors of the sensitivity matrix for an O P L inverter with stage delay equal to 20ps. In both cases, the largest eigenvalue is less than one, showing that the timing of output events is determined by the clock signal to the gate as well as the arrival time of the inputs. In fact, if an input arrives a little before or after its nominal time, the output will be closer to the target value. Thus, small timing disturbances will disappear as the event propagates through several stages. 80 " Q 1 . 5 > undisturbed input ^=0.2391 1.9 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 time(s) x ,0"° Figure 5.10: Eigenvector for an O P L Buffer with Input = 1 Analyzing the surfing chain from figure 5.4, the five largest magnitude eigen-values are: 0.5096, 0.2384, 0.0030. and (1.5324 ± 7.9406z) * 10~ 5 . Again, we do not regard the last pair of eigenvalues as being significant given the precision of the integrator. In fact, our circuit models are monotonic, and we expect the sensitivity matrix, S, to be positive definite. We believe that the complex values for the small eigenvalues is most likely a side-effect of our approximation of continuous time with a set of discrete samples or of integration errors. Figure 5.11 shows the eigenvectors for the three largest eigenvalues for the surfing chain. The eigenvector corresponding to the largest eigenvalue, 0.5096 cor-responds to a time shift of the pulse. The left, negative peak shifts the rising edge of the pulse later, and the right, positive peak, shifts the failing edge later. Because the gate is self-resetting, delaying the rising edge also delays the falling edge. Thus, the 81 -0.5 }.=0.5096 ' X=0.2384 2.05 2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 time(s) *io- s Figure 5.11: Eigenvectors of the Sensitivity Matrix for the Surfing Chain positive peak of the eigenvector has a bigger magnitude than the negative peak. The eigenvector for the second largest eigenvalue, 0.2384 shifts the falling edge without disturbing the rising edge. The eigenvector for the third largest eigenvalue, 0.0030 shifts just the rising edge. A l l of the eigenvalues for the surfing gate have magni-tudes that are significantly less than one. This shows the stability of the surfing gate. Intuitively, any small disturbance will decrease by at least a factor of nearly two from one stage to the next in the chain. Table 5.2 shows the largest eigenvalue for each type of gates. The main disturbance modes for all of these gates can be interpreted as time shifts of the input events. The static logic and self-resetting logic will propagate the time shift along the gates. However, O P L and surfing gates show strong timing stability behavior 82 Table 5,2: Robustness of Different Logics Gate Style Largest Eigenvalue static Inverter 0.9972 self-resetting domino buffer 0.9907 O P L buffer with input = 0 0.2413 O P L buffer with input = 1 0.2391 preswitching surfing buffer with input = 1 0.507 with all the eigenvalues significantly less than 1. 5.3.2 Noise Margin as Design Aid Next, we explored the use of our method as a design aid. Figure 5.12 shows how O P L allows a designer to trade timing stability for performance - as the target stage delay decreases, the largest eigenvalue of the sensitivity matrix approaches one and the timing stability decreases. For target delays where this eigenvalue exceeds one, the circuit fails. When the stage delay increases beyond around 30ps, the O P L chain enters the clock blocking mode, where the throughput of the chain is limited by the clocks. Early arrival of the input does not affect the time of the output event. In clock blocking mode, the O P L gate can reduce the delay for a late input arrival. The timing stability increases as the target stage delay increases. That is why the maximum eigenvalue approaches zero with the increasing stage delay. For surfing pipelines, performance is determined by the delay of the timing chain. The delay per stage, Sfast, can be any value between 5 / > m i n ^ and £/,max,i(see equation 2.2); however, robustness is lost for extreme values in this range. First, we studied the trade-off between stage delay and the robustness of the surfing design as measured by the largest eigenvalue of the sensitivity matrix, S. Figure 5.13 shows 83 time(ps) Figure 5.12: Largest Eigenvalue versus Stage Delay of O P L Buffer with Input = 1 this trade-off. Prior to this work, we have used the midpoint of the minimum and maximum delay as the target value for the stage delay. This plot shows that the maximum small signal robustness actually occurs at a slightly larger delay. Figure 5.14 shows the effect of varying the width of the n-channel pull-up transistor in the surfing gate. The plot shows the maximum delay of the gate (with fast low), the minimum delay of the gate (with fast high), and the value of the largest eigenvalue of the sensitivity matrix. Not surprisingly, the minimum delay decreases as the pull-up becomes stronger. There is a slight growth in the maximum delay due to the extra drain capacitance of the larger transistor. When the pull-up is eliminated (width = 0), there is still a small surfing effect contributed by the keeper transistor controlled by fast. Accordingly, the largest eigenvalue is slightly less than 1, namely 0.913. As the pull-up is made stronger, this eigenvalue decreases, 84 0.45' ' 1 ' 1 1 50 55 60 85 70 75 stage delay(ps) Figure 5.13: The Effects of Varying the Stage Delay quantifying the increase in the strength of the surfing effect. 5.3.3 L a r g e S igna l S tab i l i t y The previous results were based on a small-signal analysis for different logic styles. However, digital circuits are highly non-linear and the main concern of practicing designers is robustness in the non-linear, large-signal domain. We now apply our non-linear optimization formulation for noise-margin analysis to the four classes of circuits from section 5.2. In particular, we look for the smallest input disturbance waveform that creates an output disturbance which is at least as large. As noted in section 5.1, when the circuit is analyzed with an input transition, this condition is satisfied by disturbances that are proportional to the time derivative of the input signal, in other words, they shift the time of input events. In this section, we focus on the other case, response to noise in the absence of an input transition. In this case, the noise causes a spurious output event. Figure 5.15 and figure 5.16 show the noise margin for the four different logic styles as a function of the length of the chain. 85 100 9 0 100*largest_eigenvalue so \-7 0 6 0 5 0 4 0 O 0.2 0.4 0.6 0.8 1.2 1 .4 width of the pul l -up NMOS in \i Figure 5.14: The Effects of Varying the Width of the N-channel Pul l -Up For the domino gate, we apply the disturbance at the input node as for the other gates; however, we compare the voltages at the internal nodes of the second domino gate and the final gate. We also did this to the surfing gates with input equal to 0. We found this necessary to obtain convergence of the numerical optimization with the domino gate model. For a domino gate, the falling edge of the output node is controlled by the arrival of the precharge signal, but not by an input disturbance. Thus it is very difficult to satisfy the first constraint for the optimization problem from equation 5.1. For the static buffer, the strength of the p-channel device is 2/3 that of the strength of the n-channel. This explains why the dynamic noise margin is lower when the input is 0 than when the input is a 1: the gate is more susceptible to input disturbances when the output is driven by the weaker p-channel device. The noise margin of the self-resetting domino logic is 36% lower than that of the static 86 buffer ^ * * * * * * -* * * . * * self-resetting domino O O O O O O O O O O O O O <> surfing A A OPL A O O O O O O O Q O O O O ""0 5 10 15 gates in the chain Figure 5.15: Noise-Margin Estimates with Input = 0 design showing the trade-off of robustness and performance between these two logic families. When there is no input event, the noise margin of the surfing buffer is lower than that of the self-resetting buffer chain. We stopped the analysis with 2 surfing gates in the chain. W i t h gates number greater than 3, the optimization fails to converge because it meets a very steep cliff. We multiplied the disturbance obtained when gate number is 2 with 1.01 and noted that this noise causes a function failure of the chain. However, if we multiplied this disturbance with 0.99, the chain op-erates correctly. Thus we use the noise margin with gate number equal to 2 as an approximate noise margin with larger gate numbers. The smaller noise margin for surfing gates when compared with static or self-resetting domino designs has two * CN J> 160 c '5b 1-to a • 1—» o F3 100 80 40 on 87 C/3 CX # CN > e £>fj a <u o 200 150 100 I 1 1 buffer * * * * * * * * .... j , , A A A surfing A A * A A A t A o o & o 0 O O A i i i O O O O O O ( OPL 10 gates i n the chain Figure 5.16: Noise-Margin Estimates with Input = 1 causes: • the keeper does not perform as a keeper when fast is 1. • the pull-up N M O S device facilitates noise propagation when fast is 1. However, these are necessary for surfing to occur. The noise margin with an input event clearly demonstrates the timing stability of surfing which increases as the number of gates in the chain increases. The O P L chain has the smallest noise margin in all the gates though it is much faster than the others. It demonstrates the same type of timing stability as exhibited by surfing gates. However, the capture interval for an O P L gate is quite small. Figure 5.17 plots the delay of an O P L inverter with the assumption that the 88 clock's period is infinite. The dotted curve is the delay with an input equal to 0. If the input goes to 0 too early, the O P L inverter works like a domino gate such that the delay decreases with a slope equal to -1. The later an input transition from 1 to 0 arrives, the deeper the voltage dip on the output side is. This causes the increase of the delay to a constant value. The dashed curve plots the delay for an input equal to 1. If there is no voltage dip on the input side, the delay is a constant determined by the sizes of M O S devices in the O P L gates. If the voltage dip is small enough, this dip only affects the arrival time of the input but not the time of the output. Thus the delay from data input to data output decreases. However, if the voltage dip is deep enough, the delay increases to a constant value. The solid curve is the delay for an O P L buffer. The buffer delay is the sum of the previous two cases. Figure 5.18 shows the curves of delay versus arrival time of inputs for a preswitching buffer and an O P L buffer. In the interval [t3, t5], the O P L chain will not work correctly because the input events cannot catch up with the c lock signals. Eventually, the input event will fall into the interval [ £ 4 , £ 5 ] where an input event cannot trigger a corresponding output event and the delay goes to infinity. This also happens in the interval [to, ti]. The capture interval [ti, ti] for the O P L buffer is much smaller than the capture interval [ti, £ 4 ] of the preswitching buffer. This explains why the O P L chain has smaller noise margin. Recall that the sensitivity matrix for O P L has smaller eigenvalues than those for preswitching surfing. This is an example that shows that small signal stability does not necessarily translate into large signal robustness. For the static buffer chains, O P L , and the surfing buffer chain, the noise margin increases with the length of the chain. This is because finding minimum input and output disturbances that have the same magnitude, does not imply that 89 arrival time Figure 5.17: The Delay of an O P L Inverter they have the same shape. Thus, a disturbance that will pass through one stage will not necessarily propagate through a long chain. This effect is especially pronounced for the surfing chain. In fact, we note that, with an input event, the surfing chain appears to have almost no noise margin when a single stage is considered, but is much more robust when a chain of more stages is analyzed. We ignore the noise-margin curve for the self-resetting domino gate with input = 1. Because self-resetting domino represents a "1" value with a pulse, the optimizer finds the disturbance that shifts the time of the rising input event. This affects the time of both the rising and falling edges of the output and trivially satisfies the optimization condition. The designer would like to know "What is the smallest disturbance that suppresses an output event that should have occurred." As just noted, our analysis does not provide an answer to this question. In future work, we plan to examine modifications to our approach that will address this limitation. When the input is equal to 0, the noise margin grows very slowly with the number of gates in the chain. This is also a consequence of the pulse-signalling of the self-90 1 0 Figure 5.18: The Delay with Respect to Arrival Time resetting design. The optimizer finds a small input pulse that triggers an output pulse. The output pulse propagates through the remaining stages. O f course, this is exactly what the designer wants to know: "What is the smallest disturbance that triggers a spurious output event." We note that the Matlab optimization function fmincon failed for all four design styles when we attempted the optimization without the sensitivity matrix: The gradient estimates that it obtained by subtracting integration runs with nearby inputs had too much error to allow a productive search. B y using the sensitivity matrix S in the optimization, we obtained successful convergence in relatively short times (see figure 5.19). We ran the optimization on a 3 G H Z Pentium IV processor with 2G main memory. For the static C M O S and O P L designs we started with a 91 chain consisting of a single gate, and then used the solution from a chain of length k as the initial point for a chain of length k + 1. For self-resetting domino, we started with a chain of length 15 and then progressively removed stages. For the surfing design, we started with a chain of length 5 and worked both directions. In all cases, the initial solution took a few hours. Although the computation time varies signif-icantly depending on the number of gates in the chain, the increased computation time with adding or deleting a gate are quite close for the O P L , static, and surfing logic gates. For the O P L gate, the computation time increases dramatically for chains with more than 8 gates. This is because the optimization reaches the edge of the capture interval for surfing and a small change of input signal will cause a huge change to the output. This causes the optimizer to spend more time satisfying the constraints. For more than 12 gates, the computation time decreases. Currently, we do not know why this happens. For the self-resetting logic, the incremental compu-tation time is relatively small because it approaches the asymptotic scenario very quickly. 5.4 Summary We formulated noise-margin analysis as a non-linear optimization problem. B y calculating a small-signal sensitivity matrix as a part of integrating the circuit model, we obtain accurate gradient estimates that enable these optimization problems to be solved efficiently. The small signal model also shows the linkage of timing margins and noise margins in all four logic styles that we considered. We implemented a proof-of-concept tool and used it to analyze static C M O S , self-resetting domino, preswitching self-resetting domino and output prediction logic ( O P L ) . Our implementation uses a simple, first-order transistor model. We expect 92 B 180h o + OPL with input = 0 x OPL with input = 1 * inverter with input = 1 o inverter with input = 0 o domino with input = 0 < surfing with input = 1 10 15 gates in the chain Figure 5.19: Computation Time for Different Logics that using a more sophisticated model will affect the quantitative details of the analysis, but not the basic approach and qualitative results should remain the same. This is a possible topic for future work. As expected, static designs have the highest noise margin. O P L running at a moderate speed has the lowest dynamic noise margin, but has a form of small signal timing stability not present in the static and self-resetting domino designs. OPL and preswitching logic demonstrate the timing stability of surfing with the eigenvalues of their sensitivity matrices less than 1. When the input is low, the preswitching logic shows lower noise margin than self-resetting domino logic because of output shift introduced by the fast signal. Compared to O P L logic, the preswitching logic has a greater dynamic noise margin because of the larger capture interval with preswitching. As we have mentioned in chapter 1, designers can trade noise margin with power consumption and delay. For 93 a fair comparison between circuits of different styles, all these three figures of merit should be taken into account. Though O P L has the lowest noise margin, it has the smallest delay. It will be more reasonable to compare the noise margins of O P L and preswitching surfing gates running at the same speed. In the future, we will continue working on this. Currently, our approach cannot be applied to domino gates where the falling edge of output voltage is controlled by the clock. In future work, we plan to examine modifications to our approach that will address this problem. We also plan to include charge-sharing surfing gates and other dynamic logic circuits such as D y C M L in our analysis. 94 Chapter 6 Conclusions and Future Work We have presented a working surfing chip, new surfing circuits that greatly re-duce the power penalties associated with surfing, and a novel noise-margin analysis approach. These contributions bring surfing closer to being a practical design ap-proach. Our chip implements a simple, pseudo-random sequence generator. It sup-ports two independent waves of computation in a 12 stage ring without any latches or other storage elements. We have operated the ring for over 48 hours and 2.6* 10 1 5 surfing transfers of data between stages without error. The chip operates correctly over a wide range of power supply voltages. This chip demonstrates that surfing pipelines are possible in the real world. The chip also demonstrates the main downside of the "preswitching" ap-proach to surfing: power consumption. We addressed power consumption by in-troducing a new family of surfing circuits that avoids the short-circuit currents of preswitching. This new technique uses charge sharing to move the output and in-ternal node of a dynamic gate away from the power supply rails to accelerate the 95 generation of an eventual output event. We call this "charge-sharing surfing". U n -like the preswitching surfing whose delay depends on the ratio of the pull-up N M O S and pull-down N M O S in the inverter, the delay of a charge-sharing surfing gate depends on the ratio of the internal and load capacitances. In the future, we .will continue our research in this area. We presented a simulation study wherein we implemented a Brent-Kung carry lookahead adder with domino logic and with our new surfing approach. The surfing adder is 19% faster and uses 11.7% less energy than the domino design. The surfing version outperforms the domino design by a factor of 1.75 according to the metric. We anticipate that focusing on accelerating critical paths will enable further improvements in energy and speed. This should allow designers to use surfing as a general design method without restrictions to regular structures like other wave pipelining approaches. W i t h the event attractors created by surfing circuits, we can accurately predict the interval in which input events will arrive. Thus we can enable high speed operation of a gate only in this interval. W i t h this knowledge, we expect to employ surfing to reduce the leakage current which becomes more and more important as the technology scales down [20]. Other techniques are also available to generate the surfing effect. As shown in chapter 5, O P L gates have surfing capability. However, as described in chapter 5, the small capture interval for O P L gates results in a small noise margin. O P L treats rising and falling edges differently. Falling output events are accelerated by the surfing effect arising from the initial output dip of the gate. This same dip retards rising output events. For O P L , surfing arises from the sum of the delays of a rising and falling event from two consecutive stages. The narrowness of the output 96 dip results in a narrow capture interval. At this point, we have not identified any solution to overcome this limitation. Kursan and Friedman's variable threshold voltage keeper technique [23] also has surfing potential. The delay of a domino gate is affected by the strength of the keeper [22]; thus, it may be possible to modulate the n-well potential for the keeper to induce surfing. Figure 6.1 shows our proposed adaptation of Kursan and Friedman's technique for a surfing buffer. We use the same body bias voltage generator circuit as in figure 2.10 to generate the body bias voltage. The inverter connected to the body bias voltage generator circuit is placed there to be consistent with the previous surfing circuits in chapter 3 and 4 such that the gate's delay is smaller when fast is high than it is when fast is low. A wide fast pulse as in scheme (a) of figure 6.2 effects a weak keeper and low noise margin most of the time. A wide fast pulse also makes the capture interval when the fast signal is low smaller than that when the fast signal is high. As we have mentioned in chapter 5, an asymmetric capture interval may decrease the dynamic noise margin. Thus, a wide fast pulse is not preferred. A narrow fast signal will make the keeper strong and noise margin high most of the time. Whether a narrow fast signal or a symmetric fast signal is preferable depends on to which kind of noise the circuit is more sensitive, a noise causing a timing shift or a noise causing voltage shift. Furthermore, the capacitance between the well and the substrate and the well resistance are large enough in typical processes that it is not clear if this approach can be used at high operating frequencies. Dynamically adjusting the power rails is another way to obtain varying tran-sition delay. This may be possible by using a virtual ground as proposed in Dynamic Current Mode Logic ( D y C M L ) [1] to achieve surfing. In the following, we take the D y C M L circuit as shown in figure 2.14 as an example. When an input event comes 97 m4]|D-fast | X Q _ body bias voltage • J-MBBQGT generator mBr- 1 ^ but ml lout Figure 6.1: A Self-Resetting Buffer with Variable Threshold Voltage Keeper high Vth (i.e. weak) keeper body bias VoDr low Vib (i.e. strong) keeper \ / high Vth (i.e. weak) keeper low Vth (ie. strong) keeper \ I I high Vth (i.e. weak) keeper Scheme (a) high Vth (i.e. weak) keeper low Vth (i.e. weak) keeper high Vih (i.e. weak) keeper low Vth (ie. weak) keeper high Vth (i.e. weak) keeper \ b m * * ^ / \ / body bias VDDI / \ / VDDI fast \ / \ / Scheme (b) Figure 6.2: Variable Threshold Voltage Keeper with fast Signals of Different Pulse Width 98 early, the outputs remain unchanged because the clock signal is still low. Conversely, if input events arrive late, then out.T and out.F converge towards an intermediate voltage. This situation continues until the inputs arrive. The later the inputs arrive, the lower the intermediate voltage is. At first, the data delay decreases with a lower intermediate voltage because the high-to-low transition (normally the slower one) is accelerated. If the input is sufficiently late, then the low-to-high transition becomes the critical one, and further delay of the inputs increases the gate delay. Thus, we expect a delay vs. input arrival profile similar to that of O P L . Furthermore, it should be noted that delaying the input events increases the amount of charge transferred to node w before evaluation completes. This results in a higher final level for the low-going output. This effect can be reduced by increasing the capacitance on node w by using a large transistor for m8. Extensive simulations would be needed to assess the suitability of D y C M L or similar designs for use as a surfing logic family. We presented a method for analyzing the robustness of surfing circuits. Surf-ing creates event attractors whereby events in the data path are attracted to a fixed relationship to events in the timing path. We analyze these attractors by construct-ing the small-signal response of the circuit and by using numerical optimization techniques to study the large-signal stability. This analysis shows that surfing cir-cuits are very robust (i.e. highly damped) with respect to disturbances of the input. We presented preliminary results showing how this approach can be applied to the design of surfing pipelines and to noise-margin analysis. Compared with other approaches, our noise-margin analysis technique can verify the timing stability of surfing circuits and demonstrate trade-offs between performance and noise margin. Our noise-margin analysis is based on finding the smallest input that pro-duces an output of the same magnitude by the l2 metric. Assuming that the op-9 9 timizer finds the global optimum, this is a condition that ensures that any smaller disturbance will not propagate through a chain of logic gates. As noted in chap-ter 5, for many logic families, this condition is satisfied by a disturbance that is proportional to the time derivative of the input signal; in other words, the distur-bance effects a time shift of the input. This is both a feature and a limitation of our approach. O n the one hand, this correspondence of critical disturbances and time shifts shows the connection between noise margins and timing margins. In the future, we hope to further develop this connection to find a physically motivated quality metric that combines energy, delay, and robustness. O n the other hand, this correspondence has prevented us from making meaningful measures of the ro-bustness of some circuits. If a small input disturbance can cause a time shift of the output that has the same magnitude, then we conclude that the noise margin is very small. In future work, we would like to find ways of identifying other, large-signal noise modes even for circuits where time shifts propagate unattenuated through chains of gates. B y providing a physical implementation, novel circuit designs, and a new approach to noise-margin analysis, we have moved surfing pipelines much closer to a viable approach for practical designs. Further deployment of surfing will depend mainly on developing adequate C A D tool support especially in the areas of noise margin and timing analysis and automated transistor sizing and timing path synthe-sis. Our research has clearly identified many of the requirements for such tools and has suggested possible directions for future research to address these challenges. 100 Bibliography [1] Mohamed W . Al lam and Mohamed I. Elmasry. Dynamic current mode logic ( D y C M L ) : A new low-power high performance logic style. IEEE Journal of Solid-State Circuits, 36(3):550-558, March 2001. [2] A . Alvandpour, P. Larsson-Edefors, and C . Svensson. A leakage-tolerant multi-phase keeper for wide domino circuits. In Proceedings of the IEEE International Conference on Electronics, Circuits and Systems, pages 209-212, September 1999. [3] S.F. Anderson, J . G . Earle, et al. The I B M system/360 model 91 floating point execution unit. In IBM J. Research and Development, pages 34-53, January 1967. [4] M . Anis, M . Al lam, and M . Elmasry. Impact of technology scaling on C M O S logic styles. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 49(8):577-588, August 2002. [5] R. Brent and H . T . Kung . A regular layout for parallel adders. IEEE Transac-tions on Computers, C-31(3):260-264, March 1982. [6] Wayne P. Burleson, Maciej Ciesielski, et al. Wave-pipelining: a tutorial and research survey. IEEE Transactions on VLSI Systems, 6(3):464-474, September 1998. [7] Terry I. Chappell, Barbara A . Chappell, et al. A 2-ns cycle, 3.8-ns access 512-kb C M O S E C L S R A M with a fully pipelined architecture. IEEE Journal of Solid-State Circuits, 26(11):1577-1585, November 1991. [8] Ayoob E . Dooply and Kenneth Y . Yun. Optimal clocking and enhanced testa-bility for high-performance self-resetting domino pipelines. In Proceedings of the Twentieth Anniversary Conference on Advanced Research in VLSI, pages 220-214, March 1999. 101 [9] Jo C . Ebergen, Scott Fairbanks, and Ivan E . Sutherland. Predicting perfor-mance of micropipelines using Charlie Diagrams. In Proceedings of the Fourth International Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 238-246, Apr i l 1998. [10] M . J . F lynn, P. Hung, and K . W . Rudd. Deep submicron microprocessor design issues. IEEE Micro, 19(4):ll-22, August 1999. [11] T . Gemmeke and T . G . Noll. A physically oriented model to quantify the dynamic noise margin. In Proceeding of the 30th European Conference on Solid-State Circuits, pages 467-470, September 2004. [12] Solomon W . Golomb. Shift Register Sequences. Holden-Day, 1967. [13] N . F . Goncalves and H.De Man. N O R A : a racefree dynamic C M O S technique for pipelined logic structures. IEEE Journal of Solid-State Circuits, 18:261-266, June 1983. [14] Ricardo Gonzalez, Benjamin M . Gordon, and Mark Horowitz. Supply and threshold voltage scaling for low power C M O S . IEEE Journal of Solid-State Circuits, 32(8):1210-1216, August 1997. [15] Ricardo Gonzalez and Mark Horowitz. Energy dissipation in general pur-pose microprocessors. IEEE Journal of Solid-State Circuits, 31 (9): 1277-1284, September 1996. [16] David Harris and Ivan Sutherland. Logical effort of carry propagate adders. In Proceedings of the 37th Asilomar Conference on Signals, Systems and Comput-ers, volume 1, pages 873-878. I E E E , November 2003. [17] L . Heller, W . Griffin, J . Davis, and N . Thoma. Cascode voltage switch logic: A differential C M O S logic family. In 1984 IEEE Int'I. Solid State Circuits Conference, pages 16-17, 1984. [18] Ron Ho, Kenneth W . Mai , and Mark A . Horowitz. The future of wires. In Proceedings of the IEEE, volume 89, pages 490-504, A p r i l 2001. [19] David A . Hodges, Horace G . Jackson, et al. Analysis and design of digital integrated circuits. McGraw Hil l , 2003. [20] N.S K i m , T . Austin, et al. Leakage current: Moore's law meets static power. Computer, 36(12):68-75, December 2003. 102 [21] R . H . Krambeck, C M . Lee, and H.S. Law. High-speed compact circuits with C M O S . IEEE Journal of Solid-State Circuits, SC-17:614-619, June 1982. [22] R . K . Krishnamurthy, A . Alvandpour, et al. A 130-nm 6-Ghz 256x32 bit leakage-tolerant register file. IEEE Journal of Solid-State Circuits, 37:624-632, May 2002. [23] Volkan Kursan and Eby G . Friedman. Domino logic with variable threshold voltage keeper. IEEE Transactions on Very Large Scale Integration Syatems, 11(6):1080-1093, December 2003. [24] P. Larsson and C . Svensson. Noise in digital dynamic C M O S circuits. IEEE Journal of Solid-State Circuits, 29(6):655-662, June 1994. [25] Trevor W . S . Lee, Mark R. Greenstreet, and Carl-Johan Seger. Automatic ver-ification of refinement. In Proceedings of the 1994 International Conference on Computer Design, Boston, October 1994. [26] Alain J . Mart in , Mika Nystrom, and Paul I. Penzes. Et2: A metric for time and energy efficiency of computation. In Rami Melhem and Robert Graybill , editors, Power Aware Computing. Kluwer, 2002. [27] Larry McMurchie, Su Kio , et al. Output prediction logic: A high-performance C M O S design technique. In Proceedings of the 2000 International Conference on Computer Design, pages 247-254, 2000. [28] B . Moyer. Low-power design for embedded processors. In Proceedings of the IEEE, volume 89, pages 1576-1587, November 2001. [29] M . Mudge. Power: a first-class architectural design constraint. Computer, 34(4):52-58, Apr i l 2001. [30] K . L . Shepard and K . Chou. Cell characterization for noise stability. IEEE 2000 Custom Integrated Circuits Conference, pages 91-94, 2000. [31] K . L . Shepard and V . Narayanan. Noise in deep submicron digital design. In Proceedings of the 1996 International Conference on Computer Aided Design, pages 406-411, 1996. [32] D . Singh, J . M . Rabaey, et al. Power conscious C A D tools and methodologies: a perspective. In Proceedings of the IEEE, volume 83, pages 570-594, Apr i l 1995. 103 [33] Ivan Sutherland and Scott Fairbanks. GasP: A minimal F I F O control. In Proceedings of the Seventh International Symposium on Asynchronous Circuits and Systems, pages 46-53, Apr i l 2001. [34] Ivan E . Sutherland, Robert F . Sproull, and David Harris. Logical Effort: De-signing Fast CMOS Circuits. Morgan Kaufmann, 1999. [35] Ted E . Williams. Analyzing and improving latency and throughput in self-timed pipelines and rings. In TAU 1992 ACM International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, Princeton, N J , March 1992. [36] Ted E . Williams and Mark A . Horowitz. A zero-overhead self-timed 160-ns 15-b C M O S divider. IEEE Journal of Solid-State Circuits, 26(11):1651-1661, November 1991. [37] Anthony J . Winstanley, Aurelien Garivier, and Mark R. Greenstreet. A n event spacing experiment. In Proceedings of the Eighth International Symposium on Asynchronous Circuits and Systems, pages 42-51, Manchester, U K , Apr i l 2002. [38] Brian D . Winters and Mark R. Greenstreet. A negative-overhead, self-timed pipeline. In Proceedings of the Eighth International Symposium on Asyn-chronous Circuits and Systems, pages 32-41, Manchester, U K , Apr i l 2002. [39] Brian D . Winters and Mark R. Greenstreet. Surfing: A robust form of wave pipelining using self-timed circuit techniques. Microprocessors and Microsys-tems, 27(9):409-419, October 2003. [40] Hoi-Jun Yoo. A study of pipeline architectures for high-speed synchronous D R A M ' s . IEEE Journal of Solid-State Circuits, 32:1597-1603, October 1997. [41] Hoi-Jun Yoo, Kee-Woo Park, and Chang-Ho Chung. A 150Mhz 8-banks 256M synchronous D R A M with wave pipelining methods. IEEE International Con-ference on Solid-State Circuits, pages 250-251, February 1995. [42] Hongil Yoon, Gi -Won Cha, and Changsik Yoo. A 2.5-V, 333-Mb/s/pin, 1-Gbit, double-data-rate synchronous D R A M . IEEE Journal of Solid-State Circuits, 34:1589-1599, November 1999. [43] V . Zolotov, D . Blaauw, et al. Noise propagation and failure criteria for V L S I designs. In IEEE/ACM International Conference on Computer Aided Design, pages 587-594, November 2002. 104 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items