UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Efficient self-timed interfaces for crossing clock domains Chakraborty, Ajanta 2003

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2003-0674.pdf [ 2.93MB ]
Metadata
JSON: 831-1.0051651.json
JSON-LD: 831-1.0051651-ld.json
RDF/XML (Pretty): 831-1.0051651-rdf.xml
RDF/JSON: 831-1.0051651-rdf.json
Turtle: 831-1.0051651-turtle.txt
N-Triples: 831-1.0051651-rdf-ntriples.txt
Original Record: 831-1.0051651-source.json
Full Text
831-1.0051651-fulltext.txt
Citation
831-1.0051651.ris

Full Text

Efficient Self-Timed Interfaces for Crossing Clock Domains by Ajanta Chakraborty B.Eng. Bhopal Engineering College, 2001 A THESIS S U B M I T T E D IN PARTIAL F U L F I L L M E N T OF T H E R E Q U I R E M E N T S FOR T H E D E G R E E OF M a s t e r of Science in T H E F A C U L T Y OF G R A D U A T E STUDIES (Department of Computer Science) We accept this thesis as conforming to the required standard The University of British Columbia August 2003 © Ajanta Chakraborty, 2003 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department The University of British Columbia Vancouver, Canada DE-6 (2/88) Abstract With increasing integration densities, large chip designs are commonly partitioned into multiple clock domains. While the computation within each individual domain may be synchronous, the interfaces between these domains often use asynchronous methods. One such approach is the STARI technique[Gre93, Gre95] where a self-timed FIFO compensates for clock-skew between the sender and receiver. This dissertation presents implementations of STARI where the FIFO consists of a single, handshaking stage. I start with the simplest case where the sender and receiver operate at exactly the same frequency with an unknown skew. I then generalize this design for links with clocks whose frequencies are rational multiples of each other, clocks whose frequencies are closely matched, and arbitrary clocks. In each of these cases, the STARI interface can exploit the stability of typical clocks to achieve low latencies and negligible probabilities of synchronization failure using very simple hardware. I have designed and tested a proof-of-concept chip fabricated with the T S M C 0.18/x CMOS process for the scenario where clocks of different domains are exactly matched in frequency. The tests have demonstrated our claims about the skew tolerance of the design and I am now in the process of designing the interface for further generalizations. ii Contents Abstract ii Contents iii List of Figures v Acknowledgements vii 1 Introduction 1 1.1 Multiple Clock Domain Scenarios 1 1.2 Contributions 5 1.3 Overview 6 2 Related Work 7 2.1 Skew and Jitter 7 2.2 Generation and Distribution of Clocks 8 2.2.1 Clock Generation 8 2.2.2 Clock Distribution 8 2.3 Skew Compensation Techniques 12 2.3.1 GALS 12 2.3.2 Synchronizing Buffers 16 2.3.3 Mesochronous Designs 16 iii 2.3.4 Plesiochronous Designs 19 2.4 STARI 21 3 MinSTARI: A Single Stage FIFO Interface 23 3.1 Description 25 3.2 Skew Tolerance 28 3.3 Initialization 30 3.3.1 Maximum Robustness 31 3.3.2 Minimum Latency 34 4 Implementation and Test Results 36 4.1 Implementation 36 4.1.1 Design Overview 36 4.1.2 Implementation Details 37 4.2 Test Results 41 5 Generalizations: Rational, Close and Arbitrary Clocks 48 5.1 Rational Clock Frequency Multiples 48 5.2 Plesiochronous Interfaces 54 5.3 Arbitrary Clock Frequencies 56 5.4 A FIFO Interface 57 6 Conclusion 61 Bibliography 63 iv List of Figures 1.1 Exactly Matched Clock scenario: A Chip Multiprocessor 2 1.2 Rationally Related Clock scenario: A "typical" wireless SOC application . . 2 1.3 Nearly Matched Clock scenario 2 2.1 Clock Generation Techniques 9 2.2 A GALS System 13 2.3 A GALDS System '. 14 2.4 Communication Scheme with Resampling 15 2.5 Mesochronous Timing 17 2.6 FIFO with Local Clock Control 18 2.7 Globally Updated Mesochronous Method 19 2.8 Plesiochronous Retiming 20 2.9 Source Synchronous Communication 21 3.1 Interface as Latch with 2 clock inputs 23 3.2 The Single Stage FIFO 24 3.3 Clock Timing For The FIFO 24 3.4 Latch Controller State Diagram 27 3.5 A traditional C-element 27 3.6 The Latch Controller 28 3.7 Drifting Skew 29 v 3.8 Five timing Scenarios 32 4.1 Design Setup 36 4.2 L F S R 38 4.3 Shift Register 38 4.4 DAC 38 4.5 Modified Yuan-Svensson Latch 39 4.6 en generation 40 4.7 Receiver Shift Register 40 4.8 Error Detection Circuit 41 4.9 Test Setup 42 4.10 Skew Tolerance Window 44 4.11 Phase Modulation Tolerance at 20MHz 45 4.12 Phase Modulation Tolerance at 30MHz 46 5.1 An Interface with Rational Clocks 49 5.2 Exploiting Periodic jitter 50 5.3 A Miss Detector 52 5.4 Receiver Frequency vs. Cycle Time constraint 52 5.5 Interface for Nearly Matched Clocks 54 5.6 Interface for Arbitrary Clocks 56 5.7 Implementing a FIFO interface 58 5.8 Symmetry in FIFO interface 59 5.9 Timing scenarios with nearly full FIFO-R 60 vi Acknowledgements This work has been possible through direct and indirect support of a variety of people. I would like to sincerely thank my supervisor Dr. Mark Greentstreet for his unparalleled guidance, encouragement and enthusiasm throughout my stay at U B C as a graduate stu-dent. Being a total beginner in the field of VLSI design, the aid and help of colleagues and friends was invaluable. Some of them being Brian Winters, for providing me the answers to my unending questions and helping with design; Roberto Rosales for teaching me the basics of chip testing and for relentlessly helping me with testing; Roozbeh for helping with various C A D tools; and to the entire System-on-Chip lab of E C E department at UBC, for giving me the opportunity to use their test lab and equipments. I have no words to express my gratitude to my loving and doting parents who have taught me to take on new challenges with vigor and also to my brother, Alexy, and my sister, Jhinuk, without whom I would not have been here today. Thanks Rinkesh for always being there for me and for the constant support and encouragement I have received from you. A J A N T A C H A K R A B O R T Y The University of British Columbia August 2003 vii Chapter 1 Introduction 1.1 Multiple Clock Domain Scenarios As we move into very deep submicron technology, increasing integration densities and clock frequencies drive designers to implement increasing numbers of on-chip clock domains. This keeps the skew in clock and data to a small amount within a domain to ensure reliable transfer of data. As tight timing tolerances cannot be guaranteed between timing do-mains, communication between domains often takes place at a rate slower than the system clock (e.g., one transfer for every two cycles of the clock) or using some kind of mixed syn-chronous/asynchronous designs. Various multiple clock domain scenarios can be categorized as: 1. Exactly matched clock frequencies 2. Rationally related clock frequencies 3. Nearly matched clock frequencies 4. Arbitrary clock frequencies 1. Exac t ly matched clock frequencies In the scenario with exactly matched clocks, as shown in Figure 1.1 all the domains re-1 1. Multiple Clock Domain Scenarios C P U C P U C P U C P U 1$ D$ 1$ D$ 1$ D$ 1$ D$ i > k k > > ' Interconnect > k k > > L 2 $ L 2 $ L 2 $ L 2 $ Figure 1.1: Exactly Matched Clock scenario: A Chip Multiprocessor V Analog RF ADC DSP 1 500 MHz A Oscillator • 7 I DSP 2 200 MHz LCD Memory Crypto 700 MHz CPU 1 300 MHz Speakers 1 3 I/O Controller!. 10 MHz CPU 2 100 MHz <i Microphone ~Lj Keypad PLLs Figure 1.2: Rationally Related Clock scenario: A "typical" wireless SOC application CPU L2$ 1$ D$ DDR AGP Infiniband Figure 1.3: Nearly Matched Clock scenario 1.1. Multiple Clock Domain Scenarios 3 ceive their clocks from the same source and are thus operating at exactly same frequency. This is a typical situation in high-performance designs such as microprocessors for general purpose computers. In these designs, clock and data skew [HN01] arise from a variety of sources [BDM02]. First, scaling trends with decreasing feature sizes decrease gate delays causing a corresponding increase in clock frequencies. For high performance designs, clock frequencies are further increased by architectural trends favoring deeper pipelines with fewer gates per pipeline stage. Wire delays within a domain with a fixed number of transistors remain relatively con-stant with scaling. For long wires, the performance gap with gate becomes very severe wih shrinking feature size. Thus, although within each domain the clock skews are relatively small allowing op-eration at high clock rates, between clock domains, skews may be much larger. For ex-ample two domains, which are at separate leaves of a clock tree distribution network, might communicate with each other. Although circuits in these two domains may be physically adjacent, there may be large, unpredictable phase differences between their clock signals. Traditionally, designers target clock skews of about 10% of the clock pe-riod [KB+01, RM+01, KA+01, IM02, KN+02]. Long wire delays and variations in buffer de-lay make these targets challenging. Accordingly, designers resort to careful layout [BC +99] and active skew compensation [TR +00]. Likewise, fabrication engineers reduce wiring de-lays by deploying copper wires [Dav99] to lower resistance and low-k dielectrics [GA+02] to reduce capacitance. These approaches come at a cost: circuit and layout approaches to lowering clock skew often do so at an increase in circuit complexity and power consumption; improvements in materials are limited by physical constants. An alternate approach is to devise measures which compensate for skew when data is transferred between domains. Source synchronous designs can be of great use in such sit-uations. The designs presented are based on one such source synchronous communication technique namely the STARI technique. Here a self timed FfFO is placed between the 1.1. Multiple Clock Domain Scenarios 4 communicating domains. The self-timed FIFO is used to compensate for skew between two synchronous systems operating with a common clock. Section 2.4 describes STARI method in greater detail. 2. Rationally related clock frequencies In the scenario with rationally related clock frequencies, the clocks of the different domains operate at frequencies that are rational multiples of each other. Figure 1.2 depicts a design where different portions of the chip operate at different frequencies, and the various clocks are derived from a common source. This commonly occurs in system-on-chip designs where different IP blocks may be designed with different target clock frequencies or in multi-rate digital-signal-processing designs. Although multiple clock frequencies are used, these fre-quencies are exact rational multiples of each other, and these ratios are typically known in advance or are determined by pre-designed operating modes of the design. Knowing the exact relationship between the various clock frequencies enables the design of an interface that operates with low latency and without synchronization. 3. Nearly matched clock frequencies As shown in Figure 1.3, in the scenario with nearly matched clocks, the different domains operate with independent sources that are closely matched in frequency. This occurs, for example, in the design of network routers [KP+99] where each line card receives a bit stream with an embedded clock from a different source. Although each stream comes with its own clock, typically these clocks are very closely matched in frequency. For example, A T M standards specify that bit rates be within one part per million of their nominal values. 4. Arbitrary clock frequencies Finally in the scenario with arbitrary clock frequencies, the clocks are derived from in-dependent sources and can have any arbitrary frequencies. While the frequencies may be 1.2. Contributions 5 arbitrary, typical synchronous designs use clocks that are very stable. Thus, the relationship between the clock freuqncies change very slowly over time. This enables the design of an in-terface where synchronization while necessary, is not critical to the latency of data transfers. In this thesis, I present interfaces ensuring reliable communication in all of these multiple clock domain scenarios. The designs described, use a self-timed FfFO which has a single handshaking stage and thus is a minimalist version of the original design, namely STARI. The essential observation behind my designs is that clocks for synchronous systems are designed to be extremely stable. Thus, I can design STARI style interfaces that provide moderate amounts of skew tolerance and dynamically compensate for any long-term drift in skew or frequency. As described above, I present designs where the sender and receiver operate at exactly the same frequency; at frequencies that are rational multiples of each other; at closely matched frequencies; and at arbitrary, relatively stable frequencies. In the remainder of the thesis, I show that these designs are small and can operate at high clock frequencies with low latencies. 1.2 Contributions In this thesis, I show that, multiple clock domains with exactly matched, rationally related, nearly matched and arbitrary frequencies can communicate reliably and efficiently using a single stage FIFO that offers nearly two clock periods of skew tolerance. The contributions of this dissertation can be summarized as: • Detailed study of the minimalist version of the STARI design along with an analysis of its skew tolerance. • A novel initialization mechanism that achieves maximum robustness. • Designing a proof-of-concept chip implementing the minSTARI (described in Chap-ter 3) design and sufficient test circuitry to verify its functionality. 1.3. Overview 6 • Detailed analysis and design of extensions of the basic design to apply to the more general multiple clock domain scenarios. 1.3 Overview The document is organized as follows: I begin with a brief description of clock skew and jitter in digital designs and why it poses a problem with shrinking die sizes and increasing transistor density. Next a variety of techniques are described which have been used or suggested to handle clock skew, including the S T A R I method which this research extends. Next I describe a minimalist version of the original S T A R I method, m i n S T A R I , in Chapter 3 which can achieve a skew tolerance of almost two clock periods. Thus, m i n S T A R I forms the solution for exactly matched clock scenario where clocks of different domains are operating at the same frequency. Chapter 4 describes a proof-of-concept chip demonstrating the operation of the design along with the test results obtained. In the further generalizations of the basic idea, additional circuitry is used to reduce each scenario to the exactly matched clock frequencies scenario, and then m i n S T A R I is used to handle the skew. For example, the design for rationally related clock frequencies scenario generates a rational approximation of clocks to reduce the design to matched frequency scenario. Similarly the closely matched and arbitrary clock frequencies scenario also use combination of various other techniques to achieve the same objective. The generalizations are explained in Chapter 5. Chapter 2 Related Work 2.1 Skew and Jitter Skew can be viewed conceptually as the uncertainty in the timing of clock or data. More specifically clock skew can be defined as the difference in time between simultaneous tran-sitions of the clock within a system [Kat98] which is introduced by the clock distribution system. Clock jitter can be defined as short-term variations of the significant instants of the clock signal from its ideal position in time. Various factors can contribute to clock skew[BC+99], such as: 1. Process variation between transistors: each buffer stage introduces uncertainties due to process variations. To reduce skew, designers can reduce the number of buffer stages in the clock distribution network. 2. Variation in parameters of the wires used: long wires introduce uncertainties and thus, should be avoided. 3. Different sizing of each buffering stage according to the load it has to drive. 4. Presence of adjacent wires and the amount of switching activities between them. 5. Inductive reactances of the wires. 6. And finally, variations in factors such as temperature, power supply voltage etc. Skew is becoming a major limiting factor in increasing the global clock frequency. Com-7 2.2. Generation and Distribution of Clocks 8 pensating skew generally involves introducing complicated architectures or faster logic. 2.2 Generation and Distribution of Clocks 2.2.1 Clock Generation Three standard approaches to clock generation are Phase Locked Loops(PLLs), off chip oscillators and Delay Locked Loops (DLLs). A. P L L based design In the P L L based designs, a low speed reference clock is distributed throughout the chip and PLLs are used to obtain different multiples of the base frequency as required by various sub-components. Clock jitter in this case is typically 5% of the clock cycle[BC +99]. B. Off-Chip Oscillator In an off-chip oscillator, as the name suggests, an off-chip oscillator is used to generate very stable clock signals. Synchronizing the clock signal with the system becomes a difficult task but can have very low clock jitter amounting to as little as 1% of the clock cycle to a maximum of 5% [BC+99]. C. D L L based design In D L L based designs, the local clock is delayed sufficiently to line up with the edge of system clock and thus can be used for latency correction or introducing a desired phase difference if needed. D L L based designs form an attractive alternative to P L L based designs due to their better jitter performance, inherent stability and simpler design but are difficult to use for frequency synthesis [SH97]. 2.2.2 Clock Distribution The topology and technique for clock distribution plays a very important part in determining skew and jitter and for the overall performance of the system. Thus, much effort is spent in designing and optimizing clock networks which can balance factors such as skew and 2.2. Generation and Distribution of Clocks 9 Phase Low Pass Voltage Controlled Detector Filter Oscillator Divide By n Oscillaton jsync.| Local Domain Local Clock H delay Local matching Domain 11 System Clock a. PLL based design b. Off-Chip Oscillator Figure 2.1: Clock Generation Techniques c. DLL based design clock tree delays. Table 2.1 gives a comparison of some of the standard clock distribution techniques: Example Implementations of Clock Networks 1. D E C A l p h a series: The design is primarily based on mesh or grid techniques where wires are cross-connected with vertical and horizontal straps in a mesh pattern which keeps the clocks in phase across the whole chip. The Alpha microprocessors require that a sub-stantial capacitive load be driven at high speed along with maintaining a fast edge rate. Thus, in the earlier designs five levels of buffering configured as a tree were used. For ex-ample,the AlphaServer 4100 clock distribution system uses a combination of both balanced H-Tree and shared output tree to distribute the clock signal [Dam97]. The balanced H-Tree takes care of the fixed load blocks, whereas the shared output tree is used where various module configurations could alter clock loading. Later for a 600 MHz Alpha processor, a hierarchy of clocks was used where a gridded global clock with windowpane arrangement of final distributed drivers was used to lower the skew[BB98]. More clocks were derived out of these to provide more flexibility. Thus, a very complicated structure was adopted to enhance performance and save power. Similarly, for a processor running at 1.2 GHz, the difficulties in distributing clocks over large areas using low resistance grids are avoided by moving away from single chip-wide clock distribution to multiple phase locked clocks controlling different components of the chip for better skew and jitter control. Networks of various kinds ranging from P-shaped grids, rectangular X-trees to partial H-trees were 2.2. Generation and Distribution of Clocks 10 Methods Technique Advantages Disadvantages 1. Distributed Buffers Buffers are distributed throughout the chip Flexibility, Wire-ability, Low power High sensitivity to process skew 2. Water Main A l l the buffering is done at the main clock signal before it is distributed using a wide wire Low skew if hori-zontal flow in the design exists Not suitable if the interconnect resis-tance of the buffer is large compared to the buffer output resistance 3. H-Tree Wires between buffer stages are configured in a balanced, hierarchical "H" wiring pat-tern. Primary clock driver is connected to the center of the main "H" structure and the clock signal is distributed through the corners following a recursive design. Good wireabil-ity, zero skew for identical loads Physical layout gets constrained, interconnect ca-pacitance increases due to longer wires, difficult to balance, poor automatic clock routing 4. Mesh Rooted tree structure of clock buffers with shunt paths in the later stages. Minimizes inter-connect resistance, places branch re-sistance in parallel reducing skew. May have high device and wiring process variations, poor wireability Table 2.1: Comparison of Different Clock Distribution Techniques 2.2. Generation and Distribution of Clocks 11 used[XB+01]. 2. I B M S/390 Microprocessor: This is a 400 MHz CMOS microprocessor which primarily uses tree-like structures. A single clock is distributed from a centrally located on-chip P L L through a single buffer to 580 distribution points[RJDC98]. The distribution is achieved in two levels of balanced H-like trees. The first level tree distributes the central clock to nine buffers which are then further distributed in the second level. Using a small number of large buffers reduces skew and jitter from on-chip process and variations but results in more complicated wiring networks. 3. Intel "Itanium" Microprocessor: This is another microprocessor designed for running at GHz level frequency. It primarily uses programmable deskew circuits while supporting local optimization of the clock distribution network[RT00]. The architecture consists of three components: a balanced tree for distributing the global clock, multiple deskew buffers with balanced tree structures driving the regional clock grids, and multi-ple local clock buffers tapping these regional grids. A reference clock is also distributed throughout the chip for phase correction. 4. Other techniques: The signal integrity problems due to clock jitter, clock skew and signal reflection have motivated researchers to look into alternative methods of in-terconnection. Thus, apart from digital interconnect techniques, optical interconnect and R F and microwave interconnect have also appeared in the picture. Optical interconnect have lower power consumption at very high frequencies and good signal integrity properties but are bulky, expensive and difficult to fabricate[RWW+02]. Two major kinds of optical interconnects are based on either free-space technology or guided-wave technology. The in-termediate technology between metal-based and optical interconnect is R F and microwave interconnects. My work is focused on the existing approach i.e. digital interconnects. In summary, various methods have been devised for proper clock generation and distri-bution which can lead to stable clock signals and reduced skew. However, in most large VLSI designs, long wires and large numbers of components create large interconnect impedances. 2.3. Skew Compensation Techniques 12 Moreover, factors including variation in temperature, power supply voltage etc., introduce an arbitrary amount of skew in the signals both along the clock and data path. Thus, it becomes necessary to adopt some kind of clock skew compensation technique. 2.3 Skew Compensation Techniques Given the challenges of transmitting signals between clock domains, researchers have ex-plored a variety of asynchronous solutions. These range from building completely asyn-chronous chips [MB + 89, M L + 9 7 , FEGOO, RB+01] to various combinations of synchronous and asynchronous modules in the same design. Here I, focus on the latter approach. The various methods for combining asynchronous and synchronous modules vary according to the requirements that are placed on the clock. At one extreme, GALS (i.e. "Globally Asynchronous, Locally Synchronous") designs make very minimal assumptions about clock timing, effectively turning clocks into bundled completion signals and adding handshaking to clock generation [Cha84]. At the other extreme, "mesochronous" and "plesiochronous" methods rely on exact or nearly exact frequency matching of the clocks [Mes90]. Be-tween these two extremes, synchronizing buffers allow each domain to operate with its own clock, but make minimal assumptions about the relationships between these clocks. My approaches fall squarely in the mesochronous and plesiochronous camps. I summarize these various approaches below. 2.3.1 GALS As originally proposed by Chapiro [Cha84], GALS (i.e. "Globally Asynchronous, Locally Synchronous") designs use stoppable clocks to allow synchronous modules to communicate using asynchronous protocols. Each synchronous domain has its own clock generator that consists of a ring oscillator with a handshaking stage. When domain X has a value to send to domain Y , X outputs the value, sends a request to Y and stalls its clock until it receives 2.3. Skew Compensation Techniques 13 Port Port controller] Asynchronous Wrapper Locally Synchronous Block A Local Clock Generator Port Port controller-Figure 2.2: A GALS System an acknowledgement. Likewise, when domain Y is prepared to receive a value from domain X, it stalls its clock, waits for a request from X, latches the value, sends an acknowledgement to X and restarts its own clock. Yun and Donohue [YD96] extended Chapiro's approach by adding a mutual-exclusion element to the ring oscillator. This allows each locally synchronous block to continue oper-ating while polling for input from its neighbours. The mutual exclusion element delays the next clock event, if needed, to allow metastability [CM73] arising from the polling to resolve. Yun and Donohue's approach allows GALS designs to be very flexible, and their methods have been extended by several research groups, e.g. [MVFOO, SMOO, MT+02, SPL02]. Their design is not metastability free but no clock events occur while it is being resolved which prevents the communicating domains from accepting incorrect data. However, pausing or stretching the clock increases latency in the system. Chattopadhyay and Zilic [CZ02] extended this work further by designing a GALDS sys-tem which stands for Globally Asynchronous Locally Dynamic System as shown in Figure 2.3. G A L D S is based on the observation that dynamically switching clock frequencies is an effective method of saving power. Thus instead of generating a fixed frequency clock signal for the local domains, it uses a local clock controller which dynamically varies the clock fre-quency according to the requirement. The design uses a 'unidirectional synchronizer, which 2.3. Skew Compensation Techniques 14 Bi Directional Synchronizer 1 i Uni Directional Synchronizer X Uni Directional Synchronizer Figure 2.3: A GALDS System is essentially a single-stage asynchronous FIFO, for adjacent clock domain communication and a bidirectional synchronizer which is a combination of two FIFO stages in parallel for non-adjacent domains. Async/Sync converters, which are essentially single stage syn-chronizers, are used to interface the synchronous domains with the asynchronous wrapper, which shift asynchronous signals into local clock domains thereby introducing latency into the system. This design provides less than a clock period to resolve metastability for the control signals which might not be adequate at higher clock rates. Although, the design ensures that data is successfully latched into the receiving domain even when metastable control signals are obtained, this does not ensure proper operation of the system since this uncertainty in validity of data introduces inconsistency in the overall operation. Another communication scheme for mixed timing designs has been proposed by [HN03] where a handshaking protocol is followed between two independent clock domains for asyn-chronous signals and the asP* protocol [MJ+97]is used for generating signals synchronous to either clock. Figure 2.4 shows the setup of this scheme. The sequence of signals gen-erated is shown by the numbers corresponding with each signal. The metastability caused in receiving external signals, En or Val, is resolved by data resampling. In this method, a fixed time interval is alotted for resolving the metastability and upon unsuccessful resolu-tion, the process is repeated. On the receiver end, data corresponding to metastable signals 2.3. Skew Compensation Techniques 15 Rst(l) A En(2) Req(3) Req(3) EnGen ValGen Ack(l/5) Val(4) Ack(l/5) Data R E C E I V E R A I Clkl Clk2 I Figure 2.4: Communication Scheme with Resampling is overwritten by data of a successful attempt. GALS makes minimal assumptions about clock stability; in fact, GALS discards the stability and low-jitter of clocks that are the hallmarks of synchronous design. The frequency stability of traditional clocks allows us to determine the relative phase of two independent clocks thousands or more of cycles in advance. As I describe in sections 5.2 and 5.3, this predictability enables moving metastability off the latency critical paths in my designs. Also, in GALS designs, since a logic signal with a low drive, controls the clock which has a very high fan-out. Thus, a high amount of amplification is required before the signal is fed into the clock control circuit which introduces latency. Jitter is the variation in the time between successive clock events. This variation directly degrades the performance of synchronous designs, and clock pausing exacerbates jitter. After pausing a clock, the first edge through the ring oscillator and clock buffer will propagate slower than subsequent events [WGG02]. The loss of long-term timing predictability and the increase of jitter are consequences of the GALS approach of converting synchronous designs into asynchronous ones. In this thesis, I show that more efficient designs are achieved by letting synchronous modules be synchronous and using simple asynchronous interfaces to compensate for clock-skew and other timing uncertainties. 2.3. Skew Compensation Techniques 16 2.3.2 Synchronizing Buffers The next step in our taxonomy allows independent, free-running clocks in each domain and makes minimal assumptions about the timing relationships between them. A common rule-of-thumb for design specifies the use of two or three synchronizing latches whenever a clock domain is crossed [JG93, Chapter 3.11.4]. [JG93] also expresses the clock-to-output delay of a latch as a function of the difference of the setup and the critical switching time of the latch. This motivates increasing the number of stages of the synchronizer with increasing clock rates. Thus, for high-performance designs with a small number of gate-delays per clock period, even longer chains may be needed to achieve acceptably low probabilities of failure. Seizovic [Sei94] recognized that these synchronizations can be pipelined allowing high throughput even when the time for reliable synchronization is many clock periods. Chelcea and Nowick [CN01] further optimized this approach by noting that synchronizations are only needed for the receiver when the buffer is close to empty and only needed for the sender when the buffer is close to full. A l l of these approaches still incur worst-case latency due to either buffering latency when the buffer is not nearly empty or due to synchronization latency when the buffers are nearly empty. Iyer and Marculescu [IM02] evaluated the performance of a superscalar microprocessor design decomposed using Chelcea and Nowick's FIFOs. Superscalars are particularly sensitive to latency, and Iyer and Marculescu found that the performance penalties arising from the added latency outweighed the power savings for the design that they considered. 2.3.3 Mesochronous Designs A Mesochronous Design is marked by multiple clock domains which run at the same frequency but have unknown phase relations between them. Figure 2.5 shows a generic mesochronous design. Here, the clocks of the different domains are derived from the same clock source and are 2.3. Skew Compensation Techniques 17 Transmitter's Domain 4> Interconnect delay Receiver's Domain Dt Dt' delay delay z r ~ Figure 2.5: Mesochronous Timing exactly matched in frequency. Due to delays in the clock path, the phase relation between the clocks, however, is uncertain. Thus, the output data DT, is not synchronized with the receiver's clock <f>#. Moreover, the delay in the data path exacerbates the uncertainty. Various techniques have been suggested to handle mesochronous timing. One of them, which reduces or eliminates the synchronization latency of the designs described above by taking advantage of the stability of the clocks, is the STARI method. The stability of the clocks enables prediction of the timing relationship of clocks in different domains well into the future. STARI interfaces [Gre93, Gre95] or "source-synchronous" [YH00])have a common clock source for the sender and receiver, guaranteeing that both operate at the same frequency although the phase difference between the two may be unknown. A FIFO at the receiver is initialized to be roughly half full. During each clock period, the transmitter inserts one item into the FIFO and the receiver removes one item from the FIFO. The FIFO occupancy remains within one of half-full; in particular, overflows and underflows are excluded. This removes the need for testing full and empty conditions and thereby removes the need for synchronization and synchronizers. STARI interfaces are described in more detail in section 2.4. A variation of this design has been implemented by Kim and Sridhar [KS96]. Their design as shown in Figure 2.6 uses a self-timed FIFO for insertion and removal of data but the clock at the receiver end is regulated by a local clock control(LCC) circuit. This L C C waits for a request signal from the FIFO and then starts the receiver clock for removal of 2.3. Skew Compensation Techniques 18 Self-Timed FIFO Figure 2.6: FIFO with Local Clock Control data. Synchronization is required for the first datum and subsequent data is removed every clock cycle. Thus, instead of using phase detectors, this method relies on adjusting the data arrival time so that it is synchronized at the receiver end. The L C C is implemented through a series of three C-elements(described in Section 3.1), and metastability is resolved through a comparator. This method could suffer from the disadvantages of pausing clocks similar to GALS and could have restrictions in high performance designs [MS01]. Also the modifications required in case of multiple domains communicating with the receiver are unclear. A second variation is provided by[S03] called Globally Updated Mesochronous(GUM) Design. Here instead of using a FIFO, clocks with adjustable delays are used in each syn-chronous domain. A calibration process determines the ideal phase offset between the clocks by measuring the round trip latency of the data path between the communicating domains. In particular it starts from an arbitrary, initial operating point and slowly increases the delay of a clock signal until it enters a failure zone with respect to the other clock. Once the window of operation is determined, the clock pulse is positioned in the center of the window. This calibration process resembles the dynamic initalization process described in section 3.3.1 and 3.3,2. The difference is that the G U M method can not dynamically ac-count for skew, which our method can, and thus for factors like variations in temperature, 2.3. Skew Compensation Techniques 19 Domain 1 adjustable delay j^*) Domain 2 adjustable delay Figure 2.7: Globally Updated Mesochronous Method power supply noise etc.. Furthermore, the initialization process has to be repeated many times to account for any dynamic skew. Moreover, the additional complexity of maintaining an adjustable delay and repeating the careful measurement process several times with each clock can be quite tedious. There has also been an effort to apply mesochronous techniques to on-chip networks [Wik03]. In this technique, the transmitter sends its data and a strobe signal which is kept at half the frequency of the transmitter's clock. The receiver multiplies the strobe signal to generate the clock signal and latches the incoming data using this generated clock. A phase comparator compares the incoming data at every cycle with the receiver's clock and then selects either the receiver clock signal or the receiver clock delayed by half a clock period, to trigger a second latch which generates the final data for the receiver's domain. The comparator requires synchronization and may take arbitrary long time to resolve if both the selection signals are equally good. My solution for mesochronous designs is a STARI-based technique and is described in more detail in the subsequent chapters. 2.3.4 Plesiochronous Designs In plesiochronous designs, the sender's and receiver's clocks are generated separately but are closely matched in frequency [Mes90, DDX95]. Accordingly, the relative phase between the sender and receiver changes very slowly. 2.3. Skew Compensation Techniques 20 Transmitter H n /2de layf Q_ Q' M U X A Receiver Flip Region Detector Figure 2.8: Plesiochronous Retiming Rather than detecting FIFO-full and FIFO-empty conditions, a plesiochronous interface can include circuitry to detect FIFO-nearly-full or FIFO-nearly-empty conditions. These conditions can be synchronized to the appropriate clock domain with extremely reliable, high-latency synchronizers. [DDX95] is based on data retiming where two versions of transmitter's data, original and delayed by half a clock period, are maintained. Based on the timing of the receiver's signal, dynamic mode switching occurs choosing the one which has more tolerance for frequency mismatch. This method is similar to my work in that it also separates synchronization from the latency critical path. In [DDX95], the transmitter is always kept at a slower pace by introducing non-data items which constitutes of a fixed percentage of total bandwidth to avoid overflow of data and leads to performance degradation by 25 f. This method requires real time switching between different input modes and hence provides smaller interval to complete switching without duplicating or missing data. It also requires previous knowledge of data and nondata elements for successful switching. fn my method, real-time switching is avoided by using "near miss" detectors as described in Section 5.2 that enable us to predict the overflow and underflow well before they would actually occur. As an example, consider the case where the sender and receiver clocks are guaranteed to be matched to within 1 ppm (a typical requirement for high-speed networks). Let a "near-empty" detector report the condition that data from the transmitter arrived such that it was available for removal from the FfFO less than 10% of a clock period before the actual removal. With the close clock matching, at least 100,000 clock cycles will elapse 2.4- STARI 21 sender's domain D Q SI data forwarded elk unknown delay receiver's domain D FIFO d a t a j n d a t a _ o u t insert r e m o v e unknown delay D Q ~S-4> R Figure 2.9: Source Synchronous Communication before an underflow occurs! Thus, a ten or more stage synchronizer can be used to report this "near-empty" condition to the receiver without risk of underflow. The receiver can then skip removing a datum in a subsequent clock cycle, and it will be at least one-million clock cycles until the next near-empty condition occurs. While the synchronization path for flow control signals can have very high latency in a plesiochronous design, the latency of the data path can be kept very low. In section 5.2 we show how our designs can be used to implement plesiochronous interfaces. Section 5.3 generalizes this to interfaces with arbitrary, stable clocks. 2.4 STARI In this section, I describe a particular implementation of mesochronous interface, STARI which stands for Self Timed At Receiver's Input [Gre93]. The n-stage FIFO, placed between clock domains operating at same frequencies but with unknown phase relation between them, achieves extremely high skew tolerance but with some added latency. My interface, minSTARI, is a minimalist version of this design with reduced latency and is described in the next chapter. To handle clock skew, self-timed methods have proven to be very effective but at the additional cost of communication delay. Thus, a technique combining both synchronous 2.1 STARI 22 and asynchronous approaches can be more effective than either. STARI is such a combi-nation where an n-stage FIFO is placed between the communicating domains. Figure 2.9 describes a STARI interface. Both the transmitter and receiver derive their clocks, $T and $ R respectively, from the same clock generator The delays in the path from clock gen-erator to any clock signal are assumed to be arbitrary. The transmitter's domain forwards both its data and its clock to the FIFO. Because <&T' and are exactly matched in frequency, the insertion rate of the FIFO is same as the removal rate of the FIFO. If the FIFO is initialized to be roughly half full, then throughout operation, the capacity of the FIFO remains roughly half full. Thus, the need to check for overflow and underflow is avoided. The FIFO can be implemented with individual C-elements or using simple latches acting as buffers. Thus STARI offers clear advantages over purely synchronous or purely asynchronous systems because it does not require the absolute synchronization of purely synchronous methods nor does it require the explicit flow control mechanism of purely asynchronous ones. In this thesis, I present both specializations and generalizations of the original STARI work. In Chapter 3 a specialization of STARI is described by focusing on the case where the FIFO consists of a single stage. Such an implementation provides nearly two clock periods of skew tolerance. By optimizing for the single-stage case, I obtained very simple interfaces between the edge-triggered conventions common in synchronous design and the handshaking communication that is characteristic of self-timed designs. I then generalize STARI to relax the requirement of exactly matched clocks at the sender and receiver. I present interfaces where the sender and receiver clock frequencies are rational multiples of each other(Section 5.1), closely matched(Section 5.2), and arbitrary (Section 5.3). A l l of these designs exploit the long-term stability of clocks to obtain simple interfaces with small latencies. Chapter 3 MinSTARI: A Single Stage FIFO Interface This chapter describes a simple implementation of STARI communication where the FIFO has a single stage. As shown in Figure 3.2, the FIFO consists of a single latch, and a latch controller that generates a clock for this latch based on the clocks from the transmitter and receiver. To the user, this FIFO appears as a latch with two clock inputs(Figure 3.1). In this chapter and the next, it is assumed that the transmitter and receiver operate at exactly the same frequency; only the relative phase difference is unknown. This is easily achieved if both of their clocks are derived from a common source. data from transmitter D Q A A data to receiver Figure 3.1: Interface as Latch with 2 clock inputs 23 24 data in latch-T i [ D Q -transmitter latch-X ID Q _ A _ X latch controller I latch-R [ D Q -_ A _ data out R single-stage FIFO ; receiver Figure 3.2: The Single Stage FIFO RT OK Figure 3.3: Clock Timing For The FIFO 3.1. Description 25 3.1 Description Figure 3.3 depicts the timing for the single-stage FfFO. For simplicity, f assume that the latches are positive-edge-triggered. My design easily generalizes to other latching styles. For proper operation, the latch controller must generate $ x so as to satisfy the set-up and hold requirements of latch-X and latch-R. To satisfy the requirements of latch-X, the rising edge of <f>x must occur at least t s e t _ u p + t p r o p (abbreviated ts in the figure) after the previous <f>T event, and at least thoid ~ tprop (abbreviated th in the figure) before the next <&T event, where t s e t - U p , thold, a n d tprop denote the set-up and hold times and propagation delay of the latches respectively. To satisfy the requirements of latch-R, the rising edge of <f>x must occur at least thoU — tprop after the previous $>R event, and at least tset~up + t p r o p before the next $ R event. The "exclusion" regions corresponding to these requirements are indicated by cross-hatched regions for $x in Figure 3.3. There are two windows of opportunity for generating &x- a rising edge of &x niay occur between a rising edge of $ r and the subsequent rising edge of or between the rising edge of $R and the subsequent rising edge of $T- I refer to these scenarios according to the last event (&T or QR) that occurs prior to each <f>x event. Thus, if <f>x occurs after a <J>r event but before the next event, f refer to this situation as "transmitter-last". Likewise, I use "receiver-last" to refer to the other case. In Figure3.3, STR denotes the time from the rising edge of <f>T to the next rising edge of QR. Likewise, 5RT denotes the time from the rising edge of $ R to the next rising edge of <&T- Let P denote the clock period. Now, let ITR denote the width of the window of opportunity for the transmitter-last scenario, and 3.1. Description 26 1RT denote the width of the window of opportunity for the receiver-last case. We have: 1TR = $TR — 2(<set-up + t p r 0 p ) 1RT — $RT - 2(thold - tprop) => 1TR + 1RT = $TR + SRT - 2 ( t s e t - u p + t h o l d ) — P — ^{tset-up + thold) => max(7™, Tiff) > P/2 - (tset-up + thoid) In other words, if the clock period is greater than 2 ( t s e t - u p + thold)^ then the window of opportunity for at least one of the transmitter-last or the receiver-last case is non-empty, and the latch-controller can generate a clock that ensures proper operation of the interface. In particular, if JTR > 0, then the latch controller can safely generate a rising edge t s e t - u p + t P r o P after the rising edge of <&T\ otherwise; ^RT must be positive, and the latch controller can safely generate a rising edge t^oid — tprop after the rising edge of 3>_R. Section 3.3 shows how the latch controller can be initialized to operate in one of these two scenarios. The remainder of this section considers steady-state operation. Figure 3.4 shows a finite state machine that implements the operations of the latch controller. One event is output on $ x each time it has received an event on &T and an event on QR. For ATR > 2 ( t s e t - u p + tpropmax), the controller starts in state 0. Upon receiving a QR event, it moves to state R. When the controller receives a $ r event, it moves to state TR. After a delay of tset-up + tprop, the controller outputs a $>x event and returns to state 0. Likewise, for the case with P — ATR > ^{thold — t P r o P ) , the controller starts in state 0, moves to state T upon receiving a $ r event, moves to state T R upon receiving a $R event, and after a delay of tset~up + tprop, outputs a $x event and returns to state 0. The latch controller performs the function of a C-element. A traditional C-element (Figure 3. drives its output to the value of its inputs when they agree. When the inputs differ, the output retains its old value [Sei79]. My designs use an edge triggered C-element: after detecting rising edges on each input, it generates a pulse on its output. Now, first consider operation in a transmitter-last scenario with *)TR > 0. Following each 3.1. Description 27 Figure 3.4: Latch Controller State Diagram a b c a 0 0 0 0 unchanged unchanged 0 1 b Figure 3.5: A traditional C-element rising edge of <f>r event, the latch controller outputs a corresponding rising edge for <&x-Then, there wi l l be the next rising event for <&R followed by a rising event for <f>r before the controller outputs the next rising edge of <f>x- Conversely, for the receiver-last case, following each rising edge of <f>x event, the controller sees a rising event for <f>x followed by a $R event before generating the next rising edge for $ x - In either case, between producing consecutive rising edges of <f>x, the latch controller receives rising edges from both <J>r and fn my design, t iming is determined by the rising edges of the clocks. Accordingly, I use an edge-triggered self-resetting [ C C + 9 1 , SF01] implementation as shown in Figure 3.6. O n a rising edge of <f>r, transistors ml and m2 pul l node low. The three-inverter chain to the gate of ml disables the pull-down path shortly after §T has gone high to make the circuit edge-sensitive rather than level-sensitive. Likewise, node a# drops on a rising edge of §R. When both have dropped, node c goes low which generates a pulse on $ x - The low value of node c forms a pull-up path on nodes a^ and 3R which in turn resets node c back to its intial high value. A t this point, one cycle of operation is complete and the interface is ready to accept the next set of inputs. Delay ST ensures that the delay from a rising edge of <f>r 3.2. Skew Tolerance 28 4> m3l| rm6 A -<r-m2 m5 4>-[>H>H[m1 Figure 3.6: The Latch Controller to a rising edge of <f>x is greater than tset-up + tprop. Likewise, delay 5R ensures that the delay from $R to $>x is greater than t^oid ~ tprop- The keeper inverters on nodes BT and 3R help in resolving metastability that may occur during intialization and ensure correct operation at arbitrarily low clock frequencies. As promised, the design is extremely simple and requires very little layout area. 3.2 Skew Tolerance To analyze the skew tolerance of my design, we start with a transmitter-last scenario; proper operation requires JTR > 0 which is equivalent to 5TR > 2(tset_up +tprop). If the initial time difference, <5T\R,O, is greater than this value, then the transmitter may be further delayed by up to 8TR,O — 2{tset-up + tprop) without malfunction of the interface. Figure 3.7 shows what happens starting from a transmitter-last scenario where transmit-ter events occur progressively earlier due to drift in the skew. In this figure, the transmitter outputs the sequence of values: [—1,0, A, B, C, . . . ] on node QT- The values shown for Qx 3.2. Skew Tolerance 29 B D <J> r QTl ®x_T\ • \_T B I A r i n B fi J V I Q g ^ n o T A r • ]—I A Jr\ E JS D A T • B — " EE A r D Figure 3.7: Drifting Skew and QR show how transmitter data propagates to the other two latches. For each &x event, the figure shows a vertical dotted line labeled with the value loaded into latch-X by that event, and wi th arrows from <J?y and QR events showing the two events that triggered the latch controller. The rising edges of §x for values B and C are transmitter-last events; for value D , the <E>x and $R events are coincident; and for values E and F , 3>x is generated by receiver-last events. In all cases, the latch controller waits until it has received events on both inputs. The relative order of arrival of the rising edges of $ r and <&R does not matter; thus, no synchronization is necessary. When the $ r event precedes the $R event, then the interface operates in the receiver last mode, starting with 8RT = P- The interface continues to operate without dropping a value as long as 5RT > 2 ( t h o l d - tprop)- Starting with an init ial time difference of STR,O, transmitter events can occur up to P — 5TR,O time units earlier before the receiver-last scenario occurs. A t this point 5RT — P, and the transmitter can occur up to P — 2(thoid ~ tprop) time units earlier without malfunction. To summarize, if the interface starts in a transmitter-last scenario with 5TR = 8TR,O, then the $ r can be delayed with respect to &R by up to 5TR,O — ^ ( t set-up + tprop) time 3.3. Initialization 30 units, and it can be advanced by up to 2P — 5TR,O — ^(thoid — tprop) time units without malfunction of the interface. The total width of the interval of relative delays for which the interface operates correctly is 2(P — tset-up — t^oid)- Equivalent arguments hold starting from a receiver-last scenario. Thus, if the latch set-up and hold window is small relative to the clock period, then my design offers nearly two clock periods of skew tolerance. In addition to the set-up and hold requirements, node c in the latch controller must return high following the generation of a $ x pulse before the arrival of the next rising edge on <&T or <f>#. Let n be the time between triggering the latch controller and the subsequent return of node c to a high value. Let ST'R> and SR'T1 denote the time from a rising edge of <f>y/ to the next rising edge of <f»R' and vice-versa. As shown in scenario 1 in figure 3.8, proper operation with transmitter last requires that 6x'R' > n be satisfied. Similarly, scenario 5 requires 8R/T> > n be satisfied. At least one of the two modes is feasible if P > 2n (3.2) For my proof-of-concept test-chip [CG02], our latch set-up and hold windows were signif-icantly smaller than the latch controller's cycle time. Thus, equation 3.2 is the critical constraint for my design. 3.3 Initialization Under many circumstances, ^TR and 7 ^ are both positive. When this occurs, the interface can operate in either transmitter-last or receiver-last mode. This section describes two criteria for selecting the "better" mode and initialization procedures to achieve it. We assume that it is acceptable for the interface to drop values, duplicate values, and/or exhibit metastable behavior during initialization. Of course, it must deliver data without error after completion of initialization. 3.3. Initialization 31 3.3.1 Maximum Robustness Clock jitter, temperature drift, and other fluctuations cause the skew on physical chips to vary while the chip is operating. Typically, this variation is just as likely to make the transmitter earlier as it is to make it later. To maximize robustness to skew variation, ini-tialization should reflect the mode that tolerates the largest skew change in either direction. This corresponds to starting in the transmitter-last mode if 5TR > 5RT and in the receiver-last mode if 5TR < 5RT- For example, in Figure 3.8, scenario 1 and 3 have the same value of ATR, while, scenario 3 can tolerate substantial changes of the skew in either direction, scenario 1 will fail if &R arrives any earlier relative to Thus, scenario 3 is the preferred initialization. Similarly, scenarios 2 and 4 have the same A ^ . Scenario 2 is slightly more robust to later skew variations and will be the preferred initialization for many designs. An easy way to achieve this is to insert an adjustable delay into the self-reset cycle of the latch-controller. If this delay is initially very large, then neither mode is feasible and the latch-controller will generate ill-timed clock signals. By gradually decreasing this delay, the circuit will reach a point where exactly one of the two modes is feasible and after one or two cycles the latch controller will operate stably in that mode. As the delay is further decreased, the latch controller will remain in the first mode that became feasible. This is the mode with the larger skew margin. Thus, the analog dynamics of our circuit provide a very simple mechanism for initialization. We employ a training period during which the internal delays of the latch controller are greater than with those of normal operation. We do this in our implementation by using a separate ground signal for the latch controller connected to an internal voltage reference. This voltage sweeps from 1.8V (equal to Vdd) down to OV (normal operation). The controller speeds up during this sweep according to the linear relationship between power supply voltage and speed. When the controller is sufficiently slow, it cannot cycle as fast as the clocks. Under these 3.3. Initialization 32 1: A i B B B C i C i B l 2: ®x A -+-A - + -B I C C - + -B —I— 3: ®x 0 A +-A B -+-C - + -c -+-B t 4: B -+-<PX. <t>R 0 A 4-A B +-C — i — 5: B - + -A 3> R 0 l C B A 2(P (tset-up+thold) + fclk-~Q,max ^clk-Q.min )) Figure 3.8: Five timing Scenarios 3.3. Initialization 33 conditions, nodes a y and SR in Figure 3.6 wi l l st i l l go low in response to their respective clock inputs, and when both go low, the controller wi l l generate a <f>x event and return to state 0. However, the controller may miss incoming clock events that occur before the reset is complete. Assume that ATR < ART as in scenarios 1 and 3, and consider operation at a time during the initialization when the controller takes time ATR to traverse a path from state T R to state 0 as in Scenario 1. If the latch controller reaches state T R (Figure 3.4) in response to a event, then it wi l l return to state 0 in time for the next $T event and wi l l continue to cycle correctly. O n the other hand, if the controller reaches state T R in response to a $ y event, then it wi l l return to state 0 after the next <&R event. It wi l l remain in state 0 until the next $ T event and then transition to state T . W i t h the next $R event, the controller wi l l move to state T R and continue to cycle properly from there. This corresponds to scenario 3, the more robust initialization as noted above. Having reached this cycle, the controller wi l l continue to complete all transitions on time with further reductions of its internal delays. Thus, it w i l l remain in the preferred cycle. Metastable behaviour [CM73] is possible if ATR ~ P/2. fn this case, the controller can settle to either of two scenarios that are nearly equally robust to future variations in the skew. As with other metastable situations, the probability of remaining in an indetermi-nate state decays exponentially with time. Accordingly, my circuit can be initialized very reliably, and no metastability occurs after successful initialization. If the interface has a skew tolerance greater than one clock period (i.e. the clock period is large enough), then our initialization method can find a robust operating point for any init ial phase difference be-tween the transmitter and receiver. To ensure robust operation, an implementation should either use a "strong keeper" circuit for these inverters or allow extra time for initialization. This is because the metastability which results in intermediate voltage levels on nodes a-r and 3R can settle down to stable values with the keepers providing a strong feedback. 3.3. Initialization 34 3.3.2 Minimum Latency When both transmitter-last and receiver-last modes are feasible, the transmitter-last mode has a latency that is one clock period less than that of the receiver-last mode. For designs where latency is critical for performance (e.g. [DDX95, IM02]), it may be desirable to select transmitter-last mode whenever possible. The following initialization procedure achieves this behavior: 1. Start the interface running at full-speed (no need for the speed adjustment used in section 3.3.1). The latch controller will settle into one of its two modes. 2. Wait long enough to ensure that the probability of metastability failures is insignifi-cant. 3. Suppress one transmitter clock event. If the latch controller had been in the transmitter-last mode, it will now see two receiver events before the next transmitter event and continue in transmitter-last mode. On the other hand, had it been in the receiver-last mode, the latch controller will see one receiver event before the next transmitter event and switch to transmitter-last. If 5T'R> > n, then the controller will remain in transmitter-last, otherwise it will miss a receiver event when the controller's internal reset completes after the arrival of a rising edge of and then resume operation in receiver-last. 4. Allow adequate time for the resolution of metastability that can occur if 8T'R' ~ rj. As described, this procedure make no guarantees of robustness when forcing the transmitter-last mode. To provide some robustness against skew drift and clock jitter, the latch con-troller can be operated with a slight slow-down during this initialization and brought to full speed under normal operation. Alternatively, section 5.1 describes a near-miss detector circuit that can detect when the controller is close to its limits; in which case, the controller can be returned to the receiver-last mode by suppressing a event. 3.3. Initialization 35 In this chapter, I presented a single-stage FIFO design which can be used to interface two clock domains running at same frequency but with unknown phase relations between them. The design can work with any arbitrary amount of initial clock skew and can dynamically account for almost two clock periods of skew. It uses very simple hardware and is very robust. Provisions also exist for operating the interface at minimum latency for latency-critical applications. This design appears to the user as a latch with two clock inputs and can be made part of standard cell library and used in ASIC design flows. Chapter 4 Implementation and Test Results 4.1 Implementation We have designed a proof-of-concept chip for our interface which we have fabricated using the T S M C 0.18/i process through C M C , the Canadian Microelectronics Corporation. 4.1.1 Design Overview The design of the chip shown in Figure 4.1. The transmitter's domain consists of a Linear Feedback Shift Register(LFSR) which generates a psuedo-random sequence of numbers for transOP A LFSR reset Synch enToggle T X en dr. 3>u 81 Toggle Ckt Latch Controller Shift Register reset Gnd DAC CO C4 R to Figure 4.1: Design Setup recvIP A YSP SR Ideal Error detect Error 36 4-1- Implementation 37 data transmission. The transmitter's latch, T, takes the output of the LFSR and forwards it to the intermediate latch X . Both the LFSR and Latch T are triggered by the transmitter's clock signal Latch X takes data from the latch T and forwards it to latch R. The latch controller takes both $T and as its inputs and generates en which triggers latch X . On the receiving end, the shift register SR takes output from the receiver's latch R and generates the same sequence of data as generated by the transmitter LFSR. More specifically, SR predicts the next expected bit based on the previous eleven bits and reports an error if there is a discrepancy. The Error Detection ciruit takes the output from latch R(obtained data) and SR(expected data), compares them and generates an error signal if they don't match. More specifically, SR predicts the next expected bit based on the previous eleven bits and the error detection circuit reports an error if there is a discrepancy. The shift register and DAC modulates the ground signal of the latch controller and thus implement the dynamic initialization process. 4.1.2 Implementation Details This section describes each component of the design in greater detail. Transmit ter stage The LFSR consists of a series of Yuan-Svensson latches [YS89] which generate a pattern of length 2047. The choice of the latch was primarily based upon their ability to operate at high speeds which enabled the interface to operate at the maximum data rate. As shown in Figure 4.2, this LFSR has a tap at the 3rd cell which generates the psuedo-random sequence. The reset signal is an external signal which forces a logic high into the LFSR and triggers the pattern generation. A synchronizer is used to synchronizes the reset signal with $ 7 - . Intermediate stage This consists of the shift register, DAC, latch controller, intermediate latch X , and toggle circuit. The shift register, as shown in Figure 4.3, takes two clock inputs <f>D and <f>[/ and one data input dr as its input. One bit is shifted into the shift register on every rising edge of 4-1. Implementation 38 - 0 - 1 2 - 1=1 4 11 reset Figure 4.2: LFSR COf C l C2 C3 C4 GndReset L A *• A A A A PhiX I | | I 1 1 dr , A A A A A A PhiD Figure 4.3: Shift Register Once the shift register has a set of six data values, $[/ goes high generating the output CO through C4 and the gnd Reset signal. With proper sequence of data, this circuit can act as a shift register. The DAC(Figure 4.4) takes the output from the shift register and generates a ground signal for the latch controller of the corresponding voltage. This is achieved as follows: The inverter blobks BO through B4 are of varying gate width with BO having the widest gate width and B4 with the narrowest gate width. This leads to BO providing the Block 0 Block 1 Block 2 Block 3 Block 4 Figure 4.4: DAC 4-1- Implementation 39 Figure 4.5: Modified Yuan-Svensson Latch weakest pull-up on the p-channel transistor of voltage follower. Similarly B4 provides the strongest pull-up. This provides the first 24 step gradient on the signal falseGround which forms the ground signal for the latch controller. The n-channel transistors which start conducting only when both C3 and C4 are high, provide the next 8 steps of the gradient. The reset signal indicates the end of dynamic intialization and thus brings falseGround completely low and allows the latch controller to run at its higest speed. Using the D A C circuits, the ground signal of the latch controller can be lowered gradu-ally. The resetGnd goes high at the end of the sequence which brings the falseGround signal to its low value. Thus, the latch controller can be initialized at a very slow speed and it slowly speeds up following the gradient of the ground signal. The Yuan-Svensson latch, due to its sensitivity towards high data input when the clock input is high, has a rather large set-up time requirement for a low data input and thus is not suitable for the latches T , X and R of the design. The modified latch as shown in Figure 4.5, avoids the problem by attaching the data input to the pull-down path of the output of both p-stage and n-stage of the latch. Thus, while the clock input is high, if the data goes low and then high, the output of the latch remains stable thus reducing the large setup time requirement for low data inputs. The set-up and hold times for our latches were found to be roughly 150ps and 90ps respectively from the simulation studies. These times are much shorter than the delays 4-1- Implementation 40 cB To Latch X enBBB Figure 4.6: en generation * 0 * l - ! $ > 2 * 3 4 M 5 u Ideal T 1 recvIP Figure 4.7: Receiver Shift Register though the latch control circuitry. I widen the skew tolerance window by delaying the clocks for latches latch-T and latch-R(<5i and 62)- With this padding, the skew tolerance of the interface is determined by the minimum cycle time of the latch controller. This cycle time is 340ps. Thus, the skew tolerance window has width, 2P — 680ps. The skew window is wider than the clock period for a clock period of 1400MHz or lower. Under these conditions, the interface can operate with an arbitrary fixed skew. Initially, the self-resetting latch con-troller generated pulses on en that were of marginal width for triggering the latches. I did not want to modify the latch controller as this would increase its cycle time and decrease its skew tolerance. Instead, I used a self-resetting buffer to generate en as shown in Figure 4.6 and widened the pulse by including sufficient delay, in the reset path for the buffer. This buffer takes c output from the latch controller (see Figure 3.6)and generates the en signal which triggers the latch X . Finally the toggle circuit converts the narrow pulse on en into one transition per pulse to facilitate off-chip observation. Receiver stage The receiver stage consists of receiver's latch R, shift register SR and an error detection 4-2. Test Results 41 recvIP Ideal YS6 YSO r b ( RS s Error YS5 YS4 YS1 YS2 YS3 Figure 4.8: Error Detection Circuit circuit. The shift register SR, takes output from the latch recvIP as its input and recreates an ideal pattern of data which is sent as input to the error circuit. The error detection circuit consists of a series of Yuan-Svensson latches along with an RS latch. As seen in Figure 4.8, the X O R gate generates a high signal whenever the obtained data differs from the expected data. The rest of the latches keep the error signal high for one out of six receiver clock cycles. Thus, an error propagating through the receiver shift register does not produce multiple error reports. 4.2 Test Results Figure 4.9 shows the structure of the chip. It has a set of six input signals as follows: 1. $y: transmitter's clock 2. receiver's clock 3. reset: resets the LFSR of the transmitter's domain 4. and dr: input to the ground modulator of the latch controller, and four output signals are: 1. transOP: output data of the transmitter's domain 2. recvIP: input data to the receiver's domain 3. enToggle: Toggled version of the signal en 4. error: shows error when data from the two domains do not agree. 4-2. Test Results 42 reset 4>, recvIP error B B B • B B B • B B B n E Core Design • BH • n • • B B O B enToggle | ItransOP Rhode&Schwarz 1—1 SMT03 signal generator O R 4>D dr SRS DS345 synthesized fn generator Agilent 8373B |_| synthesized signal generatorl Agilent 81200 Parallel BERT Figure 4.9: Test Setup 4.2. Test Results 43 I tested the chip in the System-on-Chip Lab of U B C . The signal <f>r was generated from a Rhode&Schwarz SMT03 Signal Generator with phase modulation ability. The output generated from Agilent 8373B, also formed external clock reference for Agilent 81200. To achieve complete synchronization between $^ and QR, a synthesized function generator provided a common time base. The parallel BERT, Agilent 81200, generated the other programmed input signals, namely, reset, and dr. With the reset signal going high, I observed the correct psuedo-random sequence on transOP using an oscilloscope. As soon as the intialization process was completed using <£>[/ and dr, the recvlP signal exhibitted the same pattern at a fixed offset to transOP. This verified the functioning of the latch controller. To verify the functionality of LFSR, I programmed the Agilent 81200 to generate a pulse with a period of 2047 units, to match the pattern length of the LFSR, and used this pulse to trigger the oscilloscope. Then, stable data patterns were observed on transOP and recvlP. Due to a top-level wiring error, the output of the signal Error was shorted and thus I devised additional methods of detecting errors as explained later. To measure the skew tolerance of the design, I phase modulated the transmitter grad-ually until the interface no longer operated successfully as indicated by our error detection methods described later. The observed error free zone is based on the following principle: at any arbitrary phase offset between the <f>T and the interface can handle a skew tolerance of almost two clock periods; but the amount of skew tolerance in either direction, meaning with transmitter earlier or later than the receiver, is non-uniform. As an example, fig 4.10 shows two scenarios with different displacement of the skew tolerance windows, fn the first scenario, <f>r and &R are exactly matched in phase and thus <f>r can happen a clock period earlier or later than the receiver. Thus, an error free zone of almost two clock periods can be observed in this case. In the second scenario, &T and &R are 180 degrees out of phase, fn this case, $ T can happen half a clock period later and one and half clock periods earlier than the receiver. Because <I>T moves,equally in either direction during phase modulation, 4-2. Test Results 44 - H I I im I i 1 i Error Free Zone II -11 §0 - i l Jj Error Free Zone Figure 4.10: Skew Tolerance Window 4.2. Test Results 45 Figure 4.11: Phase Modulation Tolerance at 20MHz I observed errors as soon as it hit the closer boundary. Thus, a phase modulation of more than half clock period resulted in errors. Any phase offset intermediate of these two val-ues resulted in an error free zone of some intermediate displacement. Thus, the ideal plot would depict a minimum error free zone at zero phase offset between $ y a n d QR and would gradually increase with the increase in phase difference as it reaches 180 degrees. Further increase in phase offset would result in gradual decrease in error free zone until it reaches the minimum again. I started with <J>x and $R operating at about 20MHz. The plot obtained is shown in Figure 4.11. As expected, the phase modulation tolerance first increased and then decreased with increasing initial phase offset between transmitter and receiver. The maximum phase modulation obtained, which is about 6.2 radians suggested a skew tolerance of almost 2 clock periods and implies a cycle time for the latch controller of 662 picoseconds. Similarly, at 30MHz, as shown in Figure 4.12 a maximum phase tolerance of 6.17 radians is obtained which is once again almost two clock periods. To detect the boundaries of the error free zone, and thus determine the skew tolerance of the design, I devised two methods: Test Results 46 M e t h o d 1: In this method, the transmitter and receiver are initialized at some phase offset from each other. The outputs transOP and recvIP are observed on an oscilloscope. The transmitter is then phase modulated with the amount of phase modulation increased gradually. Initally the output from recvIP is stable and does not drift with time. This is possible due to operation of the latch controller in the receiver-last mode. But when the amount of phase modulation is such that it hits one of the boundaries of the error-free zone, the recvIP output drifts by a clock period at every cycle due to shifts in the mode of operation from receiver-last to transmitter-last as the controller crossed the boundaries of feasible window of operation. M e t h o d 2: In this method, the Agilent 81200 was controlled programmatically. The output recvIP was captured and observed. A lookup table was constructed internally consisting of identical pattern of data as produced by the transmitter LFSR. The first 11 bits of any block of captured data, indicated its position in the lookup table and then the subsequent bits were compared against the lookup table and any inconsis-tency were reported as "error". Once again the transmitter was phase modulated with the amount of phase modulation gradually increased. The output of recvIP was 4-2. Test Results 47 attached to Agilent 81200 and when the amount of phase modulation crossed the feasible range, "errors" were reported by the software which determined the position of the error free zone. This chapter desrcibed the implementation details of a proof-of-concept chip demon-strating the functionality of minSTARI. The test data is generated and verified on the chip itself using additional structures. Subsequent sections describe the test setup and the vari-ous observations and results obtained from testing the silicon. The design has been shown to be possesing very high skew tolerance at lower frequencies. I am in the process of further testing the interface at higher frequencies ranging up to 2 GHz. Once the skew tolerance limits are confirmed are higher frequencies, the robustness of the interface will be tested using dynamic initialization. Chapter 5 Generalizations: Rational, Close and Arbitrary Clocks This chapter describes extensions of our basic interface, which can handle multipe clock do-main designs with rational, nearly matched and arbitrary, stable clocks. As the frequencies of the clocks of different domains can differ, data is not transmitted or received at every cycle of the faster clock. Thus, proper flow control is necessary to ensure reliable commu-nication between domains. I start with the scenario where clock frequencies are rational multiples of each other. Sections 5.2 and 5.3 extend the design further to scenarios where clocks are nearly matched in frquency or have arbitrary and stable frequencies. 5.1 Rational Clock Frequency Multiples Consider the situation depicted in Figure 1.2: the frequencies of the sender's and receiver's clocks are pre-determined rational multiples of each other. Let PT be the period of the transmitter's clock and Pu be the period of the receiver's clock. Let NR and NT be positive and mutually prime with NR/NT = PT/PR {NT and NR correspond to the frequencies of the respective clocks). I developed my designs assuming NR > NT and describe the NT > NR case at the end of this section. 48 5.1. Rational Clock Frequency Multiples 49 > H data CD 3 ov la tch-X D Q A latch controller u data valid next rate m u l t i p l i e r Figure 5.1: An Interface with Rational Clocks Figure 5.1 shows the design for NR > Nx- By the assumption that the receiver operates at a higher rate than the transmitter, there will be receiver cycles for which no new trans-mitted data is available. The rate multiplier outputs NT pulses on node &u to the latch controller for every NR cycles of the receiver's clock as shown below: sum := 0; fo r each cycle of $ ^ d o i f sum > 0 t h e n output a pulse on <&[/; sum := sum + NT — NR; else sum := sum + NT; e n d i f o d By analogy with $R and $' R, let be the internally delayed version of $[/ in the latch controller, and let 5T'U' a n d 5u'T> be defined as ST'R' a n d 5R>T', respectively. As noted above, the rate multiplier introduces periodic jitter into <&JJ with a period of NRPR = NTPT- Let tiu'T',o denote the time from the rising edge of produced by the (kNR)th rising edge of <&R to the next rising edge of for any integer k. It is straightforward to show: min(<5[/'7v) = Srj'T'fi max(cW') = Su'T',o + PR - T£ (5.1) mm(5T>u>) = PT- <W',O - PR + 7 ^ marx.(5T'u') — PT — ^u'T',o 5.1. Rational Clock Frequency Multiples 50 R sum 7 -oX ®U,1 -2 X 1 X - l X 2 X o X -2 X l -A \ / B sum 2 ••^ 2i~(rnr=2nri~i( - i x 2 nro~Y^ Figure 5.2: Exploiting Periodic jitter The cycle time constraints of the latch controller can be satisfied if: max((W ' ,o, PT - <W',o - -PR + ^ ) > 77 ( 5 - 2 ) which holds for any value of <5[/'T',O if; P r - ( l - ^ ) P / i > 2n (5.3) For designs where the latch set-up and hold requirements are the dominant constraints, sim-ilar bounds for PT and P R can be derived. Thus, the one-stage FfFO can interface between synchronous domains operating at different, rationally related frequencies. Furthermore, the initialization methods described in Section 3.3.1 apply directly to the rational clocks case. Comparing equation 5.3 with equation 3.2 shows that the minimum period required for the slower clock has been increased by the jitter of $[/ created by the rate multiplier. The rate multiplier introduces periodic jitter. We can exploit this predictability to increase the robustness of the interface. For every NR consecutive cycles of the receiver's clock, the variable sum takes on each value in {NT — NR, . . . , NT — 1} exactly once. The initial value of sum is arbitrary, and we can use this freedom to increase the skew tolerance of our design. 5.1. Rational Clock Frequency Multiples 51 Figure 5.2 shows the operation of our interface where the transmitter clock frequency is 3/5 that of the receiver. The traces for s u m i and $u,i show the worst-case sequence for sum: with this choice rj < PR/2 must hold for proper operation. In particular, if the $[/ event generated when s u m i transitions from 2 to 0 triggers the latch controller to produce a $>x event, then the self-reset cycle must complete in time for the rising edge of $ x that occurs PR/2 later (indicated by the arrow labelled A in the diagram). On the other hand, if this <£>[/ event does not trigger the latch controller, then the subsequent <f>r event must, and the resulting self-reset cycle must complete in time for the next $[/ event, again PR/2 later (indicated by the arrow labelled B in the diagram). The traces for surri2 and <f>[/,2 show the optimal sequence for sum for the same transmitter and receiver clocks. For this scenario, the critical timing occurs when the rising edge of $£/ that is produced when surri2 goes from 2 to 0 triggers a <&x pulse. The self-reset of the latch controller must complete prior to the next $[/ pulse P R time units later. Thus, with this choice of the sum sequence, the latch controller can operate at half the rate as required by the worst-case sequence. I first derive the constraints on PT and P R that ensure proper operation for any phase difference between the two clocks assuming the optimal sum sequence, f then describe how our initialization technique from Section 3.3.1 can be adapted to find this sequence. Regardless of how the sum sequence is chosen, the <&[/ clock has a jitter of (1 — 1/NT)PR with respect to an evenly spaced clock with period PT- The maximally robust sequence for sum centers the <f>(y jitter interval as closely as possible on the §T clock. The interval may be off center by as much as gcd(P^, PT) — PR/NT due to the discrete set of choices for the sum sequence. From these observations, the smaller of the time from a rising edge of $ j v that triggers the latch controller to the next rising edge of or vice-versa is PT — PR/2. It is also possible that a rising edge of $ y triggers the latch controller and that the next input event for the controller is the next rising edge of $[/'. The minimum time between two such rising edges is \_NR/NT\PT- Combining these two constraints yields that there is 5.1. Rational Clock Frequency Multiples 52 4 > H > H > H [ lh<r-<H<H Cj3 u Figure 5.3: A Miss Detector m i s s CD E o a.10 e E a] 5 5 Transmitter Frequency^ units ; S a l Marain'n Receiver Frequency 25 30 Figure 5.4: Receiver Frequency vs. Cycle Time constraint a feasible sequence for sum such that the cycle time constraints of the latch controller are satisfied as long as: Pr-max(lN*XNT)pR > V ^ Comparing with equation 5.3 we see that choosing the optimal sum sequence can greatly relax the cycle time requirement for the latch controller, or, equivalently, greatly increase the robustness of the interface. For example with PR — Ins and PT = 1.2ns, equation 5.3 (fixed choice for the sum sequence) requires n < 0.1ns. With the optimal choice for the sum sequence, equation 5.4 requires n < 0.7ns, a reduction in the speed required by a factor of 7 5.1. Rational Clock Frequency Multiples 53 for this example. As another example, Figure 5.4 shows the improved margin obtained by choosing the optimal sum sequence with a fixed transmitter frequency which in this case is 7 units. The sequence for sum that maximizes robustness can be selected as part of the initial-ization of the interface. This approach is based on two observations. First, the optimal sequence works with a larger value of n than any other sequence. Second, we can shift from one sequence to another by adding NT + 1 (resp. NT — NR + 1) to sum instead of NT (resp. NT — NR). Generalizing the initialization technique described in Section 3.3.1, we can start with a large value for n and gradually decrease it. Each time the latch-controller fails to reset in time for the next &T' or $u' event, we shift to the next sum sequence. When rj is small enough that the latch controller can operate with the optimal sum sequence, but not the others, then the rate multiplier will switch from one sequence to the next until it reaches the optimal one. At this point, the latch controller will successfully reset after each cycle in time for the next &T' and $[/' events, and the rate multiplier will remain with the optimal sequence. Figure 5.3 shows my circuit that reports when a rising edge of 3>T or <f>x arrives at the latch controller prior to the completion of the controller's internal reset. Such an event is called a "miss" and f call the circuit a "miss detector". A miss occurs if a rising edge is received while the c signal of the latch controller (see Figure 3.6) is still low. Noting that $ x is an inverted version of c, we can use the $ x signal in the series stacks of transistors that detect such events. The delay of the inverter that produces $>x gives our circuit a little extra margin: it also reports "near misses". When a (near) miss occurs, node x goes low and node y goes high. These transitions occur asynchronously with respect to QR. It is assumed that the pulse width of $ x is less than QR to ensure that the "miss" signal is asserted for only one period of &R. The synchronizer provides a delayed version of y in the receiver's clock domain and the receiver switches to the "next" sequence on the reported "miss". The synchronizer is only active during initialization and does not contribute to the 5.2. Plesiochronous Interfaces 54 D Q D Q -stuff miss-D Q Latch Controller +stuff ' ' ' > near miss detector miss . Figure 5.5: Interface for Nearly Matched Clocks latency of data transfers under steady-state operation. Accordingly, the synchronizer can have a large latency and correspondingly minuscule probability of failure. This section has shown how m i n S T A R I can be extended for use in designs with multiple, rationally related clock frequencies. I have focused on the case where the receiver clock frequency is greater than that of the transmitter. If the transmitter has the higher clock frequency, equivalent designs can be used with the rate multiplier in the transmitter's clock domain. 5.2 Plesiochronous Interfaces Now consider designs with multiple clock domains with independent clocks that are closely matched in frequency; these are called "plesiochronous" interfaces (see [Mes90, DDX95]) . Such designs occur, for example, when the sender and receiver are physically separated (e.g. networks), or when separate clock generators are used to avoid introducing a single point of failure into the design. Typically, the clock frequencies wi l l be matched to within a few parts per mill ion, a tolerance that is easily achieved with crystal oscillators. W i t h close frequency matching, the relative t iming of clock edges at the latch interface changes very slowly. In particular, critical synchronization events occur at a rate corre-sponding to the difference between the clock frequencies. We can modify the miss detector circuit from Figure 5.3 to provide an output indicating when a rising edge from $ 7 - occurs 5.2. Plesiochronous Interfaces 55 shortly after the latch controller completes its reset, and another output indicating when the timing of &R is close to the margin. Furthermore, we can use a delayed version of <&x so that near-misses will be reported when a significant margin still remains. If these signals indicate, for example, when only 0.1P of margin remains, thousands of cycles remain before an error could actually occur. Thus, we can synchronize these signals to the transmitter and receiver clocks with extremely high reliability, and use the synchronized versions to take appropriate corrective action. For example, if a rising edge of &R occurs less than 0.1P after the latch controller completes its reset, then the receiver can skip clocking the latch con-troller on a subsequent cycle. This will switch the interface from operating with the rising edge of the transmitter's clock arriving much after the corresponding edge of the receiver's clock to operation where the receiver's clock edge arrives slightly after the transmitter's. Likewise, if a rising edge of $ r occurs less than 0.1P after the latch controller completes its reset, then the transmitter can skip sending data and clock on a subsequent cycle. Such protocols are commonly implemented using "stuff bytes" [DDX95] where extra bytes are padded at the end of a data packet, with padding added if the transmitter lags behind or deleted if the receiver is slow. Such protocols are easily implemented in the framework of our latch controller and miss detector. Although synchronizations are required during operation, the latency of these synchro-nizations is not critical for the latency of the data path. The data latency for our interface is always less than 2P. By adding an arbiter to detect when the latch controller is in receiver-last mode with enough margin to be able to safely switch to transmitter last, the worst-case data latency can be reduced to slightly greater than P with an average latency slightly greater than P/2. 5.3. Arbitrary Clock Frequencies 56 D Q <J>7 rate multiplier -stuff saturating counter D Q near miss detector DC D Q Latch Controller +stuff rate multiplier saturating counter Figure 5.6: Interface for Arbitrary Clocks 5.3 Arb i t rary C l o c k Frequencies We now consider the case where the transmitter and receiver operate with independent clocks at arbitrary frequencies. Initially, it might seem that such a design requires the overhead of synchronizing buffers as described in Section 2.3.2. However, clock frequencies are extremely stable in nearly all synchronous designs. We can exploit this stability even if the frequencies aren't known in advance. We combine our designs from the previous two sections to support communication with arbitrary clock frequencies. Firstly, the transmitter and receiver forward their clocks to each other. Each uses a counter to produce an initial estimate of the clock frequency of the other. These estimates provide a rational approximation of the ratio of two clock frequencies. If the nominal clock frequencies are known in advance, this step can be skipped. Secondly, if the receiver's clock frequency is higher than that of the transmitter, it uses a rate-multiplier to create an approximation of the transmitter's clock. Likewise, the transmitter uses a rate multiplier if it has the higher clock frequency. The latch controller operates with the (possibly rate-multiplied) clocks provided by the transmitter and receiver. Because the frequency values that we have for the two clocks are only approximations, albeit very accurate ones, the FIFO will be prone to occasional underflow or overflow. Separate "near-miss" signals for $T and $R forward near miss events to the client with the 5.4- A FIFO Interface 57 faster clock, i.e. the one using the rate-multiplier. This client updates its estimate of the other client's clock frequency thus changing the rate of events output by its rate multiplier. This is a second-order control system, and a little bit of care is needed to ensure stability. A simple approach is that the client with the faster clock uses a counter to measure the time between near miss events and uses this information to update its estimate of the other client's clock frequency. This process is quadratically convergent and stable. At the same time as updating the frequency estimate, a first-order correction can be applied by adding an offset to sum to bring the latch controller back to a point near the center of its safe operating region. If the near miss was for the clock generated by the rate multiplier, then this offset should be negative (to retard the rate multiplier), otherwise it should be positive. As for the plesiochronous interface described in the previous section, synchronizations are required during operation. Again, the latency of these synchronizations is not critical for the latency of the data path. These synchronizations are infrequent; their rate is determined by the resolution of the rate-multiplier and the drift rate of the clock frequencies. 5.4 A FIFO Interface The interfaces presented in the Sections 5.1 through 5.3 do not provide data transfers on every cycle of the transmitter and receiver clocks. The cycles on which transfers occur is determined by the interface. In many designs, the sender and receiver need more control of when transfers occur. Having solved the problems of mismatched clocks, flow control is straightforward. For example, Figure 5.7 shows an implementation that presents a FIFO interface to both the transmitter and receiver. The equations for Empty, gr, and pr are: Empty = e r A - i p t gr = Get A -ier (5-5) pr = pt A -i(er A Get) 5.4- A FIFO Interface 58 FIFO-1 mux rate mult. d' q' d q en I JqS Pt pr-FIFO-•R d_in d_out put get near full A — A empty < a1 aO rate mult. er Figure 5.7: Implementing a FIFO interface FIFO-R is a purely synchronous FIFO clocked by FIFO-1 is our single-stage FIFO design. The two rate-multipliers compensate for frequency differences between <f># and and the box labelled "sync" is a synchronizer. When the transmitter performs a put, it must assert Put and the data at DJn until the next event. If the transmitter's clock frequency is no greater than the receiver's, then there can be a <fov event for every event of &x and the interface can accept data from the transmitter on every cycle. If the transmitter does not perform a Put for some 4>T' event, then the resulting "empty" value will be noted by a false value of pt at the output of FIFO-1. If FIFO-R is non-empty, Get requests from the receiver are forwarded to FfFO-R on gr and data from FIFO-R is output on D.out. If FIFO-R is empty, then the receiver is signalled that data is available if an only if the transmitter performed a Put for the last transfer of FIFO-1: i.e., pt is true. In this case, the value at the output of FfFO-1 bypasses FIFO-R and goes directly through the multiplexer to D_out. The only increase in latency for our design is the latency of this multiplexer. If many cycles elapse without the receiver performing a Get, then FIFO-R may become full. The near.full output of this FIFO indicates that FIFO-R has little remaining capacity. 5.4- A FIFO Interface 59 near full T D Q Full D Q D Q en D Q D_in D Q d_in D Q W Latch Controller d_out Figure 5.8: Symmetry in FIFO interface This signal must be conveyed to the &T domain to indicate to the transmitter that the interface is full. A basic symmetry underlying our design helps us in achieving this transfer without much effort. The latches which generate D_in and FulLT belong to the transmitter's domain. Let FulLT be the signal which stops further generation of data. The timing diagram of Figure 5.9 shows the scenarios where generation of data "A" leads to nearly filled FIFO-R. As seen, regardless of the mode of operation of the latch controller, the return path latency of the interface is always two clock periods. Thus, having an additional capacity of one data item in FIFO-R when "near_full" goes high, is sufficient. In this chapter, I have presented interfaces for various multiple clock domain scenarios by adding additional functionalities to the basic single-stage FIFO interface. The designs either do not need synchronization or the required synchronization is not part of the latency critical path and thus, do not introduce extra latency. I also introduced a novel "miss-detector" that can be used for detecting "misses" or "near-misses" in communication. Finally, a generic FIFO interface is presented which exploits the existing symmetry in our design and can be applied to all possible timing scenarios discussed earlier. 54- A FIFO Interface 60 X X X X "A" X 1 a. transmitter-last scenario X o B B B X X X DC B ±_ X B b. receiver-last scenario Figure 5.9: Timing scenarios with nearly full FIFO-R Chapter 6 Conclusion I have presented a very simple design for source-synchronous communication. It is based on a self-timed, ripple FIFO with a single stage. Whereas a single-stage, pointer FIFO provides no skew compensation (such a pointer FIFO is simply a latch clocked by the transmitter), the single-stage ripple FIFO provides nearly two clock periods of skew tolerance and can operate correctly for any initial phase offset between the transmitter and the receiver of the channel. The simplicity of the single-stage FIFO enables simplifications and optimization, thus taking good advantage of self-timed design. I presented a design consisting of a self-resetting, edge-triggered C-element that generates a clock intermediate to the clocks of the transmitter and receiver. This intermediate clock strobes a latch that conveys data from the transmitter to the receiver. The timing of this clock signal ensures that the set-up and hold requirements of the receiver and the intermediate latch are all satisfied. My design can be initialized to provide maximal robustness against clock jitter and skew drift by adjusting the speed of the self-resetting C-element during its initial operation. Alternatively, the interface can be initialized for minimum latency by deliberately suppressing a transmitter clock event to the latch during initialization. Chapter 5 showed how this design can be adapted for more generalized clocking scenarios including clocks with rationally related frequencies, closely matched clocks, and arbitrary 61 62 clocks. In all of these designs, any synchronization is carried out on a path whose latency does not impact the data path. Thus, my designs can achieve latencies that are at most slightly more than one clock period and typically about half the clock period. To achieve this performance, we exploit the frequency stability of clocks in synchronous designs. In general, the overheads of synchronization and handshaking are only needed to address timing issues that cannot be resolved statically. When the clocks of the transmitter and receiver are identical in frequency, then only the relative phase needs to be resolved, and this can be done by a simple handshaking circuit such as the latch controller shown in Figure 3.6. When the clocks are rationally related, the client with the faster clock can use a rate multiplier to construct an approximation of the other client's clock. There are numerous isomorphic sequences of events that can be generated by the rate multiplier, and synchronization occurs during initialization to determine the optimal sequence. When the clocks are closely matched in frequency, only the long-term drift needs to be identified. The synchronizer that detects this drift can have high latency without impacting the data path latency. Finally, when arbitrary clock frequencies are used, the frequency stability of these clocks enables accurate approximation of the frequency ratio. Again, synchronization is only needed to detect long-term drift, and this does not impact the data path latency. I designed and fabricated a proof-of-concept chip in the T S M C 0.18/U CMOS process for my design for clients operating at identical clock frequencies using a single-stage FIFO interface. Intial test results have demonstrated my claims about skew tolerance of the design and I am in the process of more extensive testing. I am also currently designing a chip to demonstrate interfaces based on miss-detectors for rational, closely matched and arbitrary clock frequency designs and will fabricate it in the near future. Bibliography [BB98] Daniel W Bailey and Bradley J Benschneider. Clocking design and analysis for a 600MHz Alpha microprocessor. IEEE Journal of Solid-State Circuits, 33(11):1627-1633, November 1998. [BC+99] [BDM02] [CC+91] [CG02] [Cha84] [CM73] [CN01] [CZ02] [Dam97] Kerry Bernstein, Keith M . Carrig, et al. Kluwer, 1999. High Speed CMOS Design Styles. Keith A. Bowman, Steven G. Duvall, and James D. Meindl. Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. IEEE Journal of Solid-State Circuits, 37(2): 183-190, February 2002. Terry I. Chappell, Barbara A. Chappell, et al. A 2-ns cycle, 3.8-ns access 512-K B CMOS E C L S R A M with a fully pipelined architecture. IEEE Journal of Solid-State Circuits, 26(11):1577-1585, November 1991. Ajanta Chakraborty and Mark R. Greenstreet. A minimalist source-synchronous interface. In Proceedings of the 15th IEEE ASIC/SOC Conference, pages 443-447, September 2002. Daniel M . Chapiro. Globally-Asynchronous, Locally-Synchronous Systems. PhD thesis, Department of Computer Science, Stanford University, October 1984. Tech. Report STAN-CS-84-1026. T.J . Chaney and C E . Molnar. Anomalous behavior of synchronizer and arbiter circuits. IEEE Transactions on Computers, C-22(4):421-422, April 1973. Tiberiu Chelcea and Steven M . Nowick. Robust interfaces for mixed-timing systems with application to latency-insensitive protocols. In Proceedings of the 38th ACM/IEEE Design Automation Conference, pages 21-26, June 2001. Atanu Chattopadhyay and Zeljko Zilic. High speed asynchronous structures for inter-clock domain communication, fn Proceedings of the 2002 International Symposium on Circuits and Systems, pages 517-520, September 2002. Roger A. Dame. The AlphaServer 4100 low-cost clock distribution system. Compaq Digital Technical Journal, 8(4):38-47, April 1997. 63 Bibliography 64 [Dav99] Bijan Davari. CMOS technology: Present and future. In Proceedings of 1999 Symposium on VLSI Circuits, pages 5-10. IEEE, June 1999. [DDX95] Larry R. Dennison, William J. Dally, and Duke Xanthopoulos. Low-latency plesiochronous data retiming. In Proceedings of the Sixteenth Anniversary Con-ference on Advanced Research in VLSI, pages 304-315, 1995. [FEG00] S.F. Furber, D. A. Edwards, and J. D. Garside. A M U L E T 3 : a 100 MIPS asynchronous embedded processor. In Proceedings of the 2000 International Conference on Computer Design, pages 329-334, September 2000. [GA+02] Stephen Geissler, David Appenzeller, et al. A low-power RISC microprocessor using dual PLLs in a 0.13/i SOI technology with copper interconnect and low-k B E O L dielectric. In Proceedings of the 2002 International Solid-State Circuits Conference, pages 148-149, February 2002. [Gre93] Mark R. Greenstreet. STARI: A Technique for High-Bandwidth Communi-cation. PhD thesis, Department of Computer Science, Princeton University, January 1993. [Gre95] Mark R. Greenstreet. Implementing a STARI chip. In Proceedings of the 1995 International Conference on Computer Design, pages 38-43, Austin, Texas, October 1995. [HN01] David Harris and Sam Naffziger. Statistical clock skew modeling with data delay variations. IEEE Transactions on VLSI Systems, 9(l):888-898, December 2001. [HN03] Shaomei Huang and Radu Negulescu. High-performance mixed-clock commu-nication with resampling, 2003. [IM02] Anoop Iyer and Diana Marculescu. Power-performance evaluation of Globally Asynchronous, Locally Synchronous processors. In Proceedings of the 29th In-ternational Symposium on Computer Architecture, pages 158-168, June 2002. [JG93] Howard Johnson and Martin Graham. High-Speed Digital Design: A Handbook of Black Magic. Prentice Hall, 1993. [KA +01] Andre Kowalczyk, Victor Adler, et al. The first M A J C microprocessor: A dual C P U system-on-a-chip. IEEE Journal of Solid-State Circuits, 36(11): 1609-1916, November 2001. [Kat98] Cameron Katrai. Managing clock distribution and optimizing clock skew in networking applications. Application Notes #14, Pericom Semiconductor Cor-poration, 1998. Bibliography 65 [KB +01] Nasser A. Kurd, Javed S. Barkatullah, et al. Multi-GHz clocking scheme for Intel® Pentium® 4 microprocessor. In Proceedings of the 2001 International Solid-State Circuits Conference, pages 404-405, February 2001. [KN +02] Georgios K . Konstadinidis, Kevin Normoyle, et al. Implementation of a third-generation 1.1-GHz 64-bit microprocessor. IEEE Journal of Solid-State Cir-cuits, 37(11):1461-1469, November 2002. [KP +99] Gergios Kornaros, Dionisios Pnevmatikatos, et al. A T L A S 1: Implmenting a single-chip A T M switch with backpressure. IEEE Micro, 19(1):30-41, Jan/Feb 1999. [KS96] S K i m and R Sridhar. Self-timed mesochronous interconnection for high-speed VLSI systems. In Sixth Great Lakes Symposium on VLSI, pages 122-125, March 1996. [MB +89] Alain J. Martin, Steven M . Burns, et al. The design of an asynchronous mi-croprocessor. In Proceedings of the Conference on Advanced Research in VLSI, Caltech, 1989. [Mes90] David G. Messerschmitt. Synchronization in digital system design. IEEE Jour-nal on Selected Areas in Communications, 8(8):1404-1419, October 1990. [MJ+97] Charles E. Molnar, Ian W. Jones, et al. A FIFO ring oscillator performance experiment. In Proc. International Symposium on Advanced Research in Asyn-chronous Circuits and Systems, pages 279-289. IEEE Computer Society Press, April 1997. [ML +97] Alain J. Martin, Andrew Lines, et al. The design of an asynchronous MIPS R3000 microprocessor. In Proceedings of the 17th Conference on Advanced Research in VLSI, pages 164-181, September 1997. [MS01] Fenghao Mu and Christer Svensson. Self-tested self-synchronization circuit for mesochronous clocking, fn IEEE transactions on Circuits and Systems, pages 129-140, 2001. [MT+02] S.W. Moore, George Taylor, et al. Point to point GALS interconnect. In Proceedings of the Eigth International Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 62-68, April 2002. [MVF00] Jens Mutterbach, Tomas Villiger, and Wolfgang Fichtner. Practical design of globally-asynchronous, locally-synchronous sytems. In Proceedings of the Sixth International Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 52-59, April 2000. Bibliography 66 [RB+01] P.A. Riocreux, L . E . M . Brackenbury, et al. A low-power self-timed Viterbi decoder. In Proceedings of the Seventh International Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 15-24, 2001. [RJDC98] Phillip J. Restle, K . A. Jenkins, A Deutsch, and P W Cook. Measurement and modelling of on-chip transmission line effects in a 400 MHz microprocessor. IEEE Journal of Solid-State Circuits, 33(4):662-665, April 1998. [RM+01] Phillip J. Restle, Timothy G. McNamara, et al. A clock distribution network for microprocessors. IEEE Journal of Solid-State Circuits, 36(5):792-799, May 2001. [RT00] Stelan Rusu and Simon Tam. Clock generation and distribution for the first IA-64 microprocessor. In ISSCC00, pages 176-177, 2000. [RWW +02] Woonghwan Ryu, Albert Lu Chee Wai, Fan Wei, Wai Lai Lai, and Joungho Kim. Over GHz low-power R F clock distribution for a multiprocessor digital system. IEEE Transactions on Advanced Packaging, 25(l):18-27, February 2002. [S03] Ingemar Soderquist. Globally updated mesochronous design style. IJSSC, 38(7): 1242-1249, 2003. [Sei79j Charles L. Seitz. System timing. In Introduction to VLSI Systems (Carver Mead and Lynn Conway), chapter 7, pages 218-262.- Addison Wesley, 1979. [Sei94] Jakov N . Seizovic. Pipeline synchronization. In Proceedings of the First In-ternational Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 87-96. IEEE Computer Society Press, 1994. [SF01] Ivan Sutherland and Scott Fairbanks. GasP: A minimal FIFO control. In Proceedings of the Seventh International Symposium on Advanced Research in Asynchronous Circuits and Systems, pages 46-53, April 2001. [SH97] S Sidiropoulos and M Horowitz. A semi-digital D L L with unlimited phase shift capability and 0.08-400 MHz operating range. In Proceedings of the 1997 International Solid-State Circuits Conference, pages 332-333, February 1997. [SM00] Allen E. Sjogren and Chris J. Myers. Interfacing synchronous and asynchronous modules within a high-speed pipeline. IEEE Transactions on VLSI Systems, 8(5):573-583, October 2000. [SPL02] Tiberiu Seceleanu, Juha Piosila, and Pasi Liljeberg. On-chip segmented bus: A self-timed approach. In Proceedings of the 15th IEEE ASIC/SOC Conference, pages 216-220, September 2002. Bibliography 67 [TR +00] Simon Tarn, Stefan Rusu, et al. Clock generation and distribution for the first fA-64 microprocessor. IEEE Journal of Solid-State Circuits, 35(11): 1545—1552, November 2000. [WGG02] Anthony J. Winstanley, Aurelien Garivier, and Mark R. Greenstreet. An event spacing experiment. In Proceedings of the Eigth International Sympo-sium on Advanced Research in Asynchronous Circuits and Systems, pages 42-51, Manchester, U K , April 2002. [Wik03] Daniel Wiklund. Mesochronous clocking and communication in on-chip net-works. In Proceedings of Swedish System-on-chip conference, April 2003. [XB +01] Thucydides Xanthopoulus, Daniel W Bailey, et al. The design and analysis of the clock distribution network for a 1.2 GHZ alpha microprocessor. In ISSCC01, pages 402-403, 2001. [YD96] Ken Y . Yun and Ryan P. Donohue. Pausible clocking: A first step toward heterogeneous systems. In Proceedings of the 1996 International Conference on Computer Design, pages 118-123, October 1996. [YH00] Evelina Yeung and Mark A. Horowitz. A 2.4 Gb/s/pin simultaneous bidirec-tional parallel link with per-pin skew compensation. IEEE Journal of Solid-State Circuits, 35(11):1619-1628, November 2000. [YS89] Jiren Yuan and Christer Svensson. High-speed CMOS circuit technique. IEEE Journal of Solid-State Circuits, 24(l):62-70, February 1989. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051651/manifest

Comment

Related Items