Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Defect tolerance for yield enhancement of FPGA interconnect using fine-grain and coarse-grain redundancy Yu, Anthony J. 2005

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


ubc_2005-0719.pdf [ 6.95MB ]
JSON: 1.0065414.json
JSON-LD: 1.0065414+ld.json
RDF/XML (Pretty): 1.0065414.xml
RDF/JSON: 1.0065414+rdf.json
Turtle: 1.0065414+rdf-turtle.txt
N-Triples: 1.0065414+rdf-ntriples.txt
Original Record: 1.0065414 +original-record.json
Full Text

Full Text

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy by Anthony J. Y u B . A . S c , The Univers i ty o f Br i t i sh C o l u m b i a , 2001 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Applied Science i n T H E F A C U L T Y O F G R A D U A T E S T U D I E S Elec t r ica l and Computer Engineer ing The University of British Columbia August 2005 © Anthony J. Y u , 2005 Abstract F i e l d programmable gate arrays ( F P G A s ) are integrated circuits (ICs) designed to imple- ment, or be programmed wi th , any user circuit . This unique abil i ty makes F P G A extremely popular; however, it also introduces a significant amount o f area and delay overhead to the circuit . Fortunately, F P G A are typical ly manufactured in a process that is two to three gen- erations ahead of the one used by application specific ICs . This al lows some rec la iming of area and delay lost due to the programmabili ty. However , the problem wi th being this far ahead is manufacturing defects appearing i n immature technologies. The aggressive scaling of feature sizes and the migrat ion to new technologies makes the manufacturing of perfect F P G A s increasingly unlikely. Ut i l i za t ion of defect-tolerant techniques is one method of alleviating this growing problem. Defect-tolerance enable defective F P G A s to appear as "perfect." Th i s thesis presents and compares two new approaches to F P G A defect- tolerance: fine-grain redundancy ( F G R ) and coarse-grain redundancy ( C G R ) . F G R has an array-size-independent overhead of up 50%, and is capable of tolerating an increasing num- ber of defects as array size grows. In constast, C G R , at l ow defect levels, demonstrates a d iminish ing amount o f area overhead as array size increases. A t low defect levels, C G R requires less area overhead than F G R ; however, in situations where more than 2 -3 defects are expected, F G R requires less overhead. i i Contents Abstract " Contents »» List of Tables vi List of Figures v » 1 Introduction 1 1.1 Mot iva t ion and Objectives 3 1.2 Contributions ' 4 1.3 Outl ine 5 2 Background 6 2.1 F P G A Architecture 6 2.2 Defini t ions 9 2.2.1 Defects vs. Faults 10 2.2.2 Test ing vs. Diagnosis 11 2.3 Previous Redundancy Approaches 11 2.3.1 Software Redundancy 11 2.3.2 Hardware Redundancy 14 2.3.3 Run- t ime Redundancy 16 2.4 Summary 19 i i i 3 Fine-grain Redundancy (FGR) 20 3.1 Archi tectural and Implementation Details 20 3.1.1 Swi t ch B l o c k Changes 20 3.1.2 Connect ion B l o c k Changes . . . 23 3.1.3 Supported Defects 24 3.1.4 M o d e s o f Operation 29 3.1.5 Detai led Transistor-level Des ign 30 3.1.6 Software Aspect o f F G R Defect Avoidance 32 3.2 Limita t ions 33 3.3 A r e a and Delay Results 34 3.3.1 A r e a 34 3.3.2 De lay 35 3.3.3 A r e a and Delay Recovery 36 3.3.4 A r e a and Delay Product 37 4 Coarse-grain Redundancy ( C G R ) 39 4.1 Architectural and Implementation Details 39 4.1.1 Swi t ch B l o c k Changes 40 4.1.2 Connect ion B l o c k Changes 40 4.1.3 M u l t i p l e Spare R o w s and Columns 40 4.1.4 Supported Defects 43 4.1.5 Detai led Transistor-level Des ign 43 4.2 Limita t ions 43 4.3 Estimated Results 44 4.3.1 A r e a 44 4.3.2 De lay 46 4.3.3 Sca l ing Factors 48 iv 5 Yield Comparison 49 5.1 Y i e l d M o d e l 49 5.1.1 Coarse-grain M o d e l . . . 49 5.1.2 Fine-grain M o d e l 50 5.2 Archi tectural Considerations 52 5.2.1 S w i t c h Implementation Impact on Y i e l d 52 5.2.2 F lex ib i l i ty Impact on Y i e l d 53 5.2.3 A r r a y Size Impact on Y i e l d 53 5.2.4 Wi re Length Impact on Y i e l d 53 5.3 Limita t ions 54 5.4 Results 55 5.4.1 Swi t ch Implementation 55 5.4.2 S w i t c h F lex ib i l i ty 57 5.4.3 F i x e d A r r a y Size . . 58 5.4.4 Increasing A r r a y S ize 62 5.4.5 W i r e Length 64 6 Conclusion 68 6.1 A r e a and De lay 68 6.2 Y i e l d 69 6.3 Future W o r k 69 Bibliography 72 v List of Tables 3.1 Defect-tolerant swi tch implementations 32 6.1 Summary ranking of F G R defect-tolerant schemes w/ E 3 M 1 70 v i List of Figures 1.1 Overv iew of Coarse-Grain Hardware Redundancy 2 1.2 Overv iew o f F ine -Gra in Hardware Redundancy 3 2.1 Overv iew o f F ine -Gra in Hardware Redundancy 7 2.2 Island-style Architecture 8 2.3 Direct ional wire ( L = 3) 8 2.4 Direct ional switch b lock 9 2.5 C L B input connection 10 2.6 Des ign shifting for defect correction 13 2.7 S w i t c h b lock wi th spare connections 16 2.8 Triple-modular Redundancy 17 3.1 High- leve l defect-tolerant switch block 21 3.2 Connect ion b lock design 23 3.3 Single and Double-length defects . 25 3.4 Embedd ing imux to avoid contention 27 3.5 Defect correction example (L=2) 28 3.6 H S P I C E schematic for delay characterization 31 3.7 F lex ib i l i ty exploration for non-fault tolerant architectures 35 3.8 A r e a of defect-tolerant implementations 36 3.9 De lay of defect-tolerant implementations 37 3.10 Area-delay product comparison 38 v i i 4.1 Connect ion block changes for C G R 41 4.2 M u l t i p l e spare row and co lumn architectures 42 4.3 Compar i son between C G R implementations 45 4.4 Compar i son between F G R implementations 46 4.5 C G R - G 1 for increasing array sizes (spare row/co lumn overhead only) . . . 47 5.1 Swi t ch b lock wi th spare connections 51 5.2 Imux implementation (L=4, M = 32) 56 5.3 Shif t ing abilities (L=4, M = 32) 57 5.4 F lex ib i l i ty (L=4, M = 32) 58 5.5 Increasing number of global spares ( M = 32) 59 5.6 Increasing number o f local spares ( M = 32) 60 5.7 Increasing number o f global spares ( M = 256) 61 5.8 Increasing number o f local spares ( M = 256) 61 5.9 Increasing array size for F G R ( L = 4) 62 5.10 A r e a comparison between F G R and C G R at equal number o f defects(L = 4) 63 5.11 F G R y i e ld for different wire lengths (M=32) 64 5.12 Area/delay overhead for clma 66 5.13 A r e a breakdown of clma for different wire lengths at a very wide channel width o f 224 tracks 67 6.1 Summary o f area/delay overhead vs defect tolerance o f F G R 70 v i i i Chapter 1 Introduction Field-programmable gate arrays ( F P G A s ) are large integrated circuits comprised o f pro- grammable logic b locks and programmable routing. The i r size, density requirements and regular layout makes them attractive for aggressive tuning i n the latest technology pro- cesses. A s such, they are also prone to manufacturing defects [5, 32, 33, 34]. The number of manufacturing defects is expected to increase as the density of F P - G A s increases, or as the programmable logic paradigm migrates to new technologies such as nano-technology [12]. Th i s increase in defect rates severely impacts the viabi l i ty o f programmable devices. It also highlights the importance of incorporating defect tolerance strategies into F P G A s . M o d e r n F P G A s are predominantly programmable routing. Defects are thus more l ike ly to appear in the routing resources as opposed to the programmable logic b locks . Th i s makes the abil i ty to tolerate defects i n the interconnect extremely important. In this thesis, two different approaches to interconnect defect-tolerance are presented and compared. The interconnect encompasses the physical wi r ing , switch elements, configuration bits i n both the swi tch b lock and connection block. Tradit ional methods of defect tolerance involve the use of a spare row and co lumn in the F P G A architecture [18]. A s shown i n Figure 1.1, defects are tolerated by bypassing the defective row/column, and shifting part o f the design to the spare row/co lumn. Th i s coarse- 1 By * Defect a) Original Spare vertical b> Corrected wiring Figure 1.1: Overv iew of Coarse-Grain Hardware Redundancy grain approach is capable of tolerating defects in routing and logic blocks. However, the consolidation o f spare resources into a single row or co lumn l imits its abil i ty to efficiently tolerate mult iple distributed defects. This thesis presents a new architecture that embodies a fine-grain approach to de- fect tolerance. Spare resources are distributed across the F P G A fabric. This al lows the architecture to tolerate mult iple distributed defects in the interconnection of an F P G A . In the proposed architecture, defect tolerance is based on shifting individual connections. A s shown in Figure 1.2, shifting allows signals to route around a defect. This is cal led fine- grain redundancy. A comparison between traditional coarse-grain redundancy ( C G R ) and fine-grain redundancy ( F G R ) is also presented. The comparison w i l l show that F G R provides a scal- able solution that is better at tolerating more defects as larger F P G A s are made. C G R , wi th its smaller area overhead at current F P G A dimensions, is adequate for now. 2 Spare Defect a) Original b) Corrected Figure 1.2: Overv iew o f F ine -Gra in Hardware Redundancy 1.1 Motivation and Objectives A marketable F P G A is one where it can be programmed wi th any user design. Defects inhibit this abil i ty and cannot typical ly be sold. Thus, y i e ld lost due to defects represents a lost o f revenue for F P G A vendors. The desire to max imize revenue suggest the need to min imize y i e ld lost due to defects. This thesis explores one such way to improve y ie ld ; it investigates the use of defect-tolerant architectures for y ie ld enhancement. A viable defect-tolerant architecture should be capable o f a) efficiently tolerating mult iple random distributed defects in the interconnection, b) preserving the signal t iming characteristics of the original circuit, and c) computing defect corrections quickly. These attributes are desirable for the fo l lowing reasons. The first is important because random distributed defects are expected to contribute to the highest y i e ld loss [33], and that the number of defects is expected to increase wi th the scaling o f technology [5]. Next , to guar- antee the correctness of a defect-corrected circuit , drastic changes to the routing solution cannot be a l lowed as the changes can lead to unanticipated t iming violations, race condi- tions and skew. Last ly, the abil i ty to apply defect correction in a t imely manner is essential i n a manufacturing environment since mult iple circuit boards must be qu ick ly programmed with the F P G A bitstream. Unfortunately, a survey of current defect-tolerant approaches revealed that there is 3 no such architecture capable of handling all of the abovementioned features. This then sets the stage for this thesis. The objective of this research is development and comparison of defect-tolerant architectures that are capable of tolerating multiple defects, capable of rapid defect correction and capable of not altering signal timing. 1.2 Contributions This thesis proposes two defect-tolerant architectures that are capable of tolerating multiple random distributed defects. The first architecture, fine-grain redundancy (FGR), is scalable and can tolerate an increasing number of defects as chip density or area increases. Addition- ally, defect correction can be applied quickly and does not affect signal timing. Although not shown in this thesis, it is recognized that the architecture can also be used to repair crosstalk type faults. In contrast, the second proposed architecture embodies coarse-grain redundancy (CGR) and utilizes multiple spare rows and columns to attain defect-tolerance. Although C G R has been published before [18], the framework for handling multiple spares is new. The C G R redundancy scheme does not scale as well as F G R . However, it will be shown later that the use of multiple spare rows and columns is better suited for current F P G A dimensions. F G R and the multiple spare rows and columns schemes are the first architectures to address all of the aforementioned desired features of defect-tolerance. Namely, both are capable of a) efficiently tolerating multiple random distributed defects in the interconnec- tion, b) preserving the signal timing characteristics of the original circuit, and c) computing defect corrections quickly. Yield models for F G R , multiple spare rows and columns C G R and traditional C G R are presented and compared. The comparison is based on 4 factors that influence yield: switch implementation, switch flexibility, array size and wire length. In summary, the major contributions of this paper include the following: 4 • Presentation of a new fine-grain defect-tolerant architecture and its y i e ld model , • Presentation of a new coarse-grain mult iple spare row and co lumn architecture and its y i e ld model , • A detailed study of the area and delay overhead required for hardware-based defect- tolerant interconnect scheme in F P G A s , and • A comparison between fine-grain and coarse-grain redundancy. This work has been published i n two F P G A conferences [36,37] and is currently be- ing evaluated by the University-Industry L i a i s o n Office o f the Univers i ty o f Br i t i sh C o l u m b i a for patent opportunities. 1.3 Outline This thesis is organized as fol lows. Chapter 2 presents an overview of modern F P G A ar- chitectures and describes previous approaches to F P G A defect redundancy. Chapter 3 de- scribes the new the fine-grain redundancy architecture ( F G R ) . The traditional coarse-grain redundancy architecture ( C G R ) and a new multiple-spare rows and columns architecture are presented i n Chapter 4. A comparison between the y i e ld o f the F G R and C G R is shown i n Chapter 5. F ina l ly , conclusions are given in Chapter 6. Chapter 2 Background This chapter presents the architectural detail and terminology needed in the discussion of F P G A defect tolerance. First, a brief overview of modern F P G A architecture is presented. This is followed with the defining of defect, fault, test and diagnosis. Finally, a summary of past F P G A defect-tolerance approaches are presented. 2.1 FPGA Architecture A n F P G A is an integrated circuit composed of programmable routing resources and pro- grammable logic resources. The programmable logic, called configurable logic blocks (CLBs), are composed of /c-input lookup tables and flip-flops. Each lookup table can im- plement any fc-input logic function, and connects with a designated flip-flop. Together, the lookup table and flip-flop pair form a basic logic element (BLE). Figure 2.1 shows a C L B with I inputs and N B L E s . Programmable routing can be further divided into three parts: the wires, the switch blocks (S blocks) and the connection blocks (C blocks). Together, the logic blocks, wires, S blocks and C blocks for modern commercial F P G A s are organized into an island-style architecture as shown in Figure 2.2. This architecture has proven to be very successful in modern F P G A architectures [10, 22]. As such, the island-style architecture was also used as the foundation for the new defect-tolerant architecture. 6 / shared inputs S R A M ^ ^ j V f eedback connec t ions bits BLE A? BLEs BLE A- inputs 2k S R A M bits S R A M bit * - i n p u t L U T b y p a s s a b l e register output a) A configurable logic block (CLB) b) A basic logic element (BLE) Figure 2.1: Overview of Fine-Grain Hardware Redundancy Wires within an F P G A reside in routing channels and are indexed by track numbers. A channel, as shown in Figure 2.2, spans the width or height of the F P G A and has its boundaries defined by the C L B s . A wire's track number is based on its position relative to the width of the channel. The convention of this paper is that wires at the bottom-most/left- most position in each channel is assigned a track number of 0. Similarly, wires at the top-most/right-most position are assigned a track number of channel-width — 1. Modern F P G A s utilize single-driver wiring [25], where each wire is driven by a single tapered buffer connected to an input multiplexer. Figure 2.3 presents a wire that spans 3 switch blocks and uses single-driver wiring. Note that this wire can only be driven at its start point, hence a "directional" attribute is instilled upon the wire. To minimize the effects of the single-driver wiring, wires are added in pairs, one for either direction. The adopted convention in this thesis and [25] is that even tracks contain signals that move left/down, and odd tracks contain signals that move right/up. S blocks are formed at the intersection of horizontal and vertical channels. A n S block allows nets to turn corners or extend further down the channel. They also allow for net fanouts. Figure 2.4 shows both a detailed and high-level representation of the directional 7 Wire Segments Horitzontal I Q Channel ' - Vertical Channel Figure 2.2: Island-style Architecture Input MUX Wire s t a r t point Tapered b u f f e r Wire endpo Wire midpoints Figure 2.3: Direct ional wire ( L = 3) 8 A) Detailed Representation B) High Level Representation Figure 2.4: Direct ional switch b lock switch for length 1 wires. In the high-level representation, a group of wires and buffers are replaced by single arrows. The C block provides the interface between the C L B and the wires. Since wires can only be driven from its start point, the outputs of the C L B must connect directly to the input multiplexers o f the wires wh ich start nearby. The inputs o f the C L B are also selected from wires in the routing channel at designated taps. These taps connect to input multiplexers o f the C L B . Figure 2.5 shows an example of a C b lock for length 1 wi r ing . For clarity, only a few S block connections are shown i n the figure. 2.2 Definitions This section introduces a few definitions of terms used throughout this thesis. 9 Figure 2.5: C L B input connection 2.2.1 Defects vs. Faults Processing and mask imperfections can create physical imperfections in silicon. These imperfections can lead to variations in the electrical behaviour of wires and transistors [29]. In an extreme case, the imperfections can result in functional failure and unexpected circuit behaviour. This thesis defines a defect as an imperfection that causes a functional failure. Faults are not physical imperfections, but rather models that encapsulate the be- haviour of defects. They model defects at the highest level of abstraction [16]. Several different types of defects can be modeled by a single type of fault, such as stuck-at-1 faults. Abstraction simplifies test design by reducing the number of individual defects that must be considered. Unlike defects, faults can also be transient in nature. That is, faults can be used to describe errors that occur during the operation of the F P G A . This includes errors caused by single-event upsets (SEUs) [17] and crosstalk [15]. This thesis is not concerned with transient faults, although F G R can be used to avoid some crosstalk problems. 10 2.2.2 Testing vs. Diagnosis Testing is the process of determining i f a circuit is operating as designed. This is typical ly accomplished by applying some st imuli to the circuit input and observing its output. A properly tested chip w i l l produces a clear "yes, the circui t is w o r k i n g " and "no, the circuit is not w o r k i n g " answer. Testing requires time and money. To facilitate testing and reduce costs, special ized test structures can be incorporated into the circuit . A n example of this is bui l t - in self test ( B I S T ) structures. Diagnosis involves the identification (i.e., determining the type) and location of defects [8]. W h e n compared to test, diagnosis is a more complex and t ime-consuming process. Th i s is because diagnosis demands the identification and location of defects i n addition to a pass/fail result. Diagnosis for F P G A s is typical ly performed by exhaustively exercising a l l resources, and observing its output [28, 30]. Th i s thesis w i l l assume that some type o f diagnosis can be done i n a cost-efficient manner to identify all defects in a device. Hence, each device is assumed to have a defect map. This map can be rather s imple. 2.3 Previous Redundancy Approaches Defect tolerant approaches for F P G A s can be loosely classified into 3 groups. These are software redundancy, hardware redundancy and run-time redundancy. E a c h approach has its advantages and typical ly trade off between time (critical path delay and process- ing/application time) and resources (s i l icon area, external storage, etc.). In the fo l lowing subsections, examples from each redundancy approach w i l l be presented, and its advantages and disadvantages outlined. 2.3.1 Software Redundancy In the software redundancy approach, C A D tools are used to map around defective re- sources. This method typical ly has no hardware overhead; however, the application of 11 defect correction may take a long time. Furthermore, the effectiveness and efficiency of defect correction is heavily dependent on the abilities of the C A D tools. In general, software redundancy is impractical in a production environment for two reasons. First, generating a unique placement and routing solution for each F P G A is a time-consuming process. Second, it is impractical to fully verify timing for each solution because the timing characteristics of the new placement and routing solutions will be differ- ent. For some high-performance circuits, it may be impossible to meet very stringent timing requirements with every new placement and routing solution. Despite these disadvantages, software redundancy does have some advantages, notably minimal hardware overhead and the ability to efficiently tolerate more defects than most other approaches. Swapping and Incremental Rerouting The approach proposed by [24] addresses defects in two ways. For defects in C L B s , defec- tive resources are swapped with unused resources. This approach is based on the premise that the resources within an a C L B are never fully utilized, and that the logic within the C L B can be permuted such that the defective resource is avoided. For defects in routing wires, an incremental congestion-aware router is used to reroute the signals affected by the defects. By limiting the number of nets that are rerouted, the impact on signal timing is minimized. For high-performance designs, it may not always be possible to find a new routing solution that meets the stringent timing requirements of the design. This is also the case for defects in logic blocks. For dense designs, it may not be possible to find an unused resource necessary for swapping. Despite these shortcomings, the approach is attractive because it requires no area overhead and can potentially tolerate a large number of defects. Design Shifting Another method for defect avoidance requires the reservation of spare resources. By care- fully avoiding the use of certain resources, it is possible to avoid defects by "shifting" the 12 Defect a) Fault Free b) Faulty Figure 2.6: Des ign shifting for defect correction entire design by one row or co lumn in the array [14]. Figure 2.6 shows an example of defect correction using this approach. Des ign shifting can be applied i n a short amount o f time since it only requires bitstream manipulation. However , without hardware support, the shifting results in a slight variance i n I /O t iming. It can also be complicated by heterogeneous (memory or D S P ) blocks in the array. Furthermore, to support mult iple defects, the defects must be perfectly aligned with the spare locations. Precomputing Designs To reduce the time needed to correct defects, a number of placement and routing solutions for the same design can be precomputed. Each solution differs in resource usage. W h e n programming a defective device, defect correction involves selecting the appropriate solu- tion - one that does not use the defective resources(s) [20, 23]. The clear advantage of this approach is that it does not have any on-chip hardware overhead and requires a relatively short defect correction time. The disadvantage is that 13 there are many possible design permutations. Th i s approach requires a lot o f computing time (even i f it is done beforehand) and a lot o f storage space for the different design per- mutations. EasyPath X i l i n x , a major vendor of F P G A integrated circuits, has a unique approach for dealing with defective parts. Customers are asked for their bitstream. In return, X i l i n x provides the customer wi th a set o f F P G A s that are guaranteed to work wi th the provided bitstream. The vendor can make this guarantee because they ensured that the customer design is not using any defective resources on the device sold. In essence, rather than forcing the configuration bitstream to avoid the defective resources, defects are forced to avoid the bitstream [35]. .This approach is advantageous i n a sense that customers can purchase discounted F P G A s that correctly implements their design. Fo r mature designs, this translates to a significant cost reduction. The same cannot be said about new designs. Since new designs are subject to design revisions, when the place and route solutions is changed, it is unl ike ly to work on the faulty chip. 2.3.2 Hardware Redundancy Hardware redundancy requires the addition of extra or spare resources. These spare re- sources facilitate defect correction by a l lowing the use of a defective resource to be shifted over to a spare one. This shifting reduces correction time since the t ime needed for shifting is typical ly less than the t ime needed to generate a new placement and routing solution. The disadvantage of this approach is the need to incorporate redundancy at an early design stage of the F P G A itself. Hardware redundancy also costs area overhead and typ- ica l ly tolerates fewer defects than the software counterpart. Despite this, the approach is effectively used i n industry [2, 3, 11]. Th i s suggests the advantages outweigh the cost in area overhead. It also suggests that tolerating a large number o f defects is not yet required i n today's technology! 14 Spare Rows and Columns The spare row and co lumn architecture is one of the first published F P G A defect-tolerant architectures [18]. One spare row and one spare co lumn are incorporated into the F P G A . I f a defect exists i n one row or co lumn, the defective row/co lumn is bypassed, and the fo l lowing row/co lumn are shifted unti l they uti l ize the spare. This architecture can naturally tolerate clusters of defects i n the same row/column, but not i f the cluster spans two rows/columns. It has also been successfully applied in industry [2, 3, 11]. The weakness o f this architecture lies in its coarse-grain nature. In the event o f any defect, an entire spare row/co lumn is ut i l ized for correction. Thus, this architecture cannot tolerate mult iple distributed defects. The advantage of this architecture is that it requires very little correction time. Industrial designs typical ly implement the row/co lumn decoder such that rows/columns can be permenantly enabled and disabled (e.g., fuses can be b lown) . This reconfiguration can be made transparent to the user so the original bitstream can st i l l be used on this defective F P G A . A more in-depth analysis o f this architecture w i l l be presented i n Chapter 4. In particular, an extension w i l l be shown to generalize this approach for mult iple spares. Spare Connections The architecture proposed in [13] incorporates spare wires and switches into the switch b lock design. A s shown in Figure 2.7, the spare resources a l low any defective transistor to be bypassed. This architecture can tolerate one defective pass-transistor per switch b lock. One problem wi th this architecture is its impact on signal t iming. Because of its length and heavy loading, the spare connections are slower than regular connections. A s a result, defect correction introduces a significant t iming variance i n the routing solution. Another problem is the inabil i ty to tolerate br idging type faults between wires. The advan- tage of this architecture is its fine-grain approach to defect correction. B y repairing defects at the transistor level , this architecture can tolerate mult iple defects i n the routing network. 15 Long wire that connects (via pass transistor) with all S block I/O a) S block spare connections (fault free) b) S block spare connections (faulty) Figure 2.7: Swi tch b lock wi th spare connections 2.3.3 Run-time Redundancy Run- t ime redundancy deals wi th errors that occur when the F P G A is i n operation. These errors include transient errors as w e l l as permanent errors such as the ones resulting f rom electromigration. Transient errors, also cal led S E U s , are not permanent defects. They result when a radiation particle strike a memory element wi th sufficient energy to cause a change i n the memory 's stored value [17]. S E U s can be corrected by reprogramrning the memory element. In contrast, the enors cause by electromigration cannot be corrected through reprogramrning alone. It was observed i n [19] that the stresses of carrying electrical signals can cause breakages in interconnect. Th i s phenomenon is cal led electromigration and results in a permanent open circuits i n the interconnect. Run- t ime redundancy has the disadvantage of needing both hardware and software overhead. Add i t i ona l diagnosis circuitry is incorporated into the design to detect inconsis- tencies arising f rom the errors. If an error is detected, C A D tools are used to correct the 16 X2 X I L o g i c 1 a) Normal c i r c u i t X2 X I Redundant L o g i c 3 Redundant L o g i c 2 Redundant L o g i c 1 Y3 Y2 YI M a j o r i t y V o t e r b) TMR p r o t e c t e d c i r c u i t Figure 2.8: Triple-modular Redundancy circui t (either by mapping around the defect or reprogramming the defective component). The advantage o f this approach is that defect diagnosis occurs during the operation of the design. Defect correction can also occur during run-time as long as the underlying F P G A architecture supports partial reconfiguration. This reduces the impact o f defect correction and eliminates "down t ime". /i-modular Redundancy In n-modular redundancy, the circuit that desires defect protection is replicated n t imes 1 and their outputs redirected to a voter circuit [6, 21]. The voter circuit performs a bit-wise majority vote on the outputs of the replicated circuits. In a defect-free environment, the outputs of the n circuits are identical, thus the output o f the voter is identical to its input. However , i f the voter circuit detects an inconsistency in the replicated output values, for example i f the circuit experienced an transient error, the output w i l l be that o f the majority input value. The use of odd values of n ensures than ties are not possible. The repl i - ca t ion^) that exhibited the inconsistency may be taken offline, reprogrammed, and then reenabled [7]. Figure 2.8 shows an example of triple-modular redundancy. The advantage of this approach is that defect detection and correction can occur whi le the circuit is st i l l i n operation. This approach can also be scaled. A higher level o f ln is usually odd, ie. n = 3 ,5 ,7 . . . 17 protection can be attained s imply by increasing the value on n. The clear disadvantage of this architecture is its h igh area overhead. Defect correction also requires that the F P G A be- ing used has the abil i ty to perform partial reconfiguration. I f not, defect correction requires the reprogramrning of the entire F P G A . Furthermore, it is unclear whether this approach can be used to tolerate mult iple permanent defects. Roving STAR R o v i n g S T A R addresses defects that occur during the l ifetime o f the F P G A [1]. Th i s ap- proach divides the F P G A into two areas, a spare area ( S T A R ) and the work ing area. The S T A R is used for tests and diagnostics, whi le the work ing area contains the design. Each area contains a number of spare unused resources. These spare resources are used for defect correction. D u r i n g the operation of the F P G A , the S T A R shifts across the F P G A by copy ing the configuration and state o f the perspective S T A R location ( p S T A R ) into the current S T A R location ( c S T A R ) , and enabling and disabling c S T A R and p S T A R respectively. The new S T A R location is then reprogrammed to run tests and diagnosis on itself. Defects are cata- logued and saved in memory. W h e n the next shift is about to takes place, the configuration of the next perspective S T A R location is manipulated so that spare unused resources are used i n place of the defective ones. The shifting process continues unti l the entire F P G A is tested. L i k e n-modular redundancy, R o v i n g S T A R requires that the underlying F P G A sup- ports partial reconfiguration. It also requires the reservation o f spare resources, and the incorporation of a reconfiguration manager. The shifting of design blocks can also affect signal t iming for inter-block communicat ion. Nevertheless, this approach is advantageous because testing and diagnosis is performed during run-time, and that the reconfiguration manager can dynamica l ly change the configuration to avoid defects. The latter al lows the design to continue operating i n the presence of unanticipated permanent defects. 18 2.4 Summary F P G A s are composed of programmable logic and programmable routing. These structures are arranged in a regular pattern and are susceptible to defects. F P G A manufacturers must perform tests on F P G A s to ensure that the chips they sell are "defect-free." Addi t iona l ly , diagnosis is sometimes required to apply defect correction (i.e., spare row and column) . Defect tolerant techniques have been developed to increase the number of usable F P G A s . E a c h technique varies i n both its abil i ty to tolerate defects and the amount o f over- head needed. In one extreme, the software redundancy approach can tolerate a large number of defects wi th little or no area overhead, but costs a significant amount o f comput ing time or storage space. In the other extreme, the hardware redundancy approach can perform de- fect correction quickly, but costs area overhead. Further, hardware redundancy approaches typical ly tolerate fewer defects than software redundancy approaches. The existence of the two extremes suggest the presence of an architecture that can both tolerate a number of de- fects in addition to the abil i ty to preform defect correction quickly. The subsequent chapter presents one such example. 19 Chapter 3 Fine-grain Redundancy (FGR) Rather than consolidating the spare resources into rows and columns, the proposed fine- grain redundant architecture ( F G R ) contains spare resources that are distributed evenly across the F P G A . Th i s approach to defect tolerance al lows the architecture to tolerate m u l - tiple randomly distributed defects. 3.1 Architectural and Implementation Details F G R builds upon the island-style directional w i r ing architecture described in [26]. The original architecture is not defect-tolerant. This section w i l l present the changes needed to make it defect-tolerant. 3.1.1 Switch Block Changes To add defect-tolerance to the original directional switch b lock, two layers o f multiplexers are wrapped around the switch block. This is shown as the two outer layers i n Figure 3.1. The outer-most layer represents the shift-avoid layer of multiplexers (omux), and the middle layer represents the shift-restore layer of multiplexers (imux). The omux a l lows signals to "steer" away from a downstream defect. B y means of these multiplexers, signals routed on track t can be shifted up to tracks t+1 or t+2. W h e n there is a defect on track t, the defect is avoided by shifting up a l l signals routed on tracks 20 t+/-2 + 2 +1 0 Bypj -2 -1 0 £± BI B y p a s s Bypass -2 -1 0 • 2 + 1 0 Bypa IB t, t * / - l , t+/-a F i g u r e 3.1: H i g h - l e v e l d e f e c t - t o l e r a n t s w i t c h b l o c k 21 > t. Signals on tracks < t remain i n place. Clear ly , the shifting requires that there be spare routing tracks. A s w i l l be shown later, these spares incur about 10% area overhead for each spare set. Because this, is a fixed overhead amount, the percentage tends to decrease as F P G A dimensions (i.e., channel widths) are scaled up. The imux is used to reverse or restore the shift-avoid action taken by an upstream switch block. These multiplexers a l low a signal on track t+1 or t+2 to shift down to track t, thereby nul l i fy ing any upstream shifting action. To keep the effects o f track shifting loca l - ized, the switch b lock was designed such that any signal leaving a perturbed neighbourhood can be restored to the original track number. Th i s local izat ion al lows our architecture to tol- erate mult iple distributed defects. To reduce the delay of long nets, a bypass path s imi lar to [27] is introduced into the switch b lock. This bypass path connects a straight-through wire endpoint directly wi th the corresponding omux on the opposite side of the switch b lock. Th i s reduces the multiplexer depth per wire from 3 down to 1. Note that the bypass path is optional: it creates an alternate path for signals travell ing across the channel by skipping the imux and normal directional switch. In an attempt to reduce area and delay overhead, a reduction in switch b lock f lexi- bi l i ty, Fs, was considered. Fs is the number o f other wires connected to a wire at a given switch b lock [31]. B y decreasing Fs, the number of potential connections available to a signal is reduced. This i n turn reduces the number of inputs on the input mult iplexer and thus improves both area and delay. W i t h long wires, the f lexibi l i ty at the end switch blocks or endpoints can be different than at the middle swi tch blocks or midpoints. W e considered the fo l lowing switch flexibilities: 1. The E3M2 switch is the directional switch described i n [26]. It has Fs — 3 for endpoints and Fs = 2 for midpoints. Th i s al lows endpoints to form connections wi th interconnects on the left (left turn), right (right turn) and opposite side (straight- through) o f the switch b lock, and midpoints to form connections wi th interconnects on the left and right side of the switch b lock respectively. 22 (Unoptimized) (Optimized) a) Orginal C Block b) Fault Redundant C Block c) C Block Optimization Figure 3.2: Connect ion b lock design 2. The E3M1 switch also uses Fs — 3 at endpoints. However , midpoints are reduced to Fs = 1, meaning they can only turn either left or right (not both). The turn direction alternates along the length of a wire. 3. The E2M1 switch has Fs = 2 for endpoints and Fs = 1 for midpoints. Endpoints include the straight-through connections and a left or right turn, whi le midpoints can only turn left or right. Turns are handled i n the same manner as E3M1. 3.1.2 Connection Block Changes A s a consequence of track shifting, signals that were once routed on track t can now reside on tracks t+1 or t+2. To accommodate for this variability, the connection block must also be modified. In F G R , the C L B outputs do not need to be modif ied because they are already ful ly connected to a l l of the tracks. However, the C L B input connectivity must be increased by adding the connections required to the shifted tracks. This modificat ion is shown in Figure 3.2ab. Initially, the C L B inputs are connected to half o f the routing tracks. This amount of connectivity was shown to provide a suitable trade-off between routability, area and delay [4, 26]. To adjust for the track shifting, for every track t that is or iginal ly connected to a C L B input, tracks t+1 and t+2 must also be connected to the input (assuming these connections don' t exist already). Thus, i f a signal gets shifted up by 1 or 2, the C L B can still extract the correct signal. Clear ly , area overhead can be reduced by m a x i m i z i n g the 23 number o f consecutive tracks that are connected to a particular C L B input, as shown i n Figure 3.2c. However , this optimizat ion is left for future work. Ul t imately , in F G R , the C L B inputs are connected to slightly more that half o f the routing tracks. 3.1.3 Supported Defects The proposed scheme categorizes non-bridging interconnect defects into three disjoint classes: single-length, double-length and intolerable defects. Depending on the underlying F G R ar- chitecture, the number o f defect classes for br idging defects can potentially increase to five: single-length, double-length, triple-length, quadruple-length and intolerable defects. Non-bridging Defects If an open or stuck-at fault occurs on the wire , or there is a stuck-at fault i n the wire driver or the output o f the omux, the defect is a single-length defect. In this case, one switch b lock avoids the defect and a l l adjacent "downstream" switch blocks do the restore. This k i n d of defect is isolated to one wire length. Figure 3.3a illustrates how a single-length defect is corrected. W i t h single-length defects, the change is purely local ized i n the channel to a group of wires wi th c o m m o n start and ending points i n the array. Such a group of wires is cal led a trackgroup. To accommodate shift, each trackgroup has one spare wire for each direction. If a defect is found i n any of the multiplexers (aside f rom the output o f the omux), the defect is categorized as a double-length defect. D u e to their location, these defects actually impair the defect-correcting abil i ty o f the current switch block. To fix this, the switch b lock o f the adjacent "upstream" trackgroup is used to avoid the defect, and the downstream switch blocks to the restore. Hence, this k ind o f defect requires two wire lengths to correct. F igure 3.3b indicates how a double-length defect spans two adjacent trackgroups: the upstream trackgroup on the left, and the defective one on the right. In fact, for this example there are additional upstream switch blocks (above and below) that reside i n the 24 25 vertical channel. A s shown in Figure 3.4a, this introduces contention when a straight- through signal is shifted up onto a track that is expected to be available for turning signals. To avoid contention, signals on tracks >t in the vertical channel must be shifted before arriving. The upstream pre-shifting just described is on ly one way to solve the conflict prob- l e m wi th double-length defects. A more robust solution is shown in Figure 3.4b. Here , the imux is embedded wi th in the switch b lock and the internal swi tch b lock multiplexers are duplicated. This shrinks the requisite defect-free area to just the two adjacent trackgroups. Figure 3.5 highlights the neighbourhood that must be defect-free to correct for a double-length fault in an architecture wi th length 2 wires. The defect in question spans two trackgroups and affects the same track number. The defective track is avoided by up-shifting all the signals in the faulty trackgroup (highlighted i n ye l low) . To local ize the defect, a l l the signals leaving the faulty trackgroups are restored to the original track number immediately. The trackgroups containing these "fanout" signals include endpoint and midpoint connec- tions, and are highlighted i n blue. Highl ighted i n green are a l l the trackgroups wi th signals that "fanin" to the defective trackgroups. A s mentioned before, signals i n these trackgroups are pre-shifted by the same amount as the defective trackgroup to eliminate resource con- tention. Pre-shifting guarantees that the arriving signals are on the correct and unoccupied track. The m a x i m u m number of trackgroups that can be affected by defect correction is cal led the minimum fault-free radius or MFFR. The M F F R o f a defect encompasses a l l the trackgroups that needs (or potentially needs) to be up-shifted and down-shifted. In general, trackgroups that contain shifted signals, or trackgroups that have had its signals recently restored are included in a defect's M F F R . The M F F R of the previous example encompasses al l the yel low, green and blue trackgroups. 26 a) Initial imux Design with Contention b) Embedded imux Design Avoids Contention Figure 3.4: Embedding imux to avoid contention 27 28 Bridging Defects W i r e bridges and certain source-drain shorts have the potential to render two adjacent tracks as unusable. To avoid such a defect, the upstream switch block(s) must shift tracks up by 2 and the downstream switch blocks must shift down by 2. If this is implemented using +1-2 shifts, the defect is classified as a double-length defect. If this is implemented with a combinat ion o f +1-2 and +/-1 shifts (i.e., one +2 shift fo l lowed by two -1 shifts), the surrounding neighbourhood that must be defect-free can potentially be larger than that o f a double-length defect . -An architecture that only supports shifts by +/-1 require two +1 shifts fo l lowed by two -1 shifts to avoid br idging defects or source-drain shorts. In this worst case scenario, the defect is considered a quadruple-length defect. Intolerable Defects There exists a class o f intolerable defects that has not been considered wh ich includes power/ground shorts and clusters o f defects. The first type o f defect cannot be tolerated. However , it may be possible to tolerate the latter by complement ing our architecture wi th a spare row/co lumn technique, i.e., [18]. 3.1.4 Modes of Operation F G R allows defect redundancy to be an option to the user. This means an F P G A can operate i n two modes: normal defect-tolerant mode and recovery mode. The normal mode assumes the customer w i l l buy imperfect, low-cost devices, and uti l ize the underlying defect-tolerant architecture to avoid defects. In this mode, the routing software reserves a spare routing track in each trackgroup. 1 This reduces the number o f routing tracks available to the application, but the spares are needed for defect correction. Fo r many applications wh ich do not stress the routing network but require inexpensive devices, this is an easy way to lower device cost. •Two spares are needed for devices with bridging defects, which may be sold at even lower cost. 29 The recovery mode assumes the customer w i l l buy perfect devices at a price pre- m i u m . In this mode, the routing software uses the additional imux and omux routing mul t i - plexers to increase the flexibility o f the interconnect. In essence, the router is using the re- dundant resources to recover some area/delay efficiency that was sacrificed when they were added. Th i s mode is used for those few applications that have high interconnect demands where the resulting increase i n interconnect flexibility is even more helpful. However , i n this mode, there is no natural abil i ty to tolerate defects. Recovery mode is the true overhead cost o f F G R . It is the additional cost imposed on aggressive applications wi th high interconnect demands. W h e n normal mode is used, the overhead appears to be higher because spare tracks are also counted as overhead. Th i s is misleading! Appl ica t ions wi th l ow interconnect demands already have an abundance of unused routing tracks, so this extra capacity is already bui l t - in . The customer has already paid for these unused routing tracks, so there is no real end-user cost to supplying them. The proposed redundancy scheme merely finds a use for these free tracks by ca l l ing them spare tracks. 3.1.5 D e t a i l e d T r a n s i s t o r - l e v e l D e s i g n The transistor circuit schematic used for H S P I C E simulations is shown in Figure 3.6. The components i n the circuit (from left-to-right) are: input buffer, directional multiplexer, strengthening buffer, shift-restore mult iplexer (imux), shift-avoid mult iplexer (omux), ta- pered driver and the wire model wi th loads. Fo r area considerations, the directional mult iplexer is implemented using a tree of min imum-s ized transistors. This al lows the use of encoded control lines and reduced S R A M usage. W e also assumed both true and complemented outputs are available f rom a 6-transistor S R A M ce l l . The omux is implemented using a decoded mult iplexer wi th a single level o f m i n i m u m - width pass transistors. E a c h pass transistor is controlled by an independent S R A M ce l l . The motivation for this is to reduce delay. 30 r& r t f rtf rf T T T T Directional ' Bypass path multiplexer Figure 3.6: H S P I C E schematic for delay characterization. Three different implementations o f the imux were considered: decoded, encoded and embedded. The decoded mult iplexer is identical i n implementation as the omux. Th i s k i n d o f mult iplexer trades area for delay. The encoded imux is buil t l ike the directional multiplexer. It trades delay for area. A s mentioned earlier in Section 3.1.3, it is also possible to embed the imux into the directional multiplexer. This enhanced mult iplexer is built b y duplicat ing the inputs o f the directional mult iplexer for track t+1 and t+2, and connecting them to the directional mult iplexer for track t. A n embedded imux a l lows signals to turn and shift at the same time. A s w i l l be shown later, this improves y i e ld wi th double-length defects at the expense of some area. Seven different defect-tolerant implementations were considered i n this study. These implementations vary in the implementation o f the imux and shifting abil i ty o f the mul t i - plexers. The +1-2 shifts use additional area to improve y ie ld o f br idging defects. The attributes and differences between the switch implementations are summarized i n Table 3.1. 31 Arch. imux impl. imux "-2" shift omux "+2" shift imux/omux ± 1 shift E M 2 2 embedded Y Y Y E M 1 2 embedded N Y Y E M 1 1 embedded N • N Y FL22 decoded Y , Y Y EN22 encoded Y Y Y EN12 encoded N Y Y EN11 encoded N N Y Table 3.1: Defect-tolerant switch implementations. The area and delay performance of the implementations are also sensitive to the precise transistor-level circuit design of the multiplexers and buffers. The procedure used for transistor sizing is the same as the one described in [25, 26], and is as follows: 1. Select a parameter to be optimized 2. Sweep the selected parameter across a range of values while holding all other circuit parameters fixed 3. Determine the best parameter value based of the minimization of area-delay product for the entire circuit 4. Reiterate steps 1 to 3 for remaining circuit parameters 5. Repeat steps 1 to 4 until the area-delay product stabilizes The stabilization of circuit parameters and delay profiles was observed to take approxi- mately 3 complete iterations. Delay results are computed from HSPICE simulations of T S M C ' s 180nm technology. 3.1.6 Software Aspect of FGR Defect Avoidance To successfully correct a defect, the location of the defect must be known. One such way to provide this information is through the use of a relatively unique list of defective resources or defect map. These can be stored on-chip in non-volatile storage, or in an off-chip database indexed using a unique on-chip serial number. 32 The aforementioned hardware changes provide an infrastructure for defect correc- tion, but not the means to apply it. Defect correction for F G R is applied through bitstream manipulation. W h e n an F P G A is being programmed, the defect map specific to that F P G A is cal led up and the appropriate modifications are made. The correction can be applied dur- ing programming or bitstream generation. It can even be applied by means o f an embedded configuration processor wi th in the F P G A or configuration memory or subsystem. W i t h this latter method, defect correction can be completely hidden from the user. 3.2 Limitations The proposed architecture has a number of l imitations. First , F P G A and V L S I testing strategies capable of locating defects were assumed to be available. The defect information, i n a form of a defect map, may be provided by the vendor or generated by the user. The defect map does not need to be overly detailed. Fo r each defect, it must identify the wire segment location in the array (x, y and track numbers) and type (single- or double-length). Bridges are identified as adjacent defect pairs. The method o f dealing wi th br idging defects assumes that routing tracks wi th in the same channel are l a id out beside one another. This may not be a realistic assumption since there are many factors that influence the layout o f an F P G A . Th i s solution should only be viewed as a general approach. To ful ly protect an F P G A from br idging defects, the final F P G A layout must be considered. A s described earlier, defects must be surrounded by some defect-free resources for successful repair. A s a result, this approach cannot tolerate clusters or closely-spaced defects. To reconcile this shortcoming, it is possible to complement F G R with a spare row/column technique [18]. F ina l ly , defects i n the logic b lock have been ignored i n this architecture. Th i s issue has been addressed i n the past [20, 23, 24]. In these techniques, logic b locks are inten- t ionally left under-util ized (thereby creating the impression of having spare resources). In the event o f a defect, resource assignment wi th in the logic b lock is manipulated so that the 33 logic on a defect resource is shifted to an unused resource. To achieve logic block defect tolerance, these techniques, or a spare row/column technique, can be used to complement F G R . 3.3 Area and Delay Results The new architectural features were incorporated into an enhanced version of the V P R place and route tool, VPRx [26], which now supports directional wires [25]. VPRx was then used to map the 20 largest M C N C benchmark circuits [9] into an island-style F P G A consisting of directional length 4 wires and C L B s containing eight 4-input L U T s . The area and critical path delay results reported by VPRx for each circuit is normalized; then a geometric average across the circuits is computed. 3.3.1 Area Routing experiments with non-defect-tolerant switch blocks indicated that the directional switch E3M1 used 2% less area than E3M2 and E3M1. The average critical path delay for E3M1 was also 3% lower than the other two architectures. In comparison, the average channel width increased by 2% and 8% for E3M1 and E2M1, respectively. These results are shown in Figure 3.7. Hence, the non-defect-tolerant E3M1 was selected to be the basis for all area and delay normalization (=1.0). Figure 3.8 presents the average area overhead for the defect-tolerant switch blocks. The results have been normalized to the non-defect-tolerant E3M1, the best alternative with- out defect tolerance. When routing the design in normal mode, two spare sets of wires were added in the channel for the 7 architectures that tolerate bridging defects. Only one spare set of wires is inserted for the architectures that do not tolerate bridging defects (the 2 ar- chitectures with -NB). These spare wires were not used during routing. The EN11-E3M1 architecture was the most area-efficient, having an area overhead of 24% for non-bridging defects and 34% for bridging defects. The difference in area cost (10%) is one set of spare 34 btsssAM E3M2 Chan W Area Delay Figure 3.7: F l ex ib i l i t y exploration for non-fault tolerant architectures wires. Not ice that the second-best area architecture, E M 1 1 - E 3 M 1 , needs +4% to embed the imux, but it w i l l be shown later that it tolerates more defects. 3.3.2 Delay The average cri t ical path delay for each architecture is shown i n Figure 3.9. These numbers were obtained by rerouting the 20 benchmark circuits using a channel width equal to the m i n i m u m channel width obtained from the defect-tolerant area investigation plus one addi- tional set o f wires. U n l i k e the spare wires that are held in reserve, the router was a l lowed to use this new set o f wires to relieve delay increases caused by congestion. The experi- ment indicated that the E M I 1 - E 3 M 2 architecture gave the lowest average cr i t ical path delay overhead o f 15%. Overhead for the non-embedded version, E N 1 1 - E 3 M 1 was 24%. Figure 3.9 also shows the importance of the bypass path for delay reduction. The 35 Implementation + Architecture =• E3M2-NRST L l l l l l l i E3M1-NRST ::::::::: E2M1-NRST •S3 E3M2-Normal E £ 2 S 3 E3M1-Normal ii&zsx E2M1-Normal sm E3M2-Recovery mmum E3M1-Recovery MMM E2M1-Recovery Figure 3.8: Area of defect-tolerant implementations. -NRST results (no route on straight through) show significantly higher delay when the router is forced to avoid using the bypass path. 3.3.3 Area and Delay Recovery To explore the true area and delay overhead, the routing tool was put into recovery mode. In general, it was observed that the router needs lower channel widths for defect-tolerant architectures in recovery mode than non-defect-tolerant architectures. Hence, the additional multiplexers do help improve interconnect flexibility. Figure 3.8 shows the true area overhead for the defect-tolerant implementations in recovery mode. The EN11-E3M1 architecture demonstrated the lowest area overhead of 11%. Overall, recovery model saves a significant amount of area. The critical path delay overhead in recovery mode is shown in Figure 3.9. Here, the 36 Implementation + Architecture i i E3M2-NRST c : r _ ; i E3M1-NRST E2M1-NRST s s s s s s s E3M2-Normal s s s s s s s E3M1-Normal K v a x : E2M1-Normal • • • • E3M2-Recovery H M M I E3M1-Recovery • • • • • E2M1-Recovery Figure 3.9: Delay of defect-tolerant implementations E M 1 1 - E 3 M 2 architecture demonstrated the lowest delay overhead of 5%. 3.3.4 Area and Delay Product U s i n g the delay and area results obtained from the previous two experiments, the area-delay product for each architecture in recovery mode was computed. Figure 3.10 shows that the E M 1 1 - E 3 M 1 architecture produced the lowest area-delay product. W h e n comparing y ie ld later in this thesis, E M 2 2 - E 3 M 1 w i l l be selected as the best F G R variation. A l though it has the highest area overhead, it tolerates the largest number o f defects and has among the lowest delay. 37 Implementation + Architecture i 1 E3M2-NRST ""."i E3M1-NRST E2M1-NRST E g g g E3M2-Normal 5i<sissa E3M1-Normal vt^zzm E2M1 -Normal • • • a E3M2-Recovery H M N H E3M1-Recovery mmmmm E2M1-Recovery Figure 3.10: Area-delay product comparison. 38 Chapter 4 Coarse-grain Redundancy (CGR) In a traditional coarse-grain redundancy ( C G R ) scheme, one spare row and one spare co l - umn is used to correct defects. This approach can tolerate clusters o f defects wi th in the same channel. However , the consolidation of spare resources into a single spare row/co lumn severely restricts this architecture's abil i ty to tolerate randomly distributed defects. In this chapter, two schemes for adding mult iple spare rows and columns to tolerate these types of defects are considered. 4.1 Architectural and Implementation Details Tradit ional C G R adds one spare row and one spare co lumn to the existing F P G A layout. This architecture is l imi ted to defect correction for one row and one co lumn, but it can naturally tolerate clusters o f defects wi th in the same channel. In the event o f such defect clusters, the row or co lumn containing the defects is bypassed and the spare row or co lumn is used. This architecture can in fact tolerate mult iple defects wi th in the same channel. However , as array size grows, it becomes increasingly unl ikely that mult iple defects w i l l l ie i n the same row/co lumn. To increase y ie ld , a scheme is needed to add additional spare rows and columns to the architecture. However , traditional C G R does not clearly indicate how mult iple spare rows/columns can be added. H o w to support mult iple spare rows/columns i n an island-style F P G A architecture is considered below. 39 4.1.1 Switch Block Changes C G R requires modificat ion to the existing detailed switch b lock design. To a l low one row (and column) to be bypassed, a l l interconnect wires are extended i n length by one row/column. Figure 1.1a highlights these track extensions i n green. These track extensions are not used or needed in a defect-free F P G A . In the presence of a defect(s), the exten- sions a l low signals that wou ld have terminated at the defective row/co lumn to reach the subsequent defect-free row/column. A n example of row bypassing is shown in Figure 1.1b. Not ice that the track extensions are enabled only for the wires that traverse the defective row/co lumn or for wires that start at a shifted row/column. The impact o f these extensions on the S b lock is shown i n Figure 4.1. These detailed changes have been documented by previous papers i n the area. 4.1.2 Connection Block Changes The bypassing of a row/column also necessitates changes in the connection b lock design. These changes are highlighted i n Figure 4 .1 . In addition to the original connections, notice that the C L B inputs now accept connections from the wire extensions, and that the C L B outputs connect to wire starts in the original switch b lock and the switch blocks one channel over. The duplication of input and output connections are required to make the row/column bypassing transparent to the C A D tools. 4.1.3 Multiple Spare Rows and Columns There are two ways to construct a mult iple spare rows and columns architecture. The first method adds global spare rows and columns to the array. These global spares can be used to correct defects anywhere i n the F P G A . Th i s architecture can tolerate as many defective rows/columns as there are spares. The added cost o f this approach is the increased length of the routing wires, the spare rows/columns, plus the extra mul t ip lexing needed wi th in the S and C b lock to accomplish the bypass. The additional switching adds significant area overhead, and the wire extensions add significant capacitance wh ich increases both delay 40 track.extensions • i r 1 i i - - J > S block a) S and C b l o c k input m o d i f i c a t i o n s CB output extensions | C block b) S b l o c k i n p u t and C bl o c k output m o d i f i c a t i o n s Figure 4.1: Connection block changes for C G R 41 S u b d i v i s i o n a) CGR-G2 b) CGR-L1-S2 Figure 4.2: M u l t i p l e spare row and co lumn architectures and power. The second method of implementing a mult iple spare rows and columns architecture is by distributing spare rows/columns evenly among subdivisions of the chip. The F P G A is div ided into smaller subdivisions. E a c h subdivisions has dedicated local spare resources. Defect correction is handled local ly wi th in each subdivision. This approach has a smaller area overhead for a given total number o f spares because the spare wire extensions are shorter and switch overhead is reduced. Fo r conciseness, the mult iple global spare rows and columns architecture and the multiple local spare rows and columns architecture w i l l be denoted as C G R - G n and C G R - L n - S p , respectively. C G R - G n has exactly n g lobal spare rows/columns that can be used to repair any defective row/column. C G R - L n - S p is d iv ided into 2p subdivisions, p subdivi- sions for rows and p subdivisions for columns. E a c h subdivision has n spares. Overa l l , this architecture has 2pn spare rows/columns. Figure 4.2 presents C G R - G 2 and C G R - L 1 - S 2 . 42 4.1.4 Supported Defects In C G R , a l l defects are essentially treated equal. A channel containing defects is always replaced wi th an entire spare row or co lumn. This simplifies the correction process, but has the drawback of being inefficient wi th resource usage. Since there is only one spare row and one spare co lumn, traditional C G R can only tolerate defects in one row and one co lumn. W i t h i n that row/co lumn, mult iple defects can be tolerated. Th i s gives the architecture the natural abil i ty to tolerate clusters o f defects wi th in the same channel. C G R can also tolerate defects i n both routing resources and logic blocks; but l ike F G R , it cannot tolerate power/ground shorts. A l s o , some types of defects may render both row and co lumn unusable because it happens at a turning switch (i.e., a horizontal wire is shorted wi th a vertical wire). 4.1.5 Detailed Transistor-level Design Unfortunately, published research does not present the delicate circuit details needed to perform the bypass. A l t e r a patents provide some insight [3] and indicate that additional circuitry is required for bypassing. Without a detailed design, it w i l l be demonstrated in Section 4.3 that the swi tch overhead alone for C G R - G 1 and C G R - L l - S p is s imi lar to F G R . 4.2 Limitations Al though the area overhead of spare rows/columns appears to be very clear, the additional circuitry required wi th in each S b lock and C b lock to bypass a faulty row/co lumn is non- trivial and is not reported in previous work. To account for this additional overhead, esti- mations are use i n place o f detailed transistor models. A n underlying assumption of traditional C G R is that defects are isolated to one row/column. This assumption al lows defect-avoidance through row/co lumn bypasses. H o w - ever, some types o f defects (i.e., shorts between wires in two different rows/columns) may render two rows or two columns unusable. Such defects cannot be tolerated i n traditional 43 C G R since there is only one spare row/column. 4.3 Estimated Results The fo l lowing area and delay estimates are based upon the C G R model shown i n Figure 4.3 i n comparison wi th the F G R model in Figure 4.4. This figure highlights the architectural changes needed to convert a non-defect tolerant switch b lock into a C G R - G 1 and C G R - G 2 switch block. The architectural changes needed for a C G R - L n - S p architecture is compari - ble to that o f a C G R - G n architecture for the same n , but more spares are needed. 4.3.1 Area Figure 4.5 shows the spare row/co lumn area increases needed for C G R - G 1 . These results do not include the additional switch area, only the overhead resulting from the addition of one spare row and co lumn. Not ice that the area overhead decreases as the F P G A area size grows. In a 32x32 F P G A , the area overhead for one spare row and co lumn is approximately 6% (2 x 1/32). The area overhead reduces to approximately 1% for a 256x256 F P G A . Figure 4.3a presents a non-defect tolerant switch b lock. To make this a C G R - G 1 switch b lock, wire extensions are needed for C L B inputs and outputs and routing tracks. F igure 4.3b highlights the necessary extensions i n green. No t i ce that the number of inputs for the C G R - G 1 directional multiplexer roughly doubles. This actually makes it s imi lar i n size to the E M U embedded imux shown in Figure 4.4a. In addition, E M U requires an additional 2:1 mult iplexer (omux). The necessary changes for C G R - G 2 are shown i n Figure 4.3c. Th i s architecture has two spare rows/columns, thus wire extensions span two switch blocks . The first extension is highlighted in green, whi le the second i n red. Th i s swi tch b lock is comparable to the E M 2 2 switch b lock shown i n Figure 4.4b. The s imilar i ty between the switch blocks a l lows for a rough approximation o f area overhead. Fo r C G R - G 1 , the area overhead is approximately that o f E M U wh ich is about 44 Figure 4.3: Compar i son between C G R implementations 45 •) FGR - EM11 b) FGR - EM22 Figure 4.4: Comparison between FGR implementations 40%. CGR-G2, which is similar to EM22, has an area overhead of approximately 50%. The switch block area overhead for CGR-G1 and CGR-G2 will be slightly less than that of E M U and EM22 respectively since the FGR approach requires the addition of omux multiplexers. However, for architectures where there are more than 2 global spares, the CGR area overhead will be significantly greater than that of FGR. 4.3.2 Delay Again, the similarity between switch block implementations allows for a gross approxima- tion of delay overhead. EMI 1 and EM22 have delay overheads of approximately 20%. The delay overhead for CGR-G1 and CGR-G2 architectures are likely to be slightly less than the FGR architectures since CGR has one fewer multiplexer level. For architectures with more than 3 spare rows/columns, the delay overhead will increase beyond FGR as there are significantly more inputs (and thus levels) to the directional multiplexer itself. 46 bftaftoftod E3M1 32 64 128 256 Array size (M) Figure 4 . 5 : C G R - G 1 for increasing array sizes (spare row/co lumn overhead only) 4 7 4.3.3 Scaling Factors To approximate area for C G R , the area of an N x N 1 F P G A is mul t ip l ied by a constant scal ing factor o f 1.5. The value of 1.5 is used because the S b lock for C G R - G 2 is s imi lar to E M 2 2 , w h i c h has an area overhead of 50%. S imi la r ly , switch area overhead for C G R - G 1 and C G R - L l - S p is less than this, being s imi lar to E M U . In this case, a scaling factor o f 1.3 is used. Since architectures wi th more than 2 spare rows/columns w i l l l ike ly be significantly larger than E M 2 2 , this approximated area represents the lower bound area overhead for C G R wi th more than 2 spare rows/columns. S imi la r ly , a scaling factor o f 1.2 can be used for the approximation of delay over- head. However , delay results for C G R is not presented in this thesis. N = M + n, where M is the base array size and n is the number of required spare rows/columns. 4 8 Chapter 5 Yield Comparison This chapter presents a yield comparison between F G R and C G R . The comparison is based on four factors that influence yield: switch implementation, switch flexibility, array size and wire length. First, however, a yield model is presented for both C G R and F G R . 5.1 Yield Model For the subsequent yield analysis, all faults are assumed to be bridging defects (worst case). Logic faults, which are intolerable in the present F G R architecture, are not considered. 5.1.1 Coarse-grain Model A number of simplifications and assumptions are made for the C G R yield model. The model assumes the following: • A l l channels have identical routing resources and thus have an equal probability of being defective; • The vertical and horizontal channels are disjoint routing networks (defects are as- sumed to be isolated to just a row or just a column); • M x M FPGAs are perfectly symmetrical; and 49 • A spare row and co lumn are added in tandem to retain the square shape. To inject a random defect, a random row or co lumn is selected, and the defect count for that row/co lumn is incremented. Fo r C G R - G 1 , a failure occurs when there are defects in two different rows or two different columns. A failure is represented as a non-zero defect count for two different rows or two different columns. Architectures wi th mult iple global spare rows and columns are evaluated in a s imilar manner. A failure occurs when there are more defective rows of columns than there are spare rows/columns. In C G R - L n - S p , each subdivision has exactly n designated spare rows/columns. Th i s architecture can tolerate at most n defective rows/columns per subdivision. F igure 5.1 highlights the differences i n terms of defect correction between C G R - G 2 and C G R - L 1 - S 2 . 5.1.2 Fine-grain Model To model the behaviour o f defect correction for F G R , state variables are assigned to every trackgroup wi th in the F P G A . A trackgroup can have one o f three states: perfect, faulty, or must be perfect. The faulty state indicates the presence o f a defect in that particular trackgroup. The must be perfect state is used to mark the M F F R of a defect. A s mentioned before, the M F F R of a defect defines the region needed for shifting to avoid and restore around the defect. To guarantee that a defect can be correctable, the M F F R o f a defect must be defect-free. The must be perfect state facilitates the enforcement o f this requirement. Defects are injected into the model by randomly selecting a trackgroup and setting its state to faulty. The neighbouring trackgroups as defined by the M F F R are marked as must be perfect. The M F F R w i l l vary depending on the defect type and the underlying routing architecture. C h i p failure occurs when any of the fo l lowing conditions are violated: • A faulty trackgroup overlaps wi th another faulty trackgroup - only one defect per trackgroup can be tolerated; or • A faulty trackgroup overlaps wi th a must be perfect trackgroup - the defect may inhibit the abil i ty to correct the other defect; or 50 ) CGR-G2 - Correctable b) CGR-L1-S2 - Correctable U n c o r r e c t a b l e CGR-G2 - Correctable d) CGR-L1-S2 - Uncorrectable Figure 5.1: Swi t ch b lock wi th spare connections 51 • A must be perfect trackgroup overlaps wi th another must be perfect trackgroup - the defect corrections may interfere wi th one another. The y ie ld approximation for F G R is pessimistic in two ways. First , the approx- imation only considers the injection of br idging defects. Second, M F F R overlap is not a l lowed. In reality, not a l l faults are of the br idging category, and thus have a significantly smaller M F F R . Next , the avoidance and correction o f certain defects can i n fact be over- lapped. However , without real manufacturing defect information, the worst case posit ion is assumed. It is suspected that the accounting of these two factors w o u l d appreciably improve the y ie ld for F G R . 5.2 Architectural Considerations Several factors can affect y ie ld . This section discusses a few important ones. Results w i l l fo l low i n the next section. 5.2.1 Switch Implementation Impact on Yield A s mentioned in Chapter 3, the switch b lock i n fine-grain redundancy can either have an embedded or extracted imux . O f the two, the embedded imux has the highest area overhead but also the greatest connectivity. The additional connections in an embedded i m u x al lows signals to turn and shift at the same time. A s w i l l be shown i n the next section, the abil i ty to turn and shift improves y ie ld by reducing the number of trackgroups that must be pre- shifted. It was also noted that the shifting ability o f the switch b lock can be varied. If the abil i ty to shift by two is el iminated, the size of the switch w i l l decrease. B r i d g i n g defects and source-drain shorts can sti l l be tolerated, but avoiding them w i l l require two shifts fo l lowed by two "-1" shifts. In fact, any combinat ion of shifts, a "+2" fo l lowed by two two fo l lowed by a " - 2 " is acceptable. A s noted i n Section 3.1.3, changing the shifting abil i ty o f the multiplexers can potentially increase both the number o f defect categories and 52 the M F F R . 5.2.2 Flexibility Impact on Yield The number of wires connected to a given switch block wire is defined as its flexibility, Fs [31]. Fs can be used to describe both end switch blocks or endpoints and middle switch block or midpoints. With long wires, the flexibility at the endpoints and midpoints can differ. Lower Fs values equate to fewer connections, and thus smaller M F F R s . Thus, lower Fs values actually help improve yield. 5.2.3 Array Size Impact on Yield It was shown in Chapter 4 that C G R - G 1 demonstrates a decreasing amount of area overhead as array size increases. Unfortunately, as array size grows, it also becomes increasingly unlikely that multiple defects will lie in the same row/column. Thus, to maintain a fixed yield for growing array sizes, it is necessary to increase the number of spare rows and columns. In F G R , spare resources are distributed across the F P G A . Increasing the array size increases the number of trackgroups in the F P G A , and thus the amount of available spare resources. Since the amount of spare resources naturally grows with size, the architecture tolerates more and more defects. 5.2.4 Wire Length Impact on Yield In C G R , the routing network within the rows and columns are assumed to be identical and disjoint. Since all rows and columns contain the same routing resources, the ability to re- place a defective row/column containing wires of any wire length with a spare row/column is guaranteed. This eliminates the dependency of wire length on defect correction for the spare row and column architecture. For F G R , increasing wire length increases both the fanout and fanin of all routing wires. These increase because long wires naturally have a greater number of midpoint 53 locations. This increases the M F F R and negatively impacts yield. Current F P G A routing architectures utilize multiple wire lengths within their rout- ing architecture. In F G R , wires of different lengths are modelled as disjoint routing net- works 1. Each routing network will have its own unique set of spare resources. Defect correction is restricted to the individual routing networks. Of course, these disjoint net- works "join together" at the connection blocks which have been modified to correct for defects in any of them. 5.3 Limitations The yield model does not account for switch area or total die area changes. The area of the switch is an important consideration because larger circuits have a greater probability of being defective than smaller ones. For example, C G R - G n and C G R - L n - S p have signif- icantly larger switch areas because of the necessary bypass circuitry. This is not modelled, so the presented yield for C G R is over-estimated. This is also true for the various F G R implementations. Power/ground shorts have been ignored in the fault simulation. These defects can- not be tolerated in either architectures. Defects in the logic blocks have also been ignored because of lack of real-life manufacturing data and because this thesis is only concerned with interconnect faults, which are more difficult to tolerate. Routing tracks within the same channel are assumed to be laid out beside one an- other. When bridging defects are injected into the model, two adjacent tracks are made unusable. This assumption allows F G R to bypass bridging defects by performing shifts by 2. Larger faults (i.e., 3 wires) would not be tolerable in F G R . The F G R model assumes that there can be at most one defect per trackgroup. How- ever, certain types of single and double-length faults can in fact be overlapped with one another. For example, two defects can be overlapped if the underlying fault redundant ar- 'This is mostly a limitation of academic routing architectures which treat the routing networks of different wire lengths as disjoint networks. 54 chitectures supports br idging defects, the faults themselves are not br idging defects, and the defects do not reside on adjacent tracks. Provided that these conditions are met, the defect on the lower track can be avoided using a ' " + l " shift whi le the one on the higher track can be avoided wi th a "+2" shift. A l s o , defects in the same tracks should be tolerable. W h e n comput ing y ie ld , the delay bypass path o f the omux is assumed to be not used. I f it is ut i l ized, either signal t iming must be perturbed when correcting a defect or device defects must be l imi ted to one-per-channel. Defects i n C G R are assumed to require either a spare row or a spare co lumn to be tolerated. Some defects require both, hence the y ie ld for this architecture is over estimated. Last ly , when comput ing the M F F R for F G R , defects are assumed to be injected into the middle o f an infinitely sized F P G A 2 . This results in worst-case M F F R s for the defects. Since the trackgroups near the edge of the F P G A have lower connectivity than those i n the center, the use of the worst-case M F F R value for a l l defects is overly pessimistic. 5.4 Results The y ie ld estimates for C G R and F G R were obtained through Mon te C a r l o simulations. Fo r a given number of defects, randomly located faults were injected into the interconnect for 100.000 different F P G A dies. The presented results do not account for intolerable defects such as power/ground shorts. 5.4.1 Switch Implementation A n embedded i m u x al lows certain signals to shift and turn at the same time. Th i s attribute relieves the need to pre-shift certain signals. Reduc ing the number o f wires affected by defect correction results in a lower M F F R and improved y ie ld . F igure 5.2 shows the comparison between the E M U and the E N 1 1 architecture for both single-length and br idging fault correction. These architectures differ by i m u x 2 Actual ly , it assumes a torus where the edges wrap around. 55 EM11-E3M1' , S L F E N 1 1 - E 3 M 1 , S L F —o— Number of Defects (log scale) Figure 5.2: Imux implementation (L=4, M = 32) implementation. The E M U architecture uses an embedded imux, while the E N 11 uses an extracted imux. The embedded imux demonstrates better yield for multiple defects. The shifting ability of the switches also affects the length of the repair region. With a shorter the repair length, fewer trackgroups need to be pre-shifted and restored. This reduces the M F F R and improves yield. Figure 5.3 shows that the architectures with the greatest degree of shifting ability (EM22, FL22 and EN22) have the highest yield. Note also that architectures EN11-E3M1 has slightly lower yield than the spare row and column technique. The reason for this is that bridging faults require an increased repair length due to its restrictive shifting abilities, producing a large M F F R . The resulting M F F R is sufficiently large that it makes tolerating more than 2 defects within a 32x32 F P G A (with L=4) very difficult. 56 T 1 1 1 1—i—i— i—| 1 1 1 1 1—i—i—r EM22-E3M1 , BF — i — EN22 .FL22-E3M1, BF — x — 1 10 100 Number of Defects (log scale) Figure 5.3: Shifting abilities (L=4, M = 32) 5.4.2 Switch Flexibility The flexibility of the switches also has a significant impact on yield. When the flexibility is increased, the number of fanin and fanouts increased. As mentioned before, this increases the M F F R and reduces yield. Figure 5.4 shows that the architecture with the lowest Fs, E2M1, demonstrates the best yield for both single length faults (SLF) and bridging faults (BF). The reduction in midpoint flexibility improves yield more than the reduction in end- point flexibility because there are more midpoint than endpoint connections for length 4 wires. 57 i 1 1 1 1—i—i—1—| 1 1 1 1 1—i—i—r EN11-E2M1 , S L F — i — EN11-E3M1 , S L F —x— 1 10 100 Number of Defects (log scale) Figure 5.4: F l ex ib i l i t y (L=4, M = 32) Baseline Architecture Taking the above results and the area/delay results from Chapter 3, the E M 2 2 - E 3 M 1 archi- tecture was selected to be the baseline fine-grain redundancy architecture. This architecture is capable o f shifts by +1-2 and +/-1, and has endpoint and midpoint flexibili t ies o f 3 and 1 respectively. In all subsequent results , E M 2 2 - E 3 M 1 w i l l s imply be denoted as FGR. 5.4.3 Fixed Array Size FGR versus Global Spares Figure 5.5 presents the y ie ld for an 32x32 F P G A with a number o f additional g lobal rows/columns. The y ie ld for C G R - G n remained at 100% unti l the defect count became 58 1 10 100 Number of Defects (log scale) Figure 5.5: Increasing number of global spares ( M = 32) greater than n, the number of the number of spare rows/columns in the architecture 3 . After this threshold, the y ie ld decreases dramatically. Y i e l d for this particular architecture is espe- c ia l ly sensitive to the number of spare rows/columns i n the system. The figure also shows that there is a significant y ie ld improvement when the number of global spares increase from one to two, and that F G R produces s imilar yields as C G R - G 4 . Th i s can be observed by noting that both architectures fall below the 80% y i e ld threshold at approximately the same number of defects. F G R versus Local Spares The C G R - L n - S » architecture demands that defects are spaced far apart f rom one another. If too many defects reside i n the same subdivision, chip failure occurs. The impact o f this re- striction is shown i n Figure 5.6. The figure shows the y i e ld o f a 32x32 F P G A wi th the ind i - cated number o f loca l subdivisions. E a c h subdivis ion contains one local spare row/co lumn. Not ice that the y ie ld decreases almost immediately and is significantly less than the global 3 I n the worst case, all defects are located in rows/columns: Since there are n spares rows/columns, C G R - G n can always tolerate n defects. 59 2 CD >- 0.2 0.4 0.6 0.8 0 CGR-L1-S32 —x— CGR-L1-S16 CGR-L1-S8 u CGR-L1-S4 CGR-L1-S2 --©— CGR-L1-S1 Legend FGR 10 100 Number of Defects (log scale) Figure 5.6: Increasing number of local spares ( M = 32) approach. The C G R - L 1 - S 1 6 produces s imi lar yields to F G R and C G R - G 4 . It should be noted that C G R - L 1 - S 1 6 wi th 1 spare i n each subdivis ion (16 spare rows and 16 spare columns total) is more practical to implement than the amount o f additional mul t ip lexing needed by C G R - G 4 wi th 4 global spare rows and columns. A l though easier to implement, C G R - L 1 - S 1 6 more than doubles the area of then F P G A (from 32x32 to 32+16x32+16 or 48x48), making it significantly larger than F G R . Figures 5.7 and 5.8 presents the y i e ld curves for C G R - G n and C G R - L n - S p , but a larger array size o f 256x256 is used. A t this array size, the number of defects tolerated by F G R increases and C G R slightly decreases (both loca l and global). F G R is now approxi- mately equivalent to the y ie ld o f C G R - G 1 6 and has higher y i e ld than a l l implementations of C G R - L l - S p . Note that more than two global spare rows/columns is impract ical (and potentially infeasible) because the necessary wire extensions significantly increase switch area and signal t iming. A l though the local spare approach avoids this by having only one spare per subdivision, the C G R - L 1 - S 2 5 6 architecture has 300% overhead i n spare rows and columns and cannot tolerate as many defects as F G R . In comparison, the F G R area overhead is on ly 50%. 60 Legend FGR — i — OGR-G32 - r x - 0GR-G16 —*•• CGR-G8 a CGR-G4 —»•- CGR-G2 —o- CGR-G1 - 100 Number of Defects (log scale) Figure 5.7: Increasing number of global spares ( M = 256) Legend FGR CGR-L1-S256 —x- CGR-L1-S128 - CGR-L1-S64 Q CGR-L1-S32 — » - CGR-L1-S16 - -©- •• CGR-L1-S8 CGR-L1-S4 CGR-L1-S2 CGR-L1-S1 Number of Defects (log scale) Figure 5.8: Increasing number of local spares ( M = 256) 61 >- 0.2 0.4 0.6 0.8 \~ 0 CGR-G16(M = 256) Q CGR-G4(M = 32) —x— CGR-G4 (M = 256) CGR-G1 (M = 32) FGR(M = 32) - - » - FGR(M = 64) —e— FGR (M = 128) FGR (M = 256) - Legend 10 Number of Defects (log scale) 100 Figure 5.9: Increasing array size for F G R ( L = 4) 5.4.4 Increasing A r r a y Size C G R can tolerate mult iple defects in the channel. However , as array size grows, it be- comings increasingly unl ikely that randomly occurring defects w i l l l ie in the same channel. Hence, the y i e ld for C G R with a fixed number of spare rows and columns was observed to be largely independent o f array size. The only way to increase y i e ld is through the addition o f spare resources. This can be observed in Figure 5.9 where the y i e ld is shown to be s imi lar for C G R - G 4 at M = 3 2 and M=256 . F o r the F G R architecture, the amount o f spare resources increases naturally as array size grows. Since the amount o f resources needed for defect correction is constant, increas- ing the number of spare resources translates into the abil i ty to tolerate more defects. This is demonstrated i n Figure 5.9 where F G R is shown to tolerate an increasing number of defects as array size grows. To reach a s imilar level o f defect tolerance as F G R at M = 2 5 6 , C G R requires 16 global spares, wh ich is completely infeasible! Figure 5.10 presents a rough area comparison between F G R and C G R for different values o f M . F o r F G R , the reported area includes all necessary shifting multiplexers and spare wires for the given value of M . Fo r C G R - G n and C G R - L n - S p , the " n " and " p " values 62 CC c 5 CC •o c a? 4 cO " 5 cc o T3 CD N = 2 CC E i_ o c 3 2 (5 de fec ts ) B 64 (10 de fec ts ) 128 (19 de fec ts ) FGR CGR-Gn CGR-L1-Sp CGR-L2-Sp CGR-G1 2 5 6 (39 de fec ts ) Array size (M) Figure 5.10: A r e a comparison between F G R and C G R at equal number of defects(L = 4 ) were chosen to tolerate the same number of defects as F G R at the 80% yie ld . The reported C G R - L n - S p area includes the base M x M F P G A plus the additional spare rows/columns. To account for the additional bypass circuitry, the scal ing factor noted i n Section 4 .3 .3 is used to adjust the area results for both C G R approaches. The figure shows that C G R - G n , C G R - L l - S p and C G R - L 2 - S p a l l requires more area overhead to tolerate the same number of defects as F G R can tolerate at 80% yie ld . The figure also shows the area overhead for C G R - G 1 . However , C G R - G 1 can only tolerate 1 defect for the different array sizes. A scaling factor o f 1.3 is used to approximate C G R - G 1 area. Note that the area overhead between C G R - G 1 and F G R are similar, and that at large values of M , C G R - G 1 has a lower overhead. 63 5.4.5 Wire Length Long wires have a greater number of fanins and fanouts. This results in a larger M F F R and, consequently, a yield reduction. Figure 5.11 shows how much the yield for F G R decreases as wire length increases. Also note that the yield for mixed wire length is lower than the yield of the individual wire lengths it is composed of. This is largely the consequence of how mixed wires are implemented and modelled. In the fine-grain architecture, the routing network for different length wires are disjoint. To put this yield into perspective, area and delay results were computed for the largest M C N C benchmark circuit, clma. The results for separate F G R architecture of length 4, 8 and 16 wires are presented in Figure 5.12. The reported numbers have been normalized to an architecture without redundancy at L=4. The channel width was fixed at 224 for all wire lengths because 224 is the minimum channel width needed to route clma using length 16 wires. The figure shows that delay increases and area decreases as wire length increases. 64 The use of the E 3 M 1 architecture can be the potential caused of the a larmingly large delay overhead for length 16 wi r ing . It is possible that the reduced midpoint f lexibi l i ty forces nets to take longer (thus slower) and less direct paths. A l s o , the wires are much longer than needed, s lowing nets down. Last ly, routing wi th L=16 wires at 224 tracks is h ighly congested, and this increases delay. In Figure 5.13, the area overhead for clma is broken down into the C L B (logic) area, the C b lock area, the S b lock area, the spare resources area (includes the spare wires and associated i m u x and omux) and the shifting multiplexers area ( imux and omux for non-spare wires). The C L B and C b lock have fixed area overhead because array size and channel width are constant respectively. C b lock represents a large part o f circuit area because a wide channel width is used. The S b lock and shifting multiplexers area shrink as the length of wire increases because longer wires have fewer switching elements at fixed channel widths. Last ly , the spare resources area grows because longer wires have fewer wires per trackgroups, hence the addition of a spare wire is an increasing function of the overall routing area. 65 66 40 35 30 E 3 25 CD 0 ^ 20 CD < CO -t—• |2 10 5 0 u xx xt 8 Wire Length Shifting Muxes Spare Routing Resources Switch Block. Connection Block Configurable Logic Block 16 Figure 5.13: A r e a breakdown o f clma for different wire lengths at a very wide channel width of 224 tracks 67 Chapter 6 Conclusion This thesis presents a new defect-tolerant switch block and connection block architecture, called Fine-grain Redundancy (FGR), that can tolerate an increasing number of permanent manufacturing defects as the F P G A array size scales up. F G R is capable of handling tens of distributed random defects. F G R is compared to a more traditional approach employing coarse-grain redundancy (CGR). The results indicate that C G R does not scale well beyond 2 distributed defects unless significant area overhead is employed. 6.1 Area and Delay The proposed F G R approach has a true area overhead of approximately 11% and delay overhead of 4% on aggressive applications that do not wish to be defect-tolerant. When defect-tolerance is desired, it is conventional to include the cost of reserving a spare track. This increases area overhead to 35-50% and delay overhead to 5-20%. However, it should be noted that less aggressive applications will already have these spare (unused) routing tracks available for free, so the actual cost is much closer to the true area overhead. A range of F G R implementation options that have a range of area and delay costs was also presented. Of these options, EN11-E3M1 has the lowest area, EM11-E3M2 has the lowest delay, and EM22-E3M1 has the highest yield. More detailed rankings of E3M1, 68 the best flexibility option, are shown in Table 6.1 and presented visual ly i n Figure 6.1. 6.2 Yield This thesis also presents a comparison between C G R and F G R . Bo th approaches embody the idea of replacing a defective resource wi th a spare unused one; however the investigation indicates that the choice of defect tolerant architecture has a significant impact on y ie ld and area overhead. A t l o w defect levels, C G R has a lower area overhead than F G R . Further, for suf- ficiently low defect levels, the area overhead for C G R diminishes as array size increases. Th i s is not the case for F G R , where the area overhead for this approach is fixed at up to 50% for a l l array sizes. Despite the fixed cost o f redundancy, F G R demonstrated the abil i ty to tolerate an increasing number of defects as array size grows. This is extremely important as the ex- pected number of defects increase as die size grows and technology feature size shrinks. W h e n comparing C G R and F G R at equal defect levels, C G R actually requires more area overhead to tolerate the same number of defects as F G R . Other factors that influenced y ie ld are wire length and switch implementation. This study showed that the y i e ld for the F G R approach decreases as wire length or flexibility increases, and i f the switch's shifting abil i ty is reduced. These factors were found to i n - crease the M F F R o f defects, and thus reduced the number of tolerable defects. Th i s is not so for C G R . U s i n g spare rows and columns for defect correction is wire length and switch implementation independent. 6.3 Future Work In terms of architectural improvements, future work includes the opt imizat ion o f o f C b lock design for F G R and the incorporation of the extra circuitry needed for row/co lumn bypasses for C G R . The first is important as careful design of the C b lock can lead to a significant 69 Arch Area Delay Area Recovery Delay Recovery Yie ld E M 2 2 7 3 7 3 1 E N 2 2 4 7 4 7 2 F L 2 2 6 6 5 6 2 E M 1 2 5 2 6 1 4 E M U 2 1 2 2 5 EN12 3 5 3 5 6 EN11 1 4 1 4 7 Table 6.1: Summary ranking of F G R defect-tolerant schemes w / E 3 M 1 1 • C G R - G 1 i • EM12 i i i • EM22 • FL22 EN22 - • EN12 - H EM11 • EN22 • EN11 a EN12 • EN11 • FL22 • EM11 • EM12 • EM22 i i • area overhead • delay overhead i i i 0 2 4 6 8 10 Tolerable Defects with Yield >50% (M = 32, BF) 12 Figure 6.1: Summary of area/delay overhead vs defect tolerance of F G R 70 reduction in area overhead. The accounting of the extra bypass circui try is useful as it provides a more accurate area and delay comparison between F G R and C G R . Future, work in the software domain can be further d iv ided into two categories: support software and y ie ld model improvements. To provide a complete defect tolerant so- lut ion, it is necessary to develop a suite o f support tools. The functionality of these tools includes strategies for defect diagnosis, defect map generation, defect maps management and the application of defect correction. To better estimate y ie ld , chip area should be incor- porated into the y ie ld model . Addi t iona l ly , different kinds o f defects should be injected into the model (as oppose to the current implementation where only worst-case br idging defects are used). Ideally, the model should be supplemented wi th manufacturing data to produce the most accurate estimates. 71 Bibliography [1] M i r o n A b r a m o v i c i , John M . Emmert , and Charles E . Stroud. R o v i n g S T A R s : A n integrated approach to on-line testing, diagnosis, and fault tolerance for F P G A s in adaptive computing systems. In Proc. ofthe The 3rd NASA/DoD Workshop on Evolv- able Hardware, pages 73-92 . I E E E Computer Society, 2001. [2] A l t e ra Corp . Al tera ' s patented redundancy technology dramatically increases yields on high-density A P E X 2 0 K E devices. In Press Release, Nov . 27, 2000. [3] A l t e ra Corp . In United states patents #6,034,536, #6,166,559, #6,337,578, #6,344,755, #6,600,337and#6,759,871, 2000-2004. [4] Vaughn Betz , Jonathan Rose, and Alexander Marquardt . Architecture and CAD for Deep-Submicron FPGAs. K l u w e r A c a d e m i c Publishers, Boston, 1999. [5] N i c o l a Campregher, Peter Y . K . Cheung, George A . Constantinides, and M i l a n Vas i lko . Ana lys i s o f y ie ld loss due to random photolithographic defects i n the in - terconnect structure o f F P G A s . In Int'l. Symp. FPGA, pages 138-148, February 2005. [6] C . Carmichael . Tr ip le module redundancy design techniques for Vir tex F P G A s . In Xilinx Application Notes, XAPP197 (vl.0), 2001. [7] C . Carmichae l , M . Caffrey, and A . Salazar. Correct ing single-event upsets through Vir tex partial configuration. In Xilinx Application Notes, XAPP216 (vl.0), 2000. [8] Wu-Tung Cheng . S i l i c o n diagnosis. In International Test Conference, 2003. [9] Collaborat ive Benchmarking Laboratory. Lgsynth93 benchmark set: Version 4.0. Technical report, Nor th Caro l ina State Universi ty, 1993. [10] Al t e ra Corp . Stratix II device handbook, vo l . 1. 2005. [11] Crosspoint Solutions Inc. F P G A redundancy. In United states patents #5,777,887, 1998. 72 [12] Andre D e H o n and M i c h a e l J . W i l s o n . Nanowire-based sublithographic programmable logic arrays. In FPGA '04: Proceedings of the 2004 ACM/SIGDA 12th International symposium on Field Programmable Gate Arrays, pages 123-132, N e w York , N Y , U S A , 2004. A C M Press. [13] Abder rah im Doumar and Hideo Ito. Des ign o f switching blocks tolerating de- fects/faults i n F P G A interconnection resources. In Proc. 15th IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems, pages 134-142. I E E E Computer Society, 2000. [14] Abder rah im Doumar, Satoshi Kaneko , and Hideo Ito. Defect and fault tolerance F P - G A s by shifting the configuration data. In Int'I Symp. on Defect and Fault-Tolerance, pages 377-385 . I E E E Computer Society, 1999. [15] M o h a m e d A . E lgame l , Kannan S. Tharmal ingam, and M a g d y A . B a y o u m i . Crosstalk noise analysis i n ultra deep submicrometer technologies. In ISVLSI '03: Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'03), page 189, Washington, D C , U S A , 2003. I E E E Computer Society. [16] R u d y Garc ia . Rethink fault models for submicron-IC test. Test & Measurement World, October 2001. [17] Scott Hareland, Jose M a i z , M o h s e n A l a v i , K a i z a d Mis t ry , Steve Walstra, and Changhong D a i . Impact o f C M O S process scaling and S O I on the soft error rates of logic processes. In Proc. of the IEEE Nuclear and Space Radiation Effects Confer- ence, pages 73-74 , 2001. [18] F. Hator i , T. Sakurai , K . N o g a m i , K . Sawada, M . Takahashi, M . Ichida, M . Uch ida , I. Y o s h i i , Y . Kawahara , T. H i b i , Y . Saeki , A . M u r o g a , A . Tanaka, and K . K a n z a k i . Introducing redundancy i n field programmable gate arrays. In Proc. IEEE Custom Integrated Circuits Conference, pages 7.1.1-7.1.4, 1993. [19] C K . H u and J . M . E . Harper. Copper interconnect: Fabrication and reliabili ty. In International Symposium on VLSI Technology, Systems, and Applications, 1997. [20] W . - J . Huang and E . J . M c C l u s k e y . Column-based precompiled configuration tech- nique for F P G A fault tolerance. In Proc. IEEE Symp. Field Programmable Custom Computing Machines, 2001. [21] X i l i n x Inc. Quintuple modular redundancy for h igh rel iabi l i ty circuits implemented i n programmable logic devices. In United states patents #6812731, 2004. 73 [22] X i l i n x Inc. Vir tex-II Pro and Vir tex-II Pro X platform F P G A s : Complete data sheet, version 4.3. 2005. [23] John L a c h , W i l l i a m H . Mangione-Smi th , and M i o d r a g Potkonjak. Efficiently sup- porting fault-tolerance i n F P G A s . In Int'I Symp. on Field Programmable Gate Arrays, pages 105-115. A C M Press, 1998. [24] Vi j ay Lakamraju and Russe l l Tessier. Tolerating operational faults in cluster-based F P G A s . In Int'l Symp. on FPGAs, pages 187-194, 2000. [25] G u y Lemieux , E d m u n d Lee , M a r v i n T o m , and Anthony Y u . Direct ional and single- driver wires i n F P G A interconnect. In Int'l Conf on Field-Programmable Technology, 2004. [26] G u y L e m i e u x and D a v i d L e w i s . Design of Interconnection Networks for Pro- grammable Logic. K l u w e r A c a d e m i c Publishers, Boston, 2004. [27] D a v i d L e w i s , E l i as A h m e d , Gregg Baeckler , Vaughn Betz , M a r k Bourgeault , D a v i d Cashman, D a v i d Gal loway, M i k e Hutton, Chr is Lane, A n d y Lee , Paul Leventis , Sandy Marquardt, Cameron M c C l i n t o c k , Ketan Padalia, Bruce Pedersen, Gi l e s P o w e l l , Bor i s Ratchev, Srinivas Reddy, Jay Schleicher, K e v i n Stevens, R ichard Yuan , R ichard Cl i f f , and Jonathan Rose. The Stratix II logic and routing architecture. In Int'l. Symp. FPGA, pages 14-20, February 2005. [28] J . L i u and S.J. S immons . BIST-diagnos is o f interconnect fault locations i n fpga's. In Canadian Conference on Electrical and Computer Engineering, 2003. [29] Sani R . Nassif . Wi th in-ch ip variabil i ty analysis. IEEE Int. Electron Devices Meeting, December 1998. [30] M . Y Niamat , R . Nambiar , and M . M . Jamal i . A B I S T scheme for testing the inter- connects o f S R A M - b a s e d F P G A s . In Midwest Symposium of Circuits and Systems, 2002. [31] Jonathan Rose and Stephen B r o w n . F lex ib i l i ty o f interconnection structures i n field- programmable gate arrays. Journal of Solid State Circuits, 26(3):277-282, 1991. [32] R . S ingh, V . Parihar, K . F . Poole, and K . Rajkanan. Semiconductor manufacturing in the 21st century. Semiconductor Fabtech 9th Edition, pages 223-232 , 1999. [33] C . H . Stapper. M o d e l i n g of integrated circuit defect sensitivities. IBM Journal of Research and Development, vol. 27, pages 549-557, 1983. 74 [34] J . H . Stathis. Phys ica l and predictive models o f ultrathin oxide rel iabil i ty i n C M O S de- vices and circuits. In Proceedings ofthe 2001 IEEE International Reliability Physics Symposium, pages 132-149, 2001. [35] X i l i n x , San Jose, C A . EasyPath Solutions, 2005. [36] A . J . Y u and G . G . F . Lemieux . Defect-tolerant F P G A switch block and connection b lock wi th fine-grain redundancy for y ie ld enhancement. In to appear in Int'l. Conf. on Field Programmable Logic and Applications, 2005. [37] A . J . Y u and G . G . F . Lemieux . F P G A defect tolerance: Impact o f granularity. In to appear in Int'l. Conf. on Field-Programmable Technology, 2005. 75


Citation Scheme:


Usage Statistics

Country Views Downloads
United States 9 0
China 4 13
Japan 4 0
Republic of Korea 2 0
France 1 0
Mexico 1 0
City Views Downloads
Tokyo 4 0
Springfield 4 0
Ashburn 4 0
Unknown 3 5
Beijing 2 0
Shenzhen 2 13
Roubaix 1 0
Monterrey 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items