Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Structured logic arrays as an alternative to standard cell ASIC Mehrabadi, Roozbeh 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
831-ubc_2006-0077.pdf [ 4.45MB ]
Metadata
JSON: 831-1.0065617.json
JSON-LD: 831-1.0065617-ld.json
RDF/XML (Pretty): 831-1.0065617-rdf.xml
RDF/JSON: 831-1.0065617-rdf.json
Turtle: 831-1.0065617-turtle.txt
N-Triples: 831-1.0065617-rdf-ntriples.txt
Original Record: 831-1.0065617-source.json
Full Text
831-1.0065617-fulltext.txt
Citation
831-1.0065617.ris

Full Text

S T R U C T U R E D L O G I C A R R A Y S A S A N A L T E R N A T I V E T O S T A N D A R D C E L L A S I C by Roozbeh Mehrabadi B . A . S c , the University of British Columbia, 1995 A thesis submitted in partial fulfillment of the requirements for the degree of Master of Appl ied Science in The Faculty of Graduate Studies (Electrical and Computer Engineering) The University of British Columbia December 2005 © Roozbeh Mehrabadi 2005 ABSTRACT In the deep submicron (DSM) era, design rules have become increasingly more stringent and have favoured the more structured architectures. The design methods using standard cell A S I C s (SC-ASIC) produce randomly placed gates and interconnects. Beside reduced yield, they also suffer from high testing cost, even with the most advanced built-in self-test methods. These shortfalls motivate us to search for an alternative architecture in the structured logic arrays. First, we w i l l explore the available structured logic arrays and their potentials as alternatives to S C - A S I C architecture. Then we w i l l focus on programmable logic arrays to explore their potential when competing for speed and area with S C - A S I C . We have investigated the critical path delay for clock-delayed P L A and suggested equations for quick calculation of its capacitive loads and delay. We have also introduced equations to calculate their area using technology-independent parameters. This would help the front-end C A D tools in partitioning and architecture decision-making before committing to a specific technology. We found that circuits with higher than 200 product terms have slower P L A implementations than S C - A S I C . They also tend to take more than 10 times the area. Furthermore, we have introduced logical effort as a simple method for gate sizing and optimization of the P L A ' s critical path delay. Finally, we have introduced methods to subdivide the slower P L A s in order to improve the overall circuit timing. We also found that by dividing a circuit to two P L A s we can cut its delay by half and keep the increase in area minimal. i i TABLE OF CONTENTS Abstract i i Table o f Contents i i i List o f Tables v List o f Figures v i Acronyms v i i Acknowledgements v i i i Chapter 1 Introduction 1 1.1 Research Motivation 2 1.2 Research Goals 4 1.3 Thesis Outline 5 Chapter 2 Background 6 2.1 Read Only Memory ( R O M ) 6 2.2 Programmable Logic Array ( P L A ) 7 2.3 Field Programmable Gate Array ( F P G A ) 10 2.4 Structured A S I C (SA) 11 2.4.1 Choices in S A basic tile 12 2.4.2 S A configuration methods 14 2.4.3 More on Via-Configurable S A 15 2.4.4 S A advantages and disadvantages 17 2.5 Why choose P L A 17 Chapter 3 P L A vs. A S I C Comparisons 19 3.1 P L A structure used 20 3.2 P L A critical path calculations 21 3.3 Line delay (RC effects) 26 3.4 P L A area calculations 27 3.5 Standard Ce l l flow and data collection 29 3.6 P L A vs. S C - A S I C 30 3.6.1 SPICE verification of Delay calculations 30 3.6.2 Delay Comparison 31 3.6.3 Verification of area calculations for the P L A structure used 34 3.6.4 Area Comparison 36 3.7 Computing Power Factor for P L A 37 Chapter 4 Techniques for P L A Area-Delay-Power Improvement 39 4.1 Using Logical Effort for P L A delay improvement 39 4.1.1 Basic Logical Effort Optimization 40 4.1.2 Modified Logical Effort Optimization 42 4.1.3 Delay estimation with Logical Effort 43 4.1.4 Considering Power with Logical Effort 44 4.2 Using Partitioning for P L A delay Improvement 47 4.2.1 Min imum Delay P L A (single output) 48 4.2.2 Optimizing with Multiple Output P L A 50 Chapter 5 Conclusions and Future Directions 53 5.1 Future Work 55 i i i 5.2 Contributions R E F E R E N C E S LIST OF TABLES Table 2.1: Product Terms used in the Sample Clocked P L A 9 Table 3.1 Min imum size tile calculations based on X 27 Table 3.2: Correlation relation between IPO and Capacitive loads 34 Table 3.3: Sample P L A layout size scaled from A,=lum to X=0.1um 35 Table 3.4: Total capacitance and calculated a for a 5-8-4 P L A 38 Table 4.1: Five stage gate optimization with L E 42 Table 4.2: Five stage gate optimization with modified L E 43 Table 4.3: Timing comparison between fixed and optimized gate P L A s 43 Table 4.4: Table of partitioned P L A s delay/area and their %difference with their full P L A 52 LIST OF FIGURES Figure 1.1: Area overhead vs. Power-Delay Comparison Graph with respect to S C - A S I C 3 Figure 2.1: A 2 input 4 output R O M with transistor based fabric 7 Figure 2.2: A n example 4-7-2 (IPO) clocked C M O S P L A 8 Figure 2.3: General structure of an F P G A 11 Figure 2.4: The generic S A structure 12 Figure 2.5: S A tiles (a) fine-grain, (b) L U T based medium-grain, (c) master tile with 4x4 base tiles 13 Figure 2.6: A via configurable 3-input L U T with full complementary input 15 Figure 2.7: A n example o f a coarse-grained tile using 3-input L U T 16 Figure 2.8: S A interconnect routing example 16 Figure 3.1: The scatter plot of the benchmark circuits' I/O numbers 19 Figure 3.2: The 4-7-2 (IPO) clocked C M O S P L A with critical path marked 20 Figure 3.3: Schematic of the Critical path for our Sample P L A 22 Figure 3.4: The Critical Path's signal wave forms 23 Figure 3.5: Basic N M O S pass-transistor tile dimensions 27 Figure 3.6: The S C - A S I C design flow and C A D tools used 29 Figure 3.7: Plot of Spice simulation over calculated delays with respect to their no. of products 31 Figure 3.8: P L A circuit delay data sorted with respect to their number o f product terms..32 Figure 3.9: Graph of P L A to S C - A S I C delay ratios sorted with respect to the P L A product terms 33 Figure 3.10: The layout view of the sample P L A 35 Figure 3.11: P L A area / S C - A S I C area vs. number of product terms 36 Figure 3.12: The percentage of area ratios in 4 different ranges 37 Figure 4.1: Pre-charge and Evaluation Critical Paths and their side loads 40 Figure 4.2: Graph of normalized P D P and Delay vs. F O factor 45 Figure 4.3: Sizing for a load of 64 46 Figure 4.4: Lower size gates with the original Cout 46 Figure 4.5: Circuit Partitioning (a) based on Input to Output path (b) based on output's input and product term requirements 48 Figure 4.6: Percentage difference of delay between full and single-output P L A implementations 49 Figure 4.7: The percentage of circuits in the three ranges 50 Figure 4.8: The Delay vs Area change with respect to the full P L A 51 v i ACRONYMS A S I C Application Specific Integrated Circuit A T E Automated Test Equipment B E Branching Effort BIST Buil t- in Self-Test C A D Computer Aided Design C M O S Complementary Metal-Oxide-Silicon Core The central part of the design, usually surrounded with I/O pads D F T Design for Test D S M Deep Sub-Micron F F Flip-Flop F 0 4 Fan-out of 4 (ratio of Cioad/Qn = 4) F P G A Field Programmable Gate Array IC Integrated Circuit I E E E Institute of Electrical and Electronics Engineers 170 Input/Output IPO Input Output Product terms, a 3-tple number (I,P,0) ITRS International Technology Roadmap of Semiconductors L E Logical Effort L F S R Linear Feedback Shift Register M U X Multiplexer PI Primary input P L A Programmable Logic Array P O Primary output R O M Read-Only Memory R T L Register Transfer Level S A Structured A S I C SE Stage Effort (with a ' * ' indicates the optimized SE) S L A Storage/Logic Array S C - A S I C A S I C design synthesis using Standard Ce l l logic library SoC System on a Chip V H D L V H S I C (Very High Speed Integrated Circuits) Hardware Description Language V L S I Very Large Scale Integration v i i ACKNOWLEDGEMENTS It has been a great pleasure to work with Dr Resve Saleh on this thesis. He has always been ready with advice and direction and I have appreciated his willingness to help, with graciousness and careful attention to details. His dedication to research and diligent work ethics have been a great source of inspiration for me. Thank you. I am also thankful to Dr Steve Wil ton and Dr Shahriar Mirabbasi for serving on my thesis committee. The S O C faculty have always been ready to answer questions, and been supportive teachers in the courses I have taken with them. I would also like to thank my colleagues at U B C S O C lab, Dr Roberto Rosales and Sandy Scott, for being kind and supportive. A n d last, but certainly not the least, I am grateful to my family. M y beloved wife Farah, who has been supporting my efforts over an extended time, and has been my source of inspiration for time management and hard work; and our boys, Arman and Zubin, who have been my cheerleaders and promised me a prize when I finish with my thesis. They have taught me what is important in life. Finally, I would like to thank PMC-Sierra , N S E R C and Canadian Microelectronics Corporation for funding this research and the supporting C A D tools. v i i i CHAPTER 1 INTRODUCTION Before the deep submicron era, C M O S fabrication processes could handle integrated circuit (IC) designs with complex circuits and complex geometric patterns to be produced on a chip with an acceptable yield. Today, in fabrication lines with feature sizes below lOOnm, the manufacturing process is so complicated and the resolution of layout masks is so fine that design for manufacturability is a primary concern. A s a result, for these technologies, foundries have imposed more stringent design rules to keep the yield at acceptable levels. These design rules limit the randomness o f the designs and favour the more structured layouts for manufacturing. For example, the first IC designs which are accepted for fabrication on 65nm fabrication runs are field-programmable gate arrays (FPGAs) . The highly-structured design patterns of F P G A s allow the foundry to fine-tune their process to these regular structures in order to improve yield. B y moving into deep submicron technologies, it is possible to integrate hundreds of millions of transistors onto a chip. Transistors are considered to be almost "free". Therefore, designers have more transistors than they need to implement the required functions. To make use of the extra transistors or unused silicon area, very often extra memory blocks are integrated into the chips [1;]. Memories are also structured arrays, which can be produced using C A D tools known as memory compilers [2]. Their development is much easier than random logic and they generally are lowe-power circuits. In contrast, any increase in complexity of logic circuits leads to a sharp increase in their design and verification time, as Well as increases in power and cost. These extra transistors can also be used to address the at-speed testing problem in modern chips. According to the International Technology Roadmap for Semiconductor (ITRS)[3], the 1 manufacturing yield loss associated with at-speed functional test methodology is directly related to the slow improvement of automatic test equipment ( A T E ) performance and the eVer increasing device I/O speed. B y adding design-for-testability (DFT) and built-in self-test (BIST) features onto the chips, designers attempt to reduce the reliance on high-cost, full-feature testers. The benefit o f this approach for random logic, however, is limited because of the low fault coverage with logic BIST due to the use of pseudo-random patterns rather than deterministic patterns [4]. 1 . 1 R e s e a r c h M o t i v a t i o n The issues described above suggest the use of some type of structured array for logic design in the future. For example, F P G A s are highly structured with programmable logic in the form of lookup tables and programmable interconnect which are configured after fabrication. These flexibilities, however, come at the cost of much slower speed, higher power consumption and more silicon area compared to the non-programmable instances implemented using the S C - A S I C flow [5]. A more recent approach is structured ASIC which places itself between F P G A s and A S I C design using standard cells [6]. Structured A S I C fabrics have programmable lookup tables that can be modified post fabrication and programmable interconnect that can be modified in the last step before final fabrication. Although structured A S I C ' s performance is better than F P G A , its area, delay and power is more than standard cell ASICs . In Figure 1.1, the positions of F P G A , structured A S I C and custom design is shown in a graph relative to S C - A S I C in terms of area, power and delay [1,6]. 2 I 0.9 - 80% Relative Power - Delay • Figure 1.1: Area overhead vs. Power-Delay Comparison Graph with respect to SC-ASIC In order to reduce the area and delay overhead of structured A S I C , we could view it as an improvement over F P G A s . Then, i f we conceptually remove their metal layer programmability feature to reduce some of the fabric's overhead, it would better compete with SC-ASICs . In other words, logic synthesis using standard cells would be replaced with automatic synthesis of structured logic fabrics. They can be highly-customized functional blocks with potential improvements over S C - A S I C design. To accomplish this, we must revisit a number of fixed-function structured logic arrays to investigate the possibility of finding alternatives to S C - A S I C . This possibility has already been recognized by others in the field [7]. Commonly known structured logic arrays are Read-Only Memories ( R O M ) , Programmable Logic Arrays ( P L A ) and Storage-based Logic Arrays ( S L A ) . In R O M s the output is stored in the memory locations associated with their addresses as the corresponding 3 n input. For an n-input and m-output R O M 2 x m memory locations are needed. The core array (storing the bits) is usually formed as fabrics with horizontally and vertically crossing wires. A n address decoder, based on the given input, selects the proper word lines and bit lines that lead to the outputs. Devices based on a R O M architecture tend to be area and delay intensive and P L A s were introduced to improve on the R O M efficiency by using only the needed product terms. In P L A s , the output is formed as the sum-of-product terms ( N A N D - N A N D ) or product-of-sum ( N O R - N O R ) from its inputs. This makes P L A s suitable for implementing combinational logic; however, to implement state machines and sequential logic, S L A s were introduced [8]. S L A s follow the P L A layout of A N D and O R planes, except that they are folded together, and memory elements such as flip-flops and latches are placed in various locations on the layout. The benefit o f S L A diminished with more sophisticated C A D tools and availability of larger computing powers. In this research, we concentrate our efforts on P L A s because they have a long history [9,10] and substantial amount of work has been done on improving their power and delay factors [11,12]. Furthermore, because of its regularity, a single P L A structure has well known timing and power characteristics and does not require the technology-mapping step o f A S I C design flow. Although it is an older design style, it is a good starting point for this research direction. In fact, there has been resurgence in interest in this topic due to the inherent structured characteristics of P L A s [11,12,23, 36]. 1.2 R e s e a r c h G o a l s The overall goal of this study is to investigate structured logic fabrics such as P L A s as a possible alternative to A S I C design using standard cells. We w i l l explore the area-delay tradeoffs for P L A and standard cell A S I C designs using some benchmark circuits. We w i l l 4 ajso develop methods to partition a given circuit to a number of P L A s to minimize their delay and area in order to compete with standard cell A S I C . Using a set of benchmark circuits, the objectives are to: 1- Produce layouts from a commercial grade standard cell library and compare their area and delay with those of their P L A implementation. ' 2- Explore options to improve delay in P L A s using logic effort. 3- Find parameters or heuristics to partition slow P L A s to achieve an acceptable delay (based on the delay achievable by standard cell ASIC) . 4- Recommend future improvements to include P L A s in SoC design flow. The ultimate goal is to find P L A architectures which w i l l help to close the gap, as shown ih Figure 1.1, between structured A S I C and the traditional A S I C design using standard cells. 1.3 T h e s i s O u t l i n e Chapter 2 provides some background on different structured logic devices and offers reasons for choosing the P L A structure for this study. The method of implementing the circuits as P L A s , and the C A D flow used for creating cell library counterparts of the same circuits are presented in Chapter 3. Also presented are the equations used for P L A delay/area estimates as well as S P I C E simulation data and comparisons with standard cell synthesis in this chapter. We also estimate P L A activity factor and power at the end of this chapter. In Chapter 4, we w i l l present our methods to create P L A s of moderate size that are small and fast. Further discussion on circuit partitioning and heuristics to speed up the process are presented in this chapter. Conclusions and future directions are presented in Chapter 5. 5 CHAPTER 2 BACKGROUND In this chapter, we review the available structured logic arrays, and discuss their merits for use in place of S C - A S I C . First, we review R O M and P L A s , then C P L D s , and finish with F P G A s and structured ASICs . A t the end of this chapter, we w i l l discuss the reasons for targeting P L A s for our research. 2 . 1 R e a d O n l y M e m o r y ( R O M ) One way to represent complex logic functions is to implement lookup tables as R O M s . In these devices, the outputs are stored in the memory locations pointed to by the corresponding input. For an w-input and m-output device, we would need a R O M with a capacity of l" x m bits. In this way, all possible logical combinations of the input could be stored, and made available by storing the output. Figure 2.1 shows an example of a small R O M core with an address decoder on the left and its outputs present at the bottom. In this example, all output lines (bit lines) are initially pulled up high. For each input combination, the decoder selects one horizontal line (word line) and pulls it up. The output lines associated with a logic zero are connected to the ground via a pass-transistor, whose gate is connected to the associated horizontal line. This method requires enough pull-up capacity for the decoder to charge up the capacitances of the word line and the transistor gates attached to it. The R O M input to output timing depends on the decoder's pull-up and the pass-transistors' pull-down delay. A s a result the R O M device becomes rather slow as it grows in size and consumes increasingly more power. To improve the pull-up/down delay in larger R O M s , sense amplifiers are used to pull down the output line as soon as the current flow is detected [13]. A s a result, smaller 6 Bit Line(1) Bit Line! [2) BitLine(3) BitLine(4) Word Line(1) ~7 Word Line(2) Word Line(3) Word Line(4) Figure 2.1: A 2 input 4 output ROM with transistor based fabric pass-transistors could be used to reduce the overall size of the R O M and yet achieve higher speeds [14]. Because R O M s implement all possible combinations of the input, their use for logic devices which have "don't-care" terms, (some of their outputs are not dependent on all of their inputs) is not efficient. They are also wasteful in terms of area and power [7]. A s a result, P L A s were introduced to implement only the needed product terms for every output. 2 . 2 P r o g r a m m a b l e L o g i c A r r a y ( P L A ) The P L A designs, introduced in the last 20 years, fall into 2 main categories: static designs using n M O S technology and clocked designs using pre-charged gates in C M O S [15]. The static C M O S P L A s use a N A N D - N A N D or N O R - N O R structure and tend to occupy a larger area and are slower than the clocked version. Since we are targeting C M O S technology and most recent advances in P L A delay reduction have been done in clocked P L A s [11] [12], we have focused our efforts on the clocked P L A s . The static P L A s could still play a role in reconfigurable fabrics which use very small size P L A s , as noted in [16] and [17]. 7 AND plane i n i nrn i2 4 ^ i3 4>-Inter-plane buffers -> Product \ i n i i n i i n i i n i i n i n n i i n i i n i i n i Terms OO 01 P1 P2 P3 P4 : r i n i 1 i r a i n i Uni P5 P6 i n i — i -P7 J J ~ L OR plane Figure 2.2: An example 4-7-2 (IPO) clocked CMOS PLA A n example of the clocked P L A is shown in Figure 2.2. It has 4 inputs, 7 product terms and 2 outputs, i.e., IPO= 4x7x2. It produces the following two outputs: oO = PI + P2 + P3 + P4 o l = P 5 + P6 + P7 Table 2.1 shows the seven product terms in this P L A as dot products of the inputs. The trailing indicates an inverted input. In this P L A , the AND-plane is located at the top and the OR-plane at the bottom separated by the inter-plane buffers. The AND-plane implements the products of the inputs and passes them, via inter-plane" buffers, to the OR-plane to sum them up and pass the result to the output buffers. 8 PI P2 P3 P4 P5 P6 P7 ~ i O . i l . ~i2.i3 ~ i0 .~ i l .~ i2 .~ i3 i0 .~ i l . i3 i l . i 2 .~ i3 i0 .~i l . i2 . i3 i 0 . i l . ~i2.~i3 ~ i0 . i l . i 2 Table 2.1: Product Terms used in the Sample Clocked PLA. In order to compute the delay through the P L A , it is useful to understand its operation. In this P L A structure, vertical lines connected to the pull-up transistors and inter-plane buffers act as the product lines. The horizontal lines in the AND-plane are connected to the inputs or their complements. Only the input or its complement is part of a product term. When an input line is part of a product term, it is connected to the gate of a pass transistor, which could discharge the product line to the pull-down line. When the clock signal " P h i " is low, the product lines are charged up to V D D - This is called the pre-charge period. The evaluation period happens when Phi goes high. A t this time the pre-charged product lines would stay high i f they are not connected to the pull-down lines via a pass transistor (i.e., their inputs are not high). Otherwise, they w i l l be discharged low. The clock signal controlling the pre-charge and evaluation in the OR-plane is delayed so that the buffered product terms are ready for evaluation before the clock arrives. In the OR-plane the buffered product lines associated with each output are connected to its horizontal output line via pass-transistors. During the pre-charge period, the output lines, which are connected to the output inverters, are charged high. During the evaluation period, i f one or more product lines connected to the output line are high, the output line w i l l be discharged low and the resulting output w i l l be high. In the next chapter, we w i l l present the critical path for our P L A example and discuss its important characteristics. The idea of folding P L A s and adding storage elements was first introduced by S. Patil [18], and later in more detail by Patil and Welch [19]. In this method, called storage/logic array ( S L A ) , different memory elements are placed on the folded P L A fabric with column 9 and row breaks to form independent state machines and logic blocks which make up the final integrated circuit (IC). The S L A program enables the logic designer to have a physical view of the final layout without the need to interact with a layout designer. However, folding P L A s and introducing column/row breaks takes away the uniformity of interconnects and transistors present in P L A s . A n y design change would require unfolding and remapping of the S L A and would lead to major changes in a layout plan which might have fixed dimensions for the device. Minor design changes in a P L A would be matter of reallocating the pass-transistor tiles. Furthermore, with access to current computing power and C A D tool support, the visual simplicity of the S L A s is not as attractive as it was 20 years ago. 2 . 3 F i e l d P r o g r a m m a b l e G a t e A r r a y ( F P G A ) In an attempt to give full post-manufacturing re-configurability to the user, gate array vendors introduced F P G A s . A n F P G A is comprised of many uncommitted basic logic blocks surrounded by programmable interconnect fabric and input/output (I/O) blocks (see Figure 2.3). Each logic block provides combinational logic equivalent of a few gates and one or more registers (flip-flops). Many studies have been done on partitioning and technology mapping for F P G A s . In typical industrial F P G A s , a 4 to 5 input look-up table ( L U T ) and a flip-flop is used as the basic logic block with multiplexers to control signal gating [5]. A t this time, F P G A s provide the highest capacity general-purpose programmable logic chips. They could contain 1 to 6 mil l ion equivalent gates and include R A M , micro-controllers and other pre-designed modules [6]. Their programmability makes them suitable for most communication and multi-media applications where the technology and protocols are constantly changing or some hardware changes need to be applied while in use. In some products, such as network switches and routers, F P G A s are the only choice to reconfigure the 10 hardware while in service to adapt to network changes. While F P G A chips are expensive, their non-recurring engineering (NRE) cost is close to zero since they are ready made chips and no more fabrication process is needed. Finally, their highly-structured design makes them suitable for current and future D S M fabrication technologies. F P G A programmability, however, comes at a very high cost of area and performance penalty when compared to standard cell (SC) A S I C design (see Figure 1.1). They also consume much more power and have higher unit costs [6]. This makes them suitable primarily for prototyping or low volume production runs. 2 . 4 S t r u c t u r e d A S I C ( S A ) To bridge the gap between F P G A s and A S I C design using standard cells, gate array vendors [20] have introduced structured A S I C . Similar to gate arrays, a structured A S I C contains an array of highly-structured and optimized elements (tiles) to form a logic fabric. Each tile contains a small amount of generic logic that might be implemented with a Interconnect Fabric i "°—ST"1 Blocks T _ J • • • • • • • • •• •• •• •• Figure 2.3: General structure of an FPGA "^Block • • • • • • • • 11 combination of gates, L U T s and/or multiplexers. The tiles may also contain one or more registers and small amount of local R A M . The prefabricated S A contains an array o f tiles and may also contain configurable I/O, R A M blocks and IP cores. In Figure 2.4, a general S A structure is shown with embedded R A M and I/O or IP cores around it. In the description so far, the S A device resembles a modern gate array structure. Their main difference is that, in S A structures, all non-metal layers and some metallization layers have already been fabricated and the wafers are stored for later completion. The user is only left with a few (may be one or two) metallization layers to specify. Therefore, the turn-around time from final specification to a completed chip is only 1 to 2 weeks. 2.4.1 Choices in SA basic tile Most S A vendors offer a variety of IP cores and I/O blocks that to their customers. The main difference between their devices is in the architecture of their basic tiles. They range from very fine-grained, to medium and course-grained tiles. I/O Blocks Embedded Sea of tiles Figure 2.4: The generic SA structure 12 The fine-grained tile usually contains a number of unconnected basic devices such as transistors and resistors (see Figure 2.5a). These tiles are pre-connected in various configurations. The logic engineer can use the available metal layers to make the necessary connections and jo in the tiles to the local and global routing structure. Other vendors have chosen medium-grain tiles which contain generic logic in the form of L U T s or multiplexers and one or more registers (Figure 2.5b). In each of these tiles, the global connections and polarity of the register clocks (positive or negative edge-triggered) can be set by the remaining metal layers. The denser tiles are formed hierarchically by combining some base tiles which contain generic logic, multiplexers and memory elements. Figure 2.5c shows a sample master tile containing a 4x4 array of base tiles. The master tiles could have 8x8 arrays of base tiles or larger. Then a sea of master tiles is prefabricated on the chip. The choice o f fine vs. medium vs. coarse grained tiles sets the density of interconnects going into and out of the tiles. The fine-grained tiles require a lot of connections compared to the amount of functions they provide. The more coarse-grained tiles tend to require fewer connections. On the other hand, the fine-grained tiles provide more design flexibility as compared to the more coarse-grained tiles. Evaluation of tile sizes and architectures is a Figure 2.5: SA tiles (a) fine-grain, (b) LUT based medium-grain, (c) master tile with 4x4 base tiles 13 critical subject for research and development and needs to mature like the F P G A architectures [6]. 2.4.2 SA configuration methods A l l architectures using the above tiles require one or two metal layers and possibly a via layer for final routing and configurations. A t least one vendor has introduced a device which only requires the configuration of one via layer. They all require C A D tools to place the design onto the S A devices and route the final interconnects. In all tile based architectures, there is trade-off of functionality vs. area efficiency. For example, the use of extra registers in the tiles may improve their functionality for some applications, while i f not used, they tend to waste area. In [21], they have introduced a number of via-configurable functional blocks and a via-decomposable flip-flop. B y decomposing the pre-fabricated flip-flops, they can reuse their parts for other functions and save the usually wasted space. They show that their via-configured blocks could cover all logic functions and claim that it has a comparable delay to other architectures, including P L A s . Furthermore, they report the highest transistor utilization as compared to other fabrics. 14 2.4.3 More on Via-Configurable SA One type of reconfigurable fabric uses via programmable lookup table ( L U T ) , as shown in Figure 2.6. This example shows a 3-input L U T using three pairs o f pass-gates as 2-to-l multiplexers. It is very similar to a F P G A L U T , except that it does not need the S R A M © = Possible via site • = Used via site out = ( A + B « C ) B B . c. C -V D D -G N D -Figure 2.6: A via configurable 3-input LUT with full complementary input. memory cells to store the programmable bits. Hence it has 1/8* of the transistor density of a F P G A L U T . In this example, the three signals A , B , and C and their complements are applied to the L U T , and empty via locations are filled to make the necessary connections. To implement the given function, out, first we need the signal A . The two via connections are made from the V D D line to the left-most multiplexer. Then the third v ia is placed on the G N D line to the first connection of the second multiplexer to pass the complement of signal B . The fourth v ia connects signal C to the second multiplexer's second connection to generate the AND function between B and C . 15 J4 xbD—f> xcl xc2 mO y t O -L U T 3 LUT 3 Scan 1 / _yn_ — I — CLK K DFF | — O 0 -JI J2 o-o-Memory:' . • / -So " D - O 1213 i / 4 . i - o — — - O l i o - O R n Figure 2.7: An example of a coarse-grained tile using 3-input LUT. This structure is used in commercial reconfigurable fabric as a medium to coarse-grained tile, an instance of which is shown in Figure 2.7 [22]. This example also contains a scanable flip-flop and some memory elements to increase its functionality. The via-configurable fabrics typically have the first two metal layers routed and even vias at these two levels placed. This is to ensure maximum uniformity in printed masks for the lower metal layers which are usually the most difficult ones to print. The configurable interconnect and via layers are usually at higher layers. A n example of interconnect routing and via placement at i • t -i — — - A - -. ..... - -• J 1, i" i ~ • tr - - •J 1 % ... - ... - ZI "-) i i i m f 1 T" -"• - .... if: „. -4 | i 1 m m i«< #: n ... ..... ... "-I 1 • ... | t rr .L -i :~ _ •-1 i i 1 i I • "•' .... "t •+ •i i .... .... ..... .... .....J L .... - - - 1 i |. j. 1 4th metal; . 5thm«al .6th metal Vi* Figure 2.8: SA interconnect routing example. 16 higher metal layers is shown in Figure 2.8. Notice the uniform routing of metal 4 and 5 and allowance for non-uniform routing of metal 6 layer. 2.4.4 SA advantages and disadvantages Because in S A all devices have already been placed and most routing except for a few upper metals and via layers have been done, there is much less work in final mask generation. A s a result, the non-recurring engineering cost (NRE) for the customer is dramatically reduced, and the turn-around time for bug fixes and re-runs are much lower as compared to the S C - A S I C flow. However, the S A devices are still around 3 times slower than S C - A S I C designs and take as much as 3 times more area. They also consume more power [6, 21]. Furthermore, due to ttieir recent introduction in the field, there is a lack of optimized E D A tools for their technology mapping and evaluation [6]. Currently, the vendors provide the traditional A S I C tools with their devices, which are not well-suited to a cell-based design flow. The F P G A C A D tools are generally "cell-aware" and are more suited to tile placement and block-based routing. Nevertheless, a long-term investigation of S A architectures is required to bring them up to the level o f maturity that F P G A structures have achieved. 2 . 5 W h y c h o o s e P L A Considering the gap that exists between S A and S C - A S I C and existence of room for improvement with custom design (see Figure 1.1), we investigate P L A s as an alternative to S,C-ASIC to determine its potential as a future logic fabric. P L A s are well-understood and have been subject of some very recent investigations [5,11,12,16,17,23,25]. They have also been mentioned by some experts as poised for a come back [7]. Furthermore, a recent 17 investigation [23] concludes that dynamic P L A s use less power and should be considered for further investigation. Before moving onto the other structures, it is important to investigate P L A characteristics and decide i f it is a useful structure to replace S C - A S I C . 18 CHAPTER 3 PLA VS. ASIC COMPARISONS In this chapter, we present the P L A structure we used for evaluation and comparison with standard cell A S I C (SC-ASIC) . Our goal is to set determine the advantages and disadvantages of P L A structures with respect to S C - A S I C . Moreover, we w i l l gain insight as to how to improve the P L A architectures. Another reason for this analysis is that we w i l l need C A D tools to partition the logic functions and make architectural decisions. The tools require models and simple expressions for estimating area, delay and power at a reasonable speed. We have developed these models for clock-delayed P L A s , and in the next chapter, we w i l l investigate improving P L A structures. We used about 150 combinational benchmark circuits in the Berkeley ".pla" format to compare their implementation using S C - A S I C as well as P L A layout. They were obtained from "Berkeley P L A Test Set" which is included with Espresso as part of the SIS package [24]. They are combinational circuits implementing various arithmetic and industrial control functions such as: A L U , control, adders and multipliers. The scatter plot of their I/O numbers Output 120 100 80 60 40 20 • • • 20 40 60 Input 80 100 120 140 Figure 3.1: The scatter plot of the benchmark circuits' I/O numbers 19 is shown in Figure 3.1. Over 70% of the benchmark circuits have I/O numbers less than or equal to 20. We w i l l discuss the critical path for the P L A structure used and our method of data collection for the P L A and S C - A S I C implementations of the benchmarks in the following sections. 3 . 1 P L A s t r u c t u r e u s e d We have used the generic clock-delayed P L A structure (as shown again in Figure 3.2) as the base P L A in our study. The clock-delayed P L A s have been the subject of many recent studies for power and delay improvement, which makes them good candidates for replacing S C - A S I C [25,11,12]. AND plane Phi - P" i 1 i0 ini I i n i U - Li Inter-plane buffers -> Product \ Terms 01 in i in i ira i n i i n i P2 in i in i in i in i r n i in i in i in i i n i — i -P3 P4 —L-P5 i n i P6 in i p r —L-fZ ,J~L n * OR plane Figure 3.2: The 4-7-2 (IPO) clocked CMOS PLA with critical path marked. 20 There may be room for further area and/or power-delay improvement; however, this was not the goal of our comparison. We focused on area and delay measurement of P L A s as compared to S C - A S I C , the development of equations for fast comparison, and development of methods for partitioning circuits to achieve area, delay and power efficiency in their P L A implementation. It is important for partitioning tools to quickly estimate area, delay and power for a P L A implementation of a given circuit. For delay, we calculate the total capacitive loads and the equivalent resistance (Req) o f the gates on the critical path. For area, we w i l l extract the basic tile sizes (pass-transistor and pull-up/down transistors) and sum them up with the buffer areas based on the IPO number of the P L A 1 . For power, we take the total capacitance and drive a value for the activity factor, a, since the average power is calculated as: P = OCVDD2 f- The delay, area and power calculation results are presented in the next sections. 3 . 2 P L A c r i t i c a l p a t h c a l c u l a t i o n s In the P L A example shown in Figure 3.2, the critical-path starts from the clock signal Phi , follows the P2 vertical line and ends with output oO. It is marked with a dotted line. A n extracted schematic of the P L A ' s critical-path is shown in Figure 3.3. To compute the worst-case delay using this circuit, only one of the transistors attached to the product line and one attached to the output line are assumed to be on at any given time. Next, we w i l l discuss the equations that we developed for quick calculation of the P L A capacitive loads. 1 Here, IPO stands for input-product term-output. These three factors are used as multipliers in the area calculations. 21 Figure 3 .3 : Schematic of the Critical path for our Sample PLA The delay of a path containing logic gates and capacitive loads is computed with the equation: D = ZRid where Rt is the effective resistance (Reff) o f the i * gate, and d is the capacitive load the gate has to drive [29,30]. To calculate the critical path of a clocked P L A , we need to sum up the total delays that set the minimum clock period of the P L A . The P L A clock cycle has two parts: pre-charge and evaluation time. During the pre-charge time the clock signal is low and the vertical product lines or the horizontal O R lines are charging up. A t evaluation time, the clock signal goes high and the pull-down line of the product lines or the O R lines are pulled low. A t this time, the lines attached to the activated pass-transistors w i l l be evaluated low. The pre-charge timing depends on the size of clock buffers (R1/2), pull-up/down transistors (RU/d), inter plane buffers (R 3/ 4) as well as the loads they need to service: CCLK, CAND, QNTR, Ccucjeiay and C0R (see Figure 3.3). To find the maximum operating frequency, we need to compute the pre-charge time tP and evaluation time t£. The delay between clock signal Phi going low until the node n5 (final stage of the product line) has gone high is shown as tp. The tE is the delay between Phi going high until the node oO has switched from low to high. The signal waveforms associated with 22 each P L A internal node and their relative occurrences are shown in Figure 3.4. The minimum tp is from Phi high-to-low edge until node n5 low-to-high edge. The minimum tg is the delay between Phi going high until output oO is switching. Also , the clock signal applied to the OR-plane cannot be the same as the clock applied to the AND-plane. The delay between Phil and Phil delay signals should allow the evaluation of the product terms in the A N D -plane to happen before the evaluation in the OR-plane. In other words, the pre-charged period of node n5 must not coincide with the evaluation time at node n6. Otherwise, a race condition w i l l occur and the voltage level at node n6 w i l l be pulled down in error (see Figure 3.4). Since the OR-plane pre-charge time ends right after node n5 has been stabilized, the pre-charge time only needs a small increase over tp. Notice that input delay (nodes W and n2) does not affect the timing directly. However, for the P L A to function properly, it is required Figure 3.4: The Critical Path's signal wave forms 23 that the input arrives at node n2 before evaluation time starts at P h i l . It should also remain constant for the duration of the evaluation time (see Figure 3.4). , The sum of pre-charge and evaluation times plus a safety factor of 10% gives us the minimum clock period for our P L A to function properly. In order to compute the P L A timing we need to obtain the capacitive loads and calculate buffer and pull-up/down transistor timings. To do this we need the gate's intrinsic delay Tint and effective resistance Reg. The intrinsic delay o f a device is measured with a self-capacitive load, while Reff gives the device delay time per unit capacitance of the load. For our calculations and simulations, the Artisan cell libraries for T S M C 0.18um technology, at worst-case environment conditions, were used as the reference. The generic gate delay, Tdeiay, (based on logical effort [30]) is calculated with equation 3.1: Tdelay ~ ^efflPself + Q o a d J = 7int + ^eff^load (3-1) In our P L A critical path we have a chain o f gates whose total delay is: PLAciock_Penod =h+ margin + tE (3.2) Where margin is the margin-of-error (10% in our case) added for safety to take into account for timing changes due to process and environment variations. Assuming uniform inverter and M u p /d 0 wn sizes, we can use the following equations to compute tp and tg: lP ~ 4Tint + R\Cim + R-iCCLK + ^up/dwn + ^up I dwn ^  AND + + ^-filNTR (3.3) lE = $Tml + R\Cinv + R-l^CLK + ^-fiinv + RfPcLK _delay + Tup/dwn + R-upldwn^OR + Rj^Ou! (3.4) The capacitive loads are directly related to the IPO numbers o f the P L A . The capacitance of the clock lines (CCLK in the AND-plane and Ccucjeiay in the OR-plane) are due to 24 interconnect and Mpuu.up/Mpuu.ciown gate capacitances. Each product line has one pull-up and one pull-down transistor, so to calculate clock line capacitances for the A N D and O R planes we have developed the following 2 equations: CCLK = numProd x {CinUpDwn + CapPerLine x CellWidth) (3.5) CCLK_deiay = 2 x Input x CapPerLine x Cellheight + Outputs x (CinUpDwn + Cellheight x CapPerUne) (3.6) We have assumed that the clock line spans all product columns and the clock line for the OR-plane has to cross the AND-plane height. The number of the horizontal rows in the AND-plane is twice the number of inputs, to include the inverted and non-inverted forms. The CAND is the line capacitance of a product line plus the junction capacitance of the pass-transistors present on its column; it should also include the input capacitance of the inter-plane buffer (C,„v)- It is computed with the equation (3.7): CAND = numlnputs x 2 x CapPerLength x Cellheight + attached _ transistors xCds+ C!NV (3.7) CINTR is the capacitance on the rest o f the product line in the OR-plane plus the gate capacitances of the gates connecting the product line to the outputs. It is summed up in equation (3.8): C1NTR = Outputs x CPerLength x Cellheight + attached _ gates x Cgate (3.8) COR is the line capacitance of the output line plus the junction capacitance of the pass gates attached to it. It is calculated using the equation (3.9): C0R = Products x Cell _ width x CPerLength + CINV (3.9) 2 In Artisan documents, it is referred to as K-load for their pull-up/down power of their gates with unit of: delay time per load capacitance (nano seconds/Pico Farad). 25 In the next section we w i l l discuss the effect of line resistance in our calculations. The P L A timing data using SPICE simulations as well as fast manual calculations are presented in the following sections. 3 . 3 L i n e d e l a y ( R C e f f e c t s ) The R C characteristics of polysilicon and metal lines were computed with the information obtained from T S M C ' s documents [26]. We found that the line resistance o f the poly lines on average is 24 ohm/micron (24 to 250 ohm for range o f 1 to 10 microns) at the nominal width (0.23 microns). The capacitive contribution of these poly lines was in the range o f 0.14 to 1.4 fF for the same range of length. The dominant time constant r o f a signal going through such a line is shown in this equation: Ar = RCL212 . Where R and C are resistance and capacitance per unit length and L is the line length. For the range of 1 to 10 microns the time constant ranges from 1.7e-27s to 1.8e-23s. Even at lOOmicron length the time constant is in the order of 5e-20s, which is much less than our period of operation (Ins at 1 G H z clock cycle). The resistance of metal lines were found to be 2 orders of magnitude less than those of the poly line. Hence, as long as the poly lines are kept short and metal lines are used for P L A interconnects we may ignore their resistive effects. This conclusion was also confirmed with simulation of a simple circuit consisting of a pull-up/down transistor cell connected to the equivalent R C load of a lOOOum metal line. The signal delay difference between including and not including the resistive effect was less than 5%. Hence, we have excluded the line resistance in our calculations. The line resistance effect becomes significant in large or high-speed P L A designs where placement of product lines with a larger number of input connections near the input lines would improve timing. 26 3 . 4 P L A a r e a c a l c u l a t i o n s The P L A area calculation is based on its IPO numbers and the size of its buffers and pass-transistor tiles which cover the A N D and O R planes. In order to make the initial calculation for our P L A sizing independent of technology, we need to use a technology independent parameter such as lambda (X). It has been used to express the feature sizes in a given technology without specifying exact measurement. Then, by assigning a specific size to X it could be used to get the exact size of the device. Contact Poly Contact A. • -PQ-4 Figure 3.5: Basic NMOS pass-transistor tile dimensions A typical transistor tile is shown in Figure 3.5 with minimum dimensions based on the design rules which are shown with capital letters A , B , C and E . These values are dictated by fabrication lines' design rules for each technology. They are listed in Table 3.1 along with their corresponding X values in the 4 t h and 5 t h columns. The minimum width and height of the tile is calculated in the bottom 2 rows and the values from rounded up figures only increase the minimum size by 17% for the 0.35um and 5% for 0.18um Technology Dimension 0.35um 0.18um 0.35 (X) 0.18 (X) A 0.4 0.22 2 2 B (contact space) 0.4 0.25 2 3 C 0.3 0.16 2 2 E 0.15 0.1 1 1 Min-width (um) 2.05 1.14 2.4 1.2 Min-height (um) 1.05 0.6 1.2 0.6 Table 3.1 Minimum size tile calculations based on X. 27 technology. This is acceptable since designing with minimum size gates would run into many difficulties and is less scaleable. The minimum size tiles could also be used, once the initial design calculations and technology mapping is done with the chosen X values. In this case the minimum widths for 0.35 and 0.18 micron technology correspond to 12X. We use the following equations to calculate the minimum tile width and height: Tile-Width = 2xA + 2xC + 2xE + PO Tile-height = A+ 2xE (3.10) The PO is the minimum width of the polysilicon line, which sets the length of the n M O S device. The minimum height should be set to be the same as the width to have square tiles. This would allow them to be used in both planes and also simplify the buffer placement along the 2 axis. Furthermore, this would allow slight flexibility in the pass-transistor gate size selection without the need to adjust the tile size. For our experiments, the cell dimension of 1.6x1.6 micron was used to also match the cell width of the inverters used from the cell library. B y selecting the buffer and pull-up/down cells to be of the same or some multiple of the tile size, we ensure the efficient placement of the P L A parts. The cell sizes can also be expressed as some multiple of X to keep the initial design calculations simple. This would simplify the task of the C A D tool which makes the initial technology mapping to decide between selecting a P L A or synthesized S C library solution based on their area, delay, and power cost functions. The total area of a given P L A based on its IPO numbers is computed with the following equation: P L A _ A r e a = area _ Pupdown x (products + Outputs) + (2 x Inputs + Outputs + 2x products) x Inv _ area + (2 x Inputs + Outputs) x products x Tile _ area + Over _ head (3.11) 28 The first part of the equation calculates the area of pull-up/down cells, the second part sums up the buffer area and the third part adds the pass-transistor tiles used in A N D / O R planes. The final term is the over-head associated with the placement of the three parts together. In the next section we w i l l discuss the procedure used for generating the S C - A S I C for the benchmark circuits, to compare with their P L A implementations. 3 . 5 S t a n d a r d C e l l f l o w a n d d a t a c o l l e c t i o n The design flow used in generating the S C - A S I C solution for the benchmark circuits is outlined in Figure 3.6. First the benchmark circuits are optimized with Espresso [27] to use minimum terms and ensure they have correct syntax. This step is common to both P L A and S C - A S I C flow, to ensure both start from a common R T L description. Each benchmark circuit was then compiled by Synopsys d c s h e l l ™ using the information from Artisan Ce l l library for T S M C 0.18 technology (single poly 6 metal layers) [28]. A timing constraint o f 2ns was placed on all input to output paths and timing evaluations were done based on the worst-case timing information. The timing constraint was Benchmark circuits A Cadence SOC Encounter™: < Optimize with > Auto P & R from Verilog Gate-Level and Generate Layout and extract line R C Layout Area per Circuit loads. espresso Synopsys dc_shell™: Convert to Verilog R T L , Compile and Optimize and generate Verilog Synopsys Prime-Time™: From Verilog Gate-Level and Critical Path delay per Circuit line R C loads generate Critical Path delays. Gate-Level netlist Figure 3.6: The SC-ASIC design flow and CAD tools used 29 placed to ensure the tool does not use minimum size gates and yet it stays within the timing capacity of this library. After the gate synthesis, a Veri log gate-level netlist was produced and was passed to Cadence S O C Encounter™ for placement and routing. Because the circuits are all combinational, routing congestion was not expected and the placement was setup as a sea-of-gates (i.e., no extra space for routing channels was left between the cell rows) with all 6 metal layers available for routing. This was done to ensure that the S C - A S I C flow could reach its maximum density potential. Following placement and routing, interconnect parasitic resistances and capacitances (RC) were extracted. The gate-level netlist and the R C parasitic information were passed to Synopsys Pr ime-Time™ for static timing analysis to find the critical path among all input to output paths. A l l synthesized benchmark circuits met or exceeded the 2ns timing constraint. We did not consider any overhead due to power stripes and/or I/O planning and as a result the area reported for the S C - A S I C layouts are the same as their basic gate area. The area/delay information for the S C - A S I C and P L A solutions is discussed in the next section. 3 . 6 P L A v s . S C - A S I C For the P L A implementations, inverter intrinsic delay and Reff were taken from the cell library reference document [28], and computed for worst case environment conditions (Temperature = 125°C and V d d = 1.62V). This is to ensure equivalent environment conditions to the delay measurements done in the S C - A S I C flow. 3.6.1 SPICE verification of Delay calculations To verify the accuracy of the calculation method used for estimating P L A delays, the capacitive loads were generated for all benchmarks and SPICE simulation results were 30 tabulated for pre-charge and evaluation times for each P L A . The total delay of the P L A was considered the sum of the two values. To compare the results, the ratio of delays measured by SPICE simulation over the quick calculation method is shown in Figure 3.7. The data is sorted for the increasing number of the P L A s ' product terms. For all circuits with less than 200 product terms, our calculation method underestimates the delay by about 10% or less. For circuits with more than 200 product terms, our method overestimates the delay by less than 10%. This is due to the nonlinear nature o f inverter buffers whose delay we are trying to estimate with linear functions such as R C delay calculation. Since we are interested in matching or exceeding S C - A S I C delays, we w i l l be limiting the number of P L A product terms to less than 300. Furthermore, increasing our calculated delays by 10% w i l l give us the upper bound necessary to estimate our P L A timing without the need to run a full SPICE simulation. 3.6.2 Delay Comparison The P L A critical path delay was calculated by summing up the delays contributing to pre-Ratio 1.20 -t 1.10 1.00 0.90 0.80 0.70 0.60 no. of Products 350 i • Simulation/Calculation - » - no. of Product 300 250 200 150 100 50 -si-Figure 3.7: Plot of Spice simulation over calculated delays with respect to their no. of products 31 charge and evaluation times as was shown in Figure 3.4 waveforms. A Perl script was used to read in the netlist in ".pla" format, calculate the capacitive loads due to P L A ' s IPO numbers and add the gate or junction capacitances for the number of connections necessary for each product line or output line. Finally, the R C delays for each input to output path were calculated and reported. The path with longest delay (Totaldelay = tp + margin+ t£) was chosen as the critical path. Figure 3.8 shows the graph of P L A delays vs. their number of product terms 3. It is a good indicator of how closely the P L A delay follows the number of product terms. A correlation factor of 95% was found between P L A critical path delay and its number of product terms. In other words, optimizing a netlist for P L A timing should reduce its total number of product terms. Optimizing for a S C - A S I C involves reducing the number of slow Delay (ns) no. Products Figure 3.8: PLA circuit delay data sorted with respect to their number of product terms. 3 Note that every 5-6* circuit name is printed. 32 gates in its critical path that could involve logic conversions on faster gates. The relation between P L A delay and its number o f product terms is also confirmed in the graph of ratios between P L A delays and S C - A S I C delays shown in Figure 3.9. A s the graph indicates, for about 30% of the circuits, their P L A implementation was as fast or faster than their S C - A S I C one (i.e., the ratio was equal to or less than 1.0). Furthermore, the circuits With significantly longer P L A delays tend to have more than 100 product terms. A few of the circuits with small product terms did not follow this trend since their S C - A S I C implementations were much faster (2 to 3 times). We could conclude that by keeping the number of product terms less than 100, we could at least match the delay time of the S C -A S I C circuits with a conventional clock-delayed P L A structure. If we were to use faster P L A structures, as reported by recent papers [12], we could pack up to 200 or more product terms in a P L A and still match or exceed the S C - A S I C timing. Table 3.2 reports the correlation coefficients between the P L A critical path's capacitive Delay Ratio no. Products Figure 3.9: Graph of PLA to SC-ASIC delay ratios sorted with respect to the PLA product terms 33 loads and the P L A ' s IPO numbers. A s the number of product terms increase, the length of the input lines and possible number of gates attached to them increase, which gives rise to C,„2 which is seen by the input inverter. The number of product terms also sets the length of the clock line and the number of pull-up/down gates attached to it, which contribute to Cclk. It also sets the length of the output line in the O R plane which is the main contributor to C OR. The number of inputs sets the height of the AND-plane which is also the length of the product line. CAND includes this line capacitance and the junction capacitance of the pass-transistors attached to it. The number of outputs sets the number o f pull-up/down gates in the OR-plane, which are the main contributors to Cclkdelay. It also sets the height of the OR-plane, which sets the length of the product line in the OR-plane. This line and the gate capacitances attached to it are the main contributors to Cintr. Cin2 C AND C intr C OR C elk C_clk_delay Product 98% 14% 15% 93% 100% 17% Input 9% 96% 15% 4% 9% 39% Output 19% 47% 81% 14% 17% 100% Table 3.2: Correlation relation between IPO and Capacitive loads For most circuits, the number of product terms is dominant (much larger than their I/O numbers). A s a result the clock capacitance is the largest and contributes the most of the critical path delay. 3.6.3 Verification of area calculations for the PLA structure used In order to verify our P L A area estimates, we used Berkley's M P L A which produces a P L A layout from a given netlist in ".pla" format. The tool uses M O S I S scalable S C M O S technology with X,=lum and produces static P L A structures with N M O S pass gates. B y scaling it for X=0.1um we could obtain approximate layout sizes for 0.18um technology. The 34 Technology (micron) 2.0 0.35 0.18 IPCN5-8-4 A,=lum 1=0.2um A,=0.1um Dimensions (A,) 8 6 x 7 6 Area (um 2) 6536 261 65 Our 1.6x1.6 tile (um 2) - - 205 M i n . Size: 1.2x0.6 tile (um 2) - - 70 Table 3.3: Sample PLA layout size scaled from A=1um to A=0.1um area of the P L A fabric produced by M P L A for a 5-8-4 P L A , excluding the I/O buffers, is shown in Table 3.3. M P L A uses tile sizes of about 16.5A,X10A,. This is close to 1.6um x l u m iri 0.18um technology. It also uses the diffusion layer for ground line, which reduces the dimensions of the AND/OR-planes . A s shown in Table 3.3, our P L A fabric with minimum size would be close to the M P L A ' s fabric. Furthermore, with minimum size tiles, we would be able to reduce our AND/OR-plane sizes by a factor of 3 (i.e. going from 205um 2 to 70um ). The sample layout used for this measurement is shown in Figure 3.10. Input Buf iers Output Buffers A 1 ) A " . V i . . .Vi • - - i i n *<ir*rr<f \i>l*4fl . S3 | J i l ' 3B4@f^p | ^ Figure 3.10: The layout view of the sample PLA. 35 3.6.4 Area Comparison The P L A area was estimated by using equation (3.10), with the same tile and buffer sizes used for the delay computation. The P L A area was found to have a strong correlation of 94% Ratio 9 0 . 0 -no. of Products 4 5 0 ryy6 ii¥n4- 0 5 o § ? ? S 2 ^ 3 <u o Q. Figure 3.11: PLA area / SC-ASIC area vs. number of product terms with the number o f product terms. However, it is not trivial to predict the size o f S C - A S I C circuit from the netlist or the IPO number. The ratio between P L A area and S C - A S I C layouts was found to range from 1.5 to more than 70, with 74% lying below 10 times (see Figure 3.11). The relative spread of P L A over A S I C area ratios is shown in Figure 3.12. About 80% of the circuits with less than 100 product terms have P L A / S C - A S I C area ratio o f less than 5. Clearly, the P L A implementation of a circuit consumes more silicon area and it must be considered during design partitioning. However, our tiles were not minimum size and P L A area improvement was not the goal of this project. If we use minimum size tiles (see Table 3.1), the size of the main P L A fabric could be cut-down by a factor o f 3. A s a result, the percentage of P L A s with area ratio of less than 4 w i l l rise to 74% (see Figure 3.12). 36 Figure 3.12: The percentage of area ratios in 4 different ranges 3 . 7 C o m p u t i n g P o w e r F a c t o r f o r P L A The dynamic power of a circuit is calculated with the equation: P =ct2CVDD 2fV>5]- If we consider the supply voltage VDD and frequency / constant, then the dynamic power only depends on the sum of the circuit's total capacitances. In most circuits, however, not all parts of the circuit are activated at each clock cycle. In other words, not all its capacitive loads are charged and discharged. To account for this, a unit-less factor a (activity or power factor) is added to the equation. In order to compute a we need to measure the average power for a large random set o f input to the circuit over a long period o f time. Then we divide the power by the sum "of the total capacitive loads, square of power supply voltage and the frequency to obtain a. We generated a layout for a P L A example with 5-8-4 IPO number (see Figure 3.10), and extracted all its capacitive loads. The resulting netlist was simulated with spice for a set of 350 randomly generated input vectors. Its capacitive loads and average power consumption at 400Mhz clock frequency are tabulated in Table 3.4, and an a factor of 86% was 37 calculated. The result for larger P L A s w i l l be lower, since their product lines could be randomly populated by pass-transistor tiles. A s a result, the portion of capacitive loads which are switching w i l l be less. Ecaps (fF) Avg Power(mW) VD D(V) Frequency(MHz) • Total 1009 1.12 1.8 400 86% Non-Clock 645 - - - 80% Clock 363 - - - 100% Table 3.4: Total capacitance and calculated a for a 5-8-4 PLA. The activity factor along the clock lines is one (i.e., they are always switching). For the rest of the circuit, the activity factor is close to 0.8. Close to 50% of capacitive loads are due to buffers which are mostly off the clock lines. We found that only 36% of loads on the fabric are connected to the clock lines. Based on the variation of IPO numbers in our benchmark circuits, a range of 80% to 90% is expected for a . This result also matches other reports of activity factor for clock delayed P L A s [12]. Although, the power measurement w i l l differ for other P L A architecture and/or fabrication technology, the same method could be applied to calculate the a factor. Hence, the P L A dynamic power could be estimated without the need for a full spice simulation. 38 CHAPTER 4 TECHNIQUES FOR PLA AREA-DELAY-POWER IMPROVEMENT In this chapter, we w i l l present a few techniques to improve P L A area, delay and power. First, we w i l l apply the concept of logical effort (LE) to our P L A structure to improve its delay. Then we show a heuristic approach to L E gate sizing to optimize the power-delay product (PDP). Next, we w i l l investigate partitioning a circuit to improve its P L A delay, and its effect on area and power. 4 . 1 U s i n g L o g i c a l E f f o r t f o r P L A d e l a y i m p r o v e m e n t So far in working with our P L A structure, we have considered our tile and gate sizes fixed. In order to improve the P L A critical path, we still need to consider gate sizing where the timing bottle-necks occur. Normally, this would require many iterations of gate sizing and simulating with SPICE to arrive at an optimum size/delay combination. The P L A critical paths, which contribute to the pre-charge and evaluation phase timing, are shown in Figure 4.1. It is essentially a chain of inverters with the main capacitive loads as their side loads. We applied the concept of logical effort (LE) [29,30] to the P L A ' s critical path to explore a fast and yet accurate way of gate sizing for this structure. Since the method of L E is independent of technology, it has been used for optimization of complex C M O S circuits [31,32,33]. However, it has not been applied to P L A s as yet. This method w i l l determine the fastest possible speed of the P L A . In using this method, there is an assumption that a library of near continuous gate sizes exist. This is achievable since already there are numerous gate sizes in current cell libraries and for flexible P L A design we w i l l need access to a range of inverter and pull-up/down gate sizes that match our P L A tile sizes. 39 'OUT Figure 4.1: Pre-charge and Evaluation Critical Paths and their side loads. 4.1.1 Basic Logical Effort Optimization The two main critical paths are shown in Figure 4.1. The pre-charge path has 7 stages, 2 of which are the pull-up/down gates in the A N D and OR-planes. During the pre-charge stage, only the pull-up transistor is on and we are only concerned with its rise time. During the evaluation, we can assume that the input signal has already arrived (input timing requirement) and we are only concerned with the pull-down transistors' fall time. A s a result, for our timing evaluations, the pull-up/down gates behave like inverters. Furthermore, the pre-charge path is interrupted after the node n5. The rest of this path's timing depends on the arrival of the OR-plane's delayed clock (Phil delay). The timing of the delayed clock is set along the evaluation path. Notice that during the evaluation period i f the input is low, nodes n3 and n5 w i l l remain high and w i l l not delay the path. Hence, we can concentrate on the 5 stages in the pre-charge path and the 4 stages in the evaluation path. The logical effort path optimization starts with the following equations: Path _ Effort = \ \ (LE x BE x FO) = f"J (LE x BE) x C, load (4.1) in SE' = (Path _ Effort) MN (4.2) Cin=LEx SE' out (4.3) 40 A gate's L E is its ability to drive current as compared to an inverter with the same input capacitance as the gate. Hence, the L E of a balanced inverter is 1. The branching effort (BE) is the path effort of side branches connected to the main path we are analyzing. In our case, B E is also equal to 1. The electrical effort or fan-out (FO) is the ratio of the load capacitance to the input capacitance (Cout/Cin). Equations (4.1) to (4.3), help to find the optimum gate size for each stage. The path effort is computed by the ratio of the output capacitance of the last stage (Cout) over the input capacitance of the initial stage (Cin). The optimized stage effort SE* is computed as the Nth root of the path effort (equation 4.2), from which the Cin o f the current stage is calculated from equation (4.3). The following experiment was carried out to evaluate the effect of L E optimization as compared to using the fixed size gates along the P L A critical paths. First, the capacitive loads for the P L A critical path, as shown in Figure 3.3, of a few benchmark circuits with X 4 inverters and fixed tiles were calculated. They were used as parameters to a SPICE netlist, and simulated to get the pre-charge and evaluation time for each circuit. Then, the same line loads were used as side loads to compute the optimized gate sizes using L E . A sample spreadsheet result for the pre-charge path gate sizes is shown in Table 4.1. The Cout at stage 5 (node n5) is CJntr plus the gate capacitance of the pass-transistor in the OR-plane (see Figure 4.1). The optimized stage effort SE* is calculated in the 3 r d column and Cin for each stage is calculated as CoutlSE* and shown in the 4 t h column. In other words, at every stage, after calculating the new Cin, the load to the previous stage becomes the sum of previous stage's side-load plus the Cin o f the current stage. The device size at stage 3 (pull-up/down gates) also sets the Cclk side-load. It is calculated in the 2 n d row of the 6 t h column 41 based on the circuit's number of product terms and the input capacitance of the stage 3 gate. The Cclk side load is then used as the Cout for the remaining 2 stage inverters. Circuit Input Output Product C_clk(fF) Z5xp1 7 10 65 1326.8 Stage SE* (FO) Cin (fF) Cout (fF) Wp (um) Wn (um) 5 1.07 14.25 15.2 5.03 2.10 4 1.07 13.36 14.25 4.71 1.96 3 1.35 20.09 27.16 7.09 2.95 2 10.98 120.81 1326.82 42.64 17.77 1 10.98 11.00 120.81 3.88 1.62 Table 4.1: Five stage gate optimization with LE Based on the gate's input capacitance, its P / N device widths, Wp and Wn, are calculated with the ratio of 2.4 to 1 for balanced rise and fall times. The input capacitance to device width conversion is technology dependent and for 0.18um technology a 2fF/um factor is used. Using the above gate sizing, an updated S P I C E netlist for the critical path was used to obtain the new critical path delay. The results are show in Table 4.3 (under basic L E ) , which indicate significant timing improvement over the fixed size gate P L A , at the cost of area. 4.1.2 Modified Logical Effort Optimization The basic L E method requires the fan-out (CoutlCiri) ratio to be larger than 1. When this ratio is 1, it sets all the gates to the same size. In our path, this causes the inverters in stage 4 to have a larger input capacitance than they should have. This increases the side load C_clk in stage 3 to 2.5 times its original value and causes the optimized gate size for the 2 n d stage inverter to be unrealistically large (i.e., F O o f 11). 42 Circuit Input Output Product C_clk (fF) Z5xp1 7 10 65 990 Stage SE* Cin Cout Wp (um) Wn (um) 5 1.07 14.25 15.2 5.03 2.10 4 4.00 3.56 14.25 1.26 0.52 3 1.16 14.91 17.36 5.26 2.19 2 9.49 104.36 990.07 36.83 15.35 1 9.49 11.00 104.36 3.88 1.62 Table 4.2: Five stage gate optimization with modified LE To prevent this, a variation of L E must be used to ensure when the SE* is close to 1, an optimized ratio such as F 0 4 (4/1) is used between the stage 4 and 5 inverters. This result was used to calculate gate sizes reported in Table 4.2. The 2 n d method reduces the Cjclk's capacitive load by 25% and overall gate sizes by 18%, which are significant reduction in power and area. The SPICE simulation results are also shown in column 4 of Table 4.3, and they are very close to the basic L E . Circuit Timing x4 with no LE (ns) basic LE (ns) 2™ Method LE (ns) % difference basic - 2 n d Z5xp1 Pre-charge 0.83 0.46 0.47 45% - 43% Evaluation 0.88 0.40 0.39 55% - 56% Table 4.3: Timing comparison between fixed and optimized gate PLAs 4.1.3 Delay estimation with Logical Effort Total delay of the path is computed by equation 4.4, which is the sum of each gate's stage effort plus its parasitic or self capacitance. For an inverter P = 0.5. D^^LExFO + P) (4.4) ' The delay in this case is technology independent and has to be multiplied by the delay of the minimum size inverter ( T i n v ) to give the real delay for a given technology. For 0.18um technology T i n v is close to 15ps. The delay calculated with logical effort for our 2 examples have a difference of 9-10% from the SPICE simulation results. This is expected as L E does 43 riot account for slow rise/fall times, and as reported in [34] could differ as much as 30% from SPICE due to variation in input transition times. This w i l l adversely affect the size and partitioning decisions based on L E when the input transition times vary from one node to the next (specially when some have larger loads). Fortunately, it is possible to extend L E to include the input transition time effect. In [34] authors suggest an extension to L E ( X L E ) delay calculation as D = t0 + RCload + Ktranttran where to is the gate's intrinsic delay, t^an is the 10-90% input transition time and K^n is a model dependent dimensionless coefficient. Another equation is used to model the output transition time: ttrcm = R,Cload + ttran0. Here, Rt and ttrano are model dependent coefficients. Similar to L E , an initial spice simulation on a set of unit size gates and linear regression is required to drive the model coefficients. Wi th this method, the delay calculations w i l l be closer to static timing analysis. They w i l l be much faster than S P I C E simulation with close accuracy. 4.1.4 Considering Power with Logical Effort L E has only been used for delay estimation and gate sizing to optimize for delay. However, in recent years, it is power reduction that is the focus o f the designers. We suggest a method that improves the power-delay product (PDP) based on L E . We introduce a smaller loading (i.e., reduce FO). The dynamic power consumed by C M O S gates is mostly attributed to the charging and discharging of their capacitive loads. This energy is computed by the equation: E = \CLV^D. A t every clock cycle the capacitive loads are both charging and discharging, so the total energy consumed is: E = CLV^D [35]. A n d power is shown as the energy consumed per unit 44 of time, hence, the frequency of switching (charging/discharging) is added to the equation: P = CLVr)Df [35]. The power-delay product (PDP) is used as a measure to optimize circuit performance based on energy used and delay. Since the delay of a circuit normally sets its frequency of operation, / = \ Delay, it can be shown with the equation 4.5: PDP = CV2DD (4.5) Considering a constant power supply voltage VDD , P D P is only a function o f the capacitive loads in the circuit. For optimization in L E , the F O is selected based on the graph of delay vs. F O as shown in Figure 4.2. Plotting P D P on the same graph, we notice that by 20000 , , , , ' PI)] 18000 120 4- 100 Figure 4.2: Graph of normalized PDP and Delay vs. FO factor. reducing our F O factor from 2.2 to 1.8, we could save 21% in P D P (energy consumption) while spending 8% in delay (see Figure 4.2). In other words, we could trade minor loss in delay for a more energy efficient circuit. 45 Since the optimal solution of L E for delay and L E for the power-delay product (PDP) are different, we seek a new method to directly obtain the optimal sizing for P D P . We first remove the actual load and replace it with a smaller load. Then, i f we compute the L E solution, the sizes w i l l be smaller and the delays w i l l be equal in all stages except the last one. The approach is best illustrated using an example and we use an inverter chain for this purpose. Consider the gate sizes for the following logic circuit to produce the minimum delay. Assuming that the output is 64 and we optimize for delay, then the SE*=4 and the sizes increase as shown below: 64 Figure 4.3: Sizing for a load of 64 The delay, total capacitance and P D P are given by: D = 3(SE*)+3(y0)= 3(4)+3(1/2)=13.5 C = 1 + SE* + (SE*)2 = 1 + 4 + 16 =21 PDP = CxD = 21 X 13.5 = 283.5 However, i f we try to reduce capacitance, and as a result P D P , by using SE*=2, then we get: 64 Figure 4.4: Lower size gates with the original Cout 46 The delay of this case is different. The first two stages have the same delay as before but the third stage has a longer delay because of the weaker inverter as follows: D = 2(2)+2(1/2) + ((Cout/Cin)/(SE*)N-1) + (1/2) = 2(2)+2(1/2) + ((64)/(2)2) + (1/2) = 21.5 C = 1+2 + 4 = 7 PDP = C x D = 150.5 While this result demonstrates that a lower P D P can be obtained, the question remains as to the optimal solution to the P D P . The general solution would proceed as follows: D = (N- \)(SE' + P) + + P WE ) „ . SE' (SE')2 (SE')N~-C = 1 + +- — + ... + — LEX LElLE2 LEV..LEN_X (4.4) PDPLE=CxD SPDP^ = 0 dSE The solution to the last equation would produce the optimal SE* for PDP. While this is only a preliminary result, it points the way to a L E solution that optimizes energy rather than delay. Many extensions are needed for gates other than inverters, as well as the inclusion of sideloads and branching factors. 4 . 2 Using Partitioning for PLA delay Improvement Following the basic area/delay comparisons between P L A and S C - A S I C , we need to investigate methods to improve P L A solutions where applied to large combinational blocks. A s we discussed in Chapter 3, for most well-populated P L A s , the number of product terms sets the minimum timing requirement. Hence reducing the number o f product terms in a P L A is the first step in improving its speed. It is possible to partition a given circuit to separate its 47 (a) (b) Figure 4.5: Circuit Partitioning (a) based on Input to Output path (b) based on output's input and product term requirements. input/output/product terms. However, as shown in Figure 4.5a, this create feedback loops between the resulting P L A s , which is undesirable. One way to remove such problems is to start from each required output and include all fanins back to the primary inputs. Conceptually, this is shown in Figure 4.5b. The right to left funnel, starts from the output and includes all its product and input terms. Our goal is to reduce the P L A delay by partitioning its outputs and generating a number of independent P L A s . 4.2.1 Minimum Delay PLA (single output) Recall that more than two-thirds of the benchmark circuits had a P L A delay higher than their S C - A S I C implementation. We first partitioned these circuits to single output P L A s and measured their critical path delays individually. The percentage difference of the critical path delay of the full P L A implementation of each circuit and its slowest single output P L A is plotted in Figure 4.6. In each circuit, the slowest outputs typically depend on the largest number of the product terms. In circuits which one output is dependent on most of the product terms, the slowest 48 Figure 4.6: Percentage difference of delay between full and single-output PLA implementations. single output P L A is not much faster than the full circuit. Hence, these are the circuits with less than 20% difference between their slowest single output P L A and their full P L A implementation. On the other hand, for the circuits in which the outputs are dependent on a balanced number of product terms, their single output P L A implementation has only 20% delay o f the full P L A . Therefore, by partitioning this circuit, we could run it at 5 times its full P L A speed. On average, the slowest single output P L A has 44% delay margin compared to tlie full P L A implementation of the benchmark circuit. Figure 4.7 shows the percentage of the circuits that fall under each category. In 8% of the circuits, the difference between their full P L A delay and the slowest single-output P L A was less than 20%. Other 33% had a difference of 20% to 40%. For the rest, about 59%, the difference ranges from 40% to 81%. Single output partitioning, however, is very area intensive and for outputs with only a few product terms, it is very inefficient. Hence, the outputs should be combined when they share input and product terms, until we reach our timing ceiling. 49 Figure 4.7: The percentage of circuits in the three ranges. 4.2.2 Optimizing with Multiple Output PLA A s it was shown in section 3.6.4, there is a strong correlation between the number of product terms in a P L A and its critical path delay. To reduce the total size of the P L A s generated to implement a circuit but still meet the timing constraint, we need to distribute the product terms among the partitioned circuits. We used the following heuristic for the slowest outputs: 1. list the outputs which add the least number of new product terms 2. then the outputs which add the least number of new inputs Notice that we are not considering the common product terms, as two outputs could have common product terms and yet their number of uncommon product terms could be even larger. The outputs at the bottom of the initial list normally are the 2 n d slowest, since they have the 2 n d largest number of product terms. Furthermore, they do not have much in common with the first one. These 2 slow outputs (with the largest product terms,) provide us with the starting seeds. B y assigning the outputs on each list to the corresponding slow output, we can form P L A s with a balanced number of product terms. We plotted the area and 50 0.70 0.68 0.66 0.64 0.62 0.60 0.58 0.56 0.54 0.52 -1 y3 - -a— Delay Change / v /u » » i c o I I I V * I c a o c • ^ • • I l l l l KB! 180%| 170% 160% 150% 140% 130% 120% 110% 100% 90% 1 Output 2 Ouput 3 Output 5 Output Figure 4.8: The Delay vs Area change with respect to the full PLA. delay change for P L A partitions of a sample circuit in Figure 4.8. The values are normalized with respect to the full P L A implementation of the circuit. It is clear that in general by packing more outputs, we tend to reduce the overall area while increasing the delay. Since capacitive loads also increase with area, we need to make a choice between area/power vs. delay. The effect o f area increase, however, is reduced for the circuits with a much larger number of product terms than input/output. From the benchmark circuits, we chose three examples for partitioning, which had large product terms and long delays. We also included the result of partitioning a circuit with long delay, but significantly smaller number of product terms (blO). The results are tabulated in Table 4.4. The total delay, input/product term numbers and area for the full P L A are also listed after each two partitions. The percentage difference between the delay of the partitioned P L A s and their full version is listed in the 6th column. The comparison shows that the delays follow the partitions' number of product terms. 51 Circuit Inputs Products Area(um 2) Delay Total (ns) Delay %decrease Area EPartion/Full Product %increase intb (3 output) 15 321 30036 7.063 41% intb (4 output) 14 310 30684 7.186 40% 98% 0% intb (7 output) 15 631 61720 11.89 100% rd84 (2 output) 8 129 9876 3.597 37% rd84 (2 output) 8 163 12380 4.005 30% 108% 15% rd84 (4 output) 8 255 20517 5.701 100% tial (3 output) 14 314 29395 7.034 36% tial (5 output) 14 324 32026 7.191 35% 100% 10% tial (8 output) 14 581 61421 11.02 100% b10(5 output) 15 69 7587 1.99 31% b10(5 output) 14 63 5999 1.78 38% 110% 32% b10 (11 Output) 15 100 12368 2.884 100% Table 4.4: Table of partitioned PLAs delay/area and their %difference with their full PLA The area differences listed in the 7 column indicate that for the 1 and 3 circuits, the partitioned P L A s take close to the same area as the original P L A s . The partitions for the 2 n d and last circuit {rd.84 and blO), however, take about 8-10% more since their combined numbers of products are 15-32% larger than the original circuit. It should be noted that the area calculation does not take into account the overhead of P L A placement. Assuming that best P L A layout practices are used, this overhead could be much less than 10% of the P L A sizes [36]. In conclusion, by partitioning a circuit to smaller P L A s , we gain in timing and do not pay much in area. Furthermore, since the area has not increased significantly (close to 10%), the total capacitive loads w i l l also remain the same and as a result the total dynamic power w i l l not increase significantly. The output packing to partition a circuit to more than 2 P L A s , however, requires a more rigorous algorithm and it is a partitioning problem, which gets more difficult to solve as the number of inputs and outputs increase. It is an NP-hard problem [37] and we need to use heuristics to simplify the search in the solution space for the optimal solution. 52 CHAPTER 5 CONCLUSIONS AND FUTURE DIRECTIONS The trend in D S M fabrication technology favours structured architectures. However, standard A S I C generates random gate and interconnect patterns that makes their fabrication with the new D S M technologies very difficult. In order to consider structured logic arrays as an alternative to S C - A S I C , they have to be competitive in timing, power, and area. We considered a number of structured architectures and found structured A S I C (SA) and P L A to meet the uniformity requirement and have the potential to compete with S C - A S I C . Since S A leaves a few top metal layer for user configuration, it reduces the level of interconnect uniformity. The S A designs are on average 20% slower and use 3 times more power than the S C - A S I C designs. Hence, we focused our research on P L A s as a structured architecture that could f i l l this gap or improve on S C - A S I C designs. In Chapter 3, we presented the methods used to compare P L A and S C - A S I C architectures. First, for a set of benchmark circuits, we generated their S C - A S I C layouts and measured their gate area and delay. Then, based on the clocked-delayed P L A structure, we developed some formulas to calculate P L A delay and area for the same benchmark circuits. The delay and area results for both sets of experiments were tabulated. The results indicated that: 1. The P L A delay has a close correlation to its number of product terms 2. The S C - A S I C flow was able to achieve 2ns timing constraint or reduce it for a number of smaller circuits 3. For about 30% of the benchmark circuits their P L A delay was equal or lower than their S C - A S I C implementation 4. The circuits with slower P L A s tend to have more than 100 product terms 53 It was also observed that there are faster P L A designs than the conventional clocked-delayed P L A , which could allow circuits with up to 200 product terms to match their S C -A S I C implementation in P L A . The correlation between P L A ' s IPO numbers and the side loads in its critical path was also presented. In order to verify the accuracy of our P L A delay estimations, SPICE simulations were done on the netlist o f the P L A critical paths with the capacitive loads associated with the benchmark circuits. The results indicated that our calculation methods' results are about 10% lower than the SPICE Results for most o f the circuits. For circuits with more than 300 product terms, the results are closer or somewhat higher. This is because of the linearity of our calculations and nonlinearity of the buffers, whose response changes with the slower rise/fall times. The area comparison showed that the P L A implementations with similar delay take a lot more area than S C - A S I C . A strong correlation of 94% was found between P L A area and its number of product terms. A n d for 80% o f the circuits with less than 100 product terms, their P L A / S C - A S I C area ratio was found to be less than 5. The area estimation was also verified with the result of P L A layout generator tool " M P L A " and scaling the result to match 0.18 micron C M O S technology. We found that the P L A implementation of circuits with less than 200 product terms could be faster than their S C - A S I C counterparts. However, P L A s tend to take much larger area. Finally, we showed that the dynamic power consumption in a P L A circuit depends on its total capacitive loads. We also found that the activity factor is relatively high ( a = 0.8-0.9). B y calculating the a factor for each technology and P L A architecture, we could measure its power consumption without the need for a full SPICE simulation. 54 In Chapter 4, we argued that logical effort (LE) could be used for delay calculation and gate optimization of the P L A critical path. We showed that, using L E gate sizing, a delay improvement of 45% to 55% is achievable. Furthermore, the power-delay product (PDP) of an inverter chain was examined to show that the gate optimization could be adjusted in order to reduce overall energy consumption of the circuit. We showed in one example that at the cost of 8% delay in gate sizing, we could improve P D P factor by 21%. We also presented methods to partition a slow circuit into 2 or more P L A s , such that each would be independent and meet the timing constraint. It was shown on sample circuits that outputs could be packed based on their number of product terms. It was concluded that this is only useful i f one output does not depend on most of the product terms. In case of the balanced partitions, it was shown that we could divide the P L A delay by 2 when we cut its number of product terms by half. It was also shown that the area overhead of partitioning the P L A s with large number of product terms is less than 10%. 5.1 Future Work P L A s look promising as a future structured logic fabric. The next step is to study P L A area reduction techniques. B y reducing area, hence total capacitance, we w i l l also reduce the P L A ' s total power consumption. Reducing total power consumption of P L A s is important to make them a viable replacement for S C - A S I C . Furthermore, with access to a P L A layout in a specific technology, power analysis could be performed to investigate its power performance as compared to other architectures. Although P L A s have been studied as L U T s for configurable fabrics [17] and as tiles for structured A S I C [21], they have not been studied as stand alone fabric in parallel with structured A S I C . They are designed to be uniform and may only need the first 2 metal layers. 55 However, we did not look into ways to ensure uniformity at global routing between P L A blocks. It is a challenge that requires further study. Also , we did not look into P L A design improvement. There has been some recent work for improving power and delay in clocked P L A s [11,38]. They require implementation to use for comparison with other architectures. 5.2 Contributions In this work we were able to: • Created a flow to generate capacitive loads and calculate critical path delays for clock-delayed P L A structures. The flow is supported by Perl scripts and well documented for adjustments to match different technologies. • Generated area delay information for the S C - A S I C implementation of the benchmark circuits and created a flow that could be used for future comparative studies. • Investigated power analysis for P L A structure and a method to calculate the activity factor for a given technology and P L A architecture. We found it to range between 0.8 to 0.9 for our example The summary of this research's contributions is as follows: 1. Presented a method to conduct L E analysis on P L A critical path and to do gate sizing with side-load optimization. Gate sizing could improve P L A critical path delay by as much as 55% at expense of extra area and power. 2. Discussed a method to improve energy efficiency when using L E gate sizing. While gate sizing, we can limit our gain in delay in order to achieve energy efficiency in our circuit. 56 Investigated partitioning methods to subdivide slow P L A s to smaller and faster ones and showed their relative delay/area cost. Partitioning a circuit with a large number of product terms according to its outputs could improve delay with little or no area and power penalty. 57 REFERENCES [I] R. Saleh, G . L i m , T. Kadowaki, K . Uchiyama, "Trends in low power digital system-on-chip designs," Proceedings. International Symposium on Quality Electronic Design, 2002., vol . , no.pp. 373- 378, 2002. [2] X . Wang, M . Ottavi, F. Meyer, F . Lombardi, "On the yield of compiler-based e S R A M s , " Proceedings. 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2004. DFT 2004. , vol . , no.pp. 11- 19, 10-13 Oct. 2004. [3] International Technology Roadmap for Semiconductors (ITRS), 2001. [4] M X . Bushnell and V . D . Agrawal, Essentials of Electronics Testing for Digital, Memory and Mixed-Signal VLSI Circuit, Kluwer Academic Publishers, 2000. [5] S.J.E. Wilton, N . Kafafi, J. W u , K . A . Bozman, V . Aken 'Ova , R. Saleh, "Design Considerations for Soft Embedded Programmable Logic Cores", IEEE Journal of Solid-State Circuits, vol . 40, no. 2, Feb 2005, pp. 485-497. [6] B . Zahiri, "Structured ASICs : Opportunities and Challenges," Proceedings 21s' International Conference on Computer Design, Oct. 2003, pp. 404-409. [7] J. Rabaey, A . Chandrakasan, B . Niko l ic , Digital Integrated Circuits: A Design Perspective, Prentice-Hall, 2 n d Edition, 2003, p. 388. [8] K . F . Smith, T . M . Carter and C E . Hunt, "Structured logic design of integrated circuits using the Storage/Logic Array ( S L A ) " , IEEE Transactions on Electron Devices, vol . ED-29, Apr. 1982, p. 765-776. [9] R. Brayton, G . Hachtel, C . M c M u l l e n and A . Sangiovanni-Vincentelli, Logic minimization algorithms for VLSI synthesis, Kluwer Academic Publishers, 1984. [10] C . Mead and L . Conway, Introduction to VLSI Systems. Reading, M A : Addison-Wesley, 1980. [II] Kwang-Il Oh; Lee-Sup K i m , " A high performance low power dynamic P L A with conditional evaluation scheme," Proceedings of the 2004 International Symposium on Circuits and Systems, 2004. ISCAS '04. , vol.2, no.pp. II- 881-4 Vol .2 , 23-26 M a y 2004. [12] J.S. Wang, C . R. Chang, C . Yeh, "Analysis and Design of High-speed and Low-Power C M O S P L A " , IEEE J. Solid-State Circuits,, vo l . 36, pp. 1250-1262, Aug . 2001. [13] J. Wong, P. Siu, M . Ebel, " A 45ns fully-static 16K M O S R O M , " IEEE International Solid-State Circuits Conference, 1981, v o l . X X I V , no.pp. 150- 151, Feb 1981. 58 [14] O. Takahashi, N . A o k i , J. Silberman, S. Dhong, " A 1-GHz logic circuit family with sense amplifiers," Solid-State Circuits, IEEE Journal of, vol.34, no.5pp.616-622, M a y 1999. [15] Masakazu Shoji, CMOS Digital Circuit Technology, Prentice Ha l l , 1988. [16] J. L . Kouloheris and A . E l Gamal, "PLA-based F P G A area versus cell granularity," IEEE Custom Integrated Circuits Conference, M a y 1992. [17] A . Yan , S.J.E. Wilton, "Product Term Embedded Synthesizable Logic Cores", IEEE International Conference on Field-Programmable Technology, Tokyo, Japan, Dec. 2003, pp. 162-169. [18] S. S. Patil, " A n asynchronous logic array," Project M A C Tech. Memo T M - 6 2 , Massachusetts Institute of Technology, Cambridge, May 1975. [19] S. S. Patil and T. A . Welch, " A programmable logic approach for V L S I , " IEEE Transactions on Computing, vol . C-28, pp. 594-601, Sept. 1979. [20] http://www.altera.com, http://www.xilinx.com, http://www.easic.com. [21]Y. Ran, M . Marek-Sadowska, "On designing via-configurable cell blocks for regular fabrics," Design Automation Conference, Proceedings. 41st, vol . , no.pp. 198- 203, 2004. [22] K . Morris, "Redefining Structured A S I C : eASIC's Better Idea," F P G A and Programmable Logic Journal, May 24, 2005. [23] Fan M o ; R . K . Brayton, "Whirlpool P L A s : a regular logic structure and their synthesis," IEEE/ACM International Conference on Computer Aided Design, vol . , no.pp. 543- 550, 10-14 Nov . 2002. [24] Berkeley P L A Test Set, January 30,1988; included under espresso-examples directory with the SIS package. [25] Dhong, Y . B . ; Tsang, C P . , "High speed C M O S POS P L A using predischarged O R array and charge sharing A N D array," IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol.39, no.8pp.557-564, A u g 1992. [26] T S M C 0 .18UM L O G I C 1P6M S A L I C I D E 1.8V/3.3V SPICE M O D E L S . [27] Espresso: Berekly logic level optimizer included with SIS package. [28] Artisan Co. " T S M C 0.18mm Process 1.8-Volt SAGE-X™ Standard Ce l l Library Data book", September 2003, Release 4.1. 59 [29] I. Sutherland, B . Sproull, D . Harris, Logical Effort Designing Fast CMOS Circuits, Morgan Kaufmann Publisher, c 1999. [30] D . A . Hodges, H . G . Jackson, R. A . Saleh, Analysis and Design of Digital Integrated Circuits: In Deep Submicron Technology, M c G r a w H i l l , N e w York 2004, 3 r d Ed . [31] X . Y . Y u , V . G . Oklobdzija, W . W . Walker, "Performance Comparison o f V L S I Adders Using Logical Effort," 35th Asilomar Conference on Signals, Systems and Computers, N o v 2001. [32] H . Q . Dao, V . G . Oklobdzija, "Application of logical effort on delay analysis of 64-bit static carry-lookahead adder," 35th Asilomar Conference on Signals, Systems and Computers, N o v 2001. [33] P. Rezvani, M . Pedram, " A fanout optimization algorithm based on the effort delay model," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.22, no. 12pp. 1671- 1678, Dec. 2003. [34] X . Y . Y u , V . G . Oklobdzija, W . W . Walker, "Application of Logical Effort on Design of Arithmetic Blocks," 35th Asilomar Conference on Signals, Systems and Computers, N o v 2001. [35] W . Wolf, Modern VLSI Design, Perntice Hal l , P T R 1998, pg. 131. [36] M . Fan, R . K . Brayton, "River P L A s : a regular circuit structure," Design Automation Conference, 2002. Proceedings. 39th , vol . , no.pp. 201- 206, 2002. [37] M . R . Garey and D.S. Johnson, Computers and Intractability: A guide to the theory of NF'-completeness. W . H . Freeman and Co. , 1979. [38] R. Molav i , S. Mirabbasi, and R. Saleh, " A High-Speed Low-Energy Dynamic P L A Using Input-Isolation Scheme," submitted to ISCAS 2006. 60 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065617/manifest

Comment

Related Items