UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Energy-time complexity of algorithms : modelling the trade-offs of CMOS VLSI Bingham, Brad D 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


831-ubc_2007-0328.pdf [ 4.51MB ]
JSON: 831-1.0302116.json
JSON-LD: 831-1.0302116-ld.json
RDF/XML (Pretty): 831-1.0302116-rdf.xml
RDF/JSON: 831-1.0302116-rdf.json
Turtle: 831-1.0302116-turtle.txt
N-Triples: 831-1.0302116-rdf-ntriples.txt
Original Record: 831-1.0302116-source.json
Full Text

Full Text

Energy-Time Complexity of Algorithms: Modelling the Trade-offs of CMOS VLSI by Brad D. Bingham B.Sc, The University of Victoria, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University of British Columbia August 2007 © Brad D. Bingham, 2007 Abstract Power consumption has become one of the most critical concerns for processor design. Parallelism offers a pathway to increased performance under power constraints — many slow processors can complete a parallel implementation of a task using less time and less energy than a fast uniprocessor. This relies on the energy-time trade-offs present in CMOS circuits, including voltage scaling. Understanding these trade-offs and their connection with algorithms will be a key for extracting performance in future multicore processor designs. I propose simple models for analysing algorithms and deriving lower bounds that reflect the energy-time trade-offs and parallelism of CMOS circuits. For example, the mod-els constrain computational elements to lie in a two-dimensional topology. These elements, called processing elements (PEs), compute arbitrary functions of a constant number of in-put bits and store a constant-bounded memory. PEs are used to implement wires; thus subsuming and accounting for communication costs. Each operation of a PE takes time t and consumes energy e, where eta remains invariant for some fixed a > 0. Not only may different PEs independently trade time for energy in this way, but the same PE may vary the trade-off on an operation by operation basis. Using these models, I derive lower bounds for the ETa costs of sorting, addition and multiplication, where E and T are the total energy and time, and present algorithms that meet these bounds asymptotically. Clearly there exist many algorithms to solve each of these problems, and furthermore there are many choices of how to implement them with processing elements. Fortunately, the tight asymptotic bounds collapse the hierarchy of algorithms, implementations and schedules. This demonstrates that choosing other algo-rithms or layout schemes may only improve the energy-time "cost" by constant factors. In addition to analysing energy-time optimal algorithms for these problems, I also determine the complexity of many other well-established algorithms. This sheds light on the relative energy-time efficiency of these algorithms, revealing that some "fast" algo-rithms exploit free communication of traditional computation models. I show that energy-time minimal algorithms are not the same as those that minimize operation count or the computation depth. ii Table of Contents Abstract " Table of Contents iii List of Tables vii List of Figures viii Acknowledgments ix Dedication x 1 Introduction 1 1.1 Motivation 2 1.2 Energy-Time Trade-offs 4 1.2.1 Voltage Scaling 4 1.2.2 Other Energy-Time Trade-offs 6 1.3 Problem Statement 6 1.4 Thesis Statement and Contributions 8 1.4.1 Model Description 8 1.4.2 Matching Upper and Lower Bounds 9 1.4.3 Comparison of Algorithms and Other Results 9 iii 2 Related Work 11 2.1 Area-Time Models 11 2.1.1 Multiplication and Addition 12 2.1.2 Thompson's Thesis 13 2.1.3 An Area-Time Optimal Multiplier 17 2.2 Energy-Time Models 18 2.2.1 The ET2 Model of Martin 19 2.2.2 Other Energy-Time Work 20 2.3 Energy-Aware Scheduling 21 2.3.1 The Model of Yao et al 22 2.3.2 The Model of Albers and Fujiwara 23 2.3.3 Other Scheduling Models 24 2.4 Other Related Work 25 2.5 Summary 27 3 The Model 30 3.1 Metricless Model 31 3.2 Grid Model 35 3.3 Gridless Model 37 3.4 Proof Framework 39 3.5 An Example: AMnput AND Gate 42 3.6 Model Limitations 43 4 Sorting 45 4.1 ID Sorting 46 4.1.1 Bubble Sort . 46 4.1.2 Odd-Even Merge Sort 53 4.1.3 Discounted Communication 55 4.2 2D Sorting 56 iv 4.2.1 Shear Sort 58 4.2.2 SSSort 59 4.2.3 Other Mesh Sorting Algorithms 59 4.3 Lower Bounds 60 4.4 The Impact of Word Size 62 4.4.1 SS Sort Details 63 4.4.2 Increasing the Block Size 64 4.4.3 Bit-wise SS Sort 66 4.4.4 Upper Bound and Proof of Correctness 69 4.5 Summary 73 5 Binary Addition 76 5.1 ID Binary Addition 76 5.1.1 Brent-Kung Addition 77 5.1.2 Kogge-Stone Addition 79 5.2 2D Binary Addition 80 5.2.1 H-Tree Adder 81 5.3 Lower Bounds 83 5.4 Summary 84 6 Binary Multiplication 86 6.1 ID Binary Multiplication 86 6.1.1 Discrete Fourier Transform 87 6.2 2D Binary Multiplication 88 6.2.1 Preparata's Implementation 89 6.3 Lower Bounds 90 6.4 Summary 91 7 Conclusions and Future Work 92 7.1 Summary of Results 92 v 7.2 Other Problems 7 95 7.3 Other Models 95 Bibliography 97 Appendix A Proof of Theorem 4.1.6 103 vi List of Tables Summary of ETa complexities 93 vii List of Figures 3.1 Hierarchy of entities 40 3.2 8-input AND gate 42 4.1 Sorting network for bubble sort 47 4.2 Direct implementation of the sorting network for bubble sort 47 4.3 Sorting network for odd-even merge sort on 8 inputs 53 4.4 Row-wise snake order on a mesh 58 4.5 The w x w tile of bits, for w — 8 67 4.6 Tile configuration for the end of snake-rows 68 4.7 Tile transpose at the start of Epoch 5' 73 5.1 Brent-Kung addition conceptual diagram 77 5.2 Kogge-Stone addition conceptual diagram 79 5.3 H-Tree adder conceptual diagram 82 viii Acknowledgments First and foremost, I thank Mark Greenstreet for his supervision. His vast knowledge and excellent ideas gave this work direction and momentum. I wholeheartedly appreciate his patience and encouragement, and his confidence in me has been truly inspiring. I also extend my gratitude to David Kirkpatrick, Will Evans, Jack Snoeyink, Anne Condon and Joel Friedman, who all gave their time and energy to help this research along. Igor Benko and Michael Holdenderski deserve recognition for discussions that led to this investigation. None of this thesis, or any of my academic endeavors would have been possible without the support of my family. To my parents, thank-you for your limitless tolerance and support. To my brothers, thanks for being such good friends to me and for keeping me sane. BRAD D . BINGHAM The University of British Columbia August 2007 ix For KAS, who is a very special lady. x Chapter 1 Introduction Computers have improved dramatically in speed, memory capacity and cost over the more than forty years where operation counting has been the mainstay of algorithm analysis. These improvements have been driven largely by advancements in integrated circuit fabri-cation technology. In 1965, the largest integrated circuit had 64 transistors [40]. Today, a single chip can have well over one billion transistors [42]. These advances have produced a dramatic increase in performance and memory capacity. Remarkably, the sequential pro-gramming model has remained dominant throughout these technological advances. While parallel processors are used for large scale numerical computation, databases and, more re-cently, web servers, most computers have been based on a single processor CPU. With the dominance of the sequential programming model, operation counting has remained preva-lent in algorithm analysis. The technological advances that have enabled exponential increases in processor performance and memory capacity have resulted in large increases in power consumption. Now, most high performance designs are power limited [27, 41]; thus, improving perfor-mance and managing power consumption are directly linked. We appear to be at a techno-logical cusp. While computer technology has encountered thermal limits before, first with vacuum tubes and then with discrete transistors and later with bipolar integrated circuits, this is the first time that a digital technology has encountered thermal limits without an 1 alternative that offers a better power performance trade-off. Parallel computing offers a way to overcome the thermal limits of sequential com-putation. As described below, the amount of energy required to perform an operation using CMOS logic circuits decreases when more time is allotted for the operation. This allows a parallel algorithm to complete a task using both less time and less energy than its se-quential counterpart. To take advantage of parallelism with power constrained processors, programmers must understand the connection between algorithms and power consumption. Operation counting is insufficient for modelling this connection. The remainder of this chapter describes the role of energy and power consumption in current integrated circuit design and its relationship with parallel algorithms, followed by the challenges associated with modelling this connection and finally my approach to solving this problem. 1.1 Motivation Design for low-power has traditionally been motivated by the ubiquitous portable devices of modern computing. Cell phones, PDAs and laptop computers are are prime examples where battery life is at a premium. Typical laptop battery capacity ranges from 24 to 72 Watt-Hours [21]; longer lasting battery life comes at the cost of increased weight. While the energy consumption of laptops is typically dominated by the display or disk drives as opposed to the processor, smaller devices with small screens and no disk drive see a larger percentage of total power consumption attributed to the processor. It is expected that the complex logic of future processors and the simplicity of small, increasingly power-efficient devices in which they are embedded (including laptop computers) will lead to an increased percentage of processor power consumption [21]. Perhaps less obvious are the power concerns of general purpose desktop processors and servers. The recent trend towards multi-core chips instead of increasing the clock rate of superscalar uniprocessors is a direct consequence of the difficulties associated with dissi-pating the heat produced by these processors. Overall energy consumption may be reduced by using several simpler cores that are not superscalar and have a single execution pipeline. 2 One such architecture design is the Cell Broadband Engine [24] which makes use of many RISC-like cores. Such processors rely on the presence of thread level parallelism, as op-posed to superscalar processors that focus on the extraction of instruction level parallelism. Another important issue is the energy cost associated with large server farms. One study indicates that an Internet data center with about 8000 servers consumes two million Watts, accounting for 25% of the total operating cost [52]. Google has recently predicted that the power cost of their servers will soon overtake the initial hardware costs, under the expected four-year server lifecycle [6]. Thus an important metric to measure performance in such computing environments has become MIPS per Watt. In Google's case, there is an expectation that chip multiprocessors, as opposed to conventional commodity computers, will likely offer increased performance in this respect [6]. Among the many techniques for reducing energy consumption are changes at the algorithmic level. As a simple example, consider a line of high-level language code, y = x * 8, where the value of integer variable x is multiplied by 8. Optimizing compilers will not generate an assembly-level multiply instruction, but rather a bit shift instruction that shifts the register containing a; by 3 to the left, or y = x « 3 in high level language code. Clearly this has the same effect, and yet on most architectures the shift instruction will take less cycles to complete and consume less energy, as less capacitance associated with transistors and wires are charged or discharged. We see that a simple algorithmic change may be of benefit, even to a single instruction. This idea applies to general algorithms. While two algorithms solving the same problem may be compared by their respective operation or instruction count, the nature of the instructions may affect both energy consumption and delay. On a larger scale, one might compare the relative energy associated with running a parallel algorithm on many simple, single execution machines as opposed to a sequential algorithm running on a superscalar machine with a higher clock rate. The connection between algorithms and energy with complete freedom of choice over architecture and system design are a key component in addressing the power concerns of the previous subsection. This connection is more com-3 pelling when we consider the inherent trade-off between energy consumption and delay in modern computing. 1.2 Energy-Time Trade-offs CMOS logic circuits offer the designer fungible energy and time: if more time is allotted for an operation, the operation can be performed using less energy. For example, simple scaling models for CMOS logic circuits [25] model the delay of a logic gate, t, as inversely proportional to the operating voltage, whereas the energy per transition, e, is proportional to the square of the voltage. This leads to a trade-off where et2 is invariant under changes in the operating voltage [38]. Similar trade-offs occur when choosing the sizes of transistors in a CMOS design [16] and microarchitectural alternatives [22]. Combined, these trade-offs allow designers a latitude of several orders of magnitude of energy consumption, motivating the emphasis on asymptotic results that I present. The opportunity to exchange time for energy creates a compelling incentive to ex-ploit parallelism: the individual operations in a parallel algorithm may be performed more slowly, thus saving energy, while completing the entire task in the same or less time than the corresponding sequential computation. In fact, a parallel algorithm may perform more oper-ations than its sequential counterpart while using less time and less energy. While energy-time trade-offs have been understood by the VLSI design community for many decades, their implications for algorithm design have only been scarcely explored. The remainder of this section provides a brief overview of some of these methods of trading energy and time. 1.2.1 Voltage Scaling One of the most promising methods for reducing power consumption is voltage scaling. The basic idea behind voltage scaling is very simple: when the power supply voltage for a CMOS logic circuit is lowered, the time per operation increases, but the energy per op-4 eration decreases. The precise trade-off function is complicated, and the most appropriate abstraction for it is debated. Nevertheless, a simple, first-order model is derived as follows. This simplified description of CMOS logic begins with a review of basic definitions for electrical circuits, explained with analogy to water. Charge refers to the quantity of electrical carriers. Electrical charge corresponds to the volume of water; in particular, one coulomb is the amount of charge equivalent to 6.24 x 10 1 8 electrons. Voltage is a measure of energy per unit of charge; it corresponds to water pressure. One volt is one joule per coulomb. Current is the rate of flow of charge or water: one ampere is one coulomb per second. The basic operation of a CMOS logic circuit is to charge or discharge a capacitor. These charging actions are controlled by transistors that can be thought of as switches, and the capacitors arise from the transistors themselves and the wires that connect them. A capacitor can be thought of as a water storage tank. Capacitance is measured in farads, where one farad is one coulomb per volt. If a capacitor of capacitance C is charged to a voltage V it will hold a charge of CV. Now consider a circuit with an operating voltage Vdd- If a capacitor of capacitance C is charged to Vdd and then discharged to ground1, a total of charge of CVdd coulombs moves from voltage Vdd to ground, expending a total energy of CV^d joules. Thus, the en-ergy consumption of a CMOS circuit is proportional to the square of the operating voltage. Transistors operate as voltage controlled switches that charge and discharge the wiring and other capacitances of CMOS circuits. Transistors have three terminals: the gate, the source and the drain. The voltage on the gate determines the conductivity of the path from the source to the drain. Simplistically, the magnitude of the current that flows from the source to the drain is proportional to the difference between the source and drain voltages and proportional to the voltage applied to the gate. Both of these factors are proportional to Vdd', thus, the current that flows through a conducting transistor is roughly proportional to V£d. The time to charge or discharge a capacitor is proportional to the amount of transferred charge, divided by the current. Noting that the charge is proportional to Vdd and the current 1 Ground is the zero-reference for voltages. 5 proportional to Vjd, we conclude that the delay of a CMOS logic circuit is proportional to l/Vdd. To summarize, a simple model for CMOS transistors shows that the energy per operation goes as Vdd and the time goes as l/Vdd, where Vdd is the operating voltage. This yields that the energy e and time t for an operation have et2 invariant under changes in Vdd-For example, if Vdd is reduced by a factor of two, the time per operation will double, but the energy per operation will decrease by a factor of four. Noting that power is energy per unit time, this reduces power by a factor of eight. This technique is used in modern processors, where a small set of voltage levels may be used [12, 21]. 1.2.2 Other Energy-Time Trade-offs In addition to voltage scaling, there are other methods of trading time for energy. The energy-time trade-off of voltage scaling may be exploited at a high level by partitioning regions of the chip into voltage islands with different fixed operating voltages. At the tran-sistor level, the size of transistors also presents a trade-off opportunity. A wider CMOS transistor has lower resistance and thus decreased delay, but an increased source, drain and gate capacitance that requires more energy to charge. Transistor sizing may be exploited, for example, to decrease cycle time by widening transistors on critical paths. This trade-off and various associated optimization problems have been studied at length [15, 47, 48]. At the microarchitectural level, there are many features and techniques affecting energy and delay. Gonzalez and Horowitz investigate the implications of some of these, including superscalar issue, pipelining, cache organization and speculative execution [22]. 1.3 Problem Statement Power and energy consumption have become the critical factors determining computer sys-tem performance. Circuit designers and computer architects have attempted to address these challenges by using methods that trade energy and delay. Parallel computing offers the most promising direction for improving performance while reducing power consump-tion. However, this requires departing from the dominant, sequential model of computing. To make effective use of parallelism, programmers need models that faithfully abstract the energy-time trade-offs offered by parallelism. Operation counting does not model these trade-offs. A successful model should be tractable in its parameters while taking into account both energy and time. This suggests that the model should be independent of particular process technologies, but still specific to CMOS and its trade-off properties, as CMOS is and will continue to be the prevalent computing technology for at least another decade. As pointed out above, it is not clear exactly how to best model voltage scaling trade-offs, and modelling the more general trade-offs available through various methods is even more unclear. Therefore, it is important that the energy-time trade-off be generalized in some sense, for robustness to change in the modelling assumptions. Chapter 2 highlights the sparsity of work towards such a model. Communication costs must also be considered in the design of algorithms and mod-elled accordingly. It is currently possible to fabricate chips with a handful of traditional CPUs [32,44] or thousands of simple CPUs [53], each with its own local memory, possibly architected as a cache. As feature sizes on chips decrease, logic gates operate relatively faster and wires operate relatively slower. Thus, wires that cross a chip (global wires) can have delays that are several times the clock period. Communication between separate chips is even slower. Traditional models of computation (operation counting, PRAM [17], etc.) assume that communication is free, and only count operations. Some work has been done with the opposite assumptions: operations are free and communication is expensive [58]. A realistic model must reflect the fact that there are costs associated with both computation and communication so that the algorithm designer may consider trade-offs between the two. 7 1.4 Thesis Statement and Contributions In this dissertation, I investigate questions regarding the critical connection between energy and algorithms. We are not aware of any prior research in this important area. The central thesis of this dissertation is: The simple computational model that I present is sufficient to evaluate algo-rithms in CMOS logic circuits reflecting the associated energy-time trade-offs, communication costs and constraints. This model is applied to some common problems and tight asymptotic bounds are proven. Specific contributions are covered in greater detail in the following. 1.4.1 Model Description Perhaps the most important contribution in stimulating further research is the precise defi-nition of models. In Chapter 3,1 propose simple new models for parallel computation that incorporate energy-time trade-offs and communication costs with unlimited parallelism. Furthermore, I allow a general trade-off of eta for any a > 0. These models ignore all constant factors which makes them independent of many fabrication technology details. This also means that analysis in the models is most significant when viewed in terms of the asymptotic growth of energy and time as functions of the input size, which inherently hides all constants. The two models given, the grid model (Section 3.2) and the gridless model (Sec-tion 3.3) correspond to upper- and lower-bound models, respectively. The grid model re-stricts computing elements and wire segments to squares in a Cartesian grid, where embed-dings for algorithms are easily drawn and conceptualized. This has the desirable property that it is relatively straightforward to adapt well known algorithms to implementations in this model. Conversely, the gridless model is unrestrictive in the shape and size of comput-ing elements, and furthermore ignores all wiring geometry. The simplicity of the gridless model is harnessed to generalize over all implementations, regardless of wire placement 8 and other details, to give lower bounds on time and energy consumption. The experience of studying the three problems herein suggests that other interesting computational problems can likely be viewed within these models with relative ease, and reveal their associated energy-time bounds. 1.4.2 Matching Upper and Lower Bounds Three problems are analyzed in these models: the classical sorting problem, and typical VLSI problems of binary addition and binary multiplication. In each of these cases, match-ing upper and lower bounds are found by using the grid and gridless models. The upper bounds are derived through straightforward adaptations of known mesh-based algorithms to the grid model. Matching lower bounds in the gridless model are shown through sim-ple communication arguments. Initially, inputs for sorting are words of unspecified size and processing elements operate on entire words. The sorting problem is also investigated when the bits composing each input word are considered as individual inputs. Under some reasonable assumptions, matching bounds are proven for this bit-wise sorting problem. I also consider the energy-time complexity of these problems when the inputs and outputs necessarily lie in a linear arrangement, as is common and desirable in VLSI chips, and find matching bounds for all problems under this restriction. The upper- and lower-bound nature of these models allows for strong conclusions when matching bounds are found. Even once an algorithm is selected to operate in the grid model, there is much choice associated with placement of computing elements and wires, and the allotted time for these. If we find even one set of choices that yields an upper bound that matches a proven lower bound, we conclude that there is no better set of choices to improve asymptotic complexity. This framework is covered in greater detail in Section 3.4. 1.4.3 Comparison of Algorithms and Other Results I also present some surprising results: accounting for communication costs reveals why some fast algorithms perform well in traditional models. For example, if the inputs of a 9 sorting network are required to lie along a line, then "slow" algorithms such as bubble-sort and "fast" algorithms such as odd-even merge-sort have the same asymptotic energy-time costs to within constant factors. This is because the fast algorithms obtain their speed by ex-ploiting the free communication of traditional models of computation, and their advantage is reduced to a constant factor when the energy-time costs of computation are considered. A similar result is seen in the comparison of a carry-ripple adder to the asymptotically worse Kogge-Stone adder, which is traditionally seen as much faster. Our construction for an optimal adder shows that broadcasting a bit to all 0(d2) mesh locations within distance d of a source has the same cost in the models, to within a constant factor, as sending the bit to a single location distance d away. The presented algo-rithms for sorting and multiplication both perform more operations than the best sequential algorithms, yet the parallel implementations use less time and energy. In fact, for each of sorting, addition and multiplication, there are parallel algorithms whose time and energy requirements both grow slower than ./V where N is the number of bits in the input. 10 Chapter 2 Related Work Research into energy-time trade-offs in algorithm design touches on many different areas. These include computer architecture, CMOS chip design, algorithm analysis, parallel algo-rithms, theoretical models of computation, thermodynamics and information theory. Ac-cordingly, the papers covered in this section draw upon all of these topics. I divide this related work into four sections, roughly ordered by relevance: Section 2.1 covers work on area-time minimal computations; Section 2.2 covers work on previous energy-time mod-els and other energy considerations; Section 2.3 covers various models of energy-aware scheduling; Section 2.4 summarizes other related work. The uniqueness of my research in light of this related work concludes the chapter in Section 2.5. 2.1 Area-Time Models The techniques for algorithm analysis with an ETa cost metric resemble earlier work for deriving AT2 bounds for algorithms where A denotes the area required for a VLSI imple-mentation that completes the computation using time T [8, 10, 49, 51, 55]. While some researchers included energy costs within the AT2 model, they assumed fixed operating voltages and thus equated energy with capacitance (wire length). The AT2 models do not include the possibility of trading energy and time through voltage scaling [25, 38], tran-sistor sizing [16] and other techniques [22, 41] that are subsumed by the models of this 11 thesis. Like the AT2 work, my arguments are often based on communication. However, the proofs in this thesis tend to be based on the total distance that data must travel, whereas AT2 results are typically based on arguments of cross-sectional bandwidth. The remain-der of this section surveys groundbreaking work with area-time models and papers that are particularly relevant to my research. 2.1.1 Multiplication and Addition Brent and Kung presented an area-time model with many constraints meant to accurately model VLSI chips [10]. The region in which the computation takes place is assumed con-vex, with I/O ports of constant area located at the periphery. Arbitrary two-input "gates" operate with constant area and time. Wires connecting I/O ports and gates have constant width and can be arbitrarily long. Only a constant number of wires may intersect at a given point, modelling the fixed number of wiring layers of VLSI chips. Wires may propagate a single bit in constant time regardless of length. Bits are stored in constant area mem-ory elements and may not be stored at input ports. Finally, the computation is "oblivious" in the sense that the timing and location all operations is independent of the input values. Algorithms are evaluated in terms of AT2a for 0 < a < 1, where A is the area of the region and T is the time for the computation. Although all constants are parameterized and their relative values speculated, the proven bounds are asymptotic and independent of these constants. Brent and Kung's paper also examines the problem of binary multiplication [10]. First, a lower bound for shifting circuits is given based on cross-sectional bandwidth, yield-ing bounds of AT2 > K\n2 and AT > K^Ln, where K\ and K2 are constants, L is the chip perimeter and n is the number of input bits. Then, the lower bound for area A > Aon is given for constant AQ, proven with a information theoretical argument using the num-ber of possible outputs. This is quantified as /x(n), the number of products that can be formed by multiplying two numbers both less than n, bounded below by <r(n), equal to the sum of all primes less than n which in turn has known lower bounds. Algebraically com-12 bining these two bounds gives the main lower bound of (A/A0)(T/T0)2a > n a + l where To is constant. A complicated algorithm that uses convolution gives an upper bound of AT2a G 0(na+1 l o g 2 a + 1 n). The details of this paper were later reproduced in another Brent and Kung paper that also considers binary addition [9]. The algorithm they use for an upper bound is of Brent and Kung's design, usually referred to as the Brent-Kung adder [8]. After repro-ducing the algorithm details, it is analyzed in a straightforward manner to give the upper bound AT2a G 0(n l o g 2 a + 1 n). They conclude that because the asymptotic ratio of the lower bound of multiplication over the upper bound on addition is f^-y/n), multiplication is inherently harder than addition in an area-time sense. 2.1.2 Thompson's Thesis Thompson's PhD thesis [55] defines a different area-time model than that of Brent and Kung. This model has 8 assumptions that attempt to accurately reflect the properties of CMOS integrated circuits. Each of these assumptions is stated in terms of an upper and a lower bound model for ease of proof. In both cases, computation is simulated with a communication graph, a directed acyclic graph where of vertices represent computation and edges represent communication. In the lower bound model, each unit square in the plane may contain one vertex or one cross-over of at most four wires. The corresponding upper bound model assumption attempts to capture the effects of wire delay by assuming that regions of a minimal size communicate with one another through long driver wires. A driver wire of length 0(k) requires O(k) area, and can transmit a bit in constant time, although the driver itself has a charging delay of 0(log k). Other lower bound assumptions include: • Area of a communication graph is the number of unit squares occupied by vertices or wires. • Wires have a bandwidth of a single bit per unit time in both directions. 13 • Each input is streamed from a single source vertex, one bit at a time. • Each output is streamed to a single sink vertex, one bit at a time. The corresponding upper bound assumptions are: • Area of a communication graph is the area of the smallest bounded rectangle of all occupied unit squares. • Wires have a bandwidth of a single bit per unit time in one direction. • Inputs are initialized in a string of vertices, or a "register" of at least [log M] bits, where M is the number of possible inputs. • Outputs are stored in a string of vertices, or a "register" of at least [log M] bits. Algorithms, or in Thompson's terminology, admissible communication graphs, are evalu-ated in terms of their AT2a asymptotic complexity1, where 0 < a < 1 and a = 1 is the special case of interest. These models at their core are similar to that of Brent and Kung. However, one key difference between these models is that Brent and Kung insist that in-puts/outputs lie at the edge of the chip, whereas Thompson does not make this restriction. The two problems studied in Thompson's thesis are sorting and discrete Fourier transform (DFT). The lower bounds for both of these problems are proven using graph theoretic bi-section arguments. The general structure of the upper bound constructions is similar for both problems. I will sketch the main ideas behind these upper and lower bounds here. Thompson's lower bound proofs rely on the notion of minimum bisection width which is the minimum graph cut that partitions the graph into two disconnected graphs of nearly equal size (in the number of vertices). This definition is further generalized as the minimum graph cut that partitions the graph into two disconnected graphs containing a nearly equal number of vertices of some specified vertex subset. As the lower bound model assumes all vertices are embedded into unit squares, the corresponding communication graph is a planar grid where each vertex has degree of at most 4. A relation is established 'Thompson uses 2x as the exponent for T; I use 2a for consistency. 14 between the area and the minimum bisection width dividing only the source vertices, de-noted henceforth by cut^ where the cut size is u>. The area is related as bounded below by w2/4; this is proven by examining the number of such cuts that zig-zag through the grid (there are [u>/2\ such cuts), which is leveraged to bound the number of vertices that must exist. Now, u> is related to a lower bound on the time for a computation by associating with it the entropic information complexity that must cross the bisection between the di-vided source vertex locations to effect required interactions of the computation. Using some information theory, Thompson bounds the information complexity as the choices of input values on one side of the cut that maximizes the information complexity (the min-imum number of bits) that must cross the edges of cutu, minimized over all choices of cutw. This value is divided by u>, the bandwidth of the cut, to give the worst-case time for a computation. Finally, the information complexity bound is determined for both discrete Fourier transform (DFT) and sorting. For sorting, an auxiliary problem called RSORT is defined which is the same as the sorting problem but it only requires the smallest half of the outputs. The the information complexity of RSORT is maximized when all input values on one side of cut^ are equal to the largest input value. This bounds the time to solve RSORT and sorting by VL((N\ogN)/u); combined with the area bound gives the main result of AT2a G Q(Na+1 log2a N) for sorting. The DFT problem is approached similarly, noting that a DFT on half the inputs is necessarily solved when the entire DFT is computed. This observation leads to the same AT2a lower bound as for sorting. The upper bounds for sorting and DFT are both established through implementa-tions onto two different interconnects, a mesh and a shuffle-exchange network. An N-element fast Fourier transform (FFT) algorithm operates on a square mesh of multiply-add cells, indexed in row-major order. It is shown that the FFT algorithm needs at most y/N routing steps between phases to get the data to the right place. However, most routing of the log N phases requires fewer routing steps than this; the entire FFT algorithm takes exactly 4(\/N - 1) routing steps. 15 Each multiply-add cell has area that is O (log TV) x O (log AT) and is capable of routing words of log ./V bits between any of its four grid neighbors. The time associated with routing the log iV bits a distance of 0(log N) between adjacent cells is 0(log log N), a consequence of driver delay (see above). Once the proper data has arrived at a cell, it performs the fundamental multiply-add operation of FFTs, YQ = -^ o + otKX\ and Y\ = XQ — aKX\, where XQ,X\ are inputs, Yo,Y\ are outputs, a is an Nth root of unity, and k is the appropriate power of a to use, depending on the FFT stage. The 0(l6g 2 N) bits of storage per cell are enough to store a "program" that handles the timing and opera-tions, as well as each value of ak for all log TV values of k used by the cell. A circuit diagram is given for the multiply-add cell displaying the exact wiring and implementation details. The multiply-add function is implemented through a constant number of carry-save adders; the entire operation takes (9(logAr) time. Overall, the implementation has area A e 0(N log2 N) and time T e 0(VN loglogN) for the routing which asymptot-ically dominates the time for add-multiply operations. This establishes the main result of AT2a € 0{Na+1 log2 AT(log log N)2a) for the iV-element DFT. The mesh sorting implementation is based on bitonic sort, and requires a "shuffled" indexing scheme. This shuffling is done by taking the row-major n-bit index of a cell and perfectly shuffling the high n/2 bits with the low n/2 bits. Here, the cells are capable of routing inputs and performing compare-and-swap operations and have 0(log 2 N) area, where N is the number of input elements. Like the DFT mesh, the total number of routing steps is bounded by 0(y/~N), the time for a routing between adjacent cells is 0(log log N), and the total routing time asymptotically dominates the time for compare-and-swap opera-tions. Interestingly, the AT2a complexity is the same as that for DFT on a mesh. Sorting and DFT are also implemented on a shuffle-exchange interconnect, resulting in AT2a com-plexities of 0{N2 l o g 6 a ~ 1 / 2 N) and 0{N2 l o g 4 a _ 1 / 2 N), respectively. Both of these are a poly-log factor greater than the upper bounds established by the mesh interconnect when a = 1, and a fractional power of iV greater for all a < 1. 16 Thompson also makes assertions about energy dissipation of sorting and DFT. En-ergy E is defined as equal to AT, making the assumption that all capacitors are charged or discharged each unit of time. The lower bound proofs based on the number of bits crossing a bisection of the chip inherently bound the wire activeness across this bisection, and thus the lower bounds of AT may be used for energy, giving E = AT € n(N3/2 log AT). The connection of this result to our models is discussed in Section 2.5. 2.1.3 An Area-Time Optimal Multiplier Using a straightforward area-time model, the multiplication algorithm of Preparata [49] achieves the lower bound of AT2a £ Q(Na+1) on A -^bit factors established by Brent and Kung [9]. This algorithm relies on a specific DFT mesh arrangement of n cells where the row-major indices undergo a bit reversal and perfect shuffle permutations. The resulting mesh has the property that all multiply-add operations use a root of unity power that is the square and possibly the negation of the root of unity power of the previous multiply-add operation. Multiply-add cells are implemented in a pipelined fashion of shifting and adding using only O(s) area, where s is the number of bits to represent DFT operands, i.e. s = [logp], where arithmetic is over the field of modulus p (where p is prime). These cells multiply in 0(s2) time and exchange operands with neighboring cells in 0(y/s) time. This mesh arrangement only requires 0(y/n) exchange steps which dominates the time complexity when s is small compared to n. Thus, the total time is 0{\/ns) and the total area is 0(ns) for the DFT. This DFT computation is harnessed for multiplication of A -^bit factors by convolution, where N = s'n, i.e. they are divided into n words of s' bits. It is established that a suitable p always exists that is large enough and has a primitive root of ordern [14], and furthermore that s = [logp] £ O(logn) if s' is chosen as log n [35]. This assures that s is sufficiently small, and the convolution proceeds with the same time com-plexity as the DFT. The final multiplication step of adding up carry bits of the overlapping output words of the convolution is done by treating the mesh as a carry-lookahead adder; this has the same time complexity as the DFT and the main result follows. 17 This implementation is of special interest in our energy-aware application. In Chap-ter 6, we demonstrate that this arrangement is easily described in the grid model, and that it is optimal in an energy-time sense. 2.2 Energy-Time Models To the best of our knowledge, no one has examined the implications of energy-time trade-offs for algorithm design considered here. Martin's work [38] comes closest when he ex-amines an "energy complexity of computation." His treatment considers the trade-offs for various forms of parallel and sequential decompositions of operations. While Martin's work provides a clear motivation for considering the energy optimal implementations, it does not examine the fundamental energy-time limits of particular computational tasks (such as sorting or addition). This section gives a general overview of the study of energy-time performance, and then focus on some specific work. The trade-offs between voltage, energy and delay have been understood at least since Hoeneisen and Mead's paper in 1972 [25]. More recently, various researchers have promoted the use of the ET [22] and ET2 [38, 41] metrics for modelling energy-time trade-offs in CMOS logic circuits. Dynamic voltage and frequency scaling is standard practice in the design of CPUs for portable [21] and embedded [12] computers and is now being applied to processors for desktop and server processors as well [28, 36]. Recently, Williams et al. [57] examined energy-performance trade-offs by comparing the IBM Cell processor [24] with several other processors. They reported that the architectural features of the Cell, especially its support for on-chip, parallel computation, provided advantages of factors of 20-50 when compared with existing microprocessors and supercomputers by a measure of GFlops/Joule. 18 2.2.1 The ET2 Model of Martin Martin posed questions regarding the connection between energy-time trade-offs and algo-rithms [38]2. However, his treatment is quite general and is not applied to specific algo-rithms or problems. Algorithms are modelled with directed acyclic graphs, similar to the task graphs considered in Section 3.1. However, unlike the task graphs, there is no trade-off available at individual tasks. Instead, energy is defined as the total number of vertices and time as the longest path. The trade-off is realized by altering the graph to introduce greater concurrency and redundancy, which shortens the longest path but increases the number of vertices. He notes as I have that a first order approximation of the proportionality of energy and time to voltage ignoring other factors such as leakage currents demonstrates that ET2 is an appropriate function to evaluate, as it is invariant to changes in voltage. Some empirical data supporting this from specific processors is presented. This trade-off view is applied to the problem of determining the optimal pipeline cycle time with a communication overhead between stages. Martin and coauthors expand on these ideas in a more comprehensive paper [39]. The ET2 trade-off is optimized for algorithms that run m independent and parallel tasks, and for algorithms splitting the computation into two parallel tasks with split and merge costs, in both cases under uniform voltage scaling. This analysis offers some insight into the benefit of parallelism with respect to energy-time efficiency with estimated split and merge costs. A study of transistor sizing is done to optimize ET2. A simplistic model for transistors is used where energy and time are traded per transistor by choice of gate capacitance; parasitic wire capacitance is also a contributing factor but remains fixed. They conclude that under uniform transistor sizing, transistors lying on a critical cycle have gate capacitance that is three times the parasitic wire capacitance for minimal ET2. This result is generalized to optimize ETn for arbitrary n. The analysis gives way to theoretical lower bounds each of energy and time independently by way of aggressive transistor sizing; these values are compared with the time and energy quantities optimizing ET2. Further analysis 2Martin uses t to denote time; I use T for consistency 19 extends transistor sizing to the case of many circuits operating in parallel with the same deadline. Finally, optimal ET2 is computed when composing two systems A and B under independent voltage scaling, with @A = EAT\ and 6 S = EBTB, respectively, into se-quential or parallel computation. They conclude that optimal ET2 of the parallel composed system is QA + <3>B, and that of the sequentially composed system is ( v / © A + s/Qs) 3- Both of these results are consistent with compositions in our models. This is especially clear in the optimal schedule assignment of the bubble sort algorithm presented in Subsection 4 .1 .1. 2.2.2 Other Energy-Time Work Gonzalez and Horowitz investigated the effects of architecture level decisions on the energy-time product [22]. This metric however, is chosen somewhat arbitrarily as simply a function monotonic in both energy and time. They note that the product is roughly equal across a set of six processors. They conclude that in an ideal model, pipelining instruction execution has a large effect in improving the energy-time product, but other features (or lack thereof) have little effect. This includes superscalar execution. Furthermore, they suggest that superscalar execution is not a desirable design to minimize the energy-time product in a more realistic (non-ideal) model. They point out that a large percentage of energy is dissipated through the clock distribution and on-chip memory, and thus optimizations of combinational logic will only yield small returns. Horowitz recently co-authored a paper concerning the energy efficiency of adder designs in an attempt to characterize the trade-off inherent to addition of 32-bit integers [46]. Given a target delay, they seek to minimize the energy by accurately modelling for a specific process technology (90nm) the power supply voltage, threshold voltages, leakage currents, capacitance, transistor sizing, process variation and other con-tributing factors. Topologies are classified according radix, logic depth, fanout and wiring tracks, which are expected to be the main parameters affecting the trade-off. The designs of Brent-Kung, Kogge-Stone and Sklansky are determined to effectively cover this parameter space, with Sklansky having the lowest curve for the energy-delay product. Horowitz also recently co-authored a paper [37] further discussing the energy-time trade-offs available at 20 different levels of chip abstraction. This involves optimizing an energy-delay product at the system architecture level, the micro-architecture level and the circuit level. 2.3 Energy-Aware Scheduling Although the work on energy-time trade-offs for algorithms is scarce, the opportunity to ex-ploit such trade-offs for scheduling has attracted significant attention starting with Weiser's 1994 paper [56]. Recent work in energy-aware scheduling includes [2, 4, 5, 11]. Much of this work assumes power = speed13 for some (3 > 1 which is equivalent to assuming that eta is constant with a = (3—1 > 0 as in our models. Prior work in energy-aware scheduling has generally assumed a single processor and ignored the energy costs of communication. In contrast, our models account for the energy required for communication, and it is these costs that provide a limit on the optimal degree of parallelism for a given problem. Weiser's paper, motivated by conservation of laptop battery power, asks how to re-duce the clock speed via voltage scaling the execution of tasks to save power while limiting the performance degradation. Two key simplifying assumptions are made: first, that energy per clock cycle is proportional to voltage squared, and second, that the clock speed can be adjusted without interrupting the processor operation at fixed intervals based on the pre-vious interval's workload. With this scaling assumption, energy can be saved by running at half the speed for twice the time. A heuristic algorithm PAST is proposed that adjusts the clock frequency at the end of each interval such that any left over work of the current interval plus the predicted work (estimated to be the same as the workload of the current interval) will be completed exactly by the end of the next interval. The effectiveness is largely dependent on the choice of interval length calibrated for "typical" workloads, which is experimentally determined. Govil et al. [23] expand on Weiser's work by proposing a variety of heuristic algo-rithms for the same problem. As they point out, PAST depends on the interval length to decide how often the speed can be changed, the acceptable amount of delay, and how many past intervals are considered. They give many algorithms that break this dependence, many 21 with parameterized values for these decisions, but some also depend on empirically decided fixed values. For example, the algorithm AGED_AVERAGES predicts the coming work-load from a weighted average of previous workloads. This depends on a parameter that is the geometric weighting of workloads viewed backwards in time, where the workloads are sampled every 0.01 seconds, an empirical fixed value. While these algorithms do indeed expand on Weiser's work and are interesting to compare, they remain based upon heuristics and are empirically validated. The remainder of this section covers work that treats the scheduling problem for-mally while proposing a variety of models. The majority of these models assume single-processor scheduling, and attempt to minimize some function of energy and time over all workloads. Here, a workload is a set of jobs, each of which is identified by release time, the number of cycles needed and possibly a deadline. In many cases, scheduling is performed online, where jobs are not known to the algorithm until they are released. Online algorithms are often measured by comparing them against the optimal offline algorithm which has the benefit of perfect knowledge. In several models the optimal offline algorithm is demon-strated to be NP-hard to compute. I characterize these formal models by four properties: 1. What is the energy-time trade-off function and constraints thereof? 2. Is preemption of jobs allowed? 3. Do jobs have deadlines? 4. Is the scheduling goal an optimization problem or a decision problem (or both)? I do not include online versus offline as a property of the model, because I view this as a problem addressed within the model. Often both cases are considered as explained above. I begin with the earliest such model. 2.3.1 The Model of Yao et al. Yao et al. were the first to formalize an energy-aware scheduling model [59]. This model is formulated using the properties above as follows: 22 1. The power consumption3 is P(s) = sa where s is the speed and a > 2. 2. Preemption of jobs is allowed. 3. Jobs have deadlines. 4. Minimize E(S), the total energy consumption over all deadline-feasible schedules S. An offline algorithm is given that is provably optimal through repeated identifica-tion of "critical intervals", that is, intervals where a group of jobs must be executed at high speed in any optimal schedule. These "critical groups" are identified by finding the time interval containing the most job intervals (that is, the time between a job release and dead-line) that maximizes the total workload of the jobs with contained intervals divided by the interval length. Critical intervals are iteratively found and scheduled to construct the opti-mal schedule. A heuristic online algorithm is described and proven to be within a factor of n of optimal where n is at least aa but not greater than 2a~laa. 2.3.2 The Model of Albers and Fujiwara Albers and Fujiwara [1] consider a much different model than that of Yao et al. The main difference is that the function to minimize depends on the flow time: the time between job release and completion, summed over all jobs. The model is summarized as: 1. The power consumption is P{s) = sa where s is the speed and a > 1. 2. No preemption of jobs is allowed. 3. Jobs have no deadlines. 4. Minimize E(S) + F(S), the sum of the total energy consumption the flow time. Note that although the function E(S) + F(S) is monotonic in time and energy and allows for a trade-off between them, there is no justification for considering this particular sum, 3 Yao et al. use p as the exponent; I use a for consistency. 23 and furthermore seems to be susceptible to abuse by choice of the magnitude of the units for energy and time. In general, they prove that any online scheduling algorithm is r2(n1_1/a)-competitive with the optimal schedule, where n is the number of jobs. Therefore, they consider the spe-cial case where all jobs take unit amount of work. Under this simplifying restriction of job sets, they propose an online algorithm that iteratively schedules in "balanced" phases called Phasebal, which operates as follows: run each of the n\ jobs released at time 0 at speeds (m/c)1/", ((m - l)/c)lla, ((m - 2 ) / c ) 1 / ° , ( l / c ) 1 / 0 . Then, the n2 jobs released dur-ing execution of the first phase of n\ jobs execute at speeds in a similar manner, and so forth. Here, c is a constant depending on a: The main theorem proves that this algorithm is a factor /(a) within optimal. This function /(a) is upper bounded by (8.22)e(l + <5)a where $ = (1 + \/5)/2 is the golden ratio. They also show that through dynamic programming, the optimal offline algorithm can be computed in polynomial time for unit jobs. 2.3.3 Other Scheduling Models There are a number of other papers addressing this general problem, each with specific models and assumptions. I cover a few more in brief. Bansal et al. [4] prove that the proposed online algorithm of Yao et al. is in fact within a factor of aa of optimal, where P(s) = sa is the power-speed trade-off function. They furthermore propose a new online algorithm that is at most a 2(aj(a — l))aea factor within optimal under the same model. Then, they adapt the model of Yao et al. to account for temperature using Fourier's law of heat conduction. They study both the problems of minimizing the maximum temperature during execution, and scheduling with a fixed maximum temperature constraint. 24 Chen et al. give another model where jobs may have a precedent constraint that enforces a partial ordering on job execution: 1. Speeds are chosen from a finite set R. The power consumption is <j>(r) forrER where <j>(r)/r is strictly increasing. 2. No preemption of jobs is allowed. 3. Jobs have deadlines. 4. Minimize total energy consumption over all deadline- and precedence-feasible sched-ules. Even the very specific case of all jobs having the same release time and deadline, no precedence constraints and \R\ — 2, optimal offline scheduling is NP-complete by a reduction from the SUBSET SUM problem; an approximation algorithm is given that runs in polynomial time for this case. Bunde also proposes a very general model where quality schedules are those that minimize flow time (as defined above) or makespan, the time of the final job's comple-tion [11]. He studies the problems of either finding the best quality schedule with a fixed energy budget (coined as the "laptop problem") or to minimize energy consumption with predetermined schedule quality (the "server problem"). Baptiste addresses the problem of scheduling where an idle low power sleep state is available, but switching to or from this state consumes energy [5]. The multiprocessor scheduling problem is modeled by Albers et al. [2] which essentially adapts the model of Yao et al. to a parallel processor setting. This paper is uniquely relevant to modern computing on multicore processors in contrast to the majority of other work in this area which assumes a single processor. 2.4 Other Related Work There are a few other papers worth noting that do not fit in the categories above. I describe two well known alternative models of computation, an asymptotically optimal mesh sorting 25 algorithm and finally some points regarding theoretical limits of energy dissipation for any method of computation. The PRAM model of Fortune and Wyllie [17] is an often cited theoretical model of parallel computation. A generalization of Turing machines, an unlimited number proces-sors (tapes) may execute in parallel, each with unbounded local memory, an accumulator register, a finite program, and unbounded global (shared) memory. Computation begins with a single processor, but others are spawned with a "fork" command. The correspon-dence between the PRAM model and regular Turing machines is covered in their paper. The widely used logP model [13] tries to account for the engineering practicalities of building parallel processors and includes parameters to model the bandwidth, latency and overhead of communication as well as limits due to a finite number of processors. Each of these parameters is a fixed constant, and determining their values is very much an experimental problem. Schnorr and Shamir give a time optimal sorting algorithm on a mesh of "proces-sors", each storing an input and capable of comparing and swapping their contents in single time unit [51]. Their algorithm is quite involved, and operates in seven distinct phases that complete in 3\/N + 0(./V 3/ 8) time units with the N mesh elements in sorted "snake-order", which is proven optimal in the high order term in this model. The algorithm is central to our work with sorting and is adapted and analyzed in Sections 4.2 and 4.4. A related question to our consideration of energy-time trade-offs asks what are the inherent energy costs associated with any method of computation? This and related questions are the topic of reversible and adiabatic computing. This literature shows that theoretically, no energy is necessarily consumed during a computation, as energy is only necessarily consumed upon the erasure of information [18]. Logic gates that are sufficient the compute any function have been proposed that conserve all information in the form of extra outputs of "garbage bits", such as the model of Fredkin and Toffoli [19]. However, the practicality of zero-energy computation has yet to be realized. Often described incarna-tions of reversible computing are mechanical so-called billiard ball computers, composed 26 a series of walls that deflect fired hard spheres or balls that represent inputs, the collisions of which model logic gates. While the notion of such computation is possible in principle, the nature of such collisions quickly propagates even tiny timing or positioning errors, to the point of the gravity of nearby stars rendering the computation meaningless after tens of collisions [7]. There are proposals to design CMOS gates that preserve reversibility and attempt to conserve and reuse charged capacitances [3, 18]. However, Indermaur and Horowitz have given evidence towards the impracticality of such attempts, as better energy performance is achieved through simple voltage scaling [26]. 2.5 Summary Here, I point out the distinct nature of out work compared with the closest related research, covered in this chapter. First and foremost, I generally compare the models of Chapter 3 to the work of Section 2.1 and then specifically discuss the energy bounds of Thompson [55]. The related work covered in Section 2.1 mostly took place during the early 1980s. The concern of aggressively utilizing chip area is indicative of this time period where LSI gave way to VLSI and chip area was still at a premium. This is in contrast to modern VLSI of today which has several orders of magnitude greater density. As energy dissipation is now the major constraint in chip design, we regard area as a secondary concern and focus on energy and delay. Although our energy-time trade-off models differ significantly from the area-time models of Section 2.1, the algorithmic and layout ideas are indeed a useful and significant part of this thesis. Algorithms analyzed in area-time models have directly applicable results to the grid model I present, in the form of establishing upper bounds (see Lemma 3.2.1). However, the models of this section are clearly distinct from ours in that these models favor computational density, or keeping all components busy at all times. Good implementations in our models do not necessarily focus on this, as computational elements may be "turned off" at any time. The fact that our models do not charge for area implies that parallelism is likely to play a greater role. Finally, the dynamic nature of the energy-time trade-off is a clear departure from the fixed speed operations of area-time 27 models. Thompson's energy bounds for sorting or DFT are not applicable in our models for a number of reasons. Most obviously, his energy bound assumes that no energy-time trade-off is available as our models do. There are differences in our models beyond this, however, including the energy and delay charges of wires. Also, the problem definitions for sorting differ: Thompson's lower bounds rely on the assumption that M = N1+e, where M is the number of possible input values. Initially, I make no assumptions on the number of input values for sorting and treat them as words of unspecified size. Then the word size is parameterized with w where logiV <.w < N1"^ where e' > 0, i.e. the number of input values is N < 2W < 2N, but in this version of the sorting problem the input size is considered to be Nw instead of N. Thus, the input words for Thompson's sorting problem are represented with 0(log N) bits, but our bit-wise sorting bounds allow a much greater range of possible input values. Literature that considers the energy costs of computation differs significantly from our approach. The energy-time performance considerations of Martin [38, 39] do not study specific problems. Other work of Section 2.2 is overly specific [22, 46] as compared with our approach in that they consider the detailed factors associated circuit implementations in particularly technologies. The various energy-aware scheduling algorithms of Section 2.3 essentially model a problem at the operating systems level. The online problem is quali-tatively different from the problems we address in that the input is dynamic in time. Fur-thermore, little of this research considers the parallel multiprocessor case. The scheduling research that did examine multiprocessors [2] assumes jobs may be assigned to any proces-sor, which is in some sense free communication. The other referenced work covered is quite loosely related to our models. In the logP model, the cost parameters are fixed constants; there is no sense of "distance" for communication nor does the logP model energy-time trade-offs. The PRAM model also does not include communication costs with memory. It does in some very loose sense model interprocessor communication in that a single time unit on an executing processor 28 must be used to spawn another processor's execution, via the fork instruction. Repeated use of fork can then double the number of executing processors per time unit. Simplistically, the PRAM and logP models take the processors as a given; in contrast, our model reflects the key, physical trade-offs that are present in the silicon chip and thus in the algorithms that such chips execute. Finally, adiabatic reversible computing is an interesting generalization of the energy associated with computing, but asks very general questions that have eluded computer scientists and physicists for some time. Our work, on the other hand, deals with more specific questions but remains applicable to present computing methods. 29 Chapter 3 The Model This chapter presents models for energy-time trade-offs in algorithm design and analysis. The goal is to capture the essential trade-offs and constraints of CMOS technology with a simplicity that enables simple analysis and is independent of technology details that are likely to change between successive technology generations. Processors with more than one-billion transistors are now in production [42], and even denser designs are expected in future fabrication generations. We model this large device count by assuming the availabil-ity of unlimited parallelism. However, communication overhead limits the exploitation of parallelism; in fact, wire delays and energy increase relative to gate delays under uniform shrinking of all dimensions. Integrated circuits are essentially planar, and the energy and time for communication reflects this two-dimensional topology. Thus, our models include costs for communication that increase with distance assuming a planar embedding of the computation. Finally, computation or communication operations can be performed faster by using more energy, or slower by using less. Sections 3.1, 3.2 and 3.3 flesh-out the details of our models. Section 3.4 explains the methods of proof, with Section 3.5 giving a short example. Section 3.6 concludes the chapter by discussing limitations of the models. This thesis emphasizes asymptotic (i.e. big-O) results. This avoids introducing constants that at-tempt to characterize current fabrication technologies and leads to simpler derivations and conclusions that should continue to apply in future technologies. On the other hand, con-30 stant factors do matter in practice, and we plan to refine the models to explore these issues in future research. The following sections present the models: a simplified model for deriving lower bounds, and a detailed model for analysing specific algorithms to establish upper bounds. These differ in how they reflect the two-dimensional nature of integrated circuits. The assumptions and definitions that are common to both models are presented first. As this common model does not include a metric for communication distance, it is referred to it as a "metricless" model. 3.1 Metricless Model Computations are modelled as being performed by an ensemble of communicating process-ing elements (PEs). A PE has 0(1) input bits, output bits and internal state bits. PEs can represent common hardware components such as flip-flops, constant size combinational circuits and wires. We distinguish between PEs that have state bits and/or perform some meaningful computation, called computation PEs, and those which simply pass input values to outputs in a fixed manner, called wire PEs. For simplicity, we assume that the computation being performed can be represented by an acyclic task graph [20, p. 50], (V, .4.) where vertices represent operations and arcs represent communication. Vertices u £ V with deg+(v) = 0 are called inputs; those with deg~(v) = 0 are called outputs, where deg+(u) and deg~(t>) are the in-degree and out-degree of v, respectively. Let function P map each vertex v e V to PE P(v), understood to be a PE implementing the operation v. If v% and Vj are vertices of (V, .4), we write vi < Vj iff there is a directed path from Vi to Vj. The same PE can perform many different tasks as long as these operations are performed at different times. We impose a simple requirement to ensure this temporal separation of operations by the same PE: if Vi and Vj map to the same PE (i.e. P(v{) = P(VJ)), then either Vi -< Vj or Vj -< Vi must hold. Thus, P can be many-to-one. The mapping, P, provides individual PEs with the next task to execute in the case 31 where multiple vertices are mapped to one PE. This "magic" knowledge implies that in cases where one task is repeated a number of times before performing a different task, the PE is able to switch tasks at the appropriate moment without counting tasks. This is an unrealistic assumption when a task is to be repeated a number of times depending on N. However, we believe that if this model were altered so that the next task were an input to PEs and not given for free, the cost of operating counters and control signals would be subsumed by other costs in the implementations of algorithms presented in this thesis. I will point out some particular implementations that are simplified by this assumption. Let r : V —> E + map vertices v to their duration T(V), the amount of time that elapses from when the last predecessor of v completes its operation until v completes. Note that a particular PE, p, may have different durations for distinct operations in P~1(p). Again, "magic" knowledge is provided to PEs through r as the duration for each operation is assumed to be automatically known. All input vertices vin are assumed to have an identical "start time" with respect to r, and each P(vin) initially holds the input value associated with Vin. Thus, all such P(u;n), called input PEs, are distinct from each other. Likewise, each P(vout) where vout is an output vertex stores the value associated with vout when all tasks are complete, and all P(vout), called output PEs, are distinct. We refer to the function r as a schedule for the computation. We extend r to apply to paths: if w = ViltVi2,..., Vik is a path in (V, A), then we define k t(«>) = X^ rK)-i=i We can now define the total time for a computation: T(V,A,r) = max T(W) . (3.1) zugpaths of (V,A) In other words, T(V, A, r) is the last time at which an operation completes given schedule T. Energy-time trade-offs are incorporated by assuming that there are positive con-stants, a, c > 0, such that the energy to perform operation v is given by e(v) = cT(v)~a, 32 i.e., e(v)T(v)a is constant. For simplicity, a is the same for all operations, and in keeping emphasis on big-O analysis, I assume throughout that c = 1. Subsection 4.1.3 examines variations of the model that allow for discounted communication, in particular where wire PEs trade off more favorably than computation PEs. It is important to point out that r must respect the bounded memory of PEs. We cannot allow schedules that queue up more than a constant number of outputs that are waiting to be consumed by another PE, which could occur under certain choices of P and r. In order to remedy this difficulty, we propose an augmented task graph with extra arcs that prohibit such violations. Define Vi -<p Vj iff -< Vj and P(i»i) = P(VJ); in other words, Vi -<p Vj iff vi and Vj are tasks that are performed with the same PE and Vi is performed before Vj. By the requirement that all tasks assigned to a particular PE must be ordered, -<p is a partial order on V that totally orders tasks mapped to the same PE. Thus, define nextp(vi,Vj) = (vi -<P VJ)A /Bvk G V : (vi -<P vk) A (vk -<p Vj). Given a task graph, (V, A), we create an augmented task graph G' — (V, A'), where A' = AU{(vj,vk) G V 2 : (3uj G V : G A) AnextP(vi,vk))} That is, if tasks Vi and vk are performed consecutively on the same PE, then we add an arc to the task graph from each task that is an immediate successor of Vi to vk. Now, define the total time to execute a computation on an ensemble of PEs as the duration of its augmented task graph. All task graphs that I use in this thesis will remain acyclic after this augmentation, and it is my view that most task graphs of interest also have this property. For suppose there exists (vi, Vj) G A such that 3vk G V : (vi -<p vk) A (vk -< Vj). In this case, Vj cannot be performed until both Vi and vk are complete, and therefore these vertices may be merged into one task, as they are both mapped to the same PE. 33 Formally, the total time for computation actually depends on the augmented task graph V', and is analogous to (3.1): r) = max loGpaths of (V,A') T(W) . Fortunately, none of the schedules we consider throughout yield a different total computation time on these two task graphs, i.e. T(V,A, r) = T(V,A',T) for all (V,A) and associated r. Thus from now on there is no mention of this augmentation, but I note of its importance to ensure feasible hardware implementations. The total energy for a computation is Given a task graph, (V,^4), we find a schedule r that minimizes £(V, T ) T ( V , A, r ) a (or simply ETa). Such schedules are referred to as optimal; the big-0 complexity of the resulting ETa is referred to as the optimal ETa complexity. As is outlined in Section 2.5, the best function for modelling the trade-off between energy and time is debated, and thus we avoid choosing the "right" exponent by using variable a > 0. The same exponent is used for eta and ETa to ensure that total algorithm performance ETa is invariant under uniform energy-time scaling of all components with trade-off eta. The energy cost, time cost, or simply "cost" refers to the total energy E, total time T, and ETa, respectively, for PE(s) with a schedule clear from context. I sometimes make a distinction between the cost incurred by computation PEs versus that of wire PEs for analysis purposes. Observe that the minimum ETa for any given task graph can be achieved for any time, T allotted to complete the computation, as long as T > 0: Lemma 3.1.1 [Scaling] Let TQ be a schedule for a task graph, (V, A), and let To = T(V,A,Tq). LetTi > 0 and let TI(V) = ^T0(V),VV G V. Then, (3.2) S(V,n)T(V,A,n) a = £(V,To)T(V,A,T0)A, and thus TQ is optimal iffr\ is optimal. 34 Proof: The claims follow directly from the definitions of £ and T . • Lemma 3.1.1 provides a convenient way to generalize from a specific schedule to establishing a bound for an algorithm. Given some task graph, let be a schedule for that task graph that achieves ETa e 0(f(N)), where N is the size of the input to the computation. Then for any T" and function / : Z —> K + , TJV can be scaled to achieve an energy E' with E'T'a e 0(f(N)). Thus, a general conclusion is drawn from a specific schedule. Without loss of generality, we may assume that the number of input vertices is equal to TV. Let irn (outi) to denote the ith input (output) vertex v or P(v). This notation is overloaded to refer to inputs and outputs of PEs. 3.2 Grid Model The grid model refines the metricless model by requiring specific placements for PEs. This model is more restrictive than the gridless model that we present in Section 3.3. The grid model is used to establish upper bounds for the complexities of problems and algorithms. In the grid model, each PE occupies a unit square in the plane. No two PEs can occupy the same square. Thus, a PE can be identified by the coordinates of its lower-left corner, e I?. Ifpi is a PE at location (ij, ji), mdp2 isaPEat (12,32), thenpi andp2 can connect inputs to outputs of each other iff | |(ii, j i) — (*2, J/2) || 1 = 1- Communication over longer distances then can be achieved with "wires" which we model with chains of wire PEs. Note that a wire crossing can be modeled by a wire PE with two inputs, in\ and in 2 , and two outputs out\ and out2 where the PE computes the identity function to generate outi from in\, out2 from m 2- The specification of PEs in the grid model corresponding to a task graph is called a grid model implementation, or simply implementation. Each operation of each PE is performed using energy e and time t with eta = 1. Define transmission of information from PE p\ to PE P2 as the change of some input(s) to p2 as a result of an earlier change on some output(s) of p\. Note that this always refers to information described by 0(1) bits. We will simply use the term "transmission" 35 (or "transmit") when the information is clear from context. Let p\ and P2 be distinct PEs at locations j\) and (12,32)- Let d = ||(n, j i) - (22, J2)||i- If Pi transmits information to P2, then this information traverses at least d - 1 intermediate PEs. If pi and each of these intermediate PEs take unit time to transmit the information to the next PE, then the transmission uses d units of time and d units of energy, for a total1 ETa of da+1. This is optimal, as shown by the following lemma: Lemma 3.2.1 [Wires] Let p\ and P2 be two PEs in an implementation where p\ and P2 are separated by Manhattan distance d\2- Let E\2 and T\2 be the total energy and time to transmit 0(1) bits of information from p\ to p2- Then E12T12 > df^1-Proof: Consider the case where 0(1) bits of information are sent from p\ to P2 via m other PEs, si ...sm. Because po and p\ are separated by Manhattan distance d\2, m > d\2 — 1. Let to be the time to transmit from p\ to s\ \ for 1 < j < m — 1, let t, be the time to transmit from s, to Sj+i, and let tm be the time to transmit from sm to P2- We now have: m m E\2 = Y^tJ01' T u = 3=0 j=0 Taking partial derivatives with respect to U for fixed i < m yields —E12T& = 0 ^ U = a+VT12/E12 • Thus, E12T12 is minimized when all of the tfs are the same. It is straightforward to show that this is a minimum by taking the second partial derivative wrt. U with tj = a+yjTi2/Ei2 for all 0 < j < m. At this minimum, E12 = (m + l)t~a and T12 = (m + l)tj, thus Ei2T?2 = (m + l ) a + 1 . The claim Ei2Tf2 > d^1 is now established because m > d\2 — 1. • Although this present work and the earlier work on AT2 models have different optimization goals, it is possible in some cases to convert an ATa+1 upper bound to an 'in this context, E and T respectively denote the total energy and time for only this transmission. 36 upper bound for ETa in the grid model as shown by the following lemma. An algorithm uses nearest neighbor communication if its organization can clearly be abstracted into PEs in the grid model. In particular, PEs in the abstraction only communicate a constant distance using energy e and time t satisfying eta — 1. Lemma 3.2.2 [Area-Time Complexity] If an algorithm uses only nearest neighbor com-munication and has an area-time complexity ATa+1 6 0(f(N)) then ETa e 0(f(N)). Furthermore, the schedule T achieving this ETa complexity is T(V) — e(v) = 1 for all vertices v in the associated task graph. Proof: Allot unit time per operation. Because the algorithm only uses nearest neighbor communication, the total time T for completion of the algorithm in the ATa+1 model is equal to the completion time of the corresponding grid model implementation. Fur-thermore, each PE performs at most one operation per unit time; thus, E is at most AT. Therefore, ETa < ATa+1 e 0{f(N)). • Lemma 3.2.2 shows that known AT3 results may be used to establish ET2 bounds. Here we note the absence of explicit AT3 results in the literature. Fortunately, many AT2 arguments are robust to changes in the time exponent and establish the AT3 bounds that we need. 3.3 Gridless Model While the geometric detail of the grid model ensures the existence of a planar implemen-tation, it is overly restrictive for lower-bound arguments. In particular, it requires PEs to be square and for PEs and wires to be arranged on a rectilinear grid. The gridless model removes these restrictions. Computation PEs and wire PEs are distinguished in the gridless model. The task graphs for algorithms in the gridless model are bipartite where all arcs are between X = {v : P(v) is a computation PE} and W — {v : P(v) is a wire PE}. 37 Computation PEs may assume arbitrary, convex shapes, with the restriction of a constant area circumcircle. We assume a distance metric (e.g. Euclidean distance or Man-hattan distance) for computation PEs: let p,(pi,p2) be the Euclidean distance between the center points of some fixed inscribed circle of pi and of p2. Because p is a distance met-ric, the triangle inequality applies. We assume that computations are implemented in the plane, and model this assumption by requiring that any PE has at most d2 other PEs within distance d of itself. A wire PE has one input and one output. Let P(w) be a wire PE and let P(v\) and P(y2) be computation PEs such that (yi, w) and (w, v2) are arcs of the task graph. Let d = n(P(v\),P(v2)). PE P(w) may be thought of as conceptually equivalent to a chain of d wire PEs of length 1. By analogy with Lemma 3.2.1, each operation of w requires time t and energy e with eta = da+1. Formalizing these observations yields the axioms of our gridless model: 1. The computation is expressed by a bipartite task graph, ((X UW),A), where X D W = 0. Operations in X are implemented by computation PEs, and operations in W are implemented by wire PEs. 2. Each operation of a computation PE takes energy e and time t with eta = 1. 3. Let P(X) be the set of all computation PEs. There is a distance metric, p : P(X) x P(X) -> R, such that for all computation PEs, p 6 P{X), {q&P(X) : {q?p)A»(q,p)<d} <d2. 4. Each wire PE has one input and one output. 5. If vi,v2 G X, and w e W with (v\,w),(w,v2) G A, and p(P(vi), P(v2)) = d, then an operation of w takes energy e and time t with eta = da+l. Because the gridless model ignores issues of wiring congestion, it admits descrip-tions of computations that cannot be implemented in the plane. It is straightforward to show 38 that the axioms for our gridless model are strictly less restrictive than the constraints of the grid model. Thus, lower-bounds from the gridless model hold (to within constant factors) in the grid model, and upper-bounds from the grid model hold (to within constant factors) in the gridless model as well. 3.4 Proof Framework With a clear definition of the models stated, what can be proven with them? Stated con-cisely, the grid model is used to derive upper bounds for the costs of computations and the gridless model to derive lower bounds. However, it is important to understand what exactly we are finding bounds on, en route to finding tight asymptotic bounds. In this thesis, I consider solving specific problems, such as sorting. A problem is simply defined by its input to output mapping, and the type of the inputs and outputs, such as bits or words. We identify an algorithm that solves a problem by its associated equiv-alence class of task graphs. The equivalence class arises from task graphs with identical computation vertices, but flexibility of the length of all chains of communication vertices. This can be viewed as a many-to-one relation from the input size N to some task graph with N input vertices that produces the appropriate partial ordering of operations associ-ated with the algorithm. Finally, we use the function P to define PEs in the grid model to implement the task graph. Note that some task graphs containing high degree vertices may not be realizable under the mapping P to a valid grid model implementation. Thus we see a hierarchy of entities, where problems are the most general followed by algorithms, task graphs and then the grid model implementation, as shown in Figure 3.1. To understand the relationships within this hierarchy, one might ask how to find: 1. the "optimal" algorithm to solve a specific problem; 2. the "optimal" grid model implementation for a particular task graph; 3. the optimal schedule for a particular grid model embedding. 39 Problem Algorithm, i.e. Task Graph Equivalence Class i Specific Task Graph Grid Model Implementation * Schedule Figure 3.1: Hierarchy of entities Quotations are used for "optimal" in 1 and 2 since we have not yet defined these (optimal schedule is defined in Section 3.1). An optimal grid model implementation is an implementation that has a schedule that yields the optimal ETa complexity; an optimal algorithm is one such that there exists an optimal grid model implementation for it. With these definitions in place, it is easy to see that given a task graph and mapping P, any ETa calculated from some choice of T implicitly upper bounds both the optimal grid model implementation and the optimal algorithm for the problem. Furthermore, we will see that determining the optimal schedule is often relatively straightforward by employing partial derivatives to find the schedule that minimizes ETa. Directly demonstrating optimality for the first two problems described above seems to be difficult and is not explicitly addressed in this thesis. Instead, lower bounds for all three problems are established by applying the gridless model to the problem in question. This is accomplished by arguing on communication costs in the gridless model between the input and output PEs, regardless of placement and without assumptions about the al-to gorithm. If we can find an upper bound to 1, 2 and 3 through an explicit algorithm, grid model implementation and schedule that matches an established lower bound, then match-ing asymptotic bounds on all levels of the problem's hierarchy follow. Since this proof framework relies on the placement of input and output PEs, their placement may be arbitrarily restricted and the same framework applies. The only differ-ence is the explicit grid model implementation must follow the restriction, as with the lower bound communication cost calculation. The restriction considered for all problems is for all input PEs to lie along a line and all output PEs to also lie along a line. In both the grid and gridless models, the line can be viewed as one which passes through the centerpoints of a fixed inscribed circle of the region each PE occupies. This is a natural arrangement for input/output ports that lie on the periphery of chips or functional blocks, and it also lends itself to simple interconnection and pipelining. Based on this one-dimensional arrangement of the data, I refer to these as ID implementations. Lower bound proofs in the gridless model for these ID implementations assume only that a straight line can be drawn through the center points of the fixed inscribed circles of all input PEs and likewise with the out-put PEs; clearly this is less restrictive than ID grid model implementations. Sometimes implementations without the ID restriction are called 2D implementations. A useful fact in proving lower bounds in the ID case is given here. Lemma 3.4.1 The minimum value of n(a, b) where a and b are distinct PEs in a ID gridless model implementation is at least 1. Proof: Follows directly from Axiom 3. • In addition to proving tight bounds on problems, we also present analysis of some common algorithms for the sake of comparison. Rather than face the vexing task of deter-mining the optimal grid model implementation for an algorithm, I simply choose one that closely resembles the placement of hardware components in a diagram. This is called the direct grid model implementation, or simply direct implementation. We will see that this 41 reveals the advantage that traditionally "fast" algorithms have is largely attributed to free communication. 3.5 An Example: TV-input AND Gate I illustrate a simple application of the grid model to the problem of an AMnput AND Gate. For simplicity, consider only the ID restriction case. The problem is as follows. Given N input bits, ao, a i , a j v - i , compute their log-ical AND, Ai^Q 1 ai- F ° r simplicity of illustration I assume that each input PE contains Figure 3.2: 8-input AND gate exactly one input bit en and that N — 2n. Form a direct implementation of a binary tree arrangement of 2-input AND gates, as seen in Figure 3.2, and compute the ETa complexity when e = t = 1 for all PEs. One might initially conclude that because there are 0(log N) levels in the tree and 0(N) total 2-input AND gates, that these are estimates for T and E respectively and ETa € 0(N log" N). However, this is not only false given the implemen-tation and assumptions, but it is provable that any ID implementation solving this problem has ETa £ Q(Na+1), and thus no implementation exists with ETa e 0{N \oga N). The main reason why the T e 0(logA r) and E € O(N) analysis is faulty is that it ignores the cost associated with wires between the AND gates. A reasonable direct implementation 42 has each input bits at level i of the tree (where the inputs are at level 0) traversing about 2l wire PEs before reaching adjacency to a PE for the 2-input AND gate (modulo a small constant number arising from the placement choice of the 2-input AND gate PE). Ignor-ing the cost associated with such computation PEs, there are 2l wire PEs for each of the 2 n _ t inputs at level i for a total of 2~27=o The total time is computed as the sum of the wire length over all levels, i.e. Yn=o2i- Therefore, E G O(NlogN), T G 0{N) and ETa G 0(Na+1 log TV). It may seem surprising that we can do better by simply chaining together N — 1, two-input AND gates. Call the AND gates A\,A2,Aw-i, where the the inputs of A\ are ao and a 1 ; the inputs to Ai are a$ and the output of A4-1 and the output of AN-I is A i l o 1 ai- Under the assumption that e = t = 1 for all PEs, an obvious direct implementation has ETa G 0(Na+1). This result is contrary to general intuition where communication is ignored; an abstract diagram, analogous to Figure 3.2 for this arrange-ment would have a "time complexity" of O(N), as opposed to 0(log N) for Figure 3.2. 3.6 Model Limitations The simplicity and independence of all constant factors of the model limits its realism. Here I point out the main deficiencies in this regard. The models ignore the effect of threshold voltages. If V is the power supply voltage, simple arguments show that energy, e oc V2 (this ignores non-linearities in semiconductor capacitors) and time, t oc (y_v"tft)2 where Vth is the "threshold voltage" which is a constant depending on the semiconductor manufacturing process. In the limit that Vth —> 0, this yields that et2 is invariant under changes in V as in the models when a — 2. In practice, Vth is 20-30% of the power supply voltage V. By ignoring threshold voltage considera-tions, the models favor excessively fine grained parallelism at a very low operating voltage. Consideration of threshold voltages and leakage currents would discourage such low values of V. The grid model treats computation PEs and wire PEs the same. However, wire speeds do not scale with V in the same manner as logic gates. In typical designs, long 43 wires on chips are broken into shorter segments with inverters or buffers between the seg-ments; this is to limit wire delay which grows quadratically in its length. The delay of the buffers/inverters is inversely related to V as described above, but the wire delay is indepen-dent of V. Thus, wire delay is less sensitive to V than logic gate delay, while both have energy consumption that is quadratic in the V. Therefore, a should be lower for wires than for logic gates in an eta model. This is exactly the case in the 7-model of Subsection 4.1.3, briefly considered in sorting. The simple scaling of wire delay and energy with length neglects the technologi-cal discontinuities that occur at the boundaries of chips, circuit boards, or racks. At each boundary, there is a large increase in the relative cost of communication. Our models make sense for computations that are implemented on a single chip, but the results would need to be reconsidered for application to larger systems. Both gates and wires cannot be made ar-bitrarily fast by using more power (the speed of light, c, is an obvious hard limit). However, several orders of magnitude of tradeoffs are available through scaling opportunities from gate sizing and architectural choices [22, 37]. The continuous fashion of scaling allowed in our models do not match reality for most modern processors, where V is chosen from a small, discrete set of values. Chapter 7 discusses these issues in more detail and describes possible ares for fur-ther research. There I consider the challenges they present and the possibility of more realistic models. 44 Chapter 4 Sorting A sorting problem consists of N input values assumed to be unsigned integers. These values are initially placed in an arbitrary permutation in ./V predetermined input PEs. There are N PEs that are designated as the outputs. At completion, the input values must be stored in the output PEs in ascending order of the values for some predetermined ordering of the output locations. For the purposes of sorting, we begin by relaxing the restriction of PEs to having a constant number of inputs, outputs and storage bits to allow operations on entire words of unspecified size. This allows us to first focus on insight into the sorting problem, and avoid complicating the analysis with technical details. In Section 4.4 the problem is revisited in the "strict" model where PEs only operate on a constant number of bits. In Section 4.1, I analyse sorting algorithms based on sorting networks [29, chap. 5.3.4]. These ID implementations place PEs according to the locations of the comparators in typical drawings of these sorting networks and the input and output locations are each arranged along linear arrays of PEs at the left and right edges of the implementation region. Section 4.2 removes this restriction and allows the input and output PEs to be placed at arbitrary locations in the plane. I show that these 2D sorting implementations achieve lower energy-time costs than their ID counterparts. For simplicity, the analysis in Sections 4.1 and 4.2 assumes PEs that can operate on words and that a single operation on a word can be 45 performed with energy e and time t with eta = 1. Section 4.3 presents lower bounds based on the gridless model, both for PEs that operate on words on for the more restrictive model where PEs operate on bits. Section 4.4 continues with the strict definition of PEs by only allowing PEs that operate on 0(1) bits, and considering a word as a binary representation of an integer using w bits. There, I examine the impact of word size on the energy-time complexity of sorting. 4.1 ID Sorting A sorting network has AT inputs that are words of unspecified width that traverse a series of comparators leading to N outputs, such that any permutation of input values will be output in sorted order. As an example, Figure 4.1 shows a sorting network for bubble sort [29, Fig. 47, p. 224]. Each vertical line segment corresponds to a comparator, and each horizontal segment corresponds to a connection between comparators. Data is input at the left and is output at the right. A comparator receives a data value on each of its inputs and outputs the larger of the two values on its top output and the smaller value on its bottom output. The length of a comparator is the number of wires it spans minus 1. This manner of data flow implies that in direct implementations, the mapping from task graph vertices to PEs P is a one-to-one function, i.e. = 1 for every PEp. A sorting network exactly specifies a family of task graph equivalence classes, one for each value of N. Corresponding task graphs have d e g + (v) = d e g - (v) = 2 for all vertices v such that v is a comparator. 4.1.1 Bubble Sort Figure 4.1 [29, Fig. 47, p. 224] is a sorting network for bubble sort on 6 inputs. Because each comparator only spans 2 wires, a ID implementation is formed by replacing these comparators with comparator PEs with the same functionality, assuming that wire PEs are spaced unit length apart. The number of comparators in the sorting network for bubble sort is J^^1 * = 46 in(0)-in(1)-in(2)-in(3)-in(4)-in(5)-4 — T — Figure 4.1: Sorting netwoi OUt(O) •Out(1) OUt(2) OUt(3) OUt(4) OUt(5) k for bubble sort on 6 inputs Figure 4.2: The direct implementation of the sorting network for bubble sort into a grid of PEs. (N — l)N/2 as can be seen by considering the number of comparators in each row of Fig-ure 4.1. Note that the computation proceeds from left to right across Figure 4.2, and it is straightforward to show that all comparators in a vertical column, which necessarily operate on disjoint data, perform their operations at the same time. Analogously, the direct imple-mentation into PEs seen in Figure 4.2 have vertically aligned PEs operate with the same delay in an optimal schedule. Thus, I refer to the set of comparators (or the corresponding comparator PEs) in a vertical column as a phase. The sorting network in Figure 4.1 has 9 47 phases; in general, there are 2N — 3 phases. Figure 4.2 shows the resulting direct imple-mentation of the bubble sort algorithm considered. Dark boxes labeled C are comparator PEs, and the other boxes are wire PEs. As with the sorting network, there are O(N) wire PEs per input and 0(N2) in total. Because there are only a linear number of wire PEs per input, the time and energy cost of these wires may be included with the cost for the comparator PEs. This simplifies the analysis so the results above for the computation cost hold for the total cost. Recall from Section 3.1 that a schedule is called optimal if it minimizes the complexity of ETa, called the optimal ETa complexity. I now find the asymptotic behavior of delay and energy consumption for computation (the cost of comparator PEs), as the communication cost (the cost of wire PEs) is included in this. Let a schedule be called minimal if it minimizes ETa for fixed a. This is not to be confused with optimal schedules, defined to asymptotically minimize the function ETa. Lemma 4.1.1 [Phase Times] Let r be a minimal schedule for the ID direct implementation of the bubble sort algorithm. Then every comparator PE p in the same phase has the same value ofr(P~l(p)). Proof: Suppose not. Then let r(vij) and r(vik) be unequal durations of comparator task graph vertices Vij and Vik where j and k are different indices for comparator PEs in the same phase i where i is maximal. Phase % cannot be the final phase, for it only contains a single comparator PE. By the task graph for the sorting network for bubble sort, there is a vertex ug in minimal phase £ such that both v^j -< u? and Vik -< ug. Because every phase greater than i has every comparator PE within it having equal duration, the two inputs to P(ui) necessarily arrive at different times. Furthermore, the completion time T remains the same if the schedule is altered so that T'(vij) = T'(vik) = max {r(vij), riv^)}. Now one of P{vij) and P{vik) consumes less energy and the minimality of r is contradicted. • Let ti be the time cost for all comparator PEs in phase i, indexed from left to right starting 48 at 1. Also define = t~a and rrii as the energy cost per comparator PE and the number of comparators in phase i, respectively. Thus phase i takes time U and consumes m ^ i energy, where e i^f = 1. If ii = 1 and = 1 for all i, then T and £ grow as O(N) and 0{N2), respec-tively, and ETa e 0(A'' a + 2 ) . I will soon show that this is the optimal complexity for this implementation of bubble sort. First, consider the case where U is a fixed power of the number of comparators in phase i. Specifically, let U = mf for any real 6. Intuition may suggest that a value of 5 close to 1 (say, for the a — 2 case), taking time near linear in the number of comparator PEs would be a good schedule. However, Corollary 4.1.3 below will show that 5 > 1 is not asymptotically minimal. The analysis is simplified by assuming that there are only n — \N/2] phases, where phase i has i comparator PEs (this reduces the time and energy costs by roughly a factor of 4). This is justified by the observation that each rrii has a value in {1,2,..., [AT/2]}, and that at most 4 such rrii share the same value. It is reasonable to assume that an optimal schedule will have the property that phases i and j with m, = rrij will also have U = tj. Therefore, £ i = f 3 U < 4 £ " = 1 N o w > m * e * = a n d T a n d E are given to within constant factors by n T = 5 > (4-1) E = ^i/tf. i=l Theorem 4.1.2 The ID direct implementation of the bubble sort algorithm on N inputs with atf = 1 and U = is has ETa complexity as 49 0(Na^), 2/a < 5; 0(Na+2 log N), 8 = 2/a; 0(Na+2), - l < 5 < 2/a; 0{Na+2\ogaN), S = -l; 0{N 2-a5\ 5 < - 1 . Proof: From (4.1), T < 3 i T ( ( n + l ) * + 1 - l ) > E < ^ j ( ( n + l ) 2 - ^ - l ) , 6^-1; aS ^ 2, and a5 ^ 2, (4.2) proving the claim for 5 ^ -1 and 5 ^ 2/a. When 5 = — 1, E is bounded as in (4.2) and T < log (n + 1), thus ETa < 2 r ^ ( n 2 " a , 5 - l ) l o g a ( n + l ) . When 5 = 2/a, T is bounded as in (4.2) and (4.3) E < log (n + 1), thus ETa < w^log(n + l)((n + l)5+1 - 1) (4.4) • Note that the optimal asymptotic complexity of ETa G 0(Na+2) is achieved by choosing any 5 in — 1 < 5 < 2/a. In particular, the no voltage scaling case of 6 = 0 lies within this range. For illustration, Theorem 4.1.2 in the a = 2 case is stated in the following corollary. Corollary 4.1.3 For a = 2, the ID direct implementation of the bubble sort algorithm on N inputs with erf2 = 1 and tj = is has ET2 complexity as 50 0 ( i V 4 ) , \6\ < 1; , .0(NHog2N), 5 = -l; ET2 e < 0{N4logN), 5 = 1; 0{N2+2\S\), \5\ > 1. Although this follows directly from Theorem 4.1.2, it is instructive to consider (4.2) and (4.3) when a = 2 to see how the various terms arise. The bounds on T remain the same, and the bounds on E are: E < i ( ( A r + 1 ) 2 - 2 * 2-25 log (TV+ 1) 5 = 1. Evaluation of these bounds in the 4 cases of 5 values above gives Corollary 4.1.3. I now find the optimal schedule for the general case where ti can assume arbitrary values. Theorem 4.1.4 The ID direct implementation of the bubble sort algorithm on N inputs has an optimal schedule with ETa e 0(Na+2). Proof: By Lemma 4.1.1, each PE within the same phase has equal duration when the schedule is minimal (i.e., a particular optimal schedule). Let f be such a schedule, with associated total time T(f) = T and total energy £(f) =' E. Totals T and E can be expressed as in (4.1) where each U associated with f are denoted by ii. Differentiate ETa (where E and T depend on variable schedule r) with respect to tk for some 1 < k < n and evaluate at r = f: AETa i=l ST ) Ta + aETa~l = {j^jfa + aETa-1. 51 Thus, = 0 lr=f n a ~ 1 = Q & -Mrf = aE 4 v Therefore, ETa is minimized under schedule f with tj = a +^A|j f ° r all By Lemma 3.1.1 with scaling constant a+\p~,, an assignment of U = a+-v/i is optimal. Appealing to Theorem 4.1.2 with 5 — gives the result. • Theorem 4.1.4 relies on Theorem 4.1.2 to find the ETa complexity when U — a+yfi. Although the proof of Theorem 4.1.2 uses inequalities to upper bound (4.2), (4.3) and (4.4), the bound referenced by Theorem 4.1.4 is in fact tight. I avoid the tedium of explicitly proving this as it is implied in Section 4.3 where a matching lower bound of Q.{Na+2) for the ID sorting problem is proven. Moreover, there is no need to seek more efficient implementations for bubble sort beyond the simple direct one. It is worth noting that an assignment of U = 1 for all i (5 = 0 in Theorem 4.1.2) also gives the same optimal complexity. Recall that this optimality implies not only that improvement cannot be gained by considering other arrangements of PEs to perform bubble sort, but also that improvement cannot be gained by means of ID implementations of other algorithms, such as bitonic sort [29, p. 232]. It may seem surprising that the bubble sort algorithm is optimal for the ID sorting problem. For the sake of comparison, Subsection 4.1.2 analyses sorting networks for odd-even merge and shows that it also achieves ETa £ 0(Na+2). While odd-even merge performs fewer comparisons than bubble sort, it requires long chains of wire PEs in stages where the comparators span many wires. The total wire length in odd-even merge is a constant factor larger than that of bubble sort, and the energy and time for driving these long wires dominates the cost. More generally, fast sorting networks, such as odd-even merge, are "fast" because the traditional models for analyzing such networks do not take into account the cost of communication. These algorithms exploit this "free" 52 communication, but the advantage of the fast algorithms is reduced to only a constant factor in our models. 4.1.2 Odd-Even Merge Sort Knuth gives a construction for a general sorting network for odd-even merge sort [29, p. 224], which has comparators that span more than 2 wires, unlike the sorting network for bubble sort (see Figure 4.3). The sorting network resembles a standard software merge sort algorithm; pairs of single inputs are merged, then sorted lists of size 2, 4, etc. Merging two sorted lists of size 2 m , called a merge of size 2 m , requires 2 m _ 1 comparators of length 2m~i for i — 1 , 2 , m , where comparators of the same length belong in the same phase. This leads to a total of 0(N log2 N) comparators to sort N items. in(0)-in(1)-r ~ r r * in(2)-in(3)-in(4)-in(5)-in(6) in(7)-out(O) out(1). out(2) out(3) out(4) out(5) out(6) out(7) Figure 4.3: Sorting network for odd-even merge sort on 8 inputs. To form a direct implementation based on this sorting network, note that a compara-tor of length i can be implemented with a comparator PE and 2 wires of length i]. In this context, wire PEs are called horizontal if they correspond to a wire in the sorting network, 'These 2 wires can be implemented with a chain of i wire PEs with 2 inputs and 2 outputs. If any of these are placed in the same grid location as another wire, the 2 PEs can be merged into 1. 53 and called vertical if they are used to implement a comparator. To simplify asymptotic analysis, first consider the case when £ energy and £ time is charged for a comparator of length £, i.e. e = t = 1. We will also use the following lemma to simplify the cost analysis of wires. Lemma 4.1.5 In the ID direct implementation merge sort algorithm, the cost associated with horizontal wires during any phase is subsumed by the cost of vertical wires. Proof: Let m be the number of comparators in a given phase, all of which are disjoint by definition. It is sufficient for comparators that span the same wires to be implemented with unit horizontal separation to avoid crossing more than a constant number of wires in a single PE. Thus the total horizontal wire length h is at most m. Each comparator implementation attributes a vertical chain of wire PEs of length of at least 1, thus the total number of resulting wire PEs v is at least m. Therefore, the total number of wire PEs is in 0(v + h) = 0(v). • Lemma 4.1.5 conceptually shrinks the width of the area of PEs within the same phase to unit length for the sake of analysis. Now a merge of size 2m requires a total horizontal wire length of 0(m2m) (2m horizontal wires and m — 1 phases). Because both the horizontal and vertical wires must be traversed to complete a phase, we combine their cost by noting that the total number of vertical wire PEs is also proportional to 0(m2m). Furthermore, we combine the costof the comparators and the vertical wires, yielding an energy cost of m - 2 (2™- 1 ) 2 * = (2 m - 1 ) (2 m - 1 - 1) < 2 2 m , (4.5) i=0 and a time cost of m—1 Y^2i = 2m-l < 2m (4.6) t=0 for a merge of size 2 m . The sorting network for merge sort with N = 2n inputs performs 2n~m~l merges of size 2 m in parallel, for m = 0 , 1 , n — 1. Compute the total time and 54 energy cost by considering every merge. Multiplying (4.5) by 2 n m 1 and summing over all values of m gives n n— 1 E = 2 n + m _ 1 = 2 n ^ 2m = 2 n(2 n - 1) m=l m=0 energy cost. Summing (4.6) over all values of m gives T = ^ 2 m = 2 " + 1 _ 2 time cost. Thus if unit time and energy is charged for every PE, then the total time cost grows linearly with the input size and the total energy cost grows quadratically. Note that this is the same time and energy growth exhibited by bubble sort when charging unit time and energy for each PE, which is an optimal schedule. 4.1.3 Discounted Communication Before finding the complexity of the direct implementation for the merge sort algorithm with an optimal schedule, the grid model is generalized in this subsection to allow cheaper communication versus computation. Let the 7-model be the same as the grid model of Section 3.2 except that the energy and time associated with sending bits through a chain of £ wire PEs is ef and tf, respectively (where it remains that eta = 1), and 0 < 7 < 1. In other words, eta — p ( a + 1 \ so the 7-model is the same as the grid model for 7 = 1, but charges less for long wires for values of 7 less than 1. This will help to quantify the computation advantage that merge sort has over bubble sort. The complexity of the merge sort implementation in the 7-model is stated here and the proof which uses a similar partial derivative argument to the proof of Theorem 4.1.4 appears in the Appendix. Theorem 4.1.6 The ID direct implementation of the merge sort algorithm on N = 2n inputs has optimal communication cost of ETa G 0(An'(a+1)+1) under the j-model. Note that the ETa complexity of the final merge is equal to the complexity of the entire algorithm (see (A.l) with m = n — 1). 55 Corollary 4.1.7 The ID direct implementation of the merge sort algorithm on N inputs has optimal communication cost ofETa G 0(Na+2) under the grid model. Proof: Follows directly from Theorem 4.1.6 with 7 = 1. • It is easy to verify that ETa £ 0(N(\og . /V) 2 ( a + 1 ) ) is optimal for merge sort com-putation2. Considering also communication, the total ETa complexity for merge sort is 0{N'(log Ar)2(a+i) + N1(a+i)+ij i n t n e 7 _ m o d e i b y Theorem 4.1.6. With this, we can find a value for 7, call it 70, such that all 7 < 70 do not improve the complexity of the algorithm: jy7o(a+l)+l _ A/-(log N)2(a+l*) 4=> (7o(a + l ) - f - l ) l o g i V = log A^ + 2(a + 1) log log N 7 o ( a + l ) = 2^Zt°SN 70 = Thus we see that 70 is independent of a and approaches 0 as iV becomes large. Recall from the above analysis of the bubble sort implementation that all communication cost is included in the cost of comparator PEs. Therefore, the ETa complexity of the direct implementation of bubble sort remains 0(Na+2) in the 7-model for any value of 7 and the direct implementation of odd-even merge sort performs better for any 7 < 1. Furthermore, observe that for fixed 7, the communication cost of odd-even merge sort dominates the total cost for sufficiently large N. 4.2 2D Sorting I now consider implementations of sorting algorithms where the input and output PEs may be placed at arbitrary grid locations. The largest distance between any pair of input and output PEs can now be reduced to 0(VN). This is leveraged to find lower complex-ity implementations beyond the 0(Na+2) of ID implementations seen so far. Of the 2D 2There are C(log 2 N) phases, following from the total delay of sorting networks for merge sort, given by Knuth [29, p. 230]. 56 implementations, there is a clear trade-off between simplicity and ETa complexity. The most efficient and complicated algorithm, the sorting algorithm of Schnorr and Shamir [51], achieves a low enough upper bound as to match the lower bound from a simple argument presented in the next section. Each of the algorithms in this section are easily mapped into direct implementations of a 2D mesh of computation PEs. In all cases, these PEs are similar to comparator PEs from Section 4.1 but include memory to store an input value. The task knowledge given by the mapping P provides each PE with the next task to execute; thus no internal state logic is needed to decide when to compare and swap with which of the 4 adjacent PEs. These mesh PEs also serve as the output PEs. It is clear that such PEs are more complicated than PEs required in direct implementations of sorting networks, where none of them have internal memory and each PE always performs the same operation. For simplicity, it is assumed unless otherwise stated that the number of inputs TV is a perfect square, i.e. N = n 2 , and the mesh of PEs lie in an n x n grid. These algorithms extensively rely on epochs in which contiguous subsets of PEs forming a partition of the grid are sorted by transposition sort. A step of transposition sort of N objects in ordered locations 1 , 2 , N compares and swaps between all adjacent odd-even indexed pairs in parallel, followed by a compare and swap between all adjacent even-odd indexed pairs in parallel. It is clear that at most O(N) steps are needed to sort the objects. Transposition sorting is often performed on each of the columns or rows of the PE grid; these are called column transposition sorts and row transposition sorts. Finally, transposition sorting is also applied to PEs where adjacency is determined by row-wise snake order (or just snake order). Snake order of the PE grid, or some rectangular subset, indexes the grid across the first row, (say, from left to right), and the next adjacent PE is the rightmost PE of the row below. Adjacency proceeds from right to left across this second row, and the rightmost PEs of the second and third row are adjacent, with the third row proceeding from right to left and so forth (see Figure 4.4). As in Section 4.1, assume that PEs can operate on words. 57 I I I I I I I I I Figure 4.4: Row-wise snake order on a mesh. When sorted, the inputs are ordered along this path. 4.2.1 Shear Sort Scherson, Sen and Shamir discovered an efficient but simple algorithm for sorting in this model [50]. This algorithm operates on the n x n mesh of PEs described above. Each epoch of shear sort consists of a row transposition sort, followed by a column transposition sort. They showed that [log n] + 1 epochs are needed to sort the elements into snake order [50]. Each epoch takes at most 0(n) time, for a total T in 0(nlog n). It is clear that this direct implementation of shear short has area A of 0(n2). T h u s A T a + 1 € 0(na+3 l o g Q + 1 n), and it follows from Lemma 3.2.2 that ETa e 0{na+z l o g a + 1 n) = 0(yv(«+3)/2 l o g *+i N ^ & better trade-off than the ID sorting algorithm implementations presented. Furthermore, this schedule of e = t = 1 for all PEs can be shown optimal by noting the similarity between transposition sort and ID bubble sort, and that e — t = 1 is an optimal schedule for ID bubble sort. Theorem 4.2.1 A direct implementation of the shear sort algorithm on ann x n PE mesh with e = t = 1 for all PEs has is an optimal schedule with ETa € 0(na+3 l o g a + 1 n) — G ( i V ( a + 3 ) / 2 l o g a + l N y 58 4.2.2 SS Sort Schnorr and Shamir give a mesh sorting model where each memory cell can compare and swap with a horizontally or vertically adjacent neighbor each clock tick [51]. Clearly, this directly maps into the PE mesh implementation above. They give an algorithm that I call SS sort, that sorts the mesh in 0(n) clock ticks [51, Theorem 4]. Once again, the area A G 0 (n 2 ) , ATa+l e 0(na+3) and ETa G 0(na+3) = 0(/V>+ 3)/ 2) by Lemma 3.2.2. Although this algorithm is more complicated than in shear sort (as we will see when SS sort is thoroughly explained in Subsection 4.4.1), there is a l o g a + 1 N factor improvement in the complexity. Theorem 4.2.2 A direct implementation of the SS sort algorithm onannxn PEmesh with e = t = I for ail PEs results in ETa G 0(na+3) = 0(/V(Q+3)/2). 4.2.3 Other Mesh Sorting Algorithms There are many other sorting algorithms that operate on a mesh of PE-like proces-sors [33, 43, 54]. In particular, rev sort [51] uses cyclic row transposition sorts and column transposition sorts during each epoch, and takes O(loglogn) epochs to sort on a n x n mesh. A direct implementation can avoid long wires for the cyclic row transposition sort by "folding" the mesh. Each PE now has wires of reaching past the new left and right neighbors to its former neighbors and the connection to complete the cycle is likewise short. In a similar application of Lemma 3.2.2 to that of shear sort and SS sort, we find that ETa G 0(na+3(loglogn)a+1). However, only 0(v /loglogn) epochs are needed if the algorithm runs on a rectangular mesh with an aspect ratio of 1/ log log n [51, The-orem 11]. This rectangular rev sort algorithm has ETa G 0 (n a + 3 ( loglogn)( Q + 1 ) / 2 ) = 0(iV( a + 3 )/ 2 (log log A?")(a+1)/2) on an nu x n/u mesh, where u = \/log log n. I state these results together. Theorem 4.2.3 A direct implementation of the rev sort algorithm onanxn PE mesh with e = t = I for all PEs has ETa G 0 ( A r ( a + 3 ) / 2 ( l o g log N)a+1). A direct implementation of 59 the rectangular rev sort algorithm on a nuxn/u PE mesh (u = i/log log n) withe = t = l for all PEs has ETa £ 0(yV(Q + 3)/ 2(loglog J V ) ( Q + 1 ) / 2 ) . Again, we see that more complicated grid arrangements allow better complexities. Rev sort requires the grid to be horizontally folded for a log log n/ log n factor improve-ment over shear sort, and choosing a specific aspect ratio gives a further \/log log n factor improvement. 4.3 Lower Bounds I now make use of the gridless model to prove lower bounds in the ID and 2D cases. These arguments hinge on the cost of communication, and have an information theoretic flavor. We begin with the ID case, finding a lower bound of Q,(Na+2) and thus proving that both bubble sort and merge sort are optimal algorithms for the ID sorting problem, and that the direct implementations are also optimal (see Section 3.4). Then I prove a lower bound of fi(JV"(a+3)/2), matching the upper bound of SS sort and similarly showing its optimality for the 2D sorting problem. Finally, lower bounds are considered for the sorting problem in the "strict" model, where the number of word bits w is a parameter. Section 4.4 modifies SS sort to deal with this parameter in an attempt to match this lower bound. Theorem 4.3.1 [ID Sorting Lower Bound (word-wise)] Any ID implementation for the sorting problem on N inputs has ETa G Q,(Na+2). Proof: Each of the set of input PEs, A, and the set of output PEs, B, have inscribed circles whose centerpoints lie on a straight line. Let d be the shortest distance between the centerpoint of the inscribed circle of any pair of different PEs; d > 1 by Lemma 3.4.1. For ai G A, there is some b\ 6 B such that p,{a\,b\) > dN/2, because of the linear arrangement of PEs in B. Likewise, for a2 G A/{ci\}, there is some b2 G B/{b\) such that p(a2,b2) > d(N — l)/2, and aj,6j pairs are formed so forth so that p,(a,i,bi) > d(N — i + l)/2 for 1 < i < N. We choose a permutation of the 60 input values that maps the value held in input a, to output bi for all i. Let ej be the energy expended to send the input value of a* to bi if the computation is completed in T time units. By Axiom 5, > (d(N - i + l ) / 2 ) a + 1 T _ a . Summing over all i yields ETa > /d}°>+i Y1?=1(N - i + l ) a + 1 G (Na+2) as claimed. • Corollary 4.3.2 [ID Sorting Tight Bound (word-wise)] Any optimal ID implementation for the sorting problem on N inputs has ETa G Q(Na+2). Proof: Follows directly from Theorems 4.1.4 and 4.3.1. • Theorem 4.3.3 [2D Sorting Lower Bound (word-wise) ] Any implementation for the sorting problem on N inputs has ETa G ft(JV(a+3)/2). Proof: Let A be the set of inputs PEs and B be the set of output PEs. For a\ 6 A, there is some b\ 6 B such that p,(a\,b\) > V~N, by Axiom 3. Likewise, for a2 G A/{ai], there is some b2 £ B/{b\} such that p(a2,b2) > y/N — 1, and the remaining ai}bi pairs are formed in the same manner so that /z(aj, bi) > ^JN — i + 1 for 1 < i < N. Choose a permutation of the input values that maps the value held in input a» to output bi for all i. Let e, be the energy expended to send the input value of Oj to bi if the computation is completed in T time units. By Axiom 5, e; > (N — i + l ) ( a + 1 ) / 2 T _ a . Summing over alH yields ETa > E , ^ = i ( A 7 - i 4 - l ) ( a ! + 1 ) / 2 G n(Af<a+3)/2) as claimed. • Corollary 4.3.4 [2D Sorting Tight Bound (word-wise)] Any optimal implementation for the sorting problem on N inputs has ETa G Q(N^a+3^2). Proof: Follows directly from Theorems 4.2.2 and 4.3.3. • 61 With the main results for sorting on words of unspecified size in place, I turn atten-tion to sorting in the "strict" model where PEs operate on bits instead of words. Sorting problems now have additional parameter w, the word size of the inputs in bits. While this complicates the analysis, a similar idea to Theorems 4.3.1 and 4.3.3 is employed to give lower bounds, under the reasonable assumption that all w input bits of a word are initially stored in input PEs that are contiguous, and finally stored in output PEs that are contiguous. In order for the adversarial argument to go through, it is also assumed that w > log N to ensure enough distinct input values exist. I state these lower bounds together. Theorem 4.3.5 [Sorting Lower Bound (bit-wise)] Any ID implementation for the sorting problem on N inputs of w bits when the bits of any given word are initially stored con-tiguously, likewise with the bit locations of output words and w > log N has ETa € Q((Nw)a+2). Under the same conditions, any 2D implementation for the sorting problem has ETa G n((Nw)(a+3V2). Proof: The proof for the first claim is analogous to the proof for Theorem 4.3.1, now noting that each bit of input a* can be sent to bits of output 6; that is distance at least dw(N — i + l)/2 away. Likewise, the proof for the second claim is a simple variant of the proof for Theorem 4.3.3. It is required that w > log N to ensure that there are enough distinct input values to allow construction of the required input permutation. • 4.4 The Impact of Word Size We previously admitted PEs that operate on words in Sections 4.1 and 4.2. However, the grid model requires PEs to operate on and store 0(1) bits. Now consider the case where each word consists of w bits that are the binary representation of a positive integer value. Assume that the w bits for each input word of a sorting problem are stored in con-tiguous PEs and likewise for the output words. We note that whether or not a scattering of the bits of a word can help in the grid model remains an open question. 62 It is straightforward to adapt the ID implementation of bubble sort to operate on entire words, where comparisons take unit time and 0(w) energy, which meets the lower bound of £l((Nw)a+2). As this adaptation is trivial, I do not provide details and focus on the 2D case. Schnorr and Shamir's sorting algorithm can be modified to meet the 2D bound described above. Several technical issues arise because their algorithm operates o n a n x n array of words (n 2 = N) and proceeds in seven epochs: two operate on rows, one one columns and four on various snakes. It is straightforward to perform the row operations if individual words are arranged as columns of w bits, and it is easy to operate on columns if words are arranged as rows of bits. We modify SS sort by storing the data in an array of 0(y/Nw) x 0(s/Nw) bits. There are a few extra columns in our arrangement to handle the snaking configurations. We conceptually partition this array into \/N/w x ^jN/w tiles oftuxw bits each. When per-forming row operations, the columns of these tiles represent words, and when performing column operations, words are represented by the tile rows. The bits of the tiles are trans-posed between epochs for row operations and those for column operations. This use of tiles alters the order of comparisons of the sorting algorithm; thus I have reformulated Schnorr and Shamir's 0-1 argument for correctness. This section is completed with a counting ar-gument for complexity to show that this modification of their algorithm sorts in 0(s/Nw) phases (i.e. "clock ticks"). I begin by explaining SS sort and proceed with how the algo-rithm is adapted. 4.4.1 SS Sort Details Schnorr and Shamir's sorting algorithm consists of seven epochs in which the y/~N x \/~N mesh of processors is decomposed in various ways. These decompositions are described in terms of blocks and vertical/horizontal slices. A block is a square TV3/8 x TV3/8 subset of the mesh, referring to one of the TV 1/8 x TV1/8 blocks divided along every TV 3/ 8" 1 row and column. A slice is a subset of the mesh spanning JV 3/ 8 columns (rows), referring to one of the first /V 3 / / 8 columns (rows), the second iV 3 / / 8 columns (rows), . . . . Block size refers to 63 the length on a side of the block. Epoch 1: The mesh is treated as A 7 ' 1 / 8 x N1'6 independent blocks of size N3/8 x N3/8. The rows within each block are processed in a concatenated fashion that creates a snake of length N3/4. Each block is sorted into snake order. Epoch 2: The mesh is treated as y/~N independent rows. The N1/8 segments of length N3/8 in each row are "unshuffled." The unshuffle is perhaps best explained by Schnorr and Shamir: "The unshuffle mimics the process of distributing a deck of y/~N cards (the columns) among N1/8 players (the vertical slices) in a round-robin way, with each player receiving A 7 " 3 / 8 evenly spaced cards." Epoch 3: The same as Epoch 1. Epoch 4: The mesh is treated as y/~N independent columns. Each column is sorted. Epoch 5: The mesh is treated as N1/8 independent vertical slices of N1/2 rows and N3I8 columns. Each vertical slice is sorted into snake order. Epoch 6: The mesh is treated as y/~N independent rows (as in Epoch 2). The rows are sorted with even numbered rows sorted left-to-right and odd numbered rows sorted right-to-left. Epoch 7: The mesh is treated as y/~N rows. The entire resulting mesh is now sorted into snake order, but only for a limited number of steps. This algorithm sorts in 3y/~N + 0 ( A r 3 / / 8 ) clock ticks, which is optimal up to low order terms [51]. 4.4.2 Increasing the Block Size The block size and slice width in Schnorr and Shamir's presentation, both N3/8, are chosen with the intent to minimize the factor in front of the y/N high order term of the running time. However, the block size/slice width may be increased, and the running time remains 64 bounded by 0(V~N). This is an important point in allowing a larger range of w for which the bit-wise adaptation of SS sort meets the bit-wise sorting ETa lower bound, as becomes clear in the next subsection. Here, I sketch out the proof, which is analogous to Schnorr and Shamir's proof of correctness and running time of SS sort [51, Theorems 3 and 4] and can easily be verified by comparison. Let the block size and slice width of SS sort be N1/2~e for any e 6 (0,1/8); e = 1/8 gives the block size of SS sort of Subsection 4.4.1. The description will verify that any e in this range suitably preserves the time complexity. There are now N€ slices and N£ x Ne total blocks. Each epoch is described in terms of the state of the mesh following the operations of the epoch. The key differences in terms of Schnorr and Shamir's description are highlighted here. Epoch 1 (sort blocks): Let d\ denote the difference between the number of 0s of any two columns of the same block. At the end of this epoch, d\ is at most 1. The running time is 0(Nll2~€) with the choice of any sorting algorithm operating on a mesh running in linear time in the number of rows or columns of the mesh [33, 43], However, we will sort each block with shear sort for simplicity, which runs in 0(N1^2~e log N) time, which is also in o(\/N). Epoch 2 (unshuffle): Let d2 denote the difference between the number of 0s of any two blocks in the same horizontal slice. At the end of this epoch, d2 is at most N€. The unshuffle can be done in 0(y/N) time, as it is a fixed permutation of the columns. Epoch 3 (sort blocks): Let cfo denote the difference between the number of 0s of any two columns within the same horizontal slice. At the end of this epoch, 03 is at most 2. This relies on the fact that d2 is less that the width of each block, i.e. N€ < Nll2~e or e < 1/4. This is the same sequence of operations as Epoch 1, and runs in o{\/N) time. Epoch 4 (sort columns): Let d'4 and d![ denote the difference between the number of 0s of any two columns of the mesh and the difference between the number of 0s of any two 65 vertical slices, respectively. At the end of this epoch, d'A is bounded by 2Ne and d\ is bounded by TV 2 6 . Furthermore, the fact that Epoch 4 sorts all columns implies that at most d'4 rows contain both Os and Is, and these rows are contiguous. This epoch runs in 0(s/N) time using transposition sort on each column. Epoch 5 (sort vertical slices): Let d$ denote the difference between the number of Os of any two columns of the mesh. At the end of this epoch, d 5 is bounded by 2. This relies on the fact that d'4' is less than the width of a vertical slice, i.e. N 2e < T V 1 / 2 - 6 or e < 1/4. Because d'4 < 2Ne, this epoch is done in 0(VN) time with transposition sort on overlapping regions of 27Ve rows, similar to Schnorr and Shamir's method [51, Theorem 4]. Epoch 6 (sort rows): At the end of this epoch, the mesh is sorted except for a segment of length cfe bounded by 27V 3 e, spanning at most 2 rows. This relies on the fact that this segment length is less that the width of a row, i.e. 2N3t < VN or e < 1/6. This epoch takes 0(y/~N) time for transposition sort of the rows. Epoch 7 (sort mesh): The mesh is now sorted. Only 0(N 3e) steps of transposition sort of the entire snake is needed. Thus the mesh is sorted into snake order in 0(V~N) time. 4.4.3 Bit-wise SS Sort In the bit-wise adaptation of SS sort, each word is held by a chain of w PEs that operate on individual bits. These chains are arranged vertically for epochs that operate on rows and horizontally for epochs that operate on columns. Some care is needed to handle the transitions of configurations between epochs; these special cases are described below. As shown in Figure 4.5, the bits of the w words processed by a w x w tile are ar-ranged vertically when performing row operations, and successive words of the row are horizontal neighbors in this arrangement. Conversely, when performing operations on 66 columns, the bits of the words within a tile are arranged horizontally, and successive words of a column are vertical neighbors in this arrangement. b7-b6-b5~ b4-63-b2-bl-b0~ wO wl w2 w3 w4 w5 w6 w7 bl b6 b5 b4 b3 b2 bl bO Configuration for row-wise operations Configuration for column-wise operations Figure 4.5: The w x w tile of bits, for w — 8. Two adjacent vertical "words" of PEs can perform a compare-and-swap in a pipelined fashion by starting with the PEs for the most-significant bit. The most-significant bits of are compared, they swap bits if necessary, and the comparison outcome ("greater", "less", or "equal so far") is sent to PEs for the next most significant bits for use on the next time unit. This allows compare-and-swap operations to be performed with a throughput of once per time unit with the PEs for the least-significant bit lagging behind the PEs for the most-significant bit by w — 1 time units throughout the computation. This adds at most w time units to the time for a epoch which does not affect the big-O complexity of the algorithm for w e o(N). Operations on horizontal "words" of PEs are organized in the analogous manner. To handle the snake epochs of the algorithm, tiles are used of vertical words for most of the mesh, with a special tile at the row ends to connect to the adjacent row in the snake, as seen in Figure 4.6. The basic idea is to spread out the PEs at the end of the row and use the extra space to "tilt" the words into a horizontal arrangement. The adjacent row then connects to this using a vertically mirrored version of the tile from Figure 4.6. PEs that process different bits of the same word are connected by solid lines in Figure 4.6, and PEs that process the same bit of different words are connected by dotted 67 lines. The greatest distance between communicating PEs in this arrangement is two units. Furthermore, this tile has 3w — 2 columns to hold 2w — 1 words; in other words, it adds w — 1 extra columns to a row. This is rounded up to w to simplify the description below. These artifacts contribute to the low-order terms in the cost of the sorting algorithm, but they do not affect the big-0 complexity. This variation of Schnorr and Shamir's algorithm uses a y/Nw X (\/NW + 2N£w1+€) mesh of PEs. It is required that w < N1_e' for any e' > 0 to hide the costs associated with these extra columns (i.e. \/Nw > 2New1+e). In the various epochs of the algorithm, this mesh is configured and processed as follows. Epoch 1': The mesh is treated as (NwY x (Nw)e independent blocks of / V 1 / 2 " ^ - ^ / ^ ) x (Nw)ll2~e vertical words. Each block is sorted into snake order. Epoch 2': The mesh is treated as yjN/w independent rows of -JNw vertical words each. The (Nw)e groups of (Nw)ll2~e words (grouped by blocks), each within the same horizontal slice are "unshuffled." Epoch 3': The same as Epoch 1'. Epoch 4': The mesh is treated as yNJw independent columns of VNw horizontal words each. Each column is sorted. 68 Epoch 5': The mesh is treated as (Nw)e independent vertical slices of \J'N/wx(Nw)1/2 6 vertical words. Each slice is sorted into snake order. Epoch 6': The mesh is treated as \fN/w independent rows of y/Nw vertical words each (as in Epoch 2'). The rows are sorted with even numbered rows sorted left-to-right and odd numbered rows sorted right-to-left. Epoch 7': The mesh is treated as \/N/w rows of y/Nw vertical words each. The entire array is sorted into snake order. To ensure that each column of a block in Epochs 1' and 3' and that each row of a vertical 1/2-e slice in Epoch 4' contains at least one word, it is required that w < Nl/2+i which follows from w < (Nw)1^2~e. This is where increasing the block size pays off; the analogous block size of (Nw)3/8 to the block size of N3/8 of Subsection 4.4.1 would constrain the word size as w < TV 3 / 5 . 4.4.4 Upper Bound and Proof of Correctness In the above adaptation, the order of comparisons is changed from the algorithm described in Subsection 4.4.2 because the blocks are no longer square. Thus I recreate the 0-1 argu-ment to show that the bit-wise SS sort algorithm properly sorts. This is done by considering the case when all input words are either entirely Os (0-words) or entirely Is (1-words). Assume that tiles do not cross the boundary of blocks. Each epoch is described in terms of the state of the mesh following the operations of the epoch. Let word-rows be a group of horizontally adjacent words. Let word-columns be a group of vertically adjacent words. In addition to proving correctness, the cost of each phase is also analysed. The following will verify that the word-wise adaptation of SS sort meets the lower bound from Theorem 4.3.5. First, note that each w x w tile can be transposed using 0(w) time and 0(w3) energy. The mesh has N/w tiles, so all tiles may be transposed using 0(w) time and 0(Nw2) energy, for ETa £ 0(Nwa+2). Compare this with the lower bound complexity 69 of £l((Nw)(a+3^2) to find the values of w for which the transpose cost is dominated: Nwa+2 < {Nw)(a+3y2 <^ > y / a + l ) / 2 < N(a+l)/2 W < N. Thus, the overhead of changing from vertical words to horizontal words between epochs does not affect the big-O complexity. Likewise, the cost of converting between "snaked" and "unsnaked" configurations only affects the low-order terms of the complexity. Initially, the data is arranged with the tiles set to implement snakes in each (Nw)i/2~e x (Nw)1/2~e block of PEs. Such a block holds ( T V ™ ) 1 / 2 - 6 words per row and has A 7 " 1 / 2 " ~ew (li2+<i) word-rows. The analysis is similar to Schnorr and Shamir's except l-e various factors depending on w that occur in the description below. Because w < N i+s let e' = so we can write w < A T 1 - 6 ' = AT i+* and e' —> 0 as e —» 0. Epoch 1': Sort each block into snake order. This can be done in 0((Nw)1^2~e \og(Nw)) transposition sort steps using a shear sort algorithm. Tile transpositions are required to switch between sorting rows and columns, which occurs 0(log (Nw)) times. In a similar analysis to the tile transposes above, with an extra l o g a + 1 (Nw) factor for the time and energy for 0(log (Nw)) epochs, I check the values of w for which the ETa cost of these transposes are dominated by the total algorithm cost: Nwa+2 l o g Q + 1 {Nw) < (Nw)(Q+3V2 w(a+1V2loga+1(Nw) < Ma+1V2 w log2 (Nw) < N. The expression w log2 (Nw) is increasing in w, and evaluates to N1_e' log2 (N2~~e') when w = Nl~e'. For sufficiently large N, the expression A ^ ^ ' l o g 2 (N2~£') is less than A^ and thus the shear sort transpose operations do not dominate. There are T = (Nw)ll2~e log (Nw) < y/Nw transposition sort steps needed to shear sort the blocks independent of the transpose operations, and thus Lemma 3.2.2 implies the ETa G 0((Nw)(a+3S>/2) for this epoch. Let d\ denote the difference between the 70 number of 0-words in any two word-columns of the same block. After this epoch, d\ is bounded by 1. This follows immediately from a snake sorted block. Epoch 2': Perform an (Nw)e-way unshuffle along the word-rows of the mesh. This un-shuffle can be done in O(VNw) time and 0((Nw)3'2) energy by encoding the fixed permutation in the task graph and P. Thus ETa G 0((Nw)(a+3^2) for this epoch. Let d2 denote the difference between the number of 0-words in any two blocks in the same horizontal slice. After this epoch, d2 is bounded by (Nw)e. Because the unshuffle distributes the (Nw)ll2~e word-columns from the same block among the (Nw)e vertical slices, d2 is at most d\ times the number of vertical slices. Epoch 3': Sort each block into snake order. This has the same time and energy costs as Epoch 1'. Let ofa denote the difference between the number of 0-words in any two word-columns of the same horizontal slice. After this epoch, cfo is bounded by 2. The number of words per word-row when arranged as row-tiles is (Nw)1/2-* > d2, so snake sorting each block nearly equalizes the height of the 0-word/l-word boundary in all blocks in the same horizontal slice. Epoch 4': Sort all of the columns of the mesh downwards. The columns can be sorted using transposition sort. As each column holds y/Nw words, this requires 0(VNw) time and 0 ( ( A % ) 3 / 2 ) energy. Thus ETa G OiiNw)^3^2) for this epoch. Let d'A and d4 denote the difference between the number of 0-words in any two columns and the difference between the number of 0-words in any two vertical slices, respectively. After this epoch, d'4 is bounded by 2New1+i and d4 is bounded by {Nw)2e. Because the tiles are transposed before Epoch 4', the new difference be-tween the number of 0-words in any two columns of the same horizontal slice is bounded by d^w = 2w, as a result of the transpose. Thus, d'4 is this value times the number of horizontal slices (Nw)e. The bound for d4 is computed as d2 times the 71 number of blocks in a vertical slice (Nw)e, as the number of 0-words in each vertical slices had not changed since Epoch 2'. Epoch 5': Sort the vertical slices into snake order. The value of d'4 is reduced by roughly a factor of w as a result of the transpose (see Figure 4.7), thus the new bound is 2(Nw)e. This means that the unsorted segment is of at most 2y/Nw words in snake order, so 0(\/Nw) iterations of transposition sort along the vertical slice snake using overlapping blocks as in SS sort Epoch 4 completes this epoch. Let d§ denote the difference between the number of 0-words in any two word-columns. After this epoch, d$ is bounded by 2. The tile transpose before Epoch 5' the subse-quent operations preserve the bound on d4. Similar to Epoch 3', the word-width of a vertical slice (Nw)l/2~e greater than d4, and thus Epoch 5' nearly equalizes the height of the 0-word/l-word boundary across all vertical slices. Epoch 6': Sort the rows of the mesh into alternating left-right order. This is similar to Epoch 4' and can be done with 0(\fNw) time and 0((Nw)3/2) energy. Let de denote the number of misplaced 0-words and 1-words. After this epoch, the mesh is sorted if d$ < 1; if d 5 = 2, then de is bounded by 2y/Nw. In the d§ < 1 case there is at most one word-row containing both 0-words and 1-words and is thus sorted by the operations of Epoch 6'. In the d$ = 2 case, there are at most two such word-rows that are consecutive, and the bound for de follows. Epoch 7': Perform 2y/Nw steps of odd-even transposition sort along the entire mesh snake. Only this many steps are needed as implied by the bounds on d$ and d§. The entire mesh is in snake order. Follows from the fact that the misplaced words spanning the d§ consecutive words requires only a number of transposition sort steps along the snake proportional to d§. Thus the lower bound is met, and the following Theorem summarizes the main result of this section. 72 -©--©--o--o--©--©--©--o--©--©-0000 0 - © -- © -- O --o--©--©--o--o-0001 1 01111 00000 11111 00000 1111 11111 11111 Figure 4.7: Tile transpose at the start of Epoch 5'. Here, w = 5 and the number of word-rows containing both 0-words and 1-words is reduced from 9 to 2. Theorem 4.4.1 [2D Sorting Upper Bound (bit-wise)] Any 2D implementation for the sort-ing problem on N input values of w bits, each stored contiguously, likewise with the bit locations of the output bits and log A 7 < uu < N1-6' for any e' > 0 has ETa G 0{{Nw)(a+3V2). Corollary 4.4.2 [2D Sorting Tight Bound (bit-wise)] Any optimal 2D implementation for the sorting problem on N input values of w bits each are stored contiguously, likewise with the bit locations of the output bits and logN < w < A ' ' 1 - 6 for any e' > 0 has ETa G 9 ( (A%)( Q + 3 ) / 2 ) . Proof: Follows from Theorems 4.3.5 and 4.4.1. • 4.5 Summary In this chapter, we investigated energy-time costs associated with a classical problem in computer science: sorting. First, the technical issues arising from the size of each input were circumvented by relaxing the PE definition to operate on entire words and not just a constant number of bits. ID implementations for sorting were formed by directly adapting 73 sorting networks as described by Knuth [29, chap. 5.3.4]. Both of the direct ID imple-mentations for sorting networks based on bubble sort and merge sort were shown to have the same ETa complexity with an optimal schedule, in spite of the view that merge sort is better by many common measurements of performance, including operation count. These implementations have matching ETa complexity, i.e. that match the asymptotic lower bound for all ID implementations, through a simple, adversarial communication argument. Conclusively, regardless of the chosen algorithm, the arrangement of the PEs, the chosen schedule or other details of the ID implementation, the presented ID implementations are optimal for the sorting problem. It is clear that merge sort is indeed less costly than bubble sort when communication costs are reduced, as demonstrated by results in the 7-model in Subsection 4.1.3. For comparison, note that the canonical sequential R A M machine used in tradi-tional algorithm analysis has a proven lower bound of £l(N log N) comparisons for sorting. If unit energy and time is used per comparison (and communication cost is ignored), even an asymptotically optimal time sorting algorithm fails to meet the optimal ETa bounds of the 2D direct implementations of bubble sort and merge sort. Next, the input and output PE placement restrictions of ID implementations were lifted and implementations with arbi-trary placements were considered, specifically a mesh of input/output PEs. Of the various algorithms presented, there is a clear trade-off seen between (abstracted) PE complexity and decreased ETa asymptotic upper bounds. The best of these, SS Sort, has cost con-taining large hidden constants due to the many epochs of the algorithm. This upper bound is met with a matching lower bound, again through a simple, adversarial communication argument. Similar to the ID case, this proves a tight bound over all algorithms, implemen-tations and schedules of the sorting problem. In particular, the direct implementation for SS sort is optimal. Finally, the sorting problem was analysed within the "strict" model, where PEs operate on a constant number of bits. This adds a word-size parameter w and changes the number of inputs from to Nw. Lower bounds for the ID and 2D cases were proven, with 74 the added assumption that w > log N and that all w bits of the same word are initially stored contiguously. It is straightforward to adapt the ID implementation of bubble sort to operate on entire words, where comparisons take unit time and 0(w) energy, which meets the lower bound of Q((Nw)a+2). This general lower bound can be met when w < N1~t , for arbitrarily small e' > 0. This was realized by noting that the size of blocks (square subsets of the grid) on which SS sort operates can be increased without increasing the asymptotic time. Then, a detailed word-wise adaptation of this algorithm was presented, which specifies the configuration of words during each epoch of the algorithm. This adaptation breaks the order of compare-and-swap operations of the original algorithm, so a 0-1 argument similar to one given by Schnorr and Shamir verifies correctness. The upper bound proof followed from a simple relation of area-time and energy-time complexity. We also note that the w G Q(N) case easily meets the lower bound through a O(N) x O(N) grid of PEs with word arranged vertically and applying O(N) steps of transposition sort. This uses 0(N3) energy, and thus has ETa e 0{Na+3) = 0{(Nw)(a+3V2). In summary, the ID sorting problem is optimally solved with direct implementa-tions of sorting networks, the 2D sorting problem is optimally solved with a complicated mesh-based algorithm and the word-wise sorting problem (with contiguously stored words) is solved with a very technical adaptation of the optimal 2D algorithm. Determining the complexity of word-wise sorting with possible scattered bits remains an open problem. 75 Chapter 5 Binary Addition The binary addition problem consists of two iV-bit input values, a and b, and one (N + 1)-bit output value, s. We write a* to denote that ith bit of a, and likewise for b and s. The PEs storing the bits of a, 6, and s are predetermined. At the completion of the algorithm, the bits of s are set to represent the sum of the values initially stored in a and b. As with sorting, we consider ID adders where the bits of the words a, b, and s are each arranged as one-dimensional arrays and 2D adders where the placement of the input and output PEs is unconstrained. 5.1 ID Binary Addition I begin with a direct implementation of a simple adder, a carry-ripple adder. Consider ./V PEs with the functionality of full adders arranged in a linear array. The PE at array position i takes inputs a; and bi and produces the ith bit result and the carry-in to the next full adder, C j , using constant time and energy. While there is much choice of how to embed the inputs, outputs and adders in the plane, there are obvious arrangements that avoid long wires and have E,T £ O(N). The optimality of this schedule is clear and stated without proof. Theorem 5.1.1 [ID Addition Carry-Ripple] The ID direct implementation of the carry-76 ripple addition algorithm on N-bit inputs has optimal schedule with ETa £ 0(Na+l). x 16 X15 X 14 x 13 X 12 X l l X 10 X 9 X 8 x 7 X 6 X 5 X 4 x 3 X 2 x l s 1 6 S 15 S 14 s 1 3 s 1 2 S l l S 1 0 S 9 S 8 S 7 s 6 s 5 S 4 s 3 S 2 S l Figure 5.1: Brent-Kung addition conceptual diagram [45, Fig. 6.9, p. 101]. The white circles are prefix processors with inputs Xj„ = (gin,Pin) and Xin = (gin,Pin) on the upper wires and output xout = (gout,Pout) of fanout 1 or 2 on the lower wire(s). 5.1.1 Brent-Kung Addition We now look at Brent-Kung addition [8] to see if we can improve the ETa complexity beyond that of the carry-ripple adder. This algorithm requires a preprocessing step where generate and propagate signals, gi = en A bi and pi = en ® h, are computed. With a number of PEs linear in N, the signals x% — (gi,Pi) for all i are computed with 0(1) time and 0(N) energy. Let a prefix processor be a PE with inputs gin, Pin, gin, Pin and outputs g0ut = gin V (pin A gin) and pout = pin A pin. Consider a direct implementation of the Brent-Kung parallel prefix graph (see Figure 5.1). I show that this implementation reaches 77 the lower bound when voltage scaling is used. Theorem 5.1.2 The ID direct implementation of the Brent-Kung addition algorithm on N-bit inputs has optimal schedule with ETa G 0(Na+1). Proof: I simplify the analysis by only considering the "binary tree" part of the embedding (ii to £4 in Figure 5.1). The remaining cost of the algorithm is included in the binary tree and is ignored henceforth. There are O(N) prefix processors across 0(logA r) parallel levels for a total ETa £ 0(N\ogaN) for computation. There are 0(N) vertical wire segments with length 0(log N), e.g. one from each Xi with i even, for a total ETa £ 0(N l o g a + 1 TV). Now consider the diagonal wires. Let tj and &i denote the time and energy for a single wire at level i for 1 < i < log m and assume N = 2 m . Since the wires from level i have Manhattan length 2l +1, we have e,ff = (21 + by Lemma 3.2.1. There are 2m~l~1 wires at level i, so 2 m _ l _ 1 e j gives the total energy at level i: ™ 2 m - i - 1 (2 i + l ) a + 1 = 2^  ^ ' i= l i m i= l The minimum £ T a is realized with U = 2a+1 as shown by a partial derivative argument similar to that in the proof of Theorem 4.1.4, and ™ 2 m - i " 1 (2 i + 1 ) Q + 1 n l m T = ^ 2 ^ + i e O ( 2 m ( S r ) ) . i=i The diagonal wire ETa cost of 0(2m^ a + 1^) dominates and the result follows. • It is easily seen that voltage scaling is a requirement in order to reach an opti-mal schedule for Brent-Kung addition; choosing t = e — 1 for all PEs results in T G O(iVlogA0 and£cO(A0 . 78 5.1.2 Kogge-Stone Addition There are several other addition algorithms which require many long wires when inputs are given in a one-dimensional array. One example is Kogge-Stone addition [31], where an optimal assignment for the direct implementation does not meet the optimal complexity of carry-ripple and Brent-Kung. The Kogge-Stone adder is attractive because of its small depth (roughly half that of Brent-Kung designs), and cells of constant fan-out. However, the many long wires result in a non-optimal ETa complexity. Consider a direct implementation of a Kogge-Stone adder (see Figure 5.2). I show that this implementation has ETa £ Q(Na+2). X 16 X 15 X 14 X 13 X 12 X l l X 10 X 9 X 8 X 7 X 6 X 5 X 4 X 3 X 2 X l Figure 5.2: Kogge-Stone addition conceptual diagram [45, Fig. 6.10, p. 102]. The white circles are prefix processors with inputs X{n = (gin,Pm) and %in = {gin,Pin) on the upper wires and output xout = (gout,Pout) of fanout 1 or 2 on the lower wire(s). Theorem 5.1.3 The ID direct implementation of the Kogge-Stone addition algorithm on N-bit inputs has optimal schedule with ETa G Q(Na+2). 79 Proof: With N = 2m, at level i there are 2 m - 2% 1 "diagonal" wires implemented with 2i+l PEs, to within a constant factor arising from the grid placement restriction. Thus we have ettf = ( 2 i + 1 ) a + 1 , and the total energy at level i is (2i+1)a+1(2m - 2 i- 1)e i. m T = $ > , i=i ^2»+i)o:+i(2m - 2 i _ 1 ) =1 * The minimum i ? T a is realized with ti = (2l+1)2a+1 as shown by a partial derivative argument similar to the one in the proof of Theorem 4.1.4, and m T = ^ ( 2 i + 1 ) 2 ^ r G 9(2m(^+r)), i=i m £ _ E y ) " ' ( 2 " m - 2 - ) 6 9 ( S S ) «2'+')2=TT)« It is straightforward to show that this diagonal wire ETa cost of 9(2 m ( a + 2 ) ) domi-nates the computation and vertical wire cost as in the proof of Theorem 5.1.2 and the result follows. • Observe that the ETa complexity for this Kogge-Stone implementation is a factor of N greater than that of carry-ripple or Brent-Kung. The Kogge-Stone design seems to be a good candidate to minimize delay, but clearly it does not minimize ETa. Energy-time trade-offs of various adder designs on 32-bit inputs, including Kogge-Stone, are studied by Patil era/. [46]. 5.2 2D Binary Addition Once again, the ETa complexity is improved through algorithms that operate on a mesh of PEs. A carry-select type algorithm improves the complexity; the idea is sketched below. Consider an \fN x y/~N mesh PEs, each initialized with a pair of inputs a» and bi which eventually stores Let the PE at contain inputs a^+., biy^j+-. Each PE performs 80 binary additions with a carry-in of 0 or 1 and stores both results. The adder implementation is constructed as follows. For each row, insert a wire PE between horizontally neighboring PEs and between and vertically neighboring PEs in the rightmost column. Perform carry-ripple addition across all rows in parallel twice, once with a carry-in on the leftmost adder of 0 and once with a carry-in on the leftmost adder of 1. Following this, the topmost row has computed the output bits corresponding to their indices, as the top row always has carry-in of 0. Then compute the carry-out of each row by sending the carry-out from row i to row i + 1 down the rightmost column. Now, the carry-in for each row is known by the rightmost PE. Broadcast this carry-in value in parallel across the each row to the left. Now every output PE has computed the output bit for its index. The number of steps needed is T £ O(VN) and the area is linear in N. Thus by Lemma 3.2.2, ETa £ 0(/V>+3>/2). Theorem 5.2.1 The mesh implementation of the carry-save addition algorithm described above on N-bit inputs has optimal schedule with ETa £ 0(./v~(Q+3)/2). Proof: Follows from the arrangement sketched above. • 5.2.1 H-Tree Adder Consider a complete binary tree of nodes, implemented with computation PEs, embedded on a H-tree structure as shown in Figure 5.3. For simplicity, we assume that N is a power of two, i.e. TV7 = 2 m . If the root is at level 0 and the leaves are at level m — 1, then edges I m~i I at level i correspond in the implementation to wires of length 2L 2 J (implemented with this many wire PEs). Here I measure length of wires from the midpoint of the two PEs it connects for ease of expression. It is straightforward to verify that this implementation has 2m leaves, and that this embedding never has wire or computation PEs overlap. The inputs and outputs are the PEs at the leaves of the embedding, where the bits are ordered as visited by a depth first search starting with the root node, where the left child is visited before the right and the upper child is visited before the lower. The leaf i computes 81 • i # • - • - © ©-•-© Figure 5.3: H-Tree adder conceptual diagram a carry-propagate bit, pi = a* © 6j, and a carry-generate bit, gi = a-i Abi, and sends them to its parent. Each internal node combines the pin and bits from its lower child and j>;n and gin from its upper child to compute pout - pin A pin and 5 0 u t = gin V (p i n A g i n ) and sends pout,g0ut to its parent. The root node simply forwards the generate signal from its lower child as the carry input to its upper child, and sends a carry of zero to its lower child. The root outputs gin V (p,„ A gi„) as bit SJV+I of the sum. When an internal node receives a carry from its parent, it sends this as the carry input to its lower child. The internal node also sends (p;n A cparent) V g, n as the carry signal to its upper child, where cparent is the carry signal from its parent. Each leaf node, i, outputs a, © b{ © cparent as bit s* of the sum. Theorem 5.2.2 The direct implementation of the H-tree addition algorithm on N-bit inputs has an optimal schedule with O(N), 0 < a < l ; O(./Vlog 2A0, a = l ; 0(jV( a+ 1)/ 2), a > 1. Proof: Let the root of the tree be at level 0, and the leaves at level m — 1, where N = 2m. For k > 0, an edge between levels k and k -1 is implemented by a chain of 2L 2 J - 1 wire PEs. Observe that each PE, wire or computation, operates at most twice. The 82 allotted time for each PE at tree level i is U; there are 2 L wires of length 2L^ 'J at level i. This gives, m— 1 E < J2A^)/tf, 1=0 m—1 m V — ^ m — i T < 2 ^ 2 — ^ . i=0 Using a partial derivative minimization argument as in the proof of Theorem 4.1.4, an i assignment of U — 2 ^ + i gives , — v m 1 »(l-a) £ , T < y j 2 2 , i=0 and thus a+1 (a+l)m / C '(!-») \ ETA < 2^r~ 2 5 ^+ T > • (5.1) The result follows from this expression. In the 0 < a < 1 case, note that the right hand side of (5.1) is in o ( 2 ( ° + 1 ) m / 2 2 m ( 1 - ° ) / 2 ) = 0(2 m ) = O(N). • 5.3 Lower Bounds Lower bounds for the ETA complexity of addition are obtained by noting that the carry produced by the least significant bit can affect all bits of s. Let pa>i denote the PE initially storing af, let ps^ denote the PE that stores output Sj. I first prove a lower bound for ID addition, which matches the upper bound for optimal schedules of direct implementations of the carry-ripple and Brent-Kung adders. Theorem 5.3.1 [ID Addition Lower Bound] Any ID implementation for the problem of binary addition on N-bit inputs has ETA £ Q(Na+1). Proof: For some i > 0, at least one of p.{pa,o,Ps,i) or p.(pa,i,Ps,i) is £l(N), as each PE is separated by at least unit distance by Lemma 3.4.1. The result follows by Axiom 5 as both the value of ao and the value of an can determine the value of Sj. 83 • Theorem 5.3.2 [ID Addition Tight Bound] Any optimal ID implementation for the prob-lem of binary addition on N-bit inputs has ETa £ Q(Na+1). Proof: Follows from Theorems 5.1.2 and 5.3.1. • The lower bound for 2D implementations matches the upper bound for the H-tree adder implementation for a > 1. Theorem 5.3.3 [2D Addition Lower Bound] Any implementation for the problem of binary addition on N-bit inputs has ETa € n(Ma+1^2) by Axiom 3. Proof: For some i > 0, at least one of u-(pafl,Ps,i) a n d P-(Pa,i,Ps,i) is tt(V~N). The result follows by Axiom 5 as both the value of ao and the value of Oj can determine the value of Sj. • Theorem 5.3.4 [2D Addition Tight Bound] Any optimal implementation for the problem of binary addition on N-bit inputs has 0 (MQ+1)/2) , K a ; fi ( A K a + D / 2 ) ) 0 < a ; 0 ( J V l o g 2 J V ) , Q = 1; O(N), 0<a<l. Proof: Follows from Theorems 5.2.2 and 5.3.3. • 5.4 Summary In this chapter, we studied the problem of binary addition. In the ID case, both a simple carry-ripple implementation and a voltage scaled schedule of a Brent-Kung implementation 84 ETa e I were shown to be optimal. In the 2D case, there is an improvement over ID implementa-tions with a mesh-based adder that makes use of carry-select addition, but this adder does not meet the lower bound. In the case of a > 1, a carry-lookahead adder based on a binary tree that is embedded as an H-tree structure achieves the lower bound in the 2D case. I conclude this treatment of addition with three observations. First, the lower bounds for addition are lower bounds for a broadcast. The surprising result is that an H-tree arrange-ment allows a bit to be transmitted to all PEs within distance d of a source for a constant factor times the cost of transmitting to a single destination at distance d. This implies that the binary addition upper and lower bounds could be applied to other problems with similar communication patterns, such as the TV-input AND gate example from Section 3.5. Sec-ond, we note that unlike sorting, the lower bound for addition requires operating PEs at different speeds. In particular, those close to the root of the tree must operate faster than those at the leaves. It is easy to show that broadcasting a bit to N destinations requires ETa G f2( iV( a / 2 ) + 1 ) if all PEs operate at the same speed. More generally, we note that the ability to perform the small number of operations near the root of the tree in less time (but at higher energy per operation) than those at the leaves allows the energy-scaled algorithm to overcome the limitations of the sequential bottleneck in this reduction computation. 85 Chapter 6 Binary Multiplication The binary multiplication problem consists of two iV-bit input values, x and y, and one 2N-bit output value, p. Let x% denote the ith bit of x, and likewise for y and p. At the completion of the algorithm, the bits of p are set to represent the product of the values initially stored in x and y. As with the other problems, we consider ID multiplication where the bits of the words x, y and p are each arranged as one-dimensional arrays and 2D adders where the placement of the input and output PEs are unconstrained. 6.1 ID Binary Multiplication We can view multiplication as N — 1 binary additions of size N (with a carry bit), given by S^ Lo* 2lrcj/i. If multiplication is implemented with these additions performed sequentially, the lower bound of ETa e Q(Na+1) for ID addition leads a ID multiplication cost of at least ETa e fl(Na+s). We can achieve better complexities if we the intermediate sums are stored in a redundant form. Instead of using adders for each addition, we use carry-save adders of three iV-bit inputs a, b, C{n and iV-bit outputs s and cout. Define s; = Oi®bi®Cin,i andcout^+i = aibiV dicing bian,i. Clearly a + b + C j „ = s + cout. Carry-save addition of A -^bit inputs takes constant time and 0(N) energy. We can implement multiplication using N - 2 carry-save adders wired for appropriate input-output shifting, costing O(N) time and 0(N2) energy, and a single ripple-carry adder to produce the final output. The cost of the 86 ripple-carry adder is absorbed into the cost of the carry-save adders of ETa G 0(Na+2). This is a tight bound under the choice of schedule, and as all intermediate sums are stored in linear arrays, as are the inputs and outputs, we state the following result without proof. Theorem 6.1.1 [ID Multiplication (Carry-Save)] The ID direct implementation of multi-plication with the carry-save addition algorithm on N-bit inputs has optimal schedule with ETa G 9(/VQ + 2). 6.1.1 Discrete Fourier Transform We now implement multiplication on an array using the discrete Fourier transform (DFT). Consider computing a brute-force DFT of N elements on a linear array (similar to the systolic array of Kung [34]), i.e., compute W - l h=0 where U>N = . The brute force implementation has N cells in a ring. Each cell consists of 0(m) PEs where m G 9(log N). Cell k holds two values, s(k) and u(k). Initially (step 0), s(k) — 0 for all cells and u(k) — x(k). At step j, i - i s(k) = J2x((k + h) mod N)tokNh. This is accomplished by performing the following operations. On step j, cell k multiplies u(k) by u>kJ and adds this to s(k), storing the result in the s(k) "register". Fur-thermore, cell k sends u(k) to cell (k - 1) mod N, and likewise receives a value from cell (k + 1) mod N. The value that it receives is stored in "register" u(k) for step j + 1. If the words have m bits, then each cell can perform a step in 0(m) time and 0(m) en-ergy, the cost for multiplication where cells are implemented with carry-save adders as in Theorem 6.1.1. However, unlike the multiplication in Theorem 6.1.1, there are many multi-plications to perform by each cell instead of just one. Therefore, the multiplication, addition and communication operations may be pipelined (via the carry-save adders) so that each ad-dition step takes unit time when m is much smaller than iV and unit energy is charged for ' 87 each PE operation. The energy cost for the final ripple-carry addition of each multiplication takes 0 ( m a + 1 ) energy to perform in unit time, but this is subsumed by the ETa cost of the multiplication. Each multiplication has T £ 0(m) and E £ 0(m 2 ) for a total cost of ETa £ 0(ma+2). The N cells operate in parallel; thus, the total time for a step is 0(m) and the total energy is 0(Nm2). The brute-force DFT has N steps, thus T £ 0(Nm) and £ £ 0(N2m2). This method has E T a complexity of 0({N log N)a+2) to perform N log AT-hit multiplication. It is well known that DFTs can be used to multiply large inte-gers [30, chap. 4.3.3] and this method has a better time complexity on sequential machines than more naive methods. I do not provide the details of how this implementation computes the final 2AT-bit multiplication result, but addition of each final pair of overlapping words may be performed by a constant number of carry-save adders into carry-save form of two 2AM)it results. The final addition of these results may be done with a ripple-carry adder whose cost is subsumed by the DFT cost. Theorem 6.1.2 The ID implementation of multiplication with the DFT algorithm with schedule described above on N-bit inputs has ETa £ Q(Na+2). Note that this DFT implementation offers no advantages in terms of its ETa complexity when compared with the straightforward carry-save implementation. 6.2 2D Binary Multiplication A better upper bound is reached by arranging the DFT cells as a mesh, and using a radix y/N fast Fourier transform (FFT) algorithm. First, compute y/N DFTs of length y/N by using each row to compute a DFT as described above. This requires 0(y/Nm) time and 0(Nl-bm2) total energy, for word-size of m. Then, multiply these DFT results by the "twiddle-factor" of the FFT algorithm, using an additional 0(m) time and 0(Nm2) energy; this cost is subsumed by that of the row-wise DFTs. Now compute column-wise DFTs, using an additional 0(y/Nm) time and 0(N1-5m2) energy. Thus, an AT-point DFT is computed with m-bit words using a mesh arrangement with ETa £ 0(N^a+3^2ma+2). 88 This FFT approach can be used for multiplication. To compute the product of two TV-bit numbers, treat each multiplicand as N/ log N digit numbers in base N. First, compute the FFT of the digit-sequences for the multiplicands. Second, compute the element-wise products of these frequency domain sequences. Third, compute the inverse FFT of the result from the second step. This produces a sequence of digits for the product; however, some digits may have values larger than N. The standard binary representation of the product can be obtained by computing a small, constant number of carry-save adds and one carry-ripple addition. The FFT operations dominate the time and energy costs for this algorithm. This yields ETa £ 0((/V/log/V)(Q + 3)/ 2(log/V")Q + 2 = 0(/V( a + 3 ) / 2 ( logiV)( Q + 1 ) / 2 ) . Theorem 6.2.1 The mesh based FFT implementation for multiplication described above on N-bit inputs has ETa G 0(/V( a + 3)/ 2(log /V)(Q + 1)/ 2). 6.2.1 Preparata's Implementation In the 2D DFT approach described above, the time and energy are dominated by the mul-tiplications to perform the y/~N point DFTs. Preparata noted that that an optimal AT2 multiplier can be obtained by using a radix-2, decimation-in-frequency algorithm to per-form each of the \/7V-point transforms [49]. This reduces the number of multiplications performed while leaving the total amount of communication unchanged, and this commu-nication then determines the cost of the algorithm with ETa G 0(Ma+3^2). Preparata used the decimation-in-frequency formulation of the FFT as it enables a simple way to compute the required twiddle factors while only requiring 0(\) bits of storage per PE. Preparata's algorithm requires 0(N) area and takes 0(y/N) time steps to compute an TV-bit product using only nearest neighbor communication. Applying Lemma 3.2.2 yields the following result. Theorem 6.2.2 [Preparata's Implementation] Preparata's implementation for multiplica-tion on N-bit inputs when decomposed into PEs has ETa G 0(N^a+3^2). 89 6.3 Lower Bounds Following the classical arguments for the AT2 complexity of multiplication [10], observe that shifting can be reduced to multiplication. For any ID arrangement of x, y and p, there is a shift amount such that at least N/2 of the bits of x must move a distance of at least N/2 to reach their output PEs. Applying Axiom 5 yields a lower bound of fi ( i V a + 2 ) for ID multiplication. An add-pass multiplier using carry-save arithmetic [45, Chap. 10.3] achieves this bound. This yields: Theorem 6.3.1 [ID Multiplication Lower Bound] Any implementation for the ID multipli-cation problem on N-bit inputs has ETa £ £l(Na+2). Proof: Let the set of PEs designated as either inputs or outputs be denoted by A, noting that each of these have inscribed circles whose centerpoints lie on a straight line. Let d be the shortest distance between the centerpoint of the inscribed circle of any pair of PEs in A; d > 1 by Lemma 3.4.1. We conclude that for each input bit of x there are at least 3iV/4 output bits of p that are at least distance dN/4 away under p,. A pigeon-hole argument shows that there must be a shift for which at least 37V/4 input bits must be sent a distance of at least dN/4 under p. Applying Axiom 5 to each of these O(N) inputs yields a lower bound of Cl(Na+2). • Theorem 6.3.2 [ID Multiplication Tight Bound] Any optimal ID implementation for the multiplication problem on N-bit inputs has ETa £ Q(Na+2). Proof: Follows from Theorems 6.1.1 and 6.3.1. • The reduction from shifting can be applied to 2D multiplication as well. Theorem 6.3.3 [2D Multiplication Lower Bound] Any optimal implementation for the mul-tiplication problem on N-bit inputs has ETa £ ft(./V(a+3)/2). 90 Proof: In this case, using Axiom 3 we conclude that for each input PE of x there are at least 3 Ay 4 output PEs of p that are at least distance y/N/2 away under p,. A pigeon-hole argument shows that there must be a shift for which at least 3AT/4 input PEs of x must be sent a distance of at least y/N/2 under p. Applying Axiom 5 to each of these O(N) inputs yields a lower bound of Q(iV( a + 3 ) / 2 ) . • Theorem 6.3.4 [2D Multiplication Tight Bound] Any implementation for the 2D multipli-cation problem on N-bit inputs has ETa <E Q(N^a+3^2). Proof: Follows from Theorems 6.2.2 and 6.3.3. • 6.4 Summary In this chapter, we demonstrated matching upper and lower bounds for both the ID and 2D implementations of multiplication. The lower bounds are shown through a simple commu-nication argument and the observation that multiplication necessarily implements shifting. The ID matching upper bound implementation is a simple multiplier that uses carry-save adders to sum the partial products. The 2D matching upper bound implementation is a direct implementation of Preparata's clever, area-optimal multiplier that uses FFT. The optimal implementation in the ID case is very simple, and yet the optimal implementation in the 2D case is much more involved. A straightforward adaptation of DFT in the ID case offers no advantage for ID implementations, but in the 2D case it is an improvement over the simple (and optimal) ID implementation of carry-save adders. This implementation uses the optimal ID carry-save scheme for the word multiplications required by the DFT. Similar algorithms to the 2D FFT approach could be explored by changing how these word multiplications are performed. For example, the adder imple-mentation might be done in-place, or the word multiplications might be also implemented with an FFT algorithm, perhaps recursively. 91 Chapter 7 Conclusions and Future Work 7.1 Summary of Results Power consumption has become the most critical bottleneck to further increases in com-puter performance. VLSI circuits allow operations to be performed using less energy when they are performed using more time. This creates a large incentive for exploiting paral-lel computation: a parallel algorithm can be faster and use less energy than its sequential counterpart even in cases where the parallel algorithm performs more operations. We are aware of no prior work that analyses the complexity of computation in a model that takes energy-time trade-offs into account. This thesis presented simple models based on trade-offs commonly used by chip designers. For each primitive operation, the energy e and time t required for the operation can be traded with eta constant for some a > 0. I applied these models to the problems of sorting, binary addition and binary multi-plication. For all three of these examples, I derived asymptotic lower bounds and presented algorithm implementations that achieve these bounds to within constant factors. The results of all implementations are summarized in Table 7.1. In the case of the often-used trade-off of ET2, the optimal implementations for all three of these problems have the remarkable property that both time and energy can grow slower than N. Multiplication and sorting can have E and T both grow as N5/6, and addition can have E and T grows as s/N. 92 Problem Algorithm/ ID/ ETa Theorem/ Implementation 2D Corollary word-wise bubble-sort ID 0(Na+2) 4.1.4 sorting odd-even merge-sort ID 0(Na+2) 4.1.7 Shear sort 2D 0(N(a+3)/2 l o g a + l ^ 4.2.1 SS sort 2D 0 ( / V ( a + 3 ) / 2 ) 4.2.2 rev sort 2D 0 ( 7 V ( « + 3 ) / 2 ( l o g l o g J V ) a + 1 ) 4.2.3 rectangular rev sort 2D O ^ ^ + ^ / ^ l o g l o g / V ) 2 * 1 ) 4.2.3 bit-wise bit-wise SS sort 2D 0{{Nw)^a+^2) 4.4.1 sorting binary carry-ripple ID 0(Na^) 5.1.1 addition Brent-Kung ID olNa+1) 5.1.2 Kogge-Stone ID 0(Na+2) 5.1.3 carry-save 2D 0 ( A r ( a + 3 ) / 2 ) 5.2.1 H-tree 2D 0{N log2 N + / V > + 1 ) / 2 ) 5.2.2 binary carry-save ID 0{Na*2) 6.1.1 multiplication DFT ID 0{Na+2) 6.1.2 FFT 2D 0 ( 7 V ( « + 3 ) / 2 ( l o g / V ) ( Q + 1 ) / 2 ) 6.2.1 Preparata's 2D 0 ( A r ( a + 3 ) / 2 ) 6.2.2 Table 7.1: Summary of ETa complexities for the presented implementations. For sorting, N is the number of input words, and w is the number of bits per word for bit-wise sorting. For binary addition and multiplication N is the number of bits in each of the 2 operands. The algorithms for multiplication and sorting are direct adaptations of systolic al-gorithms that were derived much earlier to achieve AT2 optimal computation. For mul-tiplication and sorting, all operations can be performed at the same eta point — in other words, no voltage scaling is needed to achieve optimality for these problems. The optimal implementation for addition is a simple, carry-lookahead binary tree adder embedded in a 2D mesh. Here, a different energy-time trade-off is used for each level of the tree. This scaling achieves the somewhat surprising result that a bit can be broadcast to all d2 process-ing elements within distance d of a source for a constant multiple of the cost to send the bit to a single destination at distance d. It is straightforward to show that any implementation of addition where the same amount of time is used for all operations performed by all PEs has ETa G ft(/V>+2)/2). 93 It is interesting to point out the relationship between optimal ETa complexities of problems and their input/output dependences. Both sorting and binary multiplication have the same optimal ETa for both the ID and 2D cases. Sorting clearly has the property that any input may be mapped to any output. Similarly, multiplication has the property that the value of xi can affect any one of the N output bits P/V,PJV+I, • • • ,P2JV-i for any i if the value of y is chosen appropriately (say, by shifting). In contrast, binary addition does not have such a strong input/output dependence. For any fixed value of b, the value of ai can only influence bits Sj for j > i. This suggests that addition should be inherently cheaper; indeed, addition achieves a lower optimal ETa complexity for both the ID and 2D cases than sorting and binary multiplication. While this view abstracts the exact problem specification function of inputs that gives each output, it appears to be consistent with our results. The results presented in this thesis are just a beginning of an exploration of what can be done with energy-time trade-offs. The models used here are simple, and the arguments used to derive lower bounds and analyse algorithms are mathematically straightforward. These models are sufficient to analyse fundamental problems despite their simplicity. We regard this as a feature of these models: the ideas and approach should be accessible to practicing programmers and computer designers. Likewise, we were able to use the models to compare existing algorithms for the problems we considered and find ones that met the lower bounds we derived. Thus, our approach does not require programmers to start over from scratch, but gives them a basis to evaluate existing algorithms and design new ones where needed. The models allow each task to be performed at a different energy-time trade-off point. This trade-off is robust to changes associated with different technologies or specific implementation techniques by virtue of the variable a exponent. It is natural to ask whether or not this flexibility can be realized on a real chip. First, processors are being built with large numbers of timing domains [24, 36, 42], and the trends are to increase the number of such domains and to allow separate domains to operate at varying frequencies. As mentioned above, the optimal algorithms presented for 94 sorting and multiplication do not exploit this flexibility. The matching lower bounds show that implementations for these two problems where all PEs operate at the same voltage and frequency is optimal to within constant factors. Finally, the optimal adder that we presented does employ voltage scaling, and we showed that this scaling is necessary to achieve optimality. This suggests that heterogeneous parallel architectures with a few fast, power intensive processors, may offer substantial advantages over homogeneous designs for problems with communication patterns like those of addition, for example, where there is a sequential bottleneck at the root of a reduction tree. 7.2 Other Problems In light of the simplicity of analysing the three chosen problems, we plan to analyse other algorithms such as matrix multiplication, graph algorithms and dynamic programming. The algorithms that we have studied so far have lower bounds determined by simple, commu-nication arguments. Are there problems that are inherently computation bound rather than communication bound, and how would one show this? Addition, sorting and multiplication all have rather low-order ETa complexity. Consider the case where a = 2 [38]. As noted above, addition, sorting and multiplica-tion all have the property that both their energy and time requirements can scale sublinearly with the problem size in an ET2 model. In fact, any problem with an ET2 complexity in o(Nz) has this property. Clearly, there are problems with ET2 complexities greater than U(N3) as can be shown from the classical space-hierarchy theorems and a simple reduction argument. We hope to identify "natural" problems that have higher ETa complexities, and examine the implications for their parallel implementations. 7.3 Other Models We are also interested in refining the models for energy-time trade-offs. In this current work, constant factors have been ignored and it is assumed that the energy-time trade-offs 95 for computation and communication are the same. In reality, constant factors do matter, and while both computation and communication offer energy-time trade-offs, these trade-offs arise from different physical considerations. Thus, a more detailed analysis could motivate the use of different trade-offs for computation and communication, which could refine the results presented here beyond the brief consideration of Subsection 4.1.3. The present models also neglect the growing impact of threshold voltages and leak-age currents. Furthermore, trade-offs from transistor sizing are limited by the smallest man-ufacturable transistors. For example, all of these are considered in the analysis of adders presented by Patil et al. [46]. They conclude that a Sklansky adder is optimal for a 32-bit adder implemented in a particular 90-nanometer CMOS fabrication technology. We would like to include suitable abstractions of these effects in a technology independent model. When these limits are considered, the asymptotes of the energy-time trade-off curves occur at non-zero values for E and T. We expect that with such models (e.g. (E — EQ)(T — To)a) the analysis will become somewhat more involved, but it should also be more illuminating of the trade-offs for real hardware and software. Finally, the throughput of a computation is more critical than its latency for some applications. Thus, we can consider EPa complexities, where P is the "period" of the com-putation, the time between starting (or finishing) successive problems. Here we observe that energy-performance trade-offs differ significantly from prior work on area-time complex-ity. Most ATa lower bounds were based on operation count (for AT1) or cross-sectional bandwidth (for AT2) and thus yield the same complexity for both latency or period. In contrast, EPa bounds can be lower than their ETa counterparts. For example, a simple, pipelined, add-pass multiplier achieves an EPa complexity of 0(N2), which is less than the lower bound for ETa of Q(N^a+3^2) for any a > 1. Techniques for proving lower bounds in an EPa model and finding a suite of EPa optimal algorithms remain topics for future work. 96 Bibliography [1] S. Albers and H. Fujiwara. Energy-efficient algorithms for flow time minimization. In Proc. 23rd Intern. Symp. on Theoretical Aspects of Computer Science, pages 621-633, 2006. [2] Susanne Albers, Fabian Miiller, and Swen Schmelzer. Speed scaling on parallel processors. In SPAA '07: Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 289-298, New York, NY, USA, 2007. A C M Press. [3] W. Athas, J. Koller, and L. Svensson. An energy-efficient CMOS line driver using adiabatic switching. In Proc. 4th Great Lakes Symp. VLSI Design, pages 159-164. IEEE Press, 1994. [4] Nikhil Bansal, Tracy Kimbrel, and Kirk Pruhs. Dynamic speed scaling to manage energy and temperature. In Proceedings of the 45th Annual IEEE Symposium on the Foundatations of Computer Science (FOCS'04), pages 520-529, 2004. [5] Philippe Baptiste. Scheduling unit tasks to minimize the number of idle periods: a polynomial time algorithm for offline dynamic power management. In SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 364-367, New York, NY, USA, 2006. A C M Press. [6] Luiz Andre Barroso. The price of performance. Queue, 3(7):48-53, 2005. [7] C. H. Bennett. The Thermodynamics of Computation-a Review. . International Journal of Theoretical Physics, 21:905-+, December 1982. [8] R. Brent and H.T. Kung. A regular layout for parallel adders. IEEE Transactions on Comput-ers, C-31(3):260-264, March 1982. 97 [9] R. P. Brent and H. T. Kung. The chip complexity of binary arithmetic. In STOC '80: Pro-ceedings of the twelfth annual ACM symposium on Theory of computing, pages 190-200, New York, NY, USA, 1980. A C M Press. [10] R. P. Brent and H. T. Kung. The area-time complexity of binary multiplication. J. ACM, 28(3):521-534, 1981. [11] David P. Bunde. Power-aware scheduling for makespan and flow. In Proceedings of the 18th ACM Symposium on Parallel Algorithms and Architectures (SPAA'06), pages 190-196, 2006. [12] Lawrence T. Clark, Eric J. Hoffman, et al. An embedded 32-b microprocessor core for low-power and high-performance applications. IEEE Journal of Solid-State Circuits, 36(11): 1599— 1608, November 2001. [13] David E. Culler, Richard M. Karp, David A. Patterson, Abhijit Sahay, Klaus E. Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: Towards a realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1-12, 1993. [14] L. E. Dickson. Introduction to the Theory of Numbers. Dover, New York, 1957. [15] J.C. Ebergen and A.M.G. Peeters. Design and analysis of delay insensitive modulo-n counters. Formal Methods in System Design, 3(3):211-232, December 1993. [16] Jo C. Ebergen, Jonathan Gainsley, and Paul Cunningham. Transistor sizing: How to control the speed and energy consumption of a circuit. In Proceedings of the Tenth International Symposium on Asynchronous Circuits and Systems (ASYNC'04), pages 7-16, April 2004. [17] Steve Fortune and James Wyllie. Parallelism in random access machines. In Proceedings of the 10th Annual ACM symposium on the Theory of Computing (STOC'78), pages 114—118, 1978. [18] M. Frank. Common mistakes in adiabatic logic design and how to avoid them. In Proceedings of the Workshop on Methodologies in Low-Power Design, pages 216-222, June 2003. [19] Edward Fredkin and Tommaso Toffoli. Conservative logic. International Journal of Theoreti-cal Physics, 21:219-253, 1982. 98 [20] Erol Gelenbe. Multiprocessor Performance. John Wiley and Sons, 1989. [21] Simcha Gochman, Ronny Ronen, et al. The Intel Pentium M processor: Microarchitecture and performance. Intel Technology Journal, 7(2):21-36, May 2003. [22] Ricardo Gonzalez and Mark A. Horowitz. Energy dissipation in general purpose microproces-sors. IEEE Journal of Solid-State Circuits, 31:1277-1284, September 1995. [23] Kinshuk Govil, Edwin Chan, and Hal Wasserman. Comparing algorithm for dynamic speed-setting of a low-power CPU. In Mobile Computing and Networking, pages 13-25, 1995. [24] Michael Gschwind, H. Peter Hofstee, et al. Synergistic processing in Cell's multicore archi-tecture. IEEE Micro, 26(2): 10-24, Mar/Apr 2006. [25] Bruce Hoeneisen and Carver A. Mead. Fundamental limits in microelectronics I: MOS tech-nology. Solid-State Electronics, 15:819-829, 1972. [26] T. Indermaur and M . Horowitz. Evaluation of charge recovery circuits and adiabatic switching for low power CMOS design. In Symposium on Low-Power Electronics, pages 102-103, 1994. [27] Canturk Isci, Alper Buyuktosunoglu, et al. An analysis of efficient multi-core global power management policies: Maximizing performance for a given power budget. In Proceeding of the 39"1 Annual IEEE/ACM Workshop on Microarchitecture (MICR039), pages 347-358, December 2006. [28] Anoop Iyer and Diana Marculescu. Power-performance evaluation of globally asynchronous, locally synchronous processors. In Proceedings of the 29th International Symposium on Com-puter Architecture, pages 158-168, June 2002. [29] Donald E. Knuth. The Art of Computer Programming, volume 3: Sorting and Searching. Addison-Wesley, 1972. [30] Donald E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms. Addison-Wesley, 1972. [31] Peter M . Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers, 22:786-793, 1973. U R L : h t t p : / / c r . y p . t o / b i b / e n t r i e s . h t m l # 1 9 7 3 / k o g g e . 99 [32] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A 32-way multi-threaded Sparc processor. IEEE Micro, 25(2):21-29, March/April 2005. [33] Manoj Kumar and Daniel S. Hirschberg. An efficient implementation of batcher's odd-even merge algorithm and its application in parallel sorting schemes. IEEE Trans. Computers, 32(3):254-264, 1983. [34] H. T. Kung. Special-purpose devices for signal and image processing: An opportunity in VLSI. SPIE, Real Time Signal Processing III, 241:74-84, 1980. [35] U. V. Linnik. On the least prime in an arithmetic progression i: The basic theorem. Rec. Math., 15:139-178, 1944. [36] Grigorios Magklis, Michael L. Scott, et al. Profile-based dynamic voltage and frequency scal-ing for a multiple clock domain microprocessor. In Proc. 30"1 Int'l. Symp. Computer Archi-tecture (ISCA), pages 14-25, 2003. [37] D. Markovic, V. Stojanovic, et al. Methods for true energy-performance optimization. Solid-State Circuits, IEEE Journal of, 39:1282-1293, 2004. [38] Alain J. Martin. Towards an energy complexity of computation. Information Processing Let-ters, 77:181-187, 2001. [39] Alain J. Martin, Mika Nystrbm, and Paul I. Penzes. Et2: A metric for time and energy effi-ciency of computation. In Rami Melhem and Robert Graybill, editors, Power Aware Comput-ing. Kluwer, 2002. [40] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics Maga-zine, 38(8), April 1965. [41] Trevor Mudge. Power: a first class architectural design constraint. IEEE Computer, 34(4):52-58, April 2001. [42] Samuel Naffziger, Blaine Stackhouse, et al. The implementation of a 2-core, multi-threaded Itanium family processor. IEEE Journal of Solid State Circuits, 41(1): 197-209, January 2006. [43] David Nassimi and Sartaj Sahni. Bitonic sort on a mesh-connected parallel computer. IEEE Trans. Computers, 28(l):2-7, 1979. 100 [44] Kunle Olukotun, Basem A. Nayfeh, et al. The case for a single-chip multiprocessor. In Pro-ceedings of the Seventh International Symposium on Architectural Support for Parallel Lan-guages and Operating Systems, pages 121-132, October 1996. [45] Behrooz Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford Univer-sity Press, 2000. [46] Dinesh Patil, Mark Horowitz, et al. Robust energy efficient adder topologies. In \%th Computer Arithmetic Symposium (ARITH18), pages 16-28, June 2007. [47] P. Penzes, M . Nystrom, and A. Martin. Transistor sizing of energy-delay-efficient circuits. Technical Report 2002003, Department of Computer Science, California Institute of Technol-ogy, 2002. [48] Paul I Penzes and Alain J. Martin. Energy-delay efficiency of VLSI computations. In GLSVLSI '02: Proceedings of the 12th ACM Great Lakes symposium on VLSI, pages 104-111, New York, NY, USA, 2002. A C M Press. [49] Franco P. Preparata. A mesh-connected area-time optimal VLSI multiplier of large integers. IEEE Trans. Computers, 32(2): 194-198, 1983. [50] Isaac D. Scherson, Sandeep Sen, and Adi Shamir. Shear sort - a true two-dimensional sorting technique for VLSI networks. In 1986 International Conference on Parallel Processing, pages 903-908, August 1986. [51] CP. Schnorr and A. Shamir. An optimal sorting algorithm for mesh connected computers. In Proceedings of the l%th ACM Symposium on the Theory of Computing (STOC18), pages 255-263, May 1986. [52] Deo Singh and Vivek Tiwari. Power challenges in the internet world. In In Cool Chips Tuto-rial, held in conjunction with the 32nd Annual International Symposium on Microarchitecture, pages 8-15, November 1999. [53] Michael B. Taylor, Walter Lee, et al. Evaluation of the raw microprocessor: an exposed-wire-delay architecture for ILP and streams. In Proceedings of 31st International Symposium on Computer Architecture, pages 2-13, June 2004. 101 [54] C. D. Thompson and H. T. Kung. Sorting on a mesh-connected parallel computer. In STOC '76: Proceedings of the eighth annual ACM symposium on Theory of computing, pages 58-64, New York, NY, USA, 1976. A C M Press. [55] Clark David Thompson. A complexity theory for VLSI. PhD thesis, Carnegie Mellon Univer-sity, 1980. [56] Mark Weiser, Brent Welch, et al. Scheduling for reduced CPU energy. In Proceedings of the First Symposium on Operating Systems Design and Implementation (OSDI'94), pages 13-23, 1994. [57] Samuel Williams, John Shalf, et al. The potential of the Cell processor for scientific computing. In Proceedings of the 3rd conference on Computing Frontiers (CF'2006), pages 9-20, New York, NY, USA, 2006. A C M Press. [58] Andrew Chi-Chih Yao. Some complexity questions related to distributive comput-ing(preliminary report). In STOC '79: Proceedings of the eleventh annual ACM symposium on Theory of computing, pages 209-213, New York, NY, USA, 1979. A C M Press. [59] F. Yao, A. Demers, and S. Shenker. A scheduling model for reduced CPU energy. In FOCS '95: Proceedings of the 36th Annual Symposium on Foundations of Computer Science (FOCS'95), pages 374-382, Washington, DC, USA, 1995. IEEE Computer Society. 102 Appendix A Proof of Theorem 4.1.6 Theorem 4.1.6 The ID direct implementation of the merge sort algorithm on N = 2 inputs has optimal communication ETa £ 0(JV 7 ( a + 1 ) + 1 ) under the /y-model. Proof: Recall a merge of size 2m requires 2 m _ 1 comparators of each length 2l, i — 1,2,...,m, and all comparators of the same length execute in parallel. The merge sort algorithm on 2" inputs performs 2 n _ m ~ 1 merges of size 2 m in parallel, for m — 0,1, ...,n - 1. We first find the optimal ETa complexity for a single merge of size 2 m . The energy and time cost at phase i is 2 m _ 1 2 1 e m ) i and 2ltm^, respectively, where em,i — 1 /tm i- Define the total time and energy of a merge of size 2m as Tm and Em, respectively (see Figure 4.3). These are given by m Tm = ^ ^ 2 ~^tm,i, i=l m Em = 2— 1^2 iVC , i-1=1 Note that similar to the direct implementation of bubble sort, a minimal schedule has all comparator PEs in the same phase with the same duration (see Lemma 4.1.1). De-fine Em, Tm, TM, fm and im,i in the same manner as in the proof of Theorem 4.1.4. Now differentiate EmT^ with respect to tm^ for some 1 < k < m and evaluate at 103 \ 1=1 2=1 f2— 1(-a)^ r >) f - + a ^ m f ( 2 ^ ) . \ m,fc / Thus, 5*; = 0 2 ™ - 1 ( - a ) ^ ) f « + a ^ m f « - 1 ( 2 ^ ) = 0 2^ 71 — 1 73+T-a +A1 / 2 m _ 1 S a - . As with the proof of Theorem 4.1.4, by Lemma 3.1.1 with scaling constant an assignment of tm<i = a+\/2™ is also minimal. This gives: m i=l m Em = 2 m - x 2 ^ / 2 ^ G 0 ( 2 m ( ^ T T + 7 ) ) and thus i=l EmTZ G Q ( 2 m ( T ( « + i ) + i ) ) . (A. l ) A merge of size 2 m has an optimal EmT^ complexity of 0 ( 2 m ( T ( a + 1 ) + 1 ) ) . We achieve this by taking tm^ = a*y/2™, but through Lemma 3.1.1 we can find values for tm,i for any choice of Em and Tm such that EmT£ G C>(2 m W a + 1 ) + 1 )) . Recalling that there are 2 n _ m _ 1 parallel merges of size 2 m , defined to belong to phase Tm, consider the energy and time cost of the entire algorithm on 2" inputs, E and T, respectively, 104 given by n - l m=0 E = 2n~l 2 m 7 ( a + 1 ) / T ^ . n - l m=0 -Again note that a minimal schedule has all comparator PEs in the same phase with the same duration (see Lemma 4.1.1). Define E, T, r, f and Tm in the same manner as in the proof of Theorem 4.1.4. Now differentiate ETa with respect to Tk for some 0 < k < n — 1 and evaluate at r = f: d r?TC = (4-kE)T" + aET n - l \ -2™ _ 1 Y2 2 m 7 ( a + 1 ) /T a J Ta m=Q / •a-1 ( d_T a dTk' n - l = 2 ) 7 l — 1 9r f c \ m=0 2^7(0+1) T" + cxETa~x (1) Thus, = 0 l k 2n-1(-a)^^)fa + aEf--1 = 0 2 n — l 2 ^ 7 ( a + l ) lk T = E Tk = a+^2n-12k^a+1^ T \ "+1 n - l / r Tk = 2 ^ = + T . ( j Again, similar to the proof of Theorem 4.1.4, by Lemma 3.1.1 with scaling constant 105 ° + \ / ? ' a n a s s ' § n m e n t °f Tm = 2 m 7 + a + 1 is also minimal. This gives: n - l T = 2 m i + ^ e 0 ( 2 " ( 7 + ^ ) ) , m=0 n - l • E = 2n~1^2 2m^a+^/2m~'a+^ E 0 ( 2 " ( 7 + ^ T ) ) and thus m—0 ETa e o ( 2 n ( 7 ( Q + 1 ' + 1 ) ) . Therefore, the optimal ETa complexity for communication of the ID merge sort im-plementation on N = 2" inputs is 0(N~r(a+1)+l). • 106 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items