On the VLSI Complexity and Architecture of the Arithmetic Fourier Transform (AFT) By Oswaldo Antezana B.S. Comp.Eng. University of Washington, Seattle WA, 1991 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA July, 1994 © Oswaldo Antezana, 1994 In presenting this thesis in partial fulfilment degree at the University of of the requirements for an advanced British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of department or by his or her representatives. It is understood that copying my or publication of this thesis for financial gain shall not be allowed without my written permission. Department of ElecTV)C& I t: The University of British Columbia Vancouver, Canada Date CTuly 2 g ^ DE-6 (2/88) , \1<\<] hqihecYJM Abstract 11 Abstract This thesis presents a novel logarithmic-depth VLSI network for implementing a fundamental computation of the Arithmetic Fourier Transform (AFT). The VLSI optimality —in terms of area-time trade-off— of this Orthogonal Flat Tree (OFT) network is proved. Namely, it is proven that the area of the network is 0(N2) and the corresponding total time delay is 0(log N). In addition, a number of layout rules for the construction of this network is proposed. The topology of this network is fairly simple and is based on an orthogonal interconnection of binary trees and linear arrays, or flat trees, where the nodes of the trees are bit-serial devices (adders and distributed shift registers). This OFT network uses only a small number of bit-serial adders to implement a matrix-vector multiplication. Moreover, two architectures for the computational area are considered and their respective bounds are proved. From this discussion it can be concluded that the interconnection area is dominant as the area for computation is relatively small of the order Nlog²N. In addition, an upper-bound on the area of the butterfly network , proposed in [31] for the first stage of the AFT, is introduced. This network computes the Bruns averages vector which constitutes one of the inputs to the OFT network. These two networks can lead to a practical implementation of the AFT. Finally, the oversampling problem of the AFT algorithm is also studied. When using the butterfly network it is proved that to obtain all the samples required takes an area slightly larger than 0(N2) which can result in a non-optimal AT2 bound. This is due to the large number that need to be averaged, up to 2N numbers, for a single multiple-input AFT butterfly. However, for practical sizes (N=16, 32) of a butterfly network the AT2 bound is very close to the optimal bound. Table of Contents m Table of Contents Abstract ii List of Figures vi Acknowledgment ix 1 Introduction 1 1.1 Background 1 1.2 VLSI model of computation 3 1.3 Theoretical AT2 complexity 4 1.4 Motivation for the AFT 4 1.5 Contributions 7 1.6 Outline of Thesis 8 The Discrete and Arithmetic Fourier Transform 9 2.1 Conventional Fourier techniques for DSP 9 2.2 The computation of the DFT 10 2.3 The FFT algorithm 11 2 2.3.1 Generalization of FFT algorithms for N a composite number . 15 2.3.2 Area-time Complexity of the FFT algorithm 16 2.4 Logarithmic-time DFT Networks 17 2.5 Fourier Analysis 20 2.6 The Bruns Arithmetic Fourier Transform 21 Table of Contents 3 IV Survey of AFT architectures 23 3.1 Systolic Arrays for the AFT 23 3.2 A Modular Linear Array Architecture 26 3.3 Extended AFT VLSI Architecture 27 3.4 An iterative realization of the AFT algorithm 28 A VLSI architecture for the AFT 31 4.1 A systolic array with butterfly structure 31 4.2 An Optimal layout for the Fourier coefficient computation 4 . . 32 4.2.1 Introduction 32 4.2.2 Architecture for a n and bn computation 33 Area-Time Complexity for the AFT 40 5 5.1 Area complexity for the general tree of butterfly structure . . 40 5.2 Number of alternating averages lj for a given coefficient aj . 45 5.3 Upper bound on the total number of leaves of computational trees 5.4 A regular binary tree based architecture and an upper bound for the computational area 5.5 47 51 A two level tree architecture and an improved upper bound for the computational area 56 5.6 Communication Problem 59 5.7 Upper bound on the OFT-Network 60 Table of Contents 6 v Simulation 65 6.1 Introduction 65 6.2 Design Tools 66 6.3 The Bruns Network 66 6.4 The OFT Network and its design 71 6.5 Simulation 74 Conclusion 81 7.1 Summary 81 7.2 Further Work 83 7 Bibliography 87 Appendix A Bk distribution 91 Appendix B VHDL simulation 92 List of Figures VI List of Figures Figure 1 Area-Time performance of the Fourier Transform-Solving Circuits (from ) Figure 2 5 Diagram of initial two-point DFT's for decimation-in-time FFT 11 Figure 3 Flowchart for the FFT decimation-in-time 12 Figure 4 Eight point decimation-in-time FFT 13 Figure 5 A 2 point FFT butterfly 14 Figure 6 A 6 point FFT algorithm using a combination of radices. . . . 16 Figure 7 4-point DFT on an OTN (4x4) 18 Figure 8 A logarithmic-time DFT(N',N") network 19 Figure 9 Systolic I AFT 24 Figure 10 A signal flow-graph representation for matrix multiplication. . 24 Figure 11 Systolic II AFT processor 25 Figure 12 The linear array for computing the AFT of 27 Figure 13 Block diagram of the computation of the iterative AFT. Figure 14 VLSI Architecture for the AFT (from ) 30 Figure 15 The AFT butterfly 31 Figure 16 Example of butterfly tree for calculating Brun's alternating . . . 29 averages B(2,0), B(4,0), B(8,0) 32 Figure 17 Basic structure of an OFT network of size N=38 34 Figure 18 Structure of the computational tree for a-\ when N=16 Figure 19 Layout showing the different computational trees Figure 20 Flat tree layouts: a) Flat binary tree, b) Flat binary broadcast tree c) A two-level flat broadcast tree 35 36 37 List of Figures vn Figure 21 Layout showing the different broadcasting trees Figure 22 A nAFT showing two addition nodes (of weight=n) and two scaling operations 39 42 Figure 23 A butterfly tree network with a leftmost stage of weight =3. . 42 Figure 24 Area required by the tree configuration 48 Figure 25 laj compared with the earlier approximations 54 Figure 26 Basic process describing the behavior of the AFT butterfly. . 67 Figure 27 Two-Level Addition-Tree with Carry saved Figure 28 Description of the generation of the AFT-butterfly network. . 69 Figure 29 A Bruns network for N=8 Figure 30 Description of the generation of all binary trees in the OFT network Figure 31 72 73 Description of the generation of all flat trees in the OFT network Figure 33 70 A linear tree showing difference between boundary and inner nodes Figure 32 68 74 Waveforms showing the computation of W and B coefficients for two positive numbers 75 Figure 34 Bruns averaging with zero order interpolation 76 Figure 35 AFTN network for N=2 77 Figure 36 Waveforms showing two stages of butterfly computation . . . 77 Figure 37 Test bench circuit used for module-testing Figure 38 Some patterns generated by the TG when the DUT is an aft module Figure 39 79 79 Outline of the steps needed in the description of a test bench circuit 80 List of Figures Figure 40 viu Distribution of the broadcasting trees for relatively small sizes (N) Figure 41 Data for the distribution of the broadcasting trees for small N's Figure 42 83 91 Maximum number of nodes for linear trees for a given N. . . 91 Acknowledgment IX Acknowledgment I want to take the opportunity to thank my supervisors Dr. Hussein Alnuweiri and Dr. Andre Ivanov for their constant encouragement and valuable guidance I was fortunate to receive in the past two years. Their helpful comments and critical questions made this work possible. I would also like to thank the Canadian Microelectronics Corporation for providing me both with hardware and software tools. I am extremely grateful to my parents and sister Ana Maria who always encouraged me and supported me both financially and spiritually. I am also grateful to my girlfriend Kaoru who encouraged me and stood by me in good and bad times. Without them this work would not have been possible. 1. 1 Introduction Chapter 1 Introduction 1.1 Background The Discrete Fourier Transform (DFT) is an essential part in the analysis and design of most signal processing applications. The DFT method is popular because the transform can be computed quickly by the Fast Fourier Transform (FFT), an algorithm that has regular data-dependency. But the approach using the FFT does have a few drawbacks, slight nuisances even though well worth the saving over a direct method [29]. Here are a few of the drawbacks : • Most of the multiplications needed in computing the FFT are not by even numbers or real numbers. • The FFT uses computations involving irrational numbers. There is an inherent need to tolerate approximate results and to estimate the loss in accuracy due to round-off. • The FFT is formulated for complex sequences. It would be nice if a transform existed that had both speed and simplicity of the FFT but worked with real numbers. There exists some Number-Theoretic Transforms (NTT) that have these properties [29], [26]. NTTs can be used to compute certain problems, like convolutions, using a different field. The quantities input to a computation are put in one-to-one correspondence to elements in a finite field or ring, the computation is carried out in the arithmetic of the finite ring, and the results of the finite ring computations are put into one-to-one correspondence with the results in the desired system [29]. These transforms have, of course, a few drawbacks of their own, but there are cases in which they would clearly be of value. An example is the Fermat number transform [29]. The Fermat number transformation can be carried out by an algorithm like the FFT. This 1 . 2 Introduction transform uses Fermat numbers defined by F^ = 22M + 1 , where // is either - 1 , 0, or +1. In place of multiplications of complex data by Wftk = ei2*(nk/N), we multiply integer-valued data by 2 nk . Another example of such transforms is the Mersenne transform [29]. It is used to compute circular convolutions of integer sequences. The length of the sequences that can be convolved must be the same as the word length used to represent the data. The additions in this transform use one's-complement arithmetic and use for most of its computations Mersenne numbers defined by Mp = 2P — 1, where p is a prime number. These number-theoretic transforms do represent an alternate way of computing certain problems like convolutions . However, unlike the DFT, the Fermat and the Mersenne number transforms do not reveal very much about the spectral composition of the sequence that was transformed. Therefore, the use of the NTT is limited and by itself cannot be put into correspondence with a DFT. Nevertheless, there is a different type of number-theoretic transform that uses a completely different algorithm to compute the Discrete Fourier Transform of periodic functions through the use of a formula, called The Mobius inversion formula [32]. This algorithm was developed by a mathematician named H. Bruns early in this century. Tufts and Sadasiv rediscovered this algorithm a few years ago and renamed it The Arithmetic Fourier Transform, or AFT. This algorithm requires mainly additions and a small number of real multiplications in only one stage. Except for the non-uniform data dependency of the algorithm it is possible to implement the AFT in near optimal area and time without complex arithmetic. The application of this algorithm to the computation of the Fourier transform (forward DFT) to signal processing is evident [32],[31]. Algorithms that need to perform many forward DFTs during a fixed amount of time can be simplified by using the AFT. Recently, some architectures have been proposed that implement the AFT. Reed et al. [32] extended the basic AFT algorithm to compute the Fourier coefficients of both / . 3 Introduction the even and odd components of a periodic function. The algorithm was extended to obtain real and imaginary parts of the Fourier coefficients [31] by employing symmetrical computing sequences. In this thesis, the butterfly tree network used in [31] for computing the Bruns averages will be analyzed and used in conjunction with a novel network called The Orthogonal Flat Tree network that computes the Fourier coefficients in optimal area. Also, a formal proof of the bound on the area of the network is presented. Likewise other bounds on area and time for the butterfly tree network are introduced. 1.2 VLSI model of computation The AFT computations in this work assume a suitable VLSI model [4], [38], [34] in which the following assumptions hold : the chip is synchronous (propagation delay is independent of wire length), the chip input/output schedule is semilective-unilocal (each input is received only once, at a prespecified input port) and when and where determinate (the time and location of receiving an input or producing an output is prespecified, so as not to depend on the data or the running of the algorithm). Furthermore, it will be assumed that operations are performed under the bit-model of parallel computation. In this model, each processing node has a constant number of bits which is independent of the number of elements in the AFT, and is capable of performing only a constant number of bit-size operations (such as additions or scaling) in a single step. Under this model it is not possible to store the entire (9(log N) bits representing one element in one node, nor is it possible to add or multiply two (9(log N)-bit elements in one step [38]. The range of computation times (T) is restricted to [fi(log AO, 0(y/N log N)]. This range is determined from simple fan-in arguments which show that T=f2(log AO, and storage requirements under the semilective assumption which show that the chip area A=0(Mog AO [19]. Thus, providing computation time greater than 0(y/N log N) does not result in a smaller area. 1. Introduction 4 1.3 Theoretical AT2 complexity The performance of a VLSI model of computation can be measured by two parameters, chip area (A) and total execution time (T). A method of comparing VLSI designs is AT2 analysis [35],[34]. For a certain type of transitive problems [40], a measure of the efficiency of a VLSI can be given by its AT2 value. In [34], Thompson related graph bisection width to minimum graph area and maximum graph data rate to prove the existence of minimum AT2 values for the DFT and other classes of problems. In [40] Vuillemin proved that transitive functions of degree "I" have a minimum AT2 value of fi(72). The parameter / can be thought of as representing the extent of information transfer between the input and output. Classes of transitive problems include cyclic shifting, cyclic convolution, modulo multiplication, and linear transforms. If an AT2 lower bound exists, it can serve as an absolute measure of comparison for all graphs describing an equivalent solution. Let s=Nd be the size of a problem instance with dimension "d"; and define (log s) to be the number of bits for problem representation [34]. The time parameter Tp will represent the throughput or minimum block pipeline period. The AT2 lower bound for a linear transform of size s and word size (log s) (I = slog s) is 0(s2 log2s) [35]. An AT2 lower bound can be derived by determining the transitivity value " / ' [35] or by determining bounds from graph theory and information theory [34]. An AT2 measure is computed by direct application of area and time values of the various functional units and interconnection stages for a particular design. 1.4 Motivation for the AFT Efficient algorithms like the Fast Fourier Transform (FFT) that compute the DFT are able to reduce the sequential computation time. Taking the number of multiplications as an estimate of the computing effort, the FFT is more efficient than other methods. The table in Figure 1 shows a comparison of different designs in terms of Area, Time, Area-Time trade-offs and Delay [35]. / . Introduction 5 DESIGN AREA TIME AT2 DELAY 1-cell NlogN N2logN N5log3N N2logN N-cell NlogN NlogN N3log3N N2logN N2cell N2logN logN N2log3N N2logN Basic FFT NlogN Nlog2N N3log5N Nlog2N FFT Network N2 logN N2log2N log2N Perfect Shuffle N2/log2N log2N N2log2N log2N Mesh Nlog2N y/N N2log2N v^ Figure 1. Area-Time performance of the Fourier Transform-Solving Circuits (from [34]). In this table the time performance (TIME) measures the circuit's throughput and the delay performance (DELAY) measures the time required for a complete computation of a stage. For instance, assuming that each addition takes 0(1) time a multiply-add computation can be done in 0(log N) time. These bounds rely heavily on the I/O assumption that succesive problem inputs and outputs must enter and leave the circuit on the same pins. This is the implicit convention used to prove Vuillemin's optimality result, i.e., AT2=Vt(N2\og2N) [40]. With these assumptions, only the last 3 designs in this table are AT2 optimal. The N2cell DFT, the FFT network and the perfect-shuffle network do not have complicated routing steps, while the mesh implementation does have a routing scheme more complicated than the rest of the networks. The direct evaluation of the discrete Fourier transform requires N complex multiplications for each sample. Therefore, a total of N 2 multiplications are needed using this method. In contrast, the FFT requires y multiplications for each stage. When N is a power of 2 the number of stages is log2 N and the total number of complex multiplications is y log2 N. So, for a given problem of size N the computational complexity is reduced from 0(N2) to 0(Nlog2N). The savings that result from decimating the sequence (interpreting the time 1 . Introduction 6 sequence of length N as the sum of two interleaved sequences of half the length over and over again) are enormous when N is large. For instance, for N=1024, 10 butterfly stages are required and a total of 5120 multiplications compared to N2=10242 multiplications for the direct evaluation, a ratio of approximately 200. The basic problem is that in the field of real time signal processing the continuous flow of data together with the complexity of the required computations make it difficult to design a general purpose system that will perform a specific task efficiently. On the other hand, if an efficient algorithm for a specific task is found, there must be a suitable match between algorithm and hardware implementation (VLSI layout) [37]. The computation of the FFT remains complicated and time consuming. Even after the speed up obtained, the number of needed complex multiplications which is proportional to NlogN, remains large and area-consuming. In terms of computational speed the benefits of the FFT are great, but for VLSI realization the fact that multiply operations are involved in FFT algorithm makes the implementation difficult [22], [2], [9]. The Arithmetic Fourier Transform (AFT) algorithm and its variations appears to provide a good match between efficiency of the algorithm and realization in VLSI. That is why the AFT has received lately considerable attention. The AFT is an alternate way of computing the Discrete Fourier Transform (DFT) of a sequence of length N. The AFT has some properties that make it attractive for VLSI implementation and also some other ones that are not so desirable. The main advantage of this algorithm over the FFT is that the AFT does not require complex multiplications as it uses only additions and a small number of real multiplications for scaling purposes. This reduction in the number of multiplications results in a reduction of the area required to compute the Fourier coefficients. Also, because the AFT does not involve multiplications by trigonometric functions, its accuracy is limited only by the A/D conversion of the input data and not by accumulation of rounding and coefficients errors as is in the FFT. However, 1. Introduction 1 the data dependencies in the algorithm are non-uniform. Therefore, non equidistant inputs and assignments of Mobius function values are needed in the computation. This lack of uniformity in the data in a straightforward implementation generally results in an increase of the total area. 1.5 Contributions The contributions made in this thesis can be found mainly in Chapters Four and Five. This thesis proposes a class of efficient highly-parallel networks for the AFT computation. The major contribution is related to the development of a novel network called Orthogonal Flat Tree (OFT) Network and its corresponding upper-bound on the VLSI area. This network is constructed using layout rules stated in Chapter Four. The upper-bound on the area of the OFT network is expressed through a Theorem that states the AT2=0(N2\og2N) figure. This bound is necessary to maintain the optimality of the AFT Network. The interconnections of the AFT network are bit-serial as bit-serial designs possess many outstanding advantages for high speed VLSI signal processing architectures. Bit-serial designs are communication efficient, are easily adapted to pipelined structures at the bit level and have lower area requirements than bit parallel designs [7]. Moreover, the simple design of the OFT and butterfly network are attractive when you compare them to other systolic arrays that achieve the minimum AT2 value for a 1-D DFT. The second major contribution of this thesis deals with a number of upper bounds found in Chapter Five. To the best of the author's knowledge, no previous work has been reported concerning upper bounds both on the area and time for this type of network. First, an upperbound on the area of the butterfly network that computes the Bruns averages is presented. This bound is important because it shows that a butterfly network can be used to compute the Bruns averages with an area that is optimal in terms of AT2 trade-offs. Next, two bounds on the computational area of the two architectures considered are introduced. Finally, the 1 . Introduction 8 sampling problem using multiple input AFT butterflies is discussed and an upper-bound on the area is given. This other bound is important because it shows that the bottleneck of this algorithm takes place at the sampling stage, when up to 2N (N represents the size of the network) numbers need to be averaged for a single multiple-input AFT butterfly. 1.6 Outline of Thesis Chapter Two introduces the reader to the basics of Fourier transforms used in Digital Signal Processing. It introduces the concept of the Discrete Fourier Transform and describes the Fast Fourier Transform algorithm. Finally, The Arithmetic Fourier Transform (AFT) and Mobius formula are introduced. Chapter Three is dedicated to a survey of several implementations of the AFT, including the extended AFT VLSI architecture used throughout this work. Chapter Four presents a new architecture for Fourier coefficients computation, architecture for an and b n , even and odd coefficients, that is used in combination with the butterfly network of [31]. Also, the AFT-butterfly is reviewed in detail and the foundations of the systolic array for Bruns computation are explored. Chapter Five deals with area-time complexity issues. It starts by giving a theoretical lower-bound for the area-time complexity of the DFT. This will serve as a reference for the two designs considered in this work. Also, the area complexity for the general tree of butterfly structure is determined. Finally, the communication problem is seen in more detail and as a solution, a formal proof on the upper-bound of the Orthogonal Flat Tree network (OFT) is given. Chapter Six presents a VHDL simulation of the main stages involved in the computation of the AFT: the butterfly network that computes the Bruns averages and the Orthogonal Flat Tree that computes the Fourier coefficients. The behavior level of these two stages is described and also the structure of some essential components is shown. 2. The Discrete and Arithmetic Fourier Transform 9 Chapter 2 The Discrete and Arithmetic Fourier Transform 2.1 Conventional Fourier techniques for DSP The Fourier Transform and the Z transform are of fundamental importance in signal processing. In particular, in digital image processing these linear transformations become indispensable. The one dimensional Fourier transform of a complex function f(x) is defined as oo F{f(x)} = J f(x)e-^xdx, —oo and the two-dimensional Fourier transform is defined analogously by oo F(f(x,y)) = J oo J f(x,y)e^2^+y^dxdy. — oo —oo Much of the practice of digital signal processing is done in computers where it is not possible to evaluate a continuum of frequencies u>, nor can we input and store an infiniteduration sequence a(n). So computing Fourier Transforms analytically is not convenient. A new transform that uses only a finite number of the sequence a(n) and computes a finite number of frequencies is defined as The Discrete Fourier Transform [29]. Given a sequence a(n) for all n, its Fourier transform is defined to be oo jwT A'(w) = A(e }= J2 a(n)e-jwT. However, taking only TV samples for a(n) and computing A '(w) for only these samples the DFT can be defined by N-l AN(kw0) = J2 a(n)e->nkw°T n=0 , 10 2 . The Discrete and Arithmetic Fourier Transform where w = kw0, k = 0 , 1 , ...,N — 1, and wo = — = -rrz; N-l or A(k) = ^ a(n)Wkn, k = 0,l,..., N-l, where W = e~i2ielN. The last expression shows clearly that the DFT is only a function of k and is simply a Appoint sequence computed from an infinite sequence. 2.2 The computation of the DFT The Fourier Transform has been analyzed in terms of the computation and area complexity. Vuillemin [40] showed that a iV-element Fourier Transform can be implemented in a circuit that has no better AT2 performance than fl(N2log2N). This theoretical limit suggests that a design must be evaluated in terms of how closely it approaches to this optimal performance [38], [40]. Following the direct computation of the DFT requires a total of N2 multiplications when an iV-element input vector is multiplied by an NxN matrix of constants C, resulting in an iV-element output vector. Computing these multiplications in a single multiply-add cell takes 0(N2\ogN) time. This time complexity requires (9(log N) for the multiply-add cell, 0(log2N) for the processor supplying the constants and O(NlogN) for the random-access memory containing the input and output registers. So time complexity is given by this last constraint O(NlogN). Its AT2 performance is thus given by OiJ^log^N). The performance is not as impressive as expected from the small amount of parallelism involved in the computation. Performing this computation on N cells instead of using a single cell results in a better AT2 performance, namely, 0(N3\og3N). Using N 2 cells this figure can be improved to 0(N2\og3N) which is close to the optimal figure Q(N2log2N) [34]. 2 . The Discrete and Arithmetic Fourier 11 Transform 2.3 The FFT algorithm The Fast Fourier Transform is an algorithm to compute DFT's efficiently. It was discovered by Colley and Tukey in the mid-1960s and since its discovery many variations have been proposed. One of the approaches is called decimation in time and it is based on separating the sequence a(n) in two sequences of length N/2 comprised of the even and odd-indexed samples as shown below : N-l A{k) = J2 a(n)Wkn n=0 = J2 a{nWkn + J2 a(n)Wkn neven (N/2)-l = J2 nodd {N/2)-\ 2 kn *(2m)(W ) + n=0 or A(k) = Ae(k) + J2 <2m + l)(W2)km m=0 WkA0(k). The even and odd part are periodic in k with period N/2, thus their values for k > N/2 need not be recomputed, given those for 0 < k < N/2. This idea is then applied repeatedly until only 2-point DFT's need to be computed. The initial 2-point DFT's require coefficients of only ±1 as shown in Figure 2. W =-1 Figure 2. Diagram of initial two-point DFT's for decimation-in-time FFT. 2. 12 The Discrete and Arithmetic Fourier Transform Figure 3 shows the required steps needed in the computation of the decimation-in-time FFT . For b = log N downto 0 p = 2^ ; q = N/p z =w for i= 0 to N-l J = i mod q ; k = reverse(i) Figure 3. Flowchart for the FFT decimation-in-time. 2. 13 The Discrete and Arithmetic Fourier Transform The decimation-in-time implementation of an 8-point FFT can be obtained with the flowchart in Figure 4. a(0) w_*0 A(°) Q a(3) Q a(7) O* W*n A(l) W^Q M2) VL»Q A(3) -i^Q A(4) lW_c A(7) Figure 4. Eight point decimation-in-time FFT. The output DFT A(k) is in natural order, but the input array a(n) is not. The sequence a(n) is in bit-reversed order because the index sequence n in binary form is bitwise reversed. The FFT computed in this way, requires log N stages, each of which requires up to N complex multiplications and TV additions. So the order of complexity (time) of the FFT is O(NlogN). This translates into an important savings in resources because a DFT requires at most N*N complex multiplications and additions, so the order of computation is 0(N2) . Figure 5 shows the flowchart for the general FFT butterfly. 2. The Discrete and Arithmetic Fourier Transform 14 A.(k) l-l -j/7) ^ - ^ ^ \ ^ W A.(k) i Afi) A (1) i-1 W m+N/2 Figure 5. A 2 point FFT butterfly. An important aspect of the FFT is that its computation can be performed in place in memory because A,_;(£) and A,_;(X) can be replaced by Ai(k) and Aj(l) and the same will be valid for the next stage i+1. So stage i needs the contents of stage i—1 and not the contents of stage i—2 that were used to compute stage i—1. We have seen that the order of the input indices is not in order. When the transpose of Figure 4 is taken, a decimation-in-frequency type of algorithm is obtained, and by substituting the twiddle coefficient WrN by \W^r we can use a decimation-in-frequency algorithm to compute the inverse DFT (TDFT). It is possible to have quite a few different arrangements for both algorithms. For example, a slightly different topology for Figure 4 can be obtained with the input vector in order. This can be done also for a N -point algorithm decimation-in-frequency and it is also possible to obtain an algorithm with input and output vectors in order, but such an algorithm cannot be computed in place and therefore its computation is not as fast. A special rearrangement that sometimes is useful has to do with rearranging the topology of the network so as to have all stages with the same geometry, thus permitting sequential data accessing and storage. In other topologies where the geometry of the second or third stage is different than the geometry of the first stage, a random type of access is required [30],[29]. 2. The Discrete and Arithmetic Fourier Transform 15 2.3.1 Generalization of FFT algorithms for N a composite number Working with time sequences that have length equal to a power of 2 is convenient because the entire FFT computation can be done using 2-point butterflies, as seen previously. But when the length of the sequence is not a power of 2, it is possible sometimes to pad the sequence with zeros to get a power of 2. However, many times this might not be convenient so a combination of radices can be used. If N denotes the length of the sequence then N is written : N = pi.p2-.-Pn with q\ = P2---Pn , N = p\,q\ Dividing the sequence into/?i sequences of q\ samples each by associating every pith sample with a given subsequence, X(k) is written [29]: N-l x(k) = Y, x^WkNn n=0 = £ 4Pir)WkNn + £ x(Plr + l)WkNW$rk + ...+ r=0 r=0 9 f^x(Plr + 1)k Pl-l)W^- WPNirk r=0 pi-1 ?i-l or X{k) =J2^NJ2 1=0 x pir + ( Pirk W N r=0 For instance, consider a 6-point DFT mixed-radix decimation in frequency resulting in 3 sequences of length 2 : N = 6 = 3*2 2 =pi*qi, 1 k X(k) = ^2W> ^2x(3r + l)Wtk 1=0 r=0 r3k r3k = \x(0) + x(3)Wg* + x(l) + x(4)Wg x(0) + x(3)W£ 3k 2k wk + x(2) + x(h)Wg w6 + x(l) + x(4)W$ W£ + x{2) + x(5)W{ W,2k 2. The Discrete and Arithmetic Fourier Transform 16 Now we can draw the butterfly terms (in frequency) : A total of N(p\ — l)+piqj complex multiplications and additions is required to compute X(k) when decomposing TV in two factors p and q. By decomposing further q\2 the total number of multiplications is written as N(p\ — 1) + N(p2 — 1) + PxPiffa and when the sequence has been decomposed as much as possible the number of complex additions and multiplications is given by N(pi + p2 + ... + pv — v). This can be expressed by the number of operations, since a complex multiplication takes 6 operations and a complex addition takes 4 operations the number of operations for a fully decomposed sequence would be at most equal to 6N(pi + p2 + ... + pv - v). 2.3.2 Area-time Complexity of the FFT algorithm The Fast Fourier Transform algorithm provides a great increase in efficiency even when a conventional uniprocessor is used. The AT2 performance for a uniprocessor is 0(N3log5N) which is not close to the optimal value, but it is a great improvement from the direct computation using a single processor. However, as seen previously the computation of the FFT is more efficiently when using multiple multiply-add cells in the decimation-in-time or decimation-in-frequency algorithms. The bit-reversed order seems to minimize the amount 2 . 17 The Discrete and Arithmetic Fourier Transform of area required for interconnecting the rows or stages in the flow diagrams for either the decimation-in-time or decimation-in-frequency. Laying-out the different connections between stages can be done in N/2 horizontal tracks for the first two stages and in N/4 horizontal tracks for the second and third stages. In general, the connections emerging from the kth stage occupy N/2 k+1 tracks. Straight vertical wires are used to connect cell i in the kth stage with cell i in the (k+l)th stage. The horizontal tracks are divided into 2 k equally sized pieces, then individually assigned to the long connection from each cell. The total number of horizontal tracks is then N-l. The number of vertical tracks depends of the width of the multiply-add cells, but it is possible to have them 0(1) units tall and O(logN) units wide so that the whole network for the FFT fits into a rectangular region O(N) units wide and O(N) units tall. Then the area complexity is 0(N2) and because a new problem instance can enter the network as soon as the previous one has left the first stage the pipelined time performance is 0(log N) since there is log N stages in the network. Therefore, the AT2 is 0(N2log2N) which is optimal. 2.4 Logarithmic-time DFT Networks Another approach for designing AT2 optimal DFT computation networks that use Kshuffle permutation networks as well as orthogonal-tree networks was reported in [1]. Going N-l back to the definition of a DFT for a sequence x(i) ; X(j) = J2 x(i)W1^ , it is possible «'=o to use a simple change of radix-representation in order to express the one dimensional DFT as a 2-dimensional or d-dimensional DFT as shown below : N + ) (/ +jV) (/+iV) = E E * (•>'+0 4'" '' ' x i" i< where N = N'.N" and i = i"N' + i', j = j " +j'N". Finally, with the previous notation the DFT can be expressed as a two-dimensional DFT, i.e., xd"j) = E wp: (w/ E' <>•«(.-, o). «'=0 \ t'"=0 J 2 . 18 The Discrete and Arithmetic Fourier Transform The input now can viewed as a two dimensional N'N" array of data. The computation of this DFT can be divided into 3 stages ; the first stage performs N" point DFT on each column of the array, the second stage performs an element-by-element multiplication, and the d-i third performs an N' point DFT on each row of the array. Likewise by letting N = Yi Ni »=o it is possible to treat the DFT as a multidimensional DFT that can be computed in d-phases [1]. Figure 8 shows a network for the computation of the DFT using this technique where N=N' xN". The orthogonal tree networks are responsible for the computation of the N" point DFT on each column of the array and also for the computation of the N' point DFT on each row of the array. Figure 7 shows a 4-point orthogonal-tree network used in the computation of a DFT. The size of the OTN is dependent of the number of points of the DFT. Multiplier O Adder TRANSFORM VECTORS Figure 7. 4-point DFT on an OTN (4x4). In between computations, a rotation of indices is needed. The N-Shuffle networks shown in Figure 8 are used to implement a rotation of indices (/" a n d / ' ) that is needed in between 2. 19 The Discrete and Arithmetic Fourier Transform OTN stages. N xN'OTN N" xN"OTN — * • —•> 0 0 —*• -** ». N"-Shuffle Network 0 0 0 N* -Shuffle Network 1 —* 0 0 0 —* zt », + 1 —• N-l N'-l — > • Column]DFTs ^ Row DFTs Figure 8. A logarithmic-time DFT(N',N") network The logarithmic-time DFT network shown in Figure 8 can be laid out in 0(N2) area and has O(logN) delay, so the network is optimal. Similar results are obtained when dealing with a multidimensional DFT, the AT2 performance is bigger only by a factor of d3 (where d is the dimension of the DFT). The area of the DFT network can be reduced by reducing the area of both the DFT-row and column and the shuffle networks. A technique described in [1] allows the implementation of a d-dimensional DFT on N points in O(QlogN) time and 0(N2/Q2) area, where l<«22<N/logN. In [1] the index mapping scheme used to reduce the area of the shuffle networks to 0(N2/Q2) is explored in more detail. Basically, the N/Q input elements contained in a column are mapped into blocks that form rows. This type of networks is called folding networks and they are part of the permuter networks used to shuffle the data before and after the DFT arithmetic performed by the orthogonal-tree networks. So, for a constant d, the 20 2 . The Discrete and Arithmetic Fourier Transform resulting network can compute the d-dimensional DFT using area 0((N2log2N)/T), and in time Te [il(logN), 0(^NlogN)] which is optimal [1]. 2.5 Fourier Analysis Let a real function A(t) be a finite Fourier series of period T. A(t) has the form N N A{t) = ao + \ . an cos (nwot) + y . bn sin (nwot) , n=l n=\ where ao is the zeroth harmonic defined by T 1 1 = — / A(t)dt and A(t) = A(t) - a0 . 1 0 Shifting A(t) by an amount aT gives N N c a A(t + aT) = 2 , n ( ) cos (nwot) + 2 , dn (a) n=l sm (nwot) , n=\ where cn(a) = an cos 2irna + bn sin 2itna , and dn(a) — —an sin 2-Kna + 6re cos 27ma . The Mobius inversion formula is used in the computation of the Arithmetic Fourier Transform. This formula uses the last expression of the function A(t) to invert the series form of A(t+aT). The Mobius function fx(n) is defined by fi(n) ( 1 (—1) 0 if n = 1 , if n = piP2---Pr where p|s are distinct primes , if p2\n (divides) for some p. Now assume that n is a positive integer and f(n) is different than zero in the interval 1< n <N and /(n)=0 for n>N. Let #(n) = J^ / ( m n ) ' m=l then /(ra) = 5 ^ ^(k)9(kn) . fc=l 2. 21 The Discrete and Arithmetic Fourier Transform 2.6 The Bruns Arithmetic Fourier Transform The function A(t) of period Tcan be expressed in terms of cn{a). The coefficients c„(a) are computed using the Mobius inversion formula for a finite series (see Theorem 4 of [32]): c„(or) = ^2 li{i)S{ni,a) , i=l where //(%) is the Mobius function and S(n,a) is the nth average defined by 1 n—1 S(n, a) = — Y^ A(mT/n + aT) . m=0 The Fourier coefficients a„ and bn of the Fourier series A(t) are computed by (See Theorem 5 of [32]) : an = cn(0) and bn = (-l)mcn (l/2 f c + 2 ) . In this form an and bn require different computational complexity. However, using Bruns original form of the AFT makes it possible to compute the even and odd coefficients a„ and bn with a single matrix of similar complexity. Reed et al. [31] showed that using Bruns original form, the even and odd Fourier coefficients can be computed by equations (2) and (3) (see theorem 4 of [31]) shown below. A new average, the 2nth Bruns alternating average, is defined by the following equation : m=0 Therefore, T a0 = — I A(t)dt, 0 an= Y i=l,3,5,.. l*(i)B(2ni,0) , (2.2) 2. 22 The Discrete and Arithmetic Fourier Transform and I AM -1 bn= Y^ 1=1,3,5,.. / 1 \ v(i)(-l)^~Bhni, — ) . ^ m for n = l , 2 , . . . , N (2.3) ' Using equations (2) and (3) it is possible to develop a new extended AFT algorithm (equivalent to Bruns original method) whose architecture is suitable for VLSI implementation. This extended AFT can be used to yield the convolution of any two finite sequences of real numbers and as a consequence to perform digital filtering. 23 3 . Survey of AFT architectures Chapter 3 Survey of AFT architectures 3.1 Systolic Arrays for the AFT In Figure 9, a front-end systolic AFT accumulator is shown. A single multiplier is used in the design since the output of the systolic array emerges at different times from each of the nodes. Also, an additional arithmetic unit is used to compute the mean. This mean can be calculated by summing the input data and shifting the final result one bit to the right (right-shifting). In [16], it is shown that the area of this systolic design is : Anuit + L{Aadd + L(log L + 2)Apiaceii} + Amux + Adtmux , where Amun = Area required for multiplication , Aredajj = Area required for addition , Apiacell and Amux = Cell area , = Demultiplexer area . The throughput of this AFT design, defined by its minimum block pipeline period is Tp = raaa;(input accumulation, scaling, output accumulation) + Tcomm , i J- multyi -^^ add J ' comm i where Taddy = Time required for input accumulation, Tadd = Time required for output accumulation , and Tmun-y = Time required for scaling. 3 . 24 Survey of AFT architectures A(3) A(2) A(1) A(O) Zero mean correction Mobius (+ — * • M U X — > • —*• Figure 9. Systolic I AFT [16]. A(0) Mobius Coef. A(t) = c(m) 1,0 A(1) i'X -1 z'T pz-1 Pf1 A(2) A(3) •1 T 3z -1 c(4) c(3) c(2) c(1) Figure 10. A signalflow-graphrepresentation for matrix multiplication. 3 . Survey of AFT architectures 25 By orthogonal projection of the signal flow-graph representation of Figure 10, the second systolic array for AFT processing shown in Figure 11 is obtained. The data samples are input to each of the different systolic nodes and the output of the signal flow graph can appear serially from a single end node. The respective area of this design is : Amult + R{2L + l)Aaddy + L{Aadd + L{logL + 2)ApiaceU} + + Ade mux • The corresponding throughput of this design is R(2L + 1) * max{Tadd7, Tmuit7, Tadd] + Tcomm . <°^<~7y w<S) Figure 11. Systolic II AFT processor[16]. 3. Survey of AFT architectures 26 3.2 A Modular Linear Array Architecture A modular linear array has been proposed by [28]. This linear array uses Reed et al. [32] algorithm. The design is suitable for VLSI implementation. One of the advantages of this array is the way in which the non-equidistant input requirements are handled. Proper activation of PEs and suitable I/O schedule are fundamental to the design. To compute 2N+1 Fourier coefficients, two sets of N PEs and a multiplier are employed. All the data and control signals move form left to right in the array. The design requires fixed I/O bandwidth with the host. In addition, the time for computing the 2N+1 Fourier coefficients is 57V + ("TV/ 2]. Therefore, the design achieves a O(N) speed-up over the 0(N2) figure that is characteristic of this type of architectures [18]. This is one of the important contributions of this design, as well as the modularity of the design (all PEs are either of Type I or II). The authors claim that the area occupied by the design is reduced significantly compared to other conventional FFT/DFT architectures. In Figure 12, the architecture is shown. Group 1 contains PEs of Type I and they are all cascaded. In between these two sets of PEs a multiplier is used to scale the Bruns averages that are computed in group 1. This scaling requires some scaling factors stored in external memory Ms. After scaling, the Bruns averages are passed to group 2 along with the Mbbius function values stored in external memory Mm. Finally, group 2 contains PEs of Type II and they are also cascaded. This group of PEs takes the scaled Bruns averages and the Mobius coefficients and computes the Fourier coefficients an and b n according to equations (2) and (3) of this chapter. The linear array works with data channels of different speeds (slow and fast channels area denoted by S* and F* respectively) to transport data between adjacent PEs. In the slow channels data takes two or three time units to pass through each PE. Fast channels do not have any delay elements and take only one time unit to traverse a PE. 3 . 27 Survey of AFT architectures GROUP 1 GROUP 2 Fd Fa Fb PE. PE PE, PE mul PE. PE. Fx 1/2 1/4 1/6 1/8 1/2N Ms -1 -1 -1 0 PE, -•Sa *Fa1 -**Fb1 •Sb -^Sxa -*-Sxb ~*-Fma •Sma -»-Fmb "*Smb -1 -1 -1 0 N/2 N/2 Mm Figure 12. The linear array for computing the AFT of [28]. The PEs of group I contain numerous registers, a comparator, an adder and a couple of XORs. Group II PEs contain an adder and four XORs and a similar number of registers plus an additional number of registers to introduce the delay into the slow channels. The structure of PEs of type I and II as well as the operation is described in detail in [28]. 3.3 Extended AFT VLSI Architecture. The extended AFT algorithm [31] uses the Bruns' alternating averages, defined previously by equation (2.1). This makes it possible to compute the Fourier coefficient an and b n , each with a single matrix of similar complexity. Therefore, the computational complexity of the even and odd coefficients is identical, unlike earlier algorithms [32]. This algorithm does not require complex multiplications which is one of the obstacles needed for developing faster Fourier transform algorithms. 3. Survey of AFT architectures 28 The extended AFT VLSI architecture proposed by Reed et al. consists of three stages, as seen in Figure 14; the first one performs the sampling using zero-order interpolation, the second computes the Brun's alternating averages using butterfly networks that resemble the FFT butterflies used in constructing a FFT network. Finally, the third stage computes the even and odd Fourier coefficients an and bn using equations (2.2) and (2.3). In this architecture Bruns averages B(2n,a) are used instead of the S(n,a) averages used in the two systolic arrays I and II. The Bruns average B(2n,a) differ from S(n,a) in that it uses only an even number of samples. This fact overcomes the requirement for using a zero-mean function [31] which is a time-consuming process. As a consequence, this method has a better throughput than the method in [32]. This is important for designing real-time parallel signal processors. The number of additions needed to calculate the alternating averages (in the second stage) is 0(N2). This results in an approximate 25% reduction in the number of additions [32],[31]. 3.4 An iterative realization of the AFT algorithm An iterative algorithm for the computation of the AFT has been proposed by Tufts and Chen [36]. This algorithm overcomes the difficulty of dense, Farey-fraction sampling inherent in the original AFT algorithm presented by Tufts and Sadasiv. This iterative AFT is intended for cases in which the function to be analyzed can only be sampled uniformly and at a rate close to the Nyquist rate or the dense frequency-domain samples are needed. The steps involved in the computation of this algorithm are outline below and a block diagram of the computation is shown in Figure 13 : 1. Specify the maximum tolerance in the squared norm Em of the error vector e^ or specify a maximum number of iterations. 2. Calculate the initial frequency-domian vector Xo =Px, where x is the signal vector. 3. Synthesize the time-domain signal vector x^=AX^ using the AFT filter. 4. Calculate the error signal vector ek=x-Xk and squared norm E±; 3 . 29 Survey of AFT architectures 5. Update the frequency-domain vector X]c+i=X]c+2aATe]c, where a is the step factor of the updating ; 6. Repeat steps 3-5 by incrementing the iterative index k until a satisfactory convergence (E^<Em) has been achieved or the maximum number of iterations have been completed. rm A V Vi = X. +2ccATe. k k i i' x k+1 \' Px s n X x —• k Xk t^-~ 9 , 6 « AXk v • Figure 13. Block diagram of the computation of the iterative AFT. [36] Based on the properties of the AFT matrix and the difference matrix P, the computation of ATek and Px can be implemented with only a few multiplications by using a permuted difference coefficient (PDC) structure (see [36].) PDC is equivalent to the mathematical formulation known as the summation by parts, which is a finite difference analog to the integration by parts reformulation of an integral [5]. This reduced number of multiplications, the high degree of parallelism and the resulting regular architecture allow the VLSI implementation of this algorithm. 3 . 30 Survey of AFT architectures A(t) A(dt) A(2dt) A(3dt) LN/nJ a=I n p|] 1=U, LN/nJ q b=2[fi(l)B(2nl,l/4nl)(-l)] n 1=1,3,. where q=(l-l)/2 A(Kdt) •B(2N,0),B(2N,1/4N)Figure 14. VLSI Architecture for the AFT (from [31]). 4. A VLSI architecture for the AFT 31 Chapter 4 A VLSI architecture for the AFT 4.1 A systolic array with butterfly structure The three stages of the extended AFT architecture have been studied in Chapter 3. In this section we will see the butterfly networks proposed by Reed et al. [31] to compute the Bran's 2n?/?-point alternating average that uses only an even number of samples. The computation of B(2n,a) can be decomposed into the following : B(2n,a) = - W(n,a) and W(2n,a) = - W(n1a) -W[n,a + — j + W[n,a + 2ra where W(n,a) is the n^-point average of the n values A(mT/n+aT) of A(t) defined in the same way as the average S(n,a) in Chapter 3, ,i.e. , 1 n—1 W(n, a) = - V A(mT/n + aT) and - 1 < a < 1 . m=0 This decomposition allows computing the alternating averages with a butterfly type structure as shown in Figure 15. •2B(2j,a) WG,OC) 2W(2j«) W(j«+l/2j) Figure 15. The AFT butterfly. 4 . 32 A VLSI architecture for the AFT A systolic array based on this butterfly can be constructed, where the averages W(2n,a) and B(2n,a) are computed at the same butterfly level. This results in a reduction of the number of additions [31]. Figure 16 shows the computation of a few alternating averages. A(0)—. <- > T B • B(2,0) A(0.5> A(0.3>- A(0.8> Figure 16. Example of butterfly tree for calculating Brun's alternating averages B(2,0), B(4,0), B(8,0). In the next section an applied VLSI architecture for computing equations (2.2) and (2.3) is developed. This architecture can be used together with the first and second stage of the extended AFT. 4.2 An Optimal layout for the Fourier coefficient computation 4.2.1 Introduction In this section the problems of implementing the Fourier coefficients, an and b n are described. As introduced in chapter 2, an and b n are defined by m an = 2_s fJ-(i)B(2ni,0) , t'=l,3,5,.. and bn= J2 MOC-l^fzro.-V) for n = 1,2, ...,N, which can be computed by vector-matrix multiplications. However, due to the nature of /z(i) these coefficients can be computed as sums and subtractions by binary trees. But, the 4 . 33 A VLSI architecture for the AFT problem with this regular binary tree architecture is the large area required for communication (broadcasting of Brun's averages). As discussed in the Chapter 5 the area complexity using a binary tree architecture is of the order of 0(Nlog2N) which is acceptable, but the area for broadcasting is large and results in a non-optimal area (larger than N 2 ). As an alternative, in the following section, a different way of combining trees of different sizes and heights will be seen. This modification results in a reduction of the computational area and especially in a reduction of the area for communication allowing a near optimal figure for the total area. 4.2.2 Architecture for an and b n computation In this section an efficient way of computing the even and odd coefficients is proposed. We will analyze the computation of the even harmonics since the odd harmonics have the same complexity. A systolic structure that is able to compute the alternating averages with a reduced number of additions was described in section 4.1, now an architecture that uses these averages as inputs and outputs the Fourier coefficients will be studied. The basic structure of the proposed network is illustrated in Figure 17, and will be referred to as the orthogonal flat trees (OFT) network for reasons that will become apparent in the following discussion. Briefly, the term 'flat' refers to the fact that most of the trees in the network have a constant height. The so-called OFT network is formed by two types of trees : Computational Trees: a collection of N tree-based networks, labeled Ti, T2, ...,TN, such that network Tj is responsible for computing the coefficient aj according to the equation *2», a) = ± S > i r A ( £ r + «*•). ro=0 4. 34 A VLSI architecture for the AFT Broadcast Trees: a collection of N tree-based networks, labeled Bi, B2,...,BN, such that network B^ is responsible for broadcasting the Brun's average B(2k,a) to a subset of the computational trees. B 2 B B 6 10 B B 14 18 B B 22 26 B B B 30 34 38 B B 4 2 46 B 50 B B 54 58 B B 62 66 B B 70 74 Figure 17. Basic structure of an OFT network of size N=38. The OFT network is constructed by laying out all computational trees horizontally and all broadcast trees vertically, such that each vertical tree B^ and a horizontal tree Tj share at most one node, as shown in Figure 17. Note that tree Tj has [N/j\ leaves, where leaf /, l<z <[N/jj, multiplies the incoming Bruns average B(2ij,a) by //(/). Since //(/) e {1,-1,0}, the leaf passes either B(2ij,a) or its complement to internal nodes, or sends a 0. All internal nodes of the tree are bit-serial adders. 4 . 35 A VLSI architecture for the AFT In Chapter Two, the even coefficients aj were defined as follows: LfJ i=l,3,5,.. where N is the size of the input/output sequence and B(2ij,0) are the alternating averages and n(i) is the Mobius function defined before as 1, if n = 1 K ) = { ( _ 1 ) r i f n ~ PlP2---Pr 0, if for any i p\ \ n (divides). n Because the Mobius function takes only 0,+l, or -1, the coefficients aj are only a sum/subtractions of alternating averages that can be computed by addition trees as shown in Figure 18. The plus or minus sign of the averages are determined by /j,(i), in some cases the alternating averages are replaced by a zero (e.g. , B(18) in Figure 18 is replaced by a zero). A computational tree Tj computes aj and has leaves and each leaf corresponds to a Bruns average B(2ij,o;) that needs to be multiplied by //(i). Figure 18 shows the structure of the computational tree for ai when N=16 . B(2) B(6) B(10) B(14) B(18) B(22) B(26) Figure 18. Structure of the computational tree for aj when N=16. B(30) 4 . 36 A VLSI architecture for the AFT w N logN N- N logN Figure 19. Layout showing the different computational trees. The layout of the computational trees follows the following rules. • XI The first HN ^ trees are laid out as binary trees with the usual height log N each Tj for j = 1,2,..., ^j^ can be laid out in 0 ( log , therefore, ) area. • X2 The remainder trees Tj, j > j^ny, are laid out as linear trees with at most log N leaves as shown in Figure 20. Since most of these trees consist of few leaves and each node i consists of a single leaf it would not be practical to lay them out as binary trees. X3 In either case, the ith leaf of Tj is placed at the vertical track 2ij. Note that all leaves (from distinct computational trees) which are placed on vertical track 2ij will receive 4 . 37 A VLSI architecture for the AFT the Brun's average B(2ij,a). -V a) V t b) Figure 20. Flat tree layouts: a) Flat binary tree, b) Flat binary broadcast tree c) A two-level flat broadcast tree. The layout of the broadcasting trees is more complicated because the number of leaves of each tree is given directly by the number of divisors, which is a difficult problem in the field of number theory. However, some important results concerning the number of divisors have been found : • For most values of k, \<k <N, the number of odd divisors of k is of the order 0(logcN), where c = 1 + e (e > 0 but very small number) [11]. It follows from this result that the number of leaves of most broadcasting trees B^'s (leaves on a single vertical track k) is also 0(logcN). Each of these broadcasting trees would need a height of O(cloglogN) if they are to be implemented as binary trees. Therefore, a total of O(NloglogN) tracks would be needed and a corresponding area of order 0(N2log\ogN), 4 . A VLSI architecture for the AFT 38 which is not optimal. However, it is possible to keep the height of these trees bounded by a constant if a more efficient layout is used. Next, another set of layout rules is derived: • Yl: If the number of leaves of a tree B^ is less or equal to log N, then layout B^ as a flat tree whose nodes are connected to all leaves (of computational trees) on vertical track k. In this case, B^ is basically a linear connection of bit register cells (i.e., a distributed shift register). • Y2: If the number of leaves of the tree Bk is between (log N)cA and (log A0C , then layout Bk as a flat tree of c levels (thus, such a tree requires c vertical tracks). The top level (level 1) consists of a flat tree (i.e., a linear connection) of log N register cells, each connected to a flat tree of size log TV at the level below. The construction is applied iteratively down to level c , where there are (log A0C_1 flat trees each of size less or equal to log N. The nodes of the flat trees at level c are connected to all leaves (of computational trees) on vertical track k. • Y3: A number of broadcast trees may have a number of leaves which is not bounded by any constant power of log N, and may even be close to NVc , where c >1 is a constant. However, it will be shown in the next chapter that the number of such trees is 0(N/log N). Layout such networks as binary trees of height 0(log A7). Thus, each such tree will require 0(log N) vertical tracks. 4 . 39 A VLSI architecture for the AFT w CK) 1 track log N tracks v Y1 Y3 Figure 21. Layout showing the different broadcasting trees. 40 5 . Area-Time Complexity for the AFT Chapter 5 Area-Time Complexity for the AFT 5.1 Area complexity for the general tree of butterfly structure In Chapter 4, a VLSI architecture for the AFT with butterfly structure was introduced. This architecture is composed mainly of three stages : sampling, Bruns averages computation, and Fourier coefficient computation. In this section, the area complexity for the second stage of the systolic array will be analyzed, namely, the order of the area required to compute the Bruns averages using the butterfly network covered in Chapter 4. First, the order of the required number of AFT butterflies is found. The total number of AFT butterflies required in a butterfly tree network of size N is given by the following expression — Card(a,) — 1 £ £ »'=1,3,.. j=0 2 '. where Card(ai) is defined as the number of elements in the set «j and the set a-x is the group i that contains all the Bruns averages to be computed in the same butterfly tree. For example, for a given N all the powers of 2 need to be computed in the same group a and all the averages multiples of 6 need to be computed in another group . In general, Cardial can be computed as follows : 1 + log N [log(f)J , for i = 1 , for i = 3,5,7, ... For instance, when N=32, the first two groups a\ and 03 (for multiples of 2 and 6 respectively) have cardinalities as follows : Gard(al) = Card {2,4, 8, ...,64} = 1 + log 32 = 6 Card(a3) = Card {12, 24,48} = -If = 3 41 5 . Area-Time Complexity for the AFT The above expressions for the number of AFT butterflies can be further developed so as to find the area complexity of such network : — Number of AFT butterflies = Card(ai) — 1 ^ E i=l,3,5, j=o (i+iogj\o-i = E ^ f 2 M?)J '+ E j=0 E *' »'=3,5,7,... (log TV) f 2 =E '+ E i=0 1=3,5,7,... = (2N-1)+ £ i=0 Mf)J E » j=0 (2L l o g (T)J+ 1 -l) i=3,5,7,... < (2N-l) + 0(N\ogN) = 0(N log N). The area required for an AFT butterfly is constant, or 0(c), because the number of operations required to compute the butterfly is also constant, namely, two additions and two scaling operations if the scaring is done at each stage, otherwise only two additions. Therefore, the area taken by all the AFT butterflies is of the order O(cNlogN). So far only the AFT butterflies of order 2 was examined, but because all first stages of each group requires different number of samples to be added, these cases need to be taken in consideration separately. Therefore, the last area complexity O(cNlogN) does not take into account the area of these butterflies in the first stages of each group. This type of butterfly will be referred to as n-AFT butterfly to distinguish it from the normal AFT butterfly. Figure 22 shows the structure of the n-AFT butterfly, note that the only difference between this butterfly and the normal butterfly is the number of samples that need to be added at the left side of the butterfly and the scaling by a factor of -%• at each of the left-most nodes in the nAFT. 5 . 42 Area-Time Complexity for the AFT A(mT/2n> W(n, a) A(m'T/2n)—vty— •<(m) «-B(2j, O) A(m"T/2n) A(n,T/2n^w(n^) A(nyT/2n_»(l7n)— { g } — W(2j, OO A(m,"T/2n) Figure 22. A nAFT showing two addition nodes (of weight=n) and two scaling operations. Figure 23 shows an example of a butterfly tree where the leftmost stage is composed of nAFT butterflies of weight a=3. This means that each butterfly in this stage has three inputs (rather than a single one like in the regular AFT butterfly) and the scahng factor is 1/3. B(12,0) B(24,0) j ( 3 • B(48,0) Figure 23. A butterfly tree network with a leftmost stage of weight =3. Next, the complexity of these butterflies is examined. In the following analysis the scahng complexity of the left-most nodes in the nAFT butterfly is not considered. Later, it is shown how to add an extra stage after the Bruns averages computation to take care of the scahng steps. Therefore, the area of this first level of nAFTs can be expressed by N Area = N [2b(i)\og2b(i) + 2 additions] J=1,3,5,... N N = —(2 additions) + 2 N additions + 2S(N) J^ . b(i)\og2b(i) 43 5 . Area-Time Complexity for the AFT The value of b(i) in the expression of S(N) is assumed to be a power of 2. This was meant in the beginning to explore the area required if these additions were to be implemented with binary trees. Later, it is shown that taking b(i)=i does not change the total area significantly. Now S(N) is decomposed in two series ; Sn = 2 2 log2 22 + 2 3 log2 2 3 + 2 3 log2 2 3 + 2 4 log2 2 4 + ... + 2 4 log2 2 4 +... + 2*log22fc + ... + 2fclog22 2' MS) = Y^ j=o r 2 J [2 k log 2 2 k wherek = j + 2 = J2 2' [0 + 2 ) 2 ( J + 2 ) = Y, (j2j2(j+2) + 22j+3) j=o log(f) log(f) 2 +2 = £ j2 J + J2 log(f) =4 ^ 22j+3 log(f) j 4J j4 +8 Y The second series X is a geometrical series of computational complexity 0(N2) as shown = 2N2 3 2 - l ' (f, •7V N J-i[c )H-i[( II ) ^ ) OO | 8 V 2 lo s 2 (i • ~ 3 00 below : -1 81 = 0(N2) 3. However, the y series is not 0(NlogN) or 0(N2) as one would expect from its similarity to the Xj2^ series. In fact, Y has a computational complexity equal to 0(N2\ogN) as shown 44 5 . Area-Time Complexity for the AFT below. However, note that given the closed form for the series Y for relative small numbers of n the complexity is less or equal than 0(N2). The series Y can be computed using the integration by parts method of finite calculus [10] , namely, y uAv = uv — y^ (v + i)Au, where u, v are two discrete variables and where the delta operator is the difference operator defined as Au(x) = u(x + 1) — u(x) . Therefore, using this method the series Y can be written as : or V xix = -x4x - - 4 X + 1 . Finally, adding an upper limit to this series yields log(f) l+log(f) k Y^ k4 = Yl xAx8x *=0 k=o 1 -X4 x+1 l+log(f) U 9 = ji^i+log (tyy^m sv-^ - i4(^og(f ))| _ ' 9i = { — log ' 12 2S = 0(N logN). Substituting these expressions for X and Y back in Sn yields Jn ' 12 6 V 4 / 36 2 TV TV log — + -Nz - 5 3 V 4 / 9 9 2 = 0(A^ log7V) . 9 | | 3 3 4 45 5 . Area-Time Complexity for the AFT It is clearly seen that S(N) is not of order N 2 for all values of N. Next, the area required for the other N additions has to be determined. If 0(loglog N) time is used for addition, a corresponding O(logMoglogAf) area is required. Therefore, the final area is given by Area = NO(\ogNloglogN) + = 0(N\ogNlog\ogN) = 0(N2\ogN) 0(N2\ogN) 0(N2logN) + . Note that even reducing the time for addition down to 0(1) time, with a parallel structure, the final area complexity is not changed as long as the area complexity for the partial additions is of the order O(NlogN). 5.2 Number of alternating averages lj for a given coefficient a{ The number of alternating coefficients required for a given j determines the size of the required structure. Therefore, it is essential to have an expression for this parameter. Let Iaj = number of alternating coefficients required for aj. For instance, N Iai = number of B(2i, 0)'s = £ 1 = f , .7=1,3,5,... Ia2 = number of B(4i, 0)'s = £ 1= f . •7=1,3,5,.. In general, the number of alternating averages required for a given j can be written as : , when Ja} because when m — Kill + i )' w h e n is even is odd is odd, we can split the set {1,2,3,..., } into two sets A and B with A={l,3,5,...,m-2,m} and B={2,4,6,...,m-1}. Therefore, Card(A)=Card(B)+l=-^2l11 + 1 = ^ = ^ ^ - . , which can also be written as Ia. = l-ll -LJ-L which is the final expression for the number of required alternating averages 46 5 . Area-Time Complexity for the AFT for a given coefficient aj. Because Carry-Save-Adders trees are used to implement the computations of the even and odd Fourier coefficients the area of the tree computing aj is Area3 = Iajlog2Iaj . Then, the total area is equal to the sum of all the sub-trees for j e [l..iV] or N Area = h1log2h1 +••• + hJog2h] First tree +••• + hJog^JaN = ^ h J o g 2 I a j • jth tree ftth free 3—^ It was shown that I.ai - Y and Ia2 = ^ , also •1(IN ~ ^ im =m=i 2 Ia m» ~ = rn = i '21 = 2 a h3 =< LN+3J L#J hq. = | 2 1 = 2 g I T J - K ~N "i ~ 2 so only one third of all the trees are needed and therefore the above expression for the area can be written as 3 Area = hJog2Iai * First ~ tree +••• + hJog2Iaj ' v yh tree +••• + hJ°92hN > v Nth tree = Y]lajlog2hj ' *r^ -?-1 2N + -z~ 33 5 . 47 Area-Time Complexity for the AFT 5.3 Upper bound on the total number of leaves of computational trees In this section we find an upper-bound on the total number of leaves for the computational trees. First, a lower-bound for the area of computational trees is determined. In Figure 24 we can see how fast Iaj (the number of leaves for the jth tree) decreases. It can be easily proved that Af0 < Af , and Af0 =fi(A) : -M**Eg)-4(?) = N\og2N -2N = N(\og2N - 2) = n(Nhg2N) and Af = ft(N\og2N) . In the same manner, an upper-bound for the area can be obtained : , ^ Ar /7V, N2 f< — A andAf TV2, N\ (N lo &N = 0(N2\og2N) . However, his upper-bound is not tight and will not be useful for our comparison in terms of AT2. So a better upper-bound needs to be found. Next, the real area for N=16, 32 and 64 is examined to have an idea of the true complexity of the area : 5 . Area-Time Complexity for the AFT a6 A6=l m 48 a6 A6=l •* a5 • a5 AS=1 A5=0.125Nlog(N/8) a4 a4 • • A4=l A4=0.125Nlog(N/8) a3 • A A a3 A A3=0.125Nlog(N/8) A3=0.25NIog((N/4) A2=0.25Nlog(N/4) A2=0.25Nlog(N/4) Al=0.5Nlog(N/2) Al=0.5Nlog(N/2) Afo=Al+A2+...+AN Af=Al+A2+...+AN Figure 24. Area required by the tree configuration. 5 . Area-Time Complexity for the AFT 49 N = 16. TV A Af = J2 * = Al + A2 + - + AN t'=l TV, N TV, N TV, N TV, N TV, = ylog2 j + T l o g 2 j + T l o g 2 j + y l o g 2 j + J^g2j N + 1 + ... + 1 „ iV 1 TV 1 TVS A r / 1 , lo = 11 + Nl -log2y + ~ l o g 2 j + 4 g 2 J = H + 7iVlog2^-77V 4 4 = 55. TV = 3 2 . A = f TV, Y log TV + TV, T T log2 TV + T ¥ TV, TV TV, TV log2 + l0g2 + 16 16 16 TV, I^ log2 TV, iV + TV, y y log2 TV + TV, I£ l 0 g 2 l6 TV y + TV TV TV, T log2 l6l0g2I^ + 1 + - 7 + TV, y log2 TV ¥+ + 1 = 22 + ^ l o g 2 T V - f TV = 162. A f = TV, TV 2TV, TV 4TV, TV 5TV, TV 9TV, TV l o g 2 y + — l o g 2 T + — l o g 2 - + — l o g 2 - + — l o g 2 - + 44 T = fiVlog 2 TV - ^ - T V + 44 = 486. Note that the expression (for Af for N=16, 32, 64) has the form (aTV log 2 N — bN + K) where the coefficients a and b are almost constant. The coefficient K doubles as the size of N is doubled, but this is less important as in terms of complexity this coefficient will disappear. From these few cases it can be concluded that Af is 0(aNlogN-bN+K) where K = TV — [ y j < 2M- with the values of the coefficients a and b always less than log N. 50 5 . Area-Time Complexity for the AFT Therefore, if a < log N and b < log N for all N : Af = aN\og2N -bN + K < N\og%N - N\og2N + %- , O / o(Nlog < 2N(log2N-l) 2/V\ +— J , < o(Vlog227V + V ) = 0(iVlog227V) . Before this upper-bound can be used, the assumption that the coefficients a and b are indeed less than log N needs to hold for all N, and this is not easy to prove. Instead, an approximation for Af can be found using some relationships derived from the floor and ceiling functions that appear in the expression of Iaj. The sum of the tree's bases is N N 2 N EC a 7=1 j)e jen + (Ia3 ^orfd Z^ 7= f+l N 2 1 N N 1 7=1 1 N _N 2 2 ~ 4 Therefore, in order to approximate the latter sum it can be assumed that half of the Iaj are even and the other half is odd. While this is not the case for most N, the number of odd Iaj in the first half of the set {Iaj I j=l,...,N/2} is very small as shown below for N=16 and N=32 (the odd numbers in the first half of the set are underlined) 32 3. 16 L3 = {32,16,10, 8,6, 5,4,4,3,3, 2, 2, 2, 2, 2, 2,1,..., 1} = {16,8,5,4,3,2,2,2,1,...,!}. 51 5 . Area-Time Complexity for the AFT It is known from simple floor function definitions that = E o2 T < E T^f • ^2 Tj 3= 1 J <j[lnj + 7 + 0(l) ln2\„, /^2\ < (aN\og2N -bN — 7 „ iV W+-, + O(N)) < aN\og2N = 0(N\og2N) . In the above derivation, Euler's approximation of HN [33] is used where HN represents TV the Harmonic series defined by H^ = J2 i = (lnW + 7 + 0(1)). This shows that the number of nodes is O(NlogN) which was expected since the total computational area will be proven later to be of order 0(Nlog2N). 5.4 A regular binary tree based architecture and an upper bound for the computational area. In this section it will be shown why an architecture based in regular binary trees does result in a computational area of the order of 0(Nlog2N) and why the broadcasting area is large with this type of implementation. First, to get an approximation of the area consider Area = J ^ / a j l o g 2 / a , • 5 . Area-Time Complexity for the AFT 52 Using the expression for Iaj derived in the section 5.3 the area can be written as log2iV £ E 4+ E J<j<N/3 J l<i<^/3 N N = -# f (log 2 7V-l) + - 2 w h e r e S j v - i = <S# — ^ ^ £ " , l<j<N/3 log2j +l<j<N/3 £ -1 log 2 j J J log2^ The series SN-I can be computed by using a technique called Euler Summation formula [33]. This formula estimates a series by finding the principal term of the series and by calculating as many terms as desired followed by a remainder Rm which has to converge for a given m. The number of terms is also determined by the variable m. The coefficients Bk are the Bernoulli numbers and Ak are the coefficients for the (k-1) derivative of f(x). With f(x) = ~ ° g x this approximation of SN-I becomes : -i ''-•-7-^+S [£(-*?=) N-l + Rr, = (^)log2VV-l)1 y^ Bk f [A^ - (k - l)\ln(N - 1)] k \ For m=4, Sf/-i can be approximated using the following Bernoulli numbers B!=-l/2, B 2 =l/6, B3=0, B 4 =-l/30 . + Rr, 53 5 . Area-Time Complexity for the AFT With further manipulation of SN-I the following result is obtained 'A0 2 Ao-ln(f-l) 2(1 - f ) ; -A3 720 A3-61n(f-l) Aj + Ai-hx(f-l) l,i2 6 ( l l + _*)' ] ,4 720(1 - Nf ) llo = ('^) iog2 (1 _ X\ & (I "I) I M f ^ i ) 2 (f-1) 6 11 1 7201n2^ 6 log2(f-l)+ 1 _ 1 720 (N 61n2/Af (f -1) (!_!)• 49 + (f-1) (f-i)- 720l^2+jR4- Thus, %J ln2, 2 /TV ——log l o g 2( —— 1 ) + i?4 when TV —> oo, which leads to T U 5 L £j = 0(log22TV) . This last result is true because R4(n) approaches a definite limit for n very large. oo lira R4(n) = - I B4({x})f^\x)dx n—•oo = R4(oo) . ~. Now the area can be expressed in 0() notation by the following : Area = ofjHf (\og2N - 1)\ + N = 0\ —HNlog2N 2 = 3 + N j O ( s ^ OlflogJN + jlogJN logfW 0{N\oglN). Therefore, it has been shown that the area required for computation of the Fourier coefficients using a regular binary tree architecture is 0(Nlog2N). i i i I— I So far the area has been approximated by J2Ja ^°92Ia where l'a — Hf- but actually the base of the j * tree is Iaj = ill > Ia V je [1..N]. Next, compare the complexity 54 5 . Area-Time Complexity for the AFT of J2 Iaj with the one of l<j<N J2 h • From the definitions of floor and ceiling functions i<J<N ' the following relations can be derived : TV i. N =>P < — < p+ 1 J E => p - 1 < p = j (5.1) <P also p < — p + 1 . (5.2) (5.3) from (5.1) and (5.3) N <,'£ j. N J . +1 1 <2 TV i a( f l + 1 +1a(> TV or /„ < i. + i (5.4) l/2(N/j) p/2+3/2 '"j Figure 25. Iaj compared with the earlier approximations. 5 . Area-Time Complexity for the AFT 55 Finally, using relation (5.4) and adding terms : Area = \ l<j<N v 1<J<N T < J £ y+ £ J l<j<N 1 l<j<N N < —HN + N = 0(N\og2N + N) = 0(N\og2N) This last result is consistent with the earlier approximation of the area, i.e., Area = 0(Nlog22N) and therefore it is now possible to find the complexity of the more accurate expression for the area using the closed expression for the number of alternating averages Ij: . j_ Area 2 l<3<N/3 .j_ 2 K j_ 2 log2 t h e n Area < ^ < < _ j_ log2 . 3 _ 2 , l o . Using (5.4) 2 / N 1 §2(2-+l /a,log2 ( — + 1 j l/iV\2, l/iVx2 yi2^+l)<E ^)log2i^ UE+1 2j 2j -"2\jJ <yJ2 2yj -i^og2N - 1 - 2\og2j) + Yl (2lo&N - ! - 21o §2j) N 2j < ( N\og2N - - )HN + 2N\og2N - TV + 7V]T - l o g^M + 2 ^ -log 2 j . j 3 J 5 . 56 Area-Time Complexity for the AFT Therefore, the area in O () notation is : Area = O(N flogJN + ^log2N - l\ \ + 0(NSN) + 0 I J ] -log 2 j 2 = 0 N\ogJN + -N\og2N + N + |-Mog 2 iV| + |-Mog 2 iV| = OyNloglN + -iVlog2iV + N + NloglN + N\og2N = O (2N\og\N + ^-N\og2N + N) = 0(Mog22N) . Finally, it has been shown that the area needed to compute the even Fourier coefficients with precomputed alternating averages B(2ij,0) and external Mobius coefficients //(/) has a complexity of 0(Nlog22 N). This demonstrates that most of the area required by this circuit is used by the communication part of the network. 5.5 A two level tree architecture and an improved upper bound for the computational area In section 5.4, it was shown that the area required for computation is relatively small when a straightforward regular binary tree architecture is used. In this section, the combined architecture of the OFT network of Chapter Three that uses trees of different heights and sizes will be revisited. A new upper bound for this combined network of computational and broadcasting trees is also introduced in this section. In addition, this new bound will show that the broadcasting area is the dominating factor as the computational area in the OFT is small, yet the total area remains unchanged. The proof that the total area of this network is 0(N2 ) involves the use of non-trivial number-theory results related to the distribution of divisors and prime factors. This will be seen in detail in section 5.7 along with the problem of communication. 5 . Area-Time Complexity for the AFT 57 A similar complexity analysis to the one done in section 5.4 is necessary to prove that the computational area required by this two-level tree network is smaller than the computational area required by the regular binary tree network seen in section 5.4. In Chapter Three, we saw that the first computational trees take most of the area and the remainder of the trees take a smaller area. So first the complexity of the area of the first iV/log iV trees will be found. As it was seen in section 5.4 an upper bound on the area can be obtained using the closed expression for the number of alternating averages Ij : Area<^Ij\ogl— + l\ N N / 1\T\ logJV / N / %» 1 l \ • 1 N i°gN J iV\^/, <o(jNlogN-j^ ( N \ \ / N ( loSN J „(,„ 2 / N (log N - log log N)\ -O(N log2 f JjL\ j + O(N) <0(N log2 N - N log N log log N) + 0(N log N) + O(N) < 0[ N\ogN(log f-^—] logNJ +1 ] ) = ofNlogNlog a JJ \ N °\\ogN Now the N — j—jf trees of height 0(1), as shown in Figure 21, require an area given by 5 . 58 Area-Time Complexity for the AFT N ( Area= £ • ' =log 5 - N^ +' 1 N \ 4 < (tf - — j A ^ TV with A jy = I jv logTV+1 log TV *1 = *1 logTV + 1 TV log TV < log TV + 1 = 0 (log TV) TV and Area < A TV TV' log TV log TV•+1 " < 0(\ogN)0(N) = O(TVlogTV) . Using this scheme for the layout of the trees, the area from the first j ^ y - trees is larger than the area of the other trees, which in comparison is actually very small. In conclusion, using the above layout rules to construct the network results in a reduction of the area of the computational trees. This reduction, if not great, is important as it illustrates the relative small area needed for actual computation and the much larger area for communication. With the regular binary tree based architecture we obtained an upper bound equal to TV log2 N, the improved upper bound with the new layout is : Area of computational trees < 0[ N log TV log TV + TV log TV log TV This bound is smaller than the TV log TV figure, but still larger than TV log TV. Therefore, an upper and lower bound for the computational area can be stated TV log TV < Area < N\og2N . 59 5 . Area-Time Complexity for the AFT 5.6 Communication Problem In this Chapter it was shown that the area required for computing the 2N Fourier coefficients is not larger than Nlog22N. However, it was assumed that every binary tree had their inputs ready and that none of them were shared among the different trees computing a_j. Actually, most of the alternating averages B(2ij,0) are shared among different trees. Without global communication we are forced to connect these averages to the inputs of each tree. These broadcasting trees will occupy an additional area that results in the 0(N2) figure of the OFT network considered in Chapter Three. First, consider the sum of the number of divisors for each j in the interval [ 1..N ] where N represents the number of coefficients, i.e., 2N »'/2 °or 1 £ E PilO • i=2,4,6,... j'=l,3,5,... This is necessary because the number of leaves connected to a given tree depends directly on the number of divisors for each j . The inner sum in the last expression is complicated, but can be approximated simply by the number of divisors for a given k : 2N i/2 ^ J ^ N k ^ j ^ N £ E(2;io<E£tfi*) = ET(*)i=2,4,6,... j ' = l jfc=l j=l Jfc=l The function r(k) gives the number of divisors for a given integer k (this is often called the totient of k). Several approximations to this function can be used to estimate the number of divisors of an integer, starting with n T(n) = £ & - v(k) = 0(aN2), where y>(k) is the totient of k because an asymptotic approximation of the sum of the totients is ^-^V2 [14]. Fortunately, n better estimates of the function ]T) r(k) exist. Dirichlet [8], gave an asymptotic approxik=\ 5 . Area-Time Complexity for the AFT 60 mation for this sum ] T r{k) = n(log2n + 2 7 - 1) + R(n) , with \R(n)\ < Ayfn , and Voronoi [11] improved the remainder. That is, he showed that |i?(n)| < ri^login . n Although Y^ r(k) has been proven to be O(NlogN) this only gives an idea of the total k=i number of leaves required. In section 5.5 the optimal architecture of Chapter Three is proven to have optimal area, taking into account the broadcasting and computational area. 5.7 Upper bound on the OFT-Network The main objective of section 5.5 is to establish the 0(N 2 ) upper bound on the area of an OFT of size N constructed according to layout rules XI - X3 and Yl - Y3 of Chapter Three. In computing the total area of the OFT network, it should be noted (see Figure 17 ) that the length of the vertical dimension L is determined by the total height of the computational trees, and the width of its horizontal dimension W is determined by the total height of the broadcast trees. Let h(Tj) denote the height of tree 7j, and h(Bj) denote the height of tree Bj , then from layout rules (XI) and (Yl) of the previous section, an upper bound can be easily computed as follows : N II^NI J=l 3= 1 = 0(N). However, the proof that N N L J J Mr^vJ+i 5 . 61 Area-Time Complexity for the AFT is more complicated and in this section the details behind this bound on the horizontal dimension W of the network are examined. For the time being (5.5) is assumed to be true so the fact that L=0(N) and W=0(N) leads to the following result. Theorem 1 The area A of an OFT network of size N is 0(N2), and its total time delay T is Oflog N+w), where w is the word length of each input. Thus, when w=0(iog N) bits, the OFT network has AT2=0(N2log2 N), which is optimal. Proof: The upper bound on the area follows from the results of this section, particularly, from Theorem 3. The OQogN) bound on time follows from Chapter Three layout rules XI - X3 and Yl - Y3 where the distance from a leaf to a root of any tree is always 0(logN).[] Note that the evaluation of the function h(Bj) in (5.5) depends on the number of leaves \(Bj) in the tree Bj, which determines which layout rule to use in constructing the tree. Recall from layout rules Yl and Y2 for broadcast trees that all trees with 0(logAOc leaves, can be laid out as flat trees of constant height, i.e., h(Bj)=c. On the other hand, layout rule Y3 is applied to trees with N 1/c leaves, where c' is a constant, and thus, h(Bj)=(Vc') logN in this case. Now, let \{Bk) < (\ogN)c), F2(N,c) = {k : 1 < k < N, X(Bk) > (logN)0}, Fx(N,c) = {k :\<k<N, where c is a constant. From layout rules Yl and Y2 it is obvious that J2 h(Bk) = O(N). The central aim of the following discussion is to prove that the cardinaUty of the set F2 is relatively small. In particular, it will be shown that F2 contains only O(NAogN) indices, so that the total height contributed by all broadcast trees B^ such that k is O(N), i.e., 5 . 62 Area-Time Complexity for the AFT £ h{Bk) = 0{N) . (5.6) keF2 Lemma 1 The number of leaves \(Bk) of tree B^, 1 < k < N, is equal to the number of odd divisors of the integer k . Proof: Recall that each tree B^ broadcasts the Bruns average B(2k,a), to all trees Tj such that ij=k and i=l,3,5,..., y . The subset of trees Tj whose index j satisfy the above condition constitute the set D0(&) = {j : j G Z + , j = k/i, i is odd}. Since j has integer values only when i I j (i divides j evenly), the cardinality (number of elements) of D 0 is equal to the number of odd divisors of k, and this complete the proof. Next, define D(N,c) as D(N, c) = card {A; : 1 < k < TV, d(k) > (In N)c] where c > 1 is a constant, and d(k) denotes the number of divisors of k. Note that F2 < D(N,c) and thus an upper bound on D(N,c) will be also an upper bound on F2. The fundamental theorem of arithmetic [13], states that any positive integer k can be uniquely represented as a product of primes, i.e., for a positive integer k=pirip2r2—Pqq (where pi is a prime and all distinct). Then u(k) denotes the number of distinct primes that divide k, and £l(k) denotes the total number of primes that divide k counted according to multiplicity. With the above notation : u>(k) = q and tt(k) = r\+r2+...+rq. In addition, among other useful results from number theory it is well known that the number of divisors is given by [11], [12], [17], [23] d(k) = (1 + ri)(l + r 2 )...(l + rq), and Also, the function E(k) = J2 P<k,peE 2WW < d{k) < 2n^ . P l•> where p is a prime number, plays a role in deriving 5 . 63 Area-Time Complexity for the AFT upper bounds on co(k,E) and ft(&,E) as we will see in the following lemma by R.R. Hall [13], and Theorem by Norton [27]. Although a considerable number of authors have investigated the distribution of the function ft(£) (see [13],[27]) it remains true that the local distribution of ft(7V) is approximately Poisson [13], [14]. Lemma 2 ([13],[27]): Let E be an arbitrary non-empty subset of primes with po= min {E}, then uniformly for 0 < m < ( po-e)E(n), we have c(p0,t)Ne-E^N)E^N} ml where c(po, e) is a positive constant depending on e and po, but independent of N . This bound is important as an upper-bound on d(k) can be obtained by establishing an Rm = card{& : 1 < k < N, Q{N,E) = m} < upper-bound on fl(N). Norton generalized this result in the following theorem . Theorem 2 ([27], theorem 1.21): Let E be a non-empty subset of primes and define S(N, y; E, ft) = card {k : 1 < k < N, il(k, E) > y}. If N > e e and 0 < y < (In A0/(ln 2)-l, then S{N,y;E,tt) < c2-^iV(lniV)(lnlniV) 1/2 , where c is a constant. Theorem 3 If in Theorem 2, y=cj lnln N, where cj> 2An2, then S(N,y,E,d) C3N/log N, where C3 is a constant. Proof: If in Theorem 2, y=CilnlnN, where Ci > 2/ln2, then S(N, y; E, ft) < c2~Cl lnln7V7V(ln N)(\n In N)> S{N,y;E,n) < c(ln iV)1_CliV(lnln7V)2 S(N, y; E, ft) < cA^(ln A^)~2(ln In N)* (with ci = 3) S(N, y; E, ft) < cN(\n N)~2(ln N) = c2N(\og S(N,y;E,tt) <c2N/log N N)'1 < 5 . Area-Time Complexity for the AFT 64 where C2 is also a constant. Now, since d(k) < 2n(N'E^ it can be proved that S(N,y;E,d) S(N, y; E, d) = card {k :l<k<N, < S(N,y;E,Sl) , where d{k) > 2y] . Implications Note that when y ^ l n l n N, then S(N, cx In In N, E, d) = card {k : 1 < k < N,d(k) > (In N)Cl) . Thus, S(N,ci\n\nN,E,d) = D{N,c\), and the above theorem states the fundamental bound we were seeking on the number of integers in {1,2,...^V} which has a large number of divisors. Also, since F2(N,c\) < D(N,c\) , then keF2 6 This proves (5.5), and the 0(N2) upper bound on the area (Theorem 1) follows immediately. 6. Simulation 65 Chapter 6 Simulation 6.1 Introduction The VLSI architecture for the AFT described in Chapter Four is analyzed and HDL (Hardware Design Languages) models of the components are implemented. The purpose behind this VHDL [21] simulation is to demonstrate the feasibility of the extended architecture proposed in Chapter Four. Behavioral HDL models are written using concurrent statements which are used in architectural bodies to describe regular structures. VHDL is a comprehensive language that allows a user to deal with design complexity. Design, and the data representing a design, are complex by the very nature of a modern digital system constructed from VLSI chips [21]. This standard hardware design and description language was born as a solution to problems such as reusability of older designs and upgrading capability of existing designs. VHDL is also an IEEE standard and a leading edge design language, one that may be used with many types of design tools such as simulators, design synthesizers, silicon compilers, test generators and timing analyzers[3], [21]. The scope of VHDL covers the description of architectural description to gate level description. The language is hierarchical and mixed-level simulation is supported. The concepts embodied in the timing model for the language mirror real hardware. The VHDL models of designs behave like real hardware. At high levels of abstraction, the language makes an excellent specification medium for future designs to be created in new technologies or with alternative architectures. At lower levels of abstraction, the language serves well as a specification of what is to be fabricated. 6. 66 Simulation 6.2 Design Tools The Model Technology V-System/Workstation VHDL simulation system used supports the complete IEEE 1076-1987 standard [24]. VHDL's ability to model the concurrency inherent in hardware operations is a major advantage over the programming languages previously used for behavioral modeling. Also the ability to partition a design into highlevel subsystems and then analyze these blocks is desirable. The V-System simulation system supports this partition. While this requires describing system behavior in more detail, it does not lock the system into a specific implementation. Simulations of the AFT and OFT networks introduced in Chapter Four were carried out with the VSIM [24] simulator. Design and simulations were done on SUN SPARC2 and SUN SPARC 10 workstations provided by the Canadian Microelectronics Corporation. 6.3 The Bruns Network Chapter Four introduced the AFT butterfly as the basis of the butterfly tree network. Furthermore, it was shown that by combining these butterflies in an orderly manner it is possible to build a regular network. The systolic architecture obtained for a given N (as shown in Figure 26) contains neighboring processors laid out in a regular pattern in order to perform the computation of the different Bruns averages. In VHDL the primary unit of behavioral description is the process. A process is a sequential body of code which can be activated in response to changes in state. When more than one process are activated at the same time, they execute concurrently. A process statement is a concurrent statement which can be used in an architecture body or block. A process is activated during the initialization phase of simulation. It executes all of the sequential statements, and then repeats, starting again with the first statement. Also a sensitivity list can be included with the header of each process statement. The process is 6. 67 Simulation then assumed to have an implicit wait statement at the end of its statement part. Processes execute wait statements to suspend themselves when an event occur on any of the signal in the sensitivity list [3]. The behavior of a single AFT butterfly can be captured in a single process as shown by the pseudo code in Figure 26. In this process the n-th point averages W(n,a) and the Bruns averages B(2ni,0) are computed simultaneously. Not shown in the process is the code that keeps track of the averaging and number of right shifts needed to perform this averaging. The signal lev, which is included in the sensitivity list of the process, gives information about the position of the butterfly in the network and therefore information about the number of right shifts to be performed. Note that the signal assignments take place after an arbitrary delay given by tpd_clk_to_add, which is a constant in the AFT entity. butterfly : process(clk, al, a2) begin if (elk'event) and (elk =' l') then W < = find_sum(qq(l), al, a2) after tpd_clk_to_add; qq(l) := find_carry(qq(l),al,a2); B <— find_sum(qq(2), al,not(a2)) after tpd_clk_to_add; qq(2) := find_carry(qq(2), al,not(a2)) ; end if; end process butterfly; Figure 26. Basic process describing the behavior of the AFT butterfly. 6. Simulation 68 The function find_sum computes the bit-serial addition and the function find_carry computes the carry-out to be added to the next pair of bits (al,a2). The alternating averages are added bit-serially using a 1-bit register cell (R) and a Full Adder (FA). Figure 27 shows how the left child is obtained using the CSA trees. Each register is associated with an input and output connection. Within each time-step, data is shifted into the R-registers and the FA via their respective input buses, computed according to the RTL formulas : S <= ( ( a e b ) e C i n ) , Cout < = (a and b) or (Cin or(a and b)) . The S and Cout values are then shifted out of the registers onto their respective output line as shown in Figure 27. i ' • R * S FA n 0 S l FA FA rr 3 o \ ft a 2 a 3 Figure 27. Two-Level Addition-Tree with Carry saved. The behavior of the entire AFT network is captured in a collection of processes that can be constructed using the generate statement. The generate statement is used in architecture 69 6. Simulation bodies to describe regular structures, such as arrays of blocks, component instances or processes [3]. The regular components can be described using the, far-scheme and irregularities in the structure can be handled by if-generate statements as shown in Figure 28. begin For_all_stages : for stage_num in 1 to (1 + logN) generate Inside_stage : ( N for I in 1 to ( —-, \ =- generate \ ostage_num—1 / ° Stage_l : if (stage_num = 1) generate stage : aft port map(clk,...); signal assignment here end generate Stage_l; Other : if (stagejrum/ = 1) generate other_stage : aft port map(clk,...) signal assignment here end generate Other; end generate Inside_stage; end generate For_all_stages Figure 28. Description of the generation of the AFT-butterfly network. Figure 29 shows a small network (of size N=8) constructed after the last generation model for computing the Bruns averages. The number of stages generated is (1+log 8) as expected, and the regular structure is built uniquely from AFT-butterfly processes. The total 6. 70 Simulation number of concurrent processes is given by l+log 2 N y L^ i=l N - =2N-i. 91-1 These processes do not contain wait statements but they are suspended after an event occurs on either of the incoming inputs signals. These internal signals connecting the input-ports and output-ports of the AFT processes can only change at the rising edge of the clock. Not shown in Figure 29 is the common clock that all AFT processes share. a(1) _ » . a(2) „ a(3) ' »• B(1 ) c: ) ^ —*• a(4) 1 a(5) -—• a(6) __^ a(7) ^ B(9) W(9) 1 W ( 2) — * * B ( ;2) B(13) W(13) D(%J) • 3) w(io) B(10) I) a(8) B(15^ 0 *• B(<: a(9) „ a(10) ^ •> R ( W(15) 5) 1 a(11) _ ^ a(12) — • »-B(11) W(11) W ( 6) W(14) ** R/*, B(6; a(13) a(14) „ B(14) .. RC7 ) I fc I VV(y ) p* a(15) W(12) B(12) W ( 3) a(16) ^ STAGE 1 "• bJ(i 3) STAGE 2 STAGE: 3 Figure 29. A Bruns network for N=8. STAGE 4 71 6. Simulation 6.4 The OFT Network and its design As mentioned in Chapter Four the orthogonal flat trees (OFT) network contains N treebased networks, labeled TI,T2,...,TN such that network Tj is responsible for computing the Fourier coefficient aj or bj according to the following equations : i—1,3,5,.. 1-1 for n = 1,2, ...,N. an t=l,3,5,.. In Chapter Five it was shown that the first N logN efficiently by binary trees, while the remaining [ y j computational trees can be realized N logN trees can be realized by linear trees or flat trees. As explained earlier the term flat referred to the fact that most trees in the network have constant height. A table showing the almost constant number of nodes required to compute these linear trees can be found in appendix B. The generation of the first NAogN binary trees follows the pattern outlined in Figure 30. In the generation of a binary tree the depth or number of levels in that tree has to be known. For a given computational tree, the number of levels is given by log2 (2*treenum) > where tree_num gives the position of the tree in the network. This is true because any tree has at most y elements (This is Ti or tree_num=l). Also the total number of subtrees generated at a given level by a computational tree Tn has to be known and this is given by Number of subtrees = 2dePth-of-tree * 1 olevel At least three different if-generate statements are needed to handle the connection of all the internal components: generation of the base trees, intermediate trees and generation of the root. An additional case was added for the second level in order to simplify the connection of internal signals. Note that in each of these cases the component tree is used in the generate 72 6. Simulation statement. This entity was defined earlier in 'binary.vhd' and it essentially computes the sum of two numbers bit-serially in a similar fashion as W(n,a) and B(2nj,0) computed the sums and differences in the AFT network. begin For_all_bin_trees : for tree_num in 1 to size generate Alljevels : for level in (depth_of_tree) range generate for I in (Number_of_subtrees) range generate Base : if (level = 1) generate levelj. : tree port map(clk2,...); end generate Base; level_2 : if (level = 2 ) generate second : tree port map(clk2,...); end generate level_2; Other : if (level other than last one) generate otherjevel : tree port map(clk2,...); end generate Other; Root : if ( l a s t level) generate lastjevel : tree port map(clk2,...); end generate Root; end generate A_subtree; end generate Alljevels; end generate For_all_binJrees; Figure 30. Description of the generation of all binary trees in the OFT network. Simulation -22 The generation of flat trees is outlined in Figure 32. The [ y j — N l o g TV linear trees are generated using a similar approach to the one used when generating the binary trees. Two nested generate loops statements are needed; the inner one creates the nodes of a given tree and the outer one creates the different trees. Likewise, three different if-generate statements are needed to distinguish between boundary-node and inner-node generation. This is necessary because at the boundaries the input and output signals do not follow the same pattern as signals from internal nodes as shown in Figure 31. Figure 31. A linear tree showing difference between boundary and inner nodes. The number of nodes required for a given linear tree is given by 2 * JL 1 , which is simply Iaj — 1, defined in Chapter Five as the number of alternating averages for the jth tree. Because not all Bruns averages take the same number of clock cycles they need to be buffered as they are computed in the AFTN network. Both binary and linear trees in the OFT network use a second clock (CLK2) in the generation of the concurrent processes. 6. 74 Simulation begin / For all lin trees : for tree num in N \ N 1 + -—— to — generate A_lin_tree : for node in (l aj — l ) range generate Node : if (first node) generate first_node : tree port map(clk2,...); end generate Node; Inner : if (not first node ) and (not last node ) generate other_node : tree port map(clk2,...); end generate Inner; Last : if ( last node ) generate last_node : tree port map(clk2,...); end generate Last; end generate A J i n t r e e ; end generate For_all_lin_trees; Figure 32. Description of the generation of all flat trees in the OFT network. 6.5 Simulation A single clock of period equal to 100ns was used in all the simulations. An arbitrary time delay of 5ns was used for input settling times as well as for propagation time delays. To begin with the simplest simulation of all, let's look at the computation of the W and B averages for two positive numbers of word_size =4. The corresponding waveforms of the computation of a single module computing an aft butterfly (an addition and a substraction), 6. 75 Simulation with positive input vectors aal() and aa2() as shown below aal(l) -»• 1001 1001 aa2(l) —» 0011 0011 wa(l) -> 1100 ba(l) —> 0110 are shown in Figure 33. 5ns 1100 I 200 I 300 I 400 Figure 33. Waveforms showing the computation of W and B coefficients for two positive numbers. The transition delays at the behavior level are arbitrary and they can be preset to any value. The inputs /al and /a2 to the aft module have to settle before the rising edge of the clock, but this settling time is also arbitrary. Moreover, the length of the cycle can be arbitrarily shaped and rising/fall edges can be used as both represent an event recognized by a VHDL simulator. The vectors aal() and aa2() contain the serial input to the systolic network. The serial output stream coming from the different aft modules is buffered in different registers. The length of these registers is determined by the word size of the input vectors aal() and aa2(). Figure 34 shows the number of required cycles involved in the computation and averaging of ba(l) and wa(l) for aal(l) and aa2(l) . 76 6. Simulation aal(l)-> 01011 01011 aa2(l) ->• 00010 - 00010 wa(l) —> 01101 cycle 1 1 U 2 0 1 3 0 0 4 1 0 u u u u u u 1 u u 0 1 u 5 0 1 0 ba(l)-* 01001 0 1 0 0 1 0 0 Figure 34. Bruns averaging with zero order interpolation. When more than one butterfly-stage is used to construct the AFTN network the number of clock cycles needed to compute all Bruns averages is given by (word_size + log2 N). As seen before the Bruns averages need to be buffered as computation from stage to stage progresses. Stages that end their computation must not keep on filling the respective queues where the Bruns averages are stored as this would distort all the results. This is notably important when the number of stages increases with N. The module AFTN contains code that checks the number of cycles spent filling the buffers and compares it with the stage number and word size to decide if the buffers have been filled for that particular stage. As an example, assume a two stages computation with serial inputs aal(l)=01011, aa2(l)=00010, aal(2)=01011 and aa2(2)=00111. These input generate directly the averages ba(l) and ba(2) as shown in Figure 35. The internal signals produced wa(l) and wa(2) 77 6. Simulation generate the third average ba(3) as shown below wa(l)-> 01101 01101 wa(2) -> 10010 - 10010 wa(3) -> 11111 aa1(1) ba(3) -> 11011 ^ ba(1) ^ / \ aa2(1) iwa(1) m ' ' • ^ > G > ,. ^ X® aa2(2) ST/ \ G E "*" b a ( 3 ) ;wa(2) ba(2) 1 STAGE 2 Figure 35. AFTN network for N=2 . _J /dk= /aal : w /aa2-ui /wato XII h 10 U XllDI i 0 H\r i 1 1 + Xuoi Xn oo J i I L 7 I Z^QL XMI j 1 Xi J (3)t— (2)+--- i HU i ir U It I [ Xn \ I I0\r /ha:iw I J (3)1' U | :0 I : Xi» 1 _ [ O ' 1 1 : I I X001 1 [ 0 I 1 1 1 | i X 1 0 I X100 0 I 1 1 i J | 1 7 | i i I I I II I I I I I I I I II I I II I I I I II I I I I I I I I I I II I I I I I I I I I I I I I I I II I I I I I Figure 36. Waveforms showing two stages of butterfly computation 6. Simulation 78 The corresponding waveforms of the two stages computation appear in Figure 36 as it can be seen signal ba(3) requires an extra clock cycle to fill the buffer because during the first cycle, stage 2 did not compute a proper output (undefined state) shown in Figure 36 as a dotted line in between the one and zero levels. In this case (5 + log2 2) cycles are required because the word size is five and the size of the network is given by N=2. Testing the modules by observing the waveforms can be useful when dealing with a small number of active processes. But when the number of processes is large, testing becomes impossible by mere inspection of waveforms. Test bench circuits are used in such cases. The idea of testing with a circuit resembles the idea of Built-In-Self-Testing used in hardware. A top-level entity for a design under test (DUT) can be used as a component in a test bench circuit with another entity (TG) whose purpose is to generate test values. The values on signals can be traced using a simulation monitor or checked directly by the test generator. Unlike in the BIST approach no compaction of data is needed in here as output is checked as soon as the TG receives it by comparing expected values to the actual output. Depending on this output assertion flags are raised (boolean variables) "found error". No external connections from the test bench are needed, hence it has no ports as shown in Figure 37. Such entity is defined in VHDL as follows entity testjbench is end test_bench . Though this might at first seem to be a pointless entity definition as no generic constants nor ports are defined, it allows to test an architecture by simply providing stimulus and then checking the results and comparing to expected values to determine if an error exist. 6. 79 Simulation J A A DUT TG 1 ••- Z Y B ^ f_ B z * TEST_BENCH Figure 37. Test bench circuit used for module-testing. Due to the serial nature of the aft module, the TG when checking output from the DUT —in a test bench circuit as the one in Figure 37— has to remember the previous test pattern generated as the current output depends also on the previous pattern given to the DUT. The third row and fourth column entry in the table below should be a 1 if the pattern in the second row did not occurred in a 'borrow' ( 0 — 1 = 1). /a2 /w /b 0 0 0 1 1 1 0 1 0 1 0 0 Figure 38. Some patterns generated by the TG when the DUT is an aft module. Similar number of patterns can be generated and fed to the aft module being tested. The patterns are stored in a constant array and after checking the results a report can be issued in an assert statement inside the process in charge of the testing. A typical structure of the 6. 80 Simulation VHDL code used in these test benches is outlined in Figure 39. test : process begin for i in test_patterns range loop apply stimulus wait for outputs to settle if outputs different from expected value then assert false report 'expected value different' found_error = true; end if assert not found_error report 'There were ERRORS in the test severity note; assert found_error report 'Test completed with no errors' severity note; end process; Figure 39. Outline of the steps needed in the description of a test bench circuit. The test of the OFT module was performed in a similar fashion. Since the computations in this module are also serial, the steps involved in the description of the proper test bench circuit are almost identical. Appendix B contains the details of the VHDL code. 7 . Conclusion 81 Chapter 7 Conclusion 7.1 Summary This work described a novel network for the computation of the Arithmetic Fourier Transform as well as new bounds on the area and time for this network. Formal proofs of the bounds on the computational and broadcasting area required for the computation of this transform were found. Also a proof of an upper-bound on the area of the butterfly network that computes the Bruns averages was presented in this work. Finally, an implementation in VHDL—at the behavioral level— of this AFT network was presented to demonstrate the feasibility (in terms of easiness of modelling) of the extended architecture proposed in this work. In Chapter four the novel network called OFT or Orthogonal Flat Tree network was developed. This network computes the Fourier coefficients. It was also proved that the area of this network is 0(N2) which makes it optimal in terms of the AT2 trade-offs. In Chapter five the area-time complexity aspects of the AFT were presented. Bounds on the area and time of the two designs considered were proved as well as a formal proof on the upper-bound of the OFT network. The OFT upper bound was described using non-trivial number theoretic analysis based on the theory of propinquity of divisors [11]. Moreover, other bounds in Chapter five were derived using analysis of series. Chapter six presented a VHDL simulation of the main stages involved in the computation of the AFT. The butterfly network, which computes the Bruns averages, and the OFT network were simulated. The behavioral level of these two stages was described and also the structure of some essential components were shown. For nearly all network-sizes, half of the B(k) trees require only two leaves and a large number of other B(k) trees require a small number of leaves. In fact for N=16, 32, ... , 512 7 . Conclusion 82 more than three fourths of the B(k) trees require four or fewer leaves as shown in Figure 40. The layout of the OFT network is such that the first N/logN computational trees are implemented as binary trees and the remainder trees are laid out as linear trees. This is an essential step to achieve the optimal area of the network, but when most computational trees require few number of leaves this distinction becomes less relevant as most trees could be implemented as binary or flat trees with little change in the total area of the network. However, as the size of the network increases more trees require a larger number of leaves (see Figure 40) so implementation of an AFT-network with an OFT-network is optimal, i.e., the area of this network is of order 0(N2) and also AT2=0(N2log2N). 7 . Conclusion 83 250 N=512 200 CD 2 150 m /N= 254 / E 100 3 / A^=128 \ / / 50 6 4 3 f^^"1\l— \\\// \\v/ L^ ^ 4 6 8 Number of Bruns averages 10 12 Figure 40. Distribution of the broadcasting trees for relatively small sizes (N). 7.2 Further Work We have seen in detail that the Arithmetic Fourier Transform is a good alternative for Fourier Analysis due to its computational complexity, accuracy and speed. An interesting application of the AFT is found in image processing, especially in image data compression. Image data compression is concerned with minimizing the number of bits required to represent an image, applications of data compression are primarily in transmission and storage of information. Image data compression methods fall into two common categories. 84 7. Conclusion In the first category, called predictive coding, are methods that exploit redundancy in the data. In the second category, called transform coding or block quantization, compression is achieved by transforming the given image into another array such that a large fraction of its total energy is packed in relatively few transform coefficients, which are quantized independently. Transform coding achieves relatively larger compression than predictive methods. Any distortion due to quantization and channel errors gets distributed, during inverse transformation, over the entire image. Visually, this is less objectionable than predictive coding errors, which appear locally at the source. The optimum transform coder is defined as the one that minimizes the mean square distortion of the reproduced data for a given number of total bits. This turns out to be the Karhunen-Loeve transform (KLT), but not being a fast transform in general, the KLT can be replaced by an orthogonal transform like the Discrete Cosine Transform (DCT). The DCT performance is superior to the other fast transforms and is almost indistinguishable from the KLT for highly correlated data (p>0.5, it gives the best compression ratio for a given amount of mean-square error). Orthogonal Transforms are known to be less sensitive to noise introduced in the channel than other methods for data compression. These two properties, its excellent energy compaction and its virtual noise-free computation, make the DCT attractive for image and speech coding. However, the DCT requires massive and complicated data manipulations more like the DFT does. There exists some fast algorithms [20] that can be improved by implementing some of the stages using numerous techniques such as distributed arithmetic [6]. Nonetheless, the similarities with other orthogonal transforms have lead to algorithms based on other trigonometric transforms such as the DFT (Discrete Fourier Transform) and DHT (Discrete Hartley Transform). Several architectures have been proposed for the computation of the DFT and the DHT that can be modified to compute the DCT [25]. Most of these architectures are based in a systolic array that realize the complex multiplications required by the DFT. Therefore, the complexity of the DCT is only resolved 7. Conclusion 85 by using more computational power and not by using a more efficient algorithm. An application of the AFT can be found here in the computation of the DCT. Among the advantages of using the AFT over the DFT the reduced computational complexity, gain in accuracy and comparable speed are some of the more important ones. The DCT is an orthogonal transform which is real and orthogonal ( C=C* and C — ! =C T ) and it is defined as follows : N-l k X(k) = a(k) 2, x(n)cos ^ ( 2 n + l) , 0 < k < N-l 71=0 a(0) = y/T/N ; a(k) = ^2/N ,1 < k < N - 1 . Also, the discrete cosine transform is a fast transform. The cosine transform of a vector of N elements can be calculated in O(NlogN) operations via N—point FFT [39]. To show this property a new sequence x(n) is defined by reordering the even and odd elements of x (n) as follows 0 < ~ r j ~ <l | ^ _ l I2(n) = x(2n) 2J ' 1 x(N - n - 1 = x(2n + 1)) Now, the summation term is separated into an even and an odd sequence to obtain : (N/2-1) X{k) = a(k)} Y^ 71 = 0 a 7r(4n + l)k x{2n)cos 2~/V (N/2-1) + J2 ^(2n + l))cos 71 = 0 Tr(4n + 3)k 2N (tf/2-1) (N/2-1) 7r(4« + l)k ir(4n + 3)k + y . %(N — n — l)cos ( ^ ) i z_> x(n)cos 2~/V 2~/V 71=0 >. Changing the index of summation in the second term to n'=N-n-\ and combining terms, the 7 . Conclusion 86 transform is written as : N-l X(k) — a(k) 2^ x(n)cos 7r(4n - l ) k n=0 - irk 2N N-l • 2irk = Re n=0 = Re a(k)^ 2 k !!j 2 DFT{x(n)} N ] , where W ^ 2 = e - irk ^ k/2 r = Re a(k)W^AFT{x(n)} N_ which proves the previously stated result [15]. Therefore, if we calculate the JV-point inverse FFT of the sequence a(k)X(k)exp(jirk/2N) we can also obtain the inverse DCT in 0(Nlog N) operations. This application of the AFT to image data compression is interesting as the excellent energy compaction properties of the DCT [15] are exploited by computing a N-point AFT in O(NlogN) operations, with reduced computational complexity and gain in accuracy. -1 . Conclusion 87 Bibliography [1] H.M. Alnuweiri. Optimal VLSI Networks for Multi-dimensional Transforms. IEEE. Trans, on Parallel and Distributed Systems, Vol. 5. No.7 July, 1994. [2] Frank M. Ardito. VLSI Building Block Arithmetic Units for FFT Processing. IEEE NAECON, 1981. [3] Peter J. Ashenden. The VHDL Cookbook. Dept. Computer Science. University of Adelaide. South Australia., 1990. [4] G. Bilardi and M. Sarrafzadeh. Optimal VLSI Circuits for Discrete Fourier Transform in VLSI. Advances in Computing Research, JAI Press, 1987. [5] G. Boudreaux-Bartels and T.W. Parks. Discrete Fourier Transform using summation by parts. In Discrete Fourier Transform using summation by parts, April 1987. [6] Burleson. Efficient Computation in VLSI with Distributed Arithmetic. PhD thesis, University of Colorado, 1989. [7] P. Denyer and D. Renshaw. VLSI Signal Processing-A Bit serial Approach. AddisonWesley, 1985. [8] Dirichlet. Uber die bestimmung mittler Werthe in der Zahlen theorie. Ges. Werke, Bd. II. Arkiv for matematik, astronomi och fysik., 1907. [9] S. L. Garverick. A single wafer 16 point 16 MHz FFT processor. IEEE Custom Integrated Circuits Conference, May 1983. [10]R.L. Graham, D.E. Knuth, and O. Patashnik. Concrete Mathematics. Addison Wesley, 1989. [11JR.R. Hall. The divisors of integers I. Acta Arith. 26, 41-6, 191 A. [12JR.R. Hall. The divisor density of integers sequences. London Math.Soc. (2) 24 41-53, 1981. -1 . Conclusion 88 [13]R.R. Hall and G. Tenenbaum. Divisors. Cambridge University Press, Cambridge, 1988. [14]G.H. Hardy and E.M. Wright. An Introduction to the Theory of Numbers. Oxford University Press (fifth edition), 1979. [15]A.K. Jain. Fundamentals of Digital Image Processing. Prentice-Hall International Editions, 1989. [16]Brian T. Kelley and Vijay K. Madisetti. Efficient VLSI Architectures for the Arithmetic Fourier Transform(AFT). IEEE Transactions on Signal Processing, Vol. 41, No. 1, pp. 365-384, January 1993. [17]J. Kubilius. Probabilistic methods in the theory of numbers. Translations of Math. Monographs, vol. 11, Amer. Math.Soc, Providence, Rhode Island (thirdprinting), 1978. [18]S. Y. Kung. VLSI Array Processors. Prentice-Hall, 1988. [19]F. T. Leighton. New Lower Bound Techniques for VLSI. Mathematical Systems Theory, #17, 1984. [20]Weipi Li. A new algorithm to compute the DCT and its inverse. IEEE Transactions on Signal Processing, vol.39, No.6, pp. 1305-1313, June 1991. [21]R. Lipsett, Carl Schaefer, and Cary Ussery. VHDL : Hardware Descrption and Design. Kluwer Academic Publishers, 1991. [22]Ross Mactaggart. A Single Chip Radix 2 FFT Butterfly Architecture Using Parallel Data Distributed Arithmetic. IEEE ISSC, June 1984. [23]H. Maier and G. Tenenbaum. On the set of divisors of an integers. Invent. Math. 76,1218, 1984. [24JVSM ModelTech. V-System/Workstation User's Manual VHDL Simulation for Workstations. -/ . Conclusion 89 [25]N. R. Murthy and M.N.S. Swamy. On the real-time computation of DFT and DCT through systolic architectures. Technical report, 1992. [26]H. Nagpal, G. Jullien, and W. Miller. Processor Architectures for Two-Dimensional Convolvers Using a Single Multiplexed Computational Element with Finite Field Arithmetic. IEEE Transactions on Computers, Vol. C-32, No. 11, pp. 989-1001, November 1983. [27] K. K. Norton. On the number of restricted prime factors of an integer II. Acta Math. 143, 9-38, 1979. [28]H. Park and V. K. Prasanna. Modular VLSI Architectures for Computing Arithmetic Fourier Transform. Technical report, Institute for Robotics and Intelligent Systems and Department of Electrical Engineering-Systems.University of Southern California. Los Angeles, California, 1991. [29] J.G. Proakis and Dimitris G. Mamolakis. Digital Signal Processing Principles, Algorithms and its applications. Macmillan Publishing Company, 1992. [30]L.R. RabinerandB. Gold. Theory and Applications of Digital Signal Processing. PrenticeHall, 1975. [31]I.S. Reed, Ming-Tang Shih, T.K. Truong, E. Hendon, and Donald W. Tufts. A VLSI Architecture for Simplified Arithmetic Fourier Transform Algorithm. IEEE Transactions on Signal Processing, Vol. 40, No. 5, May 1992. [32]I.S. Reed, Donald W. Tufts, Xiaoli Yu, T.K. Truong, Ming-Tang Shih, and Xiaowei Yin. Fourier Analysis and Signal Processing by Use of the Mobius Inversion Fornula. IEEE Transactions on Acoustics, Speech, and Signal Processing. Vol. 38. No.3, pp. 458-470, March 1990. [33]Francis Scheid. Numerical Analysis. McGrawHill, 1972. -/ . Conclusion 90 [34]C. D. Thompson. Fourier Transforms in VLSI. IEEE Trans. Computer vol.C-32 no. 11, pp. 1047-1057, 1983. [35]C. D. Thompson. Complexity theory for VLSI. PhD thesis, Carnegie-Mellon University CMU-CS-80-140, Aug. 1980. [36]D. W. Tufts and H. Chen. Iterative Realization of the Arithmetic Fourier Transform. IEEE Transactions On Signal Processing , Vol. 41, No. 1, pp. 1202-11, January 1993. [37]D.W. Tufts and G. Sadasiv. The arithmetic Fourier Transform. IEEE Acoust. Speech, Signal Processing Mag., vol. 5, pp. 13-17, January 1988. [38] J. D. Ullman. Computational Aspects of VLSI. Computer Science Press, 1984. [39]M. Vetterli and H.J. Nussbaumer. Simple FFT and DCT algorithms with reduced number of operations. Signal Processing, Vol.6 no.4, Aug. 1984. [40]Vuillemin. A combinatorial limit to the computing power of VLSI circuits. 21st Symp. Foundations of Computing Sc. IEEE, pp. 294-300, Oct. 80. 91 Appendix A. Bk distribution Appendix A: Bk distribution —N_ logN No. of leaves No. of B k 1 2 3 4 5 6 7 8 4 N=16 5 9 1 1 0 0 0 0 0 0 0 0 6.4 N=32 6 19 3 4 0 0 0 0 0 0 0 0 10.7 N=64 7 36 6 13 0 2 0 0 0 0 0 0 18.3 N=128 8 66 10 35 1 7 0 1 0 0 0 0 32 N=256 9 119 15 83 2 19 0 8 1 0 0 0 56.9 N=512 10 215 22 182 3 44 3 1 0 2 9 0 30 10 11 12 Figure 41. Data for the distribution of the broadcasting trees for small N's N 16 32 64 128 256 512 1024 Number of linear trees 1 4 11 24 53 114 239 Maximum number of nodes 1 1 2 2 3 3 4 Figure 42. Maximum number of nodes for linear trees for a given N. 92 Appendix B. VHDL simulation Appendix B: VHDL simulation -The Orthogonal Flat Tree takes as inputs the Bruns coefficients computed -by the AFTN network and also the Mobius numbers that are stored in memory -These numbers are either -1,0,+1. -The Bruns coefficients are integers computed with zero order interpolation library IEEE; use IEEE.std_logic_1164.all; entity tree is generic(word_size : integer port (clk2 in b i t ; Bl in s t d _ l o g i c ; B2 in s t d _ l o g i c ; sum out s t d _ l o g i c ) end tree; architecture only of tree is constant tpd_clk_to__add : time func t i on f i nd_sum(va1 bit; bruns1 std_logic; bruns2 3td_logic ) return std__logic alias inputl : std_logic is brunsl; alias input2 : std_logic is bruns2; begin --if (mul='01' and mu2='01') then return ( inputl xor input2 xor To_StdULogic(val) ) --else if (mul='01' and mu2='ll') then --return ( inputl xor not(input2) xor To_StdULogic(not(val)) ) ; --else if (mul='ll' and mu2= / 01') then --return ( not(inputl) xor input2 xor To_StdULogic(not(val)) ) ; --else if (mul='ll' and mu2='ll') then --return { not(inputl) xor not{input2) xor To_StdULogic (val) ) ; --end if; end find_sum; function find_cout{val x y bit; std_logic; std_logic return std__logic alias inl : std_logic is x; alias in2 : std_logic is y; variable in3 : std^logic := To_StdULogic(val); begin return end find_cout; ( (inl and in2) or (in3 and inl) or (in3 and in2) begin binary_tree: process(clk2,Bl, B2) variable xmap variable carry bit; bit__vector (2 down to 1) {'!• begin if clk2'event and (clk2 = '1') then sum <= find_sum(carry (1) ,B1,B2) after tpd_clk_to_add; carry(1) = = To_bit(find_cout(carry(1),Bl,B2),xmap); if (Bl='U' and B2='U') then carry(2) := c a r r y ( 2 ) ; carry(2) := T o _ b i t ( f i n d _ c o u t ( c a r r y ( 2 ) , B l , n o t ( B 2 ) - - No change else end end i f; if; end process binary_tree; end only; s i g n a l SB : s t d J . o g i c _ y e c t o r ( N * s i z e downto 1 ) ; --This module generates and computes all the binary trees needed in the --Orthogonal Flat Tree. begin For_all_bin_trees: f o r tree_num i n 1 to s i z e g e n e r a t e library IEEE; use IEEE.std_logic_1164.all; entity all_binary is generic(N : integer :=32; size : integer :=32/5); in b i t ; in std_logic_vector(N*size downto 1); out std_logic_vector(size downto 1) --size=N/log2(N)) All_levels: for level i n 1 to log2((N/tree_num)/2) ); generate A_subtree: f o r I i n 1 t o { {N/tree_num)/(2**(level-+l) ) ) g e n e r a t e and all_binary; architecture only of all_binary is Base: i f { l e v e l = l ) g e n e r a t e l e v e l _ l : t r e e p o r t map(clk2, B(2*tree_num*(4*1-3)), B(2*tree_num*(4*1-1)), SB(I)); — i f ( l e v e l = ( l o g 2 ( ( N / t r e e _ n u m ) / 2 ) ) ) then T(tree_num) <= S B ( I ) ; — end if; end g e n e r a t e Base; -This function log2 simulates the binary logarithm -The input determines the size of the AFT network -The maximum size of the network is N=1024 function log2(size : in integer) return integer is begin case size is when 1 when 2 | 3 when 4 to when 7 to when 12 t o when 24 t o when 48 t o when 96 t o when 192 to when 384 t o when 769 t o when o t h e r s end c a s e ; end l o g 2 ; => r e t u r n => r e t u r n 6 => r e t u r n 11 => r e t u r n 23 => r e t u r n 47 -> r e t u r n 95 => r e t u r n 191 => r e t u r n 383 => r e t u r n 768 => r e t u r n 1024 => r e t u r n => NULL; L e v e l _ 2 : i f { l e v e l = 2 and l e v e l / = ( l o g 2 ( ( N / t r e e _ n u m ) / 2 ) ) ) g e n e r a t e second: t r e e p o r t m a p ( c l k 2 , 36(2*1-1), SB(2*I), SB(I+(N/tree_num)/(4*(level-l)) ) 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; end generate Level_2; ); Other: if ( level>2 and level/=(log2 ( (N/tree_num)/2)) ) generate other_level: tree port map{cl)c2, SB{2*I-l + (N/(4*treejmm})*(2-l/2** (level-3))), SB(2*I +(N/(4*treejmm) )* (2-1/2** (level-3))) , SB (2*1-1+(N/(4*tree.num)}*(2-1/2**(level-2))) end generate Other; f u n c t i o n Two5_complement( s : s t d _ l o g i c _ y e c t o r ) r e t u r n s t d _ l o g i c _ v e c t o r v a r i a b l e number v a r i a b l e one variable carry s t d _ l o g i c _ v e c t o r (1 to s'LENGTH} := s t d _ l o g i c ;= ' 1 ' ; s t d _ l o g i c _ v e c t o r { l TO s'LENGTH) ; begin for i i n s'RANGE loop number(i) ;= n o t ( s ( i ) ) ; - - o n e ' s complement end loop; c a r r y ( l ) := number(l) and one; number{l) : = number(l) x o r one; f o r j i n 2 t o s'LENGTH loop c a r r y ( j ) :=number(j) and c a r r y ( j - 1 ) ; numberfj) := number(j) xor c a r r y ( j - l ) ; end loop; r e t u r n number; end Twos_complement; : : : : in in in out bit; std_logic; std__logic; std_logic); ); Root: if ( level=(log2((N/tree_num)/2)) and level/=l ) generate last_level: tree port map(clk2, SB(2*I-l+(N/(4*treejram))*(2-1/2**(level-3))), SB(2*1 +(N/(4*tree_num))*(2-l/2**(level-3))), SB(2*I-l+{N/{4*tree_num))*{2-l/2**(level-2))) - T h i s f u n c t i o n r e t u r n s the 2 ' s complement of a b i t v e c t o r . component t r e e port{clX2 Bl B2 sum end component; - - size={N/log2 (N)) Bl:block begin ); is T(tree_num) <= SB(2*I-1+(N/(4*tree_num)}*(2-l/2**(level-2))); end generate Root; end generate A_subtree; end generate All_levels; end block; end generate For_all_bin_trees; end only; Psum(num_node-l), Psum(num_node) - - T h i s module g e n e r a t e s and computes a l l t h e l i n e a r t r e e s n e e d e d i n t h e --Orthogonal F l a t Tree. l i b r a r y IEEE; use IEEE.std_logic_1164.all; entity all_linear is g e n e r i c f N : i n t e g e r := 32; s i z e : i n t e g e r := 3 2 / 3 - 3 2 / 5 ) ; The number of l i n e a r t r e e s n e e d e d i s g i v e n b y s i z e = ( N / 3 ) - N / l o g 2 ( N ) in b i t ; i n s t d _ l o g i c _ v e c t o r ( 2 * N downto 1 ) ; o u t s t d _ l o g i c _ v e c t o r ( N / 3 downto l+N/5 ) --N/log2(N) ); ); end g e n e r a t e I n n e r ; L a s t : i f ( num_node/=l and num_node=(l+(N/tree_num))/2-1 ) g e n e r a t e last_node: t r e e p o r t map(clk2, B(2*tree_num*{2*num_node+l)), Psum(num_node-l) , Psum(num_node) ); T(tree_num) <= Psum(numjiode); end g e n e r a t e L a s t ; end g e n e r a t e A _ l i n _ t r e e ; and a l l _ l i n e a r ; a r c h i t e c t u r e o n l y of a l l _ l i n a a r end b l o c k ; end g e n e r a t e F o r _ a l l _ l i n _ t r e e s ; end o n l y ; is - - T h i s f u n c t i o n log2 s i m u l a t e s the b i n a r y l o g a r i t h m --The i n p u t d e t e r m i n e s the s i z e of t h e AFT network --The maximum s i z e of t h e network i s N=1024 f u n c t i o n l o g 2 { s i z e : in i n t e g e r ) r e t u r n i n t e g e r is begin case size is when 1 when 2 | 3 when 4 to when 7 to when 12 t o when 24 t o when 48 t o when 96 t o when 192 t o when 384 t o when 769 t o when o t h e r s end c a s e ; end l o g 2 ; component t r e e port(clk2 sum end component; => => 6 => 11 => 23 => 47 => 95 => 191 => 383 => 768 => 1024 => => NULL; in in in out return return return return return return return return return return return 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; bit; std_logic; std__logic; std_logic); s i g n a l Psum : s t d _ l o g i c _ v e c t o r ( 4 downto 1 ) ; begin For_al1_1in_trees: f o r tree_num i n (l+N/log2(N)) t o N/3 g e n e r a t e Bl:block begin —compute t h e max number of Bruns a v e r a g e s r e q u i r e d . —num_aver:=(l+(N/tree_num))/2; A_lin_tree: for num^node i n 1 t o ( ( l + N / t r e e _ n u m ) / 2 - 1 ) generate Node: i i (num_node=1) g e n e r a t e first_node: t r e e p o r t map(clk2, B(2*tree_num*(4*num_node-3)), B(2*tree_num*(4*num_node-l)), Psum(num_node) ); T(tree_num) <= Psum{num_node); end g e n e r a t e Node; I n n e r : if ( num_node/ = l and num__node/= ( l + ( N / t r e e _ n u m ) ) / 2 - 1 ) g e n e r a t e o t h e r j i o d e : t r e e p o r t map(clk2, B(2*tree_num*(2*num__node+l)), begin for I in 1 to (size-1) loop a v e r a g e d ) :=bruns(I+l) ; end loop ; average(size):=new_val; r e t u r n average; end Update_queue; --Serial Arithmetic Fourier Transform (AFT) butterfly with averaging --included. This cell represents the basic unit of the AFTN network. --It computes and stores all the Bruns coefficients needed, for the --later calculation of the Fourier coefficients. --The cell contains a register where the Bruns coefficient is stored --and scaled. - - A l l the c o m p u t a t i o n s a r e s e r i a l so the a d d i t i o n and s u b s t r a c t i o n - - a r e s e r i a l and u s e CSA t e c h n i q u e t o p r o p a g a t e t h e s u m s / s u b s t r a c t i o n s . - T h i s f u n c t i o n c o n v e r t s t h e b i t r e p r e s e n t a t i o n ( O ' s and l ' s ) of a number i n t o - i t s i n t e g e r r e p r e s e n t a t i o n (decimal r e p r e s e n t a t i o n ) l i b r a r y IEEE; use IEEE.std_logic_1164.all; entity aft is g e n e r i c ( w o r d _ s i z e : i n t e g e r := 5 ) ; p o r t (elk in b i t ; al in s t d _ l o g i c ; in std_logic; i n b i t _ v e c t o r ( 1 0 downto 1 ) ; out s t d _ l o g i c ; out s t d _ l o g i c ) ; f u n c t i o n b i t s _ t o _ i n t ( b i t s : in b i t _ v e c t o r ) r e t u r n i n t e g e r v a r i a b l e temp : b i t _ v e c t o r ( b i t s ' r a n g e ) ; v a r i a b l e r e s u l t : i n t e g e r := 0; begin i f b i t s ( b i t s ' l e f t ) = ' 1 ' t h e n temp ;= n o t b i t s ; e l s e temp := b i t s ; end i f ; f o r i n d e x i n b i t s ' r a n g e loop result := result*2+bit'pos(temp(index)); end loop; i f b i t s ( b i t s ' l e f t ) = ' 1 ' t h e n r e s u l t := ( - r e s u l t ) - l ; end if; return result; end b i t s _ t o _ i n t ; a r c h i t e c t u r e o n l y of a f t i s c o n s t a n t t p d _ c l k _ t o _ a d d : t i m e := 5 n s ; - T h i s f u n c t i o n f i n d s t h e sum of two b i t numbers and a t h i r d b i t t h a t i s e i t h e r -0 for a d d i t i o n (W w e i g h t s ) o r 1 f o r s u b s t r a c t i o n [B a v e r a g e s ) -It realizes : XI XOR X2 XOR X3. function find_sum(val bit; std_logic; std_logic) return std_logic a l i a s i n p u t l : s t d ^ l o g i c i s x,a l i a s i n p u t 2 : s t d _ l o g i c i s y; butterfly: process(elk,al,a2,lev) v a r i a b l e xmap : b i t ; v a r i a b l e qq : b i t _ v e c t o r ( 2 downto 1) := { ' 1 ' , ' 0 ' ) ; v a r i a b l e b r u n s : s t d _ l o g i c _ v e c t o r (word_size downto 1 ) ; v a r i a b l e rs_count : integer:= b i t s _ t o _ i n t ( l e v ) ; v a r i a b l e numb_of_RS : i n t e g e r — w o r d _ s i z e + b i t s _ t o _ i n t ( l e v ) ; begin begin r e t u r n ( ( i n p u t l x o r input2) x o r To_StdULogic(val) end find_sum; if elk'event and (elk = '1') then W <= find_sum(qq(l),al,a2) after tpd_clk_to_add; qq(l) := To_bit(find_cout(qq(l),al,a2},xmap); ); Find Bruns average —The number of RS (right shifts ) required includes the —averaging or division by 2 - T h i s f u n c t i o n f i n d s the c a r r y o u t of two b i t numbers and i t s t o r e s t h e -0 f o r a d d i t i o n (W w e i g h t s ) o r 1 f o r s u b s t r a c t i o n (B a v e r a g e s ) - I t r e a l i z e s t h e C a r r y Out : (XI AND X2) OR (X3 AND XI) OR (X3 AND X2). B <= find_sum(qq(2) ,al,not(a2)) after tpd_clk_to_add; numb_of_RS:=word_si ze+bi t s _ t o _ i n t ( 1 e v ) ; if function find_cout(val bit; std_logic; std_logic) return ( ( r s _ c o u n t + l ) <= numb_of_RS ) t h e n bruns:=Update_queue(word_size, find_sum(qq(2),al,not(a2)), bruns); rs_count:=rs_count +1; end i f ; std_logic a l i a s inl : std_logic i s x; a l i a s in2 : s t d _ l o g i c i s y ; v a r i a b l e in3 ; s t d _ l o g i c := T o _ S t d U L o g i c ( v a l ) ; i f ( a l = ' U ' a n d a2='U'} t h e n qq(2) := q q ( 2 ) ; — No change else qq(2) := To_bit(find_cout(qq;{2) , a l , n o t ( a 2 ) ) ,xmap) ; end i f ; begin r e t u r n ( ( i n l and in2) o r (in3 and i n l ) o r (in3 and in2) end f i n d _ c o u t ; ); -Update_queue t a k e s the v a l u e of t h e b r u n s a v e r a g e , the new_value and p e r f o r m s i a v e r a g i n g . F i n a l l y the new b r u n s v a l u e i s r e t u r n e d . end i f ; end p r o c e s s b u t t e r f l y ; end only; function Updatejqueue(word_size val bruns is integer; std_logic; std_logic_vector) return std_logic_vector a l i a s size : i n t e g e r is word_size; a l i a s new^val : s t d _ l o g i c i s v a l ; v a r i a b l e a v e r a g e : s t d _ l o g i c _ v e c t o r ( s i z e downto l ) : = b r u n s ; --log(N)+l stages of aft butterflies : - - I t accepts serially N numbers of width equal to the word_sizs defined --in the AFT module. library IEEE; use IEEE.std_logic_1164.all; entity aftN is generic(N : integer := 2); port (elk in b i t ; aal in std_logic_vector(N downto aa2 in std_logic_vector(N downto Wa : out std_logic_vector((2*N-1) Ba : out std logic_vector((2*N-1) end aftN; ); 1} ,1); downto 1); downto 1) architecture only of aftN is result(index) := bit'vaKtemp rem 2) ; temp := temp/2; end loop; if int < 0 then result := not result; r e s u l t ( b i t s ' l e f t ) := ' 1 ' ; end if; return result; end int_to_bits; component aft port! elk : al : a2 : lev : W : B : end component; in b i t ; in std_logic; in std_logic; in bit_vector; out std_logic; out Std_logic) ; constant tpd_clX_to_add : time : = 5 ns; signal signal -This function computes the odd indices : -- odd(l)=l, odd{2)=3, odd(3)=5 odd(n)=(2n-l) function oddfnum : in integer) return integer is begin if (num - 1) then return num; else return (2*num-l); end if; end odd; SW : std_logic_vector((2*N-1) downto 1); SB : std_logic_vector((2*N-1) downto 1); subtype bit„10 i s bit_vector(10 downto 1); type bit_10_array i s array{integer range <>} of bit_10; signal dummy : bit_10; --bit_vector(10 downto 1); signal cycles : bitj.0_array(l to (l+log2(N))) ; begin For_all_stages: for stage_num in 1 to (l+log2(N)) generate -This function log2 simulates the binary logarithm -The input determines the size of the AFT network -The maximum size of the network is N=1024 function log2(size : in integer) return integer begin case size is when 1 when 2 when 4 when 8 when 16 when 32 when 64 when 128 when 256 when 512 when 1024 when othe rs end case; end log 2; = > return 0 1 =>return = >return 32 = > return 4 =>return return 5 => return 6 = > return 7 =>return 8 =>return 9 =>=> return 10; ); -This function converts an integer number into the b i t representation -in 0's and 1's -The integer number is converted and returned as the value of the -function- signal cycles (stage_rtum} <= int_to_bits(stage_num,dummy) ; Inside_stage: for I in 1 to (H/(2**(stage_num-l))) generate Stage_l: if (stage_num = 1) generate stage: aft port mapfclk, aal(I), aa2(I), cycles(stage_num) , SW{I),SB(I) Wa(I) <= SW(I); Ba(I) <= SB(I); end generate Stage_l; => NULL; function int_to_bits ( Bl:block begin int : in integer; b i t s : in bit_vector) return bit_vector variable temp : integer; variable result : bit_vector(bits'range); begin if int < 0 then temp :=-(int+l); else temp :=int; end if; for index in bits'reverse range loop Other: if (stage_num /= 1) generate other_stage: aft port map(clk, SW(2*N-t-2*I-l-(8*N)/{2**stags_num)) , SW(2*N+2*I-(8*N)/(2**stage_num)), cycles(stage_num), SW(2*H+I-(4*B}/(2**stage_num)), SB(2*N+I-(4*N)/(2**stage_num)) >; Wa(2*N+I-{4*H)/(2**stage_num)) <= SW(2*H+I-(4*N)/(2**stage_num)) ; Ba(2*H+I-(4*N)/(2**stage_num)) -:= SB{2*H+I-(4*N}/(2**stage_num)); end generate Other; end generate Inside„stage; end block; end generate For_all_stages; end only; 97 Appendix B. VHDL simulation ._ --This implements the Orthogonal Flat Tree Network. The OFTN is composed of --binary trees and linear trees which compute the Fourier coefficients a(n) -- & b(n) serially. library IEEE; use IEEE.std_logic_1164.all; entity oftn is generic(N:integer:=32; size:integer:=32/5); port(clk2 : in bit; B : in std_logic_vector(N*size downto 1 ) ; Tc : out std_logic_vector(N/3 downto 1) --size=N/log(N) ) i end oftn; architecture only of oftn is component all_binary port(clk2 : in bit; B : in std_logic_vector(N*size downto 1 ) ; T : out std_logic_vector(size downto 1) \ ) ;. end component ; component all_linear port(clk2 : in bit; B : in std_logic_vector(2*N downto 1 ) ; T : out std_logic_vector(N/3 downto 1+size) \. ~l+N/log(N) ) i end component; begin --Generation of the binary trees using the component "all_binary". --The architecture of this component is defined in all_binary.vhd Bin: if (true) generate binary_gen: all_binary port map(clk2, B(N*size downto 1), Tc(size downto 1) t ; end generate Bin; --Generation of the linear trees using the component "all_linear". --The architecture of this component is defined in all_linear.vhd Lin: if (true) generate lin_gen: all_linear port map(clk2, B(2*N downto 1), Tc(N/3 downto 1+size) \. end generate Lin; end only;
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- On the VLSI complexity and architecture of the Arithmetic...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
On the VLSI complexity and architecture of the Arithmetic Fourier Transform (AFT) Antezana, Oswaldo 1994
pdf
Page Metadata
Item Metadata
Title | On the VLSI complexity and architecture of the Arithmetic Fourier Transform (AFT) |
Creator |
Antezana, Oswaldo |
Date Issued | 1994 |
Description | This thesis presents a novel logarithmic-depth VLSI network for implementing a fundamental computation of the Arithmetic Fourier Transform (AFT). The VLSI optimality —in terms of area-time trade-off— of this Orthogonal Flat Tree (OFT) network is proved. Namely, it is proven that the area of the network is 0(N2) and the corresponding total time delay is 0(log N). In addition, a number of layout rules for the construction of this network is proposed. The topology of this network is fairly simple and is based on an orthogonal interconnection of binary trees and linear arrays, or flat trees, where the nodes of the trees are bit-serial devices (adders and distributed shift registers). This OFT network uses only a small number of bit-serial adders to implement a matrix-vector multiplication. Moreover, two architectures for the computational area are considered and their respective bounds are proved. From this discussion it can be concluded that the interconnection area is dominant as the area for computation is relatively small of the order Nlog²N. In addition, an upper-bound on the area of the butterfly network , proposed in [31] for the first stage of the AFT, is introduced. This network computes the Bruns averages vector which constitutes one of the inputs to the OFT network. These two networks can lead to a practical implementation of the AFT. Finally, the oversampling problem of the AFT algorithm is also studied. When using the butterfly network it is proved that to obtain all the samples required takes an area slightly larger than 0(N2) which can result in a non-optimal AT2 bound. This is due to the large number that need to be averaged, up to 2N numbers, for a single multiple-input AFT butterfly. However, for practical sizes (N=16, 32) of a butterfly network the AT2 bound is very close to the optimal bound. |
Extent | 3416159 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-02-26 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065199 |
URI | http://hdl.handle.net/2429/5177 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1994-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1994-0360.pdf [ 3.26MB ]
- Metadata
- JSON: 831-1.0065199.json
- JSON-LD: 831-1.0065199-ld.json
- RDF/XML (Pretty): 831-1.0065199-rdf.xml
- RDF/JSON: 831-1.0065199-rdf.json
- Turtle: 831-1.0065199-turtle.txt
- N-Triples: 831-1.0065199-rdf-ntriples.txt
- Original Record: 831-1.0065199-source.json
- Full Text
- 831-1.0065199-fulltext.txt
- Citation
- 831-1.0065199.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065199/manifest