SOME S O F T W A R E A N D H A R D W A R E IMPLEMENTATIONS OF T H E FAST H A R T L E Y T R A N S F O R M By FU, Y A N KIT B.Sc.Eng., University of New Brunswick, 1986 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF M A S T E R OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES D E P A R T M E N T OF E L E C T R I C A L ENGINEERING We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F BRITISH C O L U M B I A August 1990 © F U Yan K i t , 1990 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Electrical Engineering The University of British Columbia 2356 Main Mall Vancouver, Canada Date: Abstract The fast Hartley transform (FHT) is a new tool for converting data between time and frequency domains. In this thesis, some speed-optimized software implementations of the radix-2 and split-radix FHT algorithms are presented initially, and then applied to the problems of convolution, computation of power spectra, image degradation, and image restoration. Subsequent work involved the development of a new bit-reversal algorithm. This algorithm is fast and efficient, and can be used to increase the throughput of the FHT. Finally, several hardware implementations are presented for the discrete Hartley transform (DHT) and the FHT with architectures using a single butterfly unit, pipelining and superparallelism. The advantages of each implementation are stressed. The data processing rates of these hardware implementations are analyzed. i Table o f C o n t e n t s Abstract i L i s t of Tables v L i s t of F i g u r e s vii Acknowledgement 1 2 viii Introduction 1 1.1 Digital signal processing 1 1.2 Overview of the following chapters 2 T h e fast H a r t l e y t r a n s f o r m ( F H T ) software i m p l e m e n t a t i o n 4 2.1 The Hartley transform 4 2.2 The discrete Hartley transform (DHT) 4 2.3 The F H T 5 2.3.1 The radix-2 fast algorithm 5 2.3.2 The split-radix fast algorithm 10 2.4 Comparison with the complex F F T 14 2.5 Comparison with the real F F T 18 2.6 To get the F F T from the F H T 19 2.7 F H T convolution and power spectrum 21 2.8 A n image processing example of the application of the F H T 22 2.9 Summary 26 n 3 4 5 The bit-reversal algorithms 27 3.1 Some existing algorithms 27 3.2 A new bit-reversal algorithm 31 3.3 Comparison of the bit-reversal algorithms 34 3.4 Summary 35 The F H T hardware implementation 37 4.1 Introduction 37 4.2 Hardware implementation of the D H T 37 4.3 Hardware implementation of the F H T 40 4.3.1 Single butterfly unit F H T architecture 40 4.3.2 Pipeline F H T architecture 45 4.3.3 Superparallel F H T architecture 49 4.4 Data unscrambling in hardware implementation 51 4.5 Comparison of the hardware implementations 52 4.6 Summary 53 Conclusion 54 5.1 The work done in this thesis 54 5.2 Future work 56 References 57 Appendices 61 A The radix-2 decimation-in-frequency F H T programs listing 61 A.l Radix-2 F H T implemented with the "three-loop method" A.2 Radix-2 F H T implemented with the "table-lookup method" iii 62 64 B The split-radix decimation-in-frequency F H T programs listing B.l 66 Split-radix F H T implemented with the "three-loop method" 67 B . 2 Split-radix F H T implemented with the "table-lookup method" 70 C The radix-2 decimation-in-frequency F F T programs listing 73 C. l Radix-2 forward F F T implemented with the "three-loop method" . . . . 74 C. 2 Radix-2 forward F F T implemented with the "table-lookup method" . . . 76 D The split-radix decimation-in-frequency F F T programs listing E D. l Split-radix forward F F T implemented with the "three-loop method" . . . 79 D. 2 Split-radix forward F F T implemented with the "table-lookup method" 82 . Bit-reversal programs listing 85 E. l 86 The implementation of Evans' bit-reversal algorithm E.2 The implementation of Gold and Rader's bit-reversal algorithm 88 E.3 The implementation of Rodriguez's bit-reversal algorithm 89 E.4 The implementation of Buneman's bit-reversal algorithm 90 E.5 The implementation of the new bit-reversal algorithm 91 E. 6 The implementation of the new algorithm using table-lookup method F 78 . . Image processing example using the F H T F. l Using the F H T in the processing of image degradation and restoration iv 92 93 . 94 L i s t of T a b l e s 2.1 Number of real arithmetic operations for the fast algorithms 2.2 Comparison of the fast algorithms in CPU execution time (in seconds) on 15 a Sun 3/60 workstation, where r-2 stands for the radix-2 algorithm and s-r stands for the split-radix algorithm 15 3.3 Bit-reversing of a 16-point array 28 3.4 Exclusive-oring a bit-reversed index in column three and the related operand in column four produces the next bit-reversed index in the data sequence. 3.5 CPU execution time (in seconds) of the bit-reversal algorithms running on a Sun 3/60 workstation 4.6 32 34 Comparison of the three FHT hardware implementations for N — 1024 and memory access time of 60 ns 53 v List of Figures 2.1 Radix-2 decimation-in-frequency and decimation-in-time butterflies. . . . 2.2 Flow diagram of radix-2 decimation-in-frequency decomposition from a 7 16-point FHT into two 8-point FHT blocks 8 2.3 Flow diagram of an 8-point radix-2 decimation-in-frequency FHT 8 2.4 Decomposition pattern of a 64-point radix-2 FHT 9 2.5 Flow diagram of split-radix decimation-in-frequency decomposition from a 16-point FHT into one 8-point FHT block and two 4-point FHT blocks. 12 2.6 Flow diagram of an 8-point split-radix decimation-in-frequency FHT. 12 2.7 Decomposition pattern of a 64-point split-radix FHT 2.8 FHT requires less CPU time than the complex F F T in the radix-2 soft- . . 13 ware implementation in both the "three-loop (lp)" and "table-lookup (tb)" methods 2.9 16 F H T requires less CPU time than the complex F F T in the split-radix software implementation in both the "three-loop (lp)" and "table-lookup (tb)" methods 16 2.10 Image degradation and restoration processes 23 2.11 Original image 24 2.12 Degraded image 24 2.13 Restored image 25 3.14 Gold-Rader's bit-reversal algorithm (after Rabiner and Gold) 28 3.15 The table-lookup bit-reversal algorithm (after Buneman) 30 vi 3.16 Improved Gold-Rader's algorithm (after Rodriguez) 30 3.17 The new bit-reversal algorithm 33 3.18 Plotting of Table 3.5 35 4.19 Hardware structure of a 4-point DHT 38 4.20 Architecture of a single butterfly unit FHT processor 41 4.21 Radix-2 FHT butterfly unit hardware structure 41 4.22 Single butterfly unit FHT hardware structure 42 4.23 Data sequence entering the radix-2 FHT butterfly unit 43 4.24 Timing diagram for the single butterfly unit processor 44 4.25 Pipeline architecture of a 16-point FHT 46 4.26 Pipeline hardware structure of a 16-point FHT 46 4.27 Hardware structure of the butterfly unit in the combined last two stages. 47 4.28 Data sequence for each butterfly unit 48 4.29 Timing diagram of the pipeline FHT processor 48 4.30 Superparallel hardware structure of a 16-point FHT 50 4.31 Hardware structure of butterfly unit BUO 50 4.32 Pre-permutation 16-point radix-2 FHT data flow diagram 51 4.33 Data sequence entering the pre-permutation algorithm 52 vii Acknowledgement I would like to express my gratitude to my supervisor Dr. Malcome Wvong for his valuable suggestion and encouragement. I would like to thank Mr. Robert Ross for his technical advice in the use of the departmental computer facilities. Thanks also to Dr. M. Robert Ito for pointing out some interesting properties of the fast Fourier transform, that led to an improvement in this thesis. viii Chapter 1 Introduction 1.1 Digital signal processing Along with the progress of technology, the rapid development of the digital computer has made it a very powerful tool for computation and analysis. Digital signal processing technology takes advantage of the tremendous power of the digital computer to provide much faster and more accurate solutions to scientists and engineers than in the past. Much tedious work in the fields of engineering, biomedicine, physics, astronomy, etc. previously based on the analogue signal processing technology, now can be easily handled with digital signal processing techniques. Digital signal processing is usually carried out for a mixture of data in both the time and frequency domains. The time domain allows processing in real-time systems, while the frequency domain facilitates filtering and spectrum analysis. During regular digital signal processing, data are usually transformed between these two domains frequently. In fact, the transformation is very time consuming, the throughput of most processes depends heavily on the efficiency of the way that the transformation is implemented. The fast Fourier transform (FFT) and the fast Hartley transform (FHT) are solutions to this bottleneck. The F F T is a widely used tool in digital signal processing, and provides a fast numerical method to convert data from time domain to frequency domain, and vice versa. The classical Fourier transform is designed for the general case when the data sequence 1 Chapter 1. Introduction 2 is complex, that is, it has real and imaginary parts. For the case when the data sequence is real, its Fourier transform is of complex conjugate nature, this redundancy has been taken advantage of, and several F F T algorithms which make use of this redundancy have been devised [l]-[5], resulting in computer time savings of approximately fifty percent. The FHT is an alternative to the F F T for real-valued sequences, and is well-known for the characteristics that, it has only real arithmetic, and the forms of its forward and inverse transforms are identical. The FHT has a speedup factor of two over the complex F F T [6]-[17], in the case when the complex F F T processes a real sequence as complex by adding an redundant zero-valued imaginary part. The FHT is useful in real data processing in different areas, for example, in one-dimension: spectral analysis, in twodimension: image processing, and also in multi-dimensional digital signal processsing. O v e r v i e w o f t h e following chapters 1.2 This thesis presents some software and hardware implementations of the one-dimensional FHT. Chapter 2 gives an introduction to the FHT. The radix-2 and split-radix FHT algorithms are implemented in software using several approaches. The FHT software implementaions are evaluated and compared with the complex and real FFT's. Application of the FHT is shown through an image processing example, where an image is first degraded and then restored. Some properties of the FHT are discussed. Chapter 3 discusses the scrambled output problem inherent in the fast radix algorithms. A new bit-reversal algorithm is proposed and its efficiency is compared with several other bit-reversal algorithms. Chapter 4 shows some hardware implementations of the DHT and the FHT. Architectures such as single butterfly unit, pipelining and superparallelism are compared. The Chapter 1. Introduction data processing rates are analyzed. Chapter 5 draws conclusions on the implementation results. 3 Chapter 2 The fast Hartley transform ( F H T ) software implementation 2.1 The Hartley transform In 1942 R.V.L. Hartley introduced a pair of real integral transforms [18] as an alternative to the Fourier transform pair. This pair of formulas have strict reciprocal characteristics for converting signals between time and frequency domains, and are known as the Hartley transform and its inverse. If t is a time variable and / is a frequency variable, the Hartley transform [6] for a 1 real function h(t) is CO / h(t) ens (2w ft) dt (2.1) -co and the inverse Hartley transform is CO / H(f)cas(2nft)df (2.2) -co where cas 0 = cos 9 + sin 6. 2.2 The discrete Hartley transform ( D H T ) In 1983 R.N. Bracewell introduced the discrete Hartley transform (DHT) [19]. The DHT uses discrete time and frequency variables instead of continuous variables as in the Hartley transform, and thus allows numerical analysis on digital computers. The original l/y/2n factors of the integrals in Hartley's definition are omitted, when frequency / is used as the variable instead of angular frequency u. 1 4 Chapter 2. The fast Hartley transform (FHT) software implementation 5 The DHT of a real data sequence h(t) of size iV is defined as N-l H{f) = £ t=o h{t) cas (2TTft/N) (2.3) where / = 0,..., N - 1. The inverse DHT is defined as MO 4 ^ W ) c a s W W ( - ) 2 4 where * = 0,... ,iV - 1. 2.3 The F H T The DHT requires N multiplications and N(N — 1) additions for an iV-element data 2 sequence, then the total arithmetic operations equals 2N — N or of order 0(N ). The 2 2 number of arithmetic operations which grows at a rate of N is highly undesirable for the 2 digital computer, and led to the development of the FHT. The FHT is a set of techniques (algorithms) used on digital computers to improve the efficiency of the DHT. In 1984 Bracewell introduced the decimation-in-time radix-2 FHT [7] of order 0(N \og JV), and soon after that, many other FHT's of the same order have been pre2 sented, for example, Cooley-Tukey radix-2(decimation-in-frequency), radix-4, radix-8, split-radix, prime factor, Winograd, Walsh-Hadamard, etc. [9], [10], [20]-[24]. In fact, most of the fast algorithms applied to the discrete Fourier transform (DFT) basically can also be applied to the DHT because of their similarities. 2.3.1 The radix-2 fast algorithm One of the earliest and most popularly used fast algorithm is the Cooley-Tukey radix-2 algorithm. The radix-2 algorithm has a simple flow structure that can be easily implemented in software. Chapter 2. The fast Hartley transform (FHT) software implementation 6 The radix-2 FHT is listed in the following, with the general notations n and k instead of t and / to eliminate the specific reference to time and frequency, since the same FHT algorithm can be applied to both forward and inverse transforms, with only a scaling factor difference between them. The radix-2 decimation-in-time FHT [7] is H(k) = H (k) + H (k) 2n cos 2n+1 where k = 0,..., N — 1, and H 2n (^fc) + H 2n+1 (N - k) sin (-^fc) (2.5) is the DHT of the even-numbered data and H +\ is 2n the DHT of the odd-numbered data. The radix-2 decimation-in-frequency FHT [10] is 2 H(2k) = H(2k + 1) = £ M 'N r [*(") + x ( y + )] n c E {[*(») " ( Y + )] x + n /27T, a s {^ c o s Kf " ) " " ] n x(iV n) sin 2 k n ) ( - ) 2 6 (^ ) n (% )} n cas (ji ) 2kn (2-7) where k = 0,..., -y — 1. The radix-2 decimation-in-time and decimation-in-frequency approaches are distinguished according to the way they process the data sequence, either by doing the multiplication of the trigonometric coefficients firstly or lastly in the butterfly as shown in 2 Figure 2.1. However, these approaches are symmetric inflowstructures and require the same number of arithmetic operations [25] [26], where the number of nontrivial real additions, A, and the number of nontrivial real multiplicatons, M, required for an TV-point radix-2 FHT [10] are: A = M (Z/2)N\og N2 = N\og N 2 -W (3/2)JV + 2 +± (2.8) (2.9) A butterfly is a basic computation unit in the FHT or the FFTflowstructure, such implementation of the fast algorithm can be simplified by using the butterfly repetitively. 2 t h a t the Chapter 2. The fast Hartley transform (FHT) software implementation 7 cos n9 — cos n0— decimation-in-frequency decimation-in-time 9 = 2Jt/N Figure 2.1: Radix-2 decimation-in-frequency and decimation-in-time butterflies. For simplicity, only the decimation-in-frequency approach is discussed in the following. The operation of the radix-2 decimation-in-frequency FHT is first to split the input data into two FHT blocks, where one contains the even-numbered points and another contains the odd-numbered points. Similarly, these two FHT blocks are then split into four FHT blocks, and repeatedly until one-point FHT blocks are obtained. A decimation-in-frequency radix-2 fast Hartley transform has the flow diagram as shown in Figure 2.2 and Figure 2.3. Figure 2.2 shows the first stage of a 16-point FHT decomposed into two 8-point FHT blocks. Figure 2.3 shows the second, third and fourth stages of decomposition in one of the 8-point FHT block. The total number of stages required for the radix-2 FHT equals (log N), such that, a 16-point FHT requires 2 4 stages, a 1024-point FHT requires 10 stages, etc. This decomposition property of the radix-2 FHT is shown clearly in Figure 2.4. Software implementation of the FHT in a general purpose computer has an efficiency closely related to the fast algorithm and implementation technique. The CPU time of a FHT processing is mainly spent on the multiplications and additions of floating-point numbers, and also the computation of the trigonometric functions, the former is chiefly determined by the fast algorithm used, for instant, the split-radix FHT (will be discussed Chapter 2. The fast Hartley transform (FHT) software implementation 8 0 = 2K/ 16 Figure 2.2: Flow diagram of radix-2 decimation-in-frequency decomposition from a 16-point FHT into two 8-point FHT blocks. 6 = 271/8 Figure 2.3: Flow diagram of an 8-point radix-2 decimation-in-frequency FHT. Chapter 2. The fast Hartley transform (FHT) software implementation 0 1 2 3 4 5 2 2 2 2 8 1 6 7, 2 8 ? 2 32 7, 2 7, 2 8 1 6 2 2 2 8 2 ?, 64 2 8 16 8 4 32 2 2 ? 2 7 2 2 2 8 7, 7, 1 6 2 2 8 4 2 2 Figure 2.4: Decomposition pattern of a 64-point radix-2 FHT. in the next subsection) requires lessfloating-pointmultiplications and additions than the radix-2 FHT, the latter, is determined by the implementation technique. Some efficient and interesting methods for implementing the radix-2 decimation-in-frequency FHT are presented in the following. The first method (as shown in Appendix A) uses three loops to implement the recursive property of the radix-2 algorithm, and is thus termed the "three-loop method". The first loop jumps by stage, and it loops through only (log N — 2) stages, because 2 the last two stages are not involved in the multiplication of the trigonometric functions, it is more efficient to treat them separately with special coding. The second loop jumps by data point, and the trigonometric function values are computed in this loop. Since within the same stage, the trigonometric function values required in different blocks are identical, so that, when the third loop jumps by block, the same set of trigonometric function values can be reused in each block, and thus increases 9 Chapter 2. The fast Hartley transform (FHT) software implementation 10 the processing speed. Inside the third loop, a four-point butterfly is used to process four data points at a time. A separate part for sequence with size less than four is added at the beginning of the program to complement the four-point butterfly. The second method (also shown in Appendix A) is an improved version of the first method, and the repetitive computation of the identical trigonometric functions in different stages is avoided. This is done by first creating a trigonometric function table to store all the precalulated sine and cosine function values, so that later on, when the program requires a trigonometric function value, it will look up the table instead of calculating it repeatedly in different stages. This method is termed the "table-lookup method". One drawback of this method is that a memory space of (N/2) words is required for the trigonometric function table. However, for a process that does fixed length FHT many times, the table needs to be created only once, and the overhead to create the table will become negligible. 2.3.2 The split-radix fast algorithm Among the radix algorithms, the split-radix algorithm is the most efficient one. The splitradix algorithm was first designed for the DFT, which combines the advantages of both the radix-2 and radix-4 fast algorithms, and has its even-numbered part performed in radix-2 and odd-numbered part performed in radix-4. The resulting split-radix algorithm has fewer nontrivial real multiplications and additions than the radix-2, radix-4, and even radix-8 algorithms [27] [4]. The split-radix FHT [9] [20] possesses the same advantages of the split-radix FFT. The even-numbered part of the split-radix FHT algorithm is (2-10) Chapter 2. The fast Hartley transform (FHT) software implementation 11 where k — 0,..., y — 1, and the odd-numbered part is H(Ak + 1) = £ {[(*(») - * [(* ( T + » ) ) + (* ( j - ) - (ir )) x + N n )- ( 1 x ~ ) ) l n - (* (T " ) " N +N N ) )] c o s S I N ( I " ) " (TF ) 1 1 cas (2.11) #(4*:+ 3) = S {[( s ( n ) [(* ( f " n ~ ) " x x (Y ( + N x + )) " n ) ) + (T " ( x ( f " n n ) " ( x ) ~ x { x " N ~ n ) n ))] )] c o s ( l 3 n ) (77 )} 3 n s i n •cas where + (2.12) = 0,..., — 1. The split-radix FHT has a total number of nontrivial real additions, A, and a total number of nontrivial real multiplicatons, M , for an iV-point FHT [9], such that: A = (2/3)N\og N-(19/9)N+ 2 M 3-r(-l) * lo = (4/3)A^log ^-(14/9)A^ + 3 + (-l) 2 2N log27V -(1/9) (2.13) -(5/9) (2.14) Figures 2.5 and 2.6 show the flow diagrams of the split-radix FHT. The implementation of the split-radix algorithm is more difficult than the radix-2 algorithm. Since the split-radix algorithm is a combination of the radix-2 and the radix-4 algorithms, in a radix-2 stage, a FHT block is decomposed into two half-sized FHT blocks, but in a radix-4 stage, a FHT block is decomposed into four quarter-sized FHT blocks, that means one radix-4 stage is equivalent to two radix-2 stages. This is the reason why the radix-4 is faster than the radix-2 and why it can only accept data sequence with size equal to multiple of four. Chapter 2. The fast Hartley transform (FHT) software implementation 12 e = 27t/16 Figure 2.5: Flow diagram of split-radix decimation-in-frequency decomposition from a 16-point FHT into one 8-point FHT block and two 4-point FHT blocks. Figure 2.6: Flow diagram of an 8-point split-radix decimation-in-frequency FHT. Chapter 2. The fast Hartley transform (FHT) software implementation 1 2 3 5 4 ! 4 L2_ 8 1 6 2 2 2 4 4 ~ L 32 8 8 64 4 l_2_ 2 2 2 4 2 2 4 2 4 2 2 2 8 1 6 2 4 L2_ 4 8 1 6 La2 4 2 2 4 l_2- Figure 2.7: Decomposition pattern of a 64-point split-radix FHT. In Figure 2.7, we can see that the 64-point FHT has its even part decomposed b y the radix-2 into a 32-point FHT block and its odd part decomposed by the radix-4 into two 16-point FHT blocks. This combination of different radices makes the proceeding of the even and odd part decomposition unsymmetrical. So that the "three-loop method" applied to the radix-2 requires some modification in order to fit into the split-radix. The first method (as shown in Appendix B) uses the similar approach as in the "threeloop method" in Appendix A. A special part is used for the last three stages which are not involved in the multiplication of the trigonomertic function, in order to optimize the processing speed. The first loop jumps by stage for a total (log N — 3) stages. The second loop jumps 2 by point, and the trigonometric function values are calculated in this loop, such that when the third loop jumps by the FHT blocks, each block in the same stage can reuse the trigonometric function values obtained in the second loop and so reduces the C P U 13 Chapter 2. The fast Hartley transform (FHT) software implementation 14 time. The third loop is more complicated than that in the radix-2 implementation, because of the decomposition of the unsymmetrical even-numbered and odd-numbered parts. An elegant method in [28] is utilized here to achieve the task. Inside the third loop, an eightpoint butterfly is used to process the data, therefore, a pre-examination part is presented at the beginning of the program to process the FHT with sequence size less than eight. This method avoids the trigonometric functions being calculated repeatedly in each block in the same stage. A further speed-up can be obtained by using a lookup table (as shown in Appendix B). This is similar to that of the radix-2, where a trigonometric function table is created to store all the precalculated sine and cosine values, and thus avoid the repeated calculation of the identical trigonometric functions in different stages. This "table-lookup method" requires a table size of (N/2) words. 2.4 Comparison with the complex F F T For comparison of the FHT with the complex FFT, the radix-2 complex F F T and the split-radix complex F F T [28] [29] based on the three-loop and table-lookup methods are implemented and as shown in Appendix C and Appendix D , respectively. 3 Table 2.1 lists the number of nontrivial real multiplications and additions required for each algorithm, where for the radix-2 complex F F T , A = M = 3N\og N -2N + 2 (2.15) 2N\og N -47V + 4 (2.16) 2 2 and for the split-radix complex F F T [29] [30], A = (8/3)^log 7Y-(16/9)^ + 2 - ( - l ) 2 l o g 2 .(2/9) ; v (2.17) Only the forward transform programs are listed. The inverse transform programs can be obtained easily from the forward transform programs by taking complex conjugate on the twiddle factors. A method to use a single forward FFT for both forward and inverse transforms is shown in [31] [32]. 3 Chapter 2. The fast Hartley transform (FHT) software implementation 15 complex FFT Split-radix Radix-2 Split-radix adds mults adds mults adds mults 4 2 0 4 0 0 8 0 18 4 16 0 22 2 58 52 20 8 64 12 162 144 32 68 42 104 166 418 372 196 124 416 1026 912 516 288 330 2434 998 1284 2164 744 5634 3076 2336 828 1824 5008 1994 12802 7172 5350 11380 4328 12064 4668 28674 16388 25488 10016 26854 10698 63490 36868 56436 22760 59168 24124 139266 81924 123792 50976 FHT N 2 4 8 16 32 64 128 256 512 1024 2048 4096 Radix-2 adds mults 2 0 0 8 4 26 74 20 194 68 482 196 1154 516 1284 2690 6146 3076 13824 7172 30722 16388 67586 36868 Table 2.1: Number of real arithmetic operations for the fast algorithms. Power of two 8 9 10 11 12 13 14 15 16 N 256 512 1024 2048 4096 8192 16384 32768 65536 FHT complex FFT three-loop table-lookup three-loop table-lookup r-2 s-r r-2 s-r r-2 s-r r-2 s-r 0.10 0.08 0.08 0.22 0.07 0.20 0.17 0.15 0.22 0.20 0.17 0.48 0.42 0.32 0.15 0.37 0.42 0.37 1.02 0.82 0.48 0.33 0.88 0.68 1.03 0.88 0.83 0.70 2.20 1.88 1.77 1.48 2.17 1.88 1.78 3.97 3.82 1.53 4.65 3.15 4.57 3.95 3.87 3.22 8.30 8.02 9.73 6.68 9.65 8.30 8.15 6.82 20.40 17.45 17.02 14.18 20.25 17.42 17.35 14.48 42.65 36.43 36.00 29.93 42.43 36.43 36.90 30.53 89.23 75.97 75.85 63.17 Table 2.2: Comparison of the fast algorithms in CPU execution time (in seconds) on a Sun 3/60 workstation, where r-2 stands for the radix-2 algorithm and s-r stands for the split-radix algorithm. Chapter 2. The fast Hartley transform (FHT) software implementation o.oiH 8 1 10 1 12 1 14 16 1 16 Power of two Figure 2.8: FHT requires less CPU time than the complex F F T in the radix-2 software implementation in both the "three-loop (lp)" and "table-lookup (tb)" methods. Power of two Figure 2.9: FHT requires less CPU time than the complex F F T in the split-radix software implementation in both the "three-loop (lp)" and "table-lookup (tb)" methods. Chapter 2. The fast Hartley transform (FHT) software implementation M = (4/3)A^log 7V-(32/9)iV-f4-(-l) 2 log2iV -(4/9) 17 (2.18) The easiest method to apply the complex F F T on a real sequence, is to expand the real sequence to complex by adding redundant zero-valued imaginary part. Based on this method, the execution time of the FHT and the complex F F T of a real data sequence 4 is computed and compared in Table 2.2, the graphs in Figures 2.8 and 2.9 based on Table 2.2 further clarify how the execution time increases with N. One important point not shown in Table 2.2 is that in the table-lookup method, the FHT requires a table size of (N/2), where the complex F F T requires a table size of N, double that of the F H T . It is seen that, under the same fast algorithm and implementation method, the F H T is about twice as fast as the complex F F T . It is obvious that, the complex F F T based on the above method wastes one-half of the CPU time, on computing the redundant zero-valued imaginary data, and is then very inefficient. There are two other commonly used methods [31] to deal with real sequence in the complex F F T , and without the use of the redundant zero-valued imaginary part. The first method is to transform an N-point real sequence by means of an N/2point complex F F T . First of all, an N-point real sequence is divided into two N/2-point sequences, where one contains the even-numbered data, and another contains the oddnumbered data of the original sequence. These two N/2-point sequences are input as the real and imaginary parts to an N/2-point complex F F T . The transform of the N-point sequence can be obtained by combining the real and imaginary parts of the complex F F T output with some algebraic computation. The second method allows the transform of two N-point real sequences simultaneously by using one N-point complex FFT. In this case, thefirstreal sequence is input as the real part and the second sequence is input as the imaginary part of an N-point complex F F T . Does not include the input, output and bit-reversing time. 4 Chapter 2. The fast Hartley transform (FHT) software implementation 18 The transforms of these two N-point sequences can be decomposed from the N-point complex F F T output with some additional computation. These two methods have the similar speedup factor as the FHT. However, as indicated in [13] [16], these methods cannot be used to retransform their own output, since they take only real data. Some modified approaches should be used for the inverse transform to convert the complex conjugate symmetric sequence back to a real sequence. So that the FHT is more convenient here for real sequences, since the identical forward transform program can be used for the inverse transform with only a scaling, and does not require extra computation for combining or sorting out the result. 2.5 Comparison with the real FFT Some specially designed FFT's for real sequence [l]-[5], utilize the redundancy and symmetry of the complex FFT's to reduce the number of multiplications by exactly one-half, and the number of additions by more than one-half of that of the complex FFT's. So that, these real FFT's should run at around the same speed as the FHT under the same fast algorithm implementation. Also, the real F F T requires the same size N storage space as the FHT, and because the real F F T and the FHT are very similar, the other overhead costs should be almost the same. The real F F T transforms a real sequence into a complex conjugate symmetric sequence, the inverse of the real F F T transforms a complex conjugate symmetric sequence back into a real sequence, such that, two algorithms are required for the application which needs both forward and inverse transforms. The FHT is then more convenient than the real F F T , since only one algorithm is sufficient for both forward and inverse transforms. Chapter 2. The fast Hartley transform (FHT) software implementation 2.6 19 To get the F F T from the F H T The Hartley transform, H(f), Hodd(f) can be split into its even part, H (f) and odd part, even [6], such that H(f) = H (f) (2.19) + H (f) even odd The even part is defined as HeUf) = \[H(f) (2-20) + H(-f)} CO / x(t)cos(2irft)dt (2.21) -oo and the odd part is defined as H {f) odd = \\HU)-H(-f)] CO ' x(t) sin(2n ft) dt (2.22) (2.23) / -OO The two integrals in Equation (2.21) and (2.23) are known as the Fourier cosine transform and the Fourier sine transform respectively. It is apparent that oo / / -co oo x(t)[cx>s(2vft) _ j sin(27r/0] dt x(t) t-^ dt }t (2.24) (2.25) -co is in fact the Fourier transform, F(f). That is F(f) In other words, H (f) even H (f) odd 1 S = H (f) even - JH (f) odd is equal to the real part, F i(f) rea equal to the sign-reversed imaginary part, Fi U) mag (2.26) of the Fourier transform and of the Fourier transform, then the Hartley transform can be written as H{f) = FrealU) ' ^ U ) (2-27) Chapter 2. The fast Hartley transform (FHT) software implementation 20 Equations (2.26) and (2.27) show that from either the Fourier transform or the Hartley transform, one can easily obtain the other transform. Similarly, in the discrete case, for a data sequence of N elements, the F F T of a real function can be obtained from the even and odd part of the FHT. Since H(f) is considered periodic with period N, then H (f) even HouU) = ±[H(f) + H(N - f)] = \[H{f)-H(N-f)} (2.28) (2.29) and from 2.26 F(f) = H (f)-JH (f) even odd - f) + H(f) 2 H(N = , + H(N J - /) 2 H(f) ( 2 - 3 0 ) where / = 0,..., N - 1, and H(N) = H(0). However, Equation (2.30) is only appropriate for real-input data, since neither H(N — f) nor H(f) can deal with complex numbers, so that, a method to obtain the complex F F T by using the FHT is developed in the following. Suppose y(t) is a complex function such that y(t) = a(i) + j b(t), where a(t) and b(t) are real functions, then the FFT of y(t) is Yp(f). F F T of a(t) + j b(t) is Ap(f) and B (f) F + j B (f), F By using the linearity of the FFT, the where A (f) F and B (f) F are complex. If A (f) F are expressed in their real and imaginary parts, then y>(/) = A (f) = [A (f) = [AF (f)-B (f)}+J[A (f) F Fr r + j B (f) F -f ; A ,(f)} F Fi Fi + j [B (f) Fr + j B (f)\ + B (f)} Fr Fi (2.31) Chapter 2. The fast Hartley transform (FHT) software implementation 21 since A {f) = A (f) = Fr Fi - f) + A (f)} (2.32) \\AH{N - f) - A (f)\ (2.33) H H where A (f) is the FHT of a(t), and H B (f) = \[B„(N - f) + B (f)} (2.34) B (f) = \[BH(N - f) - B (f)} (2.35) Fr Fi H H where B (f) is the FHT of b(t), Equation (2.31) becomes H Y (f) F = \[A (N - f) + A„(f) - B (N - f) + BHU)\ + H H j -[A (N - f) - A (f) l H H + BJJ(N -f) + Bfj(f)] (2.36) With Equation (2.36), one can obtain the F F T of complex-input data through the FHT. 2.7 F H T convolution and power spectrum Two of the most common applications of the FHT is for convolution and to obtain power spectrum for filtering. The FHT convolution theorem [7] states that if h(t) is the convolution of two real functions, hi(t) and h2(t), that is h(t) = hi(t) * h (t) = ^ h^t') h (t - t') 2 (2.37) 2 t'=o then H(f) = H (f)H (f) x 2e (2.38) + H,(N - f)H (f) 2o where H(f), H\(f) and H (f) are the FHT's of h(t), hi(t) and h (t), respectively, and 2 H (f) = H (f) 2 2e + if (/), the sum of its even and odd parts. 2o 2 Chapter 2. The fast Hartley transform (FHT) software 22 implementation From Equations (2.28) and (2.29), Equation (2.38) becomes H(f) = \[Hx{f)H*U) + H (N- + H {f)H*{N x f)H (f) t 2 - Hi(N - f) - f)H (N 2 - /)] (2.39) However, when either h\(t) or h (t) is symmetrical (even function), that is, either /ii(£) = 2 hi(N — t) or h (t) = h (N — t), then the convolution theorem in Equation (2.39) can be 2 2 simplified to a single real multiplication of the two FHT's: H(f) The power spectrum, P (f), a Ps(f) (2.40) = H (f)H (f) 1 2 of a real function h(t), calculated with the FHT = \H(f)\ = \{[H(f)Y is (2.41) 2 + [H(N - /)] } (2.42) 2 where H(f) is the FHT of h(t). 2.8 A n image processing example of the application of the F H T An application of the F H T is presented in this section through an image processing example. In this example, an image is first degraded, and then restored. The theorems for addition, convolution, power spectrum, etc. on the FHT are applied. Since the image is a two-dimensional signal, in order to use the fast one-dimensional algorithm in Appendix B, the input image is transformed into one dimension by joining each row of the two-dimensional image matrix serially with the previous row . This is 5 similar to T V image processing. This method suffices only in this example. The general methods to obtain the two-dimensional F H T with the one-dimensional F H T algorithm are derived in [13]. A two-dimensional F H T algorithm is developed in [33]. 5 Chapter 2. The fast Hartley transform (FHT) software implementation Degradation 23 Restoration g g r w H n Figure 2.10: Image degradation and restoration processes. The image processing is carried out by passing the image signal through a linearmotion blur operator and with the addition of some white noise, the degraded image signal is then restored by a Wiener filter. The processes are as shown in Figure 2.10. If an image x(t), passes through a linear-motion blur operator m(i) and interfered by a white noise n(t), the degraded image g(t) can be expressed as: g(t) = x(t) * m(t) + n(t) (2.43) where "*" is the convolution operator. A simple method to restore the image is to pass the degraded image through a Wiener filter. The restored image r(t) can be expressed as: r(t) = w(t)*g(t) (2.44) where w(t) is the Wiener filter function in time domain. The Wiener filter function in frequency domain, W(f) is W(f) = M(N - f) \M(f)\* + NSR (2.45) where M(f) is the linear-motion blur operator in frequency domain, and NSR is the noise to signal ratio, which is defined as the ratio of the power spectrum of noise to the power spectrum of signal. Chapter 2. The fast Hartley transform (FHT) soft ware implementation Figure 2.12: Degraded image. 24 Chapter 2. The fast Hartley transform (FHT) software implementation 25 Figure 2.13: Restored image. The degradation starts with a portrait image as shown in Figure 2.11. Each pixel in the image has a gray level between 0 to 255. A pixel in gray level 0 is black, in gray level 255 is white, and in a gray level between 0 and 255 has different gray scale. The image is first input to a linear-motion blur operator, which stretches each pixel in the image to a distance of nine pixels to the right, then the blurred image is further degraded with white noise of random gray level from 0 to 10. The degraded image is shown in Figure 2.12. The degraded image is restored with a Wiener filter, which does the inverse of the blur operator and compensates the noise level. The restored images as shown in Figure 2.13 has a root-mean-square error of 8.4 gray levels with respect to the original image. Chapter 2. The fast Hartley transform (FHT) software implementation 26 2.0 Summary The radix-2 and split-radix FHT's are implemented in the "three-loop method" and "table-lookup method", and compared with the F F T on real sequence. It is found that although the F F T can be as fast as the FHT, the FHT is more convenient in performing the inverse transform on the complex conjugate symmetric data sequence back to a real sequence. The methods to obtain the F F T from the FHT and vice versa are discussed. The applications of the FHT in convolution and computation of power spectra are shown through an image processing example. Chapter 3 The bit-reversal algorithms In the radix FHT algorithms, the data after processing are stored back in the original place where the data were obtained from, in order to economize on data memory usage. However, this "in-place" storing method generates a transform with output data sequence in bit-reversed order. For the applications which require the transform sequence in the correct order, an extra permutation stage should be used to unscramble the data. For the decimation-in-time approach, the permutation stage is placed in front of the transform, and is called the "pre-permutation". For the decimation-in-frequency approach, the permutation stage is placed after the transform, and is called the "post-permuation". However, the permutation stages in both approaches are exactly the same and can be performed with the bit-reversal algorithm. 3.1 Some existing algorithms One of the most widely used bit-reversal algorithm is introduced by Gold and Rader. An implementation of this algorithm [32] is repeated in pseudo-code form in Figure 3.14. This algorithm uses a "bit-reversed counter" to generate bit-reversed index sequence which contains the reversed bits of the index sequence from zero to (TV—2). A comparsion is made between each index and the corresponding bit-reversed index generated, and a swap of the values in these two indices is required whenever the bit-reversed index is greater than the original index in value. The swapping required for a 16-point data sequence is shown in Table 3.3. 27 Chapter 3. The bit-reversal algorithms subroutine reverse(x[ ] , N) reversed.index = 0 N2 = N / 2 f o r index = 0 to N - 2 i f (index < reversed.index) then swap(x[index], k • N2 while (k <= reversed.index) reversed.index = reversed.index - k k =k / 2 end while reversed.index = reversed.index + k next index end subroutine x[reversed.index]) Figure 3.14: Gold-Rader's bit-reversal algorithm (after Rabiner and Gold). Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 in binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 reversed 0000 1000 0100 1100 0010 1010 0110 1110 0001 1001 0101 1101 0011 1011 0111 1111 swap YES YES YES YES YES YES Table 3.3: Bit-reversing of a 16-point array. Chapter 3. The bit-reversal algorithms 29 The operation of the bit-reversed counter starts from bit-reversed index zero, the counter generates the next bit-reversed index by incrementing one in the most significant bit of the present bit-reversed index, and a rightward carry is carried out if necessary. This procedure is repeated in the counter until the second last bit-reversed index is obtained, where the last bit-reversed index is not required to be generated simply because no swapping is involved . To increment by one in the most significant bit, means adding 1 a value equivalent to (N/2), where carry is added rightwardly. A rightward carry can be obtained by checking for the first zero bit starting from the most significant bit. If a "one" is found then change it to zero, and check the next bit on the right hand side until a zero is found, then change the zero to one. To change a one to a zero can be done by substracting (N/2), or (N/4), or (N/8), etc. for each one bit that is found. Similarly, a zero can be changed to a one by adding (N/2), or (N/4), or (N/8), etc. Since the last index which has all one-bits need not be checked, there must be at least one zero bit i n each index, so that overflow will not occur. Buneman [8] shows an equivalent bit-reversal algorithm but with a lookup table to store the (N/2), (N/4), (N/8), etc. values to reduce the CPU time spent on the repeated integer divisions. Since in some high-level computer languages, the integer division cannot be replaced by simple right-bitwise shifting as in the assembly language, instead, the integer division is performed in time consuming long division. The Buneman's algorithm is shown in pseudo-code form in Figure 3.15. Rodriguez [34] presents an improvement to the Gold and Rader's algorithm by setting an upper bound on the index sequence so as to eliminate the CPU time on checking the indices beyond the upper bound, where the upper bound means the last index of swapping, which has a value equal to (TV — 1 — s/R * N), where R is "one" if the power In fact, if swapping is determined under the condition that the bit-reversed index is greater than the original index, then the second last bit-reversed index needs not be generated too, because the second last index must have a value greater than its reversed bits. 1 Chapter 3. The bit-reversal algorithms subroutine reverse(x[ ] , power, N) P = i; for index = 0 to power table[index] = p p = + next index p p reversed.index = 0 for index = 1 to N - 2 p = power do { p =p - 1 reversed.index = reversed.index - table[p] } while (reversed.index >= 0) reversed.index = reversed.index + table[p + 1] i f (index > reversed.index) then swap(x[index], x[reversed.index]) next index end subroutine Figure 3.15: The table-lookup bit-reversal algorithm (after Buneman). subroutine reverse(x[ ] , power, N) upper.bound = N - 1 - 2 ** floor{(power + 1) / 2} /* floor{s} = the greatest integer smaller or equal to s */ reversed.index = 0 N2 = N / 2 for index = 1 to upper.bound k = N2 while (k <= reversed.index) reversed.index = reversed.index - k k = k /2 end while reversed.index = reversed.index + k i f (index < reversed.index) swap(x[index], x[reversed.index]) next index end subroutine Figure 3.16: Improved Gold-Rader's algorithm (after Rodriguez). Chapter 3. The bit-reversal algorithms 31 of two is even or "two" if the power of two is odd. The Rodriguez's improved algorithm is shown in pseudo-code form in Figure 3.16. Evans [35] introduces a more efficient but also more complex bit-reversal algorithm than all the algorithms shown above. This algorithm uses a precalculated seed table to enhance the speed in generating the bit-reversed sequence, and employs some sophisticated rules to govern the swapping. An implementation of this algorithm is shown in Appendix E . l . Additional work by Evans in [36] claims that with the addition of an extra rule, this algorithm can run slightly faster in most computers. 3.2 A new bit-reversal algorithm Gold and Rader designed the bit-reversal algorithm based on the bit-reversed counter some two decades ago. The Gold-Rader's bit-reversed counter utilizes integer division on the operations of incrementation and rightward carry, since probably at that time, no suitable high-level language could be used to handle the bitwise operations, and the assembly language programming is tedious. By taking advantage of the bitwise operators of the C language, a new bit-reversed counter is developed. Table 3.4 shows the operation of a bit-reversed counter of N — 16 based on the use of the bitwise "exclusive-or" operator. By inspection, one can see that the reversed bits of the odd-numbered indices can be generated by exclusive-oring the reversed bits of the previous even-numbered indices with the operand "1000", whereas the reversed bits of the even-numbered indices can be obtained by exclusive-oring the reversed-bits of the previous odd-numbered indices with the operand "1100", "1110" and "1111", respectively. The rule for choosing the appropriate "exclusive-or" operand is that, always choose Chapter 3. The bit-reversal algorithms Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 in binary reversed xor operand xor result 0000 0000 1000 1000 1000 1100 0100 0001 0100 1000 1100 0010 0011 1100 1110 0010 0010 1000 0100 1010 1100 0101 1010 0110 0110 0110 1000 1110 0111 1110 1111 0001 0001 1000 1100 1000 1100 1001 1001 0101 0101 1000 1010 1101 1101 1110 1011 0011 1100 0011 1000 1011 1011 1100 1101 0111 1000 1111 1110 0111 1111 1111 Table 3.4: Exclusive-oring a bit-reversed index in column three and the related operand in column four produces the next bit-reversed index in the data sequence. the operand with width equals log TV, and with one or contiguous "1" bits in the most 2 significant bits and some contiguous or no "0" bit in the least significant bits, such that the operand has a value just greater than the value of the bit-reversed index. For instant, in the 16-point sequence example, the operands to be chosen are in the format: "1000", "1100", "1110" and "1111", where the operand to be used for "1001", the reversed bits of index 9, in order to generate the reversed bits of index 10 is "1100". The operand required for "1101", the reversed bits of index 11, in order to gernerate the reversed bits of index 12 is "1110". An alternative rule governing the use of the operands is to use the operand "1000", starting from index (2° — 1) and every (2 )-index times, the operand "1100" is used every 1 (2 )-index times starting from index (2 — 1), the operand "1110" is used every (2 )-index 2 1 3 times starting from index (2 — 1), and the operand "1111" is used only once starting 2 Chapter 3. The bit-reversal algorithms subroutine reverse(x[ ], N) lst.operand = N / 2 2nd_operand = (lst.operand » reversed.index = N / 2 for index = 1 to N - 2 33 1) ft (N / 2) / * swapping for the odd-numbered index */ if (index < reversed.index) then swap(x[index], x[reversed_index]) / * generate reversed bits for the next even-numbered index */ k = 2nd_operand while (k <= reversed_index) k = (k » 1) $ (N / 2) / * next higher operand */ end while reversed.index = reversed.index " k index = index + 1 / * swapping for the even-numbered index */ i f (index < reversed.index) then swap(x[index], x[reversed.index]) / * generate reversed bits for the next odd-numbered index */ reversed.index = reversed.index " lst.operand next index end subroutine Figure 3.17: The new bit-reversal algorithm. from index (2 — 1). In general, the operand with width equals log N and n "1" bits in 3 2 the most significant bits, where n is from 1 to log N, and "0" in the other bits, is used 2 firstly at index position ( 2 n-1 — 1) and repeated every (2")-index times. The operands to be used for data sequence in other length can be predicted in a similar way by using either one of the rules above. Since the implementation of the second rule may employ more complicated looping technique and hence more CPU time, the first rule is chosen for implementation, and is presented in pseudo-code form in Figure 3.17. The technique to separately process the even-numbered indices and the odd-numbered indices has a significant speedup effect. Chapter 3. The bit-reversal algorithms N lk 2k 4k 8k 16k 32k A 0.008 0.015 0.032 0.063 0.128 0.258 A: B: C: D: E: B 0.012 0.025 0.050 0.102 0.203 0.407 34 C 0.013 0.027 0.052 0.103 0.210 0.418 D 0.017 0.032 0.063 0.128 0.253 0.507 E 0.010 0.020 0.040 0.082 0.163 0.328 Evans' algorithm. Gold and Rader's algorithm. Rodriguez's algorithm. Buneman's algorithm. The new algorithm. Table 3.5: CPU execution time (in seconds) of the bit-reversal algorithms running on a Sun 3/60 workstation. A C-language implementation based on this pseudo code is shown in Appendix E.5. A second implementation using a lookup table to store the exclusive-or operands is shown in Appendix E.6. A test by using the UNIX-C compiler and a Sun 3/60 workstation shows that the implementation in Appendix E.5 can run faster, however, in other machines or Ccompilers, where the indexing is comparatively quick, the implemenation in Appendix E.6 may be more efficient. 3.3 C o m p a r i s o n o f the bit-reversal algorithms Table 3.5 and Figure 3.18 show the time required for different sizes of data for the above five order O(N) algorithms. The result is obtained by implementing the algorithms in the C language (as shown in Appendix E.l-5), and running on a Sun 3/60 workstation equipped with afloatingpoint processor. The new algorithm is one of the fastest, and is only second to Evans' bit-reversal algorithm. However, Evans' algorithm requires a complex implementation and the use of a seed table, so that the new algorithm is very Chapter 3. The bit-reversal algorithms 35 Figure 3.18: Plotting of Table 3.5. attractive. The Gold and Rader's algorithm is found to be simple and efficient, it is suitable for the implementation in the high-level languages lacking the bitwise operators. The algorithms by Buneman and Rodriguez do not show the expected effect, it is because the integer division in the Sun workstation with the UNIX-C compiler can run as fast as 2 a right-bitwise shifting operation, the indexing and other overheads in the improvement unintentionally slow down the algorithms. 3.4 Summary A new bit-reversal algorithm is presented. This new algorithm is found to be very efficient, since it is fast, has minimum memory demands and can be implemented easily in the C language. Several other bit-reversal algorithms are introduced and compared with the new algorithm. A fast bit-reversal algorithm is desirable and important in the FHT, because it accounts for a few percent of the total run time of a complete radix The integer division is replaced by right-bitwise shifting in the programs in Appendix E, just for a consistent comparison of the execution speed between different algorithms 2 Chapter 3. The bit-reversal algorithms 36 transform. The usage of the bit-reversal algorithm in the hardware implementation of the FHT will be discussed in the next chapter. Chapter 4 The F H T hardware implementation 4.1 Introduction The software implementation of the fast algorithms on a general purpose computer or on a general purpose computer equipped with a programmable digital signal processor (DSP) chip is generally speed-limited for data processing from low to medium sampling rate, say, for a maximum of less than a million samples per second . For some very high speed 1 processing environment, for example, radar signal and real-time digital image processing, custom-designed hardware implementation of the transform is highly preferred. In this chapter, the hardware implementation of the DHT is proposed, and the FHT hardware implementation based on several advanced architectures is developed and evaluated. 4.2 Hardware implementation of the D H T A four-point DHT based on the expression in Equation 2.3 is implemented and is shown in Figure 4.19. This hardware structure can be easily modified for data sequence of other size by appending or reducing devices in a similar layout pattern. The operation of this hardware implementation is synchronized with clock signals, namely CLK-A, CLKB, CLK-C and CLK-D. The control clock signals of the system are generated with a ring counter, where, each of the four outputs of the ring counter sends a clock signal As published in [37], a benchmark of 2.36ms for a 1024-point complex radix-2 FFT was recorded with the TMS320C30 DSP; this is equivalent to a processing rate of 0.434 million samples per second. The split-radix FHT implemented with the "table-lookup method" in Appendix B, requires 39.4ms for a 1024-point real data sequence on a Sun SPARC station 1 workstation. 1 37 Chapter 4. The FHT hardware implementation 38 INPUT CLK-A CLK-B CLK-C CLK-D 1' CLK-A X r*X 3: CLK-B 3t L I r-i ttl ^ x - ^ - ' CLK-C L , ^ x CLK-D '4xV4'''' 3-i7| 3-L I ROM 2 L: latch I ROM 4 CLK-A OUTPUT 3-L: tri-state latch Figure 4.19: Hardware structure of a 4-point DHT. sequentially to trigger a group of devices. The devices to be triggered synchronously, including latches and multipliers, but excluding the adders, are grouped together by the dotted-diagonal lines in Figure 4.19. The input line is a multi-bit data bus. Data are entered through the input line to the system serially at each clock signal. Four input latches controlled by clock signals are connected to the input line to select, store and pass the corresponding data points. In each clock signal, only one input latch can pass in one data point. This data point is retained in the input latch until the next time it is triggered and another new data point is sampled. The multipliers sample data from the input latches and trigonometric Chapter 4. The FHT hardware implementation 39 coefficients from the read-only memory (ROM). The adders are driven by the data from the multipliers, and since the two multipliers feeding data to the adder are tiggered in two consecutive clock signals, a latch is required to retain the output of the first multiplier and let the two data points arrive at the adder simultaneously. After three additions in each column, the transform output is passed to the output line in each clock signal. The system operation starts at CLK-A. The input latch at the upper-left corner in Figure 4.19 passes in the first data point and R0M1 is addressed and provides the cas coefficient; these two data are passed to the top multiplier in the first column. At CLK-B, the multiplier isfiredand the product is passed to a latch. At CLK-C the latch is selected and the datum inside is passed to an adder; meanwhile, the adder accepts another datum from the second multiplier in the first column. Since the adder is data driven, the adder outputs the sum to the second latch in column one without waiting for another clock signal. The second latch is then selected by CLK-D and the datum stored inside is passed to the next adder. Similarly, the final sum from the first column is passed to the output line through a tri-state output latch which will be triggered by the corresponding clock signal. Just in front of the last column of multipliers, there are some other latches connected to the input latches. These are used to hold the input data, because when one of the multipliers in the last column is triggered, the input latch connected to that multiplier is being triggered simultaneously and the datum in the input latch will be lost, so that an intermediate-stage latch is required to hold the datum. This DHT implementation requires six clock pulses of overhead, and it can convert one data point to the desired domain in one clock cycle. It has a simple structure and control, and is flexible in expansibility. However, for an TV-point transform, TV word2 Chapter 4. The FHT hardware implementation 40 sized ROM, N(N — 1) adders and N multipliers are required. For a large N, the device 2 2 count is very high, and turns out to be a high hardware implementation cost. In addition, each group of devices are triggered only in the corresponding clock signal, which means a low efficiency in the use of the devices. As an alternative, hardware implementation using the fast algorithm is then desired. Several FHT hardware implementation approaches are shown in the following section. 4.3 Hardware implementation of the F H T Unlike the F F T algorithms, most of the FHT algorithms, including the radix algorithms, have comparatively irregular geometry in their data-flow structure. Figures 2.2 and 2.5 are good examples to illustrate this point. The conventional F F T hardware implementation methods [38]-[40] utilize the regular geometry of the FFT; when these methods are applied to the FHT, more complex control and data sequence rearranging are required. This is probably the reason why the FHT hardware implementation is still not so popular today. In this section, the radix-2 algorithm is used as an example of implementation because of its popularity and simplicity. Higher radix algorithm implementation is possible with similar technique, but a more complicated control circuit is required. 4.3.1 Single butterfly unit F H T architecture Figure 4.20 shows a simple single butterfly unit architecture of the radix-2 FHT processor. The architecture shown can be configured with simple hardware. There is only one memory unit, one control unit and one butterfly unit. In the actual implementation, the multipliers in the first column and the first row are not required, since a cas value equivalent to one is to be multiplied in these multipliers, data can be passed directly to the latches or adders instead. 2 Chapter 4. The FHT hardware implementation 41 •4 Memory Unit Control i Unit < input • output (FHT) i Butterfly Unit Figure 4.20: Architecture of a single butterfly unit FHT processor. output registers input registers cos(nG) tin(nO) Figure 4.21: Radix-2 FHT butterfly unit hardware structure. The memory unit stores the input data, the intermediate results, the transform output and the trigonometric coefficients. The control unit generates memory addresses, memory clock cycles, butterfly unit clock cycles and other control signals. The butterfly unit performs real additions, substractions and multiplication. Figure 4.21 shows the hardware implementation of the 16-point radix-2 decimationin-frequency FHT butterfly unit based on the butterfly flow structure in Figure 2.1. The radix-2 FHT butterfly unit can process four real data in each computation cycle, Chapter 4. The FHT hardware implementation memory array 1 42 memory array 3 input output memory array 2 cosine and sine coefficients Figure 4.22: Single butterfly unit FHT hardware structure. Figure 4.22 shows the hardware inplementation of the single butterfly unit architecture. The operation of this implemenation can be explained with the 16-point radix-2 FHT in Figure 2.2 as an example. The 16-point transform has three sets of data points in its first stage, namely, the first set: points 1, 7, 9 and 15, the second set: points 2, 6, 10 and 14, and the third set: points 3, 5, 11 and 13, which are ready to be input to the buffer registers, a, b, c and d of the butterfly unit as in Figure 4.21. The remaining points, 0, 4, 8 and 12 do not require a complete operation of the butterfly unit, that is, no multiplication is required. However, it is not economical in this architecture to provide a separate butterfly unit for the processing of this special case. A solution to this is by entering these points to the butterfly unit also as a set, and inputting a one or cos(0), and a zero or sin(0) from the ROM to the multipliers, the multiplications can virtually be eliminated. Similarly, the data points in the last two stages that do not require complete operation of the butterfly unit, can be processed with the same method to eliminate the multiplication. Figure 4.23 shows the detail of the data point sequence entering the butterfly unit. The corresponding trigonometric coefficients used in the multiplications are listed along with the data sequence, where the data sets without trigonometric coefficients shown will be multiplied by cos(0) and sin(0). Chapter 4. The FHT hardware implementation stage 1 a: b c: d: cos 6 sin 6 0 4 8 12 12 3 7 6 5 9 1011 15 14 13 cos 28 sin 26 cos 36 sin 36 stage 2 0 2 4 6 18 3 10 5 12 7 14 cos 8 sin 8 9 11 13 15 43 stage 3 stage 4 0 4 8 12 1 5 9 13 2 6 10 14 0 4 8 12 2 6 10 14 1 5 9 13 3 7 11 15 3 7 11 15 cos 9 sin 6 8 = 27C/N Figure 4.23: Data sequence entering the radix-2 FHT butterfly unit. The data sequence in each stage can be fetched with the addresses generated by a pre-defined sequence counter. Therefore, for the 16-point FHT example, four counters are required to work sequentially for each transform. The memory unit contains three memory arrays. Memory array 1 and 2 are input buffer memory, while memory array 3 is an output buffer memory. The transform process starts with serial data input to memory array 1. In the example, sixteen data points for the first transform will be buffered. If there is more than one transform, an input switching to memory array 2 will occur, and the next sixteen data points for the second transform will enter to memory array 2. When memory array 2 is busy in receiving the serial input data, the data already stored in memory array 1 are read into the buffer registers of the butterfly unit. The computation output of the butterfly unit is written back to memory array 1 in-place, except that the final transform result from the last stage is written to memory array 3. The output buffer memory is required to provide enough time for a slower data transferring to the next processing stage, and to allow the serial input data of the next transform to enter memory array 1 immediately in order to keep the butterfly unit running uninterruptedly. The butterfly unit requires (^log N) 2 butterfly cycles for an TV-point transform. A Chapter 4. The FHT hardware implementation Butterfly Unit: | B1 | B2 | B3 | B4 | B5 , B6 | B7 | B8 i IR IW, R , W, R , W ,R , W ,R , W ,R , W ,R , W ,R , 44 • • • • Memory Array 1: iRl Memory Array 2: | IN | IN | IN i IN | IN i IN | IN | IN | IN . . . . Memory Array 3: * * * ' Butterfly Unit: . . . . " ' ' I B 12 | B13 | B 14 | B 15 | B 16 | B 1 | B2 | B3 | B4 , BS Memory Array 1: ' ' ' | W | R | W | R | Memory Array 2: |R i |R | | IN | IN | IN , IN | IN , IN • . . • i IN , IN , IN , IN , R , Memory Array 3: ' U l l R: 4 read cycles, W: 4 write cycles, w l l w . . . . l l w ,R , W , R , W,R , W,R , W,R l l w B: butterfly cycle, • • • • . . . . IN: 1 input cycle. Figure 4.24: Timing diagram for the single butterfly unit processor. . 16-point transform which has four stages then requires sixteen passes in the butterfly unit. Figure 4.24 shows the timing cycles starting from memory array 2 begins to receive the second transform data. One can see that, the use of the overlapped memory and butterfly cycle, and the use of the output memory buffer increase the processing throughput and enhance the butterfly unit to run at full efficiency. The control unit generates read and write memory cycles, butterfly cycles, device select control signals, read memory addresses, etc. Since the read memory addresses are not in sequential order, custom-designed counters are required. Figure 4.24 shows that, when a datum is read into the buffer register a in Figure 4.21 in a read memory cycle, it will wait for the equivalence of twelve memory cycles, and at the thirteenth memory cycle, the output at buffer register a! is written back to the input buffer memory in-place. That is, two sets of data points are read into the butterfly unit before one set of output is written back to the input buffer memory. Therefore, in order Chapter 4. The FHT hardware implementation 45 to avoid the extra hardware for generating the write addresses, eight address registers are used in the control unit to retain the read addresses for twelve memory cycles. The processing rate of this hardware structure depends on the memory access time and the processing time of the butterfly unit. In this design, the butterfly unit processing time matches the time for four read and four write cycles to optimize the efficiency. Suppose, if memory with access time of 60ns is used, then the butterfly unit can be designed to a maximum procesing time of 480ns or vice versa. The procesing time of a transform with length N, can be expressed as 8r(-j) log N, where r is the memory 2 access time. Therefore, if memory of 60ns is used, a 16-point transform will take 7.68/JS, or a processing rate of 2.08 million samples per second. If N = 1024, the processing time for one transform will take 1.23ms, or a processing rate of 0.83 million samples per second. The memory sometimes is the most expensive device in a circuit. If a 16-bit number representation is used, a 16-point transform will need only 96 bytes of buffer memory, a 1024-point transform will need only 6 Kbytes of buffer memory, so that, this architecture is very economical. 4.3.2 Pipeline F H T architecture Figure 4.25 shows the FHT pipeline architecture, which is similar to the single butterfly unit architecture, but instead has one butterfly unit for each stage of the FHT. Figure 4.26 shows the pipeline hardware structure of a 16-point FHT with two butterfly units in the first two stages and one in the last two stages. The input buffer, output buffer and the butterfly units (BU's) in the first two stages are the same as that in the single butterfly unit structure in the last subsection, except that the outputs of the butterfly units are not stored back to the input buffer, but instead are passed to the next butterfly unit through intermediate buffer memory. Since the last two stages of Chapter 4. The FHT hardware implementation 46 input output 1 Memory 1 Unit Figure 4.25: Pipeline architecture of a 16-point F H T . intermediate buffer intermediate buffer output buffer BU2 Figure 4.26: Pipeline hardware structure of a 16-point F H T . output Chapter 4. The FHT hardware implementation 47 Figure 4.27: Hardware structure of the butterfly unit in the combined last two stages. the radix-2 FHT do not require multiplication of trigonometric coefficients, and they can process the same four data points continuously, it is more economical to combine these two stages into one with a different butterfly unit BU2 as shown in Figure 4.27. The operation of the pipeline FHT structure is similar to the single butterfly unit structure. Each butterfly unit accepts four data points sequentially from the input buffer or the intermediate buffer, and writes the result to the next intermediate buffer or the output buffer. The input buffer and the intermediate buffer both have two memory arrays, one array being used to receive data, and the other to pass data to the next butterfly unit. A switching of input and output between these two arrays will occur when a new transform starts. Figure 4.28 shows the data sequence entering each butterfly unit, with the trigonometric coefficients listed along. The sequence in each stage can be generated by a read address counter. Since read and write are in subsequent steps, a write address counter is avoided by using eight address registers in each stage to retain the read addresses for eight memory cycles (eight read cycles before the first write cycle), and the addresses in these address registers will be used as the write addresses. Figure 4.29 shows the timing diagram of the pipeline structure operation. The data Chapter 4. The FHT hardware implementation BU BU 0 12 3 4 7 6 5 8 9 1011 12 15 14 13 1 1 cosO sinO 48 t t BU2 0 2 4 6 18 9 3 1011 5 1213 7 1415 t t 0 1 2 3 4 8 12 5 9 13 6 1014 7 11 15 cos 0 cos 0 sinO sinO cosB cos8 sin 6 sin6 cos 6 sinO cos 28 sin 26 cos 36 sin 36 6 = 27t/N Figure 4.28: Data sequence for each butterfly unit. input A: buffer B: R BU: intermediate A: buffer 1 B: R| R| R INPUT R | R | RR INPUT B| B|B B|B |W|W B B R R R R INPUT B B B B • • • INPUT R R I N P U T • • R R R R B B B B B B W| W R R R R w w w w R R R R W W W W R R R R W W W W BU: B intermediate A: buffer 2 B: B B B B B B B B B • « • • • • B W W W W R R R R w W W W W W R R 1 1 B BU2: B B B B • • • • • » 1 |W W W W 1 ' •* output buffer R: 4 read cycles, W: 4 write cycles, B: 1 butterfly cycle Figure 4.29: Timing diagram of the pipeline FHT processor. Chapter 4. The FHT hardware implementation 49 points are input to or output from the butterfly unit sequentially as in the single butterfly unit structure, and the processing time of a transform can be expressed as 4 T ( ^ ) . If memory with access time of 60ns is used, a 16-point transform will take 0.96/xs, or a processing rate of 16.7 million samples per second, while a 1024-point tranform will take 61.4/zs, or a processing rate also of 16.7 million samples per second, since the processing rate of the pipeline structure depends only on the memory read and write access time. In this hardware structure, some very high speed buffer registers are required for the butterfly units, to allow a fast transition of data in the registers before receiving the next four data points. For example, if 60ns memory is used for the buffer memory, 20ns or less access time buffer register should be used. Otherwise, a wait cycle should be inserted between each four memory cycles. 4.3.3 Superparallel F H T architecture In order to obtain an extremely high speed of FHT processing, a superparallel FHT processor is required. Figure 4.30 shows such a system structure for a 16-point FHT. It is in fact the expanded structure of the pipeline FHT in Figure 4.26, enhanced with four butterfly units for each stage. A butterfly unit BUO as shown in Figure 4.31 is used to replace butterfly unit BU in the part without the need of multiplication. In this superparallel structure, the intermediate buffer memory can be eliminated, since it is possible to have hard-wired connections between stages. However, the superparallel structure requires a large number of butterfly units for a large N, for example, a long FHT transform of 1024 points, requires 2304 butterfly units. This results in a high hardware cost, and because the interconnections between stages are irregular, the input control and inter-stage connections become very complex. By using the superparallel structure, a transform in each memory cycle is possible to obtain. That is, for any N, the processing time will be 60ns if buffer memory of 60ns Chapter 4. The FHT hardware implementation 50 b' Figure 4.31: Hardware structure of butterfly unit BUO. Chapter 4. The FHT hardware implementation 51 e = 2 7t/N Figure 4.32: Pre-permutation 16-point radix-2 FHT data flow diagram. access time is used. This means, theoretically, a processing rate of 267 million samples per second can be obtained for TV = 16, or 17 billion samples per second for TV = 1024. However, this depends on the data input rate as well as the butterfly processing rate. 4.4 Data unscrambling in hardware implementation The transform results obtained by using the methods shown above are scrambled in bitreversed order; reordering is then required. Since TV is fixed in the hardware structures, the bit-reversing can be done by copying data in the output memory buffer in a bitreversed order to the output port or to the next processing stage. The bit-reversed copying sequence can be generated by a bit-reversed counter. Another method is to employ pre-permutation algorithm. Figure 4.32 shows the prepermutation 16-point radix-2 FHT algorithm. If data are entered into the butterfly unit in the reversed-order as shown in Figure 4.33, using the decimation-in-time butterfly as in Chapter 4. The FHT hardware implementation stage 1 0 4 8 12 2 13 6 5 7 10 9 11 1413 15 stage 2 0 2 13 8 10 9 11 4 6 5 7 1214 13 15 0 = 27C/N 52 stage 3 8 12 10 14 1 5 3 7 9 13 11 15 stage 4 0 2 1 3 8 14 9 15 4 6 5 7 12 10 13 11 1.1 ttt cos 8 cosG sin 6 sin 9 cos 6 sinO cos 29 sin 29 cos 39 sin 39 Figure 4.33: Data sequence entering the pre-permutation algorithm. Figure 2.1 in the butterfly unit, and the hardware stages in the pipeline and superparallel structures are going in the reversed direction, the transform result will be in the correct order. 4.5 Comparison of the hardware implementations The DHT hardware implementation has a simple control logic and does not require the bit-reversal processing, because the data flow is very regular. Even though it involved more arithmetic than the FHT, it may be optimum for a relatively small N. The DHT hardware implementation can transform a point in each clock signal, therefore, it can run as fast as the FHT pipeline implementation. A comparison of the FHT architectures in the processing rates, the number of butterfly units and the amounts of memory space required is shown in Table 4.6. Chapter 4. The FHT hardware implementation Number of butterfly units Speed (M samples/sec.) 1 0.83 Pipelining 9 16.7 Superparallelism 2304 17,067 Architecture Single butterfly unit 53 Memory (16-bit word) R A M : 6 Kbytes R O M : 1 Kbytes R A M : 38 Kbytes R O M : 2 Kbytes R A M : 4 Kbytes ROM: 7.2 Kbytes Table 4.6: Comparison of the three FHT hardware implementations for TV = 1024 and memory access time of 60 ns. 4.6 Summary A hardware implementation of the DHT is presented in this chapter. It has a simple control logic and flexible expansibility. However, because the number of arithmetic devices required is proportional to TV , it is only suitable for small TV. Solution to this is sought 2 by the FHT implementation. Several implementation methods of the FHT are discussed, in which, the most economical method, namely, the single butterfly unit architecture, requires the fewest devices but the method is slow, whereas, the fastest method, namely, the superparallel architecture, requires the most devices. A compromise is the pipeline architecture. Chapter 5 Conclusion 5.1 The work done in this thesis Some properties of the fast Hartley transform (FHT) and its application in convolution and power spectrum were reviewed. The method to obtain a complex F F T through the FHT was derived. In the software implementation, some speed-optimized FHT programs have been developed. The radix-2 and split-radix FHT programs were optimized to reduce the number of trigonometric coefficient calculations, the technique of computing the last few stages separately further improved the program execution speed. Moreover, a table-lookup version of both the radix-2 and split-radix FHT was implemented. The table-lookup version was proved to have even higher efficiency. These programs were compared with some complex F F T programs, which were implemented in an identical way. Tests of these programs indicated that the FHT is about twice as fast as the complex F F T assigned with redundant zero-valued imaginary part. However, when some special methods are used for the complex F F T to handle the real sequence, or when the real FFT algorithm is used, the FHT does not have the speedup effect. However, the FHT is more convenient in doing the inverse transform, because only one algorithm is required for both forward and inverse transforms. On the other hand, the real sequence on the complex F F T handled with the special methods and on the real FFT, both require a separate algorithm to convert the complex conjugate symmetric transform sequence back to a real 54 Chapter 5. Conclusion 55 sequence. The application of the FHT was shown with an image processing example, where the FHT was used in the convolution and computation of power spectrum. The example used a linear-motion blur operator and white noise to degrade a protrait image, and used a Wiener filter to restore the image. The expected result was obtained as if it was done with the F F T . A new bit-reversal algorithm was proposed based on the C-language bitwise operators. This new algorithm and several other bit-reversal algorithms were implemented and evalulated. The new algorithm is fast and simple in implementation. In the hardware implementation, several new hardware structures were constructed based on some advanced architectures. The hardware implementation is new to the DHT and the FHT, since, from the best of the author's knowledge, no other hardware implementation of the DHT or the FHT has previously been published. The DHT implementation requires a number of arithmetic devices proportional to N , but it has a simple control circuit, and a processing speed of 16.7 million samples 2 per second for a 60ns clock cycle, so that, the DHT hardware implementation is attractive for small TV. The single butterfly unit implementation is the simplest, requires the least hardware, and has a fairly high speed, namely, a data processing speed of 0.83 million samples per second for TV = 1024 and a 60ns buffer memory access time. The pipelining implementation is the most efficient one for either large or small TV, which has a very high processing speed, namely, 16.7 million samples per second for TV = 1024 and a 60ns buffer memory access time, and only requires (log TV — 1) butterfly 2 units. The superparallelism implementation is the fastest, theoretically can perform a transform in one memory cycle. It requires a hardware count of ^f(log TV — 1) butterfly units, 2 Chapter 5. Conclusion 56 which turns out to be a large number if N is large, but it is acceptable when N is small or when a really high processing speed is necessary. 5.2 Future work A regular geometry FHT algorithm is sought to ease the software and hardware implementations. To implement the FHT in software with a regular geometry algorithm, will reduce the coding effort and allow a simplier and more efficient program. To implement the FHT in hardware with a regular geometry, will greatly simplify the control circuit, and reduce the effort in designing the address generation circuit. The pipeline architecture can then be improved to accept two or even four input data points in each memory cycle, and hence increase the processing speed to two or four times. The superparallel architecture will also benefit since regular geometry allows constant geometry and repeatable interconnection design between stages. In the hardware implementation, as N increases, the hardware size will increase correspondingly, such that, for a large N, the chip count will be very high. It is more economical to have a custom-designed VLSI chip in case of mass production. In fact, a VLSI implementation project is currently being scheduled at UBC. In the near future, the FHT processor will very likely become a great competitor of the F F T processors. Two dimensional and multi-dimensional FHT's have not been implemented or evaluated in this thesis; further development in this area will surely benefit two-dimensional and multi-dimensional digital signal processing. References [1] G.D. Bergland, "A fast Fourier transform algorithm for real-valued series," Comm. ACM, vol. 11, no. 10, pp. 703-710, October 1968. [2] J.-B. Martens, "Discrete Fourier transform algorithms for real valued sequences," IEEE Tran. ASSP, vol. 32, no. 2, pp. 390-396, April 1984. [3] R. Kumaresan and P.K. Gupta, "A prime factor F F T algorithm with real valued arithmetic," Proc. IEEE, vol. 73, no. 7, pp. 1241-1243, July 1985. [4] P. Duhamel, "Implementation of 'split-radix' F F T algorithms for complex, real, and real-symmetric data," IEEE Trans. ASSP, vol. 34, no. 2, pp. 285-295, April 1986. [5] H.V. Sorensen, S.L. Jones, M.T. Heideman, and S. Burrus, "Real-valued fast Fourier transform algorithms," IEEE Trans. ASSP, vol. 35, no. 6, pp. 849-863, June 1987. [6] R.N. Bracewell, The Hartley Transform. NY: Oxford Universtiy Press, March 1986. [7] R.N. Bracewell, "The fast Hartley transform," Proc. IEEE, vol. 72, no. 8, pp. 10101018, August 1984. [8] 0. Buneman, "Conversion of FFT's to fast Hartley transform," SIAM J. Sci. Stat. Comput., vol. 7, no. 2, pp. 624-638, April 1986. [9] H.V. Sorensen, D.L. Jones, C.S. Burrus and M.T. Heideman, "On computing the discrete Hartley transform," IEEE Trans. ASSP, vol. 33, no. 4, pp. 1231-1238, October 1985. 57 References 58 [10] H.-J. Meckleburg and P.K. Gupta, "Fast Hartley transform algorithm," Electronics Letters, vol. 21, pp. 341-343, April 1985. [11] J. Prado, "Comments on 'The fast Hartley transform'," Proc. IEEE, vol. 73, no. 12, pp. 1862-1863, December 1985. [12] G.E.J. Bold, "A comparison of the time involved in computing fast Hartley and fast Fourier transforms," Proc. IEEE, vol. 73, no. 12, pp. 1863-1864, December 1985. [13] R.N. Bracewell, 0. Buneman, H. Hao, and J. Villasenor, "Fast two-dimensional Hartley transform," Proc. IEEE, vol. 74, no. 9, pp. 1282-1283, September 1986. [14] H. Hao and R.N. Bracewell, "A three-dimensional DFT algorithm using the fast Hartley transform," Proc. IEEE, vol. 75, no. 2, pp. 264-266, February 1987. [15] M.A. O'Neill, "Faster than fast Fourier," BYTE, pp. 293-300, April 1988. [16] R.N. Bracewell and J. Villasenor, "Hartley vs. Fourier," BYTE, vol. 13, no. 12, pp. 24-30, November 1988. [17] T. Le-Ngoc and M.T. Vo, "Implementation and performance of the fast Hartley transform," IEEE MICRO, pp. 20-27, October 1989. [18] R.V.L. Hartley, "A more symmetrical Fourier analysis applied to transmission problems," Proc. IRE, vol. 30, pp. 144-150, March 1942. [19] R.N. Bracewell, "The discrete Hartley transform," J. Opt. Soc. Am., vol. 73, no. 12, pp. 1832-1835, December 1983. [20] S.-C. Pei and J.-L. Wu, "Split-radix fast Hartley transform," Electronics Letters, vol. 22, no. 1, pp. 26-27, January 2, 1986. References 59 [21] R.N. Bracewell, "Alternative to split-radix Hartley transform," Electronics Letters, vol. 23, no. 21, pp. 1148-1149, October 1987. [22] H.S. Hou, "The fast Hartley transform algorithm," IEEE Trans. Computers, vol. 36, no. 2, pp. 147-156, February 1987. [23] C.-Y. Hsu and J.-L. Wu, "Fast computation of discrete Hartley transform via WalshHadamard transform," Electronic Letters, vol. 23, no. 9, pp. 466-468, Aprl 1987. [24] C.-Y. Hsu and J.-L. Wu, "The Walsh-Hadamard/discrete Hartley transform," Int. J. Electronics, vol. 62, no. 5, pp. 746-755, 1987. [25] C P . Kwong and K.P. Shiu, "Structured fast Hartley transform algorithms," IEEE Trans. ASSP, vol. 34, no. 4, pp. 1000-1002, August 1986. [26] H. Hao, "On fast Hartley transform algorithms," Proc. IEEE, vol. 75, no. 7, pp. 961-962, July 1987. [27] P. Duhamel and H. Hollmann, "Split radix F F T algorithm," Electronics Letters, vol. 20, no. 1, pp. 14-16, January 5, 1984. [28] C.S. Burrus and T.W. Parks. DFT/FFT and Convolution Algorithms. NY: Wiley, 1985. [29] H.V. Sorensen, M.T. Heideman and C.S. Burrus, "On computing the split-radix FFT," IEEE Trans. ASSP, vol. 34, no. 1, pp. 152-156, February 1986. [30] P. Duhamel and H. Hollmann, "Existence of a 2" F F T algorithm with a number of multiplications lower than 2 August 1984. n+1 ," Electronics Letteres, vol. 20, no. 17, pp. 690-692, References 60 [31] E.O. Brigham, The Fast Fourier Transform. Englewood Cliff, NJ: Prentice-Hall, 1974. [32] L.R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975. [33] R. Kumaresan and P.K. Gupta, "Vector-radix algorithm for a 2-D discrete Hartley transform," Proc. IEEE, vol. 74, no. 5, pp. 755-757, May 1986. [34] J.J. Rodriguez, "An improved F F T digit-reversal algorithm," IEEE Trans. ASSP, vol. 37, no. 8, pp. 1298-1300, August 1989. [35] D.M.W. Evans, "An improved digit-reversal permutation algorithm for the fast Fourier and Hartley transforms," IEEE Trans. ASSP, vol. 35, no. 8, pp. 1120-1125, August 1987. [36] D.M.W. Evans, "A second improved digit-reversal permutation algorithm for fast transforms," IEEE Trans. ASSP, vol. 37, no. 8, pp. 1288-1291, August 1989. [37] Texas Instruments, "New 'C30 F F T benchmark," Details on Signal Processing, Issue 20, pp. 5, fall 1989. [38] Z.M. Ali, "A high-speed F F T processor," IEEE Trans. Communications, vol. 26, no. 5, pp. 690-696, May 1978. [39] E.E. Swartzlander, Jr. and K.W. Young, "A radix-4 delay communtator for fast Fourier transform processor," IEEE J. Solid-State Circuits, vol. 19, no. 5, pp. 702709, October 1984. [40] K. Yamashita, et al, "A wafer-scale 170000-gate F F T processor with built-in test circuits," IEEE J. Solid-State Circuits, vol. 23, no. 2, pp. 336-342, April 1988. Appendix A T h e radix-2 decimation-in-frequency F H T 61 programs l i s t i n g Appendix A. The radix-2 decimation-in-frequency FHT programs listing A.l Radix-2 F H T implemented w i t h the "three-loop m e t h o d " #include <math.h> #define BUTTERFLY(a, b) {double temp = a; a = temp + b; b = temp - b;> /****************************************************** * Radix-2 decimation-in-frequency FHT * * (three-loop method) * * filename: r2.1.c * * * * Written by Y.K. Fu, Dec. 1989 * * * * Input: * * x -- data a r r r a y with s i z e equals power of two. * * power -- power of two. * * s i z e -- number of input p o i n t s . * * Output: * * scrambled d i s c r e t e Hartley transform stored i n x. * ******************************************************/ void FHT(double x [ ] , unsigned power, unsigned s i z e ) i unsigned n, block, stage, t l , t 2 , t 3 , t 4 , h a l f , q r t r , step; double Red, c, s, A, x1, x2, x3, x4; s t a t i c double TuoPi=6.28318530717958647692; /* check data s i z e < 4 */ i f (power < 2) { i f (power == 1) { BUTTERFLY(x[0], x [ 1 ] ) ; return; > else return; > /* f i r s t (power - 2) stages */ half = s i z e ; for (stage = 2; stage < power; stage++) { step = h a l f ; h a l f » = 1; q r t r = h a l f » 1; Rad = TwoPi / step; f o r (n A = c = s = = 1; n < q r t r ; n++) < Rad * n; cos(A); sin(A); f o r (block = n; block < s i z e ; block += step) { i f (n == 1) < t l = block - 1; t2 = t1 + q r t r ; t3 = t2 + q r t r ; t4 = t 3 + q r t r ; BUTTERFLY(x[t1], x [ t 3 ] ) ; BUTTERFLY(x[t2], x [ t 4 ] ) ; > t3 = block + h a l f ; t2 = t 3 - n - n; t4 = t2 • h a l f ; x1 = x[block] - x [ t 3 ] ; x2 = x[t2] - x [ t 4 ) ; x [block] += x [ t 3 ] ; x[t2) += x [ t 4 ] ; x[t3] = x1 * c + x2 * s; x[t4] = x1 * s - x2 * c; > /* jump by stage /* jump by point /* jump by block Appendix A. The radix-2 decimation-in-frequency FHT programs listing / * Last two stages * / for (block a 0; block < size; block += 4) { t1 = block • 1; t2 = block + 2; t3 = block + 3; x1 = x [block] • xCt2]; x2 = x[t1] + xlt3]; x3 = x[block] - x[t2]; x4 = x[t1] - x[t3); x[block] = x1 + x2; x[t1] = x1 - x2; x[t2] = x3 • x4; x[t3] = x3 - x4; > > 63 Appendix A. A.2 The radix-2 decimation-in-frequency FHT programs listing Radix-2 F H T implemented with the "table-lookup method #include <math.h> #include <stdio.h> #define BUTTERFLY(a, b) {double temp = a; a = temp + b; b = temp - b;> /****************************************************** * Radix-2 decimation-in-frequency FHT * * (table-lookup method) * * filename: r2.2.c * • * * Written by Y.K. Fu, Dec. 1989 * * * * Input: * * x data a r r r a y with s i z e equals power of two. * * power -- power of two. * * s i z e -- number of input p o i n t s . * * Output: * * scrambled d i s c r e t e Hartley transform stored i n x. * ************************************************** void FHT(double x [ ] , unsigned power, unsigned s i z e ) < unsigned n, block, stage, m, t , t 1 , t 2 , t 3 , t 4 , h a l f , q r t r , step; double Rad, * c , * s , A, x l , x2, x3, x4; s t a t i c double TwoPi=6.28318530717958647692; /* check data s i z e < 4 */ i f (power < 2) { i f (power == 1) { BUTTERFLY(x[0], x [ 1 ] ) ; return; > e l s e return; /* Create lookup tables */ i f ( s i z e > 4) { Rad = TwoPi / s i z e ; q r t r = s i z e » 2; n = sizeof(double); c = (double * ) c a l l o c ( q r t r , n ) ; s = (double * ) c a l l o c ( q r t r , n ) ; i f (s == NULL) { putsC'** run out of memory i n c r e a t i n g lookup tables * * " ) ; exit(1); > f o r (m = 1; m < q r t r ; m++) { A = Rad * m; c[m] = cos(A); s[m] = s i n ( A ) ; > /* f i r s t (power - 2) stages */ half = size; f o r (stage = 2; stage < power; stage++) < step = h a l f ; h a l f » = 1; q r t r « h a l f » 1; t = stage - 2; /* jump by stage Appendix A. The radix-2 decimation-in-frequency FHT programs listing f o r (n = 1; n < q r t r ; n++) { m = n « t; f o r (block = n; block < s i z e ; block += step) { i f (n == 1) < t1 = block - 1; t2 = t1 + q r t r ; t3 = t 2 + q r t r ; t4 = t 3 + q r t r ; BUTTERFLY(x[t1], x [ t 3 ] ) ; BUTTERFLY(x[t2], x [ t 4 ] ) ; > t3 = block + h a l f ; t2 = t 3 - n - n; t4 = t2 + h a l f ; x1 = x[block] - x [ t 3 ] ; x2 = x [ t 2 ] - x t t 4 ] ; x [block] += x [ t 3 ] ; x[t2] +="x[t4];" x[t3] = x1 * c[m] + x2 * s[m]; x[t4] = x1 * s[m] - x2 * c[m]; > > > /* Last two stages */ for (block = 0; block < s i z e ; block •= 4) { t1 = block • 1; t2 = block + 2; t3 = block + 3; x1 = x [block] + x [ t 2 ] ; x2 = x t t l ] + x [ t 3 ] ; x3 = x[block] - x [ t 2 ] ; x4 = x [ t 1 ] - x [ t 3 ] ; x[block] = x1 + x2; x[t1] = x l - x2; x[t23 = x3 + x4; x[t3] = x3 - x4; > > /* jump by point */ /* jump by block */ 65 Appendix B T h e split-radix decimation-in-frequency F H T programs listing 66 Appendix B. The split-radix decimation-in-frequency FHT programs listing B.l Split-radix F H T implemented with the "three-loop method" #include <math.h> #define BUTTERFLYCa, b) {double temp = a; a = temp + b; b = temp - b;> ^****************************************************** * S p l i t - r a d i x decimation-in-frequency FHT * * (three-loop method) * * filename: s r . l . c * * * * Written by Y.K. Fu, Dec. 1989 * * . * * The looping technique i s adopted from the FORTRAN * * subroutine FFT w r i t t e n by C.S. Burrus, D^ec. 1984 i n * * "DFT/FFT and Convolution Algorithms", NY: Wiley, * * pp.141-142, 1985 * * * * Input: * * x -- data a r r r a y with s i z e equals power of two. * * power -- power of two. * * s i z e -- number of input p o i n t s . * * Output: * * scrambled d i s c r e t e Hartley transform stored i n x. * ******************************************************^ void FHT(double x [ ] , unsigned power, unsigned s i z e ) { unsigned q r t r , o c t a , stage, n, block, b l o c k s i z e , s t a r t , step, s t e p s i z e , t 1 1 , t12, t13, t14, t 2 1 , t22, t23, redouble Rad, c, s, c 3 , s 3 . A, A3, xO, x l , x2, x3, x4; s t a t i c double TwoPi=6.28318530717958647692, Root2=1.41421356237309514547; /* Check data s i z e < 8 V i f (power < 3) { i f (power == 2) € x1 = x[0J + x [ 2 ] ; x2 = x t l ) • x [ 3 ] ; x3 = x[0] - x [ 2 ] ; x4 = x t l l - x [ 3 ] ; x[0] = x1 + x2; xt1] = x l - x2; x[2] = x3 + x4; x[3] = x3 - x4; return; > i f (power == 1) { BuTTERFLY(x[0], x t l ] ) ; return; > else return; /* f i r s t (power - 3) stages */ b l o c k s i z e = s i z e « 1; for (stage = 3; stage < power; stage**) { stepsize = blocksize; b l o c k s i z e » = 1; q r t r - b l o c k s i z e » 2; octa = q r t r » 1; Rad = TwoPi / b l o c k s i z e ; /* jump by stage */ Appendix B. The split-radix decimation-in-frequency FHT programs listing for (n = 1; n < o c t a ; n++) ( A = Rad * n; A3 = A * 3; c = cos(A); s = sin(A); c3 = cos(A3); s3 = s i n ( A 3 ) ; s t a r t = n; step = s t e p s i z e ; /* jump by point */ do { /* jump by block */ for (block = s t a r t ; block < s i z e ; block •= step) { i f (n == 1) i t l 1 = block - 1; t12 » t11 + q r t r ; t13 = t12 + q r t r ; t14 = t13 + q r t r ; x3 = x[t11] - x [ t 1 3 ) ; x4 = x[t!2] - x [ t 1 4 ] ; x[t11] += x [ t 1 3 ] ; x[t12] += x [ t 1 4 ] ; x[t13] ~ x3 + x4; xtt14] = x3 - x4; t21 = t11 + octa; t22 = t21 + q r t r ; t23 = t22 + q r t r ; t24 = t23 + q r t r ; x1 = x [ t 2 1 J ; x2 = x [ t 2 2 ] ; x[t21] = x1 • x [ t 2 3 ] ; x[t22] = x2 • x[t241; x[t23) = (x1 - x t t 2 3 J ) * Root2; x l t 2 4 ] = (x2 - x [ t 2 4 ) ) * Root2; t12 = block + q r t r ; t13 = t12 + q r t r ; t14 = t13 + q r t r ; t21 = t12 - n - n; t22 - t21 + q r t r ; t23 = t22 • q r t r ; t24 = t23 + q r t r ; x2 = x[block] - x [ t 1 3 ] ; xO = x[t21] - x [ t 2 3 ] ; x1 = x2 + xO; x2 -- xO; xO = x[t12] - x [ t 1 4 ) ; x4 = x[t22] - x [ t 2 4 ] ; x3 = x4 - xO; x4 += xO; x [block] += x t t 1 3 J ; x[t21] += x [ t 2 3 ] ; xtt12] •= x [ t 1 4 ] ; x[t22] += x [ t 2 4 ] ; x[t13] = x1 * c + x3 * s; x[t23] = x1 * s - x3 * c; x[t14] = x2 * c3 + x4 * s3; x[t24] = x2 * s3 - x4 * c3; > s t a r t = (step « 1) - b l o c k s i z e + n; step « = 2; > while ( s t a r t < s i z e ) ; } > 68 Appendix B. The split-radix decimation-in-frequency FHT /* Last 3 stages */ /* 8-point FHT blocks */ s t a r t = 0; step = 16; do { f o r (block = s t a r t ; block < s i z e ; block += step) { t12 = block + 2; t13 = t12 + 2; t14 = t13 + 2; x3 = x[block] - x [ t 1 3 ] ; x4 = x[t12] - x [ t 1 4 ] ; x [block] = x [block] + x [ t 1 3 ] ; x[t12] = x[t12] • x [ t 1 4 ] ; x[t13] = x3 + x4; x[t14] = x3 - x4; t21 = block + 1; t22 = t21 + 2; t23 = t22 + 2; t24 = t23 + 2; x1 = x [ t 2 1 ] ; x2 = x [ t 2 2 ] ; x[t21] = x1 + x [ t 2 3 ] ; x[t22] = x2 + x [ t 2 4 ] ; x[t23] = (x1 - x [ t 2 3 ] ) * Root2; x[t24] = (x2 - x [ t 2 4 ) ) * Root2; BUTTERFLY(x[t13], x [ t 2 3 ] ) ; BUTTERFLY(x[t14], x [ t 2 4 ] ) ; > s t a r t = (step « 1) - 8; step « = 2; > while ( s t a r t < s i z e ) ; /* 4-point FHT blocks */ s t a r t = 0; step = 8; do < for (block = s t a r t ; block < s i z e ; block += step) t t12 = block + 1; t13 = t12 + 1; t14 = t13 + 1; x1 = x [block] + x [ t 1 3 ) ; x2 = x[t12] + x [ t 1 4 ] ; x3 - x[block] - x [ t 1 3 ] ; x4 = x[t12] - x [ t 1 4 ] ; x[block] = x l + x2; x[t12] = x1 - x2; x[t13] = x3 + x4; x[t14] = x3 - x4; > s t a r t = (step « 1) - 4; step « = 2; > while ( s t a r t < s i z e ) ; programs listing Appendix B. B.2 The split-radix decimation-in-frequency FHT programs listing Split-radix F H T implemented with the "table-lookup method' #include <math.h> #include <stdio.h> #define BUTTERFLY(a, b) {double temp = a; a = temp + b; b = temp - b;> /*************************»**************************** * S p l i t - r a d i x decimation-in-frequency FHT * * (table-lookup method) * * filename: sr.2.c * * * * Written by Y.K. Fu, Dec. 1989 * * • * The looping technique i s adopted from the FORTRAN * * subroutine FFT w r i t t e n by C.S. Burrus, Dec. 1984 i n * * "DFT/FFT and Convolution Algorithms", NY: Wiley * * pp.141-142, 1985 * * * * Input: * * x -- data a r r r a y with s i z e equals power of two. * * power -- power of two. * * s i z e -- number of input p o i n t s . * * Output: * * scrambled d i s c r e t e Hartley transform stored i n x. * ******************************************************/ void FHT(double x [ ] , unsigned power, unsigned s i z e ) { unsigned q r t r , octa, stage, n, m, t , block, b l o c k s i z e , s t a r t , step, s t e p s i z e , t i l , t12, t13, t14, t 2 1 , t22, t23, t24; double Rad, * c , * s , *c3, *s3. A, A3, xO, x1, x2, x3, x4; s t a t i c double TwoPi=6.28318530717958647692, Root2=1.41421356237309514547; /* Check data s i z e < 8 */ i f (power < 3) { i f (power == 2) C x1 = x[0] + x [ 2 ] ; x2 = x[1] + x [ 3 ] ; x3 = x[0] - x [ 2 ) ; x4 = x t D - x [ 3 ] ; xtO] a x1 + x2; x[1] = x1 - x2; x[2] = x3 + x4; x[3] = x3 - x4; return; > i f (power == 1) ( BUTTERFLY(x[0], x[1]>; return; > e l s e return; /* Create lookup t a b l e s */ i f ( s i z e > 8) { Rad = TwoPi / s i z e ; octa = s i z e » 3; n = sizeof(double); c a (double * ) c a l l o c ( o c t a , n ) ; s a (double * ) c a l l o c ( o c t a , n ) ; c3 a (double * ) c a l l o c ( o c t a , n ) ; s3 = (double * ) c a l l o c ( o c t a , n ) ; i f (s3 == NULL) < putsC'** run out of memory i n c r e a t i n g lookup tables * * " ) ; exitd); > Appendix B. The split-radix decimation-in-frequency FHT programs listing for (m = 1; m < octa; m++) { A = Rad * m; A3 = A * 3; cCm] = cos(A); s[m] = s i n ( A ) ; c3[m] = cos(A3); s3[m] = s i n ( A 3 ) ; > > /* f i r s t (power - 3) stages */ b l o c k s i z e = s i z e « 1; for (stage = 3; stage < power; stage++) { stepsize = blocksize; b l o c k s i z e » = 1; q r t r = b l o c k s i z e » 2; octa = q r t r » 1; t = stage - 3; for (n = 1; n < octa; n++) { s t a r t = n; step = s t e p s i z e ; m = n « t; /* jump by stage */ /* jump by point */ do { /* jump by block */ f o r (block * s t B r t ; block < s i z e ; block += step) { i f (n==1) I t11 s block - 1; t12 » t11 + q r t r ; t13 = t12 + q r t r ; t H = t13 + q r t r ; x3 = x[t11] - x [ t 1 3 ] ; x4 = x[t12] - x [ t H ] ; x[t11) •= x [ t 1 3 ) ; xtt12] += x t t H ] ; xtt13] = x3 + x4; x [ t H ] = x3 - x4; > t21 = t11 + octa; t22 e t21 + q r t r ; t23 = t22 + q r t r ; t24 s t23 + q r t r ; x1 = x [ t 2 1 ] ; x2 = x[t223; xtt21] = x l + x [ t 2 3 ] ; x(t22] = x2 + xtt241; x[t23] = (x1 - x t t 2 3 ] ) * Root2; x[t24] = (x2 - x[t24]) * Root2; t12 = block + q r t r ; t13 = t12 + q r t r ; t H = t13 • q r t r ; t21 = t12 - n - n; t22 = t21 + q r t r ; t23 = t22 + q r t r ; t24 = t23 + q r t r ; x2 = x[block] - x [ t 1 3 ) ; xO = x[t21] - x [ t 2 3 ) ; x l = x2 + xO; x2 -= xO; xO = x[t12] - x [ t H ) ; x4 = xtt22] - x [ t 2 4 ] ; x3 = x4 - xO; x4 •= xO; x [block] += x [ t 1 3 ) ; x[t21] •= x [ t 2 3 ] ; x[t12] += x [ t H ] ; x[t22] += x [ t 2 4 ] ; x[t13] = x1 * c[m] + x3 * s[m]; x[t23] = x1 * s[m] - x3 * c[m]; Appendix B. The split-radix decimation-in-frequency FHT programs listing x[t14] = x2 * c3[m] + x4 * s3[m]; x[t24] = x2 * s3tm] - x4 * c3[m]; > > s t a r t - (step « 1) - b t o c k s i z e + n; step « = 2; > while ( s t a r t < s i z e ) ; > /* Last 3 stages */ /* 8-point FHT blocks */ s t a r t = 0; step = 16; do { f o r (block « s t a r t ; block < s i z e ; block += step) { t12 = block + 2; t13 = t12 + 2; t14 = t13 + 2; x3 = x[block] - x [ t 1 3 ] ; x4 = x[t12] - x [ t 1 4 ] ; x [block] = x [block] + x [ t 1 3 ] ; x[t12] = x[t12] + x [ t 1 4 ] ; x[t13] = x3 + x4; x[t14] = x3 - x4; t21 = block • 1; t22 = t21 + 2; t23 = t22 + 2; t24 = t23 + 2; x1 = x [ t 2 1 ] ; x2 = x [ t 2 2 J ; x[t21] = x1 + x [ t 2 3 ] ; x[t22] = x2 + x [ t 2 4 ] ; x[t23] = ( x l - x [ t 2 3 ] ) * Root2; x[t24] = (x2 - x [ t 2 4 ] ) * Root2; BUTTERFLY(x[t13], x [ t 2 3 ] ) ; BUTTERFLY(xtt14], x [ t 2 4 ] ) ; > s t a r t = (step « 1) - 8; step « = 2; > while ( s t a r t < s i z e ) ; /* 4-point FHT blocks */ s t a r t = 0; step = 8; do < for (block = s t a r t ; block < s i z e ; block += step) { t12 = block + 1; t13 = t12 + 1; t14 = t13 + 1; x1 = x[block] • x [ t 1 3 ] ; x2 = x[t12] + x [ t 1 4 ] ; x3 = x [block] - x [ t 1 3 ] ; x4 = x[t12] - x [ t 1 4 ] ; x[block] = x l + x2; x[t12] = x1 - x2; x[t13] = x3 + x4; x[t14] = x3 - x4; > s t a r t = (step « 1) - 4; step « = 2; > while ( s t a r t < s i z e ) ; Appendix C The radix-2 decimation-in-frequency F F T programs listing 73 Appendix C. The radix-2 decimation-in-frequency FFT programs Usting Cl Radix-2 forward F F T implemented with the "three-loop method #include <math.h> #define BUTTERFLY(a, b) {double temp = a; a = temp + b; b = temp • b;> /****************************************************** * Radix-2 decimation-in-frequency FFT * * (three-loop method) * * filename: r 2 f f t . c * * . ..... * * Written by Y.K. Fu, Dec. 1989 * * Input: * x, real part and y, imaginary part of a data * sequence with s i z e equals power of two. * power -- power of two. * s i z e -- number of input p o i n t s . * * * * * * Output: * * x, real part and y, imaginary part of the * * scrambled d i s c r e t e Hartley transform. * ************************************************* void FFT(double x [ ] , double y [ ] , unsigned power, unsigned s i z e ) { unsigned stage, block, n, h a l f , step, t 1 , t 2 ; double A, Rad, c, s, x_tmp, y_tmp; s t a t i c double TwoPi = 6.28318530717958647692; half = s i z e ; for (stage = 1; stage < power; stage++) { step = h a l f ; half » = 1; Rad = TwoPi / step; f o r (n A = c = s = = 1; n < h a l f ; n++) { Rad * n; cos(A); sin(A); f o r (block = n; block < s i z e ; block += step) { i f (n == 1) { t1 = block - 1; t2 = t1 + h a l f ; BUTTERFLY(x[t1], x [ t 2 ] ) ; BUTTERFLY(y[t1], y [ t 2 ] ) ; > t2 = block + h a l f ; /* jump by stage */ /* jump by point */ /* jump by block */ /* (x1 + j y l ) - (x2 + jy2) = (x1 - x2) + j(y1 - y2) */ x_tmp = x[block] - x [ t 2 ] ; y~tmp = y[block] - y [ t 2 ] ; /* (x1 + jy1) + (x2 + jy2) = (x1 + x2) • j(y1 + y2) */ x[block) += x [ t 2 ] ; y [block] += y [ t 2 ] ; > /* (x_tmp + jy_tmp) * (cosA - j s i n A ) */ /* = (x_tmp * cosA • y_tmp * sinA) + j(y_tmp * cosA - x_tmp * sinA) */ x[t2] = x_tmp * c + y_tmp * s; y[t2] = y~tmp * c - x tmp * s; Appendix C. The radix-2 decimation-in-frequency FFT programs listing I* l a s t stage V for (block = 0; block < s i z e ; block += 2) t t1 = block + 1; BUTTERFLY(x[block], x[t1]); BUTTERFLY(y[block], y[t1]); Appendix C. The radix-2 decimation-in-frequency FFT programs listing C.2 Radix-2 forward F F T implemented with the "table-lookup #include <stdio.h> #include <math.h> #define BUTTERFLY(a, b> {double temp - a; a = temp + b; b = temp - b;> /****************************************************** * Radix-2 decimation-in-frequency FFT * * (table-lookup method) * * filename: r 2 f f t . 2 . c * * * * Written by Y.K. Fu, Dec. 1989 * * * * Input: * * x, real part and y, imaginary part of a data * * sequence with s i z e equals power of two. * * power -- power of two. * * s i z e -- number of input p o i n t s . * * Output: * * x, r e a l part and y, imaginary part of the * * scrambled d i s c r e t e Hartley transform. * ******************************************************/ void FFT(double x [ ] , double y [ ] , unsigned power, unsigned s i z e ) { unsigned stage, block, n, m, h a l f , step, t , t l , t 2 ; double A, Rad, * c , * s , x_tmp, y tmp; s t a t i c double TwoPi = 6.28318530717958647692; /* Create lookup t a b l e s */ Rad = TwoPi / s i z e ; half = s i z e » 1; n = sizeof(double); c = (double * ) c a l l o c ( h a l f , n ) ; s = (double * ) c a l l o c ( h a l f , n ) ; i f (s == NULL) { puts("** run out of memory i n c r e a t i n g lookup tables * * " ) ; exit(1); > f o r (m = 1; m < h a l f ; m*+) { A = Rad * m; c[m] = cos(A); slm] = s i n ( A ) ; > half = s i z e ; f o r (stage = 1 ; stage < power; stage++) { step = h a l f ; half » = 1; t = stage - 1; f o r (n = 1; n < h a l f ; n++) { m = n « t; f o r (block = n; block < s i z e ; block *- step) { i f (n == 1) { t1 = block - 1; t2 = t1 • h a l f ; BUTTERFLY(x[t1], x [ t 2 ] ) ; BUTTERFLY(y[t1], y [ t 2 ] ) ; > t2 = block + h a l f ; /* jump by stage */ /* jump by point */ /* jump by block */ method" 76 Appendix C. The radix-2 decimation-in-frequency FFT programs listing /* <x1 + jy1) - (x2 + jy2) = <x1 - x2) + j(y1 - y2) */ x_tmp = x[block] - x [ t 2 ] ; y~tmp = y[block] - y [ t 2 ] ; /* (x1 + jy1) + <x2 + jy2) = (x1 + x2) + j(y1 + y2) */ x [block] •= x [ t 2 ] ; y [block] += y [ t 2 ] ; > /* (x_tmp + jy_tmp) * (cosA - j s i n A ) */ /* = (x_tmp * cosA + y_tmp * sinA) + j(y_tmp * cosA - x_tmp * sinA) */ x[t2] = x_tmp * c[m] + y_tmp * s[m]; y[t2] = y_tmp * c[m] - x tmp * s[m]; > /* l a s t stage */ for (block »167^b^^Fizey"&l^k += 2'>"l: t1 = block + 1; BUTTERFLY(X[block] , X [ t 1 ] ) ; BUTTERFLY(y[block), y [ t 1 ] ) ; > - > _ 77 Appendix D The split-radix decimation-in-frequency F F T programs listing 78 Appendix D. The split-radix decimation-in-frequency FFT programs listing D.l Split-radix forward F F T implemented with the "three-loop method #include <math.h> #define BUTTERFLY(a, b) {double temp = a; a = temp + b; b = temp - b;> /****************************************************** * S p l i t - r a d i x decimation-in-frequency FFT * * (three-loop method) * * filename: s r f f t . c * * . . ._* * This program i s modified from the FORTRAN * * subroutine FFT w r i t t e n by C.S. Burrus, Dec. 1984 i n * * "DFT/FFT and Convolution Algorithms", NY: Wiley, * * pp.141-142, 1985 * * * * Rewritten i n C with extensive m o d i f i c a t i o n * * by Y.K. Fu, Dec. 1989 * * * * Input: * * x, real part and y, imaginary part of a data * * sequence with s i z e equals power of two. * * power -- power of two. * * s i z e -- number of input p o i n t s . * * Output: * * x, real part and y, imaginary part of the * * scrambled d i s c r e t e Hartley transform. * *************************************************** / void FFT(double x [ ] , double y [ ] , unsigned power, unsigned s i z e ) { unsigned q r t r , stage, n, s t a r t , block, b l o c k s i z e , step, s t e p s i z e , tO, t 1 , t 2 , t 3 ; double c, c3, s, s3, A, A3, x1, x2, y1, y2, Re1, Re2, Im1, Im2, Rad; s t a t i c double TwoPi=6.28318530717958647692; /* check data s i z e < 4 */ i f (power < 2) { i f (power == 1) { BUTTERFLY(xtO], x [ 1 ] ) ; BUTTERFlY(y[0], y [ 1 ] ) ; return; > else return; > /* f i r s t (power - 2) stages */ b l o c k s i z e - s i z e « 1; f o r (stage = 2; stage < power; stage++) { stepsize = blocksize; b l o c k s i z e » = 1; q r t r = b l o c k s i z e » 2; Rad = TwoPi / b l o c k s i z e ; f o r (n = 1; n < q r t r ; n++) { A = n * Rad; A3 = A * 3; c = cos(A); s = sin(A); c3 = cos(A3); s3 = s i n ( A 3 ) ; s t a r t = n; step = s t e p s i z e ; /*jump by stage */ /* jump by point */ Appendix D. The split-radix decimation-in-frequency FFT programs listing do { f o r (block = s t a r t ; block < s i z e ; block += step) { i f (n == 1) < tO « block - 1; t l = tO + q r t r ; t2 = t l + q r t r ; t3 = t2 • q r t r ; /* jump by block */ x1 n [ t 0 ] - x [ t 2 ] ; x2 « x t t l ) - x [ t 3 ] ; x[t0] += x [ t 2 ] ; x t t l ] •= x [ t 3 ] ; X y l = yttO] - y l t 2 ] ; y2 = y[t1] - y [ t 3 ] ; y[t0] += y [ t 2 ] ; y[t1] += y [ t 3 ] ; x[t2] = x1 + y2; y[t2] = y1 - x2; x[t3] = x1 - y2; y[t3] = y1 + x2; > t1 = block • q r t r ; t2 = t l + q r t r ; t3 = t2 • q r t r ; x1 = x [block] - x [ t 2 ] ; x [block] += x [ t 2 ] ; x2 = x[t1] - x [ t 3 ] ; x[t1] •= x [ t 3 ] ; y1 = y[block] - y [ t 2 ] ; y [block] += y t t 2 ) ; y2 = y[t1) - y [ t 3 ] ; y[t1] •= y [ t 3 ) ; /* ( x l + j y l ) - j ( x 2 + jy2) = (x1 + y2) + j(y1 - x2) Re1 = x1 + y2; Im1 = y1 - x2; V /* (x1 + j y D + j ( x 2 + jy2) = (x1 - y2) + j(y1 + x2) */ Re2 = x l - y2; Im2 = y1 • x2; /* (Re1 + j l m D * (cosA - j s i n A ) */ /* • (Re1 * cosA + Im1 * sinA) + j(Im1 * cosA - Re1 * sinA) */ x[t2] = Rel * c + Im1 * s; y[t2] - Im1 * c - Re1 * s; /* (Re2 • jlm2) * (cos3A - jsin3A) */ /* = (Re2 * cos3A + Im2 * sin3A) + j(Im2 * cos3A - Re2 * sin3A) */ x[t3] = Re2 * c3 + Im2 * s3; y[t3] = Im2 * c3 - Re2 * s3; > > > /* new block s t a r t address */ s t a r t = (step « 1) - b l o c k s i z e + n; step « = 2; > while ( s t a r t < s i z e ) ; 80 Appendix D. The split-radix decimation-in-frequency FFT programs listing /* l a s t 2 stages */ /* 4-point FFT blocks */ s t a r t = 0; step = 8; do i f o r (block = s t a r t ; block < s i z e ; block += step) t t1 = block + 1; t2 = t l + 1; t3 = t2 + 1; x1 = xCblock] - x [ t 2 1 ; x2 = x[t1] - x [ t 3 ] ; x[block] += xEtZJ; x[t1] += x [ t 3 ] ; y1 = y[block) - y [ t 2 ] ; y2 = y[t1] - y [ t 3 ] ; y [block] += y [ t 2 ] ; y[t1] += y [ t 3 ] ; x[t2] y[t2] x[t3] y[t3] = = = = x1 y1 x1 yi + • y2; x2; y2; x2; > s t a r t = (step « 1) - 4; step « = 2; > while ( s t a r t < s i z e ) ; /* 2-point FFT blocks */ s t a r t = 0; step = 4; do { ' ' ~ f o r (block = s t a r t ; block < s i z e ; block •= step) t t1 = block + 1; BUTTERFLY(x[block], x [ t 1 ] ) ; BUTTERFLY(y[block], y [ t 1 ] ) ; > s t a r t = (step « 1) - 2; step « = 2; > while ( s t a r t < s i z e ) ; 81 Appendix D. The split-radix decimation-in-frequency FFT programs listing D.2 Split-radix forward F F T implemented with the "table-lookup method" #include <stdio.h> #include <math.h> #define BUTTERFLY(a, b) {double temp = a; a = temp + b; b = temp • b;> ^****************************************************** * S p l i t - r a d i x decimation-in-frequency FFT * * (table-lookup method) * * filename: s r f f t . 2 . c * * * * This program i s modified from the FORTRAN * * subroutine FFT w r i t t e n by C.S. Burrus, Dec. 1984 i n * * "DFT/FFT and Convolution Algorithms", NY: Wiley, * * pp.141-142, 1985 * • . • * Rewritten i n C with extensive m o d i f i c a t i o n * * by Y.K. Fu, Dec. 1989 * * . * * Input: * * x, r e a l part and y, imaginary part of a data * * sequence with s i z e equals power of two. * * power -- power of two. * * s i z e -- number of input p o i n t s . * * Output: * * x, real part and y, imaginary part of the * * scrambled d i s c r e t e Hartley transform. * ******************************************************/ void FFT(double x [ J , double y [ ] , unsigned power, unsigned s i z e ) { unsigned q r t r , stage, n, m, s t a r t , block, b l o c k s i z e , step, s t e p s i z e , t , tO, t 1 , t 2 , t 3 ; double * c , *c3, * s , *s3. A, A3, x1, x2, y1, y2, Re1, Re2, Im1, Im2, Rad; s t a t i c double TwoPi=6.28318530717958647692; /* check data s i z e < 4 */ i f (power < 2) { i f (power == 1) { BUTTERFLY(xCO), x [ 1 ] ) ; BUTTERFLY(y[0j, y [ 1 ] ) ; return; > else return; /* Create lookup tables V Rad = TwoPi / s i z e ; q r t r = s i z e » 2; n = sizeof(double); c = (double * ) c a l l o c ( q r t r , n ) ; s = (double * ) c a l l o c ( q r t r , n ) ; c3 = (double * ) c a l l o c ( q r t r , n ) ; s3 = (double * ) c a l l o c ( q r t r , n ) ; i f (s3 == NULL) { putsC'** run out of memory i n c r e a t i n g lookup tables * * " ) ; exit(1); > f o r (m = 1; m < q r t r ; m++) { A = Rad * m; A3 = A * 3; c[m] = cos(A); s[m] = s i n ( A ) ; c3[m] = cos(A3); s3[m] = s i n ( A 3 ) ; > 82 Appendix D. The split-radix decimation-in-frequency FFT programs listing /* F i r s t (power - 2) stages */ b l o c k s i z e = s i z e « 1; for (stage = 2; stage < power; stage++) ( stepsize = blocksize; b l o c k s i z e » = 1; q r t r = b l o c k s i z e » 2; t = stage - 2; f o r (n = 1; n < q r t r ; n++) < s t a r t = n; step = s t e p s i z e ; m = n « t; do { f o r (block = s t a r t ; block < s i z e ; block += step) < i f (n == 1) { to = block - 1; t1 = tO + q r t r ; t2 = t1 + q r t r ; t3 = t2 + q r t r ; /•jump by stage */ /* jump by point */ /* jump by block */ x1 = x[tO) - x ( t 2 ) ; x2 = x t t l ] - x [ t 3 ] ; xttO] += x [ t 2 ] ; x t t l ] += x [ t 3 ] ; y1 = y[tO] - y [ t 2 ] ; y2 = y(t1] - y£t3]; yltO] += y [ t 2 ] ; y[t1] += y [ t 3 ] ; x[t2) y[t2] x[t3] y[t3] = = = = xl y1 x1 y1 + + y2; x2; y2; x2; > t1 = block + q r t r ; t2 = t1 + q r t r ; t3 = t2 + q r t r ; x1 = x[block] - x [ t 2 ] ; x [block] += x [ t 2 ] ; x2 • x[t1] - x [ t 3 ] ; x[t1] += x [ t 3 ] ; y1 = y[block] - y [ t 2 ] ; y l b l o c k ] += y [ t 2 ] ; y2 = y[t1] - y t t 3 ) ; y[t1] += y [ t 3 ] ; /* (x1 + j y D - j ( x 2 + jy2) = (x1 + y2) + j(y1 - x2) */ Rel = x1 + y2; Im1 = y1 - x2; /* (x1 + jyD + j ( x 2 * jy2) = (x1 - y2) + j(y1 + x2) */ Re2 = x1 - y2; Im2 = y1 + x2; /* (Re1 + jlmD * (cosA - jsinA) V /* = (Re1 * cosA + Im1 * sinA) + j(Im1 * cosA - Re1 * sinA) */ x[t2] = Re1 * c[m] • Im1 * s[m]; y[t2] = Im1 * c[m] - Re1 * s[m]; 83 Appendix D. The split-radix decimation-in-frequency FFT programs listing I* (Re2 + jlm2) * (cos3A - jsin3A) */ /* = (Re2 * cos3A «• Im2 * sin3A) + j(Im2 * cos3A - Re2 * sin3A) */ x[t33 = Re2 * c3tm] + Im2 * s3[m]; y[t3] = Im2 * c3[m] - Re2 * s3[m]; > > /* new block s t a r t address */ s t a r t = (step « 1) - b l o c k s i z e + n; step « = 2; > while ( s t a r t < s i z e ) ; > /* l a s t 2 stages */ /* 4-point FFT blocks */ s t a r t = 0; step = 8; do { f o r (block = s t a r t ; block < s i z e ; block += step) C t1 = block + 1; t2 = t1 + 1; t3 = t2 + 1; x1 = x[block] - x [ t 2 ] ; x2 = x[t1] - x [ t 3 ] ; x[block] •= x [ t 2 ] ; x[t1] += x [ t 3 ] ; y1 = y[block] - y [ t 2 ] ; y2 = y[t1] - y [ t 3 ) ; y[block] •= y [ t 2 ] ; ytt1] •= y [ t 3 ) ; x[t2] y[t2] x[t3] y[t3] = = = = x1 yi x1 y1 + + y2; x2; y2; x2; > s t a r t = (step « 1) - 4; step « = 2; > while ( s t a r t < s i z e ) ; /* 2-point FFT blocks */ s t a r t = 0; step = 4; do < f o r (block = s t a r t ; block < s i z e ; block += step) { t1 = block + 1; BUTTERFLY(x[block], x [ t 1 ] ) ; BUTTERFLY(y[block], y [ t 1 ] ) ; > s t a r t = (step « 1) - 2; step « = 2; > while ( s t a r t < s i z e ) ; Appendix E Bit-reversal programs listing 85 endix E. Bit-reversal programs listing The implementation of Evans' bit-reversal algorithm #include <stdio.h> #define BASE 2 #define SUAP(a, b) (double temp = a; a = b; b = temp;} ^**************************************************************** * * * * * * * * * * * * * * * A filename: f _ b i t r . c Evans b i t - r e v e r s a l permutation ( f o r FHT) * * * Written by Y.K. Fu, Dec. 1989 * * Ref: D.M.W. Evans, "An improved d i g i t - r e v e r s a l permutation * algorithm f o r the f a s t Fourier and Hartley transforms," IEEE * Trans. ASSP, vol.ASSP-35, no.8, pp.1120-1125, August 1987. * * Input: * x -- b i t - r e v e r s e d data sequence. * power power of two, where 2"power=size. * s i z e -- length of data swquence. * Output: * x -- reordered data sequence. * 1 * * * * * * * * * * * * * void reverse(double x [ ] , unsigned power, unsigned s i z e ) <. unsigned SeedWidth, OffsetWidth, o f f s e t , GpNum, GpSize, OStemp, RevOffset, RevOffsetO, width, *seed, TableSize; /************************* * define seed table s i z e * *************************/ /* get o f f s e t width and seed table width */ SeedWidth = OffsetWidth = power » 1; /* i f power i s odd, seed i s one-bit wider */ if (power & 1) { s i z e » = 1; SeedWidth += 1; > GpSize = s i z e » OffsetWidth; /******************************************** * create seed table according t o seed width * /* a l l o c a t e table space */ seed = (unsigned * ) c a l l o c ( 1 « SeedWidth, s i z e o f ( u n s i g n e d ) ) ; if (seed == NULL) < puts("** run out of memory i n c r e a t i n g seed table * * " ) ; exitd); > /* i n i t i a l i z e the two basic elements */ TableSize = BASE; seedlO] = 0; seed[1] = 1; /* expand seed table */ for (width = 1; width < SeedWidth; width++) { /* append zeros t o the e x i s t i n g seeds (become f i r s t h a l f then) */ /* and create the second-half seeds */ for ( o f f s e t = 0; o f f s e t < TableSize; offset++) { /* append a zero t o the end of the seed */ seed [offset] « = 1; /* the second-half t a b l e equals the f i r s t - h a l f t a b l e plus one * seed[offset + TableSize] = seedtoffset] + 1; > /* f o r each e x t r a b i t i n width, the t a b l e s i z e w i l l be double */ TableSize « = 1; * * * * * * * Appendix E. Bit-reversal programs listing /******************************* * switch b i t - r e v e r s e d elements * *******************************/ for ( o f f s e t = 1; o f f s e t < GpSize; offset++) < /* group zero */ RevOffsetO = seed[offset] « OffsetWidth; SWAP(x[offset], x[RevOffsetO]>; /* other groups */ OStemp = o f f s e t ; f o r (GpNum = 1; GpNum < s e e d [ o f f s e t ] ; GpNum++) { RevOffset = RevOffsetO + seed[GpNum]; OStemp += GpSize; SWAP(x[OStemp), x [ R e v O f f s e t ] ) ; > > > 87 /* Theorem 1 */ /* Theorem 3 */ /* Theorem 2 */ Appendix E. Bit-reversal programs listing E.2 88 The implementation of Gold and Rader's bit-reversal algorithm tfdefine SWAPCa, b) {double temp = a; a = b; b = temp;} * * * * * * * * GoId-Rader's b i t - r e v e r s a l a l g o r i t h m filename: f_r2.c * * * Written by: Y.K. Fu, March 1990 * * Ref: L.R. Rabiner and B. Gold, Theory and A p p l i c a t i o n of * D i g i t a l Signal Processing, NJ: P r e n t i c e - H a l l , pp.367, 1975. * * * Input: * x -- b i t - r e v e r s e d data sequence. * power -- power of two, not used here, j u s t t o match the * argument format. * N -- length of data sequence. * Output: * x -- reordered data sequence. ***************************************************** v o i d reverseCdouble x [ ] , unsigned power, unsigned N) { i n t i , j , k, M, N2; j =^0; M = N - 1; N2 = N » 1; for <i = 0; i < M; i++) { i f ( i < j ) SWAP(xti], x t j l ) ; /* generate b i t - r e v e r s e d index j */ k = N2; while (k <= j ) i j -= k; k » = 1; > > > j |= k; * * * * * * * Appendix E. E.3 Bit-reversal programs listing The implementation of Rodriguez's bit-reversal algorithm #define SUAP(a, b) {double temp = a; a = b; b = temp;> ^*************************************************************^ * Rodriguez's improved (Gold and Rader's) b i t - r e v e r s a l algorithm * * filename: f r3.c * • T — * * Written by: Y.K. Fu, March 1990 * • * * Ref: J . J . Rodriguez, "An improved FFT d i g i t - r e v e r s a l algorithm," * * IEEE Trans. ASSP, vol.ASSP-37, no.8, pp1298-1300, August 1989. * * --* * Input: * * x -- b i t - r e v e r s e d data sequence. * * power -- power of two, where 2"power=N. * * N -- length of data sequence. * * Output: * * x -- reordered data sequence. * *************************************************************** void reverse(double x [ ] , unsigned power, unsigned N) { unsigned i , j , k, N2, l a s t ; l a s t = N - (1 « (power + 1 » 1 ) ) ; j = 0; N2 = N » 1; for ( i = 1; i < l a s t ; i++) { /* generate b i t - r e v e r s e d index j */ k = N2; while (k <= j ) { j "= k; k » = 1; > j |= k; i f ( i < j ) SWAP(xti], x t j l ) ; > 89 Appendix E. Bit-reversal programs listing E.4 The implementation of Buneman's bit-reversal algorithm #define SUAP(a, b) {double temp = a; a = b; b = temp;} /********************************************************** * Buneman's b i t - r e v e r s a l algorithm * * filename: f _ r 4 . c * * .* * Written by Y.K. Fu, March 1990 * * * * Ref: 0. Buneman, "Conversion of FFT's to f a s t Hartley * * transforms," SIAM J . S c i . S t a t . Comput., v o l . 7 , no.2, * * pp.624-638, A p r i l 1986. * * * * Input: * * x -- b i t - r e v e r s e d data sequence. * * power -- power of two, not used here, j u s t * * t o match the argument format. * * N -- length of data sequence. * * Output: * * x -- reordered data sequence. * **********************************************************/ void reverse(double x [ ] , unsigned power, unsigned N) I int i , j , k, M, R[16]; j « 1; f o r (k = 0; k <= power; k++) { R[k] = j ; j += j ; > = N - 1; M j = 0; for ( i = 1; i < M; i++) t /* generate b i t - r e v e r s e d index j V k = power; do { k -= 1; j -= RCk]; } while <j >= 0 ) ; j += R[k+1]; /* i f check f o r ( i < j ) , i i s only from 1 t o N - 3. */ i f ( i > j ) SWAP(x[i], x [ j ] ) ; } > 90 Appendix E. Bit-reversal programs listing E.5 The implementation of the new bit-reversal algorithm #define SUAP(a, b) {double temp = a; a = b; b = temp;} /**************************************************************** * A new b i t - r e v e r s e I algorithm based on the use of the C * * language b i t w i s e operators. * * filename: f_r5.c * * • * Written by: Y.K. Fu, March 1990 * * * * This program generates the b i t - r e v e r s e d indices with a b i t - * * reversal counter, and swap the elements x[index] and * * x [ b i t - r e v e r s e d index] only when (index < b i t - r e v e r s e d index). * * • * Input: * * x -- b i t - r e v e r s e d data sequence. * * power -- power of two, not used here, just t o match the * * argument format. * * N -- length of data sequence. * * Output: * * x -- reordered data sequence. * *************************************************************** void reverse(double x [ ] , unsigned power, unsigned N) { int i , j , k, M, N2, S; M = N - 2; j = N2 = N » 1; S = (N2 » 1) " N2; /* unscramble array x */ for ( i = 1; i < M; i++) { i f ( i < j ) SWAP(x[i], x [ j ] ) ; k = S' while'(k <= j ) k = (k » 1) * N2; j •= k; i += 1; i f ( i < j ) SWAP(x[i], x [ j j ) ; j "= N2; > > 91 Appendix E. Bit-reversal programs listing E.6 92 The implementation of the new algorithm using table-lookup method #define SWAP(a, b) {double temp = a; a = b; b = temp;} /**************************************************************** * A new b i t - r e v e r s a l algorithm based on the use of the C * * language b i t w i s e operators. * * filename: f r5.2.c * * 7 * * Written by: Y.K. Fu, March 1990 * * * * This program generates the b i t - r e v e r s e d indices with a b i t - * * reversal counter, and swap the elements x[index] and * * x [ b i t - r e v e r s e d index] only when (index < b i t - r e v e r s e d index). * • * * Input: * * x -- b i t - r e v e r s e d data sequence. * * power -- power of two. * * N -- length of data sequence. * * Output: * * x -- reordered data sequence. * ************************************************************ v o i d reverse(double x [ ] , unsigned power, unsigned N) { i n t h, i , j , M, N2, S[16]; M = N - 2; SCO] = j = N2 = N » 1; /* generate t a b l e f o r "xor" operation */ for ( i = 1; i < power; i++) S [ i ] = (S[i-1) » > /* unscramble array x */ for ( i = 1; i < M; i++) { i f ( i < j ) SWAP(xCi), x [ j ] ) ; h = 1; while (S[h] <= j ) ++h; j "= S[h]; i •= 1; i f ( i < j ) SWAP(x[i], x [ j ] ) ; j •= N2; > 1) " N2; Appendix F Image processing example using the F H T 93 Appendix F. Image processing example using the FHT F.l 94 Using the F H T in the processing of image degradation and restoration /****************************************************************** * • * Image processing example using FHT * * Filename: image.fht.c Written by: Y.K. Fu, Feb. 1990 * * * * * * * ******************************************************************* * * * * * * * b f --->| h |--->(+)---> g g --->| W |---> r * * Restoration * * * n Degradataion ******************************************************************* * * * Notations * * * * F = FHT(f) • f: o r i g i n a l image h: b l u r operator * H = FHT(h) • b: b l u r r e d image * B = FHT(b) * n: white noise * N = FHT(n) * * G = F *H+ N * g: b l u r r e d and noisy image * g = IFHT(G) • • W = Wiener f i l t e r f u n c t i o n • • R =G* W * * r = IFHT(R) * r: restored image * • ******************************************************************* * • • * * * INPUT: binary image f i l e with number of points equals power of two. OUTPUT: degraded binary image f i l e and restored binary image file. * • * * • * ******************************************************************/ ^include <math.h> /* Define s i z e of data and work arrays */ #define DataSize 16384 /* number of points */ tfdefine Worksize (DataSize « 1) /* Program constants */ #define H OPERATOR 0.1111111111 #define ZERO 0.0 /* White noise generator constants */ tfdefine LONG 2147483648.0 /* the largest random number */ #define LEVEL 10.0 /* maximum noise gray l e v e l */ #define seed 654321 /* random number seed */ /* Macro d e f i n i t i o n s */ #define S0R(y> ( ( y ) * ( y ) ) iKdefine ABS(y) ( y < ZERO) ? -(y) : (y) /* square */ /* absolute value */ Ux F. Image processing example using the FHT I* Function d e c l a r a t i o n s */ void FHT(double x [ ] , unsigned power, unsigned s i z e ) ; void input (double x [ ] , double f [ ] , unsigned s i z e ) ; v o i d output(double x [ ] , unsigned s i z e ) ; void white(double n [ ] , double MaxLvl, unsigned s i z e ) ; void convolution(double a t ] , double b [ ] , unsigned s i z e ) ; void rmse(double x t l , double f [ ] , unsigned s i z e ) ; void f i l t e r ( d o u b l e x [ ] , double X 2 [ ] , double h [ ] , double n [ ] , unsigned s i z e ) ; void tables(unsigned s i z e ) ; mainO t unsigned power, i ; double fCDataSize), x[WorkSize], X2[Worksize), h[WorkSize], n[WorkSize]; /* /* /* /* /* a copy of the o r i g i n a l image */ image(work) array */ a copy of the FHT of f */ b l u r operator array */ white noise array */ / A * * * * * * * * * * * * * * * * * * * * * * * * * i n i t i a l i z a t i o n *******************************/ /* Compute power of two */ i = Worksize; for (power=0; i>1; power++) i » = 1; /* I n i t i a l i z e arrays */ for (i=DataSize; i<WorkSize; i++) x [ i ] = h [ i ] = n [ i ] = ZERO; for (i=0; i O a t a S i z e ; i++) h [ i ] = ZERO; /* Read i n image and store i t i n a l i n e a r array, x */ /* image gray l e v e l : 0 ( b l a c k ) - 255 (white) */ input(x, f , OataSize); /* Create uniform l i n e a r motion b l u r operator, h */ /********************************************* 1 I —> | | | | | | | | | V9 * * I * *********************************************/ h[0] = h[1] = h[2] = h[3] = h[4] = h[5] = h[6] = h[7] = h[8] = H_OPERATOR; /* Create lookup t a b l e s f o r the FHT */ tables(WorkSize); /*************************** Qegradation *********************************/ /* Get F, FHT of image array f and H, FHT of the b l u r operator h */ FHT(x, power, WorkSize); FHT(h, power, WorkSize); /* Keep a copy of the transformed image f o r the NSR c a l c u l a t i o n for (i=0; i<WorkSize; i++) X 2 [ i ] = x [ i ] ; /* Get b l u r r e d image, B = F * H */ c o n v o l u t i o n ^ , h, WorkSize); /* Generate white noise, n */ white(n, LEVEL, DataSize); /* Get N, FHT of n */ FHT(n, power, WorkSize); /* Add noise t o the b l u r r e d image, G = B + N */ for (i=0; i<WorkSize; i++) x [ i ) •= n [ i ] ; */ Appendix F. Image processing example using the FHT I* Scale r e s u l t "/ for (i=0; i<UorkSize; i++) x [ i ] /= WorkSize; /* Show b l u r r e d and n o i s y image g, IFHT of G */ FHT(x, power, WorkSize); for (i=DataSize; i<WorkSize; i++) x [ i ] = ZERO; output(x, DataSize); /***************************** R e s t o r a t i o n ******************************/ /* Get G, FHT of g V FHT(x, power, WorkSize); /* Restore image by f i l t e r i n g , R = G * W */ f i l t e r ( x , X2, h, n, WorkSize); /* Get r , I FHT of R */ FHT(x, power, WorkSize); /* Show restored image */ output(x, DataSize); /* Calculate root-mean-square rmse(x, f , DataSize); e r r o r i n r e s t o r a t i o n */ > /******••••**********•*•••**•*• Subrout i nes ********************************/ /******************************************************** * Wiener f i l t e r : W = reverse(H) / ( sqr(|H|) + NSR ) * * Where, * * r e v e r s e ( H ( i ) ) = H(Worksize - i ) * * s q r ( | H ( i ) | ) = Power(H(i)) * * = (sqr(H(i>) + sqr(H(WorkSize - i ) ) / 2 * * NSR = PowerNoise / PowersignaI * * = Power(n) / Power(X2) * ********************************************************/ void f i l t e r ( d o u b l e x [ ] , double X 2 [ ] , double h [ ] , double n [ ] , unsigned s i z e ) < unsigned i , j , h s i z e ; double temp, t p ; /* Compute W */ hsize = s i z e » 1; h[0] /= (SQR(ht03) + SQR(n(0]> / SQR(X2[0])); for (i=1; i<hsize; i++) { j = size - i ; /* temp = s q r ( | H ( i ) | ) + NSR */ temp = 0.5 * (SQR(h[i]) + SQR(h[j])) • (SQR(n[i]) • SOR(n[j])) / (SQR(X2ti]) + SQR(X2[j))); tp = h [ i ) ; h t i ) = h [ j ] / temp; h t j ] = t p / temp; > /* F * W */ c o n v o l u t i o n ^ , h, s i z e ) ; > /* s c a l e the r e s u l t */ for (i=0; i < s i z e ; i++) x t i ) /= s i z e ; Appendix F. Image processing example using the /*********************************************** * Root-mean-square e r r o r of the restored image * *******************************•***************/ void rmse(double x [ ] , double f t ] , unsigned s i z e ) { unsigned i ; double sum; sum - ZERO; for (i=0; i < s i z e ; i++) sum += SQR(x[i] - f [ i ] ) ; p r i n t f ("root mean square e r r o r = % f \ n " , sqrt(sum/size)); /************************ * White noise generator * ************************ j void white(double n [ ] , double MaxLvl, unsigned s i z e ) unsigned i ; srand(seed); for (i=0; i < s i z e ; i++) n t i ) = rand() / LONG * MaxLvl; > /*************************************************** * FHT convolution * * I f e i t h e r a or b i s symmetrical, the convolution * * of a and b can be reduced t o a * b. * • • * INPUT: arrays a and b i n frequency domain, * * s i z e of arrays. * * OUTPUT: convolution stored i n array a. * ***************************************************/ void convolution(double a [ ] , double b [ ] , unsigned s i z e ) i unsigned i , j , h s i z e ; double temp, t 1 , t 2 ; hsize = s i z e » 1; a CO] *= b [ 0 J ; for (i=1; i<hsize; i++) < j = size - i ; t1 = b [ i ] + b C j ] ; t2 = bCi] - b [ j ) ; temp = 0.5 * ( a t i l * t1 + a [ j ] * t 2 ) ; a t j ] = 0.5 * (a[j] * t1 + aCi] * - t 2 ) ; a t i ] = temp;
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Some software and hardware implementations of the fast...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Some software and hardware implementations of the fast Hartley transform Fu, Yan Kit 1990
pdf
Page Metadata
Item Metadata
Title | Some software and hardware implementations of the fast Hartley transform |
Creator |
Fu, Yan Kit |
Publisher | University of British Columbia |
Date Issued | 1990 |
Description | The fast Hartley transform (FHT) is a new tool for converting data between time and frequency domains. In this thesis, some speed-optimized software implementations of the radix-2 and split-radix FHT algorithms are presented initially, and then applied to the problems of convolution, computation of power spectra, image degradation, and image restoration. Subsequent work involved the development of a new bit-reversal algorithm. This algorithm is fast and efficient, and can be used to increase the throughput of the FHT. Finally, several hardware implementations are presented for the discrete Hartley transform (DHT) and the FHT with architectures using a single butterfly unit, pipelining and superparallelism. The advantages of each implementation are stressed. The data processing rates of these hardware implementations are analyzed. |
Subject |
Hartley transforms -- Computer programs Hartley transforms Computer software Computers |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-11-12 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065455 |
URI | http://hdl.handle.net/2429/29940 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-UBC_1991_A7 F82.pdf [ 5.83MB ]
- Metadata
- JSON: 831-1.0065455.json
- JSON-LD: 831-1.0065455-ld.json
- RDF/XML (Pretty): 831-1.0065455-rdf.xml
- RDF/JSON: 831-1.0065455-rdf.json
- Turtle: 831-1.0065455-turtle.txt
- N-Triples: 831-1.0065455-rdf-ntriples.txt
- Original Record: 831-1.0065455-source.json
- Full Text
- 831-1.0065455-fulltext.txt
- Citation
- 831-1.0065455.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065455/manifest