UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Scalable parallel VLSI architectures and algorithms for digital signal and video processing 1998

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
ubc_1998-27134X.pdf [ 4.18MB ]
ubc_1998-27134X.pdf
Metadata
JSON: 1.0064813.json
JSON-LD: 1.0064813+ld.json
RDF/XML (Pretty): 1.0064813.xml
RDF/JSON: 1.0064813+rdf.json
Turtle: 1.0064813+rdf-turtle.txt
N-Triples: 1.0064813+rdf-ntriples.txt
Citation
1.0064813.ris

Full Text

Scalable Parallel VLSI Architectures and Algorithms for Digital Signal and Video Processing Ayman Ibrahim Elnaggar B.Sc. Electrical Engineering, Cairo University, 1984. A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard TĤE UNIVERSITY OF BRITISH COLUMBIA April 1997 © Ayman Ibrahim Elnaggar, 1997 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department The University of British Columbia Vancouver, Canada Date P\ff\\ 11 S- 97 DE-6 (2/88) Abstract i i Abstract Over the past few years, the demand for high speed Digital Signal Processing (DSP) has increased dramatically. New applications in real-time multimedia communications, video processing, satellite broadcasting, and radar signal processing demand major performance improvements at several levels: algorithmic, architectural, and implementation. Although a basis of comparison of the various DSP algorithms is the number of multiplications and additions they require, for VLSI implementations other factors such as area of interconnect, I/O bandwidth, complexity of control, power dissipation, design modularity, and layout regularity are also important. This thesis proposes efficient and cost-effective techniques for mapping highly parallel DSP and video computations onto VLSI architectures. The main focus of this research is on developing new recursive formulations for a class of multidimensional DSP applications that allow the generation of larger parallel computations by combining the results of smaller size computations of the same dimension. This is useful for both hardware and software solutions, in which a very efficient smaller size core has been developed, and a larger computation is required. The proposed design methodology can be used to derive regular and modular architectures for DSP. Regularity and modularity are essential factors for facilitating design automation and implementation of parallel organizations in maturing application-specific integrated circuits (ASICs) and emerging programmable logic environments. The proposed methodology targets multi-dimensional convolution and multi-dimensional transforms. A key result of this thesis is the proposal of a number of novel algorithms that can implement a large multi-dimensional transform (or convolution) from a single parallel stage of smaller-size multi-dimensional transforms (or convolutions). Abstract i i i Our approach is based on extensive parallelization of several classes of DSP computations and mapping these computations onto parallel hardware. Our methodology employs tensor product (Kronecker product) formulations and permutation matrices as the main tools for expressing DSP algorithms in a parallel form. The effort in modularizing the resulting architectures involves both the computational sections as well as the interconnection sections. Mapping tools then manipulate such formulations into suitable recursive expressions which can be mapped efficiently onto modular parallel architectures. Table of Contents iv Table of Contents Abstract ii List of Tables . . . ix List of Figures x Acknowledgment xii CHAPTER 1 Introduction 1 1.1 Motivation for the Thesis Topic 1 1.1.1 The Need for Dedicated Hardware 1 1.1.2 The Need for Parallel Processing 3 1.2 Research Contributions 4 1.3 Previous Work 7 1.4 Thesis Overview 8 CHAPTER 2 Preliminaries and Background 10 2.1 VLSI Model of Computation 10 2.2 Tensor Product 10 2.3 Permutation Matrices 13 2.3.1 The Replicated Dilated Shuffle (RDS) Permutations 13 2.3 .2 Network Folding for Limited I/O 18 2.4 Important Properties of the Tensor Product and Permutation Matrices . 19 Table of Contents v CHAPTER 3 One-Dimensional Convolution 22 3.1 Toom's Linear Convolution Algorithm 23 3.2 The Proposed Recursive Algorithm 25 3.2.1 A Recursive Formulation for A 25 3.2.2 A Recursive Formulation for B 29 3.2 .3 A Parallel Recursive form for Toom's Algorithm 31 3.2.4 The Computational Complexity of the Proposed Architectures 33 3.3 Chapter Summary 35 CHAPTER 4 Multi-Dimensional Convolution 38 4.1 Overview 38 4.2 The One-Dimensional Shuffle-Free Algorithm 40 4 .3 The Two-Dimensional Convolution Algorithm 44 4.3.1 Reducing the Complexity of the 2-d Convolution Algorithm .46 4.3 .2 The Computational Complexity of the 2-d Convolution Algorithm 51 4.4 Multi-Dimensional Convolution 52 4 .5 Chapter Summary 53 CHAPTER 5 Multi-Dimensional DCT 55 5.1 DCT for Video Processing 55 5.2 Previous Related Work 57 5.3 Overview of the DCT 59 Table of Contents vi 5.4 The One-Dimensional DCT in a Tensor-Product Form 61 5.5 Truly Two-Dimensional Recursive Architectures for the DCT 65 5.6 Computation Complexity of the 2-d DCT 72 5.7 The Multi-Dimensional Recursive Architecture of the DCT 73 5.8 Chapter Summary 75 C H A P T E R 6 A General Methodology for Separable Transforms . . . . 76 6.1 The Discrete Fourier Transform (DFT) 76 6.1.1 The 1-d DFT 76 6.1.2 The 2-d DFT 77 6.2 The Walsh-Hadamard Transform (WHT) 78 6.2.1 The 1-d WHT 78 6.2.2 The 2-d WHT 80 6.3 A General Framework for Multidimensional Recursive DSP Transforms . 81 6.3.1 The One-Dimensional Case 81 6.3.2 The Multi-Dimensional Case 82 C H A P T E R 7 Conclusions and Future Work 84 7.1 Conclusions 84 7.2 Future Work 86 Bibliography 87 Table of Contents vii Appendix A A Complete Characterization of the Matrix R(a) 95 Appendix B The Verilog HDL Files and Schematics of The 16-point Convolution 99 B.1 The Verilog Files of A(3) 99 B.1.1 Half Adder Verilog File 99 B.1.2 Full Adder Verilog File 99 B.1 .3 Flip Flop Verilog File 99 B.1.4 Serial Adder Verilog File 100 B.1 .5 Serial-Parallel Multiplier Verilog File 100 B.1.6 A(l) Verilog File 101 B.1.7 A(2) Verilog File 101 B.1.8 A(3) Verilog File 102 B.2 The Verilog Files of 5 ( 3 ) 103 B.2.1 B{l) Verilog File 103 B.2 .2 B(2) Verilog File 103 B.2 .3 B{3) Verilog File 104 B.3 The Verilog Files Combining A(3), D(3), and B(3) 105 B.3.1 D(3) with A{3) Verilog File 105 B.3.2 A(3), D(3), and B(3) Verilog File 107 Table of Contents vi i i B. 4 The Verilog Files to Generate the 16-point Convolution from the 8-point 107 B.4.1 Q (4 ) Verilog File 107 B.4.2 RS{4) Verilog File 108 B.4.3 The 16-point Convolution Verilog File 110 Appendix C The Schematics of the DCT Architecture 112 C. 1 The Schematic of the Full Adder 112 C.2 The Schematic of the Half Adder 113 C.3 The Schematic of F2 114 C.4 The Schematic of Q 4 115 C.5 The Schematic of Q 4 116 C.6 The Schematic of RA 117 C.7 The Schematic of RA 118 C.8 The Schematic of the Serial Adder 119 C.9 The Schematic of the Serial-Parallel Multiplier 120 C.10 The Schematic of the Serial Subtractor 121 C.11 The Schematic of the 2-d 2 x 2 DCT 122 C.12 The Schematic of the 1-d 2-Point DCT 123 C.13 The Schematic of the 2-d 4 x 4 DCT 124 Appendix D List of Acronyms 125 List of Tables i x List of Tables Table 1 The number of MOPS in the H.261 compression 2 Table 2 Comparison of the number of additions required for performing convolution 36 Table 3 Comparison of the number of multiplications required for performing convolution 36 Table 4 Comparison of the number of multiplications required for the 2-d convolution (kernel of dimension 8x8) 52 Table 5 The distribution of the computational load in MPEG-1 decoding 56 Table 6 Comparison of the number of multiplications required for the 2-d DCT 72 Table 7 The summary of pre- and post-processing formulae 82 List of Figures x List of Figures Figure 1 The permutation P\2X\ 14 Figure 2 The permutation P 4 2 2^8 15 Figure 3 The RDS permutation of Pi6,8 1 7 Figure 4 Example of S8(64,4) 19 Figure 5 Bit serial and 2-bit-at-a-time (2-BAAT) block-transpose network for 0=4 and w=6 20 Figure 6 The recursive decomposition of the proposed convolution algorithm 23 Figure 7 The direct realization of A 26 Figure 8 The modification of the A matrix (numbers above arrows denotes the number of operands passed from one stage to another) 27 Figure 9 The recursive realization of A(16) 30 Figure 10 The recursive realization of the matrix 5(16) 32 Figure 11 The modified recursive architecture of Toom's algorithm . . . 34 Figure 12 The 2-d recursive convolution architecture 40 Figure 13 The permutation-free recursive realization of the 8-point convolution 42 Figure 14 The permutation-free multiplexed recursive realization of the 8-point convolution 43 Figure 15 The 1-d permutation-free recursive convolution architecture. . 43 Figure 16 The complete realization of Q 49 Figure 17 The multiplexed realization of Q 50 List of Figures xi Figure 18 The proposed recursive architecture for computing the 2-d DCT 59 Figure 19 The 1-d DCT recursive architecture 63 Figure 20 The realization of i ? 4 63 Figure 21 The realization of Q4, where C0 = 2 c o s 1 ( 7 r / 8 ) and ^1 = 2 c o s ( 5 i r / 8 ) 6 5 Figure 22 The direct realization of 2-d DCT for a 4 x 4 input image (m = n 2 = 4) . 67 Figure 23 The realization of R4 for a 4 x 4 input image (n = 4) 70 Figure 24 The realization of Q4 for a 4 x 4 input image (n = 4) . . . . . 71 Figure 25 The complete recursive realization of the 2-d DCT for a 4 x 4 input image (n = 4) 71 Figure 26 The direct realization of equation 133 for the case n\ — n2 = n3 — 2 74 Figure 27 The realization of F4 77 Figure 28 The adder circuit used in the WHT 79 Figure 29 The 16-point WHT 80 Figure 30 The realization of R(2) 97 Figure 31 The complete realization of i ? 4 98 Acknowledgment xii Acknowledgment I would like to thank my supervisors, Dr. Hussein Alnuweiri and Dr. Mabo Ito, who have been a continuing source of guidance, support and research interaction. I would like also to thank Dr. Peter Lawrence for encouraging us to patent our work. We have filed two patents so far. This work was supported in part by the G R E A T award from British Columbia Science Council and Seanix Technology Inc. I would like to thank M r . Ken Ho, the R & D manager, Seanix Technology Inc., for sharing his knowledge with me. M y thanks go to my mother, without her blessings this thesis wouldn't have been possible. To the memory of my late father who first taught me how to read and write. May his soul rest in peace in Heaven. To my role model, my brother Ashraf. To my colleagues and friends, for their support and encouragement. To my wife, Maha, whose love and understanding made me able to accomplish this work. To my lovely kids, Sara and Moustafa, for the joy they brought to my life and for their low-key noise while I was studying at home. Finally, I thank God for everything. 1 . Introduction 1 1 Introduction Single-processor core-based is the computational engine for applications such as disk controllers, cellular telephones, and consumer devices. Even, greater speed and more challenging applications can be addressed with the power of highly parallel multi- processors core-based. Moreover, significant performance improvements can be achieved with algorithms that are tailored to multi-processors implementations. These performance requirements can be achieved by employing parallel processing at all levels. Very Large Scale Integrated Circuits (VLSI) technology supports and provides a good avenue for parallelism. 1.1 Motivation for the Thesis Topic Digital signal and video processing techniques are increasingly finding application in almost all areas of electrical and computer engineering. With the reduction in the cost of integrated circuit technology, and the increasing sophistication in design and prototyping environments, the time now is opportune to study the design synthesis of VLSI architectures for digital signal and video processing. Moreover, with the recent development of video standards (JPEG, MPEG, H.261, etc.) and the introduction of new applications such as video phone and conferencing systems, multimedia, and Satellite Broadcasting, there is an increasing demand for efficient hardware designs for such applications. In the following, we briefly discuss the basic needs of DSP and video applications. 1.1.1 The Need for Dedicated Hardware While microprocessors and general purpose digital signal processors (DSPS) share a number of common features, important differences exist. DSPS are primary designed for real-time number crunching applications with special features such as: 1 . Introduction 2 1. Hardwired arithmetic circuits to perform multiply-accumulate operations, the fundamental operation for most DSP algorithms. 2. Multiple-access memory, which enables the processor to load multiple operands simultaneously within an instruction. 3. Special on-chip peripherals and I/O interfaces that enable the processor to interface efficiently with other system components, such as analog-to-digital converters and local memory. Although these features contribute a great deal to the high performance of DSPS, many video and image processing systems require a high throughput that cannot be sustained by DSPS. As an example, the number of million operations per second (MOPS) requirement for video compression using an encoder with the H.261 compression/decompression standard on a sequence of 30 frames per second (fps) is shown in Table 1 [37]. Compression MOPS RGB to YCbCr conversion 27 Motion estimation (25 searches in a 16x16 region) 608 Inter-/Intraframe coding 40 Loop filtering 55 Pixel prediction 18 2-dimensional Discrete Cosine Transform 60 Quantization, zig-zag scanning 44 Entropy coding 17 Frame reconstruction 99 Total 968 Table 1 The number of MOPS in the H.261 compression. Table 1 shows that the encoder requires almost 1,000 MOPS of processing power which 1 . Introduction 3 is outside the capabilities of current general purpose or DSP processors. ASICs are often faster and smaller than DSPS. Speed, size, and power advantages are gained when area is optimized as a result of a single-chip implementation that eliminates some of the external interconnects. Although DSPS allow flexibility of implementation of a variety of algorithms, they incur higher design and manufacturing costs, since additional hardware is required for program control. In addition, DSPS require software development tools for targeted applications. Without such tools, writing application software can be difficult, no matter how strong is the processor's performance. Although several vendors provide development tools, including high level language compilers, the expense of software development may not be neglected and has to be considered for deciding which alternative, ASICs or DSPS, has to be used for a specific application. It is worth noting that studies by market research organizations [17] show that there is a growth in DSP core-based ASICs, also called application-specific standard products (ASSPs) and a decline in the sales of general purpose DSPS. Core-based ASICs constituted 35% of the nearly $2 billion market in DSPS in 1995, and it is projected that their share will grow to 47% of the $10 billion market for the year 2000. Thus core-based ASICs will compete with custom ASICs and general purpose DSPS in the near future. General purpose DSPs expect their market share to drop from 50% in 1995 to 26% in 2000. The remainder of the market will be taken by custom ASIC solutions which are projected to take 27% of the market in 2000. 1.1.2 The Need for Paral le l Processing As the gain in device performance from advancement in processing technology is diminishing, incorporating higher concurrency within a chip will play an increasingly important role in the extraction of higher VLSI performance (through a more efficient / . Introduction 4 use of multiple chip resources). Recent advances in VLSI technology and computer-aided design (CAD) methodologies have made the implementation of highly-concurrent computing systems on a single-chip a reality. Indeed, several forms of parallelism are incoiporated at various levels of most contemporary computing systems. For example, the MPEG-1 standard for motion video requires the two-dimensional Discrete Cosine Transform (DCT) to be computed over each 8 x 8 block within a frame size of 352 x 240 in a video stream processed at a real-time rate of 30 fps. Clearly, such computationally intensive task must be implemented in pipelined or multi-processor architectures to meet the real-time requirements. However, the key to exploiting the power of a parallel architecture is the proper selection and the mapping of algorithms. Specifically, significant performance gains can be achieved with algorithms that are tailored to multi-processors implementations. For example, on a highly parallel array processor, an n-point Fourier transform can be evaluated in O(logn) time [64], as opposed to a single-processor implementation that takes O(relogn) time at best. Fully exploiting the power of parallel processing requires the proper mapping of algorithms onto the target architecture. For example, the original matrix-vector multiply formulation of the Fourier transform fits more efficiently a VLSI parallel processing system than does the Fast Fourier Transform (FFT) formulation that was designed for standard single processors [67]. The common goal of the, often difficult, algorithm mapping process is to minimize the inter-processor dependencies while maximizing the use of the computational resources to achieve a given speed up. 1.2 Research Contributions The above considerations motivated us to develop high performance platforms for video and signal processing with particular emphasis on mapping algorithms onto highly-concurrent 1 . Introduction 5 VLSI architectures and programmable logic devices. An efficient algorithm-architecture mapping is essential for achieving high performance. In general, direct implementation of parallel algorithms demands a tighter coupling between algorithm specifications and logic synthesis. This thesis proposes efficient and cost-effective techniques for mapping highly concurrent digital signal and video computations onto parallel VLSI architectures. The main objective of the research is to develop design methodologies for deriving architectures that can be embedded in regular and modular structures. Regularity and modularity are essential factors for facilitating design automation and implementation of parallel organizations in maturing and emerging programmable logic and ASIC environments. - A specific contribution of this research is the derivation of new recursive formulations for several classes of multidimensional DSP computations (convolution, Discrete Cosine Transform (DCT), Discrete Fourier Transform (DFT), Walsh-Hadamard Transform (WHT)) that allow the generation of larger VLSI architecture by combining the computation of a specific number of identical smaller size computations of the same dimension. This is useful for both hardware and software solutions, in which a very efficient smaller size core has been developed, and a larger computation is required. Our methodology employs tensor product or (Kronecker product) formulations and permutation matrices as the main tools for expressing DSP algorithms. Mapping tools then manipulate such formulations into suitable recursive expressions which can be mapped efficiently onto parallel VLSI architectures. Although tensor product techniques have been investigated before [46], [45], our methodology is the first (as far as we know) to derive truly recursive formulae for multidimensional convolution and the DCT. It should be also emphasized that our partitioning and combining method does not make any assumption about 1 . Introduction 6 how the core computations are computed. Indeed, any suitable computation method can be used, which makes the proposed method very flexible and realizable over a wide range of hardware and software platforms. Our techniques can contribute to the design of a core of high performance circuits for a class of DSP and video computations. On the one hand, customizability is improved over DSPS, and on the other hand, the drawbacks of full ASIC designs, such as long design turnaround time, limited design flexibility, and restricted usability, are avoided. For a particular targeted design, the VLSI designer will have the flexibility of choosing from a library of pre-designed building blocks according to certain limiting specifications (e.g. available silicon area, data rate, power consumption, I/O, memory, throughput). An entire design can be defined as a macro-block which can be manipulated by macro-cell placement and routing routines in semi-custom ASICs or full-custom environments. Alternatively, the design can be manipulated to fit on programmable logic devices, such as FPGAs. In this case, another set of limitations must be taken into account including a limited number of global wiring channels among logic blocks, and the degree of utilization of available logic and I/O blocks. Another benefit is that by using a technology-independent definition, the same design can by synthesized using a number of different ASIC technologies. This would allow the creation of a vendor-independent DSP function library (e.g. a macro-block library), which can be implemented on the best available silicon. Although originally conceived as a hardware solution, our method is potentially very useful for RISC processors, superpipelines, and in commercial DSP chips. Because the partitioning strategy results in a core of data-independent computations, such computations can be overlapped in superpipelines, or they can be executed concurrently on multiple functional units in a DSP chip. The method is also well suited for optimizing software 1 . Introduction 1 pipelines in RISC processors. 1.3 Previous Work This section provides a brief survey of related previous work. More detailed comparisons between our approach and other previously proposed approaches will be given in the appropriate places throughout the thesis. Most of the previous work was focused on speeding up DSP computations by reducing the number of multiplications and additions [2], [4], [5], [31], [43], [49], [50], [53], [58], [77]. Although these algorithms achieve a substantial reduction in complexity, they do not address practical design issues, such as regularity, modularity, simple control, and communication complexity. Consequently, these algorithms produce complex structures which are not suitable for direct VLSI implementation [3], [48], [61]. From architectural point of view, some of the common approaches are: 1. Distributed arithmetic architectures with memory lookup tables [43], [77]. This approach has some drawbacks: a. Many input buffers are required to preserve the input data. b. The latency can be too long. c. The pipelined performance depends on the word length. 2. Nested schemes (overlap-and-add) [4], [53] where, a composite size n-point DSP algorithm (n — rs) is decomposed into smaller algorithms of sizes r and s. Although this approach is very computationally efficient, it trades-off the performance with irregular architectures that compromise the modularity and scalability of VLSI implementations especially for larger problem sizes. 1 . Introduction 8 3. Pipelined architectures [2], [50], and systolic arrays [5], [31], [49], [58]. These approaches speed up the computations by employing direct strategies for parallelizing serial D S P algorithms. Such techniques offer limited speed up and are not efficient for achieving high degree of parallelism. Moreover, these approaches are not practical for V L S I implementation due to their high control and computation scheduling overhead. Tensor products have been previously used for matrix calculus [33], [52], the design and implementation of block recursive numerical algorithms such as F F T [6], and convolution [13]. The tensor product formulations of these algorithms have been used to generate parallel and vector forms for shared-memory and recently for distributed memory multiprocessors [63], [69]. 1.4 Thesis Overview The rest of the thesis is organized as follows. Chapter 2 reviews the basic concepts of the tensor product and permutation matrices. In Chapter 3, we introduce the 1-d recursive formulation of Toom's convolution algorithm. Also in this chapter, we compare the computational complexity of our proposed algorithm with other approaches. In Chapter 4, we introduce alternative constructions of the 1-d convolution algorithm that optimize the number of arithmetic units through multiplexing. Also , in this chapter, we extend the derivation to the 2-d and the multi-dimensional case. In Chapter 5, we introduce the 1-d D C T algorithm in a tensor product form. Also, in this chapter, we derive the truly recursive 2-d D C T algorithm and extend the derivation to the multidimensional D C T . The hardware realization of different modules are also given in this chapter. Chapter 6 applies our methodology to the D F T and W H T . Also, in this chapter we propose a general framework to derive 1 . Introduction 9 recursive formulations for multi-dimensional separable transforms. Finally, Chapter 7 gives conclusions and possible extensions of the research. 2 . Preliminaries and Background 10 2 Preliminaries and Background 2.1 VLSI Model of Computation The VLSI model of computation in this thesis assumes a basic VLSI model in which computational nodes have fixed fan-in, and they perform a single primitive arithmetic/logic operation on their inputs in one time unit. Modularity refers to a property of VLSI networks in which the architecture is decomposed into distinct building blocks (modules) which can be combined through composition to construct larger architectures. The communication among such blocks is done strictly through prespecified input/output ports. Regularity is a topological property of VLSI networks which implies that a given primitive interconnection pattern (among nodes) is repeated according to a fixed pattern. Recursivity refers to the property of defining an architecture for a problem of size n from lower-order (smaller-size) architectures (of size n / 2 in our case). Scalability refers to a resource growth property of VLSI networks in which the hardware (memory plus interconnect) resources allocated to each input remain constant or increase slowly as the number of ports increases. 2.2 Tensor Product Tensor products (or Kronecker products) have proven to be useful in providing a unified decomposable matrix formulations for multidimensional transforms, convolutions, matrix multiplication, and other fundamental computations [6], [13], [24], [25], [27], [62], [69]. Recently, Tolimieri et. al. [46], and Van Loan [45] have shown that when coupled with permutation matrices, tensor-products provide a unifying framework for describing and programming a wide range of fast algorithms for various transforms. They have also described techniques for manipulating tensor-product formulations of multidimensional 2 . Preliminaries and Background 11 transforms into forms suited to parallel processing machines and vector processing machines. The tensor product decompositions given in [6], [13], [62], [69], although useful for extracting parallelism, are not necessarily suitable for VLSI implementation. In VLSI, issues such as communication complexity and I/O dominate the cost of implementation. These issues are the main motivation behind our work. The tensor product is a binary matrix operator which provides a mechanism for multiplying two matrices to form a single larger matrix. Once the tensor-product factorization of a large matrix has been derived, the symmetries identified in the factorization can be used to derive efficient algorithms for a large number of matrix operations. It should be noted that direct implementations of parallel algorithms in silicon demand a tighter coupling between algorithm specifications and logic synthesis. Such coupling can be achieved using tensor product decompositions with few additional enhancements. Let ATiS and Bm,n be two arbitrary matrices of dimension rxs and mxn, respectively, such that « 0 , 0 O 0 , l • • • « 0 , s - l Ar.<s — Br, - a r - l , 0 L ^ m - 1 , 0 « r - l , s - l bo,n-l (1) b m—l.n — 1 The tensor product of A and B is a matrix C = A ® B of dimension vm x sn defined by a 0,o-B ao,iB • • •' o 0 j S - i i ? C (2) L a r - i , o 5 ar-\iS-iB. Throughout the thesis, a square rxr matrix A will be denoted by Ar, while a rectangular mxn matrix B will be written as Bm,n- It is possible to think of the tensor product as a 2 . Preliminaries and Background 12 matrix decomposition. Given a large matrix C, it could be expressed in terms of two smaller matrices A and B, called the tensor matrices in the form C = A®B (3) Two particular forms of tensor products have been identified in [45], [46] for use with parallel and pipelined (vector) machines. Our focus in this research w i l l be on the parallel form described by the following tensor product, Br Cn = (Ia®Br) Br 0 BT Br (4) where Is is the identity matrix of order s. The resulting matrix C „ is a block-diagonal of order n = rs. Consider we would like to apply the matrix Cn to an input vector xn of size n = 100. Direct implementation of the matrix-vector multiply would requires n2 = 10,000 multiplications. While using the parallel form of Cn as defined in equation 4 (assume s — r = 10) would require s(r2) = 1,000 multiplications only for a computations saving of 90%. Another interesting property of the tensor product is the inversion property. If A and B are non-singular matrices, then so is C — A ® B, such that C'1 = (A ® B)~x = (A~l®B -l- (5) The dimensions of A and B are usually much smaller than C, and thus inverting these smaller matrices results in a large computational savings as compared to inverting C directly. 2 . Preliminaries and Background 13 2.3 Permutation Matrices When combined with permutation matrices, the tensor product technique has many more degrees of freedom than most other techniques. This extra freedom is used not just to describe the required operation, but also to specify data partitioning and data-flow efficiently. A permutation matrix P is a special square matrix whose entries are zeroes and ones, such that each row or column of P has a single 1 entry. If n = rs then Pn>s is an n x n binary matrix specifying an n/s-shuffle (or 5-stride) permutation. The effect of the permutation matrix Pn>s on an input vector Xn of length n is to shuffle the elements of Xn by grouping all the r elements separated by distance s together. Thus, the first r elements will be the next r elements are xi , x\+s, xi+2s: x1+(r_1stS and so on. The shuffling operation of stride permutations can be easily implemented as an addressing operation on many computers. In this case, stride permutations can be realized at virtually no cost of computation. In parallel architectures, such permutations demand additional wiring area. 2.3.1 The Replicated Dilated Shuffle (RDS) Permutations For the purposes of this thesis, the class of Replicated Dilated Shuffle (RDS) Permutations needs to be introduced. This class of shuffle permutations is very effective for network partitioning and for deriving modular VLSI interconnection networks for convolution as will be shown in the next chapter. A shuffle permutation with dilation d operates on subvectors of size d rather than single elements. For example, the permutation P | 2 operates on subvectors of size two. To illustrate this effect, consider the two operation P^X^ and P 4 2 2^8 as shown in Fig 1 and Fig. 2, 2 . Preliminaries and Background 14 respectively, P4,2X1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 XQ XQ X1 X\ .x3. .x3 . (6) - ^ 4 , 2 ^ 8 h o 0 0 0 0 h 0 o h o 0 0 0 0 12 XQ XQ XI X2 X2 X\ Xz _ xz (7) where I2 is the identity matrix of size 2, and X{ X2i X2i+1 Fig. 1 The permutation Pl^Xi 2 . Preliminaries and Background 15 In general, P^a can be obtained by replacing each 1 entry in by the d x d matrix Id and each 0 in P*a by the d x d 0-matrix 0. The main power of RDS permutations stems from the fact that standard shuffle permutations can be constructed from smaller RDS permutations in a simple recursive fashion. This is a very useful property for enabling the modularization of VLSI networks that employ shuffle networks. Indeed, we can rewrite equation 7 in the form, h 0 0 0 " 0 0 h 0 0 h 0 0 0 0 0 h_ "1 0 0 0" 0 0 1 0 0 1 0 0 0 0 0 1 For example, the standard perfect shuffle permutation on 16 elements, Pi6,8 can be specified 2 . Preliminaries and Background 16 in terms of RDS permutations as follows (see Fig. 3 also) P 1 16,8 — l ) (h ® (P4.2 ® /l))(/ 2 <8> (P4,2 ® / 2 ) ) ( / l ® (P4,2 <8> / 4)) (9) 2 fJ(/2. ® P 4 , 2 ®/2 -.) An RDS permutation has the general form (Ir ® P W j a ® 1^). If r = 1, the permutation (Ir <g> Pn,a ® h) represents a dilated shuffle permutation only (i.e. P„ ] S ® i^), and if d = 1, it represents a replicated shuffle permutation only (i.e. Ir ® P„ ] a ) . The total size of the permutation is r x d x n, where r is called the replication factor and <i is called the dilation factor. Fig. 3 The RDS permutation of P16,s 2 . Preliminaries and Background 18 2.3.2 Network Folding for Limited I/O Network folding is a very effective technique for designing parametrized permutation networks that are adaptable to I/O limitations [64]. This section reviews an index mapping scheme that results in simple folded networks for permutation networks [24], [64]. The main goal is to map a stride permutations network of size N = 2n (i.e. with N I/O terminals) into an equivalent network with N/Q I/O terminals. This mapping will be called folding, and Q will be called the I/O reduction-factor. A folded stride permutation network has N/Q I/O terminals. Thus, bits from at most N/Q elements can be input or output at a time through the folded network. The index-mapping procedure views the TV-element input or output vectors as a matrix of Q columns with N/Q elements per column. Specifically, let M = N/Q, then an TV-element vector G FN can be arranged into m M x Q matrix VMXQ e F M X ® as follows: VMXQ = ' V0 VM ••• V N _ M v \ v M + l • • • V N _ M + i •VM-1 V2M-1 ••• VN-1 (10) In other words, column i, 0 < i < Q — 1 , of the above matrix contains elements with indices i^, + 1, • • •, (i + 1)-^ — 1 . The mapping (permutation) of elements from the input vector to the output vector can now be expressed as a mapping of elements from the input matrix to the output matrix. For stride permutation, each index in the output matrix is obtained by performing the specified bit-permute operations on the bit representation of the corresponding index (i.e. the index at the same position) in the input matrix. An example of an 5*8(64,4) network derived by our method is shown in Fig. 4, where S^/^^N, Q) denotes a folded 7V/7Vj-shuffle network of size N (with reduction factor Q). Note that, the folded network of Fig. 4 consists of N/Q2 sub-networks, each of which is a Q x Q transpose 2 . Preliminaries and Background 19 Fig. 4 Example of Sg(64,4) permutation circuit. Fig. 5 shows bit-serial and 2-bit-at-a-time (2-BAAT) versions of the transpose circuit with a word length w = 6 bits. 2.4 Important Properties of the Tensor Product and Permutation Matrices Some of the tensor products and permutation matrices properties that will be used throughout this thesis are [45], [46]: 1. Distributive property: AB® CD = (A®C)(B® D) (11) 2. Commutative property: {A ® B) <g> C = A ® (B <g> C) (12) 2 . Preliminaries and Background 20 Bit-Serial Outputs Bit Serial Inputs . A - H J o o o o f o j n o o a - 6 » q D o o o ^ - * q D D O Q < y 4 x r o < V [ p o o o ^ > [ p o o D < W - ^ ± } D O D < y Q D D D D W O D D O < y Q Q O D 0 6 - * 2-BAAT Output Y -•DO cx>o-*ozro- 2-BAAT ^ ^ _ J ^ X . Input v — v v-I D D O ^ t D D O ••DO —*DOO D D o ^ D D ^ ^ D D O ^ c i a o - T O O - ^ D D O - ^ D O - X ] o A - * D D O - M Z H Z K D - ••DO - • D O • • D O Fig. 5 Bit serial and 2-bit-at-a-time (2-BAAT) block-transpose network for Q=4 and w=6 3. Parallel Operations: if n = n\ri2 then Ani <g> Bn2 = Pn,ni{In2 <8> Ani)Pntn2(Ini <g> B. H2 , (13) 4. If w = W1W2W3, then XV,11)3 (14) 5. If ra = n\n2nz then P P — P J n,ni J n,n2 — 1 n,n\n-2 (15) 6. If n — n\n% then P P - T (16) 2 . Preliminaries and Background 21 7. For non-square matrix A m , n , we have (^m,» ® Ic) — -Pm.c,m(Ic ® Am,n)Pn.c,c (17) 3 . One-Dimensional Convolution 22 3 One-Dimensional Convolution In this chapter we present a new recursive formulation of Toom's convolution algorithm. Our formulation allows the generation of higher order convolution computations from three lower order convolution computations [22], [28]. Efficient parallel architectures for implementing this method are also presented. Convolution is a very important operation in most of the DSP and image processing applications. Thus, an abundance of approaches have been suggested to achieve high speed processing of convolution algorithms and to design efficient convolution architectures including pipelined architectures [2],[50], systolic arrays [5], [49], [58], distributed arithmetic [77], and nesting schemes [4], [53]. Most previous methods speed up the computation of linear convolution by employing direct strategies for parallelizing serial convolution. Such techniques offer limited speed up and are not efficient for achieving high degree of parallelism. Moreover, for large problem sizes, these algorithms are not practical for VLSI implementation due to their high control and computation scheduling overhead. Previous work presented in [7], [53] reduces the complexity of the convolution by decreasing the number of multiplications and additions. However, the resulting architectures are irregular and may not be suited to VLSI implementation. Some efficient VLSI implementations for computing convolution have been reported in [11], [56]. Our linear convolution algorithm decomposes the convolution computation recursively into three cascaded stages as shown in Fig. 6. The second stage is constructed recursively from three parallel blocks each realizing a lower order convolution. The first and the third stages serve as "glue" computations that combine the three lower order convolution blocks to construct the higher order convolution. In Fig. 6, C(n) denotes a convolution on n 3 . One-Dimensional Convolution 23 points while C(n/2) denotes a convolution on re/2 points. This formulation is useful for both hardware and software solutions, in which a very efficient smaller size 1-d linear convolution core has been developed [4], [53] and a larger convolution computation is required. In the following, we start by reviewing some fundamentals then proceed to describe our algorithm and resulting architectures. 2n - 1 O u t p u t s L® C ( n / 2 ) Postadditions C ( n /2) C ( n /2) Stage # 3 C ( n /2) Stage # 2 C ( n ) Q Preadditions Stage #1 n I n p u t s (n = 2~) Fig. 6 The recursive decomposition of the proposed convolution algorithm 3.1 Toom's Linear Convolution Algorithm Since our approach is based on a parallelized version of Toom's algorithm, we will briefly review Toom's convolution algorithm [69]. Let xn and hn be two sequences of 3 . One-Dimensional Convolution 24 length n. The linear convolution y 2 n - i = hn * xn of xn and hn is given by yi = 2~2 hk%i-k, k which can be also expressed in the matrix form Y2n-i — CnXn, or h 0 hn-i hN-2 ••• h | | : | (18) 0 hn-i 0 0 For n = 2a, where a is an integer, Granata et. al. [69] presented an efficient tensor product formulation for Toom's radix-2 convolution algorithm which has the following "parallel" form, yo yi 2/2 = J / 2n -2 . 0 " Xo 0 XI X2 o ••• hx ••• hn-i _ -Xn—1 - C2* = C(a) = BDA D = diag Ahr, (19) where a-l k=0 (20) k=i B — JJ {h"-1 ® R2<*-k {Pi{2a-kJr1-\),li{ha-kJr^-\ ® B)P3(2a-fc+l_J) > ( 2 « - * + l _ l ) ) ) , (21) 11 0 | A = 1 0 1 1 0 1 (22) B 1 0 0 -1 1 -1 0 0 1 3 . One-Dimensional Convolution 25 and R2c-k is a special matrix of zeros and ones. Since in practical applications the elements of the vector .hn (used in equation 19 to calculate the matrix D) are known a priori, this matrix-vector multiplication can be pre- computed. Moreover, since D is a diagonal matrix of order 3 a, applying the D matrix implies performing independent element-wise multiplications, which all can be done in parallel. Consequently, the focus in this chapter will be on deriving modular algorithms and architectures for the two matrices A and B. 3.2 The Proposed Recursive Algorithm Our main goal is to compute the vector Y2n-i = C(a)Xn, where n = 2a and C(a) was defined by equation 19. The next subsections present new recursive formulae for A and B, that direcdy lead to modular recursive structures for computing convolution. 3.2.1 A Recursive Formulation for A The matrix A is defined as a - l A = Ĵ J P^a-k 2* ^a-k^I^a-S-k 2 k ® A) P^a-k-l 2k+l 2k (23) A;=0 The direct realization of A is shown in Fig. 7 Applying property 15 to the matrix A and 3 . One-Dimensional Convolution 26 A Fig. 7 The direct realization of A noting that the permutations in two adjacent stages can be grouped together into a single 3 . One-Dimensional Convolution 27 permutation, equation 20 can be simplified to a-l A(n) — Ĵ J ( / 3 a - ! - * : 2k ® A) P ^ a - k - i 2 f c+i t 3 a - k ~ 1 2 k (24) k=0 where A{n) denotes the matrix A for a given n — 2a. Thus A(n) can be computed by the cascaded product of a stages of double matrix products instead of the triple matrix product in equation 20 as shown in Fig. 8 for the case a = 4 (n = 16) [26]. 54 — 36 f—1 u 1—Ti p 1 1 ' i) 24 • P A(16) 16 Fig. 8 The modification of the A matrix (numbers above arrows denotes the number of operands passed from one stage to another) Our main goal is to manipulate the expression in equation 24 into a recursive modular form. The last stage (k = 0) of the computation in equation 24, can be expressed as 3 . One-Dimensional Convolution 28 follows [22], [(J 3«-i <g> A)P 3 a- i 2 , 3 — i ] = ® ^ ) ( ^ 3 ® ^ 3 — 2 2 ,3«-0 ( P 6,3 ® / 3 — 2 ) = (h ® ( ^ 3 — 2 ® A)P3o-2 2 ) 3—3 ) (^6,3 ® ^ 3 — 2 ) Similarly, the stage before the last, k = 1, can be expressed as [(/30.-2 2 <8> A)P 3a-2 2 2 , 3 a - 2 2 ] • Combining equations 25 and 26, the last two stages can be expressed as (25) (26) Ĵ J (^o-l-A 2k ® -A) P^a — k — l 2k + 1 ia~k~12k = k=0 ((73 <8> ( / 3 a - 2 ® A)P 3 Q -2 2 ) 3 a -2 ) (P 6 ] 3 ® 73a-2))((73a-2 2 <g> A)P 3a-2 2 ]3a-2 2) • Note that and ( P 6 ) 3 <g> Pja-2 )(/ 3a-2 2 ® A) = (I3a-22 <g> A ) ( P 6 j 3 <g> 7 3 a-3 2 ) = [J3 <g> (73a-32 ® A ) ] ( P 6 ] 3 <g> 73a-32 ) P6,3 <8> / 3 Q - 3 2 = P 3 Q - 2 2 2 , 6 ( ^ 3 Q - 3 2 <S> P 6 , 3 ) P 3 a - 2 2 2 , 3 a - 3 2 = [^3 ® P 3 Q - 3 2 2 , 3 Q - 3 2 ] A a - 2 2 2 , 3 a - 2 2 . (27) (28) (29) By substituting equations 28 and 29 in 27 we have l [(-^o-1-* 2 * ® A)P3a-k-i 2fc + l ] 3 Q- f c-l 2fc] = [-^3 <8> [ ( - ^ 3 Q - 2 ® A)P 3c-2 2 , 3 a - 2 ] [ ( ^ 3 a - 3 2 ® A)P 3 a-3 22]3a-32]] (P3a-2 2 2 , 3 Q " 2 2) • A simple induction proof can show that applying equation equations 28 and 29 repeatedly to the rest of the stages, k — 2, 3, • • •, a — 2, except the first stage (k = a — 1), results in the following formula for A{n) A(n) = [ ( / 3 ® A ( n / 2 ) ) ( P 2 a - l 3 ) 2 a -2 3 ) a _ 1 ] [ ( / 2 - l <g> A ) P 2 « | 2 a - l ] , (31) k = 0,1, • • • ,a - 2. k — a — 1 3 . One-Dimensional Convolution 29 (32) where the term in the left square bracket represents the last (a — 1) stages and can be realized using three parallel blocks of A(n/2) computations preceded by (a — 1) cascaded stages of the permutation P 2 a - 1 3 , 2 a - 2 3 - The term inside the right square brackets represents a single computation stage that must be performed at the beginning of the computation of A(n). The matrix computation A{n) can be written more concisely as A(n) = [ ( / 3 ® l ( n / 2 ) ) ( P 2 Q - 1 3 ) 2 a - 2 3 ) a ~ 1 ] [(J2«-l <g> A)P2a> 2a- = (/3<8>A(ra/2))<3„ , where Qn = (P2c-1 3 ,2—23)^ 1 [(/2c-l ® A)P 2a ) 2a-l] (33) The matrix A(n) can be realized recursively from smaller computational modules as shown in Fig. 9 for the case a = 4 (i.e. for n — 2a = 16 point convolution). The labels on the connections in Fig. 9 indicate the bandwidth of each link, defined as the number of elements that can simultaneously traverse the link. Observe that, we have drawn our networks such that data flows from right to left. We chose this convention to show the direct correspondence between the derived algorithms and the proposed VLSI networks. Therefore, the first (or right most) stage of the computation of A{n) in Fig. 9 is the stage for which k — a - 1, and the last (or left most) stage is the one for which k = 0. 3.2.2 A Recursive Formulation for B From equation 21, stage k consists of the parallel tensor products (/3fc-i (8) P2a-'=(-P3(2a-' ;+1-l),3(^2a-' ;+1-l ® B)P3'2a-k+^-l) ,(2a-k+1-l))) (34) which can be realized by 3 f c _ 1 parallel blocks each computing the matrix product P2a-J=(P3(-2a-fc + l_1^3(/2a-'= + l - l <8> P)P3(2a-fc + l_l) ; (2«-* + l-l)) (35) 3 . One-Dimensional Convolution 30 81 l * ( 2 ) l ^ Q4I V(4) |̂ (2)U2 Q4] V(4) 4 4 G, Km sua- "A (4) A ( 8 ) ITS | A (2)|̂ _ Q 4 7Si(4) sea- %(4) 4 x 4 I A (2)| ]A (2) ^(4) A ( 8 ) I A (2)1 SM- A (4) s i a - ^ (4) 4 x 4 Se2~ A (4) A ( 8 ) A ( 4 ) 8 Q 76 A ( 1 6 ) 16 Fig. 9 The recursive realization of A(16) (A complete characterization of the distribution of l's and 0's in R2c-k = R(a) is given in Appendix A). It should be observed that the tensor decomposition V -P3(2 a - f c + 1 - l ) , z {h a - k ^- \ ® -#)^3(2 a- f c+ 1-l), ( 2 ° - ' [ + 1 - l ) ) (36) 3 . One-Dimensional Convolution 31 of equation 35 is similar to the decomposition of the matrix A(n) discussed in section 3.2.1 with the exception that the core matrix A is replaced by the matrix B defined in equation 22. By applying tensor product manipulations similar to those applied to the matrix A(n), the matrix B(n) can be expressed in the following form B(n) = [ i2 2 a-fc(P 3 (2a_ 1 ) i 3(/2«»-l ® 5 ) P 3 ( 2 a _ 1 ) ) ( 2 o , _ 1 ) ) ] " v ' k=l a J J (/ 3 ® I3k-2 (g> i? 2 a-fc(P 3 ( 2 a-fc + l _ 1 ) i 3 ( / 2 o - f c + l _ 1 (g) 5)P 3( 2 Q-fc + l _ 1 ) ( (2«-* + l-l))) k=2 [P 2 Q-*(P 3( 2 Q_i) i 3(/2«-l <8> -B)-P3(2a-l),(2a-l))] ( ̂ 3 ® IT (^3 f c - 2 ® R2a~k(Pz(2,:'-k+1-l),3(ha-k+1-l ® •S )A(2 a - k + 1 - l ) , ( 2 a - * + 1 - l ) ) ) } V Jt=2 / = [i2 2a- f c ( P 3 ( 2 » - i ) , 3 ( / 2 - - i ® 5)P 3 (2Q_i), ( 2 « - i ) ) ] (^3 ® £ (n / 2 ) ) = P n ( j 3 ® P(n/2)) , (37) where P n = [P 2 «-*(^3 (2«- l ) , 3 ( -^2«- l ® P ) -P3(2 Q - l ) , (2 a - l ) ) ] C3^) Equation 37 shows that B(n) can be realized by recursively combining three parallel blocks for computing B(n/2) followed by a single Rn block as shown in Fig. 10 for the case n = 16 [28]. 3.2.3 A Parallel Recursive form for Toom's Algorithm By substituting equations 32 and 37 in 19, we can rewrite Toom's convolution algorithm in the form C(n) = Rn (/3 ®B(n-l j)5{n) ( J 3 <g> A(n - 1)) Qn (39) 3 . One-Dimensional Convolution 32 3 . One-Dimensional Convolution 33 be partitioned arbitrarily into any size blocks. Therefore we have D(n) = h® D(n/2) (40) By substituting equation 40 in equation 39 and applying property 11, we have C(n) = Rn ( / 3 <g> 5(n/2)) (73 <g> 5(n/2)) (/3 <g> A(ra/2)) Q„ = Rn [h ® (P(n/2)5(n/2)2(n/2)) ] Qn (41) = i2„(/ 3<8)C(n/2))Q n Note that Qn and i?„ represent pre-addition and post-addition stages, and serve as glue structures that combine three identical lower order convolution architectures (C(n/2)) to construct the higher order architecture ( C ( n ) ) . The complete recursive modular structures for Toom's convolution algorithm is shown in Fig. 11 for the case n = 2a — 2 4 = 16. 3.2.4 The Computational Complexity of the Proposed Architectures The number of adders to compute the matrix A(n) is equal to the coefficient of the identity matrix in equation 24 and is given by a—l a—1 3 a -* _ 1 .2 f c = 3 a _ 1 3~k.2k k=0 k=0 = 3*"1 £ (2/3)* (42) k=0 = 3 a ( l - ( 2 / 3 ) a ) = (3 a -2a) The number of adders needed to compute the matrix B{n) is the sum of the number of adders to compute the matrix R(n) plus the adders that compute the matrix ( • ^ > 3 ( 2 « - ' = + 1 - l ) , 3 ( ^ 2 Q - ' s + 1 - l ®-5)P 3 ( 2 a-*:+i_ 1 ) ) (2 Q -'=+ 1 -l)) (43) Fig. 11 The modified recursive architecture of Toom's algorithm 3 . One-Dimensional Convolution 35 Since each of the 3-input adders used in computing the matrix B can be realized by two 2-input adders, the number of adders required to compute equation 26 is given by 2 3^ _ 1 (2a~k+1 — l ) . From Appendix A, the number of adders used to realize the matrix k=i a R(a) is given by 2 ^ 3 f c _ 1 (2a~k — l ) . Therefore, the number of adders needed to compute _ k=i the matrix B(n) is given by 2 [ 3* _ 1 (2a~k+1 - l ) + 3 f c _ 1 (2a-k - l ) k=i = a " 1 £ ( § ) ' - ! £ » ' (44) *=i ~ jb=i = 3 x 2 a + 1 ( Q y - l ) - 2 ( 3 B - l ) = 4 x 3 Q , - 6 x 2 a + 2 Hence, the total number of adders used to compute the convolution of size n = 2a is given by (3 a - 2a) + (4 x 3 a - 6 x 2a + 2) = 5 x 3a - 7 x 2a + 2 = 5 x 3 l o g 2 n - 7n + 2 (45) The number of multipliers required is equal to the number of multiplications to compute the matrix D(n) and is equal to 3 l o § 2 n . Tables 2 and 3 show the number of additions and multiplications, required by the proposed method and compare these numbers to the direct method and the FFT method [32]. These tables show a significant reduction of the number of multipliers of the proposed algorithm over the direct and FFT methods. For example, although the number of adders proposed is twice that for the FFT, the number of multipliers is less than half of the FFT multipliers for the case n = 1024. 3.3 Chapter Summary In this chapter, we presented a class of highly modular parallel VLSI structures for the 1-d linear convolution based on a modified version of Toom's algorithm. The proposed 3 . One-Dimensional Convolution 36 Method Formula to compute the number of additions n = 8 n = 256 n = 1024 Direct n(n — 1) 56 65,280 1.047xl06 FFT 12ralog22n 384 27,648 0.135xl06 The proposed work based on Toom's algorithm 5 x 3 l o S 2 " - In + 2 81 31,015 0.288xl06 Table 2 Comparison of the number of additions required for performing convolution. Method Formula to compute the number pf multiplications n = 8 n = 256 n = 1024 Direct n2 64 65,536 1.048xl06 FFT 12nlog2 n + 8n 448 29,696 0.143xl06 The proposed work based on Toom s algorithm 27 6561 59,049 Table 3 Comparison of the number of multiplications required for performing convolution. algorithm allows the generation of higher order convolution architectures from three lower order convolutions in a parallel realization (with a single stage of multipliers) with the 3 . One-Dimensional Convolution 37 addition of two glue stages of pre- and post-processing (adders only). Equation 41 is the key result for this chapter, showing that an n-point convolution C(n), can be constructed from a parallel stage of three (re/2)-point convolutions C(n/2). As part of testing and verifying the functionality of our computational blocks, a Verilog HDL code has been developed to simulate our method. The Verilog files of the different building blocks that recursively generate the 16-point convolution (C(16)) from three identical 8-point convolutions (C(8)) , pre-processing block (Qs), and post-processing block (Rs) are given in Appendix B. The circuits generated by the Verilog code were mapped and tested on an FPGA-based transformable computer. The results are reported in [1]. 4 . Multi-Dimensional Convolution 38 4 Multi-Dimensional Convolution This chapter generalizes the results of the previous chapter by presenting a novel recursive algorithm for generating higher-order multi-dimensional (m-d) convolution by combining the computation of 3 m identical lower-order (smaller size) convolution computations of the same dimension (m). We also show that with careful design, the number of lower-order convolutions can be reduced in the VLSI implementation from 3OT to less than § (3OT) [23], [30]. Our methodology is based on a non-trivial generalization of the 1-d convolution algorithm presented in the previous chapter. 4.1 Overview Two-dimensional (2-d) convolution have found many applications such as image processing and synthetic aperture radar (SAR) processing, etc. In real time image processing, 2-d convolution requires very high computation power. For example, a convolution with a 16 x 16 kernel, a frame size of 512 x 512 pixels and a frame rate of 25 fps results in 1.7 billions of multiply and accumulate operations. Additionally, in many applications two or more convolutions with different coefficient masks have to be applied in parallel to the same video signal, e.g. for edge detection and pattern recognition [56]. Therefore, it is necessary to perform the computationally intensive 2-d convolution in very efficient hardware. One of the major problems with conventional FFT techniques for computing 2-d convolutions is that matrix transpose operation must be performed. Moreover, the FFT approach requires the previous acquisition of the whole input sequence which may cause an extra latency. Therefore, several authors have suggested several approaches to improve the efficiency of the 2-d convolution such as block processing [72], systolic architectures [56], [49], and look-up tables [54]. 4 . Multi-Dimensional Convolution 39 The main objective of this chapter is to derive recursive formulations for the 2-d convolution which are useful for the true modularization and parallehzation of the resulting computation. The main result reported shows that a large 2-d convolution computation can be decomposed recursively into three cascaded stages as shown in Fig. 12. The center stage is constructed recursively from 9 parallel (data-independent) blocks each realizing a smaller-size 2-d convolution. The post-additions and the pre-additions stages serve as "glue" computations that combine the 9 lower order convolution blocks to construct the higher order convolution. We also show that the proposed algorithm can be extended such that an m-d convolution can be constructed from 3TO smaller size m-d convolutions. We also show, using an alternative (permutation-free) construction, that the number of core convolution blocks can be reduced to less than | (3m). The main features of our construction are the modularity and high concurrency. The proposed architectures have very small depth and contain only a single stage of multipliers, while all other stages contain adders only. The majority of the previous approaches focused on expressing an m-d convolution in terms of m consecutive stages of 1-d convolutions. However, it should be mentioned that the multirate filter banks approach proposed in [7], [36], [71] and its generalization [51] implement a given convolution using shorter convolutions running at slower speeds. Although these approaches are computationally efficient, the number of small convolutions depends on the aperiodic convolution algorithm used. Moreover, these approaches are mostly suited for symmetric 2-d n x n sequences but not for sequences of length mxn (where m is different from n). In addition, the subsampling and upsampling processes proposed in [7], [51], [71] impose additional requirements on hardware speed and analog-to-digital (A/D) conversion. 4 . Multi-Dimensional Convolution 40 2-d Convolution Output Postadditionsl C ( r y 2 , r y 2 ) | (1) ^C(n1/2,n2/2)^ (2) C ( r V 2 , n 2 / 2 J (9) Stage # 3 Stage # 2 C(n , n2) Preadditions Stage # 1 n x n 1 2 Input Fig. 12 The 2-d recursive convolution architecture. 4.2 The One-Dimensional Shuffle-Free Algorithm We will show how we can remove the shuffle permutations from equations 33 and 37 representing the pre-additions and the post-additions, respectively. In this case, the effect of the shuffle permutations is implicitly embedded in the tensor expression. Even though the resulting circuits are topologically equivalent to the ones derived previously in chapter 3, removing the shuffle permutations from the tensor formulations can simplify data movement and also simplify the specification of such circuits in high-level hardware description languages. Thus, the resulting shuffle-free networks are suited for implementation using CAD tools that employ automatic partitioning, placement, and routing [26]. 4 . Multi-Dimensional Convolution 41 First, recall that the a — 1 permutations in the term ( -P2 Q - 1 3 ,2 Q - 2 3) a contained in equation 33 can be simplified by iteratively applying property 15 which yields ( P 2 a - 1 3 , 2 a - 2 3 ) = P2°—13, 3 • (46) Also, substituting I2<*-i for Bn2 in property 13 and applying it to equation 33, results in Qn = {P2a-13,2a-23) {h*-1 ® A) P2*>2a-i = P2a-1 3 3 ( - ^ 2 a _ 1 ® A)P2a 2OL — \ (47) = A (g) I2a-1 Note that the above expression for Qn does not include any (explicit) permutations. Similarly, substituting / ^ " - I for Anj in property 13 and applying it to 33, gives Rn = R(a)(P3f2"-i),3(h"-i <g> £)P 3 (2"-l) , (2 a - l ) ) (48) = R(a)(B ® I2°-i) In Appendix A, it is shown that R(a) can be represented as a combination of the identity matrices I2* only and hence, the representation of Rn has no permutations. The permutation- free recursive realization of the 8-point convolution using three 4-point convolutions is shown in Fig. 13. We assume that each addition requires one unit of time. A careful scrutiny of the realization shown in Fig. 13 reveals that the data movement through the computational stages encounters different amounts of delays. In particular, the computations involved in the Q$ matrix affect only the middle 4-point convolution in the center stage. Thus, the top and the bottom 4-point convolutions can be computed one addition cycle ahead of the middle 4-point convolution. This means that only two 4-point convolvers are needed at a time. Therefore, through the use of a multiplexers, either the top or the bottom 4-point convolvers can be removed from the architecture without affecting the speed of the computations as shown in Fig. 14. 4 . Multi-Dimensional Convolution 4 2 15 Q 8 8 \j_ 4- Point Convolution C(4) i —-rr-0 " — i ; I •vi vf | % 'ii i l . •<<<^<<<<<<>> • If ij ? 1 Hi! 4- Point Convolution C(4) ••% '& \i M Outputs - i p 1 Inputs — f 1 1 / \l_ c 4- Point Convolution C(4) TO s© /7 •(8) A © / 4 Fig. 13 The permutation-free recursive realization of the 8-point convolution. Similarly, the realization of R$ can be modified such that the first stage of three-input adders (realizing B ® Ij) can be replaced with two-input adders. This is because the outputs from the second stage of the demultiplexer (realized by the seven horizontal arrows with shaded background in Fig. 13) are now moved to a second stage of 3-input adders (realizing i?(3)) as shown in Fig. 14 instead of the 2-input adders used previously. The permutation-free 1-d recursive convolution architecture is shown in Fig. 15. 4 . Multi-Dimensional Convolution 43 Fig. 14 The permutation-free multiplexed recursive realization of the 8-point convolution. 2n-1 Outputs Postadditions 1:2 Conv (n/2) 2:1 Conv (n/2) Preadditions Inputs Stage #3 Stage #2 Stage # 1 Conv (n) Fig. 15 The 1-d permutation-free recursive convolution architecture. 4 . Multi-Dimensional Convolution 44 4.3 The Two-Dimensional Convolution Algorithm Let n\ — 2011 and n 2 = 2a2. Pratt [59] has shown that for an n\ x n2 input data image, the two-dimensional convolution output is given by P = Cni,nJ , (49) where C n i j „ 2 is the 2-d convolution transform matrix; and p and / are the output and input column-scanned vectors, respectively of size n = n\n2 each. Pratt has also shown that, for separable transforms, the matrix C r a i > n 2 can be decomposed into the tensor product Cm>n2 = C{ni)®C{n2) (50) where C(n\) and C(n 2) represent row and column 1-d convolution operators on / , respectively, as defined by equation 41. From equations 49 and 50, we can write the 2-d convolution in the form ^ = ( ^ 1 ) 0 ^ 2 ) ) / (51) Using equation 41, the 2-d convolution algorithm can be expressed as a function of 1-d convolutions as follows q = {[Rn, (h ® C(rai/2))QBl] <g> [Rn2(h ® C{n2/2))Qn2})f (52) Applying property 11, leads to q = [(Rn, ® Rn2)((h ® C(ni/2)) ® (73 ® C{n2/2)))(Qni ® Qn2)]f RCQ f (53) where R ={Rm®Rn2) , Cni,n, = ((h®C(n1/2))®{h®C{n2/2))) , (54) Q = ( Q n i ® Q n 2 ) - 4 . Multi-Dimensional Convolution 45 Note that the matrix Cniin2 contains the 1-d convolutions matrices C(n\/2) and C(n2/2) in a tensor product expression. Our next task is to manipulate Cni )Tl2 to derive a more efficient expression. We start by applying property 12 to C n i i 7 l 2 , which gives Cni,n2 = {{h ® C(ni/2) ® h) <g> C(n 2 /2)) (55) Applying property 14, yields Cni,n2 = ( P g ^ - i - i ) ^ - ! - ! ) ^ ® C , ( 2 a i - 1 ) ) P 9 ( 2 c < 1 - 1 ) ! 3 ) ® C ( 2 a 2 - 1 ) (56) Since the convolution matrix C(n/2) is of dimension [(n — 1) x n /2] , we can write C(n2/2) as C(TI 2 /2) = J N 2 _ ! C(n 2 /2) J N A / 2 (57) Substituting 57 in 56 and applying property 11, Cni,n3 = (PBim-lMm-Vih ® C ( « l / 2 ) ) P 9 ( n 1 / 2 ) , 3 ) ® (^ (n 3 - l ) C(n2/2) I(N2/2)) = ( - F s K m - i J ^ m - i ) ® J ( " 2 - i ) ) x (58) ( / 9 ® ( C ( m / 2 ) ® C ( n 2 / 2 ) ) ) x (^9 (n a / 2 ) ,3 ® ^ (n 2 / 2 ) ) • Now, substituting equation 58 in equation 53 gives 9 = Bl (I9 <g> C 7 N I / 2 > N 2 / 2 ) § / (59) where C n i / 2 , n a / 2 = C V i / 2 ) ® C ( n 2 / 2 ) (60) is the lower order 2-d convolution matrix for an n\/2 x n2/2 input image, and Q = (P9(n,/2),3 ® I(n2/2))(Qni ® Qn 2 ) (61) 4 . Multi-Dimensional Convolution 46 and R = {Rni®Rn2)(P9 ( w i - l ) , 3 ( m - l ) ® I{n2-1) ) (62) are the new 2-d pre- and post-additions, respectively. Equation 59 represents the recursive 2-d convolution algorithm. In this case we use 9 of the lower-order (rai/2 x re2/2) convolutions in parallel to generate the higher order ( n i x n2) convolution as was shown in F ig . 12. 4.3.1 Reducing the Complexity of the 2-d Convolution Algorithm Now, we w i l l further manipulate equations 61 and 62, so that we can compute the 2-d convolution algorithm using less than 5 lower-order convolution only. From 47 and 58, we can rewrite Q as Q = (A(n1/2),3 ® J(n 2/2))(Qm ® Qn2) = (^9( n i/2),3 ® W)) KA ® Jn,l'l) ®{A® In2/2)) (63) = (P9(ni/2),3 ® J(n a/2)) [04 ® / n i / 2 ® A) ® (/na/2)] Also, from properties 11 and 12, we have Q= [P9n1/2AA®In1/2®A]] ® J n a / 2 = 1 ^ / 2 ,3 ( ( ^ 2 ) ® / 3 n i / 2 (4^2 ® ® 42/2 ( 6 4> = [P9nil2,M® hn,l2){Ini® A)] ®In2/2 Using property 13, the parallel form (7 2«i <g> A) in 64 can be modified to the vector form (Im ®A) = P 3 r a i , n i ( A <g> / n i ) P 2 n 1 ) 2 (65) Also , from property 13, we can write the vector form (A ® 13^/2) i n a parallel form as (66) Tll/2,3 Tli/2 (/3 ® / „ 1 / 2 ) ) P 3 n l l 3 • 4 . Multi-Dimensional Convolution 47 Substituting 65 and 66 in 64 and applying property 16, we have (67) -f >9n 1 /2,3 x -^9 T U / 2 , 3 m / 2 — hm/2 P<Sni,3 X P"in\,ni — -^3m • From equations 65, 66, and 67, we can write Q in the form Q = [(h ® (A® Ini/2)){A ® /n 1)P 2n 1 ) 2] ® 42/2 (68) = {{h®Qn1)(A®Ini)P2n1,2]®In2/2 • The vector form of 68 can be modified to the parallel form using property 13 and hence, we can write Q in the final form x Q = A>m/2.n 2/2,9m/2{4 2/2 ® [(h ® Ql){A ® / n j ^ m ^ ] } ^ , m , n 2 / 2 • ( 6 9 ) = (̂ 3 ® ^ , 3.m/2.n 2 /2,3.n 1 /2) (-^3.n2/2,3 ® h.m/2) x {42/2 ® [(/3 <8>Qn,)(A®/„ 1 )P 2 n 1 ,2 ]}P n i .„ 2 l „ 2 /2 • The complete realization of Q is shown in Fig. 16, where one can observe that the computations involved in the upper permutations (realized by the white rectangle) are one addition-time period ahead of those involved in the middle permutations (realized by the gray rectangle). Therefore, both computations can be multiplexed, thus reducing the number of outputs from 9(rai/2 x n 2/2) to 6(rei /2 x n 2/2) and the number of Q\ blocks from 3(n2/2) to 2(n2/2) as shown in Fig. 17. Moreover, since the computations within the lower permutations (realized by black and white dotted rectangle) are similar to those of the 1-d convolution explained before in section 4.2 and shown previously in Fig. 13, their computations also can be multiplexed in a way similar to that of Fig. 14, thus reducing its output from 3(nj/2 x n 2/2) to 2(n\/2 x n 2/2). Consequently, the total number of the multiplexed outputs from Q is reduced to b(n\ x n 2). Note that this is equivalent to 4 . Multi-Dimensional Convolution 48 reducing the number of the lower-order 2-d convolutions in the core from 9 to 5 with a computational saving of 44%. Note also that, in the multiplexed architecture, the original speed of computations are not affected. Additional multiplexing stages can be inserted to further reduce the number of the computations but this will affect the speed of the original (non-multiplexed) algorithm. A multiplexed version of the computation of the matrix R can be also derived using a similar procedure. Thus, from equations 48 and 62, we can write R in the form R = (Rni ® Rn2){P9(n,-l),3(n1-l) ® T{n2-1 ))• [(Rn, (B <g> / „ , _ ! ) ) ® (Rn2(B <g> J „ 2 _ i ) ) ] x (70) ( •P9(n 1 - l ) ,3(n 1 - l ) ® ^ ( n 2 - l ) ) • which, using property 11, can be modified to R={Rni <S> Rn2){(B ® Ini-i) <g> (B <8)/„2-i)) x (71) (^9(m-l) ,3(ni- l ) ® J(n2-l)) • Applying property 12, we have R= {Rni ®Rn2) {{B®!, i ® 5 ) ( 8 ) / B 2 _ i ) x (Rm ® Rn2)[(B ® I, 1 ® B)P9(ni-l),3(ni-l) ® I(n2-1). (72) (Rn, ®Rn2) X . ( 5 ® 73 ( ^ - 1 ) ) ®B)P9{ni-l),3(m-l) ® ^ ( n 2 - l ) . • 4 . Multi-Dimensional Convolution 49 9 ( n / 2 X n / 2 ) 1 M 2 — / 3 W / 3 n / 2 . n / 2 , 3 n / 2 V 2 3 W V n<'V 3(n/2Xn/2)j 3(n/2Xn/2« 3(n/2Xn/2) 3n,/2 n HOT! ft®Q)(M,,)g2 3n,/2, . n 3n,/2r-—in, /Y"\ 3 n . / 2r7=n n - * ft«QX*«gg,, 2n, ^n.n ,n 12 P ®l 3n/2,3 3n,/2 3jV2 ^ ^ 3n, /2r-fn, A\ *--|^7K--(+) 3n,/2|—-,n, ^ ft®Q)(^®/n,)e 2n, ^n.n ,n I2\ Fig. 16 The complete realization of Q. 4 . Multi-Dimensional Convolution 50 Fig. 17 The multiplexed realization of Q. which, using property 13, can be reformulated in the parallel form R= {Rn, ®Rn2) x [(B <g> / 3 ( B l - l ) ) (-^3(11,-1) ® B ) P 9 { n i _ 1 ) t H n i _ i ) <g> /(„ 2_1)] • = {Rn, <8> Rna)P9(n1-l)(n2-l),9{n1-l) X (73^ [/(„3_1) ® <g> / 3 ( n , - l ) ) {h(ni-i) ® ̂ ^ ( m - i ^ O u - i ) ] x ^ 9 ( n 1 - l ) ( n 2 - l ) , ( n 2 - l ) • 4 . Multi-Dimensional Convolution 51 Using similar analysis to that explained earlier for deriving an expression for Rn, the three-input adders can be moved from the stage (B <g> /3( r a i_i)) to replace the two-input adders at the stage (Rni <8> Rn2) °f the multiplexed realization, thus reducing the number of the B blocks from 3(ni — l ) (n 2 — 1) to 2(n\ — l ) (n 2 — 1). 4.3.2 The Computational Complexity of the 2-d Convolution Algorithm Let MU(n x n) be the number of multiplications needed to compute the (n x n) 2-d convolution. From equation 59, MU{n x n) = 9MU{n/2 x n/2) (74) This number is reduced to MU{n xn) = 5MU(n/2 x n/2) (75) using our multiplexed forms as explained in section 4.3.1. It should be mentioned that, the computation complexity in our proposed algorithm depends on the computations involved in the core (MU(n/2 x n/2)). It is possible to use our proposed convolution algorithm for the design of digital filters by zero-padding both inputs to make them of the same size and eliminate aliasing. However, in most practical situations the image to be filtered is considerably larger in size (such as 512 x 512) than the support of the filter impulse response (e.g., 21 x 21). As a result the evaluation of the output using the zero-padding method will computationally intensive. The sectioning schemes may be used to overcome this problem. The topics of digital filter 4 . Multi-Dimensional Convolution 52 designs and sectioning schemes for digital filters are beyond the scope of the thesis. More details can be found in [65]. The number of multiplications in our proposed algorithms based on [53] as a core, compared to direct method and FFT method is shown in Table 4 [65]. The table shows a significant reduction of the number of multipliers of the proposed multiplexed algorithm over the direct and FFT methods. n X n The Direct Multiplication The FFT The pro;osed 2-d algorithm The proposed 2-d multiplexed algorithm 8 x 8 162 96 104 58 64 x 64 1147 144 140 78 128 x 128 4246 156 148 83 Table 4 Comparison of the number of multiplications required for the 2-d convolution (kernel of dimension 8x8). 4.4 Multi-Dimensional Convolution We can extend the derivation of the 2-d convolution algorithm in section 4.3 to the m-d case, for any m > 2. From equations 48, 49 and 50, we can write the m-d convolution in the form q = (C(ni) <g> C(n2) <g> • • • <g> C(nm))f . OnAf . ™ .1 = 1 Using properties 11-17 and following the derivation steps in section 4.3, we can write equation 76 in the form f (77) 4 . Multi-Dimensional Convolution 53 where, /m—l / i \ \ / m Q= n \ p u „ u ^ i u 3 (g)g ? , (78) i=l \ k=l / / \i=l m \ /m — 1 / m—1 P = I ® P f c J] 1 ^ a , » 2 ® ^ 3 , (79) .ib=l / \ i = l V /=t m — i U l = 3m~l+1 (2a^1) , (80) 3 = 1 u2 = 3 , (81) i ^2 = 31 J] (2a" - 1) , (83) k=l and Qi and i?^ are defined in equations 47 and 48, respectively. Equation 77 represents the recursive m-d convolution computation. In this case, we use 3 m of the lower-order architectures in parallel. Similar analysis to that of deriving Q and R in section 4.3 can be applied to Q and R, respectively. Also, multiplexed formulations can be derived for both Q and R to reduce the number of the lower-order convolution architectures used to less than | (3 r a ) . 4.5 Chapter Summary In this chapter, we showed that the higher-order m-dimensional convolution can be generated recursively by combining the computation of 3 m identical lower-order (smaller size) convolution computations of the same dimension. We also showed that, with careful 4 . Multi-Dimensional Convolution 54 design, this number can be reduced to less than | (3m). The proposed networks have very small depth and contain only a single stage of multipliers, while all other stages contain adders only. 5 . Multi-Dimensional DCT 55 5 Multi-Dimensional DCT In this chapter, we present a novel recursive algorithm for generating higher-order Tri- dimensional DCT by combining the computation of 2 m identical lower-order (smaller size) DCT architectures [21]. Basically, a multi-dimensional DCT computation can be constructed from exactly one stage of smaller DCT computations of the same dimension. This is useful for both hardware and software solutions, in which a very efficient smaller size m-d DCT core has been developed, and a larger DCT computation is required. 5.1 DCT for Video Processing The Discrete Cosine Transform (DCT) is a fundamental computation that arises in many important image and video processing applications [20]. Because it results in high image compression ratio [37], DCT-based image coding became the basis for most contemporary image and video compression standards such as MPEG, JPEG, and H.261 [9], [10]. The basic computation of the DCT-based system is the transformation of an 8 x 8 image block from the spatial domain to the DCT domain. The choice of the DCT in the standards is motivated by the many benefits it offers [20]: 1. For highly correlated image data, the DCT compaction efficiency is close to that obtained with the optimum transform, namely the KLT (Karhunen-Loeve transform). 2. The DCT is an orthogonal transform. Thus, the computation of the forward and the inverse DCT are nearly the same. Thus, from a hardware implementation view point, the same computation unit can be used for both the forward and the inverse DCT. 3. An important property of the 2-d DCT and IDCT transform is separability. This implies that the 2-d DCT can be obtained by first performing 1-d DCT on the rows followed by 1-d DCT on the columns. From an implementation viewpoint, this row-column approach 5 . Multi-Dimensional DCT 56 may also simplify the hardware requirements at the expense of a slight increase in the overall operations count. This property can be extended to the general m-d DCT. 4. The DCT computations can be performed in real arithmetic with fast algorithms that require fewer operations than the original matrix-vector computations. Fast algorithms are desirable from both a hardware and a software implementation viewpoint. These fast algorithms also tend to be parallelizable and thus can be efficiently implemented on parallel architectures as it will be shown throughout this chapter. The MPEG-1 standard for motion video requires the 2-d DCT to be computed over each 8 x 8 image block within a frame size of 352 x 240 pixels in a video stream processed at a real-time rate of 30 fps. These computations consume 33% of the encoder computational load and around 22% is consumed by the IDCT on the decoder side as shown in Table 5 [37]. In the JPEG standard for still image compression, the computation of the DCT demands close to 45% of the overall encoder computation time. Clearly, such computationally intensive task must be implemented in fast hardware to meet the real-time requirements. Decoding Function Load (%) Bit stream header parsing 0.44 Huffman decoding and inverse quantization 19 8x8 IDCT 22.1 Motion compensation 38.64 Color transformation and display 19.82 Table 5 The distribution of the computational load in MPEG-1 decoding. 5 . Multi-Dimensional DCT 57 5.2 Previous Related Work Although a significant amount of work has appeared for computing the recursive 1-d DCT [18], [38], [57], [66], [69], [76], much fewer results are available on truly recursive 2-d algorithms [16], [47], [60]. The work presented in [60] is based on concatenating columns (or rows) of the 2-d input, to reduce it to a 1-d array. The transform matrix is then factorized in block form, but only part of the factorization is truly recursive which makes the algorithm less than optimal in efficiency and not modular in structure [76]. Although the algorithm presented in [16] has an improved recursive architecture, the algorithm works directly on the 2-d data and therefore, it has an irregular structure. Hardware implementations of the DCT are abundant, including systolic arrays [31], [68], [75], parallel architectures based on distributed arithmetic [34], [41], bit-serial conventional row-column decomposition [8], [12], [43], [74], and full custom VLSI architecture [44]. Other efficient approaches that work directly on the 2-d data to decrease the number of multiplications and additions are based on direct polynomial transformations [14], [15], [70], extensive manipulation of sparse matrices [42], [73], or using 1-d DCT's [39], [40]. It should be noted that although these approaches are very computationally efficient (for example, the algorithm presented in [73] required only 94 multiplications for an 8 x 8 DCT and 54 multiplications for the scaled 8 x 8 DCT), they trade-off the performance with irregular architectures that may not be suitable for VLSI implementations especially for larger problem sizes. In VLSI, the complexity of data movement has significant impact on wire area and subsequently on the overall performance of the VLSI layout and delay. In relation to the above, this chapter presents a new recursive formulation for the 2-d DCT which results in new class of scalable parallel architectures for computing the transform. Our main contribution is the derivation of a recursive truly 2-d formulation for the DCT 5 . Multi-Dimensional DCT 58 which are useful for the modularization and parallelization of the resulting computation. The main result reported in this chapter shows that a large 2-d DCT computation can be decomposed into a single parallel stage of four smaller 2-d DCT's with one additional pre-processing stage and one post-processing stage as shown in Fig. 18. Our methodology can be generalized such that an m-dimensional DCT can be constructed using 2m smaller size m-d DCT's. The majority of the previous approaches focused on expressing an m-d DCT in terms of m consecutive stages of 1-d DCT's. It should be mentioned that the 2-d DCT algorithm based on tensor product decomposition previously proposed in [55] scales the individual components constituting the 1-d DCT to obtain the 2-d DCT. Thus, it is not a truly 2-d decomposition. Also, the recursive 2-d DCT algorithm presented in [47] is a generalization of the result reported in [18] to the 2-d case. Compared to the algorithm proposed in this chapter, the 2-d method proposed in [47] requires multiplications in both the pre- and post-processing stages while our algorithm requires multiplications in the pre-processing stage only. Moreover, while our algorithm uses smaller-size DCT's to compute the large DCT, the method in [47] employ smaller blocks that are not a complete DCT computation. Thus, our algorithm is truly recursive, modular in structure, and can be generalized easily to the multidimensional case. Modularity in our case means that any suitable "small" DCT algorithm can be used in the core computations. One immediate outcome of our results is the true "scalability" of the DCT computation. Basically, a multi-dimensional DCT computation can be constructed from exactly one stage of smaller DCT computations of the same dimension. This is useful for both hardware and software solutions, in which a very efficient smaller size m-d DCT core has been developed [14], [40], [73], and a larger DCT computation is required. In such a situation, our approach can use these algorithms as the core stage in computing the larger size DCT by simply 5 . Multi-Dimensional DCT 59 DCT M O u t p u t P o s t - P r o c e s s i n g DCT ( n, 12 x n2/2 ) DCT ( n, 12 x n2/2 ) DCT ( n, 12 x a,/2) DCT (n, 12 x 1̂ /2) P r e - P r o c e s s i n g n x n 1 2 I n p u t Stage # 3 Stage # 2 Stage # 1 D C T ( n x n) Fig. 18 The proposed recursive architecture for computing the 2-d DCT adding the pre- and post-processing stages. Regularity and modularity are essential factors for facilitating design automation and the implementation of parallel circuits in maturing and emerging programmable logic devices and ASIC environments, and our techniques contribute significantly to such design approaches. 5.3 Overview of the DCT From the definition of the DCT [76], for an input vector {XQ, X\, x2, • • •, x n - i }> the DCT output vector {XQ, X\, X2, • • • , X „ _ i } is given by the relation Xk = — ^ Xi cos where erj = ^= and = 1 for k > 1. In a matrix form, the DCT can be written as X = Tn x (85) ?r(2z + l)k 2n (84) 5 . Multi-Dimensional DCT 60 where X and x are the output and input vectors of length n each, Tn is an n x n coefficient matrix. For example, T 4 w i l l have the form 1 V2 COS 1 1 x/2 7 T 8 COS 3 T T 8 COS 5 T T 8 COS 7 T T 8 7 T 4 COS 3 T T 4 COS 5ir COS 7jr 2lV 0 COS & Q COS Q COS (86) Based on the 1-d recursive algorithm proposed by Hou [18], Cvetkovic et. al. [57] presented a modified algorithm that has a fewer number of adders and no shift registers, and can be represented by the following matrix-vector product (87) where m n ~xt n Tm Tm Xp = 2 LmTmCm xr 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 (88) X Cm = diag XeX0 T 2 cos <f>M <i>M = ( 4 M + l )7r/2n , M = 0, 1, 2, • • • , rn- 1. = [XQ X2 X4 • • • X^-2 ^ 1 ^ 3 ^ 5 • • • Xjs[-\} (89) X = [xpXr] = [XQ X2 • • • £7V_2 Xjy_i £jV-3 • • ' X\] (90) (91) 5 . Multi-Dimensional DCT 61 5.4 The One-Dimensional DCT in a Tensor-Product Form The vector X defined by equation 90 can be expressed as follows X — Pn:iX (92) (93) Also, the vector x defined by equation 91 can be written as X = (4/2 ® Jn/2)Pn,2X , where Inj2 is the identity matrix of dimension n/2, © is the direct sum operator defined in [46], Jnj2 is the exchange matrix (opposite diagonal matrix) of order n/2 defined as Jn/2 — "0 0 • • 0 r 0 0 • • 1 0 0 1 • • 0 0 1 0 • • 0 0 (94) and Vn = (In/2 © Jn/2)Pn,2 From equations 85, 92, and 93, it follows that (95) T = T V -t-n — xn vn •> and from equations 87 and 92, it follows that Tn — Pn,2Tn • From property 16 and substituting equation 96 in 97, we obtain (96) (97) T — P + n — 1 n,r " L>mTmCm Vr n j (98) 5 . Multi-Dimensional DCT 62 which can be decomposed into the following matrix product, Tn — Pn,n/2 ln/2 0 L n/2 Tn/2 0 0 fr / 2 'n/2 o a n/2 'n/2 n̂/2 -I, %/2 n/2 or in the tensor product form Applying equation 96 yields (99) Tn = Pn,n/2 (4/2 © Ln/2) (l2 ® f n / 2 ) (4/2 © C n / 2 ) ( F 2 <g> / n / 2 ) K • (100) Tn = Pn,n/2(ln/2 © 4*/2) (4 ® ) ( / n / 2 0 C n / 2 ) (^2 ® 4 / 2 ) K (101) (102) which can be transformed, using property 11, into the simpler form Pn = Pn,n/2 {In/2 © Ln/2) (I2 ® Tn/2) (l2 <g> V-jty (/re/2 © Cn/2) {F2 <g> 4 / 2 ) K = Rn(h <8> Tn/2)Qn where Pn = Pn,n/2(Pa/2 ® Ln/2) , (103) Qn = (4 ® i g j ) (4/2 © Gn/2) (F2 <g> 4/2) 14 • (104) Equation 102 represents the recursive tensor product formulation for the 1-d D C T in which the core computation (l2 ® Tn/2) consists of a parallel stage of two identical smaller D C T computations each of size n / 2 , Qn and Rn serve as pre- and post-processing (glue) structures that combine the two lower-order D C T modules to construct the higher order D C T of size n as shown in Fig . 19. It should be noted that the computation of Rn involves additions only, while the computation of Qn involves additions and multiplications. For example, the realization of R± for a 4-point input (n = 4) is shown in Fig . 20, where L2 = 1 1 0 1 (105) 5 . Multi-Dimensional DCT 63 n Output Post-Calculations R n Tn/2 4J/2 /2 Ti/2 n 72 Pre-Calculations Q n Stage # 3 Stage # 2 T„ Stage # 1 n Input Fig. 19 The 1-d DCT recursive architecture M— ( © / . 2 2 -4 X — R4 Fig. 20 The realization of The computational structure of Qn can be simplified by further manipulation of Vn as defined in equation 95. Since the three matrices In/2, Jn/2, and P„ > 2 are unitary matrices, 5 . Multi-Dimensional DCT 64 then we can write V ,1 as n/2 Vn/2 = [ ( J ™/4 © Jn/4)Pn/2,2] * = [ ( / n / 4 © 4 /4)-^n/2,2] (106) (T . nrs T . \ J IV 4i/2,n/4 (4/4 © 4 / 4 ) ~ Pn/2,2 (4/4 © 4 /4 j Using property 13, the term (F2 <8> 4/2) i n equation 104 can be converted to the following parallel form F2®Im = Pn,2{Im ® F2)Pn,m (107) Thus, for a 4-point input vector, Q4 w i l l have the form Q 4 = [I2 <g) V2~l] (I2 © C 2 ) P 4 , 2 ( 4 ® -^2)^4,2^4 , (108) where V4 = (4 © 4 ) P i , 2 and V ^ - 1 = 4- Therefore Q 4 = (/ 2 © C 2 ) P 4 > 2 ( / 2 <8> F2)P4,2{h © 4 ) P 4 > 2 (109) The realization of (54 for a 4-point input (n — 4) is shown in Fig . 21, where from equation 8 9 ' C<> = 2cos(x/8) a n d = 2CO S (5T/8) - Similar results, albeit in different formulations have been reported previously for computing the 1-d D C T [18], [38], [57], [69]. However, the tensor product form proposed in equation 102 can be translated directiy into hardware realization as shown in Fig . 19, 20, and 21. 5 . Multi-Dimensional DCT 65 Fig. 21 The realization of Q 4 , where Co 2 cos (7r/8) and Ci = 2 cos (57I-/8) ' 5.5 Truly Two-Dimensional Recursive Architectures for the DCT This section w i l l develop two recursive methodologies for realizing 2-d D C T computations from smaller D C T computations. First, we present a direct method for realizing the 2-d D C T computations using the conventional row-column decomposition of the 1-d D C T in a tensor product form. The second methodology provides a truly 2-d recursive realization. The resulting structures employ one stage of smaller 2-d D C T ' s to realize the large 2-d D C T . This method provides significant advantages over other proposed architectures reported previously, and as far as we know, this is the first truly 2-d recursive formulation of the D C T . 5 . Multi-Dimensional DCT 66 The 2-d DCT is denned as [20] n\n2 W l — 1 «2 — 1 (2n + l)u7r 2ni n=0 ra=0 u = 0 ,1 , • • • , n i — 1 , u = 0,1, • • • , n 2 - 1 , / = 0 / > 0 cos (2m + l)u7r 2 ^ ( H O ) Since the DCT matrix is separable [10], the 2-d DCT for an image of dimension n\ x n2 can be computed by a stage of n 2 parallel 1-d DCT computations on n\ points each, followed by another stage of ri\ parallel 1-d DCT computations on n 2 points each. This can be represented by the matrix-vector form X — 2~ni ,n2 ^ ( H I ) where T n i j „ 2 is the 2-d DCT transform matrix for an ni x n 2 image, and X and x are the output and input column-scanned vectors, respectively. For separable transforms, the matrix Tni!n2 can be represented by the tensor product (112) where Tni and T„2 are the row and column 1-d DCT operators on x, respectively. By substituting equation 112 in equation 111 we have X = (Tni <g> Tn2) x (113) which by using property 13, can be expressed in a parallel form as X — -Pn1n2,n1{In2 ® Tni)Pnin2in2(In:i ® Tn2 ) X (114) 5 . Multi-Dimensional DCT 67 The direct realization of equation 114 for a 4 x 4 (n\ = n2 = 4) input image is shown in F ig . 22. X=16 4-polnt 1- D DCT 4-polnt 1- D DCT 4-polnt 1- D DCT 4-polnt 1- D DCT 1-D Rows DCT 4X4 DCT x = 16 Fig. 22 The direct realization of 2-d DCT for a 4 x 4 input image {n-y = n2 = 4) Now we w i l l derive a truly 2-d recursive formulation of the D C T by further manipulation of equation 114. Substituting equation 102 in equation 114, we have X = ([Rn, (I2 ® T n ] / 2 ) Q n i ] <g> [Rn2(h ® Tn2/2)Qn2}) X . From property 11, we can write X as X = [(Rn, ® Rn2)((h ® Tn,/2) ® (h ® Tn2i2))(Qni ® Qn2)] X = [Rni,n 2Tni,n 2Qni ,n2] x j where •R-ni,n2 = (Rni ® - ^ n 2 ) ? J-rn,n2 — ((h®Tni/2)®(l2®Tn2/2)) , (115) (116) (117) (118) 5 . Multi-Dimensional DCT 68 Q n 1 , n 2 — {Qni ® Qn2) • (119) Now, from property 12, we can write equation 118 in the form J - r i i , n 2 — {(h®Tni/2®I2)®Tn2/2) (120) Applying properties 11 and 14 on equation 120, we have T n i , n 2 = (P2m,m (h ® Tn^2)P2ni>2) ® (42/2 -Tn2/2 •Jn2/2) {h®{Tni/2®Tn2/2)). {P2nlt2 ® 42/2) (121) where : (P2n1,n1 ® Tn2/2, {h ® r n i / 2 i n 3 / 2 ) (P2nu2® In2/2) Pni/2,n2/2 — Tni/2 ® 4J2/2 (122) Finally, substituting equation 121 in equation 116, we have X = {h®Tni/2>n2/2)Q Ml,712 (123) where Q n i , n 2 = (P2m,2 ® 4 2 / 2 ) Q r u , n 2 , (124) and R-7ii,n2 — R-m,n 2 (P2ni,n-L ® Pa2jcl) (125) 5 . Multi-Dimensional DCT 69 Equation 123 represents the truly recursive two-dimensional DCT in which Qrailn2 and R n i , n 2 are the pre- and post-processing glue structures, respectively, that combine 22 identical lower- order 2-d DCT (T„1/2,n2/2) modules each of size ni/2 x n 2 /2 in parallel to construct the higher order 2-d DCT of size ni x n 2 . In the remaining part of this section, we will further manipulate equations 124 and 125 to obtain parallel forms for Q„1>7l2 and R W l i W 2 that are more suitable for direct VLSI realization. For simplicity, let n\ = n 2 = n, then we can rewrite equation 125 as R-n,n = (Rn ® Rn)(P2n,n ® In/2) (126) Applying property 13, we get Rn®Rn = Pn\n(In ® Rn)Pn\n(In ® Rn) (127) substituting equation 127 in equation 126, we have Rn,„ = [Pn2tn(In ® Rn)Pn2,n(In ® Rn)} (P2n,n ® In/2) = A2n(P2n,n ® In/2) , (128) where An = Pn^n(In ® Rn) (129) Equation 128 represents the post addition stage in which the term (P2n,n ® In/2) represents the permutation P2n,n dilated by n/2 [28]. The second term, A2n, represents the cascade of two identical stages, each is realized by n-parallel copies of Rn, followed by the permutation 5 . Multi-Dimensional DCT 70 P n 2 n . The realization of R4 for a 4 x 4 input image (n = 4) is shown in Fig. 23, while the realization of R± was shown in Fig. 20. Fig. 23 The realization of R 4 for a 4 x 4 input image (n = 4) Similarly, Q„ can be written in the form Qn = (P~2n,2 ® Tn/2){Qn <8> Qn) (130) = (P2n,2®l n/2) B2n where Bn = Pn2>n(In (g) Qn) (131) Equation 130 represents the pre-processing stage in which B\ represents the cascade stages, each is realized by n-parallel copies of Q n , followed by the permutation P „ 2 „ . The term (P2n,n ® In/2) represents the permutation P2n,n dilated by n/2 [28]. The realization of Q 4 for a 4 x 4 input image (n = 4) is shown in Fig. 24, while the realization of Q4 was shown 5 . Multi-Dimensional DCT 71 in F ig . 21. The complete recursive realization of the 2-d D C T for a 4 x 4 input image (n = 4) is shown in Fig . 25. 16 O u t p u t P ® I 8 , 2 W 2 B=gJ',®°> Q 16 Input Fig. 24 The realization of Q 4 for a 4 x 4 input image (n = 4) 16 (DCT Output) 4̂ 2X2 4 DCT 4̂ 2X2 44 4̂ DCT 2X2 DCT 4 Q. 4̂ 2X2 4 DCT 4X4 D C T 16 (4x4 Input Image) Fig. 25 The complete recursive realization of the 2-d DCT for a 4 x 4 input image (n = 4) 5 . Multi-Dimensional DCT 72 5.6 Computation Complexity of the 2-d DCT Let M U ( n x ri) be the number of multiplications needed to compute the (n x ri) 2-d D C T . From equation 123, MU(nxn) = 4MU(n/2xn/2) + n2 (132) It should be mentioned that, the computation complexity in our proposed algorithm depends on the computations involved in the core (MU(n/2 x n/2). The number of multiplications in our proposed algorithm based on [40] as a core, compared to other algorithms is shown in Table 6. n x n Row-Column [38] C. Ma [47] M. Vetterli [70] P. Z. Lee [19] The proposed Algorithm based on [40] as a core 4 x 4 32 24 16 20 16 8 x 8 192 144 104 128 128 16 x 16 1024 768 568 704 768 32 x 32 5120 3804 2840 3584 3584 Table 6 Comparison of the number of multiplications required for the 2-d DCT. In general, a basis of comparison of the various D C T algorithms is the number of multiplications and additions they require. However, for a V L S I implementation other factors are also important. These factors include complexity of control logic, regularity, modularity, size, power, and complexity of interconnect. As shown in Table 6, the number of multiplications in the proposed algorithm has been improved over the conventional row- column algorithm [38] and is comparable to those of [19] and [47]. It should be emphasized 5 . Multi-Dimensional DCT 73 again that our goal is to have a regular and modular algorithm that can be mapped efficiently onto VLSI architectures. Polynomial transforms [70] require fewer multiplications but have irregular structure and complex interconnection schemes among processing elements. Hence, these algorithms may be used for a software implementation on a general purpose processor or on a programmable DSP processor, but are seldom the basis for dedicated DCT processors in video codecs [37]. 5.7 The Multi-Dimensional Recursive Architecture of the DCT The multi-dimensional DCT is very important for many video processing applications. The 3-d DCT, in particular, has many applications in 3-d video compression such as the newly proposed multiple-component JPEG [35]. We can extend the 2-d DCT derivation in section 5.5 to the multi-dimensional case. From the separability property of the DCT, the m-dimensional DCT has the form where Tni is the 1-d DCT coefficient matrix as defined in equation 85. The direct realization of equation 133 for the case ni = n2 = n3 = 2 (the core of the recursive 3-d DCT) is shown in Fig. 26. Using properties 11-17 of the tensor product factorization and by following the X = {Tni®Tn2®---®Tnm)x (133) 5 . Multi-Dimensional DCT 1A 2-point 1- D D C T 2-polnt 1- D D C T 2-polnt 1- D DCT 2-polnt 1- D DCT z-Direction DCT 2-polnt 1- D D C T 2-point 1- D D C T 2-polnt 1- D D C T 2-polnt 1- D D C T y-Direction DCT 2 X 2 X 2 DCT 2-polnt 1- D D C T 2-point 1- D D C T x =8 2-polnt 1- D D C T 2-polnt 1- D D C T x-Direction DCT Fig. 26 The direct realization of equation 133 for the case n\ = n2 = ns = 2 steps in section 5.5, we can write equation 133 in the form X R I2m Q*p Tn./2 Q i=l (134) where 0 Tn% / 2 is the lower order m-dimensional DCT and i=i ^m-l V i=l V k=l / / \i=l (135) 'm-\ m—1 g=l / \i=l \ l=i (136) ui=2 n M ' i = i u2 = 2 , (137) (138) 5 . Multi-Dimensional DCT 75 1 1 n U 3 = 2 -TI (nm-j+i) , (139) ^H(nk) , (140) k=i ^2 = H ( n k ) , (141) k=l m V w* = \ I I ("i) > ( 1 4 2> Zv) =(Z1®Z2®---®ZV) , (143) \«=i / Ri and are the one-dimensional pre- and post-processing, defined in equations 103 and 104, respectively. Equation 134 extends our results by showing that a large m-d DCT can be computed from a single stage of smaller m-d DCT's in parallel in addition to the pre- and post-processing glue structures defined by equations 135 and 136, respectively. 5.8 Chapter Summary In this chapter, we presented a class of highly modular parallel VLSI structures for the m-d DCT. The proposed algorithm allows the generation of higher order DCT architectures from 2m lower order DCTs in a parallel realization with the addition of two glue stages of pre- and post-processing. As part of testing and verifying the functionality of our computational blocks, a 4 x 4 DCT architecture has been developed. The schematics of different parts of the design are given in Appendix C. The circuits generated were mapped and tested on an FPGA-based transformable computer. The results are reported in [29]. 6 . A General Methodology for Separable Transforms 76 6 A General Methodology for Separable Transforms In this chapter, we extend our methodology to other separable transforms. Two important transforms will be formulated in recursive tensor product forms; the Discrete Fourier Transform (DFT) and the Walsh-Hadamard Transform (WHT). A general framework that summarizes our methodology and that can be applied to any separable transforms are also given in this chapter. 6.1 The Discrete Fourier Transform (DFT) 6.1.1 The 1-d DFT The tensor product formulation of the 1-d n-point DFT is defined by [46] Fn = (F2 ® / „ / 2 ) T „ ( / 2 ® Fn/2)Pn,2 (144) where F2 is the 2-point DFT, Tn = {L/2 © Dn/2) , (145) is the direct sum operator defined in [46] and Dn is the twiddle factor defined by Dr 1 0 0 ••• 0 LO 0 ••• : 0 '•• ••• 0 0 ••• 0 n-l (146) 2iri and LO = e " . We can rewrite equation 144 as Fn = Rn(h®Fn/2)Q. (147) 6 . A General Methodology for Separable Transforms 11 where Rn= (F2®In/2)Tn , (148) Qn = Pn,2 (149) Equation 147 represents the recursive formulation of the 1-d DFT in tensor product form where, Rn and Qn work as glue structures that combine two of the lower order DFT's {Fn/2) to construct Fn. The realization of F± is shown in Fig. 27. Fig. 27 The realization of F4. 6.1.2 The 2-d DFT Since the DFT is a separable transform, we can write the 2-d DFT in the form Fnun2 =Fni ®Fn2 (150) 6 . A General Methodology for Separable Transforms 78 where Fm and Fn2 represent row and column 1-d DFT, respectively. By substituting 147 in 150, we have Fni,n2 = [Rn, (h <8> Fni/2)Qni] ® [Rn2 (h ® Fn2/2)Qn2] ' (151) which using properties 11, 12, and 16 can be written as Fnun2 = {Rn, ® Rn2){{h ® Fni/2) ® (h ® 4z2/2) )(Qm ® Qn2) = (Rni ® Rn2){{h ® Fni/2) ® (I2 ® Fn2/2))(Qni ® Qn2) = (Rni ® Rn2){{P2m,n, (h®Fni/2)P2ni;2) ® ^ „ 2 / 2 ) ( Q n 1 ® Qn2) = (Rni ® Rn2)(P2ni,ni ® Tn2/2){h® Fnil2tn2/2)(P2ni,2® In2l2){Qni ®Qn2) = R(h®Fn,/2>n2/2)Q (152) where, R = (Rn, ® Rn2){P2nuni ® 42/2) (153) Q = {P2m ,2 ® 42/2) (Qn, ® Qn2 ) (154) Equation 152 represents the recursive formulation of the 2-d DFT in tensor product form. Rn and Qn work as glue structures that combine four of the lower order DFT's (Fnj2) to construct Fn. It should be noted from equations 153 and 154 that the pre-processing stage, Rn, has no computation stages and can be realized by shuffle permutation only while the post-processing stage, Rn, has two stages of additions and two stages of multiplications. 6.2 The Walsh-Hadamard Transform (WHT) 6.2.1 The 1-d WHT For n — 2a, the Walsh-Hadamard transform (WHT), Wn, is defined as Wn = W2a = W2 ® W2 ® • • • ® W2 , (155) = W2® W2a-i 6 . A General Methodology for Separable Transforms 7 9 where, W2 1 1 1 -1 which can be realized as shown in Fig. 28. Using property 13, we can write equation X . + X. ' i+1 X. - X. / i+1 i+1 Fig. 28 The adder circuit used in the WHT 155 in the form W2« = Y[P2»,2(I2°-i ® W2) i=i (157) That can be realized by a similar stages of the computation P2<* j2(I2a-i ® ^ 2 ) - Alternatively, we can realize equation 157 by one block of P2a ^(Ii"-1 ® W2) and take the output after (a) iterations as shown in Fig. 29 for the case n = 2 4 = 16. We can modify equation 155 to the recursive form using property 11 as wn = w2® wn/2 = (h®Wn/2)(W2®In/2) (158) = {h®Wn/2)Qn , where Qr. w2 'n/2 (159) Equation 158 represents the recursive formulation of the 1-d WHT. In this case we use two of the lower order WHT in parallel in addition to one pre-processing stage. 6 . A General Methodology for Separable Transforms 80 Fig. 29 The 16-point WHT 6.2.2 The 2-d WHT Since the WHT is a separable transform, we can write the 2-d WHT in the form WniiH7 = Wn, ® Wn2 (160) Substituting equation 158 in equation 160 and applying properties 11-17, we have Wn,,n2 = Wn, ® Wn2 = (h ® Wni/2)Qni ® (h ® Wn2/2)Qn2 = [(h ® Wni/2) ® (I2 ® Wn2/2)} [Qn, ® Qn2] (161) = [{P2n,,n, ® /„a/2) {h ® Wn,j2:n2/2) ( A n , ,2 ® 42/2)] [Qn, ® Qn2] = R{h®Wni/2)n2/2)Q where Q = (P2n„2 ® In2/2){Qn, ® Qn2) (162) 6 . A General Methodology for Separable Transforms 81 and R= (P2m,m®In2/2) , (163) are the 2-d pre- and post-processing, respectively. It should be noted that equation 161 is the recursive formulation that generates the higher order 2-d WHT from the lower order 2-d ones. However, from the definition of the 1-d WHT given in equation 155, we can alternatively write the 2-d WHT in equation 160 as Wni,n2 = Wn, ®Wn2 = W2® Wni/2 ® Wn2 (164) = W2®(Wni/2®Wn2) = WniXn2 Which means that the 2-d WHT on an n i x n2 input is equivalent to the 1-d WHT on a 1-d input of size n\ x n2. 6.3 A General Framework for Multidimensional Recursive DSP Transforms In this section, we present a general framework to derive recursive formulations for multidimensional DSP transforms. 6.3.1 The One-Dimensional Case Given a 1-d DSP algorithm in a matrix-vector form Ym = Tm,n Xn (165) where, Tm,n is the transform matrix, Xn and Ym are the input vector of size n and the output vector of size m, respectively. Then, using sparse matrix factorization approach [4], the matrix Tm^n can be factorized so that Tm,n = TXT2 • • • Tk (166) 6 . A General Methodology for Separable Transforms 82 where each of the matrices T\, T2, • • •, is sparse. Sparseness implies that either most of the elements of the matrix are zero or the matrix in the block diagonal form. By applying property 13 of the tensor product to the block diagonal matrices of equation 166, we have Tm>n = (Ri)(Ij®Tm/2tn/2)(Qk) (167) where, Qk and Ri represent the pre- and post-processing glue structure that combine j blocks in parallel of the lower-order transform T T O / 2 , n / 2 - F ° r example, j = 3 for the linear convolution transform matrix, while j = 2 for the DCT, DFT, and WHT. The values of Qk and Ri for the 1-d transforms derived throughout the thesis are summarized in Table 7. Qk Ri Convolution A ® / 2 a - 1 R(a)(B ® 7 2 a_i) DCT ( J 2 ® Vn/l) (4/2 © Cn/2) (F2 ® In/2)Vn 4i,n/2(4/2 © Ln/2) DFT Pn,2 {F2®In/2)Tn WHT w2 ® 4 /2 In Table 7 The summary of pre- and post-processing formulae. 6.3.2 The Multi-Dimensional Case Step 1 : For an n i x n 2 input data image (X„ l j n 2 ) , and a separable 2-d Transforms C F n i ) „ 2 ) , we can the 2-d output in the form Ynin2 — T n i t n 2 X n i n 2 (168) 6 . A General Methodology for Separable Transforms 8 3 where XNITL2 and YNIN2 are the input and output column-scanned vectors, respectively. For separable matrices, the 2-d Transform ( T n i > „ 2 ) can be written in the tensor product form as [59] T n i > n 2 = T n i ® T n 2 (169) where Tni and Tn2 are the row and column 1-d transforms, respectively as defined in equation 167. Step 2 : We replace Tni and Tn2 by their corresponding values from equation 167 and apply the tensor product properties 11-17 to derive the 2-d recursive form )Q (170) where Qni,n2 and Rni ,n2 are the 2-d pre- and post-processing glue structure, respectively that combine j2 of the 2-d lower-order transform of dimension n\/2 x nij2 (Tni/2tn2/2) as explained in details throughout the thesis for the convolution, DCT, DFT, and WHT. Step 3 : For the multidimensional transforms, steps 1 and 2 can be extended to have the general form m / m \ ( g ) r „ t = i? U - ( g ) T n i / 2 Q (171) «'=1 \ 1=1 / where Q and R are the m-d pre- and post-processing glue structures, and 0 T r a . / 2 is the i=l lower-order m-d transform. 7 . Conclusions and Future Work 84 7 Conclusions and Future Work 7.1 Conclusions This thesis proposed a general approach for developing scalable VLSI architectures for several fundamental computations in signal and video processing such as multidimensional convolutions and discrete transforms. One advantage of this approach is the close relationship between the algorithm formulation and the structure of the underlying hardware which simplifies the, often difficult, algorithm mapping process. The major contribution of this thesis is the novel formulations proposed for several parallel DSP computations such as convolution and multi-dimensional transforms. Unlike previously proposed methods, our formulations are truly recursive in any dimension. In contrast, the widely used row-column approach for multi-dimensional transforms relies on decomposing the computation into a series of 1-d computations, then applying a recursive definition to each stage. Chapters 3 and 4 of the thesis developed a new algorithm for multi-demensional linear convolution based on Toom's algorithm. Although similar work has been done for the 1-dimensional case by Granata et al [69], their method can not derive a computational structure that employs smaller-size convolution in the core stage, as is the case with our method. Our formulation also maintains the original feature of Toom's algorithm which involves a single stage of multiplications in the computation. For the DCT, Chapter 5 provided a similar formulation that utilizes a single parallel stage of multi-dimensional DCT computational blocks to construct a larger DCT computation of the same dimension. A potential benefit of this research is that it can contribute to a design methodology that results in parallel computations which can be embedded in regular and modular hardware 7 . Conclusions and Future Work 85 structures. Regularity and modularity are essential factors for facilitating design automation and implementation of parallel organizations in VLSI. Indeed, our recursive formulations for a class of fundamental multidimensional DSP computations allow the generation of a larger VLSI architecture by combining the computations realized by a single parallel stage of identical smaller size VLSI blocks. Such approach leads to truly scalable performance for large computations. Our methodology emphasized the highly parallel approach for VLSI architectures. The proposed recursive approach is also useful for software-based solutions, in which a very efficient code has been developed for a smaller size problem, and a larger computation is required. However, this point is open for further investigation. The merits of our proposed architectures can be summarized as follows: 1. A multi-dimensional (m-d) convolution of size n can be constructed from a single parallel stage of 3 m , m-d convolutions each of size (n/2) with the addition of two glue stages of pre- and post-processing. 2. A multi-dimensional (m-d) DCT of size n can be constructed from a single parallel stage of 2 m , m-d DCTs each of size (n/2) with the addition of two glue stages of pre- and post-processing. 3. The methodology can be extended to derive similar recursive formulations for DFT, WHT, and a large number of separable transforms. 4. They provide a recursive design methodology for multidimensional DSP and Video Processing. The same design can be reused as a core for different-size computations. A user can freely choose a very fast core and recursively build larger architectures using only additional pre- and post-processing stages. Fully pipelined versions of the proposed architecture allow for the full utilization of the arithmetic and data movements resources. Moreover, the area can be reduced dramatically by network folding [24], [64] and by 7 . Conclusions and Future Work 86 using scheduled pipelines. This approach is well suited for automated synthesis and can lead to significant savings in design and verification time. 7.2 Future Work The work of this thesis can be extended in a number of ways, few of which are suggested below. 1. Developing new mapping techniques. The past research has concentrated on using the tensor-product decomposition as the primary tool for developing efficient mapping techniques. However, other types of matrix decomposition are known. 2. Developing our mapping methodology into an automatic tool for high-level synthesis of parallel circuits for DSP. 3. Applying the mapping techniques to solve other classes of problems in DSP and related areas. Bibliography 87 Bibliography [1] H. A. Chow ; A. Elnaggar ; H. M. Alnuweiri. "Highly Parallel Processing on the Virtual Computer". SPIE' 95, Field Programmable Gate Arrays (FPGAs) for Fast Board Development and Re configurable Computing, 1995. [2] T. Lu ; M. R. Azimi-Sadjadi. "Interleaved Pipeline Structures for Two-dimensional Recursive Filtering". IEEE Trans, on Circuits and Systems for Video Technology, 3(1), 1993. [3] G. A. Jullien ; M . A. Bayoumi. "A Review of VLSI Technologies in Digital Signal Processing". Proc. of IEEE Midwest Symp. on Circuits and Systems, 1992. [4] R. E. Blahut. "Fast Algorithms for Digital Signal Processing". Addison-Wesley Publishing Company, 1987. [5] X . Liu ; L. T. Bruton. "High-Speed Systolic Ladder Structures for MD Recursive Digital Filters". IEEE Trans, on Signal Processing, 44(4), 1996. [6] H. V. Sorensen ; C. A. Katz ; C. S. Burrus. "Efficient FFT Algorithms for DSP Processing Using Tensor Product Decompositions". Proc. ofICASSP'90, 1990. [7] Z. J. Mou ; P. Duhamel. "A Unified Approach to the Fast FIR Filtering Algorithms". Proc. ofICASSP'88, 1988. [8] S. Cucchi ; M. Fratti. "A Novel Architecture for VLSI Implementation of the 2-D DCT/IDCT". Proc. of ICASSP'92, 1992. [9] D. L. Gall. "MPEG: A Video Compression Standard for Multimedia Applications". Communications of the ACM, 34(4), 1991. [10] P. Pirsch ; N. Demassieux ; W. Gehrke. "VLSI Architectures for Video Compression - A Survey". IEEE Trans, on Signal Processing, 83(2), 1995. Bibliography 88 [11] A. Giri ; V. Visvanathan ; S. K. Nandy ; S. K. Ghoshal. "High Speed Digital Filtering on SRAM-Based FPGAs". Proc. of the 7th Int. Conf. on VLSI Design, 1994. [12] M. T. Sun ; T. C. Chen ; A. M. Gottlieb. "VLSI Implementation of a 16x16 Discrete Cosine Transform". IEEE Trans, on Circuits and Systems, 36(4), 1989. [13] D. Soudris ; V. Paliouraas ; T. Stouraitis ; A. Skavantzos ; C. Goutis. "Systematic Design of Full Adder-Based Architectures for Convolution". Proc. ofICASSP'93, 1993. [14] M. Vetterli ; P. Duhamel ; C. Guillemot. "Trade-offs in the computation of Mono- and Multi-Dimensional DCT's". Proc. ofICASSP'88, 1988. [15] P. Duhamel ; C. Guillemot. "Polynomial Transform Computation of the 2-D DCT". Proc. ofACASSP'90, 1990. [16] M. A. Haque. "A Two-Dimensional Fast Cosine Transfom". IEEE Trans, on ASSP, 33(6), 1985. [17] M. S. B. Romdhane ; V. K. Madisetti ; J. W. Hines. "Quick-Turnaround ASIC Design in VHDL: Core-Based Behavioral Synthesis". Kluwer Academic Publishers, 1996. [18] H. S. Hou. "A Fast Recursive Algorithm for Computing the Discrete Cosine Transfom". IEEE Trans, on ASSP, 35(10), 1987. [19] P. Z. Lee ; F.-Y. Huang. "Restructured Recursive DCT and DST Algorithms". IEEE Trans, on Signal Processing, 42(7), 1994. [20] K. R. Rao ; J. J. Hwang. "Techniques and Standards for Image, Video and Audio Coding". Prentice Hall PTR, 1996. [21] A. Elnaggar ; H. M. Alnuweiri ; M. R. Ito. "A New Multi-Dimensional Recursive algorithm for Computing the Discrete Cosine Transform". Accepted for publication in The IEEE Trans, on Circuits and Systems for Video Technology. Bibliography 89 [22] A. Elnaggar ; H. M. Alnuweiri ; M. R. Ito. "A New Tensor Product Formulation for Toom's Convolution Algorithm". Accepted for publication in The IEEE Trans, on Signal Processing. [23] A. Elnaggar ; H. M. Alnuweiri; M. R. Ito. "Highly Parallel Recursive Multi-Dimensional Algorithms and Architectures for Digital Filtering and Convolution". In review, IEEE Trans, on Circuits and Systems-Part II: Analog and Digital Signal Processing. [24] A. Elnaggar ; H. M. Alnuweiri ; M. R. Ito. "Mapping Tensor Products Onto VLSI Networks with Reduced I/O". Proc. of IEEE Fourth Great Lakes Symposium on VLSI, 1994. [25] A. Elnaggar ; H. M. Alnuweiri ; M. R. Ito. "Scalable ParaUel VLSI Networks for Digital Signal Processing". Proc. of ICECS '94, IEEE First Int. Conf. on Electronics, Circuits and Systems, 1994. [26] A. Elnaggar ; H. M. Alnuweiri; M. R. Ito. "Efficient Modular Interconnection Networks for Digital Filtering and Convolution". Proc. of IEEE Pacific Rim Conference On Communications, Computers, Visualization, and Signal Processing, 1995. [27] A. Elnaggar ; H. M. Alnuweiri ; M. R. Ito. "Highly Parallel VLSI Architectures for Linear Convolution". Proc. of IS CAS'95, 1995. [28] A. Elnaggar ; H. M . Alnuweiri; M. R. Ito. "New Tensor Product Factorization for Toom's Algorithm". Proc. of IEEE International Conference on Digital Signal Processing, 1995. [29] A. Elnaggar ; H. M. Alnuweiri ; M. R. Ito. "Embedding Large Multidimensional DSP Computations in Reconfigurable Logic". Proc. of SPIE East '96 Symposium, High Speed Computing, Digital Signal Processing, and Filtering for FPGAs, 1996. [30] A. Elnaggar ; H. M. Alnuweiri ; M. R. Ito. "A Novel Parallel Algorithm for Bibliography 90 Multi-Dimensional Digital Filtering and Convolution". Accepted for publication in The CCECE'97- The 1997 Canadain Conference on Electrical and Computer Engineering, May 25-28, St. John's, Canada, 1997. [31] C. Chakrabarti ; J. Jaja. "Optimal systolic Designs for Computing DHT and DCT". VLSI Signal Processing III, 1988. [32] E. C. Ifeachor ; B. W. Jervis. "Digital Signal Processing - A Practical Approach". Addison-Wesley Publishing Company, 1993. [33] R. A. Horn ; C. R. Johnson. "Topics in Matrix Analysis". Cambridge University Press, 1991. [34] K. Shin ; H. Jeon ; Y. Kang. "An Efficient VLSI Implementation of Vector-Radix 2-D DCT Using Mesh-Connected 2-D Array". Proc. ofISCAS'94, 1994. [35] J. H. Kasner. "Multiple-Componenet JPEG". ISO/IEC TTC 1/SC 29/WG1 - Technical Proposal, 1996. [36] S. Muramatsu ; H. Kiya. "Multidimensional Parallel Processing Methods for Rational Lattice Alteration". Proc. of ISCAS'95, 1995. [37] V. Bhaskaran ; K. Konstantinides. "Image and Video Compression Standards: Algorithms and Architectures". Kluwer Academic Publishers, 1996. [38] B. G. Lee. "A New Algorithm to Compute the Discrete Cosine Transform". IEEE Trans, on ASSP, 32(6), 1984. [39] N. I. Cho ; S. U. Lee. "Fast Algorithm and Implementation of 2-D DCT". IEEE Trans, on Circuits and Systems, 38(3), 1991. [40] N. I. Cho ; S. U. Lee. "A Fast 4x4 Algorithm for the Recursive 2-D DCT". IEEE Trans, on Signal Processing, 40(9), 1992. Bibliography 91 [41] W. Liebsch. "Parallel Architecture for VLSI Implementation of a 2-Dimensional Discrete Cosine Transform for Image Coding". Proc. of IEE Conf. on Image Processing and Its Applications, 1989. [42] E. Feig ; E. Linzer. "Scaled DCT's on Input Sizes That Are Composite". IEEE Trans, on Signal Processing, 43(1), 1995. [43] M. H. Sheu ; J. Y. Lee ; J. F. Wang ; A. N. Suen ; L. Y. Liu. "A High Throughput-Rate Architecture for 8x8 2-D DCT". Proc. ofISCAS'93, 1993. [44] V. Srinivasan ; K. J. R. Liu. "Full Custom VLSI Implementation of High-Speed 2-D DCT/IDCT Chip". Proc. of IEEE Int'l Conf. on Image Processing, 1994. [45] C. Van Loan. "Computational Frameworks for the Fast Fourier Transform". Siam, 1992. [46] R Tolimieri ; M. An ; C. Lu. "Algorithms for Discrete Fourier Transform and Convolution". Springer-Verlag, 1989. [47] C. Ma. "A Fast Recursive Two Dimensional Cosine Transform". Proc. of SPIE' 88, The Int'l Society of Optical Engineering, 1988. [48] V. K. Madisetti. "VLSI DSP: An Introduction to Rapid Prototyping and Design Synthesis". Butterworth-Heinemann Publishers, 1995. [49] F. Marino. "A fixed-point Parallel Convolver without Precision Loss for the Real-Time Processing of Long Numerical Sequences". Proc. of the Th IEEE Symposium on Parallel and Distributed Processing, 1995. [50] K. K. Parhi ; D. G. Messerchmitt. "Pipeline Inerleaving and Parallelism in Recursive Digital Filters- Part I: Pipelining Using Scattered Look-Ahead and Decomposition". IEEE Trans, on ASSP, 37(7), 1989. Bibliography 92 [51] I.-S. Lin ; S. K. Mitra. "Fast FIR Filtering Algorithms Based on Overlapped Block Structure". Proc. of ISCAS'93, 1993. [52] P. Regalia ; S. Mitra. "Kronecker Products, Unitary Matrices, and Signal Processing Applications". SIAM Review, 31(4), 1989. [53] P. C. Balla ; A. Antoniou ; S. D. Morgera. "Higher Radix Aperiodic-Convolution Algorithms". IEEE Trans, on ASSP, 34(1), 1986. [54] M. S. Mohammed ; S. L. Matilla ; L. Nozal. "Fast 2D Convolution Filter Based on Look up Table FFT". Proc. of the Int'l Symposium on Industrial Electronics, 1992. [55] H. R. Wu ; F. J. Paoloni. "A Two-Dimensional Fast Cosine Transform Algorithm Based on Hou's approach". IEEE Trans, on Signal Processing, 39(2), 1991. [56] V. Hecht ; K. Ronner ; P. Pirsch. "An Advanced Programmable 2D-Convolution Chip for Real Time Image Processing". Proc. ofISCAS'91, 1991. [57] Z. Cvetkovic ; M. V. Popovic. "New Fast Recursive Algorithm for the Computation of Discrete Cosine and Sine Transforms". IEEE Trans, on Signal Processing, 40(8), 1992. [58] J. L. Aravena ; W. A. Porter. "Array Based Design of Digital Filters". IEEE Trans, on ASSP, 38(9), 1990. [59] W. K. Pratt. "Digital Image Processing". John Wiley & Sons, Inc., 1991. [60] F. A. Kamangar ; K. R. Rao. "Fast Algorithm for the 2-D Discrete Cosine Transfom". IEEE Trans, on Computers, 31(9), 1982. [61] A. B. Kahng ; G. Robins. "On Optimal Interconnections for VLSI". Kluwer Academic Publishers, 1995. [62] D. Rodriguez. "Tensor Product Algebra as A Tool For VLSI Implementation of the Discrete Cosine Trnsform". Proc. ofICASSP'91, 1991. Bibliography 93 [63] S. D. Kaushik ; S. Sharma ; C.-H. Huang ; J. R. Johnson ; R. W. Johnson ; P. Sadayappan. "An Algebraic Theory for Modeling Direct Interconnection Networks". Proc. ofICASSP'92, 1992. [64] H. M. Alnuweiri ; S. M. Sait. "Efficient Network Folding Techniques for Routing Permutations in VLSI". IEEE Trans, on VLSI, 3(2), 1995. [65] R. King ; M. Ahmadi ; R. Naguib ; A. Kawabkw ; M. Sajajadi. "Digital Filtering In One And Two Dimensions". Plenum Press, 1989. [66] L-P Chau ; W-C Siu. "Recursive Algorithm for the Discrete Cosine Transform with General Lengths". IEEE Electronic Letters, 30(9), 1994. [67] Toby Skinner. "Harness Multiprocessing Power For DSP Systems". Electronic Design Magazine, February, 1994. [68] H. Lim ; E. E. Swartzlander. "A Systolic Array for 2-D DFT and 2-D DCT". Proc. of Int'l Conf. on Application Specific Array Processors, 1994. [69] J. Granata ; M. Conner ; R. Tolimieri. "Recursive Fast Algorithms and the Role of the Tensor Product". IEEE Trans, on Signal Processing, 40(12), 1992. [70] M. Vetterli. "Fast 2-D Discrete Cosine Transform". Proc. ofICASSP'85, 1985. [71] M. Vetterh. "Running FIR and IIR Filtering Using Multirate Filter Banks". IEEE Trans, on ASSP, 36(5), 1988. [72] T. Wang ; C. L. Wang. "A New Two-Dimensional Block Adaptive FIR Filtering Algorithm". Proc. of ICASSP'94, 1994. [73] E. Feig ; S. Winograd. "Fast Algorithms for the Discrete Cosine Transform". IEEE Trans, on Signal Processing, 40(9), 1992. Bibliography 94 [74] F. A. McGovern ; R. F. Woods ; M. Yan. "Novel VLSI Implementation of 8x8 point 2-D DCT". IEEE Electronic Letters, 30(8), 1994. [75] M. H. Lee ; Y. Yasuda. "New 2-D Systolic Array Algorithm for DCT/DST". IEEE Electronic Letters, 25(25), 1989. [76] K. R. Rao ; P. Yip. "Discrete Cosine Transform: Algorithms, Advantages, and Applications". Academic Press, 1990. [77] S. Zohar. "A VLSI Implementation of a Correlator/Digital Filter Based on Distributed Arithmetic". IEEE Trans, on ASSP, 37(1), 1989. Appendix A. A Complete Characterization of the Matrix R(a) 95 Appendix A. A Complete Characterization of the Matrix R(a) The matrix R(a) has the following properties : 1- The number of 1-entries per row is either 1 or 2 ; all other entries are zeros. 2- The number of 1-entries per column is exactly 1; all other entries are zeros. 3- The number of rows that have two 1-entries is 2(2" - 1). It should be noted that this number is the same as the number of adders required to realize the effect of the matrix R(a) on its input vector. Using these properties, we can derive modular realizations for the matrix R(a) at different stages. For a convolution of size n = 2a, the coordinates of the 1-entries are [ift™-1) + j , i(2a - 1) + j] where i = 0 , 1 , 2. (172) j = 0 , 1 , • • • , (2« - 2) . Let a - 1 = j3. then, the coordinates for the 1-entries become [i.2@ + j, i ( 2 / 3 + 1 — l) + j] i = 0, 1 , 2. (173) j = 0, I,--- , ( 2 ^ - 2 ) . Now, observe that if i = 0 while j varies over its entire range, the set of coordinates for the 1-entries is given by M = 4-i (174) which describes an identity matrix that occupies rows 0 to 2 / 3 + 1 - 2 and columns 0 to 2 / 3 + 1 - 2 of the matrix R(a). Similarly when i = 1 and j varies over its entire range, the set of coordinates for the 1-entries is given by [ ( 2 ^ + i ) I ( 2 ^ - l ) + i ] = j g ) 1 _ 1 (175) which describes another identity matrix I2?-n-i placed in rows 2P to (3.20+1 - 2) and columns (2 / 3 + 1 - l) to (2 / 3 + 2 - 3). Finally, when i = 2 and j varies over its entire range, then the set of coordinates for the 1-entries is given by [(2^+ 1+i),2(2^+ 1-l)+i] =I%_X (176) which describes the third identity matrix J2/9+1-i placed in rows 2 / 3 + 1 to (2f}+2 - 2) and columns (2 / J + 2 - 2) to (3.2"+1 -4). Appendix A. A Complete Characterization of the Matrix R(a) 96 Observe that while the three identity matrices occupy distinct columns of the matrix R(a), they overlap in their row occupancy. Specifically, it is easy to see that the last (2^ — l ) rows of 1^ and the first (2^ — l ) rows of i"(2) both occupy rows 20 to ( 2 / 3 + 1 - 2) of the matrix R(a). Also, the last (2^ - l ) rows of and the first (2^ - l ) rows of 7 ( 3) both occupy rows 2 / 3 + 1 to 2 / 3 + 2 - 2 of the matrix R(a). Finally, notice that there is no overlapping between the row coordinates of and I1-3) which means that each row of the matrix R(a) will contain only two 1-entries at the row coordinates specified above, and only one 1-entry in the remaining rows. Using these properties, we can derive modular realizations for the matrix R(a) for different convolution sizes. For a convolution of size 2 (a = 1), the dimension of the matrix R(l) is of size (2 2 — l ) x 3(2 - 1) = 3 x 3 and it has the form, 1 0 0 0 1 0 0 0 1 (177) and the number of rows that have two 1-entries is 2(2° - l ) = 0. It should be noted that the number of rows that have two 1-entries is the same as the number of adders required to realize the effect of the matrix R(a) on its input vector. It is clear that the matrix R(l) in this case is the identity matrix I3 and has no effect on the input vector. Similarly, for a convolution of size 4 (a = 2), R(2) is a 7 x 9 matrix with the following form, '1 0 0 0 0 0 0 0 0' 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 .0 0 0 0 0 0 0 0 1. (178) and the number of rows that have two 1-entries is 2(2 ] - 1) = 2. The adder network induced by this matrix is shown in Fig. 30. The complete realization of R4 represented in (14) is given in Fig. 31. Appendix A. A Complete Characterization of the Matrix R(a) 97 R(2) 9 Fig. 30 The realization of R(2). Appendix A. A Complete Characterization of the Matrix R(a) 98 Appendix B. The Verilog HDL Files for Convolution 99 Appendix B. The Verilog HDL Files and Schematics of The 16-point Convolution B.1 The Verilog Files of A(3) B.1.1 Half Adder Verilog File module halfadder(sum,carry,a,b); input a,b; output sum,carry; and adl(carry,a,b); xor xol(sum,a,b); endmodule B.1.2 Full Adder Verilog File module fulladder(sum,carry,a,b,c); input a,b,c; output sum,carry; halfadder hfl(suml,carryl,a,b); hf2(sum,carry2,suml ,c); or orl(carry,carryl,carry2); endmodule B.1.3 Flip Flop Verilog File module flop(d,clk,reset,q); input d,clk,reset; output q; nand ndl(a,d,clk,reset), nd2(b,nd,clk), Appendix B. The Verilog HDL Files for Convolution 100 nd3(dd,c,b,reset), nd4(e,c,nclk), nd5(f,dd,nclk), nd6(qb,q,f,reset); nand nd7(c,a,dd), nd8(q,e,qb); not invl(nd,d), inv2(nclk,clk); endmodule B.1.4 Serial Adder Verilog File odule serialadder(sum,a,b,clk,reset); input a,b,clk,reset; output sum; fulladder fdl(s,c,a,b,cc); flop fpl(c,clk,reset,cc), fp2(s,clk,reset,sum); endmodule B.1.5 Serial-Parallel Multiplier Verilog File module Serparmmtp(serout,b0,serin,parinl,parin2,parin3,parin4,parin5,parin6,parin7,parin0,clk,reset); output serout; input b0,serin,parinl,parin2,parin3,parin4,parin5,parin6,parin7,parin0,clk,reset; serialadder sel(sl,al,bO,clk,reset), se2(s2,a2,s 1 ,clk,reset), se3(s3,a3,s2,clk,reset), se4(s4,a4,s3,clk,reset), se5(s5,a5,s4,clk,reset), Appendix B. The Verilog HDL Files for Convolution 101 se6(s6,a6,s5,clk,reset), se7(s7,a7,s6,clk,reset), se8(serout,a8,s7,clk,reset); and an l(al,parin7,serin), an2(a2,parin6,serin), an3(a3,parin5,serin), an4(a4,parin4,serin), an5(a5,parin3,serin), an6(a6,parin2,serin), an7(a7,parinl,serin), an8(a8,parin0,serin); endmodule B.1.6 A(l) Verilog File module a_one(aO,al ,a2,x0,xl ,clk,reset); output a0,al,a2; input xO,xl,clk,reset; serialadder sal(al,xO,xl,clk,reset); flop nl(xO,clk,reset,aO), fl2(xl,clk,reset,a2); endmodule B.1.7 A(2) Verilog File module a_two(a,x,clk,reset); output [8:0] a; input clk,reset; input [3:0] x; a_one aal(a00,a01,a02,x[0],x[2],clk,reset), Appendix B. The Verilog HDL Files for Convolution 102 aa2(a 10,al 1 ,a 12,x[ 1 ] ,x[3] ,clk,reset), aa3(a[0],a[l],a[2],a00,al0,clk,reset),. aa4(a[3] ,a[4], a[5] ,a01 ,al 1 ,clk,reset), aa5(a[6],a[7],a[8],a02,al2,clk,reset); endmodule B.1.8 A(3) Verilog File module a_three(a,x,clk,reset); output [26:0] a; input clk,reset; input [7:0] x; wire [3:0] w0,wl,w2; a_one aa 1 (wO[0],w 1 [0],w2 [0] ,x[0],x[4] ,clk,reset), aa2(w0[l],wl[l],w2[l],x[l],x[5],clk,reset), aa3(w0[2],wl[2],w2[2],x[2],x[6],clk,reset), aa4(w0[3],wl[3],w2[3],x[3],x[7],clk,reset); a_two abl(a[8:0],w0,clk,reset), ab2(a[17:9],wl,clk,reset), ab3(a[26:18],w2,clk,reset); endmodule Appendix B. The Verilog HDL Files for Convolution 103 B.2 The Verilog Files of B(3) B.2.1 B(l) Verilog File module b_one(b0,b 1 ,b2,x0,x 1 ,x2,clk,rest,one); output b0,bl,b2; input x0,xl,x2,clk,rest,one; flop fpl(xl,clk,rest,xll), fp2(xll,clk,rest > xlll), fp3(one,clk,rest,ones), fp4(x0,clk,rest,x01), fp5(x01,clk,rest,x02), fp6(x02,clk,rest,b0), fp7(x2,clk,rest,x21), fp8(x21,clk,rest,x22), fp9(x22,clk,rest>b2); serialadder sadl(s 1 ,x0,x2,clk,rest), sad2(s2,invs 1 ,ones,clk,rest), sad3(b 1 , x l 11 ,s2,clk,rest); not invl(invsl,sl); endmodule B.2.2 B(2) Verilog File module b_two(b,x,clk,rest,one); output [6:0] b; input [8:0] x; input clk,rest,one; b_one bl(bl0,bll,bl2,x[0] )x[l],x[2],clk,rest,one), b2(b20,b21,b22,x[3],x[4],x[5],clk,rest,one), Appendix B. The Verilog HDL Files for Convolution 104 b3(b30,b31,b32,x[6]>x[7],x[8],clk,rest,one), b4(out0,bb0,bb6,bl0,b20,b30,clk,rest,one), b5(outl ,out3,out5 ,b 11 ,b21 ,b31 ,clk,rest,one), b6(ba0,bal,out6,bl2,b22,b32,clk,rest,one); serialadder sadl(b[2],ba0,bb0,clk,rest), sad2(b [4] ,bal ,bb6,clk,rest); flop fpl(outO,clk,rest,b[0]), fp2(outl(clk,rest,b[l]), fp3(out3,clk,rest,b[3]), fp4(out5,clk,rest,b[5]), fp5(out6,clk,rest,b[6]); endmodule B.2.3 B(3) Verilog File module b_three(b,x,clk,rest,one); output [14:0] b; input [26:0] x; input clk,rest,one; wire [6:0] b21,b22,b23; b_two b221(b21[6:0],x[8:0],clk,rest,one), b222(b22[6:0],x[17:9],clk,rest,one), b223(b23[6:0],x[26:18],clk,rest,one); b_one bbl(bbl0,bblI,bbl2,b21[0],b22[0],b23[0],clk,rest,one), bb2(bb20,bb21 ,bb22,b21 [1 ] ,b22 [ 1 ] ,b23 [1 ] ,clk,rest, one), bb3(bb30,bb31 ,bb32,b21 [2] ,b22 [2] ,b23 [2],clk,rest,one), bb4(bb40,bb41 ,bb42,b21 [3],b22[3] ,b23 [3] ,clk,rest,one), bb5(bb50,bb51,bb52,b21[4],b22[4],b23[4],clk,rest,one), bb6(bb60,bb61 ,bb62,b21 [5] ,b22 [5] ,b23 [5],clk,rest,one), Appendix B. The Verilog HDL Files for Convolution 105 bb7(bb70,bb71,bb72,b21[6],b22[6],b23[6],clk)rest,one); serialadder sadl(b[4],bb50,bbll,clk,rest), sad2(b[5] ,bb60,bb21 ,clk,rest), sad3(b[6],bb70,bb31,clk,rest), sad4(b[8],bb5 l,bbl2,clk,rest), sad5 (b[9] ,bb61 ,bb22,clk,rest), sad6(b [10] ,bb71 ,bb32,clk,rest); flop ipl(bblO,clk,rest,b[0]), fp2(bb20,clk,rest,b[l]), fp3(bb30,clk,rest,b[2]), fp4(bb40,clk,rest,b[3]), fp5(bb41,clk,rest,b[7]), fp6(bb42,clk,rest,b[ll]), fp7(bb52,clk,rest,b[12]), fp8(bb62,clk,rest,b[13]), fp9(bb72,clk,rest,b[14]); endmodule B.3 The Verilog Files Combining A(3), ,0(3), and B(3) B.3.1 D(3) with A(3 ) Verilog File module a_dJhree(d,hO,hl,h2,h3M hl9,h20,h21,h22)h23,h24,h25,h26,x,b0,clk,reset); output [26:0] d; input clk,reset,bO; input [7:0] x; input [10:0] h0,hl ,h2 ) h3,h4,h5,h6^ hl9,h20,h21,h22,h23)h24,h25,h26; Appendix B. The Verilog HDL Files for Convolution 106 wire [26:0] a; a_three aa(a,x,clk,reset); parsermult p0(d[0],b0,a[0],h0,clk,reset), pl(d[l],bO,a[l],hl,clk,reset), p2(d[2],b0,a[2],h2,clk,reset), p3(d[3],b0,a[3],h3,clk,reset), p4(d[4],b0,a[4],h4,clk,reset), p5(d[5],b0,a[5],h5,clk,reset), p6(d[6],b0,a[6],h6,clk,reset), p7(d[7],b0,a[7],h7,clk,reset), p8(d[8],b0,a[8],h8,clk,reset)) p9(d[9],b0,a[9],h9,clk,reset), pl0(d[10],b0,a[10],hl0,clk,reset), p 11 (d[ 11 ] ,b0,a[ 11 ] ,h 11 ,clk,reset), p 12(d[ 12] ,b0,a[ 12] ,h 12,clk,reset), pl3(d[13],bO,a[13],hl3,clk,reset), pl4(d[14],b0,a[14],hl4,clk,reset), pl5(d[15],b0,a[15],hl5,clk,reset), pl6(d[16],b0,a[16],hl6,clk,reset), pl7(d[17],b0,a[17],hl7,clk,reset), pl8(d[18],b0,a[18],hl8,clk,reset), p 19(d[ 19] ,b0,a[l 9] ,h 19,clk,reset), p20(d[20],b0,a[20],h20,clk,reset), p21(d[21])b0,a[21],h21,clk,reset), p22(d[22],b0,a[22],h22,clk,reset), p23(d[23] ,b0,a[23] ,h23,clk,reset), p24(d[24] ,b0,a[24] ,h24,clk,reset), Appendix B. The Verilog HDL Files for Convolution 107 p25(d[25],b0,a[25],h25,clk)reset), p26(d[26])b0,a[26],h26,clk,reset); endmodule B.3.2 A(3), D(3), and 5(3) Verilog File module conv_eight(c,h0,hl,h2^ h 19,h20,h21 ,h22)h23,h24>h25,h26,x,b0,clk,reset,one); input [7:0] x; input [10:0] h0,hl,h2 )h3 >h4 (h5,h6,h7 )h8 )h9,hl0,hll,hl2,hl3 )hl4,hl5,hl6 )hl7,hl8 > h 19,h20,h21 ,h22,h23,h24,li25,h26; input clk,reset,b0,one; output [14:0] c; wire [26:0] d; a_d_three ad(dMhl,h2,h3,h4,h5 >h6 >h7,h8,h9,hl0,hll,hl2,hl3,hl4,hl5,hl6,hl7,hl8, h 19,h20,h21 ,h22,h23,h24,li25,h26,x,b0,clk,reset); b_three b(c,d,clk,reset,one); endmodule B.4 The Verilog Files to Generate the 16-point Convolution from the 8-point B.4.1 Q ( 4 ) Verilog File odule q_four(a,x,clk,reset); output [23:0] a; input clk,reset; input [15:0] x; a_one a0(a[0],a[8],a[16],x[0],x[8],clk,reset), al(a[l],a[9],a[17],x[l],x[9],clk,reset), a2(a[2],a[10],a[18],x[2],x[10],clk,reset), Appendix B. The Verilog HDL Files for Convolution 108 a3(a[3],a[ll],a[19])x[3],x[ll],clk,reset)I a4(a[4] ,a[l 2] ,a[20],x[4] ,x[l 2] ,clk,reset), a5(a[5],a[13],a[21],x[5],x[13],clk,reset), a6(a[6],a[14],a[22],x[6],x[14],clk,reset), a7(a[7],a[15],a[23],x[7],x[15],clk,reset); endmodule B.4.2 RS(4) Verilog File module r_s_four_one(convl6,x,clk,reset,one); output [30:0] convl6; input [44:0] x; input clk,reset,one; // w ' l l use 15 of B l ' s (8+8+1=17) f f s (7+7=14) serialadder's b_one b0(b00,b01,b02,x[0],x[15],x[30],clk,reset,one), bl(bl0,bll,bl2,x[l],x[16],x[31],clk )reset,one), b2(b20,b21,b22,x[2]>x[17],x[32],clk,reset,one), b3(b30,b31,b32)x[3],x[18],x[33],clk,reset,one), b4(b40,b41,b42,x[4],x[19],x[34],clk)reset,one), b5(b50,b51,b52,x[5],x[20],x[35],clk,reset,one), b6(b60,b61>b62,x[6],x[21],x[36],clk,reset,one), b7(b70,b71,b72,x[7],x[22],x[37],clk,reset>one), b8(b80,b81)b82,x[8],x[23],x[38]>clk,reset,one), b9(b90,b91,b92,x[9],x[24],x[39],clk,reset,one), b 1 _0(b 100,b 101 ,b 102,x [ 10] ,x[25],x[40] ,clk,reset,one), b 1_1 (b 110,b 111 ,b 112,x[l 1 ] ,x[26] ,x[41 ] ,clk,reset, one), bl_2(bl20,bl21,bl22,x[12],x[27],x[42],clk,reset,one), bl_3(bl30,bl31,bl32,x[13],x[28],x[43],clk,reset,one), bl_4(bl40,bl41,bl42,x[14],x[29],x[44],clk,reset,one); Appendix B. The Verilog HDL Files for Convolution 109 flop f0(b00,clk,reset,convl6[0]), fl(bl0,clk,reset,convl6[l]), f2(b20,clk,reset,conv 16[2]), f3(b30,clk,reset,convl6[3]), f4(b40,clk,reset,conv 16[4]), f5(b50,clk,reset,convl6[5]), f6(b60,clk,reset,conv 16[6]), f7(b70,clk,reset,convl6[7]), f 8(b71 ,clk,reset,conv 16[ 15]), f9(b72,clk,reset,convl6[23]), fl0(b82,clk,reset,convl6[24]), f 1 l(b92,clk,reset,convl6[25]), fl2(bl02,clk,reset,convl6[26]), f l3(bl 12,clk,reset,convl6[27]), fl4(bl22,clk,reset,convl6[28]), f 15 (b 132,clk,reset,con v 16 [29]), fl6(bl42,clk,reset,convl6[30]); serialadder s0(convl6[8],b01,b80,clk,reset), s 1 (conv 16[9] ,b 11 ,b90,clk,reset), s2(conv 16[ 10] ,b21 ,b 100,clk,reset), s3(convl6[l l],b3 l , b l 10,clk,reset), s4(conv 16[ 12] ,b41 ,bl 20,clk,reset), s5(conv 16[ 13] ,b51 ,b 130,clk,reset), s6(conv 16[ 14] ,b61 ,b 140,clk,reset), s7(conv 16[16] ,b81 ,b02,clk,reset), s8(conv 16[ 17] ,b91 ,b 12,clk,reset), s9(conv 16[ 18] ,b 101 ,b22,clk,reset), Appendix B. The Verilog HDL Files for Convolution 110 s 10(conv 16[ 19] ,b 111 ,b32,clk,reset), s 11 (conv 16[20] ,b 121 ,b42,clk,reset), s 12(conv 16[21 ] ,b 131 ,b52,clk,reset), s 13(conv 16[22] ,b 141 ,b62,clk,reset); endmodule B.4.3 The 16-point Convolution Verilog File odule Conv_sixteen(c,h0>hl>h2>h3 )h4,h5,h6,h7,h8 )h9 )hl0 )hll,hl2>hl3,hl4,hl5 hl9,h20,h21,h22,h23,h24,h25,h26^ h35>h36)h37)h38)h39>h40)h41,h42,h43,h44)h45,h46>h47M8)h49,h50,h51,h52,h53, h54,h55,h56,h57,h58,h59,h60,h61,h62,h63,^ h74,h75,h76,h77,h78,h79,h80,x,b0,clk,reset,one); input [15:0] x; input [10:0] h0>hl,h2,h3,h4,h5,h6,h7,h8,h9,hl0 )hlI,hl2,hl3,hl4,hl5,hl6,hl7,hl8, 1119,1120,1121,1122,1123,1124,1125,1126,1127,1128,1129,1130,1131, h32,h33,h34,h35,h36,h37,h38,h39,h40,h41,h42,h43,h44,h45,h46, h47,h48,h49,h50,h51,h52,h53,h54,h55,h56,h57,h58,h59,h60,h61, h62,h63,h64,h65,h66,h67,h68,h69,h70,h71,h72,h73, h74,h75,h76,h77,h78,h79,h80; input clk,reset,b0,one; output [30:0] c; wire [23:0] a; wire [44:0] cc; conv_eight Cl(cc[14:0],h0,hl,h2,h3,h4,h5,h6,h7,h8,h9,hl0,hll,hl2,hl3,hl4,hl5,hl6,h hl9,h20,h21,h22,h23,h24,h25,h26,a[7:0],b0,clk,reset,one), c2(cc[29:15], h27,h28,h29,h30,h31 ,h32,h33,h34,h35,h36,h37,h38,h39,h40,h41, h42,h43,h44,h45,h46,h47,h48,h49,h50,h51,h52,h53,a[15:8],b0,clk,reset,one), c3(cc[44:30], 1154,1155,1156,1157,1158,1159,1160,1161,1162,1163,1164,1165,1166,1167, Ii68,h69,h70,h71,h72,h73,li74,h75,h76,h77,h78,h79,h80,a[23:16],b0,clk,reset,one); Appendix B. The Verilog HDL Files for Convolution 111 q_four q(a,x,clk,reset); r_s_four_one rs(c,cc,clk,reset,one); endmodule Appendix C. The Schematics of the DCT Architecture 112 Appendix C. The Schematics of the DCT Architecture C.1 The Schematic of the Full Adder Appendix C. The Schematics of the DCT Architecture 113 C.2 The Schematic of the Half Adder Appendix C. The Schematics of the DCT Architecture 114 C.3 The Schematic of F2 Z5 _Q Z) cn ca um  C/) ~o o i er  Cn cu o X} CJ o _Q cn Z5 cn I er . cn cu c o 1 cu  cu O - Q 73 cn t t AAA* O _Q ^ CU Appendix C. The Schematics of the DCT Architecture 115 C.4 The Schematic of Q4 Appendix C. The Schematics of the DCT Architecture 116 C.5 The Schematic of C?4 >S >N >N i t * * + J ) I D N >N >> >~ >.. m o> ^ >< >. >. >< f •ttt CN ro m >N >N >N >\ >̂  >S >S S CN n CU o u in S ^ (N f ) (D _ <D X X X X O U (0 mr S CN ro O u u w S ^- CN ro (D A A A * (D m CO N S ^ CN ro S CN ro CU u u cn **** C N ro ^ J - in Appendix C. The Schematics of the DCT Architecture 117 C.6 The Schematic of R4 elk x2 x3 X0 d q ce >c clr TT4- > y0 FDHF 12 d q c e >c clr •y2 a s u m b elk s e r _ a d d c e ,10 • y1 Appendix C. The Schematics of the DCT Architecture 118 C.7 The Schematic of R4 (S3 — C\| KI TJ- m ID N OQ <T) T— **** CN n m >-. >N ^ >N >> ^ >N CS •— CN fO QJ _ IS CN ro <D 5) CN ro OJ tSl ^ - CN rO QJ 4 - - S3 ' T CN ro <S> T~ CN ro >N >~ >-> c S '— CN r o t u <S i— CN IO D ITT 1 CO 0 1 Appendix C. The Schematics of the DCT Architecture 119 C.8 The Schematic of the Serial Adder c c X) U o 4 4 Appendix C. The Schematics of the DCT Architecture 120 C.9 The Schematic of the Serial-Parallel Multiplier Appendix C. The Schematics of the DCT Architecture 121 C.10 The Schematic of the Serial Subtractor Appendix C. The Schematics of the DCT Architecture 122 C.11 The Schematic of the 2-d 2 x 2 DCT IS o u o 5: IS C o CN cj r o It r s <- ca — IS o 5 CD C O J r s co z £ cu x x o o vi IS - ~ x x o " <p c o J IS *— CD — CD x x cj o w CN fO CD x x o CD CO Appendix C. The Schematics of the DCT Architecture 123 C.12 The Schematic of the 1-d 2-Point DCT o ( / ) E i o Q . I C D C / l C er i C D U w O CM u b E c n D ( / ) o js J C D c o 1 C D " ( D o O o If) i A cr o a; X) u u Appendix C. The Schematics of the DCT Architecture 124 C.13 The Schematic of the 2-d 4 x 4 DCT y < 2 >  I y < 3 >  I f I y < 5 >  X  I v < 6 >  X  y < 7 >  I ! I y < 9 >  X  ! ! S z a I y < !4 >  I 3 s ' - t N r o T r i n i o r v [ o o ) c a - - f N ' 0 ' ^ ' i n JZ D O J o Q CM ro V ± U X X X X U U W O * o B - N n U X X X X <J =5 4> B - I N n V ft 0) o * • Q — tsl ro 4) =" X X X X O U S — C M n T r i n i O N D O O i Q - ' M r O - t l f J c o ca ni K I * in V \ r x< 4>  •  x < 6 >  i f x< 8>  t ir Appendix D. List of Acronyms 125 List of Acronyms A/D Analog-to-Digital ASIC Application Specific Integrated Circuits ASSP Application Specific Standard Products BAAT Bit-At-A-Time CAD Computer Aided Design CD Compact Disc CLB Configurable Logic Block Codec Encoder/Decoder DCT Discrete Cosine Transform DFT Discrete Fourier Transform DSP Digital Signal Processing DVD Digital Video Disc FFT Fast Fourier Transform FPGA Field Programmable Gate Arrays fps Frame Per Second HDL Hardware Description Language IDCT Inverse Discrete Cosine Transform I/O Inputs/Outputs JPEG Joint Photograph Exepert Group KLT Karhunen-Loeve transform L A N Local Area Networks MOPS Million Operations Per Second MPEG Moving Picture Expert Group m-d Multi-Dimensional Appendix D. List of Acronyms 126 1- d One-Dimensional PC Personal Computers RDS Replicated Dilated Shuffle Permutations RGB RED-Green-Blue RISC Reduced Instructions Set Computers ROM Read Only Memory SAR Synthetic Aperture Radar 2- d Two-dimensional VHDL Very-high-speed-integrated-circuit Hardware Description Language VLSI Very Large Scale Integration WAN Wide Area Networks WHT Walsh-Hadamred Transform

Cite

Citation Scheme:

    

Usage Statistics

Country Views Downloads
France 5 0
United States 4 1
China 3 4
India 1 0
City Views Downloads
Unknown 7 0
Hefei 2 0
Beijing 1 2
San Jose 1 0
Mumbai 1 0
Ashburn 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}

Share

Share to:

Comment

Related Items