UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Transformable computing for MPEG video coding Chow, Hoi Au Stephen 1996

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1997-0018.pdf [ 5.03MB ]
JSON: 831-1.0065323.json
JSON-LD: 831-1.0065323-ld.json
RDF/XML (Pretty): 831-1.0065323-rdf.xml
RDF/JSON: 831-1.0065323-rdf.json
Turtle: 831-1.0065323-turtle.txt
N-Triples: 831-1.0065323-rdf-ntriples.txt
Original Record: 831-1.0065323-source.json
Full Text

Full Text

TRANSFORMABLE COMPUTING FOR MPEG VIDEO CODING by HOI A U S T E P H E N C H O W B. A. Sc. The University of British Columbia, 1994 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLffiD SCffiNCE in THE FACULTY OF GRADUATE STUDffiS DEPARTMENT OF ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard T H E UNIVERSITY O F BRITISH C O L U M B I A November 1996 © HOI A U S T E P H E N C H O W , 1996 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for refer-ence and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not allowed without my written permission. Department of Electrical Engineering The University of British Columbia Vancouver, BC Canada November 13, 1996 Abstract The dynamical reconfigurability of FPGAs permits real-time modification of application-specific integrated circuits without complicated fabrication process. The power of reconfigurable hardware has been recently utilized to achieve a new type of high performance computer architec-ture known as Transformable Computing. The real-time reconfigurability and highly customiz-able hardware of transformable computing systems are especially suited to multimedia applications which often involve computations with widely varying processing and communica-tion requirements. In this thesis, an FPGA-based transformable computing system has been developed for digital video (MPEG-1) processing. The thesis proposes a systematic approach for mapping complex applications (such as video processing) onto a transformable coprocessor attached to a standard workstation. Various algorithms in the MPEG-1 video processing standard, such as DCT/IDCT, Huffman coding, quantization and motion estimation, are used as implemen-tation targets. The mapping approach uses a number of performance criteria to develop an efficient transformable coprocessor solution to video processing under hardware, software, and interface constraints. This transformable solution requires careful analysis of data movement frequency and program partitioning strategies. The transformable MPEG-1 coprocessor has been fully developed, interfaced, and tested with MPEG-1 compatible data. The results show that the transformable coprocessor can handle complex digital video coding tasks adequately. Among the different limiting factors to the coprocessor system performance, data transfer between the host system and the coprocessor, and F P G A gate density pose the most significant constraints. Although the performance enhancement achieved varies for different algorithms, the results clearly indicate the promising future of the transformable computing paradigm as a viable and powerful tool for high performance multimedia computing. ii Table of Contents Abstract n List of Tables vii List of Figures viii Acknowledgment x Chapter 1 Introduction 1 1.1 Digital Video Coding 1 1.2 MPEG-1 Video Processing 4 1.3 Transformable Computing for MPEG-1 Video Coding 6 1.4 Thesis Scope and Objectives 7 1.5 Thesis Outline 9 Chapter 2 MPEG Video Coding Standard 10 2.1 Basic MPEG-1 Video Compression Scheme 10 2.1.1 MPEG-1 Frame Sequences 11 2.1.2 Layered MPEG-1 Bit Streams 12 2.1.3 MPEG-1 Constrained Parameter Bit Streams 14 2.2 MPEG-1 Video Coding Algorithms 15 2.2.1 Predictive Coding 16 2.2.2 Transform Coding 17 2.2.3 Subsampling and Quantization 19 2.2.4 Entropy Coding and Run-Length Coding 21 2.3 MPEG-1 Video Coding System Structures 22 2.3.1 MPEG-1 Encoding System 23 iii iv 2.3.2 MPEG-1 Decoding System 24 Chapter 3 FPGA-Based Transformable Computing 26 3.1 Transformable System Design Methodology 27 3.1.1 MPEG-1 Algorithm Mapping 29 3.1.2 FPGA Compilation 29 3.1.3 Transformable Implementation Testing 30 3.2 Transformable System Platforms 31 3.2.1 Hybrid CM-2/Xilinx Prototype 32 3.2.2 The Virtual Computer 33 3.2.3 The Reconfigurable Computer EVCl-s 35 Chapter 4 Implementation of A Transformable MPEG Video Coprocessor 38 4.1 Implementation Platform Parameters 38 4.1.1 FPGA Gate Density 39 4.1.2 Software and Hardware Interfacing 40 4.1.3 Hardware Control Scheme 41 4.1.4 Configuration Downloading Overhead Time < 42 4.1.5 Memory Restriction 43 4.1.6 MPEG-1 Video Decoding Software Structure 44 4.1.7 MPEG-1 Video Encoding Software Structure 44 4.2 System Algorithm Implementations 47 4.2.1 Discrete Cosine Transform and Inverse Discrete Cosine Transform 47 DCT/IDCT Design Approach for Transformable Coprocessor Imple-mentation 49 DCT/IDCT Architecture Optimization 50 V 4.2.2 Motion Vector Reconstruction 54 4.2.3 Motion Estimation 56 4.2.4 Quantization 59 4.2.5 Huffman Coding 63 4.2.6 Frame Reconstruction 64 Chapter 5 Simulation & Design Analysis 67 5.1 System Simulation Analysis 67 5.1.1 Simulation Tools and Procedures 68 5.1.2 System Simulation Results 69 5.2 System Real-Time Testing Analysis 74 5.2.1 Real-Time Testing Procedures and Tools 75 5.2.2 System Testing Results 76 5.3 Remarks on Performance Enhancement 79 5.4 Transformable System Design Considerations 80 5.4.1 Data Transfer Frequency Analysis 80 5.4.2 Algorithm Partition Strategy 83 5.4.3 Pipelining 85 5.4.4 Scalability 86 Chapter 6 Conclusions 88 6.1 Thesis Summary 88 6.2 Future Work 90 References 91 Appendix G. List of Acronyms 96 vi Appendix H. Software MPEG-1 Coding Test 98 Appendix I. Transformable Coprocessor EVCl-s System Specifications 105 Appendix J. List of System Design Schematics 109 List of Tables Table 1.1 Major video image coding standards [3][4] 2 Table 2.1 Six layers in MPEG-1 video bit stream [13] 13 Table 2.2 Frame sizes other than Constrained Parameter Set [13] 15 Table 4.1 Available hardware resources in Xilinx XC4013 FPGA [37] 40 Table 4.2 Configuration data downloading overhead parameter averages 43 Table 5.1 Data transfer ratio and speedups of various algorithms in the transformable MPEG-1 program using a 150-frame video sequence 82 vii List of Figures Figure 1.1 Decoding rate of Berkeley MPEG-1 decoding program in a Sun SPARC10 workstation 5 Figure 2.1 A typical MPEG-1 video frame sequence with all three types of frame 12 Figure 2.2 Layered structure of the MPEG-1 video bit stream 14 Figure 2.3 Motion Estimation by block matching algorithm 17 Figure 2.4 Variance distribution of DCT coefficients on average over a large number of image blocks. Most of the total variance is concentrated around dc D C T coefficient where u = 0, v = 0 19 Figure 2.5 The position of luminance and chrominance samples in MPEG-1 image block...20 Figure 2.6 Symbol merging of the Huffman coding procedure and the resulting code assignment 22 Figure 2.7 A typical structural framework of MPEG-1 video encoding system 24 Figure 2.8 A typical structural framework of MPEG-1 video decoding system 25 Figure 3.1 An example of four-phase computation in a transformable computer 28 Figure 3.2 Transformable MPEG-1 system design flow diagram 28 Figure 3.3 FPGA compilation process for MPEG-1 transformable coprocessor 31 Figure 3.4 CM-2X prototyping system architecture [27] 33 Figure 3.5 The Virtual Computer system architecture [30] 35 Figure 3.6 Block diagram of the EVCl-s communication interface 36 Figure 4.1 EVCl - s SBus slave interface structure [31] 41 Figure 4.2 Memory mapping hardware control technique 42 Figure 4.3 MPEG-1 video decoding program operation model 45 Figure 4.4 MPEG-1 video encoding program operation model 46 Figure 4.5 Row-Column approach for 2-D DCT/IDCT 50 viii Figure 4.6 1 -D DCT/IDCT implementation block diagram 53 Figure 4.7 2-D DCT/IDCT implementation block diagram 53 Figure 4.8 Flow chart of the motion vector reconstruction process 55 Figure 4.9 Motion vector reconstruction process implementation structure 56 Figure 4.10 Illustrating Block-Matching algorithm in motion estimation process 57 Figure 4.11 M A D algorithm implementation structure in motion estimation process 59 Figure 4.12 Inverse quantization process flow [6] 61 Figure 4.13 Inverse quantization process implementation structure 62 Figure 4.14 Huffman decoder implementation structure 64 Figure 4.15 Implementation architecture of frame reconstruction process 66 Figure 5.1 System simulation process flow 69 Figure 5.2 DCT and IDCT implementation time delay among different stage 71 Figure 5.3 Parallel DCT and IDCT circuits timing delays 72 Figure 5.4 System delays for various algorithm blocks 73 Figure 5.5 FPGA (XC4013) resource consumption for various algorithm blocks 74 Figure 5.6 System testing process flow 76 Figure 5.7 Speedup factors for various implemented MPEG algorithms with regular SBus interface and a faster version of SBus interface 78 Figure 5.8 Data movement between the coprocessor and host memory 81 Figure 5.9 Data transfer ratio vs. speedup factor for MPEG-1 video algorithms 83 Figure 5.10 Algorithm partition problem in transformable coprocessor implementation 84 Figure 5.11 Pipelining structure in transformable coprocessor designs 86 Figure 5.12 Scalable design enables transformable coprocessor to form a MIMD computing machine 87 ix Acknowledgment I would like to thank my supervisor, Dr. Alnuweiri for his continuous guidance and support throughout the thesis research process. His valuable advises and patience are sincerely appreciated. Also, I would like to thank Mr. Steve Casselman from Virtual Computer Corporation for his technical support and feedback on using the E V C l - s system. I want to thank my collegeaus and system administrators of the VLSI Lab for their constructive comments and suggestions. Finally, I want to dedicate this thesis to my parents, Jocelyn and Jonathan who have provided me with enormous encouragement and support. Chapter 1 Introduction Digital video has the potential to become a very important component of an integrated multimedia computing environment. Efforts for full digitization of video communication systems, including High Definition Television (HDTV) broadcasting, have recently intensified, and it seems inevitable that all storage and transport mechanisms to television receivers and computer displays will become dominated by digital technology. However, for digital video to be widely adopted, digital images must be available at a sufficiently high rate, with acceptable resolution and in real-time. So as to not underestimate these requirements, it is worth mentioning that a low resolution (288 pixel x 352 pixel) color digital video sequence running at 30 frame per second requires the video processing system to be able to handle 73 mega bits of uncompressed data per second. Therefore, digital video compression techniques must be employed to provide the possibility of storing and transmitting the vast amount of data necessary to represent digital images and video in an efficient and robust way. 1.1 Digital Video Coding Digital video compression techniques can effectively reduce the data rate by exploiting the inherent redundancy of video signals, and by balancing the quality of the output signals to emphasize what is important to the human visual system. Any information which can be extracted using the statistical dependencies between the picture elements is redundant and can be eliminated. Any information below a specific picture quality threshold is irrelevant to the receiver and need not be transmitted. Redundancy reduction is performed by transformation of images to other representations with reduced statistical dependencies. This can be performed by decorrela-tion with transform methods such as the discrete cosine transform (DCT) or predictive coding. In l Chapter 1 Introduction 2 most cases, irrelevancy reduction is achieved by adapting quantization to visual properties. Specific video coding schemes are dependent on the applications. The requirements on picture quality, characteristics of communication channels and storage media have strong influence on the applicable scheme. For example, T V distribution has preference for high picture quality where as video phone has preference for world wide communications with standardized low bit rate channels. According to the requirements of the applications, different image formats and coding schemes have been specified. Table 1.1 shows some major coding standards and their characteristics. Table 1.1 Major video image coding standards [3] [4]. Coding Scheme Typical Application Coded Hit Kate JPEG Continuous-Tone Still Image Coding (Photo CD) 0.25 - 2.25 b/pel H.261 ISDN Visual Telephony Applications px64kb/s, l</><30 MPEG-1 Moving Pictures and Associated Digital Audio (CD-ROM, CD-I) 1.5 Mb/s MPEG-2 Moving Pictures and Associated Digital Audio (Video Distribution and Contribution) 4-15Mb/s CC1R 723 Digital Coding of Component Television Signals for Contribution-Quality Applications 34 - 45 Mb/s CCIR 721 Transmission of Component-Coded Digital Television Signals for Contribution-Quality Applications 140 Mb/s H. 120 Codecs for Videoconferencing Using Primary Digital Group Transmission 1.5-2 Mb/s MPEG-4 Coding of Video at Very Low Bit Rates 5 - 64 kb/s The coding standards have a large variety of parameters in order to cover the requirements of a wide range of applications. In particular the M P E G standard is a generic approach with Chapter 1 Introduction 3 different profiles and levels. Generic means that MPEG standard is independent of any one partic-ular application, however it does not mean that it ignores the requirements of the applications. M P E G possesses features that make it universally acceptable, such as its incremental implementa-tion approach, which implies that not all the features need to be used all the time for all applica-tions, as this would result in dramatic inefficiency. Currently, M P E G is the most widely accepted video coding standard, and is produced by the International Standard Organization (ISO). There are over 50 industrial manufacturers worldwide who are selling M P E G standard compatible products. The first phase of the M P E G standard is called MPEG-1 and it employs video compression techniques for symmetric applica-tions (e.g. video conferencing), asymmetric applications (e.g. video-on-demand) and digital storage media (e.g. CD-ROM). The second phase of the MPEG standard is called MPEG-2 which is a superset of the MPEG-1 standard, including many new features such as interlaced video and scalable video. Some of the basic functions that are shared by both standards are random access of video frames, fast forward/backward searches, reverse playback and audio-visual synchroniza-tion. The quality of video compressed with the MPEG-1 algorithm at a rate of about 1.2 Mbits/ second is comparable to the quality of VHS recording [5]. The spatial resolution is limited to 360 samples per video line and the video signal at the input of the source coder has 30 frame/second non-interlaced. In order to guarantee interoperability of equipment using M P E G without forcing the equipment manufacturers to build over-designed systems, a special set of constrained parame-ters are defined, including horizontal size (h < 720pixels), vertical size (v < 576 pixels), picture rate (fps < 30 frames/second), bit rate (rate < 1.86Mbits/second), etc. In this thesis, the MPEG-1 video standard will be the primary implementation target. Therefore, in the rest of the document, all technical details refer to the MPEG-1 video standard. Chapter 1 Introduction 4 1.2 MPEG-1 Video Processing The MPEG-1 video compression process must be performed by powerful digital video signal processors in order to achieve high image quality and frame rate. A typical video compres-sion algorithm requires the video processor to have an operation at speed of around 1 GOPS (giga-operations per second) [1]. Currently, there are two approaches to handle MPEG-1 video coding process: software coding programs that can be executed in different computer platforms and dedicated VLSI hardware processing systems. The software MPEG-1 coding package consists of an encoder'program and a decoder program. Due to the asymmetric structure of M P E G video standard, the encoder program is more complex than decoder program and hence the encoding process requires much more resources compared to the decoding process. A number of different MPEG-1 coding program packages are available from commercial software companies and university research laboratories. Among these, the package from the Plateau Multimedia Research Group at the University of California, Berkeley, was chosen for perfor-mance tests on a Sun SPARC 10 workstation. Figure 1.1 shows the decoding rate of the Berkeley MPEG-1 decoder program when different number of copies of the decoder program are running simultaneously on the workstation. From Figure 1.1, it is obvious that the decoder is far slower than the normal movie frame rate (24 - 30 frames/second). A similar encoder test shows that the average encoding rate is 0.48 frame/second for a single processor Sun SPARC 10 workstation (refer to Appendix H for the detailed testing results). In [7], a non-multimedia-enhanced HP PA715/50 processor system gives 9.9 frame/second decoding rate. Therefore, although software MPEG-1 programs are versatile and available at a very low cost, they cannot provide a real-time coding process without additional hardware support. Chapter 1 Introduction 5 Decoding Speed (Frame/sec) of Berkeley M P E G Player x : • : Frame Size: 3 5 2 x 2 4 0 Frame Size: 1 6 0 x 1 2 0 21 I I i I I J 1 1.5 2 2 . 5 3 3 . 5 4 No. of M P E G Player > Figure 1.1 Decoding rate of Berkeley MPEG-1 decoding program running in a Sun SPARC10 workstation. The enormous computational requirements of the M P E G coding process makes dedicated VLSI video signal processor a more plausible choice for real-time MPEG video processing. VLSI video processing circuits typically employ parallel circuits where multiple arithmetic units are activated simultaneously. Numerous VLSI circuit manufacturers have announced their latest MPEG-1 decoding and encoding products and some of which can achieve real-time (30 frames/ second) video decoding and encoding [8] [12]. Although VLSI circuits offer higher operational speeds, the associated development time and cost are very high as well. A single MPEG-1 encoder chip can be priced at over $1,200 while the price of a decoder chip is much lower, at around $100 each [9]. A complete MPEG-1 decoding and encoding system can cost from $20,000 to $65,000 [10][11]. Dedicated VLSI circuits also lack the flexibility to accommodate the contin-uous evolution of video compression algorithms and standards, in addition to requiring a long Chapter I Introduction 6 design turn-around time. Since the video coding process needs a flexible system that allows the incremental expansion of the standard, in our view some type of user programmable VLSI hardware can provide a practical, cost-effective solution for video processing applications. Such solution can take advantage of state-of-art reconfigurable logic and rapid prototyping C A D environments. 1.3 Transformable Computing for MPEG-1 Video Coding Currently, a new type of programmable integrated circuit called Field-Programmable Gate Arrays (FPGAs) is gradually establishing its effectiveness in many computing fields. Current FPGAs can integrate over 25,000 available gates on a single programmable chip which provides a versatile platform for embedding applications. For example, image and video processing require the use of VLSI to implement transformation and decoding algorithms to achieve significant gains in processing speeds. FPGA reconfigurability can be utilized for image and video processing in a way that provides efficient hardware adaptation to the different image and video processing stages. Thus, the FPGA platform significantly reduces hardware development costs and turn-around time and at the same time provides sufficient processing speed gains. The power of F P G A architecture is pushed to higher levels through the concept of Transformable Computing. The reconfigurability of an FPGA allows engineers to design many different VLSI circuits and dynamically transfer the circuits into the FPGA in sequence and under software control. This technique is also known as Hardware Object Programming [2]. Also the dynamical reconfigurability of the system permits different portions of the algorithm to be downloaded in the same program. The transformable coprocessor approach for MPEG-1 coding process provides a trade-off between the slow but flexible performance of software coding system Chapter 1 Introduction 7 and the fast but inflexible hardware coding system. Since the transformable video coding system requires a set of software programs to form the basis of the hardware driver, modification to existing software MPEG-1 programs will greatly reduce the system development complexity and time. The freely distributed MPEG-1 software lowers the cost of the system. Although MPEG-2 is a newer phase of the M P E G standard, there is no freely available MPEG-2 software source code at the beginning of the implementation and MPEG-1 software is in a much more mature stage. It is also justified that the complexity of MPEG-1 video standard is sufficient enough to demonstrate the functionality of the transformable coprocessor. 1.4 Thesis Scope and Objectives In this thesis, only MPEG-1 digital video coding standard is considered for implementa-tion on the FPGA-based transformable coprocessor mainly because MPEG-1 is widely used in world of digital video and a software MPEG-1 video coding package is readily available in the form of freeware [36]. An FPGA-based transformable coprocessor is employed for MPEG-1 video processing. MPEG-1 video coding involves a number of key operations: Discrete Cosine Transform (DCT), Inverse Discrete Cosine Transform (IDCT), quantization, Huffman coding, motion compensation and frame reconstruction. Each operation uses a particular algorithm which can be mapped onto the transformable coprocessor. The goal of the implementation is to move the computationally-intensive portion of the algorithm to the task specific coprocessor which can greatly accelerate the processing rate. This thesis will present a fully functional design of such transformable coprocessor system with detailed performance evaluation of the system. The whole MPEG-1 coding standards is composed of many different algorithm blocks. However, not all algorithm blocks in the MPEG-1 coding standard are implemented in the Chapter 1 Introduction 8 transformable coprocessor with enhanced performance. Indeed, some performance degradation can result from algorithms that require extensive data transfers or algorithms which cannot be fitted to the FPGA. Hence, a major goal of this thesis is to identify which algorithm blocks can be effectively implemented and what factors cause the unfavorable conditions in the transformable coprocessor performance. The design evaluation portion of the thesis includes investigation of speedup factors, hardware complexity, and efficiency of the implemented transformable MPEG-1 coprocessor. The implementation is based on a single FPGA transformable coprocessor with an approximate gate density of 13,000. The main objective of this thesis is to implement and evaluate the FPGA-based transform-able MPEG-1 video coding system. This objective is achieved in the following manner. The thesis first studies different algorithms that can be used in MPEG-1 video coding and determines the possibility of transformable implementation for those algorithms. Then a feasible architecture for each suitable MPEG-1 algorithm block is proposed for the implementation in transformable coprocessor. The algorithm block architectures accommodate different parameters that affect the performance of the implemented system. The proposed architectures are then translated into logic circuits using the FPGA compilation technology, with other supporting integrated circuit design tools. The system performance modelling was done at two levels: system logic simulation using Verilog X L ™ logic simulator and actual coding tests using different MPEG-1 files. A number of different system parameters are analyzed to evaluate the effectiveness of implementing complex video processing algorithms in an FPGA-based transformable coprocessor system. Finally, the thesis will identify what can be done in the future to extend the current design and FPGA-based transformable system architectures. Chapter 1 Introduction 9 1.5 Thesis Outline The thesis is organized as follows. Chapter 2 provides an overview of digital video standards and algorithms with emphasis on MPEG-1 video coding process, and a review of the structures of the MPEG-1 decoder and encoder systems. Chapter 3 introduces the concept of transformable computing. A survey of current experimental transformable platforms is described in this chapter. The transformable computing system design methodology is also outlined in Chapter 3, and a brief description of the transformable coprocessor system used in this thesis is included as well. Chapter 4 records the implementation process of the transformable MPEG-1 video coding system. The hardware and software implementation platforms of the MPEG-1 system are outlined. This chapter describes some key factors that are considered in the selection of an MPEG-1 algorithm for implementation. Chapter 4 also highlights some design details for each implemented MPEG-1 algorithm block, such as algorithm modifications, the algorithm block structure diagrams and special design features. Chapter 5 provides the analysis of implemented transformable MPEG-1 coding system. The simulation tools and testing procedures are specified. This chapter presents the results from testing and simulations of the implemented system with different parameter settings. Chapter 5 also explains the effect of system parameters on the transformable MPEG-1 coding system design. Chapter 6 concludes the thesis with a brief summary and some suggestions for future works. Chapter 2 MPEG Video Coding Standard The Moving Pictures Experts Group (MPEG) is a group of video experts that meet under the International Standards Organization (ISO) to generate standards for digital video and audio compression. The official name of the MPEG-1 standard is ISO/IEC JTC1 SC29WG11 [6]. M P E G activities cover more than video compression, since the compression of the associated audio and the issue of audio-visual synchronization cannot be worked out independent of the video compression. MPEG-Video activities address the compression of the video signal, whereas MPEG-Audio activities address the compression of a digital audio signal at the rate of 64, 128 and 192 Kbits/second per channel, and MPEG-System activities address the issue of synchroniza-tion and multiplexing of multiple compressed audio and video bit streams [6]. In this thesis, only the MPEG-Video portion of the standard is investigated in detail. This chapter provides a compre-hensive overview of MPEG-1 video encoding and decoding systems, with the purpose of introducing the reader to the basic requirements of a complete M P E G decoding and encoding system. 2.1 Basic MPEG-1 Video Compression Scheme The basic scheme is to predict motion from frame to frame in the temporal domain, and' then use the Discrete Cosine Transform (DCT) to reduce the redundancy in the spatial domain. The DCTs are computed over 8x8 image blocks, while the motion prediction is performed in the luminance (Y) channel on 16x16 image blocks. In other words, given a 16x16 block in the current frame, a close match to that block in a previous or a future frame is located. The D C T coefficients are quantified, and the quantization can change for every "macroblock" (a macroblock consists of one 16x16 of luminance block and two 8x8 chrominance blocks). The resulting D C T coefficients, 10 Chapter 2 MPEG Video Coding Standard 11 motion vectors, and quantization parameters are Huffman coded using fixed tables. M P E G offers a syntax that can satisfy most applications. The syntax was designed to obtain an optimal balance between cost and quality, or in other words, between computational complexity (VLSI area, memory size, and bandwidth) and compaction (compression) efficiency. 2.1.1 MPEG-1 Frame Sequences There are three types of coded frames [6] which include Intra (I) frames, Predicted (P) frames and Bi-directional (B) frames: • Intra Frames: are frames that are simply coded as still images. Intra frames provide en-try points into the MPEG frame sequence for random access. Therefore, they can only be moderately compressed. • Predicted Frames: are frames that are predicted from the most recently reconstructed I or P frame. Each macroblock in a P frame comes with a vector and difference DCT co-efficients for a close match in the last I or P frame. P frames encoded with reference to a past frame can be used as a reference for future P frames. P frames can be subjected to a fairly high degree of compression. • Bi-directional Frames: are predicted from the closest two Intra frames or Predicted frames, one in the past and one in the future. A matching block is searched among those frames and the forward vector, the backward vector, or their average can be used as the motion vector for the block. Bi-directional pictures provide the highest degree of compression but require both a past and a future reference in order to be encoded. Bi-directional frames are never used for references. Figure 2.1 shows a typical MPEG-1 video frame sequence. There are eight P or B frames in Chapter 2 MPEG Video Coding Standard 12 between two I frames, therefore the random access point of the frame sequence in Figure 2.1 is 0.33 seconds if the frame sequence is played at 30 frames/second rate. Forward Prediction J'-Frame Bidirectional Prediction Figure 2.1 A typical MPEG-1 video frame sequence with all three types of frames. 2.1.2 Layered MPEG-1 Bit Streams The goal of a layered structure is to separate entities in a MPEG-1 bit stream that are logically distinct, prevent ambiguity and facilitate the decoding process. The separation of layers gives the following properties to the MPEG-1 bit stream: • Genericity — The syntax allows for provision of many application-specific features without penalizing applications that do not need those features. • Flexibility — MPEG-1 syntax includes a large number of parameters defined in the Video Sequence Header. So while MPEG-1 is typically focused on a bit rate of about 1.5 Mbits/second and a resolution of 360 pels/line, higher resolution and higher bit rates are not excluded. Chapter 2 MPEG Video Coding Standard 13 • Efficiency — A layered structure provides an efficient management of the overhead in-formation such as displacement field, quantizer step size and type of predictor or inter-polator. The MPEG-1 video bit stream contains six layers. Each layer supports a specific function: either a signal processing function (DCT, Motion Compression) or a logical function (Resynchro-nization, Random Access Point). Table 2.1 shows the six layers in MPEG-1 video bit stream. Table 2.1 Six Layers in MPEG-1 video bit stream [13]. S j ntax Layer function Sequence Layer Random Access Unit: Context Group ol' Picture (GOP) Layer Random Access Unit: Video Coding Picture Layer Primary Coding Unit Slice Layer Resynchronization Unit Microbloek Layer Motion Compensation Unit Block Layer DCT Unit Any sequence of binary digits consistent with the MPEG-1 syntax defines an MPEG-1 bit stream. In addition, the bit stream must be decodable with a buffer of an appropriate size. MPEG-1 standard only defines the bit stream syntax and the decoding process, and therefore a designer is entirely free to implement any quality level of decoders or encoders, as long as these units comply to the MPEG-1 standard. A valid MPEG-1 encoder should generate a MPEG-1 file with the structure shown at Figure 2.2, and a valid MPEG-1 decoder should be able to retrieve data from this MPEG-1 file. Chapter 2 MPEG Video Coding Standard 14 MPEG-1 Video Bit Stream Sequence Layer Header Group of Picture Header Picture Header Slice Header Macro Block Macro Slice Macro Block Header Block Macro Block Slice Header Picture Header Slice Header Macro Block Macro Block Macro Block Slice Header Macro Block Sequence Header Start-of- Horizontal Vertical Aspect Picture Bit Marker Buffer sequcDce Size Size Ratio Rate Rate Bit Size code Constrained, Parameter Rag Intra QuanUzatio|i Matrix Flag Quan User i Bits Matrix 32 bits 12 bits 12 bits 4 bits IS bits i bit 1 bit MxSbits 1 bit Picture Header 64x8bite User Defined GOP.sti^odHourj Code Franr Sjlftcoiib bit Ptct. GOP fcStartl U s c t code] Flag 32bits ibit 5bite 6bite 1 bit 6bite 6bite2bite32bite UBerdefined Slice Start Code Slice Vertical Position Slice Quantizati< Scale „ User BiU Pict Start Code Tempt KcL cca] I I 'Type ™*- Pkel Valu Flag Vect+r Flag Back; «tra bit info 32b 10b 3b 16b l b 3b 3 b fifnea12 f" 8 pels 8 pels user Bits Define 8 pels 8 pels Cr Block CbBlodk 4 Luminance Blocks MPEG-1 Video Bit Stream Syntax Figure 2.2 Layered structure of the MPEG-1 video bit stream. 2.1.3 MPEG-1 Constrained Parameter Bit Streams Constrained Parameter Bit-streams (CPB) are MPEG-1 bit streams with a limited set of sampling and bit rate parameters designed to normalize computational complexity, buffer size, and memory bandwidth while still address the widest possible range of applications. CPB limits Chapter 2 MPEG Video Coding Standard 15 video to 396 macroblocks (101,376 pixels) per frame if the frame rate is less than or equal to 25 frames per second (fps), and 330 macroblocks (84,480 pixels) per frame if the frame rate is less or equal to 30 fps. Therefore, MPEG video is typically coded at the dimensions of (352 x 240 x 30) fps or (352 x 288 x 25) fps. A l l MPEG-1 compatible decoders are expected to be capable of decoding at least a CPB bit stream, bit streams beyond the CPB parameters are entirely optional. Table 2.2 shows other frame parameter sets. Table 2.2 Frames sizes other than Constrained Parameter Set [13]. Format Video Sequence Parameters Compressed Kit Kate QCIF 172 x 144 @30Hz 0.2 - 0.8 Mbits/s CPB 352x240@30Hz 1.2-3Mbits/s CCIR 601 720x486 @30Hz 5-10 Mbits/s EDTV 960x486 @30Hz 7-15 Mbits/s HDTV 1920x1080 @30Hz 20 - 40 Mbits/s 2.2 MPEG-1 Video Coding Algorithms M P E G compression requires a delicate balance between intraframe and interframe coding, and between recursive and nonrecursive temporal redundancy reduction in order to achieve very high compression and random access. The M P E G algorithm uses two techniques: block-based motion compensation for the reduction of the temporal redundancy and transform domain (DCT-based) based compression for reduction of spatial redundancy. Motion-compensation techniques are applied with both causal (pure predictive coding) and noncausal predictive coding (interpola-tive coding). The remaining signal (prediction error) is further compressed with spatial redundancy reduction (DCT). The information relative to motion is based on 16x16 blocks and is transmitted together with the spatial information. Chapter 2 MPEG Video Coding Standard 16 In order to further compress the signal, M P E G uses block-based transform coding techniques with a combination of visually weighted scalar quantization and run-length coding. The technique to perform intraframe compression with D C T essentially involves three stages: computation of the transform coefficients, quantization of the transform coefficients, and conver-sion of the transform coefficients into {run, amplitude} pairs after reorganization of the data in zigzag scanning order. Discrete Cosine Transform has inputs in the range [-255, 255] and output signals in the range [-2048, 2047], providing enough accuracy even for the finest quantizer. Since M P E G has both intracoded pictures and differentially coded pictures, two different quantizers are required to quantize the D C T coefficients to achieve the best result. The quantizer can also be modified on a block-by-block basis if the image content makes it necessary. Entropy coding can further increases the compression ratio. A Huffman-like table for the D C T coefficients is used to code events corresponding to a {run, amplitude} pair. Only those codes with relatively high probabilities of occurrence are coded with a variable-length code. The less-likely events are coded with an escape symbol followed by fixed length codes to avoid extremely long code words and reduce the cost of implementation. 2.2.1 Predictive Coding Among the various techniques that exploit the temporal redundancy of video signals, the most widely used one is motion-compensated prediction. Motion-compensated prediction assumes that the current picture can be modeled as a translation of the picture at some previous time, while the amplitude and the direction of the displacement need not to be the same everywhere in the picture. Motion-compensated prediction is a key feature of MPEG. The signal to be reconstructed by prediction is obtained by adding a correction term to a combination of a past and a future frame reference. Motion Estimation is used to extract motion information from a Chapter 2 MPEG Video Coding Standard 17 video sequence. The M P E G syntax specifies the motion information that can be represented by one or two motion vectors per 16 x 16 subblock of the picture depending on the type of motion compensation used: forward-predicted, backward-predicted, or average. Block-matching techniques are likely to be used, in which the motion vector is obtained by minimizing a cost function measuring the mismatch between a block and each predictor candidate. Figure 2.3 shows the motion estimation concept in the predictive coding. Frame t -1 p t-i c.y) •^^^ Motion Vector Search Window Motion Estimation by Block Matching Frame t Figure 2.3 Motion estimation by block matching algorithm. 2.2.2 Transform Coding The purpose of transform coding is to decorrelate the picture content and to encode the transform coefficients rather than the original pixels of the image. The input images are split into disjoint blocks (denoted b) each of size N x 7V pixels. The transform can be represented as a matrix Chapter 2 MPEG Video Coding Standard 18 operation using an /V x /Vtransform coefficients matrix c based on a linear, separable, and unitary T T forward transform c = AbA , where A denotes the transpose of the transformation matrix A. The transformation is reversible since the original NxN block of pixel b can be reconstructed T using a linear and separable inverse transform b = A cA. Among many possible alternatives, the Discrete Cosine Transform (DCT) has become the most successful transform for still image and video coding. In fact, D C T based implementations are used in most image and video coding standards due to their high decorrelation performance and the availability of fast DCT algorithms suitable for real-time implementation. The major objective of transform coding is to make as many transform coefficients as possible small enough so as to render them insignificant in terms of statistical and subjective measures. Such small coefficients need not be coded for transmission which results in more compression. At the same time, it is desirable to minimize statistical dependencies between coefficients with the aim to reduce the amount of bits needed to encode the remaining coefficients. Figure 2.4 presents the variance (energy) of intraframe DCT coefficients based on a simple statis-tical model assumption [3]. The variance for each coefficient represents the variability of each particular coefficient averaged over a large number of frames. Coefficients with small variances are less significant for the reconstruction of image blocks than coefficients with large variances. It should be clear that from Figure 2.4, on average, only a small number of D C T coefficients need to be transmitted to the receiver to obtain a valuable approximate reconstruction of image blocks. Moreover, the most significant DCT coefficients are concentrated around the upper left corner and the significance of the coefficients decays with increased distance, which implies that higher DCT coefficients are less important for reconstruction than lower ones. Chapter 2 MPEG Video Coding Standard 19 Variance Distribution of DCT Coefficients Figure 2.4 Variance distribution of DCT coefficients on average over a large number of image blocks. Most of the total variance is concentrated around dc DCT coefficient where u = 0, v = 0. 2.2.3 Subsampling and Quantization The basic premise of subsampling is to reduce the dimension of the input video in the horizontal and/or the vertical dimension so that the number of pixels to be coded can be reduced. Some video applications also subsample pictures in the temporal direction to reduce frame rate prior to coding. At the receiver end, the decoded images are interpolated for display. This technique may be considered as one of the most elementary compression techniques which also utilizes specific physiological characteristics of the human eye to removes subjective redundan-cies contained in the video data. This concept is also used to explore subjective redundancies contained in chrominance data, i.e., the human eye is more sensitive to changes in brightness than in chromaticity changes. Therefore, MPEG-1 standard first divides the image into Y U V components (one luminance (Y) and two chrominance (U) and (V) components) [6]. The chromi-Chapter 2 MPEG Video Coding Standard 20 nance components are then subsampled relative to the luminance component with a Y:U: V ratio of 4:1:1. Figure 2.5 shows the position of luminance and chrominance samples. x x x x x x x x OIO OIO OIO OIO x x x x x x x x X X OIO X x X X OIO X X X X X X X x OIO OIO OIO X X x X X X X ' X X \ X 1 OIO i OIO i OIO X Luminance Sample OIO Chrominance Cr & Cb Samples X X X X X X OIO OIO OIO X X X X X X X X OIO X X Y & Cr & Cb sample ratio is 4:1:1 Luminance & Chrominance Sample Locations Figure 2.5 The position of luminance and chrominance samples in MPEG-1 image block. Quantization provides additional compression by increasing distortion. This can be achieved in a balanced manner by exploring the properties of subjective perception of the human visual system. MPEG-1 uses a quantization matrix to provide visually weighted quantization. Also the signal from intra-coded blocks is quantized differently from predicted blocks [6]. Furthermore, each marcoblock can have all of its associated DCT coefficients quantized more or less coarsely, depending on the importance of the local area of the image. The actual rule for selecting the adaptive quantization parameter is important for the performance of the overall algorithm. Chapter! MPEG Video Coding Standard 21 2.2 A Entropy Coding and Run-Length Coding The pixel color values of digital video frames are usually prequantized to fixed-length words with typically 8 or 10 bits of accuracy per color component. However, for most video material it can be assumed that not all quantized color values occur equally likely within the video scenes. We can reduce the average number of bits per word if color values having lower probabil-ities are assigned with longer code words, whereas values having higher probability are assigned shorter code words. This method is called variable word-length coding or entropy coding, and it forms one of the most basic elements of today's video coding standards, especially in combina-tion with transform domain or predictive coding techniques [15]. If the resulting code words are concatenated to form a stream of binary bits, then correct decoding by a receiver is possible if the code words are uniquely decipherable. Among the several possibilities of constructing sets of uniquely decipherable codewords, Huffman coding has found widespread applications and is used in all standard coding schemes. Figure 2.6 shows an example of how Huffman coding works. The concept of entropy coding can be combined with a run-length coding to achieve further data reduction [16]. This method is useful if consecutive pixels along a scan line are highly correlated. With run-length coding, one codeword is allocated to a pair of input values (run, length), such that the number (run) of consecutive pixels along a scan line with the same color value (length) can be encoded and transmitted as one code word only. Chapter 2 MPEG Video Coding Standard 22 Source Symbols: {a, b, c, d, e, f, g} Corresponding Probability: {0.5, 0.2, 0.1, 0.1, 0.07, 0.02, 0.01} Figure 2.6 Symbol merging of the Huffman coding procedure and the resulting code assignment. 2.3 MPEG-1 Video Coding System Structures In earlier M P E G system developments, each of the basic M P E G functions was implemented on a different chip and a chipset was needed for creating a complete system for M P E G video decoding and encoding. Recently, several single-chip M P E G decoder became available from a number of VLSI manufacturers [17][18][19][20]. Meanwhile, programmable processors for an M P E G coding system were also proposed [21]. The advantage of programmable architecture is increased flexibility, since changes in operational requirements, e.g., due to changes in algorithms or an extension of the target application field, can be handled by software Chapter 2 MPEG Video Coding Standard 23 modifications only. Thus, a generally cost-intensive redesign of the hardware can be avoided. This type of programmable system normally employs general-purpose Digital Signal Processing (DSP) chips. On the other end, completely software-based MPEG decoders are also available as a low cost solution. Regardless the different implementation platforms, most of the M P E G video coding system have similar functional units including a Variable Length Coding (VLC) unit, Discrete Cosine Transform (DCT/IDCT) unit, a Motion Compensation (MC) unit, a Motion Estimation (ME) unit, a Quantization (Q) unit, an M P E G Header Processing (HP) unit and a Reference Frame Storage unit. 2.3.1 MPEG-1 Encoding System The asymmetry of the MPEG-1 video processing system makes the encoding system much more complex than the decoding system. Most of the encoding time is spent in the motion estimation and compensation processes. Depending on the sophistication of the motion estimation algorithm, these two processes can take up to 50% of the total encoding time. The D C T is the next computationally intensive task which takes 14% of the encoding time. Other function blocks such as Quantization, Variable Length Coding and MPEG Header Generation are relatively less signif-icant in the encoding time distribution. Figure 2.7 shows the basic structure of an MPEG-1 video encoder system. Another important element in the encoder is the Frame Storage Memory Bank. It stores all previous and future reference frames that are needed for predictive mode encoding. Chapter 2 MPEG Video Coding Standard 24 Some Fran is MPEG Header Generation Motion Estimation Frame Memory Frames Quantization Compensated Frame Motion Compensation Variable Length Coding MPEG Frame Pacidng MPEG Video Frames MPEG Video Encoder Figure 2.7 A typical structural framework of the MPEG-1 video encoding system. 2.3.2 MPEG-1 Decoding System MPEG-1 video decoding system has some of the same functional blocks that can be found in the encoding system. However, these functional blocks perform exactly the reverse functions of their counterparts in the encoder. For example, the DCT block in encoding system becomes IDCT block in decoding system or the Quantization block in the encoder becomes an Inverse Quantiza-tion block in the decoder. Figure 2.8 shows the basic structural framework of the MPEG-1 video decoding system. The most computationally demanding task in the decoding system is the IDCT process. The computations required for motion vector reconstruction and frame reconstruction processes are much less demanding in the decoding system than in the encoding system, mainly because there is no block matching procedure in the decoding system. The IDCT process alone takes up about 35% of the total decoding time while the frame displaying process takes up another Chapter 2 MPEG Video Coding Standard 25 30% of the decoding time. From a previous study (refer to Appendix H), the time required to decode one MPEG-1 video frame is about 1/10 the time required for encoding the same frame. It is very advantageous to keep the decoding system as simple as possible because it is much more widely used than the encoding system, which is mostly found in broadcasting studio or profes-sional video image laboratories. However, this trend may change as the encoding system cost and simplicity of usage are enhanced. Header Information MPEG file -Header Processing Memory Bank MacroBlock Data Recontructed Motion Vector Motion Vector Motion Vector Reconstruction Encoded Motion Vector Huffman Decoding Decoding Table Inv-Q Inv-Q Scales Frame Reconstruction I Reconstructed DCT coefficients IDCT Dithering and Displaying MPEG Decoder System Figure 2.8 A typical structural framework of the MPEG-1 video decoding system. The chapter provided a tutorial overview of M P E G video processing system structure and implementational requirements. The following chapters will present a particular implementation of MPEG-1 using emerging transformable computing platform. Chapter 3 FPGA-Based Transformable Computing Transformable computing refers to the process of dynamically reconfiguring field-programmable custom-computing machines to adapt optimally to varying algorithms and operat-ing conditions [22]. The main elements of this exciting new computing platform are reconfig-urable FPGAs. An FPGA normally contains circuit configurations for a particular algorithm as well as the host-bus system interface. The transformable system also includes the configuration memory which can be pre-loaded with a library of optimized circuit configurations. These config-urations can be downloaded to the FPGAs during program executions. Al l transformable computer designs have two goals in common: high performance and dynamic reconfigurability. These designs are seen to vary widely in form, function, and program-ming methodology. But almost all transformable computers currently rely on mature system technologies, i.e. workstations or PCs to provide a front end. Basic programming requires hardware to be specified through a schematic or hardware description language, then synthesized into the target architecture. Most high-performance applications use extensive pipelining and exploit the concurrency of the underlying logic circuit. This process can achieve dramatic perfor-mance gains for the algorithms they handle. Many transformable computer systems are based on fined-grained parallelism and use the systolic array approach [23] [24] [25]. Modification of the processing elements are relatively simple compared to conventional hardware design methods, because the systolic elements can be redefined as required for the specific physical system. A particularly useful, although high simplified, view of a transformable computer is depicted in Figure 3.1 [22]. The transformable computer can be viewed as a coprocessor that takes on different configurations, each optimized for a specific task. Two essential features of 26 Chapter 3 FPGA-Based Transformable Computing 27 transformable computers are that they can be reconfigured dynamically, and that the reconfigura-tion process involves logic as well as routing functions. The example in Figure 3.1 shows a four-phase computation that involves the execution of a number of sub-tasks (A, C, D and E). The scheduling of tasks is guided mainly by data dependence, resource availability, task priority, and a number of other factors. From Figure 3.1, we also notice that sub-tasks A and D can execute concurrently, so do sub-tasks C and D. Being able to support concurrency at the task level provides transformable computers with multiprocessing capability not normally supported by standard microprocessors. Transformable computer can also provide each task with customized hardware configurations optimized for the particular requirements of the task. Additionally, because the hardware is completely reprogrammable, transformable computer can support contin-uously evolving designs for various tasks. These factors combined make transformable comput-ing a very powerful processing concept that needs further exploitation. 3.1 Transformable System Design Methodology Transformable computing implementations are characterized by strong interaction between software and hardware blocks. The design methodology requires tighter binding between both VLSI implementation and application software. The VLSI circuitry are integrated into software M P E G program by modifying original C program, so that the time consuming tasks are executed by fast hardware. Figure 3.2 shows the transformable computing system implementation process flow. er 3 FPGA-Based Transformable Computing Config A Config B Config C Config D_ Config E Configuratio Memory Load Config. A Configurations for 5 sub-tasks in an algorithm Load Config. E Phase 1 Phase 2 Phase 3 Phase 4 Multi-phase Computation in Transformable Computers Figure 3.1 An example of four-phase computation in a transformable computer. MPEG Function Block Algorithm Analysis Schematic/ Design Netlist Generation Verilog Simulation FPGA Compilation Software Driver Program Functional Unit Testing Modified MPEG Software Integrated Testing Transformable MPEG Function Block Transformable M P E G Coder Design H o w Figure 3.2 Transformable MPEG-1 system design flow diagram. Chapter 3 FPGA-Based Transformable Computing 29 3.1.1 MPEG-1 Algorithm Mapping The process of mapping MPEG-1 video processing algorithms onto transformable hardware begins with detailed program analysis. The purpose of this step is to extract significant information from the original C program pertaining to data transfer frequency, computation frequency, and data dependencies. The accumulation process of each function is gathered in order to estimate the effective performance gain when those functions are mapped onto a transformable computing engine. The second step in the mapping process is developing schematic designs of the sub-algorithms determined suitable for hardware implementation. Alternatively, a hardware description language (HDL) can be used depending the structure of the algorithms. In this thesis, most of the designs were developed in schematic format which is stored in the netlist files since there no available compiler to translate H D L specifications to FPGA netlist at the beginning of the thesis. The relatively small size of each design and the regularity of the involved structures give an advantage to the schematic approach. Currently, schematic capturing is more stable than H D L synthesis and can produce more optimized circuit netlists. 3.1.2 FPGA Compilation FPGA compilation is a process of translating the user system design to FPGA configura-tion data that can be used to program the FPGA. A number of C A D tool programs are needed to complete the process as shown in Figure 3.3. The whole process starts with the user design schematics. HDLs can be used as design input as well, but a corresponding translator program (such as Xilinx Netlist Format (XNF) translator) is needed. Most of the designs in this thesis have been created in schematic format, and so a schematic to X N F file translation is performed. After the design schemes are created, a first layer simulation can be carried out by using Verilog X L Chapter 3 FPGA-Based Transformable Computing 30 simulator program or other simulation program, then the design X N F files undergo the mapping process as shown in Figure 3.3. The mapping process will partition the entire circuit schematic into several Configurable Logic Blocks (CLBs) which are the building blocks of Xilinx FPGA. The following step is to merge all sub-designs into the top-level design, that is, the hierarchical structure of the design is flattened. Thereafter, an initial placement of the CLBs is carried out without any optimization of C L B locations. A more complex place and route program then optimizes the locations of the CLBs and the associated routing channels. This is the most time-consuming process. The final placed and routed design can then be compiled into an F P G A configuration bit file which can be downloaded directly into the FPGA. This entire F P G A compilation process is controlled by the automated C A D programs which can be manually adjusted to satisfy different compilation requirements. 3.1.3 Transformable Implementation Testing Following schematic capture, a first layer testing is done in the Veri log™ simulator. The simulator provides a functional and timing behavior of the design which is necessary to prove the correctness. However, due to the place-and-route step during compilation, a working design in the simulator does not necessarily guarantee correct operation in the F P G A mainly because of underestimated routing channel delays. Hence a good design must anticipate the possible delays in routing channels. Certain liming delays may be introduced in this step and a second testing is needed. The mapped designs are downloaded into the FPGA and tested by the modified MPEG-1 programs. The first part of the testing is the unit test in which each block is isolated from the M P E G program and individually tested. Then the correct function block is integrated into the M P E G decoding and encoding programs. This process is iterative as shown in Figure 3.2. Chapter 3 FPGA-Based Transformable Computing 31 User Designs Xilinx Library Schematic Editor Design Simulation O Schematic File Schematic t o X N F | Translator (WIR2XNF) Xilinx Netlist Format (.XNF) File FPGA Compilation MAP File o _ J _ Xilinx Netlist Merge Program (XNEMERGE) Merged MAP File T o Xilinx LCA Translator (MAP2LCA) Unrouted LCA File 9 Automatic Place <& Route Program (PPR) Placed & Routed LCA File o Pull-Timing | Design Simulation Configuration |3itstream Compiler (MAKEBITS) o-Configuration Bitstream File Figure 3.3 FPGA compilation process for MPEG-1 transformable coprocessor. 3.2 Transformable System Platforms Currently, a number of different transformable computing platforms exist mainly for experimental purposes [26][27][28]. Most current platform are based on exploiting concurrency in the embedded applications. Earlier transformable computer architectures were designed with fine-grained parallel (systolic) processing in mind. More recent systems support multiple process-Chapter 3 FPGA-Based Transformable Computing 32 ing styles including parallel, pipelined, and vector processing. For example, the Virtual Computer [30] was designed as a vector style numeric processor. Other transformable computers utilize FPGAs as high speed communications agents. In this case, FPGAs are used to configure high speed systolic communication agents among processors resulting in about two orders of magnitude improvement in data movement over conventional software-based methods [24]. 3.2.1 Hybrid CM-2/Xilinx Prototype The CM-2X system is the result of a joint effort by Supercomputing Research Center (VA, USA) and Thinking Machines Corporation to examine the suitability of a hybrid combination of the highly parallel Connection Machine (model CM-2) and the Xilinx FPGA technology. The CM-2X architecture is a standard CM-2 parallel processor with Xilinx FPGAs substituted for the standard Floating Point Units (FPUs). The CM-2X can be programmed using all the standard CM-2 languages and utilities. Figure 3.5 [27] shows the architectural structure of the CM-2X. The prototype consists of 512 bit serial processors in 16 SPRINT nodes. The SPRINT node is the basic building block for CM-2X systems. Figure 3.5 shows that a Xilinx XC4005 replaces the original FPU in the SPRINT node. A SPRINT node contains 32 bit serial processors, each of which connects to 1 MB of dynamic R A M . The processor is also connected to the SPRINT chip which forms an interface between the FPGA and the bit serial processor. A number of control signals are applied to the FPGA from the CM-2X sequencer. The static and dynamic instructions define the operation to be performed by the coprocessors. In this system, the interpretation of both instruction and address signals are dependent upon the functionality embedded in the FPGA by the programmer. Chapter 3 FPGA-Based Transformable Computing 33 Sequeocer L-|snodcl5| [ [ CM-2X System Interconnection PB Instruction Router Router 3 pit! \n Memory Addres^ lMbxl6 Memory Bypass Register T 32 bits Memory Bus T Sequencer Dyw Float Bus f 32 Bits S P R I N T Chip XilinxXOiOOS Programmable Logic Unit Figure 3.4 CM-2X prototyping system architecture [27]. Programming the CM-2X prototype requires the construction of three different programs: front-end C code, sequencer code, and FPGA configuration. The prototype system requires the designer to first program the design in C and then re-code the program in C M assembly language which can directly utilize the capabilities of the sequencer and processor logic. The recoding is necessary since the Xilinx FPGA is only drivable directly via assembly level codes. This multiple step programming procedure makes CM-2X a much more complicated computer to develop an application or to modify an existing design than the conventional computer system. 3.2.2 The Virtual Computer The Virtual Computer is a novel computing concept that involves the analysis of an algorithm to identify its computational intensive inner loops and then implement those inner loops Chapter 3 FPGA-Based Transformable Computing 34 in completely reconfigurable hardware. The Virtual Computer features fifty-two Xilinx XC4010 FPGAs, twenty-four ICUBE Field Programmable Interconnect Devices, three 64-bit I/O ports (one for hardware configuration/read back (CBus) and two for general purpose I/O (VBus)), 8 Megabytes of 25ns SRAM and sixteen 8K by 16-bit 25ns dual port R A M , as shown at Figure 3.6 [30]. The general purpose I/O ports are completely symmetrical with respect to usage and can offer balanced I/O to two completely different system. With 520,000 gates of reconfigurable logic per board, the Virtual computer can be completely reconfigured in under 25 millisecond. The Virtual Computer is bus interface independent, which means that a specific bus interface can be developed for any computer. The large number of available reconfigurable logic gates allows the Virtual Computer to execute many multi-step functions in parallel, instead of the sequential operation in conventional computing machines. Also shown in Figure 3.6 is the Virtual Pipeline structure in the Virtual Computer board. Each of the four pipelines has 100,000 gates of logic in between the dual port R A M and 20,000 gates that allow for smart memory and local interface. Although the Virtual Computer can accommodate large design with its 52 FPGAs, the design partition remains to be a major challenge. Finding an efficient way to utilize those FPGAs and programming the interconnection devices (ICUBE) are still under studying and some mapping and routing tools must be developed before the designer can fully exploit the computing power of this massive reconfigurable machine. Chapter 3 FPGA-Based Transformable Computing 35 64-Bits VBus I/O 64-Bits CBus I/O 64-Bits VBus I/O VO Buffers Clock Distribution Buffers I 256k X 32 I I 256k X 32 | I 2 5 6 k x 3 2 I I 2 5 6 k x 3 2 | Xilinx XC401 Xilinx XC401I Xilinx XC40H Xilinx XC401' Xilinx XC401 Xilinx XC40H Xilinx XOWll 1 Xilinx XC401' Xilinx XC401' Xilinx XC401 Xilinx XC401 Xilinx XC401 I 2 5 6 k x 3 2 I I 2561CX32 I I 256X1132 | 16x8k Dual Pod SRAM Xilinx XC401 I-Cube IQ160 Xilinx XC401' I-Cube IQ160 Xilinx XC401 I-Cube IQ160 Xilinx XC401 Xilinx XC401' 16x8k Dual Pod SRAM 16X81 Dual Pod SRAM Xilinx XC401 Xilinx XC401' I-Cube IQ160 16x8k Dual Pod SRAM Xilinx XC401 I-Cube IQ160 Xilinx XC40H 16x8k DualPoii SRAM 16x8k Dual Poit SRAM Xilinx XC401 Xilinx XC401 I-Cube IQ160 Xilinx XC401 I-Cube IQ160 Xilinx XC401 I-Cube IQ160 I-Cube IQI60 Xilinx XO401 I-Cube IQ160 Xilinx XC401 Xilinx XC401 I-Cube IQ160 Xilinx XC401' I-Cube IQ160 Xiliax XC401< 16x8k Dud Pod SRAM Xilinx XC40H 16x8k Dual Pol SRAM Xilinx XC40H 16x8k Dual Pod SRAM Xilinx XC401 I-Cube IQ160 Xilinx XC40H I-Cube IQ160 Xilinx XC40H I-Cube IQ160 Xilinx XC401 Xilinx XC40H 16x8k Dual Poll SRAM 16x8k Dual Poit SRAM Xilinx XC40H Xilinx XC401 I-Cube IQ160 Xilinx I-Cube XC401 IQ160 Xilinx XC401' I-Cube IQ160 Xilinx XO!01< 16x8k Dual Pod SRAM 16x8k DualPoii SRAM 16x8k Dual Pod SRAM Xilinx XC401 I-Cube IQ160 Xilinx XC401 I-Cube IQ160 Xilinx XC401 I-Cube IQ160 Xilinx XC401 Xilinx XC401 16x8k Dual Pol SRAfc Xilinx XC401' Xilinx XC401 I-Cube IQ160 Xilinx XC401 I-Cube IQ160 Xilinx XC401 I-Cube IQ160 Xilinx XO401 16x8k DualPoi SRAft 3—E VMEPl VMEP2 One Virtual Pipeline VMEP3 Virtual Computer System Figure 3.5 The Virtual Computer system architecture [30]. 3.2.3 The Reconfigurable Computer EVCl - s This is the main transformable computing system used in this thesis. The E V C l - s is an SBus-board based transformable computing system. It has one Xilinx XC4013 FPGA, 2MB R A M , and is hosted by a 40MHz Sun SPARC10 workstation [31]. An integrated software driver and SBus interface circuitry provide the necessary communication channel between the circuitry in the FPGA and the software program. The use of C programming language routines for control-ling the data flow allows dynamic downloading of a series of digital designs for rapid debugging Chapter 3 FPGA-Based Transformable Computing and development of large complex systems, interface structure. 36 Figure 3.7 shows the E V C l - s communication control EVCl-s Communication Interface Data Figure 3.6 B lock diagram of the EVC1 -s communication interface. The XC4013 FPGA in the EVCl- s provides approximately 13,000 usable logic gates. The system is also equipped with a set of software programs and functions which are listed as below [31]: • r2h — this program handles the conversion of Xilinx FPGA configuration bit files to a C programming language header file, which can then be included in the C language source file to form an executable file; • evcreset — this program resets the EVCl-s; Chapter 3 FPGA-Based Transformable Computing 37 • EVCdownload — this function downloads the FPGA configuration file into the EVC1-s; • EVCreset — this function resets the EVC1 -s; • EVCrbtrig - this function toggles RDBK.TRIG signal pin on the XC4013 FPGA de-vice; • EVCreadback - this function returns a single bit of the readback data stream from the XC4013 FPGA device. The E V C l - s system also includes an SBus slave interface macro which provides a communication interface as shown in Figure 3.7. The simplicity of the E V C l - s coprocessor and associated C A D tools (FPGA compilation programs) provide the designer with efficient transformable system development environment, however, the single FPGA setup in E V C l - s limits the size of the algorithm that can be implemented. Nevertheless, the E V C l - s provides an excellent initial platform for exploring the power of low-cost transformable computing solution for various fundamental applications. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor FPGA-based transformable system provides a cost-efficient way of implementing VLSI circuits, yet there is a slight difference in the implementation procedures and design parameters as compared to the conventional ASIC (Application-Specific Integrated Circuits) designs. For example, the algorithm design is bonded by the size of FPGA, hardware and software interfacing scheme must be created, algorithm partition optimizaion is needed. In this chapter, the E V C l - s transformable system design parameters and limitations are investigated. Moreover, the transformable system design and implementation strategies are also discussed. 4.1 Implementation Platform Parameters The transformable E V C l - s computer employed in this thesis has several unique design parameters which shape the development strategy of an MPEG-1 video coding processor. Each of the design parameters can affect the way that an algorithm is mapped onto the transformable coprocessor, and also affect implementation performance and efficiency. On the other hand, the transformable system software which drivers the hardware circuitry also affects high-level system behavior. Due to the complexity of the MPEG-1 video coding scheme, this thesis employs a modified version of a standard MPEG-1 video coding software package, rather than developing a completely new system software. This can be viewed as an added advantage of the transformable coprocessor approach, since using a standard software package as the starting point reduces development time drastically. The software package from the Plateau Multimedia Research Group at University of California, Berkeley [36] is used as the basis of the video coprocessor system software. The package consists of an encoding program and a decoding program, as well 38 Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 39 as other statistical programs. This software package exists in freeware format and is the first widely used software MPEG-1 video coding program. The single FPGA structure of EVCl - s provides for a simple mapping procedure without invoking additional design placement complications as those encountered with multiple-FPGA transformable coprocessor. However, the trade-off is a more limited number of usable gate and memory elements. Therefore, one of our design goals is to obtain optimize performance under such restricted parameter and resource limitations. In this section, various design parameters are reviewed, and their effect on the MPEG-1 video system implementation is outlined. 4.1.1 F P G A Gate Density Since FPGAs are prefabricated programmable general purpose integrated circuits, the number of available gates they host is quite small as compared to ASICs. The FPGA used in this project is a Xilinx XC4013 series Logic Cell Array. This FPGA provides 160 I/O pins, 1152 function generators and 1152 registers as shown at Table 4.1 [37]. There are approximately 13,000 gates available to implement the design. So the size of the design is limited to these numbers. In this thesis, the area limiting factor poses the biggest constraints to the IDCT and D C T designs. Most typical DCT or IDCT design require 50,000 to 75,000 gates [38][39]. To accommo-date the design into the XC4013, the number of parallel DCT/IDCT units had to be reduced, resulting in more clock cycles and slower DCT circuits. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 40 Table 4.1 Available hardware resources in Xilinx XC4013 FPGA [37]. Approximate Gate Count 13,000 CLB Matrix 24x24 Number of CLBs 576 Number of Flip-Flops 1536 Max. Decoder Outputs (per side) 72 Max. RAM Bits 18,432 Number oflOBs 192 Function Generator per CLB 3 Number of Logic Input per CLB 9 Number oi Logic Output per CLB 4 Numbei of Low-skew Global Nets 8 4.1.2 Software-Hardware Interfacing The SBus interface provides a seamless link between the hardware circuit and the software program. It allows the software program to directly control the hardware circuit through the SBus in a Sun SPARC workstation. Although this interface facilitates the data and control signal exchange, it also poses certain limitations on the hardware design. Basically the software can transfer the data to the FPGA through a 32-bit data bus and control the FPGA through a 16-bit control signal bus. The data bus width is limited to 32 bits, which is not wide enough for signal processing applications with high data rate. The interface also sets the format of data exchange. The software program can write and read data through memory mapping technique, i.e., if some data is to be sent to the FPGA, we have to write it to a pointer address. The read operation uses the same technique. Figure 4.1 shows the SBus slave interface macro which facilitates the exchange of the data between host processor and EVCl- s coprocessor. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 41 CFtf Data Address A-B ^ C K <2:0> Mm PA <16:5:2> B_SEE <2:0> B_DATA <31:0> B_AS B_SEL VSB JDTDIR B_CLK EVCl-s SBus Slave Interface Macro From SBus SDATI<31:0> EVCADR<15:0> To SBus SADTO<31:0> SCLK FCLK OkcttH J From EVCl-s Board Figure 4.1 EVCl-s SBus slave interface structure [31]. 4.1.3 Hardware Control Scheme In the SBus interface data transfer scheme, only the write operation generates a control signal to the circuit but not the read operation. For example, if the designer tries to send 32 bits of data to the circuit, then a command will put the data into a pointer address. Figure 4.2 shows this memory mapping technique in EVCl- s . The pointer addressing process will generate a one bit control signal to the circuit to enable a register or a multiplexer. In the read operation, there is no control signal and the data is available through any of 16 assigned memory address. This type of scheme creates a synchronization problem when transferring data from the FPGA to the software program memory. In order to read eight 32-bit data, a dummy write statement must be inserted to Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 42 generate a control enable signal to the registers. Such operations increase the data transfer latency and slow down the computation. C Pjpgf am tfinclude'desigiiji" mainO t int *write_to_evc *read_from_evc; writc_to_evc = EVCdownload(design,l,l); rcad_fron\_evc = writc_to_evc + 2; . 3™te_to_evc) = 0xl2345«7;_ «(wrile_to_evc * 1) = 0x2345«78; A C result = *rcad_lrom_evc; Load the data into pointer address Write Signal to EVCADR<0> Interface Macro SDATCX31:0> - SDATI<31:0> EVCADR<0> SCLK EVCADR<0> enables Input Reg. Memory Mapping Hardware Control DowtilojdeiJ Design l> Logic Circuits Input Registers elk Output Registers J Figure 4.2 Memory mapping hardware control technique. 4.1.4 Configuration Downloading Overhead Time The downloading process includes transferring the configuration data the from host memory, through the SBus interface, to the F P G A configuration memory in the E V C l - s transformable coprocessor. The configuration downloading time is accrued each time a new circuitry is transferred to the FPGA in E V C l - s . Equation 4.1 shows the factors affecting the overhead time, Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 43 Overhead (T0H) =Ttmn + T c o n f + T r e s e t (4.1) where Ttran = SBus transfer time, Tconf = FPGA configuration time and Treset = E V C l - s reset time. The first parameter, Ttran, depends on the SBus interfacing protocol and the interface clock frequency. The second parameter, Tconp is an inherent quality of a particular F P G A and it remains constant for all applications on the same FPGA. Treset is the time to reset the E V C l - s device before the next download and is a very small value as compared to other two parameters. Typical values of those parameters are shown in Table 4.2. Table 4.2 Configuration data downloading overhead parameter averages. Parameter Average Value on Sun SI'ARCIO Workstation T conf 25 x 10"3 seconds T tran 215 x 10"3 seconds T reset < 10 "6 seconds 4.1.5 Memory Restriction FPGAs only contains function generators and registers. Although these two types of elements enable a designer to built all kinds of combinational and sequential circuits, they are very inefficient in implementing large scale memory elements such as ROMs and RAMs. A single XC4013 can store up to 1152 bits of data in its registers. But this small amount of memory is far less than the D C T or IDCT requirements. Although there is a 2MB R A M module on the same E V C l - s board, using this memory requires access logic circuit makes the whole design too large to be fitted into the FPGA. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 44 4.1.6 MPEG-1 Video Decoding Software Structure The MPEG-1 video encoding and decoding programs are large complex software entities that require a clear understanding of the program flow before any modification can be done. The MPEG-1 decoding program is made up of more than 20 different C source files responsible for different decoding and displaying functions. Figure 4.3 shows the decoding program flow chart. This program flow is maintained in the modified version of decoder program which controls the transformable video coprocessor. Most of the modifications are performed on the computational functions in the decoder program because the transformable coprocessor handles mathematical calculations more efficiently than other file management or bit string manipulation functions. The computational portions of the decoder program are highlighted in Figure 4.3. 4.1.7 MPEG-1 Video Encoding Software Structure Similar to the decoder program, the original software encoder program structure is maintained in the modified encoder program for video coprocessor. The encoding process model is shown in Figure 4.4, with the computational portion of the program highlighted. The encoding process converts the original frame sequence, which is pre-stored in other formats (e.g. yuv, jpeg, etc.), to MPEG-1 format file. The encoding program also requires a parameter file which contains all the configuration data (I,P,B frame ratio, original frame sequence format, frame rate, etc.) to control the final MPEG-1 file formats. Various compression modes and options in the encoding program are unaffected by the modification, and the original compression procedure is preserved. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor MPEG-1 Video Decoder Program Row Create Frame MPEG Bit Stream Memory Stores Parse Slice Header Huffman Decoding lof Various Headers Reconstruct Motion Vectors Parse Sequence Heade Parse GOP • Picture Header Parse Picture Header Parse MacroBlock t Parse Luminance and Chrominance Blocks f Inverse Quantization Inverse DCT Computational Tasks Frame Reconstruction Display Current Frame Finish A Picture Start Anofl Slice  Another i Finish One Slice Continue the Slice L . Figure 4.3 MPEG-1 video decoding program operation model. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 46 MPEG-1 Video Encoder Program Flow Encoding Parameter File Process Parameter File MPEG File Create Sequence and GOP Headers Frame Sequence Decide Encoding Frame Type Output MPEG Bit Stream B-frame I-frame P-frame 1 Continue Encoding Add B-frame Headers Add P-frame Headers Add I-frame Headers Motion Estimation for Forward AND Backward Vector Motion Estimation for Forward OR Backward Vector Encode MBlock Using Motion Compensation Encode MBlock with NO Motion vector Finish One Frame Figure 4.4 MPEG-1 video encoding program operation mode. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 47 4.2 System Algorithm Implementations The current FPGA gate density does not allow a transformable coprocessor to contain a complete MPEG-1 decoding or encoding algorithm. Also, from data transfer frequency analysis (reported in Chapter 5 of this thesis), we have determined that not every part of the MPEG-1 decoding or encoding scheme will be accelerated if implemented in the transformable coproces-sor. For example, file management and bit string management can be handled more efficiently by the host processor. Therefore, an implementation selection process was carried out, using the data transfer frequency as the major evaluation criterion. The process decides whether the algorithm will be implemented completely or just partially. During the algorithm implementation process, the circuit size is the primary constraint parameter since most of the significant algorithms require gate densities well above that of the XC4013 FPGA. Moreover, some algorithms can only be implemented partially or must be transformed into scaled down versions of the original algorithm. In the following sections, seven different MPEG-1 coding algorithms which has been implemented in the transformable coproces-sor are presented. The algorithm architecture selection and modification processes are based on the constraint parameters of the transformable coprocessor listed in the previous section. Although these algorithms are only part of the whole MPEG-1 coding scheme, each of them is complex enough to warrant a separate research. Therefore, the focus of the following discussion is not on how to design the fastest or the most compact circuit, but rather to design a version of the circuit that can be implemented most efficiently by the transformable coprocessor. 4.2.1 Discrete Cosine Transform and Inverse Discrete Cosine Transform NxN NxN The 2D-DCT is a mapping X - » Z , where X e 9t and Z e 9t and is denned by the Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 48 NxN T orthogonal matrix A e 9? as Z = A XA. The mathematical formulation is given by [40] N-1N-1 z(kpk2) = a(kvk2) ^ m;k2, n )x(m, n) (4.2) m = On = 0 where a (kv k2) is a normalizing factor and f(kv m;k2, n ) represents the transform kernel. N is the order of the transform. For a Type II DCT transform, given an input block x(m,ri) with {0 < m, n < N} , the forward and inverse 2-D DCT can be written as [41]: N-1N-1 rr/i IN 2 v"i x-1 / \ (2m+l)nk (2n+l)%l ON Z ( M ) = -oc(fc)a(Z) £ £ x ( m , n ) c o s - ^ — — C 0 S " ^ — 2 / V (4-3) n = Om = 0 *(m,n) = | a (k) a (I) "f Z (fc, /) cos < 2 w ^ > ^cos < 2 n ^ n l (4.4) M = 0/n = 0 where k, I, m and n range from 0 to N- 1, a (0) = 1/ (A/2) , and a (J) =1 for ; ^ 0. The forward and inverse transforms are merely mappings from the spatial domain to the transform domain and vice versa. The D C T and IDCT are separable transforms and can be expressed in matrix notation as two 1-D DCT/IDCT: Z = AXAT (4.5) X = ATZA (4.6) where AA* = IN and A is&nNxNmatrix whose basis vectors are sampled cosines: Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 49 a (k, n) = N a (k) cos (In + l)itk 2N (4.7) for n = 0, ...,N-1 and k = 0, ...,N- 1 where a(0) = ^ a n d a ( f c ) = 1 for k*0. The above decomposition to a triple matrix product results in a reduction in computational complexity DCT/IDCT Design Approach for Transformable Coprocessor Implementation Various approaches for D C T and I D C T implementation have been proposed [41] [42] [43] [44]. Due to the separability of the 2D DCT/IDCT, it can be calculated by performing successively N one-dimensional DCTs/IDCTs on the image rows and then N one-dimensional DCTs/IDCTs on the resulting columns. Computing separate horizontal and vertical one-dimensional DCTs of length N requires 2 ^ multiplications and additions which are much less than direct implementations of 2D DCT/IDCT. Distributed arithmetic is another efficient way to compute DCT/IDCT either totally or partially as scalar products [45]. However, this approach requires a significant amount of memory storage to contain the multiplication look-up tables. Since the memory storage is a major limiting factor in the transformable coprocessor, the row-column approach is adopted in this thesis. The D C T and IDCT architectures share similar features. They use the same type of multiplication and accumulation components but with different data input sequences. From Equations 4.5 and 4.6, the D C T and IDCT involves matrix multiplication. Each N x N matrix-matrix multiplication can be decomposed into N matrix-vector products. The block diagram in to 2iV3 multiplications. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 50 Figure 4.5 shows the row-column DCT/IDCT computation. The first block in the diagram T computes Y = AX and the second computes Z = YA in the D C T case. However, since an entire row of Y has to be computed prior to the evaluation of the next 1-D DCT/IDCT, the intermediate result Y must be stored in a buffer. Since columns are written into the buffer and rows are read from it, this type of buffer is commonly known as the transpose buffer. MUX 2:1 ID DCT/IDCT Unit DMUXj 1:2 Transpose Memory Row-Column DCT/IDCT Algorithm Figure 4.5 Row-Column approach for 2-D DCT/TDCT. DCT/IDCT Architecture Optimization The basic computation performed by the DCT is the evaluation of the product of an (N x N) matrix by (Nx 1) vector. The first 1-D DCT/IDCT unit operates on the row of A or A r in IDCT case and column of X, while the second 1-D DCT/IDCT unit operates on rows of Y and column of AT (or A in IDCT case). The second 1-D DCT/IDCT unit is unnecessary if we can multiplex the computation for a column of X and a row of Y on the first 1-D DCT unit. The basic building block for the D C T and IDCT can be further simplified as follows. First, the 8 x 8 D C T matrix can be Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 51 written as [41]: where A = a a a a a a a a b d e g -g -e -d -b c f -f -c -c -f f c d -g -b -e e b g -d a -a -a a a -a -a a e -b g d -d -g b -e f -c c -f -f c -c f \j> -e d -b b -d e g_ (4.8) a b c d e f L&l n cos-4 TZ C 0 S 1 6 cos 71 C O S 8 16 5K C 0 S l 6 c o s y cos -TIE 16 (4.9) Note that even rows of A are even-symmetric and odd rows of A are odd-symmetric. Hence, by exploiting this symmetry, we obtain Y(0) a a a a X(0) +X'(7) 7(2) c f -f -c X(l) +X(6) 7(4) a -a -a a X(2) +X(5) 7(6). J -c c -1 X(3) +X(4)_ (4.10) Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 52 7(1) b d e g X(0) -X(l) 7(3) = d-g-b-e X(l)-X(6) ( 4 U ) 7(5) e -b g d X(2)-X(5) 7(7)J \j> -e d -b\[x(3)-X(4)_ The number of multiplications is halved since the (N x N) by (N x 1) matrix-vector multiplication has been replaced by two ((N/2) x (N/2)) by ((N/2) x 1) matrix-vector multiplies. These two operations can be performed in parallel as well. The 1-D IDCT can now be rewritten as: Y(0) a c a f X(0) b d e 8 X(l) 7(1) a f -a -c X{2) + d -8 -b -e X(3) 7(2) a -f -a c X(A) e -b 8 d X(5) 7(3). a -c a -1 X(6)_ 3 -e d -b X(7)_ 7(7)" a c a f X(0) b d e 8 X(l) 7(6) a f -a -c X(2) d--g --b -e X(3) 7(5) a -f -a c X(4) e --b 8 d X(5) 7(4)_ a -c a -I X(6)_ $ --e d -b X(l)_ (4.12) (4.13) Since the computation complexity for both the D C T and IDCT has been halved, the 1-D DCT/ IDCT unit now needs to compute only N multiplications per input sample data. Figure 4.6 and 4.7 show the implementation structures of the ID and 2D DCT circuitry, respectively. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor Control r Bits Control Generation Input and Data Data Store Operation Control Bits Store Sign Bits and Data Selection Control Bits Input Data Fractional, 2's Complement, Constant Multiplier Even Output Module Odd Output Module Even DCT Coeff. OddDCTCoeff 1-D D C T System Figure 4.6 1-D DCT/IDCT implementation block diagram. Control Bits Input Data Control Generation and Data Store Operation Control Bits Store Sign Bits and Data Selection Control Bits Input Data Fractional, 2's Complement, Constant Multiplier Data Input Control Module Address Control Address Control Module 1-D DCT Coeff. Output Even Output Module Even DCT Coeff. Odd Output Module 12bitx64 Memory Module Memory Address J Odd DCT Coeff 1-D DCT Coeff. Input 2-D D C T System Figure 4.7 2-D DCT/TDCT implementation block diagram. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 54 Due to size limitations of FPGA, only the design structure in Figure 4.6 can be completely housed in the EVCl - s FPGA. The difference between the two designs is that the 2-D DCT/IDCT circuitry of Figure 4.7 requires less amount of data transfers. Therefore, the latter design has a better performance curve in the Verilog simulation as will be demonstrated in the next chapter. 4.2.2 Motion Vector Reconstruction In MPEG-1 standard, the motion-compensation unit handles 16x16 macroblocks of pixels. The macroblock size is a trade-off between coding gain provided by using motion information and the overhead needed to store it. Depending on the type of macroblock used, motion vector and related information and is encoded with the compressed prediction error signal in each macroblock. Motion vectors are encoded differentially with respect to the last encoded set of motion vectors using variable length codes. At the decoder side, a process of reconstructing the original motion vector value is needed. The decoder must maintain previously reconstructed motion vectors in order to recover the current motion vectors. The motion values recovered from an MPEG-1 bit stream contain the motion_code which is the differential motion vector value, and the motionjresidue which is the prediction error signal. Figure 4.8 shows the flow chart of the motion vector reconstruction process. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 55 Motion Code Adjustment (flag=l) Motion Code Motion Code Adjustment Input Control Motion Flag Motion Flag Compute Max and Min Motion Range Compute Differential Term (+ difference) Compute Differential Term (- difference) Motion Vector Ranges Motion Adjustm Vector ent Full_pixel_flag Adjustment T Motion Vector Reconstruction Process Previous Motion Vector Current Motion Vector Figure 4.8 Flow chart of the motion vector reconstruction process. The reconstruction process consists of two parts: one for computing the x-axis value and one for computing the y-axis value. The two computations have the same structure and can be executed in parallel, or they can be overlapped using a single pipelined unit. The limited FPGA size favors the pipelined sequential approach in order to fit the design in the EVCl- s . The major reconstruction steps involve conditional branching, multiplication and addition. Hence, the design consists of a number of multiplexers, and since most of the multiplicands are constant power of two, the multiplication can be realized by bit shifting only. Figure 4.9 shows the structure of the motion vector reconstruction process. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 56 Encoded Vector Value Input Module h/ec or C o l a , M a x / M i n Generator Vector Code (f) Vector M i n Fu l l Pixel Indicator Vector M a x Ve:tor l^_ Coie(rr Vector Code Generator Vector Recovery Previous Vector Vector Code Lower L imit Vector Limits Generator Upper L imi t Motion Code x f Output Module Mot ion Vector Value Motion Vector Reconstruction Figure 4.9 Motion vector reconstruction process implementation structure. 4.2.3 Motion Estimation In the motion estimation process, the successive frames of a video sequence are analyzed to estimate the motion (displacement) vectors of moving pixels or blocks of pixels. Many differ-ent algorithms such as Block-Matching Algorithms (BMA) [46], Pel-Recursive Algorithms (PRA) [47], Phase Correlation [48][49], etc., have been developed for the motion estimation process in different video coding standards. Other algorithms classified as feature-based and optic-flow-based approaches have also been proposed [50]. The MPEG-1 video standard specifies the use of B M A in the coding process. In B M A , each video frame is partitioned into non-overlap-ping blocks (called macroblocks in MPEG-1 video standard). Each block in the current frame is Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 57 compared with neighboring blocks around the block at the same position in the previous frame and motion vectors are extracted on a block-by-block basis. Figure 4.10 illustrated the basic idea of BMA. Specifically, each frame of the video sequence is partitioned into macroblocks of size M x N pixels (16 x 16 pixels in MPEG-1). Each block in the current frame is searched in a search-window in the previous frame for a best matched block based on a matching criterion. If the search-window range is ±p pixels (in both the horizontal and vertical dimensions) relative to the corresponding block-position in the previous frame, then the search window contains (M+2p)(N+2p) pixels. The position of the best matched block relative to the current block defines the motion vector. Previous Frame (-p,-p) (M+p-l.-p) (0,0) (M-1,0) Motion /i Vector/ Search] Windo* (M-l.N-l) M+2p < - P , N + £ ) M + 2 p \ (M+p-l.N+p-1) Best mached Block Block Matching Algorithm V. T i m e L i n e Current Frame (0,0) (M (0.N-1) A (M-l.N-l) Current Block Figure 4.10 Illustrating Block-Matching algorithm in motion estimation process. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 58 The B M A assumes that all pixels in a block have the same motion. Several match-criteria can be used to define the best match, including Cross-Correlation Function (CCF) [51], Mean-Squared Error (MSE), and Mean-Absolute Difference (MAD). Since M A D results in good perfor-mance and is much easier to implement than MSE or CCF, it is used in this thesis for the MPEG-1 video encoding program. The displaced block difference S(m,n) with displacement (m, n) using M A D is defined as follows: M-1N-1 S(m,n) = ]T £ \x(i,j) -y (i + m,j + n)\ (4.14) j = o j = o for p>m>-p and p>n>-p, where x is the pixel value at (i,j) position in the current block, y(i+m, j+n) is the pixel value at position (i+m, j+n) of the search window in the previous frame. Comparing the displaced block differences in the search-window, the displacement that results in the smallest displaced block difference defines the motion vector, or v(x,y) = min^mn) {S(m,n)} (4.15) for p > m > -p, p > n > -p, and where v(x,y) is the motion vector. The motion estimation process in the MPEG-1 video encoder is complex and cannot be completely implemented in the transformable coprocessor. It must be broken down into a number of smaller functional units. However, not all functions in the motion estimation process can be implemented in the E V C l - s with an outstanding performance, especially those that demand intensive data transfer operations. Therefore, only the M A D algorithm of the motion estimation process is implemented in the EVCl-s . We implemented an accumulation loop which calculates the luminance error term using M A D algorithm. This portion of the algorithm was chosen Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 59 because accumulative looping can be implemented in the E V C l - s very efficiently. Figure 4.11 shows the structure of the implementations of M A D algorithm. Error Terms Oulput Selection Input Module Lum-Motion Difference Difference Difference Accumulation Difference Absolute Difference Difference Accumulation Difference — Difference Difference Squared Accumulation Output Module MAD Error Lum-Motion Error Loop Figure 4.11 MAD algorithm implementation structure in motion estimation process. 4.2.4 Quantization Quantization is a process which maps a continuous amplitude signal to a limited number of discrete values in order to obtain an efficient digital representation or compression. Quantiza-tion often introduces distortion to digital values, since the reconstruction process cannot recover the exact original values. Therefore quantization is a lossy process which often determines the quality of the processed signals. There are several performance criteria for quantizer design such Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 60 as the minimum mean square error and the minimum absolute error. In this thesis, the inverse quantization process is implemented according to the MPEG-1 video standard. The inverse quantization process is specified to produce the reconstructed D C T coefficients from QF[v][u], where QF[v][u] (QF [v] [w] = QF [v, u] ) is the two-dimensional quantized D C T coefficients array [6]. The process essentially involves a multiplication by a quantizer step size [52]. The quantizer size is modified by two mechanisms: • A weight matrix that is used to modify the step size within a block. • A scale factor that is used to modify the step size at the cost of a few bits as compared to encoding an entire new weight matrix. The inverse quantization process is illustrated in Figure 4.12. The process starts with the appropri-ate inverse quantization arithmetic which converts QF[v][u] to F"[v][u], where F"[v][u] is an array of partially computed DCT coefficients. Then the value of F"[v][u] are saturated to yield F'[v][u] which is the D C T coefficient array with all the values lie between -2048 and 2047. A mismatch control operation is performed to give the final reconstructed D C T coefficients, F[v][u]. The DC coefficients of the intra-coded blocks are inverse quantized in a different way with respect to all other coefficients in a block. In intra-coded blocks, F"[0][0] is obtained by multiplying QF[0][0] by a constant multiplier, intra_dc_mult, or using the following formula: F"[0 ][0 ] = intra_dc_mult x QF[0 ][0 ] (4.16) Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 61 QF[v][u] 1 Inverse Quantization Arithmetic T W[w][v][u] quan_scale_code F"[v][u] F'[v][u] Saturation Inverse Quantization Process Mismatch Control F[v][u] Figure 4.12 Inverse quantization process flow [6]. The arithmetic reconstruction of F"[v][u] from QF[v][u] for coefficients other than the D C coefficient is specified in the following equation: F"[v][u] = ((2QF[v][u] + k)x W[w][v][u] x quantizer_scale)/32 (4.17) where W[w][v][u] is the inverse quantization weight matrix and k is an offset value defined as k = 0, intra - blocks Sign (QF[v] [u]), non- intra-blocks] (4.18) where Sign(QF[v][u]) = -1 if QF[v][u] is negative, or equal to 1 if QF[v][u] is positive. The coefficients are then saturated to lie in the range [-2048:+2047] according to the following formula: Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 62 F'[v] [u] = 2047, F"[v] [u] >2047 F"[v] [w],-2048<F"[v] [u] < 2047 -2048, F"[v] [u] <-2048 (4.19) Finally, the saturated coefficients are tested to determine whether they are even or odd. If the values are even, they are corrected according to the following formula: F[v] [u] = F'[v] [w] + l ,F' [v] [u] <0 F' [v] [u] - 1, else (4.20) The entire inverse quantization process involves relatively fewer data transfers than other functional blocks, although the amount of data transfer can be reduced to a minimum if the quantizer matrices can be stored in the local memory of the transformable coprocessor. Figure 4.13 shows the direct implementation of the above inverse quantization equations. Input Value Input Module Sign Detection Inverse QuantizationBlock Quantization Table Quantizer Qscale Multiplication Quantizer Bit Shifting -Multiplication Output Selection Quantizer Multiplication with offset BitShifing & Subtraction -Output Module Output Coefficient Figure 4.13 Inverse quantization process implementation structure. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 63 4.2.5 Huffman Coding Variable length coding is widely used in MEPG-1 encoders to obtain further lossless compression. The variable length coding technique employed by the MPEG-1 standard is Huffman Coding. This type of coding assigns a 0 and 1 sequence for each input symbol such that the total average length of the code is minimal based on a predetermined weight for each input symbol. Huffman coding can be represented by a binary tree where each leaf of the tree represents a symbol. A symbol is encoded by the path from the root to the corresponding leaf, such that a symbol with larger weight has shorter path length. One way of implementing a Huffman encoder is through the use of encoding and decoding tables [53][54]. The decoding look-up table contains all the symbols which can be referred to by different input bit sequences. Since the symbols carry different weights in each case, more than one decoding table is needed. In the MPEG-1 decoding section, a total of 14 decoding tables are used for D C T coeffi-cients, D C T sizes, macroblock types and motion vectors. The tables are stored in the form of arrays, and for each decoding process, an array address which contains the decoded symbol is calculated. Then the value of the array element, i.e., the retrieved symbol is sent to the next processing step. Clearly, the implementation may involves a large number array address computa-tions depending on the input bit sequence. There are more bit sequence manipulations than actual computations in this case. Hence, the performance gain from implementing the algorithm in the E V C l - s is not obvious. Figure 4.14 shows the structure of the Huffman decoder. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 64 Decoding Tables (Memory Bank) Store/Decode Address Control f \ Data Control Input Data Input Data Decoding Type Input Control Module Table Address Store/Decode 'ecodinjj Type Decoding Table Selection Table Increment Input Data Input Data Processing item Offset Table Offset Output Data Data Output Module Huffman Decoding System Figure 4.14 Huffman decoder implementation structure. 4.2.6 Frame Reconstruction The frame reconstruction step in the MPEG decoding process recreates the current frame pixels by using motion vectors and reference frames. Since there are three types of frames in an M P E G video frame sequence, each frame type has its own frame reconstruction procedures. The procedures can be categorized into four parts: • I-frame without motion vectors: this is the simplest format since there is no motion vector involved. The process simply reconstructs the current frame by moving all the decoded pixel values to the current frame array structure. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 65 • P-frame with forward motion vectors: this frame type needs the reference frame values and the motion vectors for each block. The motion vector values are then added to the reference frame values and stored in the current frame array structure. • P-frame with backward motion vector: this frame type is handled in the same manner as a P-frame with forward motion vector. The only difference is that the backward ref-erence frame is used, i.e. the past reference frame is used. These two types of P-frame share the same implementation hardware with different parameters. • B-frame with forward and backward motion vectors: this frame type requires the most computing power to reconstruct. The process involves adding both forward and back-ward motion vectors to the reference frames. An averaging technique is used and the resulting values are stored into the current frame array structure. Due the limited number of gates available in the E V C l - s , three separate circuits were developed to perform each of the four types of frame reconstruction mentioned above. The I-frame reconstruction circuit mainly performs a current frame array manipulation since no motion vectors are involved. However, the simplicity of the I-frame reconstruction process gives little performance advantage to the E V C l - s implementation. The P-frame reconstruction circuit performs motion vector addition (adding a differential pixel value to the reference frame pixel value) and places the result in a new location in the current frame by adding the reference pixel location and the corresponding motion vector value. Thus the two major computations are pixel value addition and current frame array address calculation. Figure 4.15 shows the implementation architecture of frame reconstruction process. Chapter 4 Implementation of A Transformable MPEG-1 Video Coprocessor 66 Block Type No MBlock Address Block Type Identification Block Type MBlock Location Calculation (in pixel location) Forward/Backward Motion Vector Forward & Backward Motion Vector Frame Reconstruction System Row & Column Location Adding Motion Vector values Adding Both Motion Vector values (average 2 values) Block Type + Current Frame Pointer Revised Row & Column Location Move Current IMBlockto Appropriate Location in Reconstructed Frame Partially Recon. Frame Figure 4.15 Implementation architectures of frame reconstruction process. The B-frame circuit resembles the P-frame circuit except that two sets of values are used for computing the pixel values and new pixel location. These two sets of values are obtained from the average of forward reference frame and backward reference frame. There is only a slight increase in the computational requirements over P-frame reconstruction, basically one addition and one right bit shift. Chapter 5 Simulation and Performance Analysis of Trans-formable MPEG-1 Video Coprocessor Various different versions of MPEG-1 function block designs are implemented in EVCl- s coprocessor with extensive simulations and testing to evaluate the efficiency of using transform-able coprocessor in MPEG-1 coding process. The primary evaluation parameter is MPEG-1 coding rate which needs to be improved constantly since the standard was released. The complete evaluation process consists of two stages: simulation stage and real-time testing stage. The simulation stage emphasizes on functionality verification and delay analysis of the implemented transformable function blocks using Verilog X L ™ simulator [55] and X A C T ™ compilation analysis tools [56]. The real-time testing stage employs a number of different MPEG-1 frame sequences and compares the coding rate of the original software MPEG-1 coding programs and the modified programs with transformable coprocessor. The evaluation process also analyzes the factors that affecting the performance of the transformable video coprocessor and how these factors can be optimized in order to achieve maximum performance gain. 5.1 System Simulation Analysis System simulation is the first step of verifying the functional correctness of digital system implementations. Simulation results provide valuable information on the implementation size and path delay which are essential for developing a functional digital system on FPGA-based platforms. In this thesis, system simulation is the first layer verification which detects all the logic and timing faults in the designs before the actual real-time testing. Also, due to the nature of FPGA-based system, two system simulations are carried out: one simulation before F P G A compilation and one simulation after FPGA compilation. The second simulation is to ensure that 67 Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 68 extra delays introduced in the FPGA mapping and routing process do not affect the functionality of the implemented MPEG-1 coding designs. 5.1.1 Simulation Tools and Procedures Since all of the implemented MPEG-1 function blocks are stand-alone designs, they are simulated individually with the following tools: • Verilog X L -- a digital logic simulation program; • Xilinx FPGA Compiler — consists of a design netlisting program and FPGA mapping and routing programs; • X A C T xdelay program ~ provides the path delay information for a design after the place and route procedures; • Verilog H D L test drivers ~ they are H D L programs that create the test sequences and apply them to the digital designs; • Test data sets — they are sample data sets obtained from the MPEG-1 coding software; • cWave program — this is a wave-form displaying program which allows the simulated results to be analyzed. These simulation tools are used as according to the flow chart shown in Figure 5.1. Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 69 Simulation Design Translation (to Verilog HDL1 1 Create 1 Drivers TestVe in HDL rest and ctors 1 Run Verilog Simulation on the Design 1 Use cWave to Analysis the Resutls 1 FPGA Compilation Run xdelay to Obtain Path Delay 1 Check Mapped Design Still Satisfy Timing Requirement Run Verilog Simulation Again 1 1 Use cWave to Analysis the Resutls System Simulation Process Flow 2v II iwl Situuktioa Figure 5.1 System simulation process flow. 5.1.2 System Simulation Results The system simulation process unveils the initial performance of the designed MPEG-1 algorithm blocks. The evaluation parameters for the simulation are: System block delay ~ this performance parameter reveals the critical bottle-neck block in the design; Chapters Simulation and Performance Analysis ofTransformable MPEG-1 Video Coprocessor 70 • Design area — this parameter reports area utilization and the potential of duplicating a unit to achieve speedup through parallel processing; • Design memory requirement — this parameter reveals the design delay-cost trade-off resulting from the system memory constraint; • FPGA resource consumption — this parameter gives precise resource consumption fig-ures of the implemented algorithm blocks. FPGA resources include logic blocks, I/O blocks and routing area. The above parameters characterize an algorithm block implementation constraints and efficien-cies, which will eventually affect the real-time coding effectiveness of the implementation in the transformable coprocessor. Among the various algorithm blocks, the D C T and IDCT designs are more complex than the others. However, they also exhibit the largest speedup potential. Figure 5.2 shows the path delays in different stages of DCT and IDCT computations based on their implementation in the E V C l - s . The timing diagram does not include the transformable interface delay which will be accounted for during the run-time testing stage. The timing delays shown in Figure 5.2 accounts for processing only 8 coefficients. Since a D C T block size is (8x8) coefficients, and the row-column DCT method is used, the total delay for a complete processing of an (8 x 8) D C T block is: Delay(8-pointDCT)* !6 = Delay (8*8 DCT)-The multiple computation cycles in the DCT and IDCT processing can be shortened by using a parallel architecture that includes multiple 8-point DCT/IDCT processing units. Figure 5.3 shows the timing delays of the DCT and IDCT circuits with two and four parallel processing units. The parallel architecture effectively reduces the execution cycles for each D C T block, Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 71 however, the circuit size and input/output channel size also increase. Moreover, the increase may not be accommodated by the existing FPGA. A detailed resource consumption chart for each algorithm implementation is listed in the Figure 5.5. DCT and IDCT System Simulatiom: Delay for Computing 8 Coefficients 1 1 i i i t o DCT x IDCT wiuiuui inei with memon /• mouuie </ 1 1 1 450 400 350 ^ 3 0 0 CO c IT 250 CD CJ CD .§ 200 t-150 100 50 0 _D DP1 M1 A1 DP2 M2 A2 DP3 M3 A3 DP4 M4 A4 Process Stage (LD:Load Data, DP:Data Pair, M:Multiply, A:Accumulate) Figure 5.2 DCT and IDCT implementation time delay among different stage. D C T and IDCT circuits performance is also affected by the availability of local memory on the E V C l - s board. Figure 5.2 also shows the effect of local memory on system delays for the DCT and IDCT circuits. The use of local memory on the EVCl-s board reduces the frequency of data transfer operation over the host bus, a major delay bottle-neck. On the other hand, a memory access logic interface circuit is needed which further increases the implementation size of the Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 72 D C T and IDCT designs. System delays for other implemented algorithm blocks are shown at Figure 5.4. The delay patterns of these system resemble those for the D C T and IDCT. Again, utilizing parallel processing and local memory reduce the overall system delays, at the cost of increasing of implementation size. Figure 5.3 Parallel DCT and IDCT circuits timing delays. Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 73 Figure 5.4 System delays for various algorithm blocks. Accompanying the various ways of improving system performance is the increase in design implementation size which is a major limiting factor in FPGA-based system. Figure 5.5 shows the FPGA resources consumption for the different algorithm blocks. All values included in Figure 5.5 are from the circuits implemented without utilizing any parallel processing nor local memory. Therefore, from the function generator consumption for the DCT and IDCT circuits, it is obvious that the XC4013 F P G A in E V C l - s is too small for implementations with parallel processing units. Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 74 100 FPGA Resources Consumption for Various Algorithms IDCT MV Est. In.Quan. MV Recon. Huffman Frame Recon. Algorithm (IO:l/0 Pin,F:Function Generator,ff:Flip-Flop) Figure 5.5 FPGA (XC4013) resource consumption for various algorithm blocks. 5.2 System Real-Time Testing Analysis Although system simulation provides the functional verification of the video coding circuits, only the run-time system testing with a real MPEG-1 frame sequence can demonstrate the compatibility of the transformable video coprocessor. The system testing is divided into unit testing and integrated testing. The purpose of the unit testing is to verify functional correctness of the implemented designs with high level C language drivers, instead of using simulated test drivers and test vectors. Sampled data from the MPEG-1 frames are used as test vectors. The integrated testing includes the complete modified M P E G software program and M P E G frame Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 75 sequences in the testing process. The implemented MPEG-1 coding circuits are downloaded into E V C l - s coprocessor during program execution and the best achieved coding rate are recorded. Also the performance of individual transformable computing MPEG-1 video coding circuits is compared with the performance of the original MPEG-1 software program without the coproces-sor. 5.2.1 Real-Time Testing Procedures and Tools The real-time testing of the transformable video coprocessor employs the following testing tools: • EVCl - s — FPGA-based transformable coprocessor • Unit Test Drivers — C programs that are used for first level unit testing of implemented MPEG-1 function blocks. • MPEG Frame Sequences — files in MPEG-1 format for decoding test and a 150-frame video sequence in luminance and chrominance data format for encoding test. • Software MPEG-1 program ~ modified MPEG-1 programs are used to control the im-plemented MPEG-1 function blocks in EVCl-s coprocessor. • EVCl - s Interfacing functions — C programs that provide the interfacing and down-loading of the MPEG-1 video coding circuits. The run-time system testing procedure is shown in Figure 5.6. Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 76 Vail Testing Create U n i t Test Drivers i n C code Obtain sample data from software M P E G programs Download the design into E V C l - s and test with C codef^ „ Ver i fy the correctness of the Ogtput results M P E G software M o d i f y o r i g i n a l | I M P E G software and integrate the design A d d software timing function: into M P E G codes Test original v < M P E G software with M P E G frame sequences M P E G Frames M P E G Circuitry Sample Data Test Driver i n C code X Test modif ied M P E G software / with M P E G frame sequences Sample Results from originaJj M P E G software Compare the coding efficiency System Testing Process Flow Integrated Figure 5.6 System testing process flow. 5.2.2 System Testing Results The unit testing process focuses on the functional verification where the circuits are tested by a set of unit test drivers which are written in the C programming language. For all unit testings, ten different sets of data are used and the results are compared with the original programs executed on the SPARC-10 processor without using EVCl-s as a coprocessor. Run-time analysis showed there is an approximately 3.7% deviation in the DCT/IDCT values computed by the E V C l - s when compared to the DCT/IDCT functions computed by original programs. The main Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 77 source of error is due to truncation of the fractional bits in the EVCl - s implementation. Also, the intermediate results read out from the EVCl - s are stored as integers. The accuracy problem can be easily overcome by using a larger FPGA. The unit tests for other circuits showed that there is no deviation since no truncation of fractional bits is involved. The integration tests use the modified MPEG software programs with transformed MPEG-1 algorithms downloaded into the E V C l - s during execution. Seven different M P E G frame sequences are employed in the system tests and the coding rates using both original software and modified programs with EVCl- s downloading were obtained. The average of ten runs were used to plot the graph in Figure 5.7. The results show that the modified D C T and IDCT programs experienced a slower coding rate compared to the original D C T and IDCT programs. The main reason for the slowdown is the data transfer time across the SBus interface in the EVCl- s . Even though the SBus has a 32-bit data bus, the interface circuitry requires 16 bus cycles (instead of 4) for moving 8 input operands (12-bit each) across the bus. Unfortunately, a faster 2D DCT/IDCT design is too large to be mapped into the FPGA of the EVCl-s . The result demonstrated the trade-off between the speed and design area. Local memory can be utilized to close the speed gap between the data access cycle time. In this case, blocks of data are pre-fetched and maintained in the local memory. The FPGA can then access the 8 12-bit input operands in pairs, which requires 4 access cycles only. The speed obtained by this technique can be significant. For example, So the 2D parallel DCT/IDCT architecture with local memory requires only four data transfer cycles and will achieves 200% speedup over the SPARC-10 processor. Run-time testing results show different speedups in other algorithms, although the speedup factor is not always significant. The M A D algorithm in Motion Estimation process Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 78 exhibits the highest coding rate speedup on the EVCl-s . The accumulative multiplication loop in the algorithm can be computed much faster on the FPGA hardware than the SPARC-10 processor. However, data transfer slowdowns over the SBus offset some of the speed gain. Similarly, run-time testing showed a small speedup in the coding rate of the inverse quantization process. The movement of quantization scale and inverse quantization table values from host memory to E V C l - s slows down the overall coding rate. The coding rate can be improved significantly if the inverse quantization table can be stored in the FPGA. Transformable MPEG Video Coder Speedup Factors 30 20 10 o 0 ts CO I I CD CD Q. CO " - 1 0 - 2 0 - 3 0 - 4 0 - 5 0 I I with regular SBus interface delay r - , with SBus interface delay i i reduced by 50% IDCT DCT HuffmanFrame Recon. MV Recon. In.Quan. MV Est. MPEG Function Blocks Figure 5.7 Speedup factors for various implemented MPEG algorithms with regular SBus interface and a faster version of SBus interface. The motion vector reconstruction process also has a slightly coding rate speedup at about Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 79 10% over the speed of the original software coding rate. The transformable implementation of Huffman decoding process and frame reconstruction process do not have a coding rate advantage over software programs executed on the SPARC-10 processor. The main reason is the large amount of data transfers in the table lookup process. If the lookup tables can be stored in the FPGA, then higher speedup can be achieved. 5.3 Remarks on Performance Enhancement Although the EVCl - s transformable coprocessor performs sufficiently well for computa-tion intensive tasks, the overall performance of the transformed MPEG-1 video processing is limited by inefficiencies in data transfers over the host bus. As noticed from Figure 5.7, if the host SBus interface delay is reduced by 50%, the speedup factors for D C T / I D C T and Huffman decoding implementations will become positive values. In general, most of the algorithms in M P E G video coding are communication intensive and few algorithms (DCT/IDCT, motion estimation) are also computationally intensive. Therefore, successful implementations of MPEG-1 over transformable computers require very efficient techniques for moving data between the F P G A and the driver programs. The current data read and write scheme in E V C l - s cannot provide a satisfactory result and the size of usable logic gate density in FPGA is too small for most M P E G video coding algorithms. The DCT, IDCT, Huffman Coding and Frame Reconstruc-tion algorithms also require an associated memory buffer which can store the intermediate results or coding tables. The XC4013 FPGA used in our experiment does not have an efficient way of implementing the R A M or R O M structure. Hence, the transformable coprocessor coding rates cannot match those from the ASIC M P E G coding circuits. The SBus interface system limits the performance of EVCl-s in communication intensive signal processing tasks, and also prevents the implementation of short pipeline designs on EVCl-s . In summary, implementing algorithms in a Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 80 transformable computing platform requires certain design considerations which are outlined in the next section. 5.4 Transformable System Design Considerations Although transformable coprocessor development process resembles the conventional ASIC system design process, the design of a transformable computing system require extra effort during the design and implementation processes. These efforts can be considered as the transformable system design strategies which are essential to achieve maximum performance gain. The transformable coprocessor system efficiency is also directly related to these strategies. The following subsections outline a number of crucial design parameters in the design and mapping of algorithms for a transformable computing system. 5.4.1 Data Transfer Frequency Analysis The EVCl - s is a transformable coprocessor system which relies on the host bus system to transport the data. The data movement among host processor, host memory, and the coprocessor occur over the same data bus shown in Figure 5.8. In most computer systems, data bus transports are slower than the processing speed of the host processor or the coprocessor, which is also the case for the computer system configuration in this thesis. One solution to this problem is adding extra local memory into the coprocessor system to reduce the amount of data transfer and to pre-fetched important data and reduce long waits by the coprocessor. However, certain algorithms are inherently communication intensive, and local memory cannot help in this case. Therefore, the amount of data transfer between the host memory and the coprocessor must be minimized. In other words, the data transfer frequency is inversely proportional to the coprocessor system performance gain. Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 81 Control Flow -fl " \ r / Control Link \ Transformable I . W . Coprocessor 2 EVC1 -s * Data Link / Data Bus Address Bus Memory Module \ Data Transfer Mechanism in EVCl-s Fast Communication Between Coprocessor and Memory Module is Needed Figure 5.8 Data movement between the coprocessor and host memory. In order to determine whether an algorithm can be speeded up by implementing the algorithm in the coprocessor, a data transfer frequency analysis is needed. The analysis examines the data transfer ratio defined as the ratio of the number of data transfer instructions to the number of compute only instructions in an algorithm, or in a mathematical form: DataTransferRatio (DTR) = DataTransferFrequency (Fdata) ComputationalFrequency (Fcomp) (5.1) Data transfer instructions are those instructions that move a value between the host processor memory and the coprocessor algorithm, while the computational instructions are those instruc-tions that operate on data local to the host processor or to the coprocessor only. Basically, a small data transfer ratio indicates a lower dependency on the host data bus system, and therefore, a positive performance gain is possible using the coprocessor. Table 5.1 shows the data transfer Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 82 ratio and their speedups of various algorithms in the transformable MPEG-1 software coding program. Table 5.1 Data transfer ratio and speedups of various algorithms in the transformable MPEG-1 programs using a 150-frame video sequence. Algorithm Decoding/ Kncoding Data Transfer Katio EVCl-s Time (second) SPARC-10 Time (second) Speedup% DCT Encoder 6.5 459 409 -12 IDCT Decoder 7.8 30.8 21.8 -41 Hull man Decoding Decoder 5.4 22.9 22.0 -4 Frame Reconstruction Decoder 4.9 20.4 21.1 3 Inverse Quantization Decoder 4.3 20.6 23.4 12 Motion Vector Recunsuuclion Decoder 4.7 19.4 21.3 9 Mnlion Estima-tion (MAD Algorithm) Encoder 2.7 325 417 22 . From the calculation of the data transfer ratio of each algorithm, we can estimate an approximate speedup factor for the transformable implementation. Figure 5.9 shows the speedup factor and the data transfer ratio for various MPEG-1 algorithm implementations. From Figure 5.9, one can clearly notice that the speedup factor increases as the data transfer frequency decreases. The data transfer frequency analysis in Table 5.1 are performed by assuming that the original software functional partitioning is maintained. However, if the program functional partitioning is changed, the data transfer frequency will change accordingly. The effects of algorithm partitioning are examined in the next sub-section. Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 83 Figure 5.9 Data transfer ratio vs speedup factor for MPEG-1 video algorithms. 5.4.2 Algorithm Partition Strategy Due to the FPGA gate density limitation, not every algorithm in MPEG-1 coding program can be implemented in the E V C l - s coprocessor. The percentage of the algorithm that can be implemented is limited. Therefore, we have to decide which part of the algorithm is to be implemented in transformable coprocessor to give the maximum performance gain. This is a typical partition optimization problem with the objective of minimizing the data transfer frequency while maximizing the size of the algorithm partition implemented in the transformable coprocessor. Figure 5.10 illustrates the essence of this optimization problem. Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 84 Algorithm A Data Transfer Portion of algorithm | processed in host CPU Portion of • algorithm processed in coprocessor As the area of circle increases, the number of data transfer s between host CPU and the coprocessor decreases Figure 5.10 Algorithm partition problem in transformable coprocessor implementation. The algorithm partitioning strategy is dictated by the following four parameters: • Algorithm size (Aalg0) • Implemented algorithm size (A. j): the logic area of an FPGA that is mapped with an algorithm; • Usable FPGA size in the transformable coprocessor (AFPGA): maximum usable logic area in an FPGA; • Algorithm data transfer ratio (DTR) To optimally partition an algorithm for the transformable coprocessor implementation, A . , must be as close to A , as possible without being larger than A F p G A , and under the Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 85 constraint that DTR is minimum (or below a given value). In other words, the best partition minimizes the data transfer between the host processor and the coprocessor, and at the same time, moves as much of the algorithm as possible to the coprocessor. A mathematical description of the partitioning strategy is given by: (AimPl »Aalgo ) u (AimPl^AFPGA ) u ( D T R = minimum ) (5.2) where » denotes as close as possible and U denotes union. From the study of the partition strategy of an algorithm, we can have a better implementation plan about what portions of the algorithm should be moved into coprocessor and what portions should be executed in host proces-sor. When Equation 5.2 is satisfied (i.e., the DTR is minimum), the speedup factor is maximized. 5.4.3 Pipelining Pipelined processing is used in implementing the MPEG-1 video coding algorithm in the transformable coprocessor. The main issue to be considered here is the pipeline period. The strategy involved in pipelined design for an application is maximizing the pipeline period (Tseg), which also means minimizing the number of pipeline segments (Nseg), as shown in Figure 5.11. Maximizing pipeline period can reduce the data transfer frequency which can effectively increase the performance gain. Thus, an efficient implementation should allocate a large number of computation steps to each pipeline segment in order to maximize the ratio of pipeline processing time to the data transfer time. Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 86 Data Stores Processing Circuits Transformable Pipeline Structure Tses - Pipeline Segment Duration ^seg - Number of Pipeline Segment Figure 5.11 Pipelining structure in transformable coprocessor designs. 5.4.4 Scalability Under current FPGA technology, the programmable gate density increases exponentially. Although not used in this thesis, a single FPGA with over one hundred thousand usable logic gates is currently available commercially (the Xilinx XC40125 series) [57]. To benefit from such advance in F P G A technology, transformable coprocessor designs must be scalable. Scalability relies on a regular and separable system design approach which means that the system can be expanded without significant modification and with linear increase in cost. Parallel processing capability has a significant impact on scalability of FPGA-based processing. To quickly benefit from new F P G A technologies with expanded gate densities, one can use multiple copies of a Chapter 5 Simulation and Performance Analysis of Transformable MPEG-1 Video Coprocessor 87 small design which can operate on multiple sets of data concurrently, thus achieving single-instruction-multiple-data (SIMD) processing. Alternatively, different functional designs can be implemented on the same FPGA to realize a multiple-instruction-multiple-data (MIMD) process-ing. Figure 5.12 illustrates that the scalability of the design enables the transformable coprocessor to form an MIMD computer system. Low Gate Density FPGA Logic Circuit I Rapid Transition from SISD to MIMD for Scalable (Designs I MIMD Computing Machine SISD Computing Machine -.. FPGA Technology Advancement High Gate Density FPGA Logic Circuit A J Logic Circuit r > Logic CircuitB Logic Circuit Logic CircuitC Logic Circuit f N Logic Circuit D . „ Logic Circuit Scalability for Transformable Designs Figure 5.12 Scalable design enables transformable coprocessor to form a MIMD computing machine. Chapter 6 Conclusions This thesis reported the design and performance evaluation of an MPEG-1 video coding implementation on a novel transformable coprocessor platform. The results are intended to uncover the strengths as well as the weaknesses of a typical transformable coprocessor for video processing tasks. This was achieved by carrying out in-depth analysis of the computational and data movement requirements of MPEG-1 processing, and then proposing a systematic approach for algorithm partitioning and mapping on the coprocessor. The thesis also points out important research for future investigation of transformable computing system development. 6.1 Thesis Summary This thesis proposed a new approach of implementing an MPEG-1 video processor using an FPGA-based transformable coprocessor. This approach combines a number of sub-techniques for optimizing the performance of the MPEG-1 video processor. The sub-techniques include programmable hardware design, dynamic hardware reconfiguration, high-level software program-ming and partition, and data movement analysis. Because of physical restrictions on the size of usable logic and interconnect in the core FPGA, as well as timing restrictions on the host bus and bus interface, designing an optimized transformable computer can become quite complex and may result in inefficient implementations. Therefore, this thesis provide an in-depth investigation of the important features of the MPEG-1 video coding standard with particular emphasis on program partitioning through extensive data movement frequency analysis. Data movement frequency plays an important role in determining the efficiency of a particular algorithm mapping in the transformable coprocessor design process. The other major constraint on coprocessor efficiency is the limited logic-gate capacity of the core FPGA. Several MPEG-1 video processing 88 Chapter 6 Conclusions 89 algorithms were chosen as the implementation target on the E V C l - s transformable coprocessor. Different parallel architectures were developed and evaluated to determine the ones with the best delay-area performance for the above algorithms. The major criteria used for performance evaluation was performance speedup factor compared to the software only implementation of the MPEG-1 video coding program. Simulation and testing results showed that the algorithms implemented in the transformable coprocessor exhibited various levels of coding rate speedup. The performance of the algorithms mapped onto the transformable coprocessor is directly related to their data transfer frequency. Additional factors affecting the performance are algorithm partitioning strategy, and the availability of local memory on the transformable coprocessor. Data transfer frequency analysis unveils the potential for performance gain based on data traffic intensity over the host bus. An effective partitioning strategy must minimize the data frequency transfer for a given algorithm. This thesis has demonstrated that by applying these techniques, a transformable coprocessor provides substantial performance enhancement for complex video processing tasks. It should be noted that the FPGA (Xilinx XC4013) used in the EVCl - s is relatively slow and small in size. Therefore, although our results showed limited speedups, this should be taken as an encouraging result. More recent FPGAs have double or triple the capacity of the XC4013 and almost double the speed. With the use of such improved FPGAs, our implementations can achieve significant speedups over the host processor alone. However, it should be emphasized that in many cases, it is the host bus (SBus) interface and absence of local memory that prevents real speedup gains. The availability of faster buses and bus interfaces (such as a PCI bus) can improve performance significantly. Chapter 6 Conclusions 90 6.2 Future Work The performance of transformable coprocessor system can be further enhanced by improving the hardware and software interface architecture. Currently, host-coprocessor interface poses a severe constraint on the data movement rate which causes unavoidable performance degradation. Future studies can investigate other possible hardware and software interfacing schemes. The transformable coprocessor used in this thesis is developed around a relatively small-sized FPGA with only 13,000 available gates. A higher gate-density FPGA can definitely improve the performance curve of the M P E G algorithms implemented onto the transformable coprocessor. Another area that merits further research is algorithm partitioning and mapping as it applies to transformable computing. Automated partitioning programs can be developed based on the data transfer frequency analysis and other pertinent factors. Such tools would help the designer decide on task distribution and scheduling on the host processor and the transformable coprocessor. Current transformable system development involves both reconfigurable hardware and control software which requires the designer to have a wide range of expertise in these areas. A macro library of multimedia applications hardware can be developed to help the designer avoid building the whole system from low level gates and registers. As transformable computing system development is still in the early experimental stages (in the technology maturation process), more research must be done before it can become a widely accepted implementation platform for future multimedia computing. Bibliography [1] Kunihiko Niwa, Takashi Araseki, Takao Nishitani, "Digital Signal Processing for Video", IEEE Circuits and Devices Magazine, Vol. 6, Iss. 1, January 1990. [2] John Schewel, Mike Thornburg, Steve Casselman, "Programming Hardware Objects -Placing digital designs into software applications for use in rapid product development and software acceleration", Virtual Computer Corporation Publication, 1994. [3] Ralf Schafer, Thomas Sikora, "Digital Video Coding Standards and Their Role in Video Communications", Proceedings of the IEEE, Vol. 83, Iss. 6, June 1995. [4] Peter Pirsch, Nicolas Demassieux, "VLSI Architectures for Video Compression - A Survey", Proceedings of the IEEE, Vol. 83, Iss. 2, February 1995. [5] M . Aderson, "VCR quality video at 1.5 Mbits/s", National Communication Forum, Chicago, October 1990. [6] ISO Standard 11172, "Coding of Moving Pictures and Associated Audio", Draft Committee of Standard IS011172, ISO/MPEG 90/176, December 1990. [7] Ruby B. Lee, "Accelerating Multimedia with Enhanced Microprocessors", IEEE Micro, April 1995. [8] Richard Bruno, "PRISM's High Resolution MPEG Chip Set", COMPCON Spring '92, IEEE Computer Society International Conference, 1992. [9] IBM Corporation, "MPEG-2 Digital Video Decoder and Encoder", Microelectronics Product Catalog, 1995. [10] Jan Ozer, "Why MPEG Is Hot", PC Magazine, Vol. 14, No. 7, April 11,1995. [11] Standford Diehl, "Byte's Video Workshop", Byte, May 1995. [ 12] OptiVision Inc., "Optivision OPTIVideo MPEG-1 Encoder: Real-time MPEG-1 Encoding for the Professional", OptiVision Product Catalog, 1996. 91 Bibliography 92 [13] Didier Le Gall, "MPEG: A Video Compression Standard for Multimedia Applications", Communications of A C M , Vol. 34, No. 4, April 1991. [14] Ming-Ting Sun, "Algorithm and VLSI Architectures for Motion Estimation", VLSI Implementations for Image Communication, Elsevier Science Publishers B.V., 1993. [15] Kou-Hu Tzou, "Video Coding Techniques: An Overview", VLSI Implementations for Image Communication, Elsevier Science Publishers B.V., 1993. [16] Dimitris Anastassiou, "Digital Television", Proceedings of the IEEE, Vol. 82, No. 4, April 1994. [17] K. Kawahara, H. Yamauchi, S. Okada, "A Single Chip MPEG-1 Decoder", IEEE Transactions on Consumer Electronics, Vol. 41, Iss. 3, August, 1995. [ 18] Ichiro Tamitani, Mutsumi Ohta, et al., "An Encoder/Decoder Chipset for the MPEG Video Standard", ICASSP '92: Acoustics, Speech & Signal Processing Conference, Vol. 5,1992. [19] Richard Bruno, "PRISM's High Resolution MPEG Chip Set", COMPCON Spring '92 IEEE Computer Society International Conference, 1992. [20] D. Brinthaupt, L . Letham, et al., "A Video Decoder for H.261 Video Teleconferencing and M P E G Stored Interactive Video Applications", IEEE International Solid-State Circuits Conference, 1993. [21] T. Araki, M . Toyokura, et al, "Video DSP Architecture for MPEG2 Codec", ICASSP '94: Acoustics, Speech & Signal Processing Conference, Vol. II, 1994. [22] H. A. Chow, A. Elnaggar, H. M . Alnuweiri, "Highly Parallel Signal Processing on the Virtual Computer", SPIE's Photonic East Symposium, Proceedings of Conference on High-Speed Computing, Digital Signal, and Filtering Using FPGAs, 1995. [23] Maya Gokhale, Ron Minnich, "FPGA Computing in a Data Parallel C", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1993. [24] R. Rainbault, et al, "Fine Grain Parallelism on a MIMD Machine Using FPGAs", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1993. [25] G. Milne, et al, "Realizing Massively Concurrent Systems on the SPACE Machine", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1993. Bibliography 93 [26] D. T. Hoang, "Searching Genetic Databases on Splash 2", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1993. [27] S. A. Cuccaro, C. F. Reese, "The CM-2X: A Hybrid CM-2/Xilinx Prototype", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1993. [28] David E. Van den Bout, "The AnyBoard: Programming and Enhancements", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1993. [29] David E. Van den Bout, et al, "AnyBoard: An FPGA-Based, Reconfigurable System", IEEE Design and Test of Computers, Vol. 9, Iss. 3, September 1992. [30] Steven Casselman, "Virtual Computing and the Virtual Computer", Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1993. [31] Engineers' Virtual Computer: EVCl-s Users Guide, Virtual Computer Corporation, August 1994. [32] H. A. Chow, H. M . Alnuweiri, S. Casselman, "FPGA-Based Transformable Computers for Fast Digital Signal Processing", Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, April 1995. [33] I. S. Reed, et al, "Fourier Analysis and Signal Processing by Use of the Mobius Inversion Formula", IEEE Tran. ASSP, ASSP-38, pp. 458-470, March 1990. [34] I. S. Reed, et al, "A VLSI Architecture for Simplified Arithmetic Fourier Transform Algorithm", IEEE Tran. ASSP, ASSP-40, pp. 1122-1133, May 1992. [35] H. M . Alnuweiri, "The Arithmetic Fourier Transform in VLSI", to appear in IEEE Trans, on Signal Processing, 1997. [36] Plateau Multimedia Research Group, MPEG-1 Software Coding Package, University of California, Berkeley. [37] The Programmable Logic Data Book, Xilinx Corporation, 1994. [38] Peter A. Ruetz, Po Tong, "A 160-Mpixel/s IDCT Processor for HDTV, IEEE Micro, Vol.12, Iss. 5, October 1992. Bibliography 94 [39] Chen-Mie Wu, Andy Chiou, "A SIMD-Systolic Architecture and VLSI Chip for the Two-Dimensional DCT and IDCT', IEEE Transactions on Consumer Electronics, Vol. 39, Iss. 4, November 1993. [40] S. Wolter, D. Birreck, R. Laur, "Classification for 2D-DCTs and A New Architecture with Distributed Arithmetic", Proceedings of IEEE International Symposium on Circuit and Systems, 1991. [41] Avanindra Madisetti, Alan N. Willson, Jr., "A 100 MHz 2-D 8x8 DCT/IDCT Processor for H D T V Applications", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, Iss. 2, April 1995. [42] Darren Slawecki, Weiping Li , "DCT/IDCT Processor Design for High Data Rate Image Coding", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 2, Iss. 2, June 1992. [43] Yu-Tai Chang, Chin-Liang Wang, "New Systolic Array Implementation of the 2-D Discrete Cosine Transform and Its Inverse", IEEE Transactions on Circuits and Systems for Video Technology, Vol. 5, Iss. 2, April 1995. [44] Nam Ik Cho, Ang Uk Lee, "DCT Algorithms for VLSI Parallel Implementation", IEEE Transactions on Signal Processing, Vol. 38, Iss. 1, January 1990. [45] Weiping Li , "A New Algorithm to Compute the DCT and its Inverse", IEEE Transactions on Signal Processing, Vol. 39, Iss. 6, June 1991. [46] J. R. Jain, A. K. Jain, "Displacement Measurement and Its Application in Interframe Image Coding", IEEE Transactions on Communication, Vol. COM-29, pp. 1799-1808, December 1981. [47] A. N. Netravali, J. D. Robbins, "Motion Compensated Television Coding: part I", Bell System Technical Journal, Vol. 58, pp. 631-670, March 1979. [48] M . H. Ahmad Fadzil, T. J. Dennis, "Video Subband V Q Coding at 64 kb/s Using Short Kernel Filter Banks with an Improved Motion Estimation Technique", Signal Processing: Image Communication, Vol. 3, No. 1, pp. 3-21, February 1991. [49] J. J. Pearson, et al, "Video-rate Image Correlation Processor", SPIE Applications of Digital Image Processing (IOCC 1977), Vol. 119, pp. 197-205,1977. Bibliography 95 [50] J. K. Aggarwal, N. Nandhakamur, "On the Computation of Motion from Sequences of Images", Proceeding of IEEE, Vol. 76, No. 8, pp. 917-935, August 1988. [51] H. C. Bergmann, "Displacement Estimation Based on the Correlation of Image Segments", IEEE Proceedings of International Conference on Electronic Image Processing, pp. 215-219, July 1982. [52] Andy C. Hung, Teresa H.Y. Meng, "Optimal Quantizer Step Sizes for Transform Coders", ICASSP '91: Acoustic, Speech and Signal Processing Conference, 1991. [53] Heonchul Park, Viktor K. Prasanna, "Area Efficient VLSI Architectures for Huffman Coding", IEEE Transactions on Circuit and Systems, Vol. 40, Iss. 9, September 1993. [54] Seung Bae Choi, Moon Ho Lee, "High Speed Pattern Matching for a Fast Huffman Decoder", IEEE Transactions on Consumer Electronics, Vol. 41, Iss. 1, February 1995. [55] Cadence Design Systems Inc., "Verilog-XL Simulation System", Verilog-XL Reference Manual, 1991. [56] X A C T User Manual, Xilinx Corporation, 1994. [57] XCell, Xilinx Corporation, 1996. Appendix G. List of Acronyms ASIC Application Specific Integrated Circuit BMA Block Matching Algorithm CAD Computer Aided Design CCF Cross-Correlation Function CLB Configurable Logic Block CPB Constrained Parameter Bit-stream DCT Discrete Cosine Transform DSP Digital Signal Processing DTR Data Transfer Ratio FPGA Field Programmable Gate Array GOP Group Of Picture HDL Hardware Description Language HDTV High Definition Television IDCT Inverse Discrete Cosine Transform IQ Inverse Quantization ISO International Standards Organization MAD Mean-Absolute Difference M C Motion Compensation M E Motion Estimation MIMD Multiple-Instruction-Multiple-Data MPEG Motion Picture Expert Group MSE Mean-Squared Error 96 Appendix G. List of Acronyms PRA Pel-Recursive ALgorithm SISD Single-Instruction-Single-Data VHDL VHSIC Hardware Description Language V L C Variable Length Coding VLSI Very Large Scale Integration XNF Xilinx Netlist Format Appendix H. Software MPEG-1 Coding Test In this thesis project, MPEG-1 tools from the Plateau Research Group of the University of California, Berkeley will be used as primary basis of the MPEG EVCl- s tools development. One of the reasons for choosing them is that Berkeley MPEG tools are freewares and the source codes are ready available for further modifications. The E V C l - s based M P E G tools require both software and hardware implementation, and hence Berkeley MPEG tools provide a good software foundation for the project. Software Tools (1) Berkeley M P E G tools from Plateau Research Group in University of California, Berkeley consists of 5 different software packages: one decoder and player package, one encoder package and three other statistical analysis package. • mpeg_play The decoder is implemented as a library that will take a video stream and display it in an X window on an 8, 24 or 32 bit deep display. The main routine is sup-plied to demonstrate the use of the decoder library. Several dithering algorithms are supplied based on the Floyd-Steinberg, ordered dither, and half-toning algorithms that tradeoff quality and performance. Neither the library nor the main routine handle real-time synchronization or audio streams. The decoder implements the standard de-scribed in the Committee Draft ISO/IEC CD 11172 dated December 6, 1991 which is sometimes refereed to as "Paris Format." 98 Appendix H. Software MPEG-1 Coding Test 99 • mpeg_encode The encoder implements the standard described in the ISO/IEC Interna-tional Standard 11172-2. The encoder will accept any input file format as long as a script to convert the images to PPM, YUV, JPEG, or JMOVIE format is provided. Op-tions to control input file processing and compression parameters are specified in a pa-rameter file. The encoder also include some useful utilities: programs to do PPM/YUV conversion and programs to convert Parallax XVideo JPEG files into PPM, YUV, or JPEG frames. The motion vector search window can be specified, including half-pixel block matching, in the parameter file. Several search algorithms for P-frames includ-ing: 1) exhaustive search, 2) subsampled search, and 3) logarithmic search are imple-mented. There are also several alternatives for B-frame block matching including: 1) interpolate best forward and best backward block, 2) find backward block for best for-ward or vice-versa (called CROSS2), and 3) exhaustive cross product. The search al-gorithms are controlled by options in the parameters file. The encoder can be run on one computer (i.e., sequential) or on several computers (i.e., parallel). The parallelism is done on a sequence of pictures. In other words, one or more children can be spawn to encode continuous runs pictures. The uncompressed data can be accessed either through NFS or TCP sockets. Although performance depends on the speed of individ-ual processors, the file system and network, and the P/B frame search methods, a rate of 3.75 frames/second is achieved on 8 HP Snakes running in parallel as compared with 0.6 frames/second on 1 Snake. • mpeg_bits It operates by playing a video stream in one window, while another window displays the block sizes type for each block in the current frame. Appendix H. Software MPEG-1 Coding Test 100 • mpeg_stat It is a public domain MPEG video statistics gatherer. As a sample of the ba-sic information mpeg_stat collects, here is the output of mpeg_stat -quiet flower.mpg: • mpeg_blocks It operates by playing a video stream in one window, while another win-dow displays the block type and motion vectors for each block in the current frame. (2) Various MPEG files which will be used for decoder performance studies. (3) Various JPEG, Y U V and PPM format frame sequences which will be used for encoder testing purpose. (4) M P E G program Compiler - gcc compiler (5) Matlab program - statistical analysis, graphs plotting Hardware Tools (1) Xilinx XC4013 series FPGA-based system, EVCl- s - the VLSI platform for the implementa-tion of the M P E G circuits (2) Sun SPACStation 10 - host computer which provides Sbus interface for E V C l - s and acts as a central control and processing unit for the MPEG programs Experimental Testing of Berkeley MPEG-1 Tools In order to have an understanding of the performance of the software-only decoder and encoder performance, original Berkeley MPEG-1 decoder and encoder programs are compiled and tested with several frame sequences. The tests are run in a Sun SPARCStation 10 workstation. Appendix H. Software MPEG-1 Coding Test 101 Decoder Testing In this part of the test, six different M P E G format files are played with Berkeley M P E G decoding program — mpeg_play. The first few tests play a single M P E G file in the workstation with no other tasks running in the workstation. Then two and three M P E G files are run at the same time. The parameter frame-per-second is recorded. These six M P E G files have different frame sequence lengths and different screen sizes. The following table shows the relevant information of these MPEG files: Table 8.1 MPEG frame sequences. File Name Content No of Frame Frame Si/e I.P.B Frame Ratio File Si/e (Byte) Flower, mpg flower garden 150 352 x 240 10,40,100 903636 mobile, mpg toy train 150 352 x 240 10,40,100 718959 sukhoi.mpg aircraft 764 160x120 764,0,0 1534405 waters ki. mpg speed boat 80 336 x 208 6,22,53 417792 model 1.mpg talking woman 126 160x120 126,0,0 1534405 RedsNighl-m arc. mpg anima-tion 1210 320 x 240 41,81,1088 3619896 These M P E G files are decoded and played in the X Window system. The values or average frame-per-second are obtained for single and multiple MPEG players running at the same time. The following figure shows the results of running different numbers of M P E G player with different frame sizes: Appendix H. Software MPEG-1 Coding Test 102 Decoding Speed (Frame/sec) of Berkeley MPEG Player 14 12 10h 8h Frame Size: 352 x 24C Frame Size: 160 x 12C N " N ' N _ • ' 1 . 5 2 2.5 No. of MPEG Player • 3.5 As we notice from the graph, the value for frame-per-second is far from the common video broadcasting standard of 30 frame-per-second even when there is only one M P E G player running. The frame-per-second value deteriorates further when multiple copies of MPEG players are running at the same time. The size of each frame also affects the decoding rate significantly. We can have about 13 to 14 frame-per-second when we are playing those file with 160 x 120 frame size, as compared to approximately 7 frame-per-second for the MPEG file with a 352 x 240 frame size. Encoder Testing Due to the lack of suitable format frame sequence for the MPEG encoder, only one test is performed. The frame sequence consists of 150 consecutive frames of a flower garden in Y U V format. The size of each frame is 352 x 240. The Berkeley M P E G encoder - mpeg_encode is used and the encoder produces all three types (I, P, B) of frames (there are some encoders in the market that do not use P and B frames at all). The whole encoding process takes 6 minutes and 42 Appendix H. Software MPEG-1 Coding Test 103 seconds. Clearly, the encoding process is much slower than the decoding process. A 148-frame sequence only lasts 5 seconds if we play it in 30 frame-per-second, but we need 402 seconds to encode it. This performance is far slower than the real-time requirement. In general, both software-only M P E G decoder and encoder are much slower than a reasonable normal playback and recording process. Hence, some parts of the encoder and decoder system must be implemented in hardware platform. Decoder and Encoder Work Load Distribution In order to have a better understanding of the significance of each function block in the whole M P E G video coding process, the work load percentage distribution is investigated and recorded. A timer program is embedded in M P E G video decoder and encoder programs. The execution duration in each block is used to plot the following graphs. 40 35 30 h M P E G - 1 V ideo Decoder Process ing T ime Distribution 5? 2 5 h E i -c 20 •& <D O o rx 15 10 Diplay & Others Frame: RecomstrUctioh In. Quantization Huffman Decoding M V Ffecofrstrqic Header Process ing j ion IDCT 1 3 4 5 Decoder Function Blocks Appendix H. Software MPEG-1 Coding Test 104 In decoding system, most of the processing time is spent on IDCT, frame reconstruction and video frame display. So the improvement in those parts will likely increase the decoding speed dramatically. On the other hand, the motion compensation, D C T and Huffman coding are taking most of the encoding time. 40 35 30 M P E G - 1 Video Encoder Processing T ime Distribution M V i C o m p e n sation M V Estim. ition D C T Frame Packi i g Qu antii tatio n Header P Huffms rocess ing in Enc •ding 25 C D E . 1 20 cn <s> C D O rx 15 10 3 4 5 Encoder Function Blocks Appendix I. Transformable Coprocessor EVCl-s System Specifications The Engineers' Virtual Computer, Transformable Computer EVCls Top View U8 !|8 O ° o ° o ° 5 ° ° 2 o ° O ° o ° 111 U7 U6 HI °°8 8°§ °S8° o o ggg 0S6 o2o o ° o FCT244| I I I min' I \ o o o o o o o o J7J6I5J4J3 J2J1 U12 OSC o o o o o o o o P4 O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O o o O O O O Q Q Q O O O Q Q U l l U 3 U4§ 4 F „ n Xilinx XC4013 • • ' u n t i l o o o o o o o o o o o o o FCT245 JU1Q) PAL U2 FCT24d 888888888ffggg FCT244 uy PROM u u u o o o o o o o o o u P2L2 Daughter Board Area LED2 LED1 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o ooooo oo o o ooo O O o o DONE LED ERRINIT LED User Definable LED o o Rear View The EVCl- s by Virtual Computer Corporation is a single Bus board based transformable comput-ing system utilizing Field Programming Gate Array (FPGA) technology. E V C l - s is a SPARC Station compatible system featuring: Standard Single SBus Card Xilinx XC4013 Field Programmable Gate Array Third Generation Field Programmable Logic Device 576 Configurable Logic Blocks (CLBs) 256 bytes of addressable register space 96 Bits of User I/O to Daughter Board 105 Appendix I. Transformable Coprocessor EVCl-s System Specifications 106 • Optional - 2 Megabyte SRAM Daughter Board • SBus Interface & Memory Mapped SBus Driver Memory and I/O Map 0x0000 to 0x3FFFF is the R O M space, reading or writing 0x8000 will reset the Xilinx. Reading a byte from 0x8001 will return the status. BitO is the ERRINIT pin on the Xilinx and Bitl is the D O N E pin on the Xilinx. Address 0x8003 is the configuration port. Writing at this address uses only bitO. With the E V C S L V I F the address 0x10000 - 0xl003F are decoded into EVCADR<15:0> as integer addresses. Table 9.1 General Memory Map Address Description Type Ox 1FFFFFFF Unused OxlOOFFto 0x10000 User Space R/W ().\8003 Configuration Data, Readback Data W R 0x8002 Readback Trigger W 0x8001 Status R 0x8000 Reset R Unused 0\3FFF to 0x0 Boot ROM Status Bits Appendix I. Transformable Coprocessor EVCl-s System Specifications 107 The status bits 0 and 1 are connected to the Xilinx device signal IMT* and DONE respectively. • INTT* — During configuration, a Low on this output indicates that a configuration data error has occurred. • DONE ~ is used to indicate the completion of the configuration process. The configu-ration program determines the exact timing, the clock source for the Low-To-High transition, and enable of the pull-up resistor. Reset -- The RESET port address resets the Xilinx device when read. Readback Trigger — The Readback Trigger toggles the R D B K . T R I G signal which allows configuration data to be read from the Xilinx device. 7 6 5 4 3 2 1 0 1 . T Unused ^ Configuration / Data / Slave Interface Map Table 9.2 Slave Interface Map ()xl()()3C EVCADR<15> RAV 0x10038 EVCADR<14> RAV 0x10034 EVCADR<13> RAV 0x10030 EVCADR<12> RAV 0xH)()2C EVCADR<11> RAV 0x10028 EVCADR<10> RAV 0x10024 EVCADR<9> RAV 0x10020 EVCADR<8> RAV 0x1001C EVCADR<7> RAV 0x10018 EVCADR<6> RAV 0x10014 EVCADR<5> RAV 0x10010 EVCADR<4> RAV OxlOOOC EVCADR<3> RAV 0x10008 EVCADR<2> RAV 0x10004 EVCADR<1> RAV Ox 10000 EVCADR<0> RAV Appendix I. Transformable Coprocessor EVCl-s System Specifications Slave Interface Write Timing 108 SCLK DATIN EVCADR -1 i 3 1 2 3 4 5 6 7 X X X X Programming Commands Table 9.3 EVCl-s Commands Command Descriptions evcreset Resets the EVCl-s (Usage: evcreset n where n is the Sbus Slot #) r2h Reads a rawbits file and creates a .h file of the same name. (Usage: r2h <file-name> Programming Functions Table 9.4 EVCl-s programming functions Function Description EVCdownload Download configurations to the EVCl-s. (Usage: EVCdownload requires 3 arguments: <array name>, <device identifier: 1 = XC4013>, <Sbus slot #> e.g. EVCdownload(rbtest, 1,1> EVCrcset Resets the EVCl-s. (Usage: EVCreset n where n is the Sbus Slot #) EVCrbtrig Toggles RDBK.TRIG on the XC4013 device (Usage: EVCrbtrig(n) where n = <SBus slot#> EVCreadback Returns a single bit of the readback data stream from the XC4013 device. (Usage: EVCreadback(n) where n = <SBus slot#> Appendix J. List of System Design Schematics 109 Appendix J. List of System Design Schematics 110 a "1 s t c a c u u. » X 1 1 s 3 c a LL LL. x u i l l Appendix J. List of System Design Schematics Appendix J. List of System Design Schematics 113 1 F T 1 i I . i i Ii"-iii. •Iii-T i • t r _ i • f 1 1 • I. i f i f Srt—i a J . i • sh i f i f Ttfl ITT i \ r i i t 1 i • 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items