C o m p u t a t i o n - D i s t o r t i o n O p t i m i z e d D C T - B a s e d V i d e o C o d i n g by Ismaeil R. Ismaeil B. Sc. (Electrical Engineering) Garyounis University, Libya 1990 A. Sc. (System Design Engineering) University of Waterloo, Canada 1994 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF D o c t o r o f P h i l o s o p h y in T H E F A C U L T Y OF G R A D U A T E STUDIES Department of Electrical and Computer Engineering We accept this thesis as conforming to the required standard T h e U n i v e r s i t y of B r i t i s h C o l u m b i a May 2000 © Ismaeil R. Ismaeil, 2000 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of <= I GtD^p.L**-. The University of British Columbia Vancouver, Canada DE-6 (2/88) A b s t r a c t The rapidly expanding field of multimedia communications has fueled sig-nificant research and development work in the area of real-time video encod-ing. Dedicated hardware solutions have reached maturity and cost-efficient hardware encoders are being developed by several manufacturers. However, software solutions based on general purpose processors or programmable dig-ital signal processors (DSPs) have significant merits. Towards this objective, we have developed a flexible framework for video encoding that yields very good computation-performance tradeoffs. The proposed framework consists of a set of optimized core components: motion estimation, the Discrete Cosine Transform (DCT), quantization, and mode selection. Each of the components can be configured to achieve a desired computation-performance tradeoff. The components can be assembled to obtain encoders with varying degrees of com-putational complexity. Computation control has been implemented within the proposed framework to allow the resulting algorithms to adapt to the available computational resources. The proposed framework was applied to MPEG-2 and H.263 encoding using Intel's Pent ium/MMX desktop processor. Excellent speed-performance tradeoffs were obtained. 1 1 Contents Abstract ii Contents iii List of Tables viii List of Figures xi Acknowledgements xv 1 Introduction 1 1.1 Digital Image and Video Compression 1 1.2 Outline of The Thesis 6 2 Motion Compensated Video Compression 9 2.1 Introduction 9 2.2 Motion Compensated Video Coding 10 2.2.1 Spatial Redundancy 12 2.2.2 Temporal Redundancy 16 iii 2.2.3 Coding Control 22 2.3 Distortion Measures 23 2.4 Rate-Distortion Optimized Video Coding 24 2.5 The H.263 Video Coding Standard 26 2.5.1 Video Frame Structure 27 2.5.2 H.263 Optional Modes 27 2.6 The MPEG-2 Video Coding Standard 29 2.6.1 Profile and Level 30 2.6.2 Field, Frame and Picture 31 2.6.3 Motion Prediction 32 2.6.4 Frame D C T and Field D C T 34 2.6.5 Bit Stream Syntax and Macroblock Layer 34 2.7 The Test Video Format 37 3 Motion Estimation 43 3.1 Introduction 43 3.2 Motion Vector Prediction 44 3.2.1 Spatio-Temporal Prediction Method 45 3.3 Hierarchical Motion Estimation 50 3.3.1 Hierarchical Block Matching Strategy 51 3.3.2 Motion Search Path 52 3.3.3 Motion Search Stopping Criterion 53 3.3.4 Half-pel Accurate Motion Estimation 56 3.3.5 Interlaced Motion Estimation 57 iv 3.4 Simulation Results 59 3.4.1 Motion Vector Prediction 59 3.4.2 Hierarchical Motion Estimation 64 3.5 Summary 66 4 Mode Selection 67 4.1 Introduction 67 4.2 Rate-distortion Based Mode Selection 68 4.3 Measure-Based Mode Selection 72 4.4 Experimental Results 77 4.5 Summary 80 5 The Quantized DCT (QDCT) 81 5.1 Introduction . .' 81 5.2 Implementation of The D C T and Quantization 84 5.2.1 The Quantized D C T 84 5.2.2 Early Termination QDCT Computation 89 5.3 Applications of The QDCT 93 5.3.1 Application to MPEG-2 and H.263 Encoding 93 5.4 Experimental Results 94 5.4.1 The QDCT 95 5.4.2 Early Termination Q D C T Computation 98 5.5 Summary 102 6 Content-Based Computation-Distortion Optimization 103 v 6.1 Introduction 103 6.2 Encoding Parameters 104 6.3 Content-Based Computation-Distortion Optimization Algorithm 108 6.4 Experimental Results for C Implementation 110 6.5 Experimental Results for Pent ium/MMX Implementation . . . 113 6.5.1 M M X Implementation of The SAD 115 6.5.2 M M X Implementation of The Q D C T 115 6.5.3 M M X Implementation of The Other Components . . . 116 6.5.4 Adaptive Computation-Distortion Optimization of The M M X implementation 117 6.6 Summary 117 7 Computation Control 119 7.1 Introduction 119 7.2 The Problem of Computation Control 120 7.3 Feedback Control System 123 7.4 Simulation Results 124 7.5 Summary 125 8 Conclusion and Future Study 129 8.1 Contribution of The Thesis 129 8.1.1 Spatio-Temporal Motion Vector Prediction 130 8.1.2 Hierarchical Block Matching Strategy 130 8.1.3 Variance-Based Mode Selection 131 8.1.4 The Quantized D C T (QDCT) 131 vi 8.1.5 Adaptive Computation-Distortion Optimization Algo-rithm 132 8.1.6 Computation Control Algorithm 133 8.2 Topics for Future Study 133 8.2.1 H.263 Annexes 133 8.2.2 B-Picture 134 8.2.3 Non-Separable QDCT 134 8.2.4 Better Computation Control 134 8.2.5 Other Media Processor Implementations 135 Bibliography 136 vu L i s t of Tables 2.1 MPEG-2 Profiles and Levels 31 2.2 The available macroblock mode types for P-frame 36 3.1 Coding performance and number of computations for different motion vector predictors 60 3.2 Average number of SAD operations per macroblock and the PSNR for several temporal predictors 61 3.3 Average number of SAD operations per macroblock and the PSNR for the two methods 61 3.4 Average number of SAD operations per macroblock and the PSNR using downsampled macroblocks used to compute the SAD 61 3.5 Average number of SAD operations per macroblock and the P S N R for the two methods 63 3.6 Coding performance and number of computations for different half pel search algorithms 64 4.1 I picture macroblock coding modes 69 viii 4.2 P picture macroblock coding modes 70 4.3 B picture macroblock coding modes 71 4.4 Execution time for the RD and variance based mode selection of MEPG-2 encoding of CCIR601 sequences using a 450 MHz Pentium III processor 77 4.5 Execution time for the RD and variance based mode selection H.263 encoding of CIF sequences using a 450 MHz Pentium III processor 80 5.1 Optimum parameters for the early termination Q D C T used in MPEG-2 encoding 99 5.2 Optimum parameters for the early termination Q D C T used in H.263 encoding 100 6.1 Encoding parameters used for the adaptive computation-distortion optimization 107 6.2 Threshold statistics for MPEG-2 113 6.3 Profiling results platform 116 7.1 MPEG-2 optimum parameters combinations for different encod-ing speeds using a 450 MHz Pentium III machine 121 7.2 H.263 optimum parameters combinations for different encoding speeds of CIF sequences using a 450 MHz Pentium III machine. 122 7.3 Encoding speed-performance results for F O R E M A N sequence on a 450 MHz Pentium III processor 122 ix 7.4 Encoding speed-performance results for A K I Y O sequence on a 450 MHz Pentium III processor 123 7.5 Computation control results for MPEG-2 encoding using a 450 MHz Pentium III processor 125 7.6 Computation control results for H.263 encoding of CIF sequences using a 450 MHz Pentium III processor 127 7.7 Computation control results for H.263 encoding of QCIF se-quences using a 450 MHz Pentium III processor 127 x L i s t of F igures •1.1 Overview of the macroblock coding layer 4 2.1 Block Diagram of a typical motion-compensated D C T video coder. 11 2.2 Example of zig-zag scan pattern to re-order D C T coefficients from low to high frequencies 16 2.3 Block Diagram of a simple frame-differencing coder 17 2.4 Illustration of frames types and dependencies in motion com-pensation 19 2.5 Reordering of frames to allow for causal interpolative coding, (a) Temporal order and (b) Encoding order 20 2.6 Example of motion estimation search window for a block match-ing algorithm 20 2.7 H.263 picture structure at QCIF resolution 28 2.8 Interlaced video frame consists of two fields 32 2.9 MPEG-2 Motion vector type 33 2.10 MPEG-2 D C T type 35 2.11 CCIR 601 video sequences 40 xi 2.12 CCIR 601 video sequences 41 2.13 CIF video sequences 42 3.1 The region of support used for motion vector prediction. . . . 46 3.2 The temporal projection of a motion vector 48 3.3 Motion vector reliability 49 3.4 The hierarchical motion search 51 3.5 The diamond-shaped motion vector search area: (a) fixed cen-ter, (b) floating center 53 3.6 The half-pel motion vector candidates 57 3.7 Motion vector prediction throughout the stages of motion esti-mation 58 3.8 The performance of the two methods for the BUS 62 3.9 The performance of the two methods for the QCIF F O R E M A N sequence 63 3.10 Computation-distortion curves for the MPEG-2 M E parameters. 65 4.1 Variance based mode selection of MPEG-2 coder 74 4.2 Variance based mode selection of H.263 coder 76 4.3 RD vs variance based mode selection of MPEG-2 encoder for (a) B A L L E T , (b) B I C Y C L E and (c) BUS sequences 78 4.4 RD vs variance based mode selection of H.263 encoder for (a) A K I Y O , (b) C O A S T G U A R D , and (c) F O R E M A N (CIF) sequence. 79 5.1 The Chen fast D C T flow graph 90 xii 5.2 Early termination Q D C T computation 91 5.3 The optimized ordering 93 5.4 The rate-distortion curves for conventional D C T and quantiza-tion, Q D C T with b = 10, and QDCT with b = 8 for MPEG-2 (a) and H . 263 (b) 96 5.5 Results for intra approximation: (a) using the default intra ma-trix; (b) using the separable approximation; (c) using a constant quantization step 97 5.6 Computation-distortion tradeoffs for the early termination Q D C T computation: (a) MPEG-2 , (b) H.263 101 6.1 Computation-distortion parameters for MPEG-2 encoding. . . 105 6.2 greedy algorithm 109 6.3 performance of greedy algorithm 110 6.4 Computation-distortion curve for the C language implementa-tion of the MPEG-2 encoder (a) and H.263 encoder (CIF size test sequence) (b) 112 6.5 Computation-distortion curves for the Pen t ium/MMX imple-mentation of the MPEG-2 encoder (a) and the H.263 encoder (CIF test sequence) (b) 118 7.1 Computation control mechanism 123 xm Rate-distortion curves with computation control (solid line) and without computation control (dashed line) for (a) MPEG-2 en-coding of the sequence BUS and (b) H.263 encoding of the CIF sequence F O R E M A N x iv Acknowledgements First, my greatest debt of gratitude goes to my supervisors, Professor Faouzi Kossentini and Professor Rabab Ward, and to Dr. Alen Docef who have introduced and guided me into this exciting field of research, and provided sound advice and invaluable critical feedback at every stage of this project. Without their support, this thesis could not have been written. Second, I want to thank my wife, Naseem, for her patience and support over the past two years. I also want to thank my family for their encouragement and support. Finally, I want to thank all the members of the Signal Processing and Multimedia Group for their valuable discussions, suggestions, and moral support. I S M A E I L R. I S M A E I L The University of British Columbia May 2000 xv Chapter 1 Introduction 1.1 Digital Image and Video Compression The main purpose of image data compression is to represent an image with as few bits as possible and as little distortion as possible. Image compression is essential for many applications such as T V transmission, video conferencing, facsimile transmission of printed material, graphics images or transmission of remote sensing images. In these applications, the rate at which image infor-mation can be transmitted is limited by a maximum channel capacity. Image compression is also essential for applications where pictures are stored in a database, and also where the amount of digital image information exceeds the system's storage capacity. Essentially, image data compression algorithms can be divided into two classes: lossless and lossy. In lossless image compression, sometimes called error free image compression, the original image is encoded so that exact recovery is guaranteed. The most popular methods for lossless 1 coding are Huffman, Lempel-Ziv, arithmetic, and run length coding [1, 2, 3, 4]. These methods exploit the statistical redundancies to encode the image infor-mation. For instance, in Huffman coding, shorter codewords are assigned to the likely symbols and longer codewords are assigned to the unlikely symbols in a way that minimizes the average number of compression bits. Lossless im-age compression methods normally provide compression ratios of 2:1 to 10:1. Significant compression is usually achievable only by lossy image compression algorithms. These algorithms though do not allow the exact recovery of the original image. In most conventional lossy coding algorithms, such as trans-form coding and vector quantization [5, 6, 7, 8], the original image is divided into a large number of small blocks (sub-images). The selection of the sub-image or block size is an important factor since it affects the compression performance and the computational complexity. Much higher compression ra-tios (10:1-32:1) with acceptable fidelity can be achieved using these methods. However, this type of lossy coding may introduce unacceptable artifacts at very high compression ratios (greater than around 32:1). Recently, subband image coding methods have proven to outperform other coding methods es-pecially at low bit rates. In subband image coding, the image is decomposed into several non-overlapping frequency bands [9, 10, 11, 12]. The number of bits allocated for encoding each of the subbands depends on the fidelity re-quirements. Wavelet transform coding is closely related to subband coding in that it provides an alternative method for filter design [13]. Compression ra-tios higher than (40:1) can be achieved using subband coding with acceptable 2 fidelity. In digital video signals, prior to compression, the tremendous amount of data present requires large spaces and wide bandwidths for storage and trans-mission. For instance, the bit rate of a 30 frames/sec CCIR-601 digital video sequence is 248.83 Mbits/sec, and storing just one minute clip of such a video sequence requires 1.87 gigabyte storage space. Without video compression, the demands of digital storage and transmission can not be met by present or near-future communication infrastructures. Standard-based video coding sys-tems have been successfully employed in many audiovisual communications applications, including teleconferencing systems, digital T V broadcasting, and Digital Versatile Disks (DVDs). These coders are based on motion compen-sated prediction and D C T residual coding. Most of these encoders are costly and rigid. Such encoders are implemented either in hardware using current VLSI technologies or in software using parallel computers. Many video encod-ing systems have been designed and implemented using dedicated hardware, and cost-efficient solutions have been proposed by several manufacturers. Over the past few years, several researchers concentrated on developing techniques for video telephone and video conferencing systems that run on a computer and communicate over a regular telephone line, wireless channels or over the Internet. These projects yielded encoder and decoder software with some performance limitations. Usually the decoder is able to provide real-time decoding at 10 to 15 frames per second, and the encoder runs too slow for a realistic application. The goal of this thesis is to develop a DCT-3 Rate Control Motion Estimation Mode Selection Frame/field motion vectors and residuals Coding MB coding mode Quantized DCT coefficients Quantization step Figure 1.1: Overview of the macroblock coding layer. based video encoding framework for the design of flexible, real-time software solutions using general media processors (e.g., Pent ium/MMX). The proposed framework consists of several parameterized core components (motion estima-tion, quantized D C T , mode selection) and a method for developing encoders (based on such components) with different, controlled levels of performance and computational complexity. Motion compensation-based encoders divide video frames into mac-roblocks (MBs), each macroblock consisting of a 16 x 16 block (or four 8 x 8 blocks) of luminance values, and two 8 x 8 blocks of chrominance values. The macroblock coding procedure consists of three consecutive stages: motion esti-mation, mode selection, and coding, as shown in Figure 1.1. The rate control subsystem controls the parameters of the mode selection and coding algo-rithms. We proceed with a brief description of each of these stages. In the motion estimation (ME) stage, the best motion vectors are found, and mo-tion compensated prediction is performed. Hardware video encoders typically use a dedicated co-processor to perform full-search motion estimation. Due to its huge amount of computations, full-search M E is impractical for most software implementations. For example, the full-search M E algorithm used in the TM5 MPEG-2 encoder [14] requires 95 - 99% of the total number of com-putations. Therefore, many efficient statistical and predictive M E algorithms that reduce the computational complexity by up to two orders of magnitude have been proposed [15, 16, 17, 18, 19, 20, 21, 22]. The motion estimation algorithm used in our encoder framework employs a robust motion vector pre-dictor, a hierarchical block matching strategy, and a floating-center diamond search path to achieve the best possible computation-performance tradeoffs. The number of computations is reduced by two to three orders of magnitude at the expense of an insignificant loss in estimation accuracy. The coding mode, including specifically the motion compensation direction, the motion compen-sation type, and the D C T type, is selected based on variance measures of the original and motion-compensated MBs. This method is significantly sim-pler than the optimal rate-distortion-based approach proposed in [23, 24, 25], and it yields a performance level within 0.4 dB of the optimum one. The motion compensation residual blocks are then D C T transformed, quantized, and variable-length encoded. To reduce the number of computations associ-ated with the D C T and quantization steps, explicit quantization is avoided by embedding quantization into the D C T computation, and by predicting, and 5 avoiding the computation of, all-zero quantized D C T blocks. The most attractive feature of the components described above is that they are parameterizable. Each parameter controls the computation-distortion performance of the algorithm by changing the operation of a particular sub-component. A "greedy" design procedure is proposed that yields parameter sets corresponding to the best global computation-distortion operating points for a complete encoder. The proposed method leads to more than a 2:1 reduc-tion in the number of computations without significant loss in subjective and objective quality. The set of computation-distortion parameters can be used to design a computation control mechanism that allows the encoding algorithm to adapt itself to the available computational resources. To prove the useful-ness of the proposed encoding framework, it has been used to implement both an MPEG-2 [26] and an H.263 [27] video encoding system using Intel's Pen-t i u m / M M X desktop processor. Excellent speed-performance tradeoffs were obtained. 1.2 O u t l i n e of T h e Thes is In chapter 2, we first describe the block-based hybrid motion compensated and transform video coding method. Distortion measures employed in video coding are then reviewed. Next, rate-distortion optimizations for video coding are introduced. Finally, the H.263 and MPEG-2 video coding standards are briefly described. In chapter 3, we describe in detail a new motion estimation algorithm 6 that reduces the number of computations at the price of a small performance loss by significantly reducing the number of search points for each macroblock. We also discuss the most important issues relative to the hierarchical motion search process: the starting point, the search path, the stopping criterion, half-pel motion estimation, and interlaced motion estimation. In chapter 4, we describe the optimal rate-distortion-based encoding mode selection. We then describe in detail the variance-based mode selection algorithm employed in our encoding framework. A comparison between the two methods is also presented in the chapter. In chapter 5, we discuss the efficient joint implementation of D C T and quantization, which we call the Quantized D C T (QDCT). The Q D C T embeds the quantization step into the integer multiplications performed in the D C T . We will show how the computational load associated with the D C T and quan-tization can be reduced. We will also show how the proposed techniques offer the user flexible control over the QDCT accuracy and number of computations. In chapter 6, we introduce the content-based computation-distortion op-timization method that allows us to achieve optimum encoding performance for the available computational power. We will discuss in detail the greedy optimization procedure and compare its performance to the full search opti-mization method. In chapter 7, we discuss the computation control method that can be implemented within the encoding framework allowing the optimized algorithm to adapt to the available computational resources. Finally, conclusions and 7 future work are presented in the last chapter of this thesis. 8 C h a p t e r 2 M o t i o n Compensa t ed V i d e o Compres s ion 2.1 I n t r o d u c t i o n The goal of video coding is to remove spatial and temporal redundancies in an image sequence while keeping intact as much perceptually important informa-tion as possible. Removal of temporal redundancies in video signals accounts for a significant percentage of the achieved compression. The classical hybrid motion compensated and transform coding method is employed in all current video coding standards (e.g. H.261, H.263, M P E G - 1 , M P E G - 2 , MPEG-4) [28, 29, 30, 26, 31]. More advanced coding techniques usually provide little additional compression and the additional complexity does not justify this improvement. In this chapter, we first describe the block-based hybrid motion com-9 pensated and transform video coding method. Distortion measures employed in video coding are then reviewed. Next, rate-distortion optimizations for video coding are introduced. Finally, the H.263 and MPEG-2 video coding standards are briefly described. 2.2 M o t i o n C o m p e n s a t e d V i d e o C o d i n g In the hybrid motion compensated and transform video coder, motion com-pensated prediction is first carried out as to reduce temporal redundancies. Transform coding is then applied to the corresponding difference between two frames to reduce spatial redundancies. For highly correlated sources, such as natural images, the compaction ability of the Discrete Cosine Transform (DCT) is very close to that of the optimal transform, the Karhunen-Loeve Transform (KLT). The D C T , unlike the K L T , is data independent. This has made the D C T the most popular transform for image coding, and has thus been the basis of the J P E G still image international standard. The D C T is employed in all current video coding standards. A block diagram of a typical motion compensated prediction and D C T video encoder/decoder is presented in Figure 2.1. Because of its effectiveness and efficiency with respect to algorithm complexity and computational cost, the framework illustrated in Figure 2.1 has been adopted in video coding standards, such as H.261, H.263, MPEG-1 , and MPEG-2 . In the next sections, we describe the building blocks of this video coder. 10 Input Frame (Dotted Box Shoyvs Decoder) DCT, Quantization, Entropy Coding Encoded Residual (To Channel) Entropy Decoding, Inverse Quantize, Inverse DCT Motion Compensated Prediction 4> Motion Compensated Prediction Prior Coded Approximated Frame Motion Estimation and Mode Decision Approximated Input Frame (To Display) Frame Buffer Motion Vector and Predicted Mode Data (To Channel) gure 2.1: Block Diagram of a typical motion-compensated DCT video code 11 2.2 .1 S p a t i a l R e d u n d a n c y Redundancy exists in a video sequence in two forms: spatial and temporal. The former, also called intra-frame redundancy, refers to the redundancy that exists within a single frame of video, while the latter, also called inter-frame redundancy, refers to the redundancy that exists between consecutive frames within a video sequence. Reducing spatial redundancy has been the focus of many image compression algorithms. Since video is composed of a sequence of images, image compression techniques are directly applicable to video frames. Here, we outline some popular image coding techniques applicable to video coding. B l o c k T r a n s f o r m C o d i n g In block-transform coding, an image is divided into blocks. Each block is math-ematically transformed into a different representation, which is then quantized and coded. The mathematical transform is chosen so as to "pack" most of the useful information into a small set of coefficients. The coefficients are then selectively quantized so that after quantization most of the "unimportant" coefficients are driven to have zero values and can thus be ignored, while the "important" coefficients are retained. In the decoder, a dequantization process is followed by an inverse transformation. 12 Discrete Cosine Transform The purpose of the 8 x 8 D C T employed in all current image/video coding standards is to decorrelate the 8 x 8 blocks of original pixels or motion com-pensated difference pixels and compact their energy into as few coefficients as possible. Besides its relatively high decorrelation and energy compaction capabilities, the 8 x 8 D C T is simple, efficient, and amenable to software and hardware implementations [32]. The most common algorithm for implement-ing the 8 x 8 D C T is that which consists of 8-point D C T transformation of the rows and the columns, respectively. The 8 x 8 D C T is defined by Cm,n = a(m)B(n) 2^ 2^ BitJ cos( — ) cos( — ), »'=i i = 1 0 < n < 7 where fl fl 1 < m < 7 a(0) = /?(0) = and a(m) = P{n) = J - for " 1 < n < 7. Here, Bij denotes the (i,j)th pixel of the 8 x 8 original block and Cm,n denotes the coefficients of the 8 x 8 D C T transformed block. The original 8 x 8 block of pixels can be recovered using an 8 x 8 inverse D C T (IDCT) given by Bhi = /\, 1^ Cm>na(m) cos( — )^(n)cos( — j , m = l n = l i b 1 0 0 < j < 7 . Although exact reconstruction can be theoretically achieved, it is often not possible when finite-precision arithmetic is used. While forward D C T errors can be tolerated, inverse D C T (IDCT) errors must meet a minimum level of 13 precision in order to avoid IDCT mismatch between the reconstructed frames at the encoder and decoder. Quantization The human viewer is more sensitive to reconstruction errors related to low spatial frequencies than those related to high frequencies [33]. Slow linear changes in intensity or color (low frequency information) are important to the eye. Sharp, high frequency changes are often not seen and may be discarded. For every element position in the D C T output matrix, a corresponding quan-tization value is computed using the equation C 0 < m < 7 ^ m ' n 0 < n < 7, where C m j n is the (m,n)th D C T coefficient and Qm,n is the (m,n)th integer quantization step size. The resulting real numbers are then rounded to their nearest integer values. The net effect is usually a reduced variance between quantized coefficients as compared to the variance between the original D C T coefficients, as well as a reduction of the number of non-zero coefficients in C . m,n For low bit rate video coding, quantization is usually performed using the same step size within a block (i.e, using a uniform quantization matrix). For example, in H.263, even quantization levels in the range from 2 to 62 are allowed for inter blocks. The quantizers consist of equally spaced reconstruc-tion levels with a dead zone centered at zero. For higher bit rate, high quality 14 applications (e.g. MPEG-2) , a quantization matrix exploiting the character-istics of the HVS is usually employed. After the quantization process, the reconstructed picture is stored so that it can be later used for prediction of the future picture. E n t r o p y C o d i n g Entropy coding is usually performed by means of variable length codes (VLCs). VLCs can be generated using Huffman codes [1]. The V L C s are usually a subset of the complete Huffman tree, since some codewords are not employed in order to avoid the emulation of synchronization words by a combination of VLCs . Arithmetic coding [3, 4] may also be employed as means of entropy coding. Prior to entropy coding, the quantized D C T coefficients are arranged into a one-dimensional array by scanning them in zig-zag order. This re-arrangement places the DC coefficient first in the array and the remaining A C coefficients are ordered from low to high frequency. An example of the scan pattern for baseline H.263 and MPEG-2 is illustrated in Figure 2.2. The re-arranged array is coded using a 3-dimensional run-length V L C table, repre-senting the triplet (LAST, R U N , L E V E L ) . The symbol R U N is defined as the distance between two non-zero coefficients in the array. The symbol L E V E L is the non-zero value immediately following a sequence of zeros. The symbol L A S T is equivalent to the End-of Block flag also employed in 2-dimensional run-length coding, where "LAST = 1" means that the current code corre-sponds to the last coefficient in the coded block. This coding method produces 15 Figure 2.2: Example of zig-zag scan pattern to re-order DCT coefficients from low to high frequencies. a compact representation of the 8x8 DCT coefficients, as a large number of the coefficients are normally quantized to zero and the re-ordering results (ideally) in the grouping of long runs of consecutive zero values. Other information such as prediction types and quantizer indication is also entropy coded by means of VLCs. 2.2.2 T e m p o r a l R e d u n d a n c y Successive frames in a video sequence are typically highly correlated, espe-cially for scenes where there is little or no motion. The spatial decorrelation techniques described in the previous section only operate within a single frame and do not exploit the redundancy that exists between frames. We now review some basic techniques for reducing temporal redundancy. 16 Input _|_ Output Frame Encoder Frame Buffer Frame Decoder Figure 2.3: Block Diagram of a simple frame-differencing coder. Frame Differencing A very simple technique for exploiting temporal redundancy in a video se-quence is to code the difference between one frame and the next. This tech-nique is called frame differencing and is an extension of the basic differential pulse code modulation (DPCM) coding techniques (see, e.g. [34]). A block diagram of an encoder that uses frame differencing is shown in Figure 2.3. If there is little motion between successive frames, frame differencing yields a difference image that is mostly uniform and can be coded efficiently. However, frame differencing fails when there is appreciable motion between frames or when a scene change occurs. Motion Estimation and Compensation Frame differencing can be viewed as a predictive coding technique where the prediction is simply the previously decoded frame. By improving the pre-diction, we can potentially obtain better compression performance. Motion compensation is one such technique that uses a model of the motion of objects between frames to form a prediction. Using this model, the encoder performs 17 motion estimation to determine the motion that exists between a reference frame and the current frame. The reference frame can occur temporally before the current frame (forward prediction) or after the current frame (backward prediction). An advanced technique, called bi-directional prediction, uses two reference frames, one for forward and the other for backward prediction to in-terpolate the results. This usually gives better prediction and handles the case where an object is temporarily occluded. At some initial point, a frame must be coded without motion compensation using only spatial coding techniques. Such a frame is commonly referred to as an intra-coded frame, or I-frame for short. Because they do not take advantage of inter-frame redundancy, I-frames consume more bits than predictive frames of comparable quality. To prevent degradation in image quality caused by the accumulation of prediction errors and to allow for easy random access to frames in a video sequence, frames are periodically coded as I-frames. Frames that are coded using forward pre-diction are called P-frames, short for predictive frames. A P-frame uses as a reference a past I-frame or P-frame. Backward prediction is typically not used exclusively, but as an option for bi-directionally predicted frames, referred to as B-frames. A B-frame is coded from a past reference frame and a future reference frame, as shown in Figure 2.4. At first, this might seem to present a causality problem since there is a dependence upon a future frame. To avert any such problem, the frames are re-ordered so that all reference frames that are required by a B-frame or P-frame come before that frame in the re-ordered sequence. An example is shown in Figure 2.5. In practice, this re-ordering in-18 Figure 2.4: Illustration of frames types and dependencies in motion compen-sation. troduces some encoding and decoding delays and requires two frame buffers to hold the reference frames. For non-real-time applications, such as stored video, the additional delay is not a serious issue. For real-time applications, such as video conferencing, the distance between successive reference frames are kept small so as to reduce the delay. B-frames may be omitted all together to further reduce the delay. A motion model that is commonly used is the block-translation model developed by Jain and Jain [35]. In this model, each video frame is divided into blocks of equal size. Motion compensated prediction assumes that a block of pixels within the current picture can be modeled as a translation of a block from a previous picture [35], as shown in Figure 2.6. Each block is normally predicted from the previous frame. This implies an assumption that each pixel within the block undergoes the same amount of translational motion. 19 Frame Type: Temporal Index: I B B P B B P B B I 1 2 3 4 5 6 7 8 9 10 (a) Frame Type: I P B B P B B I B B Temporal Index: 1 4 2 3 7 5 6 10 8 9 (b) Figure 2.5: Reordering of frames to allow for causal interpolative coding, (a) Temporal order and (b) Encoding order. Motion vector ___.!___:_ ^ T * ~ / / Macroblock in current frame L 1 Search window Best match Figure 2.6: Example of motion estimation search window for a block matching algorithm. 20 This motion information is represented by two-dimensional displacement vec-tors or motion vectors. Due to the block-based picture representation, many motion estimation algorithms employ block-matching techniques, where the motion vector is obtained by minimizing a cost function measuring the mis-match between a candidate block and the current block. Although several cost measures have been introduced, the most widely used in motion estimation al-gorithms is the sum-of-absolute-differences (SAD), presented in Section 2.3. The SAD is defined as N N SAD = EE I BiAkJ) - J) I, (2.1) k=l1=1 where Bij(k, I) represents the (k, l)th pixel of a, N X N block from the current picture at the spatial location and Bi-Uij-V(k,l) represents the (k,l)th pixel of a candidate block from a reference picture at the spatial location displaced by the vector (u, v). To find the block producing the minimum mis-match error, we need to calculate the SAD at several locations within a search window. The simplest, but the most compute-intensive search method, known as the full search or exhaustive search method, evaluates the SAD at every possible pixel location in the search area. To lower the computational com-plexity, several algorithms that restrict the search to a few points have been proposed [36, 37]. One motion vector per block is usually allowed for motion compensation. Sub-pixel motion estimation algorithms can provide a sub-stantial improvement in reproduction quality [38]. Most recent video coding standards allow both horizontal and vertical components of the motion vectors to be of half pixel accuracy. The search window used in motion estimation is 21 often limited by the range of representable motion vector values. A positive value of the horizontal or vertical component of the motion vector represents a block spatially to the right or below the block being predicted, respectively. Motion estimation is usually performed on block sizes of 16x16 or 8x8. 2 . 2 . 3 C o d i n g C o n t r o l The mode decision block in Figure 2.1 represents the intra/inter mode selec-tion. Such a selection is made at the macroblock level 1 . The performance of the motion estimation process, usually measured in terms of the associated distortion values, can be used to select the coding mode. The coding mode where temporal prediction is used is called the inter mode. This mode is selected if the motion compensation process is effective, and only if the pre-diction error macroblock - the difference between original macroblock and the motion compensated predicted macroblock - need be encoded. If temporal prediction is not employed, the corresponding coding mode is called the intra mode. If a macroblock does not change significantly with respect to the ref-erence picture, an encoder can also choose not to encode it, and the decoder will simply repeat the macroblock located at the subject macroblock's spatial location in the reference picture. This coding mode is referred to as skip. More sophisticated coding mode selection algorithms based on rate-distortion ( R D ) optimization methods can also be employed, as discussed in Section 2.4 and in chapter 4. 1 A macroblock consists of four 8 pixels by 8 lines of luminance Y blocks and two 8 pixels by 8 lines of chrominance Cb and Cr blocks. 22 2.3 Distortion Measures In order to evaluate the performance of lossy video coding algorithms, and to optimize the rate-distortion tradeoffs, measures for distortion are necessary. The distortion measures should be both physically meaningful and analytically tractable. However, this is not always possible, specially in video coding, where one has to settle for a less meaningful but more tractable distortion measure. Research in the area of measurement scales for the assessment of the reproduced video quality is on-going [33, 39, 40, 41]. For our purposes, we present in this section, the most commonly used distortion measures in video coding. Distortion measures are usually performed on pixel values, where orig-inal pixel values and reconstructed pixel values are compared. The sum of absolute difference (SAD), as defined by Equation (2.1), is widely used in video coding. Another popular cost measure is the sum of squared errors (SSE), defined by SSE = J2J2 (0i>j(k, I) - ritj(k, I))2 . (2.2) k=l1=1 The average peak signal-to-noise ratio (PSNR) is usually used as a distortion measure to evaluate the reproduction quality at the decoder, and is given by 2 T) C5 2 PSNR = 10log j - ^ , where (2.3) MSE = j^SSE, (2.4) for 8 bits/pixel samples, where the M S E represents the mean-squared-error. The average PSNR of all encoded blocks and pictures is employed when com-23 paring reproduction quality of a video sequence. Even though the M S E does not always correlate well with human perception, it is the most widely ac-cepted distortion measure in image and video coding. Surprisingly, during the JPEG-2000 standardization process, image coders minimizing the M S E were reported to perform best in both perceptual and objective tests [42]. 2.4 R a t e - D i s t o r t i o n O p t i m i z e d V i d e o C o d i n g In this section, we describe the rate-distortion optimized modes of the H.263 coder. The MPEG-2 RD optimized modes will be discussed in chapter 4. A key component in high-compression lossy video coding is the operational control of the encoder. The motion estimation process, quantization step size selection, and the video coding mode selection are the main components in the operational control of the encoder. The process of selection between different possible representations with varying rate-distortion efficiencies can be optimized using Lagrangian minimization techniques [43]. These techniques are based on rate-distortion theory [44]. At the source coding level, the rate-distortion theory sets limits on the achievable output distortion for a given coder output rate, or conversely, sets limits on achievable output rate for a given output distortion. Shannon first described the rate-distortion function in [45], and a thorough discussion of rate-distortion theory is available in [44]. In video coding, the coding modes of operation are generally associ-ated with signal-dependent rate-distortion characteristics, and rate-distortion tradeoffs are inherent in the coding parameters selection process. The op-24 timization task is to choose, for each image block, the most efficient coded representation in the rate-distortion sense. This task is complicated by the fact that the various coding options show varying efficiency levels at differ-ent bit rates and with different scene contents. For example, inter coding is efficient in representing the key changing content in image sequences. How-ever, intra coding may be more efficient in a situation where the block-based translational motion model cannot accurately represent the image sequence changes. For low activity regions of the video sequence, simply using the skip mode may be preferred. By allowing multiple modes of operation, we expect improved rate-distortion performance if the mode selection method is applied judiciously to different spatio-temporal regions of a video sequence. The goal of the video communication system is to achieve the best fidelity (or the lowest distortion D) given the capacity of the transmission channel, subject to the coding rate constraint R(D). This optimization process can be solved using the Lagrangian multiplier method [43] where the distortion term is weighted against a rate term. The Lagrangian formulation of the minimization problem is such that we minimize J = D + XR, (2.5) for a particular Lagrangian parameter A. Each solution to Equation (2.5) for a given value of the Lagrangian parameter A should correspond to a locally optimal solution for a given rate and other constraints. A given value of A represents a specific operating point on the rate-distortion curve. It is possible to obtain an approximate relationship between the quantizer step size Q, which 25 controls the output bit rate, and the optimal value of A. This is particularly useful when a rate control method is employed to achieve a particular video encoder bit rate. In [46], a relationship between Q and A was obtained as where c is a constant that depends on the different coding parameters. This relationship is obtained by recording the quantizer step size Q that minimizes J for a given value of A. For the coding parameters and error free conditions considered in [46], c = 0.85. RD optimized video mode selection has been studied extensively in [24, 25, 46, 47]. 2.5 T h e H.263 V i d e o C o d i n g S t a n d a r d The most popular video coding standard for video communication at low bit rates is H.263 [27]. This is due to its superior compression performance, flexible frame and bit stream structure allowing transmission over a wide range of circuit and packet switched networks, and excellent complexity/performance tradeoffs achievable through the use of different optional modes. Therefore, H.263 is employed for comparison purposes. In this section, we first describe the video frame structure of H.263. Then, we discuss the optional modes of H.263 and H.263 Version 2 (referred to as H.263+). A very detailed discussion of all coding modes of H.263 and H.263+ can be found in [48]. 26 2 . 5 . 1 V i d e o F r a m e S t r u c t u r e H.263 supports five standardized picture formats: sub-QCIF, QCIF, CIF, 4CIF and 16CIF, the specifications of these formats are described in section 2.7. The luminance component, Y , is sampled at the original resolution while the two color difference components, Cb and Cr, are downsampled by 2 in both the horizontal and vertical directions. The picture structure is shown in Fig. 2.7 for the QCIF resolution. Each picture in the input video sequence is divided into macroblocks, consisting of four Y blocks of 8 pixels by 8 lines followed by one Cb block and one Cr block, each consisting of 8 pixels by 8 lines. A Group of Blocks (GOB) is defined as an integer number of macroblock rows, a number that is dependent on picture resolution. For example, a GOB consists of a single macroblock row at QCIF resolution. 2 . 5 . 2 H . 2 6 3 O p t i o n a l M o d e s In addition to the general encoding and decoding algorithms described above, H.263 includes four negotiable advanced coding modes: Unrestricted Motion Vectors, Advanced Prediction, PB-frames and Syntax Based Arithmetic Cod-ing. The first two modes are used to improve inter picture prediction. The PB-frames mode improves temporal resolution with little bit rate increase. When the Syntax Based Arithmetic Coding mode is enabled, arithmetic cod-ing replaces the default V L C coding. The objective of H.263+ is to broaden the range of applications and to improve compression efficiency. H.263+ offers many improvements over 27 176 pels Picture Frame Group of Blocks (GOB) 144 lines GOB 1 GOB 2 GOB 3 GOB 4 GOB 5 GOB 6 GOB 7 GOB 8 GOB 9 MB 1 MB 2 MB 3 MB 4 MB 5 MB 6 MB 7 MB 8 MB 9 MB 10 MB 11 Macroblock Y1 Y 2 Y 3 Y 4 Cb Cr Block 8 lines 57 8 pels 64 Figure 2.7: H.263 picture structure at QCIF resolution. 28 H.263. It allows the use of a wide range of custom source formats, as opposed to H.263, wherein only five video source formats defining picture size, picture shape and clock frequency can be used. This added flexibility opens H.263+ to a broader range of video scenes and applications, such as wide format pictures, resizeable computer windows and higher refresh rates. Moreover, picture size, aspect ratio and clock frequency can be specified as part of the H.263+ bit stream. Another major improvement of H.263+ over H.263 is scalability, which can improve the delivery of video information in error-prone, packet-lossy, or heterogeneous environments by allowing multiple display rates, bit rates, and resolutions to be available at the decoder. Furthermore, picture segment2 dependencies may be limited, likely reducing error propagation. A n additional 12 optional modes are also introduced. These optional modes allow developers to carry tradeoffs between compression performance and complexity. 2.6 T h e M P E G - 2 V i d e o C o d i n g S t a n d a r d MPEG-2 [26] has been a very successful audiovisual communication standard. Decoders that claim conformance to it have already been produced by the millions, and many new MPEG-2 applications have already been created. MPEG-2 is aimed at a wide range of applications such as television broad-casting, digital storage media and digital high definition T V (HDTV) . Like any other video standards, MPEG-2 is based on motion compensated pre-2 A picture segment is defined as a slice or any number of GOBs preceded by a G O B header. 29 diction and D C T residual coding. However. MPEG-2 offers a more efficient means to code interlaced video signals and supports scalable video coding. The readers should be aware that MPEG-2 is a complex standard. The description we provide in the following subsections is only an overview of its video part. 2 . 6 . 1 P r o f i l e a n d L e v e l The M P E G specification is intended to be generic in the sense that it serves a wide range of applications. For example, coded MPEG-2 bit rates of up to 400 Gigabit/sec and picture sizes of up to 16,000 by 16,000 pixels can be defined. Since the MPEG-2 syntax has such great flexibility, it is not feasible to make a decoder that decodes the full syntax of the standard. In order to cope with the great variety of video applications within one coding standard, MPEG-2 adopts a "toolkit-like" approach; that is, MPEG-2 is a collection of tools which satisfies the requirements of specific major applications. The range of coding support provided by MPEG-2 is divided into profiles and levels. For each profile/level, MPEG-2 provides the syntax for the coded bit stream and the decoding requirements. A profile is a defined subset of the entire bit stream syntax specified by MPEG-2 . The MPEG-2 profiles can be divided into two categories: non-scalable and scalable. The two non-scalable profiles are the Simple and Main profiles. The three scalable profiles are the SNR scalable, Spatially scalable, and High profiles. Within a profile, a level is defined as a set of constraints imposed on the parameter of the bit stream. For each profile, the four levels are the low, Main, High-1440, and High levels. The 30 Profile Level S I M P L E M A I N S N R S P A T I A L H I G H 4:2:0 4:2:0 4:2:2 H I G H 1920 X 1152 80 M b / s I ,P ,B 1920 x 1152 100 M b / s 1,P,B 4:2:0 4:2:0 4:2:0 4:2:2 H I G H 1440 1440 x 1152 60 M b / s I ,P ,B 1440 X 1152 60 M b / s I ,P ,B 1440 x 1152 80 M b / s 1,P,B 4:2:0 4:2:0 4:2:0 4:2:0 4:2:2 M A I N 720 X 576 720 X 576 720 X 576 720 X 576 15 M b / s 15 M b / s 15 M b / s 20 M b / s I ,P I ,P ,B I ,P ,B 1,P,B 4:2:0 4:2:0 L O W 352 x 200 4 M b / s I ,P ,B 352 x 200 4 M b / s I ,P ,B T a b l e 2.1: M P E G - 2 Prof i les a n d Leve l s . cons t ra in ts of the c o d i n g parameters for the var ious p r o f i l e / l eve l c o m b i n a t i o n s are s u m m a r i z e d i n T a b l e 2.1. Fo r ins tance , the S i m p l e profi le at the M a i n leve l suppor t s the color s a m p l i n g format 4:2:0, tha t w i l l be desc r ibed i n Sec t ion 2.7. T h i s profi le also suppor ts image resolut ions of up to 720 X 576 p ixe l s , i t does not a l low the use of B p ic tures , and i t targets a b i t ra te of 15 M b / s . 2 . 6 . 2 F i e l d , F r a m e a n d P i c t u r e M P E G - 2 suppor t s b o t h progressive and in t e r l aced v ideo (see F i g u r e 2.8). Fo r in t e r l aced v ideo , fields can be coded separate ly (field p ic tures ) or 2 fields can be in t e r l aced a n d coded as one p i c tu r e (frame p i c tu r e ) . B o t h f rame and f ie ld p ic tures m a y be used i n a single v ideo sequence. M P E G - 2 defines three types Bottom Field Figure 2.8: Interlaced video frame consists of two fields. of coded frames: I-frames, P-frames, and B-frames. A coded I-frame consists of an I-frame picture, a pair of I-field pictures, or an I-field picture followed by a P-field picture. A coded P-frame consists of a P-frame picture or a pair of P-field pictures. A coded B-frame consists of a B-frame picture or a pair of B-field pictures. If the first field of an I-frame is an I-field picture, the second field can be a P-field picture, provided that prediction is to be based on the first I-field picture. In this thesis, the MPEG-2 video sequences used in our simulation results are interlaced frame pictures. 2 . 6 . 3 M o t i o n P r e d i c t i o n In MPEG-2 , motion prediction depends mainly on the type of the picture (frame picture or field picture). For frame picture, there are three motion prediction types: frame prediction, field prediction, and dual prime prediction. The dual prime prediction is applicable only to a P-picture with no B-picture between the P-picture and its reference picture. Once the motion vector for a 32 Frame motion vector 16x16 Frame motion vector Field motion vector 16x8 Top — " " Top Bottom Top field motion vector - Bottom fiel motion vector Figure 2.9: MPEG-2 Motion vector type. macroblock in a field of given parity is known relative to a reference field of the same parity, it is extrapolated or interpolated to obtain a prediction of the motion vector for opposite parity reference field. For field pictures, there are also three motion prediction types: field prediction, 16 x 8 motion compensation, and dual prime prediction. The pre-diction type can be specified at the macroblock layer, which means the encoder has the latitude of selecting a different motion estimation/compensation type for each macroblock. In what follows, we only discuss frame prediction and field prediction in the context of frame pictures. Figure 2.9 shows the two motion prediction types. The shaded area represents the top field pixels and the clear area represents the bottom field 33 pixels. For frame prediction, only one motion vector will transmitted for a macroblock. For field prediction, the encoder and the decoder assume that the macroblock consists of two fields, and they will de-interlace both the 16 x 16 macroblock and the reference frame into top and bottom fields as shown in Figure 2.9. For each field, the encoder sends one motion vector and the corresponding side information to indicate the field type. 2 . 6 . 4 F r a m e D C T a n d F i e l d D C T To fully exploit the interlaced video properties, two D C T types, frame D C T and field D C T , are defined in MPEG-2 . For frame D C T , the encoder performs the D C T directly on each of the four 8 x 8 blocks (see Figure 2.10). For field D C T , the encoder assumes that the macroblock consists of two fields and de-interlaces the 16 x 16 luminance blocks in advance (see Figure 2.10). The D C T type can be specified at the macroblock layer. 2 . 6 . 5 B i t S t r e a m S y n t a x a n d M a c r o b l o c k L a y e r In MPEG-2 , the structure of the bit stream syntax depends on the profile adapted by the application. As we will concentrate on only the main pro-file/main level part of the standard, we will describe the the bit stream syntax for the main profile. This is the main stream MPEG-2 . The MPEG-2 bit stream of the main profile is a superset of the M P E G -1 bit stream syntax. This non-scalable profile is aimed at higher bit rates (4 Mbps to 15 Mbps for CCIR 601 video) and is able to handle interlaced 34 Frame DCT of the luminance macroblock mmmmmm mmmmmm mmmmmm mmmmtm ^^^^^^^^^^^^ mmmmm mmmmmm •mmmmm immmmm mmmmm mmmmmm mmmmmmz Field DCT of t mmmmm ^^^^^^^^^^^^ immmmm •mmmmmm mmmmmm mmmmmm mmmmmm ^^^^^^^^^^^ mmmmm Figure 2.10: M P E G - 2 D C T type. video formats. The bit stream syntax for the main profile consists of six lay-ers: sequences layer; group of pictures (GOP) layer; picture layer; slice layer; macroblock layer; and block layer. These layers have been created to satisfy many important requirements such as coding flexibility, ease of implementa-tion, channel error resilience by way of re-synchronization markers. The range and values of the layer parameters depend mainly on the level. From a coding viewpoint, the most important layer is the macroblock layer. As blocks within the macroblock cannot be encoded differently, a macroblock is the smallest coding unit. In what follow, we will briefly discuss the various modes in I-frame and P-frame for MPEG-2 ' s main profile. Here, the mode is defined by the 35 Mode Name Group Motion Type D C T Type Fr-Intra I N T R A Frame Fd-Intra I N T R A Field Skip FrMC I N T E R Frame F d M C I N T E R Field FrDCT I N T E R + D C T Frame F d D C T I N T E R + D C T Field FrMC-FrDCT I N T E R + D C T Frame Frame FrMC-FdDCT I N T E R + D C T Frame Field FdMC-FrDCT I N T E R + D C T Field Frame F d M C - F d D C T I N T E R + D C T Field Field Table 2.2: The available macroblock mode types for P-frame. coding operation applied on the macroblock. The coding operations are (1) intra/inter, (2) frame/filed motion prediction, and (a) frame/field D C T . As specified in the MPEG-2 standard, in an I-frame picture, two modes are sup-ported. They are intra frame-DCT and the intra field-DCT modes. If the quantization step of the current macroblock differs from that of the previ-ous macroblock, the encoder will send 5 bits to indicate the new value of the quantization step. In P-frame pictures, there are three principle modes of op-erations at the macroblock layer: (1) intra, (2) motion only, and (3) hybrid motion compensated prediction and D C T coding. However, as motion predic-tion and/or D C T coding can be applied in a frame or field mode of operation, at least eleven modes of operation are possible. Table 2.2 lists the eleven modes for P-frame picture. The B-frame picture modes are described in Chapter 4. 36 2.7 T h e Test V i d e o F o r m a t To promote the interchange of digital video data, several formats for repre-senting video data have been standardized. We now review some of the most popular standard representations. The CCIR-601 [49] format for video frames specifies spatial sampling of 720 x 480 pixels and temporal sampling at 30 frames/sec for NTSC (U.S. and Japan) television systems and 720 x 576 pixels at 25 frames/sec for P A L (Europe) television systems. Color is represented using three components: a luminance (Y) component and two chrominance components (Cb and Cr). The lumi-nance component encodes the brightness or intensity of each pixel and the chrominance components encode the color values. Each component is quan-tized linearly using eight bits. For NTSC (respectively, PAL) , there are 720 x 480 (720 x 576) luminance values, one for each pixel, and 360 x 480 (360 x 576) values for each chrominance component. The chrominance components are subsampled horizontally with respect to the luminance component to take advantage of reduced human sensitivity to color. This subsampling process is referred to as the 4:2:2 format. The Source Input Format (SIF) specifies spatial sampling of 360 x 240 (re-spectively, 360 x 288) and temporal sampling at 30 (25) frames/sec for NTSC (PAL) television systems. As with CCIR-601, color is represented using three components: Y , Cb, and Cr. Each component is quantized linearly using eight bits. For NTSC (respectively, PAL) , there are 360 x 240 (360 x 288) lumi-nance values, one for each pixel, and 180 x 120 (180 x 144) values for each 37 chrominance component. This subsampling format is referred to as the 4:2:0 format. One drawback with the CCIR-601 and SIF formats is that they spec-ify different spatial and temporal sampling parameters for NTSC and P A L systems. As its name suggests, the Common Intermediate Format (CIF) was proposed as a bridge between NTSC and P A L . As with CCIR-601 and SIF, color is represented using three components, each quantized linearly using eight bits. The CIF format uses 4:2:0 color subsampling with an image size of 352 x 288. Temporal sampling is set at 30 frames/sec. For use with P A L systems, the CIF format requires conversion of the frame rate to 25 frames/sec. For NTSC systems, a spatial resampling may be necessary. For video conferencing and other low-bit-rate, low-resolution applications, a scaled down version of CIF called Quarter-CIF (or QCIF) is commonly used. QCIF specifies an image with half the resolution of CIF in each spatial dimension: 176 x 144. For many low-bit-rate applications, the frame rate is reduced from 30 frames/sec to as low as five frames/sec. The MPEG-2 video sequences used in our simulation experiment con-form to the CCIR 601 NTSC format. To obtain statistically relevant results, we used six video sequences with various amounts of motion, varying from relatively slow motion (BALLET) , to fast motion (BUS), complex motion (NBA) and structured motion (TABLE TENNIS, GARDEN and BICYCLE). The target bit rates ranged from 3 Mbps to 9 Mbps. We show snap shots of the above six sequences in Figures 2.11 and 2.12. For H.263 encoding, we used four 38 Q C I F / C I F sequences: F O R E M A N (complex motion) , A K I Y O and N E W S (slow motion), and C O A S T G U A R D (fast motion), which have been encoded at the target rates of 64 kbps, 128 kbps, and 256 kbps. Snap shots of the above three sequences are shown in Figure 2.13. 39 (a) B A L L E T (c) NBA Figure 2.11: CCIR 601 video sequences. 40 (a) T A B L E TENNIS (c) B I C Y C L E Figure 2.12: CCIR 601 video sequences. 41 42 Chapter 3 Mot ion Estimation 3.1 I n t r o d u c t i o n In block based video encoders such as MPEG-2 [26] and H.263 [29], block matching algorithms (BMAs) are used for motion estimation (ME). During encoding, motion estimation is usually the most computationally intensive task. The objective of motion estimation is to find the best match for a mac-roblock in the current picture from a set of candidate macroblocks in an often restricted search area in the previously encoded frame. The sum of absolute differences (SAD) is the most widely used matching criterion. The full-search block matching algorithm (FS-BMA) guarantees that the minimum SAD value is obtained, but this method is computationally demanding. Effective and effi-cient motion estimation (ME) is an essential requirement for both good video quality and low computational complexity. Several fast algorithms have been proposed in the literature [22, 50, 51, 52]. These algorithms achieve sub-43 optimal compression performance with a significantly reduced computational complexity. Computational complexity is reduced by decreasing the number of search locations, and/or by using subsampled macroblocks to calculate the SAD for each location. In this chapter, we first propose a new hierarchical motion estimation (ME) algorithm that reduces the number of computations at the price of a small performance loss by significantly reducing the number of search points for each macroblock. Next we discuss the most important issues related to the hierarchical motion search process: the starting point, the search path, the stopping criterion, half-pel M E , and interlaced M E . 3.2 M o t i o n V e c t o r P r e d i c t i o n The performance of any fast motion estimation algorithm is directly affected by the accuracy of the predicted motion vector (PMV) , which is used as a starting point in the search. If the predicted motion vector is far from the best-match motion vector, the proposed search algorithm will either get quickly trapped into a local minimum (also impacting negatively motion vector prediction ac-curacy for the subsequent macroblocks), or the algorithm will perform too many computations to find the best-match motion vector. A popular predic-tor is the median of three previously coded motion vectors corresponding to the macroblock to the left (A), above (B) and above-right (C) of the current macroblock (X), as shown in Figure 3.1. The median predictor is very simple, yet its accuracy is quite high. Other more sophisticated predictive techniques [16, 22, 53], which exploit motion vector dependencies in the spatial and/or 44 temporal directions, can improve the prediction accuracy. Extensive research has been conducted to develop techniques that increase the prediction accuracy of the motion vector [54, 55, 56, 57]. These prediction techniques exploit mo-tion vector dependencies in the spatial and temporal domain. In this section, we present a new motion vector prediction technique based on spatio-temporal motion vector prediction. 3.2.1 S p a t i o - T e m p o r a l P r e d i c t i o n M e t h o d In moving video, the motion of an object has high correlation in the spatial and temporal domains. Many motion prediction methods that exploit this spatio-temporal correlation have been proposed in the literature [54, 55]. In this thesis, we combine both spatial and temporal prediction to obtain an estimate of the motion vector for the current macroblock. Next we discuss the spatial and temporal predictors and the predicted motion vector ( P M V ) selection algorithm. S p a t i a l P r e d i c t i o n A popular spatial predictor is the median of the three previously coded motion vectors corresponding to the macroblock to the left (A), above (B) and above right (C) as shown in Figure 3.1. The median predictor is very simple and its accuracy is high only for the sequences with a relatively uniform motion [50]. Several spatial predictors are considered in this section, using various macroblocks in the region of support depicted in Figure 3.1. The motion vec-45 I„ F 1 D B C E A X i Figure 3.1: The region of support used for motion vector prediction. tors corresponding to each of these macroblocks are considered as candidates for the spatial P M V . The block matching function is then computed for the current macroblock using each candidate motion vector. The spatial P M V is set to be equal to the candidate motion vector which yields the best match. Experimental results show that the computational overhead of computing the extra SADs is offset by the increased M V prediction accuracy. Temporal Prediction For sequences with small and/or fast moving objects, the accuracy of the spatial predictor may deteriorate. To solve this problem, temporal motion prediction methods have been proposed that attempt to exploit the temporal correlation in the motion field. Most of the temporal prediction methods proposed in the literature compute the P M V as the motion vector of the macroblock in a previously encoded frame that has the same spatial location as the current macroblock. In this thesis, we propose a temporal predictor based on a projection technique, and we compare its performance with the simple predictor described above. 46 The temporal projection tracks the motion vector of a moving object from one frame to another. As shown in Figure 3.2, the motion vector Vn-\ of macroblock Bn-\ in the previously encoded frame is projected on the macroblock Bn in the current frame In. The motion vector Vn-\ is used as a candidate vector in the relevant macroblock Bn. The motion vector K i - i will be used in the determination of the starting point for the motion estimation search algorithm. When projecting the motion vector for frame J n _i to frame /„ , the motion vector may point to a 16 x 16-pixel square that is not a mac-roblock, that is, its upper left corner does not have coordinates that are multi-ples of 16. The projected motion vector is then assigned to the macroblock in frame /„ that best overlaps the projected 16 X 16-pixel square. Note that the overlap area for the closest macroblock contains at least an 8 x 8-pixel block. Ideally, one temporal P M V is obtained for each macroblock in the current frame. For some macroblocks, no temporal prediction is available and motion vector prediction relies on the spatial prediction only. Another irregular situa-tion occurs when more than one motion vector points to the same macroblock in the current frame. In this case, we choose the motion vector based on a measure of its reliability measure, which is defined as follows. As shown in Figure 3.3, we assume that the motion vector Vn-i, predicted between the current frame In and the previous frame In-i, is reliable if Vn-\ = Vn-2, where Vn-2 is the motion vector detected between frames In-2 and In-i- To estimate the reliability R of the predicted vector, we use the following equation, R= (MSE(Vn-2,Vn-1) + - p ) , (3.1) y Overlap area J 47 I . , " %, i 1 1 i Figure 3.2: The temporal projection of a motion vector. where M5£(14_2 , 14- i ) is the mean square error between the vectors Vn-2 and Vn-i. The overlap area is measured as the number of pixels common between macroblocks 5„_i and Bn if we assume that Bn-i is displaced by the vector Vn-\. The reliability measure will increase if the motion vector belongs to a moving object with constant speed. In the case where more than one motion vector points to the same macroblock in the current frame, we choose the one with the higher reliability measure R. Predicted Motion Vector Selection An inaccurately predicted motion vector can have a significant negative im-pact on the motion estimation algorithm. To choose between the temporal 48 I o-2 n-2 n-2 i 1 1 r — i I n - l 1 i i " 1 i 1 i i i i I„ i i i i r r v r r ^ v n - l r I I i i i Figure 3.3: Motion vector reliability. 49 and spatial P M V s , we use the same procedure used to select the spatial P M V described above. If the temporal P M V exists, then the block matching func-tion is computed for the current macroblock using the the temporal P M V and the corresponding SAD is compared to the SAD obtained using the spatial P M V . The P M V that yields the smallest SAD is chosen. When a highly-optimized motion estimation algorithm is employed, the over-head of computing the SADs for P M V selection may become significant. In this downsampled SAD may be computed for the purpose of P M V se-lection. Using the downsampled SAD, the number of operations in the P M V selection is reduced by 50% for a 2:1 vertically downsampled SAD and by 75 % for a 2:1 vertically and horizontally downsampled SAD. The effect of using a downsampled SAD on the P M V accuracy is insignificant, as evaluated in section 3.4. 3.3 H i e r a r c h i c a l M o t i o n E s t i m a t i o n The disadvantage of using fast motion estimation algorithms, such as the 2D logarithmic search method [50] and the three-step search method [51], is that they often get trapped in local minima due to the implicit assumption of a monotonic matching function. In order to solve this problem and also reduce the number of computations, we implemented a hierarchical search algorithm [22]. The hierarchical search algorithm reduces the number of computations by utilizing lower resolution versions of the current and reference picture, obtained by downsampling by a factor of two in both directions. The first search step 50 Level 1 Figure 3.4: The hierarchical motion search. is performed at the lower resolution, where the image and the macroblocks are one fourth their original size. Either full-search or fast-search algorithms can be used in this step. To maintain the computational gains, we used the fast floating-center diamond search described later. Starting at the best match point found, another search is performed in the original picture with a reduced search area. The hierarchical motion search method is depicted in Figure 3.4. 3.3.1 Hierarchical Block Matching Strategy If the P M V itself provides a very accurate motion compensated predictor, i.e. if the corresponding SAD value is relatively small, motion estimation can be 51 stopped and the current M V is then set to the P M V . More precisely, if SADPMV < Tu then motion estimation is stopped for the current M B . Otherwise, motion estimation proceeds in a hierarchical manner starting with the predicted M V . The first step of our motion search algorithm is performed on a lower resolution frame. If the low-resolution minimum SAD is relatively small, that is, if SADl0W < T 2 , motion estimation ends. Otherwise, a full pel vector search is performed in the original picture, with the origin of the search located at the up-sampled best-match vector of the lower resolution. 3 . 3 . 2 M o t i o n S e a r c h P a t h From the starting point, the search proceeds according to a search pattern (path). The search path is made of candidate points located on diamond-shaped contours containing the four immediate neighbors of the previous search center. Two types of search centers are considered, the fixed-center and the floating-center search. In the fixed-center search, as illustrated in Fig-ure 3.5(a), the diamond-shaped search ares is divided into contours or layers, each of which represents a set of candidate motion vectors. The layers 0, 1,... are then searched sequentially starting with the layer closest to the center (layer 0) and moving outward. An example of a possible floating-center search 52 o o o o o o o -e-o o o o o o • o o • • • • • • o Oy • o o 0 s o o o o o o o o o o o , 0 0 0 layer 0 p o o layer 1 O O O layer 2 O O O O O O O O O O O O O O O O O O O O O O O o o o o o o o o 0 • • • o o • • • H • o • • — • o o • • o o o o o o o PMV o o—e—e—e—o o (a) o o o o o o o (b) Figure 3.5: The diamond-shaped motion vector search area: (a) fixed center, (b) floating center. is shown in Figure 3.5(b). Unlike the fixed-center search path, the floating-center search path changes dynamically, and the associated search process is more complicated than that of the above search path. The corresponding algo-rithm searches only the layer whose center is the "best" motion vector found at the previous layer. The advantage of floating-center search is that the number of search points in the first layer is equal to 4 and at most equal to 3 in the proceeding layers. Therefore, this method is inherently more efficient than the fixed-center diamond search method. The floating-center search path is used throughout the thesis. 3.3.3 Mot ion Search Stopping Criterion In a straightforward implementation of the algorithms we just described, the search stops when the current search area is outside the allowed search region or when the SAD corresponding to the best candidate in the search area of 53 the current step is larger than the SAD of the best candidate at the previous step. To reduce computations further, we employ a modified cost function, expressed in terms of the SAD biased by an explicit computational constraint. Analytically, this function is defined as the Lagrangian Jp = SAD + 8C, where C is the total number of operations performed so far for the current M B motion search. We count one operation as one addition, one subtraction, and one absolute value computation. If C is not efficiently computed it will add computation overhead. We update C after the SAD operation of the whole macroblock is computed. For each M V candidate, we initialize C as the number of operations per-formed before considering the candidate and we increment it after each block matching function computation. The search process is terminated for the cur-rent M B when the minimum Lagrangian in the current layer is larger than the minimum Lagrangian of the previous layer. We call this motion estimation algorithm the distortion-computation optimized ME. Notice that, since the number of computations C is always increasing, the term 8C effectively cuts short the search process in the case when the SAD decreases slowly towards a minimum. The Lagrangian parameter 8 allows us to trade image quality for coding speed. For 8 = 0 , we obtain the straightforward implementation described previously. For large values of 8, the motion search is very fast and a relatively crude estimate of the motion vector is produced. This stopping criterion is very effective because candidate motion vectors away from the cen-54 ter are quickly eliminated from consideration. Such a behavior is expected as motion vectors are usually localized within a small neighborhood of the pre-dicted motion vector, and the number of computations increases substantially as more layers are examined. A more sophisticated rate-computation-distortion M V selection crite-rion introduced in [37] has been tested within the proposed encoding frame-work. This criterion attempts to simultaneously optimize the number of com-putations C , the motion compensated (MC) prediction error SAD, and the number of bits R needed to encode the M V . It does this by using a stopping criterion that takes into account all three factors: Jp = SAD + 0C + XR. This approach has been deemed not beneficial for our framework for two rea-sons. First, each candidate M V has to be encoded using variable-length encod-ing to obtain the rate R. This introduces an overhead that can not be justified by the relatively small increase in performance of about 0.15 dB. Second, and perhaps more important, is that the motion estimation for the current M B would then require knowledge of the coding mode of the neighboring mac-roblocks to perform the differential encoding of the current M V . This makes impossible the de-coupling of the motion estimation step from the rest of the encoding process. Decoupling is necessary for a parallel implementation of the encoder when using different media processors for the different functions. 55 3.3.4 H a l f - p e l A c c u r a t e M o t i o n E s t i m a t i o n Both MPEG-2 and H.263 encoders support half-pel resolution motion compen-sation, where an interpolated version of the previous frame is used to provide a better prediction of MBs corresponding to objects that moved by a fractional number of pixels. Half-pel M E is not always necessary. If the full-pel M V produces a small SAD, that is, SAD futi-pel then half-pel M E can be avoided. Half-pel M E is typically implemented by testing the eight half-pel loca-tions neighboring the selected full-pel motion vector. To reduce the associated number of computations, we have implemented the half-pixel approximation method described in [59]. In this method, the integer-pixel SAD values are cal-culated for the four nearest macroblocks that surround the best integer-pixel candidate, as shown in Figure 3.6. From these values, we can estimate the corresponding half-pixel SADs. The interpolation formulas for points a and b are 28 A + B + E +I a = , and 32 3 , 29 B + E + 1 6 = . 32 2 Symmetric formulas are used for the other half-pel locations. Once the ap-proximate values have been obtained, k of the eight candidates are chosen corresponding to the k smallest estimated distortions, together with the cen-tral point. The SAD is computed for each of these points (only), and the best 56 B o a b c X X X d E e X o X f g h X X X D o Figure 3 . 6 : The half-pel motion vector candidates, candidate is chosen to be the half-pel motion vector. 3.3.5 Interlaced Mot ion Estimation Interlaced video sequences, such as broadcast T V images, can take advantage of interlaced M C to obtain improved quality for image objects that move during the time interval between two fields. Two motion estimation modes exist: frame and field M E . Frame M E combines both fields of the interlaced video sequence into one frame. Field M E considers the fields separately, two different motion vectors being generated for each of the top and bottom fields of the current M B . This leads to four possible ways of matching the top and bottom fields of the current M B with the top and bottom fields of a candidate M B of the previous frame. If the SAD obtained by the frame M E is relatively small, that is, SADfrarne < T 4 , the field M E can be avoided. Otherwise, field M E is performed, doubling the required number of computations. The field M E is similar to the frame M E , except that the starting point is the frame M V for both the top and bottom 57 Spatial and temporal prediction Fast diamond low resolution frame M E Full and half-pel fast diamond frame M E Frame M V Fast diamond low resolution field M E Full and half-pel fast diamond field M E Field M V Figure 3.7: Motion vector prediction throughout the stages of motion estima-tion. fields. A low-resolution search is performed first. If the result of this search is satisfactory, that is, SADfi eld,low then the field M E is stopped for the current M B . Otherwise, full-pel and half-pel field M E are generally performed. The half-pel search can be avoided if SADj ield,full—pel <T6. As a summary, an overview of the different stages of motion estimation is sketched in Figure 3.7. The challenging task is how to optimize all these thresholding parameters to obtain the best overall computation-performance tradeoff. In chapter 6, we will show how to obtain optimum parameter sets corresponding to optimum global computation-distortion operating points for a complete encoder. 58 3.4 Simulation Results We here discuss the results for motion estimation only, using the number of SADs as a measure of the computational complexity. We evaluated the perfor-mance of the proposed motion estimation algorithm in the context of high-bit rate MPEG-2 encoding and low-bit rate H.263 encoding. The motion estima-tion algorithm used in this work uses a floating center diamond-shaped search area and a computation-distortion stopping criterion as discussed in Sections 3.2 and 3.3. The motion estimation algorithm uses the P M V as a starting point for a hierarchical search algorithm consisting of a low-resolution M E , followed by normal resolution frame and field M E (MPEG-2 only). Finally, for both frame and field M E , a half-pel refinement is performed. Several sequences with different motion types were encoded at a target bit rate of 3 Mbps (for MPEG-2) and 64 kbps (for H.263. The sequences are: BUS, G A R D E N , N B A , T A B L E TENNIS and B A L L E T , for MPEG-2 , and F O R E M A N and N E W S for H.263. 3.4.1 M o t i o n V e c t o r P r e d i c t i o n In the first experiment, we evaluated the performance of several spatial pre-dictors with different regions of support. The results are compared with the median predictor used in [22]. Table 3.1 shows the average number of SADs versus the PSNR values for these predictors using the region of support in Figure 3.1 . We can see that the best performance is obtained by the spatial predictor that uses three previously encoded MVs. In this case the number of SADs is relatively small and the PSNR loss is insignificant. We can also 59 Sequence BUS T A B L E TENNIS spatial P M V SADs PSNR SADs PSNR A B C D E F 60.5 31.08 56.9 28.56 A B C D E 59.6 31.06 55.9 28.56 A B C D 58.7 31.06 55.0 28.54 A B C 57.4 31.05 54.1 28.54 A B .57.1 30.99 53.9 28.50 A 57.8 30.41 52.7 28.29 B 59.3 29.98 52.8 28.24 A B T 36.0 30.97 34.0 28.48 median 57.4 30.96 56.9 28.41 Table 3.1: Coding performance and number of computations for different mo-tion vector predictors. see that our best spatial predictor outperforms the median predictor by up to 0.1 dB. In the second experiment, we compare the spatio-temporal P M V with and without projection. Table 3.2 shows that using the projection method improves the coding performance by approximately 0.1 dB. More important, the spatio-temporal P M V leads to a reduction in the number of SAD com-putations of up to 40 % for a constant quality level. This is shown in Ta-ble 3.3 for all test sequences. The effect of using the downsampled SAD in the prediction mode selection is shown in Table 3.4 for both the vertically and vertically/horizontally downsampled SADs. The computational overhead due to the motion vector prediction algorithm is greatly reduced with minimal performance loss. The motion vectors corresponding to the macroblock to the left (A), above (B) and the temporal (T) are selected as candidates for the motion vector predictor. From Table 3.1, we can see that the A B T motion vector predictor features both low complexity and good performance. 60 Sequence BUS T A B L E TENNIS Temporal P M V SADs PSNR SADs PSNR projection 35.13 30.97 33.29 28.48 no projection 35.27 30.91 34.12 28.40 Table 3.2: Average number of SAD operations per macroblock and the PSNR for several temporal predictors. Spatio-Temporal Median Sequence SADs PSNR SADs PSNR BUS 36 30.97 57 30.96 B A L L E T 37 44.20 53 44.24 G A R D E N 46 29.18 51 29.16 T . TENNIS 34 28.48 57 28.42 N B A 47 28.58 61 28.59 Table 3.3: Average number of SAD operations per macroblock and the PSNR for the two methods. Sequence BUS T A B L E TENNIS M B resolution SADs PSNR SADs PSNR full resolution 40.26 31.03 36.31 28.52 subsampled by 2 36.29 31.01 36.31 28.50 subsampled by 4 35.13 30.97 33.29 28.48 Table 3.4: Average number of SAD operations per macroblock and the PSNR using downsampled macroblocks used to compute the SAD. 61 37 rr z co c 32 h 36 35 h 34 33 h 31 30 3 4 5 6 Bit Ra te (Mbps) 7 8 9 Figure 3.8: The performance of the two methods for the BUS. Figure 3.8 shows the rate-distortion curves for MPEG-2 encoding using our P M V and the median P M V for the BUS sequence for a constant number of SAD computations. We can see that the proposed spatio-temporal prediction method outperforms the median prediction method by about 0.15 dB at all bit rates. For H.263 encoding we used two QCIF size sequences: F O R E M A N and N E W S . Table 3.5 shows that the spatio-temporal P M V reduces the SAD computations by 40 % as compared to the median P M V with no loss in perfor-mance. Figure 3.9 shows the rate-distortion curves for H.263 encoding using the proposed P M V and the median P M V for the F O R E M A N sequence for a con-stant number of computations. We can see that the proposed spatio-temporal prediction method outperforms the median prediction method by about 0.25 dB at all bit rates. 62 Spatio-•Temporal Median Sequence SADs PSNR SADs PSNR F O R E M A N 23.50 29.64 38.60 29.63 NEWS 22.68 29.86 40.8 29.88 Table 3.5: Average number of SAD operations per macroblock and the PSNR for the two methods. 63 3 . 4 . 2 H i e r a r c h i c a l M o t i o n E s t i m a t i o n We here discuss results for motion estimation of the MPEG-2 encoder. Ta-ble 3.6 shows coding results for MPEG-2 encoding using approximated half-pel M E with different complexity levels. As described in Section 3.3.4, once the Sequence B A L L E T B I C Y C L E BUS N B A T A B L E T E N N I S A l g o r i t h m P S N R S A D PSNR S A D PSNR S A D P S N R S A D P S N R S A D full (k=9) 45.25 44 30.05 53 32.25 44 29.90 47 29.78 46 k=5 45.17 40 29.85 46 32.18 38 29.77 41 29.75 41 k=2 44.90 33 29.25 39 31.49 31 28.88 34 29.28 35 no half-pel 44.66 31 28.54 40 30.35 27 27.64 32 28.90 32 Table 3.6: Coding performance and number of computations for different half pel search algorithms. half-pel approximate values have been obtained, k of the eight candidates are chosen corresponding to the k smallest estimated distortions, together with the central point. The algorithm that computes k = 5 half-pel SADs and approximates the other 9 — k SADs provides the best computation-distortion tradeoff, and it was hence used for all the subsequent experiments. The effect of the rest of the M E parameters on the number of SAD operations and the image quality is illustrated in Figure 3.10. From the figure, we can see that each threshold parameter provides a computation-performance tradeoff. For instance, the threshold T 4 can be used to reduce the number of SAD compu-tations by 60 % without a significant loss in performance. 64 65 3.5 S u m m a r y In this chapter, we have presented the motion estimation algorithm used in our encoder framework. It employs a combination of techniques to achieve the best possible computation-performance tradeoffs. To obtain a good starting point for the motion vector (MV) search, a new spatio-temporal motion vector predictor is used that takes into account the reliability of the predicted vector. The motion vector of a moving object is tracked from one frame to another using a projection method. The accuracy of the temporally predicted motion vector is estimated using a reliability measure that allows the motion search algorithm to decide whether to use the temporal motion vector predictor or the spatial motion vector predictor. The experimental results show that the spatial-temporal prediction reduces the number of computations performed by the motion search algorithm by up to 25% while maintaining the same quality. Starting with the predicted M V , the search proceeds in a hierarchical manner. First, the search is performed using low-resolution images to obtain a rough M V estimate. This is used as a starting point to obtain the opti-mum full-pel M V . Finally, a refinement search centered at the full-pel M V is performed to obtain the half-pel motion vector. In each of these steps, the motion estimation algorithm can be terminated when the prediction error falls below an predetermined threshold. A set of threshold parameters have been proposed that control the computational complexity and performance of the hierarchical motion estimation algorithm. 66 Chapter 4 Mode Selection 4.1 I n t r o d u c t i o n During mode selection, the optimum macroblock (MB) coding mode is chosen. An M B coding mode is determined mainly by the direction and type of motion compensation and D C T type. An optimal mode selection algorithm was pro-posed that is based on a rate-distortion (RD) criterion [23, 24, 25, 46, 47]. The coding mode is selected that minimizes the Lagrangian D + XR, involving both the M B distortion D and the number of required bits, R. Since the coding speed is our main concern, the mode selection algorithm proposed in this work is a variance-based algorithm. It is significantly simpler than the RD-based approach, and it yields insignificant loss in performance as compared to the RD optimized mode selection algorithm. We here start by describing the (RD) based macroblock mode selection. Then we introduce our measure-based M B mode selection. Speed performance comparisons between the two methods are 67 presented in Section 4.4. 4.2 R a t e - d i s t o r t i o n B a s e d M o d e Se lec t i on For best rate-distortion performance, the coding mode selection should be based on a rate-distortion criterion. RD optimization techniques have been proposed for motion compensated prediction and D C T video encoding sys-tems such as H.263 and MPEG-2 compliant encoders [21, 24, 25, 46, 47, 60]. When an RD mode selection process is used, the coding mode for the current macroblock is selected such that the Lagrangian J(mode) — D(mode) + \R(mode) (4-1) (the distortion biased by the required bit rate) is minimized. Although apply-ing this technique to MPEG-2 appears to be straightforward, the huge number of additional computations required by the exhaustive Lagrangian minimiza-tion, which is required if optimality is to be guaranteed, makes RD optimized MPEG-2 encoding impractical. We next describe the MPEG-2 coding modes of operation. The MPEG-2 coding modes are summarized in Tables 4.1, 4.2 and 4.3. Two possible coding modes exist for the macroblocks of I pictures: frame D C T (Int-FrDCT) and field D C T (Int-FdDCT). In both modes, the M B header and the D C T coefficients are encoded. These two modes require the same number of computations. For P pictures, there are a total of nine modes. However, the three 68 Mode Description Int-FrDCT Int-FdDCT Intra, frame D C T Intra, field D C T Table 4.1: I picture macroblock coding modes. principal modes of operation are: (1) intra D C T , (2) motion only, and (3) hybrid motion compensated prediction and D C T coding. However, as motion estimation and compensation and/or D C T coding can be applied to frames or fields, nine modes of operation are possible. These modes are listed in Table 4.2. For the Mot-Skip mode, the current M B is replaced by the M B at the same spatial location in the previous reconstructed picture. Encoding is performed by skipping the current M B and incrementing the M B address. This mode is designed for areas in the picture where little or no change relative to the previously reconstructed picture is detected. For the frame motion only mode (Mot-F-FrMC), only the M B header and the motion vector are encoded. In the case of field predicted MBs (Mot-F-FdMC), two motion vectors, each representing an 8 x 16 block, are estimated and encoded. This mode is a likely candidate when the motion changes are fast. For the motion compensation and D C T modes, frame or field D C T coding is applied to the corresponding prediction error blocks. These modes are intended for cases when the video scene contains significant detail. Both the motion vector and the frame D C T coefficients are encoded. These modes are the most computationally intensive. Due to non-motion changes or complex motion activity, translational motion compensated prediction may be inadequate. The intra modes (Int-*) allow the 69 Mode Principal Mode Description Int-FrDCT Int-FdDCT Intra Intra, frame D C T Intra, field D C T Mot-Skip Mot-F-FrMC Mot-F-FdMC Motion Compensation only Skipped M B Frame M C only, forward Field M C only, forward DCT-F-FrMC-FrDCT D C T - F - F r M C - F d D C T D C T - F - F d M C - F r D C T D C T - F - F d M C - F d D C T Motion compensation and D C T Frame M C , frame D C T , forward Frame M C , field D C T , forward Field M C , frame D C T , forward Field M C , field D C T , forward Table 4.2: P picture macroblock coding modes. encoder to escape to intra coding when the translational motion model fails. Only the header and the D C T coefficients are then coded. Inter coding of B pictures depends on the two temporally adjacent pic-tures, which can both be I pictures, both P pictures, or an I and a P picture. In addition to the nine M B coding modes described for the P pictures, thirteen new macroblock coding modes exist. These correspond to backward (*-B-*) prediction (based on the future reference frame) and bidirectional (*-D-*) pre-diction (based on both the past and future reference frames). These modes can be classified into the same three principal modes of operation found in P pictures. In the bi-directional motion-only mode (Mot-D-Avg), the pre-dicted macroblock is the average of the forward and the backward predicted macroblocks. The prediction residual is not encoded. In the backward and bi-directional motion compensated prediction and D C T coding modes, the prediction residual is also frame or field D C T encoded. The M B coding modes, which are listed in Tables 4.1, 4.2 and 4.3, are 70 Mode Principal Mode Description Int-FrDCT Int-FdDCT Intra Intra, frame D C T Intra, field D C T Mot-Skip Mot-D-Avg Mot-F-FrMC Mot-F-FdMC Mot-B-FrMC Mot-B-FdMC Mot-D-FrMC Mot-D-FdMC Motion compensation only Skipped M B Bidirectional average Frame M C only, forward Field M C only, forward Frame M C only, backward Field M C only, backward Frame M C only, bidirectional Field M C only, bidirectional DCT-F-FrMC-FrDCT D C T - F - F r M C - F d D C T D C T - F - F d M C - F r D C T D C T - F - F d M C - F d D C T DCT-B-FrMC-FrDCT D C T - B - F r M C - F d D C T D C T - B - F d M C - F r D C T D C T - B - F d M C - F d D C T DCT-D-FrMC-FrDCT DCT-D-FrMC-FdDCT DCT-D-FdMC-FrDCT D C T - D - F d M C - F d D C T Motion compensation and D C T Frame M C , frame D C T , forward Frame M C , field D C T , forward Field M C , frame D C T , forward Field M C , field D C T , forward Frame M C , frame D C T , backward Frame M C , field D C T , backward Field M C , frame D C T , backward Field M C , field D C T , backward Frame M C , frame D C T , bidirectional Frame M C , field D C T , bidirectional Field M C , frame D C T , bidirectional Field M C , field D C T , bidirectional Table 4.3: B picture macroblock coding modes. 71 classified by principal operating mode and ranked in increasing order of the associated number of computations. The Mot-Skip mode of operation is, by far, the most efficient. The field hybrid modes demand the most computational load, and should be avoided if possible. Since the number of bits necessary to encode a block in the Mot-Skip mode is zero, this mode tends to be selected quite often for large values of A. This sometimes leads to the presence of too many skipped blocks, which can impact negatively the visual quality of the video reconstruction quality. Therefore, although not optimal from the rate-distortion point of view, the Mot-Skip mode is selected if it minimizes not only the Lagrangian but also the block MSE. When we introduced this modification, the visual quality was greatly improved without a noticeable difference in the PSNR. 4.3 M e a s u r e - B a s e d M o d e Se lec t ion Since the coding modes are strongly application-dependent, we will discuss the mode selection for MPEG-2 and H.263 encoding separately. For M P E G -2 encoding, the coding mode depends on the motion compensation direction (forward, backward, or bidirectional), the motion compensation type (none, frame, or field), and the D C T type (none, frame, or field). Typically, the coding mode is selected based on variance measures of the original and motion-compensated MBs. The TM5 encoder bases its mode selection on the energy of the motion compensation residual and the variance of the original M B . We here propose a 72 similar approach that is based on the variance of the original M B (VAR) and the minimum SAD values obtained during frame M E and field M E (SADFR and SADFD) . This approach is more computationally efficient, as the energy of the M C residual does not need to be computed. The proposed mode selection algorithm is depicted in Figure 4.1 Since this algorithm is applied when a low-complexity encoder is targeted, only I and P pictures are supported. First, the I N T R A coding mode is chosen if the motion compensation residual is relatively large, that is, if SAD2 > 256 • VAR, where SAD = mm(S AD FR, SADFD). Then, if the I N T E R coding mode was selected, the F R A M E motion compensation mode is selected when SADFR < 1.05 • SADFD. Finally, for both I N T R A and INTER MBs, the D C T type is chosen based on the correlation 7 between the two M B fields before the D C T is performed. If 7 > 0.25, the two fields are deemed strongly correlated and the F R A M E D C T is applied. Otherwise, the F IELD D C T is applied. The constants in the three inequalities above have been optimized for best rate-distortion performance. Although the maxima were relatively flat, the proposed algorithm surpasses TM5 slightly, while at the same time being simpler. Notice that this algorithm biases the encoder towards choosing the F R A M E motion compensation and the F R A M E D C T , in an attempt to mimic the rate-distortion selection procedure. 73 Compute MB VAR SAD = min(SADFr,SADFd) Yes INTRA coding No Yes Frame motion compensation No Field motion compensation Compute correlation y between the two MB fields Yes No Field DCT Frame DCT Figure 4.1: Variance based mode selection of MPEG-2 coder. 74 In the case of H.263 encoding, the mode selection process is simpler, as interlaced video and backward motion compensation are not supported. The only decision to be made is between I N T R A and I N T E R coding. For the same reason as in the MPEG-2 case, an optimum RD mode selection is avoided in favor of a simpler variance-based criterion that is biased towards choosing I N T E R coding and a zero motion vector. First, during motion estimation, the SAD corresponding to the vector (0, 0) is decreased by 100 to bias the M E algorithm towards choosing the zero M V . Second, after M E , a measure of the variance is computed as follows 256 A = E \original — MB-mean\, i=0 where MBjmean is the mean of the macroblock. If A < SAD - 500, I N T R A coding is chosen. Otherwise, I N T E R coding is chosen. Although the H.263 standard supports a coding mode with four separate motion vectors for the four luminance blocks, such a mode is not allowed in our implementation, so as to minimize the number of computations. Note that this mode selection algorithm, as depicted in Figure 4.2, is similar to the one in the H.263 TMN11 reference model [61]. After the M B coding mode is determined, and the chosen motion com-pensation and block transform are applied, the transform coefficients are quan-tized, and the M B header, motion vector, and quantized D C T coefficients are variable-length encoded. However, the M C residual macroblock is only D C T 75 Compute MB VAR Intra coding Inter coding Figure 4.2: Variance based mode selection of H.263 coder, encoded if its energy is significant. It is not D C T encoded if SAD < T7. For the case when the residual M B is D C T encoded, we propose an efficient joint implementation of D C T and quantization, which we call the Quantized D C T (QDCT). The Q D C T embeds the quantization step into the integer multiplications performed in the DCT. Then, a zero-block prediction method is presented that reduces significantly the number of computations. Primary work in this direction was presented in [62]. In the next chapter, we will introduce the quantized D C T (QDCT) and we will describe in detail the design of the early termination content-dependent Q D C T technique. 76 4.4 E x p e r i m e n t a l Resu l t s In this section we summarize the speed and performance tradeoffs for RD-based and variance based mode selection for both MPEG-2 and H.263 en-coding. Figures 4.3 and 4.4 show the performance results of three MPEG-2 sequences ( B A L L E T , B I C Y C L E and BUS) and three H.263 CIF sequences A K I Y O , C O A S T G U A R D , and F O R E M A N . From these figures we can see that the perfor-mance of the proposed variance based mode selection method is less than the RD-based mode selection method by about 0.6 dB for MPEG-2 encoding and about 0.4 dB for H.263 encoding. Tables 4.4 and 4.5 summarize the encoding frame rate for the two encoders. Using variance-based mode selection, the MPEG-2 encoder runs four times faster, and the H.263 encoder runs 5 times faster, than the ones that use the RD-based mode selection. Sequence Variance mode selection (frame/sec) RD mode selection (frame/sec) Ratio B A L L E T 3.68 0.95 3.9:1 B I C Y C L E 3.22 0.94 3.4:1 BUS 3.71 0.98 3.8:1 Table 4.4: Execution time for the RD and variance based mode selection of MEPG-2 encoding of CCIR601 sequences using a 450 MHz Pentium III processor. 77 Bit rate (Mbps) (c) Figure 4.3: RD vs variance based mode selection of MPEG-2 encoder for (a) B A L L E T , (b) B I C Y C L E and (c) BUS sequences. 78 , ' ' - - ' R D - b a s e d node select on {aolid line Measured— Dased mode selection (dashed line) (b) 1 R D - b a s e d mode selection (solid line) M e a s u r e d - b a s e d mode selection (dashed line) i 1 Figure 4.4: RD vs variance based mode selection of H.263 encoder for (a) A K I Y O , (b) C O A S T G U A R D , and (c) F O R E M A N (CIF) sequence. 79 Sequence Variance mode selection (frame/sec) RD mode selection (frame/sec) Ratio A K I Y O 14.40 3.01 4.8:1 C O A S T G U A R D 12.78 2.55 5.0:1 F O R E M A N 12.95 2.60 5.0:1 Table 4.5: Execution time for the RD and variance based mode selection H.263 encoding of CIF sequences using a 450 MHz Pentium III processor. 4.5 S u m m a r y In this chapter, we proposed a simple and efficient mode selection algorithm for MPEG-2 and H.263 encoding that is based on a measure of the motion compensation residual. It is significantly simpler than the RD-based approach, and it yields a performance level within 0.4-0.6 dB of that of the RD optimized mode selection algorithm. However, the encoding time is being reduced by approximately 80%. 80 Chapter 5 The Quantized DCT (QDCT) 5.1 I n t r o d u c t i o n As mentioned in chapter 2, the Discrete Cosine Transform (DCT) is at present the most widely used transform in image and video compression algorithms. Its popularity is due mainly to the fact that it achieves a good data compaction, that is, it concentrates the information content in a relatively few transform coefficients. For natural images, the data compaction performance of the D C T is close to that of the optimum Karhunen-Loeve (KLT) transform. In addition, unlike the K L T , the D C T is not data dependent and its transform matrix exhibits symmetries which can be exploited to obtain very efficient software and hardware implementations [32]. In the video coding area, all significant coders presently employ the D C T followed by coefficient quantization. In this chapter, we focus on the implementation of the D C T and quantization within the H.263 and MPEG-2 81 video coding frameworks. When a fast motion estimation algorithm is used, the D C T and quantization steps become relatively significant in terms of com-putations. Although fast algorithms for the D C T and quantization have been proposed [32], the associated computational savings are quite limited. We here propose a joint implementation of the D C T and quantization (which we call the quantized D C T , or QDCT) that substantially reduces the number of com-putations associated with these two components. The savings are achieved by eliminating the separate calculation process of quantization, and also by using an early termination procedure for the QDCT computation. For simplicity, a separable 2-D D C T is used in our work. However, most of the ideas presented here can be applied to non-separable 2-D DCTs. Explicit quantization of the D C T coefficients is avoided by embedding the quantization step inside the D C T routine. Computational efficiency is exchanged here for an increase in program memory size, since 32 Q D C T sets of coefficients are now required, corresponding to the 32 possible quantization step values allowed in the targeted DCT-based video coders. A different ap-proach is adopted in the scaled D C T technique described in [32]. The scaled D C T embeds part of the multiplications present in the D C T flow graph into the quantization step. Computational analysis will show that the Q D C T re-quires the same number of computations as the scaled D C T . However, the Q D C T coefficients have lower precision, making them more suitable for early termination techniques. For additional savings, an early termination of the Q D C T computation 82 technique is proposed that is content-dependent (i.e., it exploits the statistics of the input blocks). As many as 80% of the inter coded blocks become zero blocks after applying the QDCT. To avoid performing the Q D C T for these blocks, a zero block predictor can be used. Zero block predictors have been proposed for low bit rate video coding [59, 63, 64, 65]. These are based on comparing the norm of the block with a threshold. We propose an alternate method, where the Q D C T is computed partially and the computation process is terminated if the partially transformed coefficients are zeros. The proposed Q D C T implementations allow the choice of a complexity-performance pair that matches the available computational power. The computational load of D C T and quantization processes can be reduced using Q D C T by up to 60 % while maintaining high quality. The computational load can be as low as 20 % for only a 0.3-0.5 dB loss in objective picture quality. The rest of this chapter is organized as follows. In Section 5.2, we describe in detail the proposed implementation of the Quantized D C T . Applications of the Q D C T to M P E G -2 and H.263 encoding are presented in Section 5.3. Experimental results are presented and discussed in Section 5.4. Conclusions are summarized in the last section of this chapter. 83 5.2 Implementation of The D C T and Quanti-zation In this section, we introduce the quantized D C T (QDCT) and we describe in detail the design of the early termination content-dependent Q D C T technique. In DCT-based video coding, the process of quantizing the D C T coefficient yki is expressed as where qki denotes the kl-th element of an 8 x 8 quantization matrix Q and |_^J denotes the largest integer smaller or equal to x. The Q D C T is a nonlinear transform obtained by embedding the quantiza-tion stage into the D C T transformation stage. The main idea here is to pre-compute and store a set of coefficients for each quantizer used in the encoder. For the case of uniform quantization, one Q D C T routine is designed for each possible value of the quantization step. This effectively replaces the need for computing power with little additional memory. Uniform quantization in video coding can use the same quantization step or different quantization steps for the 64 D C T coefficients. The two cases (constant and variable quantization steps) are addressed separately, as follows. When a constant quantization step is used, the same uniform quantiza-tion scheme is applied to all D C T coefficients. This is the case in H.263 coding 5 . 2 . 1 T h e Q u a n t i z e d D C T 84 and in MPEG-2 inter coding. The quantization step value for each macroblock is obtained via the rate control mechanism. This choice of a constant quanti-zation step simplifies the quantization software. The 8-point 1-D D C T of (XQ, X \ , ..., X7) is defined by [36] 75 : k = * 1 : k ^ 0 . ck ^ (2i + l)kTT Vk = ~w y , x i c o s TF- > where k = 0 , . . . , 7 and ck 2 i^o 1 6 The quantized transform coefficients can be written as follows yl = L - ^ J = L Z ) C ^ « - J , w h e r e 4i = — — Q - ^ — , Q is the quantization step, and [J denotes the rounding operator. The corre-sponding matrix representations are y = Cx and y q = L C q xJ , (5.1) where y, y q and x are 8x1 column vectors representing the 8 values of the D C T coefficients, the quantized D C T coefficients and the signal input respectively. C and C q are 8 x 8 transform matrices. In a similar manner, we can write the necessary equations in the 2-D case, using the separability property, as follows Y = C X C T and (5.2) Y 9 = [C?XCfj . (5.3) Here, C-! and Cqc are the corresponding matrices for the row and column transforms, and they are given by Cl = and Cl = where Q R x Q C = Q . (5.4) 85 Computing the Q D C T following the above approach removes the division by Q during quantization. Later, we will show how the explicit rounding operation can be also avoided. When a variable quantization step is used, 64 different quantization steps are used for the 64 D C T coefficients. This is the case for J P E G and for MPEG-2 intra coding. When using a variable quantization step, the goal is to exploit the frequency response of the human visual system by quantizing the low-frequency coefficients more finely and the high-frequency coefficients more coarsely. This is achieved by using an 8 x 8, usually non-separable, quantization matrix Q whose elements are the quantization steps for the 64 D C T coefficients. Rate control is achieved by scaling this matrix by a global quantization step factor. In this situation, the Q D C T as described above cannot be used directly. If the quantization matrix was separable, a modified Q D C T could be used. As stated above, the visually-optimized matrix Q is usually not separable. Therefore, we approximate Q by a separable approximation matrix Q ^ q q T , where q = [qo, • • • ,a7]T is an eight-dimensional vector. In this case, q is a vector of quantization steps to be applied separately to the row and column QDCTs. Let us now define a modified Q D C T to be used in conjunction with this separable approximation. The quantized D C T matrix can be expressed by Y " = [ Y . / Q J , or Y ? = | Y . * R J , where R - l . / Q . 86 We use the notations ".*" and "./" to denote element-by-element matrix mul-tiplication and division, respectively. If we use the separable approximation of Q, Yq can be expressed by Yq = [Y. * RJ ~ Y . * (rrT), where r = l . / q . It is easy to prove that Y . * (rr r) = diag(r)Ydiag(r)T, where diag(r) is the matrix having the elements of r on the diagonal and zero elsewhere. Finally, we obtain Yq = Ldiag(r)Ydiag(r)J = Ldiag(r)CXCTdiag(r)TJ = [ C ? X C 9 T J , where Cq = diag(r)C is the matrix obtained by dividing the rows of C by the corresponding ele-ments of q. Note that this is similar to the Q D C T defined for the constant quantization step case, only with a modified transform matrix. So far, we have assumed that the coefficients have infinite precision. In all practical implementations, integer computations are employed, and the co-efficients should assume integer values. Their values are scaled up and rounded to the nearest integer. Let us assume that the coefficients are scaled up to 6-bit integers, that is C? = L C q x 26 + 0.5j. 87 After the transform is computed, the results are scaled down by the same factor and rounded to integer values. To avoid having to perform the rounding operation, we go one step further and compute an approximation of (5.3) as follows It is easy to see that a large value of b yields a good approximation, but the transform coefficients will require more complex multiplications. Therefore, the minimum value of b is sought that achieves a precision comparable to the full-precision QDCT. For MPEG-2 video encoding, empirical results show that choosing b = 10 yields virtually no loss in performance. As a result of the above rounding process, we now have a set of integer coefficients for each value of the quantization step. The number of coefficients in the set depends on the specific method used to implement the QDCT. The computation of the Q D C T is identical to the computation of the D C T , using the modified transform matrix. If the direct computation of the 1-D D C T is used, seven coefficients (a, 6, c, d, e, / , g) fully describe the matrix transform as (5.5) 88 follows 9 9 9 9 9 9 9 9 a b c d -d —c -b —a e f ~f — e — e -f f e b -d — a —c c a d -b 9 -9 ~9 9 9 -9 -9 9 c —a d b -b -d a —c f — e e -f -f e —e f d —c b —a a -b c -d Practical systems do not compute the D C T directly using using Equa-tion (5.3). Faster implementations exploiting the particular structure of the D C T matrix have been proposed that require significantly fewer computations. The proposed modifications to the D C T can be applied to any of the fast al-gorithms. In this work, we used Chen's algorithm [32] to compute the QDCT. The flow graph for Chen's D C T algorithm is sketched in Figure 5.1. Note that, in this implementation, only the coefficients in the last two stages need to be scaled to compute the QDCT. The coefficient c 7 in the second stage will be kept unmodified. Therefore, we have eight coefficients for each value of Q. Since Q can usually take 31 different values, the total number of coefficients to be stored is 32 x 7 + 1 = 225. 5.2.2 Early Termination Q D C T Computation In video inter block coding, the Q D C T input data is the motion compensated prediction error data, whose values are centered narrowly around zero. Sim-89 Figure 5.1: The Chen fast D C T flow graph. ulation results show that, for an MPEG-2 encoder, at a bit rate of 3 Mbps, 70 to 75% of the inter coded blocks become zero after quantization. For an H.263 encoder at 128 kbps, the percentage is 75 to 85%. Significant savings can be achieved if these blocks are detected in the early computation stages. For example, if the computation of the Q D C T is first applied to the columns and then to the rows of an 8 x 8 block, and if the column-transformed block contains only zero coefficients, then the row QDCTs would be clearly zeros and thus their calculations are rendered unnecessary. To further reduce the computational complexity, we introduce a new algorithm which is depicted in Figure 5.2. First, Nc column QDCTs are com-puted. If all Nc blocks of coefficients all have zero values, the computation process is stopped and the output 8 x 8 block is set to zero. If the column QDCTs are computed, then iV r row QDCTs are computed. If all Nr blocks 90 Compute first Nc column DCTs 1 YES Compute remaining 8-Nc column DCTs Compute first Nr row DCTs YES Compute remaining 8-Nr row DCTs Set output to all zeros Figure 5.2: Early termination QDCT computation. 91 of coefficients all have zero values, the computation process is stopped and the output 8 x 8 block is set to zero. This algorithm has three parameters: Nc, JV r, and K. The parameters Nr and Nc take values from one to eight. Clearly, the accuracy of the zero block prediction algorithm increases as these two parameters are increased. The parameter K = Q c Qr can be used to control the percentage of blocks that will be declared zero after computing the column QDCTs. When K increases, more coefficients will be set to zero after the column transforms, and the predictor also becomes less precise. Although the underlying comparison operations introduce a computa-tional overhead, this overhead is indeed small as compared to the number of computations required by the QDCT. It is also not necessary to perform the comparisons on a coefficient-by-coefficient basis. For instance, if each coeffi-cient is represented using a 16 bit integer, we can compare simultaneously 2 or 4 coefficients on a 32 or 64-bit C P U , respectively. Some CPUs might support even more efficient mechanisms for detecting an all-zero block of data. An important issue here is the order in which the column and row QDCTs are performed. Previously [62], the QDCTs were performed in the natural order, from left to right and from top to bottom. For the column QDCTs, a better approach is to spread the first Nc QDCTs over the entire block as uniformly as possible, as shown in Figure 5.3 for all values of Nc. Here, we can obtain a more precise zero block predictor by choosing a more 92 Figure 5.3: The optimized ordering. significant subset of the eight columns. Indeed, for Nc = 2, columns 3 and 6 carry more information about the entire block than columns 1 an 2. For the row Q D C T computation, the natural top-bottom order should be the best order since the energy is already concentrated in the top (low vertical frequency) coefficients after the column Q D C T computations. 5.3 Applications of The QDCT 5 . 3 . 1 A p p l i c a t i o n t o M P E G - 2 a n d H . 2 6 3 E n c o d i n g The Q D C T can be used to replace the D C T and the quantization steps of an MPEG-2 encoder. The quantizer used by MPEG-2 has a dead zone twice the size of the quantization step. Therefore, the negative Q D C T values obtained 93 by using Equation (5.3) have to be corrected as follows, y \ yq>0 The standard MPEG-2 intra quantization matrix is not separable. However, MPEG-2 supports custom quantization matrices. A separable custom quanti-zation matrix is a good approximation for the visually-optimized non-separable quantization matrix defined in the standard. In addition to this, the DC coefficient (Y 0o) of intra coded blocks is always quantized using a quantization step Q = 8. To obtain the correct DC coefficient, before shifting by 26 bits is applied, the coefficient has to be scaled by a correction factor equal to ((Q/8) x 2 2 6). Similar modifications must be made to employ the Q D C T in an H.263 encoder. Here, the quantization dead zone is 2Q for intra coding and 2.5Q for inter coding. The same corrections described above for negative Q D C T values and for the DC coefficient of intra coded blocks must be applied. H.263 does not support a variable quantization step, so the constant-step Q D C T is used for both intra and inter coded blocks. In this section, we present experimental results to evaluate the computation-performance tradeoffs obtained by using the Q D C T and its early termination q = [3.7825,4.1955,4.6957,5.0030,5.5850,6.4600, 7.3985,8.7774]T 5.4 E x p e r i m e n t a l R e s u l t s 94 variants. The new encoder implementations are based on and compared to the MPEG-2 encoder presented in [66] and the reference TMN9 H.263 encoder [67]. 5 . 4 . 1 T h e Q D C T To measure the accuracy of the QDCT as compared to the separate implemen-tation of the D C T and the quantization, simulation experiments were carried out using three test sequences. For MPEG-2 encoding, we used the CCIR 601 (720 x 480) test sequences BUS, B I C Y C L E , and B A L L E T at bit rates rang-ing from 3 to 9 Mbps. For H.263 encoding, we used the CIF (352 x 244) test sequences F O R E M A N , A K I Y O , and C O A S T G U A R D at bit rates from 128 to 320 kbps. Figures 5.4 (a) and (b) show the average peak SNR (PSNR) for the encoder implementations using separate D C T and quantization process-ing, Q D C T with 6 = 10, and QDCT with 6 = 8. Note that the difference in performance is not noticeable even for b = 8. The PSNR loss is less than 0.02 dB for b = 10 and less than 0.08 dB for 6 = 8. For MPEG-2 , we have to also evaluate the effect of using a custom separable quantization matrix on the visual quality. Figure 5.5 shows a recon-structed intra coded frame of the B I C Y C L E sequence encoded using the default non-separable quantization matrix (a), the separable approximation (b), and a constant matrix (c). Notice that (a) and (b) have very similar subjective qual-ity, while clearly outperforming (c). Let us try to evaluate the computational savings achieved by using the QDCT. The separate D C T and quantization using the Chen algorithm require 256 multiplications, 416 additions, 64 shift 95 96 (c) Figure 5.5: Results for intra approximation: (a) using the default intra matrix; (b) using the separable approximation; (c) using a constant quantization step. 97 operations, and 64 divisions. When the scaled Chen D C T is used, the number of operations is 128 multiplications, 416 additions, 64 shift operations, and 64 divisions. The Q D C T requires 128 multiplications, 416 additions, 64 shift operations, and no divisions, so it requires essentially the same computational complexity as the scaled D C T . However, the Q D C T coefficients require less precision, leading to the development of the early termination technique. 5.4 .2 Early Termination QDCT Computation We have tested exhaustively all the possible combinations of the three Q D C T parameters and chosen those yielding the best computation-distortion perfor-mance. The set of optimal triplets is listed in Tables 5.1 for M P E G - 2 encoding and 5.2 for H.263 encoding. The tradeoffs between computational complexity and P S N R are shown in Figure 5.6 (a) and (b) for M P E G - 2 and H.263 encoding, respectively. In these figures, the number of computations is represented as the ratio between the number of 1-D Q D C T s performed with and without zero block prediction (which is 16 times the number of inter coded blocks). Note that savings of up to 70% can be achieved in both cases with a P S N R loss of less than 0.3 dB. Also, the graphs show that the optimized ordering outperforms the natural ordering by up to 0.2 dB. 98 Natural ordering Optimized ordering K Nr Nc PSNR computations PSNR computations 1 8 8 29.99 98.4 29.99 98.4 1 5 8 30.08 85.0 30.08 85.0 2 5 5 30.07 81.8 30.08 82.2 2 5 4 30.08 80.6 30.08 81.3 2 5 3 30.07 78.8 30.08 79.9 4 4 8 30.08 77.9 30.08 77.9 1 4 1 30.07 73.8 30.09 73.7 16 4 8 30.08 70.8 30.08 70.7 4 4 2 30.07 65.0 30.08 66.1 16 4 5 30.07 60.7 30.08 61.6 16 4 4 30.07 56.7 30.08 58.2 8 4 2 30.05 55.4 30.08 56.6 16 4 3 30.05 52.1 30.08 53.7 16 5 2 30.02 48.0 30.06 49.3 16 4 2 30.02 46.5 30.06 47.8 16 3 2 30 44.9 30.05 46.0 32 3 2 29.93 37.2 30.03 38.4 64 3 2 29.81 30.3 29.98 31.2 32 3 1 29.80 29.9 29.95 29.3 64 3 1 29.64 22.8 29.82 22.2 64 2 1 29.58 22.5 29.77 21.9 Table 5.1: Optimum parameters for the early termination Q D C T used MPEG-2 encoding. 99 Natural ordering Optimized ordering K Nr Nc PSNR computations PSNR computations 1 8 8 32.96 96.2 32.96 96.2 1 8 4 32.92 91.9 32.96 92.9 1 8 3 32.93 90.1 32.98 91.2 1 5 5 32.94 77.5 32.95 77.9 16 5 8 32.95 68.0 32.95 68.0 1 3 2 32.91 62.1 32.94 62.6 4 5 2 32.85 58.4 32.94 58.9 8 3 2 32.78 43.8 32.92 44.2 8 2 2 32.68 40.7 32.86 41.1 16 3 2 32.65 36.3 32.82 36.7 16 2 2 32.62 33.9 32.79 34.4 16 4 1 32.47 30.7 32.64 29.0 16 3 1 32.4 28.8 32.6 27.3 16 2 1 32.35 26.8 32.56 25.5 16 1 1 32.19 25.3 32.34 24.4 Table 5.2: Optimum parameters for the early termination Q D C T used in H.263 encoding. 100 30.2 r 30 k 29.4 k 29.2 k x - optimum order o - natural order 291 I I i I I I I i I I 0 10 20 30 40 50 60 70 80 90 100 Number of computations (%) (a) 33.5 r 33 k cc 32.5 k 32 k x - optimum order o - natural order 31 5 I I I I 1 I I I I I I 0 10 20 30 40 50 60 70 80 90 100 Number of computations {%) (b) Figure 5.6: Computation-distortion tradeoffs for the early termination Q D C T computation: (a) MPEG-2 , (b) H.263. 101 5.5 S u m m a r y We have presented a new technique which we call Quantized D C T (QDCT), for reducing the computational load associated with calculating the D C T and quantization steps in DCT-based coding. The savings are due to embedding the quantization into the D C T computation and to exploiting the statistics of the video source. Besides significantly reducing the number of computations, the proposed techniques also offer the user flexible control on the Q D C T ac-curacy and the number of computations. 102 Chapter 6 Content-Based Computation-Distortion Optimization 6.1 I n t r o d u c t i o n This chapter discusses how to assemble the three main tools developed in last three chapters into a complete encoder. There are many different ways of assembling these different components. Typically, the encoder must satisfy certain implementation constraints, such as the maximum number of opera-tions per second and the maximum memory size. In this chapter, we propose a method for computation-distortion optimization that allows us to achieve optimum encoding performance for the available computational power. The proposed optimization procedure exploits the fact that different sequences, and 103 different macroblocks within a sequence, have different computational require-ments to achieve an acceptable encoding performance. The average number of computations can be thus reduced if we spend more computational resources where they are needed. For example, the half-pel motion estimation step may not be necessary for a specific macroblock if a good enough full-pel motion vector has already been found. A greedy algorithm is used to optimize all the encoding parameters. The greedy strategy makes a selection that seems the best choice at any point during the execution of the algorithm, and hopes that these local optimal decisions lead to global optimal results. The greedy method is used to select the local optimum. Sometimes these are heuristics, which are just informed strategies that produce reasonable results. In some cases, however, the greedy algorithm approach is guaranteed to produce an optimal result. 6.2 E n c o d i n g Pa rame te r s There are effectively seven threshold parameters (in the MPEG-2 encoder) and three threshold parameters (in the H.263 encoder) that control the com-putational complexity and performance of the hierarchical motion estimation algorithm. The early termination algorithm of the Q D C T has also three pa-rameters: K, Nc, and Nr, which can be used to control computation-distortion tradeoffs of the QDCT. The effect of these parameters on the computation flow is visualized in Figure 6.1 for MPEG-2 encoding. For H.263 encoding the Field motion estimation is not used. The first threshold Ti allows the encoder to 104 M V = P M V • I-ow res F r a m e M E (frume u n d field) ; f r a m e M E ;ure 6.1: Computation-distortion parameters for MPEG-2 enco 105 avoid motion estimation if the predicted motion vector provides a very ac-curate motion compensated predictor. T2 allows the encoder to stop motion estimation if the low-resolution motion search found a good match. The task of the thresholds T 3 , T 4 , T5 and which are used in the M P E G - 2 encoder, are explained as follows. T 3 allows the encoder to stop motion estimation if the frame full-pel motion search found a good match. T 4 disables field motion estimation if frame motion estimation found a good match. T 5 allows the en-coder to exit motion estimation if the field low-resolution motion search found a good match. T 6 permits the encoder to avoid field half-pel motion estima-tion if the full-pel motion search found a good match. The thresholds Tj allow the encoder to skip computing the D C T if motion compensation was very ef-fective (i.e. the energy of the residual block is small). In addition to that, the computation-distortion parameter 3 is used to stop motion search when the number of computations reaches some level or constraint. In our motion search algorithm, the search stops when the current search area is outside the allowed search region or when the SAD corresponding to the best candidate in the search area of the current layer is larger than the SAD of the best can-didate at the previous layer. To reduce computations further, we employed a modified cost function, expressed in terms of the SAD biased by an explicit computational constraint. This function is defined as the Lagrangian Jp = SAD + 3C, where C is the total number of operations performed so far for the current M B motion search. We consider one operation as one addition, one subtraction, 106 and one absolute value computation. The early prediction method used in the Q D C T is based on a partial computation of the Q D C T as depicted in Figure 6.1. First, the first Nc column QDCTs are computed. If all 8 x Nc computed Parameter Description P M E computation-distortion parameter Ti avoid M E if good P M V was found T2 avoid frame M E if good low-resolution M V was found T3 avoid frame half-pel M E if good full-pel M V was found T4 avoid field M E if good frame M V was found T5 avoid field full and half pel M E if good low-resolution field M V was found T6 avoid field half-pel M E if good full-pel field M V was found T7 avoid computing the Q D C T if the energy of the residual is small K, Nr, Nc early termination QDCT parameters Table 6.1: Encoding parameters used for the adaptive computation-distortion optimization. values are zero, the entire block is set to zero as the rest of the values are very likely to be zero. Else, the other 8 — Nc column QDCTs are computed. Then, Nr row DCTs are computed. If these 8 x Nr values are zero, the entire block is set to zero. Else, the rest of the row DCTs are computed. The parameters Nr and Nc take values from one to eight. Clearly, the accuracy of the zero block prediction algorithm increases as these two parameters are increased. The parameter can be used to control the percentage of blocks that will be declared zero after computing the column QDCTs. When K increases, more coefficients will be 107 set to zero after the column transforms, and the predictor also becomes less precise. A l l the different encoding parameters are summarized in Table 6.1. 6.3 C o n t e n t - B a s e d C o m p u t a t i o n - D i s t o r t i o n O p -t i m i z a t i o n A l g o r i t h m This content-based optimization is based on a set of the encoding parameters described above which can be adjusted to obtain different levels of performance and computational demand. A challenging task is the joint optimization of the parameters for a given global computational constraint. Ideally, an exhaustive optimization could be performed, where each parameter is varied within an interval and all the possible combinations of parameter values are considered. For each combination, encoding is performed at the target bit rate and the achieved performance (PSNR) and encoding speed (frames per second, or fps) are plotted. Finally, the convex hull of the set of (PSNR, fps) points is then determined. By choosing the parameters corresponding to the points on this curve, we are guaranteed to achieve the best possible performance level (for a particular video sequence) given a specific computational load. This ex-haustive method is quite computationally demanding. Therefore, we employ a simpler "greedy" optimization method, as follows. For a given bit rate, we start with the point in the distortion-computation plane which yields the best performance and the highest computational complexity. Then, we vary each of the parameters by a small amount S and obtain P new points. The value 108 of 5 for each parameter is chosen based on the computation-distortion curve shown in Figure 3.10. Among these points, we choose the one which, when connected to the initial point, yields a line with minimum slope as shown in Figure 6.2. We then proceed in a similar fashion from this new point. We con-tinue drawing the operational curve until we obtain a performance level that is no longer acceptable. Our optimization procedure is performed off line using test sequences with various amount of motion. The values of the optimized parameters are then fixed and used through out the thesis. An example of the optimum parameters for H.263 is shown in the next chapter (Table 7.2). In Encoding speed (frames/sec) Figure 6.2: greedy algorithm. order to compare the performance of the greedy algorithm to the exhaustive search algorithm, we performed an exhaustive optimization for the first four thresholds in Table 6.1 at a target bit rate of 3 Mbps. A total of 625 com-binations of parameter values was tested. The computation-distortion pairs (PSNR, number of SADs) are plotted in Figure 6.3. The curve obtained by the 109 29.8 Average number of SADs per macroblock Figure 6.3: performance of greedy algorithm. greedy algorithm is also plotted on the same figure. We can see that the greedy algorithm produces a curve that is very close to the optimal curve, which is the convex hull of the set of computation-distortion pairs. More optimization results using this method are presented in Sections 6.4. 6.4 E x p e r i m e n t a l Resu l t s for C Imp lemen ta -t i o n The adaptive optimization algorithm was applied to the MPEG-2 encoder using a test sequence that consists of 108 concatenated frames from three dif-110 ferent CCIR 601 video sequences; B A L L E T , BUS, and N B A . A C language im-plementation of the encoder was compiled and run on a Pentium III (450 MHz) computer with 128 M B of memory. The resulting distortion-speed curve, for a target bit rate of 3 Mbps, is presented in Figure 6.4 (a). Notice that the encod-ing speed has essentially been doubled (from frames/sec. to 5.8 frames/sec.) with no performance loss. For an encoding speed up of three times (from 2.9 frames/sec. to 8.7 frames/sec.) the PSNR loss is about 1.5 dB. Notice also that, for the first part of the curve, higher PSNRs are obtained with fewer computations. This can be explained by the fact that the threshold T4 dis-courages the encoder to use field motion vectors, which require more bits to be encoded. The rate control algorithm then allocates these bits for the encoding of other MBs, resulting in better encoding quality. The adaptive optimiza-tion algorithm was also, applied to the H.263 encoder using a test sequence that consists of 1000 concatenated frames from three different CIF video se-quences; N E W S , S I L E N T , and C O N T A I N E R . The computation-distortion curve for the H.263 encoder using the test sequence, is presented in Figure 6.4 (b). The statistics of the threshold decisions for MPEG-2 encoding are presented in Table 6.2 for two of the points on the MPEG-2 curve. I l l 112 Reference point # 1 Reference point #2 (30.00 dB, 5.52 fps) (29.39 dB, 7.55 fps) Threshold % YES % NO % YES % NO TI 15 85 19 81 T2 0 100 68 32 T3 0 100 2 98 T4 88 12 100 0 T5 36 64 0 100 T6 0 100 0 100 T7 0 100 1 99 Table 6.2: Threshold statistics for MPEG-2 . 6.5 E x p e r i m e n t a l Resu l t s for P e n t i u m / M M X I m p l e m e n t a t i o n The C language implementations of the MPEG-2 and H.263 encoders show how the different components of the framework interact. In a practical im-plementation though, a much faster encoder can be obtained by exploiting the architectural features of the processor. In this section, we show how to map some of the computation intensive algorithms of video coding to the Intel M M X processor. The media extensions for the Intel Architecture (IA) were designed to enhance performance of advanced media and communication ap-plications. The M M X [74] technology provides a new level of performance to computer platforms by adding new instructions and defining new 64-bit data types, while preserving compatibility with software and operating sys-tems developed for the Intel Architecture. The M M X technology introduces new general-purpose instructions. These instructions operate in parallel on 113 multiple data elements packed into 64-bit quantities. They perform arithmetic and logical operations on the different data types. These instructions accel-erate the performance of applications with compute-intensive algorithms that perform localized, recurring operations on small native data. This includes applications such as motion video, combined graphics with video, image pro-cessing, audio synthesis, speech synthesis and compression, telephony, video conferencing, 2D graphics, and 3D graphics The IA M M X instruction set has a simple and flexible software model with no new mode or operating-system visible state. The M M X instruction set is fully compatible with all Intel Ar-chitecture microprocessors. A l l existing software continues to run correctly, without modification, on microprocessors that incorporate the M M X technol-ogy, as well as in the presence of existing and new applications that incorporate this technology. The M M X technology uses the Single Instruction, Multiple Data (SIMD) technique. This technique speeds up software performance by processing mul-tiple data elements in parallel, using a single instruction. The M M X technol-ogy supports parallel operations on byte, word, and double word data elements, and the new quadword (64-bit) integer data type. Modern media, communi-cations, and graphics applications now include sophisticated algorithms that perform recurring operations on small data types. The M M X technology di-rectly addresses the need of these applications. For example, most audio data is represented in 16-bit (word) quantities. The M M X instructions can operate on four of these words simultaneously with one instruction. Video and graph-114 ics information is commonly represented as palletized 8-bit (byte) quantities; one M M X instruction can operate on eight of these bytes simultaneously. 6 . 5 . 1 M M X I m p l e m e n t a t i o n o f T h e S A D In motion estimation the block matching algorithm perform an 8-bit unsigned SAD computation on 16x16 or 16x8 blocks for MPEG-2 and on 16x16 or 8x8 blocks for H.263. In M M X implementation, using a word size of 8 bits allows for 8 data elements to be executed in a single instruction. This result on a significant speed performance improvement. To compute the the SADs, we adopt the M M X algorithm that was made publicly available by Intel [74]. The M M X SAD implementation is 4 to 5 times faster than the C implementation. 6 . 5 . 2 M M X I m p l e m e n t a t i o n o f T h e Q D C T To exploit the capability of the M M X technology without significant loss in precision, we chose a word size of 16 bits in implementing the Q D C T . This allows for four one-dimensional column QDCTs to be executed in parallel. After the column QDCTs are executed, the 8 x 8 block of data is transposed. The same parallel column QDCTs are applied to the transposed data. Finally, the block is transposed once again to produce the output samples in the correct order. The C language implementation of the QDCT uses 32-bit integers. The loss of precision when going from 32-bit to 16-bit computations is maintained at a low level by carefully choosing the gain factors at the different stages. Due to the M M X QDCT structure, the early termination configurations are 115 Reference point # 1 (30.00 dB, 5.52 fps) Reference point #2 (29.39 dB, 7.55 fps) Routine % of total % of total suitable for M M X SAD 31.0 22.0 YES Q D C T 13.5 15.9 YES interpolation 10.7 4.1 Y E S subtract prediction 8.2 11.4 YES motion compensation 8.8 9.9 YES V L C 6.3 8.0 NO other motion estimation 4.9 2.9 NO mode decision 2.7 4.0 NO inverse quantization 3.0 4.2 NO generate low resolution 2.6 3.3 NO IDCT 1.8 2.5 NO other 5.1 9.9 NO quant intra 1.4 1.9 NO Table 6.3: Profiling results platform. limited to those of the form (A",8,4), (A",4,8), and (A',4,4), and the only possible ordering is the natural ordering. 6.5.3 M M X Implementation of The Other Components Other computationally intensive components such as motion compensation and add prediction, which consists of simple subtraction/addition operations are also implemented using M M X and achieved additional speed improve-ments. For MPEG-2 encoding, the components that have been hand-optimized using the M M X instructions are listed in Table 6.3 together with their weights in the C implementation. 116 6.5.4 Adaptive Computation-Distortion Optimization of The M M X implementation Since the computational weights of the different components of the encoder change when M M X optimization is used, the adaptive computation-distortion optimization procedure must be repeated. The results of this optimization are shown in Figure 6.5. 6.6 Summary In this chapter, we assembled the different tools developed in the last three chapters to a flexible framework for video encoding. For a fixed bit rate, our encoder yields very good computation-performance tradeoffs. The de-veloped framework consists of a set of optimized core components: motion estimation, the Discrete Cosine Transform (DCT), quantization, and mode selection. Each of the components can be configured to achieve a desired computation-performance tradeoff. The components are assembled to obtain encoders with varying degrees of computational complexity. Each parameter controls the computation-distortion performance of the algorithm by chang-ing the operation of a particular sub-component. A "greedy" design proce-dure is proposed that yields parameter sets corresponding to the best global computation-distortion operating points for a complete encoder. The proposed method leads to more than a 2:1 reduction in the number of computations without significant loss in subjective and objective quality. 117 Figure 6.5: Computation-distortion curves for the Pen t ium/MMX implemen-tation of the MPEG-2 encoder (a) and the H.263 encoder (CIF test sequence) Chapter 7 Computa t ion Con t ro l 7.1 Introduction In certain cases, an adaptive algorithm is desired that is capable of dynamically modifying its computational requirements to meet a specific computational constraint. As a simple example, suppose that two people are communicat-ing with each other using video encoders that run on machines with different computing resources (e.g. P II 200 and P III 500 MHz). Let us assume that the encoders that run on the fast and slow machines are capable of encoding 50 and 30 frames/sec respectively. Since it is a two way communication, the person with the fast codec would still wait for the slow codec to encode the sequence at the other end. It would be necessary that the fast encoder runs at the same speed as the slower one. Therefore, the encoded sequence of the faster encoder will have a better quality. In the previous chapter, we obtained a set of optimized parameters that represent different encoding speeds and per-119 formance levels. For a given set of optimized parameters, the encoding speed and performance depend on the content of the encoded sequence. In this chap-ter, we propose a computation control mechanism that can be implemented within the proposed encoding framework to allow the encoding algorithm to adapt to the available computational resources. The control mechanism will allow the encoder to fix the encoding speed regardless of the content of the sequence. It will also allow the encoder to run real time on machines with dif-ferent computing levels, e.g. P II 300 MHz and P III 450 MHz, and adjusting the picture quality based on the available computational resources. In section 7.2, we define the problem of computation control. In Section 7.3, we describe in detail the proposed feedback control mechanism. Experimental results will be presented in Section 7.4. 7.2 T h e P r o b l e m of C o m p u t a t i o n C o n t r o l The function of computation control is defined as the capability of the en-coder to dynamically modify its computational requirements to meet a spe-cific computational constraint. For instance, the user should be able to set the encoding frame rate (the number of frames/second to be encoded) and the encoder should automatically adjust the encoding parameters so as to achieve this desired frame rate. As described in Chapter 6, during the computation-distortion optimization a set of optimized encoding parameters that correspond to certain speed and performance were obtained. Tables 7.1 and 7.2 show twenty and thirteen op-120 Operating point TI T2 T3 T4 T5 T6 T7 P K Nr Nc PSNR (dB) Frame rate (fps) 1 0 0 0 0 0 0 0 1 1 8 8 30.02 2.96 2 0 0 0 2000 0 0 0 1 1 8 8 30.20 3.85 3 0 0 0 2500 0 0 0 1 1 8 8 30.21 4.07 4 210 0 0 2500 0 0 0 1 5 6 6 30.25 4.36 5 210 0 0 3200 0 0 0 1 5 5 2 30.19 4.88 6 275 0 0 3200 0 0 0 1 5 5 2 30.18 4.88 7 275 0 0 3500 0 0 0 1 5 5 2 30.15 5.01 8 275 0 0 3500 420 0 0 30 5 5 2 30.06 5.44 9 275 0 0 4000 420 0 0 55 5 5 2 30.00 5.52 10 275 0 0 4700 420 0 0 55 5 5 2 29.93 5.74 11 330 0 0 4700 420 0 0 55 5 3 2 29.89 5.85 12 330 200 1350 4700 420 1100 0 55 5 3 2 29.74 6.63 13 330 200 1350 4700 420 1100 0 55 16 3 2 29.64 6.92 14 330 200 1350 4700 420 1100 0 55 64 3 2 29.39 7.11 15 330 200 1350 4700 420 1100 0 55 100 3 2 29.21 7.19 16 330 400 1350 4700 420 1100 0.35 55 100 3 2 28.96 7.55 17 330 600 1350 15000 420 1100 0.35 55 100 3 2 28.62 8.44 18 400 800 1350 15000 420 1100 0.35 55 100 3 2 28.30 9.16 19 1000 1000 1350 15000 420 1100 0.35 55 100 3 2 28.06 9.53 20 1500 1200 1350 15000 420 1100 0.35 55 100 3 2 27.89 9.77 Table 7.1: MPEG-2 optimum parameters combinations for different encoding speeds using a 450 MHz Pentium III machine. timum parameter combinations for the MPEG-2 and H.263 encoders respec-tively. The MPEG-2 test sequence consists of 108 concatenated frames from three different CCIR 601 video sequences; BUS, B A L L E T and T A B L E N B A . The H.263 test sequence consists of 1000 concatenated frames from three different CIF video sequences; N E W S , SILENT and C O N T A I N E R . Each set of optimized parameters represents certain encoding speed-performance operating points. Tables 7.3 and 7.4 show three speed-performance results for 1000 frames of the sequences F O R E M A N and A K I Y O that were obtained using the optimized set of parameters 1, 7 and 13 in Table 7.2. From these tables, we can see that the encoding speed is sequence dependent. Ideally the encoder would like to have more control on the speed regardless of the sequence content. 121 operating Ti T2 n Q K Nr Nc PSNR Frame rate point (fps) 1 0 0 0 0 1 8 8 34.60 18.79 2 512 0 0 0 1 8 8 34.60 19.36 3 512 0 64 0 1 8 8 34.57 20.23 4 1024 0 64 0 1 8 8 34.55 20.71 5 1024 512 64 0 1 8 8 34.55 21.43 6 1024 512 64 256 1 8 8 34.53 22.89 7 1536 512 90 512 1 8 8 34.25 26.75 8 1536 512 100 512 1 5 2 34.16 28.68 9 1536 512 100 512 8 2 2 34.07 29.28 10 2048 512 100 512 8 2 2 33.98 29.68 11 2048 1024 100 512 8 2 2 33.95 30.33 12 2048 1024 100 1024 8 2 2 33.93 30.89 13 2048 1024 140 1024 8 2 2 33.73 33.17 Table 7.2: H.263 optimum parameters combinations for different encoding speeds of CIF sequences using a 450 MHz Pentium III machine. Operating Average Encoding frame rate point PSNR dB (frame/sec) 1 30.80 19.36 7 30.56 33.28 13 29.92 37.86 Table 7.3: Encoding speed-performance results for FOREMAN sequence on a 450 MHz Pentium III processor. 122 Operating Average Encoding frame rate point PSNR dB (frame/sec) 1 39.77 20.90 7 38.65 38.99 13 37.51 46.12 Table 7.4: Encoding speed-performance results for A K I Y O sequence on a 450 MHz Pentium III processor. P(i-l) + P ( i ) . Encoder V v T a - (i-l)T t J V Figure 7.1: Computation control mechanism. 7.3 Feedback Control System The main objectives of the control mechanism is to automatically adjust the encoding frame rate to meet the desired encoding speed and keep the encod-ing speed as constant as possible. To achieve our goal, we propose a simple feedback control system, as illustrated in Figure 7.1, to adaptively change the encoding parameters to compensate for the mismatch between the encoding time and the target time. For each group of pictures (GOP) consisting of one I and eleven P pictures, a target frame encoding time Tt is computed, that reflects the desired frame rate. As each frame is encoded, its actual encoding time is measured. Before encoding the next frame, say, the i-th frame in the 123 GOP, the difference between Ta, the actual total encoding time for the first (i — 1) frames, and [i — l)Tt, the target total encoding time, is computed. Based on this difference, the operating point (described by the set of encoding parameters) is changed to compensate for the mismatch. The index Pi points to a set of encoding parameters in Table 7.2 for H.263, corresponding to a certain speed-performance operating point. The index is adjusted using the formula P, = Pi_x + k(Ta Tt)/Tt, where A; is a constant parameter. The index computed using this formula is rounded off to an integer value. Experimental results showed that this feedback control method works well for values of k between 5 and 15. A value of k = 10 was selected. The initial selection for the encoding parameters is made using the results of the static optimization described in the previous Chapter. For example, if the target encoding speed of H.263 encoder is 19 frames/sec, the initial index is set to four which corresponds to the fourth operating point in Table 7.2. 7.4 S i m u l a t i o n Resu l t s The proposed computation control mechanism has been applied to MPEG-2 and H.263 encoders. Several encoding speeds using different sequences were used in our simulation results. For the MPEG-2 encoder, we used three video sequences ( B A L L E T , BUS and B I C Y C L E ) . The video sequences were encoded at target bit rates ranging from 3 Mbps to 9 Mbps. For the H.263 encoder, 124 Sequence Target frame rate encoding frame rate (frame/sec) (frame/sec) 10 9.70 B A L L E T 7 6.97 5 5.10 10 9.52 B I C Y C L E 7 6.98 5 4.98 10 9.59 BUS 7 6.84 5 5.05 Table 7.5: Computation control results for MPEG-2 encoding using a 450 MHz Pentium III processor. we used both the CIF and QCIF sequences ( F O R E M A N , A K I Y O , and C O A S T -G U A R D ) . The video sequences were encoded at target bit rates ranging from 128 to 256 kbps for CIF sequences and 32 to 64 kbps for QCIF sequences. Ta-bles 7.5, 7.6 and 7.7 show the target and actual encoding frame rates for several test sequences for MPEG-2 and H.263 encoding, respectively. We can see that the target encoding frame rate has been achieved with relatively high accu-racy. The impact of the computation control method on the rate-distortion performance of the MPEG-2 and H.263 encoders are shown in Figure 7.2. The PSNR loss is less than 0.1 dB for all target bit rates for both encoders. 7.5 S u m m a r y In this chapter, we introduced a computation control mechanism for DCT-based video encoding. The proposed method is developed within the proposed 125 Bit rate (Mbps) (a) 33.51 1 1 1 1 r 301 1 1 1 1 1 1 1 120 140 160 180 200 220 240 260 Bit rate (kbps) (b) Figure 7.2: Rate-distortion curves with computation control (solid line) and without computation control (dashed line) for (a) MPEG-2 encoding of the sequence BUS and (b) H.263 encoding of the CIF sequence F O R E M A N . 126 Sequence Target frame rate encoding frame rate (frame/sec) (frame/sec) 31 31.19 A K I Y O 26 25.81 21 20.90 31 30.42 C O A S T G U A R D 26 25.60 21 20.94 31 31.16 F O R E M A N 26 25.62 21 20.92 Table 7.6: Computation control results for H.263 encoding of CIF sequences using a 450 MHz Pentium III processor. Sequence Target frame rate encoding frame rate (frame/sec) (frame/sec) 55 54.77 A K I Y O 50 50.01 45 45.20 55 54.55 C O A S T G U A R D 50 49.98 45 45.25 55 54.56 F O R E M A N 50 50.01 45 45.18 Table 7.7: Computation control results for H.263 encoding of QCIF sequences using a 450 MHz Pentium III processor. 127 flexible framework for video encoding that yields very good computation-performance tradeoffs. The proposed computation control mechanism is based on a feedback control system that adaptively change the encoding parameters to compensate for the mismatch between the measured encoding time and the target encoding time. The proposed computational control method was applied to MPEG-2 and H.263 encoding. The experimental results show that accurate speed-performance control was achieved using the proposed method. 128 Chapter 8 Conclus ion and Future Study 8.1 C o n t r i b u t i o n of T h e Thes is Video encoding standards such as the ITU-T H.261, H.273, H.263+, ISO/IEC MPEG-1 and MPEG-2 are based on motion compensated prediction and D C T residual coding. Most of these encoders are costly and rigid. Such encoders are implemented either in hardware using current VLSI technologies or in software using parallel computers. Many video encoding systems have been designed based on dedicated hardware, and cost-efficient solutions have been proposed by several manufacturers. While dedicated hardware solutions can presently meet the real time implementation of these encoders, software solutions which run on real time on existing general media processors are yet to be developed. The objective of this thesis is to develop a DCT-based video encoding framework for the design of flexible, real-time software solutions using me-dia processors (e.g., Pent ium/MMX). We achieve this goal by introducing a 129 flexible encoding framework that consists of several parameterized core com-ponents (motion estimation, quantized D C T , mode selection) and a method for developing encoders (based on such components) with different, controlled levels of performance and computational complexity. In the following section, we will list the main accomplishments of the thesis. 8 . 1 . 1 S p a t i o - T e m p o r a l M o t i o n V e c t o r P r e d i c t i o n The first contribution of the thesis is the introduction of the new motion vector tracking method for video sequences that is based on spatio-temporal predic-tion. In temporal prediction, the motion vector of a moving object is tracked from one frame to another using a projection method. The accuracy of the tem-porally predicted motion vector is estimated using a reliability measure that allows the motion search algorithm to decide whether to use the temporal mo-tion vector predictor or not. The predicted motion vector is selected based on its corresponding SAD. The predicted motion vector that yields the smallest SAD is chosen. The experimental results show that the spatial-temporal pre-diction reduces the number of computations performed by the motion search algorithm by up to 25% while maintaining the same quality. 8 . 1 . 2 H i e r a r c h i c a l B l o c k M a t c h i n g S t r a t e g y The second contribution of the thesis is the implementation of an efficient hi-erarchical block matching strategy, and a floating-center diamond search path to achieve the best possible computation-performance tradeoffs. Our approach 130 uses set of threshold parameter that control the computational complexity and performance of the hierarchical motion estimation algorithm. There are effec-tively seven threshold parameters in the MPEG-2 encoder, and three threshold parameters in the H.263 encoder that control the computational complexity and performance of the hierarchical motion estimation algorithm. This ap-proach adds more flexibility to the encoder and allow it to exit motion esti-mation when the computation constraints are met or a best match is found. For instance the encoder can stop motion estimation once the low-resolution motion estimation finds a good match. 8 . 1 . 3 V a r i a n c e - B a s e d M o d e S e l e c t i o n The third contribution of the thesis is the use of variance-based coding mode selection. The coding mode is selected based on variance measures of the original and motion-compensated MBs. This method is significantly simpler than the optimal rate-distortion-based, it is also five times faster and yields a performance level within 0.6 dB of the optimum one. 8 . 1 . 4 T h e Q u a n t i z e d D C T ( Q D C T ) The fourth contribution of the thesis is the joint implementation of the quan-tization and D C T step which we call the QDCT. The motion compensation residual blocks are D C T transformed, quantized, and variable-length encoded. To reduce the number of computations associated with the D C T and quantiza-tion steps, explicit quantization is avoided by embedding quantization into the 131 D C T computation. Further savings in computations are obtained by termi-nating the D C T whenever intermediate results indicate that the transform and quantization steps will likely result in a block of zero values. Besides signifi-cantly reducing the number of computations, the early termination techniques offer the encoder flexible control over the QDCT accuracy and number of computations. 8 . 1 . 5 A d a p t i v e C o m p u t a t i o n - D i s t o r t i o n O p t i m i z a t i o n A l -g o r i t h m Perhaps the most important contribution of the thesis is the adaptive computation-distortion optimization. The challenging task is the joint optimization of the encoding parameters for a given global computational constraint. Each en-coding parameter controls the computation-distortion performance of the al-gorithm by changing the operation of a particular sub-component. A "greedy" design procedure is proposed that yields parameter sets corresponding to the best global computation-distortion operating points for a complete encoder. The proposed method leads to more than a 2:1 reduction in the number of computations without significant loss in subjective and objective quality. Ex-perimental results show that the greedy algorithm achieves performance that is comparable to the exhaustive optimization method at a fraction of the com-plexity. 132 8 . 1 . 6 C o m p u t a t i o n C o n t r o l A l g o r i t h m The last contribution of the thesis is the computation control mechanism. The computation control mechanism allows the encoding algorithm to adapt itself to the available computational resources. The proposed control mecha-nism is based on a feedback control system that adaptively adjusts the encod-ing parameters to compensate for the mismatch between the encoding time and the target time. The experimental results show that an accurate speed-performance control is achieved. 8.2 Top ics for F u t u r e S t u d y Even though the proposed video coder presented in this thesis provides ex-cellent computation performance tradeoffs, the following subjects can further improve the scope and the quality of our research work. 8 . 2 . 1 H . 2 6 3 A n n e x e s In our simulations, we used the H.263 baseline encoder. There is a number of optional modes of operation in H.263 that can improve quality and per-formance. For instance, unrestricted motion vectors and advance prediction modes are used to improve inter picture prediction. We believe that the encod-ing performance can be improved by considering all possible modes of H.263. This will also increase the flexibility of the encoder. 133 8 . 2 . 2 B - P i c t u r e The encoding framework used in this thesis is developed for only I and P picture types. B pictures can be utilized to increase the encoder flexibility. It will, however, increase the encoding complexity. The increase in complexity is due to the dual number of motion vector estimation and the greater number of coding modes. If motion vectors can be predicted without having to perform motion estimation, then the complexity can be reduced. 8 . 2 . 3 N o n - S e p a r a b l e Q D C T In this thesis, we used fast (Chen) [36] algorithm for an eight-point 1-D Q D C T from which the 2-D Q D C T was computed using separability property of the QDCT. Like the D C T , a true 2-D QCT, such as Cho [36] approach, yields lower complexity than the separable extensions of the 1-D QDCT. Therefore, the implementation of the true 2-D QDCT will be a good direction to pursue. 8 . 2 . 4 B e t t e r C o m p u t a t i o n C o n t r o l The feedback control system used in our encoding framework is simple and efficient. It is based on simultanusly changing all the encoding parameters to compensate for the encoding speed mismach. We believe that better com-putation control can be achieved by using more advanced feedback control system. 134 8 . 2 . 5 O t h e r M e d i a P r o c e s s o r I m p l e m e n t a t i o n s Due to their high computational demand, the current solutions for real-time DCT-based video encoding and decoding are based principally on VLSI im-plementations such as the custom hardware (ASIC) systems. In this thesis we presented a highly optimized, low complexity DCT-based video encoding framework that run in real-time on general propose processor. Implementation of our encoding framework on a digital signal processor (DSP) such as Texas Instruments' (TI) C62x has significant merits and may be a good research direction. 135 Bibl iography [1] David A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes," Proc. IRE, pp. 1098-1101, September 1952. [2] I. Ziv and A. Lempel, "A Universal Algorithm for Data Compression," IEEE Transaction on Information Theory, IT-23(3):337-343, May 1977. [3] G. Langdon, "An introduction to arithmetic coding," IBM J. Res. Dev., vol. 28, pp. 135-149, Mar. 1984. [4] I. H . Witten, R. M . Neal, and J . G. Cleary, "Arithmetic coding for data compression," Communications of the ACM, vol. 30, pp. 520-540, 1987. [5] A . Gersho and R. M . Gray, Vector Quantization and Signal Compression. Boston: Kluwer Academic Publishers, 1992. [6] N . Nasrabadi and R. King, "Transform coding using vector quantization," in IEEE Int. Conf. on Information Science and Systems, (The John Hopkins University, Baltimore, Maryland, USA), pp. 23-25, Mar. 1983. 136 [7] B . Mandelbrot, The fractal geometry of mature. Freeman, 1982. [8] G. Lu, "Fractal image compression," Signal Processing: Image Commu-nication, vol. 5, pp. 327-343, 1993. [9] J. W. Woods, ed., Subband Image Coding. Norwell, M A : Kluwer Academic Publishers, 1991. [10] J. W. Woods and S. D. O'Neil, "Subband coding of images," IEEE Trans, on Acoustics, Speech, and Signal Processing, vol. 34, pp. 1278-1288, Oct. 1986. [11] H . Gharavi and A. Tabatabai, "Sub-band coding of monochrome and color images," IEEE Trans, on Circuits and Systems, vol. 35, pp. 207-214, Feb. 1988. [12] M . Kunt, "Second-generation image-coding techniques," Proc. of the IEEE, vol. 73, pp. 549-574, Apr. 1985. [13] M . Antonini, M . Barlaud, P. Mathieu, and I. Daubechies, "Image coding using wavelet transform," IEEE Trans, on Image Processing, vol. 1, pp. 205-220, Apr. 1992. [14] ISO/IEC JTC1/SC29/WG11, Test Model 5. M P E G 93/457, Apr. 1993. [15] S. Zafar, Y . Zhang, and J. Baras, "Predictive block-matching motion esti-mation for T V coding— Part I: Inter-block prediction," IEEE Trans, on Broadcasting, vol. 37, pp. 97-101, Sept. 1991. [16] Y . Zhang and S. Zafar, "Predictive block-matching motion estimation 137 for T V coding— Part II: Inter-frame prediction," IEEE Trans, on Broadcasting, vol. 37, pp. 102-105, Sept. 1991. [17] R. Armitano, R. Schafer, F. Kitson, and V . Bhaskaran, "Linear predic-tive coding of motion vectors," in Proc. of the Int. Picture Coding Symposium, pp. 233-236, Sept. 1994. [18] W. Chung, F. Kossentini, and M . Smith, "A new approach to scalable video coding," in IEEE Data Compression Conference (DCC), (Snow-bird, UT , USA), pp. 381-390, Mar. 1995. [19] W. Chung, F. Kossentini, and M . Smith, "Rate-distortion constrained statistical motion estimation for video coding," in Proc. of the Int. Conf. on Image Processing (ICIP), vol. 3, (Washington, DC), pp. 184— 187, Oct. 1995. [20] Y . W. Lee, R. K . Ward, F. Kossentini, and M . J . T. Smith, "Very low rate DCT-based video coding using dynamic V Q , " in Proc. of the Int. Conf. on Image Processing (ICIP), (Lausanne, Switzerland), pp. 669— 672, Sept. 1996. [21] Y . Lee, F. Kossentini, R. Ward, and M . Smith, "Towards MPEG-4: An improved H.263-based video coder," Signal Processing: Image Com-munication: Special Journal Issue on MPEG-4, v ° l - 10, pp. 143-148, July 1997. [22] Y . Lee, F. Kossentini, M . Smith, and R. Ward, "Predictive RD-constrained motion estimation for very low bit rate video coding," 138 Special Issue of the IEEE Trans, on Selected Areas in Communica-tions, vol. 15, pp. 1752-1763, Dec. 1997. [23] Y . Lee, F. Kossentini, and R. Ward, "Efficient MPEG-2 Coding of In-terlaced Video," Canadian Journal of Electrical and Computer Engi-neering, vol. 23, no. 1-2, pp. 61-67, 1998. [24] T. Wiegand, M . Lightstone, D. Mukherjee, T. Campbell, and S. Mitra, "Rate-Distortion Optimized Mode Selection for Very Low Bit Rate Video Coding and the Emerging H.263 Standard," IEEE Trans, on Circuits and Systems for Video Technology., vol. 6, pp. 182-190, Apr. 1996. [25] A . Schuster and A. Katsaggelos, "Fast and efficient mode and quantizer selection in the rate-distortion sense for H.263," in SPIE Proc. Visual Communications and Image Processing, vol. 2727', pp. 784-795, 1996. [26] ISO/TEC 13818-2—ITU-T Rec. H.262, Generic Coding of Moving Pic-tures and Associated Audio Information: Video. ISO/IEC, 1995. [27] ITU Telecom. Standardization Sector of ITU, "Video Coding for Low Bit rate Communication," ITU-T Recommendation H.263 Version 2, January 1998. [28] ITU-T Recommendation H.261,"Video Codec for Audiovisual Services at px64 kbits/sec", August 1990. [29] ITU Telecom. Standardization Sector Study Group 15, "Document L B C -139 95 Working Party 15/1 Question 2/15," Draft Recommendation H.263, Leidschendam, 7 April 1995. [30] ISO. Cdl l l72-2: , Coding of Moving Pictures and Associated Audio for Digital storage Media at up to About 1.5 Mbits/s, November 1991. [31] ISO/IEC Final Committee Draft 14496-2, "Coding of Audio-visual Ob-jects: Visual.", 1998. [32] K . P. Rao and P. Yip , Discrete Cosine Transforms: Algorithms, Advan-tages, Applications. New York: Academic Press, 1990. [33] N . Jayant, J . Johnston, and R. Safranek, "Signal compression based on models of human perception," Proc. of the fEEE, vol. 81, pp. 1385— 1422, Oct. 1993. [34] A . N . Netravali and B. G. Haskell, Digital Picture: Representation, Com-pression and Standards. Plenum Press, second edition, 1995. [35] J. Jain and A. Jain, "Displacement measurement and its application in interframe image coding," IEEE Trans, on Communications, vol. 29, pp. 1799-1808, Dec. 1981. [36] V . Bhaskaran and K . Konstantinides, Image and Video Compression Standards: Algorithms and Architecture. Boston: Kluwer Academic Publishers, 1995. [37] M . Gallant, G. Cote, and F. Kossentini, "A computation constrained block-based motion estimation algorithm for low bit rate video cod-140 ing," IEEE Trans, on Image Processing, vol. 1, pp. 467-471, Mar. 1999. [38] B . Girod, "Motion-compensating prediction with fractional-pel accuracy," IEEE Trans, on Communications, vol. 41, pp. 604-612, Apr. 1993. [39] M . Miyahara, "Quality assessment for visual service," IEEE Communi-cations Magazine, vol. 26, pp. 51-60, Oct. 1988. [40] P. Barten, "The squared root integral (SQRI): a new metric to describe the effect of various display parameters on perceived image quality," Proc. SPIE, vol. 1077, pp. 73-82, 1989. [41] M . Miyahara, K . Kotani, and V . R. Akgazi, "Objective picture qual-ity scale (PQS) for image coding," IEEE Trans, on Communications, vol. 46, pp. 1215-1226, Sept. 1998. [42] Testing A d Hoc Group, "Jpeg2000 testing results," ISO/IEC JTC1J/SC29/WG1/N705, Nov. 1997. [43] H. Everett, "Generalized Lagrange multiplier method for solving prob-lems of optimum allocation of resources," Operation Research, vol. 1, pp. 399-417, 1963. [44] T. Berger, Rate Distortion Theory. New Jersey: Prentice-Hall, Inc., 1971. [45] C. E. Shannon, "Coding theorems for a discrete source with a fidelity criterion," in IRE National Convention Record, Part 4, pp. 142-163, 1959. Also in Information and Decision Processes, R. E. Machol, Ed. New York, N Y : McGraw-Hill, 1960, pp. 93-126. 141 [46] G. Sullivan and T. Wiegand, "Rate-distortion optimization for video com-pression," IEEE Signal Processing Magazine, vol. 15, pp. 74-90, Nov. 1998. [47] A . Ortega and K . Ramchandran, "Rate-distortion methods for image and video compression," IEEE Signal Processing Magazine, vol. 15, pp. 23-50, Nov. 1998. [48] G. Cote, B. Erol, M . Gallant, and F. Kossentini, "H.263+: Video coding at low bit rates," IEEE Trans, on Circuits and Systems for Video Technology, vol. 8, pp. 849-866, Nov. 1998. [49] CCIR Recommendation 601. Encoding parameters of digital television for studios, 1982 [50] T. Zahariadis and D. Kalivas, "A spiral search algorithm for fast esti-mation of block motion vectors," In Proc. European Signal Processing Conf. (EUSIPCO-96), vol. 2, pp. 1079-1082, September 1996. [51] L. Liu and E. Feig, "A block-based gradient descent search algorithm for block motion estimation in video coding," IEEE Trans, on Circuit and Video Technology, vol. 6, no. 4, pp. 419-422, August 1996. [52] J. Mitchell, W. Pennebaker, C. Fogg, and D. LeGall, " M P E G video com-pression standard," Chapman & Hall, International Thomson Pub-lishing, 1996. [53] J. Chalidabhongse, S. K i m , and C. Kuo, "Fast motion vector estimation by using spatio-temporal correlation of motion field," in SPIE Proc. 142 Visual Communications and Image Processing, vol. 2501, pp. 810-821, 1995. [54] H . Sonehara, Y . Nojiri, K . Iguchi, Y . Sugiura, and H . Hirabayashi, "Mo-tion estimation method using the spatiotemporal characteristics of moving objects," SMPTE Journal, pp. 340-347, June, 1998. [55] M . Mattavelli, and G. Zoia, "Vector tracing techniques for motion esti-mation algorithms in video coding," European Signal Processing Conf. (EUSIPCO-98), August 1998. [56] D. K i m , J. Choi, and I. K i m , "Adaptive motion estimation based on spatio-temporal correlation," Signal Processing: Image Communica-tion,, vol. 13, pp. 161-170, 1998. [57] B. Schwartz and K . Sauer, "Efficient block motion estimation using inte-gral projection," IEEE Trans, on Circuit and Video Technology, vol. 6, no. 5, pp. 513-518, October 1996. [58] J . Chalidabhongse, S. K i m , and C. Kuo, "Fast motion estimation by using spatio-temporal correlation of motion field," in SPIE Proc. Visual Communications and Image Processing, vol. 2501, pp. 810-821, 1995. [59] B. Erol, F. Kossentini, and H. Alnuweiri, "Efficient Coding and Mapping Algorithms For Software-only Real-Time Video Coding at Low Bit Rates," IEEE Trans, on Circuits and Systems for Video Technology, 1998. To appear. [60] W. Chung and F. Kossentini and M . Smith, "An Efficient Motion Esti-143 mation Technique based on a Rate-Distortion Criterion", in Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) vol. 4, Atlanta, G A , pp. 1926-1929, may 1996. [61] ITU Telecom. Standardization Sector of ITU, "Video Codec Test Model Near-Term, Version 11 (TMN11), Release 0," H.263 Ad Hoc Group, July 1999. [62] K . Nguyen-Phi, A . Docef, and F. Kossentini, "Quantized discrete co-sine transform: A combination of D C T and scalar quantization," in Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), pp. 3197-3200, 1999. [63] A . Yu, R. Lee, and M . Flynn, "Performance enhancement of h.263 encoder based on zero coefficient prediction," in ACM Multimedia, 1997. [64] I. Pao and M . Sun, "Modeling dct coefficients for fast video encoding," IEEE Trans, on Circuits and Systems for Video Technology, pp. 608-616, June 1999. [65] B . Girod and K . Stuhlmuller, "A content-dependent fast dct for low bit-rate video coding," in Proc. of the fnt. Conf. on Image Processing (ICIP), vol. 3, pp. 80-84, 1998. [66] I. Ismaeil, A . Docef, F. Kossentini, and R. Ward, "Computation-performance control for DCT-based video coding," in Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Turkey, 2000. 144 [67] University of British Columbia, " T M N (H.263) encoder/decoder, version 3.2," TMN (H.263) codec. Available at http://spmg.ece.ubc.ca/h263plus/. [68] ITU Telecom. Standardization Sector of ITU, "Video Codec Test Model Near-Term, Version 8 (TMN8), Release 0," H.263 Ad Hoc Group, June 1997. [69] I. Ismaeil, A . Docef, F. Kossentini, and R. Ward, "Motion estimation using long term motion vector prediction," in Int. Conf. on Image Processing (ICIP), vol. 1, pp. 70-75, Japan, Oct. 1999. [70] K . Hwang, Computer Arithmetic: Principles, Architecture and Design. New York: Wiley, 1979. [71] A . Avizienis, "Signed digit number representation for fast parallel arith-metic," IRE Trans. Inform. Theory, pp. 389-400, Sept. 1961. [72] L. Breiman, J . H . Friedman, R. A . Olshen, and C. J . Stone, Classification and Regression Trees. Belmont,California: Wadsworth, 1984. [73] I. Ismaeil, A . Docef, F. Kossentini, and R. Ward, "A computation-distortion optimized framework for efficient DCT-based video coding," submitted to the IEEE Trans, on Multimedia, May 2000. [74] Intel's M M X application notes, "http://developer.intel.com/drg/mmx/appnotes/". 145
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Computation-distortion optimized DCT-based video coding
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Computation-distortion optimized DCT-based video coding Ismaeil, Ismaeil-Ragab 2001
pdf
Page Metadata
Item Metadata
Title | Computation-distortion optimized DCT-based video coding |
Creator |
Ismaeil, Ismaeil-Ragab |
Date Issued | 2001 |
Description | The rapidly expanding field of multimedia communications has fueled significant research and development work in the area of real-time video encoding. Dedicated hardware solutions have reached maturity and cost-efficient hardware encoders are being developed by several manufacturers. However, software solutions based on general purpose processors or programmable digital signal processors (DSPs) have significant merits. Towards this objective, we have developed a flexible framework for video encoding that yields very good computation-performance tradeoffs. The proposed framework consists of a set of optimized core components: motion estimation, the Discrete Cosine Transform (DCT), quantization, and mode selection. Each of the components can be configured to achieve a desired computation-performance tradeoff. The components can be assembled to obtain encoders with varying degrees of computational complexity. Computation control has been implemented within the proposed framework to allow the resulting algorithms to adapt to the available computational resources. The proposed framework was applied to MPEG-2 and H.263 encoding using Intel's Pentium/MMX desktop processor. Excellent speed-performance tradeoffs were obtained. |
Extent | 8286540 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-10-09 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065595 |
URI | http://hdl.handle.net/2429/13870 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2001-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2001-726460.pdf [ 7.9MB ]
- Metadata
- JSON: 831-1.0065595.json
- JSON-LD: 831-1.0065595-ld.json
- RDF/XML (Pretty): 831-1.0065595-rdf.xml
- RDF/JSON: 831-1.0065595-rdf.json
- Turtle: 831-1.0065595-turtle.txt
- N-Triples: 831-1.0065595-rdf-ntriples.txt
- Original Record: 831-1.0065595-source.json
- Full Text
- 831-1.0065595-fulltext.txt
- Citation
- 831-1.0065595.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065595/manifest