RATE-COMPUTATION OPTIMIZED B L O C K BASED VIDEO CODING By Yuen-Wen Lee B . A . Sc. (Electrical Engineering) Chung-Yuen Christian University, 1986 M . A . Sc. (Electrical Engineering) Syracuse University, 1992 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF T H E REQUIREMENTS FOR T H E D E G R E E OF D O C T O R OF PHILOSOPHY in T H E FACULTY OF G R A D U A T E STUDIES D E P A R T M E N T OF ELECTRICAL AND C O M P U T E R ENGINEERING We accept this thesis as conforming to the required standard T H E UNIVERSITY OF BRITISH COLUMBIA April 1998 © Yuen-Wen Lee, 1998 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Electrical and Computer Engineering The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V 6 T 1W5 Date: Abstract This dissertation reports on rate-distortion optimized and computation-constrained video coding algorithms which yield effective and efficient block-based video encoders. We first propose a new cost function for motion estimation which can achieve good tradeoffs among motion vector bit rate, compensation distortion, and number of computations. Based on the new cost function, we propose a predictive motion estimation algorithm which can effectively determine the size of the search region according to the statistics of the motion field. The proposed motion estimation algorithm allows control of the motion vector bit rate and the computational cost, simultaneously. We also develop two efficient rate-distortion optimized coding mode selection algorithms, based on thresholding and finite state machine (FSM) methods, respectively. The thresholding method can eliminate consideration of the computationally expensive modes. The F S M method determines the test orders of the modes and associated parameters by exploiting the local statistics of the input video sequence. The proposed two methods can reduce substantially the computation requirements as compared to the full search mode selection method. Based on the proposed motion estimation and mode selection algorithms, we implement an H.263-based video coder and an M P E G - 2 compliant video coder for very low bit rate video applications, and high bit rate interlaced video applications, respectively. The resulting video coders achieve excellent tradeoffs among bit rate, quality, and computational complexity. In fact, experimental results show that our video coders outperform the best known video coders in terms of compression performance and encoding speed. ii Table of Contents Abstract ii List of Tables vii List of Figures viii Acknowledgments xi 1 Introduction 1 2 Background 5 2.1 Information Theory Concepts .' . 5 2.1.1 Entropy . : 6 2.1.2 Joint Entropy and Conditional Entropy 6 2.1.3 Mutual Information '. 7 2.2 - Entropy Coding . . = 8 2.2.1 Huffman Coding 8 2.2.2 Arithmetic Coding 9 2.3 Rate-Distortion Theorem Concepts 10 2.4 Hybrid Video Coding 12 2.5 Motion Estimation 15 2.6 2.5.1 Block Matching Algorithms 2.5.2 Fast Motion Estimation Algorithm Discrete Cosine. Transform (DCT) • 16 18 . iii 25 3 2.7 D C T Coefficient Quantization and Coding 26 2.8 Video Coding Standard 27 The H.263 Video Coding Standard ' 2.8.2 The M P E G - 2 Video Coding Standard 27 28 Block-Based Motion Estimation 29 3.1 Motivation 29 3.2 Motion Estimation with Rate-Computation Constraint 3.3 3.4 3.5 4 2.8.1 . . . 31 3.2.1 Rate-Distortion Optimized Motion Estimation 35 3.2.2 Computation-Distortion Optimized Motion Estimation 35 Proposed Motion Estimation Algorithm 3.3.1 Starting Point: Prediction 3.3.2 Search Path . 3.3.3 Termination Point Algorithm 36 • 38 : 41 44 Simulation Results . 46 3.4.1 Initialization 46 3.4.2 Search Path 49' 3.4.3 Termination Point Summary ' . . . .. 51 . 53 Rate-Distortion Optimized Mode Selection 54 4.1 Multi-mode Video Coding 54 4.1.1 Macroblock Coding Mode . . . " 55 4.1.2 Conventional Mode Selection Algorithms 56 4.2 4.3 Rate-Distortion Optimized Coding Mode Selection'. 56 4.2.1 58 Exhaustive Search: Performance Upper Bound Fast Mode Selection Algorithm I: Thresholding Method IV 59 4.4 4.5 5 61 4.4.1 Finite-State Machine (FSM) 62 4.4.2 The Reference Lagrangian JR 62 4.4.3 Implementation Issue 64 Summary 66 Application: H.263 Based Low Bit Rate Video Coding 67 5.1 Motivation and Background 67 5.1.1 The H.263 Standard: Overview 69 5.1.2 The H.263 Bit Stream Structure 70 5.1.3 H.263 Macroblock Layer 72 5.2 6 Fast Mode Selection Algorithm II: Finite State Machine Method Proposed Coder ' 5.2.1 I-picture M B Coding Strategy . . . 5.2.2 P-picture M B Coding Method . 5.2.3 Rate Control 5.3 Experimental Results 5.4 Summary 74 75 . 76 82 . 82 91 Application: M P E G - 2 Video Coder 92 6.1 92 6.2 M P E G - 2 Overview 6.1.1 Profile and Level : 6.1.2 Fields, Frames and Picture 6.1.3 Motion Prediction 6.1.4 Frame D C T and Field D C T 6.1.5 Bit Stream Syntax and Macroblock Layer 94 - 95 97 / Computation-Distortion Optimized Motion Estimation 6.2.1 93 Motion Vector Prediction 98 . . 99 101 v 6.3 6.4 6.5 7 6.2.2 Search Path and Search Termination ' 6.2.3 Hierarchical Structure 103 6.2.4 Implementation Issue 105 Fast Rate-Distortion Optimized Mode Selection • 102 106 6.3.1 R D Optimized Mode Selection 107 6.3.2 Encoding of I-Pictures 112 6.3.3 Encoding of P-Pictures 114 6.3.4 The Finite State Machine (FSM) . . . ., 115 6.3.5 Threshold JR and Rate Control . 118 . Simulation Results . . 119 6.4.1 CD Optimized Motion Estimation 120 6.4.2 R D Optimized M P E G - 2 Video Coder 125. Summary ' 129 Conclusion and Future Study 7.1 Contributions of the Thesis 7.2 Topics for Future Study 130 ". 130 134 Bibliography 136 VI List of Tables 2.1 The computational cost for different motion estimation algorithms . . . . 25 3.2 Entropy of motion vector x (Miss America sequence) 48 3.3 Entropy of motion vector x (Car Phone sequence) 48 3.4 Entropy of motion vector x for F S M (Miss America sequence) 49 3.5 Entropy of motion vector x for F S M (Car Phone sequence) 5.6 H.263 source picture formats 5.7 Coded symbols for R 6.8 The available macroblock mode types for P-frame. 6.9 The entropy, joint entropy, and mutual-information for ( M , Q) M . ; 49 70 and R for various coding modes. . 73 . . 98 A '. Ill 6.10 The entropy and conditional entropy for coding mode M 112 6.11 Required number of computation for full search motion estimation . . . . 122 6.12 Computational performance for the 2D-log. search and our search . . . . 123 6.13 Computational performance-for the hierarchical search and our search . . 125 vii List of Figures 2.1 Generic hybrid video coder and decoder 13 2.2 System diagram.for block and D C T based video coder 14 2.3 Full search motion estimation process 17 2.4 The.2D logarithmic search 20 2.5 The 3-step search 2.6 The parallel hierarchical one-dimensional search (PHODS) 2.7 The hierarchical search 23 3.8 Profiles of SAD and Lagrangian J= D + 0-C 37 3.9 A spiral'search path 42 '. 21 . 22 3.10 A diamond-shaped search path 43 3.11 Floating center search path 44 3.12 Correlation coefficients between motion vectors 48 3.13 P S N R vs. computations for different search paths 50 . 3.14 P S N R vs. computations for different termination methods . . .' 52 . 4.15 Compression performance for the T M 5 and the RD .optimized coder . . . 60 4.16 The J * , coding mode, and quantization step for a frame from Football . . 63 5.17 Bit rates of motion vector, D C T , and side information for T M N 5 coder . 68 5.18 H.263 video sequence layers 71 5.19 The regions of support used by H;263 and our coder. -79 5.20 The search area used during motion estimation 80 /Vlll 5.21 The P S N R improvement clue to the R D optimized coding mode strategy 83 5.22 Bit rate of motion vector, D C T , and side information for our coder . . . 84 5.23 The P S N R loss due to the new motion vector estimation/coding method. 85 5.24 The P S N R improvement due to the M-search technique 86 5.25 The rate-distortion profile for our coder (constant A) 87 5.26 The rate-distortion profile for our coder (varying A) 88 5.27 Overall comparison between our coder and Telenor's coder 90 6.28 M P E G - 2 Profile and Level 93 6.29 Interlaced video frame consists of two fields 95 6.30 M P E G - 2 Motion vector type 96 6.31 M P E G - 2 D C T type 97 6.32 The bit rate distribution for D C T , motion vector, and side information . 100 6.33 The region of support (ROS) 101 6.34 Diamond-shaped search area used in motion estimation . . . . . . . . . . 103 6.35 The search region on full resolution level and half-pel resolution level . . 104 6.36 P S N R of the T M 5 coder and the full-search R D optimized coder . . . . . 108 6.37 P S N R of the T M 5 coder and the full-search R D optimized coder 109 6.38 Joint distribution function p(M, Q) HQ 6.39 The histogram of the quantization step step prediction error 113 6.40 M B mode selection: I-pictures 114 6.41 M B mode selection: P-pictures 116 6.42 The ROS used in this work ; . . 118 6.43 C C I R 601 video sequences 121 6.44 P S N R performance as a function of computation for our algorithm . . . . 124 6.45 P S N R as a function of bit rate 126 IX 6.46 Buffer fullness'and A variation. . . ' 6.47 P S N R and bit rate as a function of time x •. . 127 128' Acknowledgments First, my greatest debt of gratitude goes to my supervisors, Professor Rabab K . Ward and Professor Faouzi Kossentini, who have introduced and guided me into this exciting field of research, and provided sound advice and invaluable critical feedback at every stage of this project. Without their support, this thesis could not have been written. Second, I want to thank my parents for their encouragement and support. I also want to thank my wife Nancy for her patience and support in the past years. I also want to thank Michael Gallant and Ian Caven in the Image Lab. for their valuable suggestions and discussions. I also wish to acknowledge Julong Du, Xiaoli L i and other colleagues in the Image group and the S A R group for their friendship and encouragement during the long years. I also want to express my special thanks to Mei X u for her assistance in simulations and implementations in the M P E G - 2 project. xi Chapter 1 Introduction Storage and transmission of data is becoming more important in modern society. Data compression is an important tool for reducing the cost of storing and transmitting data. This is especially true for the case of video signals, whose tremendous amount of data require large spaces and wide bandwidths for storage and transmission, respectively. For example, the bit rate of a 30 frames/sec CCIR-601 digital video sequence is 248.83 Mbits/sec, and storing just a 5-minute clip of such a video sequence requires a 4.66gigabyte storage space. Without video compression, the demands of digital video storage and transmission cannot be met by present or near-future communication infrastructures. Recently, several successful and effective video coding standards have emerged, namely M P E G - 1 , H.261, M P E G - 2 , and H.263 [31, 73, 34, 36]. One common property of such coding standards is that they all employ block-based motion compensation and D C T coding algorithms for removing spatio-temporal redundancies in the video sequence. For simplicity, in the following, we will call such video encoders block-based encoders. During the encoding process, generally, the block-based encoders determine and transmit the type of coding algorithm (the coding mode) and associated parameters (motion vector, quantization step, etc.) for each macroblock to the decoder. The goal of this thesis is to develop block-based video encoders that are rate-dist'ortioncomputation optimized. More specifically, the proposed video encoders select the coding mode, motion vector(s), and associated parameters (such as quantization-step sizes, etc.) that yield the minimum Lagrangian value J = D + X R, where D and R are the associated 1 Chapter 1. Introduction 2 distortion and bit rate. For simplicity, we assume that no dependencies exist between macroblocks. Therefore, the constrained optimization is performed independently for each macroblock. Theoretically, in order to achieve optimal R D performance, all combinations of various coding modes, motion vectors, and quantization step sizes should be considered to find the minimum Lagrangian J. However, this approach is clearly impractical. For example, for a H.263 video standard [36] compliant video coder, to find the minimum value of J one would need to consider 19851 possible different cases for one 16x16 macroblock. To simplify this problem, we first decouple motion vector estimation from coding. The motion vector will be determined using our proposed motion estimation algorithm, and the coding mode and the quantization step size will be determined using our proposed mode selection algorithm. This decoupling reduce greatly the video encoding computational complexity. Unfortunately, however, the computation demands of the resulting R D optimized video coder can still be very large. This is clue mainly to two computation intensive components: motion estimation and coding mode selection. To overcome the motion estimation problem, this thesis proposes a new predictive motion estimation algorithm that will yield good tradeoffs .among motion compensation distortion, motion vector bit rate, and computational cost. First, a new cost function for motion estimation which is a weighted sum of the three quantities in a Lagrangian formulation is introduced. The advantage of the proposed cost function is that via the parameters in the cost function, the motion vector bit rate and the computational cost can be carefully controlled. Second, the search area is allowed to expand or contract as a function of the local statistics of the motion field. Simulation results show the proposed predictive motion estimation algorithm can lead to a dramatic increase in motion estimation speed for only a small loss in compression performance. A unique feature of our algorithm is that the reconstruction quality of the associated video encoder is Chapter 1. Introduction 3 proportional to the processing power. To overcome the mode selection problem, we propose an efficient rate-distortion (RD) optimized coding mode selection method. The new method chooses the coding mode and quantization step size that give the minimum Lagrangian value J. A direct result of the above method is that significantly better R D performance can be obtained. Theoretically, in order to achieve the optimum R D performance, all combinations of the coding mode and the quantization step size should be exhaustively searched. However, such an approach requires a considerable amount of computations. To solve this problem, thresholding and finite state machine (FSM) algorithms are proposed. Thresholding allows the encoder to safely eliminate the computationally expensive coding options from consideration. In F S M case, the search order of the coding mode and quantization step size are determined by the conditional probabilities. The proposed motion estimation and the coding mode selection algorithms have been implemented-in the context of an H.263-based video encoder for operation at very low bit rates (lower than 16 kbps), and in the context of an M P E G - 2 compliant video encoder for operation at high bit rates. The proposed H.263-based video encoder employs: (1) a'fast median-based predictive motion searching technique, (2) an R D optimized motion estimating algorithm, and (3) a thresholding-based R D optimized coding mode selection method. The above algorithm take into account the fact that, at low bit rates, the motion vector bit rate is a significant part of the total bit rate. The proposed M P E G - 2 video encoder employs: (1) a computation-distortion (CD) optimized predictive motion estimation algorithm and (2) a FSM-based rate-distortion (RD) optimized mode selection method. The proposed C D optimized motion estimation algorithm and the FSM-based R D optimized mode selection method are intended to reduce the huge computational requirements of motion estimation and improves the compression performance in M P E G 2 video coding, respectively. Simulation results indicate that the proposed video encoders Chapter 1. Introduction •4 outperform existing ones both in terms of compression performance and computational complexity. Another important advantage of our algorithms is that the bit rate, quality, and number of computations can be simultaneously controlled. • In Chapter 2, we begin with a brief review of some fundamental concepts that will serve as background material throughput the thesis. Chapter 3 will introduce the new cost function J for motion estimation we mentioned above.' In the same chapter, we V also address motion vector prediction and search techniques. The R D optimized mode selection algorithm as well as corresponding efficient implementations will then be discussed in Chapter 4. In Chapter 5, the proposed motion estimation algorithm and mode selection algorithm will be implemented in the context of an H.263-based video encoder that is designed for very low bit rate applications. In Chapter 6, we will implement: an M P E G - 2 compliant video encoder using the two algorithms discussed in Chapter 3 and 4. In Chapter 7, conclusions and future research directions are discussed. Chapter 2 Background In this chapter, the fundamental concepts and some of the most prominent techniques for video coding are reviewed and presented. Our discussion is only introductory, and the interested reader may wish to consult the given references for more detailed discussions. In Section 2.1, relevant concepts from information theory are reviewed. In Section 2.2, two well-known lossless coding techniques, Huffman coding, and arithmetic coding, are described. Next, in Section 2.3, rate-distortion fundamentals needed to explain our lossy coding techniques are addressed. In Section 2.4, generic hybrid video coding and DCT-based block motion compensated video coding are discussed. Motion estimation is discussed in Section 2.5. In the latter section, we focus mainly on block matching algorithms, and we limit our discussion to the most popular fast motion estimation algorithms. In Section 2.6, block transforms, and especially the D C T , are addressed. D C T coefficient quantization and coding are described in Section 2.7. We will briefly introduce the two popular video coding standards, the H.263 and the M P E G - 2 , in the last section. 2.1 Information Theory Concepts The most important contribution to information defined information measure H(X), theory was made by Shannon, who called the entropy, of a random source X. This measure gives the lowest number of bits required to represent the symbol X losslessly. Based on the entropy, other information measures, such as the conditional entropy and the mutual information, can be derived. 5 Chapter 2. 2.1.1 Background 6 Entropy The entropy, which is a measure of uncertainty of a random variable, will first be introduced. Let X be a discrete random variable with alphabet X and probability mass function p(x) = Pr{X — x}, x £ X. The entropy H(X) of a discrete random variable X is defined by H(X) = -J2p(x)log p(x), (2.1) 2 where log?, is the base-2 logarithmic function and entropy is expressed in bits. Note that entropy is a function of the distribution of X. It does not depend on the actual values taken by the random variable X, but only the probabilities. From the view point of source coding, the entropy H(X) also represents the average number of bits required to represent the random variable X. 2.1.2 Joint Entropy and Conditional Entropy In this section, the definition of entropy of a single random variable, presented in the previous section, is extended to a pair of random variables. The joint entropy H(X, Y) of a pair of discrete random variables (X,Y) with a joint distribution p(x,y) is defined as . H(X,Y) = -J2J2p(x,y)log p(x,y), xex ey (2.2). 2 y where the X and y are the alphabets of X and Y respectively. The H(X, Y) then represents the average number of bits required to represent a pair of random variable. (X,Y). We also define the conditional entropy of a random variable given another as the expected value of the entropies of the conditional distributions, averaged over the conditioning random variable. The conditional entropy H(Y\X) ' H(Y\X) = J2P( ). ( \ X H Y X = ) X is defined as Chapter 2. Background 7 = -J2P( ) x^x = p(y\ ) °92 p{y\x) x X ey L y P( ,y) x xexyey 92 p(y\x), (2.3) LO where the p(y\x) represents the conditional probability of y given x. The naturalness of the definition of the joint entropy and conditional entropy is exhibited by the fact that the joint entropy of a pair of random variables is the entropy of one plus the conditional entropy of the other, that is, H{X, Y) = H(X) + H(Y\X). (2.4) The results given above can be generalized to k random variables, where k > 2. Therefore, the joint entropy H(Xi, X , • • • , Xk) of k random variables {X±,X , 2 2 • • •, Xk} can be expressed as k H(X ,X , 1 2.1.3 2 ...,X ) k = J2 H(X \X X ,.. i u 2 .,Xi- ) X + H{X ). X (2.5) Mutual Information The mutual information is a measure of the amount of information that one random variable contains about another random variable. It is the reduction in the uncertainty of one random variable due to the knowledge of the other. Consider two random variables X and Y with a joint probability mass function p(x,y) and marginal probability mass function p(x) and.p(y). The mutual information I(X; Y) is the relative entropy between the joint distribution and the product distribution p(x)p(y), i.e., L¥/ ' ^y mY)= ' x y)l09 The mutual information I(X;Y) (2 6) is the reduction in the uncertainty of X due to the knowledge of Y (or vise versa), that is, I{X-Y) .= H(X) - H(X\Y). Chapter 2. Background 8 =. H(Y) - H{Y\X). (2.7) It is easy to derive the following relationship between entropy and mutual information from Equations (2.1), (2.2) and (2.6): I{X;Y) = H{X) + H{Y)-H(X,Y). (2.8) From the same equations, the following inequalities can also be easily proved [17, 19, 14]: I(X;Y) > 0, I(X;Y) < H(X), I(X; Y) < H(Y), and (2.11) < H(X)+-H(Y). (2.12) H(X;Y) 2.2 (2.9) (2.10) Entropy Coding Entropy coding [55, 68, 21] is the mapping of discrete data into a more compact form such that the original data can be reconstructed exactly. According to the Kraft inequality [39, 19, 14], for discrete memoryless source X with finite entropy H(X), it is possible to construct a prefix code having an average bit R that satisfies, H(X) <R= ][>(£,) 3 lj < H{X) + 1, (2.13) where p(xj) is the probability function of X and lj is the code length of the random variable Xj. In this section, two widely used entropy coding techniques,-Huffman coding and arithmetic coding, are introduced. 2.2.1 Huffman Coding Huffman coding [30] shares most of the characteristics of Shannon-Fano coding [17]. It creates variable-length codes of integer lengths. Symbols with higher probabilities are Chapter 2. Background assigned shorter codes. Huffman codes have the prefix condition property. 9 Therefore, they can be correctly decoded without knowing the individual code lengths in advance. It is shown [20] that Huffman codes can be designed such that the average bit rate is within one bit of the entropy (see Equation (2.13)). Moreover, if the symbol probabilities are powers of | , then the average rate of the Huffman code is the same as the entropy of the symbols. Huffman codes are typically generated using a tree. However, unlike Shannon-Fano codes, Huffman codes are built from the bottom up, starting with the leaves of tree and working progressively closer to the root. One drawback of Huffman coding is that all codewords have integer lengths. No matter how high the probability of one of the symbols is, that symbol still must be encoded with at least one bit. Therefore, Huffman coding becomes inefficient when it is used to encode symbols with high probabilities (e.g. 0.9). This coding efficiency can be improved if groups of symbols are encoded together. However, by grouping symbols, the size of the Huffman code tables grow exponentially. Another problem with Huffman coding is that the coder design and the statistical modeling are combined into a single process, making adaptive coding difficult. If the probabilities of the input symbols change, a relatively complex algorithm is needed to adapt the Huffman coder. 2.2.2 Arithmetic Coding Arithmetic coding is a lossless compression technique that benefits from treating multiple symbols as a single data unit but at the same time retains the incremental symbol-bysymbol coding approach of Huffman coding. Arithmetic coding separates the coding component from the modeling component. This process allows for the dynamic adaptation of the probability model without affecting the design of the coder. Theoretically, in arithmetic coding, a single codeword is assigned to each possible sequence of symbols. Each codeword can always be represented by a half-open subinterval in the interval [0,1). Chapter 2. Background 10 By assigning enough precision bits, one can distinguish one subinterval from any other subinterval, and thus uniquely decode the corresponding sequence of symbols. Like Huffman codewords, the more probable sequence of symbols correspond to larger subintervals and thus require fewer bits of precision. Because an arithmetic encoder can encode symbols with a fraction of a bit, it can achieve close-to-optimal compression performance for sources with very low entropies. Arithmetic coders usually achieve better compression performance than Huffman coders. However, arithmetic decoders require that it be informed about the number of symbols that need to be decoded. For a detailed discussion of arithmetic coding, please see [21, 45, 46]. 2.3 ' Rate-Distortion Theorem Concepts Shannon [66] also introduced the rate-distortion function R(D), which represents the lowest information rate R achievable while still maintaining a fixlelity D. He proved that it is not possible to have a source,code of rate R with distortion d smaller that the minimum distortion D(R), or the inverse of R(D). This is often called the negative source coding theorem. However, he also proved that source codes exist which can arbitrarily approach the rate-distortion bound R(D) - a result that is known as the positive source coding theorem. In what follows, some of the basic rate-distortion theory results will be reviewed. . • First, let X and X be discrete random variables with alphabet X and X respectively, and the joint probability mass function be given by p(x,x) = Pr{X = x,X = x}, where x £ X and x £ X. Next, let the distortion function d{x,x) be a measure of the cost of representing the symbol x with the symbol x. The distortion function C/(X,.T), together Chapter 2. Background 11 with the joint probability function p(x, x), can be used to compute an average distortion 4 = EEPWP(«W^.J)-' (2-14) The rate-distortion function R(D) for a source X with average distortion d m R(D)= . ..min /(*;*), p(a;|x) : d <r> is given by (2-15) m where I(X; X) is the mutual information between X and X. The minimization in Equation (2.15) is performed over all conditional distribution p(x\x) subject to the constraint that the average distortion d m is less than or equal to D. The rate R(D) is the lowest achievable rate when a distortion no greater than D is allowed. Shannon showed that no source code with R < R and distortion D exists. s It is important to emphasize that the above theoretical results apply to stationary discrete-time memoryless sources with finite output alphabets. These results are extended to sources with continuous amplitudes in [19, 4]. For a memoryless Gaussian source with finite variance a , the rate-distortion function R (D) with mean square error distortion 2 g measure d(x,x) = (x — x) is 2 R (D) = -log ^. (2.16) [ g 2 In general, explicit results of the rate-distortion functions for memoryless non-Gaussian sources are not available. However, there are useful upper and lower bounds on the ratedistortion function R(D) for discrete-time continuous-amplitude memoryless sources with zero mean and finite variance a , that is, 2 Ri(D) < R{D) < R (D), . U where Ri(D) and R (D) U (2.17) are the lower and upper bounds respectively. From [4, 66], Ri(D) and R {D) axe given by U R (D) U = R (D) = ^log ^, g 2 and (2.18) Chapter 2. Background 12 Ri(D) = H(X) - - log 2ireD, (2.19) 2 where H(X) is the.differential entropy of the continuous amplitude memoryless source X. From the above, it is clear that the- Gaussian source requires the maximum rate • among all other sources with the same variance a . 2 2.4 • Hybrid Video Coding The goal of video coding is to remove statistical and psychovisual redundancies in an image sequence while keeping intact as much perceptually important information as possible. The two major types of statistical redundancies in a video sequence are spatial and temporal redundancies. Temporal redundancies exist usually due to structured motion and camera movement, while the spatial ones are normally due to the smoothness and edge continuity. These two types of redundancies can be reduced efficiently by using a generic hybrid video coder such as the one shown in Figure 2.1. The hybrid video coder first performs temporal prediction with motion compensation. Here, motion compensation is the process of compensating for the displacement of a moving object from one frame to another. In practice, motion compensation is preceded by motion estimation, the process of finding displacement of pixels among frames. Next, the corresponding difference image, or the residual image, is processed by a waveform coder, such as a D C T or a subband coder, to further remove the spatial redundancies. To reconstruct the video sequence, the decoder in Figure 2.1 merely reverses the process of the encoder just described. Two types of coding methods, intra-frame coding and inter-frame coding, are used in the hybrid video coding scheme. Intra-frame coding employs only spatial redundancy reduction method, therefore, the frames can be encoded independently. For inter-frame coding, however, both temporal and spatial redundancy reduction methods are used Chapter 2. Background 13 MM) no Motion Estimation Motion Vector Motion —C=-| Compensation Generic Encoder Residual Coding Generic Decoder Mt) Motion Compensation Figure 2.1: Generic hybrid video coder and decoder (see Figure 2.1), and frames in the past and/or future are required to perform temporal prediction. A video frame that is coded by intra-frame coding is called an I picture. A video frame that is coded by inter-frame coding is called a P or B picture. For P pictures, only one frame in the past is used as reference for temporal prediction. For B pictures, one frame in the past and one frame in the future are used for temporal prediction. In order to reduce the implementation complexity of both the encoder and decoder, the B pictures are not referenced by any other types of picture, and only the closest I or P picture is used as a reference. Chapter 2. Background 14 DCT Quantization Entropy Encoding Inverse Quantization Inverse DCT Motion Estimation Motion Compensation Frame Memory motion vector Figure 2.2: A block motion-compensated prediction and DCT-based residual coding video coder Block Motion-Compensated Prediction and DCT-based Residual Coding The most popular hybrid video coder is the block motion-compensated prediction and DCT-based residual coding video coder shown in Figure 2.2. This type of video coder employs a block matching algorithm for motion estimation and a D C T and scalar quantization for residual coding. Because of its effectiveness and efficiency with respect to algorithm complexity and computational cost, the framework illustrated in Figure 2.2 has been adopted in video coding standards, such as H.261, M P E G - 1 , M P E G - 2 , and H.263 [73, 36, 34, 31]. For simplicity, we call this type of video coder an MPEG-like video coder in what follows. In this work, our proposed algorithms will be developed based on the MPEG-like video coding framework. Therefore, in the following sections, methods used in Figure Chapter 2. Background ' • 15 2.2, such as block matching, D C T and scalar quantization, will be discussed in more detail. 2.5 Motion Estimation Many motion estimation methods have been used to compute the apparent motion in an image sequence for use in video coding, with each algorithm relying on a different representation of the object trajectory in the three dimensional scene. In this section, three of the most popular algorithms, the block matching algorithm, pel recursive algorithm, and object based algorithm, are briefly introduced. • Block Matching Algorithms: Block matching algorithms are the most successful tools for motion estimation in video coding. They have been implemented in many video codecs and have been specified in a number of video coding standards (e.g. H.261, M P E G - 1 , M P E G - 2 , and H.263 [73, 36, 34, 31]). Typically, block matching algorithms track only translation motion using a fixed block size. The block matching algorithms are popular because of their computational efficiency and minimal information representation of the motion field. In this work, a block matching algorithm is used for motion estimation. A more detailed discussion will be given in Section 2.5:1. • Pel-Recursive Algorithms: Pel-recursive techniques use gray-scale gradients and recursion to determine object motion [29, 63, 75]. With a pel-recursive algorithm, an affine model (capable of representing both translation and rotation motion) can be easily incorporated into the video coder. Gradient-based estimation algorithms calculate motion parameters for each pixel in the frame of interest. In this respect, they are well suited for motion compensated prediction since they do not introduce blocking artifacts and are able to deal with complex motion models allowing Chapter 2. Background 16 rotational motion and object deformation, as well as translational motion. However, this flexibility comes at a cost since computing and transmitting pixel motion parameters is more expensive when compared to the block matching algorithms. In addition, large displacements in the motion field can not be easily tracked by pel-recursive algorithms. • Object Based Algorithms: Object (or region) based models are very powerful tools for motion estimation and motion representation since typical scenes can be characterized by small sets of objects moving i n time [77, 56, 44, 28, 38, 59,. 57].' These methods first segment each frame into regions according to characteristics such as motion activity, texture or color, and then find motion parameters for each region. Object based motion estimation algorithms are able to faithfully represent true object motion and avoid blocking and mosquito effects since regions are not restricted to fixed shape as in typical block matching algorithms. However, these algorithms are much more complex than the block matching and pel-recursive ones. 2.5.1 Block Matching Algorithms A block matching algorithm ( B M A ) provides a parametric representation of the motion flow in a video signal. This algorithm estimates the motion of individual pixels using a single motion vector. Within the block, it is assumed that every pixel has the same motion. In this way, each block in the.current frame is predicted from a block in a reference frame which has undergone some type of motion. The "optimal" block matching algorithm is the full search (FS) method [37]. The full search method subdivides the current frame into non-overlapping blocks that are usually square. For each block b, at (m, n) in the current frame a block h j at (m + d , n + d ) re x y in a reference frame I j, that best matches it (given a distortion criterion) within a search re Chapter 2. Background 17 Current Frame Reference Frame Search Region Figure 2.3: Full search motion estimation process region W , is found. The search region W is defined as the region over which one varies the displacement (d , d ). In Figure 2.3, the full search method is illustrated for a search x y region that is ±p pels in the x-direction and ±q pels in the y-direction. The block size is (M,N). In causal or forward motion estimation, a past frame is used to compute the displacement estimate. In non-causal and backward motion estimation, a future frame is used to compute the displacement estimate.The disadvantage of block matching algorithms lies in their simplistic assumptions. For example, block matching algorithms assume that all pixels within a block undergo identical motion, which is not always the case, especially when there are small moving objects. In addition, if pixels along the boundaries of adjacent blocks are assigned different motion vectors and an object straddles these block boundaries, the adjacent motion vectors "tear" the object'. In this case, the residual signal will contain high frequency edges, expressed using the so called "blocking artifacts". Chapter 2. Background 18 Matching Criterion The block matching algorithm uses a very simple cost function to evaluate the displacement vectors that describe motion in image sequences. In most simple pel-based absolute difference is used to determine the similarity between blocks. The two popular cost functions used to measure the similarity between blocks are the mean square error (MSE) and the mean absolute difference ( M A D ) . For a block at location (x ,y ) m block size MxN, the M S E and M A D for motion displace d = (d ,d ) x D SE(d ,d ) M x y = ^ xm+M —— J2 x=x m - | D AD(d ,d ) M x y = with are, y„+N Y.{ { ^j)- res(x I x I + d ,y + d )f (2.20) + d ,y + d )\, (2.21) x y y=y n .r,„; A/ >/„+.Y E \I(x,y)-I (x reJ x=xrn y n x y y yn = where / and I j are the pixels of the present frame and the reference frame, respectively. re The difference in subjective performance between using the M S E and M A D is quite small for most video sequences. Therefore, in general, the M A D is preferred because it required less computations than the M S E . However, even using the M A D , the computational complexity is still quite high for most video sequences, simply because hundreds of reference blocks must be considered for each candidate block. For example, for CCIR601 video (30 frames/sec, 720x480 pixels/frame) with a (±16, ±16) search region, a 30 G O P S (giga-operations per second) capable processor is required for performing full search block matching algorithm in real time. 2.5.2 Fast Motion Estimation Algorithm Full-search block matching algorithms are very computationally expensive, but they guarantee finding the motion vector with the minimum M A D value. To alleviate the computational burden, many fast block matching algorithms have been proposed since 1981, when Jain and Jain developed the 2D logarithmic search algorithm [37]. In what follows, Chapter 2. Background 19 several heuristic search methods are described. These methods achieve sub-optimum performance at significantly reduced number of computation compared to the full-search method. 2D Logarithmic Search Method The 2D logarithmic search method [37] uses a sparse binary search area, where the "best" motion vector is located. The M A D matching criterion is usually used, as in the full search method. In this case, let (dtp, dtp) be the size of the search region. Consider the grid shown in Figure 2.4, nine points within the search region are chosen about the block center. This corresponds to points on a d (d = 2 ^ 0 0fl2 0 ' ^ ' ) sampling grid, e.g. p _1 points at (-<i , d ), (0, d ), (d , d ), (-c/ , 0), (0, 0), (d , 0), (-c/ , -c/ ), (0, -d ), {d , -d ). 0 0 0 0 Q 0 0 0 0 0 0 0 The minimum distortion is found at one of these locations. The sampling period for the •ith iteration is given by di = 2 ~ , k (2.22) l where k = (\log2(p)] — 1) and i is the iteration index. The location of minimum distortion for the 0th search (i — 0 iteration) is used as the search origin for the next iteration i = 1. For this and subsequent searches, eight locations need to be searched as the search origin was computed at-the i — 1 iteration. The iterative process continue until the search sampling period is equal to one, and i — k. This corresponds to an integer pixel motion vector sampling resolution. The total number of search locations in 2D logarithmic search is far less that in the full search. In total, the method visits 8\log (p)'\ + 1 locations, and each location 2 requires a M A D computation. This is compared to (2p + l ) for the full search method. 2 If a monotonic distortion surface exists, the 2D log search algorithm will locate a global minimum. In most cases, however, the M A D surface is not monotonic, and the search Chapter 2. Background 20 (-7, -7) (0, 7) 0 0 !•- 0 —E3- j 0 (7, 7) C o 1 "l .J o r : • • •• El 1 r 1 1 c 0 : j 1 J t 1 -°l • LJ I 1 (-7, 7) , j — ! \ c (7, 7) Figure 2.4: The 2D logarithmic search converges on a local minimum. Three-step Search Method Three-step search method is iterative, and similar to the logarithmic one. As in logarithmic search, a nine-point grid is searched first with a sampling distance of d = [|]. 0 As each iteration, the nine-point sampling grid used in three-step search is identical to the nine-point sampling grid used in 2D log search. However, the sampling distance for a subsequent iteration is given by d t+1 = di - 1, (2.23) where i is the iteration index. As in 2D log search, the iteration process ends when the search distance is equal to one. By reducing the initial sampling distance d , more precise 0 motion vectors can be obtained at the cost of higher computational complexity [70]. The number of operations for the three step search method is greater than that of the 2D log 21 Chapter 2. Background (-7,-7) (0,7) 0 0 lJ 0 " i.J 0 0 0 1 1— 1 1 t j 0 (7.7) 0 L 1 1 1 r 0^ 1 1 1 1 2 3 1 r 1 L 2 1 7) 1 • p. IP: 1 i (7 Figure 2.5: The 3-step search search. Using the initial sampling distance d = [^], the total number of search location 0 is s r n + 1 . Parallel Hierarchical One-Dimensional Search (PHODS) This method is based on a one-dimensional binary search that relies on the assumption that the M A D surface is monotonically decreasing in both x and y direction. During this search process, the horizontal and vertical spatial dimensions are searched independently [7, 5]. As in the logarithmic search , a (±p, ±p) search region is here assumed. For a one-dimensional search along the x dimension illustrated in Figure 2.6, the three sampling grid for the first iteration is (-do, 0), (0, 0), (do, 0), with the sampling interval d = 2^° ' 'L The location with the minimum M A D among the three points is 52 P 0 used as the center of the next iteration. The second iteration searches for the minimum DMAD at the following grid points, (-d 0), (0, 0), (<i;, 0), centered at the location with l: Chapter 2. Background 22 (-7,-7) (0,7) (7,7) \ } 1 J °i1•I y 3 ' i1 1 2 1 L] rn LJ XQ )'o XQ r1 LJ *1 *0 1 • *2 r1 11 1I L J X, *2 \ y (-7,7) "1 LJ . (7,7) Figure 2.6: The parallel hierarchical one-dimensional search (PHODS) the minimum M A D in the first iteration, where d,- = 2 ~\ i is the iteration index, and k k = \log (p)]2 The iteration process continues until a sampling interval of d{ = 1 is obtained. The remaining y dimension can be searched at the same time and in the same manner. The number of search locations in PHODS per block is 4|7og (p)l + 1, which is 2 roughly half of the number of locations considered in 2D logarithmic search. Hierarchical Search Method To circumvent the deficiencies in fast search algorithms, which depend on a monotonic decreasing M A D surface, hierarchical search methods were developed [6]. In hierarchical search, the search process is not confined to a subset of the search region, and is less likely to be trapped in a local minimum. Using low resolution representations of the frames, the entire search region can be processed at different layers. In addition, the hierarchical structure allows the use of different sized search regions and measurement windows that Chapter 2. 23 Background Figure 2.7: The hierarchical search match the image resolution during the search. This adaptability makes the hierarchical search more robust since it is able to dynamically adjust its motion parameters to the motion structure. A hierarchical motion estimation algorithm decomposes each frame in the image sequence into multi-resolution frames. The decomposition can be made using a pyramidal structure [76, 7] or subband decomposition structure [80, 22, 71]. At the lowest resolution, the largest motion displacements are measured. This is possible since the search window size can be made large without increasing the number of required computations. Due to the low-pass nature of the lowest resolution image, an accurate motion vector is not yet possible. However, because of low-pass filtering, the displacement estimate is less sensitive to noise. Using coarse measurements made at the lowest level of the pyramid, motion estimation is then performed at the higher level. Higher levels in the pyramid Chapter 2. Background 24 require a smaller search window, which allows for more accurate measurement of small displacements. Appropriately-scaled motion vectors from the previous level are used to .initiate the search at the higher resolution level of the image pyramid. Since the search for motion vectors begins at the lowest resolution, selecting an appropriate search size at such level is very important. The search size should be large enough to estimate the magnitude of most motion variation present in the video signal. Therefore, given an L-levels dyadic decomposition (subdividing the region by two at each iteration) and a (±p, ±p) search region, the search region for the lowest level should be ( P L , P L ) , where /'/• = (2-24) For a higher resolution level, in general, the search region is set to ( ± 1 , ± 1 ) pixels (total 9 locations) to ( ± 2 , ± 2 ) pixels (total 25 locations) at the corresponding' resolution. Hierarchical search has a computational complexity roughly equivalent to half of that of 2D logarithmic search, but yielding significantly better performance. Assuming a block size of M x A" and a search region of (±p, ±p), and a three level resolutions is used, the search region in the lowest level is ( ± [ | ] , ± [ | ~ | ) , and the search region in the two higher levels is set to ( ± 1 , ± 1 ) . The total number of operations required for each block is [(2[jjl + l ) 2 ^ + 9 ^ + 9NM}. Here, one operation includes one subtraction, one absolute value, and one addition. From a computation viewpoint, the hierarchical search method is- very efficient; however, such a method requires increased storage due to the need to keep pictures at different resolution. Furthermore, this method may yield inaccurate motion vectors for regions •containing small objects. Since the search method starts at the lowest resolution in the pyramid structure, regions containing small objects may be completely eliminated and may never be tracked. Chapter 2. Background Search Method Full-search Logarithmic Three-step PHODS Hierarchical 25 Operations per Macroblock (2p + IfNM (%\log p\+l)NM (8[fl + l ) i V M (4\log p] +1)NM ((2[fl+l) + 1 8 0 ) ^ 2 2 2 Operations per second 720x480 at 30 fps ' p=7 p = 15 9963.65 342.14 673.92 176.26 169.13 MOPS MOPS MOPS MOPS MOPS 2332.80 259.20 342.14 134.78 132.84 MOPS MOPS MOPS MOPS MOPS Table 2.1: Computational complexity and operation requirements for pictui'e with image size of 720x480 and frame rate of 30 frames/sec. The macroblock size in this case is NM = 16x16. Table 2.1 summarizes the computational requirements of the five motion estimation methods we described in the section. The " M O P S " in Table 2.1 is the abbreviation of million operations per second. In all cases, we assume 16x16 macroblocks size and a (±p, ±p) search region. The picture size is 720x480 and the frame rate is 30 frames/sec. 2.6 Discrete Cosine Transform ( D C T ) Neighboring pixels in an image are usually highly correlated, implying that high levels of redundancies exist in the image data. Block transforms are very popular tools for image data decorrelation. The optimal block transform for this purpose is the KarhunenLoeve (KL) block transform. Theoretically, the K L transform can completely decorrelate the input blocks, yielding highest energy compaction. However, the drawback of the K L transform is its high computational requirements: Besides, the basis of the K L transform depends on the statistics of the specific input blocks. There are other block transform techniques which can achieve good decorrelation performance, and are much easier to implement. They are the Discrete Fourier Transform (DFT), the Hadamard Transform, the Slant Transform, and the Discrete Cosine Transform (DCT). Each of Chapter 2. Background 26 these transforms can be implemented efficiently. Among these block transform methods, the performance of the D C T is the closest to the optimal K L transform for conventional image data (such as photographic images and movie frames). In fact, D C T is today's most popular orthogonal transformation, as it is used in almost all image and video coding standards. D C T based video coding systems usually yield reasonably good image reproduction quality at moderate and high bit rates. However, such systems will produce annoying blocking and/or mosquito artifacts at the lower bit rates. This is because some or most of the A C coefficients of a block will often be quantized to zeros. 2.7 D C T Coefficient Quantization and Coding After applying the D C T , the resulting coefficients need to be quantized and then entropy coded. Since the human eye is less sensitive to energy with high spatial frequencies, the lower frequency D C T coefficients .are, in general, visually more important than others [54]. For this reason, a set of non-uniform weights called the quantization matrix is used to quantize the coefficients. Each weight should vary with the frequency, as well as the type (intra, inter) of the image. The process of quantization of the 8x8 D C T coefficients i , j = 0 , 1 , . . . , 7, can be expressed by = ronnc/(^ ^-), where qij is the M weight in the quantization matrix Q, and the q is the quantization factor. In general, two quantization matrixes, Qi t n ra and Qi t n er are required in video coding because of the different properties of the D C T coefficients in intra and inter coded pictures. The lowfrequency weights usually have smaller magnitudes than the high-frequency weights in Qintra- However, in general, uniform weights are used in Qinter- Because the quantization matrix Q will not change, at least within a frame, the rate and the quality of the coded frame are directly controlled by the quantization factor q of each block. The algorithm that determines the value of q depends on the application, the rate", and the quality Chapter 2. Background requirements. 27 In fixed bit rate coding, the q is mainly determined by a rate control mechanism that attempts to maintain a constant buffer occupancy [35, 72]. Following quantization, the coefficients are re-ordered into a one-dimensional array by reading out the entire two dimensional array in a zigzag direction. In this way, the quantized coefficients are "approximately" arranged in order of ascending frequency. This process usually increases the number of consecutive D C T zero-valued coefficients, improves significantly compression efficiency [73, 36, 34, 31]. 2.8 Video Coding Standard In this thesis, our proposed algorithms will be implemented in the context of the H.263 and M P E G - 2 video coding standard frameworks. Therefore, in the following subsections, a brief description of such video coding standards is presented. 2.8.1 The H.263 Video Coding Standard The main application of H.263 is video conferencing at a bit rates below 64 kbits/sec. The H.263 standard [36], like the M P E G ones, is based on block-based motion compensated prediction and DCT-based residual coding as shown in Figure 2.2. However, it also involves more advanced prediction techniques, more effective motion vector coding methods, and more flexible mechanisms for alternating between inter and intra coding. These new features take into account the facts that, for very low bit rate video coding applications such as video telephony, video conferencing, and video monitoring, motion changes are relatively small and motion vector side information occupies a substantial portion of the overall bit rate. Detailed description of the H.263 coding standard will be presented in Chapter 5. Chapter 2. Background 2.8.2 28 The M P E G - 2 Video Coding Standard As suggested by its title, Information technology - Generic coding of moving pictures and associated audio information, the M P E G - 2 standard is intended to be generic in the' sense that it serves a wide range of applications. M P E G - 2 differs from M P E G - 1 in that it supports many profiles and levels, which define applications and specific complexity constraints, respectively. A total of five profiles: 1) simple profile, 2) main profile, .3) SNR scalability profile, 4) spatial scalability profile, and 5) high profile, are supported by the standard. In order to encode efficiently interlaced video sequences, such as those used in T V , the M P E G - 2 video coding standard also defines interlaced motion estimation and the D C T coding algorithm. A more detailed description of the M P E G - 2 video coding standard will be presented in Chapter 6. Chapter 3 Block-Based Motion Estimation In this chapter, an efficient block-based motion estimation algorithm is proposed. Unlike in traditional block-based motion estimation algorithms, the proposed framework will yield very good tradeoffs among distortion, motion vector bit rate, and computations. Our proposed motion estimation algorithm finds the motion vector that minimizes the distortion subject to the constraints of motion vector bit rate and computations. This chapter consists of five sections. In section 3.1, the two well-known drawbacks in traditional motion estimation algorithms will be addressed. To overcome such drawbacks, a new optimization criterion for motion estimation is proposed in section 3.2. Depending on the intended video coding application, the proposed general optimization criterion can be greatly simplified. The principle and the implementation of the resulting motion estimation algorithms will then be discussed in detail in-Section 3.3 and Section 3.4. A summary is given in the last section. 3.1 Motivation Motion compensated prediction plays a vital role in video coding systems, as it is the most popular method to remove temporal redundancies between video frames. Many models have been developed to represent the apparent motion field in a video sequence. Mainly because of its computational efficiency, the most common motion model used in video coding systems and standards is that which tracks translational motion. With this model, the full-search block matching algorithms (FS-BMAs) are usually favored because 29 Chapter 3. Block-Based Motion Estimation 30 of their hardware simplicity and their generally minimal side information required for compensation. It has been implemented in many video codecs and has been specified in a number of video coding standards (e.g. H.261, M P E G - 1 , M P E G - 2 , and H.263 [73, 33, 34, 36]). While the full-search B M A (FS-BMA) is the simplest in concept and implementation, it also has two well-known drawbacks. The first drawback is a degraded video quality in low bit rate applications. This problem is the result of two limitations: (1) the encoder cannot control the motion vector bit rate, and (2) the encoder cannot look ahead. Hence, the obtained motion vectors do not necessarily yield the "best" overall encoding results. The impact of the above limitations is usually more significant at low bit rates, since the motion vector bit rate becomes relatively large (30% ~ 50% of the total bit rate) [74, 24, 49]. In order to satisfy a certain bit rate requirement, the encoder then has to use a larger quantization step during the encoding of the D C T coefficients. This, in turn, severely degrades the reproduction quality. The second drawback is F S - B M A ' s very large computation requirements. For example, the two-step F S - B M A algorithm used in the M P E G - 2 T M 5 encoder typically requires 90% ~ 95% of the total number of computations. F S - B M A s require many computations mainly because they usually follow a fixed search path whose starting point is also fixed. This implies, that, to accommodate video sequences with non-stationary motion fields, a search path traversing all points in a very large search area must be used. The F S - B M A ' s high computational requirements have fueled many research activities, which have produced several efficient algorithms such as log search, three-step search, cross search, conjugate gradient search, hierarchical search, and block subsampling [5, 81, 82, 2, 11, 12, 52]. However, most of the above algorithms may quickly get trapped in local minima, yielding a significant loss in estimation performance. In this work, we propose a new motion estimation algorithm that is both efficient and Chapter 3. Block-Based Motion Estimation 31 effective. The new algorithm is discussed in the following sections. 3.2 Motion Estimation with Rate-Computation Constraint One of the problems of traditional motion estimation algorithms is that they minimize the distortion without consideration of the motion vector bit rate and the computational cost. In fact, the performance of a motion estimation algorithm should be evaluated by not only the distortion but also the other two important measures.Our proposed motion estimation algorithm is a constrained optimization problem. Given the constraints of total motion vector bit rate R and total computational cost Gt t for a frame with NxM macroblocks, we want to find the NM motion vectors {vi,j,i — 0 , . . . , N, j = 0 , . . . , M} that minimize E^( «.i) ( - ) v 3 25 subject to the constraints T.Civij) < C u (3.26) where the D(vi j), R(vi' j) and C(vij) are, respectively, the motion compensation distort t tion, the motion vector bit rate and the computational cost for the (i,j)th macroblock. To solve the above constrained discrete optimization problem, we can use the well known Lagrangian multiplier method introduced in [16, 25, 65]. It basically means, that one adds the constraint which makes the problem hard to solve to the objective function, weighted by the Lagrangian multiplier. Then the new problem, called unconstrained or relaxed problem, is solved to find the Lagrangian multiplier that results in the optimal solution. Therefore, in our. case, the above constrained optimization problem can be Chapter B. Block-Based Motion Estimation 32 solved by minimizing the following Lagrangian cost function J J = E + A + ^ •E HV.-J. •E ',./ ' (3-27) '../ where the non-negative values of the Lagrangian multipliers A and f3 must be chosen so that the iwo constraints in Equation (3.26) are satisfied. Howeser, due to the rate and distortion dependencies manifested in the D(vij) and- R(vi j) tetms, solving Equation (3.27) with the constraints of Equation (3.26) remains t rather complicated, and almost impossible. In order to simplify this problem, we have to ignore Bate-distortion dependencies between macroblocks and minimize, instead, the Lagrangian cost function for macroblock J(v j) i = D(v ) iJ + \-R(v )+p-C{v , ). itj (3.28) i j The motion vectors {vi,j,i = 0 , . . . , N, j = 0 , . . . , M } that minimize Equation (3.28) are suboptimal, however, this approach reduces significantly the complexity of the motion estimation algorithm. According to Equation (3.28), the proposed motion estimation algorithnaseeks the motion vector V ; j for macroblock that minimizes the Lagrangian cost function J(v j). The algorithms that finds the value of A will be discussed in the t] rate control sections in Chapter 5 and 6. In Section (6.4.1) we discuss the effects of the different values of /3 on J{vi,j). In the next section, we will give a more detailed discussion about the proposed Lagrangian cost function J ( v ) . t J The Proposed Lagrangian Cost Function Let us assume that for a macroblock J(n) a pre-defined search path V that covers the search arm exists, and that V consists of the sequence of motion vectors (v , x The subscript n.in v„ indicates the search order, that is v — (x ,y ) n vector in the search path. After considering the vector v n n n V 2 , . . . , VJV). is the nth motion along the search path V, Chapter 3. Block-Based Motion Estimation 33 one can obtain the corresponding distortion D(n), motion vector bit rate R(n), and computational cost C{n). The distortion D(n) is defined as ' D(n)= $ > ( / ( r ) , M r + v„)), (3.29) t where r is the spatial index of the image pixels in the image plane, 7 (r) is the image t intensity at position r of.the candidate block in the current frame, I i(r t + v ) is the n image intensity at position (r + v ) of the matching block in the reference frame, W is n the size of the matching window, and /?(.,.) is the distortion measure. The most popular distortion measures are the mean squared error (MSE), where p(-) = | • | , and the mean 2 absolute difference ( M A D ) , where /?(•) = | • |. The bit rate R(n) represents the number of bits required to encode the motion vector v„.'- The computational cost C(n) is equal to the total number of computations for all vectors up to, and including, the nth motion vector. The "best" values of D(n), R(n), and C(n) depend on the cost function being minimized. From Equation (3.28), the proposed algorithm is to find the motion vector v n that' yields the minimum value of the Lagrangian cost J(n), where J(n) = D(n) + X • R(n) + (3 • C(n) (3.30) The parameters A and (3 control the influence of the rate R(n) and the computational cost C(n) on the value of the Lagrangian cost- J(n), respectively. For example, if the value of the parameter A increases, the proposed algorithm will favor the motion vectors requiring a lower bit rate. Similarly, if the value of the parameter (3 increases, the computation constraint becomes more important, forcing the algorithm to follow the shortest path or terminate the search process after only a few vectors have been considered. The proposed algorithm is a generalization of traditional motion estimation algorithms. In fact, when A and (3 are set to zero, the Lagrangian cost J(n) will be equal to D(n), the cost measure usually used in motion estimation. Chapter 3. Block-Based Motion Estimation 34 Minimization of the Lagrangian cost J(n) given by Equation (3.30) seems straightforward, yet there are two problems that need to be addressed. First, special values for A and (3 must be obtained, usually via rate/computation control algorithms. Second, the minimum value of J(n) can be obtained only if all candidate vectors in the search path have been considered, rendering useless the use of the computational cost as part'of J(n). However, depending on the video application, minimization of J(n) can be greatly simplified. For low bit rate video coding applications, such as video conferencing, because the motion field is often stationary, and the spatio-temporal resolution of the video sequence is-usually very small, the computational cost C(n) in Equation (3.30) becomes substantially less important than the motion vector bit rate R(?i). Therefore,in low bit rate applications, Equation (3.30)-can be simplified to J(n) = D(n) + \-R(n). (3.31) Minimizing the above equation for motion vectors along a specific path is referred to as rate-distortion (RD) optimized motion estimation. However, in high bit rate application, such as M P E G - 2 video encoding, incorporating the motion vector bit rate R(n) in Equation (3.30) will lead to an insignificant performance improvement. On the other, hand, incorporating explicitly the computational constraint will yield a large improvement in encoding speed. Therefore, for high bit rate applications, one-should minimize the cost J(n) given by J(n) = D(n) +(3-C(n). (3.32) Minimizing the above equation for motion vectors along a specific path is referred to as computation-distortion (CD) optimized motion estimation. In the following two subsections, we will discuss the specific R D and CD optimized motion estimation algorithms respectively implemented in our work. Chapter 3. Block-Based Motion Estimation 3.2.1 35 Rate-Distortion Optimized Motion Estimation According to [11, 12, 23, 41, 13, 27,'49, 43, 48], finding a motion vector v„ that minimizes the Lagrangian J(n) in Equation (3.31) is equivalent to finding a motion vector that minimizes the distortion D(n) subject to the constraint of motion vector bit rate R(n). A n important advantage of R D optimized motion estimation is that via the parameter A the encoder is able to control the bit rate of motion vectors. As we will explain in the Chapter 5, this property is very important in low bit rate video coding applications. Another advantage is the algorithm's robustness with respect to video input noise. Because a bit rate constraint is embedded in the above Lagrangian equation, for stationary areas in the video sequence (e.g., background), the R D optimized motion estimation algorithm will be insensitive to random fluctuations of block matching errors caused by various types of video input noise. 3.2.2 Computation-Distortion Optimized Motion Estimation The advantage of computation-distortion optimized motion estimation is that good tradeoffs between computation and distortion can be achieved through manipulating the value of the parameter (5 [42]. In our proposed motion estimation framework, described in detail in the next section, the motion vector search will start from the predicted motion vector, and the search will then follow the search path specified by the estimation algorithm. The Lagrangian in Equation (3.32) will be used by the algorithm to decide the termination of the search. Because the cost function J(n) takes both computational cost and distortion into account, therefore,' according* to the weighted sum of the two measures along the search path, the termination algorithm can dynamically controlthe search region. Experimental results in Section 3.4 show that the dynamic nature of the implementation leads to a dramatic increase in motion estimation speed for only a small loss Chapter 3. Block-Based Motion Estimation 36 in compression performance. Another advantage of our algorithm is that the reconstruction quality of the resulting M P E G - 2 coder is proportional to the computer's processing power, providing much needed flexibility in many M P E G - 2 software applications. In Figure 3.8.a, we show the' profile of the sum of absolute distortion (SAD) D from macroblock (8, 4) in the 1st frame in the sequence B u s . The profile of the cost function J = D + (3 •C along the diamond-shaped searching path (which will be discussed in the later section) centered at (0, 0) is shown in Figure 3.8.b. The minimum value of D in Figure 3.8.a is (5, -3) and the minimum value of J in Figure 3.8.a will be (4, -2). Notice that the D + j3C surface is smoother, indicating that the C D optimized algorithm would be less sensitive to video input noise. 3.3 Proposed Motion Estimation Algorithm Our proposed motion estimation algorithm consists of the following three important parts: (1) initialization, (2) search path, and (3) termination. Initialization involves the selection of, through prediction, the first motion vector (or the first point in the search path) to be considered. In the first subsection, four predictors, the mean predictor, the median predictor, the L M S E predictor, and the statistical mean predictor will be studied. In the following subsection, we will discuss the search path employed in this work. The search path should specify a sequence of motion vectors, where the "optimal" motion vector can be found within minimum computational cost. Three search paths, a spiral search path, a diamond-shaped search path, and a floating-center search path are addressed. In the final subsection, two effective.termination algorithms based on two probabilistic models (respectively) will be introduced and discussed. Chapter 3. Block-Based Motion Estimation 37 (a) Profile of SAD D Figure 3.8: Profile of sum of absolute distortion (SAD) (a) and Lagrangian J = D + (3C .(b) of motion estimation. The searching window is ± 1 5 . Chapter 3. Block-Based Motion Estimation 3.3.1 38 Starting Point: Prediction The first point in the search path should correspond to the motion vector with the highest probability of being the "optimum" motion vector. By employing prediction, one may be able to achieve such an objective. This is possible, mainly because the motion field in a typical video sequence is smooth' and structured, exhibiting strong spatio-temporal dependencies between its motion vectors. Memory has always been incorporated into the motion vector coding process, but a few researchers have suggested exploiting it to simplify motion estimation. Recently, predictive motion estimation [81, 82, 3, 41, 49, 52] has become an important research area and several motion vector predictors have been proposed. In this section, the mean predictor, the weighted mean (LMSE) predictor, the median predictor, and the statistical mean predictor are discussed [43],' and their complexity and performance are then be evaluated. Mean Predictor A mean predicted motion vector is given by v = jr J2k=i A-, which is the average of the V K motion vectors v/,-, evaluated for each x and y component independently. As confirmed by our experimental results, only the vectors corresponding to the closest 3 blocks (the A, B , D blocks in Figure 3.12) should be used. Weighted Mean Predictor A more accurate predicted vector is given by v = Ylt-i k k-, where the linear predica v tion coefficients a^s are computed off-line using the well-known autocorrelation method. Based on experimental results, a fourth order predictor (K = 4) seems to be a good choice. A higher order predictor improves slightly the prediction accuracy, but at the Chapter 3. Block-Based Motion Estimation 39 expense of a large increase in number of computations. While this weighted mean predictor outperforms the uniform mean predictor discussed above, further improvements in performance are expected by changing the values of the coefficients cv^'s adaptively during motion estimation. Median Predictor Like those of the mean and weighted mean predicted vectors, the two components of the median predicted vector are computed independently but using the same procedure. Two different median predicted vectors are computed: one based on the H.263 3-block and another based on a 5-block region of support (ROS). The 3-block median predictor, is better in terms of prediction performance, as will be shown later. Statistical Mean Predictor A more powerful non-linear predictor is the statistical mean vector v = (,x, y), whose two components are also computed independently and in an identical manner. Without loss of generality, we will next describe the procedure used to estimate x, the x component of v. The scalar x is the weighted average of x components of all possible motion vectors. The weighting coefficients are equal to the conditional probabilities of the x components. The conditional probabilities are determined using a finite-state machine (FSM) model that is represented by a codebook, or a set of state code vectors. Each code vector represents a relatively large set of template vectors lit. The components of the template vectors Ut are feature symbols representing the x components of previously coded motion vectors in the ROS. In other words, a template vector u is the output of a mapping function whose t inputs are the ROS motion vector x components, and is the input to a vector quantizer (VQ) whose codebook contains all state code vectors. In this work, the following two mapping functions and VQs are considered: Chapter 3. Block-Based Motion Estimation 40 1. Scalar .quantized mean: the mean of the x components is computed and scalar quantized. The ROS associated with the x components is the one used in the computation of the mean predicted vector. Each possible value of the scalar quantized mean represents a template vector u . t 2. Scalar quantized weighted mean: the. input to the scalar quantizer is the weighted mean, computed using the same ROS adopted during the computation of the weighted mean predicted vector. The state V Q codebook is generated on-line [41]. The codebook is initialized with one code vector Ug, the first template vector. The next template vector u is compared to t Ug. If the distortion <i(u ,Ug) given by t N i=l is less than a threshold Tj, then u is mapped to the state code vector u j . Otherwise, u t t is added to the codebook. Similarly, each new vector u is compared to all vectors in the t codebook. If the best matching code vector is still not a good match, as determined by the threshold T\, then it is added to the codebook. When the maximum codebook size is reached, the least popular code vector is deleted before adding the new vector. Next', suppose u is the template vector whose components are the different feature t symbols considered above. Then, an iV-size state codebook C = {uj, u f , . . . , u™} is searched, and the probability table corresponding to the state code vector u* closest to Ut is selected. Then, x is given by x = Y p{x \u* )x , l t s t ' (3.33) e-i where p(xe\ul) is the probability of the motion vector a;'component xi given state u* and L is the number of all possible motion vectors. x Chapter 3. Block-Based Motion Estimation 41- Since both the codebook and the probability tables are adaptive, this procedure is expected to yield very accurate statistical mean predicted motion vectors. Note that our implementation of the statistical mean predictor does not require that the table probabilities be stored or even directly processed. The estimates can be obtained easily by maintaining a counter for each state code vector. Thus, such a prediction technique is relatively simple. 3.3.2 Search Path Three types of search paths, a spiral search path, a diamond-shaped search path, and a floating-center search path, will be discussed in this section. The spiral search path and the diamond-shaped search path are static in the sense that the associated trajectories are fixed during the searching process. On the other hand, the floating-center search path isdynamic, as its trajectory is time-varying. In our work, all three paths consists of layers, and each layer consists of connected search points. The layers are searched in an order that depends on their geometrical distance with respect to the predicted motion vector. This is based on the assumption that the probability of finding optimal motion vector is inversely proportional to the distance between the layer and the predicted motion vector. Spiral Search Path The search area is initially divided into square-shaped contours or layers, as shown in Figure 3.9. The search starts at the predicted motion vector V, marked as "Layer 0" in the figure. We then search the layers 1,2, 3,.. . sequentially in the same order. The search order in each layer is arbitrary, since all points in the layer will be searched. The spiral search path has been used in the H.263 T M N 5 coder's motion estimation algorithm. The number of search points in each layer are given by 8 x /, where / is the layer index. Chapter 3. Block-Based Motion Estimation 42 Figure 3.9: A spiral search path Diamond-Shaped Search Path As shown in Figure 3.10, although the diamond-shaped search path also consists of a layer structure, however, its layers are diamond-shaped while those of the spiral search path are rectangular. If / represents the layer index, then the search points in layer /th are 4x1. One interesting property of the diamond-shaped search path is that its layers have the same shape as contour lines of a two dimensional Laplacian distribution function. Therefore, the diamond-shaped search path is an optimum path for motion vector fields which can be modeled by two dimensional Laplacian distributed random variables. 43 Chapter 3. Block-Based Motion Estimation V Most likely motion vector R Search region o o Layer 0, Jo v /O ' O O o o o Layer 1, J \ 0 \ 0 O O Layer 2, J2 -e—e—e—e—e—e—eo c b o o o o o o Figure 3.10: A diamond-shaped search path Floating-Center Search Path An alternative path is the floating center search path shown in Figure 3.11. Unlike the previous two search paths, the floating-center search path changes dynamically, and the associated search process is more complicated than that of the above two search paths. The corresponding algorithm searches only the layer whose center is the "best" motion vector found at the previous layer. For example, in Figure 3.11, the algorithm first searches the 4 points inside the diamond-shaped area marked as Layer 0 in Figure 3.11. After the 4 points have been tested, the "best" motion vector found in layer 0, marked as X , will then be the center of the new search region marked as layer 1. However, because of the overlap, only the 3 points in layer 1 will be tested. Next, the best motion vector found among the 3 points will be the center of the new search region marked as layer 2, Chapter 3. Block-Based Motion Estimation 44 Figure 3.11: Floating center search path and so on. The advantage of the floating-center search is that the number of the search points in each layer is equal to 4. Therefore, this method is inherently more efficient than the spiral and diamond-shaped search method. However, depending on the input video sequence, the size of the search region in each layer should be decided carefully. For example, for video with active motion, if the size of the search region is too small, then the method can be more easily trapped into local minima than the other two methods. 3.3.3 Termination Point Algorithm In RD optimized motion estimation, minimizing J(n) = D(n) + XR(n) can be very computationally demanding. In C D optimized motion estimation, to find the motion vector Chapter 3. Block-Based Motion Estimation 45 that gives the minimum J , we have to test all candidates motion vectors, rendering useless the computational cost constraint. Fortunately, however, the probability of finding the minimum value of J(n) along the search path is usually a monotonically decreasing function of the distance away from the predicted motion vector. Thus, by terminating the search process subject to rate or computation constraint, one can still obtain a nearoptimal motion vector with respect to the designated search path. However, along the search path the value of Lagrangian J(n) is not necessarily monotonically decreasing, and choosing an appropriate termination point is therefore a difficult task. One solution is to group positions on the search path into layers according to their geometrical locations as shown in Figure 3.9, 3.10 and 3.11. Let us assume that M motion vectors ( fc> Vfc+i,. . ., V/C+M) are in the Zth layer, then among the computed M Lagrangian valv ues {</(&), J(k + 1 ) , . . . , J(k + M ) } , we select the Lagrangian J with the minimum value to represent the Ith layer. Then, the Lagrangian values Jo, J i , • • • , Ji are compared to decide when to "best" terminate the search path. Intuitively, if the Lagrangian value Ji keeps increasing for several consecutive layers, then it is very unlikely that we could find a smaller J(n) in the following layers. In fact, for a smooth motion field, it is very likely that the "optimal" motion vector is located in the vicinity of the predicted motion vector. Building on the above assumption, we next propose two probabilistic models, where each model is based on a specific hypothesis. The two models are: • Model I (2-layer Hypothesis) : If J\ is larger than J / _ i , then it is unlikely that we will find a smaller Lagrangian J(n) by continuing the search outward. Thus, searching more motion vectors is not necessary. This hypothesis is often violated in areas where the motion field is not smooth. In such a case, the Lagrangian surface is likely not convex. Even when the motion field is smooth, the search as specified by this hypothesis can quickly get trapped in a local minimum. Chapter 3. Block-Based Motion Estimation 46 • Model II (3-layer Hypothesis) : Only if J ; _ < J\-\ < Ji, will we be confident that 2 searching more layers for a better motion vector is wasteful. This hypothesis places a larger convexity constraint on the search. It almost guarantees optimality, but at the expense of a much larger computational load. Because this test may never be satisfied, we must limit the number of considered vectors. By using any of the two models corresponding to the above two hypotheses, we allow the search area to expand or contract based on the local statistics of the motion field. E i ther model can be better than the other in terms of computation-performance tradeoffs, depending on the particular video sequence being coded. But Model I (2-layer Hypothesis) would be the choice when computational complexity is a major concern. As can be expected, both models fail in areas where many non-motion changes occur. Finally, note that the two models can be used interchangeably as dictated by performance and/or computation constraints. 3.4 3.4.1 Simulation Results Initialization The purpose of this experiment is to evaluate the performance of the four predictors (mean predictor, weighted mean predictor, median predictor, and statistical mean predictor) discussed in Section 3.3.1. In this experiment, the average entropy of the difference between the estimated (using the full-search method) motion vector and the predicted motion vector is used as the performance measure. For simplicity, only the x component of the difference motion vector is considered, and only integer-pel accuracy motion estimation is used. The smaller value of the entropy indicates that the corresponding motion predictor is more accurate. The video sequences used in this simulation are flie QCIF sequences (Quarter Common Intermediate Format, a video conferencing format that each Chapter 3. Block-Based Motion Estimation 47 frame containing 144 lines and 176 pixels per line) CAR P H O N E and MISS AMERICA, and the block size used for matching is 8x8. Tables 3.2 and 3.3 show the average entropy for MISS AMERICA and CAR P H O N E (respectively) of the motion vector x component in the search area centered at the mean, weighted mean, median, and statistical mean. The peak signal to noise ratio (PSNR) used in Tables 3.2 and 3.3 is defined as 25"5 = 10 logio—r- 2 PSNR (3.34) where o\ is the mean square error (MSE) of the reconstructed video sequence. There are two different mean predictors: (1) M E A N - A where the ROS in Figure 3.12 consists of blocks (A,B,D) and (2) M E A N - B where the ROS consists of blocks (A,B,C,D). As expected, M E A N - A leads to a lower entropy, as averaging many motion vectors adversely affects prediction accuracy. Since M E A N - A and M E A N - B are special cases of the weighted mean predictor (WM), the latter should yield better prediction accuracy. This is confirmed by the results shown in the tables. It is clear that the 3-block (A,B,D) median predictor (MED-A) outperforms the 5-block ( A , B , C , D , G ) median predictor (MED-B) in terms of both prediction accuracy and complexity. With the exception of its potentially higher robustness to channel errors, MED-B does not seem to be a good choice. When used to drive the F S M model, the weighted mean ( W M - S M ) also outperforms the uniform mean ( M - A - S M and M - B - S M ) in terms of prediction accuracy. Moreover, the statistical mean predictor ( W M - S M ) leads to a slightly better performance than W M . Finally, notice that M E D - A yields a lower entropy than the more complex linear and statistical mean predictors. This, coupled with its higher robustness to channel errors, is likely what made it part of the H.263 standard. It is clear from Tables 3.2 and 3.3 that the differences in average entropies are relatively small. This suggests that motion estimation/coding is not sensitive to prediction accuracy. Moreover, as shown in Tables 3.4 and 3.5 for MISS AMERICA and CAR P H O N E Chapter 3. Block-Based Motion Estimation 48 R Q F M 1 N H G J L K 0 C S E • B D A P Previous Frame Present Frame Figure 3.12: Average correlation coefficients (c ,c ) and the current one. x PSNR MEAN-A MEAN-B 35.9 36.8 37.8 0.566 0.620 0.689 0.569 0.635 0.679 1 y between the ROS motion vectors WM MED-A MED-B M-A-SM M-B-SM WM-SM 0.566 0.602 0.670 0.561 0.612 0.659 0.582 0.628 0.679 0.566 0.620 0.689 0.570 0.635 0.677 0.566 0.602 0.668 Table 3.2: Entropy for M I S S A M E R I C A of motion, vector x component offsets in the search area centered at the mean ( M E A N - A and M E A N - B ) , weighted mean ( W M ) , median ( M E D - A and M E D - B ) , mean-based statistical mean ( M - A - S M and M - B - S M ) , and weighted mean-based statistical mean ( W M - S M ) . PSNR 31.2 33.0 34.9 MEAN-A MEAN-B WM MED-A MED-B M-A-SM M-B-SM WM-SM 1.686 2.008 2.318 1.737 2.044 2.335 1.598 1.930 2.190 1.553 1.860 2.109 1.602 1.928 2.194 1.658 1.946 2.237 1.681 2.011 2.296 1.588 1.909 2.169 Table 3.3: Entropy for C A R P H O N E of motion vector x component offsets in the search area centered at the mean ( M E A N - A and M E A N - B ) , weighted mean (WM), median ( M E D - A and M E D - B ) , mean-based statistical mean ( M - A - S M and M - B - S M ) , and weighted mean-based statistical mean ( W M - S M ) . Chapter 3. Block-Based Motion Estimation 49 PSNR WM-SM FSM 35.9 36.8 37.8 0.569 0.602 0.668 0.498 0.547 0.592 Table 3.4: Comparison between the F S M (FSM) entropy of the motion vector x components and entropy of the x component offsets of those motion vectors located in the search area centered at the weighted mean-based statistical mean (WM-SM) for the sequence MISS AMERICA. PSNR 31.2 33,0 34.9 WM-SM 1.598 1.930 • 2.190 FSM 1.469 1.751 2.026 Table 3.5: Comparison between the F S M (FSM) entropy of the motion vector x .components and entropy of the x component offsets of those motion vectors located in the search area centered at the weighted mean-based statistical mean (WM-SM) for the sequence CAR PHONE. (respectively), prediction-based motion estimation/coding compares favorably with statistical FSM-based estimation/coding studied in [11, 12, 41]. In the latter case, an F S M ' model based on dynamic V Q is employed for probability-based estimation and direct coding of the motion vectors (i.e., explicit prediction is not performed). 3.4.2 Search Path In this part, the computation-distortion performance of methods based on the three search paths (spiral search path, diamond-shaped search path, and.floating-center search) discussed in Section 3.3.2 is evaluated. The computation-distortion optimized motion estimation algorithms associated with the three search paths are implemented i n the context of the M P E G - 2 T M 5 video encoder. Model I (2-layer Hypothesis) discussed Chapter 3. Block-Based Motion Estimation (a) 50 FOOTBALL PSNR vs. Computation 38 I • c 37.9 37.8 37.7 Football; 48 frames • Search Window: - 4 7 Rate: £ Mb its/sec to 47 _37.6 2, cc 37.5 z co * Diamond Q_ o Floating ce iter 37.4 37.3 37.2 37.1 37 0. 0.8 1 .1.2 1.4 Computations/MB (# of operations) 1.6 x 10 (a) B u s PSNR vs. Computation 35.1 35 34.9 Bus, 48 frames Search Window: - 4 7 t o 4 rt Rate: 8 Mbits/se 34.8 _34.7 ST 2, rr 34.6 z x Spiral: * Diamond CO o Floating center 34.5 34.4 34.3 34.2 34.1 0 0.7 0.9 1 1.1 1.2 1.3 Computations/MB (# of operations) 1.4 1.5 1.6 x 10 4 Figure 3.13: Compression performance of 3 search paths in terms of P S N R and number of computations per macroblock. Chapter 3. Block-Based Motion Estimation 51 in Section 3.3.3 is used for termination. The median predictor is used to obtain the predicted motion vector in all three video encoders. The test video sequences are the C C I R 601 (720x480 pixels, 30 frame/sec) F O O T B A L L and BUS sequences,'and the target bit rate is 8 Mbits/sec. The search window sizes for the F O O T B A L L and BUS are ± 4 7 pixels. The results of our experiment are shown in Figure 3.13. For the F O O T B A L L sequence, the results show that the method based on floating-center search path is more efficient than that of the diamond-shaped and the spiral search. However, for the BUS sequence, the performance difference is quite similar. It is due to the more active motion field in F O O T B A L L sequence, and therefore, the algorithm has to search more layers in the F O O T B A L L sequence. As explained in Section 3.3.2, the number of the search points in each .layer in the floating-center search is fixed, which on the other hand, for spiral search and the diamond-shaped search the number of the search points in each layer is proportional to the layer index. For BUS sequence, because of the smoother motion fields in the sequence, motion vector can be predicted quite accurately by the median predictor, and thus, the algorithm needs to search fewer layers. The simulation results suggest that the floating-center search path seems to be a good choice. 3.4.3 Termination Point , The proposed probabilistic models (based on the 2-layer Hypothesis and 3-layer Hypothesis) for termination of the motion search process are here evaluated and compared. The video encoder employed.a floating-center search path in this experiment. The test video sequences are the C C I R 601 (720x480 pixels, 30 frame/sec) F O O T B A L L and G A R D E N sequences, and the target bit rate is 8 Mbits/sec. Results are shown in Figure 3.14. For the F O O T B A L L sequence, the two methods are very close in performance. How- ever, for G A R D E N sequence, the method based on Model I is more efficient than the one based on Model II. For very active sequences, such as the F O O T B A L L , the proposed Chapter 3. Block-Based Motion Estimation 52 37.1 r 37 0.6 1 1.2 1.4 Computations/MB (# of-operations) 0.8 (a) 1.6 • 1.8 x 10 4 GARDEN PSNR vs. Computation 33.9 Garden, 48 fram 3S 33.85 Search \Window, -47 to 47 8 Mbits/ 33.8 33.75 DC 33.7 x 2-layer Hypoth esis 33.65 o 3-layer Hypoti- esis 33.6 Float Region • 33.55 33.5 6000 6500 7000 7500 8000 8500 9000 9500 10000 . Computations/MB (# of operations) ' 10500 11000 Figure 3.14: Compression performance of 2 models for search termination in terms of P S N R and number of computations per macroblock. Chapter 3. Block-Based Motion Estimation 53 distortion-computation optimized motion estimation algorithm yields a 200:1 encoding speed improvement with a 0.14 dB degradation. Besides its computational advantage, an important feature of our method is that computational resources are better allocated, depending on the statistics of the input video sequence. 3.5 Summary In this chapter, an efficient motion estimation algorithm that minimizes the compensation distortion under the constraints of the motion bit rate and computational cost has been proposed. Depending on the video application, the proposed algorithm can be simplified into a rate-distortion optimized motion estimation algorithm or a computation-distortion optimized motion estimation algorithm. For C D optimized motion estimation, an indepth discussion of the design and implementation of three different motion search parts: (1) initialization (motion vector prediction), (2) search path, and (3) termination was also presented. ' ' The main contribution of this chapter is that the proposed motion estimation provides excellent tradeoffs among the distortion, the motion vector bit rate, and the computational cost. The proposed algorithm overcomes the two common drawbacks of full-search block matching algorithm: low image reproduction quality and huge computation requirements. Besides the above advantages, our algorithm provides easy control-of rate, distortion and computations. Chapter 4 Rate-Distortion Optimized Mode Selection In this chapter, an efficient rate-distortion (RD) optimized mode selection algorithm is introduced and studied in the context of motion compensated prediction and DCT-based coding framework. The main advantage of our method as compared to the traditional methods is that the former chooses the mode and associated parameters that yield very good tradeoffs between the distortion and the bit rate. This chapter is divided into five sections. Section 4.1 describes the macroblock-based coding mode in a MPEG-like video coder. The traditional mode selection algorithms used in MPEG-like video coder are also briefly addressed in that section. In Section 4.2, rate-distortion (RD) optimized mode selection is discussed. In Sections 4.3 and Section 4.4, two efficient R D optimized mode selection algorithms are introduced. A summary is then given in the last section. 4.1 Multi-mode Video Coding Typical video sequences contain widely varying content and motion. Therefore, better coding performance can be achieved if different strategies are permitted to encode different image regions. Currently, the most effective block-based video coders address this problem by employing several modes of operation which are selected on a block-by-block basis. The advantage of the multi-mode approach is that its inherent adaptability lays the foundation for better coding results. In most block-based video coding standards, such as M P E G - 1 , M P E G - 2 , and H.263, . 54 Chapter 4. Rate-Distortion Optimized Mode Selection 55 the current frame is subdivided into unit regions called macroblocks, which contain a 16x16 luminance block and two or more 8x8 chrominance blocks. As blocks within the macroblock cannot be encoded differently, the macroblock is the smallest coding unit. At the macroblock level, the encoder has the latitude of choosing the coding mode. As such, a given macroblock can be intra-frame coded, inter-frame coded using motion compensated prediction, or simply replicated from the previously decoded frame. For example, inter-frame coding, or block-based motion compensation followed by D C T and quantization of the prediction error, is generally regarded as an efficient means for coding image sequences. On the other hand, intra-frame coding, or D C T coding a macroblock directly may be more efficient in situations when the block-based translational motion model fails. For static regions of the video, simply copying a portion of the previously decoded frame into the current frame may be preferred. Intuitively, by allowing multiple modes of operation, we expect improved rate-distortion performance if the modes are applied judiciously to different spatial and temporal regions of an image sequence. 4.1.1 Macroblock Coding Mode The number of allowed coding modes depends mainly on the video coding standard and the video application. For example, in P pictures, the M P E G - 2 main profile/main level standard offers 11 modes, with 7 of the 11 modes involving D C T and quantization. However, the coding modes can be classified into 3 principal modes, 1) INTRA, and 3) INTER+DCT. 2) INTER] For the I N T R A mode, the macroblock is D C T coded, and only .the D C T coefficients and the quantization step are transmitted. The I N T R A mode is normally more appropriate in the case of scene changes and occlusions. The I N T E R mode, which involves only motion compensation, is suitable for input video with a predictable motion field. In the I N T E R + D C T mode, both motion compensation and D C T residue coding techniques are used. Chapter 4. Rate-Distortion 4.1.2 Optimized Mode Selection 56 Conventional Mode Selection Algorithms The selection of the macroblock coding mode and its associated parameters (e.g. the quantization step) in conventional video coders, such as the M P E G - 2 T M 5 and the H.263 T M N 5 video coders [72, 58], is based on intuition and is not rate-distortion optimized. For example, in the H.263 T M N 5 video coder, the I N T R A mode is selected if motion compensation mean square error (MSE) is larger than the variance of the macroblock. Moreover, the quantization step is determined using a simple rate control mechanism. The drawback of the approach is that the mode and its associated parameters are selected without joint consideration of the bit rate and the video reproduction quality. 4.2 Rate-Distortion Optimized Coding Mode Selection In this section, a Lagrangian formulation is employed to address the problem of selecting the coding mode on a macroblock-by-macroblock basis within each frame of a video sequence. To simplify the problem, here, we assume that each of the frames in the sequence are optimized independently. Consider an image which is partitioned into N macroblocks by X = ( X i , . . . , Xyv). For a multi-mode video coder, each macroblock in X can be coded using only one of K possible modes given by the set X = ... Let Mi G I be the mode selected to code macroblock X,-. The modes assigned to the elements in X are then given by the Af-tuple, M — ( M i , . . . , M/v) G X . N The problem of finding the combination of modes that minimizes the distortion for a given video frame, given a rate constraint R , can be formulated by c M (4.35) Chapter 4. Rate-Distortion Optimized Mode Selection 57 Here, D(X, M.) and R(X, A r ) represent the total distortion and rate, respectively, resulting from the quantization of the X with a particular mode combination M.. Assuming an additive distortion measure, the cost function D(X, A4)~and the rate constraint can be simultaneously decomposed into sums of terms over the elements in X and re-written using an unconstrained Lagrangian formulation so that the objective function becomes minf;j(A,X,-,>W), 1=1 (4.36) where J ( A , X ; , . M ) is the Lagrangian cost function for macroblock X ; and is given by J ( A , X , A W ) = D(Xi,M) I + \R(Xi,M). (4.37) It is not difficult to show that each solution to Equation (4.36) for a given value of the Lagrangian multiplier A corresponds to an optimal solution to Equation (4.35) for a particular value of R c [67, 16]. Unfortunately, even with the simplified Lagrangian formulation, the solution to Equation (4.36) remains rather unwieldly due to the rate and distortion dependencies manifested in the D(X. , M.) and i?(X;, j\4) terms. t To simplify this problem, we can assume that the rate and distortion for a given macroblock are impacted only by the content of the subject macroblock and its corresponding mode. As a result, the rate and distortion associated with each macroblock can be computed without regard to the modes selected for the other macroblocks, resulting in a simplified Lagrangian given by J(X,Xi,M) = J(A,X;,M;). (4.38) In this case, the optimization problem of Equation (4.36) reduces to N N m i n V J ( A , X , M ) = V min J(A, X , , MA, 2 i=l 4 (4.39) i=l and, as a result, can be easily minimized by independently selecting the best mode for each macroblock. Chapter 4. Rate-Distortion Optimized Mode Selection 58 The idea of applying an R D criterion to the problem of mode selection was presented by Chung, et al. [11] in a subband video framework. Similar criteria were then applied to Telenor's H.263 coder by Wiegand et al [78, 79], by Chung et al [13], by Schuster and Katsaggelos [64], and by Lee et al. [49, 51, 50]. A n intuitive approach for coding mode decision for M P E G - 2 was proposed by Sun et al. [69]. A common important result of the above investigations is the fact that significantly better compression performance can be obtained by using an RD-optimized mode selection criterion. 4.2.1 Exhaustive Search: Performance Upper Bound Theoretically, in order to achieve the optimum performance, all the mode and associated parameters should be considered (exhaustive search). For block-based DCT-based video coders, the most important parameter associated with the I N T R A and I N T E R + D C T principal modes is the quantization step Q. In fact, an exhaustive search R D optimized mode selection algorithm will test, for the I N T R A and I N T E R + D C T principal modes, all (Mi,Qi) of a macroblock Xi to find the minimum Lagrangian J(A) given by J(A) = D(Mi,Qi) + A R(Mi,Q-). (4.40) However, the above exhaustive search method can be very inefficient for video coding standards supporting many coding modes and parameters. For example, the number of possible combinations of coding modes and associated quantization steps in M P E G - 2 main profile/main level for P pictures is 259. In this case, the resulting M P E G - 2 encoding time will be 30 - 40 times longer than that of the M P E G - 2 T M 5 encoder. However, as shown in Figure 4.15, the associated gain in compression efficiency would be 1.5 - 2.0 dB in P S N R as compared to the M P E G - 2 T M 5 encoder. To increase computational efficiency, we propose two fast mode-selection algorithms. The first algorithm is based on a thresholding method. For video coding standards with a Chapter 4. Rate-Distortion Optimized Mode Selection 59 relatively smaller number of (M,-, Qi) combinations, the thresholding method can be very effective in reducing the computational demands. The second algorithm is based on an adaptive finite state machine (FSM) method, which exploits the statistical redundancy among neighboring macroblocks. 4.3 Fast Mode Selection Algorithm I: Thresholding Method One way to reduce the total number of computations is to employ thresholding techniques that allow the encoder to safely eliminate some I N T R A and/or I N T E R coding options from consideration. For example, the Lagrangian of the "SKIP" mode ( I N T E R mode with a zero motion vector) of operation can be compared to a varying threshold T kip, and- if it is smaller than T ki a s p} the other modes are no longer considered. These techniques are similar, in principle, to those employed in conventional video encoders, but are here found to be more effective because the bit rate and distortion are more carefully traded. Although practical constraints often dictate that such methods be used, they are unfortunately still ad-hoc as there is no guarantee that the best coding control decision has been made. The threshold T ki is used to provide a good balance between performance and nums p ber of computations. A small value of Tskip increases both the computational complexity and the probability of achieving the optimal solution. A large value of T ki reduces the s p computational complexity but at the expense of possibly selecting an inferior operating mode. The conventional video coders (e.g. H.263 T M N 5 coder) employ some constant thresholds that have been optimized to incorporate the M A D as a cost measure and to favor the predicted motion vector. Although the bit rate and distortion are taken into account simultaneously in our framework, similar thresholds can still be heuristically derived. To improve our estimation of the threshold T ki , we also incorporate memory in s p Chapter 4. Rate-Distortion Optimized Mode Selection 60 Figure 4.15: The compression performance comparison between the M P E G - 2 T M 5 coder and the full-search R-D optimized mode and quantization step selection algorithm coder for Bus sequence (top), and Football sequence (bottom). Chapter 4. Rate-Distortion Optimized Mode Selection 61 the form of simple prediction models indicating the level of, activity in the video scene. The thresholding techniques are very effective for inactive video sequences such as those encountered in video-phone applications. However for video sequence with active motion the thresholding techniques would be less effective, mainly because much fewer macroblocks would be encoded following the SKIP mode of operation. 4.4 Fast Mode Selection Algorithm II: Finite State Machine Method In the proposed fast mode selection algorithm, we still need to test the Lagrangian value J(\) in Equation (4.40) in advance of determining the (Mi,Qi) J(A). However, the testing order of the (Mi,Qi) that gives, the smallest pairs depends on its probability of yielding the minimum J(A). We will next provide a detailed explanation of the algorithm. To begin, the algorithm will choose the most probable (Mi, Qi) pair as the candidate and computes the corresponding Lagrangian J(A). The Lagrangian J(\) will then be compared to a reference JR, and the comparison result is used to decide whether to terminate the search. If J(A) is smaller than JR, then we stop the search and the corresponding (M ,Qi) will be selected. If J(A) is larger than JR, then we continue the t search and we will select the second probable (M ,Qi) l as the candidate, and so on. We stop the search when 1) the Lagrangian J(X) is smaller than JR, or 2) we exceed a specific computational constraint. Intuitively, if the encoder knew the "correct" probabilities of the (Mi, Qi) pair and the "optimal" value of JR for the present macroblock, then it is very likely that the proposed algorithm locates the "best" (Mi,Qi) pair, even when aborting the search after just a few comparisons. To summarize, it is clear that the performance of the proposed fast algorithm depends on two key factors: 1) the probability of (Mi, Qi), and 2) the reference JR. These factors are addressed in the following subsections. Chapter 4. Rate-Distortion 4.4.1 Optimized Mode Selection 62 Finite-State Machine (FSM) Ideally, the probability of the (M -,Q -) pair can be determined from the conditional t t distribution function of the mode M - and the quantization step Qi given the macroblocks t f{Mi,Qi I'AW), (4.41) where X{-\ = ( X i , . . . , X ; _ i ) represents the previous i — 1 coded macroblocks. However, in practice, such a distribution function is very difficult to estimate. Fortunately, as shown in Figure 4.16 for a full-search R D optimized coder, the distributions of the mode and quantization step are very structured. In fact, the conditional distribution function f(Mi,Qi | can be approximated by f(Mi, Q IK-i) { « /(7V/,, Q | Mi-i, (4.42) t where the A4i-\ = { M i , . . . , Mj_i} and Qi-\ = {Qi, • • •, Qi-i} represent the set of the modes and the set of the quantization steps, respectively, of the previous i — 1 coded macroblocks. To estimate the conditional probabilities of f(M,Q | Mi-i, Qi-i), a finite state ma- chine (FSM) can be employed. The F S M consists of a mapping rule and a set of K probability tables {fk(Mi,Qi), k = 1,2,..., A"}, where K is the number of states in the F S M . The mapping rule will assign a probability table fk(M{,Qi). probability table fk(Mi,Qi) In general, the will be created on-line and be updated regularly in order to dynamically track the local statistics of the video sequence. 4.4.2 The Reference Lagrangian JR As shown in Figure 4.16, the distribution of the optimum Lagrangian J * is also very structured. Therefore, the reference Lagrangian JR can be derived from the optimum Chapter 4. Rate-Distortion Optimized Mode Selection (a) V I D E O F R A M E (b) O P T I M U M L A G R A N G I A N J * (c) M O D E (d) QUANTIZATION STEP 63 Figure 4.16: The J*, mode, and quantization step for a frame of the Football sequence, obtained using a full-search R D optimized encoder Chapter 4. Rate-Distortion Optimized Mode Selection 64 Lagrangian J* values of the neighboring macroblocks. One approach to estimate JR is to compute the weighted mean of the Lagrangian J* values of the neighboring macroblocks within a region of support (ROS). That is, . ,/,,= £ «•;.; • ./" ' ; (4.43) i,j<EROS where W{j are the weighting coefficients at position (i,j) in the ROS. The Lagrangian JR will directly affect the speed of the proposed algorithm. Therefore, the encoding speed can be controlled by scaling.the JR. For example, if we scale down JR, more (M,-,QJ) pairs will be considered, and the probability of finding the optimum (M;, Q ; ) increases.. The cost, however, is a substantial increase in number of computations. 4.4.3 Implementation Issue In a typical DCT-based video coder, several modes must be supported, and many quantization steps are allowed. Thus, the memory resources required to implement the F S M described in Section 4.4.1 can be prohibitive. For example, there are 259 (M{,Qi) combinations in the P pictures for the M P E G - 2 main profile/main level standard. In this case, if three neighboring macroblocks are used to compute the states in F S M , then 259 = 17,373,979 states would"be generated. Moreover, sorting a probability table 3 with many entries would definitely require a large number of computations, offsetting the reduction in computations that would be achieved using the F S M . To address the memory and computation problems, we have to decouple the mode and quantization step in the F S M , that is, we assume that the two parameters are independent. In other words, Equation (4.42) can be simplified, yielding the following equation f(Mi, Q \ Mi-u z Qi-i) = f(Mi | M^) • f(Qi | Qi'-!). . ' (4.44) Chapter 4. Rate-Distort ion Optimized Mode Selection 65 According to the above equation, the mapping mechanism and the probability table for mode and the quantization step can be designed separately, simplifying greatly the algorithm in terms of design complexity, memory requirements, and computational cost. To further simplify the mode selection algorithm, a layer structure is here adopted. In the new framework, the modes are re-organized in a two-layer hierarchical structure. The highest layer consists of three principal modes: 1) I N T R A , 2) I N T E R , and 3) I N T E R + D C T . In the lower layer the modes in each principal mode category reflect variations that are normally dependent on the video coding application. For example, in M P E G - 2 main profile/main level application, two modes belong to the I N T R A principal mode, and they are 1) I N T R A - F I E L D and 2) I N T R A - F R A M E . Locating the "best" coding mode in the above two-layer structure is a more efficient task, mainly because all the modes that belong to the same principal mode are considered before moving to the next principal mode. Another reason is that due to the hierarchical structure, the design of the F S M that determines the search order of the principal modes becomes much simpler. For example, in this case, if three neighboring macroblocks are used to compute the states in F S M , then only 3 = 27 states would be generated. The 3 smaller number of of the state will significant improve the speed of state determination and the memory requirement. Besides, clue to the smaller number of probability entries of the principal mode, the F S M can estimate the conditional distribution of the principal mode more efficiently. The test order of the modes under the principal modes, depending on the video application, can be fixed or can be dynamically arranged by another F S M . Another source of computation efficiency is that, (1) the distribution of the quantization step is unimodal, and (2) high correlation exists between neighboring macroblocks' quantization step. Simulation results, some of which are discussed in Chapter 6, show that Qi can be predicted accurately from the quantization steps of just a few neighboring macroblocks. For simplicity, we will use the following example to explain the operation Chapter 4. Rate-Distortion Optimized Mode Selection 66 of the proposed fast R D optimized mode selection algorithm. Let us assume that M , Mb, M , and Md are the 4 modes in the mode selection proa c cess. Among the 4 modes, modes M ,Mb and M are modes involving DCT.' Let the a c maximum number of quantization steps for search be 9 in this case, and let the conditional probability of the 4 modes be 0.1, 0.3, 0.4 and 0.2. The search order of the 4 modes then becomes M , Mb, Md and M , and the numbers of allocated quantization steps are c a 5 for M , 3 for M j , 0 for Md, and 1 for M . c a If Q is the quantization step predicted from neighboring macroblocks, then the 5 quantization steps for mode M will be Q-2, c Q-l, Q, Q+l, Q+2. After the 5 quantization steps of M are searched, the minimum c Lagrangian J* of mode M will be compared to the reference JR. If J* < JR, then the c search stops, and the quantization step Q* that gives the J* and mode M will be the c result. Otherwise, the 3'quantization steps"(3*-l, Q*, Q*+l of mode Mt.will be searched, and so on. 4.5 Summary In this chapter, we introduced an R D optimized mode selection algorithm which significantly improves the compression performance. Several fast implementations are also proposed and discussed. The important result is that, by employing hierarchical structures and exploiting structural dependencies in the video sequence, our mode selection algorithm can be effective and efficient. In the next two chapters, we will discuss specific implementations of the proposed method in the context of H.263's T M N 5 and MPEG-2's T M 5 video coding frameworks, respectively. Chapter 5 Application: H.263 Based Low Bit Rate Video Coding This chapter introduces an H.263-based video coder which operates at very low bit rates (lower than 16 Kbps), and is based on the algorithms developed in Chapter 3 and Chapter 4. Although the proposed coder is also based on block-based motion compensation and D C T residual coding, it is very different from conventional coders. The. proposed video coder employs: 1) a fast median-based predictive motion searching technique, 2) an R D optimized motion estimating algorithm, and 3) an R D optimized coding mode decision algorithm. This chapter consists of three major sections. In Section 1, a brief introduction of the H.263 coding standard is presented. In that section, we also explain the disadvantages of the conventional H.263 coders such as Telenor's T M N 5 [72]. In Section 2, we propose a new H.263-based coder that employs the algorithms proposed in the previous chapters. Experimental results and some conclusions are given in the last two sections. 5.1 Motivation and Background The great demand for very low bit rate video compression has motivated extensive research and standardization activities around the world. Many new compression algorithms have been developed that allow transmission or storage of QCIF resolution video with acceptable quality at bit rates as low as 16 Kbps [74, 24, 53, 1]. .Most notable are the H.263 video coders [36], which have been shown recently to perform quite well in comparison with more complex video coders. The H.263 framework was initially adopted 67 Chapter 5. Application: H.263 Based Low Bit Rate Video Coding .68 Bits/frames vs. rate 700 600 500 a) 400 E I ' 300 m 200 100 0 3000 4000 5000 6000 7000 8000 Rale (bps) 9000 10000 11000 12000 Figure 5.17: Motion vector bit rate, D C T bit rate, and side information for Telenor's coder. by the M P E G 4 [32, 60, 62] group, it is expected to provide a toolkit-based audiovisual coding standard allowing many application-driven functionalities such as high compression performance and sufficient robustness in error-prone environments by the end of the year 1998. It is widely accepted that current H.263 compression algorithms perform well for the target applications at bit rates between 20 and 64 kbps relative to coders of the earlier generation, but the compression performance tends to deteriorate rapidly at lower bit rates (such as 8 or 10 kbps). The main reason the compression performance of most H.263 video coders deteriorates at very low bit rates is that, as shown in Figure 5.17 for Telenor's H.263 video coder [72], both the motion vector bit rate and the side information become excessive at such low bit rates. This is due to the following reasons: • Inadequate macroblock (MB) coding control strategy: the mechanism used for Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 69 switching between inter and intra coding is mostly ad-hoc. • Independent estimation and coding of the motion vectors: regardless of the target bit rate or quality, estimation and coding are performed independently. • High motion vector precision: in very low bit rate applications using |-pel accuracy does not improve the compression performance much. • Inefficient motion estimation: physical motion is usually very limited, structured, and slowly-varying in very low bit rate applications. This property is not well exploited. . In this chapter, we develop an H.263-based video coder that addresses the above issues. Before we present a detailed discussion of our coder, we will first briefly introduce the H.263 coding standard. 5.1.1 The H.263 Standard: Overview The H.263 standard was developed for video coding for low bit rate audiovisual communication systems. H.263 is based on H.261 and M P E G - 1 . It employs as input the same luminance/chrominance color space (4:2:0, Y Cb Cr), and only non-interlaced pictures are coded. The macroblock, block definitions and D C T / I D C T equations are identical to those of H.261 and M P E G - 1 . In addition to the H.261 GIF and QCIF image formats (see Table 5.6), H.263 supports one smaller source image format (sub-QCIF) and two larger formats (4CIF and 16CIF). Another difference from H.261 is quantization. Unlike H.261, quantization is not. restricted to change only for every 11 macroblocks. It can be changed at the picture, group of blocks (GOB), and macroblock layers. Moreover, H.263 allows f-pel accuracy Chapter 5. Application: H.263 Based Low Bit Rate Video Coding Format sub-QCIF QCIF CIF 4CIF 16CIF luminance pels/line lines 128 ' 96 144 176 352 288 704 576 1152 1408 70 chrominance pels/line lines . 64 48 72 88 176 144 352 288 704 576 Table 5.6: H.263 source picture formats. There is no ability to specify arbitrary image size as can be done in M P E G - 2 motion vectors, and the predicted motion vector is computed based on the median filter output of three neighboring macroblock (or block) motion vectors. However, the major departure from H.261 is H.263's use of four optional coding modes as Annexes. First, the unrestricted motion vector mode allows motion vectors that point outside the picture. The edge samples on the picture boundaries are duplicated to create the "not existing pixels" for prediction. Second, the syntax-based arithmetic coding (SAC) mode is recommended as an alternative to Huffman variable length codes. The greater coding efficiency of SAC gives the same picture quality, but with approximately 5% fewer com• pression bits. Third, in the advanced prediction mode, four motion vectors representing the four luminance 8x8 blocks in a macroblock are allowed. Fourth, in the PB-frames mode, two pictures are coded as one unit. A macroblock in a P-picture is immediately followed by the corresponding macroblock in the B-picture. Clearly, the above modes improve compression performance and increase the coder's flexibility, although of course, at. the expense of additional complexity. 5.1.2 The H.263 Bit Stream Structure Figure 5.18 shows an H.263 video sequence layer structure. The H.263 bit stream syntax is arranged in a hierarchical structure with four layers. Data for each picture consists Chapter 5. Application: H.263 Based Low Bit Rate Video Coding picture header GOB GOB header GOB MB (if not first G O B ) ( i f macroblock header MB GOB MB EOS MB 71 Picture layer Group of blocks layer not skipped) block 1 block 6 block 7 block 12 Macroblock layer ^(if PB-frame mode) Intra DC run-level VLC EOB Block layer (if intra) Figure 5.18: H.263 video sequence layers. of a picture header followed by data for a Group of Blocks, eventually followed by an end-of-sequence code (EOS) and stuffing bits. The header contains the start code for re-synchronization and the information about the picture size, temporal reference, quantization step, the type information for the four optional coding modes, etc. Data for each Group of Blocks (GOB) in a G O B layer consists of a G O B header followed by data for macroblocks. Each G O B contains one or more rows of macroblocks. For the first GOB, no G O B header is transmitted. The G O B header consists of the start code for re-synchronization, the G O B number, quantization step, etc. Data in the macroblock (MB) layer consists of a macroblock header followed by data, for the six 8x8 blocks. A n M B is comprised of four luminance (Y) blocks and one of each of the two chrominance blocks (Cr, Cb). When the PB-frame mode is selected, the M B in a P-picture and the M B in a B-picture are coded as one unit, in this case an M B layer contains 12 blocks. The header contains the macroblock type, the coded block pattern, the motion vectors, etc. A detailed discussion about the M B layer will be given in the Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 72 following section. In the block layer, if an intra M B is coded, the D C coefficient in the block is transmitted first. It is followed by as many run-level variable length codes (VLC) as needed to code the nonzero coefficients. For an inter M B , the D C coefficient is coded with the A C coefficients. The last nonzero coefficient is coded as the last for the block. 5.1.3 H.263 Macroblock Layer In H.263, the bit rate representing a macroblock can be divided into three types: (1) the header (or the side information) bit rate R , (2) the motion vector(s) bit rate s and (3) the D C T bit rate R . R, m The side information will be used to indicate the mode c type of the encoded macroblock, and if necessary, it will contain some extra information such as the quantization step and the coded pattern of the 6 (8x8) D C T blocks. The motion vector bits represent the motion vector(s), and the D C T bits represent the D C T coefficients of the 6 (8x8) blocks. In Table 5.7 we show all the data elements (except the D C T ) used in the macroblock layer, where the symbol " X " in the table denotes that the element of the corresponding M B type is used. Thus, R will be the normalized sum of the bits of C O D (coded s macroblock indication), M C B P C (macroblock type and coded block pattern for chrominance), C B P Y (coded block pattern for luminance), and the D Q U A N T (quantizer information). The bit rate R m will be the normalized sum of the bits of M V D and MVD2-4 (motion vector data). The bit rate R will then be the normalized sum of the bits of c D C T coefficients of the coded 8x8 blocks. I Picture From Table 5.7, we can see that for I-pictures, two modes are available: I N T R A and I N T R A + Q . The difference between them is that in the case of the I N T R A + Q mode, the Chapter 5. Application: H.263 Based Low Bit Rate Video Coding Picture M B Type P-picture Not-Coded INTER INTER+Q INTER4V INTRA INTRA+Q I-picture Side Information R MCBPC CBPY DQUANT s COD X X X X X X INTRA INTRA+Q X X X X X X X X X X X X X X Table 5.7: Coded symbols for R m X 73 Motion Vector R MVD MVD2-4 m X X X X X X and R for various coding modes, s encoder will send 2 bits for D Q U A N T (the difference in the quantization step). P Picture • For P-picture, there are six modes of operation denoted by Not-Coded, I N T E R , INT E R + Q , I N T E R 4 V , I N T R A , and I N T R A + Q . In the Not-Coded mode, the C O D parameter is set to "1" and the current macroblock is replaced by the macroblock at the same spatial location in the previous reconstructed picture. In this case, only the C O D parameter needs to be coded. This mode is designed for areas in the picture where little or no change relative to the previously reconstructed picture is detected. In both the. I N T E R and the I N T E R + Q modes of operation, one motion vector is transmitted along with the D C T coefficients of the prediction difference blocks. The difference between the I N T E R and I N T E R + Q modes of operation is that, for the latter, the value of Q U A N T is updated. This is often required to compensate for prediction inaccuracies. The mode I N T E R 4 V is similar to the'mode I N T E R , except that four motion vectors representing four 8x8 blocks are transmitted. The mode I N T E R 4 V is found to be useful for pic- ture areas with high motion activity. Due to noise, occlusion, zoom, large illumination Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 74 changes, or complex motion activity, simple translational motion-compensated prediction' may be inadequate. In such a case, operating in the I N T R A or I N T R A + Q modes may be beneficial. In the I N T R A + Q mode, the parameter Q U A N T is changed, usually to compensate for illumination variations. 5.2 Proposed Coder To address the issues discussed in Section 5.1 we propose an H.263 video coder based on the algorithms introduced in the previous two chapters. The new H.263 video coder employs a rate-distortion (RD) based mechanism (see Chapter 4) to select amongst the H.263's M B coding types. A n R D criterion for alternating between intra and inter coding modes was first presented by Chung, et al.[ll] in a subband video framework. Similar criteria were then applied to Telenor's H.263 coder by Wiegand et al [79], by Chung et al [13], and by Schuster and Katsaggelos [64]. A common important result of the above investigations is that significantly better R D tradeoffs can be obtained at very low bit rates using an R D mode selection criterion. Unique to our RD-based selection criterion, .however, is that it is also dependent, through the use of multi-path searching,'on the performance of the D C T residual coder. Our proposed coder also employs a fast median-based predictive integer-pel motion estimation technique that minimizes the Lagrangian, i.e., the distortion biased by the 1 number of required bits. Recently, linear prediction has been proposed [3] to reduce the large computational complexity associated with block matching algorithm ( B M A ) based motion estimation. The application of several linear and nonlinear prediction techniques to BMA-based motion estimation is studied in [43]. We here study two types of median predictors: the H.263-specified 3-block median predictor and another 5-block median 1 Although other distortion measures can be used, only the popular M S E and M A D will be considered. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding predictor. 75 The Lagrangian minimization provides a more efficient cost measure that has also recently [27] been studied for motion compensation in the context of the H.261 framework. Our strategy, however, is based on the work discussed in [11, 13], where the RD-based cost measure is analyzed in the context of a subband video coding framework. Finally, the proposed algorithm employs a 3-layer diamond-shaped searching area centered at the most likely motion vector, assuming given a prediction model (e.g. a. median predictor). The proposed method is motivated by the fact that, for very low rate video coding applications, physical motion is usually very limited, structured, and slowly-varying. 5.2.1 I-picture M B Coding Strategy As specified in the H.263 standard, intra coding of I-pictures consists of 8x8 D C T coefficients calculation, uniform quantization, and then run-level and variable-length coding (or arithmetic coding). Left to the designer, however, is the problem of determining the value of the quantization of the D C T coefficients. The only two requirements are that (1) the value of Q U A N T can be changed and transmitted only at the picture and/or the group-of-blocks (GOB) layers and (2) the four possible values of D Q U A N T (2 bits) are used to adjust the value of Q U A N T at the M B layer. We follow the H.263 approach, where Q U A N T of the first macroblock of the first I-picture is initially set to the middle value (i.e., QUANT=16), and Q U A N T of each other macroblock is set during the encoding process to the value of Q U A N T of the previous macroblock. Then, given a Lagrangian parameter A that controls, the R D tradeoffs, the value of Q U A N T can be adjusted by one of the 4,possible values of D Q U A N T that minimizes */*(A) = D + X(R + R ), where S c R and D are the number of bits and the distortion (MSE or M A D ) associated with the c corresponding D C T coder, respectively, and the R is side information as shown in Table s 5.7. Next, we compare the smallest J\(A) with </|(A) = D + X(R + R ), the Lagrangian S c Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 76 obtained in the case where D Q U A N T = 0 , and select the M B type that corresponds to the smaller value. 5.2.2 ' P-picture M B Coding Method For simplicity, PB-frames are not used in this work. For H.263's P-pictures, the basic coding operation consists of: 1) motion estimation and compensation, 2) M B type determination, 3) the possible coding of the motion vector(s), and 4) the possible D G T coding of the prediction difference blocks. Assuming that the macroblocks within a P-picture are statistically independent, the optimal macroblock coding control strategy is the one that yields the best R D tradeoffs. More specifically, we should seek the motion vector d = (x,y) (if any), the quality factor ( Q U A N T ) Q, and mode M that minimize the Lagrangian Jx(d, Q,M) = D + X [R + R' + R }. m In the above equation, D is the overall distortion, R C m S (5.45) is the number of bits needed to code the motion vector(s), R is the number of bits required for D C T coefficients coding, c and R is the number of bits associated with side information. s Direct Method It is clear from Table 5.7 that six Lagrangian values must be computed in order to find deterministically the coding mode that yields the lowest Lagrangian value. However, this is generally impractical especially as the I N T E R modes of operation involve the joint optimization between the motion vector estimation/coding and the D C T coding of the corresponding prediction difference blocks. That is, for every motion vector candidate, the D C T coder must be applied to the difference block(s) in order to evaluate both R c and D. Moreover, as can be seen from equation (5.45), even the evaluation of D must be Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 77 performed for all values of D Q U A N T . The latter part ean be performed efficiently as the D C T is a unitary transformation (i.e., only one forward D C T needs'to be evaluated for each 8 x 8 block). However, although the motion vector search area can be reduced substantially'as described in the next subsection, the evaluation of the I N T E R Lagrangians can still be computationally demanding. Two Stage Method An alternative way to reduce the computational complexity is to decouple the motion vector coding and D C T coding processes so that the Lagrangian minimization is applied sequentially. More specifically, we first locate the motion vector(s) that yield(s) the minimum Lagrangian, which is formed by the estimation distortion biased by the number of bits required for the coding procedure. The corresponding difference blocks are then computed and D C T coded. Finally, the resulting average values of bit rates and distortions, along with values of R and R , are used to compute the I N T E R mode s m Lagrangians. Of course, this procedure is sub-optimal because of its sequential search structure. We found that small gains could be achieved by employing multi-path searching. In M-search, M paths are considered as good candidates for the next stage. In M-search, generally more than one motion vector can be considered as a good candidate, and D C T coding is applied to each of the corresponding difference macroblocks. R D Optimized Motion Estimation For each macroblock and its four 8x8 luminance blocks in the current picture, a motion vector d = (x,y) £ S, where S is the set of all possible vectors in the search area, is sought. Each motion vector is chosen to minimize the Lagrangian J™(d) = £ p ( / ( r , n ) - J ( r + d , n - l ) ) + A /T(d), (5.46) Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 78 where r is the (x,y) spatial index of an image pixel, n is the time, index, I(r,n) is the image intensity of the candidate macroblock/block in the current picture, I(r + d, n — 1) is the image intensity of the matching macroblock/block in the previous reconstructed picture, W is the size of the matching window, />(•) is the square operation (MSE) or the absolute operation ( M A D ) , and.R (d) is the number of bits required to encode the m motion vector. Minimizing J™(d) is equivalent to the process of full-search entropyconstrained vector quantization [18, 40, 10, 9], where the codebook contains all 16 x 16 (or 8 x 8 ) vectors in the search area. After determining .the best .motion vector(s) d* representing the macroblock or each of the four 8 x 8 blocks, either one or four motion vectors depending on what yields better R D tradeoffs, is selected. Motion Vector Accuracy Many experimental results have shown that using |-pel accuracy motion estimation in our R D framework yields an insignificant improvement or a decrease in compression performance relative to integer-pel accuracy motion estimation at very low bit rates. Thus, only integer-pel accuracy is used; this slightly reduces the computational complexity. The latter can be reduced dramatically if the size of the set S and/or the size of the matching window W is reduced. Several techniques have been proposed [24, 5] that achieve this goal, at the expense of some loss in estimation performance. We here introduce a median-based predictive searching technique that is similar in principle to the statisticalbased predictive techniques described in [11, 12, 13, 41], but is simpler in concept and implementation. Motion Prediction If compatibility with current H.263 decoders is desired, then the independent medianbased prediction and motion vector difference variable-length coding methods described Chapter 5. Application: H.263 Based Low Bit Rate Video Coding Present Frame x, y-1, t x-1,y, t Previous Frame x+1, y-1, t x. y, t H.263 x, y, t-1 79 Present Frame x-1,y-1,t x, y-1, t x-1,y, t x, y. t x+1, y-1, t 5-Block Median Figure 5.19: The regions of support used by H.263 and our coder. in the H.263 standard must be followed. However, we here propose an alternative method that is generally more robust to channel noise. Figure 5.19 compares the regions of support (ROS) used by the H.263-specified prediction model and ours. Not only does the 5block ROS, which includes 4 spatially and 1 temporally neighboring macroblocks/blocks, provide more prediction accuracy, but it also leads to a motion vector coder output that is more resilient to channel errors. In fact, it can be easily verified that, using the 5-block ROS, only three or more erroneously decoded motion vectors (as compared to two or. more of them using the 3-block ROS) can cause error propagation. Diamond Shaped Search Method Our technique exploits the fact that, in very low bit rate applications, physical motion is usually very limited, structured, and slowly-varying. As illustrated in Figure 5.20, we first determine the most likely motion vector given a prediction model. Then, only the candidate motion vectors in the diamond-shaped small search area centered at the most likely motion vector is considered. Depending on the prediction model, the bit rate of operation and the content of the video scene, we found that full-search yields only a few (2 — 5 %) motion vectors that do not belong to the diamond-shaped search area. Thus, Chapter 5. Application: H.263 Based Low Bit Rate Video Coding y V R 80 Median-predicted vector Search area X Figure 5.20: The search area used during motion estimation. such a technique can reduce substantially the number of computations at the expense of only a small loss in estimation performance. M B Mode Selection Criterion After motion estimation in the two-stage method, the encoder has to determine the mode of the M B , along with the value of the quantization of the D C T coefficients. As discussed in Chapter 4, the "best" M B mode selection is the one that yields the best R D 'tradeoffs. More specifically, we should seek the mode M , and the quality factor ( Q U A N T ) Q, given the motion v'ector(s) obtained in the motion estimation stage that minimize the Lagrangian J (M,Q) X = D + \[R m + R + R }. c s • (5.47) Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 81 The procedure of mode selection is similar to the one discussed in Section 5.2.1. To find the minimum J\(M, Q), the encoder has to test six possible modes for a macroblock (see Table 5.7), and select the mode (and associated Q U A N T value) that corresponds to the smallest Lagrangian value. Threshold T One way to reduce the total number of computations is to employ thresholding techniques that allow us to safely eliminate the I N T R A and (especially) I N T E R coding options from consideration. For example, the Lagrangian of the N O T - C O D E D mode of operation can be compared to a varying threshold T, and if it is smaller than T, the other modes are no longer considered. These techniques are similar, in principle, to those employed in Telenor's H.263 coder, but are here found to be more effective because the bit rate-and distortion are more carefully traded. Although practical constraints often dictate that such methods be used, they are unfortunately still ad-hoc as there is. no guarantee that the best coding control decision has been made. The threshold T is used to provide a good balance between performance and number of computations. A small value of T increases both the computational complexity and the probability of achieving the optimal solution. A large value of T reduces the computational complexity but at the expense of possibly selecting an inferior operating mode. The H.263 standard suggests some constant thresholds that have been optimized to incorporate the M A D as a cost measure and to favor the predicted motion vector. Although the bit rate and distortion are taken into account simultaneously in our framework, similar thresholds can still be heuristically derived. To improve our estimation of the threshold T, we also incorporate memory in the form of simple prediction models indicating the level of activity in the video scene. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 5.2.3 82 Rate Control The Lagrangian parameter A controls mainly the R D tradeoffs. One method to find a particular value for A is to employ the bisection search algorithm described in [15, 61]. This method, however, is usually computationally demanding, as a large number of iterations may be required. A n alternative method is to update A using least-meansquares (LMS) adaptation [26], as adopted in [79]. In this work, t'he parameter A is initially estimated based on computed long-term statistics. It is then varied adaptively during the encoding process depending on a bit rate and/or a quality constraint, following the simple recursive control algorithm discussed in detail in [43]. Without loss of generality, let's assume that our encoder's output bits are stored in a buffer of size S , and are then transmitted over a fixed-rate communication channel. max The system is governed by s(t + 1) = s(t) + R(t) — 7, where t denotes time, s(t) is the buffer occupancy, R(t) is the variable output bit rate of the encoder, and 7 is the channel bit rate. In our case, the rate control algorithm will be performed after coding one frame. Then, assuming a linear feedback control model and a desired buffer occupancy of s* = the Lagrangian parameter A(t) at time t can be obtained by the recursion formula [43] A(<) = A ( i - l ) ^ . . (5.48) Updating A based on the above recursion formula is found to be an efficient solution.' In fact, relative to the more complex algorithms described in [8], the resulting R D control algorithm is very simple, yet its performance is good. 5.3 Experimental Results The following experimental results illustrate the computational complexity and performance of the proposed H.263-based video coding algorithm at very low bit rates. As in f Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 83 37.5 - •• 3736.5-' 36 - •• §35.5-•• tn w Q_ 3534.5-•• 34... 33.5 3332.5'— 3000 Figure 5.21: The P S N R improvement due to the rate-distortion-based M B coding control strategy. Both the M S E and the M A D distortion measures are used. Telenor's H.263 coder, motion estimation and compensation is performed only for the Y luminance component. The estimated motion vector field is subsequently used for the motion compensation of the Cr and Cb chrominance signals. However, only integer-pel accuracy motion estimation is allowed.in our framework. We compare the coder's performance with that of Telenor's H.263 video coding implementation [72] using the QCIF test sequences MISS A M E R I C A and C A R PHONE. For fairness, the options/parameters of both coders are set to the same values. For example, neither implementation employs PB-frames. Moreover, the frame rate is set to 10 frames per second, advanced prediction is used, and the unrestricted motion vector mode is selected. Finally, the M S E is used as our coder's distortion measure in most experiments. The only exception is the first set of experiments, where both the M A D and the M S E are employed. . . • Figure 5.21 compares the average peak-signal-to-noise-ratio (PSNR) results of our Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 84 Bits/frames vs. rate 700 — 600 500 °j 400 To § 300 200 100 0 3000 4000 5000 6000 7000 8000 Rate (bps) 9000 10000 11000 12000 Figure 5.22: Motion vector bit rate, D C T bit rate, and side information for our coder. coder with those of Telenor's H.263 coder for 150 frames of the color test video sequence MISS A M E R I C A at bit rates between 4 and 10 kilobits per second (kbps). The only difference between the two coders is the M B coding control strategy. We select the best mode of operation based on an R D criterion, as described previously. As seen from Figure 5.21, our strategy yields a significant improvement in PSNR, especially at very low bit rates. This is expected because while the motion bit rate in Telenor's H.263 coder is nearly constant (see Figure 5.17), it is reduced at lower bit rates (see Figure 5.22) by using the RD-based mode selection criterion. •. Such a criterion yields a more efficient allocation of bits amongst the coder's components. It is, however, more demanding in terms of computations. Nevertheless, by applying the thresholding techniques discussed earlier, we have obtained an insignificant loss in P S N R while requiring only 20% to 30% of the number of computations needed to determine all six Lagrangian values. Figure 5.21 also shows that, when the M A D is substituted for the M S E , the P S N R gain (in dB) is reduced by 10%-25%. Since the P S N R is inversely proportional to the Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 85 PSNR vs. Rate i i 1 37.5 37 36.5 36 |35.5 tr <n a. 35 34.5 34 33.5 { 33' motion vector estimation/coding) 32 5 3000 4000 5000 6000 7000 Rate (bps) 8000 9000 10000 Figure 5.23: The P S N R loss clue to the new motion vector estimation/coding method. M S E , minimizing the M S E is equivalent to maximizing the P S N R . Thus, using a nonM S E measure such as the M A D generally yields a slightly lower P S N R . The advantage of using the M A D , however, is that the computational complexity is significantly reduced. Figure 5.23 illustrates that our fast integer-pel accuracy motion vector estimation and coding compare very favorably in terms of P S N R with the full-search motion estimation and variable-length coding used in the earlier implementation. Moreover, as demonstrated earlier and confirmed by our simulations, our technique reduces the number of computations required for motion estimation by more than one order of magnitude. This is an important feature because motion estimation usually requires the lion's share of. the computational load. Also illustrated in Figure 5.23 is the fact that the new diamond shape searching window yields an insignificant loss in PSNR. As a practical solution to the joint motion vector estimation/coding and D C T residual coding, our proposed M-search technique (M = 3) improves the P S N R performance, as is shown in Figure 5.24. However, the improvement is not substantial. The slight P S N R Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 86 PSNR vs. Rate 37.5 37 36.5 36 - f 35.5- CC M Q. 35 34.5 34 - i 33.5 -. 33 - 32.5L 300 —1 4000 1 5000 1 6000 1 7000 1 8000 1 9000 I 10000 Rate (bps) Figure 5.24: The P S N R improvement due to the M-search technique. gain suggests that, at very low bit rates, the M S E used as part of the Lagrangian cost measure is a good approximation of the 8 x'8 D C T coder's M S E . Next, we present simulation results that illustrate the effectiveness of the feedback technique used to control.the bit rate through varying the parameter A. For comparison," in Figures 5.25 we also show the bit rate and P S N R profiles of our coder when a constant value for A is used. Notice that the P S N R in Figure 5.25 is nearly constant for all frames, which indicates that it is easier to control the output quality level through the parameter A. This can be beneficial in several applications where a constant quality of service is desired. Figures 5.26 shows the bit rate and P S N R profiles of our coder when a constant bit rate is placed as a constraint on the encoding algorithm. Although a simple model (see Equation (5.48)) is used, the feedback control technique is very effective as shown in Figure 5.26. It shows that' the proposed algorithm keeps the buffer fullness around 10 Kbits through varying the parameter A. However, Figure 5.26 also indicates that it is more difficult to control the bit rate than to control the the output quality via the A. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 87 (a) BIT R A T E (BITS) Bits vs. Frames 900 800 700 600 500 3 i 400 300 200 100 50 100 Frame Index (b) P S N R ( D B ) PSNR vs. Frames 40 39 38 37 ^36 ST rr 35 z CO 0_ . .Miss America.. 34 35.5 (dB) 33 32 31 30 50 • Frame Index 100 150 Figure 5.25: The rate-distortion profile for our coder: (a) Bit rate and (b) P S N R for 150 frames of the test sequence MISS A M E R I C A . The distortion and bit rate is controlled through, constant parameter A. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding (a) BIT R A T E (BITS) Buffer Length vs. Frame ..Miss. America. 1.6 8 kbps Full Buffer length 20 kbits sz 1.2 0.6 0.4 0.2 50 Frame. Index 100 150 (b) P S N R ( D B ) PSNR vs Frame 50 Frame Index 100 150 Figure. 5.26: The rate-distortion profile for our coder: (a) Bit rate and (b) P S N R for 150 frames of the test sequence M I S S A M E R I C A . The buffer state is used to control the bit rate through varying the parameter A. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 89 To reduce the likelihood of overflow/underflow, a sufficiently large buffer should be used. However, if the buffer size is constrained, employing more sophisticated feedback control techniques or incorporating memory into the bit rate control algorithm [8] may provide better solutions. Figures 5.27(a) and 5.27(b) illustrate the overall P S N R improvement of our coder over Telenor's H.263 coder for the two test sequences MISS A M E R I C A and CAR PHONE, respectively. Our coder differs from Telenor's in that it incorporates RD-based control, thresholding, median-based predictive motion estimation, diamond shape searching window, and M-search. Not only does our coder outperform Telenor's, especially at the lower bit rates, but our coder's computational complexity is also lower. Notice that, although diamond shape searching window is employed, we still obtain a significant improvement in P S N R performance. But more important is the fact that our coder's subjective quality is superior to that of Telenor's H.263 coder. For example, when many viewers were presented with 150 frames of the decoded color sequence MISS A M E R I C A , which was encoded using the new coder at 4 kbps, they all reported that the subjective quality is either acceptable or good. When presented with another 150 frames produced by decoding Telenor's H.263 output bit stream, the viewers stated that the quality is both unacceptable and inferior to our coder's subjective quality. Finally, to' illustrate the belmvior of our coder at higher bit rates, the x-axis of Figure 5.27(a) is extended to 16 kbps and that of Figure 5.27(b) is extended to 40 kbps. As the proposed techniques are designed for the very low bit rate range of operation, it should not be surprising that the P S N R gap decreases with increasing bit rate. For example, the RD-based M B mode selection criterion makes better use of the available bits, but the improvement is significant only when the bit budget is limited. Another example is |-pel motion estimation accuracy, which becomes beneficial at higher bit rates, but is not used in this framework. However, most of the proposed techniques can be optimized so that Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 90 (a) MISS A M E R I C A P S N R vs. Rate 1 Miss America 150 frames Y. Cb, Cr x H.263 o Our coder 4000 6000' 8000 10000 Rate (bps) 12000 14000 16000 (b) C A R PHONE P S N R vs. Rate 1 :Car Phone <r 33 :101 frames y,.Cb,Cr... . x H.263 o Our code I 1 1 1 0.5 1 1.5 2 Q i 1 1i 1i 2.5 Rate (bps) 3 3.5 -. -. I 4 Figure 5.27: Overall comparison between our coder and Telenor's coder for the sequences (a) MISS A M E R I C A and (b) C A R P H O N E at bit rates between 3 and 16 kbps, and between 5 and 40 kbps, respectively. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 91 the performance improvement is large over a wide range of bit rates. This is subject for further research. .' 5.4 • Summary We have presented a new H.263-based video coder that provides an RD-based mechanism for alternating amongst H.263's modes of operation, and employs median-based predictive motion estimation and coding. For motion estimation, a diamond-shaped search window is employed, sacrificing only a small loss in compression performance for a significant reduction in computation complexity. Our coder outperforms Telenor's H.263 coder in compression performance and it also offers the user more control over the bit rate, quality, and number of computations. Although we have not addressed many other issues such as content-based scalability and temporal random access, we believe that the proposed techniques provide some efficient solutions to very low bit rate video coding application. Chapter 6 Application: M P E G - 2 Video Coder This chapter introduces a new M P E G - 2 compliant video coder. It is different than conventional coders in that it employs: 1) a computation-distortion (CD) optimized predictive motion estimation algorithm, and 2) a rate-distortion (RD) optimized mode selection algorithm. Experimental results show that the proposed motion estimation and mode selection algorithms yield substantially higher encoding speed and compression , performance than those of the MPEG-2's Test Model T M 5 coder [35]. This chapter consists of five sections. Section 1 will briefly review the M P E G - 2 video coding standard and its coded bit stream structure. In Section 2, we will discuss the implementation of the computation-distortion optimized motion estimation algorithm, described in Chapter 3, in the context of the M P E G - 2 framework. In Section 3, we will discuss the implementation of the rate-distortion (RD) optimized mode selection algorithm, described in Chapter 4, also in the context of the M P E G - 2 framework. Simulation results and summary are presented in the last two sections. 6.1 M P E G - 2 Overview M P E G - 2 [34] has been a very successful audiovisual communication standard. Decoders, that claim conformance to it have already been produced by the millions, and many new M P E G - 2 applications have already been created. M P E G - 2 is aimed at a wide range of applications such as television broadcasting, digital storage media and digital highdefinition T V ( H D T V ) . Like H.261 [73] and MPEG-1 [47], the M P E G - 2 video standard is 92 Chapter 6. Application: MPEG-2 Video Coder 93 4:2:0 HIGH 4:2:0 1920 X 1152 80 Mb/s 100 M b / s l,P,B l,P,B H I G H 1440 MAIN 4:2:2 1 9 2 0 x 1152 4:2:0 4:2:0 1 4 4 0 x 1152 1440 x 1152 1440x1152 60 Mb/s 60 M b / s 80 Mb/s l,P,B l,P,B l,P,B 4:2:0 4:2:2 4:2:0 4:2:0 4:2:0 720 x 576 720 x 576 720 x 576 720 x 576 15 M b / s 15 M b / s 15 Mb/s 20 M b / s l,P l,P,B l,P,B l,P,B LOW • 4:2:0 4:2:0 352 x 200 352 x 200 4 Mb/s 4 Mb/s l,P,B l,P,B MAIN SNR 4:2:0 4:2:2 Level SIMPLE Profile SPATIAL HIGH Figure 6.28: M P E G - 2 Profile and Level based on motion compensated prediction and D C T residual coding. However, M P E G - 2 offers a more efficient means to code interlaced video signals and supports scalable video coding. The readers should be aware that M P E G - 2 is a very complex standard. The description we provide in the following subsections is only an overview of its video part. 6.1.1 Profile and Level As suggested by its title, Information technology - Generic coding of moving pictures and associated audio information, the M P E G specification is intended to' be generic in the sense that it serves a wide range of applications. For example, coded M P E G - 2 bit rates of up to 400 Gigabits/s and picture sizes up to 16,000 by 16,000 pixels can be defined. In order to cope with the great variety of video applications within one coding standard, M P E G - 2 adopts a "toolkit-like" approach; that is, M P E G - 2 is a collection of tools which Chapter 6. Application: MPEG-2 Video Coder 94 satisfies the requirements of specific major applications. The range of coding support provided by M P E G - 2 is divided into profiles and levels. For each profile/level, M P E G - 2 provides the syntax for the coded bit stream and the decoding requirements. A profile is a defined subset of the entire bit stream syntax specified by M P E G - 2 . The M P E G - 2 profiles can be divided into two categories: non-scalable and scalable. The two non-scalable profiles are the Simple and Main profiles. The three scalable profiles are the SNR scalable, Spatially scalable, and High profiles. Within a profile, a level is defined as a set of constraints imposed on the parameters of the bit stream. For each profile, the four levels are the Low, Main, High-1440, and High levels. The constraints of the coding parameters for the various profile/level combinations are shown in Figure 6.28. Among its profile/level combinations, the most widely used combination in the M P E G 2 industry is the'main profile/main level. Therefore, in this chapter, we will concentrate on main profile/main level. Moreover, for simplicity, only I- and P-frames will be used in our simulation. 6.1.2 Fields, Frames and Picture M P E G - 2 supports both progressive and interlaced video (see Figure 6.29). For interlaced video, fields can be coded separately (field pictures) or 2 fields can be interleaved and coded as one picture (frame picture). Both frame and field pictures may be used in a single video sequence. M P E G - 2 defines three types of coded frames: I-frames, P-frames, and B-frames. A coded I-frame consists of an I-frame picture, a pair of I-field pictures, or an I-field picture followed by a P-field picture. A coded P-frame consists of a P-frame picture or a pair of P-field pictures. A coded B-frame consists of a B-frame picture or a pair of B-field pictures. If the first field of an I-frame is an I-field picture, the second field can be a P-field picture, provided that prediction is to be based on the'first I-field Chapter 6. Application: MPEG-2 Video Coder 95 Top Field Bottom Field Figure 6.29: Interlaced video frame consists of two fields. picture. One confusing aspect is the use of the words, "field" and "frame". If "frame" is used as a noun, it can mean either a frame that is coded as a single picture (either a progressive frame, or a frame created by interleaving two fields) or a frame that is coded as two separate field pictures. If "frame" is used as an adjective (as in "frame picture"), it always means that the frame is coded as one picture. In this chapter, the video sequences used in our simulation results are interlaced frame pictures. 6.1.3 Motion Prediction In M P E G - 2 , motion prediction depends mainly on the type of the picture (frame picture or field picture). For frame pictures, there are three motion prediction types: frame prediction, field prediction, and dual prime prediction. For field pictures, there are also three motion prediction types: field prediction, 16x8 motion compensation, and dual prime prediction. The prediction type can be specified at the macroblock layer, which means the encoder has the latitude of selecting a different motion estimation/compensation type for each macroblock. In what follows, we only discuss frame prediction and field Chapter 6. Application: MPEG-2 Video Coder 96 Frame motion vector 16x16 Frame motion vector Field motion vector Top 16x8 T o p Bottom 16x8 Bottom Top field motion vector Bottom field motion vector Figure 6.30: M P E G - 2 Motion vector type prediction in the context of frame pictures. Figure 6.30 shows the two motion prediction types. The shaded area represents the top field pixels and the clear area represents the bottom field pixels. For frame prediction, only one motion vector will be transmitted for a macroblock. For field prediction, the encoder/decoder assumes that the macroblock consists of two fields, and it will deinterlace both the 16x16 macroblock and the reference frame into top field and bottom field as shown in Figure 6.30. For each field, the encoder sends one motion vector and the corresponding side information to indicate the field type. Chapter 6. Application: MPEG-2 Video Coder 97 Frame DCT of the luminance macroblock Field DCT of the luminance macroblock Figure 6.31: M P E G - 2 D C T type 6.1.4 F r a m e D C T and F i e l d D C T To fully exploit the interlaced video properties, two D C T types, frame D C T and field DCT, are defined in M P E G - 2 . For frame D C T , the encoder performs the D C T directly on each of the four 8x8 blocks (see Figure 6.31). For field D C T , the encoder assumes that the macroblock consists of two fields and de-interlaces the 16x16 luminance block (or the two chrominance blocks for 4:2:2 video formats) in advance. The D C T type can be specified at the macroblock layer. However, this option is valid only for frame pictures. Chapter 6. Application: MPEG-2 Video Coder Mode Name Group FrJntra FdJntra Skip FrMC . FdMC FrDCT FdDCT FrMC_FrDCT FrMC-FdDCT FdMC-FrDCT FdMCFdDCT INTRA INTRA INTER INTER INTER+DCT INTER+DCT INTER+DCT INTER+DCT INTER+DCT INTER+DCT 98 Motion Type D C T Type Frame Field Frame " Field Frame Frame Field Field Frame Field Frame Field Frame Filed Table 6.8: The available macroblock mode types for P-frame. 6.1.5 Bit Stream Syntax and Macroblock Layer In M P E G - 2 , the structure of the bit stream syntax depends on the profile adapted by the application. As we will concentrate on only the main profile/main level part of the standard, we will only describe the bit stream syntax for the main profile. The M P E G - 2 bit stream syntax for the main profile is a superset of the M P E G - 1 bit stream syntax. This non-scalable profile is aimed at higher bit rates (4 Mbps to 8 Mbps for C C I R 601 video) and is able to handle interlaced video formats. The bitstream syntax for the main profile consists of six layers: sequence layer; group of picture (GOP) layer; picture layer; slice layer; macroblock layer; and block layer. These layers have been created to satisfy many important requirements such as coding flexibility, ease of implementation, and channel error resilience by way of re-synchronization markers. The range and values of the layer parameters depend mainly on the level. From a coding viewpoint, the most important layer is the macroblock layer. As blocks within the macroblock cannot be encoded differently, the macroblock is the smallest coding unit. Chapter 6. Application: MPEG-2 Video Coder 99 In what follows, we will briefly discuss the various modes in I-frame and P-frame for MPEG-2's main profile. Here, the mode is defined by the coding operation applied on the macroblock. The coding operations are (1) intra/inter, (2) frame/field motion prediction, and (3) frame/field DCT. As specified in the MPEG-2 standard, in an I-frame picture, two intra modes are supported. They are the intra frame-DCT and the intra field-DCT modes. If the quantization step of a present macroblock differs from that of the previous macroblock, the encoder will send 5 bits to indicate the new value of the quantization step. In P-frame pictures, there are three principal modes of operation at the macroblock layer: (1) intra, (2) motion only, and (3) hybrid motion compensated prediction and DCT coding. However, as motion prediction and/or DCT coding can be applied in a frame or field mode of operation, at least eleven modes of operation are possible. Table 6.8 lists eleven modes for P-frame pictures. 6.2 Computation-Distortion Optimized Motion Estimation MPEG-2's main profile/main level is mainly aimed at applications employing a CCIR 601 video format (4:2:0 color sampling format, 720x480 pixels, 30 frames/sec) and supporting a bit rate range from 4 Mbps to 8 Mbps. Unlike in low bit rate coding applications, where the motion vector bit rate represents a significant part of the total bit rate, in MPEG-2's main profile/main level, the lion's share of the bit stream represents the DCT coefficients (see Figure 6.32). Thus, a rate-distortion optimized motion estimation algorithm would not yield significant gains in compression performance. On the other hand, the computational resources required for motion estimation in MPEG-2 are, in most cases, relative very high. In fact, the full search algorithm can take more than 99% of the total number of computations. Therefore, the computation-distortion (CD) optimized motion estimation algorithm described in Chapter 3 can yield substantially Chapter 6. Application: MPEG-2 Video Coder 100 Bit Rate Ratio vs. Bit Rate 0 2 4 6 Bit Rate (Mbits/sec) 8 10 12 Figure 6.32: The bit rate distribution for DCT, motion vector, and side information better computation-performance tradeoffs. The proposed framework restricts the motion estimation to a relatively small search area whose center is a two-dimensional median predicted motion vector [43, 42]. The number of computations is further reduced by incorporating explicitly a computational constraint during the estimation process. Experimental results show that the median prediction as well as the dynamic nature of the implementation lead to a dramatic increase in motion estimation speed for only a small loss in compression performance. A unique feature of our algorithm is that the reconstruction quality of the resulting MPEG-2 coder is proportional to the computer's processing power, providing much needed flexibility in many MPEG-2 software applications. Chapter 6. Application: MPEG-2 Video Coder 101 R Q F M 1 N H G J L K 0 C S E B D A P Previous Frame Present Frame Figure 6.33: The region of support (ROS). 6.2.1 Motion Vector Prediction In our framework, the motion vector search region will have as a center the predicted motion vector. Therefore, the motion vector prediction directly affects the efficiency of our proposed algorithm. We next evaluate the performance of two different motion estimation vector prediction methods: 1) median prediction and 2) least mean square error (LMSE) prediction. The performance of these two prediction methods will be evaluated for different configurations of region of support (ROS) and using the mean squared error (MSE) and mean absolute error (MAE) The measures. ROS is decided by the values of correlation coefficients and of the values of LMSE coefficients values between motion vectors within a sufficiently large 3-D region, and the current motion vector. Such a region includes previously coded motion vectors representing blocks that are close spatially and/or temporally. Figure 6.33 shows a 3-D ROS extending over two frames. Block G of the previous frame is positioned at the same spatial location as the unlabeled block of the current frame (or current block). For the Chapter 6. Application: MPEG-2 Video Coder 102 LMSE predictor, the ROS is decided by the absolute value of the LMSE coefficients. For the median predictor, the ROS is decided by the correlation coefficients and the LMSE coefficients. The simulation results show that the ROS decided by the LMSE coefficients yields higher prediction accuracy than that of the ROS decided the correlation coefficients for the same number of blocks. From the simulation results, the best number of blocks for median estimation is either 4 to 5, and prediction performance will degrade as the number of blocks increases. For LMSE, the larger the number of blocks, the better performance. However, the performance improvement diminishes by employing more than 6 blocks. If we use the mean squared error (MSE) to evaluate the performance of the two predictors, the LMSE predictor is always better than the median predictor for the same ROS. However, for the ROS consisting of 5 blocks the difference is not big. The situation will change when we use the mean absolute error (MAE) as the evaluation measure. As expected, the median predictor is better than the LMSE one for the same cases. From our simulation result, and a channel robustness point of view, we suggest using the median predictor with a 5-block ROS. The suggested ROS consists of the blocks labeled A, B, C , and D in the present frame and block labeled G in in the past frame in Figure 6.33. 6.2.2 Search Path and Search Termination The diamond-shaped search path discussed in Chapter 3 is used in our motion estimation framework (see Figure 6.34). After searching each layer, the following Lagrangian cost Ji will be evaluated J, = SAD, + j8xC,, (6.49) where SAD\ is the minimum sum of the absolute distortion (SAD) within /th layer, and C\ is the total number of computations performed so far after searching the /th layer. 103 Chapter 6. Application: MPEG-2 Video Coder V Most likely motion vector R Search region o o o Layer 0, Jo ^ o o o Layer 1, J l \ 0 o o Layer 2, J2 -e—e—e—e—e—e—eo c t > o o o o o o o Figure 6.34: Diamond-shaped search area. Blocks with the same shade are in the same search layer. The parameter (3 controls the tradeoffs between computation and distortion. In this work, Model I (2-layer hypothesis) method which is described in Chapter 3 will be used to terminate the search process. 6.2.3 Hierarchical Structure A hierarchical motion estimation algorithm that is very efficient is presented next. The efficiency of the algorithm originates from searching low resolution video frames. This is because when the algorithm performs motion search at a low resolution frame, both the size of the search window and the size of the matching block are smaller than those in a higher resolution image, reducing substantially the number of computations. If the downsampled frames (low resolution frames) can still preserve the video structural Chapter 6. Application: MPEG-2 Video Coder 104 Figure 6.35: The search region on full resolution level and half-pel resolution level information, the performance of hierarchical search will be very close to that of full search. The idea of performing motion search in low resolution video frames is adopted into our proposed framework. In our method, a 3-level hierarchical structure is used. The three frame resolutions in the hierarchical structure are (1) half-pel resolution, (2) full resolution, and (3) low resolution. The reference pictures for motion estimation at low resolution levels and full resolution levels are derived from the original versions. However, the reference pictures for motion estimation at half-pel resolution level is the interpolated version of the decoded picture. During hierarchical motion estimation, we start searching a small diamond-shaped area in the low resolution frame, an area whose center is the predicted motion vector. The motion vector found in the low resolution frame is then used as the starting point for motion vector search in the full resolution image. After motion vector search at the full resolution image, the resulting motion vector will be the starting point for half-pel refinement. Note that, the search are allowed in both the full Chapter 6. Application: MPEG-2 Video Coder resolution and the half-pel resolution frame is limited to 9 points 105 shown in Figure 6.35. 6.2.4 Implementation Issue 16x16 Frame Prediction and 16x8 Field Prediction For interlaced video sequences, MPEG-2 defines two types of motion prediction in a Pframe: 16x16 frame prediction and 16x8 field prediction (see Section 6.1.3). Therefore, to exploit the interlacing property, the encoder needs to perform two types of motion estimation and select the "best" motion vector(s) for motion compensation. Because both the present frame and the reference frame consist of two fields (the top field and the bottom field), four types of motion prediction are required for 16x8 field prediction. In what follows, we list the 5 required motion search types: • 16x16 frame prediction (F-F) • 16x8 field prediction for present top field: present top to reference top (T-T), and present top to reference bottom (T-B) • 16x8 field prediction for present bottom field: present bottom to reference top (B-T), and present bottom to reference bottom (B-B) One solution to find the 5 motion vectors is to apply the same predictor (median or LMSE) five times before each search. The second approach is to use the motion vector found in a particular type of motion estimation to predict the other types of motion vectors before their search. For example, the motion vector found in motion search of the 16x16 frame prediction type could be used to predict the motion vectors for the other four 16x8 field prediction types motion estimation. Simulation results show that a Chapter 6. Application: MPEG-2 Video Coder 106 strong correlation amongst those five types of motion vectors exists, and that the second approach is more efficient. Full Block Matching and Partial Block Matching In motion estimation, the encoder searches for the motion vector that gives the minimum matching error (the sum of the absolute difference, or SAD, for example). One simple and straightforward approach is to calculate the SAD of the 256 pixels of the 16x16 macroblock for each vector, and then choose the vector with the minimum SAD. This full block matching method is suitable only for hardware implementations, because a very simple and structured digital circuit can perform this type of operation in a parallel manner. For software implementations, the more efficient partial block matching method is favored. The partial block matching method consists of comparing the partial sum SADN of N pixels to the current minimum SAD. If the value SADN is larger than that of SAD, then the matching process is stopped, and several unnecessary computations can then be avoided. Simulation results show that, on average, the partial block matching method yields 25% gain in number of computations. 6.3 Fast Rate-Distortion Optimized Mode Selection This section presents an MPEG-2 compliant interlaced video encoder that employs an efficient macroblock coding mode selection algorithm. The macroblock coding mode is selected based on a rate-distortion (RD) optimized criterion. However, predictive and statistical modeling techniques are introduced that lead to substantially better computationperformance tradeoffs. The proposed MPEG-2 video encoder is shown experimentally to outperform the MPEG-2 TM5 encoder, expressed in a 15 — 60% reduction in bit rate and/or a 1.5 — 2.5 dB increase in objective quality. Chapter 6. Application: MPEG-2 Video Coder 6.3.1 107 R D Optimized Mode Selection Theoretically, in order to obtain the optimal mode and associated quantization step that minimizes the J(A) in Equation (4.40), the encoder has to exhaustively search all the available combinations of mode M and quantization step Q. For MPEG-2 main profile/main level, the exhaustive searching implies that all 64 (M,Q) combinations in an I picture and all 259 (M, Q) combinations in a P picture need to be considered (see Table 6.8). Exhaustive search is computationally very costly. However, it is the only way that optimal performance gains can be assessed. When all (M, Q) combinations are considered, the compression performance for the three CCIR 601 video sequences, BUS and F O O T B A L L , are shown in Figures 6.36 and 6.37. The figures show that a 1.5 - 2.0 dB improvement can be achieved as compared to the mode selection method used in the MPEG-2 TM5 coder. The standard TM5 coder selects the coding mode by comparing the energy of the predictive residuals. For example, the intra/inter decision is determined by a comparison of the variance of the macroblock pixels against the variance of the predictive residuals; the inter prediction mode is selected to be the inter mode that has the least predictive residual mean square error. However, the RD optimized MPEG-2 video coder requires, on average, more than 30 minutes of CPU time on a 167Mhz Ultra Sparc machine just to encode a group of 12 frames! To reduce substantially the number of computations, a fast finite state machine (FSM) RD optimized mode selection algorithm proposed in Chapter 4 is here developed for MPEG-2 video coding. The fast FSM algorithm is developed based on the analysis results presented in the following parts. Chapter 6. Application: MPEG-2 Video Coder 108 (a) BUS Rate vs. P S N R 39 38 37 Bus, 60 frames Win Size 3 t P, G O P 12: 36 I DC 3 5 | 34 - 33 full-sea ch RD optir lized : TM5 32 31 6 30 8 10 12 Rate (bps) (b) 14 x10 FOOTBALL Rate vs. PSNR 41 I 40 39 38 Football, 24 frames I Win Size 47 36 IP, G O P 12 35 - full-sea ch RD optinlized : TM5 34 33 32 1 6 8 Rate (bps) 10 12 14 x10 Figure 6.36: PSNR (in dB) of the TM5 and the full-search RD optimized algorithm for the BUS and F O O T B A L L sequences. 109 Chapter 6. Application: MPEG-2 Video Coder Rate vs. PSNR 37 • 36 • 35 • 34- g33p tr l32r frames Win Size 15 31 • IP, G O P 12 30- : - full-sea rch RD optinlized : TM5 29 • 28 • 6 8 12 10 Rate (bps) 14 x 10 Figure 6.37: PSNR (in dB) of the TM5 and the full-search RD optimized algorithm for the GARDEN. Dependency Between Coding Mode and Quantization Step In this section the analysis of the statistical dependency between the mode M and the quantization step Q is performed. Figure 6.38 shows the joint distribution functions p(M, Q) of the mode M and the quantization step Q which are derived from the exhaustive search method on the BUS sequence with 8 Mbps. The difference between the joint entropy H(M, Q) and the mutual information / ( M , Q) is used to evaluate the dependency between the M and Q, where H(M) J2- °92(p(M))p(M), = L m H(Q) = Y,- °9MQ))P(Q), 1 9 H(M,Q) = J2- °92(p(M,Q)) (M,Q), L P m,q Chapter 6. Application: MPEG-2 Video Coder 110 Distribution Function of Q and Mode Q Index Figure 6.38: Joint distribution function p(M, Q) of coding mode M and quantization step Q I(M,Q) If the value of I(M,Q) = H(M) + H(Q) - H(M,Q). is much smaller than that of the H(M,Q), (6.50) then very little information of quantization step Q can be offered by the mode M (and vise versa) which implies that small dependencies exist between the mode M and the quantization step Q. Table 6.9 shows the H(M, Q), H(M) + H(Q), and I(M, Q) derived from the simulation results of the three video sequences (BUS, F O O T B A L L , and GARDEN) using the exhaustive search method. In Table 6.9, the values of I(M, Q) are much smaller than the values of H(M,Q), implying very small dependencies between the M and Q. Therefore, the fast coding mode selection method, described in Chapter 4, that separately process the M and Q can be employed. Chapter 6. Application: MPEG-2 Video Coder Name Bus Football Garden 111 H(M,Q) H(M)+H(Q) I(M, Q) 5.6890 5.6442 5.2887 5.7467 5.7321 5.3578 0.0577 0.0879 0.0691 Table 6.9: The entropy, joint entropy, and mutual-information of coding mode M and quantization step Q. Coding Mode In this section, the correlation among the neighboring macroblocks' mode will be analyzed, where the entropy is used as the correlation measure. The purpose of this study is to determine whether or not the neighboring macroblocks' coding mode types offer information about the mode of the present macroblock. Two quantities will be studied in this section: (1) the entropy H(M) of the mode M of the present macroblock, and (2) the conditional entropy H(M \ MN) of the mode M of the present macroblock when the N neighboring macroblocks' modes are known. Here, M . ^ represents the set of the N neighboring macroblocks' modes, and H(M \ A4N) is given by H(M | M ) N = P(M ) £ -log (p(M N M N 2 | M )) N p(M | M ). N (6.51) M Because each macroblock can take on any one of twelve modes, only the nearest four neighboring macroblocks' modes in the present frame are considered in this test. There are 20736 possible configurations for the 4 neighboring macroblocks. The result is shown in Table 6.10. The percentage in Table 6.10 is defined as H(M | M )/H(M). N From the table, we can see that knowledge of the neighboring macroblocks' modes lower the present macroblock coding mode's entropy around 58% to 48%. This implies that an FSM that exploits such high correlation will be very effective. Chapter 6. Application: MPEG-2 Video Coder Name H(M) Bus Football Garden 3.0895 3.1575 3.0394 H(M | 112 M) N Percentage 1.8055 1.5245 1.7552 58.44% 48.28% 57.75% Table 6.10: The entropy and conditional entropy coding mode M for interlaced P pictures. Quantization Step The number of possible quantization steps per macroblock in MPEG-2 is 32. Therefore, an FSM method will likely be unnecessarily complex. Fortunately, the distribution of the quantization step Q is unimodal (see Figure 6.38). Thus, a mean predictor is used to predict the "centroid" Q from the quantization steps Q of the neighboring macroblocks. c After the predicted Q is obtained, the search order for the quantization step is then c arranged according to the absolute difference between Q and the Q . For example, if the c predicted quantization step Q is 14, then the search order for the Q will be 14, 13, 15, c 12, 16, 11, 17, etc. Figure 6.39 shows the histogram of the prediction error when a mean predictor is applied. The figure shows that the quantization step Q can be accurately predicted. In order to further reduce the computational cost used in forward/backward quantization of the DCT coefficients, only 3 quantization steps, Q , Q — 2 and Q + 2, c c c are considered. The reason that we chose the (Q — 2, Q , Q + 2) rather than the (Q — 1, c c c c Qc, Qc + 1) is because the changing step for quantization in the later case is too small, which will affect the performance for lower bit rate situation. 6.3.2 Encoding of I-Pictures Figure 6.40 illustrates the intra coding algorithm employed in this work. A finite-state machine (FSM), described later, is used to determine the more likely of the two possible Chapter 6. Application: MPEG-2 Video Coder 113 Histogram of the quantization prediction error 3us, .60. frames -15 -10 JI ru -5 0 5 Quantization prediction error 10 15 Figure 6.39: The histogram of the quantization step step prediction error. modes of operation: frame DCT (FR-DCT) and field DCT (FI-DCT). Suppose, without loss of generality, that the mode FR-DCT has been selected. The value of the quantization step Q is set to the mean values corresponding to the MBs above, left, and above-right. C Then, each of the six 8x8 blocks in a macroblock is transformed by an integer DCT. The resulting coefficients are quantized and entropy coded as specified in the standard. Given a Lagrangian parameter A, the Lagrangian J\ = D + \R can be evaluated by the MB's average sum of squared error for D and the MB's average number of bits for R. After the three Lagrangian J\ of the three quantization steps, Q , Q — 2 and Q + 2, C C C are considered, the minimum value of the three Lagrangian J\s will be compared to a threshold JR (described in Section 4.4.2). If J\ < JR, then intra coding of the current MB is terminated. Otherwise, the value Q* representing the minimum Lagrangian is Chapter 6. Application: MPEG-2 Video Coder 114 Q X MSE STOP & FRAME DCT Rate CONTROL MSE STOP CONTROL FIELD & DCT Rate CONTROL Figure 6.40: M B mode selection: I-pictures. saved, and the next mode FI-DCT is tested. Statistical analysis results show that, even when all 32 possible values of quantization step are considered, the distribution of the prediction error of quantization step is both symmetric and very narrow (see Figure 6.39). In fact, a member of the triple"(<5 — 2, Q , Q + 2) is selected more than 90% of the time. c c c The procedure required to test the F I - D C T mode is identical to the one described above, except for the following two modifications. First, the luminance Y - M B is pre-arranged by odd/even selection before D C T coding. Second, only the value Q* is considered, as it is shown experimentally to be an accurate estimate. Assuming that the exit condition has still not been satisfied, the ( M , Q) pair of values yielding the smallest J\ is selected, and intra coding of the current M B is terminated. 6.3.3 Encoding of P-Pictures Inter coding of P-pictures depends on either an I- or a P-picture. In inter coding, there are three principal modes of operation: (1) motion only, (2) hybrid motion-compensated prediction and D C T , and (3) intra D C T . However, as motion estimation/compensation and/or D C T coding can be applied to frames or fields, at least nine non-intra modes of Chapter 6. • Application: MPEG-2 Video Coder 115 operation are possible (see Table 6.8). The block diagram shown in Figure 6.41 illustrates the proposed M B coding mode selection algorithm used during the inter coding of P-pictures. As in intra coding, considering all modes of operation is neither practical nor required. The modes are investigated in a specific order determined by the local statistical behavior of the principal modes of operation. More specifically, the F S M is first used to order the principal modes of operation in terms of their probabilities. The principal modes are then considered in that order. The specific modes belonging to the selected principal mode are investigated following a sequence that is specified by the computation requirements. For example, because frame motion estimation requires much less computations than field motion estimation, therefore, it will be performed before field motion estimation. Moreover, because of duplicate operations in frame D C T and field D C T , the FrMC_FdDCT mode will be tested after the FrMC_FrDCT one, and the F d M C F d D C T mode will be tested after the F d M C L F r D C T one. For each candidate mode, the corresponding algorithm is executed, yielding an average distortion and an average bit rate. Then, the resulting Lagrangian J\ is compared to the threshold JR. If J\ < JR, the appropriate bits' are emitted, and the encoding for the current M B is terminated. Otherwise, one or more modes within and outside the subject principal mode are then investigated. As in intra coding, at most three values of quantization step are considered for each principal mode (when applicable). They are tlie mean values Q , Q — 2, and Q + 2. The value Q* representing the minimum Lagrangian C C C for the mode currently being investigated is saved, and is used for other modes within the same principal mode. 6.3.4 The Finite State Machine (FSM) The computation-performance tradeoffs that can be obtained by the above algorithm depend on the characteristics of the input interlaced video sequence and on the efficiency Chapter 6. Application: MPEG-2 Video Coder 116 Motion Only MSE Rate W e MSE FSM Rate Control MSE CONTROL Rate Motion+DCT C Lh>c MSE Rate FIELD MV MSE D Rate MSE Rate - M FRAME DCT Control MSE Rate Intra MSE P RELD DCT Figure 6.41: M B mode selection: P-pictures. Rate Control Chapter 6. Application: MPEG-2 Video Coder 117 of the finite state machine (FSM). In our work, the F S M statistical modeling is employed to estimate the conditional probabilities of the two intra coding modes or the three inter coding principal modes. As many as 25 and 125 states, with associated tables of probabilities, are allowed in intra and inter coding, respectively. During intra coding, a state corresponds to a vector (51,52) whose component value's are obtained as follows. Given the region of support (ROS) shown in Figure 6.42, the weighting coefficients of the M B s that were F R - D C T encoded are added, and the sum is quantized to one of five uniformly distributed levels. The value of 51 is set to the output level. The value of s is obtained similarly using the M B s that were F I - D C T encoded. 2 During inter coding, a state corresponds to a vector (51,52,33) whose components are obtained as described above using the principal modes of the ROS's MBs. Once a state is computed, it is used as a pointer to a table of two (intra coding) or three (inter coding) probabilities. The probabilities of the principal modes (two principal modes for intra coding and three principal modes for inter coding) will be used to decide the testing order of the modes. That is, the mode with highest probability will be tested first, and so on. The probabilities are initially uniformly distributed, and are then updated during the encoding procedure. ( The search for motion vector(s) is performed only if one of the applicable modes in Figure 6.41 is considered. The procedure used in field or frame motion vector search is essentially the same as the algorithm used in Section 6.2.3. The proposed motion search algorithm is very effective because candidate motion vectors away from the predicted motion vector are quickly eliminated from consideration. This behavior is expected since likely motion vectors are usually localized within a small neighborhood of the predicted motion vector and the number of computations increases substantially as more layers are examined. . • Chapter 6. Application: MPEG-2 Video Coder 118 1 2 1 1 3 4 3 2 4 X 1 Figure 6.42: The ROS used in this work. 6.3.5 Threshold JR and Rate Control The most important.practical issue is finding appropriate values for the Lagrangian parameter A and the threshold JR, the two key parameters that control the performancespeed tradeoffs. Although they are generally not independent, we will here optimize them separately. Threshold ' J R The threshold JR is used to provide a. good balance between performance and number of computations. A small/large value of JR increases/decreases both the computational demands and the probability of selecting the optimal mode of operation. In both intra and inter coding, JR is an adaptive threshold whose value is proportional to the weighted average of the Lagrangians corresponding to the neighboring MBs within the.ROS shown in Figure 6.42. The same weights shown in the figure are employed. Rate Control The second important practical issue is finding appropriate values for the Lagrangian parameter A, the key parameters that control mainly the R D tradeoffs. In this work, the parameter A is initially estimated based on computed long-term statistics, and is then Chapter 6. Application: MPEG-2 Video Coder 119 varied adaptively during the encoding process depending on the buffer fullness and/or the history of A. The rate control algorithm used in M P E G - 2 video coder is based on the rate control algorithm used in Section 5.2.3. Without loss of generality, let's assume thatour encoder's output bits are stored in a buffer of size S , and are then transmitted over a fixed-rate max communication channel. The system is governed by s(t + 1) = s(t) -f R(t) — 7, where t denotes time, s(t) is the buffer occupancy, R(t) is the variable output bit rate of the encoder, and 7 is the channel bit rate. Then, assuming a linear feedback control model and a desired buffer occupancy of s* = ^ e Lagrangian parameter A(t) at time t can be obtained by the recursion formula [43] A(t). = X(t - 1) (a + ( l - a ) ^ ) , (6.52) where the parameter a is the decay factor used to control the variation of the parameter A(i). The purpose of introducing the parameter a in Equation (6.52) is to reduce the variation of the A, because simulation results show that better R D performance can be achieved when A is constant within a frame. Updating A based on the above recursion formula is found to be an efficient solution. In fact, relative to the more complex algorithms described in [8], the resulting R D control algorithm is very simple, yet its performance is good. 6.4 Simulation Results This section is divided into two parts. In the first part, we compare our computationdistortion optimized motion estimation algorithm to two fast motion estimation algorithms, the 2D-logarithmic search algorithm and the 4-level hierarchical search algorithm. In the second part, we present comparison results between our fast R D optimized M P E G - 2 encoder and MPEG-2's T M 5 encoder. Chapter 6. Application: MPEG-2 Video Coder 120 The Test Video Format The video sequences used in our simulation experiment conform to the C C I R Rec 601 (CCIR 601) format. The C C I R 601 specifies the image format, acquisition semantics, and parts of the coding for digital "standard" television signals. The video sequence is digitized from an interlaced television signal, thus each frame consists of two fields: the top field and the bottom field, as shown in Figure 6.29. The frame rate is 30 frames per second and the size of each frame is 720x480 pixels. The three color components are one luminance (Y) and two chrominance (Cr and Cb). The color sampling format is 4:2:0, which means the Cr and Cb components down-sampled in both vertical and horizontal direction. Each pixel has 8 bits precision. Thus, the raw bit rate for the video is 720 x 480 x 30 x 1.5 x 8 = 124.416 Mbits/sec. Three video sequences, F O O T B A L L , BUS, and G A R D E N , are are selected based on their representative video characteristics. Motion of the F O O T B A L L sequence is very active and complex, with moderate spatial details. The G A R D E N sequence features significant spatial variations, but with very simple motion. The BUS sequence shows a lot of detail, with relative high motion activity. We show snap shots of the above three sequences in Figure 6.43. 6.4.1 C D Optimized Motion Estimation In the following simulations, the three video sequences shown in Figure 6.43 (BUS, F O O T BALL, and GARDEN) are used. The frame distance between I pictures (the G O P param- eter) is 12 frames, and the target bit rate is 8 Mbits/sec. Forty eight frames are used in the simulations. The motion search window size in the simulations is determined by the motion activity in each sequence. Thus, for each sequence, we choose the smallest window size that just covers the largest motion displacement. A l l coders in simulations Chapter 6. Application: MPEG-2 Video Coder (a) Bus (b) FOOTBALL Figure 6.43: CCIR 601 video sequences. 121 Chapter 6. Application: MPEG-2 Video Coder Sequence Window Size P S N R (dB) Computations per . M B Bus 23 34.89 551510 122 Football 47 37.83 2184500 Garden 15 33.79 258070 Table 6.11: Required number of computation in motion estimation per macroblock that use full search algorithm and partial block matching method. are developed based on the M P E G - 2 T M 5 video coder, and only the motion estimation algorithm in the M P E G - 2 T M 5 coder is modified. 1 The comparison is based on the peak signal to noise ratio (PSNR) of the reconstructed video and the computation required for motion search under the same bit rate. The computation unit used in the comparisons consists of a subtraction, an absolute value and an accumulation. For comparison purposes, Table 6.11 shows the coding results of full search algorithm using partial block matching method. 2D-logarithmic Search vs. C D Optimized Diamond-Shaped Layer Search In this part, we compare our C D optimized diamond-shaped layer search algorithm with the 2D-logarithmic search algorithm. In our algorithm, Model I (2-layer hypothesis) method is used to terminate the search process, and the median predictor with a 5macroblock ROS is used to predict the motion vector. Our algorithm uses the motion vector obtained in 16x16 frame prediction type motion estimation to predict the motion vectors for the other four 16x8 field prediction types motion estimation. The 2Dlogarithmic search algorithm and our algorithm employ the same partial block matching method, the same 2-level hierarchical structure of the M P E G - 2 T M 5 coder, and the same motion search method in half-pel resolution frame. Simulation results are shown in Table 6.12. Chapter 6. Application: MPEG-2 Video Coder Search Type Window Size PSNR(dB) Computations per M B Bus 2D-Log Ours 23 23 33.59 34.69 24312 24856 123 Football 2D-Log Ours 47 47 36.97 37.42 30307 27564 Garden 2D-Log Ours 15 15 • 32.75 33.61 21475 13414 Table 6.12: Performance comparison between the 2D-logarithmic motion search and our CD optimized diamond-shaped layer search. From Table 6.12, we can see that our algorithm is more efficient than the 2Dlogarithmic search algorithm. For BUS and especially G A R D E N sequence, the difference is quite significant. This is because, in the G A R D E N sequence, most of the image area has a lot of texture (for example, the flower bed), and the 2D-logarithmic search algorithm may very easily get trapped into a local minimum. For the F O O T B A L L sequence, because the image is blurred, the probability of being trapped in a local minimum is much lower than that for the G A R D E N and BUS sequences for the 2D-logarithmic algorithm, resulting in a smaller difference between the 2D-logarithmic search algorithm and our algorithm. Another explanation is the motion property of the sequences. For sequences with structured and stable motion fields (such as BUS and G A R D E N ) , the median predictor used in our algorithm achieves high motion vector prediction accuracy. However, the motion field in F O O T B A L L sequence is very complex, which makes the predicted motion vectors much less reliable. 4-level Hierarchical Search vs. 3-level Hierarchical C D Optimized DiamondShaped Layer Search In this part, we will compare the 4-level hierarchical search algorithm with our 3-level hierarchical C D optimized diamond-shaped layer search algorithm. The 4-level hierarchical search will generate two other lower resolution frames for motion search, and performs Chapter 6. Application: MPEG-2 Video Coder .124 PSNR vs. Computation 1 38 37.9 37.8 Football, 48;frames Search Window: - 4 7 to 47 37.7 _37.6 m rr 37.5 z CO CL 37.4 37.3 Full search with full match (37.83 dB): 4113920 opsi 37.2 Full search with partial match (37.83 dB): 2184527 ops 37.1 37 0.4 0.6 0.8 1 1.2 1.4 Computations/MB (# of operations) 1.6 1.8 x10 4 Figure 6.44: P S N R performance as a function of computation for an M P E G - 2 coder with our 3-level hierarchical C D optimized diamond-shaped layer search a full search in the lowest resolution frame. For other three resolution frames, the 4levelhierarchical search only searches the 9-point region centered on the'motion vector obtained in the lower resolution framed In our algorithm, only one low resolution frame is created, and the search regions in the full and half-pel resolution frames are limited to a 9-point region. As described in Section 6.2.3, the proposed C D optimized search algorithm using Mode I (2-layer hypothesis) termination method is employed in all the three resolution frames. Simulation results are presented in Table 6.13. From Table 6.13, we can see that our algorithm is better than the 4-level hierarchical search algorithm. For sequences with active motion, such as F O O T B A L L , our method is much more efficient. Besides, our algorithm requires less memory and is substantially less complex. This is because only one low resolution image is required. Figure 6.44 shows Chapter 6. Application: MPEG-2 Video Coder Search Type Window Size P S N R (dB) Computations per M B Bus Hier Ours 23 23 34.55 34.60 15168 8765 125 Football Hier Ours 47 47 37.39 37.38 23157 9771 Garden Hier Ours 15 .15 33.55 33.72 13734 7802 Table 6.13: Performance comparison between the 4-level hierarchical motion search and our 3-level hierarchical CD optimized diamond-shaped layer search. the resulting M P E G - 2 video coder with different values (different computation levels) for the parameter f3. 6.4.2 R D Optimized M P E G - 2 Video Coder Figure 6.45 shows the P S N R performance of the. M P E G - 2 T M 5 coder, a full-search R D optimized M P E G - 2 coder and efficient R D optimized M P E G - 2 coder for 60 frames of the BUS sequence and 24 frames of the F O O T B A L L sequences. The full-search R D optimized M P E G - 2 coder considers all intra/inter modes of operation and all possible values of quantization step, but also employs proposed C D optimized motion estimation algorithm. The efficient R D optimized M P E G - 2 coder employs the FSM-based mode selection algorithm as well as the C D optimized motion search algorithm. As shown in Figure 6.45, the overall improvement in P S N R •performance of the full-search R D optimized coder is substantial, and may justify the relatively large number of required computations. Using such an encoder, we require between 65% and 85% of the bit rate required by the T M 5 encoder for the same level of P S N R performance. Although the fullsearch R D optimized coder maintains a 3 : 1 computational advantage to the T M 5 coder, it still takes 2.1 C P U minutes of a 167 Mhz Ultra Sparc machine to encode one video frame of F O O T B A L L sequence. On the other hand, the proposed efficient R D optimized coder only requires 5 - 6 C P U seconds to encode one video frame of F O O T B A L L sequence Chapter 6. Application: MPEG-2 Video Coder a 126 BUS 39r 38- 37- 36 - Bus, .1/ P,.W.31 35 - m / z03 Ct 34 - yf / CL : MPE3-2.TM5.CQd6 r / / / / 33 - . -. Full- search RD Op timized / /'/ 32 - - Effic ent RD Optim zed // 31 - 30 - 29 - 10 Rate (bps) (b) 12 x 10 x 10 FOOTBALL 42 r 41 - 40- 39 - /' FOQtbrULI/P..W 4 7 ^y / _38- / m 2. / rr 3 7 / / > - 2 03 Q. - / : / / / / / / / 36 - / / -. Full-search RD Optimized . - Efficient RD Optim zed • / / / / J I : / 35 - 34- 33 - Rate (bps) 10 32 - Figure 6.45: P S N R as a function of bit rate. Chapter 6. Application: MPEG 2 : Video Coder 127 (a) Buffer fullness 12 Buffer vs. Frame X 10 Optm Buffer: 355360 bits 10 Bus 60 frame Buffer Size 8C)*163854bits 36.12 dB, 7.9(3 Mbps Decay Factory.25 10 30 Frame 20 40 50 60 40 50 60 (b) A variation Lambda vs. Frame 140 120 ous ou irame Buffer Size 8C 100 80 "163854 bits 36,12 dB,-7.9f Decay Factor 0.25 60 40 h 20 10 20 30 Frame Figure 6.46: Buffer fullness and A variation. Chapter 6. Application: MPEG-2 Video Coder 128 (a) P S N R PSNR vs. Frame 40 Bus 60 frame 39 Buffer Size 8C *163854 bits 36.12 dB, 7.9 3 Mbps 38 Q. 36 35 34 10 20 30 Frame 40 50 60 50 60 (b) Bit rate Bits vs. Frame x 10 Bus 60 frame Buffer Size 8C "163854 bits 36.12 dB, 7:9 3 Mbps Decay Factor 0.25 10 20 30 Frame 40 Figure 6.47: P S N R and bit rate as a function of time. Chapter 6. Application: MPEG-2 Video Coder 129 on the same machine. Figure 6.46 and 6.47 show the simulation results obtained using our rate control algorithm in our efficient R D optimized M P E G - 2 coder for a target bit rate at 8 Mbits/sec. In this simulation, the Lagrangian parameter A is updated at the end of each frame, and the memory decay factor a is set to 0.25. The buffer size used in this simulation is set to 1,310,720 bits, which is smaller than the maximum buffer size 1,802,240 bits for main profile/main level in the M P E G - 2 standard. Simulation results show that, although a simple algorithm is used, the proposed rate control algorithm is very effective, regulating the buffer fullness around a half of the buffer size (655,360 bits). In general, if a bigger buffer size is used, the R D performance will be better. Another parameter that affects RD performance is the decay factor a. More specifically, R D performance will be better when a smaller a is used. However, the buffer fullness variation become larger due to the slow updating of A. 6.5 Summary We presented an M P E G - 2 compliant interlaced video encoder that employs an efficient macroblock mode selection method, a simple motion vector prediction technique, and a computation-distortion optimized motion vector search algorithm. The macroblock mode is selected based on a rate-distortion optimized criterion. However, predictive and statistical modeling techniques are introduced that lead to substantially better computationperformance tradeoffs. The proposed M P E G - 2 video encoder is shown experimentally to outperform the M P E G - 2 T M 5 encoder, expressed in a 15 — 60% reduction in bit rate or a 1.5 — 2.5 dB increase in objective quality. Besides its compression advantage, our encoder requires less computations than required by the T M 5 . I Chapter 7 Conclusion and Future Study 7.1 Contributions of the Thesis Video coding is a decision making process where the encoder determines the values of the coded parameters and appends the corresponding bits to the bit stream. For D C T based video coding, three types of data, the.motion vectors, the coding modes, and assqciated parameters, are determined by the encoder on a macroblock-by-macroblock basis. Undoubtedly, these parameters need to be optimized with respect to rate-distortion performance and computational demands. The objective of this thesis is to propose video coding algorithms that can improve the compression performance given specific computational constraints. We achieved such a goal by introducing new motion estimation and mode selection algorithms which provide a good balance among the distortion, bit rate, and computational cost. In the following section, we will list the main accomplishments of the thesis. 1. A new cost function J for motion estimation The first contribution of the thesis is the' introduction of a new cost function, the Lagrangian J, for motion estimation. The introduced cost function J is a weighted sum of the distortion D, motion vector bit rate R, and computational cost C , that is, J = D + \R + (3C. The two parameters, A and /3, control the influence of the motion vector bit rate R and computational cost C during the motion estimation process. One 130 Chapter 7. Conclusion and Future Study 131 of the advantages of using the proposed cost function J in motion estimation is that the motion vector with the minimum J provides good tradeoffs among the distortion, bit rate, and computational cost. Another advantage is that by carefully choosing the value of the parameters A and /?, the motion estimation algorithm can control the bit rate R and the computational cost C respectively. Depending on the video application, the proposed cost function J as well as the motion estimation algorithm can be simplified to yield rate-distortion (RD) optimized motion estimation and computation-distortion (CD) optimized motion estimation. Experimental results show that for low bit rate video coding applications the R D optimized motion estimation algorithm can control the motion vector bit rate and significantly improve the compression performance. Similarly, for CD optimized motion estimation, the proposed algorithm can also well control the quality and the motion estimation speed. 2. Predictive motion estimation algorithms The second contribution of the thesis is the introduction of predictive,motion estimation algorithms which consist of motion vector prediction, search path, and termination strategies. In this thesis, four well known prediction techniques (mean predictor, median predictor, weighted mean predictor, and statistical mean predictor) used for motion vector prediction are studied and compared. The simulation results show that the median predictor performs best with regards to prediction accuracy and complexity. We tested three types of search path strategies (spiral search path, diamond-shaped search path, and floating center search path) and found that the floating center search and the diamond-shape are both quite efficient. We also proposed and studied two termination algorithms for motion search. The proposed termination strategies with the proposed cost functions and search paths can dynamically shrink or expand the search region. Simulation results show that the Model I (2-layer Hypothesis) algorithm perform better Chapter 7. Conclusion and Future Study 132 than the Model II (3-layer Hypothesis) algorithm. 3. The fast rate-distortion (RD) optimized mode selection algorithms The third contribution of the thesis is the introduction of the fast R D optimized mode selection algorithms. The R D optimized, mode selection algorithm chooses the coding mode and the associated parameters that give the minimum value for the Lagrangian J — D + XR. To solve the huge computational requirement problem of the full search method, fast R D optimized mode selection algorithms are proposed. In this thesis, we propose two fast R D optimized mode selection algorithms, (1) the thresholding method and (2) the .FSM method. The thresholding method is mainly used to to safely eliminate the I N T R A and/or I N T E R coding options from consideration. It is suitable for nonactive video applications such as the heacl-and-shoulder images in video conferencing. The F S M method develops a search order for the different coding modes according to their conditional probabilities which are derived from a finite state machine (FSM). To simplify the F S M method, we separately process the coding modes and their associated parameters (such as the quantization step). During the search a threshold, which is derived from the neighboring macroblocks' Lagrangian J , is used to terminate the search for the coding mode and its associated parameters. The proposed method dramatically reduces the computations required by the full search method, but at the expense of a small degradation in the compression performance. 4. i Implementation of an H.263-based video encoder for very low bit rate video applications The fourth contribution of the thesis is that of using our proposed motion estimation and mode selection algorithms to implement an H.263-based video encoder for very low bit rate applications. The proposed H.263-based video coder overcomes some of the Chapter 7. Conclusion and Future Study 133 drawbacks of the traditional ones (such as the Telenor's T M N 5 H.263 video encoder) by employing (1) a fast full-pixel median-based predictive motion searching technique, (2) a R D optimized motion estimation algorithm, and (3) a R D optimized coding mode selection algorithm. The simulation results show that the proposed video coder gives better compression performance than the T M N 5 H.263 video coder. The simulation results also show that, due to the R D optimized motion estimation, the motion vector bit rate is carefully controlled. In the thesis, we also introduce a rate control mechanism which decides the A value according to the bit rate requirement and the buffer fullness. 5. Implementation of an M P E G - 2 compliant video encoder for interlaced video The fifth contribution of the thesis is the implementation of an M P E G - 2 compliant video encoder for interlaced video. The M P E G - 2 video encoder implemented in the thesis is mainly for the main profile and the main level case. The proposed M P E G - 2 video coder employs (1) a hierarchical computational-distortion (CD) optimized motion estimation algorithm and (2) a fast FSM-based R D optimized mode selection algorithm. Because of the M P E G - 2 high bit rate application and the active motion field of the coded video sequence, the CD optimized motion estimation algorithm is chosen for our proposed video coder. To further improve the computational efficiency, the motion estimation algorithm is implemented in a hierarchical layer structure. The FSM-based mode selection algorithm in our proposed M P E G - 2 video coder is also implemented hi a hierarchical layer structure. The simulation results show that the proposed video encoder outperforms the T M 5 M P E G - 2 video encoder in both speed and performance. Chapter 7. Conclusion and Future Study 7.2 134 Topics for Future Study Even though the proposed video coder presented in this thesis outperforms other existing video coders, the following subjects can further improve the proposed encoder performance and need to be studied further. 1. More sophisticated rate control mechanism In this thesis, we introduce a simple one-pass rate control mechanism which determines the value of the parameter A using the buffer fullness and the bit rate requirement. However, the rate-distortion (RD) characteristics of the image strongly depend on the picture coding type (I, P, B ) . Therefore, our proposed rate control algorithm may exhibit oscillation problems when A is updated only once per frame. One solution to the oscillation problem is to employ a decay factor which reduces the changing magnitude of the new A. However, a larger decay factor will also reduce the rate controllability. Another solution is to employ different updating strategies for different picture coding types. 2. Computation allocation algorithm for motion estimation The proposed computation-distortion (CD) optimized motion estimation algorithm efficiently reduces the number of computations needed for motion search. However, in our proposed video coder, the parameter (3 which controls the tradeoff between the distortion D and the computational cost C in the cost function J — D + (3C remains fixed during the whole coding process. This does not allow the user to pre-determine the overall coding speed of the motion search'since different (3 values are not tried. To overcome this problem, a computation allocation algorithm for motion search is required which can automatically determine the proper value for the (3 for each macroblock (or slice). This will allow us to efficiently allocate the pre-set.limited computation resource. 1 Chapter 7. Conclusion and Future Study 135 3. Fast R D optimized mode selection algorithm for M P E G - 2 B picture The fast R D optimized mode selection algorithm in this thesis is developed for only I and P picture types. Because a B pictures utilizes two reference frames, in general, the available coding modes for a B picture is more than double that of a P picture type. Due to the greater number of available coding modes for a B picture, we believe that larger improvements can be achieved if an R D optimized mode selection algorithm is used. The implementation of a fast algorithm for B pictures similar to the one we developing for the I and P pictures seems quite straight forward. However, because of the greater variety of mode selections, the design of such a fast R D optimized mode selection algorithm should be carefully studied. Bibliography D. Anastassiou. Current status of the M P E G - 4 standardization effort. In SPIE Proc. Visual Communications and Image Processing, volume 2308, pages 16-24, 1994, . R. Arminato, R. Schafer, F. Kitson, and V . Bhaskaran. Linear predictive coding of motion vectors. In Proceedings of the International Picture Coding Symposium, Sept. 1994. R. Arminato, R. Schafer, F. Kitson, and V . Bhaskaran. Linear predictive coding of motion vectors. In Proceedings of the IS&T/SPIE EI'96, Jan. 1996. T. Berger. Rate Distortion Theory. Prentice-Hall, Inc., New Jersey, 1971. V . Bhaskaran and K . Konstantinides. Image and Video Compression Standards: Algorithms and Architecture. Kluwer Academic Publishers, Boston, 1995. M . Bierling. Displacement estimation by hierarchical blockmatching. In SPIE Vol.1001 Visual Communications and Image Processing, pages 942-951, May 1988. C. W . C h e n and T. Huang. Left ventricle motion analysis by hierarchical decomposition. In Proc. IEEE Int. Gonf. Acoust., Speech, and Signal Processing, volume III, pages 273-276, 1992. J. Choi and D. Park. A stable feedback control of the buffer state using the L a grangian multiplier method. IEEE Transactions on Image Processing: Special Issue on Image Sequence Compression, 3(5):546-558, Sept. 1994. P. A . Chou. Application of entropy-constrained vector quantization to waveform coding of images'. Visual Communications and Image Processing IV, SPIE-1199:970978, 1989. P. A . Chou, T. Lookabaugh, and R. M . Gray. Entropy-constrained vector quantization. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP37(l):31-42, Jan. 1989. : W. Chung, F. Kossentini, and M . Smith. A new approach to scalable video coding. In IEEE Data Compression Conference, pages 381-390, Snowbird, U T , U S A , Mar. 1995. 136 Bibliography 137 [12] W . Chung, F . Kossentini, and M . Smith. Rate-distortion constrained statistical motion estimation for video coding. In ICIP95, volume 3, pages 184-187, Washington, D C , Oct. 1995. [13] W . Chung, F. Kossentini, and M . Smith. A n efficient motion estimation technique based on a rate-distortion criterion. In Proc. IEEE Int. .Conf. Acoust., Speech, and Signal Processing, volume 4, pages 1926-1929, Atlanta, G A , U S A , May 1996. [14] T. Cover and J . Thomas. Elements of Information Theory. John Wiley and Sons, Inc, New York, 1991. [15] J. E. Dennis and R. B . Schnabel. Numerical Methods for. Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, New Jersey, 1983. [16] H . Everett. Generalized Lagrange multiplier method for solving problems of optimum allocation of resources. Operation Research, 11:399-417, 1963. • [17] R. M . Fano. Transmission of Information. Wiley, New York, 1961. [18] T. R. Fisher and M . Wang. Entropy-constrained trellis-coded quantization. IEEE Trans, on Information Theory, 38(2):415-426, Mar. 1992. [19] R. G. Gallager. Information Theory and Reliable Communication. Sons, New York, 1968. John Wiley & [20] R. G. Gallager. Variations on a theme by Huffman. IEEE Transactions on Information Theory, IT-24(6):668-674, Nov. 1978. [21] A . Gersho and R. M . Gray. Vector Quantization and Signal -Compression. Kluwer Academic Publishers, Boston, 1992. [22] H . Gharavi. Mutilayer subband-based video coding. IEEE Transactions on Communications, 39:1288-1291, 1991. [23] B . Girod. Rate-constrained motion estimation. In SPIE Proc. Visual Communications and Image Processing, volume 2308, pages 1026-1034, 1994. [24] B . Girod, D. J. LeGall, M . I. Sezan, M . Vetterli, and Y . Hiroshi. Special issue on image sequence compression. IEEE Trans, on Image Processing, 3(5), Sept. 1994. [25] B . S. Gottfried and J. Weisman. Introduction to Optimization Theory. PrenticeHall, New Jersey, 1973. [26] S. Haykin. Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, New Jersey, 1986. - Bibliography 138 [27] D. Hoang, P. Long, and J . Vitter. Efficient cost measures for motion compensation at low bit rates. In IEEE Data Compression Conference, pages 102-111, Snowbird, U T , USA, Apr. 1996. [28] R. J. Holt and A . N . Netravali. Motion from optical flow: Multiplicity of .solutions. Journal on Vision Communications and Image Representation, 4(l):14-24,. 1994. [29] B . Horn and B . Schunck. Determining optical flow. Artificial Intelligence, 17:185, 1981. [30] D. A . Huffman. A method for the construction of minimum redundancy codes. Proceedings of the IRE, 40:1098-1101, 1952. [31] ISO-IEC. M P E G video report draft. JTC1/SC2/WG11, 1991. [32] ISO-IEC. M P E G - 4 proposal package description - revision 3. 1995. • ' JTC1/SC2/WG11, . . . [33] ISO/IEC 11172-2. Information technology - Coding of moving pictures .and associated audio for digital storage media up to about 1.5 Mbit/s - Part 2: Video. I S O / I E C , 1993. [34] ISO/IEC 13818-2—ITU-T Rec. H.262. Generic Coding of Moving Pictures and Associated Audio Information: Video.-ISO/IEC, 1995. [35] ISO/IEC JTC1/SC29/WG11. Test Model 5. M P E G 93/457, Apr. 1993. [36] I T U Telecom. Standardization Sector Study Group 15. Document LBC-95 Working Party 15/1 Question 2/15. Draft Recommendation H.263,. Leidschendam, 7 April 1995. • " ' . i [37] J. Jain and A. Jain. Displacement measurement and its application in interframe image coding. IEEE Trans, on Communications, COM-29(15):1799-1808, Dec. 198l'. [38] M . Kaneko, A . Koike, and Y . Hatori. Three-dimensional motion estimation of head in model-based coding of moving facial images. IEICE Transactions on Communications, J74-B-I(10):789-798, 1991. [39] J. Karush. A simple proof of an inequality of McMillan. Information Theory, 7:118, 1961. . IRE Transactions on [40] F . Kos'sentini, W . Chung, and M . Smith. Conditional entropy-constrained residual V Q with application to image coding. Transactions on Image Processing: Special Issue on Vector Quantization, 5(2):311-320, Feb. 1996. Bibliography 139 [41] F . Kossentini, W . Chung, and M . Smith. Rate-distortion-constrained subband video coding. Submitted to Transactions on Image Processing, Mar. 1996. [42] F . Kossentini and Y . Lee. Computation-constrained fast M P E G - 2 encoding. IEEE Signal Processing Letters, 4:224-226, Aug. 1997. [43] F. Kossentini, Y . Lee, M . Smith, and R. Ward. Predictive RD-constrained. motion estimation for very low bit rate video coding. Special Issue of the IEEE Transactions on Selected Areas in Communications, 15:1752-1763, Dec. 1996. [44] T. L. Kuni. Visual Computing. Springer-Verlag, 1992. [45] G. Langdon. A n introduction to arithmetic coding. IBM J. Res. Dev., 28:135-149, Mar. 1984. [46] G. G. Langdon and J. Rissanen. Compression of black-white images with arithmetic coding. IEEE Transactions on Communications, 29(6):858-867, 1981. [47] D. Le Gall. M P E G : a video compression standard for multimedia applications. Communications of the ACM, 34(4):46-58, Apr. 1991. [48] Y . Lee, F. Kossentini, M . Smith, and R. Ward. Prediction and search techniques for RD-optimized motion estimation in a very low bit rate video coding framework. In Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, volume 4, pages 2861-2864, Munich, Germany, Apr. 1997." [49] Y . Lee, F . Kossentini, R. Ward, and M . Smith. Towards M P E G - 4 : A n improved H.263-based video coder. Signal Processing: Image Communication: Special Journal ' Issue on MPEG-4, 10:143-158, July 1997. ' [50] Y . Lee, F. Kossentini, and R. K . Ward. Efficient R D optimized macroblock coding mode selection for M P E G - 2 video encoding. In International Conference on Image Processing, volume 5, pages 582-585, Santa Barbara, U S A , 1997. [51] Y . . Lee, F. Kossentini, and R. K . Ward. Efficient M P E G - 2 encoding of interlaced video. C'JECE Special Issue on Visual Computing and Communications, in press. [52] Y . Lee, R. K . Ward, F. Kossentini, and M . J. T. Smith. Very low rate DCT-based video coding using dynamic V Q . In International Conference on Image Processing, pages 669-672, Lauzanne, Switzerland, Sept. 1996. ! [53] W . L i , Y . Q. Zhang, and M . L. Liou. Special Issue on Advances in Image and Video Compression. Proc. of the IEEE, 83(2):135-340, Feb. 1995. Bibliography 140 J. 0 . Limb. Distortion criteria of the human viewer. IEEE Trans., pages S M C 9:778-793, Dec. 1979. T. J . Lynch. Data Compression: Techniques and Applications. ing, Wadsworth, Belmont, Calif., 1985. Lifetime Learn- K . Mase, Y . Watanabe, and Y . Suenaga. A realtime head motion detection system. In Proceeding of the SPIE Conference on Sensing and Reconstruction of 3-D Objects and Scenes, volume 1260, pages 262-269, 1990. H . Morikawa, E . Kondo, and H . Harashima. 3D facial modeling for model-based coding. TEICE Transactions on Communications, J76-B(6):626-633, 1993. M P E G Software Simulation Group. M P E G - 2 encoder / decoder, version 1.0. ISO/IEC DIS 13818-2 codec, May 1995. Y . Nakaya and H . Harashima. Model-based/waveform hybrid coding for lowrate transmission of facial images. IEICE Transactions on Communications, J75B(5):377-384,1992. F. Pereira. M P E G - 4 : a new challenge for the representation of the audio-visual information. In Proceedings of the International Picture Coding Symposium, March 1996. '' • , " x K . Ramchandran and M . Vetterli. Best wavelet packet bases in a rate-distortion sense. IEEE Trans, on Image Processing, 2(2):160—174, April 1993. , C. Reader. M P E G - 4 syntactic descriptive language: a universal interface for exchange of coded audio-visual data. In Proceedings of the International Picture Coding Symposium, March 1996. B. G. Schunk. The image flow constraint. Computer Vision, Graphics and Image Processing, 35(6):20-46, 1986. A. Schuster and A . Katsaggelos. Fast and efficient mode and quantizer selection in the rate-distortion sense for H.263. In SPIE Proc. Visual Communications and Image Processing, volume 2727, pages 784-795, 1996. G. M . Schuster and A . K . Katsaggelos. Rate-distortion based video compression. Kluwer Academic Publishers, Netherlands, 1997. • • < C. E . Shannon. Coding theorems for a discrete source with a fidelity criterion. In IRE National Convention Record, Part 4, pages 142-163, 1959. Also in Information .and Decision Processes, R. E . Machol, E d . New York, N Y : McGraw-Hill, 1960, pp. 93-126. i Bibliography 141 [67] Y . Shoham and A . Gersho. Efficient bit allocation for an arbitrary set of quantizers. IEEE Trans, on Acoustics, Speech, and Signal Processing, 36(9): 1445-1453, Sept. 1988. [68] J . Storer. Data Compression. Computer Science Press, Rockville, Maryland, 1988. [69] H . Sun, W . Kwok, M . Chien, and C. H . J . Ju. M P E G coding performance improvement by jointly optimizing coding mode decisions and rate control. IEEE Trans, on Circuits and Systems for Video Technology, 7:449-458, June 1997. , [70] K . I. T. Koga, A. Hirano, Y . Iijima, and T. Ishiguro. Motion-compensated interframe coding for video conferencing. In Proc. NTC 81, volume 28, pages 239-251, New Orleans, Dec. 1992. [71] D. Taubman and A . Zakhor. Rate and resolution scalable subband coding of video. In Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, volume 5, pages 493-496, 1994. ' [72] Telenor Research. codec, May 1995. T M N (H.263) encoder/decoder, version 1.4a. TMN (H.263) [73] The International Telegraph and Telephone Consultant Committee. Video codec for audiovisual services at p x 64 kbits/s; Recommendation H.261, 1990. [74] K . - H . Tzou, H . G. Musmann, and K . Aizawa. Special Issue on Very Low Bit Rate Video Coding. IEEE Trans, on Circuits and Systems for Video Technology, 4(3):213-367, June 1994. [75] D. Walker and K . Rao. Improved pel-recursive motion compensation. IEEE Trans, on Communications, pages 1128-1134, Oct. 1984. [76] Q. Wang and R. J. Clarke. Motion-compensated sequence coding using-image pyramids. In Electronic Letters, volume 26, pages 575-576, 1990. i [77] W . J . Welsh, S. , Searby, and J. B . Waite. Model-based image coding. British Telecommunication Journal, 8:94-106, July 1990. [78] T. Wiegand, M . Lightstone, T. Campbell, and S. Mitra. Efficient mode selection for block-based motion compensated video coding. In International Conference on Image Processing, Washington, D C , USA, Oct. 1995. [79] T. Wiegand, M . Lightstone, D. Mukherjee, T. Campbell', and S. Mitra. RateDistortion Optimized Mode Selection for Very Low Bit Rate Video Coding and the Emerging H.263 Standard. IEEE Trans, on Circuits and Systems for Video Technology, pages 182-190, Apr. 1996. ! i i Bibliography 142 [80] J. W . Woods and T. Naveen. Motion compensated multiresolution transmission of high definition video using multistage quantizers. In Proc. IEEE Int. Conf. Acoust:, Speech, and Signal Processing, volume 5, pages 582-585, 1993. [81] S. Zafar, Y . Zhang, and J . Baras. Predictive block-matching motion estimation for T V coding— Part I: Inter-block prediction. IEEE Transactions on Broadcasting, 37(3):97-101, Sept. 1991. [82] Y . Zhang and S. Zafar. Predictive block-matching motion estimation for T V coding— Part II: Inter-frame prediction.- IEEE Transactions on Broadcasting, 37(3):102-105, Sept. 1991.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Rate-computation optimized block based video coding
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Rate-computation optimized block based video coding Lee, Yuen-Wen 1998
pdf
Page Metadata
Item Metadata
Title | Rate-computation optimized block based video coding |
Creator |
Lee, Yuen-Wen |
Date Issued | 1998 |
Description | This dissertation reports on rate-distortion optimized and computation-constrained video coding algorithms which yield effective and efficient block-based video encoders. We first propose a new cost function for motion estimation which can achieve good tradeoffs among motion vector bit rate, compensation distortion, and number of computations. Based on the new cost function, we propose a predictive motion estimation algorithm which can effectively determine the size of the search region according to the statistics of the motion field. The proposed motion estimation algorithm allows control of the motion vector bit rate and the computational cost, simultaneously. We also develop two efficient rate-distortion optimized coding mode selection algorithms, based on thresholding and finite state machine (FSM) methods, respectively. The thresholding method can eliminate consideration of the computationally expensive modes. The FSM method determines the test orders of the modes and associated parameters by exploiting the local statistics of the input video sequence. The proposed two methods can reduce substantially the computation requirements as compared to the full search mode selection method. Based on the proposed motion estimation and mode selection algorithms, we implement an H.263-based video coder and an MPEG-2 compliant video coder for very low bit rate video applications, and high bit rate interlaced video applications, respectively. The resulting video coders achieve excellent tradeoffs among bit rate, quality, and computational complexity. In fact, experimental results show that our video coders outperform the best known video coders in terms of compression performance and encoding speed. |
Extent | 11543736 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-06-02 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065031 |
URI | http://hdl.handle.net/2429/8636 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1998-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1998-271854.pdf [ 11.01MB ]
- Metadata
- JSON: 831-1.0065031.json
- JSON-LD: 831-1.0065031-ld.json
- RDF/XML (Pretty): 831-1.0065031-rdf.xml
- RDF/JSON: 831-1.0065031-rdf.json
- Turtle: 831-1.0065031-turtle.txt
- N-Triples: 831-1.0065031-rdf-ntriples.txt
- Original Record: 831-1.0065031-source.json
- Full Text
- 831-1.0065031-fulltext.txt
- Citation
- 831-1.0065031.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065031/manifest