UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Rate-computation optimized block based video coding Lee, Yuen-Wen 1998

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1998-271854.pdf [ 11.01MB ]
Metadata
JSON: 831-1.0065031.json
JSON-LD: 831-1.0065031-ld.json
RDF/XML (Pretty): 831-1.0065031-rdf.xml
RDF/JSON: 831-1.0065031-rdf.json
Turtle: 831-1.0065031-turtle.txt
N-Triples: 831-1.0065031-rdf-ntriples.txt
Original Record: 831-1.0065031-source.json
Full Text
831-1.0065031-fulltext.txt
Citation
831-1.0065031.ris

Full Text

R A T E - C O M P U T A T I O N O P T I M I Z E D B L O C K B A S E D V I D E O C O D I N G By Yuen-Wen Lee B. A . Sc. (Electrical Engineering) Chung-Yuen Christian University, 1986 M . A. Sc. (Electrical Engineering) Syracuse University, 1992 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E D E G R E E OF DOCTOR OF PHILOSOPHY in T H E FACULTY OF GRADUATE STUDIES DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING We accept this thesis as conforming to the required standard T H E UNIVERSITY OF BRITISH COLUMBIA April 1998 © Yuen-Wen Lee, 1998 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Electrical and Computer Engineering The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1W5 Date: Abstract This dissertation reports on rate-distortion optimized and computation-constrained video coding algorithms which yield effective and efficient block-based video encoders. We first propose a new cost function for motion estimation which can achieve good tradeoffs among motion vector bit rate, compensation distortion, and number of com-putations. Based on the new cost function, we propose a predictive motion estimation algorithm which can effectively determine the size of the search region according to the statistics of the motion field. The proposed motion estimation algorithm allows control of the motion vector bit rate and the computational cost, simultaneously. We also develop two efficient rate-distortion optimized coding mode selection algo-rithms, based on thresholding and finite state machine (FSM) methods, respectively. The thresholding method can eliminate consideration of the computationally expensive modes. The F S M method determines the test orders of the modes and associated pa-rameters by exploiting the local statistics of the input video sequence. The proposed two methods can reduce substantially the computation requirements as compared to the full search mode selection method. Based on the proposed motion estimation and mode selection algorithms, we imple-ment an H.263-based video coder and an MPEG-2 compliant video coder for very low bit rate video applications, and high bit rate interlaced video applications, respectively. The resulting video coders achieve excellent tradeoffs among bit rate, quality, and computa-tional complexity. In fact, experimental results show that our video coders outperform the best known video coders in terms of compression performance and encoding speed. ii Table of Contents Abstract ii List of Tables vii List of Figures viii Acknowledgments xi 1 Introduction 1 2 Background 5 2.1 Information Theory Concepts .' . 5 2.1.1 Entropy . : 6 2.1.2 Joint Entropy and Conditional Entropy 6 2.1.3 Mutual Information '. 7 2.2 - Entropy Coding . . = 8 2.2.1 Huffman Coding 8 2.2.2 Arithmetic Coding 9 2.3 Rate-Distortion Theorem Concepts 10 2.4 Hybrid Video Coding 12 2.5 Motion Estimation 15 2.5.1 Block Matching Algorithms • 16 2.5.2 Fast Motion Estimation Algorithm 18 2.6 Discrete Cosine. Transform (DCT) . 25 iii 2.7 D C T Coefficient Quantization and Coding 26 2.8 Video Coding Standard 27 2.8.1 The H.263 Video Coding Standard ' 27 2.8.2 The MPEG-2 Video Coding Standard 28 3 Block-Based Motion Estimation 29 3.1 Motivation 29 3.2 Motion Estimation with Rate-Computation Constraint . . . 31 3.2.1 Rate-Distortion Optimized Motion Estimation 35 3.2.2 Computation-Distortion Optimized Motion Estimation 35 3.3 Proposed Motion Estimation Algorithm 36 3.3.1 Starting Point: Prediction • 38 3.3.2 Search Path . : 41 3.3.3 Termination Point Algorithm 44 3.4 Simulation Results . 46 3.4.1 Initialization 46 3.4.2 Search Path 49' 3.4.3 Termination Point ' .. 51 3.5 Summary . . . . 53 4 Rate-Distortion Optimized Mode Selection 54 4.1 Multi-mode Video Coding 54 4.1.1 Macroblock Coding Mode . . . " 55 4.1.2 Conventional Mode Selection Algorithms 56 4.2 Rate-Distortion Optimized Coding Mode Selection'. 56 4.2.1 Exhaustive Search: Performance Upper Bound 58 4.3 Fast Mode Selection Algorithm I: Thresholding Method 59 IV 4.4 Fast Mode Selection Algorithm II: Finite State Machine Method 61 4.4.1 Finite-State Machine (FSM) 62 4.4.2 The Reference Lagrangian JR 62 4.4.3 Implementation Issue 64 4.5 Summary 66 5 Application: H.263 Based Low Bit Rate Video Coding 67 5.1 Motivation and Background 67 5.1.1 The H.263 Standard: Overview 69 5.1.2 The H.263 Bit Stream Structure 70 5.1.3 H.263 Macroblock Layer 72 5.2 Proposed Coder ' 74 5.2.1 I-picture M B Coding Strategy . . . 75 5.2.2 P-picture M B Coding Method . . 76 5.2.3 Rate Control 82 5.3 Experimental Results . 82 5.4 Summary 91 6 Application: M P E G - 2 Video Coder 92 6.1 MPEG-2 Overview 92 6.1.1 Profile and Level : 93 6.1.2 Fields, Frames and Picture 94 6.1.3 Motion Prediction - 95 6.1.4 Frame D C T and Field D C T 97 6.1.5 Bit Stream Syntax and Macroblock Layer / 98 6.2 Computation-Distortion Optimized Motion Estimation . . 99 6.2.1 Motion Vector Prediction 101 v 6.2.2 Search Path and Search Termination ' 102 6.2.3 Hierarchical Structure 103 6.2.4 Implementation Issue 105 6.3 Fast Rate-Distortion Optimized Mode Selection • 106 6.3.1 RD Optimized Mode Selection 107 6.3.2 Encoding of I-Pictures . 112 6.3.3 Encoding of P-Pictures 114 6.3.4 The Finite State Machine (FSM) . . . ., 115 6.3.5 Threshold JR and Rate Control . 118 6.4 Simulation Results . . 119 6.4.1 CD Optimized Motion Estimation 120 6.4.2 RD Optimized MPEG-2 Video Coder 125. 6.5 Summary ' 129 7 Conclusion and Future Study 130 7.1 Contributions of the Thesis ". 130 7.2 Topics for Future Study 134 Bibliography 136 VI List of Tables 2.1 The computational cost for different motion estimation algorithms . . . . 25 3.2 Entropy of motion vector x (Miss America sequence) 48 3.3 Entropy of motion vector x (Car Phone sequence) 48 3.4 Entropy of motion vector x for F S M (Miss America sequence) 49 3.5 Entropy of motion vector x for F S M (Car Phone sequence) . 49 5.6 H.263 source picture formats ; 70 5.7 Coded symbols for RM and RA for various coding modes. . 73 6.8 The available macroblock mode types for P-frame. '. . . 98 6.9 The entropy, joint entropy, and mutual-information for ( M , Q) I l l 6.10 The entropy and conditional entropy for coding mode M 112 6.11 Required number of computation for full search motion estimation . . . . 122 6.12 Computational performance for the 2D-log. search and our search . . . . 123 6.13 Computational performance-for the hierarchical search and our search . . 125 vii List of Figures 2.1 Generic hybrid video coder and decoder 13 2.2 System diagram.for block and D C T based video coder 14 2.3 Full search motion estimation process 17 2.4 The.2D logarithmic search 20 2.5 The 3-step search '. 21 2.6 The parallel hierarchical one-dimensional search (PHODS) . 22 2.7 The hierarchical search 23 3.8 Profiles of SAD and Lagrangian J= D + 0-C 37 3.9 A spiral'search path 42 3.10 A diamond-shaped search path 43 3.11 Floating center search path 44 3.12 Correlation coefficients between motion vectors 48 3.13 PSNR vs. computations for different search paths 50 . 3.14 PSNR vs. computations for different termination methods . . . ' 52 . 4.15 Compression performance for the TM5 and the RD .optimized coder . . . 60 4.16 The J * , coding mode, and quantization step for a frame from Football . . 63 5.17 Bit rates of motion vector, DCT, and side information for T M N 5 coder . 68 5.18 H.263 video sequence layers 71 5.19 The regions of support used by H;263 and our coder. -79 5.20 The search area used during motion estimation 80 / V l l l 5.21 The PSNR improvement clue to the RD optimized coding mode strategy 83 5.22 Bit rate of motion vector, D C T , and side information for our coder . . . 84 5.23 The PSNR loss due to the new motion vector estimation/coding method. 85 5.24 The PSNR improvement due to the M-search technique 86 5.25 The rate-distortion profile for our coder (constant A) 87 5.26 The rate-distortion profile for our coder (varying A) 88 5.27 Overall comparison between our coder and Telenor's coder 90 6.28 M P E G - 2 Profile and Level 93 6.29 Interlaced video frame consists of two fields 95 6.30 MPEG-2 Motion vector type 96 6.31 MPEG-2 D C T type 97 6.32 The bit rate distribution for D C T , motion vector, and side information . 100 6.33 The region of support (ROS) 101 6.34 Diamond-shaped search area used in motion estimation . . . . . . . . . . 103 6.35 The search region on full resolution level and half-pel resolution level . . 104 6.36 PSNR of the TM5 coder and the full-search RD optimized coder . . . . . 108 6.37 PSNR of the TM5 coder and the full-search RD optimized coder 109 6.38 Joint distribution function p(M, Q) HQ 6.39 The histogram of the quantization step step prediction error 113 6.40 M B mode selection: I-pictures 114 6.41 M B mode selection: P-pictures 116 6.42 The ROS used in this work ; . . 118 6.43 CCIR 601 video sequences 121 6.44 PSNR performance as a function of computation for our algorithm . . . . 124 6.45 PSNR as a function of bit rate 126 IX 6.46 Buffer fullness'and A variation. . . ' •. . 127 6.47 PSNR and bit rate as a function of time 128' x Acknowledgments First, my greatest debt of gratitude goes to my supervisors, Professor Rabab K . Ward and Professor Faouzi Kossentini, who have introduced and guided me into this exciting field of research, and provided sound advice and invaluable critical feedback at every stage of this project. Without their support, this thesis could not have been written. Second, I want to thank my parents for their encouragement and support. I also want to thank my wife Nancy for her patience and support in the past years. I also want to thank Michael Gallant and Ian Caven in the Image Lab. for their valuable suggestions and discussions. I also wish to acknowledge Julong Du, Xiaoli Li and other colleagues in the Image group and the SAR group for their friendship and encouragement during the long years. I also want to express my special thanks to Mei X u for her assistance in simulations and implementations in the MPEG-2 project. xi Chapter 1 Introduction Storage and transmission of data is becoming more important in modern society. Data compression is an important tool for reducing the cost of storing and transmitting data. This is especially true for the case of video signals, whose tremendous amount of data require large spaces and wide bandwidths for storage and transmission, respectively. For example, the bit rate of a 30 frames/sec CCIR-601 digital video sequence is 248.83 Mbits/sec, and storing just a 5-minute clip of such a video sequence requires a 4.66-gigabyte storage space. Without video compression, the demands of digital video storage and transmission cannot be met by present or near-future communication infrastructures. Recently, several successful and effective video coding standards have emerged, namely M P E G - 1 , H.261, MPEG-2 , and H.263 [31, 73, 34, 36]. One common property of such coding standards is that they all employ block-based motion compensation and D C T coding algorithms for removing spatio-temporal redundancies in the video sequence. For simplicity, in the following, we will call such video encoders block-based encoders. Dur-ing the encoding process, generally, the block-based encoders determine and transmit the type of coding algorithm (the coding mode) and associated parameters (motion vector, quantization step, etc.) for each macroblock to the decoder. The goal of this thesis is to develop block-based video encoders that are rate-dist'ortion-computation optimized. More specifically, the proposed video encoders select the coding mode, motion vector(s), and associated parameters (such as quantization-step sizes, etc.) that yield the minimum Lagrangian value J = D + X R, where D and R are the associated 1 Chapter 1. Introduction 2 distortion and bit rate. For simplicity, we assume that no dependencies exist between macroblocks. Therefore, the constrained optimization is performed independently for each macroblock. Theoretically, in order to achieve optimal RD performance, all combinations of var-ious coding modes, motion vectors, and quantization step sizes should be considered to find the minimum Lagrangian J. However, this approach is clearly impractical. For ex-ample, for a H.263 video standard [36] compliant video coder, to find the minimum value of J one would need to consider 19851 possible different cases for one 16x16 macroblock. To simplify this problem, we first decouple motion vector estimation from coding. The motion vector will be determined using our proposed motion estimation algorithm, and the coding mode and the quantization step size will be determined using our proposed mode selection algorithm. This decoupling reduce greatly the video encoding computa-tional complexity. Unfortunately, however, the computation demands of the resulting RD optimized video coder can still be very large. This is clue mainly to two computation intensive components: motion estimation and coding mode selection. To overcome the motion estimation problem, this thesis proposes a new predictive motion estimation algorithm that will yield good tradeoffs .among motion compensation distortion, motion vector bit rate, and computational cost. First, a new cost function for motion estimation which is a weighted sum of the three quantities in a Lagrangian formulation is introduced. The advantage of the proposed cost function is that via the parameters in the cost function, the motion vector bit rate and the computational cost can be carefully controlled. Second, the search area is allowed to expand or contract as a function of the local statistics of the motion field. Simulation results show the pro-posed predictive motion estimation algorithm can lead to a dramatic increase in motion estimation speed for only a small loss in compression performance. A unique feature of our algorithm is that the reconstruction quality of the associated video encoder is Chapter 1. Introduction 3 proportional to the processing power. To overcome the mode selection problem, we propose an efficient rate-distortion (RD) optimized coding mode selection method. The new method chooses the coding mode and quantization step size that give the minimum Lagrangian value J. A direct result of the above method is that significantly better RD performance can be obtained. Theoreti-cally, in order to achieve the optimum RD performance, all combinations of the coding mode and the quantization step size should be exhaustively searched. However, such an approach requires a considerable amount of computations. To solve this problem, thresholding and finite state machine (FSM) algorithms are proposed. Thresholding al-lows the encoder to safely eliminate the computationally expensive coding options from consideration. In F S M case, the search order of the coding mode and quantization step size are determined by the conditional probabilities. The proposed motion estimation and the coding mode selection algorithms have been implemented-in the context of an H.263-based video encoder for operation at very low bit rates (lower than 16 kbps), and in the context of an MPEG-2 compliant video encoder for operation at high bit rates. The proposed H.263-based video encoder employs: (1) a'fast median-based predictive motion searching technique, (2) an RD optimized motion estimating algorithm, and (3) a thresholding-based RD optimized coding mode selection method. The above algorithm take into account the fact that, at low bit rates, the motion vector bit rate is a significant part of the total bit rate. The proposed MPEG-2 video encoder employs: (1) a computation-distortion (CD) optimized predictive motion estimation algorithm and (2) a FSM-based rate-distortion (RD) optimized mode selection method. The proposed CD optimized motion estimation algorithm and the FSM-based RD optimized mode selection method are intended to reduce the huge computational requirements of motion estimation and improves the compression performance in M P E G -2 video coding, respectively. Simulation results indicate that the proposed video encoders Chapter 1. Introduction •4 outperform existing ones both in terms of compression performance and computational complexity. Another important advantage of our algorithms is that the bit rate, quality, and number of computations can be simultaneously controlled. • In Chapter 2, we begin with a brief review of some fundamental concepts that will serve as background material throughput the thesis. Chapter 3 will introduce the new cost function JV for motion estimation we mentioned above.' In the same chapter, we also address motion vector prediction and search techniques. The RD optimized mode selection algorithm as well as corresponding efficient implementations will then be dis-cussed in Chapter 4. In Chapter 5, the proposed motion estimation algorithm and mode selection algorithm will be implemented in the context of an H.263-based video encoder that is designed for very low bit rate applications. In Chapter 6, we will implement: an MPEG-2 compliant video encoder using the two algorithms discussed in Chapter 3 and 4. In Chapter 7, conclusions and future research directions are discussed. Chapter 2 Background In this chapter, the fundamental concepts and some of the most prominent techniques for video coding are reviewed and presented. Our discussion is only introductory, and the interested reader may wish to consult the given references for more detailed discus-sions. In Section 2.1, relevant concepts from information theory are reviewed. In Section 2.2, two well-known lossless coding techniques, Huffman coding, and arithmetic coding, are described. Next, in Section 2.3, rate-distortion fundamentals needed to explain our lossy coding techniques are addressed. In Section 2.4, generic hybrid video coding and DCT-based block motion compensated video coding are discussed. Motion estimation is discussed in Section 2.5. In the latter section, we focus mainly on block matching algorithms, and we limit our discussion to the most popular fast motion estimation algo-rithms. In Section 2.6, block transforms, and especially the D C T , are addressed. D C T coefficient quantization and coding are described in Section 2.7. We will briefly introduce the two popular video coding standards, the H.263 and the MPEG-2 , in the last section. 2.1 Information Theory Concepts The most important contribution to information theory was made by Shannon, who defined information measure H(X), called the entropy, of a random source X. This measure gives the lowest number of bits required to represent the symbol X losslessly. Based on the entropy, other information measures, such as the conditional entropy and the mutual information, can be derived. 5 Chapter 2. Background 6 2.1.1 Entropy The entropy, which is a measure of uncertainty of a random variable, will first be in-troduced. Let X be a discrete random variable with alphabet X and probability mass function p(x) = Pr{X — x}, x £ X. The entropy H(X) of a discrete random variable X is defined by H(X) = -J2p(x)log2p(x), (2.1) where log?, is the base-2 logarithmic function and entropy is expressed in bits. Note that entropy is a function of the distribution of X. It does not depend on the actual values taken by the random variable X, but only the probabilities. From the view point of source coding, the entropy H(X) also represents the average number of bits required to represent the random variable X. 2.1.2 Joint Entropy and Conditional Entropy In this section, the definition of entropy of a single random variable, presented in the previous section, is extended to a pair of random variables. The joint entropy H(X, Y) of a pair of discrete random variables (X,Y) with a joint distribution p(x,y) is defined as . H(X,Y) = -J2J2p(x,y)log2p(x,y), (2.2). xex yey where the X and y are the alphabets of X and Y respectively. The H(X, Y) then represents the average number of bits required to represent a pair of random variable. (X,Y). We also define the conditional entropy of a random variable given another as the expected value of the entropies of the conditional distributions, averaged over the conditioning random variable. The conditional entropy H(Y\X) is defined as ' H(Y\X) = J2P(X).H(Y\X = X) Chapter 2. Background 7 = -J2P(X) p(y\x) L°92 p{y\x) x^x yey = P(x,y) LO92 p(y\x), (2.3) xexyey where the p(y\x) represents the conditional probability of y given x. The naturalness of the definition of the joint entropy and conditional entropy is ex-hibited by the fact that the joint entropy of a pair of random variables is the entropy of one plus the conditional entropy of the other, that is, H{X, Y) = H(X) + H(Y\X). (2.4) The results given above can be generalized to k random variables, where k > 2. There-fore, the joint entropy H(Xi, X2, • • • , Xk) of k random variables {X±,X2, • • •, Xk} can be expressed as k H(X1,X2, ...,Xk) = J2 H(Xi\XuX2,.. .,Xi-X) + H{XX). (2.5) 2.1.3 Mutual Information The mutual information is a measure of the amount of information that one random variable contains about another random variable. It is the reduction in the uncertainty of one random variable due to the knowledge of the other. Consider two random variables X and Y with a joint probability mass function p(x,y) and marginal probability mass function p(x) and.p(y). The mutual information I(X; Y) is the relative entropy between the joint distribution and the product distribution p(x)p(y), i.e., mY)=L¥/x'y)l09^y (2'6) The mutual information I(X;Y) is the reduction in the uncertainty of X due to the knowledge of Y (or vise versa), that is, I{X-Y) .= H(X) - H(X\Y). Chapter 2. Background 8 =. H(Y) - H{Y\X). (2.7) It is easy to derive the following relationship between entropy and mutual information from Equations (2.1), (2.2) and (2.6): I{X;Y) = H{X) + H{Y)-H(X,Y). (2.8) From the same equations, the following inequalities can also be easily proved [17, 19, 14]: I(X;Y) > 0, (2.9) I(X;Y) < H(X), (2.10) I(X; Y) < H(Y), and (2.11) H(X;Y) < H(X)+-H(Y). (2.12) 2.2 Entropy Coding Entropy coding [55, 68, 21] is the mapping of discrete data into a more compact form such that the original data can be reconstructed exactly. According to the Kraft inequality [39, 19, 14], for discrete memoryless source X with finite entropy H(X), it is possible to construct a prefix code having an average bit R that satisfies, H(X) <R= ][>(£,) lj < H{X) + 1, (2.13) 3 where p(xj) is the probability function of X and lj is the code length of the random variable Xj. In this section, two widely used entropy coding techniques,-Huffman coding and arithmetic coding, are introduced. 2.2.1 Huffman Coding Huffman coding [30] shares most of the characteristics of Shannon-Fano coding [17]. It creates variable-length codes of integer lengths. Symbols with higher probabilities are Chapter 2. Background 9 assigned shorter codes. Huffman codes have the prefix condition property. Therefore, they can be correctly decoded without knowing the individual code lengths in advance. It is shown [20] that Huffman codes can be designed such that the average bit rate is within one bit of the entropy (see Equation (2.13)). Moreover, if the symbol probabilities are powers of | , then the average rate of the Huffman code is the same as the entropy of the symbols. Huffman codes are typically generated using a tree. However, unlike Shannon-Fano codes, Huffman codes are built from the bottom up, starting with the leaves of tree and working progressively closer to the root. One drawback of Huffman coding is that all codewords have integer lengths. No matter how high the probability of one of the symbols is, that symbol still must be encoded with at least one bit. Therefore, Huffman coding becomes inefficient when it is used to encode symbols with high probabilities (e.g. 0.9). This coding efficiency can be improved if groups of symbols are encoded together. However, by grouping symbols, the size of the Huffman code tables grow exponentially. Another problem with Huffman coding is that the coder design and the statistical modeling are combined into a single process, making adaptive coding difficult. If the probabilities of the input symbols change, a relatively complex algorithm is needed to adapt the Huffman coder. 2.2.2 Arithmetic Coding Arithmetic coding is a lossless compression technique that benefits from treating multiple symbols as a single data unit but at the same time retains the incremental symbol-by-symbol coding approach of Huffman coding. Arithmetic coding separates the coding component from the modeling component. This process allows for the dynamic adapta-tion of the probability model without affecting the design of the coder. Theoretically, in arithmetic coding, a single codeword is assigned to each possible sequence of symbols. Each codeword can always be represented by a half-open subinterval in the interval [0,1). Chapter 2. Background 10 By assigning enough precision bits, one can distinguish one subinterval from any other subinterval, and thus uniquely decode the corresponding sequence of symbols. Like Huff-man codewords, the more probable sequence of symbols correspond to larger subintervals and thus require fewer bits of precision. Because an arithmetic encoder can encode symbols with a fraction of a bit, it can achieve close-to-optimal compression performance for sources with very low entropies. Arithmetic coders usually achieve better compression performance than Huffman coders. However, arithmetic decoders require that it be informed about the number of symbols that need to be decoded. For a detailed discussion of arithmetic coding, please see [21, 45, 46]. ' 2.3 Rate-Distortion Theorem Concepts Shannon [66] also introduced the rate-distortion function R(D), which represents the lowest information rate R achievable while still maintaining a fixlelity D. He proved that it is not possible to have a source,code of rate R with distortion d smaller that the minimum distortion D(R), or the inverse of R(D). This is often called the negative source coding theorem. However, he also proved that source codes exist which can arbitrarily approach the rate-distortion bound R(D) - a result that is known as the positive source coding theorem. In what follows, some of the basic rate-distortion theory results will be reviewed. . • First, let X and X be discrete random variables with alphabet X and X respectively, and the joint probability mass function be given by p(x,x) = Pr{X = x,X = x}, where x £ X and x £ X. Next, let the distortion function d{x,x) be a measure of the cost of representing the symbol x with the symbol x. The distortion function C/(X,.T), together Chapter 2. Background 11 with the joint probability function p(x, x), can be used to compute an average distortion 4 = E E P W P ( « W ^ . J ) - ' (2-14) The rate-distortion function R(D) for a source X with average distortion dm is given by R(D)= . . .min / ( * ; * ) , (2-15) p(a;|x) : dm<r> where I(X; X) is the mutual information between X and X. The minimization in Equa-tion (2.15) is performed over all conditional distribution p(x\x) subject to the constraint that the average distortion dm is less than or equal to D. The rate R(D) is the lowest achievable rate when a distortion no greater than D is allowed. Shannon showed that no source code with Rs < R and distortion D exists. It is important to emphasize that the above theoretical results apply to stationary discrete-time memoryless sources with finite output alphabets. These results are extended to sources with continuous amplitudes in [19, 4]. For a memoryless Gaussian source with finite variance a2, the rate-distortion function Rg(D) with mean square error distortion measure d(x,x) = (x — x)2 is Rg(D) = [-log2^. (2.16) In general, explicit results of the rate-distortion functions for memoryless non-Gaussian sources are not available. However, there are useful upper and lower bounds on the rate-distortion function R(D) for discrete-time continuous-amplitude memoryless sources with zero mean and finite variance a2, that is, Ri(D) < R{D) < RU(D), . (2.17) where Ri(D) and RU(D) are the lower and upper bounds respectively. From [4, 66], Ri(D) and RU{D) axe given by RU(D) = Rg(D) = ^log2^, and (2.18) Chapter 2. Background 12 Ri(D) = H(X) - - log2 2ireD, (2.19) where H(X) is the.differential entropy of the continuous amplitude memoryless source X. From the above, it is clear that the- Gaussian source requires the maximum rate • among all other sources with the same variance a2. • 2.4 Hybrid Video Coding The goal of video coding is to remove statistical and psychovisual redundancies in an image sequence while keeping intact as much perceptually important information as pos-sible. The two major types of statistical redundancies in a video sequence are spatial and temporal redundancies. Temporal redundancies exist usually due to structured motion and camera movement, while the spatial ones are normally due to the smoothness and edge continuity. These two types of redundancies can be reduced efficiently by using a generic hybrid video coder such as the one shown in Figure 2.1. The hybrid video coder first performs temporal prediction with motion compensation. Here, motion compensa-tion is the process of compensating for the displacement of a moving object from one frame to another. In practice, motion compensation is preceded by motion estimation, the process of finding displacement of pixels among frames. Next, the corresponding difference image, or the residual image, is processed by a waveform coder, such as a D C T or a subband coder, to further remove the spatial redundancies. To reconstruct the video sequence, the decoder in Figure 2.1 merely reverses the process of the encoder just described. Two types of coding methods, intra-frame coding and inter-frame coding, are used in the hybrid video coding scheme. Intra-frame coding employs only spatial redundancy reduction method, therefore, the frames can be encoded independently. For inter-frame coding, however, both temporal and spatial redundancy reduction methods are used Chapter 2. Background 13 MM) no Generic Encoder Generic Decoder Mt) Motion Estimation Motion Vector Motion —C=-| Compensation Residual Coding Motion Compensation Figure 2.1: Generic hybrid video coder and decoder (see Figure 2.1), and frames in the past and/or future are required to perform temporal prediction. A video frame that is coded by intra-frame coding is called an I picture. A video frame that is coded by inter-frame coding is called a P or B picture. For P pictures, only one frame in the past is used as reference for temporal prediction. For B pictures, one frame in the past and one frame in the future are used for temporal prediction. In order to reduce the implementation complexity of both the encoder and decoder, the B pictures are not referenced by any other types of picture, and only the closest I or P picture is used as a reference. Chapter 2. Background 14 Motion Estimation DCT Quantization Entropy Encoding Inverse Quantization Inverse DCT Motion Compensation Frame Memory motion vector Figure 2.2: A block motion-compensated prediction and DCT-based residual coding video coder Block Motion-Compensated Prediction and DCT-based Residual Coding The most popular hybrid video coder is the block motion-compensated prediction and DCT-based residual coding video coder shown in Figure 2.2. This type of video coder employs a block matching algorithm for motion estimation and a D C T and scalar quan-tization for residual coding. Because of its effectiveness and efficiency with respect to algorithm complexity and computational cost, the framework illustrated in Figure 2.2 has been adopted in video coding standards, such as H.261, M P E G - 1 , MPEG-2 , and H.263 [73, 36, 34, 31]. For simplicity, we call this type of video coder an MPEG-l ike video coder in what follows. In this work, our proposed algorithms will be developed based on the MPEG-l ike video coding framework. Therefore, in the following sections, methods used in Figure Chapter 2. Background ' • 15 2.2, such as block matching, D C T and scalar quantization, will be discussed in more detail. 2.5 Motion Estimation Many motion estimation methods have been used to compute the apparent motion in an image sequence for use in video coding, with each algorithm relying on a different repre-sentation of the object trajectory in the three dimensional scene. In this section, three of the most popular algorithms, the block matching algorithm, pel recursive algorithm, and object based algorithm, are briefly introduced. • Block Matching Algorithms: Block matching algorithms are the most success-ful tools for motion estimation in video coding. They have been implemented in many video codecs and have been specified in a number of video coding standards (e.g. H.261, MPEG-1 , MPEG-2 , and H.263 [73, 36, 34, 31]). Typically, block matching algorithms track only translation motion using a fixed block size. The block matching algorithms are popular because of their computational efficiency and minimal information representation of the motion field. In this work, a block matching algorithm is used for motion estimation. A more detailed discussion will be given in Section 2.5:1. • Pel-Recursive Algorithms: Pel-recursive techniques use gray-scale gradients and recursion to determine object motion [29, 63, 75]. With a pel-recursive algorithm, an affine model (capable of representing both translation and rotation motion) can be easily incorporated into the video coder. Gradient-based estimation algorithms calculate motion parameters for each pixel in the frame of interest. In this respect, they are well suited for motion compensated prediction since they do not intro-duce blocking artifacts and are able to deal with complex motion models allowing Chapter 2. Background 16 rotational motion and object deformation, as well as translational motion. How-ever, this flexibility comes at a cost since computing and transmitting pixel motion parameters is more expensive when compared to the block matching algorithms. In addition, large displacements in the motion field can not be easily tracked by pel-recursive algorithms. • Object Based Algorithms: Object (or region) based models are very powerful tools for motion estimation and motion representation since typical scenes can be characterized by small sets of objects moving in time [77, 56, 44, 28, 38, 59,. 57].' These methods first segment each frame into regions according to characteristics such as motion activity, texture or color, and then find motion parameters for each region. Object based motion estimation algorithms are able to faithfully represent true object motion and avoid blocking and mosquito effects since regions are not restricted to fixed shape as in typical block matching algorithms. However, these algorithms are much more complex than the block matching and pel-recursive ones. 2.5.1 Block Matching Algorithms A block matching algorithm (BMA) provides a parametric representation of the motion flow in a video signal. This algorithm estimates the motion of individual pixels using a single motion vector. Within the block, it is assumed that every pixel has the same motion. In this way, each block in the.current frame is predicted from a block in a reference frame which has undergone some type of motion. The "optimal" block matching algorithm is the full search (FS) method [37]. The full search method subdivides the current frame into non-overlapping blocks that are usually square. For each block b, at (m, n) in the current frame a block hrej at (m + dx, n + dy) in a reference frame Irej, that best matches it (given a distortion criterion) within a search Chapter 2. Background 17 Current Frame Reference Frame Search Region Figure 2.3: Full search motion estimation process region W, is found. The search region W is defined as the region over which one varies the displacement (dx, dy). In Figure 2.3, the full search method is illustrated for a search region that is ±p pels in the x-direction and ±q pels in the y-direction. The block size is (M,N). In causal or forward motion estimation, a past frame is used to compute the displacement estimate. In non-causal and backward motion estimation, a future frame is used to compute the displacement estimate.-The disadvantage of block matching algorithms lies in their simplistic assumptions. For example, block matching algorithms assume that all pixels within a block undergo identical motion, which is not always the case, especially when there are small moving objects. In addition, if pixels along the boundaries of adjacent blocks are assigned differ-ent motion vectors and an object straddles these block boundaries, the adjacent motion vectors "tear" the object'. In this case, the residual signal will contain high frequency edges, expressed using the so called "blocking artifacts". Chapter 2. Background 18 Matching Criterion The block matching algorithm uses a very simple cost function to evaluate the displace-ment vectors that describe motion in image sequences. In most simple pel-based absolute difference is used to determine the similarity between blocks. The two popular cost functions used to measure the similarity between blocks are the mean square error (MSE) and the mean absolute difference (MAD) . For a block at location (xm,yn) with block size MxN, the M S E and M A D for motion displace d = (dx,dy) are, ^ xm+M y„+N DMSE(dx,dy) = —— J2 Y.{I{x^j)-Ires(x + dx,y + dy)f (2.20) x=xm y=yn - | .r,„; A/ >/„+.Y DMAD(dx,dy) = E \I(x,y)-IreJ(x + dx,y + dy)\, (2.21) x=xrn y=yn where / and Irej are the pixels of the present frame and the reference frame, respectively. The difference in subjective performance between using the M S E and M A D is quite small for most video sequences. Therefore, in general, the M A D is preferred because it required less computations than the MSE. However, even using the M A D , the computa-tional complexity is still quite high for most video sequences, simply because hundreds of reference blocks must be considered for each candidate block. For example, for CCIR601 video (30 frames/sec, 720x480 pixels/frame) with a (±16, ±16) search region, a 30 GOPS (giga-operations per second) capable processor is required for performing full search block matching algorithm in real time. 2.5.2 Fast Motion Estimation Algorithm Full-search block matching algorithms are very computationally expensive, but they guar-antee finding the motion vector with the minimum M A D value. To alleviate the com-putational burden, many fast block matching algorithms have been proposed since 1981, when Jain and Jain developed the 2D logarithmic search algorithm [37]. In what follows, Chapter 2. Background 19 several heuristic search methods are described. These methods achieve sub-optimum performance at significantly reduced number of computation compared to the full-search method. 2D Logarithmic Search Method The 2D logarithmic search method [37] uses a sparse binary search area, where the "best" motion vector is located. The M A D matching criterion is usually used, as in the full search method. In this case, let (dtp, dtp) be the size of the search region. Consider the grid shown in Figure 2.4, nine points within the search region are chosen about the block center. This corresponds to points on a d0(d0 = 2^ 0 f l 2 ' p ^ _ 1 ' ) sampling grid, e.g. points at (-<i0, d0), (0, d0), (d0, dQ), (-c/0, 0), (0, 0), (d0, 0), (-c/0, -c/0), (0, -d0), {d0, -d0). The minimum distortion is found at one of these locations. The sampling period for the •ith iteration is given by di = 2k~l, (2.22) where k = (\log2(p)] — 1) and i is the iteration index. The location of minimum distortion for the 0th search (i — 0 iteration) is used as the search origin for the next iteration i = 1. For this and subsequent searches, eight locations need to be searched as the search origin was computed at-the i — 1 iteration. The iterative process continue until the search sampling period is equal to one, and i — k. This corresponds to an integer pixel motion vector sampling resolution. The total number of search locations in 2D logarithmic search is far less that in the full search. In total, the method visits 8\log2(p)'\ + 1 locations, and each location requires a M A D computation. This is compared to (2p + l ) 2 for the full search method. If a monotonic distortion surface exists, the 2D log search algorithm will locate a global minimum. In most cases, however, the M A D surface is not monotonic, and the search Chapter 2. Background 20 (-7, -7) (0, 7) (7, 7) 0 !•-0 —E3-0 j C 0 "l o 1 o r . J • •  El : • 1 r 1 1 c j t 0 : 1 - ° l 1 J LJ • — I 1 , ! \ j c (-7, 7) (7, 7) Figure 2.4: The 2D logarithmic search converges on a local minimum. Three-step Search Method Three-step search method is iterative, and similar to the logarithmic one. As in loga-rithmic search, a nine-point grid is searched first with a sampling distance of d0 = [|]. As each iteration, the nine-point sampling grid used in three-step search is identical to the nine-point sampling grid used in 2D log search. However, the sampling distance for a subsequent iteration is given by dt+1 = di - 1, (2.23) where i is the iteration index. As in 2D log search, the iteration process ends when the search distance is equal to one. By reducing the initial sampling distance d0, more precise motion vectors can be obtained at the cost of higher computational complexity [70]. The number of operations for the three step search method is greater than that of the 2D log Chapter 2. Background 21 (-7,-7) (0,7) (7.7) 0 0 0 1 l J " i. J t 1 0 0 0 1 j 1 1 1 0 1— 1 0 1 r 0^  1 1 L 2 3 r 1 1 L 2 1 1 i 1 • 7) IP: p . (7 Figure 2.5: The 3-step search search. Using the initial sampling distance d0 = [^], the total number of search location is s rn+1 . Parallel Hierarchical One-Dimensional Search (PHODS) This method is based on a one-dimensional binary search that relies on the assumption that the M A D surface is monotonically decreasing in both x and y direction. During this search process, the horizontal and vertical spatial dimensions are searched independently [7, 5]. As in the logarithmic search , a (±p, ±p) search region is here assumed. For a one-dimensional search along the x dimension illustrated in Figure 2.6, the three sampling grid for the first iteration is (-do, 0), (0, 0), (do, 0), with the sampling interval d0 = 2^° 5 2 ' P 'L The location with the minimum M A D among the three points is used as the center of the next iteration. The second iteration searches for the minimum DMAD at the following grid points, (-dl: 0), (0, 0), (<i;, 0), centered at the location with Chapter 2. Background 22 (-7,-7) (0,7) (7,7) }\ 1 J y°i • 1 3 ' 2 i I 1 1 1 r n L )'o ] r 1 • r 1 L XQ J XQ L *1 J *0 1 1 *2 1 1 X, I L *2 J y\ "1 L J (-7,7) . (7,7) Figure 2.6: The parallel hierarchical one-dimensional search (PHODS) the minimum M A D in the first iteration, where d,- = 2k~\ i is the iteration index, and k = \log2(p)]- The iteration process continues until a sampling interval of d{ = 1 is obtained. The remaining y dimension can be searched at the same time and in the same manner. The number of search locations in PHODS per block is 4|7og2(p)l + 1, which is roughly half of the number of locations considered in 2D logarithmic search. Hierarchical Search Method To circumvent the deficiencies in fast search algorithms, which depend on a monotonic decreasing M A D surface, hierarchical search methods were developed [6]. In hierarchical search, the search process is not confined to a subset of the search region, and is less likely to be trapped in a local minimum. Using low resolution representations of the frames, the entire search region can be processed at different layers. In addition, the hierarchical structure allows the use of different sized search regions and measurement windows that Chapter 2. Background 23 Figure 2.7: The hierarchical search match the image resolution during the search. This adaptability makes the hierarchical search more robust since it is able to dynamically adjust its motion parameters to the motion structure. A hierarchical motion estimation algorithm decomposes each frame in the image se-quence into multi-resolution frames. The decomposition can be made using a pyramidal structure [76, 7] or subband decomposition structure [80, 22, 71]. At the lowest resolu-tion, the largest motion displacements are measured. This is possible since the search window size can be made large without increasing the number of required computations. Due to the low-pass nature of the lowest resolution image, an accurate motion vector is not yet possible. However, because of low-pass filtering, the displacement estimate is less sensitive to noise. Using coarse measurements made at the lowest level of the pyramid, motion estimation is then performed at the higher level. Higher levels in the pyramid Chapter 2. Background 24 require a smaller search window, which allows for more accurate measurement of small displacements. Appropriately-scaled motion vectors from the previous level are used to .initiate the search at the higher resolution level of the image pyramid. Since the search for motion vectors begins at the lowest resolution, selecting an ap-propriate search size at such level is very important. The search size should be large enough to estimate the magnitude of most motion variation present in the video signal. Therefore, given an L-levels dyadic decomposition (subdividing the region by two at each iteration) and a (±p, ±p) search region, the search region for the lowest level should be ( P L , P L ) , where /'/• = (2-24) For a higher resolution level, in general, the search region is set to ( ± 1 , ±1) pixels (total 9 locations) to (±2, ±2) pixels (total 25 locations) at the corresponding' resolution. Hierarchical search has a computational complexity roughly equivalent to half of that of 2D logarithmic search, but yielding significantly better performance. Assuming a block size of M x A" and a search region of (±p, ±p), and a three level resolutions is used, the search region in the lowest level is ( ± [ | ] , ± [ | ~ | ) , and the search region in the two higher levels is set to ( ± 1 , ± 1 ) . The total number of operations required for each block is [(2[jjl + l ) 2 ^ + 9 ^ + 9NM}. Here, one operation includes one subtraction, one absolute value, and one addition. From a computation viewpoint, the hierarchical search method is- very efficient; how-ever, such a method requires increased storage due to the need to keep pictures at different resolution. Furthermore, this method may yield inaccurate motion vectors for regions •containing small objects. Since the search method starts at the lowest resolution in the pyramid structure, regions containing small objects may be completely eliminated and may never be tracked. Chapter 2. Background 25 Search Method Operations per Macroblock Operations per second 720x480 at 30 fps ' p = 15 p = 7 Full-search (2p + IfNM 9963.65 MOPS 2332.80 MOPS Logarithmic (%\log2p\+l)NM 342.14 MOPS 259.20 MOPS Three-step (8[fl + l ) i V M 673.92 MOPS 342.14 MOPS PHODS (4\log2p] +1)NM 176.26 MOPS 134.78 MOPS Hierarchical ( ( 2 [ f l + l ) 2 + 1 8 0 ) ^ 169.13 MOPS 132.84 MOPS Table 2.1: Computational complexity and operation requirements for pictui'e with image size of 720x480 and frame rate of 30 frames/sec. The macroblock size in this case is NM = 16x16. Table 2.1 summarizes the computational requirements of the five motion estimation methods we described in the section. The "MOPS" in Table 2.1 is the abbreviation of million operations per second. In all cases, we assume 16x16 macroblocks size and a (±p, ±p) search region. The picture size is 720x480 and the frame rate is 30 frames/sec. 2.6 Discrete Cosine Transform (DCT) Neighboring pixels in an image are usually highly correlated, implying that high levels of redundancies exist in the image data. Block transforms are very popular tools for image data decorrelation. The optimal block transform for this purpose is the Karhunen-Loeve (KL) block transform. Theoretically, the K L transform can completely decorrelate the input blocks, yielding highest energy compaction. However, the drawback of the K L transform is its high computational requirements: Besides, the basis of the K L transform depends on the statistics of the specific input blocks. There are other block transform techniques which can achieve good decorrelation performance, and are much easier to implement. They are the Discrete Fourier Transform (DFT), the Hadamard Transform, the Slant Transform, and the Discrete Cosine Transform (DCT). Each of Chapter 2. Background 26 these transforms can be implemented efficiently. Among these block transform methods, the performance of the D C T is the closest to the optimal K L transform for conventional image data (such as photographic images and movie frames). In fact, D C T is today's most popular orthogonal transformation, as it is used in almost all image and video coding standards. D C T based video coding systems usually yield reasonably good image reproduction quality at moderate and high bit rates. However, such systems will produce annoying blocking and/or mosquito artifacts at the lower bit rates. This is because some or most of the A C coefficients of a block will often be quantized to zeros. 2.7 D C T Coefficient Quantization and Coding After applying the D C T , the resulting coefficients need to be quantized and then entropy coded. Since the human eye is less sensitive to energy with high spatial frequencies, the lower frequency D C T coefficients .are, in general, visually more important than others [54]. For this reason, a set of non-uniform weights called the quantization matrix is used to quantize the coefficients. Each weight should vary with the frequency, as well as the type (intra, inter) of the image. The process of quantization of the 8x8 D C T coefficients i , j = 0 ,1 , . . . , 7, can be expressed by = ronnc/(^M^-), where qij is the weight in the quantization matrix Q, and the q is the quantization factor. In general, two quantization matrixes, Qintra and Qinter are required in video coding because of the different properties of the D C T coefficients in intra and inter coded pictures. The low-frequency weights usually have smaller magnitudes than the high-frequency weights in Qintra- However, in general, uniform weights are used in Qinter- Because the quantization matrix Q will not change, at least within a frame, the rate and the quality of the coded frame are directly controlled by the quantization factor q of each block. The algorithm that determines the value of q depends on the application, the rate", and the quality Chapter 2. Background 27 requirements. In fixed bit rate coding, the q is mainly determined by a rate control mechanism that attempts to maintain a constant buffer occupancy [35, 72]. Following quantization, the coefficients are re-ordered into a one-dimensional array by reading out the entire two dimensional array in a zigzag direction. In this way, the quantized coefficients are "approximately" arranged in order of ascending frequency. This process usually increases the number of consecutive D C T zero-valued coefficients, improves significantly compression efficiency [73, 36, 34, 31]. 2.8 Video Coding Standard In this thesis, our proposed algorithms will be implemented in the context of the H.263 and MPEG-2 video coding standard frameworks. Therefore, in the following subsections, a brief description of such video coding standards is presented. 2.8.1 The H.263 Video Coding Standard The main application of H.263 is video conferencing at a bit rates below 64 kbits/sec. The H.263 standard [36], like the M P E G ones, is based on block-based motion compen-sated prediction and DCT-based residual coding as shown in Figure 2.2. However, it also involves more advanced prediction techniques, more effective motion vector coding methods, and more flexible mechanisms for alternating between inter and intra coding. These new features take into account the facts that, for very low bit rate video coding applications such as video telephony, video conferencing, and video monitoring, motion changes are relatively small and motion vector side information occupies a substantial portion of the overall bit rate. Detailed description of the H.263 coding standard will be presented in Chapter 5. Chapter 2. Background 28 2.8.2 The M P E G - 2 Video Coding Standard As suggested by its title, Information technology - Generic coding of moving pictures and associated audio information, the MPEG-2 standard is intended to be generic in the' sense that it serves a wide range of applications. MPEG-2 differs from MPEG-1 in that it supports many profiles and levels, which define applications and specific complexity constraints, respectively. A total of five profiles: 1) simple profile, 2) main profile, .3) SNR scalability profile, 4) spatial scalability profile, and 5) high profile, are supported by the standard. In order to encode efficiently interlaced video sequences, such as those used in T V , the MPEG-2 video coding standard also defines interlaced motion estimation and the D C T coding algorithm. A more detailed description of the MPEG-2 video coding standard will be presented in Chapter 6. Chapter 3 Block-Based Motion Estimation In this chapter, an efficient block-based motion estimation algorithm is proposed. Unlike in traditional block-based motion estimation algorithms, the proposed framework will yield very good tradeoffs among distortion, motion vector bit rate, and computations. Our proposed motion estimation algorithm finds the motion vector that minimizes the distortion subject to the constraints of motion vector bit rate and computations. This chapter consists of five sections. In section 3.1, the two well-known drawbacks in traditional motion estimation algorithms will be addressed. To overcome such drawbacks, a new optimization criterion for motion estimation is proposed in section 3.2. Depending on the intended video coding application, the proposed general optimization criterion can be greatly simplified. The principle and the implementation of the resulting motion estimation algorithms will then be discussed in detail in-Section 3.3 and Section 3.4. A summary is given in the last section. 3.1 Motivation Motion compensated prediction plays a vital role in video coding systems, as it is the most popular method to remove temporal redundancies between video frames. Many models have been developed to represent the apparent motion field in a video sequence. Mainly because of its computational efficiency, the most common motion model used in video coding systems and standards is that which tracks translational motion. With this model, the full-search block matching algorithms (FS-BMAs) are usually favored because 29 Chapter 3. Block-Based Motion Estimation 30 of their hardware simplicity and their generally minimal side information required for compensation. It has been implemented in many video codecs and has been specified in a number of video coding standards (e.g. H.261, M P E G - 1 , MPEG-2 , and H.263 [73, 33, 34, 36]). While the full-search B M A (FS-BMA) is the simplest in concept and implementation, it also has two well-known drawbacks. The first drawback is a degraded video quality in low bit rate applications. This problem is the result of two limitations: (1) the encoder cannot control the motion vector bit rate, and (2) the encoder cannot look ahead. Hence, the obtained motion vectors do not necessarily yield the "best" overall encoding results. The impact of the above limitations is usually more significant at low bit rates, since the motion vector bit rate becomes relatively large (30% ~ 50% of the total bit rate) [74, 24, 49]. In order to satisfy a certain bit rate requirement, the encoder then has to use a larger quantization step during the encoding of the D C T coefficients. This, in turn, severely degrades the reproduction quality. The second drawback is FS-BMA's very large computation requirements. For ex-ample, the two-step F S - B M A algorithm used in the MPEG-2 TM5 encoder typically requires 90% ~ 95% of the total number of computations. FS-BMAs require many com-putations mainly because they usually follow a fixed search path whose starting point is also fixed. This implies, that, to accommodate video sequences with non-stationary motion fields, a search path traversing all points in a very large search area must be used. The FS-BMA's high computational requirements have fueled many research ac-tivities, which have produced several efficient algorithms such as log search, three-step search, cross search, conjugate gradient search, hierarchical search, and block subsam-pling [5, 81, 82, 2, 11, 12, 52]. However, most of the above algorithms may quickly get trapped in local minima, yielding a significant loss in estimation performance. In this work, we propose a new motion estimation algorithm that is both efficient and Chapter 3. Block-Based Motion Estimation 31 effective. The new algorithm is discussed in the following sections. 3.2 Motion Estimation with Rate-Computation Constraint One of the problems of traditional motion estimation algorithms is that they minimize the distortion without consideration of the motion vector bit rate and the computational cost. In fact, the performance of a motion estimation algorithm should be evaluated by not only the distortion but also the other two important measures.-Our proposed motion estimation algorithm is a constrained optimization problem. Given the constraints of total motion vector bit rate Rt and total computational cost Gt for a frame with NxM macroblocks, we want to find the NM motion vectors {vi,j,i — 0 , . . . , N, j = 0 , . . . , M} that minimize E ^ ( v « . i ) ( 3 - 2 5 ) subject to the constraints T.Civij) < Cu (3.26) where the D(vitj), R(vi'tj) and C(vij) are, respectively, the motion compensation distor-tion, the motion vector bit rate and the computational cost for the (i,j)th macroblock. To solve the above constrained discrete optimization problem, we can use the well known Lagrangian multiplier method introduced in [16, 25, 65]. It basically means, that one adds the constraint which makes the problem hard to solve to the objective function, weighted by the Lagrangian multiplier. Then the new problem, called unconstrained or relaxed problem, is solved to find the Lagrangian multiplier that results in the optimal solution. Therefore, in our. case, the above constrained optimization problem can be Chapter B. Block-Based Motion Estimation 32 solved by minimizing the following Lagrangian cost function J J = E + A • E + ^ • E H V . - J . ' (3-27) ',./ '../ where the non-negative values of the Lagrangian multipliers A and f3 must be chosen so that the iwo constraints in Equation (3.26) are satisfied. Howeser, due to the rate and distortion dependencies manifested in the D(vij) and-R(vitj) tetms, solving Equation (3.27) with the constraints of Equation (3.26) remains rather complicated, and almost impossible. In order to simplify this problem, we have to ignore Bate-distortion dependencies between macroblocks and minimize, instead, the Lagrangian cost function for macroblock J(vij) = D(viJ) + \-R(vitj)+p-C{vi,j). (3.28) The motion vectors {vi,j,i = 0 , . . . , N, j = 0 , . . . , M } that minimize Equation (3.28) are suboptimal, however, this approach reduces significantly the complexity of the motion estimation algorithm. According to Equation (3.28), the proposed motion estimation algorithnaseeks the motion vector V ; j for macroblock that minimizes the Lagrangian cost function J(v t ]j). The algorithms that finds the value of A will be discussed in the rate control sections in Chapter 5 and 6. In Section (6.4.1) we discuss the effects of the different values of /3 on J{vi,j). In the next section, we will give a more detailed discussion about the proposed Lagrangian cost function J ( v t J ) . The Proposed Lagrangian Cost Function J(n) Let us assume that for a macroblock a pre-defined search path V that covers the search arm exists, and that V consists of the sequence of motion vectors (vx, V 2 , . . . , V J V ) . The subscript n.in v„ indicates the search order, that is v n — (xn,yn) is the nth motion vector in the search path. After considering the vector v n along the search path V, Chapter 3. Block-Based Motion Estimation 33 one can obtain the corresponding distortion D(n), motion vector bit rate R(n), and computational cost C{n). The distortion D(n) is defined as ' D(n)= $>(/ t (r), M r + v„)), (3.29) where r is the spatial index of the image pixels in the image plane, 7 t(r) is the image intensity at position r of.the candidate block in the current frame, Iti(r + v n ) is the image intensity at position (r + v n ) of the matching block in the reference frame, W is the size of the matching window, and /?(.,.) is the distortion measure. The most popular distortion measures are the mean squared error (MSE), where p(-) = | • | 2 , and the mean absolute difference (MAD) , where /?(•) = | • |. The bit rate R(n) represents the number of bits required to encode the motion vector v„.'- The computational cost C(n) is equal to the total number of computations for all vectors up to, and including, the nth motion vector. The "best" values of D(n), R(n), and C(n) depend on the cost function being minimized. From Equation (3.28), the proposed algorithm is to find the motion vector v n that' yields the minimum value of the Lagrangian cost J(n), where J(n) = D(n) + X • R(n) + (3 • C(n) (3.30) The parameters A and (3 control the influence of the rate R(n) and the computational cost C(n) on the value of the Lagrangian cost- J(n), respectively. For example, if the value of the parameter A increases, the proposed algorithm will favor the motion vectors requiring a lower bit rate. Similarly, if the value of the parameter (3 increases, the computation constraint becomes more important, forcing the algorithm to follow the shortest path or terminate the search process after only a few vectors have been considered. The proposed algorithm is a generalization of traditional motion estimation algorithms. In fact, when A and (3 are set to zero, the Lagrangian cost J(n) will be equal to D(n), the cost measure usually used in motion estimation. Chapter 3. Block-Based Motion Estimation 34 Minimization of the Lagrangian cost J(n) given by Equation (3.30) seems straight-forward, yet there are two problems that need to be addressed. First, special values for A and (3 must be obtained, usually via rate/computation control algorithms. Second, the minimum value of J(n) can be obtained only if all candidate vectors in the search path have been considered, rendering useless the use of the computational cost as part'of J(n). However, depending on the video application, minimization of J(n) can be greatly simplified. For low bit rate video coding applications, such as video conferencing, be-cause the motion field is often stationary, and the spatio-temporal resolution of the video sequence is-usually very small, the computational cost C(n) in Equation (3.30) becomes substantially less important than the motion vector bit rate R(?i). Therefore,in low bit rate applications, Equation (3.30)-can be simplified to J(n) = D(n) + \-R(n). (3.31) Minimizing the above equation for motion vectors along a specific path is referred to as rate-distortion (RD) optimized motion estimation. However, in high bit rate application, such as MPEG-2 video encoding, incorporating the motion vector bit rate R(n) in Equa-tion (3.30) will lead to an insignificant performance improvement. On the other, hand, incorporating explicitly the computational constraint will yield a large improvement in encoding speed. Therefore, for high bit rate applications, one-should minimize the cost J(n) given by J(n) = D(n) +(3-C(n). (3.32) Minimizing the above equation for motion vectors along a specific path is referred to as computation-distortion (CD) optimized motion estimation. In the following two subsec-tions, we will discuss the specific RD and CD optimized motion estimation algorithms respectively implemented in our work. Chapter 3. Block-Based Motion Estimation 35 3.2.1 Rate-Distortion Optimized Motion Estimation According to [11, 12, 23, 41, 13, 27,'49, 43, 48], finding a motion vector v„ that minimizes the Lagrangian J(n) in Equation (3.31) is equivalent to finding a motion vector that min-imizes the distortion D(n) subject to the constraint of motion vector bit rate R(n). An important advantage of RD optimized motion estimation is that via the parameter A the encoder is able to control the bit rate of motion vectors. As we will explain in the Chap-ter 5, this property is very important in low bit rate video coding applications. Another advantage is the algorithm's robustness with respect to video input noise. Because a bit rate constraint is embedded in the above Lagrangian equation, for stationary areas in the video sequence (e.g., background), the RD optimized motion estimation algorithm will be insensitive to random fluctuations of block matching errors caused by various types of video input noise. 3.2.2 Computation-Distortion Optimized Motion Estimation The advantage of computation-distortion optimized motion estimation is that good trade-offs between computation and distortion can be achieved through manipulating the value of the parameter (5 [42]. In our proposed motion estimation framework, described in de-tail in the next section, the motion vector search will start from the predicted motion vector, and the search will then follow the search path specified by the estimation algo-rithm. The Lagrangian in Equation (3.32) will be used by the algorithm to decide the termination of the search. Because the cost function J(n) takes both computational cost and distortion into account, therefore,' according* to the weighted sum of the two mea-sures along the search path, the termination algorithm can dynamically controlthe search region. Experimental results in Section 3.4 show that the dynamic nature of the imple-mentation leads to a dramatic increase in motion estimation speed for only a small loss Chapter 3. Block-Based Motion Estimation 36 in compression performance. Another advantage of our algorithm is that the reconstruc-tion quality of the resulting MPEG-2 coder is proportional to the computer's processing power, providing much needed flexibility in many MPEG-2 software applications. In Figure 3.8.a, we show the' profile of the sum of absolute distortion (SAD) D from macroblock (8, 4) in the 1st frame in the sequence Bus. The profile of the cost function J = D + (3 •C along the diamond-shaped searching path (which will be discussed in the later section) centered at (0, 0) is shown in Figure 3.8.b. The minimum value of D in Figure 3.8.a is (5, -3) and the minimum value of J in Figure 3.8.a will be (4, -2). Notice that the D + j3C surface is smoother, indicating that the CD optimized algorithm would be less sensitive to video input noise. 3.3 Proposed Motion Estimation Algorithm Our proposed motion estimation algorithm consists of the following three important parts: (1) initialization, (2) search path, and (3) termination. Initialization involves the selection of, through prediction, the first motion vector (or the first point in the search path) to be considered. In the first subsection, four predictors, the mean predictor, the median predictor, the L M S E predictor, and the statistical mean predictor will be studied. In the following subsection, we will discuss the search path employed in this work. The search path should specify a sequence of motion vectors, where the "optimal" motion vector can be found within minimum computational cost. Three search paths, a spiral search path, a diamond-shaped search path, and a floating-center search path are addressed. In the final subsection, two effective.termination algorithms based on two probabilistic models (respectively) will be introduced and discussed. Chapter 3. Block-Based Motion Estimation 37 (a) Profile of SAD D Figure 3.8: Profile of sum of absolute distortion (SAD) (a) and Lagrangian J = D + (3C .(b) of motion estimation. The searching window is ±15. Chapter 3. Block-Based Motion Estimation 38 3.3.1 Starting Point: Prediction The first point in the search path should correspond to the motion vector with the highest probability of being the "optimum" motion vector. By employing prediction, one may be able to achieve such an objective. This is possible, mainly because the motion field in a typical video sequence is smooth' and structured, exhibiting strong spatio-temporal dependencies between its motion vectors. Memory has always been incorporated into the motion vector coding process, but a few researchers have suggested exploiting it to simplify motion estimation. Recently, predictive motion estimation [81, 82, 3, 41, 49, 52] has become an important research area and several motion vector predictors have been proposed. In this section, the mean predictor, the weighted mean (LMSE) predictor, the median predictor, and the statistical mean predictor are discussed [43],' and their complexity and performance are then be evaluated. Mean Predictor A mean predicted motion vector is given by v = jr J2k=i VA-, which is the average of the K motion vectors v/,-, evaluated for each x and y component independently. As confirmed by our experimental results, only the vectors corresponding to the closest 3 blocks (the A, B, D blocks in Figure 3.12) should be used. Weighted Mean Predictor A more accurate predicted vector is given by v = Ylt-i akvk-, where the linear predic-tion coefficients a^s are computed off-line using the well-known autocorrelation method. Based on experimental results, a fourth order predictor (K = 4) seems to be a good choice. A higher order predictor improves slightly the prediction accuracy, but at the Chapter 3. Block-Based Motion Estimation 39 expense of a large increase in number of computations. While this weighted mean pre-dictor outperforms the uniform mean predictor discussed above, further improvements in performance are expected by changing the values of the coefficients cv '^s adaptively during motion estimation. Median Predictor Like those of the mean and weighted mean predicted vectors, the two components of the median predicted vector are computed independently but using the same procedure. Two different median predicted vectors are computed: one based on the H.263 3-block and another based on a 5-block region of support (ROS). The 3-block median predictor, is better in terms of prediction performance, as will be shown later. Statistical Mean Predictor A more powerful non-linear predictor is the statistical mean vector v = (,x, y), whose two components are also computed independently and in an identical manner. Without loss of generality, we will next describe the procedure used to estimate x, the x component of v. The scalar x is the weighted average of x components of all possible motion vectors. The weighting coefficients are equal to the conditional probabilities of the x components. The conditional probabilities are determined using a finite-state machine (FSM) model that is represented by a codebook, or a set of state code vectors. Each code vector represents a relatively large set of template vectors lit. The components of the template vectors Ut are feature symbols representing the x components of previously coded motion vectors in the ROS. In other words, a template vector u t is the output of a mapping function whose inputs are the ROS motion vector x components, and is the input to a vector quantizer (VQ) whose codebook contains all state code vectors. In this work, the following two mapping functions and VQs are considered: Chapter 3. Block-Based Motion Estimation 40 1. Scalar .quantized mean: the mean of the x components is computed and scalar quantized. The ROS associated with the x components is the one used in the com-putation of the mean predicted vector. Each possible value of the scalar quantized mean represents a template vector u t . 2. Scalar quantized weighted mean: the. input to the scalar quantizer is the weighted mean, computed using the same ROS adopted during the computation of the weighted mean predicted vector. The state V Q codebook is generated on-line [41]. The codebook is initialized with one code vector Ug, the first template vector. The next template vector u t is compared to Ug. If the distortion <i(ut,Ug) given by N i=l is less than a threshold Tj, then u t is mapped to the state code vector uj . Otherwise, u t is added to the codebook. Similarly, each new vector u t is compared to all vectors in the codebook. If the best matching code vector is still not a good match, as determined by the threshold T\, then it is added to the codebook. When the maximum codebook size is reached, the least popular code vector is deleted before adding the new vector. Next', suppose u t is the template vector whose components are the different feature symbols considered above. Then, an iV-size state codebook C = {uj, uf, . . . , u™} is searched, and the probability table corresponding to the state code vector u* closest to Ut is selected. Then, x is given by x = Ylp{xt\u*s)xt, ' (3.33) e-i where p(xe\ul) is the probability of the motion vector a;'component xi given state u* and Lx is the number of all possible motion vectors. Chapter 3. Block-Based Motion Estimation 41-Since both the codebook and the probability tables are adaptive, this procedure is expected to yield very accurate statistical mean predicted motion vectors. Note that our implementation of the statistical mean predictor does not require that the table probabilities be stored or even directly processed. The estimates can be obtained easily by maintaining a counter for each state code vector. Thus, such a prediction technique is relatively simple. 3.3.2 Search Path Three types of search paths, a spiral search path, a diamond-shaped search path, and a floating-center search path, will be discussed in this section. The spiral search path and the diamond-shaped search path are static in the sense that the associated trajectories are fixed during the searching process. On the other hand, the floating-center search path is-dynamic, as its trajectory is time-varying. In our work, all three paths consists of layers, and each layer consists of connected search points. The layers are searched in an order that depends on their geometrical distance with respect to the predicted motion vector. This is based on the assumption that the probability of finding optimal motion vector is inversely proportional to the distance between the layer and the predicted motion vector. Spiral Search Path The search area is initially divided into square-shaped contours or layers, as shown in Figure 3.9. The search starts at the predicted motion vector V, marked as "Layer 0" in the figure. We then search the layers 1,2, 3, . . . sequentially in the same order. The search order in each layer is arbitrary, since all points in the layer will be searched. The spiral search path has been used in the H.263 TMN5 coder's motion estimation algorithm. The number of search points in each layer are given by 8 x /, where / is the layer index. Chapter 3. Block-Based Motion Estimation 42 Figure 3.9: A spiral search path Diamond-Shaped Search Path As shown in Figure 3.10, although the diamond-shaped search path also consists of a layer structure, however, its layers are diamond-shaped while those of the spiral search path are rectangular. If / represents the layer index, then the search points in layer /th are 4x1. One interesting property of the diamond-shaped search path is that its layers have the same shape as contour lines of a two dimensional Laplacian distribution function. Therefore, the diamond-shaped search path is an optimum path for motion vector fields which can be modeled by two dimensional Laplacian distributed random variables. Chapter 3. Block-Based Motion Estimation 43 V Most likely motion vector R Search region o o Layer 0, Jo v o o o Layer 1, J \ / O ' O O 0 \ 0 O O Layer 2, J2 -e—e—e—e—e—e—e-o c b o o o o o o Figure 3.10: A diamond-shaped search path Floating-Center Search Path An alternative path is the floating center search path shown in Figure 3.11. Unlike the previous two search paths, the floating-center search path changes dynamically, and the associated search process is more complicated than that of the above two search paths. The corresponding algorithm searches only the layer whose center is the "best" motion vector found at the previous layer. For example, in Figure 3.11, the algorithm first searches the 4 points inside the diamond-shaped area marked as Layer 0 in Figure 3.11. After the 4 points have been tested, the "best" motion vector found in layer 0, marked as X , will then be the center of the new search region marked as layer 1. However, because of the overlap, only the 3 points in layer 1 will be tested. Next, the best motion vector found among the 3 points will be the center of the new search region marked as layer 2, Chapter 3. Block-Based Motion Estimation 44 Figure 3.11: Floating center search path and so on. The advantage of the floating-center search is that the number of the search points in each layer is equal to 4. Therefore, this method is inherently more efficient than the spiral and diamond-shaped search method. However, depending on the input video sequence, the size of the search region in each layer should be decided carefully. For example, for video with active motion, if the size of the search region is too small, then the method can be more easily trapped into local minima than the other two methods. 3.3.3 Termination Point Algorithm In RD optimized motion estimation, minimizing J(n) = D(n) + XR(n) can be very com-putationally demanding. In CD optimized motion estimation, to find the motion vector Chapter 3. Block-Based Motion Estimation 45 that gives the minimum J , we have to test all candidates motion vectors, rendering use-less the computational cost constraint. Fortunately, however, the probability of finding the minimum value of J(n) along the search path is usually a monotonically decreasing function of the distance away from the predicted motion vector. Thus, by terminating the search process subject to rate or computation constraint, one can still obtain a near-optimal motion vector with respect to the designated search path. However, along the search path the value of Lagrangian J(n) is not necessarily monotonically decreasing, and choosing an appropriate termination point is therefore a difficult task. One solu-tion is to group positions on the search path into layers according to their geometrical locations as shown in Figure 3.9, 3.10 and 3.11. Let us assume that M motion vectors (vfc> Vfc+i,. . . , V/C+M) are in the Zth layer, then among the computed M Lagrangian val-ues {</(&), J(k + 1 ) , . . . , J(k + M ) } , we select the Lagrangian J with the minimum value to represent the Ith layer. Then, the Lagrangian values Jo, J i , • • • , Ji are compared to decide when to "best" terminate the search path. Intuitively, if the Lagrangian value Ji keeps increasing for several consecutive layers, then it is very unlikely that we could find a smaller J(n) in the following layers. In fact, for a smooth motion field, it is very likely that the "optimal" motion vector is located in the vicinity of the predicted motion vector. Building on the above assumption, we next propose two probabilistic models, where each model is based on a specific hypothesis. The two models are: • Model I (2-layer Hypothesis) : If J\ is larger than J /_ i , then it is unlikely that we will find a smaller Lagrangian J(n) by continuing the search outward. Thus, searching more motion vectors is not necessary. This hypothesis is often violated in areas where the motion field is not smooth. In such a case, the Lagrangian surface is likely not convex. Even when the motion field is smooth, the search as specified by this hypothesis can quickly get trapped in a local minimum. Chapter 3. Block-Based Motion Estimation 46 • Model II (3-layer Hypothesis) : Only if J ;_ 2 < J\-\ < Ji, will we be confident that searching more layers for a better motion vector is wasteful. This hypothesis places a larger convexity constraint on the search. It almost guarantees optimality, but at the expense of a much larger computational load. Because this test may never be satisfied, we must limit the number of considered vectors. By using any of the two models corresponding to the above two hypotheses, we allow the search area to expand or contract based on the local statistics of the motion field. E i -ther model can be better than the other in terms of computation-performance tradeoffs, depending on the particular video sequence being coded. But Model I (2-layer Hypoth-esis) would be the choice when computational complexity is a major concern. As can be expected, both models fail in areas where many non-motion changes occur. Finally, note that the two models can be used interchangeably as dictated by performance and/or computation constraints. 3.4 Simulation Results 3.4.1 Initialization The purpose of this experiment is to evaluate the performance of the four predictors (mean predictor, weighted mean predictor, median predictor, and statistical mean pre-dictor) discussed in Section 3.3.1. In this experiment, the average entropy of the difference between the estimated (using the full-search method) motion vector and the predicted motion vector is used as the performance measure. For simplicity, only the x component of the difference motion vector is considered, and only integer-pel accuracy motion esti-mation is used. The smaller value of the entropy indicates that the corresponding motion predictor is more accurate. The video sequences used in this simulation are flie QCIF sequences (Quarter Common Intermediate Format, a video conferencing format that each Chapter 3. Block-Based Motion Estimation 47 frame containing 144 lines and 176 pixels per line) CAR PHONE and MISS AMERICA, and the block size used for matching is 8x8. Tables 3.2 and 3.3 show the average entropy for MISS AMERICA and CAR PHONE (respectively) of the motion vector x component in the search area centered at the mean, weighted mean, median, and statistical mean. The peak signal to noise ratio (PSNR) used in Tables 3.2 and 3.3 is defined as 25"52 PSNR = 10 logio—r- (3.34) where o\ is the mean square error (MSE) of the reconstructed video sequence. There are two different mean predictors: (1) MEAN-A where the ROS in Figure 3.12 consists of blocks (A,B,D) and (2) MEAN-B where the ROS consists of blocks (A,B,C,D). As ex-pected, MEAN-A leads to a lower entropy, as averaging many motion vectors adversely affects prediction accuracy. Since MEAN-A and MEAN-B are special cases of the weighted mean predictor (WM), the latter should yield better prediction accuracy. This is con-firmed by the results shown in the tables. It is clear that the 3-block (A,B,D) median predictor (MED-A) outperforms the 5-block (A,B,C,D,G) median predictor (MED-B) in terms of both prediction accuracy and complexity. With the exception of its potentially higher robustness to channel errors, MED-B does not seem to be a good choice. When used to drive the F S M model, the weighted mean ( W M - S M ) also outperforms the uniform mean ( M - A - S M and M - B - S M ) in terms of prediction accuracy. Moreover, the statistical mean predictor ( W M - S M ) leads to a slightly better performance than WM. Finally, notice that MED-A yields a lower entropy than the more complex linear and statistical mean predictors. This, coupled with its higher robustness to channel errors, is likely what made it part of the H.263 standard. It is clear from Tables 3.2 and 3.3 that the differences in average entropies are rela-tively small. This suggests that motion estimation/coding is not sensitive to prediction accuracy. Moreover, as shown in Tables 3.4 and 3.5 for MISS AMERICA and CAR PHONE Chapter 3. Block-Based Motion Estimation 48 R M 1 N Q H G J S L K 0 P Previous Frame F C B D E • A Present Frame Figure 3.12: Average correlation coefficients (cx,cy) between the ROS motion vectors and the current one. P S N R M E A N - A M E A N - B 1 W M M E D - A M E D - B M - A - S M M - B - S M W M - S M 35.9 0.566 0.569 0.566 0.561 0.582 0.566 0.570 0.566 36.8 0.620 0.635 0.602 0.612 0.628 0.620 0.635 0.602 37.8 0.689 0.679 0.670 0.659 0.679 0.689 0.677 0.668 Table 3.2: Entropy for MISS A M E R I C A of motion, vector x component offsets in the search area centered at the mean ( M E A N - A and M E A N - B ) , weighted mean (WM), median ( M E D - A and M E D - B ) , mean-based statistical mean ( M - A - S M and M - B - S M ) , and weighted mean-based statistical mean ( W M - S M ) . P S N R M E A N - A M E A N - B W M M E D - A M E D - B M - A - S M M - B - S M W M - S M 31.2 1.686 1.737 1.598 1.553 1.602 1.658 1.681 1.588 33.0 2.008 2.044 1.930 1.860 1.928 1.946 2.011 1.909 34.9 2.318 2.335 2.190 2.109 2.194 2.237 2.296 2.169 Table 3.3: Entropy for C A R P H O N E of motion vector x component offsets in the search area centered at the mean ( M E A N - A and M E A N - B ) , weighted mean (WM), median ( M E D - A and M E D - B ) , mean-based statistical mean ( M - A - S M and M - B - S M ) , and weighted mean-based statistical mean ( W M - S M ) . Chapter 3. Block-Based Motion Estimation 49 PSNR WM-SM FSM 35.9 0.569 0.498 36.8 0.602 0.547 37.8 0.668 0.592 Table 3.4: Comparison between the F S M (FSM) entropy of the motion vector x com-ponents and entropy of the x component offsets of those motion vectors located in the search area centered at the weighted mean-based statistical mean (WM-SM) for the se-quence MISS AMERICA. PSNR WM-SM FSM 31.2 1.598 1.469 33,0 1.930 1.751 34.9 • 2.190 2.026 Table 3.5: Comparison between the F S M (FSM) entropy of the motion vector x .compo-nents and entropy of the x component offsets of those motion vectors located in the search area centered at the weighted mean-based statistical mean (WM-SM) for the sequence CAR PHONE. (respectively), prediction-based motion estimation/coding compares favorably with sta-tistical FSM-based estimation/coding studied in [11, 12, 41]. In the latter case, an F S M ' model based on dynamic V Q is employed for probability-based estimation and direct coding of the motion vectors (i.e., explicit prediction is not performed). 3.4.2 Search Path In this part, the computation-distortion performance of methods based on the three search paths (spiral search path, diamond-shaped search path, and.floating-center search) discussed in Section 3.3.2 is evaluated. The computation-distortion optimized motion estimation algorithms associated with the three search paths are implemented in the context of the MPEG-2 TM5 video encoder. Model I (2-layer Hypothesis) discussed Chapter 3. Block-Based Motion Estimation 50 38 37.9 37.8 37.7 _ 3 7 . 6 2, cc 37.5 z co Q_ 37.4 37.3 37.2 37.1 (a) F O O T B A L L PSNR vs. Computation I c • Football; 48 frames • Search Window: -47 to 47 Rate: £ Mb its/sec * Diamond o Floating ce iter 37 0. 0.8 1 . 1 . 2 1.4 Computations/MB (# of operations) 1.6 x 10 35.1 35 34.9 34.8 _ 3 4 . 7 ST 2, rr 34.6 z CO 34.5 34.4 34.3 34.2 34.1 0 (a) Bus PSNR vs. Computation Bus, 48 frames Search Window: Rate: 8 Mbits/se - 4 7 t o 4 rt x Spiral: * Diamond o Floating center 0.7 0.9 1 1.1 1.2 1.3 Computations/MB (# of operations) 1.4 1.5 1.6 x 104 Figure 3.13: Compression performance of 3 search paths in terms of PSNR and number of computations per macroblock. Chapter 3. Block-Based Motion Estimation 51 in Section 3.3.3 is used for termination. The median predictor is used to obtain the predicted motion vector in all three video encoders. The test video sequences are the CCIR 601 (720x480 pixels, 30 frame/sec) FOOTBALL and BUS sequences,'and the target bit rate is 8 Mbits/sec. The search window sizes for the F O O T B A L L and BUS are ±47 pixels. The results of our experiment are shown in Figure 3.13. For the F O O T B A L L sequence, the results show that the method based on floating-center search path is more efficient than that of the diamond-shaped and the spiral search. However, for the BUS sequence, the performance difference is quite similar. It is due to the more active motion field in FOOTBALL sequence, and therefore, the algorithm has to search more layers in the FOOTBALL sequence. As explained in Section 3.3.2, the number of the search points in each .layer in the floating-center search is fixed, which on the other hand, for spiral search and the diamond-shaped search the number of the search points in each layer is proportional to the layer index. For BUS sequence, because of the smoother motion fields in the sequence, motion vector can be predicted quite accurately by the median predictor, and thus, the algorithm needs to search fewer layers. The simulation results suggest that the floating-center search path seems to be a good choice. 3.4.3 Termination Point , The proposed probabilistic models (based on the 2-layer Hypothesis and 3-layer Hypoth-esis) for termination of the motion search process are here evaluated and compared. The video encoder employed.a floating-center search path in this experiment. The test video sequences are the CCIR 601 (720x480 pixels, 30 frame/sec) FOOTBALL and GARDEN sequences, and the target bit rate is 8 Mbits/sec. Results are shown in Figure 3.14. For the FOOTBALL sequence, the two methods are very close in performance. How-ever, for GARDEN sequence, the method based on Model I is more efficient than the one based on Model II. For very active sequences, such as the F O O T B A L L , the proposed Chapter 3. Block-Based Motion Estimation 52 37.1 r 37 0.6 0.8 1 1.2 1.4 Computations/MB (# of-operations) 1.6 • 1.8 x 104 33.9 33.85 33.8 33.75 DC 33.7 33.65 33.6 33.55 33.5 (a) G A R D E N PSNR vs. Computation Garden, 48 fram 3S Search \ 8 Mbits/ Window, -47 to 4 7 x 2-layer Hypoth esis o 3-layer Hypoti- esis Float Region • 6000 6500 7000 7500 8000 8500 9000 9500 10000 10500 11000 . Computations/MB (# of operations) ' Figure 3.14: Compression performance of 2 models for search termination in terms of PSNR and number of computations per macroblock. Chapter 3. Block-Based Motion Estimation 53 distortion-computation optimized motion estimation algorithm yields a 200:1 encoding speed improvement with a 0.14 dB degradation. Besides its computational advantage, an important feature of our method is that computational resources are better allocated, depending on the statistics of the input video sequence. 3.5 Summary In this chapter, an efficient motion estimation algorithm that minimizes the compensation distortion under the constraints of the motion bit rate and computational cost has been proposed. Depending on the video application, the proposed algorithm can be simplified into a rate-distortion optimized motion estimation algorithm or a computation-distortion optimized motion estimation algorithm. For CD optimized motion estimation, an in-depth discussion of the design and implementation of three different motion search parts: (1) initialization (motion vector prediction), (2) search path, and (3) termination was also presented. ' ' The main contribution of this chapter is that the proposed motion estimation provides excellent tradeoffs among the distortion, the motion vector bit rate, and the computa-tional cost. The proposed algorithm overcomes the two common drawbacks of full-search block matching algorithm: low image reproduction quality and huge computation re-quirements. Besides the above advantages, our algorithm provides easy control-of rate, distortion and computations. Chapter 4 Rate-Distortion Optimized Mode Selection In this chapter, an efficient rate-distortion (RD) optimized mode selection algorithm is introduced and studied in the context of motion compensated prediction and DCT-based coding framework. The main advantage of our method as compared to the traditional methods is that the former chooses the mode and associated parameters that yield very good tradeoffs between the distortion and the bit rate. This chapter is divided into five sections. Section 4.1 describes the macroblock-based coding mode in a MPEG-l ike video coder. The traditional mode selection algorithms used in MPEG-l ike video coder are also briefly addressed in that section. In Section 4.2, rate-distortion (RD) optimized mode selection is discussed. In Sections 4.3 and Section 4.4, two efficient RD optimized mode selection algorithms are introduced. A summary is then given in the last section. 4.1 Multi-mode Video Coding Typical video sequences contain widely varying content and motion. Therefore, better coding performance can be achieved if different strategies are permitted to encode dif-ferent image regions. Currently, the most effective block-based video coders address this problem by employing several modes of operation which are selected on a block-by-block basis. The advantage of the multi-mode approach is that its inherent adaptability lays the foundation for better coding results. In most block-based video coding standards, such as M P E G - 1 , MPEG-2 , and H.263, . 54 Chapter 4. Rate-Distortion Optimized Mode Selection 55 the current frame is subdivided into unit regions called macroblocks, which contain a 16x16 luminance block and two or more 8x8 chrominance blocks. As blocks within the macroblock cannot be encoded differently, the macroblock is the smallest coding unit. At the macroblock level, the encoder has the latitude of choosing the coding mode. As such, a given macroblock can be intra-frame coded, inter-frame coded using motion compensated prediction, or simply replicated from the previously decoded frame. For example, inter-frame coding, or block-based motion compensation followed by D C T and quantization of the prediction error, is generally regarded as an efficient means for coding image sequences. On the other hand, intra-frame coding, or D C T coding a macroblock directly may be more efficient in situations when the block-based translational motion model fails. For static regions of the video, simply copying a portion of the previously decoded frame into the current frame may be preferred. Intuitively, by allowing multiple modes of operation, we expect improved rate-distortion performance if the modes are applied judiciously to different spatial and temporal regions of an image sequence. 4.1.1 Macroblock Coding Mode The number of allowed coding modes depends mainly on the video coding standard and the video application. For example, in P pictures, the MPEG-2 main profile/main level standard offers 11 modes, with 7 of the 11 modes involving D C T and quantization. How-ever, the coding modes can be classified into 3 principal modes, 1) INTRA, 2) INTER] and 3) INTER+DCT. For the I N T R A mode, the macroblock is D C T coded, and only .the DCT coefficients and the quantization step are transmitted. The I N T R A mode is nor-mally more appropriate in the case of scene changes and occlusions. The I N T E R mode, which involves only motion compensation, is suitable for input video with a predictable motion field. In the I N T E R + D C T mode, both motion compensation and D C T residue coding techniques are used. Chapter 4. Rate-Distortion Optimized Mode Selection 56 4.1.2 Conventional Mode Selection Algorithms The selection of the macroblock coding mode and its associated parameters (e.g. the quantization step) in conventional video coders, such as the MPEG-2 TM5 and the H.263 TMN5 video coders [72, 58], is based on intuition and is not rate-distortion optimized. For example, in the H.263 TMN5 video coder, the I N T R A mode is selected if motion compensation mean square error (MSE) is larger than the variance of the macroblock. Moreover, the quantization step is determined using a simple rate control mechanism. The drawback of the approach is that the mode and its associated parameters are selected without joint consideration of the bit rate and the video reproduction quality. 4.2 Rate-Distortion Optimized Coding Mode Selection In this section, a Lagrangian formulation is employed to address the problem of selecting the coding mode on a macroblock-by-macroblock basis within each frame of a video sequence. To simplify the problem, here, we assume that each of the frames in the sequence are optimized independently. Consider an image which is partitioned into N macroblocks by X = ( X i , . . . , Xyv). For a multi-mode video coder, each macroblock in X can be coded using only one of K possible modes given by the set X = . . . Let Mi G I be the mode selected to code macroblock X,-. The modes assigned to the elements in X are then given by the Af-tuple, M — ( M i , . . . , M/v) G XN. The problem of finding the combination of modes that minimizes the distortion for a given video frame, given a rate constraint Rc, can be formulated by M (4.35) Chapter 4. Rate-Distortion Optimized Mode Selection 57 Here, D(X, M.) and R(X, Ar) represent the total distortion and rate, respectively, result-ing from the quantization of the X with a particular mode combination M.. Assuming an additive distortion measure, the cost function D(X, A4)~and the rate constraint can be simultaneously decomposed into sums of terms over the elements in X and re-written using an unconstrained Lagrangian formulation so that the objective function becomes minf ; j (A,X, - ,>W) , (4.36) 1=1 where J ( A , X ; , . M ) is the Lagrangian cost function for macroblock X ; and is given by J (A,X I ,AW) = D(Xi,M) + \R(Xi,M). (4.37) It is not difficult to show that each solution to Equation (4.36) for a given value of the Lagrangian multiplier A corresponds to an optimal solution to Equation (4.35) for a particular value of Rc [67, 16]. Unfortunately, even with the simplified Lagrangian formulation, the solution to Equation (4.36) remains rather unwieldly due to the rate and distortion dependencies manifested in the D(X.t, M.) and i?(X;, j\4) terms. To simplify this problem, we can assume that the rate and distortion for a given macroblock are impacted only by the content of the subject macroblock and its corre-sponding mode. As a result, the rate and distortion associated with each macroblock can be computed without regard to the modes selected for the other macroblocks, resulting in a simplified Lagrangian given by J(X,Xi,M) = J ( A , X ; , M ; ) . (4.38) In this case, the optimization problem of Equation (4.36) reduces to N N m i n V J ( A , X 2 , M 4 ) = V min J(A, X , , MA, (4.39) i=l i=l and, as a result, can be easily minimized by independently selecting the best mode for each macroblock. Chapter 4. Rate-Distortion Optimized Mode Selection 58 The idea of applying an RD criterion to the problem of mode selection was presented by Chung, et al. [11] in a subband video framework. Similar criteria were then applied to Telenor's H.263 coder by Wiegand et al [78, 79], by Chung et al [13], by Schuster and Katsaggelos [64], and by Lee et al. [49, 51, 50]. An intuitive approach for coding mode decision for MPEG-2 was proposed by Sun et al. [69]. A common important result of the above investigations is the fact that significantly better compression performance can be obtained by using an RD-optimized mode selection criterion. 4.2.1 Exhaustive Search: Performance Upper Bound Theoretically, in order to achieve the optimum performance, all the mode and associated parameters should be considered (exhaustive search). For block-based DCT-based video coders, the most important parameter associated with the I N T R A and I N T E R + D C T principal modes is the quantization step Q. In fact, an exhaustive search RD optimized mode selection algorithm will test, for the I N T R A and I N T E R + D C T principal modes, all (Mi,Qi) of a macroblock Xi to find the minimum Lagrangian J(A) given by J(A) = D(Mi,Qi) + A R(Mi,Q-). (4.40) However, the above exhaustive search method can be very inefficient for video coding standards supporting many coding modes and parameters. For example, the number of possible combinations of coding modes and associated quantization steps in MPEG-2 main profile/main level for P pictures is 259. In this case, the resulting MPEG-2 encoding time will be 30 - 40 times longer than that of the MPEG-2 TM5 encoder. However, as shown in Figure 4.15, the associated gain in compression efficiency would be 1.5 - 2.0 dB in PSNR as compared to the MPEG-2 TM5 encoder. To increase computational efficiency, we propose two fast mode-selection algorithms. The first algorithm is based on a thresholding method. For video coding standards with a Chapter 4. Rate-Distortion Optimized Mode Selection 59 relatively smaller number of (M,-, Qi) combinations, the thresholding method can be very effective in reducing the computational demands. The second algorithm is based on an adaptive finite state machine (FSM) method, which exploits the statistical redundancy among neighboring macroblocks. 4.3 Fast Mode Selection Algorithm I: Thresholding Method One way to reduce the total number of computations is to employ thresholding tech-niques that allow the encoder to safely eliminate some I N T R A and/or I N T E R coding options from consideration. For example, the Lagrangian of the "SKIP" mode ( INTER mode with a zero motion vector) of operation can be compared to a varying threshold Takip, and- if it is smaller than Tskip} the other modes are no longer considered. These techniques are similar, in principle, to those employed in conventional video encoders, but are here found to be more effective because the bit rate and distortion are more carefully traded. Although practical constraints often dictate that such methods be used, they are unfortunately still ad-hoc as there is no guarantee that the best coding control decision has been made. The threshold Tskip is used to provide a good balance between performance and num-ber of computations. A small value of Tskip increases both the computational complexity and the probability of achieving the optimal solution. A large value of Tskip reduces the computational complexity but at the expense of possibly selecting an inferior operating mode. The conventional video coders (e.g. H.263 T M N 5 coder) employ some constant thresholds that have been optimized to incorporate the M A D as a cost measure and to favor the predicted motion vector. Although the bit rate and distortion are taken into account simultaneously in our framework, similar thresholds can still be heuristically de-rived. To improve our estimation of the threshold Tskip, we also incorporate memory in Chapter 4. Rate-Distortion Optimized Mode Selection 60 Figure 4.15: The compression performance comparison between the MPEG-2 TM5 coder and the full-search R-D optimized mode and quantization step selection algorithm coder for Bus sequence (top), and Football sequence (bottom). Chapter 4. Rate-Distortion Optimized Mode Selection 61 the form of simple prediction models indicating the level of, activity in the video scene. The thresholding techniques are very effective for inactive video sequences such as those encountered in video-phone applications. However for video sequence with active motion the thresholding techniques would be less effective, mainly because much fewer macroblocks would be encoded following the SKIP mode of operation. 4.4 Fast Mode Selection Algorithm II: Finite State Machine Method In the proposed fast mode selection algorithm, we still need to test the Lagrangian value J(\) in Equation (4.40) in advance of determining the (Mi,Qi) that gives, the smallest J(A). However, the testing order of the (Mi,Qi) pairs depends on its probability of yielding the minimum J(A). We will next provide a detailed explanation of the algorithm. To begin, the algorithm will choose the most probable (Mi, Qi) pair as the candidate and computes the corresponding Lagrangian J(A). The Lagrangian J(\) will then be compared to a reference JR, and the comparison result is used to decide whether to terminate the search. If J(A) is smaller than JR, then we stop the search and the corresponding (Mt,Qi) will be selected. If J(A) is larger than JR, then we continue the search and we will select the second probable (Ml,Qi) as the candidate, and so on. We stop the search when 1) the Lagrangian J(X) is smaller than JR, or 2) we exceed a specific computational constraint. Intuitively, if the encoder knew the "correct" probabilities of the (Mi, Qi) pair and the "optimal" value of JR for the present macroblock, then it is very likely that the proposed algorithm locates the "best" (Mi,Qi) pair, even when aborting the search after just a few comparisons. To summarize, it is clear that the performance of the proposed fast algorithm depends on two key factors: 1) the probability of (Mi, Qi), and 2) the reference JR. These factors are addressed in the following subsections. Chapter 4. Rate-Distortion Optimized Mode Selection 62 4.4.1 Finite-State Machine (FSM) Ideally, the probability of the (M t-,Q t-) pair can be determined from the conditional distribution function of the mode M t- and the quantization step Qi given the macroblocks f{Mi,Qi I 'AW), (4.41) where X{-\ = ( X i , . . . , X ; _ i ) represents the previous i — 1 coded macroblocks. However, in practice, such a distribution function is very difficult to estimate. Fortunately, as shown in Figure 4.16 for a full-search RD optimized coder, the distributions of the mode and quantization step are very structured. In fact, the conditional distribution function f(Mi,Qi | can be approximated by f(Mi, Q{ IK-i) « /(7V/,, Qt | Mi-i, (4.42) where the A4i-\ = { M i , . . . , Mj_i} and Qi-\ = {Qi, • • •, Qi-i} represent the set of the modes and the set of the quantization steps, respectively, of the previous i — 1 coded macroblocks. To estimate the conditional probabilities of f(M,Q | Mi-i, Qi-i), a finite state ma-chine (FSM) can be employed. The F S M consists of a mapping rule and a set of K probability tables {fk(Mi,Qi), k = 1,2,..., A"}, where K is the number of states in the F S M . The mapping rule will assign a probability table fk(M{,Qi). In general, the probability table fk(Mi,Qi) will be created on-line and be updated regularly in order to dynamically track the local statistics of the video sequence. 4.4.2 The Reference Lagrangian JR As shown in Figure 4.16, the distribution of the optimum Lagrangian J* is also very structured. Therefore, the reference Lagrangian JR can be derived from the optimum Chapter 4. Rate-Distortion Optimized Mode Selection 63 (a) V I D E O F R A M E (b) O P T I M U M LAGRANGIAN J * (c) MODE (d) QUANTIZATION STEP Figure 4.16: The J* , mode, and quantization step for a frame of the Football sequence, obtained using a full-search RD optimized encoder Chapter 4. Rate-Distortion Optimized Mode Selection 64 Lagrangian J* values of the neighboring macroblocks. One approach to estimate JR is to compute the weighted mean of the Lagrangian J* values of the neighboring macroblocks within a region of support (ROS). That is, . ,/,,= £ «•;.; • ./" ; ' (4.43) i,j<EROS where W{j are the weighting coefficients at position (i,j) in the ROS. The Lagrangian JR will directly affect the speed of the proposed algorithm. Therefore, the encoding speed can be controlled by scaling.the JR. For example, if we scale down JR, more ( M , - , Q J ) pairs will be considered, and the probability of finding the optimum (M; , Q; ) increases.. The cost, however, is a substantial increase in number of computations. 4.4.3 Implementation Issue In a typical DCT-based video coder, several modes must be supported, and many quan-tization steps are allowed. Thus, the memory resources required to implement the F S M described in Section 4.4.1 can be prohibitive. For example, there are 259 (M{,Qi) com-binations in the P pictures for the MPEG-2 main profile/main level standard. In this case, if three neighboring macroblocks are used to compute the states in F S M , then 2593 = 17,373,979 states would"be generated. Moreover, sorting a probability table with many entries would definitely require a large number of computations, offsetting the reduction in computations that would be achieved using the F S M . To address the memory and computation problems, we have to decouple the mode and quantization step in the F S M , that is, we assume that the two parameters are independent. In other words, Equation (4.42) can be simplified, yielding the following equation f(Mi, Qz \ Mi-u Qi-i) = f(Mi | M^) • f(Qi | Qi'-!). . ' (4.44) Chapter 4. Rate-Distort ion Optimized Mode Selection 65 According to the above equation, the mapping mechanism and the probability table for mode and the quantization step can be designed separately, simplifying greatly the algorithm in terms of design complexity, memory requirements, and computational cost. To further simplify the mode selection algorithm, a layer structure is here adopted. In the new framework, the modes are re-organized in a two-layer hierarchical struc-ture. The highest layer consists of three principal modes: 1) I N T R A , 2) INTER, and 3) INTER+DCT. In the lower layer the modes in each principal mode category reflect variations that are normally dependent on the video coding application. For example, in MPEG-2 main profile/main level application, two modes belong to the I N T R A principal mode, and they are 1) I N T R A - F I E L D and 2) I N T R A - F R A M E . Locating the "best" coding mode in the above two-layer structure is a more efficient task, mainly because all the modes that belong to the same principal mode are considered before moving to the next principal mode. Another reason is that due to the hierarchical structure, the design of the F S M that determines the search order of the principal modes becomes much simpler. For example, in this case, if three neighboring macroblocks are used to compute the states in F S M , then only 3 3 = 27 states would be generated. The smaller number of of the state will significant improve the speed of state determination and the memory requirement. Besides, clue to the smaller number of probability entries of the principal mode, the F S M can estimate the conditional distribution of the principal mode more efficiently. The test order of the modes under the principal modes, depending on the video application, can be fixed or can be dynamically arranged by another F S M . Another source of computation efficiency is that, (1) the distribution of the quantiza-tion step is unimodal, and (2) high correlation exists between neighboring macroblocks' quantization step. Simulation results, some of which are discussed in Chapter 6, show that Qi can be predicted accurately from the quantization steps of just a few neighboring macroblocks. For simplicity, we will use the following example to explain the operation Chapter 4. Rate-Distortion Optimized Mode Selection 66 of the proposed fast RD optimized mode selection algorithm. Let us assume that Ma, Mb, Mc, and Md are the 4 modes in the mode selection pro-cess. Among the 4 modes, modes Ma,Mb and Mc are modes involving DCT. ' Let the maximum number of quantization steps for search be 9 in this case, and let the condi-tional probability of the 4 modes be 0.1, 0.3, 0.4 and 0.2. The search order of the 4 modes then becomes Mc, Mb, Md and Ma, and the numbers of allocated quantization steps are 5 for Mc, 3 for M j , 0 for Md, and 1 for Ma. If Q is the quantization step predicted from neighboring macroblocks, then the 5 quantization steps for mode Mc will be Q-2, Q-l, Q, Q+l, Q+2. After the 5 quantization steps of Mc are searched, the minimum Lagrangian J* of mode Mc will be compared to the reference JR. If J* < JR, then the search stops, and the quantization step Q* that gives the J* and mode Mc will be the result. Otherwise, the 3'quantization steps"(3*-l, Q*, Q*+l of mode Mt.wil l be searched, and so on. 4.5 Summary In this chapter, we introduced an RD optimized mode selection algorithm which sig-nificantly improves the compression performance. Several fast implementations are also proposed and discussed. The important result is that, by employing hierarchical struc-tures and exploiting structural dependencies in the video sequence, our mode selection algorithm can be effective and efficient. In the next two chapters, we will discuss specific implementations of the proposed method in the context of H.263's T M N 5 and MPEG-2's TM5 video coding frameworks, respectively. Chapter 5 Application: H.263 Based Low Bit Rate Video Coding This chapter introduces an H.263-based video coder which operates at very low bit rates (lower than 16 Kbps), and is based on the algorithms developed in Chapter 3 and Chapter 4. Although the proposed coder is also based on block-based motion compensation and D C T residual coding, it is very different from conventional coders. The. proposed video coder employs: 1) a fast median-based predictive motion searching technique, 2) an RD optimized motion estimating algorithm, and 3) an RD optimized coding mode decision algorithm. This chapter consists of three major sections. In Section 1, a brief introduction of the H.263 coding standard is presented. In that section, we also explain the disadvantages of the conventional H.263 coders such as Telenor's T M N 5 [72]. In Section 2, we propose a new H.263-based coder that employs the algorithms proposed in the previous chapters. Experimental results and some conclusions are given in the last two sections. 5.1 Motivation and Background The great demand for very low bit rate video compression has motivated extensive re-search and standardization activities around the world. Many new compression algo-rithms have been developed that allow transmission or storage of QCIF resolution video with acceptable quality at bit rates as low as 16 Kbps [74, 24, 53, 1]. .Most notable are the H.263 video coders [36], which have been shown recently to perform quite well in comparison with more complex video coders. The H.263 framework was initially adopted 67 Chapter 5. Application: H.263 Based Low Bit Rate Video Coding .68 Bits/frames vs. rate 700 600 500 a) 400 E I 'm 300 200 100 0 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 Rale (bps) Figure 5.17: Motion vector bit rate, D C T bit rate, and side information for Telenor's coder. by the M P E G 4 [32, 60, 62] group, it is expected to provide a toolkit-based audiovisual coding standard allowing many application-driven functionalities such as high compres-sion performance and sufficient robustness in error-prone environments by the end of the year 1998. It is widely accepted that current H.263 compression algorithms perform well for the target applications at bit rates between 20 and 64 kbps relative to coders of the earlier generation, but the compression performance tends to deteriorate rapidly at lower bit rates (such as 8 or 10 kbps). The main reason the compression performance of most H.263 video coders deteriorates at very low bit rates is that, as shown in Figure 5.17 for Telenor's H.263 video coder [72], both the motion vector bit rate and the side information become excessive at such low bit rates. This is due to the following reasons: • Inadequate macroblock (MB) coding control strategy: the mechanism used for Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 69 switching between inter and intra coding is mostly ad-hoc. • Independent estimation and coding of the motion vectors: regardless of the target bit rate or quality, estimation and coding are performed independently. • High motion vector precision: in very low bit rate applications using |-pel accuracy does not improve the compression performance much. • Inefficient motion estimation: physical motion is usually very limited, structured, and slowly-varying in very low bit rate applications. This property is not well exploited. . In this chapter, we develop an H.263-based video coder that addresses the above issues. Before we present a detailed discussion of our coder, we will first briefly introduce the H.263 coding standard. 5.1.1 The H.263 Standard: Overview The H.263 standard was developed for video coding for low bit rate audiovisual commu-nication systems. H.263 is based on H.261 and M P E G - 1 . It employs as input the same luminance/chrominance color space (4:2:0, Y Cb Cr), and only non-interlaced pictures are coded. The macroblock, block definitions and D C T / I D C T equations are identical to those of H.261 and M P E G - 1 . In addition to the H.261 GIF and QCIF image formats (see Table 5.6), H.263 sup-ports one smaller source image format (sub-QCIF) and two larger formats (4CIF and 16CIF). Another difference from H.261 is quantization. Unlike H.261, quantization is not. restricted to change only for every 11 macroblocks. It can be changed at the picture, group of blocks (GOB), and macroblock layers. Moreover, H.263 allows f-pel accuracy Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 70 Format luminance chrominance pels/line lines pels/line lines . sub-QCIF 128 ' 96 64 48 QCIF 176 144 88 72 CIF 352 288 176 144 4CIF 704 576 352 288 16CIF 1408 1152 704 576 Table 5.6: H.263 source picture formats. There is no ability to specify arbitrary image size as can be done in MPEG-2 motion vectors, and the predicted motion vector is computed based on the median fil-ter output of three neighboring macroblock (or block) motion vectors. However, the major departure from H.261 is H.263's use of four optional coding modes as Annexes. First, the unrestricted motion vector mode allows motion vectors that point outside the picture. The edge samples on the picture boundaries are duplicated to create the "not existing pixels" for prediction. Second, the syntax-based arithmetic coding (SAC) mode is recommended as an alternative to Huffman variable length codes. The greater coding efficiency of SAC gives the same picture quality, but with approximately 5% fewer com-• pression bits. Third, in the advanced prediction mode, four motion vectors representing the four luminance 8x8 blocks in a macroblock are allowed. Fourth, in the PB-frames mode, two pictures are coded as one unit. A macroblock in a P-picture is immediately followed by the corresponding macroblock in the B-picture. Clearly, the above modes improve compression performance and increase the coder's flexibility, although of course, at. the expense of additional complexity. 5.1.2 The H.263 Bit Stream Structure Figure 5.18 shows an H.263 video sequence layer structure. The H.263 bit stream syntax is arranged in a hierarchical structure with four layers. Data for each picture consists Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 71 picture GOB GOB GOB header EOS Picture layer GOB header MB MB MB MB (if not first G O B ) ( i f not skipped) macroblock block block block block header 1 6 7 12 ^(if PB-frame mode) Intra DC run-level VLC EOB (if intra) Group of blocks layer Macroblock layer Block layer Figure 5.18: H.263 video sequence layers. of a picture header followed by data for a Group of Blocks, eventually followed by an end-of-sequence code (EOS) and stuffing bits. The header contains the start code for re-synchronization and the information about the picture size, temporal reference, quan-tization step, the type information for the four optional coding modes, etc. Data for each Group of Blocks (GOB) in a GOB layer consists of a GOB header followed by data for macroblocks. Each GOB contains one or more rows of macroblocks. For the first G O B , no GOB header is transmitted. The GOB header consists of the start code for re-synchronization, the GOB number, quantization step, etc. Data in the macroblock (MB) layer consists of a macroblock header followed by data, for the six 8x8 blocks. An M B is comprised of four luminance (Y) blocks and one of each of the two chrominance blocks (Cr, Cb). When the PB-frame mode is selected, the M B in a P-picture and the M B in a B-picture are coded as one unit, in this case an M B layer contains 12 blocks. The header contains the macroblock type, the coded block pattern, the motion vectors, etc. A detailed discussion about the M B layer will be given in the Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 72 following section. In the block layer, if an intra M B is coded, the DC coefficient in the block is trans-mitted first. It is followed by as many run-level variable length codes (VLC) as needed to code the nonzero coefficients. For an inter M B , the DC coefficient is coded with the A C coefficients. The last nonzero coefficient is coded as the last for the block. 5.1.3 H.263 Macroblock Layer In H.263, the bit rate representing a macroblock can be divided into three types: (1) the header (or the side information) bit rate Rs , (2) the motion vector(s) bit rate Rm, and (3) the D C T bit rate Rc. The side information will be used to indicate the mode type of the encoded macroblock, and if necessary, it will contain some extra information such as the quantization step and the coded pattern of the 6 (8x8) D C T blocks. The motion vector bits represent the motion vector(s), and the D C T bits represent the D C T coefficients of the 6 (8x8) blocks. In Table 5.7 we show all the data elements (except the DCT) used in the macroblock layer, where the symbol " X " in the table denotes that the element of the correspond-ing M B type is used. Thus, Rs will be the normalized sum of the bits of COD (coded macroblock indication), M C B P C (macroblock type and coded block pattern for chromi-nance), C B P Y (coded block pattern for luminance), and the D Q U A N T (quantizer infor-mation). The bit rate Rm will be the normalized sum of the bits of M V D and MVD2-4 (motion vector data). The bit rate Rc will then be the normalized sum of the bits of D C T coefficients of the coded 8x8 blocks. I Picture From Table 5.7, we can see that for I-pictures, two modes are available: I N T R A and I N T R A + Q . The difference between them is that in the case of the I N T R A + Q mode, the Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 73 Picture M B Type Side Information Rs Motion Vector Rm COD M C B P C C B P Y D Q U A N T M V D MVD2-4 P-picture Not-Coded X I N T E R X X X X INTER+Q X X X X X INTER4V X X X X X I N T R A X X X I N T R A + Q X X X X I-picture I N T R A X X I N T R A + Q X X X Table 5.7: Coded symbols for Rm and Rs for various coding modes, encoder will send 2 bits for D Q U A N T (the difference in the quantization step). P Picture • For P-picture, there are six modes of operation denoted by Not-Coded, INTER, IN-T E R + Q , INTER4V, I N T R A , and INTRA+Q. In the Not-Coded mode, the COD pa-rameter is set to "1" and the current macroblock is replaced by the macroblock at the same spatial location in the previous reconstructed picture. In this case, only the COD parameter needs to be coded. This mode is designed for areas in the picture where little or no change relative to the previously reconstructed picture is detected. In both the. INTER and the INTER+Q modes of operation, one motion vector is transmitted along with the D C T coefficients of the prediction difference blocks. The difference between the INTER and INTER+Q modes of operation is that, for the latter, the value of Q U A N T is updated. This is often required to compensate for prediction inaccuracies. The mode INTER4V is similar to the'mode INTER, except that four motion vectors representing -four 8x8 blocks are transmitted. The mode INTER4V is found to be useful for pic-ture areas with high motion activity. Due to noise, occlusion, zoom, large illumination Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 74 changes, or complex motion activity, simple translational motion-compensated prediction' may be inadequate. In such a case, operating in the INTRA or I N T R A + Q modes may be beneficial. In the INTRA+Q mode, the parameter Q U A N T is changed, usually to compensate for illumination variations. 5.2 Proposed Coder To address the issues discussed in Section 5.1 we propose an H.263 video coder based on the algorithms introduced in the previous two chapters. The new H.263 video coder employs a rate-distortion (RD) based mechanism (see Chapter 4) to select amongst the H.263's M B coding types. An RD criterion for alternating between intra and inter coding modes was first presented by Chung, et al.[ll] in a subband video framework. Similar criteria were then applied to Telenor's H.263 coder by Wiegand et al [79], by Chung et al [13], and by Schuster and Katsaggelos [64]. A common important result of the above investigations is that significantly better RD tradeoffs can be obtained at very low bit rates using an RD mode selection criterion. Unique to our RD-based selection criterion, .however, is that it is also dependent, through the use of multi-path searching,'on the performance of the D C T residual coder. Our proposed coder also employs a fast median-based predictive integer-pel motion estimation technique that minimizes the Lagrangian, i.e., the distortion1 biased by the number of required bits. Recently, linear prediction has been proposed [3] to reduce the large computational complexity associated with block matching algorithm (BMA) based motion estimation. The application of several linear and nonlinear prediction techniques to BMA-based motion estimation is studied in [43]. We here study two types of median predictors: the H.263-specified 3-block median predictor and another 5-block median 1 Although other distortion measures can be used, only the popular M S E and M A D will be considered. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 75 predictor. The Lagrangian minimization provides a more efficient cost measure that has also recently [27] been studied for motion compensation in the context of the H.261 framework. Our strategy, however, is based on the work discussed in [11, 13], where the RD-based cost measure is analyzed in the context of a subband video coding framework. Finally, the proposed algorithm employs a 3-layer diamond-shaped searching area centered at the most likely motion vector, assuming given a prediction model (e.g. a. median predictor). The proposed method is motivated by the fact that, for very low rate video coding applications, physical motion is usually very limited, structured, and slowly-varying. 5.2.1 I-picture M B Coding Strategy As specified in the H.263 standard, intra coding of I-pictures consists of 8x8 D C T coef-ficients calculation, uniform quantization, and then run-level and variable-length coding (or arithmetic coding). Left to the designer, however, is the problem of determining the value of the quantization of the D C T coefficients. The only two requirements are that (1) the value of Q U A N T can be changed and transmitted only at the picture and/or the group-of-blocks (GOB) layers and (2) the four possible values of D Q U A N T (2 bits) are used to adjust the value of Q U A N T at the M B layer. We follow the H.263 approach, where Q U A N T of the first macroblock of the first I-picture is initially set to the middle value (i.e., QUANT=16), and Q U A N T of each other macroblock is set during the encod-ing process to the value of Q U A N T of the previous macroblock. Then, given a Lagrangian parameter A that controls, the RD tradeoffs, the value of Q U A N T can be adjusted by one of the 4,possible values of D Q U A N T that minimizes */*(A) = D + X(RS + Rc), where Rc and D are the number of bits and the distortion (MSE or M A D ) associated with the corresponding D C T coder, respectively, and the Rs is side information as shown in Table 5.7. Next, we compare the smallest J\(A) with </|(A) = D + X(RS + Rc), the Lagrangian Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 76 obtained in the case where DQUANT=0 , and select the M B type that corresponds to the smaller value. ' 5.2.2 P-picture M B Coding Method For simplicity, PB-frames are not used in this work. For H.263's P-pictures, the basic coding operation consists of: 1) motion estimation and compensation, 2) M B type deter-mination, 3) the possible coding of the motion vector(s), and 4) the possible DGT coding of the prediction difference blocks. Assuming that the macroblocks within a P-picture are statistically independent, the optimal macroblock coding control strategy is the one that yields the best RD tradeoffs. More specifically, we should seek the motion vector d = (x,y) (if any), the quality factor (QUANT) Q, and mode M that minimize the Lagrangian Jx(d, Q,M) = D + X [Rm + R'C + RS}. (5.45) In the above equation, D is the overall distortion, Rm is the number of bits needed to code the motion vector(s), Rc is the number of bits required for D C T coefficients coding, and Rs is the number of bits associated with side information. Direct Method It is clear from Table 5.7 that six Lagrangian values must be computed in order to find de-terministically the coding mode that yields the lowest Lagrangian value. However, this is generally impractical especially as the I N T E R modes of operation involve the joint optimization between the motion vector estimation/coding and the D C T coding of the corresponding prediction difference blocks. That is, for every motion vector candidate, the D C T coder must be applied to the difference block(s) in order to evaluate both Rc and D. Moreover, as can be seen from equation (5.45), even the evaluation of D must be Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 77 performed for all values of D Q U A N T . The latter part ean be performed efficiently as the D C T is a unitary transformation (i.e., only one forward D C T needs'to be evaluated for each 8 x 8 block). However, although the motion vector search area can be reduced sub-stantially'as described in the next subsection, the evaluation of the I N T E R Lagrangians can still be computationally demanding. Two Stage Method An alternative way to reduce the computational complexity is to decouple the motion vector coding and D C T coding processes so that the Lagrangian minimization is ap-plied sequentially. More specifically, we first locate the motion vector(s) that yield(s) the minimum Lagrangian, which is formed by the estimation distortion biased by the number of bits required for the coding procedure. The corresponding difference blocks are then computed and D C T coded. Finally, the resulting average values of bit rates and distortions, along with values of Rs and Rm, are used to compute the I N T E R mode Lagrangians. Of course, this procedure is sub-optimal because of its sequential search structure. We found that small gains could be achieved by employing multi-path search-ing. In M-search, M paths are considered as good candidates for the next stage. In M-search, generally more than one motion vector can be considered as a good candidate, and D C T coding is applied to each of the corresponding difference macroblocks. R D Optimized Motion Estimation For each macroblock and its four 8x8 luminance blocks in the current picture, a motion vector d = (x,y) £ S, where S is the set of all possible vectors in the search area, is sought. Each motion vector is chosen to minimize the Lagrangian J™(d) = £ p ( / ( r , n ) - J ( r + d , n - l ) ) + A /T(d), (5.46) Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 78 where r is the (x,y) spatial index of an image pixel, n is the time, index, I(r,n) is the image intensity of the candidate macroblock/block in the current picture, I(r + d, n — 1) is the image intensity of the matching macroblock/block in the previous reconstructed picture, W is the size of the matching window, />(•) is the square operation (MSE) or the absolute operation (MAD), and.Rm(d) is the number of bits required to encode the motion vector. Minimizing J™(d) is equivalent to the process of full-search entropy-constrained vector quantization [18, 40, 10, 9], where the codebook contains all 16 x 16 (or 8 x 8 ) vectors in the search area. After determining .the best .motion vector(s) d* representing the macroblock or each of the four 8 x 8 blocks, either one or four motion vectors depending on what yields better RD tradeoffs, is selected. Motion Vector Accuracy Many experimental results have shown that using |-pel accuracy motion estimation in our RD framework yields an insignificant improvement or a decrease in compression per-formance relative to integer-pel accuracy motion estimation at very low bit rates. Thus, only integer-pel accuracy is used; this slightly reduces the computational complexity. The latter can be reduced dramatically if the size of the set S and/or the size of the match-ing window W is reduced. Several techniques have been proposed [24, 5] that achieve this goal, at the expense of some loss in estimation performance. We here introduce a median-based predictive searching technique that is similar in principle to the statistical-based predictive techniques described in [11, 12, 13, 41], but is simpler in concept and implementation. Motion Prediction If compatibility with current H.263 decoders is desired, then the independent median-based prediction and motion vector difference variable-length coding methods described Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 79 Present Frame x, y-1, t x+1, y-1, t x-1,y, t x. y, t Previous Frame Present Frame x, y, t-1 x-1,y-1,t x, y-1, t x+1, y-1, t x-1,y, t x, y. t H.263 5-Block Median Figure 5.19: The regions of support used by H.263 and our coder. in the H.263 standard must be followed. However, we here propose an alternative method that is generally more robust to channel noise. Figure 5.19 compares the regions of sup-port (ROS) used by the H.263-specified prediction model and ours. Not only does the 5-block ROS, which includes 4 spatially and 1 temporally neighboring macroblocks/blocks, provide more prediction accuracy, but it also leads to a motion vector coder output that is more resilient to channel errors. In fact, it can be easily verified that, using the 5-block ROS, only three or more erroneously decoded motion vectors (as compared to two or. more of them using the 3-block ROS) can cause error propagation. Diamond Shaped Search Method Our technique exploits the fact that, in very low bit rate applications, physical motion is usually very limited, structured, and slowly-varying. As illustrated in Figure 5.20, we first determine the most likely motion vector given a prediction model. Then, only the candidate motion vectors in the diamond-shaped small search area centered at the most likely motion vector is considered. Depending on the prediction model, the bit rate of operation and the content of the video scene, we found that full-search yields only a few (2 — 5 %) motion vectors that do not belong to the diamond-shaped search area. Thus, Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 80 y V Median-predicted vector R Search area X Figure 5.20: The search area used during motion estimation. such a technique can reduce substantially the number of computations at the expense of only a small loss in estimation performance. M B Mode Selection Criterion After motion estimation in the two-stage method, the encoder has to determine the mode of the M B , along with the value of the quantization of the D C T coefficients. As discussed in Chapter 4, the "best" M B mode selection is the one that yields the best RD 'tradeoffs. More specifically, we should seek the mode M , and the quality factor (QUANT) Q, given the motion v'ector(s) obtained in the motion estimation stage that minimize the Lagrangian JX(M,Q) = D + \[Rm + Rc + Rs}. • (5.47) Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 81 The procedure of mode selection is similar to the one discussed in Section 5.2.1. To find the minimum J\(M, Q), the encoder has to test six possible modes for a macroblock (see Table 5.7), and select the mode (and associated Q U A N T value) that corresponds to the smallest Lagrangian value. Threshold T One way to reduce the total number of computations is to employ thresholding techniques that allow us to safely eliminate the INTRA and (especially) I N T E R coding options from consideration. For example, the Lagrangian of the N O T - C O D E D mode of operation can be compared to a varying threshold T, and if it is smaller than T, the other modes are no longer considered. These techniques are similar, in principle, to those employed in Telenor's H.263 coder, but are here found to be more effective because the bit rate-and distortion are more carefully traded. Although practical constraints often dictate that such methods be used, they are unfortunately still ad-hoc as there is. no guarantee that the best coding control decision has been made. The threshold T is used to provide a good balance between performance and num-ber of computations. A small value of T increases both the computational complexity and the probability of achieving the optimal solution. A large value of T reduces the computational complexity but at the expense of possibly selecting an inferior operating mode. The H.263 standard suggests some constant thresholds that have been optimized to incorporate the M A D as a cost measure and to favor the predicted motion vector. Although the bit rate and distortion are taken into account simultaneously in our frame-work, similar thresholds can still be heuristically derived. To improve our estimation of the threshold T, we also incorporate memory in the form of simple prediction models indicating the level of activity in the video scene. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 82 5.2.3 Rate Control The Lagrangian parameter A controls mainly the RD tradeoffs. One method to find a particular value for A is to employ the bisection search algorithm described in [15, 61]. This method, however, is usually computationally demanding, as a large number of iterations may be required. An alternative method is to update A using least-mean-squares (LMS) adaptation [26], as adopted in [79]. In this work, t'he parameter A is initially estimated based on computed long-term statistics. It is then varied adaptively during the encoding process depending on a bit rate and/or a quality constraint, following the simple recursive control algorithm discussed in detail in [43]. Without loss of generality, let's assume that our encoder's output bits are stored in a buffer of size Smax, and are then transmitted over a fixed-rate communication channel. The system is governed by s(t + 1) = s(t) + R(t) — 7, where t denotes time, s(t) is the buffer occupancy, R(t) is the variable output bit rate of the encoder, and 7 is the channel bit rate. In our case, the rate control algorithm will be performed after coding one frame. Then, assuming a linear feedback control model and a desired buffer occupancy of s* = the Lagrangian parameter A(t) at time t can be obtained by the recursion formula [43] A(<) = A ( i - l ) ^ . . (5.48) Updating A based on the above recursion formula is found to be an efficient solution.' In fact, relative to the more complex algorithms described in [8], the resulting R D control algorithm is very simple, yet its performance is good. 5.3 Experimental Results The following experimental results illustrate the computational complexity and perfor-mance of the proposed H.263-based video coding algorithm at very low bit f rates. As in Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 83 37.5 - •• 3 7 -36.5- ' 36 - •• § 3 5 . 5 - • • tn w 3 5 -Q_ 34.5-•• 3 4 . . . 33.5 33 -32.5'— 3000 Figure 5.21: The PSNR improvement due to the rate-distortion-based M B coding control strategy. Both the M S E and the M A D distortion measures are used. Telenor's H.263 coder, motion estimation and compensation is performed only for the Y luminance component. The estimated motion vector field is subsequently used for the motion compensation of the Cr and Cb chrominance signals. However, only integer-pel accuracy motion estimation is allowed.in our framework. We compare the coder's performance with that of Telenor's H.263 video coding im-plementation [72] using the QCIF test sequences MISS A M E R I C A and CAR PHONE. For fairness, the options/parameters of both coders are set to the same values. For example, neither implementation employs PB-frames. Moreover, the frame rate is set to 10 frames per second, advanced prediction is used, and the unrestricted motion vector mode is se-lected. Finally, the M S E is used as our coder's distortion measure in most experiments. The only exception is the first set of experiments, where both the M A D and the M S E are employed. . . • Figure 5.21 compares the average peak-signal-to-noise-ratio (PSNR) results of our Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 84 Bits/frames vs. rate 700 — 600 500 °j 400 To § 300 200 100 0 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 Rate (bps) Figure 5.22: Motion vector bit rate, D C T bit rate, and side information for our coder. coder with those of Telenor's H.263 coder for 150 frames of the color test video sequence MISS A M E R I C A at bit rates between 4 and 10 kilobits per second (kbps). The only difference between the two coders is the M B coding control strategy. We select the best mode of operation based on an RD criterion, as described previously. As seen from Figure 5.21, our strategy yields a significant improvement in PSNR, especially at very low bit rates. This is expected because while the motion bit rate in Telenor's H.263 coder is nearly constant (see Figure 5.17), it is reduced at lower bit rates (see Figure 5.22) by using the RD-based mode selection criterion. •. Such a criterion yields a more efficient allocation of bits amongst the coder's components. It is, however, more demanding in terms of computations. Nevertheless, by applying the thresholding techniques discussed earlier, we have obtained an insignificant loss in PSNR while requiring only 20% to 30% of the number of computations needed to determine all six Lagrangian values. Figure 5.21 also shows that, when the M A D is substituted for the MSE, the PSNR gain (in dB) is reduced by 10%-25%. Since the PSNR is inversely proportional to the Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 85 PSNR vs. Rate 37.5 37 36.5 36 | 3 5 . 5 tr <n 35 a. 34.5 34 33.5 { 33' 32 5 3000 4000 5000 6000 7000 8000 9000 10000 Rate (bps) Figure 5.23: The PSNR loss clue to the new motion vector estimation/coding method. MSE, minimizing the M S E is equivalent to maximizing the PSNR. Thus, using a non-M S E measure such as the M A D generally yields a slightly lower PSNR. The advantage of using the M A D , however, is that the computational complexity is significantly reduced. Figure 5.23 illustrates that our fast integer-pel accuracy motion vector estimation and coding compare very favorably in terms of PSNR with the full-search motion estimation and variable-length coding used in the earlier implementation. Moreover, as demon-strated earlier and confirmed by our simulations, our technique reduces the number of computations required for motion estimation by more than one order of magnitude. This is an important feature because motion estimation usually requires the lion's share of. the computational load. Also illustrated in Figure 5.23 is the fact that the new diamond shape searching window yields an insignificant loss in PSNR. As a practical solution to the joint motion vector estimation/coding and D C T residual coding, our proposed M-search technique (M = 3) improves the PSNR performance, as is shown in Figure 5.24. However, the improvement is not substantial. The slight PSNR i i 1 motion vector estimation/coding) Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 86 37.5 -37 -36.5 -36 -f 35.5-CC M 35 -Q. 34.5 -34 -i 33.5 -. 33 -32.5L 300 Figure 5.24: The PSNR improvement due to the M-search technique. gain suggests that, at very low bit rates, the M S E used as part of the Lagrangian cost measure is a good approximation of the 8 x'8 D C T coder's MSE. Next, we present simulation results that illustrate the effectiveness of the feedback technique used to control.the bit rate through varying the parameter A. For comparison," in Figures 5.25 we also show the bit rate and PSNR profiles of our coder when a constant value for A is used. Notice that the PSNR in Figure 5.25 is nearly constant for all frames, which indicates that it is easier to control the output quality level through the parameter A. This can be beneficial in several applications where a constant quality of service is desired. Figures 5.26 shows the bit rate and PSNR profiles of our coder when a constant bit rate is placed as a constraint on the encoding algorithm. Although a simple model (see Equation (5.48)) is used, the feedback control technique is very effective as shown in Figure 5.26. It shows that' the proposed algorithm keeps the buffer fullness around 10 Kbits through varying the parameter A. However, Figure 5.26 also indicates that it is more difficult to control the bit rate than to control the the output quality via the A. PSNR vs. Rate —1 1 1 1 1 1 I 4000 5000 6000 7000 8000 9000 10000 Rate (bps) Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 87 900 800 700 600 500 3 i 400 300 200 100 40 39 38 37 ^ 3 6 ST rr 35 z CO 0_ 34 33 32 31 30 (a) BIT RATE (BITS) Bits vs. Frames 50 100 Frame Index (b) P S N R (DB) PSNR vs. Frames . .Miss America.. 35.5 (dB) 50 100 • Frame Index 150 Figure 5.25: The rate-distortion profile for our coder: (a) Bit rate and (b) PSNR for 150 frames of the test sequence MISS AMERICA. The distortion and bit rate is controlled through, constant parameter A. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 1.6 sz 1.2 0.6 0.4 0.2 (a) B I T R A T E ( B I T S ) Buffer Length vs. Frame ..Miss. America. 8 kbps Full Buffer length 20 kbits 50 100 Frame. Index 150 (b) P S N R (DB) PSNR vs Frame 50 100 Frame Index 150 Figure. 5.26: The rate-distortion profile for our coder: (a) Bit rate and (b) PSNR for 150 frames of the test sequence MISS A M E R I C A . The buffer state is used to control the bit rate through varying the parameter A. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 89 To reduce the likelihood of overflow/underflow, a sufficiently large buffer should be used. However, if the buffer size is constrained, employing more sophisticated feedback control techniques or incorporating memory into the bit rate control algorithm [8] may provide better solutions. Figures 5.27(a) and 5.27(b) illustrate the overall PSNR improvement of our coder over Telenor's H.263 coder for the two test sequences MISS AMERICA and CAR PHONE, respectively. Our coder differs from Telenor's in that it incorporates RD-based control, thresholding, median-based predictive motion estimation, diamond shape searching win-dow, and M-search. Not only does our coder outperform Telenor's, especially at the lower bit rates, but our coder's computational complexity is also lower. Notice that, although diamond shape searching window is employed, we still obtain a significant improvement in PSNR performance. But more important is the fact that our coder's subjective qual-ity is superior to that of Telenor's H.263 coder. For example, when many viewers were presented with 150 frames of the decoded color sequence MISS A M E R I C A , which was encoded using the new coder at 4 kbps, they all reported that the subjective quality is either acceptable or good. When presented with another 150 frames produced by de-coding Telenor's H.263 output bit stream, the viewers stated that the quality is both unacceptable and inferior to our coder's subjective quality. Finally, to' illustrate the belmvior of our coder at higher bit rates, the x-axis of Figure 5.27(a) is extended to 16 kbps and that of Figure 5.27(b) is extended to 40 kbps. As the proposed techniques are designed for the very low bit rate range of operation, it should not be surprising that the PSNR gap decreases with increasing bit rate. For example, the RD-based M B mode selection criterion makes better use of the available bits, but the improvement is significant only when the bit budget is limited. Another example is |-pel motion estimation accuracy, which becomes beneficial at higher bit rates, but is not used in this framework. However, most of the proposed techniques can be optimized so that Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 90 (a) MISS A M E R I C A PSNR vs. Rate 1 Miss America 150 frames Y. Cb, Cr x H.263 o Our coder 4000 6000' 8000 10000 12000 14000 16000 Rate (bps) (b) CAR PHONE PSNR vs. Rate <r 33 1 :Car Phone :101 frames y , . C b , C r . . . . x H.263 o Our code 1 i i i -. Q I 1  1 1 1 1 -. I 0.5 1 1.5 2 2.5 3 3.5 4 Rate (bps) Figure 5.27: Overall comparison between our coder and Telenor's coder for the sequences (a) MISS A M E R I C A and (b) CAR PHONE at bit rates between 3 and 16 kbps, and between 5 and 40 kbps, respectively. Chapter 5. Application: H.263 Based Low Bit Rate Video Coding 91 the performance improvement is large over a wide range of bit rates. This is subject for further research. .' • 5.4 Summary We have presented a new H.263-based video coder that provides an RD-based mechanism for alternating amongst H.263's modes of operation, and employs median-based predictive motion estimation and coding. For motion estimation, a diamond-shaped search window is employed, sacrificing only a small loss in compression performance for a significant reduction in computation complexity. Our coder outperforms Telenor's H.263 coder in compression performance and it also offers the user more control over the bit rate, quality, and number of computations. Although we have not addressed many other issues such as content-based scalability and temporal random access, we believe that the proposed techniques provide some efficient solutions to very low bit rate video coding application. Chapter 6 Application: M P E G - 2 Video Coder This chapter introduces a new MPEG-2 compliant video coder. It is different than conventional coders in that it employs: 1) a computation-distortion (CD) optimized predictive motion estimation algorithm, and 2) a rate-distortion (RD) optimized mode selection algorithm. Experimental results show that the proposed motion estimation and mode selection algorithms yield substantially higher encoding speed and compression , performance than those of the MPEG-2's Test Model TM5 coder [35]. This chapter consists of five sections. Section 1 will briefly review the MPEG-2 video coding standard and its coded bit stream structure. In Section 2, we will discuss the implementation of the computation-distortion optimized motion estimation algorithm, described in Chapter 3, in the context of the MPEG-2 framework. In Section 3, we will discuss the implementation of the rate-distortion (RD) optimized mode selection algo-rithm, described in Chapter 4, also in the context of the M P E G - 2 framework. Simulation results and summary are presented in the last two sections. 6.1 M P E G - 2 Overview MPEG-2 [34] has been a very successful audiovisual communication standard. Decoders, that claim conformance to it have already been produced by the millions, and many new MPEG-2 applications have already been created. MPEG-2 is aimed at a wide range of applications such as television broadcasting, digital storage media and digital high-definition T V (HDTV) . Like H.261 [73] and MPEG-1 [47], the M P E G - 2 video standard is 92 Chapter 6. Application: MPEG-2 Video Coder 93 HIGH 4:2:0 1920x 1152 80 Mb/s l,P,B 4:2:0 4:2:2 1920 X 1152 100 Mb/s l ,P,B HIGH 1440 4:2:0 1440x 1152 60 Mb/s l,P,B 4:2:0 1440 x 1152 60 Mb/s l ,P,B 4:2:0 4:2:2 1 4 4 0 x 1 1 5 2 80 Mb/s l,P,B MAIN 4:2:0 720 x 576 15 Mb/s l,P 4:2:0 720 x 576 15 Mb/s l ,P,B 4:2:0 720 x 576 • 15 Mb/s l ,P,B 4:2:0 4:2:2 720 x 576 20 Mb/s l ,P,B L O W 4:2:0 352 x 200 4 Mb/s l,P,B 4:2:0 352 x 200 4 Mb/s l,P,B Level Profile S I M P L E MAIN S N R S P A T I A L HIGH Figure 6.28: MPEG-2 Profile and Level based on motion compensated prediction and D C T residual coding. However, MPEG-2 offers a more efficient means to code interlaced video signals and supports scalable video coding. The readers should be aware that MPEG-2 is a very complex standard. The description we provide in the following subsections is only an overview of its video part. 6.1.1 Profile and Level As suggested by its title, Information technology - Generic coding of moving pictures and associated audio information, the M P E G specification is intended to' be generic in the sense that it serves a wide range of applications. For example, coded MPEG-2 bit rates of up to 400 Gigabits/s and picture sizes up to 16,000 by 16,000 pixels can be defined. In order to cope with the great variety of video applications within one coding standard, MPEG-2 adopts a "toolkit-like" approach; that is, MPEG-2 is a collection of tools which Chapter 6. Application: MPEG-2 Video Coder 94 satisfies the requirements of specific major applications. The range of coding support provided by MPEG-2 is divided into profiles and levels. For each profile/level, MPEG-2 provides the syntax for the coded bit stream and the decoding requirements. A profile is a defined subset of the entire bit stream syntax specified by MPEG-2 . The MPEG-2 profiles can be divided into two categories: non-scalable and scalable. The two non-scalable profiles are the Simple and Main profiles. The three scalable profiles are the SNR scalable, Spatially scalable, and High profiles. Within a profile, a level is defined as a set of constraints imposed on the parameters of the bit stream. For each profile, the four levels are the Low, Main, High-1440, and High levels. The constraints of the coding parameters for the various profile/level combinations are shown in Figure 6.28. Among its profile/level combinations, the most widely used combination in the M P E G -2 industry is the'main profile/main level. Therefore, in this chapter, we will concentrate on main profile/main level. Moreover, for simplicity, only I- and P-frames will be used in our simulation. 6.1.2 Fields, Frames and Picture MPEG-2 supports both progressive and interlaced video (see Figure 6.29). For interlaced video, fields can be coded separately (field pictures) or 2 fields can be interleaved and coded as one picture (frame picture). Both frame and field pictures may be used in a single video sequence. MPEG-2 defines three types of coded frames: I-frames, P-frames, and B-frames. A coded I-frame consists of an I-frame picture, a pair of I-field pictures, or an I-field picture followed by a P-field picture. A coded P-frame consists of a P-frame picture or a pair of P-field pictures. A coded B-frame consists of a B-frame picture or a pair of B-field pictures. If the first field of an I-frame is an I-field picture, the second field can be a P-field picture, provided that prediction is to be based on the'first I-field Chapter 6. Application: MPEG-2 Video Coder 95 T o p F ie ld Bot tom Fie ld Figure 6.29: Interlaced video frame consists of two fields. picture. One confusing aspect is the use of the words, "field" and "frame". If "frame" is used as a noun, it can mean either a frame that is coded as a single picture (either a progressive frame, or a frame created by interleaving two fields) or a frame that is coded as two separate field pictures. If "frame" is used as an adjective (as in "frame picture"), it always means that the frame is coded as one picture. In this chapter, the video sequences used in our simulation results are interlaced frame pictures. 6.1.3 Motion Prediction In MPEG-2 , motion prediction depends mainly on the type of the picture (frame picture or field picture). For frame pictures, there are three motion prediction types: frame pre-diction, field prediction, and dual prime prediction. For field pictures, there are also three motion prediction types: field prediction, 16x8 motion compensation, and dual prime prediction. The prediction type can be specified at the macroblock layer, which means the encoder has the latitude of selecting a different motion estimation/compensation type for each macroblock. In what follows, we only discuss frame prediction and field Chapter 6. Application: MPEG-2 Video Coder 96 Frame motion vector 16x16 Frame motion vector Field motion vector 16x8 Top 16x8 Bottom Top field motion vector Bottom field motion vector Top Bottom Figure 6.30: MPEG-2 Motion vector type prediction in the context of frame pictures. Figure 6.30 shows the two motion prediction types. The shaded area represents the top field pixels and the clear area represents the bottom field pixels. For frame predic-tion, only one motion vector will be transmitted for a macroblock. For field prediction, the encoder/decoder assumes that the macroblock consists of two fields, and it will de-interlace both the 16x16 macroblock and the reference frame into top field and bottom field as shown in Figure 6.30. For each field, the encoder sends one motion vector and the corresponding side information to indicate the field type. Chapter 6. Application: MPEG-2 Video Coder 97 Frame DCT of the luminance macroblock Field DCT of the luminance macroblock Figure 6.31: MPEG-2 D C T type 6.1.4 Frame D C T and F i e ld D C T To fully exploit the interlaced video properties, two D C T types, frame D C T and field DCT, are defined in MPEG-2 . For frame DCT, the encoder performs the D C T directly on each of the four 8x8 blocks (see Figure 6.31). For field D C T , the encoder assumes that the macroblock consists of two fields and de-interlaces the 16x16 luminance block (or the two chrominance blocks for 4:2:2 video formats) in advance. The D C T type can be specified at the macroblock layer. However, this option is valid only for frame pictures. Chapter 6. Application: MPEG-2 Video Coder 98 Mode Name Group Motion Type D C T Type FrJntra I N T R A Frame FdJntra I N T R A Field Skip FrMC . INTER Frame " F d M C INTER Field FrDCT I N T E R + D C T Frame FdDCT I N T E R + D C T Field FrMC_FrDCT I N T E R + D C T Frame Frame FrMC-FdDCT I N T E R + D C T Frame Field FdMC-FrDCT I N T E R + D C T Field Frame F d M C F d D C T I N T E R + D C T Field Filed Table 6.8: The available macroblock mode types for P-frame. 6.1.5 Bit Stream Syntax and Macroblock Layer In MPEG-2 , the structure of the bit stream syntax depends on the profile adapted by the application. As we will concentrate on only the main profile/main level part of the standard, we will only describe the bit stream syntax for the main profile. The MPEG-2 bit stream syntax for the main profile is a superset of the MPEG-1 bit stream syntax. This non-scalable profile is aimed at higher bit rates (4 Mbps to 8 Mbps for CCIR 601 video) and is able to handle interlaced video formats. The bit-stream syntax for the main profile consists of six layers: sequence layer; group of picture (GOP) layer; picture layer; slice layer; macroblock layer; and block layer. These layers have been created to satisfy many important requirements such as coding flexibility, ease of implementation, and channel error resilience by way of re-synchronization markers. The range and values of the layer parameters depend mainly on the level. From a coding viewpoint, the most important layer is the macroblock layer. As blocks within the macroblock cannot be encoded differently, the macroblock is the smallest coding unit. Chapter 6. Application: MPEG-2 Video Coder 99 In what follows, we will briefly discuss the various modes in I-frame and P-frame for MPEG-2's main profile. Here, the mode is defined by the coding operation applied on the macroblock. The coding operations are (1) intra/inter, (2) frame/field motion prediction, and (3) frame/field DCT. As specified in the MPEG-2 standard, in an I-frame picture, two intra modes are supported. They are the intra frame-DCT and the intra field-DCT modes. If the quantization step of a present macroblock differs from that of the previous macroblock, the encoder will send 5 bits to indicate the new value of the quantization step. In P-frame pictures, there are three principal modes of operation at the macroblock layer: (1) intra, (2) motion only, and (3) hybrid motion compensated prediction and DCT coding. However, as motion prediction and/or DCT coding can be applied in a frame or field mode of operation, at least eleven modes of operation are possible. Table 6.8 lists eleven modes for P-frame pictures. 6.2 Computation-Distortion Optimized Motion Estimation MPEG-2's main profile/main level is mainly aimed at applications employing a CCIR 601 video format (4:2:0 color sampling format, 720x480 pixels, 30 frames/sec) and supporting a bit rate range from 4 Mbps to 8 Mbps. Unlike in low bit rate coding applications, where the motion vector bit rate represents a significant part of the total bit rate, in MPEG-2's main profile/main level, the lion's share of the bit stream represents the DCT coefficients (see Figure 6.32). Thus, a rate-distortion optimized motion estimation algorithm would not yield significant gains in compression performance. On the other hand, the computational resources required for motion estimation in MPEG-2 are, in most cases, relative very high. In fact, the full search algorithm can take more than 99% of the total number of computations. Therefore, the computation-distortion (CD) optimized motion estimation algorithm described in Chapter 3 can yield substantially Chapter 6. Application: MPEG-2 Video Coder 100 Bit Rate Ratio vs. Bit Rate 0 2 4 6 8 10 12 Bit Rate (Mbits/sec) Figure 6.32: The bit rate distribution for DCT, motion vector, and side information better computation-performance tradeoffs. The proposed framework restricts the motion estimation to a relatively small search area whose center is a two-dimensional median predicted motion vector [43, 42]. The number of computations is further reduced by incorporating explicitly a computational constraint during the estimation process. Experimental results show that the median prediction as well as the dynamic nature of the implementation lead to a dramatic increase in motion estimation speed for only a small loss in compression performance. A unique feature of our algorithm is that the reconstruction quality of the resulting MPEG-2 coder is proportional to the computer's processing power, providing much needed flexibility in many MPEG-2 software applications. Chapter 6. Application: MPEG-2 Video Coder 101 R M 1 N Q H G J S L K 0 P F C B D E A Previous Frame Present Frame Figure 6.33: The region of support (ROS). 6.2.1 Motion Vector Prediction In our framework, the motion vector search region will have as a center the predicted motion vector. Therefore, the motion vector prediction directly affects the efficiency of our proposed algorithm. We next evaluate the performance of two different motion estimation vector prediction methods: 1) median prediction and 2) least mean square error (LMSE) prediction. The performance of these two prediction methods will be evaluated for different configurations of region of support (ROS) and using the mean squared error (MSE) and mean absolute error (MAE) measures. The ROS is decided by the values of correlation coefficients and of the values of LMSE coefficients values between motion vectors within a sufficiently large 3-D region, and the current motion vector. Such a region includes previously coded motion vectors representing blocks that are close spatially and/or temporally. Figure 6.33 shows a 3-D ROS extending over two frames. Block G of the previous frame is positioned at the same spatial location as the unlabeled block of the current frame (or current block). For the Chapter 6. Application: MPEG-2 Video Coder 102 LMSE predictor, the ROS is decided by the absolute value of the LMSE coefficients. For the median predictor, the ROS is decided by the correlation coefficients and the LMSE coefficients. The simulation results show that the ROS decided by the LMSE coefficients yields higher prediction accuracy than that of the ROS decided the correlation coefficients for the same number of blocks. From the simulation results, the best number of blocks for median estimation is either 4 to 5, and prediction performance will degrade as the number of blocks increases. For LMSE, the larger the number of blocks, the better performance. However, the performance improvement diminishes by employing more than 6 blocks. If we use the mean squared error (MSE) to evaluate the performance of the two predictors, the LMSE predictor is always better than the median predictor for the same ROS. However, for the ROS consisting of 5 blocks the difference is not big. The situation will change when we use the mean absolute error (MAE) as the evaluation measure. As expected, the median predictor is better than the LMSE one for the same cases. From our simulation result, and a channel robustness point of view, we suggest using the median predictor with a 5-block ROS. The suggested ROS consists of the blocks labeled A, B, C , and D in the present frame and block labeled G in in the past frame in Figure 6.33. 6.2.2 Search Path and Search Termination The diamond-shaped search path discussed in Chapter 3 is used in our motion estimation framework (see Figure 6.34). After searching each layer, the following Lagrangian cost Ji will be evaluated J, = SAD, + j8xC,, (6.49) where SAD\ is the minimum sum of the absolute distortion (SAD) within /th layer, and C\ is the total number of computations performed so far after searching the /th layer. Chapter 6. Application: MPEG-2 Video Coder 103 V Most likely motion vector R Search region o o o Layer 0, Jo ^ o o o Layer 1, J l \ 0 o o Layer 2, J2 - e — e — e — e — e — e — e -o c t > o o o o o o o Figure 6.34: Diamond-shaped search area. Blocks with the same shade are in the same search layer. The parameter (3 controls the tradeoffs between computation and distortion. In this work, Model I (2-layer hypothesis) method which is described in Chapter 3 will be used to terminate the search process. 6.2.3 Hierarchical Structure A hierarchical motion estimation algorithm that is very efficient is presented next. The efficiency of the algorithm originates from searching low resolution video frames. This is because when the algorithm performs motion search at a low resolution frame, both the size of the search window and the size of the matching block are smaller than those in a higher resolution image, reducing substantially the number of computations. If the downsampled frames (low resolution frames) can still preserve the video structural Chapter 6. Application: MPEG-2 Video Coder 104 Figure 6.35: The search region on full resolution level and half-pel resolution level information, the performance of hierarchical search will be very close to that of full search. The idea of performing motion search in low resolution video frames is adopted into our proposed framework. In our method, a 3-level hierarchical structure is used. The three frame resolutions in the hierarchical structure are (1) half-pel resolution, (2) full resolution, and (3) low resolution. The reference pictures for motion estimation at low resolution levels and full resolution levels are derived from the original versions. However, the reference pictures for motion estimation at half-pel resolution level is the interpolated version of the decoded picture. During hierarchical motion estimation, we start searching a small diamond-shaped area in the low resolution frame, an area whose center is the predicted motion vector. The motion vector found in the low resolution frame is then used as the starting point for motion vector search in the full resolution image. After motion vector search at the full resolution image, the resulting motion vector will be the starting point for half-pel refinement. Note that, the search are allowed in both the full Chapter 6. Application: MPEG-2 Video Coder 105 resolution and the half-pel resolution frame is limited to 9 points shown in Figure 6.35. 6.2.4 Implementation Issue 16x16 Frame Prediction and 16x8 Field Prediction For interlaced video sequences, MPEG-2 defines two types of motion prediction in a P-frame: 16x16 frame prediction and 16x8 field prediction (see Section 6.1.3). Therefore, to exploit the interlacing property, the encoder needs to perform two types of motion estimation and select the "best" motion vector(s) for motion compensation. Because both the present frame and the reference frame consist of two fields (the top field and the bottom field), four types of motion prediction are required for 16x8 field prediction. In what follows, we list the 5 required motion search types: • 16x16 frame prediction (F-F) • 16x8 field prediction for present top field: present top to reference top (T-T), and present top to reference bottom (T-B) • 16x8 field prediction for present bottom field: present bottom to reference top (B-T), and present bottom to reference bottom (B-B) One solution to find the 5 motion vectors is to apply the same predictor (median or LMSE) five times before each search. The second approach is to use the motion vector found in a particular type of motion estimation to predict the other types of motion vectors before their search. For example, the motion vector found in motion search of the 16x16 frame prediction type could be used to predict the motion vectors for the other four 16x8 field prediction types motion estimation. Simulation results show that a Chapter 6. Application: MPEG-2 Video Coder 106 strong correlation amongst those five types of motion vectors exists, and that the second approach is more efficient. Full Block Matching and Partial Block Matching In motion estimation, the encoder searches for the motion vector that gives the minimum matching error (the sum of the absolute difference, or SAD, for example). One simple and straightforward approach is to calculate the SAD of the 256 pixels of the 16x16 macroblock for each vector, and then choose the vector with the minimum SAD. This full block matching method is suitable only for hardware implementations, because a very simple and structured digital circuit can perform this type of operation in a parallel manner. For software implementations, the more efficient partial block matching method is favored. The partial block matching method consists of comparing the partial sum SADN of N pixels to the current minimum SAD. If the value SADN is larger than that of SAD, then the matching process is stopped, and several unnecessary computations can then be avoided. Simulation results show that, on average, the partial block matching method yields 25% gain in number of computations. 6.3 Fast Rate-Distortion Optimized Mode Selection This section presents an MPEG-2 compliant interlaced video encoder that employs an efficient macroblock coding mode selection algorithm. The macroblock coding mode is selected based on a rate-distortion (RD) optimized criterion. However, predictive and sta-tistical modeling techniques are introduced that lead to substantially better computation-performance tradeoffs. The proposed MPEG-2 video encoder is shown experimentally to outperform the MPEG-2 TM5 encoder, expressed in a 15 — 60% reduction in bit rate and/or a 1.5 — 2.5 dB increase in objective quality. Chapter 6. Application: MPEG-2 Video Coder 107 6.3.1 R D Optimized Mode Selection Theoretically, in order to obtain the optimal mode and associated quantization step that minimizes the J(A) in Equation (4.40), the encoder has to exhaustively search all the available combinations of mode M and quantization step Q. For MPEG-2 main profile/main level, the exhaustive searching implies that all 64 (M,Q) combinations in an I picture and all 259 (M, Q) combinations in a P picture need to be considered (see Table 6.8). Exhaustive search is computationally very costly. However, it is the only way that optimal performance gains can be assessed. When all (M, Q) combinations are considered, the compression performance for the three CCIR 601 video sequences, BUS and F O O T B A L L , are shown in Figures 6.36 and 6.37. The figures show that a 1.5 - 2.0 dB improvement can be achieved as compared to the mode selection method used in the MPEG-2 TM5 coder. The standard TM5 coder selects the coding mode by comparing the energy of the predictive residuals. For example, the intra/inter decision is determined by a comparison of the variance of the macroblock pixels against the variance of the predictive residuals; the inter prediction mode is selected to be the inter mode that has the least predictive residual mean square error. However, the RD optimized MPEG-2 video coder requires, on average, more than 30 minutes of CPU time on a 167Mhz Ultra Sparc machine just to encode a group of 12 frames! To reduce substantially the number of computations, a fast finite state machine (FSM) RD optimized mode selection algorithm proposed in Chapter 4 is here developed for MPEG-2 video coding. The fast FSM algorithm is developed based on the analysis results presented in the following parts. Chapter 6. Application: MPEG-2 Video Coder 108 39 38 37 36 I 3 5 DC | 34 33 32 31 30 (a) BUS Rate vs. PSNR Bus, 60 frames Win Size 3t P, GOP 12: - full-sea : TM5 ch RD optir lized 41 40 39 38 I 36 35 34 33 32 6 8 Rate (bps) (b) F O O T B A L L Rate vs. PSNR 10 12 14 x10 I Football, 24 Win Size 47 frames IP, GOP 12 - full-sea : TM5 ch RD optin lized 1 6 8 Rate (bps) 10 12 14 x10 Figure 6.36: PSNR (in dB) of the TM5 and the full-search RD optimized algorithm for the BUS and FOOTBALL sequences. Chapter 6. Application: MPEG-2 Video Coder 109 Rate vs. PSNR 37 • 36 • 35 • 34-g 3 3 p tr l 3 2 r 31 • 30-29 • 28 • Win Size 15 frames IP, GOP 12 : - full-sea : TM5 rch RD optin lized 6 8 Rate (bps) 10 12 14 x 10 Figure 6.37: PSNR (in dB) of the TM5 and the full-search RD optimized algorithm for the GARDEN. Dependency Between Coding Mode and Quantization Step In this section the analysis of the statistical dependency between the mode M and the quantization step Q is performed. Figure 6.38 shows the joint distribution functions p(M, Q) of the mode M and the quantization step Q which are derived from the exhaus-tive search method on the BUS sequence with 8 Mbps. The difference between the joint entropy H(M, Q) and the mutual information / (M, Q) is used to evaluate the dependency between the M and Q, where H(M) = J2-L°92(p(M))p(M), m H(Q) = Y,-1°9MQ))P(Q), 9 H(M,Q) = J2-L°92(p(M,Q))P(M,Q), m,q Chapter 6. Application: MPEG-2 Video Coder 110 Distribution Function of Q and Mode Q Index Figure 6.38: Joint distribution function p(M, Q) of coding mode M and quantization step Q I(M,Q) = H(M) + H(Q) - H(M,Q). (6.50) If the value of I(M,Q) is much smaller than that of the H(M,Q), then very little information of quantization step Q can be offered by the mode M (and vise versa) which implies that small dependencies exist between the mode M and the quantization step Q. Table 6.9 shows the H(M, Q), H(M) + H(Q), and I(M, Q) derived from the simulation results of the three video sequences (BUS, F O O T B A L L , and GARDEN) using the exhaustive search method. In Table 6.9, the values of I(M, Q) are much smaller than the values of H(M,Q), implying very small dependencies between the M and Q. Therefore, the fast coding mode selection method, described in Chapter 4, that separately process the M and Q can be employed. Chapter 6. Application: MPEG-2 Video Coder 111 Name H(M,Q) H(M)+H(Q) I(M, Q) Bus 5.6890 5.7467 0.0577 Football 5.6442 5.7321 0.0879 Garden 5.2887 5.3578 0.0691 Table 6.9: The entropy, joint entropy, and mutual-information of coding mode M and quantization step Q. Coding Mode In this section, the correlation among the neighboring macroblocks' mode will be ana-lyzed, where the entropy is used as the correlation measure. The purpose of this study is to determine whether or not the neighboring macroblocks' coding mode types offer information about the mode of the present macroblock. Two quantities will be studied in this section: (1) the entropy H(M) of the mode M of the present macroblock, and (2) the conditional entropy H(M \ MN) of the mode M of the present macroblock when the N neighboring macroblocks' modes are known. Here, M . ^ represents the set of the N neighboring macroblocks' modes, and H(M \ A4N) is given by H(M | MN) = P(MN) £ -log2(p(M | MN)) p(M | MN). (6.51) MN M Because each macroblock can take on any one of twelve modes, only the nearest four neighboring macroblocks' modes in the present frame are considered in this test. There are 20736 possible configurations for the 4 neighboring macroblocks. The result is shown in Table 6.10. The percentage in Table 6.10 is defined as H(M | MN)/H(M). From the table, we can see that knowledge of the neighboring macroblocks' modes lower the present macroblock coding mode's entropy around 58% to 48%. This implies that an FSM that exploits such high correlation will be very effective. Chapter 6. Application: MPEG-2 Video Coder 112 Name H(M) H(M | MN) Percentage Bus 3.0895 1.8055 58.44% Football 3.1575 1.5245 48.28% Garden 3.0394 1.7552 57.75% Table 6.10: The entropy and conditional entropy coding mode M for interlaced P pic-tures. Quantization Step The number of possible quantization steps per macroblock in MPEG-2 is 32. Therefore, an FSM method will likely be unnecessarily complex. Fortunately, the distribution of the quantization step Q is unimodal (see Figure 6.38). Thus, a mean predictor is used to predict the "centroid" Qc from the quantization steps Q of the neighboring macroblocks. After the predicted Qc is obtained, the search order for the quantization step is then arranged according to the absolute difference between Q and the Qc. For example, if the predicted quantization step Qc is 14, then the search order for the Q will be 14, 13, 15, 12, 16, 11, 17, etc. Figure 6.39 shows the histogram of the prediction error when a mean predictor is applied. The figure shows that the quantization step Q can be accurately predicted. In order to further reduce the computational cost used in forward/backward quantization of the DCT coefficients, only 3 quantization steps, Qc, Qc — 2 and Qc + 2, are considered. The reason that we chose the (Qc — 2, Qc, Qc + 2) rather than the (Qc — 1, Qc, Qc + 1) is because the changing step for quantization in the later case is too small, which will affect the performance for lower bit rate situation. 6.3.2 Encoding of I-Pictures Figure 6.40 illustrates the intra coding algorithm employed in this work. A finite-state machine (FSM), described later, is used to determine the more likely of the two possible Chapter 6. Application: MPEG-2 Video Coder 113 Histogram of the quantization prediction error 3us, .60. frame s JI ru -15 -10 - 5 0 5 10 15 Quantization prediction error Figure 6.39: The histogram of the quantization step step prediction error. modes of operation: frame DCT (FR-DCT) and field DCT (FI-DCT). Suppose, without loss of generality, that the mode FR-DCT has been selected. The value of the quantization step QC is set to the mean values corresponding to the MBs above, left, and above-right. Then, each of the six 8x8 blocks in a macroblock is transformed by an integer DCT. The resulting coefficients are quantized and entropy coded as specified in the standard. Given a Lagrangian parameter A, the Lagrangian J\ = D + \R can be evaluated by the MB's average sum of squared error for D and the MB's average number of bits for R. After the three Lagrangian J\ of the three quantization steps, QC, QC — 2 and QC + 2, are considered, the minimum value of the three Lagrangian J\s will be compared to a threshold JR (described in Section 4.4.2). If J\ < JR, then intra coding of the current MB is terminated. Otherwise, the value Q* representing the minimum Lagrangian is Chapter 6. Application: MPEG-2 Video Coder 114 C O N T R O L Q F R A M E D C T F I E L D D C T M S E & Rate M S E & Rate X S T O P CONTROL S T O P CONTROL Figure 6.40: M B mode selection: I-pictures. saved, and the next mode FI-DCT is tested. Statistical analysis results show that, even when all 32 possible values of quantization step are considered, the distribution of the prediction error of quantization step is both symmetric and very narrow (see Figure 6.39). In fact, a member of the triple"(<5c — 2, Qc, Qc + 2) is selected more than 90% of the time. The procedure required to test the FI-DCT mode is identical to the one described above, except for the following two modifications. First, the luminance Y - M B is pre-arranged by odd/even selection before D C T coding. Second, only the value Q* is considered, as it is shown experimentally to be an accurate estimate. Assuming that the exit condition has still not been satisfied, the ( M , Q) pair of values yielding the smallest J\ is selected, and intra coding of the current M B is terminated. 6.3.3 Encoding of P-Pictures Inter coding of P-pictures depends on either an I- or a P-picture. In inter coding, there are three principal modes of operation: (1) motion only, (2) hybrid motion-compensated prediction and D C T , and (3) intra DCT. However, as motion estimation/compensation and/or D C T coding can be applied to frames or fields, at least nine non-intra modes of Chapter 6. • Application: MPEG-2 Video Coder 115 operation are possible (see Table 6.8). The block diagram shown in Figure 6.41 illustrates the proposed M B coding mode selection algorithm used during the inter coding of P-pictures. As in intra coding, consid-ering all modes of operation is neither practical nor required. The modes are investigated in a specific order determined by the local statistical behavior of the principal modes of operation. More specifically, the F S M is first used to order the principal modes of opera-tion in terms of their probabilities. The principal modes are then considered in that order. The specific modes belonging to the selected principal mode are investigated following a sequence that is specified by the computation requirements. For example, because frame motion estimation requires much less computations than field motion estimation, there-fore, it will be performed before field motion estimation. Moreover, because of duplicate operations in frame D C T and field D C T , the FrMC_FdDCT mode will be tested after the FrMC_FrDCT one, and the F d M C F d D C T mode will be tested after the FdMCLFrDCT one. For each candidate mode, the corresponding algorithm is executed, yielding an av-erage distortion and an average bit rate. Then, the resulting Lagrangian J\ is compared to the threshold JR. If J\ < JR, the appropriate bits' are emitted, and the encoding for the current M B is terminated. Otherwise, one or more modes within and outside the subject principal mode are then investigated. As in intra coding, at most three values of quantization step are considered for each principal mode (when applicable). They are tlie mean values QC, QC — 2, and QC + 2. The value Q* representing the minimum Lagrangian for the mode currently being investigated is saved, and is used for other modes within the same principal mode. 6.3.4 The Finite State Machine (FSM) The computation-performance tradeoffs that can be obtained by the above algorithm depend on the characteristics of the input interlaced video sequence and on the efficiency Chapter 6. Application: MPEG-2 Video Coder 116 M o t i o n O n l y F S M CONTROL M o t i o n + D C T D C FIELD MV MSE Rate MSE Rate MSE Rate MSE Rate MSE Rate MSE Rate W e Control Lh>c Control Intra - M FRAME DCT MSE Rate P RELD DCT MSE Rate Control Figure 6.41: M B mode selection: P-pictures. Chapter 6. Application: MPEG-2 Video Coder 117 of the finite state machine (FSM). In our work, the F S M statistical modeling is employed to estimate the conditional probabilities of the two intra coding modes or the three inter coding principal modes. As many as 25 and 125 states, with associated tables of probabilities, are allowed in intra and inter coding, respectively. During intra coding, a state corresponds to a vector (51,52) whose component value's are obtained as follows. Given the region of support (ROS) shown in Figure 6.42, the weighting coefficients of the MBs that were F R - D C T encoded are added, and the sum is quantized to one of five uniformly distributed levels. The value of 51 is set to the output level. The value of s2 is obtained similarly using the MBs that were FI-DCT encoded. During inter coding, a state corresponds to a vector (51,52,33) whose components are obtained as described above using the principal modes of the ROS's MBs. Once a state is computed, it is used as a pointer to a table of two (intra coding) or three (inter coding) probabilities. The probabilities of the principal modes (two principal modes for intra coding and three principal modes for inter coding) will be used to decide the testing order of the modes. That is, the mode with highest probability will be tested first, and so on. The probabilities are initially uniformly distributed, and are then updated during the encoding procedure. ( The search for motion vector(s) is performed only if one of the applicable modes in Figure 6.41 is considered. The procedure used in field or frame motion vector search is essentially the same as the algorithm used in Section 6.2.3. The proposed motion search algorithm is very effective because candidate motion vectors away from the predicted motion vector are quickly eliminated from consideration. This behavior is expected since likely motion vectors are usually localized within a small neighborhood of the predicted motion vector and the number of computations increases substantially as more layers are examined. . • Chapter 6. Application: MPEG-2 Video Coder 118 1 2 1 1 3 4 3 1 2 4 X Figure 6.42: The ROS used in this work. 6.3.5 Threshold JR and Rate Control The most important.practical issue is finding appropriate values for the Lagrangian pa-rameter A and the threshold JR, the two key parameters that control the performance-speed tradeoffs. Although they are generally not independent, we will here optimize them separately. ' Threshold JR The threshold JR is used to provide a. good balance between performance and number of computations. A small/large value of JR increases/decreases both the computational demands and the probability of selecting the optimal mode of operation. In both intra and inter coding, JR is an adaptive threshold whose value is proportional to the weighted average of the Lagrangians corresponding to the neighboring MBs within the.ROS shown in Figure 6.42. The same weights shown in the figure are employed. Rate Control The second important practical issue is finding appropriate values for the Lagrangian parameter A, the key parameters that control mainly the RD tradeoffs. In this work, the parameter A is initially estimated based on computed long-term statistics, and is then Chapter 6. Application: MPEG-2 Video Coder 119 varied adaptively during the encoding process depending on the buffer fullness and/or the history of A. The rate control algorithm used in MPEG-2 video coder is based on the rate control algorithm used in Section 5.2.3. Without loss of generality, let's assume thatour encoder's output bits are stored in a buffer of size Smax, and are then transmitted over a fixed-rate communication channel. The system is governed by s(t + 1) = s(t) -f R(t) — 7, where t denotes time, s(t) is the buffer occupancy, R(t) is the variable output bit rate of the encoder, and 7 is the channel bit rate. Then, assuming a linear feedback control model and a desired buffer occupancy of s* = ^ e Lagrangian parameter A(t) at time t can be obtained by the recursion formula [43] A(t). = X(t - 1) (a + ( l - a ) ^ ) , (6.52) where the parameter a is the decay factor used to control the variation of the parameter A(i). The purpose of introducing the parameter a in Equation (6.52) is to reduce the variation of the A, because simulation results show that better RD performance can be achieved when A is constant within a frame. Updating A based on the above recursion for-mula is found to be an efficient solution. In fact, relative to the more complex algorithms described in [8], the resulting RD control algorithm is very simple, yet its performance is good. 6.4 Simulation Results This section is divided into two parts. In the first part, we compare our computation-distortion optimized motion estimation algorithm to two fast motion estimation algo-rithms, the 2D-logarithmic search algorithm and the 4-level hierarchical search algo-rithm. In the second part, we present comparison results between our fast RD optimized MPEG-2 encoder and MPEG-2's TM5 encoder. Chapter 6. Application: MPEG-2 Video Coder 120 The Test Video Format The video sequences used in our simulation experiment conform to the CCIR Rec 601 (CCIR 601) format. The CCIR 601 specifies the image format, acquisition semantics, and parts of the coding for digital "standard" television signals. The video sequence is digitized from an interlaced television signal, thus each frame consists of two fields: the top field and the bottom field, as shown in Figure 6.29. The frame rate is 30 frames per second and the size of each frame is 720x480 pixels. The three color components are one luminance (Y) and two chrominance (Cr and Cb). The color sampling format is 4:2:0, which means the Cr and Cb components down-sampled in both vertical and horizontal direction. Each pixel has 8 bits precision. Thus, the raw bit rate for the video is 720 x 480 x 30 x 1.5 x 8 = 124.416 Mbits/sec. Three video sequences, F O O T B A L L , BUS, and GARDEN, are are selected based on their representative video characteristics. Motion of the FOOTBALL sequence is very active and complex, with moderate spatial details. The GARDEN sequence features significant spatial variations, but with very simple motion. The BUS sequence shows a lot of detail, with relative high motion activity. We show snap shots of the above three sequences in Figure 6.43. 6.4.1 C D Optimized Motion Estimation In the following simulations, the three video sequences shown in Figure 6.43 (BUS, FOOT-BALL, and GARDEN) are used. The frame distance between I pictures (the G O P param-eter) is 12 frames, and the target bit rate is 8 Mbits/sec. Forty eight frames are used in the simulations. The motion search window size in the simulations is determined by the motion activity in each sequence. Thus, for each sequence, we choose the smallest window size that just covers the largest motion displacement. A l l coders in simulations Chapter 6. Application: MPEG-2 Video Coder 121 (a) Bus (b) F O O T B A L L Figure 6.43: CCIR 601 video sequences. Chapter 6. Application: MPEG-2 Video Coder 122 Sequence Bus Football Garden Window Size 23 47 15 PSNR (dB) 34.89 37.83 33.79 Computations per .MB 551510 2184500 258070 Table 6.11: Required number of computation in motion estimation per macroblock that use full search algorithm and partial block matching method. are developed based on the MPEG-2 TM5 video coder, and only the motion estimation algorithm in the MPEG-2 TM5 coder is modified. 1 The comparison is based on the peak signal to noise ratio (PSNR) of the reconstructed video and the computation required for motion search under the same bit rate. The computation unit used in the comparisons consists of a subtraction, an absolute value and an accumulation. For comparison purposes, Table 6.11 shows the coding results of full search algorithm using partial block matching method. 2D-logarithmic Search vs. C D Optimized Diamond-Shaped Layer Search In this part, we compare our CD optimized diamond-shaped layer search algorithm with the 2D-logarithmic search algorithm. In our algorithm, Model I (2-layer hypothesis) method is used to terminate the search process, and the median predictor with a 5-macroblock ROS is used to predict the motion vector. Our algorithm uses the motion vector obtained in 16x16 frame prediction type motion estimation to predict the mo-tion vectors for the other four 16x8 field prediction types motion estimation. The 2D-logarithmic search algorithm and our algorithm employ the same partial block matching method, the same 2-level hierarchical structure of the MPEG-2 TM5 coder, and the same motion search method in half-pel resolution frame. Simulation results are shown in Table 6.12. Chapter 6. Application: MPEG-2 Video Coder 123 Bus Football Garden Search Type 2D-Log Ours 2D-Log Ours 2D-Log Ours Window Size 23 23 47 47 15 15 • P S N R ( d B ) 33.59 34.69 36.97 37.42 32.75 33.61 Computations per M B 24856 24312 30307 27564 21475 13414 Table 6.12: Performance comparison between the 2D-logarithmic motion search and our CD optimized diamond-shaped layer search. From Table 6.12, we can see that our algorithm is more efficient than the 2D-logarithmic search algorithm. For BUS and especially GARDEN sequence, the difference is quite significant. This is because, in the GARDEN sequence, most of the image area has a lot of texture (for example, the flower bed), and the 2D-logarithmic search algorithm may very easily get trapped into a local minimum. For the FOOTBALL sequence, because the image is blurred, the probability of being trapped in a local minimum is much lower than that for the GARDEN and BUS sequences for the 2D-logarithmic algorithm, resulting in a smaller difference between the 2D-logarithmic search algorithm and our algorithm. Another explanation is the motion property of the sequences. For sequences with struc-tured and stable motion fields (such as BUS and GARDEN), the median predictor used in our algorithm achieves high motion vector prediction accuracy. However, the motion field in FOOTBALL sequence is very complex, which makes the predicted motion vectors much less reliable. 4-level Hierarchical Search vs. 3-level Hierarchical C D Optimized Diamond-Shaped Layer Search In this part, we will compare the 4-level hierarchical search algorithm with our 3-level hi-erarchical CD optimized diamond-shaped layer search algorithm. The 4-level hierarchical search will generate two other lower resolution frames for motion search, and performs Chapter 6. Application: MPEG-2 Video Coder .124 PSNR vs. Computation 1 38 37.9 37.8 37.7 _ 3 7 . 6 m rr 37.5 z CO CL 37.4 37.3 37.2 37.1 37 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Computations/MB (# of operations) x10 4 Figure 6.44: PSNR performance as a function of computation for an MPEG-2 coder with our 3-level hierarchical CD optimized diamond-shaped layer search a full search in the lowest resolution frame. For other three resolution frames, the 4-levelhierarchical search only searches the 9-point region centered on the'motion vector obtained in the lower resolution framed In our algorithm, only one low resolution frame is created, and the search regions in the full and half-pel resolution frames are limited to a 9-point region. As described in Section 6.2.3, the proposed C D optimized search algorithm using Mode I (2-layer hypothesis) termination method is employed in all the three resolution frames. Simulation results are presented in Table 6.13. From Table 6.13, we can see that our algorithm is better than the 4-level hierarchical search algorithm. For sequences with active motion, such as F O O T B A L L , our method is much more efficient. Besides, our algorithm requires less memory and is substantially less complex. This is because only one low resolution image is required. Figure 6.44 shows Football, 48;frames Search Window: -47 to 47 Full search with full match (37.83 dB): 4113920 opsi Full search with partial match (37.83 dB): 2184527 ops Chapter 6. Application: MPEG-2 Video Coder 125 Bus Football Garden Search Type Hier Ours Hier Ours Hier Ours Window Size 23 23 47 47 15 .15 PSNR (dB) 34.55 34.60 37.39 37.38 33.55 33.72 Computations per M B 15168 8765 23157 9771 13734 7802 Table 6.13: Performance comparison between the 4-level hierarchical motion search and our 3-level hierarchical CD optimized diamond-shaped layer search. the resulting MPEG-2 video coder with different values (different computation levels) for the parameter f3. 6.4.2 R D Optimized M P E G - 2 Video Coder Figure 6.45 shows the PSNR performance of the. MPEG-2 TM5 coder, a full-search RD optimized MPEG-2 coder and efficient RD optimized MPEG-2 coder for 60 frames of the BUS sequence and 24 frames of the F O O T B A L L sequences. The full-search RD optimized MPEG-2 coder considers all intra/inter modes of operation and all possible values of quantization step, but also employs proposed CD optimized motion estimation algorithm. The efficient RD optimized MPEG-2 coder employs the FSM-based mode selection algorithm as well as the CD optimized motion search algorithm. As shown in Figure 6.45, the overall improvement in PSNR •performance of the full-search RD optimized coder is substantial, and may justify the relatively large number of required computations. Using such an encoder, we require between 65% and 85% of the bit rate required by the TM5 encoder for the same level of PSNR performance. Although the full-search RD optimized coder maintains a 3 : 1 computational advantage to the TM5 coder, it still takes 2.1 C P U minutes of a 167 Mhz Ultra Sparc machine to encode one video frame of FOOTBALL sequence. On the other hand, the proposed efficient RD optimized coder only requires 5 - 6 C P U seconds to encode one video frame of F O O T B A L L sequence Chapter 6. Application: MPEG-2 Video Coder 126 a B U S m 3 9 r 3 8 -3 7 -36 -35 -Ct 34 -z 03 CL 33 -32 -31 -30 -29 -Bus, .1/ P,.W.31 / / / yf : MPE 3-2.TM5.CQd6 search RD Op r / / / / . -. Full- timized /'/ // - Effic ent RD Optim zed 10 12 Rate (bps) x 10 (b) F O O T B A L L 42 r 41 -4 0 -39 -_ 3 8 -m 2. rr 37 -2 03 Q. 36 -35 -3 4 -33 -32 -FOQtbr ULI/P..W 47 /' ^y / / / / / > - : / / / / / / . / / -. Full-search RD Optimized / / • / / / / - Efficient RD Optim zed / J I : 10 Rate (bps) x 10 Figure 6.45: PSNR as a function of bit rate. Chapter 6. Application: MPEG:2 Video Coder 127 12 10 X 10 (a) Buffer fullness Buffer vs. Frame Optm Buffer: 355360 bits Bus 60 frame Buffer Size 8C 36.12 dB, 7.9( )*163854bits 3 Mbps Decay Factory.25 10 20 30 40 Frame 50 60 140 120 100 80 60 (b) A variation Lambda vs. Frame 40 h 20 ous ou irame Buffer Size 8C 36,12 dB,-7.9f Decay Factor "163854 bits 0.25 10 20 30 Frame 40 50 60 Figure 6.46: Buffer fullness and A variation. Chapter 6. Application: MPEG-2 Video Coder 128 40 39 38 Q. (a) PSNR PSNR vs. Frame 36 35 34 Bus 60 frame Buffer Size 8C 36.12 dB, 7.9 *163854 bits 3 Mbps 10 20 30 Frame 40 50 60 x 10 (b) Bit rate Bits vs. Frame Bus 60 frame Buffer Size 8C "163854 bits 36.12 dB, 7:9 Decay Factor 3 Mbps 0.25 10 20 30 Frame 40 50 60 Figure 6.47: PSNR and bit rate as a function of time. Chapter 6. Application: MPEG-2 Video Coder 129 on the same machine. Figure 6.46 and 6.47 show the simulation results obtained using our rate control algo-rithm in our efficient RD optimized MPEG-2 coder for a target bit rate at 8 Mbits/sec. In this simulation, the Lagrangian parameter A is updated at the end of each frame, and the memory decay factor a is set to 0.25. The buffer size used in this simulation is set to 1,310,720 bits, which is smaller than the maximum buffer size 1,802,240 bits for main profile/main level in the MPEG-2 standard. Simulation results show that, although a simple algorithm is used, the proposed rate control algorithm is very effective, regulating the buffer fullness around a half of the buffer size (655,360 bits). In general, if a bigger buffer size is used, the RD performance will be better. Another parameter that affects RD performance is the decay factor a. More specifically, RD performance will be better when a smaller a is used. However, the buffer fullness variation become larger due to the slow updating of A. 6.5 Summary We presented an MPEG-2 compliant interlaced video encoder that employs an efficient macroblock mode selection method, a simple motion vector prediction technique, and a computation-distortion optimized motion vector search algorithm. The macroblock mode is selected based on a rate-distortion optimized criterion. However, predictive and statis-tical modeling techniques are introduced that lead to substantially better computation-performance tradeoffs. The proposed MPEG-2 video encoder is shown experimentally to outperform the MPEG-2 TM5 encoder, expressed in a 15 — 60% reduction in bit rate or a 1.5 — 2.5 dB increase in objective quality. Besides its compression advantage, our encoder requires less computations than required by the TM5. I Chapter 7 Conclusion and Future Study 7.1 Contributions of the Thesis Video coding is a decision making process where the encoder determines the values of the coded parameters and appends the corresponding bits to the bit stream. For D C T based video coding, three types of data, the.motion vectors, the coding modes, and assq-ciated parameters, are determined by the encoder on a macroblock-by-macroblock basis. Undoubtedly, these parameters need to be optimized with respect to rate-distortion per-formance and computational demands. The objective of this thesis is to propose video coding algorithms that can improve the compression performance given specific compu-tational constraints. We achieved such a goal by introducing new motion estimation and mode selection algorithms which provide a good balance among the distortion, bit rate, and computational cost. In the following section, we will list the main accomplishments of the thesis. 1. A new cost function J for motion estimation The first contribution of the thesis is the' introduction of a new cost function, the Lagrangian J, for motion estimation. The introduced cost function J is a weighted sum of the distortion D, motion vector bit rate R, and computational cost C , that is, J = D + \R + (3C. The two parameters, A and /3, control the influence of the motion vector bit rate R and computational cost C during the motion estimation process. One 130 Chapter 7. Conclusion and Future Study 131 of the advantages of using the proposed cost function J in motion estimation is that the motion vector with the minimum J provides good tradeoffs among the distortion, bit rate, and computational cost. Another advantage is that by carefully choosing the value of the parameters A and /?, the motion estimation algorithm can control the bit rate R and the computational cost C respectively. Depending on the video application, the proposed cost function J as well as the motion estimation algorithm can be simplified to yield rate-distortion (RD) optimized motion estimation and computation-distortion (CD) optimized motion estimation. Experimental results show that for low bit rate video cod-ing applications the RD optimized motion estimation algorithm can control the motion vector bit rate and significantly improve the compression performance. Similarly, for CD optimized motion estimation, the proposed algorithm can also well control the quality and the motion estimation speed. 2. Predictive motion estimation algorithms The second contribution of the thesis is the introduction of predictive,motion estima-tion algorithms which consist of motion vector prediction, search path, and termination strategies. In this thesis, four well known prediction techniques (mean predictor, me-dian predictor, weighted mean predictor, and statistical mean predictor) used for motion vector prediction are studied and compared. The simulation results show that the me-dian predictor performs best with regards to prediction accuracy and complexity. We tested three types of search path strategies (spiral search path, diamond-shaped search path, and floating center search path) and found that the floating center search and the diamond-shape are both quite efficient. We also proposed and studied two termination algorithms for motion search. The proposed termination strategies with the proposed cost functions and search paths can dynamically shrink or expand the search region. Simulation results show that the Model I (2-layer Hypothesis) algorithm perform better Chapter 7. Conclusion and Future Study 132 than the Model II (3-layer Hypothesis) algorithm. 3. The fast rate-distortion (RD) optimized mode selection algorithms The third contribution of the thesis is the introduction of the fast RD optimized mode selection algorithms. The RD optimized, mode selection algorithm chooses the coding mode and the associated parameters that give the minimum value for the Lagrangian J — D + XR. To solve the huge computational requirement problem of the full search method, fast RD optimized mode selection algorithms are proposed. In this thesis, we propose two fast RD optimized mode selection algorithms, (1) the thresholding method and (2) the .FSM method. The thresholding method is mainly used to to safely eliminate the I N T R A and/or I N T E R coding options from consideration. It is suitable for non-active video applications such as the heacl-and-shoulder images in video conferencing. The F S M method develops a search order for the different coding modes according to their conditional probabilities which are derived from a finite state machine (FSM). To simplify the F S M method, we separately process the coding modes and their associated parameters (such as the quantization step). During the search a threshold, which is derived from the neighboring macroblocks' Lagrangian J , is used to terminate the search for the coding mode and its associated parameters. The proposed method dramatically reduces the computations required by the full search method, but at the expense of a small degradation in the compression performance. i 4. Implementation of an H.263-based video encoder for very low bit rate video applications The fourth contribution of the thesis is that of using our proposed motion estimation and mode selection algorithms to implement an H.263-based video encoder for very low bit rate applications. The proposed H.263-based video coder overcomes some of the Chapter 7. Conclusion and Future Study 133 drawbacks of the traditional ones (such as the Telenor's TMN5 H.263 video encoder) by employing (1) a fast full-pixel median-based predictive motion searching technique, (2) a RD optimized motion estimation algorithm, and (3) a RD optimized coding mode selection algorithm. The simulation results show that the proposed video coder gives better compression performance than the TMN5 H.263 video coder. The simulation results also show that, due to the RD optimized motion estimation, the motion vector bit rate is carefully controlled. In the thesis, we also introduce a rate control mechanism which decides the A value according to the bit rate requirement and the buffer fullness. 5. Implementation of an M P E G - 2 compliant video encoder for interlaced video The fifth contribution of the thesis is the implementation of an MPEG-2 compliant video encoder for interlaced video. The MPEG-2 video encoder implemented in the thesis is mainly for the main profile and the main level case. The proposed MPEG-2 video coder employs (1) a hierarchical computational-distortion (CD) optimized motion estimation algorithm and (2) a fast FSM-based RD optimized mode selection algorithm. Because of the MPEG-2 high bit rate application and the active motion field of the coded video se-quence, the CD optimized motion estimation algorithm is chosen for our proposed video coder. To further improve the computational efficiency, the motion estimation algorithm is implemented in a hierarchical layer structure. The FSM-based mode selection algo-rithm in our proposed MPEG-2 video coder is also implemented hi a hierarchical layer structure. The simulation results show that the proposed video encoder outperforms the TM5 MPEG-2 video encoder in both speed and performance. Chapter 7. Conclusion and Future Study 134 7.2 Topics for Future Study Even though the proposed video coder presented in this thesis outperforms other ex-isting video coders, the following subjects can further improve the proposed encoder performance and need to be studied further. 1. More sophisticated rate control mechanism In this thesis, we introduce a simple one-pass rate control mechanism which determines the value of the parameter A using the buffer fullness and the bit rate requirement. How-ever, the rate-distortion (RD) characteristics of the image strongly depend on the picture coding type (I, P, B). Therefore, our proposed rate control algorithm may exhibit oscil-lation problems when A is updated only once per frame. One solution to the oscillation problem is to employ a decay factor which reduces the changing magnitude of the new A. However, a larger decay factor will also reduce the rate controllability. Another solution is to employ different updating strategies for different picture coding types. 2. Computation allocation algorithm for motion estimation The proposed computation-distortion (CD) optimized motion estimation algorithm effi-ciently reduces the number of computations needed for motion search. However, in our proposed video coder, the parameter (3 which controls the tradeoff between the distortion D and the computational cost C in the cost function J — D + (3C remains fixed dur-ing the whole coding process. This does not allow the user to pre-determine the overall coding speed of the motion search'since different (3 values are not tried. To overcome this problem, a computation allocation algorithm for motion search is required which can automatically determine the proper value for the (3 for each macroblock (or slice). This will allow us to efficiently allocate the pre-set.limited computation resource. 1 Chapter 7. Conclusion and Future Study 135 3. Fast R D optimized mode selection algorithm for M P E G - 2 B picture The fast RD optimized mode selection algorithm in this thesis is developed for only I and P picture types. Because a B pictures utilizes two reference frames, in general, the available coding modes for a B picture is more than double that of a P picture type. Due to the greater number of available coding modes for a B picture, we believe that larger improvements can be achieved if an RD optimized mode selection algorithm is used. The implementation of a fast algorithm for B pictures similar to the one we developing for the I and P pictures seems quite straight forward. However, because of the greater variety of mode selections, the design of such a fast RD optimized mode selection algorithm should be carefully studied. Bibliography D. Anastassiou. Current status of the MPEG-4 standardization effort. In SPIE Proc. Visual Communications and Image Processing, volume 2308, pages 16-24, 1994, . R. Arminato, R. Schafer, F. Kitson, and V . Bhaskaran. Linear predictive coding of motion vectors. In Proceedings of the International Picture Coding Symposium, Sept. 1994. R. Arminato, R. Schafer, F. Kitson, and V . Bhaskaran. Linear predictive coding of motion vectors. In Proceedings of the IS&T/SPIE EI'96, Jan. 1996. T. Berger. Rate Distortion Theory. Prentice-Hall, Inc., New Jersey, 1971. V . Bhaskaran and K . Konstantinides. Image and Video Compression Standards: Algorithms and Architecture. Kluwer Academic Publishers, Boston, 1995. M . Bierling. Displacement estimation by hierarchical blockmatching. In SPIE Vol.1001 Visual Communications and Image Processing, pages 942-951, May 1988. C. W.Chen and T. Huang. Left ventricle motion analysis by hierarchical decompo-sition. In Proc. IEEE Int. Gonf. Acoust., Speech, and Signal Processing, volume III, pages 273-276, 1992. J. Choi and D. Park. A stable feedback control of the buffer state using the L a -grangian multiplier method. IEEE Transactions on Image Processing: Special Issue on Image Sequence Compression, 3(5):546-558, Sept. 1994. P. A. Chou. Application of entropy-constrained vector quantization to waveform coding of images'. Visual Communications and Image Processing IV, SPIE-1199:970-978, 1989. P. A . Chou, T. Lookabaugh, and R. M . Gray. Entropy-constrained vector quan-tization. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-37(l):31-42, Jan. 1989. : W. Chung, F. Kossentini, and M . Smith. A new approach to scalable video coding. In IEEE Data Compression Conference, pages 381-390, Snowbird, UT, USA, Mar. 1995. 136 Bibliography 137 [12] W. Chung, F. Kossentini, and M . Smith. Rate-distortion constrained statistical mo-tion estimation for video coding. In ICIP95, volume 3, pages 184-187, Washington, DC, Oct. 1995. [13] W. Chung, F. Kossentini, and M . Smith. An efficient motion estimation technique based on a rate-distortion criterion. In Proc. IEEE Int. .Conf. Acoust., Speech, and Signal Processing, volume 4, pages 1926-1929, Atlanta, G A , USA, May 1996. [14] T. Cover and J . Thomas. Elements of Information Theory. John Wiley and Sons, Inc, New York, 1991. [15] J. E. Dennis and R. B. Schnabel. Numerical Methods for. Unconstrained Optimiza-tion and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, New Jersey, 1983. [16] H. Everett. Generalized Lagrange multiplier method for solving problems of opti-mum allocation of resources. Operation Research, 11:399-417, 1963. • [17] R. M . Fano. Transmission of Information. Wiley, New York, 1961. [18] T. R. Fisher and M . Wang. Entropy-constrained trellis-coded quantization. IEEE Trans, on Information Theory, 38(2):415-426, Mar. 1992. [19] R. G. Gallager. Information Theory and Reliable Communication. John Wiley & Sons, New York, 1968. [20] R. G. Gallager. Variations on a theme by Huffman. IEEE Transactions on Infor-mation Theory, IT-24(6):668-674, Nov. 1978. [21] A . Gersho and R. M . Gray. Vector Quantization and Signal -Compression. Kluwer Academic Publishers, Boston, 1992. [22] H. Gharavi. Mutilayer subband-based video coding. IEEE Transactions on Com-munications, 39:1288-1291, 1991. [23] B. Girod. Rate-constrained motion estimation. In SPIE Proc. Visual Communica-tions and Image Processing, volume 2308, pages 1026-1034, 1994. [24] B . Girod, D. J. LeGall, M . I. Sezan, M . Vetterli, and Y . Hiroshi. Special issue on image sequence compression. IEEE Trans, on Image Processing, 3(5), Sept. 1994. [25] B . S. Gottfried and J. Weisman. Introduction to Optimization Theory. Prentice-Hall, New Jersey, 1973. [26] S. Haykin. Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, New Jersey, 1986. -Bibliography 138 [27] D. Hoang, P. Long, and J . Vitter. Efficient cost measures for motion compensation at low bit rates. In IEEE Data Compression Conference, pages 102-111, Snowbird, UT, USA, Apr. 1996. [28] R. J. Holt and A. N . Netravali. Motion from optical flow: Multiplicity of .solutions. Journal on Vision Communications and Image Representation, 4(l):14-24,. 1994. [29] B. Horn and B. Schunck. Determining optical flow. Artificial Intelligence, 17:185, 1981. [30] D. A . Huffman. A method for the construction of minimum redundancy codes. Proceedings of the IRE, 40:1098-1101, 1952. [31] ISO-IEC. M P E G video report draft. JTC1/SC2/WG11, 1991. [32] ISO-IEC. MPEG-4 proposal package description - revision 3. JTC1/SC2/WG11, 1995. • ' . . . [33] ISO/IEC 11172-2. Information technology - Coding of moving pictures .and associ-ated audio for digital storage media up to about 1.5 Mbit/s - Part 2: Video. ISO/IEC, 1993. [34] ISO/IEC 13818-2—ITU-T Rec. H.262. Generic Coding of Moving Pictures and Associated Audio Information: Video.-ISO/IEC, 1995. [35] ISO/IEC JTC1/SC29/WG11. Test Model 5. M P E G 93/457, Apr. 1993. [36] ITU Telecom. Standardization Sector Study Group 15. Document LBC-95 Working Party 15/1 Question 2/15. Draft Recommendation H.263,. Leidschendam, 7 Apri l 1995. • " ' . i [37] J. Jain and A. Jain. Displacement measurement and its application in interframe im-age coding. IEEE Trans, on Communications, COM-29(15):1799-1808, Dec. 198l'. [38] M . Kaneko, A . Koike, and Y . Hatori. Three-dimensional motion estimation of head in model-based coding of moving facial images. IEICE Transactions on Communi-cations, J74-B-I(10):789-798, 1991. [39] J. Karush. A simple proof of an inequality of McMillan. IRE Transactions on Information Theory, 7:118, 1961. . [40] F. Kos'sentini, W. Chung, and M . Smith. Conditional entropy-constrained residual V Q with application to image coding. Transactions on Image Processing: Special Issue on Vector Quantization, 5(2):311-320, Feb. 1996. Bibliography 139 [41] F. Kossentini, W. Chung, and M . Smith. Rate-distortion-constrained subband video coding. Submitted to Transactions on Image Processing, Mar. 1996. [42] F. Kossentini and Y . Lee. Computation-constrained fast MPEG-2 encoding. IEEE Signal Processing Letters, 4:224-226, Aug. 1997. [43] F. Kossentini, Y . Lee, M . Smith, and R. Ward. Predictive RD-constrained. motion estimation for very low bit rate video coding. Special Issue of the IEEE Transactions on Selected Areas in Communications, 15:1752-1763, Dec. 1996. [44] T. L. Kuni. Visual Computing. Springer-Verlag, 1992. [45] G. Langdon. An introduction to arithmetic coding. IBM J. Res. Dev., 28:135-149, Mar. 1984. [46] G. G. Langdon and J. Rissanen. Compression of black-white images with arithmetic coding. IEEE Transactions on Communications, 29(6):858-867, 1981. [47] D. Le Gall. M P E G : a video compression standard for multimedia applications. Communications of the ACM, 34(4):46-58, Apr. 1991. [48] Y . Lee, F. Kossentini, M . Smith, and R. Ward. Prediction and search techniques for RD-optimized motion estimation in a very low bit rate video coding framework. In Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, volume 4, pages 2861-2864, Munich, Germany, Apr. 1997." [49] Y . Lee, F. Kossentini, R. Ward, and M . Smith. Towards MPEG-4: An improved H.263-based video coder. Signal Processing: Image Communication: Special Journal ' Issue on MPEG-4, 10:143-158, July 1997. ' [50] Y . Lee, F. Kossentini, and R. K . Ward. Efficient RD optimized macroblock coding mode selection for MPEG-2 video encoding. In International Conference on Image Processing, volume 5, pages 582-585, Santa Barbara, USA, 1997. [51] Y. . Lee, F. Kossentini, and R. K . Ward. Efficient MPEG-2 encoding of interlaced video. C'JECE Special Issue on Visual Computing and Communications, in press. [52] Y . Lee, R. K . Ward, F. Kossentini, and M . J. T. Smith. Very low rate DCT-based video coding using dynamic V Q . In International Conference on Image Processing, pages 669-672, Lauzanne, Switzerland, Sept. 1996. ! [53] W. L i , Y . Q. Zhang, and M . L. Liou. Special Issue on Advances in Image and Video Compression. Proc. of the IEEE, 83(2):135-340, Feb. 1995. Bibliography 140 J. 0 . Limb. Distortion criteria of the human viewer. IEEE Trans., pages S M C -9:778-793, Dec. 1979. T. J . Lynch. Data Compression: Techniques and Applications. Lifetime Learn-ing, Wadsworth, Belmont, Calif., 1985. K . Mase, Y . Watanabe, and Y . Suenaga. A realtime head motion detection system. In Proceeding of the SPIE Conference on Sensing and Reconstruction of 3-D Objects and Scenes, volume 1260, pages 262-269, 1990. H . Morikawa, E. Kondo, and H. Harashima. 3D facial modeling for model-based coding. TEICE Transactions on Communications, J76-B(6):626-633, 1993. M P E G Software Simulation Group. MPEG-2 encoder / decoder, version 1.0. ISO/IEC DIS 13818-2 codec, May 1995. Y . Nakaya and H. Harashima. Model-based/waveform hybrid coding for low-rate transmission of facial images. IEICE Transactions on Communications, J75-B(5):377-384,1992. F. Pereira. MPEG-4: a new challenge for the representation of the audio-visual information. In Proceedings of the International Picture Coding Symposium, March 1996. '' • - , "x K. Ramchandran and M . Vetterli. Best wavelet packet bases in a rate-distortion sense. IEEE Trans, on Image Processing, 2(2):160—174, April 1993. , C. Reader. MPEG-4 syntactic descriptive language: a universal interface for ex-change of coded audio-visual data. In Proceedings of the International Picture Cod-ing Symposium, March 1996. B. G. Schunk. The image flow constraint. Computer Vision, Graphics and Image Processing, 35(6):20-46, 1986. A. Schuster and A. Katsaggelos. Fast and efficient mode and quantizer selection in the rate-distortion sense for H.263. In SPIE Proc. Visual Communications and Image Processing, volume 2727, pages 784-795, 1996. G. M . Schuster and A. K . Katsaggelos. Rate-distortion based video compression. Kluwer Academic Publishers, Netherlands, 1997. • • < C. E. Shannon. Coding theorems for a discrete source with a fidelity criterion. In IRE National Convention Record, Part 4, pages 142-163, 1959. Also in Information .and Decision Processes, R. E. Machol, Ed. New York, N Y : McGraw-Hill, 1960, pp. 93-126. i Bibliography 141 [67] Y . Shoham and A. Gersho. Efficient bit allocation for an arbitrary set of quantizers. IEEE Trans, on Acoustics, Speech, and Signal Processing, 36(9): 1445-1453, Sept. 1988. [68] J . Storer. Data Compression. Computer Science Press, Rockville, Maryland, 1988. [69] H. Sun, W. Kwok, M . Chien, and C. H. J . Ju. M P E G coding performance improve-ment by jointly optimizing coding mode decisions and rate control. IEEE Trans, on Circuits and Systems for Video Technology, 7:449-458, June 1997. , [70] K . I. T. Koga, A. Hirano, Y . Iijima, and T. Ishiguro. Motion-compensated interframe coding for video conferencing. In Proc. NTC 81, volume 28, pages 239-251, New Orleans, Dec. 1992. [71] D. Taubman and A. Zakhor. Rate and resolution scalable subband coding of video. In Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, volume 5, pages 493-496, 1994. ' [72] Telenor Research. T M N (H.263) encoder/decoder, version 1.4a. TMN (H.263) codec, May 1995. [73] The International Telegraph and Telephone Consultant Committee. Video codec for audiovisual services at p x 64 kbits/s; Recommendation H.261, 1990. [74] K . - H . Tzou, H. G. Musmann, and K . Aizawa. Special Issue on Very Low Bit Rate Video Coding. IEEE Trans, on Circuits and Systems for Video Technology, 4(3):213-367, June 1994. [75] D. Walker and K . Rao. Improved pel-recursive motion compensation. IEEE Trans, on Communications, pages 1128-1134, Oct. 1984. [76] Q. Wang and R. J. Clarke. Motion-compensated sequence coding using-image pyra-mids. In Electronic Letters, volume 26, pages 575-576, 1990. i [77] W. J . Welsh, S. , Searby, and J. B . Waite. Model-based image coding. British Telecommunication Journal, 8:94-106, July 1990. [78] T. Wiegand, M . Lightstone, T. Campbell, and S. Mitra. Efficient mode selection for block-based motion compensated video coding. In International Conference on Image Processing, Washington, DC, USA, Oct. 1995. [79] T. Wiegand, M . Lightstone, D. Mukherjee, T. Campbell', and S. Mitra. Rate-Distortion Optimized Mode Selection for Very Low Bit Rate Video Coding and the Emerging H.263 Standard. IEEE Trans, on Circuits and Systems for Video Technology, pages 182-190, Apr. 1996. ! i i Bibliography 142 [80] J. W. Woods and T. Naveen. Motion compensated multiresolution transmission of high definition video using multistage quantizers. In Proc. IEEE Int. Conf. Acoust:, Speech, and Signal Processing, volume 5, pages 582-585, 1993. [81] S. Zafar, Y . Zhang, and J . Baras. Predictive block-matching motion estimation for T V coding— Part I: Inter-block prediction. IEEE Transactions on Broadcasting, 37(3):97-101, Sept. 1991. [82] Y . Zhang and S. Zafar. Predictive block-matching motion estimation for T V coding— Part II: Inter-frame prediction.- IEEE Transactions on Broadcasting, 37(3):102-105, Sept. 1991. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065031/manifest

Comment

Related Items