Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Fast motion estimation methods for H.264 video coding on mobile devices von dem Knesebeck, Matthias 2010

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2010_fall_von dem knesebeck_matthias.pdf [ 1.26MB ]
Metadata
JSON: 24-1.0071319.json
JSON-LD: 24-1.0071319-ld.json
RDF/XML (Pretty): 24-1.0071319-rdf.xml
RDF/JSON: 24-1.0071319-rdf.json
Turtle: 24-1.0071319-turtle.txt
N-Triples: 24-1.0071319-rdf-ntriples.txt
Original Record: 24-1.0071319-source.json
Full Text
24-1.0071319-fulltext.txt
Citation
24-1.0071319.ris

Full Text

   FAST MOTION ESTIMATION METHODS FOR H.264 VIDEO CODING ON MOBILE DEVICES  by  Matthias von dem Knesebeck  Master of Science and Business Administration (Diplom-Wirtschaftsingenieur),  Technical University of Braunschweig, 2000   A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  DOCTOR OF PHILOSOPHY  in   The Faculty of Graduate Studies   (Electrical and Computer Engineering)    THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)    June 2010    © Matthias von dem Knesebeck, 2010 ii  ABSTRACT Digital video is becoming an increasingly widespread application on a multitude of devices ranging from mobile devices to digital cinema. Technological advancements in processing speed and available bandwidth along with substantial improvements in compression techniques enable completely new applications and services for digital video content. The most demanding task in video encoding is the motion estimation process which aims to identify similarities to previously transmitted video frames. Up to 90% of the processing requirements are attributable to this element. In this thesis, we present three methods for encoding new and transcoding existing video content with significantly reduced computational complexity while maintaining both quality and bitrate. The first method reduces the number of steps required to perform motion estimation by adaptively adjusting the search accuracy needed in distortion measurement. The second method addresses the topic of mode decision in video encoding and provides an algorithm that allows an early decision about the most probable modes without the need to evaluate all 259 different combinations of block sizes. The third method provides a multi-dimensional measure that facilitates evaluating only the most likely modes for efficiently transcoding existing pre-encoded content to lower resolutions with an arbitrary downscaling ratio. This is an important factor for the iii  ever-growing number of devices and application scenarios that access existing pre- encoded content. Our method supplements existing fast transcoding schemes that primarily focus on efficiently determining motion vectors in transcoding. iv  TABLE OF CONTENTS  Abstract ............................................................................................................................ ii  Table of Contents .......................................................................................................... iv  List of Tables .................................................................................................................. vi  List of Figures .............................................................................................................. viii  List of Acronyms .......................................................................................................... xii  Acknowledgements .................................................................................................... xiii  Dedication ..................................................................................................................... xiv  Co-Authorship Statement ...........................................................................................xv  Chapter 1: Introduction and Overview ......................................................................1  1.1  The H.264 Video Coding Standard ............................................................2  1.2  Overview of Current Motion Estimation Algorithms .................................6  1.2.1  Elements of Modern Motion Estimation Methods ................................7  1.2.2  Search Methods ......................................................................................10  1.3  Digital Video Transcoding ..........................................................................13  1.4  Thesis Contributions ...................................................................................15  1.5  Thesis Summary .........................................................................................17  1.6  References ...................................................................................................21  Chapter 2: A Fast Motion Estimation Algorithm for Low Processing Power Applications ......................................................................................................28  2.1  Introduction ..................................................................................................28  2.2  Proposed New Motion Estimation Scheme ............................................29  2.3  Experimental Results .................................................................................46  2.3.1  Results for QCIF Resolution .................................................................47  2.3.2  Results for CIF Resolution .....................................................................50  2.4  Conclusion ...................................................................................................54  2.5  References ...................................................................................................56  Chapter 3: An Efficient Early-Termination Mode Decision Algorithm For H.264 .........................................................................................................................59  3.1  Introduction ..................................................................................................59  3.2  Overview of H.264 Motion Estimation Methods .....................................60  3.3  Proposed Motion Estimation Method .......................................................62  v  3.3.1  Early Termination of the Mode Decision Process for the 16x16 Partition Size ................................................................................62  3.3.2  Early Termination of the Mode Decision Process for Partition Sizes larger or equal to 8x8 ...................................................................76  3.3.3  Using Pixel Decimation Patterns for further Reduction in Computational Complexity .....................................................................84  3.4  Experimental Results .................................................................................86  3.5  Conclusion ...................................................................................................92  3.6  References ...................................................................................................93  Chapter 4: A Fast Mode Decision Algorithm for Transcoding H.264 Video with Arbitrary-Ratio Spatial Downscaling ..................................................96  4.1  Introduction ..................................................................................................96  4.2  Background ..................................................................................................99  4.2.1  Mode Decision in Video Compression .................................................99  4.2.2  Bottlenecks in Downscaled Transcoding ..........................................100  4.2.3  Motion Search .......................................................................................100  4.2.4  Mode Decision .......................................................................................101  4.3  Proposed Algorithm ..................................................................................102  4.3.1  Cost Values from the 16x16 Motion Search .....................................104  4.3.2  Motion Vectors of the Pre-encoded Stream .....................................107  4.3.3  Residual Information of the Pre-encoded Stream ...........................109  4.3.4  Derivation of the Decision Criterion ...................................................111  4.4  Experimental Results ...............................................................................114  4.5  Conclusion .................................................................................................122  4.6  References .................................................................................................124  Chapter 5: Conclusions .............................................................................................128  5.1  Overall Significance of the Research ....................................................128  5.2  Potential Applications of the Research Findings .................................129  5.3  Contributions ..............................................................................................130  5.4  Comments on Future Research .............................................................131  5.5  References .................................................................................................134  A.  Appendix ...................................................................................................135   vi  LIST OF TABLES  Table 2.1:  Occurrences of Optimal Motion Vector Magnitudes in Various Video Sequences ........................................................................33  Table 2.2:  Performance of Proposed Algorithm for Various Threshold Levels (QCIF Resolution) .........................................................................40  Table 2.3:  Performance of Proposed Algorithm for Various Threshold Levels (CIF Resolution) ............................................................................41  Table 2.4:  Macroblock Categories .............................................................................41  Table 2.5:  Average PSNR (dB) Achieved for QCIF Resolution ............................48  Table 2.6:  Average Number of Pixel Comparisons per Macroblock (QCIF Resolution) ......................................................................................48  Table 2.7:  Average PSNR (Y) Value for Different Motion Estimation Methods (CIF Sequences) ........................................................................51  Table 2.8:  Average Number of Pixel Comparisons per Macroblock for Different Motion Estimation Methods (CIF Sequences) ......................51  Table 3.1:  Properties of Test Sequences Used .......................................................66  Table 3.2:  Partition Sizes Chosen for P-Frames (CIF Sequences) ......................72  Table 3.3:  Partition Sizes Chosen for P-Frames (QCIF Sequences) ...................72  Table 3.4:  Early Termination for 16x16 Partitions (CIF, Average of Encoding with QP 28, 32, 36, 40) .............................74  Table 3.5:  Early Termination for 16x16 Partitions (QCIF, Average of Encoding with QP 28, 32, 36, 40) ..........................74  Table 3.6:  Early Termination of the Encoder for Partition Sizes > 8x8 at a Threshold of ‘μ-0.5σ’ (CIF, Average of Encoding with QP 28, 32, 36, 40) .............................77  Table 3.7:  Early Termination of the Encoder for Partition Sizes > 8x8 at a Threshold of ‘μ-0.5σ’ (QCIF, Average of Encoding with QP 28, 32, 36, 40) ..........................78  Table 3.8:  Early Termination of the Encoder for Partition Sizes > 8x8 at a Threshold of ‘μ-0.75σ’ (CIF, Average of Encoding with QP 28, 32, 36, 40) .............................80  vii  Table 3.9:  Early Termination of the Encoder for Partition Sizes > 8x8 at a Threshold of ‘μ-0.75σ’ (QCIF, Average of Encoding with QP 28, 32, 36, 40) ..........................81  Table 3.10:  Categories Defined for Macroblock Classification ................................84  Table 3.11:  Subset Patterns Chosen for Different Block Sizes ...............................86  Table 3.12:  Main Parameters Used in the Experiments ...........................................86  Table 3.13:  Comparison After Encoding Using the Proposed Method ...................87  Table 3.14:  Comparison after Encoding Using the Proposed Method with the Proposed Pixel Decimation Scheme Enabled ........................88  Table 3.15:  Comparison after Encoding with the Proposed Method ......................88  Table 3.16:  Comparison after Encoding with EPZS and Our Method (IPBBP GOP Structure) ............................................................................90  Table 3.17:  Comparison after Encoding with the Proposed Method ......................91  Table 4.1:  Average Frequency of Modes Chosen for the Reference Sequences ................................................................................................102  Table 4.2:  Results for Transcoding with a Downscaling Ratio of 2:1 .................116  Table 4.3:  Results for Transcoding with a Downscaling Ratio of 4:3 .................118  Table 4.4:  Results for Transcoding with a Downscaling Ratio of 2.2:1 ..............119  viii  LIST OF FIGURES  Figure 1.1:  Block Sizes Defined by the H.264 Standard ...........................................4  Figure 1.2:  Block Diagram of a Typical H.264 Encoder ............................................6  Figure 1.3:  Block Diagram of a Typical H.264 Decoder ............................................6  Figure 1.4:  Predicted Vector Calculated as the Median of Known Surrounding Motion Vectors (Shaded Areas) .........................................8  Figure 1.5:  Search Pattern for the Diamond Search ...............................................11  Figure 1.6:  Example of a Search Path for the Diamond Search............................12  Figure 1.7:  A Typical Cascaded Pixel-Based Transcoder for Downscaled Transcoding ........................................................................14  Figure 1.8:  Pixel-based Transcoder Reusing Information from the Pre-Encoded Stream ................................................................................15  Figure 2.1:  Four Types of Subsets for Motion Estimation (1/4 of the Macroblock is Shown) ...........................................................30  Figure 2.2:  Four Types of Subsets for Motion Estimation (1/4 of the Macroblock is Shown) ...........................................................30  Figure 2.3:  Histogram of the Average Error Resulting from Using a Decimated Pixel Pattern  (Ratio: ½) for Distortion Measurement. (Akiyo CIF Sequence, 100 Frames) ............................31  Figure 2.4:  Histogram of the Average Error Resulting from Using a Decimated Pixel Pattern (Ratio: ½) for Distortion Measurement. (Coastguard CIF Sequence, 100 Frames) .................31  Figure 2.5:  Histogram of the Average Error Resulting from Using a Decimated Pixel Pattern (Ratio: ½) for Distortion Measurement. (Foreman CIF Sequence, 100 Frames) ......................32  Figure 2.6:  Histogram of the Average Error Resulting from Using a Decimated Pixel Pattern (Ratio: ½) for Distortion Measurement. (Mobile CIF Sequence, 100 Frames) ..........................32  Figure 2.7:  Cross-Correlation Between Initial SAD Values and the Magnitude of the Optimal Motion Vectors .............................................35  ix  Figure 2.8:  Average Number of Pixel Comparisons Needed for Encoding Various Sequences with Different Combinations of Decimated Pixel Patterns ....................................................................36  Figure 2.9:  Average Quality Obtained from Encoding Various CIF Sequences with Different Combinations of Decimated Pixel Patterns .............................................................................................37  Figure 2.10:  Average Quality Achieved Over the Number of Pixel Comparisons Required ............................................................................38  Figure 2.11:  Histogram of the Number of MV with Zero Magnitude or Non-Zero Magnitude, Respectively (Foreman CIF Sequence) ..................................................................................................43  Figure 2.12:  Flowchart for Our Proposed Algorithm ..................................................46  Figure 2.13:  Rate-Distortion-Plot for the Foreman (QCIF) Sequence (H.263) ........................................................................................................49  Figure 2.14:  Rate-Distortion-Plot for the Salesman QCIF Sequence (H.263) ........................................................................................................49  Figure 2.15:  Rate-Distortion Plot for the Foreman QCIF Sequence (H.264) ........................................................................................................50  Figure 2.16:  Rate-Distortion Plot for the Salesman QCIF Sequence (H.264) ........................................................................................................50  Figure 2.17:  Rate-Distortion-Plot for the Coastguard CIF Sequence (H.263) ........................................................................................................52  Figure 2.18:  Rate-Distortion Plot for the Foreman CIF Sequence (H.263) ........................................................................................................52  Figure 2.19:  Rate-Distortion-Plot for the Coastguard CIF Sequence (H.264) ........................................................................................................53  Figure 2.20:  Rate-Distortion-Plot for the Foreman CIF Sequence (H.264) ........................................................................................................53  Figure 3.1:  Mode Decision Process (Simplified) of the JM Reference Software ......................................................................................................63  Figure 3.2:  Histogram of the Absolute Cost Values (Foreman CIF) .....................64  Figure 3.3:  Histogram of Absolute Final Cost Values in Frame 28 (Foreman CIF) ...........................................................................................65  Figure 3.4:  Histogram of Absolute Final Cost Values in Frame 243 (Foreman CIF) ...........................................................................................65  Figure 3.5:  Histogram of Final Cost Values (Foreman CIF Sequence) ................67  Figure 3.6:  Histogram of Final Cost Values (Mobile CIF Sequence) ....................67  x  Figure 3.7:  Correlation Between 16x16 Cost Values J16 and Final Rate-Distortion Cost Jfinal  (Foreman CIF Sequence) ..........................69  Figure 3.8:  Correlation Between 16x16 Cost Values J16 and Final Rate-Distortion Cost Jfinal  (Mobile CIF Sequence) ...............................69  Figure 3.9:  16x16 Cost Values Vs. Final Cost Values (Mobile CIF, QP 28) ..................................................................................70  Figure 3.10:  Mode Histogram Mixed Motion Content (Foreman CIF) .....................76  Figure 3.11:  Macroblock Histogram High Motion Content (Mobile CIF) .................76  Figure 3.12:  Level of Quality Achieved for All Threshold Choices Investigated  (Foreman, CIF, QP 32) .....................................................82  Figure 3.13: Flowchart of the Proposed Algorithm .....................................................83  Figure 3.14:  Pattern A: Two Pixel Subsets Covering ¼ of the MB Pixels ...........................................................................................................85  Figure 3.15:  Pattern B: Subset Covering ½ of the MB Pixels ..................................85  Figure 3.16:  Rate-Distortion Performance for the Foreman (CIF) Sequence ...................................................................................................89  Figure 3.17:  Rate-Distortion Performance for the Mobile (CIF) Sequence ...................................................................................................89  Figure 3.18:  Rate-Distortion Performance for the Foreman (QCIF) Sequence ...................................................................................................90  Figure 4.1:  Cascaded Transcoding Process .............................................................97  Figure 4.2:  Block Sizes Defined in the H.264 Video Coding Standard .................99  Figure 4.3:  Frame 2 of the Soccer CIF Sequence with a 16x16 Grid Overlay to Illustrate the Macroblock Division ......................................104  Figure 4.4:  Illustration of Desired Partitioning Structure for Frame 2 of the Soccer CIF Sequence ......................................................................106  Figure 4.5:  Map of Cost Values Obtained from  the 16x16 Motion Search .......................................................................................................106  Figure 4.6:  Map of Block Sizes Identified in the Reference Implementation ........................................................................................106  Figure 4.7:  Map of MVDM Values for Frame 2 of the Soccer 4CIF Sequence .................................................................................................109  Figure 4.8:  Map of Block Sizes Identified in the Reference Implementation ........................................................................................109  Figure 4.9:  Map of SAR Values for Frame 2 of the Soccer 4CIF Sequence .................................................................................................111  xi  Figure 4.10:  Map of Block Sizes Identified in the Reference Implementation ........................................................................................111  Figure 4.11:  Flowchart of the Proposed Algorithm ...................................................114  Figure 4.12:  Rate-Distortion Plot for Akiyo CIF ........................................................120  Figure 4.13:  Rate-Distortion Plot for Foreman CIF ..................................................120  Figure 4.14:  Rate-Distortion Plot for Mobile CIF ......................................................121  Figure 4.15:  Rate-Distortion Plot for City 4CIF .........................................................121  Figure 4.16:  Rate-Distortion Plot for Crew 4CIF .......................................................122  Figure 4.17:  Rate-Distortion Plot for Soccer 4CIF ....................................................122   xii  LIST OF ACRONYMS DCT Discrete Cosine Transform DFT Discrete Fourier Transform DVD Digital Versatile Disc GOP Group of Pictures HVS Human Visual System IDCT Inverse Discrete Cosine Transform IQ Inverse Quantization MB Macroblock MCP Motion-Compensated Prediction ME Motion Estimation MV Motion Vector PAL Phase Alternating Line PSNR Peak Signal to Noise Ratio Q Quantization QCIF Quarter Common Intermediate Format VLC Variable-Length Coding VLD Variable-Length Decoding  xiii  ACKNOWLEDGEMENTS Performing research and developing new ideas in a field that is covered by a large number of scientists all around the world can be a very challenging and demanding task. Without the guidance, support and example of many other people, the journey could not have been pursued. My sincere appreciation and gratitude belong to everyone that contributed to the experience, some of whom I would like to emphasize in the following paragraphs. First, I would like to thank my supervisor, Dr. Panos Nasiopoulos, for his continuous encouragement, guidance, trust, and optimism. His personality as well as his experience both in industry and academia have been qualities I continue to admire. Secondly, the work has been performed in the context of a wonderfully composed group of colleagues with a wide variety of backgrounds. My sincere appreciation belongs to my friend and mentor Charles Follis who has provided support, guidance and trust on every step of the way. A special note of appreciation goes to my friends in the orchestra Jägercorps Knesebeck and especially to the Youth Orchestra that has always inspired my thoughts and creativity, and that has been a continuous source for sparking happiness. xiv  DEDICATION To the youth of my hometown Knesebeck. xv  CO-AUTHORSHIP STATEMENT This thesis presents research conducted by Matthias von dem Knesebeck, in collaboration with Dr. Panos Nasiopoulos. This is a manuscript-based thesis constructed around the three manuscripts described below: Manuscript 1: “A Fast Motion Estimation Algorithm for Low Processing Power Applications”. The identification and design of the research program, the research and data analysis were performed by Matthias von dem Knesebeck. This manuscript is the work of Matthias von dem Knesebeck, who received suggestions and feedback from his supervisor Dr. Nasiopoulos and from Dr. Victor Leung. Manuscript 2: “An Efficient Early-Termination Mode Decision Algorithm for H.264”. The identification and design of the research program, the research and data analysis were performed by Matthias von dem Knesebeck. This manuscript was the work of Matthias von dem Knesebeck, who received suggestions and feedback from his supervisor Dr. Nasiopoulos. Manuscript 3: “A Fast Mode Decision Algorithm for Transcoding of H.264 Pre- encoded Video with Spatial Downscaling”. The identification and design of the research program, the research and data analysis were performed by Matthias von dem Knesebeck. This manuscript was the work of Matthias von dem Knesebeck, who received suggestions and feedback from his supervisor Dr. Nasiopoulos. xvi  The thesis was written by Matthias von dem Knesebeck, with editing assistance and consultation from Dr. Panos Nasiopoulos.  1  CHAPTER 1: INTRODUCTION AND OVERVIEW Digital video is rapidly gaining popularity and a growing number of applications, services, and devices are evolving that make use of this technology. Since the early 1990s, international video coding standards MPEG-1 [1], H.261, MPEG-2 (also known as H.262) [2], MPEG-4 (Part 2) [3] and H.263 [4] and most recently the ITU-T H.264 / MPEG-4 (Part 10) Advanced Video Coding Standard [5] have been designed to efficiently compress motion picture content. This evolution of video coding standards has led to continuously improved coding efficiencies, offering higher quality at a given data rate or a lower data rate at a given quality level, respectively [6]. H.264, the latest international video standard, is designed for a very broad range of applications for broadcast, storage, conversational, and streaming video services and has already been adopted by a large number of consortia and manufacturers such as the Digital Video Broadcasting Project (DVB), the BluRay Disc consortium, Samsung, Nokia, Sagem, Microsoft or Apple [7]. Compression efficiency has increased by more than 50% compared to MPEG-2, allowing higher visual quality at a given bitrate or less bandwidth or storage space for encoding at a certain quality level. However, this benefit comes along with the drawback of a significant computational complexity, creating a considerable challenge in general, but especially for real-time encoding or for power consumption of devices such as mobile devices that are limited in processing power, energy reservoir, bandwidth or storage space. 2  When analyzing a typical video encoder, motion estimation is the component that accounts for up to 90% of the processing requirements, by far the most demanding element [8]. While the greatly increased number of coding options introduced by H.264 allows more efficient compression, this factor also induces the burden of identifying the best choice among them. In the following section, we give an overview of some of these newly introduced coding options defined in the H.264 video coding standard. Subsequently, we demonstrate the challenges involved in motion estimation and mode decision and additionally illustrate techniques and problems specific to transcoding pre-encoded video to a lower resolution. 1.1 The H.264 Video Coding Standard ITU-T H.264 / MPEG-4 (Part 10) Advanced Video Coding is the latest international video coding standard. It was jointly developed by the Moving Picture Experts Group (MPEG) of ISO/IEC and the Video Coding Experts Group (VCEG) of ITU-T. This standard was created to cover a wide variety of application scenarios from mobile devices all the way to digital cinema. In general, video is captured by sampling a given visual scene both spatially and temporally. Every temporal sample (frame) is essentially a still picture. Typical temporal sampling rates are between 25 and 30 frames per second while spatial resolutions are usually ranging between stamp-size QCIF resolution (176x144 pixels) all the way up to digital cinema resolutions (e.g., 4K: 4096x2160 pixels). Analogous to still-imagery, 3  every spatial sample is a digital representation of the luminance and the color information present at this location. In natural video scenes, both the spatial and the temporal samples contain significant redundancy and irrelevance (elements that cannot be perceived by the human eye). The goal of efficient video coding is to exploit these characteristics in order to minimize the number of bits required to store or transmit the video content. The encoder exploits temporal redundancies by aiming to predict the current frame based on one or more previously transmitted frames (INTER-prediction). In the simplest case for reducing temporal redundancy, the encoder could simply transmit a frame and the “simple difference” between this one and the preceding frame(s). While this approach is more efficient than transmitting the frames independently, the difference information (residual) still contains significant signal energy. Instead of transmitting “simple residuals” of the entire frame, current video CODECs apply a block-based motion estimation and compensation model. In these, every video frame is subdivided into small blocks (called macroblocks). The encoder then aims to find an area in a previously transmitted frame (reference frame) for every one of these blocks that matches its content as closely as possible and consequently creates a minimal residual. This process is known as motion estimation. The information to be transmitted then contains both a residual and the corresponding displacement information (motion vectors). The existing international video coding standards do not specify which algorithm is to be used to calculate the motion vectors [7], thus allowing for further future improvements and designs optimized for specific applications. 4  Besides predicting from a temporally neighboring frame (INTER-prediction), a second option within H.264 allows macroblocks to be encoded from those spatially adjacent blocks within the current frame that have already been encoded. This process is known as INTRA-Prediction. H.264 defines nine INTRA-prediction modes for 4x4 blocks and four modes for 16x16 blocks. For H.264 (and most preceding standards), macroblocks have a size of 16x16 pixels. These can be further subdivided into smaller partitions. The option of using smaller partitions often allows finding an even closer approximation to the optical flow, e.g., for scenes with complex motion characteristics, and hence facilitates a further reduced residual energy. However, introducing more partitions increases the amount of overhead for signaling and transmitting the individual motion vectors. The encoder eventually needs to identify the parameter set that generates the best trade-off between a minimal residual and a minimal overhead (motion vector and block-size signaling information) with the purpose of minimizing the overall number of bits required. H.264 defines seven different block sizes: 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4 (see Figure 1.1) resulting in 259 possible combinations of partition sizes for every 16x16 macroblock.  Figure 1.1: Block Sizes Defined by the H.264 Standard The task of the encoder now becomes to identify the set of coding parameters among the plethora of choices that would provide the best balance between bitrate and 5  distortion for transmitting the given content while considering rigid system constraints such as a limited bandwidth. The obtained residual from either INTER- or INTRA-coding is subject to transformation into a domain that aims to concentrate the signal energy into fewer samples. Depending on the type of residual, H.264 applies either an integer transform similar to the Discrete Cosine Transform (DCT) or a Hadamard transform. Many of the obtained coefficients are typically close to zero. In order to increase the compression-ratio, these coefficients are subject to a quantization process. H.264 supports 52 defined quantization step sizes QStep. High values for QStep lead to many coefficients being quantized to zero while small values leave more of the information content unaffected. While quantization reduces the number of bits required for encoding, it is the first element in the process chain that cannot be reversed without loss of data and is the main parameter used for controlling the tradeoff between compression and image quality. Finally, the resulting stream of bits is subject to entropy coding. In summary, given the large number of coding options available, the process of identifying the best combination of partition sizes and finding the best match for every partition in motion estimation are by far the most demanding elements of the video encoding process. A block diagram of a typical H.264 video encoder illustrates the main components previously discussed in Figure 1.2. A corresponding decoder is depicted in Figure 1.3. 6   Figure 1.2: Block Diagram of a Typical H.264 Encoder   Figure 1.3: Block Diagram of a Typical H.264 Decoder Most elements displayed in these diagrams are very similar to those found in earlier standards. An additional building block introduced by H.264 is the deblocking filter which diminishes artifacts introduced by the block-based compression process. This filter essentially smoothes the edges found at block boundaries. 1.2 Overview of Current Motion Estimation Algorithms There are three different main types of motion estimation algorithms, the pixel level [9], block level [10], and the global level [11] motion estimation. Among those, the block based motion estimation is widely used by MPEG encoders. In this category, spatial and temporal correlation of motion vectors is exploited in order to lower the computational costs over the conventional Full Search (FS) technique which scans the entire search area [9]. Other motion estimation algorithms include the Prediction Model 7  Search using Three Step Search (TSS) [12], New Three Step Search (NTSS) [13], Four Step Search (4SS) [14], Block based Gradient Descent Search (BBGDS) [15, 16], Diamond Search (DS) [17, 18], Recursive Motion Estimation (RME) [19], the Prediction Search Algorithm (PSA) [20], 2D Logarithm Search (LOGS) [21], One-at-a-Time Search (OTS) [22], and Cross Search [23]. Some other methods aim to match the picture quality of FS while improving its computational speed, but the speed improvement achieved is much less than that obtained by the methods mentioned above [8, 24-27]. More advanced methods have been developed, which include the Prediction Model Four-Step Search (PM4SS) [28], the Adaptive Predicted Direction Search algorithm (APDSA) [29], Hybrid Search approach [30], the Motion Vector Field Adaptive Search Technique (MVFAST) [31], the Advanced Predictive Diamond Zonal Search (APDZS) [32, 33], Cross Diamond Search, Hexagon-Based Search [34], Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) [35] and the Enhanced Predictive Zonal Search [36]. However, given the existing need of video- enabled devices for practical power consumption, there is still space for improvement with regards to motion estimation complexity. In the following subsection 1.2.1, we briefly explain a few of the main elements used in modern motion estimation algorithms and subsequently introduce several search methods frequently found in current video encoders. 1.2.1 Elements of Modern Motion Estimation Methods 1.2.1.1 Decimation of Checking Points Instead of evaluating every possible displacement position (checking point) within a defined search window, many fast motion estimation methods make use of a 8  search pattern featuring a decimated set of checking points, often employing a multiple stage refinement process [16]. One of those methods is the widely used Diamond Search method which is further described below in section 1.2.2.2. The validity of these approaches is facilitated by adhering to the unimodal error surface assumption which states that the matching error monotonically decreases towards a global minimum which in turn defines the position of the best match [21]. 1.2.1.2 Initial Search Point Prediction It has been shown that motion vectors of a macroblock are highly correlated to the motion vectors of neighboring MBs [24]. This has propelled the concept of using predictors based on information from temporally and spatially neighboring macroblocks that assist in initiating the motion search at a point that has a high probability of being close to the best match [37]. The most frequently used predictor is a predicted motion vector (PMV) that is defined as the median of specific surrounding motion vectors as illustrated in Figure 1.4. In this figure, the predicted motion vector MV is the median value of MV1, MV2, and MV3 (shaded areas).  Figure 1.4: Predicted Vector Calculated as the Median of Known Surrounding Motion Vectors (Shaded Areas) Several different predictors are used in the latest and highly efficient motion estimation methods in order to identify the least computationally expensive search path MV2 MV3 MV1 MV 9  [35, 36, 38-40]. In addition to spatial correlation, temporal correlation of a video sequence is also very high. As a result, very often the absolute differences between two motion vectors corresponding to the same position or neighboring positions in two consecutive frames are very small [8]. For this reason, the motion vector of the collocated macroblock and some of the neighboring macroblocks in the previous frame are considered as predictors as well. Most modern motion estimation methods take advantage of these findings in order to reduce computational complexity. 1.2.1.3 Confidence Scores Having multiple predictors increases the chance of initiating the search at a point close to the best match. Additional information concerning the suitability of an initial search point can be derived from the similarity of these predictors. If, for instance, several or even all of the predictors are the same or very similar, the likelihood for the final motion vector to be within a small region around this solidified predictor is very high. This similarity is measured by Confidence Scores which are used in motion estimation methods such as Advanced Predictive Diamond Zonal Search (APZS) [32, 33], Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) [35] and Enhanced Predictive Zonal Search (EPZS) [36]. 1.2.1.4 Early Termination In addition to Initial Search Point Prediction, recent fast motion estimation methods use fixed or adaptive thresholding parameters at various stages of the search process, enabling early termination at a point beyond which any additional benefit in terms of search accuracy is very minimal compared to the added computational cost 10  required [35]. For adaptive thresholding techniques, an overview and analysis is given in [41]. These thresholds are generally derived from linear or non-linear combinations of the distortion measures from neighbouring locations to the current macroblock. 1.2.1.5 Motion Estimation with Subsampled Pixel Patterns In motion estimation, a measure such as the Sum-of-Absolute-Differences (SAD) is used for finding the point with the lowest (relative) distortion. Its calculation usually takes every pixel within the macroblock into account. While this procedure gives the most accurate results, it might not be necessary to consider every pixel of the area at hand to find the point with the lowest distortion. Pixel subsampling methods have been proposed in [42-44] performing motion estimation by only taking a subset of pixels into account. It has been shown that an appropriate choice for the subsampling patterns could achieve a substantial reduction in computational requirements while not sacrificing quality. 1.2.2 Search Methods 1.2.2.1 Full Search Motion Estimation Algorithm The objective of motion estimation is to identify motion-based correlation between consecutive frames in order to minimize the number of bits needed for transmission. The Full Search (FS) algorithm is based on an exhaustive testing of all candidate search points within the search window in order to find the block position with the minimum distortion [9]. This method, however, involves a very large number of computations which is unjustifiable for most applications. 11  1.2.2.2 Diamond Search (DS) At the next level of advancement in terms of significant reduction in computational effort while preserving image quality is the Diamond Search (DS) algorithm which employs two search patterns as shown in Figure 1.5 [45]. The Large Diamond Shaped Pattern (LDSP) consists of 9 checking points composing the diamond shape. The second pattern, called Small Diamond Shaped Pattern (SDSP), consists of 5 checking points. In the first step, LDSP is repeatedly used until the minimum block distortion is located at the centre of the pattern. The second step then applies the SDSP. Figure 1.6 illustrates an example of a search path. Experimental results have shown that the DS algorithm requires approximately only 6% of the computational effort compared to FS. A similar method is the Hexagon-Based Search (HEXBS), which applies a hexagon-shaped search pattern instead [34]. Small Diamond Shaped Pattern Large Diamond Shaped Pattern  Figure 1.5: Search Pattern for the Diamond Search 12  -7 0 7   Figure 1.6: Example of a Search Path for the Diamond Search 1.2.2.3 Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) The Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) [35] marks a further enhancement and is currently the most widely used search method for real-time videoconferencing solutions. It classifies motion into three categories: small, medium or large. Based on these categories the algorithm chooses either the large or the small diamond pattern from the Diamond Search Pattern. It additionally creates a set of Initial Search Point Predictors from which the best one is eventually chosen for initiating the search. When the SAD value of a macroblock is lower than certain fixed or adaptive thresholds, the algorithm terminates early. PMVFAST requires 0.38% of the computational load of FS along with a quality loss of approximately 0.02dB. 1.2.2.4 Enhanced Predictive Zonal Search (EPZS) The EPZS search method [36] improves upon PMVFAST by introducing even further predictors and a more efficient selection of early termination criteria. A proper selection of predictors appears to be a very important feature for speed improvement in the mentioned parent methods. For instance, EPZS adds the horizontally and vertically adjacent motion vectors of the collocated block in the reference frame to the set of 13  predictors as well as an accelerator motion vector, which is differentially computed from the last reference frame and the one before. These predictors cover additional highly probable motion characteristics in typical video sequences (e.g. camera zooming) and hence reduce the search effort significantly. Also, EPZS can afford to omit the LDSP and use the SDSP immediately due to the high correlation of the best predictor with the result of the search step using the LDSP. EPZS requires approximately 0.21% of the computational requirements of FS along with an average loss in PSNR of 0.03dB. 1.3 Digital Video Transcoding More and more digital video content is being shared across a multitude of devices. Existing content that has been pre-encoded with a resolution and bitrate optimized for a specific device (e.g., digital television) might exceed the constraints of another target device (e.g., a mobile phone). For this purpose, transcoding is a process that allows converting compressed video from one format to another. One major application involves transcoding content to a smaller resolution. This enables providing the video content most efficiently with regards to bandwidth and visual quality to the target device. It is also desirable that the transcoding process is computationally efficient so that it can be performed in real-time and with minimal power consumption. This is especially important when the transcoding process is performed on a mobile device, e.g. for transmission of existing compressed content via a highly limited channel. A straight forward-approach for transcoding would be to fully decode the pre- encoded stream, downscale the resulting reconstructed frames (e.g., with a bilinear filter) and re-encode the obtained (raw) video sequence with a regular encoder (see Figure 1.7). 14  This process, known as “cascaded transcoding”, achieves very high compression efficiency but requires a large amount of computations.  Figure 1.7: A Typical Cascaded Pixel-Based Transcoder for Downscaled Transcoding As an alternative to pixel-based transcoding, a number of studies propose performing the downscaling operation more efficiently in the transform domain [46, 47]. However, the downscaling procedure in the transform-domain introduces a drift error that grows substantially with increasing GOP size [48, 49]. Although a number of methods have been proposed that aim to compensate for the drift error in previous standards [50- 53], the error remains severe in H.264 due to its deblocking filter and quarter-pixel interpolation feature. Pixel-domain transcoding is a much preferred choice for spatially downscaled transcoding. Characteristics of the downscaled video content such as motion or texture are very highly correlated with those in the full resolution version. We consequently expect that valuable information can be derived from the compressed stream about efficient coding choices applicable to the underlying content that can assist in reducing the burden for the re-encoding process of the reduced resolution version. Cascaded transcoding, however, as described, does not take advantage of existing information in the pre-encoded stream 15  such as motion vectors, modes or residual values or about content characteristics that can be derived from these elements. Therefore, various methods have been proposed that do re-use or consider information from the pre-encoded stream in order to speed-up the re-encoding process [47, 54, 55]. Since motion search and mode decision are the most demanding tasks in video encoding, most of these methods aim at reducing the complexity of these elements. The proposed methods provide approaches for efficiently deriving a suitable motion vector for a given partition in the downscaled version based on information present in the pre-encoded stream. However, besides simple mode-mapping or regression-based strategies that are limited to specific downscaling ratios, the challenge of identifying the most desirable modes efficiently for arbitrary downscaling ratios has not been addressed. Figure 1.7 schematically illustrates the concept of sharing information between the full resolution decoder and the encoder of the downscaled video in addition to the pixel representation of the individual frames.  Figure 1.8: Pixel-based Transcoder Reusing Information from the Pre-Encoded Stream 1.4 Thesis Contributions The objective of this thesis is to design algorithms tailored to take advantage of the unique features of H.264 allowing a greatly reduced complexity for motion estimation 16  and mode decision while maintaining very high compression efficiency. This is especially attractive for the rapidly growing but technologically challenging market of mobile devices. Mobile devices allow the realization of a powerful vision: being able to communicate from anywhere at any time with any type of data. However, supporting computationally demanding video applications on these devices has proven to be a challenge. In addition to being cost-efficient, mobile devices need to be small, lightweight and ideally have a long battery life. These restrictions considerably limit the energy available for different applications and consequently call for a well-balanced optimization of required processing power and data storage. In this thesis, we present three methods that facilitate a more efficient use of the available processing power or energy reservoir for video encoding, especially for devices or applications that are limited in these factors: • We propose a method for adaptively adjusting the accuracy required for evaluating displacement candidates in the motion estimation process. It reduces complexity for a step very frequently performed in the encoding process and hence allows for much faster processing. The algorithm complements existing strategies and saves up to 44% of pixel evaluations performed while very closely matching the compression efficiency of an unmodified encoder. • We present an algorithm that addresses the challenge of mode decision, a process step that has added a significant computational complexity with the introduction of variable block sizes in H.264. With our proposed method, the evaluation process of all 259 possible combinations is 17  adaptively reduced to a much smaller set of the most likely ones. The algorithm takes advantage of strong correlations identified with easily obtainable measures obtained during the encoding process of natural video scenes. Up to 55% of the evaluations can be saved and, combined with the overhead required for initializing such an evaluation, leads to an almost 60% reduced motion estimation time. • We finally propose a method that allows transcoding of existing pre- encoded video content from a given (high) resolution to an efficiently compressed stream with a (low) resolution optimized for the target playback device. In this scheme, we specifically address the complex challenge of mode decision and take advantage of existing information that can be derived with low complexity measures from the compressed full resolution stream. Existing methods in literature such as mode- mapping or regression-based approaches have only provided efficient solutions for specific downscaling ratios. Our method further allows the highly desired feature of an arbitrary downscaling ratio, hence facilitating content providers to serve existing content with a resolution that is optimal for the needs of any target device. Motion estimation time is reduced by an average of 50% for large resolutions and up to more than 60% for small resolution sources with our proposed approach. 1.5 Thesis Summary This thesis has been prepared according to the guidelines set forth by the University of British Columbia for manuscript-based theses. The document contains an 18  introductory chapter, followed by three core chapters containing work published or prepared for refereed academic journals. The last chapter contains a discussion of the conclusions drawn and points to aspects for future work. Subsequently, we give a summary of the details presented in the core chapters. In Chapter 2, we present an algorithm that adaptively adjusts the search accuracy required during the motion estimation process for evaluating different candidates of displacement vectors. We first present an analysis of correlations between required search accuracy and measures derived from the underlying video content. We found that strong correlations exist between the magnitude of the final motion vector for a given partition and the magnitude of an early distortion measure that corresponds to this partition. Additionally, we found that high accuracy in distortion measurement is only important for the areas with large displacement vectors, while those with small displacement vectors can be properly evaluated with a measure featuring reduced accuracy. These findings allow us to classify the macroblocks within a frame into two groups with different requirements for distortion measurement. As a result, the proposed algorithm reduces the number of pixel comparisons, the most frequently performed operation in motion estimation, by an average of 44% while not degrading picture quality. In Chapter 3, we focus on the challenge of mode decision, a feature that adds a great degree of complexity to the encoding process in H.264 due to the large number of possible coding combinations. A total of 259 combinations of partition sizes are feasible within H.264 and would need to be evaluated by the encoder for every 16x16 macroblock in order to determine the best one among them. We propose an algorithm that allows reducing the set of “efficient” combinations adaptively for every macroblock and hence 19  assists the encoder in avoiding evaluations that do not add a benefit for compression efficiency. This is facilitated by a correlation we identified in natural video scenes. The distribution of the rate-distortion cost value obtained from the 16x16 motion search within a frame and the resulting distribution of final block sizes are highly correlated. By taking advantage of this finding, we are able to avoid 32% of the computational burden in the mode decision process with a minimal penalty in bitrate and quality. When adding a modified scheme similar to that presented in Chapter 2, even 55% of the computations can be saved at a moderate increase in bitrate and a virtually equal level of picture quality. The study presented in Chapter 4 focuses on transcoding existing pre-encoded video content to a compressed stream with lower resolution. A plethora of devices capable of displaying digital video is emerging in the marketplace. Many of these devices are limited by certain system constraints such as bandwidth, processing power, storage capacity or energy reservoir. In order to obey these constraints and to maximize the user experience, the content provided ought to be served with high compression efficiency and a resolution optimized for the target application. In our study, we propose a method that allows downscaling pre-encoded compressed video to a resolution of arbitrary size with high compression efficiency and a significantly reduced computational complexity for the transcoding process. While a number of downscaled transcoding methods have been proposed for earlier standards [56, 57], they either do not address the new feature of variable block sizes introduced by H.264 or are limited to a single downscaling ratio (e.g., 2:1). By taking advantage of certain information obtained from the pre-encoded full resolution video, the method 20  reduces the computational burden of the mode decision process for re-encoding the downscaled content by 46% in terms of pixel comparisons performed, or 65% in terms of search points evaluated, respectively, while preserving image quality with only a small drop of 0.06dB and a slight bitrate increase of 1.1% on average.  21  1.6 References [1] ISO/IEC JTC1, "Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s -- Part 2", ISO/IEC 11172 (MPEG-1), vol. 2, 1993. [2] ITU-T and ISO/IEC JTC 1, "Generic coding of moving pictures and associated audio information – Part 2: Video", ITU-T Rec. H.262 – ISO/IEC 13818-2 (MPEG-2), 1994. [3] ISO/IEC JTC1, "Coding of audio-visual objects – Part 2: Visual", ISO/IEC 14496-2 (MPEG-4 Part 2), 1999. [4] ITU-T, "Video coding for low bit rate communication", ITU-T Rec. H.263 v3, 1995. [5] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, "Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC)", ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC, vol. 1, 2003. [6] A. Luthra and P. Topiwala, "Overview of the H.264/AVC video coding standard," in Applications of Digital Image Processing, pp. 417-31, 2003. [7] U. Reimers, DVB : The Family of International Standards for Digital Video Broadcasting, New York: Springer, 2004. [8] S. Han, S. Kwon, T. Lee and M. Lee, "Low power motion estimation algorithm based on temporal correlation and its architecture," in Proceedings of the International Symposium on Signal Processing and its Applications, pp. 647-50, 2001. [9] J. Biemond, L. Looijenga, D.E. Boekee and R.H.J.M. Plompen, "A pel-recursive Wiener-based displacement estimation algorithm", Signal Processing, vol. 13, pp. 399-412, 1987. [10] A. Puri, H.-. Hang and D.L. Schilling, "Motion-compensated transform coding based on block motion-tracking algorithm," in IEEE International Conference on Communications, pp. 136-40, 1987. 22  [11] Y.T. Tse and R.L. Baker, "Global zoom/pan estimation and compensation for video compression," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2725-8, 1991. [12] T. Koga, K. Iinuma, A. Hirano, Y. Iijima and T. Ishiguro, "Motion-compensated interframe coding for video conferencing," in IEEE National Telecommunications Conference (NTC), pp. G5. 3. 1-G5. 3. 5, 1981. [13] R. Li, B. Zeng and M.L. Liou, "A new three-step search algorithm for block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 4, pp. 438-42, 1994. [14] L. Po and W. Ma, "A novel four-step search algorithm for fast block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, pp. 313-7, 1996. [15] A. Averbuch and Y. Keller, "Fast motion estimation using bidirectional gradient methods," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3616-19, 2002. [16] Y. Keller, A. Averbuch and O. Miller, "Fast gradient methods based global motion estimation for video compression," in IEEE International Conference on Image Processing (ICIP), pp. 665-8, 2002. [17] J.Y. Tham, S. Ranganath, M. Ranganath and A.A. Kassim, "A novel unrestricted center-biased diamond search algorithm for block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, pp. 369-77, 1998. [18] C. Zhu, X. Lin, L. Chau, H. Ang and C. Ong, "An optimized diamond search algorithm for block motion estimation," in IEEE International Symposium on Circuits and Systems (ISCAS), pp. 488-91, 2002. [19] G. de Haan, P.W.A.C. Biezen, H. Huijgen and O.A. Ojo, "True-motion estimation with 3-D recursive search block matching", IEEE Transactions on Circuits and Systems for Video Technology, vol. 3, pp. 368-79, 1993. 23  [20] K. Chung and L. Chang, "A new predictive search area approach for fast block motion estimation", IEEE Transactions on Image Processing, vol. 12, pp. 648-52, 2003. [21] J.R. Jain and A.K. Jain, "Displacement measurement and its application in interframe image coding", IEEE Transactions on Communications, vol. COM-29, pp. 1799-808, 1981. [22] R. Srinivasan and K.R. Rao, "Predictive coding based on efficient motion estimation", IEEE Transactions on Communications, vol. COM-33, pp. 888-96, 1985. [23] M. Ghanbari, "The cross-search algorithm for motion estimation [image coding]", IEEE Transactions on Communications, vol. 38, pp. 950-3, 1990. [24] J.H. Lim and H.W. Choi, "Adaptive motion estimation algorithm using spatial and temporal correlation," in IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), pp. 473-6, 2001. [25] J. Feng, T.Y. Liu, K.T. Lo and X.D. Zhang, "Adaptive motion tracking for fast block motion estimation," in IEEE International Symposium on Circuits and Systems (ISCAS), pp. 219-22, 2001. [26] S. Mietens, P.H.N. de With and C. Hentschel, "Flexible frame-reordering and multi- temporal motion estimation for scalable MPEG encoding in mobile consumer equipment," in IEEE International Conference on Consumer Electronics, pp. 342-3, 2002. [27] D. Alfonso, F. Rovati, D. Pau and L. Celetto, "An innovative, programmable architecture for ultra-low power motion estimation in reduced memory MPEG-4 encoder," in IEEE International Conference on Consumer Electronics, pp. 344-5, 2002. [28] J. Xu, L. Po and C. Cheung, "New prediction model search algorithm for fast block motion estimation," in IEEE International Conference on Image Processing (ICIP), pp. 610-13, 1997. 24  [29] J. Nam, J. Seo, J. Kwak, M. Lee and Y.H. Ha, "New fast-search algorithm for block matching motion estimation using temporal and spatial correlation of motion vector", IEEE Transactions on Consumer Electronics, vol. 46, pp. 934-942, 2000. [30] C. Cheung and L. Po, "A novel cross-diamond search algorithm for fast block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, pp. 1168-77, 2002. [31] K.K. Ma and P.I. Hosur, "Performance Report of Motion Vector Field Adaptive Search Technique (MVFAST)," in ISO/IEC JTC1/SC29/WG11 MPEG99/m5851, 2000. [32] A.M. Tourapis, O.C. Au and M.L. Liou, "Fast motion estimation using circular zonal search", Proceedings of SPIE, vol. 3653, pp. 1496-504, 1998. [33] A.M. Tourapis, O.C. Au, M.L. Liou and G. Shen, "An advanced zonal block based algorithm for motion estimation," in IEEE International Conference on Image Processing (ICIP), pp. 610-14, 1999. [34] C. Zhu, X. Lin and L. Chau, "Hexagon-based search pattern for fast block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, pp. 349-55, 2002. [35] A.M. Tourapis, O.C. Au and M.L. Liou, "Predictive Motion Vector Field Adaptive Search Technique (PMVFAST) - Enhancing block based motion estimation", Proceedings of SPIE, vol. 4310, pp. 883-92, 2001. [36] A.M. Tourapis, "Enhanced predictive zonal search for single and multiple frame motion estimation", Proceedings of the SPIE, vol. 4671, pp. 1069-79, 2002. [37] J. Tsai, C. Hsieh, S. Weng and M. Lai, "Block-matching motion estimation using correlation search algorithm", Signal Process Image Commun, vol. 13, pp. 119-133, 1998. 25  [38] C.A. Rahman and W. Badawy, "UMHexagonS algorithm based motion estimation architecture for H.264/AVC," in Proceedings of International Workshop on System- on-Chip for Real-Time Applications, pp. 207-10, 2005. [39] Z. Chen, J. Xu, P. Zhou and Y. He, "Hybrid unsymmetrical-cross multi-hexagon- grid search strategy for integer pel motion estimation in H.264," in Picture Coding Symposium, pp. 17-21, 2003. [40] J. Xu, Z. Chen and Y. He, "Efficient fast ME predictions and early-termination strategy based on H.264 statistical characters," in Proceedings of the Joint Conference of the International Conference on Information, Communications and Signal Processing and Pacific-Rim Conference on Multimedia (ICICS-PCM), pp. 218-22, 2003. [41] A.M. Tourapis, O.C. Au and M.L. Liou, "Highly efficient predictive zonal algorithms for fast block-matching motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, pp. 934-47, 2002. [42] B. Liu and A. Zaccarin, "New fast algorithms for the estimation of block motion vectors", IEEE Transactions on Circuits and Syst. for Video Technology, vol. 3, pp. 148-57, 1993. [43] H. Jung, C. Hong, J. Choi and Y. Ha, "VLSI architecture for the alternative subsampling-based block matching algorithm", IEEE Transactions on Consumer Electronics, vol. 41, pp. 239-47, 1995. [44] K. Lee, H. Chin, H. Hsu and C. Jen, "QME: An efficient subsampling-based block matching algorithm for motion estimation," in IEEE International Symposium on Circuits and Systems, pp. 305-8, 2004. [45] S. Zhu and K. Ma, "A new diamond search algorithm for fast block matching motion estimation," in IEEE International Conference on Information, Communications and Signal Processing (ICICS), pp. 292-6, 1997. 26  [46] P.A.A. Assuncao and M. Ghanbari, "Frequency-domain video transcoder for dynamic bit-rate reduction of MPEG-2 bit streams", IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, pp. 953-967, 1998. [47] Min Li and Bo Wang, "An Efficient H.264 Transcoder with Spatial Downscaling for Wireless Communications," in Wireless Communications, Networking and Mobile Computing, 2009. WiCom '09. 5th International Conference on, pp. 1-4, 2009. [48] J. Xin, C. Lin and M. Sun, "Digital video transcoding", Proc IEEE, vol. 93, pp. 84- 97, 2005. [49] H. Shen, X. Sun, F. Wu, H. Li and S. Li, "A fast downsizing video transcoder for H.264/AVC with rate-distortion optimal mode decision," in 2006 IEEE International Conference on Multimedia and Expo, ICME 2006, July 09,2006 - July 12, pp. 2017- 2020, 2006. [50] T. Qian, J. Sun, D. Li, X. Yang and J. Wang, "Transform domain transcoding from MPEG-2 to H.264 with interpolation drift-error compensation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, pp. 523-534, 2006. [51] M. Pantoja, L. Nam and S. Weijia, "Coefficient conversion for transform domain VC-1 to H.264 transcoding," in 2007 IEEE Workshop on Signal Processing Systems, SiPS 2007, October 17, 2007 - October 19, pp. 363-367, 2007. [52] G. Chen, S. Lin and Y. Zhang, "A fast coefficients convertion method for the transform domain MPEG-2 to H.264 transcoding," in International Conference on Digital Telecommunications 2006, ICDT'06, August 29, 2006 - August 31, 2006. [53] P. Yin, A. Vetro, B. Liu and H. Sun, "Drift compensation for reduced spatial resolution transcoding: A summary", IEEE Circuits and Systems Magazine, vol. 4, pp. 32-36, 2004. [54] P. Yin, M. Wu and B. Liu, "Video transcoding by reducing spatial resolution," in International Conference on Image Processing (ICIP 2000), September 10, 2000 - September 13, pp. 972-975, 2000. 27  [55] J. Youn and M. Sun, "Adaptive motion vector refinement for high performance transcoding," in Part 2 (of 3), October 4, 1998 - October 7, pp. 596-600, 1998. [56] I. Ahmad, X. Wei, Y. Sun and Y. Zhang, "Video transcoding: An overview of various techniques and research issues", IEEE Transactions on Multimedia, vol. 7, pp. 793-804, 2005. [57] D. Lefol, D. Bull and N. Canagarajah, "Performance evaluation of transcoding algorithms for H.264", IEEE Transactions on Consumer Electronics, vol. 52, pp. 215-222, 2006. 28  CHAPTER 2: A FAST MOTION ESTIMATION ALGORITHM FOR LOW PROCESSING POWER APPLICATIONS1 2.1 Introduction In recent years, significant changes in video compression and wireless communications have stirred an evolution in portable multimedia applications. However, video encoding in devices with low processing power, especially mobile devices, has become one of the most difficult challenges for real time applications. Even though more recent video coding standards, such as H.264 / MPEG-4 AVC allow a higher compression efficiency, the added complexity in the encoding stage position them out of reach for some of these devices. It has been estimated that conventional video encoders spend more than 50% of the computational time on motion estimation (ME) [1-3], for the recent H.264 video coding standard, this share goes up to 90%, when all coding options are enabled. However, power consumption of video enabled low processing power devices can be reduced by speeding up this critical process. There are three different main types of motion estimation algorithms, the pixel level [4], block level [5], and the global level [6] motion estimation. We have developed a new method for motion estimation which further improves the speed of the motion estimation process and complements existing techniques. Our proposed method makes use of correlation identified between an early-available distortion measure and the expected accuracy required for the motion estimation process.  1 A version of this chapter has been submitted for publication. M. von dem Knesebeck, P. Nasiopoulos, “A Fast Motion Estimation Algorithm for Low Processing Power Applications” 29  This approach assists in assigning suitable accuracy levels to the motion estimation process for every macroblock. In Section 2.2 we describe our new method. Performance evaluation results are presented in Section 2.3, and the conclusions are discussed in Section 2.4. 2.2 Proposed New Motion Estimation Scheme Our objective is to further minimize the search effort involved in motion estimation to levels that are most beneficial for finding a best match given the content at hand. Our first step involves exploring the possibility of using subsampled pixel patterns, which have proven to be a popular means for significantly reducing the search effort [7- 16]. Since motion estimation is heavily based on numerous distortion measurements, reducing the computations needed for this critical step would greatly improve the encoder speed. However, the accuracy reduction in measuring distortions introduced by subsampled pixel patterns should be taken into account so that it does not affect the prediction quality. Figure 2.1 and Figure 2.2 show examples of such patterns, each displaying a quarter of a 16x16 macroblock with the shaded areas indicating the pixels considered for distortion measurement. Additional decimation ratios can be obtained, when combining these patterns. For instance, a decimation ratio of ½ would be obtained when combining pattern 2.1a with pattern 2.1d, or pattern 2.2b with 2.2d. 30   (a)                               (b)  (c)                                (d) Figure 2.1: Four Types of Subsets for Motion Estimation (1/4 of the Macroblock is Shown)  (a)                              (b)  (c)                              (d) Figure 2.2: Four Types of Subsets for Motion Estimation (1/4 of the Macroblock is Shown) It would be beneficial to know whether or not, and under which conditions, these subsampled pixel patterns can be used for accurate motion estimation that will not affect the quality of the search results. To this end, we analyzed the accuracy performance of these patterns (and various combinations of them) and compared it to that of a full set of pixels for a large set of video sequences. This was done by comparing the optimal (final) motion vector from a full search with the full set of pixels used for distortion measurement with the one found using subsampled patterns. Figures 2.3-2.6 illustrate the histograms of four representative sequences showing the average deviation from the optimal motion vector obtained by using a pixel decimation ratio of ½ (combination of 31  pattern 2.1a with 2.1d). Similar results were obtained for the other combinations of patterns shown in Figure 2.1 and Figure 2.2.  Figure 2.3: Histogram of the Average Error Resulting from Using a Decimated Pixel Pattern (Ratio: ½) for Distortion Measurement. (Akiyo CIF Sequence, 100 Frames)  Figure 2.4: Histogram of the Average Error Resulting from Using a Decimated Pixel Pattern (Ratio: ½) for Distortion Measurement. (Coastguard CIF Sequence, 100 Frames) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Optimal Motion Vector Magnitude A vg . D ev ia tio n fr om  O pt im um  ( in  P ix el s) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Optimal Motion Vector Magnitude A vg . D ev ia tio n fr om  O pt im um  ( in  P ix el s) 32   Figure 2.5: Histogram of the Average Error Resulting from Using a Decimated Pixel Pattern (Ratio: ½) for Distortion Measurement. (Foreman CIF Sequence, 100 Frames)  Figure 2.6: Histogram of the Average Error Resulting from Using a Decimated Pixel Pattern (Ratio: ½) for Distortion Measurement. (Mobile CIF Sequence, 100 Frames) We observe that the deviation between the magnitude of the motion vectors obtained using the decimated patterns increases as the magnitude of the optimal motion vectors from the search with the full set of pixels increases. While the error remains relatively small for macroblocks with a small optimal motion vector, it becomes significant for those with a larger motion vector. In order to assess the impact this reduced accuracy 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Optimal Motion Vector Magnitude A vg . D ev ia tio n fr om  O pt im um  ( in  P ix el s) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Optimal Motion Vector Magnitude A vg . D ev ia tio n fr om  O pt im um  ( in  P ix el s) 33  would have on typical video sequences, we analyzed the distribution of optimal motion vector magnitudes using four representative CIF video sequences. Table 2.1 shows the number of occurrences and the corresponding percentages of the optimal motion vector magnitudes for the four sequences. Table 2.1: Occurrences of Optimal Motion Vector Magnitudes in Various Video Sequences MV Magnit. (up to x pixels) Akiyo Coastguard Mobile Foreman # of MVs % of MVs # of MVs % of MVs # of MVs % of MVs # of MVs % of MVs 0.0 / SKIP 37,960 95.86% 10,284 25.97% 15,777 39.84% 15,958 40.30% 0.5 0 0.00% 0 0.00% 0 0.00% 0 0.00% 1.0 1,409 3.56% 21,351 53.92% 21,985 55.52% 11,602 29.30% 1.5 134 0.34% 2,740 6.92% 318 0.80% 3,153 7.96% 2.0 60 0.15% 3,049 7.70% 461 1.16% 3,853 9.73% 2.5 7 0.02% 997 2.52% 142 0.36% 1,226 3.10% 3.0 8 0.02% 906 2.29% 414 1.05% 1,676 4.23% 3.5 10 0.03% 117 0.30% 287 0.72% 747 1.89% 4.0 6 0.02% 83 0.21% 167 0.42% 761 1.92% 4.5 6 0.02% 48 0.12% 30 0.08% 330 0.83% 5.0 0 0.00% 25 0.06% 19 0.05% 294 0.74% We observe that, on average, 90% of the optimal motion vectors have a magnitude below 1.5 pixels. Since decimation patterns lead to a significant speedup of the motion estimation process, it would be very beneficial to apply them to those macroblocks that do not suffer from the resulting accuracy reduction. Our previous analysis has unveiled that the use of decimated pixel patterns mainly affects the accuracy of larger motion vectors, and smaller motion vectors are only affected minimally. However, the magnitude of the final motion vector is not known until the end of the motion estimation process. In order to resolve this problem, we investigated the possibility of finding a measure that is available before or at an early stage of the motion estimation process which is closely 34  correlated with the magnitude of the final motion vector and/or requires a minimum in additional computations. Many algorithms, including the Diamond Search, are based on the unimodal error surface assumption [17]. When applying this theory to our case, a low level of distortion would suggest that the best match is close to the current position (corresponding to a small motion vector), whereas a high level of distortion would suggest the best match to be further away (corresponding to a large motion vector). In motion estimation, distortion is generally measured by the Sum-of-Absolute-Differences (SAD). The unimodal error surface assumption points to the idea that the distribution of the SAD values, obtained from direct comparison of every macroblock within the current frame with their collocated macroblock in the previous frame, is correlated with the distribution of the optimal motion vector magnitudes within this frame. The first measure obtained during a motion search is an SAD value, subsequently referred to as Initial SAD Value, at the center of the macroblock (H.263) or at a position shifted by the Predicted Motion Vector (H.264), which itself is computed by the median of existing surrounding motion vectors. We analyzed this correlation using a two-dimensional cross-correlation analysis according to: ∑∑ = = ∗++⋅= M m N n opt kjnimSADknmMVkjiCorr 0 0 0 ),,(),,(),,(  (2-1)  where i,j is the deviation in pixels from the origin in horizontal and vertical direction, MVopt denoting the matrix of optimal motion vectors mapped from the corresponding macroblock onto every pixel in frame k with dimensions MxN, and SAD0 denoting the corresponding MxN matrix of Initial SAD Values associated every pixel contained in the 35  corresponding macroblock for frame k. The graphical representation of the correlation analysis using the Foreman (CIF) sequence is displayed in Figure 2.7.  Figure 2.7: Cross-Correlation Between Initial SAD Values and the Magnitude of the Optimal Motion Vectors We observe that the magnitude of the optimal motion vector and the Initial SAD Value for a given macroblock are indeed correlated within a frame. The same figure also unveils that the correlation remains reasonably strong as we deviate from the initial (unshifted) position. For our application, we can hence deduce that the Initial SAD Value provides an indication for the optimal motion vector for the corresponding macroblock. Given our first observation regarding the high correlation between required search accuracy and the size of motion vectors, these Initial SAD Values could consequently be used (instead of the optimal motion vectors) to assign the frame’s macroblocks to categories with different accuracy requirements. -25 -15 -5 5 15 25 -20 -10 0 10 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 deviation in horizontal directiondeviation in vertical direction cr os s- co rr el at io n co ef fic ie nt 36  A large number of experiments have been performed for different video sequences to validate these findings. In our approach, we divided the range of macroblocks into two categories, a Low Accuracy Category and a High Accuracy Category, initially separated by the mean (μ) of the frame’s Initial SAD Values. These experiments were performed using the Full Search algorithm with different combinations of decimated pixel patterns for the Low Accuracy and the High Accuracy Category. In addition to the mean, we performed these experiments with various threshold levels involving the standard deviation (σ) between the two categories ranging from ‘μ-2.0σ’ and ‘μ+2.0σ’ with increments of 0.25σ. The effects of the threshold choices on the number of pixel comparisons performed during motion estimation are displayed in Figure 2.8. This figure shows the average number of pixel comparisons performed per macroblock at different threshold levels for several combinations of pixel patterns and video sequences.  Figure 2.8: Average Number of Pixel Comparisons Needed for Encoding Various Sequences with Different Combinations of Decimated Pixel Patterns We observe that the decimated pixel patterns achieve a significant reduction in pixel comparisons needed for the motion estimation process. As expected, the Half | Half -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2 3 4 5 6 7 8 9 x 10 4 Threshold criterion T (Mean + T x Standard Deviation) A ve ra ge  N um be r of  P ix el  C om pa ris on s us ed  p er  M ac ro bl oc k   Full | Full Half | Full Quarter | Full Half | Half Quarter | Half Quarter | Quarter 37  combination reduces the number of pixel comparisons by 50%, while the Quarter | Quarter combination and the Full | Full combinations define the two extreme boundaries. More importantly, we observe that combining different decimated pixel patterns significantly reduces the number of pixel comparisons. The next logical step is to investigate what affect these choices of patterns and thresholds have on picture quality. Figure 2.9 shows the average quality level achieved (PSNR) for the same patterns and their combinations as in Figure 2.8.  Figure 2.9: Average Quality Obtained from Encoding Various CIF Sequences with Different Combinations of Decimated Pixel Patterns We observe that the application of decimated pixel patterns affects picture quality only marginally on average with a maximum penalty of 0.04dB compared to using the full set of pixels. In our implementation we used the Full Search approach. However, for practical applications, the complexity of the Full Search algorithm significantly exceeds the constraints present in most systems, especially mobile devices. Our method can complement various existing search algorithms, providing them with an additional speed enhancement while preserving image quality and bitrate. This is particularly attractive for -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 32.45 32.46 32.47 32.48 32.49 32.5 32.51 32.52 32.53 32.54 Threshold criterion T (Mean + T x Standard Deviation) P S N R  ( dB )   Full | Full Half | Full Quarter | Full Half | Half Quarter | Half Quarter | Quarter 38  applications with real-time or low power requirements. We implemented our approach complementing EPZS, the most efficient and widely used search method in H.264, and PMVFAST, which is the most advanced and widely used method in H.263 applications. While H.264 provides a much improved compression efficiency over H.263, real-time implementations are still very challenging. On the other hand, existing software implementations of the wide-spread H.263 CODEC often can perform in real-time but use reduced parameter sets for this purpose. They could be easily upgraded with our feature and the processing time saved can be invested into broader parameter sets for improved compression efficiency without changing the given system setup itself. The obtained results for PMVFAST are shown in Figure 2.10. This figure shows the quality achieved and the corresponding number of pixel comparisons for all the combinations of patterns and thresholds used in Figure 2.8 and Figure 2.9.  Figure 2.10: Average Quality Achieved Over the Number of Pixel Comparisons Required 2 3 4 5 6 7 8 9 x 10 4 32.4 32.45 32.5 32.55 32.6 32.65   X = 4.08e+004 Y = 32.52 Number of Pixel Comparisons / Macroblock (in Thousands) P S N R (d B ) Full | Full Half | Full Quarter | Full Half | Half Quarter | Half Quarter | Quarter 39  We observe that increasing the number of pixel comparisons results only in a marginal quality improvement. Depending on the specific requirements of the application at hand, the most efficient encoding parameter choices are found along an “efficient frontier” [18] which is characterized by the points which give the highest quality at any given number of pixel comparisons. In our final algorithm we propose to use the point highlighted in Figure 2.10, which corresponds to a substantial reduction in pixel comparisons at a marginal expense in terms of image quality. The specified performance is achieved by a pixel subset with a decimation ratio of ½ for the Diamond Search and the full set of pixels for predictor evaluation in the High Accuracy Category and a decimation ratio of ¼ for the Diamond Search and ½ for the predictor evaluation in the Low Accuracy Category. The threshold between these categories is defined by ‘μ-1.25σ’ of the frame’s Initial SAD Values. While Figure 2.10 only displays the average results from numerous experiments with various test sequences, detailed performance comparisons for every individual video sequence with QCIF resolution are found in Table 2.2 and for CIF resolution in Table 2.3. 40  Table 2.2: Performance of Proposed Algorithm for Various Threshold Levels (QCIF Resolution)  Threshold Choice (ΔPSNR (dB) and Δ% Pixel Comparisons relative to PMVFAST) Sequence   μ -1 .5 σ μ -1 .2 5σ  μ -1 .0 0σ  μ -.7 5σ  μ -.5 0σ  μ -.2 5σ  μ + σ μ +. 25 σ μ +. 50 σ μ +. 75 σ μ +1 .0 0σ  Akiyo Δ PSNR 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 Δ% Compar. -46.34% -46.34% -46.34% -46.34% -46.34% -46.34% -46.34% -46.34% -46.34% -48.78% -48.78% Coastguard Δ PSNR -0.02 -0.02 -0.05 -0.04 -0.03 -0.03 -0.02 -0.04 -0.04 -0.02 -0.02 Δ% Compar. -43.37% -43.54% -44.22% -44.22% -45.92% -46.94% -47.79% -48.64% -49.15% -49.15% -49.66% Container Δ PSNR 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Δ% Compar. -48.65% -48.65% -48.65% -48.65% -48.65% -48.65% -48.65% -51.35% -51.35% -51.35% -51.35% Foreman Δ PSNR 0.03 0.04 -0.01 -0.02 0.00 -0.02 -0.01 -0.05 -0.05 -0.07 -0.03 Δ% Compar. -45.33% -45.33% -46.21% -47.09% -48.32% -48.85% -49.74% -49.91% -50.26% -50.26% -49.74% Grandma Δ PSNR -0.01 -0.01 -0.01 -0.02 -0.02 -0.02 -0.02 -0.01 -0.01 -0.01 -0.01 Δ% Compar. -42.11% -42.11% -42.11% -42.11% -42.11% -43.86% -43.86% -43.86% -43.86% -45.61% -45.61% Miss America Δ PSNR -0.04 -0.04 -0.04 -0.04 -0.04 -0.05 -0.03 -0.05 -0.05 -0.06 -0.01 Δ% Compar. -43.21% -43.21% -43.21% -43.21% -43.21% -43.21% -44.44% -46.30% -46.91% -48.77% -48.77% Mobile Δ PSNR -0.02 -0.02 -0.01 0.00 -0.01 0.00 -0.01 0.00 0.00 -0.01 -0.02 Δ% Compar. -48.02% -48.02% -48.02% -47.72% -48.63% -48.33% -47.72% -48.02% -47.72% -48.02% -48.02% Mother & Daughter Δ PSNR -0.01 -0.01 -0.01 0.02 -0.01 -0.02 0.02 0.00 -0.02 0.01 -0.01 Δ% Compar. -42.53% -42.53% -42.53% -43.10% -42.53% -43.10% -44.25% -46.55% -47.13% -47.70% -48.85% News Δ PSNR 0.01 0.01 0.01 0.01 0.01 0.02 0.01 0.01 0.01 0.00 0.01 Δ% Compar. -43.16% -43.16% -43.16% -43.16% -43.16% -44.21% -45.26% -45.26% -46.32% -46.32% -47.37% Salesman Δ PSNR 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.10 0.08 Δ% Compar. -43.75% -43.75% -43.75% -43.75% -43.75% -43.75% -43.75% -43.75% -43.75% -43.75% -43.75% Suzie Δ PSNR -0.03 -0.03 -0.06 -0.04 -0.09 -0.02 -0.06 -0.03 -0.02 -0.03 -0.03 Δ% Compar. -42.00% -42.00% -42.00% -42.72% -42.96% -44.63% -46.06% -46.78% -47.73% -48.21% -48.69% Average Δ PSNR 0.00 0.00 -0.01 0.00 -0.01 0.00 0.00 -0.01 -0.01 -0.01 0.00 Average Δ% Comparisons -44.41% -44.42% -44.56% -44.73% -45.05% -45.62% -46.17% -46.98% -47.32% -47.99% -48.24% Table 2.2 shows that the most desirable trade-off between quality and speed was achieved with a threshold equal to ‘μ -1.25σ’ which corresponds to no change in PSNR and a reduction of pixel comparisons by 44.42% compared to PMVFAST. Even though the speed improvement at ‘μ +1.5σ’ or ‘μ +σ’ appears superior on average at virtually the same level of quality, sequences with diverse motion content such as Foreman were encoded with a noticeably higher quality with the first threshold choice, rendering it more desirable for video with diverse motion content. 41  Table 2.3: Performance of Proposed Algorithm for Various Threshold Levels (CIF Resolution)  Threshold Choice (ΔPSNR (dB) and Δ% Pixel Comparisons relative to PMVFAST) Sequence  μ - 1 .2 5σ  μ -1 .0 0σ  μ -.7 5σ  μ -.5 0σ  μ -.2 5σ  μ +σ  μ +. 25 σ μ +. 50 σ μ +. 75 σ μ +1 .0 0σ  μ +1 .2 5σ  Akiyo Δ PSNR -0.01 0.01 0.01 0.01 -0.01 0.01 0.00 -0.01 -0.02 -0.01 -0.03 Δ% Comparis. -42.36% -42.36% -42.36% -42.36% -42.36% -42.99% -43.95% -43.95% -44.59% -45.22% -46.18% Coastguard Δ PSNR 0.00 -0.02 -0.04 -0.03 -0.04 -0.04 -0.04 -0.04 -0.04 -0.05 -0.05 Δ% Comparis. -45.40% -45.88% -47.00% -47.85% -48.44% -48.90% -49.23% -49.23% -49.37% -49.66% -49.56% Container Δ PSNR 0.02 0.02 0.02 0.01 0.00 0.01 0.00 0.00 -0.01 -0.01 -0.01 Δ% Comparis. -39.13% -39.13% -39.13% -39.13% -39.86% -41.30% -43.12% -45.65% -47.10% -47.83% -47.83% Flower Δ PSNR 0.12 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 Δ% Comparis. -49.43% -49.78% -49.74% -49.88% -49.81% -50.02% -50.09% -50.09% -50.22% -50.22% -50.19% Foreman Δ PSNR 0.01 -0.03 -0.02 -0.04 -0.05 -0.03 -0.04 -0.04 -0.03 -0.02 -0.03 Δ% Comparis. -45.43% -45.77% -46.31% -47.70% -48.68% -48.98% -49.25% -49.49% -49.63% -49.53% -49.66% Highway Δ PSNR 0.00 0.00 0.00 0.01 0.01 -0.03 -0.03 -0.04 -0.04 -0.04 -0.05 Δ% Comparis. -40.89% -40.89% -40.89% -40.46% -41.94% -44.05% -43.96% -46.15% -47.20% -48.16% -48.60% Mobile Δ PSNR -0.01 -0.02 -0.02 -0.03 -0.02 -0.03 -0.03 -0.03 -0.02 -0.03 -0.03 Δ% Comparis. -48.30% -48.73% -48.96% -49.10% -49.47% -49.44% -49.62% -49.70% -49.73% -49.79% -49.81% Paris Δ PSNR -0.04 -0.04 -0.07 -0.03 -0.06 -0.04 -0.04 -0.06 -0.06 -0.04 -0.06 Δ% Comparis. -42.92% -43.24% -43.65% -45.05% -45.54% -46.68% -47.01% -46.85% -48.08% -47.91% -47.75% Waterfall Δ PSNR 0.03 -0.01 0.00 0.00 -0.01 -0.02 -0.01 -0.01 -0.01 -0.01 -0.01 Δ% Comparis. -39.22% -39.43% -39.99% -41.72% -42.77% -43.74% -45.41% -45.83% -46.73% -47.77% -48.33% Avg. Δ PSNR  0.01 -0.01 -0.02 -0.01 -0.02 -0.02 -0.02 -0.03 -0.03 -0.03 -0.03 Avg. Δ% Comp.  -43.68% -43.91% -44.22% -44.80% -45.43% -46.23% -46.85% -47.44% -48.07% -48.45% -48.66% Results for CIF sequences shown in Table 2.3 indicate that the most desirable trade-off was achieved with a threshold at ‘μ -1.25σ’ as well, achieving an average quality improvement of 0.01dB and a reduction in pixel comparisons by 43.68%. Table 2.4 shows our threshold choice and the pixel decimation ratios identified for the two macroblock categories High Accuracy and Low Accuracy: Table 2.4: Macroblock Categories Category Initial SAD Range  Decimation Ratio for Predictor Eval. Decimation Ratio for Diamond Search High Accuracy µ – 1.25 σ < Initial SAD Full Set of Pixels 1/2 Low Accuracy Initial SAD ≤   µ – 1.25 σ 1/2 1/4 The inclusion of the standard deviation (σ) in our threshold facilitates a highly adaptive thresholding mechanism which accommodates a wide range of motion content. Sequences with diverse motion content, expressed by a large standard deviation, will hence contain a comparatively large portion of macroblocks in the High Accuracy Category which will in turn be searched with a higher level of accuracy than the Low 42  Accuracy Category. Sequences with highly homogeneous motion content, on the other hand, will be searched with less computational effort. H.264 employs variable block sizes from 16x16 down to 4x4. Our analysis unveiled that performing the classification using only one pixel subset for the Low Accuracy Category of partitions 8x8 and smaller does not provide sufficient search accuracy due to the small absolute number of samples taken for these sizes. Using two pixel subsets for this category, however, was identified to be suitable for preserving both image quality and bitrate. Using higher subsampling factors than ¼ for the decimated pixel patterns (such as e.g., 1/8 or 1/16) for large block sizes did not lead to satisfactory results for search accuracy and was hence not further taken into account. For the specific case of zero motion vectors, it was shown in [14] that below a certain SAD value almost all of the motion vectors are found to be zero. By knowing the threshold, this finding can be used to reduce the computational effort further while not compromising image quality. The validity of this observation is confirmed by the graph shown in Figure 2.11 which illustrates the number of zero-vectors or non-zero-vectors over their corresponding Initial SAD Value in a histogram for the Foreman CIF sequence. 43   Figure 2.11: Histogram of the Number of MV with Zero Magnitude or Non-Zero Magnitude, Respectively (Foreman CIF Sequence) We observe that zero-vectors are more concentrated in the low distortion area while non-zero motion vectors are mainly found in the high distortion area. Especially in the very-low end of the histogram, almost all motion vectors are indeed found to be zero. Knowing the threshold below which all motion vectors are zero would facilitate early termination for every macroblock meeting this criterion. In order to determine such a threshold below which we can be confident to only find zero motion vectors, we introduce a Zero Vector Confidence Score. Once a sufficient level of confidence has been reached, all macroblocks with an Initial SAD Value below this threshold can be confidently considered to have a (0,0) MV. Our algorithm works as follows: The threshold value SADZero is initially set to 0. While proceeding from macroblock to macroblock in raster-scan order, the threshold 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 Initial SAD Value N um be r o f M ot io n Ve ct or s Zero Motion Vector Non-Zero Motion Vector 44  value SADZero is replaced by the current SAD value, if either the current MV was determined to be (0,0) and the current SAD is larger than the existing value for SADZero, or the current MV is different from (0,0) and the current SAD is smaller than the existing value for SADZero. A confidence parameter R counts how many times the first of these conditions has been met, and is reset to zero once the second condition was met. Once R exceeds 3, we consider the threshold to be solid, and all subsequent macroblocks with an Initial SAD Value below SADZero are set to (0,0) within the frame without performing additional search steps. Our experiments have shown that a reasonable confidence has been reached once three zero-MVs and no non-zero MVs have been found below this threshold. The next logical step is to investigate the possibility of subdividing the macroblocks into even more categories, hoping this would result in an even more efficient search. Applying a similar approach to the one for the two categories, we analyzed whether a pixel subset with a higher degree of subsampling would be sufficient for the very low end of the range of Initial SAD Values. Experiments with numerous subsampled patterns and threshold levels were performed but could not achieve a superior quality-speed trade-off. For this reason, we decided that the best classification for our implementation is to determine only two categories of macroblocks, divided by the ‘μ-1.25σ’ of the frame’s Initial SAD Values. The two categories so created are a Low Accuracy Category and a High Accuracy Category. In addition to the early-termination criteria of the implemented search algorithms, we apply our proposed Zero Confidence Score which terminates the search for macroblocks with an Initial SAD Value below the established threshold SADZero. 45  For the High Accuracy Category, the distortion measurement during the Diamond Search and the predictor evaluation are performed using an alternating pixel decimation scheme with a combination of two of the pixel subsets shown in Figure 2.1 or Figure 2.2. This procedure results in a reduction in pixel comparisons by 50% compared to the original approach. The pixel subset patterns are alternated in accordance to the recommendations of [7-9] so that objects not wider than 1 pixel are still recognized. As an example, pattern 2.2a is combined with pattern 2.2d and then alternated with the combination of pattern 2.2b and 2.2c. For the Low Accuracy Category, we found that less accuracy in distortion measurement can be employed. Suitable results were obtained by using only one of the pixel subsets shown in Figure 2.1 and Figure 2.2. This reduces the number of pixel comparisons by 75%. Figure 2.12 shows a detailed flowchart that describes our proposed algorithm. It shows the implementation when patterns 2.2a and 2.2d are applied, but as indicated before, other combinations such as e.g. patterns 2.1a and 2.1d could also be used, leading to virtually identical results across the test sequences evaluated. 46   Figure 2.12: Flowchart for Our Proposed Algorithm 2.3 Experimental Results For our tests we used the JM reference encoder v14.2 for the H.264 implementation with EPZS and the Telenor H.263 encoder 1.7 for the PMVFAST implementation. The 47  search range was set to 16. In our tests we used 11 QCIF sequences (Akiyo, Coastguard, Container, Foreman, Grandma, Miss America, Mobile, Mother & Daughter, News, Salesman and Suzie) and 9 CIF sequences (Akiyo, Coastguard, Container, Flower, Foreman, Highway, Mobile, Paris, and Waterfall). We compared our approach with the EPZS method in H.264 and the PMVFAST method in H.263, in terms of PSNR and computational effort, the latter being expressed by the number of pixel comparisons performed per macroblock. Our experiments unveiled that the threshold choices applied for PMVFAST apply equally well to the EPZS algorithm. 2.3.1 Results for QCIF Resolution The QCIF sequences were each encoded for the H.264 CODEC in accordance with [19] with the fixed quantization parameters 16, 20, 24, 28 for EPZS, and for the H.263 CODEC at 25 different bitrates for PMVFAST. The resulting rate-distortion plots are displayed in Figures 2.13-2.16. We observe that picture quality in all plots is not compromised by our method and overall remains equal to that of the original methods. The average PSNR values obtained by our method and PMVFAST for H.263 as well as EPZS for H.264 are displayed in Table 2.5. 48  Table 2.5: Average PSNR (dB) Achieved for QCIF Resolution   H.263 H.264 Sequence  PMVFAST Proposed ΔPSNR EPZS Proposed ΔPSNR Akiyo  43.47 43.47 0.00 41.44 41.42 -0.01 Coastguard  32.34 32.33 -0.01 37.76 37.75 -0.01 Container  38.17 38.18 0.01 39.23 39.22 -0.01 Foreman  33.43 33.39 -0.04 38.43 38.41 -0.02 Grandma  37.60 37.60 0.00 39.86 39.86 -0.01 Miss America  42.38 42.38 0.00 42.23 42.22 -0.01 Mobile   30.75 30.75 0.00 36.99 37.00 0.01 Mother & Daughter  37.71 37.72 0.01 39.42 39.42 0.00 News  37.80 37.77 -0.03 39.87 39.84 -0.03 Salesman  38.98 38.97 -0.01 39.34 39.34 0.00 Suzie  37.22 37.19 -0.03 39.79 39.77 -0.02 Average  37.26 37.26 -0.01 39.49 39.48 -0.01 We observe that our proposed algorithm achieves the same level of picture quality as both PMVFAST and EPZS. The average number of pixel comparisons needed by PMVFAST for H.263 and EPZS for H.264 and our proposed method are shown in Table 2.6. Table 2.6: Average Number of Pixel Comparisons per Macroblock (QCIF Resolution)   H.263 H.264 Sequence  PMVFAST Proposed Δ% EPZS Proposed Δ% Akiyo  6,137 3,188 -48.1% 5,247 2,859 -45.5% Coastguard  21,627 12,108 -44.0% 9,703 5,173 -46.7% Container  6,294 3,229 -48.7% 6,547 3,618 -44.7% Foreman  17,961 9,709 -45.9% 10,590 5,554 -47.6% Grandma  9,004 5,106 -43.3% 5,629 3,131 -44.4% Miss America  22,863 13,121 -42.6% 4,912 2,709 -44.8% Mobile   19,932 10,263 -48.5% 14,778 7,367 -50.1% Mother & Daughter  20,965 11,667 -44.4% 7,566 4,018 -46.9% News  17,442 9,824 -43.7% 7,259 3,859 -46.8% Salesman  10,076 5,489 -45.5% 6,309 3,466 -45.1% Suzie  19,884 11,294 -43.2% 8,952 4,728 -47.2% Average  15,653 8,636 -45.3% 7,954 4,226 -46.3% We observe that the speed improvement of our proposed algorithm amounts to an average of 45% compared to PMVFAST in H.263 and an average of 46% for EPZS in H.264. 49   Figure 2.13: Rate-Distortion-Plot for the Foreman (QCIF) Sequence (H.263)  Figure 2.14: Rate-Distortion-Plot for the Salesman QCIF Sequence (H.263) 50 100 150 200 250 300 350 400 450 28 29 30 31 32 33 34 35 36 37 Bitrate in kbit/s P S N R (Y )   PMVFAST Proposed 50 100 150 200 250 300 33 34 35 36 37 38 39 40 41 42 43 Bitrate in kbit/s P S N R (Y )   PMVFAST Proposed 50   Figure 2.15: Rate-Distortion Plot for the Foreman QCIF Sequence (H.264)  Figure 2.16: Rate-Distortion Plot for the Salesman QCIF Sequence (H.264) 2.3.2 Results for CIF Resolution Figures 2.17 to 2.20 show the rate-distortion plots for the CIF resolution sequences while Table 2.7 summarizes the results showing the average PSNR values for the test sequences. 200 400 600 800 1000 1200 1400 32 34 36 38 40 42 44 Bitrate in kbit/s P S N R (Y )   EPZS (JM) Proposed 50 100 150 200 250 300 34 36 38 40 42 44 46 Bitrate in kbit/s P S N R (Y )   EPZS (JM) Proposed 51  Table 2.7: Average PSNR (Y) Value for Different Motion Estimation Methods (CIF Sequences)   H.263  H.264 Sequence  PMVFAST Proposed Δ PSNR to PMVFAST  EPZS Proposed Δ PSNR to EPZS Akiyo  38.67 38.66 -0.01 42.15 42.15 0.00 Coastguard  33.68 33.66 -0.02 37.81 37.80 -0.01 Container  34.77 34.78 0.00 39.06 39.06 0.00 Flower  34.19 34.18 -0.01 38.51 38.51 0.00 Foreman  35.25 35.23 -0.02 38.43 38.41 -0.02 Highway  36.87 36.86 -0.01 40.13 40.12 -0.01 Mobile   31.73 31.70 -0.02 37.48 37.48 0.00 Paris   33.76 33.75 -0.01 39.07 39.06 -0.01 Waterfall  31.69 31.69 0.00 37.73 37.72 -0.01 Average  34.51 34.50 -0.01 38.93 38.92 -0.01 We observe that the picture quality resulting from our method remains unchanged (- 0.01 dB). The strength of our proposed algorithm is again displayed when pointing to the average number of pixel comparisons performed per macroblock, which is shown in Table 2.8. Table 2.8: Average Number of Pixel Comparisons per Macroblock for Different Motion Estimation Methods (CIF Sequences)   H.263 H.264 Sequence  PMVFAST Proposed Δ% to PMVFAST EPZS Proposed Δ% to EPZS Akiyo  7.807 4.405 -43,60% 4.290 2.390 -44,28% Coastguard  22.328 11.971 -46,40% 12.311 6.431 -47,76% Container  8.534 5.022 -41,20% 6.678 3.721 -44,28% Flower  15.747 7.864 -50,10% 11.248 5.860 -47,91% Foreman  15.395 8.340 -45,80% 10.590 5.554 -47,55% Highway  11.871 7.064 -40,50% 8.290 4.618 -44,30% Mobile   17.893 9.224 -48,50% 12.703 6.591 -48,12% Paris   11.625 6.418 -44,80% 7.140 3.871 -45,79% Waterfall  14.404 8.767 -39,10% 10.716 5.619 -47,57% Average  13.956 7.675 -44,40% 9.329 4.961 -46,40% Again, our algorithm requires significantly less pixel comparisons than PMVFAST or EPZS. With 7,675 pixel comparisons per macroblock on average compared to 13,956, it reduces one of the major computational steps in motion estimation by 44.4% for 52  PMVFAST and 46.4% for the EPZS method with 9,329 compared to 4,961 pixel comparisons on average.  Figure 2.17: Rate-Distortion-Plot for the Coastguard CIF Sequence (H.263)  Figure 2.18: Rate-Distortion Plot for the Foreman CIF Sequence (H.263) 500 1000 1500 2000 2500 3000 28 29 30 31 32 33 34 35 36 37   PMVFAST Proposed 200 400 600 800 1000 1200 1400 1600 30 31 32 33 34 35 36 37 38   PMVFAST Proposed 53   Figure 2.19: Rate-Distortion-Plot for the Coastguard CIF Sequence (H.264)  Figure 2.20: Rate-Distortion-Plot for the Foreman CIF Sequence (H.264) 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 32 34 36 38 40 42 44 Bitrate in kbit/s P S N R (Y )   EPZS (JM) Proposed 200 400 600 800 1000 1200 1400 32 34 36 38 40 42 44 Bitrate in kbit/s P S N R (Y )   EPZS (JM) Proposed  54  Our implementation significantly improves the performance of the existing methods with regards to speed improvement, resulting in a mere 0.01 dB reduction in picture quality for the case of CIF sequences. Finally, in general, the proposed algorithm performs particularly well for sequences with diverse motion content such as Mobile. 2.4 Conclusion We have developed a novel algorithm for motion estimation in video compression, which is computationally highly efficient without sacrificing video quality. The proposed method takes advantage of the correlation between the statistical distribution of Initial SAD values within the frame and the motion estimation accuracy required for finding a suitable match. The main features of the proposed motion estimation algorithms are: • methods to classify macroblocks into a Low Accuracy and a High Accuracy category as indicated by the Initial SAD of each macroblock relative to the Initial SAD statistics of the entire frame; • different efficient search criteria for each class of macroblocks; and • use of decimated pixel patterns for reducing computational effort during the motion estimation process. The proposed algorithm has been implemented using the EPZS algorithm in H.264 and the PMVFAST method in H.263. The proposed algorithm is independent of the search algorithms and others can easily be adopted. To further reduce computational effort, we introduced a Zero Vector Confidence Score which measures the validity of a certain threshold below which a motion search is not performed (terminated early).  55  The results of the performance evaluations show that the proposed algorithm significantly reduces the number of computations without degrading picture quality. For the same picture quality, the proposed algorithm reduces the number of pixel comparisons, the most frequently performed operation in motion estimation, by an average of 44%.   56  2.5 References [1] Z. Chen, Z. Peng and H. Yun, "Fast Integer Pel and Fractional Pel Motion Estimation for JVT, JVT-F017", Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, pp. 1-14, 2002. [2] D. Wu, F. Pan, K.P. Lim, S. Wu, Z.G. Li, X. Lin, S. Rahardja and C.C. Ko, "Fast intermode decision in H.264/AVC video coding", IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, pp. 953-8, 2005. [3] S. Han, S. Kwon, T. Lee and M. Lee, "Low power motion estimation algorithm based on temporal correlation and its architecture," in Proceedings of the International Symposium on Signal Processing and its Applications, pp. 647-50, 2001. [4] J. Biemond, L. Looijenga, D.E. Boekee and R.H.J.M. Plompen, "A pel-recursive Wiener-based displacement estimation algorithm", Signal Processing, vol. 13, pp. 399-412, 1987. [5] A. Puri, H.-. Hang and D.L. Schilling, "Motion-compensated transform coding based on block motion-tracking algorithm," in IEEE International Conference on Communications, pp. 136-40, 1987. [6] Y.T. Tse and R.L. Baker, "Global zoom/pan estimation and compensation for video compression," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2725-8, 1991. [7] B. Liu and A. Zaccarin, "New fast algorithms for the estimation of block motion vectors", IEEE Transactions on Circuits and Syst. for Video Technology, vol. 3, pp. 148-57, 1993. [8] H. Jung, C. Hong, J. Choi and Y. Ha, "VLSI architecture for the alternative subsampling-based block matching algorithm", IEEE Transactions on Consumer Electronics, vol. 41, pp. 239-47, 1995.  57  [9] K. Lee, H. Chin, H. Hsu and C. Jen, "QME: An efficient subsampling-based block matching algorithm for motion estimation," in IEEE International Symposium on Circuits and Systems, pp. 305-8, 2004. [10] C. Cheung and L. Po, "Adjustable partial distortion search algorithm for fast block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 100-10, 2003. [11] P. Nasiopoulos and M. von dem Knesebeck, "A Fast Video Motion Estimation Algorithm for the H.264 Standard," in IEEE International Conference on Multimedia and Expo (ICME), pp. 701-4, 2006. [12] C. Cheung and L. Po, "Normalized partial distortion search algorithm for block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 10, pp. 417-22, 2000. [13] C. Cheung and L. Po, "A hierarchical block matching algorithm using partial distortion measure," in IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1237-40, 1997. [14] H. Lee, P. Nasiopoulos and V.C.M. Leung, "Fast Video Motion Estimation Algorithm for Mobile Devices," in IEEE International Conference on Multimedia and Expo (ICME), pp. 370-3, 2005. [15] Y. Chan and W. Siu, "New adaptive pixel decimation for block motion vector estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, pp. 113-8, 1996. [16] S. Lee, J. Lee, J. Lee, J. Paik and J. Choi, "Fast partial distortion elimination based on a maximum error constraint for motion estimation," in 2009 IEEE International Conference on Image Processing, ICIP 2009, November 7, 2009 - November 12, pp. 1549-1552, 2009. [17] J.R. Jain and A.K. Jain, "Displacement measurement and its application in interframe image coding", IEEE Transactions on Communications, vol. COM-29, pp. 1799-808, 1981.  58  [18] H. Markowitz, "Portfolio Selection", Journal of Finance, vol. 7, pp. 77-91, 1952. [19] G. Bjontegaard, "Calculation of average PSNR differences between RD-curves", VCEG-M33, pp. 1-12, Apr. 2001.   59  CHAPTER 3: AN EFFICIENT EARLY-TERMINATION MODE DECISION ALGORITHM FOR H.2642 3.1 Introduction Motion Estimation is an important element of the video encoding process, as it enables the identification of motion-based correlation between successive video frames. However, it also accounts for more than 60% of the processing time in current video encoders [1]. H.264, the most advanced coding standard to date, introduces a number of unique features, which result in doubling the encoding efficiency [2]. Some of these features, however, dramatically increase complexity, especially when it comes to motion estimation. Much research has been done in order to perform motion estimation with higher efficiency and reduced computational burden. Most of these approaches exploit typical characteristics of natural video scenes including the prevalence of little or no over large movements, horizontal over vertical movements, as well as correlation between motion vectors of neighboring macroblocks [3, 4]. Advanced examples of block matching algorithms include the Diamond Search (DS) [5] and the Hexagon-Based Search (HEXBS) [6]. While these methods were designed for fixed block sizes (typically 16x16 pixels), more recent video coding standards such as H.263+, MPEG-4, and H.264 introduced the concept of variable block sizes. Consequently, a number of strategies have  2 A version of this chapter has been published. M. v. dem Knesebeck, P. Nasiopoulos, “An Efficient Early- Termination Mode Decision Algorithm For H.264”, IEEE Transactions on Consumer Electronics, v 55, n 3, pp. 1501-10, 2009.  60  been introduced in order to exploit this concept efficiently [7-10]. Despite all these efforts, motion estimation continues to be the most significant challenge concerning processing power in video encoding. Any effort assisting in noticeably alleviating the computational burden of this process would facilitate a broad range of new applications and services. This particularly applies to devices with low energy and power reservoir such as mobile devices. We developed a motion estimation method specifically designed for H.264. Our method takes advantage of correlation between the rate-distortion cost values of the 16x16 pixel motion search, subsequently referred to as J16, and the potential lowest cost values that would be obtained if Full Search was used to terminate the motion search before all possible options have been checked. A pixel decimation method yields an additional reduction in computational complexity. Overall, for almost the same picture quality, our method significantly reduces the computational effort needed by today’s H.264 motion estimation implementation. The rest of the paper is structured as follows: Section II presents an overview of motion estimation techniques proposed for H.264. Section III describes our method, its detailed algorithm and an optional speed enhancement method. The experimental results are discussed in Section IV, and Section V presents the conclusion. 3.2 Overview of H.264 Motion Estimation Methods The process of motion estimation involves finding motion-based temporal correlations between consecutive frames in a video sequence. In most current encoders, every video frame is divided into non-overlapping, rectangular blocks (macroblocks).  61  These blocks are compared with regions in one frame or, as in the case of H.264, several reference frames, and the best-matching block in the reference frame is the one that differs the least from the current macroblock. Popular measures for evaluating distortion levels include Mean-Square-Error (MSE), its non-normalized counterpart Sum of Squared Differences (SSD) and Sum of Absolute-Differences (SAD). SAD requires the least amount of computations while providing a performance comparable to MSE [11] and is thus often the preferred choice. H.264 supports seven different block sizes for temporal prediction (16x16, 16x8, 8x16, 8x8, 8x4, 4x8, 4x4). In addition, there is a SKIP mode which does not transmit a residual but instead reuses the content of the previous frame. Since dividing a larger block into smaller partitions creates more motion vectors (MV) (requiring more bits for encoding (overhead)), there is a trade-off between the best match and the amount of overhead information required for smaller block sizes. The best trade-off is achieved by a Lagrangian cost function which assigns a cost to every coding mode [12]: RDJ ⋅+= λ  (3-1) where D represents the distortion introduced from choosing a certain mode, λ is the Lagrangian multiplier, and R is the number of bits required for encoding the obtained residual and motion vector data [13]. The most efficient encoding parameters are identified when the cost J is minimized. Instead of using the Full Search Method (FS) which evaluates distortion for every block size independently, a Fast Full Search method (FFS) is used in the H.264 reference  62  software JM, which computes only the values for the 4x4 block-size. For the macroblock at hand, the obtained results are stored and reused for computing distortion values of larger partition sizes by appropriate summation. Numerous fast motion estimation methods are achieving significant computational savings compared to FS. As an example, a reduction of 92% can be achieved with the Unsymmetrical-Cross Multi-Hexagon-Grid Search (UMHexagonS) [7], or 94% with the Enhanced Predictive Zonal Search (EPZS) [14] compared to FFS [15, 16]. These methods are very efficient options available in H.264 for motion estimation at integer-pixel positions. In addition, H.264 defines sub- pixel motion estimation with quarter-pixel resolution [13], which improves matching accuracy and thus increases coding efficiency even further. A popular method available for sub-pixel motion vector refinement in JM is the Center-Biased Fractional Pel Search (CBFPS) [7, 17]. 3.3 Proposed Motion Estimation Method 3.3.1 Early Termination of the Mode Decision Process for the 16x16 Partition Size Currently, in H.264 (JM), the most preferred partition size for every macroblock is identified by comparing the cost of all possible partition sizes in a tree-structured approach, proceeding from large down to small partition-sizes as illustrated in Figure 3.1.  63   Figure 3.1: Mode Decision Process (Simplified) of the JM Reference Software Determining the cost for every partition size, however, requires a very large number of computations. Our objective is to reduce the computational effort of the motion estimation process by appropriately preventing the encoder from performing steps that do not add additional information about a better match or that can be inferred from steps performed earlier by virtue of known correlations. Specifically, we want to investigate the possibility of finding some correlation between the search on large partition sizes and the outcomes of searches on smaller partition sizes. This correlation could then be used to estimate the outcome of the search on smaller partition sizes, an approach which in turn is expected to noticeably reduce the computational complexity. To this end, we analyzed the cost values obtained for different partition sizes of every 16x16 macroblock. For a start, we compared the absolute cost values of the 16x16 partition J16 with those of the aggregated 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4 partitions, where for every partitioning scheme, the obtained cost values of the sub-partitions within the 16x16 macroblock were summed up (e.g., for 16x8, the two cost values for the two 16x8 partitions were added together). The frequency at which the cost values occur for the ‘Foreman CIF sequence’ with a quantization parameter set to 28 is illustrated in Figure 3.2.  64   Figure 3.2: Histogram of the Absolute Cost Values (Foreman CIF) We observe that the absolute values for the different partition sizes populate an entirely overlapping range of values. Therefore, knowing the absolute cost value for a given macroblock up-front would not give a suitable criterion for determining an optimal partition size. Since this analysis failed to yield any conclusive evidence regarding the final choice of partition sizes, we proceeded to investigate the possibility of finding some correlation between the ‘final partition sizes’ and their associated cost values within a frame. The distribution of the number of different partition sizes and their associated absolute cost values are shown for two representative frames of the Foreman sequence in Figures 3.3 and 3.4. 400 600 800 1,000 1,200 1,400 1,600 1,800 600 3,100 5,100 7,100 9,100 Absolute Cost Values N um be r o f O cc ur re nc es 8x8 16x8 8x16 16x16  65   Figure 3.3: Histogram of Absolute Final Cost Values in Frame 28 (Foreman CIF)   Figure 3.4: Histogram of Absolute Final Cost Values in Frame 243 (Foreman CIF)  We observe that in the lower part of the histograms shown in Figs. 3 and 4, the 16x16 partition sizes are dominant, while smaller block sizes dominate the area with higher cost values. We can conclude that macroblocks in the low distortion area are expected to be assigned to a 16x16 partition size with a high probability, and that the 0 2 4 6 8 10 12 14 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Absolute Rate-Distortion-Cost N um be r o f M ac ro bl oc ks 16x16 16x8 8x16 P8x8 sub8x8 0 2 4 6 8 10 12 14 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 Absolute Rate-Distortion-Cost N um be r o f M ac ro bl oc ks 16x16 16x8 8x16 P8x8 sub8x8  66  additional benefit from investigating smaller partition sizes in this low distortion area is limited. Considering that the H.264 reference encoder investigates the 16x16 partition size first before proceeding to smaller block sizes, it would be desirable to terminate the search early (preferably immediately after the 16x16 search) once the current macroblock has been identified to be in an area exhibiting a low probability for smaller partition sizes. For all experiments throughout the paper, the number of frames used for every test sequence is summarized in Table 3.1: Table 3.1: Properties of Test Sequences Used Format Sequence # of Frames Format Sequence # of Frames C IF  Akiyo 300 Q C IF  Akiyo 299 Coastguard 300 Coastguard 299 Foreman 300 Carphone 382 Mobile 299 Foreman 399 Tempete 260 Salesman 448 Figs. 3.5 and 3.6 show the distribution of the number of macroblocks and their absolute cost values Jfinal for the entire video sequences of Foreman and Mobile. The individual distortion values have been normalized to the respective frame’s average distortion.  67   Figure 3.5: Histogram of Final Cost Values (Foreman CIF Sequence)   Figure 3.6: Histogram of Final Cost Values (Mobile CIF Sequence)  Once more, we observe that in the low distortion area, the 16x16 partition size is clearly dominant. We conclude that knowing the cost value of the ‘final partition size’ Jfinal for all macroblocks of a frame would enable us to determine a threshold below which a search on partition sizes smaller than 16x16 would only add a limited bitrate advantage compared to the computational effort required. However, since the final  cost values Jfinal are not known before the entire decision process has been completed, it is 0 1,000 2,000 3,000 4,000 5,000 0 0.5 1 1.5 2 2.5 3 3.5 4 Rate-Distortion-Cost (relative to Mean of Frame's 16x16 cost) N um be r o f M ac ro bl oc ks 16x16 16x8 8x16 P8x8 sub8x8 0 1,000 2,000 3,000 4,000 5,000 0 0.5 1 1.5 2 2.5 3 3.5 4 Rate-Distortion-Cost (relative to Mean of Frame's 16x16 cost) N um be r o f M ac ro bl oc ks 16x16 16x8 8x16 P8x8 sub8x8  68  desirable to identify a different measure (other than the final cost value) which on one hand is closely correlated with the final cost value Jfinal of every macroblock, and on the other hand, is available without the need for going through the entire search process. In an attempt to identify such a measure, we analyzed the correlation between the cost obtained for the 16x16 partition size J16 (being the first value provided in the encoding process) and the final cost value Jfinal for the chosen partition structure according to [18]: final16 kJkJ final16k k JJ σσρ ),(cov=  (3-2) where kρ  is the correlation coefficient obtained for frame k, ),(cov 16 finalk JJ  is the covariance between the 16x16 cost values J16 and the final cost values Jfinal in frame k, 16kJσ is the standard deviation of the 16x16 cost values J16 in frame k and finalkJσ is the standard deviation of the final cost values in frame k. This correlation coefficient was calculated for every P frame of the Foreman and Mobile video sequences (I-frame interval 15) and is illustrated in Figs. 3.7 and 3.8. The graphs have been generated using the results from an analysis with four different quantization parameters (28, 32, 36, and 40).  69   Figure 3.7: Correlation Between 16x16 Cost Values J16 and Final Rate-Distortion Cost Jfinal (Foreman CIF Sequence)   Figure 3.8: Correlation Between 16x16 Cost Values J16 and Final Rate-Distortion Cost Jfinal (Mobile CIF Sequence) For both sequences, we observe a very strong correlation with coefficients ranging from 0.86 to 0.99 and an average of 0.964 (QP 28), 0.973 (QP 32), 0.966 (QP 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 0 25 50 75 10 0 12 5 15 0 17 5 20 0 22 5 25 0 27 5 Frame Number C or re la tio n C oe ff ic ie nt QP 28 QP 32 QP 36 QP 40 0.82 0.84 0.86 0.88 0.90 0.92 0.94 0.96 0.98 1.00 0 25 50 75 10 0 12 5 15 0 17 5 20 0 22 5 25 0 27 5 Frame Number C or re la tio n C oe ff ic ie nt QP 28 QP 32 QP 36 QP 40  70  36), 0.98 (QP 40) (Foreman), and 0.982 (QP 28), 0.978 (QP 32), 0.984 (QP 36), 0.981 (QP 40) (Mobile). Note that for the vast majority of frames, the correlation coefficient deviates only minimally from the average. The few exceptions found in Figure  are attributed to significant changes within the frame when fast objects abruptly appear in the scene. Figure 3.9 contrasts the 16x16 cost value J16 against the final cost value Jfinal in a scatter plot diagram (QP 28) for the Mobile sequence. Considering that the 45 degree line would indicate the case of a perfect correlation within the scatter plots, we observe consistently that the correlation is particularly strong for low distortion values, a fact that will be taken into account in our proposed method.  Figure 3.9: 16x16 Cost Values Vs. Final Cost Values (Mobile CIF, QP 28) 0 10,000 20,000 30,000 40,000 0 10,000 20,000 30,000 40,000 16x16 Cost O pt im al  C os t  71  This leads to the conclusion that J16 is indeed closely correlated with Jfinal for every macroblock and is available without the need to go through the entire search process [19]. Table 2 shows the frequency of occurrence of the different partition sizes for several different CIF video sequences, and Table 3 summarizes the average for the Akiyo, Carphone, Coastguard, Foreman, and Salesman QCIF video sequences, respectively.  72  Table 3.2: Partition Sizes Chosen for P-Frames (CIF Sequences) Sequence QP SKIP 16x16 16x8 8x16 P8x8 sub8x8 Intra 16x16 Intra 4x4 Akiyo 28 83% 9% 3% 3% 0% 2% 0% 0% (300 frms.) 32 87% 8% 2% 2% 0% 1% 0% 0% 36 91% 7% 1% 1% 0% 0% 0% 0% 40 94% 5% 0% 0% 0% 0% 0% 0% Coastguard 28 15% 38% 13% 13% 2% 18% 0% 1% (300 frms) 32 28% 36% 11% 11% 2% 9% 1% 1% 36 48% 29% 9% 8% 1% 3% 1% 0% 40 65% 23% 5% 4% 0% 1% 2% 0% Foreman 28 33% 33% 9% 11% 1% 8% 1% 1% (300 frms) 32 45% 31% 8% 8% 1% 4% 2% 1% 36 57% 28% 6% 5% 0% 2% 2% 1% 40 67% 23% 4% 3% 0% 1% 2% 0% Mobile 28 7% 30% 17% 16% 3% 27% 0% 0% (299 frms) 32 13% 36% 16% 17% 2% 16% 0% 0% 36 26% 41% 12% 13% 1% 7% 0% 0% 40 45% 38% 6% 7% 1% 3% 0% 0% Average 28 34% 28% 10% 11% 2% 14% 0% 1% 32 44% 28% 9% 9% 1% 7% 1% 0% 36 55% 26% 7% 7% 1% 3% 1% 0% 40 68% 22% 4% 4% 0% 1% 1% 0% % of macroblocks chosen for every mode   Table 3.3: Partition Sizes Chosen for P-Frames (QCIF Sequences) Sequence QP SKIP 16x16 16x8 8x16 P8x8 sub8x8 Intra 16x16 Intra 4x4 Average 28 48% 24% 8% 9% 1% 10% 0% 0% 32 58% 24% 6% 7% 1% 4% 0% 0% 36 68% 22% 4% 4% 0% 1% 0% 0% 40 77% 18% 2% 2% 0% 0% 0% 0% % of macroblocks chosen for every mode  We observe that the 16x16 partition size and the SKIP mode are by far the most frequent final choice. This fact reinforces our decision that it is advantageous to find a way of terminating the search process early, at a point that would result in significant computational savings at a small bitrate penalty.  73  The original H.264 reference encoder performs motion estimation on a macroblock-by-macroblock basis. If we are to use the knowledge of the 16x16 cost values as a measure for early search termination, then the encoding process has to be modified so that all of these values are calculated for the entire frame. Note that changing the processing order for the 16x16 cost values does not add additional computational steps to the encoding process. However, the identified J16 cost for every macroblock within the current frame and its associated motion vector for this block-size have to be kept in memory in order to be available for the mode decision of the investigated macroblock. This leads to an added memory requirement for every macroblock of 4 bytes for the cost value and 2 x 2 bytes for the motion vector information, resulting in added memory requirements of approximately 3kB for CIF sequences or 784 bytes for QCIF sequences. The next decision to be made is to determine the threshold value below which the motion estimation process on partition sizes smaller than 16x16 is not pursued. A large number of pertinent tests was performed using various thresholds for the J16 cost value which are presented in Table 3.4 for CIF sequences (Akiyo, Coastguard, Foreman, Mobile, Tempete) and in Table 3.5 for QCIF sequences (Akiyo, Carphone, Coastguard, Foreman, Salesman). More specifically, the tables show the threshold criterion chosen for terminating the encoder early for partitions lower than 16x16, the average PSNR quality in dB, the bitrate in kbps, the total encoding time, and the number of comparisons performed per macroblock (CMP/MB), along with the differences from the EPZS algorithm (ΔPSNR, ΔBitrate, ΔEncTime, ΔCMP/MB). The tables detail the average results obtained using four different quantization parameters (28, 32, 36, and 40). Note  74  that our method can serve as a speed-enhancement supplement to various existing fast motion estimation algorithms. In this case, we chose EPZS as a base for comparison. A similar performance has been verified for UMHexagonS (see Table 3.17), and the same is expected for other search algorithms. Table 3.4: Early Termination for 16x16 Partitions (CIF, Average of Encoding with QP 28, 32, 36, 40) Threshold PSNR (dB) ΔPSNR (dB) Bitrate (kbps) ΔBitrate (%) EncTime (s) ΔEncTime (%) CMP/MB ΔCMP/MB (%) EPZS 31.18 352.5 400.3 232,429 μ - σ 31.19 0.01 352.73 0.0% 365.2 -8.4% 220,088 -5.2% μ - 0.75σ 31.18 0.00 353.03 0.1% 334.9 -16.1% 203,448 -12.3% μ - 0.5σ 31.17 -0.01 353.74 0.2% 301.6 -24.3% 182,285 -21.4% μ - 0.25σ 31.15 -0.03 355.33 0.6% 259.8 -35.2% 159,457 -31.5% μ 31.12 -0.06 357.05 1.0% 225.4 -43.7% 138,250 -40.5% μ + 0.25σ 31.10 -0.08 359.22 1.5% 195.6 -50.9% 119,594 -48.3% Table 3.5: Early Termination for 16x16 Partitions (QCIF, Average of Encoding with QP 28, 32, 36, 40) Threshold PSNR (dB) ΔPSNR (dB) Bitrate (kbps) ΔBitrate (%) EncTime (s) ΔEncTime (%) CMP/MB ΔCMP/MB (%) EPZS 31.44 50.5 88.0 217,534 μ - σ 31.44 0.00 50.38 -0.1% 77.8 -11.0% 205,266 -5.5% μ - 0.75σ 31.44 0.00 50.40 -0.1% 72.6 -17.2% 192,548 -11.4% μ - 0.5σ 31.43 -0.01 50.45 0.0% 66.3 -24.6% 175,898 -19.1% μ - 0.25σ 31.41 -0.03 50.57 0.1% 58.8 -33.3% 156,679 -28.0% μ 31.39 -0.05 50.84 0.5% 51.6 -41.5% 137,210 -36.8% μ + 0.25σ 31.37 -0.07 51.07 0.8% 46.7 -46.4% 119,765 -44.7% We observe that substantial computational savings can be achieved by separating the frame’s macroblocks into two categories with different search termination criteria. The threshold used for separating the macroblocks terminated at 16x16 and the ones for which the encoder proceeds further, is a combination of the mean (μ) and standard deviation (σ) of the J16 values of the entire frame. The two categories defined by this threshold are a low distortion category exhibiting a clear dominance of 16x16 partition sizes and a high distortion category with all partition sizes found. Depending on the  75  constraints of the target application, a different choice for the threshold might be considered optimal. We propose to use ‘μ-0.75σ’ which preserves picture quality and bitrate while allowing computational savings of 12.3% (CIF) / 11.4% (QCIF) on average compared to the original JM  implementation, respectively. Figures 3.10 and 3.11 show the number of macroblocks over their respective level of distortion for the Foreman, Mobile and Coastguard sequence, normalized around the proposed threshold ‘μ-0.75σ’, as opposed to the current frame’s average chosen for normalization in Figs. 3.5 and 3.6. These figures suggest that the mode assignment differs from the optimum for a certain number of macroblocks in the low distortion area (a 16x16 block size was chosen where a smaller block size would have been optimal), and a penalty in terms of residual energy would be expected. We found that assigning a non-optimal mode in the low rate- distortion-cost area only has a small effect on video quality. The experiments confirmed that this penalty is minimized with an appropriate choice for the threshold and can be considered small with respect to the computational savings achieved.  0 1,000 2,000 3,000 4,000 5,000 0 1 2 3 4 5 Rate-Distortion-Cost (relative to Mean-0.75xStd.Dev.) N um be r o f M ac ro bl oc ks 16x16 16x8 8x16 P8x8 sub8x8  76  Figure  3.10: Mode Histogram Mixed Motion Content (Foreman CIF)  Figure  3.11: Macroblock Histogram High Motion Content (Mobile CIF) As expected, we observe that the number of macroblocks with a partition size smaller than 16x16 is comparatively small below the threshold, whereas the majority of them are found above it. Based on the significant computational savings obtained by the proposed approach, we concluded that it would be desirable to investigate the possibility of introducing a second threshold within the high distortion area which would allow us to terminate the search for certain smaller block sizes. 3.3.2 Early Termination of the Mode Decision Process for Partition Sizes larger or equal to 8x8 The histograms shown in Figs. 3.5 and 3.6 suggest that mode assignments for partition sizes smaller than 8x8 only commence in even higher distortion areas than the larger partition sizes. We therefore investigated the possibility of defining a second threshold within the high cost area that would again allow separating the macroblocks, this time between large to medium-sized partition sizes (16x16,16x8,8x16,8x8) and those including very small partition sizes (8x4,4x8,4x4). Similar experiments to the ones in 0 1,000 2,000 3,000 4,000 5,000 0 1 2 3 4 5 Rate-Distortion-Cost (relative to Mean-0.75xStd.Dev.) N um be r o f M ac ro bl oc ks 16x16 16x8 8x16 P8x8 sub8x8  77  subsection 3.3.1 were performed with the objective of identifying such a suitable second threshold. 1) Configuration A (First threshold fixed at ‘μ-0.5σ’): Tables 6 and 7 present the results for the case when the first threshold was fixed at ‘μ-0.5σ’ and the second threshold was varied between ‘μ’ and ‘μ+σ’ of the underlying J16 cost values.  Table 3.6 shows the results using CIF resolution video sequences, and Table 3.7 for QCIF sequences. Table 3.6: Early Termination of the Encoder for Partition Sizes > 8x8 at a Threshold of ‘μ-0.5σ’ (CIF, Average of Encoding with QP 28, 32, 36, 40) Sequence Threshold PSNR (dB) ΔPSNR (dB) Bitrate (kbps) ΔBitrate (%) EncTime (s) ΔEncTime (%) CMP/MB ΔCMP/MB (%) Akiyo EPZS 35.61 37.9 281.7 151,086 (300 frms) μ 35.61 0.00 37.7 -0.5% 147.4 -47.7% 105,260 -30.3% μ + 0.25σ 35.61 0.00 37.8 -0.3% 143.6 -49.0% 102,354 -32.3% μ + 0.5σ 35.61 0.00 37.7 -0.4% 136.4 -51.6% 99,515 -34.1% μ + 0.75σ 35.61 0.00 37.8 -0.1% 129.9 -53.9% 96,185 -36.3% μ + σ 35.61 0.00 37.8 -0.1% 124.4 -55.9% 93,155 -38.3% Coastguard EPZS 29.84 414.6 482.4 277,308 (300 frms) μ 29.82 -0.02 417.1 0.6% 214.8 -55.5% 176,657 -36.3% μ + 0.25σ 29.82 -0.02 417.5 0.7% 201.6 -58.2% 165,451 -40.3% μ + 0.5σ 29.82 -0.02 417.2 0.6% 181.6 -62.3% 156,859 -43.4% μ + 0.75σ 29.82 -0.02 417.5 0.7% 184.3 -61.8% 150,847 -45.6% μ + σ 29.81 -0.03 417.0 0.6% 162.9 -66.2% 146,636 -47.1% Foreman EPZS 32.17 203.3 369.4 254,679 (300 frms) μ 32.15 -0.02 203.5 0.1% 199.8 -45.9% 173,081 -32.0% μ + 0.25σ 32.14 -0.03 203.4 0.0% 186.5 -49.5% 164,877 -35.3% μ + 0.5σ 32.13 -0.04 203.4 0.1% 175.0 -52.6% 157,715 -38.1% μ + 0.75σ 32.13 -0.04 203.8 0.2% 164.9 -55.4% 151,541 -40.5% μ + σ 32.13 -0.04 204.1 0.4% 156.6 -57.6% 146,153 -42.6% Mobile EPZS 28.47 633.7 452.5 240,719 (299 frms) μ 28.43 -0.04 635.2 0.2% 197.8 -56.3% 156,489 -35.0% μ + 0.25σ 28.42 -0.05 635.4 0.3% 183.3 -59.5% 147,759 -38.6% μ + 0.5σ 28.41 -0.06 635.6 0.3% 170.9 -62.2% 140,161 -41.8% μ + 0.75σ 28.41 -0.06 635.7 0.3% 160.3 -64.6% 133,714 -44.5% μ + σ 28.40 -0.07 636.2 0.4% 152.9 -66.2% 128,263 -46.7% Tempete EPZS 29.81 471.8 415.4 238,355 (260 frms) μ 29.78 -0.03 471.5 -0.1% 181.4 -56.3% 170,755 -28.4% μ + 0.25σ 29.77 -0.04 470.3 -0.3% 166.7 -59.9% 160,502 -32.7% μ + 0.5σ 29.77 -0.04 470.2 -0.3% 154.3 -62.8% 151,796 -36.3% μ + 0.75σ 29.76 -0.05 470.1 -0.4% 144.9 -65.1% 145,035 -39.2% μ + σ 29.77 -0.04 469.9 -0.4% 137.6 -66.9% 139,783 -41.4% EPZS 31.18 352.3 400.3 232,429 μ 31.16 -0.02 353.0 0.2% 188.3 -53.0% 156,448 -32.7% μ + 0.25σ 31.15 -0.03 352.9 0.2% 176.3 -55.9% 148,189 -36.2% μ + 0.5σ 31.15 -0.03 352.8 0.2% 163.6 -59.1% 141,209 -39.2% μ + 0.75σ 31.15 -0.03 353.0 0.2% 156.9 -60.8% 135,464 -41.7% μ + σ 31.15 -0.03 353.0 0.2% 146.9 -63.3% 130,798 -43.7% Average    78  Table 3.7: Early Termination of the Encoder for Partition Sizes > 8x8 at a Threshold of ‘μ-0.5σ’ (QCIF, Average of Encoding with QP 28, 32, 36, 40) Threshold PSNR (dB) ΔPSNR (dB) Bitrate (kbps) ΔBitrate (%) EncTime (s) ΔEncTime (%) CMP/MB ΔCMP/MB (%) EPZS 31.44 50.5 88.0 217,534 μ 31.42 -0.02 50.6 0.2% 57.6 -34.5% 151,638 -30.3% μ + 0.25σ 31.42 -0.02 50.6 0.3% 53.4 -39.3% 143,751 -33.9% μ + 0.5σ 31.41 -0.03 50.6 0.3% 50.0 -43.2% 136,893 -37.1% μ + 0.75σ 31.40 -0.03 50.7 0.4% 47.8 -45.7% 131,277 -39.7% μ + σ 31.40 -0.03 50.8 0.6% 44.8 -49.1% 126,748 -41.7% We observe that for the same picture quality, introducing a second threshold provides some additional computational savings compared to the case of only one threshold. For instance, the level with an average quality drop of 0.03 dB achieved computational savings of 31.5% (CIF) / 28.0% (QCIF) when using only one threshold and 43.7% (CIF) / 41.7% (QCIF) for the case with a second threshold. Once more, depending on the major constraints of the target application, a different threshold might be preferred considering the tradeoff between quality and computational savings. 2) Configuration B (First threshold fixed at ‘μ-0.75σ’): In Tables 3.8 and 3.9, we present the results for the case when the first threshold is fixed at ‘μ-0.75σ’ and the second threshold is again varied between ‘μ’ and ‘μ+σ’. We observe that Configuration B provides a slightly preferred performance than Configuration A in terms of computational savings and bitrate at any given level of picture quality. More specifically, the Tempete CIF sequence can be encoded in the current case with a quality drop of 0.03dB at computational savings of 37.0% and a marginal improvement in bitrate of 0.5% while with Configuration A, this quality level was only achievable with computational savings of 28.4% and a 0.1% improvement in bitrate. When we compare the averages of Configuration A with Configuration B for CIF sequences, we observe that, for the same picture quality (approximately 0.02dB loss), we  79  gain a slight advantage in computational savings of approximately 0.8% (33.5% vs. 32.7%) and 0.4% in bitrate. For these reasons we propose to use ‘μ-0.75σ’ as the first threshold in our final algorithm for CIF sequences. The choice of the second threshold may depend on the requirements of the specific application. For evaluation purposes, we propose to adopt the ‘μ+0.5σ’, which results in average computational savings of 33.5% while sacrificing only 0.02dB in picture quality on average and achieving 0.2% in bitrate improvement.  80  Table 3.8: Early Termination of the Encoder for Partition Sizes > 8x8 at a Threshold of ‘μ-0.75σ’ (CIF, Average of Encoding with QP 28, 32, 36, 40) Sequence Threshold PSNR (dB) ΔPSNR (dB) Bitrate (kbps) ΔBitrate (%) EncTime (s) ΔEncTime (%) CMP/MB ΔCMP/MB (%) EPZS 35.61 37.9 281.7 151,086 μ 35.61 0.00 37.8 -0.2% 161.5 -42.7% 117,400 -22.3% μ + 0.25σ 35.60 -0.01 37.8 -0.2% 155.4 -44.8% 114,552 -24.2% μ + 0.5σ 35.61 0.00 37.8 -0.3% 149.8 -46.8% 111,623 -26.1% μ + 0.75σ 35.60 -0.01 37.7 -0.5% 143.5 -49.1% 108,347 -28.3% μ + σ 35.60 -0.01 37.8 -0.4% 137.8 -51.1% 105,241 -30.3% EPZS 29.84 414.6 482.4 277,308 μ 29.83 -0.01 415.2 0.2% 217.0 -55.0% 194,821 -29.7% μ + 0.25σ 29.82 -0.02 414.9 0.1% 199.6 -58.6% 183,679 -33.8% μ + 0.5σ 29.82 -0.02 414.8 0.1% 186.5 -61.3% 175,293 -36.8% μ + 0.75σ 29.82 -0.02 414.8 0.0% 188.6 -60.9% 169,098 -39.0% μ + σ 29.82 -0.02 414.8 0.1% 180.0 -62.7% 164,901 -40.5% EPZS 32.17 203.3 369.4 254,679 μ 32.15 -0.02 203.0 -0.1% 214.7 -41.9% 187,712 -26.3% μ + 0.25σ 32.15 -0.02 203.0 -0.2% 199.9 -45.9% 179,578 -29.5% μ + 0.5σ 32.15 -0.02 202.9 -0.2% 188.2 -49.0% 172,219 -32.4% μ + 0.75σ 32.14 -0.03 203.2 0.0% 178.3 -51.7% 166,064 -34.8% μ + σ 32.14 -0.03 203.5 0.1% 169.9 -54.0% 160,796 -36.9% EPZS 28.47 633.7 452.5 240,719 μ 28.45 -0.02 632.5 -0.2% 219.4 -51.5% 168,513 -30.0% μ + 0.25σ 28.44 -0.03 632.3 -0.2% 205.4 -54.6% 159,720 -33.6% μ + 0.5σ 28.44 -0.03 633.0 -0.1% 183.2 -59.5% 152,150 -36.8% μ + 0.75σ 28.44 -0.03 632.5 -0.2% 171.7 -62.1% 145,679 -39.5% μ + σ 28.43 -0.04 633.2 -0.1% 162.6 -64.1% 140,391 -41.7% EPZS 29.81 471.8 415.4 238,355 μ 29.80 -0.01 471.1 -0.1% 191.2 -54.0% 181,205 -24.0% μ + 0.25σ 29.79 -0.02 471.0 -0.2% 175.2 -57.8% 170,817 -28.3% μ + 0.5σ 29.79 -0.02 470.1 -0.4% 162.9 -60.8% 162,083 -32.0% μ + 0.75σ 29.78 -0.03 469.8 -0.4% 153.3 -63.1% 155,323 -34.8% μ + σ 29.78 -0.03 469.3 -0.5% 146.1 -64.8% 150,211 -37.0% EPZS 31.18 352.3 400.3 232,429 μ 31.17 -0.01 351.9 -0.1% 200.8 -49.8% 169,930 -26.9% μ + 0.25σ 31.16 -0.02 351.8 -0.1% 187.1 -53.3% 161,669 -30.4% μ + 0.5σ 31.16 -0.02 351.7 -0.2% 174.1 -56.5% 154,673 -33.5% μ + 0.75σ 31.15 -0.03 351.6 -0.2% 167.1 -58.3% 148,902 -35.9% μ + σ 31.15 -0.03 351.7 -0.2% 159.3 -60.2% 144,308 -37.9% Foreman Mobile Akiyo Coastguard Tempete Average    81  Table 3.9: Early Termination of the Encoder for Partition Sizes > 8x8 at a Threshold of ‘μ-0.75σ’ (QCIF, Average of Encoding with QP 28, 32, 36, 40) Threshold PSNR (dB) ΔPSNR (dB) Bitrate (kbps) ΔBitrate (%) EncTime (s) ΔEncTime (%) CMP/MB ΔCMP/MB (%) EPZS 31.44 50.5 88.0 217,534 μ 31.42 -0.01 50.5 0.0% 60.7 -31.0% 162,156 -25.5% μ + 0.25σ 31.43 -0.01 50.5 0.1% 56.5 -35.8% 154,234 -29.1% μ + 0.5σ 31.42 -0.01 50.5 0.1% 53.3 -39.5% 147,312 -32.3% μ + 0.75σ 31.42 -0.02 50.6 0.3% 49.9 -43.2% 141,728 -34.8% μ + σ 31.41 -0.03 50.6 0.3% 47.8 -45.7% 137,257 -36.9%  The same finding was obtained for QCIF sequences. For instance, for an average quality penalty of 0.02dB, Configuration A achieves savings of 33.9% at a bitrate increase of 0.3% while Configuration B would achieve the same level of quality and bitrate with an average of 34.8% in computational savings. We hence also propose to use ‘μ-0.75σ’ for the lower threshold, and ‘μ+0.5σ’ for the upper threshold. The validity of this proposed choice is illustrated in Figure 3.12 for the Foreman CIF sequence. This figure presents the quality level achieved for all variations of thresholds investigated over the number of pixel comparisons expended per macroblock. For instance, the proposed threshold choice achieves a PSNR of 32.157 dB at the expense of 185,650 pixel comparisons per macroblock. The original EPZS implementation achieved a quality level of 32.17 dB at the expense of 254,679 pixel comparisons per macroblock. A reduction in pixel comparisons by 27% led to a marginal quality drop of 0.013 dB.  82   Figure 3.12: Level of Quality Achieved for All Threshold Choices Investigated  (Foreman, CIF, QP 32) Every threshold combination located on or near the upper bound of the convex hull of these points yields the highest quality at a given level of pixel comparisons and would hence be considered efficient [20]. Depending on the constraints of the application, a different choice among these efficient combinations might be preferred. Our proposed combination is highlighted by a large circle mark in the graph. The flow chart of our entire algorithm for CIF sequences is shown in Figure  3.13. 33.94 33.95 33.96 33.97 33.98 33.99 34.00 34.01 34.02 34.03 150,000 175,000 200,000 225,000 250,000 275,000 300,000 Pixel Comparisons per Macroblock PS N R  (d B ) Proposed EPZS All Threshold Choices  83   Figure  3.13: Flowchart of the Proposed Algorithm Note that in the optimal case, frames with diverse motion content and texture would contain a larger number of small block sizes, whereas frames with uniform texture or motion content would exhibit a preference for larger block sizes. It is, therefore, desirable for the proposed thresholds to adapt to this condition. The fact that the proposed thresholds are directly related to the standard deviation of the cost value distribution within a frame facilitates them to dynamically accommodate to this condition. This adaptive behavior is also beneficial for scene changes since only information from the  84  current frame is taken into account for the threshold computation. For example, in Figure , the correlation between the close-to-optimal rate-distortion-cost Jfinal and the 16x16 cost J16 dropped during a scene change but remained strong with a minimum of 0.86, still indicating a strong correlation. The proposed threshold choices are summarized in Table 3.10. Table 3.10: Categories Defined for Macroblock Classification Category Threshold Block Sizes 1 μ + 0.5σ < J16 all block-sizes 2 μ - 0.75σ  < J16 <= μ+ 0.5σ 16x16, 16x8, 8x16, 8x8 3 J16 <= μ-0.75σ 16x16 only As mentioned above, depending on the limitations of the application in terms of processing power, energy reservoir or bandwidth available, these thresholds may be modified to achieve either lower computational requirements or a higher level of quality. 3.3.3 Using Pixel Decimation Patterns for further Reduction in Computational Complexity An additional reduction in computational requirements can be achieved by reducing the number of pixels taken into account for measuring distortion [4, 21-25]. One approach to achieve that is to use a subsampled (decimated) set of pixels as described in [24] and [25]. If a subsampled set of pixels is used, the decimation should be performed such that the decimated block pixel error functions are highly correlated with the original, complete error functions [11]. This approach can reduce the computational effort considerably, especially when processing power is the major limitation for a given application. An example method using quarter-subsampled SAD values for the Full Search method is described in [23].  85   Figure 3.14: Pattern A: Two Pixel Subsets Covering ¼ of the MB Pixels  Figure 3.15: Pattern B: Subset Covering ½ of the MB Pixels We introduce such a pixel decimation scheme which is specifically designed for our early termination method described earlier. Performance evaluations over a large set of different decimation patterns and different combinations have shown that for variable block size motion estimation, the evaluation of different block sizes requires different decimated pixel patterns. Our experiments unveiled that by using one of the patterns shown in Figure 3.14 (each one considering ¼ of the pixels present within the block, reducing the number of pixel comparisons performed by 75%) resulted in matching accuracy highly satisfactory for the 16x16 block size. However, the same patterns yielded unsatisfactory results for block sizes smaller than 16x16. In order to maintain an acceptable search accuracy for smaller block sizes, our experiments showed that a pattern that incorporates more of the original pixels was needed. The pattern that yielded the best results is depicted in Figure 3.15. This pattern covers ½ of the original pixels and reduces the number of pixel comparisons by 50%. As will be demonstrated in Section IV, using this combination of different patterns allows our approach to deliver an acceptable level of accuracy, delivering an average reduction in total pixel comparisons of 55%. The proposed scheme is summarized in Table 3.11.  86  Table 3.11: Subset Patterns Chosen for Different Block Sizes Partition size Pattern Choice 16x16 Subset Pattern A < 16x16 Subset Pattern B Some of the current encoding solutions provide a set of optimized instruction sets for performing operations frequently found in video coding. A popular example are SIMD (single instruction multiple data) instruction sets featured for instance in the Pentium and XScale line of microprocessors of Intel. These operations are optimized for performing a single instruction on multiple data that is found in adjacent memory locations. With such hardware available, the performance improvement of decimated pixel patterns can be increased with certain pixel patterns (e.g. a 8x1 subsampling pattern for 64-bit instructions). 3.4 Experimental Results The proposed early-termination method was implemented using the JM reference software 13.1 on a 3.0 GHz Pentium IV platform. The main parameters used for our tests are summarized in Table 3.12. Table 3.12: Main Parameters Used in the Experiments Parameter Setting GOP IPPP (Tables 13-15, 17);  IPBBP (Table 16) I-frame interval 15 QP 28, 32, 36, 40 (Tables 13,14,16,17); 12, 16, 20, 24 (Table 15) # of ref .frames 5 (Tables 13-17) A number of video sequences were encoded for performance evaluation purposes. For CIF sequences, our method used ‘μ-0.75σ’ as the first threshold and ‘μ+0.5σ’ as the second threshold. For QCIF sequences, we used ‘μ-0.5σ’ as the first threshold and ‘μ+0.5σ’ as the second threshold.  87  Table 3.13 shows the results obtained by our proposed method, and contrasts them against the EPZS and CBFPS motion estimation options. The table details the PSNR in dB, bitrate in kbps, the total encoding time, the number of pixel comparisons per macroblock (CMP/MB), along with the differences compared to EPZS (ΔSNR, ΔBitrate, ΔEncTime, ΔCMP/MB). These values are the averages obtained by the quantization factors indicated in Table 3.12. We observe that compared to EPZS our algorithm achieves an average of 32.4% in computational savings with a mere drop in picture quality of 0.02dB and virtually the same bitrate. Table 3.13: Comparison After Encoding Using the Proposed Method Sequence PSNR (dB) Bitrate (kbps) EncTime (s) CMP/MB ΔSNR (dB) Δbitrate (%) ΔEncTime (%) ΔCMP/MB (%) Akiyo 35.61 37.9 281.7 151,086 0.00 -0.21% -46.8% -26.1% Coastguard 29.84 414.6 482.4 277,308 -0.02 0.07% -61.3% -36.8% Foreman 32.17 203.3 369.4 254,679 -0.02 0.10% -49.0% -32.4% Mobile 28.47 633.7 452.5 240,719 -0.03 0.02% -59.5% -36.8% Tempete 29.81 471.8 415.4 238,355 -0.02 -0.36% -60.8% -32.0% Akiyo 33.76 12.6 68.8 162,044 0.00 -0.49% -46.5% -29.7% Carphone 31.99 71.3 92.0 219,842 -0.05 0.14% -37.0% -30.1% Coastguard 29.39 81.1 76.4 250,891 -0.01 0.73% -41.4% -36.1% Foreman 31.09 65.5 103.0 249,079 -0.03 -0.17% -38.8% -32.8% Salesman 30.96 21.9 99.7 205,812 0.01 0.00% -36.0% -31.3% Average 31.31 201.4 244.1 224,982 -0.02 -0.02% -47.7% -32.4% Q C IF C IF H.264 w/EPZS Proposed Algorithm  Table 3.14 compares EPZS and the proposed method, but in this case the suggested decimated pixel patterns were also applied by our method. We observe that in this case the computational savings were increased to 55% for a 0.02dB drop in picture quality and a bitrate increase of 1.96%.  88  Table 3.14: Comparison after Encoding Using the Proposed Method with the Proposed Pixel Decimation Scheme Enabled Sequence PSNR (dB) Bitrate (kbps) EncTime (s) CMP/MB ΔSNR (dB) Δbitrate (%) ΔEncTime (%) ΔCMP/MB (%) Akiyo 35.61 37.9 281.7 151,086 0.00 1.6% -45.2% -40.9% Coastguard 29.84 414.6 482.4 277,308 -0.03 1.6% -59.6% -60.2% Foreman 32.17 203.3 369.4 254,679 -0.05 2.9% -44.4% -58.2% Mobile 28.47 633.7 452.5 240,719 -0.02 2.1% -58.6% -58.9% Tempete 29.81 471.8 415.4 238,355 -0.04 1.3% -61.5% -63.1% Akiyo 33.76 12.6 68.8 162,044 0.02 2.4% -44.4% -44.6% Carphone 31.99 71.3 92.0 219,842 -0.04 2.0% -37.5% -55.1% Coastguard 29.39 81.1 76.4 250,891 -0.02 1.6% -40.4% -58.4% Foreman 31.09 65.5 103.0 249,079 -0.04 2.4% -38.8% -58.3% Salesman 30.96 21.9 99.7 205,812 -0.02 1.8% -40.5% -55.0% Average 31.31 201.4 244.1 224,982 -0.02 1.96% -47.1% -55.3% Q C IF C IF Proposed Algorithm (w/ Decimated Pixel Patterns)H.264 w/EPZS  The same analysis and tests have also been performed for QP 12, 16, 20 and 24. The most favorable results for this case were obtained for the same threshold choices proposed in Table 3.10 and are summarized in Table 3.15. Table 3.15: Comparison after Encoding with the Proposed Method Sequence PSNR (dB) Bitrate (kbps) EncTime (s) CMP/MB ΔSNR (dB) Δbitrate (%) ΔEncTime (%) ΔCMP/MB (%) Akiyo 45.78 785.2 422.2 157,576 -0.01 -0.1% -45.0% -33.6% Coastguard 42.67 4,662.6 528.7 268,203 -0.04 -0.4% -39.7% -29.6% Foreman 43.51 3,002.5 501.1 247,907 -0.04 -0.2% -35.5% -26.8% Mobile 42.58 6,293.6 489.8 215,913 -0.05 -0.5% -37.5% -28.0% Tempete 43.02 5,266.2 424.1 226,345 -0.04 -0.3% -37.1% -26.2% Akiyo 45.53 215.0 97.1 142,610 -0.01 0.0% -43.0% -33.5% Carphone 44.09 721.3 150.8 215,785 -0.05 0.0% -39.3% -27.7% Coastguard 42.52 1,047.0 123.6 230,345 -0.03 0.1% -41.6% -30.7% Foreman 43.20 856.1 164.6 228,468 -0.05 0.5% -37.6% -27.1% Salesman 43.66 375.1 147.0 140,010 -0.01 0.1% -47.0% -35.0% Average 43.66 2,322.5 304.9 207,316 -0.03 -0.07% -40.3% -29.8% With proposed Pixel Decimation Scheme enabled: Average 43.53 2,233.3 267.0 206,077 -0.05 1.99% -41.0% -53.6% Q C IF H.264 w/EPZS Proposed Algorithm C IF   89  We observe that a similar performance was achieved for this case (with/without pixel decimation) resulting in 29.8% / 53.6% savings in pixel comparisons, 0.03dB / 0.05dB quality drop and -0.07% / 1.99% change in bitrate. Figures 3.16-3.18 show the rate-distortion curves of some of the investigated test sequences, detailing PSNR in dB and bitrate in kbps, prepared in accordance to [26].  Figure 3.16: Rate-Distortion Performance for the Foreman (CIF) Sequence  Figure 3.17: Rate-Distortion Performance for the Mobile (CIF) Sequence 36 37 38 39 40 41 42 43 44 45 46 500 1,000 1,500 2,000 2,500 3,000 3,500 Bitrate (kbps) PS N R EPZS Proposed Proposed w/ Decimation 33 34 35 36 37 38 39 40 41 42 43 44 45 1,500 2,500 3,500 4,500 5,500 6,500 7,500 Bitrate (kbps) PS N R EPZS Proposed Proposed w/ Decimation  90   Figure 3.18: Rate-Distortion Performance for the Foreman (QCIF) Sequence These figures represent the results from Tables 3.13 and 3.14 and demonstrate the fact that our method does not deteriorate the visual quality of the image (-0.02dB). Table 3.16 illustrates the performance of our algorithm, this time using a GOP structure of IPBBP. Table 3.16: Comparison after Encoding with EPZS and Our Method (IPBBP GOP Structure) Sequence PSNR (dB) Bitrate (kbps) EncTime (s) CMP/MB ΔSNR (dB) Δbitrate (%) ΔEncTime (%) ΔCMP/MB (%) Akiyo 35.71 33.8 262.1 507,864 -0.01 -0.62% -32.6% -22.5% Coastguard 29.73 356.5 376.8 965,859 -0.02 -0.71% -39.6% -32.6% Foreman 32.07 199.4 381.8 879,311 -0.02 -0.65% -37.8% -28.7% Mobile 28.39 514.6 361.4 840,560 -0.03 -0.51% -37.8% -32.7% Tempete 29.68 411.2 308.3 825,699 -0.04 -0.87% -34.5% -27.7% Akiyo 33.82 11.4 66.0 547,448 0.00 -0.55% -33.4% -25.5% Carphone 31.97 65.0 80.3 745,347 -0.04 -0.80% -33.9% -26.4% Coastguard 29.36 67.9 87.6 883,373 -0.01 -0.11% -38.0% -32.3% Foreman 31.57 65.2 86.7 849,866 -0.04 -0.91% -35.3% -28.6% Salesman 31.04 21.8 75.7 705,107 -0.01 -0.27% -32.9% -26.4% Average 31.33 174.7 208.7 775,043 -0.02 -0.60% -35.6% -28.4% With proposed Pixel Decimation Scheme enabled: Average 31.33 174.7 208.7 775,043 -0.04 1.76% -35.0% -55.2% C IF Q C IF H.264 w/EPZS Proposed Algorithm  35 36 37 38 39 40 41 42 43 44 45 150 250 350 450 550 650 750 850 950 1,050 Bitrate (kbps) PS N R EPZS Proposed Proposed w/ Decimation  91  We observe that the savings facilitated by our algorithm (with / without decimated pixel patterns) amount to an average of 28.4% / 55.2% with a drop in picture quality of 0.03dB / 0.04dB while the bitrate changes by -0.6% / 1.76%. For comprehensiveness, experimental results have also been obtained using only one reference frame and IPPP GOP structure. Our proposed method saves around 29.7% of the pixel comparisons on average, while compromising quality by only a mere 0.02 dB. The performance in this case is slightly lower (-3%) in terms of computational savings than in the case of five reference frames (see Table 3.13). However, the bitrate performance was improved slightly to -0.33% on average compared to EPZS. When decimated pixel patterns are enabled, our method saves around 61.8% of the pixel comparisons on average with a minor compromise in image quality by 0.04 dB and a bitrate increase of 0.31%. The performance in this case, is better than when five reference frames are used. Experiments were also performed with our algorithm supplementing the UMHexagonS method (see Table 3.17). Table 3.17: Comparison after Encoding with the Proposed Method Sequence PSNR (dB) Bitrate (kbps) EncTime (s) CMP/MB ΔSNR (dB) Δbitrate (%) ΔEncTime (%) ΔCMP/MB (%) Akiyo 36.42 94.5 345.4 121,512 0.00 0.0% -45.0% -25.6% Coastguard 30.72 572.1 510.3 320,842 -0.02 0.1% -47.7% -33.9% Foreman 32.99 301.0 471.9 282,659 -0.02 0.5% -44.0% -30.3% Mobile 29.55 952.1 470.7 267,186 -0.06 0.5% -46.9% -34.4% Tempete 30.73 694.8 409.6 262,528 -0.03 -0.2% -46.1% -30.4% Akiyo 34.52 36.0 85.8 123,883 -0.01 0.0% -45.4% -28.0% Carphone 33.01 102.8 147.4 209,133 -0.03 0.2% -44.8% -27.2% Coastguard 30.42 121.0 116.1 259,924 -0.02 1.1% -47.8% -33.3% Foreman 32.01 97.6 153.1 257,193 -0.04 0.3% -45.2% -30.5% Salesman 31.72 55.4 141.1 149,545 -0.01 0.2% -47.4% -29.8% Average 32.21 302.7 285.1 225,440 -0.02 0.26% -46.0% -30.3% C IF Q C IF H.264 w/UMHexagonS Proposed Algorithm   92  We observe that the proposed method performs similarly well when using the UMHexagonS algorithm as a base with 30.3% savings in pixel comparisons, 0.02 dB quality drop and 0.26% bitrate increase on average. Once more, the choice of the thresholds as well as the option of using the decimation patterns entirely depend on the major constraints of the target application, allowing the user to make the final decision about the tradeoff between quality and computational savings. 3.5 Conclusion We presented a novel early-termination motion estimation algorithm specifically designed for the H.264 standard. This method takes advantage of significant correlations present between minimum rate-distortion cost values obtained by the 16x16 macroblock search and the cost values of the potential best partitioning structure for every macroblock in order to terminate the search for smaller partitions as early as possible. Performance evaluations have shown that approximately 32% of the computational burden can be saved compared to the presently used approach by H.264 while preserving picture quality and only mildly affecting the overall bitrate. Computational savings could be further enhanced by applying a pixel decimation scheme for measuring distortion, achieving total savings of 55%. The choices of the thresholds used by our method may be modified depending on the major constraints of the target application and the desired trade-off between picture quality and computational savings. In summary, our algorithm is compliant with the existing H.264 standard and complements existing search methods by defining additional early-termination criteria that aim to avoid unnecessary search efforts.  93   3.6 References [1] S. Goel, Y. Ismail and M.A. Bayoumi, "Adaptive search window size algorithm for fast motion estimation in H.264/AVC standard," in IEEE Intl. Midwest Symposium on Circuits and Syst. (MWSCAS), pp. 1557-60, 2005. [2] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, "Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC)", ITU-T Rec. H.264/ISO/IEC 14 496- 10 AVC, vol. 1, 2003. [3] K.M. Nam, J. Kim, R. Park and Y.S. Shim, "A fast hierarchical motion vector estimation algorithm using mean pyramid", IEEE Transactions on Circuits and Systems for Video Technology, vol. 5, pp. 344-51, 1995. [4] B. Liu and A. Zaccarin, "New fast algorithms for the estimation of block motion vectors", IEEE Transactions on Circuits and Syst. for Video Technology, vol. 3, pp. 148-57, 1993. [5] S. Zhu and K. Ma, "A new diamond search algorithm for fast block matching motion estimation," in IEEE International Conference on Information, Communications and Signal Processing (ICICS), pp. 292-6, 1997. [6] C. Zhu, X. Lin and L. Chau, "Hexagon-based search pattern for fast block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, pp. 349-55, 2002. [7] Z. Chen, Z. Peng and H. Yun, "Fast Integer Pel and Fractional Pel Motion Estimation for JVT, JVT-F017", Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, pp. 1-14, 2002.  94  [8] H.-.C. Tourapis and A.M. Tourapis, "Fast motion estimation within the H.264 codec," in IEEE International Conference on Multimedia and Expo (ICME), pp. 517-20, 2003. [9] Z. Zhou, M. Sun and Y. Hsu, "Fast variable block-size motion estimation algorithm based on merge and slit procedures for H.264/MPEG-4 AVC," in IEEE International Symposium on Circuits and Systems (ISCAS), pp. 725-8, 2004. [10] Y. Tu, J. Yang, M. Sun and Y.T. Tsai, "Fast variable-size block motion estimation for efficient H.264/AVC encoding", EURASIP Journal for Signal Processing: Image Communication, vol. 20, pp. 595-623, 2005. [11] J. Xin, M. Sun and V. Hsu, "Diversity-based fast block motion estimation," in IEEE International Conference on Multimedia and Expo (ICME), pp. 525-8, 2003. [12] H. Kim, N. Kamaci and Y. Altunbasak, "Low-complexity rate-distortion optimal macroblock mode selection and motion estimation for MPEG-like video coders", IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, pp. 823- 34, 2005. [13] G.J. Sullivan, P. Topiwala and A. Luthra, "The H.264/AVC advanced video coding standard: Overview and introduction to the fidelity range extensions," in Applications of Digital Image Processing, pp. 454-74, 2004. [14] A.M. Tourapis, "Enhanced predictive zonal search for single and multiple frame motion estimation", Proceedings of the SPIE, vol. 4671, pp. 1069-79, 2002. [15] X. Xu and Y. He, "Comments on Motion Estimation Algorithms in Current JM Software, JVT-Q089.doc", Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, pp. 1-12, Oct. 2005. [16] K. Ma and G. Qiu, " Verification for Fast Motion Estimation for JVT, JVT- G017.doc", Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, pp. 1-5, Mar. 2003.  95  [17] X. Yi, J. Zhang, N. Ling and W. Shang, "Improved and simplified fast motion estimation for JM, JVT-P021.doc", Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG, pp. 1-23, Jul. 2005. [18] S.J. Orfanidis, Optimum Signal Processing - An Introduction, 2nd Edition, Englewood Cliffs, NJ, USA: Prentice-Hall, 1996. [19] S. Kim and J. Han, "Efficient motion estimation using a modified early termination algorithm in H.264", IEICE Transactions on Information and Systems, vol. E88, pp. 1707-15, 2005. [20] H. Markowitz, "Portfolio Selection", Journal of Finance, vol. 7, pp. 77-91, 1952. [21] X. Yi and N. Ling, "Improved normalized partial distortion search with dual- halfway-stop for rapid block motion estimation", IEEE Transactions on Multimedia, vol. 9, pp. 995-1003, 2007. [22] C. Cheung and L. Po, "Adjustable partial distortion search algorithm for fast block motion estimation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 100-10, 2003. [23] K. Lee, H. Chin, H. Hsu and C. Jen, "QME: An efficient subsampling-based block matching algorithm for motion estimation," in IEEE International Symposium on Circuits and Systems, pp. 305-8, 2004. [24] H. Lee, P. Nasiopoulos and V.C.M. Leung, "Fast Video Motion Estimation Algorithm for Mobile Devices," in IEEE International Conference on Multimedia and Expo (ICME), pp. 370-3, 2005. [25] Y.V. Ivanov and C.J. Bleakley, "Skip prediction and early termination for fast mode decision in H.264/AVC," in International Conference on Digital Telecommunications, pp. 7-7, 2006. [26] G. Bjontegaard, "Calculation of average PSNR differences between RD-curves", VCEG-M33, pp. 1-12, Apr. 2001.   96  CHAPTER 4: A FAST MODE DECISION ALGORITHM FOR TRANSCODING H.264 VIDEO WITH ARBITRARY-RATIO SPATIAL DOWNSCALING3 4.1 Introduction With the continuous evolution of wireless networks and versatile mobile devices, the demand for mobile video applications is rapidly gaining popularity. At the same time, mobile devices suffer from innate constraints such as limited processing power, display capabilities, energy reservoir, storage space and transmission bandwidth. Due to these constraints, it is highly desirable to provide content in a format that is optimized for the target device and transmission channel[1]. However, existing content has often been pre- encoded with a resolution or bitrate exceeding these constraints, aiming at different target segments, e.g., home entertainment. Transcoding is a process that allows converting compressed video from one format to another, in the present case from a high resolution to a lower resolution[2-4]. H.264 / MPEG-4 AVC, the latest international video coding standard, has introduced a number of coding features that enable a significantly increased compression efficiency compared to all previous standards[5]. However, this increase in compression efficiency comes along with a substantial complexity that translates into a massive burden for the video encoder.  3 A version of this chapter has been submitted for publication. M. von dem Knesebeck, P. Nasiopoulos, “A Fast Mode Decision Algorithm for Transcoding H.264 Video with Arbitrary-Ratio Spatial Downscaling”.  97  When transcoding an H.264 compressed video stream from one resolution to another, one can apply a straight-forward approach, known as “cascaded pixel-based transcoding” (see Figure 4.1)[2].  Figure 4.1: Cascaded Transcoding Process In this process, the compressed full resolution bitstream is completely decoded, the resulting sequence of raw images is then scaled using some process such as the bilinear filter, and a regular H.264 encoder finally encodes these scaled raw images into a new, efficiently compressed H.264 stream.“Cascaded Pixel-Based transcoding” results in very high compression efficiency. However, while the decoding and downscaling segments require minimal computational effort, the re-encoding process consumes substantial resources. As an alternative to pixel-based transcoding, a number of studies propose performing the downscaling operation more efficiently in the transform domain [6, 7]. However, the downscaling procedure in the transform-domain introduces a drift error that grows substantially with increasing GOP size [4, 8]. Although a number of methods have been proposed that aim to compensate for the drift error in previous standards [9-12], the error remains severe in H.264 due to its deblocking filter and quarter-pixel interpolation  98  feature. Pixel-domain transcoding is a preferred choice for spatially downscaled transcoding. Characteristics of the downscaled video content such as motion or texture are very highly correlated with those in the full resolution version. We consequently expect that valuable information can be derived from the compressed stream about efficient coding choices applicable to the underlying content that can assist in reducing the burden for the re-encoding process of the reduced resolution version. Cascaded transcoding, however, as described, does not take advantage of existing information in the pre-encoded stream such as motion vectors, modes or residual values or about content characteristics that can be derived from these elements. For this reason, various methods have been proposed that do re-use or consider information from the pre-encoded stream in order to speed-up the re-encoding process[7, 13, 14]. Since motion search and mode decision are the most demanding tasks in video encoding, most of the methods aim at reducing the complexity of these elements. In this study, we propose an efficient spatial transcoding method that specifically focuses on reducing the computational burden associated with mode decision by taking advantage of existing information such as motion vectors and residuals in the compressed full resolution stream. Except for simple mode mapping strategies, adaptive mode decision approaches for arbitrary-ratio downscaling have not been proposed. The rest of the paper is structured as follows: Section 4.2 presents an overview of transcoding techniques proposed for downscaling a pre-encoded H.264 stream. Section 4.3 describes our method and its details. The experimental results are discussed in Section 4.4, and Section 4.5 presents the conclusion.  99  4.2 Background 4.2.1 Mode Decision in Video Compression A major contribution to compression efficiency in H.264 stems from the feature of variable block size motion estimation[15]. While previous standards divided the video frame into macroblocks of a fixed size (usually 16x16 pixels), H.264 defines seven block sizes for motion-compensated prediction (16x16, 16x8, 8x16, 8x8, 8x4, 4x8,4x4) (see Figure 4.2). This permits the encoder to subdivide any given 16x16 area in the video frame into various combinations of smaller partitions, each one carrying a motion vector of its own.  Figure 4.2: Block Sizes Defined in the H.264 Video Coding Standard In total, these choices for the coding mode lead to 259 possible distinct combinations for motion-compensated prediction (MCP) of every 16x16 macroblock, in addition to the SKIP mode that does not transmit a motion vector. For maximum compression efficiency, all of these combinations need to be evaluated for determining the “best one” among them. To achieve this objective, H.264 aims to minimize a Lagrangian cost function for every coding mode that balances the trade-off between bitrate and quality [16]: J = D + λR (4-1) where D denotes the distortion between the predicted and the original macroblock, R represents the number of bits used to encode the motion information, and  100  λ represents the Lagrangian multiplier that is itself dependent on the quantization factor chosen. The coding mode resulting in the lowest cost J is finally selected for encoding. 4.2.2 Bottlenecks in Downscaled Transcoding The results of a motion search are one or several motion vectors (depending on the coding mode used) for every given macroblock partition for creating a predicted frame. A frame downscaled to a smaller resolution carries almost the same motion characteristics as the full resolution frame. However, depending on the downscaling ratio chosen, motion elements previously found in different adjacent macroblocks, are now merged into fewer macroblocks. 4.2.3 Motion Search Instead of performing a regular motion search with many search points such as in raw video encoding, a number of transcoding algorithms have addressed the challenge of deriving a suitable (initial) motion vector for the macroblock partitions in the downscaled video from information present in the pre-encoded stream. Those algorithms include the simple average (SA), the activity-weighted average [17-19], the area-weighted average (AWA) and the area-weighted vector median (AWMVM) [20], or regression-based approaches such as [21] that require initial training. The obtained motion vectors from these schemes are close approximations to the optimal motion vectors and save a substantial amount of computations compared to a regular motion search. Generally, however, an additional 1-pixel motion search around the obtained displacement position with subsequent quarter pixel refinement is still required to obtain high compression efficiency.  101  Except for the AWA approach, which measures the activity within a given macroblock by the magnitude of the corresponding DC coefficient, none of these methods considers the characteristics of the underlying content further to reduce complexity. AWMVM yields the best results among the above methods in terms of compression efficiency, and it is used as a reference for our study. Even though a significant amount of effort is saved by these algorithms with regards to the element of motion search, the burden of evaluating the potential 259 possible coding mode combinations remains. 4.2.4 Mode Decision Several studies have proposed mode-mapping strategies that aim to determine suitable block sizes for the downscaled frame based on the block sizes found in the full resolution frames [22, 23]. These methods work well for integer downscaling ratios such as, for instance, 2:1 where four original macroblocks would be merged into a single macroblock. For non-integer downscaling ratios, however, they cannot easily be applied since various fractions of macroblocks with differing modes are merged into a single macroblock. Our proposed algorithm enables an arbitrary spatial downscaling ratio and significantly reduces the computational burden of mode decision for downscaled transcoding. It adaptively determines the most likely modes using information from the pre-encoded full resolution stream and some early measures obtained during the re- encoding process.  102  4.3 Proposed Algorithm The goal of this paper is to find measures that can be obtained with very little computational effort and which will allow an early optimal mode decision for every macroblock in the downscaling transcoding process instead of evaluating every one of the 259 possible coding combinations. A typical frame in natural video sequences exhibits a distribution of block sizes, in which the SKIP and 16x16 mode are the most prevalent choice. Table 4.1 shows the average frequency of final modes chosen for representative reference sequences that cover a broad spectrum of motion and texture characteristics found in natural video. The results in this table are determined by an extensive motion estimation process with Full Search and all modes enabled. Table 4.1: Average Frequency of Modes Chosen for the Reference Sequences  SKIP 16x16 16x8 8x16 8x8 sub8x8 Crew 14% 32% 7% 8% 3% 36% City 7% 29% 13% 17% 4% 30% Soccer 27% 21% 7% 7% 3% 36% Coastguard 9% 28% 8% 7% 3% 45% Foreman 17% 21% 8% 9% 3% 42% Mobile 1% 33% 4% 4% 3% 54% Average 13% 27% 8% 9% 3% 40% We observe that close to 40% of the macroblocks are assigned to the SKIP and the 16x16 mode on average. Sequences with diverse motion and texture characteristics feature a large number of small block sizes and those with uniform motion and texture characteristics bear a high number of large block sizes. It would be very beneficial to know this distribution of the final block sizes within a given frame early, before evaluating all different mode combinations in an extensive process. Under the assumption that the pre-encoded stream has been created using the most efficient coding criteria, we  103  expect to be able to derive valuable information about the content characteristics from the pre-encoded stream that can assist in choosing an appropriate coding mode. A number of methods have already addressed the mode decision challenge when encoding from raw video and some of them have also been applied to transcoding [20, 24-26]. However, in transcoding, information in addition to that from raw video encoding is available from the pre-encoded stream and may be used to improve performance, as subsequently presented. In the following subsections, we analyze a number of candidate measures that can be derived with low computational effort from the pre-encoded stream or from completed elements of the encoding process and we investigate if they can be used to assist in determining the most likely modes early. The measures considered are: • the cost results of the 16x16 mode search for the entire frame • the Motion Vector Difference Measure (MVDM), a measure derived from the motion vectors within the pre-encoded stream • the Sum of Absolute Residuals (SAR), a measure representing the residual found in the pre-encoded stream For illustrating the measures, we subsequently refer to frame 2 of the Soccer 4CIF sequence that is being transcoded to CIF resolution (see Figure 4.3).  104   Figure 4.3: Frame 2 of the Soccer CIF Sequence with a 16x16 Grid Overlay to Illustrate the Macroblock Division In this video frame, we find a number of different motion and texture characteristics that are representative for various frequently found coding attributes, such as different degrees of texture detail or motion. In what follows, we explain the contribution and analysis of each of these measures and finally present the proposed algorithm. 4.3.1 Cost Values from the 16x16 Motion Search The cost value J16is the final result of the 16x16 motion search for every macroblock and is also the first measure obtained during the mode decision process applied in the JM reference software for H.264. In our study, the motion search is partly substituted by the AWMVM approach mentioned earlier. First, an initial motion vector is being computed for a given partition by the area-weighted median of the motion vectors in the corresponding area of the pre-encoded stream. Then, a 1-pixel motion search and a subsequent quarter-pixel refinement is performed around this initial vector for further accuracy enhancement. For every search point evaluated during this process, a rate- distortion cost is obtained according to (1). The displacement position featuring the  105  lowest cost is finally chosen as the best match for this partition. For the 16x16 mode search, we obtain the J16 value. In [27], a strong correlation was observed between the distribution of the J16 cost values for the entire frame and the distribution of the final block sizes within this frame that were determined by an extensive mode decision process. Generally, a high cost value indicates that the image area is fairly complex to encode and could bear the potential of a more efficient match using one of the smaller block sizes. Small cost values indicate a very good match, and an additional benefit from evaluating smaller block sizes would be limited. For raw video encoding, the observed correlation was used in [27] to stop the mode decision process early for those macroblocks exhibiting a J16 value below an adaptively determined threshold. The validity of this approach can be illustrated for the example of our model frame 2 of the Soccer CIF sequence. Figure 4.4 illustrates the “desired” partitioning structure that would be chosen if all 259 mode combinations were evaluated. Figure 4.5 shows a map of the (final) J16 cost values obtained for this frame. Figure 4.6 shows a map of the generic modes (16x16, 16x8, 8x16, 8x8, sub8x8) corresponding to the structure in Figure 4.4. Ideally, the measures analyzed would allow composing this generic map without the evaluation of all mode combinations.  106   Figure 4.4: Illustration of Desired Partitioning Structure for Frame 2 of the Soccer CIF Sequence Figure 4.5: Map of Cost Values Obtained from the 16x16 Motion Search  Figure 4.6: Map of Block Sizes Identified in the Reference Implementation We observe that most of the areas featuring a high J16 value correspond to small block sizes while low J16 cost values correspond to both large and small block size. When looking at the underlying content, the areas exhibiting a large J16 value contain non- uniform motion and the unveiling of occluded areas that are not found in the preceding frame, while small J16 values are found in areas that do not show complex motion characteristics. While this knowledge already assists in reducing the number of modes, it does not deliver a perfect measure. This is especially observable for the region in the lower middle of the frame which contains a high degree of detail with many similarities in neighboring partitions. Figure 4.4 representing the desired partitioning structure   2000 4000 6000 8000 10000 12000 14000 SKIP 16x16 16x8, 8x16 8x8 sub8x8  107  demonstrates that small block sizes would achieve even smaller cost values (leading to less bits for encoding) for this area but we cannot induce that from the J16 measure. The J16 measure has been used in raw video encoding for reducing the burden of the mode decision challenge. In transcoding, additional information is available from the pre-encoded stream. Next, we analyze two of the main elements found in the pre- compressed stream: motion vectors and residuals. 4.3.2 Motion Vectors of the Pre-encoded Stream The motion vectors found in the pre-encoded stream are the result of a motion search that has achieved the lowest rate-distortion-cost J for the challenge of predicting a given macroblock (partition) from an area in a reference frame. In most cases, these motion vectors either correspond to the true motion trajectory or point to locations exhibiting a high degree of similarity with the search content. They, hence, give an indication about the characteristic of the underlying content. Motion vectors of neighboring macroblocks exhibit a substantial correlation [28]. Consequently, instead of transmitting motion vectors independently of each other, the H.264 video coding standard has taken this correlation into account by first computing a predicted motion vector (PMV) for every partition based on the median of specific available neighboring motion vectors. Only the differential between this predicted motion vector (PMV) and the actual motion vector then needs to be encoded in the bitstream to further reduce redundancy. Therefore, a difference vector transmitted in the bitstream for a given partition generally indicates that the motion content of that partition could not be  108  easily inferred from its surroundings. Hence, the more and the larger these motion vector differences are, the more the given content deviates from its surroundings. The motion vector differences found in a given frame area stem from a multitude of underlying partitions. In order to obtain a representative measure for the motion vector differences present in an area in the full resolution stream that corresponds to every 16x16 block in the downscaled version, we compute the Motion Vector Difference Measure (MVDM). This measure is an area-weighted average of all motion vector differences found within this area. For every 16x16 macroblock i in the downscaled video, the corresponding MVDM can be expressed by: ܯܸܦܯ௜ ൌ ݏ݈ܿܽ݅݊݃_݂ܽܿݐ݋ݎ ܣ௜ ෍ฮܣ௝ ൈ ܯܸܦ௝ฮଵ ௄ ௝ୀଵ  (4-2) where for the area Ai corresponding to the 16x16 macroblock i, Aj indicates the area associated with the absolute value of the motion vector difference MVDj for all K areas Aj composing Ai, and scaling_factor being the downscaling ratio used during transcoding. Since motion characteristics between the full resolution video and the reduced resolution video are highly correlated, the MVDM serves as a motion complexity indicator for every 16x16 macroblock element within the downscaled video. Intuitively, we would expect that areas with high motion complexity benefit from small block sizes for prediction, and those with low motion complexity could be encoded with larger block sizes. A map of the obtained MVDM values for every region corresponding to a 16x16 macroblock in the downscaled video is illustrated in Figure 4.7, again for frame 2 of the  109  Soccer sequence that has been transcoded from 4CIF to CIF resolution. The corresponding map of final (generic) modes from cascaded transcoding is again shown in Figure 4.8.  Figure 4.7: Map of MVDM Values for Frame 2 of the Soccer 4CIF Sequence  Figure 4.8: Map of Block Sizes Identified in the Reference Implementation We observe a correlation present between large MVDM values and small block sizes and vice versa. In particular, this measure shows a higher correlation with the mode map than the J16 measure in the lower left and right regions of the frame, and emphasizes coding characteristics not expressed by the previous measure. These areas contain regions with uncovered occlusions or texture details that could not efficiently be predicted by motion characteristics of neighboring macroblocks. However, while this measure correlates with several areas of the desired generic mode map in Figure 4.8, it does not suffice for an accurate prediction. 4.3.3 Residual Information of the Pre-encoded Stream The previous measures provide a high correlation with many of the most efficient choices for the block size. However, using those measures, the blocks in the lower middle would be classified to be large, but in the reference implementation, a lower cost was   4 6 8 0 4 6 8 0 1 2 3 4 5 6 SKIP 16x16 16x8, 8x16 8x8 sub8x8  110  obtained when small block sizes are chosen. The characteristics found in this region are represented by an additional measure that is based on another main component of the pre- encoded stream: the residuals. The residuals represent the difference between the predicted frame, which is derived from the motion vectors and the reference frame, and the original frame. A large residual in the pre-encoded stream indicates that a “perfect match” could not be found for a given image area. Generally, this also suggests that the underlying content has complex motion, occlusions or texture change characteristics that cannot be easily predicted by a simple displacement. This detail carries valuable information that can be exploited for efficient mode decision practices. For every area in the pre-encoded stream that corresponds to a 16x16 macroblock i in the downscaled frame, we define the Sum-of-Absolute-Residuals (SAR) measure as follows: ܵܣܴ௜ ൌ ݏ݈ܿܽ݅݊݃_݂ܽܿݐ݋ݎ ൈ ෍ ෍ |ݎ௜ሾݔ, ݕሿ| ௙௟௢௢௥ቀ ଵ଺௦௖௔௟௜௡௚_௙௔௖௧௢௥ቁ ௫ୀଵ ௙௟௢௢௥ቀ ଵ଺௦௖௔௟௜௡௚_௙௔௖௧௢௥ቁ ௬ୀଵ  (4-3) where ݎ௜ሾݔ, ݕሿ indicates the residual value at position x, y within the area in the pre-encoded video that corresponds to macroblock i of the low resolution frame. When analyzing the correlation of this measure with the final block sizes in the downscaled video, we have indeed found a strong correlation between the distribution of large, medium-sized and small partition sizes for every 16x16 macroblock in the reduced resolution stream and the sum-of-absolute residual values (SAR) originating within the areas in the pre-encoded stream that correspond to these macroblocks (incase of a 2:1 downsizing ratio, those corresponding areas contain 32x32 pixels). The illustration, once  111  more referring to frame 2 of the Soccer sequence, shows the map of the SAR values in Figure 4.9, while Figure 4.10 shows again the map of the “efficient” generic modes. Figure 4.9: Map of SAR Values for Frame 2 of the Soccer 4CIF Sequence  Figure 4.10: Map of Block Sizes Identified in the Reference Implementation We observe that the SAR measure exhibits a correlation with the lower middle region, where small block sizes are found in the reference that did not correlate with the previous measures and also emphasizes scene content that was not present in the previous frame. The correlation identified indicates that small partition sizes are prevalent in the downscaled stream where the absolute residual values in the pre-encoded video are large and vice versa. However, this statement holds true for most but not all cases since small partitions that feature a close or perfect match would also carry a very small residual. 4.3.4 Derivation of the Decision Criterion The observations in the previous paragraphs lead to the conclusion that an appropriate combination of the measures presented might yield a mode decision criterion that satisfies the goal criteria. Every one of the measures analyzed contributes information that improves the correlation with the final modes chosen in the reference.   0 500 1000 1500 2000 2500 3000 3500 4000 SKIP 16x16 16x8, 8x16 8x8 sub8x8  112  Generally, we would like to find the mode that would be the most likely choice, given our three measures. This can be expressed by the following expression: Ω෢ ൌ max ୨ P൫ω୨หJଵ଺, SAR, MVDM൯ (4-4) with Ω෡ representing the generic mode distribution within the frame, P indicating the conditional probability, ωj representing a generic mode for macroblock j, and J16, MVDM, and SAR representing the measures chosen. A linear combination defining a threshold plane in the three-dimensional feature space spanned by the measures J16, SAR, and MVDM would have the form: α Jଵ଺ ൅ β SAR ൅ γ MVDM ൌ δ (4-5) with α, β, γ, and δ representing the parameters defining the plane. We analyzed whether or not such a linear threshold plane would yield a decision criterion that could significantly reduce the number of mode combinations that need to be evaluated for every macroblock and still achieve a rate-distortion-performance similar to the one obtained by evaluating all 259 mode combinations. For this purpose, we transcoded six `test sequences that represent a wide variety of coding challenges found in natural video scenes (Akiyo, Foreman, Mobile, Soccer, Crew, Harbour) and 3 different downscaling ratios (4:3, 2:1, 2.2:1) with the “cascaded transcoder” and recorded both the cost values and the three measures associated with every one of the 259 mode combinations. When analyzing the feature space of these measures only for the most efficient generic modes, we identified a prominent cluster of 16x16 modes in the region exhibiting low values for the three measures. This can also be confirmed for our illustrative frame  113  from the Soccer sequence. The reference encoding determined the 16x16 block size or the SKIP mode to be most beneficial for the top middle region of the frame. For this area, Figs. 7, 9, and 11 also exhibit low values. In order to determine a threshold that separates possible clusters of generic modes in the feature space, we numerically evaluated the cost penalty J associated with erroneous mode assignments for a large number of threshold planes using our recorded cost data. Given the dominance of the SKIP and the 16x16 mode in natural video scenes, we first aimed to identify a linear threshold criterion below which the 16x16 mode is by far the most prevalent choice. In turn, the potential error in terms of the cumulative cost J of erroneous mode assignments for the sequences would prove to be minimal for this threshold choice. The obtained threshold criterion from this analysis can be expressed by: ܬଵ଺,௜ ൏ 421 ൅ 1.82 ܵܣܴ௜ ൅ 1 32786 ൈ ܯܸܦܯ௜ (4-6) In summary, if J16,i remains below this threshold for a given macroblock i, SKIP or the 16x16 mode are the only modes evaluated and all other modes will not be considered. Another 20% of the macroblocks are generally assigned to block sizes not smaller than 8x8 (see Table 1). In order to draw a benefit from this knowledge, we were subsequently looking for a threshold that could possibly allow us to separate further clusters of generic modes. The group of modes with block sizes equal to or smaller than 8x8 was indeed identified to be feasible cluster that could be separated with minimal cost penalty by the following threshold criterion:  114  ܬଵ଺,௜ ൏ 608 ൅ 2.45 ܵܣܴ௜ ൅ 1 32786 ൈ ܯܸܦܯ௜ (4-7) If J16,i remains below this threshold for a given macroblock i, the modes involving 8x8 and smaller will not be evaluated; if it lies above the threshold, the 16x8 and 8x16 mode will not be evaluated. A flowchart of the proposed algorithm can be found in Figure 4.11:  Figure 4.11: Flowchart of the Proposed Algorithm 4.4 Experimental Results The proposed algorithm was implemented using the JM reference software 14.2 on a 3.0 GHz Pentium IV platform. Simulations were performed with the first 100 frames of 3 standard test sequences (Akiyo, Foreman, Mobile) for transcoding from CIF resolution and 3 test sequences (Soccer, City, Crew) for transcoding from 4CIF resolution, using IPPPGOP structure and 1 reference frame. Transcoding was performed  115  for 3 different downscaling ratios, 2.2:1, 2:1, and 4:3. Table 2 shows the results for transcoding the three test sequences from with a downscaling ratio of 2.2:1. The data shows the results of the AWMVM method with unmodified mode decision contrasted against the proposed algorithm, detailing PSNR, bitrate, encoding time, # of search points visited per macroblock and the number of pixel comparisons per macroblock (CMP/MB) performed.  116  Table 4.2: Results for Transcoding with a Downscaling Ratio of 2:1 Sequence Resol. QP PSNR (dB) Bitrate (kbps) M e  Ti m e  (s ) Co m p/ M B Se ar ch  P oi nt s/ M B PSNR (dB) Δ Bitrate (kbps) Δ% M e  Ti m e  (s ) Δ% Co m p/ M B Δ% Se ar ch  P oi nt s/ M B Δ% 16 47.36 122.2 6.6 37,579 631 47.29 ‐0.06 125.3 2.59% 2.4 ‐64.21% 18,949 ‐49.58% 190 ‐69.89% 20 44.42 68.6 7.2 36,239 593 44.37 ‐0.05 69.9 1.94% 2.4 ‐66.97% 17,957 ‐50.45% 161 ‐72.85% 24 41.49 38.5 6.8 34,653 546 41.46 ‐0.04 39.1 1.55% 2.0 ‐70.28% 17,163 ‐50.47% 139 ‐74.54% 28 38.32 20.9 5.2 33,956 522 38.28 ‐0.04 21.2 1.57% 2.0 ‐61.97% 16,805 ‐50.51% 125 ‐76.05% Average ‐0.05 1.91% ‐65.86% ‐50.25% ‐73.33% 16 45.50 689.9 6.7 39,099 639 45.42 ‐0.08 702.5 1.83% 3.0 ‐54.68% 24,680 ‐36.88% 307 ‐51.96% 20 42.13 405.1 7.5 39,644 638 42.08 ‐0.04 411.7 1.62% 3.9 ‐48.06% 24,602 ‐37.94% 297 ‐53.45% 24 39.00 237.8 5.7 40,017 633 38.99 0.00 242.3 1.89% 3.0 ‐47.17% 23,044 ‐42.41% 252 ‐60.19% 28 35.90 137.7 8.2 40,290 625 35.86 ‐0.04 139.8 1.51% 3.0 ‐63.18% 20,923 ‐48.07% 195 ‐68.80% Average ‐0.04 1.71% ‐53.27% ‐41.33% ‐58.60% 16 45.25 1,670.1 8.8 37,416 641 45.19 ‐0.06 1,679.0 0.54% 4.7 ‐46.90% 28,459 ‐23.94% 422 ‐34.17% 20 41.25 1,121.3 7.2 37,806 641 41.20 ‐0.05 1,127.6 0.56% 3.7 ‐48.52% 28,274 ‐25.21% 411 ‐35.88% 24 37.33 689.6 7.2 38,420 641 37.28 ‐0.05 695.1 0.80% 4.4 ‐38.74% 27,294 ‐28.96% 377 ‐41.19% 28 33.13 347.3 7.6 39,484 640 33.11 ‐0.02 349.9 0.76% 4.0 ‐46.66% 26,547 ‐32.77% 343 ‐46.41% Average ‐0.04 0.66% ‐45.20% ‐27.72% ‐39.41% Average (CIF ‐> QCIF) ‐0.04 1.43% ‐54.78% ‐39.77% ‐57.11% 16 45.79 4,060.0 10.6 42,249 641 45.69 ‐0.10 4,082.7 0.56% 3.9 ‐63.64% 22,223 ‐47.40% 209 ‐67.39% 20 43.13 2,342.7 12.4 42,161 640 43.09 ‐0.04 2,358.8 0.69% 4.3 ‐65.28% 23,128 ‐45.14% 229 ‐64.22% 24 40.51 1,410.8 9.6 41,905 634 40.49 ‐0.02 1,418.1 0.52% 3.6 ‐62.39% 22,479 ‐46.36% 214 ‐66.25% 28 37.57 796.6 9.7 41,433 622 37.57 ‐0.01 800.5 0.49% 3.5 ‐63.61% 20,757 ‐49.90% 175 ‐71.86% Average ‐0.04 0.56% ‐63.73% ‐47.20% ‐67.43% 16 45.12 3,408.5 9.4 39,068 641 45.03 ‐0.10 3,422.6 0.41% 4.0 ‐57.02% 22,195 ‐43.19% 247 ‐61.47% 20 41.43 1,796.5 9.7 39,589 641 41.38 ‐0.05 1,813.6 0.96% 4.4 ‐54.69% 22,524 ‐43.11% 248 ‐61.31% 24 37.97 921.3 10.7 40,424 640 37.96 ‐0.01 929.5 0.90% 4.0 ‐62.25% 22,377 ‐44.64% 234 ‐63.44% 28 34.66 448.2 10.5 41,339 637 34.67 0.01 453.6 1.20% 3.8 ‐63.31% 20,473 ‐50.48% 179 ‐71.90% Average ‐0.04 0.87% ‐59.32% ‐45.35% ‐64.53% 16 45.20 3,337.2 10.6 40,082 641 45.11 ‐0.10 3,352.9 0.47% 4.4 ‐59.09% 22,835 ‐43.03% 248 ‐61.31% 20 41.60 1,899.2 10.0 40,786 641 41.56 ‐0.04 1,914.4 0.80% 4.0 ‐59.41% 22,474 ‐44.90% 232 ‐63.81% 24 38.35 1,122.5 10.4 41,457 637 38.35 0.00 1,136.8 1.28% 3.9 ‐62.23% 21,954 ‐47.04% 210 ‐67.03% 28 35.28 643.1 9.9 41,919 631 35.29 0.01 647.9 0.75% 3.4 ‐65.94% 21,125 ‐49.61% 183 ‐71.00% Average ‐0.03 0.82% ‐61.67% ‐46.14% ‐65.79% Average (4CIF ‐> CIF) ‐0.04 0.75% ‐61.57% ‐46.23% ‐65.92% 704 x 576 to 352 x 288 704 x 576 to 352 x 288 704 x 576 to 352 x 288 Akiyo Foreman Mobile Crew City Soccer Reference (AWMVM) Proposed 352 x 288 to 176 x 144 352 x 288 to 176 x 144 352 x 288 to 176 x 144  We observe for the 2:1 downscaling ratio that 39.77% (CIFÆQCIF) or 46.23% (4CIFÆCIF) of pixel comparisons are avoided by using the proposed method with a small decrease in picture quality by 0.04dB and a bit rate penalty of 0.92% on average. The benefit becomes even larger when looking at the motion estimation time saved, which amounts to 54.78% (CIFÆQCIF) or 61.57% (4CIFÆCIF) , respectively.  117  While the 2:1 downscaling ratio can also be handled efficiently by other approaches such as mode-mapping, our proposed algorithm provides the substantial benefit of an arbitrary downscaling ratio and can hence adapt to a multitude of requirements and usage scenarios. Table 4.3: Results for Transcoding with a Downscaling Ratio of 4:3 summarizes the results for transcoding with a downscaling ratio of 4:3.  118  Table 4.3: Results for Transcoding with a Downscaling Ratio of 4:3 Sequence Resol. QP PSNR (dB) Bitrate (kbps) M e  Ti m e  (s ) Co m p/ M B Se ar ch  P oi nt s/ M B PSNR (dB) Δ Bitrate (kbps) Δ% M e  Ti m e  (s ) Δ% Co m p/ M B Δ% Se ar ch  P oi nt s/ M B Δ% 16 47.85 220.8 15.9 38,922 631 47.81 ‐0.04 222.0 0.51% 5.8 ‐63.75% 20,411 ‐47.56% 210 ‐66.72% 20 45.31 118.9 15.2 36,310 576 45.29 ‐0.02 119.2 0.30% 5.8 ‐61.93% 20,712 ‐42.96% 222 ‐61.46% 24 42.59 65.6 13.4 34,271 524 42.63 0.03 65.7 0.13% 6.3 ‐52.99% 21,095 ‐38.45% 230 ‐56.11% 28 39.77 35.4 12.7 32,161 479 39.73 ‐0.04 35.2 ‐0.52% 6.7 ‐46.90% 21,433 ‐33.36% 235 ‐50.94% Average ‐0.02 0.11% ‐56.39% ‐40.58% ‐58.81% 16 45.96 1,142.6 15.4 39,724 639 45.84 ‐0.12 1,170.3 2.43% 5.4 ‐64.86% 20,693 ‐47.91% 215 ‐66.35% 20 42.85 653.5 16.4 40,179 637 42.75 ‐0.11 669.0 2.38% 4.9 ‐70.36% 20,984 ‐47.77% 215 ‐66.25% 24 39.88 391.0 16.2 40,374 631 39.85 ‐0.03 399.8 2.23% 6.5 ‐59.67% 21,214 ‐47.46% 214 ‐66.09% 28 36.90 229.8 17.3 40,256 618 36.88 ‐0.02 233.5 1.59% 6.3 ‐63.32% 21,600 ‐46.34% 217 ‐64.89% Average ‐0.07 2.16% ‐64.55% ‐47.37% ‐65.89% 16 45.35 2,821.7 15.3 37,498 641 45.16 ‐0.19 2,878.3 2.01% 5.6 ‐63.34% 20,433 ‐45.51% 237 ‐63.03% 20 41.62 1,837.8 15.2 37,919 641 41.49 ‐0.12 1,874.5 1.99% 5.7 ‐62.78% 20,306 ‐46.45% 228 ‐64.43% 24 37.98 1,123.5 16.9 38,475 639 37.89 ‐0.09 1,143.0 1.73% 5.5 ‐67.14% 20,406 ‐46.96% 222 ‐65.26% 28 34.08 579.2 15.9 39,327 638 34.02 ‐0.06 589.8 1.83% 5.9 ‐62.96% 20,769 ‐47.19% 220 ‐65.52% Average ‐0.12 1.89% ‐64.06% ‐46.53% ‐64.56% Average (CIF ‐> 256 x 208) ‐0.07 1.38% ‐61.67% ‐44.83% ‐63.09% 16 46.13 8,437.6 24.4 42,835 641 46.00 ‐0.13 8,471.9 0.41% 9.1 ‐62.80% 22,465 ‐47.55% 213 ‐66.77% 20 43.66 4,571.6 24.7 42,594 640 43.61 ‐0.05 4,621.1 1.08% 9.4 ‐61.87% 22,429 ‐47.34% 216 ‐66.25% 24 41.34 2,687.5 24.1 42,027 631 41.32 ‐0.02 2,710.5 0.86% 9.8 ‐59.50% 22,431 ‐46.63% 217 ‐65.61% 28 38.59 1,538.1 23.9 41,142 613 38.58 ‐0.01 1,549.4 0.74% 9.7 ‐59.54% 22,284 ‐45.84% 213 ‐65.25% Average ‐0.05 0.77% ‐60.93% ‐46.84% ‐65.97% 16 45.61 6,855.8 22.2 39,818 641 45.46 ‐0.14 6,885.6 0.43% 8.0 ‐64.02% 20,168 ‐49.35% 194 ‐69.73% 20 42.18 3,295.1 23.0 39,914 641 42.11 ‐0.07 3,328.2 1.00% 8.2 ‐64.54% 20,187 ‐49.42% 194 ‐69.73% 24 39.02 1,590.2 22.7 40,339 639 38.99 ‐0.03 1,633.7 2.74% 8.2 ‐63.73% 20,729 ‐48.61% 203 ‐68.23% 28 35.90 810.3 22.5 40,905 633 35.88 ‐0.03 831.3 2.59% 8.0 ‐64.42% 21,036 ‐48.57% 202 ‐68.09% Average ‐0.07 1.69% ‐64.18% ‐48.99% ‐68.95% 16 45.85 6,858.3 24.2 40,840 641 45.72 ‐0.12 6,854.6 ‐0.06% 7.9 ‐67.29% 20,100 ‐50.78% 177 ‐72.39% 20 42.47 3,709.6 24.9 41,037 640 42.43 ‐0.05 3,721.7 0.33% 8.3 ‐66.56% 20,553 ‐49.92% 185 ‐71.09% 24 39.35 2,107.4 23.8 41,276 635 39.33 ‐0.02 2,119.4 0.57% 7.8 ‐67.06% 21,099 ‐48.88% 193 ‐69.61% 28 36.29 1,205.4 23.1 41,420 625 36.28 ‐0.01 1,217.1 0.97% 8.6 ‐62.85% 21,302 ‐48.57% 193 ‐69.12% Average ‐0.05 0.45% ‐65.94% ‐49.54% ‐70.55% Average (4CIF ‐> 528 x 432) ‐0.06 0.97% ‐63.68% ‐48.46% ‐68.49% Soccer 704 x 576 to 528 x 432 Mobile 352 x 288 to 256 x 208 Crew 704 x 576 to 528 x 432 City 704 x 576 to 528 x 432 Reference (AWMVM) Proposed Akiyo 352 x 288 to 256 x 208 Foreman 352 x 288 to 256 x 208  Again, we observe similar computational savings and rate-distortion performance. For this downscaling ratio, an average of44.83% (CIFÆQCIF) or 48.46% (4CIFÆCIF) could be saved in terms of pixel comparisons per macroblock. In conjunction with all the operations involved in the evaluation of every mode, the motion estimation time could be reduced by 61.67% (CIFÆQCIF) or 63.68% (4CIFÆCIF) on average at a distortion  119  penalty of 0.07dB(CIFÆQCIF) or 0.06dB (4CIFÆCIF) and 1.38% (CIFÆQCIF) or 0.97% (4CIFÆCIF) in bitrate. In order to validate the results further, a third downscaling ratio of 2.2:1 was evaluated and summarized in Table 4.4: Results for Transcoding with a Downscaling Ratio of 2.2:1. Table 4.4: Results for Transcoding with a Downscaling Ratio of 2.2:1 Sequence Resol. QP PSNR (dB) Bitrate (kbps) M e  Ti m e  (s ) Co m p/ M B Se ar ch  P oi nt s/ M B PSNR (dB) Δ Bitrate (kbps) Δ% M e  Ti m e  (s ) Δ% Co m p/ M B Δ% Se ar ch  P oi nt s/ M B Δ% 16 47.27 92.2 5.4 37,595 632 47.18 ‐0.09 93.2 1.07% 2.2 ‐59.42% 19,502 ‐48.13% 205 ‐67.56% 20 44.24 51.8 5.4 37,112 608 44.15 ‐0.09 52.1 0.65% 2.3 ‐57.98% 19,950 ‐46.24% 214 ‐64.80% 24 41.21 29.3 5.6 36,222 573 41.18 ‐0.03 29.4 0.38% 3.5 ‐37.60% 20,901 ‐42.30% 229 ‐60.03% 28 38.04 16.3 5.6 35,168 539 38.04 0.01 16.4 0.98% 2.6 ‐52.85% 21,284 ‐39.48% 230 ‐57.33% Average ‐0.05 0.77% ‐51.96% ‐44.04% ‐62.43% 16 45.63 531.5 5.3 38,881 640 45.36 ‐0.27 542.2 2.01% 2.4 ‐55.18% 21,092 ‐45.75% 238 ‐62.81% 20 42.25 316.2 6.1 39,476 639 42.09 ‐0.16 322.6 2.01% 2.5 ‐59.19% 21,175 ‐46.36% 233 ‐63.54% 24 39.08 190.7 6.2 39,917 634 39.03 ‐0.05 194.4 1.95% 2.2 ‐63.76% 21,265 ‐46.73% 225 ‐64.51% 28 35.90 113.7 6.2 40,354 628 35.87 ‐0.03 115.9 1.94% 2.8 ‐55.25% 21,621 ‐46.42% 224 ‐64.33% Average ‐0.13 1.98% ‐58.35% ‐46.32% ‐63.80% 16 45.10 1,169.6 5.8 37,136 641 44.92 ‐0.17 1,177.9 0.71% 2.4 ‐58.53% 20,796 ‐44.00% 255 ‐60.22% 20 41.15 764.4 6.3 37,606 641 41.00 ‐0.15 771.9 0.99% 2.2 ‐65.65% 20,378 ‐45.81% 235 ‐63.34% 24 37.31 455.6 5.9 38,325 641 37.19 ‐0.12 459.6 0.87% 1.8 ‐69.15% 20,990 ‐45.23% 242 ‐62.25% 28 33.29 228.4 5.8 39,509 640 33.22 ‐0.07 230.2 0.79% 2.6 ‐55.85% 21,474 ‐45.65% 240 ‐62.50% Average ‐0.13 0.84% ‐62.29% ‐45.17% ‐62.08% Average (CIF ‐> 160x128) ‐0.10 1.20% ‐57.53% ‐45.17% ‐62.77% 16 46.02 3,014.2 8.5 42,033 641 45.87 ‐0.15 3,044.3 1.00% 3.2 ‐62.02% 22,376 ‐46.77% 223 ‐65.21% 20 43.30 1,803.0 8.0 42,027 640 43.20 ‐0.10 1,824.7 1.20% 3.6 ‐55.64% 22,357 ‐46.80% 224 ‐65.00% 24 40.55 1,102.6 8.5 41,901 635 40.50 ‐0.05 1,116.7 1.28% 2.8 ‐67.70% 22,278 ‐46.83% 219 ‐65.51% 28 37.47 631.5 7.9 41,625 625 37.44 ‐0.03 639.4 1.25% 3.2 ‐59.83% 22,197 ‐46.67% 215 ‐65.60% Average ‐0.08 1.18% ‐61.30% ‐46.77% ‐65.33% 16 45.22 2,259.1 8.3 38,945 641 45.02 ‐0.20 2,274.5 0.68% 2.9 ‐64.92% 20,816 ‐46.55% 224 ‐65.05% 20 41.58 1,156.6 9.4 39,592 641 41.43 ‐0.15 1,165.5 0.77% 2.7 ‐70.83% 21,038 ‐46.86% 225 ‐64.90% 24 38.23 590.2 8.2 40,469 640 38.15 ‐0.08 594.4 0.72% 2.8 ‐65.78% 21,653 ‐46.49% 227 ‐64.53% 28 34.98 299.0 9.4 41,400 638 34.96 ‐0.02 301.5 0.82% 2.9 ‐68.84% 21,545 ‐47.96% 211 ‐66.93% Average ‐0.11 0.75% ‐67.59% ‐46.97% ‐65.35% 16 45.24 2,401.5 9.3 40,026 641 45.13 ‐0.10 2,422.6 0.88% 2.8 ‐70.30% 20,449 ‐48.91% 195 ‐69.58% 20 41.73 1,403.9 8.5 40,834 641 41.69 ‐0.04 1,413.0 0.65% 2.9 ‐65.27% 21,233 ‐48.00% 204 ‐68.17% 24 38.60 863.1 8.5 41,512 638 38.58 ‐0.02 869.6 0.75% 3.1 ‐63.55% 22,107 ‐46.75% 215 ‐66.30% 28 35.54 507.6 9.1 41,967 633 35.54 0.00 512.6 0.98% 3.7 ‐59.15% 22,143 ‐47.24% 209 ‐66.98% Average ‐0.04 0.81% ‐64.57% ‐47.72% ‐67.76% Average (4CIF ‐> 320x256) ‐0.08 0.91% ‐64.49% ‐47.15% ‐66.15% Reference (AWMVM) Proposed Akiyo 352 x 288 to 160 x 128 Foreman 352 x 288 to 160 x 128 Soccer 704 x 576 to 320 x 256 Mobile 352 x 288 to 160 x 128 City 704 x 576 to 320 x 256 Crew 704 x 576 to 320 x 256   120  For this downscaling ratio, an average of 45.17% (CIFÆQCIF) or 47.15% (4CIFÆCIF) could be saved in terms of pixel comparisons per macroblock. The motion estimation time could be reduced by 57.53% (CIFÆQCIF) or 64.49% (4CIFÆCIF) on average at a distortion penalty of 0.11dB(CIFÆQCIF) or 0.08dB (4CIFÆCIF) and 1.20% (CIFÆQCIF) or 0.91% (4CIFÆCIF) in terms of bitrate. The corresponding rate- distortion plots are shown in Figs. 4.12 to 4.17.  Figure 4.12: Rate-Distortion Plot for Akiyo CIF  Figure 4.13: Rate-Distortion Plot for Foreman CIF 0 50 100 150 200 250 38 39 40 41 42 43 44 45 46 47 48 Bitrate in kbit/s P S N R (Y )   AWMVM (Ratio 2.2:1) Proposed (Ratio 2.2:1) AWMVM (Ratio 2:1) Proposed (Ratio 2:1) AWMVM (Ratio 4:3) Proposed (Ratio 4:3) 0 200 400 600 800 1000 1200 34 36 38 40 42 44 46 Bitrate in kbit/s P S N R (Y )   AWMVM (Ratio 2.2:1) Proposed (Ratio 2.2:1) AWMVM (Ratio 2:1) Proposed (Ratio 2:1) AWMVM (Ratio 4:3) Proposed (Ratio 4:3)  121   Figure 4.14: Rate-Distortion Plot for Mobile CIF  Figure 4.15: Rate-Distortion Plot for City 4CIF 0 500 1000 1500 2000 2500 3000 32 34 36 38 40 42 44 46 Bitrate in kbit/s P S N R (Y )   AWMVM (Ratio 2.2:1) Proposed (Ratio 2.2:1) AWMVM (Ratio 2:1) Proposed (Ratio 2:1) AWMVM (Ratio 4:3) Proposed (Ratio 4:3) 0 1000 2000 3000 4000 5000 6000 7000 34 36 38 40 42 44 46 Bitrate in kbit/s P S N R (Y )   AWMVM (Ratio 2.2:1) Proposed (Ratio 2.2:1) AWMVM (Ratio 2:1) Proposed (Ratio 2:1) AWMVM (Ratio 4:3) Proposed (Ratio 4:3)  122   Figure 4.16: Rate-Distortion Plot for Crew 4CIF  Figure 4.17: Rate-Distortion Plot for Soccer 4CIF  4.5 Conclusion We presented an early-termination mode decision process for efficiently transcoding H.264 video content to a smaller resolution by an arbitrary downscaling factor. The method takes advantage of correlation present between the residual of the pre- encoded stream, the final cost value from the 1-pixel full search around the AWMVM 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 37 38 39 40 41 42 43 44 45 46 47 Bitrate in kbit/s P S N R (Y )   AWMVM (Ratio 2.2:1) Proposed (Ratio 2.2:1) AWMVM (Ratio 2:1) Proposed (Ratio 2:1) AWMVM (Ratio 4:3) Proposed (Ratio 4:3) 0 1000 2000 3000 4000 5000 6000 7000 34 36 38 40 42 44 46 Bitrate in kbit/s P S N R (Y )   AWMVM (Ratio 2.2:1) Proposed (Ratio 2.2:1) AWMVM (Ratio 2:1) Proposed (Ratio 2:1) AWMVM (Ratio 4:3) Proposed (Ratio 4:3)  123  vector and the motion vector differences found in the pre-encoded stream. Our performance evaluations have shown that about 46% of the computations can be saved compared to the unmodified AWMVM method while preserving picture quality (-0.06dB) and incurring a 1.1% increase in bitrate on average.   124  4.6 References [1] R. Mohan, J.R. Smith and Chung-Sheng Li, "Adapting multimedia Internet content for universal access", Multimedia, IEEE Transactions on, vol. 1, pp. 104-114, 1999. [2] A. Vetro, C. Christopoulos and H. Sun, "Video transcoding architectures and techniques: An overview", IEEE Signal Process. Mag., vol. 20, pp. 18-29, 2003. [3] I. Ahmad, X. Wei, Y. Sun and Y. Zhang, "Video transcoding: An overview of various techniques and research issues", IEEE Transactions on Multimedia, vol. 7, pp. 793- 804, 2005. [4] J. Xin, C. Lin and M. Sun, "Digital video transcoding", Proc IEEE, vol. 93, pp. 84-97, 2005. [5] Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VCEG, "Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC)", ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC, vol. 1, 2003. [6] P.A.A. Assuncao and M. Ghanbari, "Frequency-domain video transcoder for dynamic bit-rate reduction of MPEG-2 bit streams", IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, pp. 953-967, 1998. [7] Min Li and Bo Wang, "An Efficient H.264 Transcoder with Spatial Downscaling for Wireless Communications," in Wireless Communications, Networking and Mobile Computing, 2009. WiCom '09.5th International Conference on, pp. 1-4, 2009. [8] H. Shen, X. Sun, F. Wu, H. Li and S. Li, "A fast downsizing video transcoder for H.264/AVC with rate-distortion optimal mode decision," in 2006 IEEE International Conference on Multimedia and Expo, pp. 2017-20, 2006. [9] T. Qian, J. Sun, D. Li, X. Yang and J. Wang, "Transform domain transcoding from MPEG-2 to H.264 with interpolation drift-error compensation", IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, pp. 523-34, 2006.  125  [10] M. Pantoja, L. Nam and S. Weijia, "Coefficient conversion for transform domain VC-1 to H.264 transcoding," in 2007 IEEE Workshop on Signal Processing Systems, SiPS 2007, October 17, 2007 - October 19, pp. 363-7, 2007. [11] G. Chen, S. Lin and Y. Zhang, "A fast coefficients convertion method for the transform domain MPEG-2 to H.264 transcoding," in International Conference on Digital Telecommunications 2006, ICDT'06, August 29, 2006 - August 31, 2006. [12] P. Yin, A. Vetro, B. Liu and H. Sun, "Drift compensation for reduced spatial resolution transcoding: A summary", IEEE Circuits and Systems Magazine, vol. 4, pp. 32-6, 2004. [13] P. Yin, M. Wu and B. Liu, "Video transcoding by reducing spatial resolution," in International Conference on Image Processing (ICIP 2000), September 10, 2000 - September 13, pp. 972-5, 2000. [14] J. Youn and M. Sun, "Adaptive motion vector refinement for high performance transcoding," in Part 2 (of 3), October 4, 1998 - October 7, pp. 596-600, 1998. [15] T. Wiegand, G.J. Sullivan, G. Bjontegaard and A. Luthra, "Overview of the H.264/AVC video coding standard", IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 560-76, 2003. [16] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini and G.J. Sullivan, "Rate- constrained coder control and comparison of video coding standards", IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 688-703, 2003. [17] A.M. Tourapis, "Enhanced predictive zonal search for single and multiple frame motion estimation", Proceedings of the SPIE, vol. 4671, pp. 1069-79, 2002. [18] S. Goel, Y. Ismail and M.A. Bayoumi, "Adaptive search window size algorithm for fast motion estimation in H.264/AVC standard," in IEEE Intl. Midwest Symposium on Circuits and Syst. (MWSCAS), pp. 1557-60, 2005.  126  [19] B. Shen, I.K. Sethi and B. Vasudev, "Adaptive motion-vector resampling for compressed video downscaling", IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, pp. 929-36, 1999. [20] Y. Tan and H. Sun, "Fast motion re-estimation for arbitrary downsizing video transcoding using H.264/AVC standard", IEEE Transactions on Consumer Electronics, vol. 50, pp. 887-94, 2004. [21] Jiao Wang, En-hui Yang and Xiang Yu, "An Efficient Motion Estimation Method for H.264-Based Video Transcoding with Spatial Resolution Conversion," in Multimedia and Expo, 2007 IEEE International Conference on, pp. 444-447, 2007. [22] P. Zhang, Y. Lu, Q. Huang and W. Gao, "Mode mapping method for H.264/AVC spatial downscaling transcoding," in 2004 International Conference on Image Processing, ICIP 2004, October 18,2004 - October 21, pp. 2781-2784, 2004. [23] J. Bu, L. Mo, C. Chen and Z. Yang, "Fast mode decision algorithm for spatial resolutions down-scaling transcoding to H.264", Journal of Zhejiang University: Science, vol. 7, pp. 70-5, 2006. [24] K.M. Nam, J. Kim, R. Park and Y.S. Shim, "A fast hierarchical motion vector estimation algorithm using mean pyramid", IEEE Transactions on Circuits and Systems for Video Technology, vol. 5, pp. 344-51, 1995. [25] B. Liu and A. Zaccarin, "New fast algorithms for the estimation of block motion vectors", IEEE Transactions on Circuits and Syst. for Video Technology, vol. 3, pp. 148-57, 1993. [26] S. Zhu and K. Ma, "A new diamond search algorithm for fast block matching motion estimation," in IEEE International Conference on Information, Communications and Signal Processing (ICICS), pp. 292-6, 1997. [27] M. von dem Knesebeck and P. Nasiopoulos, "An efficient early-termination mode decision algorithm for H.264", IEEE Transactions on Consumer Electronics, vol. 55, pp. 1501-10, 2009.  127  [28] G.J. Sullivan, P. Topiwala and A. Luthra, "The H.264/AVC advanced video coding standard: Overview and introduction to the fidelity range extensions," in Applications of Digital Image Processing, pp. 454-74, 2004.  128  CHAPTER 5: CONCLUSIONS 5.1 Overall Significance of the Research In the introductory chapter, we illustrated that motion estimation is the most demanding element in video encoding. It remains the key task and challenge for a video encoder to find the best combination of encoding options for a given video scene among the set of parameters offered by the chosen video coding standard. While compression efficiency is highly desired for all applications, many application scenarios such as real-time encoding or applications with low processing power do not permit the exhaustive evaluation of all coding options. In many if not most of these cases, manufacturers or users retreat to non-adaptively reducing the number of choices available for the purpose of meeting the main application constraints, but concurrently compromising significant portions of the compression efficiency available by a given coding standard. For the aspect of motion estimation, the research work presented in this thesis allows making informed, efficient and highly-adaptive choices about the most likely coding options for the video content at hand, allowing compression with significantly increased efficiency even in environments with challenging technical constraints. First, we analyzed the accuracy required for evaluating distortion during the motion estimation process. In Chapter 2, we propose a method that adaptively adjusts the applied accuracy in distortion evaluation in order to reduce the number of computations  129  necessary for identifying a “best match”. The obtained method reduces computations by more than 40% without sacrificing quality or bitrate for the resulting video stream. In the second section, we analyze the plethora of block size combinations defined and introduced by the H.264 standard. In Chapter 3, we propose an algorithm which makes an early decision about the most probable block size combinations for the underlying video content and hence assists in avoiding the evaluation of many combinations that are highly unlikely to add further compression benefit. In the third section, we address the challenge of mode decision for the specific topic of transcoding existing compressed video content to a compressed stream with a lower resolution, an application that is gaining rapid significance with the growing number of different devices accessing and sharing digital video material. In chapter 4, we propose a pertinent method that takes information from the pre- encoded video stream and from information during the re-encoding process that can be used to derive important conclusions about most efficient block size combinations most suitable for the downscaled video stream. 5.2 Potential Applications of the Research Findings A large number of application scenarios could benefit from the proposed methods. Since video encoding is among the most demanding applications, achieving a high degree of compression efficiency is not only desirable but sometimes also a key factor for even realizing the benefit of certain applications. The first method which focuses on adaptively adjusting the required accuracy for candidate evaluation during motion search can be applied to any video encoding  130  application that is highly limited in processing power or energy reservoir, such as mobile devices. The same application scenario applies to the second method presented and offers additional means to significantly reduce computational requirements for encoding with high compression efficiency. This is again highly attractive for devices with limiting constraints in terms of processing power or energy reservoir. Since the required time for encoding is drastically reduced by the proposed algorithm as well, it becomes attractive for real-time or near real-time encoding applications. With its feature of arbitrarily adjusting the target resolution, the third method opens new applications and distribution opportunities for content providers. The algorithm allows serving digital video content in a format that is specifically optimized for the resolution of the target device that is requesting the content. Due to the high compression efficiency, bandwidth, storage space and possible processing power for adjusting the content are efficiently used. This allows serving the content with higher quality and a more efficient use of the available technical environment. 5.3 Contributions In the preceding chapters, three methods are proposed that speed-up the motion estimation process in video encoding or video transcoding, respectively. In summary, the contributions are: • We presented an analysis of the search accuracy required for evaluating distortion and analyzed correlations in natural video sequences between magnitudes of displacement vectors and distortion measures  131  • We developed an efficient method that permits taking advantage of this correlation in order to speed-up the motion estimation process in video encoding. This is done by defining an adaptive threshold for assigning reduced accuracy distortion measurements where applicable. • We investigated the computationally intensive mode decision process in H.264 video encoding and identified correlations that can be used to adaptively reduce the evaluation process to only those block size combinations that are most beneficial for compressing a given underlying video content. By applying our method, the computational burden was reduced significantly without diminishing compression efficiency. • We analyzed the aspect of block size selection in the specific application of video transcoding with an arbitrary downscaling ratio. By taking advantage of measures derived from the pre-encoded video stream and an early measure from the re- encoding process, we were able to create a much faster video transcoder with very high compression efficiency. 5.4 Comments on Future Research The proposed algorithms supply a contribution to making highly efficient digital video encoding accessible for many new applications and services by reducing the massive computational burden of motion estimation. The large number of coding combinations defined by the H.264 video coding standard facilitates a greatly improved compression efficiency. However, finding the best one among them involves performing a large number of evaluations and usually discarding every result except for the most  132  beneficial one. In this thesis, we presented methods that assist in avoiding many of these computations that would otherwise lack further benefit. For the upcoming H.265 standard, additional features are proposed for encoding that also affect the motion estimation process and decision criteria [1]. These features include e.g. the 32x32 or 64x64 pixel block size for mode decision, additional motion vector predictors (motion vector competition), adaptive interpolation filters or 1/8-pel accuracy for motion vectors. This is complemented by putting emphasis on higher resolutions which affects  the most likely motion vector sizes and search ranges to be considered in motion estimation. In future research, these criteria need to be considered for identifying further measures or techniques that could assist in avoiding unnecessary computations and to limit the complexity while preserving the achievable compression efficiency. Currently, the best coding combination is identified by the evaluation of a Lagrangian cost function that provides a means for balancing the trade-off between bitrate and distortion. Distortion itself is currently assessed by a rather technical measure, the Peak-Signal-to-Noise Ratio (PSNR). A number of studies have shown, however, that PSNR does not accurately correlate with the quality perceived by the Human Visual System (HVS). Some new measures are consequently being proposed in literature that more precisely model “perceived quality” [2]. However, they often involve a larger degree of computational complexity compared to PSNR and hence often even add more to the burden of video encoding. A promising candidate presently spreading in literature is the Structure Similarity Index Measure (SSIM), which adds complexity within reasonable limits and shows an improved correlation with “perceived quality” [3]. Using more accurate measures for “perceived quality” opens the opportunity for even higher  133  compression efficiency but the computational complexity involved in achieving this level continues to be a critical challenge. Motion estimation algorithms specifically designed for the requirements and priorities of a target application allow reaping the desired benefit and maximizing the user experience while adhering to the underlying technological and economic constraints. Presently, 3D or multiview video is gaining great interest in academic research and also substantial commercial support. Efficient compression and encoding with low computational complexity are key topics with high priority in this research realm. The existing H.264/MVC has shown to be very limited. A new interview prediction structure must be introduced to address the challenges of this new application. A favourite approach could be the inclusion of depth and interview frames. Motion estimation may end up a much more complex process.   134  5.5 References [1] Joint Collaborative Team on Video Coding (JCT-VC), "Table of Proposal Design Elements for High Efficiency Video Coding (HEVC)," Output Document JCTVC- A203, pp. 1-10, Dresden, 2010. http://ftp3.itu.int/av-arch/jctvc- site/2010_04_A_Dresden/JCTVC-A203.zip [2] K. Seshadrinathan and A.C. Bovik, "New vistas in image and video quality assessment," in SPIE Human Vision and Electronic Imaging, Jan. 2007. [3] Z. Wang, A.C. Bovik, H.R. Sheikh and E.P. Simoncelli, "Image quality assessment: From error visibility to structural similarity", IEEE Trans.Image Process., vol. 13, pp. 600-12, 2004.   135  A. APPENDIX A.1 The Area-Weighted Motion Vector Median Method (AWMVM) With every motion vector vi, the AWMVM operation for N underlying motion vectors V={v1,v2,…,vN} can be expressed by: 21 minarg ∑ =∈ −×= N i ij Vv AWMVM vvSv j  where vAWMVM indicates the median vector, 2 ⋅ the Euclidian norm for vector distances, and S symbolizes a 2x2 diagonal matrix for downscaling the resulting vector. Figure A.1 shows an example for a non-integer downscaling ratio of a full resolution section of a frame with the dark-shaded region indicating an area that corresponds to a 16x16 macroblock in the downscaled version of the frame. AWMVM computes the initial motion vector for the downscaled 16x16 partition (red arrow) by the area-weighted median of all motion vectors covered by the dark-shaded area.  Figure A.1: Illustration of the AWMVM Method for a Non-integer Downscaling Ratio

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0071319/manifest

Comment

Related Items