EFFICIENT CODING AND MAPPING ALGORITHMS FOR SOFTWARE IMPLEMENTATION OF A REAL-TIME H.263+ COMPLIANT LOW BIT RATE VIDEO CODER by B E R N A E R O L B.Sc. (Control and Computer Engineering) Istanbul Technical University, 1994 A THESIS SUBMITTED IN PARTIAL F U L F I L L M E N T OF THE REQUIREMENT FOR THE D E G R E E OF M A S T E R OF APPLIED SCIENCE in THE F A C U L T Y OF G R A D U A T E STUDIES D E P A R T M E N T OF ELECTRICAL A N D COMPUTER ENGINEERING We accept this thesis, as conforming to the required standard THE UNIVERSITY o t f i ^ I T I S H C O L U M B I A April 1998 © Berna Erol, 1998 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. . Department of £ . lecHrt ' coV q A q ! C o n n p ^ r °~ o The University of British Columbia Vancouver, Canada Date -a zl. 133% DE-6 (2/88) Abstract The growing interest in video coding applications led to the development of video coding standards. Several successful standards have emerged, such as the ITU-T H.261, H.263, ISO/EEC MPEG-1 and MPEG-2. ITU-T recently announced its latest low bit rate video coding standard, the H.263 Version 2, also known as the H.263+ standard [25]. While this new low bit rate video coding standard provides better compression performance levels than those of the former ones, it is more complex and is more computationally, demanding. In this thesis, we develop coding and mapping algorithms that improve the video coding speed performance, making a reality the software implementation of interactive real-time low bit rate video coding on the Pentium M M X general purpose processor. First, using statistical properties of low resolution and slowly varying video sequences, we manage to significantly reduce the computation times of the most computationally intensive components of video coding, particularly the DCT, the IDCT, quantization and half pixel motion search. We also map some of the SIMD (Single Instruction Multiple Data) oriented functions onto Intel's M M X architecture. The developed algorithms are implemented using our public-domain H.263+ encoder/decoder software [45]. After the algorithmic optimization and the M M X mapping, the resulting H.263+ video encoder implementation runs approximately 3 times faster than the original unoptimized public-domain encoder implementation. Moreover, our optimized H.263+ compliant video coder implementation can encode/decode more than 15 frames of a QCIF (176x144) video sequence per second without sacrificing video reproduction quality. ii Table of Contents Abstract ii Table of Contents iii List of Tables vii List of Figures ix Acknowledgments .. xi Chapter 1 Introduction 1 Chapter 2 Background and Preliminaries :.. 4 2.1 Introduction.. 4 2.2 Basics of Video Coding ." 4 2.2.1 Motion Estimation and Motion Compensation 5 2.2.2 Transform Coding, Quantization and Variable Length Coding 7 2.3 The H.263 Video Coding Standard 8 2.3.1 Baseline Coding 9 2.3.1.1 Video Input / Output and Picture Formats 9 2.3.1.2 The Baseline Encoder : 10 2.3.1.3 Decoder 13 . 2.3.1.4 Motion Vector Prediction in H.263 14 2.3.1.5 Forward and Inverse Quantization 15 2.3.1.6 3D V L C Coding of Transform Coefficients 16 2.3.1.6.1 Coding of DC coefficients , 16 2.3.1.6.2 Coding of A C coefficients 16 2.3.2 Coded Video Bit Stream Structure 17 2.3.3 Negotiable Modes of H.263 18 2.3.3.1 Unrestricted Motion Vector Mode (UMV).. . 19 2.3.3.2 Advanced Prediction Mode (AP) 19 2.3.3.3 PB-frames Mode (PB) : 20 2.3.3.4 Syntax-based Arithmetic Coding mode (SAC) 21 iii 2.4 H.263 Version 2 (H.263+) 21 2.4.1 H.263+ Optional Modes , • 22 2.4.1.1 Unrestricted Motion Vector Mode (UMV) 22 2.4.1.2 Advanced Intra Coding Mode (AIC) 23 2.4.1.3 Deblocking Filter Mode (DF) .24 2.4.1.4 Slice Structured Mode (SS) 25 2.4.1.5 Supplemental Enhancement Information Mode (SEI) 25 2.4.1.6 Improved PB-frames Mode (TPB) 25 2.4.1.7 Reference Picture Selection mode (RPS) 26 2.4.1.8 Temporal, SNR, and Spatial Scalability Mode .. 26 2.4.1.9 Reference Picture Resampling Mode (RPR) .....27 2.4.1.10 Reduced Resolution Update Mode (RRU) : 27 2.4.1.11 Independently Segmented Decoding Mode (ISD) 28 2.4.1.12 Alternative Inter V L C Mode (AIV) . 28 2.4.1.13 Modified Quantization Mode (MQ) '..'28 2.4.2 Fast Motion Vector Search Algorithm 29 2.5 Intel's M M X Architecture 29 2.5.1 M M X Data Types, Features and Instructions 30 2.5.1.1 M M X Data Types .. 30 2.5.1.2 Saturation Arithmetic : 31 2.5.1.3 M M X Instructions... 32 2.5.1.3.1 Packed addition and subtraction with optional saturation 33 2.5.1.3.2 Packed multiplications 33 2.5.1.3.3 Packed compare instructions 34 2.5.1.3.4 Packed shift instructions 34 2.5.1.3.5 Conversion instructions 34 2.5.1.3.6 Logical instructions • 35 2.5.1.3.7 Memory Transfer instructions 35 2.5.1.3.8 Empty M M X State operation 35 2.5.2 Backward Compatibility 36 iv 2.5.2.1 M M X Registers and IA Floating-Point Registers 37 2.5.3 The Superscalar Architecture of M M X 38 2.5.3.1 M M X Pipeline Structure.... .' 39 Chapter 3 Performance of H.263+Video Coding Standard 42 3.1 Introduction : 42 3.2 Performance Levels of Individual Modes 42 3.2.1 Advanced Intra Coding mode (AIC) ...43 3.2.2 Deblocking Filter mode (DF) 44 3.2.3 Improved PB-frames mode (IPB) 45 3.2.4 Alternative Inter V L C mode (AIV) 47 3.2.5 Modified Quantization mode (MQ) 48 3.3 Computation Times and Compression Performance Improvements of Individual Modes49 3.4 Compression Performance and Complexity of Mode Combinations 51 Chapter 4 Efficient Low Bit Rate Video Coding Algorithms 53 4.1 Introduction , 53 4.2 DCT '. 56 4.2.1 Zero Block Prediction Prior to DCT 59 4.2.1.1 Zero block Predictor 60 4.2.1.2 Prediction Accuracy 63 4.2.1.3 Computation Time Gains 63 4.2.1.4 Simulation Results 64 4.2.1.5 Implementation Complexity and Memory Requirements 66 4.3 IDCT 67 4.3.1 Implementation Complexity and Memory Requirements 69 4.4 Quantization and Dequantization 69 4.4.1 Quantization with Tables 70 4.4.2 Implementation Complexity and Memory Requirements 70 4.5 Partial SAD Computation Technique 71 4.6 Fast Half Pixel Motion Estimation Based on Approximation 72 4.6.1 A Simplified M E for MPEG-2 Encoder 73 v 4.6.2 Fast Search M E and Half Pixel M E Approximation 74 4.6.3 Implementation Complexity and Memory Requirements 78 Chapter 5 MMX Implementation of the H.263+ Video Encoder 79 5.1 Introduction 79 5.2 Video Coding Cores with SIMD Structure 79 5.3 SAD Computations for Motion Estimation 80 5.4 DCT .....82 5.5 IDCT : 86 5.6 Interleaved Data Transfers for Half Pixel M E 88 5.7 Interpolation 90 5.8 Motion Compensation 92 5.9 Load M E Area : 92 5.10 Optimization of the M M X Functions 92 5.10.1 Data Alignment 93 5.10.2 Optimization of Pipeline Usage 93 5.10.3 Memory Read and Write 94 5.10.4 Other Optimizations 94 5.11 Overall Performance Improvement of the Application 95 Chapter 6 Conclusions and Future Research 96 Appendix A Overall Performance Improvements 105 Appendix B Example Applications for the H.263+ Video Coding Standard 107 6.1 Video Telephony/Conferencing over a Serial Link 107 6.2 Internet Video Conferencing 109 vi List of Tables Figure 1. Forward motion compansation 6 Figure 2. Structure of a macroblock .'. 10 Figure 3. Block diagram of the H.263 encoder 11 Figure 4. Bilinear prediction in half pixel motion estimation 12 Figure 5. Block diagram of the H.263 decoder 13 Figure 6. Motion vector prediction from surrounding macroblocks. 14 Figure 7. Zig-zag scan order of a 8x8 block 16 Figure 8. Hierarchical block structure in a CIF image : 18 Figure 9. PB frame structure 20 Figure 10. Forward and bidirectional prediction for a B picture block. 21 Figure 11. Neighboring blocks used for intra prediction in the Advanced Coding mode 23 Figure 12. M M X architecture data types 31 Figure 13. Multiply-add word to doubleword 33 Figure 14. Compare words (greater than) 34 Figure 15. Unpacking a byte to a word 34 Figure 16. Mapping M M X registers to floating-point registers ;.. 37 Figure 17. Pentium and M M X technology pipeline structures 40 Figure 18. Advanced Intra Coding Rate-Distortion Performance for A K I Y O sequence 44 Figure 19. Reconstructed image for F O R E M A N picture number 100 without (a) and with (b) Deblocking Filter mode and Post Filter set on , 45 Figure 20. Improved PB frames mode: Rate-Distortion performance for F O R E M A N 46 vii Figure 21. (a) Chrominance and (b) Average PSNR performance for Modified Quantization Mode 48 Figure 22. Encoding times for the H.263 and H.263+ modes for F O R E M A N at 64 kb/s on a 200 M H z PC , '. » 49 Figure 23. Block diagram of the H.263+ encoder implementation. 55 Figure 24. Flowgraph for fast 1-D DCT from Arai, Agui, and Nakajirha [1] 58 Figure 25. Zero block statistics for different predictors 61 Figure 26. Number of computations of a fast IDCT for different sub zero block patterns 67 Figure 27. Zero Sub block detection via bit stream parsing 68 Figure 28. Half pixel motion vector prediction using 8 precomputed integer pixel SADs 73 Figure 29. Half pixel motion vector prediction using 5 precomputed integer pixel SADs 74 Figure 30. Half and integer pixel locations in a diamond shape area." 76 Figure 31. PSNR - number of iterations tradeoff in method 3 77 Figure 32. Data flow to compute absolute value of differences of two unsigned data 81 Figure 33. Flowgraph of multiplication and downscaling operations performed with M M X instructions :. 85 Figure 34. Half pixel search block locations in interpolated reconstructed data 88 Figure 35. Flowgraph of the data interleaving for M M X Half Pixel M E function 89 Figure 36. H.263+ interpolation 90 Figure 37. The inner core of M M X implementation of interpolation 91 Figure 38. C routine to create an n byte aligned array 93 Figure 39. Video telephony application over serial link using wireless modems 108 Figure 40. VIC application using the H.263+ codec '. 109 viii List of Figures Figure 1. Forward motion compansation 6 Figure 2. Structure of a macroblock 10 Figure 3. Block diagram of the H.263 encoder 11 Figure 4. Bilinear prediction in half pixel motion estimation 12 Figure 5. Block diagram of the H.263 decoder 13 Figure 6. Motion vector prediction from surrounding macroblocks 14 Figure 7. Zig-zag scan order of a 8x8 block 16 Figure 8. Hierarchical block structure in a CTF image 18 Figure 9. PB frame structure 20 Figure 10. Forward and bidirectional prediction for a B picture block 21 Figure 11. Neighboring blocks used for intra prediction in the Advanced Coding mode 23 Figure 12. M M X architecture data types 31 Figure 13. Multiply-add word to doubleword , 33 Figure 14. Compare words (greater than) '. 34 Figure 15. Unpacking a byte to a word. : 34 Figure 16. Mapping M M X registers to floating-point registers 37 Figure 17. Pentium and M M X technology pipeline structures 40 Figure 18. Advanced Intra Coding Rate-Distortion Performance for A K I Y O sequence 44 Figure 19. Reconstructed image for F O R E M A N picture number 100 without (a) and with (b) Deblocking Filter mode and Post Filter set on 45 Figure 20. Improved PB frames mode: Rate-Distortion performance for F O R E M A N 46 ix Figure 21. (a) Chrominance and (b) Average PSNR performance for Modified Quantization Mode... '. '. 48 Figure 22. Encoding times for the H.263 and H.263+ modes for F O R E M A N at 64 kb/s on a 200 . MHz PC.....: ...... :.49 Figure 23. Block diagram of the H.263+ encoder implementation 55 Figure 24. Flowgraph for fast 1-D DCT from Arai, Agui, and Nakajima [1] 58 Figure 25. Zero block statistics for different predictors 61 Figure 26. Number of computations of a fast DDCT for different sub zero block patterns 67 Figure 27. Zero Sub block detection via bit stream parsing 68 Figure 28. Half pixel motion vector prediction using 8 precomputed integer pixel SADs 73 Figure 29. Half pixel motion vector prediction using 5 precomputed integer pixel SADs.. 74 Figure 30. Half and integer pixel locations in a diamond shape area 76 Figure 31. PSNR - number of iterations tradeoff in method 3 77 Figure 32. Data flow to compute absolute value of differences of two unsigned data 81 Figure 33. Flowgraph of multiplication and downscaling operations performed with M M X instructions 85 Figure 34. Half pixel search block locations in interpolated reconstructed data 88 Figure 35. Flowgraph of the data interleaving for M M X Half Pixel M E function 89 Figure 36. H.263+ interpolation 90 Figure 37. The inner core of M M X implementation of interpolation 91 Figure 38. C routine to create an n byte aligned array 93 Figure 39. Video telephony application over serial link using wireless modems 108 Figure 40. VIC application using the H.263+ codec 109 x Acknowledgments First of all, I would like to thank my husband, Arda Erol, for his great love and confidence in me, and for making my life easier during my heavy course load and research work. My research would not have become a reality without the very valuable supervision of Dr. Faouzi Kossentini and Dr. Hussein Alnuweiri. Their guidance and support are greatly appreciated. I would like to thank my colleagues Guy Cote, Michael Gallant and Ayman Elnaggar for sharing their valuable technical knowledge with me. I also wish to express my gratitude to Dr. Stephan Wenger for the useful reference materials he has provided me and our research engineer Dr. Alen Docef for making our lives easier with his great technical support. Last, but certainly not the least, I thank all my colleagues and friends for their invaluable friendship. xi Chapter 1 Introduction During the last decade, there has been a significant amount of interest in video coding and its applications such as video conferencing, video e-mailing and video telephony. In the late 1970's, the telecommunications industry had realized that continuous growth of audiovisual services is only possible with international standards. This necessity led to the standardization of a number of international video coding algorithms, yielding the ITU-T H.261, H.263, H.263 Version 2, and ISO/TEC MPEG1 and MPEG2 standards, that address a wide range of applications with different requirements, such as bit rate, quality, complexity, error resilience and delay. The limited transmission rates supported by public switched telephone networks (PSTN) and wireless networks create a significant problem for digital video communication applications. When the currently available algorithms are used in very low bit rate applications (<64 kb/s), they mostly lead to very bad picture quality and blocking artifacts or require operation at low frame rates yielding low temporal resolution. * ' In recent years, ITU-T successfully established two video coding standards, the H.263 in December 1995 and the H.263 Version 2 (H.263+) in January 1998, that provide better performance levels than those achieved by the currently available video coding standards, 1 Chapter 1: Introduction including the ITU-T H.261, at low bit rates. The basic configurations of these standards are based on the ITU-T Recommendation H.261. In addition to their baseline operation, H.263 and H.263+ have negotiable modes that improve rate-distortion performance and provide robust operation in the presence of channel errors, of course, at the expense of additional complexity, computation and memory requirements. Until recently, real-time video encoding and decoding were only possible using Application Specific Integrated Circuits (ASICs) or multi DSP platforms, resulting in coding systems such as the VSP3 H.261 from NEC, the VDSP2 MPEG-2 from Matsushita Electric, and the Multimedia Video Processor (MVP), which employs the TMS320C80, from Texas Instruments [2]. Due to the growing interest in multimedia applications and the enormous computing power these applications require, many of the general purpose processors now have multimedia extension units where one instruction can be performed using more than one data input, simultaneously. Examples of such architectural extensions are the VIS extension of Sun's Ultra Sparc [49], the M A X - 2 extension of Hewlett-Packard's PA-RISC [19], the M M X extension of Intel's Pentium [21] and the Multimedia Instruction Set extensions of Silicon Graphics' microprocessor [46]. These enhancements have greatly helped making real-time video applications in software a reality. In this thesis, we propose two approaches for improving the speed performance of video coding, thereby supporting interactive real time video applications on the Pentium M M X general purpose processor. In the first approach, we develop platform independent efficient video coding algorithms that reduce the encoding time by using statistical properties of low resolution and 2 Chapter 1: Introduction slowly varying video sequences. Such algorithms improve the computation times of the DCT, the IDCT, block matching (SAD), quantization and half pixel motion search functions significantly. The second approach is processor-specific, where we map several SIMD (Single Instruction Multiple Data) oriented components of video coding onto Intel's M M X architecture. Such components include the DCT, interpolation, SAD and half pixel motion vector search. A l l the algorithms and techniques proposed in this thesis are implemented using our public-domain H.263+ encoder/decoder software [45]. After the algorithmic optimization and M M X mapping, the H.263+ video encoder implementation runs approximately 3 times faster than our unoptimized public-domain encoder implementation. Moreover, the resulting H.263+ compliant video coding implementation can encode/decode QCIF video sequences at 15 frames per second on a Pentium M M X 200 MHz processor without sacrificing video reproduction quality. The thesis is organized as follows. Chapter 2 provides an overview of the H.263 and H.263+ low bit rate video coding standards. It also introduces Intel's M M X architecture and provides some brief information on its instruction set and superscalar architecture. The performance of a H.263+ compliant encoder, more specifically our public domain H.263+ video coding software [45], is discussed in Chapter 3. Chapter 4 describes the platform independent algorithms, which are developed and implemented, increasing the efficiency of H.263+ encoding. This chapter also discusses the complexity-performance tradeoffs associated with each algorithm. In Chapter 5, the M M X mapping algorithms of some of the compute intensive video encoder functions and their performance results are presented. Finally, conclusions and suggestions for future research are given in Chapter 6. 3 Chapter 2 Background and Preliminaries 2.1 Introduction As stated earlier, the main objective of this thesis is to develop an efficient implementation of the H.263+ video encoding algorithm on the Intel's M M X architecture. In this chapter, we first present basic concepts of video coding. This is followed by a discussion of the H.263 and H.263+ standards. This chapter concludes with a presentation of the basic features of the M M X architecture. 2.2 Basics of Video Coding In most video sequences, there is usually little change from one video frame to the next one. Video compression and coding algorithms take advantage of such temporal redundancies by coding the difference between a predicted frame and the current frame instead of encoding the current frame itself. A compression scheme that employs only temporal redundancy reduction is referred to as an interframe coder. If the difference between two consecutive video frames is very large, then it is more advantageous to code the frame itself rather than the corresponding difference. Video compression algorithms reduce spatial redundancies within a video frame by employing transform coding as in still image coding. A compression scheme that employs only 4 Chapter 2: Background and Preliminaries spatial redundancy reduction without doing any prediction from the previous frame is referred to as an intraframe coder. The combination of intra and inter frame coding is called hybrid coding. In video coding terminology, an intra coded picture, is called an I-picture. A picture that is predicted from the previous picture is called a P-picture. A picture that is bidirectionally predicted from both the previous and future pictures is referred to as a B-picture. 2.2.1 Motion Estimation and Motion Compensation If the present picture is well predicted from the previous picture, then the difference between these two pictures will be quite small. In order to achieve a good prediction of the picture currently being encoded, the motion of the objects in the video sequence must be taken into account. Although a number of approaches have been investigated, the method that works best is block-based motion compensation. In this method, the picture is divided into blocks of size M x M pixels each. For each block, the previous picture is searched for the best match for the block. This search operation is called Motion Estimation (ME). The relative distance between the best matching block in the previous picture and the current block is called the Motion Vector (MV). Figure 1 illustrates the block based motion estimation procedure. Since it would be computationally very expensive search the entire picture for the best match, M E is usually performed using a limited window called a search window. 5 Chapter 2: Background and Preliminaries Past picture \ Prediction error To Transform Coding Best matching macroblock M V : motion vector Figure 1. Forward motion compansation. There are different measures that can be used to determine the best matching blocks, such as minimizing the mean squared error (MSE) and minimizing the mean absolute error (MAE) which are given by 1 M N M S E = Mj(i, j) = —— Yj(Xmn - X * + i , n + j ) 2 , | i | < m 2 , | j | < n 1 , M = N , MAE - M2(i,j) = 1 M N , _ L y y v _ X R v / i i T Z—l £ j \ m,n m+i,n+j i <m2, j <nl, • M = N, (1) (2) where N and M are the dimensions of the block and X m + i , n + j is the value of the block element that is located in row m+i and column n+j in the reference (previous) frame. Extensive investigations have shown that the M A E performs as well as the M S E in motion prediction [38]. Since the number of operations necessary for calculating M A E is much smaller than that of MSE, M A E error measure is more commonly used. 6 Chapter 2: Background and Preliminaries Motion estimation is the most time consuming part of all video encoders [38], [41]. Therefore, there are many efficient algorithms [2] (e.g. logarithmic search, hierarchical search) have been suggested in order to reduce the number of operations require to find the best matching block. Later in this chapter, we present a fast M E search algorithm [8], developed in our laboratory, that was recently included H.263+ Test Model [26]. 2.2.2 Transform Coding, Quantization and Variable Length Coding After the motion estimation is performed, the difference block between the best matching block and the current block is computed. Then a transform operation is applied to the difference block to reduce the spatial redundancies. Most of the coding algorithms use the Discrete Cosine Transform (DCT) because of its ability to greatly decorrelate signals and its smaller complexity compared to that of other transforms with similar performance, such as the Karhunen-Loeve-Hotelling (KLH) transform [38]. The definition of the 2-D DCT transform for an 8x8 block is given by . c(k) c(l) ^ ^ I ^ i=0 j=0 7 7 f (2i + l)kjtN 16 ' 1 ((2j + l ) l 7 ^ 16 cos V (3) where k, 1 = 0, 1, ..., 7 and c(k) = ,_ i f k = 0 1 otherwise Following transform coding, resulting coefficients are quantized and quantized DCT coefficients, motion vectors, and side information, such as the picture coding type and the quantizer step size are entropy coded using Variable Length Codes (VLCs). Quantization is a lossy coding method where V L C coding is lossless. After the quantization, the coefficient block Chapter 2: Background and Preliminaries is reconstructed by dequantization and application of Inverse DCT,(IDCT) and fed back to the coding loop to be used for prediction of temporally future blocks. 2.3 The H.263 Video Coding Standard The objective of H.263 [24]is to provide significantly better picture quality at the same bit rate than the existing ITU-T H.261 video coding standard [23]. Due to the short standardization schedule, the H.263 was based on the existing H.261 technology. In principle, H.263 is network independent and can be used in a large variety of applications. However, it targets low-bit-rate networks such as the public switched telephone networks (PSTN) and wireless networks, and its target applications are video telephony and video conferencing. Therefore coding techniques that introduce high delays are not used in the H.263 standard. In summary, the basic requirements of the H.263 standard were as follows: > Use of available technology. > Low complexity, thus low cost. > Interoperability and coexistence with other ITU communication standards. > Robust operation in the presence of channel errors. > Flexibility to allow further extension. > Flexibility of tradeoff between picture quality, complexity and bit rate. The standard was finalized in December 1995. With its improved baseline coding and four negotiable modes, the H.263 compliant coders outperform the H.261 ones and it is predicted that the H.263 standard will replace the H.261 standard in many applications [16]. 8 Chapter 2: Background and Preliminaries 2.3.1 Baseline Coding The baseline coding algorithm of H.263 is similar to that used by H.261, however there are some enhancements and changes to improve rate-distortion performance, including more flexible input picture formats, half pixel motion compensation, and modified V L C coding. 2.3.1.1 Video Input I Output and Picture Formats The input to H.263 coder consist of non-interlaced pictures that are based on Common Intermediate Format (CIF) in order to provide a single recommendation for both 625 and 525-line television standards. However the standard itself does not define the conversion to CIF from video sources such as NTSC, PAL, S E C A M etc. In H.263, the pictures are coded in YCt,C r color format. The YCt,C r format is specially developed by ITU-R for digital video coding. While Y represents the luminance component of the pixel, Cb and C r represents the chrominance components. The Y C b C r signals are scaled and offset versions of the Y U V signals. Format Number of pixels for luminance (Horizontal) Number of pixels for luminance (Vertical) Number of pixels for chrominance (Horizontal) Number of pixels for chrominance (Vertical) Sub QCIF 128 96 64 48 QCIF 176 144 88 72 CIF 352 288 176 144 4CIF 704 576 352 288 16CIF 1408 1152 704 576 Table 1. Resolutions of the H.263 picture formats. 9 Chapter 2: Background and Preliminaries H.263 supports sub-QCIF, QCIF, CIF, 4CTF, and 16CIF input picture formats where as H.261 supports only sub-QCIF and QCIF formats. These standardized picture formats and their spatial resolutions are listed in Table 1. For all of these picture formats, the luminance component sampling is done for each pixel in the picture and the chrominance component sampling is done for half of the pixels in each of the horizontal and vertical directions. The CIF, 4CIF, and 16CIF picture formats are optional for encoders as well as decoders. The encoder, however, should support at least one of the QCIF or sub-QCIF formats. Support for both of these formats, QCIF and sub-QCIF, is mandatory for decoders. This requirement of two mandatory formats for the decoder and only one for the encoder is a result of a compromise between high resolution and low cost. This way, the more expensive encoder will need to support only one of the resolutions. 2.3.1.2 The Baseline Encoder Y l Y2 Y3 Y4 Figure 2. Structure of a macroblock. Each picture in the input video sequence is divided into macroblocks. A macroblock consists of four 8x8 luminance blocks, one 8x8 Q, block and one 8x8 C r block as shown in Figure 2. The motion compensated prediction operation is performed at the macroblock level. 10 Chapter 2: Background and Preliminaries Video in DCT Q Jut VLC(C) iQ IDCT - f x Inter/Intra 0 PRED MUX ~~i—r cc BUF MBTYPE M l ME1 ME2 M2 VLC(M) • Data Out Video signal Side information Figure 3. Block diagram of the H.263 encoder. (ME1: Integer pixel motion estimation and intra/inter decision; ME2: Half pixel motion estimation; Ml: Input frame store; M2: Decoded frame store; PRED: Make prediction block; MBTYPE: Decided block type and block pattern; VLC(C): Variable-length coder for transform coefficients; VLC(M): VLC for motion vectors; CC: Coding control; DCT: Discrete Cosine Transform; IDCT: Inverse DCT; Q: Quantizer; IQ: Inverse Quantizer) * A l l pictures are inter coded in H.263 encoding except for the first picture. Even when a picture is inter coded, it can contain some macroblocks that are intra coded. The decision for each macroblock is made during the encoding process. Figure 3 shows the encoding diagram for the H.263 baseline encoder. First, an integer pixel motion vector is calculated for the 16x16 luminance block. Then, a comparison is made between the currently encoded block and the 11 Chapter 2: Background and Preliminaries displaced block in the previous picture. If the difference is small enough, the block passes through a motion compensated predictive coding process. If the difference is large, then the macroblock is intra coded, i.e. the 8x8 luminance and chrominance blocks are transform coded with 2D DCT and then the transform coefficients are quantized and V L C coded. The H.263 standard itself does not define the decision-making processes for inter or intra coding. There are test models defined for the standard and these test models define the decision making processes of all H.263 encoders. After integer pixel motion estimation, if a block is decided to be inter coded, then the motion search continues with half-pixel search around the current motion vector position. Half pixel values are found using bilinear interpolation as shown in Figure 4. Half pixel search is optional for H.263. Once motion estimation is completed, the difference between the motion compensated block and the current block is then DCT coded, quantized and V L C coded. Information bits for the motion vectors are also added to the bit stream. A B X a X X Inte ger pix el p ositi on o o O Half pixel position c d C X X D a = A b = (A + B) / 2 c=(A + C) /2 d=(A + B+ C + D+2) /4 Figure 4. Bilinear prediction in half pixel motion estimation. 12 Chapter 2: Background and Preliminaries Similar to a Differential Pulse Code Modulator (DPCM), H.263 decodes and reconstructs the image in the encoder. Then the reconstructed picture is added to the current predicted picture in order to obtain the decoded picture, which is then stored in memory (M2). 2.3.1.3 Decoder Video out IDCT IQ VLD(C) DMUX BUF Data in Inter/Intra 0 / M ** MBTYPE *\ M H PRED VLD(M) Video signal Side information Figure 5. Block diagram of the H.263 decoder. (M: Decoded frame store; MBTYPE: Decided block type and block pattern;. VLD(C): Variable-length decoder for transform coefficients; VLD(M): VLD for motion vectors; IDCT: Inverse Discrete Cosine Transform; IQ: Inverse Quantizer) Figure 5 shows a block diagram of the H.263 video decoder. First, the bit stream is parsed and variable-length decoded to obtain the coefficients, the motion vectors and other side information. After the coefficients are decoded by inverse quantization, IDCT. is applied to the coefficient blocks. If the block is intra coded, the reconstructed block is equal to the result of the inverse 13 Chapter 2: Background and Preliminaries transformation. For inter coded blocks, the reconstruction is formed by motion compensation, i.e. summing the predicted block and the inverse transformed block. 2.3.1.4 Motion Vector Prediction in H.263 Motion vectors are coded using differential coding at the encoder. A median-based M V predictor is used in order to predict the motion vectors. The motion vector is obtained by adding predicted vectors to the difference vectors (MVD) at the decoder. In the case of one motion vector per macro.block, the candidate predicted motion vectors are taken from the three surrounding macroblocks as shown in Figure 6. The predicted horizontal and vertical motion vector components are calculated independently. For each component, the predicted value is the median value of the three candidates. The same motion vector is used for all of the luminance blocks. The motion vectors for chrominance blocks are obtained by dividing each of the component values of luminance M V by two, due to the 2:1 chrominance resolution. MV2 MV3 MV2 MV3 MV1 MV (0,0) MV MV1 MV1 MV2 (0,0) MV1 MV MV1 MV Picture or GOB border MV: Current motion vector MV1, MV2, MV3: The motion vectors that are used to predict MV. Figure 6. Motion vector prediction from surrounding macroblocks. 14 Chapter 2: Background and Preliminaries 2.3.1.5 Forward and Inverse Quantization Depending on the prediction method selected at the encoder, coefficients that are going to be quantized can have a wide variation of statistics. In the case of intra prediction, the first element of the coefficient block, also referred as DC coefficient, take much larger value than the rest of the coefficients, which are also called A C coefficients. If the block is inter coded, then all the coefficients take smaller values. In order to efficiently quantize both inter and intra coded blocks, one quantizer is used for the intra DC coefficient and one of 31 predefined quantizers is used for the A C coefficients. The decision levels of the quantizers are not defined within the standard. It is suggested, that the quantizer for the DC coefficient should be a uniform quantizer with a step size 8. Each of the other 31 quantizers use equally spaced reconstruction levels with a central dead-zone around, zero and with a step size of an even value in the range 2 to 62. The absolute values of the reconstruction levels of non-zero coefficients are calculated as | Reconstruction Level | = Quant x (2 x |LEVEL| + 1) if Quant = "odd", and (4) | Reconstruction Level | = Quant x (2 x |LEVEL| + 1) -1 if Quant = "even", where Quant is the quantizer step size and L E V E L is the quantized coefficient. Note that this process does not allow even valued numbers and prevents accumulation of IDCT mismatch errors. The sign of the quantized transform coefficient is signaled at the end of the V L C code word. After inverse quantization, the reconstruction levels of all coefficients other than the DC coefficient are clipped to the range-2048 to 2047. 15 Chapter 2: Background and Preliminaries 2.3.1.6 3D VLC Coding of Transform Coefficients One of the main improvements that the H.263 standard introduces over H.261 is the new V L C tables and the 3D V L C coding technique. 2.3.1.6.1 Coding of DC coefficients Since human visual system is more sensitive to blocking artifacts,.which are mainly due to the DC coefficient, this coefficient is treated separately from the other 63 coefficients. The DC coefficient is coded differentially. The coding employs a first order predictor which is given by DIFF = D Q - DQ_i , (5) where DC; is the DC coefficient of the current block and DQ-i is the D C coefficient of the previous block. 2.3.1.6.2 Coding of AC coefficients After quantization, most of the A C values that represent higher frequencies become zero. To exploit this, before V L C coding, the 63 quantized A C coefficients are scanned in a zigzag scan order as illustrated in Figure 7. Figure 7. Zig-zag scan order of a 8x8 block. 16 Chapter 2: Background and Preliminaries In H.261, nonzero coefficients are called L E V E L s and are coded using Huffman-like variable length coding. Because many A C coefficients become zero after quantization, runs of zeros along the zigzag scan direction are also identified and compacted. This is done by using R U N VLCs which indicate the number of the zero coefficients followed by a non-zero coefficient. If the rest of the A C coefficients are zero, then a V L C that represents an End Of Block (EOB) is used and coding is completed. The H.263 standard improves this technique by including an EOB for each of the coded AC-coefficients instead of separately coding it. Therefore, only one code represents the L E V E L of the coefficient, the number of the zero coefficients followed by a non-zero coefficient (RUN), and whether that particular A C coefficient is the L A S T one of the non-zero coefficients. The VLCs are optimized so that blocks with very few non-zero A C coefficients are represented with short VLCs . Variable length codes are not defined for all the combinations of L E V E L , R U N and LAST. If a V L C is not found for a specific combination, the combination is coded by a 7 bit E S C A P E code followed by 1 bit LAST, 6 bit R U N and 8 bit L E V E L codes. 2.3.2 Coded Video Bit Stream Structure The H.263 syntax for the coded video bit stream has a hierarchical representation, as shown in Figure 8, with four data layers, namely a Picture layer, a Group of Block (GOB) layer, a Macroblock (MB) layer and a Block (8x8) layer. Each layer is composed of data and corresponding header information. Picture properties, such as temporal reference, picture format, picture type, quantizer and use of optional modes are kept in the Picture layer. Each picture is divided into groups of blocks (GOBs) in order to enable quick resynchronization after 17 1 Chapter 2: Background and Preliminaries transmission errors. It is possible to update the quantizer at the GOB layer. The Macroblock layer consists of macroblock specific information such as motion vectors, Coded Macroblock Indication (COD) which indicates if a macroblock is coded or not, Coded Block Pattern (CBP) that shows which of the blocks are coded within a macroblock and macroblock level quantizer. Last, the block layer consists of 64 (8x8) V L C coded residual data. 352 pels 288 lines GOB 1 GOB 2 GOB 3 GOB 11 GOB 12 Groups of Blocks (GOB) ! pels • lines < - 352 pels • MB1 MB2 MB11 MB12 MB22 MB23 MB33 1 8 \ J Y, Y 2 57 64 Y 3 Y 4 Macroblocks (MB) t 48 lines Structure of a Block Structure of a Macroblock Figure 8. Hierarchical block structure in a CIF image. 2.3.3 Negotiable Modes of H.263 In addition to the core encoding and decoding algorithms described above, H.263 includes four negotiable advanced coding modes: Unrestricted Motion Vectors, Advanced Prediction, PB-frames and Syntax Based Arithmetic Coding. The first two modes are used to improve inter 18 . Chapter 2: Background and Preliminaries picture prediction. The PB-frames mode improves temporal resolution with little bit rate increase. If the Syntax Based Arithmetic coding mode is enabled, arithmetic coding replaces ,the default Huffman-based V L C coding. These optional modes allow developers to tradeoff compression performance and complexity. In the following, we provide a brief description of each of these modes. A more detailed description of such modes can be found in [16]. 2.3.3.1 Unrestricted Motion Vector Mode (UMV) In baseline H.263, motion vectors can only reference pixels that are within the picture area. Because of this, macroblocks at the border of a picture may not be well predicted. When the Unrestricted Motion Vector mode is used, motion vectors can take on values in the range [-31.5, 31.5] instead of [-16, 15.5] and are allowed to point outside the picture boundaries. The longer motion vectors improve coding efficiency for larger picture formats, i.e. 4CIF or 16CIF. Moreover, by allowing motions vectors to point outside the picture, a significant gain is achieved if there is movement along picture edges. This is especially useful in the case of camera movement or background movement. 2.3.3.2 Advanced Prediction Mode (AP) In this mode, overlapped block motion compensation is used for the luminance component of the P-pictures. Four motion vectors per macroblock, that is one vector per 8x8 luminance block, are allowed, instead of one motion vector per macroblock, when it is advantageous. Also, the motion vectors are allowed to point outside the picture as in the Unrestricted Motion Vector mode. The encoder has to decide which type of motion vectors to use. Four motion vectors require more 19 Chapter 2: Background and Preliminaries bits, but yield better prediction. The use of this mode generally provides a considerable improvement, and results in less blocking artifacts. 2.3.3.3 PB-frames Mode (PB) B B B Figure 9. PB frame structure. In this mode, a frame structure consists of a P-picture and a B-picture as illustrated in Figure 9. The quantized DCT coefficients of the B- and P-pictures are interleaved at the macroblock layer such that a P-picture macroblock is immediately followed by a B-picture macroblock. Therefore, the maximum number of blocks transmitted at the macroblock layer is twelve rather than six. The P-picture is forward predicted from the previously decoded P-picture. The B-picture is bi-directionally predicted from the previously decoded P-picture and the P-picture currently being decoded as illustrated in Figure 10. The forward and backward motion vectors for a B-macroblock are calculated by scaling the motion vector from the current P-picture macroblock using the temporal resolution of the P- and B-pictures with respect to the previous P-picture. If this motion vector does not yield a good prediction, it can be enhanced by a delta vector. The delta vector is obtained by performing motion estimation, within a small search window, around the calculated motion vectors. When decoding a PB-frame macroblock, the P-macroblock is reconstructed first, followed by the B-macroblock, since the information from the P-macroblock is needed for B-macroblock 20 Chapter 2: Background and Preliminaries prediction. When using the PB-frames mode, the picture rate can be doubled without a significant increase in bit rate. B-block P-macroblock , Figure 10. Forward and bidirectional prediction for a B picture block. 2.3.3.4 Syntax-based Arithmetic Coding mode (SAC) Baseline H.263 employs Huffman based variable length coding as a means of entropy coding. In this mode, all variable length encoding/decoding operations of H.263 are replaced with arithmetic encoding/decoding operations. Since V L C and arithmetic coding are lossless coding schemes, the SNR and reconstructed frames will be the same, but bit rate will be reduced by approximately 5% due to the more efficient arithmetic codes. 2.4 H.263 Version 2 (H.263+) The objective of H.263 Version 2, also known as H.263+ in the standards community, is to broaden the range of applications, improve compression efficiency and improve error resilince. H.263+ offers many improvements over H.263. It allows the use of a wide range of custom source formats, as opposed to H.263, where only five video source formats defining picture size, 21 Chapter 2: Background and Preliminaries picture shape and clock frequency can be used. This added flexibility opens up H.263+ to a broader range of video scenes and applications, such as wide format pictures, resizeable computer windows and higher refresh rates. Picture size, aspect ratio and clock frequency can be specified as part of the H.263+ bit stream. Moreover, with 12 new negotiable modes, the H.263+ not only improves the rate-distortion performance but also increases the error resilience and provides a scalable syntax. Scalability can improve the delivery of video information in error-prone, packet-lossy, or heterogeneous environments by allowing multiple display rates, bit rates, and resolutions to be available at the decoder. 2.4.1 H.263+ Optional Modes Next, the 12 new optional coding modes of the H.263+ video coding standard, including the modification of H.263's Unrestricted Motion Vector mode when used within an H.263+ framework, are described. 2.4.1.1 Unrestricted Motion Vector Mode (UMV) The definition of the Unrestricted Motion Vector mode in H.263+ is different from that of H.263. When this mode is employed within an H.263+ framework, new reversible VLCs (RVLCs) are used for encoding the difference motion vectors. Reversible VLCs are easy to implement, as a simple state machine can be used to generate and decode them. The idea behind RVLCs is that decoding can be performed by processing the received motion vector part of the bit stream in the forward and reverse directions. If an error is detected while decoding in the forward direction, motion vector data is not completely lost as the decoder can proceed in the reverse direction; this improves error resilience of the bit stream. Furthermore, the motion vector range is extended to 22 Chapter 2: Background and Preliminaries up to 256, depending on the picture size. This is very useful given the wide range of new picture formats available in H.263+. 2.4.1.2 Advanced Intra Coding Mode (AIC) This mode improves compression performance when coding intra macroblocks. In this mode, intra-block prediction from neighboring intra blocks, a modified inverse quantization of intra DCT coefficients, and a separate V L C table for intra coded coefficients, are employed. Block prediction is performed using data from the same luminance or chrominance components (Y, Cr orCb). Block to the Left Horizontal Prediction Block Above Vertical Prediction dcK DC Current Block Figure 11. Neighboring blocks used for intra prediction in the Advanced Coding mode. As illustrated in Figure 11, one of three different prediction options can be signaled: DC only, vertical D C & A C , or horizontal D C & A C . In the DC only option, only the DC coefficient is predicted, usually from both the block above and the block to the left, unless one of these blocks is not in the same picture segment or is not an intra block. In the vertical D C & A C option, the DC 23 Chapter 2: Background and Preliminaries and first row of A C coefficients are vertically predicted from those of the block above. Finally, in the horizontal D C & A C option, the DC and first column of A C coefficients are horizontally predicted from those of the block to the left. The option that yields the best prediction is applied to all blocks of the subject intra macroblock. The difference coefficients, obtained by subtracting the predicted DCT coefficients from the original ones, are then quantized and scanned differently depending on the selected prediction option. Three scanning patterns are used: the basic zig-zag scan for DC only prediction, the alternate-vertical scan (as in MPEG-2) for horizontally predicted blocks or the alternate-horizontal scan for vertically predicted blocks. The main part of the standard employs the same V L C table for coding all quantized coefficients. However, this table is designed for inter macroblocks and is not very effective for coding intra macroblocks. In intra macroblocks, larger coefficients with smaller runs of zeros are more common. Thus, the Advanced Intra Coding mode employs a new V L C table for encoding the quantized coefficients, a table that is optimized to global statistics of intra macroblocks. 2.4.1.3 Deblocking Filter Mode (DF) This mode introduces a deblocking filter inside the coding loop. Unlike in post-filtering, predicted pictures are computed based on filtered versions of the previous pictures. A filter is applied to the edge boundaries of the four luminance and two chrominance 8 x 8 blocks. The weight of the filter's coefficients depends on the quantizer step size for a given macroblock, where stronger coefficients are used for a coarser quantizer. This mode also allows the use of four motion vectors per macroblock and motion vectors to point outside picture boundaries. The 24 Chapter 2: Background and Preliminaries above techniques, as well as filtering, result in better prediction and a reduction in blocking artifacts. 1 2.4.1.4 Slice Structured Mode (SS) A "slice structure, instead of a GOB structure, is employed in this mode. This allows the subdivision of the picture into segments containing variable numbers of macroblocks. The slice structure consists of a slice header followed by consecutive complete macroblocks. Two additional submodes can be signaled to reflect the order of transmission: sequential or arbitrary, and the shape of the slices: rectangular or not. These add flexibility to the slice structure so that it can be designed for different environments and applications. 2.4.1.5 Supplemental Enhancement Information Mode (SEI) In this mode, supplemental information is included in the bit stream in order to offer display capabilities within the coding framework. This information includes support for picture freeze, picture snapshot, video segmentation, progressive refinement and chroma keying [25]. These options are aimed at providing decoder supporting features and functionalities within the bit stream. For example, such options will facilitate interoperability between different applications within the context of windows-based environments. 2.4.1.6 Improved PB-frames Mode (IPB) This mode is an enhanced version of the H.263 PB-frames mode. The main difference is that the H.263 PB-frames mode allows only bi-directional prediction to predict B-pictures in a PB-frame, whereas the Improved PB-frames mode permits forward, backward and bi-directional 25 Chapter 2: Background and Preliminaries predictions. Bi-directional prediction methods are the same in both modes except that, in the Improved PB-frames mode, no delta vector is transmitted. In forward prediction, the B-macroblock is predicted from the previous P-macroblock, and a separate motion vector is then transmitted. In backward prediction, the predicted macroblock is equal to the future P-macroblock, and therefore no motion vector is transmitted. Use of the additional forward and backward predictors makes the Improved PB-frames less susceptible to significant changes that may occur between pictures. 2.4.1.7 Reference Picture Selection mode (RPS) In the H.263 baseline, a picture is predicted from the previous picture. If a part of the subject picture is lost due to channel errors or packet loss, the quality of future pictures can be severely degraded. Using this mode, it is possible to select the reference picture for prediction in order to suppress temporal error propagation due to inter coding. The information which specifies the selected picture for prediction is included in the encoded bit stream. 2.4.1.8 Temporal, SNR, and Spatial Scalability Mode This mode specifies syntax to support temporal, SNR, and spatial scalability capabilities. Scalability means that a bit stream consists of a separately decodable base layer, and associated enhancement layers. This structure is especially desirable for error-prone and heterogeneous environments to counter limitations such as constraints on bit rate, display resolution, network throughput, and decoder complexity. 26 Chapter 2: Background and Preliminaries Temporal scalability provides a mechanism for enhancing perceptual quality by increasing the picture display rate. This is achieved via bi-directionally predicted B-pictures, inserted between two P pictures and predicted from either one or both of these P pictures. SNR scalability is achieved by using a finer quantizer to encode the difference picture in an enhancement layer. This additional information increases the SNR, thus the quality of the overall reproduced picture. Spatial scalability and SNR scalability are closely related, the only difference is that spatial scalability provides an increased spatial resolution in the enhancement layer. Spatial scalability allows for the creation of multi-resolution bit. streams to meet varying display requirements/constraints for a wide range of applications. 2.4.1.9 Reference Picture Resampling Mode (RPR) This mode describes an algorithm to warp the reference picture prior to its use for prediction. It can be useful for resampling a reference picture having a different sourceformat than the picture being predicted. It can also be used for global motion estimation, or estimation of rotating motion, by warping the shape, size and location of the reference picture. 2.4.1.10 Reduced Resolution Update Mode (RRU) This mode allows the encoder to send update information for a picture encoded at a lower resolution, while still maintaining a higher resolution for the reference picture, to create a final image at the higher resolution. This is most useful in the case of movement over picture boundaries, motion of large objects and highly active motion scenes with detailed backgrounds. 27 Chapter 2: Background and Preliminaries 2.4.1.11 Independently Segmented Decoding Mode (ISD) In this mode, picture segment boundaries are treated as picture boundaries in the sense that no data dependencies across the segment boundaries are allowed. This includes estimation of motion vectors and texture operations across picture boundaries. Use of this mode prevents the propagation of errors, thus providing enhanced error resilience and recovery capabilities. 2.4.1.12 Alternative Inter VLC Mode (AIV) Large quantized coefficients and small runs of zeros, typically present in intra blocks, become more frequent in inter blocks when small quantizer step sizes are used. When this mode is enabled, the intra V L C table designed for encoding quantized intra DCT coefficients in the Advanced Intra Coding mode can be used for inter block coding. 2.4.1.13 Modified Quantization Mode (MQ) In H.263, the modification of quantizer value at the macroblock level is limited to a small adjustment (± 1 or ± 2) in the value of the most recent quantizer. The Modified Quantization mode allows the modification of the quantizer to any value, and thus provides the rate control methods more flexibility. Also, this mode increases chrominance quality significantly by using a smaller quantizer step size for the chrominance blocks relative to the luminance blocks. In H.263, when a quantizer smaller than 8 is employed, quantized coefficients exceeding the representable range of [-127, +127] are clipped. The Modified Quantization allows the representation of coefficients that are outside the range of [-127, +127], improving the picture quality at high bit rates. 28 Chapter 2: Background and Preliminaries 2.4.2 Fast Motion Vector Search Algorithm In the H.263+ Test Model, ITU-T suggests the use of one of two motion vector search algorithms: Full search or a fast search algorithm that was developed at U B C [8]. The fast search algorithm results in an encoder that is approximately 5 times faster than that using the full search algorithm, while still maintaining similar PSNR performance levels. The fast search algorithm proceeds by sequentially searching diamond-shaped layers, each of which contains the four immediate neighbors of the current search center. Layer i+1 is then centered at the point of minimum M A E , also referred as Sum of Absolute Differences (SAD), of layer i. Thus successive layers have different centers and contain at most four untested candidate motion vectors. The search is stopped only when the following two conditions are met. 1. A l l candidate motion vectors in the current layer have been considered. 2. The minimum SAD value of the current layer is larger than that of the previous layer. 2.5 Intel's MMX Architecture The computational demands in many of the multimedia and communication applications are enormous, due mainly to the large amount of data to be processed and the compute-intensive repetitive operations of the underlying processing algorithms. Recently, Intel has announced their development of the Multimedia Accelerator (MMX), which is an extension to their current Intel Architecture (IA) to improve the performance of communication, signal processing and multimedia applications. After analyzing a wide range of software applications, including graphics, M P E G video, music synthesis,' speech compression, image processing, games, speech recognition, and videoconferencing, Intel scientists and others determined that although the 29 Chapter 2: Background and Preliminaries applications are different, the underlying compute, intensive routines have common characteristics [36], [37]. These characteristics include: > Small native data types (for example, 8-bit pixels, 16-bit pixels) > Localized recurring operations > Much inherent parallelism. Taking the above characteristics into account, Intel hardware architects designed a Single Instruction Multiple Data (SIMD) architecture where one instruction performs the same operation on multiple data elements, simultaneously. This architecture provides, for most applications, an average processing speed increase of 150%-200% when compared to the speeds of the same applications run on the same processor without M M X . 2.5.1 MMX Data Types, Features and Instructions Intel's M M X instruction set can be seen as an extension to the Intel Architecture (IA) instruction set. Besides providing 57 new instructions, M M X introduces four new data type. 2.5.1.1 MMX Data Types In conventional Intel architecture processors, when processing 8 or 16-bit data, the existing 32 or 64-bit C P U band width and processing resources are actually underutilized. A L U , multiplier and other units only modify the low order 8 or 16 bits and higher order bits are unused. In. order to eliminate this inefficiency Intel architects developed an architecture where small data elements can be processed independently, but simultaneously, thereby utilizing all.the C P U processing units. The architecture is based on a maximum data width of 64 bits, as Pentium and P6 generations of processors use 64-bit wide busses. 30 : Chapter 2: Background and Preliminaries Figure 12 shows the new data types supported by M M X architecture which are three packed data types (packed byte, packed word and packed doubleword) and a 64-bit quadword. In addition, the M M X architecture defines eight new 64-bit registers, each of which can be directly addressed using the register names MMO to M M 7 . Notice that 64 bits can be used to represent a quadword, two packed doublewords, four packed words or eight packet bytes [21]. Packed byte (eight 8-bit elements) 63 53 55 48 47 40 39 32 31 24 23 16 15 8 7 Packed word (four 16-bit elements) 63 48 47 32 31 16 15 Packed doubleword (two 32-brt elements) S3 32 31 Quadword (64-bit element) 63 Figure 12. MMX architecture data types. 2.5.1.2 Saturation Arithmetic One of the important features of M M X instructions is saturation arithmetic. In regular fixed-point arithmetic when an operation overflows or underflows, the most significant bit is lost. For example, the addition of two unsigned 16-bit numbers residing in a 16-bit register may result in an unsigned 17-bit result. Naturally, this number is too large to be represented in a 16-bit register and while the result's low order 16 bits appears in the 16-bit register, the seventeenth bit does not fit and therefore lost. There is usually a status flag that shows overflow or underflow, by setting it 31 Chapter 2: Background and Preliminaries flag value to '1 ' . Unless a special attention is paid, this kind of behavior may cause serious problems, especially in graphics applications. However, in saturation arithmetic, when such an. overflow or underflow occurs, instead of generating a 17-bit number and loosing the most significant bit, the corresponding instruction clamps the 17-bit result to the largest possible unsigned number that can be represented by 16 bits, i.e. FFFF in hexadecimal format. M M X supports two types of saturation arithmetic: > Unsigned saturation: In the case of underflow, the register is set to '0', and in,the case of overflow, the register holds the largest number, i.e. 2 n - l , where n is the bit number. > Signed saturation: If underflow occurs then the register takes the smallest number possible, i.e. - 2 n ' \ if overflow occurs the register takes 2 n _ 1 - l (n:bit number). In the M M X architecture, saturation arithmetic is not a mode which may be activated by setting a control bit. Instead, some instructions inherently perform saturation during their operation. For example, some M M X add instructions employ conventional wrap-around arithmetic while others employ saturation arithmetic. 2.5.1.3 MMX Instructions M M X adds a new set of instructions that perform parallel operations on multiple data elements which are packed into a 64-bit register. Since most of the multimedia applications use 16-bit data, the new instruction set is optimized mostly for the 16-bit data type. The 8-bit data type (byte), is supported with the same instructions, with the exception of multiplication instructions, as the 16-bit word. Limited support is provided for 32-bit doubleword. data. The M M X instructions are summarized in the following sections. 32 Chapter 2: Background and Preliminaries 2.5.1.3.1 Packed addition and subtraction with optional saturation These instructions exist for byte, word and doubleword types. Each single add or subtract operation is independent of the others and takes place in parallel. The result can be a wraparound, unsigned saturation or signed saturation. 2.5.1.3.2 Packed multiplications A3 A2 A l AO X X X X B3 B2 B l BO = A3xB3 A2xB2 A l x B l AOxBO A3xB3+A2xB2 AlxBl+AOxBO Figure 13. Multiply-add word to doubleword. The multiplication instructions support 16-bit word data. There are two types of multiplication instructions/The first one performs four 16-bit x 16-bit multiplies and lets the user choose the low or the high order parts of the 32-bit multiply result. Therefore, both the input and the result are packed in 16-bit data types. The second type of multiplication is a multiply-accumulate operation. This instruction starts from a packed 16-bit data type and returns a packed 32-bit data type. It multiplies respective elements from both its sources and generates four 32-bit results. It adds two adjacent products producing a 32-bit result, and then adds the two other adjacent products to generate the final doubleword as illustrated in Figure 13. Therefore, this instruction performs four multiplies and two 32-bit accumulation operations in one instruction. 33 Chapter 2: Background and Preliminaries 2.5.1.3.3 Packed compare instructions These instructions compare packed 8 bytes, four 16-bit words, or two 32-bit double words simultaneously. Figure 14 shows a compare operation where the result is a mask of Is and Os depending on whether the condition is true or false. 45 23 8 11 > > > > 22 64 41 2 111... 1 000...0 000...0 111... 1 Figure 14. Compare words (greater than). 2.5.1.3.4 Packed shift instructions There are two types of M M X shift instructions. The first one consists of regular arithmetic right, logical right and left shift operations that are performed on four 16-bit words or two 32-bit doublewords. The second type of shift operation is performed on the quadword and it can be either a logical left shift or logical right shift. 2.5.1.3.5 Conversion instructions B7 B6 B5 B4 B3 B2 B l B0 A7 A6 A5 A4 A3 A2 A l AO B3 A3 B2 A2 B l A l B0 AO Figure 15. Unpacking a byte to a word. 34 Chapter 2: Background and Preliminaries These instructions perform conversions between the packed data types. This is especially important when an algorithm needs a higher precision to store its intermediate results. The unpack instruction unpacks a smaller precision data type to a higher precision data type. This instruction is also used as an interleaved merge operation as shown in Figure 15. This operation is very useful in many applications such as matrix transpositions, conversion between color planes and interpolation operations. The unpack operation is performed as a special case of this instruction where one of the operands is filled with zeros. In Figure 15, if the B register is filled with zeros then unpacking from a byte to a word is performed. 2.5.1.3.6 Logical instructions Logical instructions are performed on 64-bit quadword. The logical operations that are supported are A N D , OR, X O R and ANDNOT. 2.5.1.3.7 Memory Transfer instructions Memory instructions provide data transfer between memory and the M M X registers, as well as data transfer between two M M X registers. These instructions operate on a 64-bit quadword as on a 32-bit doubleword. When a 32-bit doubleword is used, the low order 32 bits of a 64-bit M M X register is used for the data transfer. 2.5.1.3.8 Empty MMX State operation This instruction is used to maintain compatibility with the conventional Intel architecture and it is covered in more detail in the " M M X Registers and IA Floating-Point Registers" section. 35 Chapter 2: Background and Preliminaries 2.5.2 Backward Compatibility Besides speeding up the multimedia applications, it is also very important that M M X processors retain backward compatibility with the existing architectures. A l l existing software and operating systems that runs on Intel processors have to run using the MMX-based processors without modification. Moreover, if M M X is supported, software programs should be able to detect it and make use of the added capabilities. Software developers can employ the "CPUID" instruction in order to detect existence of M M X technology within the processor. Executing the "CPUTD" instruction sets a bit in the result, and thus during run-time, a software program can query the microprocessor to determine whether M M X is supported. Compatibility with software that is written for conventional IA can not be achieved easily when it comes to operating systems that support a multitasking programming environment. New applications that use. M M X instructions should be able to multitask with any other applications. During multitasking, the operating system performs a context-switching, i.e. it saves all the registers in memory and restores them when the next time that task gets to run. This places a constraint on M M X , in the sense that it cannot create new registers, new modes or new states. Otherwise, operating systems that are not designed for MMX-based architectures would not know these new registers or modes, and would not be able to save them during context-switching. Intel's solution is to hide M M X registers and states into existing floating-point registers and states. 36 Chapter 2: Background and Preliminaries 2.5.2.1 MMX Registers and IA Floating-Point Registers For software transparency, an existing operating system views the M M X instructions as an extension of floating-point instructions. This is achieved by using the existing IA floating-point space for temporary storage for M M X data as shown in Figure 16. M M X instructions write values to the low order 64 bits of the 80-bit floating-point registers. When am M M X instruction accesses the registers, the rest of the 64 bits are set to "1". If a floating-point instruction accesses these registers at that point, it will see the values in the registers as N A N (Not A Number). FP tag 00 00 00 00 00 00 00 00 Floating Point Registers J9 6? _ _ M M 7 M M 6 M M 5 M M 4 M M 3 M M 2 M M 1 M M 0 Figure 16. Mapping MMX registers to floating-point registers. Sharing the registers with floating-point registers revealed some design problems. One problem was the stack architecture of floating-point registers. Since there is need for random access to M M X registers, the stack architecture is replaced by another architecture that provides random access to the registers. Another problem was that an application may require use of 37 Chapter 2: Background and Preliminaries . M M X registers and floating-point registers. An application can use both of these registers as long as it does not use them simultaneously. Partitioning the floating-point and M M X codes into long execution blocks makes transition events infrequent and simple to handle. The IA defines a tag field for each floating-point register. Each of these fields indicates whether the corresponding register is empty, full or a special case of a floating-point value, e.g. N A N . When a new value is pushed into a floating-point register, the corresponding tag field becomes set. When a value is popped off a floating-point register then the corresponding tag is not set anymore. The software convention followed by today's IA programmers is to leave the floating-point stack empty after using it. Some operating systems employ floating-pointer tags to save only the valid floating-point registers during context-switching. Random access of M M X registers makes it difficult to implement the effect that the floating-point instructions have over the stack structured floating-point tags. This problem was solved as follows: the first time an M M X instruction accesses a floating-point register, the tag bits of all the floating-point registers are set so that all the registers will be saved during a context switch. Intel also supplies an instruction to clear all the floating-point tags. After completion of each M M X code section, application programmers should use the E M M S (Empty M M X State) command to unset all the tags. 2.5.3 The Superscalar Architecture of MMX The Pentium processor is an advanced superscalar processor which means that it can handle two or more instructions simultaneously. It employs two general-purpose integer pipelines and a pipelined floating-point unit, allowing the processor to execute two integer instructions 38 Chapter 2: Background and Preliminaries simultaneously. Moreover, a software-transparent dynamic branch-prediction mechanism is used to minimize branching related pipeline stalls. M M X instructions were designed to run using integer pipelines despite the use of the floating-point registers to hold data. Since M M X instructions operate on packed integer types, it makes sense to use integer pipelines. Pentium processors can issue two integer instructions, one in each integer pipe, in every clock cycle. The first pipe is referred to as the "U" pipe and the second as the "V" pipe. During decoding of any given instruction, the next two instructions are checked, and if possible, they are issued such that the first one is executed in the U-pipe and the second in the V-pipe. If it is not possible to issue two instructions, then the next instruction is issued to the U-pipe and none is issued to the V-pipe. In this case, the second instruction waits for the next cycle and becomes the first instruction in the next instruction pair. In a Pentium processor with M M X , two integer instructions, one integer instruction and one M M X instruction, or two M M X instructions can be executed simultaneously. M M X instructions, with the exception of the multiply operation, execute in one cycle. The multiply instruction is completed in 3 cycles, but because of the pipelined architecture, it is possible to issue one multiply instruction within every clock cycle. 2.5.3.1 MMX Pipeline Structure The M M X pipeline structure adds additional stages to the conventional Intel architecture pipeline as shown in Figure 17. The basic integer pipeline structure of a Pentium processor consists of the following stages: • PF : Instruction prefetch 39 Chapter 2: Background and Preliminaries • IF : Instruction fetch • D l : Instruction decode • D2 : Instruction paring and dispatch • E : Execution and memory access . • WB : Write back Mex W M / M J Ms W M i • Decoupled stages of M M X 7 " Technology Pipe • MMX pipeline integrated in integer pipeline H Integer pipeline only Figure 17. Pentium and MMX technology pipeline structures. The pipeline structure for M M X instructions is different than that of the Pentium instructions. The flowchart of M M X pipeline is given below: • PF stage: Prefetches instructions. • F stage: The prefetched instructions are parsed into instructions. The prefixes are decoded and up to two instructions are pushed into the FIFO. Two M M X instructions can be pushed if each of the instructions is less than or equal to 7 in byte length. • D l stage: Integer, floating-point and M M X instructions are decoded at this stage. • D2 stage: Source values are read. • E stage: The instruction is committed for execution. • Mrw stage: Operands are read from memory. • Mex stage: Execution stage for M M X instruction : A L U , shift, pack and unpack are executed and completed at this stage. This is the first clock cycle of multiply instruction. 40 Chapter 2: Background and Preliminaries • W M / M 2 stage: The single clock operations are written. This s the second stage of multiply operation. • M3 stage: This is the third stage of multiply instruction. • Wmul stage: Multiplier results are written. Instruction parsing is decoupled from the instruction decoding with an "Instruction First In, First Out (FIFO) buffer", which is situated between the F and Decode 1 (Dl) stages as shown in Figure 17. The FIFO has slots for up to four instructions. This FIFO does not add additional latency when it is empty. During every clock cycle, two instructions can be pushed into the instruction FIFO (depending on availability of the code bytes, and on other factors such as prefixes). Instruction pairs are pulled out of the FIFO into the D l stage. Since the average rate of instruction execution is less than two instructions per clock cycle, the FIFO is normally full. As long as the FIFO is full, it can buffer any stalls that may occur during instruction fetch and parsing. If such a stall occurs, the FIFO prevents the stall from causing a stall in the execution stage of the pipe. If the FIFO is empty, then an execution stall may result from the pipeline being "starved" for instructions to execute. Stalls at the FIFO entrance may result from long instructions or prefixes. 41 Chapter 3 Performance of H.263+Video Coding Standard 3.1 Introduction In order to implement a real-time H.263+ encoder, it is first necessary to implement an H.263+ staridard compliant encoder and decoder. At the beginning of this thesis work, the definition of the H.263+ standard was still in progress and U B C signal processing and multimedia group agreed to contribute to this work by providing a publicly available official ITU-T standard compliant software encoder/decoder (codec) [45]. Our software codec is based on a publicly available H.263 codec software which was developed by a company named Telenor from Norway. The H.263+ became an International Standard in January 1998 and our H.263+ encoder/decoder software is now very well known arid widely referenced within the video coding community. 3.2 Performance Levels of Individual Modes In this chapter, simulation results based on our implementation of an H.263+ encoder are presented. The results illustrate the tradeoffs between compression performance, complexity, encoding/decoding speed and memory requirements of each of the implemented modes: Advanced Intra Coding mode, Deblocking Filter mode, Improved PB-frames mode, Alternative 42 Chapter 3: Performance of H.263+Video Coding Standard • Inter. V L C mode, and Modified Quantization mode. Complexity and performance results on scalability and error resilience/recovery modes are discussed in [7] and [51] respectively. The average peak signal-to-noise ratio (PSNR) of all encoded pictures is used as a measure of objective quality, and is given by where M is the number of samples and o, and r, are the amplitudes of the original and QCIF resolution and consist of 300 frames. Unless otherwise specified, rate control strategies are not employed. Instead, a quantizer step size is fixed for an entire sequence. Rate-distortion graphs are obtained by selecting different values for the quantizer step size. Also, unless otherwise indicated, all results are obtained using the fast search motion estimation implementation. 3.2.1 Advanced Intra Coding mode (AIC) This mode significantly improves compression of intra macroblocks. Prediction lowers the number of bits required to represent the quantized DCT coefficients, while quantization without a dead zone improves the picture reproduction quality. This is illustrated in Figure 18, which presents the coding result for the first intra picture (i.e. where all the macroblocks are intra coded) of the Y-component of the video sequence A K I Y O . Compression performance improvements of 15-25% are achieved. However, the Advanced Intra Coding mode only improves compression performance of intra coded macroblocks. Thus, negligible compression 1 M 2552 (6) reconstructed pictures respectively. The test sequences that are used in the simulations have 43 Chapter 3: Performance ofH.263+Video Coding Standard improvements are achieved for low activity video sequences, where most macroblocks are inter coded. 4000 6000 8000 10000 12000 14000 Y bits Figure 18. Advanced Intra Coding Rate-Distortion Performance for AKIYO sequence. Based on our implementation, the associated encoding time increases by 5% on average, due to the prediction method selection operations. This mode requires slightly more memory to store the reconstructed DCT coefficients, needed for intra prediction. The increase in decoding time is negligible, as only a few additions are required to predict an intra coded macroblock. 3.2.2 Deblocking Filter mode (DF) The Deblocking Filter mode improves subjective quality by removing blocking and mosquito artifacts common to block-based video coding at low bit rates. The effects of the deblocking filter are more pronounced when combined with the post filter described in the TMN-8 model [26]. This post filter is usually present at the decoder and is outside the coding loop. Therefore, prediction is not based on the post filtered version of the picture. To illustrate the improvement in subjective quality, the sequence F O R E M A N was encoded using both the deblocking and post 44 Chapter 3: Performance of H.263+Video Coding Standard filters at 24 kbps and 10 fps. Figure 19 shows the reconstructed image for picture number 100. The Deblocking Filter mode allows the use of four motion vectors per macroblock. This requires additional motion estimation, increasing the computational load, and thereby resulting in a 5-10% additional encoding time. (a) (b) Figure 19. Reconstructed image for FOREMAN picture number 100 without (a) and with (b) Deblocking Filter mode and Post Filter set on. 3.2.3 Improved PB-frames mode (IPB) The PB-frames mode of H.263 can double the picture rate without significantly increasing the bit rate. The increase in bit rate is small due mainly to bits saved by coarser quantization of B-macroblocks. While this causes the B-picture to have a lower quality than the P-picture, the increased temporal resolution results in much better overall subjective quality. The PB-frames mode provides good compression performance levels, especially for low motion video sequences. However, since only bi-directional prediction is used for the B-picture of a PB-frame, when irregular motion is present in the video sequence, the quality of the B-picture decreases considerably. The Improved PB-frames mode of H.263+ addresses this problem by allowing forward only or backward only prediction, in addition to bi-directional prediction, of B-45 Chapter 3: Performance of H.263+Video Coding Standard macroblocks. While the.prediction of B pictures in this mode is not as effective as that of true B-pictures, it does make the H.263+ IPB-frames mode more robust than H.263 PB-frames mode. Improved PB Frames mode: Y PSNR vs Rate for FOREMAN P picture, w/o mode B picture, PB mode P picture, PB mode B picture, IPB mode P picture, IPB mode 10 30 50 70 90 110 130 150 Rate (kbps) Figure 20. Improved PB frames mode: Rate-Distortion performance for FOREMAN. Our simulation results show that a significant improvement in PSNR is achieved as compared to the H.263 PB-frames mode when an active video sequence is coded as illustrated in Figure 20. The figure shows the rate-distortion performances levels achieved by the H.263 baseline, H.263 PB-frames mode and H.263+ Improved PB-frames mode for the active video sequence F O R E M A N at 10 frame per second (fps). On the other hand, the PSNR gain over the H.263 PB-frames mode is smaller for video sequences that have moderate motion. The H.263+ Improved PB-frames mode substantially increases the encoder/decoder complexity requirements. The complexity of the Improved PB-frames mode is slightly larger than that of the PB-frames mode due to the additional prediction modes. The computational load associated with the H.263+ Improved PB-frames mode is also usually larger than that of the H.263 PB-frames mode due to the forward prediction method of the Improved PB-frames mode. 46 34 « 3 2 Z 5h 30 28 Chapter 3: Performance of H.263+Video Coding Standard Like the H.263 PB-frames mode, the H.263+ Improved PB-frames mode requires more memory both at the encoder and the decoder because of the need to store two pictures in memory. There is also one frame delay associated with both of the above modes. This may present a problem in real-time applications. 3.2.4 Alternative Inter V L C mode (AIV) This mode allows the intra macroblock quantized DCT coefficient VLCs of the Advanced Intra Coding mode to be used for some inter coded blocks. This mode of operation is useful at high bit rates, when short runs of zeros and large coefficient values are present, as the Advanced Intra Coding mode run-length VLCs are designed for such statistics. Best results for this mode are obtained when fine quantizers are used, as can be seen in Table 2. At very high bit rates, bit savings of as much as 10% can be achieved. The added complexity that this mode introduces is negligible, especially in software applications. In fact, less than 2% additional encoding/decoding time is usually required. Sequence Quantizer Step Size Y PSNR Bits-(w/o mode) Bits - Alternate Inter V L C mode Bit savings AKIYO 4 43.79 9354 8891 463 (5%) 8 39.47 4128 4073 55 (1%) 12 36.94 2411 2405 6 (0%) FOREMAN .4 . 41.41 47659 44118 3541 (7%) 8 37.15 22036 21175 861 (4%) 12 34.73 13498 13201 297 "(2%) 16 33.12 9434 9320 114(1%) Table 2. Average bit savings for inter coded pictures of the Alternate Inter V L C mode. 47 Chapter 3: Performance of H.263+Video Coding Standard 3.2.5 Modified Quantization mode (MQ) To fully illustrate the capabilities of this mode, the TMN-8 [26] rate control method is used for the simulations in this section. Figure 21 (a) shows the chrominance PSNR performance for the video sequence F O R E M A N with and without the Modified Quantization mode enabled. From this figure, it is clear that the chrominance PSNR increases substantially at low bit rates. Naturally, this causes a drop in luminance PSNR as less bits remain to represent the luminance coefficients. However this drop is rather insignificant and the overall PSNR performance is usually improved. Figure 21 (b) shows that the overall PSNR performance is indeed higher when the Modified Quantization mode is enabled. The Modified Quantization mode adds very little computation time and complexity to the coder. Modified Quantization mode : Modified Quantization Mode:Average Chrominance PSNR vs Rate for PSNR vs Rate for FOREMAN 0 50 100 0 50 100 Rate(kbps) Rate (kbps) (a) (b) Figure 21. (a) Chrominance and (b) Average PSNR performance for Modified Quantization Mode. 48 Chapter 3: Performance of H.263+Video Coding Standard 3.3 Computation Times and Compression Performance Improvements of Individual Modes Figure 22 illustrates the added encoding computation times of the individual implemented H.263 and H.263+ optional modes. The results were obtained by encoding 300 frames of the video sequence F O R E M A N at 64 kbps and 10 fps on a Pentium 200 M H z computer. The new TMN-8 rate control method was used for these simulations. • Full Search ME • Fast Search ME H.263 and H.263+modes Figure 22. Encoding times for the H.263 and H.263+ modes for FOREMAN at 64 kb/s on a 200 MHz PC. Additional computational resources required by the H.263+ modes are negligible in our software decoder implementation, and real-time decoding of an H.263+ compliant bit stream can be supported. The encoder's speed of the H.263 baseline coder with any H.263 or H.263+ individual mode enabled is at most 15% larger than that of H.263. Setting the PB-frames mode on results in a reduction in encoding time, since only a restricted motion estimation operation is performed for the B-picture of a PB-frame. Encoding time is at most half of that of full search motion estimation when the fast search method is used (except for the H.263 PB-frames mode). 49 Chapter 3: Performance of H.263+Video Coding Standard Low Bit Rate High Bit Rate Mode Foreman Silent Akiyo Foreman Silent Akiyo 32 kbps 24 kbps 8 kbps 128 kbps 96 kbps 32 kbps No mode 30.44 32.74 33.9 35.83 38.84 39.26 UMV +0.61 +0.01 -0.01 +0.64 +0.01 -0.04 SAC +0.08 +0.04 -0.12 +0.06 +0.13 +0.08 AP +0.28 +0.06 +0.19 +0.58 +0.13 +0.33 PB B-pictures -0.57 -0.05 +0.51 -1.64 -0.69 +0.61 P-pictures +0.11 +0.5 +0.6 +0.37 +0.41 +1.05 AIC +0.04 +0.11 +0.14 0 +0.07 -0.03 DF +0.18 +0.12. -0.24 +0.48 -0.02 -0.23 IPB B-pictures -0.17 +0.19 +0.54 -1.21 -0.27 +0.74 P-pictures +0.36 . +0.62 +0.61 +0.62 +0.66 +1.12 AIV +0.02 +0.02 -0.02 +0.06 +0.23 +0.01 MQ +04 +0.2 +0.4 +0.2 +0.03 -0.05 Table 3. Summary of Improvement in PSNR (dB) for H.263 and H.263+ individual modes at low bit rates and high bit rates. A summary of compression performance improvements resulting from the use of individual modes is given in Table 3. Results are presented for low and high bit rates using three QCIF video sequences at 10 fps: an active video sequence,. F O R E M A N , a sign language video sequence, SILENT, and a typical low motion videophone sequence, A K I Y O . It can be observed that a given mode is not always suitable for any bit rate and/or any sequence. For example, the Alternate Inter V L C mode achieves compression gains only at high bit rates. The Deblocking 50 Chapter 3: Performance of H.263+Video Coding Standard Filter mode usually yields a decrease in PSNR, but the resulting picture subjective quality is generally much better. However, this mode may result in excessive blurriness at very low bit rates. Another observation is that the Modified Quantization mode does not lead to compression gains at high bit rates for low motion sequences, as the extended quantized coefficient range and the finer chrominance quantization are rarely used. Finally, the Unrestricted Motion Vector mode shows PSNR improvements for sequences with motion across picture boundaries (in F O R E M A N for example), or at CIF and larger resolutions. 3.4 Compression Performance and Complexity of Mode Combinations With the large number of possible mode combinations, it becomes difficult for implementers to select combinations that are suitable for their applications. The ITU-T video experts group decided to include non-normative mode combinations as guidelines for implementers [25]. The recommended mode combinations are based on the performance of individual modes. The performance criteria are the improvement in subjective quality, the impact on delay, and the additional complexity, computation, and memory demands. The Level 1 preferred combination of modes includes Advanced Intra Coding, Deblocking Filter, Supplemental Enhancement Information, Full Frame Freeze only and Modified Quantization. The Level 2 preferred combination of modes includes, in addition to the Level 1 modes, the Unrestricted Motion Vector, Slice Structure, and Reference Picture Resampling modes. Finally, the Level 3 preferred combination of modes includes Level 2 and Level 1 preferred modes and Advanced Prediction, Improved PB-frames, Independent Segment Decoding and Alternate Inter V L C modes. 51 Chapter 3: Performance of H.263'+Video Coding Standard In our experiments, an error-free environment is assumed. Thus, the H.263+ modes involving error resilience, although part of the preferred mode combinations, are here excluded. Similarly, the Full Frame Freeze mode is not included in our simulations as it provides enhanced display capabilities but does not impact performance. Also, the Reference Picture Resampling mode is not here considered since it is currently not available in [45]. Fast M E Full M E AKIYO, 8 kbps P-pictures B-pictures time P-pictures B-pictures time Baseline 33.9 dB N/A 16.9 sec 33.9 N/A 41 sec AIC+DF+MQ (Level 1) +0.74 N/A +28% +0.74 N/A +25% FOREMAN, 128 kbps P-pictures B-pictures time P-pictures B-pictures time Baseline 35.83 N/A 25.3 sec 35.83 N/A 54.9 sec AIC+DF+MQ+UMV+AP+AIV +1.05 N/A +34% +1.1 N/A +21% AIC+DF+MQ+UMV+AP+AIV +EPB(Level3) +1.52 -0.42 +15% +1.53 -0.27 +19% Table 4. Mode combinations for FOREMAN and AKIYO. Table 4 presents results for the Level 1 and Level 3 mode combinations for the video sequences A K I Y O and F O R E M A N , respectively. Based on our experiments, using a higher level of mode combinations provides better compression performance, especially for highly active video sequences such as F O R E M A N , by as much as 1.5 dB at high bit rates. Moreover, note that the encoding time is, at most, approximately 25% greater, even for the Level 3 combinations of modes. 52 Chapter 4 Efficient Low Bit Rate Video Coding Algorithms 4.1 Introduction In this chapter, we propose efficient platform independent coding algorithms and techniques that significantly improve the performance of the most computationally intensive components of video coding. The proposed algorithms include zero block prediction prior to DCT, partial IDCT computations, fast half pixel motion estimation, quantization with look-up tables and partial SAD computations. There has been a significant effort towards developing efficient video coding algorithms in both industry and academia to improve the performance of video encoding and decoding. Most of these efforts have concentrated on developing efficient ways of motion vectors estimation and-DCT/IDCT computation. However, it is possible to improve the already existing algorithms by exploiting specific statistical properties of coarsely quantized, slowly varying, low resolution sequences. The main focus of this thesis is to implement a real-time H.263+ encoder in software. At this point it is important to define the parameters for real-time video encoding. In H.263+, besides custom picture frequencies, there are two predefined input video frequencies: 25 Hz and 53 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms 30 Hz. However, it is currently not possible to have a software implementation on a single general purpose processor that will support encoding at these frequencies. Very low picture frequencies, such as a couple of frames per second, would not provide an acceptable quality in applications such as video telephony and video conferencing where video is multiplexed with speech. For such applications, an acceptable rate would be at least 10 frames per second for a QCIF (176 x 144) resolution. There are commercially available software implementations that can encode at this rate and resolution. Many of these implementations employ less computationally intensive encoding algorithms that have lower compression levels, such as motion JPEG or H.261. However, some applications, including wireless video telephony may require bit rates as low as 8 Kbits/sec, which can only be achieved, while maintaining acceptable reproduction quality, using more advanced video compression algorithms. In turn, these compression algorithms, such as those compliant with the H.263+, are computationally more expensive. The block diagram of our H.263+ encoder software implementation is given in Figure 23. For simplicity, negotiable mode branches are not shown on the diagram. As illustrated in the figure, the integer pixel and half pixel accuracy motion vector searches and the interpolation are performed on the whole picture at a time, where the DCT, IDCT, Quantization, Dequantization and Variable Length Coding (VLC) are performed at the macroblock level. This way, we can start transmitting a part of the bit stream even though the encoding of the complete frame is not completed. 54 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms Start Read one frame ) i W W < Q W > E-i intra-Interpolate the-previous frame Integer Pixel M V Search & Half.Pi'xeJ.VMV-Search 1 MB> counter = number of MBs mter-int1%~ D C I" Quantization Zigzag Scan and V L C Write to Bitstream Dequantization IDC Reconstruct Image counter = counter-1 Motion compensated prediction BIT S T R E A M Figure 23. Block diagram of the H.263+ encoder implementation. 55 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms Table 5 shows the relative computational costs of some of the H.263+ encoder functions, which are obtained by profiling the software with the fast motion vector search algorithm enabled and without using rate control. These functions are also highlighted in Figure 23. As seen from the Table 5, any speed performance improvements of integer pixel motion estimation (ME), half pixel M E , DCT, IDCT and quantization will affect the overall encoding time significantly. Note that the integer pixel M E computations still take considerable time, although a fast search method is employed. Function Computational Load Reference IDCT 25% Integer Pixel M E 12% Fast DCT 11% Half Pixel M E 10% Quantization 9% Table 5. Relative computational costs of the most computationally intensive H.263+ encoder functions prior to optimization. 4.2 DCT The DCT is one of the main building blocks of the H.263+ encoder and one of the most computationally expensive functions since it has to be computed for every block. Despite its small implementation complexity, the number of operations involved in calculating the DCT is still significant. For example, a 2-D DCT transform of an 8x8 block requires 2048 floating-point multiplications and 1792 additions. A number of fast DCT algorithms have been developed to 56 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms greatly reduce the number of operations [1], [6], [11], [27], [43]. Most of these algorithms are variations of Lee's 1984 pioneer work [28] or are based on variants of Winograd's FFT. To select a fast 2-D DCT algorithm suitable for our H.263+ encoder, we have considered several properties of the algorithms. Since our implementation involves only software, implementation complexity was not a primary issue of concern. However, we did require the DCT algorithm to have a regular structure so that any irregular memory access patterns, which may cause cache misses, are avoided. Many of the fast DCT algorithms are aimed only at reducing the number of multiplications, however, at the expense of increasing the number of additions. Today, most of the processors include fast floating-point co-processor units and the cost of a multiplication operation is not much more than the cost of an addition or a shift operation. Hence, the total number of operations is more important than the number of multiplications for our software implementation. Another consideration is the consecutive number of multiplications required by the fast DCT algorithm. The DCT transform consists of many floating-point operations. If these operations are converted to fixed-point operations some speed-up can be achieved. Since fixed-point operations have limited accuracy, some error is accumulated after each operation. This error is much larger for a multiplication operation. Thus, it is important to minimize the number of consecutive multiplications in order to keep the accumulated error at a minimum. 57 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms O 8 F ( 0 ) 0 1 6 F ( 4 ) C 1 6 F ( 2 ) 0 1 6 F ( 6 ) 16 F (5) Figure 24. Flowgraph for fast 1-D DCT from Arai, Agui, and Nakajima [1]. Solid circles represent additions and the arrows represent negating the data, al = 0.707, a2 = 0.541, a3 = 0.707, a4 = 1.307, a5 = 0.383. The selected fast DCT algorithm, which is based on the Fast Fourier Transform (FFT) and developed by Arai, Agui and Nakajima [1], satisfies most of the above requirements. Using this algorithm, a 2-D DCT of an 8x8 block requires 114 multiplications and 464 additions. However, what make it special are the properties it has for a scaled DCT implementation. In the case of forward DCT, 64 of these multiplications can be performed after the transform, and therefore, they can be combined with the quantization operation. Similarly, 64 of the IDCT multiplications can be done prior to the transform, which can be combined with the dequantization operation. Figure 24 shows the data flowgraph of an 8-point 1-D scaled fast DCT transform. The IDCT flowgraph can be obtained simply by reversing the direction of the flowgraph in Figure 24. 58 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms The scaled DCT algorithm uses floating-point arithmetic. The input data to the transform is 8-bit signed data and the output is 16-bit signed data. When converting this algorithm to fixed point arithmetic, the worst case scenario in precision loss should be considered. Even for intermediate variables 16-bit accuracy, which is very far from floating-point accuracy, is used. However, ITU-T does not have impose on DCT accuracy specification in the H.263+ coding standard. Since this standard is designed for low bit rate video coding where the quantizer has very high values, this precision loss in the DCT computations is tolerable and can be viewed as a loss in the quantization process. Even though using fast algorithms greatly reduces the number of computations of the DCT and IDCT operations, the DCT still takes 15% and the IDCT takes 7% of the encoding time when the fast search algorithm is used for motion estimation. However, further speed performance gains can still be achieved by taking advantage of the fact that the DCT is used within a low bit rate video coding framework. 4.2.1 Zero Block Prediction Prior to DCT Statistical properties of low bit rate video coding can be used to estimate the blocks that are not going to have any nonzero coefficients. A typical H.263+ video coder is based mostly on inter picture prediction, unlike in M P E G where intra pictures are inserted quite frequently between a number of inter pictures. Also, typical sequences used in H.263+ applications, such as video telephony or video conferencing sequences are generally low-activity video sequences. Therefore, the prediction error of each macroblock is small. After applying the DCT to the prediction error and performing quantization, many of the blocks will be zero blocks (i.e. all their 59 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms coefficient values are zero). If these blocks can be predicted prior to computing the DCT, then the corresponding DCT computations can be eliminated. In this case, the prediction operation should be repeated for each block and thus adds some overhead. The overall computation time, C, obtained after predicting of zero blocks can be calculated as C = X (1-00 + Y , ( 7 ) where X is the DCT computation time, Y is the computation time of prediction of zero blocks and a is the percentage of blocks which are predicted to have all zero coefficients, a can be also expressed by (3 x 5, where (3 is the percentage of blocks that have all zero coefficients and 8 is the accuracy of prediction of zero blocks. The DCT computation time, X , is taken into account only for non zero blocks where the computation time of prediction for zero blocks, Y , always adds to the overall computation time since it has to be computed for every block. 4.2.1.1 Zero block Predictor The computation time of prediction of zero blocks, Y , depends on the prediction operation. The following four predictor functions were considered to exploit the properties of the block before applying the DCT. 1 7 7 1 7 7 1. Variance of the block = — XX (block[i] [ j] - average)2 ; average = — XX block[i] [ j] 64 i = 0 j = 0 64 i = 0 j = 0 1 7 7 2. Sum of squares = — ^ ^ ( b l o c k [ i ] [ j ] ) 2 64 i = 0 j = 0 • ' 1 7 7 3. Sum of absolute differences = —XX abs(block[i][j] - average) 64 i = 0 j = 0 60 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms 1 7 7 4. Sum of absolute values = —^^abs(block[i][ j ]) 64 1600 100 Variance • zero blocks •non-zero blocks 200 300 variance 400 500 1600 Sum of Squares • zero blocks *non-zero blocks 100 200 300 400 sum of squares 500 10000 -, Sum of Absolute Differences zero blocks non-zero blocks 10 20 30 40 sum of absolute differences 50 10000 n 8000 -o o 6000 -'o <D 4000 -6 3 C 2000 -0 -Sum of Absolute Values • zero blocks * non-zero blocks 10 20 30 40 sum of absolute values 50 Figure 25. Zero block statistics for different predictors. Figure 25 shows the statistical results for each predictor function. The simulations were performed using the F O R E M A N video sequence and a quantizer value of 13. As can be seen from the figure, there does not appear a deterministic method to predict zero blocks. The Sum of squares and sum of absolute values predictor functions perform better than the other two functions in distinguishing zero blocks from the non-zero ones. Other simulations that were 61 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms performed using different sequences and quantizers also showed very similar characteristics. Because of its associated low computation demands, the sum of absolute values is here selected as the predictor function. QP a (high) a (low) 3(high) P(low) 5(high) 8(low) threshold <9 Too low to consider -9 0:284 0.563 0.607 0.906 0.468 0.622 2 10 0.268 0.527 0.642 0.918 0.417 0.574 2 11 0.258 0.483 0.665 0.922 0.387 0.523 2 12 0.412 0.657 0.691 0.931 0.596 0.705 3 13 0.396 0.625 0.708 0.938 0.559 - 0.667 3 14 0.380 0.601 0.731 0.944 0.520 0.637 3 15 0.495 0.708 0.744 0.948 0.665 0.747 4 16 0.479 0.696 0.761 0.955 0.630 0.729 4 17 0.467 0.682 0.772 0.957 0.605 0.713 4 18 0.454 0.682 0.788 0.961 0.576 0.709 4 19 0.447 0.667 0.795 0.963 0.562 0.692 4 20 0.529 0.741 0.808 0.966 0.655 0.766 5 21 0.524 0.731 0.816 0.968 0.643 0.755 5 22 0.583 0.785 0.827 0.971 0.706 0.809 6 23 0.575 0.772 0.833 0.972 0.691 0.795 6 24 0.575 0.769 0.843 0.974 0.682 0.789 6 25 0.562 0.761 0.847 0.975 0.664 0.781 6 26 0.617 0.805 0.854 0.977 0.722 0.824 7 27 0.612 0.787 0.859 0.978 0.712 0.805 7 28 0.608 0.783 0.866 0.979 0.701 0.800 7 29 0.609 0.783 0.870 0.980 0.699 0.799 7 30 0.652 0.814 0.877 0.981 0.744 0.830 8 31 0.647 0.812 0.879 0.981 0.736 0.828 8 Table 6. a, 5, (3 and thresholds for different quantizers, high : high activity sequence, low : low activity sequence (more typical for H.263+). 62 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms 4.2.1.2 Prediction Accuracy The existence of zero blocks in a bit stream strongly depends on the characteristics of the original video sequence and the quantizer value. In high-activity sequences, the difference between two corresponding blocks that belong to two consecutive frames is quite significant. Also, coarse quantizers result in more zero blocks, while for very fine quantizers it may not worth to try to predict zero blocks. Table 6 lists the different values for the experimentally best a, (3 and 8 parameters that are defined in Equation (7) for different quantizers. If the sum of absolute values is greater than the threshold, then the block is predicted to be non-zero. The threshold is determined statistically based on the characteristics of several video sequences so that it does not cause any quality degradation in the reconstructed picture. The threshold is found to vary only slightly as a function of the video content. As can be seen from the Table 6, zero block prediction becomes much more advantageous at low bit rates, i.e. when a coarse quantizer is employed. It is also possible to increase the parameter a by selecting higher thresholds. However, in that case, the reproduction quality may decrease. 4.2.1.3 Computation Time Gains Since zero block prediction has to be done for every DCT calculation, the computation gain (in percentage) is given by X - ( ( l - « ) ' * + Y) « » X - Y (8) g X X 63 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms where X is the DCT computation time, Y is the computation time of prediction of zero blocks and a is the percentage of blocks which are predicted to have all zero coefficients. Even if a block is predicted to be zero, it may still be necessary to perform some additional operations, e.g. to fill in the coefficients array with zeros. In that case, Equation (7) becomes C = X(l-oc) + Y+aZ , (9) where Z is the computation time require to process a zero block after its detected, and Equation (8) becomes a . ( X - Z ) - Y x l 0 0 (10) g X The values of X, Y and Z strongly depend on the implementation of each function. Also, the relative computation cost of individual operations, such as addition and multiplication, is processor dependent. For instance, if an Intel M M X processor is used, then with proper pipelining the cost of a multiplication can be the same as the cost of an addition, which is one clock cycle. On the other hand, other processors may have a multiplication cost that is higher than an addition cost. Therefore, it is not practical to compute the number of additions and multiplications necessary for an operation and place these in Equation (10) to determine precise value of C g . Rather, it would be more meaningful to use the processor time needed to perform each operation as a parameter during the estimation process. 4.2.1.4 Simulation Results In order to evaluate the computation advantage that zero block prediction provides, the computation times of individual functions and implementations should be known. The 64 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms computation times of different implementations of DCT functions running on a Pentium M M X 200 MHz computer are given in Table 7. Function X: Time (for 100 frames: 59400 call) Fast DCT (H.263 encoder) 2000 ms. Fast DCT (Agui, et al.) 1900 ms. Integer DCT (Agui et al.) 1150 ms. M M X DCT ( M M X assembly version of Integer DCT) 400 ms. (+80 ms. preprocessing ignored) Table 7. Computation times of DCT functions. Table 8 shows the computation costs of calculating the sum of absolute values of an 8x8 block. Function Y: Time (for 100 frames: 59400 calls) Sum of absolute values (C imp.) 200 ms. Sum of absolute values ( M M X imp.) 110 ms. Table 8. Computation times of different Sum of absolute values functions. Also, the implementation dependent computation cost of setting an 8x8 array of integer coefficients to zero values is given in Table 9. Function Z: Time (for 100 frames: 59400 call) memset (C imp.) 150 ms. Table 9. Computation time of memory set functions. Using Equation (9) the computation time for the M M X DCT implementation, with a quantizer of 20 and for a low activity video sequence is calculated as 65 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms C= 400 (1-0.741) + 110 + 0.741 * 150 = 103+110+111=324 ms. When the same conditions are simulated, a total computation time for the DCT of 340 ms. is obtained. If we take into consideration that MSVC++* profiling is not extremely accurate and some operations would change the cache content and thus affect the cache hit ratio, the estimated computation time appears to be quite accurate. Using above zero block prediction method, a computation gain of 15% is achieved. This speed improvement is not very significant. This is due to the fact that the computation time of the sum of absolute values (Y) and the memset function (Z) are relatively large as compared to the computation time of the DCT (X) on an M M X processor. On the other hand, this is not the case for implementation of the DCT on a Pentium processor that does not support M M X . Therefore, greater speed improvements, in such a case, can. be expected. In fact, using our " C " DCT implementation, with a quantizer of 20, and for low activity video sequences, the computation time becomes C = 1900 (1-0.741)+ 200+ 0.741 * 150 = 492+110+111=713 ms. The actual computation time for the function is 751 ms, resulting in a computation gain of 60%. 4.2.1.5 Implementation Complexity and Memory Requirements The above method adds very little complexity and memory requirements to the software implementations of the DCT. However, in hardware implementations of the DCT units, the * Microsoft Visual C ++ 66 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms residual data transfer to the DCT component is one of the main time consuming parts of the whole process. Since the described method does not reduce the amount of data transfer, hardware implementations may not benefit from this as much as software implementations. 4.3 IDCT In video coding, the IDCT is performed both at the encoder and the decoder. While the DCT has to be computed for each block, the IDCT needs to be computed only for the non-zero blocks, which are indicated by the Coded Block Pattern (CBP) in the H.263+ bit stream. The CPB shows which 8x8 blocks are encoded in a macroblock structure. In a typical H.263+ low bit rate bit stream, approximately 70-80% of the prediction error blocks are not coded, i.e. they are zero blocks. Moreover, even the non-zero blocks have dominating zero sub-blocks. This makes it possible to further reduce the computation time. Figure 26 shows the number of computations necessary for a fast IDCT, if zero sub-blocks exist in the block. 92 multiplications and 284 additions. (a) (b) 128 multiplications and 432 additions. (c) (d) Figure 26. Number of computations of a fast IDCT for different sub zero block patterns. non-zero non-zero 144 multiplications non-zero zero non- non- and 464 additions. non- zero zero zero zero non- zero non- non-zero 56 multiplications zero zero zero zero and 252 additions. non- zero zero 67 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms Different fast DCT algorithms may produce a greater or smaller reduction in the number of computations. The computation saving for the block structure shown in Figure 26.c is approximately 50%. This structure is also the most typical in a low bit rate video coding application where approximately 80% or so of the blocks conform such a structure, of course depending on the input sequence and the quantizer. Hence, the other structures shown in Figure 26 can be ignored, and therefore, the savings in the computation time is C g~= 80% x 0.5 = 40%. There is also an overhead involved in finding which of the sub-blocks are zero. However, this overhead can be eliminated by finding the required information during variable length coding of coefficients at the encoder side and variable length decoding of the bit stream at the.decoder side. After the residual blocks are DCT coded and quantized they are scanned, in zigzag order. Variable length coding is then performed. After the last non-zero coefficient is V L C coded, a bit sequence indicates that the rest of the coefficients are zero valued. The position of the last non-zero coefficient makes transparent the zero sub block, structure. However, since the coefficients are zigzag scanned, it is not possible to exactly identify the structure in Figure 26.c, instead the structure in Figure 27 can be identified. Approximately %70 of the blocks are in this structure in a low bit rate bit stream. 50 multiplications and 240 additions. Figure 27. Zero Sub block detection via bit stream parsing. non- / zero/ / z e r o zero zero zero 68 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms 4.3.1 Implementation Complexity and Memory Requirements The implementation of zero sub-block detection is a fairly simple task for both hardware and software video coders. Hardware applications may benefit especially from the fact that the above method reduces the amount of data transfer to the IDCT unit. However, two different IDCT implementations may be necessary in hardware, one for non-zero blocks and the other for partially zero blocks. On the other hand, a software implementation of the IDCT module that will support both types of blocks is trivial. 4.4 Quantization and Dequantization The quantization operation takes 8% and the dequantization operation takes 2% of the whole encoding time if the fast search algorithm is during for the motion estimation process. While dequantization is performed only for the non-zero blocks, quantization has to be performed for every block. It is mentioned in Section 4.2 that the pre-scaled fast DCT algorithm and the post-scaled fast IDCT algorithm may remove the need to perform quantization and dequantization if the quantizer is embedded into the pre-scale and post-scale coefficients. While this is true for most video coding algorithms, because of the dead-zone quantization of H.263, it is not applicable to our encoder implementation unless the Advanced Intra Coding mode is turned on. The zero block prediction method that is described in Section 4.2.1 can also be applied to the quantization process. The computation gain of both the DCT and the quantization can be calculated by including the quantization as follows: 69 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms X + Q - ( ( l - a ) * ( X + Q) + Y + a Z ) X + Q xlOO = a * ( X + Q - Z ) - Y X + Q xlOO, (11) where Q is the computation time of quantization. 4.4.1 Quantization with Tables Another method for faster quantization is to use look-up tables. After the residual block is DCT coded, the coefficients block is clipped into the range [-2048, 2047]. It is possible to create a 2D array by pre-calculating all quantization results for each of the 31 quantizers and to access this array by using the coefficient value and the quantizer as indices. Such an array would take 253,952 bytes of memory if 16-bit short integer registers are used. If look-up tables are used, 64 irregular memory read operations are performed per 8x8 block. Since these memory accesses are irregular and thus cause cache-misses, this method is only three times faster than doing quantization using the CPU. If look-up tables are used in conjunction with the previously discussed zero block prediction method, then the advantage of using look-up tables becomes insignificant during the encoding process. 4.4.2 Implementation Complexity and Memory Requirements The implementation of look-up tables and the memory needed for this is trivial in software applications. In hardware, however, using 250 Kbytes of memory for such look-up tables may not be suitable. But, if the memory is not a problem, the implementation of lookup tables becomes practical for hardware applications too, as it removes the need for a multiplication unit. 70 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms 4.5 Partial SAD Computation Technique If the full search algorithm is used, the motion estimation procedure takes approximately 75% of the encoding time. A number of fast motion vector search algorithms were developed, however, the full search algorithm is still the most commonly used one because of unavoidable quality degradation caused by the use of fast search algorithms. . A common computation reduction technique is to compare the accumulated SAD values to the minimum SAD after each row of the block and then terminate the computation if the accumulated SAD is greater than the minimum SAD. Statistics show that more than half of all SAD computations are terminated before calculating the 5th line in a 16x16 block. Using this technique, approximately 70 SAD computations are performed instead of the 256 SAD calculations normally needed for a 16x16 block. The number of SAD computations can be further reduced by trying to predict whether the SAD will exceed the minimum SAD before it actually exceeds it. For example, during the SAD computation of a 16x16 block, after the computation is completed for the first row, if the accumulated SAD so far is more than the half of the minimum SAD, it can be predicted that the final SAD will exceed the minimum SAD. This prediction is clearly a function of the SAD value computed so far, the number of rows processed and the total number of rows. Considering these, a fairly effective prediction model can be given by S A D Predicted SAD = SAD + I x x<|>, (12) N - I 71 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms where I is the number of rows that are not processed so far, N is the dimension of the M E block and <j) is the accuracy coefficient. The number of SAD calculations per search and the PSNR decrease in the picture for different § values are presented in Table 10 for the F O R E M A N and the A K I Y O sequences at different bit rates based on 16x16 block motion estimation. Sequence w/o SAD prediction $ = 0.2 <|> = 0.5 (j) = 0.7 (|>=1 FOREMAN (48 kbit/sec) PSNR (dB) 30.74 -0.04 -0.07 -0.3 -0.95 Number of SADs 89 61 40 33 27 AKIYO (24 kbit/sec) PSNR (dB) 37.17 +0.03 0.0 -0.06 -0.13 Number of SADs 54 33 24 22 20 . Table 10. Change in picture quality for different <|>. The SAD prediction is not as effective when it is used in conjunction with the fast search M E algorithms. The reason is that in many fast search algorithms, the minimum SAD and the calculated SAD are very close to each other, delaying the termination of the SAD computation until the late stages of the process. Thus, trying to predict the SAD would usually not be as successful and would not save as much computation time when fast search motion estimation is employed. 4.6 Fast Half Pixel Motion Estimation Based on Approximation Half pixel motion estimation is one of the main enhancements of H.263 over H.261. It provides, in most cases, more than 2 dB PSNR increase in picture quality. The eight SAD calculations that are necessary for a half pixel M E form a rather insignificant computational load when compared to the over 256 SAD calculations needed for a,full search M E . Yet, there have been efforts to 72 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms reduce or eliminate the computation time for the half pixel M E . The resulting computation reduction techniques have become especially beneficial when fast search algorithms are used. 4.6.1 A Simplified M E for MPEG-2 Encoder O O Integer Pixel X Half Pixel O O Figure 28. Half pixel motion vector prediction using 8 precomputed integer pixel SADs. Senda Y . et al. [42] suggested a simple approximation technique to remove the need for half pixel M E within an MPEG-2 coding framework. Their suggestion was to compute each half pixel SAD from the surrounding integer pixel SADs as illustrated in Figure 28 and given by SAD a =\ | /vHx(SADA+SADB + SADc + S A D D + 2 ) / 4 , (13) SAD b = \|/H x (SAD C + S A D D +1) / 2, and (14) SAD C= y v x (SAD B + S A D D +1) / 2, (15) where y/vH, Wh and y/y are statistically optimized coefficients for high bit rate (>1 Mbit/sec) MPEG2 encoding. It is possible to adapt this technique to low bit rate video coding (<64 Kbit/sec) by re-optimizing these coefficients, as shown in Table 11. A O C o a X b X B O c X D O X X X X X 73 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms Foreman 48 Kb/sec (Q>13) Akiyo 8 Kb/sec (Q>13) Foreman 64 Kb/sec (Q<13) Akiyo 28 Kb/sec (Q<13) PSNR with half pixel M E 30.47 32.63 31.56 37.92 PSNR with \|/VH=13/16, \|/v=\|fH=14/16 -0.15 -0.61 -0.15 -0.34 PSNR with \|/vh=27/32, \|/v=\j/H=29/32 -0.13 -0.21 -0.21 -0.47 Table 11. The change in PSNR using different approximation factors (\|/). 4.6.2 Fast Search M E and Half Pixel M E Approximation B O a c X X A b C O X O X X o Figure 29. Half pixel motion vector prediction using 5 precomputed integer pixel SADs To perform half pixel M E approximation, all the surrounding integer pixel SADs must be known. This is not a problem in the case of full search M E . However in many fast search M E algorithms, not all the surrounding SADs are computed. One solution is to simply perform M E for missing locations. Another solution is to modify the approximation equations and the approximation coefficients, y/, so that only the available integer SADs are used. X X o X O Integer Pixel X Half Pixel 74 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms The H.263+ Test Model Near-term (TMN) fast integer pixel M E algorithm computes the SADs at locations that are the vertices of a diamond as illustrated in Figure 29. The modified half pixel M E equations and the optimized y/'s for this algorithm are given by SAD a =\ | /vHx(SAD A +SADB + S A D c + l ) / 3 \ | / V H = 28/32, (16) SAD b = \)/H x (SAD A + SAD C +1) / 2 ' \|/H = 29/32, and (17) SAD C= \|/ v x (SAD B + SAD C +1) / 2 \ j / v =29/32. (18) Although this technique removes the need for 8 half pixel SAD computations, the reproduction quality may not be acceptable in some applications. Thus, we next suggest a new approach that yields better computation-performance tradeoffs. In this approach, three fast half pixel M E methods are considered. Our motivation is essentially the same for all three methods. If a half pixel block is estimated to have the minimum SAD by using the surrounding integer pixel blocks, it is most likely that it or a neighbor block has the minimum SAD. A limited search can then be performed to find the block correspond to the smallest SAD. Even if the described procedure fails to find the minimum SAD, a very close match is likely to be found. The three methods are described next: 1. Method 1: We find the smallest SAD out of the four surrounding integer pixel SADs. Then we perform three half pixel MEs accordingly. For example, in Figure 30, if B corresponds to the smallest SAD, then we compute the half pixel SADs for pixels 1, 2 and 3. If C corresponds to the smallest SAD, then we compute half pixel SADs for pixels 3, 5 and 8, and so on. We also compare these to the center pixel SAD and pick the smallest of two. 75 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms O Integer Pixel X Half Pixel B O 1 2 3 X X X 4 E 5 X O X 6 7 8 X X X D O Figure 30. Half and integer pixel locations in a diamond shape area. 2. Method 2: We find the two smallest SADs out of the four surrounding integer pixel SADs. We then perform two or three half pixel, M E SADs depending on their location. For instance, in Figure 30, if the SADs for pixels A and B are the smallest, we find the half pixel SADs for the pixels 1, 2 and 4. If SADs for B and D are the smallest then we find the half pixel SADs for the pixels 2 and 7. We then pick the smallest of these SADs and the center pixel SAD. 3. Method 3: We compute the half pixel SADs using the four surrounding integer pixels and the center pixel by approximation. We then perform a half pixel M E to find more accurate SADs for the N pixels that correspond to the smallest approximated SADs out of 8 pixels. Last, we find the smallest of the N SADs and the center SAD to determine the half pixel M V . The change in PSNR associated with each of the above methods is given in Table 12. The method 3, by selecting N equal to 4, thus performing 4 M E operations per macroblock, gives the best results. The method 1, which uses 3 M E operations, and the method 2, which performs 2-3 76 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms M E operations, cause very small quality degradation. The only method that doesn't need to perform any M E is the approximation method that uses 5 integer pixels. When compared to the PSNR decrease caused by not using half pixel M E at all, the 5 pixel approximation method gives very good results and can be used in applications where processing power is very limited. Foreman 64 Kb/sec Foreman 48 Kb/sec Akiyo 28 Kb/sec Akiyo 8 Kb/sec with half-pixel M E 31.56 30.47 37.92 32.63 no half-pixel M E -2.24 -2.16 -1.61 -1.33 Approximation using 9 integer pixels \|/VH=27/32, \j/v=\|/H=29/32 -0.21 -0.13 -0.47 -0.21 Approximation using 5 integer pixels \|/VH=28/32, \|/v=\|/H=29/32 -0.36 -0.31 -0.52 -0.21 Method 1 -0.07 -0.05 -0.04 0 Method 2 -0.12 -0.11 0 -0.02 Method 3 (N=4 pixels) -0.04 -0.02 +0.02 +0.06 Method 3 (N=2 pixels) -0.15 -0.14 -0.05 -0.02 Table 12. PSNR table for different fast half pixel M E methods. PSNR Reduction for Different N Values Quantizer Figure 31. PSNR - number of iterations tradeoff in method 3. 77 Chapter 4: Efficient Low Bit Rate Video Coding Algorithms Figure 31 shows the change in PSNR when different N values are used for different bit rates of the Foreman sequence. N defines the number of times the half pixel M E search is iterated. Selecting an N value of 3 seems to yield good compromise between computation time and picture quality. 4.6.3 Implementation Complexity and Memory Requirements Fast Half Pixel M E not only reduces the number of SAD computations but also lowers the number of irregular memory accesses. Therefore it increases the performance of both hardware and software systems. There is some overhead resulting from the computation of half pixel SAD approximations and finding the minimum one, but it is rather small. While hardware implementation of the proposed methods may be nontrivial because of the need to have a multiplier unit, the complexity of the associated software implementations is negligible. 78 Chapter 5 MMX Implementation of the H.263+ Video Encoder 5.1 Introduction In this chapter, we propose platform (and more specifically M M X ) dependent mapping techniques for the computationally intensive SIMD structured components of video coding to improve the video coding speed performance. These video coding components include the DCT, SAD computations, interpolation, data interleaving for half, pixel motion estimation and motion compensation functions. The mapping of these components to M M X provide an additional speed advantage over the platform independent optimizations proposed in Chapter 4 and make real time H.263+ video encoding possible on the Pentium M M X processor [20], [36], [37]. 5.2 Video Coding Cores with SIMD Structure Video coding systems, like many other multimedia systems, has very computationally intensive core functions that can be implemented efficiently using a Single Instruction Multiple Data (SIMD) structure. Table 13 shows a profile of the H.263+ video encoder for the F O R E M A N sequence without any modes enabled, and with the fast M E search, fast IDCT, fast DCT and the fast Quantization enabled and rate control disabled. The relative computation time percentages may vary slightly as a function of the input video sequence and the quantizer. 79 Chapter 5: MMX Implementation of the H.263+ Encoder Function Computational Load Fast M E 17% DCT 15% Half Pixel M E 12% Interpolation 7% IDCT 5% Load M E Area 4% Motion Compensation 3.5% Quantization 3% Table 13. Relative computational costs of the most compute intensive H.263+ encoder functions. Except for the quantization function, which is implemented using a look-up table, all of the functions in Table 13 employ SIMD structured algorithms. Thus, a significant performance increase can be expected if the M M X architecture is properly exploited. 5.3 SAD Computations for Motion Estimation The motion vectors are found through performing a number of SAD operations. In H.263+, the SAD is performed on either 16x16, in baseline coding, or on 8x8 unsigned byte blocks, when Advanced Prediction or Deblocking Filter modes enabled. Since the size of each data element is 8 bits, 8 data elements can be processed at the same time on M M X , giving a significant performance improvement. 80 Chapter 5: MMX Implementation of the H.263+ Encoder Subtraction of two unsigned bytes produces a 9-bit signed number. Therefore, to compute the absolute value of the difference it may seem that a conversion to 16-bit format is needed. However, this conversion may not be necessary if the saturated arithmetic feature of the M M X architecture is used. Consider the two M M X registers in Figure 32. One of the registers contains 8 bytes of data from the target block and the other register contains 8 bytes of data from the current block. When register B is subtracted from register A with unsigned saturation, the result is set to the difference between ai and bi if ai is greater than bi or zero if ai is smaller than or equal to bi, simply because the negative result saturates to zero. When the situation is reversed, i.e. A is subtracted from B, the result saturates to zero when bi is smaller than ai. If the OR operation is applied to these two subtraction results, the outcome will be the absolute value of the difference between A and B registers. Then the data can be unpacked and accumulated in 16-bit precision registers. Computing one SAD for two 16x16 blocks would require that 64 memory load, 64 packed subtraction, 32 packed-OR, 32 unpack and 63 addition operations be performed. The basic core of this algorithm is published by Intel [20]. We adopted their technique to work with 16x16 and 8x8 blocks. a8 a7 a6|a5|a4|a3|a2 al b8 b7 b6 b5 b4|b3 b2 bl - (psubusb) -(psubusb) b8 b7 b6 b5 b4 b3 b2 bl a8 a7 a6 a5 a4 a3 a2 al — — x8 x7 x6 0 x4 x3 x2 0 0 0 0 y_5 0 0 0 vj_ por x8 x7 x6 y5 x4 k3 x2y l Figure 32. Data flow to compute absolute value of differences of two unsigned data. 81 Chapter 5: MMX Implementation of the H.263+ Encoder One disadvantage of using this technique is that we can not use any kind of partial SAD computation technique. Since 4 of the additions are saved in the same register, after computing each line, i.e. 16 bytes, unpacking these values and adding them together to find the SAD at the end of each row would cause an unacceptable overhead. Moreover, if fast search is used, not using the partial SAD computation technique would not slow M E since most of the time the SAD computation is not terminated until the later stages. The M M X implementation of the SAD function runs approximately 4-5 times faster than the optimized C implementation. For example, when encoding the F O R E M A N sequence with a constant quantizer 13 on a Pentium M M X 200 MHz computer, 221901 SAD computations were performed in 2349 ms. in the optimized C version of the encoder and the same number of operations were performed only in 522 ms. in the M M X version of the encoder. This implementation doesn't have any effect on the rate-distortion performance of the encoder. 5.4 DCT Computing the DCT takes 15% of the encoding time, and, unlike SAD, its high computational cost is caused mainly by its complexity. We implemented the fast floating point DCT algorithm from Arai et al. [1] to use M M X instructions because of its scalability properties, small number of operations it requires and its suitable structure to implement in SIMD structure. Since M M X instructions can perform only integer arithmetic, the floating point algorithm was converted to fixed-point algorithm. The input to the DCT is an array of signed 8-bit data and the output is an array of 16-bit signed data. In order to keep the precision as high as possible, in the intermediate 82 Chapter 5: MMX Implementation of the H.263+ Encoder stages of the computations, some downscaling and upscaling operations are applied to the data. These operations are given by Upscale (X, precision): X « precision, and (19) • Downscale (X, precision): (X + 2 p r e c i s i o n" 1) » precision, (20) where " « " denotes the left shift operation and " » " denotes the right shift operation. In the intermediate stages of the computations, it is possible to use either 16-bit or 32-bit registers. Using 32-bit registers would increase the accuracy of the DCT but then only 2 intermediate data elements will fit into one M M X register, causing performance loss in terms of speed. On the other hand, by using 16-bit intermediate data it is possible to process 4 data elements at a time. Since H.263+ is a low bit rate video coding standard and DCT coefficients are thus usually quantized very coarsely, DCT errors are mostly insignificant. Therefore, 16-bit intermediate registers are here used. It is possible to implement the DCT function using M M X instruction in three ways: 1. Computing the DCT for a 1x8 vector one at a time: Here, the DCT is first performed on the first row, then the second row, and so on. This is also how our C implementation works. 2. Performing the DCT on four 1x8 vectors at a time. In this method, the DCT is first applied to the first four rows, and then the other four rows are processed. The reasoning behind processing four rows at a time, instead of 2 or 3 for example, is that the four 16-bit data elements fits into one 64-bit M M X register. 83 . Chapter 5: MMX Implementation of the H.263+ Encoder 3. Performing the DCT on four 8x8 blocks simultaneously. Here, each block is processed as described in the first method but the computations for four blocks are carried out at the same time. The disadvantages of the first method are the implementation complexity and the different extra precision lost due to fitting many data elements in the same M M X register, that may cause extra precision loss. Implementing the third method appears to be simple, but the four 8x8 matrix data has to be interleaved before applying the DCT and this may cause a major overhead. In order to implement the second method, the input array should be transposed before applying the DCT. Because of its simplicity, the second method is selected in our implementation. To perform the DCT of four 8x1 vectors, one 16-bit data word from each row is read into the same M M X register. In order to obtain the highest efficiency, 16-bit words should be placed next to each other in memory making transposition of the input matrix necessary. After transposition, the DCT for the upper four rows of the array is computed, followed by the computations for the remaining four rows. After that, the array is transposed. Then, the DCT is applied again four rows at a time. Operations other than multiplications are performed using four 16-bit data. The multiplications are performed using 32-bit data.and the corresponding results are immediately downscaled and compressed into 16 bits. As illustrated in Figure 33, first, two multiplication operations, one for the low significant part and the other for the high significant part are performed. Then, the results were interleaved so that the higher significant part and the lower significant part of the same multiplication will be in the same M M X register. Then, 32-bit 84 Chapter 5: MMX Implementation of the H.263+ Encoder additions and arithmetic right shift operations are executed for downscaling. Last, the four 32-bit data is packed into one 64-bit M M X register with signed saturation. Z=Downscale (Y x A, P) Y: An intermediate variable A: Coefficient to multiply P: Downscaling precision Y3 Y2 Y l YO X A A A A P M U L H W P M U L L W h: high 1: low Y3xA (h) Y2xA (h) Y l x A (h) YOxA (h) Y3xA (1) Y2xA (1) Y l x A (1) YOxA (1) P U N P C K L W D PUNPCKHWD Y3xA (h) Y3xA (1) Y2xA (h) Y2xA (1) Y l x A (h) Y l x A (1) YOxA (h) YOxA (1) PADDD 2P-1 2p~l . 2P"' 2p- ] r r Y3XA+213-1 Y2xA+2p_1 YlxA+2 p _ 1 Y0xA+2p-] » P PSRAD » P (Y3xA+2 p 4 )»P (Y2xA+2 p" 1)»P ( Y l x A + 2 M ) » P (Y0xA+2 p l )»P PACKSSDW Z3 Z2 Z l ZO Figure 33. Flowgraph of multiplication and downscaling operations performed with MMX instructions. The M M X DCT implementation runs approximately four times faster than the floating point implementation and three times faster than the integer optimized C implementation. In our encoder implementation in C, the residual block is stored in a 32-bit integer array. However, the 85 Chapter 5: MMX Implementation of the H.263+ Encoder M M X DCT implementation takes a 16-bit array as input. Therefore, the 32-bit integer array needs to be converted to a 16-bit integer array adding a significant overhead. Hence, the overall speed up drops to three times faster when compared to the optimized C floating point algorithm. Since an integer DCT algorithm is used in the M M X implementation, the rate-distortion performance of the encoder is affected as well. Table 14 shows the resulting picture quality in terms of PSNR for different bit rates and sequences of both DCT implementations. As seen from the table, using an integer DCT implementation does not cause a significant decrease in PSNR. Foreman 48 Kb/sec Akiyo 8 Kb/sec Foreman 64 Kb/sec Akiyo 28 Kb/sec PSNR with floating point DCT 30.47 32.63 31.56 37.92 Variation in PSNR with integer or MMX DCT -0.01 +0.02 -0.01 0.00 Table 14. Performance degradation caused by using integer DCT implementation. 5.5 IDCT The H.263+ standard is based on inter picture prediction. Thus, any errors caused due to JDCT's low accuracy propagate throughout the video coding process. ITU-T already has an IDCT accuracy measure procedure defined within the H.263+ standard. Unless very high fixed point accuracy is used, it is very difficult, if not impossible, to implement a standard compliant IDCT function MMX-wise using 32-bit or less precision. Moreover, when the IDCT is implemented with 32-bit precision, and when the computational overhead is considered, a large speed up is not expected. Intel has already implemented and made publicly available an M M X IDCT routine for 86 Chapter 5: MMX Implementation of the H.263+ Encoder M P E G decoding. Their implementation employs 32-bit precision for multiplication and 16-bit precision for accumulation operations, and is approximately 4 times faster than the optimized C implementation. However it does not meet the specifications of the ITU-T standard. The ITU-T IDCT accuracy specifications require that, the DCT and the IDCT, with 64-bit floating point accuracy, applied to a certain number of blocks that are generated by a pseudo random number generator routine, and then the peak error, mean square and mean errors (pixel-wise and overall) should be less than predefined thresholds. Also, if the reference IDCT produces a zero output, the IDCT under test should also produce a zero output. More details on these requirements can be found in [25]. Even if an implementation does not meet the standard, it may be beneficial to compare its computation accuracy to such thresholds. Table 15 shows the partial* accuracy results for 32-bit floating point, 64-bit integer and 32-bit integer IDCT implementations along with Intel's M M X IDCT implementation. Peak Zero output Maximum mean Overall error rule violation error for any pixel mean error Thresholds 1 No 0.015 0.0015 32-bit floating point fast IDCT 1 No 0.0002 0.0000156 64-bit integer IDCT 1 No 0.0001 0.00000625 32-bit integer IDCT 1 Yes 0.0539 0.0289 Intel's M M X IDCT 3 Yes 1.866 0.55 Table 15. ITU-T IDCT accuracy test results for various IDCT implementations. The test results are given only for numbers that are in the range of [-256, +255]. 87 Chapter 5: MMX Implementation of the H.263+ Encoder The IDCT accuracy is more important for H.263+ than it is for M P E G because in a typical H.263+ application only one frame (out of 100+ frames) is intra coded, whereas in an M P E G application a frame is intra coded every 10-15 frames. Nevertheless, Intel's IDCT implementation can still be used in our H.263+ encoder provided that the encoder/decoder software runs on exactly the same type of processor. Only that, the codec would not be truly standard compliant. 5.6 Interleaved Data Transfers for Half Pixel ME Interpolated and Reconstructed Picture Data • • • • • • p l l pl2 pl3 pl4 pl5 pl6 • • • • p21 p22 p23 p24 p25 p26 • • • • p31 p32 p33 p34 p35 p36 • • • • p41 p42 p43 p44 p45 p46 • • • • • • • • bold cells : integer pixels Half Pixel Search Block 1 Half Pixel Search Block 2 Half Pixel Search Block 4 p l l pl3 pl5 • • • pl2 pl4 pl6 • • • p21 p23 p25 • • • p33 p35 p37 • • p31 p33 p35 • • • p32 p34 P36 • • • p41 p43 p45 • • • p53 p55 p57 • • p51 P53 p55 • • • p52 p54 p56 p41 p43 p45 • • • p73 p75 p77 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • Half Pixel Search Block 9 Figure 34. Half pixel search block locations in interpolated reconstructed data. Half pixel M E consists of two computationally intensive parts: SAD computation and extraction of the search block from the interpolated reference picture. The M M X SAD function 88 Chapter 5: MMX Implementation of the H.263+ Encoder implemented for integer pixel M E can also be used for half pixel M E . However, this will not speed up the other computationally expensive tasks that represent approximately 25% of all computations. Figure 34 shows the interpolated picture that is used for prediction and the interleaved locations of the half pixel search blocks. In order to perform SAD computations using the interleaved data, the data should first be collected into an array. This operation requires irregular memory accesses and takes considerable time. Since the blocks consists of 8-bit data arrays, by using data pack instructions, it is possible to obtain a 700% speed increase for this operation. Figure 35 shows the flowgraph of the M M X implementation. First, 16 bytes are read into two M M X registers, then an AND operation is performed with the mask "00FF00FF00FF00FF" to eliminate the data that will not be placed into the half pixel search block. Then, the remaining bytes are packed into one M M X register with either signed or unsigned saturation. Last, the M M X register is written back to the memory as the search block. p!16 p!15 p!14 p!13 p!12 p i l l pllO pl9 pl8 pl7 pl6 pl5 pl4 pl3 pl2 p l l PAND PAND 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF 00 FF — — 0 p!15 0 p!13 0 p i l l 0 pl9 0 pl7 0 pl5 0 pl3 0 p l l P A C K U S W B pll5 pl l3 p i l l pl9 P17 pl5 pl3 p l l Figure 35. Flowgraph of the data interleaving for MMX Half Pixel M E function. 89 Chapter 5: MMX Implementation of the H.263+ Encoder 5.7 Interpolation Interpolation is performed on every reconstructed I and P picture in order to predict the current picture in half pixel accuracy. Figure 36 explains the bilinear interpolation performed as specified in the H.263+ standard. Rtype is an optional parameter that improves the picture quality and can be signaled in the bit stream header. p l l p l2 g } a X b O a = n ° Integer Pixel u i 11 . n . i ^ \m X Half Pixel b = (pll+pl2+l-rtype)/2 X c X d c = (pll+p21+l-rtype)/2 d = (pll+pl2+p21+p22+2-rtype)/4 rtype = rounding type. When it is on, rtype is 1 for every other picture and 0 for the rest. p21 p22 O O Figure 36. H.263+ interpolation. Since the interpolation is done on 8-bit data and the result is also stored in 8 bits, it is possible to achieve a large speed up by processing 8 units of data simultaneously. Figure 37 shows the inner core of the interpolation function implemented using M M X instructions. First, the four 1x8 pixel vectors are loaded into the M M X registers and the data that will be processed is separated by using unpacking instructions. Next, the packed additions are performed and the results are written into 16-bit registers. Also, rtype is subtracted from each data and the outcome is shifted to the right by one or two depending on the position of the pixel. Last, the 8-bit results are packed together again and written back to the memory. In the outer loop, this operation is repeated for the one pixel below until the bottom line of the picture is reached. Then, starting from the top, the same procedure is repeated for the next 8' pixels in the horizontal direction. 90 Chapter 5: MMX Implementation of the H.263+ Encoder pl9 pl8 P17 pl6 pl5 pl4 pl3 pl2 pl8 pl7 pl6 pl5 pl4 pl3 pl2 p l l P U N P C K L B W 0 p l 5 | 0 pl4 0 pl3 0 pl2 0 pl4] 0 pl3 0 pl2 . 0 p l l p29 p28 p27 p26 p25 p24 p23 p22 p28 p27 p26 p25 p24 p23 p22 p21 P U N P C K L B W G p25 0 p24 0 p23 0 p22 0 p24 0 p23 0 p22 0 p21 0 pl5 0 pi4: 0 pl3 . 0 pl2 0 p25 0 p24 0 p23 0 p22 + P A D D W ' + 0 pl4 0 pl3 0 pl2 0 p l l 0 p24 0 p23 0 p22 0 p21 + P A D D W + 00 01 00 01 00 01 00 01 00 01 oo 01 00 01 00 01 2b+rtype 2b+rtype 2b+rtype 2b+rtype 4d+rtype 4d+rtype 4d+rtype 4d+rtype 0 pl4 0 pl3 0 pl2 0 p l l r a a a p24+p25+l p23+p24+l p22+p23+l p21+p22+l 0 , pl4 0 pl3 0 pl2 0 p l l 0 p24 0 p23 0 p22 0 p21 + 00 01 00 01 00 01 00 01 = 2c+rtype 2c+rtype 2c+rtype 2c+rtype Figure 37. The inner core of MMX implementation of interpolation. The implemented M M X interpolation algorithm is more than three times faster than the optimized C implementation. This optimization does not change the rate-distortion performance of the encoder. 91 Chapter 5: MMX Implementation of the H.263+ Encoder 5.8 Motion Compensation Motion compensation consists of simple addition operations of reconstructed residual blocks and prediction blocks. Since it is performed for all the coded macroblocks, this simple function takes a considerable computation time. In our software implementation, the difference and the prediction blocks are stored in 32-bit integer arrays. Thus, our M M X implementation gave less than 100% speed up. Obviously, implementations that use 16-bit arrays for motion compensation benefit more from use of M M X . 5.9 Load ME Area M M X instructions can be used to speed up memory copy operations. A good example is the search block loading function for motion estimation. This function is called more than once for each macroblock. Using M M X instructions for simple read and write operations, an approximately 70% gain in performance is achieved. 5.10 Optimization of the MMX Functions A l l of the M M X implementations that are explained above were manually optimized to fully utilize the M M X processor's scalar architecture. Currently, there are not any software optimizers written for M M X and probably there never will be such since Intel believes that programmers would do a much better job of optimizing the code than any of the optimization software. Following are some of the issues that were considered when optimizing the M M X source code. 92 Chapter 5: MMX Implementation of the H.263+ Encoder 5.10.1 Data Alignment A misaligned access in the data cache or in the bus costs at least three extra clock cycles on the Pentium processor. A misaligned access in the data cache, which crosses a cache line boundary, costs nine to twelve clock cycles on Pentium processors. Intel recommends that 64-bit data be aligned on an 8-byte boundary. Using C it is not possible to guarantee this alignment for dynamically allocated memory, but a small adjustment to the memory pointer can provide alignment after the memory allocation procedure. Figure 38 shows the C routine ensures an n byte aligned array. If this routine is used, all the memory access should be done using aligned_ptr. However the memory should be freed using ptr. n_byte *ptr; n_byte *aligned_ ptr; int sizeofarray; sizeof(n) ptr = malloc(sizeofarray x sizeof(n) + sizeof(n)); aligned_ptr = (n_byte *) ((unsigned int)ptr - ptr % sizeof(ri)); Figure 38. C routine to create an n byte aligned array. 5.10.2 Optimization of Pipeline Usage Optimization of the pipeline usage of M M X instructions provides most of the gain in terms of execution time. The M M X pipeline structure has been already discussed in Chapter 4. In this section, some techniques that enhance the pairing of instructions will be discussed. As indicated before, M M X uses the two existing integer pipelines of the Pentium processor, namely the U pipe and the V pipe. M M X has only one multiplier and one,.shifter unit. Thus* 93 Chapter 5: MMX Implementation of the H.263+ Encoder multiplication and shifting instructions, including pack and unpack, are not pairable with other multiplication and shifting instructions. The shift operation takes only one cycle. Therefore, it is necessary to insert at least one instruction between two shift instructions. On the other hand, since a multiplication takes three cycles to complete, at least two instructions should be put between two multiplication instructions. Also, multiplication results can only be used after two instructions. Instruction operand dependencies also effect instruction pairing. For instance, two instructions that one of them writes to a register and the other one reads from the same register can not be paired. 5.10.3 Memory Read and Write The M M X Pentium processor has four write buffers. Therefore, at most four 64-bit data elements should be written to memory at a time. Unlike many other processors, in Pentium, reading from memory that is in the cache takes the same time as reading from a register. Hence, when optimizing the source code, using instructions that reads from the memory should be preferred to reduce the load on the registers so that they can be used.in other instructions that can take only register operands. 5.10.4 Other Optimizations Full optimization of the M M X source code involves many more techniques than explained in the preceding sections. For instance, taking into account that some instructions can be assigned only to the U pipe or the V pipe or understanding the branch prediction algorithm of the Pentium, it is possible to achieve more gains. Moreover, the Pentium has two instructions, RDMSR and 94 Chapter 5: MMX Implementation of the H.263+ Encoder WRMSR, that give the exact utilization of pipelines, cache and other resources. These instructions can be used towards further optimization of the M M X source code. 5.11 Overall Performance Improvement of the Application Our M M X implementation of a real-time H.263+ encoder is capable of encoding 10 frames per second in QCIF (176 x 144) resolution on a Pentium M M X 200 M H z processor. It is approximately 25% faster than the optimized C implementation. Table 16 shows a profile of the H.263+ baseline encoder for the F O R E M A N sequence after M M X optimization. As seen from the table, since some of the functions were optimized more than others, the computational loads were re-distributed. Also, computation times of some of the functions which were insignificant before, such as Scan, have become important. Function Computational Load DCT 9% IDCT 9% Quantization 8% Load M E Area 7% Scan 6% Interpolation 5.5% Fast M E 5% Reconstruct Full Image 5% Half Pixel M E 4% Motion Compensation 4% Table 16. Relative computational costs of the most computationally intensive H.263+ encoder functions after incorporation of the M M X functions. 95 Chapter 6 Conclusions and Future Research In this thesis, we propose efficient coding algorithms and techniques that improve the performance of low bit rate (<64 Kbps) video coding significantly. These algorithms and techniques are implemented and tested using our public-domain software implementation of an H.263+ codec. The contribution of our research consists of two parts. In the first part, we develop platform independent algorithms, mostly using the global statistical properties of low resolution and slowly varying video sequences, for the most computationally intensive components of video coding. These proposed algorithms are zero block prediction prior to DCT, partial IDCT computations, fast half pixel motion estimation, quantization with look-up tables and partial SAD computations. These algorithms are discussed in detail in Chapter 4. In the second part, we propose M M X mapping strategies for the SIMD oriented video coding components. Through processing two, four and even eight data simultaneously we achieve major speed ups for some of the functions, particularly the DCT, SAD, interpolation, data interleaving for half pixel motion estimation and motion compensation functions. After employing the platform independent algorithmic optimizations and M M X mapping, we achieve a much faster encoder that can encode 15 frames per second (fps) on a Pentium M M X 200 MHz processor. The unoptimized version of 96 Chapter 6: Conclusions and Future Work our encoder can only encode 5 fps on the same processor. More detailed performance results are presented in Appendix A. Also, two video/telephony applications that are developed to demonstrate the capabilities of the H.263+ video coding standard and our optimized encoder are presented in Appendix B. Although this thesis focuses mainly on video encoding, some of the developed algorithms, such as the more efficient IDCT and the motion compensation algorithms, can be also used to increase the speed of video decoding. Moreover, the performance of the software implemented H.263+ encoder can be further improved by further optimization of the M M X optimized functions. This can be achieved by using advanced debuggers, such as Intel's VTune software package [22], that provides detailed information on the pipeline usage and cache misses of the M M X instructions. Although some effort has been spent to improve the speed performance of the optional coding modes, such as the implementation of the 8x8 SAD function using M M X instructions, we focused mostly on optimizing the baseline components of the H.263+. It is possible to optimize some of the optional coding modes of H.263+, such as the Deblocking Filter Mode and the Improved PB frames mode. However, since the H.263+ optional modes usually require relatively complex functions that are performed on large data types, the hardware dependent speed improvements are expected to be not very significant. Another interesting research direction is to extend the hardware dependent techniques, which were presented in Chapter 5, to similar SIMD architectures such as the VIS architecture of Sun's Ultra Sparc [49], the VLIW architecture of TPs C6X, and Philips' Tri+media chips. 97 Bibliography [1] Y . Arai, T. Agui, and M . Nakajima, " A Fast DCT-SQ Scheme for Images" Transactions of IEICE, pp. 1095-1097, 1988. [2] V . Bhaskaran, K . Konstantinides, "Image and Video Compression Standards", Kluwer Academic Publishers, 1995. [3] Blinn, J., "Fugue for M M X " , Computer Graphics and Applications, pp. 88-93, April 1997. [4] R. Booth, "Inner Loops - A Source Book for Fast 32-Bit Software Development", Addison-Wesley Developers Press, 1997. [5] D. Chan, J. Yang, C. Fang, "Fast Implementation of M P E G Audio Coder Using Recursive Formula with Fast Discrete Cosine Transforms", IEEE Transactions on Speech and Audio Processing, Vol . 4, pp. 114-148, March 1996. [6] S. L . Chang, T. Ogunfunmi, " A Fast DCT implementation and application in MPEG1 Video Compression", IEEE Trans, on Signal Processing, pp. 961-964, 1996. [7] G. Cote, B. Erol, M . Gallant and F. Kossentini, "H.263+: Video Coding at Low Bitrates", to appear in IEEE Transactions on Circuits and Systems for Video Technology, September 1998. [8] G. Cote, M . Gallant and F. Kossentini, "Efficient motion vector estimation and coding for Ff.263-based very low bit rate video compression", Submitted to IEEE Transactions on Image Processing, June 1997. [9] B. Erol, M . Gallant, G. Cote and F. Kossentini, "The H.263+ Video Coding Standard: Complexity and Performance", IEEE Data Compression Conference, Snowbird, Utah, pp. 259-268, March 1998. 98 Bibliography [10] N . Faerber, B. Girod, E. Steinbach, "Performance of the H.263 video compression standard", To Appear in Journal of VLSI Signal Processing: Systems for Signal, Image, and Video Technology, Special Issue on Recent Development in Video: Algorithms, Implementation and Applications, 1997. [11] E. Feig, and S. Winograd, "Fast Algorithms for Discrete Cosine Transform", IEEE Trans. Signal Processing, pp. 2174-2193, 1992. [12] D. L . Gall, "MPEG: A Video Compression Standard for Multimedia Applications", Communications of the A C M , vol. 34, pp. 47-58, April 1991. [13] M . Gallant, G. Cote, B. Erol and F. Kosseriitni, "UBC's H.263+ Public Domain Software, Release 1," ITU-T Study Group 16, Video Experts Group, Document Q15-B-33, Sunriver, September 1997. [14] B. Girard, K . B. Younes, R. Bernstein, P. Eisert, N . Farber, F. Hartung, U . Horn, E. Steinbach, and T. Wiegand, "Recent Advances in Video Compression", in IEEE International Symposium on Circuits and Systems, February 1996. [15] B. Girod, "Advances in Digital Image Communication", in Proceedings 2nd Erlangen Symposium, Erlangen, Germany, April 1997. [16] B. Girod, E. Steinbach, and N . Faerber, "Comparison of the H.263 and H.261 video compression standards", in Standards and Common Interfaces for Video Information Systems, K.R. Rao, editor, Critical reviews of optical science and technology, Philadelphia, Pennsylvania, Oct. 1995, vol. 60, pp. 233-251. [17] M . Hans, "An M P E G Audio Decoder Based on 16-Bit Integer Arithmetic and SIMD Usage", IEEE Transactions on Signal Processing, pp. 463-468, 1997. [18] M . Hans, V . Bhaskaran, " A Compliant MPEG-1 Layer U Audio Decoder with 16-Bit Arithmetic Operations", IEEE Signal Processing Letters, Vol.4, pp. 121-122, May 1997. [19] Hewlett-Packard PA-RISC web site at http://www.hp.com/ [20] Intel Web site at http://www.intel.com/ 99 Bibliography [21] Intel's M M X Technology Programmers Reference Manual at http://developer.intelxom/drg/mmx/manuals/prrn/prm.htm [22] Intel VTune software web site at http://developer.intel.com/design/perftool/vtcd/ [23] ITU Telecom. Standardization Sector of ITU, "Video Coding for Low Bitrate Communication", ITU-T Recommendation H.261, 1991. [24] ITU Telecom. Standardization Sector of ITU, "Video Coding for Low Bitrate Communication", ITU-T Recommendation H.263, March 1996. [25] ITU Telecom. Standardization Sector of ITU, "Video Coding for Low Bitrate Communication", Draft ITU-T Recommendation H.263 Version 2, September 1997. [26] ITU Telecom. Standardization Sector of ITU, "Video Codec Test Model Near-Term, Version 8 (TMN8)", H.263 A d Hoc Group, June 1997. [27] Y . Jeong, I. Lee, T. Yun, G. Park, K. Park, " A Fast Algorithm Suitable for DCT Implementation with Integer Multiplication", IEEE TECNON-Digital Signal Processing Applications, pp. 784-787, 1996. [28] B. Lee, " A New Algorithm to Compute the Discrete Cosine Transform", IEEE Trans. Signal Processing, pp. 1243-1245, December 1984. [29] Y . Lee, F. Kossentini, M . Smith, and R. Ward, "Prediction and Search Techniques for RD-Optimized Motion Estimation in a Very Low Bit Rate Video Coding Framework", ICASSP96, Munich, Germany, April 1997. [30] Y . Lee, F. Kossentini, R. Ward and M . Smith, " Towards MPEG4: An Improved H.263-Based Video Coder", Special Issue of Signal Processing: Image Communication on MPEG-4, pages 143-148, October 1997. [31] H . L i , A . Lundmark, and R. Forchheimer, "Image Sequence Coding at Very Low Bitrates: A Review", IEEE Trans, on Image Processing, vol. 3, pp. 568-609, September 1994. [32] W: L i , Y . Q. Zhang, and M . L . Liou, "Special Issue on Advances in Image and Video Compression", Proc. of the IEEE, voL83, no. 2, pp. 135-340, Feb. 1995. Bibliography [33] M T T T Y Serial Communication software web site at http://premium.microsoft.com/msdn/library/techart/msdn_serial.htm [34] W. B. Pennebaker, J. L. Mitchell "JPEG: Still Image Compression Standard", Van Nostrand Reinhold, New York, 1993. [35] Pentium U Processor Developer's Manual, Intel Architecture Optimizations Manual, Intel Architecture Software Developer's Manual Volume 1: Basic Architecture, Intel Architecture Software Developer's Manual Volume 2: Instruction Set Reference at http ://developer. intel. com/design/Pen tiurnll/ manuals/ [36] A. Pleg, U . Weiser, " M M X Technology Extension to the Intel Architecture", IEEE Micro, pp. 42-50, August 1996. [37] A . Pleg., S. Wilkie, U . Weiser, "Intel M M X for Multimedia PCs", Communications of the A C M , pp. 25-38, January 1997. [38] K.R. Rao, J.J. Hwang, "Techniques & Standards for Image Video & Audio Coding", Prentice-Hall, 1996. [39] J. Ribas-Corbera and S. Lei, "Optimal quantizer control in DCT video coding for low-delay video communications", in Picture Coding Symposium, Berlin, Germany, Sept. 1997. [40] K. Rijkse, "H.263: Video Coding for Low-Bit-Rate Communication", IEEE Communications Magazine, pp. 42-45, December 1996. [41] K. Sayood, "Introduction to Data Compression", Morgan Kaufmann Publishers, 1996. [42] Y . Senda, H . Harasaki, M . Yamo, " A Simplified Motion Estimation Using An Approximation for the MPEG-2 Real-Time Encoder", IEEE, pp. 2273-2276, 1995. [43] A sample implementation of the 2D-IDCT we site at http://rnvs.informatik.tu-chemnitz.de/~ja/MPEG/HTML/IDCT.html [44] Sample M M X audio, communications and video routines on the Intel web site at http://developer.intel.com/drg/mmx/appnotes/ 101 Bibliography [45] Signal Processing & Multimedia Group, University of British Columbia, " T M N (H.263+) encoder/decoder, version 3.0", T M N (H.263+) codec, September 1997. See web site at http://www.ee.ubc.ca/spmg /. [46] Silicon Graphics web site at http://www.sgi.com/ [47] K. H . Tzou, H . G. Musmann, and K . Aizawa, "Special Issue on Very Low Bit Rate Video Coding", IEEE Trans, on Circuits and Systems for Video Technology, vol. 4, no. 3, pp. 213-367, June 1994. [48] VIC software we site at http://www-nrg.ee.lbl.gOv/vic/#overview [49] VIS extension of Sun Ultra Sparc web site at http://www.sun.com/microelectronics/vis/ [50] J. Wen and J.D. Villasenor, " A Class of Reversible Variable Length Codes for Robust Image and Video Coding", in International Conference on Image Processing, Santa Barbara, C A , Oct. 1997. [51] S. Wenger, "Video redundancy coding in H.263+", in Audio-Visual Services over Packet Networks, Scotland, U K , Sept. 1997. [52] R. Westwater, "Real-time Video Compression: Techniques and Algorithms", Kluwer Press, 1997. 102 List of Acronyms AIC Advanced Intra Coding AIV Alternative Inter V L C AP Advanced Prediction ASIC Application Specific Integrated Circuit bps Bits Per Second CBP Coded Block Pattern CIF Common Intermediate Format COD Coded Macroblock Indication DCT Discrete Cosine Transform DF Deblocking Filter D P C M Differential Pulse Code Modulator DSP Digital Signal Processor EOB End Of Block FIFO First In First Out FFT Fast Fourier Transform fps Frames Per Second GOB Group of Block GSTN General Switched Telephone Network IA Intel Architecture IDCT Inverse Discrete Cosine Transform IPB Improved PB-frames ISD Independently Segmented Decoding ITU-T International Telecommunication Union Telecommunication Standardization Sector JPEG Joint Picture Expert Group kbps Kilo Bits Per second K L H Karhunen-Loeve-Hotelling M A E Minimum Absolute Error 103 Acronyms M B Macroblock M E Motion Estimation M V Motion Vector M V D Motion Vector Data M P E G Moving Picture Expert Group M Q Modified Quantization M S E Minimum Squared Error M S V C Microsoft Visual C N A N Not A Number PB PB-frames PC Personal Computer PSNR Peak Signal to Noise Ratio PSTN Public Switched Telephone Network RPR Reference Picture Resampling RPS Reference Picture Selection R R U Reduced Resolution Update RTP Real-time Transport Protocol SAC Syntax-based Arithmetic Coding SAD Sum of Absolute Differences SEI Supplemental Enhancement Information SIMD ( Single Instruction Multiple Data SS Slice Structured T M N Test Model Near-term U M V Unrestricted Motion Vector VIC Video Conferencing Tool V L C Variable Length Code V L D Variable Length Decode 104 Appendix A Overall Performance Improvements Besides developing a number of hardware independent and dependent techniques to achieve an improvement in encoding speed, we also restructured our public domain H.263+ video encoding software performing faster function calls, re-arranging memory allocation operations and optimizing some of the computations. The overall performance improvements in encoding speed are summarized in Table 17. The table shows the encoding times for 100 frames of the high motion, F O R E M A N , and low motion, A K I Y O , video sequences. The simulations, were performed on a PC with a Pentium M M X 200 MHz processor and 64 M B of R A M . The PC runs under the Windows95 operating system and our software does not use any of the multitasking features of the operating system. Each simulation was repeated several times and the numerical results were averaged. As can be seen from the table, the encoding speed improves almost 3 times in baseline coding when the fast M V search is enabled and twice as much in the other cases. Only.a 100% speed improvement achieved for the full M V search case might seem to be surprising since 75% of all the operations correspond to SAD computations and the M M X -optimized SAD implementation is 3-4 times faster than that of unoptimized SAD one. However, this should be expected because, as indicated in Chapter 5, the partial SAD computation technique, which is very useful in full search M E , cannot be employed efficiently in the M M X implementation. Therefore, in the M M X implementation all the SAD computations are 105 Appendix A: Overall Performance Improvements performed until the last row is processed, where in the C implementation, many of the SAD computations are terminated after processing, on the average, the first couple of rows. Encoding times before optimization (sec) Encoding times after optimization (sec) FOREMAN AKIYO FOREMAN AKIYO H.263+ Baseline coding (Full ME) 57.2 42.8 28.4 27.0 H.263+ Baseline coding (Fast ME) 23.5 17.3 8.4 6.5 H.263+: M Q , DF, AIC, AIV, IPB, U M V , AP modes turned on (Fast ME) 33.7 26.1 16.8 14.9 Table 17. The overall performance improvements in video encoding speed. 106 Appendix B Example Applications for the H.263+ Video Coding Standard The H.263+ video coding standard specifically targets low bit rate applications such as video telephony and video conferencing and low bit rate networks such as PSTN (Public Switched Telephone Networks) and wireless networks: In order to demonstrate the capabilities of this new video coding standard two applications were developed with the sponsorship of Avtel and the coloberation of the Technical University of Berlin. This appendix provides an overview of these applications. 6.1 Video Telephony/Conferencing over a Serial Link A video telephony/conferencing application that allows communication of two PCs over a serial link was developed with the collaboration of Guy Cote and Michael Gallant. The project was sponsored by Avtel. The purpose of the application was to demonstrate that, using our H.263+ video encoding algorithms, it is possible to achieve video communication with acceptable quality levels over a link with as low as a 7200 kbits/sec bit rate. The target platform of such application is wireless networks. The application currently allows only one way video communication and speech transmission is not supported. In the future, the coded speech bit stream will be multiplexed with the video bit stream by using the ITU-T H.324 terminal standard. 107 Appendix B: Example Applications for the H.263+ Low Bit Rate Video Coding In order to provide serial connection capabilities, our H.263+ encoder and decoder software is merged with a publicly available serial communication software: Microsoft M T T T Y [33]. This Win32 application runs PCs with Windows 95/NT operating systems. Two wireless modems, each having a 9600 kbits/sec band-width, are connected to the serial port. The remaining 2400 kbits/sec bandwidth is reserved for transmitting speech. Figure 39 shows the platform and the simple block diagram of the application. Camera a ,.„ — Communication Link • video data Laptop computer with Wireless modem Windows 95 operating system Bit stream Wireless modem PC with Windows NT operating system 'Reconstructed v • Data H.263 + Decoder M T T T Y Serial Communication Software Bit stream (7200 kbits/sec) H.263 + Encoder '4 , M T T T Y Serial Communication Software Video Data in Y U V format Camera Figure 39. Video telephony application over serial link using wireless modems. Since the bit rate is quite low, the smallest predefined picture format, sub-QCIF (128x96) is used. It is also possible to use larger picture formats, such as QCIF, but then the picture quality would be degraded. Also, a rate control algorithm, which is defined in [26], and its performance discussed in [7], is employed to keep the bit rate constant at 7200 kbit/sec. No error resilience 108 Appendix B: Example Applications for the H.263+ Low Bit Rate Video Coding modes are employed because they are not supported in our H.263+ implementation. At very low bit rates, blocking artifacts increase and the color of the video becomes less visible because of the 2:1 downsampling and the coarse quantization of the color components. In this case, enabling the Deblocking Filter and Modified Quantization modes helps improve the subjective quality significantly without adding much computation time. Even when the processor power needed for frame grabbing and serial communication is included, we still manage to encode and decode approximately at the 10 fps rate. 6.2 Internet Video Conferencing PC with Linux operating system Camera SGI with IRIX operating system H.263 + Decoder taa£ Video Conferencing (VIC) Tool RTP Stream (variable bit rate)1 H.263 + Encoder Video Conferencing (VIC) Tool 'Video Data in Y U V format Camera Figure 40. VIC application using the H.263+ codec. 109 Appendix B: Example Applications for the H.263+ Low Bit Rate Video Coding The U C B 7 L B N L * * video tool, VIC, provides a real-time multimedia application for video conferencing over the Internet [48]. VIC uses the Draft Internet Standard Real-time Transport Protocol (RTP) for communication over internet. Even though it comes with some video coding standards embedded in the application, such as motion JPEG and the H.261, it is possible to use it for other video coding algorithms, such as ours. Gerd Knorr and Stephan Wenger from the Technical University of Berlin, Germany, developed an internet video conferencing application by merging the VIC tool and our publicly available H.263+ codec. The platform and the simple block diagram of the application is shown in Figure 40. They also implemented the Slice structure mode and the Reference picture selection mode of the H.263+ for enhanced error resilience capabilities. Also, in order to provide synchronization points one frame every 5 seconds was intra coded. We recently combined our optimized H.263+ encoder with their application and achieved encoding performance of approximately 8-10 fps in QCIF resolution on an SGI 02 computer that has al75 MHz processor and 126 M B of R A M . In a T l connection established between Germany and Canada at a non peak time, the packet rate loss was approximately 3-10% and there was almost no quality degradation. University of California, Berkley ** Lawrence Berkley National Laboratory 110
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Efficient coding and mapping algorithms for software...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Efficient coding and mapping algorithms for software implementation of a real-time H.263+ compliant low… Erol, Berna 1998
pdf
Page Metadata
Item Metadata
Title | Efficient coding and mapping algorithms for software implementation of a real-time H.263+ compliant low bit rate video coder | |
Creator |
Erol, Berna |
Date Issued | 1998 |
Description | The growing interest in video coding applications led to the development of video coding standards. Several successful standards have emerged, such as the ITU-T H.261, H.263, ISO/EEC MPEG-1 and MPEG-2. ITU-T recently announced its latest low bit rate video coding standard, the H.263 Version 2, also known as the H.263+ standard [25]. While this new low bit rate video coding standard provides better compression performance levels than those of the former ones, it is more complex and is more computationally, demanding. In this thesis, we develop coding and mapping algorithms that improve the video coding speed performance, making a reality the software implementation of interactive real-time low bit rate video coding on the Pentium MMX general purpose processor. First, using statistical properties of low resolution and slowly varying video sequences, we manage to significantly reduce the computation times of the most computationally intensive components of video coding, particularly the DCT, the IDCT, quantization and half pixel motion search. We also map some of the SIMD (Single Instruction Multiple Data) oriented functions onto Intel's M M X architecture. The developed algorithms are implemented using our public-domain H.263+ encoder/decoder software [45]. After the algorithmic optimization and the M M X mapping, the resulting H.263+ video encoder implementation runs approximately 3 times faster than the original unoptimized public-domain encoder implementation. Moreover, our optimized H.263+ compliant video coder implementation can encode/decode more than 15 frames of a QCIF (176x144) video sequence per second without sacrificing video reproduction quality. |
Extent | 5766318 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-05-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065265 |
URI | http://hdl.handle.net/2429/8391 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1998-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1998-0227.pdf [ 5.5MB ]
- Metadata
- JSON: 831-1.0065265.json
- JSON-LD: 831-1.0065265-ld.json
- RDF/XML (Pretty): 831-1.0065265-rdf.xml
- RDF/JSON: 831-1.0065265-rdf.json
- Turtle: 831-1.0065265-turtle.txt
- N-Triples: 831-1.0065265-rdf-ntriples.txt
- Original Record: 831-1.0065265-source.json
- Full Text
- 831-1.0065265-fulltext.txt
- Citation
- 831-1.0065265.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065265/manifest