"Applied Science, Faculty of"@en . "Electrical and Computer Engineering, Department of"@en . "DSpace"@en . "UBCV"@en . "Mansour, Hassan"@en . "2009-12-23T18:34:49Z"@en . "2005"@en . "Master of Applied Science - MASc"@en . "University of British Columbia"@en . "Advances in digital video coding are pushing the boundaries of multimedia\r\nservices and Internet applications to mobile devices. The Scalable Video\r\nCoding (SVC) project is one such development that provides different video\r\nquality guarantees to end users on mobile networks that support different\r\nterminal capability classes. Such networks are Digital Video Broadcast-\r\nHandheld (DVB-H) and the Third Generation Partnership Project's (3 GPP)\r\nMultimedia Broadcast Multicast Service (MBMS). In this thesis, we propose\r\na multiple-description coding scheme for SVC (MD-SVC) to provide\r\nerror resilience to SVC and ensure the safe delivery of the video pay load\r\nto end users on MBMS networks. Due to the highly error-prone nature of\r\nwireless environments, received video quality is normally guaranteed by either\r\nvideo redundancy coding, retransmissions, forward-error-correction, or\r\nerror-resilient video coding. MD-SVC takes advantage of the layered structure\r\nof SVC to generate two descriptions (or versions) of the higher layer\r\n(enhancement layer) frames in SVC while utilizing Unequal Erasure Protection\r\n(UXP) to efficiently protect the base layer frames. The result is three\r\nseparable streams that can be transmitted over the same channel or over three separate channels to minimize the packet loss rate. The two enhancement\r\ndescriptions are independently decodable, however, both descriptions\r\ndepend on the error-free reception of the base layer. Furthermore, error detection\r\nand concealment features are added to the SVC decoder to cope with\r\nframe losses. The proposed scheme is implemented and integrated fully into\r\nthe SVC codec and tested using a 3GPP/3GPP2 offline network simulator.\r\nObjective and subjective performance evaluations show that under the same\r\npacket loss conditions our scheme outperforms the single description SVC\r\nand existing scalable multiple-description coding schemes."@en . "https://circle.library.ubc.ca/rest/handle/2429/17264?expand=metadata"@en . "Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) by Hassan Mansour B.E., The American University of Beirut, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Applied Science in The Faculty of Graduate Studies (Electrical Engineering) The University Of British Columbia August 29, 2005 \u00C2\u00A9 Hassan Mansour 2005 ii Abstrac t Advances in digital video coding are pushing the boundaries of multimedia services and Internet applications to mobile devices. The Scalable Video Coding (SVC) project is one such development that provides different video quality guarantees to end users on mobile networks that support different terminal capability classes. Such networks are Digital Video Broadcast-Handheld (DVB-H) and the Third Generation Partnership Project's ( 3 G P P ) Multimedia Broadcast Multicast Service (MBMS). In this thesis, we pro-pose a multiple-description coding scheme for SVC (MD-SVC) to provide error resilience to SVC and ensure the safe delivery of the video pay load to end users on MBMS networks. Due to the highly error-prone nature of wireless environments, received video quality is normally guaranteed by ei-ther video redundancy coding, retransmissions, forward-error-correction, or error-resilient video coding. MD-SVC takes advantage of the layered struc-ture of SVC to generate two descriptions (or versions) of the higher layer (enhancement layer) frames in SVC while utilizing Unequal Erasure Protec-tion (UXP) to efficiently protect the base layer frames. The result is three separable streams that can be transmitted over the same channel or over Abstract iii three separate channels to minimize the packet loss rate. The two enhance-ment descriptions are independently decodable, however, both descriptions depend on the error-free reception of the base layer. Furthermore, error de-tection and concealment features are added to the SVC decoder to cope with frame losses. The proposed scheme is implemented and integrated fully into the SVC codec and tested using a 3GPP/3GPP2 offline network simulator. Objective and subjective performance evaluations show that under the same packet loss conditions our scheme outperforms the single description SVC and existing scalable multiple-description coding schemes. Contents Abstract ii Contents iv List of Tables vi List of Figures vii List of Acronyms xi Acknowledgements xiv 1 Introduction 1 2 Digital Video Transmission and Coding, From H.264/AVC To SVC 5 2.1 Multimedia Broadcast/Multicast Services 5 2.1.1 MBMS Modes 5 2.2 Overview of the H.264/AVC Standard 10 2.2.1 Motion Estimation and Motion Compensation 14 \u00E2\u0080\u00A2Contents 2.3 Scalable Video Coding , 17 2.4 Error Resilience in Video Coding 19 2.4.1 Unequal Erasure Protection in SVC 21 2.4.2 Multiple Description - Motion Compensated Temporal Filtering 22 3 Multiple Description Coding of the Scalable Extension of H .264 /AVC (SVC) 25 3.1 Multiple Description Coding of High Pass (HP) Frames . . . 28 3.1.1 Multiple Description Coding of Prediction (Motion) Data 30 3.1.2 Entropy Coding of Error Recovery Data . 37 3.1.3 Multiple Description Coding of Residual Data 44 3.2 Theoretical Analysis 46 3.3 Multiplexing and Error Concealment 63 4 Simulation Results 70 5 Conclusion and Future Work 84 Bibliography \u00E2\u0080\u00A2 86 vi List of Tables 2.1 MBMS video delivery bandwidth resources 8 3.1 MDC macroblock classification 33 3.2 MB mode and the associated coded BN indices 36 3.3 Syntax element classification in MDC 37 3.4 BestNeighbor symbol probabilites and the corresponding bin-strings 42. 3.5 BestNeighbor context models 43 3.6 Relation between motion index and the coded video file size. . 61 3.7 Error handling routine description 68 4.1 Coding configuration of the test sequences 74 4.2 MD scheme redundancy 75 Vll List of Figures 2.1 Example of (a) Broadcast Mode and (b) Multicast Mode Net-works 7 2.2 Macroblock and sub-macroblock partitions in H.264/AVC. . . 12 2.3 Example of a coded video sequence. 12 2.4 Hybrid encoder similar to the H.264/AVC encoder 13 2.5 Functional structure of the Inter prediction process 15 2.6 Motion vector and reference index assignment in the motion estimation process 15 2.7 MCTF temporal decomposition process 18 2.8 Basic structure of a 5 layer combined scalability encoder. . . 20 2.9 UXP transmission sub-block (TSB) packetization 22 2.10 Coding structure of the MD-MCTF. 23 3.1 Coding structure of the MD-MCTF . . . 26 3.2 Single layer MCTF encoder with additional and modified mod-ules for multiple-description coding 27 List of Figures viii 3.3 Basic structure of Multiple Description Coding module with a two description coder of High Pass frames 29 3.4 Quincunx Lattice structure of prediction data MBs 31 3.5 Typical motion data macroblock map of a HP frame in de-scription 1 31 3.6 Change in motion vector prediction, (a) Standard defined neighbor MBs (b) MDC defined neighbor MBs 33 3.7 Neighboring 8x8 block indices 35 3.8 Fixed Length binarization for BestNeighbor syntax element. . 41 3.9 Multiple-description coding of residual frames in a temporal-scalable layer with three temporal levels. Dark blocks indicate Intra-coded macroblocks in the residual frame 45 3.10 Frequency response properties of the [-1/8 1/4 3/4 1/4 -1/8] low pass filter 49 3.11 Frequency response of the [-1/2 1 -1/2] high pass filter. . . . 50 3.12 Decomposition of a group of pictures (GOP) of size 8. The shaded frames will be coded and transmitted in the bitstream. 50 3.13 Reconstruction of a group of pictures showing the MDC high pass frames with zero residual marked with the red Xs. . . . 55 3.14 Variation of motion index with respect to GOP number in the Crew sequence 58 3.15 Crew sequence showing (a) frames from GOP number 2 and (b) frames from GOP number 10 59 List of Figures ix 3.16 Comparison of the motion index values between the 113 coded frames of the four sequences Crew, City, Foreman, and Har-bour 60 3.17 (a) Original Multiplex module of the SVC encoder, (b) Mod-ified Multiplex module to accommodate MDC frames 64 3.18 Basic structure of the MDC Decoder with the Inverse MDC module shown in gray. 66 3.19 Multiple-description coded group of pictures with four SVC layers of combined scalability 67 3.20 GOP with five possible frame-loss scenarios 69 4.1 RTP encapsulation of NALUs with varying sizes 71 4.2 CDMA-2000 user plane protocol stack packetization presented inVCEG-N80 72 4.3 Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Crew sequence. . . . 76 4.4 Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Foreman sequence. . 77 4.5 Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Harbour sequence. . 78 4.6 Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the City sequence. . . . 79 List of Figures x 4.7 Comparison of MD-SVC and MD-MCTF redundancy rates and associated distortions 80 4.8 Comparison of visual quality from the Foreman sequence, MD-SVC (left), MD-MCTF(right) 81 4.9 Comparison of visual quality from the Crew sequence, MD-SVC (left), MD-MCTF(right) 82 4.10 Comparison of visual quality from the City sequence, MD-SVC (left), MD-MCTF(right) 82 4.11 Comparison of visual quality from the Harbour sequence, MD-SVC (left), MD-MCTF(right) 83 List of Acronyms 3GPP 3rd Generation Partnership Project A V C Advanced Video Coding B L E R BLock Error Rate B N Best Neighbor BS Base Station C A B A C Context-Adaptive Binary Arithmetic Coding C A V L C Context-Adaptive Variable Length Coding C B P Coded Block Pattern C D M A Code Division Multiple Access C R C Cyclic Redundancy Code D C T Discrete Cosine Transform D V B - H Digital Video Broadcasting - Handheld F E C Forward Error Correction G E R A N G S M / E D G E Radio Access Network G O P Group Of Pictures G S M Global System for Mobile communications List of Acronyms xii HP High Pass IPDC IP Data Cast ICT Integer Cosine Transform LP Low Pass LPS Least Probable Symbol LTU Logical Transmission Unit MB macroblock MBMS Multimedia Broadcast Multicast Services MCTF Motion Compensated Temporal Filtering MDC Multiple Description Coding MD-MCTF Multiple Description Motion Compensated Temporal Filterin MD-SVC Multiple Description Scalable Video Coding MPEG Moving Pictures Experts Group MPS Most Probable Symbol MTAP Multi-Time Aggregation Unit MV Motion Vector NAL Network Abstraction Unit PDU Protocol Data Unit PLMN Public Land Mobile Network PPP Point to Point Protocol PSNR Peak Signal to Noise Ratio QP Quantization Parameter QoS Quality of Service RFS Radio Frame Size RS Reed-Solomon RTP Real-Time Protocol List of Acronyms xiii SAD Sum of Absolute Differences SD Single Description SDU Session Data Unit SNR Signal to Noise Ratio SSD Sum of Square Differences SVC Scalable Video Coding TSB Transmission Sub-block UDP User Datagram Protocol U M T S Universal Mobile Telecommunications System R L C Radio Link Control U E User Equipment U T R A N Universal Terrestrial Radio Access Network U X P Unequal Erasure Protection V C E G Video Coding Experts Group xiv Acknowledgements I would like to thank my supervisors, Professor Panos Nasiopoulos and Pro-fessor Victor Leung, for their time, patience, and support. To the reviewers, I express my gratitude for their constructive feedback. To my best friend and \"buddy\", Rayan, I would like to express my ap-preciation for always being there for me. I would also like to thank Lina and Rachel for tolerating my irritability during the last few weeks. Last but not least, I would like to thank my mom, dad, and brother for their endless love and support. This work was supported by a grant from Telus Mobility, and by the Nat-ural Sciences and Engineering Research Council of Canada under grants CRD247855-01 and CBNR11R82208. Chapter 1 i Introduction The distribution of multimedia services to mobile devices is finally a reality. Multimedia service providers are teaming up with mobile telecommunications operators to deliver to the public numerous services from transferring light video and audio clips to the heavy duty streaming of mobile T V [1], [2]. There are currently two major solutions for the support of mobile multimedia delivery, namely, 3GPP's M B M S (Multimedia Broad-cast/Multicast Service) and D V B 2.0's D V B - H (Digital Video Broadcast- Handheld) supported by Nokia [3]. These solutions suffer however from limited network bandwidth resources in addition to transmission errors that significantly affect the quality of the distributed video. MBMS and D V B - H rely, therefore, on the recent advances in digital video coding technology to deliver video content that efficiently utilizes the allocated channel bandwidth while ensuring various QoS (Quality of Service) requirements. The newest international video coding standard is H.264/AVC. Approved by ITU-T as Recommendation H.264 and by ISO/IEC as International Standard 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC), H.264/AVC has proved its superi-ority to previous video coding standards (MPEG-4 Visual, MPEG-2, MPEG-2, H.263) through its improved coding efficiency and its provision of a network friendly video rep-resentation [4]. However, coding efficiency alone is not enough to support QoS features Chapter 1. Introduction 2 due to the highly error-prone nature of wireless environments and the unexpected fluc-tuation in available bandwidth. Therefore, the demand for bandwidth adaptive codecs and robust error-resilient techniques is constantly increasing. The Scalable Video Cod-ing (SVC) standardization project was launched by MPEG (Moving Pictures Experts Group) and ITU-T's Video Coding Experts Group (VCEG) in January 2005 as an amendment of their H.264/AVC standard to solve the bandwidth fluctuation prob-lem and offer multiple QoS requirements to the end user [5]. However, SVC presently does not offer any error resilient features that can protect all the layers in the coded bitstream. Existing solutions to protect a coded video stream include the use of For-ward Error Correction (FEC) codes or Unequal Erasure Protection (UXP) techniques to combat bit-errors in wireless transmission. Unfortunately, these techniques fail in recovering any packet losses that might occur due to network congestion. An alternative to FEC and UXP is the implementation of Multiple Description Coding (MDC) of video content in order to protect against packet losses. In MDC, a coded video sequence is separated into multiple descriptions (or versions) of the coded sequence such that each description is independently decodable and provides a decoded video quality that is poorer than the original video quality. However, if all the descrip-tions are received by the decoder, then the original video quality should be generated. The independent decodability feature is made possible at the expense of additional coding overhead, also known as data \"redundancy\" between the various descriptions. The drawback in existing multiple-description coding techniques lies in the inefficient allocation of redundant data. The allocation of redundant data controls both the level of redundancy in the coded video bitstream and the quality (or distortion) produced from decoding only one description. In this thesis, we develop a Multiple-Description Coding scheme (MD-SVC), specif-ically designed for the scalable extension of H.264/AVC (SVC), that dramatically im-Chapter 1. Introduction 3 proves the error resilience of the SVC codec. Our proposed MD-SVC scheme generates two descriptions of the enhancement layers of an SVC coded stream by embedding in each description only half of the motion information and half the texture information of the original coded stream with a minimal degree of redundancy. The two descriptions are complementary but independently decodable, such that if only one description is received, the decoder will be able to recover the missing motion information from the available data and generate an output video sequence with an acceptable degradation in quality. If both descriptions are received, then the full quality of a single descrip-tion SVC stream is delivered. Furthermore, we have implemented our proposed scheme and integrated all of its functionalities into the existing SVC standard. We have also added error detection and error concealment features to the SVC decoder, which did not exist. These error handling routines help the decoder to cope packet losses that might arise due to the unexpected fluctuations in available bandwidth. The proposed framework thus provides a highly error resilient video bitstream that requires no re-transmissions or feedback channels while minimizing any channel overhead imposed by the video redundancy due to multiple description coding. The rest of this thesis is organized as follows. In Chapter 2, we discuss some of the features supported by the MBMS video dis-tribution framework. Next, we introduce the basic functions of digital video coding as implemented in H.264/AVC leading up to the motion-compensated temporal filtering structure of SVC. We develop in Chapter 3 our Multiple-Description coding scheme and give a detailed account of its integration into the SVC codec. We also offer a theoretical analysis of the redundancy and distortion imposed by our scheme. In Chapter 4, we describe the network simulation setup and compare the perfor-Chapter 1. Introduction 4 mance of our MD-SVC scheme with other multiple description and single description coding methods. Finally, we conclude our work in Chapter 5 and offer suggestions for future research in this field. 5 Chapter 2 Digi tal Video Transmission and Coding, From H . 2 6 4 / A V C To S V C 2.1 Multimedia Broadcast/Multicast Services While DVB-H is still in its testing stages, 3GPP's MBMS has already completed its first stage with Release 6 of the Universal Mobile Telecommunications System (UMTS) in September of 2004 [6]. Broadcast and Multicast are two.IP datacast (IPDC) type services for transmitting data-grams from a single server to multiple clients (point-to-multipoint) that can be supported by the existing GSM (Global System for Mobile communications) and UMTS cellular networks [6], [1]. 2.1.1 MBMS Modes MBMS defines two functional modes, broadcast mode and multicast mode. The MBMS multicast mode differs from the broadcast mode in that it requires a client to subscribe to and activate an MBMS service, whereas, the broadcast mode does not. The broad-cast mode is generalized as a unidirectional point-to-multipoint bearer services in which multimedia data is transmitted from one server to multiple users in a broadcast service area. This service efficiently utilizes the available radio or network resources by trans-mitting data over a common radio channel that can be received by all users within the service area. The multicast mode similarly allows a unidirectional point-to-multipoint Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 6 transmission of multimedia data, but the users are restricted to those clients that be-long to the multicast subscription group. [6] specifies that MBMS data transmission should have the ability to adapt to different RAN (Radio Access Network) resources and capabilities by efficiently managing the bitrate of the MBMS data. Furthermore, individual broadcast/multicast services are allocated for independent broadcast areas/ multicast groups that may overlap. Quality of Service (QoS) guarantees can also be independently configured by the PLMN (Public Land Mobile Network) for each indi-vidual broadcast/multicast services [6]. Figure 2.1 shows examples of the Broadcast Mode and Multicast Mode Networks as depicted in [6]. Handoff between different operators sharing the broadcast service area or multi-cast subscription group is allowed in MBMS [7]. Therefore, the broadcast/multicast resources will be allocated for all subscribers of a certain operator A in addition to all inbound roamers of operator A. However, if there are not enough resources (such as available bandwidth) to provide the requested service, the MBMS application will have to support the QoS requirements instead. MBMS User Services may be delivered to a user at different bit rates and quality of service depending on radio networks and conditions. Table 2.1 lists the bandwidth resources allocated for video streaming and video download as defined in [7]. Note that the specified bit-rates are those of the user-data at the application layer, and in GERAN, lower bandwidth is available which may constrain some applications. Since the MBMS applications must adapt to the resource heterogeneity in UTRAN (Universal Terrestrial Radio Access Network) and GERAN (GSM/EDGE Radio Access Network), it falls upon the design of multimedia content to provide the scalability required, which gives rise to the need for scalable content especially in video streaming services. MBMS video streaming services also suffer from packet losses and transmission errors that arise from multi-path fading, channel interference, network congestion, and Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 7 Broadcast Service Area Cell phone /Hand held computer Cell phone/ Broadcast Towe Broadcast Tower UMTS Packet-Switching Core - QoS handling - Broadcast Area Configuration - Provisioning Control Multimedia Services Operator Specific Services Internet Hosted Services Multimedia Broadcast Capable UTRAN/GERAN Hand held computer Multimedia Broadcast Capable UTRAN/GERAN (b) Figure 2.1: Example of (a) Broadcast Mode and (b) Multicast Mode Net-works. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 8 Table 2.1: MBMS video delivery bandwidth resources. Service Media Distribution Scope MBMS User Service Classification Application Bit-Rate Video streaming Video and auxiliary data such as text, still images Broadcast Streaming < 384 kbps Video streaming Video and auxiliary data such as text, still images Multicast Streaming < 384 kbps Video distribution Video and auxiliary data such as text, still images Broadcast Download < 384 kbps Video distribution Video and auxiliary data such as text, still images Multicast Download < 384 kbps Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 9 noise. MBMS utilizes Foreword-Error-Correction (FEC) schemes in order to protect the data packets during transmission [8]. The generic mechanism employed for systematic FEC of RTP streams involves the generation of two RTP payload formats, one for FEC source packets and another for FEC repair packets. FEC schemes use Reed Solomon (RS) codes to protect data content. A Reed Solomon code with an x symbol error protection and code word length of n symbols is labeled (n, n-x) RS code. The imposed overhead can then be calculated as of the original data size. For example, if the codeword length is to be 255 bytes and a symbol error protection of 50 bytes, then the overhead is equal to 255/205 = 1.24, which is a 24% increase in data size. Although FEC can be an effective tool, the problems that arise from its implementation include the lack of flexibility in error protection since the entire bitstream will have to be equally protected. Moreover, the redundancy added by including the FEC packets can significantly increase the video payload bit-rate and in turn cause additional congestion in the network. Alternatively, unequal error protection, unequal erasure protection, and error-resilient video coding techniques are more desirable solutions. Finally, MBMS allows the use of multiple channels for the transmission of broadcast data in order to reduce the transmission power of a broadcast station and allow for flexible resource management. The separate physical channels can have varying power allocation and can thus cover different or overlapping portions of a service area. This scalability option would offer the entire broadcast service area with a base quality of service guarantee using a high powered base channel and the core of the service area with an enhanced quality of service through another low powered enhancement channel [9], [1]. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 10 2.2 Overview of the H.264/AVC Standard Digital video coding was first made possible on a worldwide basis with the establishment of the MPEG-1 standard back in 1992. Over the past fifteen years, significant advances have been achieved with the launch of several international video coding standards such as the well known MPEG-2 standard in 1998, which set-off the DVD industry, in addition to more recent standards including ITU-T's H.263 series and ISO/IEC's latest MPEG-4 standard [4], [10]. Video compression is achieved by applying a series of processes, categorized in [10] into the following: \u00E2\u0080\u00A2 A prediction process that takes advantage of the spatial correlation within a single image and the temporal correlation between successive images to send \"prediction\" information (motion-vectors, reference in-dices) to the decoder to help reconstruct a \"prediction image\". The differences in sample values (residual components) between the original image and the prediction image are then compressed using a different process. \u00E2\u0080\u00A2 A transform process that converts the sample values (usually applied to the residual components) into a set of samples where most of the signal power is grouped into a fewer samples. The most commonly used transform in image and video coding is the discrete cosine transform . (DCT). \u00E2\u0080\u00A2 A quantization process that reduces the precision of representing a sample value in order to decrease the amount of data to be coded. \u00E2\u0080\u00A2 An entropy coding process that takes advantage of the symbol proba-Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 11 bilities to generate binary representations of the coded symbols. The H.264/AVC standard defines two interworking layers, the video coding layer which performs the video compression, and the network abstraction layer that packs the coded data into \"network friendly\" units that can be easily passed on to different communications networks. In this overview, we will focus our discussion on the video coding layer. The H.264/AVC video coder is a block-based hybrid video coder that utilizes all of the above mentioned techniques for video coding and applies them on a block basis. Therefore, every input image to the video coder is first split into a group of macroblocks each containing 16x16 samples. In addition to the macroblock structure, H.264/AVC also defines smaller block structures that are partitions of the macroblocks and sub-macroblocks to better capture the details in an image. Figure 2.2 shows the possible partitions supported in H.264, extracted from [11]. The two major components of H.264/AVC that contribute to its superior coding efficiency are the prediction process and the entropy coding process. We will start by giving a brief functional overview of the encoder and then focus on the above-mentioned components. There are three types of coded pictures supported in H.264, namely, I pictures, P pictures, and B pictures. An I picture (or frame) is coded using Intra-frame prediction as an independent self-contained compressed picture. P and B pictures on the other hand are Inter-frame predicted from previous pictures only, in P frames, or from both previous and future pictures in B frames. The prediction process itself will be explained in detail in the following section. Given a sequence of pictures, the H.264 encoder codes the first picture as an I-frame. The remaining pictures can be coded as either P-frames or B-frames based on the delay restrictions imposed by the video application (B-frames require more delay but provide better coding efficiency). Figure 2.3 shows a sequence of coded pictures. The arrows indicate which pictures act as references to other pictures. Notice in Figure 2.3 that P-frames can only use other previously coded P- or I-Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 1 2 One macroblock partition Two macroblock partitions Two macroblock partitions Four macroblock partitions 8x8 8x8 8x8 8x8 One sub-macroblock partition Two sub-macroblock partitions Two sub-macroblock partitions 4x4 4x4 4x4 4x4 Four sub-macroblock partitions I B B P B B P B B I F i g u r e 2 . 3 : E x a m p l e o f a c o d e d v i d e o s e q u e n c e . Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 13 frames as references. B-frames are never used as references for motion prediction. Therefore, an input picture first enters the H.264/AVC encoder where an Intra- or Inter-frame prediction decision is made to indicate the type of picture to be used. If Intra-frame prediction is to be implemented, then every macroblock in the picture is Intra- coded. If Inter-frame prediction is used, then a motion estimation process is initiated on the macroblock level to find matching macroblocks in the reference frames. After the motion estimation process is completed, a full frame containing the prediction information (motion-vectors and reference indices) is produced which is used to generate a prediction image that is composed of the motion compensated blocks, as will be discussed in the next section. The original input picture is then subtracted from the motion compensated image to obtain a residual image. Integer cosine transform is then applied to the residual samples in order to utilize the correlation in the residual samples. Next the transformed coefficients are quantized and finally the quantized samples are entropy coded and passed on to the network abstraction layer to be packetized and transmitted across the communications network. [4] Figure 2.4, extracted from [10], gives an example of the functional blocks of the H.264/AVC encoder we just described. Input Picture Intra-trams Intra-frame Estimation Prediction ' Intra/ . Inter \select ion Figure 2.4: Hybrid encoder similar to the H.264/AVC encoder. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 14 Note that in the Inter- prediction mode, a P- or B- frame can contain Intra coded macroblocks if the Intra coding mode reduces a rate-distortion optimization equation. 2.2.1 Motion Estimation and Motion Compensation Compression is achieved in I-frames using Intra prediction, a process that takes ad-vantage of the spatial correlation between the samples of a picture. Intra prediction is performed on a macroblock or sub-macroblock basis. For every macroblock, the values of the samples located within the macroblock bounds are estimated using the boundary samples of the neighboring left and above macroblocks. The estimation is performed using a set of predefined Intra-prediction modes. A detailed discussion of the Intra pre-diction mode can be found in [11] and [4], however, it is out of the scope of this thesis .since our multiple-description coding scheme will not deal with Intra-coded frames or macroblocks. The Inter prediction process, on the other hand, takes advantage of temporal redun-dancies found in successive pictures in a video sequence. In Inter-frame prediction, the encoder runs two complementary processes called motion estimation and motion com-pensation, on a block basis, to find a match in the neighboring frames for every block within an Inter-coded frame. Figure 2.5 illustrates the general functional structure of the Inter prediction process. Note that only the residual frame is passed through the transform, scaling, and quantization processes. The prediction frame is only entropy coded and multiplexed with the coded residual frame on a macroblock level before being transmitted. Figure 2.6 demonstrates the motion estimation process where one block in the current picture, labeled frame N, is being estimated by a block of the same size in frame number N-2. The output of the motion estimation process is a set of motion vectors (MV) and reference indices (r) for every estimated block in the Inter-predicted frame. Figure Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 15 Current Picture Motion Estimation Picture buffer Motion Compensat ion Prediction Frame Residual Frame Transform, Sca l ing , Quantization, Entropy Coding Output Figure 2.5: Functional structure of the Inter prediction process. 2.6 shows that the current block is allocated a motion vector MV shown in red that represents the spatial shift between the matching block and the co-located block in the reference frame. C u r r e n t Frame N-3 Frame N-2 Frame N-1 Frame N Figure 2.6: Motion vector and reference index assignment in the motion estimation process. As a result, the motion estimation process generates a complete frame containing only motion information, such as, macroblock mode, motion vectors, and reference in-dices. Let us refer to this frame as the \"prediction frame\". The choice for best matching block is made based on a rate-distortion tradeoff. Lagrangian optimization is used to Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 16 minimize both the bit-rate required to encode the block and the distortion resulting from the chosen estimate. The Lagrangian cost function that is to be minimized is therefore expressed as: J(m, r\QP, A) = D(m, r\QP, A) + \R(m, r) where D is the distortion function, R is the rate for encoding the motion information, and A is the Lagrangian parameter specified in [12] to be equal to: I QP \u00E2\u0080\u009412 V 0.85 x 2 3 if sum of absolute difference (SAD) distortion is used A = { Q P \u00E2\u0080\u0094 1 2 0.85 x 2\" -T - , if sum of squared difference (SSD) distortion is used The distortion term is also expressed as Derr(P, rest, m e s t ) = {l0rig[i, j] ~ Keat[i + mest,xij + mest,y])n where P is the macroblock partition, r the reference index, m the motion vector, l[] is the sample value, and n is equal to 1 if err = SAD or 2 if err = SSD. The prediction frame next enters into the motion compensation process which recon-structs the estimated blocks using the motion vectors and reference indices generated by motion estimation. The output of motion compensation is a blocky image that approx-imates the original image. The motion compensation process itself is a direct inversion of the motion estimation process where the reference blocks are grouped together into one image. This motion compensated image is then subtracted from the original image to produce a residual image to which the transform, scaling, and quantization processes are applied before pairing the blocks up with their respective motion data and entropy coding the entire coded frame. [11], [4], [10], [13] Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 17 2.3 Scalable Video Coding The subband extension of H.264/AVC was developed at Heinrich Hertz Institute (HHI) in Germany. In a document, [14], presented at the JVT Munich meeting in March of 2004, the authors produced a subband extension of the H.264/AVC using Motion Compensated Temporal Filtering (MCTF) based on the lifting representations of the Haar wavelet and the 5/3 wavelet. The design keeps most of the components of the original H.264/AVC standard while only a few adjustments have been made to support the MCTF structure. The lifting representations of the Haar and 5/3 wavelets are used as filter-banks in uni-directional and bi-directional temporal prediction, respectively. Previous attempts have been made to use Lifting Wavelet Transforms with Frame-Adaptive Motion Compensation in the building of video codecs [15]. However, the advantage in [14] lies in the use of the highly efficient motion model of H.264/AVC along with an adaptive switching between the Haar and 5/3 spline wavelet on a block basis. A group of pictures (GOP) of the original video stream is decomposed into a set of High Pass (or difference) frames after a prediction step and a set of Low Pass (or average) frames after an update step [14]. Figure 2.7, extracted from [5], illustrates the temporal decomposition of a group of 8 pictures. A High Pass (HP) frame is produced after the prediction step and a Low Pass (LP) frame is produced after the update step. Each stage has half the temporal resolution as that of the original stream. A high pass frame is equivalent to a coded B-frame in H.264/AVC. It contains both residual data and prediction (motion) data. Each following stage uses the LP frames from the previous stage to produce its respective HP and LP frame sets which in turn, have half the temporal resolution of the previous stage. The following equations describe the extended motion-compensated temporal Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 18 Figure 2.7: MCTF temporal decomposition process. filtering prediction and update operators implemented in SVC. PHaaMx, 2k + 1]) = s[x + H l f l , , 2k - 2rPo] ( z . l j UHaa.r{s[x., 2k]) = |/i[x + mUo, k + rUo] P5/3(s[x, 2fc + 1]) = |(5[x + mPo,2k - 2rPo] + s[x + m P l , 2k + 2 + 2rPl]) U5/3(s[x,2k]) = \(h[x + mUo,k + rUo] + h[x + mu1}k - 1 - rVl]) Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 19 where s[] and h\\ indicate the y, u, and v samples in the original and high pass frames, respectively, x refers to the luma spatial coordinates or the samples, k is the picture temporal index, and m and r are the prediction information. It is also shown in [14] that using this model, it is possible to produce a fully scalable video codec offering temporal, spatial, and SNR scalability. We are interested in the case of combined scalability, such that, as we move from one layer to the subsequent one, the bitstream can lose spatial resolution, temporal resolution, and SNR quality by down-sampling spatially and/or temporally the decomposed sequence. The extended coding structure of an SVC encoder, as specified in [5] is shown in Figure 2.8. Spatial down-sampling between layers is used to reduce spatial resolution. Temporal scalability is controlled by limiting the number of LP and HP frames transmitted within a specific layer. The number of HP frames transmitted can also control the SNR scalability. When all layers are available, the decoder can reconstruct the original video stream in full quality and resolution. [16] This combined scalability helps the bitstream adapt to the conditions of the channel, namely fluctuations in bandwidth and network congestion. However, the bitstream is still vulnerable to bit-errors, which may lead to the dropping of packets. If the packets contain HP information, then the decoder can be modified to recover from packet loss by inserting a zero residual frame instead of the lost frame. If, on the other hand, a LP packet is dropped, then the consequence is more severe since that can greatly affect the PSNR and quality of the decoded stream. 2.4 Error Resilience in Video Coding Error correction schemes such as the commonly used forward error correction (FEC) can be implemented to combat the problem of bit-errors in transmission networks. How-Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 20 Input video sequence lnter-1; motion and texture prediction Progressive refinement texture coding (SNR scalability) texture Base layer ' coding motion Laver 4 Spatial enhancement layer encoder Progressive refinement texture coding (SNR scalability) Output Scalable texture Base layer coding motion Spatial enhancement layer encoder Inter-layer motion and jtexture prediction Motion compensated temporal filtering Progressive refinement texture coding (SNR scalability) texture Base layer coding motion Layer 2 Layer 0 AVC compatible encoder Figure 2.8: Basic structure of a 5 layer combined scalability encoder. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 21 ever, protection of the entire bitstream requires large overhead which can be very costly. Alternatively, [17] suggests the use of unequal error or erasure protection (UEP/UXP) schemes to protect only the important frames or layers in a video bitstream. Other error-resilience techniques have targeted the design of the video bitstream itself to gen-erate an inherently error resilient coded video representation through multiple descrip-tion coding. Multiple description coding schemes generate several versions of a video sequence which are transmitted either on the same channel or on different channels and paths in order to reduce the probability of transmission errors. 2.4.1 Unequal Erasure Protection in SVC Unequal Erasure Protection (UXP) is used to provide content aware protection to a video bitstream. [17] develops an UXP scheme suited for protecting the base layer of a SVC bitstream. This scheme cannot be applied to the enhancement layer of SVC. We will give a brief overview of the erasure protection scheme implemented in [17]. UXP defines several protection classes for the different layers in a scalable video bitstream. Each of these classes is characterized by a difference in the symbol erasure protection level. Figure 2.9 demonstrates the UXP procedure as it is illustrated in [17]. A trans-mission sub-block (TSB) contains a number of protected NAL units separated by Multi Time Aggregation Unit (MTAP) headers. The MTAP headers specify the NAL unit size, RTP timestamp, and a Decoding Order Number (DON). The DON is used to fix any higher level interleaving problems that might arise due to a reordering of the decoding order of NALs. For further detail on the functioning of MTAP, please refer to [18]. The TSBs are then grouped into one Transmission Block (TB) which, in turn, is encapsulated into several RTP packets. Simulation performed in [17] show that this UXP can successfully protect the base layer information of an SVC bitstream with more than 30% network packet loss rate. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 22 The UXP scheme described above is not designed to protect higher layer packets, therefore, another error resilience scheme is required to protect those packets from network errors. 2.4.2 Multiple Description - Motion Compensated Temporal Filtering Recent studies in Internet error resilience technology have shown that Multiple De-scription Coding (MDC) along with path or server diversity can reduce the effects of packet delay and loss [19] [20]. Although this technique can be very effective to combat bit errors, it also suffers from data redundancy and thus requires additional bandwidth. However, it falls upon the design of the multiple-description to solve the problem of data redundancy. Multiple Description Scalable Coding (MDSC) merges scalable coding with MDC. Scalable coding facilitated the adaptability of the different video descriptions to variations in channel bandwidth [19]. One approach to MDSC is Multiple Description Motion Compensated Temporal Filtering (MD-MCTF) developed Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 23 in [19]. This approach applies MDC to a wavelet-based MCTF codec, similar to SVC, by dividing the HP frames between the different video descriptions and keeping the lowest layer LP frames as redundant data inside the streams. The motion information in each HP frame is also replicated in both streams. Figure 2.10 illustrates the frame coding order of the MD-MCTF coding scheme. Original coded sequence L H1 H2 H3 H4 H5 H6 H7 First Description Second Description L H1 H2 H3 H4 H5 H6 H7 Figure 2.10: Coding structure of the MD-MCTF. In each of the two descriptions, the lightly shaded HP frames contain only motion information. All residual components are set to zero. The problem with this approach is that most of the redundant bit allocation is dedicated to duplicating the motion information. Texture information, on the other hand, is sacrificed to reduce the redun-dancy. However, this sacrifice in texture data results in increased quality degradation, Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 24 a feature which is very undesirable. 25 Chapter 3 Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) Scalable Video Coding (SVC) clearly solves the network heterogeneity problem faced in the current M B M S architecture. The prevalent problem with this coding technique is its lack of error resilience schemes that can ensure the safe delivery of coded video data to its destination. We have seen that there are currently several error resilience techniques being employed to protect the video data, however, all of these techniques suffer from limited protection to the entire video data or from extensive redundancy in the coded video representations. For this reason, we propose a new Multiple Description Coding scheme specifically designed for protecting the high pass frames of an SVC compatible bitstream. We shall call our scheme Multiple Description Scalable Video Coding (MD-SVC). M D - S V C takes advantage of the layered structure of SVC to generate two descrip-tions of each enhancement layer of an SVC coded video.provide multiple coded versions of the video layers with minimum redundancy. The SVC structure is modified to allow the encoder to create two complementary descriptions of HP frames, such that each description is ensured to be independently decodable. The challenge thus lies in cre-ating a description of every SVC layer that will produce an acceptable video quality Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 26 when decoded separately while limiting the redundancy induced by M D C . The target of our coding scheme is to be able to support networks and users with heterogeneous capabilities while providing an error-resilient video representation with the minimum possible redundancy. Figure 3.1 shows one example of the applications that can be serviced using our M D - M C T F coding scheme over 3G M B M S networks. m Hand held computer\" 0. In the BestNeighbor case, the maximum length of the FL binstring is 3 and therefore n< 2. In Figure 3.8, we assume that 0 is the most probable symbol and 1 the least probable symbol of the arithmetic coder, as defined by [11]. Therefore, using the 400 coded frame dataset, we calculate the symbol probabilities of the BestNeighbor symbol values and P(xt+1\s(^)) = P(s(xt+1)) P(s(xt)) (3.4) Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 42 Table 3.4: BestNeighbor symbol probabilites and the corresponding bin-string Non-Binary Symbol 0 1 2 3 4 5 6 7 Probability PT(X2) 0.22 0.34 0.23 0.17 0.0065 0.011 0.0039 0.019 Binstring (*2) 001 000 010 011 110 101 111 100 assign these values to the leaves of the binary tree accordingly. Table 3.4 shows the BestNeighbor symbol probabilities and the corresponding binstring. To calculate the internal node probabilities, take for example node C3, apply equa-tion 3.4 as follows. We want to calculate P(0|01) and P(l|01) = 1 - P(0|01). P(o|oi)= P ( 0 1 0 ) - P r B N => P(010) + P(011) PrBN=2 + PrBN=3 Similarly, the conditional probabilities of node CI are P(0|0) and P(1|0) and are expressed as calculate P(0|01) and P(l|01) = 1 - P(0|01). P(0|01)= P(000) + P(001) P(000) + P(001) + P(010) + P(011) The conditional probabilities of all remaining internal nodes are calculated similar to the conditional probabilities of contexts C3 and CI. These conditional probabilities define the context models of every node in the binstring and will therefore be used to initialize the context models at the beginning of encoding or decoding of every frame. The CABAC entropy coder will then update these context probabilities based on the encoded BestNeighbor syntax element statistics as more and more MDC macroblocks are encoded into the bitstream. The initial conditional probabilities of all BestNeighbor Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 43 Table 3.5: BestNeighbor context models. Context MPS LPS RLPS state m n CO 0.96 0.0403 20 44 0 19 CI 0.58 0.42 215 2 0 61 C2 0.74 0.26 201 1 0 62 C3 0.61 0.39 213 1 0 62 C4 0.58 0.42 131 10 0 53 C5 0.63 0.37 189 3 0 60 C6 0.62 0.38 192 1 0 62 context models are shown in Table 3.5. [11] specifies three steps for initializing context models in which a context model is initialized by assigning to it a state number and a meaning for the most probable symbol (MPS). The first step is linearly dependent on the frame quantization parameter (QP) and involves calculating a prestate = (m * QP) \u00C2\u00BB 4 + n [11]. The next step limits prestate to the range of [1, 126]. The final step maps the prestate to a {state, MPS} pair, such that, if prestate < 63 ,then state = 63 \u00E2\u0080\u0094 prestate and MPS = 0, otherwise, state = prestate \u00E2\u0080\u0094 64 and MPS = 1 [ 1 1 ] . Now the arithmetic coder used in H.264/AVC and SVC is a table based arithmetic coder, where the bin probabilities are estimated using a table of scaled least probable symbol (LPS) ranges (RLPS) indexed by the context model state. Therefore, the cal-culated LPS probabilities shown in Table 3.5 are mapped into the RLPS values found in column 4 of Table 3.5, where the maximum RLPS range is 256 and corresponds to a LPS probability of 0.5. The frame quantization parameter does not affect the value of BestNeighbor and therefore m is set equal to 0. Moreover, we're assuming that 0 is the Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 44 most probable symbol so the prestate value should not exceed 63. Columns 5, 6 and 7 of Table 3.5 show the initialization parameters of the BestNeighbor syntax element contexts resulting from the above mentioned initialization procedure. 3.1.3 Multiple Description Coding of Residual Data In addition to motion information, High Pass frames also contain residual information that can consume at least as much bandwidth as the motion information. Therefore, it is imperative to multiple-description code the residual data as well in order to reduce the redundancy between the two descriptions. We suggest using an inter-frame based multiple-description coding of residual data similar to MD-MCTF, the approach used in [19]. In [19], HP frames are divided into even numbered and odd numbered frames at every temporal level in a MCTF-coded video layer. The even numbered frames are then transmitted in one description while the odd numbered frames are transmitted in the second description. Controlling the number of HP frames duplicated between the two descriptions leads to controlling the video redundancy. Although bitrate effi-cient, this approach suffers from a reduced decoded video quality since any Intra coded macroblocks in the HP frames cannot be retrieved if one description is lost. Our approach to multiple-description coding of residual data is also based on sepa-rating HP frames into even numbered and odd numbered frames and coding one group in each description. However, we extend this separation process by inserting any Intra-coded macroblocks found in the high pass frames in both descriptions. The first step is to add an MDC flag to the slice header to allow the decoder to distinguish between fully coded HP frames and modified or multiple-description coded HP frames. Note that this flag is only inserted to refer to multiple-description coding of residual data and not motion (prediction) data. Therefore, if description one is allocated the even numbered frames, for example, then all even numbered HP frames are coded entirely Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 45 without any modification in the first description bitstream and the MDC flag is set to zero. The odd numbered frames are multiple-description coded by setting the MDC flag to one and only inserting the Intra-coded macroblocks into the second-description bitstream. No residual information is coded for residual macroblocks. Figure 3.9 shows the separation of a set of residual high pass frames into two descriptions. The dark blocks in the high pass frames refer to Intra coded macroblocks. Original HP sequence Description One Description Two ^ 0 MDC- 1 j MOCO MDC. Figure 3.9: Multiple-description coding of residual frames in a temporal-scalable layer with three temporal levels. Dark blocks indicate Intra-coded macroblocks in the residual frame. Notice that the HP frames coded in the two descriptions are complementary. The only redundancy involved rises from the Intra-coded macroblocks in the residual frames that are duplicated to improve the decoded video quality in case of packet loss. In order to ensure independent decodability of each description, essential modifications were required. First, the coding of the coded block pattern (CBP) syntax element is modified in MDC frames such that it is independent of neighboring CBP values in the adjacent blocks. All neighboring blocks are assumed to be unavailable when coding the Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 46 CBP field of a residual block. Moreover, cross layer dependence of residual prediction is also eliminated for all residual frames by assuming that all base layer residual frames contain zero-valued samples. Finally, the TransformSize8x8 flag is always added to the bitstream to inform the decoder whether or not an 8x8 block transform is used. This is crucial when the motion information is multiple-description coded and the macroblock mode is MODE_8x8. In this case, the decoder has no clue as to whether the residual data was transformed using an 8x8 block transform or a 4x4 block transform since the BestNeighbor field is limited to 8x8 block precision and does not exceed that to sub-block precision. Consequently, adding the TransformSize8x8 flag eliminates any ambiguity that might arise from multiple-description coding of HP motion information. In the following section, we present a theoretical analysis of the multiple-description coding framework we just described. We perform a redundancy rate distortion analysis of the suggested framework and develop a video motion index that reflects the level of distortion produced by this multiple-description coding framework. 3.2 Theoretical Analysis Recall from section 2.3 that the current scalable SVC framework provides three kinds of scalability; temporal, spatial, and SNR. Spatial scalability is achieved by downsampling the input video sequence and then performing cross-layer prediction between the original (high-resolution) and downsampled (low-resolution) video sequences. This scalability is optimized by utilizing the already coded motion information in the lower layers. SNR scalability is generally accomplished by coding the residual (texture) signals obtained from computing the difference between the original pictures and the recon-structed pictures produced after decoding the base layer. This scalability is extended to include all temporal subband pictures obtained after temporal scalable coding. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 47 Temporal scalability is realized by applying the concept of motion-compensated temporal filtering (MCTF) to a group of original video pictures. M C T F utilizes an adaptive selection of the lifting representations of the Haar and 5/3 spline wavelets on a block-basis. It is important to emphasize that the wavelet filters are applied \"in the temporal domain\" to blocks of picture samples and N O T applied, in the spatial domain. The result is a set of high-pass and low-pass subband pictures having half the temporal resolution as the original group of pictures. Note that after performing temporal decomposition on a group of pictures (GOP), spatial and SNR scalability can be added to the resulting subband pictures. In our M D C framework, we produce two descriptions of the HP frames obtained by M C T F and support any spatial and SNR scalability features applied to those HP frames. To begin with, we revert back to equations 2.1 and 2.2 that describe the extended motion-compensated temporal filtering prediction and update operators. These equa-tions are presented below. -P/w(s[x, 2k + 1]) = s[x + rnp0, 2k - 2rPo] ^/w(s[x, 2k}) = |/i[x + msi dual (3.11) GOP Size as NHP = E; Jog2 GOP^ize GOPSize 'i=l 2* GOPSize / \" C - , 1 \u00C2\u00B0 S 2 GOPJSize-1 2 \2-^i=0 2\"') (3.12) = GOPSize-1 We will assume that the number of MDC coded macroblocks is fixed per frame in a GOP and let Q be the set of MDC coded macroblocks with n being the macroblock (MB) index. The distortion due to multiple-description coding of motion information Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 53 is finally written as: D Motion = NHp(Ylnen(Ylp=O^SSE(P,rorig,^oriS,re3t,mest))) (3.13) = (GOPSize - 1) x (zZnen Ep=o DSSE(P, rOTig, m o r i g , r e s t , m e s t ) where DSSE(P, rorig, t n o r i g , r e s t , m e s t ) is the sum of square error distortion between the original motion-compensated block and the candidate motion-compensated block defined in equation 3.1. The expression for DMotion above is minimized by minimizing the DSSE term. This has already been included in our framework as a design parameter since the best neighbor algorithm discussed in section 2.3 selects the neighboring 8x8 block that minimizes the DSSE term. Therefore, the minimal distortion requirement is satisfied for motion estimation. The second term in the Dxotai expression is DResiduai- Recall from section 3.1.3 that multiple-description coding of residual (texture) information is performed on the entire GOP level (or inter-frame level). The texture information in a multiple-description coded frame is simply set to zero. This, however, does not include Intra-coded mac-roblocks found within the MDC frame. The purpose behind this decision is to restrict the loss in coded data to the high pass components produced by MCTF. [14] specifies that the samples of Intra-coded macroblocks in high pass frames are not used for updat-ing the low-pass pictures and therefore are not included in the reconstruction process at the decoder. Moreover, all sample values of the intra macroblocks are set to zero when used in the update process [14]. For this analysis, we will begin by defining DnesidUai in terms of the components affected by dropping of residual data and then we will assess the level of that distortion based on the video sequence statistics. The only process affected by multiple-description coding of texture information is the reconstruction stage of motion-compensated temporal filtering. In other words, the loss is restricted to some of the high pass subbands needed for the complete reconstruction Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 54 of the output pictures. We will express the reconstruction of a set of low pass pictures in terms of the MCTF synthesis equations of two adjacent pictures. Equations 3.14 and 3.15 have been derived from the MCTF analysis equations 3.7 and 3.8. Let s[x, 2k] and s[x,2k + 1] be the reconstructed picture samples, the MCTF reconstruction equations will then be s[x, 2k] = l[x, k] - ^(h[x, k] + h[x, k - 1]) (3.14) s[x, 2k + 1] = h[x, k] + i(s[x, 2k] + s[x, 2k + 2]) (3.15) Note that prediction data variables have been removed from the above equations for clarity. Our MDC framework involves setting every other high pass frame to zero in a single description, hence, if the odd numbered frames are set to zero and k is even then the above equations become s[x,2k] = l[x,k]-^(h[x,k]) (3.16) s|x,2A; + l] =h[x,k} + \u00C2\u00B1(s[x,2k] + s[x,2k + 2]) = \u00C2\u00B1{l[x,k] + l[x,k + l])-\h[x.,k] For a more comprehensive analysis, let's consider the group of pictures of size GOP.Size. One example of a GOP with GOP.Size = 8 is shown in Figure 3.13 below, where the shaded frames indicate the transmitted frames, and the crossed frames are multiple-description coded. We define DResidUai as the sum of square error between the MCTF reconstruction of a single-description (SD) coded GOP and the MCTF reconstruction of one description Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 55 Figure 3.13: Reconstruction of a group of pictures showing the MDC high pass frames with zero residual marked with the red Xs. of a MDC GOP. The calculated distortion is strictly limited to the error caused by dropping of residual information. Therefore, we will develop an expression for Dp_esidUai that incorporates all the lost residual samples across the MCTF stages. Using equations 3.16 and 3.17, the final expression for an entire GOP of size GOP.Size is written as D Residual El o g 2 (GOPSize)-! , s=l V - 2 GOP.Size Ek=f ((s[x, 2k) - s[x, 2A;])2 + (s[x, 2k + 1] - s[x, 2k + l])2)) = \u00C2\u00A3 l o S 2 ( G O P - 5 \" e ) - i ( (3.18) GOPSize n IZZT ((l-Mx, k -1])2 + (\u00C2\u00A7Mx, k-i] + |/i[x, k +1])) E l o g 2 (GOPjSize)-l i s=l V GOPSize n EkZT k - l]) 2 + \u00C2\u00A3(/i[x, k + l]) 2 + \u00C2\u00B1h[x, k - l)h[x, k + 1])) where s is the MCTF stage number and G O P ^ l z e \u00E2\u0080\u0094 2 is the maximum value that k can \u00E2\u0080\u00A2 take in stage 5 . Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 56 Let us now define a new variable, the motion index, as the normalized sum of square error distortion between the original picture samples and the motion-compensated sam-ples produced after the Lagrangian based rate-distortion optimization. The motion index, labeled MIdx, will generally be expressed as A/r Tr] HkeGOP 2~2xeFrame (k\"g[ x > ^] ~ PJX> 1 1 1 ' r> ^ ] ) 2 (\"3 1 Q\ M M X - f (GOP.Size,/\) [ 6 ^ } where p[] represents the motion-compensated samples, /(.) is a normalization function to be defined later and A is a constant reflecting the maximum SSD distortion per INTER coded macroblock. A is calculated experimentally by encoding a number of sequences with varying levels of motion and is estimated to be around 22000. The normalization function is simply written as f(GOPJSize, A) = GOP.Size x (FrameWidth xFrameHeight) y A The question that arises from the above definition is that how can MIdx be char-acteristic of the level of motion in a group of pictures. The answer is inherent in the concept of motion-compensated prediction itself. Motion-compensated prediction can be viewed as a mechanism for tracking motion in a video sequence. The process involves \"tracking\" a fixed size block of pixels across a group of pictures and trying to find the best match for that block in previous and future pictures by finding a set of motion vectors and reference indices. If the tracking is highly successful, then the picture gen-erated from the motion tracking will closely resemble the original picture. This will result in a residual frame with small sample values. On the other hand, if the tracking is unsuccessful, then the motion-compensated picture will vaguely resemble the original picture causing the residual frame to have large sample values and, in more extreme cases, to use the INTRAJVIODE to code the macroblocks. Therefore, if the residual Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 57 frame sample values are large, then the motion tracking process is finding difficulty in keeping up with the level of motion in the sequence, which reflects that the sequence exhibits high levels of motion. Similarly, if a residual frame contains a large number of Intra-coded macroblocks, then the sequence is more likely to exhibit high levels of motion. Since the high pass subband samples are defined by h[x] = lorig[x\ \u00E2\u0080\u0094 p[x, m, r] , [14]. Consequently, the motion index equation is rewritten in terms of the square of the residual high pass samples as M M X \" f {GOP.Size, A) Note that the spatial coordinate vector x is dropped from the above equation for clarity. This argument parallels our previous discussion dealing with the amount of signal power retained in low pass and high pass subbands of the MCTF process. To reiterate, if the video sequence exhibits low levels of motion, then the low pass subband retains more of the signal power. If on the other hand, the sequence exhibits high levels of motion, then the high pass subbands will retain more of the signal power. Thus, we speculate that the motion index will closely reflect the level of motion found in a video sequence and act as an indicator to the size of the encoded video sequence. To prove our speculation, we have encoded 113 pictures from each of four video sequences at 15 frames per second and a GOP.Size of 8 frames. The sequences Crew, City, Foreman, and Harbour were used. Figure 3.14 shows the motion index calculated for 14 GOPs of the Crew sequence. The motion index values displayed here are normalized with respect to the maximum distortion of the Crew sequence. Based on our speculation, Figure 3.14 indicates that the sequence Crew exhibits Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 58 Crew 113 frames 1 I 1 1 1 1 H r 0.9 0.8 0.7 X O.B OJ c c 0.5 o o 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 GOP Number Figure 3.14: Variation of motion index with respect to G O P number in the Crew sequence. low levels of motion during the first three GOPs and then the level of motion starts increasing until it hits a maximum in G O P number 10. To demonstrate this analysis, we shall display screen captures of the frames of G O P 2 and G O P 10. Figure 3.15 (a) shows that very little motion can be noticed in the displayed group of eight pictures. The G O P intensity is almost homogenous as well and therefore, the value of motion index was small. Figure 3.15 (b) on the other hand exhibits more motion in the foreground and, more specifically, the background of the sequence. Moreover, the intensity of these pictures is also changing which results in a high motion index. The above figures clearly demonstrate the relevance of the motion index in indicating the motion within a video sequence. The question that rises, however, is how can the motion index reflect the difference in the level of motion between two or more sequences. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 59 Figure 3.15: Crew sequence showing (a) frames from GOP number 2 and (b) frames from GOP number 10. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 60 Figure 3.16 shows the motion index values of the four sequences mentioned earlier. Figure 3.16: Comparison of the motion index values between the 113 coded frames of the four sequences Crew, City, Foreman, and Harbour. The plots shown in Figure 3.16 indicate that the sequences Crew and Harbour exhibit higher levels of motion than Foreman or City. The values of motion indices shown in Figure 3.16 were normalized using the constant A referred to in equation 3.19. By averaging the motion index over all the GOPs in each sequence, the motion index can then be used to compare the level of motion between sequences. Moreover, this measure will also reflect the bit-rate of the coded video sequence. Table 3.6 lists the average motion index for each sequence along with the bit-rate and size of each coded video sequence. Note that the same QP value was used during the encoding of the four sequences. Clearly, the motion index can now be related to the residual distortion Daesiduai caused by multiple-description coding of a scalable video sequence. Since multiple-- B - Crew 3 City Foreman - O Harbour GOP Number Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 61 Table 3.6: Relation between motion index and the coded video file size. Video Average Bit-Rate File Size Sequence MIdx (kbit/sec) (KB) Crew 0.5126 95.762 88 Harbour 0.4902 95.498 87.8 Foreman 0.3874 70.008 64.3 City 0.1898 50.937 46.8 description coding of a video sequence involves dropping some of the high pass (or residual) components, sequence with high MIdx will suffer from high residual distortion compared to sequences with lower MIdx. We will demonstrate this relationship later in chapter 4. The proposed multiple-description coding scheme incorporates some degree of redun-dancy in video payload in order to assure independent decodability of every individual description. Let Rrotai be the overall redundancy rate introduced due to our MDC framework. Similar to the distortion measure, Rrotai can also be separated into two components: RMOUOU and RResiduai, such that RTotal = R-Motion + R-Residual-The redundancy rising from multiple-description coding of motion information is mainly due to the common syntax elements coded in both MDC and non-MDC macroblocks in addition to the new syntax elements introduced by MDC macroblocks. The syntax elements coded for both MDC and non-MDC macroblocks are shown in Table 3.3. Previous multiple-description coding schemes such as [19] required copying all mo-tion information in both descriptions to ensure correct decodability of the independent descriptions. The problem with such an approach is mainly the considerable increase in bitrate of the multiple-description coded bitstream, translated into an increase in Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 62 video redundancy. We have overcome this problem by limiting the level of redundancy to a minimum. Therefore, based on the syntax elements listed in Table 3.3 above, the motion redundancy is estimated to be E-Motion = RBLSkipFlag + E-MbMode + RBLQRefFlag + R}ADCFlag + RBestNeighbor (3.21) v v . ' v v ; common syntax elements added syntax elements We will label the common syntax element rates by R c o m m o n . The major contribution to motion redundancy comes from coding of the BestNeighbor field. However, this in-crease in bitrate remains much smaller than coding the actual motion-vector differences instead for the following reasons. The BestNeighbor field is binarized using the 3-bit fixed length binarization scheme discussed in section 3.1.2. The motion-vector difference (Mvd) field binarization is performed by concatenating the truncated unary and Exp-Golomb binarization schemes (UEG3) with a cutoff value of 9. Moreover, motion-vector differences in H.264 are set to quarter-pixel accuracy. For instance, if a block shifts by one pixel from one picture to the other, then the coded motion-vector will have a value of 4. [11] Applying UEG3 binarization will produce 5 bits for every one pixel-shift, four bits to code the value of the Mvd and an additional sign bit [25]. Entropy coding of both the BestNeighbor and Mvd fields is governed by their respective context models which closely approximate that actual symbol probabilities, and then the respective binstrings are each coded using the same binary arithmetic coder. Experiments show that the coding of the finite-alphabet BestNeighbor syntax element requires fewer bits compared to encoding the actual Mvd syntax element that has a much larger alphabet, the tradeoff being the distortion cost calculated earlier in this section. The additional rate contributed by multiple-description coding of the residual infor-mation RResiduai is directly related to the number of Intra-coded blocks found within a high pass frame. Note that since any high pass subband components are set to zero Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 63 in a MDC high pass frame, the redundancy rate is strictly limited to the coding of In-tra macroblocks. Let Nintra and Rintra be the number of Intra-coded macroblocks in a group of pictures and the average bitrate required to encode an Intra-coded macroblock, respectively. We can estimate the residual redundancy by RResidual ~ Njntra X Rintra- (3.22) 3.3 Multiplexing and Error Concealment The final step before transmitting the multiple-description coded stream is multiplexing the generated descriptions. Our framework is based on a layered structure of a combined scalable video stream offering spatial, temporal, and SNR scalability. The original SVC multiplexing scheme is performed on a group of picture (GOP) basis, such that, every fame belonging to all the SVC layers of a coded GOP is embedded into the bitstream before the next GOP frames are inserted. Figure 3.17 (a) shows the Multiplex module of the original SVC encoder. The coded frames of a GOP k are arranged starting with the layer SVCO frames followed by layer SVC1 frames until all frames belonging to GOP k are inserted. Next frames of GOP k + 1 follow starting with layer SVCO frames and so on. Performing multiple-description coding on the high pass frames of all SVC layers, except for layer SVCO, will yield two descriptions of the high pass frames. This will require a slight modification to the Multiplex module such that, all frames belonging to GOP k, for instance, are still inserted into the bitstream before the insertion of any frame from GOP k + 1. Moreover, we further propose embedding all frames belonging to layer I of both descriptions into the bitstream before inserting any frame from layer / + 1 as an additional multiplexing constraint. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) Multiplex a HHHHHHH L HHHHHHI 1 , , | HHHHHHH | L || HHHHHHH | P k\u00C2\u00AB1 G O P k \u00E2\u0080\u00A2 ' ^ \u00E2\u0080\u00A2 \u00E2\u0080\u009E . \u00E2\u0080\u00A2 | H H H H H H H [ H H H H H H H H H H H H H H H | L | \u00C2\u00BB \u00E2\u0080\u00A2 | H H H H H H H | L | [ H H I H H H H H | L i ^ W l s t t l i l l l l l l i i 0 HHHHHHHHHHHHHHH L HHHHHHHHHHHHHHH L Gi \u00E2\u0080\u00A2 GOPk \u00E2\u0080\u00A2 (a) Multiplex ; v c o | HHHHHHH| L I I HHHHHHh| I k.1 G O P k \" | HHHHHHHJ~L\"| HHHHHHH| L I : HHHHHHH l j HHHHHHH | \u00E2\u0080\u00A2\u00E2\u0080\u00A2 :^^^cription 1 \u00E2\u0080\u0094 \u00E2\u0080\u00A2Description 2-s - . t j SV3-. V..C, ft- * \"T'M-'-i* \"** * | HHHHHHH| L 11 HHHHHHHHHHHHHHl^ HHHHHHHHHHHHHHl] L | \u00E2\u0080\u00A2 \u00E2\u0080\u00A2 I I I IHHf HHHH H HHHH H H H H H H H H H H H H H H l L HHHHHHHHHHHHHH 33- -Description V \u00E2\u0080\u0094 1 7 \ iy \"\u00E2\u0080\u00A2xi\ H H H H H H H H H H H H H H ^ . . ' | HHHHHHHHHHHHHHl! Description 2 (b) Figure 3.17: (a) Original Multiplex module of the SVC encoder, (b) Modified Multiplex module to accommodate M D C frames. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 65 The encoding side of our multiple-description coding framework partitions a single stream into three sub-streams, a base-layer stream, labeled stream 0, that contains all SVCO frames along with the low pass frames of the higher layers, and two descriptions of an enhancement layer that contains the high pass frames of the higher layers, labeled stream 1 and stream 2 respectively. The stream number is also coded into the frame header to facilitate stream partitioning. From this point on, it falls upon the transmis-sion network to decide how to transmit the multiple descriptions, whether to transmit the entire bitstream using one channel or to partition the bitstream and transmit each description using a separate channel. Note that stream one and stream two are inde-pendently decodable of each other, however, they do depend on the correct reception of stream zero. Therefore, a necessary condition to independently decode each of the two descriptions is the correct reception and decodability of the base layer (stream 0). The MDC decoder will simply invert all the steps performed by the encoder to create the two descriptions. Figure 3.18 shows the basic structure of an MDC decoder with emphasis on the inverse MDC module, the shaded block. The combine motion fields and combine residual fields modules restore the coded motion and texture data to the original SVC compatible, single-description input that is accepted by the MCTF decoder. Let us first evaluate the significance of the coded frames produced by our SVC based multiple-description coder. The proposed framework offers a layered approach to MDC. We separate a coded video sequence into one logical layer, the base-layer, which contains all SVCO frames in addition to the low pass frames of all higher SVC layers. This architecture necessitates the correct reception of all base-layer frames in order to ensure decodability. Therefore, if we rate the importance of the coded video frames, the base-layer frames will take the highest priority. Next, we move to the multiple-description coded logical enhancement-layer which contains all high pass Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 66 Stream 0 | Layer Storage MDC Decoder Combine Mot.cn Fie ds Output Decoded Pictures Figure 3.18: Basic structure of the MDC Decoder with the Inverse MDC module shown in gray. frames of all SVC layers higher than SVCO. These high pass frames may or may not contain prediction data (motion information) depending on the type of scalability associated with the SVC layer. For instance, if an SVC layer is SNR scalable, then its high pass frames do not contain motion information. High pass frames belonging to SVC layers that extend the temporal and/or spatial scalability of the lower layers do contain motion information. Hence, the loss of a high pass frame from an SNR scalable layer will only cause some degradation in output picture quality; however, it does not hamper the decoding process since the lost residual samples can be assumed to be zero. The loss of motion information on the other hand is not replaceable and will completely obstruct the decoding of the respective layer. To clearly illustrate our discussion, consider the group of pictures shown in Figure 3.19. The dark frames belong to the base-layer whereas the light frames belong to the multiple-description coded enhancement-layer. Figure 3.20 shows the same GOP dis-Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 67 S V C 3: S N R S V C 2: spatial/temporal S V C 1: S N R S V C 0: temporal H H H H H H H H H H H H H H H H H H H H H H D1 D2 H H H H H H H H H H H H D1 D2 D1 D2 Figure 3.19: Multiple-description coded group of pictures with four SVC layers of combined scalability. played earlier with five frame-loss scenarios labeled LSI to LS5. The frame loss scenarios shown in Figure 3.20 include losses in base-layer frames, motion information of enhancement-layer frames and residual information in enhance-ment frames. These loss scenarios constitute a comprehensive study of the possible frames losses that might occur when transmitting the coded video bitstream. The error handling routines we developed for coping with these losses are listed in Table 3.7. The error handling routines listed in Table 3.7 were developed to allow an SVC decoder to detect and recover from frame losses. This feature is not supported in the current SVC decoder nor is there any recommendation that describes error handling in SVC. The existing recommendation suggests replacing any lost output frame with the previously decoded output frame, a condition that occurs every time a single coded Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 68 Table 3.7: Error handling routine description. Loss Sce-nario Description Effect Error Handling LSI Loss of the two descriptions of a high pass frame belonging to an SNR layer Reduces quality of decoded video Assume that the lost HP frame has zero residual information. Extract motion, information from the lower SVC layer. Resume decoding of the GOP. LS2 Loss of one description of a high pass frame of a spatial/temporal layer Activate motion recovery routine Use the second description high pass frame to recover an estimate to the lost motion and residual information. Resume decoding of the GOP. LS3 Loss of the two descriptions of a high pass frame belonging to a spatial/temporal layer Disrupt decoding of current and higher SVC layers Discard decoded frames belonging to current layer and ignore all other received frames belonging to the same GOP. Repeat the last decoded picture to fill the GOP space. LS4 Loss of the two descriptions of a high pass frame belonging to the highest SVC layer (also an SNR layer) Reduces quality of decoded video Assume that the lost HP frame has zero residual information. Extract motion information from the lower SVC layer. Resume decoding of the GOP. LS5 Loss of a base layer frame Disrupts decoding of the current GOP Discard all received frames belonging to the current GOP. Repeat the last decoded picture to fill the GOP space. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 69 S V C 3: S N R S V C 2: spatial/temporal S V C 1: S N R S V C 0: temporal ^ g j g ^ l H | H H H H H LS 4 \ H / H H H H H H H H \ L S 2 H H V D1 i D2 H H H H H H H LS 3 H H H H D1 D2 D1 D2 F i g u r e 3 . 2 0 : GOP w i t h five p o s s i b l e f r a m e - l o s s s c e n a r i o s . f r a m e i n a GOP i s l o s t . T h e r e f o r e , t h e r e f e r e n c e e r r o r h a n d l i n g r o u t i n e c o m m e n d s t h a t f o r e v e r y l o s t f r a m e i n a GOP, t h e e n t i r e GOP w i l l b e r e p l a c e d b y t h e i n s e r t i o n o f t h e l a s t o u t p u t p i c t u r e . I n t h e n e x t s e c t i o n , w e s h o w s i m u l a t i o n r e s u l t s t h a t f a v o r o u r p r o p o s e d m u l t i p l e - d e s c r i p t i o n c o d i n g a n d e r r o r h a n d l i n g r o u t i n e s a s o p p o s e d t o t h e r e f e r e n c e SVC a n d e r r o r h a n d l i n g r o u t i n e s . 70 Chapter 4 Simulation Results The J V T standardization committee has put together common conditions to test the error resilience of SVC. These common conditions are meant to be used in testing any proposal of error resilient coding tools developed for SVC by providing simulation results adhering to the specified conditions [28]. We will use the conditions specified in [28] as the common grounds to test our M D C framework and compare our results with the single-description SVC and the multiple-description scalable coding framework developed by [19]. We will begin by describing the simulated network environment which will transfer our multiple-description coded SVC stream. As a primary guideline, it is important to note that [28] assumes that the communication of video streams is performed using an R T P / U D P / I P transport [29]. This specification adheres with the MBMS streaming de-livery transport protocol [8]. [28] further assumes that transmission errors are restricted to packet losses, bit errors not being considered since UDP would discard packets with any bit errors. Moreover, the RTP payload size is limited to 1400 bytes with only one N A L unit encapsulated in one RTP packet. This condition cannot be satisfied at this stage of SVC development since it does not support the coding of multiple slices per frame. Therefore, the alternative is to partition every N A L unit among several RTP packets with a maximum size of 1400 bytes. We have developed an RTP encapsulator that extracts the N A L units from the SVC bitstream and packetizes them into RTP packets with a maximum payload size of 1400 bytes. If the N A L U size is smaller than Chapter 4. Simulation Results 71 1400 bytes, then the entire N A L U is encapsulated in one R T P packet. If on the other hand, the N A L U size is larger than 1400 bytes, then the N A L U is partitioned into blocks of 1400 bytes, each of which is encapsulated in one RTP packet. Moreover, the RTP marker bit is set to one to indicate if the RTP payload is a full N A L U or the last partition of a N A L U ; otherwise, the marker bit is set to zero. Figure 4.1 below shows the N A L U encapsulation process. 1200 bytes 1400 bytes 1400 bytes . I. 1400 bytes 900 bytes - . -NALU I RTP header RTP header Figure 4.1: RTP encapsulation of NALUs with varying sizes. [28] recommends the use of the erasure simulation patterns ITU-T V C E G Q15-I-16rl to model Internet and 3GPP/3GPP2 packet-loss environments. We have chosen the network simulator tool recommended in [28], \"Offline simulator for RTP/IP over U T R A N \" provided by the 3GPP SA4 S4-AHVIC036 to test the resilience of our frame-work. This simulator was developed from VCEG-N80 which describes common test conditions suited for transmission on 3GPP/3GPP2 networks for conversational and streaming video applications based on RTP/IP. VCEG-N80 provides a controlled environment for experiments by defining an offline software simulator of the 3GPP/3GPP2 radio bearer protocols and R L C physical layer error patterns with average packet loss rates of 3%, 5%, 10%, and 20%. We shall give a brief overview of the simulated network environment and the user plane protocols in 3G presented in VCEG-N80 [30]. The video source and video receiver are both assumed to be located within a private operator's network. This network is composed of an IP-based core network and a Chapter 4. Simulation Results 72 radio access network. The streaming server is assumed to be directly connected to the core network. Moreover, the core network is assumed to be error free restricting the bottleneck to the radio interface. Consequently, the simulated packet losses result only from fading/shadowing errors at the radio interface [30]. The user plane protocol stack specifications defined by 3GPP and 3GPP2 are similar between the user equipment (UE) or mobile station (MS) and the radio base station (BS). We will give a brief overview of the protocol stack as presented in [30]. Figure 4.2 shows the packetization of a video payload unit according to the user plane protocol stack of CDMA-2000 [30]. Physical frame LTU Figure 4.2: CDMA-2000 user plane protocol stack packetization presented in VCEG-N80. The protocol stack used for UMTS is very similar, except that 3GPP defines different names for the protocols. The differences listed in [30] are as follows: \u00E2\u0080\u00A2 The point-to-point protocol (PPP) is not used. Instead, the radio link control (RLC) protocol includes additional information that signals the RLC payload boundaries. \u00E2\u0080\u00A2 The size of a physical layer unit in UMTS is more flexible than in CDMA-2000 and therefore, a physical layer frame is not split further into logical transmission units (LTU) as occurs in CDMA-2000. In CDMA-2000, an LTU is the smallest fixed-size unit and has a cyclic redundancy Chapter 4. Simulation Results 73 code (CRC) to detect possible errors. U M T S on the other hand defines an R L C - P D U (protocol data unit) which is a fixed length physical layer unit that is determined when the radio bearer is setup. In our simulations, it is assumed that one link-layer frame is packed into one L T U / R L C - P D U [30]. In general, an R T P / U D P / I P packet is packed into one P D C P (packet data convergence protocol)/PPP packet which becomes an R L C - S D U (service data unit). Since video packet sizes are variable in nature, the R L C -SDUs will maintain the variation in size. If an R L C - S D U is larger that the R L C - P D U , then the R L C layer segments the SDU into multiple PDUs. For our simulations, we have merged the packet loss patterns presented in ITU-T V C E G Q15-I-16rl to create radio bearers with 3%, 5%, 10%, and 20% block error rates (BLER). The bearer specifications used in the offline simulator first assume that no retransmissions are allowed during transmission and therefore, the bearer is unac-knowledged. The bearer bitrate is set to 128kbits/s with a radio frame size (RFS) or R L C - P D U size of 320 bytes. Moreover, we will use U M T S as our telecommunications system. In our simulations, we will demonstrate the performance of our MD-SVC framework in the simulated U M T S network compared to the performance of the M D - M C T F scheme presented in [19] and the single description SVC bitstream [14]. We have coded 961 frames from each of the four test sequences crew, city, harbour, and foreman. The 961 frames were generated according to the recommendation in [28] by first encoding the pictures \"in normal order till the original sequence ends, then in reverse order from the second last picture in the sequence end to the beginning, and then again in normal order from the second picture, and so on.\" Four SVC layers were encoded according to the configuration shown in Table 4.1. The multiple-description coded S V C stream, which we will refer to as MD-SVC, is composed of a base-layer which contains all SVCO frames along with the low pass frames Chapter 4. Simulation Results 74 Table 4.1: Coding configuration of the test sequences. Layer Spatial Resolution Frame-Rate G O P Size Scalability SVCO QCIF 15 8 Temporal SVC1 QCIF 15 8 SNR SVC2 CIF 30 16 Spatial-temporal SVC3 CIF 30 16 SNR of the higher three SVC layers, and an enhancement-layer containing two descriptions of the high pass frames belonging to SVC layers 1, 2, and 3. One example of the coding structure can be seen in Figure 3.19, however, the G O P sizes shown in Figure 3.19 are half of the G O P sizes used for our tests. We will assume that all SVCO layer frames along with the low pass frames belonging to the higher SVC layers are protected using the Unequal Erasure Protection (UXP) scheme presented in [17] that was developed specifically to protect base layer frames in SVC. A brief overview of the U X P scheme was presented earlier. Our assumption follows that the U X P scheme is applied to all coded streams (MD-SVC, M D - M C T F , and SD-SVC). The U X P scheme results in a loss rate significantly below 1% [17] with a 20% overhead in transmission bitrate of the protected frames. Furthermore, the MD-M C T F scheme is applied only to the high pass frames of the enhancement layer defined by our MD-SVC framework. This restriction was imposed since the M D - M C T F scheme replicates low pass frames in M C T F , which dramatically increases the bitrate of the coded video stream and results in an unfair comparison with our developed scheme. Therefore, in the M D - M C T F case, low pass and base layer frames are protected using only UXP. Figures 4.3 to 4.6 demonstrate the performance of our M D - S V C scheme compared to M D - M C T F and single description SVC (SD-SVC) schemes when faced with loss rates between 3% and 20%. The sequences Crew, Foreman, Harbour, and City were Chapter 4. Simulation Results 75 Table 4.2: Ml 3 scheme redundancy Crew Foreman Harbour City MD-SVC 21.9% 17.9% 8.1% 14.1% M D - M C T F 13.9% 16% 7.7% 15.5% encoded using each of the above-mentioned coding schemes, and the reconstructed frames obtained after decoding are compared using the luminance PSNR (Y-PSNR) measure. The PSNR performance plots show that MD-SVC out-performs M D - M C T F in most cases, except for the City sequence where M D - M C T F has a better PSNR performance. The City sequence has very little motion as indicated by the motion index measure (MIdx) plots shown in Figure 3.17. However, it is important to note that the level of redundancy imposed by our MD-SVC scheme is lower than that imposed by the M D - M C T F scheme in the case of the City sequence. As indicated in section 3.2, the redundancy of our M D - S V C scheme results mainly from the repeated Intra-coded macroblocks in the video sequence. The M D - M C T F scheme does not insert these Intra-coded blocks in multiple-description coded high pass frames. The result is improved decoded video quality for our M D - S V C over M D -M C T F in sequences that contain a large number of Intra-coded macroblocks. Figure 4.7 illustrates the relation between the redundancy rate and the associated distortion due to the packet loss patterns. The redundancy rates used are those generated by encoding the four video sequences and are listed in Table 4.2. The redundancies listed in Table 4.2 show that both MD-SVC and M D - M C T F impose very similar redundancy rates. Except for the Crew sequence which exhibits a high number of Intra-coded blocks in the high pass frames, the redundancy rates are within 1% of each other. The peeks seen in the M D - M C T F plots in Figure 4.7 are due to the City sequence Chapter 4. Simulation Results 76 C R E W 40 EC | 20 CL >\u00E2\u0080\u00A2 0 40 M D - S V C \u00E2\u0080\u0094 M D - M C T F - \u00E2\u0080\u0094 S D - S V C 3 % loss 200 400 600 800 or w 20 CL 0 40 I 20 i - M D - S V C - M D - M C T F -- S D - S V C 5 % loss 200 400 600 800 PS r i i M D - S V C \u00E2\u0080\u0094 M D - M C T F . . . . S D - S V C 10% loss M D - S V C M D - M C T F S D - S V C 2 0 % loss -Eh M D - S V C - 0 - M D - M C T F S D - S V C 10 % packets lost 15 20 Figure 4.3: Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Crew sequence. Chapter 4. Simulation Results 77 Foreman M D - S V C \u00E2\u0080\u0094 M D - M C T F S D - S V C 3 % loss M D - S V C M D - M C T F S D - S V C 5 % loss M D - S V C \u00E2\u0080\u0094 M D - M C T F \u00E2\u0080\u0094 - S D - S V C 10% loss M D - S V C M D - M C T F S D - S V C 2 0 % loss - B - M D - S V C -0- M D - M C T F S D - S V C 5 10 % packets lost 15 Figure 4 . 4 : Comparison of PSNR performance of M D - S V C , M D - M C T F , and (single description) SD-SVC for the Foreman sequence. Chapter 4. Simulation Results 78 Harbour c o CL >-M D - S V C \u00E2\u0080\u0094 M D - M C T F S D - S V C 3 % loss M D - S V C \u00E2\u0080\u0094 M D - M C T F - - - - S D - S V C 5 % loss M D - S V C \u00E2\u0080\u0094 M D - M C T F - - - - S D - S V C 10% loss M D - S V C \u00E2\u0080\u0094 M D - M C T F - - - - S D - S V C 2 0 % loss t o LL >-OS 'X' 20 - B - M D - S V C M D - M C T F S D - S V C ^ T ) | t ^ ^ i i i i 10 15 % packets lost 20 25 Figure 4.5: Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Harbour sequence. Chapter 4. Simulation Results 79 M D - S V C \u00E2\u0080\u0094 M D - M C T F . . . . S D - S V C 3 % loss M D - S V C M D - M C T F S D - S V C 5 % loss M D - S V C M D - M C T F S D - S V C 10% loss M D - S V C \u00E2\u0080\u0094 M D - M C T F \u00E2\u0080\u0094 - S D - S V C 2 0 % loss - F J - M D - S V C - 0 - M D - M C T F S D - S V C 5 10 % packets lost Figure 4 . 6 : Comparison of PSNR performance of M D - S V C , M D - M C T F , and (single description) SD-SVC for the City sequence. Chapter 4. Simulation Results 80 Redundancy - Distortion cc -z. CO D_ >-cr co rx cr z CO Q_ I >-CO Q_ >-5 10 15 20 Sequence Redundancy - B - M D - S V C -\u00E2\u0080\u00A2\u00C2\u00A9- MD-MCTF 3 % - B - M D - S V C - G - MD-MCTF 5% - B - M D - S V C - O - MD-MCTF 10% - B - M D - S V C - O MD-MCTF 20% Figure 4.7: Comparison of MD-SVC and MD-MCTF redundancy rates and associated distortions. Chapter 4. Simulation Results 81 which exhibits very little motion, therefore, losses in high pass frames did not affect the quality much since most of the signal power is conserved in the low pass frames. The M D - M C T F scheme does not generate any distortion due to the coding of mo-tion information since, motion information is replicated in both descriptions. This however is not the case in our M D - S V C scheme which saves on the redundancy in motion information by multiple-description coding of the motion information. In ex-change, M D - S V C uses the saving in motion information coding to insert redundant texture information through replicating Intra-coded macroblocks. Figures 4.8 to 4.11 show snapshots from the four video sequences. The pictures on the left are generated from M D - S V C . The pictures on the right are M D - M C T F coded. It can be seen that M D - S V C causes slight distortions that reflect negatively in PSNR but are not as sig-nificant visually. These slight distortions are mainly due to the approximation of the motion vectors of macroblocks from the neighboring blocks. The benefit of this coding mechanism lies in reducing the video data-rate to allocate more bits to duplicate Intra-coded blocks. Since the M D - M C T F scheme does not duplicate Intra-coded blocks, the packet loss is more obvious visually, as can be seen in Figures 4.8 to 4.11. Figure 4.8: Comparison of visual quality from the Foreman sequence, M D -SVC (left), MD-MCTF(right) . Chapter 4. Simulation Results 82 Figure 4.10: Comparison of visual quality from the City sequence, MD-SVC (left), MD-MCTF(right). Chapter 4. Simulation Results S3 Figure 4.11: Comparison of visual quality from the Harbour sequence, MD-SVC (left), MD-MCTF(right). Chapter 5 84 Conclusion and Future W o r k In this thesis, we have developed a new multiple-description coding scheme for the scal-able extension of H.264/AVC video coding standard (SVC). Our scheme (MD-SVC) generates two descriptions of the high pass frames of the enhancement layers of an SVC coder by coding in each description only half the motion information and half the tex-ture information with a minimal degree of redundancy. Intra-coded macroblocks are inserted as redundant information since they cannot be approximated using the motion information. The two descriptions are complementary but independently decodable, such that if only one description is received, the decoder will be able to recover the missing motion information from the available data and generate an output video se-quence with acceptable degradation in quality. If both descriptions are received then the full quality of a single description SVC (SD-SVC) is delivered. However, the two descriptions are highly dependent on the base layer, therefore, the base-layer and the low pass frames of the enhancement layers are protected using an unequal erasure pro-tection (UXP) technique developed specifically for SVC payload data. We have also added error detection and concealment features to the SVC decoder to cope with frame losses. Our multiple-description coding framework conforms with the protocols and codecs specifications of the 3GPP multimedia broadcast/multicast services (MBMS). More-over, the framework is built on and integrated into the SVC codec, and can therefore offer highly reliable video services to MBMS networks and clients with heterogeneous Chapter 5. Conclusion and Future Work 85 resources and capabilities. The outcome of our coding scheme is three separable streams that can be transmitted over the same channel (in multicast or broadcast mode) or over three separate channels (in broadcast mode) to minimize the packet loss rate. Objective and subjective performance evaluations have shown that our scheme delivers a supe-rior decoded video quality when compared with the UXP protected SD-SVC and the multiple-description motion compensated temporal filtering (MD-MCTF) scheme with comparable redundancy levels. It would be desirable to further reduce the redundancy level and remove the de-pendence on the base layer, therefore, more work can be performed to generate low redundancy multiple descriptions of the low pass frames using techniques such as cor-relating transforms. Furthermore, the accuracy of the motion recovery approach can be improved through boundary matching algorithms and block interpolation. Bibl iography [1] Mobile Broadcast/Multicast Service (MBMS), 2004. [2] DigiTAG. Television on a Handheld Receiver - broadcasting with DVB-H, 2005. [3] Nokia makes air interface of its mobile T V end-to-end solution ( D V B -H) publicly available, May 2005. [4] T . Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H . 2 6 4 / A V C video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560-576, 2003. [5] R. Schafer, H. Schwarz, D. Marpe, T . Schierl, and T. Wiegand. M C T F and scalability extension of H . 2 6 4 / A V C and its applications to video transmission, storage, and surveillance. In Visual Communications and Image Processing, July 2005. [6] European Telecommunications Standards Institute. Universal Mobile Telecommunications System (UMTS); Multimedia Broadcast/Multicast Service (MBMS); Stage 1, September 2004. [7] European Telecommunications Standards Institute. Digital cellular telecommunications system (Phase 2+); Universal Mobile Telecommu-Bibliography 87 nications System (UMTS); Multimedia Broadcast/Multicast Service (MBMS); Stage 1, September 2004. [8] European Telecommunications Standards Institute. Universal Mobile Telecommunications System (UMTS); Multimedia Broadcast/Multicast Service (MBMS); Protocols and Codecs, March 2005. [9] Samsung Electronics. Scalable Multimedia Broadcast and Multicast Ser-vice (MBMS), May 2002. [10] G.J. Sullivan T. Wiegand. Video compression - from concepts to the H . 2 6 4 / A V C standard. Proceedings of the IEEE, 93(1):18-31, 2005. [11] T. Wiegand, G.J. Sullivan, and A. Luthra. Draft I T U - T Recommenda-tion and Final Draft International Standard of Joint Video Specification ( I T U - T Rec. H.264 \u00E2\u0080\u0094 I S O / I E C 14496-10 AVC). Joint Video Team (JVT) of ISO/IEC M P E G and ITU-T V C E G , 2003. [12] T. Wiegand, H. Schwarz, A. Joch, F . Kossentini, and G.J. Sullivan. Rate-constrained coder control and comparison of video coding stan-dards. IEEE Transactions on Circuits and Systems for Video Technol-ogy, 13(7):688-703, July 2003. [13] T . Wedi and H. G. Musmann. Motion- and aliasing-compensated pre-diction for hybrid video coding. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):577-586, July 2003. [14] Joint Video Team (JVT) of ISO/IEC M P E G and ITU-T V C E G . Sub-band Extension of H.264/AVC, March 2004. Bibliography 88 [15] M . Flierl. Video coding with lifted wavelet transforms and frame-adaptive motion compensation. In VLBI, September 2003. [16] H. Schwarz, D. Marpe, and T. Wiegand. M C T F and scalability exten-sion of H . 2 6 4 / A V C . In PCS, December 2004. [17] T. Schierl, H. Schwarz, D. Marpe, and T. Wiegand. Wireless broad-casting using the scalability extension of H . 2 6 4 / A V C . In ICME, July 2005. [18] S. Wenger, M . M . Hannuksela, T. Stockhammer, M.Westerlund, and D. Singer. R T P Payload Format for H.264 Video. I E T F , February 2005. [19] M . van der Schaar and D. S. Turaga. Multiple description scalable coding using wavelet-based motion compensated temporal filtering. In ICIP, September 2003. [20] A.R. Reibman, H. Jafarkhani, Y . Wang, M . T . Orchard, and R. Puri. Multiple description coding for video using motion compensated predic-tion. In ICIP, 1999. [21] C. Kim and S. Lee. Multiple description coding of motion fields for robust video transmission. IEEE Transactions on Circuits and Systems for Video Technology, 11(9):999-1010, September 2001. [22] Y . K . Wang, M . M . Hannuksela, V. Varsa, A. Hourunranta, and M . Gab-bouj. The error concealment feature in the H.26L test model. In ICIP, 2002. Bibliography 89 [23] T . Stockhammer, M . M . Hannuksela, and T. Wiegand. H . 2 6 4 / A V C in wireless environments. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):657-673, July 2003. [24] Joint Video Team (JVT) of ISO/IEC M P E G and ITU-T V C E G . Scalable Video Coding - Working Draft 2, April 2005. [25] D. Marpe, H . Schwarz, and T . Wiegand. Context-based adaptive bi-nary arithmetic coding in the H . 2 6 4 / A V C video compression stan-dard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):620-636, July 2003. [26] M.J . Weinberger and J.J. Rissanen adn M . Feder. A universal finite memory source. IEEE Transactions on Information Theory, 41(3):643-652, May 1995. [27] D. Marpe, H. Schwarz, G. Blattermann, G. Heising, and T. Wiegand. Context-based adaptive binary arithmetic coding in J V T / H . 2 6 L . In ICIP, September. [28] Joint Video Team (JVT) of ISO/IEC M P E G and I T U - T V C E G . Com-mon conditions for SVC error resilience testing, July 2005. [29] S. Wenger. H . 2 6 4 / A V C over IP. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):645-656, July 2003. [30] ITU-Telecommunications Standardization V C E G . Comment Test Con-ditions for RTP/IP over 3GPP/3GPP2, December 2001. "@en . "Thesis/Dissertation"@en . "2005-05"@en . "10.14288/1.0065531"@en . "eng"@en . "Electrical and Computer Engineering"@en . "Vancouver : University of British Columbia Library"@en . "University of British Columbia"@en . "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en . "Graduate"@en . "Multiple description coding of the scalable extension of H.264/AVC (SVC)"@en . "Text"@en . "http://hdl.handle.net/2429/17264"@en .