Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Multiple description coding of the scalable extension of H.264/AVC (SVC) Mansour, Hassan 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2005-0548.pdf [ 11.76MB ]
Metadata
JSON: 831-1.0065531.json
JSON-LD: 831-1.0065531-ld.json
RDF/XML (Pretty): 831-1.0065531-rdf.xml
RDF/JSON: 831-1.0065531-rdf.json
Turtle: 831-1.0065531-turtle.txt
N-Triples: 831-1.0065531-rdf-ntriples.txt
Original Record: 831-1.0065531-source.json
Full Text
831-1.0065531-fulltext.txt
Citation
831-1.0065531.ris

Full Text

Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) by Hassan Mansour B.E., The American University of Beirut, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Applied Science in The Faculty of Graduate Studies (Electrical Engineering) The University Of British Columbia August 29, 2005 © Hassan Mansour 2005 ii Abstrac t Advances in digital video coding are pushing the boundaries of multimedia services and Internet applications to mobile devices. The Scalable Video Coding (SVC) project is one such development that provides different video quality guarantees to end users on mobile networks that support different terminal capability classes. Such networks are Digital Video Broadcast-Handheld (DVB-H) and the Third Generation Partnership Project's ( 3 G P P ) Multimedia Broadcast Multicast Service (MBMS). In this thesis, we pro-pose a multiple-description coding scheme for SVC (MD-SVC) to provide error resilience to SVC and ensure the safe delivery of the video pay load to end users on MBMS networks. Due to the highly error-prone nature of wireless environments, received video quality is normally guaranteed by ei-ther video redundancy coding, retransmissions, forward-error-correction, or error-resilient video coding. MD-SVC takes advantage of the layered struc-ture of SVC to generate two descriptions (or versions) of the higher layer (enhancement layer) frames in SVC while utilizing Unequal Erasure Protec-tion (UXP) to efficiently protect the base layer frames. The result is three separable streams that can be transmitted over the same channel or over Abstract iii three separate channels to minimize the packet loss rate. The two enhance-ment descriptions are independently decodable, however, both descriptions depend on the error-free reception of the base layer. Furthermore, error de-tection and concealment features are added to the SVC decoder to cope with frame losses. The proposed scheme is implemented and integrated fully into the SVC codec and tested using a 3GPP/3GPP2 offline network simulator. Objective and subjective performance evaluations show that under the same packet loss conditions our scheme outperforms the single description SVC and existing scalable multiple-description coding schemes. Contents Abstract ii Contents iv List of Tables vi List of Figures vii List of Acronyms xi Acknowledgements xiv 1 Introduction 1 2 Digital Video Transmission and Coding, From H.264/AVC To SVC 5 2.1 Multimedia Broadcast/Multicast Services 5 2.1.1 MBMS Modes 5 2.2 Overview of the H.264/AVC Standard 10 2.2.1 Motion Estimation and Motion Compensation 14 •Contents 2.3 Scalable Video Coding , 17 2.4 Error Resilience in Video Coding 19 2.4.1 Unequal Erasure Protection in SVC 21 2.4.2 Multiple Description - Motion Compensated Temporal Filtering 22 3 Multiple Description Coding of the Scalable Extension of H .264 /AVC (SVC) 25 3.1 Multiple Description Coding of High Pass (HP) Frames . . . 28 3.1.1 Multiple Description Coding of Prediction (Motion) Data 30 3.1.2 Entropy Coding of Error Recovery Data . 37 3.1.3 Multiple Description Coding of Residual Data 44 3.2 Theoretical Analysis 46 3.3 Multiplexing and Error Concealment 63 4 Simulation Results 70 5 Conclusion and Future Work 84 Bibliography • 86 vi List of Tables 2.1 MBMS video delivery bandwidth resources 8 3.1 MDC macroblock classification 33 3.2 MB mode and the associated coded BN indices 36 3.3 Syntax element classification in MDC 37 3.4 BestNeighbor symbol probabilites and the corresponding bin-strings 42. 3.5 BestNeighbor context models 43 3.6 Relation between motion index and the coded video file size. . 61 3.7 Error handling routine description 68 4.1 Coding configuration of the test sequences 74 4.2 MD scheme redundancy 75 Vll List of Figures 2.1 Example of (a) Broadcast Mode and (b) Multicast Mode Net-works 7 2.2 Macroblock and sub-macroblock partitions in H.264/AVC. . . 12 2.3 Example of a coded video sequence. 12 2.4 Hybrid encoder similar to the H.264/AVC encoder 13 2.5 Functional structure of the Inter prediction process 15 2.6 Motion vector and reference index assignment in the motion estimation process 15 2.7 MCTF temporal decomposition process 18 2.8 Basic structure of a 5 layer combined scalability encoder. . . 20 2.9 UXP transmission sub-block (TSB) packetization 22 2.10 Coding structure of the MD-MCTF. 23 3.1 Coding structure of the MD-MCTF . . . 26 3.2 Single layer MCTF encoder with additional and modified mod-ules for multiple-description coding 27 List of Figures viii 3.3 Basic structure of Multiple Description Coding module with a two description coder of High Pass frames 29 3.4 Quincunx Lattice structure of prediction data MBs 31 3.5 Typical motion data macroblock map of a HP frame in de-scription 1 31 3.6 Change in motion vector prediction, (a) Standard defined neighbor MBs (b) MDC defined neighbor MBs 33 3.7 Neighboring 8x8 block indices 35 3.8 Fixed Length binarization for BestNeighbor syntax element. . 41 3.9 Multiple-description coding of residual frames in a temporal-scalable layer with three temporal levels. Dark blocks indicate Intra-coded macroblocks in the residual frame 45 3.10 Frequency response properties of the [-1/8 1/4 3/4 1/4 -1/8] low pass filter 49 3.11 Frequency response of the [-1/2 1 -1/2] high pass filter. . . . 50 3.12 Decomposition of a group of pictures (GOP) of size 8. The shaded frames will be coded and transmitted in the bitstream. 50 3.13 Reconstruction of a group of pictures showing the MDC high pass frames with zero residual marked with the red Xs. . . . 55 3.14 Variation of motion index with respect to GOP number in the Crew sequence 58 3.15 Crew sequence showing (a) frames from GOP number 2 and (b) frames from GOP number 10 59 List of Figures ix 3.16 Comparison of the motion index values between the 113 coded frames of the four sequences Crew, City, Foreman, and Har-bour 60 3.17 (a) Original Multiplex module of the SVC encoder, (b) Mod-ified Multiplex module to accommodate MDC frames 64 3.18 Basic structure of the MDC Decoder with the Inverse MDC module shown in gray. 66 3.19 Multiple-description coded group of pictures with four SVC layers of combined scalability 67 3.20 GOP with five possible frame-loss scenarios 69 4.1 RTP encapsulation of NALUs with varying sizes 71 4.2 CDMA-2000 user plane protocol stack packetization presented inVCEG-N80 72 4.3 Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Crew sequence. . . . 76 4.4 Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Foreman sequence. . 77 4.5 Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Harbour sequence. . 78 4.6 Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the City sequence. . . . 79 List of Figures x 4.7 Comparison of MD-SVC and MD-MCTF redundancy rates and associated distortions 80 4.8 Comparison of visual quality from the Foreman sequence, MD-SVC (left), MD-MCTF(right) 81 4.9 Comparison of visual quality from the Crew sequence, MD-SVC (left), MD-MCTF(right) 82 4.10 Comparison of visual quality from the City sequence, MD-SVC (left), MD-MCTF(right) 82 4.11 Comparison of visual quality from the Harbour sequence, MD-SVC (left), MD-MCTF(right) 83 List of Acronyms 3GPP 3rd Generation Partnership Project A V C Advanced Video Coding B L E R BLock Error Rate B N Best Neighbor BS Base Station C A B A C Context-Adaptive Binary Arithmetic Coding C A V L C Context-Adaptive Variable Length Coding C B P Coded Block Pattern C D M A Code Division Multiple Access C R C Cyclic Redundancy Code D C T Discrete Cosine Transform D V B - H Digital Video Broadcasting - Handheld F E C Forward Error Correction G E R A N G S M / E D G E Radio Access Network G O P Group Of Pictures G S M Global System for Mobile communications List of Acronyms xii HP High Pass IPDC IP Data Cast ICT Integer Cosine Transform LP Low Pass LPS Least Probable Symbol LTU Logical Transmission Unit MB macroblock MBMS Multimedia Broadcast Multicast Services MCTF Motion Compensated Temporal Filtering MDC Multiple Description Coding MD-MCTF Multiple Description Motion Compensated Temporal Filterin MD-SVC Multiple Description Scalable Video Coding MPEG Moving Pictures Experts Group MPS Most Probable Symbol MTAP Multi-Time Aggregation Unit MV Motion Vector NAL Network Abstraction Unit PDU Protocol Data Unit PLMN Public Land Mobile Network PPP Point to Point Protocol PSNR Peak Signal to Noise Ratio QP Quantization Parameter QoS Quality of Service RFS Radio Frame Size RS Reed-Solomon RTP Real-Time Protocol List of Acronyms xiii SAD Sum of Absolute Differences SD Single Description SDU Session Data Unit SNR Signal to Noise Ratio SSD Sum of Square Differences SVC Scalable Video Coding TSB Transmission Sub-block UDP User Datagram Protocol U M T S Universal Mobile Telecommunications System R L C Radio Link Control U E User Equipment U T R A N Universal Terrestrial Radio Access Network U X P Unequal Erasure Protection V C E G Video Coding Experts Group xiv Acknowledgements I would like to thank my supervisors, Professor Panos Nasiopoulos and Pro-fessor Victor Leung, for their time, patience, and support. To the reviewers, I express my gratitude for their constructive feedback. To my best friend and "buddy", Rayan, I would like to express my ap-preciation for always being there for me. I would also like to thank Lina and Rachel for tolerating my irritability during the last few weeks. Last but not least, I would like to thank my mom, dad, and brother for their endless love and support. This work was supported by a grant from Telus Mobility, and by the Nat-ural Sciences and Engineering Research Council of Canada under grants CRD247855-01 and CBNR11R82208. Chapter 1 i Introduction The distribution of multimedia services to mobile devices is finally a reality. Multimedia service providers are teaming up with mobile telecommunications operators to deliver to the public numerous services from transferring light video and audio clips to the heavy duty streaming of mobile T V [1], [2]. There are currently two major solutions for the support of mobile multimedia delivery, namely, 3GPP's M B M S (Multimedia Broad-cast/Multicast Service) and D V B 2.0's D V B - H (Digital Video Broadcast- Handheld) supported by Nokia [3]. These solutions suffer however from limited network bandwidth resources in addition to transmission errors that significantly affect the quality of the distributed video. MBMS and D V B - H rely, therefore, on the recent advances in digital video coding technology to deliver video content that efficiently utilizes the allocated channel bandwidth while ensuring various QoS (Quality of Service) requirements. The newest international video coding standard is H.264/AVC. Approved by ITU-T as Recommendation H.264 and by ISO/IEC as International Standard 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC), H.264/AVC has proved its superi-ority to previous video coding standards (MPEG-4 Visual, MPEG-2, MPEG-2, H.263) through its improved coding efficiency and its provision of a network friendly video rep-resentation [4]. However, coding efficiency alone is not enough to support QoS features Chapter 1. Introduction 2 due to the highly error-prone nature of wireless environments and the unexpected fluc-tuation in available bandwidth. Therefore, the demand for bandwidth adaptive codecs and robust error-resilient techniques is constantly increasing. The Scalable Video Cod-ing (SVC) standardization project was launched by MPEG (Moving Pictures Experts Group) and ITU-T's Video Coding Experts Group (VCEG) in January 2005 as an amendment of their H.264/AVC standard to solve the bandwidth fluctuation prob-lem and offer multiple QoS requirements to the end user [5]. However, SVC presently does not offer any error resilient features that can protect all the layers in the coded bitstream. Existing solutions to protect a coded video stream include the use of For-ward Error Correction (FEC) codes or Unequal Erasure Protection (UXP) techniques to combat bit-errors in wireless transmission. Unfortunately, these techniques fail in recovering any packet losses that might occur due to network congestion. An alternative to FEC and UXP is the implementation of Multiple Description Coding (MDC) of video content in order to protect against packet losses. In MDC, a coded video sequence is separated into multiple descriptions (or versions) of the coded sequence such that each description is independently decodable and provides a decoded video quality that is poorer than the original video quality. However, if all the descrip-tions are received by the decoder, then the original video quality should be generated. The independent decodability feature is made possible at the expense of additional coding overhead, also known as data "redundancy" between the various descriptions. The drawback in existing multiple-description coding techniques lies in the inefficient allocation of redundant data. The allocation of redundant data controls both the level of redundancy in the coded video bitstream and the quality (or distortion) produced from decoding only one description. In this thesis, we develop a Multiple-Description Coding scheme (MD-SVC), specif-ically designed for the scalable extension of H.264/AVC (SVC), that dramatically im-Chapter 1. Introduction 3 proves the error resilience of the SVC codec. Our proposed MD-SVC scheme generates two descriptions of the enhancement layers of an SVC coded stream by embedding in each description only half of the motion information and half the texture information of the original coded stream with a minimal degree of redundancy. The two descriptions are complementary but independently decodable, such that if only one description is received, the decoder will be able to recover the missing motion information from the available data and generate an output video sequence with an acceptable degradation in quality. If both descriptions are received, then the full quality of a single descrip-tion SVC stream is delivered. Furthermore, we have implemented our proposed scheme and integrated all of its functionalities into the existing SVC standard. We have also added error detection and error concealment features to the SVC decoder, which did not exist. These error handling routines help the decoder to cope packet losses that might arise due to the unexpected fluctuations in available bandwidth. The proposed framework thus provides a highly error resilient video bitstream that requires no re-transmissions or feedback channels while minimizing any channel overhead imposed by the video redundancy due to multiple description coding. The rest of this thesis is organized as follows. In Chapter 2, we discuss some of the features supported by the MBMS video dis-tribution framework. Next, we introduce the basic functions of digital video coding as implemented in H.264/AVC leading up to the motion-compensated temporal filtering structure of SVC. We develop in Chapter 3 our Multiple-Description coding scheme and give a detailed account of its integration into the SVC codec. We also offer a theoretical analysis of the redundancy and distortion imposed by our scheme. In Chapter 4, we describe the network simulation setup and compare the perfor-Chapter 1. Introduction 4 mance of our MD-SVC scheme with other multiple description and single description coding methods. Finally, we conclude our work in Chapter 5 and offer suggestions for future research in this field. 5 Chapter 2 Digi tal Video Transmission and Coding, From H . 2 6 4 / A V C To S V C 2.1 Multimedia Broadcast/Multicast Services While DVB-H is still in its testing stages, 3GPP's MBMS has already completed its first stage with Release 6 of the Universal Mobile Telecommunications System (UMTS) in September of 2004 [6]. Broadcast and Multicast are two.IP datacast (IPDC) type services for transmitting data-grams from a single server to multiple clients (point-to-multipoint) that can be supported by the existing GSM (Global System for Mobile communications) and UMTS cellular networks [6], [1]. 2.1.1 MBMS Modes MBMS defines two functional modes, broadcast mode and multicast mode. The MBMS multicast mode differs from the broadcast mode in that it requires a client to subscribe to and activate an MBMS service, whereas, the broadcast mode does not. The broad-cast mode is generalized as a unidirectional point-to-multipoint bearer services in which multimedia data is transmitted from one server to multiple users in a broadcast service area. This service efficiently utilizes the available radio or network resources by trans-mitting data over a common radio channel that can be received by all users within the service area. The multicast mode similarly allows a unidirectional point-to-multipoint Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 6 transmission of multimedia data, but the users are restricted to those clients that be-long to the multicast subscription group. [6] specifies that MBMS data transmission should have the ability to adapt to different RAN (Radio Access Network) resources and capabilities by efficiently managing the bitrate of the MBMS data. Furthermore, individual broadcast/multicast services are allocated for independent broadcast areas/ multicast groups that may overlap. Quality of Service (QoS) guarantees can also be independently configured by the PLMN (Public Land Mobile Network) for each indi-vidual broadcast/multicast services [6]. Figure 2.1 shows examples of the Broadcast Mode and Multicast Mode Networks as depicted in [6]. Handoff between different operators sharing the broadcast service area or multi-cast subscription group is allowed in MBMS [7]. Therefore, the broadcast/multicast resources will be allocated for all subscribers of a certain operator A in addition to all inbound roamers of operator A. However, if there are not enough resources (such as available bandwidth) to provide the requested service, the MBMS application will have to support the QoS requirements instead. MBMS User Services may be delivered to a user at different bit rates and quality of service depending on radio networks and conditions. Table 2.1 lists the bandwidth resources allocated for video streaming and video download as defined in [7]. Note that the specified bit-rates are those of the user-data at the application layer, and in GERAN, lower bandwidth is available which may constrain some applications. Since the MBMS applications must adapt to the resource heterogeneity in UTRAN (Universal Terrestrial Radio Access Network) and GERAN (GSM/EDGE Radio Access Network), it falls upon the design of multimedia content to provide the scalability required, which gives rise to the need for scalable content especially in video streaming services. MBMS video streaming services also suffer from packet losses and transmission errors that arise from multi-path fading, channel interference, network congestion, and Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 7 Broadcast Service Area Cell phone /Hand held computer Cell phone/ Broadcast Towe Broadcast Tower UMTS Packet-Switching Core - QoS handling - Broadcast Area Configuration - Provisioning Control Multimedia Services Operator Specific Services Internet Hosted Services Multimedia Broadcast Capable UTRAN/GERAN Hand held computer Multimedia Broadcast Capable UTRAN/GERAN (b) Figure 2.1: Example of (a) Broadcast Mode and (b) Multicast Mode Net-works. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 8 Table 2.1: MBMS video delivery bandwidth resources. Service Media Distribution Scope MBMS User Service Classification Application Bit-Rate Video streaming Video and auxiliary data such as text, still images Broadcast Streaming < 384 kbps Video streaming Video and auxiliary data such as text, still images Multicast Streaming < 384 kbps Video distribution Video and auxiliary data such as text, still images Broadcast Download < 384 kbps Video distribution Video and auxiliary data such as text, still images Multicast Download < 384 kbps Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 9 noise. MBMS utilizes Foreword-Error-Correction (FEC) schemes in order to protect the data packets during transmission [8]. The generic mechanism employed for systematic FEC of RTP streams involves the generation of two RTP payload formats, one for FEC source packets and another for FEC repair packets. FEC schemes use Reed Solomon (RS) codes to protect data content. A Reed Solomon code with an x symbol error protection and code word length of n symbols is labeled (n, n-x) RS code. The imposed overhead can then be calculated as of the original data size. For example, if the codeword length is to be 255 bytes and a symbol error protection of 50 bytes, then the overhead is equal to 255/205 = 1.24, which is a 24% increase in data size. Although FEC can be an effective tool, the problems that arise from its implementation include the lack of flexibility in error protection since the entire bitstream will have to be equally protected. Moreover, the redundancy added by including the FEC packets can significantly increase the video payload bit-rate and in turn cause additional congestion in the network. Alternatively, unequal error protection, unequal erasure protection, and error-resilient video coding techniques are more desirable solutions. Finally, MBMS allows the use of multiple channels for the transmission of broadcast data in order to reduce the transmission power of a broadcast station and allow for flexible resource management. The separate physical channels can have varying power allocation and can thus cover different or overlapping portions of a service area. This scalability option would offer the entire broadcast service area with a base quality of service guarantee using a high powered base channel and the core of the service area with an enhanced quality of service through another low powered enhancement channel [9], [1]. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 10 2.2 Overview of the H.264/AVC Standard Digital video coding was first made possible on a worldwide basis with the establishment of the MPEG-1 standard back in 1992. Over the past fifteen years, significant advances have been achieved with the launch of several international video coding standards such as the well known MPEG-2 standard in 1998, which set-off the DVD industry, in addition to more recent standards including ITU-T's H.263 series and ISO/IEC's latest MPEG-4 standard [4], [10]. Video compression is achieved by applying a series of processes, categorized in [10] into the following: • A prediction process that takes advantage of the spatial correlation within a single image and the temporal correlation between successive images to send "prediction" information (motion-vectors, reference in-dices) to the decoder to help reconstruct a "prediction image". The differences in sample values (residual components) between the original image and the prediction image are then compressed using a different process. • A transform process that converts the sample values (usually applied to the residual components) into a set of samples where most of the signal power is grouped into a fewer samples. The most commonly used transform in image and video coding is the discrete cosine transform . (DCT). • A quantization process that reduces the precision of representing a sample value in order to decrease the amount of data to be coded. • An entropy coding process that takes advantage of the symbol proba-Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 11 bilities to generate binary representations of the coded symbols. The H.264/AVC standard defines two interworking layers, the video coding layer which performs the video compression, and the network abstraction layer that packs the coded data into "network friendly" units that can be easily passed on to different communications networks. In this overview, we will focus our discussion on the video coding layer. The H.264/AVC video coder is a block-based hybrid video coder that utilizes all of the above mentioned techniques for video coding and applies them on a block basis. Therefore, every input image to the video coder is first split into a group of macroblocks each containing 16x16 samples. In addition to the macroblock structure, H.264/AVC also defines smaller block structures that are partitions of the macroblocks and sub-macroblocks to better capture the details in an image. Figure 2.2 shows the possible partitions supported in H.264, extracted from [11]. The two major components of H.264/AVC that contribute to its superior coding efficiency are the prediction process and the entropy coding process. We will start by giving a brief functional overview of the encoder and then focus on the above-mentioned components. There are three types of coded pictures supported in H.264, namely, I pictures, P pictures, and B pictures. An I picture (or frame) is coded using Intra-frame prediction as an independent self-contained compressed picture. P and B pictures on the other hand are Inter-frame predicted from previous pictures only, in P frames, or from both previous and future pictures in B frames. The prediction process itself will be explained in detail in the following section. Given a sequence of pictures, the H.264 encoder codes the first picture as an I-frame. The remaining pictures can be coded as either P-frames or B-frames based on the delay restrictions imposed by the video application (B-frames require more delay but provide better coding efficiency). Figure 2.3 shows a sequence of coded pictures. The arrows indicate which pictures act as references to other pictures. Notice in Figure 2.3 that P-frames can only use other previously coded P- or I-Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 1 2 One macroblock partition Two macroblock partitions Two macroblock partitions Four macroblock partitions 8x8 8x8 8x8 8x8 One sub-macroblock partition Two sub-macroblock partitions Two sub-macroblock partitions 4x4 4x4 4x4 4x4 Four sub-macroblock partitions I B B P B B P B B I F i g u r e 2 . 3 : E x a m p l e o f a c o d e d v i d e o s e q u e n c e . Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 13 frames as references. B-frames are never used as references for motion prediction. Therefore, an input picture first enters the H.264/AVC encoder where an Intra- or Inter-frame prediction decision is made to indicate the type of picture to be used. If Intra-frame prediction is to be implemented, then every macroblock in the picture is Intra- coded. If Inter-frame prediction is used, then a motion estimation process is initiated on the macroblock level to find matching macroblocks in the reference frames. After the motion estimation process is completed, a full frame containing the prediction information (motion-vectors and reference indices) is produced which is used to generate a prediction image that is composed of the motion compensated blocks, as will be discussed in the next section. The original input picture is then subtracted from the motion compensated image to obtain a residual image. Integer cosine transform is then applied to the residual samples in order to utilize the correlation in the residual samples. Next the transformed coefficients are quantized and finally the quantized samples are entropy coded and passed on to the network abstraction layer to be packetized and transmitted across the communications network. [4] Figure 2.4, extracted from [10], gives an example of the functional blocks of the H.264/AVC encoder we just described. Input Picture Intra-trams Intra-frame Estimation Prediction ' Intra/ . Inter \select ion Figure 2.4: Hybrid encoder similar to the H.264/AVC encoder. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 14 Note that in the Inter- prediction mode, a P- or B- frame can contain Intra coded macroblocks if the Intra coding mode reduces a rate-distortion optimization equation. 2.2.1 Motion Estimation and Motion Compensation Compression is achieved in I-frames using Intra prediction, a process that takes ad-vantage of the spatial correlation between the samples of a picture. Intra prediction is performed on a macroblock or sub-macroblock basis. For every macroblock, the values of the samples located within the macroblock bounds are estimated using the boundary samples of the neighboring left and above macroblocks. The estimation is performed using a set of predefined Intra-prediction modes. A detailed discussion of the Intra pre-diction mode can be found in [11] and [4], however, it is out of the scope of this thesis .since our multiple-description coding scheme will not deal with Intra-coded frames or macroblocks. The Inter prediction process, on the other hand, takes advantage of temporal redun-dancies found in successive pictures in a video sequence. In Inter-frame prediction, the encoder runs two complementary processes called motion estimation and motion com-pensation, on a block basis, to find a match in the neighboring frames for every block within an Inter-coded frame. Figure 2.5 illustrates the general functional structure of the Inter prediction process. Note that only the residual frame is passed through the transform, scaling, and quantization processes. The prediction frame is only entropy coded and multiplexed with the coded residual frame on a macroblock level before being transmitted. Figure 2.6 demonstrates the motion estimation process where one block in the current picture, labeled frame N, is being estimated by a block of the same size in frame number N-2. The output of the motion estimation process is a set of motion vectors (MV) and reference indices (r) for every estimated block in the Inter-predicted frame. Figure Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 15 Current Picture Motion Estimation Picture buffer Motion Compensat ion Prediction Frame Residual Frame Transform, Sca l ing , Quantization, Entropy Coding Output Figure 2.5: Functional structure of the Inter prediction process. 2.6 shows that the current block is allocated a motion vector MV shown in red that represents the spatial shift between the matching block and the co-located block in the reference frame. C u r r e n t Frame N-3 Frame N-2 Frame N-1 Frame N Figure 2.6: Motion vector and reference index assignment in the motion estimation process. As a result, the motion estimation process generates a complete frame containing only motion information, such as, macroblock mode, motion vectors, and reference in-dices. Let us refer to this frame as the "prediction frame". The choice for best matching block is made based on a rate-distortion tradeoff. Lagrangian optimization is used to Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 16 minimize both the bit-rate required to encode the block and the distortion resulting from the chosen estimate. The Lagrangian cost function that is to be minimized is therefore expressed as: J(m, r\QP, A) = D(m, r\QP, A) + \R(m, r) where D is the distortion function, R is the rate for encoding the motion information, and A is the Lagrangian parameter specified in [12] to be equal to: I QP —12 V 0.85 x 2 3 if sum of absolute difference (SAD) distortion is used A = { Q P — 1 2 0.85 x 2" -T - , if sum of squared difference (SSD) distortion is used The distortion term is also expressed as Derr(P, rest, m e s t ) = {l0rig[i, j] ~ Keat[i + mest,xij + mest,y])n where P is the macroblock partition, r the reference index, m the motion vector, l[] is the sample value, and n is equal to 1 if err = SAD or 2 if err = SSD. The prediction frame next enters into the motion compensation process which recon-structs the estimated blocks using the motion vectors and reference indices generated by motion estimation. The output of motion compensation is a blocky image that approx-imates the original image. The motion compensation process itself is a direct inversion of the motion estimation process where the reference blocks are grouped together into one image. This motion compensated image is then subtracted from the original image to produce a residual image to which the transform, scaling, and quantization processes are applied before pairing the blocks up with their respective motion data and entropy coding the entire coded frame. [11], [4], [10], [13] Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 17 2.3 Scalable Video Coding The subband extension of H.264/AVC was developed at Heinrich Hertz Institute (HHI) in Germany. In a document, [14], presented at the JVT Munich meeting in March of 2004, the authors produced a subband extension of the H.264/AVC using Motion Compensated Temporal Filtering (MCTF) based on the lifting representations of the Haar wavelet and the 5/3 wavelet. The design keeps most of the components of the original H.264/AVC standard while only a few adjustments have been made to support the MCTF structure. The lifting representations of the Haar and 5/3 wavelets are used as filter-banks in uni-directional and bi-directional temporal prediction, respectively. Previous attempts have been made to use Lifting Wavelet Transforms with Frame-Adaptive Motion Compensation in the building of video codecs [15]. However, the advantage in [14] lies in the use of the highly efficient motion model of H.264/AVC along with an adaptive switching between the Haar and 5/3 spline wavelet on a block basis. A group of pictures (GOP) of the original video stream is decomposed into a set of High Pass (or difference) frames after a prediction step and a set of Low Pass (or average) frames after an update step [14]. Figure 2.7, extracted from [5], illustrates the temporal decomposition of a group of 8 pictures. A High Pass (HP) frame is produced after the prediction step and a Low Pass (LP) frame is produced after the update step. Each stage has half the temporal resolution as that of the original stream. A high pass frame is equivalent to a coded B-frame in H.264/AVC. It contains both residual data and prediction (motion) data. Each following stage uses the LP frames from the previous stage to produce its respective HP and LP frame sets which in turn, have half the temporal resolution of the previous stage. The following equations describe the extended motion-compensated temporal Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 18 Figure 2.7: MCTF temporal decomposition process. filtering prediction and update operators implemented in SVC. PHaaMx, 2k + 1]) = s[x + H l f l , , 2k - 2rPo] ( z . l j UHaa.r{s[x., 2k]) = |/i[x + mUo, k + rUo] P5/3(s[x, 2fc + 1]) = |(5[x + mPo,2k - 2rPo] + s[x + m P l , 2k + 2 + 2rPl]) U5/3(s[x,2k]) = \(h[x + mUo,k + rUo] + h[x + mu1}k - 1 - rVl]) Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 19 where s[] and h\\ indicate the y, u, and v samples in the original and high pass frames, respectively, x refers to the luma spatial coordinates or the samples, k is the picture temporal index, and m and r are the prediction information. It is also shown in [14] that using this model, it is possible to produce a fully scalable video codec offering temporal, spatial, and SNR scalability. We are interested in the case of combined scalability, such that, as we move from one layer to the subsequent one, the bitstream can lose spatial resolution, temporal resolution, and SNR quality by down-sampling spatially and/or temporally the decomposed sequence. The extended coding structure of an SVC encoder, as specified in [5] is shown in Figure 2.8. Spatial down-sampling between layers is used to reduce spatial resolution. Temporal scalability is controlled by limiting the number of LP and HP frames transmitted within a specific layer. The number of HP frames transmitted can also control the SNR scalability. When all layers are available, the decoder can reconstruct the original video stream in full quality and resolution. [16] This combined scalability helps the bitstream adapt to the conditions of the channel, namely fluctuations in bandwidth and network congestion. However, the bitstream is still vulnerable to bit-errors, which may lead to the dropping of packets. If the packets contain HP information, then the decoder can be modified to recover from packet loss by inserting a zero residual frame instead of the lost frame. If, on the other hand, a LP packet is dropped, then the consequence is more severe since that can greatly affect the PSNR and quality of the decoded stream. 2.4 Error Resilience in Video Coding Error correction schemes such as the commonly used forward error correction (FEC) can be implemented to combat the problem of bit-errors in transmission networks. How-Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 20 Input video sequence lnter-1; motion and texture prediction Progressive refinement texture coding (SNR scalability) texture Base layer ' coding motion Laver 4 Spatial enhancement layer encoder Progressive refinement texture coding (SNR scalability) Output Scalable texture Base layer coding motion Spatial enhancement layer encoder Inter-layer motion and jtexture prediction Motion compensated temporal filtering Progressive refinement texture coding (SNR scalability) texture Base layer coding motion Layer 2 Layer 0 AVC compatible encoder Figure 2.8: Basic structure of a 5 layer combined scalability encoder. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 21 ever, protection of the entire bitstream requires large overhead which can be very costly. Alternatively, [17] suggests the use of unequal error or erasure protection (UEP/UXP) schemes to protect only the important frames or layers in a video bitstream. Other error-resilience techniques have targeted the design of the video bitstream itself to gen-erate an inherently error resilient coded video representation through multiple descrip-tion coding. Multiple description coding schemes generate several versions of a video sequence which are transmitted either on the same channel or on different channels and paths in order to reduce the probability of transmission errors. 2.4.1 Unequal Erasure Protection in SVC Unequal Erasure Protection (UXP) is used to provide content aware protection to a video bitstream. [17] develops an UXP scheme suited for protecting the base layer of a SVC bitstream. This scheme cannot be applied to the enhancement layer of SVC. We will give a brief overview of the erasure protection scheme implemented in [17]. UXP defines several protection classes for the different layers in a scalable video bitstream. Each of these classes is characterized by a difference in the symbol erasure protection level. Figure 2.9 demonstrates the UXP procedure as it is illustrated in [17]. A trans-mission sub-block (TSB) contains a number of protected NAL units separated by Multi Time Aggregation Unit (MTAP) headers. The MTAP headers specify the NAL unit size, RTP timestamp, and a Decoding Order Number (DON). The DON is used to fix any higher level interleaving problems that might arise due to a reordering of the decoding order of NALs. For further detail on the functioning of MTAP, please refer to [18]. The TSBs are then grouped into one Transmission Block (TB) which, in turn, is encapsulated into several RTP packets. Simulation performed in [17] show that this UXP can successfully protect the base layer information of an SVC bitstream with more than 30% network packet loss rate. Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 22 The UXP scheme described above is not designed to protect higher layer packets, therefore, another error resilience scheme is required to protect those packets from network errors. 2.4.2 Multiple Description - Motion Compensated Temporal Filtering Recent studies in Internet error resilience technology have shown that Multiple De-scription Coding (MDC) along with path or server diversity can reduce the effects of packet delay and loss [19] [20]. Although this technique can be very effective to combat bit errors, it also suffers from data redundancy and thus requires additional bandwidth. However, it falls upon the design of the multiple-description to solve the problem of data redundancy. Multiple Description Scalable Coding (MDSC) merges scalable coding with MDC. Scalable coding facilitated the adaptability of the different video descriptions to variations in channel bandwidth [19]. One approach to MDSC is Multiple Description Motion Compensated Temporal Filtering (MD-MCTF) developed Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 23 in [19]. This approach applies MDC to a wavelet-based MCTF codec, similar to SVC, by dividing the HP frames between the different video descriptions and keeping the lowest layer LP frames as redundant data inside the streams. The motion information in each HP frame is also replicated in both streams. Figure 2.10 illustrates the frame coding order of the MD-MCTF coding scheme. Original coded sequence L H1 H2 H3 H4 H5 H6 H7 First Description Second Description L H1 H2 H3 H4 H5 H6 H7 Figure 2.10: Coding structure of the MD-MCTF. In each of the two descriptions, the lightly shaded HP frames contain only motion information. All residual components are set to zero. The problem with this approach is that most of the redundant bit allocation is dedicated to duplicating the motion information. Texture information, on the other hand, is sacrificed to reduce the redun-dancy. However, this sacrifice in texture data results in increased quality degradation, Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 24 a feature which is very undesirable. 25 Chapter 3 Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) Scalable Video Coding (SVC) clearly solves the network heterogeneity problem faced in the current M B M S architecture. The prevalent problem with this coding technique is its lack of error resilience schemes that can ensure the safe delivery of coded video data to its destination. We have seen that there are currently several error resilience techniques being employed to protect the video data, however, all of these techniques suffer from limited protection to the entire video data or from extensive redundancy in the coded video representations. For this reason, we propose a new Multiple Description Coding scheme specifically designed for protecting the high pass frames of an SVC compatible bitstream. We shall call our scheme Multiple Description Scalable Video Coding (MD-SVC). M D - S V C takes advantage of the layered structure of SVC to generate two descrip-tions of each enhancement layer of an SVC coded video.provide multiple coded versions of the video layers with minimum redundancy. The SVC structure is modified to allow the encoder to create two complementary descriptions of HP frames, such that each description is ensured to be independently decodable. The challenge thus lies in cre-ating a description of every SVC layer that will produce an acceptable video quality Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 26 when decoded separately while limiting the redundancy induced by M D C . The target of our coding scheme is to be able to support networks and users with heterogeneous capabilities while providing an error-resilient video representation with the minimum possible redundancy. Figure 3.1 shows one example of the applications that can be serviced using our M D - M C T F coding scheme over 3G M B M S networks. m Hand held computer" <v ... Description 2- Channel 3 "Descri'ptionT "^ Chan re' "2" .Bas_e_.Lay_ej. - Channel J . escnption 2- Chan re! 3 PDA Description 1 - Channel Base layer - Channel 1 . c Base layer - Channel 1 Server Cell phone Figure 3.1: Coding structure of the M D - M C T F . The proposed framework describes two logical layers, a base layer and a multiple-description coded enhancement layer. These logical layers are composed of K SVC layers of the coded video stream, where K is an integer. The SVC layers will be referred to as SVCx, where x stands for 0,1,2,. . . , K — 1. The base layer contains SVCO layer frames along with all the LP frames belonging to SVCl, SVC2,..., SVC(K — 1) layers. The enhancement layer contains 2 descriptions of the HP frames belonging to SVCl, SVC2,SVC(K - 1) layers. All SVC layers considered are temporally scalable. In addition to temporal scalability, if an SVC layer is also a spatial scalable Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 27 layer, the HP frames include both prediction data and residual (texture) data. If, on the other hand, an SVC layer is an SNR scalable layer, then the HP frames contain only residual (texture) data. Hence, creating multiple descriptions of HP frames entails generating multiple copies of both prediction data and texture data in order to reduce the redundancy between the descriptions while ensuring independent decodability of each description. Therefore, the SVC encoder is modified in order to incorporate all of the above-mentioned features. As discussed in Section2.3, the SVC encoder supports three types of scalability; spatial, temporal, and SNR (quality) scalability. The core of the coding engine is the motion-compensated temporal filtering (MCTF) coder that utilizes lifting representations of the Haar and 5/3 wavelet filters to provide temporal scalability in a video bitstream. Consequently, our multiple-description coding framework is integrated in the existing MCTF coder by inserting new modules and modifying existing modules. Figure 3.2 shows a modified single layer MCTF encoder with the additional and modified modules marked in gray. Input GOP Hierarchical motion-compensated prediction MCTF loop Decompose prediction hhlP texture Decompose update H P texture HP texture I l i l P P P i l i llislIIIll IBIiloIll —D1 motion --D1 HP texture — D 2 motion --D2 HP texture .* Multiplex Scalable bitstream Figure 3.2: Single layer MCTF encoder with additional and modified mod-ules for multiple-description coding. The MCTF loop shown in Figure 3:2 takes as input a group of pictures (GOP) and performs log2 (GOPSize) stages of motion-compensated prediction. In every stage, Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 28 the input pictures first pass through a decompose prediction stage where the motion information and high pass (HP) texture frames are produced. Next, the motion and texture frames are fed along with the input pictures to a decompose update module which produces the low pass (LP) texture frames. Note that every stage produces a set of LP frames that are decimated by two in time, i.e., the number of LP frames at the end of a stage is half the number of input pictures to that stage. Our multiple-description coding framework is restricted to generating two descriptions of the high pass frames. Therefore, motion and HP texture frames are passed as input to the Multiple Description Coding module, discussed in detail in the next section, which produces two copies of each input, labeled D l and D2 to indicate description one and description two respectively. Moreover, the Multiplex module is also modified to manage the order of frames in the output bitstream as will be shown in a later section. 3.1 Mul t ip l e Description Coding of H i g h Pass (HP) Frames High Pass (HP) frames constitute most of the transmitted frames in an SVC bitstream. For every GOP, only one frame is a low pass (LP) frame and GOP Size — 1 frames are HP frames. Moreover, depending on the level of motion found in a video sequence, HP frames can occupy around 50% to 70% of the bitrate allocated for a scalable SVC layer. Therefore, in error prone environments, multiple-description coding of HP frames is a necessary and attractive error resilience tool since it minimizes the coding overhead as opposed to replicating these frames. Every HP frame contains both motion and texture information so it is desirable to create multiple descriptions of both. Available multiple description techniques have only dealt with texture information, meanwhile motion in-formation was replicated in each description, thus increasing the redundancy between Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 29 the descriptions. The multiple-description framework we developed generates two de-scriptions of both motion data and texture data. A HP frame is first separated into two frames, motion frame and texture frame. Next, each frame is handled separately and follows a different coding path. Figure 3.3 shows the two description coding structure of the M D C module presented in Figure 3.2. Motion Texture Multiple Descr ipt ion Cod ing Quincunx Lattice Frame Separator Frame number mod ( 2 ) Jodd 1 i D1 motion v I D2 motion Best Best Neighbor Neighbor Estimator Estimator Zero residual MBs Non-zero INTRA MBs -D1 motion D2 HP texture Figure 3.3: Basic structure of Multiple Description Coding module with a two description coder of High Pass frames. Notice that the two-description coding of motion data is performed on the intra-frame level (or individual frame basis), whereas, the coding of texture data is performed Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 30 on the inter-frame level (or group-of-frame basis). In other words, every motion frame is separated into two descriptions, whereas, the texture frames in a group of pictures (GOP) are split based on a frame number. In the following sections, we will offer a detailed illustration of the two-description coding procedure for high pass motion and texture data. 3.1.1 Multiple Description Coding of Prediction (Motion) Data One approach to multiple-description coding of motion information of temporally scal-able video streams can be found in [19]. This approach includes a copy of all the motion information in each description, which increases the video redundancy between the multiple descriptions. In order to minimize this redundancy, we have developed motion coding method that reduces the number of coded motion vectors (MVs) to half and predicts the missing MVs using information from the existing neighboring motion vectors. As a first step, the motion MBs are separated between the two descriptions using the Quincunx Lattice structure adopted in [21]. Figure 3.4 below shows a 5x7 MB portion of the motion data of a HP frame. The D l and D2 labeled MBs are placed in the first and second descriptions of a HP frame, respectively. However, this produces HP frames with only half the motion information. For the remaining motion data MBs that are not assigned to a particular description, new syntax elements are devised to help predict the missing motion infor-mation using the neighboring correctly transmitted MBs specific for the description. Therefore, a typical MDC HP frame in the first description, for example, would contain the motion data shown in Figure 3.5 below. All D2 MBs of the original HP frame are now labeled multiple-description coded (MDC) macroblocks. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 31 Figure 3.4: Quincunx Lattice structure of prediction data MBs. Figure 3.5: Typical motion data macroblock map of a HP frame in descrip-tion 1. Chapter 3. - Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 32 An important condition imposed by multiple description coding is the independent decodability of each description. Since only every other motion M B from the original frame is available per M D C frame, slight modifications to the motion vector coding process specified in [11] are required. In order to reduce the size of coded motion infor-mation, [11] specifies that for every coded block, the associated motion vector (MV) is subtracted from a predicted M V and only this difference is coded and transmitted in the bitstream. The motion vector prediction process described in [11] can be summarized as follows: • Check the availability of the left, above, and either the above-right or above-left blocks • The prediction M V is equal to the motion vectors of either one of the above-mentioned blocks or to the median of three of the motion vectors based on block availability and the current M B mode. In the case of M D C frames, however, independent decodability of each description is a priority. This imposes a constraint on the selection of the neighboring blocks used for the motion vector prediction process since the left and above MBs are expected to be missing. As a result, M D C skips one M B to the left and uses both the above-left and above-right MBs as the neighboring blocks for motion vector prediction. Figure 3.6 demonstrates the change in the motion vector prediction process discussed above. The M D C labeled macroblocks are not left as empty sets. These MBs contain necessary motion information calculated by the encoder that enable the decoder to recover the closest match possible to the missing motion vectors using the neighboring correctly coded MBs. The encoder first copies one set of motion macroblocks of a HP frame into the respective description. For instance, a M D C HP frame belonging to description (or stream) one will first contain all D l labeled MBs as shown in Figure 3.5. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 33 D 8 c A Current MS D MDC A WOC : Current MS (a) (b) Figure 3.6: Change in motion vector prediction, (a) Standard defined neigh-bor MBs (b) MDC defined neighbor MBs. Table 3.1: MDC macroblock classification. Syntax Element Multiple-Description non-Multiple-Description Codable Codable MbMode MODE.16xl6 MODE-Skip MODE_16x8 INTRA.4x4 MODE_8xl6 MODE_PCM MODE_8x8 INTRAJ3L BLSkipFlag 0 1 BLQRefFlag 0 1 A new MDC flag is added to the macroblock syntax elements to distinguish between MDC and nonMDC coded blocks. The MDC flag is reset for all D l labeled MBs and set for all MDC labeled MBs. The MDC labeled MBs are separated into two classes based on the coded MB mode and base layer dependence. Since most motion bits are allocated to represent the motion vector field, the goal is to reduce the number of coded motion vectors within a motion frame. Hence, of the MDC labeled MBs, only those that contain motion vector information retain an MDC flag set to one. The remaining MBs are copied directly from the original motion frame and coded in the bitstream with the MDC flag reset to zero. Table 3.1 classifies macroblocks into multiple-description codable and non-multiple-description codable blocks. The next step is to derive the motion recovery data for all remaining macroblocks with an MDC flag of one. This is done by taking advantage of the redundancy (or Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 34 correlation, find good word) between spatially neighboring macroblocks with zero M D C flag to estimate the current macroblock's motion vectors. Two possible techniques are motion vector median or averaging, and motion-compensated best neighbor-block matching. In [22] and [23], a non-normative error concealment algorithm for Inter coded frames is presented based on motion-compensated motion-vector recovery with boundary pixel matching. It was shown in experiments run in [22], that the median or average motion vector approach did not give better results when compared to the motion-compensated prediction approach. Therefore, we decided to follow through with a modified version of the motion-compensated prediction approach as discussed in the following paragraphs. In order to reduce the number of computations performed by the decoder, the motion-vector recovery is performed at the encoder and an index of the best matching neighboring block is coded in the bitstream and passed to the decoder. First, the M D C macroblock is divided into four 8x8 blocks {XQ,X\, X2, X3} and the motion recovery algorithm is performed for each of these blocks. The neighboring eight 8x8 blocks are then assigned indices as shown in Figure 3.7. Since motion recovery is performed at the encoder, the original motion-compensated block is available in full and therefore imposes no need on running the matching solely on the boundary pixels. Instead, the entire motion compensated block is matched with every candidate block, which improves the accuracy of the estimate. The motion-vector recovery algorithm, which will be referred to as the Best-Neighbor Matching algorithm, can be summarized as follows: • For every block partition P which belongs to the set of block parti-tions P = Xo,Xi,X2,Xz, find the neighboring 8x8 block that mini-mizes the sum of square error distortion between the original motion-compensated block and the candidate motion-compensated block. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 35 MDC MDC 1 5 4 x_o X J 2 0 X_3 6 MDC 7 3 MDC Figure 3.7: Neighboring 8x8 block indices. • The output is the set of indices BNp associated with every partition P e P . The distortion measure can be expressed as follows DSSE(P, rorig, m o r i g , rest, m e s t) = (3.1) 2~2iJeP (^rorig[^ "f" morig,xi j "H ^ o r i g . i / ] — K C S T p + mest,xi3 ^ mest,y}) where lr[] represent the motion compensated luma samples, m the motion vectors, and P the sub-macroblock partition. The best matching neighbor block index is then found by minimizing equation 3.1 as shown in equation 3.2 below: BNp = arg min. DSSE(P, r o r i g , m o r i g , rest, m e s t) (3.2) BN^S where S = {0,1,2,3,4,5,6, 7} is the set of neighboring 8x8 block indices. As a result of the best-neighbor matching algorithm, every MDC macroblock is Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 36 Table 3.2: M B mode and the associated coded B N indices. Mb Mode Prime Best-Neighbor Index MODE_16xl6 BNXo MODE_16x8 BNXo BNX2 MODE.8xl6 BNXo BNXl MODE.8x8 BNXo BNXl BNX2 BNX3 allocated four indices that need to be coded in the bitstream. This might end up re-quiring more bits than coding the original motion-vectors, especially if, for instance, index seven is chosen for all partitions P. In order to cope with such cases, an appropri-ate binarization and entropy-coding scheme must be developed on the one hand. This will be discussed in the following section. On the other hand, the macroblock mode of an M D C M B is also coded in the bitstream in exchange for removing some of the of the best-neighbor indices from the bitstream. For example, if the macroblock mode is MODE_16xl6, then the original motion is homogeneous over all blocks. Therefore, it is sufficient to code the best-neighbor index of only one 8x8 block partition to estimate the missing motion in M D C case. The chosen best-neighbor index BNp is that of the 8x8 block P with the minimum distortion DSSE located within the macroblock partition. The remaining 8x8 blocks in the macroblock partition are set equal to the chosen index. Table 3.2 shows the prime best-neighbor indices that are coded based on the value of M B mode. The decoder first reads the M B mode and then decides which prime best-neighbor indices are to be read from the bitstream as shown in Table 3.2. The remaining best-neighbor indices of the designated macroblock partition are set equal to the prime index. Finally, for every best-neighbor index BNp, the decoder copies the motion-vectors from Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 37 Table 3.3: Syntax element classification in M D C non-MDC M D C BLSkipFLag BLSkipFLag MbMode MbMode BLQRefFlag BLQRefFlag MDCFlag MDCFlag BestNeighbors BlockModes MotionPredictionFlag ReferenceFrames MotionVectors TerminatingBit TerminatingBit the neighboring block indexed by BN into the macroblock partition P. The bitrate reduction achieved by multiple-description coding of motion can be seen by comparing the number of syntax elements coded in a regular macroblock as opposed to a multiple-description coded macroblock. Table 3.3 lists the syntax elements encoded in both M D C and non-MDC macroblocks. Moreover, in the case of bidirectional prediction, two sets of the motion prediction flags, reference frame indices, and motion-vectors are coded in the bitstream. The best-neighbor indices are only coded once since the afore-mentioned fields will be copied in full from the indexed neighbor block. 3.1.2 Entropy Coding of Error Recovery Data The addition of new syntax elements to the SVC bitstream requires additional en-tropy coding contexts that will ensure optimal coding into the bitstream. H.264/AVC supports two entropy coding techniques to efficiently encode all syntax elements into the bitstream. These entropy coding techniques are Context-Adaptive Variable Length Coding (CAVLC) and Context-Adaptive Binary Arithmetic Coding (CABAC). Since SVC is the scalable extension project of H.264/AVC, our method has also adopted Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 38 the same entropy coding techniques implemented in SVC. However, SVC defines addi-tional syntax elements that did not exist in H.264/AVC, hence, [24] specifies additional context indices, context initialization tables, binarizatiori schemes, . . . The full list of modifications can be found in section S.9.3 of [24]. It is worth mentioning that CAVLC is not affected by the support of new syntax elements. It is only CABAC that requires the modifications and additions referred to in section S.9.3. On a similar note, MDC of SVC defines two new syntax elements embedded in the motion data stream. These syn-tax elements are MDC flag and BestNeighbor indices. In this section, we will describe the binarization and context modeling process developed to support entropy coding of these additional MDC syntax elements. To begin with, we will give a brief summary of the CABAC framework for encod-ing syntax elements as described in [25]. A syntax element undergoes three steps of encoding, namely, binarization, context modeling, and binary arithmetic coding. In the binarization stage, a non-binary syntax element is mapped into a unique binary se-quence referred to as a binstring [25]. Next, in the regular coding mode, every element in the bin string passes through a context modeling stage where previously coded ele-ments can be used to select a probability model for the intended element. Finally, the element is arithmetic coded using the probability model selected in the context model-ing stage. Another coding mode is the bypass mode where a binary valued symbol is passed directly to the arithmetic coder without the step of context modeling. In the MD-SVC case, only the regular coding mode is used to encode the new syn-tax elements. We first encode 400 coded frames of 4 representative test sequences and monitor the BestNeighbor symbol statistics from which we derive the symbol probabil-ities. Since the MDC flag can only take binary values, the calculated symbol statistics directly reflect the binary representation of the syntax element. The current version of the SVC software does not support multiple slices per frame and therefore the symbol Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 39 statistics are based on the occurrence of each binary value in every coded frame. Given the procedure above, the M D C flag syntax element appeared to have a simple uniform distribution. Therefore, the context model designed for representing the BLSkip flag in [24] and has the same probability distribution was adopted to model the M D C flag context as well. The case of entropy coding the BestNeighbor syntax element is a bit more com-plicated. The first difficulty lies in finding a suitable binarization scheme that will minimize the output bit-rate. Since our entropy coding engine is a binary arithmetic coder, the task of representing every symbol by a representative set of binary values, or bmstring, that best reflects the actual symbol probabilities. As mentioned in the previ-ous section, the BestNeighbor syntax element holds the values of the indices pointing to the neighboring 8x8 blocks of a multiple-description coded macroblock. Let us revert to the previously defined arithmetic terms in order to define a mathematical model for the context modeling task at hand. The BestNeighbor syntax element is defined as the index BNp € S, S = {0,1, 2,3,4, 5,6, where S is the set of neighboring 8x8 block indices and P is the sub-macroblock par-tition index. Therefore, BestNeighbor can hold one of eight possible non-binary values each of which demands a separate symbol probability model. S defines a finite alpha-bet of values for the BestNeighbor syntax element leading to the application of Fixed Length (FL) binarization as recommended by [25]. Moreover, by monitoring the symbol statistics of the BestNeighbor index values over the same coded frame data set of 400 coded frames it was shown that the BestNeighbor syntax element has a nearly uniform distribution, which further supports the choice of F L binarization. Let PrsN be the probability of occurrence of a non-binary value BN E S. The first step is to calculate PrBN using the available coded frame data set. Let Nf be the total number of coded frames and CBNM be the number of times the index BN occurred in frame k. The Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 40 probability PTQN is expressed as PTBN = Nf •BNeS (3.3) The next step is to derive the bin probabilities in the bin-string using the non-binary symbol probabilities. These bin probabilities will be used to define the contexts or probability models of the internal bins of the bin-string that will determine the coded bitstream and its efficiency as stated in [25]. The binarization stage ensures the re-duction of the alphabet size of the syntax elements to be encoded. This stage does not lead to the loss of any information on the high-level symbol probabilities. These symbol probabilities can be completely recovered using the probabilities of the internal The processes of binarization and context modeling are employed to fit the available non-binary data to a universal finite memory source, a concept developed in [26] and applied in the design of CAB AC in H.264 [27]. [26] also defines a "tree source" as a "machine" that can implement a Markov source using a "simple tree architecture". The concept of "tree source" is important in this discussion because it will lead to the definition and design of the BestNeighbor bin contexts. [26] defines P(xn) and PT(£™) as the probabilities assigned to a string xn by the universal source and by a data-generating finite-memory source with K states. In the BestNeighbor binarization application, a universal source is finite and the tree is binary. Since we are using FL binarization, the length of the tree will be IBN — R°§2 7] = 3. Figure 3.8 shows the binary tree used for the binarization of the BestNeighbor syntax element. Notice that the tree nodes are labeled CO, 671,... 676 which stand for the context models of the internal bins. The calculated probability set PrBN provide the leaf probabilities of the binary tree, bins [25]. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 41 Figure 3.8: Fixed Length binarization for BestNeighbor syntax element. i.e. PrsN — Pr(xn) = n™=rj P(^t+i\s(xt)) [26] These conditional probabilities define the bin contexts, which we would like to calculate. Since each leaf of the binary tree represents one state of a Markov chain, the conditional probabilities P(xt+i\s(xt)) of the internal nodes define the state transition probabilities of the Markov chain and can be expressed as where s(xn) is defined as s(xn) = xn ... x 0v(n-fc+i) for some k > 0. In the BestNeighbor case, the maximum length of the FL binstring is 3 and therefore n< 2. In Figure 3.8, we assume that 0 is the most probable symbol and 1 the least probable symbol of the arithmetic coder, as defined by [11]. Therefore, using the 400 coded frame dataset, we calculate the symbol probabilities of the BestNeighbor symbol values and P(xt+1\s(^)) = P(s(xt+1)) P(s(xt)) (3.4) Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 42 Table 3.4: BestNeighbor symbol probabilites and the corresponding bin-string Non-Binary Symbol 0 1 2 3 4 5 6 7 Probability PT(X2) 0.22 0.34 0.23 0.17 0.0065 0.011 0.0039 0.019 Binstring (*2) 001 000 010 011 110 101 111 100 assign these values to the leaves of the binary tree accordingly. Table 3.4 shows the BestNeighbor symbol probabilities and the corresponding binstring. To calculate the internal node probabilities, take for example node C3, apply equa-tion 3.4 as follows. We want to calculate P(0|01) and P(l|01) = 1 - P(0|01). P(o|oi)= P ( 0 1 0 ) - P r B N => P(010) + P(011) PrBN=2 + PrBN=3 Similarly, the conditional probabilities of node CI are P(0|0) and P(1|0) and are expressed as calculate P(0|01) and P(l|01) = 1 - P(0|01). P(0|01)= P(000) + P(001) P(000) + P(001) + P(010) + P(011) The conditional probabilities of all remaining internal nodes are calculated similar to the conditional probabilities of contexts C3 and CI. These conditional probabilities define the context models of every node in the binstring and will therefore be used to initialize the context models at the beginning of encoding or decoding of every frame. The CABAC entropy coder will then update these context probabilities based on the encoded BestNeighbor syntax element statistics as more and more MDC macroblocks are encoded into the bitstream. The initial conditional probabilities of all BestNeighbor Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 43 Table 3.5: BestNeighbor context models. Context MPS LPS RLPS state m n CO 0.96 0.0403 20 44 0 19 CI 0.58 0.42 215 2 0 61 C2 0.74 0.26 201 1 0 62 C3 0.61 0.39 213 1 0 62 C4 0.58 0.42 131 10 0 53 C5 0.63 0.37 189 3 0 60 C6 0.62 0.38 192 1 0 62 context models are shown in Table 3.5. [11] specifies three steps for initializing context models in which a context model is initialized by assigning to it a state number and a meaning for the most probable symbol (MPS). The first step is linearly dependent on the frame quantization parameter (QP) and involves calculating a prestate = (m * QP) » 4 + n [11]. The next step limits prestate to the range of [1, 126]. The final step maps the prestate to a {state, MPS} pair, such that, if prestate < 63 ,then state = 63 — prestate and MPS = 0, otherwise, state = prestate — 64 and MPS = 1 [ 1 1 ] . Now the arithmetic coder used in H.264/AVC and SVC is a table based arithmetic coder, where the bin probabilities are estimated using a table of scaled least probable symbol (LPS) ranges (RLPS) indexed by the context model state. Therefore, the cal-culated LPS probabilities shown in Table 3.5 are mapped into the RLPS values found in column 4 of Table 3.5, where the maximum RLPS range is 256 and corresponds to a LPS probability of 0.5. The frame quantization parameter does not affect the value of BestNeighbor and therefore m is set equal to 0. Moreover, we're assuming that 0 is the Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 44 most probable symbol so the prestate value should not exceed 63. Columns 5, 6 and 7 of Table 3.5 show the initialization parameters of the BestNeighbor syntax element contexts resulting from the above mentioned initialization procedure. 3.1.3 Multiple Description Coding of Residual Data In addition to motion information, High Pass frames also contain residual information that can consume at least as much bandwidth as the motion information. Therefore, it is imperative to multiple-description code the residual data as well in order to reduce the redundancy between the two descriptions. We suggest using an inter-frame based multiple-description coding of residual data similar to MD-MCTF, the approach used in [19]. In [19], HP frames are divided into even numbered and odd numbered frames at every temporal level in a MCTF-coded video layer. The even numbered frames are then transmitted in one description while the odd numbered frames are transmitted in the second description. Controlling the number of HP frames duplicated between the two descriptions leads to controlling the video redundancy. Although bitrate effi-cient, this approach suffers from a reduced decoded video quality since any Intra coded macroblocks in the HP frames cannot be retrieved if one description is lost. Our approach to multiple-description coding of residual data is also based on sepa-rating HP frames into even numbered and odd numbered frames and coding one group in each description. However, we extend this separation process by inserting any Intra-coded macroblocks found in the high pass frames in both descriptions. The first step is to add an MDC flag to the slice header to allow the decoder to distinguish between fully coded HP frames and modified or multiple-description coded HP frames. Note that this flag is only inserted to refer to multiple-description coding of residual data and not motion (prediction) data. Therefore, if description one is allocated the even numbered frames, for example, then all even numbered HP frames are coded entirely Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 45 without any modification in the first description bitstream and the MDC flag is set to zero. The odd numbered frames are multiple-description coded by setting the MDC flag to one and only inserting the Intra-coded macroblocks into the second-description bitstream. No residual information is coded for residual macroblocks. Figure 3.9 shows the separation of a set of residual high pass frames into two descriptions. The dark blocks in the high pass frames refer to Intra coded macroblocks. Original HP sequence Description One Description Two ^ 0 MDC- 1 j MOCO MDC. Figure 3.9: Multiple-description coding of residual frames in a temporal-scalable layer with three temporal levels. Dark blocks indicate Intra-coded macroblocks in the residual frame. Notice that the HP frames coded in the two descriptions are complementary. The only redundancy involved rises from the Intra-coded macroblocks in the residual frames that are duplicated to improve the decoded video quality in case of packet loss. In order to ensure independent decodability of each description, essential modifications were required. First, the coding of the coded block pattern (CBP) syntax element is modified in MDC frames such that it is independent of neighboring CBP values in the adjacent blocks. All neighboring blocks are assumed to be unavailable when coding the Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 46 CBP field of a residual block. Moreover, cross layer dependence of residual prediction is also eliminated for all residual frames by assuming that all base layer residual frames contain zero-valued samples. Finally, the TransformSize8x8 flag is always added to the bitstream to inform the decoder whether or not an 8x8 block transform is used. This is crucial when the motion information is multiple-description coded and the macroblock mode is MODE_8x8. In this case, the decoder has no clue as to whether the residual data was transformed using an 8x8 block transform or a 4x4 block transform since the BestNeighbor field is limited to 8x8 block precision and does not exceed that to sub-block precision. Consequently, adding the TransformSize8x8 flag eliminates any ambiguity that might arise from multiple-description coding of HP motion information. In the following section, we present a theoretical analysis of the multiple-description coding framework we just described. We perform a redundancy rate distortion analysis of the suggested framework and develop a video motion index that reflects the level of distortion produced by this multiple-description coding framework. 3.2 Theoretical Analysis Recall from section 2.3 that the current scalable SVC framework provides three kinds of scalability; temporal, spatial, and SNR. Spatial scalability is achieved by downsampling the input video sequence and then performing cross-layer prediction between the original (high-resolution) and downsampled (low-resolution) video sequences. This scalability is optimized by utilizing the already coded motion information in the lower layers. SNR scalability is generally accomplished by coding the residual (texture) signals obtained from computing the difference between the original pictures and the recon-structed pictures produced after decoding the base layer. This scalability is extended to include all temporal subband pictures obtained after temporal scalable coding. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 47 Temporal scalability is realized by applying the concept of motion-compensated temporal filtering (MCTF) to a group of original video pictures. M C T F utilizes an adaptive selection of the lifting representations of the Haar and 5/3 spline wavelets on a block-basis. It is important to emphasize that the wavelet filters are applied "in the temporal domain" to blocks of picture samples and N O T applied, in the spatial domain. The result is a set of high-pass and low-pass subband pictures having half the temporal resolution as the original group of pictures. Note that after performing temporal decomposition on a group of pictures (GOP), spatial and SNR scalability can be added to the resulting subband pictures. In our M D C framework, we produce two descriptions of the HP frames obtained by M C T F and support any spatial and SNR scalability features applied to those HP frames. To begin with, we revert back to equations 2.1 and 2.2 that describe the extended motion-compensated temporal filtering prediction and update operators. These equa-tions are presented below. -P/w(s[x, 2k + 1]) = s[x + rnp0, 2k - 2rPo] ^/w(s[x, 2k}) = |/i[x + m<y0, k + rUo] P5/3(s[x,2k + 1}) = l(s[x + mp0,2k-2rp0]+s[x + mPl,2k + 2 + 2rpl}) U5/3(s[x,2h]) = \(h[x + mUo,k + rUo] + h[x + m ^ , k - 1 - r-yj) where s[] and h\\ indicate the y, it, and v samples in the original and high pass frames, respectively, x refers to the luma spatial coordinates or the samples, k is the picture temporal index, and m and r are the prediction information. Switching between the Haar and 5/3 filters is done at the macroblock level where, the Haar filter is used for unidirectional prediction while the 5/3 filter is used for bi-Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 48 directional prediction. For the remainder of this discussion, we will consider the 5/3 filter as the general prediction process and analyze accordingly. [14] defines the decomposition process using the following equations: h[x, k] = s[x, 2k + 1] - P(s[x, 2k + 1]) (3.5) l[x, k] = s[x, 2k) - U(s[x, 2k}) (3.6) where P(s[x, 2k + 1]) and U(s[x, 2k]) represent the motion-compensated samples of the prediction and update processes respectively. h[x, k] and l[x, k] are the high-pass and low-pass decomposition components. Replacing equations 2.1 and 2.2 in equations 3.5and 3.6 we get h[x,k] = s[x,2fc + l] - ^(s[x + mPo,2k-2rp0]-rs[x-rmp1,2k-r2 + 2rPl]) (3.7) Z[x, k] = s[x, 2k] + ^ (h[x + mUo, k + rUo] + h[x + m ^ , k - 1 - rVl)) (3.8) which result upon expansion in the following FIR filter representations of the low pass and high pass subbands. The motion-compensated prediction variables mp/u and r 0 / i have been dropped from the two equations for clarity. h[x, k] = ~^s[x, 2k] + s[x, 2k + 1] - ^s[x, 2k + 2] (3.9) /[x, k] = -^s[x, 2k - 2] + ]s[x, 2k - 1] - ^s[x, 2k] + ]s[x, 2k +1] - \s[x, 2k + 2] (3.10) 8 4 4 4 8 Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 49 Hence, the l[x, k] samples (more appropriately called subbands) now clearly cor-respond to the filtered output of the 5-tap low pass filter [=£ \ § \ = £] . The h[x, k] subbands, in turn, correspond to the filtered output of a the 3-tap high pass filter [=Y 1 Equations 3.9 and 3.10 therefore describe the 5/3 spline-wavelet analysis filters. Figures 3.10 and 3.11 illustrate the properties of the two filters. [-1/8 1/4 3/4 1/4-1/8] low pass filter 50 I 1 1 1 1 1 1 1 1 r 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Frequency (xu rad/sample) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Frequency (xu rad/sample) Figure 3.10: Frequency response properties of the [-1/8 1/4 3/4 1/4 -1/8] low pass filter. The application of these temporal filters on a group of pictures (GOP) of size N produces N/2 low pass subbands and N/2 high pass subbands. When we implement a dyadic wavelet decomposition by applying multiple stages of the filters described above to the pictures of a GOP of size 8 for instance, the resulting output pictures will have the subband structure shown in Figure 3.12. Since both high pass and low pass filters are applied in the temporal domain, the Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 50 Figure 3.11: Frequency response of the [-1/2 1 -1/2] high pass filter. Figure 3.12: Decomposition of a group of pictures (GOP) of size 8. The shaded frames will be coded and transmitted in the bitstream. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 51 low pass subbands will consequently correspond to the homogeneous components found within the GOP. In other words, areas in a GOP that are common in all the pictures across the GOP will be conserved in the low pass subband. The high pass subbands will contain all high pass components in a GOP. More specifically, high pass subbands will contain most of the motion found in the GOP. Therefore, if a GOP in a video sequence exhibits high levels of motion, the low pass subband components of the GOP will retain less of the GOP signal power when compared to the power retained in the low pass subband components corresponding to a GOP with lower levels of motion. The reasoning behind this is quite simple. A low pass subband can be viewed as the average frame of a GOP. Hence, if all the pictures in a GOP resemble each other, i.e. the GOP exhibits little motion, then the average of these pictures will highly resemble each and every picture in the GOP. On the other hand, the high pass components will retain more of the GOP power when there is considerable motion and low power when the level of motion is low. The above rhetoric leads to the conclusion that the amount of information coded within the high pass frames is proportional to the level of motion found in a GOP. Keeping in mind that the focus of this thesis is multiple-description coding of the high pass (HP) frames in SVC, we will now relate our proposed MDC framework to our discussion above. Multiple description coding calls for the production of two or more versions of a video stream such that the reception of one stream can guarantee a decodable video sequence with an acceptable loss in quality. To put the above sentence in more accurate terms, we need to create two independently decodable streams such that the distortion produced from decoding a single stream is minimal, while maintaining a low redundancy between the two descriptions. The task remains to develop an expression for single-stream distortion and for the redundancy added by creating the two streams compared Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 52 to a single-description coded bitstream. Let Dxotai be the distortion resulting from decoding a group of pictures using only one description of the MDC high-pass frames. DTotai can then be expressed as where DMotion and DResidUai are the distortions caused by multiple-description coding of the motion information and residual (texture) information, respectively, in one GOP. The distortion resulting from multiple-description coding of motion information is completely due to the motion vector approximation process described in section 3.1.1. The motion vectors of a MDC macroblock are approximated using the motion vectors of the neighboring blocks. Since this approximation is performed on an 8x8 block basis, with P being the 8x8 block index, DMotion c a n then be expressed as follows. Let NHP be the number of high pass frames found in a GOP of size GOP .Size. Based on a dyadic decomposition structure of MCTF, NHP is evaluated in terms of Drotal — D Motion + D' Pj>si dual (3.11) GOP Size as NHP = E; Jog2 GOP^ize GOPSize 'i=l 2* GOPSize / " C - , 1 ° S 2 GOPJSize-1 2 \2-^i=0 2"') (3.12) = GOPSize-1 We will assume that the number of MDC coded macroblocks is fixed per frame in a GOP and let Q be the set of MDC coded macroblocks with n being the macroblock (MB) index. The distortion due to multiple-description coding of motion information Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 53 is finally written as: D Motion = NHp(Ylnen(Ylp=O^SSE(P,rorig,^oriS,re3t,mest))) (3.13) = (GOPSize - 1) x (zZnen Ep=o DSSE(P, rOTig, m o r i g , r e s t , m e s t ) where DSSE(P, rorig, t n o r i g , r e s t , m e s t ) is the sum of square error distortion between the original motion-compensated block and the candidate motion-compensated block defined in equation 3.1. The expression for DMotion above is minimized by minimizing the DSSE term. This has already been included in our framework as a design parameter since the best neighbor algorithm discussed in section 2.3 selects the neighboring 8x8 block that minimizes the DSSE term. Therefore, the minimal distortion requirement is satisfied for motion estimation. The second term in the Dxotai expression is DResiduai- Recall from section 3.1.3 that multiple-description coding of residual (texture) information is performed on the entire GOP level (or inter-frame level). The texture information in a multiple-description coded frame is simply set to zero. This, however, does not include Intra-coded mac-roblocks found within the MDC frame. The purpose behind this decision is to restrict the loss in coded data to the high pass components produced by MCTF. [14] specifies that the samples of Intra-coded macroblocks in high pass frames are not used for updat-ing the low-pass pictures and therefore are not included in the reconstruction process at the decoder. Moreover, all sample values of the intra macroblocks are set to zero when used in the update process [14]. For this analysis, we will begin by defining DnesidUai in terms of the components affected by dropping of residual data and then we will assess the level of that distortion based on the video sequence statistics. The only process affected by multiple-description coding of texture information is the reconstruction stage of motion-compensated temporal filtering. In other words, the loss is restricted to some of the high pass subbands needed for the complete reconstruction Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 54 of the output pictures. We will express the reconstruction of a set of low pass pictures in terms of the MCTF synthesis equations of two adjacent pictures. Equations 3.14 and 3.15 have been derived from the MCTF analysis equations 3.7 and 3.8. Let s[x, 2k] and s[x,2k + 1] be the reconstructed picture samples, the MCTF reconstruction equations will then be s[x, 2k] = l[x, k] - ^(h[x, k] + h[x, k - 1]) (3.14) s[x, 2k + 1] = h[x, k] + i(s[x, 2k] + s[x, 2k + 2]) (3.15) Note that prediction data variables have been removed from the above equations for clarity. Our MDC framework involves setting every other high pass frame to zero in a single description, hence, if the odd numbered frames are set to zero and k is even then the above equations become s[x,2k] = l[x,k]-^(h[x,k]) (3.16) s|x,2A; + l] =h[x,k} + ±(s[x,2k] + s[x,2k + 2]) = ±{l[x,k] + l[x,k + l])-\h[x.,k] For a more comprehensive analysis, let's consider the group of pictures of size GOP.Size. One example of a GOP with GOP.Size = 8 is shown in Figure 3.13 below, where the shaded frames indicate the transmitted frames, and the crossed frames are multiple-description coded. We define DResidUai as the sum of square error between the MCTF reconstruction of a single-description (SD) coded GOP and the MCTF reconstruction of one description Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 55 Figure 3.13: Reconstruction of a group of pictures showing the MDC high pass frames with zero residual marked with the red Xs. of a MDC GOP. The calculated distortion is strictly limited to the error caused by dropping of residual information. Therefore, we will develop an expression for Dp_esidUai that incorporates all the lost residual samples across the MCTF stages. Using equations 3.16 and 3.17, the final expression for an entire GOP of size GOP.Size is written as D Residual El o g 2 (GOPSize)-! , s=l V - 2 GOP.Size Ek=f ((s[x, 2k) - s[x, 2A;])2 + (s[x, 2k + 1] - s[x, 2k + l])2)) = £ l o S 2 ( G O P - 5 " e ) - i ( (3.18) GOPSize n IZZT ((l-Mx, k -1])2 + (§Mx, k-i] + |/i[x, k +1])) E l o g 2 (GOPjSize)-l i s=l V GOPSize n EkZT k - l]) 2 + £(/i[x, k + l]) 2 + ±h[x, k - l)h[x, k + 1])) where s is the MCTF stage number and G O P ^ l z e — 2 is the maximum value that k can • take in stage 5 . Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 56 Let us now define a new variable, the motion index, as the normalized sum of square error distortion between the original picture samples and the motion-compensated sam-ples produced after the Lagrangian based rate-distortion optimization. The motion index, labeled MIdx, will generally be expressed as A/r Tr] HkeGOP 2~2xeFrame (k"g[ x > ^] ~ PJX> 1 1 1 ' r> ^ ] ) 2 ("3 1 Q\ M M X - f (GOP.Size,/\) [ 6 ^ } where p[] represents the motion-compensated samples, /(.) is a normalization function to be defined later and A is a constant reflecting the maximum SSD distortion per INTER coded macroblock. A is calculated experimentally by encoding a number of sequences with varying levels of motion and is estimated to be around 22000. The normalization function is simply written as f(GOPJSize, A) = GOP.Size x (FrameWidth xFrameHeight) y A The question that arises from the above definition is that how can MIdx be char-acteristic of the level of motion in a group of pictures. The answer is inherent in the concept of motion-compensated prediction itself. Motion-compensated prediction can be viewed as a mechanism for tracking motion in a video sequence. The process involves "tracking" a fixed size block of pixels across a group of pictures and trying to find the best match for that block in previous and future pictures by finding a set of motion vectors and reference indices. If the tracking is highly successful, then the picture gen-erated from the motion tracking will closely resemble the original picture. This will result in a residual frame with small sample values. On the other hand, if the tracking is unsuccessful, then the motion-compensated picture will vaguely resemble the original picture causing the residual frame to have large sample values and, in more extreme cases, to use the INTRAJVIODE to code the macroblocks. Therefore, if the residual Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 57 frame sample values are large, then the motion tracking process is finding difficulty in keeping up with the level of motion in the sequence, which reflects that the sequence exhibits high levels of motion. Similarly, if a residual frame contains a large number of Intra-coded macroblocks, then the sequence is more likely to exhibit high levels of motion. Since the high pass subband samples are defined by h[x] = lorig[x\ — p[x, m, r] , [14]. Consequently, the motion index equation is rewritten in terms of the square of the residual high pass samples as M M X " f {GOP.Size, A) Note that the spatial coordinate vector x is dropped from the above equation for clarity. This argument parallels our previous discussion dealing with the amount of signal power retained in low pass and high pass subbands of the MCTF process. To reiterate, if the video sequence exhibits low levels of motion, then the low pass subband retains more of the signal power. If on the other hand, the sequence exhibits high levels of motion, then the high pass subbands will retain more of the signal power. Thus, we speculate that the motion index will closely reflect the level of motion found in a video sequence and act as an indicator to the size of the encoded video sequence. To prove our speculation, we have encoded 113 pictures from each of four video sequences at 15 frames per second and a GOP.Size of 8 frames. The sequences Crew, City, Foreman, and Harbour were used. Figure 3.14 shows the motion index calculated for 14 GOPs of the Crew sequence. The motion index values displayed here are normalized with respect to the maximum distortion of the Crew sequence. Based on our speculation, Figure 3.14 indicates that the sequence Crew exhibits Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 58 Crew 113 frames 1 I 1 1 1 1 H r 0.9 0.8 0.7 X O.B OJ c c 0.5 o o 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 14 GOP Number Figure 3.14: Variation of motion index with respect to G O P number in the Crew sequence. low levels of motion during the first three GOPs and then the level of motion starts increasing until it hits a maximum in G O P number 10. To demonstrate this analysis, we shall display screen captures of the frames of G O P 2 and G O P 10. Figure 3.15 (a) shows that very little motion can be noticed in the displayed group of eight pictures. The G O P intensity is almost homogenous as well and therefore, the value of motion index was small. Figure 3.15 (b) on the other hand exhibits more motion in the foreground and, more specifically, the background of the sequence. Moreover, the intensity of these pictures is also changing which results in a high motion index. The above figures clearly demonstrate the relevance of the motion index in indicating the motion within a video sequence. The question that rises, however, is how can the motion index reflect the difference in the level of motion between two or more sequences. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 59 Figure 3.15: Crew sequence showing (a) frames from GOP number 2 and (b) frames from GOP number 10. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 60 Figure 3.16 shows the motion index values of the four sequences mentioned earlier. Figure 3.16: Comparison of the motion index values between the 113 coded frames of the four sequences Crew, City, Foreman, and Harbour. The plots shown in Figure 3.16 indicate that the sequences Crew and Harbour exhibit higher levels of motion than Foreman or City. The values of motion indices shown in Figure 3.16 were normalized using the constant A referred to in equation 3.19. By averaging the motion index over all the GOPs in each sequence, the motion index can then be used to compare the level of motion between sequences. Moreover, this measure will also reflect the bit-rate of the coded video sequence. Table 3.6 lists the average motion index for each sequence along with the bit-rate and size of each coded video sequence. Note that the same QP value was used during the encoding of the four sequences. Clearly, the motion index can now be related to the residual distortion Daesiduai caused by multiple-description coding of a scalable video sequence. Since multiple-- B - Crew 3 City Foreman - O Harbour GOP Number Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 61 Table 3.6: Relation between motion index and the coded video file size. Video Average Bit-Rate File Size Sequence MIdx (kbit/sec) (KB) Crew 0.5126 95.762 88 Harbour 0.4902 95.498 87.8 Foreman 0.3874 70.008 64.3 City 0.1898 50.937 46.8 description coding of a video sequence involves dropping some of the high pass (or residual) components, sequence with high MIdx will suffer from high residual distortion compared to sequences with lower MIdx. We will demonstrate this relationship later in chapter 4. The proposed multiple-description coding scheme incorporates some degree of redun-dancy in video payload in order to assure independent decodability of every individual description. Let Rrotai be the overall redundancy rate introduced due to our MDC framework. Similar to the distortion measure, Rrotai can also be separated into two components: RMOUOU and RResiduai, such that RTotal = R-Motion + R-Residual-The redundancy rising from multiple-description coding of motion information is mainly due to the common syntax elements coded in both MDC and non-MDC macroblocks in addition to the new syntax elements introduced by MDC macroblocks. The syntax elements coded for both MDC and non-MDC macroblocks are shown in Table 3.3. Previous multiple-description coding schemes such as [19] required copying all mo-tion information in both descriptions to ensure correct decodability of the independent descriptions. The problem with such an approach is mainly the considerable increase in bitrate of the multiple-description coded bitstream, translated into an increase in Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 62 video redundancy. We have overcome this problem by limiting the level of redundancy to a minimum. Therefore, based on the syntax elements listed in Table 3.3 above, the motion redundancy is estimated to be E-Motion = RBLSkipFlag + E-MbMode + RBLQRefFlag + R}ADCFlag + RBestNeighbor (3.21) v v . ' v v ; common syntax elements added syntax elements We will label the common syntax element rates by R c o m m o n . The major contribution to motion redundancy comes from coding of the BestNeighbor field. However, this in-crease in bitrate remains much smaller than coding the actual motion-vector differences instead for the following reasons. The BestNeighbor field is binarized using the 3-bit fixed length binarization scheme discussed in section 3.1.2. The motion-vector difference (Mvd) field binarization is performed by concatenating the truncated unary and Exp-Golomb binarization schemes (UEG3) with a cutoff value of 9. Moreover, motion-vector differences in H.264 are set to quarter-pixel accuracy. For instance, if a block shifts by one pixel from one picture to the other, then the coded motion-vector will have a value of 4. [11] Applying UEG3 binarization will produce 5 bits for every one pixel-shift, four bits to code the value of the Mvd and an additional sign bit [25]. Entropy coding of both the BestNeighbor and Mvd fields is governed by their respective context models which closely approximate that actual symbol probabilities, and then the respective binstrings are each coded using the same binary arithmetic coder. Experiments show that the coding of the finite-alphabet BestNeighbor syntax element requires fewer bits compared to encoding the actual Mvd syntax element that has a much larger alphabet, the tradeoff being the distortion cost calculated earlier in this section. The additional rate contributed by multiple-description coding of the residual infor-mation RResiduai is directly related to the number of Intra-coded blocks found within a high pass frame. Note that since any high pass subband components are set to zero Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 63 in a MDC high pass frame, the redundancy rate is strictly limited to the coding of In-tra macroblocks. Let Nintra and Rintra be the number of Intra-coded macroblocks in a group of pictures and the average bitrate required to encode an Intra-coded macroblock, respectively. We can estimate the residual redundancy by RResidual ~ Njntra X Rintra- (3.22) 3.3 Multiplexing and Error Concealment The final step before transmitting the multiple-description coded stream is multiplexing the generated descriptions. Our framework is based on a layered structure of a combined scalable video stream offering spatial, temporal, and SNR scalability. The original SVC multiplexing scheme is performed on a group of picture (GOP) basis, such that, every fame belonging to all the SVC layers of a coded GOP is embedded into the bitstream before the next GOP frames are inserted. Figure 3.17 (a) shows the Multiplex module of the original SVC encoder. The coded frames of a GOP k are arranged starting with the layer SVCO frames followed by layer SVC1 frames until all frames belonging to GOP k are inserted. Next frames of GOP k + 1 follow starting with layer SVCO frames and so on. Performing multiple-description coding on the high pass frames of all SVC layers, except for layer SVCO, will yield two descriptions of the high pass frames. This will require a slight modification to the Multiplex module such that, all frames belonging to GOP k, for instance, are still inserted into the bitstream before the insertion of any frame from GOP k + 1. Moreover, we further propose embedding all frames belonging to layer I of both descriptions into the bitstream before inserting any frame from layer / + 1 as an additional multiplexing constraint. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) Multiplex a HHHHHHH L HHHHHHI 1 , , | HHHHHHH | L || HHHHHHH | P k«1 G O P k • ' ^ • „ . • | H H H H H H H [ H H H H H H H H H H H H H H H | L | » • | H H H H H H H | L | [ H H I H H H H H | L i ^ W l s t t l i l l l l l l i i 0 HHHHHHHHHHHHHHH L HHHHHHHHHHHHHHH L Gi • GOPk • (a) Multiplex ; v c o | HHHHHHH| L I I HHHHHHh| I k.1 G O P k " | HHHHHHHJ~L"| HHHHHHH| L I : HHHHHHH l j HHHHHHH | •• :^^^cription 1 — •Description 2-s - . t j SV3-. V..C, ft- * "T'M-'-i* "** * | HHHHHHH| L 11 HHHHHHHHHHHHHHl^ HHHHHHHHHHHHHHl] L | • • I I I IHHf HHHH H HHHH H H H H H H H H H H H H H H l L HHHHHHHHHHHHHH 33- -Description V — 1 7 \ iy "•xi\ H H H H H H H H H H H H H H ^ . . ' | HHHHHHHHHHHHHHl! Description 2 (b) Figure 3.17: (a) Original Multiplex module of the SVC encoder, (b) Modified Multiplex module to accommodate M D C frames. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 65 The encoding side of our multiple-description coding framework partitions a single stream into three sub-streams, a base-layer stream, labeled stream 0, that contains all SVCO frames along with the low pass frames of the higher layers, and two descriptions of an enhancement layer that contains the high pass frames of the higher layers, labeled stream 1 and stream 2 respectively. The stream number is also coded into the frame header to facilitate stream partitioning. From this point on, it falls upon the transmis-sion network to decide how to transmit the multiple descriptions, whether to transmit the entire bitstream using one channel or to partition the bitstream and transmit each description using a separate channel. Note that stream one and stream two are inde-pendently decodable of each other, however, they do depend on the correct reception of stream zero. Therefore, a necessary condition to independently decode each of the two descriptions is the correct reception and decodability of the base layer (stream 0). The MDC decoder will simply invert all the steps performed by the encoder to create the two descriptions. Figure 3.18 shows the basic structure of an MDC decoder with emphasis on the inverse MDC module, the shaded block. The combine motion fields and combine residual fields modules restore the coded motion and texture data to the original SVC compatible, single-description input that is accepted by the MCTF decoder. Let us first evaluate the significance of the coded frames produced by our SVC based multiple-description coder. The proposed framework offers a layered approach to MDC. We separate a coded video sequence into one logical layer, the base-layer, which contains all SVCO frames in addition to the low pass frames of all higher SVC layers. This architecture necessitates the correct reception of all base-layer frames in order to ensure decodability. Therefore, if we rate the importance of the coded video frames, the base-layer frames will take the highest priority. Next, we move to the multiple-description coded logical enhancement-layer which contains all high pass Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 66 Stream 0 | Layer Storage MDC Decoder Combine Mot.cn Fie ds Output Decoded Pictures Figure 3.18: Basic structure of the MDC Decoder with the Inverse MDC module shown in gray. frames of all SVC layers higher than SVCO. These high pass frames may or may not contain prediction data (motion information) depending on the type of scalability associated with the SVC layer. For instance, if an SVC layer is SNR scalable, then its high pass frames do not contain motion information. High pass frames belonging to SVC layers that extend the temporal and/or spatial scalability of the lower layers do contain motion information. Hence, the loss of a high pass frame from an SNR scalable layer will only cause some degradation in output picture quality; however, it does not hamper the decoding process since the lost residual samples can be assumed to be zero. The loss of motion information on the other hand is not replaceable and will completely obstruct the decoding of the respective layer. To clearly illustrate our discussion, consider the group of pictures shown in Figure 3.19. The dark frames belong to the base-layer whereas the light frames belong to the multiple-description coded enhancement-layer. Figure 3.20 shows the same GOP dis-Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 67 S V C 3: S N R S V C 2: spatial/temporal S V C 1: S N R S V C 0: temporal H H H H H H H H H H H H H H H H H H H H H H D1 D2 H H H H H H H H H H H H D1 D2 D1 D2 Figure 3.19: Multiple-description coded group of pictures with four SVC layers of combined scalability. played earlier with five frame-loss scenarios labeled LSI to LS5. The frame loss scenarios shown in Figure 3.20 include losses in base-layer frames, motion information of enhancement-layer frames and residual information in enhance-ment frames. These loss scenarios constitute a comprehensive study of the possible frames losses that might occur when transmitting the coded video bitstream. The error handling routines we developed for coping with these losses are listed in Table 3.7. The error handling routines listed in Table 3.7 were developed to allow an SVC decoder to detect and recover from frame losses. This feature is not supported in the current SVC decoder nor is there any recommendation that describes error handling in SVC. The existing recommendation suggests replacing any lost output frame with the previously decoded output frame, a condition that occurs every time a single coded Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 68 Table 3.7: Error handling routine description. Loss Sce-nario Description Effect Error Handling LSI Loss of the two descriptions of a high pass frame belonging to an SNR layer Reduces quality of decoded video Assume that the lost HP frame has zero residual information. Extract motion, information from the lower SVC layer. Resume decoding of the GOP. LS2 Loss of one description of a high pass frame of a spatial/temporal layer Activate motion recovery routine Use the second description high pass frame to recover an estimate to the lost motion and residual information. Resume decoding of the GOP. LS3 Loss of the two descriptions of a high pass frame belonging to a spatial/temporal layer Disrupt decoding of current and higher SVC layers Discard decoded frames belonging to current layer and ignore all other received frames belonging to the same GOP. Repeat the last decoded picture to fill the GOP space. LS4 Loss of the two descriptions of a high pass frame belonging to the highest SVC layer (also an SNR layer) Reduces quality of decoded video Assume that the lost HP frame has zero residual information. Extract motion information from the lower SVC layer. Resume decoding of the GOP. LS5 Loss of a base layer frame Disrupts decoding of the current GOP Discard all received frames belonging to the current GOP. Repeat the last decoded picture to fill the GOP space. Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 69 S V C 3: S N R S V C 2: spatial/temporal S V C 1: S N R S V C 0: temporal ^ g j g ^ l H | H H H H H LS 4 \ H / H H H H H H H H \ L S 2 H H V D1 i D2 H H H H H H H LS 3 H H H H D1 D2 D1 D2 F i g u r e 3 . 2 0 : GOP w i t h five p o s s i b l e f r a m e - l o s s s c e n a r i o s . f r a m e i n a GOP i s l o s t . T h e r e f o r e , t h e r e f e r e n c e e r r o r h a n d l i n g r o u t i n e c o m m e n d s t h a t f o r e v e r y l o s t f r a m e i n a GOP, t h e e n t i r e GOP w i l l b e r e p l a c e d b y t h e i n s e r t i o n o f t h e l a s t o u t p u t p i c t u r e . I n t h e n e x t s e c t i o n , w e s h o w s i m u l a t i o n r e s u l t s t h a t f a v o r o u r p r o p o s e d m u l t i p l e - d e s c r i p t i o n c o d i n g a n d e r r o r h a n d l i n g r o u t i n e s a s o p p o s e d t o t h e r e f e r e n c e SVC a n d e r r o r h a n d l i n g r o u t i n e s . 70 Chapter 4 Simulation Results The J V T standardization committee has put together common conditions to test the error resilience of SVC. These common conditions are meant to be used in testing any proposal of error resilient coding tools developed for SVC by providing simulation results adhering to the specified conditions [28]. We will use the conditions specified in [28] as the common grounds to test our M D C framework and compare our results with the single-description SVC and the multiple-description scalable coding framework developed by [19]. We will begin by describing the simulated network environment which will transfer our multiple-description coded SVC stream. As a primary guideline, it is important to note that [28] assumes that the communication of video streams is performed using an R T P / U D P / I P transport [29]. This specification adheres with the MBMS streaming de-livery transport protocol [8]. [28] further assumes that transmission errors are restricted to packet losses, bit errors not being considered since UDP would discard packets with any bit errors. Moreover, the RTP payload size is limited to 1400 bytes with only one N A L unit encapsulated in one RTP packet. This condition cannot be satisfied at this stage of SVC development since it does not support the coding of multiple slices per frame. Therefore, the alternative is to partition every N A L unit among several RTP packets with a maximum size of 1400 bytes. We have developed an RTP encapsulator that extracts the N A L units from the SVC bitstream and packetizes them into RTP packets with a maximum payload size of 1400 bytes. If the N A L U size is smaller than Chapter 4. Simulation Results 71 1400 bytes, then the entire N A L U is encapsulated in one R T P packet. If on the other hand, the N A L U size is larger than 1400 bytes, then the N A L U is partitioned into blocks of 1400 bytes, each of which is encapsulated in one RTP packet. Moreover, the RTP marker bit is set to one to indicate if the RTP payload is a full N A L U or the last partition of a N A L U ; otherwise, the marker bit is set to zero. Figure 4.1 below shows the N A L U encapsulation process. 1200 bytes 1400 bytes 1400 bytes . I. 1400 bytes 900 bytes - . -NALU I RTP header RTP header Figure 4.1: RTP encapsulation of NALUs with varying sizes. [28] recommends the use of the erasure simulation patterns ITU-T V C E G Q15-I-16rl to model Internet and 3GPP/3GPP2 packet-loss environments. We have chosen the network simulator tool recommended in [28], "Offline simulator for RTP/IP over U T R A N " provided by the 3GPP SA4 S4-AHVIC036 to test the resilience of our frame-work. This simulator was developed from VCEG-N80 which describes common test conditions suited for transmission on 3GPP/3GPP2 networks for conversational and streaming video applications based on RTP/IP. VCEG-N80 provides a controlled environment for experiments by defining an offline software simulator of the 3GPP/3GPP2 radio bearer protocols and R L C physical layer error patterns with average packet loss rates of 3%, 5%, 10%, and 20%. We shall give a brief overview of the simulated network environment and the user plane protocols in 3G presented in VCEG-N80 [30]. The video source and video receiver are both assumed to be located within a private operator's network. This network is composed of an IP-based core network and a Chapter 4. Simulation Results 72 radio access network. The streaming server is assumed to be directly connected to the core network. Moreover, the core network is assumed to be error free restricting the bottleneck to the radio interface. Consequently, the simulated packet losses result only from fading/shadowing errors at the radio interface [30]. The user plane protocol stack specifications defined by 3GPP and 3GPP2 are similar between the user equipment (UE) or mobile station (MS) and the radio base station (BS). We will give a brief overview of the protocol stack as presented in [30]. Figure 4.2 shows the packetization of a video payload unit according to the user plane protocol stack of CDMA-2000 [30]. Physical frame LTU Figure 4.2: CDMA-2000 user plane protocol stack packetization presented in VCEG-N80. The protocol stack used for UMTS is very similar, except that 3GPP defines different names for the protocols. The differences listed in [30] are as follows: • The point-to-point protocol (PPP) is not used. Instead, the radio link control (RLC) protocol includes additional information that signals the RLC payload boundaries. • The size of a physical layer unit in UMTS is more flexible than in CDMA-2000 and therefore, a physical layer frame is not split further into logical transmission units (LTU) as occurs in CDMA-2000. In CDMA-2000, an LTU is the smallest fixed-size unit and has a cyclic redundancy Chapter 4. Simulation Results 73 code (CRC) to detect possible errors. U M T S on the other hand defines an R L C - P D U (protocol data unit) which is a fixed length physical layer unit that is determined when the radio bearer is setup. In our simulations, it is assumed that one link-layer frame is packed into one L T U / R L C - P D U [30]. In general, an R T P / U D P / I P packet is packed into one P D C P (packet data convergence protocol)/PPP packet which becomes an R L C - S D U (service data unit). Since video packet sizes are variable in nature, the R L C -SDUs will maintain the variation in size. If an R L C - S D U is larger that the R L C - P D U , then the R L C layer segments the SDU into multiple PDUs. For our simulations, we have merged the packet loss patterns presented in ITU-T V C E G Q15-I-16rl to create radio bearers with 3%, 5%, 10%, and 20% block error rates (BLER). The bearer specifications used in the offline simulator first assume that no retransmissions are allowed during transmission and therefore, the bearer is unac-knowledged. The bearer bitrate is set to 128kbits/s with a radio frame size (RFS) or R L C - P D U size of 320 bytes. Moreover, we will use U M T S as our telecommunications system. In our simulations, we will demonstrate the performance of our MD-SVC framework in the simulated U M T S network compared to the performance of the M D - M C T F scheme presented in [19] and the single description SVC bitstream [14]. We have coded 961 frames from each of the four test sequences crew, city, harbour, and foreman. The 961 frames were generated according to the recommendation in [28] by first encoding the pictures "in normal order till the original sequence ends, then in reverse order from the second last picture in the sequence end to the beginning, and then again in normal order from the second picture, and so on." Four SVC layers were encoded according to the configuration shown in Table 4.1. The multiple-description coded S V C stream, which we will refer to as MD-SVC, is composed of a base-layer which contains all SVCO frames along with the low pass frames Chapter 4. Simulation Results 74 Table 4.1: Coding configuration of the test sequences. Layer Spatial Resolution Frame-Rate G O P Size Scalability SVCO QCIF 15 8 Temporal SVC1 QCIF 15 8 SNR SVC2 CIF 30 16 Spatial-temporal SVC3 CIF 30 16 SNR of the higher three SVC layers, and an enhancement-layer containing two descriptions of the high pass frames belonging to SVC layers 1, 2, and 3. One example of the coding structure can be seen in Figure 3.19, however, the G O P sizes shown in Figure 3.19 are half of the G O P sizes used for our tests. We will assume that all SVCO layer frames along with the low pass frames belonging to the higher SVC layers are protected using the Unequal Erasure Protection (UXP) scheme presented in [17] that was developed specifically to protect base layer frames in SVC. A brief overview of the U X P scheme was presented earlier. Our assumption follows that the U X P scheme is applied to all coded streams (MD-SVC, M D - M C T F , and SD-SVC). The U X P scheme results in a loss rate significantly below 1% [17] with a 20% overhead in transmission bitrate of the protected frames. Furthermore, the MD-M C T F scheme is applied only to the high pass frames of the enhancement layer defined by our MD-SVC framework. This restriction was imposed since the M D - M C T F scheme replicates low pass frames in M C T F , which dramatically increases the bitrate of the coded video stream and results in an unfair comparison with our developed scheme. Therefore, in the M D - M C T F case, low pass and base layer frames are protected using only UXP. Figures 4.3 to 4.6 demonstrate the performance of our M D - S V C scheme compared to M D - M C T F and single description SVC (SD-SVC) schemes when faced with loss rates between 3% and 20%. The sequences Crew, Foreman, Harbour, and City were Chapter 4. Simulation Results 75 Table 4.2: Ml 3 scheme redundancy Crew Foreman Harbour City MD-SVC 21.9% 17.9% 8.1% 14.1% M D - M C T F 13.9% 16% 7.7% 15.5% encoded using each of the above-mentioned coding schemes, and the reconstructed frames obtained after decoding are compared using the luminance PSNR (Y-PSNR) measure. The PSNR performance plots show that MD-SVC out-performs M D - M C T F in most cases, except for the City sequence where M D - M C T F has a better PSNR performance. The City sequence has very little motion as indicated by the motion index measure (MIdx) plots shown in Figure 3.17. However, it is important to note that the level of redundancy imposed by our MD-SVC scheme is lower than that imposed by the M D - M C T F scheme in the case of the City sequence. As indicated in section 3.2, the redundancy of our M D - S V C scheme results mainly from the repeated Intra-coded macroblocks in the video sequence. The M D - M C T F scheme does not insert these Intra-coded blocks in multiple-description coded high pass frames. The result is improved decoded video quality for our M D - S V C over M D -M C T F in sequences that contain a large number of Intra-coded macroblocks. Figure 4.7 illustrates the relation between the redundancy rate and the associated distortion due to the packet loss patterns. The redundancy rates used are those generated by encoding the four video sequences and are listed in Table 4.2. The redundancies listed in Table 4.2 show that both MD-SVC and M D - M C T F impose very similar redundancy rates. Except for the Crew sequence which exhibits a high number of Intra-coded blocks in the high pass frames, the redundancy rates are within 1% of each other. The peeks seen in the M D - M C T F plots in Figure 4.7 are due to the City sequence Chapter 4. Simulation Results 76 C R E W 40 EC | 20 CL >• 0 40 M D - S V C — M D - M C T F - — S D - S V C 3 % loss 200 400 600 800 or w 20 CL 0 40 I 20 i - M D - S V C - M D - M C T F -- S D - S V C 5 % loss 200 400 600 800 PS r i i M D - S V C — M D - M C T F . . . . S D - S V C 10% loss M D - S V C M D - M C T F S D - S V C 2 0 % loss -Eh M D - S V C - 0 - M D - M C T F S D - S V C 10 % packets lost 15 20 Figure 4.3: Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Crew sequence. Chapter 4. Simulation Results 77 Foreman M D - S V C — M D - M C T F S D - S V C 3 % loss M D - S V C M D - M C T F S D - S V C 5 % loss M D - S V C — M D - M C T F — - S D - S V C 10% loss M D - S V C M D - M C T F S D - S V C 2 0 % loss - B - M D - S V C -0- M D - M C T F S D - S V C 5 10 % packets lost 15 Figure 4 . 4 : Comparison of PSNR performance of M D - S V C , M D - M C T F , and (single description) SD-SVC for the Foreman sequence. Chapter 4. Simulation Results 78 Harbour c o CL >-M D - S V C — M D - M C T F S D - S V C 3 % loss M D - S V C — M D - M C T F - - - - S D - S V C 5 % loss M D - S V C — M D - M C T F - - - - S D - S V C 10% loss M D - S V C — M D - M C T F - - - - S D - S V C 2 0 % loss t o LL >-OS 'X' 20 - B - M D - S V C M D - M C T F S D - S V C ^ T ) | t ^ ^ i i i i 10 15 % packets lost 20 25 Figure 4.5: Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Harbour sequence. Chapter 4. Simulation Results 79 M D - S V C — M D - M C T F . . . . S D - S V C 3 % loss M D - S V C M D - M C T F S D - S V C 5 % loss M D - S V C M D - M C T F S D - S V C 10% loss M D - S V C — M D - M C T F — - S D - S V C 2 0 % loss - F J - M D - S V C - 0 - M D - M C T F S D - S V C 5 10 % packets lost Figure 4 . 6 : Comparison of PSNR performance of M D - S V C , M D - M C T F , and (single description) SD-SVC for the City sequence. Chapter 4. Simulation Results 80 Redundancy - Distortion cc -z. CO D_ >-cr co rx cr z CO Q_ I >-CO Q_ >-5 10 15 20 Sequence Redundancy - B - M D - S V C -•©- MD-MCTF 3 % - B - M D - S V C - G - MD-MCTF 5% - B - M D - S V C - O - MD-MCTF 10% - B - M D - S V C - O MD-MCTF 20% Figure 4.7: Comparison of MD-SVC and MD-MCTF redundancy rates and associated distortions. Chapter 4. Simulation Results 81 which exhibits very little motion, therefore, losses in high pass frames did not affect the quality much since most of the signal power is conserved in the low pass frames. The M D - M C T F scheme does not generate any distortion due to the coding of mo-tion information since, motion information is replicated in both descriptions. This however is not the case in our M D - S V C scheme which saves on the redundancy in motion information by multiple-description coding of the motion information. In ex-change, M D - S V C uses the saving in motion information coding to insert redundant texture information through replicating Intra-coded macroblocks. Figures 4.8 to 4.11 show snapshots from the four video sequences. The pictures on the left are generated from M D - S V C . The pictures on the right are M D - M C T F coded. It can be seen that M D - S V C causes slight distortions that reflect negatively in PSNR but are not as sig-nificant visually. These slight distortions are mainly due to the approximation of the motion vectors of macroblocks from the neighboring blocks. The benefit of this coding mechanism lies in reducing the video data-rate to allocate more bits to duplicate Intra-coded blocks. Since the M D - M C T F scheme does not duplicate Intra-coded blocks, the packet loss is more obvious visually, as can be seen in Figures 4.8 to 4.11. Figure 4.8: Comparison of visual quality from the Foreman sequence, M D -SVC (left), MD-MCTF(right) . Chapter 4. Simulation Results 82 Figure 4.10: Comparison of visual quality from the City sequence, MD-SVC (left), MD-MCTF(right). Chapter 4. Simulation Results S3 Figure 4.11: Comparison of visual quality from the Harbour sequence, MD-SVC (left), MD-MCTF(right). Chapter 5 84 Conclusion and Future W o r k In this thesis, we have developed a new multiple-description coding scheme for the scal-able extension of H.264/AVC video coding standard (SVC). Our scheme (MD-SVC) generates two descriptions of the high pass frames of the enhancement layers of an SVC coder by coding in each description only half the motion information and half the tex-ture information with a minimal degree of redundancy. Intra-coded macroblocks are inserted as redundant information since they cannot be approximated using the motion information. The two descriptions are complementary but independently decodable, such that if only one description is received, the decoder will be able to recover the missing motion information from the available data and generate an output video se-quence with acceptable degradation in quality. If both descriptions are received then the full quality of a single description SVC (SD-SVC) is delivered. However, the two descriptions are highly dependent on the base layer, therefore, the base-layer and the low pass frames of the enhancement layers are protected using an unequal erasure pro-tection (UXP) technique developed specifically for SVC payload data. We have also added error detection and concealment features to the SVC decoder to cope with frame losses. Our multiple-description coding framework conforms with the protocols and codecs specifications of the 3GPP multimedia broadcast/multicast services (MBMS). More-over, the framework is built on and integrated into the SVC codec, and can therefore offer highly reliable video services to MBMS networks and clients with heterogeneous Chapter 5. Conclusion and Future Work 85 resources and capabilities. The outcome of our coding scheme is three separable streams that can be transmitted over the same channel (in multicast or broadcast mode) or over three separate channels (in broadcast mode) to minimize the packet loss rate. Objective and subjective performance evaluations have shown that our scheme delivers a supe-rior decoded video quality when compared with the UXP protected SD-SVC and the multiple-description motion compensated temporal filtering (MD-MCTF) scheme with comparable redundancy levels. It would be desirable to further reduce the redundancy level and remove the de-pendence on the base layer, therefore, more work can be performed to generate low redundancy multiple descriptions of the low pass frames using techniques such as cor-relating transforms. Furthermore, the accuracy of the motion recovery approach can be improved through boundary matching algorithms and block interpolation. Bibl iography [1] Mobile Broadcast/Multicast Service (MBMS), 2004. [2] DigiTAG. Television on a Handheld Receiver - broadcasting with DVB-H, 2005. [3] Nokia makes air interface of its mobile T V end-to-end solution ( D V B -H) publicly available, May 2005. [4] T . Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H . 2 6 4 / A V C video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560-576, 2003. [5] R. Schafer, H. Schwarz, D. Marpe, T . Schierl, and T. Wiegand. M C T F and scalability extension of H . 2 6 4 / A V C and its applications to video transmission, storage, and surveillance. In Visual Communications and Image Processing, July 2005. [6] European Telecommunications Standards Institute. Universal Mobile Telecommunications System (UMTS); Multimedia Broadcast/Multicast Service (MBMS); Stage 1, September 2004. [7] European Telecommunications Standards Institute. Digital cellular telecommunications system (Phase 2+); Universal Mobile Telecommu-Bibliography 87 nications System (UMTS); Multimedia Broadcast/Multicast Service (MBMS); Stage 1, September 2004. [8] European Telecommunications Standards Institute. Universal Mobile Telecommunications System (UMTS); Multimedia Broadcast/Multicast Service (MBMS); Protocols and Codecs, March 2005. [9] Samsung Electronics. Scalable Multimedia Broadcast and Multicast Ser-vice (MBMS), May 2002. [10] G.J. Sullivan T. Wiegand. Video compression - from concepts to the H . 2 6 4 / A V C standard. Proceedings of the IEEE, 93(1):18-31, 2005. [11] T. Wiegand, G.J. Sullivan, and A. Luthra. Draft I T U - T Recommenda-tion and Final Draft International Standard of Joint Video Specification ( I T U - T Rec. H.264 — I S O / I E C 14496-10 AVC). Joint Video Team (JVT) of ISO/IEC M P E G and ITU-T V C E G , 2003. [12] T. Wiegand, H. Schwarz, A. Joch, F . Kossentini, and G.J. Sullivan. Rate-constrained coder control and comparison of video coding stan-dards. IEEE Transactions on Circuits and Systems for Video Technol-ogy, 13(7):688-703, July 2003. [13] T . Wedi and H. G. Musmann. Motion- and aliasing-compensated pre-diction for hybrid video coding. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):577-586, July 2003. [14] Joint Video Team (JVT) of ISO/IEC M P E G and ITU-T V C E G . Sub-band Extension of H.264/AVC, March 2004. Bibliography 88 [15] M . Flierl. Video coding with lifted wavelet transforms and frame-adaptive motion compensation. In VLBI, September 2003. [16] H. Schwarz, D. Marpe, and T. Wiegand. M C T F and scalability exten-sion of H . 2 6 4 / A V C . In PCS, December 2004. [17] T. Schierl, H. Schwarz, D. Marpe, and T. Wiegand. Wireless broad-casting using the scalability extension of H . 2 6 4 / A V C . In ICME, July 2005. [18] S. Wenger, M . M . Hannuksela, T. Stockhammer, M.Westerlund, and D. Singer. R T P Payload Format for H.264 Video. I E T F , February 2005. [19] M . van der Schaar and D. S. Turaga. Multiple description scalable coding using wavelet-based motion compensated temporal filtering. In ICIP, September 2003. [20] A.R. Reibman, H. Jafarkhani, Y . Wang, M . T . Orchard, and R. Puri. Multiple description coding for video using motion compensated predic-tion. In ICIP, 1999. [21] C. Kim and S. Lee. Multiple description coding of motion fields for robust video transmission. IEEE Transactions on Circuits and Systems for Video Technology, 11(9):999-1010, September 2001. [22] Y . K . Wang, M . M . Hannuksela, V. Varsa, A. Hourunranta, and M . Gab-bouj. The error concealment feature in the H.26L test model. In ICIP, 2002. Bibliography 89 [23] T . Stockhammer, M . M . Hannuksela, and T. Wiegand. H . 2 6 4 / A V C in wireless environments. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):657-673, July 2003. [24] Joint Video Team (JVT) of ISO/IEC M P E G and ITU-T V C E G . Scalable Video Coding - Working Draft 2, April 2005. [25] D. Marpe, H . Schwarz, and T . Wiegand. Context-based adaptive bi-nary arithmetic coding in the H . 2 6 4 / A V C video compression stan-dard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):620-636, July 2003. [26] M.J . Weinberger and J.J. Rissanen adn M . Feder. A universal finite memory source. IEEE Transactions on Information Theory, 41(3):643-652, May 1995. [27] D. Marpe, H. Schwarz, G. Blattermann, G. Heising, and T. Wiegand. Context-based adaptive binary arithmetic coding in J V T / H . 2 6 L . In ICIP, September. [28] Joint Video Team (JVT) of ISO/IEC M P E G and I T U - T V C E G . Com-mon conditions for SVC error resilience testing, July 2005. [29] S. Wenger. H . 2 6 4 / A V C over IP. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):645-656, July 2003. [30] ITU-Telecommunications Standardization V C E G . Comment Test Con-ditions for RTP/IP over 3GPP/3GPP2, December 2001. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065531/manifest

Comment

Related Items