Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Multiple description coding of the scalable extension of H.264/AVC (SVC) Mansour, Hassan 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2005-0548.pdf [ 11.76MB ]
Metadata
JSON: 831-1.0065531.json
JSON-LD: 831-1.0065531-ld.json
RDF/XML (Pretty): 831-1.0065531-rdf.xml
RDF/JSON: 831-1.0065531-rdf.json
Turtle: 831-1.0065531-turtle.txt
N-Triples: 831-1.0065531-rdf-ntriples.txt
Original Record: 831-1.0065531-source.json
Full Text
831-1.0065531-fulltext.txt
Citation
831-1.0065531.ris

Full Text

Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) by Hassan Mansour B.E., The American University of Beirut, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Applied Science in The Faculty of Graduate Studies (Electrical Engineering)  The University Of British Columbia August 29, 2005 © Hassan Mansour 2005  ii  Abstract Advances in digital video coding are pushing the boundaries of multimedia services and Internet applications to mobile devices. The Scalable Video Coding (SVC) project is one such development that provides different video quality guarantees to end users on mobile networks that support different terminal capability classes.  Such networks are Digital Video Broadcast-  Handheld (DVB-H) and the Third Generation Partnership Project's ( 3 G P P ) Multimedia Broadcast Multicast Service (MBMS). In this thesis, we propose a multiple-description coding scheme for SVC (MD-SVC) to provide error resilience to SVC and ensure the safe delivery of the video pay load to end users on MBMS networks. Due to the highly error-prone nature of wireless environments, received video quality is normally guaranteed by either video redundancy coding, retransmissions, forward-error-correction, or error-resilient video coding. MD-SVC takes advantage of the layered structure of SVC to generate two descriptions (or versions) of the higher layer (enhancement layer) frames in SVC while utilizing Unequal Erasure Protection (UXP) to efficiently protect the base layer frames. The result is three separable streams that can be transmitted over the same channel or over  Abstract  iii  three separate channels to minimize the packet loss rate. The two enhancement descriptions are independently decodable, however, both descriptions depend on the error-free reception of the base layer. Furthermore, error detection and concealment features are added to the SVC decoder to cope with frame losses. The proposed scheme is implemented and integrated fully into the SVC codec and tested using a 3GPP/3GPP2 offline network simulator. Objective and subjective performance evaluations show that under the same packet loss conditions our scheme outperforms the single description SVC and existing scalable multiple-description coding schemes.  Contents Abstract  ii  Contents  iv  List of Tables  vi  List of Figures  vii  List of Acronyms Acknowledgements 1 Introduction  xi xiv 1  2 Digital Video Transmission and Coding, From H.264/AVC To SVC  5  2.1  Multimedia Broadcast/Multicast Services  5  2.1.1  5  2.2  MBMS Modes  Overview of the H.264/AVC Standard  10  2.2.1  14  Motion Estimation and Motion Compensation  •Contents  2.3  Scalable Video Coding  , 17  2.4  Error Resilience in Video Coding  19  2.4.1  Unequal Erasure Protection in SVC  21  2.4.2  Multiple Description - Motion Compensated Temporal Filtering  22  3 Multiple Description Coding of the Scalable Extension of H . 2 6 4 / A V C (SVC) 3.1  Multiple Description Coding of High Pass (HP) Frames  25 ...  28  3.1.1  Multiple Description Coding of Prediction (Motion) Data 30  3.1.2  Entropy Coding of Error Recovery Data .  37  3.1.3  Multiple Description Coding of Residual Data  44  3.2  Theoretical Analysis  46  3.3  Multiplexing and Error Concealment  63  4 Simulation Results  70  5 Conclusion and Future Work  84  Bibliography  • 86  vi  List of Tables 2.1  MBMS video delivery bandwidth resources  8  3.1  MDC macroblock classification  33  3.2  MB mode and the associated coded BN indices  36  3.3  Syntax element classification in MDC  37  3.4  BestNeighbor symbol probabilites and the corresponding binstrings  42.  3.5  BestNeighbor context models  43  3.6  Relation between motion index and the coded video file size. . 61  3.7  Error handling routine description  68  4.1  Coding configuration of the test sequences  74  4.2  MD scheme redundancy  75  Vll  List of Figures 2.1  Example of (a) Broadcast Mode and (b) Multicast Mode Networks  7  2.2  Macroblock and sub-macroblock partitions in H.264/AVC. . . 12  2.3  Example of a coded video sequence.  12  2.4  Hybrid encoder similar to the H.264/AVC encoder  13  2.5  Functional structure of the Inter prediction process  15  2.6  Motion vector and reference index assignment in the motion estimation process  15  2.7  MCTF temporal decomposition process  18  2.8  Basic structure of a 5 layer combined scalability encoder.  2.9  UXP transmission sub-block (TSB) packetization  . . 20 22  2.10 Coding structure of the MD-MCTF.  23  3.1  Coding structure of the MD-MCTF  3.2  Single layer MCTF encoder with additional and modified modules for multiple-description coding  ...  26  27  viii  List of Figures 3.3  Basic structure of Multiple Description Coding module with a two description coder of High Pass frames  29  3.4  Quincunx Lattice structure of prediction data MBs  31  3.5  Typical motion data macroblock map of a HP frame in description 1  3.6  31  Change in motion vector prediction, (a) Standard defined neighbor MBs (b) MDC defined neighbor MBs  33  3.7  Neighboring 8x8 block indices  35  3.8  Fixed Length binarization for BestNeighbor syntax element.  3.9  Multiple-description coding of residual frames in a temporal-  . 41  scalable layer with three temporal levels. Dark blocks indicate Intra-coded macroblocks in the residual frame  45  3.10 Frequency response properties of the [-1/8 1/4 3/4 1/4 -1/8] low pass filter 3.11 Frequency response of the [-1/2 1 -1/2] high pass filter.  49 ...  50  3.12 Decomposition of a group of pictures (GOP) of size 8. The shaded frames will be coded and transmitted in the bitstream.  50  3.13 Reconstruction of a group of pictures showing the MDC high pass frames with zero residual marked with the red Xs.  ...  55  3.14 Variation of motion index with respect to GOP number in the Crew sequence  58  3.15 Crew sequence showing (a) frames from GOP number 2 and (b) frames from GOP number 10  59  List of Figures  ix  3.16 Comparison of the motion index values between the 113 coded frames of the four sequences Crew, City, Foreman, and Harbour  60  3.17 (a) Original Multiplex module of the SVC encoder, (b) Modified Multiplex module to accommodate MDC frames  64  3.18 Basic structure of the MDC Decoder with the Inverse MDC module shown in gray.  66  3.19 Multiple-description coded group of pictures with four SVC layers of combined scalability  67  3.20 GOP with five possible frame-loss scenarios  69  4.1  RTP encapsulation of NALUs with varying sizes  71  4.2  CDMA-2000 user plane protocol stack packetization presented inVCEG-N80  4.3  Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Crew sequence.  4.4  72  ...  Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Foreman sequence.  4.5  . 77  Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Harbour sequence.  4.6  76  . 78  Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the City sequence.  ...  79  List of Figures  4.7  Comparison of MD-SVC and MD-MCTF redundancy rates and associated distortions  4.8  80  Comparison of visual quality from the Foreman sequence, MDSVC (left), MD-MCTF(right)  4.9  x  81  Comparison of visual quality from the Crew sequence, MDSVC (left), MD-MCTF(right)  82  4.10 Comparison of visual quality from the City sequence, MD-SVC (left), MD-MCTF(right)  82  4.11 Comparison of visual quality from the Harbour sequence, MDSVC (left), MD-MCTF(right)  83  List of Acronyms 3GPP  3rd Generation Partnership Project  AVC  Advanced Video Coding  BLER  BLock Error Rate  BN  Best Neighbor  BS  Base Station  CABAC  Context-Adaptive Binary Arithmetic Coding  CAVLC  Context-Adaptive Variable Length Coding  CBP  Coded Block Pattern  CDMA  Code Division Multiple Access  CRC  Cyclic Redundancy Code  DCT  Discrete Cosine Transform  DVB-H  Digital Video Broadcasting - Handheld  FEC  Forward Error Correction  GERAN  G S M / E D G E Radio Access Network  GOP  Group Of Pictures  GSM  Global System for Mobile communications  List of Acronyms  xii  HP  High Pass  IPDC  IP Data Cast  ICT  Integer Cosine Transform  LP  Low Pass  LPS  Least Probable Symbol  LTU  Logical Transmission Unit  MB  macroblock  MBMS  Multimedia Broadcast Multicast Services  MCTF  Motion Compensated Temporal Filtering  MDC  Multiple Description Coding  MD-MCTF  Multiple Description Motion Compensated Temporal Filterin  MD-SVC  Multiple Description Scalable Video Coding  MPEG  Moving Pictures Experts Group  MPS  Most Probable Symbol  MTAP  Multi-Time Aggregation Unit  MV  Motion Vector  NAL  Network Abstraction Unit  PDU  Protocol Data Unit  PLMN  Public Land Mobile Network  PPP  Point to Point Protocol  PSNR  Peak Signal to Noise Ratio  QP  Quantization Parameter  QoS  Quality of Service  RFS  Radio Frame Size  RS  Reed-Solomon  RTP  Real-Time Protocol  List of Acronyms  SAD  Sum of Absolute Differences  SD  Single Description  SDU  Session Data Unit  SNR  Signal to Noise Ratio  SSD  Sum of Square Differences  SVC  Scalable Video Coding  TSB  Transmission Sub-block  UDP  User Datagram Protocol  UMTS  Universal Mobile Telecommunications System  RLC  Radio Link Control  UE  User Equipment  UTRAN  Universal Terrestrial Radio Access Network  UXP  Unequal Erasure Protection  VCEG  Video Coding Experts Group  xiii  xiv  Acknowledgements I would like to thank my supervisors, Professor Panos Nasiopoulos and Professor Victor Leung, for their time, patience, and support. To the reviewers, I express my gratitude for their constructive feedback. To my best friend and "buddy", Rayan, I would like to express my appreciation for always being there for me. I would also like to thank Lina and Rachel for tolerating my irritability during the last few weeks. Last but not least, I would like to thank my mom, dad, and brother for their endless love and support.  This work was supported by a grant from Telus Mobility, and by the Natural Sciences and Engineering Research Council of Canada under grants CRD247855-01 and CBNR11R82208.  i  Chapter 1 Introduction The distribution of multimedia services to mobile devices is finally a reality. Multimedia service providers are teaming up with mobile telecommunications operators to deliver to the public numerous services from transferring light video and audio clips to the heavy duty streaming of mobile T V [1], [2]. There are currently two major solutions for the support of mobile multimedia delivery, namely, 3GPP's M B M S (Multimedia Broadcast/Multicast Service) and D V B 2.0's D V B - H (Digital Video Broadcast- Handheld) supported by Nokia [3]. These solutions suffer however from limited network bandwidth resources in addition to transmission errors that significantly affect the quality of the distributed video. M B M S and D V B - H rely, therefore, on the recent advances in digital video coding technology to deliver video content that efficiently utilizes the allocated channel bandwidth while ensuring various QoS (Quality of Service) requirements. The newest international video coding standard is H.264/AVC. Approved by I T U T as Recommendation H.264 and by I S O / I E C as International Standard 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC), H.264/AVC has proved its superiority to previous video coding standards (MPEG-4 Visual, M P E G - 2 , M P E G - 2 , H.263) through its improved coding efficiency and its provision of a network friendly video representation [4]. However, coding efficiency alone is not enough to support QoS features  Chapter 1. Introduction  2  due to the highly error-prone nature of wireless environments and the unexpected fluctuation in available bandwidth. Therefore, the demand for bandwidth adaptive codecs and robust error-resilient techniques is constantly increasing. The Scalable Video Coding (SVC) standardization project was launched by MPEG (Moving Pictures Experts Group) and ITU-T's Video Coding Experts Group (VCEG) in January 2005 as an amendment of their H.264/AVC standard to solve the bandwidth fluctuation problem and offer multiple QoS requirements to the end user [5]. However, SVC presently does not offer any error resilient features that can protect all the layers in the coded bitstream. Existing solutions to protect a coded video stream include the use of Forward Error Correction (FEC) codes or Unequal Erasure Protection (UXP) techniques to combat bit-errors in wireless transmission. Unfortunately, these techniques fail in recovering any packet losses that might occur due to network congestion. An alternative to FEC and U X P is the implementation of Multiple Description Coding (MDC) of video content in order to protect against packet losses. In MDC, a coded video sequence is separated into multiple descriptions (or versions) of the coded sequence such that each description is independently decodable and provides a decoded video quality that is poorer than the original video quality. However, if all the descriptions are received by the decoder, then the original video quality should be generated. The independent decodability feature is made possible at the expense of additional coding overhead, also known as data "redundancy" between the various descriptions. The drawback in existing multiple-description coding techniques lies in the inefficient allocation of redundant data. The allocation of redundant data controls both the level of redundancy in the coded video bitstream and the quality (or distortion) produced from decoding only one description. In this thesis, we develop a Multiple-Description Coding scheme (MD-SVC), specifically designed for the scalable extension of H.264/AVC (SVC), that dramatically im-  Chapter 1.  Introduction  3  proves the error resilience of the SVC codec. Our proposed MD-SVC scheme generates two descriptions of the enhancement layers of an SVC coded stream by embedding in each description only half of the motion information and half the texture information of the original coded stream with a minimal degree of redundancy. The two descriptions are complementary but independently decodable, such that if only one description is received, the decoder will be able to recover the missing motion information from the available data and generate an output video sequence with an acceptable degradation in quality. If both descriptions are received, then the full quality of a single description SVC stream is delivered. Furthermore, we have implemented our proposed scheme and integrated all of its functionalities into the existing SVC standard. We have also added error detection and error concealment features to the SVC decoder, which did not exist. These error handling routines help the decoder to cope packet losses that might arise due to the unexpected fluctuations in available bandwidth. The proposed framework thus provides a highly error resilient video bitstream that requires no retransmissions or feedback channels while minimizing any channel overhead imposed by the video redundancy due to multiple description coding.  The rest of this thesis is organized as follows. In Chapter 2, we discuss some of the features supported by the MBMS video distribution framework. Next, we introduce the basic functions of digital video coding as implemented in H.264/AVC leading up to the motion-compensated temporal filtering structure of SVC. We develop in Chapter 3 our Multiple-Description coding scheme and give a detailed account of its integration into the SVC codec. We also offer a theoretical analysis of the redundancy and distortion imposed by our scheme. In Chapter 4, we describe the network simulation setup and compare the perfor-  Chapter 1.  Introduction  4  mance of our MD-SVC scheme with other multiple description and single description coding methods. Finally, we conclude our work in Chapter 5 and offer suggestions for future research in this field.  5  Chapter 2 Digital Video Transmission and Coding, F r o m H . 2 6 4 / A V C To S V C 2.1  Multimedia Broadcast/Multicast Services  While DVB-H is still in its testing stages, 3GPP's MBMS has already completed its first stage with Release 6 of the Universal Mobile Telecommunications System (UMTS) in September of 2004 [6]. Broadcast and Multicast are two.IP datacast (IPDC) type services for transmitting data-grams from a single server to multiple clients (point-tomultipoint) that can be supported by the existing GSM (Global System for Mobile communications) and UMTS cellular networks [6], [1].  2.1.1  MBMS Modes  MBMS defines two functional modes, broadcast mode and multicast mode. The MBMS multicast mode differs from the broadcast mode in that it requires a client to subscribe to and activate an MBMS service, whereas, the broadcast mode does not. The broadcast mode is generalized as a unidirectional point-to-multipoint bearer services in which multimedia data is transmitted from one server to multiple users in a broadcast service area. This service efficiently utilizes the available radio or network resources by transmitting data over a common radio channel that can be received by all users within the service area. The multicast mode similarly allows a unidirectional point-to-multipoint  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  6  transmission of multimedia data, but the users are restricted to those clients that belong to the multicast subscription group. [6] specifies that MBMS data transmission should have the ability to adapt to different RAN (Radio Access Network) resources and capabilities by efficiently managing the bitrate of the MBMS data. Furthermore, individual broadcast/multicast services are allocated for independent broadcast areas/ multicast groups that may overlap. Quality of Service (QoS) guarantees can also be independently configured by the PLMN (Public Land Mobile Network) for each individual broadcast/multicast services [6]. Figure 2.1 shows examples of the Broadcast Mode and Multicast Mode Networks as depicted in [6]. Handoff between different operators sharing the broadcast service area or multicast subscription group is allowed in MBMS [7]. Therefore, the broadcast/multicast resources will be allocated for all subscribers of a certain operator A in addition to all inbound roamers of operator A. However, if there are not enough resources (such as available bandwidth) to provide the requested service, the MBMS application will have to support the QoS requirements instead. MBMS User Services may be delivered to a user at different bit rates and quality of service depending on radio networks and conditions. Table 2.1 lists the bandwidth resources allocated for video streaming and video download as defined in [7]. Note that the specified bit-rates are those of the user-data at the application layer, and in GERAN, lower bandwidth is available which may constrain some applications. Since the MBMS applications must adapt to the resource heterogeneity in UTRAN (Universal Terrestrial Radio Access Network) and GERAN (GSM/EDGE Radio Access Network), it falls upon the design of multimedia content to provide the scalability required, which gives rise to the need for scalable content especially in video streaming services. MBMS video streaming services also suffer from packet losses and transmission errors that arise from multi-path fading, channel interference, network congestion, and  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC  To SVC  Broadcast Service Area  UMTS PacketSwitching Core  Multimedia Services  Cell phone  /Hand held computer  Broadcast Towe  - QoS handling - Broadcast Area Configuration  Operator Specific Services  - Provisioning Control  Internet  Broadcast Tower  Hosted Services  Cell phone/  Multimedia Broadcast Capable U T R A N / G E R A N  Hand held computer  Multimedia Broadcast Capable U T R A N / G E R A N (b)  Figure 2.1: Example of (a) Broadcast Mode and (b) Multicast Mode Networks.  7  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC  To SVC  Table 2.1: M B M S video delivery bandwidth resources.  Service  Media  Distribution Scope  M B M S User Service Classification  Application Bit-Rate  Video streaming  Video and auxiliary data such as text, still images  Broadcast  Streaming  < 384 kbps  Video streaming  Video and auxiliary data such as text, still images  Multicast  Streaming  < 384 kbps  Video distribution  Video and auxiliary data such as text, still images  Broadcast  Download  < 384 kbps  Video distribution  Video and auxiliary data such as text, still images  Multicast  Download  < 384 kbps  8  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  9  noise. MBMS utilizes Foreword-Error-Correction (FEC) schemes in order to protect the data packets during transmission [8]. The generic mechanism employed for systematic FEC of RTP streams involves the generation of two RTP payload formats, one for FEC source packets and another for FEC repair packets. F E C schemes use Reed Solomon (RS) codes to protect data content. A Reed Solomon code with an x symbol error protection and code word length of n symbols is labeled (n, n-x) RS code. The imposed overhead can then be calculated as  of the original data size. For example, if the  codeword length is to be 255 bytes and a symbol error protection of 50 bytes, then the overhead is equal to 255/205 = 1.24, which is a 24% increase in data size. Although FEC can be an effective tool, the problems that arise from its implementation include the lack of flexibility in error protection since the entire bitstream will have to be equally protected. Moreover, the redundancy added by including the F E C packets can significantly increase the video payload bit-rate and in turn cause additional congestion in the network. Alternatively, unequal error protection, unequal erasure protection, and error-resilient video coding techniques are more desirable solutions. Finally, MBMS allows the use of multiple channels for the transmission of broadcast data in order to reduce the transmission power of a broadcast station and allow for flexible resource management. The separate physical channels can have varying power allocation and can thus cover different or overlapping portions of a service area. This scalability option would offer the entire broadcast service area with a base quality of service guarantee using a high powered base channel and the core of the service area with an enhanced quality of service through another low powered enhancement channel [9], [1].  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  2.2  10  Overview of the H.264/AVC Standard  Digital video coding was first made possible on a worldwide basis with the establishment of the MPEG-1 standard back in 1992. Over the past fifteen years, significant advances have been achieved with the launch of several international video coding standards such as the well known MPEG-2 standard in 1998, which set-off the DVD industry, in addition to more recent standards including ITU-T's H.263 series and ISO/IEC's latest MPEG-4 standard [4], [10]. Video compression is achieved by applying a series of processes, categorized in [10] into the following: • A prediction process that takes advantage of the spatial correlation within a single image and the temporal correlation between successive images to send "prediction" information (motion-vectors, reference indices) to the decoder to help reconstruct a "prediction image". The differences in sample values (residual components) between the original image and the prediction image are then compressed using a different process. • A transform process that converts the sample values (usually applied to the residual components) into a set of samples where most of the signal power is grouped into a fewer samples. The most commonly used transform in image and video coding is the discrete cosine transform . (DCT). • A quantization process that reduces the precision of representing a sample value in order to decrease the amount of data to be coded. • An entropy coding process that takes advantage of the symbol proba-  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  11  bilities to generate binary representations of the coded symbols. The H.264/AVC standard defines two interworking layers, the video coding layer which performs the video compression, and the network abstraction layer that packs the coded data into "network friendly" units that can be easily passed on to different communications networks. In this overview, we will focus our discussion on the video coding layer. The H.264/AVC video coder is a block-based hybrid video coder that utilizes all of the above mentioned techniques for video coding and applies them on a block basis. Therefore, every input image to the video coder is first split into a group of macroblocks each containing 16x16 samples. In addition to the macroblock structure, H.264/AVC also defines smaller block structures that are partitions of the macroblocks and sub-macroblocks to better capture the details in an image. Figure 2.2 shows the possible partitions supported in H.264, extracted from [11]. The two major components of H.264/AVC that contribute to its superior coding efficiency are the prediction process and the entropy coding process. We will start by giving a brief functional overview of the encoder and then focus on the above-mentioned components. There are three types of coded pictures supported in H.264, namely, I pictures, P pictures, and B pictures. An I picture (or frame) is coded using Intra-frame prediction as an independent self-contained compressed picture. P and B pictures on the other hand are Inter-frame predicted from previous pictures only, in P frames, or from both previous and future pictures in B frames. The prediction process itself will be explained in detail in the following section. Given a sequence of pictures, the H.264 encoder codes the first picture as an I-frame. The remaining pictures can be coded as either P-frames or B-frames based on the delay restrictions imposed by the video application (B-frames require more delay but provide better coding efficiency). Figure 2.3 shows a sequence of coded pictures. The arrows indicate which pictures act as references to other pictures. Notice in Figure 2.3 that P-frames can only use other previously coded P- or I-  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC  To SVC 1 2  Two  Two  macroblock  macroblock  macroblock  macroblock  partition  partitions  partitions  partitions  One  One  sub-  Four  8x8  8x8  8x8  8x8  4x4  4x4  4x4  4x4  Two sub-  Two sub-  Four sub-  macroblock  macroblock  macroblock  macroblock  partition  partitions  partitions  partitions  I  B  B  Figure 2.3:  P  B  B  P  B  B  E x a m p l e of a c o d e d video  I sequence.  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  frames as references.  13  B-frames are never used as references for motion prediction.  Therefore, an input picture first enters the H.264/AVC encoder where an Intra- or Inter-frame prediction decision is made to indicate the type of picture to be used. If Intra-frame prediction is to be implemented, then every macroblock in the picture is Intra- coded. If Inter-frame prediction is used, then a motion estimation process is initiated on the macroblock level to find matching macroblocks in the reference frames. After the motion estimation process is completed, a full frame containing the prediction information (motion-vectors and reference indices) is produced which is used to generate a prediction image that is composed of the motion compensated blocks, as will be discussed in the next section. The original input picture is then subtracted from the motion compensated image to obtain a residual image. Integer cosine transform is then applied to the residual samples in order to utilize the correlation in the residual samples. Next the transformed coefficients are quantized and finally the quantized samples are entropy coded and passed on to the network abstraction layer to be packetized and transmitted across the communications network. [4] Figure 2.4, extracted from [10], gives an example of the functional blocks of the H.264/AVC encoder we just described.  Input Picture  Intra-trams  Intra-frame  Estimation  Prediction  ' .  Intra/ Inter \selection  Figure 2.4: Hybrid encoder similar to the H.264/AVC encoder.  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  14  Note that in the Inter- prediction mode, a P- or B- frame can contain Intra coded macroblocks if the Intra coding mode reduces a rate-distortion optimization equation.  2.2.1  Motion Estimation and Motion Compensation  Compression is achieved in I-frames using Intra prediction, a process that takes advantage of the spatial correlation between the samples of a picture. Intra prediction is performed on a macroblock or sub-macroblock basis. For every macroblock, the values of the samples located within the macroblock bounds are estimated using the boundary samples of the neighboring left and above macroblocks. The estimation is performed using a set of predefined Intra-prediction modes. A detailed discussion of the Intra prediction mode can be found in [11] and [4], however, it is out of the scope of this thesis .since our multiple-description coding scheme will not deal with Intra-coded frames or macroblocks. The Inter prediction process, on the other hand, takes advantage of temporal redundancies found in successive pictures in a video sequence. In Inter-frame prediction, the encoder runs two complementary processes called motion estimation and motion compensation, on a block basis, to find a match in the neighboring frames for every block within an Inter-coded frame. Figure 2.5 illustrates the general functional structure of the Inter prediction process. Note that only the residual frame is passed through the transform, scaling, and quantization processes. The prediction frame is only entropy coded and multiplexed with the coded residual frame on a macroblock level before being transmitted. Figure 2.6 demonstrates the motion estimation process where one block in the current picture, labeled frame N , is being estimated by a block of the same size in frame number N2. The output of the motion estimation process is a set of motion vectors (MV) and reference indices (r) for every estimated block in the Inter-predicted frame. Figure  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 15  Current  Motion  Motion  Picture  Estimation  Compensation Residual Frame Transform, Prediction F r a m e  Scaling,  Output  Quantization,  Picture buffer  Entropy C o d i n g  Figure 2.5: Functional structure of the Inter prediction process.  2.6 shows that the current block is allocated a motion vector M V shown in red that represents the spatial shift between the matching block and the co-located block in the reference frame. Current  Frame N-3  Frame N-2  Frame N-1  Frame N  Figure 2.6: Motion vector and reference index assignment in the motion estimation process.  As a result, the motion estimation process generates a complete frame containing only motion information, such as, macroblock mode, motion vectors, and reference indices. Let us refer to this frame as the "prediction frame". The choice for best matching block is made based on a rate-distortion tradeoff. Lagrangian optimization is used to  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  minimize both the bit-rate required to encode the block and the distortion resulting from the chosen estimate. The Lagrangian cost function that is to be minimized is therefore expressed as: J(m, r\QP, A) = D ( m , r\QP, A) + \R(m, r)  where D is the distortion function, R is the rate for encoding the motion information, and A is the Lagrangian parameter specified in [12] to be equal to: QP —12  I  3  V 0.85 x 2  if sum of absolute difference (SAD) distortion is used  A = { QP  —12  0.85 x 2 " - T -  , if sum of squared difference (SSD) distortion is used  The distortion term is also expressed as  D (P, r , m err  est  e s t  ) =  {l rig[i, j] ~ K [i + 0  eat  est,xij  m  +  est,y])  m  n  where P is the macroblock partition, r the reference index, m the motion vector, l[] is the sample value, and n is equal to 1 if err = SAD or 2 if err = SSD. The prediction frame next enters into the motion compensation process which reconstructs the estimated blocks using the motion vectors and reference indices generated by motion estimation. The output of motion compensation is a blocky image that approximates the original image. The motion compensation process itself is a direct inversion of the motion estimation process where the reference blocks are grouped together into one image. This motion compensated image is then subtracted from the original image to produce a residual image to which the transform, scaling, and quantization processes are applied before pairing the blocks up with their respective motion data and entropy coding the entire coded frame. [11], [4], [10], [13]  16  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  2.3  17  Scalable Video Coding  The subband extension of H.264/AVC was developed at Heinrich Hertz Institute (HHI) in Germany. In a document, [14], presented at the JVT Munich meeting in March of 2004, the authors produced a subband extension of the H.264/AVC using Motion Compensated Temporal Filtering (MCTF) based on the lifting representations of the Haar wavelet and the 5/3 wavelet. The design keeps most of the components of the original H.264/AVC standard while only a few adjustments have been made to support the MCTF structure. The lifting representations of the Haar and 5/3 wavelets are used as filter-banks in uni-directional and bi-directional temporal prediction, respectively. Previous attempts have been made to use Lifting Wavelet Transforms with FrameAdaptive Motion Compensation in the building of video codecs [15]. However, the advantage in [14] lies in the use of the highly efficient motion model of H.264/AVC along with an adaptive switching between the Haar and 5/3 spline wavelet on a block basis. A group of pictures (GOP) of the original video stream is decomposed into a set of High Pass (or difference) frames after a prediction step and a set of Low Pass (or average) frames after an update step [14]. Figure 2.7, extracted from [5], illustrates the temporal decomposition of a group of 8 pictures. A High Pass (HP) frame is produced after the prediction step and a Low Pass (LP) frame is produced after the update step. Each stage has half the temporal resolution as that of the original stream. A high pass frame is equivalent to a coded Bframe in H.264/AVC. It contains both residual data and prediction (motion) data. Each following stage uses the L P frames from the previous stage to produce its respective HP and LP frame sets which in turn, have half the temporal resolution of the previous stage. The following equations describe the extended motion-compensated temporal  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 18  Figure 2.7: MCTF temporal decomposition process.  filtering prediction and update operators implemented in SVC. PHaaMx, 2k + 1]) = s[x +  Hlfl,,  2k - 2r ] Po  ( z . l j  U a.r{s[x., 2k]) = |/i[x + m , Ha  Uo  P /3(s[x, 2fc + 1]) = |(5[x + m ,2k 5  U (s[x,2k]) 5/3  k+ r ] Uo  - 2r ] + s[x + m , 2k + 2 + 2r ])  Po  Po  P l  = \(h[x + m ,k + r ] + h[x + mu k - 1 Uo  Uo  1}  Pl  r ]) Vl  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 19  where s[] and h\\ indicate the y, u, and v samples in the original and high pass frames, respectively, x refers to the luma spatial coordinates or the samples, k is the picture temporal index, and m and r are the prediction information. It is also shown in [14] that using this model, it is possible to produce a fully scalable video codec offering temporal, spatial, and SNR scalability. We are interested in the case of combined scalability, such that, as we move from one layer to the subsequent one, the bitstream can lose spatial resolution, temporal resolution, and SNR quality by down-sampling spatially and/or temporally the decomposed sequence. The extended coding structure of an SVC encoder, as specified in [5] is shown in Figure 2.8. Spatial down-sampling between layers is used to reduce spatial resolution. Temporal scalability is controlled by limiting the number of LP and HP frames transmitted within a specific layer. The number of HP frames transmitted can also control the SNR scalability. When all layers are available, the decoder can reconstruct the original video stream in full quality and resolution. [16] This combined scalability helps the bitstream adapt to the conditions of the channel, namely fluctuations in bandwidth and network congestion. However, the bitstream is still vulnerable to bit-errors, which may lead to the dropping of packets. If the packets contain HP information, then the decoder can be modified to recover from packet loss by inserting a zero residual frame instead of the lost frame. If, on the other hand, a L P packet is dropped, then the consequence is more severe since that can greatly affect the PSNR and quality of the decoded stream.  2.4  Error Resilience in Video Coding  Error correction schemes such as the commonly used forward error correction (FEC) can be implemented to combat the problem of bit-errors in transmission networks. How-  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 20  Progressive refinement texture coding (SNR scalability)  texture Input video sequence motion  Base layer ' coding  Laver 4  Spatial enhancement layer encoder lnter-1; motion and texture prediction  Output Scalable  Progressive refinement texture coding (SNR scalability)  texture motion  Base layer coding  Layer 2  Spatial enhancement layer encoder Inter-layer motion and jtexture prediction  Progressive refinement texture coding (SNR scalability)  texture Motion compensated temporal filtering  Base layer motion  coding  Layer 0  AVC compatible encoder  Figure 2.8:  Basic structure of a 5 layer combined scalability encoder.  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  21  ever, protection of the entire bitstream requires large overhead which can be very costly. Alternatively, [17] suggests the use of unequal error or erasure protection (UEP/UXP) schemes to protect only the important frames or layers in a video bitstream. Other error-resilience techniques have targeted the design of the video bitstream itself to generate an inherently error resilient coded video representation through multiple description coding. Multiple description coding schemes generate several versions of a video sequence which are transmitted either on the same channel or on different channels and paths in order to reduce the probability of transmission errors.  2.4.1  Unequal Erasure Protection in SVC  Unequal Erasure Protection (UXP) is used to provide content aware protection to a video bitstream. [17] develops an UXP scheme suited for protecting the base layer of a SVC bitstream. This scheme cannot be applied to the enhancement layer of SVC. We will give a brief overview of the erasure protection scheme implemented in [17]. U X P defines several protection classes for the different layers in a scalable video bitstream. Each of these classes is characterized by a difference in the symbol erasure protection level. Figure 2.9 demonstrates the UXP procedure as it is illustrated in [17]. A transmission sub-block (TSB) contains a number of protected NAL units separated by Multi Time Aggregation Unit (MTAP) headers. The MTAP headers specify the NAL unit size, RTP timestamp, and a Decoding Order Number (DON). The DON is used to fix any higher level interleaving problems that might arise due to a reordering of the decoding order of NALs. For further detail on the functioning of MTAP, please refer to [18]. The TSBs are then grouped into one Transmission Block (TB) which, in turn, is encapsulated into several RTP packets. Simulation performed in [17] show that this UXP can successfully protect the base layer information of an SVC bitstream with more than 30% network packet loss rate.  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC  22  The UXP scheme described above is not designed to protect higher layer packets, therefore, another error resilience scheme is required to protect those packets from network errors.  2.4.2 Multiple Description - Motion Compensated Temporal Filtering Recent studies in Internet error resilience technology have shown that Multiple Description Coding (MDC) along with path or server diversity can reduce the effects of packet delay and loss [19] [20]. Although this technique can be very effective to combat bit errors, it also suffers from data redundancy and thus requires additional bandwidth. However, it falls upon the design of the multiple-description to solve the problem of data redundancy. Multiple Description Scalable Coding (MDSC) merges scalable coding with MDC. Scalable coding facilitated the adaptability of the different video descriptions to variations in channel bandwidth [19]. One approach to MDSC is Multiple Description Motion Compensated Temporal Filtering (MD-MCTF) developed  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 23 in [19]. This approach applies MDC to a wavelet-based MCTF codec, similar to SVC, by dividing the HP frames between the different video descriptions and keeping the lowest layer L P frames as redundant data inside the streams. The motion information in each HP frame is also replicated in both streams. Figure 2.10 illustrates the frame coding order of the MD-MCTF coding scheme.  Original coded sequence  L  H1 H2  H3  H4 H5  H6  H7  L  H1 H2  H3  H4 H5  H6  H7  First Description  Second Description  Figure 2.10: Coding structure of the MD-MCTF.  In each of the two descriptions, the lightly shaded HP frames contain only motion information. All residual components are set to zero. The problem with this approach is that most of the redundant bit allocation is dedicated to duplicating the motion information. Texture information, on the other hand, is sacrificed to reduce the redundancy. However, this sacrifice in texture data results in increased quality degradation,  Chapter 2. Digital Video Transmission and Coding, From H.264/AVC To SVC 24 a feature which is very undesirable.  25  Chapter 3 Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) Scalable Video Coding (SVC) clearly solves the network heterogeneity problem faced in the current M B M S architecture. The prevalent problem with this coding technique is its lack of error resilience schemes that can ensure the safe delivery of coded video data to its destination. We have seen that there are currently several error resilience techniques being employed to protect the video data, however, all of these techniques suffer from limited protection to the entire video data or from extensive redundancy in the coded video representations. For this reason, we propose a new Multiple Description Coding scheme specifically designed for protecting the high pass frames of an S V C compatible bitstream. We shall call our scheme Multiple Description Scalable Video Coding (MDSVC). M D - S V C takes advantage of the layered structure of S V C to generate two descriptions of each enhancement layer of an SVC coded video.provide multiple coded versions of the video layers with minimum redundancy. The SVC structure is modified to allow the encoder to create two complementary descriptions of H P frames, such that each description is ensured to be independently decodable. The challenge thus lies in creating a description of every S V C layer that will produce an acceptable video quality  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 26 when decoded separately while limiting the redundancy induced by M D C . The target of our coding scheme is to be able to support networks and users with heterogeneous capabilities while providing an error-resilient video representation with the minimum possible redundancy. Figure 3.1 shows one example of the applications that can be serviced using our M D - M C T F coding scheme over 3G M B M S networks.  Description 2 - Channel 3  m Hand held computer" <v ...  "Descri'ptionT^" Chan re' "2" .Bas_e_.Lay_ej. - Channel J . escnption 2 - Chan re! 3 Description 1 - Channel Base layer - Channel 1 .  PDA  c  Base layer - Channel 1  Server  Cell phone  Figure 3.1:  Coding structure of the M D - M C T F .  The proposed framework describes two logical layers, a base layer and a multipledescription coded enhancement layer. These logical layers are composed of K S V C layers of the coded video stream, where K is an integer. The S V C layers will be referred to as SVCx, where x stands for 0 , 1 , 2 , . . . , K — 1.  The base layer contains SVCO  layer frames along with all the L P frames belonging to SVCl, SVC2,..., layers.  SVC(K — 1)  The enhancement layer contains 2 descriptions of the H P frames belonging  to SVCl, SVC2,SVC(K  - 1) layers. A l l S V C layers considered are temporally  scalable. In addition to temporal scalability, if an S V C layer is also a spatial scalable  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 27  layer, the HP frames include both prediction data and residual (texture) data. If, on the other hand, an SVC layer is an SNR scalable layer, then the HP frames contain only residual (texture) data. Hence, creating multiple descriptions of HP frames entails generating multiple copies of both prediction data and texture data in order to reduce the redundancy between the descriptions while ensuring independent decodability of each description. Therefore, the SVC encoder is modified in order to incorporate all of the abovementioned features. As discussed in Section2.3, the SVC encoder supports three types of scalability; spatial, temporal, and SNR (quality) scalability. The core of the coding engine is the motion-compensated temporal filtering (MCTF) coder that utilizes lifting representations of the Haar and 5/3 wavelet filters to provide temporal scalability in a video bitstream. Consequently, our multiple-description coding framework is integrated in the existing MCTF coder by inserting new modules and modifying existing modules. Figure 3.2 shows a modified single layer MCTF encoder with the additional and modified modules marked in gray. Hierarchical motion-compensated prediction MCTF loop Input  Decompose prediction  GOP  hhlP texture  Decompose update  H P texture Scalable  HP texture IlilPPPili  llislIIIll IBIiloIll  —D1 motion -D1 HP texture  .*  Multiplex bitstream  — D 2 motion -D2 HP texture  Figure 3.2: Single layer MCTF encoder with additional and modified modules for multiple-description coding.  The MCTF loop shown in Figure 3:2 takes as input a group of pictures (GOP) and performs log (GOPSize) 2  stages of motion-compensated prediction. In every stage,  Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC  (SVC) 28  the input pictures first pass through a decompose prediction stage where the motion information and high pass (HP) texture frames are produced. Next, the motion and texture frames are fed along with the input pictures to a decompose update module which produces the low pass (LP) texture frames. Note that every stage produces a set of LP frames that are decimated by two in time, i.e., the number of LP frames at the end of a stage is half the number of input pictures to that stage. Our multipledescription coding framework is restricted to generating two descriptions of the high pass frames. Therefore, motion and HP texture frames are passed as input to the Multiple Description Coding module, discussed in detail in the next section, which produces two copies of each input, labeled D l and D2 to indicate description one and description two respectively. Moreover, the Multiplex module is also modified to manage the order of frames in the output bitstream as will be shown in a later section.  3.1  M u l t i p l e Description C o d i n g of H i g h Pass ( H P ) Frames  High Pass (HP) frames constitute most of the transmitted frames in an SVC bitstream. For every GOP, only one frame is a low pass (LP) frame and GOP Size — 1 frames are HP frames. Moreover, depending on the level of motion found in a video sequence, HP frames can occupy around 50% to 70% of the bitrate allocated for a scalable SVC layer. Therefore, in error prone environments, multiple-description coding of HP frames is a necessary and attractive error resilience tool since it minimizes the coding overhead as opposed to replicating these frames. Every HP frame contains both motion and texture information so it is desirable to create multiple descriptions of both. Available multiple description techniques have only dealt with texture information, meanwhile motion information was replicated in each description, thus increasing the redundancy between  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 29 the descriptions. The multiple-description framework we developed generates two descriptions of both motion data and texture data. A HP frame is first separated into two frames, motion frame and texture frame. Next, each frame is handled separately and follows a different coding path. Figure 3.3 shows the two description coding structure of the M D C module presented in Figure 3.2. Texture  Motion  Multiple Description C o d i n g  Frame number mod ( 2 ) Quincunx Lattice Frame Separator 1  i  D1 motion v  Jodd  D2 motion  I  Best Neighbor Estimator  Best Neighbor  Zero residual MBs Non-zero INTRA MBs  Estimator  -D1 motion D2 H P texture  Figure 3.3: Basic structure of Multiple Description Coding module with a two description coder of High Pass frames.  Notice that the two-description coding of motion data is performed on the intraframe level (or individual frame basis), whereas, the coding of texture data is performed  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 30  on the inter-frame level (or group-of-frame basis). In other words, every motion frame is separated into two descriptions, whereas, the texture frames in a group of pictures (GOP) are split based on a frame number. In the following sections, we will offer a detailed illustration of the two-description coding procedure for high pass motion and texture data.  3.1.1 Multiple Description Coding of Prediction (Motion) Data One approach to multiple-description coding of motion information of temporally scalable video streams can be found in [19]. This approach includes a copy of all the motion information in each description, which increases the video redundancy between the multiple descriptions. In order to minimize this redundancy, we have developed motion coding method that reduces the number of coded motion vectors (MVs) to half and predicts the missing MVs using information from the existing neighboring motion vectors. As a first step, the motion MBs are separated between the two descriptions using the Quincunx Lattice structure adopted in [21]. Figure 3.4 below shows a 5x7 MB portion of the motion data of a HP frame. The D l and D2 labeled MBs are placed in the first and second descriptions of a HP frame, respectively. However, this produces HP frames with only half the motion information. For the remaining motion data MBs that are not assigned to a particular description, new syntax elements are devised to help predict the missing motion information using the neighboring correctly transmitted MBs specific for the description. Therefore, a typical MDC HP frame in the first description, for example, would contain the motion data shown in Figure 3.5 below. All D2 MBs of the original HP frame are now labeled multiple-description coded (MDC) macroblocks.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  Figure 3.4: Quincunx Lattice structure of prediction data MBs.  Figure 3.5: Typical motion data macroblock map of a HP frame in description 1.  (SVC) 31  Chapter 3. - Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 32 A n important condition imposed by multiple description coding is the independent decodability of each description. Since only every other motion M B from the original frame is available per M D C frame, slight modifications to the motion vector coding process specified in [11] are required. In order to reduce the size of coded motion information, [11] specifies that for every coded block, the associated motion vector (MV) is subtracted from a predicted M V and only this difference is coded and transmitted in the bitstream. The motion vector prediction process described in [11] can be summarized as follows: • Check the availability of the left, above, and either the above-right or above-left blocks • The prediction M V is equal to the motion vectors of either one of the above-mentioned blocks or to the median of three of the motion vectors based on block availability and the current M B mode. In the case of M D C frames, however, independent decodability of each description is a priority. This imposes a constraint on the selection of the neighboring blocks used for the motion vector prediction process since the left and above MBs are expected to be missing. As a result, M D C skips one M B to the left and uses both the above-left and above-right MBs as the neighboring blocks for motion vector prediction. Figure 3.6 demonstrates the change in the motion vector prediction process discussed above. The M D C labeled macroblocks are not left as empty sets.  These MBs contain  necessary motion information calculated by the encoder that enable the decoder to recover the closest match possible to the missing motion vectors using the neighboring correctly coded MBs. The encoder first copies one set of motion macroblocks of a H P frame into the respective description. For instance, a M D C H P frame belonging to description (or stream) one will first contain all D l labeled MBs as shown in Figure 3.5.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 33  D  8  A  Current MS  c  MDC  D  A  (a)  Current MS  WOC :  (b)  Figure 3.6: Change in motion vector prediction, (a) Standard defined neighbor MBs (b) MDC defined neighbor MBs.  Table 3.1: MDC macroblock classification. Syntax Element Multiple-Description non-Multiple-Description Codable Codable MbMode MODE.16xl6 MODE-Skip MODE_16x8 INTRA.4x4 MODE_8xl6 MODE_PCM MODE_8x8 INTRAJ3L 0 1 BLSkipFlag BLQRefFlag 0 1  A new MDC flag is added to the macroblock syntax elements to distinguish between MDC and nonMDC coded blocks. The MDC flag is reset for all D l labeled MBs and set for all MDC labeled MBs. The MDC labeled MBs are separated into two classes based on the coded MB mode and base layer dependence. Since most motion bits are allocated to represent the motion vector field, the goal is to reduce the number of coded motion vectors within a motion frame. Hence, of the MDC labeled MBs, only those that contain motion vector information retain an MDC flag set to one. The remaining MBs are copied directly from the original motion frame and coded in the bitstream with the MDC flag reset to zero. Table 3.1 classifies macroblocks into multiple-description codable and non-multiple-description codable blocks. The next step is to derive the motion recovery data for all remaining macroblocks with an MDC flag of one. This is done by taking advantage of the redundancy (or  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 34 correlation, find good word) between spatially neighboring macroblocks with zero M D C flag to estimate the current macroblock's motion vectors.  Two possible techniques  are motion vector median or averaging, and motion-compensated best neighbor-block matching.  In [22] and [23], a non-normative error concealment algorithm for Inter  coded frames is presented based on motion-compensated motion-vector recovery with boundary pixel matching. It was shown in experiments run in [22], that the median or average motion vector approach did not give better results when compared to the motion-compensated prediction approach. Therefore, we decided to follow through with a modified version of the motion-compensated prediction approach as discussed in the following paragraphs. In order to reduce the number of computations performed by the decoder, the motion-vector recovery is performed at the encoder and an index of the best matching neighboring block is coded in the bitstream and passed to the decoder. First, the M D C macroblock is divided into four 8x8 blocks {XQ,X\,  X2, X3} and the motion recovery  algorithm is performed for each of these blocks. The neighboring eight 8x8 blocks are then assigned indices as shown in Figure 3.7. Since motion recovery is performed at the encoder, the original motion-compensated block is available in full and therefore imposes no need on running the matching solely on the boundary pixels. Instead, the entire motion compensated block is matched with every candidate block, which improves the accuracy of the estimate. The motion-vector recovery algorithm, which will be referred to as the Best-Neighbor Matching algorithm, can be summarized as follows: • For every block partition P which belongs to the set of block partitions P = Xo,Xi,X2,Xz,  find the neighboring 8x8 block that mini-  mizes the sum of square error distortion between the original motioncompensated block and the candidate motion-compensated block.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 35  MDC  MDC  4  1  5  x_o  XJ  2  X_3  6  0  3  7  MDC  MDC  Figure 3.7: Neighboring 8x8 block indices.  • The output is the set of indices BNp associated with every partition PeP.  The distortion measure can be expressed as follows  DSSE(P,  ori ,  r  g  m i , r , mest) = or  g  est  2~2iJeP (^ orig[^ "f" orig,xi j "H r  m  (3.1) ^orig.i/]  K  —  p + est,xi3 ^ m  C  S  T  est,y})  m  where l [] represent the motion compensated luma samples, m the motion vectors, and r  P the sub-macroblock partition. The best matching neighbor block index is then found by minimizing equation 3.1 as shown in equation 3.2 below:  BNp = arg min.  D SE(P, S  r  o r  i ,m g  o r  ig,  r t, m e s t ) es  (3.2)  BN^S  where S = {0,1,2,3,4,5,6, 7} is the set of neighboring 8x8 block indices. As a result of the best-neighbor matching algorithm, every MDC macroblock is  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 36  Table 3.2: M B mode and the associated coded B N indices. Prime Best-Neighbor Index Mb Mode MODE_16xl6 MODE_16x8  BN BN BN BN BN BN BN BN BN  Xo  Xo  X2  MODE.8xl6  Xo Xl  MODE.8x8  Xo  Xl  X2  X3  allocated four indices that need to be coded in the bitstream. This might end up requiring more bits than coding the original motion-vectors, especially if, for instance, index seven is chosen for all partitions P. In order to cope with such cases, an appropriate binarization and entropy-coding scheme must be developed on the one hand. This will be discussed in the following section. O n the other hand, the macroblock mode of an M D C M B is also coded in the bitstream in exchange for removing some of the of the best-neighbor indices from the bitstream. For example, if the macroblock mode is MODE_16xl6, then the original motion is homogeneous over all blocks. Therefore, it is sufficient to code the best-neighbor index of only one 8x8 block partition to estimate the missing motion in M D C case. The chosen best-neighbor index BNp is that of the 8x8 block P with the minimum distortion DSSE located within the macroblock partition. The remaining 8x8 blocks in the macroblock partition are set equal to the chosen index. Table 3.2 shows the prime best-neighbor indices that are coded based on the value of M B mode. The decoder first reads the M B mode and then decides which prime best-neighbor indices are to be read from the bitstream as shown in Table 3.2. The remaining bestneighbor indices of the designated macroblock partition are set equal to the prime index. Finally, for every best-neighbor index BNp, the decoder copies the motion-vectors from  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 37  Table 3.3: Syntax element classification in M D C MDC non-MDC BLSkipFLag BLSkipFLag MbMode MbMode BLQRefFlag BLQRefFlag MDCFlag MDCFlag BestNeighbors BlockModes MotionPredictionFlag ReferenceFrames MotionVectors TerminatingBit TerminatingBit  the neighboring block indexed by BN into the macroblock partition P. The bitrate reduction achieved by multiple-description coding of motion can be seen by comparing the number of syntax elements coded in a regular macroblock as opposed to a multiple-description coded macroblock. Table 3.3 lists the syntax elements encoded in both M D C and non-MDC macroblocks. Moreover, in the case of bidirectional prediction, two sets of the motion prediction flags, reference frame indices, and motion-vectors are coded in the bitstream. The bestneighbor indices are only coded once since the afore-mentioned fields will be copied in full from the indexed neighbor block.  3.1.2  Entropy Coding of Error Recovery Data  The addition of new syntax elements to the S V C bitstream requires additional entropy coding contexts that will ensure optimal coding into the bitstream. H.264/AVC supports two entropy coding techniques to efficiently encode all syntax elements into the bitstream. These entropy coding techniques are Context-Adaptive Variable Length Coding ( C A V L C ) and Context-Adaptive Binary Arithmetic Coding ( C A B A C ) . Since S V C is the scalable extension project of H.264/AVC, our method has also adopted  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 38  the same entropy coding techniques implemented in SVC. However, SVC defines additional syntax elements that did not exist in H.264/AVC, hence, [24] specifies additional context indices, context initialization tables, binarizatiori schemes, . . . The full list of modifications can be found in section S.9.3 of [24]. It is worth mentioning that CAVLC is not affected by the support of new syntax elements. It is only CABAC that requires the modifications and additions referred to in section S.9.3. On a similar note, MDC of SVC defines two new syntax elements embedded in the motion data stream. These syntax elements are MDC flag and BestNeighbor indices. In this section, we will describe the binarization and context modeling process developed to support entropy coding of these additional MDC syntax elements. To begin with, we will give a brief summary of the CABAC framework for encoding syntax elements as described in [25]. A syntax element undergoes three steps of encoding, namely, binarization, context modeling, and binary arithmetic coding. In the binarization stage, a non-binary syntax element is mapped into a unique binary sequence referred to as a binstring [25]. Next, in the regular coding mode, every element in the bin string passes through a context modeling stage where previously coded elements can be used to select a probability model for the intended element. Finally, the element is arithmetic coded using the probability model selected in the context modeling stage. Another coding mode is the bypass mode where a binary valued symbol is passed directly to the arithmetic coder without the step of context modeling. In the MD-SVC case, only the regular coding mode is used to encode the new syntax elements. We first encode 400 coded frames of 4 representative test sequences and monitor the BestNeighbor symbol statistics from which we derive the symbol probabilities. Since the MDC flag can only take binary values, the calculated symbol statistics directly reflect the binary representation of the syntax element. The current version of the SVC software does not support multiple slices per frame and therefore the symbol  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 39  statistics are based on the occurrence of each binary value in every coded frame. Given the procedure above, the M D C flag syntax element appeared to have a simple uniform distribution. Therefore, the context model designed for representing the BLSkip flag in [24] and has the same probability distribution was adopted to model the M D C flag context as well. The case of entropy coding the BestNeighbor syntax element is a bit more complicated.  The first difficulty lies in finding a suitable binarization scheme that will  minimize the output bit-rate. Since our entropy coding engine is a binary arithmetic coder, the task of representing every symbol by a representative set of binary values, or bmstring, that best reflects the actual symbol probabilities. As mentioned in the previous section, the BestNeighbor syntax element holds the values of the indices pointing to the neighboring 8x8 blocks of a multiple-description coded macroblock. Let us revert to the previously defined arithmetic terms in order to define a mathematical model for the context modeling task at hand. The BestNeighbor syntax element is defined as the index BNp € S, S = {0,1, 2,3,4, 5,6, where S is the set of neighboring 8x8 block indices and P is the sub-macroblock partition index. Therefore, BestNeighbor can hold one of eight possible non-binary values each of which demands a separate symbol probability model. S defines a finite alphabet of values for the BestNeighbor syntax element leading to the application of Fixed Length (FL) binarization as recommended by [25]. Moreover, by monitoring the symbol statistics of the BestNeighbor index values over the same coded frame data set of 400 coded frames it was shown that the BestNeighbor syntax element has a nearly uniform distribution, which further supports the choice of F L binarization. Let PrsN  be the  probability of occurrence of a non-binary value BN E S. The first step is to calculate PrBN  using the available coded frame data set. Let Nf be the total number of coded  frames and CBNM be the number of times the index BN  occurred in frame k.  The  Chapter  3. Multiple Description Coding of the Scalable Extension ofH.264/AVC  probability  PTQN  (SVC) 40  is expressed as  PTBN  •BNeS  =  (3.3)  N  f  The next step is to derive the bin probabilities in the bin-string using the non-binary symbol probabilities. These bin probabilities will be used to define the contexts or probability models of the internal bins of the bin-string that will determine the coded bitstream and its efficiency as stated in [25]. The binarization stage ensures the reduction of the alphabet size of the syntax elements to be encoded. This stage does not lead to the loss of any information on the high-level symbol probabilities. These symbol probabilities can be completely recovered using the probabilities of the internal bins [25]. The processes of binarization and context modeling are employed to fit the available non-binary data to a universal finite memory source, a concept developed in [26] and applied in the design of CAB AC in H.264 [27]. [26] also defines a "tree source" as a "machine" that can implement a Markov source using a "simple tree architecture". The concept of "tree source" is important in this discussion because it will lead to the definition and design of the BestNeighbor bin contexts. [26] defines P(x ) and PT(£™) as n  the probabilities assigned to a string x by the universal source and by a data-generating n  finite-memory source with K states. In the BestNeighbor binarization application, a universal source is finite and the tree is binary. Since we are using FL binarization, the length of the tree will be IBN  — R°§2  7] = 3. Figure 3.8 shows the binary tree used for the binarization of  the BestNeighbor syntax element. Notice that the tree nodes are labeled CO, 671,... 676 which stand for the context models of the internal bins. The calculated probability set PrBN provide the leaf probabilities of the binary tree,  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 41  Figure 3.8: Fixed Length binarization for BestNeighbor syntax element.  i.e. PrsN — Pr(x ) n  = n™=rj  [26] These conditional probabilities define  P(^t+i\s(x )) t  the bin contexts, which we would like to calculate. Since each leaf of the binary tree represents one state of a Markov chain, the conditional probabilities P(x i\s(x )) t  t+  of  the internal nodes define the state transition probabilities of the Markov chain and can be expressed as P(s(x )) = P(s(x )) t+1  P(x \s(^))  t  t+1  where s(xn) is defined as s(x ) = x ... n  n  x v(n-fc+i) 0  (3.4)  for some k > 0. In the BestNeighbor  case, the maximum length of the F L binstring is 3 and therefore n< 2. In Figure 3.8, we assume that 0 is the most probable symbol and 1 the least probable symbol of the arithmetic coder, as defined by [11]. Therefore, using the 400 coded frame dataset, we calculate the symbol probabilities of the BestNeighbor symbol values and  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 42 Table 3.4: BestNeighbor symbol probabilites and the corresponding binstring Non7 6 2 3 4 5 Binary 0 1 Symbol Probability 0.22 0.34 0.23 0.17 0.0065 0.011 0.0039 0.019 PT(X ) Binstring 100 110 101 111 001 000 010 011 (* ) 2  2  assign these values to the leaves of the binary tree accordingly. Table 3.4 shows the BestNeighbor symbol probabilities and the corresponding binstring. To calculate the internal node probabilities, take for example node C3, apply equation 3.4 as follows. We want to calculate P(0|01) and P(l|01) = 1 - P(0|01).  P(o|oi)=  P ( 0 1 0 )  P(010) + P(011)  -  P r B N  =>  Pr =2 + BN  Pr N=3 B  Similarly, the conditional probabilities of node CI are P(0|0) and P(1|0) and are expressed as calculate P(0|01) and P(l|01) = 1 - P(0|01). P(0|01)=  P(000) P(001) P(000) + P(001) + P(010) + P(011) +  The conditional probabilities of all remaining internal nodes are calculated similar to the conditional probabilities of contexts C3 and CI. These conditional probabilities define the context models of every node in the binstring and will therefore be used to initialize the context models at the beginning of encoding or decoding of every frame. The CABAC entropy coder will then update these context probabilities based on the encoded BestNeighbor syntax element statistics as more and more MDC macroblocks are encoded into the bitstream. The initial conditional probabilities of all BestNeighbor  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 43 Table Context CO CI C2 C3 C4 C5 C6  3.5: BestNeighbor context models. state m MPS LPS RLPS 0.96 0.0403 20 44 0 0.42 215 2 0 0.58 0.74 0.26 201 1 0 0.61 0.39 213 1 0 0.58 0.42 131 10 0 0.63 0.37 189 3 0 0.62 0.38 192 1 0  n 19 61 62 62 53 60 62  context models are shown in Table 3.5. [11] specifies three steps for initializing context models in which a context model is initialized by assigning to it a state number and a meaning for the most probable symbol (MPS). The first step is linearly dependent on the frame quantization parameter (QP) and involves calculating a prestate = (m * QP) »  4 + n [11]. The next step limits  prestate to the range of [1, 126]. The final step maps the prestate to a {state, MPS} pair, such that, if  prestate < 63 ,then state = 63 — prestate and MPS = 0,  otherwise,  state = prestate — 64 and MPS = 1  [11].  Now the arithmetic coder used in H.264/AVC and SVC is a table based arithmetic coder, where the bin probabilities are estimated using a table of scaled least probable symbol (LPS) ranges  (RLPS)  indexed by the context model state. Therefore, the cal-  culated LPS probabilities shown in Table 3.5 are mapped into the  RLPS  values found  in column 4 of Table 3.5, where the maximum RLPS range is 256 and corresponds to a LPS probability of 0.5. The frame quantization parameter does not affect the value of BestNeighbor and therefore m is set equal to 0. Moreover, we're assuming that 0 is the  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 44  most probable symbol so the prestate value should not exceed 63. Columns 5, 6 and 7 of Table 3.5 show the initialization parameters of the BestNeighbor syntax element contexts resulting from the above mentioned initialization procedure.  3.1.3 Multiple Description Coding of Residual Data In addition to motion information, High Pass frames also contain residual information that can consume at least as much bandwidth as the motion information. Therefore, it is imperative to multiple-description code the residual data as well in order to reduce the redundancy between the two descriptions. We suggest using an inter-frame based multiple-description coding of residual data similar to MD-MCTF, the approach used in [19]. In [19], HP frames are divided into even numbered and odd numbered frames at every temporal level in a MCTF-coded video layer. The even numbered frames are then transmitted in one description while the odd numbered frames are transmitted in the second description. Controlling the number of HP frames duplicated between the two descriptions leads to controlling the video redundancy. Although bitrate efficient, this approach suffers from a reduced decoded video quality since any Intra coded macroblocks in the HP frames cannot be retrieved if one description is lost. Our approach to multiple-description coding of residual data is also based on separating HP frames into even numbered and odd numbered frames and coding one group in each description. However, we extend this separation process by inserting any Intracoded macroblocks found in the high pass frames in both descriptions. The first step is to add an MDC flag to the slice header to allow the decoder to distinguish between fully coded HP frames and modified or multiple-description coded HP frames. Note that this flag is only inserted to refer to multiple-description coding of residual data and not motion (prediction) data. Therefore, if description one is allocated the even numbered frames, for example, then all even numbered HP frames are coded entirely  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 45  without any modification in the first description bitstream and the MDC flag is set to zero. The odd numbered frames are multiple-description coded by setting the MDC flag to one and only inserting the Intra-coded macroblocks into the second-description bitstream. No residual information is coded for residual macroblocks. Figure 3.9 shows the separation of a set of residual high pass frames into two descriptions. The dark blocks in the high pass frames refer to Intra coded macroblocks. Description One  Original HP sequence  Description Two  ^0 MDC- 1  MOCO  j  MDC.  Figure 3.9: Multiple-description coding of residual frames in a temporalscalable layer with three temporal levels. Dark blocks indicate Intra-coded macroblocks in the residual frame.  Notice that the HP frames coded in the two descriptions are complementary. The only redundancy involved rises from the Intra-coded macroblocks in the residual frames that are duplicated to improve the decoded video quality in case of packet loss. In order to ensure independent decodability of each description, essential modifications were required. First, the coding of the coded block pattern (CBP) syntax element is modified in MDC frames such that it is independent of neighboring CBP values in the adjacent blocks. All neighboring blocks are assumed to be unavailable when coding the  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 46  CBP field of a residual block. Moreover, cross layer dependence of residual prediction is also eliminated for all residual frames by assuming that all base layer residual frames contain zero-valued samples. Finally, the TransformSize8x8 flag is always added to the bitstream to inform the decoder whether or not an 8x8 block transform is used. This is crucial when the motion information is multiple-description coded and the macroblock mode is MODE_8x8. In this case, the decoder has no clue as to whether the residual data was transformed using an 8x8 block transform or a 4x4 block transform since the BestNeighbor field is limited to 8x8 block precision and does not exceed that to sub-block precision. Consequently, adding the TransformSize8x8 flag eliminates any ambiguity that might arise from multiple-description coding of HP motion information. In the following section, we present a theoretical analysis of the multiple-description coding framework we just described. We perform a redundancy rate distortion analysis of the suggested framework and develop a video motion index that reflects the level of distortion produced by this multiple-description coding framework.  3.2  Theoretical Analysis  Recall from section 2.3 that the current scalable SVC framework provides three kinds of scalability; temporal, spatial, and SNR. Spatial scalability is achieved by downsampling the input video sequence and then performing cross-layer prediction between the original (high-resolution) and downsampled (low-resolution) video sequences. This scalability is optimized by utilizing the already coded motion information in the lower layers. SNR scalability is generally accomplished by coding the residual (texture) signals obtained from computing the difference between the original pictures and the reconstructed pictures produced after decoding the base layer. This scalability is extended to include all temporal subband pictures obtained after temporal scalable coding.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 47 Temporal scalability is realized by applying the concept of motion-compensated temporal filtering ( M C T F ) to a group of original video pictures. M C T F utilizes an adaptive selection of the lifting representations of the Haar and 5/3 spline wavelets on a block-basis. It is important to emphasize that the wavelet filters are applied "in the temporal domain" to blocks of picture samples and N O T applied, in the spatial domain. The result is a set of high-pass and low-pass subband pictures having half the temporal resolution as the original group of pictures. Note that after performing temporal decomposition on a group of pictures (GOP), spatial and SNR scalability can be added to the resulting subband pictures. In our M D C framework, we produce two descriptions of the H P frames obtained by M C T F and support any spatial and SNR scalability features applied to those H P frames. To begin with, we revert back to equations 2.1 and 2.2 that describe the extended motion-compensated temporal filtering prediction and update operators. These equations are presented below.  -P/w(s[x, 2k + 1]) = s[x + rnp , 2k - 2r ] 0  Po  ^ / w ( s [ x , 2k}) = |/i[x + m<y, k + r ] 0  P (s[x,2k 5/3  + 1}) = l(s[x + mp ,2k-2rp ]+s[x 0  0  Uo  + m ,2k + 2 + 2rp }) Pl  l  U (s[x,2h]) = \(h[x + m ,k + r ] + h[x + m ^ , k - 1 - r-yj) 5/3  Uo  Uo  where s[] and h\\ indicate the y, it, and v samples in the original and high pass frames, respectively, x refers to the luma spatial coordinates or the samples, k is the picture temporal index, and m and r are the prediction information. Switching between the Haar and 5/3 filters is done at the macroblock level where, the Haar filter is used for unidirectional prediction while the 5/3 filter is used for bi-  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 48 directional prediction. For the remainder of this discussion, we will consider the 5/3 filter as the general prediction process and analyze accordingly. [14] defines the decomposition process using the following equations:  h[x, k] = s[x, 2k + 1] - P(s[x, 2k + 1])  (3.5)  l[x, k] = s[x, 2k) - U(s[x, 2k})  (3.6)  where P(s[x, 2k + 1]) and U(s[x, 2k]) represent the motion-compensated samples of the prediction and update processes respectively. h[x, k] and l[x, k] are the high-pass and low-pass decomposition components. Replacing equations 2.1 and 2.2 in equations 3.5and 3.6 we get  h[x,k] = s[x,2fc + l] - ^(s[x + m ,2k-2rp ]-rs[x-rmp ,2k-r2 Po  0  1  + 2r ]) Pl  Z[x, k] = s[x, 2k] + ^(h[x + mUo, k + rUo] + h[x + m ^ , k - 1 - rVl))  (3.7)  (3.8)  which result upon expansion in the following FIR filter representations of the low pass and high pass subbands. The motion-compensated prediction variables mp/u and r / i 0  have been dropped from the two equations for clarity.  h[x, k] = ~^s[x, 2k] + s[x, 2k + 1] - ^s[x, 2k + 2]  (3.9)  /[x, k] = -^s[x, 2k - 2] + ]s[x, 2k - 1] - ^s[x, 2k] + ]s[x, 2k +1] - \s[x, 2k + 2] (3.10) 8 4 4 4 8  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 49  Hence, the l[x, k] samples (more appropriately called subbands) now clearly correspond to the filtered output of the 5-tap low pass filter [=£ \ § \ £ ] . The h[x, k] =  subbands, in turn, correspond to the filtered output of a the 3-tap high pass filter [=Y 1  Equations 3.9 and 3.10 therefore describe the 5/3 spline-wavelet analysis  filters. Figures 3.10 and 3.11 illustrate the properties of the two filters. [-1/8 1/4 3/4 1/4-1/8] low pass filter 50  I  0  1  1  0.1  0.2  r  1  1  1  1  1  1  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1  0.8  0.9  1  Normalized Frequency (xu rad/sample)  0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  Normalized Frequency (xu rad/sample)  Figure 3.10: Frequency response properties of the [-1/8 1/4 3/4 1/4 -1/8] low pass filter.  The application of these temporal filters on a group of pictures (GOP) of size N produces N/2 low pass subbands and N/2 high pass subbands. When we implement a dyadic wavelet decomposition by applying multiple stages of the filters described above to the pictures of a GOP of size 8 for instance, the resulting output pictures will have the subband structure shown in Figure 3.12. Since both high pass and low pass filters are applied in the temporal domain, the  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 50  Figure 3.11:  Frequency response of the [-1/2 1 -1/2] high pass filter.  Figure 3.12: Decomposition of a group of pictures (GOP) of size 8. The shaded frames will be coded and transmitted in the bitstream.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 51  low pass subbands will consequently correspond to the homogeneous components found within the GOP. In other words, areas in a GOP that are common in all the pictures across the GOP will be conserved in the low pass subband. The high pass subbands will contain all high pass components in a GOP. More specifically, high pass subbands will contain most of the motion found in the GOP. Therefore, if a GOP in a video sequence exhibits high levels of motion, the low pass subband components of the GOP will retain less of the GOP signal power when compared to the power retained in the low pass subband components corresponding to a GOP with lower levels of motion. The reasoning behind this is quite simple. A low pass subband can be viewed as the average frame of a GOP. Hence, if all the pictures in a GOP resemble each other, i.e. the GOP exhibits little motion, then the average of these pictures will highly resemble each and every picture in the GOP. On the other hand, the high pass components will retain more of the GOP power when there is considerable motion and low power when the level of motion is low. The above rhetoric leads to the conclusion that the amount of information coded within the high pass frames is proportional to the level of motion found in a GOP. Keeping in mind that the focus of this thesis is multiple-description coding of the high pass (HP) frames in SVC, we will now relate our proposed MDC framework to our discussion above. Multiple description coding calls for the production of two or more versions of a video stream such that the reception of one stream can guarantee a decodable video sequence with an acceptable loss in quality. To put the above sentence in more accurate terms, we need to create two independently decodable streams such that the distortion produced from decoding a single stream is minimal, while maintaining a low redundancy between the two descriptions. The task remains to develop an expression for singlestream distortion and for the redundancy added by creating the two streams compared  Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC  (SVC) 52  to a single-description coded bitstream. Let Dxotai be the distortion resulting from decoding a group of pictures using only one description of the MDC high-pass frames. DTotai can then be expressed as  Drotal — D Motion  where  DMotion  and DR d ai esi  U  + D' Pj> i  (3.11)  s  dual  are the distortions caused by multiple-description coding  of the motion information and residual (texture) information, respectively, in one GOP. The distortion resulting from multiple-description coding of motion information is completely due to the motion vector approximation process described in section 3.1.1. The motion vectors of a MDC macroblock are approximated using the motion vectors of the neighboring blocks. Since this approximation is performed on an 8x8 block basis, with P being the 8x8 block index,  DMotion  c a  n then be expressed as follows.  Let NHP be the number of high pass frames found in a GOP of size GOP .Size. Based on a dyadic decomposition structure of MCTF, NHP is evaluated in terms of GOP Size as  NHP  Jog GOP^ize  = E;'i=l  2  GOPSize  2  =  /"C- °S2 , 1  \2-^i=0  GOPSize  2*  GOPJSize-1  2"')  (3.12)  GOPSize-1  We will assume that the number of MDC coded macroblocks is fixed per frame in a GOP and let Q be the set of MDC coded macroblocks with n being the macroblock (MB) index. The distortion due to multiple-description coding of motion information  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 53  is finally written as: NHp(Ylnen(Ylp=O^SSE(P,rorig,^ori ,r t,m t)))  D Motion =  =  where  S  - 1) x (zZnen Ep=o  (GOPSize  DSSE(P,  r ig, t n i , r or  o r  g  est  , mest)is  SSE(P,  D  e3  es  r i, m OT g  o r i g  (3.13) , r , m est  e s t  )  the sum of square error distortion between  the original motion-compensated block and the candidate motion-compensated block defined in equation 3.1. The expression for DMotion above is minimized by minimizing the  DSSE  term. This has already been included in our framework as a design parameter  since the best neighbor algorithm discussed in section 2.3 selects the neighboring 8x8 block that minimizes the  DSSE  term. Therefore, the minimal distortion requirement is  satisfied for motion estimation. The second term in the Dxotai expression is DR id ies  ua  Recall from section 3.1.3 that  multiple-description coding of residual (texture) information is performed on the entire GOP level (or inter-frame level). The texture information in a multiple-description coded frame is simply set to zero. This, however, does not include Intra-coded macroblocks found within the MDC frame. The purpose behind this decision is to restrict the loss in coded data to the high pass components produced by MCTF. [14] specifies that the samples of Intra-coded macroblocks in high pass frames are not used for updating the low-pass pictures and therefore are not included in the reconstruction process at the decoder. Moreover, all sample values of the intra macroblocks are set to zero when used in the update process [14]. For this analysis, we will begin by defining Dn id ai in es  U  terms of the components affected by dropping of residual data and then we will assess the level of that distortion based on the video sequence statistics. The only process affected by multiple-description coding of texture information is the reconstruction stage of motion-compensated temporal filtering. In other words, the loss is restricted to some of the high pass subbands needed for the complete reconstruction  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 54  of the output pictures. We will express the reconstruction of a set of low pass pictures in terms of the M C T F synthesis equations of two adjacent pictures. Equations 3.14 and 3.15 have been derived from the M C T F analysis equations 3.7 and 3.8. Let s[x, 2k] and s[x,2k + 1] be the reconstructed picture samples, the M C T F reconstruction equations will then be s[x, 2k] = l[x, k] - ^(h[x, k] + h[x, k - 1])  s[x, 2k + 1] = h[x, k] + i(s[x, 2k] + s[x, 2k + 2])  (3.14)  (3.15)  Note that prediction data variables have been removed from the above equations for clarity. Our MDC framework involves setting every other high pass frame to zero in a single description, hence, if the odd numbered frames are set to zero and k is even then the above equations become  s[x,2k] = l[x,k]-^(h[x,k])  (3.16)  s|x,2A; + l] =h[x,k} + ±(s[x,2k] + s[x,2k + 2]) = ±{l[x,k] + l[x,k +  l])-\h[x.,k]  For a more comprehensive analysis, let's consider the group of pictures of size GOP.Size. One example of a GOP with GOP.Size = 8 is shown in Figure 3.13 below, where the shaded frames indicate the transmitted frames, and the crossed frames are multiple-description coded. We define DR d ai as the sum of square error between the M C T F reconstruction of esi  U  a single-description (SD) coded GOP and the M C T F reconstruction of one description  Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 55  Figure 3.13: Reconstruction of a group of pictures showing the MDC high pass frames with zero residual marked with the red Xs.  of a MDC GOP. The calculated distortion is strictly limited to the error caused by dropping of residual information. Therefore, we will develop an expression for Dp_ id ai es  U  that incorporates all the lost residual samples across the MCTF stages. Using equations 3.16 and 3.17, the final expression for an entire GOP of size GOP.Size is written as  D Residual  Es=l log  2  (GOPSize)-!  GOP.Size  -2  Ek=f  = £ S2( l o  GOPSize  G O P  log  ((s[x, 2k) - s[x, 2A;]) + (s[x, 2k + 1] - s[x, 2k + l]) )) 2  - " )-i ( 5  (3.18)  e  ((l-Mx, k -1])  2  2  GOPSize  2  n  IZZT  E s=l  , V  (GOPjSize)-l  + (§Mx, k-i]  + |/i[x, k +1]))  i V  n  EkZT  k - l]) + £(/i[x, k + l]) + ±h[x, k - l)h[x, k + 1])) 2  where s is the MCTF stage number and take in stage 5 .  2  G O P  ^  l z e  — 2 is the maximum value that k can •  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 56 Let us now define a new variable, the motion index, as the normalized sum of square error distortion between the original picture samples and the motion-compensated samples produced after the Lagrangian based rate-distortion optimization. The motion index, labeled MIdx, will generally be expressed as  A/r M  ]  Tr M  X  HkeGOP 2~2xeFrame ( k " g [ > ^] f (GOP.Size,/\) x  ~  PJ > X  1 1 1  ' > ^]) r  ("3 1  2  [  6  ^  Q\  }  where p[] represents the motion-compensated samples, /(.) is a normalization function to be defined later and A is a constant reflecting the maximum SSD distortion per INTER coded macroblock. A is calculated experimentally by encoding a number of sequences with varying levels of motion and is estimated to be around  22000.  The  normalization function is simply written as f(GOPJSize, A ) = GOP.Size x (FrameW dth xFrameHeight) i  y  A  The question that arises from the above definition is that how can MIdx be characteristic of the level of motion in a group of pictures. The answer is inherent in the concept of motion-compensated prediction itself. Motion-compensated prediction can be viewed as a mechanism for tracking motion in a video sequence. The process involves "tracking" a fixed size block of pixels across a group of pictures and trying to find the best match for that block in previous and future pictures by finding a set of motion vectors and reference indices. If the tracking is highly successful, then the picture generated from the motion tracking will closely resemble the original picture. This will result in a residual frame with small sample values. On the other hand, if the tracking is unsuccessful, then the motion-compensated picture will vaguely resemble the original picture causing the residual frame to have large sample values and, in more extreme cases, to use the INTRAJVIODE to code the macroblocks. Therefore, if the residual  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 57  frame sample values are large, then the motion tracking process is finding difficulty in keeping up with the level of motion in the sequence, which reflects that the sequence exhibits high levels of motion. Similarly, if a residual frame contains a large number of Intra-coded macroblocks, then the sequence is more likely to exhibit high levels of motion. Since the high pass subband samples are defined by h[x] = lori [x\ — p[x, m, r] , [14]. g  Consequently, the motion index equation is rewritten in terms of the square of the residual high pass samples as  M  M  X  "  f {GOP.Size,  A)  Note that the spatial coordinate vector x is dropped from the above equation for clarity. This argument parallels our previous discussion dealing with the amount of signal power retained in low pass and high pass subbands of the MCTF process. To reiterate, if the video sequence exhibits low levels of motion, then the low pass subband retains more of the signal power. If on the other hand, the sequence exhibits high levels of motion, then the high pass subbands will retain more of the signal power. Thus, we speculate that the motion index will closely reflect the level of motion found in a video sequence and act as an indicator to the size of the encoded video sequence. To prove our speculation, we have encoded 113 pictures from each of four video sequences at 15 frames per second and a GOP.Size  of 8 frames. The sequences Crew, City, Foreman,  and Harbour were used. Figure 3.14 shows the motion index calculated for 14 GOPs of the Crew sequence. The motion index values displayed here are normalized with respect to the maximum distortion of the Crew sequence. Based on our speculation, Figure 3.14 indicates that the sequence Crew exhibits  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  1  (SVC) 58  Crew 113 frames 1  I  1  1  1  H  r  0.9 0.8 0.7 X OJ c c o o  O.B 0.5 0.4 0.3 0.2 0.1 0 0  2  4  6 8 GOP Number  10  12  14  Figure 3.14: Variation of motion index with respect to G O P number in the Crew sequence.  low levels of motion during the first three GOPs and then the level of motion starts increasing until it hits a maximum in G O P number 10. To demonstrate this analysis, we shall display screen captures of the frames of G O P 2 and G O P 10. Figure 3.15 (a) shows that very little motion can be noticed in the displayed group of eight pictures. The G O P intensity is almost homogenous as well and therefore, the value of motion index was small. Figure 3.15 (b) on the other hand exhibits more motion in the foreground and, more specifically, the background of the sequence. Moreover, the intensity of these pictures is also changing which results in a high motion index. The above figures clearly demonstrate the relevance of the motion index in indicating the motion within a video sequence. The question that rises, however, is how can the motion index reflect the difference in the level of motion between two or more sequences.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 59  Figure 3.15: Crew sequence showing (a) frames from GOP number 2 and (b) frames from GOP number 10.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 60  Figure 3.16 shows the motion index values of the four sequences mentioned earlier.  - B - Crew 3 City Foreman - O Harbour  GOP Number  Figure 3.16: Comparison of the motion index values between the 113 coded frames of the four sequences Crew, City, Foreman, and Harbour.  The plots shown in Figure 3.16 indicate that the sequences Crew and Harbour exhibit higher levels of motion than Foreman or City. The values of motion indices shown in Figure 3.16 were normalized using the constant A referred to in equation 3.19. By averaging the motion index over all the GOPs in each sequence, the motion index can then be used to compare the level of motion between sequences. Moreover, this measure will also reflect the bit-rate of the coded video sequence. Table 3.6 lists the average motion index for each sequence along with the bit-rate and size of each coded video sequence. Note that the same QP value was used during the encoding of the four sequences. Clearly, the motion index can now be related to the residual distortion Daesiduai caused by multiple-description coding of a scalable video sequence. Since multiple-  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC) 61  Table 3.6: Relation between motion index and the coded video file size. Video Average Bit-Rate File Size (kbit/sec) (KB) Sequence MIdx Crew 0.5126 95.762 88 Harbour 0.4902 95.498 87.8 Foreman 0.3874 70.008 64.3 City 0.1898 50.937 46.8  description coding of a video sequence involves dropping some of the high pass (or residual) components, sequence with high MIdx will suffer from high residual distortion compared to sequences with lower MIdx. We will demonstrate this relationship later in chapter 4. The proposed multiple-description coding scheme incorporates some degree of redundancy in video payload in order to assure independent decodability of every individual description. Let Rrotai be the overall redundancy rate introduced due to our MDC framework. Similar to the distortion measure, Rrotai can also be separated into two components: RMOUOU and RR iduai, such that es  RTotal = R-Motion + R-Residual-  The redundancy rising from multiple-description coding of motion information is mainly due to the common syntax elements coded in both MDC and non-MDC macroblocks in addition to the new syntax elements introduced by MDC macroblocks. The syntax elements coded for both MDC and non-MDC macroblocks are shown in Table 3.3. Previous multiple-description coding schemes such as [19] required copying all motion information in both descriptions to ensure correct decodability of the independent descriptions. The problem with such an approach is mainly the considerable increase in bitrate of the multiple-description coded bitstream, translated into an increase in  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 62  video redundancy. We have overcome this problem by limiting the level of redundancy to a minimum. Therefore, based on the syntax elements listed in Table 3.3 above, the motion redundancy is estimated to be  E-Motion  = RBLSkipFlag v  +  E-MbMode + RBLQRefFlag . common syntax elements v  '  + R}ADCFlag + RBestNeighbor v ; added syntax elements  We will label the common syntax element rates by  (3.21)  v  R  c o m m o n  .  The major contribution  to motion redundancy comes from coding of the BestNeighbor field. However, this increase in bitrate remains much smaller than coding the actual motion-vector differences instead for the following reasons. The BestNeighbor field is binarized using the 3-bit fixed length binarization scheme discussed in section 3.1.2. The motion-vector difference (Mvd) field binarization is performed by concatenating the truncated unary and ExpGolomb binarization schemes (UEG3) with a cutoff value of 9. Moreover, motion-vector differences in H.264 are set to quarter-pixel accuracy. For instance, if a block shifts by one pixel from one picture to the other, then the coded motion-vector will have a value of 4. [11] Applying UEG3 binarization will produce 5 bits for every one pixel-shift, four bits to code the value of the Mvd and an additional sign bit [25]. Entropy coding of both the BestNeighbor and Mvd fields is governed by their respective context models which closely approximate that actual symbol probabilities, and then the respective binstrings are each coded using the same binary arithmetic coder. Experiments show that the coding of the finite-alphabet BestNeighbor syntax element requires fewer bits compared to encoding the actual Mvd syntax element that has a much larger alphabet, the tradeoff being the distortion cost calculated earlier in this section. The additional rate contributed by multiple-description coding of the residual information  RResiduai  is directly related to the number of Intra-coded blocks found within  a high pass frame. Note that since any high pass subband components are set to zero  Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC  (SVC) 63  in a MDC high pass frame, the redundancy rate is strictly limited to the coding of Intra macroblocks. Let Ni  ntra  and Rintra be the number of Intra-coded macroblocks in a  group of pictures and the average bitrate required to encode an Intra-coded macroblock, respectively. We can estimate the residual redundancy by  RResidual ~  3.3  Nj t  n ra  X  Rintra-  (3.22)  Multiplexing and Error Concealment  The final step before transmitting the multiple-description coded stream is multiplexing the generated descriptions. Our framework is based on a layered structure of a combined scalable video stream offering spatial, temporal, and SNR scalability. The original SVC multiplexing scheme is performed on a group of picture (GOP) basis, such that, every fame belonging to all the SVC layers of a coded GOP is embedded into the bitstream before the next GOP frames are inserted. Figure 3.17 (a) shows the Multiplex module of the original SVC encoder. The coded frames of a GOP k are arranged starting with the layer SVCO frames followed by layer SVC1 frames until all frames belonging to GOP k are inserted. Next frames of GOP k + 1 follow starting with layer SVCO frames and so on. Performing multiple-description coding on the high pass frames of all SVC layers, except for layer SVCO, will yield two descriptions of the high pass frames. This will require a slight modification to the Multiplex module such that, all frames belonging to GOP k, for instance, are still inserted into the bitstream before the insertion of any frame from GOP k + 1. Moreover, we further propose embedding all frames belonging to layer I of both descriptions into the bitstream before inserting any frame from layer / + 1 as an additional multiplexing constraint.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC  (SVC)  Multiplex  a  HHHHHHH  L  HHHHHHI  , ,  1  | HHHHHHH | L | | HHHHHHH | P k«1  GOPk •  ' ^ • „.  •  |H H H H H H H  [HHHHHHHHHHHHHHH  | L |»  •  |H H H H H H H  | L | [HHIHHHHH  | L  i^Wlsttlillllllii GOPk •  HHHHHHHHHHHHHHH  0  Gi  L  HHHHHHHHHHHHHHH  L  •  (a)  Multiplex  ;  v  c  | HHHHHHH| L I I HHHHHHh| I  o  k.1  GOPk  :  ^^^cription 1 —  " | HHHHHHHJ~L"| HHHHHHH| L I HHHHHHH l  :  s-.tj ft-  *  ..C,  SV3-.  V  "T'M-'-i* "** * | HHHHHHH| L 11 HHHHHHHHHHHHHHl^ HHHHHHHHHHHHHHl] L | •  HHHHHHHHHHHHHHlL  "•xi\  7\  •Description 2 -  j HHHHHHH | ••  33-  HHHHHHHHHHHHHH  -Description V —  HHH • H HI H IH I IHHf  HHHH  1  H H H H H H H H H H H H H H ^ . . ' | H H H H H H H H H H H H H H l ! Description 2  (b)  Figure 3.17: (a) Original Multiplex module of the S V C encoder, (b) Modified Multiplex module to accommodate M D C frames.  iy  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/'AVC (SVC) 65  The encoding side of our multiple-description coding framework partitions a single stream into three sub-streams, a base-layer stream, labeled stream 0, that contains all SVCO frames along with the low pass frames of the higher layers, and two descriptions of an enhancement layer that contains the high pass frames of the higher layers, labeled stream 1 and stream 2 respectively. The stream number is also coded into the frame header to facilitate stream partitioning. From this point on, it falls upon the transmission network to decide how to transmit the multiple descriptions, whether to transmit the entire bitstream using one channel or to partition the bitstream and transmit each description using a separate channel. Note that stream one and stream two are independently decodable of each other, however, they do depend on the correct reception of stream zero. Therefore, a necessary condition to independently decode each of the two descriptions is the correct reception and decodability of the base layer (stream 0). The MDC decoder will simply invert all the steps performed by the encoder to create the two descriptions. Figure 3.18 shows the basic structure of an MDC decoder with emphasis on the inverse MDC module, the shaded block. The combine motion fields and combine residual fields modules restore the coded motion and texture data to the original SVC compatible, single-description input that is accepted by the MCTF decoder. Let us first evaluate the significance of the coded frames produced by our SVC based multiple-description coder. The proposed framework offers a layered approach to MDC. We separate a coded video sequence into one logical layer, the base-layer, which contains all SVCO frames in addition to the low pass frames of all higher SVC layers. This architecture necessitates the correct reception of all base-layer frames in order to ensure decodability. Therefore, if we rate the importance of the coded video frames, the base-layer frames will take the highest priority. Next, we move to the multiple-description coded logical enhancement-layer which contains all high pass  Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC  (SVC) 66  MDC Decoder  Output Decoded Pictures Stream 0  |  Combine Mot.cn Fie ds Layer Storage  Figure 3.18: Basic structure of the MDC Decoder with the Inverse MDC module shown in gray.  frames of all SVC layers higher than SVCO.  These high pass frames may or may  not contain prediction data (motion information) depending on the type of scalability associated with the SVC layer. For instance, if an SVC layer is SNR scalable, then its high pass frames do not contain motion information. High pass frames belonging to SVC layers that extend the temporal and/or spatial scalability of the lower layers do contain motion information. Hence, the loss of a high pass frame from an SNR scalable layer will only cause some degradation in output picture quality; however, it does not hamper the decoding process since the lost residual samples can be assumed to be zero. The loss of motion information on the other hand is not replaceable and will completely obstruct the decoding of the respective layer. To clearly illustrate our discussion, consider the group of pictures shown in Figure 3.19. The dark frames belong to the base-layer whereas the light frames belong to the multiple-description coded enhancement-layer. Figure 3.20 shows the same GOP dis-  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 67  H  H  H  H  H  H  H  D1  H  H  H  H  H  H  H  D2  H  H  H  H  H  H  H  D1  H  H  H  H  H  H  H  D2  H  H  H  D1  H  H  H  D2  S V C 3: S N R  S V C 2: spatial/temporal  S V C 1: S N R  S V C 0: temporal  Figure 3.19: Multiple-description coded group of pictures with four SVC layers of combined scalability.  played earlier with five frame-loss scenarios labeled LSI to LS5. The frame loss scenarios shown in Figure 3.20 include losses in base-layer frames, motion information of enhancement-layer frames and residual information in enhancement frames. These loss scenarios constitute a comprehensive study of the possible frames losses that might occur when transmitting the coded video bitstream. The error handling routines we developed for coping with these losses are listed in Table 3.7. The error handling routines listed in Table 3.7 were developed to allow an SVC decoder to detect and recover from frame losses. This feature is not supported in the current SVC decoder nor is there any recommendation that describes error handling in SVC. The existing recommendation suggests replacing any lost output frame with the previously decoded output frame, a condition that occurs every time a single coded  Chapter 3. Multiple Description Coding of the Scalable Extension ofH.264/AVC (SVC) 68 Table 3.7: Error handling routine description.  Loss Scenario  Description  Effect  Error Handling  LSI  Loss of the two descriptions of a high pass frame belonging to an SNR layer  Reduces quality of decoded video  Assume that the lost HP frame has zero residual information. Extract motion, information from the lower SVC layer. Resume decoding of the GOP.  LS2  Loss of one description of a high pass frame of a spatial/temporal layer  Activate motion recovery routine  Use the second description high pass frame to recover an estimate to the lost motion and residual information. Resume decoding of the GOP.  LS3  Loss of the two descriptions of a high pass frame belonging to a spatial/temporal layer  Disrupt decoding of current and higher SVC layers  Discard decoded frames belonging to current layer and ignore all other received frames belonging to the same GOP. Repeat the last decoded picture to fill the GOP space.  LS4  Loss of the two descriptions of a high pass frame belonging to the highest SVC layer (also an SNR layer)  Reduces quality of decoded video  Assume that the lost HP frame has zero residual information. Extract motion information from the lower SVC layer. Resume decoding of the GOP.  LS5  Loss of a base layer frame  Disrupts decoding of the current GOP  Discard all received frames belonging to the current GOP. Repeat the last decoded picture to fill the GOP space.  Chapter 3. Multiple Description Coding of the Scalable Extension of H.264/AVC (SVC) 69 LS 4  \H/  H SVC  3:  H  H  H  H  H  D1  H  H  H  H  H  D2  H  H  H  D1  H  H  H  D2  SNR  H \LS2  SVC 2: spatial/temporal  SVC  SVC  1:  ^gjg^l H |  H H  H  H  H  H  SNR  V i  LS  3  D1  D2  H  0: temporal  Figure 3.20:  f r a m e i n a GOP  H  GOP w i t h  i slost.  five  Therefore,  possible frame-loss scenarios.  the reference error handling routine  t h a t f o r e v e r y l o s t f r a m e i n a GOP, t h e e n t i r e GOP w i l l b e r e p l a c e d b y t h e of the last output picture.  commends insertion  I nthe next section, w e s h o w s i m u l a t i o n results that  our proposed multiple-description coding a n d error h a n d l i n g routines as opposed to r e f e r e n c e SVC a n d e r r o r h a n d l i n g  routines.  favor the  70  Chapter 4 Simulation Results The J V T standardization committee has put together common conditions to test the error resilience of S V C . These common conditions are meant to be used in testing any proposal of error resilient coding tools developed for S V C by providing simulation results adhering to the specified conditions [28]. We will use the conditions specified in [28] as the common grounds to test our M D C framework and compare our results with the single-description SVC and the multiple-description scalable coding framework developed by [19]. We will begin by describing the simulated network environment which will transfer our multiple-description coded S V C stream. As a primary guideline, it is important to note that [28] assumes that the communication of video streams is performed using an R T P / U D P / I P transport [29]. This specification adheres with the M B M S streaming delivery transport protocol [8]. [28] further assumes that transmission errors are restricted to packet losses, bit errors not being considered since U D P would discard packets with any bit errors. Moreover, the R T P payload size is limited to 1400 bytes with only one N A L unit encapsulated in one R T P packet. This condition cannot be satisfied at this stage of S V C development since it does not support the coding of multiple slices per frame. Therefore, the alternative is to partition every N A L unit among several R T P packets with a maximum size of 1400 bytes. We have developed an R T P encapsulator that extracts the N A L units from the S V C bitstream and packetizes them into R T P packets with a maximum payload size of 1400 bytes. If the N A L U size is smaller than  Chapter 4. Simulation Results  71  1400 bytes, then the entire N A L U is encapsulated in one R T P packet. If on the other hand, the N A L U size is larger than 1400 bytes, then the N A L U is partitioned into blocks of 1400 bytes, each of which is encapsulated in one R T P packet. Moreover, the R T P marker bit is set to one to indicate if the R T P payload is a full N A L U or the last partition of a N A L U ; otherwise, the marker bit is set to zero. Figure 4.1 below shows the N A L U encapsulation process. 1400 bytes  1200 bytes  1400 bytes  . I.  1400 bytes  900 bytes  -. -  NALU I  RTP  RTP  header  header  Figure 4.1: R T P encapsulation of N A L U s with varying sizes.  [28] recommends the use of the erasure simulation patterns I T U - T V C E G Q15-I16rl to model Internet and 3 G P P / 3 G P P 2 packet-loss environments. We have chosen the network simulator tool recommended in [28], "Offline simulator for R T P / I P over U T R A N " provided by the 3GPP SA4 S4-AHVIC036 to test the resilience of our framework. This simulator was developed from V C E G - N 8 0 which describes common test conditions suited for transmission on 3 G P P / 3 G P P 2 networks for conversational and streaming video applications based on R T P / I P . V C E G - N 8 0 provides a controlled environment for experiments by defining an offline software simulator of the 3 G P P / 3 G P P 2 radio bearer protocols and R L C physical layer error patterns with average packet loss rates of 3%, 5%, 10%, and 20%. We shall give a brief overview of the simulated network environment and the user plane protocols in 3G presented in V C E G - N 8 0 [30]. The video source and video receiver are both assumed to be located within a private operator's network.  This network is composed of an IP-based core network and a  Chapter 4. Simulation Results  72  radio access network. The streaming server is assumed to be directly connected to the core network. Moreover, the core network is assumed to be error free restricting the bottleneck to the radio interface. Consequently, the simulated packet losses result only from fading/shadowing errors at the radio interface [30]. The user plane protocol stack specifications defined by 3GPP and 3GPP2 are similar between the user equipment (UE) or mobile station (MS) and the radio base station (BS). We will give a brief overview of the protocol stack as presented in [30]. Figure 4.2 shows the packetization of a video payload unit according to the user plane protocol stack of CDMA-2000 [30].  Physical frame LTU  Figure 4.2: CDMA-2000 user plane protocol stack packetization presented in VCEG-N80.  The protocol stack used for UMTS is very similar, except that 3GPP defines different names for the protocols. The differences listed in [30] are as follows: • The point-to-point protocol (PPP) is not used. Instead, the radio link control (RLC) protocol includes additional information that signals the RLC payload boundaries. • The size of a physical layer unit in UMTS is more flexible than in CDMA-2000 and therefore, a physical layer frame is not split further into logical transmission units (LTU) as occurs in CDMA-2000. In CDMA-2000, an LTU is the smallest fixed-size unit and has a cyclic redundancy  Chapter 4. Simulation Results  73  code (CRC) to detect possible errors. U M T S on the other hand defines an R L C - P D U (protocol data unit) which is a fixed length physical layer unit that is determined when the radio bearer is setup. In our simulations, it is assumed that one link-layer frame is packed into one L T U / R L C - P D U [30]. In general, an R T P / U D P / I P packet is packed into one P D C P (packet data convergence protocol)/PPP packet which becomes an R L C - S D U (service data unit). Since video packet sizes are variable in nature, the R L C SDUs will maintain the variation in size. If an R L C - S D U is larger that the R L C - P D U , then the R L C layer segments the S D U into multiple PDUs. For our simulations, we have merged the packet loss patterns presented in I T U - T V C E G Q15-I-16rl to create radio bearers with 3%, 5%, 10%, and 20% block error rates (BLER). The bearer specifications used in the offline simulator first assume that no retransmissions are allowed during transmission and therefore, the bearer is unacknowledged. The bearer bitrate is set to 128kbits/s with a radio frame size (RFS) or R L C - P D U size of 320 bytes. Moreover, we will use U M T S as our telecommunications system. In our simulations, we will demonstrate the performance of our M D - S V C framework in the simulated U M T S network compared to the performance of the M D - M C T F scheme presented in [19] and the single description S V C bitstream [14]. We have coded 961 frames from each of the four test sequences crew, city, harbour, and foreman. The 961 frames were generated according to the recommendation in [28] by first encoding the pictures "in normal order till the original sequence ends, then in reverse order from the second last picture in the sequence end to the beginning, and then again in normal order from the second picture, and so on." Four S V C layers were encoded according to the configuration shown in Table 4.1. The multiple-description coded S V C stream, which we will refer to as M D - S V C , is composed of a base-layer which contains all SVCO frames along with the low pass frames  Chapter 4. Simulation Results  Layer SVCO SVC1 SVC2 SVC3  74  Table 4.1: Coding configuration of the test sequences. Scalability Spatial Resolution Frame-Rate G O P Size Temporal 8 QCIF 15 SNR 8 15 QCIF Spatial-temporal 16 30 CIF SNR 16 CIF 30  of the higher three S V C layers, and an enhancement-layer containing two descriptions of the high pass frames belonging to S V C layers 1, 2, and 3. One example of the coding structure can be seen in Figure 3.19, however, the G O P sizes shown in Figure 3.19 are half of the G O P sizes used for our tests. We will assume that all SVCO layer frames along with the low pass frames belonging to the higher S V C layers are protected using the Unequal Erasure Protection (UXP) scheme presented in [17] that was developed specifically to protect base layer frames in S V C . A brief overview of the U X P scheme was presented earlier. Our assumption follows that the U X P scheme is applied to all coded streams (MD-SVC, M D - M C T F , and SD-SVC). The U X P scheme results in a loss rate significantly below 1% [17] with a 20% overhead in transmission bitrate of the protected frames. Furthermore, the M D M C T F scheme is applied only to the high pass frames of the enhancement layer defined by our M D - S V C framework. This restriction was imposed since the M D - M C T F scheme replicates low pass frames in M C T F , which dramatically increases the bitrate of the coded video stream and results in an unfair comparison with our developed scheme. Therefore, in the M D - M C T F case, low pass and base layer frames are protected using only U X P . Figures 4.3 to 4.6 demonstrate the performance of our M D - S V C scheme compared to M D - M C T F and single description S V C (SD-SVC) schemes when faced with loss rates between 3% and 20%. The sequences Crew, Foreman, Harbour, and City were  Chapter 4. Simulation Results  Table 4.2: Ml 3 scheme redundancy Crew Foreman Harbour  75  City  MD-SVC  21.9%  17.9%  8.1%  14.1%  MD-MCTF  13.9%  16%  7.7%  15.5%  encoded using each of the above-mentioned coding schemes, and the reconstructed frames obtained after decoding are compared using the luminance P S N R (Y-PSNR) measure. The PSNR performance plots show that M D - S V C out-performs M D - M C T F in most cases, except for the City sequence where M D - M C T F has a better P S N R performance. The City sequence has very little motion as indicated by the motion index measure (MIdx) plots shown in Figure 3.17.  However, it is important to note that the level  of redundancy imposed by our M D - S V C scheme is lower than that imposed by the M D - M C T F scheme in the case of the City sequence. As indicated in section 3.2, the redundancy of our M D - S V C scheme results mainly from the repeated Intra-coded macroblocks in the video sequence.  The M D - M C T F  scheme does not insert these Intra-coded blocks in multiple-description coded high pass frames. The result is improved decoded video quality for our M D - S V C over M D M C T F in sequences that contain a large number of Intra-coded macroblocks. Figure 4.7 illustrates the relation between the redundancy rate and the associated distortion due to the packet loss patterns.  The redundancy rates used are those generated by  encoding the four video sequences and are listed in Table 4.2. The redundancies listed in Table 4.2 show that both M D - S V C and M D - M C T F impose very similar redundancy rates. Except for the Crew sequence which exhibits a high number of Intra-coded blocks in the high pass frames, the redundancy rates are within 1% of each other. The peeks seen in the M D - M C T F plots in Figure 4.7 are due to the City sequence  Chapter 4. Simulation Results  76  CREW 40 EC |  — -—  20  CL >•  0 or  w  200  400  600  800  40  -  MD-SVC  20  --  MD-MCTF SD-SVC  CL  5 % loss 0 40  I  MD-SVC MD-MCTF SD-SVC 3 % loss  200  400  600  800  PS  rii  20  i  — ....  MD-SVC MD-MCTF SD-SVC 1 0 % loss  MD-SVC MD-MCTF SD-SVC 2 0 % loss  -Eh -0-  MD-SVC MD-MCTF SD-SVC  10  15  20  % packets lost  Figure 4.3: Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Crew sequence.  Chapter  4.  Simulation  77  Results  Foreman  —  MD-SVC MD-MCTF SD-SVC 3 % loss  MD-SVC MD-MCTF SD-SVC 5 % loss  MD-SVC —  MD-MCTF  —-  SD-SVC 1 0 % loss  MD-SVC MD-MCTF SD-SVC 2 0 % loss  -B-0-  5  10  MD-SVC MD-MCTF SD-SVC  15  % packets lost  Figure 4 . 4 : Comparison of P S N R performance of M D - S V C , M D - M C T F , and (single description) SD-SVC for the Foreman sequence.  Chapter 4. Simulation Results  78  Harbour  —  co  CL  >-  LL >- 20 OS  — ----  MD-SVC MD-MCTF SD-SVC 5 % loss  — ----  MD-SVC MD-MCTF SD-SVC 1 0 % loss  — ----  MD-SVC MD-MCTF SD-SVC 2 0 % loss  -B-  to ^T)|t  ^  MD-SVC MD-MCTF SD-SVC 3 % loss  MD-SVC MD-MCTF SD-SVC  ^  'X' i  i  i  10  15  i  20  25  % packets lost  Figure 4.5: Comparison of PSNR performance of MD-SVC, MD-MCTF, and (single description) SD-SVC for the Harbour sequence.  Chapter 4. Simulation Results  79  MD-SVC — ....  MD-MCTF SD-SVC 3 % loss  MD-SVC MD-MCTF SD-SVC 5 % loss  MD-SVC MD-MCTF SD-SVC 1 0 % loss  MD-SVC —  MD-MCTF  —-  SD-SVC 2 0 % loss  -FJ-0-  5  MD-SVC MD-MCTF SD-SVC  10 % packets lost  Figure 4 . 6 : Comparison of P S N R performance of M D - S V C , M D - M C T F , and (single description) SD-SVC for the City sequence.  Chapter 4. Simulation Results  80  Redundancy - Distortion  cc -z.  -B-•©-  MD-SVC MD-MCTF 3%  -B-G-  MD-SVC MD-MCTF 5%  -B-O-  MD-SVC MD-MCTF 10%  -B- O  MD-SVC MD-MCTF 20%  CO D_  >-  cr  co rx  cr z  CO Q_ I  >-  CO Q_ >-  5  10  15  20  Sequence Redundancy  Figure 4.7: Comparison of MD-SVC and MD-MCTF redundancy rates and associated distortions.  Chapter  4.  Simulation  Results  81  which exhibits very little motion, therefore, losses in high pass frames did not affect the quality much since most of the signal power is conserved in the low pass frames. The M D - M C T F scheme does not generate any distortion due to the coding of motion information since, motion information is replicated in both descriptions.  This  however is not the case in our M D - S V C scheme which saves on the redundancy in motion information by multiple-description coding of the motion information. In exchange, M D - S V C uses the saving in motion information coding to insert redundant texture information through replicating Intra-coded macroblocks. Figures 4.8 to 4.11 show snapshots from the four video sequences. The pictures on the left are generated from M D - S V C . The pictures on the right are M D - M C T F coded. It can be seen that M D - S V C causes slight distortions that reflect negatively in P S N R but are not as significant visually. These slight distortions are mainly due to the approximation of the motion vectors of macroblocks from the neighboring blocks. The benefit of this coding mechanism lies in reducing the video data-rate to allocate more bits to duplicate Intracoded blocks. Since the M D - M C T F scheme does not duplicate Intra-coded blocks, the packet loss is more obvious visually, as can be seen in Figures 4.8 to 4.11.  Figure 4.8: Comparison of visual quality from the Foreman sequence, M D SVC (left), MD-MCTF(right).  Chapter 4. Simulation Results  Figure 4.10: Comparison of visual quality from the City sequence, MD-SVC (left), MD-MCTF(right).  82  Chapter 4. Simulation Results  Figure 4.11: Comparison of visual quality from the Harbour sequence, MDSVC (left), MD-MCTF(right).  S3  84  Chapter 5 C o n c l u s i o n and F u t u r e W o r k In this thesis, we have developed a new multiple-description coding scheme for the scalable extension of H.264/AVC video coding standard (SVC). Our scheme (MD-SVC) generates two descriptions of the high pass frames of the enhancement layers of an SVC coder by coding in each description only half the motion information and half the texture information with a minimal degree of redundancy. Intra-coded macroblocks are inserted as redundant information since they cannot be approximated using the motion information. The two descriptions are complementary but independently decodable, such that if only one description is received, the decoder will be able to recover the missing motion information from the available data and generate an output video sequence with acceptable degradation in quality. If both descriptions are received then the full quality of a single description SVC (SD-SVC) is delivered. However, the two descriptions are highly dependent on the base layer, therefore, the base-layer and the low pass frames of the enhancement layers are protected using an unequal erasure protection (UXP) technique developed specifically for SVC payload data. We have also added error detection and concealment features to the SVC decoder to cope with frame losses. Our multiple-description coding framework conforms with the protocols and codecs specifications of the 3GPP multimedia broadcast/multicast services (MBMS). Moreover, the framework is built on and integrated into the SVC codec, and can therefore offer highly reliable video services to MBMS networks and clients with heterogeneous  Chapter 5. Conclusion and Future Work  85  resources and capabilities. The outcome of our coding scheme is three separable streams that can be transmitted over the same channel (in multicast or broadcast mode) or over three separate channels (in broadcast mode) to minimize the packet loss rate. Objective and subjective performance evaluations have shown that our scheme delivers a superior decoded video quality when compared with the UXP protected SD-SVC and the multiple-description motion compensated temporal filtering (MD-MCTF) scheme with comparable redundancy levels. It would be desirable to further reduce the redundancy level and remove the dependence on the base layer, therefore, more work can be performed to generate low redundancy multiple descriptions of the low pass frames using techniques such as correlating transforms. Furthermore, the accuracy of the motion recovery approach can be improved through boundary matching algorithms and block interpolation.  Bibliography [1] Mobile Broadcast/Multicast Service (MBMS), 2004. [2] DigiTAG. Television on a Handheld Receiver - broadcasting with DVBH, 2005. [3] Nokia makes air interface of its mobile T V end-to-end solution ( D V B H) publicly available, May 2005. [4] T . Wiegand, G . J . Sullivan, G . Bjontegaard, and A . Luthra. Overview of the H . 2 6 4 / A V C video coding standard. IEEE Transactions on Circuits  and Systems for Video Technology, 13(7):560-576, 2003. [5] R. Schafer, H . Schwarz, D. Marpe, T . Schierl, and T . Wiegand. M C T F and scalability extension of H . 2 6 4 / A V C and its applications to video transmission, storage, and surveillance. In Visual Communications and  Image Processing, July 2005. [6] European Telecommunications Standards Institute.  Universal Mobile  Telecommunications System (UMTS); Multimedia Broadcast/Multicast Service (MBMS); Stage 1, September 2004. [7] European Telecommunications Standards Institute.  Digital cellular  telecommunications system (Phase 2+); Universal Mobile Telecommu-  Bibliography nications System (UMTS);  Multimedia  87  Broadcast/Multicast  Service  (MBMS); Stage 1, September 2004. [8] European Telecommunications Standards Institute.  Telecommunications System (UMTS); Multimedia  Universal Mobile  Broadcast/Multicast  Service (MBMS); Protocols and Codecs, March 2005. [9] Samsung Electronics. Scalable Multimedia Broadcast and Multicast Service (MBMS), May 2002. [10] G . J . Sullivan T . Wiegand. Video compression - from concepts to the H . 2 6 4 / A V C standard. Proceedings of the IEEE, 93(1):18-31, 2005. [11] T . Wiegand, G.J. Sullivan, and A . Luthra. Draft I T U - T Recommenda-  tion and Final Draft International Standard of Joint Video Specification ( I T U - T Rec. H.264 — I S O / I E C 14496-10 AVC). (JVT)  Joint Video Team  of ISO/IEC M P E G and I T U - T V C E G , 2003.  [12] T . Wiegand, H . Schwarz, A . Joch, F . Kossentini, and G . J . Sullivan. Rate-constrained coder control and comparison of video coding stan-  dards. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):688-703, July 2003. [13] T . Wedi and H . G . Musmann. Motion- and aliasing-compensated prediction for hybrid video coding.  IEEE  Transactions on Circuits and  Systems for Video Technology, 13(7):577-586, July 2003. [14] Joint Video Team ( J V T ) of I S O / I E C M P E G and I T U - T V C E G .  band Extension of H.264/AVC, March 2004.  Sub-  Bibliography [15] M . Flierl.  Video coding with lifted wavelet transforms and frame-  adaptive motion compensation. In VLBI, September 2003. [16] H . Schwarz, D. Marpe, and T . Wiegand. M C T F and scalability extension of H . 2 6 4 / A V C . In PCS, December 2004. [17] T . Schierl, H . Schwarz, D. Marpe, and T . Wiegand. Wireless broadcasting using the scalability extension of H . 2 6 4 / A V C . In ICME, July 2005. [18] S. Wenger, M . M . Hannuksela, T . Stockhammer, M.Westerlund, and D. Singer.  RTP  Payload Format for H.264 Video. I E T F , February  2005. [19] M . van der Schaar and D. S. Turaga.  Multiple description scalable  coding using wavelet-based motion compensated temporal filtering. In ICIP, September 2003. [20] A . R . Reibman, H . Jafarkhani, Y . Wang, M . T . Orchard, and R. Puri. Multiple description coding for video using motion compensated predic-  tion. In ICIP, 1999. [21] C . K i m and S. Lee.  Multiple description coding of motion fields for  robust video transmission.  IEEE Transactions on Circuits and Systems  for Video Technology, 11(9):999-1010, September 2001. [22] Y . K . Wang, M . M . Hannuksela, V . Varsa, A . Hourunranta, and M . Gabbouj. The error concealment feature in the H . 2 6 L test model. In ICIP, 2002.  88  Bibliography  89  [23] T . Stockhammer, M . M . Hannuksela, and T . Wiegand. H . 2 6 4 / A V C in wireless environments.  IEEE Transactions on Circuits and Systems for  Video Technology, 13(7):657-673, July 2003. [24] Joint Video Team (JVT) of ISO/IEC M P E G and I T U - T V C E G . Scalable  Video Coding - Working Draft 2, April 2005. [25] D . Marpe, H . Schwarz, and T . Wiegand. Context-based adaptive binary arithmetic coding in the H . 2 6 4 / A V C video compression standard.  IEEE Transactions on Circuits and Systems for Video Technology,  13(7):620-636, July 2003. [26] M . J . Weinberger and J . J . Rissanen adn M . Feder. A universal finite memory source.  IEEE Transactions on Information Theory, 41(3):643-  652, May 1995. [27] D. Marpe, H . Schwarz, G . Blattermann, G . Heising, and T . Wiegand. Context-based adaptive binary arithmetic coding in J V T / H . 2 6 L .  In  ICIP, September. [28] Joint Video Team (JVT) of I S O / I E C M P E G and I T U - T V C E G .  Com-  mon conditions for SVC error resilience testing, July 2005. [29] S. Wenger. H . 2 6 4 / A V C over IP.  IEEE Transactions on Circuits and  Systems for Video Technology, 13(7):645-656, July 2003. [30] ITU-Telecommunications Standardization V C E G .  Comment Test Con-  ditions for RTP/IP over 3GPP/3GPP2, December 2001.  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0065531/manifest

Comment

Related Items