COMPUTATIONALLY EFFICIENT TECHNIQUES FOR H.264/AVC TRANSCODING APPLICATIONS by Qiang Tang B.Eng., Tianjin University, 2001 M.A.Sc., Tianjin University, 2004 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Electrical and Computer Engineering) The University of British Columbia (Vancouver) May 2010 © Qiang Tang 2010 ABSTRACT Providing universal access to end-users is the ultimate goal of the communications, entertainment and broadcasting industries. H.264/AVC has become the coding choice for broadcasting, and entertainment (i.e., DVD/Blu-ray), meaning that the latest set-top boxes and playback devices support this new video standard. Since many existing videos had been encoded using previous video coding standards (e.g., MPEG-2), playing them back on the new devices will be possible only if they are converted or transcoded into the H.264/AVC format. In addition, even in the case that videos are compressed using H.264/AVC, transmitting them over different networks for different user applications (e.g., mobile phones, TV) will require transcoding in order to adapt them to different bandwidth and resolution requirements. This thesis tackled the H.264/AVC transcoding problems in 3 aspects. At first, we propose the algorithms that improve the resultant video quality of the transform-domain MPEG-2 to H.264/AVC transcoding structure. Transform-domain transcoding offers the least complexity. However, it produces transcoded videos suffering from some inherent video distortions. We provide a theoretical analysis for these distortions and propose algorithms that compensate for the distortions. Performance evaluation shows that the proposed algorithms greatly improve the resultant transcoded video quality with reasonable computational complexity. Second, we develop an algorithm that speeds up the process of the pixel-domain MPEG-2 to H.264/AVC transcoding. Motion re-estimation is the most time consuming process for this type of transcoding. The proposed algorithm accelerates the motion reestimation process by predicting the H.264/AVC block-size partitioning. Performance evaluation shows the proposed algorithm significantly reduces the computational complexity compared to the existing state-of-the-art method, while maintaining the same compression efficiency. ii At last, we propose the algorithm that accelerates the transcoding process of downscaling a coded H.264/AVC video into its downscaled version using arbitrary downscaling ratios. To accelerate the process of encoding the downscaled video, the proposed algorithm derives accurate initial motion vectors for the downscaled video, thus greatly reducing the computational complexity of the motion re-estimation process. Compared to other downscaling state-of-the-art methods, the proposed method requires the least computation while yields the best compression efficiency. iii TABLE OF CONTENTS Abstract ............................................................................................................... ii Table of Contents .............................................................................................. iv List of Figures .................................................................................................... vi List of Tables ................................................................................................... viii Glossary .............................................................................................................. x Acknowledgements .......................................................................................... xii Dedication ........................................................................................................ xiv Co-authorship Statement ................................................................................. xv Chapter 1: Introduction and Overview .............................................................. 1 1.1 1.2 1.2.1 1.2.2 1.3 1.4 1.5 1.6 1.7 Introduction ................................................................................................................ 1 Overview of H.264/AVC Video Coding Standard and Traditional Transcoding Structures .............................................................................................. 3 Overview of H.264/AVC video coding standard .................................................... 3 Overview of traditional transcoding structures ..................................................... 11 Challenges in H.264/AVC Transcoding Applications ............................................... 14 Thesis Objectives ..................................................................................................... 15 Thesis Contributions ................................................................................................ 17 Thesis Summary ...................................................................................................... 18 References ............................................................................................................... 22 Chapter 2: Compensation of Re-quantization and Interpolation Errors in MPEG-2 to H.264/AVC Transcoding ............................................................ 26 2.1 2.2 2.2.1 2.2.2 2.2.3 2.3 2.3.1 2.3.2 2.3.3 2.4 2.4.1 2.4.2 2.4.3 2.5 Introduction .............................................................................................................. 26 Problem Description ................................................................................................. 28 Re-quantization error ........................................................................................... 31 Luminance half-pixel interpolation error .............................................................. 31 Chrominance quarter/three-quarter pixel interpolation errors ............................. 33 Proposed Compensation Algorithms ....................................................................... 34 Re-quantization error compensation algorithm.................................................... 35 Luminance half-pixel and chroma quarter/three-quarter interpolation error compensation algorithm .............................................................................. 38 The proposed closed-loop transcoding structure ................................................ 40 Experimental Results ............................................................................................... 42 Compensation of re-quantization error ................................................................ 43 Compensation of re-quantization and interpolation errors .................................. 44 Comparison and complexity analysis of the proposed compensation algorithms ............................................................................................................ 50 Conclusion and Future Work ................................................................................... 53 iv 2.6 References ............................................................................................................... 55 Chapter 3: Efficient Motion Re-estimation with Rate-distortion Optimization for MPEG-2 to H.264/AVC Transcoding ................................... 57 3.1 3.2 3.2.1 3.2.2 3.2.3 3.3 3.4 3.5 Introduction .............................................................................................................. 57 Proposed Efficient Motion Re-estimation with Rate-Distortion Optimization ........... 60 RDO-based block size partitioning prediction for P/B frames ............................. 60 Limiting block size partitioning to 8x8 .................................................................. 67 Our proposed cascaded pixel-domain transcoding structure .............................. 71 Experimental Results and Computational Complexity Analysis .............................. 74 Conclusion ............................................................................................................... 82 References ............................................................................................................... 83 Chapter 4: An Efficient Motion Re-estimation Scheme for H.264/AVC Video Transcoding with Arbitrary Downscaling Ratios ................................ 86 4.1 4.2 4.3 4.3.1 4.3.2 4.3.3 4.4 4.5 4.5.1 4.5.2 4.5.3 4.6 4.7 Introduction .............................................................................................................. 86 Overview of Motion Vector Estimation during Downscaling Transcoding ............... 88 Proposed H.264/AVC Motion Vector Re-estimation during Downscaling Transcoding ............................................................................................................. 91 Calculation of the area-weighted align-to-worst motion vector ............................ 92 Finding related integer transform coefficients of parts of the original blocks associated with the downscaled block ..................................................... 95 Scaling down the area-weight align-to-worst MV by the downscaling ratio ........ 96 Computational Complexity Analysis ........................................................................ 97 Experimental Results and Discussions .................................................................... 98 Comparison between the proposed algorithm and the multipleregression-models method .................................................................................. 99 Comparison between the proposed algorithm and the area-weighted vector median filter approach ............................................................................ 100 Investigating of the impact of MV refinement range on compression performance ....................................................................................................... 102 Conclusion ............................................................................................................. 106 References ............................................................................................................. 107 Chapter 5: Conclusions and Future Work .................................................... 109 5.1 5.2 5.3 5.4 5.5 Significance of the Research ................................................................................. 109 Potential Applications ............................................................................................. 111 Contributions .......................................................................................................... 112 Suggestions for Future Research .......................................................................... 114 References ............................................................................................................. 116 Appendices ..................................................................................................... 118 Appendix A ............................................................................................................................... 118 Appendix B ............................................................................................................................... 120 v LIST OF FIGURES Figure 1.1 The concept of video transcoding · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·2 Figure 1.2 The basic coding structure of encoding a macroblock in H.264/AVC · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·4 Figure 1.3 Variable block size motion compensation in H.264/AVC · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·6 Figure 1.4 The cascaded pixel domain transcoder that outputs H.264/AVC encoded videos · · · · · · · ·11 Figure 1.5 A general framework of the open-loop video transcoding structure· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·12 Figure 1.6 A general framework of the closed-loop video transcoding structure · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·13 Figure 2.1 Cascaded pixel-domain transcoding structure · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·29 Figure 2.2 P frame residue data re-using transcoding framework · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·29 Figure 2.3 Interpolation of half-pixel values in (a) MPEG-2 and (b) H.264/AVC. (a) Half-pixel 1 M sample in MPEG-2 reconstructed I-frame I . (b) Half-pixel sample in H.264/AVC 1 reconstructed I-frame I H · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·32 Figure 2.4 Interpolation of chrominance components in MPEG-2 and H.264/AVC · · · · · · · · · · · · · · · · · · · · · · · · · ·33 Figure 2.5 The flowchart of the Inter MB transcoding scheme with re-quantization error compensation· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·38 Figure 2.6 The flowchart of the Inter MB transcoding scheme with re-quantization error and halfpixel / quarter-pixel interpolation error compensation · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·41 Figure 2.7 The PSNR of transcoded video obtained by the open loop structure, cascaded structure and our proposed transcoding structure under different quantization parameters (quantizers) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·44 Figure 2.8 Rate-distortion curves obtained by different transcoding schemes using four test video sequences (a) Akiyo sequence (b) Foreman sequence (c) Football sequence (d) Mobile sequence · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·47 vi Figure 2.9 PNSR values (including the Y and UV components) of each frame of the Mobile sequence obtained by different transcoding schemes (a) Mobile with GOP size equal to 15 (Y) (b) Mobile with GOP size equal to 15 (U) (c) Mobile with GOP size equal to 15 (V) (d) Mobile with one I-frame as the first frame and all other 99 frames are P-frames (Y) · · · · · · · · · · · · ·48 Figure 2.10 Last frame in the second GOP of the Mobile test sequence obtained by different transcoding schemes (a) Open-loop structure (b) Proposed structure (c) Cascaded structure reusing MPEG-2 MVs (d) Cascaded structure performing ±2 pixels MV refinement · · · · · · · · · · · · · ·49 Figure 3.1 Variable block sizes supported by the H.264/AVC motion estimation process · · · · · · · · · · · · · ·60 Figure 3.2 Rate-distortion curves resulting from different transcoding schemes over different videos (a) Music Video (b) Commercial (c) Sports Scene (d) Movie · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·69 Figure 3.3 Proposed cascaded pixel-domain transcoding structure for P/B frames · · · · · · · · · · · · · · · · · · · · · · · ·72 Figure 3.4 Rate-distortion curves from different transcoding schemes over different SDTV sequences (a) Music Video (b) Commercial (c) Movie (d) Sport Scene · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·77 Figure 3.5 Rate-distortion curves from different transcoding schemes over different CIF sequences (a) Foreman (b) Football (c) Mobile (d) Sign Irene · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·78 Figure 4.1 An example showing 9 MBs (in the original video) being downscaled into one MB in the down-scaled video · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·92 vii LIST OF TABLES Table 2.1 List of notations · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·30 Table 2.2 Average PSNR values (dB) obtained by the three different transcoding schemes · · · · · · · ·43 Table 2.3 Transcoding parameters of different transcoding schemes · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·45 Table 2.4 Operations and PSNR comparison of different transcoding schemes for one 16x16 MB (Y) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·51 Table 3.1 Search level and coding parameters of different tests · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·68 Table 3.2 Average PSNR differences of different rate-distortion curves (dB) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·70 Table 3.3 Key differences between the MotionMapping and our proposed schemes · · · · · · · · · · · · · · · · · · · · ·73 Table 3.4 H.264/AVC encoding settings · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·74 Table 3.5 The PSNR values and bit-rates of the transcoded videos with/without using MPEG-2 MVs · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·75 Table 3.6 Differences between our proposed transcoding schemes and other schemes · · · · · · · · · · · · · · ·77 Table 3.7 Average PSNR differences (to MPEG-2 encoded videos) between different schemes (300 frames) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·79 Table 3.8 Average PSNR differences (to the original videos before MPEG-2 encoding) between different schemes (300 frames) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·80 Table 3.9 Comparison of motion estimation execution time for different transcoding schemes (300 frames) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·81 Table 3.10 Average transcoding execution time per-frame (seconds) in different transcoding schemes · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·81 Table 4.1 The number of operations needed for each method · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·97 Table 4.2 The encoding parameters of the original H.264/AVC videos which are the inputs to the transcoder · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·99 viii Table 4.3 Transcoding performance of the proposed method and FullSearch (with rate distortion optimization) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·100 Table 4.4 Comparative results with the FullSearch method· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·100 Table 4.5 BD PSNR difference between our proposed algorithm and AWVMF · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·101 Table 4.6 Comparison of time needed to estimate the initial motion vectors for the downscaled videos (seconds) · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·102 Table 4.7 The BD-PSNR differences between FullSearch and our proposed algorithm with MV refinement · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·104 Table 4.8 Comparison of motion estimation time between FullSearch and the proposed algorithm with MV refinement · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·105 ix GLOSSARY ATSC Advanced Television Systems Committee AVC Advanced Video Coding B-frame Bi-Predictive Picture CABAC Content Adaptive Binary Arithmetic Coding CAVLC Content-based Adaptive Variable Length Coding CIF/QCIF Common Intermediate Format / Quarter CIF CPDT Cascaded Pixel-Domain Transcoder CPU Central Processing Unit dB Decibel DCT Discrete Cosine Transform DTV Digital TV DVB Digital Video Broadcasting DVD Digital Versatile Discs DVR Digital Video Recorder FME Fast Motion Estimation FRExt Fidelity Range Extensions GOP Group of Pictures H/SDTV High/Standard Definition TV x I-frame Intra-coded Picture IPTV Internet Protocol Television ISO International Organization for Standardization ITU-T International Telecommunication Union - Telecommunication JVT Joint Video Team MAD Mean of Absolute Differences MB Macroblock ME/MC Motion Estimation/Motion Compensation MPEG Motion Picture Expert Group MV Motion Vector P-frame Predictive Picture PSNR Peak Signal-to-Noise Ratio QP Quantization Parameter RAM Random Access Memory RDO Rate Distortion Optimization SA(T)D Sum of Absolute (Transformed) Differences SSE Sum of Squared Error UMA Universal Multimedia Access VBSME/C Variable Block-Size Motion Estimation / Compensation VCEG Video Coding Experts Group VLC Variable Length Code xi ACKNOWLEDGEMENTS It is a long journal for me to walk towards the Ph.D. degree. I am lucky that I am not walking all the way by myself. Without assistance and support from a large number of people, this work would have not been done. First, I would like to express my deep and sincere gratitude to my supervisors: Dr. Rabab Ward, and Dr. Panos Nasiopoulos. Both of them are such care-giving people and helped me in lots of ways besides my research. As a profound and distinguished professor, Dr. Ward sets an example of being a great researcher. Dr. Nasiopoulos, as a prestigious professor, succeeded both in academia and industry. His sharpness not only directs my research, but also influences my attitude towards pursuing the goal of happiness. I would also like to express my gratitude to the committee members for their constructive feedback. In addition, I wish to express my sincere thanks to Professor Ricardo L. de Queiroz, who is my external examiner, for his insightful feedbacks and suggestions. Second, I am deeply grateful to have the best lab-mates that one could ever have. As an international group, our different backgrounds do not keep us from having great friendship amongst each other. The world is big, but here in our lab it is becoming smaller every day. I would like to thank my research group colleagues: Dr. Lino Coria for his jokes and patience, Dr. Merhdad Fatourechi for his insightful suggestions, Dr. Hassan Mansour for the deep discussions about anything, Dr Shan Du, and Dr. Qing Wang for sharing their valuable Ph.D. study experience, Zicong Mai for his warm-hearted character, Di Xu for her awesome personality, Ashfiqua Connie for her gentleness, Colin Doutre, for letting me know an extraordinary Canadian and teaching me genuine English, Matthias von dem Knesebeck for his all-round talents and kind character, Mahsa Pourazad for her strong personality, Sergio Infante for his smiling, and Victor Sanchez for his free-spirit. xii Next, I wish to express my warm and sincere thanks to my friends in the UBC International Bible Study Fellowship and the Vancouver Westside Alliance Church. In addition, my warm thanks are due to lots of friends in Vancouver. Zheman Zhao & Xiaoping Hu who give me the feeling of home in Vancouver. Jacky Zhang and Lesley Xie, Vancouver is much colder without them. Yifan Tian and Yangwen Liang, thank you for your friendship. Xudong Lv, little brother, we share a lot in common. Liwei and Yaolan Wang, thank you for your help. Furthermore, I also wish to thank Professor Guiling Li, and Dr. Yu Liu, from my home university in China: Tianjin University. The knowledge and training I got from them benefited me then, now and forever. Last but not least, I owe my thanks to my loving wife, Miao Yao, who supported me constantly, for her understanding and encouragement. To my parents, Dinglv Tang and Ling Li, thank you for bringing me to this world and taking care of me all these years. To my sister, Jian Tang, who is not only my sister, but also my best friend. The financial support of the University of British Columbia is gratefully acknowledged. xiii DEDICATION To My Parents and My Wife xiv CO-AUTHORSHIP STATEMENT This research presents work conducted by Qiang Tang, in collaboration with Dr. Panos Nasiopoulos and Dr. Rabab Ward. This is a manuscript-based thesis constructed around the three manuscripts described below: Manuscript 1: ―Compensation of Re-Quantization and Interpolation Errors in MPEG-2 to H.264/AVC Transcoding,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 3, pp.314-325, Mar. 2008. The identification and design of the research program, the research and data analysis were performed by Qiang Tang. The manuscript is the work of Qiang Tang who received suggestions and feedback from Dr. Nasiopoulos and Dr. Ward. Manuscript 2: ―Efficient Motion Re-Estimation with Rate-Distortion Optimization for MPEG-2 to H.264/AVC Transcoding,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 2, pp. 262-274, Feb. 2010. The identification and design of the research program, the research and data analysis were performed by Qiang Tang. The manuscript is the work of Qiang Tang who received suggestions and feedback from Dr. Nasiopoulos. Manuscript 3: ―An Efficient Motion Re-Estimation Scheme for H.264/AVC Video Transcoding with Arbitrary Downscaling Ratios,‖ submitted to IEEE Trans. Circuits Syst. Video Technol., Feb. 2010. The identification and design of the research program, the research and data analysis were performed by Qiang Tang. The manuscript is the work of Qiang Tang who received suggestions and feedback from Dr. Nasiopoulos and Dr. Ward. The first and last chapters of the thesis were written by Qiang Tang, with editing assistance and consultation from Dr. Panos Nasiopoulos and Dr. Rabab Ward. xv Chapter 1: INTRODUCTION AND OVERVIEW 1.1 Introduction Many video coding standards have been established during the past two decades. The two groups involved in the video coding standardization process are the Moving Picture Experts Group (MPEG), which belongs to the International Organization for Standardization (ISO), and the Video Coding Experts Group (VCEG) of the ITU Telecommunication Standardization Sector (ITU-T), where ITU stands for International Telecommunication Union. The video coding standards developed by the MPEG group are named MPEG-x where x indicates the different versions of the standards such as MPEG-2 [1] or MPEG-4 [2]. The video coding standards proposed by the VCEG group are named H.26x where x refers to different versions of the standards such as H.263 [3]. Since 2002, the MPEG and VCEG groups joined and formed the Joint Video Team (JVT) to develop the next generation of video coding standards. The output of this cooperation is the latest video coding standard referred to as H.264/AVC where AVC stands for Advanced Video Coding [4]. The MPEG group refers to this standard as MPEG-4 part10 Advanced Video Coding, and the VCEG group names this standard as H.264. As a result, this standard is usually referred to as H.264/AVC. Compared to previous standards, H.264/AVC increases the video compression efficiency by about 50% while maintaining the same picture quality [5]. For this reason, H.264/AVC has quickly gained ground and has had a great impact on the video industry [6]. Providing universal access to end-users is the ultimate goal of the communications, entertainment and broadcasting industries. This is not an easy task for many reasons. For instance, since video coding standards evolve with time and many of them have been proposed over the years, content has been encoded using a variety of coding standards [7]. Another challenging factor is the large variety of available networks and the bandwidth limitations which are a direct derivative of the underlined application 1 [8]. To top it up, the quality of experience also depends on the display device available to the user. It is for all these reasons, that providing universal multimedia access (UMA) has become one of the hottest research topics of the last decade. To this end, many different approaches have been proposed, with transcoding emerging as the most promising and efficient solution to this problem [9]-[11]. By definition, video transcoding converts videos from one format to another, in terms of bit-rates, spatial resolutions, temporal resolutions and video coding standards (refer to Figure 1.1). Bit rate: R1 Frame rate: F1 Resolution: S1 Coding standard: C1 ••• ••• Transcoder Bit rate: R2 Frame rate: F2 Resolution: S2 Coding standard: C2 ••• ••• Figure 1.1 The concept of video transcoding H.264/AVC has become the coding choice for broadcasting, and entertainment (i.e., DVD/Blu-ray), meaning that the latest set-top boxes and playback devices support this new standard. Since many existing videos had been encoded using previous video coding standards, playing them back on the new devices will be able only if they are transcoded into the H.264/AVC format [12]-[15]. In addition, even in the case that videos are compressed using the H.264/AVC video coding standard, transmitting them over different networks for different user applications (e.g., mobile phones, TV) will require transcoding to adapt them to different bandwidth and resolution requirements [16]. One additional challenge in transcoding H.264/AVC is the fact that this standard is not backwards compatible with previous video standards [17]. Among the H.264/AVC related video transcoding applications, transcoding from MPEG-2 to H.264/AVC format stands out as the most popular application. MPEG-2 is presently the video coding standard used in most consumer products ranging from digital TV broadcasting to DVD. Therefore, at present, large amounts of videos in consumer electronics market are encoded in the MPEG-2 format. As H.264/AVC is quickly gaining 2 ground in many applications such as DTV, Blu-ray, and mobile applications [6], it is expected that MPEG-2 and H.264/AVC will co-exist in the near future. Besides the need for MPEG-2 to H.264/AVC transcoding, the H.264/AVC to H.264/AVC transcoding that involves resolution downscaling is also drawing people’s attention recently. The popularity of this type of application stems from the ability of H.264/AVC to support a variety of applications much wider than the other video coding standards preceding H.264/AVC. H.264/AVC has been adopted in both DTV and smart phone (e.g., iPhone, BlackBerry, etc.) applications [6], which use different transmission networks. Based on the available bandwidth, modulation, and channel conditions, different networks have different rate capacities. As a result, H.264/AVC videos are encoded at different bit-rates so as to adapt to the diverse networks. Therefore, when the bit-rate of a transmitted H.264 encoded video stream exceeds the bit-rate constraint of a network, a video transcoder is needed to reduce the coded video bit-rate and satisfy the network requirement [18]. A video transcoder can reduce the video bit-rate by increasing the quantization step, reducing the spatial resolution, and/or decreasing the temporal resolution (frame rate). Since mobile devices usually use wireless networks, they typically have low spatial resolutions. Thus, the first two schemes are better choices for reducing the bit-rates of videos for wireless networks. In the remainder of this chapter, an overview of the H.264/AVC video coding standard and the traditional video transcoding structures is given. After that, we present the challenges of the H.264/AVC transcoding applications. Next, we introduce the thesis statement and list the research objectives. In the end, we summarize our thesis contributions and present the summaries of the following chapters. 1.2 Overview of H.264/AVC Video Coding Standard and Traditional Transcoding Structures 1.2.1 Overview of H.264/AVC video coding standard The H.264/AVC video coding standard introduces many new coding tools that make H.264/AVC very different from previous video coding standards. The intention of 3 designing this standard was to outperform any previous standard and for this reason, backward compatibility was not a requirement. Traditional transcoding techniques cannot thus be directly applied to transcoding applications involving H.264/AVC. In order to better grasp the challenges and the corresponding solutions of H.264/AVC transcoding applications, a brief overview of H.264/AVC and its new features is given below. General description A video sequence is defined as a set of successive still images. Each still image is referred to as a picture frame. The frame rate determines the time interval between two consecutive frames. The frequently used frame rate ranges from 15 frames per second to 30 frames per second, depending on the target video application. In the encoding process, one frame is subdivided into smaller macroblocks (MBs), and each MB is size of 16x16 pixels. A basic coding structure for encoding an MB in H.264/AVC is given in Figure 1.2. Input Videos + _ Integer Transform Entropy Coding Quantization Output H.264 Inter/Intra Inter Motion Compensation motion info. Motion Estimation Previous frames Intra Intra Prediction Encoded part of current frame Frame Buffer Inverse Quantization Inverse Integer Transform + + Deblocking Filter current frame Figure 1.2 The basic coding structure of encoding a macroblock in H.264/AVC In Figure 1.2, the grids inside the input image represent the MBs. When one MB enters into a H.264/AVC encoder, the encoder first finds in the previous frame or the encoded region of the current frame the block that forms the closest approximation of this MB. This MB is then subtracted from its approximation to produce the prediction errors 4 of the MB. Finally, these prediction errors are transformed, quantized and entropy coded to generate the binary-coded video streams. To provide the references for finding the estimate of the current MB, a frame buffer is used to store several previous frames and the encoded MBs of the current frame. The Inter prediction uses the previous frames as the references, while the Intra prediction uses the already encoded part of the current frame as the reference. The encoder will make the decision between these two types of predictions (the switch in Figure 1.2). Below we give more details about these two types of prediction. Inter prediction aims at reducing the temporal redundancy of the video content since two consecutive video frames are usually very similar to each other. During inter prediction, the motion estimation process first finds the best match of the current MB from the previous frames. Then the original macroblock mo is subtracted from the reference mp to form the prediction errors of the MB mr. At the same time, a displacement vector is generated that stores the location of the reference of the original macroblock. This displacement vector is referred to as the motion vector in video coding standards. Intra prediction means finding the reference of the current MB from the already encoded region (located in the left and top of the current MB) in the current frame. Intra prediction is one of the new coding tools introduced in H.264/AVC standard. It aims at reducing the spatial redundancy of the video content in the pixel domain. Besides the general description of the H.264 encoding process, in what follows, we give more details on the new coding tools that are related to the transcoding problems addressed in this thesis. Variable block-size motion estimation/compensation with quarter-pixel accuracy During Inter prediction, the motion estimation process searches for an area in the reference frame that best matches the current MB. After that, the motion compensation process subtracts the original MB from its reference. To get better matching, the precision of the motion compensation is often at the sub-pixel level. For instance, MPEG2 uses half-pixel motion compensation and H.264/AVC uses quarter-pixel motion 5 compensation. These values of the sub-pixels are found by interpolating the integer pixels in the reference frame. As the common size of the searching range is ±32 pixels, the motion estimation is regarded as the most consuming process in video coding, especially when the exhaustive full search is employed. In MPEG-2, motion estimation (ME) is implemented using only 16x16 pixel block size or 16x8 (field mode). On the other hand, H.264 performs motion estimation for a variety of block sizes that, in addition to 16x16, include 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 blocks. This means that one MB can be subdivided into smaller blocks and each block has its independent motion vector (Figure 1.3). This feature is referred to as the variable block size motion compensation (VBSMC) [17]. It results in much smaller prediction errors for the motion compensation and thus greatly improves the compression performance of the encoded video. However, VBSMC also dramatically increases the computational expense of the motion estimation process. 16 8 16 8 16 b0 b0 b0 16 8 8 b0 4 b1 8 8 8 8 b0 b1 8 b2 b3 b1 b1 4 8 b0 b1 4 4 4 b0 b1 4 b2 b3 4 4 Figure 1.3 Variable block size motion compensation in H.264/AVC In H.264/AVC, since one MB can use 7 block sizes for motion compensation, which block size to choose should be decided during the encoding process. The next section explains how to make the decision. Rate distortion optimization The rate distortion optimization (RDO) helps the encoder to make a decision among different encoding modes (e.g., the block size for the motion compensation). The 6 goal of RDO is to achieve the best picture quality using the least bit-rates. Rate-distortion optimization (RDO) is not a new concept in video coding, but it is significantly important in H.264/AVC encoding due to the many coding options supported by this standard [17]. The most popular method for achieving RDO uses the Lagrangian technique [19]. The objective function of this optimization problem calculates the Lagrangian cost of the different encoding modes. The general formulation of the Lagrangian cost J is shown as follow: J = D + × R (1-1) where D is the resultant picture quality of the reconstructed video (after encoding and decoding), R is the bits needed to encode the corresponding MB, and λ is the Lagrangian multiplier which is related to the quantization parameter. To make a decision among the different block sizes, the Lagrangian cost J of each block size is calculated. The one that offers the least cost is chosen as the final block size for partitioning the current MB. To obtain the accurate Lagrangian cost, one needs to calculate the accurate D and R. This means that we have to completely encode each MB many times (one for each block size), although only one block size is finally chosen. Therefore, the computational complexity of an accurate RDO could be very high. After an appropriate block size is chosen, the prediction errors are obtained through the motion compensation process. The resulting prediction errors are then transformed and quantized as explained below. Transform and quantization It is very import to learn the spatial transformation used in H.264/AVC since it helps us understand the challenges existing in the transform-domain MPEG-2 to H.264/AVC transcoding. The transform-domain MPEG-2 to H.264/AVC transcoding reuses the MPEG-2 DCT coefficients, which are different from the transform coefficients used in the H.264/AVC encoding process. In H.264/AVC, the transform coding uses a 4x4 integer-transform, which is built based on the 4x4 DCT transform [20]. Although the 7 Fidelity Range Extensions (FRExt) of H.264/AVC also supports 8x8 integer-transform [21], the problems addressed in this thesis do not relate to FRExt. Therefore, FRExt of H.264/AVC is not considered. Transform Given a 4x4 signal matrix x, the corresponding 4x4 DCT coefficients (𝐗 ) is obtained using 𝐗 = HxHT (where T denotes transposition). H is the 4x4 DCT kernel matrix with the kth row and nth column given by: 𝐻𝑘𝑛 = 𝑐𝑘 2 cos 4 𝑛+ 1 𝑘𝜋 2 4 (1-2) where c0 = 1/√2 and ck = 1 when k > 0. The original 4x4 DCT kernel matrix (1-2) has some irrational numbers. Modern digital computers represent an irrational number by approximating it using a rounding procedure. However, the rounding errors cause a mismatch problem between the original signal and its reconstructed signal, although DCT is an orthogonal transform. Thus, to obtain an orthogonal transform that is similar to DCT but has integer entries, H.264/AVC used an integer version of a scaled 4x4 DCT [19]. The kernel matrices of H.264/AVC forward/inverse 4x4 integer transform are shown below: 1 1 1 1 2 1 1 2 , H 1 1 1 1 1 2 2 1 1 1 1 1 2 IH 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 (1-3) where H refers to the kernel matrix of the forward 4x4 integer transform and IH stands for the kernel matrix of the inverse 4x4 integer transform. 8 The H.264 integer transform (1-3) however introduces a gain in the energy of the transform coefficients. In order to achieve perfect reconstruction, the integer transform coefficients must be scaled down. This is achieved by using the following matrix before applying the inverse transform [22]. 1 16 1 V 20 1 16 1 20 1 20 1 25 1 20 1 25 1 16 1 20 1 16 1 20 1 20 1 25 1 20 1 25 (1-4) Richardson’s analysis in [23] separate V into two scaling matrices: one for the forward transform and the other for the inverse transform. The two scaling matrices are shown below: a 2 ab For S 2 a ab a 2 ab a 2 ab b 2 ab b 2 Inv ab / 2 S 2 ab a 2 ab a b 2 ab b 2 ab / 2 , ab / 2 b 2 / 4 ab / 2 b 2 / 4 ab / 2 a 2 ab / 2 b 2 / 4 ab / 2 b 2 / 4 ab / 2 a2 (1-5) where a = ½, b = √2/5. SFor is the scaling matrix for the forward transform, and SInv is the scaling matrix for the inverse transform. This can be illustrated as follows: 𝐗 𝐒 = 𝐗⨂𝑆 𝐹𝑜𝑟 = (𝐻 × 𝐱 × 𝐻 𝑇 )⨂𝑆 𝐹𝑜𝑟 , and 𝐱 = 𝐼𝐻 × (𝑿𝑺 ⨂𝑆 𝐼𝑛𝑣 ) × 𝐼𝐻 𝑇 (1-6) where 𝐗 𝐬 stands for the scaled integer transform coefficients. Notice that ⨂ denotes the element by element multiplication. Quantization The two above scaling matrices are absorbed into the quantization and dequantization processes in the H.264/AVC standard. Furthermore, in order to avoid the 9 division operations and reduce the rounding errors, H.264/AVC implements the quantization using multiplication factors (MF) and the right-shift operation. Eventually, the multiplication factors and the scaling matrices (SFor and SInv) are all combined together. This is illustrated in the following equation: 𝐗 𝑄 = 𝑠𝑖𝑔𝑛 𝐗 𝐗 ⊗ 𝑀𝐹 𝑄𝑚 + 𝑓 × 215+𝑄𝑒 ≫ 15 + 𝑄𝑒 𝐗 𝑅 = 𝐗 𝑄 ⊗ 𝑆(𝑄𝑚 ) × 2𝑄 (1-7) where »stands for the right-shift operation, Qm = Q mod 6, and Qe = floor(Q/6). 𝐗 𝑄 , 𝐗 𝑅 are the quantized and its corresponding reconstructed integer transform coefficients. MF(Qm) and S(Qm) are the scaling matrices that include the effects of both the scaling factors for the integer transform and the scaling factors for removing the division operation during quantization. Below we also give a simplified expression of the H.264/AVC quantization and its corresponding reconstructed process. To interpret the H.264/AVC quantization as a scalar quantization scheme, the quantization process and its corresponding reconstruction process can be expressed using the following equation that is explained in [23]: 𝐗 𝑄 = 𝑠𝑖𝑔𝑛 𝐗 𝐗 ⊗ 𝑆 𝐹𝑜𝑟 + 𝑓 , 𝑄𝑠 𝑎𝑛𝑑 𝐗 𝑅 = (𝐗 𝑄 ⨂𝑆 𝐼𝑛𝑣 ) × 𝑄𝑠 (1-8) where Qs denotes the quantization step, and f is the rounding factor that controls the range of the dead-zone. Finally, the reconstructed coefficients (𝐗 𝑅 ) go through the following equation to produce the pixel-domain reconstructed signal (xr): 𝐱 𝑟 = 𝐼𝐻 × 𝐗 𝑅 × 𝐼𝐻 𝑇 + 25 ≫ 26 (1-9) In what follows, we present an overview of traditional transcoding structures. 10 1.2.2 Overview of traditional transcoding structures Transcoding techniques have evolved over the last two decades. In 2003, Vetro et al gave the overview of video transcoding architectures and techniques for the first time [9]. After that, in 2005, Xin el al and Ahmad et al presented an overview of digital video transcoding in [10], and [11] respectively. These overviews discussed many transcoding structures that are used in different scenarios. In general, traditional transcoding structures can be categorized into three types: the cascaded pixel-domain transcoder, the open-loop transcoder, and the closed-loop transcoder. The loop refers to the loop that exists in the motion compensation process which reconstructs the encoded frame. Below we will briefly introduce the general workflow of these 3 types of transcoding structure. Cascaded pixel-domain transcoding structure The cascaded pixel-domain transcoding structure (CPDT) is the most straightforward structure. It cascades a full decoder and encoder to fully decode the original video back to the pixel domain and then re-encode the decoded video into the new format. A general framework of CPDT that transcodes a P frame is illustrated in Figure 1.4. This figure demonstrates the basic idea of how a video decoder is cascaded to a video encoder. The corresponding CPDT of different transcoding applications are slightly different from the example shown in Figure 1.4. Input Video Entropy Decoding Inverse Quantization / Transform Input motion vector + + + MC Output Video Inverse Quantization / Transform Frame Store MC MC: ME: Entropy Encoding _ Transform Quantization Motion Compensation Motion Estimation motion vector ME Frame Store + + Encoder Figure 1.4 The cascaded pixel domain transcoder that outputs H.264/AVC encoded videos 11 Being the most straightforward structure, CPDT has the highest computational complexity among all existing transcoding structures. To reduce the computational complexity, a common practice is to re-employ the motion vectors of the original video as the motion vectors of the output videos. This avoids performing another motion estimation process when encoding the resultant transcoded video. To improve the compression performance in this scenario, many researchers have proposed the use of the original motion vectors as the initial motion vectors when searching for the motion vectors of the resultant output video within a very small range [24], [25]. In this case, the compression performance could be significantly improved with reasonable computational complexity. In summary, CPDT has the highest computational complexity among the 3 transcoding structures. Open-Loop transcoding structure A general framework of the open-loop video transcoding structure is shown in Figure 1.5. Inside the open-loop structure, the input video is entropy decoded and inverse quantized to produce the reconstructed transform coefficients. After that, the reconstructed transform coefficients are re-quantized (the input video has already been quantized once) and entropy coded to produce the new videos. The open-loop structure carries all the transcoding computations in the transform domain. Therefore, the underlying assumption behind the open-loop structure is that the output video uses the same transform coefficients of the input video. For this reason, the open-loop structure has drift, and is often used for transcoding within the same standard, e.g., MPEG-2 to MPEG-2 bit-reduction transcoding [9]. Input Video Entropy Decoding Inverse Quantization Quantization Entropy Encoding Output Video Encoder Figure 1.5 A general framework of the open-loop video transcoding structure 12 Among the 3 types of transcoding structures, the open-loop structure has the fastest transcoding speed and the least computational complexity. The disadvantage of this approach is that its transcoded videos often suffer from poor picture quality. Due to the poor picture quality, the open-loop structure is not commonly used in practice, especially for the transcoding between different video coding standards. However, since this structure offers the least computational complexity, it is often used as a comparison reference. Closed-Loop transcoding structure To improve the picture quality of the resultant transcoded videos of the open-loop transcoding structure, a closed-loop structure is proposed [26]. If we take the open-loop structure shown above as an example of MPEG-2 to MPEG-2 bit rate reduction transcoding, a corresponding closed-loop structure is shown in Figure 1.6. Unlike the open-loop structure that has no motion compensation loop, the closed-loop structure has one motion compensation loop on the encoder side. This loop records the errors caused by the re-quantization process of the encoder, and uses them to adjust the future coefficients so as to compensate for the re-quantization errors. Therefore, the closed-loop structure shown in our example aims at improving the video quality of the P frame transcoding, albeit at the expense of a higher computational complexity compared to the Input Video Entropy Decoding Inverse Quantization + Quantization + Entropy Encoding Inverse Quantization + Input motion vector Error frame buffer _ Transform Encoder Figure 1.6 A general framework of the closed-loop video transcoding structure 13 Output Video open-loop structure. Nevertheless, the added computational complexity of the closedloop structure is still much less than that of the CPDT structure. Another advantage of the closed-loop structure is that if offers great flexibility in the design of the closed-loop. Although different designs are based on different motion compensation implementations, the detailed structure inside the motion compensation loop could be different from one another. This flexibility also enables the designer to adopt different structures in different applications. Keeping a reasonable trade-off between transcoding speed and video quality is the key to designing a successful closedloop transcoding structure. 1.3 Challenges in H.264/AVC Transcoding Applications As discussed in Section 1.1, among the H.264/AVC related video transcoding applications, transcoding from MPEG-2 encoded videos to H.264/AVC encoded videos stands out as the most popular application. In order to achieve the fastest transcoding speed, the MPEG-2 DCT coefficients are re-utilized during transcoding. However, due to the huge differences between MPEG-2 8x8 DCT coefficients and H.264/AVC 4x4 integer transform coefficients, the MPEG-2 DCT coefficients cannot be directly reutilized. To this end, one method is proposed to efficiently convert the 8x8 DCT coefficients to four 4x4 integer transform coefficients (DCT-to-HT) in the transform domain [27]. Using this DCT-to-HT conversion method, a transform-domain structure for transcoding a coded MPEG-2 I frame to an H.264/AVC I frame is proposed in [28]. When transcoding a coded MPEG-2 P frame to an H.264/AVC P frame, the transformdomain structure that re-utilizes the MPEG-2 DCT coefficients has some inherent video distortions. These distortions severely affect the picture quality of the resultant H.264/AVC video and they are a major challenge for this type of transcoding [29]. The main advantage of the above transform-domain transcoding structure is the speed of converting MPEG-2 to H.264/AVC, while its disadvantage is that it limits the compression efficiency of the transcoded H.264/AVC video to the options offered by MPEG-2. Most of the existing MPEG-2 to H.264/AVC transcoding methods use the 14 cascaded pixel-domain transcoding structure to improve the compression efficiency of the resultant H.264/AVC video by implementing an efficient motion estimation process [30]-[33]. However, the computational complexity of implementing the motion reestimation is very high, especially since H.264/AVC supports the variable block-size motion estimation, which means that it needs a separate motion search for each bock-size of 7 sizes, i.e., 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4. The computational complexity of the existing methods that address the pixel-domain MPEG-2 to H.264/AVC transcoding is very high and leaves space for further improvement. Another popular H.264/AVC transcoding application is to convert a coded largeresolution H.264/AVC video into a downscaled version. One challenge in this process is to find the motion vectors of the downscaled video during re-encoding. Since the motion vectors of the large-resolution video cannot be directly used as the motion vectors for the downscaled video, a motion re-estimation process is needed. However, motion estimation is known to be computationally extensive. Traditional methods propose to compose the motion vectors of the downscaled video by utilizing the motion vectors of the original video [34]. However, the H.264/AVC variable block-size motion estimation introduces new challenges. In addition, most of the existing method only addresses 2x2:1x1 downscaling ratio since transcoding coded H.264/AVC video to arbitrary downscaling ratios involves high computational complexity [35]-[36]. 1.4 Thesis Objectives In general, video transcoding technologies have many similar aspects to video coding technologies. However, they also have many differences. Since the video has already been encoded before the transcoding stage, the transcoder has access to many coding parameters already determined by the first encoding. These parameters not only help to reduce the computational complexity during transcoding, but they also help obtain important information about the video characteristics. It is this information that allows more efficient encoding during the transcoding step. Overall, the general objective in transcoding is how to achieve the best possible video quality and the lowest possible 15 computational complexity [9]-[11]. Regarding the transcoding applications discussed in our thesis, the objectives of the corresponding research are listed below. 1. Develop an efficient distortion-compensation scheme for MPEG-2 to H.264/AVC transform-domain transcoding. When MPEG-2 DCT coefficients are re-employed during transcoding, several inherent distortions exist. These distortions cause un-acceptable picture quality degradation in the transcoded video. Compensating for these distortions at a reasonable computational cost is the primary goal for this type of transcoding. 2. Propose a solution to achieve efficient pixel-domain MPEG-2 to H.264/AVC transcoding. In an MPEG-2 to H.264/AVC transcoding system, one of the primary objectives is to reduce the computational complexity of the overall system. As H.264/AVC provides many new coding options, it is essential to base them on existing MPEG-2 coding information in order to minimize the overall complexity of a transcoding scheme. In that respect, a significant part of the overall complexity is due to the variable block-size motion estimation (VBSME) process in H.264/AVC. Therefore, our goal is to reduce the computational complexity of VBSME during pixel-domain MPEG-2 to H.264/AVC transcoding. 3. Develop an algorithm that accelerates the H.264/AVC encoding process of the downscaled images. In a H.264/AVC downscaling transcoder, one of the most time-consuming processes is finding the motion vectors for the downscaled videos. One common practice is to estimate the motion vectors of the downscaled videos by using the available motion vectors in the original higher-resolution videos. Our aim is to find a solution that requires the least computational complexity for predicting the motion vectors of the downscaled videos. At the same time, this solution should offer better or at least the same accuracy compared to existing state-of-the-art methods. 16 1.5 Thesis Contributions The main contributions of this thesis are summarized as follows: We present a theoretical analysis about the inherent distortions caused by reemploying the MPEG-2 DCT coefficients in the transform-domain MPEG-2 to H.264/AVC transcoding. The analysis offers a useful reference to understand the distortions in any transform-domain based transcoding that outputs H.264/AVC coded videos. We develop an efficient scheme that compensates for the distortions caused by requantization and interpolation errors inherent in the transform-domain based MPEG-2 to H.264/AVC transcoding. The traditional re-quantization error compensation algorithm for DCT coefficients is updated so that it can be applied to the H.264 integer transform coefficients. Equations that compensate for the luminance half-pixel and chrominance quarter/three-quarter pixel interpolation errors are derived. Our proposed algorithm greatly improves the resultant video quality compared to the open-loop structure. At the same time, it requires less computational complexity compared to the cascaded pixel-domain structure. We demonstrate the importance of enabling the rate-distortion optimization (RDO) during the MPEG-2 to H.264/AVC transcoding. Most of the existing approaches choose not to use the RDO during the transcoding in order to reduce the computational complexity. However, ignoring the RDO severely degrades the resultant video quality of the transcoded H.264/AVC videos. Our thorough experiments show that RDO should always be considered during the MPEG-2 to H.264/AVC transcoding. We propose empirical rate and distortion models that efficiently predict the blocksize partitioning of the MBs when encoding the output H.264/AVC video during transcoding. Our algorithm uses the predicted initial motion vectors and ratedistortion optimization techniques to estimate the block partition of the MBs. Experimental results show that, compared to the state-of-the-art transcoding scheme, our transcoder yields similar rate-distortion performance on the resultant 17 output H.264/AVC video, while significantly reducing the computational complexity. We create an area-weighted align-to-worst strategy to calculate the motion vectors of the downscaled videos when downscaling the compressed H.264/AVC videos. Compare to the other state-of-the-art methods, our proposed approach requires the least computational complexity for finding the motion vectors of the downscaled video during transcoding At the same time, our approach yields better or at least the same compression performance on the output downscaled H.264/AVC video. We investigate the impact of the range of the motion vector refinement on the compression performance of the downscaled videos. We conclude that a range of ±0.75 pixel motion vector refinement allows us to achieve the same compression performance as that of the cascaded recoding approach (FullSearch) method for downscaling CIF test sequences. In order to achieve the same performance as that of the cascaded recoding approach for downscaling large-resolution test sequences such as 4CIF, a larger range of motion vector refinement (i.e., ±1.75 pixels) is suggested. 1.6 Thesis Summary This is a manuscript-based thesis that follows the specifications required by the University of British Columbia for this format. In addition to this introductory chapter, the thesis includes three chapters that were originally published or prepared for refereed academic journals and have been slightly modified in order to offer a logical progression in the thesis. The final chapter discusses the conclusions and directions for future work. In what follows, we give the detailed summary for the remaining three chapters, i.e., chapters 2, 3, and 4. These three chapters include our main work/contributions that we have made in this thesis. In chapter 2, our proposed algorithms aim at improving the resultant video quality of the transform-domain MPEG-2 to H.264/AVC transcoding structure. The transformdomain transcoding structure is more computationally efficient than its pixel-domain counterpart. This is because the transform-domain transcoder does not require the 18 computation of the transform coefficients of the output video. Instead, the existing transform coefficients of the input video, i.e., the MPEG-2 DCT coefficients, are reemployed. However, this re-employment inherently causes distortions in the resultant video quality due to mismatches between the MPEG-2 and H.264/AVC motion compensation processes. Chapter 2 first presents a theoretical analysis about the inherent video distortions existing in the transform-domain MPEG-2 to H.264/AVC transcoding. The analysis clearly shows that the distortions come from the re-quantization errors, the luminance half-pixel and the chrominance quarter/three-quarter interpolation errors. Then we propose computationally efficient algorithms that compensate for these errors. As for the re-quantization error, the traditional re-quantization error compensation algorithm for DCT coefficients is updated so that it can be applied to the H.264/AVC integer transform coefficients. As for the interpolation errors, equations that compensate for the luminance half-pixel and chrominance quarter/three-quarter pixel interpolation errors are derived. The experimental results show that the proposed compensation algorithms achieve 5dB quality improvement over the transform-domain transcoding approach that does not compensate for these distortions. An additional advantage is a reduction in the computational complexity compared to the pixel-domain distortion-free method. The pixel-domain transcoding structure is distortion-free since it re-calculates the transform coefficients with additional computational cost. In chapter 3, our proposed algorithm aims at speeding up the motion estimation process for the pixel-domain MPEG-2 to H.264/AVC transcoder. Compared to the transform-domain transcoding structure, the advantage of the pixel-domain solution is not only that it is distortion-free, but it also offers better compression efficiency if the motion re-estimation is employed. The disadvantage of the pixel-domain solution on the other hand is its higher computational complexity. In the pixel-domain MPEG-2 to H.264/AVC transcoder, the most time consuming process is the variable block-size motion estimation (VBSME). In order to reduce the computational complexity of the VBSME during transcoding, one of the state-of-the-art methods [33] significantly reduces the search 19 range for each block-size motion estimation. However, the motion search still needs to be implemented in many times, each time for one block-size. In order to further reduce the computational complexity of the state-of-the-art transcoding method [33], we present an efficient algorithm that predicts the MB partitioning size for the VBSME. Therefore, instead of implementing the motion search many times for each MB, our algorithm implements the motion search only once. To predict the MB partitioning size, our proposed approach first uses the predicted initial motion vectors (from MPEG-2 or H.264/AVC) to quickly obtain the residual data for each block-size. After that, is uses the proposed empirical distortion and rate models to select the best block-size in the rate-distortion optimization sense. The use of the ratedistortion optimization techniques is important and neglected by most of the state-of-theart transcoding methods. To achieve additional computational saving, we also show that using block-sizes smaller than 8x8 (i.e., 8x4, 4x8 and 4x4) results in negligible compression improvements, and thus these sizes should be avoided in transcoding. Experimental results show that, compared to the state-of-the-art transcoding scheme, our transcoder yields similar rate-distortion performance, while the computational complexity is significantly reduced, requiring an average of 29% of the computations. In chapter 4, we present a study that is related to the downscaling video transcoder, in which both the input and output video are encoded using the H.264/AVC standard. This study proposes an efficient motion-vector composition scheme that is specifically designed for H.264/AVC transcoding applications with arbitrary downscaling ratios. For the downscaling transcoding process, the most time-consuming step is the motion estimation process that finds the motion vectors for the downscaled video. In order to speed up the motion estimation process of encoding the downsize video, most state-of-the-art method propose to derive the initial motion vectors for the downscaled video and therefore implement a small range of motion vector refinement using those initial motion vectors. However, the process of calculating the initial motion vectors is complex, especially when arbitrary downscaling ratios are considered. 20 In our study, we propose a motion-vector composition scheme that supports arbitrary downscaling ratios for H.264/AVC transcoding. Our algorithm utilizes the information in the transform-domain, i.e., the H.264/AVC integer transform coefficients. Since the transform coefficients often give a good indication about the motion-activity of the video, utilizing the transform coefficients can also make the algorithm inherently adaptive to the video contents. Although a transform-domain based method has been proposed [34], it does not consider the H.264/AVC variable block-size motion estimation and the arbitrary downscaling ratios. In order to support the variable block-size motion estimation and the arbitrary downscaling ratios, our method uses the area-weighted numbers of nonzero AC integer-transform coefficients as the weights of the available motion vectors in the original video. The one that corresponds to the highest number is used to calculate the motion vector of the corresponding downscaled block. Compared to other state-of-the-arts methods [35]-[36], our proposed approach achieves the best accuracy with the least computational complexity. Note that the notations used in the different chapters are independent of each other. 21 1.7 References1 [1] ―Information technology – Generic coding of moving pictures and associated audio information: Video,‖ ITU-T Rec. H.262 and ISO/IEC Standard 13818-2 (MPEG-2 Video), 2nd ed., Feb. 2000. [2] ―Coding of audio-visual objects—Part 2: Visual (MPEG-4 video),‖ Int. Standards Org./Int. Electrotech. Comm. (ISO/IEC), ISO/IEC 14496-2:2001, 2nd ed., 2001. [3] ―Video coding for low bit rate communication,‖ Int. Telecommun. UnionTelecommun. (ITU-T), Geneva, Switzerland, Recommendation H.263, 1998 [4] ―Advanced video coding for generic audiovisual services,‖ ITU-T Rec. H.264 and ISO/IEC Standard 14496-10 (MPEG-4 AVC), 3rd ed., Mar. 2005. [5] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. Sullivan, ―Rate-Constrained Coder Control and Comparison of Video Coding Standards,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 688-703, Jul. 2003. [6] D. Marpe, T. Wiegand, G. Sullivan, ―The H.264/MPEG4 Advanced Video Coding Standard and its Applications,‖ IEEE Communications Mag., vol. 44, no. 8, pp. 134143, Aug. 2006. [7] G. Sullivan, and T. Wiegand, ―Video Compression – from Concepts to the H.264/AVC Standard,‖ Proceedings of the IEEE, vol. 93, no.1, Jan. 2005. [8] D. Wu, Y. T. Hou, Y.Q. Zhang, ―Transporting Real-Time Video over the Internet: Challenges and Approaches,‖ Proceedings of the IEEE, vol. 88, no. 12, Dec. 2000. [9] A. Vetro, C. Christopulos, and H. Sun, ―Video transcoding architectures and techniques: An overview,‖ IEEE Signal Process. Mag., vol. 20, no. 2, pp. 18–29, Mar. 2003. [10] J. Xin, C.W. Lin, and M.T. Sun, ―Digital Video Transcoding,‖ Proceedings of the IEEE, vol. 93, no. 1, pp. 84–97, Jan. 2005. 1 The references included in this chapter are generic references. More specific references to the subsequent chapters will follow at the end of each chapter. 22 [11] I. Ahmad, X. Wei, Y. Sun, and Y-Q. Zhang, ―Video Transcoding: An Overview of Various Techniques and Research Issues,‖ IEEE Trans. on Multimedia, vol. 7, no. 5, pp. 793-804, Oct. 2005. [12] H. Kalva, ―Issues in H.264/MPEG-2 video transcoding,‖ in Proc. IEEE Consumer Commun. Networking Conf., Jan. 2004, pp. 657-659. [13] K. T. Fung, and W. C. Siu, ―Low complexity H.263 to H.264 video transcoding using motion vector decomposition,‖ in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, May 2005, pp. 908-911. [14] Y. K. Lee, S. S. Lee, and Y. L. Lee, ―MPEG-4 to H.264 Transcoding using Macroblock Statistics,‖ in Proc. IEEE Int. Conf. Multimedia and Expo, Jul. 2006, pp. 57-60. [15] J.-B. Lee, and H. Kalva, ―An Efficient Algorithm for VC-1 to H.264 Video Transcoding in Progressive Compression,‖ in Proc. IEEE Int. Conf. Multimedia and Expo, Jul. 2006, pp. 53-56. [16] Y.-P. Tan and H. Sun, ―Fast Motion Re-Estimation for Arbitrary Downsizing Video Transcoding using H.264/AVC Standard,‖ IEEE Trans. Consumer Electronics, vol. 50, no. 3, pp. 887-894, Aug. 2004. [17] T. Wiegand, G. Sullivan, G. Bjøntegaard, and A. Luthra, ―Overview of the H.264/AVC Video Coding Standard,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560-576, Jul. 2003. [18] Q. Tang, H. Mansour, P. Nasiopoulos, R. Ward, ―Bit-rate estimation for bit-rate reduction H.264/AVC video transcoding in wireless networks,‖ in Proc. IEEE Int. Symp. Wireless Pervasive Computing, May 2008, pp. 464-467. [19] G. Sullivan, T. Wiegand, ―Rate-Distortion Optimization for Video Compression,‖ IEEE Signal Process. Mag., vol. 15, No. 6, pp. 74-90, Nov. 1998. [20] H.S. Malvar, A. Hallapuro, M. Karczewicz, L. Kerofsky, ―Low-Complexity Transform and Quantization in H.264/AVC,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 598-603, Jul. 2003. [21] G. Sullivan, P. Topiwala, A. Luthra, ―The H.264/AVC Advanced Video Coding Standard: Overview and Introduction to the Fidelity Range Extensions,‖ in Proc. 23 SPIE Annual Conf. Apps. of Digital Image Processing XXVII, Special Session on Advances in the New Emerging Standard H.264/AVC, Aug., 2004, pp. 454–74. [22] H. Mansour, ―Modeling of Scalable Video Content for Multi-user Wireless Transmission,‖ Ph.D. dissertation, Dept. Elect. Comput. Eng., Univ. British Colombia, Vancouver, Canada, 2009 [23] Iain Richardson, Transform and quantization, H.264/MPEG-4 Part 10 White Paper, [online], Available: http://www.vcodex.com. [24] T. Shanableh and M. Ghanbari, ―Heterogeneous video transcoding to lower spatialtemporal resolutions and different encoding formats,‖ IEEE Trans. Multimedia, vol. 2, no. 2, pp. 101–110, Jun. 2000. [25] J. Xin, M.-T. Sun, B. S. Choi, and K.W. Chun, ―An HDTV to SDTV spatial transcoder,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 11, pp. 998– 1008, Nov. 2002. [26] P. A. A. Assuncao and M. Ghanbari, ―A frequency-domain video transcoder for dynamic bitrate reduction of MPEG-2 bit streams,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 8, pp. 953–967, Dec. 1998. [27] J. Xin, A. Vetro and H. Sun, ―Converting DCT Coefficients to H.264/AVC Transform Coefficients,‖ in Proc. Pacific-rim Conference on Multimedia, Nov. 2004, pp. 939-946. [28] Y. Sun, J. Xin, A. Vetro, and H. Sun, ―Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform Domain,‖ in in Proc. IEEE Int. Symp. Circuits and Systems, May 2005, pp. 1234-1237. [29] Q. Tang, R. Ward, P. Nasiopoulos, ―An Efficient MPEG-2 to H.264 Half-Pixel Motion Compensation Transcoding,‖ in Proc. IEEE Int. Conf. Image Processing, Oct. 2006, pp. 865-868. [30] Z. Zhou, S. Sun, S. Lei, and M.T. Sun, ―Motion Information and Coding Mode Reuse for MPEG-2 to H.264 Transcoding,‖ in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, May 2005, pp. 1230 - 1233. 24 [31] G. Chen, Y. Zhang, S. Lin, F. Dai, ―Efficient Block Size Selection for MPEG-2 to H.264 Transcoding‖, in Proc. 12th ACM Int. Conf. Multimedia, Oct. 2004, pp. 300303. [32] X. Lu, A. Tourapis, P. Yin and J. Boyce, ―Fast Mode Decision and Motion Estimation for H.264 with a Focus on MPEG-2/H.264 Transcoding,‖ in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, May 2005, pp. 1246-1249. [33] J. Xin, J. Li, A. Vetro, H. Sun, and S. Sekiguchi, ―Motion Mapping for MPEG-2 to H.264/AVC Transcoding, ‖ in Proc. IEEE Int. Symp. Circuits and Systems, May 2007, pp. 1991-1994. [34] B. Shen, I.K. Sethi and B. Vasudev, "Adaptive Motion-Vector Resampling for Compressed Video Downscaling", IEEE Trans. Circuits Syst. Video Technol., vol. 9, pp. 929-936, Sep. 1999. [35] Y.-P. Tan and H. Sun, ―Fast Motion Re-Estimation for Arbitrary Downscaling Video Transcoding using H.264/AVC Standard,‖ IEEE Trans. Consumer Electronics, vol. 50, no. 3, pp. 887-894, Aug. 2004. [36] J. Wang, E.H. Yang, X, Yu, ―An efficient motion estimation method for H.264based video transcoding with arbitrary spatial resolution conversion,‖ in Proc. IEEE Int. Conf. Multimedia & Expo, Jul. 2007, pp. 444-447. 25 Chapter 2: COMPENSATION OF RE-QUANTIZATION AND INTERPOLATION ERRORS IN MPEG-2 TO H.264/AVC TRANSCODING2 2.1 Introduction MPEG-2 is presently the video coding standard used in most consumer products ranging from digital TV broadcasting to DVDs [1]. However, H.264/AVC, the latest video coding standard of the Joint Video Team (JVT), is gaining ground in many applications [2]. The introduction of several new advanced features in H.264/AVC results in improved compression efficiency. Therefore, it is expected that MPEG-2 and H.264/AVC will coexist in the foreseeable future. Since MPEG-2 has been in existence for over a decade, most video has been stored using the MPEG-2 standard. To have universal multimedia access, users with H.264/AVC players should be able to access and play the MPEG-2 coded video. Transcoding from MPEG-2 to H.264/AVC can be carried at the transmitter or the receiving end or at a server in between. Instead of using the straightforward cascaded approach which is to fully decode the MPEG-2 encoded video back to the pixel domain and then re-encode it using H.264/AVC, transcoding is a more efficient approach. Transcoding can result in almost the same video quality as the cascaded structure [3]. Several transcoding architectures have been proposed [4]. Most can be classified into three categories: cascaded pixel-domain, closed-loop, and open-loop transcoding. The cascaded pixel-domain transcoding structure produces the best picture quality but requires a high degree of computational effort. The closed-loop structure reduces the computational complexity, but unfortunately it introduces some distortions in the picture. 2 A version of this paper has been published. Q. Tang, P. Nasiopoulos, R. Ward, ―Compensation of Requantization and Interpolation Errors in MPEG-2 to H.264 Transcoding,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 3, pp.314-325, Mar. 2008. 26 The loop can be implemented either in the pixel domain or in the transform domain [4]. The open-loop structure carries transcoding on the transform coefficients directly without introducing a loop. This method introduces the highest degree of distortion. Since this structure offers the least computational complexity, it is often used as a comparison reference. Since MPEG-2 uses discrete cosine transform (DCT) and H.264/AVC uses integer transform, most efforts addressing MPEG-2–to-H.264/AVC transcoding have thus focused on implementing the transcoding in the pixel domain [5]–[7]. Methods that convert the 8x8 MPEG-2 DCT coefficients directly to the H.264/AVC 4x4 integer transform coefficients in the transform domain have been proposed (DCT-to-HT) [8], [9]. These conversions have lower computational complexity than the cascaded structure while maintaining the same picture quality [8], [9]. Based on these methods, an Intra macroblock (MB) MPEG-2-to-H.264/AVC transcoding scheme that operates in the transform domain has been proposed [10]. This transcoding scheme is designed to work only on I-frames. Regarding P-frames, the DCT-to-HT transcoding method results in unavoidable distortions caused by incompatible motion compensation schemes supported by the two standards. A P-frame transcoding scheme is proposed by using the DCT-toHT method in [11], which also addresses the issues of interpolation and requantization errors. However, the nature of the requantization errors is not analyzed in detail while the interpolation errors in the chrominance components are not addressed at all. This study presents a new compensating algorithm that addresses the requantization errors and the interpolation errors (arising from the luminance as well as the chrominance components). In Section 2.2, we first analyze the nature of the distortions present in the open-loop structure which uses the DCT-to-HT method. In this structure, encoded residue data of the MPEG-2 P frame are used by the H.264/AVC encoder, and the distortions are the results of the substantial differences between the MPEG-2 and H.264/AVC motion compensation processes. The main types of distortion are: 1) those resulting from the requantization of the coefficients; 2) the luminance halfpixel interpolation error; and 3) the chrominance quarter/three-quarter pixel interpolation errors. The requantization error is well known in the transcoding area. The traditional 27 requantization error compensation algorithm cannot be directly used in MPEG-2–toH.264/AVC transcoding, since H.264/AVC uses the integer transform and MPEG-2 relies on the DCT transform. The luminance half-pixel interpolation error is the result of MPEG-2 use of a half-pixel interpolation filter that is different from that used by H.264/AVC. Similarly, the chrominance quarter/three-quarter pixel interpolation error is the result of the different schemes used by MPEG-2 and H.264/AVC to interpolate the chrominance quarter/three-quarter pixel samples. After analyzing these three types of errors, we propose a procedure that jointly compensates for all of them. The compensation algorithms and the transcoding structure are discussed in Section 2.3. In Section 2.4, two different types of experimental results are shown. The first set of results relates to the case when only the requantization error exists and is compensated. The second set relates to implementing the procedure that jointly compensates for the luminance, the chrominance interpolation errors, as well as the requantization error. Experimental results show that our distortion compensation algorithms achieve almost the same picture quality as the ideal one achieved by the cascaded structure that reuses MPEG-2 motion vectors. An additional advantage is that in the worst case our method reduces the computational complexity by 13%. 2.2 Problem Description The main objective of this study is to compensate for the inherent distortions present in DCT-to-HT transcoding. Transcoding DCT to integer coefficients is very efficient since one reuses as much of the existing MPEG-2 information as possible (e.g., motion vectors and MB modes) when generating the H.264/AVC–compliant streams. However, reusing MPEG-2 information may introduce video distortions [11], [16]. In order to analyze these distortions, we compare the cascaded pixel-domain transcoding structure with the open-loop structure that is based on the DCT-to-HT transcoding method. 28 Figure 2.1 shows the cascaded pixel-domain transcoding structure. The MPEG-2 video is here fully decoded back to the pixel domain, and the resulting video is reencoded using H.264/AVC. The new features introduced in H.264/AVC such as the deblocking filter and quarter-pixel motion compensation could be thus added in the latter process. Figure 2.2 shows the block diagram of an open-loop P-frame transcoding scheme, which uses the DCT-to-HT approach. Unlike the cascade structure in Figure 2.1, the open-loop structure does not recalculate the residue values of the P-frames and is the most computationally efficient transcoding scheme. However, since it introduces many video distortions, it is not commonly used in practice. To improve its video quality, we compare the cascaded and the open-loop transcoding structures so as to identify the distortions caused by the transform-domain-based approach. To fairly compare these two structures, the same motion vectors (MVs) and MB modes are used in both cases. Figure 2.1 Cascaded pixel-domain transcoding structure Input VLD MPEG-2 VLD: IQM: Q H: CAVLC: IQM DCT-to-HT transcoding Variable Length Decoding MPEG-2 Inverse Quantization H.264 Quantization Content-based Adaptive Variable Length Coding QH CAVLC H.264 Part Figure 2.2 P frame residue data re-using transcoding framework 29 Output H.264 In Figure 2.2, the DCT-to-HT transcoding method converts the DCT coefficients into integer H.264/AVC coefficients (DCT-to-HT) [8][9]. The notations used below are shown in Table 2.1. Table 2.1 List of notations F1, F2 I 1 , I M2 I H1 , I H2 M Mp Mh P M2 P H2 Consecutive original video frames Resulting F1, F2 after being MPEG-2 encoded and MPEG-2 decoded IM1, IM2 consecutive frames after passing through H.264/AVC encoding and H.264/AVC decoding MPEG-2 pixel domain motion compensation (MC) function H.264/AVC pixel domain MC function MPEG-2 P residue frame (PM2 = IM2 – Mp (IM1)) H.264/AVC P residue frame (PH2 = IH2 – Mh (IH1)) To understand what causes the distortions in Figure 2.2, let us consider the first two frames of a video sequence. The first frame (F1) is first encoded as an MPEG-2 I frame. After MPEG-2 decoding, we get the reconstructed MPEG-2 frame (IM1). IM1 is then encoded using H.264/AVC. After H.264/AVC decoding, the transcoded H.264/AVC frame is IH1. Due to the lossy quantization of H.264/AVC, IH1 is not equal to IM1. The second frame (F2) is encoded as an MPEG-2 P frame. After MPEG-2 pixeldomain motion compensation, the MPEG-2 P residue frame (PM2) is: (2-1) P M 2 I M 2 M p ( I M 1) This residue frame PM2 is then encoded using H.264/AVC. The above applies to the open-loop structure. Since the cascaded pixel-domain transcoding structure yields much higher picture quality than this structure, let us find out what happens when the cascaded structure is used. There is no difference for the first I frame. For the second frame, which is a P frame, the input residue frame PH2 (which is to be encoded using H.264/AVC) is: 30 (2-2) PH 2 I M 2 Mh (I H1) Thus, the open-loop structure uses PM2 and the cascaded structure uses PH2 as the input to the H.264/AVC encoder. The difference between PM2 and PH2 will not only affect the quality of this frame but also that of all the subsequent frames inside one group of pictures (GOP). Thus, the quality of the transcoded video will deteriorate frame by frame until the next I frame is reached. From (2-1) and (2-2), the difference between PM2 and PH2 is the result of the difference between Mp(IM1) and Mh(IH1). This difference between Mp(IM1) and Mh(IH1) can be classified into three categories: re-quantization error, luminance half-pixel and chrominance quarter/three-quarter pixel interpolation errors. The following three subsections explain how these differences arise. 2.2.1 Re-quantization error MPEG-2 and H.264/AVC use the same motion vectors in both the cascaded and the open-loop transcoding structures. When motion vectors point to the integer samples in the reference frame, (2-1) and (2-2) become: P M 2 I M 2 M int p ( I M 1 ) I M 2 I M 1 (2-3) P H 2 I M 2 M int h ( I H 1) I M 2 I H 1 (2-4) In this case, Mintp(IM1) and Minth(IH1) are different because the integer-position samples of IH1 are different from those of IM1. This difference is called re-quantization error since it is the result of the H.264/AVC re-quantization process. 2.2.2 Luminance half-pixel interpolation error When a motion vector points to a half-pixel sample in the reference frame, the encoder has to interpolate the half sample pixel using the known integer-position sample values. MPEG-2 and H.264/AVC use different methods to interpolate the half-pixel 31 values. MPEG-2 uses a 2-tap linear filter and H.264/AVC uses a 6-tap finite impulse response filter. Now assume that IM2(x) is a pixel in IM2. In Figure 2.3 (a), IM1(xC) and IM1(xD) are two consecutive integer-position samples. The motion vector of IM2(x) points to IM1(xhalf) in the reference frame of MPEG-2 (IM1). In Figure 2.3 (b), the motion vector points to IH1(xhalf) in the reference frame of H.264/AVC (IH1). In Figure 2.3 (b), IH1(xA), IH1(xB), IH1(xC), IH1(xD), IH1(xE) and IH1(xF) are integer-position samples, and round denotes rounding towards the nearest integer. Motion Vector of IM2(x) IM1(xC) IM1(xD) IM1(xhalf) IM1(xhalf)=round{ (IM1(xC)+IM1(xD))/2 } (a) Motion Vector of IM2(x) IH1(xA) IH1(xB) IH1(xC) IH1(xD) IH1(xE) IH1(xF) IH1(xhalf) IH1(xhalf)=round{(IH1(xA) -5IH1(xB)+20IH1(xC) +20IH1(xD)-5IH1(xE)+IH1(xF))/32} (b) Figure 2.3 Interpolation of half-pixel values in (a) MPEG-2 and (b) H.264/AVC. (a) Half-pixel sample in MPEG-2 reconstructed I-frame I1M . (b) Half-pixel sample in H.264/AVC reconstructed Iframe I1H For the open-loop structure, the residue value of IM2(x) which forms the input to the H.264/AVC encoder in the second frame is: P M 2 ( x) I M 2 ( x) M half p ( I M 1) I M 2 ( x) I M 1( xhalf ) (2-5) where Mhalfp is the half-pixel interpolation motion compensation function. This residue 32 will be encoded using H.264/AVC. However, in the cascaded structure, the residue value of IM2(x) is: (2-6) P H 2 ( x) I M 2 ( x) M half h ( I H 1) I M 2 ( x) I H 1( xhalf ) Since the MPEG-2 and H.264/AVC use different filters to interpolate half-pixel samples, IM1(xhalf) is usually different from IH1(xhalf). This difference makes PM2(x) different from PH2(x) and introduces interpolation errors. In this study, we call this type of errors the luminance half-pixel interpolation error. 2.2.3 Chrominance quarter/three-quarter pixel interpolation errors For the chrominance components (U, V), the distortion is different from that of the luminance components (Y). In the commonly used color space format 4:2:0, the UV components are sub-sampled both in the horizontal and vertical directions. This means that the half-pixel motion vectors of the luminance components will correspond to quarter-pixel motion vectors in the chrominance components. The different quarter-pixel interpolation methods used by MPEG-2 and H.264/AVC introduce degradation in the picture quality. Figure 2.4 shows the MPEG-2 and H.264/AVC components of the chrominance interpolation, where IM1(xA), IM1(xB) and IH1(xA), IH1(xB) are two adjacent integer pixel samples in the MPEG-2 and H.264/AVC images. IM1(x0.25), IH1(x0.25) are one quarter samples from IM1(xA) and IM1(xB), and from IH1(xA) and IH1(xB), respectively. Similarly, IM1(x0.5), IH1(x0.5) are half way samples, and IM1(x0.75), IH1(x0.75) are three quarters samples away. IM1(xA) IM1(x0.5) IM1(x0.25) IM1(xB) IH1(xA) IM1(x0.75) IH1(x0.5) IH1(x0.25) MPEG-2 chrominance interpolation IH1(xB) IH1(x0.75) H.264 chrominance interpolation Figure 2.4 Interpolation of chrominance components in MPEG-2 and H.264/AVC 33 For the one quarter pixel sample (refer to Figure 2.4 (a)), the resulting interpolated values in MPEG-2 and H.264/AVC are: M M MPEG2 : I 1( x0.25 ) I 1( x A ) H H H H.264 : I 1( x0.25 ) (48 I 1( x A ) 16 I 1( xB ) 32) / 64 (2-7) For the three quarters pixel sample (refer to Figure 2.4 (b)), the interpolated value in MPEG-2 and H.264/AVC are: M M M MPEG2 : I M ( x 1 0.75 ) I 1( x0.5 ) ( I 1( x A ) I 1( x B ) 1) / 2 H.264 : I H 1( x0.75 ) (16 I H 1( x A ) 48 I H 1( x B ) 32) / 64 (2-8) We observe that for the Inter chrominance MBs with quarter/three-quarter pixel motion vectors, the residue data that are input to the H.264/AVC encoder in the openloop structure are different from those in the cascaded structure. This difference causes distortion in the open-loop transcoding structure which uses the DCT-to-HT method. We call this error the chrominance quarter/three-quarter pixel interpolation error in this study. In fact, the re-quantization error, described in subsection 2.2.1, also affects the luminance half-pixel interpolation error and the chrominance quarter/three-quarter pixel interpolation error. For instance, in Figure 2.3, IM1(xhalf) is different from IH1(xhalf) not only because of the different half-pixel interpolation filters, but also because of the different reconstructed integer-pixel samples resulting from re-quantization of H.264/AVC, i.e., in Figure 2.3, IM1(xC) and IM1(xD) are different from IH1(xC) and IH1(xD) respectively. The same is true for the chrominance quarter/three-quarter pixel interpolation error. 2.3 Proposed Compensation Algorithms The objective of our work is to provide efficient algorithms that compensate for the inherent distortions in the open-loop DCT-to-HT transcoding. Based on the discussions above and in order to improve the video quality of the open-loop structure, the value of PM2 should be changed so that it is equal to the value of PH2 (or the value of 34 PM2(x) should be changed to the value of PH2(x)) before it goes through the H.264/AVC integer transform. In this section, the re-quantization error compensation algorithm is introduced first. Then, the proposed luminance half-pixel and chrominance quarter/threequarter pixel interpolation errors compensation schemes are introduced. Subtracting (2-1) from (2-2) yields the difference between the distorted residue PM2 and the cascaded structure’s desired residue PH2 as: ∆P = PM2 – PH2 = Mh(IH1) – Mp(IM1) = ∆I (2-9) Therefore, the difference between PM2 and PH2 is equal to the difference between Mh(IH1) and Mp(IM1). Once ∆I is properly calculated, the value of PM2 can be adjusted to become equal to the value of PH2 as: PH2 = PM2 – ∆I (2-10) Furthermore, due to the linear property of the H.264/AVC integer transform, the final adjustment can be implemented in the transform domain, i.e., (2-10) can be expressed as follows: HI ( P H 2 ) HI ( P M 2 ) HI (I ) HI ( P M 2 I ) (2-11) where HI denotes the H.264/AVC integer transform. 2.3.1 Re-quantization error compensation algorithm The most straight forward way to calculate the exact value of ∆I is to go back to the pixel domain and reconstruct both the frames IM1 and IH1. However, the computational complexity in this case is too high. Since the re-quantization error ∆I is only related to the quantization process which takes place in the transform domain, a more efficient approach for calculating this type of error should be sought. In (2-11), HI(∆I) actually denotes the loss in the original transform coefficients due to the quantization process. 35 The re-quantization error in DCT coefficients is well known in the transcoding area. For the DCT coefficients, if the differences between the original transform coefficients and the quantized/de-quantized coefficients are calculated, the loss in each original coefficient is measured [13]. However, this algorithm cannot be directly used on the H.264/AVC integer transform coefficients. The differences between the H.264/AVC original integer transform coefficients and the quantized/de-quantized integer transform coefficients are not equal to the amount of loss in the coefficients values. This is because the H.264/AVC integer transform and the inverse integer transform are not orthogonal unless scaling is performed. Unfortunately, in the H.264/AVC standard, scaling is combined with the quantization / de-quantization process. In order to calculate the requantization error, the scaling processes need to be separated from the quantization/dequantization processes. After scaling is performed on the original coefficients, the scaled integer transform coefficients are subtracted from the de-quantized transform coefficients. The differences are then used to calculate the re-quantization errors. The following sub-sections explain the proposed algorithm to separate the scaling from the quantization/de-quantization processes. H.264/AVC quantization and de-quantization For ease of explanation, we use the quantization/ de-quantization equations in [15], in which the quantization formula is given by: Zij = Wij × SForij / Qstep (2-12) where i, j denote the 2-D pixel coordinates inside the 4x4 block, SFor denotes the scaling matrix for the forward H.264/AVC integer transform, Wij denotes the integer transform coefficient before quantization, and Zij denotes the quantized and scaled integer transform coefficient. Qstep denotes the quantization step size. The de-quantization operation is described by: 36 W’ij = Zij × Qstep × SInvij × 64 (2-13) where SInv denotes the scaling matrix for the inverse H.264/AVC integer transform, W’ij denotes the de-quantized integer transform coefficients and 64 is used to reduce the rounding errors [15]. Derivation of ideal scaling If the Zij in (2-13) is substituted by the equation (2-12), the quantization process can be removed and only the scaling process remains. Therefore, the final formula that obtains the scaled forward integer transform coefficients (Wscaleij) is given by: W scaleij (Wij S Forij / Qstep ) Qstep S Invij 64 Wij S Forij S Invij 64 (2-14) Wij S ij where S S For S Inv 64 ( indicates that each element of SFor is multiplied by each element in the same position in matrix SInv, i.e., scalar multiplication rather than matrix multiplication). It follows that the scaling matrix is then given by: 4 a 2b 2 a4 a 2 a 2b 2 b 4 a 2b 2 2 2 S 2 4 a 2b 2 a4 a 2 a 2b 2 b 4 a 2b 2 2 2 2 a 2b 2 2 b4 2 64 a 2b 2 2 b4 2 (2-15) where a = ½ and b = √2/5. The derived scaling matrix S is applied on the forward integer transform coefficients. After that, the scaled coefficients are subtracted from the de-quantized coefficients to obtain the re-quantization error. Ideally, the separated scaling process should introduce no loss in the forward transform coefficients. However, after the scaling 37 factors are separated from the quantization process, division cannot be avoided. Therefore, some rounding errors exist in the scaling process. Experimental results show that our scaling algorithm introduces negligible change to the forward transform coefficients [16]. After the re-quantization errors of all MBs in the current frame are calculated, these error values are used to adjust the values of residues in the next frame. Figure 2.5 shows the flowchart of the Inter MB transcoding scheme including the re-quantization error compensation. Input MPEG-2 VLD IQM + DCT-to-HT Transcoding QH + Integer Transform Input motion vector CAVLC Output H.264 IQH _ + Scaling Re-quantization Error Frame Inverse Integer Transform H.264 Part VLD: Variable Length Decoding IQM: MPEG-2 Inverse Quantization QH / IQH: H.264 Quantization / H.264 Inverse Quantization CAVLC: Content-based Adaptive Variable Length Coding Figure 2.5 The flowchart of the Inter MB transcoding scheme with re-quantization error compensation 2.3.2 Luminance half-pixel and chroma quarter/three-quarter interpolation error compensation algorithm For the luminance half-pixel interpolation error, subtracting (2-6) from (2-5) yields the difference between PM2(x)and PH2(x): ∆P(x) = PM2(x) – PH2(x) = IH1(xhalf) – IM1(xhalf) = ∆xhalf (2-16) Substituting the interpolated values of IH1(xhalf) and IM1(xhalf) using the equations in Figure 2.3, we get (2-17) which is shown below: 38 I M 1 ( xC ) I M 1 ( x D ) 1 2 I H 1 ( x A ) 5I H 1 ( x B ) 20 I H 1 ( xC ) 20 I H 1 ( x D ) 5I H 1 ( x E ) I H 1 ( x F ) 16 32 xhalf (2-17) The most straight forward way to compensate for the luminance half-pixel interpolation error is to reconstruct the previous MPEG-2 frame so as to obtain the values of IM1(xC) and IM1(xD), and to reconstruct the H.264/AVC frame to obtain the IH1(xA), IH1(xB), …, IH1(xF) values. Unfortunately, this makes the computational complexity high. In (2-17), the difference between IM1(xC) and IH1(xC), and the difference between IM1(xD) and IH1(xD) are actually equal to the re-quantization errors ∆xC and ∆xD, respectively. Therefore, (2-17) can be rewritten as: xhalf I H 1 ( xC ) xC I H 1 ( x D ) x D 1 I H 1 ( xhalf ) 2 (2-18) Since our algorithm calculates the re-quantization error (∆xC and ∆xD), only the previously reconstructed H.264/AVC frame is required for obtaining IH1(xC) and IH1(xD). Then, the difference between PM2(x) and PH2(x) can be correctly measured. This approach does not need to reconstruct the previous MPEG-2 frame, and the last term in equation (2-18), i.e., IH1(xhalf), has already been calculated in the H.264/AVC motion compensation process. Only the re-quantization errors and the surrounding integer-position samples of the chrominance quarter/three-quarter pixel samples are thus needed. For the chrominance quarter/three-quarter pixel interpolation error, the difference between the MPEG-2 and the H.264/AVC quarter/three-quarter pixel sample values can be also obtained, using the H.264/AVC reconstructed frame and the re-quantization errors. Thus, according to Figure 2.3, when the motion vector points to the one quarter pixel sample [(2-7)], the amount of the adjustment is equal to: 39 0.25 I M 1 ( x A ) (48 I H 1 ( x A ) 16 I M 1 ( x B ) 32) / 64 (2-19) ( I H 1 ( x A ) x A ) I H 1 ( x0.25) When the motion vector points to the three quarters pixel sample [(2-8)], the amount of the adjustment is equal to: 0.75 ( I M 1 ( x A ) I M 1 ( x B ) 1) / 2 (16 I H 1 ( x A ) 48 I H 1 ( x B ) 32) / 64 ( I H 1 ( x A ) x A I H 1 ( x B ) x B 1) / 2 I H 1 ( x0.75 ) (2-20) As is the case with the luminance component, the last terms in equations (2-19) and (2-20), (i.e., IH1(x0.25) and IH1(x0.75), respectively) have already been calculated in the H.264/AVC motion compensation process. Only the re-quantization errors and the surrounding integer-position samples of the chrominance quarter/three-quarter pixels are thus needed. 2.3.3 The proposed closed-loop transcoding structure As discussed above, our objective is to correct the distortion inherent in the openloop DCT-to-HT transcoding. For the re-quantization error, the amount of adjustment comes from the re-quantization error in the previous frame. The errors are located by the motion vectors of the current frame. For the luminance half-pixel and chrominance quarter/three-quarter pixel interpolation errors, the amount of adjustment comes from two parts. One is from the re-quantization error. The other is from the surrounding integerposition samples of the half-pixel or quarter/three-quarter pixel samples in the previous reconstructed frame. 40 All errors discussed in this work are only related to Inter-coded MBs. To effectively evaluate the performance of the proposed algorithms, the Intra-coded MBs are transcoded using the cascaded pixel-domain transcoding structure, i.e., they are MPEG-2 decoded and then H.264/AVC re-encoded. This allows us to avoid any other additional errors that might be caused by Intra-coded MB’s transcoding. The proposed P frame transcoding architecture is shown in Figure 2.6. In Figure 2.6, the MPEG-2 DCT coefficients of the input MPEG-2 video streams are converted to the H.264/AVC integer transform coefficients directly by using the DCT-to-HT method. Re-quantization errors are generated during H.264/AVC quantization/de-quantization process. The amount of adjustment needed for correcting the distorted H.264/AVC coefficients is calculated using the proposed algorithms. Once these values are known, they are converted into the H.264/AVC integer transform domain. The final adjustment is then executed on the H.264/AVC integer transform coefficients. In Figure 2.6, a switch is defined which is turned on when the current motion vector has a half-pixel value. In that case, the adjustment is the summation of the re-quantization error and either the luminance halfpixel interpolation error, or the chrominance interpolation errors. VLD IQM + DCT-to-HT Transcoding QH + Integer Transform Switch + Re-quantization Error Adjustment Output H.264 IQH _ + Scaling + Half-pixel Interpolation Error Adjustment CAVLC Error Frame Buffer Motion Compensation VLD: Variable Length Decoding IQM: MPEG-2 Inverse Quantization QH / IQH: H.264 Quantization / H.264 Inverse Quantization CAVLC: Content-based Adaptive Variable Length Coding Switch: On when the motion vector has half-pixel value Inverse Integer Transform Frame Store + + H.264 Part Figure 2.6 The flowchart of the Inter MB transcoding scheme with re-quantization error and halfpixel / quarter-pixel interpolation error compensation 41 2.4 Experimental Results To evaluate the performances of the proposed re-quantization and interpolation compensation algorithms, we carried two sets of experiments. The first set evaluates the re-quantization error compensation scheme only. Thus, video streams that do not use half-pixel motion vectors are employed in this experiment. The second set evaluates our combined compensation scheme, which includes the luminance half-pixel, the chrominance quarter/three-quarter pixel interpolation errors and the re-quantization error compensation schemes. The transcoding schemes used in our experiments are: 1) the open-loop, 2) the cascaded pixel-domain transcoding which makes use of the deblocking filter and reuses the MPEG-2 MVs, 3) the cascaded pixel-domain transcoding which makes use of the deblocking filter and performs motion estimation within a ±2 pixels window from the MPEG-2 MVs with quarter-pixel accuracy, 4) the proposed compensation scheme for the re-quantization error only, 5) the proposed compensation scheme for the Luminance interpolation error only, 6) the proposed combined compensation scheme that includes the re-quantization and interpolation errors schemes. The motion compensation block sizes used in all tested schemes is 16x16 since the 16x16 MPEG-2 MVs are reused or refined. The proposed closed-loop structure does not perform the H.264/AVC motion estimation since all the interpolation error distortions discussed in the study are the result of re-using MPEG2 MVs. The deblocking filter is turned off in the proposed compensation algorithms. This is because the deblocking filter is applied after the inverse H.264/AVC integer transform, while our proposed requantization error calculation is carried before the inverse H.264/AVC integer transform. If the deblocking filter was used in this case, it would have introduced drift errors since the re-quantization error compensation process would not have access to the difference between the H.264/AVC reference frame before and after implementing the deblocking filter. In all our experiment, the Test Model 5 MPEG-2 reference software and the JM10.2 reference software were used [17][18]. CIF format video sequences are used in all out tests. Rate control is used during MPEG-2 encoding. The desired bit-rate is chosen to be 2 Mbits/sec. Only I and P frames are used. The group of picture (GOP) size is set to 15 frames, which means that one I frame exists in every 15 frames. Main profile is used 42 to encode the MPEG-2 streams since it represents the profile used in MPEG-2 DTV broadcasting applications. For this reason, the H.264/AVC streams are also encoded as Main profile during MPEG-2 to H.264/AVC transcoding. 2.4.1 Compensation of re-quantization error This experiment focuses on the distortions caused by the re-quantization errors only, the half-pixel motion vector is not used during MPEG-2 encoding. In fact, in order to evaluate only the re-quantization error compensation, the deblocking filter in the cascaded structure that reuses MPEG-2 MVs is also turned off. For the rate control in H.264/AVC, every frame uses the same quantization parameter. By choosing proper quantization parameters, one can roughly control the desired bit-rate for the transcoded H.264/AVC streams. The transcoding scheme is shown in Figure 2.5 above. Four video sequences are used: Akiyo, Foreman, Football and Mobile & Calendar (with motion details varying from subtle to large). We compare the picture quality of the transcoded video obtained by the cascaded structure which reuses MPEG-2 MVs, and the proposed re-quantization error compensation scheme. The average PSNR values resulting from 100 frames are used as a measure and they are calculated using the decoded MPEG2 video and its resulting H.264/AVC transcoded video at different quantization parameters (see Figure 2.7). Table 2.2 shows the average PSNR values of the open-loop, the cascaded and the proposed transcoding scheme when the quantization parameters is Table 2.2 Average PSNR values (dB) obtained by the three different transcoding schemes Akiyo 36.93 Scheme II: Cascaded structure 40.24 Foreman 33.15 37.18 37.04 ∆= 3.89 ∆= -0.14 Football 34.43 37.98 37.86 ∆= 3.43 ∆= -0.12 Mobile &Calendar 30.55 36.17 35.94 ∆= 5.39 ∆= -0.23 Sequence Scheme I: Open-loop structure Scheme IV : Proposed structure 40.15 Scheme IV minus Scheme I Scheme IV minus Scheme II ∆= 3.22 ∆= -0.09 Note: Quantization parameter = 25, and the deblocking filter of scheme II is turned off 43 PSNR Versus Quantizer (Foreman) PSNR Versus Quantizer (Akiyo) 47 PSNR (dB) PSNR (dB) 45 40 35 30 42 37 32 27 25 22 10 20 Cascaded 30 Quantizer Proposed 40 50 10 Open-loop 20 Cascaded (a) PSNR (dB) PSNR (dB) 42 37 32 27 22 Cascaded 30 Quantizer Proposed 50 Open-loop PSNR Versus Quantizer (Mobile) PSNR Versus Quantizer (Football) 20 40 (b) 47 10 30 Quantizer Proposed 40 50 46 41 36 31 26 21 16 10 20 Cascaded Open-loop (c) 30 Quantizer Proposed 40 50 Open-loop (d) Figure 2.7 The PSNR of transcoded video obtained by the open loop structure, cascaded structure and our proposed transcoding structure under different quantization parameters (quantizers) 25. We observe that, compared to the cascaded structure, our proposed algorithm produces almost the same picture quality. The average picture quality loss is equal to a mere 0.145dB. 2.4.2 Compensation of re-quantization and interpolation errors For this set of experiments, the transcoding parameters for the different transcoding schemes used are shown in Table 2.3. In this section, we compare our combined re-quantization and interpolation error compensation scheme with the two cascaded structure 2) and 3): the first reuses the MPEG-2 MVs, while the other uses the MPEG-2 MVs as the initial search points and then performs H.264/AVC motion estimation within ±2 pixels window with quarter-pixel accuracy. The cascaded structure 3) checks the Intra mode cost value for every MB, even in P frames, to determine whether or not it is better to encode this MB as an Intra MB. The deblocking filter is turned on in both cases. 44 Table 2.3 Transcoding parameters of different transcoding schemes Transcoding Parameters Picture Format 352x288 Proposed transcoding scheme 352x288 GOP structure IPPP… (15 frames) Yes IPPP… (15 frames) Yes IPPP… (15 frames) Yes IPPP… (15 frames) Yes Yes Yes Yes No N/A N/A No Yes 16x16 16x16 16x16 16x16 ±2 pixels motion vector refinement Reference frame N/A N/A No Yes 1 1 1 1 Intra mode search No No No Yes Deblocking filter N/A No Yes Yes Rate control No No No No RD-Optimization N/A N/A Low Low Half-pixel MV accuracy Re-using MPEG2 MVs Quarter-pixel MV accuracy Block size for MC Open-loop Cascaded reusing MPEG-2 MVs 352x288 Cascaded ±2 pixels MV refinement 352x288 Figure 2.8 shows the rate-distortion curves for the open-loop, the two cascaded structure 2) and 3), and the proposed error compensation transcoding structure. The range of quantization parameters used in Figure 2.8 varies depending on the tested sequences. The smallest quantization parameter ensures that the bit-rate of the transcoded H.264/AVC video streams is kept under 2 Mbits/sec (the MPEG-2 video bit-rate used is 2 Mbits/sec for all video sequences). The largest chosen quantization parameter guarantees that the picture quality of the transcoded H.264/AVC video stream is acceptable. We observe that our proposed transcoding algorithm achieves almost the same picture quality (average 0.3 dB loss) as the cascaded pixel-domain transcoding structure that reuses the MPEG-2 MVs. The cascaded pixel-domain transcoding structure, which performs ±2 pixel motion vector refinement, needs 3 times the computational effort of the proposed scheme for a mere average 0.6 dB increase in picture quality in JM software. 45 48 Rate-Distortion Curves (Akiyo - U components) Rate-Distortion Curves (Akiyo - Y components) 49.5 46 47.5 PSNR (dB) PSNR (dB) 44 42 40 Y Y Y Y 38 36 34 130 330 530 730 930 open-loop cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop 45.5 43.5 U open-loop U cascaded re-using MV U cascaded ±2 pixels ME U proposed closed-loop 41.5 39.5 130 1130 1330 1530 1730 1930 330 530 Bit-rate (Kbits) 730 930 1130 1330 1530 1730 1930 Bit-rate (Kbits) Rate-Distortion Curves (Akiyo - V components) 50 49 PSNR (dB) 48 47 46 45 V V V V 44 43 42 130 330 530 730 open-loop cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop 930 1130 1330 1530 1730 1930 Bit-rate (Kbits) (a) 41 Rate-Distortion Curves (Foreman - Y components) 46 45 39 44 PSNR (dB) PSNR (dB) 37 35 33 Y Y Y Y 31 29 27 140 340 540 740 940 open-loop cascaded re-using MV cascaded ±2 pixels MV proposed closed-loop 1140 1340 PSNR (dB) 45 44 43 42 V V V V 38 140 340 540 open-loop cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop 740 940 1140 Bit-rate (Kbits) 1340 41 U open-loop U cascaded re-using MV U cascaded ±2 pixels ME U proposed closed-loop 40 37 140 47 46 40 39 42 38 1540 Rate-Distortion Curves (Foreman - V components) 41 43 39 Bit-rate (Kbits) 48 Rate-Distortion Curves (Foreman - U components) 1540 (b) 46 340 540 740 940 1140 Bit-rate (Kbits) 1340 1540 Rate-Distortion Curves (Football - Y components) 37 41 35 40 31 29 Y Y Y Y 27 25 23 280 780 open-loop cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop 1280 1780 PSNR (dB) 33 PSNR (dB) Rate -Distortion Curv e s (Football - U compone nts) 42 39 38 37 U open-loop U cascaded re-using MV U cascaded ±2 pixels ME U proposed closed-loop 36 35 34 33 280 2280 780 Bit-rate (Kbits) 43.5 1280 Bit-rate (Kbits) 1780 2280 Rate -Distortion Curv e s (Football - V compone nts) 42.5 PSNR (dB) 41.5 40.5 39.5 38.5 V V V V 37.5 36.5 35.5 280 780 open-loop cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop 1280 1780 Bit-rate (Kbits) 2280 (c) Rate-Distortion Curves (Mobile - Y components) Rate-Distortion Curves (M obile - U components) 38 37 31 29 36 35 PSNR (dB) PSNR (dB) 27 25 23 21 Y Y Y Y 19 17 15 240 740 open-loop cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop 1240 Bit-rate (Kbits) 1740 34 33 32 29 28 240 2240 U open-loop U cascaded re-using MV U cascaded ±2 pixels ME U proposed closed-loop 31 30 740 1240 Bit-rate (Kbits) 1740 2240 Rate-Distortion Curves (M obile - V components) 36 PSNR (dB) 34 32 30 V V V V 28 26 240 740 open-loop cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop 1240 1740 Bit-rate (Kbits) 2240 (d) Figure 2.8 Rate-distortion curves obtained by different transcoding schemes using four test video sequences (a) Akiyo sequence (b) Foreman sequence (c) Football sequence (d) Mobile sequence 47 Figure 2.9 shows the PNSR values (including the Y and UV components) of each frame in the Mobile sequence, obtained by the different transcoding schemes with the quantization parameter (QP) equal to 25. For a subjective evaluation of the results, Figure 2.10 shows the transcoded last frame of the second GOP obtained by the different transcoding schemes with the quantization parameter remaining at 25. We observe that the picture quality of the open-loop structure is significantly improved by the proposed distortion compensation algorithms, and visually equal to that obtained by the cascaded structures. The picture quality loss caused by the proposed algorithm is mainly due to the re-quantization error calculation. PSNR of each frame with QP=25 (Mobile - Y components) PSNR of each frame with QP=25 (Mobile - U components) 38 PSNR (dB) PSNR (dB) 42.5 37 36 35 41.5 40.5 39.5 0 10 20 30 40 50 Frame number cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop 60 0 10 30 40 50 Frame number cascaded re-using MV cascaded ±2 pixels ME proposed closed-loop (a) 60 (b) PSNR of each frame with QP=25 (Mobile - Y components) PSNR of each frame with QP=25 (Mobile - V components) 42 20 38 PSNR (dB) PSNR (dB) 36 41 40 34 32 30 39 28 0 10 20 30 Frame number cascaded re-using MV proposed closed-loop 40 50 -1 60 cascaded ±2 pixels ME 19 39 59 Frame number cascaded re-using MV proposed closed-loop (c) 79 99 cascaded ±2 pixels ME (d) Figure 2.9 PNSR values (including the Y and UV components) of each frame of the Mobile sequence obtained by different transcoding schemes (a) Mobile with GOP size equal to 15 (Y) (b) Mobile with GOP size equal to 15 (U) (c) Mobile with GOP size equal to 15 (V) (d) Mobile with one I-frame as the first frame and all other 99 frames are P-frames (Y) As discussed in the Section 2.3, the proposed re-quantization error calculation process may introduce rounding errors. These rounding errors propagate from one frame to the other until the next I frame is reached. In order to show the worst case scenario, one extreme, non-practical case is tested, where only the first frame is an I frame and the 48 (a) (b) (c) (d) Figure 2.10 Last frame in the second GOP of the Mobile test sequence obtained by different transcoding schemes (a) Open-loop structure (b) Proposed structure (c) Cascaded structure reusing MPEG-2 MVs (d) Cascaded structure performing ±2 pixels MV refinement rest 99 frames are P frames with the quantization parameter equal to 25. In this extreme case, the average PSNR of the transcoded video of the proposed structure is 3dB lower than that of the cascaded structure (See Figure 2.9 (d)). Since the performance of the proposed algorithms decreases as the number of P frames increases, a good trade off between the picture quality and the computational complexity is between 14 and 29. Experimental results show that when the number of P frames in one GOP is 29, the proposed scheme has an only 0.6 dB loss in picture quality compared to the cascaded structure that reuses MPEG-2 MVs (Mobile sequence with QF equal to 25). When the number of P frame is larger than 29, other transcoding technique like Intra-refresh is recommended to stop the rounding errors from accumulating [4]. 49 2.4.3 Comparison and complexity analysis of the proposed compensation algorithms The above experimental results have shown that our algorithms successfully compensate picture distortions in transform-domain-based transcoding, significantly improving the overall picture quality to levels obtained by cascaded methods. In this section, we offer a complexity analysis of the overall system in order to better understand the advantages and disadvantages of cascaded and transformed based transcoding methods. The complexity analysis takes into account of every part used by the tested transcoding schemes (see Table 2.4). In the analysis, the time needed to implement the variable length coding and memory operations is omitted since these operations are the same in all the tested schemes. The average transcoding execution time of different schemes is also shown in Table 2.4, with each video sequence transcoded 9 times using 9 different quantization parameters. An Intel Pentium 4 CPU with 3.2 GHz and 1G RAM was used for our experiments. Since the codec is not optimized, the execution time cannot be an accurate measure for comparing complexity. Referring to Table 2.4, the operations related to the Y component in one 16 16 MB are listed for all of the six tested transcoding schemes. We observe that the cascaded structure, which implements 2 pixels motion vector refinement, needs the largest number of operations (49024). The cascaded structure which reuses the MPEG-2 MVs needs the second largest number of operations (17280). The operations needed by the proposed transcoding structure that uses the combined compensation scheme are 15040. We observe that our algorithms, which were designed mainly for improving the picture quality in transform-domain-based transcoding, also achieve computational savings over the cascaded approaches. These savings range from 13% when the cascaded method only reuses MPEG-2 MVs to 69% when 2 pixel motion vector refinement is used by the cascaded approach. Finally, note that Table 2.4 shows the worst case scenario for our combined compensation method, where half-pixel interpolation is present. If half-pixel interpolation is not present, our method needs 8384 operations per macroblock, while the cascaded method needs 12 160 operations. That corresponds to 30% reduction in computational complexity. 50 Table 2.4 Operations and PSNR comparison of different transcoding schemes for one 16x16 MB (Y) Unit (16x16 MB) ADD. MUL. SHIFT Open- Cascaded re- Cascaded ±2 Proposed ReInterpolation loop using MPEG- pixels MV quantization error only 2 MVs refinement error only DCT-to-HT conversion [9] 1920 832 MPEG-2 Fast IDCT [8] 1664 1024 MPEG-2 half-pixel Interpolation 512 MPEG-2 motion compensation 768 H.264/AVC reference MB calculation (half-pixel and quarter-pixel interpolation) 2560 ±2 pixels MV refinement 18432 Intra MB cost comparison (4 16x16 modes and 9 4x4 modes) 9984 H.264/AVC motion compensation 768 H.264/AVC Integer Transform [8] 1024 H.264/AVC Quantization 256 256 256 H.264/AVC Inverse Quantization 256 256 512 √ 256 √ √ √ √ 256 √ √ 1024 √ √ 768 Deblocking filter (taking strength = 2, threshold < β as an example) 1792 √ √ √ √ 3328 √ 256 H.264/AVC Inverse Integer 1024 Transform 768 √ √ √ H.264/AVC MB reconstruction √ √ √ √ √ √ 384 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ 640 √ Compensation related H.264/AVC coefficients scaling Re-quantization error calculation 256 √ 256 √ Inverse integer transform of 1024 the re-quantization errors 256 Re-quantization error compensation only 512 256 Interpolation error compensation only 512 256 Combined compensation 768 256 √ Integer transform of the requantization errors 1024 256 √ Total: N/A N/A N/A √ √ √ √ √ √ 3520 17280 51 49024 15040 √ √ 8384 12992 Complexity analysis of partial compensation More experiments are implemented on the luminance to evaluate and compare the complexity-quality trade-off between the full compensation structure and that of the requantization error or the interpolation error only. We observe that when the quantization parameter is small, the interpolation error only compensation scheme results in much higher PSNR values (refer to Table 2.4). This means that when small quantization values in transcoding are used, the interpolation error is the dominant distortion. However, as the quantization parameter increases, the re-quantization error becomes the dominant distortion (the PSNR obtained by compensating for the re-quantization error is higher). In this case, however the differences in the PSNR values are smaller than those for which the interpolation error is dominant. In terms of computational complexity, compensating for the re-quantization error only uses about 56% of the computations required by the combined compensation scheme, while compensating for the interpolation error alone takes about 86% of our combined compensation scheme. In terms of picture fidelity, the picture quality resulting from each compensation separately is much worse than that obtained by combining the two compensation schemes. For instance, when only the requantization error is compensated for, the picture quality loss caused by the interpolation error is about 8.5dB (Mobile with QF equal to 30). On the other hand, when only the interpolation error compensation is implemented, the picture quality loss caused by the re-quantization error is about 5 dB. Thus, to achieve the best results, the combined interpolation and re-quantization error compensation approach should be used. However, for MBs with integer motion vectors, only the re-quantization error compensation is needed during transcoding. Memory requirements We consider the IBBPBBP… as the GOP structure for our memory analysis. In MPEG-2, a P frame uses only one forward reference frame, while a B frame uses two reference frames (one for forward prediction and the other for backward prediction). Thus, reconstructing a B frame requires storing two reference frames. We use the B frame as an example to illustrate the memory requirement (knowing that the P frame 52 needs half memory). In the cascaded pixel-domain transcoding structure, in addition to the memory needed to store the quantized coefficients, motion vectors, and the overhead for variable length coding (VLC), two MPEG-2 and two H.264/AVC reference frames should also be stored to reconstruct the B frames. The TM5 software implementation of MPEG2 uses 8 bits per pixel to store the reference frames. In JM10.2, the H.264/AVC reference frames are stored using 16 bits per pixel. If a frame has S pixels, the total memory needed to store the MPEG-2 and H.264/AVC reference frames for one B frame is 2S + 4S = 6S (bytes). In the proposed algorithm, for each B frame, we need to store two H.264/AVC reference frames and one re-quantization error value for each pixel of these two frames. Since the re-quantization errors could be positive or negative, one bit is needed to indicate the sign of the re-quantization error. Because the re-quantization error measures the loss in the magnitude of each pixel in the space domain, the maximum magnitude of the re-quantization error cannot be larger than the pixel value itself, which is 8 bits long in our experiment. Therefore, the magnitude of the re-quantization errors can be represented by 8 bits. We also need 1 bit more to represent the sign of the re-quantization error for each pixel. Thus, theoretically, 9 bits are needed to store the re-quantization error of each pixel. If a frame has S pixels, the total memory needed to store the H.264/AVC reference frames and their re-quantization errors for one B frame is 4S 2 9S 6.25S (bytes). In real-life applications, it is common for the PSNR value of 8 the transcoded video to be above 30 dB. In this case, the re-quantization error is always within the range [-127, 128], and thus, only 8 bits are needed to store the re-quantization error and its sign for each pixel. This means that for the majority of real-life applications, the memory requirement of the proposed structure is the same as that of the cascaded structure. 2.5 Conclusion and Future Work We presented a scheme that compensates for distortions caused by re-quantization and interpolation errors inherent in transform-domain-based MPEG-2 to H.264/AVC 53 transcoding. Our algorithms successfully offset mismatches between the MPEG-2 and H.264/AVC motion compensation processes, without a significant increase in computational complexity. Performance evaluations have shown that the proposed algorithms achieve 5dB quality improvement over the open-loop transform-domain-based transcoding and almost the same picture quality (0.3dB to 0.6dB) as the cascaded structures. In addition, our methods result in computational savings that range from 13% compared to the cascaded method that reuses MPEG-2 MVs to 69% compared to the cascaded method that uses ±2 pixel motion vector refinement. As the proposed re-quantization error compensation algorithm is designed to comply with the H.264/AVC integer transform features, it can also be used in transcoding from H.263, MPEG4 or H.264/AVC to H.264/AVC. The chrominance interpolation method in H.264/AVC is more complex than that in previous video coding standards and its use in MPEG-2 to H.264/AVC transcoding has not been previously addressed. As the demand for MPEG-2 to H.264/AVC transcoding increases, the proposed structure can provide high picture quality transcoded streams at a reasonable computational expense. Future work focusing on designing a more advanced rate-control scheme may improve the rate-distortion performance of the transcoded video streams. In the next chapter, we present a study that accelerates the motion re-estimation process for the pixel-domain MPEG-2 to H.264/AVC transcoding structure. This structure can offer better compression efficiency for the resultant transcoded H.264/AVC videos at the cost of extra computational efforts. 54 2.6 References [1] ―Information technology – Generic coding of moving pictures and associated audio information: Video,‖ ITU-T Recommendation H.262, International Standard 13818-2, 2nd ed., Feb. 2000. [2] ―Advanced video coding for generic audiovisual services,‖ ITU-T Recommendation H.264, International Standard 14496-10, Mar. 2005. [3] J. Xin, C.W. Lin, and M.T. Sun, ―Digital video transcoding,‖ Proceedings of the IEEE, vol. 93, no. 1, pp. 84–97, January 2005. [4] A. Vetro, C. Christopoulos, and H. Sun, ―Video Transcoding Architectures and Techniques: An Overview, ―IEEE Signal Processing Magazine, vol. 20, no. 2, pp. 19-29, March 2003. [5] Z. Zhou, S. Sun, S. Lei, and M.T. Sun, ―Motion Information and Coding Mode Reuse for MPEG-2 to H.264 Transcoding,‖ in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, May 2005, pp. 1230 - 1233. [6] G. Chen, Y. Zhang, S. Lin, F. Dai, ―Efficient Block Size Selection for MPEG-2 to H.264 Transcoding‖, in Proc. 12th ACM Int. Conf. Multimedia, October 2004, pp. 300-303. [7] X. Lu, A. Tourapis, P. Yin and J. Boyce, ―Fast Mode Decision and Motion Estimation for H.264 with a Focus on MPEG-2/H.264 Transcoding,‖ in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, May 2005, pp. 1246-1249. [8] J. Xin, A. Vetro, and H. Sun, ―Converting DCT coefficients to H.264/AVC transform coefficients,‖ Technical Report of Mitsubishi Electric Research Lab., TR-2004-058, June 2004. [9] G. Chen, S. Lin, Y. Zhang, G. Cao, ―An New Coefficients Transform Matrix for the Transform Domain MPEG-2 to H.264/AVC Transcoding,‖, in Proc. IEEE Int. Conf. Multimedia and Expo, July 2006, pp.321-324. [10] Y. Sun, J. Xin, A. Vetro, and H. Sun, ―Efficient MPEG-2 to H.264/AVC Intra trnascoding in transform domain,‖ in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, May 2005, pp. 1234–1237. 55 [11] T. Qian, J. Sun, D. Li, X. Yang, J. Wang, ―Transform Domain Transcoding From MPEG-2 to H.264 With Interpolation Drift-Error Compensation,‖ IEEE Trans. on Circuits Syst. Video Technol., vol. 16, no. 4, pp. 523-534, April, 2006. [12] Iain E.G. Richardson, H.264 and MPEG-4 Video Compression - Video Coding for Next-generation Multimedia, John Wiley & Sons Ltd, The Artrium, Southern Gate,Chichester, England, 2003, pp. 174-175. [13] Y. Liu, Research on Digital Video Transcoding, Tianjin University Ph.D Dissertation, Tianjin, China, 2005. [14] H.S. Malvar, A. Hallapuro, M. Karczewicz, L. Kerofsky, ―Low-Complexity Transform and Quantization in H.264/AVC,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 598-603, July 2003. [15] Iain Richardson, H.264 / MPEG-4 Part 10 White Paper, Transform and quantization, [online], Available: hettp://www.vcodex.com. [16] Q. Tang, P. Nasiopoulos, R. Ward, ―An Efficient Re-quantization Error Compensation for MPEG-2 to H.264 Transcoding, ‖ in Proc. IEEE Int. Symp. Signal Processing and Information Technol., August 2006, pp.530-353. [17] MPEG Software Simulation Group, MPEG-2 reference software – Test Model 5 (TM5), [online], Available: http://www.mpeg.org/MPEG/MSSG/. [18] Joint Video Team, H.264/AVC Reference Software Codec (JM), ver. 10.2, [online], Available: http://iphome.hhi.de/suehring/tml/download/. 56 Chapter 3: EFFICIENT MOTION RE-ESTIMATION WITH RATE-DISTORTION OPTIMIZATION FOR MPEG-2 TO H.264/AVC TRANSCODING3 3.1 Introduction Video coding standards have been constantly evolving during the last two decades. MPEG-2 is presently the dominant video coding standard in DTV broadcasting and DVD applications [1]. However, H.264/AVC is quickly gaining ground in many applications due to its higher compression efficiency [2]. Since the two standards are destined to coexist for some time, providing universal multimedia access between them has become an active research area. Transcoding from one standard to the other is the most cost effective and most attractive way to achieve such universal multimedia access [3]-[4]. Although implementing a multi-standard decoder at the receiver-end would also allow playback of different standards, such approach may be cost and space prohibitive. This may be the case of video-enabled mobile devices, where the use of a transcoder at server station is a desired solution. Therefore, as the majority of suitable legacy content is in MPEG-2 format, this study concentrates on transcoding from MPEG-2 to H.264/AVC formats. H.264/AVC is the latest and most advanced video coding standard developed by the Joint Video Team (JVT). It provides an impressive improvement in compression efficiency compared to the previous standards [5]. The improvement is due to many new video coding features which include variable block-size motion compensation (VBSMC), multiple reference picture motion compensation (MRPMC), quarter-sample accurate motion compensation, improved ―skipped‖ and ―direct‖ motion inference, deblocking 3 A version of this paper has been published. Q. Tang, P. Nasiopoulos, ―Efficient Motion Re-Estimation with Rate-Distortion Optimization for MPEG-2 to H.264/AVC Transcoding,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 2, pp. 262-274, Feb. 2010. 57 filter, Intra prediction, and rate-distortion optimization (RDO) [5]-[6]. Unfortunately, these same features that yield such improvements in compression performance are also the cause for a significant increase in computational complexity. In an MPEG-2 to H.264/AVC transcoding system, one of the primary objectives is to reduce the computational complexity of the overall system. As H.264/AVC provides many new coding options, it is essential to base them on existing MPEG-2 coding information in order to minimize the overall complexity of a transcoding scheme. In that respect, a significant part of the overall complexity is due to the H.264/AVC motion estimation process. In video compression, a block of 16x16 pixels (known as macroblock (MB)) is the basic unit of the motion estimation process. One MB can be subdivided into smaller blocks, which are known as partitions. In MPEG-2, motion estimation (ME) is relatively simple, using only the 16x16 block size. H.264/AVC, on the other hand, performs motion estimation using a variety of block sizes, which in addition to 16x16, include 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 block sizes. This feature, referred as variable block-size motion compensation (VBSMC) [6], yields optimized motion vectors with better matching accuracy and thus higher compression performance [7]. Nevertheless, VBSMC dramatically increases the computational complexity of the motion reestimation process in MPEG-2 to H.264/AVC transcoding. Most existing work on MPEG-2 to H.264/AVC transcoding addresses how to accelerate the motion re-estimation process for H.264/AVC encoding by using some coding information existing in MPEG-2 videos [8]-[13]. In [8] and[9], for instance, some thresholds are used to early terminate the motion search in H.264/AVC. However, after those thresholds are reached, an exhaustive full search is still needed for the 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4 block sizes. In [10], machine-learning techniques are used to choose either 16x16 or 8x8 as the partitioning size by calculating the mean and variance of all the 4x4 blocks within each 16x16 macroblock. In [11], the MPEG-2 DCT coefficients are used to predict the block size partitioning choice from the set of {16x16, 16x8, 8x16 and 8x8}. In [12], the MPEG-2 DCT coefficients are used to decide whether or not an MPEG-2 inter-coded macroblock should be encoded as an Intra macroblock in H.264/AVC. A motion vector refinement scheme (MotionMapping), which reduces the 58 searching range used for finding the best motion vectors in H.264/AVC, is proposed in[13]. Among the above cases, the MotionMapping scheme achieves the least computational complexity since the search range is greatly reduced. However, the authors in [13] chose not to consider the use of a fast block size partitioning prediction method. In this study, we propose an efficient block size partitioning prediction algorithm for MPEG-2 to H.264/AVC transcoding applications. Our algorithm uses rate-distortion optimization techniques and predicted initial motion vectors to estimate block size partitioning for H.264/AVC. Rate-distortion optimization (RDO) techniques and the fast block size partitioning prediction have not been considered together in other state-of-art methods [8]-[12]. In addition to our proposed fast block size partitioning algorithm, we also illustrate that using block size partitioning smaller than 8x8 (i.e., 8x4, 4x8 and 4x4) results in negligible compression improvements, and thus these sizes should be avoided in MPEG-2 to H.264/AVC transcoding. Experimental results show that our transcoder yields a rate-distortion performance that is almost the same as that of the MotionMapping scheme [13]. At the same time, the computational complexity is significantly reduced, requiring 29% of the computations used by the MotionMapping algorithm, which requires the least computations among other state-of-the-art algorithms. Compared to the full-search scheme, our algorithm reduces the computational complexity by about 99.47% for SDTV sequences and 98.66% for CIF sequences. Compared to UMHexagonS, the fast motion estimation algorithm used in H.264/AVC, experimental results have shown that our proposed algorithm is a better trade-off between computational complexity and picture quality. The rest of this chapter is structured as follows. Section 3.2 explains in detail the advantages of the proposed algorithms. Performance evaluations and computational complexity analysis are presented in 3.3. Finally, conclusions are drawn in Section 3.4. 59 3.2 Proposed Efficient Motion Re-estimation with RateDistortion Optimization 3.2.1 RDO-based block size partitioning prediction for P/B frames The block size partitioning choice selects the best block size for H.264/AVC motion compensation from a set of 7 block sizes (i.e., 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4). When the partitioning choice uses a block size smaller than 16x16, it creates more than one partition (block) for one 16x16 MB (shown in Figure 3.1), each of which has its independent motion vector. Choosing the best block size as the partitioning choice could be a computationally expensive process. 16 8 16 8 16 0 0 16 8 8 0 0 4 8 0 8 0 1 8 2 3 4 4 0 1 4 2 3 4 4 1 1 8 8 1 1 4 8 4 Figure 3.1 Variable block sizes supported by the H.264/AVC motion estimation process In MPEG-2 to H.264/AVC transcoding, the block size partitioning selection process can be accelerated by using some coding information that exists in MPEG-2 videos. For instance, using the MPEG-2 motion vectors (MVs) as initial motion vectors for H.264/AVC encoding, significantly reduces the search range for estimating the final motion vectors. Using these initial motion vectors, an accurate block size partitioning choice can be achieved through rate-distortion optimization techniques. In our work, we propose to predict the block size partitioning choice for MBs in P/B frames using Lagrangian techniques. In this case, the block size partitioning choice aims at minimizing the following Lagrangian cost J = D + × R (with D measuring the picture quality of the reconstructed video obtained after encoding and decoding the original video, R measuring 60 the bits needed to encode the corresponding MB, and λ being the Lagrangian multiplier which is related to the quantization parameter). Our proposed algorithm consists of two steps which are explained in the following two sub-sections. Determining the initial motion vector The first step is to find the initial motion vector which will be used as the initial search point to perform the motion estimation for each partition in H.264/AVC encoding process. The MotionMapping scheme proposed in [13] derives the initial search point by using the motion vectors of the surrounding MPEG-2 MBs for each partition. In our scheme, in addition to those vectors, we also consider the H.264/AVC predicted motion vector as another candidate for the initial search point. As for the B frames, the method in [13] is limited to the options offered by MPEG-2 when deriving the initial motion vectors (i.e., does not exploit the additional prediction options supported by H.264/AVC). In a B frame, motion vectors have three predictive direction modes: forward prediction, backward prediction and bi-directional prediction. During transcoding, if all the surrounding MPEG-2 MBs had only forwardpredicted MVs, the H.264/AVC encoder cannot get an initial backward-predicted MV using surrounding MPEG-2 MBs. To let H.264/AVC encoder explores more prediction modes in a B frame, we proposed to initialize forward-predicted and backward-predicted MVs for every MB in one frame, regardless of what predictive direction mode the corresponding MPEG-2 MB uses. The following equation shows the proposed initialization procedure4: MVcurrent is forward - predicted MV if MV forward MVcurrent, MVbackward 0 else if MVcurrent is backward - predicted MV MV forward 0, MVbackward MVcurrent else MVcurrent is bi - directional - predicted MV MV forward MVcurrent_forward, MVbackward MVcurrent_backward 4 (3-1) Experiments have confirmed that the choice of the bi-directional predicted MV, whether zero or from the reverse direction to that currently used, does not critically affect the success of the algorithm. 61 This initialization procedure enables us to derive an initial search point for each prediction mode (forward, backward or bi-directional prediction) using surrounding MPEG-2 MVs when encoding MBs in B frames during the H.264/AVC encoding process. In summary, for both P and B frames, our proposed algorithm selects the initial search motion vector from a larger pool of candidates. Since in our algorithm there is more than one candidate to be considered as the initial search point for each partition, we use the Lagrangian cost to decide which one to use. Since this Lagrangian cost is used to find the best motion vector, we use the JMOTION to represent the cost. The equation is given as below: min { JMOTION(Bk, VINIT|λMOTION) = SATD(Bk, VINIT) +λMOTIONRMOTION(Bk, VINIT) } (3-2) where Bk represents the index of partitions (blocks) in an MB. VINIT = {MVm, MVh}, MVm is the initial search-point candidate derived using motion vectors from surrounding MPEG-2 MBs, and MVh is the initial search-point candidate which is equal to the H.264/AVC predicted motion vector. Note that in B frames, MVm includes two initial search-point candidates, one for forward prediction and the other for backward prediction. The same applies to MVh. RMOTION represents the number of bits needed for encoding the corresponding motion vectors. SATD is the sum of the absolute transformed differences between the original block and the predicted block, using VINIT in the reference frame. The detail of SATD calculation is explained later in the next step of our algorithm.λMOTION is the Lagrangian multiplier which is related to the quantization parameter and frame type (I, P, and B). This multiplier is calculated as follows: MOTION , I, P 0.85 2(QP 12) / 3 MOTION , B max 2, min(4, (3-3) QP 12 ) MOTION, I, P 6 62 (3-4) where QP stands for the quantization parameter of encoding the current MB. The initial search-point candidate which yields the least JMOTION is chosen as the best initial motion vector. After the initial search point is decided following the above procedure, a motion vector refinement process with search range equal to [-1.75 +1.75] is adequate for finding the best H.264/AVC motion vector [13]. In [13], four block sizes {16x16, 16x8, 8x16 and 8x8} are used as the partitioning candidates. Each partitioning candidate will find its own initial search points and then implement the motion vector refinement process to find its final motion vectors. Therefore, the motion vector refinement process has to be repeated four times to find the best motion vectors for all four block size partitioning candidates. To reduce the computational complexity, we proposed to predict the block size partitioning choice before performing the motion vector refinement process. Proposed block size partitioning prediction using the rate-distortion optimization method Different partitioning choices correspond to different encoding modes of Inter MBs in H.264/AVC. To be consistent with other literature, we also use mode selection to represent the block size partitioning choice when describing the related Lagrangian techniques in this study. Based on the Lagrangian RDO theory, the following equation should be used to determine the block size partitioning choice (or mode decision) in H.264/AVC: min {JMODE(Sk, Ik|QP, λMODE) = DREC(Sk, Ik|QP) +λMODERREC(Sk, Ik|QP)} (3-5) where Sk represents the index of MBs in a video frame and Ik represents the set of possible block size partitioning candidates such as 16x16, 16x8, 8x16 and 8x8. QP refers to the quantization parameter used to encode MB Sk. DREC refers to the distortion between the original MB and the reconstructed MB obtained by encoding and decoding the original MB. RREC refers the binary bits used to encode the current MB. λMODE is the Lagrangian multiplier which is equal to the square ofλMOTION. The best block partition candidate should yield the least JMODE. 63 Calculating the accurate DREC and RREC requires significant computational efforts. For instance, calculating the accurate RREC requires a complete encoding process, including transform, quantization and entropy coding. This means that we have to completely encode each MB many times (one for each partitioning candidate), although only one partitioning candidate is finally chosen. This method, therefore, is computationally expensive. To reduce computational complexity, we propose to predict the block size partitioning choice using our proposed rate-distortion model. We based our rate-distortion model on the rate and distortion models proposed in [14], with some necessary modifications for adapting the general cases to the specific needs of MPEG-2 to H.264/AVC transcoding. The updated distortion model used in our algorithm is given by: DREC(Sk, Ik|QP) = –b1×log10(SATD(Sk, Ik, VINIT)/256 +1)×QP+b2 where b1, b2 are model parameters. VINIT (3-6) stands for the chosen initial motion vector and SATD(Sk, Ik, VINIT) is the sum of absolute transformed differences between the original video blocks and the predicted video blocks, using initial motion vectors for each block. The SATD is calculated as follows: SATD (S k , I k , VINIT ) 16 H [s(m) c(m,VINIT ( I k ))] H m 1 T (3-7) 2 where m is the index of the 4x4 blocks, the basic units of the H.264/AVC integer transform (one 16x16 MB having 16 4x4 blocks). s(m) and c(m, VINIT(Ik)) are the original block and its predicted block located using VINIT(Ik) in the reference frame. H is the kernel matrix of Hadamard transform. This SATD value is obtained right after motion compensation. This approach reduces the computational complexity compared to computing the SATD value between the original and reconstructed videos, a common practice used in other fast rate-distortion optimization methods. In summary, using the SATD and QP values, the distortion model measures the distortion DREC in PSNR. 64 The use of SATD makes the updated distortion model different from the original distortion model. In (3-6), SATD(Sk, Ik, VINIT)/256 calculates the mean value of the absolute transformed differences (MATD) between the original video and predicted video blocks. In the original model, instead of using MATD, the mean of absolute difference (MAD) is used which is estimated by using the MAD in the previous frame. One drawback is that when the current frame has local scene changes, the estimation may become inaccurate. In addition, in general, the difference of transformed coefficients yields better rate-distortion optimization performance than the difference in spatial domain. For these reasons, we use the initial motion vectors of the current MB to calculate MATD. Theoretically, the MATD calculation should use the final motion vector of each block size partitioning candidate. However, in MPEG-2 to H.264/AVC transcoding, since the final motion vector of each partitioning choice could be obtained through a small search window ([-1.75 1.75]) starting from the initial search point, the probability of the initial motion vector being very close to the final motion vector is very high. It is because of this, that our distortion model ends up yielding very good results. After the updated distortion model is explained, the updated rate model in our proposed algorithm is given by the following equation: RREC(Sk, VINIT|QP) = c2×(SATD(Sk, Ik, VINIT))/256×2-QP/6 (3-8) where c2 is the model parameter. As in the distortion model, instead of estimating the MAD from previous frames, we use SATD(Sk, Ik, VINIT)/256 and calculate the MATD by using the initial motion vectors of the current MB. This rate model estimates the binary number of bits needed for encoding the texture information of the current MB using SATD and QP. Note that in low bit-rate scenarios, the bits for encoding motion vectors could represent a large portion of total number of bits needed to encode the current MB. We estimate the bits needed to encode motion vectors for the different partitions using the following equation: 65 RMV (S k , I k ) N (I k ) m 1 | max(MVX i (S k , I k ), MVYi (S k , I k )) | a (3-9) where N(Ik) refers to the number of motion vectors ,which is equal to the number of partitions under each block size partitioning candidate (N(Ik) = 1 for 16x16 partitioning and N(Ik) = 4 for 8x8 partitioning). a is a constant equal to 1 and compensates for zerovalue motion vectors. MVXi and MVYi represent the values of the motion vector in the x and y axis, respectively. Based on Equations (3-6)–(3-9), our final proposed rate-distortion model is given in (3-10) at the end of this paragraph. To calculate JMODE, besides SATD(Sk, Ik, VINIT) and QP, we also need to know the model parameters b1, b2 and c2. The following paragraph explains how we choose the model parameters in our proposed algorithm. JMODE(Sk, Ik, VINIT|QP, λMODE) = –DREC(Sk, Ik, VINIT|QP) +λMODE [RREC(Sk, Ik, VINIT|QP) + RMV(Sk, Ik)] = b1×log10(SATD(Sk, Ik, VINIT)/256 +1)×QP–b2 + λMODE× [c2×(SATD(Sk, Ik, VINIT))/256×2-QP/6+RMV(Sk, Ik)] (3-10) In the previous models, the values of these model parameters are video-content dependent. However, we propose to fix the values of these parameters in our model. When we calculate JMODE for different block size partitions, the calculations take place within the same macroblock. In this scenario, b1, b2 and c2 are constants. As a result, by knowing the values of SATD(Sk, Ik, VINIT) and QP, we can determine which block size partition gives the least JMODE. In other words, there is no need to calculate the absolute value of JMODE since the relative value suffices for finding the lowest cost. Suggested practical values of b1, b2 and c2 are 0.52, 47 and 64, respectively. Note that c2 cannot be too small, otherwise the value of RREC(Sk, VINIT|QP) will always be a decimal number, which cannot give a correct value for JMODE. Note that in the final rate-distortion model (3-10), we use the negative value of DREC(Sk, Ik, VINIT|QP) since DREC measures the distortion in PSNR and thus lower DREC(Sk, Ik, VINIT|QP) values correspond to less distortion. Therefore, the best block partition Ib is chosen using the following minimization equation: 66 I b arg min J M ODE( S k , I k ,VINIT | QP , M ODE) I b I (3-11) In (3-10), the original λMODE is designed when the distortion is measured using the sum of squared error (SSE). Since our distortion is measured in PSNR, the λMODE in our model should be different from the original one. Since λMODE is a parameter related to the quantization parameter (QP), in order to calculate λMODE in J = D +λMODE×R, we calculate the derivative of J at QP and let dJ/dQP = dD/dQP +λMODE×dR/dQP = 0 to obtain the minimum value of J. Therefore, λMODE = –dD/dR, and the λMODE in our model is simplified as the following equation: 6 b1 2QP / 6 MODE log( 2) c2 (3-12) After the best partition is chosen, the final step is to use the best initial search point to perform a [-1.75 +1.75] motion vector refinement in order to find the best motion vector. In summary, the entire H.264/AVC motion re-estimation process in our proposed algorithm consists of three steps. The first step involves determining the best initial search point for finding the final H.264/AVC motion vectors (3-2). The second step is to use (3-11) to predict the best block size partition. The last step is to refine the motion vector using a search window of [-1.75 +1.75] pixels. Despite that our proposed algorithm significantly reduces the computational complexity of block size partitioning, using all 7 block sizes {16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4} during transcoding is still time-consuming. Our next objective is to demonstrate that limiting the block size partitioning to some of these block sizes, the computational complexity of the overall transcoding system is further reduced without affecting the compression performance of the transcoded H.264/AVC videos. 3.2.2 Limiting block size partitioning to 8x8 Searching the entire range of block size partitioning candidates (i.e., 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4) may be considered counterproductive since one of the main objectives in transcoding is to reduce the computational complexity. It is, thus, desirable 67 to limit the amount of the search effort without, however, sacrificing compression performance. An interesting coding mode analysis of MPEG-2 to H.264/AVC transcoding is presented in [15]. This work shows that the combination of using only one reference frame, 16x16 block size, the deblocking filter, and turning off the rate distortion optimization (RDO) option, achieves the best balance between computation complexity and picture quality. However, two important factors are not addressed in [15]. First, since only the baseline H.264/AVC encoder is tested, the effect B frames have on compression is not studied. Using B frames is a common practice in DTV applications. Second, the test sequences and MPEG-2 encoder used do not accurately represent real-life applications. In order to expand on [15] and seek a better selection of H.264/AVC coding modes, we use several real-life digital TV broadcasting video streams (standard DTV with resolution equal to 720x576 or 704x480), which are encoded using a commercial MPEG-2 encoder. These DTV streams were compressed at high bit-rates, ranging from 3Mbits/sec to 8 Mbits/sec, to ensure high picture quality and represent content that is widely used in real-life applications, such as movies, commercials, music videos and sports events. In our experiments, these MPEG-2 video streams, which are all encoded using the main profile and main level, are decoded back into the pixel domain and reencoded using the official H.264/AVC reference software JM14.2 [16] (main profile). Table 3.1 shows the different search levels and some coding parameters used in our tests. The remaining coding parameters follow the simulation common conditions proposed by the JVT group [17]. Regarding the search levels, three sets of block size Table 3.1 Search level and coding parameters of different tests Search Level Number of ref. Rate Distortion GOP frames Optimization (RDO) Structure FullSearch 1 for P, 2 for B NotBelow8x8 1 for P, 2 for B 16x16 1 for P, 2 for B On Off On Off On Off 68 Deblocking Filter IBBP On IBBP On IBBP On partitioning candidates were examined in our experiments: 1) {16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4} (FullSearch), 2) {16x16, 16x8, 8x16, 8x8} (NotBelow8x8), and 3) {16x16} (16x16). For each set, we turned on and off the rate-distortion optimization option and both results were examined. Our tests mainly aimed at finding how ratedistortion optimization and different block size partitions affect the compression performance of the transcoded videos. Figure 3.2 shows the rate-distortion curves of using different sets of block size partitioning candidates during MPEG-2 to H.264/AVC transcoding. In order to clearly show the objective differences between RD curves, we measure these differences using the average PSNR differences explained in [18] (related tool may be found in [19]). Using the method in [18], the average PSNR differences of the RD curves of Figure 3.2 are calculated and the results of four video sequences (i.e., Music Video, Commercial, Sports Scene, and Movie) are shown in Table 3.2. The last row of Table 3.2 shows the 46 Rate-Distortion Curves (Music Video) 44 44 FullRDon FullRDoff NoBelow8x8RDon NoBelow8x8RDoff 16x16RDon 16x16RDoff 40 38 36 42 PSNR (dB) PSNR (dB) 42 34 400 Rate-Distortion Curves (Commercial) 46 FullRDon FullRDoff NoBelow8x8RDon NoBelow8x8RDoff 16x16RDon 16x16RDoff 40 38 36 34 900 1400 1900 2400 Bit-rate (kbps) 2900 400 900 1400 1900 2400 Bit-rate (kbps) (a) 44 2900 (b) Rate-Distortion Curves (Sports Scene) 45 Rate-Distortion Curves (Movie) 44 42 PSNR (dB) FullRDon FullRDoff NoBelow8x8RDon NoBelow8x8RDoff 16x16RDon 16x16RDoff 38 36 34 32 700 1700 2700 Bit-rate (kbps) PSNR (dB) 40 43 42 41 40 39 38 37 36 35 FullRDon FullRDoff NoBelow8x8RDon NoBelow8x8RDoff 16x16RDon 16x16RDoff 250 3700 (c) 750 1250 1750 Bit-rate (kbps) 2250 (d) Figure 3.2 Rate-distortion curves resulting from different transcoding schemes over different videos (a) Music Video (b) Commercial (c) Sports Scene (d) Movie 69 Table 3.2 Average PSNR differences of different rate-distortion curves (dB) Average PSNR differences of rate-distortion curves obtained by comparing the selected scheme with the FullSearch / RDO on Test Sequences Music Video ΔPSNR of ΔPSNR of ΔPSNR of ΔPSNR of ΔPSNR of FullSearch NotBelow8x8 NotBelow8x8 16x16 16x16 RDO RDO off RDO on RDO off RDO on off -0.64 dB -0.01 dB -0.54 dB -0.11 dB -0.59 dB Commercial -0.68 dB -0.02 dB -0.61 dB -0.33 dB -0.87 dB Sports Scene -0.86 dB -0.004 dB -0.74 dB -0.13 dB -0.85 dB Movies -0.60 dB 0.01 dB -0.56 dB -0.21 dB -0.72 dB Avg. of Four Sequences -0.70 dB -0.01 dB -0.61 dB -0.20 dB -0.76 dB average result for the four video sequences. First, we observe that when the rate-distortion optimization option is disabled (off), the drop in PSNR quality is not negligible (average 0.6dB). It is, therefore, recommended that the rate-distortion optimization option is always turned on. Secondly, the differences in PSNR quality between the FullSearch and using only 4 block size partitions (NotBelow8x8) are negligible (average 0.01dB). The main reason behind this result is the fact that the MPEG-2 video streams had already gone through a quantization process which removed a significant amount of video details. The smaller block size partitions do not improve the performance since they work well on blocks containing significant amount of details. Combining this finding with the fact that block sizes smaller than 8x8 need many more bits for storing the corresponding motion vectors, allows us to conclude that the overall cost savings for using these blocks becomes insignificant. Although our experiments use SDTV sequences, similar results have been reported for CIF format video sequences in [20], our previous work which differs from the present in that its fast block size partitioning prediction approach does not consider rate distortion optimization (RDO) and B frames. As it is shown above, the use of RDO significantly affects the compression performance of the transcoded H.264/AVC videos. Finally, we observed that using only 16x16 block size partitioning results in significant deterioration in picture quality when compared to using partition sizes down to 8x8 (average drop in PSNR value is 0.2 dB). 70 We could summarize our findings as follows: The rate-distortion optimization option should always be turned on. There is no need for using block size partitions below 8x8. Based on the above findings, our proposed MPEG-2 to H.264/AVC transcoding algorithm uses only four block size partitioning candidates (i.e., 16x16, 16x8, 8x16 and 8x8). Compared to using all 7 block size partitioning candidates, the computational complexity of the overall transcoding system is reduced without affecting the compression performance. This algorithm, together with our proposed block size partitioning prediction algorithm, achieves further reduction of the computational complexity of the motion re-estimation process. A transcoding structure which utilizes our proposed algorithms is presented in the next sub-section. 3.2.3 Our proposed cascaded pixel-domain transcoding structure Generally speaking, MPEG-2 to H.264/AVC transcoding structures can be categorized in two types, which address two different issues and have different objectives. The first type is the DCT-domain transcoder (DDT), which is based on reusing the MPEG-2 DCT coefficients, aiming at achieving the lowest computational complexity. This type of transcoding deals mainly with the inherent distortions in reusing MPEG-2 DCT coefficients [21]-[22]. The other type of transcoding uses the cascaded pixel-domain transcoder (CPDT), which performs a motion re-estimation process to find new motion information that leads to better compression. Although motion re-estimation could be done in the transform domain for the standards previous to H.264/AVC, the introduction of the deblocking filter in H.264/AVC makes its motion estimation a non-linear process and, thus, motion estimation cannot be performed in the transform domain any more. In addition, implementing motion re-estimation in the transform domain involves matrix multiplication, an operation of high computational complexity [23]. For the above reasons, most existing methods that address this type of transcoding are based on the cascaded pixel-domain transcoding structure [8]-[13]. 71 Our proposed MPEG-2 to H.264/AVC transcoding structure for P/B frames is shown in Figure 3.3 and, as expected, it belongs to the cascaded pixel-domain transcoder (CPDT). This is different from our previous work presented in [22] which falls in the DDT category (as explained above). The method presented in [22] aims at compensating for the inherent distortions caused by re-using the residue data and the corresponding motion vectors from MPEG-2 videos during transcoding. This proposed method, on the other hand, aims at reducing the computational complexity of finding new H.264/AVC motion vectors and the corresponding partitioning block size. In Figure 3.3, the gray shaded blocks show our proposed algorithms. In our implementation, blocks that were chosen as Skip or Intra mode by MPEG-2, are kept the same mode for H.264/AVC. For the rest of the blocks, we first limit the block size partitioning choice to be within four block sizes (i.e., 16x16, 16x8, 8x16, and 8x8) in H.264/AVC. Then, the H.264/AVC predicted motion vector and the motion vectors of the surrounding MPEG-2 MBs are chosen as candidates for finding the best initial search Figure 3.3 Proposed cascaded pixel-domain transcoding structure for P/B frames 72 point. Afterwards, the best initial search point is used to predict the block size partitioning choice based on our proposed rate and distortion models. After the best block size partitioning choice is selected, a motion vector refinement ([-1.75 +1.75]) yields the final motion vector. As a final step, for all these blocks we allow the H.264/AVC encoder to check the cost of encoding them using Intra and Skip mode. If the cost of using Intra or Skip mode is lower, that mode is chosen as the final encoding mode for the current MB. The reason for our decision to allow H.264/AVC choose the Intra/Skip mode for the blocks that were not chosen as such by MPEG-2 is based on the fact that the percentage of MBs which are encoded using Intra or Skip mode in H.264/AVC is much higher than that of MPEG-2 [12]. In summary, our proposed transcoding algorithm aims at offering the best tradeoff between compression complexity and picture quality compared to the other existing methods. Table 3.3 lists the key differences between our scheme and MotionMapping, the MPEG-2 to H.264/AVC transcoding scheme with the best performance among the existing methods. One main contribution of the study is a reduction in the iterations against inclusion of RDO and variable block-size motion search. Table 3.3 Key differences between the MotionMapping and our proposed schemes Key differences MotionMapping Proposed 1) Use both MPEG-2 and H.264 MVs 2) Derive more MVs for B frames Iterate every block size Avoid the iteration and predict the Block size partitioning choice partitioning choice and check the block size using our proposed rate one yielding the minimum cost and distortion models Initial motion vectors Use only MPEG-2 MV Total search range Four times of search in the range One time of search in the range of of [-1.75 +1.75] [-1.75 +1.75] Rate-distortion optimization Without the consideration of RDO 73 With the consideration of RDO using our proposed rate and distortion models 3.3 Experimental Results and Computational Complexity Analysis The TM5 MPEG-2 and the JM14.2 H.264/AVC reference software codecs were used in our implementation [24][16]. Performance evaluations were carried out using a large number of different video test sequences with a wide variety of content. Ten CIF (352x288) resolution video sequences were encoded into MPEG-2 streams using TM5. The GOP structure was set to IPPP with length equal to 15. The bit-rates range from 1 to 3 Mbits/sec. Full search with range of ±32 pixels is used to encode these CIF sequences. In addition to the CIF sequences, six real-life SDTV broadcasting streams were also tested. The bit-rates of these streams range from 3 to 7 Mbps. These SDTV video streams were encoded with flexible GOP structures, i.e., the number of B frames between P and I/P frames differs in different GOPs. In our transcoding, the H.264/AVC encoding process reuses the MPEG-2 GOP structure. The SDTV streams are interlaced and thus, the interlace-to-progressive conversion is applied before the streams go through the H.264/AVC encoding process. The encoding settings of H.264/AVC are listed in Table 3.4. Note that the reference B is turned off in H.264/AVC since the B frame from MPEG2 videos does not support reference B mode. It therefore should not be considered during MPEG-2 to H.264/AVC transcoding to save the computation. Table 3.4 H.264/AVC encoding settings Experimental settings of H.264/AVC encoding Search Range ±32 (CIF), ±64 (SDTV) No. of Frames 300 Profile Main Level 2.0 (CIF), 3.0 (SDTV) Reference B Disabled Deblocking Filter Enabled Entropy Coding Context-Adaptive Binary Arithmetic Coding (CABAC) The first set of experiments aims at justifying the advantage of using MPEG-2 information (i.e., existing surrounding MVs) during transcoding. To measure the improvement of the picture quality caused by using MPEG-2 MVs, we transcoded four 74 CIF sequences (Football, Foreman, Mobile and SignIrene) using our proposed algorithm with and without using the MPEG-2 MVs. The zero motion vectors are not considered to only show the performance comparison between using and not-using MPEG-2 MVs. Table 3.5 shows the bit-rates and picture quality of the transcoded videos in PSNR values with and without the use of MPEG-2 motion vectors. To measure the overall picture quality over four different quantization parameters, the method proposed in [18] is used to calculate the average PSNR differences and the results are shown on the last column of Table 3.5. We observe that the use of MPEG-2 MVs by our algorithm improves the picture quality by about 0.4 dB on average ([0.76+0.71+0.06+0.16]/4). Moreover, for the video sequences containing fast motion, such as Football and Foreman, the improvement reaches 0.76 dB and 0.71 dB, respectively. We conclude that using the MPEG-2 Table 3.5 The PSNR values and bit-rates of the transcoded videos with/without using MPEG-2 MVs Test Seqs. Football (CIF) Foreman (CIF) Mobile (CIF) SignIrene (CIF) Quant. Param. Without using With using MPEG-2 MVs MPEG-2 MVs Average PSNR diff. of using MPEG-2 MVs V.S. PSNR Bit-Rate PSNR Bit-Rate not-using MPEG-2 MVs (dB) (kbits/sec) (dB) (kbits/sec) QPI 41.14 3313.26 40.93 2917.67 QPII 37.16 1882.52 36.96 1582.24 QPIII 33.75 966.03 33.56 787.08 QPIV 30.75 478.55 30.54 378.51 QPI 40.72 2116.81 40.65 1895.69 QPII 37.03 948.38 36.97 803.65 QPIII 33.73 393.38 33.65 307 QPIV 30.88 188.29 30.79 145.77 QPI 40.08 5724.9 40.08 5633.85 QPII 35.28 3453.13 35.29 3403.95 QPIII 30.39 1566.84 30.41 1559.26 QPIV 26.38 541.75 26.37 545.22 QPI 42 1196.29 41.99 1170.2 QPII 39.01 593.47 39.01 574.08 QPIII 35.83 275.29 35.82 262.13 QPIV 32.79 132.22 32.74 122.89 Note: QPI (I: 22, P: 23, B: 24), QPII (I: 27, P: 28, B: 29), QPIII (I: 32, P: 33, B: 34), QPIV (I: 37, P: 38, B: 39) 75 + 0.76 dB + 0.71 dB + 0.06 dB + 0.17 dB information during transcoding greatly improves the picture quality of the transcoded videos. The second set of experiments compares our proposed MPEG-2 to H.264/AVC transcoding scheme with four other schemes: 1) The MotionMapping scheme which uses the motion-mapping algorithm proposed in [13] and also takes advantage of rate distortion optimization (RDO). In our experiments, four block size partitions {16x16, 16x8, 8x16 and 8x8} are used. 2) The Full-Search scheme, which is the exhaustive fullsearch motion estimation scheme in the H.264/AVC encoder using all 7 block sizes and RDO. This scheme yields the best compression performance but also has the highest computational complexity. The reason for this comparison is that although MotionMapping has been compared with the full-search scheme in [13], the ratedistortion optimization was disabled in those experiments. 3) The UMHexNotBelow8x8 scheme which employs the UMHexagonS fast motion estimation (FME) algorithm in the JM encoder and uses four block size partitions {16x16, 16x8, 8x16 and 8x8}. The UMHexagonS is one of the most popular fast motion estimation algorithms with negligible impact on compression performance [25]. 4) The UMHex16x16 scheme which also employs the UMHexagonS FME algorithm but in this case uses only the 16x16 block size partition. MotionMapping is the scheme that requires the least computational complexity among the other existing techniques. The comparison with UMHexNotBelow8x8 is used in order to measure the quality loss of our proposed algorithm versus the UMHexagonS FME algorithm. The comparison with UMHex16x16 is used in order to show if our proposed algorithm is a better trade-off than the UMHexagonS FME algorithm. Table 3.6 shows the main differences between our proposed transcoding scheme and the other four schemes in terms of coding parameters related to motion estimation. The extra coding parameters follow the common conditions proposed by the JVT group in [17] (same as the experiments explained in Section 3.2.2). The set of quantization parameters suggested in [17] is {22, 27, 32, 37} for I frames, {23, 28, 33, 38} for P frames, and {24, 29, 34, 39} for B frames. 76 Table 3.6 Differences between our proposed transcoding schemes and other schemes Transcoding Search range of Available block size schemes motion vectors partitioning candidates Our Proposed ±1.75 16x16 16x8 8x16 8x8 Full-Search ±32 (CIF) Motion estimation strategy Full search for integer pixel refinement and UMHexagonS for sub-pixel refinement 16x16 16x8 8x16 8x8 8x4 Exhaustive full search 4x8 4x4 ±64 (SDTV) MotionMapping ±1.75 16x16 16x8 8x16 8x8 Full search for integer pixel refinement and UMHexagonS for sub-pixel refinement UMHexNotBelow ±32 (CIF) 8x8 ±64 (SDTV) 16x16 16x8, 8x16 8x8 UMHexagonS fast ME 16x16 UMHexagonS fast ME UMHex16x16 ±32 (CIF) ±64 (SDTV) The resulting rate-distortion curves of the transcoded SDTV sequences are shown 43.2 44.4 Rate-Distortion Curves Music Video 42.4 Full-Search 41.2 Commercial 40.4 Proposed MotionMapping 39.2 Rate-Distortion Curves PSNR (dB) PSNR (dB) in Figure 3.4. The rate-distortion curves of the transcoded CIF sequences are shown in Full-Search Proposed 38.4 MotionMapping UMHexNotBelow8x8 37.2 Bit-rate (Kbits/sec) 35.2 420 920 1420 1920 2420 UMHexNotBelow8x8 36.4 UMHex16x16 UMHex16x16 Bit-rate (Kbits/sec) 34.4 383 2920 883 1383 42.85 Movie 40.85 41.35 Full-Search Proposed MotionMapping UMHexNotBelow8x8 UMHex16x16 39.35 37.35 235 735 1235 2883 1735 Rate-Distortion Curves Sports Scene Full-Search Proposed MotionMapping UMHexNotBelow8x8 UMHex16x16 38.85 36.85 34.85 Bit-rate (Kbits/sec) 35.35 2383 (b) Rate-Distortion Curves PSNR (dB) 43.35 PSNR (dB) (a) 1883 Bit-rate (Kbits/sec) 32.85 2235 688 (c) 1688 2688 3688 (d) Figure 3.4 Rate-distortion curves from different transcoding schemes over different SDTV sequences (a) Music Video (b) Commercial (c) Movie (d) Sport Scene 77 Figure 3.5. Note that only the results of the four SDTV sequences and four CIF sequences are shown due to space limitation. From Figure 3.4 and Figure 3.5, we observe that our proposed algorithm achieves almost the same compression performance as that of the MotionMapping scheme, which is presently considered as the state-of-the-art transcoding technique. Regarding the UMHexagonS FME algorithm, our proposed scheme yields compression performance similar to the UMHexNotBelow8x8 scheme and better compression performance than UMHex16x16. Finally, and as expected, the compression performance of our proposed scheme, MotionMapping, UMHexNotBelow8x8 and UMHex16x16 is slightly lower than that of the Full-Search scheme. The Full-Search and UMHexNotBelow8x8 yield the best results in terms of compression performance. Foreman 40 38 36.5 Full-Search Proposed MotionMapping UMHexNotBelow8x8 UMHex16x16 34.5 32.5 Bit-rate (Kbits/sec) 30.5 130 630 1130 Rate-Distortion Curves PSNR (dB) 38.5 Rate-Distortion Curves PSNR (dB) 40.5 Football 36 Full-Search Proposed MotionMapping UMHexNotBelow8x8 UMHex16x16 34 32 Bit-rate (Kbits/sec) 30 350 1630 850 (a) Mobile 40.6 Full-Search Proposed MotionMapping UMHexNotBelow8x8 UMHex16x16 32.3 30.3 28.3 1520 2520 3520 4520 Full-Search Proposed MotionMapping UMHexNotBelow8x8 UMHex16x16 36.6 34.6 Bit-rate (Kbits/sec) 520 2850 Sign Irene 38.6 34.3 26.3 2350 Rate-Distortion Curves PSNR (dB) 36.3 1850 (b) Rate-Distortion Curves PSNR (dB) 38.3 1350 32.6 115 5520 (c) Bit-rate (Kbits/sec) 315 515 715 915 1115 (d) Figure 3.5 Rate-distortion curves from different transcoding schemes over different CIF sequences (a) Foreman (b) Football (c) Mobile (d) Sign Irene Another way of computing the picture quality from the different schemes is by calculating the objective PSNR differences of the rate-distortion curves between different 78 transcoding schemes as suggested in [17]. These values are listed in Table 3.7. We observe that the proposed method achieves almost the same picture quality as that of the MotionMapping scheme. For the SDTV sequences, the differences are negligible (average 0.03 dB). Regarding the CIF sequences, the difference is 0.06 dB on average. Compared to the exhaustive Full-Search scheme, the drop in picture quality of our transcoding scheme is about 0.28 dB for the SDTV sequences and 0.22 dB for the CIF sequences (similar as that of MotionMapping). Compared to the UMHexNotBelow8x8, the picture quality from our proposed transcoding scheme drops by about 0.16 dB for the SDTV sequences and 0.15 dB for the CIF sequences (visually no difference). Finally, compared to the UMHex16x16, our proposed algorithm improved the picture quality by about 0.09 dB for the SDTV sequences and 0.07 dB for the CIF sequences. Table 3.7 Average PSNR differences (to MPEG-2 encoded videos) between different schemes (300 frames) To The MPEG-2 Encoded Videos Test Sequences ΔPSNR of ΔPSNR of ΔPSNR of Proposed vs. Proposed vs. Proposed vs. MotionMapping Full-Search UMHexNotbelow8x8 ΔPSNR of Proposed vs. UMHex16x16 Music Video (SDTV) -0.03 dB -0.16 dB -0.09 dB +0.02 dB Commercials (SDTV) -0.02 dB -0.43 dB -0.24 dB +0.09 dB Sports Scenes (SDTV) 0 dB -0.17 dB -0.01 dB +0.15 dB Movies (SDTV) -0.06 dB -0.36 dB -0.28 dB +0.08 dB Football (CIF) -0.07 dB -0.26 dB -0.18 dB +0.05 dB Foreman (CIF) -0.07 dB -0.27 dB -0.18 dB +0.10 dB Mobile (CIF) -0.04 dB -0.13 dB -0.08 dB +0.02 dB SignIrene (CIF) -0.07 dB -0.22 dB -0.17 dB +0.10 dB Average of SDTV Seq. -0.03 dB -0.28 dB -0.16 dB +0.09 dB Average of CIF Seq. -0.06 dB -0.22 dB -0.15 dB +0.07 dB Note: The initial MPEG-2 encoding parameters can be found at the beginning of Section 3.3. The above results are obtained when PSNR values are calculated based on the MPEG-2 compressed/decompressed videos. In order to offer a more complete quality index, in Table 3.8 we also calculate the PSNR values based on the original videos, i.e., before MPEG-2 encoding. In addition, the last column shows the quality (in PSNR values) 79 Table 3.8 Average PSNR differences (to the original videos before MPEG-2 encoding) between different schemes (300 frames) To The Original Videos Before MPEG-2 Encoding Test Seq. Football (CIF) ΔPSNR of ΔPSNR of ΔPSNR of Proposed ΔPSNR of PSNR of the Proposed v.s. Proposed v.s. v.s. Proposed v.s. MPEG-2 MotionMapping Full-Search UMHexNotbelow8x8 UMHex16x16 decoded videos -0.04 dB -0.14 dB +0.05 dB 36.81 dB -0.21 dB Foreman (CIF) -0.06 dB -0.25 dB -0.17 dB +0.09 dB 38.58 dB Mobile (CIF) -0.02 dB -0.09 dB -0.05 dB +0.02 dB 28.90 dB SignIrene (CIF) -0.07 dB -0.23 dB -0.17 dB +0.09 dB 41.48 dB Avg. of CIF Seq. -0.05 dB -0.20 dB -0.14 dB +0.06 dB 36.44 dB of the MPEG-2 decoded streams, which are the inputs to the transcoding scheme. These results are shown only for the CIF sequences, since we have no access to the real-life SDTV original streams. Compared to the Full-Search scheme, the drop in picture quality of our proposed transcoding scheme is 0.2 dB on average. Compared to UMHexagonS FME algorithm, our proposed algorithm reduces the picture quality by 0.14 dB versus UMHexNotBelow8x8 and improves the picture quality by 0.06 dB versus UMHex16x16. The difference of the picture quality between our proposed algorithm and the MotionMapping scheme is 0.05 dB on average. We observe that the results for these original sequences are similar to the ones obtained when the PSNR values are based on the MPEG-2 encoded streams (Table 3.7). Table 3.9 shows the execution time of the motion estimation process for the proposed and the other four transcoding schemes. For the SDTV sequences, and the search range equal to 64, our proposed algorithm can reduce the computational complexity by 99.47% compared to the Full-Search scheme. Compared to MotionMapping transcoding scheme, our proposed transcoding scheme achieves almost the same rate-distortion performance, with the reduction of computational complexity reaching 71%. Compared to UMHexNotBelow8x8 and UMHex16x16, the reduction of computational complexity reaches 83% and 58%, respectively. 80 Table 3.9 Comparison of motion estimation execution time for different transcoding schemes (300 frames) Reduction on motion estimation execution time Test Sequences proposed v.s. proposed v.s. proposed v.s. Full-Search MotionMapping UMHexNotBelow8x8 Music Video (SDTV) 99.57% 70.93% 93.04% Commercials (SDTV) 99.22% 70.24% 82.35% Sports Scenes (SDTV) 99.66% 71.67% 77.00% Movies (SDTV) Football (CIF) Foreman (CIF) Mobile (CIF) SignIrene (CIF) Average SDTV Seq. Average CIF Seq. 99.43% 98.80 % 98.56 % 99.03% 98.24% 99.47% 98.66% 69.58% 72.32% 72.98% 73.06% 75.92% 70.61% 73.57% proposed v.s. UMHex16x16 51.04% 58.57% 55.33% 81.34% 72.91% 81.42% 73.98% 66.86% 83.43% 73.79% 65.58% 34.84% 36.69% 31.49% 22.43% 57.63% 31.36% For the CIF sequences, and the search range equal to 32, our proposed algorithm reduces the computational complexity by 98.66% compared to the Full-Search scheme and 73% compared to MotionMapping. Compared to UMHexNotBelow8x8 and UMHex16x16, the computational complexity is reduced by 74% and 31%, respectively. In addition to the above comparisons, Table 3.10 shows the average transcoding execution time per frame for one specific example of SDTV and CIF sequences. An Intel Pentium 4 CPU with 3.2GHz and 2G RAM was used for our experiments. We observe that our proposed algorithm requires the least transcoding execution time (on average). We also observe that, for the same rate-distortion performance, our proposed algorithm significantly reduces the computational complexity compared to MotionMapping, which has been shown to be the least computationally demanding among other state-of-the-art MPEG-2 to H.264/AVC transcoding schemes. In summary, based on all the above observations, we conclude that our proposed algorithm achieves the best trade-off Table 3.10 Average transcoding execution time per-frame (seconds) in different transcoding schemes Test Seq. Full-Search MotionMapping UMHexNotBelow8x8 UMHex16x16 Proposed (secs) (secs) (secs) (secs) (secs) Sports Scenes (SDTV) 32.70 3.25 2.92 2.46 1.85 Foreman (CIF) 2.38 0.71 0.94 0.58 0.46 81 between the computational complexity and the picture quality. 3.4 Conclusion We presented an efficient H.264/AVC block size partitioning prediction algorithm for MPEG-2 to H.264/AVC transcoding applications. Our algorithm uses rate-distortion optimization techniques and predicted initial motion vectors to estimate block size partitioning for H.264/AVC. In addition to the fast block size partitioning algorithm, we also illustrated that using block size partitioning smaller than 8x8 (i.e., 8x4, 4x8 and 4x4) results in negligible compression improvements, and thus these sizes should be avoided in MPEG-2 to H.264/AVC transcoding. Experimental results showed that, compared to the state-of-the-art MotionMapping scheme, our transcoder yields similar rate-distortion performance, while the computational complexity is significantly reduced, requiring an average of 29% of the computations used by the MotionMapping algorithm. Compared to the full-search scheme, our proposed algorithm reduces the computational complexity by about 99.47% for SDTV sequences and 98.66% for CIF sequences. Compared to UMHexagonS, the fast motion estimation algorithm used in H.264/AVC, the experimental results showed that our proposed algorithm is a better trade-off between computational complexity and picture quality. In summary, our proposed transcoding scheme outperforms the other state-of-art methods requiring the least computational complexity while achieving similar compression performance. In the next chapter, we present a transcoding method that is designed for efficiently converting a coded large-resolution H.264/AVC video to its downscaled version. Using H.264/AVC videos over different networks for different user applications (e.g., mobile phones, TV) will require transcoding either at the transmitter end or at a server level in order to adapt them to different bandwidth and resolution requirements. 82 3.5 References [1] ―Information technology – Generic coding of moving pictures and associated audio information: Video,‖ ITU-T Recommendation H.262 and Int. Standards Org./Int. Electrotech.Comm. (ISO/IEC) 13818-2, 2nd ed., Feb. 2000. [2] ―Advanced video coding for generic audiovisual services,‖ ITU-T Recommendation H.264 and Int. Standards Org./Int. Electrotech.Comm. (ISO/IEC) 14496-10, Mar. 2005. [3] J. Xin, C. W. Lin, and M. T. Sun, ―Digital Video Transcoding,‖ Proceedings of the IEEE, vol. 93, no. 1, pp. 84-97, Jan. 2005. [4] I. Ahmad, X. Wei, Y. Sun, and Y-Q. Zhang, ―Video Transcoding: An Overview of Various Techniques and Research Issues,‖ IEEE Trans. on Multimedia, vol. 7, no. 5, pp. 793-804, Oct. 2005. [5] T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. Sullivan, ―RateConstrained Coder Control and Comparison of Video Coding Standards,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 688-703, Jul. 2003. [6] T. Wiegand, G. Sullivan, G. Bjøntegaard, and A. Luthra, ―Overview of the H.264/AVC Video Coding Standard,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560-576, Jul. 2003 [7] M. Wien, ―Variable Block-Size Transforms for H.264/AVC,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 604-613, Jul. 2003. [8] Z. Zhou, S. Sun, S. Lei, and M.T. Sun, ―Motion Information and Coding Mode Reuse for MPEG-2 to H.264 Transcoding‖, in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, May 2005, pp. 1230 - 1233. [9] X. Lu, A. Tourapis, P. Yin and J. Boyce, ―Fast Mode Decision and Motion Estimation for H.264 with a Focus on MPEG-2/H.264 Transcoding,‖ in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, May 2005, pp. 1246-1249. [10] G Fernández, H Kalva, P Cuenca, and LO Barbosa, ―Speeding-up the Macroblock Partition Mode Decision in MPEG-2/H.264 Transcoding,‖ in Proc. IEEE Int. Conf. Image Process., Oct. 2006, pp.869-872. 83 [11] G. Chen, Y. Zhang, S. Lin, and F. Dai, ―Efficient Block Size Selection for MPEG-2 to H.264 Transcoding‖, in Proc. 12th ACM Int. Conf. Multimedia, 2004, pp. 300303. [12] H. Kato, A. Yoneyama, Y. Takishima, and Y. Kaji, ―Coding Mode Decision for High Quality MPEG-2 to H.264 Transcoding,‖ in Proc. IEEE Int. Conf. Image Process., Oct. 2007, pp.IV77-IV80. [13] J. Xin, J. Li, A. Vetro, H. Sun, and S. Sekiguchi, ―Motion Mapping for MPEG-2 to H.264/AVC Transcoding, ‖ in Proc. IEEE Int. Symp. Circuits and Systems, May 2007, pp. 1991-1994. [14] H. Mansour, P. Nasiopoulos, and V. Krishnamurphy, ―Real-Time Joint Rate and Protection Allocation for Multi-User Scalable Video Streaming,‖ in Proc. IEEE Int. Symp. Personal, Indoor, and Mobile Radio Communications (PIMRC), Sep. 2008, pp.1-5. [15] Y.N. Liu, C.S. Tang, and S.Y. Chien, ―Coding Mode Analysis of MPEG-2 to H.264/AVC Transcoding for Digital TV Applications,‖ in Proc. IEEE Int. Symp. Circuits and Systems, May 2007, pp. 1995-1998. [16] Joint Video Team, H.264/AVC Reference Software Codec (JM), ver. 14.2, [online], Available: http://iphome.hhi.de/suehring/tml/download/. [17] TK Tan, G. Sullivan, and T. Weidi, ―Recommended Simulation Common Conditions for Coding Efficiency Experiments Revision 1,‖ ITU-T SC16/Q6, Doc. VCEG-AE010, Jan. 2007. [18] Gisle Bjontegaard, ―Calculation of Average PSNR Differences between RD curves‖, ITU-T SC16/Q6, Doc. VCEG-M33, Apr. 2001. [19] Average PSNR Calculation Tool (AVSNR), [online], Available: http://ftp3.itu.int/av-arch/video-site/H26L/avsnr4.zip. [20] Q. Tang, P. Nasiopoulos, and R. Ward, "Fast Block Size Prediction for MPEG-2 To H.264/AVC Transcoding," in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Apr. 2008, pp.1029-1032. 84 [21] T. Qian, J. Sun, D. Li, X. Yang, and J. Wang, ―Transform Domain Transcoding From MPEG-2 to H.264 With Interpolation Drift-Error Compensation,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 4, pp. 523-534, Apr., 2006. [22] Q. Tang, P. Nasiopoulos, and R. Ward, ―Compensation of Re-quantization and Interpolation Errors in MPEG-2 to H.264 Transcoding,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 3, pp.314-325, Mar. 2008. [23] A. Vetro, C. Christopulos, and H. Sun, ―Video transcoding architectures and techniques: An overview,‖ IEEE Signal Process. Mag., vol. 20, no. 2, pp. 18–29, Mar. 2003. [24] MPEG Software Simulation Group, MPEG-2 reference software – Test Model 5 (TM5), [online], Available: http://www.mpeg.org/MPEG/MSSG/. [25] Z. Chen, P. Zhou, Y. He, and Y. Chen, ―Fast integer pel and fractional pel motion estimation,‖ ISO/IEC JTC1/SC29/WG11 and ITU-T SG16/Q6, Doc. JVT-F017, Dec. 2002. 85 Chapter 4: AN EFFICIENT MOTION RE-ESTIMATION SCHEME FOR H.264/AVC VIDEO TRANSCODING WITH ARBITRARY DOWNSCALING RATIOS5 4.1 Introduction Broadcasting videos for a variety of different terminal devices is a challenging matter because of the diverse types of existing devices and network limitations. The resolutions of existing terminal devices vary widely, ranging from 1920x1080 (i.e., HDTV resolution) to much smaller resolutions for smart phones, such as 320x240. Existing networks also vary in terms of rate capacity and channel conditions. For instance, a 3G wireless network (384K~2M bps) [1] offers a much lower bit-rate capacity compared to the traditional cable-based networks (100~1000M bps). Thus, to provide universal multimedia access (UMA) for mobile applications, while starting with a broadcast HD video stream, is challenging and is usually addressed by downscaling the image size of the encoded video. Image downscaling should adapt to the device resolution and the bit-rates of different device/network requirements. Video transcoding can offer efficient solutions for this type of application [2]-[3]. Transcoding the coded video into a downscaled version involves motion estimation of the lower resolution video stream so as to achieve good compression performance. Motion estimation is known to be the most complex and time consuming encoding process. In order to accelerate the motion estimation process when encoding the downscaled video, many algorithms have been proposed. These algorithms derive accurate motion vectors for the downscaled video, using the motion vectors of the original video [4]-[9]. Some methods have been designed to work for H.263, MPEG-1, 5 A revised version of this work has been submitted for publication. Q. Tang, P. Nasiopoulos, and R. Ward, ―An Efficient Motion Re-Estimation Scheme for H.264/AVC Video Transcoding with Arbitrary Downscaling Ratios,‖ February 2010. 86 MPEG-2 or MPEG-4 compressed streams [4]-[7]. These methods cannot be applied directly to H.264/AVC because of its more complex motion estimation process, which supports variable block-sizes [10]. Tan et al proposed the first approach designed for downscaling coded H.264/AVC video streams using an area-weight vector median filter (AWVMF) [8]. Another method uses multiple linear-regression-models to estimate the motion vector of a downscaled H.264 video, reducing the complexity introduced by the median filter [9]. Although the computational complexity is reduced compared to AWVMF, an extensive off-line training process is needed to find the model parameters. Moreover, for different downscaling ratios, the multiple linear-regression-models inherently have different numbers and values of model parameters, and the training process has to be implemented for all possible ratios and new model parameters. This limits the practicality of this approach for supporting arbitrary downscaling ratios. In this study, we propose a computationally efficient downscaling approach for H.264/AVC compressed video. Our method aims at reducing the computational complexity of the motion vector estimation of the downscaled video without affecting the overall picture quality. Moreover, this method is designed to work for any arbitrary downscaling ratio, without the need for prior training. Our method calculates the motion vector of each block in the downscaled video from the motion vectors of corresponding blocks in the original video. The motion vectors of the original blocks corresponding to one downscaled block are weighted using the area of the blocks they belong to as well as the number of non-zero AC coefficients in the original blocks. The motion vector of the original block that has the highest weight is used to calculate the motion vector of the corresponding downscaled block. Performance evaluation shows that the proposed algorithm offers less computational complexity and achieves better compression picture quality than other downscaling state-of-the-art methods. Compared to the area-weighted vector median filter approach (AWVMF), the proposed algorithm greatly reduces the computational complexity of estimating the downscaled motion vectors as well as improves the overall compression picture quality. Compared to the multiple linear-regression-models based scheme, our algorithm does not need an offline training process and can easily support 87 any arbitrary downscaling ratio. Furthermore, it offers better compression performance. By using the motion vectors derived by our algorithms as the initial vectors, a range of ±0.75 pixel search results in the same compression performance as that of the cascaded recoding structure with full scale search over 32x32 pixels (FullSearch) for downscaling CIF videos. Using ranges larger than ±0.75 pixel yields even slightly better performance than the FullSearch method does. In the case of downscaling larger resolution video streams, such as 4CIF, a larger range of motion vector refinement (e.g., ±1.75 pixels) is recommended. The rest of this study is organized as follow. Section 4.2 gives the overview about the existing methods. The proposed algorithm is explained in Section 4.3 and the computational complexity analysis is shown in Section 4.4. Section 4.5 presents the experimental results and discussions. Finally, the conclusions are drawn in Section 4.6. 4.2 Overview of Motion Vector Estimation during Downscaling Transcoding During the downscaling transcoding process, a full-resolution video is first decoded back into the pixel domain, where it is downscaled according to a lower resolution (ratio). After that, the downscaled video is encoded again. During this reencoding process, the motion re-estimation has the highest computational complexity, especially for H.264/AVC video coding. To reduce this complexity, many proposed methods compose accurate initial motion vectors using the motion vectors of the fullresolution video. Using these initial motion vectors, a small range of search is usually performed to further improve the accuracy of the motion vectors of the downscaled videos. One well-known downscaling method designed for MPEG-1/2, H.261/3 video streams is the adaptive motion vector resampling (AMVR) approach [4]. AMVR is designed to work only for 16x16 size blocks (MBs) and 2x2:1x1 downscaling ratio. In this case, each 16x16 MB of the downscaled video is associated with four MBs in the full-resolution version of the video. The most straight-forward strategy for composing the motion vector of the downscaled MB is to average the four motion vectors of the four 88 associated MBs in the original video. This strategy is referred to as align-to-average weighting (AAW) strategy. In addition to this strategy, two other strategies are also discussed in [4]: align-to-best weighting (ABW) and align-to-worst weighting (AWW). Each of these two strategies composes the MV of the downscaled MB by calculating the weighted average of the MVs of the four associated MBs in the original video. The alignto-best, or align-to-worst weighting strategy assigns the smallest or largest weight, respectively, to the original MB that has the largest motion compensation prediction error. The study in [4] proves that the AWW strategy yields the best performance. The adaptive motion vector resampling (AMVR) approach using the AWW strategy is as follows: 4 mvi Ai 1 i 1 mv' 2 4 Ai (4-1) i 1 where mvi denotes the motion vector of MB i in the original video and Ai represents the motion prediction error of MB i and is measured as the number of nonzero AC coefficients of MB i. Since this approach was proposed before H.264/AVC fully evolved, the H.264/AVC variable size of the block considered in motion estimation was not accounted for. In order to address the block-size variation of downscaling H.264 compressed videos, Tan et al proposed a new method, as mentioned above, which is based on an areaweighted vector median filter [8]. This filter weighs the motion vectors of the original video by the portion of the area (in terms of number of pixels) involved in forming the new blocks of the downscaled video. The use of area weighted motion vectors allows the support of variable block sizes as well as arbitrary size downscaling. The area-weighted vector median filter (AWVMF) approach is expressed as follows: 89 K mv AWVM arg min mv j mv i mv V j (4-2) i 1 mv ' S mv AWVM where mv' is the resultant motion vector of the downscaled video, and mvAWVM is the area-weighted vector median of the original video. K is the number of available motion vectors in the original videos. V= {mv1, mv2, ……, mvk} is the set of available motion vectors in the original videos. The Euclidean norm (γ = 2) is used to measure the distance between different motion vectors. S is the scaling matrix representing the downscaling ratios (in the x and y direction). The median vector is defined as the vector with the shortest aggregate distance to all other vectors. When the number of available motion vectors (K) is large, the computational complexity of the median filter is high. Another state-of-the-art downscaling method for H.264/AVC is proposed by Wang et al [9]. This method uses multiple linear-regression-models to compose the new motion vector of the downscaled video during transcoding. The linear-regression-models use the motion vectors of the full-resolution video as inputs. They assume the motion vectors in the downscaled videos have a linear relationship with those in the fullresolution video. Since H.264/AVC uses 7 block sizes, 7 linear models are defined as follows: y α0 Mi αm xm, m1 (4-3) i 1,,7. where i corresponds to the index of the block-size set, i.e., 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4, and Mi is the number of available motion vectors in the full-resolution videos. The model parameters, α0, α1, ……, αM are obtained via an off-line training process. refers to the element-wise multiplication between two vectors. xm is the motion vector in the full-resolution videos. Note that the motion vector is a 2-dimensional vector, i.e., it has elements in both the x and y directions. The y and αm are also 2dimensinal vectors. 90 In this approach, the number of the model parameters changes with the downscaling ratios. Thus, for every downscaling ratio, a new training process is needed for finding the new model parameters. Therefore, it is not practical for this approach to support any arbitrary downscaling ratio. Furthermore, due to the vast diversity of video content, the train-and-use approach inherently lacks the flexibility and accuracy of offering the best solution for different video contents. In summary, the complexity of the existing downscaling methods is relatively high, leaving space for improvement. In the following section we describe our approach that addresses this issue. 4.3 Proposed H.264/AVC Motion Vector Re-estimation during Downscaling Transcoding In what follows, we propose a low computational complexity method that composes accurate motion vectors for downscaled H.264/AVC videos. Inspired by the AMVR scheme [4], we use the information of the transform coefficients of the fullresolution videos. Our proposed method, however, is designed to work for H.264/AVC variable block-size motion estimation and to support any arbitrary downscaling ratio. For ease of explanation, Figure 4.1 shows one example of downscaling 9 MBs (on the left hand side) to one MB (on the right hand side). Some of the original MBs are partitioned into smaller blocks by H.264/AVC variable block-size motion estimation process. The blocks inside the downscaled MB correspond to blocks in the original video. However, the H.264/AVC encoding of this MB may result into a different partitioning from that shown in Figure 4.1. The resulting partitioning will of course result into blocks from the 7 block sizes, i.e., 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4. Notice that such a resulting block may correspond to more than one block in the original video. Our goal is to find the relationship between the motion vectors of these blocks in the downscaled video and the motion vectors of the corresponding blocks in the original videos. Based on that relationship, we can then estimate the motion vectors of the downscaled video without performing an exhaustive full search, which is computationally very expensive. 91 Figure 4.1 An example showing 9 MBs (in the original video) being downscaled into one MB in the down-scaled video 4.3.1 Calculation of the area-weighted align-to-worst motion vector As discussed in Section 4.2, there are three strategies to calculate the motion vectors of the downscaled videos from the motion vectors of the full-resolution videos, i.e., align-to-best weighting, align-to-average weighting and align-to-worst weighting. The align-to-worst weighting strategy has been proven to be the best strategy for MPEG1/2 and H.261/3 [4] because the block-matching motion estimation process assumes all motion is of the translational type. Since H.264/AVC also assumes that the motion is of the translation type, one may assume that the align-to-worst weighting strategy can also be used for the H264/AVC transcoding that involves the downscaling operation. However, the fact that H.264/AVC uses different block sizes for motion estimation, makes the align-to-worst strategy very inefficient. When an original block has a small size (e.g., 4x4), giving the largest weight to the MV of this block might not be a good strategy even if this block has the largest motion-compensation prediction error. Without considering the differences in the area size amongst different original blocks, the derived 92 MVs of the downscaled blocks are not accurate enough. In order to extend the align-toworst strategy to support the variable block-size motion estimation of H.264/AVC, we add an area-weighting factor to the weights. Assume the downscaled MB is partitioned into 2 (8x16) blocks as shown in Figure 4.1. To find the MV of the left 8x16 block, we consider the 14 corresponding blocks in the original video associated with it. These blocks are of different sizes in terms of pixel area. For example, the area weighting factor ωi of the shaded block i in the original video in Figure 4.1 is calculated using the following function: i Wi H i 16 16 (4-4) where both Wi and Hi are measured in the original video, for example, for block i in Figure 4.1, Wi is equal to 8 and Hi is also equal to 8. The modified align-to-worst weighting strategy could thus be as follows: for each motion vector of the original block, two weights are associated with its MV. One weight is related to the area of the original block and the other weight is related to its prediction error, i.e., two types of weights are multiplied to form one weight and the weighted average of all the MVs of the associated original blocks is found. However, our experiments showed that using the double-weighted average of the MVs of the original blocks fails to yield good accuracy for estimating the MV of the corresponding downscaled block. For this reason, in our algorithm, we estimate the MV of the downscaled block by choosing the MV of the original block that has the largest areaweighted motion-compensation prediction error, instead of calculating the doubleweighted average. We refer to our proposed algorithm as the area-weighted align-toworst (AWAW) method. The area-weighted align-to-worst motion vector, MVAWAW, is calculated by the following equation: MV AWAW MV i* , where i* arg max (i Ai ) iN (4-5) where MVi* is the motion vector whose weighted motion compensation error is the largest and the weight ωi is as in equation (4). N is the number of the blocks in the 93 original video that corresponds to the considered downscaled block. i is the index of these blocks. Ai represents the motion-compensation prediction error for block i. The value of Ai should be chosen so that a smaller Ai would imply a more perfect prediction. In the AMVR method for downscaling MPEG-1/2 and H.261/3 [4], Ai is represented by the number of non-zero AC coefficients, i.e., the zero norm of the AC coefficients. This is a simple yet effective measure of the motion-compensation prediction error. Since H.264/AVC uses a DCT-like integer transform, our proposed algorithm also uses this measure to represent Ai. In summary, our proposed AWAW method chooses the motion vector of the original block that has the largest (area-weight × number of non-zero AC integer-transform coefficients). Next, we explain in detail how Ai is calculated. Ai is calculated for each original block that has an available motion vector. Since AMVR was designed for MPEG-1/2 and H.261/3, DCT coefficients of 8x8 blocks are used and each block has the same number of DCT coefficients [4]. However, H.264/AVC uses 4x4 as the basic size of a DCT-like integer transform, and the original blocks have different block sizes, i.e., 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 or 4x4. This means each original block has at least one set of 4x4 integer-transform coefficients (e.g., 8x4 blocks have 2 sets). When evaluating the number of nonzero AC integer-transform coefficients, this causes a problem: the larger block is the more likely it is to have a larger number of nonzero AC coefficients because it has more number of coefficients in the first place. In order to fairly compare the number of the nonzero AC coefficients for the associated original blocks with different sizes, we propose to calculate the average (over the number of 4x4 blocks in the original block) of the number of nonzero AC coefficients for each associated original block. We give an illustration below. If the size of the original block is 16x16, it will have sixteen 4x4 blocks. Since the integer transform is applied on each 4x4 block, this 16x16 block will have 16 sets of 4x4 integer-transform coefficients, and each set has its own number of non-zero AC coefficients. On the other hand, these sixteen 4x4 blocks use the same motion vector since the motion-compensation partitioning size of this MB is 16x16. When evaluating the prediction error for this block, all the 16 sets of non-zero AC coefficients should be considered. We thus use the average of the numbers of non-zero AC coefficients from all 94 4x4 blocks as the final number of non-zero AC coefficient for this 16x16 block. The same strategy is applied to other block sizes as well. Notice that if the original MB is intra-coded, there is no motion vector associate with it. In our algorithm, we set the block size of an intra-coded MB to 16x16. At the same time, we set its corresponding motion vector to be zero-valued. 4.3.2 Finding related integer transform coefficients of parts of the original blocks associated with the downscaled block Assume in Figure 4.1, we wish to find the weights of MVs of the original blocks associated with the left 8x16 block that forms the left half of the downscaled MB on the right hand side of the figure. As seen from Figure 4.1, only the left half of block j in the original video is associated with the considered block in the downscaled image. This causes a problem since we do not have the motion vector of the left part of block j that is related to the left 8x16 downscaled block. Therefore, we have to use the motion vector of the whole block j as the MV of the left part of this block. Also, when calculating the area weight and the number of nonzero AC coefficients, we should consider only the area of the left part of block j, i.e., only the part related to the considered downscaled block. In this case, the calculation of the area weight is straightforward. In what follows, we explain how to calculate the number of nonzero AC coefficients in this scenario. Thanks to the small block size (i.e., 4x4) that H.264/AVC uses as the basic size for the integer transform, we are able to easily separate all 4x4 integer transform coefficients inside one block into two categories: one category that includes all the 4x4 integer transform coefficients in the region that is associated with the downscaled block, and a second category that includes all other non-associated 4x4 integer transform coefficients. To understand this categorization, take block j in Figure 4.1 as an example. Only the left part of block j is related to the left 8x16 downscaled block. The nonassociated 4x4 coefficients in the right part of block j should be discarded when calculating the average number of the non-zero AC coefficients of block j that is associated with the left 8x16 downscaled block. Therefore, the average number of nonzero AC coefficients is calculated as follows: 95 M ac m 0 avg ac 0 m 1 M (4-6) , where m represents the index of B related where || ||0 stands for the zero norm of the corresponding 4x4 AC coefficients, and Brelated is the set of 4x4 blocks in the original video which are associated with the down-scaled block. M is the number of all related 4x4 blocks. This consideration actually dramatically improves the accuracy of the calculated motion vector for the downscaled block, especially when the value of the downscaling ratio is not an even number and not an integer. Performance evaluation shows that if the size of the corresponding block in the original video is equal to or smaller than 8x8, then all its coefficients may be used for evaluating the block’s contribution to the downscaled MV. We use this observation to reduce the complexity of our algorithm. 4.3.3 Scaling down the area-weight align-to-worst MV by the downscaling ratio Through equation (4-6), we obtain MVAWAW, which is the motion vector of the original block having the largest area-weighted number of nonzero AC coefficients amongst all associated original blocks. The MVAWAW however cannot directly be used as the motion vector of the block in the downscaled video. A common practice is to scale down the original motion vector by the downscaling ratio, i.e.: MV ' S MV AWAW s x S 0 0 s y (4-7) where MV' is the scaled motion vector of the block in the down-scaled video. S is the downscaling matrix and sx, sy represent the downscaling ratio in the x and y direction, respectively. Using the calculated MV' as the initial motion vector, a small range of search (referred to as motion vector refinement in the literature) is usually performed to reach better compression performance. Nevertheless, since H.264/AVC also derives a 96 predicted MV from the encoded blocks, a common practice is to compare this MV with the MV' obtained by Eq. (6). The one yielding smaller prediction errors is used as the final initial motion vectors for the motion vector refinement. This size of the MV refinement range could be as small as quarter-pixel accuracy, i.e., ±0.25 pixel. To achieve better compression performance, a larger range could be used, such as a half-pixel motion vector refinement followed by a quarter-pixel motion vector refinement, i.e., ±0.75 pixel. We show the impact of the MV refinement range on the compression performance in our experimental results discussed in Section 4.5. In summary, our AWAW approach uses the motion vector of the original block that has the largest area-weighted number of non-zero AC coefficients as the motion vector of the block in the downscaled video. For blocks 8x16, 16x8 and 16x16, only the AC coefficients of the areas (in one original block) associated with the downscaled block are considered. Since finding the largest number of non-zero AC coefficients is very simple to implement, the computational complexity of our proposed algorithm should be very low. Next, we analyze the computational complexity of our proposed algorithm. 4.4 Computational Complexity Analysis We measure the computational complexity by calculating the basic operations needed for our method and another two state-of-the-art methods: the Area-weighted Median Filter and the Multiple Linear-Regression-Models methods. The pseudo codes of the basic operations for each approach are listed in the Appendix A at the end of the thesis. The number of operations needed for each approach is listed in Table 4.1. Note that these basic operations include summation, subtraction, multiplication, division, Table 4.1 The number of operations needed for each method Area-weighted Median Filter No. of basic operations (n is the number of original blocks corresponding to a downscaled block) T(n) = 12n2 + 3n Multiple linear-regression-models T(n) = 4n + 1 (+ off-line training) Our proposed AWAW T(n) = 4n 97 comparison, and the calculation of the square and the square-root of a number. We observe that our proposed algorithm and the multiple linear-regressionmodels based method have the least computational complexity (O(n)). The multiple linear-regression-methods method though requires a significant amount of off-line training to find the model parameters for each block size. To obtain good picture quality results, this method needs extensive training on a very large variety of video data sets. Therefore, it is not easy for this method to easily support arbitrary downscaling ratios. Among all the methods, the area-weighted median filter needs the most computations (O(n2)) since the median filter has a relatively high computational complexity. This is specially the case as the number of related blocks is usually large in the full-resolution video (as original H.264/AVC coded videos usually have many blocks with small block sizes). In our pseudo codes in the Appendix, the operation of counting the non-zero AC coefficients is omitted for ease of explanation. Note that the number of non-zero AC coefficients can be easily obtained when decoding the full-resolution videos during downscaling transcoding. Next, we present our experimental results and discuss our findings. 4.5 Experimental Results and Discussions JM14.2 (H.264/AVC reference software) is used in our implementation [11]. Ten representative CIF (352x288) sequences (i.e., Akiyo, Bus, Bridge, Crew, Coastguard, Football, Foreman, M&D: Mother Daughter, and Paris) and four 4CIF (704x576) sequences (i.e., Crew, Harbour, Ice, and Soccer) are used in our experiments. Almost all the state-of-the-art methods test their algorithms using only CIF sequences. However, testing video sequences with larger resolution such as 4CIF is necessary in order to evaluate the performance of the proposed algorithm for real-life applications. The encoding parameters of the original H.264/VC videos are listed in Table 4.2. The original H.264/AVC video streams are decoded back into the pixel domain and downscaled according to the required downscaling ratio. Down-sampling the original decoded video 98 Table 4.2 The encoding parameters of the original H.264/AVC videos which are the inputs to the transcoder Encoding parameters Settings GOP structure/length IPPPP / 30 Variable blocks 16x16, 16x8, 8x16, 8x8, 8x4, 4x8, and 4x4 Search range ±64 (4CIF), ±32 (CIF) Search mode Fast full search Rate-distortion optimization High complexity Quantization parameter 22 (I frames), 23 (P frames) No. of frames 300 Deblocking filter on Adaptive rounding Yes Entropy coding CABAC is obtained using the tool available in the official scalable video coding reference software (JSVM) [12]. The downscaled video is then re-encoded as an H.264/AVC video stream using our method, the multiple linear-regression-models method, the areaweighted vector median filter method, and the full-scale motion search (FullSearch) approach. The search range of FullSearch method is ±32 pixels. 4.5.1 Comparison between the proposed algorithm and the multipleregression-models method We first compare the proposed algorithm to the multiple linear-regression-models method [9]. Since the model parameters and the detailed experimental settings are not given in [9], we are not able to compare our method directly with that method. For this reason, we use FullSearch as a reference point for comparing the two methods. Table 4.3 shows the PSNR values and the bit-rates of the downscaled videos resulting from our approach with/without MV refinement and the FullSearch method that uses the range ±32 pixels. The comparison between the multiple linear-regression-models approach and the FullSearch method is found in [9]. Table 4.4 compares the average bitrate increase and the PSNR decrease in dB between the FullSearch and each of the multiple linear-regression-models method and our proposed method. The results show that our proposed algorithm achieves better compression performance than the multiple 99 Table 4.3 Transcoding performance of the proposed method and FullSearch (with rate distortion optimization) Seq. Akiyo FullSearch Bridge Coastguard M&D Paris PSNR (dB) 38.74 40.61 33.64 37.49 34.85 Bit-rates (kbits/sec) 34.07 130.94 140.94 43.68 119.85 Proposed without refinement PSNR (dB) 38.67 40.56 33.56 37.35 34.77 Bit-rates (kbits/sec) 35.62 132.88 156.64 47.38 132.45 Proposed quarterpixel refinement PSNR (dB) 38.79 40.60 33.63 37.50 34.84 Bit-rates (kbits/sec) 34.16 131.37 146.31 44.53 123.96 Quantization Parameters I:28, P:29 I:22, P:23 I:29, P:30 I:29, P:29 I:29, P:30 Table 4.4 Comparative results with the FullSearch method Without MV refinement With quarter-pixel refinement Proposed Multiple LinearProposed Regression-Models* Multiple LinearRegression-Models* Average bit-rate increase 7% 26% 2% 11% Average PSNR decrease (dB) 0.08 0.27 0 0.13 * Notice that it is not easy for this method to support any arbitrary downscaling ratio. linear-regression-models approach, both with and without MV refinement. Without MV refinement, our proposed algorithm reduces the bit-rate by 19% (26% – 7%) and increases the video quality by 0.19 (0.27 –0.08) dB, on average. With quarter-pixel MV refinement, our proposed algorithm reduces the bit-rate by 9% and improves the video quality by 0.13 dB, on average. In addition, our proposed algorithm does not need the extensive offline training process and can easily support any arbitrary downscaling ratio. 4.5.2 Comparison between the proposed algorithm and the area-weighted vector median filter approach AWVMF is presently the state-of-the-art H.264/AVC transcoding method for video downscaling applications. It supports arbitrary downscaling ratios. For each CIF or 4CIF sequence, three downscaled spatial resolutions are tested. For the 4CIF sequences, the tested downscaled spatial resolutions are 640x480, 352x288, and 320x240. For the CIF sequences, the tested downscaled spatial resolutions are 224x176, 176x144, and 160x120. When encoding the downscaled videos using H.264/AVC, the rate-distortion 100 optimization (RDO) and the deblocking filter are enabled. All 7 block sizes (i.e., 16x16, 16x8, 8x16, 8x8, 8x4, 4x8 and 4x4) are used in our experiments. The set of quantization parameters is {22, 27, 32, 37} for the I frames and {23, 28, 33, 38} for the P frames. The context-adaptive binary arithmetic coding (CABAC) is used for entropy coding. Adaptive rounding is enabled. The picture quality of the transcoded videos obtained by our proposed method and that of the AWVMF are compared using the Bjontegaard delta (BD) PSNR differences [13]. BD PSNR can measure the average video-quality differences of the encoded videos over a wide range of bit-rates. It offers a more compact and in some sense a more accurate way to evaluate the compression performance of video encoding, compared to measuring the data resulting from using a single quantization parameter. Only the experimental results of downscaling three CIF and three 4CIF sequences are shown in Table 4.5, since the rest of the sequences yield very similar results. Both the results with and without quarter-pixel MV refinement are given. The quarter-pixel range is used in the experiments of evaluating AWVMF [8]. Compared to AWVMF, our proposed algorithm Table 4.5 BD PSNR difference between our proposed algorithm and AWVMF Reduced frame size Proposed minus 224x176 AWVMF without 176x144 refinement 160x120 Proposed minus 640x480 AWVMF without 352x288 refinement 320x240 Proposed minus 224x176 AWVMF with ±0.25 176x144 pixel refinement 160x120 Proposed minus 640x480 AWVMF with ±0.25 352x288 pixel refinement 320x240 Original CIF (352x288) seq. Bus Crew Foreman 0.17 dB 0.08 dB 0.12 dB 0.16 dB 0 dB 0.04 dB 0.17 dB 0.07 dB 0.10 dB Original 4CIF (704x576) seq. Crew Ice Soccer 0.11 dB 0.15 dB 0.08 dB 0 dB 0 dB 0 dB 0.13 dB 0.06 dB 0.06 dB Original CIF (352x288) seq. Bus Crew Foreman 0.14 dB 0.06 dB 0.07 dB 0.16 dB 0.03 dB 0.04 dB 0.15 dB 0.07 dB 0.05 dB Original 4CIF (704x576) seq. Crew Ice Soccer 0.09 dB 0.09 dB 0.07 dB 0.02 dB 0 dB 0 dB 0.11 dB 0.03 dB 0.05 dB 101 CIF avg. 0.1 dB 4CIF avg. 0.07 dB CIF avg. 0.09 dB 4CIF avg. 0.05 dB improves the picture quality of downscaled videos by an average 0.1 dB (maximum 0.17 dB) for original CIF videos. The improvement is 0.07 dB on average for original 4CIF videos (maximum 0.14 dB). These comparative results reflect the compression performance of the transcoded video using our proposed algorithm and AWVMF without performing the MV refinement. When the quarter-pixel MV refinement is employed, our algorithm shows similar results in terms of the improvement on the compression performance. More importantly, our proposed algorithm significantly reduces the computational complexity of finding the motion vectors for the downscaled video as shown in Table 4.6. An Intel Pentium 3.2GHz CPU with 2GB of RAM was used in these experiments. Although the time recorded in an operating system like Windows XP is not accurate enough, the results conform to our computational complexity analysis in Section 4.4. Table 4.6 Comparison of time needed to estimate the initial motion vectors for the downscaled videos (seconds) Original CIF (352x288) seq. 224x176 176x144 Bus Crew Foreman Bus Crew Foreman Bus AWVMF 16.4 26.9 22.8 6.3 10.1 8.8 17.3 Proposed 3.5 7.2 7.4 2.4 3.2 4.3 2.5 Time reduction 79% 73% 68% 61% 68% 51% 85% Avg. of reduction 73% Original 4CIF (704x576) seq. 640x480 352x288 Crew Ice Soccer Crew Ice Soccer Crew AWVMF 73.1 63.9 84.7 28.6 21.6 25.3 68.3 Proposed 39.5 26.9 42.8 12.9 10.2 12.9 11 Time reduction 46% 58% 49% 55% 53% 49% 83% Avg. of reduction 61% 160x120 Crew Foreman 26.1 21.5 3.5 3.9 87% 82% 320x240 Ice Soccer 37.1 54.8 9.8 11.4 74% 79% 4.5.3 Investigating of the impact of MV refinement range on compression performance In the following experiments, we investigate the impact that the MV refinement range has on the compression performance when the proposed algorithm is used. In this set of experiments, the proposed algorithm is employed to derive the motion vectors of the downscaled videos. Since H.264/AVC also derives a predicted MV from encoded 102 blocks, our derived motion vector is compared with the H.264/AVC predicted motion vector and the one yielding smaller prediction errors is chosen as the final initial motion vector. Using this initial motion vector, a small range of motion vector refinement is performed to improve the accuracy of the initial motion vectors. The final transcoded videos are then compared to the videos resulting from the cascaded full-scale motion search structure (FullSearch). The search range of the FullSearch method is ±32 pixels. In this set of experiments, four versions of our scheme are compared with the cascaded full-scale motion search scheme (FullSearch). These are 1) the proposed algorithm without motion vector refinement, 2) proposed algorithm with quarter-pixel motion vector refinement, 3) proposed algorithm with half-pixel refinement followed by quarter-pixel refinement, and 4) proposed algorithm with ±1.75 pixel of motion vector refinement. Table 4.7 shows the BD PSNR difference between the FullSearch scheme and each of these 4 schemes. It shows that without any motion vector refinement, the BD PSNR difference between the proposed algorithm and FullSearch is 0.66 dB for downscaling the CIF sequences and 0.92dB for downscaling the 4CIF sequences. However, the resultant video quality from our proposed transcoding scheme is much improved by using a small-range of motion vector refinement. With the quarter-pixel accuracy (±0.25 pixel) refinement, the difference between FullSearch and our proposed algorithm is 0.12 dB for downscaling CIF sequences and 0.44dB for downscaling 4CIF sequences. With ±0.75 pixel MV refinement, the average difference becomes 0 dB for downscaling CIF sequences, which means our proposed algorithm can get the same compression performance as that of the FullSearch scheme. With a refinement range of ±1.75, our proposed algorithm achieves slightly better resultant downscaled video quality (0.06dB improvement on average) than that of the FullSearch method. As for downscaling 4CIF sequences, the difference between FullSearch and our proposed algorithm with ±0.75 pixel MV refinement is 0.27 dB on average. The difference becomes 0.17 dB when the MV refinement range is ±1.75 pixels. 103 Table 4.7 The BD-PSNR differences between FullSearch and our proposed algorithm with MV refinement Different sequences (Bus: 150 frames, Ice: 240 frames Others: 300 frames ) CIF (352x288) (704x576) 4CIF Avg. BD-PSNR of BD-PSNR of BD-PSNR of BD-PSNR of Proposed (without Proposed (with Proposed (with Proposed (with refinement) V.S ±0.25 refinement) ±0.75 refinement) ±1.75 refinement) FullSearch V.S FullSearch V.S. FullSearch V.S. FullSearch Bus (224x176) –0.53 dB –0.04 dB +0.01 dB +0.04 dB Bus (176x144) –0.70 dB –0.06 dB +0.01 dB +0.05 dB Bus (160x120) –0.73 dB –0.13 dB –0.05 dB 0 dB Foreman (224x176) –0.79 dB –0.20 dB –0.03 dB +0.03 dB Foreman (176x144) –0.88 dB –0.24 dB –0.05 dB 0.01 dB Foreman (160x120) –0.89 dB –0.19 dB –0.02 dB 0.02 dB Crew (224x176) –0.46 dB –0.10 dB +0.03 dB +0.11 dB Crew (176x144) –0.47 dB –0.08 dB +0.05 dB +0.13 dB Crew (160x120) –0.46 dB –0.05 dB +0.10 dB +0.17 dB –0.66 dB –0.12 dB 0 dB +0.06 dB Crew (640x480) –0.48 dB –0.23 dB –0.12 dB –0.05 dB Crew (352x288) –0.60 dB –0.24 dB –0.09 dB +0.01 dB Crew (320x240) –0.52 dB –0.17 dB –0.04 dB +0.05 dB Ice (640x480) –1.27 dB –0.62 dB –0.39 dB –0.29 dB Ice (352x288) –1.56 dB –0.74 dB –0.47 dB –0.33 dB Ice (320x240) –1.60 dB –0.77 dB –0.47 dB –0.33 dB Soccer (640x480) –0.58 dB –0.29 dB –0.21 dB –0.15 dB Soccer (352x288) –0.85 dB –0.45 dB –0.31 dB –0.22 dB Soccer (320x240) –0.83 dB –0.48 dB –0.32 dB –0.22 dB –0.92 dB –0.44 dB –0.27 dB -0.17 dB CIF Avg. 4CIF BD-PSNR of FullSearch minus Proposed algorithm with MV refinement In addition, Table 4.8 shows the motion estimation time of the FullSearch method and our proposed algorithm with different MV refinement size ranges. For downscaling the CIF sequences, our proposed algorithm with quarter-pixel refinement reaches an impressive time reduction of 94%. The time reduction is 90% when ±0.75 pixel MV refinement is implemented and 89% for ±1.75 pixels MV refinement. As for downscaling the 4CIF sequences, the motion estimation time reduction of our proposed algorithm is 95% when quarter-pixel motion vector refinement is implemented. When ±0.75 pixel motion vector refinement is executed, the time reduction is 91%, and with ±1.75 pixel 104 motion vector refinement, the motion estimation time reduction reaches to 89.5%. In summary, using the motion vectors derived by our proposed algorithms and a range of ±0.75 pixel motion vector refinement achieve the same picture quality as that of the FullSearch method for downscaling CIF test sequences. In the case of downscaling larger resolution video streams, such as 4CIF, a larger range of motion vector refinement (e.g., ±1.75 pixels) is recommended. Table 4.8 Comparison of motion estimation time between FullSearch and the proposed algorithm with MV refinement Different sequences (Bus: 150 frames, Ice: 240 frames Others: 300 frames ) CIF The motion estimation time of different schemes (seconds) Full- Proposed Proposed Proposed ME time reduction compared to Search ±0.25 ±0.75 ±1.75 FullSearch refinement refinement refinement Proposed Proposed Proposed ±0.25 ±0.75 ±1.75 refinement refinement refinement Bus (224x176) 147.5 8.0 12.7 14.5 94.6% 91.4% 90.1% Bus (176x144) 103.4 6.1 8.4 9.7 94.1% 91.9% 90.6% Bus (160x120) 65.9 3.4 6.3 7.3 94.8% 90.5 88.9% Foreman (224x176) 217.2 16.5 25.6 28.9 92.4% 88.2% 86.7% Foreman (176x144) 141.4 9.8 16.2 17.6 93.0% 88.6% 87.5% Foreman (160x120) 100.9 7.4 10.8 12.4 92.6% 89.3% 87.7% Crew (224x176) 297.3 17.4 28.2 31.0 94.2% 90.5% 89.6% Crew (176x144) 193.6 10.9 17.9 19.7 94.4% 90.8% 89.8% Crew (160x120) 136.1 7.5 12.4 14.4 94.5% 90.9% 89.4% 94% 90% 88.9% CIF avg. 4CIF Ice (640x480) 1804.6 142.3 223.6 264.9 92.1% 87.6% 85.3% Ice (352x288) 666.0 50.9 76.1 90.8 92.4% 88.6% 86.4% Ice (320x240) 524.5 36.2 59.9 69.0 93.1% 88.8% 86.8% Soccer (640x480) 4858.5 218.6 366.2 442.5 95.5% 92.5% 90.4% Soccer (352x288) 1693.4 72.3 120.9 145.0 95.7% 92.9% 91.3% Soccer (320x240) 1278.6 54.7 92.3 111.1 95.7% 92.8% 91.4% Crew (640x480) 4611.8 215.5 361.5 445.0 95.3% 92.1% 90.8% Crew (352x288) 1665.6 72.0 119.9 143.6 95.7% 92.8% 91.5% Crew (320x240) 1288.2 54.6 91.9 111.0 95.8% 92.9% 91.3% 95% 91% 89.5% 4CIF avg. 105 4.6 Conclusion An efficient motion re-estimation approach is proposed to accelerate the transcoding process for compressed H.264/AVC video downscaling applications. The proposed algorithm uses the area-weighted non-zero AC integer-transform coefficients to choose the motion vectors from the full-resolution videos as the initial motion vectors of the downscaled videos. The proposed algorithm is compared with two state-of-the-arts methods. Compared to the area-weighted vector median filter approach, our proposed algorithm reduces the computation of estimating the motion vector by 73% for downscaling CIF coded videos and by 61% for downscaling 4CIF coded videos. In addition, the proposed algorithm yields slightly better compression efficiency. Compared to the multiple linear-regression-models method, our proposed algorithm does not need the extensive offline training process. Moreover, it achieves better compression efficiency for the downscaled video. Finally, in order to achieve the same compression performance as that of the cascaded full-scale motion-search scheme, a small range of motion vector refinement is needed. For downscaling CIF videos, a range of ±0.75 pixel MV refinement results in the same compression performance as that of FullSearch. Using ranges larger than ±0.75 pixel yields even slightly better performance than the FullSearch method does. In the case of downscaling larger resolution video streams, such as 4CIF, a larger range of motion vector refinement (e.g., ±1.75 pixels) is recommended. 106 4.7 References [1] A. T. Connie, P. Nasiopoulos, V. C. M. Leung and Y. P. Fallah, ―Video Packetization Techniques for Enhancing H.264 Video Transmission over 3G Networks, ‖ in Proc. IEEE Consumer Communication and Networking Conf., Jan. 2008, pp. 802-804. [2] I. Ahmad, X. Wei, Y. Sun, Y-Q. Zhang, ―Video Transcoding: An Overview of Various Techniques and Research Issues,‖ IEEE Trans. on Multimedia, vol. 7, no. 5, pp. 793-804, Oct. 2005. [3] J. Xin, C. W. Lin, and M. T. Sun, ―Digital Video Transcoding,‖ Proceedings of the IEEE, vol. 93, no. 1, pp. 84-97, Jan. 2005. [4] B. Shen, I.K. Sethi and B. Vasudev, "Adaptive Motion-Vector Resampling for Compressed Video Downscaling", IEEE Trans. Circuits Syst. Video Technol., vol. 9, pp. 929-936, 1999. [5] M. Chen, M. Chu and S. Lo, "Motion vector composition algorithm for spatial scalability in compressed video", IEEE Trans. Consumer Electronics., vol. 47, pp. 319-325, 2001. [6] J. Xin, M.-T. Sun, and T.-D. Wu, ―Motion vector composition for MPEG-2 to MPEG-4 video transcoding,‖ in Proc. Workshop and Exhibition MPEG-4, 2002, pp. 9–12. [7] Y.-P. Tan, H. Sun and Y. Liang, ―On the methods and applications of arbitrary downscaling video transcoding,‖ in Proc. IEEE Int. Conf. Multimedia & Expo, Aug 2002, pp. 609-612. [8] Y.-P. Tan and H. Sun, ―Fast Motion Re-Estimation for Arbitrary Downscaling Video Transcoding using H.264/AVC Standard,‖ IEEE Trans. Consumer Electronics, vol. 50, no. 3, pp. 887-894, Aug. 2004. [9] J. Wang, E.H. Yang, X, Yu, ―An efficient motion estimation method for H.264based video transcoding with arbitrary spatial resolution conversion,‖ in Proc. IEEE Int. Conf. Multimedia & Expo, Jul. 2007, pp. 444-447. [10] M. Wien, ―Variable Block-Size Transforms for H.264/AVC,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 604-613, Jul. 2003. 107 [11] Joint Video Team, H.264/AVC Reference Software Codec (JM), ver. 14.2, [online], Available: http://iphome.hhi.de/suehring/tml/download/. [12] Joint Video Team, SVC reference software (JSVM), ver. 9.14 [online]. Available: http://wftp3.itu.int/av-arch/jvt-site/2008_07_Hannover/JVT-AB203.zip. [13] Gisle Bjontegaard, ―Calculation of Average PSNR Differences between RD curves‖, ITU-T SC16/Q6, Doc. VCEG-M33, Apr. 2001 108 Chapter 5: CONCLUSIONS AND FUTURE WORK 5.1 Significance of the Research During the last two decades, video industry has grown so fast that video applications are not limited to the traditional TV business anymore. Mobile video applications [1] (e.g, BlackBerry, iPhone/iTouch, etc.) as well as online video applications (YouTube, IPTV, etc.) are starting to have a great impact on our daily life. As sharing videos between different users and applications becomes more and more popular, the need for digital video transcoding techniques that offer cost effective and efficient universal multimedia access is increasing [2]-[3]. Among all the transcoding applications, those whose output is H.264/AVC coded video are the most popular due to the better compression performance offered by this latest video coding standard [4]. Out of all the H.264/AVC related transcoding applications, MPEG-2 to H.264/AVC transcoding stands out as the most important transcoding application. This is because while MPEG-2 is presently the dominant video coding standard in the digital TV and DVD industries [5]-[6], H.264/AVC is quickly gaining ground and the two seem to be co-existing for at least the near future. Besides MPEG-2 to H.264/AVC transcoding, downscaling the image size of an H.264/AVC coded video has also drawn a lot of attention [7]. Related video applications support a wide range of display resolutions and bit-rates. In order to provide universal multimedia access to these applications, transcoding a H.264/AVC coded video into its downscaled version is often necessary. MPEG-2 to H.264/AVC transcoding can be implemented in either the transformdomain or the pixel-domain. While the transform-domain transcoding structure offers the fastest transcoding speed, it produces transcoded videos with low picture quality. On the other hand, the pixel-domain transcoding structure produces transcoded videos with high picture quality but at the expense of higher computational complexity. 109 In chapter 2, we present a scheme that improves the resultant transcoded video quality by compensating for the distortions inherent in the transform-domain-based MPEG-2 to H.264/AVC transcoding. Our algorithms successfully offset the mismatches between the MPEG-2 and H.264/AVC motion compensation processes, without a significant increase in computational complexity. Performance evaluations have shown that using the proposed algorithms in the transform-domain transcoding we achieve 5dB quality improvement over the method that does not compensate for those distortions. In order to offer a more complete comparison, we also compare our proposed algorithms with cascaded pixel-domain structures. The results have shown that our proposed transcoding structure results in computational savings that range from 13% compared to the cascaded method that re-uses MPEG-2 MVs to 69% compared to the cascaded method that uses ±2 pixel motion vector refinement. At the same time, our proposed transform-domain structure achieves almost the same picture quality (0.3dB to 0.6dB) as that of the cascaded pixel-domain structures. In chapter 3, we present an algorithm that speeds up the motion estimation process involved in compressing the output H.264/AVC video for the pixel-domain MPEG-2 to H.264/AVC transcoding structure. This is done by predicting the block size partitioning of the output video. The algorithm uses our proposed empirical distortion and rate models to select the block-size that yields better rate-distortion performance. Experimental results show that, our transcoder yields similar rate-distortion performance as the state-of-the-art MotionMapping scheme [8], while the computational complexity is significantly reduced, requiring an average of 29% of the computations used by the MotionMapping algorithm. Compared to the full-scale motion search scheme, our proposed algorithm reduces the computational complexity by about 99.47% for SDTV sequences and 98.66% for CIF sequences. The experimental results show that our proposed algorithm offers a better trade-off between the computational complexity and picture quality. The above transcoding structures keep the input and output video at the same spatial resolution. In some applications, such as video enabled mobile devices, a downscaling operation is often needed to convert a large-resolution video into its 110 downscaled version to satisfy both the bit-rate and the device display requirements. For this type of the transcoding, our study focuses on transcoding a coded large-resolution H.264/AVC video to a smaller-resolution H.264/AVC video. Traditional downscaling transcoding techniques cannot be directly applied on H.264/AVC transcoding applications due to the many more features supported by the H.264/AVC standard. To address this issue, in chapter 4 we present a study that accelerates the transcoding process of downscaling a coded H.264/AVC video into its downscaled version using arbitrary downscaling ratios. Our method is based on an efficient motion vector composition approach that derives accurate initial motion vectors for the downscaled video. Compared to existing state-of-the-arts methods [9]-[10], our proposed scheme achieves the best accuracy with the least computational complexity. In order to achieve the same compression performance as that of the cascaded full-scale motion search scheme, a small range of motion vector refinement should be performed. When the resolution of the downscaled videos is small, the range could be as small as ±0.75 pixels. However, when the resolution of the downscaled videos is big, a larger range of refinement is suggested such as ±1.75 pixels. 5.2 Potential Applications The H.264/AVC transcoding applications include digital video streaming, video broadcasting, and video recording over a wide range of modern transmission and storage systems. The two major TV broadcasting standards, i.e., DVB [6] and ATSC [5],[11], support both MPEG-2 and H.264/AVC standards. Since MPEG-2 was introduced long before H.264, the majority of most of digital TV content has been encoded using MPEG2. As H.264/AVC is quick gaining ground in digital TV broadcasting applications with the new set-top boxes supporting the new video standard, there is an immediate need for transcoding existing MPEG-2 video into its H.264/AVC counterpart. The proposed MPEG-2 to H.264/AVC transcoding algorithms are ideal for this implementation. In addition, the proposed MPEG-2 to H.264/AVC transcoding algorithms can be used in digital video recording and Blu-ray playback systems. Recording in H.264/AVC is desirable since it results in a smaller overall compression rate for the same picture 111 quality. Since Blu-ray devices also support H.264, converting video content from MPEG2 to H.264 allows playback on these devices. With the rapid development of wireless networks (e.g., 3GPP [13], WiMAX [14], etc.), video applications for mobile/handheld devices have quickly gained importance in that market. As a result, the mobile TV application is becoming a very popular application for mobile phone users and both ETSI and ATSC are developing standards such as DVB-H (handheld) [15] and ATSC-M/H (mobile/handheld) [16], respectively. In both cases, H.264/AVC has been adopted as the standard for video coding. Since the mobile/handheld devices usually have smaller resolutions than those of TV sets, transcoding a coded high-resolution video to a downscaled version is often necessary in order to support the two applications. The proposed downscaling transcoding algorithm is ideal for this type of application. 5.3 Contributions As this thesis tackles the H.264/AVC transcoding problems in three aspects, the corresponding contributions are listed below. First, for the transform-domain MPEG-2 to H.264/AVC transcoding: We present a theoretical analysis about the inherent distortions caused by reemploying the MPEG-2 DCT coefficients in the transform-domain MPEG-2 to H.264/AVC transcoding. This analysis offers a useful reference to understand the distortions in any transform-domain based transcoding that outputs H.264/AVC coded videos. We develop an efficient scheme that compensates for the distortions caused by requantization and interpolation errors inherent in the transform-domain based MPEG-2 to H.264/AVC transcoding. The traditional re-quantization error compensation algorithm for DCT coefficients is updated so that it can be applied to the H.264 integer transform coefficients. Equations that compensate for the luminance half-pixel and chrominance quarter/three-quarter pixel interpolation errors are derived. Our proposed algorithm greatly improves the resultant video 112 quality compared to the open-loop structure. At the same time, it requires less computational complexity compared to the cascaded pixel-domain structure. Second, for the pixel-domain MPEG-2 to H.264/AVC transcoding: We demonstrate the importance of enabling the rate-distortion optimization (RDO) during the MPEG-2 to H.264/AVC transcoding. Most of the existing approaches choose not to use the RDO during the transcoding in order to reduce the computational complexity. However, ignoring the RDO severely degrades the resultant video quality of the transcoded H.264/AVC videos. Our thorough experiments show that RDO should always be considered during the MPEG-2 to H.264/AVC transcoding. We propose empirical rate and distortion models that efficiently predict the blocksize partitioning of the MBs when encoding the output H.264/AVC video during transcoding. Our algorithm uses the predicted initial motion vectors and ratedistortion optimization techniques to estimate the block partition of the MBs. Experimental results show that, compared to the state-of-the-art transcoding scheme, our transcoder yields similar rate-distortion performance on the resultant output H.264/AVC video, while significantly reducing the computational complexity. Finally, as for the H.264/AVC video transcoding that involves the resolutiondownscaling operation: We create an area-weighted align-to-worst strategy to calculate the motion vectors of the downscaled videos when downscaling the compressed H.264/AVC videos. Compare to the other state-of-the-art methods, our proposed approach requires the least computational complexity for finding the motion vectors of the downscaled video during transcoding At the same time, our approach yields better or at least the same compression performance on the output downscaled H.264/AVC video. We investigate the impact of the range of the motion vector refinement on the compression performance of the downscaled videos. We conclude that a range of ±0.75 pixel motion vector refinement allows us to achieve the same compression performance as that of the cascaded recoding approach (FullSearch) method for 113 downscaling CIF test sequences. In order to achieve the same performance as that of the cascaded recoding approach for downscaling large-resolution test sequences such as 4CIF, a larger range of motion vector refinement (i.e., ±1.75 pixels) is suggested. 5.4 Suggestions for Future Research The main objective of a transcoding system is to reduce the computational complexity of the overall transcoding system, while maintaining the best possible picture quality. In most cases, transcoding techniques significantly reduce the computational complexity of converting the video into different formats, but they also cause slight quality loss compared to the computationally demanding cascaded decoder-encoder structure. Evaluating the video quality and the quality of experience offered to the enduser is still an open question [17]. Since all existing studies evaluate the quality of the transcoded video in terms of PSNR, exploring other quality metrics will definitely bring a great benefit to the transcoding field. The advantage of transcoding is that we can utilize some existing information about the texture information and motion. This information can help us choose the appropriate quality metric for the corresponding transcoding applications. As the diversity of video applications keeps increasing, the requirements of transcoding applications also keep evolving. In some applications, it is desirable to obtain better video quality than that the cascaded decoder-encoder structure, even if a higher computational complexity is required. One possible solution is to control the quantization parameters in order to minimize the quantization errors resulting from encoding the output videos [18]-[19]. Existing methods consider re-quantization schemes only for JPEG images and MPEG-2 Intra-frames. Designing an efficient re-quantization scheme for both Intra-frames and Inter-frames is still a challenging problem, especially for H.264/AVC. Another approach to achieve higher video quality is to utilize the rate-distortion optimization techniques. The current rate-distortion optimization utilizes the Lagrangian 114 optimization that forms a cost function J = D + λR, with λ being the Lagrangian multiplier, D being the distortion and R being the bit-rate. By minimizing the Lagrangian cost function J, we can minimize the rate and distortion together. The physical meaning of this multiplier represents the choice of how to weight the distortion and rate in the cost function. The selection of λ plays an important role for a video encoder to achieve good rate-distortion performance [20]-[21], which can be measured either by picture quality or by compression ratio. This is an approximated solution which has left out the influence of motion and resolution of the original video. Since during transcoding the input video has been already encoded once, existing information about the motion, resolution and resulting residue may be used to achieve better rate-distortion optimization and improve the compression performance at the re-encoding stage. At last, the research presented in this thesis also needs many improvements. First, as the resolutions of video streams become larger and larger, additional tests of evaluating our proposed algorithms should be run on larger-resolution videos, such as (720p or 1080p). Second, more advanced fast motion estimation algorithms are developed recently. Comparing our proposed algorithms to recent fast motion estimation algorithms is important because one of our contributions is to reduce the computational complexity of motion estimation process. At last, for our work on H.264/AVC downscaling transcoding, the B frame scenario is currently missing and it should be considered in the future. 115 5.5 References [1] K. O’Hara, A.S. Mitchell, A. Vorbau, ―Consuming Video on Mobile Devices,‖ in Proc. SIGCHI conf. on Human factors in computing systems, Apr. 2007, pp. 857866. [2] J. Xin, C.W. Lin, and M.T. Sun, ―Digital Video Transcoding,‖ Proceedings of the IEEE, vol. 93, no. 1, pp. 84–97, Jan. 2005. [3] I. Ahmad, X. Wei, Y. Sun, and Y-Q. Zhang, ―Video Transcoding: An Overview of Various Techniques and Research Issues,‖ IEEE Trans. on Multimedia, vol. 7, no. 5, pp. 793-804, Oct. 2005. [4] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke, F. Pereira, T. Stockhammer, T. Wedi, ―Video coding with H.264/AVC: tools, performance, and complexity,‖ IEEE Circuits Syst. Mag., vol. 4, no. 1, pp. 7-28, first quarter 2004. [5] Advanced Television Systems Committee (ATSC), ―ATSC Digital Television Standard, Part 4 – MPEG-2 Video System Characteristics,‖ Doc. A/53 Part 4:2009, Aug. 2007 [6] European Telecommunications Standards Institute (ETSI), ―Digital Video Broadcasting (DVB); Specification for the use of Video and Audio Coding in Broadcasting Applications based on the MPEG-2 Transport Stream,‖ Technical Specification 101 154 V1.9.1, Sep. 2009. [7] D. Marpe, T. Wiegand, G. Sullivan, ―The H.264/MPEG4 Advanced Video Coding Standard and its Applications,‖ IEEE Communications Mag., vol. 44, no. 8, pp. 134143, Aug. 2006. [8] J. Xin, J. Li, A. Vetro, H. Sun, and S. Sekiguchi, ―Motion Mapping for MPEG-2 to H.264/AVC Transcoding, ‖ in Proc. IEEE Int. Symp. Circuits and Systems, May 2007, pp. 1991-1994. [9] Y.-P. Tan and H. Sun, ―Fast Motion Re-Estimation for Arbitrary Downscaling Video Transcoding using H.264/AVC Standard,‖ IEEE Trans. Consumer Electronics, vol. 50, no. 3, pp. 887-894, Aug. 2004. 116 [10] J. Wang, E.H. Yang, X, Yu, ―An efficient motion estimation method for H.264based video transcoding with arbitrary spatial resolution conversion,‖ in Proc. IEEE Int. Conf. Multimedia & Expo, Jul. 2007, pp. 444-447. [11] Advanced Television Systems Committee (ATSC), ―Video System Characteristics of AVC in the ATSC Digital Television System,‖ Doc. A/72 Part 1, Jul. 2008 [12] J. Xin, A. Vetro, S. Sekiguchi and K. Sugimoto, ―MPEG-2 to H.264/AVC Transcoding for Efficient Storage of Broadcast Video Bitstreams,‖ in Proc. Int. Conf. Consumer Electronics, Jan. 2006, pp. 417- 418. [13] 3GPP, ―Services and service capabilities (Release 6.2.0),‖ Technical Specification TS 22.105, Jun. 2003. [14] Bo Li, Y.Qin, C.Low, C.Gwee ―A survey on mobile WiMAX‖, IEEE Communications magazine, Dec. 2009. [15] G. Faria, J.A. Henriksson, E. Stare, and P. Talmola, ―DVB-H: Digital Broadcast Services to Handheld Devices,‖ Proceedings of the IEEE, vol. 94, no. 1, pp. 194209, Jan. 2006. [16] Advanced Television System Committee, ―ATSC Mobile TV Standard, Parts 1- 8,‖ ATSC standard A/153, 2009. [17] K. Seshadrinathan, A. C. Bovik, "New vistas in image and video quality assessment," Proc. of SPIE: Human Vision and Electronic Imaging, Jan. 2007. [18] O. Werner, ―Requantization for transcoding of MPEG-2 intra frames,‖ IEEE Trans. Image Process., vol. 8, no. 2, pp. 179–191, Feb. 1999. [19] H. H. Bauschke, C. H. Hamilton, M. S. Macklem, J. S. McMichael, and N. R. Swart, ―Recompression of JPEG images by requantization,‖ IEEE Trans. Image Process., vol. 12, no. 7, pp. 843–849, Jul. 2003. [20] X. Li, N. Oertel, and A. Kaup, ―Adaptive Lagrange Multiplier Selection for IntraFrame Video Coding, ‖ in Proc. IEEE Int. Symp. Circuits and Systems, May 2007, pp. 3643-3646. [21] L. Chen, and I. Garbacea, ―Adaptive λ Estimation in Lagrangian Rate-Distortion Optimization for Video Coding,‖ in Proc. SPIE Visual Communication and Image Processing, vol. 6077, 60772B, Jan. 2006. 117 APPENDICES Appendix A Pseudo code of different motion vector re-estimation methods The notations used in the following pseudo codes are listed in Table as below. The self-explaining variables are not listed. n The number of available blocks in the original videos mvj The motion vector of block bj mvxj, mvyj The motion vector of block bj in the x, and y direction mvselect The selected motion vector in the original video mvxselect, mvyselect The selected motion vector in the x, and y direction α0 …αn The model parameters of the linear-regression models Ai The measure of the motion-compensation prediction errors for the block bj weight(mvj) The area-weighted factor for motion vector mvj width(bj), height(bj) The width, and height of block bj in terms of the number of pixels width(total), height(total) The width, and height of all related blocks in the original video MIN_DIST The minimum distance MAX_NONZERO The maximum weighted no. of the nonzero AC integer-transform coefficients Area-Weighted Vector Median Filter Method 1 for j ← 1 to n 2 weight(mvj) = width(bj) × height(bj) / width(total) × height(total) 3 for j ← 1 to n 4 5 for i ← 1 to n distance += sqrt( (weight(mvi) ×mvxi – weight(mvj) ×mvxj)2 + (weight(mvi) × mvyi –weight(mvj) ×mvyj)2 ) 6 if distance < MIN_DIST 118 7 mvSelect = mvj Multiple Linear-Regression-Models Based Method 1 y = α0, 2 for j← 1 to n 3 mvxSelect = y + αj × mvxj 4 mvySelect = y +αj × mvyj Our Proposed AWAW Method 1 for j← 1 to n 2 weightj = width(bj) ×height(bj) /256 3 Aj = nonzeroAC(bj) 4 if weightj×Aj > MAX_NONZERO 5 mvselect = mvj 119 Appendix B List of recent publications Journals Q. Tang, P. Nasiopoulos, and R. Ward, ―An Efficient Motion Re-Estimation Scheme for H.264/AVC Video Transcoding with Arbitrary Downscaling Ratios,‖ submitted to IEEE Trans. Circuits Syst. Video. Technol., Feb. 2010. Q. Tang, and P. Nasiopoulos, ―Efficient Motion Re-Estimation with RateDistortion Optimization for MPEG-2 to H.264/AVC Transcoding,‖ IEEE Trans. Circuits Syst. Video. Technol., vol. 20, no.2, pp. 262-274, Feb. 2010. Q. Tang, P. Nasiopoulos, and R. Ward, ―Compensation of Re-quantization and Interpolation Errors in MPEG-2 to H.264 Transcoding,‖ IEEE Trans. Circuits Syst. Video. Technol., vol. 18, pp.314-325, Mar. 2008. Conference Proceedings Q. Tang, P. Nasiopoulos, and R. Ward, ―Fast Block-Size Partitioning Using Empirical Rate-Distortion Models for MPEG-2 to H.264/AVC Transcoding,‖ in Proc. IEEE Int. Sym. Circuits Syst. Paris, May 2010, Accepted. Q. Tang, P. Nasiopoulos, and R. Ward, ―Efficient Motion Vector Re-Estimation for MPEG-2 to H.264/AVC Transcoding with Arbitrary Down-Sizing Ratios,‖ in Proc. IEEE Int. Conf. Image Processing, Nov. 2009, pp. 3689-3692. Q. Tang, H. Mansour, P. Nasiopoulos, and R. Ward, ―Bit-Rate Estimation for BitRate Reduction H.264/AVC Video Transcoding in Wireless Networks,‖ in Proc. IEEE Int. Sym. Wireless Pervasive Computing, May 2008, pp.464-467. Q. Tang, P. Nasiopoulos, and R. Ward, ―Fast Block Size Prediction for MPEG-2 TO H.264/AVC Transcoding,‖ in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, Apr. 2008, pp.1029-1032. 120 Q. Tang, P. Nasiopoulos, and R. Ward, ―Efficient Chrominance Compensation for MPEG-2 to H.264 Transcoding,‖ in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, Apr. 2007, pp. I.1129-I.1132. Q. Tang, R. Ward, and P. Nasiopoulos, ―An Efficient MPEG-2 to H.264/AVC Half-Pixel Motion Compensation Transcoding,‖ in Proc. IEEE Int. Conf. Image Process., Oct. 2006, pp. 865-868. Q. Tang, P. Nasiopoulos, and R. Ward, ―An Efficient Re-quantization Error Compensation for MPEG-2 to H.264 Transcoding‖ in Proc. IEEE Int. Sym. Signal Process. Inform. Technology, Aug. 2006, pp. 530-535. 121
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Computationally efficient techniques for H.264/AVC...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Computationally efficient techniques for H.264/AVC transcoding applications Tang, Qiang 2010
pdf
Page Metadata
Item Metadata
Title | Computationally efficient techniques for H.264/AVC transcoding applications |
Creator |
Tang, Qiang |
Publisher | University of British Columbia |
Date Issued | 2010 |
Description | Providing universal access to end-users is the ultimate goal of the communications, entertainment and broadcasting industries. H.264/AVC has become the coding choice for broadcasting, and entertainment (i.e., DVD/Blu-ray), meaning that the latest set-top boxes and playback devices support this new video standard. Since many existing videos had been encoded using previous video coding standards (e.g., MPEG-2), playing them back on the new devices will be possible only if they are converted or transcoded into the H.264/AVC format. In addition, even in the case that videos are compressed using H.264/AVC, transmitting them over different networks for different user applications (e.g., mobile phones, TV) will require transcoding in order to adapt them to different bandwidth and resolution requirements. This thesis tackled the H.264/AVC transcoding problems in 3 aspects. At first, we propose the algorithms that improve the resultant video quality of the transform-domain MPEG-2 to H.264/AVC transcoding structure. Transform-domain transcoding offers the least complexity. However, it produces transcoded videos suffering from some inherent video distortions. We provide a theoretical analysis for these distortions and propose algorithms that compensate for the distortions. Performance evaluation shows that the proposed algorithms greatly improve the resultant transcoded video quality with reasonable computational complexity. Second, we develop an algorithm that speeds up the process of the pixel-domain MPEG-2 to H.264/AVC transcoding. Motion re-estimation is the most time consuming process for this type of transcoding. The proposed algorithm accelerates the motion re-estimation process by predicting the H.264/AVC block-size partitioning. Performance evaluation shows the proposed algorithm significantly reduces the computational complexity compared to the existing state-of-the-art method, while maintaining the same compression efficiency. At last, we propose the algorithm that accelerates the transcoding process of downscaling a coded H.264/AVC video into its downscaled version using arbitrary downscaling ratios. To accelerate the process of encoding the downscaled video, the proposed algorithm derives accurate initial motion vectors for the downscaled video, thus greatly reducing the computational complexity of the motion re-estimation process. Compared to other downscaling state-of-the-art methods, the proposed method requires the least computation while yields the best compression efficiency. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-05-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0064910 |
URI | http://hdl.handle.net/2429/24821 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2010-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2010_fall_tang_qiang.pdf [ 1.58MB ]
- Metadata
- JSON: 24-1.0064910.json
- JSON-LD: 24-1.0064910-ld.json
- RDF/XML (Pretty): 24-1.0064910-rdf.xml
- RDF/JSON: 24-1.0064910-rdf.json
- Turtle: 24-1.0064910-turtle.txt
- N-Triples: 24-1.0064910-rdf-ntriples.txt
- Original Record: 24-1.0064910-source.json
- Full Text
- 24-1.0064910-fulltext.txt
- Citation
- 24-1.0064910.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0064910/manifest