Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An efficient middle-out prediction structure for light field video compression using MV-HEVC Khoury, Joseph 2019

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2019_september_khoury_joseph.pdf [ 3.06MB ]
Metadata
JSON: 24-1.0379910.json
JSON-LD: 24-1.0379910-ld.json
RDF/XML (Pretty): 24-1.0379910-rdf.xml
RDF/JSON: 24-1.0379910-rdf.json
Turtle: 24-1.0379910-turtle.txt
N-Triples: 24-1.0379910-rdf-ntriples.txt
Original Record: 24-1.0379910-source.json
Full Text
24-1.0379910-fulltext.txt
Citation
24-1.0379910.ris

Full Text

AN EFFICIENT MIDDLE OUT PREDICTION STRUCTURE FOR LIGHT FIELD VIDEO COMPRESSION USING MV-HEVC by  Joseph Khoury  B.Sc., American University of Science and Technology, 2016  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Electrical and Computer Engineering)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  July 2019  © Joseph Khoury, 2019  ii  The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, a thesis/dissertation entitled:  An efficient middle out prediction structure for light field video compression using MV-HEVC  submitted by Joseph Khoury in partial fulfillment of the requirements for the degree of Master of Applied Science in Electrical and Computer Engineering  Examining Committee: Dr. Panos Nasiopoulos Supervisor  Dr. Robert Rohling Supervisory Committee Member  Dr. Mattew Yedlin Supervisory Committee Member  Additional Examiner   Additional Supervisory Committee Members:  Supervisory Committee Member  Supervisory Committee Member iii  Abstract  Light field imaging has emerged as a technology that enables the capture of richer visual information. While traditional photography captures just a 2D projection of the light in the scene, a light field camera collects the radiance from rays in all directions and extracts the angular information that is otherwise lost in conventional photography. This angular information can be used to substantially improve immersiveness, focus, depth, color, intensity and perspective, opening up new market opportunities. Nevertheless, the high-dimensionality of light fields also brings with it its own new challenges such as the size of the captured data. Research in light field image compression is becoming increasingly popular, but light field video compression remains a relatively under-explored field. State of the art solutions attempt to apply existing multi-view coding (MVC) methods to encode light field videos. While these solutions show potential, they do not manage to address the bandwidth problem imposed by the size of data involved. Hence, there is a real need for improvement, taking advantage of the additional redundancies of light field video and the  intricacies of this data.  In this thesis, we proposed a three-dimensional prediction structure for efficiently coding light video using the MV-HEVC standard. First, we modify the inter-view structure in order to exploit the higher similarity found around the central set of views compared to those around the edges. In addition to this, the selection of which views start with a P-frame takes into consideration maximizing their utilization as references by other views. Secondly, we build upon this structure by expanding the GOP size and creating a more efficient temporal structure that better utilizes the higher-fidelity sequences as references for subsequent frames. The schema contains various iv  temporal structures for compressing the views, which is based on their encoding order. This facilitates the latter views relying more heavily on frames from other views as references in order to compensate for the fact that the preceding frames are more quantized by comparison.    v  Lay Summary  The advent of light field technology has brought with it a need for an efficient form of compression. While light field image coding has become a well-studied area of research, light field video coding remains a relatively un-surveyed topic. One promising avenue for research is to utilize multi-view encoders to compress the numerous views as a collection of sequences. This enables us to take advantage of the redundancy found between views. The majority of existing solutions do not take into consideration the intricacies exhibited by light fields and thus there remains room for improvement. In this work, we propose a three dimensional prediction structure for coding light field videos. The results show that it outperforms existing state-of-the-art solutions by significantly improving the compression efficiency. vi  Preface  All of the work presented in this thesis was conducted in the Digital Multimedia Laboratory at the University of British Columbia, Vancouver campus. All figures and tables used in this thesis are original. All projects and associated methods were approved by the University of British Columbia’s Research Ethics Board [certificate #H12-00308]. A version of Chapter 3, sub-sections 3.2.3 has been published as J. Khoury, M. T. Pourazad and P. Nasiopoulos, “A New Prediction Structure for Efficient MV-HEVC based Light Field Video Compression,” in IEEE International Conference on Computing, Networking and Communications (ICNC), Honolulu, USA, 2019. I was the lead investigator responsible for all areas of research, data collection, and the majority of manuscript composition. M. T. Pourazad was involved in the early stages of research concept formation, technical guidance and aided with manuscript edits. P. Nasiopoulos was the supervisor on this project and was also involved with research concept formation, guidance and manuscript edits. vii  Table of Contents  Abstract ......................................................................................................................................... iii Lay Summary .................................................................................................................................v Preface ........................................................................................................................................... vi Table of Contents ........................................................................................................................ vii List of Tables ................................................................................................................................ ix List of Figures .................................................................................................................................x List of Abbreviations ................................................................................................................... xi Acknowledgements ..................................................................................................................... xii Dedication .................................................................................................................................... xii Chapter 1: Introduction ................................................................................................................1 1.1 Overview ......................................................................................................................... 1 1.2 Motivation ....................................................................................................................... 6 1.3 Thesis Organization ........................................................................................................ 7 Chapter 2: Background .................................................................................................................8 2.1 High Efficiency Video Coding ....................................................................................... 8 2.2 Multi-view coding ......................................................................................................... 10 2.2.1 H.264/AVC MVC ..................................................................................................... 10 2.2.2 MV-HEVC ................................................................................................................ 11 2.3 Video compression frame types .................................................................................... 13 2.4 Light field video coding ................................................................................................ 14 Chapter 3: Proposed Approach ..................................................................................................19 viii  3.1 Introduction ................................................................................................................... 19 3.2 Our Proposed Method ................................................................................................... 19 3.2.1 Overview ................................................................................................................... 19 3.2.2 Inter-prediction structure .......................................................................................... 21 3.2.3 Temporal prediction structure ................................................................................... 25 Chapter 4: Results and Discussion .............................................................................................28 4.1 Introduction ................................................................................................................... 28 4.2 Experimental Setup ....................................................................................................... 28 4.3 Objective Results .......................................................................................................... 30 4.4 Subjective Results ......................................................................................................... 33 Chapter 5: Conclusion and Future Work ..................................................................................44 5.1 Conclusion .................................................................................................................... 44 5.2 Future Work .................................................................................................................. 45 Bibliography .................................................................................................................................46  ix  List of Tables  Table 4.1 BD-rate for PSNR variations ........................................................................................ 33 Table 4.2 BD-rate for MOS variations ......................................................................................... 40  x  List of Figures  Figure 1.1 Light field camera design .............................................................................................. 2 Figure 1.2 Raw light field image .................................................................................................... 3 Figure 1.3 2D array of sub-aperture images ................................................................................... 3 Figure 1.4 Sub-aperture image extraction ....................................................................................... 4 Figure 1.5 Camera 2D array............................................................................................................ 5 Figure 2.1 Combination of temporal and inter-view prediction structure for MVC .................... 11 Figure 2.2 Example of a multiview prediction structure for 3-views ........................................... 12 Figure 2.3 Example of a GOP containing the 3 frame types ........................................................ 13 Figure 2.4 3D matrix representation of a light field video sequence. ........................................... 15 Figure 2.5 Inter-prediction structure proposed in [38].................................................................. 18 Figure 3.1 Inter-prediction structure of our proposed scheme ...................................................... 19 Figure 3.2 Heat maps of SSIM ..................................................................................................... 20 Figure 3.3 View numbering in our scheme ................................................................................... 21 Figure 3.4 Encoding process of the proposed scheme .................................................................. 24 Figure 4.1 Content preparation before input to encoder ............................................................... 29 Figure 4.2 PSNR distortion curves of our method and the structure in [38] ................................ 32 Figure 4.3 Video arrangement for subjective test ......................................................................... 35 Figure 4.4 MOS-rate comparison of our method and the structure in [38] .................................. 39 Figure 4.5 Visual results from the “ChessPieces” sequence......................................................... 42 Figure 4.6 Visual results from the “Boxer-Irishman-Gladiador” sequence .................................. 43  xi  List of Abbreviations  BD-rate Bjøntegaard Delta rate CTB  Coding Tree Block CTU  Coding Tree Unit  CU  Coding Unit DSIS  Double stimulus impairment scale GOP  Group of Pictures HEVC  High Efficiency Coding JVT  Joint Video Team MOS  Mean opinion score MPEG  Moving Picture Experts Group MRM  Motion region merging MVC  Multi view coding MV-HEVC Multi view High Efficiency Coding PSNR  Peak signal-to-noise ratio PU  Prediction Unit SSIM  Structural Similarity Index metric TU  Transform Unit xii  Acknowledgements I would like to express my deepest gratitude to my supervisors, Dr. Panos Nasiopoulos for providing me with this great opportunity and for his continuous support and guidance throughout my program. Without him, the work presented here would not have been possible. Also, many thanks to Dr. Mahsa Talebpourazad for her help and support throughout this work and for providing me with continuous and valuable feedback. To all my colleagues who helped me improve myself and push my boundaries, and all my friends who made me feel loved and supported me during this challenging chapter of my life. xiii  Dedication  I would like to dedicate this thesis to my parents, for all the love and sacrifices they have made in my life. Thank you for making me the man I am today.  To my beloved brothers, thank you for the love and support you have given me. I could not imagine my life without you.  Finally, I owe my sincere gratitude to my uncle, friend and role model Mr. Ibrahim Gedeon. Thank you for believing in me and providing me with your words of wisdom when I needed it. 1  Chapter 1: Introduction 1.1 Overview Light field technology is the latest emerging innovation in digital media, seen as an “upgrade” to the way optical devices take in light, with the potential to revolutionize the way we view and deliver real-life and virtual video content. Light field is defined as the light rays at every point in a space that travel in every direction. It can be represented as a 4D-vector function that describes the light rays’ radiance as a function of their position and direction in free space. Light field acquisition schemes are implemented by the means of sampling the 4D plenoptic function to measure radiance, color and intensity values for the light rays flowing in different directions [1]. As opposed to traditional cameras that only capture the scene from a single viewpoint, a light field camera allows light to be captured from multiple viewpoints. When the 3D scene is captured with a conventional camera, only the light color and intensity are projected on the 2D image.  Alternatively, when a light field camera is used to capture the scene, the directional information of the light is preserved. As a result, each rays’ color, intensity and angular direction are captured, producing images with both color and depth information. Light field cameras (also called plenoptic cameras) have a micro lens array just in front of the imaging sensor [2]. Such arrays consist of many microscopic lenses (often in the range of the hundreds of thousands) with tiny focal lengths, and split up what would have become a 2D-pixel into individual light rays just before reaching the sensor. This setup can be seen in Fig 1.1, where a simplified design of the light field camera is illustrated. Note that the rays of light are separated by the micro-lens before they hit the image sensor at the back of the camera.  2  The resulting raw image consists of a two dimensional array of micro-lens images else know as sub-aperture images, which exhibit repetitive patterns that contain the spatial and angular information of a scene [3]. The output of the light field camera is a 2D raw image that can then be decomposed to create a 4D light field as illustrated in Fig. 1.2. Beyond this, the processed sub-aperture images are extracted from the raw image by taking a single pixel location under each micro-lens and combining them in a 2D array formation, producing a single image. This results in a number of views based on the number of micro-lenses, whereby each view contains a slightly shifted perspective from the other.  This 2D array of views can be seen in Fig 1.3. This process of sub-aperture image extraction is illustrated in Fig. 1.4. The figure shows the top-left view being extracted by combining the top-left pixels under each micro-lens and arranging the pixels based on the respective order of the   Fig. 1.1 Light field camera design   3  micro-lens arrangement in 2D array. For example, the pixels of the central micro-lens will correspond to the central pixel of each view (using different pixels). The same will apply to every other micro lens in the array.  Fig. 1.2 Raw light field image  Fig. 1.3 2D array of sub-aperture images    4  In addition to this, light field capture systems can also be created from an array of cameras, an example of such a camera arrangement can be seen in Fig. 1.5. Each camera in the array captures the same scene from a slightly different perspective and contributes a varied vantage point. This  Fig. 1.4 Sub-aperture image extraction   5  results in a large number of views of the same scene, resulting in a huge increase in the size of captured data, which for the setup shown in Fig. 1.5 is 96 times larger than that of a single view.   Using post processing, every vantage point in the array is then combined to produce light field images. The result is an abundance of captured information that enables the depth of field, focus position, or resolution to be changed even after the scene has been recorded. Furthermore, depth and distance information can be obtained from the images, which can be used for applications requiring segmentation or object detection [4]. With this in mind, there is a vast number of applications for light field technology from cinematography [5] and virtual reality [6] and medical applications [7].  On the other hand, compared to conventional image/video content, light field generates huge amounts of data, demanding the design of new and effective compression methods. The  Fig. 1.5 Camera 2D array  6  objective in this case is to modify existing coding standards to address the idiosyncrasies of light field content to efficiently match the requirements of storage, transmission bandwidths, display devices, and computational resources. While there are differences between the sub-aperture images, there also exists a large amount of redundancy. Emerging light field compression methods attempt to remove this redundancy utilizing existing conventional image and video compression methods, such as wavelet transforms [8], pseudo sequence coding [9] and predictive coding [10]. The introduction of light field video has added the time dimension into the function, making the storage and compression challenges yet more urgent and difficult.  A light field video can be regarded as a temporal sequence of 2D image arrays. In other words, a light field video can be thought of as a series of video sequences arranged in a two dimensional matrix. Most of the existing work on light field compression focuses on light field image coding, as video content was not widely available. A straightforward application of image coding solutions for light video coding would not be efficient. As a result, a number of works have been proposed that utilize multi-view encoders to compress light field video content. However, these structures can still be further improved upon as they do not take into account all the intricacies of light field content. 1.2 Motivation The emergence of light field technology and the complexity of its data has created a demand for an efficient form of compression, with a number of research works proposed that attempt to tackle the problem. Multi-view encoders have proved to be the most promising platform for designing a prediction structure, since the multiple views of a light field can be processed as different 7  sequences by the encoder while still utilizing the redundancy between the images in both the spatial and temporal domain. In this thesis, we propose an efficient three-dimensional prediction structure that incorporates both inter-view and inter-frame predictions using the multi-view extension of HEVC (MV-HEVC) for coding light field video sequences. Our method is based on a novel two-dimensional inter-prediction structure that is based on the fact that the correlation between views around the center is higher than the views around the edges. This is complemented by a temporal prediction model that utilizes the correlation between the different views in a sequence within the temporal domain to more efficiently encode light field video content. 1.3 Thesis Organization The remainder of the thesis is structured as follows: Chapter 2 provides background information on HEVC, multi-view and light field compression. Chapter 3 explains in detail our proposed prediction structure for encoding light field video content. In Chapter 4, we present the objective and subjective tests that we performed to evaluate our proposed prediction structure and analyzes and discusses the results. Finally, conclusions and future work are drawn in Chapter 5. 8  Chapter 2: Background In this section we provide background information on video compression and multi-view video compression standards and then light field video compression techniques. In subsection 2.1 we give a detailed summary of the existing video compression standard known as High Efficiency Video Coding (HEVC), and a comparison with its predecessor (H.264/AVC) on their coding efficiency. In subsection 2.2 we provide basic information on multi-view video coding schemes and the various developments in standards. In  2.2.1 we cover H.264/AVC MVC, in 2.2.2 we detail MV-HEVC. Finally, in subsection 2.3 we give an overview of the existing light field video compression methods.  2.1  High Efficiency Video coding (HEVC) High Efficiency Video Coding (HEVC) is a video compression standard that is designed to succeed the widely used H.264/AVC. In comparison to AVC, HEVC offers an improved compression performance of at the same bit rate. Objective comparative findings indicate that the present HEVC design outperforms H.264/AVC from 25% to 50% in terms of bit rate. On the subjective side, a visual quality comparison of compressed videos indicates that HEVC outperforms AVC in average bit rate savings by 58% [11], [12]. One advantage of HEVC is its capability to use larger coding tree unit (CTU) sizes. In tests that were performed evaluating the impact of CTU size on compression efficiency, the large CTU sizes increased coding efficiency while also reducing decoding time [13]. HEVC uses a quad-tree based coding framework that supports coding units of more diverse dimensions than those found in H.264/AVC. The CTU size is typically set to 64×64 (up from 16x16 in AVC). Initially, HEVC 9  splits the image into CTUs that are then split into coding tree blocks (CTBs) for each luma / chroma element. CTBs are then split into one or more coding units (CUs). CUs are then divided into prediction units (PUs) of either intra-picture or inter-picture prediction type which can vary in size from 64×64 to 4×4. A CU is split into a quad tree of transform units (TUs) to code the residual prediction. TUs contain coefficients for spatial block transform and quantization. A TU can be 32×32, 16×16, 8×8, or 4×4 pixel block sizes [14]. HEVC introduces a technique titled “motion region merging” (MRM) to facilitate inter-prediction. At the PU level, HEVC offers a range of modes for encoding. These modes include MRM mode, skip mode, or an explicit encoding of motion parameters. The MRM mode consists of generating a list of earlier coded neighboring (spatially or temporarily) PUs (called candidates) for encoding the PU. The motion information is copied from a selected candidate for the current PU, avoiding the need to encode a motion vector for the PU. Alternatively, HEVC encodes only the candidate's index and the residual data in the motion merge list. An alternative mode is called the skip mode. The encoder indicates the index of a movement merge candidate and copies the movement parameters for the present PU from the candidate chosen. However, unlike the MRM mode, on skip mode the encoder does not send any residual data. Because of this, areas of a picture that do not see much change from one frame to another can be encoded using very few bits.  HEVC utilizes 35 luma intra prediction modes for intra prediction, compared to the nine used in H.264/AVC. In addition, depending on the size of the PU, HEVC’s intra prediction is flexible enough to be implemented using a range of different block sizes, from as small as 4×4 up to 64×64 [15]. 10  2.2 Multi-view coding With the advent of multi-view technologies, the industry has sought out solutions that help facilitate the end-to-end multiview pipeline. The simple solution would be to encode all video signals with a state-of - the-art video codec separately. Multiview video, however, possesses a high cross-correlation between views since all the cameras capture the same scene from slightly different viewpoints  2.2.1 H.264/AVC MVC In order to tackle the above requirements, the Joint Video Team (JVT) created Multiview Video Coding (MVC), an extension of the H.264/AVC video compression standard [16]. MVC allows effective encoding of independently captured sequences from multiple cameras [17]-[19] for multiview video encoding. MVC is designed to exploit the dependencies between multiple views of the same scene as well as the temporal dependencies found within a view, as illustrated in Fig. 2.1. Here the frames in each sequence are not only predicted from temporally neighbouring images but also from corresponding images found in the adjacent views. This is accomplished by using interview prediction, where pictures from other views can be used as references when coding another image. MVC allows for a flexible prediction, referencing from multiple images in a timely or spatial manner [20]. The standardization of MVC began in July 2006 and was completed in 2009. Statistical evaluations have shown that significant gain can be expected from such combined temporal/inter-view prediction [21], [22]. Pioneering work on multiview image coding is reported in [23], [24].   11  2.2.2  MV-HEVC In order to expand on the previous work and leverage the state-of-the-art compression capacities provided by the High Efficiency Video Coding (HEVC) standard [25], [26] the Moving Picture Experts Group (MPEG) released a vision for the next generation 3D video format [27]. Their aim was to create a 3D video format that would enable the generation of high-compression intermediate opinions to support sophisticated stereoscopic display features and developing autostereoscopic displays. Multi view HEVC (MV-HEVC) is an extension to the high-level syntax that includes signalization of predictive dependencies between different views, identification of which images belong to each perspective, and syntax components to enable baseline view extraction. A major advantage of the MV-HEVC architecture is that it does not change the syntax or decoding process required for   Fig. 2.1 Combination of temporal and inter-view prediction structure for MVC [20] 12  HEVC single-layer coding below the slice level [28]. This enables the reuse of current applications to build MV-HEVC decoders without major changes. The inter-view prediction is enabled through the versatile reference picture management capabilities found in HEVC. A prediction structure example is illustrated in Fig. 2.2. The decoded images from other views are essentially placed in the reference picture lists of the present perspective for use in prediction processing. As a consequence, the reference picture lists include the temporal reference photos of the present perspective that can be used to predict the present image along with the interview reference images from the same instance's adjacent views. This design requires only minor changes to the high-level syntax, in this case an indication of the prediction dependency across views. The prediction is adaptive so that the best predictor can be chosen on block-based basis between temporal and interview references. In this manner, stereo content is compressed more effectively than using so-called frame-compatible formats [28].    Fig. 2.2 Example of a multiview prediction structure for 3-views [29] 13  2.3 Video compression frame types In the field of video compression a video frame is compressed using different algorithms with different advantages and disadvantages, centered mainly around amount of data compression. These different algorithms for video frames are called picture types or frame types. The three major picture types used in the different video algorithms are I, P and B. An I‑frame is the least compressible and does not require other video frames to be decoded previously. A P‑frame holds only the changes in the image from one previous frame and are more compressible than I‑frames. B‑frames can use both previous and forward frames for data reference to get the highest amount of data compression. The GOP (group of pictures) is a collection of successive pictures within a coded video stream that specifies the order in which the frames are arranged. An I frame indicates the beginning of a GOP, after which several P and B frames follow as seen in Fig. 2.3.   Fig. 2.3 Example of a GOP containing the 3 frame types 14  2.4 Light field video coding Compared to conventional image/video content, light field generates huge amounts of data, demanding the design of new and effective compression methods. The objective in this case is to modify existing coding standards to address the idiosyncrasies of light field content to efficiently match the requirements of storage, bandwidth transmission, display devices, and computational resources.  While there are differences between the sub-aperture images, there also exists a large amount of redundancy. Emerging light field compression methods attempt to remove this redundancy by utilizing existing conventional image and video compression methods, such as wavelet transforms [8], pseudo sequence coding [9] and predictive coding [10]. The introduction of light field video has added the time dimension into the function, making the storage and compression challenges yet more urgent and difficult. Light field videos can be perceived as a sequence of 2D image arrays in time. In other words, a light field video can be thought of as a series of video sequences arranged in a two dimensional matrix. The majority of the existing work on light field compression focuses on image coding, as video content was not widely available.  A straightforward application of image coding solutions for light video coding would not be efficient. One straightforward approach to compress a light field video sequence is by separately encoding each view sequence as a 2D video stream. This was implemented in a previous work using the H.264/MPEG4-AVC simulcast coding [30]. This produces an independent encoded video stream for each field of view of the light field video sequence. Nevertheless, this method is 15  not very efficient as it only takes into consideration the inter-frame correlation between subsequent frames of each view sequence. However, it does not exploit the strong correlation between views. An alternative approach is to represent a light field video stream as a three dimensional matrix (see Fig. 2.4). In this case, the horizontal parallax corresponds to one dimension, the vertical parallax corresponds to the second dimension, and the time corresponds to the third dimension. This 3D matrix should first be converted into multiple individual video streams which may be compressed using existing standards. Taking into consideration the fact that a light field video exhibits similar characteristics to multi-view videos, one potential solution is to use multi-view encoding schemes for compression. One such work is presented in [31], where the authors propose an applicable method for 3D video by applying the Multiview Coding extension of H.264/AVC (MVC) [18] on light field video using two prediction structures, “IPPPP” and “IBPBP”, showing that it outperforms the conventional 2D H.264 standard. Despite that, the multi-view approach   Fig. 2.4 A 3D matrix representation of a light field video sequence. 16  considered in this work possesses only one directional parallax between views, which is not efficient for the light field video case as it possesses two-directional parallaxes.   Accordingly, a more appropriate approach to compress light field multi-view video would be to convert the two-dimensional multi-view video sequence into a one-dimensional multi-view video sequence. This can be done by arranging the ordering of extracted views from the 2D multi-view matrix into a specific order that best utilizes the similarity between the views. Rearranging views is important as existing encoders are primarily used for content that displays one-dimensional (horizontal) parallax in contrast to the two dimensions (horizontal and vertical) supported by light field. The authors used a transposed ordering of images in MVC in [32]. This approach goes from top to bottom through the views of each frame in a horizontal scan order, converting a two-dimensional video into a one-dimensional video. This method therefore takes advantage of both spatial and temporal correlations between light field views and can outperform the compression efficiency of the previous method.  Later this work was then expanded upon in [33] by creating the one-dimensional multi-view video and then applying the conventional hierarchical B-frame prediction structure in MVC. Despite this, in both methods the vertical correlation between views is lost during the conversion process, resulting in ineffective predictions. Although other similar methods use different scanning schemes varying from spiral to horizontal zigzag, they all experience the same challenges in prediction accuracy [34].  A number of methods have been published that follow an alternative approach, by employing a two directional inter-view prediction structure. The advantage of these methods is that they utilize 17  both the vertical and horizontal correlation between the different views. The authors in [35] propose a structure that uses a horizontal prediction scheme, adds vertical inter-view prediction on the first and central column using an H.264/AVC encoder and applies the typical settings for MVC. Nevertheless, the main drawback of this structure is that only a limited number of views have access to both a horizontal and vertical reference.  In order to address this issue, a number of proposed works have attempted to find the ideal 2D MVC-based prediction structure for light field video coding. One example of such work is presented in [36], which proposes a 2D prediction structure by encoding the central view as an I-frame, and then the remaining views as P-frames using the spatially adjacent views as references. Moreover, this work was expanded on in [37] whereby the 2D prediction structure proposed exploits the two-dimensional view alignment in the light field. The proposed structure, titled Central 2D, first encodes the center view as an I-frame, then the other views on the horizontal and vertical centerlines as P-frames, and finally the remaining views as B-frames using the two preceding spatially adjacent views as references. Performance evaluations showed that this prediction structure outperformed the other existing MVC-based prediction structures. A compression method that is based on MVC and is designed specifically for light field, known as LF-MVC, is presented in [38]. The prediction structure proposed aims at improving upon [37] by using sub-structures in the prediction of the MVC configuration as illustrated in Fig. 2.5. This approach uses a two-directional parallel prediction run starting from the top-left view and diverging towards the right and bottom directions. This results in a prediction structure consisting of smaller square sub-structures. For example, in the case of 5x5 views, the structure would be composed of four 3x3 sub-structures. The vertices of each square sub-structure follow the same 18  two-directional parallel prediction flow, and the prediction within each sub-structure forms a closed-loop. One benefit of this design is that the prediction remains localized to a small sub-area of views and the scheme does not utilize any diagonal references. One further benefit of this prediction structure is that it can also be applied onto light field videos of arbitrary sizes. Nevertheless, this structure can still be further improved upon as it does not take into account all the intricacies of light field content. Furthermore, the inter-frame prediction structure found in [38] follows the conventional hierarchical B-frame prediction structure which is not the most efficient method since not all views are referenced equally. For example, views that start with P-frames are better suited as references than those that start with B-frames.  Fig. 2.5 Inter-prediction structure proposed in [38]. 19  Chapter 3: Proposed Prediction Structure  3.1 Introduction In this thesis, we propose an efficient three-dimensional prediction structure that incorporates both inter-view and interframe predictions using the multi-view extension of HEVC (MV-HEVC) [28] for coding light field video content. Our method is designed as a novel two-dimensional inter-prediction structure that is based on the fact that the correlation between views around the center is higher than the views around the edges. This is complemented by a temporal prediction model that utilizes the correlation between the different views in a sequence within the temporal domain to more efficiently encode light field video content.   3.2 Our Proposed Method 3.2.1 Overview In this section, we describe our novel three-dimensional prediction structure for light field video content in HEVC. The prediction’s three dimensions are the horizontal inter-view, vertical inter-  Fig. 3.1 Inter-prediction structure of our proposed scheme. 20  view and temporal inter-frame. The following subsections describe these structures in detail. In our previous work [39], we proposed an inter-prediction structure that is designed to take advantage of the high correlation that is found between views around the center compared to the outer edges. However, while that work stops at the frame itself, in this work we expand the design to further incorporate the temporal dimension and provide a more detailed explanation of the method.     Fig. 3.2 Heat maps of SSIM between (a) central view and the inner ring, (b) top-left corner view and the outer ring. 21  3.2.2 Inter-prediction structure The inter-prediction structure can be seen in Fig.3.1. This structure was originally presented in our previous work [39], where we proposed an inter-prediction structure that is designed to take advantage of the high correlation that is found between views around the center compared to the outer edges. However, while that work stops at the frame itself, in this work we expand the design to further incorporate the temporal dimension and provide a more detailed explanation of the method. The structure is designed around maximizing the utilization of the higher similarity found between the central views compared to those around the edges.  To illustrate this correlation, heat maps that correspond to the similarity between the views are shown in Fig. 3.2. The values in each block in the array correspond to the average Structural Similarity Index metric (SSIM) of each frame between the initial view and the subsequent set. These values are the average SSIM values of 300 frames of video. The heat map of the central   Fig. 3.3 View numbering in our scheme 22  view (a) indicates a higher similarity to the surrounding inner ring of views compared to the top-left corner view’s (b) similarity to the outer ring of views. In Fig. 3.3 we present a view numbering that will be used from here on out.  Due to the higher similarity of the central views, the proposed method utilizes the central view (view 13 in Fig. 3.3) as the I-frame of the structure, from which subsequent views will refer to, as opposed to the top-left corner view that is found in [38]. This also has the added benefit of providing a direct reference to more views based on its location. As it is surrounded by 4 views as opposed to the corner view that only has 3 direct neighboring views (excluding diagonal neighbors due to the increased bit rate of that prediction). Beyond this, the prediction structure then uses this I-frame as the sole reference to predict four P-frame sequences on the outer edge (views 3, 11, 15 and 23). In addition to taking advantage of the high correlation around the inner edge views, the proposed structure also aims at efficiently positioning the P-frames for optimized compression. By maximizing the usuage of P-frames, the number of frames at the lower level of encoding hierarchy is reduced, due to the fact that only one predecessor would need to be encoded first. This is achieved by placing the P-frames in specific positions in order to maximize the number of views that can benefit from them as direct references. Furthermore, the P-frames should also be placed in positions so that they are not adjacent with each other and can be surrounded by as many B-frames as possible. Therefore, the P-frames are placed on the outer edge of the view array, on both the horizontal and vertical axes from the center.  23  Placing the P-frames on the corners was originally attempted, however it entailed using diagonal references. This would lead to a higher bit-rate and limit the number of direct neighbors that could refer to it. In our structure, the P-frames can be used as a direct reference by up to three other views due to their placement in the design. After the P-frames have been encoded, the four remaining views found between the P and I frames on the central axes are encoded (8, 12, 14 and 18) as B-frames of the highest hierarchical level (henceforth known as B1) using both the surrounding I- and P-frames as references. This results in a cross shape across the two dimensional array of encoded views. Then the corner views (1, 5, 21 and 25) are encoded as B-frames using the nearest horizontal and vertical P-frames as references. Since these frames are encoded using two P-frames as references, they are also B1 in terms of hierarchy. At this point four square substructures are created with the central I-frame common between all of them (in a 5x5 array the square sub structures would be 3x3 in size). From this point on, the remaining views on the outer ring are encoded next as B-frames utilizing the two surrounding views as references. These frames (2, 4, 6, 10, 16, 20, 22 and 24) are encoded using the closest P- and B1-frames as references. As a result, they are encoded as B2 frames on the hierarchical level, due to the fact that two levels of preceding views would need to first be encoded before they could be accessed. Finally, the remaining views on the inner ring (7, 9, 17 and 19) are encoded using the above and below views as references of varying hierarchical levels, utilizing the closest previously encoded views as references. Since their references are two B2 frames, they are encoded as B3 frames. One further advantage of this structure is that references are taken from nearby views ensuring that the prediction is contained from within the surrounding regions. In addition to this no diagonal references are used in the structure either. 24    Fig. 3.4. Encoding process of the proposed scheme for a 5x5 light field video in a single GOP of size 8. 25  3.2.3 Temporal prediction structure The next contribution of the proposed method is the temporal prediction structure. The 3D structure, as seen in Fig. 3.4, has a GOP of 8 and a hierarchical level of 4. The hierarchical levels are as follows: I-and P-frames at the top, B1 frames, B2 frames and finally B3 at the lowest level. There is a direct correlation between the hierarchy level and the prediction dependency needed to access the view. To illustrate this point, I- and P-frames do not require any previous frames, while B2 frames would require the preceding B1 be encoded first as references (in addition to the B1 references).  For simplicity of illustration, frames 3, 5, 6 and 7 are excluded since their prediction follows a similar pattern to those already shown in the figure, however their direct references are different. Frames 3, 5 and 7 are similar to frame 1. Frame 3 refers to T2 and T4. Frame 5 refers to T4 and T6. Frame 7 refers to T6 and T8. Frame 6 is similar to frame 2 and refers to T0, T4 and T8.  As opposed to structure from [38], which follows a generic hierarchical B-frame structure, the proposed structure employs an arrangement that better utilizes the similarity between views through the temporal domain. This is achieved by maximizing the utilization of the view sequences that contain an I-, P- or a higher hierarchy B-frame. In [38] the structure utilizes all the sequences evenly as references for subsequent frames in the surrounding sequences solely based on their relative position. However, I- and P-frames are better suited as references for subsequent frames than B-frames. Because of this, the proposed structure utilizes the sequences that contain these I- and P-frames are more heavily used than others. 26  Furthermore, the frames of the remaining views are encoded only as B-frames, creating a hierarchy among them. At the top of this hierarchy are the B-frames that are bi-directionally predicted from the surrounding I- and P-frames. Further down this hierarchy, however, are B-frames that are predicted from other B-frames. Because of this, certain design considerations are taken to account for multiple levels of reliance on other B-frames.  The view sequences that are encoded last based on the order, do not utilize inter-frame prediction for certain subsequent frames within the GOP. This applies specifically to the B3-frame sequences. The inter-frame prediction is foregone in order to solely utilize the more efficient inter-view prediction. This is due to the fact that a better prediction can be achieved by solely using the surrounding views as references without sacrificing bit-rate due to redundant references.  The initial frame in each view sequence is dependent on its location with relation to the other views; in other words, the initial frame is based on the aforementioned inter-prediction structure. The subsequent frames in each sequence then consist of B-frames until the next GOP, whereby the initial frame in the sequence is repeated. The difference between these B-frames is their hierarchy and subsequently the number and types of references they utilize. Sequences that start with an I- or P-frame, have the subsequent B-frames using only two temporal references (one forward and one backward). The subsequent frames in the P-sequence also refer to their temporal counterpart in the I-sequence. Alternatively, sequences that start with a B-frame, have the subsequent B-frames using up to four references, which are a mixture of temporal and inter-view references. These sequences prioritize the I- or P-sequences as references when available. In these cases, the temporal prediction that uses 27  inter-frame references is combined with the aforementioned inter-view prediction of the previous subsection. The inter-frame direct-referencing of the proposed structure is as follows: Frame T0 and T8 only utilize intra-frame references. Frame T4 references views in T0 and T8. Frame T2 references T0, T4 and T8. Finally, Frame T1 references T0 and T2. The encoding process order of views in the frames is the same across the whole sequence, the difference being the addition of references from their respective direct-references. The detailed step-by-step explanation of the proposed structure is as follows: First in Fig. 3.4a, the central views of frames T0 and T8 predict the central view of T4 as well as the four central edge views 3, 11, 15, and 23 (see Fig. 3.3) as P-frames on their respective frame. The same process is repeated for frame T4 (Fig. 3.4b), T2 (Fig. 3.4c) and T1 (Fig. 3.4d), but these edge views are encoded as B-frames in this case. Next on the inner ring, the views 8, 12, 14 and 18 (between the central view and views 3, 11, 15 and 23) are encoded using the central view and their closest neighbor of the previously encoded four edge views. The same process is repeated for frame T4 (Fig. 3.4c), T2 (Fig. 3.4d) and T1 (Fig. 3.4e). After that, the remaining views around the edges of the frame (2, 4, 6, 10, 16, 20, 22 and 24) are encoded using their two neighboring views, as seen in Fig. 3.4d. The same process is repeated for frame T4 (Fig. 3.4e), T2 (Fig. 3.4f) and T1 (Fig. 3.4h).  Finally, in Fig. 6e the four remaining views (views 7, 9, 17, 19) are encoded using the above and below views of the same frame as their sole references. The same process of using only two intra-frame references is followed by T2 (Fig. 3.4g) and T1 (Fig. 3.4i). The prediction of these views on this frame foregoes the temporal prediction to solely utilize the inter-view prediction On the other hand, T4 combines the two intra-frame references with respective inter-frame references to T0 and T8. 28  Chapter 4: Results & Discussion 4.1 Introduction In this section, we present how we evaluated our proposed structure, our results and a discussion behind these results. First, we present our experimental setup, which includes a description of the content that was used for the tests and the preparation performed before the tests were conducted. Then we present our objective results by comparing our method against a state-of-the-art prediction structure. Next, we discuss the details of our subjective test which include the features of the test, the participants involved and how the results were collected.  4.2 Experimental Setup A R8 Raytrix plenoptic camera captured the video test set used in our experiment which comes from a public dataset that consisted of two light field video streams: “ChessPieces” and “Boxer-IrishMan-Gladiador” [40]. These sequences are selected because they provide a relatively high resolution and a large number of frames. At the time of writing, the number of public light field datasets were limited that satisfied both of these requirements. Alternatives were either limited in resolution or the videos were too short in length. Both video sequences contain objects rotating on a turntable with a calendar in the background, with the camera positioned at a distance of 35cm from the scene. The difference between them is the object on the turn-table and the direction of rotation. The “ChessPieces” contains four wooden chess pieces rotating counter-clockwise. The other video sequence, “Boxer-IrishMan-Gladiador”, contains three plastic Lego characters rotating clockwise. The Raytrix software RxLive stores the data received from the R8 camera as a Raytrix sequence file. The views are extracted from each 29  light field frame using the Raytrix API and are used to create separate video sequences. Each sequence contains 25 views based on the 5x5 micro lens array found in the cameras. For each view 300 frames were captured at a resolution of 1920x1080. The data is available as single frames, so we created videos by placing the frames of the same view back-to-back in time, the result is 25 separate video sequences. As usual in video coding, each view sequence was transformed into the video format of YCbCr, in order to be input into the video encoder. The full process of pre-processing is illustrated in Fig. 4.1. The compression performance of our scheme is evaluated against that of the state-of-the-art scheme presented in [38]. We selected [38] as the base of our comparison, because it could outperform the other previously proposed prediction structures in the literature, and proved to be the most efficient solution at the time of writing. Based on the authors’ results in [38], their prediction structure showed a BD-rate gain of up to 32.48% over the structure in [37] which was already an improvement over existing MVC light field video solutions.    Fig. 4.1 Content preparation before input to encoder 30  Our scheme was implemented in the 16.2 version of the MV-HEVC software [14]. We set the intra-frame period to 32 frames, based on the camera frame rate capturing the sequences. The predictive structure in [38] was also implemented using MV-HEVC to ensure a fair comparison. In addition to this, the temporal prediction structure was adapted to a GOP of 8 with a hierarchical level of 4 for both structures. This is achieved by expanding the structure the authors presented but maintaining the same design principles laid out in their goals. Otherwise, the larger GOP size in of itself would improve the compression efficiency regardless of the structure it was implemented in.  4.3 Objective Results In order to objectively evaluate the compression efficiency of our proposed structure, we use the conventional multi-view compression content evaluation tools: the peak signal-to-noise ratio (PSNR) and the Bjøntegaard Delta (BD) rate [41]. The PSNR and the BD-rate are calculated with the QP range set to 25-30-35-40 for both our structure and the one proposed in [38]. Fig. 4.2 illustrates the rate-distortion curves for both sequences using both structures.  From the figure, we can see that for all 4 QP levels our proposed method offers the same or better visual quality (based on PSNR) at a significant reduction in bitrate. The difference in performance gain between our proposed method and that found in [38] grows as the visual quality of the video sequence increases. From the figures, the PSNR values range between 36 to 42 dB as the typical values for the PSNR in video compression are between 30 and 50 dB. This is because below 30 31  dB the video would be un-viewable due to the artifacts produced from compression, and beyond 50 dB the human perception cannot notice a difference in visual quality. In the “ChessPieces” sequence, for instance, for PSNR 38 our proposed structure yields a bit rate of 34.9 kbps while the structure in [38] yields a bit rate of 63.9 kbps, a reduction of 54.6%. Note that the gains grow larger for higher bit rates (lower QPs). In the “ChessPieces” sequence for PSNR 41.5, our proposed structure yields a bit rate of 237 kbps while the structure in [38] produced a bit rate of 414 kbps. In this case, there is a reduction of 57.25% in bit rate. As it can be observed, our proposed method outperforms the other method for both experimental sequences and the gain in performance over the proposed method increases at higher bit-rates.  It should also be noted, that the gains of our method have increased from those reported in [39]. This is down to a number of factors. Firstly, the resolution and number of frames is larger. In that work we used a downscaled resolution on only a portion of the total frames from the dataset. In addition to that, the introduction of the temporal prediction structure has proved to improve the results across the board. Our proposed inter-frame structure outperforms the generic hierarchal B-structure used by both methods in [39].    32    (a)  (b) Fig. 4.2 PSNR distortion curves of our method and the structure in [38] for (a) “ChessPieces” and  (b) “Boxer-IrishMan-Gladiador” 33  In addition to PSNR gains, the BD-rate is used to quantify the difference between the two curves and the results can be seen in Table 4.1. As it can be observed, our proposed method outperforms the other method by 48.32% on average. Our results confirm that the central view approach we use proves to have better correlation between the views resulting in an improved compression efficiency compared to the approach that starts with the border views. Video Sequence BD-rate ChessPieces -45.75% Boxer-Irishman-Gladiador -50.89% Average -48.32%  Table 4.1 BD-rate for PSNR variations of the proposed structure against the structure from [38] used as anchor However, one unique feature of light field content is that it can be used to reconstruct multiple views of the same scene using the numerous light rays emitted in the capture scene. Therefore, the subjective quality evaluation should also be measured for the multiple views, which is done in the following section. 4.4 Subjective Results A subjective test was also conducted to evaluate how a human would perceive the effects of the different compression prediction structures on the decoded light field content. Furthermore, certain aspects of an degradation can be perceived differently from an objective metric to a human visually.  In this case, the compressed video content was compared to the original uncompressed 34  video for extracted views. We presented both the original and decoded videos at the same time in a side-by-side manner to the viewers based on the double stimulus impairment scale (DSIS) method of Recommendation BT.500-13 DSIS [42]. The goal of this test is to determine whether or not our compression using prediction structure can achieve a closer image quality match to the original content than a state-of-the-art method.   Each test sequence consists of an extracted view as two videos one on top of the other: the base uncompressed version as the top video and the compressed version as the bottom video. During the test, the subjects were aware of the original video’s position. We did not change the position of the original and test videos throughout the test. Each video had an original resolution of 1920 x 1080, while after stitching the two videos together ended up having a 1920 x 2160 resolution. 35  However, a grey border of 16 pixels is taken out of the video to create a neutral color separation between the two videos, the setup can be seen in Fig. 4.3. Because of this, both videos were cropped to a size of 1856x1032. Each view is shown at the four levels of compression (25-30-35-40), once for the proposed method and once for the method in [38]. The 300 frames of video were shown at 30 fps, based on matching the fps of the camera that captured the sequences. Note that each video was separated by 3s of gray frames. This resulted in a total runtime of 13 seconds for each video. With this in mind, for each view the number of comparisons would be 16 (four QPs x 2 sequences x 2 methods). Therefore, in order to ensure that the test did not exceed the recommended time of 20 minutes for subjective tests, 6 of the 25 views are selected from each video sequence (same as the aforementioned in the objective tests).  Fig. 4.3 Video arrangement for subjective test. 36  The selection of these six views is based on the test consisting of an equal number of P-frame and B-frame views from each method to ensure a fair comparison. This results in a test that is composed of a total of 96 sequences (6 views x 16 comparisons), resulting in a total runtime of 21 minutes. The view sequences selected for the test were: 5, 14, 15, 21, 23 and 25 (based on the numbering in Fig. 3.3). The order of videos was randomized in each session and we ensured that the same sequence coded with a different QP is not shown consecutively. In the subjective test, we used a Samsung UN65KS9000 65-inch display at a 3840 x 2160 resolution to accommodate the two videos on top of each other. The test subjects are asked to evaluate on a discrete scale of 1 to 5 (1: worst, 5: best) how close they perceive the bottom compressed view compared to the top uncompressed view. Nineteen adult subjects including 7 males and 12 females participated in our test. The subjects’ ages ranged from 23 to 34 years old. All of the participants displayed normal to perfect vision in order to participate in the test and were considered naïve in the field and not made aware of the test objectives.  Prior to the test session, a training session was held to introduce the test procedure to the subjects by using a set of training videos that included the varying levels of compression and its effect on the video quality that were not in the actual test dataset, to allow subjects to adapt to the assessment procedure. To ensure that the choices of the subjects were consistent, the order of the 96 sequences were randomized. Furthermore, three different playlist orders were created to ensure that the first sequences shown to the subject, which might be more unevenly graded than later sequences, were different for every six participants in the test. This was done by first splitting the 96 sequences into 3 sections of 32 each, then reordering the sections to produce 3 different playlist orderings. The 37  room in which the tests were run was compliant with the BT.500 recommendations for subjective evaluation of visual data [42]. Once the subjective test results are collected, we performed an outlier detection based on the standardized method found in [42]. An outlier in this case is a piece of data that deviates drastically from the mean of the data set. This was determined by calculating the mean score of all participants’ scores for each video sequence, then calculating the correlation coefficients of each participants with this mean. The participants whose correlation score was below 0.75 were dropped as outliers. In our test three subjects were regarded as outliers, as their results deviated from the overall mean score of all the subjects. As such, their test results were excluded from our analysis. The mean opinion score (MOS) was calculated by averaging the scores from all subjects across the six views with a 95% confidence interval. The plot in Fig. 4.4 shows the MOS values at different bit-rates. The vertical lines at every point in the figure indicate a 95% confidence interval at every QP level. Based on the figures, we can observe that our proposed structure outperforms the structure proposed in [38] subjectively and reaffirms our objective results. On further analysis of the results, we can conclude that the performance gains increase as the visual quality of the video improves. This is based on the fact that at lower levels of video signal quantization (lower QP values) our gains in performance grow larger. To illustrate this point, on the average of 6 selected views of “ChessPieces” sequence at an MOS of 2, our proposed structure produces a bit rate of 4.4 kbps while the structure in [38] produced a bit rate of 12 kbps. This is a reduction of 36.67% in bit rate. However, at an MOS of 4, our proposed structure produces a bit rate of 49.4 kbps while the structure in [38] produced a bit rate of 95 kbps. In this case, there is a reduction of 52% in bit rate. 38  Furthermore, in order to quantify the differences in performances between the two structures, the BD-rate difference for MOS between our proposed method and that of [38] was calculated across the two sequences and can be found in Table 4.2.   39    (a)  (b) Fig. 4.4 MOS-rate comparison of our method and the structure in [38] for (a) “ChessPieces” and (b) “Boxer-IrishMan-Gladiador”  40  The visual quality of both compression schemes can be seen in Fig. 4.5 (on the “ChessPieces” sequence) and Fig. 4.6 (on the “Boxer-Irishman-Gladiador” sequence). The figures include a single frame of view 14 after compression by both methods. The results of each frame is shown across the four QP values used in the experiment. Specific areas of note in the images that exhibit the most differences at different QP levels are the texture of carpet, the details of the figurines and the text on the background calendar. Based on these MOS results, the proposed method outperforms the scheme proposed in [38] by 64.24% on average. The scores illustrate the proposed method offers a closer level of visual quality to the original content compared to structure proposed in [38], at a significant reduction in bit-rate (49.76% in this case for the 6 views selected). These findings align with the previously conducted objective tests, although the difference between the two methods here is larger.  Video Sequence BD-rate ChessPieces -62.65% Boxer-Irishman-Gladiador -65.83% Average -64.24%  Table 4.2 BD-rate for MOS variations of the proposed structure against the structure from [38]  The disparity in results between the two sequences can be put down to the fact that the details of the chess pieces (e.g., wood grain, notches) were more obvious than those found on the Lego pieces. This can be attributed to the eye sensitivity to different brightness levels and corresponding frequencies. In other words, the two streams involve different light conditions and texture of the 41  plastic and wood models. Because of this, the participants were more aware of the compression artifacts found on the wooden chess pieces and could better notice them as the quantization parameter level increased. 42    (a)  (b)  (c)  (d)  (e)  (f)  (g)  (h) Fig. 4.5 Visual results of view 14, frame 211 from the “ChessPieces” sequence. (a) Proposed QP25, (b) Proposed QP30, (c) Proposed QP35, (d) Proposed QP40, (e) Structure from [38] QP25, (f) Structure from [38] QP30, (g) Structure from [38] QP35, (h) Structure from [38] QP40     43    (a)  (b)  (c)  (d)  (e)  (f)  (g)  (h) Fig.  4.6 Visual results of view 14, frame 211 from the “Boxer-Irishman-Gladiador” sequence. (a) Proposed QP25, (b) Proposed QP30, (c) Proposed QP35, (d) Proposed QP40, (e) Structure from [38] QP25, (f) Structure from [38] QP30, (g) Structure from [38] QP35, (h) Structure from [38] QP40     44  Chapter 5: Conclusion and Future Work 5.1 Conclusion In this work, we have introduced a new and efficient three-dimensional prediction structure in MV-HEVC designed to encode light field video sequences that are captured using a plenoptic camera. This work’s contribution combines an improved inter-view prediction with our novel inter-frame prediction structure. The inter-view prediction structure proposed here is built on the idea of exploiting the high level of similarity found between the extracted views around the center of a light field video while using a two-dimensional structure of prediction. Furthermore, the inter-frame prediction structure takes this a step further by utilizing the views that are higher in the temporal levels more heavily as references.  To assess the performance of the proposed structure, a combination of objective and subjective tests was performed against an existing state of-the-art light field video coding method. The tests were applied on two publicly available light field video datasets. The results of our objective experiment show that our method exceeds the prediction structure proposed in [38] by up to 50.89% in BD-rate for PSNR.  Furthermore, the subjective results illustrated that the proposed structure produced a more visually appealing image at a significantly lower bit-rate. This is based is on the MOS of the participants when comparing the encoded videos of both structures against the original uncompressed sequences, whereby the proposed method benefits up to 65.83% in MOS BD-rate. These findings illustrate the important impact of using the high similarity found between the extracted central views and the importance of creating a more efficient prediction structure in the temporal domain.   45  5.2 Future Work Our prediction structure attempts to maximize the utilization of P-frames by placing them on the edge. However, such a design can be improved by not placing these P-frames on the edge, insuring at least one more neighboring view. Such a structure would require a complete overhauled inter-view prediction design, but could yield an improved compression efficiency and a better random access cost.  One drawback of the complex hierarchy of encoded frames in our proposed structure is the random access efficiency required to access a specific view. In some cases, six other views would need to be encoded first as references before the view can be accessed. Our plan is to further investigate reducing the random access complexity of accessing the latter encoded views by modifying the prediction structure and taking into account the temporal domain as well. Essentially, by reducing the hierarchical levels, we can reduce the complexity to access a certain view. Of course, as random access and compression efficiency may be two opposite objectives, we will have to find an optimum solution that yields the best possible trade-off between the two. 46  Bibliography   [1] E. H. Adelson and J. R. Bergen, “The plenoptic function and the elements of early vision,” Computational. Models Visual Processing, MIT Press, 1991, pp. 3–20 [2] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan, “Light field photography with a hand-held plenoptic camera,” Dept. Comp. Sci., Stanford Univ., Stanford, CA, USA, Tech. Rep. CSTR 2005- 02, 2005. [3] M. Levoy and P. Hanrahan, “Light Field Rendering,” Proc. Siggraph 96, ACM Press, New York, 1996, pp. 31-42. [4] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection on light field,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2014, pp. 2806–2813.  [5] P. A. Kara, Z. Nagy, M. G. Martini, and A. Barsi, “Cinema as large as life: Large-scale light field cinema system,” in Proc. Int. Conf. 3D Immersion (IC3D), 2017, pp. 1–8. [6] S. Lee, C. Jang, S. Moon, J. Cho, and B. Lee, ‘‘Additive light field displays: Realization of augmented reality with holographic optical elements,’’ ACM Trans. Graph., vol. 35, no. 4, 2016, Art. no. 60. [7] M. Levoy, R. Ng, A. Adams, M. Footer, and M. Horowitz, “Light Field Microscopy,” ACM Trans. Graphics, vol. 25, no. 3, pp. 924- 934, 2006. [8] A. Aggoun, “Compression of 3D integral images using 3D wavelet transform,” J. Display Technol., vol. 7, no. 11, pp. 586–592, 2011. [9] F. Dai, J. Zhang, Y. Ma, and Y. Zhang, “Lenselet image compression scheme based on subaperture images streaming,” in Proc. IEEE Int. Conf. Image Process., 2015, pp. 4733–4737.  47  [10] X. Jiang, M. L. Pendu, R. Farrugia, and C. Guillemot, “Light field compression with homography-based low rank approximation,” IEEE J. Sel. Topics Signal Process., vol. 11, no. 7, pp. 1132–1145, Oct. 2017. [11] M. T. Pourazad, C. Doutre, M. Azimi, and P. Nasiopoulos, “HEVC: The new gold standard for video compression: How does HEVC compare with H.264/AVC?” IEEE Consum. Electron. Mag., vol. 1, no. 3, pp. 36–46, Jul. 2012 [12]  G. J. Sullivan, J.-R. Ohm, F. Bossen, and T. Wiegand and Jizheng Xu, “JCT-VC AHG report: HM subjective quality investigation (AHG22).” JCT-VC of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Doc. JCTVC-H0022r1, San José, CA, Feb-2012. [13]  I.-K. Kim, J. Min, T. Lee, W.-J. Han, and J. Park, “Block partitioning structure in the HEVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1697–1706, Dec. 2012. [14]  G. J. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the High Efficiency Video Coding (HEVC) standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.  [15] J. Lainema, F. Bossen, W.-J. Han, J. Min, and K. Ugur, “Intra coding of the HEVC standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1792–1801, Dec. 2012. [16]  Y. Chen, Y.-K. Wang, K. Ugur, M. M. Hannuksela, J. Lainema, and M. Gabbouj, “The emerging MVC standard for 3D video services,” EURASIP J. Adv. Signal Process., vol. 2009, no. 1, p. 786015, Jan. 2009. [17]  A. S. Akbari, N. Canagarajah, D. Redmill, and D. Agrafiotis, “A novel H.264/AVC based multi-view video coding scheme,” in Proc. 3DTV Conf., May 2007, pp. 1–4. 48  [18]  A. Vetro, T. Wiegand, and G. J. Sullivan, “Overview of the stereo and multiview video coding extensions of the H.264/AVC standard,” Proc. IEEE, vol. 99, no. 4, pp. 626–642, Apr. 2011.  [19]  Y.Chen, Y.-K.Wang, K. Ugur,M. Hannuksela, J. Lainema, and M. Gabbouj, “The emerging MVC standard for 3D video services,” EURASIP J. Adv. Signal Process., vol. 2009, no. 1, Jan. 2009, Article 8. [20] A. Smolic, K. Mueller, N. Stefanoski, J. Ostermann, A. Gotchev, G. B. Akar, G. Triantafyllidis, and A. Koz, “Coding algorithms for 3DTV-A survey,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1606–1621, Nov. 2007. [21]  P. Merkle, K. Müller, A. Smolic´, and T. Wiegand, “Statistical evaluation of spatiotemporal prediction for multiview video coding,” in Proc. ICOB 2005, Berlin, Germany, Oct. 27–28, 2005.  [22]  A. Kaup and U. Fecker, “Analysis of multireference block matching for multiview video coding,” in Proc. 7th Workshop Digital Broadcasting, Erlangen, Germany, Sep. 2006, pp. 33–39.  [23]  M. Magnor and B. Girod, “Data compression for light-field rendering,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 3, pp. 338–343, Apr. 2000.  [24]  M. Magnor, P. Ramanathan, and B. Girod, “Multi-view coding for image-based rendering using 3-D scene geometry,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 11, pp. 1092–1106, Nov. 2003. [25] B. Bross, W.-J. Han, G. J. Sullivan, J.-R. Ohm, and T. Wiegand, High Efficiency Video Coding (HEVC) Text Specification Draft 9, document JCTVC-K1003, ITU-T/ISO/IEC Joint Collaborative Team on Video Coding (JCT-VC), Oct. 2012. 49  [26] T. Wiegand, J.-R. Ohm, G. J. Sullivan, W.-J. Han, R. Joshi, T. K. Tan, and K. Ugur, “Special section on the joint call for proposals on High Efficiency Video Coding (HEVC) standardization,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 12, pp. 1661–1666, Dec. 2010. [27]  MPEG Video and Requirements Group, Vision on 3D Video Coding, document N10357, Lausanne, Switzerland, Feb. 2009. [28]  G. Tech, K. Müller, J.-R. Ohm, and A. Vetro, “Overview of the multiview and 3D extensions of High Efficiency Video Coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 26, no. 1. [29]  G. J. Sullivan, J. M. Boyce, Y. Chen, J.-R. Ohm, C. A. Segall, and A. Vetro, “Standardized extensions of High Efficiency Video Coding (HEVC),” IEEE J. Sel. Topics Signal Process., vol. 7, no. 6, pp. 1001–1016, Dec. 2013. [30] ISO/IEC JTC1/SC29/WG11, “Report of the subjective quality evaluation for MVC Call for Evidence”, Doc. N6999, Hong Kong, China, January 2005.  [31] J. Dick, H. Almeida, L. D. Soares, and P. Nunes, “3D Holoscopic video coding using MVC,” 2011 IEEE EUROCON - International Conference on Computer as a Tool, pp. 1–4, Apr. 2011. [32] U. Fecker and A. Kaup, “H.264/AVC-compatible coding of dynamic light fields using transposed picture ordering,” in 13th European Signal Processing Conference (EUSIPCO), Antalya, Turkey, September 2005. [33] T.-Y. Chung, I.-L. Jung, K. Song, and C.-S. Kim, “Multi-view video coding with view interpolation prediction for 2D camera arrays,” J. Vis. Commun. Image Represent., vol. 21, no. 5/6, pp. 474–486, Jul. 2010. 50  [34] S. Shi, P. Gioia, and G. Madec, "Efficient Compression Method for Integral Images Using Multi-View Video Coding," in 18th International Conference on Image Processing (ICIP 2011), Brussels, Belgium, September 2011, pp. 141-144. [35] P. Merkle, A. Smolic, K. Mueller, and T. Wiegand, “Efficient prediction structures for multiview video coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1461–1473, Nov. 2007. [36] A. Avci, J. De Cock, P. Lambert, R. Beernaert, J. De Smeta, L. Bogaert, Y. Meuret, H. Thienpont, and H. De Smeta, “Efficient disparity vector prediction schemes with modified p frame for 2d camera arrays,” in Journal of Visual Communication and Image Representation, February 2012, vol. 23, pp. 287–292. [37] A. Dricot, J. Jung, M. Cagnazzo, B. Pesquet, and F. Dufaux, “Full parallax super multi-view video coding,” in Proc. ICIP, Paris, France, Oct. 2014, pp. 135–139. [38] G. Wang, W. Xiang, M. Pickering, and C. W. Chen, “Light field multiview video coding with two-directional parallel inter-view prediction,” IEEE Trans. Image Process., vol. 25, no. 11, pp. 5104–5117, Nov. 2016. [39] J. Khoury, M. T. Pourazad and P. Nasiopoulos, "A New Prediction Structure for Efficient MV-HEVC based Light Field Video Compression," in  Proc. ICNC, Honolulu, HI, USA, 2019, pp. 588-591. [40] L. Guillo, X. Jiang, G. Lafruit, and C. Guillemot, “Light field video dataset captured by a R8 Raytrix camera (with disparity maps),” MPEG and JPEG contributions, ISO/IEC JTC1/SC29/WG11MPEG2018/m42468, ISO/IEC JTC1/SC29/WG1 JPEG2018/m79046, 2018. [41] G. Bjontegard, Calculation of Average PSNR Differences Between RD-Curves, document VCEG-M33, Austin, TX, USA, Apr. 2001. 51  [42] Int. Telecommun. Union, Methodology for the Subjective Assessment of the Quality of Television Pictures ITU-R Recommendation BT.500-11, Tech. Rep., 2000. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0379910/manifest

Comment

Related Items