UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Correcting capturing and display distortions in 3D video Doutre, Colin Ray 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2012_spring_doutre_colin.pdf [ 3.68MB ]
Metadata
JSON: 24-1.0072641.json
JSON-LD: 24-1.0072641-ld.json
RDF/XML (Pretty): 24-1.0072641-rdf.xml
RDF/JSON: 24-1.0072641-rdf.json
Turtle: 24-1.0072641-turtle.txt
N-Triples: 24-1.0072641-rdf-ntriples.txt
Original Record: 24-1.0072641-source.json
Full Text
24-1.0072641-fulltext.txt
Citation
24-1.0072641.ris

Full Text

Correcting Capturing and Display Distortions in 3D Video by Colin Ray Doutre  B.Sc., Queen’s University, 2005 M.A.Sc., The University of British Columbia, 2007  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  Doctor of Philosophy  in  THE FACULTY OF GRADUATE STUDIES (Electrical & Computer Engineering)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver)  March 2012 © Colin Ray Doutre, 2012  Abstract 3D video systems provide a sense of depth by showing slightly different images to the viewer’s left and right eyes. 3D video is usually generated by capturing a scene with two or more cameras and 3D displays need to be able to concurrently display at least two different images. The use of multiple cameras and multiple display channels creates problems that are not present in 2D video systems. At the capturing side, there can be inconsistencies in the videos captured with the different cameras, for example the videos may differ in brightness, colour, sharpness, etc. At the display side, crosstalk is a major problem. Crosstalk is an effect where there is incomplete separation of the images intended for the two eyes; so the left eye sees a portion of the image intended for the right eye and vice versa. In this thesis, we develop methods for correcting these capturing and display distortions in 3D video systems through new digital video processing algorithms. First we propose a new method for correcting the colour of multiview video sets. Our method modifies the colour of all the input videos to match the average colour of the original set of views. Experiments show that applying our method greatly improves the efficiency of multiview video coding. We present a modification of our colour correction algorithm which also corrects vignetting (darkening of an image near its corners), which is useful when images are stitched together into a panorama. Next, we present a method for making stereo images match in sharpness based on scaling the discrete cosine transforms coefficients of the images. Experiments show that our method can greatly increase the accuracy of depth maps estimated from two images that differ in sharpness, which is useful in 3D systems that use view rendering. ii  Finally, we present a new algorithm for crosstalk compensation in 3D displays. Our algorithm selectively adds local patches of light to regions that suffer from visible crosstalk, while considering temporal consistency to prevent flickering.  Results show  our method greatly reduces the appearance of crosstalk, while preserving image contrast.  iii  Preface This thesis presents research conducted by Colin Doutre, under the guidance of Dr. Panos Nasiopoulos. A list of publications resulting from the work presented in this thesis is provided on the following page. The work presented in Chapter 2 has been published in [P1]-[P4]. The content of Chapter 3 appears in one conference [P5] and one journal publication [P6]. Portions of Chapter 4 appear in [P7] and [P8], and a provisional patent application has been filled based on the material in the chapter [P9]. The work presented in all of these manuscripts was performed by Colin Doutre, including designing and implementing the proposed algorithms, performing all experiments, analyzing the results and writing the manuscripts. The work was conducted with the guidance and editorial input of Dr. Panos Nasiopoulos. Over the course of his PhD, Colin Doutre has participated in a number of other projects that were collaborations with colleagues.  Publications resulting from these  projects are listed at the end of this preface [P11]-[P16]. While these other projects are related to the topic of this thesis, this thesis exclusively includes work for which Mr. Doutre and Dr. Nasiopoulos are the only authors. The first and last chapters of this thesis were written by Colin Doutre, with editing assistance from Dr. Nasiopoulos.  iv  List of Publications Based on Work Presented in This Thesis [P1]  C. Doutre and P. Nasiopoulos, "A Colour Correction Preprocessing Method For Multiview Video Coding," European Signal Processing Conference (EUSIPCO2008), Lausanne, Switzerland, 5 pages, Aug. 2008.  [P2]  C. Doutre and P. Nasiopoulos, "Color Correction of Multiview Video With Average Color as Reference," IEEE International Symposium on Circuits and Systems (ISCAS 2009), Taipei, Taiwan, pp. 860-863, May 2009.  [P3]  C. Doutre and P. Nasiopoulos, "Color Correction Preprocessing For Multiview Video Coding." IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 9, pp. 1400-1406, Sept. 2009.  [P4]  C. Doutre and P. Nasiopoulos, "Fast Vignetting Correction and Color Matching For Panoramic Image Stitching," IEEE International Conference on Image Processing (ICIP 2009), Cario, Egypt, pp. 709-712, Nov. 2009.  [P5]  C. Doutre and P. Nasiopoulos, "Correcting Sharpness Variations in Stereo Image Pairs," European Conference on Visual Media Production (CVMP 2009), London, United Kingdom, pp. 45-51, Nov. 2009.  [P6]  C. Doutre and P. Nasiopoulos, "Sharpness Matching in Stereo Images,” accepted in the Journal of Virtual Reality and Broadcasting, Mar. 2011.  [P7]  C. Doutre and P. Nasiopoulos, "Optimized Contrast Reduction for Crosstalk Cancellation in 3d Displays," 3DTV Conference 2011, Antalya, Turkey, May 2011.  [P8]  C. Doutre and P. Nasiopoulos, "Crosstalk Cancellation in 3D Video with Local Contrast Reduction," European Signal Processing Conference (EUSIPCO-2011), Barcelona, Spain, 5 pages, Aug. 2011.  [P9]  C. Doutre and P. Nasiopoulos, “Crosstalk Cancellation in 3D Displays,” US Provisional Patent Application No. 61/486268, filled May 14, 2011.  [P10] C. Doutre and P. Nasiopoulos, "Contrast Preserving Crosstalk Cancellation in 3D Video," to be submitted.  v  Other PhD Publications [P11] D. Xu, C. Doutre, and P. Nasiopoulos, "Saturated-Pixel Enhancement for Color Images," IEEE International Symposium on Circuits and Systems (ISCAS 2010), pp. 3377-3380, May 2010. [P12] C. Doutre, M. T. Pourazad, A. Tourapis, P. Nasiopoulos and R. K. Ward, "Correcting Unsynchronized Zoom In 3D Video," IEEE International Symposium on Circuits and Systems (ISCAS 2010), pp. 3244-3247, May 2010. [P13] D. Xu, C. Doutre, and P. Nasiopoulos, “An Improved Bayesian-based Algorithm for Saturated Color-Pixel Correction,” IEEE International Conference on Image Processing 2010 (ICIP 2010), pp. 1325 – 1328, Sept. 2010. [P14] D. Xu, C. Doutre and P. Nasiopoulos, "Correction of Clipped Pixels in Color Images," IEEE Transactions on Visualization and Computer Graphics, vol. 17, no. 3, pp. 333-344, Mar. 2011. [P15] Z. Mai, C. Doutre, P. Nasiopoulos and R. Ward, "Subjective Evaluation of ToneMapping Methods on 3D Images", IEEE International Conference on Digital Signal Processing (DSP 2011), July 2011. [P16] Z. Mai, C. Doutre, P. Nasiopoulos and R. Ward, "Rendering 3D High Dynamic Range Images: Subjective Evaluation of Tone-Mapping Methods and Preferred 3D Image Attributes," accepted in IEEE Journal of Selected Topics in Signal Processing, Mar. 2012. [P17] M. T. Pourazad, C. Doutre, M. Azimi and P. Nasiopoulos, “HEVC: The New Gold Standard for Video Compression," accepted in IEEE Consumer Electronics Magazine, Mar. 2012.  vi  Table of Contents Abstract.............................................................................................................................. ii Preface................................................................................................................................iv Table of Contents ..............................................................................................................vi List of Tables ......................................................................................................................x List of Figures....................................................................................................................xi List of Acronyms .............................................................................................................xiv Acknowledgements ..........................................................................................................xv Dedication ........................................................................................................................xvi 1   Introduction and Overview ........................................................................................ 1  1.1   1.1.1   Capturing and Representation.................................................................................... 4   1.1.2   3D and Multiview Video Compression ..................................................................... 8   1.1.3   3D Display Technologies ........................................................................................ 10   1.2   Capturing Distortions ......................................................................................... 15   1.2.1   Colour Inconsistences in Multiview Imaging.......................................................... 15   1.2.2   Sharpness Compensation Methods .......................................................................... 23   1.3   Crosstalk in 3D Displays.................................................................................... 26   1.3.1   Sources of Crosstalk in 3D Displays ....................................................................... 27   1.3.2   Crosstalk Cancellation ............................................................................................. 30   1.3.3   Raising the Image Black Level................................................................................ 33   1.4  2   3D Video Technology Overview ......................................................................... 2   Thesis Contributions .......................................................................................... 35   Multiview Colour Correction................................................................................... 37  2.1   Colour Correction Preprocessing For Multiview Video Coding ....................... 37   2.1.1   Choice of Colour Reference in Multiview Colour Correction................................. 38   2.1.2   Proposed Colour Correction Preprocessing Method ............................................... 41   2.1.3   Experimental Results ............................................................................................... 47   vii  2.1.4   2.2   3   Proposed Vignetting and Colour Correction Method .............................................. 64   2.2.2   Results ..................................................................................................................... 69   2.2.3   Conclusions ............................................................................................................. 73   Sharpness Matching in Stereo Images.................................................................... 75  Proposed Sharpness Matching Method .............................................................. 76   3.1.1   Removing Non-overlapping Edge Regions ............................................................. 77   3.1.2   Noise Variance Estimation ...................................................................................... 79   3.1.3   Division into Frequency Bands ............................................................................... 80   3.1.4   DCT Coefficient Scaling ......................................................................................... 82   3.2   Experimental Results.......................................................................................... 86   3.2.1   Impact of the Number of Frequency Bands............................................................. 88   3.2.2   Disparity Map Improvement for Blurred Images .................................................... 89   3.2.3   Complexity .............................................................................................................. 94   3.3   Conclusion.......................................................................................................... 95   3.4   Derivation of Optimal Attenuation Factor ......................................................... 95   Crosstalk Cancellation in 3d Video with Local Contrast Reduction ..................... 98  4.1   5   Fast Vignetting Correction and Colour Matching.............................................. 63   2.2.1   3.1   4   Conclusions ............................................................................................................. 62   Proposed Method................................................................................................ 99   4.1.1   Algorithm for Still Images..................................................................................... 100   4.1.2   Extension to Video Sequences............................................................................... 104   4.2   Results .............................................................................................................. 108   4.3   Discussion ........................................................................................................ 110   4.4   Conclusions ...................................................................................................... 111   Conclusions and Future Work .............................................................................. 112  5.1   Significance and Potential Applications of the Research................................. 112  viii  5.2   Summary of Contributions ............................................................................... 114   5.3   Directions for Future Work .............................................................................. 116   Bibliography ...................................................................................................................119  ix  List of Tables Table 2.1: Relative H.264 coding efficiency obtained on the Flamenco2 video set with histogram matching colour correction and different views used as the colour reference. 40  Table 2.2: Average mean squared difference between videos obtained using NCC and MRSAD as the block matching criteria ............................................................................ 52  Table 2.3: Average mean squared difference between original data and colour corrected views using different colour references ............................................................................ 57  Table 3.1: Percentage of errors in disparity maps with belief propagation stereo method, out-of-focus blurring......................................................................................................... 91  Table 3.2: Percentage of errors in disparity maps with window stereo method, out-offocus blurring .................................................................................................................... 91  Table 3.3: Percentage of errors in disparity maps with belief propagation stereo method, linear motion blur.............................................................................................................. 91  Table 3.4: Percentage of errors in disparity maps with window stereo method, linear motion blur........................................................................................................................ 91   x  List of Figures Figure 1.1: Stages of a 3D video system, from capturing to display .................................. 3  Figure 1.2: Stereoscopic camera setups. Side-by-side 2D cameras (left), and a stereoscopic JVC camera with two lenses (right) ............................................................... 4  Figure 1.3: Common stereoscopic 3D formats. (a) Full resolution left and right views (b) Horizontally downsampled frame compatible format (c) Vertically downsampled frame compatible format..................................................................................................... 6  Figure 1.4: One frame of a video in 2D plus depth (2D+Z) format................................... 7  Figure 1.5: Example prediction structure for MVC using prediction in both time and between views..................................................................................................................... 9  Figure 1.6: Illustration of different wavelengths being used for the RGB colour channels in the left and right eye images. ........................................................................................ 11  Figure 1.7: Illustration of the parallax barrier concept for a two view 3D display. (http://en.wikipedia.org/wiki/File:Parallax_Barrier.jpg) .................................................. 13  Figure 1.8: Example pixel layout and angled lens position in an 8 view lenticular lens 3D display. The numbers indicate which view each sub-pixel belongs to. ........................... 14  Figure 1.9: Two examples of 3D images with 10% crosstalk between the left and right views. ................................................................................................................................ 27  Figure 2.1: IPPP coding structure using inter-view prediction......................................... 39  Figure 2.2: Finding matching points between all views by choosing an anchor view and performing block matching between anchor and all other views ..................................... 42  Figure 2.3: Rate distortion performance obtained with different polynomials used for the colour correction function................................................................................................. 50  Figure 2.4: Colour correction of the Flamenco2 multiview video set (a) original data (b) colour corrected with proposed method............................................................................ 54  Figure 2.5: Sample views from the Rena test video set (a) Frames from two views before correction (b) After proposed colour correction ............................................................... 55  Figure 2.6: Sample views from the Race test video set (a) Frames from two views before correction (b) After proposed colour correction ............................................................... 55   xi  Figure 2.7: Rate-distortion performance obtained with the proposed colour correction pre-processing method ...................................................................................................... 59  Figure 2.8: Two images showing severe colour mismatch aligned with no blending (left) and multi-band blending (right) ........................................................................................ 63  Figure 2.9: Green lake panorama (a) Original images, (b) Corrected with Goldman’s method, (c) Proposed correction method .......................................................................... 69  Figure 2.10: Skyline panorama. (a) Original images, (b) Correction with Goldman’s method, (c) Proposed correction method .......................................................................... 71  Figure 2.11: Sunset panorama. Original images (top) and after proposed correction (bottom)............................................................................................................................. 72  Figure 2.12: Mountain panorama. Original images (top) and after proposed correction (bottom)............................................................................................................................. 73  Figure 3.1: Removing non-overlapping edge regions from a stereo image. (a) Left and right original images with edge strips used in search shown in red, and matching regions found through SAD search (equation (4.3)) shown in blue. (b) Images cropped with nonoverlapping regions removed............................................................................................ 78  Figure 3.2: Division of DCT coefficients into M frequency bands in each direction, illustrated for M=8. ........................................................................................................... 81  Figure 3.3: Test images used in experiments (the left image of each stereo pair is shown). In reading order: Tsukuba, teddy, cones, art, laundry, moebius, reindeer, aloe, baby1, and rocks.................................................................................................................................. 86  Figure 3.4: Errors in disparity maps as a function of the number of frequency bands used ........................................................................................................................................... 88  Figure 3.5: Demonstration of blurring filters used in our tests on the Tsukuba images (a) Original left image (b) Original right image, (c)-(g) Left image blurred with: (c) Outof-focus blur, radius 1 (d) Out-of-focus blur, radius 2 (e) Out-of-focus blur, radius 3 (f) motion blur, length 2 (g) motion blur, length 3 (h) motion blur, length 4. The blurring filter is illustrated in the top left corner of each image. .................................................... 90  Figure 3.6: Example of cones image pair before and after correction. (a) Blurred left image (b) Original right image (c) Disparity map obtained from images a and b with belief bropagation (d) Corrected left image (e) Corrected right image (f) Disparity map obtained from corrected images with belief propagation ................................................. 94  Figure 4.1: The steps of our method for still images. (a) Original images (left and right) (b) Amount each pixel needs to be raised, calculated with equations (4.4) and (4.5), (c) Labelled regions that remain after thresholding, erosion and dilation (d) Luminance patches for the above regions, calculated with equation (4.6), (e) Patches from the other xii  image, shifted by the estimated disparity for each region (f) The final smooth signal that will be added to each image (pixel-wise maximum of the previous two signals) (g) The output images with the added patches of luminance ...................................................... 104  Figure 4.2: Illustration of crosstalk reduction with local and global level raising. (a) Original left image with no crosstalk (b) Image with crosstalk reduction but no level raising (c) Global raising of minimum image level (d) Proposed method with local raising ......................................................................................................................................... 109  Figure 4.3: Regions intensities for a 3D video calculated: (a) On a frame-by-frame basis, and (b) With our proposed method including temporal consistency .............................. 110   xiii  List of Acronyms AVC  Advanced Video Coding  BP  Belief Propagation (stereo matching method)  CRT  Cathode Ray Tube  DCT  Discrete Cosine Transform  DFT  Discrete Fourier Transform  FTV  Free-Viewpoint Television  HEVC  High-Efficiency Video Coding  HM  Histogram Matching (colour correction method)  IC  Illumination Compensation  JM  Joint Model (H.264/AVC reference software)  JMVM  Joint Multiview Video Model (MVC reference software)  JVT  Joint Video Team  LCD  Liquid Crystal Display  LoG  Laplacian of Gaussian (filter)  MB  Macroblock  MC  Motion Compensation  ME  Motion Estimation  MPEG  Moving Picture Experts Group  MRSAD  Mean-Removed Sum of Absolute Differences  MRSAV  Mean-Removed Sum of Absolute Values  MVC  Multiview Video Coding  NCC  Normalized Cross-Correlation  PSNR  Peak Signal-to-Noise Ratio  RGB  Red, Green, Blue (colour space)  S3D  Stereoscopic Three Dimensional  SAD  Sum of Absolute Differences  SIFT  Scale-Invariant Feature Transform  SSD  Sum of Squared Differences  xiv  Acknowledgements  I give my sincere thanks to my supervisor, Dr. Panos Nasiopoulos, for his guidance and support throughout my M.A.Sc and PhD programs.  I also thank my colleagues at  the Digital Multimedia Lab, Zicong Mai, Lino Coria, Di Xu, Qiang Tang, Mahsa Pourazad, Hassan Mansour, Ashfiqua Connie, Sergio Infante, Matthias von dem Knesebeck, Victor Sanchez, Mohsen Amiri, Maryam Azimi and Sima Valizadeh. It was a pleasure working with you all in such a vibrant and eclectic environment. I would like to thank the Natural Sciences and Engineering Research Council of Canada (NSERC) for financially supporting me with the PGS and CGS scholarships. Finally, I thank my parents, wife and daughter for their constant love, support and encouragement.  xv  To my wife and daughter  xvi  1 Introduction and Overview 3D video can give a realistic and engaging visual experience by providing viewers with a strong sense of depth. While 3D displays have existed for over a century, and have experienced brief periods of some commercial success, they have never achieved the widespread use that 2D displays have. Many factors have contributed to the limited adoption of 3D technology, including higher production costs, challenging capturing, and low quality display systems. The appearance of depth in 3D video is achieved by showing slightly different images to the viewer’s left and right eyes which the viewer’s brain matches and fuses together to form a three-dimensional interpretation of the scene. The most straightforward way to capture 3D video is to use two side-by-side cameras (or one camera with two lenses) that capture one video for each eye. The use of multiple cameras can create new problems that are not present in 2D video. Inconsistencies between the cameras can result in videos that differ in brightness, colour, sharpness, etc. These inconsistencies can greatly lower the perceived quality of the 3D video, and can also negatively affect video processing algorithms that involve matching or combining information from the two views. A 3D display has to be able to show separate images to the viewer’s left and right eyes, which also creates new challenges. 3D displays often have to sacrifice brightness, temporal resolution and/or spatial resolution to allow multiple images to be displayed. More importantly, the image intended for one eye will usually leak through to the other eye as well, an effect known as crosstalk. The perception of crosstalk is called ghosting, 1  and it will often result in viewers seeing double edges. Ghosting is one of the biggest complaints about 3D technology. In this thesis we propose new methods for correcting capturing and display problems in 3D video systems through digital video processing algorithms. In Chapter 2, we present methods for correcting colour mismatches between cameras. Chapter 3 deals with correcting sharpness variations in 3D images. In Chapter 4 we present a method for reducing the appearance of crosstalk in 3D displays. The following sections in this introductory chapter provide background information on 3D video systems and a literature review of the topics addressed in each of the research chapters. Section 1.1 provides basic background information on 3D video technology. Existing work dealing with 3D video capturing distortions is reviewed in Section  1.2,  including  colour  inconsistencies  (section  1.2.1)  and  sharpness  inconsistencies (section 1.2.2). A literature review on crosstalk in 3D displays and crosstalk compensation methods is presented in Section 1.3. Section 1.4 concludes the introduction with an overview of the research contributions presented in this thesis.  1.1 3D Video Technology Overview The process of going from a real-world scene to a representation of the scene viewed on a 3D display consists of several steps, as illustrated in Figure 1.1.  2  Figure 1.1: Stages of a 3D video system, from capturing to display  First, the scene must be captured with one or more cameras. Then some processing is typically performed, including converting the captured data into the 3D representation used in the system (there are several competing 3D representation formats, which will be described later). This processing can be as simple as shifting the images from two cameras, or as complicated as multi-camera depth estimation. Next the 3D data must be encoded with some compression scheme, in order to limit the number of bits used to store the video. Then the 3D video can be distributed. This may involve producing Blu-ray discs of a 3D movie to be sold in stores. Alternatively, the video may be transmitted over the internet (YouTube has introduced support for 3D videos), or sent over a video broadcast network (e.g. World Cup matches are now broadcast in 3D in many countries). At the playback side, the video must be decoded, and additional processing may be performed such as rendering novel viewpoints, or modifying the video based on the  3  display properties.  Finally, the 3D video is displayed on hardware that can show  different images to the viewer’s two eyes. There are competing methods and technologies for each of the steps in Figure 1.1, and these steps are also interrelated. For example, the way the data is captured affects the processing that must be done to convert it to a given 3D representation. Different 3D representations require different encoding and transmission methods. Different display technologies place different requirements on the 3D representation; for example some displays simply require a left and right view, whereas others require multiple viewpoints. An overview of the major techniques for capturing, representing, encoding and displaying 3D video is provided in the following sub-sections.  1.1.1 Capturing and Representation There are several ways to capture and represent 3D video data. Since ultimately a 3D display will show separate images to the left and right eyes, perhaps the most straightforward way to capture 3D data is to use two side-by-side cameras, one for the left eye image and one for the right eye image. This can be done by mounting two separate 2D cameras side-by-side, or with a stereoscopic camera that has two separate lenses and sensors within a single camera body (Figure 1.2).  Figure 1.2: Stereoscopic camera setups. Side-by-side 2D cameras (left), and a stereoscopic JVC camera with two lenses (right)  4  In this two-sensor setup, the cameras are usually placed parallel to each other, but they can also be placed in a ‘toed-in’ configuration where the two cameras are angled towards each other, similar to how human eyes adjust their angle to converge on the object being viewed. However, the toed-in camera configuration results in keystone distortion [1], an effect where the left and right images will not be aligned vertically towards their corners. Keystone distortion contributes to eyestrain, so the parallel camera arrangement is usually used. Representing 3D video with left and right videos that will be viewed by each eye is known as stereoscopic 3D (S3D). It is currently the most popular 3D format, and is used in 3D cinema, 3D Blu-ray discs [2], and most current 3D broadcasts [4]. To allow S3D content to be easily encoded and transmitted with existing infrastructure and equipment designed for 2D video, frame-compatible S3D formats are sometimes used. Framecompatible formats multiplex the information for the left and right views into a single stream that can be treated as a 2D video [5]. The left and right videos can be spatially down-sampled in either the vertical or horizontal directions to squeeze the two videos into a single frame with conventional resolution (i.e., 1920x1080p HD). These downsampled videos would usually be stored one on top of the other or side by side, but they could also be interlaced row by row. It is also possible to temporally interleave the frames from the left and right videos; for example if the left and right videos are captured at 30 Hz, they can be combined to a 60 Hz video, which alternates between the left and right images.  In practice, spatial multiplexing is usually used, because temporal  multiplexing decreases the performance of video compression. Figure 1.3 shows one  5  frame of a 3D video in a full resolution format, as well as vertically and horizontally down-sampled frame-compatible formats.  Figure 1.3: Common stereoscopic 3D formats. (a) Full resolution left and right views (b) Horizontally downsampled frame compatible format (c) Vertically downsampled frame compatible format  The basic two camera capturing setup can be extended to more cameras to create multiview video [6]. Multiview data is useful for Free-viewpoint TV (FTV) [7], where the user can select the viewpoint they watch the scene from.  More importantly,  multiview video is needed for many autostereoscopic 3D displays that allow users to view 3D without glasses [8][9] (these will be discussed in more detail later in Section 1.1.3.2). Multiview capturing setups have been made with up to 100 cameras [10].  6  An alternative format to stereo or multiview video is to transmit a 2D video plus perpixel depth information (sometimes denoted 2D+Z).  Using image based rendering  techniques, new viewpoints can be created [11][12][13], allowing flexible stereo or multiview video to be generated from the 2D+Z data.  The depth information can be  captured directly with a depth camera [14][15], obtained through matching two or more videos from different cameras locations [12][16][17][18], or generated from a single 2D video through 2D to 3D conversion [19][20].  An example of a video in 2D+Z format  where the depth has been estimated through stereo matching is shown in Figure 1.4.  Figure 1.4: One frame of a video in 2D plus depth (2D+Z) format.  The 2D+Z format, combined with rendering at the display, side provides greater flexibility than stereo or multiview formats. Through the rendering process, the amount of depth can be controlled, allowing the 3D effect to be tailored to the display. In principle an arbitrary number of views can be rendered, allowing one transmission format to service displays with different amounts of views [18].  However, occlusions are a  problem in the rendering process. When a new view is rendered, there will be regions that should be visible in the new view that are not visible in the original 2D video (because they fall behind another object, or show a different side of an object). Therefore, hole filling methods are needed [21]. 7  Since multiview and 2D+Z formats each have their advantages, it has been proposed that 3D data be represented in a multiview plus depth format [22]. Having data from multiple views helps with occlusions (since the data that is not visible in one view may be present in others), while retaining the flexibility of depth-based rendering approach.  1.1.2 3D and Multiview Video Compression Data compression is an important part of any system that involves storing or transmitting video. It is even more important for 3D video systems due to the additional data present for the depth information or additional video streams. In traditional 2D video compression, transform coding and intra frame prediction exploit redundancies within a single frame, and motion compensation is used to exploit correlation between frames captured at different times. The H.264/AVC (Advanced Video Coding) standard, developed jointly by the Moving Picture Experts Group (MPEG) and Video Coding Experts Group (VCEG), is currently the dominant video coding standard [24].  Work is currently underway on a new High Efficiency Video  Coding (HEVC) standard [25], which is again a joint effort of MPEG and VCEG groups. In Multiview Video Coding (MVC) [26], techniques also exploit redundancies between the views captured with different cameras to reduce the number of bits required to store the videos.  This is known as disparity compensated coding.  Disparity  compensation can be accomplished by using standard motion compensation techniques between frames in different views.  8  In both traditional 2D H.264/AVC coding and MVC there are three main classes of frames, I (intra), P (predictive) and B (bi-predictive) frames.  I frames are coded  independently of all other frames using still image compression techniques [22]. P frames are predicted from one previously coded frame and B frames are predicted from two or more previously coded frames. In multiview video coding, P and B frames can be predicted from either frames from the same view captured at different times or different views captured at the same time [26]. Different prediction structures for multiview video are evaluated in [27]. An example prediction structure using prediction in both time and between views is shown in Figure 1.5. For simplicity, only three views are shown in Figure 1.5, but the concept can be extended to an arbitrary number of views and temporal frames.  Time  View 1  I  B2  B1  B2  P  View 2  B1  B4  B3  B4  B2  View 3  P  B3  B2  B3  B1  …  … Figure 1.5: Example prediction structure for MVC using prediction in both time and between views.  Stereo (two view) encoding is an important special case of MVC. In stereo encoding the first view is encoded with a standard temporal-only prediction structure, and the 9  second view can be encoded with prediction both from the first view, and from previously coded frames of the same view [26]. As introduced in the previous section, 3D video can also be represented as a 2D video plus a per-pixel depth map (2D+Z format). In 2D+Z format, the 2D video and the depth video can be compressed separately as two conventional video streams. Since the depth information is typically very smooth, and does not contain any colour information, compressing the depth information separately with H.264 can require as little as 10% additional bitrate compared to compressing just the 2D video [28]. To reduce the bitrate further, compression schemes have been proposed to use information from the 2D compressed video when compressing the depth map video, for example sharing motion information between the two streams [29].  1.1.3 3D Display Technologies 3D displays show different images to the viewer’s left and right eyes. There are many techniques for doing this, some based on the user wearing special glasses that separate the left and right images, others based on projecting different images in different directions from the display (these are called autostereoscopic displays). A summary of the most significant technologies is provided in the following subsections. 1.1.3.1 Glasses-Based 3D Displays A simple, low cost method of viewing 3D content on any standard 2D display is with colour anaglyph glasses. Anaglyph glasses are simple colour filters that allow different, near-complementary, colours to reach the viewer’s left and right eyes. Red and cyan filters are often used, but other colour combinations are possible [30]. Anaglyph glasses 10  are very cheap to produce and can be used with any colour electronic or paper display. However, they have extremely poor colour reproduction, high levels of crosstalk and they can cause headaches and discomfort. Therefore they are mostly used as a novelty rather than a practical 3D display for extended viewing. There are currently two competing 3D technologies employed in 3D cinema, polarized projection [31] (used by RealD) and wavelength multiplexed RGB (used in Dolby 3D). In polarized 3D projection, the left and right images are displayed with different circular polarizations (left-handed and right-handed), and the viewer wears passive glasses that selectively block light of one polarization [31]. In wavelengthmultiplexed 3D, slightly different wavelengths are used for each colour channel (red, green and blue) in the left and right eyes [32][33] (Figure 1.6). The glasses act as a comb filter, selectively passing one of the wavelengths for each colour channel. Wavelengthmultiplexed 3D requires more expensive glasses than circularly polarized 3D, but circular polarized requires a special silver screen to maintain the polarization of the light when it is reflected.  Figure 1.6: Illustration of different wavelengths being used for the RGB colour channels in the left and right eye images.  11  Another method employed for viewing 3D video, particularly for home use, is active shutter glasses. In active shutter glasses, the lenses covering the eyes are transparent liquid crystal shutters, which turn opaque when a voltage is applied [34]. The display alternates between showing frames from the left and right videos, and through a signalling mechanism (e.g. IR) the glasses selectively block one of the eyes so that each eye only sees the half of the frames (the ones intended for that eye).  The main  disadvantages of active shutter glasses are their high cost and bulkiness compared to passive glasses, however they are used in many current consumer products due to difficulties making cost effective small scale polarized displays. Although, recently consumer 3DTV’s have been put on the market that use line-by-line polarization together with passive glasses [35]. 1.1.3.2 Autostereoscopic 3D Displays The need for special glasses has long been one of the major complaints about 3D technology.  Therefore, 3D without glasses has been a goal of several display  manufacturers, and some predict that achieving high quality 3D effect without glasses will be necessary for 3D to achieve mainstream success in the home consumer market. To achieve 3D viewing without glasses (called autostereoscopic viewing), the display needs to emit different images in different directions, allowing a separate image to reach each eye of the viewer. In practice, more than two views are usually needed, to allow the viewers to move in front of the display and still see the 3D effect, and to allow multiple viewers to watch at the same time. Current autostereoscopic displays use between five and several dozen views. There are currently two major techniques for autostereoscopic display, parallax barrier and lenticular lens. 12  Parallax barrier 3D displays place small barrier strips in front of the screen, which block light from travelling at certain angles, so that each eye see different pixels behind the barrier (Figure 1.7) [36]. The barrier itself can be a transparent LCD panel, allowing the barrier to be customized for different viewing distances and number of views. The barrier principle can also be achieved by placing it behind the screen, but in front of a backlight. If an LCD barrier is used, it is easy to switch between 2D and 3D viewing, by simply making the entire barrier transparent in 2D mode. A disadvantage of parallax 3D displays is their low brightness due to light being blocked by the barrier (in principle the brightness is reduced by the number of views). Furthermore, the resolution of each view is reduced by the number of views.  Figure 1.7: Illustration of the parallax barrier concept for a two view 3D display. (http://en.wikipedia.org/wiki/File:Parallax_Barrier.jpg)  The main competitor to parallax barriers is lenticular lens [9] technology (sometimes called micro-lens). Lenticular lens 3D displays place small cylindrical lenses in front of a flat panel display, each lens covering multiple sub-pixels (one pixel being composed of three sub-pixels for emitting red, green and blue light). The lenses magnify the light 13  from the display such that in the correct viewing zone, a user will only see one of the subpixels behind the lens, with their viewing position controlling which one is seen. Early lenticular designs used vertical lenses. In current displays the lenses are usually placed at a small angle relative to the pixels on the display [9] (Figure 1.8), because slanted lenses allow having some loss of both vertical and horizontal resolution (rather than a bigger loss of entirely horizontal resolution) and because the lenses are less noticeable when placed at an angle (the human visual systems is more sensitive to vertical or horizontal edges that diagonal ones).  Figure 1.8: Example pixel layout and angled lens position in an 8 view lenticular lens 3D display. The numbers indicate which view each sub-pixel belongs to.  Note that in an N-view 3D display based on either parallax barrier or lenticular lens technology, each view only has 1/N of the resolution of the display (because all views are spatially multiplexed at once on the display). Consequently, autostereoscopic displays using these technologies have only recently started to become practical due to the increasing resolution of flat panel displays.  14  1.2 Capturing Distortions A challenge that arises in multiview video systems that is not present in traditional single view video is inconsistencies between cameras. Capturing well calibrated stereo or multi-view video sequences is a challenging problem, and in fact many of the multiview sequences used by the Joint Video Team (JVT) standardization committee have noticeable inconsistencies between the videos captured with different cameras [37]. Inconsistencies between cameras can cause the images to differ in brightness, colour, sharpness, contrast, etc. These differences reduce the correlation between the images and therefore make multiview video coding less efficient, since predicting one view based on another will not work well if there are inconsistencies between views. Inconsistencies also make stereo matching between views more challenging, resulting in lower quality depth maps.  Variations in colour or texture between views also negatively affect  rendering of new virtual viewpoints, and will be unpleasant to users as they switch between different views in a free-viewpoint video application. important to be able to correct inconsistencies between cameras.  Consequently, it is In the following  subsections we will review methods for correcting two possible inconsistencies in multiview video: colour and sharpness.  1.2.1 Colour Inconsistences in Multiview Imaging In both image and video cameras, colour calibration is usually done by capturing a known reference such as a colour chart [38]. One way to try to calibrate an array of cameras in multiview video systems is to capture a colour chart and calibrate every camera based on that [39]. However, this approach is sensitive to light conditions, and it 15  is not always practical to capture a colour chart. In multiview imaging, it is more important that colours are consistent between different cameras rather than making all views perfectly match an outside colour reference. In multiview video systems, brightness and colour correction can be performed either as preprocessing before compression, or incorporated into the compression process itself. Accounting for colour variations in the compression process itself has the advantage that the original data is restored during the compression process, which gives flexibility for different colour correction methods to be applied after decoding. The disadvantages are that the complexity of the compression process is increased, and it forces correction to be performed at the decoder, which further increases the complexity and cost of the decoder/display side.  Performing colour correction as preprocessing also has the  advantage that more complex correction algorithms can be applied, since it only has to be performed once before encoding rather than at every decoder. 1.2.1.1 Coding Based Colour Compensation A leading method for accounting for brightness variations in the compression process is the macroblock (MB) based illumination change compensation method proposed by Hur et al. in [40]. In their method, an illumination change (IC) value is calculated for each MB, which is the difference in the DC values between the MB being coded and the corresponding MB in the reference view being used for prediction. The IC for each MB is predicted from the IC values from neighbouring MB’s, and the difference between the actual value and predicted value is encoded in the bit-stream. The motion estimation (ME) and motion compensation (MC) processes are altered to account for the IC values. 16  Note that this method is only applied to the luma channel of the video, so it corrects for variations in brightness but not colour.  This method was included by the JVT  standardization committee in their multiview video coding reference software (JMVC), but the JVT decided not to include any colour or illumination compensation in the final multiview video coding standard [26]. In [41] another method for correcting brightness and colour during the encoding process is proposed, which uses lookup tables in the RGB colour space. Each frame being used for prediction is converted to RGB colour space, and the red, green and blue colour channels are modified independently with three separate lookup tables. The correction lookup tables are calculated by finding matching points between pairs of views, and using a dynamic programming method to find a function that will make the colours of the matching points agree. Then the frame is converted back to the YUV colour space to be used for inter-view prediction.  The lookup tables are sent as side  information in the bit-stream. Note that in coding-based colour compensation methods [40][41], the goal is strictly to reduce the bit-rate of the encoded multiview videos. The data output by the decoder will have any colour consistencies that were present in the input. 1.2.1.2 Preprocessing Colour Correction Methods A simple method for correcting colour in multiview video as a preprocessing step is through histogram matching, as proposed by Fecker et al. in [42] and further evaluated in [43]. In that method, a lookup table is calculated for each of the Y, U and V channels based on the histograms of the view being corrected and the reference view. 17  The  histogram matching method was improved in [44], with the correction being performed in RGB colour space and a time-constant correction function being used (so the same correction lookup table is used for every temporal frame in a sequence). A fundamental disadvantage of histogram based methods is that they cannot deal with occlusions between views, or situations where the views have different amounts of foreground and background. Another preprocessing method is proposed in [45], where a scaling and an offset parameter are calculated for each YUV component. For example, the modified value for each Y sample in a view being corrected is calculated with:  Ycor  aYorg  b  (1.1)  where Ycor is the corrected value, Yorg is the original value and a and b are the scaling and offset parameters for the Y channel. The U and V channels are corrected equivalently. The scaling and offset values for each channel are calculated based on the histograms of the reference view and the view being corrected. Thus, this method has the same weaknesses as histogram methods.  Furthermore, it modifies each colour channel  independently, so information from the other colour channels is not used. An object based colour correction method is proposed by Shao et. al. in [46]. This method uses simple colour clustering to segment each frame into objects, and then applies a separate colour transformation on each object.  Colour transformation is  performed in the CIELAB colour space, using a matrix multiplication. Defining X as a  18  vector of CIELAB colour coordinates for one pixel, [L,a,b]T, the corrected colour, Xcor, is calculated from the original colour Xorg using a matrix multiplication:  X cor  MX org  (1.2)  Here M is a 3x3 correction matrix calculated based on the covariance matrices of the CIELAB colour coordinates of the reference frame and the frame being corrected. Since the colour correction is linear in the CIELAB space, this method has limited ability to correct non-linear distortions. 1.2.1.3 Panorama Colour Correction Methods Another application of multiview imaging is panoramic stitching [47][48], where multiple images at stitched together to form a single panoramic image with a larger field on view. Colour correction is also important in panoramic stitching, as the images may have different exposure, white balance, etc. because of the different content in each image.  In panoramic stitching it is also important to correct for vignetting [49]  (darkening of an image towards its corners). To compensate for brightness and colour differences between images in panoramas, several techniques have been proposed. The simplest of these is to multiply each image by a scaling factor [47]. Scaling can somewhat correct exposure differences, but cannot correct vignetting, so a number of more sophisticated techniques have been developed. The multiview video methods described in the previous section all find a mapping function that will make the colour of one video match that of another. That is, the function takes pixels values as inputs (in RGB or a related colour space) and directly 19  produces modified pixel values. In contrast, colour correction in panoramic stitching is usually done by trying to solve for the radiance of the scene by finding the inverse of the camera response function. Once the radiance values of all images have been estimated, a new camera response can be simulated to render the values back to pixels in RGB colour space, with consistent simulated camera parameters applied to all images.  A common  camera model is used in most previous work on vignetting and exposure correction for panoramas [50]-[53]. A scene radiance value L is mapped to an image pixel value I, through:  I  f eV x L   (1.3)  In equation (1.3), e is the exposure with which the image was captured, and f(E) is the cameras response function, which in general is a non-linear function. The pixel value I represents a red, green or blue sample. V(x) is the vignetting function which is dependent on the position of the pixel in the image (x). V(x) is one at the image’s optical center and it decreases towards the edges. The irradiance observed by the image sensor at location x is E = eV(x)L. With an estimate of the exposure, and models for the camera response f(E) and the vignetting V(x), the scene radiance can be recovered from the measured pixel value as:  L  g(I ) eV (x)  (1.4)  where g(I) is the inverse of the camera response function. Once the radiance values are found, each image can be rendered with a common exposure and no vignetting (i.e. V(x)=1) with equation (1.3), which should correct colour mismatches between images. 20  In [50], the camera response function is modeled as a gamma curve and vignetting is modeled as cos4(d/f), where d is the radial distance between the pixel and the center of the image and f is the focal length. In [51], the authors generalize the camera response model to be a polynomial. An influential method for removing vignetting and exposure differences is proposed by Goldman and Chen in [52], which uses an empirical camera response model together with a polynomial model for vignetting. Specifically, a 6th order even polynomial is used:  V (d )  1   1d  2    2d  4    3d  6  (1.5)  In (1.5), the α values are weights adjusted to make the polynomial fit the observed vignetting. Another method based on Goldman and Chen work is proposed in [53]. A white balance factor is introduced, so that I=f(eV(x)Lwc).  The optimization problem is  reformulated so that a single non-linear optimization is done with the LevenbergMarquardt method [52]. 1.2.1.4 Robust Stereo Matching Methods Colour mismatches between videos can also negatively impact stereo matching. Stereo matching is a classical computer vision problem that has been extensively studied [12][16]. By finding matching points between two images captured at different locations, the depth of those points (i.e. the distance between the camera and the object) can be determined by triangulation [1].  More than two views can be used in the matching 21  process to increase the accuracy of the depth estimation [12][17] (sometimes this is called “multiview matching” or “multiview stereo”). Stereo or multiview matching is often used to generate 3D video in the 2D+Z format. Colour and brightness mismatches between video captured with different cameras can affect the accuracy of stereo matching, because it will be harder to match points if those points appear different in each view. One way to deal with colour mismatches is to apply a multiview colour correction method, such as those described in section 1.2.1.2, before matching is performed. Alternatively, the matching process itself can be made to be robust to variations in brightness/colour. A number of techniques have been proposed to make stereo matching robust to radiometric differences between images (i.e., variations in brightness, contrast, vignetting).  A simple but effective method for compensating for small brightness  differences is to pre-filter the images with a Laplacian of Gaussian (LoG) kernel [54]. The LoG filter kernel is zero-mean, so it will remove changes in bias (i.e., the DC component). Therefore, the LoG would perfectly compensate for brightness variations if one image was a constant amount brighter than the other. Instead of modifying the images being matched, a robust matching cost could be used (the matching cost is the function that is used to measure how closely different points match). Various matching costs have been proposed that are robust to variations in brightness such as normalized cross correlation [55] (which is robust to variations in scaling) and mutual information [56] (which is robust to complicated relationships between the pixel values in the two images). 22  Another successful technique is to take the rank transform of the images [57], which replaces each pixel by the number of pixels in a local window that have a value lower than the current pixel. The rank transform is robust to any monotonic relationship between to the pixel values of the images. A detailed evaluation of several matching techniques that are robust to radiometric differences was first presented in a conference [58] and was extended further in a journal paper [59].  1.2.2 Sharpness Compensation Methods Inconsistent sharpness between videos is another potential problem in multiview videos sets. Sharpness variations can result from cameras being focused differently or different amounts of motion blur affecting each camera (for example due to shaking). Human stereo vision has an interesting property related to blurred images; if a blurred image is presented to one eye and a sharp image to the other, the person will perceive the scene with the higher sharpness [60] (i.e., the viewer will see a sharp 3D image). This property has been exploited to improve compression of stereoscopic video, by coding one view with either lower spatial resolution or synthetic blur to reduce the bit-rate of that view [61]. Consequently, it is not always necessary to correct sharpness variations in 3D video for viewing purposes; in fact sometimes they are purposefully artificially introduced. However, sharpness variations can severely negatively affect processing of multiview video, resulting in inefficient prediction between views, stereo matching or view rendering, due to the lower correlation between views.  Some research has  addressed improving multiview compression efficiency when there are sharpness 23  variations between videos, and also making stereo matching robust to variable sharpness. An overview of these topics is presented in the following sub-sections. 1.2.2.1 Multiview Video Coding Tools for Sharpness Variations A technique called blur compensation is proposed in [62], which aims at increasing the efficiency of motion compensation when there are different amounts of blurring between frames in a 2D video. A set of pre-defined low-pass filters are used to create a series of blurred frames which are used as reference frames during motion compensation. Since pre-defined filters are used, the method has limited ability to correct different amounts of blurring between frames. Sharpening filters are not considered in their method. An adaptive filtering method is proposed in [63] whose objective is to increase the coding efficiency of multiview video coding when there are focus mismatches between views. In that method, 5x5 linear filters are applied to reference frames before the frames are used for disparity compensated prediction. The frame is segmented into different depth levels, and a separate filter is used for each depth level in the frame. First block based disparity estimation is used to estimate a disparity vector (dx,dy) for every block in the frame being coded. Then, the frame is segmented into different depth levels based on the disparity vectors. Finally the coefficients for a 5x5 filter φ for each depth level are calculated such that the square of the prediction error after disparity compensation is minimized:  24  min      I  x, y     I  ( x , y )depthi  view1  i  view 2  x  dx, y  dy 2  i  (1.6)  The filter defined by equation (1.6) is called the minimum mean square error filter. Using these criteria to design the filter will produce good coding efficiency results, but will not result in images that match the reference in perceived sharpness.  In fact,  designing a filter using equation (1.6) will almost always produce a low-pass filter because block based motion compensation is more accurate at reconstructing the low frequency portions of a signal [64]. Consequently, the filtered reference frame will usually be blurred. 1.2.2.2 Stereo Matching Robust to Sharpness Variations As described in section 1.2.1.4, there has been considerable research on making stereo matching robust to variations in brightness/colour, including detailed comparison papers [59]. Much less work has addressed stereo matching when there are variations in sharpness/blurring between the images. In [65], a stereo method is proposed using a matching cost based on phase quantization.  That method involves comparing highly quantized Discrete Fourier  Transform (DFT) phase values of image blocks. Since phase values are not affected by convolution with a centrally symmetric point spread function, their method is robust to blurring such as out-of-focus blur or Gaussian blur. A problem with their method is that the DFT must be taken on fairly large square windows for reliable matching, which leads to the well-known foreground fattening effect [16].  25  In [66] a method is proposed for performing stereo matching on images where a small portion of the image suffers from motion blur. A probabilistic framework is used, where each region can be classified as affected by motion blur or not. Different smoothness parameters in an energy minimization step are used for the pixels estimated as affected by motion blur. Note that neither [65] nor [66] attempts to correct the blurring in the images (they do not modify the input images); instead, they attempt to make the matching more robust to blurring.  1.3 Crosstalk in 3D Displays Since 3D displays show different images to the viewer’s left and right eyes, a common problem with them is incomplete separation of these images. That is, a portion of the image intended for the left eye reaches the right one, and vice versa. This effect is known as crosstalk [67][68]. Crosstalk is a critical factor that limits the quality of many 3D displays, as it can cause the viewer to perceive ‘ghosting’, an effect where a double image is seen.  Crosstalk can severely degrade the perceived 3D picture quality  [69][70][71], can affect the impression of depth [73] and can result in viewers not being able to fuse the two images. Two examples of 3D images with 10% crosstalk between the left and right views are shown in Figure 1.9.  26  Figure 1.9: Two examples of 3D images with 10% crosstalk between the left and right views.  Different 3D displays have different mechanisms that cause crosstalk, and suffer from different amounts of crosstalk. However, all 3D displays have some crosstalk except for those that have physically separate optic channels for the left and right eyes (such as head mounted displays [74]).  1.3.1 Sources of Crosstalk in 3D Displays Anaglyph 3D displays suffer from very high amounts of crosstalk (often over 20% [30]), due to the spectral response of the simple colour filters used (the filters typically have a fairly large transition between the pass-band and stop-band). Furthermore, many displays have a large amount of overlap in the wavelengths emitted by their red, green and blue sub-pixels (particularly between red and green) [30], so even with perfect filters there would be some crosstalk.  27  In polarized projection 3D displays, crosstalk is caused by imperfect extinction of the glasses (i.e., the polarized filters leak a small amount of the light with the wrong polarization), as well as the optical quality of the polarizer (not all of the photons may be polarized as intended) and the optical quality of the screen [67]. A disadvantage of polarized projection is the need for a silver screen, because normal cinema screens will not retain the polarization of reflected photons. In wavelength-multiplexed 3D displays the amount of crosstalk is controlled by the spectral emission characteristics of the projector combined with the frequency response of the comb filters in the glasses. Recall that wavelength-multiplexed 3D involves using slightly different wavelengths for each colour channel in the left and right eyes (Section 1.1.3.1). The amount of crosstalk depends on how narrow the emitted wavelengths can be made (so that the red wavelengths used for the left eye don’t overlap with those of the right, etc.), as well as how well the glasses reject the wavelengths used for the RGB channels of the other eye. Through careful design of the projector and glasses, the crosstalk of the system can be made quite low [33]. 3D displays based on active glasses often have high levels of crosstalk; some of the first generation 3D displays released in early 2010 had up to 20% crosstalk [75]. One source of crosstalk in shutter displays is transmission in the ‘off’ state. When the shutters turn opaque (black) there is still a significant amount of light leaking through to the eye [76]. The timing of the shutter glasses and its interaction with the display update timing also contributes to crosstalk. It takes some time for the glasses to switch between the clear and opaque states. If the display is still emitting light during the transition it can 28  cause crosstalk. If shutter glasses are used with a phosphor based display (i.e., CRT’s or plasma), phosphor persistence (afterglow) is a problem [76]. CRT’s and plasma displays are impulse-type displays; in each frame the phosphors ‘fire’ releasing a impulse or light which gradually decays. After the display and glasses have updated to a new frame the phosphors can still be emitting some light from the previous frame (which was intended for the other eye). Combining shutter glasses with LCD displays also creates crosstalk problems. LCD’s are a hold-type display; they continue to emit light throughout the entire frame period. LCD’s have a backlight, and the voltage applied to each LCD cell controls how much of the backlight is transmitted or absorbed, thus controlling the amount of light reaching the viewer. The time it takes for the LCD to transition from one output to another when a new voltage is applied highly depends on the two voltages (i.e., the two gray levels). When the change in voltage is small, the transition time can be quite long, so the display will often still be transitioning when the glasses switch to a new frame. Therefore, LCDs have different levels of crosstalk for different gray level combinations, called the gray-togray crosstalk characteristics of the display [77][78]. Techniques such as black frame insertion and modulated backlight have been employed to improve the transition performance of LCD’s, both for reducing motion blur and reducing crosstalk [67][75]. Autostereoscopic displays, both lenticular lens and parallax barrier, also have high levels of crosstalk. The amount of crosstalk depends on the optical quality of the lenses or barriers, their alignment relative to the display sub-pixels and the viewer’s position in front of the screen [79][67]. Crosstalk is actually necessary in autostereoscopic displays, 29  to allow smooth transitions between viewpoints as the viewer’s eye moves relative to the display [80].  1.3.2 Crosstalk Cancellation Since crosstalk is a problem in virtually all practical 3D displays, methods to reduce it through image processing are of interest.  An effective method for reducing the  appearance of crosstalk is through subtractive crosstalk cancellation [83]-[86]. This is a technique where the image levels are lowered based on the anticipated amount of crosstalk.  Therefore, after crosstalk is added during playback, the intended images  should be seen by the viewers. A simple linear model for crosstalk is sometimes used:  iL ,eye  x, y   iL  x, y   c  iR  x, y   iR ,eye  x, y   iR  x, y   c  iL  x, y   (1.7)  Here iL  x, y  and iR  x, y  are one colour channel of the input images in linear space, and c is the amount of crosstalk in that colour channel. The signals i L ,eye and i R ,eye are the ones reaching the viewer’s eyes. Note that the amount of crosstalk (c) often varies for the different colour channels. Equation (1.7) requires the images be represented as linear intensity values, whereas digital images are usually stored as 8-bit unsigned values (0-255) in the standard RGB (sRGB) colour space [81]. The sRGB colour space is nonlinear and approximately a gamma curve with gamma 2.2. The equations for converting between sRGB and linear RGB intensities [81] could be used in the crosstalk model, however, that may not represent the response of the display. Displays are often calibrated 30  to a gamma of 2.2 rather than the sRGB curve [82], so a simple gamma curve can often be used to convert from 8-bit values (iL8) to linear intensities:  i  iL   L 8   255     (1.8)  With equation (1.8), the images will be represented in the range zero to one, with one corresponding to the maximum intensity of the display. Crosstalk can be compensated for if we input the following processed images to the display [83]:  i L  x , y  c  i R  x, y   1 c2 1 c2 i  x , y  c  i L  x, y  i R ,disp  x, y   R 2  1 c 1 c2  i L ,disp  x, y    (1.9)  By substituting equation (1.9) into the crosstalk model of (1.7), it can easily be verified that the images reaching the viewer’s eyes will be the original images, iL and iR. Note that (1.9) basically involves subtracting the right image from the left one and vice versa, with a small gain of 1/(1-c2) applied to all terms. The factor 1/(1-c2) is close to one if the amount of crosstalk is relatively low, so it is sometimes ignored to lower complexity. A slightly altered form of equation (1.9) is used in the patent [84], for implementation with hardware circuitry.  The same concept can be extended to  multiview displays with crosstalk from more viewpoints [85] An alternate crosstalk model based purely on visual measurements is proposed by Konrad et al. in [86]. They use a model where the amount of crosstalk depends not only 31  on the image causing crosstalk, but also the intended image. Hence, the crosstalk is a two dimensional function, depending on the pixels’ values in both the left and right images:  iL ,eye  iL8   iR8 , iL8   iR ,eye  iR8   iL8 , iR8   (1.10)  In equation (1.10),  () is the crosstalk function, and iL8 and iR8 are the 8-bit gamma encoded values of one colour channel in the left and right images (note that this model is based entirely on values in 8 bit format, not linear values).  The model in equation  (1.10) is calibrated by a series of visual experiments where a user adjusts the brightness of a square that has a certain amount of crosstalk until it visually matches the brightness of another square that has no crosstalk (because one image is set to black). Based on these visual measurements, the function  () is determined. Then, a mapping function is derived that takes that input values iL8 and iR8 and produces modified 8-bit values, γL8 and γR8, that satisfy the property:  iL8   L8    R8 ,  L8   iR 8   R8    L8 ,  R8   (1.11)  The idea of equation (1.11) is that after crosstalk is added to the modified image values, γL8 and γR8, the desired images should be obtained. An iterative procedure is used to find the mapping from the input values to the modified values [86]. This method can handle more general crosstalk functions than the linear model in (1.7), but still requires the crosstalk function be smooth, monotonic and have zero crosstalk when the image values are zero (which may not be the case for 3D LCDs [77][78]). It also requires a large set of tedious and eyestrain-inducing visual measurements for calibrating the model. 32  1.3.3 Raising the Image Black Level A problem with crosstalk cancelation occurs if there is a high amount of crosstalk in an image region that is close to black. In this case, there may not be enough light to subtract from the almost-black image to compensate for the crosstalk. That is, if the amount of light from crosstalk is greater than the amount of light from the signal at a pixel, the intended pixel value cannot be lowered by enough to compensate for the crosstalk. One solution to this problem is to raise the minimum image level, for example if the input images cover the entire range [0, 255], then compress the range to [30, 255]. This ensures there will always be ‘foot room’ for lowering the image values to compensate for crosstalk. This global approach of compressing the image range is used in several previous works [83]-[86]. Obviously raising the black level is undesirable, as it will reduce image contrast and therefore lower the picture quality. To allow full crosstalk cancellation even in the worst case (when one image is white and the other image is black), the image range would have to be compressed from [0,1] to [c,1], in a linear space, where c is the amount of crosstalk. The corresponding range      reduction in 8-bit gamma encoded space is from [0,255] to 255c1 /  ,255 .  For 4%  crosstalk (c=0.04), the image range would have to be compressed to [59,255], so even for moderate amounts of crosstalk a large portion of the useable range would be lost. A low complexity method for raising the black level and performing subtractive crosstalk cancellation in one step has been patented by Sharp [87]. Their method requires  33  only subtractions, bit shift operations and one fixed multiplication per pixel, thus it can be easily implemented in either hardware or software. Instead of globally raising the image levels, which reduces contrast over the entire image, an alternative approach is to raise the image levels only in local regions that suffer from noticeable crosstalk. The company RealD has applied for a patent on this approach [88]. In their method, they detect regions where conventional crosstalk cancellation will fail (i.e., where the signal is too low to be able to compensate for crosstalk) and around these regions patches of ‘disguising’ luminance are added. These patches of added luminance are very smooth, and are therefore likely to be less noticeable than crosstalk. Since the patches typically occupy a small percentage of the total image area, the method preserves image contrast better than globally compressing the image range. However, the method as described in [88] is operated on a frame by frame basis, without considering temporal consistency. Therefore, it may often result in flickering or sudden jumps in brightness. It is worth noting that in addition to providing foot-room for subtractive cancellation, raising the image levels will also decrease the visibility of crosstalk due to Weber’s Law. According to Weber’s Law, the just noticeable difference (JND) of a signal (in this case luminance from a display) is proportional to the signal’s magnitude [89]. Therefore, by raising the black level of an image, the JND will also be increased, and a constant amount of crosstalk will be less noticeable.  34  1.4 Thesis Contributions In this thesis, we present novel methods for correcting capturing and display problems in 3D video systems. This includes correction of inconsistent colour and sharpness in stereo and multiview data, and crosstalk cancellation for 3D displays. In Chapter 2, section 2.1, we present a new multiview video colour correction method that increases the compression efficiency of multiview video coding. Unlike previous colour correction methods which require choosing a reference view and modifying all other views to match it, our method modifies all views to match the average colour of the original set of videos. This results in less change being needed to make the set of videos consistent. Experimental results show our method is effective at increasing the coding efficiency of MVC when there are colour mismatches in the original multiview video data. In section 2.2, we present a modification of our multiview video colour correction algorithm for panoramic image stitching. When images are stitched into a panorama, vignetting (images becoming darker towards their corners) must also be corrected. Our method corrects both colour mismatch and vignetting with one linear least squares regression per colour channel, avoiding the need for more complex non-linear optimization used in other methods. In Chapter 3, a pre-processing method for correcting sharpness variations in stereo images is presented. Our method improves the accuracy of stereo matching performed after the correction is applied. Correcting sharpness variations as a pre-processing step gives the advantage that any stereo matching method can be used. Our method is based 35  on scaling the 2D discrete cosine transform (DCT) coefficients of both images so that the two images have the same amount of energy in each of a set of frequency bands. Experiments show that applying the proposed correction method can greatly improve the disparity map quality when one image in a stereo pair is more blurred than the other. Finally, in Chapter 4 we present a method for improving crosstalk cancellation in 3D displays. A common problem with crosstalk cancellation methods is that they require the black level of the image be raised, so that there is always some ‘foot-room’ to lower the image levels based on the amount of crosstalk.  Alternatively, the image levels can be  selectively raised in local regions that suffer from crosstalk, however existing methods for doing this do not consider temporal consistency so they may result in flicker or sudden jumps/drop in brightness. We propose a new method for locally raising image levels for crosstalk cancellation, which provides temporal consistency through smoothing the intensity of the added light, and by providing fade-ins and fade-outs to the added signal to prevent sudden changes in brightness. Results show that our method allows effective crosstalk cancellation, yielding smooth videos without flicker, while maintaining better image contrast than globally scaling the image levels.  36  2 Multiview Colour Correction 2.1 Colour Correction Preprocessing For Multiview Video Coding Current multiview video systems capture a scene with at least three and up to a hundred cameras. Since a number of cameras are needed, and each individual camera captures a large amount of data, the total amount of data captured in multi-view video systems is huge. Hence, efficient compression methods are required to allow efficient storage and transmission of the data.  As discussed in section 1.2.1, multiview video sets  can have inconsistent brightness and colour, particularly when the number of cameras is large and therefore accurately calibrating all of them is extremely challenging. In fact many of the multiview video sets used by the JVT committee on multiview video coding suffer from quite noticeable variations in colour [37]. Variations in colour between views negatively impact predictive coding between views and therefore lower multiview video coding efficiency. In previous work on colour correction in multiview video sets, one view is always chosen as the colour reference and all other views are modified to match it on a pair-wise basis [40]-[45]. In this chapter we propose a new method, where instead we find the average colour of the video set, and modify all views to match the average. Our method involves finding a set of matching points between all views in the video set with block based disparity estimation, then calculating the average Y, U and V value for these matching points. A least squares regression is performed for each view to find the coefficients of a polynomial that will make the corrected YUV values of the view most 37  closely match the average YUV values. Experimental results show that applying the proposed colour correction greatly increases the compression efficiency when multiview video is compressed with inter-view prediction. The impact of the choice of colour reference in multiview video colour correction is discussed in Section 2.1.1. Our proposed method is described in detail in Section 2.1.2. Experimental results are presented in Section 2.1.3, including subjective colour comparisons and tests on how the proposed method improves inter-view prediction in multiview video coding. Finally, conclusions are given in Section 2.1.4.  2.1.1 Choice of Colour Reference in Multiview Colour Correction Existing multiview colour correction algorithms choose one view as the reference, and try to alter the brightness and colour in all other views to match it [40]-[45]. The reference view is usually chosen as a view in the center of the camera arrangement. The reason for choosing the center view is that it will usually have the most in common with the other views. Also, it will usually be easier to find matching points in views that are close together, so using the center view as a reference is logical if the colour correction method involves point matching. However, the center view may not always be a good choice for the colour reference. If there is an even number of views, there is no logical way to choose between the two views closest to the center.  A bigger problem is that the center view may have  substantially different colour from the other views.  For example, in the standard  Flamenco2 multiview video set, the center camera has much lower brightness and less  38  saturated colours than the other views. If it is used as the reference, all other views will have to be greatly altered to make their colours match the reference. The choice of colour reference can have a significant impact on the coding efficiency when the multiview video is compressed.  In order to demonstrate this, we colour  corrected the Flamenco2 video with the histogram matching method in [42] five separate times using each of the five views as the colour reference.  Each resulting colour  corrected video was compressed with H.264 using an IPPP structure in the view direction (i.e. using entirely inter-view prediction in P frames) (Figure 2.1). The relative coding efficiency obtained using each view as the reference is shown in Table I, with the Bjontegaard metric used to calculate average Peak Signal to Noise Ratio (PSNR) difference between two curves [90].  View  Time  I  I  I  I  P  P  P  P  P  P  P  P  P  P  P  P  Figure 2.1: IPPP coding structure using inter-view prediction  39  Reference View 0 1 2 3 4  ΔPSNR Y U V Relative to Variance Variance Variance View 0 (dB) 0.00 2009 636 577 0.02 2145 589 564 2021 511 445 0.42 -0.65 2375 662 612 -0.19 2409 536 457  Table 2.1: Relative H.264 coding efficiency obtained on the Flamenco2 video set with histogram matching colour correction and different views used as the colour reference  As seen in Table 2.11, the difference in coding efficiency obtained using different reference views can be larger than 1 dB. This shows that the choice of reference colour can have a large impact on coding efficiency. However, choosing the reference that will give the best coding efficiency is not necessarily the best choice. In Table 2.11, we can see that the view that produces the best coding efficiency (view 2) has low variance in the Y, U and V channels compared to the other views. Likewise, the view with the worst coding efficiency (view 3) has high variance in the colour channels. This is quite intuitive as signals with low variance are easier to compress. This does not mean that choosing a low variance view as the colour reference is necessarily a good choice, as doing this will result in corrected videos with lower contrast and less saturated colours. These results show that the choice of the colour reference is important when performing colour correction on multiview video sets. We propose a new method for choosing the colour reference that is used for correcting multiview video. Instead of using the colour from one of the captured views as the reference, we find the average colour from all the captured views and use that as the colour reference. The advantage of using the average colour as reference is that the  40  minimum amount of modification will have to be done across the set of views in order to make them consistent.  2.1.2 Proposed Colour Correction Preprocessing Method Although modifying each view to match the average colour of all views is a relatively intuitive concept, in practice defining and calculating the average colour is not very straight forward. We propose a point based method for finding the average colour. We find matching points between all views using a correlation method, and then calculate the average Y, U and V values of these matching points. We use a least squares regression to find a polynomial function to map the initial YUV values in each view to match these average YUV values. These steps are described in detail in the following sub-sections. 2.1.2.1 Point Matching Between Views Point matching in stereo and multiview camera arrays is a classical problem that has been extensively studied. An overview paper that evaluates several methods is presented in [16]. There are two main approaches to point matching, block based methods and feature based methods. Block based methods divide one image into small blocks and attempt to find matching blocks in another image based on some criteria, for example the sum of squared differences. Feature based methods, such as the Scale Invariant Feature Transform (SIFT) [91], involve extracting keypoints from each image and looking for matches between keypoints. Block based methods produce far more matching points between images, but the matches are less reliable.  41  In order to find a large number of matching points, we use block based disparity estimation on the luma channel to find matching points between all views. One view in the center of the camera arrangement is chosen as the anchor. This view is divided into blocks of size 8x8 pixels, and matching blocks for every block in the anchor are found in all other views (Figure 2.2).  Figure 2.2: Finding matching points between all views by choosing an anchor view and performing block matching between anchor and all other views  In video coding, block matching is usually done using the sum of absolute differences (SAD), or sum of square differences (SSD) as the cost function. However, these cost functions are sensitive to the brightness level of the two blocks, and have been shown to have very poor performance when there are brightness variations between views [58]. In multiview video sets, there may be substantial brightness variations, so a more robust matching criterion is needed. Therefore, we use the normalized cross correlation (NCC) defined for an NxN block located at position (x0,y0) in the anchor frame as: x0  N y 0  N  NCC i , j      Y x, y   m Y x  i , y  j   m   x  x0 y  y 0  anc  anc  x0  N y0  N  view  view  (2.1)  x0  N y 0  N    Y x, y   m     Y x  i, y  j   m   x  x0 y  y0  2  anc  anc  x  x0 y  y 0  2  view  view  where Yanc(x,y) is a frame of the anchor view, Yview(x,y) is the view in which a matching block is being found, and mview and manc are the mean values of the blocks in the two frames: 42  mview  1  2 N  manc  1  2 N  x 0  N 1 y 0  N 1    Y  x, y  view  x  x0  y  y0  x 0  N 1 y 0  N 1    Y  x, y   (2.2)  anc  x  x0  y  y0  Note that the mean of the block is subtracted and the energy of the block is normalized in the NCC calculation. This provides robustness to changes in brightness between the views. The disparity is estimated by choosing the vector (i*,j*) that results in the maximum NCC over a search range i  s x1 , s x 2 , j  s y1 , s y 2  :  i , j   arg max *  *  sx1isx 2 s y 2  j s y 2  NCCi, j   (2.3)  In this work we have used a rectangular search range and a full search (where every displacement vector within the search range is evaluated). The search range was chosen separately for each video set based on the range of disparities observed in the sequence. Note that the computational cost of disparity estimation could be reduced by using a fast disparity estimation algorithm such as those proposed in [92] and [93]. The disparity vector calculated with equation (2.3) may not correspond to true disparity, because of occlusion between the views, noise, or other factors. Therefore, an additional test is used to decide whether the disparity estimation has found matching points for the current block. The NCC ranges from [-1, 1] and has a value of one when the blocks are scaled versions of each other (after mean removal). If the NCC value between the block in the anchor view and the matching blocks in other views is above 0.7 for all views, the blocks 43  are considered to be valid matches across all views, and the pixels in the block are added to vectors of matching points Y1, Y2…YM, U1, U2…UM, V1, V2…VM. The subscript indicates the view number 1…M where M is the number of views in the video set. The threshold 0.7 was determined experimentally to provide good results for typical multiview content. With the vectors of matching points the average YUV values across all views can easily be calculated for these points:  Y U  avg  avg  1  M 1  M  V avg   1 M  M  Y i 1  i  M  U i 1  i  (2.4)  M  V i 1  i  2.1.2.2 Colour Correction Function A transformation has to be found for each view to map the view’s YUV values to match the average YUV values as closely as possible.  We use three functions to  calculate the corrected Y, U and V values for each frame:  Yicor  f Y Yi , U i , Vi   U cor  fU Yi , U i , Vi  i cor i  V  (2.5)   fV Yi , U i , Vi   The functions fY, fU and fV are designed to minimize the error between the corrected YUV values for the view and the average YUV values:  44  Yavg  Yicor  ε  (2.6)  In colour correction research, polynomials are often used to find least-squares fits to non-linear functions. For a 2nd order polynomial, the corrected Y, U or V value for each pixel is calculated based on the captured YUV values of the pixel as:  Y cor  aY 1Y  aY 2U  aY 3V  aY 4Y 2  aY 5U 2  aY 6V 2  aY 7YU  aY 8YV  aY 9UV  aY 10  (2.7)  The correction function for each view is controlled by the weighting vector aY=[aY1, aY2 … aY10]. The corrected U and V values are calculated equivalently with different weight vectors aU and aV. A 2nd order polynomial is shown in equation (2.7) but any order is possible. For example a third order polynomial would have 20 terms. The optimal weight vectors that will make the corrected values match the average values can be calculated with a least squares regression. Define Ψ as follows, with ‘*’ denoting element-wise multiplication of vectors:    Yi  Ui  Vi  Yi  Yi  Ui  Ui  Vi  Yi  Ui  Yi  Vi  Ui  Vi 1 (2.8)  Here Yi, Ui, and Vi, are the vectors of pixels of view ‘i’ found in the block matching process. The colour corrected Y values can be calculated as:  Y cor  a Y  (2.9)  Substituting this into equation (2.6) gives:  Y avg  a Y  ε 45  (2.10)  The parameter vector aY which minimizes the energy of the error vector ε can be found with a standard least squares estimator [95]:    aY  T    1   T Y avg  (2.11)  Similarly, the coefficients for generating the corrected U and V values can be found with:         aU  T aV  T  1   T U avg  1   T V avg  (2.12)  After the weight vectors have been calculated with equations (2.11) and (2.12), each pixel in the view can be colour corrected with equation (2.7). Here, we find a single disparity vector for 8x8 pixel blocks and consider every pixel within the block to be a match. In stereo matching research, it is more common to estimate disparity on a pixel by pixel basis, with a block surrounding the current pixel being used in the matching process [16]. Calculating disparity and finding matches on a pixel by pixel basis may result in more accurate matches, and hence fewer outliers in the regression, but would greatly increase the number of computations needed. It would also make it more difficult to reuse the disparity vectors calculated during later compression of the multiview video. The majority of video content is stored in YUV 4:2:0 format, where the chroma (UV) channels are downsampled by a factor of two in the vertical and horizontal directions relative to the luma. Equations (2.7) through (2.12) require a Y, U and V sample for every matching point between frames. Therefore, we downsample the Y channel by a 46  factor of two in both directions for the purpose of finding the correction parameters. This yields a quarter size video in YUV 4:4:4 format. To apply equation (2.7) during the correction step, we upsample the U and V channels in order to calculate the corrected Y channel. That is when applying equation (2.7), the corrected Y channel is a function of the original Y channel and upsampled U and V channels, and the corrected U and V channels are functions of the original U and V channels and the downsampled Y channel.  2.1.3 Experimental Results Results are presented for four standard multiview video test sequences Rena (16 views), Flamenco2 (5 views), Race (8 views), and Ballroom (8 views). All videos have resolution 640x480. 60 temporal frames were processed for each video, with every tenth frame being used in the regression analysis for calculating the parameter vectors aY, aU and aV. The same set of parameters was used to correct every temporal frame in a view. In most video coding papers only the luma PSNR is reported, since the human visual system is more sensitive to luma than chroma. However, here we are modifying the colour of the video, so the chroma quality is also relevant. Hence we provide two RD curves for each video to show the luma and chroma PSNR. We calculate the chroma PSNR as the average PSNR of the U and V channels. When measuring PSNR, the compressed video is compared to a reference which is assumed to have perfect quality. Usually the original video is used as the reference. However, when colour correction is being performed, a fundamental assumption is that the original data is not perfect and needs to be modified. Therefore when measuring the PSNR of video where both colour correction and compression have been applied, we use 47  the colour corrected version of the video as the reference. The PSNR reported is a measure of the amount of distortion introduced in the compression process. Note that when different colour correction schemes are compared, we use different references for measuring the PSNR, because there is no ground truth video that has perfect colour. 2.1.3.1 Effect of the Colour Transform Function Here we evaluate the effect of the polynomial used in the correction function on Multiview video compression performance when disparity compensation is used. To show the effect of the polynomial function used in the proposed method, the test videos were compressed with the H.264 reference software (JM).  Disparity compensated  prediction was used with an IPPP coding structure in the view direction (Fig. 2.1). The search range was selected on a video by video basis to ensure that the search range is somewhat larger than the range of disparities for each video set. We compare using polynomials of order up to four. The equations for generating the corrected values for orders one, two and three are given in equations (2.13), (2.14) and (2.15), respectively.  Y cor  aY 1Y  aY 2U  aY 3V  aY 4 Y cor  aY 1Y  aY 2U  aY 3V  aY 4Y 2  aY 5U 2  aY 6V 2  aY 7YU  aY 8YV  aY 9UV  aY 10  48  (2.13) (2.14)  Y cor  aY 1Y  aY 2U  aY 3V  aY 4 Y 2  aY 5U 2  aY 6V 2  aY 7 YU  aY 8 YV  aY 9UV  aY 10 Y 3  aY 11U 3  (2.15)   aY 12V 3  aY 13YUV  aY 14 Y 2U  aY 15 Y 2V  aY 16 YU 2  aY 17 YV 2  aY 18U 2V  aY 19UV 2  a 20 The equation for a fourth order polynomial is similar, but involves 35 terms. In addition we give results for correcting the colour channels independently. Independent correction has the advantage of lower complexity, and it is easier to perform in parallel. Using independent correction of each channel with a 3rd order polynomial, the corrected Y values are calculated as:  Y cor  aY 1Y  aY 2Y 2  aY 3Y 3  aY 4  (2.16)  Note that to use each polynomial form, the matrix Ψ in equation (2.8) has to be redefined accordingly, but otherwise the proposed correction algorithm is identical. Rate distortion curves for the test videos Rena and Flamenco2 obtained with the different polynomial forms for colour correction are shown in Figure 3. We observe that as the polynomial order is increased the compression efficiency increases. This shows that higher order polynomials are able to more accurately characterize the colour distortion between different cameras. That is, using a higher order polynomial results in more accurate colour correction and hence makes inter-view prediction more effective. However, very similar performance is obtained for 3rd and 4th order polynomials, which suggests that increasing the order higher than three will result in very small gains. There is also risk of over fitting the data as a more terms are used. Therefore, we propose to use  49  3rd order polynomials for the correction function as they give similar results to 4th order polynomials while having lower complexity.  Flamenco2 ‐ Luma PSNR   41.5  Average Chroma PSNR (dB)  40.5  40.5  Y PSNR (dB)  40 39.5 39 1st order  38.5  2nd order  38  3rd order  37.5  4th order  37 36.5 1000  Flamenco2 ‐ Chroma PSNR  41  41  1400  1600  1800  2000  2200  2400  39 38.5  1st order 2nd order 3rd order 4th order Independent 3rd order  38 37.5  Independent 3rd order 1200  40 39.5  37 1000  2600  1200  1400  Rena ‐ Luma PSNR   47  1800  2000  2200  2400  2600  Rena ‐ Chroma PSNR  44  46  Average Chroma PSNR (dB)  43  45  Y PSNR (dB)  1600  Bit rate (Mbps)  Bit rate (kbps per view)  44 43 1st order 42  2nd order  41  3rd order 4th order  40  42 41 40 39  1st order 2nd order 3rd order 4th order Independent 3rd order  38 37  Independent 3rd order  39  36 0  500  1000  1500  2000  2500  0  500  1000  1500  2000  2500  Bit rate (kbps per view)  Bit rate (kbps per view)  Figure 2.3: Rate distortion performance obtained with different polynomials used for the colour correction function  The results in Figure 2.3 also show that substantial gain is achieved by using the information from all three colour channels during correction rather than correcting each channel independently. Independent correction with a 3rd order polynomial gives similar or slightly worse results than a 1st order polynomial using all three colour channels. The rest of the results in this chapter use a 3rd order polynomial with all three colour channels used to generate each corrected channel, as in equation (3.16).  50  2.1.3.2 Block Matching Criteria There are alternatives to the Normalized Cross Correlation that can be used as the block matching criteria when there are differences in brightness between views. In previous MVC work (such as [40]), the Mean-Removed Sum of Absolute Differences (MRSAD) has been used. The MRSAD has been shown to be more accurate than the standard SAD for disparity estimation on multi-view video [40]. For an NxN block of pixels located at position (x0,y0) the MRSAD is defined as:  MRSADi, j    x0  N y0  N    Y x, y   m   Y x  i, y  j   m   x  x0 y  y0  anc  anc  view  view  (2.17)  Here, manc and mview are the mean value of each block, as defined in equation (2.2). If the MRSAD is used as the block matching criteria, an alternate test is needed to decide if a valid match has been found. We compare the MRSAD to the Mean Removed Sum of Absolute Values (MRSAV) of the anchor block, which is a measure of how much texture is present in the block.  MRSAV   x0  N y 0  N    Y x, y   m   x  x0 y  y 0  anc  anc  (2.18)  If the MRSAD (which measures the residual between the blocks) is significantly lower than the MRSAV, then it is more likely that a valid match has been found. We consider the blocks valid matches if MRSAD  0.5MRSAV . In order to determine if the NCC and MRSAD produce significantly different results, we have performed colour correction on the test videos using each of them as the block 51  matching criteria. The mean squared differences between the videos obtained with the two methods are shown in Table 2.2. Video Set Flamenco2 Rena Race Ballroom  Y msd 1.16 0.30 1.49 0.57  U msd 0.26 0.20 0.24 0.13  V msd 0.27 0.29 0.33 0.10  Table 2.2: Average mean squared difference between videos obtained using NCC and MRSAD as the block matching criteria  As seen in Table II, there is essentially no difference in the resulting videos when the MRSAD is used as the block matching criteria. Note that the mean-squared differences in Table II are equivalent to PSNR values of 46 to 58 dB, which are high enough levels such that no difference in the visual quality between the videos. Therefore, either block matching method could be used. The MRSAD might be preferred since it involves fewer calculations. The NCC has been used since it can correct for both variations in the mean and scaling of the block (since the energy is normalized in the NCC calculation), while the MRSAD only accounts for variations in the mean value. However, the MRSAD can still produce good results if there are only minor variations in the scaling between different views (which seems to be the case for the MVC test videos). Note that the different block matching criteria will produce different vectors of matching points, with different bad matches (outliers). Since so many matching points can be found in multiview video data, a least squares regression can still estimate the weighting parameters even in the presence of different matching errors.  52  2.1.3.3 Subjective Quality of Corrected Views The first frame from each view of the Flamenco2 video set before and after colour correction is shown in Figure 2.4. Before correction there are visible differences in brightness and colour between the different views, particularly the center view which is darker than the rest and has less saturated colours. After correction the views look very consistent in brightness and colour, and all the views have colour which looks like the average colour before correction. Since the other video sets have many views (8 or 16), it is not practical to show all of the views at once. For this reason we select to show only two views from the ‘Rena’ and ‘Race’ video sets for subjective comparison (Figures 2.5 and 2.6). These views have noticeably different colour before correction but look consistent after our proposed colour correction algorithm is applied.  53  Figure 2.4: Colour correction of the Flamenco2 multiview video set (a) original data (b) colour corrected with proposed method  54  Figure 2.5: Sample views from the Rena test video set (a) Frames from two views before correction (b) After proposed colour correction  Figure 2.6: Sample views from the Race test video set (a) Frames from two views before correction (b) After proposed colour correction  55  2.1.3.4 Distortion of Original Data An advantage of correcting to the average colour is that the minimal amount of average modification is needed to make the set of videos consistent. The mean squared difference between the original data and the colour corrected data of view ‘i’ is calculated as:    2    1 T 1 W 1 H 1 org msd i  f i  x, y, t   f i cor  x, y, t   WHT t 0 x 0 y 0  (2.19)  where f i org i, j , t  is frame ‘t’ of one channel (Y, U, or V) of the original video, and  f i cor i, j , t  is the colour corrected version of the channel. T, H, and W are the number of frames, the width of the video and the height of the video, respectively. The average difference across all views is calculated as:  msd avg   1 M  M 1   msd i 0  i  (2.20)  The mean squared differences in each colour channel between the original and colour corrected videos calculated with equation (2.19) are shown in Table 2.3. The proposed method of correcting to the average colour is compared with the traditional method of choosing the centre view as the colour reference. To use the center view as the colour reference, the proposed method can be applied with Yavg, Uavg and Vavg being replaced by Yk, Uk and Vk in equations (2.11) and (2.12), where ‘k’ is the index of the center view.  56  Video Set Flamenco2 Rena Race Ballroom  Colour Reference Center (View 0) Average Center (View 46) Average Center (View 4) Average Center (View 4) Average  Y msd 56.1 22.7 4.5 3.0 138.6 47.3 23.2 22.0  U msd 19.6 7.2 8.2 4.7 7.8 4.6 4.5 3.9  V msd 58.7 13.4 27.0 14.3 20.2 10.6 3.9 3.6  Table 2.3: Average mean squared difference between original data and colour corrected views using different colour references  As seen in Table 2.3, correcting to the average colour requires less average change be made to the set of videos to make the colour consistent. In the Flamenco2 video set, the center view is much darker and has less saturated colours than the other views, so using it as the reference results in much greater differences than using the average colour as the reference. Likewise, in the race video set the center view is brighter than the rest, so using the average colour as the reference also greater reduces the amount of change being made to the video set. For the Ballroom video set, the center view has almost the same brightness and colour as the average, so the results are similar using it as the reference. 2.1.3.5 Effect on Compression Performance In order to evaluate how the proposed method affects multiview video encoding performance, experiments were run with the Joint Multiview Video Model reference software (JMVM 8.0) [96]. A temporal GOP size of 8 was used, and the QP’s used are those used in the JVT common test conditions [97]. We compare our method with 1) compressing the original video, 2) the illumination compensation (IC) method adopted in JMVM, and 3) the histogram matching (HM) 57  method presented in [44]. As noted in Section II of this paper, the contrast of the reference view is important to multiview coding performance. So when testing the HM method, the reference view has been chosen to have contrast as close to the average colour (used in the proposed method) as possible. Figure 2.7 shows the Peak Signal to Noise Ratio (PSNR) vs. bit rate curves for both the luma and chroma channels of the test videos.  58  Flamenco2 ‐ Luma PSNR   Flamenco2 ‐ Average Chroma PSNR  40  39 38.5  39.5  38  Chroma PSNR (dB)  Y PSNR (dB)  37.5 37 36.5 36 JMVM  35.5  39  38.5 JMVM  38  IC  35  IC  HM  34.5  HM  37.5  Proposed  34  Proposed  37 400  500  600  700  800  900  1,000  1,100  1,200  400  500  600  Bit rate (kbps per view)  Rena ‐ Luma PSNR  43  45  42  Chroma PSNR (dB)  Y PSNR (dB)  43 42 41  JMVM IC  40  900  1,000  1,100  1,200  41 40 39 JMVM 38  IC  HM  HM 37  Proposed  39  Proposed  36  38 0  100  200  300  400  500  600  700  800  0  900  100  200  300  400  500  600  700  800  900  Bit rate (kbps per view)  Bit rate (kbps per view)  Race ‐ Average Chroma PSNR   Race ‐ Luma PSNR  41  39.5  40.5  Chroma PSNR (dB)  40  39  Y PSNR (dB)  800  Rena ‐ Average Chroma PSNR   46  44  38.5 JMVM  38  IC HM  37.5  40  39.5 JMVM  39  IC HM  38.5  Proposed  37  Proposed  38 400  500  600  700  800  900  400  500  600  Bit rate (kbps per view)  700  800  900  Bit rate (kbps per view)  Ballroom ‐ Luma PSNR   Ballroom ‐ Average Chroma PSNR   37  41  36.5  40.5  36  40  Chroma PSNR (dB)  Y PSNR (dB)  700  Bit rate (kbps per view)  35.5 35 34.5  JMVM IC  34  39 38.5  JMVM IC  38  HM Proposed  33.5  39.5  HM Proposed  37.5  33  37 300  350  400  450  500  550  600  650  700  300  Bit rate (kbps per view)  350  400  450  500  550  600  650  700  Bit rate (kbps per view)  Figure 2.7: Rate-distortion performance obtained with the proposed colour correction preprocessing method  59  The proposed method results in luma PSNR gains ranging from about 0.5 to 1 dB over compressing the original data without IC. The gains compared to using IC range from 0 to 0.6 dB. Greater PSNR gains are seen in the chroma channels. The proposed method gives chroma PSNR gains ranging from about 0.7 to 2.1 dB over compressing the original videos. For all videos the proposed method gives similar or better performance than the HM method. The coding efficiency gains are particularly high for the Rena video set. The Rena set has very low spacing between cameras (5 cm) so neighboring views are very highly correlated and disparity compensation is very effective. Furthermore, there are very significant colour variations between views in the Rena set. Therefore, correcting the colour between views to improve disparity compensation results in large gains in compression efficiency. For the video sets that are less densely captured (i.e., with greater camera spacing) disparity compensation is less effective so improving disparity compensation through colour correction provides less gain. However, for all tested video sets, significant compression efficiency gains are observed. 2.1.3.6 Computational Complexity The proposed method consists of three main steps: 1) finding matching points through disparity estimation, 2) performing the least squares regressions, and 3) correcting the YUV values. The complexity of each of the first two steps is dependent on the number of frames used to calculate the correction parameters, which can be a small subset of the total temporal frames in the video (six frames were used in our tests). The last step must 60  be performed on every frame in the video set, so its computational complexity depends on the temporal length of the video. Disparity estimation is used to find matching points between all views.  The  computational complexity of this process is heavily dependent on the search range and search method used. The search range required varies from video to video, as it depends on the scene and camera geometry. In our experiments, we have used a rectangular search window and a full search (where the cost function is evaluated at every possible disparity within the window). The speed of the search process could be considerably increased by using fast disparity estimation methods such as those in [92][93]. If the videos are well rectified, disparity estimation can be reduced to a 1D search [94], greatly reducing the number of search points. Note that the disparity vectors calculated in this step could be reused in later compression of the video (or serve as a starting point for a smaller refinement search), so the additional complexity of this step to a complete system may be minor. Three least squares regressions must be performed per view to calculate the correction vectors with (2.11) and (2.12). There are many very efficient numerical methods for performing least squares regressions [95]. In our experiments we have used the MATLAB “\” operator which calculates the least squares solution based on QR factorization [98]. On an Intel Core2 Duo E4400 2.0 GHz system, performing the least squares regressions in MATLAB took 1.5 to 3.1 seconds per view, depending on how many matching points were found in disparity estimation.  61  To calculate the final corrected YUV values with equation (16), 45 multiplications and 19 additions are needed per sample (Y, U or V). In our implementation, written in C code and compiled with Microsoft Visual Studio®, this takes 15 ms per 640x480 pixel frame. The complexity could be reduced by using a lower order polynomial, or removing some terms from equation (2.15), at the expense of slightly less accurate correction.  2.1.4 Conclusions In this section, we have proposed a method for correcting colour in multiview video sets to the average colour of the set of original views. The colour correction is done as a pre-processing step to compression. Block based disparity estimation is used to find matching points between all views. A least squares regression is performed on the set of matching points to find the optimal parameters for a polynomial that will make the captured values from each view match the average colour values. Experimental results show that the proposed method produces video sets that are highly consistent in colour. Applying the proposed method increases compression efficiency by up to 1.0 dB in luma PSNR when multiview video is compressed with JMVM.  62  2.2 Fast Vignetting Correction and Colour Matching In the previous section, we have presented a colour matching method for multiview videos sets. Here we describe a modified version of our method for another multiview imaging application, panoramic image stitching. Panoramic image stitching involves taking multiview photos of a scene with any camera a stitching them into a wide-angle panoramic image that can have a field of view of up to 360 degrees. Panoramic image stitching has been extensively studied in the literature and there are also several commercial software programs for creating panoramas [47][48]. When two images are aligned, there are almost always visible seams due to mismatches in the brightness and colour between images (Figure 2.8). There are several causes of these seams, such as registration errors, exposure differences, vignetting [49] (radial falloff towards the edges of an image) and variable white balance.  Figure 2.8: Two images showing severe colour mismatch aligned with no blending (left) and multi-band blending (right)  To prevent visible edges in the final panorama, blending techniques are used to make transitions between images smooth. Popular methods include multi-band blending [99] and gradient domain blending [100]. While these techniques do remove visible seams, 63  they do nothing to remove global colour differences.  Therefore using blending  techniques alone can result in unnatural looking panoramas, as seen in Figure 2.8 (right). In previous work that corrects for both vignetting and exposure differences, nonlinear optimization is used [50]-[53].  Non-linear optimization methods have  considerably higher complexity than linear ones, and therefore may take a long time to run [50][52], and may have problems with convergence [51]. In this chapter we propose a vignetting correction and colour matching method that uses only linear least-squares regression. Linear regression has far lower complexity than non-linear optimization and has no problems with convergence since the global optimum can be determined in closed form. In order to use simple linear regression, we correct vignetting with an additive term, and match the colour between two images with a 2nd order polynomial.  Our proposed method is described in detail in section 2.2.1  Experimental results, presented in section 2.2.2, show that the proposed method effectively corrects for vignetting, exposure and white balance differences between images. When combined with blending methods the proposed method gives panoramas with no visible mismatch between images.  2.2.1 Proposed Vignetting and Colour Correction Method We assume that we have two images, I1(x,y) and I2(x,y) that have been registered so that an overlapping area has been identified (in our experiments we align images using autostitch [47]). Our goal is to correct vignetting in both images and make the colour of I1 match that of I2 (so that I2 is the image used as the colour reference). We will first  64  describe the method for correcting two images, and then describe how it is extended to a greater number of images. Vignetting correction is usually done by dividing the irradiance by the estimated vignetting attenuation, as in equation. In order to use a simple linear regression we would like to compensate for vignetting with an additive term. Define E0 to be the irradiance that would be measured by the sensor if there were no vignetting (E0=eL). The drop in irradiance due to vignetting using the polynomial model of Goldman, et al. (equation 1.5) is:  E vig  E 0  E     E0  E0 1  1d 2   2 d 4   3 d 6      E0  1d 2   2 d 4   3 d 6      (2.21)  Since an image pixel value is proportional to the irradiance, we can say that the corresponding drop in the image sample will be approximately:    I vig   I 1d 2   2 d 4   3 d 6    (2.22)  Note that the vignetting model of equation (2.22) would be equivalent to the model in equation (2.21) if the cameras response were linear. In image 2, we will only correct vignetting (since it serves as the colour reference), so applying equation (2.22) the corrected image is calculated with:    I 2 , cor  I 2  I 2  1 d 2   2 d 2   3 d 2 2  65  4  6    (2.23)  In image 1, we have to correct both vignetting and colour differences between image 1 and 2. To do this we propose to use a 2nd order polynomial to model a transfer function to render a pixel in image 1 with the exposure and white balance of image 2. Hence the function we use for correcting vignetting in image 1 and making its colour match image 2 is:    I1,cor  a1I1  a2 I1  a3  I1 1d12   2 d14   3d 2 2  6    (2.24)  The optimal weights ai and αi that will make the images consistent will be estimated from matching points in the overlapping region. Vignetting affects all colour channels equally, so the weights for vignetting should be common to all three channels (R, G, and B), but the other weights are different for each channel. The corrected values of image 1 calculated with (2.24) should equal the corrected values of image 2 calculated with (2.23). Applying this to the R, G and B channels gives the set of equations: R2  1d 22 R2   2 d 24 R2   3d 26 R2  aR1R1  aR 2 R12  aR 3  1d12 R1   2 d14 R1   3d16 R1 G2  1d 22G2   2 d 24G2   3d 26G2  aG1G1  aG 2G12  aG 3  1d12G1   2 d14G1   3d16G1  (2.25)  B2  1d 22 B2   2 d 24 B2   3d 26 B2  aB1B1  aB 2 B12  aB 3  1d 2 B1   2 d 4 B1   3d16 B1  After both images have been warped into the coordinates of the final panorama, there will be a region where the images are overlapping. Define vectors of the red, green and blue samples in the overlapping region as R1, G1, B1 and R2, G2, B2. The equations in (2.25) can be written for these vectors of matching points as:  66  R 2  G   Ψa  ε  2  B 2   (2.26)  with a and Ψ defined as: a  aR1 aR 2 R 1  Ψ0 0   aR 3  R12 1 0 0  aG1 aG 2 0  0  aG 3 0  0  aB1 aB 2  aB 3 1  2  3   T  0 R1d12  R 2d 22  0  R1d14  R 2d 42  0 G1 G12 1 0 0 0 G1d12  G 2d 22 G1d14  G 2d 24 0 0 0 0 B1 B12 1 B1d12  B 2d 22 B1d14  B 2d 24  R1d16  R 2d62   G1d16  G 2d62  B1d16  B 2d62   (2.27)  The ε vector in (2.26) is the error between the corrected image samples, which we want to minimize. The vector a which minimizes the squared error between the corrected RGB values of image 1 and the matching RGB values of image 2 can be obtained with a standard linear least-squares regression:  R 2  a     G 2   B 2     T    1  T  (2.28)  After the parameter vector is calculated with (2.28), the final corrected version of image 1 is calculated with (2.24), and the corrected version of image 2 is calculated with (2.23). After correcting these first two images, more images can be corrected using those that have already been corrected. Suppose we are correcting image j which has an overlapping area with image k that has already been corrected. The corrected version of image j will be generated with (2.23), which should equal the already known corrected pixels of image k: 67    I k , cor  a1I j  a2 I j  a3  I j 1d 2j   2 d 4j   3d 6j 2    (2.29)  Based on (2.29), a set equations equivalent to those of (2.25) can easily be derived (but without the vignetting terms on the left hand side of each equation, since vignetting will already have been corrected in the reference image).  Another least squares  regression is done as in equation (2.28), only with Ψ modified slightly to account for the fact that there is no vignetting in the already corrected image. To make the correction more accurate, we can exclude some less reliable points from being used in the regressions. Image registration is usually not accurate to within a pixel, so matching points near edges are less reliable than those in flat image regions. Therefore we do not use matching points that have a high gradient value in either image. We calculate the gradient magnitude at each pixel as: G ( x, y )  I ( x  1, y )  I ( x  1, y )  I ( x, y  1)  I ( x, y  1)  (2.30)  If the gradient is above a threshold (we use 10 in our experiments) the matching points are not used in the regressions for calculating the correction parameters. Consumer cameras now usually capture images with several megapixels of resolution. To lower the computational cost of performing the regressions a small subset of the pixels in the overlapping region can be used in each regression.  In our  experiments, we have used 200 matching points in each regression, sampled uniformly in spherical coordinates in the overlapping region.  68  2.2.2 Results Sample panoramas before and after colour correction are shown in Figures 2.9 and 2.10. Each panorama is obtained by blending the images with a checkerboard pattern, in order to highlight colour differences between images. We compare our proposed method against the method of Goldman and Chen [52]. Since our goal is to provide visually pleasing panoramas, only subjective comparisons are made.  Figure 2.9: Green lake panorama (a) Original images, (b) Corrected with Goldman’s method, (c) Proposed correction method  Figure 2.9 shows images with significant exposure differences, causing the images on the right-most image of the panorama to look much darker then the left-most image. 69  After correcting the images with Goldman’s method the checkerboard pattern is still visible in the grass and the image on the right is still visibly darker than the one on the left (Figure 2.9b). These problems are not visible on the images corrected with the proposed method (Figure 2.9c). Figure 4 shows an example with pronounced vignetting in the sky. While Goldman’s method reduces the appearance of seams (Figure 2.10b), they are even less visible after applying the proposed method (Figure 2.10c). In addition to providing better colour matching between images, the proposed method requires only linear regressions on a relatively small number of samples, whereas Goldman’s method requires a much more complex non-linear optimization [52].  70  Figure 2.10: Skyline panorama. (a) Original images, (b) Correction with Goldman’s method, (c) Proposed correction method  Two more example panoramas are shown in Figures 2.11 and 2.12. Before correction the colour mismatch between images is highly visible, whereas after correction only minor seams are noticeable. Applying blending techniques [99][100] will easily remove the remaining seams, which are mostly due to small alignment problems rather than colour mismatch.  71  Figure 2.11: Sunset panorama. Original images (top) and after proposed correction (bottom)  72  Figure 2.12: Mountain panorama. Original images (top) and after proposed correction (bottom)  2.2.3 Conclusions In this chapter we propose a method for correcting vignetting and colour differences between images being stitched together to form panoramas.  Our method is a  modification of our multiview colour correction method described in section 2.1, which includes additional terms for correcting vignetting. Unlike previous methods which use complex non-linear optimization to solve for the correction parameters, our proposed method uses only linear least squares regressions with a lower number of parameters. This makes our proposed method fast, and avoids problems with convergence that can be encountered with non-linear optimization. Results show that our method effectively 73  corrects for vignetting, exposure and white balance differences between images, producing panoramas with negligible colour mismatch between images.  74  3 Sharpness Matching in Stereo Images Stereo matching is a classical problem in computer vision that has been extensively studied [12][16],[93],[94],[54]-[66].  It has many applications such as 3D scene  reconstruction, image based rendering, and robot navigation.  Most stereo matching  research uses high quality datasets that have been captured under very carefully controlled conditions. However, capturing well-calibrated high quality images is not always possible, for example when cameras are mounted on a robot [101], or simply due to a low cost camera setup being used. Therefore stereo images may differ in brightness, contrast, sharpness, etc. As described in the introduction, the problem of making stereo matching robust to radiometric variations (brightness, contrast, colour) has been well studied (section 1.2.1.4). In comparison the body of work on sharpness variations is much more limited [65][66]. In this chapter, we propose a fast method for correcting sharpness variations in stereo images. Unlike previous works [65] [66], the method is applied as pre-processing before depth estimation. Therefore, it can be used together with any stereo method.  Our  method takes a stereo image pair as input, and modifies the more blurry image so that it matches the sharper image. This is achieved by scaling the DCT coefficients of both images so that the two images have equal energy in a set of frequency bands. Experimental results show that applying the proposed method can greatly improve the quality of the depth map when there are variable amounts of blur between the two images. The rest of this chapter is organized as follows. The proposed method is 75  described in section 3.2, experimental results are given in section 3.3 and conclusions are presented in section 3.4.  3.1 Proposed Sharpness Matching Method When an image is captured by a camera, it may be degraded by a number of factors, including optical blur, motion blur, and sensor noise. Hence the captured image can be modeled as:  ~ i  x, y   h  x, y   i  x, y   n  x , y   (3.1)  where i(x,y) is the “true” or “ideal” image and h(x,y) is the point spread function (psf) of the capturing process. The ‘*’ operator represents two dimensional convolution. The n(x,y) term is additive noise, which is usually assumed to be independent of the signal and normally distributed with some variance σn2. Throughout the paper, we will use the tilde ‘~’ to denote an observed (and hence degraded) image. The psf, h(x,y), is usually a low-pass filter, which makes the observed image blurred (high frequency details are attenuated). In the case of stereo images, we have left and right images, iL and iR, for which the observed images can be modeled as in equation (4.1):  ~ iL x, y   hL x, y   iL x, y   n L x, y  ~ iR x, y   hR  x, y   iR x, y   n R x, y   (3.2)  If the same amount of blurring occurs in both images, i.e., hL and hR are the same, the images may lack detail but they will still be consistent. Therefore, stereo matching will still work reasonably well. We are interested in the case where different amounts of 76  blurring occur in the images, so hL and hR are different. Our method attempts to make the more blurred image match the less blurred image by scaling the DCT coefficients of the images. The basis for our method is that un-blurred stereo images typically have very similar frequency content, so that the signal energy in a frequency band should closely match between the two images.  Therefore, we scale the DCT coefficients in each  frequency band so that after scaling the image that originally had less energy in the band will have the same amount of energy as the other image. The resulting corrected images will match closely in sharpness, making stereo matching between the images more accurate. The steps of our method are described in detail in the following subsections. 3.1.1 Removing Non-overlapping Edge Regions Typical stereo image pairs have a large overlapping area between the two images. However, there is usually also a region on the left side of the left image and a region on the right side of the right image that are not visible in the other image (Fig. 3.1a).  If  these non-overlapping areas are removed, the assumption that the two images will have similar frequency content is stronger.  77  Figure 3.1: Removing non-overlapping edge regions from a stereo image. (a) Left and right original images with edge strips used in search shown in red, and matching regions found through SAD search (equation (4.3)) shown in blue. (b) Images cropped with nonoverlapping regions removed.  In order to identify the overlapping region between the two images, we consider two strips; one along the right edge of iL and one along the left edge of iR (see Figure 4.1a, regions highlighted in red). Strips five pixels wide are used in our experiments. We find matching strips in the other image using simple block based stereo matching. Using the sum of absolute differences (SAD) as a matching cost, two SAD values are calculated for each possible disparity d, one for the edge of iL and one for the edge of iR:  SADL d    ~  ~  L  R   i ( x, y )  i ( x  d , y ) ( x , y )edge L  SADR d    ~ ~  iR ( x, y)  iL ( x  d , y)  ( x , y )edge R  78  (3.3)  The disparity value d that minimizes the sum SADL d   SADR d  is chosen as the edge disparity D. Cropped versions of iL and iR are created by removing D pixels from the left of iL and D pixels from the right of iR (Figure 3.1b). These cropped images, which we will denote iLc and iRc, contain only the overlapping region of iL and iR. In equation (3.3), we have used the standard sum of absolute differences as the matching cost. If there are variations in brightness between the images, a more robust cost should be used, such as normalized cross correlation or mean-removed absolute differences.  3.1.2 Noise Variance Estimation Noise can have a significant effect on blurred images, particularly in the frequency ranges where the signal energy is low due to blurring. We wish to remove the effect of noise when estimating the signal energy, which requires estimating the noise variance of each image. We take the two dimensional DCT [103] of the cropped images, which we will  ~ ~ denote as I Lc (u , v ) and I Rc (u, v) . The indices u and v represent horizontal and vertical frequencies, respectively.  These DCT coefficients are affected by the additive noise.  We can obtain an estimate for the noise standard deviation from the median absolute value of the high frequency DCT coefficients [104]:  N     ~ median I (u , v)  u uT ,v  vT  0.6745 79   (3.4)  Values uT and vT are the thresholds for which DCT coefficients are classified as high frequency. We have used 20 less than the maximum values for u and v as the thresholds in our tests, making 400 coefficients used when calculating the median with equation (3.4). The reasoning behind equation (3.4) is that the high frequency coefficients are dominated by noise, with the signal energy concentrated in a small number of coefficients. The use of the median function makes the estimator robust to a few large coefficients which represent signal rather than noise. Using equation (3.4), we obtain estimates for the noise standard deviation in both images,  N , L and  N , R .  3.1.3 Division into Frequency Bands We wish to correct the full left and right images, without the cropping described in section 3.1.1. Therefore, we also need to take the DCT of the original images, I~L (u, v) ~  and I R (u , v ) so that those coefficients can be scaled. This is in addition to taking the DCT  ~ ~ of the cropped images I Lc (u , v ) and I Rc (u, v) from which we will calculate the scaling factors. The DCT coefficients of each image have the same dimensions as the image in the spatial domain. If the width and height of the original images are W and H, the cropped images will have dimensions (W-D)xH. In the DCT domain, the dimensions of the coefficients will also be WxH for the original images and (W-D)xH for the cropped images. Our proposed method is based on the observation that stereo images typically have very similar amounts of energy in the same frequency ranges. Therefore, we divide the DCT coefficients of both the original and cropped images into a number of equally sized frequency bands, as illustrated in Figure 3.2. 80  v7  Vertical frequency (v)  v6 v5 v4 v3 v2 v1 u0, v0  u1  u2  u3  u4  u5  u6  u7  Horizontal frequency (u)  Figure 3.2: Division of DCT coefficients into M frequency bands in each direction, illustrated for M=8.  Each frequency band consists of a set of (u,v) values such that ui  u  ui 1 and  v j  v  v j 1 , where ui is the starting index of band i in the horizontal direction and vj is the starting index of band j in the vertical direction (Figure 3.2). If we use M bands in both the horizontal and vertical directions, then the starting frequency index of each band in the original and cropped images can be calculated as:  i W   jH  u i  round , v j  round   M   M   i  (W  D )   jH  u i ,c  round , v j ,c  round  M    M   (3.5)  where ui and vj are the indexes for the original images, and ui,c and vj,c are the indices for the cropped images. Although ui and ui,c are different numbers, they correspond to the same spatial frequencies.  81  The number of frequency bands to use in each direction, M, is a parameter that must be decided. If more bands are used, the correction can potentially be more accurate. However, if too many bands are used, each band will contain little energy and therefore the estimate for the scaling factor will be less reliable. We evaluate the impact of the number of bands used experimentally in Section 3.2.1. The DC coefficient for each image, IL(0,0) and IR(0,0), is treated as a special case, because the DC coefficient usually has much more energy than any of the AC coefficients, and has different statistical properties [106]. Therefore, we treat the DC coefficient as a frequency band on its own. When the images are divided into frequency bands as illustrated in Figure 3.2, one additional band is created that contain only the DC coefficient. The size of the band is smaller for the DC coefficient, but otherwise the correction process is done the same as for the other bands.  3.1.4 DCT Coefficient Scaling Our assumption is that the true (un-blurred) images should have the same amount of energy in each frequency band. Therefore, we will scale the coefficients in each band so that the left and right images have the same amount of signal energy in each frequency band. The energy of each observed image in band ij can be computed as:         2 ~ ui 1 1 vi1 1 ~ En ij I    I (u , v) u ui  v  vi  (3.6)  We wish to remove the effect of noise from the energy calculated with (3.6). Let us define HI(u,v) as the DCT of the blurred signal h( x, y )  i ( x, y ) , and define N(u,v) as the 82  DCT of the noise.  Since we are using an orthogonal DCT, N(u,v) is also normally  distributed with zero mean and variance σN. Given that the noise is independent of the signal and zero mean, we can calculate the expected value of the energy of the observed signal:    ~ E I (u, v)     EHI (u, v)  N (u, v)  2  2      E HI (u, v)      E HI (u, v)   E N (u, v)  2 2  2    (3.7)  2 N  Summing the above relation over all the DCT coefficients in a frequency band gives:      ui 1 1 vi 1 1 u ui v vi  ~ E I (u , v)       EHI (u, v)    2  ui 1 1 vi 1 1  ui 1 1 vi 1 1  2  u ui v  vi  u ui v  vi  2 N  (3.8)   En ij ( HI )  Cij 2N  where Enij(HI) is the energy of the blurred signal in frequency band ij and Cij is the number of coefficients in the band. The left hand side of equation (3.8) can be estimated with the observed signal energy calculated with equation (3.6).  Therefore, we can  estimate the blurred signal energy in the band with:      ~ En ij HI   max 0, En ij I  Cij N2    (3.9)  In (3.9) we have clipped the estimated energy to be zero if the subtraction gives a negative result since the energy must be positive (by definition). Using (3.9) we estimate the signal energies Enij(HIL) and Enij(HIR). We wish to multiply the coefficients in the image with less energy by a gain factor (Gij) so that this image ends up having the same  83  amount of signal energy as the other image. The scale factors to apply to each image in this band can be found as:  En ij , max  maxEn ij HI L , En ij HI R   Gij ,L  Gij ,R   (3.10)  En ij ,max  En ij HI L   (3.11)  En ij ,max  En ij HI R   Note that either Gij,L or Gij,R will always be one. The gain factors calculated with (3.11) do not consider the effect of noise. If the signal energy is very low, the gain will be very high, and noise may be amplified excessively (this is a common issue in deblurring methods [105]). To prevent noise amplification from corrupting the recovered image, we multiply the gain by an attenuation factor, denoted A, which lowers the gain applied to the band based on the signal to noise ratio. Ideally, there would be no noise in the images, and we would be able to calculate the scaled DCT coefficients as  G  HI (u, v) .  With  noise,  the  scaled  coefficients  will  actually  be  G   HI (u , v)  N (u , v)  . We choose the attenuation factor (A) to minimize the squared  error between the ideal coefficients and the actual coefficients, i.e.,    min E G  HI (u , v )  G  A HI (u , v )  N (u , v )  A  The value of A that minimizes (3.12) is given by:  84   2  (3.12)  Aij   En ij , min En ij , min  Cij 2N ,min  (3.13)  where En min  min En HI L , En HI R  and  2N ,min is the noise variance of the image that has less signal energy in the band. We provide a derivation for (3.13) in at the end of the chapter in section 3.4. Note that the attenuation factor in equation (3.13) is basically the classic Wiener filter [105][107]. The scaling factors, G and A are calculated based on the DCT coefficients of the cropped images (because the assumption of equal signal energy in each frequency band will be stronger for the cropped images). So equations (3.6) through (3.13) are all applied only to the DCT coefficients of the cropped images. Once G and A are calculated for a frequency band ij, we scale the DCT coefficients of the original images:  ~ I L , cor (u, v)  Gij , L  Aij  I L (u, v) ~ I R , cor (u, v)  Gij , R  Aij  I R (u, v)  (3.14)  for ui  u  ui 1 , v j  v  v j 1 Note that we apply the same attenuation factor to the coefficients from both the left and right images. This ensures that the corrected images will have the same amount of signal energy in the band, at the expense of blurring the sharper image somewhat. However, unless the noise variance is very high in the blurred image, the sharper image will not be affected much. After calculating all of the scaling factors, and applying equation (3.14) for every frequency band, we will have the complete DCT coefficients of the corrected 85  images I L ,cor (u, v) and I R,cor (u, v) . Then we simply take the inverse DCT to obtain the final corrected images in the spatial domain.  3.2 Experimental Results We test our proposed method on 10 stereo image pairs from the Middlebury stereo page [108] that have ground truth disparities obtained through a structured light technique [102]. Thumbnails of the test images are shown in Figure 3.3. The 2005 and 2006 data sets from the Middlebury page (art, laundry, moebius, reindeer, aloe, baby1, rocks) all have seven views; we used the one-third size versions of views 1 and 3 in our tests.  Figure 3.3: Test images used in experiments (the left image of each stereo pair is shown). In reading order: Tsukuba, teddy, cones, art, laundry, moebius, reindeer, aloe, baby1, and rocks.  Our sharpness correction method is a pre-processing step performed before stereo matching, and therefore it can be used together with any stereo method. We test our method together with two representative stereo algorithms; one simple window based matching method, and one global method that solves an energy minimization problem using Belief Propagation (BP) [109].  86  Our window stereo method involves first performing block matching with a 9x9 window using sum of absolute differences (SAD) as the matching cost.  A single  disparity is chosen for each pixel that has the minimum SAD (winner take all). A leftright cross check is done to invalidate occluded pixels and unreliable matches [55], and disparity segments smaller than 160 pixels are eliminated [110]. Invalid pixels are interpolated by propagating neighbouring background disparity values. Our window based method is similar to the window based method used for the comparative study in [58]. The global belief propagation (BP) method solves a 2D energy minimization problem, taking into account the smoothness of the disparity field. We refer readers to [109] for the details of the BP method. Two kinds of blurring filters are tested. Out-of-focus blur, which is modeled with a disk filter of a given radius [111], and linear motion blur, which is modelled as an average of samples along a straight line of a given length. We use the MATLAB commands  fspecial(‘disk’,…)  and  fspecial(‘motion’,…)  to generate the blurring  filters. The performance metric we use is the percentage of ‘bad’ pixels in the disparity map in the un-occluded regions of the image. A bad pixel is defined as one where the calculated disparity differs from the true disparity by more than one pixel. This is the most commonly used quality measure for disparity maps, and it has been used in major studies such as [16] and [59].  87  3.2.1  Impact of the Number of Frequency Bands  In this section we evaluate how the performance of our method is affected by the number of frequency bands used in each direction (M), as described in section 3.1.3. We filtered the left image of each stereo pair with a disk filter of radius 2 (simulating an out of focus image) and added white Gaussian noise with variance 2. The right image was left unmodified. Then, we corrected the stereo pair using our algorithm a number of times, with the value of M ranging from 2 to 80. Finally, we ran the BP stereo matching method on the corrected images. The number of bad pixels versus the value of M is plotted in Figure 3.4 for the ten test image pairs, together with the average error across the ten image pairs. 50 Tsukuba Teddy Cones Art Laundry Moebius Reindeer Aloe Baby1 Rocks Average  45  Errors in unoccluded regions (%)  40 35 30 25 20 15 10 5 0  0  10  20  30 40 50 Number of frequency bands in each direction (M)  60  70  80  Figure 3.4: Errors in disparity maps as a function of the number of frequency bands used  From Figure 3.4, we can see that the amount of errors is generally higher when the number of bands is very low (2-6) or very high (50+). For all of the test images the number of errors is steady and near minimum when M is in the range 10 to 30. The  88  average number of errors is minimized when M is 20. Furthermore, choosing M=20 gives very close to optimal results for all 10 test images. We have done similar tests with other blur filters and different levels of blurring, and in all cases the results were similar to those shown in Figure 3.4 (i.e., the minimum was at or near M=20, and the curves were flat in a large range around the minimum). Therefore we use M=20 in the rest of our tests.  3.2.2 Disparity Map Improvement for Blurred Images Here we compare the quality of depth maps obtained using our proposed method relative to performing stereo matching directly on the blurred images. In each test, the right image was left unfiltered, while the left image was blurred with either a disk filter (simulating out-of–focus blur) or a linear motion blur filter at 45 degrees to the x-axis. We tested out-of-focus-blur with radii of 0, 1, 2 and 3 pixels and motion blur with lengths of 2, 3 and 4 pixels. A larger radius or length means the image is blurred more. A blur radius of zero means the image is not blurred at all, i.e., the filter is an impulse response and convolving it with the image leaves the image unaltered. White Gaussian noise with a variance of 2 was added to all of the blurred images (which is typical of the amount of noise found in the original images). Figure 3.5 shows the Tsukuba image blurred with all of the filters tested, to give the reader an idea of how severe the blurring is in different tests.  89  Figure 3.5: Demonstration of blurring filters used in our tests on the Tsukuba images (a) Original left image (b) Original right image, (c)-(g) Left image blurred with: (c) Out-offocus blur, radius 1 (d) Out-of-focus blur, radius 2 (e) Out-of-focus blur, radius 3 (f) motion blur, length 2 (g) motion blur, length 3 (h) motion blur, length 4. The blurring filter is illustrated in the top left corner of each image.  Tables 3.1 through 3.4 show the percentage of errors in the disparity maps obtained with different levels of blurring, with and without the proposed correction. The second column of each table (images) shows whether stereo matching was performed on either the blurred left image and original right image (the “blurred” case), or on the left-right pair obtained by applying our proposed method (the “corrected” case). Table 3.1 gives results for out-of-focus blur and the Belief Propagation stereo method, Table 3.2 for out90  of-focus blur and the window stereo method, Table 3.3 for motion blur and the Belief Propagation stereo method, and Table 3.4 for motion blur and the window stereo method. Radius 0 1 2 3  Images  Tsukuba  Teddy  Cones  Art  Laundry  Moebius  Reindeer  Aloe  Baby1  Rocks  Average  Blurred  2.0  14.8  9.7  6.8  13.2  7.9  5.9  2.8  1.6  3.8  6.8  Corrected  2.2  12.2  5.0  7.3  13.4  6.1  3.7  2.9  1.5  3.7  5.8  Blurred  3.3  16.6  12.9  8.9  13.0  8.0  8.0  3.9  1.7  4.2  8.1  Corrected  2.6  12.6  5.5  8.1  11.9  6.6  5.2  3.4  1.5  3.5  6.1  Blurred  6.4  28.1  28.4  13.2  18.6  12.4  15.3  8.4  6.2  7.0  14.4  Corrected  3.9  15.5  6.5  9.5  11.5  7.7  7.0  4.3  1.7  4.6  7.2  Blurred  11.0  39.6  55.3  26.8  33.1  26.5  24.5  32.1  25.8  16.3  29.1  Corrected  6.0  24.9  15.6  14.0  15.8  11.7  16.2  7.7  2.6  6.6  12.1  Table 3.1: Percentage of errors in disparity maps with belief propagation stereo method, out-of-focus blurring Radius 0 1 2 3  Images  Tsukuba  Teddy  Cones  Art  Laundry  Moebius  Reindeer  Aloe  Baby1  Rocks  Average  Blurred  5.3  16.3  13.4  13.3  19.5  13.5  10.4  5.6  4.8  5.3  10.7  Corrected  5.4  14.4  8.4  15.0  19.5  12.2  7.4  5.8  4.9  6.7  10.0  Blurred  7.7  19.7  18.2  16.7  23.3  13.7  13.2  6.4  6.0  5.7  13.1  Corrected  8.0  17.4  8.4  16.3  17.8  12.9  11.9  5.9  5.1  5.1  10.9  Blurred  9.8  31.1  35.0  25.1  29.7  21.0  36.6  10.7  14.2  12.6  22.6  Corrected  8.8  23.9  10.1  19.4  18.7  15.4  26.4  7.0  6.7  8.4  14.5  Blurred  17.8  49.9  64.0  39.1  47.9  42.2  52.3  29.5  56.7  30.8  43.0  Corrected  10.6  44.3  31.4  27.2  33.7  24.1  42.6  12.6  13.4  13.2  25.3  Table 3.2: Percentage of errors in disparity maps with window stereo method, out-of-focus blurring Length 2 3 4  Images  Tsukuba  Teddy  Cones  Art  Laundry  Moebius  Reindeer  Aloe  Baby1  Rocks  Average  Blurred  2.7  16.2  12.1  8.0  13.0  7.7  7.7  3.5  1.7  4.2  7.7  Corrected  2.6  12.5  5.3  8.1  12.2  5.5  5.4  3.0  1.4  3.6  6.0  Blurred  3.0  18.4  15.6  9.0  14.6  8.5  9.6  4.6  1.9  5.4  9.1  Corrected  2.6  13.1  5.9  8.1  10.8  6.0  5.6  3.6  1.6  4.3  6.2  Blurred  3.6  21.1  18.3  10.4  15.3  9.2  11.0  5.8  3.1  5.8  10.3  Corrected  3.0  14.5  6.4  8.5  10.9  7.1  6.4  4.1  1.6  4.1  6.7  Table 3.3: Percentage of errors in disparity maps with belief propagation stereo method, linear motion blur  Length 2 3 4  Images  Tsukuba  Teddy  Cones  Art  Laundry  Moebius  Reindeer  Aloe  Baby1  Rocks  Average  Blurred  6.7  18.8  16.8  15.6  22.0  13.6  11.5  6.0  5.5  5.5  12.2  Corrected  6.7  16.3  8.5  16.0  19.3  12.5  9.2  5.9  4.7  5.0  10.4  Blurred  7.6  20.8  19.2  17.7  23.6  14.4  16.2  6.7  6.4  7.5  14.0  Corrected  6.5  17.2  8.9  17.0  17.2  13.0  15.3  6.1  5.1  5.8  11.2  Blurred  7.9  22.3  23.8  20.2  27.0  16.1  26.9  7.8  8.5  10.6  17.1  Corrected  7.2  19.8  10.3  17.4  20.2  13.5  21.7  6.7  5.5  7.4  13.0  Table 3.4: Percentage of errors in disparity maps with window stereo method, linear motion blur  91  From Tables 3.1 through 3.4, we can see there is a substantial reduction in the number of errors in the disparity maps when the proposed correction is used, particularly when the amount of blurring in the left image is high. For the case of out-of-focus blur with a radius of 3 and the BP stereo method, the average number of errors is reduced from 29.1% to 12.1% using our proposed method. For all images, there is some improvement when the proposed correction is used if one image is blurred. Even when neither image is blurred (the out-of-focus, zero radius case in Tables 3.1 and 3.2), there is some improvement on average when applying the proposed method. Using the Belief Propagation stereo method the average amount of errors is reduced from 6.8% to 5.8%, and for the window stereo method the average error is reduced from 10.7% to 10.0%, when neither image is blurred at all. One possible reason for this improvement is that through equation (3.13), noise is filtered from both images (the effect is similar to using a Weiner filter to remove noise [107]). Another possible reason is that the original images may have slightly different levels of sharpness that the proposed method can correct. Comparing Tables 4.1 to Table 4.3, and Table 4.2 to Table 4.4, we can see that our correction method gives larger gains for out-of-focus blurring than for motion blur. This is because modifying the DCT coefficients can only correct for magnitude distortion, not phase distortion [112]. Since the out-of-focus blur filter is symmetric, it is zero-phase, and therefore introduces no phase distortion and can be accurately corrected for in the DCT domain.  A linear motion blur filter is in general not symmetric, and hence  introduces phase distortion. Modifying the DCT coefficients can correct the magnitude 92  distortion caused by motion blur, but the corrected image will still suffer from phase distortion. Consequently, the proposed method provides some improvement for motion blur but not as much as it does for out-of-focus blur. An example demonstrating the subjective visual quality of the corrected images and resulting disparity maps is shown in Figure 3.6. The blurred left image and original right image of the cones stereo pair is shown, along with the result of correcting the image pair with our method. The disparity maps obtained based on the blurred pair and corrected pair are also shown. Comparing (c) and (f) in Figure 3.6, we can see that the proposed method greatly reduces the errors in the disparity map, and produces more accurate depth edges. We can also see that the corrected images, (d) and (e), are perceptually closer in sharpness than the blurred images, (a) and (b). Therefore, the proposed method may also be useful in applications such as free-viewpoint TV [113] and other multiview imaging scenarios, for making the subjective quality of different viewpoints more uniform.  93  Figure 3.6: Example of cones image pair before and after correction. (a) Blurred left image (b) Original right image (c) Disparity map obtained from images a and b with belief bropagation (d) Corrected left image (e) Corrected right image (f) Disparity map obtained from corrected images with belief propagation  3.2.3 Complexity Take the dimensions of the original left and right images as WxH. Removing the non-overlapping areas requires H  operations. We need to take four DCT’s (of both original and cropped images). There are many fast algorithms for computing DCT’s, which have  N log(N )  complexity [114], where N is the number of pixels in the image, N=W•H.  The number of operations required for the energy calculations and  coefficient scaling is linear with the number of pixels, and therefore these steps have  N   complexity.  Finally two inverse DCT’s must be performed to generate the  corrected images in the spatial domain, which again have  N log(N )  complexity. Overall, our proposed method has  N log(N )  complexity, with the slowest steps that 94  take the large majority of the running time being the DCT’s and inverse DCT’s. Many fast algorithms, as well as efficient software and hardware implementations, have been developed for performing the DCT and similar transforms [114]. We have implemented our proposed method in C code, and the running time for the Tsukuba images is 62 ms on an Intel Core 2 E4400 2 GHz processer under Windows XP. Using simple parallel processing, such as taking the DCT’s of the left and right images in parallel, could provide significant speedup. Therefore, the proposed method is fast enough to be used in real-time systems at reasonable frame rates.  3.3 Conclusion In this chapter we have proposed a pre-processing method for correcting sharpness variations in stereo image pairs. We modify the more blurred image to match the sharpness of the less blurred image as closely as possible, taking noise into account. The DCT coefficients of the images are divided into a number of frequency bands, and the coefficients are scaled so that the images have the same amount of signal energy in each band. Experimental results show that applying the proposed method before estimating disparity on a stereo image pair can significantly improve the accuracy of the disparity map compared to performing stereo matching directly on the blurred images.  3.4 Derivation of Optimal Attenuation Factor Here we provide a detailed derivation for the value of the attenuation factor A given in equation 3.13. We wish to find the value of A which will minimize the difference  95  between the desired scaled coefficients, G  HI (u, v) , and the noisy scaled coefficients. G  A HI (u , v )  N (u , v )  . The expected square error is:      E G  HI (u , v )  G  A HI (u , v )  N (u , v )    2  (3.15)  Fully expanding gives:   E  G 2 HI (u , v ) 2  2G 2 AHI (u , v ) 2  2G 2 A  HI (u , v ) N (u , v )  G 2 A 2 HI (u , v ) 2  2G 2 A 2 HI (u , v ) N (u , v )  G 2 A 2 N (u , v ) 2    (3.16)  Since the noise is zero mean and independent of the signal, and the third and fifth terms in (3.16) are zero, so the equation reduces to:    E  G 2 HI (u , v ) 2  2G 2 AHI (u , v ) 2  G 2 A 2 HI (u , v ) 2  G 2 A 2 N (u , v ) 2    (3.17)  To find the minimum, we take the derivative with respect to A, and set it to zero: d  2G 2  E HI (u , v ) 2  2G 2 A  E HI (u , v ) 2  2G 2 A  E N (u , v ) 2  0 (3.18) dA              Solving for A yields:  A      E HI (u , v ) 2 E HI (u , v ) 2  E N (u , v ) 2         (3.19)  The expected value E N (u, v) 2  is simply the noise variance. We can estimate the expected energy of an individual signal coefficient HI (u , v) 2 , as the total signal energy in the band, calculated with equation (3.9), divided by the number of coefficients in the band (Cij).  96  En( HI ) A  En( HI )  C ij  C ij   N2  Rearranging gives the result of equation (3.13).  97  (3.20)  4 Crosstalk Cancellation in 3d Video with Local Contrast Reduction The past two chapters have dealt with capturing problems in 3D video that arise from the use of multiple cameras. In this chapter we move to a display issue, namely crosstalk in 3D displays.  As mentioned in the introduction, crosstalk is one of the biggest  complaints about 3D displays, particularly for displays using active or polarized glasses. Subtractive crosstalk cancellation (section 1.3.2) can reduce the appearance of crosstalk by lowering the image levels by the anticipated amount of crosstalk. However, to be effective subtractive cancelation requires the minimum level of the input images be raised, to give some foot-room in the signal to ensure that the subtraction does not give negative values. In most previous works, the image levels are raised throughout the whole image by compressing their range. Recently, the company RealD has applied for a patent for an alternative method, where instead patches of luminance are added only to regions of the 3D image that would have uncorrectable crosstalk without the added signal [88]. However, their method as described in [88] operates on a frame-by-frame basis, so it may introduce flickering or sudden jumps/drops in brightness. In this chapter we present a new method for locally adding patches of luminance to a 3D video that explicitly considers temporal consistency. Our method involves detecting regions with crosstalk problems in each frame, matching these regions with regions in other temporal frames, and allowing smoothing to the region intensities over time to prevent flicker.  We also add fade-ins and fade-outs to prevent sudden jumps in  brightness. Our proposed method is described in Section 4.1. Experimental results, presented in Section 4.2, show that our method allows effective crosstalk cancellation, 98  without introducing temporal artifacts. At the end of the chapter we discuss how the method could be modified for different applications (section 4.3), and provide a summary of the chapter in section 4.4.  4.1 Proposed Method Here we will use the linear crosstalk model, based on equation (1.7), as an example. With the linear model, the following images are input to display in subtractive crosstalk cancellation (repeated from section 1.3.2):  i L  x , y  c  i R  x, y   1 c2 1 c2 i  x , y  c  i L  x, y  i R ,disp  x, y   R 2  1 c 1 c2  i L ,disp  x, y    (4.1)  From equation (4.1), we can see that the subtraction will give a negative result if  iL x, y   c  iR  x, y  or i R x, y   c  i L x, y  . In these cases one of the displayed images would have a negative value with (4.1), which obviously cannot be displayed. The pixel would need to be clipped to zero, and crosstalk would not be fully cancelled out. More generally we can refer to the two images as the signal image (iS) and crosstalk image (iC). To prevent negative values from occurring when applying equation (4.1), we need the condition:  i S  x , y   c  iC  x , y   (4.2)  Condition (4.2) basically says that there needs to be more light from the signal image than from crosstalk for subtractive cancellation to work.  99  In order to reduce ghosting, we need to raise the levels of the image in regions where crosstalk is visible and there is insufficient luminance to subtract from (i.e. pixels that do not meet condition (4.2)). We follow the idea from [88], where patches of “disguising” luminance are added to the image regions where crosstalk cannot be corrected and is likely to be visually disturbing unless the image levels are raised. Since the patches of luminance are very smooth, they will be less noticeable and disturbing than visible crosstalk. Each image is altered by adding a smooth signal,  i  x, y  i  x, y    x, y  (4.3)  Here i  x, y  represents any one of the red, green and blue colour channels of the image. The same signal   x, y is added to all channels in order to avoid altering the colours of the pixels. Two versions of   x, y need to be generated; one for the left image and one for the right image. To construct these signals we need to determine the regions of each image where crosstalk cannot be compensated for using equation (4.1), and then generate a smooth signal that will raise the luminance in those regions enough for effective crosstalk compensation. For clarity, we will first describe our algorithm for still images and then describe its extension to video.  4.1.1 Algorithm for Still Images We can calculate the amount each pixel has to be raised, in order to meet the condition in equation (4.2) as follows:  RK  x, y   max0, c  iC ,K  x, y   iS ,K  x, y  100  (4.4)  Here K represents the colour channel (R, G, or B), and RK(x,y) is the amount that one colour channel needs to be raised for effective crosstalk cancellation, on a per-pixel basis. The result is clipped to zero because a negative value indicates the sample already has sufficient luminance for performing subtractive crosstalk cancellation. Since we will add the same signal to all three colour channels, we generate a single value for the amount each pixel has to be raised by taking the maximum over the three colours:  Rx, y   maxRR x, y , RG  x, y , RB x, y   (4.5)  At this point, R(x,y) contains all the pixels for which equation (4.1) would fail if it were applied to the original images, i.e., all the pixels that would still have some crosstalk even after cancellation. An example of the signal R(x,y) for the left and right images of a stereo pair is shown in Figure 4.1b. In many of these pixels the crosstalk may not be visually noticeable and thus raising the luminance in those areas would be unnecessary. Therefore, we apply additional processing to determine which areas of the image need luminance added to them. First we set to zero all the pixels in R(x,y) that are below a threshold. We use 1% of the maximum display luminance, i.e. 0.01 with the images represented as linear intensities in the range [0,1]. Then we remove small regions, which are less visually noticeable, by eroding and dilating with a circular mask of 8 pixels (as in done in [88]). We then divide the signal R(x,y) into a number of connected regions with the binary labeling algorithm presented in [115] (Fig. 4.1c). For each of these regions we will add a patch of smoothly varying luminance that has its maximum value at the pixels in the  101  region and gradually decreases for the surrounding pixels based on their distance from the region. The patch for one region is calculated as:     j  x, y   M j  max 0,  w  d  x, y    w   (4.6)  In equation (4.6), j represents the region label, Mj is the maximum value of R(x,y) of any of the pixels in the region, w is the width of the transition region, and d(x,y) is the distance of pixel (x,y) from the region. The distance can be calculated efficiently in O(n) time with the algorithm described in [116]. Equation (4.6) uses a linear ramp for the transition region but other smoothly decreasing shapes would also work, such as a Gaussian or sigmoid. For HD resolution videos, an appropriate transition width (w) is 200 pixels. A larger width will result in the transition being less noticeable, but also a larger portion of the image having lower contrast. Since we are adding the patch of luminance given by equation (4.6) to one of the images (left or right) in the stereo 3D pair, we have to add a corresponding patch to the other image to prevent retinal rivalry (objects appearing different in each eye). To do this, we first calculate the centre-of-mass of the region and then perform block matching with a large block size (e.g., 16 pixels) to estimate the disparity at the centre of the region. Then, we create a copy of the signal  j x, y  shifted in the x-direction by the estimated disparity that will be added to the other image. An example of this is shown in Figure 4.1 (d) and (e). In Figure 4.1d, the patches  j  x, y  are shown for the left and right images. In Figure 4.1e shows the patches from the other image shifted by the estimated disparity for each region. 102  Finally, the smooth signal   x, y is added to the input images (equation (4.3)). Then, conventional crosstalk cancellation can be applied with equation (4.1), and it will be more effective since the images now have been raised in areas that suffer from noticeable crosstalk.  If required (for displaying purposes), the final images can be  converted from the linear space to a gamma encoded 8-bit space. Figure 4.1 illustrates the steps of our method for still images on an example 3D pair.  103  Figure 4.1: The steps of our method for still images. (a) Original images (left and right) (b) Amount each pixel needs to be raised, calculated with equations (4.4) and (4.5), (c) Labelled regions that remain after thresholding, erosion and dilation (d) Luminance patches for the above regions, calculated with equation (4.6), (e) Patches from the other image, shifted by the estimated disparity for each region (f) The final smooth signal that will be added to each image (pixel-wise maximum of the previous two signals) (g) The output images with the added patches of luminance  4.1.2 Extension to Video Sequences In the previous section, we have described an algorithm for adding local patches of luminance to images to ensure there is enough ‘foot room’ for effective crosstalk cancellation. If this method were applied to video sequences on a frame by frame basis, annoying flickering would occur as temporal consistency is not considered. In this section of the paper, we describe an extension of the previous method to video sequences, which includes temporal filtering of the patches, removing patches of short temporal 104  duration, and fading patches in and out when regions of uncorrectable crosstalk appear or disappear over the course of the video. We will describe a non-causal version of the algorithm that assumes the entire video is available during processing, but later we will comment on how it could be adapted for a causal real-time application. First we detect significant regions of uncorrectable crosstalk in every frame of the video, following the same process as in the still image case. That is, equations (4.4) and (4.5) are applied, followed by thresholding, erosion, and dilation. For each detected region, we calculate its centre-of-mass and the amount the region needs to be raised (the value of Mj in equation (4.6) for the region). Next we match regions in temporally adjacent frames. This is done with two passes through the frames of the video, a forward pass to link each region in the following frame, followed by a backward pass to check for regions that do not have a link to the previous frame. Starting at the first frame and progressing through every frame in the video, we match regions with regions in the next frame. For each region in frame N we compare its centre to those of regions in frame N+1. If the distance between the centre of the region in frame N and the centre of a region in frame N+1 is less than 20 pixels, the regions are considered a match and ‘linked’ together. We implement this linking by having a data structure for each region that contains pointers to the matching regions in the next and previous frames. If more than one region meets the matching criterion, we choose the one with the lowest distance from centre to centre. If no match is found in frame N+1 for a region in frame N, we make a copy of the region in frame N+1. This copying serves 105  two purposes; it allows us to fade out luminance patches, preventing a visually noticeable sudden drop in luminance, and it also sometimes allows us to fill in temporal gaps for regions that are not detected in one or more frames, but then later are detected again. When copying a region from frame N into frame N+1, we decrease its value for Mj by a small amount so that the region will fade out over time if it does not appear again in the video (we use 0.1% of the display luminance, i.e. 0.001, which typically makes regions fade out over 2-3 seconds for moderate amounts of crosstalk). Next, we eliminate regions that have a very short temporal duration, as viewers are not likely to notice crosstalk that appears for a short amount of time. To achieve this, we count how many frames each group of regions was linked to in the previous stage (not counting any copied regions). If the count for a group of temporally linked regions is less than one second worth of frames (i.e., 30 for 30 fps video), then we delete all the regions in the group. Afterwards, we identify regions that appear for the first time midway through the video. To prevent a sudden jump in luminance when a region first appears in time, we apply a fade-in to these regions. To achieve this, we perform a backward pass through the frames in the videos, checking for regions in frame N that do not have a backward link to a region in frame N-1. If any such regions are found, we copy the region into the previous frames, first into frame N-1, then N-2, and so on. Each time a region is copied, its value for Mj is decreased a small amount so that the region will fade in. As with the fade-out case discussed earlier, we use 0.1% of the display luminance for the step size of decreasing Mj, to achieve a fade transition time of 2-3 seconds. 106  At this stage, we have a series of regions that are linked temporally to regions in other frames. To prevent flickering, we apply temporal filtering to ensure that the amount each region is raised by (Mj) is consistent between frames and changes very slowly over time. Based on the temporal frequency response of the human visual system, applying a low pass filter with cut-off frequency 0.5 Hz is sufficient to eliminate flicker in video [117]. Therefore, we design an 80 tap a low pass filter with cut-off frequency 0.5 Hz, denoted l(N), which can easily be done with the MATLAB ‘fir1’ command. We apply this filter to values of Mj(N) for each region over time to generate a filtered region intensities, Mj,filt(N).  M j , filt ( N )  M j ( N )  l ( N )  (4.7)  In (4.7) ‘*’ denotes 1D convolution. After filtering, we calculate a patch of luminance for each region using equation (4.6), only this time using the temporally filtered versions, Mj,filt(N). As in the still image case, we perform a disparity search using block matching for each region, and generate a shifted version of each patch for the other image (left or right). To save computations, a smaller disparity search can be used for most frames, using the disparity of the region in the previous frame as the initial estimate (and searching for example +/- 2 pixels). The rest of the process is the same as described for the still image case. The final version of   x, y for each frame is calculated by taking the maximum of all the individual patches, and is then added to each image.  After that, conventional crosstalk cancelation can be  performed. If required, the image can be converted from the linear space to a gamma encoded representation. 107  4.2 Results  We tested our method on several stereo videos captured with a pair of parallel cameras. The capturing setup is described in detail in [118]. The videos had a resolution of 1280x720p and were 30 fps. In Figure 4.2, we give an example of how a frame will look using crosstalk cancellation and different methods for raising the image levels. In Figure 4.2a, we show the original left view of one frame of video. Figure 4.2b shows what the left view will look like with 5% crosstalk from the right view, using crosstalk cancellation based on equation (4.1). Crosstalk from the right view results in an extremely annoying double edge appearing along the person’s left side. Since the image levels are low across the person’s black clothing, there is not enough light available when performing the cancellation. Raising the minimum image level globally, as shown in Figure 4.2c, results in effective crosstalk cancellation, but at the expense of lowering the contrast and hence degrading the image quality. Our proposed method, where patches of luminance are added locally to regions that suffer from crosstalk, results in effective crosstalk cancellation while retaining better image contrast, as seen in Figure 4.2d.  108  Figure 4.2: Illustration of crosstalk reduction with local and global level raising. (a) Original left image with no crosstalk (b) Image with crosstalk reduction but no level raising (c) Global raising of minimum image level (d) Proposed method with local raising  To illustrate how our method improves temporal consistency, we plot the intensity of the patches added over the course of a 10 second video when using our proposed method and compare to calculating the regions on each frame independently (Figure 4.3). As seen in Figure 4.3a, when the regions are calculated independently on each frame, they can be quite inconsistent from frame to frame. Sometimes a region is not detected in every frame, and a few regions are only detected in one or two frames. This results in annoying and highly noticeable flickering in the output video. Our proposed method gives much smoother temporal transitions, and avoids any flickering (Figure 4.3b).  109  Figure 4.3: Regions intensities for a 3D video calculated: (a) On a frame-by-frame basis, and (b) With our proposed method including temporal consistency  4.3 Discussion In this chapter we have described a non-causal method, which could be applied to applications such as 3D cinema where the data can be batch-processed ahead of display time. Our algorithm could be modified to be real-time and causal, so that it could be applied to situations such as 3DTVs. The temporal low-pass filter would have to be replaced with a causal filter, and fewer taps may need to be used. When a new region is detected that was not in the previous frame, a fade-in for the new region would have to start at that frame (as opposed to fading in during the previous frames). A faster fade-in time would need to be used since 110  crosstalk would be visible during the fade-in time. Also, deleting regions that appear only a short amount of time would not be possible, since when a new region is detected, it is not known how long it will last. It should be noted that our algorithm could be applied using different crosstalk models than the linear one of equation (1.7), such as the model in [86], which is based entirely on visual measurements. To use a different crosstalk model, equation (4.4) would simply need to be replaced with a different function that gives the minimum amount of light required to compensate for the crosstalk from the corresponding pixel in the other image.  4.4 Conclusions In this chapter we have proposed a method for locally adding patches of luminance to videos in order to improve crosstalk cancellation. Our method considered temporal consistency by applying low pass filtering to detected regions over time, removing regions of short duration, and creating fade in and fade outs for regions that appear or disappear midway through the video. Our method allows crosstalk to be effectively corrected, while maintaining good image contrast and avoiding problems with flickering that can occur in methods applied on a frame-by-frame basis.  111  5 Conclusions and Future Work 5.1 Significance and Potential Applications of the Research 3D video can provide a more immersive and realistic visual experience by showing slightly different images to the viewer’s left and right eyes. The presence of multiple imaging channels during capturing and multiple display channels introduces new problems in 3D video that are not present in 2D systems. In this thesis, we propose new digital video processing algorithms for correcting capturing and display issues in 3D video systems. In particular, we develop a multiview video colour matching algorithm (Chapter 2), a sharpness correction method for stereo images (Chapter 3), and a method for preserving contrast while performing crosstalk cancelation (Chapter 4). Calibrating multiple cameras to be consistent with each other is always challenging, and it will become a bigger problem when lower cost cameras are used. For example, the company Point Grey Research produces popular stereo cameras setups. Despite their cameras being explicitly designed for stereo vision and relatively expensive, they still need to compensate for brightness variations during stereo matching [124].  Our  multiview colour correction can be used to get higher quality multiview data using lower quality cameras for capturing. This can provide new trade-offs in a multiview system design. Instead of using a small number of expensive cameras, a larger number of lower cost cameras could be used. Having more cameras (viewpoints) will reduce problems with occlusions, and therefore can help produce higher quality rendered views. Our proposed method also lowers the bitrate required for multiview video compression, so it can be used to reduce storage and transmission requirements in a system. 112  Our method for matching colours and correcting vignetting in panoramic image stitching would be most useful directly implemented on consumer cameras. Even lowcost cameras often provide a panoramic mode where the user can take a series of overlapping images with horizontal panning between them, and they are automatically stitched together on the camera. Since the processing power is limited on consumer cameras, our method would be appropriate since it requires only linear regression with a small number of samples. Consumer cameras often produce blurred images under low light conditions, due to motion blur and/or a large aperture being used. Our sharpness matching method present in Chapter 3 may be useful in these cases, for making the data from the two cameras consistent. Stereo cameras are also often used in robot navigation [101], and system constraints often require small, low cost cameras to be used. With moving robots, camera shake can cause motion blur. Our sharpness correction method could be useful in these cases, as it can improve the accuracy of the depth map obtained from blurred stereo images. This would help improve the navigation and recognition performance of the robot system. Our local crosstalk compensation method described in Chapter 4 could be applied to 3D displays with polarized glasses or active shutter glasses.  In 3D cinema using  polarized projection display, crosstalk compensation (sometimes called “ghost-busting”) is required to achieve high-quality display [123]. For cinema applications, the crosstalk cancellation can be performed as a batch process before display, and our proposed method could be used employing heavy temporal filtering and long fade-ins and fade113  outs. Most current commercial 3D TV’s use active glasses, but line-by-line polarized 3D TV’s have recently entered the market.  In either case, the amount of crosstalk is  significant and compensation methods could improve 3D picture quality. In 3D TVs, crosstalk compensation would have to be performed in real time, because the combination of the content and the display equipment generally won’t be known until the content is being displayed. Our proposed method could be applied as a real time method using shorter filter for temporal smoothing. Since crosstalk cancellation is performed through digital processing, it is inexpensive to implement on the existing chips in 3D TVs, and can provide a much lower cost solution than designing better display hardware that inherently has lower crosstalk.  5.2 Summary of Contributions This thesis addresses three topics related to correcting capturing and display distortions in 3D and multiview systems: 1) Correcting inconsistent colour in multiview video sets, 2) Correcting inconsistent sharpness in 3D images/video and 3) Compensating for crosstalk in 3D displays.   We show that in multiview video colour correction, the choice of which view is used as the colour reference can have a significant impact on coding efficiency (as high as 1 dB in PSNR). This shows that the strategy previously used, where the center view is normally chosen by default as the colour reference, is usually not optimal for coding efficiency.    We propose a method for correcting the colour of multiview video sets such that all views are modified to match the average colour of the original data. 114  Experiments show that our method requires less change be made to the input videos to make them consistent, and that our method greatly increasing the coding efficiency of multiview video coding by making inter-view prediction more efficient.   We present a modified version of our multiview colour correction method for correcting mismatched colour and vignetting in images being stitched into a panorama. Our method requires only linear least squares regression to solve for the correction parameters, making it lower complexity than previous methods.    We propose a method for correcting sharpness variations in stereo images based on scaling the DCT coefficients of the images. While previous methods have attempted to make stereo matching robust to sharpness variations, our method efficiently corrects sharpness variations as a pre-processing step and can be used with any stereo matching method.  Experiments show our method greatly  increasing stereo matching accuracy when images are blurred differently, and our method is fast enough to be run in real time.   We propose a method for locally raising image levels in regions of a 3D video that suffer from noticeable crosstalk. This allows effective subtractive crosstalk cancellation to be performed, while better maintaining image contrast compared to methods that globally raise the image levels. Our method involves temporally matching and filtering regions to ensure consistency, and adding fade-ins and fade-outs to regions that appear or disappear midway through the video. This prevents annoying flicker and sudden jumps or drops in brightness. 115  5.3 Directions for Future Work One direction for future research would be to further integrate our colour correction method with multiview video coding. The disparity vectors found in the colour correction process could be reused in the multiview video coding stage. However, directly reusing the disparity vectors would decrease compression performance, as the vectors chosen to maximize correlation (or minimize absolute/squared differences) in general will not be identical to the vectors that give the best coding efficiency. This is a similar issue to reusing motion vectors when transcoding compressed video from one format to another [122]. As with the transcoding case, a much smaller disparity search could be performed in the coding stage that uses the disparity vectors and correlation scores from the colour correction stage to identify which vectors should be evaluated during encoding. In order to have accurate colour correction using our method, the matching points used in the regression should cover the entire range of colours present in the input videos. We currently achieve this by performing block matching over several entire frames, which yields a very large number of matching points to use in the regressions (typically in the hundreds of thousands). Complexity could be lowered by smartly selecting fewer points that cover the range of Y, U and V values in the input videos. Future work could explore using an intelligent method for quickly selecting a smaller number of matching points that can be accurately matched in all views and also cover the entire range of colours present in the input videos. Our sharpness correction method for stereo images attempts to make the images consistent in sharpness by boosting and attenuating DCT coefficients in a set of 116  frequency bands. It does not solve for the blurring functions applied to each image. A more accurate, but also more complex, sharpness correction method could be designed for 3D images that explicitly solves for the blurring function of each image and applying inverse filtering (i.e., blind deconvolution) [120][121]. Deconvolution methods have been extensively studied for 2D images, but further work would be needed to apply them to 3D images, where two blurring kernels could be simultaneously estimated using information from both input images. Our crosstalk reduction method could be improved through better modelling of the visibility of crosstalk in 3D images. Our method employs simple techniques to detect which regions of crosstalk will be noticeable, by ignore small regions or regions with low crosstalk intensity. It could be improved through more sophisticated modelling of the human visual system such as that used in the visual difference predictor [119][89], which take into account contrast sensitivity and masking effects. A better method for estimating which regions of crosstalk will be visible would allow the method to only add brightness to regions that really need it. A related direction for research could be modeling the image quality taking into account both the loss of contrast when image levels are raised, as well as the visibility of un-corrected crosstalk. A new method could provide trade-offs between preserving contrast and compensating for crosstalk. In some regions it may be desirable to leave some, barely noticeable, crosstalk in order to keep image contrast higher. A challenging direction for future research is correcting crosstalk in autostereoscopic (glasses-free) multiview displays. Crosstalk in autostereoscopic displays varies with the 117  viewing position, and also with the position of the pixels on the screen [79]. Therefore it is very difficult to model the amount of crosstalk each viewer will experience for each pixel in the screen, and this model would have to be continuously updated as the viewer moves.  118  References [1]  A. Woods, T. Docherty, and R. Koch. “Image distortions in stereoscopic video systems,” in SPIE Volume 1915: Stereoscopic Displays and Applications IV. 1993.  [2]  Blu-ray Disc Association, “System Description: Blu-ray Disc ReadOnly Format Part3: Audio Visual Basic Specifications”, Dec. 2009.  [3]  YouTube in 3D, available: http://youtu.be/5ANcspdYh_U, accessed Mar. 28, 2012.  [4]  N. Hur, H. Lee; G. S. Lee; S. J. Lee; A. Gotchev, and S. Park; , "3DTV Broadcasting and Distribution Systems," IEEE Transactions on Broadcasting, vol.57, no.2, pp.395-407, June 2011.  [5]  A. Vetro, A. M. Tourapis, K. Müller, and T. Chen, “3D-TV content storage and transmission,” IEEE Trans. Broadcast., Jun. 2011.  [6]  A. Kubota, A. Smolic, M. Magnor, M. Tanimoto, T. Chen, C. Zhang, “Multiview Imaging and 3DTV,” IEEE Signal Processing Magazine, vol. 24, no. 6, pp. 10-21, Nov. 2007.  [7]  M. Tanimoto, “Overview of Free Viewpoint Television,” Signal Processing: Image Communication, vol. 21, no. 6, pp. 454-461, July 2006.  [8]  N. A. Dodgson, “Autostereoscopic 3d displays,” IEEE Computer, vol. 38, pp 31– 36, 2005.  [9]  C. Van Berkel, D. W. Parker, and A. R. Franklin, “Multiview 3D-LCD,” in Proc. SPIE Stereoscopic Displays and Virtual Reality Systems, vol. 2653, 1996, pp. 32– 39.  [10]  Tanimoto Laboratory webpage, available: http://www.tanimoto.nuee.nagoyau.ac.jp/, accessed Mar. 28, 2012.  [11]  S.C. Chan, H.Y. Shum, and K.T. Ng, “Image-based rendering and synthesis,” IEEE Signal Processing Mag. vol. 24, no. 7, pp. 22–33, Nov. 2007.  [12]  C.L. Zitnick, S.B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, “High Quality Video View Interpolation Using a Layered Representation,” in Proc. ACM SIGGRAPH and ACM Trans. Graphics, Los Angeles, CA, Aug. 2004.  [13]  C. Fehn “Depth-Image-Based Rendering (DIBR), the compression and Transmission for a New Approach on 3DTV.” In Proceedings of SPIE 119  Stereoscopic Displays and Virtual Reality Systems XI, San Jose, CA, USA, January 2004. pp. 93-104. [14]  M. Kawakita, T. Kurita, H. Kikuchi, and S. Inoue, “HDTV AXI-vision camera,” in Proc. International Broadcasting Conference, pp. 397–404, 2002.  [15]  G. J. Iddan and G. Yahav, “3D Imaging in the Studio (and Elsewhere …),” in Proc. of SPUE Videometrics and Optical Methods for 3D Shape Measurements, pp. 48–55, 2001.  [16]  D. Scharstein and R. Szeliski. "A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms," International Journal of Computer Vision, vol. 47, pp. 7-42, 2002.  [17]  S.M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A Comparison and Evaluation of MultiView Stereo Reconstruction Algorithms,” in International Conference on Computer Vision and Pattern Recognition, pp. 519–528, 2006.  [18]  P. Kauff, N. Atzpadin, C. Fehn, M. Müller, O. Schreer, A. Smolic, and R. Tanger, “Depth Map Creation and Image Based Rendering for Advanced 3DTV Services Providing Interoperability and Scalability”, Signal Processing: Image Communication. Special Issue on 3DTV, February 2007.  [19]  W. J. Tam and L. Zhang, “3D-TV content generation: 2D-to-3D conversion,” IEEE International Conference on Multimedia and Expo, pp. 1869-1872, 2006.  [20]  M. T. Pourazad, P. Nasiopoulos and R. K. Ward, "An H.264-based scheme for 2D to 3D video conversion", IEEE Transactions on Consumer Electronics, vo1.55, no.2 May, 2009.  [21]  Kwan-Jung Oh; Sehoon Yea; Yo-Sung Ho, "Hole filling method using depth based in-painting for view synthesis in free viewpoint television and 3-D video," Proc. Picture Coding Symposium, May 2009.  [22]  P. Merkle, A. Smolic, K. Müller, and T. Wiegand, “Multi-view Video Plus Depth Representation and Coding”, ICIP 2007, IEEE International Conference on Image Processing, San Antonio, TX, USA, September 2007.  [23]  A. Smolic, K. Müller, K. Dix, P. Merkle, P. Kauff, and T. Wiegand, "Intermediate View Interpolation based on Multi-View Video plus Depth for Advanced 3D Video Systems," Proc. IEEE International Conference on Image Processing (ICIP'08), San Diego, CA, USA, pp. 2448-2451, Oct. 2008.  [24]  T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–560, Jul. 2003. 120  [25]  G. J. Sullivan and J.R. Ohm, "Recent developments in standardization of high effiency video coding (HEVC)," in SPIE Applications of Digital Image Processing XXXIII, vol. 7798, Aug. 2010.  [26]  A. Vetro, T. Wiegand, and G.J. Sullivan, “Overview of the Stereo and Multiview Video Coding Extensions of the H.264/AVC Standard,” Proceedings of the IEEE, 2011.  [27]  P. Merkle, A. Smolic, K. Muller and T. Wiegand, "Efficient Prediction Structures for Multiview Video Coding," IEEE Transactions on Circuits and Systems for Video Technology, vol.17, no.11, pp.1461-1473, Nov. 2007.  [28]  C. Fehn, "A 3D-TV system based on video plus depth information," Signals, Systems and Computers, vol. 2, pp. 1529–33, Nov. 2003.  [29]  S. Grewatsch, and E. Miiller, "Sharing of motion vectors in 3D video coding,'' in Proc. IEEE International Conference on Image Processing (ICIP 2004), vol. 5, pp. 3271–74, 2004.  [30]  A. J. Woods and C. R. Harris, “Comparing levels of crosstalk with red/cyan, blue/yellow, and green/magenta anaglyph 3D glasses,” In Proceedings of SPIE Stereoscopic Displays and Applications 2010, vol. 7524, 2010.  [31]  L. Lipton, "Stereoscopic motion picture projection system," US Patent No. 5,481,321, Jan. 2, 1996.  [32]  H. Jorke, “Device for projecting a stereo color image,” US Patent No. 7,001,021, Feb. 21, 2006.  [33]  M. Richards and G. D. Gomes, "Spectral Separation Filters For 3D Stereoscopic D-Cinema Presentation,” US Patent Application No. 2011/0205494, Aug. 2011.  [34]  J. A. Roese, “Liquid crystal stereoscopic viewer,” US Patent 4,021,846, May 3, 1977.  [35]  P. Miller and T. Moynihan, “Active 3D vs. Passive 3D,” PCWorld Magazine, Apr. 2011, available: www.pcworld.com/article/225218/active_3d_vs_ passive_3d.html, accessed Mar. 28, 2012.  [36]  H. Urey, K. V. Chellappan, E. Erden, and P. Surman, “State of the Art in Stereoscopic and Autostereoscopic Displays,” Proceedings of the IEEE, vol. 99, no. 4, pp. 540-555, Apr. 2011.  [37]  Y. Ho, K. Oh, C. Lee, B. Choi and J. Park, “Observations of Multi-view Test Sequences,” Joint Video Team standard committee input document JVT-W084, Joint Video Team 23rd meeting, San Jose, USA, April 2007.  121  [38]  C.S. McCamy, H. Marcus, and Davidson, J.G., "A Color Rendition Chart," Journal of Applied Photographic Engineering, vol. 11, no. 3 pp. 95-99. 1976.  [39]  A. Ilie and G. Welch, “Ensuring color consistency across multiple cameras,” in Proc. IEEE International Conference on Computer Vision (ICCV), pp. 1268– 1275, Oct. 2005.  [40]  J.H. Hur, S. Cho, and Y.L. Lee, “Adaptive Local Illumination Change Compensation Method for H.264-Based Multiview Video Coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1496–1505, Nov. 2007.  [41]  K. Yamamoto, M. Kitahara, H. Kimata, T. Yendo, T. Fujii, M. Tanimoto, S. Shimizu, K. Kamikura, and Y. Yashima, “Multiview Video Coding Using View Interpolation and Color Correction,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1436–1449, Nov. 2007.  [42]  U. Fecker, M. Barkowsky, and A. Kaup, “Improving the prediction efficiency for multi-view video coding using histogram matching,” in Proc. Picture Coding Symposium 2006, pp. 2–16, Apr. 2006.  [43]  M. Flierl, A. Mavlankar, and B. Girod, “Motion and Disparity Compensated Coding for Multiview Video,” IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 11, pp. 1474-1484, Nov. 2007.  [44]  U. Fecker, M. Barkowsky, and A. Kaup, “Histogram-Based Pre-Filtering for Luminance and Chrominance Compensation of Multi-View Video,” IEEE Trans Circuits and Systems for Video Technology, vol.18, no.9, pp.1258-1267, Sept. 2008.  [45]  Y. Chen, C. Cai, and J. Liu, “YUV Correction for Multi-View Video Compression,” Proc. Int. Conf. Pattern Recognition 2006, pp. 734-737, Aug. 2006.  [46]  F. Shao, G. Jiang, M. YuI, and K. Chen, "A Content-Adaptive Multi-View Video Color Correction Algorithm, " Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 969-972, Apr. 2007.  [47]  M. Brown and D. Lowe. "Automatic Panoramic Image Stitching using Invariant Features," International Journal of Computer Vision, vol. 74, no. 1, pp. 59-73, 2007.  [48]  R. Szeliski. Image alignment and stitching: A tutorial. Technical Report MSRTR-2004-92, Microsoft Research, Dec. 2004.  [49]  Y. Zheng, S. Lin, S.B. Kang, "Single-Image Vignetting Correction," Proc Computer Vision and Pattern Recognition (CVPR 2006), vol.1, no., pp. 461-468, 17-22 June 2006. 122  [50]  D. Hasler and S. Süsstrunk, "Color Handling in Panoramic Photography," Proc. IS&T/SPIE Electronic Imaging 2001: Videometrics and Optical Methods for 3D Shape Measurement, Vol. 4309, pp. 62-72, 2001.  [51]  D. Hasler and S. Süsstrunk, "Mapping colour in image stitching applications," Journal of Visual Communication and Image Representation, vol. 15, num. 12, pp. 65-90, 2004.  [52]  D. Goldman and J. Chen, "Vignette and Exposure Calibration and Compensation," Proc. IEEE Int'l Conf. Computer Vision, pp. 899-906, Oct. 2005.  [53]  P. d'Angelo, "Radiometric alignment and vignetting calibration," Proc. Camera Calibration Methods for Computer Vision Systems, Mar. 2007.  [54]  L. Matthies, A. Kelly and T. Litwin. “Obstacle detection for unmanned ground vehicles: a progress report,” Proc. International Symposium on Robotics Research (ISRR), 1995.  [55]  P. Fua, “A parallel stereo algorithm that produces dense depth maps and preserves image features,” Machine Vision and Applications, vol. 6, no. 1, 1993.  [56]  J. Kim, V. Kolmogorov and R. Zabih, “Visual correspondence using energy minimization and mutual information,” Proc IEEE Conference on Computer Vision (ICCV), 2003.  [57]  R. Zabih and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” Proc. European Conference on Computer Vision (ECCV), pp. 151–158, 1994.  [58]  H. Hirschmuller and D. Scharstein, "Evaluation of Cost Functions for Stereo Matching," Proc IEEE Conference on Computer Vision and Pattern Recognition, June 2007.  [59]  H. Hirschmuller and D. Scharstein, "Evaluation of Stereo Matching Costs on Images with Radiometric Differences," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.31, no.9, pp.1582-1599, Sept. 2009.  [60]  L. B. Stelmach, W. J. Tam, D. V. Meegan, A. Vincent, and P. Corriveau, “Human perception of mismatched stereoscopic 3D inputs,” Proc. IEEE International Conference on Image Processing (ICIP), Vancouver, BC, Canada, 2000.  [61]  G. Saygili, G. Gurler, A.M. Tekalp, "Quality assessment of asymmetric stereo video coding," Proc. IEEE International Conference on Image Processing (ICIP), Hong Kong, 2010  [62]  M. Budagavi, "Video compression using blur compensation," Proc IEEE International Conference on Image Processing (ICIP), vol.2, pp. 882-885, Sept. 2005. 123  [63]  J.H. Kim, P. Lai, J. Lopez, A. Ortega, Y. Su, P. Yin, and C. Gomila, “New Coding Tools for Illumination and Focus Mismatch Compensation in Multiview Video Coding,” IEEE Trans. Circuits and Systems for Video Technology, vol. 17, no. 11, pp. 1519-1535, Nov. 2007.  [64]  B. Girod, “The efficiency of motion-compensating prediction for hybrid coding of video sequences,” IEEE Journal on Selected Areas in Communications, vol. SAC5, pp. 1140-1154, Aug. 1987.  [65]  M. Pedone and J. Heikkila, “Blur and Contrast Invariant Fast Stereo Matching,” Proc. Advanced Concepts for Intelligent Vision Systems, pp. 883-890, Oct. 2008.  [66]  W. Wang, Y. Wang; L. Huo, Q. Huang and W. Gao, "Symmetric segment-based stereo matching of motion blurred images with illumination variations," International Conference on Pattern Recognition (ICPR), Dec. 2008.  [67]  A. J. Woods, "Understanding Crosstalk in Stereoscopic Displays" (Keynote Presentation) at Three-Dimensional Systems and Applications (3DSA) conference, Tokyo, Japan, 19-21 May 2010.  [68]  A. J. Woods, "How are crosstalk and ghosting defined in the stereoscopic literature?" in Proceedings of SPIE Stereoscopic Displays and Applications XXII, Vol. 7863, 78630Z, Jan. 2011.  [69]  L. M. Wilcox and J. A. D. Stewart, "Determinants of perceived image quality: ghosting vs. brightness", Proc. SPIE Stereoscopic Displays and Applications, vol. 5006, 263, Jan. 2003.  [70]  F. Kooi and A. Toet, "Visual confort of binocular and 3D displays," Displays, vol. 25, pp. 99-108, 2004.  [71]  P. J. H. Seuntiens, L. M. J. Meesters, and W. A. IJsselsteijn, "Perceptual Attributes of Crosstalk in 3D Images," Displays, vol. 26, no. 4-5, pp. 177-183, 2005.  [72]  L. Wang, K. Teunissen, Y. Tu, L. Chen; P. Zhang, T. Zhang, I. Heynderickx, "Crosstalk Evaluation in Stereoscopic Displays," Journal of Display Technology, vol. 7, no. 4, pp. 208-214, April 2011  [73]  I. Tsirlin, L. M Wilcox, R. S. Allison, "The Effect of Crosstalk on the Perceived Depth From Disparity and Monocular Occlusions," IEEE Transactions on Broadcasting, vol. 57, no. 2, pp. 445-453, 2011.  [74]  S. S. Fisher, M. McGreevy, J. Humphries, W. Robinett, “Virtual Environment Display System,” Proceedings of the Workshop on Interactive 3D Graphics, Oct. 1986. 124  [75]  S. Daly, R Held and D. Hoffman, “Perceptual Issues in Stereoscopic Signal Processing,” IEEE Transactions On Broadcasting, vol. 57, no. 2, pp. 347-361 June 2011.  [76]  A. J. Woods and S. Tan, “Characterising sources of ghosting in time-sequential stereoscopic video displays,” Proc. SPIE, vol. 4660, pp. 66–77, Jan. 21–23, 2002.  [77]  S. Shestak, D. Kim, and S. Hwang "Measuring of Gray-to-Gray Crosstalk in a LCD Based Time-Sequential Stereoscopic Display" proc SID 2010, Seattle, pp. 132-135, May 2010.  [78]  C. C. Pan, Y. R. Lee, K. F. Huang, T. C. Huang, "Cross-Talk Evaluation of Shutter-Type Stereoscopic 3D Display" in SID Digest 2010, Seattle, 128-131 (2010).  [79]  A. Boev, A. Gotchev and K. Egiazarian: “Crosstalk Measurement Methodology for Auto-Stereoscopic Screens”, IEEE 3DTV Conference, Kos, Greece, May 7-9, 2007.  [80]  R. Kaptein and I. Heynderickx, “Effect of crosstalk in multi-view autostereoscopic 3D displays on perceived image quality,” in SID Digest, 2007, pp. 1220–1223.  [81]  M. Stokes, M. Anderson, S. Chandrasekar, and R. Motta, “A standard default color space for the internet srgb,” 1996, available: http://www.color.org/sRGB.html, accessed Mar. 28, 2012.  [82]  M. A. Weissman and A. J. Woods, “A simple method for measuring crosstalk in stereoscopic displays” in Proc. SPIE Stereoscopic Displays and Applications XXII, San Francisco, vol. 7863, 2011.  [83]  J. S. Lipscomb, W. L. Wooten, “Reducing crosstalk between stereoscopic views,” Proceedings of SPIE Stereoscopic Displays and Virtual Reality Systems, vol. 2177, pp. 92-96, February 1994.  [84]  G. Street, "Method and apparatus for image enhancement." US Patent No. 6075555, Jun. 2000.  [85]  M . Barkowsky , P. Campisi, P . Le Callet , V. Rizzo, “Crosstalk measurement and mitigation for autostereoscopic displays,” Proc. SPIE 3D Image Processing and Applications, San Jose, USA, 2010.  [86]  J. Konrad, B. Lacotte, E. Dubois, “Cancellation of image crosstalk in timesequential displays of stereoscopic video” in IEEE Transactions on Image Processing, vol. 9, no. 5, pp. 897-908, May 2000.  [87]  G. R. Jones, N. S. Holliman, “Display controller, three dimensional display, and method of reducing crosstalk,” U.S Patent 6573928, Jun. 3, 2003. 125  [88]  D. J. McKnight., "Enhanced Ghost Compensation for Stereoscopic Imagery," U.S. Patent application 2010/0040280, Feb. 2010.  [89]  R. Mantiuk, S. Daly, K. Myszkowski, H. P. Seidel: "Predicting Visible Differences in High Dynamic Range Images - Model and its Calibration," Proc. of Human Vision and Electronic Imaging X, IS&T/SPIE's 17th Annual Symposium on Electronic Imaging, pp. 204-214, 2005.  [90]  G. Bjontegaard, “Calculation of average PSNR differences between RD-curves”, VCEG Contribution VCEG-M33, Austin, Apr. 2001.  [91]  D.G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, Nov. 2004.  [92]  J. Lu, H. Cai, J.G. Lou, and J. Li, "An Epipolar Geometry-Based Fast Disparity Estimation Algorithm for Multiview Image and Video Coding," IEEE Trans. Circuits and Systems for Video Technology, vol.17, no.6, pp.737-750, Jun. 2007.  [93]  Y. Kim, J. Kim and K. Sohn, "Fast Disparity and Motion Estimation for Multiview Video Coding," IEEE Trans. Consumer Electronics, vol.53, no.2, pp.712719, May 2007.  [94]  D.V. Papadimitriou and T.J. Dennis, "Epipolar line estimation and rectification for stereo image pairs," IEEE Trans Image Processing, vol.5, no.4, pp.672-676, Apr. 1996.  [95]  A. Björck, Numerical Methods for Least Squares Problems, Society for Industrial and Applied Mathematics (SIAM), 1996.  [96]  A. Vetro, P. Pandit, H. Kimata, A. Smolic, and Y. K. Wang, “Joint Multiview Video Model (JMVM) 8.0,” ISO/IEC JTC1/SC29/WG11 and ITU-T Q6/SG16, Doc. JVT-AA207, Apr. 2008.  [97]  Y. Su, A. Vetro and A. Smolic, “Common Test Conditions for Multiview Video Coding,” ISO/IEC JTC1/SC29/WG11 and ITU-T Q6/SG16, Doc. JVT-T207, Jul. 2006.  [98]  C. Moler, Numerical Computing with MATLAB, Society for Industrial and Applied Mathematics (SIAM), 2004.  [99]  P. Burt and E. Adelson, "A multiresolution spline with application to image mosaics," ACM Transactions on Graphics, vol. 2, no 4, pp. 217-236, 1983.  [100] A. Levin, A. Zomet, S. Peleg, and Y. Weiss, "Seamless image stitching in the gradient domain," in Proc. European Conf. Computer Vision (ECCV 2004), pp. 377-389.  126  [101] D. Murray and J. Little, “Using real-time stereo vision for mobile robot navigation,” Autonomous Robots, vol. 8, no. 2, pp. 161–171, 2000. [102] D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using structured light,” Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pages 195-202, Madison, WI, June 2003. [103] N. Ahmed, T. Natarajan, and K.R. Rao, “Discrete Cosine Transform,” IEEE Transactions on Computers, Vol. C-23, pp.90-93, Jan. 1974. [104] M. K. Hasan, S. Salahuddin, M. R. Khan, “Reducing signal-bias from MAD estimated noise level for DCT speech enhancement,” Signal Processing, vol. 84, issue 1, pp. 151-162, Jan. 2004. [105] R. C. Gonzalez and R. E. Woods, Digital Image Processing (2nd Edition). Prentice Hall, January 2002. [106] R. C. Reininger and J. D. Gibson, “Distributions of the two-dimensional DCT coefficients for images,” IEEE Transactions on Communications, vol. 31, pp. 835–839, June 1983. [107] W. K. Pratt, “Generalized Wiener filtering computation techniques,” IEEE Trans. Comput., vol. C-21, pp. 636–641, July 1972. [108] D. Scharstein and R. Szeliski, Middlebury stereo vision page, available: http://vision.middlebury.edu/stereo, accessed Mar. 28, 2012. [109] P.F. Felzenszwalb and D.P. Huttenlocher, “Efficient Belief Propagation for Early Vision,” International Journal of Computer Vision, Vol. 70, No. 1, Oct. 2006. [110] H. Hirschmuller, “Stereo Vision Based Mapping and Immediate Virtual Walkthroughs.” PhD thesis, School of Computing, De Montfort University, Leicester, UK, 2003. [111] J. Ens and P. Lawrence, “An investigation of methods for determining depth from focus,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 2, pp. 97–108, 1993. [112] S.A. Martucci, “Symmetric Convolution and the Discrete Sine and Cosine Transforms,” IEEE Transactions on Signal Processing, Vol. 42, No. 5, pp. 103851, May, 1994. [113] M. Tanimoto, “Free Viewpoint Television – FTV,” Proc. Picture Coding Symposium (PCS 2004), San Francisco, CA, Dec. 2004. [114] M. Frigo and S. G. Johnson, "The design and implementation of fftw3," Proc. IEEE, vol. 93, no. 2, pp. 216-231, 2005. 127  [115] R. M. Haralick, and L. G. Shapiro. Computer and Robot Vision, Volume I. Addison-Wesley, 1992, pp. 40-48. [116]  H. Breu, J. Gil, D. Kirkpatrick, and M. Werman, "Linear Time Euclidean Distance Transform Algorithms," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 5, May 1995, pp. 529-533.  [117] R. Mantiuk, S. Daly, and L. Kerofsky, “Display adaptive tone mapping,” ACM trans. on Graphics, vol. 27, no. 3, pp. 68, July 2008. [118] D. Xu, L. Coria, P. Nasiopoulos, "Guidelines for Capturing High Quality Stereoscopic Content Based on a Systematic Subjective Evaluation," IEEE International Conference on Electronics, Circuits, and Systems, ICECS 2010, pages 166-169, Dec. 2010. [119] S. Daly, “The visible difference predictor: An algorithm for the assessment of image fidelity,” in Proc. SPIE, vol. 1616, pp. 2–15, 1992. [120] Q. Shan, J. Jia, and A. Agarwala, “High-quality motion deblurring from a single image,” ACM Transactions on Graphics (SIGGRAPH), 2008. [121] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understanding and evaluating blind deconvolution algorithms,” Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. [122] Q. Tang and P. Nasiopoulos, "Efficient Motion Re-Estimation With RateDistortion Optimization for MPEG-2 to H.264/AVC Transcoding," IEEE Transactions on Circuits and Systems for Video Technology, vol.20, no.2, pp.262-274, Feb. 2010. [123] M. Cowen, “REAL D 3D Theatrical System: A Technical Overview,” (white paper), available: http://www.edcf.net/edcf_docs/real-d.pdf, accessed Mar. 28, 2012. Grey Stereo Vision Camera Catalog, available: [124] Point http://www.ptgrey.com/products/Point_Grey_stereo_catalog.pdf, accessed Mar. 28, 2012.  128  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0072641/manifest

Comment

Related Items