Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Improvements of demosaicking and compression for single sensor digital cameras Doutre, Colin Ray 2007

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2007-0086a.pdf [ 10.92MB ]
Metadata
JSON: 831-1.0101193.json
JSON-LD: 831-1.0101193-ld.json
RDF/XML (Pretty): 831-1.0101193-rdf.xml
RDF/JSON: 831-1.0101193-rdf.json
Turtle: 831-1.0101193-turtle.txt
N-Triples: 831-1.0101193-rdf-ntriples.txt
Original Record: 831-1.0101193-source.json
Full Text
831-1.0101193-fulltext.txt
Citation
831-1.0101193.ris

Full Text

Improvements of Demosaicking and Compression for Single Sensor Digital Cameras by Col in Ray Doutre B . Sc. (Electrical Engineering), Queen's University, 2005 A THESIS S U M B I I T E D I N P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Applied Science in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Electrical & Computer Engineering) T H E U N I V E R S I T Y OF B R I T I S H C O L U M B I A February 2007 © Colin Ray Doutre, 2007 Abstract Most consumer digital cameras captured colour images using a single light sensor and a colour filter array (CFA). This results in a mosaic image being captured, where at each pixel location either a red, green, or blue sample is obtained. The two missing colours at each location must be interpolated from the surrounding samples in a process known as demosaicking. In most current digital cameras, demosaicking is carried out in R G B colour space and later the image or video is converted to Y C b C r 4:2:0 format and compressed with standard methods. Most previous research on demosaicking and compression in digital cameras has addressed the issues separately. That is, current demosaicking methods ignore the fact that the data wi l l later be compressed, and compression techniques ignore the fact that the data was captured with a C F A . In this thesis we proposed two methods for optimizing the demosaicking and compression processes in digital cameras. First we propose a fast demosaicking method that directly produces an image in Y C b C r 4:2:0 format (the colour format most commonly used in compression). This reduces the computational complexity relative to the conventional approach of performing demosaicking in the R G B space and later converting to Y C b C r 4:2:0 format. Second, we propose two methods for compressing video captured with a C F A prior to demosaicking being performed. This allows us to take advantage of the smaller input data size, since demosaicking expands the number of samples by a factor of three. Our first C F A video compression method uses standard H.264, while our second method uses a modified motion compensation scheme that further increases compression efficiency by exploiting the properties of C F A data. ii Table of Contents Abstract • " Table of Contents , *" List of Tables v List of Figures. • • vi 1 Introduction 1 2 Background 4 2.1 Digital Camera Design 4 2.1.1 Optical System and Sensors 4 2.1.2 Digital Image Processing , 7 2.2 Image and Video Compression 9 2.2.1 YCbCr Colour Space 10 2.2.2 Image Compression 12 2.2.3 Video Compression 13 2.2.4 H.264 Video Compression 15 2.3 Demosaicking Algorithms 16 2.4 Previous Work on C F A Image and Video Compression 23 3 Demosaicking Directly to YCbCr 4:2:0 , 2 7 3.1 Introduction 27 3.2 Proposed Demosaicking Method 28 3.2.1 Generating a Full Green Channel .29 3.2.2 Calculating Low-pass R-G and B-G Samples 31 3.2.3 Calculating Down-sampled Chroma Channels 33 3.2.4 Calculating the Full Luma Channel 34 3.2.5 Summary of Complete Algorithm 36 3.3 Experimental Results • 37 ii i 3.4 Complexity Analysis 46 3.5 Conclusions 48 4 H.264 Based Compression of Bayer Pattern Video 50 4.1 Introduction 50 4.2 Impact of Aliasing on C F A Video Coding 51 4.3 Proposed Methods for C F A Video Compression ..54 4.3.1 Pixel Rearranging 54 4.3.2 Modified Motion Compensation 57 4.4 Results '. 61 4.4.1 Testing Methodology .' 61 4.4.2 Demosaicking Algorithm Choice 63 4.4.3 Quality Comparison Against Demosaick-First Approach 66 4.4.4 Complexity Comparison . : 69 4.4.5 Comparison Against Method in [19] 72 4.5 Conclusions : • • 73 5 Conclusions and Future Work 74 5.1 Conclusions , 74 5.2 Future Work . 75 Bibliography • 76 Appendix A: List of Acronyms •••80 iv List of Tables Table 3.1: Y - P S N R comparison of different demosaicking methods (dB) 40 Table 3.2: Cb-PSNR comparison of different demosaicking methods (dB) 41 Table 3.3: Cr-PSNR comparison of different demosaicking methods (dB) 42 Table 3.4: Number of operations per pixel required for the proposed method 47 Table 3.5: Number of operations per pixel required for the demosaicking method in [9] plus conversion to Y C b C r 4:2:0 format 47 Table 4.1: Bit Rate Comparison of our proposed methods with the method in [19] 72 v List of Figures Figure 2.1: Typical optical path for a three sensor camera 5 Figure 2.2: Typical optical path for a single sensor camera using a C F A 5 Figure 2.3: Bayer Pattern C F A 6 Figure 2.4: Example digital processing pipeline for a single sensor camera 7 Figure 2.5: Illustration of Y C b C r sampling formats, (a) 4:4:4 (b) 4:2:2 (c) 4:2:0 12 Figure 2.6: Block diagram of a typical motion compensated hybrid D C T based video encoder. 15 Figure 2.7: Original R G B and the result of applying bilinear interpolation to the image when it is sampled with the Bayer pattern 18 Figure 2.8: Location numbering of Bayer Pattern 19 Figure 2.9: Conversion from Bayer cell to Y C b C r 4:2:0 22 Figure 2.10: Flow diagram of C F A compression schemes: (a) Traditional demosaick-first approach, (b) Compression of C F A data prior to demosaicking 23 Figure 2.11: Modified Y C b C r conversion performed on the Bayer unit cell in [17] 24 Figure 2.12: 45 degree rotation of luma samples used in [16]... 24 Figure 2.13: Methods for forming rectangular arrays of luma data in [17]. (a) Original luma arrangement (b) Structure conversion (c) Structure separation 25 Figure 2.14: R G B based structure conversion method used in [21] 26 Figure 3.1: Neighbourhood of pixels used for generating the luma and chroma samples in the cell containing positions 1-4. 29 Figure 3.2: Test Images used in evaluating demosaicking algorithm performance. Numbered 1-24, from top left to bottom right 37 Figure 3.3: Procedure used for measuring demosaicking algorithm performance 38 Figure 3.4: Comparison of demosaicking methods on a cropped portion of image 6. (a) original image (b) bilinear (c) method in [6] (d) method in [9] (e) Y U V method in [31] (f) proposed method 44 vi Figure 3.5: Comparison of demosaicking methods on a cropped portion of image 8. (a) original image (b) bilinear (c) method in [6] (d) method in [9] (e) Y U V method in [31] (f) proposed method '. 45 Figure 4.1: A frame of the red channel from the "Mobile and Calender" video (top), and a blowup of the highlighted region over four successive frames (bottom), illustrating the affect of aliasing and the low correlation between frames 52 Figure 4.2: The four frames of red data in Figure 4.1 after demosaicking has been performed 53 Figure 4.3: Methods for forming rectangular arrays of luma data in [17]. (a) Original luma arrangement (b) Structure conversion (c) Structure separation 55 Figure 4.4: Conversion from mosaic data into separate R, G and B arrays used in our C F A video compression methods 57 Figure 4.5: (a) Performing demosaicking on a block of pixels from a reference frame (b) Sampling the demosaicked reference frame to obtain a prediction for the block when the motion vector is (1,0) in full pel units 58 Figure 4.6: Screenshots of the test videos, CrowdRun (top) and OldTownCross (bottom). 62 Figure 4.7: Plots of PSNR (of the C F A data) vs. bit rate for the two test videos obtained when different demosaicking algorithms are used for motion compensation 65 Figure 4.8: Plots of C P S N R (measured after compression and demosaicking) vs. bit rate. 68 vii 1 Introduction Consumer digital cameras have gained widespread popularity in recent years and have replaced analog film and tape cameras as the most common means for home users capturing still images and video. In capturing an image, many processing steps are carried out by the camera before it can be stored and viewed by the user. Two critical steps are demosaicking (colour filter array interpolation) and data compression. Due to the properties of the human visual system, at least three colour samples are needed at each pixel location to represent a full colour image. Most digital cameras use red, green and blue samples. One way to design a digital camera is to have three separate sensors for capturing a red, green and blue sample at every pixel location. However, only high-grade cameras use this design as it is expensive to implement. Instead, most consumer digital cameras use a single light sensor together with a colour filter array (CFA). The C F A allows only one colour of light to reach the sensor at each pixel location. This results in a mosaic image, where at each pixel location either a red, green or blue sample is captured, with the C F A controlling the pattern of the colour samples. The two missing colour samples at each location must be interpolated from the surrounding samples. This process is known as colour filter array interpolation or demosaicking. Another critical processing step in digital cameras is data compression. The large amount of data required for high-resolution images and video sequences makes efficient compression necessary in order to keep the camera's memory requirements reasonable. For still images, the JPEG (Joint Photographic Experts Group) standard [1] is by far the 1 most common compression method used. For video sequences, a number of different compression standards are available, the most popular being ITU-T H.263 [2], ISO/IEC M P E G - 2 [3], and, most recently, H.264/AVC [4]. Virtually all of the work done in the area of single sensor digital cameras has addressed the issues of demosaicking and compression separately. Most demosaicking algorithms have been designed without regard as to how the image wi l l be compressed afterwards, and most compression techniques ignore the fact that the data was originally captured with a colour filter array. Advanced demosaicking algorithms put a lot of computational effort into reconstructing high frequency detail in the red and blue colour channels [5]-[15]. If the image is compressed afterwards, it wi l l typically be converted to Y C b C r 4:2:0 format. In this format, the chroma channels (Cb, Cr) are down-sampled by a factor of two in both the horizontal and vertical directions, resulting in a loss of the high frequency colour information. Therefore, it is wasteful to generate the high-frequency colour information in the demosaicking process. The traditional approach to compressing image or video data captured with a C F A is to first perform demosaicking and then compress the resulting full-colour data. This approach is less than optimal, because demosaicking increases the size of the data without introducing new information. That is, the demosaicking process introduces redundancies into the data that the compression process must undo. Some work has been done on compressing raw C F A still image data, rather than first performing 2 demosaicking and then applying compression [16]-[21]. However, little work has been done on compressing video captured with a C F A . In this thesis we develop schemes that jointly optimize the demosaicking and compression processes. Two approaches to doing this are considered: 1. Creating a demosaicking algorithm that directly produces an image in the format used for compression (YCbCr 4:2:0), thus reducing the complexity of the demosaicking process. 2. Compressing C F A video data prior to demosaicking rather than after, increasing compression efficiency by taking advantage of the smaller input data size. The rest of this thesis is organized as follows. Chapter 2 provides background information on digital cameras, image and video compression, demosaicking methods, and existing techniques for C F A data compression. In Chapter 3, we present a new demosaicking algorithm that directly produces an image in Y C b C r 4:2:0 format that can be immediately be compressed, without the needed for a separate colour space conversion step. Two methods for compressing C F A video prior to demosaicking are presented in Chapter 4. Finally, conclusions and directions for further research are provided in Chapter 5. 3 2 Background In this chapter, we provide background information on digital camera design, the fundamentals of image and video compression, demosaicking in digital cameras, and the existing work on compression in single-sensor cameras. In Section 2.1 we provide basic information on the design of digital cameras, including the optical path, sensors and digital image processing steps. Section 2.2 provides an overview of the basics of image' and video compression, including the Y C b C r colour space, the discrete cosine transform, motion compensation, and the H.264 standard. In Section 2.2, an overview of demosaicking algorithms is provided, with emphasis placed on fast techniques that run directly on the digital camera itself. A detailed summary of existing research on direct compression of C F A images and video is provided in Section 2.3. 2.1 Digital Camera Design 2.1.1 Optical System and Sensors A colour digital image can be represented using three colour samples. Often red, green and blue are used, which roughly correspond to the wavelengths that the three types of cones in the human eye are most sensitive to. By combining red, green and blue light in different ratios, almost any visible colour can be obtained. To capture a digital image, the most straightforward design would be to use three separate sensors to capture red, green and blue light. The optical path for such a system is shown in Figure 2.1. A beam splitter would be used to project the light through three colour filters, and towards three sensors. This three sensor design is used in some high-4 end cameras. However, since the sensor is one of the most expensive parts of a camera, typically accounting for 10-25% of the total cost [22], the three sensor design is not used in most consumer digital cameras. Scene Taking Lens Optical Filter Red Filter Sensor Green Filter Sensor Blue Filter Sensor Figure 2.1: Typical optical path for a three sensor camera. To reduce cost, most consumer digital cameras manufacturers use only a single light sensor. To capture colour information in such devices, a colour filter array (CFA) is placed before the sensor [23]. The optical path for such a camera is shown in Figure 2.2. The C F A only allows one colour of light to reach the sensor at each pixel location. Several C F A designs exist [24], the most popular being the Bayer pattern [25]. The Bayer pattern consists of a repeating 2x2 cell, each cell containing two green samples, one red sample and one blue sample, as shown in Figure 2.3. More green samples are captured than red or blue because the human visual system is more sensitive to the green portion of the light spectrum. Scene Taking Lens Optical Filter CFA Sensor ; * Figure 2.2: Typical optical path for a single sensor camera using a CFA. 5 Figure 2.3: Bayer Pattern CFA In the optical path for both the three sensor and the single sensor camera designs, optical filtering is done before the light passes through the colour filters. This is necessary to provide infra red (IR) rejection. Most colour dyes transmit light beyond 700 nm, which the human visual system does not perceive, but which the camera's light sensor may be sensitive to. This light must be filtered out to capture an image that corresponds to what a human observes. Typically, a hot mirror is used for IR rejection. A hot mirror transmits low wavelength light and reflects high wavelengths. A n anti-aliasing filter may also be placed at the "optical filter" location in Figures 2.1 and 2.2. A n anti-aliasing filter provides spatial blurring to filter out high frequencies that cannot be captured due to the spatial resolution of the sensor. The sensor used in digital cameras is either a charge-coupled device (CCD) or a C M O S (Complementary Metal Oxide Semiconductor) device. C C D sensors were used in virtually all early cameras, but C M O S sensors are now becoming more common due to their lower cost, lower power consumption and their ability to be incorporated onto a single chip with other circuits. However, C C D sensors produce superior image quality and are used in high-end devices. 6 2.1.2 Digital Image Processing The output from the image sensor goes through an analog to digital (A/D) converter to produce a digital image. After this, many digital image processing steps must be carried out in order to produce a final viewable image. A typical digital image processing pipeline for a single sensor camera is shown in Figure 2.4. It should be noted that this pipeline is an example only; different cameras perform different processing steps and the order of steps may be different in some cameras. A/D Conversion ' 1 ^ Preprocessing White point correction Demosaicking Storage Compression Post Colour space processing conversion Figure 2.4: Example digital processing pipeline for a single sensor camera. After the A / D converter has generated a digital image, some pre-processing may be applied. These steps vary significantly from camera to camera, but may include interpolating values at defective pixel locations, or converting the output from a non-linear sensor to a linear space. The colour of light coming off an object is a function of both the incident light colour and the reflectance of the object. The human visual system corrects for the lighting conditions so that a white object is still perceived as being white regardless of the light source. To produce images that correspond to what a human perceives, a camera must also correct the colours in an image based on the lighting conditions. This processing is called white balancing or white-point correction. White balancing is done either by having the user select from a number of pre-programmed settings, or using an auto white balance (AWB) algorithm [26]. When a C F A is used in a single sensor camera, the sensor captures a mosaic image, where there is either a red, green or blue sample at each pixel location. The two missing colours at each location much be estimated from the surrounding samples in a process known as demosaicking [5]. A n overview of demosaicking algorithms is provided later in this chapter in Section 2.3. The spectral characteristics of an image sensor are not likely to be matched to the spectral emission characteristics of the display device used. That is, i f we took the R G B values recorded by the sensor in response to a scene and directly displayed those R G B values on a monitor, the colours produced by the monitor would not match the colours in the image scene. Thus, it is necessary to convert the R G B values produced by a sensor into a standard colour space, most often the sRGB (standard red, green, blue) space [27]. The sRGB space defines a mapping between a set of sRGB values and the CIE (Commission Internationale de l'Eclairage) chromaticity coordinates of the light produced by a display device. In digital cameras, the R G B values produced by the image processing chain discussed so far are typically converted into CIE X Y Z (chromaticity) values through a linear transformation (that has been designed based on the sensors characteristics) [28]. Then a non-linear transformation is applied to convert from X Y Z values to sRGB space [27]. 8 Some cameras apply post-processing to the image before it is compressed. This may include steps such as sharpening and artifact removal to improve the subjective image quality. The final step of the digital image processing chain is compression, where redundancies are removed from the image representation so it can be coded using fewer bits. The JPEG image compression standard [1] is by far the most commonly used image compression scheme in digital cameras. Section 2.2 of this thesis provides more details on image and video compression. Some cameras provide a setting for the user to bypass all of the digital image processing steps above, and just store the raw data obtained by the sensor (perhaps with some minimal compression). This allows the digital processing to be performed later on a computer, with guidance from the user. By performing the digital processing on a computer, more advanced algorithms for white-balance, demosaicking, and post processing can be used, since a typical computer has much more computational power than a digital camera. This mode of operation is used by professional users, but rarely by typical consumers. 2.2 Image and Video Compression In the following sections the basics of image and video compression are covered. Section 2.2.1 discusses the Y C b C r colour space which is used in most still image and video compression standards. Still image compression techniques are covered in Section 2.2.2 and video compression is discussed in Section 2.2.3. Finally, a brief overview of H.264/AVC, the latest major video coding standard, is provided in Section 2.2.4. 9 2.2.1 YCbCr Colour Space A digital image can be represented using three colour samples per pixel. Typically red (R), green (G) and blue (B) are used when capturing images with a digital camera. However, storing images in R G B space is inefficient, as there is a large amount of correlation between the channels. Instead, when images or video are to be compressed, they are usually converted into Y C b C r colour space. In Y C b C r space, an image is represented by one luma (Y) and two chroma (Cb, Cr) components. The luma channel contains "brightness" information; it is essentially a greyscale version of the image. The chroma values are colour offsets, which show how much a pixel deviates from greyscale in the blue (Cb) and red (Cr) directions. The equations used for converting from R G B to Y C b C r used in the JPEG JFIF format [29] are: Y = 0.299 • R +0.587 • G + 0.114 • B Cb = - 0 . 1 6 8 7 - R - 0 . 3 3 1 3 - G +0.5-B (2.1) Cr = 0 . 5 - R - 0 . 4 1 8 7 - G - 0 . 0 8 1 3 B The reverse equations for converting from Y C b C r to R G B are: R = Y + 1.402-Cr G = Y - 0 . 3 4 4 1 4 - C b - 0 . 7 1 4 1 4 - C r (2.2) B = Y + 1.772-Cb For most natural images, the R G B to Y C b C r conversion strongly de-correlates the colour channels, so the Y , Cb, and Cr components can be coded independently without loss of efficiency. In Y C b C r space, the energy of an image tends to be concentrated in 10 the Y channel. This leaves the Cb and Cr channels with less information, so they can be represented with fewer bits. Another advantage of the Y C b C r space comes from the properties of the human visual system. The human eye is more sensitive to brightness information than colour information. Consequently, the chroma signals can be down-sampled relative to the luma without significant loss of perceived quality. In fact, chroma down-sampling is almost always done when compressing image or video data. Applying equation (2.1) at every pixel in an image results in an Y C b C r 4:4:4 picture. In Y C b C r 4:4:4 format, the chroma is sampled at the same rate as luma (Figure 2.5a). This format is rarely used, except in professional applications. In Y C b C r 4:2:2 format, the chroma signals are down-sampled by a factor of two relative to the luma in the horizontal direction (Figure 2.5b). Some higher end digital video formats use Y C b C r 4:2:2 sampling. However, the most common colour format used in compressed images and video is Y C b C r 4:2:0 (Figure 2.5c). In Y C b C r 4:2:0 format, the chroma signals are down-sampled by a factor of two in both the horizontal and vertical directions. It should be noted that there are different chroma positions used in Y C b C r 4:2:0 format. Sometimes the chorma samples are considered to be half-way between the luma samples vertically, or in the center of a group of four luma samples. In the downsampling process the chroma channels should be low-pass filtered to limit aliasing. 11 Y ) Y ( Y ) Y ( Y ) Y Y ) Y © Y m Y 2) y ©Y ©Y (b) © Y © Y © Y Y Y Y Y Y Y © Y © Y © Y Y Y Y Y Y Y (a) (c) Y - Luma sample location '""*) - Chroma sample locatio n Figure 2.5: Illustration of YCbCr sampling formats, (a) 4:4:4 (b) 4:2:2 (c) 4:2:0. There are minor variations on Equations 2.1 and 2.2 used for converting to Y C b C r format. These typically involve scaling of the luma and chroma vales so the final samples wil l fall within a lower range. In digital images and video, the term Y C b C r is typically usually used interchangeably with Y U V to refer to the same colour space. 2.2.2 Image Compression Still image compression involves removing spatial redundancies within an image. This is usually done with a transform, such as the discrete cosine transform (used in JPEG [1]) or the wavelet transform (used in JPEG 2000 [30]). The idea of transform coding is to apply a mathematical transformation to a group of pixels that wi l l concentrate the energy of the signal into a smaller number of coefficients. After applying a transform, many coefficients are typically near zero, and hence can be discarded with minimal impact on the output image when the inverse transform is taken. By far, the most popular image compression method used in digital cameras is baseline JPEG. JPEG is based on the 2D discrete cosine transform (DCT). The image 12 is divided into 8x8 blocks of pixels, and the D C T of each block is taken. The D C T coefficients are calculated with the following formula: The D C T coefficients, D(p,q), show how much energy the image block has at different spatial frequencies. Higher values of p and q correspond to higher frequencies in the x and y directions. The coefficients are quantized based on how the human visual system perceives different frequencies, with the higher frequency coefficients being more coarsely quantized. The quantized coefficients are scanned in zigzag order, in direction of increasing frequency. Entropy coding is applied to (run, level) pairs, where level is the value of a quantized coefficient, and run is the number of zero values preceding the coefficient in the zigzag scan. 2.2.3 Video Compression Due to the large amount raw data needed to represent video, efficient data compression is critical in systems involving digital video. With a given amount of data storage capacity more efficient video compression schemes allow either the video quality to be increased or the play time to be extended. Efficient video compression is essential in applications such as personal video players (e.g., D V D ) , streaming video over the internet, digital camcorders, video conferencing, broadcasting, etc. 13 A digital video is a time sequence of 2D frames. Video compression attempts to remove redundancies both within a single frame and between frames. Some frames in a video are coded independently of other frames using intra coding methods (I frames). These frames are compressed with still image techniques, typically similar to those used in JPEG. The key technique that distinguishes video compression from image compression is motion compensation (MC). When M C is used, some frames are coded by predicting the frame from previously coded frames and storing the difference between the actual frame and the prediction. Such frames are called P frames. After forming a prediction for a block of pixels, the residual prediction error is calculated, which is the difference between the actual block and the prediction. Then a transform (e.g., DCT) is usually applied to the residual. The transform coefficients are then quantized, scanned and entropy coded. M C involves dividing a frame into blocks and estimating and coding a displacement vector for each block. The displacement vector tells the decoder which block in a previously coded reference frame to use as a prediction. Using the displacement vector the decoder can form the same prediction for the block, and by adding the residual to the prediction the original block can be restored. Figure 2.6 shows a simplified block diagram of a motion compensated, transform based hybrid video encoder. This basic structure is used in most major video coding standards including M P E G - 1 , M P E G - 2 , H.261, H.263 and H.264/AVC. 14 Input frame residual DCT coefficients Entropy Coding DCT Quantization Inverse Quantization Inverse DCT prediction + *3> Motion Compensation Frame Store Reconstructed frame • Motion vectors Figure 2.6: Block diagram of a typical motion compensated hybrid DCT based video encoder 2.2.4 H.264 Video Compression H.264/AVC is the latest major video coding standard, and it is being used in applications such as next generation video players (Blu-ray, H D D V D ) . A number of new coding features were introduced in H.264, the most important of which are summarized below. The transform used in H.264 is different from previous standards. Instead of using a D C T calculated with floating point arithmetic, as in JPEG and M P E G - 2 , H.264 uses an integer transform that approximates the DCT. The integer transform matrix only contains the values +1, -1, +2, and -2, so the transform can be calculated using only addition/subtraction and bit shift operations. In baseline H.264, the transform is calculated on blocks of 4x4 pixels, which is smaller than the 8x8 blocks used in most previous standards. The smaller block size helps reduce ringing artifacts. 15 When coding I frames, instead of taking the transform of blocks of pixels directly, each block is predicted in the spatial domain from previously coded blocks (usually blocks above or to the left of the one being coded). A number of different intra prediction modes are supported, for example each pixel in the block can be predicted from the pixel outside the block directly above it, to the left, or at a diagonal direction. A D C prediction mode is also supported, where the entire block is predicted based of the average value of the surrounding previously coded pixels. After the prediction has been made, the integer transform is applied to the prediction residual. Motion Compensation (MC) is much more flexible in H.264 than in previous standards (e.g., MPEG-2) . In H.264, variable block size motion compensation is supported. Blocks of size 16x16 pixels down to 4x4 pixels can be used. In addition, H.264 allows the encoder to select from multiple previously coded frames to find the one that wi l l provide the best prediction for the frame being coded. Finally, motion vectors in H.264 can be defined in units of one quarter pixel. This significantly increases the accuracy of M C , resulting in higher compression efficiency. In order to support fractional pixel motion vectors, the reference frames must be interpolated. This is done with a 6 tap FIR (finite impulse response) filter to generate samples at half pixel positions, followed by bilinear interpolation to generate samples at quarter pixel positions. 2.3 Demosaicking Algorithms Most consumer digital cameras capture colour information with a single light sensor and a colour filter array (CFA). The C F A only allows one light colour (red, green or 16 blue) to reach the sensor at each pixel location. The Bayer Pattern [25] is the most commonly used C F A design. Demosaicking is the process of estimating the two missing colour samples at each pixel location. The simplest demosaicking methods interpolate each colour channel separately. One such technique is bilinear interpolation, where the average of the surrounding samples is used. When bilinear interpolation is used, each missing green sample is calculated as the average of the four surrounding green samples, and each missing red or blue sample is taken as the average of the two nearest neighbours or four nearest neighbours, depending on the position. Other standard interpolation methods, such as cubic spline interpolation, can be used to slightly improve the performance when processing each colour channel separately. The problem with methods that interpolate the colour channels independently is that they usually fail at sharp edges in images, resulting in objectionable colour artifacts. Figure 2.7 illustrates this; it shows a full colour image and the result of using bilinear interpolation to reconstruct the image after it has been sub-sampled with the Bayer pattern. Note how the bilinear image has false colours in regions of high frequency image content and also appears significantly blurred. 17 Figure 2.7: Original RGB and the result of applying bilinear interpolation to the image when it is sampled with the Bayer pattern To overcome the problems caused by simple methods that interpolate the colours channels separately, many adaptive demosaicking algorithms have been developed which exploit the correlation between the colour channels. One class of adaptive demosaicking algorithms is edge-directed interpolation [6]-[8]. These algorithms attempt to preserve edges by calculating gradients from the C F A data and interpolating along the direction (typically either horizontal or vertical) that has the lower gradient. 18 Figure 2.8: Location numbering of Bayer Pattern In [6], Hamilton and Adams propose a method where Laplacian second order correction terms are used to enhance the estimates for missing pixels. Referring to Figure 2.8, consider the task of estimating the green sample at the location of sample R2 (we denote the missing green sample G2). In the method in [6] (which we wil l refer to as 'Laplacian demosaicking') gradients are calculated as: DH=|R16 + R 2 4 - 2 - R 2 | + | G l - G 9 | (2-4) DV=|R20 + R 1 2 - 2 - R 2 | + | G 7 - G 3 | Bilinear interpolation is carried out in the direction with the lower gradient, or both directions i f the gradients are equal. The Laplacian of the red or blue channel is calculated in the same direction and added to the bilinear estimate. For example, G2 is calculated as follows: 19 i f (DH < D V ) ^ G1 + G9 2 R 2 - R 1 6 - R 2 4 G2 = + 2 4 else i f ( D V < D H ) • G7 + G3 2 R 2 - R 2 0 - R 1 2 (2-5) G2 = + 2 4 else G1 + G7 + G9 + G3 4 - R 2 - R 1 6 - R 2 0 - R 2 4 - R 1 2 G2 = + 4 8 In equation (2.5), the left term of every sum is the result of applying bilinear interpolation to the green channel in the lower gradient direction (horizontal, vertical, or both). The right hand term of each sum is the Laplacian of the red channel. The Laplacian is a high-pass filter. The R G B colour planes are typically very highly correlated, especially in high-frequency content, so adding the Laplacian of the red channel to the green channel enhances the bilinear estimate. The red and blue planes are filled in a similar manner (calculating gradients and using directional interpolation with Laplacian terms). In [9], Laplacian correction terms are also used, but to reduce the number of computations required, the interpolation of the green channel always uses samples in both directions (the last case in equation (2.5)). This eliminates the need for calculating the gradients and performing comparisons, resulting in a very computationally simple method. After the green channel has been filled, the red and blue channels are estimated using bilinear interpolation on the difference between the red/blue channel and green channel. The signals R - G , and B - G are generally much smoother (less high frequency content) than the red and blue channels themselves and are thus more suitable for conventional linear interpolation. 20 Many demosaicking methods have been developed which are far more advanced than the methods described so far. Wavelet decomposition is used in [10], where the high-frequency sub-band of the green channel is used to iteratively update the high-frequency content in the red and blue channels. Techniques have also been proposed using neural networks [11], markov random fields [12] and a soft decision process for estimating edge directions [13]. These methods can achieve very good image quality but have high computational complexity. Thus, instead of performing demosaicking directly on the camera when the image is taken, the raw data would be stored on the camera and then transferred to a computer where demosaicking would take place. In this thesis, we focus on methods that run directly on the digital camera, so these advanced algorithms are not discussed in detail. When video is captured with a C F A device, demosaicking is usually done on each frame independently. However, recent work has explored using motion estimation to improve performance when demosaicking is performed on a video [14],[15]. Through motion estimation, information from other frames can be used to improve the estimate for the missing pixels in each frame, at the expense of greatly increased complexity. A l l the demosaicking algorithms described up to this point produce R G B output images. If the image is to be compressed afterwards (it almost always wil l be), it wi l l be converted into Y C b C r 4:2:0 format, which is more suitable for compression. Instead of performing demosaicking in the R G B space and then converting to Y C b C r 4:2:0 space afterwards, the demosaicking algorithm can be designed to directly produce Y C b C r 4:2:0 output. 21 G I HBj B3 I G4 Y1 Y2 Y3 Y4 Figure 2.9: Conversion from Bayer cell to YCbCr 4:2:0 There is one previously published demosaicking algorithm that produces an Y C b C r 4:2:0 output image [31]. Their algorithm, entitled " Y U V through green interpolation with median fdtering post-processing," ( Y U V G M ) , starts by creating a full green channel, using the method for interpolating the green channel in [6]. For each 2x2 cell in the Bayer pattern, four luma (Y) samples must be generated, together with one Cb and Cr sample (Figure 2.9). Let 7(R,G,B), C6(R,G,B) and Cr(R,G,B) denote functions for the equations in (2.1) for converting from R G B space to Y C b C r space. Then, the output samples are calculated by the following equations: Y1 = 7(R2,G1,B3) Y 2 = 7(R2,G2,B3) Y3 = 7(R2,G3, B3) Y 4 = r(R2,G4,B3) (2.6) G a v g = ( G l + G2 + G3 + G 4 ) / 4 C b l = C 6 ( R 2 , G a v g , B 3 ) C r l = C r ( R 2 , G a v g , B 3 ) After the above equations have been evaluated for every Bayer cell, median fdtering is performed on the Cb and Cr channels to remove colour artefacts. Note that the Y U V G M method is using zero-order hold interpolation on the red and blue channels, which produces severe false colours around edges. The median fdtering post processing is an ad-hoc and computationally expensive method of reducing false colours. 22 2.4 Previous Work on CFA Image and Video Compression The conventional approach used for compressing images and video generated with single-sensor cameras is to first perform demosaicking and then compress the resulting full-colour data with standard methods (Figure 2.10a). This approach is sub-optimal because demosaicking expands the size of data to be compressed by a factor of three and introduces further redundancy into the data that wi l l be removed by the compression stage. Instead compression can be carried out on the C F A data, with demosaicking being performed after decompression (Figure 2.10b) [16]-[19]. (a) C F A _ ^ Full colour v i deo - * Demosaicking Compression Storage Decompression ~"" video (b) C F A _ video Custom Compression Storage Decompression Demosaicking Full colour video Figure 2.10: Flow diagram of CFA compression schemes: (a) Traditional demosaick-first approach, (b) Compression of CFA data prior to demosaicking. There have been a few papers published on lossy compression of C F A image data prior to demosaicking. The goal of these is to provide better image quality at a given bit rate, allowing a camera to either store higher quality images or store more images with the same quality level. In [16], Lee and Ortega propose a method starting with a modified Y C b C r colour space conversion. The conversion is performed on each group of four Bayer pattern samples, and creates two Y samples and one Cb and Cr sample (Figure 2.11). The equations for the modified conversion are: 23 Y u = 0.299 • R+0.587 • Gu + 0.114 • B Y l = 0.299 • R+0.587-Gl+ 0.114 B Cb =-0.1687-R-0.16565-Gu-0.16565-G1 + 0.5-B Cr = 0.5 • R - 0.20935 • Gu - 0.20935 • G l - 0.0813 • B (2.7) When calculating each luma value with (2.7), the corresponding green sample is used together with the red and blue samples. For calculating the chroma values, the average of the two green samples is used along with the red and blue. After the conversion, the chroma samples are arranged into rectangular arrays and then compressed with standard JPEG. The Y samples, which are arranged in a quincunx lattice, are rotated 45 degrees to form a compact rhombus shape, as shown in Figure 2.12. Then the Y samples are compressed with a modified JPEG algorithm, where padding is done at the edges of the Y rhombus to create square 8x8 blocks that can be compressed with the JPEG D C T method. Gu B j Gl Yu Yl Cb Cr Figure 2.11: Modified YCbCr conversion performed on the Bayer unit cell in [17]. Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Figure 2.12: 45 degree rotation of luma samples used in [16]. 24 Two C F A image compression methods are proposed by Koh et. al. in [17], which are called'structure conversion' and 'structure separation'. Both methods start with the modified Y C b C r conversion proposed in [16], followed by compressing the chroma channels with JPEG. The methods differ in how they process the luma samples. In the structure conversion method, the two luma samples generated from every Bayer cell are merged into a single column, as shown in Figure 2.13b. This creates a single luma array that has half the size of the C F A data. The process of merging the two samples distorts the image content somewhat. The structure separation method involves separating the even column and odd column luma samples into two arrays, each having size one quarter that of the C F A data (Figure 2.13c). In both methods, after rectangular arrays of luma samples have been formed they are compressed with standard JPEG. Y1 Y2 Y 3 Y4 Y 5 Y6 Y7 Y8 Y 9 Y 1 0 Y11 Y12 Y 1 3 Y14 Y 1 5 Y16 Y 1 7 Y 1 8 Y1 Y2 Y 3 Y4 Y 5 Y6 Y 7 Y8 Y 9 Y 1 0 Y11 Y12 Y 1 3 Y14 Y 1 5 Y16 Y 1 7 Y 1 8 Y1 Y2 Y 3 Y7 Y 8 Y 9 Y 1 3 Y14 Y 1 5 Y4 Y 5 Y6 Y 1 0 Y11 Y12 Y16 Y 1 7 Y18 (a) (b) (c) Figure 2.13: Methods for forming rectangular arrays of luma data in [17]. (a) Original luma arrangement (b) Structure conversion (c) Structure separation Other techniques for compressing C F A images include sub-band decomposition [20] and vector quantization of groups of pixels [21]. However, these methods have worse compression efficiency than the JPEG based methods in [16] and [17]. A comparative analysis of the two workflows (compression-first vs. conventional demosaick-first) is presented in [18]. 25 The only previously published work on lossy compression of C F A videos is presented in [19] by Gastaldi et. al. Their method is based on the structure conversion method in [17]. In [19], the structure conversion process is done in the R G B domain (without first applying an Y C b C r colour space conversion), as shown in Figure 2.14. The arrays of green, red and blue are compressed with a custom M P E G - 2 like coding scheme. A custom coding scheme was developed because no major video coding standard supports input data where one channel (green) has twice the vertical size and the same horizontal size as the other channels (red and blue). A the motion vectors from the green channel are used on the red and blue channels, with appropriate scaling and downsampling. A n I B B P B B P B B P B B group of pictures structure is used, with JPEG compression used to compress the I-frames and residual for P and B frames. G1 G3 m G5 B7 G8 B9 G10 B11 G12 G13 m G15 G17 • B19 G20 R21 G22 B23 G24 G25 •B| R26 G27 G29 • B31 G32 B33 G34 B35 G36 G1 G3 G5 G8 G10 G12 G13 G15 G17 G20 G22 G24 G25 G27 G29 G32 G34 G36 B7 B9 B11 B19 R21 B23 B31 B33 B35 Figure 2.14: RGB based structure conversion method used in [21]. 26 3 Demosaicking Directly to YCbCr 4:2:0 3.1 Introduction Single sensor cameras capture colour information using a colour filter array (CFA). This results in a mosaic image being captured, where at each pixel location either a red, green or blue sample is captured. The two missing colours at each location are interpolated from the surrounding samples in a process known as demosaicking. As discussed in our overview of demosaicking methods in Section 2.3, virtually all demosaicking algorithms produce an R G B output image. If the image is to be compressed afterwards, it wi l l typically be converted to Y C b C r 4:2:0 format. The green channel is the dominant component in determining luma, and the red and blue samples contribute most to the Cr and Cb channels, respectively. Advanced demosaicking methods put a lot of computational effort into reconstructing fine detail in the red and blue colour planes. This is a difficult task because red and blue are more sparsely sampled in the Bayer pattern. It is also somewhat unnecessary as most of the high frequency detail in those channels is lost in the conversion from R G B to Y C b C r 4:2:0 format. In this chapter, we present a demosaicking method that directly produces an Y C b C r 4:2:0 output image. This reduces computational complexity by avoiding the need for performing demosaicking in the R G B domain and then converting to Y C b C r 4:2:0 format afterwards. 27 The rest of this chapter is organized as follows. Our demosaicking method is described in Section 3.2. Simulation results comparing our proposed method against other fast demosaicking algorithms are presented in Section 3.3. A complexity analysis of our method is given in Section 3.4, where the complexity of our proposed method is compared against a very fast R G B based method. Finally, conclusions are given in Section 3.5. 3.2 Proposed Demosaicking Method The Bayer pattern consists of cells of size 2x2 pixels, each cell containing two green samples, one red sample and one blue sample. To produce Y C b C r 4:2:0 output, four luma samples and one Cb and Cr samples must be generated for each cell. Figure 3.1 shows a 2x2 cell (locations 1-4) and the surrounding Bayer pattern samples that wi l l be used to calculate the luma and chroma samples in the cell. In the following section we describe how our algorithm generates chroma samples at location 1 (denoted as C b l , C r l ) , and luma samples at locations 1-4 (denoted Y l , Y 2 , Y 3 , Y4). The steps described are repeated on every 2x2 cell to generate the entire Y C b C r 4:2:0 image. The location numbering given in Figure 3.1 wil l be used throughout the rest of the chapter. 28 Figure 3.1: Neighbourhood of pixels used for generating the luma and chroma samples in the cell containing positions 1-4. Our demosaicking algorithm consists of four steps: 1. Generating a full green channel 2. Generating low-pass filtered, down-sampled red-green and blue-green colour difference values (R-G, B-G) 3. Calculating low-pass filtered, down-sampled chroma channels 4. Filling the luma channel. Each step is discussed in turn in the following sections. 3.2.1 Generating a Full Green Channel Since a full luma channel is needed, and green is the dominant component in determining luma, our method begins by filling the green channel. A complete green channel allows us to estimate a high-quality luma channel. 29 Many existing demosaicking methods start by generating a complete green channel. We base our method for calculating the missing green samples on the method by Hamilton and Adams [6], which is a popular low-complexity demosaicking algorithm. The idea behind the method for filling the green channel is to calculate horizontal and vertical gradients at the current location, and interpolate along the direction that contains the lower gradient. This results in interpolation being performed along edges rather than across edges. If the gradients are similar in magnitude, interpolation is done using samples in both directions. There are two cases that need to be considered when calculating the missing green samples; generating a green at a red location (R2) or blue location (B3). At red location (R2), the gradients are calculated as: D H = | R 1 4 + R 1 6 - 2 - R 2 | + | G 1 - G 1 5 | D V = |R7 + R22 - 2 • R2| + |G11 - G4 | The missing green sample is calculated as follows: i f ( D H + T < D V ) „ G1 + G15 2 R 2 - R 1 4 - R 1 6 O z = 1 2 4 else if ( D V + T < D H ) ^ 2 ) „ „ G11 + G 4 2 - R 2 - R 7 - R 2 2 O z = 1 2 4 else G1 + G11 + G I 5 + G 4 4 R 2 - R 7 - R 1 4 - R 1 6 - R 2 2 ( j i = ; 1 The second term in each sum in (3.2) is the second order gradient (Laplacian) of the red channel. The red, green and blue channels are typically very highly correlated, 30 especially in high frequency content [10], so adding the Laplacian term improves the bilinear interpolation [32]. A threshold T ' is used to ensure that the gradients are sufficiently different for interpolation to happen in only one direction. If the difference between the gradients is less then T (the last case in equation (3.2)), the interpolation uses samples in both directions. A threshold is not used in the method by Hamilton and Adams [6]. If the threshold is set to zero, the method for filling the green channel would be identical to their method. Experimentally, we found a threshold of 35 to provide good results for a wide range of images. At the blue location (B3), the gradients and missing green sample are calculated analogously to the red location using the following equations: DH=|B17 + B19-2 -B3 | + |G18-G4| DV= IBIO + B24 - 2 • B3| + |G1 - G21| (3.3) i f (DH + T < D V ) ^ G18 + G 4 2 B 3 - B 1 7 - B 1 9 G 3 = + • 2 4 else i f (DV + T < D H ) ^ ^ G1 + G21 2 - B 3 - B 1 0 - B 2 4 G 3 = + else G1 + G 4 + G18 + G21 4 - B 3 - B 1 7 - B 1 9 - B 1 0 - B 2 4 G 3 = + 3.2.2 Calculating Low-pass R-G and B-G Samples Since the conversion from R G B to Y C b C r is a linear process, we can equivalently perform loss-pass filtering in the R G B domain rather than the Y C b C r domain. We apply 31 low-pass filtering to generate R - G and B - G values located at the location of G l in Figure 3.1. Instead of performing interpolation on the red and blue channels themselves, we perform interpolation on the difference between green and red or blue. The R - G and B - G images are generally much smoother than the R and B channels, so they are more suitable for interpolation [9]. From the low-pass R - G and B - G values, we can generate low-pass chroma samples, as wi l l be explained in the next section. The following 2D filter is used on the R - G channel: 1/8 0 1/8 0 0 0 1/4 0 1/4 0 0 0 1/8 0 1/8 (3.5) This filter provides low-pass filtering in both the horizontal and vertical directions, while only using positions that have red samples available. The filter in equation (3.5) is separable, and equivalent to using the following two filters in the horizontal and vertical directions: hhor=[\/2 0 1/2] _ Ken =[l /4 0 1/2 0 1/4] The vertical filter in (3.6) provides stronger low-pass filtering than the horizontal filter, as it has two zeros at n radians/sample rather than just one. However, due to the required positioning of the output, a filter with an even number of taps is needed in the 32 horizontal direction. Using a four tap horizontal filter would require significantly more operations to implement. The resulting equation for calculating the low-pass R - G value, denoted K R is: R 1 4 - G 1 4 + R 2 - G 2 KRllp=(R-G)*hKR = 4 (3.7) R 5 - G 5 + R 7 - G 7 + R 2 0 - G 2 0 + R 2 2 - G 2 2 + 8 For calculating a low-pass B - G sample at location 1, an equivalent filter to that of equation (3.5) is used, only this time it is rotated 90 degrees due to the different positioning of the blue samples in the Bayer pattern: h "KB 1/8 0 1/4 0 1/8 0 0 0 0 0 1/8 0 1/4 0 1/8 (3.8) The low-pass K B sample at location 1 is calculated equivalently to the low-pass K R sample using: (3.9) B 8 - G 8 + B 1 2 - G 1 2 + B 1 7 - G 1 7 + B 1 9 - G 1 9 + : 3.2.3 Calculating Down-sampled Chroma Channels The conversion from R G B to Y C b C r space is linear, so instead of performing linear low-pass filtering on the Cb and Cr channels, the filtering can be equivalently performed on the R G B samples: 33 Cblp = - 0 . 1 6 8 7 - R * / ? - 0 . 3 3 1 3 - G * / z + 0 .5-B*/z C r / p = 0 . 5 - R * / z - 0 . 4 1 8 7 - G * / ? - 0 . 0 8 1 3 - B * / 7 ( 3 ' 1 0 ) where h is the low-pass filter used to limit aliasing after downsampling, Cb/p and Cr/ P are low-pass chroma channels, and * denotes 2D convolution. The equations in (3.10) can easily be rewritten in terms of the colour differences R - G and B - G using the linear property of convolution: C b / p = - 0 . 1 6 8 7 - ( R - G ) * / z + 0 .5 - (B-G)* / z Cxlp = 0 . 5 - ( R - G ) * / J - 0 . 0 8 1 3 - ( B - G ) * / J ( 3 ' U ) By allowing different low-pass filters to be applied to the R - G and B - G signals in (3.11), the chroma values at location 1 can be calculated using the K R and K B samples generated with equations (3.7) and (3.9): C b l = -0.1687 • K R 1 , + 0.5 • K B 1 , (3 12) C r l = 0 . 5 - K R l / p - 0 . 0 8 1 3 - K B L p v ' ; In our method, equation (3.12) is used to calculate the final chroma samples using the low pass filtered, down-sampled, K R and K B values. 3.2.4 Ca lcu lat ing the Ful l L u m a Channe l Once the down-sampled chroma values have been calculated, the only task left is to generate the full luma channel. Since different samples are available at each location, we considered the task of generating luma samples at locations 1 through 4 separately. 34 At the location of G l , we already have G, K R and K B samples available. By rearranging the equation for luma in (2.1) in terms of R - G and B - G , the Y l sample can be calculated with: Y l = 0 . 2 9 9 - K R 1 , , + G1 + 0 . 1 1 4 - K B 1 , , (3.13) Note that we are using low-pass K R and K B values, when ideally unfdtered values should be used. However, since the green sample has not been filtered and green is the dominant component in calculating luma, the value calculated with equation (3.13) is still a good estimate. At the location of R2, we have red and green samples available. A n assumption often made in demosaicking methods is that chroma varies smoothly in natural images, so bilinear interpolation provides a good estimate for missing chroma samples. Using this assumption, an estimate for the blue chroma at R2 is found as: C b l + C b l 5 Cb2 = (3.14) The Cb2 sample in (3.14) does not need to be calculated and stored, but the equation wi l l be used for deriving an expression for Y 2 . We would like to calculate the luma value at location 2 using R2, G2 and Cb2. This can be done by substituting the equation for B in equation (2.2) into the Y definition in equation (2.1). Y 2 = 0.299- R 2 + 0.587 • G2 + 0.114 - (Y2 +1.772 • Cb2) (3.15) Further substituting the estimate for Cb2 given by (3.14) and solving for Y 2 yields: 35 Y 2 = 0.3375 • R 2 + 0.6625 • G2 + 0.114 • (Cbl + Cbl5) (3.16) At location 3, green and blue samples are available. Using an analogous method as described for calculating Y 2 , only now with bilinear interpolation performed on Cr samples, Y3 is calculated as: Y 3 = 0.299 • (Crl + Cr21) + 0.8374 • G2 + 0.1626 • B3 (3.17) At location 4, only a green sample is available. So here we use bilinear interpolation on both the Cb and Cr channels to calculate Y4 . Substituting the R and B equations from (2.2) into the definition of luma gives: Y 4 = 0.299 • (Y4 +1.402 • Cr4) +0.587 • G + 0.114 • (Y4 + 1.772 • Cr4) (3.18) Using bilinear interpolation on the four surrounding samples to estimate Cb4 and Cr4, and solving for Y 4 yields: Y 4 = 0.1785 • (Crl + Crl5 + Cr21 + Cr23)+ G4 + 0.086• (Cbl + Cbl5 + Cb21 + Cb23) (3.19) 3.2.5 Summary of Complete Algorithm Our complete demosaicking algorithm for producing Y C b C r 4:2:0 output consists of the following steps. Each step must be carried out on every 2x2 cell before proceeding to the next step. 1. F i l l the missing green samples with equations (3.1), (3.2) (3.3) and (3.4). 2. Using (3.7) and (3.9) find low-pass R - G and B - G values. 36 3. Using (3.12) calculate the final Cb and Cr samples. 4. F i l l the luma channel with equations (3.13), (3.16), (3.17), (3.19). 3.3 Experimental Results The 24 R G B images from the Kodac set where used in our experiments. These images have been extensively used in demosaicking research. Thumbnails of the images are provided in Figure 3.2. C F A images were obtained by sampling the R G B images with the Bayer pattern. Figure 3.2: Test Images used in evaluating demosaicking algorithm performance. Numbered 1-24, from top left to bottom right. We compared our proposed method against the Y U V method in [31], and some fast demosaicking methods that operate in R G B space. The R G B methods tested were bilinear interpolation, and the methods in [6] and [9]. The quality of the different methods is evaluated by comparing the demosaicked image to the original full colour image, as illustrated in Figure 3.3. 37 CFA Image Full colour image Demosaicked Image Comparison Figure 3.3: Procedure used for measuring demosaicking algorithm performance. Here objective quality is measured in the Y C b C r 4:2:0 domain using the peak signal-to-noise ratio (PSNR). The PSNR (in dB) of a demosaicked image channel, Idem(iJ) with 8 bit precision is calculated as: where i denotes the row, j the column, M is the height of the channel, N is the width of the channel, and Ire/i,j) is the reference channel against which quality is measured. The reference images were obtained by converting the full colour R G B images to Y C b C r space with (2.1), fdtering the Cb and Cr channels with a 9-tap FIR low-pass filter and downsampling. The 9-tap filter closely approximates an ideal low-pass filter with PSNR = \0lo$ 255 (3.20) 38 cut-off 0.571 radians/sample, so the reference images contain very little aliasing in the down-sampled chroma channels. For the demosaicking methods that operate in R G B space, the following low-complexity filter was used for filtering the chroma channels in the downsampling process: h = [l/4 1/2 1/4] (3.21) This filter provides a good tradeoff between complexity and limiting aliasing. Tables 3.1, 3.2 and 3.3 show the PSNR (in dB) obtained with each demosaicking method in the Y , Cb and Cr channels, respectively. In almost all cases, the proposed method gives higher PSNR than the other methods. On average the proposed method gives about 1 dB higher PSNR in each channel than the R G B based methods in [6] and [9]. For all images, the proposed method gives far better performance (over 5 dB higher PSNR in luma) than the only other Y U V 4:2:0 based demosaicking method presented in [31]. 39 Image Bilinear Method in [6] Method in [9] Y U V Method in T3H Proposed 1 29.58 35.99 35.97 30.90 37.43 2 36.31 41.68 41.28 37.04 42.21 3 37.45 43.44 43.40 38.17 44.32 4 36.82 41.71 42.33 37.65 42.52 5 29.60 37.49 37.09 30.85 38.36 6 31.04 37.47 37.38 32.53 38.87 7 36.59 43.60 42.42 36.88 43.92 8 27.08 34.76 33.11 28.55 35.85 9 35.67 42.75 41.55 36.57 43.81 10 35.57 42.63 42.36 36.78 43.58 11 32.33 38.57 38.44 33.58 39.65 12 36.82 43.32 42.47 38.12 43.93 13 26.90 32.18 33.51 28.18 33.57 14 32.23 38.50 38.16 33.07 39.09 15 35.98 40.32 41.43 36.64 41.26 16 34.62 40.94 40.42 36.04 42.24 17 35.17 41.04 41.50 36.31 42.11 18 31.06 36.51 37.48 32.15 37.71 19 31.49 39.75 37.38 32.94 40.94 20 34.78 41.23 41.02 35.83 42.40 21 31.69 37.80 38.00 32.90 39.18 22 33.67 39.24 39.33 34.88 40.35 23 38.21 44.58 43.75 38.69 44.86 24 29.90 35.03 36.36 31.06 35.88 Average 33.36 39.60 39.42 34.43 40.58 Table 3.1: Y-PSNR comparison of different demosaicking methods (dB) 40 Image Bilinear Method in [6] Method in [9] Y U V Method in [311 Proposed 1 35.03 40.03 39.37 37.85 42.68 2 41.73 45.01 44.25 42.46 45.48 3 42.76 45.46 45.36 41.85 45.94 4 43.45 45.69 46.07 43.31 47.56 5 37.22 40.87 40.73 36.61 41.59 6 36.17 41.30 40.63 38.29 43.35 7 42.88 45.60 44.67 39.87 45.40 8 32.05 38.44 35.73 33.49 40.35 9 41.19 45.22 44.06 40.84 45.38 10 41.53 45.61 45.21 41.64 46.17 11 37.86 42.53 41.84 40,11 44.36 12 41.97 46.35 45.15 43.01 46.60 13 32.78 36.18 37.08 35.50 38.90 14 37.75 40.60 40.03 36.09 40.35 15 42.01 43.61 44.93 41.58 45.81 16 39.10 44.32 43.15 41.03 45.68 17 41.78 44.44 44.68 41.40 45.27 18 37.09 40.29 40.76 37.83 41.75 19 36.78 43.00 40.37 37.16 43.85 20 40.64 43.91 43.45 40.73 44.38 21 37.12 41.34 41.12 38.69 43.26 22' ; 39.12 42.17 41.80 39.15 42.57 23 44.66 46.79 45.87 41.62 45.46 24 35.23 38.04 38.95 36.15 39.53 Average 39.08 42.78 42.30 39.43 43.82 Table 3.2: Cb-PSNR comparison of different demosaicking methods (dB) 41 Image Bilinear Method in [6] Method in [9] Y U V Method in [311 Proposed 1 35.97 41.36 39.33 37.83 42.36 2 39.88 42.04 40.62 38.55 41.53 3 44.34 46.36 44.24 42.16 46.01 4 41.18 42.36 41.40 39.13 41.90 5 36.74 42.02 39.06 37.24 42.18 6 38.93 42.77 40.77 39.24 43.76 7 42.51 46.23 43.65 40.51 45.89 8 30.25 39.29 35.19 34.03 40.03 9 40.88 46.34 43.66 42.05 46.65 10 42.03 46.30 44.23 42.24 46.45 11 38.02 42.62 40.58 39.29 42.82 12 42.70 46.10 44.27 42.51 46.03 13 34.78 38.57 38.19 37.01 40.28 14 37.92 40.29 38.41 36.37 39.57 15 39.77 41.03 40.69 38.46 41.05 16 43.35 45.92 43.92 42.13 46.68 17 41.48 46.09 44.55 43.10 46.87 18 37.58 41.40 40.24 38.57 42.05 19 36.36 44.24 39.80 38.05 44.48 20 40.53 45.86 43.03 42.60 46.41 21 38.86 43.15 41.22 39.92 44.19 22 38.27 42.23 41.00 39.01 42.36 23 43.91 45.97 43.56 40.98 45.28 24 37.12 40.32 39.72 37.58 40.71 Average 39.31 43.29 41.31 39.52 43.56 Table 3.3: Cr-PSNR comparison of different demosaicking methods (dB) Since PSNR is not always an accurate measure of perceived image quality we also provide images for subjective quality comparisons. Figure 3.4 and Figure 3.5 show cropped portions of images 6 and 8, respectively, and the result of applying each demosaicking method to the image. For the R G B based demosaicking methods, the images have been converted to Y C b C r 4:2:0 format using the fdter in (3.21) for the chroma downsampling process. Close visual inspection of the images show that the 42 proposed method produces fewer color artifacts and results in less blurring of edges than the other methods. Also note how that despite providing competitive PSNR, the method in [9] produces unpleasing zipper artefacts along some edges (Figure 3.4d, Figure 3.5d). 43 44 3.4 Complexity Analysis A key advantage of the proposed method is the computational complexity saved by directly producing Y C b C r 4:2:0 output rather than performing demosaicking in R G B space and then converting to Y C b C r 4:2:0. Table 3.4 shows a summary of the number of operations per pixel in the C F A image needed for the proposed demosaicking method. Note there are some fractional values in Table 3.4 because many equations are not evaluated out at every pixel location. The number of operations performed when evaluating equations (3.2) and (3.4) is variable depending on the result of the comparisons; only the worst case complexity is shown in Table 3.4. For comparison, Table 3.5 presents a complexity analysis of the method in [9], which is one of the lowest complexity R G B based demosaicking algorithms reported in the literature. This table shows the number of operations required for demosaicking with the method in [9] and then converting to Y C b C r 4:2:0 format. In this analysis, the simple filter in equation (3.21) is used for limiting aliasing in the Cb and Cr channels and an efficient downsampling scheme is used (where the filtering operations are performed at the lower sampling rate). 46 Absolute Step Addition Shift Multiplication Value Comparison Green interpolation (worst case) 9 2.5 0 2 1 Low-pass K R , K B 5.5 1 0 0 0 Generating chroma 0.5 0 1 0 0 Calculating Luma 4 0 2.75 0 0 Total (worst case) 19 3.5 3.5 2 1 Table 3.4: Number of operations per pixel required for the proposed method. Absolute Step Addition Shift Multiplication Value Comparison Green Interpolation 4 1.5 0 0 0 Red Interpolation 4 0.75 0 0 0 Blue Interpolation 4 0.75 0 0 0 R G B to Y C b C r Conversion 6 0 9 0 0 Filtering and downsampling rows, Cr 1 1 0 0 0 Filtering and downsampling columns, Cr 1 1 0 0 0 Filtering and downsampling rows, Cb 0.5 0.5 0 0 0 Filtering and downsampling columns, Cb 0.5 0.5 0 0 0 Total 21 6 9 0 0 Table 3.5: Number of operations per pixel required for the demosaicking method in [9] plus conversion to YCbCr 4:2:0 format. Comparison of Tables 3.4 and 3.5 shows that the proposed demosaicking method has lower complexity than the method in [9], In the worst case, our method requires slightly fewer additions and shift operations. More importantly, the proposed method uses far fewer multiplication operations, which are expensive to implement in the low cost DSP (digital single processor) chips typically used in digital cameras. The multiplications are required for the R G B to Y C b C r conversion, so our proposed method uses fewer 47 multiplications than performing any R G B based demosaicking method and subsequently converting to Y C b C r space. The demosaicking method in [6] has considerably higher complexity than the method in [9], so our method has much lower complexity than that of [6]. We are not aware of any demosaicking methods with complexity lower or equal to [9] that provide comparable image quality. 3.5 Conclusions In this chapter, we have presented a fast demosaicking algorithm that directly produces Y C b C r 4:2:0 output. Our method saves considerable computational complexity by avoiding the need for performing demosaicking in the R G B colour space and then converting from R G B to Y C b C r 4:2:0 format. The proposed method generates a full green channel, and low-pass filtered, down-sampled red and blue samples. The green channel contains the fine detail needed to generate a high quality luma channel, while the low-pass R - G and B - G values allow us to directly compute low-pass, down-sampled chroma channels. The proposed method achieves much higher PSNR than the only other demosaicking method that produces luma and chroma output. It also achieves better quality than fast R G B based demosaicking methods, with lower complexity than performing demosaicking in R G B space and then converting to Y C b C r 4:2:0 format. Our demosaicking method prepares the image (or video) for compression. Thus, it would be used in the conventional camera workflow of first performing demosaicking 48 and then compressing the image/video with standard methods. In performing demosaicking to Y C b C r 4:2:0 format, the number of samples is increased by a factor of 1.5, which is somewhat undesirable. To avoid this, in the next chapter we present methods for compressing a C F A video prior to demosaicking, taking advantage of the smaller data size. 49 4 H.264 Based Compression of Bayer Pattern Video 4.1 Introduction The conventional approach to compressing C F A data (still image or video) is to first perform demosaicking and then compress the resulting full colour data. This approach is sub-optimal because the amount of data is expanded by a factor of three in the demosaicking stage, which increases the compression processing time and introduces redundancy that the compression stage must remove. To avoid these problems, compression can be carried out on the C F A data prior to demosaicking. In this chapter two new methods are proposed for compressing C F A video data. One uses the H.264 video coding standard and one uses a modified version of the H.264. H.264 is the latest major video coding standard and it provides significant improvements in coding efficiency over previous standards such as M P E G - 2 . Basing our methods on H.264 allows us to exploit the latest powerful video coding tools. Our first proposed method compresses the C F A video with standard H.264 and achieves better quality (measured with mean-square error) than the demosaick-first approach at high bit-rates. Our second method further increases compression efficiency by introducing a modified motion compensation scheme into H.264, alleviating problems that arise due to aliasing in the C F A data. Both methods are suitable for devices such as digital camcorders where video is encoded with high quality. The rest of the chapter is organized as follows. In Section 4.2, aliasing in C F A data and its negative effect on video coding are discussed, providing the motivation for our 50 modified motion compensation scheme. The proposed methods for compressing C F A video data are described in Section 4.3. Simulation results showing the performance our methods relative to the conventional demosaick-first approach are presented in Section 4.4, along with a comparison of the complexity the different approaches. Conclusions are presented in Section 4.5. 4.2 Impact of Aliasing on CFA Video Coding Aliasing in video has been shown to negatively impact the coding of P frames [33]. If there is movement between frames the effect of aliasing wil l be different in each frame. This results in low correlation between frames, and hence large P-frame size. The negative effect of aliasing can be reduced by using sub-pixel accurate motion vectors together with adaptive interpolation filters [34]. C F A data can contain severe aliasing [35]. Single sensor cameras usually use an optical filter to limit aliasing [5]. When selecting how much filtering to use, there is a trade-off between limiting aliasing and capturing fine image detail. Furthermore, since the Bayer pattern contains more green samples than red or blue, different amounts of filtering are needed to limit aliasing in the different colours. If enough filtering were used so that there was little aliasing in the red and blue channels, then significant detail would be lost from the green channel which is undesirable, and defeats the purpose of sampling green at a higher rate than red or blue. A n assumption sometimes made in demosaicking research is that enough optical filtering is done so that i f a full colour image were captured, it would contain negligible aliasing; however sampling with the Bayer C F A introduces aliasing [36]. Other work 51 makes the assumption that significant aliasing occurs in the red and blue channels, but not in the green [10],[37]. The effect of aliasing in C F A video is illustrated in Figure 4.1, which shows a frame of red data from the "Mobile and Calender" video, and a blowup of the highlighted region over four successive frames. Over these frames, the calendar is moving vertically. Notice how the moving portion, especially the number '3 ' , looks considerably different in each frame due to aliasing. Figure 4.1: A frame of the red channel from the "Mobile and Calender" video (top), and a blowup of the highlighted region over four successive frames (bottom), illustrating the affect of aliasing and the low correlation between frames. Advanced demosaicking algorithms attempt to reduce the effects of aliasing in each colour channel by using information from the other colour channels [9],[10],[37],[38]. 52 A n example of this is shown in Figure 4.2, which presents the four frames of red data in Figure 4.1 after demosaicking has been performed with the method presented in [38], Figure 4.2: The four frames of red data in Figure 4.1 after demosaicking has been performed. Demosaicking increases the amount of red data by a factor of four, so the frames in Figure 4.2 are bigger than in Figure 4.1. We observe that there is considerably more temporal correlation in the frames after demosaicking than there is in the original C F A data. Since each C F A frame is a subset of the corresponding demosaicked frame, the C F A frames can be effectively predicted from the demosaicked versions of the other frames. 53 4.3 Proposed Methods for CFA Video Compression Both of our proposed methods involve dividing the C F A data into separate arrays of green, blue and red data, which are compressed in 4:2:2 sampling mode. In our first method, standard H.264 is used for compressing the arrays of red, green and blue. In our second method, a modified motion compensation (MC) scheme is also applied, where demosaicking is performed on the reference frames within the encoder and decoder to reduce the negative effects of aliasing on P-frames. The method of creating rectangular arrays of each colour and arranging the data for compression with H.264 is described in Section 4.3.1. The modified M C scheme used in our second method is described in Section 4.3.2. 4.3.1 Pixel Rearranging Most video compression standards, including H.264, can only compress video of rectangular shape. So in order to compress the C F A data using H.264, the pixels must be rearranged into rectangular arrays. The red and blue data are sampled in a rectangular manner, so they can easily be separated into arrays one quarter the size of the Bayer data. In [17] two options for creating rectangular arrays of quincunx sampled data are proposed (Figure 4.3). These methods are applied to luma samples after a color space conversion in [17], but they can also be applied to the green samples directly. If a frame of Bayer pattern data has dimensions MxN (height, width), the structure separation method involves separating the green data into two arrays of size (M/2)x(N/2), one containing the even column samples and the other containing the odd column samples (Figure 4.3c). Implementing this method in a video codec would require extensive 54 modifications to allow four channels to be compressed together rather than the usual three. Also, downsampling the green data into two separate channels introduces further aliasing in each channel (in addition to the aliasing introduced due to Bayer sampling the green channel). In [17] a filter is applied to the green data before downsampling to limit the aliasing. However, applying filtering is undesirable since it removes high-frequency detail which cannot be recovered later. Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 Y12 Y13 Y14 Y15 Y16 Y17 Y18 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10 Y11 Y12 Y13 Y14 Y15 Y16 Y17 Y18 Y1 Y2 Y3 Y7 Y8 Y9 Y13 Y14 Y15 Y4 Y5 Y6 Y10 Y11 Y12 Y16 Y17 Y18 (a) (b) (c) Figure 4.3: Methods for forming rectangular arrays of luma data in [17]. (a) Original luma arrangement (b) Structure conversion (c) Structure separation The structure conversion method in [17] involves merging the two green samples from every group of four Bayer pattern samples into a single column, resulting in a green channel of size Mx(N/2). This method does not suffer from the aliasing problems of the structure separation method, however the merging process does distort the data somewhat. In [19] the structure conversion method is used for compressing video, where they compress a green channel of size Mx(N/2) together with red and blue channels of size (M/2)x(N/2). Since the relative dimensions of the three colour channels do not correspond to any sampling scheme supported in major video coding standards, a custom codec was used, where the motion vectors from the green channel are reused on the red and blue channels, with appropriate downsampling and scaling on the motion vectors. 55 We proposed a different structure conversion method, where the samples are merged into rows. Let f(i,j) be the value of the C F A data at spatial location (i,j) within the image, where / donotes the row and j the column. Let R(i,j), G(i,j) and B(i,j) denote the arrays of red, green and blue data after conversion to separate arrays. The color arrays can be expressed in terms of the C F A data by the following equations: The array G(i,j) has dimensions (M/2)xN, B(i,j) and R(ij) have dimensions in 4:2:2 sampling mode, with green data in the luma channel and red and blue data in the chroma channels. Since 4:2:2 sampling support was added to H.264 with the Fidelity Range Extensions (FRExt) [39], the C F A data can be compressed with H.264 achieving the same effect as in [19] (reusing motion vectors from the green channel on the red and blue data) without the need for a custom codec. Our first proposed method simply consists of compressing the green, blue and red arrays given by (4.1) with standard H.264 using 4:2:2 sampling. B{i,j) /(2i + l,2y) /(2i,2y + l) (4.1) (M/2)x(N/2), as illustrated in Figure 4.4. This approach allows the data to be compressed 5.6 G1 g Q3 • S 5 • B7 G8 B9 G10 B11 G12 G13 G15 • G17 s B19 G20 R21 G22 B23 G24 G25 G27 m G29 B31 G32 B33 G34 B35 G36 G1 G8 G3 G10 G5 G12 G13 G20 G15 G22 G17 G24 G25 G32 G27 G34 G29 G36 B7 B9 B11 B19 R21 B23 B31 B33 B35 Figure 4.4: Conversion from mosaic data into separate R, G and B arrays used in our CFA video compression methods. Our first proposed method has the advantage of using standard H.264. In the following section we describe a second proposed method which increases coding efficiency by using a modified motion compensation scheme, at the expense of increased complexity. 4.3.2 Modified Motion Compensation As discussed in Section 4.2, aliasing in C F A data negatively effects P-frame coding. In our second proposed method we minimize this problem by performing demosaicking on the reference frames in the encoder and decoder for the purposes of motion compensation. Each P-frame of C F A data is predicted from the demosaicked reference frames, providing a better prediction for the frame being coded and hence lowering the bit-rate. In H.264, I and P frames are used for predicting other frames of the video. After a frame has been encoded, it is decoded within the encoder, and the decoded version of the frame is used for prediction. In our method rather than directly using the decoded frame 57 for prediction, demosaicking is first performed on the decoded frame, and the demosaicked frame is used for prediction. This increases the size of the green data by a factor of two and the red and blue data by a factor of four. This is illustrated in Figure 4.5a, which shows a block of pixels from a decoded frame, and the block after demosaicking. The numbers in Figure 4.5a indicate the location of each of the original pixels in the demosaicked frame and the unlabeled pixels are generated in the demosaicking process. Any demosaicking method could be used on the reference frames. The choice of a particular method would depend on the application and the amount of complexity that can be tolerated. Different demosaicking methods are evaluated for this purpose in Section 4.4.2. 1 2 3 4 5 S 7 8 s 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 , 3' 5 6 7 S 10 11 12 13 14 ,5 16 1 3 5 7 2 4 6 e 9 11 13 16 10 12 14 16 17 19 21 23 18 20 22 24 25 27 29 31 26 28 30 32 •v.-X _ 1 (a) ; 3 5 7 3 4 6 9 11 13 13 10 13 14 17 19 31 33 13 30 33 35 37 39 31 36 33 30 ?.:-• -- < z 7 Z] i ! 111 — m 9 I -m 13 13 14 I 3 3 4 } 6 7 3 9 10 11 13 13 14 15 16 17 13 19 30 31 33 33 34 35 36 37 33 39 30 31 S3 1 3 3 4 5 6 7 3 9 10 11 13 13 14 15 16 1 2 3 4 5 6 7 3 9 10 li 13 13 14 15 16 (b) Figure 4.5: (a) Performing demosaicking on a block of pixels from a reference frame (b) Sampling the demosaicked reference frame to obtain a prediction for the block when the motion vector is (1,0) in full pel units After demosaicking, all three colour channels are up-sampled by a factor of four in the horizontal and vertical directions to support quarter pel accurate motion vectors. This 58 is done using the method defined in the H.264 standard for upsampling the luma channel (where a 6-tap FIR filter generates the half-pel samples, and bilinear interpolation is used to generate quarter-pel samples). R G B data has properties more similar to luma than chroma, so the luma upsampling method is used to give better performance. In the H.264 reference encoder, motion estimation (ME) is done on the luma channel. When M E is performed on R G B data, the green channel is usually used for M E [14][15], since the green data is more high correlated with luma than red or blue. So in our second proposed method, M E is done on the green channel. Consider a green pixel at location OG, JG) in a C F A frame. After demosaicking, this pixel wi l l be located at position (2ICJG) in the demosaicked frame i f JG is even and position (2IG+1,JG) i f JG is odd (Figure 4.5a). Let Q. denote the set of coordinates of the green pixels within a block in the C F A data. A motion vector (mit rrij) is calculated for the block by minimizing the sum of absolute differences (SAD) given by: where Gdem(i,j) is the demosaicked reference frame being used for prediction. In equation (4.2), the demosaicked reference frame is being sampled with shape of the green data in the Bayer pattern, with the motion vector controlling the relative position of the sampling. After a full pel motion vector has been found with (4.2), the motion vector is refined to quarter pel accuracy, as is done in the H.264 reference encoder [4] (this basically involves minimizing the S A D given by (4.2), only now letting m, and rrij take on values in increments of 0.25 pixels). ) j odd j even (4.2) 59 In our method, the motion vectors calculated from the green channel are also used on the red and blue channels. A red pixel at location (IR,JR) wi l l be moved to (2IR, 2jR+l) after demosaicking, and a blue pixel at (IBJB) wi l l be moved to (2iB+l, 2jB). In order to obtain a prediction for a pixel in a C F A frame using motion compensation, the motion vector calculated is added to the corresponding position of the pixel in the demosaicked frame, and the demosaicked frame is sampled at that position. Let Bdem(i,j), and Rdem(ij) be the values of the demosaicked frame being used for prediction, and the motion vector for a block of C F A data be (mu mj). Then the predictions for the C F A pixels in the block wi l l be: As an example, consider the task of predicting the 8x4 block of C F A pixels in Figure 4.5a in a future frame when the motion vector is (1,0) in full pel units. Figure 4.5b shows how the demosaicked frame is sampled to obtain a prediction for the block. The white squares represent pixels that are obtained by edge replication. In summary, our M C scheme uses demosaicked versions of reference frames to predict the C F A frame being coded. This takes advantage of the increased temporal correlation of frames after demosaicking has been performed, without the need for compressing the larger demosaicked frames themselves. (4.3) 60 4.4 Results 4.4.1 Testing Methodology To evaluate the performance of our compression schemes two videos from the S V T High Definition Test Set [40], CrowdRun and OldTownCross, were used in our simulations. The CrowdRun sequence is a shot of hundreds of marathoners running through a park. It contains a high amount of motion due to the runners and also slight camera motion. The OldTownCross sequence is an aerial shot of a European city containing camera motion. Both videos contain large amounts of fine detail (high frequency image content). Screenshots of the test videos are shown in Figure 4.6. 61 Figure 4.6: Screenshots of the test videos, CrowdRun (top) and OldTownCross (bottom). 62 These videos were digitized at a resolution of 3840x2160, 16 bits per (RGB) colour plane, 50 frames/sec. We down-sampled and cropped the videos to give 720x480, 8 bit R G B data at 25 frames/sec. This is the quality of video typically captured by consumer digital camcorders, our target application. C F A videos were obtained by sampling the R G B videos with the Bayer pattern. In all tests, 60 frames of each video were compressed, with I-frames inserted every 15 frames. C A B A C and B-frames were disabled to lower the encoding complexity (these features are not likely to be used in digital camcorders). In the M E process, two reference frames were used and the search range was ±16 integer pixels. 4.4.2 Demosaicking Algorithm Choice Here we evaluate the effectiveness of different demosaicking algorithms for use in our modified motion compensation scheme. Three different demosaicking algorithms were tested; bilinear interpolation, an edge-sensing method by Hamilton and Adams using Laplacian second order correction terms [6] (referred to as Laplacian), and a state-of-the-art method based on estimating luma from the C F A data [38], These algorithms vary in complexity and the quality of image they produce; bilinear interpolation has very low complexity but gives the poorest image quality. The luma-based method gives much better image quality (in terms of mean-square error) but has significantly greater computational cost. Laplacian demosaicking is intermediate in both image quality and complexity. 63 Rate-distortion curves obtained using different demosaicking algorithms for M C are shown in Figure 4.7. Here, the PSNR of the C F A data itself is measured (as opposed to measuring quality after demosaicking has been performed). Results are shown for our proposed M C scheme with each demosaicking algorithm, as well as compressing the C F A video with a standard H.264 encoder in 4:2:2 sampling mode (without our proposed M C scheme). 64 Old Town Cross - CFA Video Quality 46 n 26 J 1 0 5 10 15 20 25 Bit rate (Mbps) Crowd Run - CFA Video Quality 23 -1 , , , , 1 0 5 10 15 20 25 30 Bit rate (Mbps) Figure 4.7: Plots of PSNR (of the CFA data) vs. bit rate for the two test videos obtained when different demosaicking algorithms are used for motion compensation. 65 The results for both videos show there is substantial benefit in using our M C scheme. The bit-rate reductions for the CrowdRun sequence are about 4%, 10%, and 12% using bilinear, Laplacian and luma-based demosaicking, respectively. Greater bit rate reduction is obtained on the OldTownCross sequence, up to 15% using bilinear, 43% using Laplacian and 48% using luma-based demosaicking. The OldTownCross sequence experiences greater bit-rate reduction when our M C scheme is used because that video contains camera motion. In general M C is very effective on videos with camera motion, because the picture experiences simple translational motion. When the motion is more complicated, such as the motion the marathoners in the CrowdRun sequence, M C is less effective, and thus enhancing M C provides less benefit. These results show that the more advanced demosaicking schemes provide significant bit-rate reductions over simple demosaicking such as bilinear. Using Laplacian demosaicking for M C provides bit rates almost as low as the luma-based demosaicking method with considerably lower complexity, so Laplacian demosaicking is used in the rest of our tests. 4.4.3 Quality Comparison Against Demosaick-First Approach In this section we compare our two methods against the conventional demosaick-first approach. When comparing to the demosaick-first approach, the quality of the final R G B video, after both compression and demosaicking have been performed, needs to be measured. The quality measure used here is the Composite Peak Signal to Noise Ratio (CPSNR), which has.been used in previous work on compressing C F A data [17],[19]. The C P S N R is calculated as the standard PSNR, but with the mean-square error 66 evaluated across the red, green and blue colour channels. The C P S N R for one frame of video with 8 bit data is given by: CPS/V7? = 101og 255 2 M-\N-\ 3 Z E E ('• M ) - I ref 0. J.k)} v 3 M V i = 0 y = 0 i = 1 (4.4) where k denotes the colour component (R, G, or B), IComP(hj,k) is the compressed video after demosaicking, and Ire/i,j,k) is the reference video against which quality is measured. The reference is taken as the video obtained by demosaicking the C F A data without any compression. Thus, the reference video represents the best quality that can be obtained from the C F A data with a given demosaicking algorithm. The C P S N R for a video is taken to be the average C P S N R of the frames in the video. The proposed methods were compared against the conventional demosaick-first approach. The demosaick-first results were obtained by demosaicking the C F A data (with the Laplacian method), converting from R G B to Y U V and compressing with standard H.264. Results are presented for the demosaick-first approach using both Y U V 4:2:0 and Y U V 4:2:2 format during compression. Plots of C P S N R vs. bit-rate for the two test videos are shown in Figure 4.8. 67 42 40 38 -I 36 m 34 •o 32 ] 30 28 26 24-1 22 46 44 -I 42 40 Crowd Run X ..A - Demosaick first: YUV 4:2:0 -Demosaick first: YUV 4:2:2 Proposed Method 1: Standard H.264 X Proposed Method 2: Modified MC 10 15 20 25 30 Bit rate (Mbps) Old Town Cross ...x -Demosaick first, YUV 4:2:0 • Demosaick first, YUV 4:2:2 Proposed Method 1: Standard H.264 -X • - Proposed Method 2: Modified MC 6 8 10 Bit rate (Mbps) 12 14 16 Figure 4.8: Plots of CPSNR (measured after compression and demosaicking) vs. bit rate. 68 Both proposed methods outperform the conventional demosaick-first approach at high bit-rates. This is primarily because the proposed methods compress data one third the raw size of the demosaick first approach. The proposed methods do not perform well at low quality levels. Highly compressing the C F A data removes detail necessary for demosaicking, and may introduce blocking artefacts which are interpreted as edges in the demosaicking process. Consequently demosaicking does not work well on highly compressed data. This problem obviously does not arise when demosaicking is performed prior to compression. 4.4.4 Complexity Comparison In the conventional demosaick-first approach, a video of size MxN is compressed, usually in Y U V 4:2:0 format, resulting in \.5MN total samples. In either of our proposed methods, a video of size (M/2)xN is compressed in 4:2:2 format, which requires MN samples. Hence our proposed methods require two-thirds the number of samples to be compressed as the demosaick-first approach. Although the input video size is smaller in our proposed methods, the video wi l l have more detail (high-frequency image content) as each frame has effectively been downsampled by a factor of two in the vertical direction. The most computationally expensive operation of an H.264 video encoder is motion estimation (ME), which typically accounts for about 65% of the encoding time [41]. The complexity of M E varies greatly based on the algorithm used. If a full-search is used, the complexity of M E in either of our proposed methods wi l l be about half that of the demosaick-first approach, since the luma channel has half the size. If a fast, content adaptive M E method is used, the M E complexity reduction obtained by using our 69 methods wi l l be variable. Other significant functions that contribute to H.264 encoder complexity are intra prediction, interpolation, transform and quantization, loop filtering and entropy encoding [41]. With the exception of interpolation, the computational complexity of these functions varies depending on the video content. Generally video with more detail requires more computations. For these functions, our proposed methods wi l l require fewer computations than the demosaick-first approach, but not by a factor of two-thirds since the smaller video wil l contain more detail. The number of calculations for interpolation varies in our two methods. In Method 1 (standard H.264), luma interpolation requires half the number of computations of the demosaick-first approach, and chroma interpolation requires the same number of operations, due to the relative channel sizes. In Method 2 (Modified M C ) , the luma interpolation filter is applied to all of the red, green and blue channels. Therefore the same number of luma interpolation calculations are required as in the demosaick-first approach, because the combined size of the R, G, and B channels is the same as the size of the luma channel in the demosaick-first approach. However, no chroma interpolation calculations are required in Method 2. Therefore fewer interpolations operations are required in both proposed methods than in the demosaick-first approach. In Method 1 (using standard H.264), demosaicking does not have to be performed at the encoder, which saves some computations relative to the demosaick-first approach. In Method 2, demosaicking is performed at the encoder, so the same number of demosaicking calculations are required as in the demosaick-first approach. 70 The most time consuming operations in H.264 video decoding are loop filtering, interpolation, inverse transform and quantization, and entropy decoding [42], With the exception of interpolation, the amount of computations required for these operations is highly content and bit rate dependent. In general, videos with more detail require more computations. Hence the amount of computations required for the loop filter, inverse quantization and entropy decoding wi l l be lower in the proposed method due to the smaller frame size, but not by a factor of two-thirds because the video wi l l have more detail. Our proposed methods wil l require two-thirds the number of inverse transform calculations, because the amount of calculations required to perform the inverse transform is not content dependent. The relative number of interpolation calculations is the same at the encoder and decoder; so our proposed methods require fewer interpolation calculations than the demosaick-first approach at the decoder as well. In either of our proposed methods, the decoder wil l have to perform demosaicking, which it would not in the demosaick-first approach. The decoder complexity of our proposed methods relative to the demosaick-first approach wil l depend on the video coding options used and the demosaicking algorithm. On average, H.264 decoding of a QCIF sized frame (176x144) requires 30-40 million RISC operations [43], which corresponds to about 1200-1600 operations per pixel. By comparison, the Laplacian demosaicking method in [6] requires about 60 operations per pixel (the exact number wil l be dependent on the hardware and software implementation details). Hence, i f a computationally efficient demosaicking method is used, demosaicking wi l l only take a small fraction, around 4-5%, of the decoding computations. This means that both proposed methods wil l have lower decoder 71 complexity than the demosaick-first approach, due to the smaller frame size of the encoded video. 4.4.5 Comparison Against Method in [19] In order to compare our proposed methods against the method in [19], we use two standard test videos, foreman and earphone, for which results are presented in [19]. These videos are of QCIF resolution (176x144) and sampled at 30 frames/sec. Both videos were compressed at three quality levels and the resulting bit-rates are summarized in Table 4.1. The results for the method in [19] are taken directly from that paper. Both of our methods give far lower bit rates than the method in [19]. The bit rate reductions achieved range from 68% to 83% for our first method using standard H.264, and 76% to 90% for our second method using modified M C . Because are methods are based on H.264, which is a highly optimized, efficient video coding standard, they achieve much better compression than the custom built method in [19]. Video CPSNR (dB) Bit-rate (Kbps) Method in [19] Proposed Method 1: Standard Encoder Proposed Method 2: Modified M C Foreman 29.4 1360 246 182 32.5 2080 478 350 36.0 2780 882 664 Carphone 27.2 812 100 90 30.0 1154 194 166 34.5 1850 458 370 Table 4.1: Bit Rate Comparison of our proposed methods with the method in [19]. 72 4.5 Conclusions In this chapter, we have proposed two methods for compressing Bayer pattern C F A video prior to demosaicking. Our first method involves separating the C F A data into arrays of green, blue and red samples and compressing in 4:2:2 sampling with standard H.264. In our second method, a modified motion compensation scheme is also used, where demosaicking is performed on the references frames in the encoder and decoder in order to alleviate problems due to aliasing in the C F A data. Both proposed methods give better compression efficiency than the conventional demosaick-first approach at high bit-rates, and much better performance than the only previously proposed method for compressing C F A video prior to demosaicking. 73 5 Conclusions and Future Work 5.1 Conclusions In this thesis, two means of jointly optimizing demosaicking and compression in single sensor cameras are investigated: l ) Creating a demosaicking algorithm that directly produces an image in the format used for compression (YCbCr 4:2:0), and 2) Compressing C F A video data prior to demosaicking, taking advantage of the smaller raw data size before demosaicking has been performed. In Chapter 3, a new demosaicking method is proposed which directly generates an Y C b C r 4:2:0 output image. This allows the image to be directly compressed after demosaicking, avoiding the need for a separate stage where the image is converted from R G B to Y C b C r space. The proposed method provides better image quality than fast R G B based demosaicking methods and has lower computational complexity. In Chapter 4, two methods for compressing colour filter array videos prior to demosaicking are proposed. Our first method uses standard H.264, and involves separating the C F A data into arrays of green, blue and red which are compressed in 4:2:2 sampling mode. Our second method also uses G B R 4:2:2 sampling, but with a modified motion compensation scheme where demosaicking is performed on reference frames. This alleviates problems due to aliasing in the C F A data. Both proposed methods given better compression efficiency than the standard demosaick-first approach at high bit-rates, and give far better performance than the only other method for directly compressing C F A video. 74 5.2 Future Work Our demosaicking method that produces Y C b C r 4:2:0 output works strictly on still images. If it were applied to a video sequence, the demosaicking would be done on each frame independently, only using information from the current frame. When performing demosaicking on a video, additional information is available from other frames. The demosaicking method could be altered to take into account temporal correlation within a sequence, which may improve performance. In our work on compressing C F A video prior to demosaicking, we have relied on existing demosaicking methods for producing the final R G B video after demosaicking and compression. Instead, the demosaicking could use knowledge of how the data was compressed. Given knowledge of how a video has been degraded in the compression process, a demosaicking method could be designed to produce the highest quality image possible given the amount of compression that has been applied. 75 Bibliography ITU-T and ISO/TEC JTC1, "Digital Compression and Coding of Continuous-Tone Still Images," ISO/TEC 10918-1 - ITU Recommendation T.81 (JPEG), September 1992. ITU-T, "Video coding for low bitrate communication," ITU-T Recommendation H.263; version 1, November 1995; version 2, January 1998. M P E G - 2 : ISO/IEC JTC1/SC29/WG11 and ITU-T, "Revised text for ITU-T recommendation H.262—ISO/IEC 13 818-2: Information technology-generic coding of moving pictures and associated audio information: Video," ISO/IEC and ITU-T, Genf, Switzerland, 1995. ITU-T, "Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification," ITU-T Rec. H.264/ISO/IEC 14496-10 A V C , Mar. 2003. B . K . Gunturk, J. Glotzbach, Y . Altunbasak, R . M . Mersereau, and R.W. Schafer, "Demosaicking: Color filter array interpolation," IEEE Signal Processing Mag., vol. 22, no. 1, pp. 44-54, 2005. Hamilton, J. F. and Adams, J. E., "Adaptive color plane interpolation in single sensor color electronic camera," U.S. Patent 5,629,734, May 1997. Laroche, C. A . and Prescott, M . A . , "Apparatus and method for adaptively interpolating a full color image utilizing chrominance gradients," U.S. Patent 5,373,322, December 1994. R .H . Hibbard, "Apparatus and method for adaptively interpolating a full color image utilizing luminance gradients," U.S. Patent 5 382 976, 1995. 5. C. Pei and I. K . Tarn, "Effective color interpolation in C C D color filter arrays using signal correlation," IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 6, pp. 503-513,Jun. 2003. B. K . Gunturk, Y . Altunbasak, and R. M . Mersereau, "Color plane interpolation using alternating projections," IEEE Transcations on Image Processing, vol. 11, no. 9, 2002. J; Go, K . Sohn, and C. Lee, "Interpolation using neural networks for digital still cameras," IEEE Transcations on Consumer Electronics., vol. 46, no. 3, pp. 610-616, Aug. 2000. 76 J. Mukherjee, R. Parthasarathi, and S. Goyal, "Markov random field processing for color demosaicing," Pattern Recognition Letters, vol. 22, no. 3-4, pp. 339-351, Mar. 2001. X . Wu and N . Zhang, "Primary-consistent soft-decision color demosaicking for digital cameras," IEEE Trans, on Image Processing, vol. 13, pp. 1263-1274, 2004. X . Wu and L . Zhang, "Color Video Demosaicking via Motion Estimation and Data Fusing," IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 2, pp. 231-240, Feb. 2006. X . Wu and N . Zhang, "Improvement of Color Video Demosaicking in Temporal Domain," IEEE Transactions on Image Processing, vol. 15, no. 20, pp. 3138— 3151, Oct. 2006. S.Y. Lee and A . Ortega, " A novel approach of image compression in digital cameras with a Bayer color filter array," IEEE Int. Conf. Image Processing 2001, vol. 3, pp. 482-485, Oct 2001. C.C. Koh, J. Mukherjee, S.K. Mitra. "New efficient methods of image compression in digital cameras with color filter array". IEEE Transactions on Consumer Electronics 2003; 49(4): 1448-56. N . X . Lian, L . Chang, V . Zagorodnov, Y . P . Tan, "Reversing Demosaicking and Compression in Color Filter Array Image Processing: Performance Analysis and Modeling," IEEE Transactions on Image Processing, vol.15, no. 11, pp. 3261-3278, Nov. 2006. F. Gastaldi, C. C. Koh, M . Carli, A . Neri and S. K . Mitra, "Compression of videos captured via bayer patterned color filter arrays," Proc: 13th European Signal Processing Conference (EUSIPCO-2005), Antalya, Turkey. T. Toi, and M . Ohta, " A subband coding technique for image compression in single C C D cameras with bayer color filter arrays," IEEE Transactions on Consumer Electronics, vol .45, no. 1, pp. 176-80, 1999. A . Bruna, A . Buemi, F. Vella, A . Vitali , " A Low Cost Algorithm For C F A Data Compression". Int Conf on Consumer Electronics, 2006 Digest of Technical Papers, pp. 385-386, Jan. 2006. J. Adams, K . Parsulski, and K . Spaulding, "Color processing in digital cameras," IEEE Micro, pp. 20-29, Nov.-Dec. 1998. K . A . Parulski, "Color Filter Arrays and Processing Alternatives for One-Chip Cameras," IEEE Trans. Electron Devices, V o l . ED-32, No. 8, Aug. 1985, pp. 1381-1389. 77 R. Lukac, and K . N . Plataniotis, "Color Filter Arrays: Design and Performance Analysis," IEEE Transactions on Consumer Electronics, vol. 51, no. 4, pp. 1260-1267, Nov. 2005. B .E . Bayer, "Color Imaging Array," U.S. Patent 3,971,065. K . Barnard, V . Cardei, and B . Funt, " A comparison of computational color constancy algorithms - part i : Methodology and experiments with synthesized data," IEEE Trans on Image Proc, vol. 11, no. 9, pp. 972-983, 2002. IEC 61966-2-1 (1999-10), Multimedia systems and equipment - Colour measurement and management - Part 2-1: Colour management - Default R G B colour space - sRGB, International Electrotechnical Commission, www.srgb.com, 1999. R. Ramanath, W.E. Snyder, Y . Yoo, M.S . Drew, "Color image processing pipeline," IEEE Signal Processing Mag., vol. 22, no. 1, pp. 34^13, 2005. E. Hamilton, "JPEG File Interchange Format, Version 1.02," Sept., 1992. Available: http://www.w3.org/Graphics/JPEG/ifif.pdf. ISO/IEC JTC1, "Information Technology - JPEG 2000 image coding system -Part 1: Core coding system," ISO/IEC 15444-1, 2000. Mukherjee, J., Lang, M . K . , and Mitra, S. K . 2005. Demosaicing of images obtained from single-chip imaging sensors in Y U V color space. Pattern Recognition Letters. 26, 7, pp. 985-997, May 2005. J.E. Adams, Jr., "Design of Practical Color Filter Array Interpolation Algorithms for Digital Cameras," Proc. SPIE, V o l . 3028, SPIE, 1997, pp. 117-125. T. Wedi and H.G. Musmann, "Motion- and aliasing-compensated prediction for hybrid video coding," IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 577-587, July 2003. T. Wedi, "Adaptive Interpolation Filters and High-Resolution Displacements for Video Coding," IEEE Trans. Circuits Syst. Video Technol., vol. 16, no. 4, pp. 484 -491, Apr. 2006. D. Alleysson and S. Susstrunk, "Aliasing in Digital Cameras", SPIE EI Newsletter, Special Issue on Smart Image Acquisition and Processing, V o l . 14, Nr. 12, pp. 1,8,2004. H . J. Trussell and R. E. Hartwig, "Mathematics for demosaicking," IEEE Trans. Image Processing, vol. 11, pp. 485^192, Apr. 2002. 78 J.W. Glotzbach, R.W. Schafer, and K . Illgner, " A method of color filter array interpolation with alias cancellation properties," in Proc. IEEE Int. Conf. Image Processing, vol. 1,2001, pp. 141-144. N . Lian, L . Chang and Y . P Tan, "Improved color filter array demosaicking by accurate luminance estimation," IEEE Int. Conf. Image Processing 2005, vol. 1, pp. 41-44, Sept 2005. Joint Video Team of ITU-T and ISO/IEC, "Draft Text of H.264/AVC Fidelity Range Extensions Amendment", Doc. JVT-L047, Sept. 2004. L . Haglund, "The SVT High Definition Multi Format Test Set," Available: ftp://vcieg.its.bldrdoc.gov/HDTV/SVT MultiFormat/ A . Hallapuro and M . Karczewicz, "Complexity Analysis of H.26L," ITU-T SGI6 Doc. V C E G - M 5 0 , 2001. V . Lappalainen, A . Hallapuro, T.D. Hamalainen, "Complexity of optimized H.26L video decoder Implementation," IEEE Transactions on Circuits and Systems for Video Technology, pp.717-725, vol. 13 , no. 7 , July 2003. C. X u , T . M . Le, and T.T. Tay, "H.264/AVC Codec: Instruction-Level Complexity Analysis", Proc. IASTED International Conference on Internet and Multimedia Systems, and Applications ( IMSA 2005), Honolulu, Hawaii, Aug. 2005. 79 Appendix A - List of Acronyms A W B auto white balance C C D charge-coupled device C F A Colour filter array CIE Commission Internationale de l'Eclairage (International Commission on Illumination) C M O S Complementary Metal Oxide Semiconductor CPSNR Composite peak signal-to-noise ratio D C T Discrete cosine transform FIR Finite impulse response JPEG Joint Photographic Experts Group M C Motion compensation M E Motion estimation M P E G Moving Picture Experts Group PSNR Peak signal-to-noise ratio sRGB standard red, green, blue (colour space) 80 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0101193/manifest

Comment

Related Items