V i s u a l E c h o Analysis and Multi-evidential C o r r e l a t i o n Non-linear Matching & Registration of Signals and Images By Esfandiar Bandari B.Sc. University of California, Berkeley M.Sc. University of California, Berkeley M.Sc. Stanford University A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS FOR T H E D E G R E E OF DOCTOR OF PHILOSOPHY in T H E F A C U L T Y O F G R A D U A T E STUDIES ( D E P A R T M E N T O F C O M P U T E R SCIENCE) We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F BRITISH C O L U M B I A November 1995 © Esfandiar Bandari, 1995 In presenting degree at the this thesis in University of partial fulfilment of of department this thesis for or by his or requirements British Columbia, I agree that the freely available for reference and study. I further copying the representatives. an advanced Library shall make it agree that permission for extensive scholarly purposes may be granted her for It is by the understood that head of copying my or publication of this thesis for financial gain shall not be allowed without my written permission. Department The University of British Columbia Vancouver, Canada DE-6 (2/88) Abstract Many low-level vision tasks - such as measurement of visual motion, stereo disparity estimation, or texture segmentation - can be solved by similar computational or biological mechanisms. The primary aim of this dissertation is to introduce and describe a broadly applicable approach to address a variety of low level computational vision problems. This unified framework, which is named visual echo analysis, is based on the simple observation that many computer vision problems can be viewed as detection and estimation of echo arrival periods in time and space. To this end, the framework uses cepstral techniques, a common and effective non-linear signal processing methods for detecting the presence of echoes and estimating their spatial or temporal arrival period. The thesis introduces computational and performance improvements to the traditional power and differential cepstrum with direct extensions to complex and phase cepstrum. Visual echo analysis (and multi-dimensional cepstrum) is then applied to a number of low-level vision tasks such as: motion estimation, binocular and trinocular stereo disparity, motion-stereo analysis, multi-frame disparity estimation (multi-frame motion, multiple baseline stereo), stationary texture segmentation, boundary symmetry analysis, and detection and estimation of multiple disparities (i.e., motion transparency, reflection, and occluding boundary). The relationship between echo analysis and matching is briefly examined, and a new technique for signal registration - called multi-evidential correlation (MEC) is introduced. M E C provides multiple, and thus verifiable, measurements for individual point disparities. The technique utilizes specific matching kernels - such as cepstrum, phase correlation or Hadamard based disparity measurements methods - to furnish multiple estimates of individual disparities; estimates that can be used to verify one another and/or be combined to establish a robust and accurate measure of signal displacements. ii Table of Contents Abstract i i List of Tables viii List of Figures 1 2 ix Introduction 1 1.1 Visual Echo Analysis . . 3 1.2 Multi-evidential Correlation 6 1.3 Outline and Organizational Overview 7 B a c k g r o u n d and H i s t o r i c a l Review 11 2.1 Previous Work 13 2.1.1 Correlation based techniques 14 2.1.2 Differenced Based Techniques 20 2.1.3 Differential or Gradient Based Techniques 21 2.1.4 Velocity tuned 28 2.1.5 Other Techniques 2.2 filtering 30 Stereo 36 2.2.1 Epipolar Constraint and Stereo Vision 37 2.2.2 Stereo Matching 40 2.3 Hierarchical techniques 41 2.4 Summary 43 iii 3 Cepstral Analysis: Background, Mathematical Preliminaries, and Mod- ifications 44 3.1 A Brief History of Cepstrum 45 3.2 Themes, Variations and Mathematical Preliminaries 48 3.2.1 Prototypical Cepstrum 48 3.2.2 Power Cepstrum 49 3.2.3 Complex and Phase Cepstrum 53 3.2.4 Differential Cepstrum 54 3.3 3.4 3.5 4 Improvements to the traditional approaches: posCepsCos and diffCepsSin 55 3.3.1 Positive CepsCos 56 3.3.2 diffCepsSin 58 Matching and Registration of Magnetic Resonance Images 60 3.4.1 61 M R I Processing Summary/Conclusion: 66 Cepstral Properties 68 4.1 Power Cepstrum's Sign Ambiguity in Calculating Displacement: 69 4.2 False Peak at (0,0) 72 4.3 Windowing 74 4.3.1 Weighted Windowing of the Input Images 75 4.3.2 Windowing in the Logarithmic Domain 76 4.3.3 Zero Padding and Numerical Stability 78 4.4 Noise, Convolution, First Order Distortions, and Illumination Change . . 79 4.4.1 Effects of Additive Stationary Random Gaussian Noise 80 4.4.2 Effects of Convolution and Blur 82 4.4.3 Effects of Illumination Change: 84 iv 4.5 Rotation and Expansion 86 4.6 Subpixel Motion 88 4.7 Parallelism and Computational Complexity 89 4.8 Comparison with Standard Techniques 91 4.8.1 Differences from Standard Techniques 93 4.8.2 Similarity with Standard Techniques 97 4.9 5 6 Summary 99 Cooperative Analysis of M u l t i p l e Frames by V i s u a l Echoes 102 5.1 Introduction, Background and Previous Work 103 5.2 Visual Echoes and Spatio-temporal Image Analysis 105 5.3 Direct Measurement of Trinocular Stereo Disparities 108 5.4 Improved Disparity Measures by Multiple Baseline Stereo 110 5.5 Decoupling of Motion and Stereo Disparities in Binocular-Motion 5.6 Summary and Conclusion . . . . 112 114 M u l t i - E v i d e n t i a l Correlation and C e p s C o r r 116 6.1 Multi-evidential Correlation: Definition 117 6.2 Multi-evidential Correlation: General Properties 120 6.3 Cepstrum and Multi-evidential Correlation—cepsCorr 122 6.3.1 CepsCorr and the (0,0) Peak: 123 6.3.2 Sign Ambiguity 125 6.3.3 Other properties of cepsCorr 127 6.4 6.5 Verification and Combination of Correlated Estimates 127 6.4.1 130 Max (0,0) peak and winner take all Summary/Conclusion 134 v 7 8 Results of Multi-evidential Correlation 137 7.1 Results and Comparative Analysis 139 7.1.1 139 The Hamburgh Taxi Sequence 7.2 Effects of Rotation 150 7.3 Effects of Expansion/Convergent Flow 154 7.4 Effects of Illumination Change 159 7.5 Tiny Camera 165 7.6 Binocular/Outdoor Scene 169 7.7 Computational Complexity of Motion Analysis Algorithms 174 7.8 Summary/Conclusion 175 Detection and E s t i m a t i o n of M u l t i p l e Disparities by Multi-evidential Correlation 8.1 Introduction 178 8.2 Multi-evidential Correlation, Cepstrum and Phase Correlation 180 8.2.1 Phase Correlation and Multi-evidential Correlation 182 8.2.2 cepsCorr: Cepstrum and Multi-evidential Correlation 184 8.2.3 Benefits of Non-linear Matching Kernels 185 8.3 8.4 9 178 Results 187 8.3.1 Motion Transparency 187 8.3.2 Occluding Boundary 189 8.3.3 MultiCeps and Multiple Motion Due to Reflection 192 Conclusion 194 C o n c l u d i n g R e m a r k s / O v e r v i e w and Future Directions 195 9.1 197 Future Directions vi Bibliography 199 vii List of Tables Preliminary experimental results for M R I image registration using different data types and cepstral techniques viii List of Figures 1.1 Measurement of visual motion using echoes, (a) shows one frame of the motion sequence, and (b) shows the superposition of two frames in order to display the spatial echo produced by motion. Tweedy bird, as well as part of the cloths in the background display an echo due to motion. . . . 2.1 4 Example of the effects of large motion and sharp intensity change on the differential algorithm for one dimension. f(x) and g(x) are the two functions that have gone through the translation h. In frame (a) the value of motion, h, is calculate correctly namely h = (f(x)—g(x))*^&. The same approach in frame (b) gives one the value d which is much smaller than the actual motion. Frame (c) shows how the motion can be estimated completely incorrectly due to a sharp change in the derivatives of the function f(x). Obviously, in frame (c) one could have a quadratic function with a large second derivative to display the same effect 2.2 23 A n illustration of the spatio-temporal filtering, using the space time cube, (a) the first frame of motion for a grey bar moving to the left with velocity v. (b) the stacked (or echoed) representation of this motion for five frames, (c) a horizontal cut for continuous time (i.e., in the (x,t) plane). Adopted from Adelson and Bergen 29 2.3 Epipolar constraint and the verging cameras 38 3.1 Plot of log(l + cos(2?r/)) 50 ix 3.2 Cepstral analysis of synthetic motion with (x,y) displacement of (7,3). The lower portion of the figure displays the cepstral result. The top peak is located at 3 pixels down and 7 pixels plus the width of the window horizontally. The lower peak is generated due to symmetric properties of the power spectrum of a real signal 3.3 51 (a) Regular cepstrum of a synthetic motion of 7 horizontally and 3 vertically, (b) Positive cepsCos of the synthetic motion 3.4 56 (a) regular power cepstrum for Figure 1; (b) its positive CepsCos - i.e., cepsCos with all the negative values discarded (c) the difference of (a) and (b) . Note that the landscape showing the difference has a different visual scale 3.5 57 (a) Differential cepstrum and (b) differential sine cepstral analysis of synthetic motion. Since differential cepstrum of a function is antisymmetric, for clarity the absolute value of the differential cepstrums is shown. Note the improved signal to noise ratio in the right 3.6 figure 58 (a) regular differential cepstrum for a simulated motion; (b) diffCepsSin, (c) and their difference. Note that the landscape showing the difference has a different visual scale 3.7 59 Magnetic resonance image pair from a single slice, (a) the magnitude image (b) the phase image 3.8 62 (a) The real image and (b) the imaginary image of a magnetic resonance image pair 3.9 63 Cepstral results, (a) cepsCos of real only data, amplitude only data (b) power Cepstrum of 66 x 4.1 (a) Power cepstrum of a motion sequence. In (a) the disparity is 3 downward and 7 to the left. Figure (b) shows the disparity 3 rows up and 7 to the left, even though the direction of motion is reversed the cepstrum results are identical 4.2 70 Differential cepstrum (diffCepsSin) of two motion disparities with opposite signs. Even though the cepstral extreme values are located in the same place, being and odd function the sign of the values determines whether the disparity is (+3,+7) or (-3,-7) 4.3 71 A n example of false peak at (0,0) disparity. Even though a displacement of (1,1) has been simulated, the main peak corresponds to a (0,0) disparity. 73 4.4 Removal of false peak at (0,0) disparity by using the edge image of the previous figure 4.5 74 Effects of windowing the two signal with Gaussians. A signal of width 128 was randomly generated and concatenated with a delayed version of itself. The delay was 7 pixels, the signal, (a) shows the the regular cepstrum of (b) Is the cepstrum of the signal, where the two windows were multiplied with very narrow Gaussians before concatenation, (c) is the cepstrum where two Gaussians were not narrow but were displaced toward the edges of the window by 20 pixels 4.6 76 (a) The result of the log of power spectrum of an image and its echo after the removal of the effects of the original signal. Even with the additional removal of the signal the quefrency window is still seems noisy. But the cepstrum of the signal is quite visible in (b) 4.7 77 Effect of moving square windows in front of a black background. The checkered multiple peaks not only interfere with the measurement of displacements, they also obscure the number of independently moving objects. 79 xi 4.8 Effects of motion blur on power cepstrum 84 4.9 Power cepstrum of a moving plaid function 85 4.10 Effects of rotation on cepstral results. The gray arrows are representative of the motions recognized by ignoring the peak at (0,0) of cepstrum. As it can be seen, places where the secondary peak is used are near the center of the rotation or the back wall (i.e., areas of virtually no motion). In these regions the motions are very close to zero. Secondly, note that since power cepstrum does not know the difference between (u, v) and (—u, —v) then the arrows on the left half of the flow field are pointing the wrong the wrong direction. None the less the magnitude and the orientation of the motion is qualitatively correct 87 4.11 Effects of expansion on cepstral results 88 4.12 Positive cepsCos for a simulated subpixel motion, (a) the figure and its positive cepsCos. (b) the landscape of the cepstral plane 4.13 The optical flow field due to egomotion 89 91 4.14 Cepstral equivalent filtering: note how the low frequency functions are deemphasized 98 4.15 Cepstral equivalent filtering for an image and an image with added noise. 100 4.16 A n image after it has been put through the different layers of retina. . . . 101 5.1 Multi-frame optical flow analysis utilizing cepstrum. The simulated motion field is made to reflect acceleration of the object. The motion disparities with respect to the leftmost frame are (1,1), (3,3) and (7,7) xii 106 5.2 The results of spatio temporal cepstrum analysis for a constant velocity for eight frames, (a) shows the disparity for every two consecutive frames, (b) every two frames separated by one frame, (c) for two windows differing in image number by two, and (d) for two frames differing in three frames. 5.3 (a) The peak magnitude and (b) the signal-to-noise ratio as a function of the number of frames used 5.4 107 108 Trinocular disparity measurements using visual echoes. The three peaks in the cepstrum reflect a disparity of 3.5 horizontally between the horizontal cameras, a vertical disparity of 6.0 between the cameras located vertically from each other, and a disparity of (3.5, 6.0) for the two diagonal cameras. 109 5.5 Simulated camera grid by the use of stationary texture segmentation. . . 110 5.6 Results of multiple baseline stereo Ill 5.7 Topographic plot of multiple baseline stereo results using the visual echo analysis approach. The peaks in the figure point to the horizontal disparity of the figure - roughly 1.65 pixels horizontally. The second largest peak corresponds to the overlap of disparity measures between two windows separated at twice the baseline Ill 5.8 The result of cepstral analysis for stereo disparity measurement 112 5.9 Motion-stereo disparity measurement. The two windows on the left are a stereo pair taken prior to the object going through a translation. The cepstral results shows how the motion disparity of (3.5, 6.4) and the stereo disparity of (4.3, 0.0) are decoupled 113 5.10 A stereo motion scene, and the horizontal stereo disparities using visual echo 114 xiii 5.11 Motion-stereo disparity measurement. The background is stationary anddoes not move while the brightly shaded and darker areas point to the pixel values corresponding to the Tweedy Bird and the piece of cloth both of which have gone through a motion 6.1 115 A n illustration of the persistence of the (0,0) peak for a portion of an image that lacks much detail. The window on the upper left hand corner was matched against the region in the upper right using cepsCorr. The lower right shows an enlarged result and the bright pixels show the positions (or disparities) where the (0,0) peak was the dominant peak. As it can be seen this area includes a large region within the search region. The lower left hand corner shows the relative magnitude of the (0,0) peaks, and as one can see at the proper disparity—i.e., the disparity (2,1)—the (0,0) peak increases dramatically 6.2 124 Two concatenated 2D signals that match perfectly and the logarithm of their power spectrum on the lower half. The black stripes represent very large negative values 6.3 133 The random signal used for comparing cepsCorr magnitude and signal to noise ratio as the disparity increases 6.4 (a) Magnitude of cepstral peaks, and (b) the base 10 logarithm of the peak magnitudes 6.5 134 135 The cepstral results generated during the iterations of cepsCorr. Each row of the image shows the cepstrum result normalized between zero and one; the first row corresponds to disparity 16 and the last row to disparity +16. 135 6.6 Signal to noise ratio for cepstrum as a function of disparity 6.7 Horizontal and vertical disparity fields for a natural scene using ceps Corr 136 xiv 136 7.1 Results of the region based approaches for 16x16 window size. The search regions were vertically from -3 to 2 and horizontally from -3 to 7 pixels. . 141 7.2 The same matching operations as in the last but with 8x8 window size. . 143 7.3 Results of Horn Sz, Schunk for different values of a and number of iteration. The disparity range for a = 2.0 and 20 iterations (or higher) was within the proper motion region. For a = 0.0, on the other hand, the vertical disparity after 100 iterations ranged from -41.772167 27.433767 7.4 145 (a) Cepstral results generated by cepsCorrOO iterations, individually normalized and then concatenated. The cepstrum in the upper left corner was replaced with the input image to the first iteration to show typical images being compared, (b) The concatenated cepstral results, without being separately normalized. This figure shows how cepstrum's instability may dominate the cepsCorrOO result based on (0,0) peak magnitude. . . . 147 7.5 Results the 8x8 window sized registrations after 3x3 median filtering. . . 149 7.6 Results of rotation for 16x16 window size 151 7.7 Results of rotation for 8x8 window size 152 7.8 Results of rotation for 8x8 window size after median 7.9 Horn & Schunk method for rotation, a = 2.0 and number of iterations filtering was 50 153 154 7.10 Results of expansion for 16x16 window size 156 7.11 Results of expansion for 8x8 window size 157 7.12 Results of expansion for 8x8 window size after median filtering 7.13 Results of Horn and Schunk for looming effect 158 159 7.14 Results of change in the direction of illumination from 60 to 30 degrees, search area (-7,7) in both vertical and horizontal direction and 16x16 window size 161 xv 7.15 Results of illumination change for 8x8 window size 162 7.16 Results of illumination change for 8x8 window size after median filtering. 163 7.17 Results of illumination change and Horn & Schunk method for a = 20 and 20 iterations 164 7.18 Original images used in the analysis 166 7.19 Effects of image quality on matching using 16x16 windows for matching. 167 7.20 Results of tinyCamera change for 8x8 window size 168 7.21 Horn & Schunk method a = 20 after 20 iterations 169 7.22 Results of binocular disparity measurements. Search for (-1,1) in rows and (-1,9) pixels in column-wise. 16x16 window size 171 7.23 Results of binocular disparity measurements. Search for (-1,1) in rows and (-1,9) pixels in column-wise. 8x8 window size 172 7.24 Results of binocular disparity measurements after median filtering for 8x8 comparison window size 173 7.25 Horn &; Schunk method a = 2.0 after 20 iterations 8.1 Illustration of multiple peaks due to multiple motions, 174 (a) & (b) the images with two synthetic motions, (c) the result of analyzing the images with cepstrum 8.2 181 Results of M E C for stereo, (a) Each row of the figure is the first row (i.e., zero vertical disparity) of the result of one iteration of M E C with phase correlation as its matching kernel. The location of the peaks are indicative of the disparity, with negative disparities located close to the right most column, (b) Topographic image of Figure (a) displaying the magnitude of the phase correlation peaks 183 xvi / 8.3 (a) The collection of binocular disparity signals generated by cepsCorr. Each row of the figure represents the result of one iteration of the multievidential correlation with cepstrum as its matching kernel. Each disparity is represented by two peaks which are symmetrically displaced away from the center column by the amount of the disparity, (b) Topographic image of cepsCorr in (a) in order to display the magnitude of the cepstrum peaks. 185 8.4 (a) Topographic representation of multi-evidential correlation results for the Pepsi scene with phase correlation as the matching kernel, (b) Result of cepsCorr for the same image area in the Pepsi scene. Note that cepsCorr generates a strong peak in the center corresponding to one of the disparities while for phase correlation the results are very weak and lack a distinguishing structure 8.5 186 The effects of simulated multiple disparities on: (a) sum of squared differences (mean squared error) (b) phase correlation, and (c) cepstrum. The second peaks in phase correlation and cepstrum correspond to the correct second disparity while the second minimum in SSD does not. Using M E C the other peaks in cepstrum and phase correlation will vanish and all but the main two peaks are rejected 8.6 187 (a) The first frame of a sequence of motion transparency images. The flower pattern is apparently due to the stained glass, and the tripod is the object moving in the background, (b) & (c) The magnitude of M E C at (0,0) disparities for cepstrum (b), and phase correlation (c). These two figures analogous to plotting the middle column of Figure 3(a) and the left most column of Figure 2(a) respectively xvii 188 8.7 Detection vs. localization of occluded regions for random dot stereogram. The area between a pair of white and black edges corresponds to the locations where multiple disparities were detected 8.8 189 (a) Sz (b) A pair of stereo images of a curved object, (c) the occluding boundary of the object using M E C 8.9 190 (a) One of the three images with two constant disparities caused by reflection. The three frames where produced by looking at the Escher drawing with the reflection of the person in the glass, and moving the camera a constant distance each time and acquiring two more frames, (b) The magnitude of the (0,0) peaks caused by the overlap of individual multiple disparities. Note that the graph shows two maximums due to two disparities. 191 8.10 A portion of the cepstrum result for multi-frame analysis 192 8.11 The results of analyzing Figure 8(a) for multiple echoes. The difference images between frame 1 and 2 after disparity compensation for the two estimates. Note how that the reflection of the person in the glass is now easily detected as compared to the original image xviii 193 Acknowledgment I have been extremely fortunate to have numerous outstanding colleagues, instructors and advisors who have influenced my ideas and interests in research. But I would like to particularly acknowledge V . V . Bertero and Egor Popov of University of California, Berkeley who got me started in research as an undergraduate, Boyd Paulson and Thomas Binford of Stanford University who peaked my interest and encouraged me to study robotics and vision, and Takeo Kanade, Martial Hebert, Ben Motazed and William (Red) Wittaker of Carnegie-Mellon University who provided me with a wonderful opportunity to work and learn about vision and robotics. Furthermore, I would especially like to extend my appreciation to my committee members: Bob Woodham, David Lowe, Peter Lawrence, Mathew Palmer, Tad Ulrych, Rabab Ward, Allan Jepson and finally my official advisor James Little, as well as my good friends Rob Scharein, Ron Rensink, Norm Goldstien and Marc Romanycia. They supported my ideas, reviewed my work and provided me with valuable recommendations and advice. Most of all, I would like to thank and dedicate this work to my parents. They provided me with the best early education possible, continuously encouraged me to follow my interests and set valuable examples for their childern to follow. To them I am always indebted. xix Dedicated to the memory of my father Bahman Ardeshir Bandari and to my mother Golbaii S. Kaikhosrawi xx Chapter 1 Introduction Visual systems, whether computational or biological, are often characterized by their versatility as well as their complexity. The tasks often demanded from such systems can range from navigation and obstacle detection to monitoring or recognition. But fortunately many of the problems encountered can be addressed through similar mechanisms. Such uniform treatment of topics not only gives us a better understanding of the underlying concepts, but provides uniform solutions that can be cast into specialized hardware. This kind of efficiency is particularly striking in biological systems, where widely different visual processes share similar computational steps [SW90]. The objective of this dissertation is twofold: (1) to introduce a new computational framework which addresses several related areas in computer vision, and (2) to introduce a new, robust and hybrid image registration technique particularly used in computational motion and stereo analysis. To elucidate further, a framework refers to a conceptual/computational structure that provides a unified treatment of a class of problems sometimes seemingly unrelated and disjoint. Therefore, a framework not only addresses multitude of tasks, but its solutions 1 emanate from a singular approach; as a result, the analysis of solutions - for example, *As an example of an unrelated and disjoint problem one can cite multi-frame motion analysis and 3D reconstruction from trinocular stereo. The problems are unrelated because one deals with motion displacement measurement from multiple frames over time, while the other deals with 3D scene reconstruction from disparities due to sensor displacements. The solutions (or solution space) to these problems are disjoint because to date all the solutions provided for these problems are specific to the two domains. 1 Chapter 1. Introduction 2 the effects of additive zero-mean uncorrelated Gaussian noise - are addressed analogously across all application domains. A good conceptual example in computational vision and one that strongly motivated the present dissertation - is Whitman Richards' [Ric88] visual flow analysis which draws attention to the similarity among motion, binocular and texture processing. Motivated by this work as well as others [Pri86, SS93], the first part of this thesis introduces and develops a computational structure, nicknamed visual echo analysis. With this framework I try to address such computational vision tasks as motion analysis, binocular and trinocular stereo, stationary texture segmentation, boundary symmetry recognition, multi-frame analysis, and position/scale and rotation invariant recognition of binarized images. The framework exists because the core conceptual notion employed is unique but diversely applicable. This provides a common methodology for analysis of results and effects of irregularities in inputs. Based on this and similar techniques in literature, the second part of this thesis introduces multi-evidential correlation - a new hybrid and robust approach for a fundamental task in computational vision, namely matching and registration of raw images. Multi-dimensional signal correspondence is a significant topic in signal processing and computational vision. The signals compared are often acquired under different circumstances at different times, by different sensors, with different modalities, under different view points, and for different reasons. But the fundamental operation, whether to find similarities or differences between the inputs, is indeed matching and registration of the data. The primary focus of this work is matching and registration as they relate to visual motion and stereo analysis. This interest is not accidental. Both motion analysis and stereopsis are fundamental topics in computational and biological vision. Moreover, Chapter 1. Introduction 3 together they encompass the majority of essential issues raised in image registration. 2 Therefore, from this point on, "image matching and registration" refers primarily to the issues and techniques that concern visual motion and stereo analysis. Through out this work, I will present examples and applications of the techniques utilized, discuss the limitations and advantages of these routines, and provide comparisons to the traditional approaches when applicable. The next two sections provide a more clear and specific discussion of the concepts developed in the thesis, and are followed by a brief outline of the dissertation. 1.1 Visual Echo Analysis There are quite a few books on problem solving and scientific thinking [Pol87, Ada86] and a common approach often utilized is solution by association, analogy or equivalence. Accordingly, let me describe visual echo analysis by examining motion analysis from a new perspective. Assume we are recording images of a static scene with a stationary camera. Ignoring sensor noise, the images captured over time are identical. Indeed one can claim that the second frame is a repetition, replication or an echo of the first frame in time. Similarly, every frame can be viewed as a temporal echo of the previous ones. But what if an object in the scene begins to move? Then the moving object also creates an echo of itself, but this time in space as well as in time (see Figure 1.1). Thus motion analysis can be viewed as detection and estimation of spatio-temporal visual echo arrival periods between two or more image frames. We will visit this issue again. There are of course gross geometric and photometric distortions that may not be encountered in either motion or stereo analysis. Many of these involve landmark registration (often manual) followed by image interpolation and morphing. These are outside the scope of this work. But others, such as shadows, changed illumination, or multiple motion (e.g., motion of ground behind moving clouds) are issues that we attempt to address. 2 Chapter 1. Introduction (a) 4 (b) Figure 1.1: Measurement of visual motion using echoes, (a) shows one frame of the motion sequence, and (b) shows the superposition of two frames in order to display the spatial echo produced by motion. Tweedy bird, as well as part of the cloths in the background display an echo due to motion. What we are doing is correspondence; that is we are trying to match different portions of two images and decide how much has each part moved. Except we do this by looking for echoes. There are two main advantages to this conceptual shift in our thinking. First, echoes have been studied extensively in many fields such as acoustics, oceanography, geophysics, seismology, physics, communication, power engineering, speech recognition, system analysis and signal restoration [CSK77], making it possible to build on the extensive research literature on this topic. 3 Second, modeling matching as echo analysis presents new possibilities and a great 3 I.e., building on the shoulders of giants. Chapter 1. Introduction 5 deal of flexibility to address other topics such as, binocular or trinocular stereo, multiframe motion analysis, multiple baseline stereo, stationary texture segmentation, motionstereo, boundary symmetry analysis, multiple disparity analysis, and occluding boundary detection. These applications have been developed by exploring many questions about echoes and vision. For instance, what if we have multiple echoes, in space (e.g., stationary texture, trinocular stereo, and multiple baseline stereo) or time (multi-frame motion analysis), or both (motion stereo analysis)? What if we have multiple moving objects, each having a unique, or perhaps overlapping, echo arrival periods? Moreover, echoes of waves reflected from boundaries can be symmetric or antisymmetric, depending on the boundary conditions. Can one detect different types of symmetries and self similarities in object boundaries when they are present? How stable are the underlying operations, especially in the presence of distortions? Is the underlying technique a different variation of previous methods, in the same mathematical class, or is it fundamentally different (e.g., linear vs. non-linear routines)? What properties can be fully characterized and which ones remain to be studied further? The cardinal question, of course, is how this flexibility affects the performance of our objective tasks. Poorly performing flexible techniques are almost always best replaced 4 with specialized accurate methodologies. To answer this question, one has to furnish a comparison with the best performing techniques, in the applicable domain whenever possible. T h e r e is an interesting recursive definition attributed to the late Alan Newell, one of the prominent figures in the history of artificial intelligence: "Almost always almost always means always." 4 Chapter 1. Introduction 1.2 6 Multi-evidential Correlation As pointed out earlier, one of the most significant tasks in signal processing and computational vision is matching of multi-dimensional signals. Whether images have been gathered over time, such as motion sequences, or are different by the spatial location of the sensors, such as binocular and trinocular stereo, with different number of bands and formats , the measurement of displacement defines the task. 5 It is thus not surprising that a variety of approaches have been introduced over time. Some schemes minimize differences, others maximize similarities, some use linear, and yet others utilize nonlinear techniques. Many methods are preceded by preprocessing the raw data followed by establishing constraints through post processing the results. But the fundamental objective remains the same. The primary purpose of the second part of this thesis is to address the correspondence issue. The weak assertion behind this work is that the more estimates of displacements available the better. The strong assertion (the main observation), however, is that as the disparity or displacement between the images increase, the more difficult and unreliable 6 our results become. Consequently, reducing the disparities increases the accuracy of our measurements. To give a simplistic example, when measuring the length (weight) of an object, one can make several measurements and average the results. The higher the number of times one makes the measurement, the higher the confidence in the final answer. Furthermore, the more accurately each measurement is made the better the outcome. Multi-evidential correlation uses algorithms that provide a direct estimate of disparity between two images, and combines these with correlation to provide better measures of displacements. Since the actual matching kernel used can be changed, one might assert 5 6 i.e., real vs. complex format. These two terms are often used interchangeably. Chapter 1. Introduction 7 that multi-evidential correlation is a hybrid technique. Application of multi-evidential correlation to detection and measurement of multiple displacements due to occlusion, reflectance, or transparency is shown. In this case the multiple evidences are used to differentiate between spurious noise and peaks due to secondary motion. 1.3 Outline and Organizational Overview Due to the breadth of the applications and the techniques considered, chapters are organized to follow a uniform skeleton when possible. Most chapters have sections on background, computational considerations and summary. When applicable I also include sections on mathematical derivations, applications and results as well as future works. To keep the discussion focused I have made ample use of dashes, parenthesis and footnotes for the material not always essential to the discussion. This is often necessary in order to point out facts and exceptions with minimum disruption to the flow of conversation. Many of these interjections can therefore be skipped over for the casual reader or in the first reading of the text. 7 The following are brief chapter descriptions, which are also expanded upon at the end of each chapter as a summary. C h a p t e r 2. This chapter provides a brief review of multi-dimensional signal and image registration. Majority of the techniques reviewed, mean squared difference (sum of squared differences), mean absolute difference (sum of absolute differences), spatially tuned filters, Haddamard based routines and differential approaches have their routes in linear filtering and statistics. Phase correlation (and non-linearly preprocessed token based) Here is an opportune example, where one may ultimately ask: "why would anybody want to read this twice?!" ; ) 7 Chapter 1. Introduction 8 methods are the only two non-linear techniques reviewed. Motion analysis is one of the most general applications of image registration and majority of the references and the works cited are from this domain. Section 2.2 reviews stereo matching as a special case of motion analysis and image registration. Hierarchical pyramids and their role in matching are also discussed, and their role in reducing the assumptions and limitations of region-based methods explored. Chapter 3. Chapter 3 focuses primarily on the mathematical definitions of cepstrum and its variations. For the sake of brevity and conciseness, some questions, especially the intuitive ones, such as cepstrum's behavior in the presence of multiple echoes (e.g., occlusions and transparency) are deferred to later chapters and will not be addressed in this chapter. Other properties such as cepstrum's behavior in the presence of noise or convolution will also be addressed later. But the chapter introduces new variations of power and differential cepstrum that are better from the performance aspects of echo detection as well as computational efficiency. Chapter 4. This chapter provides a review of cepstrum properties as a matching routine, particularly emphasizing its limitations and shortcomings, which need to be addressed. Some of the topics covered include windowing, numerical stability, effects of additive noise, convolution (particularly blur), rotation, expansion and illumination changes. A short comparison, particularly emphasizing the similarities to other approaches was also provided with graphical examples and theoretical analysis. Chapter 5. This chapter provides a uniform approach to multi-frame analysis and shows the application of the visual echo framework to the analysis of the disparity calculations for spatio-temporal motion analysis, trinocular stereo disparities, multiple baseline stereo Chapter 1. Introduction 9 measurements, and depth and motion calculations using motion-stereo. What sets this approach apart from its predecessors is that the equal disparities constraint, when applicable, is fully utilized to increase the disparity detectability and the accuracy in the estimation process. Several domains where this constraint is satisfied include motionstereo, multiple baseline stereo, and constant velocity motion. When this constraint is violated, such as in the case of trinocular stereo or accelerating motion, the visual echo techniques provide all disparities between every two windows analyzed. One of the interesting features of this work is its extendibility into other possible mechanical setups such as trinocular stereo-motion analysis, or calibrated camera grids. Chapter 6. This section introduces Multi-evidential Correlation (MEC) and the additional robustness, flexibility, and accuracy that it provides. I will then concentrate on cepsCorr M E C with cepstrum as its matching kernel - and show how some of the limitations and ambiguities in cepstrum can be overcome through the use of M E C . Lastly, we introduce, a simple, fast and efficient winner take all approach to utilize the disparity estimates (i.e., multiple evidence) provided by cepsCorr. This method not only overcomes, both the problems with sign ambiguity of cepstral disparities, but it also solves the sometimes troublesome incorrect (0,0) peaks and turns this limitation of cepstrum into an advantage. Chapter 7. This chapter shows the use of cepsCorrOO and phaseCCorrOO for matching and registration of several images. The images chosen address some of the more common image transformations, such as rotation, expansion, and translation parallel to the focal plane, plus changes in lighting condition and image quality. Even with considering only the Chapter 1. Introduction 10 magnitude of the (0,0) peak - a rather limited application of M E C - cepsCorrOO performed rather nicely. As was expected and pointed out by Yeshurun and Schwartz [YS89] and Ludwig, Neumann and Neumann [LNN94] for illumination changes, rotation and expansion, cepsCorrOO inherits cepstrum's robustness and performs better than the other techniques considered. The chapter discusses in detail possible instabilities on cepstrum and phase correlation and how they can be easily overcome by the use of M E C (and additional proofs of its robustness). These instabilities sometime occur at the occluding boundaries, a significant area of research (especially for object segmentation) and a topic that will be discussed along with multiple motion - in more detail in the next chapter. C h a p t e r 8. In this section I discuss a direct method for detection and estimation of multiple disparities due to occlusion, motion transparency, and reflection using multi-evidential correlation. No a priori assumptions about the existence or the number of disparities were made. The two matching kernels used, cepstrum and phase correlation, both generate multiple peaks in the presence of multiple motions. When these peaks persist between different iterations of multi-evidential correlation, and they are consistent in the measurement of disparities, the existence of multiple displacements is assured. I will also address the use of multi-frame analysis and cepstrum to estimate multiple constant disparities over time. Both cepstrum and phase correlation performed well, but cepstrum had higher signal to noise ratio, and an overall better performance than phase correlation. A significant fact that should be pointed out here is that motion transparency and reflection often cause blurring in the image. Both cepstrum and phase correlation are robust to blurring, and in fact cepstrum is often used as a deblurring mechanism. Chapter 2 B a c k g r o u n d and H i s t o r i c a l Review This chapter is intended to provide a brief tutorial of matching and registration routines often used in practice. It is practically impossible to provide an objective comparison of all the routines developed. One reason is that the performance of these processes are effected not only by the nature of the raw data being matched, but also the number and applicability of the assumptions being made as well as the number and value of the thresholds and internal parameters being set. Therefore, even though a technique may perform poorly on one set of data, given the right assumptions and thresholds it can perform best for another set of data. Thus for the purposes of this review, I have tried to provide the basic theoretical formulation of each technique as well as the practical and performance related issues. When the results of a technique are dependent on certain presuppositions they are made explicit in the review. In regards to assumptions, I have found that in general: in every assumption (explicit or implied), there exists the possibility of its violation, thus limiting the accuracy and applicability of the procedure to which it pretains. Therefore, in my own work I have tried to minimize or eliminate the number of assumptions and arbitrary thresholds and limits. But there are inherent restrictions that one has to note. In fact, even though a variety of registration techniques and approaches will be reviewed below, emphasis is placed on region-based techniques and methods that are particularly suited to estimate small displacements. The primary assumption in region-based 11 Chapter 2. Background and Historical Review 12 motion and stereo analysis is that of "local constancy of disparity". That is, for small neighborhoods the displacements of all enclosed pixels are constant. This assumption is, of course, violated in the presence of occluding boundaries or multiple displacements due to transparency or reflection. These problems are addressed in later chapters. A related and implied assumption for region-based approaches is that of fronto-parallel displacement. That is, for small regions the displacements are due to simple two dimensional translations (of the object or camera) parallel to the image plane. That is, large 1 distortions (affine or perspective) are either prohibited or reduced through prepossessing. For measurements of small visual motion and stereo disparity (controlled by the stereo baseline compared to depths of interest) the assumption of fronto-parallel displacement is not an unreasonable one. In practice, for applications that may violate the above assumptions - such as satellite imagery - gross geometrical distortions are corrected automatically by considering the location and orientation of the cameras during the image capture, or registration of control or ground points between the two images, and subsequent resampling of the points between control points. 2 Fine registration of the interpolated images can be then conducted using the techniques described below. Section 2.1 reviews a quintessential area of registration, namely motion analysis which is a comprehensive representation of this topic; specific matching techniques are described. Section 2.2 covers stereo, both binocular and trinocular, and reviews epipolar constraint and reviews the pros and cons of feature versus region matching; and section 2.3 covers the hierarchical pyramids often used to eliminate the assumption of small displacement fields. A s we will see it is really not necessary to restrict the horizontal and vertical disparities. disparities are often addressed using hierarchical pyramids. J Large For a good discussion of image interpolation and morphing between registered landmarks the reader is referred to [Bro92]. 2 Chapter 2. Background and Historical Review 2.1 13 Previous Work It would be truly prohibitive to consider all the techniques, methodologies and approaches from the ever growing number of applications that discuss image registration. Some techniques look for maximum similarities, others for minimum differences, and yet a third force matching through interpolation and morphing [Bro92, SC94]. Moreover, 3 the techniques developed could be categorized into, linear vs. nonlinear, spatial vs. Fourier based, or feature based vs. region-based approaches. Being one the most general representative of image registration, motion analysis embodies the majority of problems as well as technical solutions developed to date. In fact, as noted by others [Hor86, BB82], stereo analysis is a special case of visual motion processing. But independent of other topics, the importance of visual motion analysis can not be over stated. In biological and computational vision time varying image analysis plays a fundamental role in egomotion estimation, computation of time to collision, determination of focus of attention, target tracking, scene segmentation, interpretation and encoding of 3D information, and monitoring [Nak85]. In today's multi-media information technology, motion compensation is an integral part of video compression algorithms [Nas94, Tho87, Ana94]. There have already been several sound reviews for motion analysis techniques [Ana85, Bro92, AN88, Bar84, The91]. Building on these works, the next sections will cover three often cited categories of motion analysis, i.e., the well entrenched correlation based techniques, gradient or differential schemes, and spatio-temporal (or velocity tuned filtering) approaches. This is followed by two powerful but often ignored routines , phase correlation and Hadamard 4 E v e n though considered by some as matching, interpolation and morphing of image sections between isolated registered points is more a topic related to computer graphics than computational vision. Therefore this class of procedures will not be covered here. See [Bro92] for a good discussion. Sometimes, but hardly ever, the word routine is used in the text to reflect, algorithms, methods, 3 4 Chapter 2. Background and Historical Review 14/ based routines. Section 2.2 covers stereo analysis and the provides a comparison between feature based and region-based approaches. Finally, I'll briefly address some remaining issues such as the use of hierarchical pyramids, and image differencing. 2.1.1 Correlation based techniques There is no class of matching methods better entrenched in computational vision than the one based on linear cross-correlation coefficient of statistics and its variations. To begin near the beginning, based on research on insects, Reichard [Rei57, Rei61], one of the most influential early vision scientists, postulated that autocorrelation is the motion processing instrumentation of the central nervous system. The significance of these works, as well as [PR73], was in giving correlation plausible neurophysiological basis. But even without such endorsements, cross-correlation has a long history as the basic statistical approach to template matching and pattern recognition. For matching two dimensional discrete signals simple cross correlation can be written as: 5 W Wy X Rhj ( ,v) u 2 = ^^h(x,y)I (x 2 - u,y - v) (2.1) where w and w define the size of the two image templates I\ and io., and the limits x Umin < u< u y max and v min < v< v max delimit the search area for the best match. Theoretically, correlation is merely a similarity measure. As template i i is moved over the second image, the sum of pixel by pixel multiplications of the two windows provides a similarity measure between the template and the image at that location. Therefore, the correlation of a template and an image is a matching metric. The location of the maximum similarity value is considered the location of the best match for the template. techniques, or approaches. This is not to be confused with "visual routines" [U1183] which was first used by Ullman to differentiate between serial and parallel visual tasks. Obviously one can replace vector quantities (x, y) and (u, v) by single bold letters and thus extending the present representation to arbitrary dimensional signals. 5 Chapter 2. Background and Historical Review 15 One of the most attractive properties of correlation - particularly in signal processing - is its great speed. In frequency domain, correlation is merely multiplication of the one window and the complex conjugate of the other. But typical correlation-based techniques implicitly assume the conservation of local intensity distribution. In the case of pure correlation this assumption can lead to severe problems. Assume for instance that a part of the search region considered in the second image, contains an area of very high intensity (e.g., a uniform extremely bright region with the same dimension as the template from the first image). It is easy to see that the correlation results will be maximum within this region, independent of the template values. To alleviate this problem often normalized correlation is used. ^ / 6 Ylr 1, h(x, y)Io.(x — u, y — v) s Ci (u,v)= ^x, ,y > ^ ^ VEi,y ~U,yv)^x,y IK - ^y) The normalization of the cross-correlation is used to reduce the influence of the local uh y i\ ,yj 2K = X U = V intensities. Since the template energy - i.e., J2 , I\ ( , y) ~ is constant, one can omit x x y normalization with respect to this measure without effecting the location of the peak. Unfortunately, this improvement comes at a great cost. Not only the new approach is computationally intensive, it also can not be implemented in the frequency domain, thus loosing the primary method for correlation speed up. Another similarity measure used is correlation coefficient [SMA76]: , CC {u,v) Iuh . = covarianceiL, E ,y(h(x,y) x I) (2.3) 2 - Ph){h{x -u,y-v)- \jEx, (h(x, y) ~Vh) 2 y Ex, (h{x -u,y-v)y p) h fl ) 2 h provides a linear measure of similarity between -1 and 1 inclusive. In the formulation above fij and Oj are the mean and variance of an image window Ij. 6 F o r simplicity I am substituting Y^ , x y f° r S j r Sorry if I am stating the obvious again. Chapter 2. Background and Historical Review 16 A related and empirically more effective technique is sum of squared differences (See [AW84, AW85] for a justification) or SSD [BAHH92, Ana87, Sco88, Sin91, Fua91, FL94, AW85, AW84, Sco86, B Y X 8 2 , B Y X 8 3 , WH84, BL83, AR84, CCM84, GGB84, BT80, Rei61, Rei57]. SSD (u,v) Iuh = Y,Vi(x,y) - I (x -u,y2 v)f (2.4) This computationally efficient statistical method looks for minimum difference between the signals. It is obvious that SSD is just sum of powers of template and the underlying image minus twice the correlation value. Most often it is the correlation value that minimizes SSD, but the addition of the signal powers tries to offset the problems with regular cross-correlation such as drastic brightness change. Nonetheless it is advisable to normalize SSD. One of the primary characteristics of SSD and cross-correlation are their optimality with respect to certain type of noise, specifically additive stationary white Gaussian noise [Mat92b]. It should be noted however, that image noise is often neither completely additive nor Gaussian white noise. Yet this is a popular noise model, perhaps more applicable to photodetector imaging sensors [Pra78]. Another realistic additive noise model is pink, or higher frequency random noise [Nis84]. To avoid the correlation result being undully influenced or confounded by differently scaled structures (i.e., different frequency bands), many authors prefilter the image with Laplacian of Gaussians ( V G often approximated numerically with difference of Gaus2 sians with standard deviation ratios of 0.6). In fact, simple windowed correlation or SSD 7 does very poorly in regions of low contrast and near constant intensity. In the converse case, the presence of a great deal of intensity variation or texture can also bog down 7 I will revisit this problem in hierarchical signal matching [BAHH92, BAK91]. Chapter 2. Background and Historical Review 17 correlation based methods [Cro83]. That is why prefiltering with V G is used to local2 ize or band limit the correlation in the frequency domain (i.e., to lock into a particular spatial structure scale). By increasing the size of the V G a greater percentage of the 2 mean of the signals is removed [AW84]. But the same process also changes the noise characteristics so that, even for white Gaussian noise, correlation will no longer be the optimal process. Obviously, up to now the primary computational expense of the techniques considered was due to the multiplication operation. Sum of absolute differences (SAD) is a comparable difference minimizing search technique which in its normalized form is written as: SAD (u,v) Iul2 = x,y \h{x,y) - n h + -I (x 2 - u,y - v) + p | h (2.5) A n interesting but scarcely referenced work by Barnea and Silverman [BS72] is Sequential Similarity Detection Algorithm. This approach not only uses S A D but also looks for the best fit where the highest number of pixels are subtracted and summed before a threshold is reached. The inner loop of summation is stopped if the threshold is passed at present maximum number of pixels. Obviously, not only this algorithm is very fast, it is particularly suitable for large templates, as in satellite imagery. Barnea and Silverman provide interesting variation of the algorithm. Although there are studies comparing different cross-correlation techniques [BYX82] and SSD [AW84], showing the relatively greater robustness of SSD particularly in the presence of noise, I am not aware of any adequate comparisons made between SSD and S A D techniques. In conclusion, the following points about correlation techniques are noteworthy: • Correlation based methods are one of the simplest, oldest and best entrenched techniques for match filtering and registration. In fact, in many situations, these Chapter 2. Background and Historical Review 18 approaches have taken a Hammer Syndrome - i.e., "since the only tool in my toolbox is a hammer, all the problems I face are nails". This metaphor was first 8 noted by Horn [Hor83], who urged examination of alternative registration strategies. A side effect of this popularity has been the development of different hardware configurations for fast cross-correlation [ea93, PSW93]. • Like many other region-based matching routines, the choice of window size, and search boundaries are often arbitrary [OK92]. Small window sizes limit the amount of signal to be matched and therefore reduce the robustness of the match. Large window sizes, on the other hand, blend different disparities that may arise due to surface gradients or motion discontinuities. This is analogous to the problem of windowing or support region for an edge detector. Nishihara [Nis84] provides a good mathematical treatment of window size and convolution with a band limiting filter. I will also provide a treatment of both window size and search boundary under the hierarchical search methods. • In addition to areas of low contrast, cross-correlation techniques produce very poor results in the presence of occlusion, rotation, affine or perspective transformation or scale change. Bergen et al [BAHH92] have an excellent mathematical treatment for different transformations that may occur. But even this work does not cover all the possible transformations. • Aside from regular cross-correlation - which can be implemented in Fourier domain - and S A D all other variations described are computationally expensive, and slow, but can be implemented on parallel architectures [Fua91]. U s i n g a stretched analogy, one may argue that simple correlation to matching and registration, is what Fortran has been to computer language. 8 er 2. Background and Historical Review 19 These techniques are data type independent, but again aside from the traditional cross-correlation approach, they are not really useful when matching complex type data. Singh [Sin91], Scott [Sco86], and Anandan [Ana87] all provided confidence measures for their matching algorithm. But interestingly none of these confidence measures was based on the traditional signal to noise ratio. Anandan's approach [AW85, Ana87] is more localized and involves the values and directions of principal curvatures at the correlation peaks. Scott and Singh also provide directional confidences, but they are based on principal axis and principal component analysis of the entire correlation plane. But they modify the correlation results through heuristically based formulations which are rather arbitrary and effect the outcome of the process. Singh [Sin91], however, does have an interesting explanation of his choice of formulation and effects of his use of the entire correlation result. Other researchers such as Fua [Fua91] and Matthies [MS86] have also provided detailed treatment of the topic. Such error analysis techniques are particularly important in applications of Kalman filtering. There has been no well known multi-frame analysis method using correlation based techniques. Multi-frame analysis is important for several reasons. First it produces results between every other frame. This multiple result generation is particularly useful because it provides verification and a consistency check among results. More importantly, in the case of visual echo analysis, temporal smoothness and constant velocity provides additional verification by improving the signal to noise ratio of the final disparity maps through signal reinforcement. I will discuss this in more detail later on. In the case of cross-correlation there are no simple multi-frame analysis. The only Chapter 2. Background and Historical Review 20 solution is to compare every two images, an approach that grows exponentially with the number of images. Matthies et al [MKS89], Heel [Hee91], Singh [Sin91] and many others have used consecutive correlation results in Kalman filtering to improve 3D reconstruction of surfaces. But this is not the same as multi-frame analysis. Singh [Sin91] also uses averaging of correlation results to increase the spatial smoothness of flow. This is particularly interesting because it assumes and, in fact forces, temporal smoothness of the solution. • Majority of the subpixel approximations to the correct match are based on bicubic or quadratic surface fitting to the correlation plane. Again Heel [Hee91] has a good discussion of this topic. • Finally, there is no process to generate multiple number of evidences by any of the correlation techniques. It is possible to increase ones confidence by comparing different correlation measures, but the approaches are so similar that no real variation can be expected. It is when one gets different answers with low signal to noise that one can be more confident in the presence of anomalies such as occlusions or multiple disparities. 2.1.2 Differenced Based Techniques Sometimes the most effective techniques are the simplest ones. Case in point is the simple differencing technique which is applicable to tracking and motion analysis under constant illumination for stationary sensor. Barron [Bar84] provides a review of works to isolate moving objects against stationary background using simple differencing of consecutive images. As simple as this technique may seem, for many simple applications it is often overlooked. Noise removal, region growing, and object differentiation are some of the required Chapter 2. Background and Historical Review 21 post processing operations. Needless to say, this is the simplest, most intuitive and cheapest motion detection algorithm to date. More sophisticated techniques to improve the speed of this algorithm, by projecting the image samples on different axes and isolating moving samples has been also studied particularly in pattern recognition. 2.1.3 Differential or Gradient Based Techniques Before describing the gradient based techniques we need to clearly define two terminologies which are sometimes used interchangeably (perhaps due to a lack of consistently agreed upon definition). • At a point on the imaging surface the image flow depicts the instantaneous 2D projection of a 3D velocity vector of the corresponding point in the scene. • The optic-flow, on the other hand, is the apparent 2D velocity of a constant brightness pattern on the image plane. It is the second definition that gives rise to the gradient based optical flow techniques. These matching methodologies are purely concerned with the 2D motion of the brightness patterns and typically work based on the assumption of conservation of image intensity, also referred to as the brightness constancy model. That is, one assumes that the intensity of a point between the two images being matched remains identical. In motion analysis it is this assumption that isolates the 2D projection of the 3D velocities in space. The first two papers on gradient based techniques were apparently due to Limb and Murphy [LM75] and Netravali and Robins [NR79] in 1975 and 1979 respectively. But the most cited treatment of this topic has been by Horn and Schunk [HS81]. Subsequently variations of Horn and Schunk's approach soon followed notably [CK83, Nag83b, Nag86, TB87] and are still being developed to date. Chapter 2. Background and Historical Review 22 In short, given our assumption of constancy of brightness: I(x, y, t) = I(x + Sx, y + Sy, t + St) (2.6) Using the MacLauren-Taylor series expansion of the right hand side, and discarding the second and higher order terms: dl dl dl I(x, y, t) = I(x, y, t) + Sx— + Sy— + St— (2.7) simplifying this equation and dividing by St £ j , + g l , + /,=0 (2.8) where I , I and I , are partials of intensity I(x, y, t) with respect to x, y and t respectively. x y t In the limit as St approaches zero | | and | | are the instantaneous velocity components u and v. Therefore one can write: E - b ul + vl + I x y t = U.VI + I = 0 (2.9) t Since the spatial and temporal derivatives of the image intensity can be easily derived, Equation 2.9 provides a linear constraint equation between the unknowns (u,v). Obviously this is an ill-posed problem and a second independent equation is required to solve for both variables. Horn and Schunk then introduce the assumption of locally smooth optical flow. That 9 is, one should try to minimize: = Vu .Vu + Vv .Vv T T T h i s is often referred to as a the optical flow smoothness constraint. This assumption is only valid for the interior points of smooth surfaces, and is thus not a real physical or photometric constraint since it is obviously violated at motion boundaries. 9 23 Chapter 2. Background and Historical Review Minimizing the El + \E% - where A is the regularization parameter - over the entire image provides one with two non-linear equations for the two unknowns, which can then be solved iteratively. Some researchers, particularly Scott [Sco88], are overly critical of the gradient based methods. But even though limited - again due to their underlying assumptions - differential based techniques are an excellent mathematical treatment of the motion and registration based purely on the photometric characteristics of images. Perhaps the biggest limitations of Horn and Schunk's approach stem from discarding the higher order terms of the Taylor series. The assumption that these terms are negligible depends on the spatial frequency content of the intensity pattern and the magnitude of the displacement. In fact, highly textured regions, or large displacement can not be accommodated properly with this methodology. This is perhaps best seen through Figure 2.1. A x (a) A X (b) (c) Figure 2.1: Example of the effects of large motion and sharp intensity change on the differential algorithm for one dimension. f(x) and g(x) are the two functions that have gone through the translation h. In frame (a) the value of motion, h, is calculate correctly namely h = (f(x) - g{x)) * The same approach in frame (b) gives one the value d which is much smaller than the actual motion. Frame (c) shows how the motion can be estimated completely incorrectly due to a sharp change in the derivatives of the function f(x). Obviously, in frame (c) one could have a quadratic function with a large second derivative to display the same effect. Chapter 2. Background and Historical Review 24 This is a severe limitation. Image intensities are not always smoothly varying, particularly for textured surface patterns, and the assumption of smoothness of flow fails, especially at object boundaries. Another implied assumptions is that the light source is fixed, both in its intensity and position; and surface shading of the object do not change due to motion or surface inter-reflections. Perhaps one of most knowledgeable and prolific pioneers in the differential registration techniques is Hans-Helmut Nagel. Nagel[Nag83a, Nag83b, Nag86] included the second order terms in the calculation, and restricted the direction of the displacement field variations. This approach was interesting but in fact aggravates one of the major problems with any differential based approach - namely their sensitivity to noise. A n excellent contribution of Nagel's work, even though restricted to Lambertian surfaces, was changing Horn and Schunk's ill posed problem to one that is well defined by changing the nature of the assumptions. Mainly, the bightness constancy assumption was replaced with the temporal constancy of the spatial gradient of brightness. UE y + VEyy + Efy = Oj X uE xx + vE yx + Etx = 0; 10 That is: (2.11) (2.12) which leads to a two linear equations for two unknowns. Although elegant and well formulated in terms of the underlying assumptions, the use of double differentials, and the physical conditions that have to be met, obviously renders of this work rather limited to specialized domains. Even though not referred to by the authors, the intensity variation smoothness is partially responsible for the successful application of gradient based techniques in [BAK91] and [WM93]. Battiti et al [BAK91] use a hierarchical pyramid to lock into the proper T h i s is of course equivalent to, and sometimes referenced as, the assumption of locally constant flow field. 1 0 Chapter 2. Background and Historical Review 25 bandwidth for matching, but this could also be because of smoothness of the image in the higher levels of the pyramid. They also provide a heuristic measure of determining which level of the pyramid is a correct one to stop at. Weber and Malik [WM93] have another formulation, which involves convolving the two images with smooth but orthogonal filters. Using the brightness constancy model, the motion sequence for each band limited image pair creates a linear constraint for u and v. Given several smooth image pairs there is an overdetermined system that is used to find (u, v). The unusual thing about this work is that all the image pairs used are smooth non-overlapping band passed versions of the original image. Although the information content of the band pass filters is unique, as Nagel points out [Nag94]: "Unfortunately, this kind of estimating optical flow must fail the more, the better the estimated partial derivatives approximate the real derivatives, because in this case, the equations tend to [should] become linearly dependent". Tistarelli [Tis94] combined the first order constraint of Equation 2.9 and Nagel's second order constraints to form a three equation two unknowns overdetermined system. Each equation can be interpreted geometrically as a straight line in the (u, v) velocity space. Ideally all three lines should pass through the same point corresponding to the correct u, v. But in reality they intersect in three separate points. One can then solve for the velocity vector by least squares, or take normalized sum of the velocity calculations by solving the three possible pair of linear equations. Campani and Verri [CV90] used a first order brightness gradients within a neighborhood to create an over constrained linear system of equations i.e.,: 0 = 0 = Ix(X0, Vo)™ + Iy( 0> Vo) X v + It(XQ, Vo) Ix(xu yi)(u + u (xi - XQ) + u (y - y )) + x y x 0 Chapter 2. Background and Historical Review 26 I {xi, yi)(v + v (xi - x ) + v (y y x 0 y 1 - y )) + 0 +h(xi,yi) i (2.13) The final solution not only estimates of the optical flow but also its spatial variations. As Nagel [Nag94] points out, since Campani and Verri choose a 10x10 neighborhood to generate 100 linear equations for 6 unknowns, one can extend this work to solve for the two temporal derivatives of u and v as well. Similar to Battiti et al, Galzer [Gla84, Gla87] have also an interesting hierarchical treatment of the differential registration, but the brightness constancy assumption is particularly questionable in the presence of large disparities. 11 Finally, in [NY93] Negah- daripour and Yo try to partially eliminate the brightness constancy assumption. This is different than Cornelius and Kanade's work [CK83] which completely ignored the assumption. It would be interesting to see a gradient-based hierarchical approach that would address both large displacement as well as eliminate the constraint on conservation of image intensity. Lastly, one has to note Lucas and Kanade's work [LK81] which had quite a few novel ideas embedded in it. This particular work is harder to categorize since it was a cross between differential and region-based approaches. In summary, differential approaches suffer from a few shortcomings, mostly caused by the underlying assumptions (as usual). • The motion of objects has to be small for best effects; in fact, Schunk[Sch88] postulated that differential motion estimation techniques are similar to the neurophysiological mechanisms of short range motion first discovered in psychophysical One may argue (and in fact this is an implicit assumption often made) that the levels of Laplacian pyramid provides a normalization effect. 1 1 Chapter 2. Background and Historical Review experiments by Braddick[Bra74]. 27 12 • The smoother the image intensity variation the better. Many authors use smoothing operations to reduce the effect of sharp intensity discontinuities, as well as noise. One should ask, however, what happens when the intensity of a region is uniform. 13 • Due to the use of numerical differentiation the method is very noise sensitive. Therefore differential techniques are not ideal for registration of images corrupted by random noise. • Assumption of smooth disparity variation is quite limiting particularly at motion boundaries. Several techniques, especially those using Markov random field ap- proach add line processes to get a nice Bayesian estimations to the motion discontinuities [KD92]. Giachetti, Campani and Torre [GCT94] showed that because of shocks and vibrations present during image acquisition for autonomous navigation of vehicles on highways the "numerical computations of temporal derivatives is very noisy and therefore differential techniques to compute the optical flow do not provide adequate results." • Constant and motionless light source is an easily violated assumption. There is always interreflection and casting of shadows among the objects moving in a scene. In may robotics applications the mobile automatons carry their own light source. For majority of these shortcomings I have referenced papers that try to overcome them. The problem is that many of these papers introduce their own assumptions and T h e existence of neurophysiological short range vs. long range motion mechanisms has been brought under question by Cavanagh[CM89]. 1 2 Remember no technique does very well when the intensity is constant. The difficulty with differential technique is that the motion calculation is so localized that neighborhood distribution of the intensity does not help. 13 Chapter 2. Background and Historical Review 28 limitations, which is perhaps why no one has fully integrated them. On the positive side, one has to note that differential approaches do estimate motion to subpixel accuracy and are localized, hence do not suffer from the limitations of window based approaches. 14 Lastly, in the gradient based techniques it is not clear how to address multiple motion, for instance due to transparency or reflection. 2.1.4 Velocity tuned filtering Spatio-temporal filters for motion analysis were first discussed as plausible neurophysiological mechanisms by Watson and Ahumada [WA83, WA85] and Adelson and Bergen [AB85, AB86]. 15 Figure 2.2 shows the fundamental principle behind the velocity tuned filtering. In the right most figure a cross section of the moving strip parallel to the the xt plane is shown. A n edge detector can easily find the slope of the left and right most edges thus calculating the velocity of the stripe. But since motion is not always so clearly defined scale-specific velocity-tuned linear filters such as Gabor - which are Gaussian modulated 3D sine and cosine functions - are used for motion processing. The nicety of the Gabor filters lie in the fact that they provide minimum uncertainty in localization in time/space and frequency. Localization limits the size of the neighborhood contributing to the velocity. Localization in the frequency domain ensures that proper spatial structures are being matched. In fact, frequency domain analysis shows that for pure lateral translation of the image, the 3D Fourier energy of the images (arranged in space time cube of Figure 2.2) A g a i n one has to take this with a grain of salt! Calculating motion to subpixel accuracy has to take into account the optics and the sampling process that has given rise to a digital image. Lack of neighborhood support can be extremely detrimental and add to the noise process. Smoothing the image with a Gaussian, commonly used, has the effect of using a windowing operation. Lastly, the regularization processes often used, for example in Horn and Schunk, smooth the optical flow, thus causing neighborhood measurements (sometimes far away neighborhoods) to effect one another. 14 15 Interested readers are particularly referred to [AB86] for an interesting discussion of the topic. Chapter 2. Background and Historical Review (a) (b) 29 (c) Figure 2.2: An illustration of the spatio-temporal filtering, using the space time cube, (a) the first frame of motion for a grey bar moving to the left with velocity v. (b) the stacked (or echoed) representation of this motion for five frames, (c) a horizontal cut for continuous time (i.e., in the (x,t) plane). Adopted from Adelson and Bergen. has its energy concentrated along a plane whose slope and normal defines the velocity. That is: ui t = UUJ X + vujy (2-14) This means that to measure several motions within an image region one needs a collection of properly oriented filters which detect the proper motion at any location by having the highest energy (assuming no cross talk) when proper velocity and frequency are matched. Heeger[Hee87] was the first person to implement the above mentioned formulation shortly after this theory was introduced. This work was later improved by Porat and Friedlander [PF90] and other spatio-temporal formulations - such as phased based techniques by Jepson and Fleet [FJ90a] - soon followed. I will discuss some of the more recent and interesting approaches in this category later on, especially that of Shizawa and Mase [SM90], for multiple motion detection. The latest work in this area is probably by Niyogy and Adelson [NA94]. Some of the shortcomings of the technique can be summarized as: • Almost all the papers encountered assumes constant motion. • Little or no analysis of noise, blurring or geometric distortions especially on the phase-based spatiotemporal routines have been conducted. Chapter 2. Background and Historical Review 30 • Even though multiple motion has been discussed by some authors recognition of occluding contours is not fully developed. • Usually large number of frames are used to isolate the motion plane. This is particularly true of phase-based techniques that typically use nineteen frames or more. Although of interest to long motion sequences, this reduces the use of this technique for smaller frame sequences, (see [JJ94] for the application of phase based approaches to two frame stereo analysis.) • Using large number of frames also introduces delays in the estimation of motion sequence. This does not mean that there has to be a problem with causality of the motion sequence. It means that for a 19 frame sequence we will not have the proper motion estimates for the first 18 frames. On the positive side, this approach provides a novel and interesting methodology to registration, and tries to directly address both the frequency and spatial information of an image sequence through time. One application of this work to unstructured domains is that of Jahne for analysis of ocean waves [J91]. 2.1.5 Other Techniques Phase correlation Phase correlation is perhaps one of the simplest matching techniques in signal processing. Amazingly, however, the first written record of its use for image registration was not until 1975 [KH75]. This technique is however simple enough that it has been rediscovered by others [OC90] including this author. Phase correlation is basically based on the Fourier shift theorem. Given two signals, h (x) x = s(x) and h (x) 2 = s(x — r ) , their Fourier Chapter 2. Background and Historical Review 31 transforms are related by: H (f)=H (f)e- (2.15) 2niTf 2 1 Now given the above idealized relationships, there are a variety of ways one can formulate the phase correlation. These include: - 2 W _ 7*2 (f) ( 2 1 6 ) nm n-2 (f) ll«l(f)ll ^(f)^x(f) ll%(f)ll |^(f)Hi(f)| where ft*(f) is the complex conjugate of %(f) and ||?*,(f)|| = |^(f)| 2 7^(f)ft*(f); then the disparity r can be easily retrieved by the inverse Fourier transform of the above exponential: <S(x - r) = ^{e- } 27riTf (2.17) which will show up as a strong peak (Kroenecker 5) in the image plane, at the proper disparity. A few points about Equations 2.15 and 2.17 are in order. First, Equation 2.15 is not met in practice always. Because aside from aliasing effects, noise, blurring and transformation which introduce non-translational variations between the two images, the windows used in phase correlation are localized. Therefore, because of motion, rows and columns near the two window boundaries are not the same. As a result, the equations in 2.17 are not exact equalities, but very close approximations. Moreover, all the equations in 2.17 do not have exactly the same number of computational operations. So the question is which one should one choose; the computationally most optimal one? The most Chapter 2. Background and Historical Review 32 unbiased estimate of disparity using phase correlation is: |W5(f)Wi(f)| { } 1 j which is the original formulation by Kuglin and Hines [KH75]. The above distinctions becomes particularly significant when one notes Girod and Kuo [GK89] work on forward, reverse, and unbiased phase correlation r ^ forward ^backward r ^unbiased r — — — \{H (f)|| \6.1-i>) 11^ -j* H v \\ 2{1)\\ / u / V and their use as the probability density function of the motion disparity. 16 A more interesting issue is the relationship of phase correlation to regular spatial correlation. The numerator of Equation 2.18 is the cross-correlation of the two windows in the Fourier domain. Therefore, phase correlation is basically a normalized version of cross-correlation in the frequency domain, where each data point in the frequency plane 17 is divided by the magnitude of the Fourier at that point leaving the contribution of the phase difference Therefore one can write phase correlation as: (P (Ui)-phase(H ))_ l e hase 2 PC{HuH2) = \Hi(f)HZ(f)\-\Carr{Ui,H ) 2 (2.22) Kuglin and Hines recognized this similarity and mentioned that a more general approach is not to completely ignore the magnitude of cross correlation but a parametrized normalization: One should note again that PC bi d is not exactly the same as Equation 2.18 and is more computationally intensive. This is not to be confused with normalized cross-correlation in Equation 2.2 which does not have a frequency domain implementation. 16 un 17 ase Chapter 2. Background and Historical Review PC(Ui,H ,a) 2 = \Hi(f).H* (f)\- 33 a 2 . Corr(H H ) u 2 (2.23) where 1 > a > 0. Therefore, at the two extremes of a = 0 and a = 1 one gets the regular correlation and phase correlation respectively. Many authors such as Girod [Gir87] and Thomas [Tho87] have discussed phase correlation in motion compensated image coding. Wu and Fernando [WF88] provide a quantitative as well as qualitative comparison between phase correlation and differential techniques and found phase correlation to provide an overall better result in most cases. Pearson, Hines and Golosman [PJGK77] discuss a cleverly optimized hardware implementation of phase correlation, capable of comparing 128x128 windows at 30 Hz. As we will see, one of the advantages of phase correlation is that it can be used in multi-evidential correlation. Moreover, from Equation 2.17 one can see that phase correlation is not susceptible to scaled or constant intensity variation between the two image. Unfortunately, however, phase-based approaches are generally considered susceptible to noise. In fact, Lee, Krile, and Mitra [LKM88] show the relative weakness of phase correlation to noise compared to cepstrum. Perhaps realizing this problem, Pearson, Hines and Golosman, provided an interesting treatment for the noise, including low pass filtering the result. Nonetheless, phase correlation is still a complementary technique to Cepstrum approach and one that is very powerful, robust, popular and interesting. Section 4.8.2 gives a more detailed mathematical treatment of how phase correlation, cepstrum and regular correlation are related to one another mathematically. Hadamard Domain Technique In this section I will cover a disparity measurement technique relatively unknown in computational vision. There are three reasons that I will cover this technique. First, to Chapter 2. Background and Historical Review 34 bring this technique to the attention of other researchers in the field for further research. Second the approach is extremely fast and it is specifically designed for discretely sampled images, with integer intensity values. Third, the approach can be used in conjunction with multi-evidential correlation. Walsh-Hadamard transform is perhaps one of the simplest and fastest orthogonal encoding methods. To describe the technique briefly, given a sequence of discretized images I(x, y, t), x min <x< x , y max min <y< y x, ma and t min <t< t , then its three max dimensional Hadamard transform is: % A 7 ) = Xmax Vmax E E tmax _E I(x,y,t)(-iy^«>W (2.24) where p(...) is the parity function: max Xmin ^Og2{ymax-ymin)] )1 p(x, y, t , a, 7) = E ii + a x /=0 E m=0 \iog2(tmax-tmin + E )1 7n*n n=0 (2.25) The index values I, m and n simply refer to bit numbers in x, y and t respectively, and they range from 0 to the maximum number of bits applicable. The notation Wi G {0,1} refers to the i th bit of the integer w. The calculation of the parity function seems deceiv- ingly difficult. But in reality each of the three summations merely calculates the binary A N D of the two index, and counts the number of ones. So a better definition of p which I have not found in the literature is: p(x, y, t, a, (3,7) = t(a f > ) + W(P |>) + it (7 (1 *) (- ) where $(w) stands for the number of bits set to 1 in standard binary coded representation of the number w. Now, how can we use Hadamard transform to calculate image displacements? Following the examples of [YCC92, LC88, FC89] I will demonstrate the Hadamard based 2 26 Chapter 2. Background and Historical Review 35 algorithm for an image consisting of a single moving non-zero pixel with intensity L. Given two images, J ( x i , y i , 0 ) and I(x ,y ,1) 2 #(0,0,0) = /(xi,yi,0)(-l) from Equation 2.24: 2 ' ' ' '^ (2.27) = /(zi^OX-l)^™ '^ (2.28) p ( l l y i 0 0 = 2L and #(a,0,l) 0 = L(—l) ( ' ) p Xl Q -|- L{—i)p( ' ) X2 a +1 Assuming that the a = 2" and using Equation 2.26 we see that only the n th bit of x x and X 2 contribute to Equation 2.29, and: H(a, 0,1) = L ( ( - l ) - » - (-1)* '") Xl where x i and x , are the value of the n th > n 2 n (2.29) 2 bit of x and x respectively. Obviously, these n 2 bit values can be only 0 or 1. Plugging the four possible combinations into Equation 2.29: H{a,0,l) = L(x 2 ) (2-30) therefore, *>* - = m m ( 2 - 3 1 ) But the disparity in the x direction (assuming integer disparities) can be represented as: P°S2(* 5 =x -x = x 2 l Cma:c Y, ~ xmin)~\ ( ^n r x ~ xi, ) n (2.32) 71=0 which is simply the binary expanded representation of the disparity. Using Equations 2.31 and 2.32 we see that: \\og2( xmax ~%min)\ TTI nil A 1^ Chapter 2. Background and Historical Review 36 Using symmetry, we can derive a similar equation for the disparity in y: \log2{y max-ymin)] TTI(\ i)n 1 \ To extend this analysis to an image, / , containing p non-zero pixels instead of just one: R°g2{.Vmax— ymin)1 TT ( ryn n n ^ A 1 n ^i=p—1 TT Inn 1\ yV' ) 1 _ \ ^ ~ V #(0,0,0) h o n EJX Egg" but from Equation 2.33 £ fiog (y^-^)l 2 ^ 1 1\ Hj(Z ,0,1) 1 #(0,0,0) [ 2 - 3 b ) En 2"ff (2", 0,1) #,(0,0,0) i 2 f l ( 2 , 0 1 ) = 5 H;(0,0,0), therefore n n U o n n i ) X ff n j(2 £jg- ^g,(0,0,0) 1 )Q)1) #7(0,0,0) = EST 1 #(0,0,0) f 2 1 36 ' > ' which is the same as Equations 2.33 and 2.34. Unlike other techniques that are numerical implementations of algorithms developed for the continuous domain, Hadamard based disparity analysis is designed for modern digital computers and imaging equipment. The high speed of the algorithm and the availability of fast Walsh-Hadamard transforms both in hardware and software is attracting more research on this technique, especially for image coding and motion compensated imaging. But the algorithm in its basic form suffers from a few shortcomings as well. 2.2 Stereo Even though binocular stereo analysis can be viewed as a special case of motion registration [Hor86], the significance of the topic and its treatment in this thesis (including trinocular and multiple baseline stereo), as well as the unique matching methodology used for stereo warrants it special attention. Chapter 2. Background and Historical Review 37 There are many reasons for stereo analysis including segmentation and figure ground separation [Gri93], and gaze-control through vergence [OC91, 01s93, P E , Kam93, TOM94, TM94]. But historically stereo vision systems have been used in vision and photogrammetry for three dimensional scene reconstruction and distance measurement. As Regan et al point out [SW90] (page 13, and chapter 12 page 318): The hypothesis that binocular disparity is the effective stimulus for depth sensation is credited to the astronomer Johannes Kepler. The demonstration that disparity alone can produce depth sensation was first provided by Wheatstone (1852) ... In the next section, I will briefly review the geometrical constraint provided by stereo configuration. The material in this section is not only of importance to stereo vision, but has important applications in motion analysis. This provides an excellent example of how physical configuration restricts the space of possible solutions, thus furnishing a genuine constraint. In Section 2.2.2 I will review token based matching that has had its history and most of its application in stereo computational vision. 2.2.1 E p i p o l a r Constraint and Stereo V i s i o n Figure 2.3 shows two cameras and the projection of a point P into their image planes. Given the relative position of the two cameras, their focal lengths, and the coordinates of the two projections, the three dimensional location of the point can be derived through geometry. But matching and registration of points in binocular stereo is simplified through a physical constraint called epipolar geometry. Simply stated, given two cameras the line connecting their focal points, F and F', Chapter 2. Background and Historical Review 38 Figure 2.3: Epipolar constraint and the verging cameras. intersects the two image planes at points e and e' called their respective epipoles. Now 18 a point in space, P, projects on the two image planes at points p and p'. But the plane that passes through P, F and F',intersects the two image planes at lines ep and e'p', referred to as the epipolar lines for the point P. One can utilize epipolar lines to ones advantage, because these lines reduce the search space for finding matches from a two dimensional to a one dimensional space. That is, to find a match for p we can construct the line ep and its corresponding epipolar line in the second image which passes through e'. The match point p' is granteed to lie on this line thus constraining constraining the search space. Obviously, all the discussion so far relates to the frames captured by a moving camera. In fact, finding epipoles between two image frames and establishing correspondences has been used for finding the intrinsic and extrinsic camera parameters [LC94]. But the advantage of stereo analysis is that one can localize the epipoles as desired, thus simplifying the computations and increase the accuracy of our measurements. This is the reason that in majority of stereo vision systems (binocular or trinocular) the image I am using upper case letters for points in space, and the lower case letters for the points on image planes. 18 Chapter 2. Background and Historical Review 39 planes are kept coplanar - i.e., the optical axis of the two cameras are parallel. In this 19 case, the epipoles for the two images are located at infinity. This means that epipolar lines (all of which have to pass through the epipoles) are parallel to one another. In a calibrated binocular stereo system, for instance, the two cameras are parallel to one another and located horizontally. Thus, to find a match for a point in one image, one only has to search along the same row in the second image. Another advantage of this configuration is that it is "order preserving". That is, if there are points on an epipolar line ordered from left to right, their matches are also ordered from left to right. This property, which is not true for non-calibrated stereo, reduces the search area even further. It is easy to show that for calibrated camera, the depth of a point from the camera baselines is: Z = fl c (2.37) where f is the focal length of each camera, b is the distance between the two cameras c often referred to as the baseline, and 5x is the horizontal disparity in the image planes. Of course, if the two cameras are located vertically, then the epipolar lines are along the columns of the images. Using three cameras provides two epipolar lines for each point in an image thus constraining the problem even further and increasing the accuracy of the match. The configuration often used is a right angle isosceles, with calibrated cameras [KY90]. Later chapters provide a more detailed discussion of trinocular and multiple-baseline stereo. It is however noteworthy that epipolar constraint is not restricted to stereo cameras, and is just applicable to two frames captured from a moving camera in a stationary scene. 19 This assumes the focal length of the two cameras are equal. Chapter 2. Background and Historical Review 2.2.2 40 Stereo Matching Because of its applicability and importance, there are already several thorough reviews of stereo matching algorithms [BF82, Bak82, DA89]. One class of algorithms that has found particular applicability in stereo analysis, is the feature based techniques. Simply stated, feature or token based techniques establish correspondences between individual spatial image primitives in two frames. Each feature can have different attributes to distinguish it from near by ones, thus improving the robustness of the result. The methodology relies upon the ability of the system to both find an adequate number of tokens and then accurately localize them in the two frames. In this sense finding tokens is simply a filtering operation [Bak82]. The two most cited token based techniques are those of Moravec [Mor80] and Grimson [Gri80]. Moravec's interest operator was a corner detector based on maximal directional intensity variance. Once interesting points in the two scene were found and matched - using correlation - the cues were used to provide a mobile robot with motion calibration information and locate possible obstacles. Grimson's technique was the first implementation of David Marr and Tomaso Poggio's stereo algorithm [MP77]. This algorithm that had its roots in the preliminary studies in psychophysics of human stereopsis, was based on matching intensity edges and with the intensity gradient and edge orientation as attributes of each feature. The matching was conducted at different scales by using different size V 2 G operators. The matches were then used for controlling the vergence angle of a stereo system. Similar pioneering systems, notably that of Hildreth [Hil83], for motion analysis followed. Using tokens and their attributes for correspondence is the traditional symbolic image matching. The primitives used should be stable and easily detectable from both view points, dense enough to give adequate information about the surface characteristics, yet Chapter 2. Background and Historical Review 41 sparse enough that they are differentiated easily within a neighborhood. The feasibility of these requirements, much like other techniques in computational vision, is scene dependent. But in feature based approaches the probability of missing features between the two frames, and incorrect localization of features is a more serious problem. The edge based techniques, for instance, suffer immediately from the uncertainty in the location of edges using the linear filters. Even more serious is the implied assumption that intensity discontinuity separates image locales of equivalent depth. 2 0 Nonetheless symbolic image matching is a significant domain of computational vision that should not be overlooked. It is not always that one needs dense disparity maps, and stable features can greatly improve the stability and accuracy of a system. Unlike the area based techniques, given enough attributes, the matching can be quite local - i.e., no area based matching. Lastly, because of locality of the matching process - and assuming low angle cameras (i.e., orthographic projection) - foreshortening does not effect the matching process. 2.3 Hierarchical techniques Hierarchical based approaches [BA83] have gained great popularity in computational vision literature because they address several problems faced by variety of techniques. Two obvious limitations in region-based matching algorithms is the choice of window size and the search area [BYX82, BYX83]. Just like in edge detection, large windows contain more structure and information that can be used and are therefore less susceptible to noise. They do however blur the disparities (especially at discontinuity, or for curved objects) and suffer from localizing the exact portion of the image corresponding to (or contributing most) to the disparity. Small windows are better overall [Nis84] but can To see how easily this assumption is violated consider the occluding edges of a cylinder viewed by a calibrated stereo pair. 2 0 Chapter 2. Background and Historical Review 42 produce noisy outputs. Choosing small search regions can give rise to incorrect disparities, if the actual motion is outside the "guessed" area. On the other hand large regions also increase the probability of false matches and increase the computational cost. Pyramid structures provide a natural answer to these problems. If S max is the maxi- mum expected scene displacement at the base level of the pyramid, at the k th disparity is reduced to S 2~ . k max side) on the k th level the By the same token matching a window size w (each level is comparing a window size of w2 . k 21 Therefore, by matching at the higher levels of pyramid one can get a rough estimate of the motion disparity. These values are then doubled and passed on as the center of the search regions for lower levels. Using this approach, on each level one conducts matching on a small neighborhood. The total cost of this method is much less than a complete search at the lowest level. The window size can also be kept constant or changed based on the information contained in the windows and the goodness of the match. A more important reason for using hierarchical pyramids is that the ideal match is usually detectable at particular scales. That is, using the high resolution image for matching implicitly assumes that the spatial and temporal sampling scales are correct. But for large motion the high frequency image components are aliased. Such aliasing is a serious source of matching errors; and thus for high disparities, ignoring high resolution information is not only computationally efficient it is necessary [BHK91, B A K 9 1 , Ana87]; this is particularly true when one has a poor initial guess of the disparity value. Using the Laplacian of Gaussian pyramid allows one to "lock into" the correct disparity at higher levels and pass them on to lower lower levels (higher acuity levels) of the pyramid. A question that remains to be addressed is how to recognize the particular structural level of the pyramid, and discontinue the search if the lower levels of the pyramid obscure the This is of course a bit more complicated. With Laplacian of Gaussian pyramids each level is complimentary to the previous ones. But the structures included are larger anyway, so you want larger windows. 21 Chapter 2. Background and Historical Review 43 disparity by adding random errors. 2.4 Summary This chapter provided a brief review of multi-dimensional signal and image registration. Majority of the techniques reviewed, mean squared difference (sum of squared differences), mean absolute difference (sum of absolute differences), spatially tuned filters, Hadamard based routines and differential approaches have their roots in linear filtering and statistics. Phase correlation (and non-linearly preprocessed token based) methods were perhaps the only two non-linear techniques reviewed. Section 2.2 reviewed stereo matching as a special case of motion analysis and image registration. Hierarchical pyramids and their role in matching was also discussed, and their role in reducing the assumptions and limitations of region-based methods explored. Chapter 3 C e p s t r a l Analysis: B a c k g r o u n d , M a t h e m a t i c a l Preliminaries, and Modifications Mathematics not only defines the properties of a technique formally, it also furnishes clues and specifics to improving its behavior. This section is an investigation into detailed theoretical aspects of the cepstrum filter. Its primary objective is to provide a simple yet mathematically sound and well-formulated understanding of cepstrum, and to introduce computational improvements to two commonly used versions of the cepstral filter. As pointed out earlier, the visual echo framework is based on the simple observation that self-similar patterns in time or space can be viewed as spatial or temporal echoes of one another. Consequently, the calculation of disparities or isolation of these patterns (as in the case of stationary textures) can be seen as determination of echo arrival periods in time or space. Cepstral analysis [BHT62] is a powerful nonlinear adaptive signal processing methodology developed for detection and estimation of echoes and their arrival period. Because of the significance of echoes, cepstrum has been studied and used in variety of scientific fields such as acoustics, oceanography, seismology, geophysics, wave analysis, speech processing (phoneme chunking), radar and sonar processing, medicine, image deblurring and restoration, system analysis and signal recovery just to name a few. The breadth of applications of cepstrum and its variations give a vast amount of background material to be drawn upon. Consequently not all the literature and research work will be presented here (for reviews of cepstrum see [CSK77, LE84].). The aim of this chapter is to: 44 Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ 1. present a brief historical review of origins of cepstrum with emphasis on its utility in computational vision. 2. provide a detailed mathematical treatment of cepstrum, its themes, variations, and extensions. 3. introduce computational and performance improvements to power and differential cepstrum for use in detection of echoes; and to provide a comparison between these methods and the traditional cepstral techniques. 4. furnish the mathematical foundations to examine cepstrum's properties and limitations. Section 3.1 provides a brief review of cepstrum's origins, concentrating primarily on its applications to computational vision for no other purpose than disparity analysis (i.e., due to the breadth of the domain I will not consider all applications of cepstrum in computational vision, such as deblurring, image restoration, etc.). Section 3.2 provides a detailed mathematical treatment of cepstrum and its variations. Section 3.3 provides computational and performance improvements to differential cepstrum and power cepstrum, which are the two most popular and robust forms of cepstrum. These improvements can be applied in other application domains frequently utilizing cepstrum. Finally, Section 5 concludes this chapter by providing a brief summary. 3.1 A Brief History of Cepstrum It would be prohibitive to discuss all the papers that utilize, compare, or even discuss cepstral filtering and its variations or extensions, such as homomorphic filtering. But, if Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modihcations46 ever possible, it is always interesting and informative to seek the origins of a technique and the person or people that first invented an algorithm. Silvia and Robinson [SR78] have traced the mathematical evolution of cepstrum for signal analysis to the classical works of Poisson (1823) and Schwarz (1872) for the study of the potential and the logarithmic potential theory. Others, Szego (1915) and K o l mogorov (1939), used cepstrum to factorize the power spectrum of random processes and to extract stable causal systems. Robinson's thesis (1954) was perhaps one the first applications of homomorphic filtering to engineering, but it was Bogert, Healy and Tukey [BHT62] that coined the terms "cepstrum" and "quefrency analysis" and used cepstrum as a signal processing technique [SR78]. Bogert and his colleagues were primarily interested in the recognition and retrieval of echo arrival periods due to seismic waves. Thus they defined cepstrum as the power spectrum of the logarithm of the power spectrum of the signal. Due to its definition, this 1 variation of cepstrum involves no phase information, i.e., it only uses the magnitude of the spectrum. Therefore, it is often referred to as "power" cepstrum and, as we will see later, avoids the problems of phase unwrapping that can hinder echo detection [Tri77]. Soon after, different variations and numerous applications of cepstrum followed. In fact, the papers utilizing cepstrum strictly for image enhancement and deconvolution, disparity analysis, or simple non-linear filtering are too numerous to cite. But there are a few papers that have to be cited as milestones in this area. For instance, some of the first applications of cepstrum to image processing were those Oppenheim, Schafer and Stockham [OSS68] and Dudgeon [Dud77]. Restricting our attention to disparity analysis, Lee, Krile and Mitra [LKM87, L K M 8 8 , L M K 8 8 , L M K 8 9 , KM92] were the first researchers to use power cepstrum specifically for stereo and motion analysis. Inspired by the columnar structure of primate ocular 1 T h e term power spectrum refers to the amplitude of the Fourier transform of the signal. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ dominant columns, Yeshurun and Schwartz[YS89] postulated a plausible theory for the neurophysiological mechanism of stereo vision based on power cepstrum. Because of its speed and versatility, many researchers have used cepstrum in real-time vergence control for binocular robots. Most notably, Olson and his colleagues [OC90, OC91, OL92, 01s93], Coombs and Brown [CB92], Pahlavan and his colleagues [PE93, PUE92] and most recently Taylor, Olson and Martin [TOM94] have successfully used cepstrum for stereo convergence and motion tracking. Both Ludwig et al [LNN92] and Yeshurun [YS89] discussed the robustness of cepstrum with regards to image distortion and specifically rotations of up to 8 degrees and expansions of up to 6 percent. This is consistent with the results of Lee, Ramirez and Mitra [LRM91] who also got robust cepstral results of up to 6 degrees, compared with only 2 degrees for phase correlation produced by De Castro and Morandi [CM87]. Ludwig, Neumann and Neumann [LNN94] expanded on the stereopsis work of Yeshurun and Schwartz and "corrected some of the misstatements" of [YS89] but unfortunately, as we will see later, introduced some new potential problems by not considering the effect of their windowing operations. Much like Yeshurun et al and Ludwig and his colleagues, Jones and Lamb [JL93] used cepstrum analysis for stereo analysis and depth measurement. In this case, Jones and Lamb constructed a range finder by constructing a camera with two apertures; the resulting image consists of an image and an overlapped echo of itself which can then be studied using cepstral analysis for disparity measurement. The majority of other cepstrum applications in image processing involve image restoration i.e., deblurring, noise removal, and contrast enhancement. In the next sections of this chapter I provide a more detailed mathematical treatment of all cepstrum variations, and introduce computational and performance improvements to differential cepstrum and power cepstrum (and consequently complex cepstrum) [BL93a]. The following Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications48 chapters will present the broader applications of cepstrum and the development of visual echo paradigm to computational vision problems [BL92a, BL92b, BL93a, BL93d, BL93b, BL93c, BL92c, BXL94, BL94]. 3.2 T h e m e s , Variations and M a t h e m a t i c a l Preliminaries This section provides mathematical treatments of cepstrum and its variations as they appeared chronologically in literature. The aim is to formally explain the properties of these techniques, and provide the necessary formulas which have led to the improvements discussed in Section 3.3. The mathematical derivations and explanatory formula are written for one dimensional signals, but can be easily generalized to higher dimensions. This simplicity will become particularly beneficial once we discuss the differential and complex cepstrum. 3.2.1 2 Prototypical Cepstrum Before addressing specific cepstral techniques and their properties, it is helpful to define prototypical cepstrum as the common definition of all cepstral filters. In general, Prototypical Cepstrum = some kind of (linear) transform, followed by some kind of logarithm, followed by some (other) kind of transform Traditionally, the transforms used are Fourier and inverse Fourier transforms or power spectrum, which turn convolutions into multiplication. The original inventors of cepstrum discovered that to extract the echo - which is equivalent to convolution with a delayed Kroenecker 8 - one must follow the first transform by some kind of a pointwise logarithm Alternatively one could replace all the parameters x, f and T with x, f and r representing a multidimensional version of the these factors, and further replace their multiplications with dot products. 2 Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and ModiGcations49 to turn the multiplication into addition. The second (inverse) transform now returns the convolution filter to its transformed spatial form. In summary, the prototypical cepstrum of two convolved functions is equal to the sum of the prototypical cepstrums of the two functions. This property makes cepstral niters particularly suited for deconvolving the delayed echo arrival periods. 3.2.2 Power Cepstrum Bogert, Healy, and Tukey[BHT62] introduced cepstral filtering and quefrency analysis for estimating the arrival time of the echo of a complex signal. They defined (power) cepstrum as the power spectrum of the logarithm of the power spectrum of the signal. Given a signal h(x) consisting of an input s(x) and its echo delayed by r: h(x) = s(x) + s(x - (3.1) T) and its Fourier transform (using the Fourier shift theorem), T{h(x)} H(f) = F{S(X) = S(f)(l + S{X-T)} + e- ™f) (3.2) 2 where Ti(f) and <5(/) are the Fourier transforms of h(x) and s{x) respectively. The logarithm of the power spectrum of h{x) is then: log(||W)||) = log(||5(/)||) + log(l + cos(27rr/)) + % 2 Figure 3.1 shows the plot of the function log(l + cos(27r/)). 3 (3.3) As expected, the function is periodic with the maximum value of log 2, and asymptotically approaches —oo in the limit as the cosine approaches -1. 3 Obviously one can set the value of r by scaling the horizontal axis. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and ModificationsdO Figure 3.1: Plot of log(l + cos(2?r/)). A n obvious choice of analysis from here is to take the MacLauren-Taylor series expansion of log(l 4- COS(27TT/)) which results in: ~ (-l) cos (27r/r) n n (3.4) n=l In effect, the logarithm operation transforms the power spectrum of h(x) into the logarithm of the power spectrum of the single input, s(x), and a residual summation of decreasing cosine polynomials generated due to the presence of the delayed replica. 4 A better transformation of log(l + cos(27rr/)) for our present analysis is its Fourier (or more appropriately the Cosine) series expansion(see [Spi74] (pagel72), [GR80] (page45 Equation 1.514)): log(l + COS(2TTT/)) = log(2cos (7rr/)) (3.5) 2 , /„N , , „ ^ log(2) + 2(- log 2 + £ (-l) cos(27rnr/) — r —) n n 71 = x 1 n *We will see that Equation 3.4 will come very handy later on - thus being introduced here. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^! Substituting the above equation into Equation 3.3 one gets: log(IIW)H) = log(||5(/)||) + ± (-l)"cos(2mrr/) 71=1 ( 3 g ) U Thus the logarithm not only isolates the echo arrival period, r , from the contribution of the original signal, it also places it in a series of decreasing cosine functions. To extract r, traditionally, a second power spectrum has been applied, which in turn transforms the cosine series into oo (_i)n E ^zr-K* - ) nT 71=1 (-) 3 7 1 i.e., a ripple of decreasing Kroenecker deltas with periodicity r. Figure 3.2: Cepstral analysis of synthetic motion with (x,y) displacement of (7,3). The lower portion of the figure displays the cepstral result. The top peak is located at 3 pixels down and 7 pixels plus the width of the window horizontally. The lower peak is generated due to symmetric properties of the power spectrum of a real signal. Figure 3.2 shows the power cepstrum of an image and its spatially transformed temporal echo for a simulated motion field of three rows down and seven columns to the right. In these examples, for the sake of display clarity, I concatenated the images horizontally (i.e., placed them side by side to create a new image) instead of simply adding them. Therefore the horizontal disparity is increased by the width of the window under consideration. This point is worth emphasizing. Many authors, perhaps due to a misstatement 7 Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ in [YS89], place the origin of the coordinate system at row zero but the middle column; and to correct a non-existent bias on the part of cepstrum they rotate the rows thus placing the new origin at the center of cepstrum plane. The reasoning often provided for this is that using the Fourier transform displaces the origin of the signal, and therefore the original of cepstrum (relating to (0,0) disparity) has been moved up to the top rather than being at the center. But this is quite erroneous. First, if the origin was displaced, it would be moved for both the row coordinate and the column coordinate. Second, since we are taking two transforms the origin of the coordinate system is moved back to the upper left corner of the cepstral plane. What needs to be realized is that the echo arrival period of the image in the horizontal direction has been increased by the window width. That is, if the true disparity between the two windows was (T TOWS , r / ) , because co s of the concatenation the two windows (as opposed to adding them) the new disparity is (Trows, window width + T I) CO S where the cepstrum peak appears. A beneficial effect of concatenating the images is that it moves the peak of the cepstrum away from the corners of the cepstrum plane, where the aliasing effects are largest and could hinder the peak recognition process. The obvious cost of the concatenation is the additional processing of Fourier transforms and pointwise logarithms, which amounts to nearly doubling the processing time. Another obvious result of such concatenation is that the Fourier transform of the resulting image has uneven spatial sampling with the horizontal components having twice the resolution of those in the vertical direction. To compensate for this (and assuming a 1:1 spatial sampling aspect ratio) one can pad the image with rows of zero, which also increases the computational cost further. A better justification, should one be needed, is that the camera can be rotated ninety degrees, thus interchanging the rows and columns of the image. One can then compare the even number columns of the two image (corresponding to the even fields of the images) and the odd numbered columns (corresponding to odd fields of the images) separately using Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications53 concatenated windows. This concatenation does however have some adverse properties that I will discuss further in Chapter 4. 3.2.3 C o m p l e x and Phase C e p s t r u m Oppenheim, Shafer and Stockham[OSS68] introduced the complex cepstrum and homomorphic filtering for use in predictive deconvolution and inverse filtering. Complex cepstrum, h(x), is defined as the inverse Fourier transform of the logarithm of the Fourier transform of the signal: h(x)=f- {\og(T{h(x)})} 1 It is easy to show that both power and complex cepstrum behave as deconvolving filters - i.e., if two functions are combined by convolution, the cepstrum of the resulting signal is equal to the sum of the cepstrums of the convolved functions. In fact the echo of a signal is merely the result of the convolution of the original signal with a delayed impulse function. As a result when a function and its echo are put through a cepstral filter the echo arrival period is marked by a deconvolved delta function [BL92a]. It is interesting to see the relationship between power and complex cepstrum for the detection of echoes. Taking the complex logarithm of the Fourier transform of Equation 3.1 one obtains: H(f) = log(5(/)) + log(l + e- ^) 2 \\S(f)\\+iphase(S(f)) + i l o g ( 2 + 2cos 2TT/T) (3.8) The real portion of this equation, i.e., the first and third term, is in fact, equivalent to the power cepstrum. The fourth term can be simplified to —infr and defines a plane in the Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ imaginary portion of the Fourier transform with the slope —7rr. This term is normally referred to as the phase cepstrum, and due to phase wrapping is a periodic function. 5 Therefore the inverse Fourier transform of this function will also result in a peak at r . But even if the phase of S(f) is removed, like most phase correlation techniques, phase 6 cepstrum is susceptible to noise and phase unwrapping ambiguities. 3.2.4 Differential Cepstrum Noting the problems with phase unwrapping, Polydoros, A u and Fam [PAF79] introduced differential cepstrum as a shift and scale invariant deconvolution filter. Differential cepstrum simply replaces the logarithm operation in the complex cepstrum with the logarithmic derivative. 7 It is trivial to verify that, like its predecessors, differential cepstrum still transforms a convolution into additions of the differential cepstrum of the original signals. Bednar and Watt [BW85], for instance, used differential cepstrum to calculate complex cepstrum of exponentially decreasing signals [RR87a] without any phase unwrapping. As for the effects of a signal containing an echo, taking the logarithmic derivative of the Fourier transform of Equation 3.1 one obtains: H (f) d S(f) Sd(f) ~ 1 + e- ™/ 2 — 7rr tan(7rr/) (3.9) The echo arrival time, r , is again encoded as a parameter to the tan function and it can be retrieved using the inverse Fourier transform, producing a rippling effect with spatial Phase wrapping is a common signal processing problem which results from the theoretical equivalence of two phase angles separated by 2kn for any integer k. It is usually desirable but often difficult to "unwrap" the phase of a signal by adding the proper 2kir value when applicable[Tri77]. B y dividing the Fourier transform of h{x) by S(f) before taking the logarithm. The logarithmic derivative (or the derivative of logarithm) of a function is simply the derivative of the function divided by the function. 5 6 7 Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modihcationsbb periodicity r . Raghuramireddy and Unbehauen [RU85, RU84] extended the differential cepstrum to images and higher dimensional signals. Briefly stated, the higher dimensional differential cepstrum is defined as the inverse Fourier transform of the sum of the partial derivatives of the Fourier transform of the function along each axis divided by the Fourier transform of the function. For images, for instance, the differential cepstrum becomes: * 3.3 1 n(fuh) + nfuh) } - (3 10) Improvements to the traditional approaches: p o s C e p s C o s and diffCepsSin Because of its wide applicability in not only echo detection, but also in non-linear deconvolution, signal decomposition, adaptive filtering, and development of stable IIR filters, even recently researchers have been pursuing improvements to cepstrum[WY91]. Furthermore, since both complex and phase cepstrum contain a mixture of original phase of the signals, and additionally suffer from phase unwrapping problems, these improvements have dealt almost exclusively with power cepstrum and windowing of the logarithm result [Has74, Has75, AA92]. But for image analysis, as I will show later, power cepstrum suffers from sign ambiguities that can be overcome by verification through differential cepstrum. In the subsequent two subsections, I present simple modifications to power and differential cepstrum that not only improve their signal to noise ratio, but also increase their computational efficiency. Section 3.4 shows an application of these improvements to matching magnetic resonance images which have both a real and an imaginary component. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ (a) (b) Figure 3.3: (a) Regular cepstrum of a synthetic motion of 7 horizontally and 3 vertically, (b) Positive cepsCos of the synthetic motion. 3.3.1 Positive CepsCos Reviewing Equation 3.3: MlW)||) = log(||5(/)||) + log(l + cos(27rr/)) + log 2 (3.11) the echo arrival period, r, is only present in the term log(l + cos(27rr/)). A n obvious method of improving cepstrum, whenever possible, is to subtract the cepstrum of the original signal s(x) from the result [LMK89, YS89]. In many applications (such as phoneme chunking), however, the form of s(t) is unknown, rendering this approach inviable. Moreover, my experience has shown that subtracting the cepstrum of s(x) can introduce additional noise and ambiguities near the peak value, primarily due to ringing and windowing effects. But, unlike log(||«S(/)||), l o g ( l + cos(27rr/)) is always a periodic even function. Moreover, from Equation 3.7 one observes that the main peak of the power cepstrum is a positive 5. Therefore to eliminate the additional noise introduced by the odd portion of Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modihcationshl (a) (b) (c) Figure 3.4: (a) regular power cepstrum for Figure 1; (b) its positive CepsCos - i.e., cepsCos with all the negative values discarded (c) the difference of (a) and (b). Note that the landscape showing the difference has a different visual scale. the function log(||<S(/)||), as well as all the negative components of its even portion, I introduce cepsCos. In short, cepsCos replaces the second power spectrum of cepstrum by its cosine transform [RY90] and optionally discards the negative portion of the result - i.e., positive cepsCos. By making this simple change, cepsCos reduces the computational cost and increases the signal to noise ratio of final result. In image processing the original form of the signal s{x) is real - i.e., non complex. Consequently, its power spectrum and the log(||5(/)||) are also even functions. So if for any reason the traditional cepstrum result for a real function is more desirable, one only has to put the result of cepsCos through an absolute value filter and still benefit from the lower computational cost. The reduction in computational cost is twofold; first, the second Fourier transform is replaced by a Cosine transform of an even function, and secondly the calculation of power spectrum (two multiplication and one addition) is replaced by an absolute value. However, if registering two images that are naturally in complex format (i.e., images consisting of real and imaginary parts) - for instance magnetic resonance images - then log(||<S(/)||) is no longer a purely even signal. Consequently, cepsCos will result in an improved signal by eliminating the added noise due to the odd portion of log(||(S(/)||). In Section 3.4 we will see use of cepsCos for matching complex medical images. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and (a) Modifications^ (b) Figure 3.5: (a) Differential cepstrum and (b) differential sine cepstral analysis of synthetic motion. Since differential cepstrum of a function is antisymmetric, for clarity the absolute value of the differential cepstrums is shown. Note the improved signal to noise ratio in the right figure. Figure 3.3 shows the result of power cepstrum and positive cepsCos for a simple simulated planar motion. Figure 3.4 shows the reduction in noise between the regular power cepstrum and posCepsCos for Figure 3.3 more clearly. As a quantitative measure the signal to noise ratio increased from 20.961 for regular power cepstrum to 36.132 for 8 positive cepsCos. 3.3.2 diffCepsSin By analogy to cepsCos, in Equation 3.9 the term — irr tan(7rr/) is a periodic odd function. The Fourier transform of this function is therefore odd and imaginary, with 8 Defined as the ratio of average signal to average noise. For signal to noise ratio, several different measures are often used. One of the most common ones, is the ratio of the power of the signal to the power of noise/background. Since power cepstrum is the absolute value of cepsCos, they would both have the same power. As a result, the ratio of average signal to average noise/background was used. In later sections, a more formal definition of S / N is used which treats the background signal as noise only. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ (a) (b) (c) Figure 3.6: (a) regular differential cepstrum for a simulated motion; (b) diffCepsSin, (c) and their difference. Note that the landscape showing the difference has a different visual scale. anti-symmetric period peaks at r . Using a similar approach to the derivation of cepsCos, diffCepsSin can be defined as the Sine transform of the real part of the differential logarithm. In addition to the increased computational efficiency, diffCepsSin eliminates some of the additive noise from the inverse Fourier transform of — — S(f) . To be more specific to S(f) image processing, it is easy to show that for real signals ^srjy n a s the form o(f) + ie(f), where e(f) and o(/) are even and odd functions respectively. Consequently, diffCepsSin eliminates the contribution of the ie(f) to the additive noise. Figure 3.5 shows the (absolute value of) differential cepstrum and diffCepsSin for the simulated motion in Figure 3.3; Figure 3.6 is the isometric view of these functions and 9 shows how the peak in diffCepsSin is more pronounced that regular differential cepstrum. For this example diffCepsSin increased the signal to noise ratio by more than 50% - from 6.67 to 10.34. 10 1 used the absolute value to emphasize the difference and delay the discussion of anti-symmetric nature of peaks in differential cepstrum and diffCepsSin to a later section. Since, unlike posCepsCos, diffCepsSin contains both positive and negative values, I used the signal to average absolute value of noise as the signal to noise measure. 9 10 Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications60 3.4 M a t c h i n g and Registration of M a g n e t i c Resonance Images To precisely measure the progress of disease or treatment from volumetric data requires accurate positioning of patients during repeated examinations. Automatic matching, in the case of MRI, will allow appropriate changes to acquire more optimal positioning coordinates. Such techniques should also eliminate the use of unpleasant surgically- installed fixation devices which are currently employed in practice. Additionally, accurate positioning and characterization of 3D data can lead to matching of complementary data from different imaging systems or modalities. Our improvements to Cepstrum - as I will show theoretically - are particularly suited for matching and registration of complex signals such as magnetic resonance images. 11 This section presents preliminary work in M R I matching using visual echo analysis and Cepstral techniques. The aims of this section are: 1. To show the performance improvements caused by using cepsCos rather than the traditional power cepstrum. 2. To analyze the improvements in Cepstral matching through the use of imaginary data. 3. To gain better understanding of MRI, and the advantages of using multiple information channels it provides. 4. A comparative analysis of real and imaginary vs. amplitude only matching for M R I analysis. Section 3.4.1 describes the M R images used and gives an overview of the analysis, while Section 3.4.1 describes my preliminary results and other possible modifications to my methodology. 1 1 Complex signals are signals that consist of a real and an imaginary component Chapter3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modihcations61 3.4.1 M R I Processing To reiterate, cepsCos reduces the contribution of log(|<S(/)|), hence improving the signal to noise ratio and echo detectability. To describe this effectively, let us note that log(|<S(/)|) can be described as the sum of two functions: 0(f) and E(f) which is an odd function, which is an even function. Using the Cosine transform instead of the second power spectrum eliminates the contribution of 0(f) power spectrum, the Cosine transform of E(f) completely. Moreover, unlike the is not a purely positive function. This reduces the interference of the transform of E(f) with the detection of the main peak due to the echo. As mentioned earlier, one of the reasons for my interest in matching magnetic resonance images was to study and compare the performance of cepsCos in registering complex data as it compared to the traditional power cepstrum. To do this I acquired two sets of contiguous multi-sliced complex M R data with slice thickness of 2.5 mm. Figure 3.7 shows a pair of images from a single slices, one representing the magnitude and the other the phase information. As can be seen, unlike the magnitude image, the phase image of the skull is rather noisy on the periphery and is rather flat and lacks any structure or information in the brain region. In fact, close examination of the phase measurements in the brain area shows that the majority of the phase values are nearly zero, while for the rest of the image the phase values fluctuate between — ir and ir. This is primarily due the fact that the phase data, unless encoded otherwise, is normally generated by anomalies, such as magnetic field inhomogeneities or motion. As a result, non-zero phase values can be used to signal the presence of noisy or unreliable data. By pixel-wise multiplication of the magnitude image by the cosine and the sine of the phase data one can create a pair of real and imaginary images. Figure 3.8 shows the real Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modihcations62 (a) (b) Figure 3.7: Magnetic resonance image pair from a single slice, (a) the magnitude image (b) the phase image. and the imaginary M R images. As it can be seen, the lack of structure in the imaginary image, especially in the brain area confirms the fact that phase value (and therefore the sine of the phase) is close to zero. Moreover, the real image is similar to the magnitude image and it contains the main structural information of the image. This is primarily due to the fact that the cosines of the phase values in the brain region are close to one in value. The improved appearance and contrast of the real image results from the noisy areas in the magnitude having higher phase values. Since the cosines of the larger phase values decrease, multiplying the amplitude by the cosine deemphasizes (i.e., reduces) the noise in the real image. Lastly, it would be better if the phase channel of the M R images contained relevant data for matching, rather than acting as a noise deemphasizer. But as we will see, even in its present form one is able to get better registration using real data, than using pure magnitude images. But to improve registration, the phase channel can also be encoded to carry more significant signals, such as velocity or chemical shift information. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and ModiGcationsGS (a) (b) Figure 3.8: (a) The real image and (b) the imaginary image of a magnetic resonance image pair. Results and Analysis To test the effectiveness of cepstrum and cepsCos and to measure the improvements in the cepsCos performance for real and imaginary image matching, I acquired two sets of M R I brain data. In the second set, the subject was moved forward by roughly 3.0 cm, and his head was moved vertically upward as well (no lateral motion was feasible). The spacing of the M R I slices was 2.5 mm, i.e., causing a depth or Z disparity of approximately 12 slices. I then found the real and the imaginary data, based on the amplitude and phase images, and created three sets of data for these experiments: 1. Amplitude only data: This set of images contained the raw amplitude data as the real component, and zero as the imaginary component. 2. Real and imaginary data: This set consisted of both real and imaginary component of the data. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^^ 3. Real only data: This consisted of only the real image, and zero for the imaginary component. The image sets being compared were then concatenated in the Z direction and each set was put through the traditional power Cepstrum filter and cepsCos. Because of the fact that slices from the first image were concatenated with those from the second, the peak location corresponding to the depth disparity, A , appears somewhere in the z vicinity of the middle slices. To find this exact disparity we can search for the magnitude of the maximum peak, but I found the value of the rejection ratio - i.e., the ratio of the maximum peak to second highest peak - to be a much better indicator. Table 1 shows typical results of the cepstral matching for the chosen slices. Amplitude Location Rejection Ratio S/N Real & Imaginary cepsCos Ceps 3781.16 3983.65 (16, 0) (16, 0) 1.32865 1.2001 Amplitude Only cepsCos Ceps 4454.66 4454.66 (16, 0) (16, 0) 1.44977 1.35987 Real Only cepsCos Ceps 4832.12 4832.12 (16, 0) (16, 0) 1.53223 1.39225 21.1984 20.6612 21.3363 20.1591 19.655 20.384 Table 3.1: Preliminary experimental results for M R I image registration using different data types and cepstral techniques. Several points can be immediately made from the table. First, all techniques showed a disparity of 16 pixels vertically and zero pixels horizontally, which were visually verified to be accurate. More importantly, for all three sets of data, cepsCos provided a superior performance compared to the traditional cepstrum. This is reflected in the signal to noise and rejection ratio measurements. Lastly, real only data had the best performance 12 12 The definition of signal to noise ratio used here was: Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ over all. In fact, depending on which portions of the two image sets were being matched, the real and imaginary data performed better than the amplitude only data. This again reflects the importance of incorporating phase - at least as a noise suppressor - in the matching process. It would, of course, be better to use the phase channel to encode more relevant information such as velocity or chemical shift. There is other more subtle information present in the table. For instance, the peak magnitudes of cepsCos and Cepstrum for both the amplitude only or real only data are equal. This should be expected since, as I pointed out, these are both real images. But cepCos always outperforms regular Cepstrum in rejection ratio as well as signal to noise measurement. This is due to the fact that cepsCos reduces some of the positive value noise generated by power spectrum. More interesting still is that cepsCos outperforms Cepstrum when both real and imaginary data are used. This verifies that cepsCos is still a better matcher even when the imaginary data basically lacks any relevant information. It should be pointed out that, given the large spacing between the slices compared to pixel distance on each slice, the depth disparity calculation is not considerably less effective and accurate than the vertical or horizontal disparity measurement. In fact, my experience showed that, given the high degree of correlation between slices, the cepstral results showed many slices with the correct vertical disparity, while finding the exact depth disparity was rather inaccurate. Figure 3.9 shows two slices of the cepstral results. The figure on the left, is the cepsCos result for real only images, while the noisy image on the right is the regular power cepstrum of the amplitude only data. The windows chosen are 128x128 pixels, which makes the recognition of the peaks rather difficult. But close examination reveals a small but very bright peak at row 16 column 0 which corresponds to the correct disparity. Along with the ratio of the power of the signal to the power of background, this is a common S/N ratio. Variation of both measures are commonly used in practice. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications66 (a) Figure 3.9: Cepstral results, amplitude only data. (b) (a) cepsCos of real only data, (b) power Cepstrum of In summary, this section introduced a new technique based on Cepstral filtering for matching and registration of volumetric magnetic resonance images. As expected, these improvements to power Cepstrum consistently outperformed the traditional methodology. Moreover, even though the M R phase information was used as a noise deemphasizer, I were still able to generate better results than matching only the amplitude data. Greater benefits can be expected by having the phase channel carry more relevant information with such techniques as velocity phase encoding, chemical shift imaging or inversion recovery. 3.5 Summary/Conclusion: This chapter focuses primarily on the mathematical definitions of cepstrum and its variations. Admittedly, not all questions, especially the intuitive ones, such as cepstrum's behavior in the presence of multiple echoes (e.g., occlusions and transparency) were Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and ModificationsQ7 addressed here. Other property related questions such as cepstrum's behavior in the presence of noise or convolution are also deferred to later chapters. But this chapter did introduce new variations of power and differential cepstrum that are better from the performance aspects of echo detection as well as computational efficiency. We will visit these criteria later on. But it should be noted that from now on my primary focus is cepsCos as a matching routine. 13 In fact, when the term cepstrum is used, I am in fact referring to cepsCos. Chapter 4 Cepstral Properties In the last chapter I set the preliminary mathematical ground works for the description and analysis of cepstrum filters, particularly power cepstrum. In this chapter I will look at the properties of two dimensional cepstrum particularly as it relates to image registration. To briefly outline its intended objectives, this chapter attempts to: 1. examine properties of cepstrum, particularly emphasizing its shortcomings which one needs to overcome. 2. provide a mathematical analysis of effects of convolution and noise on (power) cepstrum. 3. discuss the effects of windowing either the original images or the logarithm of the magnitude. 4. discuss the computational complexity of cepstrum, algorithms to speed up cepstrum using Hartley transform [HCL91, LRM91, SD89], and introducing simpler more intuitive algorithms to achieve similar or higher speed ups using Fourier transforms. 5. provide mathematical and experimental comparison between cepstrum's properties and other matching techniques. 6. discuss some misconceptions and oversights in utilizing cepstrum. 68 Chapter 4. Cepstral Properties 69 It is almost impossible to address all the conditions that may appear and adversely or positively affect cepstral results. This is particularly true because cepstrum is a nonlinear process. But fortunately, given its popularity as a signal processing tool, there have been many publications, and dissertations on the properties and behavior of cepstrum particularly in one-dimensional signal processing such as epoch detection [Pea79], and signal decomposition [Ski74]. Unfortunately these studies are at times inconclusive and can not address relevant issues in the applications of cepstrum to images and higher dimensional signals, such as effects of rotation or image distortions. Some of the material presented in this chapter were originally discussed in [BL93d] and are elaborated on herein. Rom [Rom75] has also discussed the properties of cepstrum for two dimensional signal and to the best of my knowledge this is the only paper dedicated strictly to (power) cepstrum's 2D properties. The next sections will address such issues as mathematics of cepstrum that results in disparity sign ambiguity, effects of low frequency signal, cepstrum's performance in the presence of noise, convolution, or change in illumination and other issues that may effect cepstrum's effectiveness as a signal matching filter. 4.1 Power Cepstrum's Sign Ambiguity in Calculating Displacement: Even though power cepstrum had been commonly used as a matching process, its inability to distinguish between oppositely signed displacement vectors were first discussed in [BL93d] and [BL93b]. To be more specific, reviewing Equations 3.3 and 3.4 for a 2D signal, if the horizontal and vertical disparities - i.e., the echo arrival periods - are or ( T , —Ty) — X (r ,r ) x y the cosine function, and therefore the result of the logarithm, remains the same. That is: log(l + cos(2 r(r / + r f ))) = log(l + cos(2n(-T f 7 I x y y x x - T f ))) y y (4.1) Chapter 4. Cepstral Properties 70 This is reflected in Figure 4.1 where the order of the windows, and thus the sign of the displacement vector, is reversed. The location of disparity peaks for power cepstrum however remains unchanged. (a) (b) Figure 4.1: (a) Power cepstrum of a motion sequence. In (a) the disparity is 3 downward and 7 to the left. Figure (b) shows the disparity 3 rows up and 7 to the left, even though the direction of motion is reversed the cepstrum results are identical. This problem should not come as a surprise. After all power spectrum and its logarithm is a positive even function. Taking a cosine transform will result in an even function, producing a coordinate system with 2 quadratures rather than 4. Therefore, to solve this problem I need a solution that does not produce an even function as its result. Differential cepstrum (or diffCepsSin) is exactly such a solution. Figure 4.2 shows the result of differential cepstrum for the above figures. Since differential cepstrum is an odd function, the sign of the disparity vector is reflected in the sign of the peak values (in the upper half of the result for instance). It is important to note that this limitation of power cepstrum is specific to multidimensional signals. In a single dimension the origin of the signal is determined and the Chapter 4. Cepstral Properties (a) 71 (b) Figure 4.2: Differential cepstrum (diffCepsSin) of two motion disparities with opposite signs. Even though the cepstral extreme values are located in the same place, being and odd function the sign of the values determines whether the disparity is (+3,-1-7) or (-3,-7). echo could only appear in the positive disparity. This is perhaps the reason that power cepstrum's sign ambiguity was previously overlooked or ignored by other researchers. One solution is to calculate both power and differential cepstrum; the power cepstrum provides reliable displacement measures with high signal to noise ratio, while differential cepstrum provides the proper sign of the displacement vector. Although there is some computational overlap, and the operations can be conducted in parallel, this seems to be a duplication of effort. Moreover, since differential cepstrum is a noisy process it seems better to calculate the power cepstrum to find an estimate of the disparity, and then move one window with respect to the other by a known amount - for instance (+d , +d ), d and d being both positive. If the peak in power cepstrum is moved x y x y by (—d, — d ) then one knows that the original disparity in the opposite direction of x y what was expected. 1 1 Olson [01s93] uses a similar procedure to avoid sign ambiguity in T h i s is exactly what I do in cepsCorr and multi-evidential correlation. Chapter 4. Cepstral Properties 72 binocular disparity calculation. He introduces a large enough artificial vertical disparity in one of the windows before using the cepstrum. Then depending on the change in the vertical location of the cepstral peak the sign of the cepstrum disparity is modified. The reason that this simple procedure works is that the domain of the application was stereo disparity analysis. 4.2 False Peak at (0,0) As I pointed out in Chapter 2 different structural scales in an image may result in aliasing of motion and incorrect disparity calculations. For instance, narrow band low frequency signal may not seem to move at all. As noted by Brown [Bro92], this is particularly true of frequency based techniques such as phase correlation and cepstrum; the main reason being that in these techniques all frequencies equally affect the outcome. Thus the low frequency signal can be viewed as a slow moving (or generally stationary) transparent motion, which is added to the higher frequency (i.e., more detailed) signal which provides the correct displacement. 2 Figure 4.3 shows the effects of lack of structure in an image. Although the image has gone through a motion of 1 pixel horizontally and 1 pixel vertically the main peak of cepstrum corresponds to a (0,0) disparity. To show that the false peak at (0,0) disparity is caused by the absence of adequate image structure - i.e., large amount of low frequency signal - Figure 4.4 shows the cepstrum results of disparity for the edges of Figure 4.3. Obviously, this effect is not restricted to cepstrum alone [Ana87]. But in the case of cepstrum the effect is more significant because, as we will see in Chapter 8 cepstrum is This should not be confused with the motion aliasing of high frequency signals [BAK91] in the presence of large motion. 2 Chapter 4. Cepstral Properties 73 Figure 4.3: A n example of false peak at (0,0) disparity. Even though a displacement of (1,1) has been simulated, the main peak corresponds to a (0,0) disparity. a technique particularly strong in recognizing and measuring multiple motions. Therefore, the existence of a peak at (0,0) can be indicative of a transparent motion with one of the images not moving, or motion boundary with the foreground or the background stationary. Some researchers [LNN94] ignore any peaks at (0,0) when calculating disparities. Others [TOM94] multiply the input signals by modified Hamming or Gaussian windows [LNN92] which basically amounts to removing the lower frequency cosine component in the log domain. A n alternative approach to band pass filtering the image is to use the usual hierarchical algorithm. This method, however, ignores cepstrum's ability in recognizing multiple motion and can be costly. A better method to check whether a second peak at (0,0) is due to multiple motions or not, is to move one window with respect to the other, and recalculate the cepstrum result (i.e., a small version of cepsCorr). If the peak at (0,0) moves due to simulated motion, then the peak is due to multiple motion. If the peak had resulted from low frequency structures however, the peak will remain at (0,0). Chapter 4. Cepstral Properties 74 Figure 4.4: Removal of false peak at (0,0) disparity by using the edge image of the previous figure. 4.3 Windowing Cepstrum, like most other methods in use today, is a region based analysis technique. This gives rise to several standard questions. The most popular question, of course, is the size of cepstrum window used for matching. As pointed out earlier, if the motion disparity is too large, small windows will fail to capture enough area of the moving object to estimate the displacement confidently. A large window, on the other hand, blurs the displacement field thus reducing the accuracy especially across the motion boundaries. Yeshurun and Schwartz [YS89] claim that the window size in each direction has to be at least twice the size of the motion for cepstrum to be succeed. Of course this also depends on other factors, such as the correlation between the new data entering the window and the old data moving out. One solution, of course, is to use pyramidal techniques to localize the location of matching windows on the higher levels before matching small windows on the lower level. A n alternative solution is to use algorithms such as [OK92] where the window size is modified to increase the accuracy of the match. A third approach, and one that Chapter 4. Cepstral Properties 75 provides more flexibility and information is cepsCorr, which will be introduced later. 4.3.1 Weighted W i n d o w i n g of the Input Images Intuitively, multiplying the windows being matched by a Gaussian filter should increase the contribution of the center pixels to the disparity match, thus localizing the disparity measurements more. Ludwig, Neumann and Neumann [LNN92] propose this approach and show improved detectability for the cepstrum peak. Taylor, Olson and Martin [TOM94] use a Hamming window in order to reduce the contribution of the borders. But a simple mathematical analysis advises against such an approach. To explain briefly, two concatenated Gaussians result in an even Gabor function (i.e., a modulated Gaussian) in the Fourier domain. Therefore, multiplying the two windows by Gaussians (or Gaussian like filters) before concatenating them, is equivalent to convolving the Fourier domain image with a Gabor. Consequently, if the frequency of the Gabor filter and log(l-|-cos(27r/r)) are not the same the cepstrum peak will be reduced or eliminated, while the correct Gabor function frequency will in fact improve the result. The choice of frequency and the width of the Gabor function depend on the eccentricity of the Gaussians (i.e., how centered they are on the two windows) and their standard deviations respectively. Figure 4.5 shows a one-dimensional example the effects these parameters may have. Figure 4.5(a) shows the cepstrum of a randomly generated signal, and its echo which is delayed by the window width (128 samples) plus seven pixels. The peaks of the cepstrum are exactly the window width +/— 7. Figure 4.5(b) shows the result of cepstrum by multiplying the two windows with very narrow Gaussians (standard deviation of 8 pixels). Figure 4.5(c) is the result of cepstrum when the two Gaussian functions (std div 50) were off center from the two windows by 20 pixels. As it can be seen the two cepstrum results are incorrect. Interestingly when the Gaussians were Chapter 4. Cepstral Properties 76 centrally located and had a standard deviation of fifty, the magnitude of the cepstrum results increased by about ten percent. This is perhaps due to the fact that with a large standard deviation the width of the Gabor function is very small and has a smoothing effect on the log(l + cos(27r/r)). 250 300 (a) (b) (c) Figure 4.5: Effects of windowing the two signal with Gaussians. A signal of width 128 was randomly generated and concatenated with a delayed version of itself. The delay was 7 pixels, (a) shows the the regular cepstrum of the signal, (b) Is the cepstrum of the signal, where the two windows were multiplied with very narrow Gaussians before concatenation, (c) is the cepstrum where two Gaussians were not narrow but were displaced toward the edges of the window by 20 pixels. 4.3.2 Windowing in the Logarithmic Domain As pointed out earlier, the signal to noise ratio of the cepstrum can be improved by removing the cepstrum of the original signal. But some authors have proposed other mechanisms to improve the signal to noise by windowing the result of the logarithm operation. Hassab and Boucher [HB78] showed marked improvements by using a modified Hamming window. Armestrong and Ahmed [AA92, AA91] showed even better results using an inverted Butterworth window. Windowing in the logarithm result is equivalent to noise removal in the cepstral Chapter 4. Cepstral Properties (a) 77 (b) Figure 4.6: (a) The result of the log of power spectrum of an image and its echo after the removal of the effects of the original signal. Even with the additional removal of the signal the quefrency window is still seems noisy. But the cepstrum of the signal is quite visible in (b). window using convolution but is computationally less expensive. It is also much more cost effective than removing the cepstrum of the original window. But as many researchers have shown cepstrum is a very effective matching technique. My own experience has shown that such windowing is really not necessary. Figure 4.6 shows logarithm of the power spectrum of an image and its echo after the logarithmic result of the original signal was removed. The subtraction of the log of the power spectrum of the original signal was made to show that even with such operation the result is quite noisy. Without the additional expense the periodic striped pattern of log(l + cos(27r/r)) is even less visible. But because of cepstrum's inherent strengths, the cepstral peaks of 3 the results are however remarkably strong. 3 E v e n the windows were made large so that the stripe pattern would appear. Chapter 4. Cepstral Properties 4.3.3 78 Zero P a d d i n g and N u m e r i c a l Stability As I pointed out earlier concatenating the windows horizontally increases the sampling rate in the horizontal direction, but taking the second power spectrum would be equivalent to taking an inverse Fourier transform and finding its magnitude. Therefore, in reality the sampling rate remains unchanged. What the concatenation does however is to move the peak away from the corners of the image where aliasing and noise can reduce the accuracy of the estimated displacement. But the peaks are still near the horizontal edges of the window, zero padding the image in the vertical direction (or placing the two windows diagonally) may seem a natural improvement. In fact, Skinner [Ski74], as well as others (as well as my own experiments), show that zero padding the signal increases the signal to noise ratio of the cepstrum. But generally the improvements are so minimal that additional computational cost does not warrant this. 4 There is however a more important issue involving the presence of black background, with rectangular moving objects. Ludwig et al [LNN92, LNN94] showed that linear edge segments may result in additional maxima peaks in power cepstrum. This is perhaps best reflected in Figure 4.7. Two randomly generated squares are placed on a black background and are independently displaced. The figure shows the cepstrum result. Even though the multiple motions are detected (I will discuss this in more detail later on), the peaks generate a checkered pattern of multiple peaks. Even though in this result the two main peaks are detected, the additional peaks does obscure the number of independent motions and may result in incorrect estimates of disparity. As one expects, aliasing becomes more of a problem when the windows being matched are small. In such cases padding may help more, but there are other windowing techniques, discussed before, that are just as effective. 4 Chapter 4. Cepstral Properties 79 Figure 4.7: Effect of moving square windows in front of a black background. The checkered multiple peaks not only interfere with the measurement of displacements, they also obscure the number of independently moving objects. Obviously this is a very unlikely problem and one that does not warrant much concern. But a more serious problem is the calculation of the log of the power spectrum. Theoretically the power spectrum of the signal can be zero or cos(27r/r) can become -1. If either condition is satisfied then the power spectrum of the signal and its echo becomes zero and log(O) is negative infinity, thus causing numerical problems. Even for very small values of the spectrum the log can produce large negative numbers that adversely affect the outcome. There are two solutions to overcome this problem. One is to filter out small values of spectrum and replace their log with a reasonable constant; another - which I will discuss later - is to add a one (or a small positive constant) to the power spectrum before taking the log. Because of cepstrum's matching strength both techniques provide very similar and adequate results. It should also be noted even without these precautions that numerical instabilities due to log of small values is extremely uncommon. 4.4 Noise, Convolution, First Order Distortions, and Illumination Change Image blur and distortions due to motion or convolution are significant phenomena sometimes overlooked in computational vision. In reality local distortions can be generally modeled as affine transformations some of which (particularly scale and rotation) in turn, Chapter 4. Cepstral Properties 80 can be analyzed to the first order of approximation as convolution with asymmetric (i.e., deformed) Gaussians and derivatives of Gaussians [Man94, M092]. Another significant effect, especially in a local sense, is the change in illumination or brightness of an image patch due to changes in inter-reflectance, shadows, or simple change in surface orientation. In this section I will provide an analysis of the effects of these irregularities on cepstrum analysis. This material is basically a rederivation and modification of materials presented in [HB76, AA92, AA91, HB78, Has74, Has75] and additional material on higher order spectral analysis and Bicepstrum [NM93, NR87, NP88, TG92]. 4.4.1 Effects of A d d i t i v e Stationary R a n d o m Gaussian Noise The title of this subsection exemplifies the main assumption in this analysis - i.e., the noise is additive stationary random Gaussian process. The additive assumption is a rather common and accurate one. As for the stationary random Gaussian process, the noise is almost never (see the definition of "almost always" in Chapter 2) either Gaussian or stationary and random. The central limit theorem tells us that without any additional knowledge specific to the sensor and the sampling operation, and assuming finite temporal sampling of each pixel (modelled as summation of pixel values of several images) a Gaussian distribution is an adequate model for the sensor noise. But the additive noise is really never uncorrelated. Luckily for us, sensor noise is usually not a major problem in computational vision applications. 5 In section 2.1.1 we saw that minimizing the mean squared error between the two signals, or equivalently their sum of squared differences, is the optimal matching procedure in the presence of white Gaussian noise. For correlated Gaussian noise, higher order correlation or cepstrum analysis are in order [NM93, NR87, NP88, TG92], but these analyses usually require long signal sequences and are computationally very expensive. 5 At least the ones that I am focusing on in this dissertation. Chapter 4. Cepstral Properties 81 In section 4.3.2 we saw the use of inverse Butterworth windowing or Hassab's modified Hamming window to remove the high frequency noise from cepstrum. In the tradition of doing things backward(I), and since I have discussed some possible solutions, this section provides a mathematical discussion of the problem. Manifestation of A d d i t i v e Noise in Power C e p s t r u m In their ground breaking work on cepstrum, Bogert, Healy and Tukey provided experimental comparison for the echo detection performance of cepstrum and autocorrelation in the presence of different types of noise (e.g., pink noise, green noise etc.) [BHT62]. Childers et al [CSK77] also provided some experimental results providing additional results and the first treatment of noise removal in the logarithmic domain. But Hassab and Boucher [HB76] provided the first detailed mathematical treatment of cepstrum performance in the presence of uncorrelated Gaussian noise and showed the significance of the relative bandwidths of the signal and the noise. A modified and more accessible version of their work is as follows. Given a signal, its echo and additive noise n(x), h(x) = s(x) + s(x -T)+ n(x) (4.2) the Fourier transform of the signal becomes: H(f) = S(f)(l + e~ ™f) + M(f) (4.3) 2 = ) /A(7)e W s ( / ) ~' r i T / + y/N(f)e iMf) where A(f) = |«S(/)|(l + cos(-27rr/)), &?(/) is the phase of «S(/) and 0V(7) (4.4) and <j> {f) N are the amplitude and phase of the Fourier noise. Now multiplying the above with its 6 conjugate we get: \n(f)\ 6 = A(f) + N(f) + 2y/A(f)N(f) COS(TTT/ - <M/)) This is basically writing the Fourier transform in terms of the magnitude and phase. Chapter 4. Cepstral Properties + cos(27rr/)) + \M(f)\ + 2 /A(f)N(f)cos(nrf = + \M(f)\ + (1 + where </> (f) = Mf)+Mf)w e s e e 2y/A(f)N{f)co8(*Tf - Mf))) <f> (f)) R * W))*ccs(W) W ) | + Mf)\ t° - ] = R 82 + 2^A(f)N(f)cos(nrf - <f> (f)) R Abbreviating |5(/)| + |Af(/)| + 2^A(f)N(f) C0S(7TT/ - that the logarithm of the above equation becomes: log(Q(/)) + log(l + cos(27rr/)). (4.6) This is exactly the result that Hassab and Boucher [HB76] and Bogert and Ossanna [B066] produced but more simplified and less elaborate. Namely, the peak of the cepstrum is still produced as usual except that it is modulated with the power spectrum of ^Qfj) • The log(Q(/)) also adds noise to the final result. By considering the probability distribution density function for the noise and making a few additional simplifying assumptions, Hassab and Boucher concluded on "the dependence of the statistical measures upon the pointwise variation of input to noise spectra. Thus the total signal power to noise power (S/N) alone is insufficient measure for setting the cepstrum performance, and relative bandwidths of signal and noise are also needed." They also calculated the detection probability of an echo for an exponential signal with equal bandwidth for the signal and the noise, and showed that it degrades for signal to noise ratio greater than -4 db. This is far below the capabilities of the present common cameras. 4.4.2 Effects of C o n v o l u t i o n and B l u r As was pointed out earlier, cepstrum is in fact a deconvolving filter, and is often used for system analysis. It is however important to know what would be the effect of convolution on a signal's echo, as in the case of motion blur, and how cepstrum is effected by such problems. Both Hassab and Boucher [Has75] and Tavakkoli [Tav86] treat the subject ^ Chapter 4. Cepstral Properties 83 adequately. Given the signal: h(x) = s(x) + s(x - T) * r(x) (4.7) where r(x) is the convolving filter, then the power spectrum of the signal and its echo can be simplified to: 7 cos(27rr/)) U{x) = \S(f)\(l + (4.8) Therefore, after taking the logarithm and using the MacLauren-Taylor series expansion, we will get: U{f) = log(|S(/)|) + \TZ(f)\cos(27rr/) + 0.5|7e(/)| cos(27rr/) + 1.0/3.0... 2 (4.9) Taking the cosine transform of the above we get: 8 h(x) = s(x)+f(x—r)+0.5f(x — T)*f(x — r ) + 1.0/3.Of (a; —T)*f(x—r)*f(x —r) + ... (4.10) Therefore the peak of the cepstrum is in fact convolved with the cepstrum of the convolving filter. Note that this derivation also has applications to the result of Equation 4.6. Figure 4.8 shows the application of cepstrum to a blurred echo. The motion is again (3,7) and the blurring function is an anisotropic Gaussian function with the blur axis parallel to the direction of motion. Even after the distortion, the peak is still at (3,7) and quite visible. Being a deconvolving filter, a natural question to ask would be about cepstrum's performance in the presence of periodic or convolved signal. Without using any more 9 The actual formula is slightly different and involves square of power spectrum of r(x). This equation is actually a result of complex cepstrum analysis. It is easy to show (but outside the scope of this discussion) that power cepstrum's peak is not only convolved with the delayed cepstrum of f(x) but it is distorted due to the effects of phase of 11(f). I am assuming an even function for the blur. Also known as the "moving stripped apple" syndrome! 7 8 9 Chapter 4. Cepstral Properties 84 Figure 4.8: Effects of motion blur on power cepstrum. mathematics, it should be clear that cepstrum deconvolves the two signals and generates the sum of their cepstrums. Figure 4.9 shows how cepstrum demodulates a moving plaid image constructed from the convolution of two sinusoids. 4.4.3 Effects of Illumination Change: Changes in position of the light source - as in the case of autonomous vehicle navigation at night - or the effects of inter-reflectance or shadows cause brightness changes in an image. Simplifying the problem by modeling - for the small local neighborhoods considered the illumination change as a constant multiplicative factor: h(x) = s(x) + as(x — T) (4.11) where a is the change in illumination. Then the power spectrum becomes: = W ) | ( l + o + 2ocos(27rnr/)) 2 (4.12) Chapter 4. Cepstral Properties 85 Figure 4.9: Power cepstrum of a moving plaid function. therefore the cepstral result is: ?/ x , / /~x ( a) 8(x — TIT) - / x — n h(x) = log(a/2) + s(x) + V *—'— X n n 1 4.13 where a = 2a/(a + 1) thus the basic effect is to reduce the magnitude of the cepstral 2 peak by the illumination effect. An interesting problem arises if a = 1.0. In this case, lim (...)_•_! log(l + C O S ( 2 7 T T / ) ) cos — r —oo. To counter this, Hassab and others multiply the signal with a exponential function. Existence of any noise, illumination change, and/or blurring will eliminate the degenerate case. 10 Fortunately, in my image processing applications, the sensors were very good and quite adequate. Consequently improving cepstrum's performance in noise is not as critical of an issue as improving its performance in the presence of image distortions. Noise analysis is outside the present scope of this thesis. A dramatic display of cepstrum's effectiveness was provided by Yeshurum and Schwartz [YS89] 10 I n my experience, perhaps due to numerical round off, log(...) never approaches negative infinity. M y cepstrum routine does check for very small values passed to the logarithm and returns a large negative number. Chapter 4. Cepstral Properties 86 where they blurred an image and deleted portion of the image, and still calculated the proper disparity. 4.5 R o t a t i o n and E x p a n s i o n As pointed out in Chapter 2, one of the fundamental assumptions of region based techniques is that on a local scale the motion of the scene, is singular (no multiple motion), and parallel to the image plane of the camera. That is there are no rotations and or scale changes allowed. But reality often defies assumptions. Observation of human motion, for instance, shows that there are plenty of rotational and scale changes that occur even as the subject tries to move parallel to the image plane. Yeshurun and Schwartz [YS89], Ludwig et al [LNN92] were among a few who studied the effects of scale and rotation on cepstrum results. Interestingly enough, even though such studies are often dependent on the scene content their results are remarkably close. Both experiments showed that cepstrum performed very well in the presence of rotations anywhere in the range of 6 to 8 degrees and expansions of up to 10 percent. Lee, Krile and Mitra [LKM87] used a polar representation of the power spectrum to transform rotational motion of ocular fundus images, into translational motion, and then registered these images using power cepstrum. 11 Figure 4.10 shows the results of cepstrum analysis of a rotational motion. This result is particularly interesting, because it also displays the two shortcomings of cepstrum pointed out earlier. First, at times the primary peak of cepstrum corresponds to (0,0) motion and it is the secondary peak that should be considered as the result. Second, even though power cepstrum is good at recognizing the magnitude and orientation of the motion, it can not differentiate between the two directions of motions (e.g., (u,v) and In polar, or log polar techniques, the main issue is the choice of the origin. In the case of fundus image analysis, the images used are circular images of the retina, so the choice was already made. 11 Chapter 4. Cepstral Properties 87 (—u, —v)). This gives rise to an interesting physical effect, as if the image of the plant is being split from the middle, while in reality switching the direction of the arrows of the left half of the plant image will give rise to the correct disparities. (a) (b) (c) Figure 4.10: Effects of rotation on cepstral results. The gray arrows are representative of the motions recognized by ignoring the peak at (0,0) of cepstrum. As it can be seen, places where the secondary peak is used are near the center of the rotation or the back wall (i.e., areas of virtually no motion). In these regions the motions are very close to zero. Secondly, note that since power cepstrum does not know the difference between (u, v) and (— u, —v) then the arrows on the left half of the flow field are pointing the wrong the wrong direction. None the less the magnitude and the orientation of the motion is qualitatively correct. Figure 4.11 shows the same result for expansion. Again the secondary motion especially near the center of expansion is close to the real value of motion and very close to zero. The background wall has basically no motion, which can be recognized by the high randomness of the grey arrows indicating that the real motion is in fact depicted by the peak at (0,0). 88 Chapter 4. Cepstral Properties (a) (b) (c) Figure 4.11: Effects of expansion on cepstral results. 4.6 Subpixel Motion There have been several interesting papers on subpixel analysis for motion and registration [WC90, TH86, Hee91] with the majority using a bicubic spline. Earlier, however, I discussed the effects of convolution on the cepstrum peak. Consequently, one can confidently claim that subpixel motion analysis is purely dependent on the sampling filter used in the image. Olson and Coombs have tested and used bilinear distribution for subpixel analysis [OC90, OC91]. I have not found this type of interpolation an adequate operation for many of the examples I have tested, even though at times it seems adequately accurate. Personally I believe more study is required to gain more understanding of the effects of sampling on cepstrum and its ability to get better result. Figure 4.12 shows the result of posCepsCos for a simulated subpixel motion. Bilinear sampling shows a disparity of 4.3 pixels, which is close to the actual displacement which is 4.4 pixels horizontally. 12 In images where subpixel motion is simulated by bicubic interpolation, the cepstrum filter deconvolves the result into the cepstrum of the bicubic and the cepstrum of the actual motion. 12 Chapter 4. Cepstral Properties 89 ceps cos (for subpixeition) l 30 (a) (b) Figure 4.12: Positive cepsCos for a simulated subpixel motion, (a) the figure and its positive cepsCos. (b) the landscape of the cepstral plane. 4.7 Parallelism and C o m p u t a t i o n a l C o m p l e x i t y The computational complexity of fast Fourier transform is 0(n log n). Simple back of the envelope calculations reveal that power cepstrum requires less computation than the mean squared error analysis (i.e., sum of squared differences). 13 But there are stan- dard computational techniques especially for dense disparity measurement to reduce the computational cost of these methods. A popular form of improving the computational cost of cepstrum for instance has been the use of Hartley transform. He, Cai, and Liang [HCL91] along with Lee, Ramirez and Mitra [LRM91] and Steckner and Drost [SD89] have all used Hartley transform to reduce the computational cost of cepstrum in image processing by about 40 percent. Given cepstrum's popularity in signal processing, He, Cai, and Liang also report the existence of real time hardware for television signal processing. One of the advantages of using Hartley transform often cited is that the memory requirement is exactly the same of that of the image size, while for the F F T the memory requirement is twice as much. There are however a few points that needs to being pointed Given a h by w window and a search margin of h by w , sum of squared differences uses 0(h *w * h * w ) operation while cepstrum uses 0(m log m) where m = (h + h„) * (w + w„). 13 s s s s Chapter 4. Cepstral Properties 90 out: 1. None of the above mentioned techniques use cosine transform instead of power spectrum as the final transform for cepstrum. This can not only reduce the cost of the final Fourier transform by fifty percent, it also reduces the number of multiplications and square root operations used in the final stage. 2. The main cost of cepstrum analysis is in the trigonometric computations of the sines and cosines. The standard trick of using a precomputed table goes a long way to reducing the cost of F F T and the cepstrum. Tabulated logarithm values and interpolations are used in commercial signal processing products, and can be used here as well. 3. Another major cost of F F T is the reordering of the input data into a bit reversed version using a butterfly algorithm [GW93]. This operation can easily be accomplished using the standard Fourier transform hardware [PSW93]. 4. Since usually more than one cepstrum is used, one can use standard tricks of computing two Fourier transforms with one computation and then calculating the first power spectrums without use of any additional memory [PFTV88]. Another issue worth pointing out is that cepstrum is a local matching operation, and as such it can be done in parallel. Figure 4.13 shows an application of cepstrum using 16 transputer nodes. 1 4 Needless to say individual Fourier transform operations can be implemented in parallel. Images courtesy of SRI International. 14 Chapter 4. Cepstral Properties 91 Figure 4.13: The optical flow field due to egomotion 4.8 C o m p a r i s o n w i t h S t a n d a r d Techniques In Chapter 2 we reviewed different techniques, their behavior and their properties. If there is a logical conclusion to be drawn, it is that there is really no optimal matching technique that works best under all illumination, sampling, noise, blur, and geometrical distortions. A l l techniques are bound by explicit or intrinsic assumptions that limit them or under which they are formulated. For instance, it can be proven that the sum of the square differences algorithm is the least sensitive technique to uncorrelated Gaussian noise under motion parallel to the camera plane and no rotation [Mat92a]. For the more realistic assumption of correlated Gaussian noise and similar physical constraints higher order spectral analysis provides the greatest insensitivity [NR87]. But as Horn points Chapter 4. Cepstral Properties 92 out [Hor83], depending on the domain, the environment, and the circumstances these assumptions are easily and often violated. As an example one might think of motion blur, or scale change and transformation distortions. Therefore it is logical to state that all techniques are at one level or another input dependent and they perform (qualitatively as well as quantitatively) best under different conditions. What complicates the issue even further is that not all attributes based on which different techniques are judged can be measured quantitatively. Thus even though a technique may not perform best under variety of imaging conditions [BFBB92, BFB94] it may still be the best overall performer given the variations among its application domains, and all the attributes or requirements for which it is judged. Having said all this, it is still important to know strengths, and perhaps more importantly weaknesses and failure modes of a particular technique. For instance, it is trivial to prove that under uncorrelated Gaussian noise, mean squared error - or equivalently sum of squared differences - provides the optimal matching solution[Mat92a]. Moreover, for the more realistic assumption of correlated Gaussian noise higher order spectral analysis provides the correct solution [NR87]. But depending on the domain, the environment, and the circumstances these assumptions are almost always violated. As an example one might think of motion blur, or scale change and transformation distortions. Another significant factor in design and evaluation of matching routines is their versatility and generality of application. I introduced cepstrum as a general, powerful, and robust framework for the analysis of motion, stereo, texture and boundary symmetry. But this simple routine is also able to address other important computational vision routines such as multi-frame analysis, motion transparency, and subpixel motion estimation. These flexibilities are addressed in the next sections of this chapter briefly, and more detailed treatments of them are provided in the next chapters. Chapter 4. Cepstral Properties 93 This chapter has taken a critical look at cepstrum particularly emphasizing its shortcomings. The objective has been to amplify these weaknesses in order to be able to provide new algorithms that will overcome these problems. These improvements will be introduced under the heading of multi-evidential correlation and more quantitative and qualitative comparisons provided there. The next subsection provides comparisons of cepstrum to other techniques, particularly drawing upon the existing literature and personal experiences. Section 4.8.2 will show the similarity of cepstrum to regular correlation and conclude with the statements of the originators of cepstrum. 4.8.1 Differences from Standard Techniques To evaluate cepstrum's performance I need to do two things. First I must examine cepstrum as an echo detection tool; and then choose a general and representative area of computational vision to compare cepstrum's performance and generality to other techniques. To date cepstrum routines have proven to be the most powerful technique for detection of echoes. Reddy and Rao [RR87b], for instance, showed that differential cepstrum performed better than waveform delay analysis, or unwrapped phase averaging. Such performance differences will become even more acute for images and higher dimensional signals, primarily due to greater possibility of errors in multidimensional phase unwrapping[Dud77]. Of course, as is also seen in my experimental examples, power cepstrum has a higher signal to noise ratio than differential cepstrum. This is primarily due to the well known fact that differentiation leads to higher noise levels. As a result, it is safe to claim that power cepstrum is the most robust and common approach for the detection of echoes, primarily because it does not suffer from additive noise due to differentiation or phase unwrapping. On the other hand, however, since Chapter 4. Cepstral Properties 94 power cepstrum lacks any phase information, signal reconstruction (which is really not required in registration) is impossible when using power cepstrum[OL81]. As a result I only use power cepstrum as a matching technique. 15 To point out cepstrum's differences with other matching techniques, one has to narrow down an area of computational vision related to matching and registration. As pointed out earlier, my area interest is motion analysis and stereo. Not only is visual motion analysis one of the fundamental areas of computational vision, it also provides one of the richest area for comparative analysis. Time varying image analysis plays a significant role in segmentation of scenes, encoding 3D information, egomotion estimation, object tracking, determination of focus of attention, and estimation of time to collision[Nak85], as well as video frame predictive coding, motion compensated interpolation and motion compensated hybrid coding[Cla90]. Barron, Fleet, Beauchemin, and Burkitt[BFBB92] compared several motion algorithms, notably spatio-temporal filtering [Hee87], correlation matching techniques [Ana89], gradient based[HS81], and phase based approaches[FJ90b]. I have not performed as detailed an analysis, but I have examined the performance of power cepstrum relative to phase correlation[KH75], gradient based approaches [HS81], and correlation [LBP88]. On synthetic motion images, compared to the first two techniques, I found cepstrum to perform very well particularly in the presence of noise. Mitra and Lee[LMK89] have also shown that cepstrum is much more robust in the presence of noise than phase correlation [KH75]. Moreover, Wu and Fernando[WF88] showed that, on at least synthetic images, phase correlation performed better than differential techniques [LL88]. They furthermore stipulate that the only reason phase correlation did not perform better, on a qualitative level, than the differential technique on their only real T V image, was due to the iterative nature and smoothing Actually I use cepsCos - my version of power cepstrum - because it performs better and is computationally more efficient. 1 5 Chapter 4. Cepstral Properties 95 properties of the differential techniques. This shortcoming of phase correlation can be overcome by simple post processing. No comparison of matching methods is complete however without considering correlation or Sum of Square Differences. These techniques are often used due to their simplicity and their optimality in the presence of uncorrelated Gaussian noise. But camera noise is often spatially correlated ( with a symmetric probability distribution). Consequently, I explored the use of third order cumulants[NR87, TG92] and BiCepstrum [NP88] for analysis of displacements in one dimensional signals. These techniques are mathematically optimal in the presence of uncorrelated or "correlated" Gaussian noise. But unfortunately, they require large data segments - which means large window sizes. Another important problem (and sometimes more prevalent than noise) is the violation of the assumption of brightness constancy due to motion of the light source, cast shadows, or change in the direction of illumination due to motion itself[WM93]. What makes this problem particularly difficult to handle - and as a result often ignored in computational vision - is its dependence on outside (i.e., environmental factors), while noise can often be eliminated by use of better hardware. Unlike other techniques, particularly differential methods which are particularly sensitive to changes in lighting, it can be shown mathematically that Cepstrum behaves very robustly in the presence of illumination changes. 16 What sets cepstrum apart, of course, is its non-linearity. Without its nonlinearity, as we will see in the next section, cepstrum has more in common with correlation than one may expect. In fact, originally Pratt [Pra78] and Gonzales et al [GW93] and later Olson and Coombs [OC90] showed that cepstrum can be interpreted as an adaptive correlation mechanism. But more significantly, the logarithm in cepstrum behaves as a selective Basically the log{l + COS(27TT/)} becomes log{l + acos(27rr/)} where a is the multiplicative factor (less than one) which corresponds to the change in the illumination. 16 Chapter 4. Cepstral Properties 96 filter by compressing the effect of narrow band signals, and emphasizing the information rich broad band portions of the image (i.e., edges). This behavior is quite different from many linear techniques, particularly first and second gradient approaches, or linear filtering methodologies. Comparing techniques analytically, or even numerically, is a good academic exercise, and provides a valuable guide to improving the present methodologies. In fact, the development of cepsCorr was partially a consequence of comparisons that I drew between cepstrum and different correlative approaches [LBP88, Ana89, Fua91]. Even though I will be providing such a comparison myself - in the results section - 1 generally found that drawing firm conclusions on such comparisons, without the consideration of the images being matched, can be rather deluding. Specifically, there are constant improvements introduced to traditional methods, that render such comparisons outdated (see for instance [CV92]). This does not mean that one should dismiss the existence of the "silver bullet", the methodology that is "mathematically" proven to be the best matching routine under any and all distortions, noise types and levels, and any other conceivable anomalies. 17 Finally, as we will see later, visual echoes as a framework provides a substantial flexibility that is not so easily exploitable by other routines. Even for a subset of visual echoes framework, such as multi-frame analysis or recognition of motion boundaries and multiple motions, some techniques provide solutions but these solutions, when applicable, require a change in the premises, assumptions, computations and the properties of the original approach [JB93a]. With visual echoes, on the other hand, one should not need the benefit of thresholding, a priori knowledge and/or modified methodology. Banking one's Ph.D. on this however, is like trying to find the solution to a historically open problem as a thesis topic. 17 Chapter 4. Cepstral Properties 4.8.2 97 Similarity w i t h Standard Techniques I introduced cepstrum as an adaptive non-linear correlation technique. To understand this mathematically, the power spectrum of the logarithm is equivalent to its autocorrelation. Thus to see what it means to calculate the cepstrum, one can say it is equivalent of finding the correlation of: (4.14) There are two noteworthy points. First, this is exactly like taking a regular correlation, except that the magnitude of the function is reduced. What this means is that the normally higher frequency signals are now more emphasized. This is primarily due to the fact that higher frequencies in images have lower amplitudes. By the same token lower bandwidth signals - i.e., constant intensity blobs and periodic functions are generally deemphasized. This is reflected in Figures 4.14. The edges of the circle are well emphasized, while the effects of the stripe and the uniform intensity of the blob disappears. The second point to be noticed is that emphasizing the high frequency makes the cepstrum not only a good matching technique, but also makes it very noise sensitive since usually the noise associated with images, whether correctly or not, is often modeled as high frequency signal. Figure 4.15 shows the cepstrum equivalent filtering of an image and the same image with added Gaussian noise of standard deviation 20.0. As can be seen the high frequency signals are emphasized and in fact the dynamic range of the signal is also improved, but unfortunately the effect of the Gaussian noise is also reflected in the image. Finally, I would like to draw attention to the similarity of the cepstral equivalent image and that of a silicon retina image simulated by software [MM91] (see Figure 4.16). Chapter 4. Cepstral Properties (a) 98 (b) Figure 4.14: Cepstral equivalent filtering: note how the low frequency functions are deemphasized. Perhaps the best way to conclude this analysis is to quote the statements of the originators of cepstrum, Bogert, Healy and Tukey [BHT62]. 18 "We have given evidence that autocorrelation is more likely to fail to detect complex echo than are the more elaborate processes we have been developing. It is important to understand why this is so. It is natural to think of the duality between time and frequency as an either-or phenomenon-sometimes we look at things one way, sometimes the other. In so far as the actual time series are concerned, the situation is often far more both than either. Series containing echoes, whether or not geophysical, are not likely to have had a white spectrum before the echo was added in. Some may have had a spectrum that was smooth enough to give the autocorrelation a fighting chance to find 18Page 220 of the reference. Chapter 4. Cepstral Properties 99 the echo, but many have not. We could always try to proceed by first finding the power spectrum of the original series and then calculating a filter that, when applied to the original series, would approximately smooth out the short-quefrency variations in the spectrum. Then we may apply this filter to the original data and calculate the autocorrelation of the resulting series. On the other hand, we could try many alternative filterings of the original series and then look at the autocorrelation of that one whose frequency was free of short-quefrency components (i.e., trends with frequency). Such procedures will surely be no easier than these we suggest. Their only advantage would seem to be that they would allow one to whom autocorrelation is sacred to apply it. The usefulness of the more nonlinear process lies in the fact that liftering 19 the log power spectrum removes the effects of short-quefrency component without any need for recognizing their individuality." 4.9 Summary This chapter provides a review of cepstrum properties as a matching routine, particularly emphasizing its limitations and shortcomings, which need to be addressed. In Chapter 2, I provided a partial list of characteristics for matching routines. But the scope of this work should surpass a cursory list-like review, emphasizing windowing, numerical stability, effects of additive noise, convolution (particularly blur), rotation, expansion and illumination changes. A short comparison, particularly emphasizing the similarities to other approaches, was also provided with graphical examples and theoretical analysis. Liftering refers to filtering in the log domain. Chapter 4. Cepstral Properties 100 Figure 4.15: Cepstral equivalent filtering for an image and an image with added noise. Chapter 4. Cepstral Properties 101 Figure 4.16: A n image after it has been put through the different layers of retina. Chapter 5 Cooperative Analysis of M u l t i p l e Frames by V i s u a l Echoes Traditionally, image matching and registration techniques in computational vision utilize only two image frames at a time. This was originally motivated by the lack of multipleframe image matching routines, as well as the perceived similarity of this approach to human perceptual operations such as binocular stereo disparity measurements. But, as noted by Price[Pri90], multi-frame analysis allows introduction and use of constraints on the equations of motion or stereo. Such constraints can improve the signal-to-noise ratio of disparity detection mechanisms and improve the accuracy of such estimates. Moreover, such processing can be used in the direct estimation of constant or variable accelerations for 3D structure recognition [Fra91]. Historically, multi-frame analysis has been limited in use only to motion analysis. But many computational vision tasks, such as trinocular stereo disparity calculations, multiple baseline stereo measurements, multi-frame motion analysis, motion-stereo (binocular or trinocular) techniques, vergence control and structure from uniform three dimensional acceleration involve determination of disparities among multiple frames. The large breadth of applications of multi-frame analysis warrants it a greater deal of attention and research in computational vision. As a framework (which by definition should provide a large breath of applications), visual echo analysis is flexible enough to extend to this domain, thus providing a uniform approach to multi-frame analysis. Where equal disparities exist between multiple frames, the cepstral peaks reinforce one another in time and space. Consequently, when the 102 Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 103 constraint of constant separation is applicable, the increased number of frames leads to an increase in the detection and accuracy of disparity estimation. In the case of unequal separations - for instance due to acceleration - disparities between every two frames are calculated, providing a rate of change in disparities. 5.1 Introduction, B a c k g r o u n d and Previous W o r k One of the first - and perhaps most interesting - work in multi-frame analysis was by Mohanty[Moh81] who used a maximum-likelihood approach to position and track target points in space. Another early work in this area by Braniv[Bar85] used dynamic programming for detection of dim moving targets , but much like other methodologies [GV93] the constraints on inter-frame disparities were not used to improve detectability or es1 timation. Perhaps the best known multi-frame methodologies are the spatio-temporal techniques [AB85, WA85] (see Figure 2.2). Even in these techniques, although constant linear velocities are normally assumed, there is no inter-frame reinforcement of disparities. Instead, in the case of multi-frame phase analysis for instance [Fle92], the presence of a constant velocity gives rise to a better fit of the data to a plane. In effect, in the spatiotemporal approach the use of multiple frames is a requirement or restriction, rather than a feature. Lastly, routines that utilize a large number of image frames [Fle92] automatically introduce a lag or delay time in processing caused primarily by the non-causality of routines which require future data for present disparity calculations. Huang, L u i , Hayes and Mersereau[HLHM92] on the other hand, did introduce an interesting recursive differential approach for multiframe analysis where disparities between every two frames were registered. Two more noteworthy works in this domain, Inter-frame disparities refers to the disparity between any two frames separated by one or more frames. 1 Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 104 that fall into the 3D structure from multiple frames categories are by Spetsakis and Aloimonos[SA90] who used an elegant mean squared approach, and K i m and Price[KP93] who used a feedback mechanism for improving structure from motion estimates. The objective of this chapter is twofold. First, it extends the domain of visual echo analysis in order to provide a uniform approach for the analysis of multiple frames and applies it to such areas as trinocular stereo, motion-stereo, multi-frame motion analysis, and multiple-baseline stereo. Second, it provides a technique that fully utilizes the natural constraints of multi-frame analysis - such as reinforcement of constant disparities in time and/or space, and direct recognition of acceleration or inter-frame disparities. The next section will briefly cover the behavior of cepstrum in the presence of multiple echoes or disparities [BL93b, BL92c] and sets the ground work for the use of visual echoes in spatio-temporal motion analysis. Section 5.3 provides a treatment of trinocular stereo as a visual signal with multiple echoes, and shows how horizontal, vertical and diagonal disparities can be derived directly. Section 5.4 covers multiple-baseline stereo and how the signal-to-noise ratio improves dramatically as we incorporate more frames. Finally, section 5.5 shows the application of cooperative multi-frame analysis to stereo-motion disparity measurement. This section shows how disparities in time and space (motion and stereo) are decoupled, and moving targets can be directly segmented. Extensions of this approach this approach can be applied to vergence control and trinocular stereo-motion analysis are also discussed. Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 5.2 105 V i s u a l Echoes and Spatio-temporal Image Analysis The interesting question is what if one has multiple replications or echoes of the signal in time or space, i.e.,: Mx) = 5(x) + E ? s ( x - r 0 (5.1) = o where is the disparity of the I th echo with respect to the first signal. Taking the Fourier transform and factoring out the power spectrum of the S(f) we get: H(f) = S(f)(l + Er exp(-27rzr f)) (5.2) i =0 taking the power spectrum and the logarithm we are left with: log(|S(f)|) + log(n + 2 + £? o cos(2 rr f) + E ^ - cos(2 r(r - )f)) = 7 i 7 i Tj (5.3) This is a significant formula, because it shows that if we have multiple echoes, either due to multiple frames or due to multiple motions, then the argument of the logarithm becomes a sum of several cosines instead of a single one. The periods of these periodic functions are the relative disparity of every two echoes in the scene. The power spectrum of the logarithm then generates multiple peaks corresponding to the differences of the arrival periods (or disparities) between every two echoes. Figure 5.1 illustrates the result of cepstrum analysis on a simulated non-linear acceleration. The simulated motion field is made to reflect acceleration of the object. The motion disparities with respect to the leftmost frame are (1,1), (3,3) and (7,7). The peaks in the result, displayed below the original image, show the displacements between each pair of windows. As usual, the origin in the results is the upper left corner; peaks occur at one frame width plus the relative motion of windows. It is significant to note that should the object go through a constant velocity (or, in the case of multiple baseline stereo, constant disparities) the echo peaks between Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 106 Figure 5.1: Multi-frame optical flow analysis utilizing cepstrum. The simulated motion field is made to reflect acceleration of the object. The motion disparities with respect to the leftmost frame are (1,1), (3,3) and (7,7). frames overlap, thus increasing the detectability and accuracy of the echo arrival measure. This important fact sets cepstrum apart from other multi-frame analysis routines. To demonstrate this, I simulated a constant velocity motion and analyzed the motion in space-time cube. This is much like the spatio-temporal analysis depicted in Figure 2.2 except that I are using a non-linear filter. I then conducted a three-dimensional cepstrum analysis on the space-time cube, and looked at the spatial disparity signals in time. The peak in the first cepstral image shows all the disparities between any two image separated by one time frame. If the object is going through a constant motion, the cepstral peaks overlap thus increasing the signal-tonoise ratio. This is shown in Figure 5.2(a) where the single large peak in the first frame corresponds to a constant disparity of 7 pixels horizontally and 3 pixels vertically. The peaks in the second frame, Figure 5.2(b), correspond to all the disparities between the frames separated by two time slices in the cepstrum results. We see that the displacement of the peak doubles, reflecting the disparity between every two frames separated by two time steps. The peak magnitudes of these comparisons reduces dramatically and the Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes (c) 107 (d) Figure 5.2: The results of spatio temporal cepstrum analysis for a constant velocity for eight frames, (a) shows the disparity for every two consecutive frames, (b) every two frames separated by one frame, (c) for two windows differing in image number by two, and (d) for two frames differing in three frames. Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes Peak Magnitude YulO 108 Signal to Noise Ratio Y 6 900 .0 (a) (b) Figure 5.3: (a) The peak magnitude and (b) the signal-to-noise ratio as a function of the number of frames used. noise level increases as the temporal disparity between the frames increases. It is also important to note that spatio-temporal cepstral analysis is simpler than the traditional approaches because there is no interpolation of a surface, whose normal defines the velocity of the motion. Figure 5.3 shows the effects of multiple frames on signal-to-noise ratio and peak magnitude of the cepstrum result. 5.3 2 Direct Measurement of Trinocular Stereo Disparities There are numerous articles on the use and effectiveness of trinocular stereo measurements and their performance compared to binocular stereo routines. Dhond and Aggrawal[DA91] showed how trinocular stereo provides a higher degree of accuracy in disparity measurements than binocular stereo. Traditionally, trinocular stereo measurement techniques 2 As a reminder, I am still using the formula [COP90]: „„ , .PeakMaqnitude y 20*log ( 10 to define signal-to-noise ratio. . . — AveraqeNoise. T / . VJyoisevariance ) .„ (5.4) Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 109 Figure 5.4: Trinocular disparity measurements using visual echoes. The three peaks in the cepstrum reflect a disparity of 3.5 horizontally between the horizontal cameras, a vertical disparity of 6.0 between the cameras located vertically from each other, and a disparity of (3.5, 6.0) for the two diagonal cameras. use three calibrated cameras located in a right angle triangle arrangement with their optical axes in parallel. This leads to three epipolar constraints and three disparity measurements. Figure 5.4 shows the effects of cepstrum analysis of trinocular triplet. The cameras are located in the order of (0,D), (0,0) and (D, 0). The lower portion of the image shows the result of the cepstrum, where there are three peaks corresponding to the three disparities. Note that the locations of peaks are consistent and the results of the three disparities can be combined to get a more accurate disparity. Should the locations of peaks not be consistent, the multiple disparities are caused by occlusions or motion transparency. It is also noteworthy that if one uses a fourth camera, so that the ensemble is a square grid, every pair of disparities reinforces one another giving a higher signal-tonoise ratio, and better detectability. This approach can be further extended to a grid of cameras, or multiple baseline trinocular stereo. A similar structure can be simulated by a stationary texture ensemble. Figure 5.5 shows the results of cepstral analysis of a Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes (a) 110 (b) Figure 5.5: Simulated camera grid by the use of stationary texture segmentation, simulated stationary texture field. 5.4 Improved Disparity Measures by Multiple Baseline Stereo Kanade, Nakahara and Okutomi [KON92, NK92] have examined increased accuracy in the detection of binocular disparities using the multiple baseline stereo technique. Described briefly, multiple baseline stereo measures the disparities between a window of an original image frame, and subsequent windows in the frames taken by the camera displaced by a constant amount over time. The authors then showed that, using a simple formulation, they can get better disparity measures and higher signal-to-noise ratio by accumulating the properly scaled cross correlation result between the original window and the subsequent images. This work has been followed by Ens and Li[EL93] who used a stationary camera to recognize the shape of objects moving at constant velocity on a conveyor belt. Figures 5.6 and 5.7 show an experiment using multiple baseline stereo. As it can be seen the disparities between the windows overlap, giving rise to a large peak between the Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 111 Figure 5.6: Results of multiple baseline stereo. first and the second window. A n added advantage of using visual echo for analyzing multiple baseline stereo is that it generates multiple evidences. Therefore the peak generated between frames two and three is the overlap of the disparities of all frames separated by two baselines. Figure 5.7: Topographic plot of multiple baseline stereo results using the visual echo analysis approach. The peaks in the figure point to the horizontal disparity of the figure - roughly 1.65 pixels horizontally. The second largest peak corresponds to the overlap of disparity measures between two windows separated at twice the baseline. To show how multiple baseline stereo increases the signal-to-noise ratio I also conducted cepstrum analysis of disparity for only two frames. Figure 5.8 shows the results. Comparing figures 5.8(b) and 5.7 reveals that the peak magnitude multi-frame analysis increases the peak magnitudes by nearly a factor of five. Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes multiBaseline 112 result 60 (a) (b) Figure 5.8: The result of cepstral analysis for stereo disparity measurement. 5.5 Decoupling of Motion and Stereo Disparities in Binocular-Motion So far we have examined the application of cooperative multi-frame analysis to either spatial vision - such as binocular or trinocular stereo - or temporal domain applications - such as multi-frame motion analysis. In this section I will address the application of multi-frame visual echo analysis to the stereo motion domain. Several interesting works in this area have already been presented by Cui, Weng and Cohen [CWC91] who used an iterative least squared error approach, Waxman and Duncan[WD86] who used the rate of change of stereo disparity, and Waxman and Sinha[WS86] who used two image flows from two cameras with slightly different motions. In this work I make a simple and yet fruitful observation: if the motion with respect to the binocular stereo pair is parallel to the image plane or small then the motion disparities between the two stereo cameras are equal. Moreover, the binocular disparities at the two time periods also remain the same. This simple observation has a trivial Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 113 Figure 5.9: Motion-stereo disparity measurement. The two windows on the left are a stereo pair taken prior to the object going through a translation. The cepstral results shows how the motion disparity of (3.5, 6.4) and the stereo disparity of (4.3, 0.0) are decoupled. geometrical proof but two significant side effects. First, visual echo registration decouples the disparities due to motion and stereo. Second, since these disparities remain constant and decoupled, similar type disparities reinforce one another increasing the signal-tonoise ratio. Figure 5.9 shows the use of cepstral analysis for two pairs of stereo images. The two main peaks show that the figure has a stereo disparity of 4.3 pixels horizontally and a motion disparity of 3.5 pixels horizontally and 6.4 pixels vertically. It is noteworthy that in this example I concatenated all the frames together, while another alternative would be to concatenate the stereo data in space and the motion sequence in time, as was done previously. In figure 5.10 I used a calibrated stereo system to find the stereo disparities of a scene. Figure 5.11 shows the portion of the scene (i.e., the tweedy bird!) that has gone through a relative motion. One of the applications of this methodology can be the active pursuit of targets using stereo vergence[PE]. When the object being tracked remains within the Horopter Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 114 (a) Figure 5.10: A stereo motion scene, and the horizontal stereo disparities using visual echo. the stereo disparity remains null, but the motion disparity changes as the object moves. 3 Consequently one can track the object by following the motion disparity. Once the object moves in depth the binocular disparity also changes and the camera vergence angle can then be adjusted appropriately. 5.6 Summary and Conclusion In this chapter I showed how visual echo analysis provides a uniform approach to multiframe analysis and applied this framework to the analysis of the disparity calculations for spatio-temporal motion analysis, trinocular stereo disparities, multiple baseline stereo measurements, and depth and motion calculations using motion-stereo. What sets this approach apart from its predecessors is that the equal disparities constraint, when applicable, is fully utilized to increase the disparity detectability and the accuracy in the 3 This actually depends on the vergence angle between the two cameras Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 115 Figure 5.11: Motion-stereo disparity measurement. The background is stationary anddoes not move while the brightly shaded and darker areas point to the pixel values corresponding to the Tweedy Bird and the piece of cloth both of which have gone through a motion. estimation process. Several domains where this constraint is satisfied include motionstereo, multiple baseline stereo, and constant velocity motion. When this constraint is violated, such as in the case of trinocular stereo or accelerating motion, the visual echo techniques provide all disparities between every two windows analyzed. One of the interesting features of this work is its extendibility into other possible mechanical setups such as trinocular stereo-motion analysis, or calibrated cameras grids. Chapter 6 Multi-Evidential Correlation and CepsCorr Previous chapters—particularly Chapter 4—provided an objective evaluation of cepstrum, including its main advantages and shortcomings. One advantage of cepstrum was its simplicity of implementation and its ability to provide a direct (i.e., one shot) estimate of the disparity. Specifically, unlike mean squared error analysis (SSD) where individual correlation values are gathered over a region and the location of the minimum is chosen as the disparity, the cepstral result provides a direct value for the disparity without incremental calculation of the result. But cepstrum is not unique in this behavior. In fact, there are several techniques that provide direct estimates of disparities. Among these cepstrum, phase correlation and Hadamard based approaches are the most notable. 1 A l l these techniques have several attributes in common, namely: • the signal to noise ratio and the accuracy of the measured estimates increase with reduced disparity. • the uncertainty in disparity measurements due to other factors such as image distortions, blur, or noise can be reduced if multiple measurements of disparities are available. • if the same signal is matched against a modified copy of itself at different disparities, all the measured estimates are correlated. A review of these approaches has been provided previously in Chapters 2, 3 and 4. l 116 Chapter 6. Multi-Evidential Correlation and CepsCorr 117 It is in accordance with these observations and a desire to remove limitations of cepstrum that multi-evidential correlation has been developed. The next section provides a mathematical definition of M E C and provides some intuition behind its use. Section 6.2 discusses some more properties of this technique and the choice of the matching kernel. Section 6.3 addresses the use of this methodology to overcome some of problems and ambiguities associated with cepstrum. Section 6.4 discusses verification of estimated disparities and the as of yet unanswered questions about optimally combining the correlated disparity estimates; this section then provides a preliminary but very robust solution which is both accurate and computationally practical. 6.1 M u l t i - e v i d e n t i a l Correlation: Definition Described briefly, multi-evidential correlation combines the direct disparity estimation of a matching kernel—such as cepstrum, phase correlation or Hadamard transform— with the iterative behavior of correlation to generate multiple number of estimates for disparity. As we saw in Chapter 2, the individual pixel values of the spatial correlation plane are calculated from: = J2'52 i( >y) 2( Riui ( , ) u v I 2 W X x I x - >y ~ ) u (-) v 6 where (w , w ) defines the spatial extent of the matching window and u { < u < x and v min plane. y m < v < v max 1 Wy n u max defines the search area and the size of the resulting correlation 2 Analogously a comparative kernel such as cepstrum, C(Ii(x, y), ^{x — u,y — v)), can take two equally sized image windows as input and directly generate a disparity value The disparity (S , 6 ) is then (u i values of u and v. 2 x y m n + u*, v i m n + v*) such that Ri j (u*, lt 2 v*) is maximum for all Chapter 6. Multi-Evidential Correlation and CepsCorr T>(u,v). Repeating the process for different values of 118 u and v gives rise to multiple estimates of disparity. Therefore a window from one image is matched against a series of windows within a neighborhood in the second image, each comparison producing a disparity value. In short, the pseudo-code for multi-evidential correlation can be summarized as follows: 1. given a window Wi(x,y) from the first image, and 2. a search area, S(x, y), in the second image where one expects the closest match for W\ to reside, 3. clip a region, W (x,y), in S to be compared to Wi(x,y). 2 4. Match the two regions W\ arid W using the non-linear filters cepstrum and/or 2 phase correlation. Locate the maximum peak and save the result as a possible disparity measure. 3 5. repeat steps 3 and 4 - of course with a new W (x, y) - to gather more disparity 2 estimates as needed. 6. combine the measured disparities to get a final estimate for the motion of W\(x, y) in S. It is worth noting again that matching kernels such as cepstrum or phase correlation obviously do not generate a disparity value directly but an image output I (x,y) u>v The disparity, V(u,v), first. is then found by searching for the coordinates of the maximum peak in the output image and subtracting the artificial disparity - i.e., (u, v) - introduced 3 One can also use techniques based on Hadamard transform [LC88]. This approach is fast but not necessarily very robust and fails particularly in the presence of multiple motion. Chapter 6. Multi-Evidential Correlation and CepsCorr 119 due to the sliding window. This may sound like additional computational burden but it is in fact a major asset of this process. Since the output of each iteration is an image, other attributes can be associated with each disparity estimate V(u,v). For instance, signal to noise ratio, the absolute magnitude of the peak, or the ratios of the first and second peak (often referred to as the rejection ratio) can be used as possible confidence measures for each estimate. An obvious question to raise is why can one not use a regular iterative approach such as SSD with multi-evidential correlation? The answer is simple. Using any spatial correlation method such as sum of absolute differences technique as the matching kernel of multi-evidential correlation is simply equivalent to increasing the search area of the matching kernel (and doing a lot of redundant calculations). The minimum will (hopefully) remain at the same absolute location. The significant point to note is that techniques such as cepstrum and phase correlation provide a direct estimate of disparity with each iteration step. Moreover, as the values of u and v change with each iteration, one of the input signals to multi-evidential correlation is constantly changing. Consequently, the estimated disparities may be partially correlated (since at least on of the underlying signals remains the same) but they are uniquely generated and non-identical estimates (i.e., not based on the same inputs). As we will see in Section 6.4 this has important implications in terms for localizing the matching features and combining of evidence. There are however certain properties that are independent of the choice of kernel. For instance as we saw earlier, in the case of Hadamard kernel the accuracy of the disparity increases, while with cepstrum and phase correlation the detectability of the disparity and its signal to noise ratio is also enhanced. Chapter 6. Multi-Evidential Correlation and CepsCorr 6.2 120 Multi-evidential Correlation: General Properties I have only discussed three matching kernels for multi-evidential correlation. But there may be other (perhaps application dependent) registration routines that can be used as well. The properties of a matching kernel have a fundamental effect on the results of multi-evidential correlation as well as how the individual measurements should be treated and eventually combined. But the choice of the kernel is not always an obvious one. For instance, although cepstrum is the most reliable and robust of the three alternatives presented here it is also computationally much more expensive than Hadamard transform. Therefore, for some applications one may decide to use the Hadamard kernel to gain much higher processing speed at the expense of some accuracy and reliability. One of the main advantages of using multi-evidential correlation is that it reduces the effects of window size. As we saw earlier, in area based techniques choosing a large window increases the signal to noise ratio. But it also has the effect of blurring the disparity values. 4 This is particularly a problem when the moving objects are much smaller than the window size. In the case of regular cepstrum or phase correlation it is critical that the window is large enough to cover an adequate portion of the object for proper matching. Therefore, as the size of disparity grows so should the window size. This coupling of window size 5 and disparity is a limiting factor in the use of cepstrum or phase correlation. For small fast moving objects, for instance, the window size necessary to include the object may be This phenomenon is also referred to as bleeding since it tends to defocus the disparity vectors and is most notable across displacement boundaries. One may say that the window size has an absolute lower bound which is the size of the disparity. At or below this size you are guaranteed that the disparity value estimated is meaningless since the content of the two windows are different. In practice, of course, as less of a moving object is included into the two windows the signal to noise ratio reduces and after a certain point (depending on the frequency content of the object and the background) the correct disparity peak disappears before the window size reaches the absolute lower bound. 4 5 Chapter 6. Multi-Evidential Correlation and CepsCorr 121 so large that the object's displacement may go completely unnoticed compared to that of the background. Using multi-evidential correlation partially overcomes this problem. Since the window from the first image is compared against a series of displaced windows from the second image (i.e., the second window is a sliding one) one can reduce the size of comparison window and increase the search space. Modifying the sizes of the windows and the search region has a direct effect on the computational expense of ones operations. Computational expense increases linearly with the size of the search area while the effects of the window size depends on the computational complexity of the matching kernel. Thus for cepstrum, which has a O(nlogn) complexity, reducing the window size reduces the computational time faster than linear.Therefore it seems even more useful to choose a small window size and increase the search area. We still have not completely eliminated the problems associated with windows or the problem of choosing the proper search region. That is, as the window size reduces, the amount of the information being compared is reduced, and the accuracy and the reliability of the estimated displacement diminishes. This is particularly true for cepstrum and phase correlation where diminishing window size causes increase noise due to aliasing in the result. Therefore, reducing the window size at the expense of increasing the search region is not always a viable solution. Moreover, increasing the window size can still have the effects of blurring the disparity especially when the objects in the scene have curved surfaces. Luckily, as we will see in Chapter 8, cepstrum produces multiple peaks for multiple distinct disparities. This allows us to recognize the existence of motion boundaries, or transparent objects. 6 Another significant property, which was briefly mentioned in the introduction, is that Needless to say, choosing the proper search domain, requires an a priori disparities in each direction. 6 estimate of the maximum Chapter 6. Multi-Evidential Correlation and CepsCorr 122 as the relative disparity of the two matching windows diminishes, the signal to noise ratio and reliability in the estimated disparity increases. This is an important property which will be discussed in more detail—and provide experimental results for—in Section 6.4. The key idea which should be emphasized, is that multi-evidential correlation provides multiple estimates of disparity, which can be used for verification of individual measurements, or combined into a final more reliable estimate of disparity. This is particularly important when one is considering subpixel displacement estimations. The remainder of this discussion concentrates primarily on the cepstrum based approach nicknamed cepsCorr. But much of the material presented is also applicable to other matching kernels. For instance, as we will see in a Chapter 8, both cepstrum and phase correlation can be used to detect, verify and calculate multiple motion disparities. 6.3 Cepstrum and Multi-evidential Correlation—cepsCorr In Chapter 4 I reviewed some of the strengths and limitations of cepstrum as a matching technique. The primary problems with cepstrum were not due to lack of accuracy or robustness but rather results of mathematical and computational operations that resulted in ambiguities in the cepstral results. Specifically, the existence of a spurious dominating 7 peak corresponding to the (0,0) disparity and the sign ambiguity of the displacement vector are the two problems that can obscure the effectiveness of cepstrum. In Chapter 4 I examined these problems in adequate detail and provided simple solutions. Such problems and simple intuitions were in fact the reason behind the development of cepsCorr and eventually multi-evidential correlation. The initial objectives were a) to improve the performance of the cepstrum and provide a confidence measure for the displacement estimates b) formulate a solution that overcomes these problems, c) This would be a good place to remind the reader that term cepstrum refers specifically to the power cepstrum technique. 7 Chapter 6. Multi-Evidential Correlation and CepsCorr 123 eliminate or avoid use of arbitrary thresholds and assumptions. In the next two sections I will briefly revisit these problems and show how cepsCorr overcomes these problems. 6.3.1 CepsCorr and the (0,0) Peak: In Chapter 4 we saw how lack of image detail results in a dominating peak which corresponds to a null disparity. This is of course not surprising since most techniques fail when there is not adequate high frequency information that would facilitate the match. This effect can in fact be quite dominating. Figure 6.1 uses the window in the upper left corner and using cepsCorr matches against a sliding window which covers the region in the right image. The corresponding match in the right window has a disparity of two rows down and one column to the right—i.e., a disparity of (2,1). The search region was specified to be (-3,3) rows and (-3,3) columns; that is a 7x7 search region. The lower right hand corner bright pixels show all the window locations where the (0,0) peak was the maximum peak value. Black pixels were supposed to show the points where correct 8 disparity was estimated. Interestingly the (0,0) peak dominates even if the motion is only one pixel in any direction. Therefore, it should not be surprising that the actual disparity is never found. The lower left corner shows the relative (0,0) peak magnitudes. When the window is at the proper disparity the (0,0) peak increases dramatically. It is fair to question the significance of solving this problem before going any further. After all the (0,0) peak appears in a fixed location. The problem is however that an area dominated with low frequency may go through a large motion, and still not be detected. Validation of the (0,0) is also important for such tasks as the figure ground separation and motion transparency. There are basically two simple choices to this problem: a) high pass filter the image This figure is obviously enlarged to fit the lower left hand quadrant. The actual size of this region is obviously the size of the margin—i.e., the dimensions of the search region. 8 Chapter 6. Multi-Evidential Correlation and CepsCorr 124 Figure 6.1: A n illustration of the persistence of the (0,0) peak for a portion of an image that lacks much detail. The window on the upper left hand corner was matched against the region in the upper right using cepsCorr. The lower right shows an enlarged result and the bright pixels show the positions (or disparities) where the (0,0) peak was the dominant peak. As it can be seen this area includes a large region within the search region. The lower left hand corner shows the relative magnitude of the (0,0) peaks, and as one can see at the proper disparity—i.e., the disparity (2,1)—the (0,0) peak increases dramatically. or use image edges to calculate the disparities, or b) use a criterion to simply ignore the (0,0) peak. As it should be clear by now, cepstrum is a deconvolving filter, therefore high pass filtering the image is equivalent to adding the cepstrum of high pass filter to the final result. 9 This in effect has the effect of reducing the peak at (0,0) and the amount subtracted from the peak depends on the parameters of the filter. Not only choosing these parameters are arbitrary the cepstral values near the (0,0) peak may also get adversely affected. A logical criterion to ignore the (0,0) peak is the ratio of its magnitude over the second largest peak. If this ratio is below a certain value one can be reasonably assured that This actually gets a bit more complicated, since not only I am filtering the image but also windowing it. However the end result is basically the same. 9 Chapter 6. Multi-Evidential Correlation and CepsCorr 125 the secondary peak is the correct disparity. There are three problems with this approach however. First, it again requires using an arbitrary threshold which should be dependent on the contents of the windows. Second, this approach may fail for subpixel motions where the magnitude of the correct peak may be distributed between two pixels. Lastly, this method again fails if multiple disparities are present(i.e., image boundary, motion transparency or reflection) and one of the disparities is in fact (0,0). CepsCorr easily solves these problems through several different means. For instance, as the moving window sweeps the search region the disparity peak should also move accordingly. If the (0,0) peak moves along with the window we can be certain that the initial displacement was null. On the other hand, if the second largest peak moves properly and increases with reduced disparity while the peak at (0,0) does not, then we can be confident that the (0,0) peak is a false match. Lastly, if both the (0,0) peak and the secondary peak move, we may be facing a multiple motion (more on this topic in a later chapter). The simplest solution of course, is to iterate over search region and choose the location where the (0,0) peak is maximum. This is reflected in the lower left corner of Figure 6.1 which displays the size of the (0,0) peaks (regardless of whether they are the maximum peak in the cepstral plane or not). Obviously the (0,0) peak that corresponds to the overlap of the two images dominates all other (0,0) peaks and is quite distinguishable (larger by roughly an order two). I will discuss this approach in more detail later on. 6.3.2 Sign Ambiguity We saw in Chapter 4 that one of the ambiguities facing the cepstral echo detection was the sign of displacement field. That is disparities (8 , 8 ) and (—5, X y X —5 ) give rise to y the same peak location on the cepstral plain. This is of course due the fact that the log of the power spectrum is an even function and its cosine transform is also even. Chapter 6. Multi-Evidential Correlation and CepsCorr 126 This is not a real problem when dealing with stereo disparity analysis where the direction of displacement is obvious. But when dealing with motion analysis and tracking the sign of the displacement plays an important role. I also pointed out that one way to overcome this problem is to move one of the images up vertically (and wrapping the image around) and conducting the cepstrum. The cepstral peak should then appear with a vertical disparity near the artificial shift. The location of the peak and the amount of change in the vertical shift then gives us the direction of disparity. The problem is that one again has to assume something about the amount of what constitutes reasonable vertical shift. In the case of verging system where the vertical displacement is minimal this technique is quite adequate [01s93]. This is not a problem for calibrated binocular stereo since the direction of horizontal disparity between the two cameras is known (and the vertical disparity is zero). Even for verging binocular stereo cameras the vertical disparity is small enough that by shifting the camera upward one can produce an artificial vertical disparity and deduce the sign of the disparity vector [TOM94]. But for a more general case of motion analysis such assumptions are impractical. 10 With cepsCorr the problem of guessing the sign of the displacement or having to artificially manipulate the input data is solved. In short, as one window slides over the second image, the cepstral peak should be moving in the opposite direction - showing how the disparity is changed. Unless, of course, the sign of the disparity is opposite of what the cepstral peak reflects. This simple geometric constraint can be used to disambiguate the sign of the disparity in the cepstrum result. It should also be noted that aside from the small computational cost of manipulating data in this form, onw also pays a price in lower signal to noise ratio, and additional inaccuracy at least on a subpixel level. 10 1 127 Chapter 6. Multi-Evidential Correlation and CepsCorr 6.3.3 Other properties of cepsCorr But cepsCorr's greatest advantage is that it combines the flexibility and accuracy of cepstrum with the robustness, informativeness and reliability of multi-evidential correlation. In the next section a very preliminary but fast and accurate solution to the complex problem of combining evidence is provided. The aim of next section is to provide a stepping stone for future developments and improvements to cepsCorr and multi-evidential correlation without pretending that the solution presented is the final, complete and optimal solution to the problem of combining disparity estimates and multi-evidential correlation. 6.4 Verification and Combination of Correlated Estimates Fusion of information, whether from a single source - as in multi-evidential correlation or separate origins - as in sensor fusion - is a complicated and a broad topic of research, and the subject of numerous graduate theses and research projects. As such it would be prohibitive and outside the scope of this dissertation to cover the topic entirely and provide an optimal solution for combining the evidence for M E C . Instead, in this section I will cover alternative approaches, examine their strengths and weaknesses, and finally propose a simple yet adequate solution based on the winner take all scheme. The remaining approaches discussed herein are left to future development and examination. But before examining different alternatives of combining evidence, it is important to visit some of the properties of these measurements, which I have briefly pointed out before: • For cepsCorr and other variations of M E C , one of the matching signals remains unchanged, and as a result the estimated disparities are correlated. Consequently, the simpler traditional statistical techniques based on independent measurements Chapter 6. Multi-Evidential Correlation and CepsCorr 128 and estimates of disparity are theoretically inapplicable to our application. • The uncertainty in disparity measurements due to geometric distortions, blur and noise can be reduced if multiple measurements of disparity are available and are combined properly. But finding the confidence in each measurement itself is a complicated task and subject to error if limiting and simplifying assumptions are made. • The detectability and accuracy of disparity measurements increase as the disparity between the two signals reduces. • It is preferable to use one of the disadvantages and limiting properties our matching kernel - in our case cepstrum - to our advantage in finding the correct disparity. As pointed out, combining evidence is a complicated task that must satisfy many objectives including those mentioned in Chapter 2. This is particularly true when one seeks subpixels accuracy as well as confidence measurements. Below I will briefly present four alternatives, and finally choose the simplest approach based on the winner take all alternative. V o t i n g techniques: One of the first things that comes to mind when having multiple measurements is a voting scheme. For integral disparity measurements, in particular, each measurement can be used as a vote for a particular integer disparity in x and y. The disparities that have the highest votes can then be used to refine the motion estimation, and the number of votes can be used as the confidence measure in the disparity. This, of course, is a simple and crude form of outlier detection and robust statistics [Hub81, RL83] Relaxation labeling by Hummel and Zucker [HZ83] can similarly be used to to refine and combine the measurements. But all these techniques, particularly Chapter 6. Multi-Evidential Correlation and CepsCorr 129 the outlier detection, should be used as a preprocessing method to choose the correct disparity measures. The question of how to combine the remaining correlated measurements is still an open issue. Note also that the voting mechanism can be deceivingly simple. In the case of multiple motion, as we will see later, the cepstrum produces multiple peaks. Consequently, we have to keep track of at least the top two motion estimates. A d d to this the fact that a (0,0) peak may be present if the signal is of low frequency, and consequently we have to keep track of the top three peaks if the (0,0) peak is maximum. In the presence of multiple motion, each of the true disparity peaks may at one point or another become larger than the other. A d d to this ambiguity of sign of the disparity, and one soon finds that the problem was not at simple as it first appeared but requires a rather elaborate book keeping mechanism. Cuplas: A tough choice about combining multiple evidence, even if we assume that they are independent, is what do we use as the confidence measure: the peak magnitude, the rejection ratio, the signal to noise ratio, maximum and minimum curvatures at the peak, or a combination of these. Without adequate analytical analysis or experimentation any choice would be quite arbitrary. Girod and Kuo [GK89] showed that under certain assumptions one can use the shape of the peak of the phase correlation as the probability distribution of disparity. Given the similarity of cepstrum and phase correlation one may adopt such a scheme and use the shape of each disparity peak as the marginal probability distribution function of that particular measurement. Statistically, one can then use Cuplas [DKS91] in order to combine these correlated marginals. But to use Cuplas we have to know something about the marginal probability distribution of the correlated values and how they are correlated with one another. This will requires one to make more arbitrary assumptions which may be quite erroneous. In fact, after eliminating the outliers, one may be better Chapter 6. Multi-Evidential Correlation and CepsCorr 130 off treating the remaining measurements as independent variables. Bootstrapping: It has been shown that many traditional statistical techniques as well as robust statistics such as cross validation, leave one out, and redistribution methods can suffer from large bias and/or large variances in their measurements [JBC87]. Bootstrap method [DE83, Efr79, Efr82] is a computationally intensive method that can be used to determine the statistical property of an estimator and try to combine measurements when little is known about the underlying distribution and relationship of the samples available. There are however problems with the bootstrap method. First, it is very computationally intensive, which makes its use with many disparity measurements methods practically prohibitive. Secondly, many questions about how to use the disparity values, or even whether to use the raw disparity image, and how to include measurements such as signal to noise ratio, and the effects of multiple motion still remain unanswered. 6.4.1 M a x (0,0) peak and winner take all As I have pointed out several times before, the artificial peak corresponding to (0,0) disparity occurs when the signal contains little structure and predominantly consists of low frequency values. I have already proposed a few solutions to this problem, which included band passing the signal, using hierarchical pyramids, and keeping track of motion of the (0,0) peak in cepsCorr. Obviously, any of these solutions should be adequate to overcome the problem, but they all require additional computational and bookkeeping efforts. A good question to ask is: how can one turn this problem to one's advantage and avoid the additional computational cost. The natural answer is simple. Instead of looking for the largest two or three peaks in each iteration of cepsCorr, look for the largest peak at the location corresponding to the (0,0) disparity. When the two signal match exactly (i.e., have a zero disparity) this peak Chapter 6. Multi-Evidential Correlation and CepsCorr 131 will be at its largest. This is true regardless of whether the (0,0) peak is artificially high due to the existence of low frequency signal components and lack of spatial structure. Moreover, there is no reason to worry about the sign ambiguity of cepstral disparity when we adopt this method. Simple mathematical analysis shows that there are other reasons, besides greater efficiency, to pursue this technique - especially as a preliminary solution. Referring back to Equation 3.2 in Chapter 3, the Fourier transform of a one dimensional signal and its echo is: F{h(x)} T{s{x) + s(x - T)} = = S(f)(l + e- ™f) 2 (6.2) Now, if we are matching two signals Si(x) and S2(x) of length iV sample points, and concatenate them for matching, we have: ?{h(x)} = f{ (x) n(f) Of course, both h(x) and H(f) + Sl s (x-N)} 2 = S (f)+S (f)e- f (6.3) 2mN 1 2 have 2iV sample points; and in the Fourier domain the sampling period between any two / becomes ^ 1 1 • Therefore replacing / with k/2N where k is an integer from 0 to 2N — 1 we get: •H(f) = Sx(f) + S (f)e-™ k 2 = +«S (/)cos(A;7r) 2 = S (f) + (-l) S (f) k l 2 if I had used u instead of 2nf in my Fourier transform, the sampling period would become is another version to which a lot of people are accustomed to. 11 (6.4) This Chapter 6. Multi-Evidential Correlation and CepsCorr 132 Now if s±(x) and S2(x) are exact replica, then: « ( / ) = 5 ( / ) ( l + (-l)*) (6.5) 1 So far so good! But the next operation in cepstrum involves taking a logarithm, and for odd values of k the Fourier transform is zero. Consequently, when k becomes an odd integer (i.e., every other sample point) %(/) = H(k/2n) becomes log(O) which is negative infinity. This of course causes a numerical instability, and is easily overcome by replacing this value by a large negative value. What is important, however, is that we have a periodic function with a large amplitude and periodicity of two sampling points in the frequency domain. 12 Figure 6.2 shows this effect for a two dimensional signal. The top half of the image shows the two exact windows that have been concatenated and are being matched. The lower part of the image show the logarithm of the power spectrum of this signal. Since there is only a horizontal disparity of exactly N pixels where N is the horizontal window width of each signal, the result of the log of the power spectrum are horizontal stripes with the black pixels representing very large negative values. What is significant is that the next cosine transform or power spectrum operation which results in cepstrum - will produce a very large peak at pixel position (0,N), which reflects zero disparity between the two signals. To show this quantitatively, I generated a one-dimensional random signal displayed in Figure 6.3. This signal was then used in cepsCorr against itself with the search area being from -16 to 16. 13 Figure 6.4(a) shows the magnitude of cepstral peaks. The zero disparity peak is so much larger that the remaining peaks practically disappear after scaling. To show the effects of disparity on the peak, Figure 6.4(b) shows the logarithm of the peak magnitudes. As it can be seen the magnitude of the peaks do in general reduce as the There is also a negative constant added to the function, which results in a reduction in the DC component of the next Fourier transform, and it does not effect the peak at the echo arrival period. A sequence of 64 pixels from the middle portion of this signal was used in the comparison. Consequently, the missing data due to artificial disparity was random values. 12 1 3 Chapter 6. Multi-Evidential Correlation and CepsCorr JL ... 133 M ; Figure 6.2: Two concatenated 2D signals that match perfectly and the logarithm of their power spectrum on the lower half. The black stripes represent very large negative values. disparity increases. The sharp increase in the peak at 0 disparity is however impressive and roughly of two orders of magnitude. 14 To show that the peak magnitudes do correspond to the correct disparities, the relevant portion of the cepstral filter for each cepsCorr iteration were extracted and normalized between zero and one. The cepstral results were then merged vertically into a two-dimensional signal with the first row displaying the result of cepstrum at disparity -16 and the last row the result of cepstrum for disparity -16. Figure 6.5 shows that all the cepstral peaks reflect the correct disparity, with the disparity reducing from -16 to 0, and then increasing to +16 as cepsCorr iterations progress. The increase in the peak magnitude is important, but for detectability one has to also check the increase or decrease in the amount of background noise. Figure 6.6 shows the change in signal to noise ratio (S/N) in cepstrum as a function of disparity. As we can see the signal to noise ratio is almost constant at about 17 db for all disparities between -16 and +16, but increases almost five fold when the disparity between the two signals 14 For two-dimensional signals this ratio generally increases usually to four orders of magnitude. Chapter 6. Multi-Evidential Correlation and CepsCorr 134 The random signal .00 300 .0 400 .0 500 .0 600 .0 700 .0 800 .0 Figure 6.3: The random signal used for comparing cepsCorr magnitude and signal to noise ratio as the disparity increases. becomes zero. In short, even though this approach does not technically combine all the evidence, because of large reduction in memory requirement and computational cost, greater detectability, high accuracy and simplicity of the process, I have used the maximum magnitude of the (0,0) peak as a preliminary solution of utilizing cepsCorr for disparity measurement. Figure 6.7 shows the application of cepsCorr and the maximum (0,0) peak criteria. I will discuss the performance of our preliminary approach and issues related to its stability in the next chapter. 6.5 Summary / Conclusion In this chapter I briefly introduced Multi-evidential Correlation and the additional robustness, flexibility, and accuracy that it provides. I then concentrated on cepsCorr M E C with cepstrum as its matching kernel - and showed how some of the limitations Chapter 6. Multi-Evidential Correlation and CepsCorr peakValues Y* 103 31 .0 56 .0 54 .0 32 .0 50 .0 41 .0 46 .0 44 .0 42 .0 6500 .0 6000 .0 5500 .0 5000 .0 4000 .0 3500 .0 3000 .0 —2500 .0 2000 .0 1500 .0 1000 .0 — 500 .0 -15.00 -10.00 -5.00 00 .0 50 .0 100 .0 150 .0 (a) Log 10 of Peak Magnitude Y 7500 .0 135 v 31 .0 ^ 36 .0 34 .0 -15.00 -10.00 -5.00 00 .0 50 .0 100 .0 150 .0 (b) Figure 6.4: (a) Magnitude of cepstral peaks, and (b) the base 10 logarithm of the peak magnitudes. and ambiguities in cepstrum can be overcome through the use of M E C . Lastly, we introduced, a simple, fast and efficient winner take all approach to utilize the disparity estimates (i.e., multiple evidence) provided by cepsCorr. This method not only overcomes, both the problems with sign ambiguity of cepstral disparities, but it also solves the sometimes troublesome incorrect (0,0) peaks and turns this limitation of cepstrum into an advantage. Figure 6.5: The cepstral results generated during the iterations of cepsCorr. Each row of the image shows the cepstrum result normalized between zero and one; the first row corresponds to disparity 16 and the last row to disparity +16. 136 Chapter 6. Multi-Evidential Correlation and CepsCorr Signal to Noise Ratio y 1* 00 -15.00 -10.00 -5.00 00 .0 50 .0 100 .0 150 .0 X Figure 6.6: Signal to noise ratio for cepstrum as a function of disparity. Figure 6.7: Horizontal and vertical disparity fields for a natural scene using ceps Corr Chapter 7 Results of Multi-evidential C o r r e l a t i o n This section presents the results of applications of M E C on several natural images. The two comparison kernels used with M E C are cepstrum and phase correlation. To determine the disparities, both methods use the maximum (0,0) peak heuristic, presented in the last chapter. Moreover, for the sake of brevity and ease of discussion, I have nicknamed these processes cepsCorrOO and phaseCCorrOO respectively. The suffix 00 in the above 1 names is to be a constant reminder of the heuristic being used. Other matching routines used for comparative purposes are Sum of Square Differences (SSD) or mean squared error, and Sum of Absolute Differences (SAD) or mean absolute difference, and Horn & Schunk's differential technique were also provided. 2 The images chosen reflect different types of motion, such as rotation, divergence, or uncalibrated stereo, as well as different types of image properties such as cluttered natural scene vs. completely featureless artificial environment, dark noisy images taken with a small inexpensive camera, or changes in brightness by a moving light source. As it is traditionally the case, the disparities are measured in integer values. More accurate subpixel analysis is another significant topic that will be addressed in future work. In fact, subpixel accuracy is particularly interesting because it should be one of the most beneficial and intrinsically useful applications of multi-evidential correlation. Having a pronounced Phase Si kor zero zero. All the techniques mentioned with the exception of Horn & Schunk are region-based search techniques. Any kind of pre or post processing operations such as multiplying the comparison windows with Gaussians, or using regularization to smooth the imageflow,can be applied equally to all these region-based methods. 1 2 137 Chapter 7. Results of Multi-evidential Correlation 138 number of subpixel estimates of displacement should provide a more accurate estimate of the motion as well as measures of confidence in the combined result. The format of the results are similar to the ones in the last chapter. The two images being compared are shown on the top upper half. The area inside the two images that is being compared is delimited by a black rectangle. The remaining narrow margins near the boundary of the two images are the extra pixels necessary for search space and the window width of the pixels near the boundary of compared regions. The lower left quadrant shows the vertical disparities between two images for each pixel (i.e., drow) and the lower right quadrant shows the vertical disparities between two images for each pixel (dcolumn). Darker regions show the greater disparity, and the disparity results are scaled based on the size of the search regions (mentioned in the caption of each image). As we will see, this is a more informative format for the disparity display. The normal arrow type displays which shows the disparity only at selected points on a grid of pixels on one of the images is sometimes better looking but inadequate. This is primarily because they do not show the anomalies that may occur near the object boundaries or the effects of window size. Lastly, as a reminder, both cepsCorrOO and phaseCCorrOO analysis use the location of maximum (0,0) peak for disparity measurement. As pointed out this is a simple, fast, and relatively robust estimate, but as one would expect the results should improve if all the disparity estimates were used together (this is particularly true for subpixel motion analysis.). Nevertheless, even with this simple heuristic cepsCorrOO proves to be more accurate than all the other techniques considered. These analyses also shows possible instabilities in cepstrum and phase correlation which were not considered before. An analysis of possible instabilities are provided and simple solutions to overcoming them with cepsCorrOO are provided. The next sections will present the results and the parameters used for motion analysis. Chapter 7. Results of Multi-evidential Correlation 139 The window sizes chosen for region based techniques were usually 8x8 and 16x16, with the search region varying depending on the size of the motion. For the differential motion technique, the main two parameters to choose are a used in the smoothing and the number of iterations. Experience with the images shows that for the majority of images chosen in these experiments there is no substantial change in the motion estimations past 20 iterations. The parameter a does have strong effects as expected. These will all be shown via examples. Section 7.7 talks about the computational burden and the speed of each methodology. 7.1 Results and C o m p a r a t i v e Analysis 7.1.1 T h e H a m b u r g h T a x i Sequence This is a popular sequence of images of a common traffic scene. I present this set of images first because comparatively speaking cepsCorrOO did not perform as well as usual on this set of images. In this sequence, there are objects moving in different directions and leaving or entering the scene. The images being compared are not of high quality especially since they lack good contrast. Additionally there are plenty of featureless objects and regions including the road. The results (which may not seem very impressive) for different techniques are presented. Almost all techniques distinguish four moving objects: a small object on the upper left corner (a pedestrian) two cars at the two edges of the image moving toward the center of the image, and a white taxi in the middle of the image making a right hand turn. The presence of all these objects has been verified visually. Looking at the results between different window sizes for individual techniques, as one should expect, the larger window sizes produce less noisy and more uniform disparity values (which is primarily the effects of displacement blurring as well a greater Chapter 7. Results of Multi-evidential Correlation 140 computational stability). It is also worth remembering that, much like edge detection, region based approaches do tend to introduce ambiguities in motion boundaries, often making the moving objects look larger. Comparing these techniques for different window sizes, one sees remarkable similarity in answers, particularly between SSD and SAD. Looking at the results for 16x16 window sizes, the cepsCorrOO and SSD results seem very much the same. There are two things that are worth pointing out however. First, cepsCorrOO tends to produce smaller objects (we will also see this in later experiments); this is particularly notable in the column disparity (i.e., the lower right quadrants) of the moving object in the upper right corner and the two cars to the left and right of the image boundary. Second, cepsCorrOO produces some sporadic values near the boundary of taxi and the other two cars. This is because cepstrum recognizes two disparities, and therefore it generates two peaks, one of which when at 0,0 can become dominant. I will address the use of this property to detect multiple motions in the next chapter. Comparing the results for the 8x8 window sizes gives one more interesting insights. First, not only are the SSD and S A D results are very similar (which is expected), the majority of noisy values seem to cluster together. In the case of cepsCorrOO however, the noise seems to break up and become more dispersed giving it a more speckle-like effect. There are two explanations for this. First, for SSD and S A D calculations of disparity, each pixel shares a lot of data in common with its neighboring pixels. In fact some of the values calculated for the correlation plane are the same multiplications and additions (an observation that is used for faster more efficient implementations of SSD and SAD). Consequently, one expects that if there are noisy disparities, they would have quite a bit in common with their neighboring values and therefore one finds clusters of them. In the case of cepstrum on the other hand, a non-linear operation and frequency domain analysis were used, consequently the same constraints do not apply. But more importantly, the Chapter 7. Results of Multi-evidential Correlation 141 Figure 7.1: Results of the region based approaches for 16x16 window size, The search regions were vertically from -3 to 2 and horizontally from -3 to 7 pixels. Chapter 7. Results of Multi-evidential Correlation 142 speckled appearance of the noise in cepsCorrOO may be due to instability in the results. As we will see in Section 7.1.1 this is the case. It should also be pointed out that the areas of inaccuracy are all dark and featureless, such as portions of the road or the side of the building near the upper right corner of the image. In these areas, not only does the image lack spatial structure needed for matching, but there is hardly any brightness or adequate signal power to reliably match the two images. But before providing a complete analysis of possible instabilities in cepstrum, let us consider the two other techniques phaseCCorrOO and Horn Sz Schunk. As it can be seen phaseCCorrOO did not fare very well on this image. This is mostly due to susceptibility of phase to noise, and as we will see in the next section, the fact that phase correlation can also suffer from numerical instability. I will provide different solutions for this problem. As for Horn and Schunk, one has to note that because there was no window size and search region the two image region are larger in both direction by a small margin. One problem with this approach is choice of the parameter a . Three cases, of a = 0.0, a = 2.0, and a = 4.0 are shown to reflect the high sensitivity of the technique to the choice of this parameter. For a = 0.0, there are really no motions visible, even after 20 iterations or more. For a — 4.0 the two cars on the left and right are missing, a = 2.0 furnishes the best answer, especially around 20 iterations. The result for 100 iterations does not show much change, and basically spreads disparities (i.e., by weighted averaging) as expected . 3 The disparities are more localized and they are subpixel values as expected. In fact, the results do not look bad (i.e., the three moving cars are visible - the object on the Compare the horizontal disparity on the left for the white taxi for 20 and 100 iterations. The disparities are more blurred out. 3 143 Chapter 7. Results of Multi-evidential Correlation - phaseCCorrOO SAD Figure 7.2: The same matching operations as in the last but with 8x8 window size. Chapter 7. Results of Multi-evidential Correlation 144 upper right corner of the previous image also creates a small blob for a = 2.0), but they are not really accurate because even though each car has basically uniform motion it is the motion of boundaries and edges that make the cars stand out - this is exactly where the motion is incorrect (due to sharp changes in intensity). Looking at the white taxi, for instance, the motion of the roof or the body is practically like the background i.e., (0,0), which is quite incorrect. As for the other two cars on the image boundary, they fared even more poorly. Many pixels on these two cars received a motion disparity of nearly zero (i.e., similar to the background) and the shape of the two cars does not really resemble rectangular moving objects. In fact, for a = 4.0 these two cars completely disappeared. In addition, there are indications of motion at the corners with high intensity change (e.g., the white parked cars and the van, as well as the striped store awning on the upper right corner of the image). Instability As we saw, unlike SSD or S A D which show inaccurate measurements in clusters, the noise in cepstrum seems more sporadic and speckle-like. Therefore, one may suspect that these specular noise are due to some inherent numerical instabilities that may arise in cepstrum. Further investigations show that this is in fact the case. To show this, I chose an area, on the back window of the taxi, that showed high amount of noise in the 8x8 window size disparity field calculation of cepsCorrOO. The results of each cepstrum iteration of cepsCorrOO were then calculated and normalized and all the results were then concatenated (to correspond to their location in the search region). Figure 7.4(a) shows the result of this operation. Since the search region was -3 to 2 pixels vertically and -3 to 7 vertically, 66 cepstrum images where collected. It 4 The cepstrum on the upper left corner was replaced with the image used in the first iteration of cepsCorrOO. So there are 65 cepstrum images present. 4 Chapter 7. Results of Multi-evidential Correlation 145 Figure 7.3: Results of Horn & Schunk for different values of a and number of iteration. The disparity range for a = 2.0 and 20 iterations (or higher) was within the proper motion region. For a — 0.0, on the other hand, the vertical disparity after 100 iterations ranged from -41.772167 27.433767. Chapter 7. Results of Multi-evidential Correlation 146 is not necessary to distinguish the boundary between the cepstral results. But one can 5 easily note the cepstrum with vertical stripes to left of the result. This cepstrum plane is obviously caused by a numerical instability and corresponds to the search coordinate point -2 vertically and -3 horizontally. More importantly, Figure 7.4(b) shows the same cepsCorrOO figure except that the cepstrum fields were not normalized before they were concatenated. The final concatenated result was then normalized in order to show the relative magnitude of cepstral peaks. As we can see the unstable cepstrum dominates all other cepstral values. The reason behind this is simple. Taking the log of the power spectrum of the image and its echo, we have: l o g { | 5 ( / ) | * ( l + cos(27rr/))} (7.1) where as usual, | 5 ( / ) | is the power spectrum of the image, and r is disparity. Now if at any point on the frequency plane either \S(f)\ or (1 + cos(27rr/)) becomes zero or very very small, the logarithm will result in a very large negative value, which can act as a negative 5 function. This is followed by another power spectrum (or cosine transform in our case). Therefore, the large negative peak will result in a sinusoid, similar to the one in Figure 7.4(b) , with a very large amplitude that dominates all other cepstral results. 6 Of course, it is not often that log \S(f) \ * (1 + COS(2TTT/)) will have a large negative value; additionally, not all instabilities have large value at (0,0). Moreover, there are several easy ways to avoid these instabilities. For instance, one can filter the logarithm result, add one to the value of power spectrum before taking the logarithm, or make sure that the signal to noise ratio of the cepstral result generating the largest (0,0) peak is In fact, one could put a boundary of uniform value between the cepsturms, but this may result in some changes to the results of Figure 7.4(b). Of course, the shape of the sinusoid - i.e., its periodicity in x and y - depends on the location of negative 6 peak in the log domain. 5 6 Chapter 7. Results of Multi-evidential Correlation 147 (a) (b) Figure 7.4: (a) Cepstral results generated by cepsCorrOO iterations, individually normalized and then concatenated. The cepstrum in the upper left corner was replaced with the input image to the first iteration to show typical images being compared, (b) The concatenated cepstral results, without being separately normalized. This figure shows how cepstrum's instability may dominate the cepsCorrOO result based on (0,0) peak magnitude. Chapter 7. Results of Multi-evidential Correlation 148 above a certain threshold. But these approaches require more calculations, and can introduce additional noise or inaccuracies. Additionally, we may require extra memory for keeping the intermediate cepstrum results. Given the fact that these instabilities do not occur often, and trying to keep the computational time to a manageable level, a simple median filtering of the results was used to reduce the computational burden, and memory requirement. This of course is not an ideal option and the results are probably not as impressive, but the process is very fast and robust. Figure 7.5 shows the result of median filtering on the results presented in Figure 7.2. As we can see the 3x3 median filtering did not really change the results of SSD or S A D but did improve the cepsCorrOO results. In comparing the results of cepsCorrOO, SSD, and S A D , one point is worth noting. Instabilities seldom arise within cepstrum; their occurrence is rather sporadic and primarily in the areas of very low signal energy and contrast - i.e., \S(f)\ —> 0. But since cepstrum 8 is generally very good even in the areas of low energy [CSK77], the unstable results have a speckle nature which increases the randomness (or entropy) in the resulting disparity figures. This is unlike SSD or S A D where the errors are clustered and have a more uniform distribution. Additionally, cepstrum's instability may occur any place within the search region and therefore the result can often be very different from the disparity of the neighboring pixels (see the speckle white and black pixel noise in Figure 7.2. This also increases the randomness of the values in the result. As a result, even though the errors in cepstrum are not very common, they may seem deceptively more distinguished and noisy than in SSD or S A D where the regions of error are more uniform. Moreover, for small window sizes, such as 8x8, the aliasing effect is very pronounced; and in the regions of low contrast or signal content the effects of instability seem more T o avoid the setting arbitrary thresholds, one can require that signal to noise ratio for the (0,0) minimum of all cepstral results. 7 8 Also see Ludwig's discussion of effects of straight lines and edges in cepstrum [LNN94]. be 149 Chapter 7. Results of Multi-evidential Correlation - phaseCCorrOO - - SAD - Figure 7.5: Results the 8x8 window sized registrations after 3x3 median filtering. Chapter 7. Results of Multi-evidential Correlation 150 pronounced. Padding with zero, or windowing the cepstrums are also options that may be used, but at the expense of additional computational expense. A closer look at the mathematics of phase correlation shows how one may also encounter instabilities. For each phase correlation calculation in phaseCCorrOO if either or |52(/)| is nearly zero for a particular frequency value then there will also be a huge peak in the inverse Fourier transform of \Si(f)\^S2(f)\ ^ o r * t particular frequency. n& This in turn will result in a sinusoid with a large magnitude, similar to what was experienced in cepstrum. Given that phaseCCorrOO provides multiple number of displacement estimates as well, one can always distinguish the instabilities (from their low signal to noise ratio for instance) and consider one of the other measurements instead. 7.2 Effects of R o t a t i o n In the previous chapters I gave references to the works on the effects of rotation and scale on cepstrum. Figure 7.6 shows the effects of rotation on the four different techniques discussed before. For a purely rotational motion drow and dcol are simply a linear function of the distance of the pixel from the center of rotation. In these images, we can clearly see horizontal and vertical stripes present for dcol and drow respectively. As it can be seen in all images, the area of the image being rotated is the table with the plant located above it. The stationary portion is a thin stripe on the left hand side of the image. A l l techniques performed well, but for cepsCorrOO blotches of error in the upper middle portion of the image are missing. For 8x8 images we see that again cepsCorrOO suffers at few isolated points from instabilities (see Figure 7.7) which can be quickly and efficiently removed using simple median filtering (Figure 7.8). After median filtering, even phaseCCorrOO performed better than SSD. Chapter 7. Results of Multi-evidential Correlation 151 - cepsCorrOO - - SSD - - phaseCCorrOO - - SAD - Figure 7.6: Results of rotation for 16x16 window size. Chapter 7. Results of Multi-evidential Correlation Figure 7.7: Results of rotation for 8x8 window size. 152 - phaseCCorrOO - - SAD - Figure 7.8: Results of rotation for 8x8 window size after median filtering. Chapter 7. Results of Multi-evidential Correlation 154 Figure 7.9 shows the effects of rotation and Horn and Schunk method. This is particularly a hard image for any differential technique to analyze because it is very textured; and even though the boundary is well preserved, the differential technique did not perform very well. Figure 7.9: Horn Sz Schunk method for rotation, a = 2.0 and number of iterations was 50. 7.3 Effects of Expansion/Convergent Flow Next I investigated the effects of looming. In these figures the camera was moved toward a statue of Apollo. Again, in a divergent or a convergent flow the disparities are a linear function of the distance of the image point from the focus of expansion (FOE). As we can see for the 16x16 window size all techniques performed nicely, but both SSD and S A D had a problem in the vicinity of the ear of the statue. Moreover there is a bit of abnormality to the left of the image, primarily due to lack of image structure. Chapter 7. Results of Multi-evidential Correlation 155 This problem becomes more pronounced in the 8x8 window sizes. Median filtering the flows provides us with very nice flow for both cepsCorrOO and phaseCCorrOO, with little or no anomalies. Chapter 7. Results of Multi-evidential Correlation - cepsCorrOO - - phaseCCorrOO - 156 SSD - - SAD - Figure 7.10: Results of expansion for 16x16 window size. Chapter 7. Results of Multi-evidential Correlation Figure 7.11: Results of expansion for 8x8 window size. 157 Chapter 7. Results of Multi-evidential Correlation 158 Chapter 7. Results of Multi-evidential Correlation 159 Finally one should examine the results of Horn and Schunk's algorithm for diverging field. Figure 7.13 shows the result for two different a values. = 4.0 number of iterations is 20 - - a — 16.0 number of iterations is 20 - Figure 7.13: Results of Horn and Schunk for looming effect. 7.4 Effects of Illumination Change Implicit assumptions can be and are often more dangerous than explicitly declared ones. Many application domains in computational vision or image processing are conducted under differing lighting conditions. This could be due to the existence of shadows, the irradiance change due to the motion of the object, light source or the viewing angle or simply an inherently dynamic nature of the problem domain as in the case of satellite imaging. The figures below shows pictures of the same object taken at different lighting 9 9 D a t a courtesy of Dr. Ardeshir Bagheri and Professor Robert Woodham. Chapter 7. Results of Multi-evidential Correlation 160 conditions. The lights are on the same horizontal plane as the camera and are located so that they make angles of 60 and 30 degrees with the camera's optical axis respectively. The search region is from -7 to 7 pixels in both horizontal and vertical direction. This search space was chosen to correspond to the results of the Horn & Schunk technique. As usual the SSD and S A D figures are similar. 10 Obviously all pixel values should be 0 for disparity. The black background gives the disparity -7 for both vertical and horizontal disparity values. This is of course sensible since all the disparity signals are the same, and the first minimum or maximum signal is chosen as the disparity. As usual the incorrect disparity values for SSD and S A D are clustered together, while for cepsCorrOO they are more sporadic. For SSD the area of the statue corresponding to the boy's hair near the upper left corner has a band of incorrect of measurement for -2 pixel vertically and when the vertical disparity is correct the horizontal one is not. In the case of cepsCorrOO the disparity mistakes in terms of value are more dramatic, but both the horizontal and the vertical disparities are incorrect, and very isolated in nature (i.e., one pixel at a time) showing the effects of instabilities. Note also the occluding boundary of the statue analyzed by cepsCorrOO. The occluding boundary shows up as a 4 pixels wide region with disparity of +7 pixels horizontally or vertically. This results from the fact that at an occluding boundary cepsCorrOO signals the existence of multiple disparities, a topic I will address in depth in the next chapter. Needless to say I am persistently displaying this, hoping for a change against all theoretical and practical odds! 10 Chapter 7. Results of Multi-evidential Correlation 161 Figure 7.14: Results of change in the direction of illumination from 60 to 30 degrees, search area (-7,7) in both vertical and horizontal direction and 16x16 window size. Chapter 7. Results of Multi-evidential Correlation 162 - SSD - cepsCorrOO - n Jlll phaseCCorrOO • ' . » „; J - J' SAD Figure 7.15: Results of illumination change for 8x8 window size. 163 Chapter 7. Results of Multi-evidential Correlation 3fc cepsCorrOO SSD V." phaseCCorrOO - SAD Figure 7.16: Results of illumination change for 8x8 window size after median filtering. Chapter 7. Results of Multi-evidential Correlation 164 Figure 7.17 shows the effects of illumination change on Horn k. Schunk's differential method. This is in violation one of the primary assumptions of the technique, therefore it should be expected that the range of disparities in rows and columns will be large, as shown below: Maximum/minimum v a l u e f o r d e l row: -6.333128 5.884905 Maximum/minimum v a l u e f o r d e l c o l : - 5 . 1 9 5 5 3 3 5.436991 Figure 7.17: Results of illumination change and Horn & Schunk method for a = 20 and 20 iterations. These are the values used to set the search space parameters for the region based methods. The fact that the surfaces are basically Lambertian and are not textured helped this methodology even more by eliminating highlights and sudden disparity changes. Chapter 7. Results of Multi-evidential Correlation 7.5 165 Tiny Camera It is not always possible to acquire clean (i.e., not noisy) data. Sometimes technological constraints (e.g., sending images over telephone wires, or low bit quantized images for higher compression) or simply the price of camera may force one to settle for noisy, dark images. 11 The images below are real images taken with a small inexpensive C C D camera. The images are very dark and noisy. The objects/people in the scene are stationary but the camera goes through a forward/looming and sideway motion. The original images used for matching are displayed in Figure 7.18, but in the results they are renormalized in order to show their structure better. As expected, SSD does a much better job than any other technique since it is optimal in random uncorrelated Gaussian noise with mean of zero. In fact, as we will see in the next image, SSD and SAD perform better than cepsCorrOO but also note that the performance of these techniques drops rather dramatically with the size of window used. I like to refer to this as the 7/11 syndrome! Chapter 7. Results of Multi-evidential Correlation Figure 7.18: Original images used in the analysis. 166 167 Chapter 7. Results of Multi-evidential Correlation l| SSD - cepsCorrOO - •J. 1 -•> If Sfef * iL^- " K JIM J? ' r 1 - phaseCCorrOO - ' » ^ - SAD - Figure 7.19: Effects of image quality on matching using 16x16 windows for matching. 168 Chapter 7. Results of Multi-evidential Correlation - cepsCorrOO - - phaseCCorrOO - - SAD - Figure 7.20: Results of tinyCamera change for 8x8 window size. Chapter 7. Results of Multi-evidential Correlation 169 The figure below shows the result of Horn & Schunk algorithm with the actual images (i.e., not normalized). Figure 7.21: Horn & Schunk method a = 20 after 20 iterations. These results are so poor that I will not bother to do median filter the results. Instead, one should note the effects of oriented structures - such as the boundary between the ceiling and the back wall - on all the region based methods. 7.6 Binocular/Outdoor Scene The next images (i.e., SRI trees) 12 are particularly interesting, because not only they represent typical outdoor scene, but they are also uncalibrated binocular images of a complex and cluttered environment. The images are also very clean - i.e., There is little if any noise, and the range of brightness in the images is also nicely exploited. SSD and SAD give similar disparity results which are rather blurred (typical of region 1 2 Courtesy of Stanford Research Institute International. Chapter 7. Results of Multi-evidential Correlation 170 based approaches) and not very crisp. CepsCorrOO and phaseCCorrOO provide more distinct results in the sense the images are more crisp, and the disparity of the objects are not very blurred, the objects are segmented better and their boundaries are better refined. There are no blotches of mistaken disparities but simple speckles. Note that all the results improve with median filtering particularly the phaseCCorrOO results. For this particular image, looking at the row disparities for median filtered images can reflect the accuracies of the disparities (since these images are practically stereo pairs). As one can see the errors for S A D and SSD seem a bit more pronounced. 171 Chapter 7. Results of Multi-evidential Correlation - cepsCorrOO - - SSD - - phaseCCorrOO - - SAD - Figure 7.22: Results of binocular disparity measurements. Search for (-1,1) in rows and (-1,9) pixels in column-wise. 16x16 window size. Chapter 7. Results of Multi-evidential Correlation 172 Figure 7.23: Results of binocular disparity measurements. Search for (-1,1) in rows and (-1,9) pixels in column-wise. 8x8 window size. Chapter 7. Results of Multi-evidential Correlation 173 Figure 7.24: Results of binocular disparity measurements after median filtering for 8x8 comparison window size. Chapter 7. Results of Multi-evidential Correlation 174 Below we see the effects of Horn & Schunck algorithm with a = 2.0 after 20 iterations. Figure 7.25: Horn & Schunk method a = 2.0 after 20 iterations. 7.7 C o m p u t a t i o n a l C o m p l e x i t y of M o t i o n Analysis A l g o r i t h m s I pointed out earlier that M E C ' s computational cost is linearly proportional to the search area and the entire process is parallelizable. The primary operation in cepstrum and phase correlation are the calculation of Fourier transform which is 0(nlogn). Both cepstrum and phase correlation require two Fourier transforms. But on a per pixel calculation, phase correlation requires 4 multiplications and two additions for the numerator, and another two multiplications (minimum), one addition and a square root for the calculation of denominator. Therefore, there are basically six multiplications, one square root, and a final division per pixel that are the main computational burden of the process, besides the Fourier transform. Chapter 7. Results of Multi-evidential Correlation 175 CepsCos, on the other hand, requires two multiplication, one addition and a logarithm. Therefore, for the same size windows, without any concatenation power cepstrum is a clearly a faster process. With concatenation, we will have four multiplications, two additions and two logarithms per pixel of a matching window, since the number of pixels has doubled. Fourier transform now becomes 2 + more expensive (practically twice). These programs were run on a standard Sparc 2 and Sparc 10 and even with window concatenations used in cepstral analysis, the cepstrum algorithm was performed faster than phase correlation. The complexity of SSD and S A D are well established, with S A D being a faster process. Horn & Schunk's algorithm was the fastest of all programs mainly because there is no search or comparison (to find the maximum or minimum) is involved. 13 7.8 Summary/Conclusion This chapter shows the use of cepsCorrOO and phaseCCorrOO for matching and registration of several images. The images chosen address some of the more common image transformations, such as rotation, expansion, and translation parallel to the focal plane, plus changes in lighting condition and image quality. Even using only the maximum (0,0) peak magnitude - a rather limited application of M E C - cepsCorrOO performed rather nicely. As was expected and pointed out by Yeshurun and Schwartz [YS89] and Ludwig, Neumann and Neumann [LNN94] for illumination changes, rotation and expansion, cepsCorrOO inherits cepstrum's robustness and performs better than the other techniques considered. Even on today's fast reduced instruction sets machines, comparison is one the more expensive operations. It should also be noted that when using regular cepstrum or phase correlation (as opposed to cepsCorrOO or phaseCCorrOO) the search region for cepstrum is half of that of phase correlation, even with concatenation. Since I are using the maximum (0,0) peak values, this does not come into play in these techniques. 13 Chapter 7. Results of Multi-evidential Correlation 176 The chapter discussed in detail possible instabilities on cepstrum and phase correlation and how they can be easily overcome by the use of MEC. These instabilities sometime occur at the occluding boundaries, a significant area of research (especially for object segmentation) and a topic that will be discussed - along with multiple motion - in more detail in the next chapter. Chapter 7. Results of Multi-evidential Correlation Chapter 8 Detection and E s t i m a t i o n of M u l t i p l e Disparities by M u l t i - e v i d e n t i a l Correlation 8.1 Introduction In region-based image registration methods the implicit assumption of a single uniform and constant disparity within the matching window is often violated. As pointed out graphically by Bergen, Burt, Hingorani and Peleg [BBHP90], many phenomena may result in multiple displacements in a single image neighborhood. Some of these include multiple disparities due to reflection or specularities, relative motion of background scenes and transparent surfaces (such as clouds, smoke and stained glass), and multiple motions or stereo disparities at occluding boundaries. The goal of this chapter is to address both detection and estimation of multiple disparities in motion and stereo registration. No a priori assumption regarding the existence or absence of multiple displacements is necessary. As a result, the problems of detection and measurements of multiple motions are decoupled. But before providing the details of this work a brief review is in order. One of the first attempts to find multiple disparities was by Fenema and Thompson [FW79] who used a histogram of the correlation results to localize occlusions. Since then many researchers have provided more robust and interesting approaches based on a variety of displacement estimation techniques. Burt, Hingorani and Kolczynski [BHK91], use a Laplacian pyramid and the iterative 178 Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl79 selective stabilization routine to lock into and cancel a dominant velocity when this velocity is detectable within a frequency band provided by the Laplacian's band-pass structure. Bergen, Burt, Hingorani and Peleg [BBHP92] also used an iterative approach to determine two constant velocities present in three frames. Peleg and Irani [IP92] then built on these works and showed an application of multiple motion estimation to track objects through image frames and to improve their appearance and resolution over time. Besides signal reconstruction and enhancement, multiple disparities are needed in object segmentation and figure ground separation[Gri93]. Other application domains include recognition of motion transparency, and signal separation and recovery from reflection. Peleg and Rom [PR90] used an iterative approach for motion segmentation based on constraint equations of brightness change for images where depth of the scene was already known or remained constant. Campani and Verri [CV92] also used the differential approach for optical flow estimation to calculate multiple motion disparities in image sequences. Little and Gillet [LG90] used a normalized correlation approach, but introduces two mechanisms to independently determine the occluding boundaries in stereo images. Darrel and Pentland [DP91] used robust statistics and temporal integration to find distinct "layers" of motion. Jepson and Black [JB93b] also used robust statistics and the optical flow gradient constraint equation to localize occlusions. Jones and Malik [JM92] proposed an elegant approach based on the distance of vectorized response measures to orthogonal linear filters to determine occluded regions on stereo scenes. Finally Chen, Shirai and Asada [CSA93] also used linear spatial filtering as well as the motion constraint equation to find occluded regions. In some works the presence of multiple disparities is assumed a priori. This is often not the case in real applications, where multiple disparities may or may not be present at all, Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl80 and when several displacements do occur, they are often limited to certain locations in the image. Therefore, there is a need to first detect the presence of multiple displacements. Moreover, it would be nice to verify these disparities by having multiple pieces of evidence that point to their proper value. This chapter describes multi-evidential correlation ( M E C ) and its use to detect and verify the presence of multiple displacements (without any a priori knowledge of their existence) and to estimate differing disparities due to reflection, transparency and occlusion. The number of disparities present is not limited to two since the matching kernels used - namely phase correlation and cepstrum filtering - provide direct estimates of multiple displacements. Furthermore, in the case of constant velocities, I will show how multi-frame analysis can improve the detection and estimation process by allowing constant disparities among frames to reinforce one another. The organization of this discussion is from general properties to specific applications. Examples based on random dot stereograms are used to provide a quantitative comparison between cepstrum and phase correlation first. Section 8.3 presents results of multiple disparity estimation due to transparent motion and discusses different algorithms to segment a stereo image and to detect occluded regions. Finally I will revisit the use of a multi-frame analysis technique, called multiCeps, to find constant multiple disparities due to reflection in a sequence of three images. 8.2 1 M u l t i - e v i d e n t i a l C o r r e l a t i o n , C e p s t r u m and Phase C o r r e l a t i o n A n important property of cepstrum and phase correlation is that they both produce multiple peaks in the presence of multiple motions. Figure 8.1 shows a manufactured image to display the effects of multiple motion on phase correlation and cepstrum. The 1 This chapter, like some of the others, is based on a previous paper and technical report [BL93c]. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlational "/Onp/esf.gdata" 60000 50000 40000 30000 - (a) (c) (b) Figure 8.1: Illustration of multiple peaks due to multiple motions, (a) & (b) the images with two synthetic motions, (c) the result of analyzing the images with cepstrum. box in the image was moved by 3 pixels vertically and 6 pixels horizontally, while the main image was moved 10 pixels vertically and 5 pixels horizontally. These motions created the corresponding peaks in the cepstrum result, a relevant portion of which is shown topographically, with the two peaks estimating the correct disparity. This property, in conjunction with multi-evidential correlation, provides a unique approach to detection and recognition of multiple disparities when they are present. To elaborate, when two or more disparities appear in an image, they generate two or more peaks in cepstrum and phase correlation. Since M E C sweeps a window from one frame over a region in the other, all the comparisons result in two consistently located maxima corresponding to the two displacements. The relative magnitude of these peaks will, of course, increase and decrease as the relative overlapping areas of the matching image parts decrease or increase. But the peaks themselves persist over all or some part of the iteration. In this manner one can first estimate their proper values. detect the presence of multiple disparities and then Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl82 The following subsections describe phase correlation and a variation of power cepstrum for estimation of motion and binocular disparity. I defer the mathematical treatment of their behavior in the presence of multiple motion and instead provide examples based on random dot stereograms. 8.2.1 Phase C o r r e l a t i o n and M u l t i - e v i d e n t i a l C o r r e l a t i o n It can also be shown mathematically that if multiple disparities are present in the two windows, phase correlation results in multiple peaks corresponding to the individual disparities plus a small residual noise. To show this effect, and how multiple peaks persist over individual iterations of M E C , a random dot stereogram with two disparity levels of 2 and 5 were generated. I then chose an area near the occluding boundary. Using phase correlation as the matching kernel I conducted multi-evidential correlation. 16x16 windows were chosen and for the correlation span (i.e., the area swept over during the 2 comparisons) I selected 0 to 10 columns horizontally and zero columns vertically. Since the epipolar constraint was preserved, and no vertical displacement was introduced, the peaks in each phase correlation outcome appear only in the first row. Only the first row of each comparison in the multi-evidential correlation were then kept, and they were then concatenated together vertically. Figure 8.2 shows the result with each row representing the outcome of one phase correlation match. The first row of the figure shows the result (i.e., the first row) of the left most M E C and each subsequent row shows the result of next M E C iteration as one moves to the right. As can be seen, each row contains two peaks corresponding to the two disparities. As one moves to the right these disparities get smaller by one pixel, as expected, and reach their respective maximum magnitudes at 0 disparities (i.e., when they reach the left Eight pixels by eight pixels patches or even smaller window sizes would also be adequate, but I chose a larger size for display purposes. 2 Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl83 (a) (b) Figure 8.2: Results of M E C for stereo, (a) Each row of the figure is the first row (i.e., zero vertical disparity) of the result of one iteration of M E C with phase correlation as its matching kernel. The location of the peaks are indicative of the disparity, with negative disparities located close to the right most column, (b) Topographic image of Figure (a) displaying the magnitude of the phase correlation peaks hand column). If one continues moving to the right the peaks reappear in the right hand column, indicating a negative disparity, and their magnitude again starts to decrease with each iteration. What is important to note is the consistent presence of the two peaks and how their relative location moves as one iterates over an area. If such consistency persists during multi-evidential correlation then it is easy to infer that multiple disparities due to occlusion, reflection, or transparency have occurred. As mentioned briefly, two additional conditions can also be used as steps for verification of multiple motion. First, as the disparity of the peaks diminish, their magnitude should increase; and second, the distance between the two peaks is the width of the occluded region (or disparity differences) in the image, and should remain constant. Even though the example presented here corresponds to binocular disparity (primarily for proper illustration), the same approach can be extended to the detection and Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl84 estimation of two or more motion disparities. 8.2.2 cepsCorr: C e p s t r u m and M u l t i - e v i d e n t i a l C o r r e l a t i o n I next used cepstrum as the correlation kernel in our multi-evidential routine - which I often refer to as cepsCorr for brevity - and repeated the experiment described in the previous section. As with the phase correlation the peaks of the cepstrum appear only in the first row of each iteration, and hence they can be concatenated vertically. Since I use the same windowing routine described in [YS89, BL93b], and since the cepstrums (or cepsCos) of real signals are symmetric and even functions, for binocular disparity the first row of cepstrum will contain two symmetric peaks around the center column for each of the disparities present. Figure 8.3 shows the cepsCorr results for the experiment described above. Note that, as with phase correlation, as the disparity between two identical parts in the two images reduces, the peak magnitude of their disparity increases, and the peak moves close to the middle column (i.e., zero disparity). Comparing the two topographic maps in Figures 8.2 and 8.3 for the matching kernels indicates that the disparity peaks generated by the cepstrum filter are larger in magnitude than those generated by phase correlation by roughly a factor of three to one. Moreover, the experiments indicate that cepstrum results have higher signal to noise ratios than those of phase correlation. Other researchers [LMK89] have also shown that cepstrum performs better than phase correlation in the presence of noise. Lastly, Figures 8.4 (a) and (b) are the results of the same operations as above on the boundary of the real image depicted in Figure 8.8. The estimated peaks in phase correlation, Figure 8.4 (a), even after taking into account spreading due to the curvature of the Pepsi can were quite inaccurate, while the results of cepstrum correlation, Figure 8.4 (b), were much better suited for further analysis. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlation!^ (a) (b) Figure 8.3: (a) The collection of binocular disparity signals generated by cepsCorr. Each row of the figure represents the result of one iteration of the multi-evidential correlation with cepstrum as its matching kernel. Each disparity is represented by two peaks which are symmetrically displaced away from the center column by the amount of the disparity, (b) Topographic image of cepsCorr in (a) in order to display the magnitude of the cepstrum peaks. 8.2.3 Benefits of Non-linear Matching Kernels I have already reviewed one of the primary advantages of multi-evidential correlation i.e., providing multiple evidences of the existence, or lack of presence, of multiple moving patterns (as well as estimates for each motion value). But before getting into more specific results and examples it would be interesting to compare the behavior of nonlinear kernels used with the more traditional region based schemes specifically sum of squared differences (SSD). I chose SSD because it is one of the more popular matching methodologies and its behavior is indicative of other similar techniques such as sum of absolute differences (SAD) or regular correlation. To do so I created two randomly generated I D signals with rather diverse disparities 1 and 13. I then used cepstrum, phase correlation and SSD to detect the presence of two disparities and to measure their proper values. Figure 8.5 shows the results of the Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlation! 86 (a) (b) Figure 8.4: (a) Topographic representation of multi-evidential correlation results for Pepsi scene with phase correlation as the matching kernel, (b) Result of cepsCorr for same image area in the Pepsi scene. Note that cepsCorr generates a strong peak in center corresponding to one of the disparities while for phase correlation the results very weak and lack a distinguishing structure. comparison. 3 the the the are As we can see, both cepstrum and phase correlation have their first and second peak at 1 and 13. The second minimum in SSD, on the other hand, does not correspond to the correct value of the second displacement. This is more than likely due the fact that the two motion displacements were quite diverse in magnitude. Even though this result can not be generalized to all displacements and all signals, it points to the strength of non-linear kernels in finding multiple displacements compared to most linear techniques. Additionally, it has been shown that both cepstrum and phase correlation are very robust in the case of illumination change [Has74, KH75]. Illumination change can be particularly a problem for motion transparency or reflection where the direction of illumination or viewing angle can change the brightness of a motion layer compared to the next. Lastly, without providing multiple evidences for the presence of several motions, using 3 Only one half of the cepstrum result is shown, since cepstrum is symmetric. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl87 (a) (b) (c) Figure 8.5: The effects of simulated multiple disparities on: (a) sum of squared differences (mean squared error) (b) phase correlation, and (c) cepstrum. The second peaks in phase correlation and cepstrum correspond to the correct second disparity while the second minimum in SSD does not. Using M E C the other peaks in cepstrum and phase correlation will vanish and all but the main two peaks are rejected. SSD or similar techniques will require choice of arbitrary thresholds and parameters to decide when two or more minimum values correspond to the existence of two or more moving objects. M E C , on the other hand, requires no such measures and all but persistent peaks are rejected. 8.3 Results In this section M E C is applied to the detection of multiple disparities. The three examples tackled involve motion transparency, detection and localization of occluding boundaries in binocular stereo, and multi-frame analysis of constant dual disparities due to reflection. 8.3.1 Motion Transparency Determination of the motions of two objects where one object is transparent and in front of the other is often referred to as the motion transparency solution. One of the applications of this work is matching of satellite images where thin cloud cover or smoke Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl88 (a) (b) (c) Figure 8.6: (a) The first frame of a sequence of motion transparency images. The flower pattern is apparently due to the stained glass, and the tripod is the object moving in the background, (b) & (c) The magnitude of M E C at (0,0) disparities for cepstrum (b), and phase correlation (c). These two figures analogous to plotting the middle column of Figure 3(a) and the left most column of Figure 2(a) respectively. can overshadow the primary matching disparity estimates. Irani and Peleg [IP92] also showed how they used results of motion transparency to track objects behind stained glass and improved the clarity and resolution of the object images over time. Figure 8.6(a) shows an image of a sequence where a tripod is moving behind a window with a picture of flower. In the first two frames of this sequence the flower stays stationary and the tripod moves by 4 pixels horizontally (again there is no vertical motion present). I used phase correlation and cepstrum as the matching kernels for this approach and then checked for maximum peaks at zero disparities - i.e., when either the tripods or the flowers in the two images overlapped. I also chose rather large windows of 128x128 pixels to ensure that both disparities are included in my analysis. The span of this routine was from -2 to 7 pixels horizontally. The results of the (0,0) peaks are shown in Figure 8.6 (b) and (c) (these correspond to the trace of the middle column in Figure 3(a) and the left most column in Figure 2(a) respectively). The main two peaks appear at the proper Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential CorrelationlS9 Figure 8.7: Detection vs. localization of occluded regions for random dot stereogram. The area between a pair of white and black edges corresponds to the locations where multiple disparities were detected. locations and correspond to the offsets where either the flower or the tripod overlap. It is important to note that even though the disparity in this case was only horizontal this is not at all a requirement, and that vertical as well as horizontal disparities can be found quite easily. 8.3.2 Occluding B o u n d a r y Recognition of occluded regions in motion or stereo plays a significant role in segmentation of three dimensional objects. Such segmentations are important for tracking, recognition, or object manipulation in robotics and automation. As it has been demonstrated, in the region near an occluded boundary, both phase correlation and cepstrum generate two peaks corresponding to the two disparities present. The distance between the two peaks is the width of the occluded region. But while detection and estimation of disparities is easy, localization of the occluded regions is a more cumbersome task. The primary problem is the duality in detection and localization, and how they relate to the window size. Large window sizes are generally better in detection and estimation, but produce greater uncertainty in the location of the occluded Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl90 (c) Figure 8.8: (a) & (b) A pair of stereo images of a curved object, boundary of the object using M E C . (c) the occluding regions. Figure 8.7 shows the uncertainty in locating the occluded regions in a pair of random dot stereograms for two squares. This problem is aggravated, and the consistency of multiple peaks in M E C is jeopardized, if the window shape does not match the occluding boundary, or if one or both objects in view contain curved or slanted surface structures; this is primarily due to the fact that a curved surface will have multiple disparities within a window which result in a blurring of the disparity peak. The natural answer to this problem is to use a smaller window size. This approach was examined by Okutomi and Kanade [OK92] who used windows with locally adaptive extents. This approach improves the localization of the occluded regions significantly, but the extent of the matching window, however small, still generates an uncertainty in the location of the occluded area. In many ways, especially the trade of between detection and localization, is analogous to the traditional edge detection problem in computer vision. It should also be pointed out here that relative sizes of image patches that give rise to multiple disparities are not the only factors affecting the relative peak magnitudes. Our studies show that other factors such as the relative frequency distribution content of Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlational the image windows also play a significant role. Moreover, often the objects creating an occluding boundary are similar in nature, making the detection and localization of such boundaries even more difficult. This is why random dot stereograms are not a good representative problem. A n engineering solution that will work with real scenes is to first find edges in the image and then determine if the edge is an occluding boundary or a surface 4 feature based on the existence or lack of strong multiple peaks along the edge [GP87]. The difference between the two disparities indicates the width of the occluded area. (a) (b) Figure 8.9: (a) One of the three images with two constant disparities caused by reflection. The three frames where produced by looking at the Escher drawing with the reflection of the person in the glass, and moving the camera a constant distance each time and acquiring two more frames, (b) The magnitude of the (0,0) peaks caused by the overlap of individual multiple disparities. Note that the graph shows two maximums due to two disparities. Instead, different direct methodologies for the detection of occluding boundaries were examined , including voting schemes, and thresholding of relative peak magnitudes. The final approach to solving the occlusion problem is similar to a technique used by Fua [Fua91] to verify correct disparity calculations. That is, using M E C I selected the 4 Obviously with an edge finder that has good localization characteristics such as [RHKvdH92]. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl92 largest of the two dominant peaks and centered a stationary window in the second image at that location. I then tried to find the disparity for this window in the first image using M E C . It is easy to show geometrically that the verification procedure fails inside an occluded region. In fact, in the occluded regions the secondary peak in the first pass becomes the dominant peak in the second. Figure 8.8 shows a pair of stereo images and the occluded regions found by the above technique. The thickness of the occluded regions can also be determined from the distance between the two disparities. Where the image lacks features for matching (such the lower part of the Pepsi can) this procedure also shows lack of consistency in results. One way to reduce the noise in this result is to verify the existence of multiple peaks for all occluded pixel candidates which unfortunately was not conducted for this experiment. 8.3.3 M u l t i C e p s and M u l t i p l e M o t i o n D u e to Reflection In [BBHP92] Bergen, Burt, Hingorani and Peleg used an iterative technique to determine two constant motion disparities from three frames. As we can see in Figure 8.9, the magnitudes of the (0,0) disparity peaks in cepsCorr also indicate the presence of at least two motions in the three frames. Figure 8.10: A portion of the cepstrum result for multi-frame analysis. In this section multi-frame analysis technique called multiCeps [BL93b] was used to Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl93 estimate the constant disparities among the three frames. To do this I concatenated the three frames and conducted cepstral analysis. It is easy to show mathematically that 5 in multiCeps the constant disparities between the first and the second frame and the disparity between the second and third frame reinforce one another [BL94]. The same results will hold for the disparity between the first and third frame. Figure 8.10 shows a small portion of the multiCeps result. (a) (b) Figure 8.11: The results of analyzing Figure 8(a) for multiple echoes. The difference images between frame 1 and 2 after disparity compensation for the two estimates. Note how that the reflection of the person in the glass is now easily detected as compared to the original image. Close examination of these results shows three peaks corresponding to the three horizontal disparities at -3.4, -0.6 and 6.7 using linear interpolation between the peaks. The first two peaks correspond to the multiple disparity measures (of the Escher print and the reflection) between the first frame and the second frame, reinforcing similar disparities between frames two and three. The last disparity corresponds to one of the two Actually I used a 341 by 256 sub-image of the three figures. 5 Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correiationl94 disparities between frame one and frame three. The vertical disparities between these images are also zero. Figure 8.11 shows the outlines of the two images generated by simple subtraction of the first two frames after proper disparity realignment. The procedure above shows how multi-frame analysis can be utilized in estimating of constant multiple disparity measurements. 8.4 Conclusion In this section I discussed a direct method for detection and estimation of multiple disparities due to occlusion, motion transparency, and reflection using multi-evidential correlation. No a priori assumptions about the existence or the number of disparities were made. The two matching kernels used, cepstrum and phase correlation, both generate multiple peaks in the presence of multiple motions. When these peaks persist between different iterations of multi-evidential correlation, and they are consistent in the measurement of disparities, the existence of multiple displacements is assured. Also addressed were the use of multi-frame analysis and cepstrum to estimate multiple constant disparities over time. Both cepstrum and phase correlation performed well, but cepstrum had higher signal to noise ratio, and an overall better performance than phase correlation. A significant fact that should be pointed out here is that motion transparency and reflection often cause blurring in the image. Both cepstrum and phase correlation are robust to blurring, and in fact cepstrum is often used as a deblurring mechanism. Chapter 9 C o n c l u d i n g R e m a r k s / O v e r v i e w and Future Directions The objective of this dissertation can be separated into two related topics. The first is to develop a broadly applicable framework in order to address variety of computational vision tasks through a common methodology. This goal is partially motivated by a number of papers on the neurophysiology and psychophysics of biological visual systems, and partially by discussions on the significant directions in computer vision [Pri86, Ric88]. The result was visual echo analysis, a framework which was developed through reformulation of various topics in computational vision as detection and estimation of echo arrival periods in time and space. Using the strengths and versatility of cepstral filtering as the echo detection mechanism, visual echo analysis was then successful applied to: • measurement of visual motion, stereopsis and stationary texture segmentation (topics discussed under visual flow analysis in [Ric88]), • trinocular stereo analysis, • multi-frame and 3D spatio-temporal motion analysis - for constant or variable velocity, • detection and estimation of multiple disparities (e.g., motion transparency, reflection, or occluding boundaries), • multiple baseline stereo, • boundary symmetry analysis, 195 Chapter 9. Concluding Remarks/Overview and Future Directions 196 • scale and rotation invariant binary image registration, • and matching and registration of complex (i.e., real and imaginary) signals - as in the case of Magnetic Resonance Images. It was the development of this framework that led to the extensive analysis of phase correlation and cepstral filtering. These studies resulted in the development of cepsCos and diffCepsSin which provide both computational and performance improvements to the traditional power and differential cepstrum respectively. The second objective of this dissertation involved a dominant area of computational vision and signal processing: i.e., matching and registration. Many subfields of computer vision, from low level motion or stereo analysis to higher level recognition or structure from motion, rely on accurate registration of features and data. Having studied the traditional techniques - their strengths, shortcomings and underlying assumptions - and drawing on the properties of cepstrum, phase correlation and Hadamard based techniques, I introduced Multi-Evidential Correlation (MEC). The strengths of this approach are three-fold: 1. This approach provides multiple and thus verifiable estimates of individual disparities. Determination of confidence measures for disparities and mechanisms to verify displacements is a significant and desirable attribute which is not always attainable in computational vision. 2. It provides a strong alternative to the traditional linear filtering methods as well as additional flexibility in the choice of its matching kernel. M E C can utilize the nonlinear filtering techniques of phase correlation and cepstral filtering, or the linear and extremely fast Hadamard based technique as its matching routine. Chapter 9. Concluding Remarks/Overview and Future Directions 197 3. In its present form, it was formulated to avoid any (or at least as many) assumptions, whether analytic (e.g., smoothness of ...) or numeric (e.g., thresholds); this of course was a self imposed constraint and objective. Chapter 7 provides an extensive comparison of M E C , using cepstrum and phase correlation, with the more common motion analysis techniques. Even the most primitive form of M E C - i.e., cepsCorrOO - produced better results than the traditional methods, especially in the presence of rotation, looming and changes in illumination. 9.1 Future Directions The only list longer than the list of things I have done is the list of things I would like to do. - anon. There are numerous topics and subjects that one can further develop based on the new material presented here. Among the most important topics is the combining of evidences for M E C . To this end, I have already generated preliminary results for combining different M E C evidences, based on a rather involved voting mechanism for cepsCorr. The result of these experiments are very promising and shed light on other aspects of cepstrum and M E C . Using Hadamard based matching kernels, or improving the phaseCCorr are two other topics that can be pursued in the future. As for visual echo analysis, it clearly provides a very fertile ground for future research and development. Each topic discussed under this framework - such as motion-stereo analysis, or 3D spatio-temporal analysis - can be extended and further studied. Moreover, there are a wealth of other topics that may be approached by the same frameworks; for instance, 3D or 2D motion trinocular, or motion-stereo analysis for vergence control. Added to these is the combination of numerous of other methods such as hierarchical Chapter 9. Concluding Remarks/Overview and Future Directions 198 matching schemes or variety of pre, middle, and post processing operations that may be combined with cepstrum to improve its performance. It is the author's sincere hope that this work is both an original contribution to the field and a stepping stone for others researchers toward future developments. E.B.B. Bibliography [AA91] B . Armestrong and N . Ahmed. A method of cepstral enhancement for detection and parameter estimation of echoes. Asilomar Conference on Circuits, Systems and Computers, pages 160-164, 1991. [AA92] B . Armestrong and N . Ahmed. A method of cepstral enhancement for detection and parameter estimation of echoes. IEEE Transaction on Signal Processing, pages 160-164, 1992. [AB85] E . H . Adelson and J. R. Bergen. Spatiotemporal energy models for the perception of motion. Journal of Optical Society of America, 2(2):284299, 1985. [AB86] E . H . Adelson and J . R. Bergen. The extraction of spatiotemporal energy in human and machine vision. In Proceedings of IEEE Motion Worskshop, pages 151-155, 1986. [Ada86] J. Adams. Conceptual Blockbusting. Addison-Wesley, Cambridge, Mass, 1986. [AN88] J. K . Aggarwal and N . Nandhakumar. On the computation of motion from sequences of images - a review. Proceedings IEEE, 76(8):917-35, Aug 1988. [Ana85] P. Anandan. A review of motion and stereopsis research. Technical Report COINS-85-52, University of Massachusetts at Amherst, December 1985. [Ana87] P. Anandan. Measuring image motion from image sequences. Technical Report COINS-87-21, University of Massachusetts at Amherst, 1987. [Ana89] P. Anandan. A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision, 2:283-310, 1989. [Ana94] D . Anastassiou. Digital television. Proceedings IEEE, 82(4):510-519, April 1994. [AR84] J. Altmann and H. J. P. Reitbock. A fast correlation method for scale- and translation-invariant pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:46-57, 1984. 199 Bibliography 200 [AW84] P. Anandan and R. Weiss. Computing dense displacement fields with confidence measures in scenes containing occlusion. In SPIE Intelligent Robots and Computer Vision Conference, volume 521, pages 184-194, 1984. [AW85] P. Anandan and R. Weiss. Introduction of smoothness constraint in a matching approach for the computation of optical flow fields. In IEEE Workshop on Computer Vision, pages 186-194, Michigan, 1985. [BA83] P. Burt and E . Adelson. The Laplacian pyramid and the compact image code. IEEE Transactions on Communication, 31:532-540, 1983. [BAHH92] J. Bergen, P. Anandan, K . Hanna, and R. Hingorani. Hierarchical modelbased motion estimation. In Proceedings of European Conference in Computer Vision, pages 237-252, 1992. [Bak82] H . Baker. Stereo vision systems. In Proceedings of International ence on Systems, Man and Cybernetics, pages 322-326, 1982. [BAK91] R. Battiti, E. Amaldi, and C. Koch. Computing optical flow across multiple scales: an adaptive coarse-to-fine strategy. International Journal of Computer Vision, 6(2):133-45, June 1991. [Bar84] J. Barron. A survey of approaches for determining optic flow, environmental layout and egomotion. Technical Report RBCV-TR-84-5, University of Toronto, November 1984. [Bar85] Y . Barniv. Dynamic programming solution for detecting dim moving targets. IEEE Transactions on Aerospace and Electronics Systems, A E S 21(1):144-156, Jan 1985. [BB82] C. Brown and D. Ballard. Computer Vision. Perintice Hall, Englewood Cliffs, N.J., 1982. [BBHP90] J.R. Bergen, P.J. Burt, R. Hingorani, and S. Peleg. Computing two motions from three frames. In Proceedings of IEEE International Conference on Computer Vision, pages 27-32, Osaka, Japan, Dec 1990. [BBHP92] J.R. Bergen, P.J. Burt, R. Hingorani, and S. Peleg. A three-frame algorithm for estimating two-component image motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(9):886-896, Sept. 1992. [BF82] S. Barnard and M . Fischler. Computational stereo. Computational Surveys, 14(4):553-572, 1982. Confer- Bibliography 201 [BFB94] J.L. Barron, D . J . Fleet, and S.S. Beauchemin. Performance of optical flow techniques. 12(l):43-79, February 1994. [BFBB92] J.L. Barron, D . J . Fleet, S.S. Beauchemin, and T . A . Burkitt. Performance of optical flow techniques. In Proceedings of Conference on Computer Vision and Pattern Recognition, pages 236-242, 1992. [BHK91] P.J. Burt, R. Hingorani, and R . J . Kolczynski. Mechanism of isolating component patterns in the sequential analysis of multiple motion. In Proceedings of IEEE Workshop On Visual Motion, pages 187-193, Princeton, New Jersey, Oct. 1991. [BHT62] B.P. Bogert, M.J.R. Healy, and J.W. Tukey. The quefrency analysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. Proceedings of Symposium on Time Series Analysis, 1962. [BL83] P. Bolon and J. L . Lacoume. Speed measurement by cross correlation theoretical aspects and applications in the paper industry. IEEE Transactions on Acoustics, Speech and Signal Processing, 31:1374-1378, 1983. [BL92a] E. Bandari and J. Little. Cepstral analysis of optical flow. Report T R 92-6, University of British Columbia, 1992. [BL92b] E. Bandari and J. Little. Spatial-Quefrency approach to optical echo analysis. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 850-852, Urbana-Champaign, Illinois, 1992. [BL92c] E. Bandari and J. Little. Visual echo analysis. Unpublished report, 1992. [BL93a] E. Bandari and J. Little. Cepstral methods in computational vision. In Nonlinear Image Processing IV; Proceedings SPIE, pages 266-277, San Jose, C A , February 1993. [BL93b] E. Bandari and J. Little. Visual echo analysis. In Proceedings of Fourth International Conference in Computer Vision, pages 220-225, Berlin, Germany, May 1993. [BL93c] E. Bandari and J.J. Little. Detection and estimation of multiple disparities by multi-evidential correlation. Technical Report T R 93-38, University of British Columbia, 1993. [BL93d] E. Bandari and J.J. Little. Multi-evidential correlation and visual echo analysis. Technical Report T R 93-1, University of British Columbia, 1993. Technical Bibliography 202 [BL94] E . Bandari and J. Little. Cooperative analysis of multiple frames by visual echoes. In IEEE International Conference on Image Processing, pages 111:766-770, Austin, Texas, 1994. [B066] B.P. Bogert and J.F. Ossanna. The heuristics of cepstrum analysis of a stationary complex echoed gaussian signal in stationary gaussian noise. IEEE Transactions on Information Theory, IT-12(3):373-380, July 1966. [Bra74] O. Braddick. A short range process in apparent motion. Spatial Vision, 14:519-527, 1974. [Bro92] L . Brown. A survey of image registration techniques. ACM Computing Surveys, 24(4):326-376, December 1992. [BS72] D . Barnea and H . Silverman. A class of algorithms for fast digital registration. IEEE Transactions on Computers, C(21):179-186, 1972. [BT80] S. Barnard and W . Thompson. Disparity analysis of images. IEEE Transactions in Pattern Analysis and Machine Intelligence, 2(4):333-340, July 1980. [BW85] J. Bednar and T. Watt. Calculating the complex cepstrum without phase unwrapping or integration. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(4): 1014-1017, August 1985. [BXL94] E . Bandari, Q. Xing, and J. Little. Visual echo registration of magnetic resonance images. In AAAI Spring Symposium on Applications of Computer Vision in Medical Image Processing, pages 38-41, Stanford, California, 1994. [BYX82] P. Burt, C. Yen, and X . X u . Local correlation measures for motion analysis: a comparative study. In IEEE Proceedings of Pattern Recognition and Image Processing, pages 269-274, 1982. [BYX83] P. Burt, C. Yen, and X . X u . Multi-resolution flow through motion analysis. In IEEE Conference on Computer Vision and Pattern Recognition, pages 246-252, 1983. [CB92] D . Coombs and C. Brown. Real-time smooth pursuit tracking for a moving binocular robot. In Proceedings of Computer Vision and Pattern Recognition, pages 23-28, 1992. [CCM84] B . Chanda, B . B . Chaudhuri, and D. Dutta Majumder. Application of least square estimation technique for image restoration using signal-noise Bibliography 203 correlation constraint. IEEE Transactions on Systems Man and Cybernetics, 14:515-519, 1984. [CK83] N . Cornelius and T. Kanade. Adapting optical flow to measure object motion in reflectance and x-ray image sequences. In Proceedings of ACM Siggraph/Sigart Interdisciplinary Workshop on Motion, pages 50-58, Toronto, Ontario, 1983. [Cla90] R . J . Clark. Basic principles of motion estimation and compensation. In IEE Colloquium on Applications of Motion Compensation, pages 1/1-1/7, October 1990. [CM87] E . De Castro and C. Morandi. Registration of translated and rotated images using finite fourier transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(5):700-703, Sept. 1987. [CM89] P. Cavanagh and G . Mather. Motion: the long and short of it. Spatial Vision, 4(2-3):103-129, 1989. [COP90] P. Concetti, G . Orlandi, and F Piazza. A n enhanced image registration algorithm by burg filter. In Proceedings of International Conference in Acoustics, Speech and Signal Porcessing (ICASSP), volume 4, pages 20536. I E E E , 1990. [Cro83] M . Crombie. Coordination of stereo image registration and pixel classification. Photogrammetric Engineering Remote Sensing, 49(4):529-532, April 1983. [CSA93] H . J. Chen, Y . Shirai, and M . Asada. Obtaining optical flow with multiorientation filters. In Proceedings of International Conference in Computer Vision and Pattern Recognition, pages 736-738, New York, New York, 1993. I E E E . [CSK77] D . G . Childers, D.P. Skinner, and R . C . Kemerait. The cepstrum: a guide to processing. Proceedings of IEEE, 65(10):1428-1443, October 1977. [CV90] M . Campani and A . Verri. Optical flow from an overconstrained system of linear algebraic equations. In IEEE International Conference on Computer Vision, pages 22-26, Osaka, Japan, December 1990. [CV92] M . Campani and A . Verri. Motion analysis from first order properties of optical flow. CVGIP. Image Understanding, 56(1):90-107, July 1992. Bibliography 204 [CWC91] N. Cui, J . Weng, and P. Cohen. Motion and structure from long stereo sequences. In Proceedings of IEEE Workshop on Visual Motion, pages 75-80, 1991. [DA89] U . R. Dhond and J . K . Aggarwal. Structure from stereo - a review. IEEE Transactions on Systems, Man and Cybernetics, 19(6):1489-1510, Dec 1989. [DA91] U.R. Dhond and J . K . Aggarwal. A cost-benefit analysis of a third camera for stereo correspondence. International Journal of Computer Vision., 6(l):39-59, April 1991. [DE83] P. Diaconis and B . Efron. Computer intensive techniques in statistics. Scientific American, pages 116-127, 1983. [DKS91] G . Dall'Aglio, S. Kotz, and G . Salinetti. Advances in probability distributions with given marginals. Kluwer Academic Publishers, 1991. [DP91] T. Darrell and A . Pentland. Robust estimation of a multi-layered motion representation. In Proceedings of the IEEE Workshop on Visual Motion, pages 173-8, Princeton, N J , 1991. [Dud77] D.E. Dudgeon. The computation of two-dimensional cepstra. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-25(6):476-484, December 1977. [ea93] S. Rao et al. A real-time P * 6 4 / M P E G video encoder chip. In Proceedings of International Solid-State Circuits Conference - ISSCC 93, pages 32-33, San Francisco, C A , February 1993. [Efr79] B . Efron. Bootstrap methods: another look at the jackknife. Annals of Statistics, 7:1-26, 1979. [Efr82] B . Efron. The jacknife, the bootstrap, and other resampling plans. In CBMS-NSF Regional Conference in Applied Mathematics, number S I A M 38, 1982. [EL93] J . Ens and Z. L i . Realtime motion stereo. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 130-135, New York, New York, 1993. [FC89] Chang-Wu Fu and Shyang Chang. A motion estimation algorithm under time-varying illumination. Pattern Recognition Letters, 10(3): 195-199, Sept 1989. Bibliography 205 [FJ90a] D. J. Fleet and A . D. Jepson. Computation of component image velocity from local phase information. International Journal of Computer Vision, 5(1):77-104, 1990. [FJ90b] D . J . Fleet and A . D . Jepson. Computation of component image velocity from local phase information. International Journal of Computer Vision, 5:77-104, 1990. [FL94] P. Fua and Y . Leclerc. Registration without matching. In IEEE Conference on Computer Vision and Pattern Recognition, pages 121-128, Seattle, Wash, June 1994. [Fle92] D. Fleet. Measurement of image velocity. Kluwer, 1992. [Fra91] W . Franzen. Structure and motion from uniform 3d acceleration. In Proceedings of IEEE Workshop on Visual Motion, pages 14-20, New York, New York, Oct. 1991. [Fua91] P. Fua. A parallel stereo algorithm that produces dense depth maps and preserves image features. Technical Report 1369, INRIA, January 1991. [FW79] C L . Fennema and W.B.Thomson. Velocity determination in scenes containing several moving objects. Computer Graphics and Image Processing, 9:301-315, 1979. [GCT94] A . Giachetti, M . Campani, and V . Torre. The use of optical flow for autonomous navigation. In European Conference in Computational Vision, pages 148-151, Stockholm, Sweden, April 1994. [GGB84] A . Goshtasby, S. H . Gage, and J. F . Bartholic. A two-stage cross correlation approach to template matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:374-378, 1984. [Gir87] Brend Girod. The efficiency of motion-compensating prediction for hybrid coding of video sequences. IEEE Journal on Selected Areas in Communications, SAC-5(7), August 1987. [GK89] B . Girod and D. Kuo. Direct estimation of displacement histogram. In Image Understanding and Machine Vision, pages 73-76, 1989. [Gla84] F. Glazer. Multi-level relaxation in low level computer vision. In E . Rosenfeld, editor, Multiresolution Image Processing and Analysis, pages 312-330. Springer-Verlag, 1984. Bibliography 206 [Gla87] F . Glazer. Hierarchical gradient based motion detection. In Image Understand Workshop, pages 733-748, Los Angeles, C A , Feb 1987. [GP87] E . Gamble and T. Poggio. Visual integration and detection of discontinuities: The key role of intensity edges. A.I. Memo No. 970, MIT-AI, October 1987. [GR80] I. Gradshteyn and I. Ryzhik. Table of integrals, series and products. Academic Press, New York, N . Y . , 1980. [Gri80] W . E . Grimson. Computing shape using a theory of human stereo system. Technical report, M I T Department of Mathematics, June 1980. [Gri93] W . E . L . Grimson. Why stereo vision is not always about 3D reconstruction. AI-Memo-1435, MIT-AI, July 1993. [GV93] M . Gothe and J. Vaisey. Improving motion compensation using multiple temporal frames. In Proceedings of IEEE Pacific Rim Conference on Communications, Computers, and Signal Processing, pages 157-160, New York, New York, May 1993. [GW93] R. Gonzalez and R. Woods. Digital Image Processing. Addison-Wesley Publishing Company, 1993. [Has74] J.C. Hassab. Time delay processing near the ocean surface. Sound and Vibration, 35(4):489-501, 1974. [Has75] J.C. Hassab. Analysis of signal extraction, echo and removal by complex cepstrum in the presence of distortion and noise. Journal of Sound and Vibration, 40(3):321-335, 1975. [HB76] J.C. Hassab and R. Boucher. A probabilistic analysis of time delay extraction by the cepstrum in stationary gaussian noise. IEEE Transactions on Information Theory, 22(4):444-53, 1976. [HB78] J. Hassab and R. Boucher. Improved cepstrum performance through the windowing of log spectrum. Journal of Sound and Vibration, 58(4) :597598, 1978. [HCL91] P. He, Y . Cai, and D. Liang. On cepstrum approach to sequential image matching. In IEEE China 1991 International Conference on Circuits and Systems, pages 797-800, Shenzhen, China, June 1991. [Hee87] D . Heeger. Optical flow from spatio-temporal filters. Proceedings of 1st International Conference in Computer Vision, pages 181-190, 1987. Journal of Bibliography 207 [Hee91] Joachim Heel. Temporal surface reconstruction. Ph.D. Dissertation, M I T Artificial Intelligence Laboratory, 1991. [Hil83] E . C. Hildreth. Measurement of visual motion. M I T Press, Cambridge, Mass, 1983. [HLHM92] J. Huang, S. L i u , M . H . Hayes, and R. M . Mersereau. Multi-frame pelrecursive algorithm for varying frame-to-frame displacement estimation. In Proceedings of IEEE International Conference Acoustics Speech and Signal Processing (ICASSP), volume 3, pages III-241, 1992. [Hor83] B . K . P. Horn. Non-correlation methods for stereo matching. Photogrammetric Engineering Remote Sensing, 49(4):535-536, April 1983. [Hor86] B . K . P. Horn. Robot Vision. M I T Press, Cambridge, Mass, 1986. [HS81] B . K . P . Horn and B . G . Shunck. Determining optical flow. Artificial Intelligence, 17:185-204, 1981. [Hub81] P. Huber. Robust Statistics. Wiley, New York, New York, 1981. [HZ83] R. A . Hummel and S. W . Zucker. On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5:267-287, 1983. [IP92] M . Irani and S. Peleg. Image sequence enhancement using multiple motions analysis. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 216-221, Champaign, Illinois, June 1992. [J91] B . Jahne. Digital Image Processing. Springer-Verlag, 1991. [JB93a] Allan Jepson and Michael Black. Mixture models for optical flow computation. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 760-761, New York, New York, June 1993. [JB93b] Allen Jepson and Michael Black. Mixture models for optical flow computation. Technical Report 93-44, R B C V , November 1993. [JBC87] A . Jain, R. Bubes, and C. Chen. Bootstrap technique for error estimation. IEEE Transaction of Pattern Analysis and Machine Intelligence, P A M I 9(5):628-633, Sept 1987. [JJ94] M . R. Jenkin a n d . A . D.. Jepson. Recovering local surface structure through local phase difference measurements. CVGIP: Image Understanding, 59(l):72-93, January 1994. Bibliography 208 [JL93] D . Jones and D . Lamb. Analyzing the visual echo: passive 3-d imaging with a multiple aperture camera. Technical Report CIM-93-3, M c G i l l University, Feb 1993. [JM92] David Jones and Jitendra Malik. A computational framework for determining stereo correspondence from a set of linear spatial niters. In Proceedings of Second European Conference on Computer Vision, 1992. [Kam93] J . K a m . A Real-Time 3D Motion Tracking System. University of British Columbia. Dept. of Comp. Sci. M.Sc. Thesis, 1993. [KD92] J . Konrad and E . Dubois. Bayesian estimation of motion vector fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(9):910-927, Sept 1992. [KH75] C D . Kuglin and D . C Hines. The phase correlation image alignment method. Proceedings of IEEE International Conference on Cybernetics and Society, pages 163-165, 1975. [KM92] A . Kher and S. Mitra. Registration of noisy S A R imagery using morphological feature extractor and 2d epstrum. In SPIE Conference on Applications of Digital Image Processing XV, volume 1771, pages 281-291, 1992. [KON92] T. Kanade, M . Okutomi, and T. Nakahara. A multi-baseline stereo method. In DARPA Image Understanding Workshop, pages 409-426, 1992. [KP93] Y . K i m and K . Price. A n integrated motion analysis system guided by feedback information. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 130-135, New York, New York, 1993. [KY90] Y . Kitamura and M . Yachida. Three-dimensional data acquisition by trinocular stereo. Advanced Robotics, 4(l):29-42, 1990. [LBP88] J.J. Little, H . H . Bulthoff, and T. Poggio. Parallel optical flow using local voting. Proceedings of International Conference on Computer Vision, pages 454-459, December 1988. [LC88] S.H. L a i and S. Chang. Estimation of 2-d translation via Hadamard transform. In J.L. Lacoume, A . Chehikian, N . Martin, and J . Malbos, editors, Signal Processing IV: Theories and Applications. Proceedings of EUSIPCO-88. Fourth European Signal Processing Conference, volume 1, pages 35-37. North-Holland, Sept. 1988. Bibliography 209 [LC94] J . Lawn and R. Cipolla. Robust egomotion estimation from affine motion parallax. In European Conference in Computational Vision, pages 205-210, Stockholm, Sweden, April 1994. [LE84] J. Lin and E . Eisner. A review of homomorphic deconvolution. Reviews of Geophysics and Space Physics, 22(3):255-263, August 1984. [LG90] J. J. Little and W . E . Gillett. Direct evidence of occlusion in stereo and motion. In European Conference in Computational Vision, pages 336-340, Antibes, France, April 1990. [LK81] B . Lucas and T. Kanade. A n iterative image registration technique with an application to stereo vision. In Proceedings of Eight International Joint Conference on Artificial Intelligence, pages 674-679, 1981. [LKM87] D . J . Lee, T . F . Krile, and S. Mitra. Digital registration technique for sequential fundus images. Proceedings of SPIE on Applications of Digital Image Processing X, SPIE-829:293-300, 1987. [LKM88] D. Lee, T. F . Krile, and S. Mitra. Power cepstrum and spectrum techniques applied to image registration. Applied Optics, 27(6): 1099-1106, March 1988. ......... : [LL88] M . Lamnabhi and J. J. Lhuillier. Motion compensated time rate conversion of video sequences. In Proceedings of 2nd International Workshop on Signal Processing of HDTV, 1988. [LM75] J. Limb and J. Murphy. Estimating the velocity of moving objects from television signals. Proceedings of Computer Graphics and Image Processing, 4:311-327, 1975. [LMK88] D.J. Lee, S. Mitra, and T . F . Krile. Power cepstrum in registration of noisy images. Electronic Imaging, 2:953-956, 1988. [LMK89] D . J . Lee, S. Mitra, and T . F . Krile. Analysis of sequential complex images, using feature extraction and two-dimensional cepstrum techniques. Journal of Optical Society of America, 6(6):863-870, 1989. [LNN92] K . O . Ludwig, H . Neumann, and B . Neumann. Local stereoscopic depth estimation using ocular stripe maps. In Proceedings of Second European Conference on Computer Vision, pages 373-377, 1992. [LNN94] K . O . Ludwig, H . Neumann, and B . Neumann. Local stereoscopic depth estimation. Image and Vision Computing, 12(1):16—35, Jan 1994. Bibliography 210 [LRM91] D . J . Lee, M . Romirez, and S. Mitrra. Fast 2-d hartley transform in 3-d object representation and recognition. In Intelligent Robots and Computer Vision; Proceedings SPIE, volume 1608, pages 302-312, 1991. [Man94] R. Manmatha. Framework for recovering affine transformation using points, lines or image brightness. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 141-146, 1994. [Mat92a] L . Matthies. Passive stereo range imaging for semi-autonomous land navigation. Journal of Robotic Systems, 9(6):787-816, Sept. 1992. [Mat92b] L . Matthies. Stereo vision for planetary rovers: stochastic modeling to near real-time implementation. International Journal of Computer Vision, 8(1):71-91, July 1992. [MKS89] L . Matthies, T. Kanade, and R. Szeliski. Kalman filter-based algorithms for estimating depth from image sequences. International Journal of Computer Vision, 3:209-236, 1989. [MM91] M . Mahowald and C. Mead. The silicon retina. Scientific American (International Edition), 264(5):76-82, May 1991. [M092] R. Manmatha and J. Oliensis. Measuring the affine transform - 1 : recovering scale and rotation. Comp Sci T R 92-74, University of Massachusetts at Amherst, Amherst, M A , 1992. [Moh81] N . C. Mohanty. Computer tracking of moving point targets in space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3(11):606—611, Sept 1981. [Mor80] H . Moravec. Obstacle avoidance and navigation in the real world by a seeing robot rover. Technical Report 340, Stanford AI-Memo, 1980. [MP77] D. Marr and T. Poggio. A theory of human stereo vision. AI-Memo-451, MIT-AI, Cambridge, M A , 1977. [MS86] L . Matthies and S. A . Shafer. Error modelling in stereo navigation, 1986. [NA94] S. Niyogi and E . Adelson. Analyzing and recognizing walking figures. In Proceedings of IEEE Conference Computer Vision and Pattern Recognition, pages 469-474, 1994. [Nag83a] H . Nagel. Constraints on the estimation of displacement fields from image sequences. In Proceedings of Eight International Joint Conference on Artificial Intelligence, pages 945-951, August 1983. Bibliography 211 [Nag83b] H . Nagel. Displacement vectors derived from second order intensity variations in image sequences. Computer Vision, Graphics, and Image Processing, 21:85-117, 1983. [Nag86] H . Nagel. O n the estimation of dense displacement maps from image sequences. In Proceedings of ACM Motion Workshop, pages 59-65, Toronto, Ontario, 1986. [Nag94] H . H . Nagel. Optical flow estimation: advances and comparisons. In European Conference in Computational Vision, pages 51-60, Stockholm, Sweden, April 1994. [Nak85] K . Nakayama. Biological motion processing: a review. Vision Research, 25:625-660, 1985. [Nas94] P. Nasiopoulos. Error Resistant Schemes for All-Digital High Definition Television. 1994. [Nis84] H . Nishihara. Prism: a practical real-time imaging stereo matcher. A I Memo-780, Cambridge, M A , 1984. [NK92] T. Nakahara and T. Kanade. Experiments in multiple-baseline stereo. Technical Report CMU-CS-93-102, Carnegie-Mellon University, August 1992. [NM93] Chrysostomos L . Nikias and Jerry M . Mendel. Signal processing with higher order spectra. IEEE Signal Processing Magazine, pages 10-37, July 1993. [NP88] C L . Nikias and R. Pan. Time delay estimation in unknown gaussian spatially correlated noise. IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(11), November 1988. [NR79] A . Netravali and J. Robins. Motion compensated television coding: part I. Bell Systems Technical Journal, 58:631-670, 1979. [NR87] L . C Nikias and M.R. Raghuveer. Bispectrum estimation: a digital signal processing framework. Proceedings of IEEE, 75(7):869-92, July 1987. [NY93] S. Negahdaripour and C. Yo. A generalized brightness model for computing optical flow. In Proceedings of International Conference in Computer Vision, pages 2-11, Berlin, Germany, May 1993. [OC90] T . J . Olson and D . J . Coombs. Real-time vergence control of binocular robots. Technical Report T R 348, University of Rochester, June 1990. Bibliography 212 [0C91] T. Olson and D. Coombs. Real-time vergence control for binocular robots. Internaltional Journal of Computer Vision, 7(1):67-81, 1991. [OK92] M . Okutomi and T. Kanade. A locally adaptive window for signal matching. International Journal of Computer Vision, 7(2): 143-162, January 1992. [OL81] A . V . Oppenheim and J.S. Lim. The importance of phase in signals. Proceedings of IEEE, 69(5):529-541, May 1981. [OL92] T. Olson and R. Lockwood. Fixation based filtering. In Proceedings of SPIE Intelligent Robots and Computer Vision IX, pages 685-696, 1992. [01s93] T. Olson. Stereopsis and verging systems. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 55-60, New York, New York, 1993. [OSS68] A . V . Oppenheim, R . W . Schafer, and T . G . Stockham. Non-linear filtering of multiplied and convolved signals. Proceedings of IEEE, 56:21364-21291, August 1968. [PAF79] A . Polydoros, W . A u , and A . Fam. Shift invariant homomorphic filtering. In Proceedings of Twenty Second Mid-west Symposium on Circuits and Systems, 1979. [PE] K . Pahlevan and J. Eklundh. A head-eye system for active purposive computer vision. Technical Report CVAP-80, Royal Institute of Technology, Sweden. [PE93] K . Pahlavan and J. Eklundh. Heads, eyes and head-eye systems. International Journal of Pattern Recognition and Artificial Intelligence, 7(1):3349, February 1993. [Pea79] L . G . Peardon. Aspects of cepstral analysis. Ph.D. Thesis, Dept. of Mathematics, Portsmouth Polytechnic, 1979. [PF90] B . Porat and B . Friedlander. A frequency domain algorithm for multiframe detection and estimation of dim targets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(4):398-401, April 1990. [PFTV88] W . Press, B . Flannery, S. Teukolsky, and W . Vetterling. Numeriacl Recepies in C. Cambridge University Press, Cambridge, England, 1988. Bibliography 213 [PJGK77] J. J. Pearson, D . C . Hines Jr., S. Golosman, and C D . Kuglin. Video-rate image correlation processor. In Proceedings of SPIE on Applications of Digital Image Processing, pages 197-205, San Jose, C A , August 1977. [Pol87] G . Polya. How to solve it; a new aspect of mathematical method. Princeton University Press, Princeton, N.J., 1987. [PR73] T. Poggio and W . Reichardt. Considerations on the models of movement detection. Kybernetics, 13:223-227, 1973. [PR90] S. Peleg and H . Rom. Motion based segmentation. In Proceedings of IEEE Tenth International Conference on Pattern Recognition, pages 109-113, Osaka, Japan, June 1990. [Pra78] W . Pratt. Digital Image Processing. Wiley-Interscience Publication, New York, New York, 1978. [Pri86] Keith Price. Anything you can do I can do better (not you can't). Computer Vision, Graphics, and Image Processing, 36:387-391, 1986. [Pri90] Keith Price. Multi-frame feature-based motion analysis. In Proceedings of IEEE International Conference in Pattern Recognition, pages 114-117, Atlantic City, N.J., 1990. [PSW93] M . Platzner, C. Steger, and R. Weiss. Performance measurements on a multi-DSP architecture with tms320c40. In Proceedings of the Fourth International Conference on Signal Processing Applications and Technology, pages 1144-1152, Santa Clara, C A , October 1993. [PUE92] K . Pahlavan, T. Uhlin, and J. Eklundh. Fixation by active accommodation. In Intelligent Robots and Computer Vision XI: Algorithms, Techniques and Active Vision, volume SPIE-1825, pages 670-684, November 1992. [Rei57] W . Reichardt. Autokorrelations-auswertung als functionsprinzip des zentralnervensystems. Naturaforsch., 12b:448-457, 1957. [Rei61] W . Reichardt. Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In W . Rosenblith, editor, Sensory Communication, pages 303-314. M I T Press, 1961. [RHKvdH92] L . Rosenthler, F . Heitger, O. Kiibler, and R. von der Heydt. Detection of general edges and key points. In Proceedings of Second European Conference on Computer Vision. Springer-Verlag, May 1992. Bibliography 214 [Ric88] Whitman Richards. Natural Computation. M I T Press, Cambridge, Mass, 1988. [RL83] P. Rousseeuw and A . Leroy. Robust regression and outlier detection. Wiley, New York, New York, 1983. [Rom75] R. Rom. On the cepstrum of two-dimensional functions. IEEE Transactions on Information Theory, pages 214-217, March 1975. [RR87a] G . Reddy and V Rao. On the computation of complex cepstrum through differential cepstrum. Signal Processing, 13(l):79-83, July 1987. [RR87b] G.R. Reddy and V . V . Rao. Signal delay and waveform estimation through differential cepstrum averaging. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(10):1487-89, October 1987. [RU84] D. R. Reddy and R. Unbehauen. The n-dimensional differential cepstrum. In V . Cappellini and A . G . Constantinides, editors, Digital Signal Processing - 84, pages 145-148. Elsevier Science Publishers, 1984. [RU85] D. Raghuramireddy and R. Unbehauen. The two-dimensional differential cepstrum. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-33(4):1335-1337, October 1985. [RY90] K . R. Rao and P. Y i p . Discrete cosine transform: algorithms, advantages, and applications. Academic Press, Boston, Mass., 1990. [SA90] M . Spetsakis and J. Aloimomonos. A multi-frame approach to visual motion perception. International Journal of Computer Vision, 6(3):245-255, 1990. [SC94] R. Szeliski and J. Coughlan. Hierarchical spline-based image registration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 194-201, Seattle, Wash, June 1994. [Sch88] B . Schunk. Image flow: fundamentals and algorithms. In J. Martin and W. Aggarwal, editors, Motion Understanding: Robot and Human Vision, pages 23-68. Kluwer Academic Publishers, 1988. [Sco86] G . Scott. Four line method for locally estimating optical flow. Vision and Image Computing, 5(2):67-72, May 1986. [Sco88] G . Scott. Local and global interpretation of moving images. Morgan Kaufmann Publishers, London, England, 1988. Bibliography 215 [SD89] M . Steckner and D. Drost. Fast cepstrum analysis using hartley transform. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(8):1300-1302, 1989. [Sin91] A . Singh. Opticalflowcomputation, a unified approach. I E E E Computer Society Press, Los Alamitos, C A , 1991. [Ski74] D. P. Skinner. Real-time composite signal decomposition. Ph.D. Thesis, Dept. of Electrical Eng., University of Florida, 1974. [SM90] M . Shizawa and K . Mase. Simultaneous multiple optical flow estimation. In Proceedings of IEEE 10th International Conference on Pattern Recognition, pages 274-278, Atlantic City, New Jersey, June 1990. [SMA76] M . Svedlow, C. McGillem, and P. Anuta. Experimental examination of similarity measures and preprocessing methods used for image registration. In The Symposium on Machine Processing of Remotely Sensed Data, pages 4A-9, Westville, Indiana, June 1976. [Spi74] M . Spiegel. Fourier analysis, with applications to boundary value problems. McGraw-Hill, New York, N . Y . , 1974. [SR78] M . Silvia and E . Robinson. Use of kepstrum in signal analysis. Geoexploration, 16:55-73, 1978. [SS93] M . Swain and M . Strieker. Promising directions in active vision. International Journal of Computer Vision., 11(2):109-126, 1993. [SW90] L . Spillmann and J. Werner. Visual perception, the neurophysiological foundations. Academic Press, 1990. [Tav86] S. Tavakkoli. Application of cepstrum techniques to location of acoustic sources in the presence of reflective surfaces. M.Sc. Thesis, Dept. of Mechanical Engineering, Virginia Polytechnique Institute and State University, 1986. [TB87] W . Thompson and S. Barnard. Lower level estimation and interpretation of visual motion. Computer, 20:20-28, 1987. [TG92] M . K . Tsatsanis and B . G . Giannakis. Object and texture classification using higher order statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(7):733-50, July 1992. [TH86] Q. Tian and M . N . Huhns. Algorithms for subpixel registration. Computer Vision, Graphics and Image Processing, 35(35):220-233, August 1986. Bibliography 216 [The91] P. Thevanaz. Motion analysis. In Proceedings of the 37th Scottish Summer School in Physics, pages 129-166, 1991. [Tho87] G . A . Thomas. Television motion measurement for D A T V and other application. Technical Report 1987-11, B B C Research Department, 1987. [Tis94] M . Tistarelli. Multiple constraints for optical flow. In European Conference in Computational Vision, pages 61-70, Stockholm, Sweden, April 1994. [TM94] W . Theimer and H . Mallot. Phase-based binocular vergence control and depth reconstruction using active vision. Computer Vision, Graphics and Image Processing IU, 60(3):343-358, 1994. [TOM94] J. Taylor, T. Olson, and W . Martin. Accurate vergence control in complex scene. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 540-545, Seattle, Washington, 1994. [Tri77] J . M . Tribolet. A new phase unwrapping algorithm. IEEE Transactions onAcoustics, Speech, and Signal Processing, ASSP-25(1):170-179, Dec. 1977. [U1183] S. Ullman. Visual routines. M I T AI-Memo-723, MIT-AI, Cambridge, M A , 1983. [WA83] A . Watson and A . Ahumada. A look at motion in the frequency domain. Technical Report N A S A Technical Memorandum 84352, N A S A Ames Research Center, 1983. [WA85] A . B . Watson and A . J. Ahumada. Model of human visual-motion sensing. Journal of Optical Society of America, 2(2):322-342, 1985. [WC90] G . A . W . West and T. A . Clark. A survey and examination of subpixel measurement techniques. In Close Range Photogrammetry Meets Machine Vision; Proceedings SPIE, volume 1395, pages 456-463, 1990. [WD86] A . Waxman and J. Duncan. Binocular image flow: steps toward stereomotion fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):715-729, November 1986. [WF88] S. F. Wu and G . M . X . Fernando. Comparative study of two motion estimation. In Proceedings of IEE Third International Conference on Image Processing and Its Applications, volume 307, pages 305-9. I E E , 1988. [WH84] R. Wong and E . Hall. Sequential hierarchical scene matching. IEEE Transactions on Computers, 27(4):359-366, 1984. Bibliography 217 [WM93] J . Weber and J . Malik. Robust computation of optical flow in a multiscale differential framework. In Proceedings of International Conference in Computer Vision, pages 12-20, Berlin, Germany, May 1993. [WS86] A . Waxman and S. Sinha. Dynamic stereo: passive ranging to moving objects from relative image flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(4):406-412, July 1986. [WY91] F . Wang and P. Y i p . Cepstrum analysis using discrete trigonometric transforms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 39(2):538-541, February 1991. [YCC92] J . C . Yen, F . J . Chang, and S. Chang. A new architecture for motioncompensated image coding. Pattern Recognition, 25(4):357-66, April 1992. [YS89] Y . Yeshurun and E . L . Schwartz. Cepstral filtering on a columnar image architecture: a fast algorithm for binocular stereo segmentation. IEEE PAMI, 11(7), July 1989.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Visual echo analysis and multi-evidential correlation:...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Visual echo analysis and multi-evidential correlation: non-linear matching & registration of signals… Bandari, Esfandiar 1996
pdf
Page Metadata
Item Metadata
Title | Visual echo analysis and multi-evidential correlation: non-linear matching & registration of signals and images |
Creator |
Bandari, Esfandiar |
Date Issued | 1996 |
Description | Many low-level vision tasks - such as measurement of visual motion, stereo disparity estimation, or texture segmentation - can be solved by similar computational or biological mechanisms. The primary aim of this dissertation is to introduce and describe a broadly applicable approach to address a variety of low level computational vision problems. This unified framework, which is named visual echo analysis, is based on the simple observation that many computer vision problems can be viewed as detection and estimation of echo arrival periods in time and space. To this end, the framework uses cepstral techniques, a common and effective non-linear signal processing methods for detecting the presence of echoes and estimating their spatial or temporal arrival period. The thesis introduces computational and performance improvements to the traditional power and differential cepstrum with direct extensions to complex and phase cepstrum. Visual echo analysis (and multi-dimensional cepstrum) is then applied to a number of low-level vision tasks such as: motion estimation, binocular and trinocular stereo disparity, motion-stereo analysis, multi-frame disparity estimation (multi-frame motion, multiple baseline stereo), stationary texture segmentation, boundary symmetry analysis, and detection and estimation of multiple disparities (i.e., motion transparency, reflection, and occluding boundary). The relationship between echo analysis and matching is briefly examined, and a new technique for signal registration - called multi-evidential correlation (MEC) is introduced. MEC provides multiple, and thus verifiable, measurements for individual point disparities. The technique utilizes specific matching kernels - such as cepstrum, phase correlation or Hadamard based disparity measurements methods - to furnish multiple estimates of individual disparities; estimates that can be used to verify one another and/or be combined to establish a robust and accurate measure of signal displacements. |
Extent | 55601224 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-03-19 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051640 |
URI | http://hdl.handle.net/2429/6230 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 1996-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_1996-09040X.pdf [ 53.03MB ]
- Metadata
- JSON: 831-1.0051640.json
- JSON-LD: 831-1.0051640-ld.json
- RDF/XML (Pretty): 831-1.0051640-rdf.xml
- RDF/JSON: 831-1.0051640-rdf.json
- Turtle: 831-1.0051640-turtle.txt
- N-Triples: 831-1.0051640-rdf-ntriples.txt
- Original Record: 831-1.0051640-source.json
- Full Text
- 831-1.0051640-fulltext.txt
- Citation
- 831-1.0051640.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051640/manifest