Visual Echo Analysis and Multi-evidential Correlation Non-linear Matching & Registration of Signals and Images By Esfandiar Bandari B.Sc. University of California, Berkeley M.Sc. University of California, Berkeley M.Sc. Stanford University A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T OF T H E R E Q U I R E M E N T S F O R T H E D E G R E E OF D O C T O R OF P H I L O S O P H Y in T H E F A C U L T Y OF G R A D U A T E STUDIES ( D E P A R T M E N T OF C O M P U T E R SCIENCE) We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y OF BRITISH C O L U M B I A November 1995 © Esfandiar Bandari, 1995 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department The University of British Columbia Vancouver, Canada DE-6 (2/88) Abstract Many low-level vision tasks - such as measurement of visual motion, stereo disparity estimation, or texture segmentation - can be solved by similar computational or biological mechanisms. The primary aim of this dissertation is to introduce and describe a broadly applicable approach to address a variety of low level computational vision problems. This unified framework, which is named visual echo analysis, is based on the simple observation that many computer vision problems can be viewed as detection and estimation of echo ar-rival periods in time and space. To this end, the framework uses cepstral techniques, a common and effective non-linear signal processing methods for detecting the presence of echoes and estimating their spatial or temporal arrival period. The thesis introduces com-putational and performance improvements to the traditional power and differential cep-strum with direct extensions to complex and phase cepstrum. Visual echo analysis (and multi-dimensional cepstrum) is then applied to a number of low-level vision tasks such as: motion estimation, binocular and trinocular stereo disparity, motion-stereo analysis, multi-frame disparity estimation (multi-frame motion, multiple baseline stereo), station-ary texture segmentation, boundary symmetry analysis, and detection and estimation of multiple disparities (i.e., motion transparency, reflection, and occluding boundary). The relationship between echo analysis and matching is briefly examined, and a new technique for signal registration - called multi-evidential correlation (MEC) is intro-duced. M E C provides multiple, and thus verifiable, measurements for individual point disparities. The technique utilizes specific matching kernels - such as cepstrum, phase correlation or Hadamard based disparity measurements methods - to furnish multiple es-timates of individual disparities; estimates that can be used to verify one another and/or be combined to establish a robust and accurate measure of signal displacements. i i Table of Contents Abstract i i List of Tables vii i List of Figures ix 1 Introduction 1 1.1 Visual Echo Analysis . . 3 1.2 Multi-evidential Correlation 6 1.3 Outline and Organizational Overview 7 2 Background and Historical Review 11 2.1 Previous Work 13 2.1.1 Correlation based techniques 14 2.1.2 Differenced Based Techniques 20 2.1.3 Differential or Gradient Based Techniques 21 2.1.4 Velocity tuned filtering 28 2.1.5 Other Techniques 30 2.2 Stereo 36 2.2.1 Epipolar Constraint and Stereo Vision 37 2.2.2 Stereo Matching 40 2.3 Hierarchical techniques 41 2.4 Summary 43 ii i 3 Cepstral Analysis: Background, Mathematical Preliminaries, and Mod-ifications 44 3.1 A Brief History of Cepstrum 45 3.2 Themes, Variations and Mathematical Preliminaries 48 3.2.1 Prototypical Cepstrum 48 3.2.2 Power Cepstrum 49 3.2.3 Complex and Phase Cepstrum 53 3.2.4 Differential Cepstrum 54 3.3 Improvements to the traditional approaches: posCepsCos and diffCepsSin 55 3.3.1 Positive CepsCos 56 3.3.2 diffCepsSin 58 3.4 Matching and Registration of Magnetic Resonance Images 60 3.4.1 MRI Processing 61 3.5 Summary/Conclusion: 66 4 Cepstral Properties 68 4.1 Power Cepstrum's Sign Ambiguity in Calculating Displacement: 69 4.2 False Peak at (0,0) 72 4.3 Windowing 74 4.3.1 Weighted Windowing of the Input Images 75 4.3.2 Windowing in the Logarithmic Domain 76 4.3.3 Zero Padding and Numerical Stability 78 4.4 Noise, Convolution, First Order Distortions, and Illumination Change . . 79 4.4.1 Effects of Additive Stationary Random Gaussian Noise 80 4.4.2 Effects of Convolution and Blur 82 4.4.3 Effects of Illumination Change: 84 iv 4.5 Rotation and Expansion 86 4.6 Subpixel Motion 88 4.7 Parallelism and Computational Complexity 89 4.8 Comparison with Standard Techniques 91 4.8.1 Differences from Standard Techniques 93 4.8.2 Similarity with Standard Techniques 97 4.9 Summary 99 5 Cooperative Analysis of Mult iple Frames by Visual Echoes 102 5.1 Introduction, Background and Previous Work 103 5.2 Visual Echoes and Spatio-temporal Image Analysis 105 5.3 Direct Measurement of Trinocular Stereo Disparities 108 5.4 Improved Disparity Measures by Multiple Baseline Stereo 110 5.5 Decoupling of Motion and Stereo Disparities in Binocular-Motion . . . . 112 5.6 Summary and Conclusion 114 6 Mult i -Evident ial Correlation and CepsCorr 116 6.1 Multi-evidential Correlation: Definition 117 6.2 Multi-evidential Correlation: General Properties 120 6.3 Cepstrum and Multi-evidential Correlation—cepsCorr 122 6.3.1 CepsCorr and the (0,0) Peak: 123 6.3.2 Sign Ambiguity 125 6.3.3 Other properties of cepsCorr 127 6.4 Verification and Combination of Correlated Estimates 127 6.4.1 Max (0,0) peak and winner take all 130 6.5 Summary/Conclusion 134 v 7 Results of Multi-evidential Correlation 137 7.1 Results and Comparative Analysis 139 7.1.1 The Hamburgh Taxi Sequence 139 7.2 Effects of Rotation 150 7.3 Effects of Expansion/Convergent Flow 154 7.4 Effects of Illumination Change 159 7.5 Tiny Camera 165 7.6 Binocular/Outdoor Scene 169 7.7 Computational Complexity of Motion Analysis Algorithms 174 7.8 Summary/Conclusion 175 8 Detection and Estimation of Mult iple Disparities by Multi-evidential Correlation 178 8.1 Introduction 178 8.2 Multi-evidential Correlation, Cepstrum and Phase Correlation 180 8.2.1 Phase Correlation and Multi-evidential Correlation 182 8.2.2 cepsCorr: Cepstrum and Multi-evidential Correlation 184 8.2.3 Benefits of Non-linear Matching Kernels 185 8.3 Results 187 8.3.1 Motion Transparency 187 8.3.2 Occluding Boundary 189 8.3.3 MultiCeps and Multiple Motion Due to Reflection 192 8.4 Conclusion 194 9 Concluding Remarks/Overview and Future Directions 195 9.1 Future Directions 197 vi Bibliography 199 vii List of Tables Preliminary experimental results for M R I image registration using different data types and cepstral techniques vii i List of Figures 1.1 Measurement of visual motion using echoes, (a) shows one frame of the motion sequence, and (b) shows the superposition of two frames in order to display the spatial echo produced by motion. Tweedy bird, as well as part of the cloths in the background display an echo due to motion. . . . 4 2.1 Example of the effects of large motion and sharp intensity change on the differential algorithm for one dimension. f(x) and g(x) are the two func-tions that have gone through the translation h. In frame (a) the value of motion, h, is calculate correctly namely h = (f(x)—g(x))*^&. The same approach in frame (b) gives one the value d which is much smaller than the actual motion. Frame (c) shows how the motion can be estimated com-pletely incorrectly due to a sharp change in the derivatives of the function f(x). Obviously, in frame (c) one could have a quadratic function with a large second derivative to display the same effect 23 2.2 A n illustration of the spatio-temporal filtering, using the space time cube, (a) the first frame of motion for a grey bar moving to the left with velocity v. (b) the stacked (or echoed) representation of this motion for five frames, (c) a horizontal cut for continuous time (i.e., in the (x,t) plane). Adopted from Adelson and Bergen 29 2.3 Epipolar constraint and the verging cameras 38 3.1 Plot of log(l + cos(2?r/)) 50 ix 3.2 Cepstral analysis of synthetic motion with (x,y) displacement of (7,3). The lower portion of the figure displays the cepstral result. The top peak is located at 3 pixels down and 7 pixels plus the width of the window horizontally. The lower peak is generated due to symmetric properties of the power spectrum of a real signal 51 3.3 (a) Regular cepstrum of a synthetic motion of 7 horizontally and 3 verti-cally, (b) Positive cepsCos of the synthetic motion 56 3.4 (a) regular power cepstrum for Figure 1; (b) its positive CepsCos - i.e., cepsCos with all the negative values discarded (c) the difference of (a) and (b) . Note that the landscape showing the difference has a different visual scale 57 3.5 (a) Differential cepstrum and (b) differential sine cepstral analysis of syn-thetic motion. Since differential cepstrum of a function is antisymmetric, for clarity the absolute value of the differential cepstrums is shown. Note the improved signal to noise ratio in the right figure 58 3.6 (a) regular differential cepstrum for a simulated motion; (b) diffCepsSin, (c) and their difference. Note that the landscape showing the difference has a different visual scale 59 3.7 Magnetic resonance image pair from a single slice, (a) the magnitude image (b) the phase image 62 3.8 (a) The real image and (b) the imaginary image of a magnetic resonance image pair 63 3.9 Cepstral results, (a) cepsCos of real only data, (b) power Cepstrum of amplitude only data 66 x 4.1 (a) Power cepstrum of a motion sequence. In (a) the disparity is 3 down-ward and 7 to the left. Figure (b) shows the disparity 3 rows up and 7 to the left, even though the direction of motion is reversed the cepstrum results are identical 70 4.2 Differential cepstrum (diffCepsSin) of two motion disparities with opposite signs. Even though the cepstral extreme values are located in the same place, being and odd function the sign of the values determines whether the disparity is (+3,+7) or (-3,-7) 71 4.3 A n example of false peak at (0,0) disparity. Even though a displacement of (1,1) has been simulated, the main peak corresponds to a (0,0) disparity. 73 4.4 Removal of false peak at (0,0) disparity by using the edge image of the previous figure 74 4.5 Effects of windowing the two signal with Gaussians. A signal of width 128 was randomly generated and concatenated with a delayed version of itself. The delay was 7 pixels, (a) shows the the regular cepstrum of the signal, (b) Is the cepstrum of the signal, where the two windows were multiplied with very narrow Gaussians before concatenation, (c) is the cepstrum where two Gaussians were not narrow but were displaced toward the edges of the window by 20 pixels 76 4.6 (a) The result of the log of power spectrum of an image and its echo after the removal of the effects of the original signal. Even with the additional removal of the signal the quefrency window is still seems noisy. But the cepstrum of the signal is quite visible in (b) 77 4.7 Effect of moving square windows in front of a black background. The checkered multiple peaks not only interfere with the measurement of dis-placements, they also obscure the number of independently moving objects. 79 xi 4.8 Effects of motion blur on power cepstrum 84 4.9 Power cepstrum of a moving plaid function 85 4.10 Effects of rotation on cepstral results. The gray arrows are representative of the motions recognized by ignoring the peak at (0,0) of cepstrum. As it can be seen, places where the secondary peak is used are near the center of the rotation or the back wall (i.e., areas of virtually no motion). In these regions the motions are very close to zero. Secondly, note that since power cepstrum does not know the difference between (u, v) and (—u, —v) then the arrows on the left half of the flow field are pointing the wrong the wrong direction. None the less the magnitude and the orientation of the motion is qualitatively correct 87 4.11 Effects of expansion on cepstral results 88 4.12 Positive cepsCos for a simulated subpixel motion, (a) the figure and its positive cepsCos. (b) the landscape of the cepstral plane 89 4.13 The optical flow field due to egomotion 91 4.14 Cepstral equivalent filtering: note how the low frequency functions are deemphasized 98 4.15 Cepstral equivalent filtering for an image and an image with added noise. 100 4.16 A n image after it has been put through the different layers of retina. . . . 101 5.1 Multi-frame optical flow analysis utilizing cepstrum. The simulated mo-tion field is made to reflect acceleration of the object. The motion dispar-ities with respect to the leftmost frame are (1,1), (3,3) and (7,7) 106 xii 5.2 The results of spatio temporal cepstrum analysis for a constant velocity for eight frames, (a) shows the disparity for every two consecutive frames, (b) every two frames separated by one frame, (c) for two windows differing in image number by two, and (d) for two frames differing in three frames. 107 5.3 (a) The peak magnitude and (b) the signal-to-noise ratio as a function of the number of frames used 108 5.4 Trinocular disparity measurements using visual echoes. The three peaks in the cepstrum reflect a disparity of 3.5 horizontally between the horizontal cameras, a vertical disparity of 6.0 between the cameras located vertically from each other, and a disparity of (3.5, 6.0) for the two diagonal cameras. 109 5.5 Simulated camera grid by the use of stationary texture segmentation. . . 110 5.6 Results of multiple baseline stereo I l l 5.7 Topographic plot of multiple baseline stereo results using the visual echo analysis approach. The peaks in the figure point to the horizontal disparity of the figure - roughly 1.65 pixels horizontally. The second largest peak corresponds to the overlap of disparity measures between two windows separated at twice the baseline I l l 5.8 The result of cepstral analysis for stereo disparity measurement 112 5.9 Motion-stereo disparity measurement. The two windows on the left are a stereo pair taken prior to the object going through a translation. The cepstral results shows how the motion disparity of (3.5, 6.4) and the stereo disparity of (4.3, 0.0) are decoupled 113 5.10 A stereo motion scene, and the horizontal stereo disparities using visual echo 114 xiii 5.11 Motion-stereo disparity measurement. The background is stationary and-does not move while the brightly shaded and darker areas point to the pixel values corresponding to the Tweedy Bird and the piece of cloth both of which have gone through a motion 115 6.1 A n illustration of the persistence of the (0,0) peak for a portion of an image that lacks much detail. The window on the upper left hand corner was matched against the region in the upper right using cepsCorr. The lower right shows an enlarged result and the bright pixels show the positions (or disparities) where the (0,0) peak was the dominant peak. As it can be seen this area includes a large region within the search region. The lower left hand corner shows the relative magnitude of the (0,0) peaks, and as one can see at the proper disparity—i.e., the disparity (2,1)—the (0,0) peak increases dramatically 124 6.2 Two concatenated 2D signals that match perfectly and the logarithm of their power spectrum on the lower half. The black stripes represent very large negative values 133 6.3 The random signal used for comparing cepsCorr magnitude and signal to noise ratio as the disparity increases 134 6.4 (a) Magnitude of cepstral peaks, and (b) the base 10 logarithm of the peak magnitudes 135 6.5 The cepstral results generated during the iterations of cepsCorr. Each row of the image shows the cepstrum result normalized between zero and one; the first row corresponds to disparity 16 and the last row to disparity +16. 135 6.6 Signal to noise ratio for cepstrum as a function of disparity 136 6.7 Horizontal and vertical disparity fields for a natural scene using ceps Corr 136 xiv 7.1 Results of the region based approaches for 16x16 window size. The search regions were vertically from -3 to 2 and horizontally from -3 to 7 pixels. . 141 7.2 The same matching operations as in the last but with 8x8 window size. . 143 7.3 Results of Horn Sz, Schunk for different values of a and number of iteration. The disparity range for a = 2.0 and 20 iterations (or higher) was within the proper motion region. For a = 0.0, on the other hand, the vertical disparity after 100 iterations ranged from -41.772167 27.433767 145 7.4 (a) Cepstral results generated by cepsCorrOO iterations, individually nor-malized and then concatenated. The cepstrum in the upper left corner was replaced with the input image to the first iteration to show typical images being compared, (b) The concatenated cepstral results, without being separately normalized. This figure shows how cepstrum's instability may dominate the cepsCorrOO result based on (0,0) peak magnitude. . . . 147 7.5 Results the 8x8 window sized registrations after 3x3 median filtering. . . 149 7.6 Results of rotation for 16x16 window size 151 7.7 Results of rotation for 8x8 window size 152 7.8 Results of rotation for 8x8 window size after median filtering 153 7.9 Horn & Schunk method for rotation, a = 2.0 and number of iterations was 50 154 7.10 Results of expansion for 16x16 window size 156 7.11 Results of expansion for 8x8 window size 157 7.12 Results of expansion for 8x8 window size after median filtering 158 7.13 Results of Horn and Schunk for looming effect 159 7.14 Results of change in the direction of illumination from 60 to 30 degrees, search area (-7,7) in both vertical and horizontal direction and 16x16 win-dow size 161 xv 7.15 Results of illumination change for 8x8 window size 162 7.16 Results of illumination change for 8x8 window size after median filtering. 163 7.17 Results of illumination change and Horn & Schunk method for a = 20 and 20 iterations 164 7.18 Original images used in the analysis 166 7.19 Effects of image quality on matching using 16x16 windows for matching. 167 7.20 Results of tinyCamera change for 8x8 window size 168 7.21 Horn & Schunk method a = 20 after 20 iterations 169 7.22 Results of binocular disparity measurements. Search for (-1,1) in rows and (-1,9) pixels in column-wise. 16x16 window size 171 7.23 Results of binocular disparity measurements. Search for (-1,1) in rows and (-1,9) pixels in column-wise. 8x8 window size 172 7.24 Results of binocular disparity measurements after median filtering for 8x8 comparison window size 173 7.25 Horn &; Schunk method a = 2.0 after 20 iterations 174 8.1 Illustration of multiple peaks due to multiple motions, (a) & (b) the images with two synthetic motions, (c) the result of analyzing the images with cepstrum 181 8.2 Results of M E C for stereo, (a) Each row of the figure is the first row (i.e., zero vertical disparity) of the result of one iteration of M E C with phase correlation as its matching kernel. The location of the peaks are indicative of the disparity, with negative disparities located close to the right most column, (b) Topographic image of Figure (a) displaying the magnitude of the phase correlation peaks 183 xvi / 8.3 (a) The collection of binocular disparity signals generated by cepsCorr. Each row of the figure represents the result of one iteration of the multi-evidential correlation with cepstrum as its matching kernel. Each disparity is represented by two peaks which are symmetrically displaced away from the center column by the amount of the disparity, (b) Topographic image of cepsCorr in (a) in order to display the magnitude of the cepstrum peaks. 185 8.4 (a) Topographic representation of multi-evidential correlation results for the Pepsi scene with phase correlation as the matching kernel, (b) Result of cepsCorr for the same image area in the Pepsi scene. Note that cep-sCorr generates a strong peak in the center corresponding to one of the disparities while for phase correlation the results are very weak and lack a distinguishing structure 186 8.5 The effects of simulated multiple disparities on: (a) sum of squared differ-ences (mean squared error) (b) phase correlation, and (c) cepstrum. The second peaks in phase correlation and cepstrum correspond to the correct second disparity while the second minimum in SSD does not. Using M E C the other peaks in cepstrum and phase correlation will vanish and all but the main two peaks are rejected 187 8.6 (a) The first frame of a sequence of motion transparency images. The flower pattern is apparently due to the stained glass, and the tripod is the object moving in the background, (b) & (c) The magnitude of M E C at (0,0) disparities for cepstrum (b), and phase correlation (c). These two figures analogous to plotting the middle column of Figure 3(a) and the left most column of Figure 2(a) respectively 188 xvii 8.7 Detection vs. localization of occluded regions for random dot stereogram. The area between a pair of white and black edges corresponds to the locations where multiple disparities were detected 189 8.8 (a) Sz (b) A pair of stereo images of a curved object, (c) the occluding boundary of the object using M E C 190 8.9 (a) One of the three images with two constant disparities caused by reflec-tion. The three frames where produced by looking at the Escher drawing with the reflection of the person in the glass, and moving the camera a constant distance each time and acquiring two more frames, (b) The magnitude of the (0,0) peaks caused by the overlap of individual multiple disparities. Note that the graph shows two maximums due to two disparities. 191 8.10 A portion of the cepstrum result for multi-frame analysis 192 8.11 The results of analyzing Figure 8(a) for multiple echoes. The difference images between frame 1 and 2 after disparity compensation for the two estimates. Note how that the reflection of the person in the glass is now easily detected as compared to the original image 193 xviii Acknowledgment I have been extremely fortunate to have numerous outstanding colleagues, instructors and advisors who have influenced my ideas and interests in research. But I would like to particularly acknowledge V . V . Bertero and Egor Popov of University of California, Berkeley who got me started in research as an undergraduate, Boyd Paulson and Thomas Binford of Stanford University who peaked my interest and encouraged me to study robotics and vision, and Takeo Kanade, Martial Hebert, Ben Motazed and William (Red) Wittaker of Carnegie-Mellon University who provided me with a wonderful opportunity to work and learn about vision and robotics. Furthermore, I would especially like to extend my appreciation to my committee members: Bob Woodham, David Lowe, Peter Lawrence, Mathew Palmer, Tad Ulrych, Rabab Ward, Allan Jepson and finally my official advisor James Little, as well as my good friends Rob Scharein, Ron Rensink, Norm Goldstien and Marc Romanycia. They supported my ideas, reviewed my work and provided me with valuable recommendations and advice. Most of all, I would like to thank and dedicate this work to my parents. They provided me with the best early education possible, continuously encouraged me to follow my interests and set valuable examples for their childern to follow. To them I am always indebted. xix Dedicated to the memory of my father Bahman Ardeshir Bandari and to my mother Golbaii S. Kaikhosrawi xx Chapter 1 Introduction Visual systems, whether computational or biological, are often characterized by their versatility as well as their complexity. The tasks often demanded from such systems can range from navigation and obstacle detection to monitoring or recognition. But fortu-nately many of the problems encountered can be addressed through similar mechanisms. Such uniform treatment of topics not only gives us a better understanding of the underlying concepts, but provides uniform solutions that can be cast into specialized hardware. This kind of efficiency is particularly striking in biological systems, where widely different visual processes share similar computational steps [SW90]. The objective of this dissertation is twofold: (1) to introduce a new computational framework which addresses several related areas in computer vision, and (2) to introduce a new, robust and hybrid image registration technique particularly used in computational motion and stereo analysis. To elucidate further, a framework refers to a conceptual/computational structure that provides a unified treatment of a class of problems sometimes seemingly unrelated and disjoint.1 Therefore, a framework not only addresses multitude of tasks, but its solutions emanate from a singular approach; as a result, the analysis of solutions - for example, *As an example of an unrelated and disjoint problem one can cite multi-frame motion analysis and 3D reconstruction from trinocular stereo. The problems are unrelated because one deals with motion displacement measurement from multiple frames over time, while the other deals with 3D scene re-construction from disparities due to sensor displacements. The solutions (or solution space) to these problems are disjoint because to date all the solutions provided for these problems are specific to the two domains. 1 Chapter 1. Introduction 2 the effects of additive zero-mean uncorrelated Gaussian noise - are addressed analogously across all application domains. A good conceptual example in computational vision -and one that strongly motivated the present dissertation - is Whitman Richards' [Ric88] visual flow analysis which draws attention to the similarity among motion, binocular and texture processing. Motivated by this work as well as others [Pri86, SS93], the first part of this thesis introduces and develops a computational structure, nicknamed visual echo analysis. With this framework I try to address such computational vision tasks as motion analysis, binocular and trinocular stereo, stationary texture segmentation, boundary symmetry recognition, multi-frame analysis, and position/scale and rotation invariant recognition of binarized images. The framework exists because the core conceptual notion employed is unique but diversely applicable. This provides a common methodology for analysis of results and effects of irregularities in inputs. Based on this and similar techniques in literature, the second part of this thesis intro-duces multi-evidential correlation - a new hybrid and robust approach for a fundamental task in computational vision, namely matching and registration of raw images. Multi-dimensional signal correspondence is a significant topic in signal processing and computational vision. The signals compared are often acquired under different circum-stances at different times, by different sensors, with different modalities, under different view points, and for different reasons. But the fundamental operation, whether to find similarities or differences between the inputs, is indeed matching and registration of the data. The primary focus of this work is matching and registration as they relate to visual motion and stereo analysis. This interest is not accidental. Both motion analysis and stereopsis are fundamental topics in computational and biological vision. Moreover, Chapter 1. Introduction 3 together they encompass the majority of essential issues raised in image registration.2 Therefore, from this point on, "image matching and registration" refers primarily to the issues and techniques that concern visual motion and stereo analysis. Through out this work, I will present examples and applications of the techniques uti-lized, discuss the limitations and advantages of these routines, and provide comparisons to the traditional approaches when applicable. The next two sections provide a more clear and specific discussion of the concepts developed in the thesis, and are followed by a brief outline of the dissertation. 1.1 Visual Echo Analysis There are quite a few books on problem solving and scientific thinking [Pol87, Ada86] and a common approach often utilized is solution by association, analogy or equivalence. Accordingly, let me describe visual echo analysis by examining motion analysis from a new perspective. Assume we are recording images of a static scene with a stationary camera. Ignoring sensor noise, the images captured over time are identical. Indeed one can claim that the second frame is a repetition, replication or an echo of the first frame in time. Similarly, every frame can be viewed as a temporal echo of the previous ones. But what if an object in the scene begins to move? Then the moving object also creates an echo of itself, but this time in space as well as in time (see Figure 1.1). Thus motion analysis can be viewed as detection and estimation of spatio-temporal visual echo arrival periods between two or more image frames. 2 We will visit this issue again. There are of course gross geometric and photometric distortions that may not be encountered in either motion or stereo analysis. Many of these involve landmark registration (often manual) followed by image interpolation and morphing. These are outside the scope of this work. But others, such as shadows, changed illumination, or multiple motion (e.g., motion of ground behind moving clouds) are issues that we attempt to address. Chapter 1. Introduction 4 (a) (b) Figure 1.1: Measurement of visual motion using echoes, (a) shows one frame of the motion sequence, and (b) shows the superposition of two frames in order to display the spatial echo produced by motion. Tweedy bird, as well as part of the cloths in the background display an echo due to motion. What we are doing is correspondence; that is we are trying to match different portions of two images and decide how much has each part moved. Except we do this by looking for echoes. There are two main advantages to this conceptual shift in our thinking. First, echoes have been studied extensively in many fields such as acoustics, oceanography, geophysics, seismology, physics, communication, power engineering, speech recognition, system anal-ysis and signal restoration [CSK77], making it possible to build on the extensive research literature on this topic. 3 Second, modeling matching as echo analysis presents new possibilities and a great 3I.e., building on the shoulders of giants. Chapter 1. Introduction 5 deal of flexibility to address other topics such as, binocular or trinocular stereo, multi-frame motion analysis, multiple baseline stereo, stationary texture segmentation, motion-stereo, boundary symmetry analysis, multiple disparity analysis, and occluding boundary detection. These applications have been developed by exploring many questions about echoes and vision. For instance, what if we have multiple echoes, in space (e.g., stationary texture, trinocular stereo, and multiple baseline stereo) or time (multi-frame motion analysis), or both (motion stereo analysis)? What if we have multiple moving objects, each having a unique, or perhaps overlapping, echo arrival periods? Moreover, echoes of waves reflected from boundaries can be symmetric or antisymmetric, depending on the boundary conditions. Can one detect different types of symmetries and self similarities in object boundaries when they are present? How stable are the underlying operations, especially in the presence of distortions? Is the underlying technique a different variation of previous methods, in the same mathematical class, or is it fundamentally different (e.g., linear vs. non-linear routines)? What properties can be fully characterized and which ones remain to be studied further? The cardinal question, of course, is how this flexibility affects the performance of our objective tasks. Poorly performing flexible techniques are almost always4 best replaced with specialized accurate methodologies. To answer this question, one has to furnish a comparison with the best performing techniques, in the applicable domain whenever possible. 4 There is an interesting recursive definition attributed to the late Alan Newell, one of the prominent figures in the history of artificial intelligence: "Almost always almost always means always." Chapter 1. Introduction 6 1.2 Multi-evidential Correlation As pointed out earlier, one of the most significant tasks in signal processing and com-putational vision is matching of multi-dimensional signals. Whether images have been gathered over time, such as motion sequences, or are different by the spatial location of the sensors, such as binocular and trinocular stereo, with different number of bands and formats5, the measurement of displacement defines the task. It is thus not surprising that a variety of approaches have been introduced over time. Some schemes minimize differences, others maximize similarities, some use linear, and yet others utilize nonlinear techniques. Many methods are preceded by preprocessing the raw data followed by establishing constraints through post processing the results. But the fundamental objective remains the same. The primary purpose of the second part of this thesis is to address the correspondence issue. The weak assertion behind this work is that the more estimates of displacements available the better. The strong assertion (the main observation), however, is that as the disparity or displacement6 between the images increase, the more difficult and unreliable our results become. Consequently, reducing the disparities increases the accuracy of our measurements. To give a simplistic example, when measuring the length (weight) of an object, one can make several measurements and average the results. The higher the number of times one makes the measurement, the higher the confidence in the final answer. Furthermore, the more accurately each measurement is made the better the outcome. Multi-evidential correlation uses algorithms that provide a direct estimate of disparity between two images, and combines these with correlation to provide better measures of displacements. Since the actual matching kernel used can be changed, one might assert 5i.e., real vs. complex format. 6 These two terms are often used interchangeably. Chapter 1. Introduction 7 that multi-evidential correlation is a hybrid technique. Application of multi-evidential correlation to detection and measurement of multiple displacements due to occlusion, reflectance, or transparency is shown. In this case the multiple evidences are used to differentiate between spurious noise and peaks due to secondary motion. 1.3 Outline and Organizational Overview Due to the breadth of the applications and the techniques considered, chapters are or-ganized to follow a uniform skeleton when possible. Most chapters have sections on background, computational considerations and summary. When applicable I also include sections on mathematical derivations, applications and results as well as future works. To keep the discussion focused I have made ample use of dashes, parenthesis and footnotes for the material not always essential to the discussion. This is often necessary in order to point out facts and exceptions with minimum disruption to the flow of con-versation. Many of these interjections can therefore be skipped over for the casual reader or in the first reading of the text.7 The following are brief chapter descriptions, which are also expanded upon at the end of each chapter as a summary. Chapter 2. This chapter provides a brief review of multi-dimensional signal and image registra-tion. Majority of the techniques reviewed, mean squared difference (sum of squared differences), mean absolute difference (sum of absolute differences), spatially tuned fil-ters, Haddamard based routines and differential approaches have their routes in linear filtering and statistics. Phase correlation (and non-linearly preprocessed token based) 7 Here is an opportune example, where one may ultimately ask: "why would anybody want to read this twice?!" ; ) Chapter 1. Introduction 8 methods are the only two non-linear techniques reviewed. Motion analysis is one of the most general applications of image registration and majority of the references and the works cited are from this domain. Section 2.2 reviews stereo matching as a special case of motion analysis and image registration. Hierarchical pyramids and their role in matching are also discussed, and their role in reducing the assumptions and limitations of region-based methods explored. Chapter 3. Chapter 3 focuses primarily on the mathematical definitions of cepstrum and its variations. For the sake of brevity and conciseness, some questions, especially the intuitive ones, such as cepstrum's behavior in the presence of multiple echoes (e.g., occlusions and transparency) are deferred to later chapters and will not be addressed in this chapter. Other properties such as cepstrum's behavior in the presence of noise or convolution will also be addressed later. But the chapter introduces new variations of power and differential cepstrum that are better from the performance aspects of echo detection as well as computational efficiency. Chapter 4. This chapter provides a review of cepstrum properties as a matching routine, partic-ularly emphasizing its limitations and shortcomings, which need to be addressed. Some of the topics covered include windowing, numerical stability, effects of additive noise, convolution (particularly blur), rotation, expansion and illumination changes. A short comparison, particularly emphasizing the similarities to other approaches was also pro-vided with graphical examples and theoretical analysis. Chapter 5. This chapter provides a uniform approach to multi-frame analysis and shows the ap-plication of the visual echo framework to the analysis of the disparity calculations for spatio-temporal motion analysis, trinocular stereo disparities, multiple baseline stereo Chapter 1. Introduction 9 measurements, and depth and motion calculations using motion-stereo. What sets this approach apart from its predecessors is that the equal disparities constraint, when ap-plicable, is fully utilized to increase the disparity detectability and the accuracy in the estimation process. Several domains where this constraint is satisfied include motion-stereo, multiple baseline stereo, and constant velocity motion. When this constraint is violated, such as in the case of trinocular stereo or accelerating motion, the visual echo techniques provide all disparities between every two windows analyzed. One of the interesting features of this work is its extendibility into other possible mechanical setups such as trinocular stereo-motion analysis, or calibrated camera grids. Chapter 6. This section introduces Multi-evidential Correlation (MEC) and the additional ro-bustness, flexibility, and accuracy that it provides. I will then concentrate on cepsCorr -M E C with cepstrum as its matching kernel - and show how some of the limitations and ambiguities in cepstrum can be overcome through the use of M E C . Lastly, we introduce, a simple, fast and efficient winner take all approach to utilize the disparity estimates (i.e., multiple evidence) provided by cepsCorr. This method not only overcomes, both the problems with sign ambiguity of cepstral disparities, but it also solves the some-times troublesome incorrect (0,0) peaks and turns this limitation of cepstrum into an advantage. Chapter 7. This chapter shows the use of cepsCorrOO and phaseCCorrOO for matching and regis-tration of several images. The images chosen address some of the more common image transformations, such as rotation, expansion, and translation parallel to the focal plane, plus changes in lighting condition and image quality. Even with considering only the Chapter 1. Introduction 10 magnitude of the (0,0) peak - a rather limited application of M E C - cepsCorrOO per-formed rather nicely. As was expected and pointed out by Yeshurun and Schwartz [YS89] and Ludwig, Neumann and Neumann [LNN94] for illumination changes, rotation and ex-pansion, cepsCorrOO inherits cepstrum's robustness and performs better than the other techniques considered. The chapter discusses in detail possible instabilities on cepstrum and phase correlation and how they can be easily overcome by the use of M E C (and additional proofs of its robustness). These instabilities sometime occur at the occluding boundaries, a significant area of research (especially for object segmentation) and a topic that will be discussed -along with multiple motion - in more detail in the next chapter. Chapter 8. In this section I discuss a direct method for detection and estimation of multiple disparities due to occlusion, motion transparency, and reflection using multi-evidential correlation. No a priori assumptions about the existence or the number of disparities were made. The two matching kernels used, cepstrum and phase correlation, both gen-erate multiple peaks in the presence of multiple motions. When these peaks persist between different iterations of multi-evidential correlation, and they are consistent in the measurement of disparities, the existence of multiple displacements is assured. I will also address the use of multi-frame analysis and cepstrum to estimate multiple constant disparities over time. Both cepstrum and phase correlation performed well, but cepstrum had higher signal to noise ratio, and an overall better performance than phase correlation. A significant fact that should be pointed out here is that motion transparency and reflection often cause blurring in the image. Both cepstrum and phase correlation are robust to blurring, and in fact cepstrum is often used as a deblurring mechanism. Chapter 2 Background and Historical Review This chapter is intended to provide a brief tutorial of matching and registration routines often used in practice. It is practically impossible to provide an objective comparison of all the routines developed. One reason is that the performance of these processes are effected not only by the nature of the raw data being matched, but also the number and applicability of the assumptions being made as well as the number and value of the thresholds and internal parameters being set. Therefore, even though a technique may perform poorly on one set of data, given the right assumptions and thresholds it can perform best for another set of data. Thus for the purposes of this review, I have tried to provide the basic theoretical formulation of each technique as well as the practical and performance related issues. When the results of a technique are dependent on certain presuppositions they are made explicit in the review. In regards to assumptions, I have found that in general: in every assumption (explicit or implied), there exists the possibility of its violation, thus limiting the accuracy and applicability of the procedure to which it pretains. Therefore, in my own work I have tried to minimize or eliminate the number of assump-tions and arbitrary thresholds and limits. But there are inherent restrictions that one has to note. In fact, even though a variety of registration techniques and approaches will be reviewed below, emphasis is placed on region-based techniques and methods that are par-ticularly suited to estimate small displacements. The primary assumption in region-based 11 Chapter 2. Background and Historical Review 12 motion and stereo analysis is that of "local constancy of disparity". That is, for small neighborhoods the displacements of all enclosed pixels are constant. This assumption is, of course, violated in the presence of occluding boundaries or multiple displacements due to transparency or reflection. These problems are addressed in later chapters. A related and implied assumption for region-based approaches is that of fronto-parallel displacement. That is, for small regions the displacements are due to simple two dimen-sional translations (of the object or camera) parallel to the image plane.1 That is, large distortions (affine or perspective) are either prohibited or reduced through prepossessing. For measurements of small visual motion and stereo disparity (controlled by the stereo baseline compared to depths of interest) the assumption of fronto-parallel displacement is not an unreasonable one. In practice, for applications that may violate the above assumptions - such as satellite imagery - gross geometrical distortions are corrected automatically by considering the location and orientation of the cameras during the image capture, or registration of control or ground points between the two images, and subsequent resampling of the points between control points.2 Fine registration of the interpolated images can be then conducted using the techniques described below. Section 2.1 reviews a quintessential area of registration, namely motion analysis which is a comprehensive representation of this topic; specific matching techniques are de-scribed. Section 2.2 covers stereo, both binocular and trinocular, and reviews epipolar constraint and reviews the pros and cons of feature versus region matching; and sec-tion 2.3 covers the hierarchical pyramids often used to eliminate the assumption of small displacement fields. J A s we will see it is really not necessary to restrict the horizontal and vertical disparities. Large disparities are often addressed using hierarchical pyramids. 2 For a good discussion of image interpolation and morphing between registered landmarks the reader is referred to [Bro92]. Chapter 2. Background and Historical Review 13 2.1 Previous Work It would be truly prohibitive to consider all the techniques, methodologies and approaches from the ever growing number of applications that discuss image registration. Some techniques look for maximum similarities, others for minimum differences, and yet a third force matching through interpolation and morphing3[Bro92, SC94]. Moreover, the techniques developed could be categorized into, linear vs. nonlinear, spatial vs. Fourier based, or feature based vs. region-based approaches. Being one the most general representative of image registration, motion analysis em-bodies the majority of problems as well as technical solutions developed to date. In fact, as noted by others [Hor86, BB82], stereo analysis is a special case of visual motion processing. But independent of other topics, the importance of visual motion analysis can not be over stated. In biological and computational vision time varying image anal-ysis plays a fundamental role in egomotion estimation, computation of time to collision, determination of focus of attention, target tracking, scene segmentation, interpretation and encoding of 3D information, and monitoring [Nak85]. In today's multi-media in-formation technology, motion compensation is an integral part of video compression algorithms [Nas94, Tho87, Ana94]. There have already been several sound reviews for motion analysis techniques [Ana85, Bro92, AN88, Bar84, The91]. Building on these works, the next sections will cover three often cited categories of motion analysis, i.e., the well entrenched correlation based techniques, gradient or differential schemes, and spatio-temporal (or velocity tuned filtering) approaches. This is followed by two powerful but often ignored routines4, phase correlation and Hadamard 3 Even though considered by some as matching, interpolation and morphing of image sections be-tween isolated registered points is more a topic related to computer graphics than computational vision. Therefore this class of procedures will not be covered here. See [Bro92] for a good discussion. 4Sometimes, but hardly ever, the word routine is used in the text to reflect, algorithms, methods, Chapter 2. Background and Historical Review 14/ based routines. Section 2.2 covers stereo analysis and the provides a comparison between feature based and region-based approaches. Finally, I'll briefly address some remaining issues such as the use of hierarchical pyramids, and image differencing. 2.1.1 Correlation based techniques There is no class of matching methods better entrenched in computational vision than the one based on linear cross-correlation coefficient of statistics and its variations. To begin near the beginning, based on research on insects, Reichard [Rei57, Rei61], one of the most influential early vision scientists, postulated that autocorrelation is the motion processing instrumentation of the central nervous system. The significance of these works, as well as [PR73], was in giving correlation plausible neurophysiological basis. But even without such endorsements, cross-correlation has a long history as the basic statistical approach to template matching and pattern recognition. For matching two dimensional discrete signals simple cross correlation can be written as:5 WX Wy Rhj2(u,v) = ^^h(x,y)I2(x - u,y - v) (2.1) where wx and wy define the size of the two image templates I\ and io., and the limits Umin < u < umax and vmin < v < vmax delimit the search area for the best match. Theoretically, correlation is merely a similarity measure. As template i i is moved over the second image, the sum of pixel by pixel multiplications of the two windows provides a similarity measure between the template and the image at that location. Therefore, the correlation of a template and an image is a matching metric. The location of the maximum similarity value is considered the location of the best match for the template. techniques, or approaches. This is not to be confused with "visual routines" [U1183] which was first used by Ullman to differentiate between serial and parallel visual tasks. 5Obviously one can replace vector quantities (x, y) and (u, v) by single bold letters and thus extending the present representation to arbitrary dimensional signals. Chapter 2. Background and Historical Review 15 One of the most attractive properties of correlation - particularly in signal processing - is its great speed. In frequency domain, correlation is merely multiplication of the one window and the complex conjugate of the other. But typical correlation-based techniques implicitly assume the conservation of local intensity distribution. In the case of pure correlation this assumption can lead to severe problems. Assume for instance that a part of the search region considered in the second image, contains an area of very high intensity (e.g., a uniform extremely bright region with the same dimension as the template from the first image). It is easy to see that the correlation results will be maximum within this region, independent of the template values. To alleviate this problem often normalized correlation is used.6 ^ / s Ylr 1, h(x, y)Io.(x — u, y — v) Ciuh(u,v)= ^x,y i\ ,yj 2K ,y > = = ^ ^ VEi,y ~U,y- v)^x,y IKX-U^y- V) The normalization of the cross-correlation is used to reduce the influence of the local intensities. Since the template energy - i.e., J2x,y I\ (x, y) ~ is constant, one can omit normalization with respect to this measure without effecting the location of the peak. Unfortunately, this improvement comes at a great cost. Not only the new approach is computationally intensive, it also can not be implemented in the frequency domain, thus loosing the primary method for correlation speed up. Another similarity measure used is correlation coefficient [SMA76]: , . covarianceiL, I2) CCIuh{u,v) = (2.3) Ex,y(h(x,y) - Ph){h{x -u,y-v)- ph) \jEx,y(h(x, y) ~Vh)2 Ex,y(h{x -u,y-v)- flh)2 provides a linear measure of similarity between -1 and 1 inclusive. In the formulation above fij and Oj are the mean and variance of an image window Ij. 6 For simplicity I am substituting Y^x,y f ° r S j r Sorry if I am stating the obvious again. Chapter 2. Background and Historical Review 16 A related and empirically more effective technique is sum of squared differences (See [AW84, AW85] for a justification) or SSD [BAHH92, Ana87, Sco88, Sin91, Fua91, FL94, AW85, AW84, Sco86, BYX82 , BYX83 , WH84, BL83, AR84, CCM84, GGB84, BT80, Rei61, Rei57]. SSDIuh(u,v) = Y,Vi(x,y) - I2(x -u,y- v)f (2.4) This computationally efficient statistical method looks for minimum difference between the signals. It is obvious that SSD is just sum of powers of template and the underlying image minus twice the correlation value. Most often it is the correlation value that minimizes SSD, but the addition of the signal powers tries to offset the problems with regular cross-correlation such as drastic brightness change. Nonetheless it is advisable to normalize SSD. One of the primary characteristics of SSD and cross-correlation are their optimality with respect to certain type of noise, specifically additive stationary white Gaussian noise [Mat92b]. It should be noted however, that image noise is often neither completely additive nor Gaussian white noise. Yet this is a popular noise model, perhaps more applicable to photodetector imaging sensors [Pra78]. Another realistic additive noise model is pink, or higher frequency random noise [Nis84]. To avoid the correlation result being undully influenced or confounded by differently scaled structures (i.e., different frequency bands), many authors prefilter the image with Laplacian of Gaussians ( V 2 G often approximated numerically with difference of Gaus-sians with standard deviation ratios of 0.6).7 In fact, simple windowed correlation or SSD does very poorly in regions of low contrast and near constant intensity. In the converse case, the presence of a great deal of intensity variation or texture can also bog down 7I will revisit this problem in hierarchical signal matching [BAHH92, BAK91]. Chapter 2. Background and Historical Review 17 correlation based methods [Cro83]. That is why prefiltering with V 2 G is used to local-ize or band limit the correlation in the frequency domain (i.e., to lock into a particular spatial structure scale). By increasing the size of the V2G a greater percentage of the mean of the signals is removed [AW84]. But the same process also changes the noise characteristics so that, even for white Gaussian noise, correlation will no longer be the optimal process. Obviously, up to now the primary computational expense of the techniques considered was due to the multiplication operation. Sum of absolute differences (SAD) is a com-parable difference minimizing search technique which in its normalized form is written as: SADIul2(u,v) = \h{x,y) - nh + -I2(x - u,y - v) + ph| (2.5) x,y An interesting but scarcely referenced work by Barnea and Silverman [BS72] is Sequential Similarity Detection Algorithm. This approach not only uses SAD but also looks for the best fit where the highest number of pixels are subtracted and summed before a threshold is reached. The inner loop of summation is stopped if the threshold is passed at present maximum number of pixels. Obviously, not only this algorithm is very fast, it is particularly suitable for large templates, as in satellite imagery. Barnea and Silverman provide interesting variation of the algorithm. Although there are studies comparing different cross-correlation techniques [BYX82] and SSD [AW84], showing the relatively greater robustness of SSD particularly in the presence of noise, I am not aware of any adequate comparisons made between SSD and SAD techniques. In conclusion, the following points about correlation techniques are noteworthy: • Correlation based methods are one of the simplest, oldest and best entrenched techniques for match filtering and registration. In fact, in many situations, these Chapter 2. Background and Historical Review 18 approaches have taken a Hammer Syndrome - i.e., "since the only tool in my toolbox is a hammer, all the problems I face are nails". 8 This metaphor was first noted by Horn [Hor83], who urged examination of alternative registration strategies. A side effect of this popularity has been the development of different hardware configurations for fast cross-correlation [ea93, PSW93]. • Like many other region-based matching routines, the choice of window size, and search boundaries are often arbitrary [OK92]. Small window sizes limit the amount of signal to be matched and therefore reduce the robustness of the match. Large window sizes, on the other hand, blend different disparities that may arise due to surface gradients or motion discontinuities. This is analogous to the problem of windowing or support region for an edge detector. Nishihara [Nis84] provides a good mathematical treatment of window size and convolution with a band limiting filter. I will also provide a treatment of both window size and search boundary under the hierarchical search methods. • In addition to areas of low contrast, cross-correlation techniques produce very poor results in the presence of occlusion, rotation, affine or perspective transformation or scale change. Bergen et al [BAHH92] have an excellent mathematical treatment for different transformations that may occur. But even this work does not cover all the possible transformations. • Aside from regular cross-correlation - which can be implemented in Fourier domain - and SAD all other variations described are computationally expensive, and slow, but can be implemented on parallel architectures [Fua91]. 8 Using a stretched analogy, one may argue that simple correlation to matching and registration, is what Fortran has been to computer language. er 2. Background and Historical Review 19 These techniques are data type independent, but again aside from the traditional cross-correlation approach, they are not really useful when matching complex type data. Singh [Sin91], Scott [Sco86], and Anandan [Ana87] all provided confidence measures for their matching algorithm. But interestingly none of these confidence measures was based on the traditional signal to noise ratio. Anandan's approach [AW85, Ana87] is more localized and involves the values and directions of principal curva-tures at the correlation peaks. Scott and Singh also provide directional confidences, but they are based on principal axis and principal component analysis of the entire correlation plane. But they modify the correlation results through heuristically based formulations which are rather arbitrary and effect the outcome of the pro-cess. Singh [Sin91], however, does have an interesting explanation of his choice of formulation and effects of his use of the entire correlation result. Other researchers such as Fua [Fua91] and Matthies [MS86] have also provided detailed treatment of the topic. Such error analysis techniques are particularly important in applications of Kalman filtering. There has been no well known multi-frame analysis method using correlation based techniques. Multi-frame analysis is important for several reasons. First it produces results between every other frame. This multiple result generation is particularly useful because it provides verification and a consistency check among results. More importantly, in the case of visual echo analysis, temporal smoothness and constant velocity provides additional verification by improving the signal to noise ratio of the final disparity maps through signal reinforcement. I will discuss this in more detail later on. In the case of cross-correlation there are no simple multi-frame analysis. The only Chapter 2. Background and Historical Review 20 solution is to compare every two images, an approach that grows exponentially with the number of images. Matthies et al [MKS89], Heel [Hee91], Singh [Sin91] and many others have used consecutive correlation results in Kalman filtering to improve 3D reconstruction of surfaces. But this is not the same as multi-frame analysis. Singh [Sin91] also uses averaging of correlation results to increase the spatial smoothness of flow. This is particularly interesting because it assumes and, in fact forces, temporal smoothness of the solution. • Majority of the subpixel approximations to the correct match are based on bicubic or quadratic surface fitting to the correlation plane. Again Heel [Hee91] has a good discussion of this topic. • Finally, there is no process to generate multiple number of evidences by any of the correlation techniques. It is possible to increase ones confidence by comparing different correlation measures, but the approaches are so similar that no real varia-tion can be expected. It is when one gets different answers with low signal to noise that one can be more confident in the presence of anomalies such as occlusions or multiple disparities. 2.1.2 Differenced Based Techniques Sometimes the most effective techniques are the simplest ones. Case in point is the simple differencing technique which is applicable to tracking and motion analysis under constant illumination for stationary sensor. Barron [Bar84] provides a review of works to isolate moving objects against stationary background using simple differencing of consecutive images. As simple as this technique may seem, for many simple applications it is often over-looked. Noise removal, region growing, and object differentiation are some of the required Chapter 2. Background and Historical Review 21 post processing operations. Needless to say, this is the simplest, most intuitive and cheap-est motion detection algorithm to date. More sophisticated techniques to improve the speed of this algorithm, by projecting the image samples on different axes and isolating moving samples has been also studied particularly in pattern recognition. 2.1.3 Differential or Gradient Based Techniques Before describing the gradient based techniques we need to clearly define two terminolo-gies which are sometimes used interchangeably (perhaps due to a lack of consistently agreed upon definition). • At a point on the imaging surface the image flow depicts the instantaneous 2D projection of a 3D velocity vector of the corresponding point in the scene. • The optic-flow, on the other hand, is the apparent 2D velocity of a constant bright-ness pattern on the image plane. It is the second definition that gives rise to the gradient based optical flow techniques. These matching methodologies are purely concerned with the 2D motion of the brightness patterns and typically work based on the assumption of conservation of image intensity, also referred to as the brightness constancy model. That is, one assumes that the intensity of a point between the two images being matched remains identical. In motion analysis it is this assumption that isolates the 2D projection of the 3D velocities in space. The first two papers on gradient based techniques were apparently due to Limb and Murphy [LM75] and Netravali and Robins [NR79] in 1975 and 1979 respectively. But the most cited treatment of this topic has been by Horn and Schunk [HS81]. Subsequently variations of Horn and Schunk's approach soon followed notably [CK83, Nag83b, Nag86, TB87] and are still being developed to date. Chapter 2. Background and Historical Review 22 In short, given our assumption of constancy of brightness: I(x, y, t) = I(x + Sx, y + Sy, t + St) (2.6) Using the MacLauren-Taylor series expansion of the right hand side, and discarding the second and higher order terms: dl dl dl I(x, y, t) = I(x, y, t) + Sx— + Sy— + St— (2.7) simplifying this equation and dividing by St £ j , + g l , + /,=0 (2.8) where Ix, Iy and It, are partials of intensity I(x, y, t) with respect to x, y and t respectively. In the limit as St approaches zero | | and | | are the instantaneous velocity components u and v. Therefore one can write: Eb - ulx + vly + It = U.VI + It = 0 (2.9) Since the spatial and temporal derivatives of the image intensity can be easily de-rived, Equation 2.9 provides a linear constraint equation between the unknowns (u,v). Obviously this is an ill-posed problem and a second independent equation is required to solve for both variables. Horn and Schunk then introduce the assumption of locally smooth optical flow.9 That is, one should try to minimize: = VuT.Vu + VvT.Vv 9 This is often referred to as a the optical flow smoothness constraint. This assumption is only valid for the interior points of smooth surfaces, and is thus not a real physical or photometric constraint since it is obviously violated at motion boundaries. Chapter 2. Background and Historical Review 23 Minimizing the El + \E% - where A is the regularization parameter - over the entire image provides one with two non-linear equations for the two unknowns, which can then be solved iteratively. Some researchers, particularly Scott [Sco88], are overly critical of the gradient based methods. But even though limited - again due to their underlying assumptions - dif-ferential based techniques are an excellent mathematical treatment of the motion and registration based purely on the photometric characteristics of images. Perhaps the biggest limitations of Horn and Schunk's approach stem from discarding the higher order terms of the Taylor series. The assumption that these terms are negligible depends on the spatial frequency content of the intensity pattern and the magnitude of the displacement. In fact, highly textured regions, or large displacement can not be accommodated properly with this methodology. This is perhaps best seen through Figure 2.1. A x A X ( a ) ( b ) ( c ) Figure 2.1: Example of the effects of large motion and sharp intensity change on the differential algorithm for one dimension. f(x) and g(x) are the two functions that have gone through the translation h. In frame (a) the value of motion, h, is calculate correctly namely h = (f(x) - g{x)) * The same approach in frame (b) gives one the value d which is much smaller than the actual motion. Frame (c) shows how the motion can be estimated completely incorrectly due to a sharp change in the derivatives of the function f(x). Obviously, in frame (c) one could have a quadratic function with a large second derivative to display the same effect. Chapter 2. Background and Historical Review 24 This is a severe limitation. Image intensities are not always smoothly varying, par-ticularly for textured surface patterns, and the assumption of smoothness of flow fails, especially at object boundaries. Another implied assumptions is that the light source is fixed, both in its intensity and position; and surface shading of the object do not change due to motion or surface inter-reflections. Perhaps one of most knowledgeable and prolific pioneers in the differential registration techniques is Hans-Helmut Nagel. Nagel[Nag83a, Nag83b, Nag86] included the second order terms in the calculation, and restricted the direction of the displacement field vari-ations. This approach was interesting but in fact aggravates one of the major problems with any differential based approach - namely their sensitivity to noise. A n excellent con-tribution of Nagel's work, even though restricted to Lambertian surfaces, was changing Horn and Schunk's i l l posed problem to one that is well defined by changing the nature of the assumptions. Mainly, the bightness constancy assumption was replaced with the temporal constancy of the spatial gradient of brightness.10 That is: which leads to a two linear equations for two unknowns. Although elegant and well formulated in terms of the underlying assumptions, the use of double differentials, and the physical conditions that have to be met, obviously renders of this work rather limited to specialized domains. Even though not referred to by the authors, the intensity variation smoothness is par-tially responsible for the successful application of gradient based techniques in [BAK91] and [WM93]. Battiti et al [BAK91] use a hierarchical pyramid to lock into the proper 1 0 T h i s is of course equivalent to, and sometimes referenced as, the assumption of locally constant flow UEXy + VEyy + Efy = Oj (2.11) uExx + vEyx + Etx = 0; (2.12) field. Chapter 2. Background and Historical Review 25 bandwidth for matching, but this could also be because of smoothness of the image in the higher levels of the pyramid. They also provide a heuristic measure of determining which level of the pyramid is a correct one to stop at. Weber and Malik [WM93] have another formulation, which involves convolving the two images with smooth but orthogonal filters. Using the brightness constancy model, the motion sequence for each band limited image pair creates a linear constraint for u and v. Given several smooth image pairs there is an overdetermined system that is used to find (u, v). The unusual thing about this work is that all the image pairs used are smooth non-overlapping band passed versions of the original image. Although the information content of the band pass filters is unique, as Nagel points out [Nag94]: "Unfortunately, this kind of estimating optical flow must fail the more, the better the estimated partial derivatives approximate the real derivatives, be-cause in this case, the equations tend to [should] become linearly dependent". Tistarelli [Tis94] combined the first order constraint of Equation 2.9 and Nagel's second order constraints to form a three equation two unknowns overdetermined system. Each equation can be interpreted geometrically as a straight line in the (u, v) velocity space. Ideally all three lines should pass through the same point corresponding to the correct u, v. But in reality they intersect in three separate points. One can then solve for the velocity vector by least squares, or take normalized sum of the velocity calculations by solving the three possible pair of linear equations. Campani and Verri [CV90] used a first order brightness gradients within a neighbor-hood to create an over constrained linear system of equations i.e.,: 0 = Ix(X0, Vo)™ + Iy(X0> Vo)v + It(XQ, Vo) 0 = Ix(xu yi)(u + ux(xi - XQ) + uy(yx - y0)) + Chapter 2. Background and Historical Review 26 Iy{xi, yi)(v + vx(xi - x0) + vy(y1 - y0)) + +h(xi,yi) i (2.13) The final solution not only estimates of the optical flow but also its spatial variations. As Nagel [Nag94] points out, since Campani and Verri choose a 10x10 neighborhood to generate 100 linear equations for 6 unknowns, one can extend this work to solve for the two temporal derivatives of u and v as well. Similar to Battiti et al, Galzer [Gla84, Gla87] have also an interesting hierarchical treatment of the differential registration, but the brightness constancy assumption is particularly questionable in the presence of large disparities.1 1 Finally, in [NY93] Negah-daripour and Yo try to partially eliminate the brightness constancy assumption. This is different than Cornelius and Kanade's work [CK83] which completely ignored the assump-tion. It would be interesting to see a gradient-based hierarchical approach that would address both large displacement as well as eliminate the constraint on conservation of image intensity. Lastly, one has to note Lucas and Kanade's work [LK81] which had quite a few novel ideas embedded in it. This particular work is harder to categorize since it was a cross between differential and region-based approaches. In summary, differential approaches suffer from a few shortcomings, mostly caused by the underlying assumptions (as usual). • The motion of objects has to be small for best effects; in fact, Schunk[Sch88] pos-tulated that differential motion estimation techniques are similar to the neuro-physiological mechanisms of short range motion first discovered in psychophysical 1 1 One may argue (and in fact this is an implicit assumption often made) that the levels of Laplacian pyramid provides a normalization effect. Chapter 2. Background and Historical Review 27 experiments by Braddick[Bra74].1 2 • The smoother the image intensity variation the better. Many authors use smoothing operations to reduce the effect of sharp intensity discontinuities, as well as noise. One should ask, however, what happens when the intensity of a region is uniform. 1 3 • Due to the use of numerical differentiation the method is very noise sensitive. Therefore differential techniques are not ideal for registration of images corrupted by random noise. • Assumption of smooth disparity variation is quite limiting particularly at motion boundaries. Several techniques, especially those using Markov random field ap-proach add line processes to get a nice Bayesian estimations to the motion discon-tinuities [KD92]. Giachetti, Campani and Torre [GCT94] showed that because of shocks and vibra-tions present during image acquisition for autonomous navigation of vehicles on highways the "numerical computations of temporal derivatives is very noisy and therefore differential techniques to compute the optical flow do not provide ade-quate results." • Constant and motionless light source is an easily violated assumption. There is always interreflection and casting of shadows among the objects moving in a scene. In may robotics applications the mobile automatons carry their own light source. For majority of these shortcomings I have referenced papers that try to overcome them. The problem is that many of these papers introduce their own assumptions and 1 2 T h e existence of neurophysiological short range vs. long range motion mechanisms has been brought under question by Cavanagh[CM89]. 1 3Remember no technique does very well when the intensity is constant. The difficulty with differential technique is that the motion calculation is so localized that neighborhood distribution of the intensity does not help. Chapter 2. Background and Historical Review 28 limitations, which is perhaps why no one has fully integrated them. On the positive side, one has to note that differential approaches do estimate motion to subpixel accuracy and are localized, hence do not suffer from the limitations of window based approaches.14 Lastly, in the gradient based techniques it is not clear how to address multiple motion, for instance due to transparency or reflection. 2.1.4 Velocity tuned filtering Spatio-temporal filters for motion analysis were first discussed as plausible neurophysio-logical mechanisms by Watson and Ahumada [WA83, WA85] and Adelson and Bergen [AB85, AB86] . 1 5 Figure 2.2 shows the fundamental principle behind the velocity tuned filtering. In the right most figure a cross section of the moving strip parallel to the the xt plane is shown. A n edge detector can easily find the slope of the left and right most edges thus calculating the velocity of the stripe. But since motion is not always so clearly defined scale-specific velocity-tuned linear filters such as Gabor - which are Gaussian modulated 3D sine and cosine functions - are used for motion processing. The nicety of the Gabor filters lie in the fact that they provide minimum uncertainty in localization in time/space and frequency. Localization limits the size of the neighborhood contributing to the velocity. Localization in the frequency domain ensures that proper spatial structures are being matched. In fact, frequency domain analysis shows that for pure lateral translation of the image, the 3D Fourier energy of the images (arranged in space time cube of Figure 2.2) 1 4 A g a i n one has to take this with a grain of salt! Calculating motion to subpixel accuracy has to take into account the optics and the sampling process that has given rise to a digital image. Lack of neighborhood support can be extremely detrimental and add to the noise process. Smoothing the image with a Gaussian, commonly used, has the effect of using a windowing operation. Lastly, the regularization processes often used, for example in Horn and Schunk, smooth the optical flow, thus causing neighborhood measurements (sometimes far away neighborhoods) to effect one another. 1 5Interested readers are particularly referred to [AB86] for an interesting discussion of the topic. Chapter 2. Background and Historical Review 29 (a) (b) (c) Figure 2.2: An illustration of the spatio-temporal filtering, using the space time cube, (a) the first frame of motion for a grey bar moving to the left with velocity v. (b) the stacked (or echoed) representation of this motion for five frames, (c) a horizontal cut for continuous time (i.e., in the (x,t) plane). Adopted from Adelson and Bergen. has its energy concentrated along a plane whose slope and normal defines the velocity. That is: uit = UUJX + vujy (2-14) This means that to measure several motions within an image region one needs a collection of properly oriented filters which detect the proper motion at any location by having the highest energy (assuming no cross talk) when proper velocity and frequency are matched. Heeger[Hee87] was the first person to implement the above mentioned formulation shortly after this theory was introduced. This work was later improved by Porat and Friedlander [PF90] and other spatio-temporal formulations - such as phased based tech-niques by Jepson and Fleet [FJ90a] - soon followed. I will discuss some of the more recent and interesting approaches in this category later on, especially that of Shizawa and Mase [SM90], for multiple motion detection. The latest work in this area is probably by Niyogy and Adelson [NA94]. Some of the shortcomings of the technique can be summarized as: • Almost all the papers encountered assumes constant motion. • Little or no analysis of noise, blurring or geometric distortions especially on the phase-based spatiotemporal routines have been conducted. Chapter 2. Background and Historical Review 30 • Even though multiple motion has been discussed by some authors recognition of occluding contours is not fully developed. • Usually large number of frames are used to isolate the motion plane. This is particularly true of phase-based techniques that typically use nineteen frames or more. Although of interest to long motion sequences, this reduces the use of this technique for smaller frame sequences, (see [JJ94] for the application of phase based approaches to two frame stereo analysis.) • Using large number of frames also introduces delays in the estimation of motion sequence. This does not mean that there has to be a problem with causality of the motion sequence. It means that for a 19 frame sequence we will not have the proper motion estimates for the first 18 frames. On the positive side, this approach provides a novel and interesting methodology to registration, and tries to directly address both the frequency and spatial information of an image sequence through time. One application of this work to unstructured domains is that of Jahne for analysis of ocean waves [J91]. 2.1.5 Other Techniques Phase correlation Phase correlation is perhaps one of the simplest matching techniques in signal processing. Amazingly, however, the first written record of its use for image registration was not until 1975 [KH75]. This technique is however simple enough that it has been rediscovered by others [OC90] including this author. Phase correlation is basically based on the Fourier shift theorem. Given two signals, hx(x) = s(x) and h2(x) = s(x — r) , their Fourier Chapter 2. Background and Historical Review 31 transforms are related by: H2(f)=H1(f)e-2niTf (2.15) Now given the above idealized relationships, there are a variety of ways one can formulate the phase correlation. These include: - 2 W _ 7*2 (f) ( 2 1 6 ) n m n-2 (f) ll«l(f)ll ^(f)^x(f) ll%(f)ll |^(f)Hi(f)| where ft*(f) is the complex conjugate of %(f) and ||?*,(f)|| = |^(f)|2 - 7^(f)ft*(f); then the disparity r can be easily retrieved by the inverse Fourier transform of the above exponential: ) r^backward — 11^-j* v u / \\H2{1)\\ r^unbiased — / V and their use as the probability density function of the motion disparity. 1 6 A more interesting issue is the relationship of phase correlation to regular spatial correlation. The numerator of Equation 2.18 is the cross-correlation of the two windows in the Fourier domain. Therefore, phase correlation is basically a normalized version of cross-correlation in the frequency domain, 1 7 where each data point in the frequency plane is divided by the magnitude of the Fourier at that point leaving the contribution of the phase difference el(Phase(Ui)-phase(H2))_ Therefore one can write phase correlation as: PC{HuH2) = \ H i ( f ) H Z ( f ) \ - \ C a r r { U i , H 2 ) (2.22) Kuglin and Hines recognized this similarity and mentioned that a more general ap-proach is not to completely ignore the magnitude of cross correlation but a parametrized normalization: 16One should note again that PCunbiased is not exactly the same as Equation 2.18 and is more com-putationally intensive. 17This is not to be confused with normalized cross-correlation in Equation 2.2 which does not have a frequency domain implementation. Chapter 2. Background and Historical Review 33 PC(Ui,H2,a) = \Hi(f).H*2(f)\-a . Corr(HuH2) (2.23) where 1 > a > 0. Therefore, at the two extremes of a = 0 and a = 1 one gets the regular correlation and phase correlation respectively. Many authors such as Girod [Gir87] and Thomas [Tho87] have discussed phase cor-relation in motion compensated image coding. Wu and Fernando [WF88] provide a quantitative as well as qualitative comparison between phase correlation and differen-tial techniques and found phase correlation to provide an overall better result in most cases. Pearson, Hines and Golosman [PJGK77] discuss a cleverly optimized hardware implementation of phase correlation, capable of comparing 128x128 windows at 30 Hz. As we will see, one of the advantages of phase correlation is that it can be used in multi-evidential correlation. Moreover, from Equation 2.17 one can see that phase corre-lation is not susceptible to scaled or constant intensity variation between the two image. Unfortunately, however, phase-based approaches are generally considered susceptible to noise. In fact, Lee, Krile, and Mitra [LKM88] show the relative weakness of phase correla-tion to noise compared to cepstrum. Perhaps realizing this problem, Pearson, Hines and Golosman, provided an interesting treatment for the noise, including low pass filtering the result. Nonetheless, phase correlation is still a complementary technique to Cepstrum approach and one that is very powerful, robust, popular and interesting. Section 4.8.2 gives a more detailed mathematical treatment of how phase correlation, cepstrum and regular correlation are related to one another mathematically. Hadamard Domain Technique In this section I will cover a disparity measurement technique relatively unknown in computational vision. There are three reasons that I will cover this technique. First, to Chapter 2. Background and Historical Review 34 bring this technique to the attention of other researchers in the field for further research. Second the approach is extremely fast and it is specifically designed for discretely sampled images, with integer intensity values. Third, the approach can be used in conjunction with multi-evidential correlation. Walsh-Hadamard transform is perhaps one of the simplest and fastest orthogonal encoding methods. To describe the technique briefly, given a sequence of discretized images I(x, y, t), xmin W (2.24) where p(...) is the parity function: max Xmin )1 ^Og2{ymax-ymin)] \iog2(tmax-tmin )1 p(x, y, t, a, 7 ) = E aixi + E + E 7n*n /=0 m=0 n=0 (2.25) The index values I, m and n simply refer to bit numbers in x, y and t respectively, and they range from 0 to the maximum number of bits applicable. The notation Wi G {0,1} refers to the ith bit of the integer w. The calculation of the parity function seems deceiv-ingly difficult. But in reality each of the three summations merely calculates the binary A N D of the two index, and counts the number of ones. So a better definition of p which I have not found in the literature is: p(x, y, t, a, (3,7) = t(a f>) + W(P |>) + it (7 (1 *) (2-26) where $(w) stands for the number of bits set to 1 in standard binary coded representation of the number w. Now, how can we use Hadamard transform to calculate image displacements? Fol-lowing the examples of [YCC92, LC88, FC89] I will demonstrate the Hadamard based Chapter 2. Background and Historical Review 35 algorithm for an image consisting of a single moving non-zero pixel with intensity L. Given two images, J (x i ,y i ,0) and I(x2,y2,1) from Equation 2.24: #(0,0,0) = / ( x i , y i , 0 ) ( - l ) p ( l l ' y i ' 0 ' 0 ' ^ (2.27) = 2L and # (a ,0 , l ) = / ( z i ^ O X - l ) ^ ™ 0 ' ^ (2.28) = L(—l)p(Xl'Q) -|- L{—i)p(X2'a)+1 Assuming that the a = 2" and using Equation 2.26 we see that only the nth bit of xx and X 2 contribute to Equation 2.29, and: H(a, 0,1) = L ( ( - l ) X l - » - (-1)*2'") (2.29) where x i > n and x2,n are the value of the nth bit of x n and x2 respectively. Obviously, these bit values can be only 0 or 1. Plugging the four possible combinations into Equation 2.29: H{a,0,l) = L(x2 ) (2-30) therefore, *>* - = m m ( 2 - 3 1 ) But the disparity in the x direction (assuming integer disparities) can be represented as: P°S2(*C m a : c ~xmin)~\ 5x = x 2 - x l = Y, r(x^n ~ xi,n) (2.32) 71=0 which is simply the binary expanded representation of the disparity. Using Equations 2.31 and 2.32 we see that: \\og2(xmax ~%min)\ TTI nil A 1^ Chapter 2. Background and Historical Review 36 Using symmetry, we can derive a similar equation for the disparity in y: \log2{y max-ymin)] TTI(\ i ) n 1 \ To extend this analysis to an image, / , containing p non-zero pixels instead of just one: R°g2{.Vmax— ymin)1 TT ( ryn n 1 \ ^i=p—1 TT Inn n 1 \ n n ^ A 1 yV'1) _ \ ^ o n U o Hj(Z ,0,1) h #(0,0,0) ~ V E JX 1 #(0,0,0) [ 2 - 3 b ) Egg"1 En 2"ff i(2", 0,1) #,(0,0,0) but from Equation 2.33 £ n 2 n f l i (2 n , 0 ) 1) = 5XH;(0,0,0), therefore f i o g 2 ( y ^ - ^ ) l ffj(2n)Q)1) = £ j g - 1 ^g , (0 ,0 ,0 ) f 2 36> ^ #7(0,0,0) EST 1 #(0,0,0) 1 ' ' which is the same as Equations 2.33 and 2.34. Unlike other techniques that are numerical implementations of algorithms developed for the continuous domain, Hadamard based disparity analysis is designed for modern digital computers and imaging equipment. The high speed of the algorithm and the avail-ability of fast Walsh-Hadamard transforms both in hardware and software is attracting more research on this technique, especially for image coding and motion compensated imaging. But the algorithm in its basic form suffers from a few shortcomings as well. 2.2 Stereo Even though binocular stereo analysis can be viewed as a special case of motion regis-tration [Hor86], the significance of the topic and its treatment in this thesis (including trinocular and multiple baseline stereo), as well as the unique matching methodology used for stereo warrants it special attention. Chapter 2. Background and Historical Review 37 There are many reasons for stereo analysis including segmentation and figure ground separation [Gri93], and gaze-control through vergence [OC91, 01s93, P E , Kam93, TOM94, TM94]. But historically stereo vision systems have been used in vision and photogram-metry for three dimensional scene reconstruction and distance measurement. As Regan et al point out [SW90] (page 13, and chapter 12 page 318): The hypothesis that binocular disparity is the effective stimulus for depth sensation is credited to the astronomer Johannes Kepler. The demonstra-tion that disparity alone can produce depth sensation was first provided by Wheatstone (1852) ... In the next section, I will briefly review the geometrical constraint provided by stereo configuration. The material in this section is not only of importance to stereo vision, but has important applications in motion analysis. This provides an excellent example of how physical configuration restricts the space of possible solutions, thus furnishing a genuine constraint. In Section 2.2.2 I will review token based matching that has had its history and most of its application in stereo computational vision. 2.2.1 Epipolar Constraint and Stereo Vis ion Figure 2.3 shows two cameras and the projection of a point P into their image planes. Given the relative position of the two cameras, their focal lengths, and the coordinates of the two projections, the three dimensional location of the point can be derived through ge-ometry. But matching and registration of points in binocular stereo is simplified through a physical constraint called epipolar geometry. Simply stated, given two cameras the line connecting their focal points, F and F', Chapter 2. Background and Historical Review 38 Figure 2.3: Epipolar constraint and the verging cameras. intersects the two image planes at points e and e' called their respective epipoles.1 8 Now a point in space, P, projects on the two image planes at points p and p'. But the plane that passes through P, F and F',intersects the two image planes at lines ep and e'p', referred to as the epipolar lines for the point P. One can utilize epipolar lines to ones advantage, because these lines reduce the search space for finding matches from a two dimensional to a one dimensional space. That is, to find a match for p we can construct the line ep and its corresponding epipolar line in the second image which passes through e'. The match point p' is granteed to lie on this line thus constraining constraining the search space. Obviously, all the discussion so far relates to the frames captured by a moving camera. In fact, finding epipoles between two image frames and establishing correspondences has been used for finding the intrinsic and extrinsic camera parameters [LC94]. But the advantage of stereo analysis is that one can localize the epipoles as desired, thus simplifying the computations and increase the accuracy of our measurements. This is the reason that in majority of stereo vision systems (binocular or trinocular) the image 18I am using upper case letters for points in space, and the lower case letters for the points on image planes. Chapter 2. Background and Historical Review 39 planes are kept coplanar 1 9 - i.e., the optical axis of the two cameras are parallel. In this case, the epipoles for the two images are located at infinity. This means that epipolar lines (all of which have to pass through the epipoles) are parallel to one another. In a calibrated binocular stereo system, for instance, the two cameras are parallel to one another and located horizontally. Thus, to find a match for a point in one image, one only has to search along the same row in the second image. Another advantage of this configuration is that it is "order preserving". That is, if there are points on an epipolar line ordered from left to right, their matches are also ordered from left to right. This property, which is not true for non-calibrated stereo, reduces the search area even further. It is easy to show that for calibrated camera, the depth of a point from the camera baselines is: Z = fcl (2.37) where fc is the focal length of each camera, b is the distance between the two cameras -often referred to as the baseline, and 5x is the horizontal disparity in the image planes. Of course, if the two cameras are located vertically, then the epipolar lines are along the columns of the images. Using three cameras provides two epipolar lines for each point in an image thus constraining the problem even further and increasing the accuracy of the match. The configuration often used is a right angle isosceles, with calibrated cameras [KY90]. Later chapters provide a more detailed discussion of trinocular and multiple-baseline stereo. It is however noteworthy that epipolar constraint is not restricted to stereo cam-eras, and is just applicable to two frames captured from a moving camera in a stationary scene. 19This assumes the focal length of the two cameras are equal. Chapter 2. Background and Historical Review 40 2.2.2 Stereo Matching Because of its applicability and importance, there are already several thorough reviews of stereo matching algorithms [BF82, Bak82, DA89]. One class of algorithms that has found particular applicability in stereo analysis, is the feature based techniques. Simply stated, feature or token based techniques establish correspondences between individual spatial image primitives in two frames. Each feature can have different at-tributes to distinguish it from near by ones, thus improving the robustness of the result. The methodology relies upon the ability of the system to both find an adequate number of tokens and then accurately localize them in the two frames. In this sense finding tokens is simply a filtering operation [Bak82]. The two most cited token based techniques are those of Moravec [Mor80] and Grim-son [Gri80]. Moravec's interest operator was a corner detector based on maximal di-rectional intensity variance. Once interesting points in the two scene were found and matched - using correlation - the cues were used to provide a mobile robot with motion calibration information and locate possible obstacles. Grimson's technique was the first implementation of David Marr and Tomaso Poggio's stereo algorithm [MP77]. This algorithm that had its roots in the preliminary studies in psychophysics of human stereopsis, was based on matching intensity edges and with the intensity gradient and edge orientation as attributes of each feature. The matching was conducted at different scales by using different size V 2 G operators. The matches were then used for controlling the vergence angle of a stereo system. Similar pioneering systems, notably that of Hildreth [Hil83], for motion analysis followed. Using tokens and their attributes for correspondence is the traditional symbolic image matching. The primitives used should be stable and easily detectable from both view points, dense enough to give adequate information about the surface characteristics, yet Chapter 2. Background and Historical Review 41 sparse enough that they are differentiated easily within a neighborhood. The feasibility of these requirements, much like other techniques in computational vision, is scene de-pendent. But in feature based approaches the probability of missing features between the two frames, and incorrect localization of features is a more serious problem. The edge based techniques, for instance, suffer immediately from the uncertainty in the location of edges using the linear filters. Even more serious is the implied assumption that intensity discontinuity separates image locales of equivalent depth. 2 0 Nonetheless symbolic image matching is a significant domain of computational vision that should not be overlooked. It is not always that one needs dense disparity maps, and stable features can greatly improve the stability and accuracy of a system. Unlike the area based techniques, given enough attributes, the matching can be quite local - i.e., no area based matching. Lastly, because of locality of the matching process - and assuming low angle cameras (i.e., orthographic projection) - foreshortening does not effect the matching process. 2.3 Hierarchical techniques Hierarchical based approaches [BA83] have gained great popularity in computational vi-sion literature because they address several problems faced by variety of techniques. Two obvious limitations in region-based matching algorithms is the choice of window size and the search area [BYX82, BYX83]. Just like in edge detection, large windows contain more structure and information that can be used and are therefore less susceptible to noise. They do however blur the disparities (especially at discontinuity, or for curved objects) and suffer from localizing the exact portion of the image corresponding to (or contributing most) to the disparity. Small windows are better overall [Nis84] but can 2 0 To see how easily this assumption is violated consider the occluding edges of a cylinder viewed by a calibrated stereo pair. Chapter 2. Background and Historical Review 42 produce noisy outputs. Choosing small search regions can give rise to incorrect dispari-ties, if the actual motion is outside the "guessed" area. On the other hand large regions also increase the probability of false matches and increase the computational cost. Pyramid structures provide a natural answer to these problems. If Smax is the maxi-mum expected scene displacement at the base level of the pyramid, at the kth level the disparity is reduced to Smax2~k. By the same token matching a window size w (each side) on the kth level is comparing a window size of w2k.21 Therefore, by matching at the higher levels of pyramid one can get a rough estimate of the motion disparity. These values are then doubled and passed on as the center of the search regions for lower levels. Using this approach, on each level one conducts matching on a small neighborhood. The total cost of this method is much less than a complete search at the lowest level. The window size can also be kept constant or changed based on the information contained in the windows and the goodness of the match. A more important reason for using hierarchical pyramids is that the ideal match is usually detectable at particular scales. That is, using the high resolution image for matching implicitly assumes that the spatial and temporal sampling scales are correct. But for large motion the high frequency image components are aliased. Such aliasing is a serious source of matching errors; and thus for high disparities, ignoring high resolution information is not only computationally efficient it is necessary [BHK91, B A K 9 1 , Ana87]; this is particularly true when one has a poor initial guess of the disparity value. Using the Laplacian of Gaussian pyramid allows one to "lock into" the correct disparity at higher levels and pass them on to lower lower levels (higher acuity levels) of the pyramid. A question that remains to be addressed is how to recognize the particular structural level of the pyramid, and discontinue the search if the lower levels of the pyramid obscure the 2 1This is of course a bit more complicated. With Laplacian of Gaussian pyramids each level is com-plimentary to the previous ones. But the structures included are larger anyway, so you want larger windows. Chapter 2. Background and Historical Review 43 disparity by adding random errors. 2.4 Summary This chapter provided a brief review of multi-dimensional signal and image registration. Majority of the techniques reviewed, mean squared difference (sum of squared differ-ences), mean absolute difference (sum of absolute differences), spatially tuned filters, Hadamard based routines and differential approaches have their roots in linear filtering and statistics. Phase correlation (and non-linearly preprocessed token based) methods were perhaps the only two non-linear techniques reviewed. Section 2.2 reviewed stereo matching as a special case of motion analysis and image registration. Hierarchical pyramids and their role in matching was also discussed, and their role in reducing the assumptions and limitations of region-based methods explored. Chapter 3 Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications Mathematics not only defines the properties of a technique formally, it also furnishes clues and specifics to improving its behavior. This section is an investigation into detailed theoretical aspects of the cepstrum filter. Its primary objective is to provide a simple yet mathematically sound and well-formulated understanding of cepstrum, and to introduce computational improvements to two commonly used versions of the cepstral filter. As pointed out earlier, the visual echo framework is based on the simple observation that self-similar patterns in time or space can be viewed as spatial or temporal echoes of one another. Consequently, the calculation of disparities or isolation of these patterns (as in the case of stationary textures) can be seen as determination of echo arrival periods in time or space. Cepstral analysis [BHT62] is a powerful nonlinear adaptive signal processing method-ology developed for detection and estimation of echoes and their arrival period. Because of the significance of echoes, cepstrum has been studied and used in variety of scien-tific fields such as acoustics, oceanography, seismology, geophysics, wave analysis, speech processing (phoneme chunking), radar and sonar processing, medicine, image deblurring and restoration, system analysis and signal recovery just to name a few. The breadth of applications of cepstrum and its variations give a vast amount of background material to be drawn upon. Consequently not all the literature and research work will be presented here (for reviews of cepstrum see [CSK77, LE84].). The aim of this chapter is to: 44 Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ 1. present a brief historical review of origins of cepstrum with emphasis on its utility in computational vision. 2. provide a detailed mathematical treatment of cepstrum, its themes, variations, and extensions. 3. introduce computational and performance improvements to power and differential cepstrum for use in detection of echoes; and to provide a comparison between these methods and the traditional cepstral techniques. 4. furnish the mathematical foundations to examine cepstrum's properties and limi-tations. Section 3.1 provides a brief review of cepstrum's origins, concentrating primarily on its applications to computational vision for no other purpose than disparity analysis (i.e., due to the breadth of the domain I will not consider all applications of cepstrum in computational vision, such as deblurring, image restoration, etc.). Section 3.2 provides a detailed mathematical treatment of cepstrum and its variations. Section 3.3 provides computational and performance improvements to differential cepstrum and power cepstrum, which are the two most popular and robust forms of cepstrum. These improvements can be applied in other application domains frequently utilizing cepstrum. Finally, Section 5 concludes this chapter by providing a brief summary. 3.1 A Brief History of Cepstrum It would be prohibitive to discuss all the papers that utilize, compare, or even discuss cepstral filtering and its variations or extensions, such as homomorphic filtering. But, if Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modihcations46 ever possible, it is always interesting and informative to seek the origins of a technique and the person or people that first invented an algorithm. Silvia and Robinson [SR78] have traced the mathematical evolution of cepstrum for signal analysis to the classical works of Poisson (1823) and Schwarz (1872) for the study of the potential and the logarithmic potential theory. Others, Szego (1915) and Ko l -mogorov (1939), used cepstrum to factorize the power spectrum of random processes and to extract stable causal systems. Robinson's thesis (1954) was perhaps one the first applications of homomorphic filtering to engineering, but it was Bogert, Healy and Tukey [BHT62] that coined the terms "cepstrum" and "quefrency analysis" and used cepstrum as a signal processing technique [SR78]. Bogert and his colleagues were primarily interested in the recognition and retrieval of echo arrival periods due to seismic waves. Thus they defined cepstrum as the power spectrum of the logarithm of the power spectrum of the signal.1 Due to its definition, this variation of cepstrum involves no phase information, i.e., it only uses the magnitude of the spectrum. Therefore, it is often referred to as "power" cepstrum and, as we will see later, avoids the problems of phase unwrapping that can hinder echo detection [Tri77]. Soon after, different variations and numerous applications of cepstrum followed. In fact, the papers utilizing cepstrum strictly for image enhancement and deconvolution, disparity analysis, or simple non-linear filtering are too numerous to cite. But there are a few papers that have to be cited as milestones in this area. For instance, some of the first applications of cepstrum to image processing were those Oppenheim, Schafer and Stockham [OSS68] and Dudgeon [Dud77]. Restricting our attention to disparity analysis, Lee, Krile and Mitra [LKM87, LKM88, LMK88, LMK89, KM92] were the first researchers to use power cepstrum specifically for stereo and motion analysis. Inspired by the columnar structure of primate ocular 1 T h e term power spectrum refers to the amplitude of the Fourier transform of the signal. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ dominant columns, Yeshurun and Schwartz[YS89] postulated a plausible theory for the neurophysiological mechanism of stereo vision based on power cepstrum. Because of its speed and versatility, many researchers have used cepstrum in real-time vergence control for binocular robots. Most notably, Olson and his colleagues [OC90, OC91, OL92, 01s93], Coombs and Brown [CB92], Pahlavan and his colleagues [PE93, PUE92] and most recently Taylor, Olson and Martin [TOM94] have successfully used cepstrum for stereo convergence and motion tracking. Both Ludwig et al [LNN92] and Yeshurun [YS89] discussed the robustness of cep-strum with regards to image distortion and specifically rotations of up to 8 degrees and expansions of up to 6 percent. This is consistent with the results of Lee, Ramirez and Mitra [LRM91] who also got robust cepstral results of up to 6 degrees, compared with only 2 degrees for phase correlation produced by De Castro and Morandi [CM87]. Lud-wig, Neumann and Neumann [LNN94] expanded on the stereopsis work of Yeshurun and Schwartz and "corrected some of the misstatements" of [YS89] but unfortunately, as we will see later, introduced some new potential problems by not considering the effect of their windowing operations. Much like Yeshurun et al and Ludwig and his colleagues, Jones and Lamb [JL93] used cepstrum analysis for stereo analysis and depth measurement. In this case, Jones and Lamb constructed a range finder by constructing a camera with two apertures; the resulting image consists of an image and an overlapped echo of itself which can then be studied using cepstral analysis for disparity measurement. The majority of other cepstrum applications in image processing involve image restora-tion i.e., deblurring, noise removal, and contrast enhancement. In the next sections of this chapter I provide a more detailed mathematical treatment of all cepstrum varia-tions, and introduce computational and performance improvements to differential cep-strum and power cepstrum (and consequently complex cepstrum) [BL93a]. The following Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications48 chapters will present the broader applications of cepstrum and the development of visual echo paradigm to computational vision problems [BL92a, BL92b, BL93a, BL93d, BL93b, BL93c, BL92c, BXL94, BL94]. 3.2 Themes, Variations and Mathematical Preliminaries This section provides mathematical treatments of cepstrum and its variations as they appeared chronologically in literature. The aim is to formally explain the properties of these techniques, and provide the necessary formulas which have led to the improvements discussed in Section 3.3. The mathematical derivations and explanatory formula are written for one dimen-sional signals, but can be easily generalized to higher dimensions. This simplicity will become particularly beneficial once we discuss the differential and complex cepstrum.2 3.2.1 Prototypical Cepstrum Before addressing specific cepstral techniques and their properties, it is helpful to define prototypical cepstrum as the common definition of all cepstral filters. In general, Prototypical Cepstrum = some kind of (linear) transform, followed by some kind of logarithm, followed by some (other) kind of transform Traditionally, the transforms used are Fourier and inverse Fourier transforms or power spectrum, which turn convolutions into multiplication. The original inventors of cepstrum discovered that to extract the echo - which is equivalent to convolution with a delayed Kroenecker 8 - one must follow the first transform by some kind of a pointwise logarithm 2 Alternatively one could replace all the parameters x, f and T with x, f and r representing a multi-dimensional version of the these factors, and further replace their multiplications with dot products. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and ModiGcations49 to turn the multiplication into addition. The second (inverse) transform now returns the convolution filter to its transformed spatial form. In summary, the prototypical cepstrum of two convolved functions is equal to the sum of the prototypical cepstrums of the two functions. This property makes cepstral niters particularly suited for deconvolving the delayed echo arrival periods. 3.2.2 Power Cepstrum Bogert, Healy, and Tukey[BHT62] introduced cepstral filtering and quefrency analysis for estimating the arrival time of the echo of a complex signal. They defined (power) cepstrum as the power spectrum of the logarithm of the power spectrum of the signal. Given a signal h(x) consisting of an input s(x) and its echo delayed by r: h(x) = s(x) + s(x - T) (3.1) and its Fourier transform (using the Fourier shift theorem), T{h(x)} = F{S(X) + S{X-T)} H(f) = S(f)(l + e-2™f) (3.2) where Ti(f) and <5(/) are the Fourier transforms of h(x) and s{x) respectively. The logarithm of the power spectrum of h{x) is then: log(||W)||) = log(||5(/)||) + log(l + cos(27rr/)) + % 2 (3.3) Figure 3.1 shows the plot of the function log(l + cos(27r/)). 3 As expected, the function is periodic with the maximum value of log 2, and asymptotically approaches —oo in the limit as the cosine approaches -1. 3 Obviously one can set the value of r by scaling the horizontal axis. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and ModificationsdO Figure 3.1: Plot of log(l + cos(2?r/)). A n obvious choice of analysis from here is to take the MacLauren-Taylor series ex-pansion of log(l 4- COS(27TT/)) which results in: ~ ( - l ) n cos n (27r / r ) n=l (3.4) In effect, the logarithm operation transforms the power spectrum of h(x) into the log-arithm of the power spectrum of the single input, s(x), and a residual summation of decreasing cosine polynomials generated due to the presence of the delayed replica.4 A better transformation of log(l + cos(27rr/)) for our present analysis is its Fourier (or more appropriately the Cosine) series expansion(see [Spi74] (pagel72), [GR80] (page45 Equa-tion 1.514)): log(l + COS(2TTT/)) = log(2cos2(7rr/)) (3.5) , / „ N n , , „ ^ ( - l ) n cos(27rnr / ) x log(2) + 2(- log 2 + £ — r —) 71 = 1 n *We will see that Equation 3.4 will come very handy later on - thus being introduced here. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^! Substituting the above equation into Equation 3.3 one gets: log(IIW)H) = log(||5(/)||) + ± ( - l )"cos(2mrr/) ( 3 g ) 71=1 U Thus the logarithm not only isolates the echo arrival period, r , from the contribution of the original signal, it also places it in a series of decreasing cosine functions. To extract r, traditionally, a second power spectrum has been applied, which in turn transforms the cosine series into oo (_i)n E ^zr-K* - nT) (3-7) 71=1 7 1 i.e., a ripple of decreasing Kroenecker deltas with periodicity r. Figure 3.2: Cepstral analysis of synthetic motion with (x,y) displacement of (7,3). The lower portion of the figure displays the cepstral result. The top peak is located at 3 pixels down and 7 pixels plus the width of the window horizontally. The lower peak is generated due to symmetric properties of the power spectrum of a real signal. Figure 3.2 shows the power cepstrum of an image and its spatially transformed tem-poral echo for a simulated motion field of three rows down and seven columns to the right. In these examples, for the sake of display clarity, I concatenated the images horizontally (i.e., placed them side by side to create a new image) instead of simply adding them. Therefore the horizontal disparity is increased by the width of the window under consid-eration. This point is worth emphasizing. Many authors, perhaps due to a misstatement Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ in [YS89], place the origin of the coordinate system at row zero but the middle column; and to correct a non-existent bias on the part of cepstrum they rotate the rows thus placing the new origin at the center of cepstrum plane. The reasoning often provided for this is that using the Fourier transform displaces the origin of the signal, and therefore the original of cepstrum (relating to (0,0) disparity) has been moved up to the top rather than being at the center. But this is quite erroneous. First, if the origin was displaced, it would be moved for both the row coordinate and the column coordinate. Second, since we are taking two transforms the origin of the coordinate system is moved back to the upper left corner of the cepstral plane. What needs to be realized is that the echo arrival period of the image in the horizontal direction has been increased by the window width. That is, if the true disparity between the two windows was (TTOWS, r c o / s ) , because of the concatenation the two windows (as opposed to adding them) the new disparity is (Trows, window width + TCOIS) where the cepstrum peak appears. A beneficial effect of concatenating the images is that it moves the peak of the cep-strum away from the corners of the cepstrum plane, where the aliasing effects are largest and could hinder the peak recognition process. The obvious cost of the concatenation is the additional processing of Fourier transforms and pointwise logarithms, which amounts to nearly doubling the processing time. Another obvious result of such concatenation is that the Fourier transform of the resulting image has uneven spatial sampling with the horizontal components having twice the resolution of those in the vertical direction. To compensate for this (and assuming a 1:1 spatial sampling aspect ratio) one can pad the image with rows of zero, which also increases the computational cost further. A better justification, should one be needed, is that the camera can be rotated ninety degrees, thus interchanging the rows and columns of the image. One can then compare the even number columns of the two image (corresponding to the even fields of the images) and the odd numbered columns (corresponding to odd fields of the images) separately using Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications53 concatenated windows. This concatenation does however have some adverse properties that I will discuss further in Chapter 4. 3.2.3 Complex and Phase Cepstrum Oppenheim, Shafer and Stockham[OSS68] introduced the complex cepstrum and ho-momorphic filtering for use in predictive deconvolution and inverse filtering. Complex cepstrum, h(x), is defined as the inverse Fourier transform of the logarithm of the Fourier transform of the signal: It is easy to show that both power and complex cepstrum behave as deconvolving filters - i.e., if two functions are combined by convolution, the cepstrum of the resulting signal is equal to the sum of the cepstrums of the convolved functions. In fact the echo of a signal is merely the result of the convolution of the original signal with a delayed impulse function. As a result when a function and its echo are put through a cepstral filter the echo arrival period is marked by a deconvolved delta function [BL92a]. It is interesting to see the relationship between power and complex cepstrum for the detection of echoes. Taking the complex logarithm of the Fourier transform of Equa-tion 3.1 one obtains: The real portion of this equation, i.e., the first and third term, is in fact, equivalent to the power cepstrum. The fourth term can be simplified to —infr and defines a plane in the h(x)=f-1{\og(T{h(x)})} H(f) = log(5(/)) + log(l + e - 2 ^ ) \\S(f)\\+iphase(S(f)) + i log(2 + 2cos 2TT/T) (3.8) Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ imaginary portion of the Fourier transform with the slope —7rr. This term is normally referred to as the phase cepstrum, and due to phase wrapping5 is a periodic function. Therefore the inverse Fourier transform of this function will also result in a peak at r . But even if the phase of S(f) is removed,6 like most phase correlation techniques, phase cepstrum is susceptible to noise and phase unwrapping ambiguities. 3.2.4 Differential Cepstrum Noting the problems with phase unwrapping, Polydoros, A u and Fam [PAF79] intro-duced differential cepstrum as a shift and scale invariant deconvolution filter. Differen-tial cepstrum simply replaces the logarithm operation in the complex cepstrum with the logarithmic derivative.7 It is trivial to verify that, like its predecessors, differential cepstrum still transforms a convolution into additions of the differential cepstrum of the original signals. Bednar and Watt [BW85], for instance, used differential cepstrum to calculate complex cepstrum of exponentially decreasing signals [RR87a] without any phase unwrapping. As for the effects of a signal containing an echo, taking the logarithmic derivative of the Fourier transform of Equation 3.1 one obtains: The echo arrival time, r , is again encoded as a parameter to the tan function and it can be retrieved using the inverse Fourier transform, producing a rippling effect with spatial 5 Phase wrapping is a common signal processing problem which results from the theoretical equivalence of two phase angles separated by 2kn for any integer k. It is usually desirable but often difficult to "unwrap" the phase of a signal by adding the proper 2kir value when applicable[Tri77]. 6 B y dividing the Fourier transform of h{x) by S(f) before taking the logarithm. 7 The logarithmic derivative (or the derivative of logarithm) of a function is simply the derivative of the function divided by the function. Hd(f) S(f) 1 + e - 2 ™ / Sd(f) ~ — 7rr tan(7rr/) (3.9) Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modihcationsbb periodicity r . Raghuramireddy and Unbehauen [RU85, RU84] extended the differential cepstrum to images and higher dimensional signals. Briefly stated, the higher dimensional differential cepstrum is defined as the inverse Fourier transform of the sum of the partial derivatives of the Fourier transform of the function along each axis divided by the Fourier transform of the function. For images, for instance, the differential cepstrum becomes: * 1 n(fuh) + nfuh) } (3-10) 3.3 Improvements to the traditional approaches: posCepsCos and diffCepsSin Because of its wide applicability in not only echo detection, but also in non-linear decon-volution, signal decomposition, adaptive filtering, and development of stable IIR filters, even recently researchers have been pursuing improvements to cepstrum[WY91]. Fur-thermore, since both complex and phase cepstrum contain a mixture of original phase of the signals, and additionally suffer from phase unwrapping problems, these improve-ments have dealt almost exclusively with power cepstrum and windowing of the logarithm result [Has74, Has75, AA92]. But for image analysis, as I will show later, power cepstrum suffers from sign ambigui-ties that can be overcome by verification through differential cepstrum. In the subsequent two subsections, I present simple modifications to power and differential cepstrum that not only improve their signal to noise ratio, but also increase their computational effi-ciency. Section 3.4 shows an application of these improvements to matching magnetic reso-nance images which have both a real and an imaginary component. Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modifications^ (a) (b) Figure 3.3: (a) Regular cepstrum of a synthetic motion of 7 horizontally and 3 vertically, (b) Positive cepsCos of the synthetic motion. 3.3.1 Positive CepsCos Reviewing Equation 3.3: MlW)||) = log(||5(/)||) + log(l + cos(27rr/)) + log 2 (3.11) the echo arrival period, r, is only present in the term log(l + cos(27rr/)). An obvi-ous method of improving cepstrum, whenever possible, is to subtract the cepstrum of the original signal s(x) from the result [LMK89, YS89]. In many applications (such as phoneme chunking), however, the form of s(t) is unknown, rendering this approach in-viable. Moreover, my experience has shown that subtracting the cepstrum of s(x) can introduce additional noise and ambiguities near the peak value, primarily due to ringing and windowing effects. But, unlike log(||«S(/)||), log(l+ cos(27rr/)) is always a periodic even function. More-over, from Equation 3.7 one observes that the main peak of the power cepstrum is a positive 5. Therefore to eliminate the additional noise introduced by the odd portion of Chapter 3. Cepstral Analysis: Background, Mathematical Preliminaries, and Modihcationshl (a) (b) (c) Figure 3.4: (a) regular power cepstrum for Figure 1; (b) its positive CepsCos - i.e., cepsCos with all the negative values discarded (c) the difference of (a) and (b). Note that the landscape showing the difference has a different visual scale. the function log(||N{f) are the amplitude and phase of the Fourier noise.6 Now multiplying the above with its conjugate we get: \n(f)\ = A(f) + N(f) + 2y/A(f)N(f) COS(TTT/ - R(f)) = + \M(f)\ + 2y/A(f)N{f)co8(*Tf - Mf))) * (1 + W ) ) * c c s ( W ) W ) | + Mf)\ + 2^A(f)N(f)cos(nrf - R(f)) where R(f) = Mf)+Mf)- Abbreviating |5(/)| + |Af(/)| + 2^A(f)N(f) C 0 S ( 7 T T / - ^ t ° w e s e e that the logarithm of the above equation becomes: log(Q(/)) + log(l + cos(27rr/)). (4.6) This is exactly the result that Hassab and Boucher [HB76] and Bogert and Os-sanna [B066] produced but more simplified and less elaborate. Namely, the peak of the cepstrum is still produced as usual except that it is modulated with the power spectrum of ^Qfj) • The log(Q(/)) also adds noise to the final result. By considering the probabil-ity distribution density function for the noise and making a few additional simplifying assumptions, Hassab and Boucher concluded on "the dependence of the statistical mea-sures upon the pointwise variation of input to noise spectra. Thus the total signal power to noise power (S/N) alone is insufficient measure for setting the cepstrum performance, and relative bandwidths of signal and noise are also needed." They also calculated the detection probability of an echo for an exponential signal with equal bandwidth for the signal and the noise, and showed that it degrades for signal to noise ratio greater than -4 db. This is far below the capabilities of the present common cameras. 4.4.2 Effects of Convolution and Blur As was pointed out earlier, cepstrum is in fact a deconvolving filter, and is often used for system analysis. It is however important to know what would be the effect of convolution on a signal's echo, as in the case of motion blur, and how cepstrum is effected by such problems. Both Hassab and Boucher [Has75] and Tavakkoli [Tav86] treat the subject Chapter 4. Cepstral Properties 83 adequately. Given the signal: h(x) = s(x) + s(x - T) * r(x) (4.7) where r(x) is the convolving filter, then the power spectrum of the signal and its echo can be simplified to:7 U{x) = \S(f)\(l + cos(27rr/)) (4.8) Therefore, after taking the logarithm and using the MacLauren-Taylor series expan-sion, we will get: U{f) = log(|S(/)|) + \TZ(f)\cos(27rr/) + 0.5|7e(/)|2cos(27rr/) + 1.0/3.0... (4.9) Taking the cosine transform of the above we get:8 h(x) = s(x)+f(x—r)+0.5f(x — T)*f(x — r ) + 1.0/3.Of (a; —T)*f(x—r)*f(x —r) + ... (4.10) Therefore the peak of the cepstrum is in fact convolved with the cepstrum of the convolving filter. Note that this derivation also has applications to the result of Equa-tion 4.6. Figure 4.8 shows the application of cepstrum to a blurred echo. The motion is again (3,7) and the blurring function is an anisotropic Gaussian function with the blur axis parallel to the direction of motion. Even after the distortion, the peak is still at (3,7) and quite visible. Being a deconvolving filter, a natural question to ask would be about cepstrum's performance in the presence of periodic or convolved signal.9 Without using any more 7The actual formula is slightly different and involves square of power spectrum of r(x). 8This equation is actually a result of complex cepstrum analysis. It is easy to show (but outside the scope of this discussion) that power cepstrum's peak is not only convolved with the delayed cepstrum of f(x) but it is distorted due to the effects of phase of 11(f). I am assuming an even function for the blur. 9 Also known as the "moving stripped apple" syndrome! Chapter 4. Cepstral Properties 84 Figure 4.8: Effects of motion blur on power cepstrum. mathematics, it should be clear that cepstrum deconvolves the two signals and generates the sum of their cepstrums. Figure 4.9 shows how cepstrum demodulates a moving plaid image constructed from the convolution of two sinusoids. 4.4.3 Effects of Illumination Change: Changes in position of the light source - as in the case of autonomous vehicle navigation at night - or the effects of inter-reflectance or shadows cause brightness changes in an image. Simplifying the problem by modeling - for the small local neighborhoods considered -the illumination change as a constant multiplicative factor: h(x) = s(x) + as(x — T) (4.11) where a is the change in illumination. Then the power spectrum becomes: = W ) | ( l + o 2 + 2ocos(27rnr/)) (4.12) Chapter 4. Cepstral Properties 85 Figure 4.9: Power cepstrum of a moving plaid function. therefore the cepstral result is: ? / x , / / ~ x - / x (—a) n8(x — TIT) h(x) = log(a/2) + s(x) + V *—'—X- 1 4.13 n n where a = 2a/(a2 + 1) thus the basic effect is to reduce the magnitude of the cepstral peak by the illumination effect. An interesting problem arises if a = 1.0. In this case, limcos(...)_•_! log(l + C O S ( 2 7 T T / ) ) —r —oo. To counter this, Hassab and others multiply the signal with a exponential function. Existence of any noise, illumination change, and/or blurring will eliminate the degenerate case.10 Fortunately, in my image processing applications, the sensors were very good and quite adequate. Consequently improving cepstrum's performance in noise is not as critical of an issue as improving its performance in the presence of image distortions. Noise analysis is outside the present scope of this thesis. A dramatic display of cepstrum's effectiveness was provided by Yeshurum and Schwartz [YS89] 1 0 In my experience, perhaps due to numerical round off, log(...) never approaches negative infinity. My cepstrum routine does check for very small values passed to the logarithm and returns a large negative number. Chapter 4. Cepstral Properties 86 where they blurred an image and deleted portion of the image, and still calculated the proper disparity. 4 . 5 Rotation and Expansion As pointed out in Chapter 2, one of the fundamental assumptions of region based tech-niques is that on a local scale the motion of the scene, is singular (no multiple motion), and parallel to the image plane of the camera. That is there are no rotations and or scale changes allowed. But reality often defies assumptions. Observation of human motion, for instance, shows that there are plenty of rotational and scale changes that occur even as the subject tries to move parallel to the image plane. Yeshurun and Schwartz [YS89], Ludwig et al [LNN92] were among a few who studied the effects of scale and rotation on cepstrum results. Interestingly enough, even though such studies are often dependent on the scene content their results are remarkably close. Both experiments showed that cepstrum performed very well in the presence of rotations anywhere in the range of 6 to 8 degrees and expansions of up to 10 percent. Lee, Krile and Mitra [LKM87] used a polar representation of the power spectrum to transform rotational motion of ocular fundus images, into translational motion, and then registered these images using power cepstrum.1 1 Figure 4.10 shows the results of cepstrum analysis of a rotational motion. This result is particularly interesting, because it also displays the two shortcomings of cepstrum pointed out earlier. First, at times the primary peak of cepstrum corresponds to (0,0) motion and it is the secondary peak that should be considered as the result. Second, even though power cepstrum is good at recognizing the magnitude and orientation of the motion, it can not differentiate between the two directions of motions (e.g., (u,v) and 11In polar, or log polar techniques, the main issue is the choice of the origin. In the case of fundus image analysis, the images used are circular images of the retina, so the choice was already made. Chapter 4. Cepstral Properties 87 (—u, —v)). This gives rise to an interesting physical effect, as if the image of the plant is being split from the middle, while in reality switching the direction of the arrows of the left half of the plant image will give rise to the correct disparities. (a) (b) (c) Figure 4.10: Effects of rotation on cepstral results. The gray arrows are representative of the motions recognized by ignoring the peak at (0,0) of cepstrum. As it can be seen, places where the secondary peak is used are near the center of the rotation or the back wall (i.e., areas of virtually no motion). In these regions the motions are very close to zero. Secondly, note that since power cepstrum does not know the difference between (u, v) and (— u, —v) then the arrows on the left half of the flow field are pointing the wrong the wrong direction. None the less the magnitude and the orientation of the motion is qualitatively correct. Figure 4.11 shows the same result for expansion. Again the secondary motion espe-cially near the center of expansion is close to the real value of motion and very close to zero. The background wall has basically no motion, which can be recognized by the high randomness of the grey arrows indicating that the real motion is in fact depicted by the peak at (0,0). Chapter 4. Cepstral Properties 88 (a) (b) (c) Figure 4.11: Effects of expansion on cepstral results. 4.6 Subpixel Motion There have been several interesting papers on subpixel analysis for motion and registra-tion [WC90, TH86, Hee91] with the majority using a bicubic spline. Earlier, however, I discussed the effects of convolution on the cepstrum peak. Conse-quently, one can confidently claim that subpixel motion analysis is purely dependent on the sampling filter used in the image. Olson and Coombs have tested and used bilinear distribution for subpixel analysis [OC90, OC91]. I have not found this type of interpola-tion an adequate operation for many of the examples I have tested, even though at times it seems adequately accurate. Personally I believe more study is required to gain more understanding of the effects of sampling on cepstrum and its ability to get better result. Figure 4.12 shows the result of posCepsCos for a simulated subpixel motion. Bilinear sampling shows a disparity of 4.3 pixels, which is close to the actual displacement which is 4.4 pixels horizontally.1 2 12In images where subpixel motion is simulated by bicubic interpolation, the cepstrum filter decon-volves the result into the cepstrum of the bicubic and the cepstrum of the actual motion. Chapter 4. Cepstral Properties 89 ceps cos (for subpixel ition) 3 0 (a) (b) Figure 4.12: Positive cepsCos for a simulated subpixel motion, (a) the figure and its positive cepsCos. (b) the landscape of the cepstral plane. 4.7 Parallelism and Computational Complexity The computational complexity of fast Fourier transform is 0(n log n). Simple back of the envelope calculations reveal that power cepstrum requires less computation than the mean squared error analysis (i.e., sum of squared differences).13 But there are stan-dard computational techniques especially for dense disparity measurement to reduce the computational cost of these methods. A popular form of improving the computational cost of cepstrum for instance has been the use of Hartley transform. He, Cai, and Liang [HCL91] along with Lee, Ramirez and Mitra [LRM91] and Steckner and Drost [SD89] have all used Hartley transform to reduce the computational cost of cepstrum in image processing by about 40 percent. Given cepstrum's popularity in signal processing, He, Cai, and Liang also report the existence of real time hardware for television signal processing. One of the advantages of using Hartley transform often cited is that the memory requirement is exactly the same of that of the image size, while for the F F T the memory requirement is twice as much. There are however a few points that needs to being pointed 13Given a h by w window and a search margin of hs by ws, sum of squared differences uses 0(h *w * hs * ws) operation while cepstrum uses 0(m log m) where m = (h + h„) * (w + w„). Chapter 4. Cepstral Properties 90 out: 1. None of the above mentioned techniques use cosine transform instead of power spectrum as the final transform for cepstrum. This can not only reduce the cost of the final Fourier transform by fifty percent, it also reduces the number of multipli-cations and square root operations used in the final stage. 2. The main cost of cepstrum analysis is in the trigonometric computations of the sines and cosines. The standard trick of using a precomputed table goes a long way to reducing the cost of F F T and the cepstrum. Tabulated logarithm values and interpolations are used in commercial signal processing products, and can be used here as well. 3. Another major cost of F F T is the reordering of the input data into a bit reversed version using a butterfly algorithm [GW93]. This operation can easily be accom-plished using the standard Fourier transform hardware [PSW93]. 4. Since usually more than one cepstrum is used, one can use standard tricks of com-puting two Fourier transforms with one computation and then calculating the first power spectrums without use of any additional memory [PFTV88]. Another issue worth pointing out is that cepstrum is a local matching operation, and as such it can be done in parallel. Figure 4.13 shows an application of cepstrum using 16 transputer nodes. 1 4 Needless to say individual Fourier transform operations can be implemented in parallel. 14Images courtesy of SRI International. Chapter 4. Cepstral Properties 91 Figure 4.13: The optical flow field due to egomotion 4.8 Compar i son w i t h Standard Techniques In Chapter 2 we reviewed different techniques, their behavior and their properties. If there is a logical conclusion to be drawn, it is that there is really no optimal matching technique that works best under all illumination, sampling, noise, blur, and geometrical distortions. A l l techniques are bound by explicit or intrinsic assumptions that limit them or under which they are formulated. For instance, it can be proven that the sum of the square differences algorithm is the least sensitive technique to uncorrelated Gaussian noise under motion parallel to the camera plane and no rotation [Mat92a]. For the more realistic assumption of correlated Gaussian noise and similar physical constraints higher order spectral analysis provides the greatest insensitivity [NR87]. But as Horn points Chapter 4. Cepstral Properties 92 out [Hor83], depending on the domain, the environment, and the circumstances these assumptions are easily and often violated. As an example one might think of motion blur, or scale change and transformation distortions. Therefore it is logical to state that all techniques are at one level or another input dependent and they perform (qualitatively as well as quantitatively) best under different conditions. What complicates the issue even further is that not all attributes based on which different techniques are judged can be measured quantitatively. Thus even though a technique may not perform best under variety of imaging conditions [BFBB92, BFB94] it may still be the best overall performer given the variations among its application domains, and all the attributes or requirements for which it is judged. Having said all this, it is still important to know strengths, and perhaps more impor-tantly weaknesses and failure modes of a particular technique. For instance, it is trivial to prove that under uncorrelated Gaussian noise, mean squared error - or equivalently sum of squared differences - provides the optimal matching solution[Mat92a]. Moreover, for the more realistic assumption of correlated Gaussian noise higher order spectral analysis provides the correct solution [NR87]. But depending on the domain, the environment, and the circumstances these assumptions are almost always violated. As an example one might think of motion blur, or scale change and transformation distortions. Another significant factor in design and evaluation of matching routines is their ver-satility and generality of application. I introduced cepstrum as a general, powerful, and robust framework for the analysis of motion, stereo, texture and boundary symmetry. But this simple routine is also able to address other important computational vision routines such as multi-frame analysis, motion transparency, and subpixel motion esti-mation. These flexibilities are addressed in the next sections of this chapter briefly, and more detailed treatments of them are provided in the next chapters. Chapter 4. Cepstral Properties 93 This chapter has taken a critical look at cepstrum particularly emphasizing its short-comings. The objective has been to amplify these weaknesses in order to be able to provide new algorithms that will overcome these problems. These improvements will be introduced under the heading of multi-evidential correlation and more quantitative and qualitative comparisons provided there. The next subsection provides comparisons of cepstrum to other techniques, particu-larly drawing upon the existing literature and personal experiences. Section 4.8.2 will show the similarity of cepstrum to regular correlation and conclude with the statements of the originators of cepstrum. 4.8.1 Differences from Standard Techniques To evaluate cepstrum's performance I need to do two things. First I must examine cep-strum as an echo detection tool; and then choose a general and representative area of computational vision to compare cepstrum's performance and generality to other tech-niques. To date cepstrum routines have proven to be the most powerful technique for de-tection of echoes. Reddy and Rao [RR87b], for instance, showed that differential cep-strum performed better than waveform delay analysis, or unwrapped phase averaging. Such performance differences will become even more acute for images and higher dimen-sional signals, primarily due to greater possibility of errors in multidimensional phase unwrapping[Dud77]. Of course, as is also seen in my experimental examples, power cep-strum has a higher signal to noise ratio than differential cepstrum. This is primarily due to the well known fact that differentiation leads to higher noise levels. As a result, it is safe to claim that power cepstrum is the most robust and common approach for the detection of echoes, primarily because it does not suffer from additive noise due to differentiation or phase unwrapping. On the other hand, however, since Chapter 4. Cepstral Properties 94 power cepstrum lacks any phase information, signal reconstruction (which is really not required in registration) is impossible when using power cepstrum[OL81]. As a result I only use power cepstrum as a matching technique.15 To point out cepstrum's differences with other matching techniques, one has to nar-row down an area of computational vision related to matching and registration. As pointed out earlier, my area interest is motion analysis and stereo. Not only is visual motion analysis one of the fundamental areas of computational vision, it also provides one of the richest area for comparative analysis. Time varying image analysis plays a significant role in segmentation of scenes, encoding 3D information, egomotion esti-mation, object tracking, determination of focus of attention, and estimation of time to collision[Nak85], as well as video frame predictive coding, motion compensated interpo-lation and motion compensated hybrid coding[Cla90]. Barron, Fleet, Beauchemin, and Burkitt[BFBB92] compared several motion algorithms, notably spatio-temporal filter-ing [Hee87], correlation matching techniques [Ana89], gradient based[HS81], and phase based approaches[FJ90b]. I have not performed as detailed an analysis, but I have ex-amined the performance of power cepstrum relative to phase correlation[KH75], gradient based approaches [HS81], and correlation [LBP88]. On synthetic motion images, com-pared to the first two techniques, I found cepstrum to perform very well particularly in the presence of noise. Mitra and Lee[LMK89] have also shown that cepstrum is much more robust in the presence of noise than phase correlation [KH75]. Moreover, Wu and Fernando[WF88] showed that, on at least synthetic images, phase correlation performed better than differential techniques [LL88]. They furthermore stipulate that the only rea-son phase correlation did not perform better, on a qualitative level, than the differential technique on their only real T V image, was due to the iterative nature and smoothing 1 5 Actually I use cepsCos - my version of power cepstrum - because it performs better and is compu-tationally more efficient. Chapter 4. Cepstral Properties 95 properties of the differential techniques. This shortcoming of phase correlation can be overcome by simple post processing. No comparison of matching methods is complete however without considering cor-relation or Sum of Square Differences. These techniques are often used due to their simplicity and their optimality in the presence of uncorrelated Gaussian noise. But camera noise is often spatially correlated ( with a symmetric probability distribution). Consequently, I explored the use of third order cumulants[NR87, TG92] and BiCepstrum [NP88] for analysis of displacements in one dimensional signals. These techniques are mathematically optimal in the presence of uncorrelated or "correlated" Gaussian noise. But unfortunately, they require large data segments - which means large window sizes. Another important problem (and sometimes more prevalent than noise) is the viola-tion of the assumption of brightness constancy due to motion of the light source, cast shadows, or change in the direction of illumination due to motion itself[WM93]. What makes this problem particularly difficult to handle - and as a result often ignored in computational vision - is its dependence on outside (i.e., environmental factors), while noise can often be eliminated by use of better hardware. Unlike other techniques, par-ticularly differential methods which are particularly sensitive to changes in lighting, it can be shown mathematically that Cepstrum behaves very robustly in the presence of illumination changes.16 What sets cepstrum apart, of course, is its non-linearity. Without its nonlinearity, as we will see in the next section, cepstrum has more in common with correlation than one may expect. In fact, originally Pratt [Pra78] and Gonzales et al [GW93] and later Olson and Coombs [OC90] showed that cepstrum can be interpreted as an adaptive correlation mechanism. But more significantly, the logarithm in cepstrum behaves as a selective 16Basically the log{l + COS(27TT/)} becomes log{l + acos(27rr/)} where a is the multiplicative factor (less than one) which corresponds to the change in the illumination. Chapter 4. Cepstral Properties 96 filter by compressing the effect of narrow band signals, and emphasizing the information rich broad band portions of the image (i.e., edges). This behavior is quite different from many linear techniques, particularly first and second gradient approaches, or linear filtering methodologies. Comparing techniques analytically, or even numerically, is a good academic exercise, and provides a valuable guide to improving the present methodologies. In fact, the de-velopment of cepsCorr was partially a consequence of comparisons that I drew between cepstrum and different correlative approaches [LBP88, Ana89, Fua91]. Even though I will be providing such a comparison myself - in the results section -1 generally found that drawing firm conclusions on such comparisons, without the consideration of the images being matched, can be rather deluding. Specifically, there are constant improvements in-troduced to traditional methods, that render such comparisons outdated (see for instance [CV92]). This does not mean that one should dismiss the existence of the "silver bullet", the methodology that is "mathematically" proven to be the best matching routine under any and all distortions, noise types and levels, and any other conceivable anomalies.1 7 Finally, as we will see later, visual echoes as a framework provides a substantial flexibility that is not so easily exploitable by other routines. Even for a subset of visual echoes framework, such as multi-frame analysis or recognition of motion boundaries and multiple motions, some techniques provide solutions but these solutions, when applicable, require a change in the premises, assumptions, computations and the properties of the original approach [JB93a]. With visual echoes, on the other hand, one should not need the benefit of thresholding, a priori knowledge and/or modified methodology. 17Banking one's Ph.D. on this however, is like trying to find the solution to a historically open problem as a thesis topic. Chapter 4. Cepstral Properties 97 4.8.2 Similarity with Standard Techniques I introduced cepstrum as an adaptive non-linear correlation technique. To understand this mathematically, the power spectrum of the logarithm is equivalent to its auto-correlation. Thus to see what it means to calculate the cepstrum, one can say it is equivalent of finding the correlation of: There are two noteworthy points. First, this is exactly like taking a regular correla-tion, except that the magnitude of the function is reduced. What this means is that the normally higher frequency signals are now more emphasized. This is primarily due to the fact that higher frequencies in images have lower amplitudes. By the same token lower bandwidth signals - i.e., constant intensity blobs and periodic functions are generally deemphasized. This is reflected in Figures 4.14. The edges of the circle are well empha-sized, while the effects of the stripe and the uniform intensity of the blob disappears. The second point to be noticed is that emphasizing the high frequency makes the cepstrum not only a good matching technique, but also makes it very noise sensitive -since usually the noise associated with images, whether correctly or not, is often modeled as high frequency signal. Figure 4.15 shows the cepstrum equivalent filtering of an image and the same image with added Gaussian noise of standard deviation 20.0. As can be seen the high frequency signals are emphasized and in fact the dynamic range of the signal is also improved, but unfortunately the effect of the Gaussian noise is also reflected in the image. Finally, I would like to draw attention to the similarity of the cepstral equivalent image and that of a silicon retina image simulated by software [MM91] (see Figure 4.16). (4.14) Chapter 4. Cepstral Properties 98 (a) (b) Figure 4.14: Cepstral equivalent filtering: note how the low frequency functions are deemphasized. Perhaps the best way to conclude this analysis is to quote the statements of the originators of cepstrum, Bogert, Healy and Tukey [BHT62]. 1 8 "We have given evidence that autocorrelation is more likely to fail to detect complex echo than are the more elaborate processes we have been developing. It is important to understand why this is so. It is natural to think of the duality between time and frequency as an either-or phenomenon-sometimes we look at things one way, sometimes the other. In so far as the actual time series are concerned, the situation is often far more both than either. Series containing echoes, whether or not geophysical, are not likely to have had a white spectrum before the echo was added in. Some may have had a spectrum that was smooth enough to give the autocorrelation a fighting chance to find 18 Page 220 of the reference. Chapter 4. Cepstral Properties 99 the echo, but many have not. We could always try to proceed by first finding the power spectrum of the original series and then calculating a filter that, when applied to the original series, would approximately smooth out the short-quefrency variations in the spectrum. Then we may apply this filter to the original data and calculate the autocorrelation of the resulting series. On the other hand, we could try many alternative filterings of the original series and then look at the autocorrelation of that one whose frequency was free of short-quefrency components (i.e., trends with frequency). Such procedures will surely be no easier than these we suggest. Their only advantage would seem to be that they would allow one to whom autocorrelation is sacred to apply it. The usefulness of the more nonlinear process lies in the fact that liftering 1 9 the log power spectrum removes the effects of short-quefrency component without any need for recognizing their individuality." 4.9 Summary This chapter provides a review of cepstrum properties as a matching routine, particularly emphasizing its limitations and shortcomings, which need to be addressed. In Chapter 2, I provided a partial list of characteristics for matching routines. But the scope of this work should surpass a cursory list-like review, emphasizing windowing, numerical sta-bility, effects of additive noise, convolution (particularly blur), rotation, expansion and illumination changes. A short comparison, particularly emphasizing the similarities to other approaches, was also provided with graphical examples and theoretical analysis. Liftering refers to filtering in the log domain. Chapter 4. Cepstral Properties 100 Figure 4.15: Cepstral equivalent filtering for an image and an image with added noise. Chapter 4. Cepstral Properties 101 Figure 4.16: A n image after it has been put through the different layers of retina. Chapter 5 Cooperative Analysis of Mult iple Frames by Visual Echoes Traditionally, image matching and registration techniques in computational vision utilize only two image frames at a time. This was originally motivated by the lack of multiple-frame image matching routines, as well as the perceived similarity of this approach to human perceptual operations such as binocular stereo disparity measurements. But, as noted by Price[Pri90], multi-frame analysis allows introduction and use of constraints on the equations of motion or stereo. Such constraints can improve the signal-to-noise ratio of disparity detection mechanisms and improve the accuracy of such estimates. Moreover, such processing can be used in the direct estimation of constant or variable accelerations for 3D structure recognition [Fra91]. Historically, multi-frame analysis has been limited in use only to motion analysis. But many computational vision tasks, such as trinocular stereo disparity calculations, mul-tiple baseline stereo measurements, multi-frame motion analysis, motion-stereo (binocu-lar or trinocular) techniques, vergence control and structure from uniform three dimen-sional acceleration involve determination of disparities among multiple frames. The large breadth of applications of multi-frame analysis warrants it a greater deal of attention and research in computational vision. As a framework (which by definition should provide a large breath of applications), visual echo analysis is flexible enough to extend to this domain, thus providing a uniform approach to multi-frame analysis. Where equal disparities exist between multiple frames, the cepstral peaks reinforce one another in time and space. Consequently, when the 102 Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 103 constraint of constant separation is applicable, the increased number of frames leads to an increase in the detection and accuracy of disparity estimation. In the case of unequal separations - for instance due to acceleration - disparities between every two frames are calculated, providing a rate of change in disparities. 5.1 Introduction, Background and Previous Work One of the first - and perhaps most interesting - work in multi-frame analysis was by Mohanty[Moh81] who used a maximum-likelihood approach to position and track target points in space. Another early work in this area by Braniv[Bar85] used dynamic pro-gramming for detection of dim moving targets , but much like other methodologies [GV93] the constraints on inter-frame disparities1 were not used to improve detectability or es-timation. Perhaps the best known multi-frame methodologies are the spatio-temporal tech-niques [AB85, WA85] (see Figure 2.2). Even in these techniques, although constant lin-ear velocities are normally assumed, there is no inter-frame reinforcement of disparities. Instead, in the case of multi-frame phase analysis for instance [Fle92], the presence of a constant velocity gives rise to a better fit of the data to a plane. In effect, in the spatio-temporal approach the use of multiple frames is a requirement or restriction, rather than a feature. Lastly, routines that utilize a large number of image frames [Fle92] automat-ically introduce a lag or delay time in processing caused primarily by the non-causality of routines which require future data for present disparity calculations. Huang, Lui , Hayes and Mersereau[HLHM92] on the other hand, did introduce an interesting recursive differential approach for multiframe analysis where disparities be-tween every two frames were registered. Two more noteworthy works in this domain, 1 Inter-frame disparities refers to the disparity between any two frames separated by one or more frames. Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 104 that fall into the 3D structure from multiple frames categories are by Spetsakis and Aloimonos[SA90] who used an elegant mean squared approach, and K i m and Price[KP93] who used a feedback mechanism for improving structure from motion estimates. The objective of this chapter is twofold. First, it extends the domain of visual echo analysis in order to provide a uniform approach for the analysis of multiple frames and applies it to such areas as trinocular stereo, motion-stereo, multi-frame motion analysis, and multiple-baseline stereo. Second, it provides a technique that fully utilizes the natural constraints of multi-frame analysis - such as reinforcement of constant disparities in time and/or space, and direct recognition of acceleration or inter-frame disparities. The next section will briefly cover the behavior of cepstrum in the presence of multiple echoes or disparities [BL93b, BL92c] and sets the ground work for the use of visual echoes in spatio-temporal motion analysis. Section 5.3 provides a treatment of trinocular stereo as a visual signal with multi-ple echoes, and shows how horizontal, vertical and diagonal disparities can be derived directly. Section 5.4 covers multiple-baseline stereo and how the signal-to-noise ratio improves dramatically as we incorporate more frames. Finally, section 5.5 shows the application of cooperative multi-frame analysis to stereo-motion disparity measurement. This section shows how disparities in time and space (motion and stereo) are decou-pled, and moving targets can be directly segmented. Extensions of this approach this approach can be applied to vergence control and trinocular stereo-motion analysis are also discussed. Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 105 5.2 Visual Echoes and Spatio-temporal Image Analysis The interesting question is what if one has multiple replications or echoes of the signal in time or space, i.e.,: Mx) = 5(x) + E? = o s (x - r 0 (5.1) where is the disparity of the Ith echo with respect to the first signal. Taking the Fourier transform and factoring out the power spectrum of the S(f) we get: H(f) = S(f)(l + Er= 0exp(-27rzr if)) (5.2) taking the power spectrum and the logarithm we are left with: log(|S(f)|) + log(n + 2 + £? =o cos(27rrif) + E ^ - cos(27r(r i - Tj)f)) (5.3) This is a significant formula, because it shows that if we have multiple echoes, either due to multiple frames or due to multiple motions, then the argument of the logarithm becomes a sum of several cosines instead of a single one. The periods of these periodic functions are the relative disparity of every two echoes in the scene. The power spectrum of the logarithm then generates multiple peaks corresponding to the differences of the arrival periods (or disparities) between every two echoes. Figure 5.1 illustrates the result of cepstrum analysis on a simulated non-linear acceleration. The simulated motion field is made to reflect acceleration of the object. The motion disparities with respect to the leftmost frame are (1,1), (3,3) and (7,7). The peaks in the result, displayed below the original image, show the displacements between each pair of windows. As usual, the origin in the results is the upper left corner; peaks occur at one frame width plus the relative motion of windows. It is significant to note that should the object go through a constant velocity (or, in the case of multiple baseline stereo, constant disparities) the echo peaks between Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 106 Figure 5.1: Multi-frame optical flow analysis utilizing cepstrum. The simulated motion field is made to reflect acceleration of the object. The motion disparities with respect to the leftmost frame are (1,1), (3,3) and (7,7). frames overlap, thus increasing the detectability and accuracy of the echo arrival measure. This important fact sets cepstrum apart from other multi-frame analysis routines. To demonstrate this, I simulated a constant velocity motion and analyzed the motion in space-time cube. This is much like the spatio-temporal analysis depicted in Figure 2.2 except that I are using a non-linear filter. I then conducted a three-dimensional cepstrum analysis on the space-time cube, and looked at the spatial disparity signals in time. The peak in the first cepstral image shows all the disparities between any two image separated by one time frame. If the object is going through a constant motion, the cepstral peaks overlap thus increasing the signal-to-noise ratio. This is shown in Figure 5.2(a) where the single large peak in the first frame corresponds to a constant disparity of 7 pixels horizontally and 3 pixels vertically. The peaks in the second frame, Figure 5.2(b), correspond to all the disparities between the frames separated by two time slices in the cepstrum results. We see that the displacement of the peak doubles, reflecting the disparity between every two frames separated by two time steps. The peak magnitudes of these comparisons reduces dramatically and the Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 107 (c) (d) Figure 5.2: The results of spatio temporal cepstrum analysis for a constant velocity for eight frames, (a) shows the disparity for every two consecutive frames, (b) every two frames separated by one frame, (c) for two windows differing in image number by two, and (d) for two frames differing in three frames. Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 108 Peak Magnitude Signal to Noise Ratio Y u l O 6 Y 90.0 (a) (b) Figure 5.3: (a) The peak magnitude and (b) the signal-to-noise ratio as a function of the number of frames used. noise level increases as the temporal disparity between the frames increases. It is also important to note that spatio-temporal cepstral analysis is simpler than the traditional approaches because there is no interpolation of a surface, whose normal defines the velocity of the motion. Figure 5.3 shows the effects of multiple frames on signal-to-noise ratio and peak magnitude of the cepstrum result.2 5.3 Direct Measurement of Trinocular Stereo Disparities There are numerous articles on the use and effectiveness of trinocular stereo measure-ments and their performance compared to binocular stereo routines. Dhond and Aggrawal[DA91] showed how trinocular stereo provides a higher degree of accuracy in disparity measure-ments than binocular stereo. Traditionally, trinocular stereo measurement techniques 2As a reminder, I am still using the formula [COP90]: „„ , .PeakMaqnitude — AveraqeNoise. .„ 20*log10( y. . T / . ) (5.4) VJyo i sevar iance to define signal-to-noise ratio. Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 109 Figure 5.4: Trinocular disparity measurements using visual echoes. The three peaks in the cepstrum reflect a disparity of 3.5 horizontally between the horizontal cameras, a vertical disparity of 6.0 between the cameras located vertically from each other, and a disparity of (3.5, 6.0) for the two diagonal cameras. use three calibrated cameras located in a right angle triangle arrangement with their optical axes in parallel. This leads to three epipolar constraints and three disparity measurements. Figure 5.4 shows the effects of cepstrum analysis of trinocular triplet. The cameras are located in the order of (0,D), (0,0) and (D, 0). The lower portion of the image shows the result of the cepstrum, where there are three peaks corresponding to the three disparities. Note that the locations of peaks are consistent and the results of the three disparities can be combined to get a more accurate disparity. Should the locations of peaks not be consistent, the multiple disparities are caused by occlusions or motion transparency. It is also noteworthy that if one uses a fourth camera, so that the ensemble is a square grid, every pair of disparities reinforces one another giving a higher signal-to-noise ratio, and better detectability. This approach can be further extended to a grid of cameras, or multiple baseline trinocular stereo. A similar structure can be simulated by a stationary texture ensemble. Figure 5.5 shows the results of cepstral analysis of a Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 110 (a) (b) Figure 5.5: Simulated camera grid by the use of stationary texture segmentation, simulated stationary texture field. 5.4 Improved Disparity Measures by Multiple Baseline Stereo Kanade, Nakahara and Okutomi [KON92, NK92] have examined increased accuracy in the detection of binocular disparities using the multiple baseline stereo technique. De-scribed briefly, multiple baseline stereo measures the disparities between a window of an original image frame, and subsequent windows in the frames taken by the camera displaced by a constant amount over time. The authors then showed that, using a simple formulation, they can get better disparity measures and higher signal-to-noise ratio by accumulating the properly scaled cross correlation result between the original window and the subsequent images. This work has been followed by Ens and Li[EL93] who used a stationary camera to recognize the shape of objects moving at constant velocity on a conveyor belt. Figures 5.6 and 5.7 show an experiment using multiple baseline stereo. As it can be seen the disparities between the windows overlap, giving rise to a large peak between the Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 111 Figure 5.6: Results of multiple baseline stereo. first and the second window. A n added advantage of using visual echo for analyzing mul-tiple baseline stereo is that it generates multiple evidences. Therefore the peak generated between frames two and three is the overlap of the disparities of all frames separated by two baselines. Figure 5.7: Topographic plot of multiple baseline stereo results using the visual echo analysis approach. The peaks in the figure point to the horizontal disparity of the figure - roughly 1.65 pixels horizontally. The second largest peak corresponds to the overlap of disparity measures between two windows separated at twice the baseline. To show how multiple baseline stereo increases the signal-to-noise ratio I also con-ducted cepstrum analysis of disparity for only two frames. Figure 5.8 shows the results. Comparing figures 5.8(b) and 5.7 reveals that the peak magnitude multi-frame analysis increases the peak magnitudes by nearly a factor of five. Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 112 m u l t i B a s e l i n e r e s u l t 60 (a) (b) Figure 5.8: The result of cepstral analysis for stereo disparity measurement. 5.5 Decoupling of Motion and Stereo Disparities in Binocular-Motion So far we have examined the application of cooperative multi-frame analysis to either spatial vision - such as binocular or trinocular stereo - or temporal domain applications - such as multi-frame motion analysis. In this section I will address the application of multi-frame visual echo analysis to the stereo motion domain. Several interesting works in this area have already been presented by Cui, Weng and Cohen [CWC91] who used an iterative least squared error approach, Waxman and Duncan[WD86] who used the rate of change of stereo disparity, and Waxman and Sinha[WS86] who used two image flows from two cameras with slightly different motions. In this work I make a simple and yet fruitful observation: if the motion with respect to the binocular stereo pair is parallel to the image plane or small then the motion disparities between the two stereo cameras are equal. Moreover, the binocular disparities at the two time periods also remain the same. This simple observation has a trivial Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 113 Figure 5.9: Motion-stereo disparity measurement. The two windows on the left are a stereo pair taken prior to the object going through a translation. The cepstral results shows how the motion disparity of (3.5, 6.4) and the stereo disparity of (4.3, 0.0) are decoupled. geometrical proof but two significant side effects. First, visual echo registration decouples the disparities due to motion and stereo. Second, since these disparities remain constant and decoupled, similar type disparities reinforce one another increasing the signal-to-noise ratio. Figure 5.9 shows the use of cepstral analysis for two pairs of stereo images. The two main peaks show that the figure has a stereo disparity of 4.3 pixels horizontally and a motion disparity of 3.5 pixels horizontally and 6.4 pixels vertically. It is noteworthy that in this example I concatenated all the frames together, while another alternative would be to concatenate the stereo data in space and the motion sequence in time, as was done previously. In figure 5.10 I used a calibrated stereo system to find the stereo disparities of a scene. Figure 5.11 shows the portion of the scene (i.e., the tweedy bird!) that has gone through a relative motion. One of the applications of this methodology can be the active pursuit of targets using stereo vergence[PE]. When the object being tracked remains within the Horopter Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 114 (a) Figure 5.10: A stereo motion scene, and the horizontal stereo disparities using visual echo. the stereo disparity remains null, 3 but the motion disparity changes as the object moves. Consequently one can track the object by following the motion disparity. Once the object moves in depth the binocular disparity also changes and the camera vergence angle can then be adjusted appropriately. 5.6 Summary and Conclusion In this chapter I showed how visual echo analysis provides a uniform approach to multi-frame analysis and applied this framework to the analysis of the disparity calculations for spatio-temporal motion analysis, trinocular stereo disparities, multiple baseline stereo measurements, and depth and motion calculations using motion-stereo. What sets this approach apart from its predecessors is that the equal disparities constraint, when ap-plicable, is fully utilized to increase the disparity detectability and the accuracy in the 3 This actually depends on the vergence angle between the two cameras Chapter 5. Cooperative Analysis of Multiple Frames by Visual Echoes 115 Figure 5.11: Motion-stereo disparity measurement. The background is stationary and-does not move while the brightly shaded and darker areas point to the pixel values cor-responding to the Tweedy Bird and the piece of cloth both of which have gone through a motion. estimation process. Several domains where this constraint is satisfied include motion-stereo, multiple baseline stereo, and constant velocity motion. When this constraint is violated, such as in the case of trinocular stereo or accelerating motion, the visual echo techniques provide all disparities between every two windows analyzed. One of the interesting features of this work is its extendibility into other possible mechanical setups such as trinocular stereo-motion analysis, or calibrated cameras grids. Chapter 6 Multi-Evidential Correlation and CepsCorr Previous chapters—particularly Chapter 4—provided an objective evaluation of cep-strum, including its main advantages and shortcomings. One advantage of cepstrum was its simplicity of implementation and its ability to provide a direct (i.e., one shot) estimate of the disparity. Specifically, unlike mean squared error analysis (SSD) where individual correlation values are gathered over a region and the location of the minimum is chosen as the disparity, the cepstral result provides a direct value for the disparity without incremental calculation of the result. But cepstrum is not unique in this behavior. In fact, there are several techniques that provide direct estimates of disparities. Among these cepstrum, phase correlation and Hadamard based approaches are the most notable.1 A l l these techniques have several attributes in common, namely: • the signal to noise ratio and the accuracy of the measured estimates increase with reduced disparity. • the uncertainty in disparity measurements due to other factors such as image dis-tortions, blur, or noise can be reduced if multiple measurements of disparities are available. • if the same signal is matched against a modified copy of itself at different disparities, all the measured estimates are correlated. lA review of these approaches has been provided previously in Chapters 2, 3 and 4. 116 Chapter 6. Multi-Evidential Correlation and CepsCorr 117 It is in accordance with these observations and a desire to remove limitations of cep-strum that multi-evidential correlation has been developed. The next section provides a mathematical definition of M E C and provides some intuition behind its use. Section 6.2 discusses some more properties of this technique and the choice of the matching ker-nel. Section 6.3 addresses the use of this methodology to overcome some of problems and ambiguities associated with cepstrum. Section 6.4 discusses verification of estimated disparities and the as of yet unanswered questions about optimally combining the cor-related disparity estimates; this section then provides a preliminary but very robust solution which is both accurate and computationally practical. 6.1 Multi-evidential Correlation: Definition Described briefly, multi-evidential correlation combines the direct disparity estimation of a matching kernel—such as cepstrum, phase correlation or Hadamard transform— with the iterative behavior of correlation to generate multiple number of estimates for disparity. As we saw in Chapter 2, the individual pixel values of the spatial correlation plane are calculated from: Riui2(u,v) = J2'52Ii(x>y)I2(x - u>y ~v) (6-1) WX Wy where (wx, wy) defines the spatial extent of the matching window and um{n < u < umax and vmin < v < vmax defines the search area and the size of the resulting correlation plane.2 Analogously a comparative kernel such as cepstrum, C(Ii(x, y), ^{x — u,y — v)), can take two equally sized image windows as input and directly generate a disparity value 2The disparity (Sx, 6y) is then (umin + u*, vmin + v*) such that Riltj2(u*, v*) is maximum for all values of u and v. Chapter 6. Multi-Evidential Correlation and CepsCorr 118 T>(u,v). Repeating the process for different values of u and v gives rise to multiple estimates of disparity. Therefore a window from one image is matched against a series of windows within a neighborhood in the second image, each comparison producing a disparity value. In short, the pseudo-code for multi-evidential correlation can be summarized as fol-lows: 1. given a window Wi(x,y) from the first image, and 2. a search area, S(x, y), in the second image where one expects the closest match for W\ to reside, 3. clip a region, W2(x,y), in S to be compared to Wi(x,y). 4. Match the two regions W\ arid W2 using the non-linear filters cepstrum and/or phase correlation. Locate the maximum peak and save the result as a possible disparity measure.3 5. repeat steps 3 and 4 - of course with a new W2(x, y) - to gather more disparity estimates as needed. 6. combine the measured disparities to get a final estimate for the motion of W\(x, y) in S. It is worth noting again that matching kernels such as cepstrum or phase correlation obviously do not generate a disparity value directly but an image output Iu>v(x,y) first. The disparity, V(u,v), is then found by searching for the coordinates of the maximum peak in the output image and subtracting the artificial disparity - i.e., (u, v) - introduced 3 One can also use techniques based on Hadamard transform [LC88]. This approach is fast but not necessarily very robust and fails particularly in the presence of multiple motion. Chapter 6. Multi-Evidential Correlation and CepsCorr 119 due to the sliding window. This may sound like additional computational burden but it is in fact a major asset of this process. Since the output of each iteration is an image, other attributes can be associated with each disparity estimate V(u,v). For instance, signal to noise ratio, the absolute magnitude of the peak, or the ratios of the first and second peak (often referred to as the rejection ratio) can be used as possible confidence measures for each estimate. An obvious question to raise is why can one not use a regular iterative approach such as SSD with multi-evidential correlation? The answer is simple. Using any spa-tial correlation method such as sum of absolute differences technique as the matching kernel of multi-evidential correlation is simply equivalent to increasing the search area of the matching kernel (and doing a lot of redundant calculations). The minimum will (hopefully) remain at the same absolute location. The significant point to note is that techniques such as cepstrum and phase correlation provide a direct estimate of disparity with each iteration step. Moreover, as the values of u and v change with each iteration, one of the input signals to multi-evidential correlation is constantly changing. Conse-quently, the estimated disparities may be partially correlated (since at least on of the underlying signals remains the same) but they are uniquely generated and non-identical estimates (i.e., not based on the same inputs). As we will see in Section 6.4 this has important implications in terms for localizing the matching features and combining of evidence. There are however certain properties that are independent of the choice of kernel. For instance as we saw earlier, in the case of Hadamard kernel the accuracy of the disparity increases, while with cepstrum and phase correlation the detectability of the disparity and its signal to noise ratio is also enhanced. Chapter 6. Multi-Evidential Correlation and CepsCorr 120 6.2 Multi-evidential Correlation: General Properties I have only discussed three matching kernels for multi-evidential correlation. But there may be other (perhaps application dependent) registration routines that can be used as well. The properties of a matching kernel have a fundamental effect on the results of multi-evidential correlation as well as how the individual measurements should be treated and eventually combined. But the choice of the kernel is not always an obvious one. For instance, although cepstrum is the most reliable and robust of the three alternatives presented here it is also computationally much more expensive than Hadamard transform. Therefore, for some applications one may decide to use the Hadamard kernel to gain much higher processing speed at the expense of some accuracy and reliability. One of the main advantages of using multi-evidential correlation is that it reduces the effects of window size. As we saw earlier, in area based techniques choosing a large window increases the signal to noise ratio. But it also has the effect of blurring the disparity values.4 This is particularly a problem when the moving objects are much smaller than the window size. In the case of regular cepstrum or phase correlation it is critical that the window is large enough to cover an adequate portion of the object for proper matching. Therefore, as the size of disparity grows so should the window size.5 This coupling of window size and disparity is a limiting factor in the use of cepstrum or phase correlation. For small fast moving objects, for instance, the window size necessary to include the object may be 4 This phenomenon is also referred to as bleeding since it tends to defocus the disparity vectors and is most notable across displacement boundaries. 5 One may say that the window size has an absolute lower bound which is the size of the disparity. At or below this size you are guaranteed that the disparity value estimated is meaningless since the content of the two windows are different. In practice, of course, as less of a moving object is included into the two windows the signal to noise ratio reduces and after a certain point (depending on the frequency content of the object and the background) the correct disparity peak disappears before the window size reaches the absolute lower bound. Chapter 6. Multi-Evidential Correlation and CepsCorr 121 so large that the object's displacement may go completely unnoticed compared to that of the background. Using multi-evidential correlation partially overcomes this problem. Since the window from the first image is compared against a series of displaced windows from the second image (i.e., the second window is a sliding one) one can reduce the size of comparison window and increase the search space. Modifying the sizes of the windows and the search region has a direct effect on the computational expense of ones operations. Computational expense increases linearly with the size of the search area while the effects of the window size depends on the com-putational complexity of the matching kernel. Thus for cepstrum, which has a O(nlogn) complexity, reducing the window size reduces the computational time faster than lin-ear.Therefore it seems even more useful to choose a small window size and increase the search area. We still have not completely eliminated the problems associated with windows or the problem of choosing the proper search region. That is, as the window size reduces, the amount of the information being compared is reduced, and the accuracy and the reliability of the estimated displacement diminishes. This is particularly true for cepstrum and phase correlation where diminishing window size causes increase noise due to aliasing in the result. Therefore, reducing the window size at the expense of increasing the search region is not always a viable solution. Moreover, increasing the window size can still have the effects of blurring the disparity especially when the objects in the scene have curved surfaces. Luckily, as we will see in Chapter 8, cepstrum produces multiple peaks for multiple distinct disparities. This allows us to recognize the existence of motion boundaries, or transparent objects.6 Another significant property, which was briefly mentioned in the introduction, is that 6 Needless to say, choosing the proper search domain, requires an a priori estimate of the maximum disparities in each direction. Chapter 6. Multi-Evidential Correlation and CepsCorr 122 as the relative disparity of the two matching windows diminishes, the signal to noise ratio and reliability in the estimated disparity increases. This is an important property which will be discussed in more detail—and provide experimental results for—in Section 6.4. The key idea which should be emphasized, is that multi-evidential correlation provides multiple estimates of disparity, which can be used for verification of individual measure-ments, or combined into a final more reliable estimate of disparity. This is particularly important when one is considering subpixel displacement estimations. The remainder of this discussion concentrates primarily on the cepstrum based approach nicknamed cepsCorr. But much of the material presented is also applicable to other matching kernels. For instance, as we will see in a Chapter 8, both cepstrum and phase correlation can be used to detect, verify and calculate multiple motion disparities. 6.3 Cepstrum and Multi-evidential Correlation—cepsCorr In Chapter 4 I reviewed some of the strengths and limitations of cepstrum as a matching technique. The primary problems with cepstrum were not due to lack of accuracy or robustness but rather results of mathematical and computational operations that resulted in ambiguities in the cepstral results.7 Specifically, the existence of a spurious dominating peak corresponding to the (0,0) disparity and the sign ambiguity of the displacement vector are the two problems that can obscure the effectiveness of cepstrum. In Chapter 4 I examined these problems in adequate detail and provided simple solutions. Such problems and simple intuitions were in fact the reason behind the de-velopment of cepsCorr and eventually multi-evidential correlation. The initial objectives were a) to improve the performance of the cepstrum and provide a confidence measure for the displacement estimates b) formulate a solution that overcomes these problems, c) 7This would be a good place to remind the reader that term cepstrum refers specifically to the power cepstrum technique. Chapter 6. Multi-Evidential Correlation and CepsCorr 123 eliminate or avoid use of arbitrary thresholds and assumptions. In the next two sections I will briefly revisit these problems and show how cepsCorr overcomes these problems. 6.3.1 CepsCorr and the (0,0) Peak: In Chapter 4 we saw how lack of image detail results in a dominating peak which cor-responds to a null disparity. This is of course not surprising since most techniques fail when there is not adequate high frequency information that would facilitate the match. This effect can in fact be quite dominating. Figure 6.1 uses the window in the upper left corner and using cepsCorr matches against a sliding window which covers the region in the right image. The corresponding match in the right window has a disparity of two rows down and one column to the right—i.e., a disparity of (2,1). The search region was specified to be (-3,3) rows and (-3,3) columns; that is a 7x7 search region. The lower right hand corner bright pixels show all the window locations where the (0,0) peak was the maximum peak value.8 Black pixels were supposed to show the points where correct disparity was estimated. Interestingly the (0,0) peak dominates even if the motion is only one pixel in any direction. Therefore, it should not be surprising that the actual disparity is never found. The lower left corner shows the relative (0,0) peak magnitudes. When the window is at the proper disparity the (0,0) peak increases dramatically. It is fair to question the significance of solving this problem before going any further. After all the (0,0) peak appears in a fixed location. The problem is however that an area dominated with low frequency may go through a large motion, and still not be detected. Validation of the (0,0) is also important for such tasks as the figure ground separation and motion transparency. There are basically two simple choices to this problem: a) high pass filter the image 8This figure is obviously enlarged to fit the lower left hand quadrant. The actual size of this region is obviously the size of the margin—i.e., the dimensions of the search region. Chapter 6. Multi-Evidential Correlation and CepsCorr 124 Figure 6.1: An illustration of the persistence of the (0,0) peak for a portion of an image that lacks much detail. The window on the upper left hand corner was matched against the region in the upper right using cepsCorr. The lower right shows an enlarged result and the bright pixels show the positions (or disparities) where the (0,0) peak was the dominant peak. As it can be seen this area includes a large region within the search region. The lower left hand corner shows the relative magnitude of the (0,0) peaks, and as one can see at the proper disparity—i.e., the disparity (2,1)—the (0,0) peak increases dramatically. or use image edges to calculate the disparities, or b) use a criterion to simply ignore the (0,0) peak. As it should be clear by now, cepstrum is a deconvolving filter, therefore high pass filtering the image is equivalent to adding the cepstrum of high pass filter to the final result.9 This in effect has the effect of reducing the peak at (0,0) and the amount subtracted from the peak depends on the parameters of the filter. Not only choosing these parameters are arbitrary the cepstral values near the (0,0) peak may also get adversely affected. A logical criterion to ignore the (0,0) peak is the ratio of its magnitude over the second largest peak. If this ratio is below a certain value one can be reasonably assured that 9This actually gets a bit more complicated, since not only I am filtering the image but also windowing it. However the end result is basically the same. Chapter 6. Multi-Evidential Correlation and CepsCorr 125 the secondary peak is the correct disparity. There are three problems with this approach however. First, it again requires using an arbitrary threshold which should be dependent on the contents of the windows. Second, this approach may fail for subpixel motions where the magnitude of the correct peak may be distributed between two pixels. Lastly, this method again fails if multiple disparities are present(i.e., image boundary, motion transparency or reflection) and one of the disparities is in fact (0,0). CepsCorr easily solves these problems through several different means. For instance, as the moving window sweeps the search region the disparity peak should also move accordingly. If the (0,0) peak moves along with the window we can be certain that the initial displacement was null. On the other hand, if the second largest peak moves properly and increases with reduced disparity while the peak at (0,0) does not, then we can be confident that the (0,0) peak is a false match. Lastly, if both the (0,0) peak and the secondary peak move, we may be facing a multiple motion (more on this topic in a later chapter). The simplest solution of course, is to iterate over search region and choose the location where the (0,0) peak is maximum. This is reflected in the lower left corner of Figure 6.1 which displays the size of the (0,0) peaks (regardless of whether they are the maximum peak in the cepstral plane or not). Obviously the (0,0) peak that corresponds to the overlap of the two images dominates all other (0,0) peaks and is quite distinguishable (larger by roughly an order two). I will discuss this approach in more detail later on. 6.3.2 Sign Ambiguity We saw in Chapter 4 that one of the ambiguities facing the cepstral echo detection was the sign of displacement field. That is disparities (8X, 8y) and (—5X, —5y) give rise to the same peak location on the cepstral plain. This is of course due the fact that the log of the power spectrum is an even function and its cosine transform is also even. Chapter 6. Multi-Evidential Correlation and CepsCorr 126 This is not a real problem when dealing with stereo disparity analysis where the direction of displacement is obvious. But when dealing with motion analysis and tracking the sign of the displacement plays an important role. I also pointed out that one way to overcome this problem is to move one of the images up vertically (and wrapping the image around) and conducting the cepstrum. The cepstral peak should then appear with a vertical disparity near the artificial shift. The location of the peak and the amount of change in the vertical shift then gives us the direction of disparity. The problem is that one again has to assume something about the amount of what constitutes reasonable vertical shift. In the case of verging system where the vertical displacement is minimal this technique is quite adequate [01s93]. This is not a problem for calibrated binocular stereo since the direction of horizontal disparity between the two cameras is known (and the vertical disparity is zero). Even for verging binocular stereo cameras the vertical disparity is small enough that by shifting the camera upward one can produce an artificial vertical disparity and deduce the sign of the disparity vector [TOM94]. But for a more general case of motion analysis such assumptions are impractical. 1 0 With cepsCorr the problem of guessing the sign of the displacement or having to artificially manipulate the input data is solved. In short, as one window slides over the second image, the cepstral peak should be moving in the opposite direction - showing how the disparity is changed. Unless, of course, the sign of the disparity is opposite of what the cepstral peak reflects. This simple geometric constraint can be used to disambiguate the sign of the disparity in the cepstrum result. 10It should also be noted that aside from the small computational cost of manipulating data in this form, onw also pays a price in lower signal to noise ratio, and additional inaccuracy at least on a subpixel level. 1 Chapter 6. Multi-Evidential Correlation and CepsCorr 127 6.3.3 Other properties of cepsCorr But cepsCorr's greatest advantage is that it combines the flexibility and accuracy of cep-strum with the robustness, informativeness and reliability of multi-evidential correlation. In the next section a very preliminary but fast and accurate solution to the complex problem of combining evidence is provided. The aim of next section is to provide a step-ping stone for future developments and improvements to cepsCorr and multi-evidential correlation without pretending that the solution presented is the final, complete and optimal solution to the problem of combining disparity estimates and multi-evidential correlation. 6.4 Verification and Combination of Correlated Estimates Fusion of information, whether from a single source - as in multi-evidential correlation -or separate origins - as in sensor fusion - is a complicated and a broad topic of research, and the subject of numerous graduate theses and research projects. As such it would be prohibitive and outside the scope of this dissertation to cover the topic entirely and provide an optimal solution for combining the evidence for M E C . Instead, in this section I will cover alternative approaches, examine their strengths and weaknesses, and finally propose a simple yet adequate solution based on the winner take all scheme. The remaining approaches discussed herein are left to future development and exami-nation. But before examining different alternatives of combining evidence, it is important to visit some of the properties of these measurements, which I have briefly pointed out before: • For cepsCorr and other variations of M E C , one of the matching signals remains unchanged, and as a result the estimated disparities are correlated. Consequently, the simpler traditional statistical techniques based on independent measurements Chapter 6. Multi-Evidential Correlation and CepsCorr 128 and estimates of disparity are theoretically inapplicable to our application. • The uncertainty in disparity measurements due to geometric distortions, blur and noise can be reduced if multiple measurements of disparity are available and are combined properly. But finding the confidence in each measurement itself is a complicated task and subject to error if limiting and simplifying assumptions are made. • The detectability and accuracy of disparity measurements increase as the disparity between the two signals reduces. • It is preferable to use one of the disadvantages and limiting properties our matching kernel - in our case cepstrum - to our advantage in finding the correct disparity. As pointed out, combining evidence is a complicated task that must satisfy many objectives including those mentioned in Chapter 2. This is particularly true when one seeks subpixels accuracy as well as confidence measurements. Below I will briefly present four alternatives, and finally choose the simplest approach based on the winner take all alternative. V o t i n g techniques: One of the first things that comes to mind when having multiple measurements is a voting scheme. For integral disparity measurements, in particular, each measurement can be used as a vote for a particular integer disparity in x and y. The disparities that have the highest votes can then be used to refine the motion estimation, and the number of votes can be used as the confidence measure in the disparity. This, of course, is a simple and crude form of outlier detection and robust statis-tics [Hub81, RL83] Relaxation labeling by Hummel and Zucker [HZ83] can similarly be used to to refine and combine the measurements. But all these techniques, particularly Chapter 6. Multi-Evidential Correlation and CepsCorr 129 the outlier detection, should be used as a preprocessing method to choose the correct dis-parity measures. The question of how to combine the remaining correlated measurements is still an open issue. Note also that the voting mechanism can be deceivingly simple. In the case of multiple motion, as we will see later, the cepstrum produces multiple peaks. Consequently, we have to keep track of at least the top two motion estimates. Add to this the fact that a (0,0) peak may be present if the signal is of low frequency, and consequently we have to keep track of the top three peaks if the (0,0) peak is maximum. In the presence of multiple motion, each of the true disparity peaks may at one point or another become larger than the other. Add to this ambiguity of sign of the disparity, and one soon finds that the problem was not at simple as it first appeared but requires a rather elaborate book keeping mechanism. Cuplas: A tough choice about combining multiple evidence, even if we assume that they are independent, is what do we use as the confidence measure: the peak magnitude, the rejection ratio, the signal to noise ratio, maximum and minimum curvatures at the peak, or a combination of these. Without adequate analytical analysis or experimentation any choice would be quite arbitrary. Girod and Kuo [GK89] showed that under certain assumptions one can use the shape of the peak of the phase correlation as the probability distribution of disparity. Given the similarity of cepstrum and phase correlation one may adopt such a scheme and use the shape of each disparity peak as the marginal probability distribution function of that particular measurement. Statistically, one can then use Cuplas [DKS91] in order to combine these correlated marginals. But to use Cuplas we have to know something about the marginal probability distribution of the correlated values and how they are correlated with one another. This will requires one to make more arbitrary assumptions which may be quite erroneous. In fact, after eliminating the outliers, one may be better Chapter 6. Multi-Evidential Correlation and CepsCorr 130 off treating the remaining measurements as independent variables. Bootstrapping: It has been shown that many traditional statistical techniques as well as robust statistics such as cross validation, leave one out, and redistribution meth-ods can suffer from large bias and/or large variances in their measurements [JBC87]. Bootstrap method [DE83, Efr79, Efr82] is a computationally intensive method that can be used to determine the statistical property of an estimator and try to combine mea-surements when little is known about the underlying distribution and relationship of the samples available. There are however problems with the bootstrap method. First, it is very compu-tationally intensive, which makes its use with many disparity measurements methods practically prohibitive. Secondly, many questions about how to use the disparity values, or even whether to use the raw disparity image, and how to include measurements such as signal to noise ratio, and the effects of multiple motion still remain unanswered. 6.4.1 M a x (0,0) peak and winner take all As I have pointed out several times before, the artificial peak corresponding to (0,0) dis-parity occurs when the signal contains little structure and predominantly consists of low frequency values. I have already proposed a few solutions to this problem, which included band passing the signal, using hierarchical pyramids, and keeping track of motion of the (0,0) peak in cepsCorr. Obviously, any of these solutions should be adequate to overcome the problem, but they all require additional computational and bookkeeping efforts. A good question to ask is: how can one turn this problem to one's advantage and avoid the additional computational cost. The natural answer is simple. Instead of looking for the largest two or three peaks in each iteration of cepsCorr, look for the largest peak at the location corresponding to the (0,0) disparity. When the two signal match exactly (i.e., have a zero disparity) this peak Chapter 6. Multi-Evidential Correlation and CepsCorr 131 will be at its largest. This is true regardless of whether the (0,0) peak is artificially high due to the existence of low frequency signal components and lack of spatial structure. Moreover, there is no reason to worry about the sign ambiguity of cepstral disparity when we adopt this method. Simple mathematical analysis shows that there are other reasons, besides greater efficiency, to pursue this technique - especially as a preliminary solution. Referring back to Equation 3.2 in Chapter 3, the Fourier transform of a one dimensional signal and its echo is: F{h(x)} = T{s{x) + s(x - T)} = S(f)(l + e-2™f) (6.2) Now, if we are matching two signals Si(x) and S2(x) of length iV sample points, and concatenate them for matching, we have: ?{h(x)} = f{Sl(x) + s2(x-N)} n(f) = S1(f)+S2(f)e-2mNf (6.3) Of course, both h(x) and H(f) have 2iV sample points; and in the Fourier domain the sampling period between any two / becomes ^ 1 1 • Therefore replacing / with k/2N where k is an integer from 0 to 2N — 1 we get: •H(f) = Sx(f) + S2(f)e-™k = +«S2(/)cos(A;7r) = Sl(f) + (-l)kS2(f) (6.4) 1 1if I had used u instead of 2nf in my Fourier transform, the sampling period would become This is another version to which a lot of people are accustomed to. Chapter 6. Multi-Evidential Correlation and CepsCorr 132 Now if s±(x) and S2(x) are exact replica, then: « ( / ) = 5 1 ( / ) ( l + (-l)*) (6.5) So far so good! But the next operation in cepstrum involves taking a logarithm, and for odd values of k the Fourier transform is zero. Consequently, when k becomes an odd integer (i.e., every other sample point) %(/) = H(k/2n) becomes log(O) which is negative infinity. This of course causes a numerical instability, and is easily overcome by replacing this value by a large negative value. What is important, however, is that we have a periodic function with a large amplitude and periodicity of two sampling points in the frequency domain. 1 2 Figure 6.2 shows this effect for a two dimensional signal. The top half of the image shows the two exact windows that have been concatenated and are being matched. The lower part of the image show the logarithm of the power spectrum of this signal. Since there is only a horizontal disparity of exactly N pixels where N is the horizontal window width of each signal, the result of the log of the power spectrum are horizontal stripes with the black pixels representing very large negative values. What is significant is that the next cosine transform or power spectrum operation -which results in cepstrum - will produce a very large peak at pixel position (0,N), which reflects zero disparity between the two signals. To show this quantitatively, I generated a one-dimensional random signal displayed in Figure 6.3. This signal was then used in cepsCorr against itself with the search area being from -16 to 16. 1 3 Figure 6.4(a) shows the magnitude of cepstral peaks. The zero disparity peak is so much larger that the remaining peaks practically disappear after scaling. To show the effects of disparity on the peak, Figure 6.4(b) shows the logarithm of the peak magnitudes. As it can be seen the magnitude of the peaks do in general reduce as the 12There is also a negative constant added to the function, which results in a reduction in the DC component of the next Fourier transform, and it does not effect the peak at the echo arrival period. 1 3 A sequence of 64 pixels from the middle portion of this signal was used in the comparison. Conse-quently, the missing data due to artificial disparity was random values. Chapter 6. Multi-Evidential Correlation and CepsCorr 133 J L ... M ; Figure 6.2: Two concatenated 2D signals that match perfectly and the logarithm of their power spectrum on the lower half. The black stripes represent very large negative values. disparity increases. The sharp increase in the peak at 0 disparity is however impressive and roughly of two orders of magnitude.1 4 To show that the peak magnitudes do correspond to the correct disparities, the rel-evant portion of the cepstral filter for each cepsCorr iteration were extracted and nor-malized between zero and one. The cepstral results were then merged vertically into a two-dimensional signal with the first row displaying the result of cepstrum at disparity -16 and the last row the result of cepstrum for disparity -16. Figure 6.5 shows that all the cepstral peaks reflect the correct disparity, with the disparity reducing from -16 to 0, and then increasing to +16 as cepsCorr iterations progress. The increase in the peak magnitude is important, but for detectability one has to also check the increase or decrease in the amount of background noise. Figure 6.6 shows the change in signal to noise ratio (S/N) in cepstrum as a function of disparity. As we can see the signal to noise ratio is almost constant at about 17 db for all disparities between -16 and +16, but increases almost five fold when the disparity between the two signals 14For two-dimensional signals this ratio generally increases usually to four orders of magnitude. Chapter 6. Multi-Evidential Correlation and CepsCorr 134 The random signal .00 30.0 40.0 50.0 60.0 70.00 80.0 Figure 6.3: The random signal used for comparing cepsCorr magnitude and signal to noise ratio as the disparity increases. becomes zero. In short, even though this approach does not technically combine all the evidence, because of large reduction in memory requirement and computational cost, greater de-tectability, high accuracy and simplicity of the process, I have used the maximum mag-nitude of the (0,0) peak as a preliminary solution of utilizing cepsCorr for disparity measurement. Figure 6.7 shows the application of cepsCorr and the maximum (0,0) peak criteria. I will discuss the performance of our preliminary approach and issues related to its stability in the next chapter. 6.5 Summary / Conclusion In this chapter I briefly introduced Multi-evidential Correlation and the additional ro-bustness, flexibility, and accuracy that it provides. I then concentrated on cepsCorr -M E C with cepstrum as its matching kernel - and showed how some of the limitations Chapter 6. Multi-Evidential Correlation and CepsCorr 135 peakValues Log 10 of Peak Magnitude Y* 103 Y 750.0 3.10 650.0 600.0 5.60 5.40 50.0 50.0 3.20 5.00 40.0 350.0 4.10 4.60 30.0 —-250.0 20.0 4.40 4.20 150.0 10.0 — 50.0 3.10 ^ 3.60 v 3.40 -15.00 -10.00 -5.00 0.0 5.00 10.0 15.00 -15.00 -10.00 -5.00 0.0 5.00 10.0 15.00 (a) (b) Figure 6.4: (a) Magnitude of cepstral peaks, and (b) the base 10 logarithm of the peak magnitudes. and ambiguities in cepstrum can be overcome through the use of M E C . Lastly, we in-troduced, a simple, fast and efficient winner take all approach to utilize the disparity estimates (i.e., multiple evidence) provided by cepsCorr. This method not only over-comes, both the problems with sign ambiguity of cepstral disparities, but it also solves the sometimes troublesome incorrect (0,0) peaks and turns this limitation of cepstrum into an advantage. Figure 6.5: The cepstral results generated during the iterations of cepsCorr. Each row of the image shows the cepstrum result normalized between zero and one; the first row corresponds to disparity 16 and the last row to disparity +16. Chapter 6. Multi-Evidential Correlation and CepsCorr 136 Signal to Noise Ratio y 1* 0 X -15.00 -10.00 -5.00 0.00 5.00 10.0 15.00 Figure 6.6: Signal to noise ratio for cepstrum as a function of disparity. Figure 6.7: Horizontal and vertical disparity fields for a natural scene using ceps Corr Chapter 7 Results of Multi-evidential Correlation This section presents the results of applications of M E C on several natural images. The two comparison kernels used with M E C are cepstrum and phase correlation. To determine the disparities, both methods use the maximum (0,0) peak heuristic, presented in the last chapter. Moreover, for the sake of brevity and ease of discussion, I have nicknamed these processes cepsCorrOO and phaseCCorrOO1 respectively. The suffix 00 in the above names is to be a constant reminder of the heuristic being used. Other matching routines used for comparative purposes are Sum of Square Differences (SSD) or mean squared error, and Sum of Absolute Differences (SAD) or mean absolute difference, and Horn & Schunk's differential technique were also provided.2 The images chosen reflect different types of motion, such as rotation, divergence, or uncalibrated stereo, as well as different types of image properties such as cluttered natural scene vs. completely featureless artificial environment, dark noisy images taken with a small inexpensive camera, or changes in brightness by a moving light source. As it is traditionally the case, the disparities are measured in integer values. More accurate subpixel analysis is another significant topic that will be addressed in future work. In fact, subpixel accuracy is particularly interesting because it should be one of the most beneficial and intrinsically useful applications of multi-evidential correlation. Having a 1 pronounced Phase Si kor zero zero. 2 All the techniques mentioned with the exception of Horn & Schunk are region-based search tech-niques. Any kind of pre or post processing operations such as multiplying the comparison windows with Gaussians, or using regularization to smooth the image flow, can be applied equally to all these region-based methods. 137 Chapter 7. Results of Multi-evidential Correlation 138 number of subpixel estimates of displacement should provide a more accurate estimate of the motion as well as measures of confidence in the combined result. The format of the results are similar to the ones in the last chapter. The two images being compared are shown on the top upper half. The area inside the two images that is being compared is delimited by a black rectangle. The remaining narrow margins near the boundary of the two images are the extra pixels necessary for search space and the window width of the pixels near the boundary of compared regions. The lower left quadrant shows the vertical disparities between two images for each pixel (i.e., drow) and the lower right quadrant shows the vertical disparities between two images for each pixel (dcolumn). Darker regions show the greater disparity, and the disparity results are scaled based on the size of the search regions (mentioned in the caption of each image). As we will see, this is a more informative format for the disparity display. The normal arrow type displays which shows the disparity only at selected points on a grid of pixels on one of the images is sometimes better looking but inadequate. This is primarily because they do not show the anomalies that may occur near the object boundaries or the effects of window size. Lastly, as a reminder, both cepsCorrOO and phaseCCorrOO analysis use the location of maximum (0,0) peak for disparity measurement. As pointed out this is a simple, fast, and relatively robust estimate, but as one would expect the results should improve if all the disparity estimates were used together (this is particularly true for subpixel motion analysis.). Nevertheless, even with this simple heuristic cepsCorrOO proves to be more accurate than all the other techniques considered. These analyses also shows possible instabilities in cepstrum and phase correlation which were not considered before. An analysis of possible instabilities are provided and simple solutions to overcoming them with cepsCorrOO are provided. The next sections will present the results and the parameters used for motion analysis. Chapter 7. Results of Multi-evidential Correlation 139 The window sizes chosen for region based techniques were usually 8x8 and 16x16, with the search region varying depending on the size of the motion. For the differential motion technique, the main two parameters to choose are a used in the smoothing and the number of iterations. Experience with the images shows that for the majority of images chosen in these experiments there is no substantial change in the motion estimations past 20 iterations. The parameter a does have strong effects as expected. These will all be shown via examples. Section 7.7 talks about the computational burden and the speed of each methodology. 7.1 Results and Comparative Analysis 7.1.1 The Hamburgh Taxi Sequence This is a popular sequence of images of a common traffic scene. I present this set of images first because comparatively speaking cepsCorrOO did not perform as well as usual on this set of images. In this sequence, there are objects moving in different directions and leaving or entering the scene. The images being compared are not of high quality especially since they lack good contrast. Additionally there are plenty of featureless objects and regions including the road. The results (which may not seem very impressive) for different techniques are pre-sented. Almost all techniques distinguish four moving objects: a small object on the upper left corner (a pedestrian) two cars at the two edges of the image moving toward the center of the image, and a white taxi in the middle of the image making a right hand turn. The presence of all these objects has been verified visually. Looking at the results between different window sizes for individual techniques, as one should expect, the larger window sizes produce less noisy and more uniform dis-parity values (which is primarily the effects of displacement blurring as well a greater Chapter 7. Results of Multi-evidential Correlation 140 computational stability). It is also worth remembering that, much like edge detection, region based approaches do tend to introduce ambiguities in motion boundaries, often making the moving objects look larger. Comparing these techniques for different window sizes, one sees remarkable similarity in answers, particularly between SSD and SAD. Looking at the results for 16x16 window sizes, the cepsCorrOO and SSD results seem very much the same. There are two things that are worth pointing out however. First, cepsCorrOO tends to produce smaller objects (we will also see this in later experiments); this is particularly notable in the column disparity (i.e., the lower right quadrants) of the moving object in the upper right corner and the two cars to the left and right of the image boundary. Second, cepsCorrOO produces some sporadic values near the boundary of taxi and the other two cars. This is because cepstrum recognizes two disparities, and therefore it generates two peaks, one of which when at 0,0 can become dominant. I will address the use of this property to detect multiple motions in the next chapter. Comparing the results for the 8x8 window sizes gives one more interesting insights. First, not only are the SSD and SAD results are very similar (which is expected), the majority of noisy values seem to cluster together. In the case of cepsCorrOO however, the noise seems to break up and become more dispersed giving it a more speckle-like effect. There are two explanations for this. First, for SSD and SAD calculations of disparity, each pixel shares a lot of data in common with its neighboring pixels. In fact some of the values calculated for the correlation plane are the same multiplications and additions (an observation that is used for faster more efficient implementations of SSD and SAD). Consequently, one expects that if there are noisy disparities, they would have quite a bit in common with their neighboring values and therefore one finds clusters of them. In the case of cepstrum on the other hand, a non-linear operation and frequency domain analysis were used, consequently the same constraints do not apply. But more importantly, the Chapter 7. Results of Multi-evidential Correlation 141 Figure 7.1: Results of the region based approaches for 16x16 window size, regions were vertically from -3 to 2 and horizontally from -3 to 7 pixels. The search Chapter 7. Results of Multi-evidential Correlation 142 speckled appearance of the noise in cepsCorrOO may be due to instability in the results. As we will see in Section 7.1.1 this is the case. It should also be pointed out that the areas of inaccuracy are all dark and featureless, such as portions of the road or the side of the building near the upper right corner of the image. In these areas, not only does the image lack spatial structure needed for matching, but there is hardly any brightness or adequate signal power to reliably match the two images. But before providing a complete analysis of possible instabilities in cepstrum, let us consider the two other techniques phaseCCorrOO and Horn Sz Schunk. As it can be seen phaseCCorrOO did not fare very well on this image. This is mostly due to susceptibility of phase to noise, and as we will see in the next section, the fact that phase correlation can also suffer from numerical instability. I will provide different solutions for this problem. As for Horn and Schunk, one has to note that because there was no window size and search region the two image region are larger in both direction by a small margin. One problem with this approach is choice of the parameter a . Three cases, of a = 0.0, a = 2.0, and a = 4.0 are shown to reflect the high sensitivity of the technique to the choice of this parameter. For a = 0.0, there are really no motions visible, even after 20 iterations or more. For a — 4.0 the two cars on the left and right are missing, a = 2.0 furnishes the best answer, especially around 20 iterations. The result for 100 iterations does not show much change, and basically spreads disparities (i.e., by weighted averaging) as expected3. The disparities are more localized and they are subpixel values as expected. In fact, the results do not look bad (i.e., the three moving cars are visible - the object on the 3Compare the horizontal disparity on the left for the white taxi for 20 and 100 iterations. The disparities are more blurred out. Chapter 7. Results of Multi-evidential Correlation 143 - phaseCCorrOO SAD Figure 7.2: The same matching operations as in the last but with 8x8 window size. Chapter 7. Results of Multi-evidential Correlation 144 upper right corner of the previous image also creates a small blob for a = 2.0), but they are not really accurate because even though each car has basically uniform motion it is the motion of boundaries and edges that make the cars stand out - this is exactly where the motion is incorrect (due to sharp changes in intensity). Looking at the white taxi, for instance, the motion of the roof or the body is practically like the background i.e., (0,0), which is quite incorrect. As for the other two cars on the image boundary, they fared even more poorly. Many pixels on these two cars received a motion disparity of nearly zero (i.e., similar to the background) and the shape of the two cars does not really resemble rectangular moving objects. In fact, for a = 4.0 these two cars completely disappeared. In addition, there are indications of motion at the corners with high intensity change (e.g., the white parked cars and the van, as well as the striped store awning on the upper right corner of the image). Instability As we saw, unlike SSD or SAD which show inaccurate measurements in clusters, the noise in cepstrum seems more sporadic and speckle-like. Therefore, one may suspect that these specular noise are due to some inherent numerical instabilities that may arise in cepstrum. Further investigations show that this is in fact the case. To show this, I chose an area, on the back window of the taxi, that showed high amount of noise in the 8x8 window size disparity field calculation of cepsCorrOO. The results of each cepstrum iteration of cepsCorrOO were then calculated and normalized and all the results were then concatenated (to correspond to their location in the search region). Figure 7.4(a) shows the result of this operation. Since the search region was -3 to 2 pixels vertically and -3 to 7 vertically, 66 cepstrum images where collected.4 It 4 The cepstrum on the upper left corner was replaced with the image used in the first iteration of cepsCorrOO. So there are 65 cepstrum images present. Chapter 7. Results of Multi-evidential Correlation 145 Figure 7.3: Results of Horn & Schunk for different values of a and number of iteration. The disparity range for a = 2.0 and 20 iterations (or higher) was within the proper motion region. For a — 0.0, on the other hand, the vertical disparity after 100 iterations ranged from -41.772167 27.433767. Chapter 7. Results of Multi-evidential Correlation 146 is not necessary to distinguish the boundary between the cepstral results.5 But one can easily note the cepstrum with vertical stripes to left of the result. This cepstrum plane is obviously caused by a numerical instability and corresponds to the search coordinate point -2 vertically and -3 horizontally. More importantly, Figure 7.4(b) shows the same cepsCorrOO figure except that the cepstrum fields were not normalized before they were concatenated. The final concate-nated result was then normalized in order to show the relative magnitude of cepstral peaks. As we can see the unstable cepstrum dominates all other cepstral values. The reason behind this is simple. Taking the log of the power spectrum of the image and its echo, we have: l og{ |5 ( / ) | * ( l + cos(27rr/))} (7.1) where as usual, |5( / ) | is the power spectrum of the image, and r is disparity. Now if at any point on the frequency plane either \S(f)\ or (1 + cos(27rr/)) becomes zero or very very small, the logarithm will result in a very large negative value, which can act as a negative 5 function. This is followed by another power spectrum (or cosine transform in our case). Therefore, the large negative peak will result in a sinusoid, similar to the one in Figure 7.4(b)6, with a very large amplitude that dominates all other cepstral results. Of course, it is not often that log \S(f) \ * (1 + COS(2TTT/)) will have a large negative value; additionally, not all instabilities have large value at (0,0). Moreover, there are several easy ways to avoid these instabilities. For instance, one can filter the logarithm result, add one to the value of power spectrum before taking the logarithm, or make sure that the signal to noise ratio of the cepstral result generating the largest (0,0) peak is 5In fact, one could put a boundary of uniform value between the cepsturms, but this may result in some changes to the results of Figure 7.4(b). 6Of course, the shape of the sinusoid - i.e., its periodicity in x and y - depends on the location of negative 6 peak in the log domain. Chapter 7. Results of Multi-evidential Correlation 147 (a) (b) Figure 7.4: (a) Cepstral results generated by cepsCorrOO iterations, individually normal-ized and then concatenated. The cepstrum in the upper left corner was replaced with the input image to the first iteration to show typical images being compared, (b) The concatenated cepstral results, without being separately normalized. This figure shows how cepstrum's instability may dominate the cepsCorrOO result based on (0,0) peak mag-nitude. Chapter 7. Results of Multi-evidential Correlation 148 above a certain threshold. But these approaches require more calculations, and can introduce additional noise or inaccuracies. Additionally, we may require extra memory for keeping the intermediate cepstrum results. Given the fact that these instabilities do not occur often, and trying to keep the computational time to a manageable level, a simple median filtering of the results was used to reduce the computational burden, and memory requirement. This of course is not an ideal option and the results are probably not as impressive, but the process is very fast and robust. Figure 7.5 shows the result of median filtering on the results presented in Figure 7.2. As we can see the 3x3 median filtering did not really change the results of SSD or SAD but did improve the cepsCorrOO results. In comparing the results of cepsCorrOO, SSD, and SAD, one point is worth noting. In-stabilities seldom arise within cepstrum; their occurrence is rather sporadic and primarily in the areas of very low signal energy and contrast - i.e., \S(f)\ —> 0.8 But since cepstrum is generally very good even in the areas of low energy [CSK77], the unstable results have a speckle nature which increases the randomness (or entropy) in the resulting disparity figures. This is unlike SSD or SAD where the errors are clustered and have a more uni-form distribution. Additionally, cepstrum's instability may occur any place within the search region and therefore the result can often be very different from the disparity of the neighboring pixels (see the speckle white and black pixel noise in Figure 7.2. This also increases the randomness of the values in the result. As a result, even though the errors in cepstrum are not very common, they may seem deceptively more distinguished and noisy than in SSD or SAD where the regions of error are more uniform. Moreover, for small window sizes, such as 8x8, the aliasing effect is very pronounced; and in the regions of low contrast or signal content the effects of instability seem more 7 T o avoid the setting arbitrary thresholds, one can require that signal to noise ratio for the (0,0) be minimum of all cepstral results. 8 Also see Ludwig's discussion of effects of straight lines and edges in cepstrum [LNN94]. Chapter 7. Results of Multi-evidential Correlation 149 - phaseCCorrOO - - SAD -Figure 7.5: Results the 8x8 window sized registrations after 3x3 median filtering. Chapter 7. Results of Multi-evidential Correlation 150 pronounced. Padding with zero, or windowing the cepstrums are also options that may be used, but at the expense of additional computational expense. A closer look at the mathematics of phase correlation shows how one may also en-counter instabilities. For each phase correlation calculation in phaseCCorrOO if either or |52(/)| is nearly zero for a particular frequency value then there will also be a huge peak in the inverse Fourier transform of \Si(f)\^S2(f)\ ^ o r * n & t particular frequency. This in turn will result in a sinusoid with a large magnitude, similar to what was experi-enced in cepstrum. Given that phaseCCorrOO provides multiple number of displacement estimates as well, one can always distinguish the instabilities (from their low signal to noise ratio for instance) and consider one of the other measurements instead. 7.2 Effects of Rotation In the previous chapters I gave references to the works on the effects of rotation and scale on cepstrum. Figure 7.6 shows the effects of rotation on the four different techniques discussed before. For a purely rotational motion drow and dcol are simply a linear function of the distance of the pixel from the center of rotation. In these images, we can clearly see horizontal and vertical stripes present for dcol and drow respectively. As it can be seen in all images, the area of the image being rotated is the table with the plant located above it. The stationary portion is a thin stripe on the left hand side of the image. A l l techniques performed well, but for cepsCorrOO blotches of error in the upper middle portion of the image are missing. For 8x8 images we see that again cepsCorrOO suffers at few isolated points from instabilities (see Figure 7.7) which can be quickly and efficiently removed using simple median filtering (Figure 7.8). After median filtering, even phaseCCorrOO performed better than SSD. Chapter 7. Results of Multi-evidential Correlation 151 - cepsCorrOO - - SSD -- phaseCCorrOO - - SAD -Figure 7.6: Results of rotation for 16x16 window size. Chapter 7. Results of Multi-evidential Correlation 152 Figure 7.7: Results of rotation for 8x8 window size. - phaseCCorrOO - - SAD -Figure 7.8: Results of rotation for 8x8 window size after median filtering. Chapter 7. Results of Multi-evidential Correlation 154 Figure 7.9 shows the effects of rotation and Horn and Schunk method. This is partic-ularly a hard image for any differential technique to analyze because it is very textured; and even though the boundary is well preserved, the differential technique did not per-form very well. Figure 7.9: Horn Sz Schunk method for rotation, a = 2.0 and number of iterations was 50. 7.3 Effects of Expansion/Convergent Flow Next I investigated the effects of looming. In these figures the camera was moved toward a statue of Apollo. Again, in a divergent or a convergent flow the disparities are a linear function of the distance of the image point from the focus of expansion (FOE). As we can see for the 16x16 window size all techniques performed nicely, but both SSD and SAD had a problem in the vicinity of the ear of the statue. Moreover there is a bit of abnormality to the left of the image, primarily due to lack of image structure. Chapter 7. Results of Multi-evidential Correlation 155 This problem becomes more pronounced in the 8x8 window sizes. Median filtering the flows provides us with very nice flow for both cepsCorrOO and phaseCCorrOO, with little or no anomalies. Chapter 7. Results of Multi-evidential Correlation 156 SSD -- cepsCorrOO -- phaseCCorrOO - - SAD -Figure 7.10: Results of expansion for 16x16 window size. Chapter 7. Results of Multi-evidential Correlation 157 Figure 7.11: Results of expansion for 8x8 window size. Chapter 7. Results of Multi-evidential Correlation 158 Chapter 7. Results of Multi-evidential Correlation 159 Finally one should examine the results of Horn and Schunk's algorithm for diverging field. Figure 7.13 shows the result for two different a values. = 4.0 number of iterations is 20 - - a — 16.0 number of iterations is 20 -Figure 7.13: Results of Horn and Schunk for looming effect. 7.4 Effects of Illumination Change Implicit assumptions can be and are often more dangerous than explicitly declared ones. Many application domains in computational vision or image processing are conducted under differing lighting conditions. This could be due to the existence of shadows, the irradiance change due to the motion of the object, light source or the viewing angle or simply an inherently dynamic nature of the problem domain as in the case of satellite imaging. The figures below9 shows pictures of the same object taken at different lighting 9 Data courtesy of Dr. Ardeshir Bagheri and Professor Robert Woodham. Chapter 7. Results of Multi-evidential Correlation 160 conditions. The lights are on the same horizontal plane as the camera and are located so that they make angles of 60 and 30 degrees with the camera's optical axis respectively. The search region is from -7 to 7 pixels in both horizontal and vertical direction. This search space was chosen to correspond to the results of the Horn & Schunk technique. As usual the SSD and SAD figures are similar. 1 0 Obviously all pixel values should be 0 for disparity. The black background gives the disparity -7 for both vertical and horizontal disparity values. This is of course sensible since all the disparity signals are the same, and the first minimum or maximum signal is chosen as the disparity. As usual the incorrect disparity values for SSD and SAD are clustered together, while for cepsCorrOO they are more sporadic. For SSD the area of the statue corresponding to the boy's hair near the upper left corner has a band of incorrect of measurement for -2 pixel vertically and when the vertical disparity is correct the horizontal one is not. In the case of cepsCorrOO the disparity mistakes in terms of value are more dramatic, but both the horizontal and the vertical disparities are incorrect, and very isolated in nature (i.e., one pixel at a time) showing the effects of instabilities. Note also the occluding boundary of the statue analyzed by cepsCorrOO. The occlud-ing boundary shows up as a 4 pixels wide region with disparity of +7 pixels horizontally or vertically. This results from the fact that at an occluding boundary cepsCorrOO signals the existence of multiple disparities, a topic I will address in depth in the next chapter. 10Needless to say I am persistently displaying this, hoping for a change against all theoretical and practical odds! Chapter 7. Results of Multi-evidential Correlation 161 Figure 7.14: Results of change in the direction of illumination from 60 to 30 degrees, search area (-7,7) in both vertical and horizontal direction and 16x16 window size. Chapter 7. Results of Multi-evidential Correlation 162 - cepsCorrOO - - SSD n Jlll • ' . » „; - J' J phaseCCorrOO SAD Figure 7.15: Results of illumination change for 8x8 window size. Chapter 7. Results of Multi-evidential Correlation 163 3fc cepsCorrOO SSD phaseCCorrOO -V." SAD Figure 7.16: Results of illumination change for 8x8 window size after median filtering. Chapter 7. Results of Multi-evidential Correlation 164 Figure 7.17 shows the effects of illumination change on Horn k. Schunk's differential method. This is in violation one of the primary assumptions of the technique, therefore it should be expected that the range of disparities in rows and columns will be large, as shown below: Maximum/minimum va lue f o r d e l row: -6.333128 5.884905 Maximum/minimum va lue f o r d e l c o l : -5.195533 5.436991 Figure 7.17: Results of illumination change and Horn & Schunk method for a = 20 and 20 iterations. These are the values used to set the search space parameters for the region based methods. The fact that the surfaces are basically Lambertian and are not textured helped this methodology even more by eliminating highlights and sudden disparity changes. Chapter 7. Results of Multi-evidential Correlation 165 7.5 Tiny Camera It is not always possible to acquire clean (i.e., not noisy) data. Sometimes technological constraints (e.g., sending images over telephone wires, or low bit quantized images for higher compression) or simply the price of camera may force one to settle for noisy, dark images.1 1 The images below are real images taken with a small inexpensive C C D camera. The images are very dark and noisy. The objects/people in the scene are stationary but the camera goes through a forward/looming and sideway motion. The original images used for matching are displayed in Figure 7.18, but in the results they are renormalized in order to show their structure better. As expected, SSD does a much better job than any other technique since it is optimal in random uncorrelated Gaussian noise with mean of zero. In fact, as we will see in the next image, SSD and SAD perform better than cepsCorrOO but also note that the performance of these techniques drops rather dramatically with the size of window used. I like to refer to this as the 7/11 syndrome! Chapter 7. Results of Multi-evidential Correlation 166 Figure 7.18: Original images used in the analysis. Chapter 7. Results of Multi-evidential Correlation 167 - cepsCorrOO -•J. l | SSD 1 Sfef iL^- " K JIM J?r ' -•> If * 1 ' » ^ - phaseCCorrOO - - SAD -Figure 7.19: Effects of image quality on matching using 16x16 windows for matching. Chapter 7. Results of Multi-evidential Correlation 168 - cepsCorrOO -- phaseCCorrOO - - SAD -Figure 7.20: Results of tinyCamera change for 8x8 window size. Chapter 7. Results of Multi-evidential Correlation 169 The figure below shows the result of Horn & Schunk algorithm with the actual images (i.e., not normalized). Figure 7.21: Horn & Schunk method a = 20 after 20 iterations. These results are so poor that I will not bother to do median filter the results. Instead, one should note the effects of oriented structures - such as the boundary between the ceiling and the back wall - on all the region based methods. 7.6 Binocular /Outdoor Scene The next images (i.e., SRI trees)12 are particularly interesting, because not only they represent typical outdoor scene, but they are also uncalibrated binocular images of a complex and cluttered environment. The images are also very clean - i.e., There is little if any noise, and the range of brightness in the images is also nicely exploited. SSD and SAD give similar disparity results which are rather blurred (typical of region 1 2 Courtesy of Stanford Research Institute International. Chapter 7. Results of Multi-evidential Correlation 170 based approaches) and not very crisp. CepsCorrOO and phaseCCorrOO provide more distinct results in the sense the images are more crisp, and the disparity of the objects are not very blurred, the objects are segmented better and their boundaries are better refined. There are no blotches of mistaken disparities but simple speckles. Note that all the results improve with median filtering particularly the phaseCCorrOO results. For this particular image, looking at the row disparities for median filtered images can reflect the accuracies of the disparities (since these images are practically stereo pairs). As one can see the errors for SAD and SSD seem a bit more pronounced. Chapter 7. Results of Multi-evidential Correlation 171 - cepsCorrOO - - SSD -- phaseCCorrOO - - SAD -Figure 7.22: Results of binocular disparity measurements. Search for (-1,1) in rows and (-1,9) pixels in column-wise. 16x16 window size. Chapter 7. Results of Multi-evidential Correlation 172 Figure 7.23: Results of binocular disparity measurements. Search for (-1,1) in rows and (-1,9) pixels in column-wise. 8x8 window size. Chapter 7. Results of Multi-evidential Correlation 173 Figure 7.24: Results of binocular disparity measurements after median filtering for 8x8 comparison window size. Chapter 7. Results of Multi-evidential Correlation 174 Below we see the effects of Horn & Schunck algorithm with a = 2.0 after 20 iterations. Figure 7.25: Horn & Schunk method a = 2.0 after 20 iterations. 7.7 Computational Complexity of Mot ion Analysis Algorithms I pointed out earlier that MEC' s computational cost is linearly proportional to the search area and the entire process is parallelizable. The primary operation in cepstrum and phase correlation are the calculation of Fourier transform which is 0(nlogn). Both cepstrum and phase correlation require two Fourier transforms. But on a per pixel calculation, phase correlation requires 4 multiplications and two additions for the numerator, and another two multiplications (minimum), one addition and a square root for the calculation of denominator. Therefore, there are basically six multiplications, one square root, and a final division per pixel that are the main computational burden of the process, besides the Fourier transform. Chapter 7. Results of Multi-evidential Correlation 175 CepsCos, on the other hand, requires two multiplication, one addition and a loga-rithm. Therefore, for the same size windows, without any concatenation power cepstrum is a clearly a faster process. With concatenation, we will have four multiplications, two additions and two logarithms per pixel of a matching window, since the number of pixels has doubled. Fourier transform now becomes 2 + more expensive (practically twice). These programs were run on a standard Sparc 2 and Sparc 10 and even with window concatenations used in cepstral analysis, the cepstrum algorithm was performed faster than phase correlation. The complexity of SSD and SAD are well established, with SAD being a faster process. Horn & Schunk's algorithm was the fastest of all programs mainly because there is no search or comparison (to find the maximum or minimum) is involved. 1 3 7.8 Summary/Conclusion This chapter shows the use of cepsCorrOO and phaseCCorrOO for matching and regis-tration of several images. The images chosen address some of the more common image transformations, such as rotation, expansion, and translation parallel to the focal plane, plus changes in lighting condition and image quality. Even using only the maximum (0,0) peak magnitude - a rather limited application of M E C - cepsCorrOO performed rather nicely. As was expected and pointed out by Yeshurun and Schwartz [YS89] and Ludwig, Neumann and Neumann [LNN94] for illumination changes, rotation and expansion, cep-sCorrOO inherits cepstrum's robustness and performs better than the other techniques considered. 13Even on today's fast reduced instruction sets machines, comparison is one the more expensive operations. It should also be noted that when using regular cepstrum or phase correlation (as opposed to cepsCorrOO or phaseCCorrOO) the search region for cepstrum is half of that of phase correlation, even with concatenation. Since I are using the maximum (0,0) peak values, this does not come into play in these techniques. Chapter 7. Results of Multi-evidential Correlation 176 The chapter discussed in detail possible instabilities on cepstrum and phase correlation and how they can be easily overcome by the use of MEC. These instabilities sometime occur at the occluding boundaries, a significant area of research (especially for object segmentation) and a topic that will be discussed - along with multiple motion - in more detail in the next chapter. Chapter 7. Results of Multi-evidential Correlation Chapter 8 Detection and Estimation of Mult iple Disparities by Multi-evidential Correlation 8.1 Introduction In region-based image registration methods the implicit assumption of a single uniform and constant disparity within the matching window is often violated. As pointed out graphically by Bergen, Burt, Hingorani and Peleg [BBHP90], many phenomena may result in multiple displacements in a single image neighborhood. Some of these include multiple disparities due to reflection or specularities, relative motion of background scenes and transparent surfaces (such as clouds, smoke and stained glass), and multiple motions or stereo disparities at occluding boundaries. The goal of this chapter is to address both detection and estimation of multiple disparities in motion and stereo registration. No a priori assumption regarding the existence or absence of multiple displacements is necessary. As a result, the problems of detection and measurements of multiple motions are decoupled. But before providing the details of this work a brief review is in order. One of the first attempts to find multiple disparities was by Fenema and Thomp-son [FW79] who used a histogram of the correlation results to localize occlusions. Since then many researchers have provided more robust and interesting approaches based on a variety of displacement estimation techniques. Burt, Hingorani and Kolczynski [BHK91], use a Laplacian pyramid and the iterative 178 Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl79 selective stabilization routine to lock into and cancel a dominant velocity when this velocity is detectable within a frequency band provided by the Laplacian's band-pass structure. Bergen, Burt, Hingorani and Peleg [BBHP92] also used an iterative approach to determine two constant velocities present in three frames. Peleg and Irani [IP92] then built on these works and showed an application of multiple motion estimation to track objects through image frames and to improve their appearance and resolution over time. Besides signal reconstruction and enhancement, multiple disparities are needed in object segmentation and figure ground separation[Gri93]. Other application domains include recognition of motion transparency, and signal separation and recovery from reflection. Peleg and Rom [PR90] used an iterative approach for motion segmentation based on constraint equations of brightness change for images where depth of the scene was already known or remained constant. Campani and Verri [CV92] also used the differential approach for optical flow estimation to calculate multiple motion disparities in image sequences. Little and Gillet [LG90] used a normalized correlation approach, but introduces two mechanisms to independently determine the occluding boundaries in stereo images. Darrel and Pentland [DP91] used robust statistics and temporal integration to find distinct "layers" of motion. Jepson and Black [JB93b] also used robust statistics and the optical flow gradient constraint equation to localize occlusions. Jones and Malik [JM92] proposed an elegant approach based on the distance of vec-torized response measures to orthogonal linear filters to determine occluded regions on stereo scenes. Finally Chen, Shirai and Asada [CSA93] also used linear spatial filtering as well as the motion constraint equation to find occluded regions. In some works the presence of multiple disparities is assumed a priori. This is often not the case in real applications, where multiple disparities may or may not be present at all, Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl80 and when several displacements do occur, they are often limited to certain locations in the image. Therefore, there is a need to first detect the presence of multiple displacements. Moreover, it would be nice to verify these disparities by having multiple pieces of evidence that point to their proper value. This chapter describes multi-evidential correlation (MEC) and its use to detect and verify the presence of multiple displacements (without any a priori knowledge of their existence) and to estimate differing disparities due to reflection, transparency and oc-clusion. The number of disparities present is not limited to two since the matching kernels used - namely phase correlation and cepstrum filtering - provide direct estimates of multiple displacements. Furthermore, in the case of constant velocities, I will show how multi-frame analysis can improve the detection and estimation process by allowing constant disparities among frames to reinforce one another. The organization of this discussion is from general properties to specific applications. Examples based on random dot stereograms are used to provide a quantitative compari-son between cepstrum and phase correlation first. Section 8.3 presents results of multiple disparity estimation due to transparent motion and discusses different algorithms to seg-ment a stereo image and to detect occluded regions. Finally I will revisit the use of a multi-frame analysis technique, called multiCeps, to find constant multiple disparities due to reflection in a sequence of three images. 1 8.2 Multi-evidential Correlation, Cepstrum and Phase Correlation An important property of cepstrum and phase correlation is that they both produce multiple peaks in the presence of multiple motions. Figure 8.1 shows a manufactured image to display the effects of multiple motion on phase correlation and cepstrum. The 1This chapter, like some of the others, is based on a previous paper and technical report [BL93c]. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlational "/Onp/esf.gdata" 60000 -50000 -40000 -30000 -(a) (b) (c) Figure 8.1: Illustration of multiple peaks due to multiple motions, (a) & (b) the images with two synthetic motions, (c) the result of analyzing the images with cepstrum. box in the image was moved by 3 pixels vertically and 6 pixels horizontally, while the main image was moved 10 pixels vertically and 5 pixels horizontally. These motions created the corresponding peaks in the cepstrum result, a relevant portion of which is shown topographically, with the two peaks estimating the correct disparity. This property, in conjunction with multi-evidential correlation, provides a unique approach to detection and recognition of multiple disparities when they are present. To elaborate, when two or more disparities appear in an image, they generate two or more peaks in cepstrum and phase correlation. Since M E C sweeps a window from one frame over a region in the other, all the comparisons result in two consistently located maxima corresponding to the two displacements. The relative magnitude of these peaks will, of course, increase and decrease as the relative overlapping areas of the matching image parts decrease or increase. But the peaks themselves persist over all or some part of the iteration. In this manner one can first detect the presence of multiple disparities and then estimate their proper values. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl82 The following subsections describe phase correlation and a variation of power cep-strum for estimation of motion and binocular disparity. I defer the mathematical treat-ment of their behavior in the presence of multiple motion and instead provide examples based on random dot stereograms. 8.2.1 Phase Correlation and Multi-evidential Correlation It can also be shown mathematically that if multiple disparities are present in the two windows, phase correlation results in multiple peaks corresponding to the individual disparities plus a small residual noise. To show this effect, and how multiple peaks persist over individual iterations of M E C , a random dot stereogram with two disparity levels of 2 and 5 were generated. I then chose an area near the occluding boundary. Using phase correlation as the matching kernel I conducted multi-evidential correlation. 16x16 windows were chosen2 and for the correlation span (i.e., the area swept over during the comparisons) I selected 0 to 10 columns horizontally and zero columns vertically. Since the epipolar constraint was preserved, and no vertical displacement was introduced, the peaks in each phase correlation outcome appear only in the first row. Only the first row of each comparison in the multi-evidential correlation were then kept, and they were then concatenated together vertically. Figure 8.2 shows the result with each row representing the outcome of one phase correlation match. The first row of the figure shows the result (i.e., the first row) of the left most M E C and each subsequent row shows the result of next M E C iteration as one moves to the right. As can be seen, each row contains two peaks corresponding to the two disparities. As one moves to the right these disparities get smaller by one pixel, as expected, and reach their respective maximum magnitudes at 0 disparities (i.e., when they reach the left 2Eight pixels by eight pixels patches or even smaller window sizes would also be adequate, but I chose a larger size for display purposes. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl83 (a) (b) Figure 8.2: Results of M E C for stereo, (a) Each row of the figure is the first row (i.e., zero vertical disparity) of the result of one iteration of M E C with phase correlation as its matching kernel. The location of the peaks are indicative of the disparity, with negative disparities located close to the right most column, (b) Topographic image of Figure (a) displaying the magnitude of the phase correlation peaks hand column). If one continues moving to the right the peaks reappear in the right hand column, indicating a negative disparity, and their magnitude again starts to decrease with each iteration. What is important to note is the consistent presence of the two peaks and how their relative location moves as one iterates over an area. If such consistency persists during multi-evidential correlation then it is easy to infer that multiple disparities due to oc-clusion, reflection, or transparency have occurred. As mentioned briefly, two additional conditions can also be used as steps for verification of multiple motion. First, as the dis-parity of the peaks diminish, their magnitude should increase; and second, the distance between the two peaks is the width of the occluded region (or disparity differences) in the image, and should remain constant. Even though the example presented here corresponds to binocular disparity (primar-ily for proper illustration), the same approach can be extended to the detection and Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl84 estimation of two or more motion disparities. 8.2.2 cepsCorr: Cepstrum and Multi-evidential Correlation I next used cepstrum as the correlation kernel in our multi-evidential routine - which I often refer to as cepsCorr for brevity - and repeated the experiment described in the previous section. As with the phase correlation the peaks of the cepstrum appear only in the first row of each iteration, and hence they can be concatenated vertically. Since I use the same windowing routine described in [YS89, BL93b], and since the cepstrums (or cepsCos) of real signals are symmetric and even functions, for binocular disparity the first row of cepstrum will contain two symmetric peaks around the center column for each of the disparities present. Figure 8.3 shows the cepsCorr results for the experiment described above. Note that, as with phase correlation, as the disparity between two identical parts in the two images reduces, the peak magnitude of their disparity increases, and the peak moves close to the middle column (i.e., zero disparity). Comparing the two topographic maps in Figures 8.2 and 8.3 for the matching kernels indicates that the disparity peaks generated by the cepstrum filter are larger in magnitude than those generated by phase correlation by roughly a factor of three to one. Moreover, the experiments indicate that cepstrum results have higher signal to noise ratios than those of phase correlation. Other researchers [LMK89] have also shown that cepstrum performs better than phase correlation in the presence of noise. Lastly, Figures 8.4 (a) and (b) are the results of the same operations as above on the boundary of the real image depicted in Figure 8.8. The estimated peaks in phase correlation, Figure 8.4 (a), even after taking into account spreading due to the curvature of the Pepsi can were quite inaccurate, while the results of cepstrum correlation, Figure 8.4 (b), were much better suited for further analysis. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlation!^ (a) (b) Figure 8.3: (a) The collection of binocular disparity signals generated by cepsCorr. Each row of the figure represents the result of one iteration of the multi-evidential correlation with cepstrum as its matching kernel. Each disparity is represented by two peaks which are symmetrically displaced away from the center column by the amount of the dispar-ity, (b) Topographic image of cepsCorr in (a) in order to display the magnitude of the cepstrum peaks. 8.2.3 Benefits of Non-linear Matching Kernels I have already reviewed one of the primary advantages of multi-evidential correlation -i.e., providing multiple evidences of the existence, or lack of presence, of multiple moving patterns (as well as estimates for each motion value). But before getting into more specific results and examples it would be interesting to compare the behavior of non-linear kernels used with the more traditional region based schemes specifically sum of squared differences (SSD). I chose SSD because it is one of the more popular matching methodologies and its behavior is indicative of other similar techniques such as sum of absolute differences (SAD) or regular correlation. To do so I created two randomly generated ID signals with rather diverse disparities 1 and 13. I then used cepstrum, phase correlation and SSD to detect the presence of two disparities and to measure their proper values. Figure 8.5 shows the results of the Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlation! 86 (a) (b) Figure 8.4: (a) Topographic representation of multi-evidential correlation results for the Pepsi scene with phase correlation as the matching kernel, (b) Result of cepsCorr for the same image area in the Pepsi scene. Note that cepsCorr generates a strong peak in the center corresponding to one of the disparities while for phase correlation the results are very weak and lack a distinguishing structure. comparison.3 As we can see, both cepstrum and phase correlation have their first and second peak at 1 and 13. The second minimum in SSD, on the other hand, does not correspond to the correct value of the second displacement. This is more than likely due the fact that the two motion displacements were quite diverse in magnitude. Even though this result can not be generalized to all displacements and all signals, it points to the strength of non-linear kernels in finding multiple displacements compared to most linear techniques. Additionally, it has been shown that both cepstrum and phase correlation are very robust in the case of illumination change [Has74, KH75]. Illumination change can be particularly a problem for motion transparency or reflection where the direction of illumination or viewing angle can change the brightness of a motion layer compared to the next. Lastly, without providing multiple evidences for the presence of several motions, using 3Only one half of the cepstrum result is shown, since cepstrum is symmetric. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl87 (a) (b) (c) Figure 8.5: The effects of simulated multiple disparities on: (a) sum of squared differences (mean squared error) (b) phase correlation, and (c) cepstrum. The second peaks in phase correlation and cepstrum correspond to the correct second disparity while the second minimum in SSD does not. Using M E C the other peaks in cepstrum and phase correlation will vanish and all but the main two peaks are rejected. SSD or similar techniques will require choice of arbitrary thresholds and parameters to decide when two or more minimum values correspond to the existence of two or more moving objects. M E C , on the other hand, requires no such measures and all but persistent peaks are rejected. 8.3 Results In this section M E C is applied to the detection of multiple disparities. The three examples tackled involve motion transparency, detection and localization of occluding boundaries in binocular stereo, and multi-frame analysis of constant dual disparities due to reflection. 8.3.1 Motion Transparency Determination of the motions of two objects where one object is transparent and in front of the other is often referred to as the motion transparency solution. One of the applications of this work is matching of satellite images where thin cloud cover or smoke Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl88 (a) (b) (c) Figure 8.6: (a) The first frame of a sequence of motion transparency images. The flower pattern is apparently due to the stained glass, and the tripod is the object moving in the background, (b) & (c) The magnitude of M E C at (0,0) disparities for cepstrum (b), and phase correlation (c). These two figures analogous to plotting the middle column of Figure 3(a) and the left most column of Figure 2(a) respectively. can overshadow the primary matching disparity estimates. Irani and Peleg [IP92] also showed how they used results of motion transparency to track objects behind stained glass and improved the clarity and resolution of the object images over time. Figure 8.6(a) shows an image of a sequence where a tripod is moving behind a window with a picture of flower. In the first two frames of this sequence the flower stays stationary and the tripod moves by 4 pixels horizontally (again there is no vertical motion present). I used phase correlation and cepstrum as the matching kernels for this approach and then checked for maximum peaks at zero disparities - i.e., when either the tripods or the flowers in the two images overlapped. I also chose rather large windows of 128x128 pixels to ensure that both disparities are included in my analysis. The span of this routine was from -2 to 7 pixels horizontally. The results of the (0,0) peaks are shown in Figure 8.6 (b) and (c) (these correspond to the trace of the middle column in Figure 3(a) and the left most column in Figure 2(a) respectively). The main two peaks appear at the proper Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential CorrelationlS9 Figure 8.7: Detection vs. localization of occluded regions for random dot stereogram. The area between a pair of white and black edges corresponds to the locations where multiple disparities were detected. locations and correspond to the offsets where either the flower or the tripod overlap. It is important to note that even though the disparity in this case was only horizontal this is not at all a requirement, and that vertical as well as horizontal disparities can be found quite easily. 8.3.2 Occluding Boundary Recognition of occluded regions in motion or stereo plays a significant role in segmentation of three dimensional objects. Such segmentations are important for tracking, recognition, or object manipulation in robotics and automation. As it has been demonstrated, in the region near an occluded boundary, both phase correlation and cepstrum generate two peaks corresponding to the two disparities present. The distance between the two peaks is the width of the occluded region. But while detection and estimation of disparities is easy, localization of the occluded regions is a more cumbersome task. The primary problem is the duality in detection and localization, and how they relate to the window size. Large window sizes are generally better in detection and estimation, but produce greater uncertainty in the location of the occluded Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl90 (c) Figure 8.8: (a) & (b) A pair of stereo images of a curved object, (c) the occluding boundary of the object using M E C . regions. Figure 8.7 shows the uncertainty in locating the occluded regions in a pair of random dot stereograms for two squares. This problem is aggravated, and the consistency of multiple peaks in M E C is jeop-ardized, if the window shape does not match the occluding boundary, or if one or both objects in view contain curved or slanted surface structures; this is primarily due to the fact that a curved surface will have multiple disparities within a window which result in a blurring of the disparity peak. The natural answer to this problem is to use a smaller window size. This approach was examined by Okutomi and Kanade [OK92] who used windows with locally adaptive extents. This approach improves the localization of the occluded regions significantly, but the extent of the matching window, however small, still generates an uncertainty in the location of the occluded area. In many ways, especially the trade of between detection and localization, is analogous to the traditional edge detection problem in computer vision. It should also be pointed out here that relative sizes of image patches that give rise to multiple disparities are not the only factors affecting the relative peak magnitudes. Our studies show that other factors such as the relative frequency distribution content of Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlational the image windows also play a significant role. Moreover, often the objects creating an occluding boundary are similar in nature, making the detection and localization of such boundaries even more difficult. This is why random dot stereograms are not a good repre-sentative problem. An engineering solution that will work with real scenes is to first find edges in the image4 and then determine if the edge is an occluding boundary or a surface feature based on the existence or lack of strong multiple peaks along the edge [GP87]. The difference between the two disparities indicates the width of the occluded area. (a) (b) Figure 8.9: (a) One of the three images with two constant disparities caused by reflection. The three frames where produced by looking at the Escher drawing with the reflection of the person in the glass, and moving the camera a constant distance each time and acquiring two more frames, (b) The magnitude of the (0,0) peaks caused by the overlap of individual multiple disparities. Note that the graph shows two maximums due to two disparities. Instead, different direct methodologies for the detection of occluding boundaries were examined , including voting schemes, and thresholding of relative peak magnitudes. The final approach to solving the occlusion problem is similar to a technique used by Fua [Fua91] to verify correct disparity calculations. That is, using M E C I selected the 4 Obviously with an edge finder that has good localization characteristics such as [RHKvdH92]. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl92 largest of the two dominant peaks and centered a stationary window in the second image at that location. I then tried to find the disparity for this window in the first image using M E C . It is easy to show geometrically that the verification procedure fails inside an occluded region. In fact, in the occluded regions the secondary peak in the first pass becomes the dominant peak in the second. Figure 8.8 shows a pair of stereo images and the occluded regions found by the above technique. The thickness of the occluded regions can also be determined from the distance between the two disparities. Where the image lacks features for matching (such the lower part of the Pepsi can) this procedure also shows lack of consistency in results. One way to reduce the noise in this result is to verify the existence of multiple peaks for all occluded pixel candidates which unfortunately was not conducted for this experiment. 8.3.3 Mul t iCeps and Mult iple Mot ion Due to Reflection In [BBHP92] Bergen, Burt, Hingorani and Peleg used an iterative technique to determine two constant motion disparities from three frames. As we can see in Figure 8.9, the magnitudes of the (0,0) disparity peaks in cepsCorr also indicate the presence of at least two motions in the three frames. Figure 8.10: A portion of the cepstrum result for multi-frame analysis. In this section multi-frame analysis technique called multiCeps [BL93b] was used to Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correlationl93 estimate the constant disparities among the three frames. To do this I concatenated the three frames5 and conducted cepstral analysis. It is easy to show mathematically that in multiCeps the constant disparities between the first and the second frame and the disparity between the second and third frame reinforce one another [BL94]. The same results will hold for the disparity between the first and third frame. Figure 8.10 shows a small portion of the multiCeps result. (a) (b) Figure 8.11: The results of analyzing Figure 8(a) for multiple echoes. The difference images between frame 1 and 2 after disparity compensation for the two estimates. Note how that the reflection of the person in the glass is now easily detected as compared to the original image. Close examination of these results shows three peaks corresponding to the three hor-izontal disparities at -3.4, -0.6 and 6.7 using linear interpolation between the peaks. The first two peaks correspond to the multiple disparity measures (of the Escher print and the reflection) between the first frame and the second frame, reinforcing similar dispar-ities between frames two and three. The last disparity corresponds to one of the two 5Actually I used a 341 by 256 sub-image of the three figures. Chapter 8. Detection and Estimation of Multiple Disparities by Multi-evidential Correiationl94 disparities between frame one and frame three. The vertical disparities between these images are also zero. Figure 8.11 shows the outlines of the two images generated by simple subtraction of the first two frames after proper disparity realignment. The procedure above shows how multi-frame analysis can be utilized in estimating of constant multiple disparity measurements. 8.4 Conclusion In this section I discussed a direct method for detection and estimation of multiple disparities due to occlusion, motion transparency, and reflection using multi-evidential correlation. No a priori assumptions about the existence or the number of disparities were made. The two matching kernels used, cepstrum and phase correlation, both gen-erate multiple peaks in the presence of multiple motions. When these peaks persist between different iterations of multi-evidential correlation, and they are consistent in the measurement of disparities, the existence of multiple displacements is assured. Also addressed were the use of multi-frame analysis and cepstrum to estimate multiple constant disparities over time. Both cepstrum and phase correlation performed well, but cepstrum had higher signal to noise ratio, and an overall better performance than phase correlation. A significant fact that should be pointed out here is that motion transparency and reflection often cause blurring in the image. Both cepstrum and phase correlation are robust to blurring, and in fact cepstrum is often used as a deblurring mechanism. Chapter 9 Concluding Remarks/Overview and Future Directions The objective of this dissertation can be separated into two related topics. The first is to develop a broadly applicable framework in order to address variety of computational vision tasks through a common methodology. This goal is partially motivated by a number of papers on the neurophysiology and psychophysics of biological visual systems, and partially by discussions on the significant directions in computer vision [Pri86, Ric88]. The result was visual echo analysis, a framework which was developed through refor-mulation of various topics in computational vision as detection and estimation of echo arrival periods in time and space. Using the strengths and versatility of cepstral filtering as the echo detection mechanism, visual echo analysis was then successful applied to: • measurement of visual motion, stereopsis and stationary texture segmentation (top-ics discussed under visual flow analysis in [Ric88]), • trinocular stereo analysis, • multi-frame and 3D spatio-temporal motion analysis - for constant or variable velocity, • detection and estimation of multiple disparities (e.g., motion transparency, reflec-tion, or occluding boundaries), • multiple baseline stereo, • boundary symmetry analysis, 195 Chapter 9. Concluding Remarks/Overview and Future Directions 196 • scale and rotation invariant binary image registration, • and matching and registration of complex (i.e., real and imaginary) signals - as in the case of Magnetic Resonance Images. It was the development of this framework that led to the extensive analysis of phase correlation and cepstral filtering. These studies resulted in the development of cepsCos and diffCepsSin which provide both computational and performance improvements to the traditional power and differential cepstrum respectively. The second objective of this dissertation involved a dominant area of computational vision and signal processing: i.e., matching and registration. Many subfields of computer vision, from low level motion or stereo analysis to higher level recognition or structure from motion, rely on accurate registration of features and data. Having studied the traditional techniques - their strengths, shortcomings and underlying assumptions - and drawing on the properties of cepstrum, phase correlation and Hadamard based techniques, I introduced Multi-Evidential Correlation (MEC). The strengths of this approach are three-fold: 1. This approach provides multiple and thus verifiable estimates of individual dispari-ties. Determination of confidence measures for disparities and mechanisms to verify displacements is a significant and desirable attribute which is not always attainable in computational vision. 2. It provides a strong alternative to the traditional linear filtering methods as well as additional flexibility in the choice of its matching kernel. M E C can utilize the non-linear filtering techniques of phase correlation and cepstral filtering, or the linear and extremely fast Hadamard based technique as its matching routine. Chapter 9. Concluding Remarks/Overview and Future Directions 197 3. In its present form, it was formulated to avoid any (or at least as many) assump-tions, whether analytic (e.g., smoothness of ...) or numeric (e.g., thresholds); this of course was a self imposed constraint and objective. Chapter 7 provides an extensive comparison of M E C , using cepstrum and phase cor-relation, with the more common motion analysis techniques. Even the most primitive form of M E C - i.e., cepsCorrOO - produced better results than the traditional methods, especially in the presence of rotation, looming and changes in illumination. 9.1 Future Directions The only list longer than the list of things I have done is the list of things I would like to do. - anon. There are numerous topics and subjects that one can further develop based on the new material presented here. Among the most important topics is the combining of evidences for M E C . To this end, I have already generated preliminary results for combining different M E C evidences, based on a rather involved voting mechanism for cepsCorr. The result of these experiments are very promising and shed light on other aspects of cepstrum and M E C . Using Hadamard based matching kernels, or improving the phaseCCorr are two other topics that can be pursued in the future. As for visual echo analysis, it clearly provides a very fertile ground for future research and development. Each topic discussed under this framework - such as motion-stereo analysis, or 3D spatio-temporal analysis - can be extended and further studied. Moreover, there are a wealth of other topics that may be approached by the same frameworks; for instance, 3D or 2D motion trinocular, or motion-stereo analysis for vergence control. Added to these is the combination of numerous of other methods such as hierarchical Chapter 9. Concluding Remarks/Overview and Future Directions 198 matching schemes or variety of pre, middle, and post processing operations that may be combined with cepstrum to improve its performance. It is the author's sincere hope that this work is both an original contribution to the field and a stepping stone for others researchers toward future developments. E.B.B. Bibliography [AA91] B. Armestrong and N . Ahmed. A method of cepstral enhancement for detection and parameter estimation of echoes. Asilomar Conference on Circuits, Systems and Computers, pages 160-164, 1991. [AA92] B. Armestrong and N . Ahmed. A method of cepstral enhancement for detection and parameter estimation of echoes. IEEE Transaction on Signal Processing, pages 160-164, 1992. [AB85] E. H. Adelson and J. R. Bergen. Spatiotemporal energy models for the perception of motion. Journal of Optical Society of America, 2(2):284-299, 1985. [AB86] E. H. Adelson and J . R. Bergen. The extraction of spatiotemporal energy in human and machine vision. In Proceedings of IEEE Motion Worskshop, pages 151-155, 1986. [Ada86] J. Adams. Conceptual Blockbusting. Addison-Wesley, Cambridge, Mass, 1986. [AN88] J. K . Aggarwal and N . Nandhakumar. On the computation of motion from sequences of images - a review. Proceedings IEEE, 76(8):917-35, Aug 1988. [Ana85] P. Anandan. A review of motion and stereopsis research. Technical Report COINS-85-52, University of Massachusetts at Amherst, December 1985. [Ana87] P. Anandan. Measuring image motion from image sequences. Technical Report COINS-87-21, University of Massachusetts at Amherst, 1987. [Ana89] P. Anandan. A computational framework and an algorithm for the mea-surement of visual motion. International Journal of Computer Vision, 2:283-310, 1989. [Ana94] D. Anastassiou. Digital television. Proceedings IEEE, 82(4):510-519, Apri l 1994. [AR84] J. Altmann and H. J. P. Reitbock. A fast correlation method for scale- and translation-invariant pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:46-57, 1984. 199 Bibliography 200 [AW84] P. Anandan and R. Weiss. Computing dense displacement fields with con-fidence measures in scenes containing occlusion. In SPIE Intelligent Robots and Computer Vision Conference, volume 521, pages 184-194, 1984. [AW85] P. Anandan and R. Weiss. Introduction of smoothness constraint in a matching approach for the computation of optical flow fields. In IEEE Workshop on Computer Vision, pages 186-194, Michigan, 1985. [BA83] P. Burt and E. Adelson. The Laplacian pyramid and the compact image code. IEEE Transactions on Communication, 31:532-540, 1983. [BAHH92] J. Bergen, P. Anandan, K . Hanna, and R. Hingorani. Hierarchical model-based motion estimation. In Proceedings of European Conference in Com-puter Vision, pages 237-252, 1992. [Bak82] H . Baker. Stereo vision systems. In Proceedings of International Confer-ence on Systems, Man and Cybernetics, pages 322-326, 1982. [BAK91] R. Battiti, E. Amaldi, and C. Koch. Computing optical flow across mul-tiple scales: an adaptive coarse-to-fine strategy. International Journal of Computer Vision, 6(2):133-45, June 1991. [Bar84] J. Barron. A survey of approaches for determining optic flow, environmen-tal layout and egomotion. Technical Report RBCV-TR-84-5, University of Toronto, November 1984. [Bar85] Y . Barniv. Dynamic programming solution for detecting dim moving tar-gets. IEEE Transactions on Aerospace and Electronics Systems, AES-21(1):144-156, Jan 1985. [BB82] C. Brown and D. Ballard. Computer Vision. Perintice Hall, Englewood Cliffs, N.J . , 1982. [BBHP90] J.R. Bergen, P.J. Burt, R. Hingorani, and S. Peleg. Computing two mo-tions from three frames. In Proceedings of IEEE International Conference on Computer Vision, pages 27-32, Osaka, Japan, Dec 1990. [BBHP92] J.R. Bergen, P.J. Burt, R. Hingorani, and S. Peleg. A three-frame algo-rithm for estimating two-component image motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-14(9):886-896, Sept. 1992. [BF82] S. Barnard and M . Fischler. Computational stereo. Computational Sur-veys, 14(4):553-572, 1982. Bibliography 201 [BFB94] [BFBB92] [BHK91] [BHT62] [BL83] [BL92a] [BL92b] [BL92c] [BL93a] [BL93b] [BL93c] [BL93d] J.L. Barron, D.J . Fleet, and S.S. Beauchemin. Performance of optical flow techniques. 12(l):43-79, February 1994. J.L. Barron, D.J . Fleet, S.S. Beauchemin, and T .A . Burkitt. Performance of optical flow techniques. In Proceedings of Conference on Computer Vi-sion and Pattern Recognition, pages 236-242, 1992. P.J. Burt, R. Hingorani, and R.J . Kolczynski. Mechanism of isolating component patterns in the sequential analysis of multiple motion. In Pro-ceedings of IEEE Workshop On Visual Motion, pages 187-193, Princeton, New Jersey, Oct. 1991. B.P. Bogert, M.J .R. Healy, and J.W. Tukey. The quefrency analysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. Proceedings of Symposium on Time Series Analysis, 1962. P. Bolon and J. L. Lacoume. Speed measurement by cross correlation -theoretical aspects and applications in the paper industry. IEEE Transac-tions on Acoustics, Speech and Signal Processing, 31:1374-1378, 1983. E. Bandari and J. Little. Cepstral analysis of optical flow. Report T R 92-6, University of British Columbia, 1992. Technical E. Bandari and J. Little. Spatial-Quefrency approach to optical echo anal-ysis. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 850-852, Urbana-Champaign, Illinois, 1992. E. Bandari and J. Little. Visual echo analysis. Unpublished report, 1992. E. Bandari and J. Little. Cepstral methods in computational vision. In Nonlinear Image Processing IV; Proceedings SPIE, pages 266-277, San Jose, C A , February 1993. E. Bandari and J. Little. Visual echo analysis. In Proceedings of Fourth International Conference in Computer Vision, pages 220-225, Berlin, Ger-many, May 1993. E. Bandari and J.J. Little. Detection and estimation of multiple disparities by multi-evidential correlation. Technical Report T R 93-38, University of British Columbia, 1993. E. Bandari and J.J. Little. Multi-evidential correlation and visual echo analysis. Technical Report T R 93-1, University of British Columbia, 1993. Bibliography 202 [BL94] E. Bandari and J. Little. Cooperative analysis of multiple frames by visual echoes. In IEEE International Conference on Image Processing, pages 111:766-770, Austin, Texas, 1994. [B066] B.P. Bogert and J.F. Ossanna. The heuristics of cepstrum analysis of a stationary complex echoed gaussian signal in stationary gaussian noise. IEEE Transactions on Information Theory, IT-12(3):373-380, July 1966. [Bra74] O. Braddick. A short range process in apparent motion. Spatial Vision, 14:519-527, 1974. [Bro92] L. Brown. A survey of image registration techniques. ACM Computing Surveys, 24(4):326-376, December 1992. [BS72] D. Barnea and H. Silverman. A class of algorithms for fast digital regis-tration. IEEE Transactions on Computers, C(21):179-186, 1972. [BT80] S. Barnard and W. Thompson. Disparity analysis of images. IEEE Trans-actions in Pattern Analysis and Machine Intelligence, 2(4):333-340, July 1980. [BW85] J. Bednar and T. Watt. Calculating the complex cepstrum without phase unwrapping or integration. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(4): 1014-1017, August 1985. [BXL94] E. Bandari, Q. Xing, and J. Little. Visual echo registration of magnetic res-onance images. In AAAI Spring Symposium on Applications of Computer Vision in Medical Image Processing, pages 38-41, Stanford, California, 1994. [BYX82] P. Burt, C. Yen, and X . X u . Local correlation measures for motion analysis: a comparative study. In IEEE Proceedings of Pattern Recognition and Image Processing, pages 269-274, 1982. [BYX83] P. Burt, C. Yen, and X . X u . Multi-resolution flow through motion analysis. In IEEE Conference on Computer Vision and Pattern Recognition, pages 246-252, 1983. [CB92] D. Coombs and C. Brown. Real-time smooth pursuit tracking for a moving binocular robot. In Proceedings of Computer Vision and Pattern Recogni-tion, pages 23-28, 1992. [CCM84] B. Chanda, B. B. Chaudhuri, and D. Dutta Majumder. Application of least square estimation technique for image restoration using signal-noise Bibliography 203 correlation constraint. IEEE Transactions on Systems Man and Cybernet-ics, 14:515-519, 1984. [CK83] N . Cornelius and T. Kanade. Adapting optical flow to measure object mo-tion in reflectance and x-ray image sequences. In Proceedings of ACM Sig-graph/Sigart Interdisciplinary Workshop on Motion, pages 50-58, Toronto, Ontario, 1983. [Cla90] R.J . Clark. Basic principles of motion estimation and compensation. In IEE Colloquium on Applications of Motion Compensation, pages 1/1-1/7, October 1990. [CM87] E. De Castro and C. Morandi. Registration of translated and rotated im-ages using finite fourier transform. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9(5):700-703, Sept. 1987. [CM89] P. Cavanagh and G. Mather. Motion: the long and short of it. Spatial Vision, 4(2-3):103-129, 1989. [COP90] P. Concetti, G. Orlandi, and F Piazza. A n enhanced image registration algorithm by burg filter. In Proceedings of International Conference in Acoustics, Speech and Signal Porcessing (ICASSP), volume 4, pages 2053-6. IEEE, 1990. [Cro83] M . Crombie. Coordination of stereo image registration and pixel classifica-tion. Photogrammetric Engineering Remote Sensing, 49(4):529-532, Apri l 1983. [CSA93] H. J. Chen, Y . Shirai, and M . Asada. Obtaining optical flow with multi-orientation filters. In Proceedings of International Conference in Computer Vision and Pattern Recognition, pages 736-738, New York, New York, 1993. IEEE. [CSK77] D.G. Childers, D.P. Skinner, and R.C. Kemerait. The cepstrum: a guide to processing. Proceedings of IEEE, 65(10):1428-1443, October 1977. [CV90] M . Campani and A . Verri. Optical flow from an overconstrained system of linear algebraic equations. In IEEE International Conference on Computer Vision, pages 22-26, Osaka, Japan, December 1990. [CV92] M . Campani and A . Verri. Motion analysis from first order properties of optical flow. CVGIP. Image Understanding, 56(1):90-107, July 1992. Bibliography 204 [CWC91] N. Cui, J. Weng, and P. Cohen. Motion and structure from long stereo sequences. In Proceedings of IEEE Workshop on Visual Motion, pages 75-80, 1991. [DA89] U . R. Dhond and J. K . Aggarwal. Structure from stereo - a review. IEEE Transactions on Systems, Man and Cybernetics, 19(6):1489-1510, Dec 1989. [DA91] U.R. Dhond and J .K. Aggarwal. A cost-benefit analysis of a third cam-era for stereo correspondence. International Journal of Computer Vision., 6(l):39-59, Apri l 1991. [DE83] P. Diaconis and B. Efron. Computer intensive techniques in statistics. Scientific American, pages 116-127, 1983. [DKS91] G. Dall'Aglio, S. Kotz, and G. Salinetti. Advances in probability distribu-tions with given marginals. Kluwer Academic Publishers, 1991. [DP91] T. Darrell and A . Pentland. Robust estimation of a multi-layered motion representation. In Proceedings of the IEEE Workshop on Visual Motion, pages 173-8, Princeton, NJ , 1991. [Dud77] D.E. Dudgeon. The computation of two-dimensional cepstra. IEEE Trans-actions on Acoustics, Speech and Signal Processing, ASSP-25(6):476-484, December 1977. [ea93] S. Rao et al. A real-time P * 6 4 / M P E G video encoder chip. In Proceedings of International Solid-State Circuits Conference - ISSCC 93, pages 32-33, San Francisco, C A , February 1993. [Efr79] B . Efron. Bootstrap methods: another look at the jackknife. Annals of Statistics, 7:1-26, 1979. [Efr82] B. Efron. The jacknife, the bootstrap, and other resampling plans. In CBMS-NSF Regional Conference in Applied Mathematics, number S I A M -38, 1982. [EL93] J. Ens and Z. L i . Realtime motion stereo. In Proceedings of IEEE Con-ference on Computer Vision and Pattern Recognition, pages 130-135, New York, New York, 1993. [FC89] Chang-Wu Fu and Shyang Chang. A motion estimation algorithm un-der time-varying illumination. Pattern Recognition Letters, 10(3): 195-199, Sept 1989. Bibliography 205 [FJ90a] D. J. Fleet and A. D. Jepson. Computation of component image velocity from local phase information. International Journal of Computer Vision, 5(1):77-104, 1990. [FJ90b] D.J . Fleet and A . D . Jepson. Computation of component image velocity from local phase information. International Journal of Computer Vision, 5:77-104, 1990. [FL94] P. Fua and Y . Leclerc. Registration without matching. In IEEE Confer-ence on Computer Vision and Pattern Recognition, pages 121-128, Seattle, Wash, June 1994. [Fle92] D. Fleet. Measurement of image velocity. Kluwer, 1992. [Fra91] W. Franzen. Structure and motion from uniform 3d acceleration. In Pro-ceedings of IEEE Workshop on Visual Motion, pages 14-20, New York, New York, Oct. 1991. [Fua91] P. Fua. A parallel stereo algorithm that produces dense depth maps and preserves image features. Technical Report 1369, INRIA, January 1991. [FW79] C L . Fennema and W.B.Thomson. Velocity determination in scenes con-taining several moving objects. Computer Graphics and Image Processing, 9:301-315, 1979. [GCT94] A . Giachetti, M . Campani, and V . Torre. The use of optical flow for autonomous navigation. In European Conference in Computational Vision, pages 148-151, Stockholm, Sweden, Apri l 1994. [GGB84] A. Goshtasby, S. H. Gage, and J. F. Bartholic. A two-stage cross cor-relation approach to template matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:374-378, 1984. [Gir87] Brend Girod. The efficiency of motion-compensating prediction for hybrid coding of video sequences. IEEE Journal on Selected Areas in Communi-cations, SAC-5(7), August 1987. [GK89] B. Girod and D. Kuo. Direct estimation of displacement histogram. In Image Understanding and Machine Vision, pages 73-76, 1989. [Gla84] F. Glazer. Multi-level relaxation in low level computer vision. In E. Rosen-feld, editor, Multiresolution Image Processing and Analysis, pages 312-330. Springer-Verlag, 1984. Bibliography 206 [Gla87] F. Glazer. Hierarchical gradient based motion detection. In Image Under-stand Workshop, pages 733-748, Los Angeles, C A , Feb 1987. [GP87] E. Gamble and T. Poggio. Visual integration and detection of discontinu-ities: The key role of intensity edges. A.I. Memo No. 970, MIT-AI , October 1987. [GR80] I. Gradshteyn and I. Ryzhik. Table of integrals, series and products. Aca-demic Press, New York, N .Y. , 1980. [Gri80] W. E. Grimson. Computing shape using a theory of human stereo system. Technical report, MIT Department of Mathematics, June 1980. [Gri93] W. E. L . Grimson. Why stereo vision is not always about 3D reconstruc-tion. AI-Memo-1435, MIT-AI, July 1993. [GV93] M . Gothe and J. Vaisey. Improving motion compensation using multiple temporal frames. In Proceedings of IEEE Pacific Rim Conference on Com-munications, Computers, and Signal Processing, pages 157-160, New York, New York, May 1993. [GW93] R. Gonzalez and R. Woods. Digital Image Processing. Addison-Wesley Publishing Company, 1993. [Has74] J.C. Hassab. Time delay processing near the ocean surface. Journal of Sound and Vibration, 35(4):489-501, 1974. [Has75] J.C. Hassab. Analysis of signal extraction, echo and removal by complex cepstrum in the presence of distortion and noise. Journal of Sound and Vibration, 40(3):321-335, 1975. [HB76] J.C. Hassab and R. Boucher. A probabilistic analysis of time delay extrac-tion by the cepstrum in stationary gaussian noise. IEEE Transactions on Information Theory, 22(4):444-53, 1976. [HB78] J. Hassab and R. Boucher. Improved cepstrum performance through the windowing of log spectrum. Journal of Sound and Vibration, 58(4) :597-598, 1978. [HCL91] P. He, Y . Cai, and D. Liang. On cepstrum approach to sequential image matching. In IEEE China 1991 International Conference on Circuits and Systems, pages 797-800, Shenzhen, China, June 1991. [Hee87] D. Heeger. Optical flow from spatio-temporal filters. Proceedings of 1st International Conference in Computer Vision, pages 181-190, 1987. Bibliography 207 [Hee91] Joachim Heel. Temporal surface reconstruction. Ph.D. Dissertation, MIT Artificial Intelligence Laboratory, 1991. [Hil83] E. C. Hildreth. Measurement of visual motion. MIT Press, Cambridge, Mass, 1983. [HLHM92] J. Huang, S. Liu, M . H. Hayes, and R. M . Mersereau. Multi-frame pel-recursive algorithm for varying frame-to-frame displacement estimation. In Proceedings of IEEE International Conference Acoustics Speech and Signal Processing (ICASSP), volume 3, pages III-241, 1992. [Hor83] B . K . P. Horn. Non-correlation methods for stereo matching. Photogram-metric Engineering Remote Sensing, 49(4):535-536, Apri l 1983. [Hor86] B . K . P. Horn. Robot Vision. MIT Press, Cambridge, Mass, 1986. [HS81] B .K.P . Horn and B . G . Shunck. Determining optical flow. Artificial Intel-ligence, 17:185-204, 1981. [Hub81] P. Huber. Robust Statistics. Wiley, New York, New York, 1981. [HZ83] R. A . Hummel and S. W. Zucker. On the foundations of relaxation la-beling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5:267-287, 1983. [IP92] M . Irani and S. Peleg. Image sequence enhancement using multiple mo-tions analysis. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 216-221, Champaign, Illinois, June 1992. [J91] B . Jahne. Digital Image Processing. Springer-Verlag, 1991. [JB93a] Allan Jepson and Michael Black. Mixture models for optical flow computa-tion. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 760-761, New York, New York, June 1993. [JB93b] Allen Jepson and Michael Black. Mixture models for optical flow compu-tation. Technical Report 93-44, R B C V , November 1993. [JBC87] A. Jain, R. Bubes, and C. Chen. Bootstrap technique for error estimation. IEEE Transaction of Pattern Analysis and Machine Intelligence, PAMI-9(5):628-633, Sept 1987. [JJ94] M . R. Jenkin and.A. D.. Jepson. Recovering local surface structure through local phase difference measurements. CVGIP: Image Understand-ing, 59(l):72-93, January 1994. Bibliography 208 [JL93] D. Jones and D. Lamb. Analyzing the visual echo: passive 3-d imaging with a multiple aperture camera. Technical Report CIM-93-3, McGi l l Uni-versity, Feb 1993. [JM92] David Jones and Jitendra Malik. A computational framework for determin-ing stereo correspondence from a set of linear spatial niters. In Proceedings of Second European Conference on Computer Vision, 1992. [Kam93] J. Kam. A Real-Time 3D Motion Tracking System. University of British Columbia. Dept. of Comp. Sci. M.Sc. Thesis, 1993. [KD92] J. Konrad and E. Dubois. Bayesian estimation of motion vector fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(9):910-927, Sept 1992. [KH75] C D . Kuglin and D . C Hines. The phase correlation image alignment method. Proceedings of IEEE International Conference on Cybernetics and Society, pages 163-165, 1975. [KM92] A . Kher and S. Mitra. Registration of noisy SAR imagery using morpholog-ical feature extractor and 2d epstrum. In SPIE Conference on Applications of Digital Image Processing XV, volume 1771, pages 281-291, 1992. [KON92] T. Kanade, M . Okutomi, and T. Nakahara. A multi-baseline stereo method. In DARPA Image Understanding Workshop, pages 409-426, 1992. [KP93] Y . K i m and K . Price. An integrated motion analysis system guided by feedback information. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 130-135, New York, New York, 1993. [KY90] Y . Kitamura and M . Yachida. Three-dimensional data acquisition by trinocular stereo. Advanced Robotics, 4(l):29-42, 1990. [LBP88] J.J. Little, H.H. Bulthoff, and T. Poggio. Parallel optical flow using lo-cal voting. Proceedings of International Conference on Computer Vision, pages 454-459, December 1988. [LC88] S.H. Lai and S. Chang. Estimation of 2-d translation via Hadamard transform. In J.L. Lacoume, A . Chehikian, N . Martin, and J. Malbos, editors, Signal Processing IV: Theories and Applications. Proceedings of EUSIPCO-88. Fourth European Signal Processing Conference, volume 1, pages 35-37. North-Holland, Sept. 1988. Bibliography 209 [LC94] J . Lawn and R. Cipolla. Robust egomotion estimation from affine motion parallax. In European Conference in Computational Vision, pages 205-210, Stockholm, Sweden, Apri l 1994. [LE84] J. Lin and E. Eisner. A review of homomorphic deconvolution. Reviews of Geophysics and Space Physics, 22(3):255-263, August 1984. [LG90] J. J. Little and W. E. Gillett. Direct evidence of occlusion in stereo and motion. In European Conference in Computational Vision, pages 336-340, Antibes, France, Apri l 1990. [LK81] B. Lucas and T. Kanade. A n iterative image registration technique with an application to stereo vision. In Proceedings of Eight International Joint Conference on Artificial Intelligence, pages 674-679, 1981. [LKM87] D.J . Lee, T .F . Krile, and S. Mitra. Digital registration technique for se-quential fundus images. Proceedings of SPIE on Applications of Digital Image Processing X, SPIE-829:293-300, 1987. [LKM88] D. Lee, T. F. Krile, and S. Mitra. Power cepstrum and spectrum techniques applied to image registration. Applied Optics, 27(6): 1099-1106, March 1988. : . . . . . . . . . [LL88] M . Lamnabhi and J. J. Lhuillier. Motion compensated time rate conversion of video sequences. In Proceedings of 2nd International Workshop on Signal Processing of HDTV, 1988. [LM75] J. Limb and J. Murphy. Estimating the velocity of moving objects from television signals. Proceedings of Computer Graphics and Image Process-ing, 4:311-327, 1975. [LMK88] D.J . Lee, S. Mitra, and T.F. Krile. Power cepstrum in registration of noisy images. Electronic Imaging, 2:953-956, 1988. [LMK89] D.J . Lee, S. Mitra, and T.F. Krile. Analysis of sequential complex im-ages, using feature extraction and two-dimensional cepstrum techniques. Journal of Optical Society of America, 6(6):863-870, 1989. [LNN92] K . O . Ludwig, H . Neumann, and B. Neumann. Local stereoscopic depth estimation using ocular stripe maps. In Proceedings of Second European Conference on Computer Vision, pages 373-377, 1992. [LNN94] K . O . Ludwig, H. Neumann, and B. Neumann. Local stereoscopic depth estimation. Image and Vision Computing, 12(1):16—35, Jan 1994. Bibliography 210 [LRM91] D.J . Lee, M . Romirez, and S. Mitrra. Fast 2-d hartley transform in 3-d object representation and recognition. In Intelligent Robots and Computer Vision; Proceedings SPIE, volume 1608, pages 302-312, 1991. [Man94] R. Manmatha. Framework for recovering affine transformation using points, lines or image brightness. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 141-146, 1994. [Mat92a] L . Matthies. Passive stereo range imaging for semi-autonomous land nav-igation. Journal of Robotic Systems, 9(6):787-816, Sept. 1992. [Mat92b] L. Matthies. Stereo vision for planetary rovers: stochastic modeling to near real-time implementation. International Journal of Computer Vision, 8(1):71-91, July 1992. [MKS89] L. Matthies, T. Kanade, and R. Szeliski. Kalman filter-based algorithms for estimating depth from image sequences. International Journal of Com-puter Vision, 3:209-236, 1989. [MM91] M . Mahowald and C. Mead. The silicon retina. Scientific American (In-ternational Edition), 264(5):76-82, May 1991. [M092] R. Manmatha and J. Oliensis. Measuring the affine transform -1 : recover-ing scale and rotation. Comp Sci T R 92-74, University of Massachusetts at Amherst, Amherst, M A , 1992. [Moh81] N . C. Mohanty. Computer tracking of moving point targets in space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3(11):606—611, Sept 1981. [Mor80] H . Moravec. Obstacle avoidance and navigation in the real world by a seeing robot rover. Technical Report 340, Stanford AI-Memo, 1980. [MP77] D. Marr and T. Poggio. A theory of human stereo vision. AI-Memo-451, MIT-AI , Cambridge, M A , 1977. [MS86] L. Matthies and S. A . Shafer. Error modelling in stereo navigation, 1986. [NA94] S. Niyogi and E. Adelson. Analyzing and recognizing walking figures. In Proceedings of IEEE Conference Computer Vision and Pattern Recogni-tion, pages 469-474, 1994. [Nag83a] H. Nagel. Constraints on the estimation of displacement fields from im-age sequences. In Proceedings of Eight International Joint Conference on Artificial Intelligence, pages 945-951, August 1983. Bibliography 211 [Nag83b] H . Nagel. Displacement vectors derived from second order intensity varia-tions in image sequences. Computer Vision, Graphics, and Image Process-ing, 21:85-117, 1983. [Nag86] H . Nagel. On the estimation of dense displacement maps from image se-quences. In Proceedings of ACM Motion Workshop, pages 59-65, Toronto, Ontario, 1986. [Nag94] H. H. Nagel. Optical flow estimation: advances and comparisons. In European Conference in Computational Vision, pages 51-60, Stockholm, Sweden, Apri l 1994. [Nak85] K . Nakayama. Biological motion processing: a review. Vision Research, 25:625-660, 1985. [Nas94] P. Nasiopoulos. Error Resistant Schemes for All-Digital High Definition Television. 1994. [Nis84] H. Nishihara. Prism: a practical real-time imaging stereo matcher. AI-Memo-780, Cambridge, M A , 1984. [NK92] T. Nakahara and T. Kanade. Experiments in multiple-baseline stereo. Technical Report CMU-CS-93-102, Carnegie-Mellon University, August 1992. [NM93] Chrysostomos L. Nikias and Jerry M . Mendel. Signal processing with higher order spectra. IEEE Signal Processing Magazine, pages 10-37, July 1993. [NP88] C L . Nikias and R. Pan. Time delay estimation in unknown gaussian spa-tially correlated noise. IEEE Transactions on Acoustics, Speech, and Signal Processing, 36(11), November 1988. [NR79] A . Netravali and J. Robins. Motion compensated television coding: part I. Bell Systems Technical Journal, 58:631-670, 1979. [NR87] L . C Nikias and M.R. Raghuveer. Bispectrum estimation: a digital signal processing framework. Proceedings of IEEE, 75(7):869-92, July 1987. [NY93] S. Negahdaripour and C. Yo. A generalized brightness model for comput-ing optical flow. In Proceedings of International Conference in Computer Vision, pages 2-11, Berlin, Germany, May 1993. [OC90] T .J . Olson and D.J . Coombs. Real-time vergence control of binocular robots. Technical Report T R 348, University of Rochester, June 1990. Bibliography 212 [0C91] T. Olson and D. Coombs. Real-time vergence control for binocular robots. Internaltional Journal of Computer Vision, 7(1):67-81, 1991. [OK92] M . Okutomi and T. Kanade. A locally adaptive window for signal match-ing. International Journal of Computer Vision, 7(2): 143-162, January 1992. [OL81] A . V . Oppenheim and J.S. Lim. The importance of phase in signals. Pro-ceedings of IEEE, 69(5):529-541, May 1981. [OL92] T. Olson and R. Lockwood. Fixation based filtering. In Proceedings of SPIE Intelligent Robots and Computer Vision IX, pages 685-696, 1992. [01s93] T. Olson. Stereopsis and verging systems. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 55-60, New York, New York, 1993. [OSS68] A . V . Oppenheim, R.W. Schafer, and T .G . Stockham. Non-linear filtering of multiplied and convolved signals. Proceedings of IEEE, 56:21364-21291, August 1968. [PAF79] A . Polydoros, W. Au , and A. Fam. Shift invariant homomorphic filtering. In Proceedings of Twenty Second Mid-west Symposium on Circuits and Systems, 1979. [PE] K . Pahlevan and J. Eklundh. A head-eye system for active purposive com-puter vision. Technical Report CVAP-80, Royal Institute of Technology, Sweden. [PE93] K . Pahlavan and J. Eklundh. Heads, eyes and head-eye systems. Interna-tional Journal of Pattern Recognition and Artificial Intelligence, 7(1):33-49, February 1993. [Pea79] L. G. Peardon. Aspects of cepstral analysis. Ph.D. Thesis, Dept. of Math-ematics, Portsmouth Polytechnic, 1979. [PF90] B. Porat and B. Friedlander. A frequency domain algorithm for multiframe detection and estimation of dim targets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(4):398-401, Apri l 1990. [PFTV88] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numeriacl Re-cepies in C. Cambridge University Press, Cambridge, England, 1988. Bibliography 213 [PJGK77] J. J. Pearson, D.C. Hines Jr., S. Golosman, and C D . Kuglin. Video-rate image correlation processor. In Proceedings of SPIE on Applications of Digital Image Processing, pages 197-205, San Jose, C A , August 1977. [Pol87] G. Polya. How to solve it; a new aspect of mathematical method. Princeton University Press, Princeton, N.J . , 1987. [PR73] T. Poggio and W. Reichardt. Considerations on the models of movement detection. Kybernetics, 13:223-227, 1973. [PR90] S. Peleg and H. Rom. Motion based segmentation. In Proceedings of IEEE Tenth International Conference on Pattern Recognition, pages 109-113, Osaka, Japan, June 1990. [Pra78] W. Pratt. Digital Image Processing. Wiley-Interscience Publication, New York, New York, 1978. [Pri86] Keith Price. Anything you can do I can do better (not you can't). Com-puter Vision, Graphics, and Image Processing, 36:387-391, 1986. [Pri90] Keith Price. Multi-frame feature-based motion analysis. In Proceedings of IEEE International Conference in Pattern Recognition, pages 114-117, Atlantic City, N.J . , 1990. [PSW93] M . Platzner, C. Steger, and R. Weiss. Performance measurements on a multi-DSP architecture with tms320c40. In Proceedings of the Fourth In-ternational Conference on Signal Processing Applications and Technology, pages 1144-1152, Santa Clara, C A , October 1993. [PUE92] K . Pahlavan, T. Uhlin, and J. Eklundh. Fixation by active accommodation. In Intelligent Robots and Computer Vision XI: Algorithms, Techniques and Active Vision, volume SPIE-1825, pages 670-684, November 1992. [Rei57] W. Reichardt. Autokorrelations-auswertung als functionsprinzip des zen-tralnervensystems. Naturaforsch., 12b:448-457, 1957. [Rei61] W. Reichardt. Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In W. Rosenblith, editor, Sen-sory Communication, pages 303-314. MIT Press, 1961. [RHKvdH92] L. Rosenthler, F. Heitger, O. Kiibler, and R. von der Heydt. Detection of general edges and key points. In Proceedings of Second European Confer-ence on Computer Vision. Springer-Verlag, May 1992. Bibliography 214 [Ric88] Whitman Richards. Natural Computation. MIT Press, Cambridge, Mass, 1988. [RL83] P. Rousseeuw and A . Leroy. Robust regression and outlier detection. Wiley, New York, New York, 1983. [Rom75] R. Rom. On the cepstrum of two-dimensional functions. IEEE Transac-tions on Information Theory, pages 214-217, March 1975. [RR87a] G. Reddy and V Rao. On the computation of complex cepstrum through differential cepstrum. Signal Processing, 13(l):79-83, July 1987. [RR87b] G.R. Reddy and V . V . Rao. Signal delay and waveform estimation through differential cepstrum averaging. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(10):1487-89, October 1987. [RU84] D. R. Reddy and R. Unbehauen. The n-dimensional differential cepstrum. In V . Cappellini and A . G. Constantinides, editors, Digital Signal Process-ing - 84, pages 145-148. Elsevier Science Publishers, 1984. [RU85] D. Raghuramireddy and R. Unbehauen. The two-dimensional differential cepstrum. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-33(4):1335-1337, October 1985. [RY90] K . R. Rao and P. Yip . Discrete cosine transform: algorithms, advantages, and applications. Academic Press, Boston, Mass., 1990. [SA90] M . Spetsakis and J. Aloimomonos. A multi-frame approach to visual mo-tion perception. International Journal of Computer Vision, 6(3):245-255, 1990. [SC94] R. Szeliski and J. Coughlan. Hierarchical spline-based image registration. In IEEE Conference on Computer Vision and Pattern Recognition, pages 194-201, Seattle, Wash, June 1994. [Sch88] B. Schunk. Image flow: fundamentals and algorithms. In J. Martin and W. Aggarwal, editors, Motion Understanding: Robot and Human Vision, pages 23-68. Kluwer Academic Publishers, 1988. [Sco86] G. Scott. Four line method for locally estimating optical flow. Vision and Image Computing, 5(2):67-72, May 1986. [Sco88] G. Scott. Local and global interpretation of moving images. Morgan Kauf-mann Publishers, London, England, 1988. Bibliography 215 [SD89] M . Steckner and D. Drost. Fast cepstrum analysis using hartley trans-form. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(8):1300-1302, 1989. [Sin91] A . Singh. Optical flow computation, a unified approach. IEEE Computer Society Press, Los Alamitos, C A , 1991. [Ski74] D. P. Skinner. Real-time composite signal decomposition. Ph.D. Thesis, Dept. of Electrical Eng., University of Florida, 1974. [SM90] M . Shizawa and K . Mase. Simultaneous multiple optical flow estimation. In Proceedings of IEEE 10th International Conference on Pattern Recog-nition, pages 274-278, Atlantic City, New Jersey, June 1990. [SMA76] M . Svedlow, C. McGillem, and P. Anuta. Experimental examination of similarity measures and preprocessing methods used for image registration. In The Symposium on Machine Processing of Remotely Sensed Data, pages 4A-9, Westville, Indiana, June 1976. [Spi74] M . Spiegel. Fourier analysis, with applications to boundary value problems. McGraw-Hill, New York, N . Y . , 1974. [SR78] M . Silvia and E. Robinson. Use of kepstrum in signal analysis. Geoexplo-ration, 16:55-73, 1978. [SS93] M . Swain and M . Strieker. Promising directions in active vision. Interna-tional Journal of Computer Vision., 11(2):109-126, 1993. [SW90] L. Spillmann and J. Werner. Visual perception, the neurophysiological foundations. Academic Press, 1990. [Tav86] S. Tavakkoli. Application of cepstrum techniques to location of acoustic sources in the presence of reflective surfaces. M.Sc. Thesis, Dept. of Me-chanical Engineering, Virginia Polytechnique Institute and State Univer-sity, 1986. [TB87] W. Thompson and S. Barnard. Lower level estimation and interpretation of visual motion. Computer, 20:20-28, 1987. [TG92] M . K . Tsatsanis and B . G . Giannakis. Object and texture classification using higher order statistics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(7):733-50, July 1992. [TH86] Q. Tian and M . N . Huhns. Algorithms for subpixel registration. Computer Vision, Graphics and Image Processing, 35(35):220-233, August 1986. Bibliography 216 [The91] P. Thevanaz. Motion analysis. In Proceedings of the 37th Scottish Summer School in Physics, pages 129-166, 1991. [Tho87] G.A. Thomas. Television motion measurement for D A T V and other appli-cation. Technical Report 1987-11, B B C Research Department, 1987. [Tis94] M . Tistarelli. Multiple constraints for optical flow. In European Conference in Computational Vision, pages 61-70, Stockholm, Sweden, Apri l 1994. [TM94] W. Theimer and H . Mallot. Phase-based binocular vergence control and depth reconstruction using active vision. Computer Vision, Graphics and Image Processing IU, 60(3):343-358, 1994. [TOM94] J. Taylor, T. Olson, and W. Martin. Accurate vergence control in com-plex scene. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 540-545, Seattle, Washington, 1994. [Tri77] J . M . Tribolet. A new phase unwrapping algorithm. IEEE Transactions onAcoustics, Speech, and Signal Processing, ASSP-25(1):170-179, Dec. 1977. [U1183] S. Ullman. Visual routines. MIT AI-Memo-723, MIT-AI , Cambridge, M A , 1983. [WA83] A . Watson and A. Ahumada. A look at motion in the frequency domain. Technical Report N A S A Technical Memorandum 84352, N A S A Ames Re-search Center, 1983. [WA85] A . B. Watson and A . J. Ahumada. Model of human visual-motion sensing. Journal of Optical Society of America, 2(2):322-342, 1985. [WC90] G. A . W. West and T. A . Clark. A survey and examination of subpixel measurement techniques. In Close Range Photogrammetry Meets Machine Vision; Proceedings SPIE, volume 1395, pages 456-463, 1990. [WD86] A . Waxman and J. Duncan. Binocular image flow: steps toward stereo-motion fusion. IEEE Transactions on Pattern Analysis and Machine In-telligence, 8(6):715-729, November 1986. [WF88] S. F. Wu and G. M . X . Fernando. Comparative study of two motion es-timation. In Proceedings of IEE Third International Conference on Image Processing and Its Applications, volume 307, pages 305-9. IEE, 1988. [WH84] R. Wong and E. Hall. Sequential hierarchical scene matching. IEEE Trans-actions on Computers, 27(4):359-366, 1984. Bibliography 217 [WM93] J. Weber and J. Malik. Robust computation of optical flow in a multi-scale differential framework. In Proceedings of International Conference in Computer Vision, pages 12-20, Berlin, Germany, May 1993. [WS86] A . Waxman and S. Sinha. Dynamic stereo: passive ranging to moving objects from relative image flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(4):406-412, July 1986. [WY91] F. Wang and P. Yip . Cepstrum analysis using discrete trigonometric trans-forms. IEEE Transactions on Acoustics, Speech, and Signal Processing, 39(2):538-541, February 1991. [YCC92] J.C. Yen, F . J . Chang, and S. Chang. A new architecture for motion-compensated image coding. Pattern Recognition, 25(4):357-66, Apri l 1992. [YS89] Y . Yeshurun and E.L . Schwartz. Cepstral filtering on a columnar image architecture: a fast algorithm for binocular stereo segmentation. IEEE PAMI, 11(7), July 1989.