Multi-Image Matching using Invariant Features by Matthew Alun Brown B.A., Cambridge University, 2000 M.Eng., Cambridge University, 2000 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L M E N T O F T H E REQUIREMENTS FOR T H E D E G R E E OF Doctor of Philosophy in T H E F A C U L T Y O F G R A D U A T E STUDIES (Computer Science) The University of British Columbia July 26, 2005 © Matthew Alun Brown 2005 Abstract This thesis concerns the problems of automatic image stitching and 3D modelling from multiple views. These are basic problems of computer vision, with applications in robotics, architecture, industrial inspection, surveillance, computer graphics and film. Recent work has brought increasing automation to these tasks, but despite a large amount of progress, state-of-the-art algorithms still require some form of user input or assumptions about the image sequence. For example, the best image stitchers currently require an ordered set of input images, or user input to identify the matching images, before automatic registration can proceed. In this work we show how such tasks can be performed automatically and without any user input at all. We formulate the multi-image matching problem as one of finding all matching images, subject to the constraint that they are consistent views from a perspective camera. We use invariant features as a mechanism for finding correspondences, and indexing techniques to efficiently find matches between multiple views. We then find all sets of geometrically consistent feature matches, using a probabilistic model for verification. This allows us to identify each object or scene in the dataset using only the structure already present in the data. The major contributions of this thesis are the development of a system that can automatically recognise and stitch 2D panoramas in unordered image datasets, and a new class of invariant features for this purpose. iii Contents Abstract ii Contents iii L i s t of Tables vii L i s t of F i g u r e s viii Acknowledgements 1 2 xi Introduction 1 1.1 Motivation 1 1.2 Approach 3 1.3 Contributions 4 1.4 Outline of Thesis 9 M u l t i - S c a l e O r i e n t e d Patches 10 2.1 Introduction 10 2.2 Multi-Scale Oriented Patches 11 2.3 Interest Points 13 2.3.1 Adaptive Non-Maximal Suppression 15 2.3.2 Sub-Pixel Accuracy 16 2.3.3 Repeatability 18 2.4 Orientation 20 2.5 Analysis of Interest Point Extraction 20 2.6 Feature Descriptor 22 2.6.1 Illumination Invariance 25 2.6.2 Haar Wavelet Transform 26 2.7 Feature Matching 29 Contents 3 2.7.1 Feature-Space Outlier Rejection 29 2.7.2 Spatial Variation of Errors 34 2.7.3 Position and Orientation Errors for Correct Matches 35 2.8 Feature Indexing 37 2.9 Analysis of Descriptor Sampling 41 2.9.1 Patch Refinement 41 2.9.2 Image Downsampling and Bilinear Interpolation 42 2.9.3 Comparison to SIFT features 43 2.10 Summary 48 A u t o m a t i c P a n o r a m i c Image S t i t c h i n g 50 3.1 Introduction 50 3.2 Feature Matching 51 3.3 Image Matching 53 3.3.1 Robust Homography Estimation using R A N S A C 53 3.3.2 Probabilistic Model for Image Match Verification 54 3.4 4 iv Bundle Adjustment 55 3.4.1 60 Fast Solution by Direct Computation of the Linear System . . . 3.5 Automatic Panorama Straightening 60 3.6 Gain Compensation 61 3.7 Multi-Band Blending 64 3.8 Results 69 3.9 Summary 70 E v a l u a t i o n of Image S t i t c h i n g A l g o r i t h m s 74 4.1 Introduction 74 4.1.1 Image Stitching Algorithms 75 Evaluation of Image Stitching Algorithms 75 4.2.1 Dealing with Failed Matches 77 4.3 Creating Gold Standard Stitching Results 80 4.4 Experimental Setup 80 4.4.1 Comparison of Automatic Stitching Algorithms 80 4.4.2 Parameter Tuning for Automatic Stitchers 81 4.2 4.5 Results 81 Contents 4.6 5 6 Summary v 86 3 D Object R e c o g n i t i o n a n d R e c o n s t r u c t i o n 88 5.1 Introduction 88 5.2 Feature Matching 89 5.3 Image Matching 89 5.4 Bundle Adjustment 91 5.4.1 Sparse Bundle Adjustment 92 5.4.2 Graphical Model Interpretation 97 5.4.3 Sparsity in the Graphical Model 98 5.5 Results 99 5.6 Summary Conclusions 103 106 Bibliography 108 A Multiple View Geometry 120 A.l Camera Models 120 A. 1.1 Pinhole Camera 120 A. 1.2 Projective Camera 121 A.1.3 Orthographic Camera 122 A . 1.4 Affine Camera 123 A.2 Viewing a Plane 123 A. 2.1 Homography 123 A.2.2 Affine Transformation 124 A.2.3 Similarity Transformation 125 A.3 Hierarchy of 2D Transformations 125 A.4 Canonical Frames 126 A.5 Multi-View Constraints 127 A.5.1 Panoramic Geometry 128 A.5.2 Epipolar geometry 129 A.6 Plane (at Infinity) Plus Parallax 131 A.7 Axis-Angle Rotations 132 Contents B Image Datasets vi 136 vii List of Tables 2.1 Indexing on wavelet coefficients vs pixel values 41 2.2 Effect of pyramid downsampling on feature matching 43 2.3 Comparison of M O P S and SIFT feature matching 44 viii List of Figures 1.1 Multi-image matching 2 1.2 Panorama recognition 5 1.3 Fully automatic 2D image stitching 7 2.1 Multi-Scale Oriented Patches (MOPS) extracted at five pyramid levels 12 2.2 Isocontours of popular interest point detection functions 15 2.3 Adaptive non-maximal suppression 17 2.4 Computation of derivatives for sub-pixel accuracy 18 2.5 Repeatability of interest points with and without sub-pixel correction . 19 2.6 Repeatability vs accuracy for M O P S 21 2.7 Repeatability of M O P S at 5 pyramid levels (Matier dataset) 23 2.8 Repeatability of M O P S at 5 pyramid levels (Van Gogh dataset) 2.9 M O P S descriptor sampling . . . . 24 25 2.10 M O P S sample spacing R O C 26 2.11 Corresponding M O P S 1 27 2.12 Corresponding M O P S 2 28 2.13 Feature space outlier rejection 32 2.14 Distributions of matching error (Abbey dataset) 33 2.15 Thresholding based on outlier distance 34 2.16 Spatial error variation across the patch 35 2.17 M O P S location/orientation error distribution 36 2.18 Distributions of matching error (Matier, Van Gogh datasets) 37 2.19 High and low contrast features 1 38 2.20 High and low contrast features 2 39 2.21 Haar wavelet coefficients 40 2.22 Patch refinement with different alignment models 42 2.23 Comparison of SIFT and M O P S features 1 45 List of Figures ix 2.24 Comparison of SIFT and M O P S features 2 46 2.25 Matching mistakes 47 3.1 Image alignment according to a homography 56 3.2 Recognising panoramas 57 3.3 Finding the up-vector 61 3.4 Automatic panorama straightening 62 3.5 Gain compensation 63 3.6 Images and blend weights in spherical coordinates 66 3.7 Multi-band blending 67 3.8 Multi-band blending V S no blending 68 3.9 Multi-band blending V S linear blending 68 3.10 Panorama recognition 2 71 3.11 Fully automatic 2D image stitching 2 72 4.1 Evaluation function 76 4.2 Generating synthetic ground truth 78 4.3 Panorama stitching with and without 360° wrap-around 78 4.4 Solution of focal length for panoramas of 360°, 315° and 180° 79 4.5 Elfin sequence from the synthetic dataset 82 4.6 Comparison of AutoStitch and MSRStitch for the Alaska sequence . . . 82 4.7 Comparison of AutoStitch and MSRStitch using the synthetic dataset . 84 4.8 Comparison of AutoStitch and MSRStitch using the real dataset . . . . 84 4.9 Comparison of AutoStitch and MSRStitch with variable overlap and scale 85 4.10 Tuning of a and a parameters 85 4.11 Tuning of number of features extracted per image 86 5.1 Finding sets of consistent matches using SIFT and R A N S A C 90 5.2 Graphical model for bundle adjustment 97 5.3 Sparse and fully connected bundle adjustment problems 100 5.4 Fully automatic object recognition and 3D reconstruction 101 5.5 Fully automatic structure and motion (SAM) estimation 102 A.l The pinhole camera 121 A.2 The orthographic camera 122 A.3 Canonical reference frame 127 List of Figures x A.4 Panoramic geometry 128 A.5 Epipolar geometry 129 A.6 Plane (at infinity) plus parallax 133 A . 7 Axis-angle rotations 134 B. l Synthetic dataset , . . 136 B.2 Real dataset 137 B.3 Van Gogh dataset (pure rotation) 138 B.4 Matier dataset 138 B.5 Abbey dataset 139 B.6 Dash point dataset 140 B.7 Times Square dataset 141 xi Acknowledgements Firstly, I would like to thank David Lowe for sharing his considerable wisdom in this field, and providing invaluable guidance in the supervision of this thesis. I would also like to thank the other members of my supervisory committee: Nando de Freitas, Jim Little and Wolfgang Heidrich for their insights and inspiration. I am most grateful to my colleagues and friends at U B C for many interesting discussions: Bob Woodham, Dinesh Pai, Jesse Hoey, Don Murray, Adrian Secord, Arthur Louie, Alex Stevenson, Chris Majewski, Mark McCann, Mike Cline, Doug James, Paul Kry, Dave Burke, Ben Forsyth, Peter Carbonetto, Jacek Kisynski, Hendrik Kueck and Matthew Trentacoste. I also, had the opportunity to work with a great many imaginative and inspirational people during internships at Microsoft Research. It was a pleasure to work with Andrew Blake, Phil Torr and Patrick Perez at Cambridge, and with Richard Szeliski, Matt Uyttendaele and Simon Winder in Redmond. For extra-curricular distractions, I am deeply indebted to Steph Durocher and Reid Holmes, with whom I have spent the best of times. I would also like to thank Zosia Bornik for her patient encouragement, support and many Friday night dinners. Finally, thanks to M u m and Dad, for encouraging me in the pursuit of my dreams, and providing a loving family in which to do so. 1 Chapter 1 Introduction 1.1 Motivation The ability to deduce information about the world from multiple views is a fundamental capability of both human and machine vision systems. Such systems typically employ sensors (eyes or cameras) that lose information in the projection from 3D to 2D, and rely on a processing unit (the brain or computer) that combines information from multiple views to create the impression of a visual world. A n example of this in humans is foveal vision [Ray98]. Most of us are not immediately aware that the high-resolution portion of the retina (the fovea) has an angular range of only about 2 degrees - approximately the size of a thumb held at arms length. Yet, we perceive a much larger field of view of up to 135 x 200 degrees. This is enabled by rapid eye movements called saccades, during which the eyes move with angular velocities up to 500 degrees per second. In between the saccades are fixations, where we focus our foveal vision on a point. Though a typical task such as reading would involve fixations of 200-300 ms between saccades of 30 ms or so, the brain is able to assimilate all this information and create the illusion of a single, immersive, high-resolution scene. Another basic capability of human vision is the ability to perceive depth. Since the eyes are separated in space, each receives a slightly different image, and the difference in position (disparity) of corresponding points in these images can be used to judge depth . 1 Both of these processes have been mimicked in machine vision systems where they are known as panoramic [Mil75, BK01] and stereo [MP79, SS02] vision. The basic ideas can also be extended to multiple views and were first implemented in computers by the photogrammetry community for the purposes of aerial cartography [Bro58, Sla80]. A distinction is made between problems where the imagery is essentially 2-dimensional, and may be combined into a larger composite image (a process known as image align1 T h i s was first recognised by Euclid in 280 A . D . Chapter 1. Introduction Figure 1.1: The multi-image matching framework. The objective is to operate on an unordered database of images, and find all the matching images. Subsets of matching images can be combined into 3D models or panoramic views as appropriate. ment or stitching [Sze04]), and cases when the imagery is truly 3-dimensional. The latter case is distinguished by the fact that the images exhibit parallax (depth dependent motion), which can be used to deduce the camera motion and structure of the scene (known as structure and motion estimation [HZ04]). Both problems also have many compelling applications. Image stitching can be used to create beautiful panoramic mosaics, which are often viewed interactively for applications such as virtual tourism, or to provide backdrops in films and video games. Camera tracking and 3D structure estimation are used extensively in the visual effects industry, for video shot stabilisation, and for modelling and visualisation in areas as diverse as archaeological digs and crime scenes. Computer vision has brought increasing automation to such problems, with several commercial offerings [Che95, R E A , 2D3] in addition to an extensive research literature [Sze04, HZ04, BTZ96, PolOO] devoted to automatic image stitching and 3D modelling from multiple views. Despite much recent progress, state-the-art systems for image registration [REA] and camera tracking [2D3] still require assumptions about the image sequence or user input to define matching images. The main theme of thesis is that such models can be discovered automatically in a database of images using the structure of the data only. No prior information is required about the camera parameters or image sequence other than mild assumptions about the nature of projection. Chapter 1. 1.2 Introduction 3 Approach A basic task in vision is the correspondence problem. This is the task of finding points in different images that are corresponding in the sense that they are projections of the same point in the world. While humans perform this activity without conscious thought, it has proven to be a difficult problem in computer vision. Several approaches to image correspondence have been proposed. Some are iterative, for example tracking of object contours [BI98], and some are non-iterative, such as indexing using invariants [MZ92]. Another distinction is between techniques that are direct [IA99] (using all image information) and feature-based [TZ99] (using a sparse representation of the image). In this work we adopt a non-iterative, feature based approach to image matching. It is well known that the first stages in the visual cortex involve the detection of lowlevel features such as edges [HW62]. However, more recently Tanaka [Tan97] and others have shown that certain cells in the inferotemporal cortex respond only to more complex (and irreducible) features. Feature detection in computer vision has followed a parallel path, with early work in edge [Can86] and corner [Har92] detection being augmented with more complex and distinctive image features [Low99, M C U P 0 2 , Lin98, KZB04]. Our choice of image features follows this trend, and will be discussed in more detail in chapter 2. Our approach to image matching also differs from more traditional methods in that we see matching as an operation on multiple images and not just an image pair. The problem can be stated as follows: Given an unordered set of images and a geometric matching constraint, find all subsets of matching images We will mainly be interested in the cases of a) stationary but rotating cameras, and b) and moving cameras viewing rigid scenes, for which it is possible to a) stitch panoramas and b) generate 3D models (see figure 1.1). There are several advantages of multi-image matching over the traditional pairwise approach: Complexity. B y representing a collection of images as a database of features, we can pose the image matching problem as an all nearest neighbours problem [GMOO], which can be solved in 0(nlogn) time (cf. 0(n ) for naive pairwise matching). 2 Chapter 1. Introduction 4 Geometry. Multi-view constraints are stronger than pairwise constraints. This allows for more accurate solution of the image geometry, and more incorrect matches to be rejected. Probability. Using a large database of images provides a background distribution of incorrect feature matches, which can be used to help verify correct matches. Furthermore, any feature matching errors that do occur are distributed across all of the images, and not just the pair being matched. Another key advantage is the ability to automatically recognise consistent objects or scenes in an image database. A n algorithmic overview of our approach is shown in algorithm 1. We begin by extracting invariant features from all of the input images (step I). The features we have used are Scale Invariant Feature Transform (SIFT) features [Low99] and MultiScale Oriented Patches (MOPS) (chapter 2). The goal of this stage is to generate a set of descriptors for locations in each image that are (as far as possible) invariant to the imaging process. Each feature is then matched to the features from all other images using an efficient indexing technique. We have used k-d trees [BL97] and wavelet indexing (section 2.8) for this purpose. Step III consists of outlier rejection by robust estimation of multi-view geometric constraints. For this we have used pairwise constraints based on the panoramic and epipolar geometry [HZ04]. Finally we use bundle adjustment to compute optimal estimates of the camera parameters and reject further outliers [TMHF99]. 1.3 Contributions The main contribution of this thesis is the development and evaluation of a system capable of recognising and stitching 2D panoramas without any user input. We also develop a new class of invariant features (MOPS) designed specially for this purpose. Finally, we show how the multi-image matching framework can be applied to 3D object recognition and reconstruction. Automatic Panoramic Image Stitching We develop a novel, fully automated 2D panorama stitcher. This has the following advantages over previous image stitching approaches Chapter 1. (b) Introduction 5 Output panoramas Figure 1.2: Panorama recognition: (a) an image set containing multiple panoramas and distractor images is input, and (b) panoramic sequences are recognised and rendered as output. Chapter 1. Introduction 6 Algorithm: Multi-Image Matching using Invariant Features Input: n unordered images I. Extract invariant features from all n images II. Find k nearest-neighbours for each feature using an efficient indexing scheme III. For each image: (i) Select m candidate matching images (ii) Find geometrically consistent feature matches by robustly estimating the pairwise image geometry (iii) Verify image matches using a probabilistic model IV. Find connected components of image matches V . For each connected component: (i) Perform bundle adjustment to estimate the global geometry and reject further outliers. (ii) Compute 3D models/render panoramas as appropriate Output: Matched images and 3D model(s)/panorama(s) Algorithm 1: Multi-image matching Chapter 1. (c) Introduction 7 Panorama rendered with multi-band blending Figure 1.3: Fully automatic 2D image stitching. A l l 57 images are registered automatically without any user input, and stitched into a seamless panoramic image. Chapter 1. Introduction 8 • Robustness to image zoom, rotation and exposure change, due to the use of invariant features. • O(nlogn) running time, due to tree based matching (cf. 0(n ) 2 for naive pairwise matching). • Automatic detection of matching images, using a probabilistic model for image match verification. This also allows us to recognise panoramas [BL03] (figure 1.2) in unordered datasets. • High quality rendering, even in the presence of misregistration, motion and exposure differences, due to gain compensation and multi-band blending (figure 1.3). • Automatic panorama straightening, using a heuristic based on the way users typically shoot panoramic images. Evaluation of Image Stitching Algorithms We introduce a framework for evaluation of multi-image registration techniques. This uses a novel error function based on the projection errors of a test alignment compared to ground truth. Multi-Scale Oriented Patches (MOPS) • We show that a direct patch based sampling of an oriented image patch can serve as a useful invariant feature descriptor. • We propose an novel adaptive non-maximal suppression algorithm that distributes features more evenly over the image than previous approaches. • We show that data driven classifiers that make use of known incorrect matches can provide superior performance to the basic Gaussian noise model. • We show that indexing based on Haar wavelet coefficients speeds up the search for matching features, and is superior to indexing on other dimensions of the feature descriptor. Chapter 1. Introduction 9 3D Object Recognition and Reconstruction We show how the previous techniques can be extended to the case of a moving camera and enable recognition and reconstruction of 3D objects from unordered datasets. We present a graphical model formulation for the reconstruction problem, and implement an efficient solution using sparse bundle adjustment. 1.4 Outline of Thesis The remainder of the thesis is organised as follows. Chapter 2 reviews the literature on invariant features and introduces Multi-Scale Oriented Patches (MOPS) - a new class of invariant feature that use a direct patch based sampling of the local image region. We perform a detailed analysis of the properties of these image features and include comparisons to the current state-of-the-art in feature matching. In chapter 3 we describe the design of an automatic panoramic image stitcher, capable of recognising and stitching panoramas from unordered datasets. We develop methods for evaluation and tuning the parameters of our automatic panorama stitcher in chapter 4. Chapter 5 extends the multi-view matching framework to the case of moving cameras and 3D model acquisition. In chapter 6 we present conclusions and ideas for future work. 10 Chapter 2 Multi-Scale Oriented Patches 2.1 Introduction A quantity is invariant under a group of transformations if it is conserved (unchanged) by any transformation in that group. In the context of computer vision, the transformations of interest are those induced by the perspective projection of the world onto the image plane (see appendix A ) . Invariance provides a mechanism for finding correspondences between images for which the transformation parameters are unknown [MZ92, RZFM92]. A n alternative approach to correspondence is to start with some initial guess of the transformation parameters and iteratively refine this whilst minimising an error function based on the quality of registration. Such error functions typically use all of the image data, for example, the sum squared error of overlapping pixels, and these are called direct methods [Ana89, LK81]. In contrast, feature-based methods attempt to extract salient features such as edges and corners, and then to use a small amount of local information, for example, correlation of a small image patch, to establish matches [Har92, ST94]. Invariant features can be seen as a hybrid of earlier matching methods. As with traditional image features, they use a distributed representation which makes matching methods robust (in that failed feature matches do not adversely affect the solution). In common with direct methods, each feature descriptor typically uses a large amount of local image data, making the features more distinctive than traditional image features such as edges and corners. Finally, the use of invariants makes it possible to find matches over a wide range of image transformations using indexing instead of iterative search. The first work in the area was by Schmid and Mohr [SM97] who used a jet of Gaussian derivatives to form a rotationally invariant descriptor around a Harris corner. Lowe extended this approach to incorporate scale invariance [Low99, Low04]. Other Chapter 2. Multi-Scale Oriented Patches 11 researchers have developed features that are invariant under affine transformations [BauOO, TGOO, BL02]. Interest point detectors vary from standard feature detectors such as Harris corners or D O G maxima to more elaborate methods such as maximally stable regions [MCUP02] and stable local phase structures [CJ03]. Generally, interest point extraction and descriptor matching are considered as two basic steps, and there has been some progress in evaluating the various techniques with respect to interest point repeatability [SMB98] and descriptor performance [MS03]. Other researchers have suggested that interest points should be located such that the solutions for matching position [ST94], orientation and scale [Tri04] are stable. There have been several compelling applications of invariant feature based matching in the context of object recognition [Low99], structure from motion [SZ02], panoramic imaging [BL03] and searching for objects in videos [SZ03]. In this chapter, we describe the implementation of a patch-based invariant feature called Multi-Scale Oriented Patches (MOPS). M O P S have been designed with the task of panoramic image stitching in mind. They have a number of advantages over previous approaches in that regard. First, we use a novel adaptive non-maximal suppression algorithm that better distributes features across the image than previous techniques (section 2.3.1). This facilitates improved matching for panoramic sequences with low overlap. Second, we develop a feature space outlier rejection strategy that uses all of the images in an n-image matching problem to give a background distribution for incorrect matches (section 2.7). Finally, we develop an indexing scheme based on low-frequency Haar wavelet coefficients that greatly speeds up the search for feature correspondences with minimal impact on matching performance (section 2.8). We close the chapter with a discussion of our results and ideas for future work in this area. 2.2 Multi-Scale Oriented Patches In general the transformation between corresponding regions in a pair of images is a complex function of the geometric and photometric properties of the scene and the cameras. For the purposes of this work we reduce this to a simple 6 parameter model /'(x') = a/(x) + /? + n(x) and (2.1) Chapter 2. Multi-Scale Oriented Patches 12 Figure 2.1: Multi-scale Oriented Patches (MOPS) extracted at five pyramid levels. The boxes show the feature orientation and the region from which the descriptor vector is sampled. Chapter 2. Multi-Scale Oriented Patches 13 x' = Ax +1 A = s cos 9 (2.2) sin 9 (2.3) — sin 9 cos 9 where /(x) and Z'(x') are the corresponding image patches. There are four geometric parameters ti,t2,9,s (position, orientation and scale) and two photometric parameters are a,/3 (gain and bias). The error n(x) represents imaging noise and modelling error. Features are located at points where this transformation is well defined i.e. the autocorrelation of I(x) is peaked [ST94]. To compare features, one could in principle compute the maximum likelihood estimates for the transformation parameters between a pair of image locations. Assuming Gaussian noise, this can be done iteratively by solving a non-linear least squares problem [BM04]. However, this would require an iterative registration step to match any pair of features. Instead, we establish a canonical frame (see appendix A.4) for each feature, and sample invariant descriptor vectors relative to that frame. This allows us to efficiently compute an approximation to the minimum matching error n(x) between any pair of features using indexing techniques (see section 2.7.1). We then use the statistics of the matching error n(x) to verify whether a match is correct or incorrect. 2.3 Interest Points The interest points we use are multi-scale Harris [Har92] corners. For efficiency, we work with greyscale images I(x,y). For each input image I(x,y), we form an image pyramid with the lowest level Po(x, y) = I(x, y) and higher levels related by smoothing and subsampling operations Pi{x,y)*g* (x,y) (2.4) P[(sx, sy) (2.5) P / denotes the pyramid level, and g (x,y) denotes a Gaussian kernel of standard devia ation a. We use a subsampling rate r = 2 and pyramid smoothing a = 1.0. Interest p Chapter 2. Multi-Scale Oriented Patches 14 points are extracted from each level of the pyramid. Other authors use sub-octave pyramids, for example Lowe [Low04j. This gives improved matching for images at different scales. Since we are mostly concerned with matching images that have the same scale, this is left for future work. The Harris matrix at level I and position (x, y) is the smoothed outer product of the gradients H (x,y) l = V P (x,y)V P (x,y) * (x,y) T ad l (Td l 9l7i (2.6) Vo- represents the spatial derivative at scale o i.e. V f{x,y)±Vf(x,y)* {x,y) 0 (2.7) gir We set the integration scale a"; = 1.5 and the derivative scale <Jd = 1.0 and use the corner detection function _ det Hi(x,y) _ A A jHM{x,y) -—— — — -——— trH/(x,y) Ai + A X 2 (2.8) 2 which is the harmonic mean of the eigenvalues (Ai, A ) of H. Interest points are located 2 where the corner strength / # M ( £ , y) is a local maximum of a 3 x 3 neighbourhood, and above a threshold t — 10.0. The reason for this choice of interest point detection function can be understood in terms of the relationship between H and the local autocorrelation function. For an image i(x), the first order Taylor expansion gives an expression for the local autocorrelation e(x) = |/(x) - I | » 2 0 x f^ x T T = x Hx T (2.9) ax ax Interest points are located at peaks in the autocorrelation function. This means that e(u) is large for all unit vectors u, which is equivalent to requiring that both eigenvalues of H are large . Figure 2.2 compares isocontours of our interest point detection function 1 (Harmonic mean) with the common Harris [Har92] and Shi-Tomasi [ST94] detectors. Note that all the detectors require both eigenvalues to be large. Harmonic mean and Shi-Tomasi detectors have the slight advantage that they are parameter free, whereas ^ o t e that in practice H is integrated over a range as in equation 2.6 (otherwise it would be rank 1). Chapter 2. Multi-Scale Oriented Patches I2r Harris Harmonic mean Shi-Tomasi I I 'Oh 15 >l :i i! 11 i\ I \ I I \ * \ \ ; \ \ v. 2h g | 1 0 1 —'. ' ' 1 1 2 3 I 4 1 5 \ 1 6 1 7 1 B 1 1 9 10 Figure 2.2: Isocontours of popular interest point detection functions. Each detector looks for points where the eigenvalues Ai, A of H = J V J V / < i x are both large. T 2 N the Harris detector has a single parameter to be tuned. Harris Harmonic mean Shi-Tomasi f„ = A A - 0.04(Ai + A ) f HM = £fe fsr = : 2 2 2 - det H - 0.04(tr H) = 2 ^ min(Ai,A ) 2 Preliminary experiments suggest each of these detectors give roughly the same performance, although one could compute repeatability statistics to confirm this (see section 2.3.3). 2.3.1 Adaptive Non-Maximal Suppression Since the computational cost of matching is superlinear in the number of interest points, it is desirable to restrict the maximum number of interest points that are extracted from each image. At the same time it is important that the interest points that are generated are well spatially distributed over the image, since the area of overlap between a pair of images may be small. To satisfy these requirements, we use an adaptive non-maximal suppression (ANMS) strategy to select a fixed number of interest points from each image. Chapter 2. Multi-Scale Oriented Patches 16 Interest points are suppressed based on the corner strength JHM and only those that are a maximum in a neighbourhood of radius r pixels are retained. Conceptually, we initialise the suppression radius r = 0 and then increase it, removing interest points by non-maximal suppression, until the desired number of interest points is obtained. In practice, we can perform this operation without search as the set of interest points which are generated in this way form an ordered list. The first entry in the list is the global maximum, which is not suppressed at any radius (however large). As the suppression radius decreases from infinity, interest points are added to the list. However, once an interest point appears, it will always remain in the list. This is true because if an interest point is a maximum in radius r then it is also a maximum in radius r' < r. In practice we robustify the non-maximal suppression by requiring that a neighbour has a sufficiently larger strength. Thus the minimum suppression radius r is given by t 7-j = min |xi - X j | , s.t. /(x^) < c/(xj), x^- el j (2.10) where Xj is a 2D interest point image location, and J is the set of all interest point locations. We use a value c = 0.9, which ensures that a neighbour must have significantly higher strength for suppression to take place. We select the n ip = 500 interest points with the largest values of rj. Experiments on a large database of panoramic images suggest that distributing interest points spatially in this way, as opposed to selecting based on max corner strength, results in fewer dropped image matches (we found improved matching on 5 sequences from our 200 sequence dataset). Another interesting experiment would be to test if A N M S also improves registration accuracy for a fixed number of features. This test could be performed using the apparatus developed in chapter 4. 2.3.2 Sub-Pixel Accuracy Interest points are located to sub-pixel accuracy by fitting a 2D quadratic to the corner strength function in a local 3 x 3 neighbourhood (at the detection scale) and finding its maximum. Chapter 2. Multi-Scale Oriented Patches (a) (c) Strongest 250 A N M S 250, r = 24 (b) (d) 17 Strongest 500 A N M S 500, r = 16 Figure 2.3: Adaptive non-maximal suppression (ANMS). The two upper images show interest points with the highest corner strength, while the lower two images show interest points selected with adaptive non-maximal suppression (along with the corresponding suppression radius r). Note how the latter features have a much more uniform spatial distribution across the image. Chapter 2. Multi-Scale Oriented Patches /-l,l fo,i fl,l f-1,0 fo,o /l,0 f-1,-1 fo,-i 18 fl,-l Figure 2.4: For subpixel accuracy, derivatives are computed from pixel difference in a 3 x 3 neighbourhood according to the equation 2.17. where x denotes position (x,y), and / ( x ) = / H M ( X ) is the corner strength measure. Derivatives are computed from the 3 x 3 neighbourhood using pixel differences i.e. dj_ dx d± dy a / dx &l dy df dxdy = (/i,o -/-i,o)/2 (2-12) = (/o,i — /o,-i)/2 (2.13) = /i,o — 2 2/o,o + /-i,o (2.14) = /o,i — 2 2/0,0 + /o,-i (2-15) = (/_i,_i - 2 2 - / l , _ l - /l,l)/4 (2-16) See figure 2.4. The subpixel location is given by d f~ df = x - ^-j ax ax 2 x 2.3.3 m 0 l ( - ) 2 2 17 Repeatability The fraction of interest points whose transformed position is correct up to some toler2 ance epsilon is known as repeatability [SMB98]. We use a slightly different definition Here we assume that images are related by a homography (see appendix A.2.1). Hence interest point detections in a pair of images are 'correct' if they are consistent with the homography between that pair of images. Later chapters will discuss alternate motion models between images. 2 Chapter 2. Multi-Scale Oriented Patches 19 with sub-pixel localisation without sub-pixel localisation 70 20 0 I 1 0.5 1 1 1 1.5 1 1 2 2.5 3 accuracy/pixels Figure 2.5: Repeatability of interest points with and without sub-pixel correction. These results were computed from the Matier dataset. of repeatability to that defined in [SMB98] (which is not symmetric) as follows. Let IM denote the set of all points belonging to image M, and T M denote the set of interest points in image M. The set of points from image M that project to image N is given by VN M VMN = {xi : HjvMXi e I } (2.18) N where UNM is the homography between images M and TV. The set of points from image M that are repeated in image N (within tolerance e) is given by "^MAr(e) = {xi : 3j : |x* - H M A r x . , | < e, x e 1 , 4 M x.jel } N TZMN(^) (2-19) The repeatability is the number of interest points that are repeated as a fraction of the total number of interest points that could be repeated. It is useful to adopt a symmetrical definition r(e) = mm — 1 — i \\rMN\ \rNM\J (2.20) The repeatability of our interest points with and without sub pixel localisation is shown in figure 2.5. Note that sub-pixel localisation gives approximately 5% improvement in repeatability. Chapter 2. Multi-Scale Oriented Patches 2.4 20 Orientation Each interest point has an orientation 9, where the orientation vector [cos 6, sin 6] = u/|u| comes from the smoothed local gradient u (x,y) = V P (x,y) l ao (2.21) l Note that the image gradient is covariant under similarity transforms, and hence can be used to compute a canonical frame (see appendix A.4). The integration scale for orientation is a = 4.5. A large derivative scale is desirable so that the vector field 0 ui(x,y) varies smoothly across the image, making orientation estimation robust to errors in interest point location. The orientation estimate is poorly conditioned if the first derivative is close to zero, in which case it may be favourable to look at higher order derivatives [SF95]. This is left for future work. 2.5 Analysis of Interest Point Extraction Figure 2.6 compares the errors introduced in four stages of feature extraction: position, scale and orientation measurement, and descriptor matching . These experiments 3 were conducted using the Matier dataset (see appendix B). Features were extracted and matched between all 7 images, and the top 2 image matches for each image were selected. The maximum number of matches per feature was 5. For each of these (14) image matches, the number of features in the area of overlap was found, and the number of features with consistent position, scale and orientation measurements computed. Consistent position means that the interest point was detected within e pixels of the projected position using the homographies computed from bundle adjustment. Con- sistent scale means that the interest point was detected at the same scale in the two images. Consistent orientation means that the transformed orientations differ by less than 3 standard deviations (= 3 x 18.5 degrees). To an accuracy of 3 pixels, 72% of interest points are repeated (have correct position), 66% have the correct position and scale, 64% also have correct orientation, and in total 59% of interest points are correctly matched (meaning they are one of the top 5 matches in terms of Euclidean distance in feature space). That is, given that an interest point overlaps another image, N o t e that the extraction and matching of descriptor vectors is discussed i n sections 2.6 and 2.7. The results are included here for completeness. 3 Chapter 2. Multi-Scale Oriented Patches 21 100 90 80 70 S? 60 I 50 30 20 10 0 0.5 1 1.5 2 2.5 3 accuracy/pixels 3.5 4 4.5 5 Figure 2.6: Repeatability vs accuracy for Multi-Scale Oriented Patches. To an accuracy of 3 pixels, 72% of interest points in the overlap region have consistent position, 66% have correct position and scale, 64% also have correct orientation, and in total 59% of interest points in the overlap region are correctly matched. the probability that it will be correctly matched is 59%. Whilst figure 2.6 shows combined results for all levels, figure 2.7 shows separate results for interest points extracted at each level of the pyramid. A l l measurements are relative to the base image. Note that contrary to the popular perception that Harris corners are sub-pixel accurate, the majority of interest points have location errors in the 0-3 pixel range, even at the finest scale of detection. Also note that interest points at higher levels of the pyramid are less accurately localised relative to the base image than those at a lower level, due to the larger sample spacing. Although less useful for accurate localisation, these higher level features are still useful in verifying an image match or a coarse R A N S A C hypothesis. Also, the orientation estimate improves as the level increases. As expected, features at levels 4 and 5 generally have poor accuracy, and their distributions show many features have accuracy worse than 3 pixels. However, it is slightly counter intuitive that features at levels 2 and 3 tend to have accuracies of 3 pixels or better. Figure 2.8 show the same results as computed for the Van Gogh sequence. This is a pure rotation sequence, with no projective distortion. As compared to the Matier sequence, which does have perspective distortion, matching is improved. Note in particular that the orientation repeatability curves and the matched curves are very close, indicating that if feature orientation is correctly estimated, then it is very likely that Chapter 2. Multi-Scale Oriented Patches 22 the feature will also be correctly matched. This is not the case for the Matier dataset due to perspective distortion. 2.6 Feature Descriptor Once we have determined where to place our interest points, we need to extract a description of the local image structure that will support reliable and efficient matching of features across images. A wide range of such local feature vectors have been developed, including local intensity patches [For86, Har92], Gaussian derivatives [SM97], scale invariant feature transforms [Low04], and affine-invariant descriptors [BauOO, TGOO, BL02]. In their comparative survey, Mikolajczyk and Schmid [MS03] evaluated a variety of these descriptors and found that SIFT features generally perform the best. Local patches oriented to the dominant local orientation were also evaluated, but found not to perform as well. In this section, we show how such patches can be made less sensitive to the exact feature location by sampling the pixels at a lower frequency than the one at which the interest points are located. Given an interest point (x, y, I, 9), the descriptor is formed by sampling from a patch centred at (x,y) and oriented at angle 9 from pyramid level /. We sample an 8 x 8 patch of pixels around the sub-pixel location of the interest point, using a spacing of s — 5 pixels between samples (figure 2.9). Figure 2.10 shows how varying the sample spacing s affects the reliability of feature matching. We have found that performance increases up to a value s = 5, with negligible gains thereafter. To avoid aliasing, the sampling is performed at a higher pyramid level, such that the sampling rate is approximately once per pixel (the Nyquist frequency). This means sampling the descriptor from a level l levels above the detection scale, where s (2.22) The descriptor vector is sampled using bilinear interpolation. In practice, s — 5 so the descriptor vectors are sampled at l — 2 levels above the detection scale. s Suppose the interest point was detected at level /. This suggests sampling the descriptor from Pi i (x,y) + 3 = Pi {x,y)- However, we have found better results by +2 instead sampling the descriptor from P{ (x, y), where P( (x, y) = Pi i(x, y)*ga (x, y), +1 +1 + p i.e. blurring but not downsampling. Further (smaller) gains are made by sampling from Chapter 2. Multi-Scale Oriented Patches 23 position — position, seals • . — position, seals, orientation : — matched - (a) A l l levels. 6649 features extracted, 6610 correct matches (b) Level 1. 4997 features extracted, 5318 correct matches — position — position, orientation matched * 60 s (c) Level 2. 1044 features extracted, 860 correct matches (d) Level 3. 372 features extracted, 295 correct matches — posrtkon position, orientation matched • (e) Level 4. 180 features extracted, 120 correct matches (f) Level 5. 56 features extracted, 17 correct matches Figure 2.7: Repeatability of interest points, orientation and matching for Multi-Scale Oriented Patches at 5 pyramid levels (Matier dataset). The top left figure is a combined result for all levels. A l l measurements are relative to the base image in the pyramid. Chapter 2. Multi-Scale Oriented Patches (a) A l l levels. 6557 features extracted, 9880 correct matches (b) Level 1. 4925 features extracted, 7559 correct matches (c) Level 2. 1041 features extracted, 1512 correct matches (d) Level 3. 392 features extracted, 542 correct matches 24 position position, orientation — matched - J^' f^Lj i e 40 y i (e) Level 4. 158 features extracted, 212 correct matches (f) Level 5. 41 features extracted, 55 correct matches Figure 2.8: Repeatability of interest points, orientation and matching for Multi-Scale Oriented Patches at 5 pyramid levels (Van Gogh dataset). This dataset consists of pure rotations with no perspective distortion. The top left figure is a combined result for all levels. A l l measurements are relative to the base image in the pyramid. Chapter 2. Multi-Scale Oriented Patches 25 Figure 2.9: Descriptors are formed using an 8 x 8 sampling of bias/gain normalised intensity values, with a sample spacing of 5 pixels relative to the detection scale. This low frequency sampling gives the features some robustness to interest point location error, and is achieved by sampling at a higher pyramid level than the detection scale. P('(x,y) = Pi(x,y) * #2x<T (£,?/)• Whilst theoretically one can interpolate a function P exactly given a sampling of that function at the Nyquist frequency, in practice it is better to maintain a denser sampling if using a bilinear resampling kernel. This is discussed in more detail with quantative results in section 2.9.2. 2.6.1 Illumination Invariance In the previous sections we have described some of the geometrical transformations under which we wish our image features to achieve invariance. However, there is also a potentially complex transformation of the illumination between corresponding image regions, which depends upon the surface reflectance properties and lighting conditions. Digital cameras also perform several non-linear transformations on the image intensities such as white-balancing and gamma correction. In this work we use a simple affine model for illumination change I' = al + p (2.23) Whilst it would be desirable to define a more accurate model of illumination change and represent features using illumination invariants, this is left for future work. In practice, we normalise the descriptor vector so that the mean is 0 and the standard deviation is 1, i.e. di = (d'i - fi)/a (2.24) Chapter 2. Multi-Scale Oriented Patches 0 0.1 0.2 0.3 0.4 false positive 0.5 0.6 26 0.7 Figure 2.10: Effect of changing the descriptor sample spacing on performance. These R O C curves show the results of thresholding feature matches based on normalised match distance as in section 2.7.1. Performance improves as the sample spacing increases (larger patches), but gains are minimal above a sample spacing of 5 pixels. where d[, ie{l..d } 2 and cr = \JYli=i(°H are the elements of the descriptor vector, with p = ^J2i=i^i ~ A*) - This makes the features invariant to afFme changes in 2 intensity (bias and gain). 2.6.2 Haar Wavelet Transform Finally, we perform the Haar wavelet transform on the 8 x 8 descriptor patch di to form a 64 dimensional descriptor vector containing the wavelet coefficients q . Due to the orthogonality property of Haar wavelets, distances are preserved " <*?) = <*) 2 i (2-25) 2 i So nearest neighbours in a sum-squared difference sense are unchanged. Our motivation for using wavelet coefficients to parameterise descriptors was the intuition that some dimensions of the feature descriptor would be more noisy than others (in fact we show that this is true in section 2.7.2). We exploit this in an indexing strategy which uses the first 3 non-zero wavelet coefficients C i , c , c (see section 2.8). 2 Note that the mean CQ is equal to 0. 3 Chapter 2. Multi-Scale Oriented Patches (a) (b) 27 Feature locations Feature descriptors Figure 2.11: Corresponding features in a pair of images. For each feature a characteristic scale and orientation is established and an 8 x 8 patch of pixels sampled for the feature descriptor. Since the reference frame and the image undergo the same transformation between the images, the descriptor vector is the same in both cases (up to noise and modelling error). Chapter 2. Multi-Scale Oriented Patches 28 (a) (b) Figure 2.12: Examples of corresponding features from different images in the Matier dataset. For each image, M O P S are extracted and descriptor vectors are stored in an indexing structure. Feature descriptors are indexed and matched as described in section 2.7. Chapter 2. Multi-Scale Oriented Patches 2.7 29 Feature Matching Given Multi-Scale Oriented Patches extracted from all n images, the goal of the matching stage is to find geometrically consistent feature matches between all images. This proceeds as follows. First, we find a set of candidate feature matches using an approximate nearest neighbour algorithm (section 2.8). Then we refine matches using an outlier rejection procedure based on the noise statistics of correct/incorrect matches. Finally we use R A N S A C to apply geometric constraints and reject remaining outliers. 2.7.1 Feature-Space Outlier Rejection Our basic noise model assumes that a patch in one image, when correctly oriented, located and scaled, corresponds to a patch in the other image modulo additive Gaussian noise: J'(x') = a/(x) + (3 + n(x) A x +1 X — A = s (2.27) cos 9 sin 9 - sin 9 cos 9 n(x)~M(0,a ) n x a r e ^ n e (2.28) (2.29) 2 where J(x) and I ' ( ) (2.26) corresponding patches, and n(x) is independent Gaussian noise at each pixel. To compare two features, we estimate the geometrical parameters from the translation, rotation and scale of the canonical frames (figure A.3), and the photometric parameters using the approximations a 0 = — a = p'-ap (2.30) (2.31) These expressions are obtained by taking the mean and variance of equation 2.26, and assuming that n is small. The parameters p, a, p', a' are the means and variances of patches 7(x) and / ' ( ' ) respectively. We then compute the matching error (Euclidean x distance) between the two patches e = ^ / ^ n(x) . 2 x 30 Chapter 2. Multi-Scale Oriented Patches Unfortunately, we have found this model to be inadequate for classification, as the error distributions for correctly and incorrectly matching patches overlap significantly (see figure 2.14(a) ). Hence, it is not possible to set a global threshold on the matching error to distinguish between correct and incorrect matches. Note that the above results apply to errors in the image plane, after correcting for the brightness changes a, j3 between patches. We have also repeated this experiment using a Gaussian noise model in (bias-gain normalised) feature space n(x) = Ji(xi)-mi i~ (x )-m cti cr 2 2 2 (2.32) 2 and found similar results. This behaviour has also been observed by Lowe [Low04], who suggested thresholding instead on the ratio 6 i — a w / H e r e e\_ NN denotes the error for the best match (first nearest neighbour) and e -NN denotes the error for the second best match 2 (second nearest neighbour). As in Lowe's work, we have also found that the distributions of ei^NNI&2-NN for correct and incorrect matches are better separated than the distributions of e^^N alone (figure 2.14(b) ). The intuition for why this works is as follows. For a given feature, correct matches always have substantially lower error than incorrect matches. However, the overall scale of errors varies greatly, depending upon the appearance of that feature (location in feature space). See figures 2.19 and 2.20. For this reason it is better to use a discriminative classifier that compares correct and incorrect matches for a particular feature, than it is to use a uniform Gaussian noise model in feature space. Lowe's technique works by assuming that the 1-NN in some image is a potential correct match, whilst the 2-NN in the same image is an incorrect match. In fact, we have observed that the distance in feature space of the 2-NN and subsequent matches is almost constant . We call this the outlier distance e u , 4 out er as it gives an estimate of the matching distance (error) for an incorrect match (figure 2.15). We have found that in the n image matching context we can improve outlier rejection by using information from all of the images (rather than just the two being matched). Using the same argument as Lowe, the 2-NN from each image will almost certainly be an incorrect match. Hence we average the 2-NN distances from all n T h i s is known as the shell property ([Bis95] exercise 1.4). The distances of a set of uniformly distributed points from a query point in high dimensions are almost equal. 4 Chapter 2. Multi-Scale Oriented Patches 31 images, to give an improved estimate for the outlier distance . This separates the dis5 tributions for correct and incorrect matches still further, resulting in improved outlier rejection (figure 2.14(d) ). Hence we accept a feature match with error e iff e</x (2.33) ^outlier Where the threshold / = 0.65. In the general case it is prudent to attempt to match a given feature to the features from all other images in the dataset, using the above criterion (this assumes that every pair of images may have a match). However, in many applications e.g. panoramic stitching (chapter 3) and 3D modelling (chapter 5) the number of images that view a ray or point in the world n overlap is small, and hence it is only necessary to find a small number of candidate matches for each feature. Let us assume that we know the maximum number of images that may overlap a given ray n i . over ap ^overlap — In an ordered list of nearest-neighbour matches, we assume that the first 1 elements are potential correct matches, and that the n i over elements are incorrect matches. Typically we use a value n i over ap ap and subsequent = 5. In addition to speeding up the search for nearest neighbour matches, finding k matches per feature in a collection of n images (where k « n) has the advantage that any incorrect feature matches that do occur are distributed over a large number of images, and thus less likely to cause an incorrect image match to be declared. In general the feature-space outlier rejection test is very powerful. For example, we can eliminate 80% of the false matches for a loss of less than 10% correct matches. This allows for a significant reduction in the number of R A N S A C iterations required in subsequent geometry estimation steps (see figure 2.13). These results are computed for Matier (7 images, 6649 features, 5970 correct matches), Van Gogh (7 images, 6557 features, 9260 correct matches) and Abbey (20 images, 18567 features, 15558 correct matches). W e also tried using other statistics of the distribution of outliers e.g. max, min, median, but found that using the average gave the best results. Future work might try to model the whole outlier distribution for these tests. 5 Chapter 2. Multi-Scale Oriented Patches (a) (b) (c) 32 A l l 1313 feature matches 839 outliers rejected using feature space outlier rejection A further 96 matches rejected using geometrical constraints Figure 2.13: Outlier rejection using b) feature space outlier rejection c) geometric constraints. The raw matches are a). There were 1313 initial matches, of which 839 were rejected without geometric constraints by thresholding based on the outlier distance, and a further 96 were rejected using geometric constraints. The input images are 385 x 512 and there were 378 matches in the final solution. Chapter 2. Multi-Scale Oriented Patches 0 10 20 30 40 t-NN squared error 50 60 0 33 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1-NN/2-NN squared error (a) (b) (c) (d) Figure 2.14: Distributions of matching error for correct and incorrect matches. Note that the distance of the closest match (the 1-NN) is a poor metric for distinguishing whether a match is correct or not (figure (a)), but the ratio of the closest to the second closest (1-NN/2-NN) is a good metric (figure (b)). We have found that using an average of 2-NN distances from multiple images (INN/(average 2-NN)) is an even better metric (figures (c)-(d)). These results were computed from 18567 features in 20 images of the Abbey dataset (see appendix B), and have been verified for several other datasets. Chapter 2. Multi-Scale Oriented Patches — — JZZL 1 fr It 2 3 4 34 Outlier distance Threshold 5 feature number t I ft | | ft 1 | Figure 2.15: Thresholding based on outlier distance. This figure shows the best 10 matches for a sample feature. The first is a correct match, and the rest are incorrect matches. Thresholding based purely on matching error gives poor results, since matching errors vary greatly depending upon the position in feature space. However, thresholding at a fraction of the outlier distance gives better results. 2.7.2 Spatial Variation of Errors In actual fact the errors between corresponding patches are not uniform across the patch as suggested in equation 2.29. We have also computed the error variance assuming a diagonal covariance model ii ( x ) ~ J v * ( 0 , E ) (2.34) n where — *?1 0 0 0 0 0 0 (2.35) *33 If we assume that the error variance is constant across the patch ( £ „ = <r?I) we find that the standard deviation of intensity errors for correct matches is o = 0.0334 (for n brightness values in the range 0 < / < 1). That is, the error for correct matches is around 3%. However, with the diagonal covariance model of equation 2.35 we find that Chapter 2. Multi-Scale Oriented Patches 35 Figure 2.16: Spatial variation of errors across the patch (for correct feature matches). Lighter tones indicate larger values of variance. The variance of the errors at the edge of the patches are larger than those in the centre. This is consistent with making small errors in scale / orientation selection. the standard deviation of errors at the edge of the patch is approximately 2 times that in the centre. This is shown in figure 2.16. Note that this is consistent with small errors in scale / orientation estimation for each patch, as these would generate larger errors at the edge of the patch. Preliminary experiments have shown that weighting _ i the errors by their inverse standard deviation i.e. minimising | E n ( x ) | does not give 2 2 n much improvement over simply minimising |n(x)| . These results were computed using 2 7572 correctly matching features ( R A N S A C inliers with e = 10 pixels) from the Matier dataset. 2.7.3 Position and Orientation Errors for Correct Matches Figure 2.17 shows the residual image position errors and errors in rotation estimates for correctly matching features. Note that the features from the (rotation only) Van Gogh dataset are more accurately located than those in the Matier dataset (see appendix B). For the pure rotation dataset, features are typically accurate in the 0-2 pixel range, whilst those from the Matier dataset are typically in the 0-4 pixel range. This discrepancy could be due to perspective distortion (which could adversely affect the feature location), and also unmodelled parameters such as radial distortion. A n interesting future experiment would be to correct for radial distortion to check if the accuracy of features is still poorer when perspective distortion is present. This could also be simulated synthetically as in chapter 4. Another interesting observation is that there are a significant number of features that match correctly when the rotation estimate is 180° out. This suggests that some the of the features might be rotationally symmetric e.g. the 2 x 2 checkerboard pattern has the properties of ambiguous gradient Chapter 2. Multi-Scale Oriented Patches 36 0.2 0.15 2 a 0.1 0.05 0 2 (a) 4 6 distance error/pixels 8 10 12 angular error/radians Image position error (Matier) (b) 0.181 0 1 2 3 4 5 6 7 distance error/pixels 8 9 (c) Image position error (Van Gogh) Orientation error (Matier) 1 1—p 1 1 1 1: 10 angular error/radians (d) Orientation error (Van Gogh) Figure 2.17: Distributions of image location error and feature orientation error for correctly matching features ( R A N S A C inliers). Note that features from the Van Gogh dataset are more accurately located. Also, there are a significant number of features that match correctly when the rotation estimate is 180° out. This could occur for rotationally symmetric features where the gradient is ambiguous e.g. a 2 x 2 checkerboard pattern. 37 Chapter 2. Multi-Scale Oriented Patches D 10 201N 30 *0uB»fd *0W 5 60 -N o rD 0 01. 02. 03. 04. 0N 5-.IN 8. e 0n 9.of 1 N 2-0 /8.N07.squ0 sid (d) (e) 0 01. 02. 03. 04. 05. 08. 0la 7.inO pSorv ti* 09. 1 (f) Figure 2.18: Distributions of matching error for correct and incorrect matches. The results were computed for the Matier (a, b, c) and Van Gogh (d, e, f) datasets. Again the distance of the closest match (the 1-NN) is a poor metric for distinguishing whether a match is correct or not (figures (a), (d)). The ratio of the closest to the second closest (1-NN/2-NN) is a better metric (figures (b), (e)), but using an average of 2-NN distances from multiple images (lNN/(average 2-NN)) is better still (figures (c), (f)). and rotational symmetry. To compute position and orientation errors, features were projected between images using the homographies obtained from bundle adjustment over all images. 2.8 Feature Indexing At this stage we wish to find nearest neighbours for all features from all images. This is the well known all nearest neighbours problem Vj NN(j) = arg min | | X i - X j 11, i ^ j (2.36) i This is naively 0(n ) 2 but several more efficient (approximate) solutions have been pro- Chapter 2. Multi-Scale Oriented Patches 38 1 absolute distance ~j relative distance - threshold 1 2 3 4 5 6 7 feature number 8 9 10 (a) ] absolute distance 3 relative distance -threshold feature number (b) Figure 2.19: Distances of correct and incorrect matches for high and low contrast features. We plot absolute and relative distances. Absolute distance is the Euclidean distance between brightness normalised patches. Relative distance is the distance relative to the outlier distance (the outlier distance is take as the distance of the closest 2-NN match from all other images). Note that for the high contrast features, the absolute distances are all much larger than for the low contrast features. Chapter 2. Multi-Scale Oriented Patches 39 1.6r absolute distance relative distance threshold 1.2 - 3 C T E I 0.8- (D Q 0.4 h 0.2 [ 1 2 3 4 5 6 7 feature number 8 9 10 (b) Figure 2.20: Distances of correct and incorrect matches for high and low contrast features. Absolute distances are larger for high contrast features than low contrast features. Hence, thresholding based on absolute match distances is a poor test, but thresholding on relative distances is better. Chapter 2. Multi-Scale Oriented Patches 40 B Figure 2.21: Indexing is performed on the first 3 non-zero wavelet coefficients (the mean is 0). These represent the first derivatives in x and y and the second order cross derivative. posed [BL97, NN97, GM00, SVD03]. In this section, we describe a nearest-neighbour algorithm that exploits the properties of our descriptor vectors. In particular, it makes use of the stability of the low-frequency components of our descriptor vectors, by indexing on the first 3 non-zero wavelet coefficients. Features are indexed in a three-dimensional lookup table with dimensions corresponding to the first 3 non-zero wavelet coefficients Ci,C2,C3 (estimates of f j , f^, over the patch) (see figure 2.21). The lookup table has 6 — 1 0 bins per dimension, which cover ±n c = 3 standard deviations from the mean of that dimension. Note that the means are typically around zero except for the first derivative that is aligned with the feature orientation, which is significantly positive. The bins are overlapped so that data within half a bin width, i.e. = | , are guaranteed to be matched against the query. These are approximate nearest neighbours as it is possible (but unlikely) that the true nearest neighbour lies outside | in one of the 3 dimensions. The query is exhaustively matched to all features in the query bin, and k approximate nearest neighbours are selected. We then apply the outlier distance constraint as described in section 2.7.1 to verify correct matches and eliminate outliers. Indexing with b bins on 3 dimensions gives a speedup of 6 / 2 (assuming features are 3 3 evenly distributed in the histogram) at the expense of some potential for lost feature matches. Table 2.1 shows the percent recall for 3 indexing methods: Wavelet low freq Indexing uses the first 3 non-zero wavelet coefficients Pixel random Indexing uses 3 random grey values from the descriptor Wavelet random Indexing uses 3 random wavelet coefficients from the descriptor It is clear from the results that using the low frequency wavelet coefficients for indexing is most effective. Choosing 10 bins per dimension gives a speedup of 1 0 / 2 = 3 3 Chapter 2. Multi-Scale Oriented Patches 41 125 compared to exhaustive nearest neighbour matching, with the loss of less than 10% of the matches. Indexing using low frequency wavelet coefficients is 10-20% better than the other methods at this operating point. In later chapters (section 3.2) we also describe an alternative, k-d tree based nearest neighbour algorithm. Indexing Method Dataset Wavelet low freq Matier Dash point Abbey Number of bins / dimension 1 5 10 15 100 99.6 91.4 72.4 100 99.8 93.2 76.8 80.2 100 99.9 95.1 Pixel random Matier Dash point Abbey 100 100 100 97.9 96.4 96.7 Wavelet random Matier Dash point Abbey 100 100 100 84.4 81.5 83.0 79.8 74.0 77.8 49.2 42.8 45.4 57.8 52.9 56.3 28.1 25.6 24.6 Table 2.1: Indexing on wavelet coefficients vs pixel values - percent recall in database matching. Using 10 bins per dimension, indexing on the 3 non-zero low frequency Haar wavelet coefficients (x and y derivative and the cross derivative) gives about 10% better recall than indexing on random dimensions (pixels) of the descriptor. 2.9 Analysis of Descriptor Sampling In this section we describe experiments to test the properties of our patch based descriptors. Firstly, we attempt an iterative refinement strategy for patches using a a direct method for registration (section 2.9.1). Next we discuss the effects of different image sampling procedures (section 2.9.2). Finally, we compare our descriptors with the commonly used SIFT descriptors (section 2.9.3). 2.9.1 Patch Refinement In [MS03], Mikolajczyk and Schmid note that "It would be interesting to include correlation with patch alignment which corrects for these errors and to measure the gain obtained by such an alignment." Since sensitivity to localization errors has been touted as one of the weaknesses of pixel-based descriptors, we decided to implement this suggestion to see how much it would help. Rather than computing sum-squared error on Chapter 2. Multi-Scale Oriented Patches 42 1 0.9 0.6 0.7 o, 0.6 Ia 0.5 2 ~ 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.4 0.3 false positive 0.5 0.6 0.7 Figure 2.22: R O C curves for patch refinement with different alignment models (Matier dataset). Each additional free parameter degrades the matching performance. pixel patches (or wavelet coefficients) directly, we included a stage of Lucas-Kanade [LK81, BM04] refinement to bring the patches more closely into spatial alignment before computing the pairwise descriptor distance. Since this has elements in common with the use of tangent distances [SLDV96] we expected that there might be an improvement in the separation of good and bad matches. Instead we found the opposite to be true. We used four motion models (direct, translation, similarity and affine) with 0, 2, 4 and 6 parameters respectively. The results are shown in figure 2.22. Note that matching performance is degraded for each new parameter that is added to the model. Since correct matches are already fairly well aligned, but bad matches typically have large errors, refinement tends to overfit the incorrect matches, whilst making only small improvements to the correct matches. This means that Lucas-Kanade refinement actually makes it more difficult to distinguish between correct and incorrect matches than before. This is a somewhat unexpected result. Future work would incorporate priors on the transformation parameters to prevent overfitting of the incorrect matches. 2.9.2 Image Downsampling and Bilinear Interpolation A typical approach to image sampling is to downsample the original signal to the sampling frequency before interpolation, so that the image is sampled at the Nyquist Chapter 2. Multi-Scale Oriented Patches Relative Sampling Level Extra Smoothing 0 -1 -2 0 a 2a 43 Number of feature matches Matier Dash Point Abbey 2620 3017 3139 2323 2988 3058 7467 9078 9268 Table 2.2: Effect of pyramid downsampling on feature matching. We found better results by sampling the feature descriptor at a smoothed version of a finer pyramid level, when using bilinear resampling. frequency (once per pixel). In theory this results in a correct sampling of the (bandlimited) original signal, assuming the correct decimation/interpolation kernel is used (a sine function). However, in practice bilinear interpolation is often used for speed. In this case, I have found that better results are obtained by maintaining a denser signal sampling during interpolation than the usual 1 pixel/sample. Table 2.2 compares feature matching performance when bilinear resampling is performed at successively finer pyramid levels corresponding to 1, 2 and 4 pixels of the original signal per sample. 'Relative level' means the level relative to the Nyquist sampling level, where one would sample exactly once per pixel in the pyramid. 'Extra smoothing' is the standard deviation of the extra Gaussian smoothing applied (to prevent aliasing) i.e. instead of sampling at some level I, we sample at level I — 1, but introduce extra smoothing with a Gaussian kernel standard deviation a. In each case, exactly the same interest points were extracted, so the total number of feature matches that result is a good indication of how well the descriptors are working. The results show that better results are obtained (about 15%-20% more matches) by performing bilinear resampling at the next finer pyramid level (where the signal sampling is approximately twice the sampling frequency). Smaller gains are obtained by moving to the next finer pyramid level again. These gains must be balanced against the cost of the extra convolution operations required in Gaussian smoothing. 2.9.3 Comparison to SIFT features To compare Multi-Scale Oriented Patches (MOPS) and SIFT features, I used 3 datasets of panoramic images. For each method, I extracted, approximately the same number of interest points from each image, and then exhaustively matched them to find k = 4 exact nearest neighbour matches for each feature. I then used identical R A N S A C Chapter 2. Multi-Scale Oriented Patches Dataset 44 MOPS SIFT Matier ^interest points ^matches #matches/interest point 3610 3776 1.05 3678 4344 1.18 Dash point ^interest points #matches ^matches/interest point 32689 11750 0.36 32517 22928 0.71 Abbey ^interest points #matches #matches/interest point 15494 18659 1.20 15710 21718 1.38 Table 2.3: Comparison of Multi-Scale Oriented Patches and SIFT feature matching. Note that SIFT features have a larger number of matches per interest point for each of the 3 datasets. algorithms to find the number of correct feature matches. In each case I have tabulated the number of correct matches per feature. The results are given in table 2.3. Note that both methods could potentially find more matches by using more scales / adjusting interest point strength thresholds etc. From the results of table 2.3 it seems that in terms of number of matches per interest point SIFT features outperform M O P S . Why is this? Lowe [Low04] reports higher repeatability for his difference of Gaussian (DOG) interest points (around 90% compared to 70% for our Harris corners), although this is highly dataset dependent. The rotation estimate used in SIFT features (maxima of a histogram of local gradients) is more robust since multiple orientations can be assigned if the histogram peaks are close. Lowe reports repeatability of around 80% for position (to an accuracy of 3 pixels), orientation and scale compared to our value of 58%. Another issue is that SIFT features are found to be located very close to image edges, but since M O P S use relatively large image patches they are constrained to be at least 20 pixels from the image edge (this is the main reason SIFT performs much better on the Dash Point dataset). This is shown in figure 2.24. Finally, the SIFT descriptor is more robust to affine change and small shifts in interest point position than patch correlation, due to accumulating measurements in spatial histograms. Note however, that M O P S and SIFT features tend to concentrate on different areas in the images. In particular, the D O G detector used for SIFT makes it more likely to find 'blobs' - bright areas surrounded by dark pixels or vice versa, whereas the autocorrelation detector used for M O P S makes it more likely to find edge or corner Chapter 2. Multi-Scale Oriented Patches (a) S I F T feature matches (167) (b) M O P S feature matches (238) 45 Figure 2.23: Comparison of SIFT features and M O P S features. Note that more M O P S features are found at edges/corners, whereas SIFT features concentrate on blobs. Though in this case there were more M O P S feature matches than SIFT feature matches, in general SIFT features gave better overall performance on our test datasets (see table 2.3). like features. This suggests that a combination of the two feature types might be effective. Also, it is important to note that M O P S features matches are by design well spatially distributed in the image. Sometimes SIFT feature matches are very close together. Are feature matches equally useful if they are very close together? We think not. It seems that some other criterion, such as registration accuracy, would be a better criterion for evaluating features, than simply counting the number of matches per feature (see chapter 4). Another approach would be to compare the number of matched/dropped images in a panorama dataset. Chapter 2. Multi-Scale Oriented Patches (a) S I F T feature matches (421) (b) M O P S feature matches (372) 46 Figure 2.24: Comparison of SIFT features and M O P S features. For the Dash Point dataset, the SIFT feature detector performed better in terms of number of matches. However, note that the M O P S feature matches are better spatially distributed e.g. there are many more matches in the sea/sand. Also note that the SIFT features are located right up to the edge of the image, but due to the large patches used in M O P S , the features are constrained to be at least 20 pixels away from an image edge. Chapter 2. Multi-Scale Oriented Patches 47 (a) Abbey. Though several feature matches have been found between the two images, the stained glass windows are in fact different. Though generalisation capability is desirable in many object recognition applications, in image stitching we are generally only interested in matching to the same object or scene. (b) Office. Though the windows look strikingly similar, the presence of the tree i n the first image gives away the fact that they are i n fact different. In this case it would be a difficult to distinguish whether motion of the tree or the observer caused the error. Figure 2.25: Matching mistakes are are often caused by repeating structures or similar looking objects that appear in multiple views. Often, the set of feature matches may appear consistent, but the images will differ substantially in other areas (a). This suggests the need for robust image matching metrics based on all the image data, and not just feature positions. In some cases however, it will still be difficult to tell if the source of error is object motion, or if the images are really different (b). Chapter 2. Multi-Scale Oriented Patches 2.10 48 Summary We have presented a new type of invariant feature, which we call Multi-Scale Oriented Patches (MOPS). These features utilise a novel adaptive non-maximal suppression algorithm for interest point location, and a simple sampling of the (oriented) local image intensity for the feature descriptor. We have also introduced two innovations in multi-image matching. First, we have demonstrated an improved test for verification of pairwise image matches that uses matching results from all n images. Second, we have shown that an indexing scheme based on low frequency wavelet coefficients yields a fast approximate nearest neighbour algorithm that is superior to indexing using the raw data values. Future Work We conclude by noting some possible areas for future development of M O P S . We discuss more general possibilities for invariant features in the final conclusions (chapter 6). Orientation Estimation Orientation estimation is currently problematic if the gradient is not well defined (i.e. small) or varies rapidly around the interest point (i.e. small changes in interest point location lead to large changes in the orientation). Alternative methods could include: using the largest eigenvector of H (although this also has a degeneracy for rotational symmetry), steerable filters, or peaks in a histogram of local orientations. The latter option is attractive as it can be made robust by considering multiple peaks in the histogram in ambiguous cases. Colour We could use R G B features instead of greyscale, or just add a few dimensions of colour information (e.g. the average [R, G, B ] / ( R + G + B) for the patch) to the descriptors with an appropriate weighting. Interest Operators In addition to using autocorrelation maxima, we could form features using other interest operators e.g. difference of Gaussian maxima, watershed regions or edge based features. Multi-scale/3D Matching In order to better cope with multi-scale matching we could use a true scale-space interest operator. For example, we could use an Chapter 2. Multi-Scale Oriented Patches 49 image pyramid with a finer scale sampling, and interpolate interest points to subscale accuracy. To cope better with 3D matching problems we should introduce more robustness to affine change and relative shifting of edge positions. Learning Feature Matching The multi-image matching systems that are described in subsequent chapters make it easy to generate large datasets of known correct and incorrect matches. This could be used to learn data driven classifiers to distinguish between correct and incorrect matches. 50 Chapter 3 Automatic Panoramic Image Stitching 3.1 Introduction This chapter extends the ideas of invariant feature matching developed in the previous chapter to the problem of panoramic image stitching. Automatic panoramic image stitching has an extensive research literature [Sze04] and several commercial applications [Che95, R E A , MSF]. The basic geometry of the problem is well understood, and consists of estimating a 3 x 3 camera matrix or homography for each image [HZ04, SS97]. This estimation process needs an initialisation, which is typically provided by user input to approximately align the images, or a fixed image ordering. For example, the PhotoStitch software bundled with Canon digital cameras requires a horizontal or vertical sweep, or a square matrix of images. The R E A L V I Z Stitcher [REA] has a user interface to roughly position the images with a mouse, before automatic registration proceeds. Our work is novel in that we require no such initialisation to be provided. In the research literature methods for automatic image alignment and stitching fall broadly into two categories - direct [SK95, IA99, SK99, SS00] and feature based [ZFD97, CZ98, MJ02]. Direct methods have the advantage that they use all of the available image data and hence can provide very accurate registration, but they require a close initialisation. Feature based registration does not require initialisation, but traditional feature matching methods (e.g. correlation of image patches around Harris corners [Har92, ST94]) have lacked the invariance properties needed to enable reliable matching of arbitrary panoramic image sequences. In this chapter we describe an invariant feature based approach to fully automatic panoramic image stitching. This has several advantages over previous approaches. Firstly, our use of invariant features enables reliable matching of panoramic image sequences despite rotation, zoom and illumination change in the input images. Secondly, Chapter 3. Automatic Panoramic Image Stitching 51 by viewing image stitching as a multi-image matching problem, we can automatically discover the matching relationships between the images, and even recognise panoramas in unordered datasets. Thirdly, we generate high-quality results using automatic straightening, gain compensation, and multi-band blending to render seamless output panoramas. The results shown in this chapter are from a system implemented us- ing SIFT features [Low99]. We have also implemented a similar system using M O P S (chapter 2). The remainder of this chapter is structured as follows. Section 3.2 develops the geometry of the problem and motivates our choice of invariant features. Section 3.3 describes our image matching methodology (RANSAC) and a probabilistic model for image match verification. In section 3.4 we describe our image alignment algorithm (bundle adjustment) which jointly optimises the parameters of each camera. Sections 3.5 - 3.7 describe the rendering pipeline including automatic straightening, gain compensation and multi-band blending. In section 3.9 we present conclusions and ideas for future work. 3.2 Feature Matching The first step in the panoramic recognition algorithm is to extract and match SIFT features between all of the images. SIFT features are located at scale-space max- ima/minima of a difference of Gaussian function. A t each feature location, a characteristic scale and orientation is established. This gives a similarity-invariant frame in which to make measurements. Although simply sampling intensity values in this frame would be similarity invariant, the invariant descriptor is actually computed by accumulating local gradients in orientation histograms. This allows edges to shift slightly without altering the descriptor vector, giving some robustness to affine change. This spatial accumulation is also important for shift invariance, since the interest point locations are typically accurate in the 0-3 pixel range (see figure 2.17 and also [BSW05, SZ03]). Illumination invariance is achieved by using gradients (which eliminates bias) and normalising the descriptor vector (which eliminates gain). Assuming that the camera rotates about its optical centre, the group of transformations the images may undergo is a special group of homographies. We parameterise each camera by a rotation vector 9 = [6\, 8 ,0 ] and focal length / . This gives pairwise 2 3 Chapter 3. Automatic Panoramic Image Stitching 52 homographies iij = H ^ u , where Hjj — K j R j R j K j 1 (3-1) and i i i , iij are the homogeneous image positions (iij = S;[iij,l], where Uj is the 2dimensional image position). The 4 parameter camera model is defined by 'fi o 0' 0 fi 0 0 0 1 (3.2) and (using the exponential representation for rotations) 0 Ft = e [Oil . Pi]x = # 0 — i3 i3 0 — &i2 Gil 9 i2 -0^1 (3.3) 0 Ideally one would use image features that are invariant under this group of transformations. However, for small changes in image position du.j u = u ? (3.4) ALL, or equivalently iij = A y t i j , where a n a>\2 «i3 0 0 (3-5) fl23 1 is an affine transformation obtained by linearising the homography about u . This i 0 implies that each small image patch undergoes an affine transformation, and justifies the use of SIFT features which are partially invariant under affine change. Once features have been extracted from all n images (linear time), they must be matched. Since multiple images may overlap a single ray, each feature is matched to its k nearest neighbours (we use k = 4). This can be done in 0(n log n) time by using a k-d tree to find approximate nearest neighbours [BL97]. A k-d tree is an axis aligned binary space partition, which recursively partitions the feature space at the mean in the dimension with highest variance. The k-d tree is ultimately (for large n) more Chapter 3. Automatic Panoramic Image Stitching 53 efficient than the indexing scheme described in section 2.8 as each feature lookup is O(logn) (and not 0(n)). 3.3 Image Matching At this stage the objective is to find all matching (i.e. overlapping) images. Connected sets of image matches will later become panoramas. Since each image could potentially match every other one, this problem appears at first to be quadratic in the number of images. However, it is only necessary to match each image to a small number of neighbouring images in order to get a good solution for the image geometry. From the feature matching step, we have identified images that have a large number of matches between them. We consider a constant number m images, that have the greatest number of feature matches to the current image, as potential image matches (we use m = 6). First, we use R A N S A C to select a set of inliers that are compatible with a homography between the images. Next we apply a probabilistic model to verify the match. 3.3.1 Robust Homography Estimation using RANSAC R A N S A C (random sample consensus) [FB81] is a robust estimation procedure that uses a minimal set of randomly sampled correspondences to estimate image transformation parameters, and finds a solution that has the best consensus with the data. In the case of panoramas we select sets of r = 4 feature correspondences and compute the homography H between them using the direct linear transformation (DLT) method [HZ04]. We repeat this with n = 500 trials and select the solution that has the maximum number of inliers (whose projections are consistent with H within a tolerance e pixels). Given the probability that a feature match is correct between a pair of matching images (the inlier probability) is Pi, the probability of finding the correct transformation after n trials is p ( H is correct) = 1 - (1 - (pi) ) r (3.6) n After a large number of trials the probability of finding the correct homography is very high. For example, for an inlier probability pi = 0.5, the probability that the correct homography is not found after 500 trials is approximately 1 x 10~ . 14 Chapter 3. Automatic Panoramic Image Stitching 54 R A N S A C is essentially a sampling approach to estimating H . If instead of maximising the number of inliers one maximises the sum of the log likelihoods, the result is maximum likelihood estimation (MLE). Furthermore, if priors on the transformation parameters are available, one can compute a maximum a posteriori estimate (MAP). These algorithms are known as M L E S A C and M A P S A C respectively [Tor02]. 3.3.2 Probabilistic Model for Image Match Verification For each pair of potentially matching images we have a set of feature matches that are geometrically consistent ( R A N S A C inliers) and a set of features that are inside the area of overlap but not consistent ( R A N S A C outliers). The idea of our verification model is to compare the probabilities that this set of inliers/outliers was generated by a correct image match or by a false image match. For a given image we denote the total number of features in the area of overlap rif and the number of inliers n*.. The event that this image matches correctly/incorrectly is represented by the binary variable m e {0,1}. The event that the i th feature match e{0,1} is an inlier/outlier is assumed to be independent Bernoulli, so that the total number of inliers is Binomial p(/( "/)|m = l ) 1: p(/( -/)| 1 m = = 0) = B( ;n , ) (3.7) B(n n ,p ) (3.8) ni f Pl i; f 0 where i is the probability a feature is an inlier given a correct image match, and p is P 0 the probability a feature is an inlier given a false image match. The number of inliers rii — Y^lL\ a n d B(.) is the Binomial distribution B(x-n,p) = • ..p (l-p) x\(n — x)\ x n x (3.9) We choose values p\ = 0.6 and po = 0.1. We can now evaluate the posterior probability that an image match is correct using Bayes' Rule Chapter 3. Automatic Panoramic Image Stitching 55 (3.10) 1 ~ 1 p(/( p(/ l!W (1:n /V=0)p(m=0) / |m=l)p(m=l) ) We accept an image match if p(m = l | / ( / ) ) > p 1:n min B{rii\rif,pi)p(m = 1) ^P B(rii\n ,p )p(m = 0) f ct - f ac 0 1 t 1 re e choosing values p(m = 1) = 1 0 and p - 6 min ni>a (3.11) 1 (3.12) Pmin = 0.999 gives the condition + pn (3.13) f for a correct image match, where a = 8.0 and (3 = 0.3. Though in practice we have chosen values for p , pi, p(m = 0), p(m = 1) and p in-, they could in principle be 0 m learnt from the data. For example, pi could be estimated by computing the fraction of matches consistent with correct homographies over a large dataset. Once pairwise matches have been established between images, we can find panoramic sequences as connected sets of matching images. This allows us to recognise multiple panoramas in a set of images, and reject noise images which match to no other images (see figure (3.2)). 3.4 Bundle Adjustment Given a set of geometrically consistent matches between the images, we use bundle adjustment [TMHF99] to solve for all of the camera parameters jointly. This is an essential step as concatenation of pairwise homographies would cause accumulated errors and disregard multiple constraints between images e.g. that the ends of a panorama should join up. Images are added to the bundle adjuster one by one, with the best matching image (maximum number of matches) being added at each step. The new image is initialised with the same rotation and focal length as the image to which it best matches. Then the parameters are updated using Levenberg-Marquardt. The objective function we use is a robustified sum squared projection error. That Chapter 3. Automatic Panoramic Image Stitching (c) S I F T matches 1 . (e) (d) 56 S I F T matches 2 Images aligned according to a homography Figure 3.1: SIFT features are extracted from all of the images. After matching all of the features using a k-d tree, the m images with the greatest number of feature matches to a given image are checked for an image match. First R A N S A C is performed to compute the homography, then a probabilistic model is invoked to verify the image match based on the number of inliers. In this example the input images are 517 x 374 pixels and there are 247 correct feature matches. Chapter 3. Automatic Panoramic Image Stitching (a) (b) 57 A l l feature matches Geometrically consistent feature matches (c) Output panoramas Figure 3.2: Recognising panoramas. Given a noisy set of feature matches (a), we use R A N S A C and a probabilistic verification procedure to find consistent image matches (b). Connected components of image matches are stitched into panoramas (c). Chapter 3. Automatic Panoramic Image Stitching 58 is, each feature is projected into all the images in which it matches, and the sum of squared image distances is minimised with respect to the camera parameters . Given 1 a correspondence uf <-> ( u | denotes the position of the kth feature in image z), the residual is 4 = u?-p* (3.14) where p£- is the projection from image j to image i of the point corresponding to p£ = K i R i R j K j * ; (3.15) 1 The error function is the sum over all images of the robustified residual errors e = E E E'/tfi) (- ) 3 t=l jeJ(i) keF(i,j) where n is the number of images, 16 is the set of images matching to image i , ^{i, j) is the set of feature matches between images i and j. We use a Huber robust error function { Ixl , if Ixl < cr 2 1 1 1 2cr|x| — cr , (3.17) 1 if |x| > cr This error function combines the fast convergence properties of an L norm optimisation 2 scheme for inliers (distance less than a), with the robustness of an L\ norm scheme for outliers (distance greater than cr). We use an outlier distance a = oo during initialisation and a — 2 pixels for the final solution. This is a non-linear least squares problem which we solve using the LevenbergMarquardt algorithm. Each iteration step is of the form 0 = (J J + AC; )- J r T 1 1 T where 0 are all the parameters, r the residuals and J = dr/d®. (3.18) We encode our prior N o t e that it would also be possible (and in fact statistically optimal) to represent the unknown ray directions X explicitly, and to estimate them jointly with the camera parameters. This would not increase the complexity of the algorithm if a sparse bundle adjustment method was used (as in section 5.4.1). This is left for future work. 1 Chapter 3. Automatic Panoramic Image Stitching 59 belief about the parameter changes in the (diagonal) covariance matrix C p C — p n 0 0 0 0 0 a e 0 0 0 0 0 a 0 0 0 0 0 a} 0 0 0 0 0 2 2 (3.19) This is set such that the standard deviation of angles is ag = 7r/16 and focal lengths af — f/10 (where / is the mean of the focal lengths estimated so far). This helps in choosing suitable step sizes, and hence speeding up convergence. For example, if a spherical covariance matrix were used, a change of 1 radian in rotation would be equally penalised as a change of 1 pixel in focal length. Finally, the A parameter is varied at each iteration to ensure that the objective function of equation 3.16 does in fact decrease. The derivatives are computed analytically via the chain rule, for example (3.20) where d x/z y/z x y z 1/z 0 -x/z 0 l/z -y/z 2 2 (3.21) and dRi <9Rj —i / ft [0iU (3.22) 0 0 9 e e 0 0 1 0 0-1 0 (3.23) Chapter 3. Automatic Panoramic Image Stitching 3.4.1 60 Fast Solution by Direct Computation of the Linear System Since the matrix J is sparse, forming J J by explicitly multiplying J by its transpose is T inefficient. In fact, this would be the most expensive step in bundle adjustment, costing 0(MN ) 2 for a n M x J V matrix J. The sparseness arises because each image typically only matches to a small subset of the other images. This means that in practice each element of J J can be computed in much fewer than M multiplications T (3.24) i.e. the inverse covariance between cameras i and j depends only on the residuals of feature matches between i and j. Similarly, J r need not be computed explicitly, but can be computed via T (3.25) In both cases each summation would require M multiplications if each feature matched to every single image, but in practice the number of feature matches for a given image is much less than this. 3.5 Automatic Panorama Straightening Image registration using the steps of sections 3.2 - 3.4 gives the relative rotations between the cameras, but there remains an unknown 3D rotation to a chosen world coordinate frame. If we simply assume that R = I for one of the images, we typically find a wavy effect in the output panorama. This is because the real camera was unlikely to be perfectly level and un-tilted. We can correct this wavy output and automatically straighten the panorama by making use of a heuristic about the way people typically shoot panoramic images. The idea is that it is rare for people to twist the camera relative to the horizon, so the camera X vectors typically lie in a plane. By finding the null vector of the covariance matrix of the camera X vectors, we can find the Chapter 3. Automatic Panoramic Image Stitching (a) (b) 61 (c) Figure 3.3: Finding the up-vector u. A good heuristic to align wavy panoramas is to note that people rarely twist the camera relative to the horizon. Hence despite tilt (b) and rotation (c), the camera X vectors typically lie in a plane. The up-vector u (opposite to the direction of gravity) is the vector normal to this plane. "up-vector" u (normal to the plane containing the camera centre and the horizon) ^ X i=i ; X [ u = 0 (3.26) / Applying a global rotation such that up-vector u is vertical (in the rendering frame) effectively removes the wavy effect from output panoramas as shown in figure 3.4. 3.6 Gain Compensation In previous sections, we described a method for computing the geometric parameters (orientation and focal length) of each camera. In this section, we show how to solve for a photometric parameter, namely the overall gain between images. This is set up in a similar manner, with an error function defined over all images. The error function is the sum of gain normalised intensity errors for all overlapping pixels n e = 2 E n E E iii ~ {gikiyo) - QjlMi)) 2 HijUj (3-27) Chapter 3. Automatic Panoramic Image Stitching (a) 62 Without automatic straightening (b) W i t h automatic straightening Figure 3.4: Automatic panorama straightening. Using the heuristic that users rarely twist the camera relative to the horizon allows us to straighten wavy panoramas by computing the up-vector (perpendicular to the plane containing the horizon and the camera centre). where g , gj are the gains, and TZ(i,j) is the region of overlap between images i and j. t In practice we approximate /(u.j) by the mean in each overlapping region lij Iii = E "^ ( i J " (3.28) ) / i ( U i ) This simplifies the computation and gives some robustness to outliers, which might arise due to small misregistrations between the images. Also, since g = 0 is an optimal solution to the problem, we add a prior term to keep the gains close to unity. Hence the error function becomes 1 e n = o£ i=\ n E j=i V ( W « - 9JIH)WN + (1 - N ft)7"2) (3-29) where iVy = \H(i,j)\ equals the number of pixels in image i that overlap in image j. The parameters and o are the standard deviations of the normalised intensity g error and gain respectively. We choose values <tjv = 10.0, (7e{0..255|) and u — 0.1. g This is a quadratic objective function in the gain parameters g which can be solved in closed form by setting the derivative to 0 (see figure 3.5). Chapter 3. Automatic Panoramic Image Stitching (a) (b) (c) 63 Without gain compensation W i t h gain compensation W i t h gain compensation and multi-band blending Figure 3.5: Gain compensation. Note that large changes in brightness between the images are visible if gain compensation is not applied (a). After gain compensation, some image edges are still visible due to unmodelled effects such as vignetting (b). These can be effectively smoothed out using multi-band blending (c). Chapter 3. Automatic Panoramic Image Stitching 3.7 64 Multi-Band Blending Ideally each sample (pixel) along a ray would have the same intensity in every image that it intersects, but in reality this is not the case. Even after gain compensation some image edges are still visible due to a number of unmodelled effects, such as vignetting (intensity decreases towards the edge of the image), parallax effects due to unwanted motion of the optical centre, mis-registration errors due to mis-modelling of the camera, radial distortion and so on. Because of this a good blending strategy is important. From the previous steps we have n images P(x,y) (i e {l..n}) which, given the known registration, may be expressed in a common (spherical) coordinate system as P(6,<f>). In order to combine information from multiple images we assign a weight function to each image W(x,y) = w(x)w(y) where w(x) varies linearly from 1 at the centre of the image to 0 at the edge. The weight functions are also resampled in spherical coordinates W (9, (f>). A simple approach to blending is to perform a weighted l sum of the image intensities along each ray using these weight functions rUnear where I hnear EIU ^ 4)^(6, </>) EIU^(M) _ (f) (3 - 30) (9, 4>) is a composite spherical image formed using linear blending. How- ever, this approach can cause blurring of high frequency detail if there are small registration errors (see figure 3.9). To prevent this we use the multi-band blending algorithm of Burt and Adelson [BA83]. The idea behind multi-band blending is to blend low frequencies over a large spatial range, and high frequencies over a short range. We initialise blending weights for each image by finding the set of points for which image i is most responsible = ^ M - . * . - ^ . , , ) I0 ( 3 3 i ) otherwise i.e. Wm (0, (f>) is 1 for (9, (ft) values where image i has maximum weight, and 0 where OI some other image has a higher weight (see figure 3.6). These max-weight maps are successively blurred to form the blending weights for each band. A high pass version of the rendered image is formed £{9, <t>) = I (6, 4)-r(6 i t flight) (3.32) Chapter 3. Automatic Panoramic Image Stitching where and g {9,4>) 1S a a 65 Gaussian of standard deviation <r, and I (6,<f>) represents spatial a frequencies in the range of wavelengths A — 0 —> a. We blend this band between images using a blending weight formed by blurring the max-weight map for this image B {d,<t>) = W (6A)*9<r{0A) (3.33) i a rnax where B (8,(j)) is the blend weight for the wavelength 0 —> a band. Subsequent frel a quency bands are blended by forming lower frequency bandpass images and further blurring the blend weights, i.e. for k > 1 t i ) . = £ ( W = 4 , - 4 , * 0* (3-34) l*9a (3.35) B For each band, overlapping images are linearly combined using the corresponding blend weights ° H ^UBLMJ) ( 3 - 3 6 ) This causes high frequency bands (small ka) to be blended over short ranges whilst low frequency bands (large ka) are blended over larger ranges (see figure (3.7)). Note that we have chosen to render the panorama in spherical coordinates 9,<f>. In principle one could choose any 2-dimensional parameterisation of a surface around the viewpoint for rendering. One good choice would be to render to a triangulated sphere, constructing the blending weights in the image plane. This would have the advantage of uniform treatment of all images, and it would also allow easy resampling to other surfaces (in graphics hardware). A n assumption of our blending strategy is that the image whose centre is closest to a given pixel in the rendering has the 'best' information about that pixel, and therefore the highest blending weight. This might not always be the case, for example, if the image set contained defocussed or otherwise degraded images. In this scenario one might want to select blending weights based on some other criterion e.g., sharpness of the image. Chapter () c w 3. Automatic Panoramic Image Stitching ( ^) 66 (d) W {6,4>) e max Figure 3.6: Images I(x,y) and weights W(x,y) are resampled in spherical coordinates 1(9, (j)), W(9,<p) and max weight maps W (9, (f>) are computed. In linear blending images are combined as linear sums using the weights W{9, <f>). In multi-band blending blurred versions of the max weight map are used to form separate blending functions for each frequency band. max Chapter 3. Automatic Panoramic Image Stitching (c) (a) Band 1 (scale 0 to a) (b) Band 2 (scale a to 2a) G7 Band 3 (scale lower than 2a) Figure 3.7: Multi-band blending. Bandpass images Ik<j(9, 4>) for k = 1,2,3 are shown on the left, with the corresponding blending weights B (6, 4>) shown on the right. Initial blending weights are assigned to 1 where each image has maximum weight. To obtain each blending function, the weights are blurred at spatial frequency a and bandpass images of the same spatial frequency are formed. The bandpass images are blended together using weighted sums based on the blending weights (Note: the contrast has been enhanced and blending widths exaggerated for clarity in these figures). ka Chapter 3. Automatic Panoramic Image Stitching (a) No blending (b) 68 Multi-band blending Figure 3.8: Results for panorama rendering with and without multi-band blending. See figure 3.7 for the blending functions used for these images. In this case multi-band blending smooths out exposure differences and vignetting between the images. For an example of dealing with with misregistrations, see figure 3.9. (a) Linear blending (b) Multi-band blending Figure 3.9: Comparison of linear and multi-band blending. The image on the right was blended using multi-band blending using 5 bands and a = 5 pixels. The image on the left was linearly blended. In this case matches on the moving person have caused small misregistrations between the images, which cause blurring in the linearly blended result, but multi-band blended image is clear. Chapter 3. Automatic Panoramic Image Stitching 69 Algorithm: Panoramic Recognition Input: n unordered images I. Extract SIFT features from all n images II. Find k nearest-neighbours for each feature using a k-d tree III. For each image: (i) Select m candidate matching images that have the most feature matches to this image (ii) Find geometrically consistent feature matches using R A N S A C to solve for the homography between pairs of images (iii) Verify image matches using a probabilistic model IV. Find connected components of image matches V . For each connected component: (i) Perform bundle adjustment to solve for the rotation 81,82,03 and focal length / of all cameras (ii) Render panorama using multi-band blending Output: Panoramic image(s) 3.8 Results Figure 3.10 shows typical operation of the panoramic recognition algorithm. A set of images containing 2 panoramas and 5 noise images was input. The algorithm detected 2 connected components of image matches and 5 unmatched images, and output the 2 blended panoramas. The complete algorithm ran in 83 seconds on a 2GHz P C , with input images 525 x 375 pixels (7" x 5" prints scanned at 75 dpi), and rendering the larger output panorama as a 300 x 3000 pixel image. The majority of computation time is spent in extracting the SIFT features from the images. Figure 3.11 shows a larger example where 80 images were used to create a 360° x 90° panorama. No user input is required: the object recognition system decides which images match, and the bundle adjustment algorithm optimises jointly for the Chapter 3. Automatic Panoramic Image Stitching 70 4 x 80 = 320 parameters of all the cameras. Finally, the multi-band blending scheme effectively hides the seams despite the illumination changes (camera flash and change 2 in aperture/exposure). We have tested the system on many other image sets, for example full 360° x 180° panoramas, and sequences where different cameras are used in the same panorama. 3.9 Summary This chapter has presented a novel system for fully automatic panorama stitching. Our use of invariant local features and a probabilistic model to verify image matches allows us recognise multiple panoramas in unordered image sets, and stitch them fully automatically without user input. The system is robust to camera zoom, orientation of the input images, and changes in illumination due to flash and exposure/aperture settings. A multi-band blending scheme ensures smooth transitions between images despite illumination differences, whilst preserving high frequency details. Future Work Possible areas for future work include compensation for motion in the camera and scene, and more advanced modelling of the geometric and photometric properties of the camera: Camera Motion Panoramas often suffer from parallax errors due to small motions of the optical centre. These could be removed by solving for camera translations and depths in the scene, before re-rendering from a central point. A good representation to use might plane at infinity plus parallax [RC02] (appendix A.6). Whilst gross camera motions cause parallax artifacts, small motions during shooting result in motion blur. Motion blurred images could be deblurred using nearby in-focus images as in [BBZ96]. Similar techniques can also be used to generate super-resolution images [CZ98]. Scene Motion Though our multi-band blending strategy works well in many cases, large motions of objects in the scene cause visible artifacts when blending between Note that in this case the camera flash caused significant illumination changes between images of the grass in the lower part of the panorama. In some of the grass images the flash fired whilst in others it did not. 2 Chapter 3. Automatic Panoramic Image Stitching 73 multiple images. Another approach would be to automatically find optimal seam lines based on regions of difference between the images [Dav98, UES01, ADA+04]. Advanced Camera Modelling A n important characteristic of most cameras that is not modelled by the projective camera model (which preserves straight lines) is radial distortion [Bro71]. This can be accurately modelled by low order polynomials, the parameters of which could be included in the bundle adjustment framework. The ideal image stitcher would also support multiple motion models, for example, rotation about a point (e.g. panoramas), viewing a plane (e.g. whiteboards) and Euclidean transforms (e.g. aligning scanned images). One could also render to multiple surface types e.g. spherical, cylindrical, planar. Further geometric corrections could also be applied using local warping as in [SSOO]. Photometric Modelling In principle it should also be possible to estimate many of the photometric parameters of the camera. Vignetting (decrease in intensity towards image edges) is a common source of artifacts, particularly in uniform colour regions such as sky [LS05]. One could also acquire high-dynamic range [DM97, SHS ] information from the overlapping image regions, and render tone + mapped or synthetic exposure images. We have developed a C++ implementation of the algorithm described in this chapter, called AutoStitch. A demo of this program can be downloaded from the website at http://www.autostitch.net. Chapter 3. Automatic Panoramic Image Stitching (a) 71 Input images (b) Output panorama 1 (c) Output panorama 2 Figure 3.10: Typical operation of the panoramic recognition algorithm: an image set containing multiple panoramas and noise images is input, and panoramic sequences are recognised and rendered as output. The algorithm is insensitive to the ordering, scale and orientation of the images. It is also insensitive to noise images which are not part of a panorama. Chapter 3. Automatic Panoramic Image Stitching (a) (b) (c) 72 40 of 80 images registered A l l 80 images registered Rendered with multi-band blending Figure 3.11: Green College. This sequence was shot using the camera's automatic mode, which allowed the aperture and exposure time to vary, and the flash to fire on some images. Despite these changes in illumination, the SIFT features match robustly and the multi-band blending strategy yields a seamless panorama. These 360° x 90° images have been rendered in spherical coordinates (8,<f>). The sequence consisted of 80 images all of which were matched fully automatically with no user input, and a 4 x 80 = 320 parameter optimisation problem was solved for the final registration. With 400 x 300 pixel input images and a 500 x 2000 pixel output panorama, the whole process ran in 197 seconds on a 2GHz P C . 74 Chapter 4 Evaluation of Image Stitching Algorithms 4.1 Introduction In the previous chapters we developed apparatus for fully automatic multi-image registration. This chapter develops a framework for evaluation and performance tuning of such multi-image matching methods. Algorithms for stitching multiple images into seamless photomosaics have been used in satellite photography and digital mapping for decades [Mil75]. Early approaches involved much user input and specialised hardware [Sla80, Mee90]. Computer vision techniques have brought increasing automation to the problem, from automatic pairwise alignment [BAHH92, IA99] to bundle adjustment [SK99, SSOO], and culminating in systems that recognise matches and stitch images with no user input whatsoever [BL03, BSW05]. As more approaches are proposed [ZFD97, CZ98, SK99, SSOO, BL03] it becomes increasingly important to form comparisons and elicit best practices. There has been excellent comparison work in the related areas of stereo [SS02] and feature matching [MS03]. However, knowing the performance of individual feature matching techniques is not sufficient to predict overall stitching performance, which involves global decisions as to which image pairs truly match and global optimization for registration. Relying on visual inspection of stitching results to assess algorithm performance is tedious and becomes unworkable for large test sets. Instead we develop a database of image sequences for which the ground truth registration is known, and a metric for quantative evaluation of stitching results. This allows us to form comparisons between different image stitching methods, and to tune the parameters of these increasingly complex algorithms on real world data. Chapter 4. Evaluation of Image Stitching Algorithms 4.1.1 75 Image Stitching Algorithms In this chapter we focus on multi-image matching problems in which there is a one-toone correspondence between the images. In particular we are interested in problems where images are related by a linear transformation in 2D projective space, so that each camera can be characterised by a 3 x 3 projection matrix P. This generalises most image stitching problems, for example panoramas (rotating camera), whiteboards (moving camera, planar scene) and flatbed scanning. It does not include cases where the camera is free to move in 3-dimensions (and the scene is non-planar), or cases where the objects in the scene are in motion. 4.2 Evaluation of Image Stitching Algorithms For the purposes of this chapter, we concentrate on datasets with only a single image sequence to be stitched, although outlier images which are not part of the sequence may be present. Given a set of gold standard projection matrices P* = {P?,P*,...P;} (4.i) and a test set of image stitching parameters P = {P ,P ,...P } 1 2 N (4.2) we wish to write an evaluation function e = e(P*,P) (4.3) to evaluate the test set against the gold standard. Since we are only interested in the point mapping between images Py = P i P j 1 (4.4) the P matrices may be post-multiplied by an arbitrary (invertible) 3 x 3 matrix, and still yield the same stitching results. One could attempt to recover this matrix and compute distances in the space of projection matrices. However, we have found that different sets of camera parameters can often give quite similar stitching results (see Chapter 4. Evaluation of Image Stitching Algorithms 76 Figure 4.1: Evaluation function for multi-image matching. Uniformly distributed points VL-® are projected from image i to each matching image j. The residual r-^ between the projections of the point under the gold standard and test homographies is computed (a). The error function is the root mean square of all such residuals. Random points are generated for all images i, and residuals are summed for all images that overlap image i (b). figures 4.3 and 4.4). Also, for stitching problems, we are generally most concerned with errors in the image plane, since these lead to visible artifacts in the stitching results. Hence, we propose an evaluation function based on the sum squared projection error of random image points, relative to the gold standard i=l j=l fc=l (k) where the residual r)- is the difference between the projections of the A;th random point from image % to image j under the gold standard and test homographies (fc) r,-,~{k) J (fc) * J — UL 3 U,- (4.6) (fc) = (4.7) U = u- fc) P*-U (4.8) W is a random point uniformly distributed in image i, and P*^ and Py are the gold standard and test homographies from image i to image j. See figure 4.1. The third (fc) summation in equation 4.5 is over all randomly generated points u- that success- Chapter 4. Evaluation of Image Stitching Algorithms 77 fully project inside image j under either the gold standard or test homographies i.e. u[ ^eO(i,j) iff u ^ e l ( j ) or u*^ eX(j) where X(j) denotes the set of points in image k 3In practice we divide by the number of residuals and take the square root to give the R M S projection error over all images ( \ 0.5 *( '* P ] m ) (4.9) We generate a fixed number of random samples N = 100 in each image. 4.2.1 Dealing with Failed Matches Suppose that the image dataset contains some outlier images which do not belong to the panorama. In this case it will be possible for an image stitcher to generate false positives (images that are incorrectly matched, and do not match in the gold standard) and false negatives (images that fail to match, and are matched in the gold standard). We would also like to label gross registration errors as failed matches. In order to differentiate minor registration errors from failed matches we compute the summation of equation 4.5 only over image pairs whose R M S errors are less than a threshold i ( )i2 ^fc=l,uf 60(i,j) l ^ f c r ) RMS C — 1 v^AT N -, ^r ^ 'max i l^- i u J 2~>k=l,u\ eO(i,j) k) Furthermore, we label an image i as a failed match if it is (a) a false positive, (b) a false negative, or (c) belongs to a pair whose R M S error is above the threshold r max . Hence, the overall evaluation of an image stitching result should be based on the R M S projection error of all correctly matching images e s RM (pixels), and the number of images which fail to match np- Note that we count the number of images that fail to match correctly, instead of counting the number of failed pairwise matches, so the number of failed matches 0 < n < n (where n is the number of images in the dataset). F Chapter 4. Evaluation of Image Stitching (a) (b) Algorithms 78 Initial panorama Resampled camera views Figure 4.2: Generating synthetic ground truth. To generate synthetic ground truth data we first use our existing stitchers to generate a panorama (a). Next, we resample virtual camera views from this panorama, for which the camera matrices are known exactly (b). (a) mm • — W i—j (b) Without 360° wrap-around Bffi ^"Trn n — liJTJ W i t h 360° wrap-around Figure 4.3: Panorama stitching with and without 360° wrap-around. In both cases an optimal algorithm (bundle adjustment) has been used to solve for the camera parameters. However, in (a) matching between the ends of the panorama has been suppressed, whereas in (b) matching between the ends is allowed. Note that although the stitching is accurate (RMS pixel errors are small) in both cases, the focal length estimate in (a) is out by almost 20% (520 pixels in (a) compared to 630 pixels in (b)). These results demonstrate that it is difficult to solve for focal length (even using an optimal bundle adjustment algorithm) unless 360 wrap-around is achieved (Cedar Court sequence, real image dataset). Chapter 4. Evaluation of Image Stitching Algorithms 79 Figure 4.4: Solution of focal length for panoramas of 360°, 315° and 180°. These results were computed for the synthetic database. Note that even though the R M S pixel errors are approximately the same in each case (figure (b)), the focal length is commonly off by up to 5% (even with synthetic data and an optimal algorithm for the solution) when the panorama is less than 360°. These results indicate that 360° wrap-around is essential for accurate solution of focal length. It also justifies our use of projection error and not distance in parameter space for our evaluation function. Chapter 4. Evaluation of Image Stitching 4.3 Algorithms 80 Creating Gold Standard Stitching Results We have used two methods for generating gold standard stitching results, using both synthetic and real world data. Synthetic Virtual camera views are generated from previously stitched panoramas. The camera matrices are known exactly. Real World Real camera views are registered using a state of the art algorithm (different from those being tested) and manual intervention as necessary. See appendix B for examples from these datasets. Note that only the first case gives actual "ground truth" camera matrices. The second approach is limited by the current best algorithms and human error, and is best described as "gold standard". To create our synthetic data, we first use an image stitching algorithm to generate large panoramas. We then resample these panoramas to form virtual camera views. The statistics of the resulting images will differ slightly from natural images due to errors in the stitching process, but the camera matrices will be exact. To create our real world dataset, we have used a direct (intensity-based) algorithm similar to [SS97]. Manual intervention was used for challenging situations such as low overlap or low texture areas, and to correct any visually unsatisfactory results. While the resulting camera matrices may be subject to small errors, real world images are used as inputs, which makes the results more representative of expected performance. 4.4 Experimental Setup In this section, we describe the experimental methodology used to compare the performance of two automatic stitching algorithms, as well as showing how the same apparatus is used to tune algorithm parameters. 4.4.1 Comparison of Automatic Stitching Algorithms We use the error metrics defined in section 4.2 to perform a comparison between two well known automatic image stitchers: AutoStitch , based on SIFT features [BL03] 1 1 h t t p : / / w w w . autostitch.net Chapter 4. Evaluation of Image Stitching Algorithms 81 from U B C , and MSRStitch, based on M O P S [BSW05] from Microsoft Research. We compare performance on synthetic and real datasets. The synthetic database consists of 10 sequences, each containing 16 images of 600 x 800 pixels, with 50% overlap between the images. The real database consists of 40 sequences of intentionally difficult stitching problems. These contain significant amounts of radial distortion, featureless regions (e.g. sky), motion (e.g. water) and parallax errors. We have created gold standard stitching data for 8 of these sequences using VideoMosaic [SS97]. We also use synthetically generated sequences to characterise performance with variable overlap and scale. 4.4.2 Parameter Tuning for Automatic Stitchers Our evaluation function can be used to tune the parameters of automatic image stitching algorithms. For example, AutoStitch uses a Huber robust error function in the bundle adjustment stage. This function takes a parameter cr which corresponds to the distance (in pixels) of outliers to be suppressed. By plotting the R M S projection error efiMS against <r, we can find an optimal value of cr. We perform similar experiments for the a parameter in AutoStitch (which controls the probability of declaring a valid image match given a set of feature matches) and also for the number of features extracted from each image. 4.5 Results Figure 4.4 shows R M S errors in focal length, and R M S pixel errors for the synthetic dataset. We found (as in [KW99]) that when the panoramas wrap around 360°, the solution for focal length is very good (average R M S error in focal length = 0.029%). However, if the panorama is less than 360°, the focal length estimates are often up to 5% off (figure 4.4(a) ), yet the R M S pixel errors are unchanged (figure 4.4(b) ). Figure 4.7 shows results for testing AutoStitch and MSRStitch on the synthetic database. Both stitchers perform very well, with R M S projection errors typically around 0.1 pixel. MSRStitch has a larger R M S error for sequence #5. This was caused by the stitcher failing to notice a match between a pair of images of fairly featureless snow (see figure 4.5). In these examples we used an outlier threshold of Tmax = 2 pixels. Chapter 4. Evaluation of Image Stitching Algorithms 82 L^-i^...^ Figure 4.5: Elfin sequence from the synthetic dataset. MSRStitch failed to match the rightmost pair of images, in which the area of overlap is mainly featureless snow. Figure 4.6: Comparison of AutoStitch and MSRStitch for the Alaska sequence from the real dataset. MSRStitch finds 6 matching images, and AutoStitch only 3. However, MSRStitch has actually stitched some images out of order, and has larger registration errors. The ground truth was stitched manually. Chapter 4. Evaluation of Image Stitching Algorithms 83 Figure 4.8 shows results for the (harder) real database. In this case we have set the outlier threshold r max to 50 pixels as there is significant radial distortion in many of these images. The radial distortion is the main cause of the relatively high R M S errors in figure 4.8(a) . For the more difficult sequences there are also many match failures due to dropped or misregistered images, see figure 4.6 for an example. We also tested AutoStitch and MSRStitch with variable image overlap and scale (figure 4.9). For the image overlap test we used images from the Green dataset (figure 4.2). The image size (600 x 800) and focal length were kept constant whilst the number of equally spaced images was varied to generate a range of overlap from 10% to 90%. The performance of MSRStitch dropped off for low image overlap (< 20%). This is probably due to the larger footprint of M O P S relative to the detection scale when compared to SIFT, which means that fewer features can be extracted towards image edges. For the variable image scale test, we again used the Green dataset, but with images at scales from 10% to 100%. The R M S error is shown relative to the full size images. AutoStitch gave superior performance for the smaller image scales. This may again be due to the larger footprint of M O P S compared to SIFT at a given detection scale. Figures 4.10 and 4.11 show stitching performance for AutoStitch whilst the parameters a, a and the number of features per image were varied. The parameter a is the minimum number of matches for a correct image match to be declared in the probabilistic matching model of equation 3.13. The parameter a is the outlier distance in the Huber robust error function 3.17. The plot for Huber a in figure 4.10(b) shows a clear minimum at 0.25 pixels which represents to the optimal setting for this parameter on this dataset. It would be instructive to repeat this experiment on other datasets e.g. real images with moving objects, where the optimal setting would likely be higher. The plot for a in figure 4.10(a) shows a wide basin (20 < a < 120) in which the optimal solution is achieved. Note that as a is increased, the R M S error increases as the quality of image registration falls, until finally some images fail to match. Once images fail to match, the R M S error often dips down again, as those remaining images can be registered adequately. Finally, figure 4.11 shows how stitching performance varies with the number of features extracted from each image. With small numbers of features (< 200) failed matches and misregistrations are common. This improves as the number of features increases, with negligible gains after about 400 features per image. Chapter 4. Evaluation of Image Stitching Algorithms 84 - autostitch - msrstitch - autostitch - msrstitch g 0, 5 6 sequence number 7 sequence number (a) (b) Figure 4.7: Comparison of AutoStitch and MSRStitch using the synthetic image dataset. This dataset contains 10 synthetic image sequences. Note that both stitchers give very accurate solutions (0.1 pixel error) compared to the ground truth. MSRStitch has a glitch in sequence 5 (figure 4.5) which contains a pair of images of snow with few features. 3 4 5 sequence number (a) 3 4 5 sequence number (b) Figure 4.8: Comparison of AutoStitch and MSRStitch using the real image dataset. This dataset contains 8 real image sequences. This dataset contains difficult sequences with radial distortion, featureless regions, motion and parallax. The radial distortion is the main cause of R M S errors around 10 pixels shown in (a). Chapter 4. Evaluation of Image Stitching Algorithms 85 Figure 4.9: Comparison of AutoStitch and MSRStitch for panoramas with variable image overlap and scale. The first dataset (a) contains sequences with overlap from 10% to 90%. The second dataset (b) contains sequences with images with scale from 30% to 100%. Note that the performance of MSRStitch drops off at low image scales and small overlaps. This may be due to the larger footprint of M O P S compared to SIFT at a given detection scale, meaning that fewer features can be extracted towards image edges. 0.155r alpha (a) huber sigma (b) Figure 4.10: Tuning of a and a parameters. The results in (a) were generated by stitching a dataset of 16 synthetic images whilst varying a in increments of 10 from 0 to 200. The results are compared to ground truth. Here, values of alpha in the range 20-120 give the best stitching results. The results in (b) show similar results for tuning the a parameter in the Huber robust error function. It can be seen that the optimal value of Huber a is 0.25 pixels in this case. Chapter 4. Evaluation of Image Stitching Algorithms 86 1.25 3 | 0.75 i 50 100 150 200 250 300 350 number of features per image 400 450 500 Figure 4.11: Effect of number of features extracted per image on R M S error, and number of failed matches. These results were computed using the Elfin (figure 4.5) dataset. Note that if the number of features extracted per image is small (< 200), then the number of failed matches is high. The R M S errors can still be low if the few images that remain are registered well. As more features are added, the R M S errors can actually increase as new images are matched but are poorly registered. In this case all images are correctly matched using 200 features per image but gains are minimal after more than about 400 features per image are used. 4.6 Summary We have proposed a framework for evaluation of image stitching algorithms and used it to compare two automatic image stitchers, AutoStitch and MSRStitch. Both algorithms performed very well on the synthetic ground truth data, with registration errors typically around 0.1 pixels. Generally, AutoStitch performed better on the real dataset, but both algorithms showed room for improvement, with radial distortion a major cause of large R M S projection errors. Future Work Advanced Camera/Scene Models In this work we sampled virtual camera views from panoramas using a simple 4 parameter camera model. A n interesting direction for future work would be to generate high quality computer graphic renderings of virtual scenes with a large number of camera and scene parameters e.g. motion of the camera (parallax), radial distortion, illumination changes etc. One would then attempt to solve for all of these parameters and evaluate the results using the image based error function of equation 4.5. Chapter 4. Evaluation of Image Stitching Algorithms 87 Feature Based vs Direct Methods The relative merits of direct and feature based methods have been the subject of much discussion in the computer vision community [TZ99, IA99]. A n interesting area for future work would be to compare these two methods based on their registration accuracy, using the evaluation framework presented here. 88 Chapter 5 3D Object Recognition and Reconstruction 5.1 Introduction Object recognition and structure and motion recovery are two long standing problems in computer vision. The structure and motion (SAM) problem has reached a degree 1 of maturity, with several commercial offerings [2D3, R E A ] , in addition to an extensive research literature [SK93, BTZ96, PolOO, HZ04]. Object recognition is also well studied but remains an extremely active research area, with recent advances in image features and probabilistic modelling inspiring previously unexplored areas such as object class recognition [FPZ03]. Invariant local features have emerged as an invaluable tool in tackling the ubiquitous image correspondence problem. By using descriptors that are invariant not just to translation, but also to rotation [SM97], scale [Low99] and affine warping [BauOO, MS02, MCUP02], invariant features provide much more robust matching than previous correlation based methods. Until recently, the majority of object recognition algorithms have depended upon some form of training phase [Low99, VJ01]. However, algorithms have been developed recently that operate in an unsupervised manner on an image dataset [BL03, SZ02, RLSP03]. In this chapter we develop such an algorithm. We operate in an unsupervised setting on an unordered image dataset, and pose the object recognition problem as one of finding matches that are consistent views of some 3D scene. Our algorithm recognises objects in the sense that it finds all images that view a given object or scene. However, it makes no attempt to attach a label to the object [DBdFF02] or generalise object classes [FPZ03]. The remainder of this chapter is structured as follows. In section 5.2 we describe our A l s o known as structure from motion (SFM). We prefer to use the term structure and motion ( S A M ) , to emphasise the fact that the estimation of 3D structure and camera motion parameters are fundamentally linked. 1 Chapter 5. 3D Object Recognition and Reconstruction 89 invariant feature extraction and matching scheme. Section 5.3 describes the geometric constraints used to find correct image matches. Section 5.4 describes the sparse bundle adjustment algorithm used to solve jointly for the cameras and structure. Section 5.5 demonstrates results of object recognition and reconstruction on a test dataset, and section 5.6 presents conclusions and ideas for future work. 5.2 Feature Matching We extract and match SIFT features from all images in exactly the same way as for automatic panorama stitching (described in section 3.2), using a k-d tree to find nearest neighbour matches. We then apply a feature space outlier rejection test as described in section 2.7.1. 5.3 Image Matching During this stage, the objective is to find all matching images, that is, those that view a common subset of 3D points. Connected sets of image matches will later become 3D models. From the feature matching step, we have identified images with a large number of matches between them. As in section 3.3, we consider a constant number m images as potential image matches (we use m = 6). We parameterise each camera using 7 parameters. Oi These are a rotation vector = [On, &i2,0i3], translation t* = [tn, t , t ] and focal length fi (see section 3.2). Each i2 i3 pairwise image match adds four constraints on the camera parameters whilst adding three unknown structure parameters X = [Xi,X , X] 2 — KjX 3 (5.1) c i = KjX (5.2) ' R i X + tj (5.3) = R j X + tj (5.4) cj where iij, u,- are the homogeneous image positions in camera i and j respectively. The single remaining constraint (4 equations minus 3 unknowns = 1 constraint) expresses the fact that the two camera rays p>j, pj and the translation vector between Chapter 5. 3D Object Recognition and Reconstruction (c) S I F T features 1 (d) S I F T features 2 (e) Epipolar geom 1 (f) Epipolar geom 2 (g) 1 R A N S A C inliers (h) 2 R A N S A C inliers 90 Figure 5.1: Finding sets of consistent matches using SIFT and R A N S A C . SIFT features are extracted from all input images, and each feature is matched to k = 4 nearest neighbours. Outliers are first rejected by thresholding against the distance of an incorrect match (section (2.7.1)), before R A N S A C is used to find a final set of inliers that are consistent with the fundamental matrix. For this pair of 1024 x 768 input images, there were 365 SIFT features in image 1 and 379 in image 2. Of the initial feature matches, 133 matches remained after feature space outlier rejection, and there were 103 matches in the final solution after using R A N S A C . Chapter 5. 3D Object Recognition and Reconstruction 91 camera centres ty are coplanar, and hence their scalar triple product is equal to zero P?[ttf]xPi = 0 Writing p>j,Pj and (5.5) in terms of camera parameters Pi = RfKr *, (5.6) PJ = RjK^uj (5.7) tij = R,Jtj — R f tj 1 (5-8) and substituting in equation 5.5 gives u f Fj.-Uj = 0 (5.9) where Fij = K - R j [Rjt,- - R f tj] R J K J r (5.10) 1 x This is the well known epipolar constraint. Image matching entails robust estimation of the fundamental matrix [TM97]. Since equation 5.9 is non-linear in the camera parameters, it is commonplace to relax the non-linear constraints and estimate a general 3 x 3 matrix F j j . This enables a closed form solution via S V D [HZ04]. We use R A N S A C to robustly estimate F and hence find a set of inliers that have consistent epipolar geometry. A n image match is declared if the number of R A N S A C inliers n i in iers > n h, matc where the minimum number of matches n h matc is a constant (typically around 20). 3D objects/scenes are identified as connected components of image matches. This simple approach typically requires hand tuning of the threshold nmatch- A n improved verification procedure might extend the probabilistic model of section 3.3.2 to include parallax. This is left for future work. 5.4 Bundle Adjustment Given a set of geometrically consistent matches, we use bundle adjustment to solve for the camera and structure parameters jointly. In contrast to other approaches Chapter 5. 3D Object Recognition and Reconstruction 92 [HZ04, PolOO] that begin with a projective reconstruction and later refine to a metric reconstruction, we solve directly for the metric structure and camera parameters. The cameras are added one by one, starting with the best matching pair. We have found that initialising each new camera with the rotation, translation and focal length of the best matching image works well, even if the images have different rotation and scale (see example in figure 5.4). In practice we peturb the camera positions slightly by adding Gaussian noise (of standard deviation unity) to the translation vector of each new camera that is added. To cope with Necker reversal, we first run bundle adjustment on the initial image pair, noting the final value of the error function (section 5.4.1). We then swap the camera positions, and flip the 3D point depths, before repeating bundle adjustment (as in [SK93]). This normally converges to a different local minimum. We retain the solution that minimises the error function. 5.4.1 Sparse Bundle Adjustment Each connected component of feature matches defines a 3D point X j , and our error function is the sum squared error between the projected 3D point and the measured feature position iel jeX(i) where J is the set of all images, X(i) is the set of 3D points projecting to image i, and Tij is the residual error in image % for 3D point j. The residual r^ is the difference between the measured feature position and projected 3D point Yij = my - u tj (5.12) where m,j is the measured feature position, and is the projection of point X j in image i utj = K ( R X - + t ) i i J i (5.13) We use a Huber robust error function as in equation 3.17. The outlier distance o is set at 3 standard deviations of the current (un-normalised) residual error. We use the Levenberg-Marquardt algorithm to solve this non-linear least squares problem. Each Chapter 5. 3D Object Recognition and Reconstruction 93 iteration step is of the form $ = (J J + A C - ) - J r T 1 1 (5.14) T where <I> = [0,X] is the vector of camera (0) and structure (X) parameters, r is the vector of residuals and J = dr/d$>. The Jacobean J is an M x N matrix, where M is the number of measurements (twice the number of features), and N — UQ + nx is the number of camera (n&) and structure (nx) parameters (7 for each camera plus 3 for each 3D point). The prior covariance matrix C is set such that the standard p deviation of angles is <JQ = n/16, translations a — 0.01, focal lengths Oj = //100 and t 3D points ax = 0.1. Note that new cameras are initialsed with the same translation, rotation and focal length as the best matching camera (largest number of matches), but Gaussian noise of standard deviation 1 is added to the new camera to prevent the camera centres being coincident. Although one could in principle solve equation 5.14 directly (by solving an N x N linear system), to do so would ignore the sparse structure of the problem, and be very inefficient. Firstly, the matrix J is mostly zeros (since the derivatives of residuals for image i are zero except with respect to the parameters of image i), so the elements of J J T should be computed directly, instead of computing J first. Examining the structure of J J r dr T dr d® d@ dr Tdr ^©x a© ax J J = (5.15) r dr Tdr dr T dr Cx I ax a© ax ax where the camera parameter inverse covariance matrix j a©i a©i 0 0 0 o E j a© a© dr2j dr2j T 2 0 1 2 0 dr j dr j T E j d& 3 (5.16) 3 3 a© 3 is block diagonal, consisting of 7 x 7 blocks, and the structure parameter inverse co- Chapter 5. 3D Object Recognition and Reconstruction 94 variance matrix dr drn E i 9X! axi T 0 n 0 9r dr i 9X2 3X2 T 0 i2 0 i2 (5.17) 0 is also block diagonal, consisting of 3 x 3 blocks. The camera/structure cross covariance is a full matrix C1 - 1 (5.18) HX — but consists of a single multiplication for each element (the covariance between image i and point j depends only on the residual of point j in image i). Computing J J by T explicit multiplication of J would take 0(MN ) 2 be computed in 0(nQn ) x operations. However, J J can in fact T operations (the cost of computing C@ ). x Secondly, the matrix inversion involving J J need not be computed explicitly r (0(N )) 3 due to the sparse structure of J J . This sparseness reflects the loose couT pling inbetween cameras, and inbetween 3D points, in the error function of equation 5.11. The cameras are independent given the 3D structure parameters, and the 3D points are independent given the cameras. Equation 5.14 may be rewritten A B T where B 0 C X e© (5.19) Chapter 5. 3D Object Recognition and Reconstruction A Cq + a C P 0 C Cx + cr C P X 1 2 1 B 2 95 1 (5.20) 1 (5.21) (5.22) 'ex dr T e© ex (5.23) (5.24) dx 1 and Cp© 0 Multiplying both sides of equation 5.19 by A - BC^B B 0 7 C T 0 1 C P (5.25) X I - B C " 0 I 1 gives j 0 X |e e -BC^exl (5.26) ex which eliminates X from the solution for 0. This is known as a reduced camera subsystem [TMHF99]. Note that again none of the matrix products need be computed directly. For example each element of e© = [e© e © , . . . ] is computed as 1; 2 dr E 50" T ik Y i k (5.27) which involves only one term for each residual in image i. Similarly, each element of ex = [ e i , e 2» • • •] is given by X X dr = Yl dXi T e ki X i (5.28) and the right-hand side elimination product B C e x x ( B C e x ) i = y^BjfcCfcfcexfc 1 k (5.29) Chapter 5. 3D Object Recognition and Reconstruction 96 where B B12 B13 B22 B23 B31 B3 B33 n B21 C Where each element B ^ = | ^ = 2 Cn 0 0 0 C 2 2 0 0 0 C33 (5.30) (5.31) 1^ is a 7 x 3 matrix (number of parameters per camera x number of parameters per 3D point). This gives an no x UQ linear system to solve for the camera parameters 0. The resulting value of 0 can be substituted into the linear system for X , which reduces to independent 3 x 3 linear systems for each 3D point. The most expensive stage in this process is (potentially) in computation of the left-hand side elimination product BC _ 1 B . The ijth element is given by r (BC 1 B )jj = ^ BjfeCfc/tBfcj T (5.32) k where B y = -g^- is a 7 x 3 matrix (number of parameters per camera x number of parameters per 3D point) and C k = Yli f x ^ f x t i k sa 3 x 3 matrix. This summation may involve in the worst case M terms (if every 3D point is imaged in every camera). Hence the worst case complexity of sparse bundle adjustment is 0(Mn%). Note that this is still much cheaper than it would be were C not block diagonal. If C were a general nx x nx matrix the cost of this elimination step would be 0(n ). x However, the cost of bundle adjustment is usually much less than 0 ( A f n | ) . This is because the terms B y are zero unless point j is viewed in camera i. This means that each summation above involves only a constant number of 3D points for each camera. In this case, the complexity of sparse bundle adjustment is 0(mn@), where m is the number of residuals in each image. The best case complexity (given small m) is 0(n@), which is the cost Chapter 5. 3D Object Recognition and Reconstruction 97 0 (a) 3 (b) Figure 5.2: A simple bundle adjustment problem and the corresponding probabilistic graphical model. The unknown quantities to be inferred are the camera parameters Qj and the 3D points X j . The measured quantities are the projected feature positions of solving the linear system for the camera parameters. Hence the total computational cost for one step of sparse bundle adjustment is now 0 ( m n | ) , reduced from 0(MN ) 2 for naive solution of the normal equations of equation 5.14. Note that UQ <^ N since the number of camera parameters UQ is very much less than the number of structure parameters % (N = HQ + % ) . This is a very significant reduction in practice. For example, with 10 cameras (UQ = 70), and 1000 3D points {n = 3000), sparse bundle adjustment would be about (N/n ) 2 x e = (3070/70) « 2000 2 times faster than naive bundle adjustment. Furthermore, if a constant number of 3D points are imaged by each camera, the cost would be further reduced. 5.4.2 Graphical Model Interpretation The bundle adjustment process (joint estimation of camera and structure parameters from noisy image based measurements) can be described conveniently using a probabilistic graphical model (see figure 5.2). Each 3D point X j gives rise to a measurement rrijj in each camera j (with parameters 0 j ) that it is imaged. The joint probability of the measurements, cameras and structure (assuming uniform priors on 0 j , X j ) is Chapter 5. 3D Object Recognition and Reconstruction 98 given by p ( m , X , 0 ) = HI] piniijlQuXj) (5.33) ia jeX(i) If the measurement density is assumed to be Gaussian then p{mij\®i, Xj) = N(mij - uy; 0, a I) (5.34) 2 where a is the standard deviation of measurement noise in the image plane and uy is n the projection of 3D point Xj to image i u y = Ki(RiXj - U) (5.35) Taking the negative logarithm of equation 5.33 gives -logp(m|0,X) ex J2 -logW(my -Uy;0 r I) 2 )( (5.36) ieZ jeX{i) x £ £ l u - ulV^n m u (5-37) The maximum likelihood solution is 0 , X = argmine(0,X) = ^ ^ Im^ - m / / ^ (5.38) Hence, the optimal solution for camera and structure parameters (under the assumption of Gaussian noise) is obtained by solving a non-linear least squares problem (equation 5.38). In practice robustness is achieved by using Gaussian priors on the parameters 0, X and a robust Huber kernel is used instead of a Gaussian kernel in the measurement density of equation 5.34. 5.4.3 Sparsity in the Graphical Model Sparsity manifests itself in two ways in bundle adjustment problems. These can be seen clearly from the graphical model structure of the problem (figure 5.3). Firstly, the structure of the observation process implies fundamental independence relationships Chapter 5. 3D Object Recognition and Reconstruction 99 between cameras and 3D points. More precisely, the camera parameters are independent given the 3D points, and the 3D points are independent given the cameras. Secondly, many bundle adjustment problems are sparse in the sense that the unknown camera and structure parameters are loosely coupled. This leads to a reduction in the complexity of sparse bundle adjustment algorithms. The first level of sparseness arises because camera parameters are only coupled by the 3D points that they observe, and vice versa. A pair of cameras 0 ; , ®j are dependent if they view a common 3D point X&, or if there is sequence of cameras (viewing common 3D points), that connects them. For example in figure 5.3(a) , cameras O i and 02 are coupled because they share a common 3D point. However, cameras 0 i and © 3 are also coupled because camera 0 2 shares common 3D points with both. In other words, there is a path for a 'Bayes Ball' [Sha98] between 0 X and 0 , so they are not independent. The cameras can be made to be independent if all 3 of the structure parameters X are known (observed). Likewise, the 3D points can be made to be independent if all the camera parameters 0 are known. This is the basis for the sparse bundle adjustment algorithm of section 5.4.1. The second level of sparseness arises in certain kinds of bundle adjustment problems. Consider a sequence consisting of a few images of an object taken from similar viewpoints. In this case (almost) all of the 3D points will be visible in each of the cameras, and the graph has a fully connected structure (figure 5.3(b) ). This might occur in a short turntable sequence with a small range of viewpoints. Conversely, if the sequence consists of widely separated views which share few common points, the graph has a sparse structure (figure 5.3(a) ). The first case leads to the worst case complexity of sparse bundle adjustment of 0(Mn@) (where M is the number of measurements and rtQ is the number of camera parameters). The second case leads to the best case complexity of sparse bundle adjustment 0(n@). Note that in the sparsely connected case (figure 5.3) more general sparse matrix techniques [GL81] may be used in addition to the reduced camera/structure systems described in section 5.4.1. 5.5 Results Figure 5.4 shows typical operation of the object recognition algorithm. A set of images containing 2 objects and 6 distractor images was input. The algorithm detected 2 Chapter 5. 3D Object Recognition and Reconstruction Xi X2 01 X3 X4 © X5 2 Xe 100 X7 ©3 (b) Figure 5.3: Graphical models for (a) sparsely and (b) fully connected bundle adjustment problems. Note that in both cases the value of each unknown parameters X j , ®j is dependent on every other unknown parameter. However, if all the camera parameters 0 i are known (observed), the 3D structure parameters X j are independent, and vice versa. This is the basis for the sparse bundle adjustment method described in section 5.4.1. The best case complexity of this algorithm is 0(n@) i.e. cubic in the number of camera parameters (in the sparsely connected case (a)), and the worst case complexity is 0(Mn@) (in the fully connected case (b)). The number of camera parameters is UQ and the number of measurements is \M. Chapter 5. 3D Object Recognition and Reconstruction 101 C IIHHBKSiirAW (a) (b) (c) i >• Input images Output 3D model 1 - Tiger Output 3D model 2 - Rose Garden Figure 5.4: Fully automatic object recognition and 3D reconstruction. Note that despite the incorrect ordering, rotation and scale changes, and distractor images in the input, the system is able to successfully recognise the two consistent objects and perform 3D reconstruction. The Tiger sequence consisted of 13 images and yielded a 3D model with 675 points. The Rose Garden sequence consisted of 11 images and the 3D model contained 1351 points. The whole process of feature matching, image matching and bundle adjustment took a total of 556 seconds, of which 230 seconds were spent during bundle adjustment. The tests were run using a M A T L A B implementation on a 2.8GHz Pentium processor. Chapter 5. 3D Object Recognition and Reconstruction (b) 102 Output 3D model Figure 5.5: Fully automatic structure and motion (SAM) estimation for a 3 x 3 array of cameras. Note that the relative camera positions have been recovered correctly. In this example the input images were 800 x 600, and a 3D model of 1684 points was computed. The total computation time for feature extraction and matching, image matching and bundle adjustment was 293 seconds. Chapter 5. 3D Object Recognition and Reconstruction 103 Algorithm: 3D Object Recognition/Reconstruction Input: n unordered images I. Extract SIFT features from all n images II. Find k nearest-neighbours for each feature using a k-d tree III. For each image: (i) Select m candidate matching images (with the maximum number of feature matches to this image) (ii) Find geometrically consistent feature matches using R A N S A C to solve for the fundamental matrix between pairs of images (hi) (Future work) Verify image matches using a probabilistic model IV. Find connected components of image matches V . For each connected component: (i) Perform sparse bundle adjustment to solve for the rotation 9\, 9 , 8 , translation ti,t , t and focal length / of all cameras, and pointwise 3D geometry 2 2 3 3 (ii) (Future Work) Compute dense depth estimates, triangulate, texture map etc. Output: 3D model(s) connected components of image matches and 6 unmatched images, and output the 2 reconstructed 3D models. The complete algorithm ran i n 556 seconds on a 2.8GHz P C . About half of the computation time was spent in bundle adjustment. Another example of fully automatic structure and motion estimation is given in figure 5.5. 5.6 Summary We have presented a fully automatic 3D object recognition and reconstruction system. Our system starts by extracting SIFT features from a collection of images, and recognises 3D objects/scenes as geometrically consistent sets of feature matches. We perform bundle adjustment for metric structure directly, without the initial projective Chapter 5. 3D Object Recognition and Reconstruction 104 reconstruction common to other approaches. We have found that initialising each new camera with the same parameters as the best matching camera gives no problems with convergence. Future Work We close the chapter by noting some possible areas for future work development of the rigid object recognition/reconstruction system. We leave discussion of more general image matching and object class recognition for the final conclusions (chapter 6). Probabilistic Model for Image Match Verification A n important part of the object recognition system is the ability to identify whether a given set of feature matches represents a correct or incorrect image match. Currently this has been implemented using the feature space test of 2.7.1 and simply thresholding on the number of inliers to the fundamental matrix F between each image pair. A n important area for future work would be to develop are more principled model for verifying correct image matches. One possibility would be to extend the model of 3.3.2 to include parallax. Multi-View Tensors In this work we have used pairwise geometric constraints to reject outliers. These constraints are fairly weak because corresponding points may lie anywhere along an epipolar line. This gives the potential for introducing outliers with unrealistic (e.g. negative) depths. Such outliers could be identified by using 3 view matching constraints. For example, one could robustly estimate the trifocal tensor [Har96] using R A N S A C , and reject matches which are inconsistent between the 3 views. Multiple Rigid Motions A straightforward extension of the ideas presented in this chapter would be to associate more than one set of camera parameters with each image to enable multiple independently moving rigid objects to be discovered in the images. This would require a model selection/cross validation stage to determine how many objects are present in each image. Photorealistic 3D Modelling The algorithm presented in this chapter gives a method for automatic camera calibration from unordered image datasets. Given calibrated cameras, various methods could be applied to generate photorealistic 3D Chapter 5. 3D Object Recognition and Reconstruction 105 models. A traditional approach would be to perform dense stereo and extract a triangle mesh [CSG03]. Another method would be to represent the scene as a voxel grid which is coloured using an algorithm such as [SD99]. These representations have depth ambiguities for areas which are viewed in only one camera, or areas that have constant intensity [BSK01]. One way to deal with such ambiguities is to adopt an image based rendering approach, using priors based on the image statistics [FWZ03]. 106 Chapter 6 Conclusions This thesis has presented a practical approach to multi-image matching for the automatic discovery and reconstruction of image panoramas and 3D models from image datasets. We have proposed an invariant feature based approach, using geometric constraints to find structure in an unordered dataset. There are several advantages to such an approach. Firstly, by organising the feature database as a tree structure one can reduce the complexity of matching n images from 0{n ) for naive pairwise match2 ing to 0(n log n). Secondly, the geometric constraints for multiple views are stronger than their pairwise counterparts, which allows more incorrect matches to be rejected. Finally, we can exploit the probabilistic nature of an n image matching problem by using known incorrect matches in data driven classifiers. The major contribution of this work has been the development of a novel image stitching algorithm, that can automatically recognise and stitch high quality image panoramas from unsorted image datasets. We have also described a new type of invariant feature - Multi-Scale Oriented Patches (MOPS) that are especially suited for this purpose. We have developed a framework for evaluation of multi-image matching methods and used it to test the performance and tune the parameters of our image stitching algorithms. Finally, we have also shown how our framework can also be applied to the problem of structure and motion recovery in unordered datasets. Future Directions We conclude by identifying some avenues for future exploration: Partially Invariant Features The results of section 2.9.1 demonstrate the pitfalls of adding too much invariance to feature descriptors. One major improvement to feature matching methods would be to use feature descriptors that are partially invariant under some transformations. For many parameters such as affine stretch, full invariance is not really desirable, and we would rather specify a prior Chapter 6. Conclusions 107 probability distribution on those parameters. For example, very large or small stretch factors would be very unlikely, and difficult to match due to sampling issues. Whilst it would be straightforward to add priors to the Lucas-Kanade refinement framework, performing indexing with partial invariance would be more of a challenge. Computational Photography In chapters 3 and 5 we discussed automatic image stitching and.3D modelling from multiple views. These problems can be thought of within the general framework of computational photography, where one attempts to augment the capabilities of a digital imaging device using computational processing. In the context of stitching and 3D modelling the most obvious application is novel view synthesis - generating virtual camera views with arbitrary positions/orientations from a collection of images. However, one could also generate novel views which have virtual exposures [DM97] or digital refocussing [Ng05]. In the future, there may be no such thing as a 'bad' photograph, because the shooting conditions will be completely reconfigurable after the event has been recorded. A n even bigger challenge is to capture and resynthesise dynamic scenes [AZP+05]. Non-Rigid Object Recognition In chapter 5 we describe a system for recognising rigid objects in an unordered dataset. A straightforward extension of this work would be to relax the global geometric constraints and instead search for sets of feature matches that are consistent with local geometric constraints. This would enable multi-image matching of non-rigid objects or scenes. Object Class Recognition Recent results in neuroscience [IUMH00] and machine learning [FPZ03] have shown that both humans and machines can detect classes of objects. Progress in this area will require flexible models for correspondence with similarity metrics linked to human perception. Image Searching and Sorting W i t h the explosion of digital photography, our ability to understand images has not kept pace with our ability to generate them. In the future, algorithms for searching and sorting in image databases will become as fundamental as those for searching for text on the World-Wide Web. Such algorithms could borrow from the ideas of Brin and Page[PBMW98], using feature matches as the visual analogue of text hyperlinks on the web. 108 Bibliography [2D3] 2d3. http://www.2d3.com. [ADA+04] A . Agarwala, M . Dontcheva, M . Agarwala, S. Drucker, A . Colburn, B. Curless, D. Salesin, and M . Cohen. Interactive digital photomontage. In ACM Transactions on Graphics (SIGGRAPH'04), 2004. [Ana89] P. Anandan. A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision, 2:283-310, 1989. [AZP+05] A . Agarwala, C. Zheng, C. Pal, M . Agrawala, M . Cohen, B. Curless, D. Salesin, and R. Szeliski. Panoramic video textures. actions on Graphics (SIGGRAPH'05), In ACM Trans- 2005. [BA83] P. Burt and E. Adelson. A multiresolution spline with application to image mosaics. ACM Transactions on Graphics, 2(4):217-236, 1983. [BAHH92] J. Bergen, P. Anandan, K . Hanna, and R. Hingorani. Hierachical modelbased motion estimation. In Proceedings of the 2nd European Conference on Computer Vision (ECCV92), pages 237-252. Springer-Verlag, May 1992. [BauOO] A . Baumberg. Reliable feature matching across widely separated views. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR00), pages 774-781, 2000. [BBZ96] B . Bascle, A . Blake, and A . Zisserman. Motion deblurring and superresolution from and image sequence. In Proceedings of the 4th Euro- pean Conference on Computer Vision (ECCV96), pages 312-320. SpringerVerlag, 1996. Bibliography 109 [BI98] A . Blake and M . Isard. Active Contours. Springer-Verlag, 1998. [Bis95] C. Bishop. Neural Networks for Pattern Recognition. Clarendon, Oxford, U K , 1995. [BK01] R. Benosman and S. Kang. Panoramic Vision: Sensors, Theory and Applicatons. Springer, New York, 2001. [BL97] J. Beis and D. Lowe. Shape indexing using approximate nearest-neighbor search in high-dimensional spaces. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR97), pages 1000-1006, 1997. [BL02] M . Brown and D. Lowe. Invariant features from interest point groups. In Proceedings of the 13th British Machine Vision Conference (BMVC02), pages 253-262, Cardiff, 2002. [BL03] M . Brown and D. Lowe. Recognising panoramas. In Proceedings of the 9th International Conference on Computer Vision (ICCV03), volume 2, pages 1218-1225, Nice, October 2003. [BL05] M . Brown and D. Lowe. Unsupervised 3D object recognition and reconstruction in unordered datasets. In 5th International Conference on 3D Imaging and Modelling (3DIM05), 2005. [BM04] S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework: Part 1: The quantity approximated, the warp update rule, and the gradient descent approximation. International Journal of Computer Vision, 56(3):221-255, March 2004. [Bro58] D. Brown. A solution to the general problem of multiple station analytical stereotriangulation. Technical report, Patrick Airforce Base, 1958. [Bro71] D. Brown. Close-range camera calibration. Photogrammetric Engineering, 37(8):855-866, 1971. [BSK01] S. Baker, T. Sim, and T. Kanade. A characterization of inherent stereo ambiguities. In Proceedings of the 8th International Conference on Computer Vision (ICCV01), volume 1, pages 428-435, Vancouver, July 2001. Bibliography 110 [BSW05] M . Brown, R. Szeliski, and S. Winder. Multi-image matching using multiscale oriented patches. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR05), San Diego, June 2005. [BTZ96] P. Beardsley, P. Torr, and A . Zisserman. 3D model acquisition from extended image sequences. In Proceedings of the 4th European Conference on Computer Vision (ECCV96), volume II, pages 683-695, Cambridge, April 1996. Springer-Verlag. [Can86] J . Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6):679-698, 1986. [Che95] S. Chen. QuickTime V R - A n image-based approach to virtual environment navigation. In SIGGRAPH'95, volume 29, pages 29-38, 1995. [CJ03] G . Carneiro and A. Jepson. Multi-scale local phase-based features. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR03), 2003. [CRB99] R. Cipolla, D. Robertson, and E . Boyer. 3D models of architectural scenes from uncalibrated images. In IEEE International Conference on Multimedia Computing and Systems, 1999. [CSG03] T. Tuytelaars C. Strecha and L. Van Gool. Dense matching of multiple wide-baseline views. In Proceedings of the 9th International Conference on Computer Vision (ICCV03), 2003. [CZ98] D. Capel and A . Zisserman. Automated mosaicing with super-resolution zoom. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR98), pages 885-891, June 1998. [Dav98] J . Davis. Mosaics of scenes with moving objects. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR98), pages 354-360, 1998. [DBdFF02] P. Duygulu, K . Barnard, N . de Freitas, and D. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabu- Bibliography 111 lary. In Proceedings of the 7th European Conference on Computer Vision (ECCV02), pages 97-112, 2002. [DC99] T. Drummond and R. Cipolla. Real-time tracking of complex structures with on-line camera calibration. In Proceedings of the 10th British Machine Vision Conference (BMVC99), pages 574-583, Nottingham, 1999. [DM97] P. Debevec and J. Malik. Recovering high dynamic range radiance maps from photographs. Computer Graphics, 31:369-378, 1997. [DSTT00] F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Structure from motion without correspondence. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR00), pages 557-564, June 2000. [Fau93] O. Faugeras. Three-Dimensional Computer Vision - A Geometric Viewpoint. M I T press, 1993. [FB81] M . Fischler and R. Bolles. Random sample consensus: A paradigm for model fitting with application to image analysis and automated cartography. Communications of the ACM, 24:381-395, 1981. [For86] W . Forstner. A feature-based correspondence algorithm for image matching. Intl. Arch. Photogrammetry & Remote Sensing, 26(3):150-166, 1986. [FPZ03] R. Fergus, P. Perona, and A . Zisserman. Object class recognition by unsupervised scale-invariant learning. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR03), 2003. [FS03] J. Flusser and T. Suk. Construction of complete and independent systems of rotation moment invariants. In Computer Analysis of Images and Patterns: 10th International Conference (CAIP2003), pages 41-48, Groningen, Netherlands, 2003. [FWZ03] A . Fitzgibbon, Y . Wexler, and A . Zisserman. Image-based rendering using image-based priors. In Proceedings of the 9th International Conference on Computer Vision (ICCV03), 2003. Bibliography 112 [FZ03] W . Freeman and H . Zhang. Shape-time photography. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR03), volume 2, pages 151-157, June 2003. [GL81] J. George and J. Liu. Computer Solution of Large Sparse Positive Definite Systems. Prentice-Hall, 1981. [GM00] A . Gray and A . Moore. 'N-Body' problems in statistical learning. In NIPS00, pages 521-527, 2000. [Har92] C. Harris. Geometry from visual motion. In A. Blake and A. Yuille, editors, Active Vision, pages 263-284. M I T Press, 1992. [Har94] R. Hartley. Self-calibration from multiple views with a rotating cam- era. In Proceedings of the 3rd European Conference on Computer Vision (ECCV94), pages 471-478, Stockholm, 1994. [Har96] R. Hartley. Lines and points in three views and the trifocal tensor. International Journal of Computer Vision, 22(2):125-140, 1996. [HS03] A . Hertzmann and S. Seitz. Shape and materials by example: A photometric stereo approach. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR03), volume 1, pages 533-540, Madison, WI, June 2003. [HW62] D. Hubel and T. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology, 160:106-154, 1962. [HZ04] R. Hartley and A . Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [IA99] M . Irani and P. Anandan. About direct methods. In B . Triggs, A . Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, number 1883 in L N C S , pages 267-277. Springer-Verlag, Corfu, Greece, September 1999. Bibliography 113 [IUMHOO] A . Ishai, L . Ungerleider, A . Martin, and J. Haxby. The representation of objects in the human occipital and temporal cortex. Journal of Cognitive Neuroscience, 12(2):35-51, 2000. [KSU03] S. Kang, R. Szeliski, and M . Uyttendaele. Seamless stitching using multiperspective plane sweep. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR03), Madison, WI, June 2003. [KW99] S. Kang and R. Weiss. Characterization of errors in compositing panoramic images. Computer Vision and Image Understanding, 73(2):269-280, February 1999. [KZB04] T. Kadir, A . Zisserman, and M . Brady. A n affine invariant salient region detector. In Proceedings of the 8th European Conference on Computer Vision (ECCV04), pages 404-416, 2004. [Lin98] T. Lindeberg. Feature detection with automatic scale selection. International Journal of Computer Vision, 30(2):79-116, 1998. [LK81] B. Lucas and T. Kanade. A n iterative image registration technique with an application to stereo vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence, pages 674-679, 1981. [Low99] D. Lowe. Object recognition from local scale-invariant features. In Proceedings of the 7th International Conference on Computer Vision (ICCV99), pages 1150-1157, Corfu, Greece, September 1999. [Low04] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91-110, 2004. [LS05] A . Litvinov and Y . Schechner. Adressing radiometric nonidealities: A unified framework. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR05), volume 2, pages 52-59, 2005. [MCUP02] J. Matas, O. Chum, M . Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proceedings of the 13th British Machine Vision Conference (BMVC02), 2002. [Mee90] J. Meehan. Panoramic Photography. Amphoto Books, September 1990. Bibliography 114 [Mil75] D. Milgram. Computer methods for creating photomosaics. IEEE Transactions on Computers, C-24(ll):1113-1119, November 1975. [MJ02] P. McLauchlan and A . Jaenicke. Image mosaicing using sequential bundle adjustment. Image and Vision Computing, 20(9-10):751-759, August 2002. [MP79] D. Marr and T. Poggio. A computational theory of human stereo vision. Proceedings of the Royal Society of London, 13204:301-328, 1979. [MS02] K . Mikolajczyk and C. Schmid. A n affine invariant interest point detector. In Proceedings of the 7th European Conference on Computer Vision (ECCV02), 2002. [MS03] K . Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR03), 2003. [MSF] Microsoft Digital Image Pro. http://www.microsoft.com/products/imaging. [MZ92] J. Mundy and A. Zisserman, editors. Geometrical Invariance in Computer Vision. M I T Press, 1992. [Ng05] R. Ng. Fourier slice photography. (SIGGRAPH'05), In A CM Transactions on Graphics 2005. To appear. [NN97] S. Nene and S. Nayar. A simple algorithm for nearest neighbour search in high dimensions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(9):989-1003, September 1997. [OTdF+04] K . Okuma, A . Taleghani, N . de Freitas, J. Little, and D. Lowe. A boosted particle filter: Multitarget detection and tracking. In Proceedings of the 8th European Conference on Computer Vision (ECCV04), number 3021 in L N C S , pages 28-39. Springer-Verlag, 2004. [PBMW98] L . Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998. Bibliography 115 [PKG99] M . Pollefeys, R. Koch, and L. Van Gool. Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters. International Journal of Computer Vision, 1999. [PolOO] M . Pollefeys. 3D modelling from images. In Tutorial organised in conjunction with ECCV 2000, Dublin, June 2000. Tutorial notes. [Ray98] K . Rayner. Eye movements in reading and information processing: 20 years of research. Pyschological Bulletin, 124(3) :372-422, 1998. [RC02] C. Rother and S. Carlsson. Linear multi view reconstruction and camera recovery using a reference plane. International Journal of Computer Vision, 49(2/3): 117-141, 2002. [REA] Realviz. http://www.realviz.com. [RLSP03] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce. 3D object modelling and recognition using affine-invariant patches and multi-view spatial constraints. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR03), pages 272-277, Madison, W I , June 2003. [RZFM92] C. Rothwell, A . Zisserman, D. Forsyth, and J. Mundy. Canonical frames for planar object recognition. In Proceedings of the 2nd European Conference on Computer Vision (ECCV92), pages 757-772, 1992. [RZFM95] C. Rothwell, A . Zisserman, D. Forsyth, and J. Mundy. Planar object recognition using projective shape representation. In International Journal of Computer Vision, number 16, pages 57-99, 1995. [Saw94] H . Sawhney. 3D geometry from planar parallax. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR94), pages 929-934, Seattle, W A , 1994. [SD99] S. Seitz and C. Dyer. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2), 1999. [SF95] E . Simoncelli and W . Freeman. The steerable pyramid: A flexible architecture for multi-scale derivative computation. In International Conference Bibliography 116 on Image Processing, volume 3, pages 444-447, 23-26 Oct. 1995, Washington, D C , USA, 1995. [Sha98] R. Shachter. Bayes-ball: The rational pastime for determining irrelevance and requisite information in belief networks and influence diagrams. In G. Cooper and S. Moral, editors, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pages 480-487, San Francisco, C A , 1998. Morgan Kaufmann. [SHS+] H . Seetzen, W . Heidrich, W . Stuerzlinger, G . Ward, L. Whitehead, M . Trentacoste, A . Ghosh, and A . Vorozcovs. High dynamic range display systems. In ACM Transactions on Graphics (SIGGRAPH'04)- [SK93] R. Szeliski and S. Kang. Recovering 3D shape and motion from image streams using non-linear least squares. Technical Report C R L 93/3, Digital Equipment Corporation, Cambridge Research Laboratory, March 1993. [SK95] R. Szeliski and S. Kang. Direct methods for visual scene reconstruction. In IEEE Workshop on Representations of Visual Scenes, pages 26-33, Cambridge, M A , 1995. [SK99] H . Sawhney and R. Kumar. True multi-image alignment and its application to mosaicing and lens distortion correction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(3):235-243, 1999. [Sla80] C. Slama, editor. Manual of Photogrammetry. American Society of Photogrammetry, Falls Church, Virginia, fourth edition edition, 1980. [SLDV96] P. Simard, Y . LeCun, J. Denker, and B . Victorri. Transformation invariance in pattern recognition-tangent distance and tangent propagation. In Neural Networks: Tricks of the Trade, pages 239-27, 1996. [SM97] C. Schmid and R. Mohr. Local gray value invariants for image re- trieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5):530-535, May 1997. [SMB98] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. In Proceedings of the 6th International Conference on Computer Vision (ICCV98), pages 230-235, Bombay, 1998. Bibliography 117 [SS97] R. Szeliski and H . Shum. Creating full view panoramic image mosaics and environment maps. Computer Graphics (SIGGRAPH'97), 31(Annual Conference Series) :251-258, 1997. [SSOO] H . Shum and R. Szeliski. Construction of panoramic mosaics with global and local alignment. International Journal of Computer Vision, 36(2): 101130, February 2000. [SS02] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. International Journal of Computer Vision, 47(l):7-42, May 2002. [ST94] J. Shi and C. Tomasi. Good features to track. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR94), Seattle, June 1994. [Ste95] G . Stein. Accurate internal camera calibration using rotation, with analysis of sources of error. In Proceedings of the 5th International Conference on Computer Vision (ICCV95), pages 230-236, 1995. [SVD03] G . Shakhnarovich, P. Viola, and T. Darrell. Fast pose estimation with parameter sensitive hashing. In Proceedings of the 9th International Conference on Computer Vision (ICCV03), pages 750-757, Nice, October 2003. [SZ02] F. Schaffalitzky and A . Zisserman. Multi-view matching for unordered image sets, or "How do I organise my holiday snaps?". In Proceedings of the 7th European Conference on Computer Vision (ECCV02), pages 414-431, 2002. [SZ03] J. Sivic and A . Zisserman. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th International Conference on Computer Vision (ICCV03), October 2003. [Sze04] R. Szeliski. Image alignment and stitching: A tutorial. Technical Report MSR-TR-2004-92, Microsoft Research, December 2004. [Tan97] K . Tanaka. Mecahnisms of visual object recognition: Monkey and human studies. Current Opinion in Neurobiology, (7):523-529, 1997. Bibliography [TB01] K . Toyama and A . Blake. 118 Probabilistic tracking in a metric space. In Proceedings of the 8th International Conference on Computer Vision (ICCV01), volume 2, pages 50-57, 2001. [TG00] T. Tuytelaars and L. Van Gool. Wide baseline stereo matching based on local, affinely invariant regions. In Proceedings of the 11th British Machine Vision Conference (BMVC00), pages 412-422, Bristol, U K , 2000. [TM97] P. Torr and D. Murray. The development and comparison of robust methods for estimating the fundamental matrix. International Journal of Computer Vision, 24(3):271-300, 1997. [TMHF99] W. Triggs, P. McLauchlan, R. Hartley, and A . Fitzgibbon. Bundle adjustment: A modern synthesis. In Vision Algorithms: Theory and Practice, number 1883 in L N C S , pages 298-373. Springer-Verlag, Corfu, Greece, September 1999. [Tor02] P. Torr. Bayesian model estimation and selection for epipolar geometry and generic manifold fitting. International Journal of Computer Vision, 50(1):35-61, 2002. [Tri04] B . Triggs. Detecting keypoints with stable position, orientation, and scale under illumination changes. In Proceedings of the 8th European Conference on Computer Vision (ECCV04), volume 4, pages 100-113, Prague, May 2004. [TZ99] P. Torr and A . Zisserman. Feature based methods for structure and motion estimation. In B. Triggs, A . Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, number 1883 in L N C S , pages 278-294. Springer-Verlag, Corfu, Greece, September 1999. [UES01] M . Uyttendaele, A . Eden, and R. Szeliski. Eliminating ghosting and exposure artifacts in image mosaics. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR01), volume 2, pages 509-516, Kauai, Hawaii, December 2001. Bibliography 119 [VJ01] P. Viola and M . Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the Interational Conference on Computer Vision and Pattern Recognition (CVPR01), December 2001. [ZFD97] I. Zoghlami, 0 . Faugeras, and R. Deriche. Using geometric corners to build a 2D mosaic from a set of images. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Puerto Rico. I E E E , June 1997. 120 Appendix A Multiple View Geometry A.l Camera Models A camera induces a mapping between the 3-dimensional world and the 2-dimensional image. In order to form invariants that describe the images of points in the world, we must understand this mapping. In the next sections we describe pinhole, orthographic, projective and affine camera models, and characterise their properties when viewing a plane. For a more detailed presentation of these results, see [HZ04]. A. 1.1 Pinhole Camera In its simplest form, a camera is a device which interprets 3-dimensional points in the world (R ) as points of a 2-dimensional projective space (P ). 3 2 x = X where X = [X,Y,Z] c (A.l) c is the position in camera coordinates, and x = s[x,y, 1] is the homogeneous image position. Hence, the actual image coordinates x = [x, y] are given by X 'x/z y. Y/Z (A.2) See figure A . l . This is the basic pinhole camera model. Assuming a rigid body transformation X = R X +1 between world coordinates X and camera coordinates X , and c a linear transformation in the image plane u = K x , we have c Appendix A. Multiple View Geometry 121 Figure A . l : The pinhole camera interprets points in R as points of P . Points X that lie on a line through the optical centre are equivalent in P as they map to the same point x on the plane Z — 1. 3 2 c 2 u = K ( R X +1) Ui s u 2 = _1_ (A.3) fl 0 Wl 0 h ^2 0 0 ru 0 ri2 ri3 ti r r3 t 22 0 1 _ J31 r 2 3 2 x x 2 2 (A.4) 3 r3 *3. 3 1 where R , t are the rotation and translation of the camera (extrinsic parameters) and K is the calibration matrix (intrinsic parameters). The actual image position is u = [1*1,112] and the world position is X = [Xi,X ,X ]. 2 3 Here, we parameterise K by focal lengths fi, f , and the position of the principal point u , u - Parameterisation of R 2 l o 2o is discussed in appendix A.7. A. 1.2 Projective Camera Sometimes it is convenient to use a linear model for the mapping between homogeneous world and image coordinates. This can be accomplished by relaxing the non-linear constraints in equation A.3 (e.g. that R R = I) T Appendix A. Multiple View Geometry 122 Figure A.2: The orthographic camera projects points X parallel to the Z axis (with scaling in the image plane based on the average depth). Points X that lie on a line perpendicular to the plane Z = 1 project to the same point x. c c u = PX = s _1_ (A.5) Pu Pl2 Pl3 PlA P21 P22 P23 P2A _P31 P32 P33 £>34 X 2 (A.6) 1 where P is an arbitrary 3 x 4 matrix. This is known as a projective camera. This enables closed form linear solutions for the camera parameters given image and 3D correspondences [HZ04]. Projective cameras have the property that straight lines in the world map to straight lines in the image. This is also known as rectilinear projection in photography. A. 1.3 Orthographic Camera A simplified camera model assumes that the camera depth Z = Zo + AZ and the depth variation AZ is small. Hence X y. 'X/Zo Y/Z _ (A.7) 0 This is the basic orthographic camera model (figure A.2). In this model world points Appendix A. Multiple View Geometry 123 axe projected perpendicular to the image plane, and then scaled based on their average depth. A. 1.4 Affine Camera Substituting for the rigid body and linear transforms, and linearising the result gives u= P X (A.8) a Pll Mi = s 1 Pl2 Pl3 Pu (A.9) P21 P22 P23 P24, 0 1 0 0 X3 1 which gives a linear relationship between the ordinary (non-homogeneous) world and image coordinates. This is known as an affine camera. A.2 Viewing a Plane Given the above camera models, we now wish to characterise the geometric invariants of the imaging process. Unfortunately, the geometric relationships between corresponding image regions is complex, depending on the unknown 3D structure. One way to proceed is to linearise this structure, which is equivalent to viewing a planar region. In this section we describe the mappings between images of planes given the camera models developed above. A. 2.1 Homography Consider viewing a plane with a projective camera. Without loss of generality, assume that this is the plane X = 0. Then 3 Ui Pll = s _1_ P12 Pu P21 P22 P21 x2 _1 _ (A.10) Appendix A. Multiple View Geometry 124 or u = HX„ (A.ll) Ui = H j X p (A.12) u = H X P (A.13) Ui = H i U 2 (A.14) If two cameras view the same plane 2 2 so 2 where H12 = H i H . Writing this in terms of image coordinates 1 2 hi2 U12 = hi3 U21 hi h-22 h 3 U22 hi h2 2 1 _ 2 3 h 3 3 (A.15) 1 So views of a plane with projective cameras are related by a matrix multiplication in homogeneous coordinates, known as a homography. Since the homography has the property that straight lines map to straight lines, it is also known as a collineation [Fau93]. A homography has 8 degrees of freedom corresponding to translation (2), rotation, scale, shear (2), and perspective (2) (characterised by the horizon line). A.2.2 Affine Transformation Similarly, when viewing the plane X = 0 with an affine camera 3 = 1 Pll Pl2 Pl4 P21 0 P22 P24 0 1 x 2 1 (A.16) Appendix A. Multiple View Geometry 125 So views are related by iii = A i u 2 where A i (A.17) 2 = A i A ^ . Writing in terms of image coordinates 1 2 an s U12 = «13 a-12 Ul 2 1 0 1 0 (A. 18) U22 0-22 1 Hence views of a plane with affine cameras are related by a linear transformation in (non-homogeneous) image coordinates, called a 2D affine transformation. A n affine transformation differs from a homography by the fact that parallel lines are preserved. This is easily seen from the fact that points at infinity u = (ui,tt ,0) (where parallel 2 lines intersect) remain at infinity under an affinity, but not a homography. A n affine transform has 6 degrees of freedom corresponding to translation (2), rotation, scale and shear (2). A.2.3 Similarity Transformation A further simplification can be made if there is no rotation in depth of the plane relative to the cameras Uu s Ul2 1 = acosi9 —asini9 t\ U2l asim9 a cos 0 t U22 0 0 1 2 (A.19) 1 This is known as a similarity transform. The similarity transform has four degrees of freedom corresponding to translation (2), rotation and scale. A.3 Hierarchy of 2D Transformations The formulas for homographies, affinities and similarities derived above no longer depend on the 3D geometry, so planar regions differ only by a planar projective transformation. Some examples of these invariants under these transformations are given below Appendix Group A. Multiple View Geometry Example Projection matrix Pu Projective - Pl2 PU P21 P22 a P24 J>31 P32 P34 Pii Pl2 Pl4 P21 P22 P24 _ 0 0 1 Affine a cos 9 —asm9 Similarity asm. 9 a cos 9 0 0 cos# — sin# Euclidean sin# cos# 0 0 126 Invariants cross ratio, intersection + parallel lines, ratio of areas + ratio of lengths, angles + length, area One approach to matching and object recognition is to characterise objects by geometric invariants. For example, one might compute the cross ratios of colinear points and use this for indexing. One can also compute invariants from the local brightness structure. For example, the integral of intensity around a circle would be invariant under rotations about that point. A.4 Canonical Frames A desirable characteristic for the set of invariants chosen is that they should be complete and independent [FS03]. In other words, one should be able to exactly reproduce the region of interest given the invariant descriptor and the unknown transformation parameters (completeness). Also, the number of elements of the descriptor vector should be no greater than necessary to reproduce the original data (independence). One way to do this is to warp corresponding image regions to a canonical frame (figure A.3). Consider a reference frame with coordinates U ref = H U ref uf re defined by a homography H r e / (A. 20) Suppose that the coordinates u undergo a planar projective transformation u' = Hu. Appendix A. Multiple View Geometry 127 H canonical Figure A.3: A canonical reference frame transforms according to the same homography as the images i.e. u' — Hu =>• ^-'ref Hre/H = If the new reference frame - 1 also transforms according to H H' f re = H^H" H' ref (A.21) 1 then u r e / =H r e / u= (A.22) H' u' ref Then the corresponding feature points u, u' map to the same reference point u /. The re reference frame defined by H r e / is called a canonical frame. Note that though corre- sponding feature points are covariant (transform under H ) , the reference homography is contravariant (transforms under H A.5 _ 1 ). Multi-View Constraints In the next sections we describe the pairwise constraints that apply to panoramic image geometry and epipolar geometry. Information about 3 and 4 view constraints can be found in [HZ04]. Appendix A. Multiple View Geometry 128 Figure A.4: Panoramic Geometry. For a camera that rotates about it's optical centre, there is a one-to-one mapping between the points in the two images. A.5.1 Panoramic Geometry For images related by a rotation about the camera centre there is a one-to-one mapping between corresponding points. Consider two cameras centred at 0 viewing the point X xi = [ R i | 0 ] X = R i X (A.23) x = [R |0]X = R X (A.24) xi = R i R ^ x s (A.25) 2 2 2 Hence Substituting u = K x gives ux = H 1 2 u (A.26) 2 where H 1 2 = KIRJR^KJ 1 (A.27) is a special homography known as the homography of the plane at infinity. In the general case when the camera centres are distinct, the homography of the plane at infinity Appendix A. Multiple View Geometry 129 Figure A.5: The epipolar constraint expresses the fact that the two camera centres and the 3D point lie in the same plane (the epipolar plane). Hence the scalar triple product p f ( t x p ) is equal to zero. 2 describes the motion between points that are (infinitely) far from both camera centres (see appendix A.6). If the camera centres are coincident, this homography describes the motion of all points. The elements of H are typically linearised to facilitate linear solution by SVD. A.5.2 Epipolar geometry In the general case of moving cameras a point viewed in one image defines a line in the second, corresponding to a continuous range of possible depths for that point. Hence there is a point to line mapping between images. Consider a point X viewed in a pair of images ii = X C 1 X = R!X x = X C 2 X = R X 2 2 C l C 2 + ti (A.28) +t (A.29) 2 The homogeneous coordinates x are related to the normalised ray directions x via the unknown depths s Appendix A. Multiple View Geometry x X i = Si l 1 130 = SlXi (A.30) S2X2 (A.31) l 2| X x = s 2 2 1 For consistency of the 3D point position between the cameras R<iX + t i = R 2 X Cl +1 C 2 (A.32) 2 Substituting for the observed ray directions R i S1X1 + t i = s R x + 1 2 2 2 2 (A.33) this is of the form S1P1 + S2P2 = (A.34) t where p i , P 2 and t are 3-vectors as follows Pi = R i X ! (A.35) p = R x (A.36) 2 2 2 t = t - ti (A.37) 2 Equation A.34 gives 3 (scalar) equations in 2 unknowns. Eliminating the unknown depths s i , s gives 2 p f (t x p ) = 0 (A.38) 2 This expresses the fact that p i , p 2 and t all lie in the same plane, and hence their scalar triple product is equal to zero. Writing this out in terms of the camera extrinsic parameters xfRf[t 2 -ti] R2X x 2 - 0 (A.39) Appendix A. Multiple View Geometry 131 or x ^ E x 2 = 0 where E = Rnt -ti] R 2 x (A.40) 2 is a 3 x 3 matrix known as the essential matrix. Substituting u = K x gives ufK^RHta - t ] R K T u = 0 (A.41) 1 1 or u f F u 2 x 2 2 2 = 0 where F = Kr Rnt -t ] R Kj 1 2 1 x (A.42) 1 2 is known as the fundamental matrix. Again, the fundamental matrix is often linearised to facilitate linear solution via SVD. A.6 Plane (at Infinity) Plus Parallax A l l image motion can be decomposed as a homography of the plane at infinity (rotation between the cameras) and depth dependent motion towards the epipole (parallax). Adopting similar camera models as used previously iii = siui = K i X C l (A.43) u C 2 (A.44) 2 = s u = K X 2 2 2 where X = RiX C l + ti = R X 2 C 2 + t (A.45) 2 Assuming that K is upper triangular and scaled such that £33 = 1, s = Z , the unknown c point depth. Substituting for u RiK x ^ i i i i + ti = R K 2 1 2 s u 2 2 + t 2 (A.46) Appendix A. Multiple View Geometry 132 Hence - u = K R RiK]" ui + -KaRiXti - t ) (A.47) Si Si For points far from both cameras, Si and s are both large, and the image positions r 2 2 1 2 2 2 are related by a homography u = Hoofix (A.48) 2 where Hoo = K R f R i K f 1 2 (A.49) is known as the homography of the plane at infinity. As the depth of the point decreases, it moves in a line towards e = K R ( t i — t ) , which is the projection of the centre n 2 2 2 of camera 1 in camera 2, otherwise known as the epipole. Hence the motion of points may be written u = HooUx + Ae 2 (A.50) i.e. points transform under the homography of the plane at infinity (rotation between the cameras), plus depth dependent motion towards the epipole. Note that a similar result can be shown for an arbitrary plane H . This is known as plane plus parallax representation [Saw94]. A.7 Axis-Angle Rotations The axis-angle representation provides a representation for rotations that is a) minimal b) easily differentiable. The first point is important because some other rotation representations e.g. a 3 x 3 rotation matrix, can become non-Euclidean due to round off errors after manipulation. Consider a rotation of the vector x by an angle 9 about axis u , resulting in the vector x'. We begin with the identity x = u u x — u x (u x x) T (A.51) Appendix A. Multiple View Geometry 133 Figure A.6: The motion of point X between cameras can be expressed as a homography plus motion towards the epipole. If the point is far from the cameras, it's position is given by the homography of the plane at infinity. As the depth of the point Z decreases, it moves along a line in the image towards the epipole e. c See figure A.7(a) . Note that the vectors u uxx (A.52) ux (uxx) are orthogonal and —ux ( u x x ) is the component of x in the plane perpendicular to u. Under a rotation about an angle 6 about u, the component u u x is unchanged, but r the component — u x ( u x x ) becomes —ux ( u x x ) cos# + ( u x x ) sin# (A.53) Figure A.7(b) . Therefore the rotation of x by angle 6 about axis u is given by x' = u u x - u x (u x x) cos 9 + (u x x) sin 8 (A.54) x' = Rx (A.55) T Equivalently Appendix A. Multiple View Geometry 134 -ux ( u x x ) —ux ( u x x ) • ux(uxx) (a) (b) Figure A.7: Axis-angle rotations. The vector x is expressed as components parallel to u and in the plane perpendicular to u (a). Only the component perpendicular to u is affected by rotation about u (b). where R = uu + [u] sin(9 + (I - uu )cos# T T x (A.56) and [u] is the cross product matrix defined by x u v = 0 -0 03 0 — 02 01 3 02 (A.57) -0i 0 Rodriguez Equation It can be shown (by polynomial expansion of e®*) that e [0] x = e f (j _ 0 £ ) T + g i n + c o s |0| (A.58) Hence R = e^* (A.59) Appendix A. Multiple View Geometry 135 represents a rotation by an angle \9\ around the axis 9, where 9 = 9/\9\. Inverting Rodriguez Equation Consider the trace of R tr(R) = 3 - 201 + § + §j)(l - cos \9\) 2 2 (A.60) also V = R32 — R23 Rl3 — R-31 .R2I — R l 2 = 62 e 3 and therefore |0| = cos ! tr(R) - 1 9 = v/|v (A.62) (A.63) 136 Appendix B Image Datasets (a) (b) 13k* hia^f Elfin Green ** ! ^4 11 .i WSP (c) L C I Figure B . l : Example panorama sequences from the synthetic dataset. Each camera view is synthetically sampled from a previously stitched panorama. Synthetic sampling allows us to generate arbitrary image sequences whose camera matrices are known exactly. Appendix B. Image Datasets (a) Alaska (b) Room WW 137 — - J P T IpLll; Ijfej it^j| (c) Cedar Figure B.2: Example panorama sequences from the real dataset. The ground truth registrations shown here have been generated by manual stitching. Our complete test dataset consists of over 200 real image sequences containing I D (single row) and 2D (multi row) panoramas. Figure B.4: Matier dataset. Appendix B. Image Datasets 139 Figure B.6: Dash point dataset. Appendix H B. Image Datasets 141 1 • 4H, 1 i a Figure B.7: Times Square. This difficult stitching problem contains many moving objects and large changes in brightness between the images. Future automatic image stitchers could detect the moving objects, and compute high dynamic range radiance maps of the scene. This would enable the user to 're-photograph' the scene with different exposure settings and moving objects selected.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Multi-image matching using invariant features
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Multi-image matching using invariant features Brown, Matthew Alun 2005
pdf
Page Metadata
Item Metadata
Title | Multi-image matching using invariant features |
Creator |
Brown, Matthew Alun |
Date Issued | 2005 |
Description | This thesis concerns the problems of automatic image stitching and 3D modelling from multiple views. These are basic problems of computer vision, with applications in robotics, architecture, industrial inspection, surveillance, computer graphics and film. Recent work has brought increasing automation to these tasks, but despite a large amount of progress, state-of-the-art algorithms still require some form of user input or assumptions about the image sequence. For example, the best image stitchers currently require an ordered set of input images, or user input to identify the matching images, before automatic registration can proceed. In this work we show how such tasks can be performed automatically and without any user input at all. We formulate the multi-image matching problem as one of finding all matching images, subject to the constraint that they are consistent views from a perspective camera. We use invariant features as a mechanism for finding correspondences, and indexing techniques to efficiently find matches between multiple views. We then find all sets of geometrically consistent feature matches, using a probabilistic model for verification. This allows us to identify each object or scene in the dataset using only the structure already present in the data. The major contributions of this thesis are the development of a system that can automatically recognise and stitch 2D panoramas in unordered image datasets, and a new class of invariant features for this purpose. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2009-12-23 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051634 |
URI | http://hdl.handle.net/2429/17085 |
Degree |
Doctor of Philosophy - PhD |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2005-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-ubc_2005-104724.pdf [ 22.48MB ]
- Metadata
- JSON: 831-1.0051634.json
- JSON-LD: 831-1.0051634-ld.json
- RDF/XML (Pretty): 831-1.0051634-rdf.xml
- RDF/JSON: 831-1.0051634-rdf.json
- Turtle: 831-1.0051634-turtle.txt
- N-Triples: 831-1.0051634-rdf-ntriples.txt
- Original Record: 831-1.0051634-source.json
- Full Text
- 831-1.0051634-fulltext.txt
- Citation
- 831-1.0051634.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051634/manifest