The Use of Model-Guided Grouping in Model-Based Motion Tracking by Xun Li B.Sc, Tsinghua University, 1986 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in The Faculty of Graduate Studies Department of Computer Science We accept this thesis as conforming to the required standard The University of British Columbia © X u n Li October 1991 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. The University of British Columbia Vancouver, Canada Abstract A general motion tracking system consists of three major parts: segmentation, correspondence and viewpoint verification. This thesis presents improved methods for solving the correspondence problem. Most previous approaches to solving the correspondence problem have used lower-level primitives such as points and lines as their matching tokens. It has been found that if the matching tokens, such as points and lines, are less distinctive, greater ambiguity in matching inevitably arises. The objective of this thesis is to develop a robust solution to the correspondence problem even when the frame-to-frame motion is relatively fast. A new approach called M o d e l - G u i d e d G r o u p i n g , which is used to derive intermediate-level structures as our matching tokens, has been introduced. The term Model-Guided comes from the fact that the groupings are guided and derived locally, with the contemporary use of model structures, around the predicted model during the object tracking. We choose junctions and parallel pairs as our matching tokens, thus the information coded in these structures is relatively invariant in consecutive frames. The matching strategy is coarse-to-fine, and partial matching will also be allowed when occlusions are present. The method for evaluation of probability of accidental match based on junction groupings will be discussed. Comparisons are made between the current system and Lowe's previous system. The results show that with a slight increase in computational cost, the use of the higher-level groupings as matching tokens leads to a much more robust model-based motion tracking system. K e y w o r d s : motion tracking, perceptual grouping, model-guided grouping, correspondence, model-based vision, segmentation, token matching, viewpoint verification. ii Contents Abstract ii Contents iii L i s t of T a b l e s v List of Figures vi Acknowledgement 1 2 Introduction 1 1.1 1.2 1.3 2 3 6 What is Our Problem? Motivation of the Thesis Thesis Organization ; Outline of Related Topics 2.1 2.2 2.3 3 viii 9 Motion Tracking System 2.1.1 Segmentation 2.1.2 Correspondence 2.1.3 Viewpoint Verification 2.1.4 Model-based Motion Tracking System , Previous Correspondence Algorithms for Motion Tracking 2.2.1 Gennery's Approach 2.2.2 Asada's Approach 2.2.3 Verghese's Approach 2.2.4 Lowe's Line-to-Line Tracking System 2.2.5 Tracking Line Segments The Use of Grouped Line Segments as Matching Tokens 9 9 11 12 13 14 14 15 15 16 16 17 Model-Guided Grouping 19 3.1 19 21 Perceptual Grouping 3.1.1 Grouping Based on Cotermination iii 3.2 3.3 3.4 4 38 4.1 4.2 38 39 39 40 42 4.5 4.6 Model-Based Motion Tracking as a Matching Problem Line to Line Matching 4.2.1 Psychological Experiments from Ullman 4.2.2 Lowe's Model-Based Motion Tracking System Matching candidates to target - a probability problem Evaluation of Probability of Accidental Match with Model-guided Junction Groupings Matching with Model-Guided Groupings 4.5.1 Algorithm I: Matching through Junctions 4.5.2 Algorithm II: Matching through Parallel Pairs Matching Strategy 43 45 45 46 46 Implementation Results 48 5.1 5.2 48 5.3 5.4 6 22 24 24 26 26 • • • 27 32 33 37 Matching with Model-Guided Groupings 4.3 4.4 5 3.1.2 Collinearity Grouping 3.1.3 Grouping Based on Parallelism Indexing Line Segments Searching Correspondence for Motion Tracking 3.3.1 Psychological Studies of Human Visual Perception 3.3.2 Constraints Reduce the Search Space Model-Guided Grouping 3.4.1 Model-Guided Junction Grouping 3.4.2 Model-Guided Parallel Grouping Model-guided Grouping Matching with Junction Grouping - Comparison with Lowe's Line-to-Line Matching Matching through Parallel Pairs Interface for Model-based Motion Tracking 49 54 54 Conclusions and Future Directions 57 6.1 Conclusions 57 6.2 Future Directions 58 A C a n n y a n d Deriche's E d g e D e t e c t o r 61 B Least-Square Line Fitting 65 C Finding Pseudo-Cotermination Point 67 References 69 iv List of Tables 4.1 5.1 5.2 5.3 The relative effect of orientation difference, length ratio, and distance. Adapted from Ullman's paper Comparisons of average compatible tokens found and computing time at each iteration between matching with junction groupings and line segments Comparisons of error rates between translational test cases for matching between junction groupings and line segments Comparisons between translational and rotational test cases for matching between junction groupings and line segments. v 40 53 53 53 List of Figures 1.1 1.2 1.3 3.1 3.2 3.3 Projected Model Edges (a) Initial prediction (in thick lines), (b) Incorrect matching, result due to incomplete match The framework of our motion tracking system. Model-Guided Groupings are incorporated as our matching tokens 3.8 Five nonaccidental relations and 3-D inference from image features In the real image, lines are terminate at a small common region. p e p is determined by minimization the sum of squares of perpendicular distance to each line Grouping based on collinearity (a) Parallel pair, (b) Salient parallel pair: h < min^li,^) Image is partitioned into m x m windows, each window is attached a bucket. Line segments are indexed into bucket arrays which are formed by dotted lines. . (a) We focus the search in the Local Focus Window, which is determined by the bounds calculated from superimposed model edges, (b) Compatible image edges will be derived after passing all the unary constraints Both angles for image edge and model edge are in the range of [0,360] degree. . . 3.9 (a) Distance is interpreted as perpendicular distance, (b) Distance is determined 3.4 3.5 3.6 3.7 from two endpoints (e.g. oa, ob, oc) 3.10 The angle between a pair of edge angles (a, 6), must lie within the range (&', a"), except (a', 6"), since counterclockwise relationship should be reserved 3.11 T junctions will be examined for potential connected constraints 3.12 We focus our searching around the model vertex, junction (thick lines) is formed around the model vertex by checking potentially connected constraints 4.1 3 4 7 20 22 23 23 24 25 29 30 30 31 34 35 Improper matches arise due to spurious data or inconsistent local structures. Thick lines are predicted model edges, thin lines are image edges. Ambiguities can be reduced by using groupings, (a) Correct matches should be between a, b, c and a,/3,7, instead of local best matches between a,c and a',c'. (b) Correct matches should be between a, 6 and a,(3 instead of local best match between b and a . 43 vi 4.2 5.1 5.2 5.3 Matching refinement with model-guided junction groupings. Superimposed model edges are projected from its prior estimated viewpoint, nearby are top ranked matching edges, heavy bars are the perpendicular errors to be minimized. Searching regions (circles) will be reduced in consecutive iterations (a)-(d) 47 Images and their segmentation results used for testing. . Interface for Model-based Motion Tracking XView: Notification-based System. Each panel button is registered with its own callback procedures 56 A. l Pipeline structure for implementation of Canny's filters 62 B. l Extraction of Line Segments by least-square line fitting 66 vii 51 55 Acknowledgement It has been my great pleasure to be supervised by Dr. David Lowe. Being an excellent researcher, David has used his insight to help identify crucial research problems. He is a constant source of inspiration and support, and his help is always direct and substantial. Throughout the time I worked with him, he repeatedly challenged my thinking, and always led me to a clearer understanding of the problems I was working on. I would not have survived without his patience and encouragement. I sincerely wish him the greatest success in his future academic pursuits. I would like to thank professor Alan Mack worth for his willingness to read the final draft. His valuable contributions improved the quality of this work. I would like to thank Dr. Jim Little for his encouragement during the time I was taking Computer Science 505. Finally, I would like to thank professor David Kirkpatrick, who provided useful theoretical papers concerning the bucket technique. Carol Whitehead, Theresa Fong, Grace Wolkosky, Gale Arndt, Sunnie Khuman, Evelyn Fong, Carlin Chao, Ming Koon Lau, Valerie McRae, and Deborah Wilson, also deserve a special note of thanks for their kindness and their superb assistance. The friendship between many graduate students made my stay at UBC a most enjoyable time, among them especially are: Yoko Ando, Jeff Beis, Li Li, Antonio Loureiro, Runping Qi, and Sijian Zhang. Crucial support was also provided by my wife, Lucy, with her endless love and her inimitable gift for inspiring instant cheerfulness. She has showed great faith and confidence in my thesis work. I want to thank my parents and sisters for their love and generosity. During the past many years, they have always supported and encouraged me. The generous and continuous financial support from IRIS research funding and the Computer Science Department is gratefully acknowledged. viii Chapter 1 Introduction The importance of motion information has been proved in many industrial applications. Tracking of moving objects from a series of images is a fundamental task in computer vision. A general motion tracking system consists of three major parts: segmentation, correspondence (local matching) and viewpoint verification (global matching). Correspondence is treated as a central and crucial part. However, most previous approaches to solving the correspondence problem have used lower-level primitives such as points [55] and lines [34, 14, 60] as their matching tokens. It has been found [60, 45] that if the matching tokens, such as points and line segments, are less distinctive, greater ambiguity in matching arises inevitably. The objective of this thesis is to develop a robust solution to the correspondence problem. We propose a new approach called M o d e l - G u i d e d G r o u p i n g , which is used to derive intermediatelevel structures as our matching tokens. The term Model-Guided comes from the fact that the groupings are guided and derived locally, with the contemporary use of model structures, around the predicted model during the object tracking. We choose junctions and parallel pairs as our matching tokens, thus the information coded in these structures is relatively invariant in consecutive frames. Once the matching tokens have been derived, the problem to be considered now is the pairing by 1 CHAPTER 1. INTRODUCTION 2 the correspondence process of tokens in one frame (represented by the model information) with tokens in the subsequent frame. The key issue here is that the correspondence between isolated token pairs is governed by a certain proximity and similarity metric, termed Correspondence Strength [58]. We also discuss how to formulate the measure of correspondence strength based on model-guided groupings in this thesis. Correspondence is performed locally, thus is referred as local matching. Local matching adopts a coarse-to-fine strategy, which is performed on the model-guided groupings. Partial matching will also be allowed when occlusions are present. Global matching is examined by viewpoint consistency, which is based on the Newton-Raphson iterative method [34]. Comparisons are made between current system and Lowe's previous system. The results show that with a slight increase in the computational cost, the use of the higher-level groupings as matching tokens leads to a much more robust model-based motion tracking system. 1.1 What is Our Problem? Given a relatively dense sequence of images of a known 3 D object which is moving arbitrarily, we describe our problem for model-based motion tracking as follows: Given an initial viewpoint estimate (the problems of initially locating the object or calculating the initial viewpoint will not be considered, we focus on the dynamic aspects of 3D modelbased tracking), we assume a single object is moving arbitrarily in space, permitting 6 degrees of freedom (in fact^ 3 rotational, 3 translational parameters and 1 camera focal length). Our objective is to continuously tracking the moving object by updating 7 parameter values needed to keep the projection of the 3D object in correspondence with the dynamic images. In this thesis, the test object to be tracked is an articulated box, which is represented by a CHAPTER 1. INTRODUCTION 3 wireframe model (Figure 1.1), and we start our tracking with a known correspondence position. The use of a polyhedral model of the 3D object we are tracking distinguishes our problem from those works which are not model-based. The relatively dense sequence assumption enables us to take advantage of the frame-to-frame coherence between a projection of the model in one image and its slightly differing projection in the successive image frame. Figure 1.1: Projected Model Edges Correspondence is the process that identities elements in different views as representing the same object at different times, thereby maintaining the perceptual identity of objects in motion. Two important issues concerning motion correspondence algorithms will be discussed: 1. What kind of tokens (primitives) will be used in our matching process? How to derive these tokens? 2. How should correspondence strength of these tokens (primitives) be measured? 1.2 Motivation of the Thesis The correspondence algorithm plays the crucial role in motion tracking. The robustness of the correspondence process is essential in any motion tracking system. Since the correspondence result for any frame depends heavily on the matching result from the previous one, the loss of CHAPTER 1. INTRODUCTION 4 the correspondence at any frame will lead to failure of the whole system. The previous correspondence algorithms using line segments as the matching token are relatively robust when motion is very slow between frame-to-frame changes. However, when the motion is relatively fast, the system fails due to the following inevitable problem: in all those algorithms token matching is facing the selection from multiple matches - a token being tracked may find multiple matches in the current frame around the predicted region, especially when there are several scene tokens which are near to each other. For matching with line segments, the selection of the correct match can not depends on the geometric information of the isolated line segment only, the failure of the correct selection among the multiple matches will lead to the failure of overall matching. (a) (b) Figure 1.2: (a) Initial prediction (in thick lines), (b) Incorrect matching result due to incomplete match. All methods solved this problem by selecting locally "best" token [33, 14]: to choose the scene token which is best predicted according to a kind of measure for correspondence strength. Lowe also formulates the probability for accidental match to prune the searching space, so only several lowest ranked probabilities for accidental match will be used as the correct matches. Deriche CHAPTER 1. INTRODUCTION 5 simply chooses the scene token which is the nearest to the predicted position in the sense of Mahalanobis distance. The best-first approach, however, may result in the false match due to spurious data or in a close clustered scene tokens, and thus may lead to unpredicted results (See Figure 1.2). Zhang [60] claimed a robust approach which keeps tracking the token using two nearest scene tokens to the position in terms of Mahalanobis distance, if the distances are both less than some threshold. He then distinguished two different cases of multiple matches: non-collinear and collinear, and introduced the concepts of splitting and merging to cope with these two cases. Collinear segments are merged into a single segment, while in present of two non-collinear segments, model edge will be split into two duplicated ones, each pursuing the tracking by incorporating a different matcher. In a complex image, however, the number of model edges with multiple image matches is large, thus leads to an exponential growth of computational complexity. Besides, in some extreme case, the correct match might even not be contained in the best two matches. In recent years, the use of perceptual organization has drawn great attention in the machine vision community. Most machine vision systems implicitly use some aspects of perceptual organization. For instances, edge linking and region segmentation are structuring processes often used for object recognition; however, they are seldom considered as being part of an overall attempt to structure the image information. Most aspects of perceptual organization remain untried within machine vision. Marr [41] was the prominent machine vision pioneer to pay some attention to perceptual organization when he treated edge grouping as part of his concept of his primal sketch. In spite of this, he and his followers ignored the explicit use of perceptual organization in machine vision. Since the middle 1980s, increasing attention has been paid to many aspects of perceptual organization, its emerging importance has been signified by David Lowe [34], Fischer and Firschein CHAPTER 1. INTRODUCTION 6 [19], Pentland [46], et al. The objective of this thesis is to find a robust solution for the correspondence algorithm even when the frame-to-frame motion is relatively fast. Inspired by the human capability of visual perception, our approach is to incorporate perceptual groupings - intermediate-level structures as the matching tokens, since higher level groupings are far less likely to be incorrectly matched than isolated line segments [37]. In this thesis, we coin the term of Model-Guided Grouping (MGG). Unlike most perceptual grouping process, model-guided grouping is used to derive perceptual groupings with the contemporary knowledge of the model structures, therefore it can be computed much more efficiently. The correspondence is performed between intermediate-level structures from model and image pairings. Figure 1.3 illustrates the framework of our matching system. 1.3 Thesis Organization C h a p t e r 2. We will describe the major components of the general motion tracking system: segmentation, correspondence and viewpoint verification. Higher-level processing depends heavily on the input from the low-level segmentation results. We focus pur segmentation process on the extract of straight line segments. The frame work for the motion correspondence will be described in this section. Viewpoint verification can be regarded as the global matching, we simply adopt Lowe's [35] Newton-Raphon iterative method. The major contribution of this thesis is described in Chapter 3 and 4. C h a p t e r 3. We review the work in perceptual grouping; and we emphasis what will be used as our matching token. In fact, the matching tokens are derived through Model-Guided Grouping, CHAPTER 1. INTRODUCTION 7 3D Scene 3D Model 2D Image WireFrame Model Representation Segmentation 3D Transformation into Indexing Line Segments Into Bucket Arrays 2D Views Viewpoint Verification Figure 1.3: The framework of our motion tracking system. Model-Guided Groupings are incorporated as our matching tokens. CHAPTER 1. INTRODUCTION 8 with, the use of contemporary use of model structures, which is used to detect intermediate-level structures, such as junctions and parallel pairs. These kinds of model-guided groupings derived are used as the tokens for local matching. C h a p t e r 4. We emphasis how to use the model-guided groupings detected. The central problem in this chapter is to formulate the measure of correspondence strength for model-guided groupings, coarse-to-fine matching strategy is also discussed. C h a p t e r 5. Various implementation results are presented. It turns out that the use of modelguided junction groupings leads to the most successful result, while the use of modehguided parallel groupings can not work independently to attack the correspondence problem. It is important to integrate all the groupings into one system. C h a p t e r 6. We present the conclusions of this thesis, and discuss the directions for future research. i Chapter 2 Outline of Related Topics This chapter outlines some related topics in both motion tracking and matching with grouped line segments. 2.1 Motion Tracking System A general discrete motion tracking system consists of three major steps: segmentation, correspondence and viewpoint verification. Following sections describe the approaches in these three steps. 2.1.1 Segmentation Segmentation plays a crucial part in any vision system. Success at subsequent levels depends heavily on a good low-level feature extraction. Since our objective is to track solid object which is represented with rich line segment components, we focus our segmentation scheme on the line segment extraction. Various work has been reported to extract line segments in previous literatures. Our implementation works bottom up from edges detection, edge linking to least-square line fitting. 9 CHAPTER 2. OUTLINE OF RELATED TOPICS .10 Edge Detection and Linking Edge detectors can be classified in two broad classes: gradient operators and second derivative operators. Gradient operators, such as Canny and Deriche's, respond with a broad peak at an edge location, therefore these operators require a thinning or non-maximum suppression step [10]. Second derivative operators such as Marr-Hildreth's, on the other hand, respond with each zero-crossing at each edge location. We implemented various edge detectors which can be used in our experiment: Zero-crossing [42], Canny [10] and Deriche [13]. Since Zero-crossing technique is well known, we will give more details about the implementation of Canny and Deriche's detectors in Appendix A . Following the edge detection, non-maximum suppression will be applied to remove spurious edge points. Edges extracted are linked to form chains of edges points. During the linking phase, an hysteresis thresholding [10] is applied to connected chains. Line Segments Extraction The primitive features discussed in this thesis are line segments. After edge linking, we obtain the connected contours, line segments can be extracted from the connected contours by applying a recursive split-and-merge algorithm [27] to each contour. This method can be simply described as follows: 1. Giving a contour, construct the line between its endpoints, and measure the maximum deviation of the curve from that line. 2. If that deviation exceeds some threshold, split the curve at that point, and replace the original line with two new straight lines approximations between the old points and the new one. CHAPTER 2. OUTLINE OF RELATED TOPICS 11 3. Repeat this process for each new formed segment, until all the segments have small enough deviations. 4. Finally, test consecutive segments to see if they can be merged into a single straight line, without exceeding the threshold on deviation. This can also be done at grouping stage. Least-Square Line Fitting In the split-and-merge algorithm [27], line segments are formed by simply connecting the old points and the new ones located in the maximum deviations. The formation of the line segment in above algorithm might loss the accuracy for the real location. Accurate straight segments can be obtained by fitting linked edgels from minimizing the least-square errors 2.1.2 (Appendix B). Correspondence The correspondence problem is that of identifying a portion of the changing visual array as representing a single object in motion or change. The notion of a "correspondence" comes about when the problem is considered in the context of a sequence of images. The problem becomes one of establishing a match between parts of one frame and their counterparts in a subsequent frame that represents the same object at a later time: An important theoretical basis for computer understanding of 3D motion was given by Ullman [58] who studied the problem from a different direction - the understanding of psychological evidences. Many of the algorithms presented in the literature for motion tracking fall within the framework of token matching presented by Ullman. The algorithms developed differ only in the complexity and dimensionality of the matching primitives chosen, as well as the type of cor- respondence strength used. The algorithms rely on correspondence as the underlying visual CHAPTER 2. OUTLINE OF RELATED TOPICS 12 process, and assume that this process is a 2D task operating on discrete presentation. The matching of low-level tokens, determined by finding a correspondence between primitives of some type and dimensionality, is the major task of many computer vision algorithms. Whether the task be stereopsis or object tracking, the approach of minimizing some function over the possible pairing of inputs has been a common theme of many algorithms. Since this thesis mainly deals with correspondence, in section 2.2, we will discuss in detail about various correspondence algorithms for motion tracking. 2.1.3 Viewpoint Verification Correspondence is first performed among local token pairings. The global consistency should be further examined. We adopt the viewpoint verification technique developed by Lowe [33]. Given a 3-D model point p from the original model coordinates, we project p into a point (x,y, z) in camera-centered coordinates. The viewpoint is specified by 6 parameters: a 3x3 orthonormal matrix R and a 3-D vector D = (D , D , D ), such that the transform between the camera centered coordinate x y z system {x,y, z) and the image plane coordinate system (u,v) is: (x,y,z) = Rp •M where D x and D y =l 7 & . + * ~ 7 j k (2.1) + '> D ( 2 - 2 ) specify the location of the object on the image plane and D specifies the z the distance of the object from the camera. It should be noted that the above projection transform is an affine approximation rather than true perspective. To see this, consider the true perspective: (x',y',z') = R(p-t) (2.3) CHAPTER 2. OUTLINE OF RELATED TOPICS 13 (2.4) If we define t' = Rt, then (x',y',z') = (x,y,z) — (t' ,t' ,t' ), then we have: x y z (2.5) (2.6) (2.7) Consider the true perspective (equation 2.3, 2.4), it can be seen that assuming D ,D x y are constants amounts to an afRne approximation. However, it works quite well in the implementation. An improved solution without approximation can be found in Lowe's recent paper [40]. The viewpoint calculation is based on the Newton-Raphson iterative technique [33]. At each iteration, it involves linearizing the fit error between model segment and matched image segment pair about the current estimate, solving for the corrections by least squares if the number of data points is greater that the number of parameters (6 points, 3 segments), and update the estimate by the corrections. The error between an matched image segment and a model segment is measured by the perpendicular distance of each end of the model segment from the matched image segment. The iterations are terminated when the correction factors fall below a tolerant threshold. 2.1.4 Model-based Motion Tracking System In our model-based motion tracking, we limited our tracked solid objects to polyhedra. Wireframes are often used to model polyhedra, which are solid objects whose faces consist of planar polygons. The wireframe model captures the geometry of the object through vertices and edges, and defines a polyhedron according to its face. The shape of the 3D object is interpreted in two lists: a vertex list and an edge list. The vertex list specifies geometric information: where each corner is located. The edge list pro- CHAPTER 2. OUTLINE OF RELATED TOPICS 14 vides connectivity information, specifying the two vertices that form the endpoints of each edge. In current system, given the viewpoint, the visibility of each planar surface is determined through its normal direction. Thus, the visibility of each model edge is determined from the visibilities of the two planar surfaces which form that edge. Using true hidden line removal instead of this kind of approximation would have eliminated some false matches in motion tracking by reducing the length of a partially occluded model edge segment. 2.2 Previous Correspondence Algorithms for Motion Tracking Motion tracking is concerned with the real-time implementation, previous correspondence algorithms emphasis on the simplicity and efficiency, and the tokens they used are all low-level features such as points and line segments. 2.2.1 Gennery's Approach An early system for model-based motion tracking was studied by Gennery [20]. He used a prediction and a matching process to adjust parameters. The prediction of the object position and orientation for the current frame is taken by extrapolating from the previously data. Matching is performed by searching nearest image edges to the predicted the model line, if it is within five pixels. However, a more elaborate method was also proposed in his paper [20]. This new method varies the extent of searching according to the accuracy of the predicted data, accepts all edge elements within the search width, and gives the edge elements variable weight according to their agreement with the predicted line and their intensity. A similar approach is described in Lowe's recent paper [38], his system also provides an integrated treatment of matching and measurement errors duing the motion tracking. Two types CHAPTER 2. OUTLINE OF RELATED TOPICS 15 of errors can be treated in an integrated way by using the computation of variance in predicted feature measurements to decrease the amount of search for subsequent matching process. 2.2.2 Asada's Approach Asada et al. [2] track moving polyhedral objects in 3D block world scenes. Their work assumes perfect line drawings as input and each line drawing image is labeled independently before the correspondence between images is performed based on a "junction transition table" to define legal correspondence. Correspondence is established between their vertices of two consecutive frames. The perfect line drawing input assumption enables them to use a junction labeling scheme. However, a junction labeling scheme such as Huffman's [28] is only effective for static scene analysis, they extend it to the dynamic environment. Appearance and disappearance of junctions and transition of vertex label are studied in the dynamic scene. Next, the shape rigidness property of three vertices on a block is used to evaluate geometric parameters, such as orientation and edge lengths, then their interframe rotational movements are determined. 2.2.3 Verghese's Approach Two correspondence algorithms are described in Verghese's [55] work: hypothesize-and-test and dynamic edge tracking. From a set of slightly different viewpoints, the hypothesize-and-test algorithm uses the previous camera parameters to hypothesize model projections which are cross-correlated with edges detected in the latest image. The results of cross-correlation will be used to select the best match and to update the camera parameters. The second algorithm is divided into two problems: dynamic edge tracking and model-based segment tracking. Dynamic edge tracking is performed in a 3x3 neighbor around each model edge point; model-based segment tracking uses an adaptation of Lowe's viewpoint solution algorithm. The underlying CHAPTER 2. OUTLINE OF RELATED TOPICS 16 assumption behind their approaches is that the motion is very slow - the average motion displacement is 1 pixel per frame. 2.2.4 Lowe's Line-to-Line Tracking System The correspondence algorithm in Lowe's [38] tracking system works through following two steps: with projected model edges predicted from previous viewpoint parameters, he searches compatible edges around each projected model edge. Compatiblity is defined in terms of location, length and orientation. Then the probabilities for accidental match will be measured and ranked for all compatible candidate edges associated with each model edge [33], and the lowest ranked ones are reported as the matchers and will be used to update the viewpoint. 2.2.5 Tracking Line Segments Deriche [14] presents a tracking approach that combines a prediction and a matching steps. The prediction is a Kalman filtering based approach that is used to estimates the region where the matching process will be performed. Correspondence in the search area is done through the use of a similarity function based on from each line segment. Mahalanobis distance between following attributes chosen , After consider the effects of image degradation of the process of extracting edge lines, they express the attributes of an observed edge line as a vector: S = {c, d, 9,h}. c is the perpendicular distance of the segment from the origin, d is the distance from the perpendicular intercept of the origin to the midpoint of the segment, 6 is the orientation of the segment, and h is the half-length of the segment. Zhang's [60] problem is to find correspondence positions of static and moving objects in each stereo frame. The correspondence is performed in a similar prediction and matching steps CHAPTER 2. OUTLINE OF RELATED TOPICS 17 used in Deriche algorithm. He defines an object as: an object is a set of tokens with the same kinematic parameters. Interesting enough, his grouping is performed after the matching process - grouping the set of tokens with the same kinematic parameters. 2.3 The Use of Grouped Line Segments as Matching Tokens Typically, an object can be recognized by a human: 1) rapidly, 2) when viewed from different viewpoint, 3) under moderate levels of visual noise, and 4) when partially occluded. Psychological evidence suggest that human visual perception is in a manner consistent with nonaccidental relations. Five of these 2D nonaccidental properties and the associated 3D inferences are described in Lowe's book [33]. They are: collinearity, curvilinearity, symmetry, parallelism and cotermination (See Figure 3.1). Many people apply these nonaccidental principles in different ways, since they notice that the grouped segments are more reliable than isolated ones, thus initiated matching with grouped line segments could result in a more robust result. An early example for the use of grouped line segments as the matching tokens can be found in Clark's work [11]. The objective of this work is to build a reliable matching system. Clark [11] initiated match between three connected lines in one model, referred to as a line triple, and line triples in the other model. Matching corresponding line triples in the two models takes place by sequencing through the possible line triples around the boundaries of the regions. Although corresponding line triples have slightly tolerance between orientations and relative lengths, this method depends heavily on the segmentation result: this system did not work well with degraded line segments around the boundary. Based on the exact matching of the connected line segments, Dougherty [16] uses the idea of string matching measured by Levenstein distance. In the real image, however, segmentation CHAPTER 2. OUTLINE OF RELATED TOPICS 18 seldom provided the perfect input for higher-level processing. With the tolerance of degenerated case, Burr [9] proposed an "elastic model" which was designed to be iterative because of the difficulty in producing low-error matches in one step. However, this approach is computational expensive. Matsuyama [43] defined a "neighborhood relations" between line segments. The neighboring relation defined here is not a topological relation such as intersection, but a general geometric relation between a pair of spatially separated line segments. In his 2D-to-2D matching, geometric properties of neighboring line segments are used as "matching tokens" - local matching, and clustering in the parameter space is used to determine the geometrical transformation between line drawings - global matching. Instead of using bucket array, he used slabs to reduce the computational complexity for forming neighborhood relations. Wallace [54] used pairwise relationships to guide the 2-D matching process. The primitive elements of the component model are line and arc segments which are formed interactively by superimposion over an ideal image. The idea behind this approach is to cope with the rotational transformation. As pointed out by Lowe [38], higher-level groupings are far less likely to be incorrectly matched than isolated line segments. However, the use of grouped line segments as matching tokens for model-based motion tracking is still untried. Most errors in current motion tracking system are found in the image due to locally inconsistency with local relationships in the model. As we will show later that matching in motion tracking using junction groupings is a very suitable method, since the computational cost is limited by the model complexity and searching is only performed around the model vertices. Chapter 3 Model-Guided Grouping Two issues are of great importance in our matching algorithms: 1. What kind of tokens (primitives) will be used in our matching process? How to detect these tokens? 2. How should correspondence strength of these tokens (primitives) be measured? We will discuss the first problem in this chapter. 3.1 Perceptual Grouping The interest for perceptual grouping stems from Gestalt psychologists' figure-ground demonstrations: certain image elements are organized to produce an emergent figure. Zucker [61] distinguishes two types of grouping. The first corresponds to the inference of one-dimensional contours. The second involves orientation information that is densely distributed across areas rather than curves. In this thesis, we are only concerned with the first type of grouping. Marr [41] pointed out that an image contains two type of information: changes in intensity and image local geometry. Then grouping is a process that makes both these pieces of information explicit. 19 CHAPTER 3. MODEL-GUIDED GROUPING 20 Witkin and Tenenbaum [57] defined perceptual organization as process being able to detect primitive structure: those relationships we are able to perceive even when we can not interpret them in terms of familiar objects. Our objective, in model-based motion tracking system, is to incorporate these kind of primitives even when we have no prior knowledge of the object. Principle of Non-Accidentalness Critical information is unlikely to be a consequence of an accident of viewpoint 2-D Relation 1. Gollinearity of 3-D Inference Collinearity in 3-space. points or lines 2. Curvilinearity of Curvilinearity in 3-space points or arcs 3. Symmetry Symmetry in 3-space (Skew symmetry) 4. Parallel lines Curves are parallel in 3-space or curves 5. Cotermination: two or more terminations at a common Lines(Curves) terminate at a common point in 3-space point Figure 3.1: Five nonaccidental relations and 3-D inference from image features. For a more practical point of view, Dolan and Weiss [15] attacked the problem of perceptual grouping and implemented an algorithm which performed linking, grouping, and replacement in an iterative procedure. Sets of simple parametric tests were used to determine which structural relations were applicable to a given set of tokens. Then a set of tokens was replaced by a simple token. CHAPTER 3. MODEL-GUIDED GROUPING 21 A direct exploration of the role of grouping in reducing search is given by Jacobs [30]. Jacobs also considered the problem of selecting salient groupings of features on which to focus search, he concentrates on directly estimating the likelyhood that different edge structures have arisen from the same object. We will illustrate later the role of grouping to reduce the searching space in our matching system. Lowe [33] argued that perceptual groupings are useful to the extent that they are unlikely to have arisen by accident of viewpoint or position, and therefore are likely to reflect meaningful structure in the scene. Such groupings include collinearity, curvilinearity, symmetry, parallelism and cotermination (Figure 3.1). Moreover, Lowe demonstrated the necessity of such groupings for object recognition. All these groupings, advocated by Lowe, could be embedded in our Model-Guided Groupings. However, our grouping in the current system are mainly based on cotermination, collinearity and parallelism. Based on threshold selection, following are the grouping algorithms (without model guidance) developed by the author. 3.1.1 Grouping Based on Cotermination An ideal junction is a set of lines terminating at a common point. In real images however, the lines rarely terminate exactly at the same point due to noise or algorithms for constructing the line segments. Instead they terminate at a "small common region" (Figure 3.2). Hence the difficulty of detecting junctions is two fold: • the complexity of considering all subsets of lines as potential junctions; • the lack of a simple and exact mathematical definition of a set of lines terminating at a common region. In order to reduce the complexity mentioned above, we built a simple data structure, called a bucket array (Figure 3.6). Each bucket consists of lists of line segments falling into the bucket according to their endpoints (see Section 3.2). By examination of the neighboring buckets, for any image neighborhood one can quickly determine the list of segments with their endpoints CHAPTER 3. MODEL-GUIDED GROUPING 22 Figure 3.2: In the real image, lines are terminate at a small common region, termination at this neighborhood. The problem is that it is hard to decide what is the suitable size of this common region. Besides it is difficult to separate if several junctions are falling within the same region. So, the detection of the junctions without any prior knowledge is still uncertain. In order to avoid giving a formal definition of the junction in the real image, we propose the concept of pseudo-cotermination point (pep), which is defined as: given a set of lines (more than 2 lines), the point which minimizes the sum of squares of perpendicular distance to each line is called pseudo-cotermination point. Obviously, an ideal junction has a perfect pep with zero sum of squares of perpendicular distance. Thus the "quality" of junction formed in the real image can be judged from the maximum distance from pep to all the line segemnts considered. For instance, if the maximum distance is less than 4 pixels, we can assume that junction is formed. The detailed information for finding the pseudo-cotermination point is described in A p p e n d i x C . 3.1.2 Collinearity Grouping Given two line segments, we define the distance between two line segments as the distance from the middle point of the smaller segment to the longer one. Let 6 be the value of the angle between two segments, c be the threshold set for the angle differ- CHAPTER 3. MODEL-GUIDED GROUPING 23 Figure 3.3: pep is determined by minimization the sum of squares of perpendicular distance to each line. ence. Let / be the overlap between two line segments, d be the distance between two lines. Let g be the gap between two nearer endpoints of segments, max.overlap, max-distance, max.gap are the thresholds set for corresponding constraints. The two line segments are collinear if the following conditions are satisfied (Figure 3.4): 1. 6<eorx-e<$<K 2. ./ < maxjoverlap or g < max.gap 3. d < max.distance Figure 3.4: Grouping based on collinearity. The collinear grouping is performed .around the endpoints associated with each segment in the region of a circle with the radius of max.gap. Thus the complexity of collinear detection can be obtained similar to the method of junction detection. CHAPTER 3. MODEL-GUIDED GROUPING 3.1.3 24 Grouping Based on Parallelism Parallel grouping is treated as the special case of symmetry detection. With the same parameter definitions, the two line segments are parallel if the following conditions are satisfied (Figure 3.5): 1. 6 < e or 7T - e < 0 < TT 2. I > miti-overlap , A pair of parallel segments is called a salient parallel pair (Figure 3.5) if the perpendicular distance between two lines is less than either of the length of these two segments. Our parallel grouping is mainly focused on detecting salient parallel pairs. (a) (b) . Figure 3.5: (a) Parallel pair, (b) Salient parallel pair: h < m i n ^ , ^ ) . 3.2 Indexing Line Segments After segmentation, we obtain an image containing N line segments. The straightforward method for finding all pairs of neighboring line segment relations, based on endpoint proximity for instance, requires a large amount of computation: there are N(N — l)/2 candidate pairs and for each pair we need to check the positions of the other N — 2 line segments. CHAPTER 3. MODEL-GUIDED GROUPING 25 (0,m-l) (m-l,m-l) (0,0) (m-1,0) Figure 3.6: Image is partitioned into m XTOwindows, each window is attached a bucket. Line segments are indexed into bucket arrays which are formed by dotted lines. Attempts to reduce the computational complexity for detecting neighborhood relations between segments has been inspired by the methods in computational geometry - partition the whole image into buckets [17, 3] or slabs [43]. The computation of the buckets is performed by an indexing algorithm: whose complexity is linear in the average with respect to the number of segments in the images. The overall image is partitioned intoTOXTOsquared windows (Figure 3.6). To each window is attached a bucket that is a list of segments according to.their geometrical distribution. Two kinds of information are stored into the buckets for different grouping processes: line segment endpoints and line segment middle points. Theoretically, it has been observed that we should partition the image according to the line segment distribution. Instead to partition the image into squared buckets, we should partition CHAPTER 3. MODEL-GUIDED GROUPING 26 the image into the rectangular ones by first computing the parameters which determine the partition size. Edhiro [17] described in detail about this method. Matsuyama [43] used another approach - slab to facilitate computation. However, the bucket technique is more suitable in our application, especially for grouping and searching. With the dynamic aspect introduced, it is not suitable to compute the bucket partition parameters for each frame. We simply partition the image into squared windows.- 3.3 Searching Correspondence for Motion Tracking 3.3.1 Psychological Studies of Human Visual Perception A number of psychophysical studies concerning the detection, localization and tracking objects in the visual field have suggested a two-stage theory of human visual perception. The first stage is the "preattentive" mode [52], in which simple features are processed rapidly and in parallel over the entire visual field. In the second, "attentive" mode, a specialized processing focus, usually called focus of attention, is directed to particular locations in the visual field. Empirical and theoretical studies suggest that beyond a certain processing stage, the analysis of visual information proceeds in a sequence of operations, each one applied to a selected location (or- locations). The sequential application of operation to selected locations raises two central problems. First, what are the operations that visual system can apply to the selected locations? Second, how does the selection proceed? That is, what determines the next location to be processed, and how does the processing shift from the current to the next selected location. As Koch [31] suggested that the major rule for initial selection.of a location is based on the conspicuity of that location, i.e. by how much its properties differ from the property of its CHAPTER 3. MODEL-GUIDED GROUPING 27 neighborhood. In this thesis, we argue that perceptual groupings do preserve the properties different from its neighborhood, thus they could be one of the candidates to guide the initial selection process. Consider the second of questions in motion tracking, two rules for shifting from one selected location to another are based on (1) proximity preference and (2) similarity preference. Searching for an "interesting" target around a selected location would profit from a selection mechanism biased to nearby location. Scanning the visual field for objects with a common identifying property, for instance the orientation of the line segment, would be likewise facilitated if locations with similar property to the presently selected location are preferentially selected. Both mechanism are related to phenomena on perceptual grouping and "Gestalt effects" which occur as a function of object similarity and spatial proximity. From the computational point of view, we impose the constraints, which will be describeded in section 3.3.2, to simulate the focus attention in the region of interest, and to perform the shifting rules based on proximity and similarity preference. 3.3.2 Constraints Reduce the Search Space The role of the constraints is to reduce the amount of search required to find the correct interpretations of the image data. Our guidelines to develop constraints are: • the constraints should reduce the searching space in order to minimize the run time effort. • the constraints should be as complete as possible. With these principles in mind, we consider two types of constraints: geometric constraints and physical constraints. Since we are dealing with geometric features, we focus on the geometric ones. CHAPTER 3. MODEL-GUIDED GROUPING 28 Local Focus Window The matching in motion tracking proceeds from two sets of data which are obtained in two consecutive frames from the same physical position. Except for the initial guess, we use superimposed model edges (Figure 1.1) as the matching result from the previous frame, and we use image edges to represent the continuants of tokens in the current frame. Matching is performed between model edges (previous frame) and image edges (current frame). Attentional influence on the computation of visual information has profound effects: 1. It can reduce the computational cost in the lower-level processing. 2. It reduce the spatial region that only contain potential matches. In order to focus the search space, all the grouping and matching processes should be focused within the local focus window which is bounded by superimposed model edges. Suppose that {xmin, Vmin) is the coordinate of the lower-left corner, and the upper-right corner bounded by the object. For (x , y ) max 1S max local focus window, we use the coordinate of (lx i ,ly i ) m n m n (lx x,lymax) to represent the coordinates of lower-left corner and upper-right corner respectively. Here lx i = max(x i — d,Q), ly i = rnax(y i — d,0)), lx ax = min(x x + d, N — 1) and ly = min(y x + d,N — l) (Figure 3.7(a)). Here image size is N x N, d can be and ma m n max m n m n m n m ma determined by the searching bound. For computational efficiency, both low-level segmentation and model-guided grouping could be performed only within this window. It should be noticed that the position of this window is at the object to be tracked and must be updated during each iteration for the matching refinement. Unary Constraints Three unary constraints will be used to impose consistency between each individual model edge and image edge. ma CHAPTER 3. MODEL-GUIDED GROUPING 29 Figure 3.7: (a) We focus the search in the Local Focus Window, which is determined by the bounds calculated from superimposed model edges, (b) Compatible image edges will be derived after passing all the unary constraints. Length constraint This constraint says that if the ratio between the length of image edge and the length of model edge is less (or greater) than certain amount, then it is possible to assign this image edge to this model edge. In the current implementation, occlusion and distortion of perspective projection must be considered, thus it has been found that the ratio within the range of (1/4, 5/4) is suitable. Angle constraint This says that if the angle of the image edge agrees with the angle of the model edge, within some threshold 6, then it is possible to consistently assign the image fragment to lie on the model one. In Figure 3.8, maximum 0 is 30 degree. Distance constraint This says that if the distance measure between the image edge and the model edge is bounded within our maximum tracking displacement, then it is possible to consistently assign the image edge to lie on the model edge as a potential matcher. CHAPTER 3. MODEL-GUIDED GROUPING 30 Figure 3.8: Both, angles for image edge and model edge are in the range of [0,360] degree. Figure 3.9: (a) Distance is interpreted as perpendicular distance, (b) Distance is determined from two endpoints (e.g. oa, ob, oc). CHAPTER 3. MODEL-GUIDED GROUPING It should be noticed that the 31 distance measure varies according to the specific algorithm. It can be the perpendicular distance or the distance measured from the endpoints (Figure 3.9). Binary Constraints On top of the unary constraints, we impose binary constraints to further reduce the inconsistency between paired image fragments and model ones. Figure 3.10: The angle between a pair of edge angles (a, 6), must lie within the range (&', a"), except (a', 6"), since counterclockwise relationship should be reserved. Angle constraint The angle between a pair of image edges, as shown in Figure 3.10, must lie within the range of angles between a pair of model edges. This can be further imposed in the junction grouping. The maximum 9 is 30 degree. The counterclockwise relationship between two line segments should be examined and is of particular importance if they are close to each other. Distance constraint We use distance constraint in both junction grouping and salient parallel grouping. In junction grouping, we force the pep of the junction so that it can fall in the searching region bounded by the maximum tracking displacement. In salient parallel grouping, the range of the perpendicular distance between parallel image edge pair must CHAPTER 3. MODEL-GUIDED GROUPING 32 be contained within the range of distance between the parallel model pair, subject to certain threshold. 3.4 Model-Guided Grouping Choice of meaningful grouped features mainly depends on two factors: the problem domain and the system goal. Our objective is, however, not limited to a domain which relies on the specific knowledge, but on the knowledge that these structures have certain generic shapes which are viewpoint invariant. Since our domain of 3D motion tracking is focused on the 3D block world. We choose junction, parallel pair as our grouped features, i.e. primitive tokens. The underlying assumption used in these groupings is: the properties of these kind of 2D groupings are most likely preserved in 3D scene as well as i the consecutive frames. In this thesis, we propose a new approach in grouping especially applied to model-based motion tracking - Model-Guided Grouping (MGG). The term MGG is defined as: Grouping, based on some generic structures compatible with the model, is guided and der locally, with the contemporary use of model structures, around the predicted model position. In motion tracking, our objective is to continuously revise the camera viewpoint parameters from the previous solution. The method we use is to match the tokens from the two consecutive frames which are from the same physical position. As we noticed that the properties of junctions and parallel pairs will be preserved in consecutive frames, thus we choose these groupings as our matching tokens. So, our grouping differs from the general grouping in that the grouping is guided by the known model structures. CHAPTER 3. MODEL-GUIDED GROUPING 3.4.1 33 Model-Guided Junction Grouping The registration of cotermination is important for determining vertices, which provide information that can serve to distinguish the other structures. In fact Binford [8] has suggested that the major function of eye movement is to determine coterminous edges. Many researchers demonstrated the use of junction information to facilitate 3-D object recognition. An example of a non-coterminous vertex is the T , although there is a termination of one segment on another. Such vertices are important for determining occlusion. Thus the T vertex might have a somewhat different status than the Y (fork), /jv (arrow), and L vertices. The arrangement of vertices, particularly for polyhedra, offers constraints on "possible" interpretations of lines as convex, concave, or occluding. Impossible objects [28] can be constructed from violation of these constraints. The impossibility of most impossible objects is not immediately identified, but requires scrutiny and thought before the inconsistency is detected. This indicates that the visual system has a capacity for classifying vertices locally, but no perceptual routines for determining the global consistency of a set of vertices. The discussion above provides us with the computational consideration that the use of vertices should be based on the local vertex information only. Given a 3D polyhedral model, we present a method for extracting junctions which are compatible with the model junctions. However, the detection of junctions without prior model information is still unsolved. Model Junction Indexing Since the grouping is guided by the model junction information, model junction information must be expressed explicitly. Because we have a list to store the information of all model edges, CHAPTER 3. MODEL-GUIDED GROUPING 34 model junction indexing is thus straightforward. First, we assigne each vertex with distinct number; second, associated with each vertex there is a pointer to the model edges terminating at this vertex. All these edges should be sorted in counter-clockwised orientation. Grouping Connected Segment Pairs For grouping junctions with two or more than two lines, two lines are identified as "potential connected" if they satisfy the following constraints: 1. their gap is small - less than a certain threshold (say 12 pixels), 2. they are not T junctions. Figure 3.11: T junctions will be examined for potential connected constraints. By T junction we mean that for two line segments if the gap between their endpoints are less than a threshold and intersection point is inside one of the segment (Figure 3.11) with / longer than certain threshold. From our tests we found that if / is longer than 5 pixels, T junctions will no longer be the connected pairs. Model-Guided Junction Grouping CHAPTER 3. MODEL-GUIDED GROUPING 35 Unlike general cotermination grouping, we focus on the grouping around the predicted model vertices. With indexed model edges for each vertex, we first find all the compatible edges associated with each model edge. By compatible we mean that image edges should pass all the unary constraints. We use r as our maximum tracking displacement, within given radius r and centered around the corresponding model vertex, we only consider the image segments falling into the circle (Figure 3.12), following with the examination of the length, angle constraints. Figure 3.12: We focus our searching around the model vertex, junction (thick lines) is formed around the model vertex by checking potentially connected constraints. A list of compatible image edges associated with each model edge will be created for further computation. Two directions can follow to form the junctions from all the combinations: • based on pep: compute the peps for all the junction combinations; • based on proximity: examination of all pairs of potentially connected constraints. CHAPTER 3. MODEL-GUIDED GROUPING 36 Our implementation follows the later one, and test results give satisfactory formation, and efficiency. Starting from the model edge indexed in the junction database, we form the junction, in the sequence of model edges with ascending orientation, by examination of all pairwised relations satisfying the "potential connected" constraints. Look at the example from Figure 3.12, model edges a,(3 and 7 are sorted, according to their orientation, in counter-clockwise: a,/3 and 7. Junction grouping is performed around model vertex m, image segments a, b, and c are found compatible with model edges a, (3 and 7 respectively. In order to form the junction by a, 6, c, we should check pairwised relations between a and b, b and c. If they all pass the "potential connected" constraints, a junction, formed by a,b,c, will be created. A complicated algorithm can also include following constraint: pep formed by this junction should be in the searching circle, and the maximum distance from pep to all the line segments should be less than a threshold. In order to reduce the computational cost for checking edge endpoint proximity, we group all the "connected" images segments during the contour-splitting algorithm - all the split segments from the same connected chains will be assign the same group identification number. This information will be used at a later stage to facilitate the computation based on proximity between endpoints. Algorithms for Model-Guided Junction Grouping 1. Grouping connected line segments during the preprocessing (recursive contour splitting); 2. Indexing image segments according to the endpoints; 3. Indexing model junctions associated with each model vertex; for each model vertex do 4. Searching all compatible image edges by imposing unary constraints, linking all compatible edges associated with each model junction edge. 5. Junction is formed by checking the potential connected constraints for all pairs. CHAPTER 3. MODEL-GUIDED GROUPING 3.4.2 37 M o d e l - G u i d e d Parallel Grouping Model-guided parallel grouping is performed in a similar way to the model-guided junction grouping. Salient parallel model pairs should be indexed first, this is done by checking all the segment pairs in each polygon. Then we will search all the compatible image edges associated with each salient parallel model edge. Salient parallel grouping is performed between compatible image edges associated with salient parallel model edges. Algorithm for Model-Guided Parallel Grouping 1. Indexing image segments according to the endpoints; 2. Indexing salient parallel model pairs; for each salient parallel model pair do 3. Searching all compatible image edges by imposing unary constraints, linking all compatible edges associated with each model edge. 4. Salient image parallel pair is formed by checking the constraints for salient parallel and compatibility between the image pair and model pair. Chapter 4 Matching with Model-Guided Groupings In this chapter, we first review the previous line-to-line matching system, and explain how can we extend the matching problem from Lowe's line tokens to grouping tokens. 4.1 Model-Based Motion Tracking as a Matching Problem The problem of initially locating the object or calculating the initial viewpoint has been studied by Grimson [22], Huttenlocher [29], Lowe [33] and Besl [4]. In this thesis however, we only address the dynamic aspects of 3-D model-based tracking, and assume that the initial viewpoint has been obtained from other sources. The 2D motion correspondence problem is to find the displacement between points projected onto the image plane at different times by the same physical point. This can be accomplished through token matching: pairing of tokens in one frame with tokens in the subsequent frame from the projection of the same scene feature, as each matched token can provide at least one point-to-point correspondence. Matching is performed between two sets of object data which are from two consecutive frames: (a) object features (tokens) from previous frame, and (b) object features (tokens) from current 38 CHAPTER 4. MATCHING WITH MODEL-GUIDED GROUPINGS 39 frame. Since our approach is model-based, segmented object features from the previous frame will be represented by a projection of a known model which is in accurate agreement with the object features in the image. Since the motion is relatively dense, token matching can be performed locally by taking frame-to-frame coherence. 4.2 Line to Line Matching An extension of point-to-point correspondences will allow the use of line-to-line correspondences, as people noticed that the information from line segments is more reliable than from isolated points. The solution adopted in Lowe's system is to measure the errors of perpendicular distance of each endpoint of the matched line in the image from the model edge. The result is that each line-to-line correspondence between model and image edge can give us two equations for viewpoint verification. Solving 6 parameters, a minimum 3 lines are required. 4.2.1 Psychological Experiments from Ullman Ullman considered the matching problem as two stages: local and global matching. In this approach, he started matching of the complex configuration from their elementary constituents, followed with the examination of more complex configuration. Between these two stages are some interactions. He coined the term Correspondence Strength (CS) as a measure of affinity between tokens in one frame and the tokens in the subsequent frame. Each correspondence is established by selecting matched candidate tokens with highest CS. The effects of distance, orientation and length determine this measure. The effect of distance The selection criterion can be met, at least approximately, by defining the segments' separation CHAPTER.4. MATCHING WITH MODEL-GUIDED GROUPINGS 40 as the distance between their midpoint when their lengths are similar; and as the perpendicular distance when their lengths are quite different. The effect of orientation The affinity between line segments decreases as a function of the orientation difference between them. The effect of length Similarity of length enhances the affinity between line segments. Comparing the effects of distance, orientation and length, he summarized the results in a table [58] which gives the relative strengths of the different effects. For example, it can be seen from the table that the effect of a 60 degrees orientation difference is comparable to the effect of a 2.25 distance ratio or a 2.1 length ratio. All these entries have a strength of about 1 on a scale of 0 (no preference) to 4 (absolute preference). Preference Orientation difference Distance Ratio Length Ratio 4 15 1-1 1.04 3 30 1.2 1.13 2 45 1.6 1.5 1 60 2.25 2.1 0 75+ 2.7+ 2.5 Table 4.1: The relative effect of orientation difference, length ratio, and distance. Adapted from Ullman's paper. His affinity measure, in terms of distance, orientation and length, bears some resemblance to Lowe's probability evaluation for matching with accidental agreement. However, he did not give an integrated quantitative measure for overall CS. 4.2.2 Lowe's Model-Based Motion Tracking System The primitive token used for matching in Lowe's system [38] is the line segment, and the overall structure of Lowe's model-based motion tracking system can be summarized as follows. CHAPTER 4. MATCHING WITH MODEL-GUIDED GROUPINGS 41 Edge Detection and Formation of Line Segments An image is processed with a Marr-Hildreth [42] edge detector, and the zero-crossings are reported together with the gradient of the image through the zero-crossings. Edge linking is then performed, followed with the formation of line segments by using recursive contour splitting algorithm [27]. Model Projection This is determined from the previous matching result by superimposing model edges on top of matched image edges. Searching and Probability Ranking for Potentially Matched Image Segments All potentially matched image segments, which satisfy three unary constrains, will be identified locally around the predicted model edge position. Under the assumption that incorrect matches will be uniformly distributed in terms of position, orientation and scale, Lowe integrates all these measures in one function which is equivalent to computing the probability of the occurrences of an accidental match: 7rm Here, 6 is the orientation difference, m is the length of the matched segment, and s is' the perpendicular distance from the center of the matched image segment to the predicted segment. To verify a predicted model feature, he ranks all potentially matched image segments in terms of probability of accidental agreement in terms of p: Viewpoint Update and Verification The ranked probabilities of accidental agreement are used to prune the set of possible matches: only top matches are accepted. Since there are 6 degree of freedom of the model, a minimum CHAPTER 4. MATCHING WITH MODEL-GUIDED GROUPINGS 42 of 4 line matches are required. These top matched image segments will be used for updating the new viewpoint using Newton-Raphson technique [34] then the process will be repeated. 4.3 Matching candidates to target - a probability problem In response to Tsotsos' target article "Analyzing vision at the complexity level", Lowe [39] emphasized the use of probability theory on vision problems. He pointed out that: The probabilistic approach is based on the assumption that perfect visual performance is possible (due to inherent ambiguities in the data) and that therefore the goal of a system maximize the probability of making correct interpretations. As stated above that the goal of a system is to maximize the probability of making correct interpretations, this statement can also be interpreted as to minimize the probability of making incorrect interpretations. All the matching algorithms are facing the ambiguities from the input low-level features, due to noise from camera input as well as the imperfect output from the lowerlevel segmentation. On the other hand, however, grouping can bridge the gap from imperfect data to a better interpretation. Thus model-guided grouping can reduce the ambiguities by making full use of local consistent interpretation from image data. Our tests show that Lowe's system does work well in slow frame-to-frame motion. However, the real maximum tracking displacement depends heavily on both the input data and the object structures, which are determined from a particular viewpoint. As Lowe argued that all the visual.systems face the ambiguities in their input, thus it is impossible to guarantee 100 percent correct matches by evaluating a kind of formula, since the correct matches are not only determined from isolated geometrical information. Besides, due to spurious data, the best matches, measured from his formula, do not always reflect the correct matches (Figure 4.1). Another problem is that when two candidates have similar evaluation to the same predicted model edge, the choice of which matcher is uncertain. These uncertainties can be minimized CHAPTER 4. MATCHING WITH MODEL-GUIDED GROUPINGS (a) 43 (b) Figure 4.1: Improper matches arise due to spurious data or inconsistent local structures; Thick lines are predicted model edges, thin lines are image edges. Ambiguities can be reduced by using groupings, (a) Correct matches should be between a,b,c and a,/3,7, instead of local best matches between a,c and a',c'. (b) Correct matches should be between a,b and a,/3 instead of local best match between 6 and a. through the use of model-guided groupings. 4.4 Evaluation of Probability of Accidental Match with Modelguided Junction Groupings Taking model-guided junctions as our matching tokens, our evaluation of the probability of accidental match makes use of the following considerations: 1. The probability should reflect the proximity and similarity rules. 2. The probability of accidental agreement should decrease when the number of compatible constituents in a junction increases. Example: suppose a junction of a model consists of three edges. Then, a detected image junction compatible with three line segments would have a lower probability of accidental match than a detected image junction compatible with two line segment. 3. When occlusion is present, partial matching among grouped junction segments should be allowed. Thus the evaluation should be flexible with respect to the number of the segments. Example: suppose a junction of a model consists of three edges, the maximum number CHAPTER 4. MATCHING WITH MODEL-GUIDED GROUPINGS 44 of line segments in detected compatible image junction is two, this match should also be allowed.. 4. If no junction groupings are present, this system should work at least as well as Lowe's line-to-line matching. I.e. this system should be compatible with Lowe's. A straight forward method is to find an integrated evaluation directly between a set of compatible image line tokens and model junction edges. We can evaluate the accidental probability of the grouped feature as the product of each individual line-to-line accidental matching agreement (formula 4.2). All the principles will be satisfied automatically, for instance, when no grouped features are present, i.e. only one image edge matches against one model edge, this will result in the identical evaluation as Lowe's (Figure 4.2). Now we formulate the evaluation of probability of accidental match between junction tokens. Suppose that there are m compatible image segments with respect to n model edges, here 1 < m < n, and the probability of these m compatible image edges which can form a junction is p(junction). Each individual probability for accidental match is p,-, which can be obtained from equation 4.2. Making use of conditional probabilities and Bayesian inference, the overall probability for accidental match between image junction with m segments and model junction with n edges is: p(accidental8zjunction) = p(accidental\junction)p( junction) = (pi Xp2 X . . .Xp )xp(j unction) m (4-3) Since an image junction is derived with the use of model structure, we assume that p(junction) is very close to 1. Thus the equation 4.3 can be simplified as: p(accidentalSz junction) = p\ x p2 x ... x p m (4-4) Thus the inverse probabilities for accidental match will be ranked for all compatible candidate edges. It is obvious that the highest ranked ones are the complex junction groupings; and lowest CHAPTER 4. MATCHING WITH MODEL-GUIDED GROUPINGS 45 ranked ones are single line segments. Top ranked ones will be regarded as the matches and be used to update the viewpoint. 4.5 Matching with Mo del-Guided Groupings Lowe [38] suggested that the development of feature groupings could be very useful for the motion tracking problem as well as for general recognition problem, as the higher-level groupings are far less likely to be incorrectly matched than isolated line segments. Many reasons for using grouped line segments as the matching tokens, among them are: • compatible higher-level groupings are less numerous than compatible low-level primitives, thus the probability of making incorrect match can be reduced through the use of groupings. • higher-level groupings are less sensitive to motion and noise, thus they are less likely to be incorrectly matched. • the computational cost is limited by the complexity of the model structure. The entire motion tracking system structure is similar to Lowe's line-to-line matching system, however, the matching tokens have been replaced by higher-level groupings such as junctions and parallelpairs instead of line segments. 4.5.1 Algorithm I: Matching through Junctions Algorithm for matching with junctions: 1. Model-guided junction grouping, starting from the most complex group matching, and ending with line-to-line matching. 2. Evaluation of the matching agreement according to formula 4.4. 3. Ranking all the above matchers according to the evaluation, and select top matches for viewpoint verification. CHAPTER 4. MATCHING WITH MODEL-GUIDED GROUPINGS 4.5.2 46 Algorithm II: Matching through Parallel Pairs Algorithm for matching with salient parallel pairs: 1. Model-guided salient parallel pair grouping. 2. Evaluation of the matching agreement. 3. Ranking all the parallel pair as well as line-to-line matches, and select top matches for viewpoint verification. 4.6 Matching Strategy Initial matching will be performed around each model vertex where the area for searching compatible candidate tokens is bounded by the circle (Figure 4.2 (a)) with radius of r. Top matches will be reported to update the viewpoint. Next viewpoint estimation will be performed with updated viewpoint, thus the model structure will be updated (Figure 4.2 (b)). During this iteration, searching for compatible candidate tokens is bounded by a shrinking circle with radius of r' (here r' < r). Matching refinement will be repeated until the agreement between model structures and image edges is close enough (Figure 4.2 (d)). CHAPTER 4. MATCHING WITH MODEL-GUIDED GROUPINGS 47 Figure 4.2: Matching refinement with model-guided junction groupings. Superimposed model edges are projected from its prior estimated viewpoint, nearby are top ranked matching edges, heavy bars are the perpendicular errors to be minimized. Searching regions (circles) will be reduced in consecutive iterations (a)-(d). Chapter 5 Implementation Results All the implementations are performed on SPARC II among three images which are captured from different viewpoints (Figure 5.1). System performance will be compared with the one in Lowe's line-to-line matching. Zero-crossing, Canny and Deriche's detectors have been used in the past to extract the edges. Since current real-time implementation of line-to-line matching adopts Marr-Hildreth detector, our implementation adopts the same edge detector. All the edges in three images are extracted by marking zero-crossings, then connecting edges that are inked together. Recursive contour splitting is applied for all the linked edges. Finally, line segments are formed by least-square fitting (Figure 5.1). It should be noticed that as a by-product of least-square fitting, for each image segment, its line equation parameters can be obtained (refer to Appendix A), and can be used to facilitate further computation. 5.1 Model-guided Grouping The projected positions of model edges can be computed directly from the known viewpoint. Each model vertex will be assigned a distinct identification, and the junction of each model 48 CHAPTER 5: IMPLEMENTATION RESULTS 49 vertex is formed and sorted according to the ascending order in the counterclockwise direction. The number of the junction and the orientation of each line segment will be stored into the model junction database. During each iteration, viewpoint will be updated and the information of each current model junction can be obtained from the junction database, and then will be used to guide the grouping for image segments. Figure 3.2 shows the result of junction groupings for Image 1. Similar to the model-guided junction grouping, salient model parallel pairs will be formed into a database. Then they will be used to find the image parallel pairs around the predicted model position during each matching refinement. 5.2 Matching with Junction Grouping — Comparison with Lowe's Line-to-Line Matching The estimation of initial viewpoint for each image is done interactively by using our interface. When the position is close to the true position, we can use the matching with model-guided junction groupings to finalize the viewpoint parameters. It should be noticed that the viewpoint parameters measured by matching with junction groupings is slightly different from the matching with line-to-line, because the rankings of the top matches in the two methods are not identical. Projection from a slightly different viewpoint form the initial viewpoint obtained, matching with Model-Guided Junction Groupings, will be performed between model junctions and compatible image junctions. In Image 1, the number of junctions in predicted model is 17 for each iteration. At each iteration, the number of compatible images junctions found through model-guided grouping is about 21. The ambiguity caused by line-to-line matching can also be identified by checking the average compatible lines for each model edge. Looking at Table 5.1, it has been found that the average number of compatible junctions is greatly reduced. CHAPTER 5. IMPLEMENTATION RESULTS 50 The matching strategy for each iteration is coarse-to-fine: around each vertex, the searching radii for the first three iterations are 40, 35, 30 pixels respectly, and will all be reduced to 25 pixels after the third iteration. As the results show, within a tolerant threshold, most matching can be terminated in two iterations. The computation time for model-guided junction matching was measured by running 50 iterations: for line-to-line matching, the total time is 4.43 seconds; for junction matching, the total time is 9.85 seconds. The average time for each iteration is shown in Table 5.1. It has been found that the average time for the use of model-guided junction groupings is slightly more than double the time for the use of single line segment. However, it should be noticed that since the radii of the first four iterations are graduately reduced, the computation time should also be slightly reduced for the first four iterations. The robustness of the system is tested in the following two steps: first, we have obtained the correct viewpoint for image 1-3; then we translate and rotate the model from initial viewpoint we obtained to the adjacent regions: the tests will be performed to find the matching between the new position and the position projected from the initial viewpoint. As we measured from various tests, depth information will not affect the system performance. Thus, our tests are measured up to 5 degrees of freedom. Since we know the correct viewpoint, we can translate the model from the correct position in four directions: up, down, left, right. Measured by pixels, the translational amount is bounded in 24 pixels in each direction, and the test data are sampled by the interval of 4 pixels, so the total number of tests generated in two dimensional is 13*13=169. The implementation results show that among 169 translational motion, almost all correspon- Figure 5.1: Images and their segmentation results used for testing. CHAPTER 5. IMPLEMENTATION RESULTS 52 deuces are correctly established by using junction groupings. It has been found that if the translation amount is not large enough to make the junctions out of the searching region, all the matching can be correctly established with the use of model-guided junction groupings. We found however, line-to-line matching failed about 40 percent. If the translational amount is small, line-to-line matching is relatively robust. And when the translation is relatively large, especially if ambiguities with other model edges, it is sensitive to the local "best" match based on single line-to-line correspondence measure. Most failures are due to partial local "best" match. The tests for both translation and rotation are measured in two steps: As we mentioned above, We first translate the model in four directions to the adjacent 169 positions, then we rotate the model in each position in 8 different directions, each within the 15 degrees. Thus the. number of tests for combination of translation and rotation is 1352. It has been found that the use of Model-Guided Junction Groupings is also robust with respect to the combination of both translation and rotation. The success rates, in three images, for the use of grouping are all over 95 percent. Failure has been identified that due to both rotation and translation, the junctions in the image are falling outside of the searching region. It should also be noted that the junctions with three line segments are all correctly matched in three images. Despite of the fact that there are few incorrect matching candidates can be derived for the "junction" with two line segments, probability ranking according to our method can also reduce the ambiguities. Viewpoint can also be correctly identified since the matches for the junction with three line segments will be ranked on top the ranking list. It also seems that the junctions with three line segments dominate the matching process. This might indicate in turn that the higher level the groupings are used the more accurate will be the matching result. CHAPTER 5. IMPLEMENTATION RESULTS Test Cases Junction Groupings Line-to-Line 53 Compatible Tokens 1.1 - 1.4 2.5 - 3.1 Average Iteration Time (sec) 0.197 0.089 Table 5.1: Comparisons of average compatible tokens found and computing time at each iteration between matching with junction groupings and line segments. Images Image 1 Image 2 Image 3 Test Cases Junction Groupings Line-to-Line Junction Groupings Line-to-Line Junction Groupings Line-to-Line Translation (169) 169 112 164 89 168 105 % Error 0.0 33.7 3.0 47.3 0.6 37.9 Table 5.2: Comparisons of error rates between translational test cases for matching between junction groupings and line segments. Images Image 1 Image 2 Image 3 Test Cases Junction Groupings Line-to-Line Junction Groupings Line-to-Line Junction Groupings Line-to-Line Translation-)-Rotation (1352) 1309 758 1275 695 1298 712 % Error 3.2 43.9 5.7 48.6 4.0 47.3 Table 5.3: Comparisons between translational and rotational test cases for matching between junction groupings and line segments. CHAPTER 5. IMPLEMENTATION RESULTS 5.3 54 Matching through Parallel Pairs The test results indicate that evaluation of accidental probability for parallel pair is very important. Inadequate evaluation may result in unexpected matching results. The author tried different evaluations of accidental match, and it turn out to be that the displacement of each individual edge is a very important factor. The matching constraints are relatively weak, because only salient model parallel pairs are used to test matching. As the results show that except small motion can be uncovered, when motion is large, only result in partial correct matches between image parallel pairs and salient model parallel pairs. 5.4 Interface for Model-based Motion Tracking Interface for our model-based motion tracking system was developed using C and XView Toolkit. The use of XView toolkit eases the development of a GUI, as it allows the programmer to use the window system at a higher and more abstract level.. XView [24] is an object-oriented toolkit on top of the X Window system. Based on OPEN LOOK standards, it offers a rich set of composition strategies along with a variety of predefined components that allow the programmer to easily develop and implement complex user interfaces. From the programming point of view, XView can be viewed as a notification-based system. The centralized Notifier acts as the controlling entity within a user process, reading from the operating system and formatting it into higher-level events, which it distributes to the different XView objects (Figure 5.3). XView has a two-tired scheme in which the packages that support the various objects—panels, canvases, etc.—interact with the notifier directly, registing their own callback procedures. The application, in turn, registers its own callback procedures with CHAPTERS. IMPLEMENTATION RESULTS ] Model - Based Motion Trcklng Display Window | [input Flit V-J [ Rotate Model v " ) (Translate ModeTv^) [Depth Adjustment? J (Model Guided Grouping ( Line to Line Matching 7 ) ( Group to Croup Matching? ) Step size: Q Q (Show Movie) (Toggle visibility) (pelatG model) (Delete edges) (create test matches) (init view J (solve) (Toggle grid display) [ Save model viewpoint ) (Test Curve Drawing ) ( Connect ) ( Pre pro ) (Buckets) [Output File v " ) ( Test Rot )fTest Trans ) f Test Both ) Figure 5.2: Interface for Model-based Motion Tracking 55 CHAPTER 5. IMPLEMENTATION RESULTS 56 the object. mouse button control... I I Unix events: input on file descriptors 1 1 Notifier formats UNIX input into XView events, passes each event to the event procedure of the appropriate window XView events XView V V Control Panel Paint Canvas Application 1 1 N event proc for Paint Canvas notify proc for itemn notify proc for item 1 ss application's event procedures ' Application's notify procedures for panel items Figure 5.3: XView: Notification-based System. Each panel button is registered with its own callback procedures. Our interface is visually designed into two parts: control area and canvas. Within control area, we create various buttons and menu buttons, meanwhile, each callback procedure is registered for corresponding button and application programs are written in various callback procedures. Canvas is used interactively to test our implementation results, manipulate our model and display the image line segments. Various groupings and tests of matching are performed in this area, thresholds can be selected through the control of sliders. Chapter 6 Conclusions and Future Directions 6.1 Conclusions As we mentioned before, the goal of our system is to minimize the probability of making incorrect interpretations. The use of Model-Guided Groupings that we have presented assigns a central role to perceptual grouping as a way of reducing the ambiguities during the matching process in motion tracking. Most incorrect local matching, as shown from Lowe's line-to-line matching, can be eliminated through the use of the inherent contextual information formed from these groupings. As the results show in section 5.2, we found, particularly, that the use of junction grouping greatly improves the system performance. In comparison with line-to-line matching algorithm, the use of junction groupings as the matching tokens has been found very robust with respect to both translational motion and rotational motion. We also found that the model-guided junction groupings with three line segments are seldomly incorrectly matched, although the "junctions" with two line segments can still find incorrect matching candidates. Since the final set of correspondences will be greatly overdetermined, the final correct interpretation can still be reliably made even in the presence of some missing features (due to occlusion). 57 CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS 58 Lowe's line-to-line's matching can successfully interpreted the frame-to-frame slow motion (up to 10 pixels) in our current tests (the real maximum displacement depends heavily on the model structures which is determined from a particular viewpoint). In our implementation, the frameto-frame motion information, with the displacement up to 30 pixels, can be reliably recovered. Since the metric of the evaluations of different model-guided groupings are different, the optimal integration through the probability ranking is still unknown. However, as the author noticed, the correct integration should not be performed from a simple weighting scheme. Instead, we should examine the local interpretation of matched parallel pairs. For model-guided junction grouping, the running time for each iteration during the matching is only slightly more than double the time in comparison with line-to-line matching. However, since the correspondences established with the junction groupings are much less ambiguous, final probability ranking can be made properly. Thus the solution converged very quickly, normally within at most two iterations. 6.2 Future Directions The use of model-guided grouping for model-based motion tracking provides an attractive route for the further development of a robust motion tracking system with improved capabilities. Many research directions, as listed below, can be pursued if time allowed. Integration of Different Mo del-Guided Groupings Model-guided grouping based on cotermination, parallelism and collinearity should be incorporated into one system in order to reduce all sources of ambiguities during the matching process. Grouping based on proximity suggests that this method is crucial for limiting the computational complexity, as the computation is performed locally. Model-guided grouping based on CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS 59 cotermiiiation does give a robust and fast solution for the new viewpoint during the tracking. Most local ambiguities caused by close parallel pairs can also be avoided by using junction groupings. However, the iteration of model-guided parallel grouping would have eliminated the errors due to local inconsistent parallel relations, although it might cause slightly redundancy. It has been found however that if the parallel pair is not so "salient" the ambiguities in the detection process remain. The model-guided grouping based on collinearity can reduce the ambiguities caused by the multiple line matches, since a longer segments based on the fitting of two shorter ones can be replaced and thus have a higher ranking. Since the probability ranking for groupings based on cotermination and collinearity has the same metric, however, one grouping based on parallelism involved, the integration using a kind of weighting scheme dose not.yield an optimal solution. The proper integration should be based on the structure interpretation, rather than blindly testing the weighting scheme for general implementation. Extension of the features The current system is constrained by the use of line segments only. Provided the additional complexity of the segmentation is acceptable, the constraint can be removed by adding other features such as circular arc segments or general curve fragments. The use of curves in grouping and matching has been studied by Lowe[36], Dolan[15], Fischler[18] and Wolfson[59]. However, curve description and curve partitioning remain the research problems. Multiple object tracking system Tracking multiple objects would require at least a linear increase in computational complexity. Pylyshyn [49] proposed a FINST [48] - which posits a primitive identity maintenance mechanism that indexes and tracks a limited number of visual objects in parallel. His indexes are hypothesized to serve the function of binding visual features prior to subsequent pattern recognition. A FINST is an index to a particular feature or feature clusters on the retina, and a FINST keeps pointing to the "same" feature clusters as the cluster moves across the retina. The feature clusters he mentioned is similar to Lowe's grouping primitives, thus maintaining primitive identity based on model-guided groupings might be a promising route to parallelism. Enhanced Grouping Perceptual grouping suggested by Lowe was performed after features have been extracted. However, there are possibilities to perform the groupings during the segmentation stage. For instance, line segment grouping based on proximity can be partially realized during the recursive line segment formation, since each recursively segmented line fragment can assign the same group identification as long as they belong to the identical connected chain, this in turn can facilitate the grouping based on proximity at a later stage by examination of group identification. Current grouping is based on the geometrical information only, since other information such as gradients of each pixel, we can group the features based on other information such as intensity or contrast across a line segment or gradient information for a line segment. 60 Appendix A Canny and Deriche's Edge Detector Canny's Edge Detector According to Canny's three criteria[10], he leads to the solution f(x) such that 2f(x)-2\ f"(x) + 2\ f"'(x) + \ = 0 1 2 3 (A.l) The general solution for A . l is in a rather complexed form[10]. Since f(x) is antisymmetric, on inspection of the shape of the optimal filter, Canny observed that it is approximately the first derivative of a Gaussian: V2JT(7 j The major advantage for doing this is of practical consideration: representation simplicity and computation efficiency for the 2D extension: a Gaussian smoothed gradient estimation can be performed by separable 2D filtering. Defined ID Gaussian function as g(x): 9 { X ) = ^k~a ~ e & Canny's 2D filters are: d G(x,y) = g'(x)g(y) (A.2) d G(x,y) = g(x)g'(y) (A.3) x y The separability of the 2D filters leads to the efficient implementation: two ID filters will be used: one is smoothing filter - Gaussian, another is the derivative of Gaussian. 61 An efficient implementation of Canny's edge detector can be reorganized in a pipeline structure: Image Gaussian Smoothing in X direction Gaussian Smoothing in Y direction dx dy Non-maximum Suppression Figure A . l : Pipeline structure for implementation of Canny's filters. Using Sobel operators for dx and dy, we can save about half of the convolution time. In comparison with the method using 2 separable derivative filters directly (See A.2, A.3), we tested several images with this kind of approximation and found the accuracy, of this implementation can reach above 97 percent. Deriche's Edge Detector Canny's design was developed for afiniteextent antisymmetric filter. Dealing with an infinite extent antisymmetric filter, Deriche leads to the general solution (approximation) for the same equation A . l : Similar to Canny's implementation of 2D separable filters, the separable filters are: D (x,y) = f(x)g(y) x D (x,y) = g(x)f(y) y 62 Here f(x) is the optimal derivative filter: F(x) = -cxe- \ \ (A.4) a x and #(2:) is the optimal smoothing filter: g(x) = k(a\n\ +l)e~ \ \ (A.5) a x k and c can be fixed by the normalization requirement, and can leads to the following values:. c= {l-e- Ya k = 1 + 2ae- - e~ a a 2a Let f(n) be the samples of f(x) and F(Z) its Z transform: F(Z) = J2f( ) ~ n z n = -oo,...,oo n Using Z transform: F(Z) = F.(Z) + F+{Z~ ) l where 1+ 61Z- + 6 Z - ' 1 1+ 6 Z+ 6 Z 2 2 X 2 2 with a = —ce~ sinco, 61 = — 2e~ cosu>, 62 = e~ . a a 2a The output sequence {y(m)} in response to {x(m)} as input to a system, with impulse response { / ( 7 1 ) } , can be obtained recursively according to: y (m) = x{m - 1) - 6 y ( m - 1) - b y {m- 2) y~(m) = x(m + 1) - 6i?/~(m + 1) - b y~(m + 2) + + + 2 1 j/(m) = a(?/ (m) - y-(m)) + 2 m=l,...,M m = M,...,l m = 1,.... , M (A.6) (A.7) (A.8) Now let us see how g(n) can be implemented recursively. Calculating the Z-transform G(Z) of <?(n), we obtain: G(Z) = G+iZ- ) + G-(Z) 1 Here ap + a i Z ~ ^ } " 1+ 6^-1 + 6 2 ^ 63 aZ + aZ 2 2 3 with ao = C2 fli = (—C2C0SU a = —c b 3 2 + cisin^e 2 b\ — —2e~ 2 a = a\ — c b\ -01 a b = e~ coslj + {x(m)}, with the impulse response a x(m) — aix(m — 1) - biy (m — 1) - b y (m - 2) TO = 1,..., M + + 0 2 y~(m) = a x(m + 1) - aa:(TO + 2) - biy~(m - 1) - b y~{m + 2) 2 2a 2 The output {y(m)} of the convolution of an input sequence {g(n)} is obtained as follows: y (m) = 2 2 3 y(m) = J/ (TO) + + 2/"(TO) 64 TO = 1,...,M m=M, ...,1 (A.9) (A.10) (A.ll) Appendix B Least-Square Line Fitting Given n points: (xi,yi), i — 1,2,...,ra. Our objective is to minimize the sum of squares of perpendicular distances (Equation B.2) of these n points from a fitting line represented as: x sin<? — j/cos 6 + p = 0 e = YL( i 2 x ~ Vi s i ne c o s6 (B.l) ~ pf (-) B 2 Solving parameters (0,p) for least-square fitting line From =0 dp (B.3) de2 we can obtain: p — — xq sin 0 + yo cos 0 and xo = yo - n Define: x'i - xi - x , y[ = yi - yo 0 we can obtain: i i zZy'i = £ yi - y<> 2 2 n i i ]T x'iV'i = Y2 ? ~ oVo Xiy 65 nx 2 Let • i i From i de d6 2 (B.4) = 0 we can obtain 9: cos 9 = Ib + c 2c and sin 0 = (a/2c)/ cos 9 Finding the projection points on Least-Square fitting line of the end points. Figure B.l: Extraction of Line Segments by least-square line fitting. Given a point (xq,yq) (Figure B.l), following is to find the projection point (xp, yp) to the line (See B.l). The line passing (xq,yg) and perpendicular to the line (Equation B.l) is represented as: (x - x )cos9 + (y-y )sin9 = 0 q (B.5) q Solving equations B.5 and B . l , we can get the projection point: (x ,y ). Let d = xq cos 9 + yq sin 9 p p then x = d cos 9 — p sin 9 p y = p cos 9 -f d sin 9 p Similarly, we can obtain another projection point form line segment's another end. 66 Appendix C Finding Pseudo-Cotermination Point Given n lines, each is represented as (x sin 6{ — y cos 6{ + pi), i = 1,2,..., n. Our objective is to find a point with minimum sum of squares of perpendicular distance from each line. e 2 = ^(a; sin #- — y cos -f pi) s i — x Y2 i i 2 s n 2 0i ~ v Y/ i x de * (Cl) 2 s n cos ^« ^ 2 cos2 ^* ^ X] ^ 2a; 8 sm t — ^ypi t z2 cos t ^« + P* 2 i 2 —j— = 2(x ^2 sin Oi — sinffi cos 6i + Y^Pi « « « 2 -r— = ^«) sm 2(—a; ^ sin^j-cos^,-+ cos 0j — cos^,) 2 (C-2) (0.3) Let A = ]p sin Oi ]P cos 0j 2 « Set ^ = 0, and ^ sin Bicosdi) 2 i 2 i = 0, we have: a: ^ sin #j — ^ ^ sin 0; cos #j + />» sin 0j- = 2 « x ^2 i s m @i cos ^» — 2/ 0 (C4) i c o s 2 ^* ^ C O S ^' = 0 (C-5) A will only be zero if all the lines are parallel. Solving the above set of equations in x and y we get: . 67 t t Oi^picose^/A t t y = (-^(sin^cos^0l]ptSmtf/+5^8in ^^^cos^)/A t « . « i By examination of 2 (C.6) ( C J ) « (f e 2 2 2 ]T cos Oi > 0 2 we know that this gives us a minimum. We can also weight the lines according to the length of each segments by simply multiplying each line equation with weights u>i. 68 Bibliography [I] By Adobe Systems Inc., PostScript lishing Company, Inc., 1987. language, Tutorial and Cookbook. Addison-Wesley Pub- [2] M. Asada, M. Yachida and S. Tsuji, "Analysis of three-dimensional motions in blocks world," Pattern Recognition 17, 57-71, 1984. [3] N. Ay ache and B. Faverjon, "Efficient Registration of Stereo Images by Matching Graph Descriptions of Edge Segments," Inter. Jour, of Computer Vision, 1987. [4] P.J. Besl and R.C. Jain, "Three-Dimensional Object Recognition," ACM Computing Sur- veys, V17, N l , 75-154, 1985. [5] I. Biederman, "Human Image Understanding: Recent Reaserch and a theory," CVGIP 32, 29-73, 1985. [6] H.G. Barrow, J.M. Tenenbaum, R.C. Bolles and H.C. Wolf, "Parameter correspondence and chamfer matching: two new techniques for image matching," in Proc. 5th Int. Joint Conf. Artificial Intellegence, Cambridge, MA, 659-663, 1977. [7] G. Borgefors, "Hierarchical Chamfer Matching: A Parameteric Edge Matching Algorithm," IEEE Tansactions on PAMI, Vol 10, No. 6, Nov, 1988. [8] T. O. Binford, "Inferring Surfaces from Images," Artificial Intelligence 17, 205-244, 1981. [9] D.J. Burr, " Elastic matching of line drawings," 5th International Conference on Pattern Recognition, v l , 223-228, 1980. [10] J.F. Canny, "Finding Edges and Lines in Images," MIT A l Lab., TR-720, 1983. [II] C S . Clark, D.K. Conti, W.O. Eckhardt, T.A. McCulloh, R. Nevatia, and D.Y. Tseng, "Matching of Natural Terrain Scenes," 5th International Conference on Pattern Recognition, v l , 217-222, 1980. [12] J.L. Crowley, P. Stelmaszyk and C. Discours, "Measuring Image Flow by Tracking EdgeLine," Second International conference on Computer Vision, Tampa, Florida, USA, 658-664, 1988. 69 [13] R. Deriche, "Using canny's criteria to derive a recursively implemented optimal edge de- tector," IJCV, 1(2), 167-187, May 1987. [14] R. Deriche and 0. Faugeras, "Tracking Line Segments," First European Conference on Computer Vision, Antibes, France, 259-268, April 1990. [15] J. Dolan and R. Weiss. "Perceptual Grouping of Curved Lines," In Proc. Image Under- standing Workshop, 1135-1145, 1989. [16] E.R. Dougherty and C R . Giardina, Mathematicaal autonomous systems, Prentice-Hall, Inc., 1988. methods for artificial intelligence and [17] M. Edhiro, I. Kokubo, and T. Asano, "A New Point-Location Algorithm and Its Practical Efficiency - Comparison with Existing Algorithms," ACM Trans, on Graphics, v3, N2, 86-109, April 1984. [18] M.A. Fischler and R.C. Bolles, "Perceptual organization and curve partitioning," IEEE Trans, on PAMI, V 210-215, 1986. [19] M.A. Fischler and 0. Firschein, Readings in Computer and Paradigms, Morgan Kaufmann, Los Altos, 1987. Vision: Issues, Problems, Principles [20] D.B.Gennery, "Tracking Known Three-Dimensional Objects," Proceedings of the National Conference on Artificial Intelligence, Pittsburgh, 13-17, 1982. [21] D.B.Gennery, "Stereo vision for the acqusition and tracking of moving three-dimensional objects," Techniques for 3-D Machine Perception, A. Rosenfeld, ed., North-Holland, New York, 1983. [22] W. L. Grimson and T. Lozano-Perez, Model-Based Recognition and Localization from Sparse Range or Tactile Data, Int. J. Robotics Res., V3, N3, 3-35, 1984. [23] W. L. Grimson, Object MIT Press, 1990. [24] D. Heller, Recognition by Computer: The Role of Geometric Constraints, Th XView Programming Manual. O'Reilly & Associates, Inc., 1990. [25] R. Horaud, "New Methods for Matching 3-D Objects with Single Perspective Views," IEEE Tansactions on PAMI, Vol 9, No. 3, May, 1987. [26] R. Horaud, F. Veillon, T. Skordas, "Finding Geometric and Relational Structures in an Image," First European Conference on Computer Vision, Antibes, France, 374-384, April 1990. [27] S.L Horowitz and T. Pavlidis, "Picture segmentation by a tree traversal algorithm," Journal of A C M , V. 23, 368-388, 1976. 70 [28] D. Huffman, "Impossible objects as nonsense sentences," Machine Intelligence and D. Michie, eds, 295-323, Edinburgh University Press, Edinburgh, 1971. 6, B. Meltzer Three-Dimensional Recognition of Solid Objects from a TwoDimensional Image, MIT AI-TR 1045, 1988. [29] D. P. Huttenlocher, [30] D. W. Jacobs, The. Use of Grouping in Object Recognition, 1023, MIT A l Lab Memo. [31] C. Koch and S. Ullman, "Selecting one Among the Many: A simple Network Implementing Shifts in Selective Visual Attention," MIT A l Lab Memo 770, Janurary, 1984. [32] X. Li and Y. Ando, "Edge Detection with Subpixel Accuracy Using Canny's Detector," CPSC525 project paper, Computer Science Dept., UBC, 1990. [33] D. G. Lowe, 1985. Perceptual Organization and Visual Recognition, Kluwer Academic Publishers, [34] D. G. Lowe, "The Viewpoint Consistency Constraint," IJCV 1, 57-72, 1987. [35] D. G. Lowe, "Three-Dimensional Object Recognition from Single Two-Dimensional Images," Artificial Intelligence 31, 355-395, 1987. [36] D. G. Lowe, "Organization of Smooth Image Curves at Multiple Scales," IJVC 3, 2; 119130, 1989. [37] D. G. Lowe, "Fitting Parameterized 3D Models to Images," T R 89-26, Computer Science Department, 1989. [38] D. G. Lowe, "Integrated Treatment of Matching and Measurement Errors for Robust Model-Based Motion Tracking," Third International conference on Computer Vision, Osaka, Japan, 436-440, 1990. [39] D. G. Lowe, "Probability theory as ail alternative to complexity," Commentatory on J.K. Tsotsos' target paper "Analyzing vision at the complexity level," Behavioral and brain sciences, 13, 423-469, 1990. [40] D- G. Lowe, "Fitting Parameterized Three-Dimensional Models to Images," IEEE Tans- actions on PAMI, Vol 13, No. 5, May, 1991. [41] D. Marr, Vision, W. H. Freeman and Company, San Francisco, 1982. [42] D. Marr and Hildreth, "Theory of edge detection," 187-217, 1980. Proc. Roy. Soc. London B, vol 207, [43] T. Matsuyama, H. Arita and M. Nagao, "Structural Matching of Line Drawings Using the Geometric Relationship between Line Segments," CVGIP, vol. 27, 177-194, 1984. 71 [44] It. Mohan and R. Nevatia, "Using Perceptual Organization to Extract 3-D Structures," IEEE Tansactions on PAMI, Vol 11, No. 11, November, 1989. [45] D.W. Murray, D.A. Castelow and B.F. Buxton, "Frome Image Sequences to Recognized Moving Polyhedral Objects," IJVC 3, 181-208, 1989. [46] A. P. Pentland(Ed.), From Pixel to Predicates: Recent Robotic Vision, Ablex Publishing Corporation, 1986. Advances in Computational and [47] T. Poggio and C. Koch, "Ill-posed problems in early vision: From computational theory to analogue networks," Proc. R. Soc. London B, 226, 1985, 303-323. [48] Z. Pylyshyn, "The role of location indexes in spatial perception: A sketch of the FIN ST spatial-index model," Cognition, 32, 65-97, 1989. [49] Z. Pylyshyn and R.W. Storm, "Traking multiple independent targets: Evidence for a parallel tracking mechanism," Spatial Vision, vol.3, n3, 179-197, 1989. [50] L. Quan and R. Mohr, "Matching Perspective" Images Using Geometric Constraints and Perceptual Grouping," Second International conference on Computer Vision, Tampa, Florida, USA, 679-683, 1988. [51] P.L. Rosin and. G.A. West, "Segmentation Curves into Elliptic Arcs and Straight Lines," Third International conference on Computer Vision, Osaka, Japan, 75-78, 1990. [52] A. Treisman, "Preattentive processing in vision," CVGIP 31, 156-177, 1985. [53] S. Umeyama, T. Kasvand and M . Hospatal, "Recognition and Positioning of ThreeDimensional Object by Combining Matchings of Primitive Local Patterns," CVGIP, vol 44, 58-76, 1988. . [54] A . M . Wallace, "Matching segmented scenes to models using pairwise relationships between features," Image and Vision Computing, Vol 5, N 2, 114-120, May 1987. [55] G. Verghese, C. R. Dyer, "Real Time, Model-Based Tracking of Three-Dimensional Objects," TR-, Computer Sciences Deparment, Universsity of Wisconsin. [56] D. Walters, "Selection of Image Primitives for General-Purpose Visual Processing". Special Issue on Human and Machine Vision, Part II, in CVGIP, v37, 261-298, 1987. [57] A.P. Witkin and J.M. Tenenbaum, "On perceptual organisation," In Alex B. Pentland, editor, From pixels to Predicates, Chapter 7, pl49-169. Ablex Publishing Corporation, Norwood, New Jersey, 1986. [58] S. Ullman, The Interpretation of Visual Motion, The MIT Press, 1979. 72 [59] H.J. Wolfson, "On curve matching," May 1990. IEEE Tansactions on PAMI, Vol 12, No. 5, 483-489, [60] Z. Zhang and O.D. Faugeras, "Tracking and Grouping 3D Line Segments," Third International conference on Computer Vision, Osaka, Japan, 577-580, 1990. [61] S.W. Zucker, "The diversity of feature grouping," In Michael Arbib and Allen Hanson, editors, Vision, Brain, and Cooperative Computation, 231-262, MIT Press, 1988. 73
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- The use of model-guided grouping in model-based motion...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
The use of model-guided grouping in model-based motion tracking Li, Hsün 1991
pdf
Page Metadata
Item Metadata
Title | The use of model-guided grouping in model-based motion tracking |
Creator |
Li, Hsün |
Publisher | University of British Columbia |
Date Issued | 1991 |
Description | A general motion tracking system consists of three major parts: segmentation, correspondence and viewpoint verification. This thesis presents improved methods for solving the correspondence problem. Most previous approaches to solving the correspondence problem have used lower-level primitives such as points and lines as their matching tokens. It has been found that if the matching tokens, such as points and lines, are less distinctive, greater ambiguity in matching inevitably arises. The objective of this thesis is to develop a robust solution to the correspondence problem even when the frame-to-frame motion is relatively fast. A new approach called Model-Guided Grouping, which is used to derive intermediate-level structures as our matching tokens, has been introduced. The term Model-Guided comes from the fact that the groupings are guided and derived locally, with the contemporary use of model structures, around the predicted model during the object tracking. We choose junctions and parallel pairs as our matching tokens, thus the information coded in these structures is relatively invariant in consecutive frames. The matching strategy is coarse-to-fine, and partial matching will also be allowed when occlusions are present. The method for evaluation of probability of accidental match based on junction groupings will be discussed. Comparisons are made between the current system and Lowe's previous system. The results show that with a slight increase in computational cost, the use of the higher-level groupings as matching tokens leads to a much more robust model-based motion tracking system. Keywords: motion tracking, perceptual grouping, model-guided grouping, correspondence, model-based vision, segmentation, token matching, viewpoint verification. |
Extent | video/mp4 |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-11-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051966 |
URI | http://hdl.handle.net/2429/30018 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-UBC_1991_A6_7 L49.pdf [ 4.43MB ]
- Metadata
- JSON: 831-1.0051966.json
- JSON-LD: 831-1.0051966-ld.json
- RDF/XML (Pretty): 831-1.0051966-rdf.xml
- RDF/JSON: 831-1.0051966-rdf.json
- Turtle: 831-1.0051966-turtle.txt
- N-Triples: 831-1.0051966-rdf-ntriples.txt
- Original Record: 831-1.0051966-source.json
- Full Text
- 831-1.0051966-fulltext.txt
- Citation
- 831-1.0051966.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051966/manifest