Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Face detection by facets : combined bottom-up and top-down search using compound templates Holst, Glendon Randal 2000-12-31

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
[if-you-see-this-DO-NOT-CLICK]
ubc_2000-0426.pdf [ 3.81MB ]
Metadata
JSON: 1.0051147.json
JSON-LD: 1.0051147+ld.json
RDF/XML (Pretty): 1.0051147.xml
RDF/JSON: 1.0051147+rdf.json
Turtle: 1.0051147+rdf-turtle.txt
N-Triples: 1.0051147+rdf-ntriples.txt
Original Record: 1.0051147 +original-record.json
Full Text
1.0051147.txt
Citation
1.0051147.ris

Full Text

FACE DETECTION BY FACETS:  C O M B I N E D BOTTOM-UP A N D  TOP-DOWN  USING C O M P O U N D  SEARCH  TEMPLATES  by G L E N D O N R.  HOLST  B.Sc. Honours, The University of British Columbia, 1997 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) We a c c e p t this thesis as conforming to the required standard  David G. Lowe James J. Little  THE UNIVERSITY OF BRITISH COLUMBIA June 2000 © Glendon R. Hoist, 2000  In  presenting  degree freely  at  this  the  available  copying  of  department publication  of  in  partial  fulfilment  University  of  British  Columbia,  for  this or  thesis  reference  thesis by  this  for  his thesis  and  study.  scholarly  or  her  for  financial  of  T h e U n i v e r s i t y of British Vancouver, Canada  Date  DE-6  (2788)  5pLs/  jOk  I  further  purposes  Columbia  Tj^rOD  gain  the  shall  requirements  agree  that  agree  may  representatives.  permission.  Department  I  of  It not  be is  that  the  Library  permission  granted  by  understood be  for  allowed  an  advanced  shall for  the that  without  head  make  it  extensive of  my  copying  or  my  written  Abstract A s detection d o m a i n s increase in s i z e and complexity, new techniques are needed to effectively s e a r c h the image and feature s p a c e . In this thesis, I explore one such approach to object recognition in the domain of face detection. This approach, dubbed c o m p o u n d templates, is c o m p a r e d to a single template approach. The developed system, Facets, provides an implementation of both techniques to enable fair c o m p a r i s o n . T h e c o m p o u n d template technique u s e s subfeatures and spatial models to represent a c o m p o u n d object (such as a face). From these c o m p o u n d models, hypothesis-based s e a r c h then c o m b i n e s top-down and bottom-up s e a r c h p r o c e s s e s to localize the s e a r c h within the image and feature s p a c e . Detected subfeatures b e c o m e evidence for facial hypotheses, which then guide local s e a r c h e s for the remaining subfeatures b a s e d upon the e x p e c t e d facial configuration. The compound technique is described and a comparison of the c o m p o u n d templates technique with a single template technique in a mug-shot style face domain is presented. A description of the implementation, along with i s s u e s surrounding the c o m p o u n d templates a p p r o a c h is also provided. Attention is paid to performance, including both efficiency and accuracy. The results are c o m p l e x ; but the strengths, w e a k n e s s e s , and various trade-offs of the two techniques are detailed. The c o m b i n e d bottom-up and top-down a p p r o a c h of c o m p o u n d templates demonstrates a clear advantage over bottom-up only a p p r o a c h e s . The c o m p o u n d templates a p p r o a c h also demonstrates better performance for feature s p a r s e images, detection accuracy, domain c o v e r a g e , and for d o m a i n s with increasing size.  Table  of  Contents  Preliminaries Abstract Table ot Contents List of T a b l e s List of Figures Acknowledgements Chapter 1 - Introduction to F a c e Detection 1.1 The Domain 1.2 Related Work 1.2.1 Single Template 1.2.2 C o m p o u n d Representations 1.2.3 C o m b i n e d S e a r c h 1.2.4 C o m p o u n d and C o m b i n e d F a c e Detection 1.2.5 Motivations 1.3 Overview 1.3.1 G e n e r a l Approach 1.3.2 G e n e r a l Issues 1.3.3 C o m p o u n d Templates 1.3.4 C o n c u r r e n t Interests Chapter 2 - T e c h n i q u e s for F a c e t s 2.1 S i m p l e T e m p l a t e s 2.2 C o m p o u n d F a c e Model 2.3 L o c a l i z e d S e a r c h 2.4 S y s t e m Runtime 2.5 Future Implementations  ii iii iv v vi 1 1 2 3 4 6 8 9 9 9 - 12 13 15 17 17 24 29 31 34  Chapter 3 - Data and A n a l y s i s 3.1 Template D a t a b a s e Acquisition 3.2 Training P e r f o r m a n c e 3.3 Testing 3.4 Non-Faces 3.5 S c a l e and Density 3.6 C o m p a r i s o n s  3 6 36 40 44 46 48 52  Chapter 4 - C o n c l u s i o n  5 5  Chapter 5 - R e f e r e n c e s  5 8  iv List  of Tables  Table 1 - Domain detection coverage compared to template database completeness.  . .  42  Table 2 - Detection times for final detection run  44  Table 3 - Face detection results for the Nottingham and Yale image sets  45  Table 4 - Accuracy results for the Nottingham and Yale image sets  45  Table 5 - Non-face image results  46  Table 6 - Non-face detection time results  47  V  List of Figures Figure 1 - The CMU 1104 image  2  Figure 2 - The Yale image database Figure 3 - The Nottingham image database Figure 4 - Compound and combined technique search process Figure 5 - Compound face model with four subfeatures  2 2 10 14  Figure 7 - Feature templates with masks Figure 8 - Template pyramids to scale for face, eyes, nose, and mouth Figure 9 - 2 D Face Model  18 19 29  Figure 11 - Selecting a face to create a template Figure 12 - Editing the template mask Figure 13 - Creating a template pyramid from a template Figure 14 - Selecting subfeatures from an image to create a template Figure 15 - Creating a manifest record for a template pyramid Figure 16 - Simple template detection example Figure 17 - Compound template detection example Figure 18 - Simple template accuracy test Figure 19 - Compound template accuracy test Figure 20 - Simple template detection times Figure 21 - Compound template detection times Figure 22 - Tree with 19 false positives using simple templates Figure 23 - Tree with 18 false positives using compound templates Figure 24 - Sky with 11 false positives using simple templates Figure 25 - Sky with 5 false positives using compound templates Figure 26 - Canyon with 14 false positives using simple templates Figure 27 - Canyon with 8 false positives using compound templates Figure 28 - Simple template detection performance by image scale Figure 29 - Compound template detection performance by image scale Figure 30 - Simple template detection times by image scale Figure 31 - Compound template detection times by image scale Figure 32 - Simple template detection times by facial density Figure 33 - Compound template detection times by facial density Figure 34 - Search time comparisons for various partial approaches  38 38 38 38 38 39 39 41 41 43 43 46 46 47 47 47 47 49 49 50 50 51 52 53  Figure 6 - Typical facial subfeatures  14  Figure 10 - Facets Runtime Architecture  33  .  V  Acknowledgements T o L i s a J a c k s o n for such a good start. T o my grade 8 s c i e n c e teacher, for the best answer possible. To the a m a z i n g residents of St. J o h n ' s C o l l e g e , for enriching discoveries, restored faith in humanity, and hope for the future. To Hayedeh Behzad... you know! ;-) To David L o w e for patience, s u p e r v i s i o n , and proofreading.  1  Introduction to  1  Face  Detection  Efficient and effective object recognition t e c h n i q u e s are n e e d e d for increasingly large and c o m p l e x image domains. O n e s u c h candidate approach w a s implemented and tested for the domain of face detection. This thesis presents the result of that work.  The  1.1  Domain  F a c e s are d i v e r s e , semi-rigid, semi-flexible, culturally significant, and part of our individual identity. A s a domain they are already well studied and many image databases already exist. For complexity the domain is extensible along a continuum ranging from e x p r e s s i o n l e s s frontal, to increasingly varied e x p r e s s i o n , to increasing p o s e v a r i a n c e , to increasing artistic interpretation. F a c e recognition s y s t e m s [Konen], [Brunelli], [Beymer] already exist for a variety of purposes, from surveillance to mug shot queries. M a n y of these s y s t e m s use e i g e n f a c e s (principal c o m p o n e n t s ) ; however, eigenface s y s t e m s need to first normalize the face image. Normalization requires a priori location of the face and its subfeatures. F a c e detection s y s t e m s could locate the face and subfeatures in the image as a preliminary step for face recognition. Both face recognition and face detection are categorization p r o b l e m s , except that recognition c a t e g o r i z e s f a c e s by individual, and detection c a t e g o r i z e s by face or non-face. T h e c h o s e n domain is face detection of photographic, grey-scale, near . frontal, mildly e x p r e s s i o n e d , f a c e s . This d o m a i n is practical yet interesting; challenging yet tractable. E x a m p l e s of the d o m a i n f o l l o w s  1  The  URL  The C M U 1104 image comes from the online testing page for the C M U Face Detection system. is:  http://www.ius.cs.cmu.edu/IUS/usrpO/har/FaceDemo/images/1104/input.gif  The Yale images are available from: The  http://cvc.yale.edu/projects/yalefaces/yalefaces.html  Nottingham images are available from:  http://pics.psych.stir.ac.uk/cgi-bin/PICS/New/pics.cgi  Figure 1  each.  - T h e C M U 1 1 0 4 i m a g e c o n t a i n s 4 0 0 i m a g e s of 4 0 d i f f e r e n t p e o p l e , in 1 0 p o s e s  T h e s e f a c e s formed the training set.  Figure  2 - The Yale  i m a g e d a t a b a s e c o n t a i n s 1 6 5 f a c e s f r o m 1 5 d i f f e r e n t p e o p l e , i n 11  p o s e s e a c h . T h e s e f a c e s w e r e part of t h e t e s t i n g s e t .  Figure  3 - T h e Nottingham image database contains 100 faces from 5 0 w o m e n a n d 5 0 m e n .  T h e s e f a c e s w e r e part of t h e t e s t i n g s e t .  1.2  Related  Work  There are many other approaches to the face detection problem. Some techniques rely on whole face templates or models for detection [Sung], [Lanitis], [Rowley], others rely on facial subfeatures [Viola], [Yuille], [Takacs]. A variety of detection techniques are employed, from correlation [Brunelli], neural nets [Rowley], creseptrons [Weng], eigentemplates [Shakunaga], [Sung], [Viola], Bayesian models [Viola], and flexible models [Lanitis], [Yuille]. Some approaches use bottom-up search [Weng], others use top-down search [Sung], and still others combine both search types [Shakunaga]. Combining bottom-up and top-down processes appears as a promising way to guide the search efficiently. There are  3 other s y s t e m s in other d o m a i n s which appear to use this approach s u c c e s s f u l l y [Matsuyama], [Milanese]. 1.2.1  Single  Template  Kah-Kay S u n g and T o m a s o Poggio propose a technique for face detection [Sung] which u s e s a small set of full face templates, and a small set of non-face templates. This a p p r o a c h u s e s 'view-based' models, created by the s y s t e m from training e x a m p l e s , h e n c e the templates are essentially model images. A s d e s c r i b e d in their paper, "[They chose] a piece-wise continuous modelling s c h e m e b e c a u s e face patterns appear to o c c u p y a smoothly varying and continuous region in the vector s p a c e — i.e., more often than not, a face pattern with minor spatial and/or grey-level perturbations still looks like another valid face pattern." [Sung] At 19x19 pixels the templates are s m a l l , striking a b a l a n c e b e t w e e n discrimination power and computational efficiency. T h o u g h sufficient in s i z e to distinguish a f a c e , they do not contain sufficient detail to determine subfeature e x p r e s s i o n (e.g., direction of g a z e is not m o d e l e d in the template views). T h e 12 view-based models are the prototypes and antiprototypes that describe the face s p a c e . E a c h prototype is e n c o d e d using, the 75 largest eigenvectors. T h e p r o c e s s of detection is a classification of e a c h window of the image, at different resolutions, as one or hone of the prototypes. T h e window starts off at 19x19 pixels and grows to 100x100 pixels by a factor of 1.2 e a c h time. T h e window is then s c a l e d down to 19x19, equal in s i z e to the prototypes, and converted into an eigenface. A distance measurement is then taken between e a c h of the 12 prototypes and the image window. T h e classifier is a Multi-Layer Perceptron (MLP) which learns to map these d i s t a n c e m e a s u r e m e n t s to a yes/no face c l a s s i f i c a t i o n . Essentially, t h e s e distance m e a s u r e s either place an image near a face prototype, near a non-face prototype, or not near a face prototype. If only the first c a s e is true a face is detected, otherwise not. T h e d i s t a n c e s t h e m s e l v e s are a 2-value concoction of the normalized M a h a l a n o b i s distance in the 75 top eigenvector s p a c e between the window image and the prototype, and the Euclidean distance between the window image and its projection onto the eigenface. T h e M a h a l a n o b i s distance takes into account the higher variance of f a c e s along the direction of the eigenvectors so that a m e a s u r e d distance between the centroid of a  4 prototype and an image b e c o m e s shorter for images which, while keeping the s a m e E u c l i d e a n distance from the prototype centroid, lie closer to the higher variance eigenvectors (i.e., the boundary of equal distance to a prototype is not circular, as in Euclidean distance, but oval). The prototypes t h e m s e l v e s are automatically generated by a k-means clustering algorithm from a database of e x a m p l e s . O n e d a t a b a s e contains the normalized c a n o n i c a l f a c e s , the other contains non-face e x a m p l e s , d i s c o v e r e d in the p r o c e s s of testing the s y s t e m . S u n g and Poggio claim that their technique is a general purpose feature detector, stressing that "[their] ultimate g o a l is to p r o p o s e a general methodology for taking on feature detection t a s k s in multiple d o m a i n s , including industrial i n s p e c t i o n , m e d i c a l i m a g e a n a l y s i s and terrain c l a s s i f i c a t i o n , w h e r e target patterns may not be rigid or geometrically parameterizable, and where imaging conditions may not be within the user's control" [Sung]. How would their system behave a s the d o m a i n s are e x p a n d e d , (e.g., with multiple p o s e s , or articulated objects), and what happens to system performance as the number of domain c l a s s e s increases? How could we expand the s y s t e m to detect subfeatures which require higher resolution templates than the e n c l o s i n g feature, or c o m b i n e this technique with other feature detection t e c h n i q u e s , s u c h as deformable templates [Yuille], better suited to a s u b d o m a i n of the larger domain? It is these sorts of questions that motivated and informed the exploration into c o m p o u n d templates. Although not a s sophisticated as the a b o v e eigentemplate a p p r o a c h , simple templates were u s e d a s the comparative norm in the F a c e t s implementation. T h e c o m b i n e d and c o m p o u n d a p p r o a c h d e s c r i b e d in chapter 1, section 3 . 1 , has the flexibility to c o m b i n e image a n a l y s i s techniques, and to recognize articulated objects. 1.2.2  Compound  Representations  O n the question of multiple pose, David B e y m e r provides one solution as part of his face recognition under varying p o s e s y s t e m . T h e feature finder in this system looks for both irises and a nose lobe. The s e a r c h is performed on a 5 level image pyramid, with level 0 being the original image, and level 4 the lowest resolution version. S e a r c h begins at the top level, using 30 full face model templates that cover 5 rotations, 3 image plane rotations, and 2 s c a l e s . At e a c h level, matches are determined by correlation s c o r e s greater than a predetermined threshold. At levels 3  5 and 4, all matches are found and then sorted by correlation score. S e a r c h then proceeds in a depth first manner from the level 3 hypothesis. The first match at level 0 wins. At e a c h level more templates are introduced, covering the p o s e s p a c e using smaller intervals and providing higher resolution templates. Not all the templates are u s e d , rather just the small neighbourhood of templates surrounding the rough face hypothesis from the previous level. O n the top levels the templates are whole faces, but they b e c o m e templates for the individual features, e y e s and nose, on the lower levels. O n e benefit is that the combinatorial e x p l o s i o n of potential f a c e s is prevented while still benefiting from a larger number of feature exemplars. Instead of using eigenfaces a s in S u n g ' s a p p r o a c h , templates and image windows are c o m p a r e d using normalized correlation on gradient, L a p l a c i a n , and original grey-level v e r s i o n s . R e c o g n i t i o n performance for the original grey-level images was 9 4 . 5 % , and a b o v e 9 8 % a c c u r a c y for the p r e p r o c e s s e d images. [Beymer] Attractive features of B e y m e r ' s a p p r o a c h are the use of face hypotheses to guide further s e a r c h , and the use of subfeatures within the model framework. T h e s e a r c h p r o c e s s p r o c e e d s from general to specific, in a decision tree like f a s h i o n . A s in the B e y m e r a p p r o a c h , the posited technique u s e s the location of the more general features to guide the s e a r c h for lower level prototype m a t c h e s . Likewise, the incorporation of subfeatures continues the s e a r c h for more specific prototypes, but in a sub region. How would the system efficiency degrade a s new domain objects were a d d e d ? If the higher level prototypes perform their discrimination t a s k s well (i.e., there are fewer paths from the tree root to a specific low level prototype) then the s e a r c h s p a c e might only grow logarithmically, otherwise the performance c o u l d be w o r s e . It s e e m s plausible, however, that the domain model s p a c e would grow much faster than any s p a c e of subfeatures, or image primitives. Subfeatures provide a c o m p a c t way to d e s c r i b e the superfeature s p a c e , limiting the combinatorial e x p l o s i o n that affects the superfeature s p a c e . Subfeatures also allow for localized s e a r c h of simpler features. A topdown a p p r o a c h , s u c h a s those d i s c u s s e d here, may benefit from the slower growing subfeature s p a c e ; however, if the superfeature s p a c e is very large, having m a n y non-generalizable top-level configurations, then many initial s e a r c h e s are required. E a c h of these s e a r c h e s may benefit from  6 the hierarchical nature of the s e a r c h (from superfeature to subfeature, or from general to specific) but without a priori knowledge the s e a r c h must cover the entire superfeature s p a c e . O n e way to provide the a priori knowledge is to use bottom-up s e a r c h . First searching for the subfeatures provides information about the superfeatures in the image s p a c e ; however, if the s e a r c h is performed for all subfeatures, there is no need for the top-down s e a r c h , and thus no way to benefit from it. S u c h a complete bottom-up s e a r c h includes more of the subfeature s p a c e than the top-down s e a r c h would require. C o m b i n i n g the bottom-up and topdown s e a r c h e s while preserving their benefits would s e e m a worthwhile goal. 1.2.3  Combined  Search  M a t s u y a m a and Hwang's S I G M A system detects h o u s e s , roads and driveways from aerial images. It must deal with a top-level model s p a c e (superfeature s p a c e comprising the c o n n e c t i o n of roads with driveways with houses) where the p o s s i b l e spatial and orientational configurations are so e n o r m o u s , no reasonable set of image level templates could holistically d e s c r i b e every p o s s i b l e s c e n e . In dealing with the complexity of this d o m a i n , S I G M A incorporates bottom-up and top-down p r o c e s s e s in a complementary way. S I G M A first looks for subfeatures of the d o m a i n , c o m b i n e t h e s e together into a coherent upper-level view and then s e a r c h e s for missing features. E x t e n d i n g B e y m e r ' s a p p r o a c h so that it c o m b i n e s both bottom-up and top-down p r o c e s s e s , first s e a r c h i n g for facial sub-features (eyes, nose, mouth), requires that i s s u e s s u c h a s false positives, u n d i s c o v e r e d features, overlapping h y p o t h e s e s , and the enforcement or d i s c o v e r y of spatial relationships are dealt with. In short, resolving i s s u e s that the S I G M A a p p r o a c h a d d r e s s e s . T h e S I G M A system comprises three experts, the Geometric R e a s o n i n g Expert ( G R E ) , the Low-Level Vision Expert ( L L V E ) , and the Model Selection Expert ( M S E ) , which correspond respectively to the three knowledge categories of: s c e n e domain knowledge, image domain knowledge, and meta-knowledge mapping the two knowledge d o m a i n s . There are also d a t a b a s e s for domain model c l a s s e s and the runtime instances and hypotheses. Detection first p r o c e e d s with a segmentation p r o c e s s searching prominent features. This task is initiated by the M S E which selects objects from the domain d a t a b a s e and provides the L L V E with the object's description n e e d e d to perform the actual image analysis. The  7 L L V E reports back the image level description of the found item (or possibly a Not-Found message). T h e M S E converts the image level description into a s c e n e level object instance and p l a c e s it in the runtime database. T h e G R E evaluates the object instances in the runtime database and determines, using the domain models, which other objects are related to the runtime instance. The G R E then instructs the M S E to search for the missing instances. W h e n and if these new objects are found, the M S E again instantiates them and p l a c e s them in the runtime d a t a b a s e where the G R E establishes the relationships between them (e.g., P A R T - O F or spatial-relationship). The G R E is also responsible to ensure that the runtime d a t a b a s e only contains consistent and coherent h y p o t h e s e s called interpretation networks. S e v e r a l interpretation networks may exist in the runtime database at the s a m e time, but they form independent and competing s c e n e interpretations of the image. T h e quality of the interpretations is determined by its s i z e — a s s u m i n g here that larger networks require more image level structure to support them and h e n c e are less likely random artifacts. In the c o u r s e of detection, interpretation networks may be j o i n e d , split, d u p l i c a t e d , or r e m o v e d a c c o r d i n g to rules in the G R E which work to remove interpretation conflicts a n d support coherent interpretations. T h e iterative interaction of these three expert modules detects features and creates a s c e n e level interpretation. [ M a t s u y a m a ] T h i s p r o c e s s c a n deal with false positives, u n d i s c o v e r e d features, spatial relationships, plus provide additional benefits. F a l s e positives are unlikely to b e c o m e part of the final interpretation s i n c e it is unlikely that they c o u l d form coherent relationships with e n o u g h features to c o m p e t e with the correct interpretation. T h e use of top-down a n a l y s i s aids in the detection of, as yet, u n d i s c o v e r e d features. T h e top-down expectations encourage the use of local and complete image analysis techniques of a type that would be computationally e x p e n s i v e to blindly perform on the entire image. The L L V E u s e s threshold and binarize at different threshold levels to discover a feature, and s u c h trial and error attempts are localized thanks to the top-down p r o c e s s . Deformable template t e c h n i q u e s [Yuille] [Lanitis], which provide feature parameterization at a level of detail difficult to a c h i e v e with template c l a s s i f i c a t i o n , work best if placed near the feature (e.g., deformable eye templates work best if p l a c e d slightly below the actual eye) and top-down placement would support t h e s e requirements while removing the computational e x p e n s e of unfocused s e a r c h . The bottom-up p r o c e s s also reduces computational  8 cost since the G R E d o e s not search for domain objects that do not have image level e v i d e n c e . Another benefit c o m e s declaring spatial relationships in a more flexible and independent manner than a strictly image b a s e d approach. S i n c e the S I G M A operative model separates s c e n e and image domains, it is possible to have 3 D s c e n e d o m a i n s and reason about their appearance in the 2 D image domain and vice v e r s a — a flexibility that view-based m o d e l s do not afford. T h e s c e n e level provides a more natural interface for the designer w h o c a n e x p r e s s the domain using the concepts they are familiar with. Although the hypothesis generation, evaluation, and reasoning a s p e c t s of S I G M A a d d complexity to the s y s t e m , they do provide a flexibility a n d e l e g a n c e from the c o m b i n e d bottom-up, top-down p r o c e s s e s they support. Whether this combination of s e a r c h p r o c e s s e s p r o d u c e s a more efficient or more accurate system, in comparision to other a p p r o a c h e s , is unknown. T h u s , part of the motivation for this thesis is to c o m p a r e a c o m b i n e d a n d standard a p p r o a c h . 1.2.4 C o m p o u n d and Combined Face Detection T h e aerial image domain of S I G M A is more complex than the face domain proposed here. Is the corresponding complexity of the S I G M A s y s t e m warranted in the face domain? T h e c o m p o u n d technique proposed here is envisioned a s generalizable to other d o m a i n s and other image analysis techniques; the face domain a n d the F a c e t s implementation is but a single instance in the exploration. Anyway, T a k e s h i S h a k u n a g a , K e i s u k e O g a w a , and S h o h e i O k i use combined bottom-up and top-down s e a r c h in their face detection s y s t e m [Shakunaga]. S h a k u n a g a ' s s y s t e m u s e s 3 2 x 6 4 pixel templates to represent e y e s , eyebrows, ears, nose and mouth center. E a c h face is represented by these eight subfeatures. T h e subfeatures are represented in a c o m b i n e d eigenvector s p a c e a n d subfeature detection p r o c e e d s in a w a y similar to the S u n g d e s i g n s . T h e first p h a s e of detection s e a r c h e s for all subfeatures, collecting those a b o v e a given threshold. T h e s e c o n d phase enumerates 10 feasible combinations of found instances. A 3 D face model is created for e a c h combination, a n d its pose parameters are estimated  2 The Shakunaga system doesn't have non-subfeature prototypes, and instead of a MLP (MultiLevel Perceptron) for classification it uses a threshold based on the Mahalanobis distance.  9 b a s e d on subfeatures and defaults. T h e third phase performs a local topdown s e a r c h for any missing subfeatures. M i s s i n g subfeature attributes are calculated from the 3D face model and its pose. If a new subfeature instance is detected, the face model parameters are re-calculated, and p h a s e three repeats until no more subfeatures are found, or all subfeatures are found. Although the first search p h a s e s e a r c h e s for all features it d i s c a r d s the poor performers to limit the c o m b i n a t o r i a l explosion of face hypotheses. The top-down p h a s e then s e a r c h e s locally for t h e s e d i s c a r d e d subfeatures, improving the correct detection rate from around 6 5 % to near 9 2 % . 1.2.5  Motivations  S o m e single template techniques [Sung] are advertised a s generalizable to larger d o m a i n s . Of the c o m p o u n d techniques [Beymer], or those combining bottom-up and top-down [Matsuyama], [Shakungaga], the e m p h a s i s appears on detection performance. I w a s interested to s e e what efficiency merits a combined and compound approach would have compared to a single template a p p r o a c h . I was also interested to extend my understanding of the implementation i s s u e s involved in a c o m b i n e d a p p r o a c h . To this end I developed the F a c e t s face detection system as a platform to c o m p a r e both a p p r o a c h e s in equivalent implementations. 1.3  Overview  T h e c o m p o u n d template technique is an instantiation of the more general c o m p o u n d and combined approach outline below. S o m e key issues of the general a p p r o a c h are d i s c u s s e d , followed by a description of the c o m p o u n d template face model and methodology. 1.3.1  General  Approach  1 0  Feature Model Class (FMC)  Multi-Level Feature Model Heirarchy  FMC w/ Image Detection Procedure(s) Feature Instance Missing Feature Instance Feature Instance with Subfeatures  Priority Queue Scheduling Feature Detection  CD Combined Search Process Models  Figure 4 - Compound and combined technique requirements: Illustration (top) of multilevel structure and legend for Combined Search Process (bottom). The feature instance is shown with four subfeatures (two found, two missing), and the model used to calculate feature model parameters. Illustration of Combined Search Process are 7  In general terms, a c o m p o u n d a n d combined technique requires: 1  2  A n object model with a multi-level structure a n d partitioning subfeatures. T w o or more levels are required to invoke a combined bottom-up and top-down s e a r c h . There m a y be a n y number of object models.  At a minimum, the b a s e subfeatures, (those features which do not t h e m s e l v e s have subfeatures), require a n a s s o c i a t e d image detection  3  procedure. Other features may also have a s s o c i a t e d image detection procedures, but this is optional. T h e image detection procedure may implement any image a n a l y s i s technique. A n invocation queue for search requests. This could be a F I F O queue (essentially a direct call to the image detection procedures) or it could be a priority queue. A s shown later, the queue helps w e a v e the bottom-up and top-down p r o c e s s e s together, and provides the designer with c o n s i d e r a b l e flexibility.  T h e c o m b i n e d s e a r c h process for the c o m p o u n d technique performs as follows: H  B  C  D  E  T h e s e a r c h begins by invoking the image detection procedures for s o m e initial subfeatures (usually b a s e subfeatures). T h e c h o i c e of initial subfeatures critically affects a c c u r a c y and efficiency. T h i s is d i s c u s s e d in more detail below. W h e n a feature is found, a model instance for e a c h related superfeature is created, and the detected subfeature is given to the superfeature model instance as e v i d e n c e . Optionally, if the detected subfeature fits within the hypothesis s p a c e of an existing superfeature model instance, it is given to the existing instance. M a n a g i n g overlapping hypotheses is d i s c u s s e d in more detail below. A model instance represents a hypothesis about an object (or objectspace) supported by evidence from the image. T h e model instance contains the subfeature i n s t a n c e s already found (which provide evidence for the hypothesis), and the estimated feature s p a c e for all subfeatures (required for m i s s i n g subfeatures). The model instance may also estimate model level parameters and instance states. A superfeature model instance may set the estimated feature s p a c e , or the model instance may calculate them itself.  P a r a m e t e r s , which might represent rotation, s c a l e , and hypothesis strength, are calculated b a s e d upon the evidence (subfeature instances), potential defaults, and model requirements. Instance states might include the status of subfeature s e a r c h e s , or whether the model instance represented an object instance or object-space. A model instance may only represent the hypothesis for a single object instance, or it may represent an object s p a c e . In the later c a s e , the model may c h o o s e the best object instance w h e n the s e a r c h c o m p l e t e s . W h e n a model instance r e a c h e s a critical threshold of evidence or  1 2  F  strength (normally the first subfeature), the top-down p h a s e begins. T h e image detection procedures are called for the missing subfeatures using the estimated feature s p a c e a s the search parameter. Search scheduling i s s u e s are d i s c u s s e d below. Avoiding duplicated s e a r c h is covered a s part of the d i s c u s s i o n on overlapping hypotheses, below. A model instance completes its s e a r c h either when it has exhausted the subfeature s e a r c h s p a c e , or it represents a complete object instance. T h e entire s e a r c h c o m p l e t e s when the initial bottom-up s e a r c h c o m p l e t e s , and all model instances have c o m p l e t e d their search.  1.3.2  General  Issues  The efficiency and accuracy of a c o m b i n e d search d e p e n d s upon the subfeature s p a c e u s e d for the initial bottom-up s e a r c h . If the initial s e a r c h c o m p r i s e s the entire subfeature s p a c e then the s e a r c h b e c o m e s essentially equivalent to a pure bottom-up s e a r c h . If the initial feature s p a c e is not sufficient to find at least o n e subfeature per object (or whatever the threshold may be), then not all the objects in the image will be detected. Using a scheduling queue provides a way to increase the feature s p a c e of the initial s e a r c h without delaying the top-down p h a s e s . T h e discrimination quality of the initial subfeatures, a n d the computational r e s o u r c e s n e e d e d to detect those features are also important considerations. For subfeature s , detected instance S i of subfeature s , and object o containing subfeature s in configuration i , the preference is for subfeatures with a higher probability p ( o | s ) . The preference is a l s o for subfeatures with efficient image a n a l y s i s procedures. T h e design decisions made in the Facets system are provided below. s i  s i  i  Initial bottom-up s e a r c h d o e s not require b a s e subfeatures. A mid-level subfeature might have better discrimination powers a n d lower computational cost. T o illustrate, imagine a face model that has eyes, nose, and mouth a s mid-level subfeatures. T h e eye subfeatures are models which contain subfeatures for iris, pupil, l a s h e s , a n d whites. T h e mouth subfeature is a model with subfeatures for teeth, lips, a n d tongue. S e a r c h i n g for a face by first s e a r c h i n g for irises, pupils, l a s h e s , eye whites, teeth, lips, and tongues m a k e s little s e n s e , s i n c e s u c h subfeatures might not exist in the image, or are not easily discriminated. S e a r c h i n g for t h e s e subfeatures is best done after the e n c l o s i n g feature  13 is localized. For this reason it is not n e c e s s a r y to initiate the bottom-up s e a r c h from the b a s e subfeatures w h e n intermediate subfeatures provide for more efficient and accurate detection. W h e n subfeature instances are detected they are evidence for s o m e object level hypothesis s p a c e . It is very likely that overlapping hypotheses are c r e a t e d , e s p e c i a l l y if the initial subfeature s e a r c h s p a c e is complete enough to ensure the detection of all objects. It is important to merge similar model instances before they instigate their top-down s e a r c h . There are many w a y s to perform the overlap detection and merging, and the F a c e t s s y s t e m provides rudimentary facilities for this, d e s c r i b e d later. T h e scheduling queue provides a way to prioritize the image level s e a r c h . For e x a m p l e , in the initial bottom-up s e a r c h , the q u e u e c o u l d prioritize b a s e d upon the discrimination capabilities and the r e s o u r c e s required for the subfeatures. A s mentioned, this would provide a way to w e a v e in the top-down s e a r c h e s , performing them before all the bottom-up s e a r c h e s completed. The queue could also let the designer separate the search into temporally distinct s t a g e s (e.g., bottom-up s t a g e , first top-down stage, s e c o n d top-down stage, etc.) Aggregate s e a r c h e s allow additional s e a r c h control logic that d e p e n d s on several s e a r c h results, for e x a m p l e , cancelling or delaying the s e a r c h for one subfeature if another subfeature w a s not detected. 1.3.3  Compound  Templates  T h e c o m p o u n d template technique is a type of c o m p o u n d and combined method. For a model it uses a 2 D plane with subfeature points and four degrees of freedom: rotation in the image plane, scale, and <X,Y> location. T h e image a n a l y s i s procedure for subfeatures u s e s n o r m a l i z e d correlation. In this implementation there are four subfeatures partitioning the face and the correlation is performed on the grey-scale image using grey-scale image templates. A template specifies the spatial arrangement of sub-features. A 2 D model a l s o s p e c i f i e s spatial arrangement, but with more flexibility. With whole templates, the entire template is s e a r c h e d , e v e n if initial portions of the template are poor matches. It would appear beneficial to perform a  1 4 faster initial s e a r c h with smaller template sub-regions and then complete the search in the areas denoted most promising. Using a model with template s u b regions (subfeatures) gains the flexibility of the model, o c c l u s i o n support, combinatorial s a v i n g s for the object representations, plus control over the s e a r c h p r o c e s s . T h e c o m p o u n d template representation for a c o m p o u n d face u s e s four subfeature types: left eye, right eye, nose, and mouth.  correlation  for the i m a g e a n a l y s i s p r o c e d u r e .  Figure 6  - Typical facial  subfeatures  E a c h subfeature type contains a set of m a s k e d templates used by the normalized correlation image detection procedure. T h e mask describes a, possibly, non-rectangular region of interest in the template that is used for c o m p a r i s o n s . A subset of t h e s e templates is specified a s initial s e a r c h candidates. A complete s e a r c h , over the entire image, is performed for these templates. W h e n the correlation s c o r e for a template is greater than the threshold, the c o r r e s p o n d i n g subfeature instances are created. A s features are found and instantiated, c o m p o u n d face instances are created to represent the face hypothesis evidenced by the subfeature instances. B a s e d on the 2D model and the subfeature evidence, the face instance requests localized s e a r c h e s for the remaining features. The strength of the initial e v i d e n c e determines which l o c a l i z e d s e a r c h e s are p e r f o r m e d first. T h e e v i d e n c e strength for a subfeature instance is the correlation score for the template match. The strength for a c o m p o u n d face is partly the evidence strength of its contained subfeatures, and partly the inverse of  the d i s t a n c e from the current configuration (spatial, orientation, and size) of the subfeatures c o m p a r e d to the ideal model. Implementation details, s u c h as managing overlapping hypotheses, c o m p o u n d face strength c a l c u l a t i o n s , the initial subfeature s p a c e , aggregate s e a r c h optimizations, and more, are d i s c u s s e d below in the T e c h n i q u e s chapter. 1.3.4  Concurrent  Interests  T h e F a c e t s s y s t e m , b e s i d e s serving its primary goal as a testbed to c o m p a r e two face detection a p p r o a c h e s , served a s an exploratory tool for my other interests. S o m e t i m e s t h e s e interests informed F a c e t s beneficially. I find s y s t e m s like S I G M A [Matsuyama], and C o p y c a t [Hofstadter] interesting b e c a u s e they create their own models during the perceptual p r o c e s s . S I G M A uses declarative rules to specify valid connections of roads, driveways, and h o u s e s . T h e system then u s e s the image to inform the development of the s c e n e model, and the s c e n e model to inform the analysis of the image. C o p y c a t creates a model of an analogical mapping by c o m b i n i n g existing relationship c o n c e p t s . T h e generation p r o c e s s is intriguing in that it results from the interaction of many parallel C o d e l e t agents which both add and remove relationships from the mapping model. T h e parallel C o d e l e t s attracted my interest in agent oriented p r o c e s s e s , and the d y n a m i c model generation m e s h e d with my interest in prototype languages. However, the attempt at declarative model generation w a s replaced by the simple 2 D model described later. While the object model of F a c e t s supports runtime generation of c l a s s e s , much of its use w a s removed during c o d e profiling. S u c h runtime support w a s not n e e d e d by the models u s e d . T h r e a d overhead from attempted parallelism w a s detrimental to performance, especially for the smaller localized s e a r c h e s . This was r e m o v e d , and now only one s e a r c h occurs at a time (although aggregate s e a r c h optimization provides s o m e of its benefits). S e a r c h e s are still prioritized, though. Interest in anytime algorithms informs the implementation of c o m p o u n d templates. T h e state of a face model instance is a l w a y s updated with the  1 6  most recent information available. Stopping the detection p r o c e s s yields all f a c e s found so far with the best combination of subfeatures yet found.  17 Techniques  2  for  Facets  T h i s chapter details the core techniques u s e d along with the F a c e t s s y s t e m implementation. 2.1  Simple  Templates  T h e foundation of the F a c e t s face detection s y s t e m is the simple template. A simple template is c o m p o s e d from a pyramid of m a s k e d grey s c a l e images. Detection is performed using normalized correlation on the s o u r c e image pyramid. F a c e t s reads any image format supported by QuickTime® and its own V i s a g e format. V i s a g e is an object repository for i m a g e s , templates, p y r a m i d s , filters, manifests (list of objects and their s e a r c h attributes), and their attributes. Images are stored and represented in grey s c a l e as an array of real numbers in the range 0..1, where 0 is black and 1 is white. Images allow a s s o c i a t e d named attributes. T h e s e are used to store image angle and scale attributes. Templates are represented a s two images of the s a m e size. O n e image is the source, the other is the mask. T h e mask has a one-to-one pixel c o r r e s p o n d e n c e with the source and represents the degree to which a s o u r c e pixel is included for correlation c o m p a r i s o n s . In the mask, 0 represents the excluded source and 1 represents full inclusion. F a c e t s provides image scaling and rotation filters. Rotation is performed using bilinear interpolation. S c a l i n g is c a l c u l a t e d using the weighted a v e r a g e of all pixels in the source image that overlap the destination pixel. T h e weight is b a s e d on the percentage a r e a overlap between the destination pixel and the s o u r c e pixel (their intersection). For s e a r c h efficiency, both templates and images are represented at various s c a l e s in image pyramids [Burt]. Detection target images are s c a l e d from the bottom to top (largest to smallest). T h e original image is the largest (bottom - image 0) image, and the smallest (top - image N) image is close to 64 pixels in size. E a c h image X in the pyramid, where X >=1, is 0.75 the a r e a of the image X-1 below it. Scaling is performed from the original image e a c h time (i.e., image 1 is 0.75 the s i z e of image 0, image 2 is 0.5625 the area of image 0, etc.). For example, a training set image (containing 25 faces) would create a pyramid with a 2 3 5 x 2 8 5 pixel  18 bottom image, a 7x9 pixel top image, and containing 25 levels, numbered 0..24. Similarly, pyramids are u s e d for feature templates. To create template pyramids, the source image and the mask are s c a l e d by the s a m e amount, and the s c a l e d templates form the template pyramid. The scaling e n s u r e s that the maximum c h a n g e in width or height for the s c a l e d template is 2 pixels smaller than the preceding template in the pyramid. This e n s u r e s that templates are d i s c r e t i z e d to within a pixel of their neighbours, a s required for correlation. E a c h s c a l e d template is created from the original largest template. T h e area of the bottom and top template is specified when generating the pyramid. For f a c e s the range is 400 pixel a r e a for the bottom template and 64 pixels for the top, producing 7 levels average. For e y e s the range is 150 pixels for bottom and 12 pixels for top, with 5 levels average. For n o s e s the range is 200 pixels for bottom and 25 pixels for top, with 6 levels average. For mouths the range is 200 pixels for bottom and 12 pixels for top, with 8 levels average. T h e 400 pixel area for the largest face template w a s c h o s e n a s a 20x20 face similar to S u n g ' s 19x19 face templates. T h e other values were determined by initial experiments to find g o o d working v a l u e s . Template pyramids may also include rotated t e m p l a t e s . 3  E  F  Figure 7  N  M  - F e a t u r e t e m p l a t e s w i t h m a s k s (left) a n d r e l a t i v e s i z e (right). F i s f a c e , E  is e y e ,  N is n o s e , a n d M i s m o u t h .  3  Although the F a c e t s s y s t e m supports rotated t e m p l a t e s and pyramids, rotation w a s not u s e d  for testing s i n c e the d o m a i n c o n t a i n e d only close-to-upright f a c e s .  1  9  Face  Mouth  Figure 8  - T e m p l a t e p y r a m i d s to s c a l e for f a c e , e y e s , n o s e , a n d m o u t h ( m a s k s not  shown).  Normalized correlation [Moore] is used to detect a template in an image. For a source image S , T  a feature template F , and a location L of F T  overlapping s , the intersection of s £  the image s template F  t  t  and F  are I, T and M. I is the area of  t  contained within the intersection,  x  template F  x  contained within the intersection, contained within the intersection,  s i z e with length n . the i t h pixel of M.  i±  is the i t h pixel of i .  T  T is the image area of the M is the mask area of the i , T and M have the s a m e  T i is the i t h pixel of T.  I and T are the means of i and T respectively.  M i is  They  are calculated a s : n  n  T-J.Ii,, n  o^ and a  T - I I T . n  i  are the standard deviation of the i and T vectors respectively.  T h e y are calculated a s :  o = _!_Sfl- I 2  I  n  o =^_y (T - T  \2  2  T  n-l  i  \  i  J  20 T h e m a s k e d correlation r lies in the range [-1,1] a n d is calculated a s :  T - T V  i  r=  "i  n  \2  -1  I  ) V  i  T  /T  +  /  ^  - T  i  M M  o  V  T  /  T h e correlation r a p p r o a c h e s 1 a s the positive i and T increases. T h e correlation threshold of r determine valid template matches. Expanding a 2 a n d a 2 , a n d correlation r produces formulas I  2  linear a s s o c i a t i o n between >= 0.7 is used to the equations for variance calculatable in a single  T  pass:  cr =  Zi 2  1  )2 '  (n  n  n -1 . i  .  1  n  s-£i I  .  a =  E l  n  1  T  .  i  i  n -1  T  i  •  M  i  .  MT  2a r =  I  a fS 2  T  MI  2  a  -1SS w  T  = ZI M.,  2  s  •  2  MTT  I  i  I  n  MI  +J_S S n  2  I  S  2 2  /  i  /  =£T2M.  n  Ua fS  •  Z  S T  MT  2  +J_S MI  2  M  .  MTT  n  MT  MT  =^ITM,  i  -is s -is 2  I  n  =£TM,S 2  MI  n  i  .  n  i  J_  n  n  s  IT .i  i  s -XT.. s = £ M  S =£lM,S •  _  n 2 >2_  n  n  MI  n 2  n  - I S  ss  T I M  2  T  S  +J_S S 2  MT  n  2  T  M  21 Repeated  s u m m a t i o n terms, s*,  are optimized into a single calculation and  m a s k e d templates precalculate constant s u b e x p r e s s i o n v a l u e s (i.e., which don't include i ) .  those  T h e s e values are used when the template  intersection is the entire template; which o c c u r s in the majority of c a s e s . They  are c a l c u l a t e d during correlation o t h e r w i s e .  Another optimization  o c c u r s for m a s k s with a large number of z e r o s z .  If the cost of testing  ( M i == 0) n times in the correlation loop is less than the cost for z inner loop c a l c u l a t i o n s involving M^, are not performed when  then inner loop calculations involving  == 0.  T h e feature s p a c e for a t e m p l a t e represents the range of s i z e s , rotations and locations that would contain a feature in the image. Feature s p a c e s are c o m p o s e d from: the minimum and maximum a r e a range in base coordinates, the rotation angle range, and the e n c l o s i n g rectangle for the template center in b a s e coordinates. B a s e coordinates are in pixels b a s e d upon the original (i.e., largest) target image. F a c e t s provides for a rudimentary representation of feature s p a c e regions. Feature s p a c e regions are just collections of feature s p a c e s . If two feature s p a c e s overlap sufficiently they are represented a s a single e n c l o s i n g feature s p a c e . Feature s p a c e s sufficiently overlap w h e n the a r e a and angle ranges overlap, and the intersection i of the overlapping e n c l o s i n g rectangles 1  and R  2  is greater in area than [ a r e a ( u  area(R )+area(i 2  and  2  1 2  ) ], where u  1 2  1 2  )-area(R ) X  is the smallest rectangle enclosing R  1  R . 2  T e m p l a t e s e a r c h p r o c e e d s in three s t a g e s : candidate s e a r c h , refine s e a r c h , and instance s e a r c h . S e a r c h e s are performed on a per template basis and require a feature s p a c e argument F S 5 . T h e candidate search stage s e a r c h e s the given feature s p a c e F S using the lowest resolution templates and images possible from the respective pyramids. The search s p a c e is  4  Templates and template pyramids are used interchangeably to represent a feature instance.  There is a limit on the minimum area for feature space F S that is based upon the area range covered by the template pyramid. The minimum area ensures that detected features are not too small, and that templates used to create an instance have sufficient resolution. For subfeature templates, the minimum area is equal to a template in the middle of the pyramid. For face templates, the minimum area is equal to a template a quarter the way from the bottom of the pyramid. Compared to the size of the original feature, maximum subfeature template sizes are bigger than their face template counterparts, so the difference in the minimum area factor equalizes the two. 5  22 covered by traversing the target image pyramid from top (small) to bottom (large). At each level L the feature s p a c e F S for that level is calculated in b a s e coordinates. T h e template pyramid is queried for a set of templates T that cover the F S s p a c e . T h e feature s p a c e covered by T should equal F S within tolerances, otherwise a gap-error results (i.e., one of the pyramids has too few levels). Templates T are matched to the target image i at level L using correlation, FS b e c o m e s F S - F S and the p r o c e s s repeats for the next level L-1 or stops when FS is empty or after level 0. If FS is not empty after searching e a c h image pyramid level, then a gap-error results. T h e result for the completed search is a feature s p a c e region (possibly disjoint) of d i s c o v e r e d matches suitable for the next search stage. l  l  l  L  l  T h e refinement stage s e a r c h e s the given feature s p a c e on ever lower target image pyramid levels (i.e., increasing image size). T h i s differs from the candidate search in that FS is not updated to remove the s e a r c h e d s p a c e F S . S e a r c h stops when level 0 of the pyramid is reached, or when templates T cannot cover the s p a c e F S . This p r o c e s s localizes the feature location more accurately a n d u s e s higher resolution templates for better discrimination. T h e result is a feature s p a c e region of the most l o c a l i z e d (higher resolution) match suitable for the next s e a r c h stage. T h e final instance s e a r c h stage s e a r c h e s the given feature s p a c e starting from the bottom of the target image pyramids. T h e first detected match b e c o m e s the evidence for a feature instance. T h e feature c l a s s owning the template (face, left eye, right e y e , nose, mouth) creates a feature instance. T h e strength of a feature instance is the correlation r of the detected match, plus an early detection factor?. l  l  Feature templates are stored in a V i s a g e file called the environment. E a c h environment has a manifest which details e a c h templates' information, including feature c l a s s , initial detection, s k i p refine step, instance candidate, default weight, a n d evidence sets. A template belongs to a feature c l a s s : face, left eye, right eye, nose, or mouth. Templates  6 Some work from the refine search stage is duplicated during the instance search stage. The early detection factor is calculated as 0 . 0 0 1 / d e t e c t i o n _ t i m e , where d e t e c t i o n t i m e is an integer, greater than 0, measuring system ticks in 1/60th of a second. The purpose of this factor is to 'bless' equivalent matches detected earlier. Equivalent matches can occur when search spaces overlap. 7  23 with initial d e t e c t i o n set to true are invoked at the start of face detection (i.e., they are feature templates which do not require previous evidence). T h e refine s e a r c h stage for a feature template is s k i p p e d if skip refine step is true. For best performance, this w a s set to false for all templates. T e m p l a t e s with i n s t a n c e c a n d i d a t e set to true can generate instances, otherwise the last s e a r c h stage is s k i p p e d and matched template features c a n only be evidence to invoke other s e a r c h e s . For the tests, all feature templates could create instances. T h e d e f a u l t w e i g h t is a real number from 0..1 that represents the a priori strength of a given feature template. T h e default value for testing w a s 0.8, which m e a n s that only candidate matches stronger than 0.8 will pre-empt the s e a r c h for initial feature templates. T h e e v i d e n c e set contains a list (possibly empty) of other feature templates to s e a r c h for w h e n detection matches are found for this feature templates. T h e feature s p a c e to s e a r c h is given by the found detection match. T o start face detection, the user must first o p e n a V i s a g e file containing the desired d a t a b a s e of feature templates and set it a s the environment. T h e manifest is read, the feature templates l o a d e d , and the feature m o d e l s created with the a s s o c i a t e d feature templates. It is only n e c e s s a r y to perform this step o n c e , although for testing the environment w a s reloaded e a c h times. The user then selects a target image pyramid to perform the detection o n . A candidate s e a r c h over the entire image is invoked for all feature templates with initial detection set. All i n s t a n c e s created are collected for later use and analysis. For face feature instances, the s e a r c h p r o c e s s is complete now. T h i s d e s c r i b e s the simple template technique u s e d a s the b a s i s for c o m p a r i s o n . For left eye, right eye, nose, and mouth subfeature instances, the s e a r c h p r o c e s s is just beginning. W h e n a left eye, right eye, nose, or mouth c l a s s creates a subfeature instance it gives the instance to the c o m p o u n d face c l a s s which creates a c o m p o u n d face instance to contain the given subfeature. T h e c o m p o u n d face instance is responsible to generate a hypothesis s p a c e from the subfeature evidence and continue the s e a r c h .  8 This feature was not used during testing. 9 To limit the effect of memory fragmentation.  24 2.2 Compound Face Model C o m p o u n d face instances are objects containing data a n d functionality. Data C o m p o u n d f a c e s need to keep track of their detected subfeatures (Subfeature Collections, e x p l a i n e d below), a n d t h o s e subfeatures which actually c o m p r i s e the f a c e (Primary Subfeatures). T o limit duplicate searching, c o m p o u n d f a c e s also need to record the feature s p a c e s already s e a r c h e d (Searched Spaces). T h e behaviour of the c o m p o u n d face instance d e p e n d s upon its current state (Detection State), a n d the system performance is m e a s u r e d by the time required to completely detect a face (Last Modified Time, e x p l a i n e d below). T h e subfeature collections s c , r i g h t _ e e . nose> m o u t h > where each collection contains the subfeature instances of the designated type a d d e d to the c o m p o u n d face instance, N I L represents the nonexistent subfeature instance. E a c h combination of < s f , s f , s f , s f > where, Subfeature s c  Collections:  s c  a  n  l  e  f  t  e  y  e  d s c  Y  l  sf  eSC  e  n  m  left_eye eSC  U{NIL},  re sf  r  U{NIL},  le sf  e  right_eye eSC  U{NIL},  n  nose  sf  eSC  U{NIL},  m  mouth  represents a hypothesis instance, a n d the set of all combinations J<sf ^  , s f , s f , s f > sf , sf , sf , & s f le  re  n  m  le  re  n  each e S C m  U{NTL}1  *  J  is the hypothesis  space H S for the compound face instance c f i . From the H S the c o m p o u n d face only hypothesizes the existence of a single face (i.e., a single c o m p o u n d face instance cannot represent the potential that two or more faces exist in the image). c  f  i  c  Primary Subfeatures: PFright_eye> PF  PF  n o s e > and P F  eSC  left_eye PF  m o u t h  U{NIL},  left_eye eSC  right_eye  T h e primary subfeatures P F  U{NIL},  right_eye  , where  l  e  f  t  e  y  e  f  ,  i  25 PF  ESC nose  PF  nose  U{NIL}, U{NIL},  eSC mouth  mouth  and N I L represents no subfeature instance. A primary feature represents at most a single subfeature instance (i.e., only 1 or 0 in number). T h e primary hypothesis instance <PFieft_eye.PFright_eye.PF ,PF > represents the most likely face hypothesis instance yet found (e.g., a c o m p o u n d face with four primary instances would represent a single complete face). Primary i n s t a n c e s a r e determined by m a x i m i z i n g the hypothesis strength generated by the underlying 2 D face model. n o s e  T h e searched s p a c e s s s  Searched Spaces: ss  n  o  s  e  , and s s  m  o  u  t  c f  i and H S  c f 2  States:  instance c f i . D S  e  f  t  e  y  e  ,s s  r  i  g  n  t  e  y  e  ,  T h e s e are u s e d to determine the equivalence of  for compound faces c f l and c f 2, which assists  duplicate hypothesis Detection  l  , which a r e the sets of feature s p a c e s searched for  n  each subfeature type. HS  m o u t h  c f i  management. T h e detection state D S  c f  i of the c o m p o u n d face  denotes the states: {dormant,  active,  f i n a l i z e d } , dormant specifies inactive instances, usually during the  construction phase, instances,  a c t i v e specifies the search p h a s e of the  f i n a l i z e d specifies that the search w a s completed.  Last Modified Time: T h e last modified time LmT. W h e n e v e r the primary hypothesis instance c h a n g e s for a c o m p o u n d face, the LmT is updated with the current system time. LmT is used to determine the detection time for a c o m p o u n d face instance. Functions A c o m p o u n d face instance n e e d s to determine its pose from the collected subfeatures (pose estimate), determine the feature s p a c e s for missing subfeatures (feature s p a c e estimates), a n d calculate how well the subfeatures match the estimated p o s e (hypothesis strength). A method to add subfeatures to a c o m p o u n d face instance is required, a n d the face needs to m a n a g e this collection (accept subfeatures). T h e c o m p o u n d face must also invoke s e a r c h e s for m i s s i n g subfeatures (top-down search).  26 Pose  Estimate the facial pose.  Estimate:  instance < s f  l e  ,sf  calculates:  r e  For a given hypothesis  , s f , s f > , the face model, described below, n  m  T h e rotational angle G  z  of the face,  the scaling factor s, and the face center point c in b a s e coordinates. The face pose estimate FP for given hypothesis instance HI is denoted F P . T h e calculation is an ad hoc estimate b a s e d upon the weighted c o m b i n a t i o n of: h i  T h e rotation attributes of the s f  l e  , sf  r e  , s f , and s f n  m  subfeature  instances. T h e s i z e s of the subfeatures comprising the hypothesis instance. T h e spatial arrangement of the subfeature instance pairs. Feature  Space  subfeatures.  Estimate the feature s p a c e containing  Estimates:  Feature s p a c e estimate F S E , p , H i ] r T  subfeature type T based upon pose F P  {  FSE  i s  generated for  and the face model. The  h i  complete feature s p a c e estimates F S E  F  [ F P F H I ]  [ l e f t _ e y e , F P , H I ] > [right_eye,FP,HI]» FSE  are defined a s the set  FSE  [nose,FP,HI]>  [mouth,FP,Hi]} estimates for the primary hypothesis instance are used to initiate the top-down search for missing (i.e., NIL) primary subfeature instances. F S E  T  n  e  Hypothesis Strength: C a l c u l a t e s the hypothesis strength and estimated potential. T h e hypothesis strength for a given hypothesis instance HI = < s f , s f , s f , s f > is calculated a s a combination of the: l e  r e  n  Feature strengths of s f , s f l e  m  r e  , s f , and s f n  m  (zero for missing  subfeatures). The number of subfeatures found (i.e., not NIL). T h e distance in feature s p a c e from the face model under F P the subfeatures s f  l e  , sf  r e  , s f , and s f . n  h i  to  m  A n d a bonus for similar s i z e d eyes. E s t i m a t e d potential is e s s e n t i a l l y hypothesis strength, e x c e p t that missing subfeatures for which no search w a s performed are counted as perfect matches (i.e., the assumption is that the ideal subfeature instance will be found). Estimated potential determines s e a r c h  27 priorities, while h y p o t h e s i s strength d e t e r m i n e s the best f a c e w h e n detection c o m p l e t e s .  candidate  A c c e p t Subfeatures: Subfeature instances are a d d e d to the appropriate subfeature collection w h e n : T h e c o m p o u n d face is first created. T h e subfeature instance which provided the evidence for an enclosing face, is a d d e d . During hypothesis m a n a g e m e n t . If a newly created compound face i n s t a n c e c f l represents an equivalent hypothesis to an existing c o m p o u n d face instance c f 2 , then all the subfeatures in cf l ' s subfeature collections are added to c f 2 . A subfeature instance is created as the result of a top-down s e a r c h . The created subfeature instance is a d d e d directly to the requesting c o m p o u n d face instance. W h e n a subfeature instance is a d d e d , two steps are performed. T h e first step r e m o v e s duplicates from the subfeature collection, and the s e c o n d step updates the hypothesis instance. R e m o v i n g duplicates (i.e., those subfeatures representing an equivalent feature s p a c e ) limits the potential number of new hypothesis i n s t a n c e s (i.e., limits combinatorial e x p l o s i o n s ) , which is an important consideration for the update step. Duplicates with lower strength are removed. If the newly a d d e d subfeature instance s i is still an element of the subfeature collection, then all hypothesis i n s t a n c e s which contain s i are evaluated and c o m p a r e d to the primary hypothesis instance. T h e hypothesis instance HI = < s f l e , s f r e , s f n , s f m > with the greatest hypothesis strength b e c o m e s the primary hypothesis instance, and the primary instances are likewise updated P F ye ie> = s f  l  e  f  t  e  ^ ^ ' r i g h t _ e y e ^ r e ' ^ ^ n o s e ^ n ' Qnd PFmouth ^m)  = s  = s  =s  Top-Down S e a r c h : Initiate a request to s e a r c h for missing subfeatures. To complete a s e a r c h request for c o m p o u n d face instance c f i , the detections state D S c f i must be active. The search request p r o c e e d s as follows: Determine the estimated feature s p a c e s for subfeatures by calling the function d e s c r i b e d above. T h e top-down s e a r c h is partitioned into two p h a s e s . In the s e c o n d p h a s e the features s p a c e s are enlarged by a factor of 1.25 times. For each missing subfeature of type T (i.e., P I = NIL), request a search package from the T feature c l a s s . T h e provided package T  28 lists the templates available for s e a r c h i n g i o and includes the  requested feature s p a c e . T h e requested feature s p a c e for the package is added to the searched space s s . T  The p a c k a g e s are combined into an aggregate search package. The priority of this s e a r c h p a c k a g e is set to the e x p e c t e d potential of the requesting face i n s t a n c e . 1 1 The aggregate search package is added to the scheduling queue. The first search is requested when the DS c f i b e c o m e s active. The s e c o n d s e a r c h is requested after the first s e a r c h c o m p l e t e s , if any s u b f e a t u r e s are still m i s s i n g . C o m p o u n d face instance cf 1 is equivalent to compound face instance cf 2, if for e a c h subfeature type T, all subfeature instances s i  c f i ' s subfeature collection s c  T  T  are within the searched s p a c e s s  or within the feature s p a c e estimate F S E , F P , H I ] for cf2 [ T  hypothesis instance HI.  of type T in T  forcf2  with primary  T h e equivalent hypothesis c o m p a r i s o n is  performed during duplicate hypothesis m a n a g e m e n t that is performed  after a c o m p o u n d face instance is created, but before it is activated.  If  cf 1 is equivalent to cf 2, then all subfeatures collected by cf 1 are added to cf2,  and cf 1 is removed.  10 The templates marked for initial detection are excluded because the system is already searching for them over their entire feature space and the entire image. 11 Facets supports other search modes, including: individual feature templates with priority set to the template's default weight, and single docket with priority set to the expected potential.  29 Y  Figure 9 - 2D  F a c e Model.  Described below.  The face model is represented on a 2 D plane with 4 d e g r e e s of freedom:  T h e rotational angle 0  z  , the scaling factor s, and the face center point c .  The plane in model coordinates is a l x i square with the point ( 0 , 0 ) at the center, ( - 0 . 5 , 0 . 5 ) at the top left corner, and ( 0 . 5 , - 0 . 5 ) at the bottom right corner. T h e left eye position is ( - 0 . 2 7 5 , 0 . 2 2 7 ) , the right eye position is (0.275,0.2 27 ), the nose position is (0,0. 0 9 ) , and the mouth position is (o, - 0 . 27 ). The feature centers may vary by ±0 . 1 2 5 in the x and Y directions. E y e s have a minimum area of 0 . 0 6 and a maximum area of 0 . 2 8 in model coordinates. The nose has a minimum area of 0 . 0 6 and a maximum a r e a of 0 . 3 5 in model coordinates. The mouth has a minimum area of 0 . 0 7 and a maximum area of 0 . 4 in model coordinates. The width/height ratio for f a c e s is fixed to 1 0 / 1 1 . The way in which the compound face instances use the face model to generate face hypotheses and calculate the hypothesis strength w a s described above as part Of the Pose E s t i m a t e s and Hypotheses S t r e n g t h functionality. 2.3  Localized  Search  F a c e t s provides three s e a r c h m o d e s : single templates, template collections, and aggregate s e a r c h . T h e initial bottom-up s e a r c h u s e s the simple template s e a r c h mode. For testing, the localized top-down s e a r c h uses the aggregate search. For e a c h search mode there is a feature detection c l a s s . T h e feature detector for single template preforms the simple template detection d e s c r i b e d in chapter 2, section 1 above. The  30 feature detector for c o l l e c t i o n s i n v o k e s the single template detector for e a c h template in the collection. T h e feature detector for the aggregate m o d e performs optimized s e a r c h e s . A g g r e g a t e feature detectors provide s e v e r a l optimization o p t i o n s ; two of which are d i s c u s s e d here. T h e optimizations work by d e c o m p o s i n g s e a r c h into s t a g e s (passes), r e s c h e d u l i n g poor performers, a n d limiting the total s e a r c h required. Optimization option op is the mid-level technique, while Op is the most a d v a n c e d and c o m p l e x optimization option provided. Option performs complete simple template detection a s d e s c r i b e d in chapter 2, section 1, while option O p splits template detection into the initial s e a r c h (candidate search), a n d the final s e a r c h (refining a n d instance search) a n d r e a s o n s about the candidates from the initial s e a r c h to determine how to proceed with the final s e a r c h . W h e n template detection is split into initial a n d final s e a r c h , the feature detector sorts the candidate matches by strength (correlation score) to order the final search. 1  2  2  T h e first p a s s of optimized s e a r c h determines if it is p o s s i b l e to find a complete face. E a c h collection, which contains the templates for a subfeature type, is s e a r c h e d for a single candidate subfeature. F o r o p this entails a subfeature instance, for O p this entails at least o n e candidate match with correlation r >= 0.8, (possibly more than o n e candidate found). If none a r e found, then the search is rescheduled for a later phase. For o p , final s e a r c h is then performed for all found candidates. A l l subfeature instances created during detection a r e a d d e d to the invoking c o m p o u n d face instance. If subfeature collections scieft_eye> s c r i g h t _ e y e , s c , and s c m o u t h , are all non-empty, and the hypothesis strength is greater than 0.65, then the feature detector tries to immediately complete the c o m p o u n d f a c e ; otherwise, the s e a r c h is r e s c h e d u l e d for a later p a s s . R e s c h e d u l e d s e a r c h e s a r e re-prioritized at a fraction of the hypothesis strength, depending o n w h y the s e a r c h w a s rescheduled. For example, if one of the subfeature collections sc is empty then the search is rescheduled at 25% of the hypothesis strength, otherwise if it just failed to meet the 0.6 5 strength t h r e s h o l d , it is r e s c h e d u l e d at 7 5 % of the hypothesis strength. x  2  2  n o s e  T  T h e completion process only s e a r c h e s until s i z e ( s c ) =N , T  t  where N  t  is the  m a x i m u m number of subfeature instances needed for subfeature type T.  31 For O p ! the values are N uses the values N  l e f t  _  l e  e y e  f _eye  = 7  t  =3, N  . r i g h t _ e y e = . n o s e = . m o u t h = - °P2 N  r i g h t  7  _  e y e  =3, N  n o s e  N  =2, N  5  N  7  m o u t h  = 4 for the  subfeature instances, but u s e s the o p ! values for the number of candidate matches to search for. T h e candidate matches are sorted in order of higher correlation s c o r e s before s e a r c h i n g for subfeature The remaining p a s s e s are invoked for rescheduled s e a r c h e s .  final p a s s attempts to complete the face.  instances. For o p x the  For o p , the s e c o n d p a s s 2  a p p e a r s similar to the first except that the initial s e a r c h is performed  for c a n d i d a t e s with r  >=  0.75.  If this search fails, the s e a r c h is  r e s c h e d u l e d , otherwise it attempts the completion p r o c e s s (no threshold). P a s s three for o p  2  is similar to o p i ' s p a s s two.  The s e a r c h p a s s e s work well b e c a u s e most c o m p o u n d face instances that do represent actual f a c e s in the image have a high hypothesis strength after the first subfeatures a r e found. It is important to collect s e v e r a l subfeature instances of e a c h type, since subfeature instances that make for the best face hypothesis instance, may not be the strongest subfeatures individually. T h e c o m p o u n d face r e m o v e s duplicate subfeatures from its collection, s o the small number of subfeatures found are distinct; even without s e a r c h i n g all subfeature templates. o p w a s 2  much more sensitive to the threshold parameters than o p x ; however, Op  2  performed better than op  lt  since  it w a s used for testing. T h e benefits of  aggregate s e a r c h require a level of coordination that the individual s e a r c h e s cannot provide. W h e n all p a s s e s complete the invoking c o m p o u n d face instance is notified. T h e face instance may attempt a s e c o n d search over a larger feature s p a c e for any missing features. T h e priority of the s e c o n d s e a r c h is handicapped by 50%. 2.4  System  Runtime  Previously, the key s y s t e m c o m p o n e n t s a n d their implementations were d e s c r i b e d . T h i s s u b s e c t i o n d e s c r i b e s their interaction a s an entire s y s t e m . T h e runtime is c o m p o s e d from four main units: Template a n d Model Database (TMD), Runtime Instances (Rl), P r o c e s s Scheduler (PS), a n d Feature Detection (FD). S e e Figure 10 for a n illustration.  32 The T M D unit contains the object model c l a s s e s for detection. The models in this unit are created when a V i s a g e is c h o s e n as the environment. T h e s i m p l e feature c l a s s e s contain their respective t e m p l a t e s from the environment V i s a g e . The c o m p o u n d face class, depicted in Figure 6, has the eye, nose, and mouth simple feature c l a s s e s as subfeatures, and contains the 2D face model. C o m p o u n d face instances call upon the c o m p o u n d face c l a s s to perform calculations and s e a r c h requests on their behalf. T h e Rl unit contains all the instances created during the detection p r o c e s s . T h e s e instances are available to display detection results (see Data and A n a l y s i s section). T h e Feature Instances (Fl) vector contains all instances. T h e C o m p o u n d F a c e Instances (CFI) vector contains only c o m p o u n d face instances, and it u s e d to search for c o m p o u n d face instances with duplicate hypothesis s p a c e s . If a c o m p o u n d face instance c f i d o e s not duplicate an existing instance in the C F I , then cfi is added to both the CFI and the F l . T h e P S unit q u e u e s detection requests by priority weight, and p a s s e s them along to the F D unit when a detector task b e c o m e s available. The Facets architecture supports multiple threaded detectors, but for testing only a single detector is u s e d . The FD unit contains the detectors that perform the feature s e a r c h e s . The target image is set when detection c o m m e n c e s . T h e detection tasks c o m m u n i c a t e with the feature c l a s s e s in the T M D unit (e.g., create a new feature instance from a candidate match).  33 Runtime Instances  Template & Model Database  Compound Face Instances  Compound Face Model  Feature Instances Full Face V  B  j:  A  Facets Runtime Model  1 Left Eye I Right Eye Nose Vlouth  Feature Detection Threshold — ^ _y"  Source Image  Process Scheduler  M Detection Task  _ ...  Detection Task Priority Queue  Figure 10 - Facets Runtime Architecture. During the runtime, as shown in Figure 10, requests for searches are sent (A) by the feature classes to the scheduler, where they are prioritized by value. Detection tasks are sent (B) to the detection unit to search for matches in the specified regions. Matches exceeding the threshold are returned (C) to the original template class. A subfeature instance is created and added (D) to the runtime instances. If the original search request came from a compound face, the instance is returned (E) to the face, otherwise a new compound face instance is created. Compound face instances are placed (F) among the runtime instances; but, to prevent searching overhead from multiple similar hypotheses, similar faces are merged. Compound face instances make localized search requests (F) via the compound face class, which communicates (E) on to the subfeatures for a search, and then makes the request (A) using an aggregate search package. When the search concludes (or when the user stops the search), the instances in the Rl unit are compared against other overlapping features of the same type using hypothesis strength for comparison. Those  34 instances with the best score and marked as best i n s t a n c e s . The b e s t i n s t a n c e s are d i s p l a y e d as the detected features. 2.5  Future  Implementations  F a c e t s as a s y s t e m has evolved c o n s i d e r a b l y from optimistic c o n c e p t i o n to pragmatic c o n c l u s i o n . M a n y implementation a v e n u e s were e x p l o r e d ; s o m e s u c c e s s f u l , s o m e d e a d ends. Certainly, I am quite sure that were I to create Facets now, I could create a more elegant, robust, and generalizable system in much less time. T h e s e are s o m e of the l e s s o n s learned. G o o d infrastructure is important. F a c e t s benefited from the i m a g e p r o c e s s i n g and representation primitives in its V i s t a 2 Iibraryi2, a flexible object d a t a b a s e , a n d object g a r b a g e collection (reference counting in F a c e t s ) i 3 . Although implementing this infrastructure provided good experience and solutions tailored to the s y s t e m n e e d s , I would have s a v e d c o n s i d e r a b l e effort by investing s o m e time initially into finding existing and more general solutions. It is worth investing time in automation. A l t h o u g h the V i s a g e object file format s u p p o r t s arbitrary attributes, I did not include any f a c e information with the target i m a g e s . Originally I thought that I w o u l d only perform one testing run, and that the effort n e e d e d to gather 50 face templates, and determine a c c u r a c y w a s l e s s than that n e e d e d to specify 4 0 0 f a c e s and the infrastructure to use this information. A s it turned out, I would restart the testing runs s e v e r a l times as i s s u e s were d i s c o v e r e d and corrected. In the e n d , s o m e A p p l e S c r i p t scripting facilities were a d d e d to the program to automate d a t a collection over the training set (I still had to manually review the detection result i m a g e s for a c c u r a c y though). Currently, model and system parameters were manually set to working values. Automation would support better optimization for t h e s e p a r a m e t e r s .  12 Vista2 is an object oriented implementation of a subset of the Vista image processing library by Art  Pope and  David  Lowe  <http://www.cs.ubc.ca/labs/lci/vista/vista.html>.  The garbage collection supports object sharing without complicating object lifetime and ownership issues. It also reduces dependence, allowing the designer more freedom to experiment and make system changes. As a result, stability was excellent throughout Facet's entire development life cycle. 1 3  35  Create the model first. T h e model is central to representing a n d reasoning about the scene-space of an image; a n d , the capabilities of the model c a n then guide requirements for the underlying detection s y s t e m and the feature s p a c e . In combination with the automation tactics described above, it then b e c o m e s possible to test and even train the model over the training set, even before the detection s y s t e m is available. Create a general and robust feature s p a c e region representation, independent of the detection techniques. F a c e t s already h a s a rudimentary feature s p a c e representation; however, in s o m e c a s e s its simplicity proved a limiting factor. S o m e potentially useful features of feature s p a c e regions would b e : probability distributions over the feature s p a c e s , and set operations (e.g., union, intersection, a n d subtract). Facets supports c o m b i n i n g feature s p a c e regions, however, the resulting region can b e c o m e significantly larger that the union of the two regions. Lack of subtraction (removing a region) in the F a c e t s representation also results in d u p l i c a t e d s e a r c h effort.  36 3  Data and  Analysis  This chapter details the tests performed using the F a c e t s s y s t e m implementation d e s c r i b e d previously. 3.1  Template  Database  Acquisition  T e m p l a t e d a t a b a s e s were created for both simple templates (single face), and c o m p o u n d templates (four subfeatures); a n d , several tests and c o m p a r i s o n s performed. S y s t e m performance over template acquisition w a s itself one of the tests performed. T h e remaining tests used the complete d a t a b a s e s . Detection a c c u r a c y w a s determined by visual inspection. Other data w a s calculated by the s y s t e m . T h e template d a t a b a s e s for the simple and c o m p o u n d technique were collected from the C M U 1104 image set shown in Figure 1. T h e C M U 1104 training set includes 16 images with 25 f a c e s e a c h from 5 people, comprising 400 faces from 40 people in 10 p o s e s all together. For the simple technique, the whole face templates were collected in the following order: A n initial face w a s c h o s e n from the 1st image, a template created from it, and a detection run performed. T h e person with the lowest number of correctly detected f a c e s , in the image with the lowest number of correctly detected f a c e s w a s selected. From the remaining undetected f a c e s for the selected person, the most normal appearing face w a s c h o s e n , and a template made from it. Another detection run w a s performed, and the collection p r o c e s s repeated until all f a c e s in the training set were correctly detected. For the c o m p o u n d technique, the subfeature templates were collected in the following order. A n initial face w a s c h o s e n from the 1st image, templates created for all four subfeatures, and a detection run performed. For e a c h partially detected face (i.e., the face had one, two, or three subfeatures detected), a template w a s created from one of the missing subfeatures. Another detection run w a s performed, and this  37 step repeated until all potentially correct f a c e s detected had four subfeatures . After the last detection run, the person with the lowest number of correctly detected f a c e s , in the image with the lowest number of correctly detected f a c e s w a s s e l e c t e d . From the remaining undetected f a c e s for the selected p e r s o n , the most normal appearing face w a s c h o s e n , and a subfeature template created from one of the subfeaturesis. 1 4  Another detection run w a s performed, and the collection p r o c e s s repeated until all f a c e s in the training set w e r e potentially correct with four s u b f e a t u r e s . In a final step, subfeature templates were collected from f a c e s with less than four correctly detected subfeatures. T h i s step w a s repeated until all f a c e s in the training set w e r e correctly and fully detected. T o collect a template for the d a t a b a s e (i.e., create a template pyramid), the face is selected from the image creating a single template. T o create a whole face template pyramid, the mask is applied to the template, the template pyramid is then created, and finally s a v e d into the d a t a b a s e V i s a g e . T o create the subfeature template pyramids, the subfeatures are selected from a face image to create the four subfeature templates. Each subfeature template has a mask applied, then the template pyramid is created from these m a s k e d templates, and finally the template pyramids are s a v e d to the database V i s a g e . Manifest records for the template pyramids are a d d e d to the V i s a g e manifest. F a c e t s provides several tools and dialogs to assist with this p r o c e s s .  1 4  Potentially correct faces have enough correctly detected subfeatures to create a face  estimate that includes all subfeatures, although possibly some detected subfeatures are incorrect.  1 5  eye.  These subfeature templates are the initial templates.  In our test they were always the left  3 8  I m a n 1-13 t m p | Template  Select Image Template -Sub-Image Location Left  95  Top  114  Width  gr  Height 154 [ Cancel ][(  Figure 11  Ok  )]  - Selecting a face  to c r e a t e a t e m p l a t e .  Figure 12  - Editing the template  mask. Create Template pyramid Pyramid Data Angle Pixel Ad jus Total Angle [ § T Scale Pixel Adjust |2 Final Area [64 Maximum Area p o o Height 30 Width 40  Face j[ Eye~|[ Nose }(nouth|  Figure 13 - C r e a t i n g  [cancel |p  OK ||  a template  pyramid from a template. i Select Face Subfeatures i r  Subfeature  Q L e f t Eye Q Right Eye I © Nose O Mouth  Face: Left -1 Top 7 width Angle: Rotation - ° 0 2 | Show! I Face  Siza  Center)  Figure 14 - S e l e c t i n g  34  "Set!  Template Name I lady 5-16 reye pyrin  Width Height 266 Area Min Area Max Area Avg Area  i Manifest Recur  OFaoe 7 2  421 246 Height 37  Cancel  subfeatures  from a f a c e i m a g e to create a template,  QLEye  [~J Initial Detection • Skip Refine Step  fjREye 0  QNose  O Mouth  instance Candidate  Default Weight 10.30  fwl  Evidence Far:  lady 5-16 roye pyrm  Figure 15 - C r e a t i n g  Cancel  Ok  a manifest  record for a template pyramid.  Detection a c c u r a c y w a s determined manually by inspection. Accurate detection for simple f a c e s required that the f a c e outline contained the eyes, nose, and mouth in the image. T h e maximum face size w a s judged more subjectively for r e a s o n a b l e a p p e a r a n c e . A c c u r a t e detection for c o m p o u n d f a c e s required that all subfeature outlines contain their respective subfeature, a n d not contain any other subfeature, in the image. S o m e small overlap w a s permitted. Detected f a c e s d e e m e d accurate are specified a s c o r r e c t , otherwise f a i l e d . Subjectively, the c o m p o u n d  39  templates produced closer matches to the actual face. visible by comparing Figure 16 a n d Figure 17.  Figure 16 - Simple template detection example. positives, 1 false-negative).  S o m e e x a m p l e s are  39 correct, 7 failed (6 false-  Figure 17 - Compound template detection example. 40 correct, 0 failed.  40 T h e simple template technique required 54 templates, totalling 2,010 K B in size, to cover the training set with 1 0 0 % accuracy. By c o m p a r i s o n , the c o m p o u n d template technique required 22 left eye templates (the initial detection templates), 44 right e y e s , 25 n o s e s , and 74 mouths, for a total of 165 templates, totalling 2,604 K B in s i z e i e . Tests were performed on a 4 0 0 M h z P o w e r P C b a s e d P o w e r M a c G 4 with 1 2 8 M B . T h e Velocity Engine was not u s e d . T i m e s are in system ticks (1/60th of a s e c o n d ) . 3.2  Training  Performance  T h e performance characteristics of the simple and c o m p o u n d techniques were c o m p a r e d during the template acquisition p r o c e s s . T h e key result of these tests indicate how domain c o v e r a g e , detection times, and resource requirements increase as templates are a d d e d to the d a t a b a s e . For e a c h new initial template, timing and a c c u r a c y data w e r e collected from the detection run. For c o m p o u n d templates, the detection run for initial template N, w a s performed just before adding initial template N+1. This e n s u r e s that as many complete f a c e s as possible are detected from the initial e v i d e n c e provided by initial template N. T h e following a c c u r a c y tests show the domain coverage and a c c u r a c y as a function of the number of initial templates. T h e sharp i n c r e a s e in coverage and accuracy seen near the end, in Figure 19 below, is b e c a u s e incomplete c o m p o u n d f a c e s (i.e., with four subfeatures, at least one subfeature correct, but not all correct) w e r e c o m p l e t e d (i.e., correctly matching subfeature templates were added) during the final step of the template acquisition p r o c e s s . C o m p o u n d templates s h o w e d better a c c u r a c y throughout. S i m p l e templates m a d e 36 false positive identifications for the final run, while c o m p o u n d templates m a d e only 1 false positive identification. A l m o s t all failed m a t c h e s for c o m p o u n d templates were incomplete f a c e s , a s c o m p a r e d to the simple templates where most failures were non-faces. O n e potential explanation is that c o m p o u n d templates are less likely to detect smaller features (this i s s u e is a d d r e s s e d further in the other tests below). 16 why the left eye space was covered with 22 templates, and the right eye space represented by twice that (44 templates), is not clear. It may result from biases in the template acquisition process.  Simple: Intitial Templates vs. A c c u r a c y 100% 90% 80% 70% 10  3  <J O  <  2Z  60% 50% 40% 30% 20% 10% 0%  7  10  20  25  30  35  40  45  50  55  Initial Templates -% accuracy  -% complete  Figure 18 - Simple template accuracy test. Completeness denotes true-positive matches only, while accuracy accounts for false matches. C o m p o u n d : Initial Templates vs. Accuracy 100% 90%  I  n i flft i El  1  "i ^ " i ^ —  80% 70% o S o o <  60% 50% 40% 30% 20% 10% 0% 10  12  14  16  20  22  Templates • % accuracy  -% complete  Figure 19 - Compound template accuracy test. Completeness denotes true-positive matches only, while accuracy accounts for false matches.  42 C o m p o u n d templates also demonstrate better detection rates from available templates; however, this m a y result from the larger number of templates initially in the c o m p o u n d template d a t a b a s e (i.e., more flexibility of the c o m p o u n d template to represent f a c e s with a n y appropriate combination of subfeatures, it m a y also be that the larger number of templates cover a larger portion of the face domain). T h e growth of the template d a t a b a s e is d i s c u s s e d later. In T a b l e 1 the c o m p l e t e n e s s of the d a t a b a s e is shown c o m p a r e d to the percentage of correctly detected f a c e s . T h e c o m p l e t e n e s s is s h o w n using initial templates, a n d additionally for the c o m p o u n d d a t a b a s e , the total number of templates in the d a t a b a s e . F o r c o m p o u n d templates, the initial templates are the constraint o n the number of f a c e s it is possible to detect (since initial templates a r e u s e d in the intial bottom-up search). For simple templates, the initial templates a r e a l s o constraints of the number of f a c e s it is possible to detect (since they are the faces). Table 1 - Domain detection coverage compared to template database completeness  Detected % faces detected  Simple  Templates  # initial templates / (% database complete)  Compound  Templates  # initial templates / (% database complete) / [# total template / (% complete total)]  60%  10 (19%)  2 (9%)  [85 (52%)]  70%  14 (26%)  3 (14%)  [97 (59%)]  80%  18 (33%)  4 (18%)  [108 (65%)]  90%  28 (52%)  8 (36%)  [130 (79%)]  95%  39 (72%)  16 (73%)  [143 (87%)]  A s the template d a t a b a s e i n c r e a s e s in s i z e , the detection times for s i m p l e templates i n c r e a s e s linearly, while t h e i n c r e a s e is sub-linear for c o m p o u n d templates"!?.  This comparison is complicated by different nature of the two approaches. It is not clear if growth of the database is better counted as initial templates, total template, size, or coverage of the training face domain. Although initial templates for simple templates represent the entire database, and for compound templates only a fraction thereof; they do represent an important fraction. Initial template in the compound approach have the greatest influence on detection times. 1  7  43  Simple: Initial Templates v s . Detection Time 700  -I  L  650 • 600 |  550 -  £  500 -  §  450 •  •  ~ 400 | " j= 300 3  5  0  §  250 -  rj  200 -  Q  1 nn 1 uu •  1  150 -  50 • 0•  cD  oi  —  o  . —  c  n  ro  o  c  ro n  o  C M  c  C M n  o  c  n  o  cn  Initial Templates •  avg avg time — • — avg min time —±—avg max time  Figure 2 0 - Simple template detection times. C o m p o u n d : Initial Templates vs. Detection Time 1400  T  Initial Templates | — • — avg avg time — a v g min time —A—avg max time  Figure 21 - Compound template detection times.  A v e r a g e detection times for c o m p o u n d templates were nearly twice a s long a s those for simple templates. O n e reason may be the large number of feature instances detected. O n the final detection run, simple templates detected a total of 2,201 f a c e s , while the c o m p o u n d a p p r o a c h  44 produced 3,510 f a c e s , 7,543 left e y e s , 7,933 right eyes, 4,815 n o s e s , a n d 12,415 mouths, for a total 3 6 , 2 1 6 features. However, c o m p o u n d templates p r o d u c e d consistently faster times to detect the first face, a n d the s e a r c h optimizations p o s s i b l e with c o m p o u n d t e m p l a t e s demonstrate their effectiveness w h e n the a v e r a g e a n d m a x i m u m detection times for the best f a c e s a r e c o m p a r e d to the c o r r e s p o n d i n g detection times for a l l faces. T a b l e 2 - Detection times for final detection run.  Simple  Templates  Compound  Templates  (1/60th second)  (1/60th second)  Minimum Time (best faces)  296  1 69  Average Time (best faces)  478  624  Maximum Time (best faces)  664  1201  Minimum Time (all)  292  65  Average Time (all)  517  1003  Maximum Time (all)  678  1728  T h e growth of the template d a t a b a s e s during template acquisition w a s similar to the increase in detection times. T h e simple template d a t a b a s e i n c r e a s e d linearly from its initial 5 8 KB to its final 2,010 KB18, p a s s i n g the midway 1,005 KB s i z e with 2 8 templates. T h e c o m p o u n d template d a t a b a s e grows in s i z e from the initial 4 5 8 KB, to the final 2,604 KB, p a s s i n g the half w a y 1,302 KB s i z e with just 3 initial templates (97 t e m p l a t e s total). With 11 initial templates (136 t e m p l a t e s total) t h e d a t a b a s e is 2,117 KB in size. 3.3  Testing  T h e performance of both techniques w a s evaluated o n the Nottingham a n d Y a l e image sets using the template d a t a b a s e acquired previously from the training set. T h e s e image sets a r e described in chapter 1, section 1, above. T h e simple template technique is more likely to d i s c o v e r smaller f a c e s .  1  8 T h e template d a t a b a s e s i z e includes index o v e r h e a d .  45 To make the c o m p a r i s o n a s fair a s p o s s i b l e , simple template matches below a given threshold were ignored. T h e threshold w a s determined from the smallest feature d i s c o v e r e d by the c o m p o u n d template a p p r o a c h over the s a m e data set. For the Nottingham image set, the threshold w a s 5 7 3 pixels in area. F o r the Y a l e image set, the threshold w a s 4 0 9 pixels in area. T h e results from this adjustment a r e noted below. C o m p o u n d templates may find a face, but not find all four of the subfeatures correctly. A n adjustment w a s made for face matches with three correct subfeatures, s o that the incomplete face w a s counted a s 3/4 of a face. This adjustment for c o m p o u n d results is also noted below. Table 3 - Face detection results for the Nottingham and Yale image sets.  Simple  Templates  % faces detected / (% adjusted)  Compound  Templates  % faces detected / (% adjusted)  Nottingham  65.63%  (69.58%)  87.71%  (93.02%)  Yale  15.56%,  (15.56%)  31.11%  (35.14%)  Both  34.45%  (35.94%)  52.47%  (56.98%)  T h e vastly better performance for the Nottingham image set is b e c a u s e it contained full frontal, e x p r e s s i o n l e s s f a c e s , while the Y a l e image s e t contained frontal f a c e s with a wide range of e x p r e s s i o n s . Therefore, training o n different facial e x p r e s s i o n s would b e n e c e s s a r y for reliable detection. T h e s e tests suffered from high false positive rates. accuracy is shown in Table 4 below.  T h e detection  Table 4 - Accuracy results for the Nottingham and Yale image sets.  Simple  Templates  Compound  % accuracy / (% adjusted)  % accuracy / (% adjusted)  Nottingham  14.26%  Yale  0.88%  Both  5.93%  Templates  53.43%  (57.96%)  (1.18%)  11.29%  (13.25%)  (14.28%)  27.19%  (30.12%)  (35.92%)  46 T h e c o m p o u n d template technique performed better, s o m e t i m e s substantially, in all t h e s e tests. 3.4  Non-Faces  Both techniques were tried o n three non-face images, using the complete template d a t a b a s e s , to test for false positive performance. T h e images are labelled by content and are shown below. A s d i s c u s s e d previously, the simple template technique is more likely to detect small features. T o e q u a l i z e the test, simple template features w e r e limited to those greater than 611 pixels, the a r e a of the smallest face detected by compound templates. Table 5  - Non-face image results.  Simple  Templates  # false positive faces  (# adjusted false  positives) /  Compound  # false positive faces /  # total f a c e s  Tree Sky  (399x536) (350x241)  C a n y o n (670x450)  Templates  # total f a c e s  61  (19) / 365  1 8 / 398  21  (11) / 63  5 / 157  47  ( 1 4 ) / 118  8 / 416  T h e post-detection test images for Tree, Sky, and C a n y o n are shown below:  Figure 2 2 - T r e e  with 1 9 f a l s e p o s i t i v e s  using simple templates.  Figure 23 - T r e e  with 1 8 false positives  using c o m p o u n d templates.  47  Figure 2 4 - S k y w i t h  11 f a l s e p o s i t i v e s  using simple templates.  Figure 26 - C a n y o n  Figure 2 5 - S k y w i t h  5 false  positives  using c o m p o u n d templates.  Figure 27 - C a n y o n  with 1 4 false  positives using simple templates.  with 8 f a l s e  positives  using compound templates.  A g a i n , a s in the previous Testing s u b s e c t i o n , the c o m p o u n d template technique provides better false positive discrimination. T h e relative detection times for c o m p o u n d templates also significantly for t h e s e test i m a g e s : Table 6  - Non-face detection time  results.  Simple  Template  # m i n i m u m time /  # average time / #  Tree Sky  (399x536) (350x241)  Canyon  (670x450)  improve  maximum  time  Compound #  Templates  minimum time /  # average time / # maximum  time  1289 / 2 0 9 0 / 2 3 8 7  1147 / 1 7 7 6 / 2 6 6 8  538 / 6 9 6 / 872  610 / 9 2 2 / 1355  1255 / 3 0 7 9 / 3 6 2 8  1646 / 2 3 3 2 / 3 5 2 2  This indicates that c o m p o u n d templates are dependent on the number of features, more than o n image area alone. This result is explored below in the density tests.  48  3.5 Scale and Density T h e performance of simple and c o m p o u n d templates was c o m p a r e d over images with varying s c a l e and facial density. T h e test image w a s a 40 face subset from the training set (shown in Figures 16 and 17). For the scaling tests, 9 s c a l e d images were produced at 5 0 % , 7 5 % , 8 5 % , 9 5 % , 1 0 0 % , 1 2 5 % , 1 5 0 % , 1 7 5 % , and 2 0 0 % the s i z e of the original image. For the facial density tests, 5 i m a g e s were created by covering s o m e of the f a c e s with a textured swatch, producing images with 40, 30, 20, 10, and 2 f a c e s visible (corresponding approximately to 1 0 0 % , 7 5 % , 5 0 % , 2 5 % , and 5 % of the image uncovered). For the detection a c c u r a c y tests, an adjusted performance value is calculated b a s e d upon the smallest face detected in the c o m p o u n d test on the s a m e image. For the simple tests this greatly improves the a c c u r a c y a n d correctly detected face counts for the large i m a g e s ; but it hinders the performance for the smallest image. T h e reason is that many small false positive f a c e s detected by simple templates in the large i m a g e s were removed, while in the small image, the c o m p o u n d technique did very poorly and was unable to detect the small features, so the threshold w a s too large and the small features removed from the simple template s e a r c h w e r e actual f a c e s . T h e simple template a p p r o a c h performed much better overall producing a gradual face detection performance d e c a y a w a y from the nominal 1 0 0 % c a s e . The one bright spot for the c o m p o u n d template test is that, unlike the simple templates, it did perform with 1 0 0 % a c c u r a c y in the 1 0 0 % image size c a s e . T h e s e results may be the result of an implementation issue.  49  Simple: Completeness and Accuracy by Image Scale  1  -i 50%  75%  1 85%  1 95%  1 100%  1  1  125% 150%  1 175% 200%  Image Scale (from original 100%) • % complete •  • % accuracy •  • % complete >= min  % accuracy >= min  Figure 28 - Simple template detection performance by image scale.  Compound: Completeness and Accuracy by Image Scale 110% 100% 90% a  >> o JO  1 S e  I  a. E o o  80% 70% 60% 50% 40% 30% 20% 10% 0% 50%  75%  85%  95%  100%  125% 150%  175% 200%  lma§e Scale (from original 100%) •% complete •  •% accuracy •  •% 3/4complete •  •% 3/4 accuracy  Figure 29 - Compound template detection performance by image scale.  Detection time performance w a s also m e a s u r e d over the different i m a g e s s c a l e s . T h e graphs in Figures 3 0 and 31 indicate linear growth for both simple and c o m p o u n d templates. T h e slope of the average detection time line is 9.8 ticks per 1 0 0 0 pixels for simple templates, a n d 14.1 ticks per 1000 pixels for c o m p o u n d templates.  Simple Density: Detection Time by Image Size  5500 5000 3 4500 « sz 4000 S 3500 3000 2500 c o  33 O  Image Size (in pixels) • mi n ti me  Figure 30 -  •  avg ti me  max ti me  A  Simple template detection times by image scale.  C o m p o u n d Density: Detection Time by Image Size  12000 11 000 10000 s 9000 8000 7000 4) E 6000 F c 5000 o 33 u  Q  *• o  o  *h o  G  a  o  c  o  a  ^& o o c  Q  o  Image Size (in pixels) • min time  Figure 31 -  •  avg time  A. max time  C o m p o u n d template detection times by image scale.  G  51  T h e detection times for simple templates d e p e n d s largely on the image s i z e , and very little upon its contents. O n the other hand, detection times for the compound template approach d e p e n d s on the number of unique initial features d i s c o v e r e d , which d o e s d e p e n d on the image contents. There is always the o v e r h e a d of the initial bottom-up s e a r c h which d o e s depend on image size. T h e graphs in Figure 3 2 and 33 illustrate this.  Simple Density: Detection Time by Image Density 1250  1000  w C  s  E co  750  0)  P  500  33  °  250  40  20  30  10  Faces In Image • m i n time  Figure 32 - S i m p l e  •  avg time  A  max time  template detection times by facial density.  52 Compound Density: Detection Time by Image Density 2250 T  0 -I  ,  ,  40  ,  30  ,  20  10  2  Faces In Image |  •  min time  •  avg time  A maxtime|  Figure 3 3 - Compound template detection times by facial density.  3.6  Comparisons  C o m p a r a t i v e tests w e r e performed to determine the e f f e c t i v e n e s s of combining the bottom-up a n d top-down s e a r c h , a n d of the aggregate s e a r c h optimization. T h e test image w a s the s a m e 4 0 face image, at 1 0 0 % , used for the s c a l e a n d density tests. T h e five tests performed were of the following type: Simple  Compound Primary  Standard  S e a r c h i n g the entire simple template d a t a b a s e over the entire test image (all 5 4 templates). T h i s a m o u n t s to the standard simple t e m p l a t e s . S e a r c h i n g the entire c o m p o u n d template d a t a b a s e (all 165 templates) over the entire test image. S e a r c h only the initial detection templates (22 templates) over the entire test image (i.e., only perform the bottom-up s e a r c h ) . T h e standard c o m p o u n d templates technique a s u s e d throughout the testing p r o c e s s .  53 Unoptimized  T h e aggregate s e a r c h optimizations a r e not performed, instead search p r o c e e d s for all subfeatures over the localized feature s p a c e (combined bottom-up a n d topdown searches).  The results are shown in the Figure 3 4 graph below, which s h o w s for each test the m a x i m u m , average, a n d minimum times for finding the best faces.  Search Times f o r Best Instances b y Technique 6000  simple  compound  primary  standard  unoptimized  Search Technique  Figure 3 4 - C o m p a r i s o n  of s e a r c h t i m e f o r v a r i o u s p a r t i a l a p p r o a c h e s .  The c o m p o u n d test essentially performs simple template detection for 165 templates. Although the number of templates is 3 times the number of templates u s e d for the simple test, the detection times a r e nearly 4.5 times longer, e v e n though the subfeature templates are smaller. T h i s is partly d u e to the slightly higher relative resolution of the subfeature templates c o m p a r e d to the face templates (see Figure 8 for a visual comparison), the computational o v e r h e a d required to prepare e a c h template for s e a r c h , a n d the higher probability that subfeature templates will match during the candidate s e a r c h , thus requiring refine a n d instance searches too. C o m p a r i n g the c o m p o u n d and unoptimized results s h o w s  54 that c o m b i n i n g bottom-up and top-down s e a r c h is more efficient than bottom-up s e a r c h alone. C o m p a r i n g the standard results to the u n o p t i m i z e d results s h o w s that optimizing the top-down s e a r c h is possible and effective.  55 4  Conclusion  This paper d e s c r i b e d the background, motivation, techniques, and implementation behind the F a c e t s face detection s y s t e m . Test data w a s c o l l e c t e d from this s y s t e m c o m p a r i n g the two object recognition a p p r o a c h e s : simple templates and c o m p o u n d templates. T h e results, while not conclusive, do show that the c o m p o u n d template approach has merits worth further exploration. In particular, c o m p o u n d templates provides better feature g e n e r a l i z a t i o n , better detection t i m e s for feature s p a r s e i m a g e s , and slower detection time growths as new objects are a d d e d to the d o m a i n . H e i g h t e n e d expectations for vision s y s t e m s applications, s u c h as realworld robotics, are sure to increase the size and complexity for object recognition domains. C o m p l e t e usage of available knowledge, both of the domain and of the image, a p p e a r s a plausible way to make the detection p r o c e s s more efficient. C o m b i n i n g bottom-up and top-down s e a r c h p r o c e s s e s is one way to use partial image knowledge (features found during the bottom-up search) and domain knowledge (the modeled spatial relations localizing the top-down search). A c o m p o u n d technique w a s p r o p o s e d for combining the bottom-up and top-down s e a r c h . T h e face d o m a i n w a s c h o s e n as a tractable, extensible, and interesting domain for study. Existing research in face detection is extensive and provides for a variety of a p p r o a c h e s . S u n g and Poggio's face detection system utilizes top-down s e a r c h with whole f a c e s [Sung]. The top-down, whole f a c e d , s i m p l e template technique w a s used as a comparative metric for the c o m p o u n d template a p p r o a c h . Other systems in a variety of domains [Matsuyama], [Shakungaga], use combined bottom-up and top-down s e a r c h ; however, the e m p h a s i s a p p e a r s on the detection performance of the whole s y s t e m , not efficiency. F a c e t s implements a novel variation of a c o m p o u n d technique, and provides a testing platform to evaluate both detection and efficiency p e r f o r m a n c e . T h e simple template implementation used correlation between grey s c a l e i m a g e s and templates. Template s e a r c h e s were performed on the target image pyramid using a template pyramid and a three phase search. The c o m p o u n d template implementation used a 2D face model to represent facial spatial relations, four subfeatures, and s i m p l e templates to s e a r c h for these subfeatures. S e a r c h i n g for subfeatures occurred during the initial bottom-up s e a r c h , and during localized top-down s e a r c h . T h e top-  56 down s e a r c h w a s directed by face hypotheses generated from the face model. T h e search requests were s c h e d u l e d based upon existing subfeature e v i d e n c e , and further o p t i m i z e d by combining all subfeature s e a r c h e s for a single hypothesis. Test results ranged from favourable, to inconclusive, to poor for c o m p o u n d templates. C o m p o u n d templates appear to represent a larger face s p a c e with fewer features, p o s s i b l y b e c a u s e of the combinatorial way that c o m p o u n d templates c o m b i n e subfeatures. This is a l s o evident in the Nottingham and Y a l e image tests, where c o m p o u n d templates detected 6 0 % more faces. T h e detection times for c o m p o u n d templates also s e e m e d to increase more slowly a s the template d a t a b a s e increased in s i z e , w h e n c o m p a r e d to the simple template tests; possibly b e c a u s e hypothesis m a n a g e m e n t could increasingly reduce s e a r c h duplication. T h i s result could indicate that c o m p o u n d templates are more efficient for the larger d a t a b a s e s n e e d e d for larger d o m a i n s ; however, this result is still open to interpretation d e p e n d i n g on how the d a t a b a s e growth is specified. C o m p o u n d templates are more demanding of memory and computational r e s o u r c e s in the c h o s e n face d o m a i n , resulting in poorer absolute detection times than simple templates. C o m p o u n d templates are well suited to image d o m a i n s that are feature s p a r s e , beating the simple a p p r o a c h in detection times for images with 5 0 % face density or less. Although the purely top-down a p p r o a c h often had faster detection times, the combination of bottom-up and top-down s e a r c h w a s 1 5 0 % faster then searching for all the subfeatures in a bottom-up manner. 3 5 0 % faster w h e n aggregate s e a r c h optimization w a s u s e d . While simple templates proved fast, they were not as accurate as c o m p o u n d templates; s o m e t i m e s resulting in a c c u r a c y that w a s 2 to 10 times lower than c o m p o u n d templates. T h i s might be surprising, c o n s i d e r i n g that the full f a c e template contained more of the face that the combined eye, nose, and mouth subfeatures c o m b i n e d . O n e possible reason is that the whole face templates w e r e slightly smaller in relative s i z e c o m p a r e d to the subfeature templates (i.e., a nose subfeature template would be slightly larger than the nose in the face template). C o m p a r i s o n s using equivalent resolutions would have been preferable, but must be s a v e d for future experiments. T h e implemented c o m p o u n d templates technique is one instance of the general c o m p o u n d and combined approach described previously in chapter  57 1, section 3. The approach is extensible to other domains, to compound models with higher number of levels, and to any combination of underlying image analysis techniques. O n e future direction for development would create a more generalized implementation and evaluate this in a larger domain such as articulated objects, or human bodies [Felzenszwalb]. The approach would also benefit from more formal models, such as a B a y e s i a n a p p r o a c h [Viola], or formal t e c h n i q u e s to calculate facial p o s e [Shakunaga]. Another possible improvement would replace the underlying simple template detection with more robust t e c h n i q u e s , s u c h as principle c o m p o n e n t s [Sung], [Shakunaga], or combine it with other techniques, such a s flexible templates [Lanitis], [Yuille].  58  5  References  Beymer, David J. Face Recognition Under Varying Pose from MIT Al Memo, no 1461, 1993 Bottoni, P. Cinque, L. Levialdi, S. Mussio, P. Matching the Resolution Level to Salient Image Features in Pattern Recognition, vol 31, no 1, pg. 89-104, 1998 Brautigam, Carsten G. Eklundh, Jan-Olof. Christensen, Henrik I. A Model-Free Voting Approach for Integrating Multiple Cues from unknown, pg.735-750. Brunelli, Roberto. Poggio, Tomaso. Face Recognition: Features versus Templates in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 15, no 10, pg. 1042-1052, 1 993 Burt, Peter J. Smart Sensing within a Pyramid Vision Machine in Proceedings of the IEEE, vol 76, no 8, pg. 1006-1015, August 1988 Felzenszwalb, Pedro. Huttenlocher, Dan. Efficient Matching of Pictorial Structures, at http://www.ai.mit.edu/people/pff/blobrec/blobrec.html and http://www.ai.mit.edu/people/pff/blobrec/blobrec2.ps.gz Hofstadter, Douglas. Fluid Concepts and Creative Analogies. Basic Books, 1994 Jeng, Shi-Hong. Liao, Hong. Han, Chin. Chem, Ming. Liu, Yao. Facial Feature Detection using Geometrical Face Model: An Efficient Approach in Pattern Recognition, vol 31, no 3, pg. 273-282, 1998 Konen, Wolfgang. Comparing Facial Line Drawings with Gray-Level Images: A Case Study on PHANTOMAS at http://www.zn.ruhr-uni-bochum.de Lanitis, Andreas. Taylor, Chris J. Cootes, Timothy F. Automatic Interpretation and Coding of Face Images using Flexible Models in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 19, no 7, pg. 743-755, 1997 Matsuyama, Takashi. Hwang, Vincent Shang-Shouq. SIGMA: A Knowledge-Based Aerial Image Understanding System. Plenum Publishing Corporation, 199? Milanese, Ruggero. Gil, Sylvia. Bost, Jean-Marc. Wechsler, Harry. Pun, Theirry. Integration of Bottom-Up and Top-Down Cues for Visual Attention Using Non-Linear Relaxation in Berkeley Technical Reports, vol 94, no 14, pg. 1-6, 1994 Moore, David S. McCabe, George P. Introduction to the Practice of Statistics, second edition, W. H. Freeman and Company Rowley, Henry A. Baluja, Shumeet. Kanade, Takeo. Neural Network-Based Face Detection in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 20, no 1, pg.23-38, 1998 Shakunaga, Takeshi. Integration of Eigentemplate and Structure Matching for Automatic Facial Feature Detection in IEEE Journal - Unknown (0-8186-8344-9/98), pg. 94-99, 1998 Sung, Kah-Kay. Poggio, Tomaso. Example-Based Learning for View-Based Human Face Detection from MIT Lab Report, pg. 1-8 Takacs, Barnabas. Wechsler, Harry. Detection of Faces and Facial Landmarks using Iconic Filter Banks in Pattern Recognition, vol 30, no 10, pg. 1623-1636, 1997 Viola, Paul A. Complex Feature Recognition: A Bayesian Approach for Learning to Recognize Objects from MIT Al Memo, no 1591. 1996 Yuille, Alan L. Hallinan, Peter W. Cohen, David S. Feature Extraction from Faces Using  59 Deformable Templates in International Journal of Computer Vision, vol 8, no 2, pg. 99-111, 1 992  Weng,  John J. Hwang, Wey-Shiuan. Toward Automation of Learning: The State SelfOrganization Problem for a Face Recognizer in IEEE Journal - Unknown (0-8186-83449/98), pg. 384-389, 1998  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Country Views Downloads
China 15 5
Brazil 13 1
United States 7 1
Russia 6 0
France 3 1
Japan 1 0
Iran 1 0
Uruguay 1 0
India 1 1
Germany 1 2
City Views Downloads
Unknown 17 4
Shenzhen 13 4
Saint Petersburg 6 0
Ashburn 5 0
Kansas City 2 1
Beijing 2 1
Tokyo 1 0
Montevideo 1 0
Bilaspur 1 1
Tiran 1 0

{[{ mDataHeader[type] }]} {[{ month[type] }]} {[{ tData[type] }]}
Download Stats

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051147/manifest

Comment

Related Items