Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Multiclass object recognition inspired by the ventral visual pathway Mutch, James Vincent 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata


831-ubc_2006-0594.pdf [ 12.63MB ]
JSON: 831-1.0051724.json
JSON-LD: 831-1.0051724-ld.json
RDF/XML (Pretty): 831-1.0051724-rdf.xml
RDF/JSON: 831-1.0051724-rdf.json
Turtle: 831-1.0051724-turtle.txt
N-Triples: 831-1.0051724-rdf-ntriples.txt
Original Record: 831-1.0051724-source.json
Full Text

Full Text

Multiclass Object Recognition Inspired by the Ventral Visual Pathway by James Vincent M u t c h B . S c , The Univers i ty of B r i t i s h Columbia , 1989 A T H E S I S S U B M I T T E D I N P A R T I A L F U L F I L L M E N T O F ' T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F Master of Science in T H E F A C U L T Y O F G R A D U A T E S T U D I E S (Computer Science) T H E U N I V E R S I T Y OF B R I T I S H C O L U M B I A October 2006 © James Vincent M u t c h , 2006 Abstract We describe a biologically-inspired system for classifying objects i n s t i l l images. O u r system learns to identify the class (car, person, etc.) of a previously-unseen instance of an object. A s the primate visual system st i l l outperforms computer vision systems on this task by a wide margin, we base our work on a model of the ventral visual pathway, thought to be p r imar i ly responsible for object recognition i n cortex. Our model modifies that of Serre, Wolf, and Poggio, which hierarchically builds up feature selectivity and invariance to posit ion and scale i n a manner anal-ogous to that of visual areas VI , V2, V4, and IT . A s i n that work, we first apply Gabor filters at al l positions and scales; selectivity and invariance are then buil t up by alternating template matching and max pool ing operations. " " We refine the~appr6achln several biological ly 'plausible ways, using 's imple versions of sparsification and lateral inhib i t ion . We demonstrate the value of retain-ing some posit ion and scale information above the intermediate feature level. Us ing feature selection we arrive at a model that performs better w i t h fewer features. Our final model is tested on the Cal tech 101 object categories and the U I U C car localization task, i n both cases achieving state-of-the-art performance. T h e results strengthen the case for using this type of model i n computer vision. n Contents Abstract 1 1 Contents iii List of Tables vii List of Figures vi i i Acknowledgments x 1 Introduction 1 1.1 Mot iva t ion • • 1 1.2 Scope of P rob lem Addressed . . 3 —1.2:1 -Format-of Images-.---.- -. — . . - . 3 1.2.2 Segmented vs. Unsegmented Tra in ing Images 4 1.2.3 Categories . . '. 4 1.2.4 Context 5 1.2.5 Types of Classification Tasks 5 1.3 Contr ibut ions . . 5 1.4 Out l ine of Thesis 7 i i i 2 Background 9 2.1 T h e Ventral V i s u a l Pathway 9. 2.1.1 V I and Topographical M a p s 9 2.1.2 Hierarchical Organizat ion 10 . 2.1.3 Immediate Recognit ion and Feedforward Operat ion . . . . . . 11 2.2 Sparsity 12 2.3 Geometry and "Bags of Features" 13 3 Previous Work 14 3.1 Other Biologically-Inspired Models 14 3.1.1 The Neocognitron 14 3.1.2 Convolut ional Networks 16 3.1.3 " H M A X " . . 17 3.1.4 Serre, Wol f k Poggio M o d e l 20 3.2 Rela t ion to Recent Computer V i s ion Models 20 4 Base Model 23 •4.1 M o d e l Overview - . . - . 23 4.2 Feature Computa t ion 23 4.2.1 Image Layer 24 4.2.2 Gabor F i l t e r (SI) Layer 24 4.2.3 Loca l Invariance (CI ) Layer 26 4.2.4 Intermediate Feature (S2) Layer 27 4.2.5 G loba l Invariance (C2) Layer 29 4.3 S V M Classifier 29 4.4 Differences from Serre et al • • • 29 iv 5 Improvements 31 5.1 Sparser S2 Inputs 31 5.2 Inhibi ted S l / C l Outputs . '. 33 5.3 L i m i t e d C 2 Posi t ion/Scale Invariance 33 . 5.4 Feature Selection 35 6 Mult ic lass Experiments (Caltech 101 Dataset) . 38 6.1 T h e Cal tech 101 Dataset 38 6.2 R u n n i n g the M o d e l 39 6.3 Parameter Tun ing • • • 40 6.4 Mul t ic lass Performance 42 7 Localization Experiments ( U I U C C a r Dataset) 46 7.1 T h e U I U C C a r Dataset . . 46 7.2 M o d e l Parameters 47 7.3 Sl id ing W i n d o w 47 7.4 Results 48 8 Analysis of Features 52 8.1.__The Value-of-S2.Features - , 52_ 8.2 Visua l i z ing Features " 53 8.3 Feature Sharing Across Categories . . 60 8.4 U t i l i t y of Different Feature Sizes 60 9 Other Experiments (Graz Datasets) 64 9.1 T h e Graz-02 Datasets 64 9.2 Tra in ing the M o d e l 65 v 9.3 Test ing the M o d e l 65 9.4 Results 65 10 Discussion and Future W o r k 68 10.1 Summary 68 10.2 Future W o r k 69 Bibl iography 71 v i List of Tables 6.1 Results for the Cal tech 101 dataset along w i t h those of previous studies 42 6.2 Cont r ibu t ion of successive modifications to the overall score 43 6.3 Per-category classification rates and most common errors for the C a l -tech 101 dataset 45 7.1 Results for the U I U C car dataset along w i t h those of previous studies 48 7.2 Frequency of error types on the multiscale U I U C car dataset . . . . 50 9.1 O u r results for the Graz-02 datasets . 66 V l l List of Figures 1.1 Some challenges of visual object recognition 2 1.2 Example images for the basic classification task 6 1.3 Example images for the localizat ion task 6 2.1 The ventral visual pathway 10 3.1 Archi tecture of the Neocognitron 15 3.2 Archi tecture of the LeNet-5 convolutional network 17 3.3 Archi tecture of the original " H M A X " model 19 4.1 Overal l form of our base model 24 4.2 Base model layers 25 4.3 A n S2 feature.(prototype patch) in the base m o d e l . _. 28 5.1 Dense vs. sparse S2 features 32 5.2 Inhibi t ion i n S l / C l 34 5.3 L i m i t i n g the posit ion/scale invariance of C 2 units 35 5.4 Informative and uninformative features 36 5.5 Us ing an S V M for feature weighting 36 6.1 Some images from the Cal tech 101 dataset 39 v i i i 6.2 Results of parameter tun ing using the Cal tech 101 dataset 41 6.3 Results on the Cal tech 101 dataset for various numbers of features . 41 6.4 Some example images from easier categories 43 6.5 Some example images from difficult categories 43 7.1 Some correct detections on the single-scale U I U C car dataset . . . . 49 7.2 T h e only errors made on the single-scale U I U C car dataset 50 7.3 Some correct detections on the multiscale U I U C car dataset 50 7.4 Some errors made on the multiscale U I U C car dataset 51 8.1 Best image patches for feature #1 54 8.2 Best image patches for feature #2 55 8.3 Best image patches for feature #3 56 8.4 Best image patches for feature #101 57 8.5 Best image patches for feature #501 58 8.6 Best image patches for feature #1001 59 8.7 Feature. sharing among categories 61 8.8 Sizes of features remaining after feature selection 62 8.9 Relative proportions of feature sizes by selection rank 63 9.1 Subimages used to t ra in the Graz-02 "bikes" classifier 66 i x Acknowledgments I am greatly indebted to my thesis supevisor D a v i d Lowe, without wh o m none of this would have happened. D a v i d has been bo th a br i l l iant collaborator and a k ind and patient mentor, point ing me in al l the right directions and expertly guiding me through many "firsts" - first paper submission, first conference talk, and many others. Thank you, Dav id . The U B C Computer Science Department as a whole has been an incredibly supportive environment. Special thanks to Ruben Morales-Menendez, who helped me when I first came back to school, and to the 238 bul lpen gang, for their friendship. Thanks to Rober t W o o d h a m for much helpful advice, and to J i m L i t t l e for being my second reader. F ina l ly , thanks to D a v i d K i rkpa t r i ck , for giving me my first book on A l al l those years ago, and for. encouraging me to come back. " I ' m grateful for the lifelong support of my parents,"who never "failed to en-courage my curiosity about the world . To a l l the friends and family whom I've neglected dur ing this thesis - thanks for your love and understanding. Final ly , my thanks and love to Rachel , who has borne the burden of my obsessiveness and shown me endless patience, understanding, and support . 1 Introduction 1.1 Motivation Computer vision systems have become quite good at recognizing specific objects they have seen before. T h i s is not an easy task, as it demands a difficult combination of specificity and invariance. T h e system must give very different responses to s t imuli which are often superficially similar, such as two different faces. Yet it must give the same response to two different views of the same object, despite potentially large differences in viewpoint or i l luminat ion, presence of background clutter and par t ia l ly occluding objects, and other factors (see figure 1.1). Nevertheless systems such as [23] are now capable of recognizing previously-seen objects - part icularly r igid objects whose shapes do not change much - fairly accurately and at real-time _ speed, under.a variety of conditions. - - -— - - •--However, the generalization from object instances to object categories re-mains much more difficult for computers than it is for human vis ion. Computer vision systems that are good at recognizing specific, previously-seen objects tend to look for very distinctive low-level visual features having r igid geometric rela-tionships. Various attempts to soften these constraints to span entire categories of objects while s t i l l retaining the necessary specificity have met w i th only part ia l success. 1 Figure 1.1: Some challenges of visual object recognition: multiple viewpoints, vary-ing degrees of illumination, background clutter, and partial occlusion. 2 Given the s t i l l vastly superior performance of humans (and animals) on this task, it makes sense to look to the ever-increasing body of neuroscientific and psy-chophysical da ta for inspirat ion. In fact, recent work by Serre, Wolf, and Poggio [37] has shown that a computat ional model based on some of our current knowledge of visual cortex can be at least competi t ive wi th the best existing computer vision systems on some of the standard classification datasets. Our work builds on this model and improves its performance. O u r ul t imate goal is to give computers human-level abi l i ty to learn to visual ly classify objects. T h e number of potential applications is v i r tua l ly limitless, from image retrieval to al lowing robots to interact more fully w i t h their environments. T h e success (or failure) of vision systems inspired by our knowledge of the bra in may also help add to that knowledge. 1.2 Scope of Problem Addressed 1.2.1 Format of Images Most research in this area ( including this study) focuses on the core problem of object classification in single 2D grayscale images (i.e., s t i l l "black and white" pho-tographs). A few object categories are almost impossible to dis t inguish without colour, such as lemons and limes. For other categories colour helps: most trees are green, a purple object is usually not a face, etc. B u t for many categories, such as cars, colour is just a distraction, and systems that incorporate it require larger amounts of t ra ining data i n order to come to the conclusion that colour, for such categories, is not relevant. Moreover, humans do very well wi thout colour cues. We thus focus 3 on the core problem of classification from shape and texture. Recognit ion from mot ion cues is also an important component of real-world vision, and forms another area of active research, but it is outside the scope of this study. 1.2.2 S e g m e n t e d vs. U n s e g m e n t e d T r a i n i n g Images Systems are generally trained on a set of labeled images containing exemplars of object categories; they are then expected to recognize and label new instances of those categories i n other images. Some computer vis ion systems can learn from completely unsegmented i m -ages, knowing only that some images contain an object of type X (but not where) and some do not. Others require t ra ining images having bounding boxes or precise outlines. We focus on learning from segmented images because it is more analogous to human visual learning. It is fairly clear that we learn about objects upon which our attention has already been focused by some other mechanism, possibly aided by motion, stereo, or colour cues. We wish to s imply assume the existence of such a mechanism dur ing t ra in ing by using segmented images, or at least, images i n which the object is central and dominant. 1.2.3 Categor ie s In reality, different object categories can have complex semantic relationships. These include parent-child "is-a" relationships (e.g., a robin is a bird) and whole-part "has-a" relationships (e.g., a cougar has a head). We ignore this for purposes of this s tudy and treat categories as disjoint, unrelated sets, defined s imply by the labels we give to our t ra ining images. 4 1.2.4 Context Object classification in humans can be aided by context (the entire scene or other objects) or other prior beliefs. For example, street scenes prime us to see cars. A white blob next to a computer keyboard is probably a mouse. These kinds of top-down, scene-level effects are outside the scope of this study. 1.2.5 Types of Classification Tasks The ult imate goal i n object recognition is to be able to locate and classify every object in a scene. In this s tudy we focus on two simpler tasks. Basic Classification: W h a t is this object? T h e answer can be one of many cat-egories, but there is only one object, which does not have to be found i n a larger image; see figure 1.2. It either dominates the image or is contained in an already-chosen attentional window. These experiments are described in chapter 6. Local izat ion: Where are a l l the objects of type X in this image, i f any? Here the simplification is to reduce the number of categories to one, but the in -stance^) of that category can be anywhere in the image; see figure 1.3. These experiments are described i n chapter 7. 1.3 Contributions The model presented i n this paper is based on the "standard model" of object recognition i n cortex [31] and builds on the " H M A X " - b a s e d model of Serre, Wolf, and Poggio [37]. We incorporate some addi t ional biologically-motivated properties 5 Figure 1,2: Example images for the basic classification task. T h e goal is to determine the category of the object (airplane, elephant, etc.) which is central ly located and dominant. inc luding sparsification of feature inputs, lateral inhibi t ion, feature localization, and feature selection (see chapter 5). We show that these modifications further improve classification performance, strengthening our understanding of the computat ional constraints facing both biological and computer vision systems. We test these modifications on the large Cal tech dataset of images from 101 object classes (chapter 6). O u r results show that there are significant improvements to classification performance from each of the changes. Further tests on the U I U C car database (chapter 7) demonstrate that the resulting system can also perform well on object localization. Our final system outperformed all previous studies involving these datasets, further strengthening the case for incorporat ing concepts from biological vision into the design of computer vis ion systems. 1.4 Outline of Thesis T h e rest of the thesis is structured as follows. • Chapter 2 reviews some of the ideas from biological and computer vis ion upon which our model is based. - . . . • Chapter. 3 discusses some of the previous biologically-motivated computat ional models for object classification. • Chapter 4 describes our "base" model, which is essentially an abstraction of [37]. • Chapter 5 describes our enhancements to the base model inc luding spar-sification of feature inputs, inhibi t ion, l imi ted position/scale, invariance of intermediate-level features, and feature selection. These represent the ma in 7 contributions of this work. • Chapter 6 describes the training and tuning of our model on the Caltech 101 dataset and presents the results for the basic classification task. • Chapter 7 describes our localization experiments on the UIUC car dataset and presents the results. • Chapter 8 provides some insight into the kinds of features that are being selected. • Chapter 9 describes experiments on the Graz datasets, in which the images are somewhat more difficult. • Chapter 10 summarizes the results and discusses possible future work. 8 2 Background 2.1 The Ventral Visual Pathway Object classification in cortex is believed to be centered i n the ventral visual path-way, running from pr imary visual cortex ( V I ) through areas V 2 , V 4 , and inferotem-poral cortex (IT) . 2.1.1 V I and Topographical M a p s A m o n g all the visual areas V I has received the most study (although it is far from being fully understood [27]). Cells i n V I respond to very simple features (essentially oriented bars) at specific retinal positions and scales [15]. V I is one of many cortical areas w i t h a clear topographic mapping: its 2D layout corresponds roughly to--2D .retinotopic_ posit ion. The^addi t ional st imulus dimensions of scale and bar orientation are folded into this 2D layout so that a given bit of V I cortex w i l l contain cells reponsive to a variety of scales and orientations at a specific retinotopic posit ion. T h i s packing of mult iple st imulus dimensions into a 2D array is done in a manner that maintains continuity along stimulus dimensions while covering the entire stimulus space [38]. Note that V I contains addi t ional stimulus dimensions (colour, stereo, motion) that we ignore i n this study. Topographical maps are numerous in cortex: there are about 25 of them i n 9 Figure 2.1: The ventral visual pathway. F r o m [32]. the visual system alone [5]. There are also topographical maps for body location in the somatosensory system, frequency in the auditory system, and muscle groups in the motor system. Continuous topographical maps minimize the amount of wi r ing necessary for performing local computations. Neurons representing similar s t imuli are kept close together. 2.1.2 Hierarch ica l Organizat ion A s we move through the ventral stream from V I through V 2 , V 4 , and IT , we encounter cells that are responsive to increasingly complex s t imuli w i th increasing invariance to-posi t ion and-scale. T h e first-step seems to occur w i th in V I itself,-where the outputs of a number of "simple" cells responsive to the same feature (i.e., a part icular orientation) are pooled by "complex" cells [15]. A resultant complex cell is responsive to the same orientation as its inputs but has a larger receptive field, i.e., a greater degree of posit ion and scale invariance. Further steps combine heterogeneous features to generate more complex features, and more pool ing over posit ion and scale occurs. A t the level of I T we encounter cells w i t h a high degree of posi t ion and scale invariance which are responsive to specific object views, as well 10 as viewpoint-invariant units. Despite the increasing degree of spatial invariance at higher levels, combi-natorical constraints on the number of complex features that could be represented rule out complete coverage of the potential s t imulus space, as seems to occur i n V I . T h i s suggests an increasing role for ongoing learning at higher levels, and certainly at the level of IT . T h i s hierarchical arrangement of topographical maps is not unique to the visual system; similar hierarchies are present i n the somatosensory, auditory, and motor systems [5]. 2.1.3 I m m e d i a t e R e c o g n i t i o n a n d F e e d f o r w a r d O p e r a t i o n Most forward projections between bra in areas are matched by corresponding feed-back connections, and the ventral stream is known to be modulated by other areas for reasons including attention and contextual pr iming . Nevertheless we seems to do quite well in an "immediate" recognition mode i n which these effects are absent. H u m a n subjects in rapid serial v isual presentation ( R S V P ) experiments have been able to process images as rapidly as 8 per second [29]. E E G - studies [39] show re-sponse times on the order of the latency of the ventral stream, suggesting a mainly feedforward mode of operation for this first stage of the basic classification task. T h i s constraint, more than any other, suggests that computat ional models of biological object classification might be workable despite the current l imitat ions of our knowledge. Studies of the response properties of neurons at various levels in the hierarchy are growing more and more numerous. If we know what is being computed at each level, the feedforward constraint makes it easier to guess how it is being computed. 11 T h e feedforward constraint, however, does not apply to learning. 2.2 Sparsity Representations in visual cortex are known to be overcomplete, i.e., data is repre-sented using a much larger set of basis functions (neurons) than would be min imal ly necessary. For example, in macaque V I there are 50 times as many output fibres as input fibres. However, if the decomposition of a signal into such an overcomplete code were done linearly, the components of the resulting vector would be corre-lated [27]. Some form of nonlinearity is required to sparsify them. Sparse vectors are vectors whose components are mostly zeros. In general, sparse codes are more metabolical ly efficient and more easily stored in associative memories [2]. Sparsity constraints have proven cr i t ical for explaining the response functions of V I neurons i n terms of the statistics of natural images [26]. A direct way to achieve sparsification is for cells to have nonlinear, highly peaked response functions. A n alternative method is lateral inhib i t ion , i n which cells inhib i t their less-active neighbours in a winner-take-all competi t ion - the continuous topographical map organization of many cortical areas (section 2.1.1), i n which cells encoding similar s t imuli are kept-close together, is ideal for this. - — - — Furthermore, w i th in machine learning, it has been found that increasing the sparsity of inputs [9, 17] (equivalent to reducing the capacity of the classifier) plays an important role in improving generalization performance. Chapter 5 describes our efforts to incorporate these concepts into our classi-fication model . 12 2.3 Geometry and "Bags of Features" Some current successful computer vision systems for object classification learn and apply quite precise geometric constraints on feature locations [8, 4], while others ignore geometry and use a "bag of features" approach that ignores the locations of ind iv idua l features [6]. However, i n a hierarchical model , i n which simple, low-level features having high posi t ion and scale specificity are pooled and combined into more complex, higher-level features having greater location invariance, this ceases to be a binary decision. T h e question becomes: at what level have features become complex enough that we can ignore their location? If the features are too simple, we run the risk of "binding" problems: we are s t i l l vulnerable to false positives due to chance co-occurrence of features from differ-ent objects and /or background clutter. Conversely, the classification system would be unable to dist inguish between an instance of a known category and a scrambled version of the same image. However, if the features are sufficiently complex and large enough to overlap, then random rearrangements of the image would destroy enough features to avoid the false positive. In this study we investigated-retaining some degree of posit ion and scale -sensitivity at a higher point in this hierarchy than the approach of [37], and show that this provides a significant improvement in final classification performance. Chapter 5 outlines our approach. 13 3 Previous Work In this chapter we review other biologically-inspired models of object classification and discuss our approach in the context of some of the recent computer vision approaches. 3.1 Other Biologically-Inspired Models This section reviews other biologically-inspired models, specifically, models of feed-forward recognition in the ventral stream that hierarchically bu i ld up feature com-plexity and invariance. 3.1.1 The Neocognitron T h e Neocognitron [12] was the first of this class of models. General izing from the "simple" and "complex" cells of H u b e l & Wiesel [15], the Neocognitron starts w i t h a 2D pixel layer and then computes alternating "S" and " C " layers (SI , C I , S2, C 2 , ....) "S" layers bu i ld up feature complexity and " C " layers bu i ld up posi t ion invariance. Its architecture is i l lustrated in figure 3.1. E a c h S n layer processes the previous layer and computes dn feature maps; each map is the response to a part icular type of local feature computed at al l possible positions in the previous layer. In layer S i , local features are computed direct ly from 14 Figure 3.1: Archi tecture of the Neocognitron. Each layer consists of some number of square feature maps. The SI layer's feature maps are computed from the image. Each feature map in a " C " layer is created by pool ing units from one feature map i n the previous "S" layer (or sometimes two i n symmetry cases). However, "S" features i n S2 or higher are computed by combining features from mult iple maps i n the previous layer. T h i s is i l lustrated by the crossing connections from " C " to "S" layers. F r o m [12]. 15 pixels. In higher "S" layers, features are computed as local combinations of different types of cells (from different feature maps) in the previous " C " layer. A C-cell 's value is a local weighted sum of a patch of S-cells of the same type (i.e., from the same, feature map) i n the previous layer. " C " layers also serve to reduce the total number of units by subsampling their input "S" layer. A t the top level, cells are complex enough to represent entire object cate-gories, and are completely posit ion invariant. Classification is performed by selecting the most active top-level cell. Learn ing i n the Neocognitron means learning what features to compute i n the "S" layers. T h i s is typical ly done i n a bot tom-up manner. T h e SI layer is trained to find common or useful patterns in the pixel layer, then the S2 layer is trained to find patterns i n the C I layer, and so on. Clus ter ing methods are often used. T h e Neocognitron was invented for handwri t ten character recognition, but has been adapted for other 2D pattern classification tasks. It is not expl ic i t ly mul-tiscale; patterns must be of a standard size. 3.1.2 Convolutional Networks The Neocognitron was essentially the first convolutional network. In its regular, feedforward (post-learning) mode, its basic operation is convolution. A n "S" layer is generated by convolving the previous layer w i t h d local filters, while each feature map of a " C " layer is generated from the corresponding map i n the previous "S" layer v i a convolution w i t h a fixed local filter. T h e term "convolutional network" now seems to refer to a network having this same basic structure, but w i th two major differences. 16 INPUT-•32x32 C3-1 maps'l6@i0x10 Cofeature maps.. _ i S4;f.'maps I6@5x5 Gaussian connections Figure 3.2: Archi tecture of the LeNet-5 convolutional network. (Note that in LeNet , the nomenclature of "S" and " C " levels is the opposite of that used throughout this thesis.) Prom [20]. • Tra in ing is top-down, v i a backpropagation [20]. • The top-level features do not represent object categories. Instead, the vector of activations is fed into a classifier. T h i s can be a standard, fully-connected multilayer neural network or some other classifier. Convolut ional networks such as LeNet-5 (figure 3.2) have been applied to commercial-level character recognition, speech recognition, and face/object recogni-t ion. In 1999. Riesenhuber & P.oggio formulated the-"standard model" of object recogni-t ion in cortex, essentially defining a class of models consistent w i t h the following mostly agreed-upon facts regarding the ventral visual pathway (from [30]): • " A hierarchical bui ld-up of invariances first to posit ion and scale and then to viewpoint and more complex transformations requir ing the interpolation between several different object views; • i n parallel, an increasing size of the receptive fields; 3.1.3 H M A X 17 • an increasing complexity of the opt imal s t imuli for the neurons; • a basic feedforward processing of information (for "immediate" recognition tasks); • plast ici ty and learning probably at al l stages and certainly at the level of IT ; • learning specific to an ind iv idua l object is not required for scale and posi t ion invariance (over a restricted range)." T h e y also created a quantitative model [31], later dubbed " H M A X " , which embodied some of these concepts. T h e basic H M A X model is not an end-to-end object classification system; it was designed to account for the tuning and invariance properties of neurons in I T cortex found by the experiments of Logothetis et al. [22]. . T h e H M A X model is similar to the convolutional networks mentioned above in that it uses alternating "S" and " C " layers to bu i ld up feature complexity and invariance; see figure 3.3. Top-level view-trained units are learned vectors of ac-tivations of fully position/scale-invariant features. However, H M A X differs in the following ways: • Rather than learning the low-level (SI) features, H M A X starts w i t h statically defined features (Gaussian derivatives or Gabor filters) designed to mimic cells i n V I cortex. • C-level pool ing uses a M A X operation instead of a weighted sum; the output of a C-cel l is that of its strongest input. Th i s increases posit ion and scale invariance without losing any feature specificity. Support for a M A X operation i n at least some cells in visual cortex has been found i n physiological studies [18]. 18 © © © e 0 0 ® 0 view-tuned cells Gp^iptexjcom Complex cells (CI) 02)0)® © 0 . C D . ® 0(2)0® Simple cells (SI) r . weighteclisum; M A X Figure 3.3: Archi tecture of the original " H M A X " model. F r o m [31]. • H M A X is expl ic i t ly multiscale. A s i n V I cortex, the S I layer applies filters at a range of scales. Higher C-uni ts pool not only over local positions but also - over nearby scales. T h e original H M A X model had four fixed feature layers ( S i , C I , S2, and C2) - none were learned. T h e relatively smal l number of simple features at the C 2 level were insufficiently complex or dist inct for object recognition tasks [35]. 19 3.1.4 Serre , W o l f & P o g g i o M o d e l Serre et al. [37] modified the original H M A X model to make it useful for classifica-t ion. T h e static S2 features were replaced by a much larger set of features sampled from t ra ining images, and the final vectors of C 2 activations were fed into a sup-port vector machine ( S V M ) classifier. T h e other layers remained mostly the same, al though the exact S i features and C l pool ing ranges were adjusted to more closely match physiological data [36]. T h e Serre model is described i n chapter 4 by contrast w i t h our base model . T h i s model achieved results comparable to the best non-biologically mot i -vated approaches on the difficult Cal tech 101 dataset. 3.2 Relation to Recent Computer Vision Models T h e above-mentioned systems represent the most direct attempts to model object classification in the primate ventral stream. Nevertheless, many other computer vis ion systems draw analogies to aspects of biological vis ion, and a l l face the same challenges. . . . . . . . V i r t u a l l y al l current approaches to object classification start by extracting various kinds of local features from the image; object category representations are then learned and expressed in terms of the presence or act ivi ty level of these features. C o m m o n choices for features include fragments composed of raw pixels [41, 40] and S I F T descriptors [23]. They may be sparsely sampled at certain interest points where some local saliency condit ion is satisfied, or they may be densely computed at every point i n the image. Recent work by Jurie & Triggs [16] suggests that sparse sampling at interest points is subopt imal for classification tasks, as too much 20 discriminative information is lost. L ike other biologically-motivated approaches, our model uses the dense method. E a r l y object classification systems focused on a single object category at a t ime, learning a set of features that best distinguished that category from the background, i.e., from general clutter and al l other object categories. A s systems scale up towards the thousands of categories humans can recognize, having a separate set of features for each object category is clearly not feasible. Tor ra lba et al. [40] employ a multiclass version of boosting to learn a set of shared features, and find that the number of features needed for a given level of performance scales roughly logari thmical ly w i t h the number of categories. A l l the biologically-inspired models of object classification are shared-feature models. T h e issue of feature selection is also explored in models such as that of U l l -m a n et al. [41], which addresses the op t imal complexity of features for classification tasks. High ly complex features may be extremely distinctive, but w i l l not occur often enough to be useful; conversely, very simple features may occur frequently but are often not distinctive. For a single-class-ns.-background task, they found that features of "intermediate complexity" (around 10% of object size i n their experi-ments) maximized mutual information. We address a similar issue for our model in chapter 8. Higher up, at the level of category representations, there are a number of approaches. Constel lat ion models such as that of Fergus et al. [8] encode explicit geometric relationships between object parts, while "bag of features" methods such as those of Csu rka et al. [6] and Opel t et al. [28] represent objects as vectors of feature activations, discarding geometric relationships above the feature level. There are a number of approaches i n between. T h e features of Agarwal et al. [1] 21 retain some coarsely-coded locat ion information, while Leibe &; Schiele [21] and Berg et al. [3] retain the locations of features relative to the object center. In a biologically-inspired feature hierarchy, spatial structure is gradually incorporated into the features themselves. In each "S" layer, there are loose geometric constraints on which features may be combined. M o v i n g upward, as spatial information becomes impl ic i t ly encoded into more complex, overlapping features of varying sizes, explicit spatial information becomes more coarsely coded. 22 4 Base Model T h i s chapter describes the "base" version of our model . T h e base model is s imilar to [37] and performs about as well; nevertheless, it is an independent implementat ion, and we give its complete description here. Its differences from [37] w i l l be listed briefly at the end of this chapter (section 4.4). Larger changes, representing the main contr ibution of this work, are described i n chapter 5. 4.1 Model Overview The overall form of the model (shown in figure 4.1) is very simple. Images are reduced to feature vectors, which are then classified by an S V M . T h e dict ionary of features is shared across al l categories - a l l images "live" i n the same feature space. The model's biological plausibi l i ty lies in the feature computat ion stage. 4.2 Feature Computation Features are computed in five layers: an in i t ia l image layer and four subsequent layers, each layer buil t from the previous by alternating template matching and max pool ing operations. It is shown graphically in figure 4.2, and the following 1 A n abbreviated version of chapters 4-7 has been published as a conference paper [25]. 23 Figure 4.1: Overa l l form of our base model. Images are reduced to feature vectors which are then classified by an S V M . sections describe each layer. Note that features in a l l layers are computed at a l l positions and scales - interest point detectors are not used. 4.2.1 Image Layer We convert the image to grayscale and scale the shorter edge to 140 pixels while maintaining the aspect ratio. Next we create an image pyramid of 10 scales, each a factor of 2 1 / 4 smaller than the last (using bicubic interpolation). T h e SI layer is computed from the image layer by centering 2D Gabor filters w i t h a full range of orientations at each possible posit ion and scale. O u r base model follows [37] and uses 4 orientations. Where the image layer is a 3D pyramid of pixels, the S I layer is a 4D structure, having the same 3D pyramid shape, but w i th mult iple oriented units at each position and scale (see figure 4.2). Each unit represents the activation of a part icular Gabor filter centered at that posi t ion/scale. T h i s layer corresponds to V I simple cells. The Gabor filters are 11x11 in size, and can be described by: 4.2.2 Gabor Filter (SI) Layer (4.1) 2 1 C2 Laver global max [n r2 ... rd\ d feature responses S2 Layer 22 i d features 22 . [n r 2 ... rd] d feature responses per location C l Layer local max SI Layer 4 orientations per location + 4 orientations per location 130 ® 4 filters Image Laver 1 pixel per location Figure 4.2: Base model layers. E a c h layer has units covering three spatial dimensions (x /y /sca le) , and at each 3D location, an addit ional dimension of feature type. T h e image layer has only one type (pixels), layers SI and C l have 4 types, and the upper layers have d (many) types per location. Each layer is computed from the previous v i a convolution w i t h template matching or max pool ing filters. Image size can vary and is shown for i l lustrat ion. 25 where X = xcosd — ysm9 and Y = x s i n # + ycos9. x and y vary between -5 and 5, and 6 varies between 0 and 7r. The parameters 7 (aspect ratio), a (effective width) , and A (wavelength) are a l l taken from [37] and are set to 0.3, 4.5, and 5.6 respectively. F ina l ly , the components of each filter are normalized so that their mean is 0 and the sum of their squares is 1. We use the same size filters for a l l scales (applying them to scaled versions of the image). T h e response of a patch of pixels X to a part icular S I filter G is given by: It should be noted that the filters produced by these parameters are quite cl ipped; in part icular, the long axis of the Gabor filter does not d imin ish to zero before the boundary of the 11x11 array is reached. Nevertheless, experiments using much larger arrays d id not show any effect on overall system performance. 4.2.3 L o c a l Invariance ( C l ) L a y e r This layer pools nearby S I units (of the same orientation) to create posit ion and scale invariance over larger local regions, and as a result can also subsample S I to reduce the number of units. For each orientation, the SI pyramid is convolved wi th a 3D max filter, 10x10 units across in pos i t ion 2 and 2 units deep i n scale. A C l unit 's value is s imply the value of the m a x i m u m S I unit (of that orientation) that falls wi th in the max filter. To achieve subsampling, the max filter is moved around the S i pyramid in steps of 5 i n posit ion (but only 1 i n scale), g iving a sampling overlap factor of 2 i n both posit ion and scale. Due to the pyramida l structure of S I , we are able to use the same size filter for a l l scales. T h e resulting C l layer is 2 Note that the max filter is itself a pyramid, so its size is 10x10 only at the lowest scale. (4.2) 26 smaller in spatial extent and has the same number of feature types (orientations) as S I ; see figure 4.2. T h i s layer provides a model for V I complex cells. 4.2.4 Intermediate Feature (S2) Layer A t every posit ion and scale i n the C I layer, we perform template matches between the patch of C I units centered at that posit ion/scale and each of d prototype patches. These prototype patches represent the intermediate-level features of the model . T h e prototypes themselves are randomly sampled from the C I layers of the t ra ining images in an in i t i a l feature-learning stage. (For the Cal tech 101 dataset, we use d — 4,075 for comparison wi th [37].) Prototype patches are like fuzzy tem-plates, consisting of a gr id of simpler features that are al l slightly posi t ion and scale invariant. D u r i n g the feature learning stage, sampling is performed by centering a patch of size 4x4, 8x8, 12x12, or 16x16 (x 1 scale) at a random posi t ion and scale i n the C l layer of a random t ra in ing image. T h e values of a l l C I units w i th in the patch are read out and stored as a prototype. For a 4x4 patch, this means 16 different positions, but for each posit ion, there are units representing each of 4 orientations (see figure 4.3). Thus a 4x4 patch actually contains 4x4x4 = 64 C l unit values. Pre l iminary tests seemed to confirm that mult iple feature sizes worked some-what better than any single size. Smaller (4x4) features can be seen as encoding shape, while larger features are probably more useful for texture. Since we learn the prototype patches randomly from unsegmented images, many w i l l not actually represent the object of interest, and others may not be useful for the classification task. T h e weighting of features is left for the later S V M step. It should be noted that while each S2 prototype is learned by sampling from a specific image of a single 27 mm Figure 4.3: A n S2 feature (prototype patch) in the base model. A 4x4 prototype patch is shown. Each prototype is sampled from the C l layer of a training image at a random position and scale. For each position within the prototype, there are C l values for each of the four orientations. Stronger C l values are shown as darker. category, the resulting dictionary of features is shared, i.e., all features are used by all categories. During normal operation (after feature learning) each of these prototypes can be seen as just another convolution filter which is run over C l . We generate an S2 pyramid with roughly the same number of positions/scales as C l , but having d types of units at each position/scale, each representing the response of the corresponding C l patch to a specific prototype patch; see figure 4.2. The S2 layer is intended to correspond to cortical area V4 or posterior IT. The response of a patch of C l units X to a particular S2 feature/prototype P, of size n x n, is given by a Gaussian radial basis function: Both X and P have dimensionality n x n x 4, where n € {4, 8,12,16}. As in [37], the standard deviation a is set to 1 in all experiments. The parameter a is a normalizing factor for different patch sizes. For larger patches n € {8,12,16} we are computing distances in a higher dimensional space; (4.3) 28 for the distance to be small , there are more dimensions that have to match. We reduce the weight of these extra dimensions by using a = (ra/4) 2 , which is the ratio of the dimension of P to the dimension of the smallest patch size. 4.2.5 G l o b a l Invariance (C2) Layer F i n a l l y we create a d-dimensional vector, each element of which is the m a x i m u m response (anywhere i n the image) to one of the model's d prototype patches. A t this point, a l l posit ion and scale information has been removed, i.e., we have a "bag of features". 4.3 S V M Classifier The C 2 vectors are classified using an all-pairs linear S V M 3 . D a t a is "sphered" before classification: the mean and variance of each dimension are normalized to zero and one respectively. 4 Test images are assigned to categories using the majori ty-vot ing method. 4.4 Differences from Serre et al. O u r base model, as described above, performs only as well as that of Serre et al. i n [37], despite several changes that one might expect to improve performance: • We scale the smaller edge of each image to 140 pixels, as opposed to always scaling the height to 140. We thus avoid making ta l l "portrait" images very th in . 3 W e use the Statistical Pattern Recognition Toolbox for Matlab [10]. 4Suggested by T. Serre (personal communication). 29 • Scales in our image pyramid differ multiplicatively. [37] does not use an image pyramid , but rather applies different-sized SI filters to the full-scale image. However, these SI filters differ additively i n size. Hence the larger-scale C l units i n [37] have very l i t t le scale invariance because they pool S I units com-puted using nearly identically sized filters. • O u r C l subsampling ranges overlap i n scale as well as in posit ion. T h i s should make the system more robust to smal l changes, but has no noticeable effect on the end result. • We introduce the a parameter (section 4.2.4) to avoid favoring the smallest (4x4) S2 features. One reason this might not be helping is that it appears (in chapter 8) that 4x4 features are more suited to the task at hand anyway. O u r system also contains a simplification that results in no appreciable per-formance loss. The SI filter parameters a and A i n [37] change from scale to scale, in accordance w i t h physiological studies [36]. We find using the same parameters for a l l scales makes li t t le difference for our purposes. 30 5 Improvements T h i s chapter describes four improvements to the base model , representing the ma in contr ibut ion of this work. T h e improvements are: 1. Sparsification of inputs to S2 units. 2. Inhibi t ion of S I and C l unit outputs. 3. L i m i t i n g the posit ion and scale invariance of C 2 units. 4. Selecting the best S2 features. Test ing results for each modification are provided i n chapter 6. 5.1 Sparser S2 Inputs In the base model, an S2 unit computes its response using a l l the possible inputs in its corresponding C l patch. Specifically, at each posi t ion i n the patch, it is looking at the response to every orientation of Gabor filter and comparing it to its prototype. Rea l neurons, however, are l ikely to be more selective among their inputs. To increase sparsity among an S2 uni t ' s inputs, we reduce the number of inputs to an S2 feature to one per C l posit ion. In the feature learning phase, we remember the identity and magnitude of the dominant orientation (maximally responding C l 31 C1 patch Prototype (dense) Prototype (sparse) Figure 5.1: Dense vs. sparse S2 features. Dense S2 prototypes in the base model are sensitive to all orientations of C l units at each position. Sparse S2 prototypes are sensitive only to a particular orientation at each position. A 4x4 S2 feature for a 4-orientation model is shown here. Stronger C l unit responses are shown as darker. unit) at each of the n x n positions in the patch. This is illustrated in figure 5.1; the resulting 4x4 prototype patch now contains only 16 C l unit values, not 64. When computing responses to such "sparsified" S2 features, equation 4.3 is still used, but with a lower dimensionality: for each position in the patch, the S2 feature only cares about the value of the C l unit representing its preferred orientation for that position. The lower dimensionality helps to improve generalization. In conjunction with this we increase the number of Gabor filter orientations in SI and C l from 4 to 12. Since we are now looking at particular orientations, rather than combinations of responses to all orientations, it becomes more important to represent orientation accurately. Cells in visual cortex also have much finer gradations of orientation than 7 r / 4 [15]. 32 5.2 Inhibited S l / C l Outputs Our second modification is similar - we again ignore non-dominant orientations, but here we focus not on pruning S2 feature inputs but on suppressing Si and C l unit outputs. In cortex, lateral inh ib i t ion refers to units suppressing their less-active neighbors. We adopt a simple version of this between S l / C l units encoding different orientations at the same posit ion and scale. Essentially these units are competing to describe the dominant orientation at their location. We define a global parameter h, the inhibition level, which can be set between 0 and 1 and represents the fraction of the response range that gets suppressed. A t each location, we compute the m i n i m u m and m a x i m u m responses, and Rmax, over a l l orientations. A n y unit having R < Rmin + M -^max — Rmin) has its response set to zero. T h i s process is i l lustrated i n figure 5.2. A s a result, i f a given S2 unit is looking for a response to a vert ical filter (for example) in a certain posit ion, but there is a significantly stronger horizontal edge i n that rough posit ion, the S2 unit w i l l be penalized. T h i s enhancement, together w i th the increased number of orientations (sec-t ion 5.1), gives us a sparse, overcomplete code i n Si and C l . 5.3 Limited C2 Position/Scale Invariance Above the S2 level, the base model becomes a "bag of features" [6], disregarding al l geometry. T h e C 2 layer s imply takes the m a x i m u m response to each S2 feature at any posit ion or scale. T h i s gives complete posit ion and scale invariance, but S2 features are s t i l l too simple to eliminate b inding problems: we are s t i l l vulnerable to false positives due to chance co-occurrence of features from different objects and/or 33 Figure 5.2: Inhibition in S l / C l . Before inhibition, the circled unit in the prototype patch is getting some response to its desired orientation, despite the fact other ori-entations dominate. Inhibition increases the distance to prototype patches looking for non-dominant orientations. background clutter. We wanted to investigate the option of retaining some geometric information above the S2 level. In fact, neurons in V4 and IT do not exhibit full invariance and are known to have receptive fields limited to only a portion of the visual field and range of scales [33]. To model this, we simply restrict the region of the visual field in which a given S2 feature can be found, relative to its location in the image from which it was originally sampled, to ±tp% of image size and ± t s scales, where tp and ts are global parameters. This is illustrated in figure 5.3. This approach assumes the system is "attending" close to the center of the object. This is appropriate for datasets such as the Caltech 101, in which most ob-jects of interest are at similar positions and scales within the image. For localization of objects within complex scenes, as in the UIUC car database, we augment it with 3-1 Figure 5.3: L i m i t i n g the position/scale invariance of C 2 units. T h e solid boxes represent S2 features sampled from this t raining image. In test images, we w i l l l imi t the search for the max imum response to each S2 feature to the positions represented by the corresponding dashed box. Scale invariance is s imi lar ly l imi ted (although not shown here). a search for peak responses over object location using a s l iding window. 5.4 Feature Selection Our S2 features are prototype patches randomly selected from unsegmented t ra ining images. M a n y w i l l be from the background (figure 5.4), and others w i l l have varying degrees of usefulness for the classification task. We wanted to find out how many features were actually needed, and whether cut t ing out less-useful features would improve performance, as we might expect from machine learning results on the value of sparsity. We use a simple feature selection technique based on S V M normals [24]. In fitting separating hyperplanes, the S V M is essentially doing feature weighting (see figure 5.5). Our all-pairs m-class linear S V M consists of m(m — l ) / 2 binary S V M s . 35 Figure 5.4: Informative and uninformative features. A n S2 feature sampled from the top right of this t ra ining image is not likely to be useful for classification. Figure 5.5: U s i n g an S V M for feature weighting. In this s imple 2d b inary S V M , feature 2 is clearly more useful i n separating the classes t h a n is feature 1. E a c h fits a separating hyperplane between two sets of points i n d dimensions, i n which points represent images and each dimension is the response to a different S2 feature. T h e d components of the (unit length) n o r m a l vector to this hyperplane can be interpreted as feature weights; the higher the kth component (in absolute value), the more important feature k is i n separating the two classes. To perform feature selection, we s imply drop features w i t h low weight. Since the same features are shared by a l l the binary S V M s , we do this based on a feature's average weight over a l l b inary S V M s . Start ing w i t h a pool of 12,000 features, we conduct a m u l t i - r o u n d "tournament". In each round, the S V M is trained, then at 36 most half the features are dropped. T h e number of rounds depends on the desired final number of features d. (For performance reasons, earlier rounds are carried out using mult iple S V M s , each containing at most 3,000 features.) O u r experiments show that dropping features (effectively setting their weights to zero rather than those assigned by the S V M ) improves classification performance, and the resulting model is more economical to compute. 1 Depending on the desired number of features it may be necessary to drop less than half per round. 37 6 Multiclass Experiments -Tuning and Performance on the Caltech 101 Dataset In this chapter we describe both the tun ing of model parameters and the performance of the final model on the basic classification task. Each modification described in chapter 5 has at least one free parameter (number of orientations, degree of inhib i t ion , allowed posit ion/scale variation, and number of features). Us ing subsets of the Cal tech 101 dataset, we arrive at robust values for each of these parameters. We then evaluate the final model 's performance on the full dataset. 6.1 The Caltech 101 Dataset T h e Cal tech 101 contains 9,197 images comprising 101 different object categories, plus a background category, collected v i a Google image search by Fei-Fei et al. [7]. Most objects are centered and i n the foreground, making this dataset ideal for testing basic classification on a large number of categories. (It has become the unofficial 38 Figure 6.1: Some images from the Cal tech 101 dataset. standard benchmark for this task.) Some sample images are shown i n figure 6.1. 6.2 Running the Model To run our model for the basic classification task, the experimenter first defines a set of parameters including: • the number of t ra in ing images per category (15 or 30 for the Cal tech 101), • which modifications (chapter 5) are turned on, and • parameter values for those modifications (number of orientations, degree of inhib i t ion , etc.) T h e system then: 1. chooses the desired number of t raining images at random from each category, placing remaining images in the test set, 2. learns features at random positions and scales from the t ra in ing images (an equal number from each image), 39 3. builds C 2 vectors for the t ra ining set, 4. trains the S V M (performing feature selection if that opt ion is turned on), 5. builds C 2 vectors for the test set, and 6. classifies the test images. 6.3 Parameter Tuning The complete parameter space for the full set of modifications is too large to search exhaustively, hence we chose an order and opt imized each parameter separately before moving to the next. F i r s t we turned on S2 input sparsification and found a good number of orientations, then we fixed that number and moved on to find a good inhibi t ion level, etc. Our goal was to find parameter values that could be used for any dataset, so we wanted to guard against the possibil i ty of tun ing parameters to unknown properties specific to the Cal tech 101. Th i s large dataset has enough variety to make this unlikely; nevertheless, we ran tests independently on two disjoint subsets of the categories and chose parameter values that fell in the middle of the good range for both groups (see figure 6.2). The fact that such values were easy to find increases our confidence in the generality of the chosen values. T h e two groups were constructed as follows: 1. remove the easy faces and background categories, 2. sort the remaining 100 categories by number of images, then 3. place odd numbered categories into group A and even into group B . T h e final parameter, number of features, was opt imized for al l 102 cate-gories. Since models w i th fewer features can be computed more quickly, we chose the smallest number of features that s t i l l gave results close to the best. 40 # orientations inhibition factor h allowed % position variation allowed scale variation Figure 6.2: T h e results of parameter tuning for various enhancements to the base model using the Cal tech 101 dataset. E a c h data point is the average of 8 independent runs, using 15 t ra ining images and up to 100 test images per category. Tests were run independently on two disjoint groups of 50 categories each. T h e horizontal lines in the leftmost graph show the performance of the base model (dense features, 4 orientations) on the two groups. Tun ing is cumulative: the parameter value chosen in each graph is marked by a solid d iamond on the x-axis. The results for this parameter value become the start ing points (shown as solid data points) for the next graph. Figure 6.3: Results for the final model on the entire Cal tech 101 dataset for various numbers of features, selected from a pool of 12,000. E a c h data point is the average of 4 runs w i t h 15 t raining images and up to 100 test images per category. T h e hor-izontal line represents the performance of the same model but w i th 4,075 randomly selected features and no feature selection. 41 Model 15 training images/cat. 30 training images/cat. O u r model (base) .33 41 Serre et al. [37] 35 42 Holub et al. [14] 37 43 Berg et al. [3] 45 G r a u m a n & Darre l l [13] 49.5 58.2 O u r model (final) 51 56 Table 6.1: Our results for the Cal tech 101 dataset along w i t h those of previous studies. Scores for our model are the average of 8 independent runs using all available test images. Scores shown are the average of the per-category classification rates. T h e results of parameter tun ing are shown in figures 6.2 and 6.3. T h e chosen parameters were 12 orientations, h = 0.5, tp — ± 5 % , ts = ± 1 scale, 1,500 features. 6.4 Multiclass Performance Table 6.1 shows the performance of our base and final models and compares them w i t h results of previous studies. E a c h result is the average of 8 independent runs. Our final results for 15 and 30 t ra ining images are 51% and 56%. Note that subsequent work by other groups [19, 42] has exceeded this per-formance level. These studies use improved kernels for the S V M classifier (as does G r a u m a n &- Dar re l l [13]). It wi l l -be interesting to see whether these ideas can be -successfully combined w i t h our sparse image features to get further improvements. Table 6.2 shows the contr ibut ion to performance of each successive modifi-cation. Figure 6.4 contains some examples of categories for which the system per-formed well, while figure 6.5 illustrates some difficult categories. In general, the harder categories are those having greater shape variabi l i ty due to greater intra-class variat ion and nonrigidity. Interestingly, the frequency of occurence of background 42 M o d e l 15 t r a i n i n g i m a g e s / c a t . 30 t r a i n i n g i m a g e s / c a t . Base 33 41 + sparse S2 inputs 35 (+ 2) 45 (+ 4) + inhib i ted S l / C l outputs 40 (+ 5) 49 (+ 4) + l imi ted C 2 invariance 48 (+ 8) 54 (+ 5) + feature selection 51 (+ 3) 56 (+ 2) Table 6.2: T h e contr ibut ion of our successive modifications to the overall classifica-t ion score. E a c h score is the average of 8 independent runs using a l l available test images. Scores shown are the average of the per-category classification rates. Figure 6.4: Some example images from easier categories. clutter in a category's images does not seem to be a significant factor. Table 6.3 shows the classification rate and most common error for each cat-egory. Notably, most of these errors are not outrageous by human standards. T h e most common confusions are schooner vs. ketch (indistinguishable by non-expert humans) and lotus vs. water l i ly (vaguely similar flowers). Figure 6.5: Some example images from difficult categories. 43 T h e background category is the least recognized of a l l . T h i s is not surpris-ing, as our system does not currently have a special case for "none of the above". Background is treated as just another category, and the system attempts to learn it from only 30 exemplars. 44 R a n k Category Rate & StDev M o s t C o m m o n E r r o r & Rate 1 car side 98.39 1.52 2 Faces easy 97.87 0.37 3 Motorbikes 96.81 0.61 4 minaret 94.02 2.79 • 5 Faces 93.52 2.09 6 trilobite 89.96 5.13 7 airplanes 89.85 1.87 8 grand piano 86.23 4.31 9 yin yang 85.42 8.53 10 laptop 84.80 4.16 11 revolver 79.09 2.80 12 menorah 77.63 4.67 13 ferry 69.59 6.42 14 dragonfly 69.41 7.16 15 euphonium 68.38 7.16 16 ketch 67.56 3.70 schooner 17.11 17 stop sign 66.54 5.87 18 soccer ball 66.54 7.85 nautilus 5.88 19 electric guitar 66.11 6.26 bass 5.83 20 schooner 65.15 8.26 ketch 19.32 21 watch 64.65 1.36 22 Joshua tree 63.97 5.83 23 buddha 62.50 8.58 24 umbrella 61.94 5.87 lamp 5.28 25 Leopards 59.93 4.66 crocodile 5.44 26 ewer 58.18 6.45 strawberry 5.00 27 brain 58.09 5.50 28 dalmatian 53.72 5.49 garfield 5.07 29 helicopter 53.23 6.55 cannon 5.17 30 cougar face 52.88 6.56 flamingo head 5.45 31 chair 50.78 11.30 Windsor chair 5.08 32 hawksbill 50.54 6.52 crab 5.71 33 chandelier 50.00 3.47 scissors 5.36 34 bonsai 49.74 4.75 35 lamp 49.60 6.66 flamingo 6.85 36 sunflower 49.55 7.39 starfish 5.68 37 butterfly . 48.98 . 3.66 38 elephant 48.90 7.53 brontosaurus 8.46 39 crayfish 45.31 7.84 lobster 7.50 40 dolphin 42.14 4.52 bass • 5.00 -- 41- llama -39:84- - - 7.91 kangaroo - - -6.51 -42 kangaroo 38.84 6.87 sea horse 6.47 43 flamingo 38.18 7.84 emu 6.42 44 lotus 37.50 7.12 water lilly 18.75 45 starfish 34.82 3.70 46 crab 30.81 4.93 crocodile 7.85 47 ibis 28.25 9.41 emu 7.50 48 scorpion 25.00 5.24 ant 8.80 49 B A C K G R O U N D 12.17 2.32 Table 6.3: Per-category classification rates and most common errors for the Cal tech 101 dataset, for the final model using 30 tra ining images per category, averaged over 8 runs. O n l y categories having at least 30 remaining test images are shown. T h e most common error is shown only if it occurs at least 5% of the time. 45 7 Localization Experiments (UIUC Car Dataset) This chapter describes our experiments on the U I U C car dataset [1], i n which we used our final, tuned model to tackle the single-class local izat ion problem. These experiments served two purposes. • O u r introduction of l imi ted C 2 invariance (section 5.3) sacrificed full invariance to object posit ion and scale w i th in the image; we wanted to see i f we could recover it. • We wanted to demonstrate that the model , and the parameters learned dur ing the tuning process, worked equally well on another dataset. 7.1 The UIUC Car Dataset The U I U C car dataset consists of small (100x40) t ra ining images of cars and back-ground, and larger test images in which there is at least one car to be found. There are two sets of test images: a single-scale set i n which the cars to be detected are roughly the same size (100x40 pixels) as those in the t ra ining images, and a multi-scale set. 46 7.2 Model Parameters Other than the number of features, a l l parameters were unchanged. T h e number of features was arbi t rar i ly set to 500 and immediately yielded excellent results. We d id not attempt to optimize system speed by reducing this number as we d id i n the multiclass experiments. A s before, the features were selected from a group of randomly-sampled features eight times larger, 4000 in this case, and the selection process comprised 3 rounds. Features were compared i n groups of at most 1000. See section 5.4 for details. We trained the model using 500 positive and 500 negative t ra ining images; features were sampled from these same images. 7.3 Sliding Window For localizat ion i n these larger test images we added a s l iding window. A s i n [1], the s l iding window moves in steps of 5 pixels horizontally and 2 vertically. In the multiscale case this is done at every scale using these same step sizes, al though at larger scales there are fewer pixels, each representing more of the image. Hence there are fewer window positions at larger scales. For efficiency reasons, levels S I , C l , and S2 are computed once for the entire image. T h e n a C 2 vector is computed for each posit ion of the sl iding window. W h e n comput ing the C 2 vector, the scope of the C 2 layer's M A X operation is l imi ted to the s l iding window, and feature posit ion and scale ranges (section 5.3) are considered relative to the window frame. Dupl icate detections were consolidated using the neighborhood suppression algori thm from [1]. We increase the w id th of a "neighborhood" from 71 to 111 pixels 47 M o d e l Single-scale Mult iscale Agarwa l et al. [1] 76.5 39.6 Leibe et al. [21] 97.5 Fr i tz et al. [11] 87.8 O u r model 99.94 90.6 Table 7.1: O u r results (recall at equal-error rates) for the U I U C car dataset along w i t h those of previous studies. Scores for our model are the average of 8 independent runs. Scoring methods were those of [1]. to avoid merging adjacent cars. 7.4 Results Our results are shown i n table 7.1 along w i t h those of other studies. Our recall at equal-error rates (recall — precision) is 99.94% for the single-scale test set and 90.6% for the multiscale set, averaged over 8 runs. Scores were computed using the scoring programs provided w i t h the U I U C data. ' In our single-scale tests, 7 of 8 runs scored a perfect 100% - a l l 200 cars in 170 images were detected w i t h no false positives. To be considered correct, the detected posit ion must lie inside an ellipse centered at the true posi t ion, having horizontal and vertical axes of 25 and 10 pixels respectively. Repeated detections of the same object count as false positives. F igure 7.2 shows the only errors from the 8th run; figure 7.1 shows some correct single-scale detections. For the multiscale tests, the s l iding window also searches through scale, and the scoring cri ter ia include a scale tolerance (from [1]). Figures 7.3 and 7.4 show some correct detections and some errors on the multiscale set. Table 7.4 contains a breakdown of the types of errors made. Even in the multiscale case, outright false positives and missed detections are uncommon. Mos t of the errors are due to the 48 Figure 7.1: Some correct detections from one run on the single-scale U I U C car dataset. following two reasons. 1. T w o cars are detected correctly, but their bounding boxes overlap. T h i s is more common i n the multiscale case; see for example figure 7.4, bo t tom left. The neighbourhood suppression algori thm eliminates one of them. Since the detector itself is the ma in focus of this work, this k ind of error is not a great concern. 2. For certain instances of cars, the peak response, i.e., the highest-responding placement of the bounding box, occurs at a scale somewhat larger or smaller than that of the best bounding box. T h i s is considered a missed detection (and a false positive) by the scoring algori thm [1]. 1!) Figure 7.2: T h e only 2 errors (1 missed detection, 1 false positive) made in 8 runs on the single-scale U I U C car dataset. Figure 7.3: Some correct detections from one run on the multiscale U I U C car dataset. Source of error Number of test images Simple false positive 1 Simple false negative 1 Suppression due to overlap 6 Detect ion at wrong scale 6 Table 7.2: Frequency of error types for one run on the multiscale U I U C car dataset. 50 Figure 7.4: Examples of the kinds of errors made for one run on the multiscale U I U C car dataset. Top left: a simple false positive. Top right: a simple false negative. B o t t o m left: the second car is suppressed due to overlapping bounding boxes. B o t t o m right: the car is detected but the scale is slightly off. 51 8 Analysis of Features In this chapter we take a closer look at the kinds of features that are being selected. 8.1 The Value of S2 Features Most of this model's biological plausibi l i ty is i n the way the S2 features are being computed: starting wi th Gabor filters and then bui ld ing up invariance and complex-ity, using lateral inhibi t ion along the way, etc. T h i s raises an important question. How much of the model's performance is due to these part icular features, and how much is due to the rest of the model - i.e., the large number of features, the powerful S V M classifier, and the feature selection step? To test this, we trained a "stub" version of the model on the Cal tech 101 dataset. In the stub version, layer S2 is computed directly from the image layer; see figure 4.2. Feature learning is performed by sampling patches of pixels. W h e n computing responses to an image, S2 prototypes (now just simple image patches) are compared to candidate patches using normalized cross-correlation. Note that without the intervening C l layer, which was performing subsam-pling, the number of units i n S2 would be much larger than in the full model . To keep the number comparable, we tr ied two different approaches. 1.' Subsampling the image down to the size of the full model 's C l layer. 52 2. Increasing the size of prototype patches and moving them across the image i n larger steps. T h e best classification rate for the stub model was 37% (for 30 t ra ining images per category), down from 56% for the full model. T h i s is strong evidence that the S2 features i n the full model really are doing something useful. J 8.2 Visualizing Features Because S2 features are not direct ly made up of pixels, but rather C l units, it is not possible to uniquely show what they "look l ike" . However, it is possible to find the image patches i n the test set to which a given feature responds most strongly. Figures 8.1 through 8.6 show exactly this for a part icular run on the Cal tech 101 dataset. We ranked the 1,500 features which survived the feature selection step (sec-t ion 5.4) by their average weight across a l l binary S V M s - the same cri terion by which selection was performed. Figure 8.1 shows the 40 patches to which feature #1 responds most strongly. Feature #1 is the feature wi th the highest average weight - the most informative feature overall under our selection scheme. Figures 8.2 through 8.6 show the strongest 40 patches for some other features. For most features, these highest-responding patches do not a l l come from one object category, although there are often a few commonly recurring categories. S2 features are s t i l l rather weak classifiers on their own. 53 V V m r ir IMMMMMMMMH r Ml V Ill BI • r mmmmmmmm SI J>i| m\ r I" F •1 r •F , Figure 8.2: Best image patches for feature #2, from one run on the Cal tech 101 dataset. T h e top left patch represents the highest response. Note that this feature seems to be responding to a strong edge introduced into this category by artificial rotat ion of the images - an unfortunate flaw of this dataset. 5 5 • • • • D Figure 8.4: Best image patches for feature #101, from one run on the dataset. T h e top left patch represents the highest response. 57 • [111, j J Mm 1171 Figure 8.6: Best image patches for feature #1001, from one run on the Cal tech 101 dataset. The top left patch represents the highest response. 59 8.3 Feature Sharing Across Categories For any given feature, there w i l l be one category for which it "votes" the most strongly. In fact it is possible to sort a l l the categories - i n the case of the Cal tech 101, from 1 to 102 - in order of how strongly the part icular feature, votes for them. In figure 8.7 we show, for a few chosen features from one run, how rapidly each feature's influence tails off from the 1st to 102nd category. T h i s gives us a feel for how much features are being shared among categories. Mos t features seem fairly strong for a significant number of categories (at least 20). Notably, the features ranked as most important overall by the feature selection process (section 5.4) tend to be features that vote very strongly for one or two categories over a l l the others. Features 1-6 are shown i n the top half of figure 8.7; the general trend holds for the top 100 or so features. 8.4 Utility of Different Feature Sizes Recal l that S2 features come i n various sizes: 4x4, 8x8, 12x12, and 16x16. It turns out that 4x4 features are significantly favoured by the feature selection process, i.e., they are the most informative features for this task [41]. T h i s can be seen in figure 8.8, which shows the percentage of each feature size remaining after feature selection. (Note that we do not show absolute numbers because smaller features are also somewhat preferentially favoured in the original sampling process due to edge effects.) Over 20% of the 4x4 features originally sampled survived the feature selection phase, as compared to just over 5% of the 16x16 features. A n interesting exception to this occurs i n the first 100 or so features - the 60 0.1 0.08 0.06 0.04 0.02 0 Feature 1 (12x12) 0.1 0.08 0.06 0.04 0.02 0 0.1 0.08 0.06 0.04 0.02 0 w=1.0000 20 40 60 80 100 Feature 4 (12x12) 0.1 0.08 0.06 0.04 0.02 0 Feature 2 (4x4) w=0.9896 20 40 60 80 100 Feature 5 (4x4) 0.1 0.08 0.06 0.04 0.02 0 Feature 3 (4x4) w=0.9298 20 40 60 80 100 Feature 6 (8x8) 20 40 60 80 100 Feature 501 (4x4) w=0.7916 20 40 60 80 100 Feature 1001 (4x4) w=0.7583 0.1 0.08 0.06 0.04 0.02 0 0.1 0.08 0.06 0.04 0.02 0 20 40 60 80 100 Feature 502 (8x8) w=0.7915 20 40 60 80 100 Feature 1002 (4x4) w=0.7582 0.1 0.08 0.06 0.04 0.02 0 0.1 0.08 0.06 0.04 0.02 0 20 40 60 80 100 Feature 503 (4x4) w=0.7912 20 40 60 80 100 Feature 1003 (4x4) w=0.7582 20 40 60 80 100 20 40 60 80 100 20 40 60 80 100 Figure 8.7: Feature sharing among categories, for a few features from one run on the Cal tech 101 dataset. T h e y axis for each subplot represents the weight w i t h which the feature votes for each category (averaged across a l l binary S V M s involving the category; negative weights count as zero). Categories (the x axis) are sorted by this value to show the rate of dropoff. T h e leftmost category for each subplot is the category for which the feature votes most strongly, w represents each feature's overall weight (the area under the curve), relative to that of feature #1. 61 "5. 20 15 ^ 10 5 0 4 x 4 8 x 8 1 2 x 1 2 16x16 Figure 8.8: Percentage of each size of feature remaining after feature selection. features ranked as most informative overall by the feature selection process. A s shown i n figure 8.9, larger features are much more common in the top 100. 7 T T 62 200 1-250 251-500 501-750 751-1000 feature rank 1001-1250 1251-1500 Figure 8.9: Relat ive proportions of feature sizes by selection rank. 4x4 features dominate except among the very top-ranked features. T h i s is best seen i n the top graph. 63 9 Other Experiments (Graz Datasets) T h i s chapter describes our experiments on the more difficult images of the Graz-02 datasets. 9.1 The Graz-02 Datasets We ran some addi t ional tests on the Graz-02 datasets [28] (bikes, cars, and people). Each image for a given category contains one or more instances of that category only. In other words, "bikes" images contain only bikes, "cars" images contain only cars, etc. There is also a "background" category whose images do not contain instances of any of_these_categories.„__. . _.. - A s i n other studies u t i l iz ing these datasets, we perform only single-category-vs-background tests. For the "bikes" category, the task is s imply to tel l whether or not a given image contains a bicycle. Unl ike the U I U C tests, the part icular locat ion wi th in the image is not important . W h i l e the task may be simpler, the ind iv idua l category datasets are harder than those of the Cal tech 101 or U I U C datasets. They contain greater pose vari-ability, and there are more instances of par t ia l occlusion. 64 9.2 Training the Model O u r model (wi th localized C 2 features; see section 5.3) needs to be trained on centred images. We make use of the annotations, provided w i t h the Graz-02 data, to randomly select 50 square subimages containing a single object each. The images they came from are set aside, i.e., they are unavailable for testing. We then create a left-right flipped copy of each positive subimage, giving us a total of 100 positive subimages for t raining. Next , 500 negative (background) subimages are taken from randomly-selected images i n the "background" category. E a c h subimage is extracted from a randomly-chosen bounding box, equal in size to the average bounding box size of the positive t ra ining examples. See figure 9.1. T h e feature dict ionary contains 1,000 features (selected i n 3 rounds from 8,000 features). A l l other parameters are again unchanged. 9.3 Testing the Model - A l l remaining images-(from theposi t ive category and the background category) form the test set. For each test image, we slide a square window through all positions and scales. Images are classified as positive or negative based on the peak detector response over al l window locations. 9.4 Results O u r results for the single-category-vs-background test for each of the three categories are shown in table 9.1. A s was the case i n [28], we do not do as well on these more difficult images, and our scores are somewhat lower than those of [28]. 65 Figure 9.1: Subimages used to t ra in the Graz-02 "bikes" classifier. D a t a s e t (vs. b a c k g r o u n d ) O u r s c o r e O p e l t et al. [28] Bikes 72 78 Cars 64 71 People 81 81 Table 9.1: Our results for the Graz-02 datasets. 66 Par t of the discrepancy may be due to the fact we are actually solving the more difficult problem of localization, but then throwing the location away. We are also using fewer t ra ining instances than [28], and our method of selecting subimages for t ra ining may also be skewing things somewhat. It is possible that by taking many of the easily-separable single-object subimages for t raining, we are leaving ourselves w i t h a slightly harder test set. F ina l ly , it is possible that a wider range of C 2 posit ion/scale tolerance would be appropriate for this more varied dataset. These results should thus be viewed as preliminary. In any case, this dataset shows there is s t i l l room for improvement on difficult single-class- vs.-background tasks. 67 10 Discussion and Future Work 10.1 Summary In this s tudy we have shown that a biologically-based model can compete w i t h other state-of-the-art approaches to object classification, strengthening the case for investigating biologically-motivated approaches to object recognition. Even w i t h our enhancements, this model is s t i l l relatively simple. T h e system implemented here is not real-time; it takes several seconds to process and classify an image on a 2GHz Intel Pen t ium server. Hardware advances w i l l reduce this to immediate recognition speeds w i t h i n a few years. Biological ly motivated algorithms also have the advantage of being susceptible to massive par-allelization. Local iza t ion in larger images takes longer; i n both cases the bulk of the .time. is. spent bui lding, feature .vectors . We have found increasing sparsity to be a fruitful approach to improving generalization performance. O u r methods for increasing sparsity have a l l been mo-tivated by approaches that appear to be incorporated i n biological vision, although we have made no attempt to model biological data i n full detail . G iven that both biological and computer vision systems face the same computat ional constraints arising from the data, we would expect computer vis ion research to benefit from the use of similar basis functions for describing images. O u r experiments show that 68 both lateral inhibi t ion and the use of sparsified intermediate features contribute to generalization performance. We have also examined the issue of feature local izat ion i n biologically based models. W h i l e very precise geometric constraints may not be useful for broad object categories, there is a substantial loss of useful information i n completely ignoring feature location as i n bag-of-features models. We have shown a considerable increase i n performance by using intermediate features that are localized to smal l regions of an image relative to an object coordinate frame. W h e n an object may appear at any posit ion or scale in a cluttered image, it is necessary to search over a l l potential reference frames to combine appropriately localized features. In biological vision this attentional search appears to be driven by a complex range of saliency measures [33]. For our computer implementation, we can s imply search over a densely sampled set of possible reference frames and evaluate each one. T h i s has the advantage of not only improving classification performance but also providing quite accurate localization of each object. The strong performance shown on the U I U C car localization task indicates the potential for further work in this area. 10.2 Future Work Most of the performance improvements for our model were due to the feature com-putat ion stage. Other recent multiclass studies [19, 42] have done well by improving the S V M classifier stage. F rom a pure performance point of view, the most immedi -ately fruitful direction might be to t ry to combine these ideas into a single system. However, as we do not wish to stray too far from what is clearly a valuable source of inspirat ion, we lean towards future enhancements that are biologically realistic. Our ul t imate goal is to emulate the process of object classification by humans. 69 T h e in i t ia l , feedforward mode of classification is the obvious first step. A recent, updated model by Serre et al. [34] - having a slightly deeper feature hierarchy that better corresponds to known connectivity between areas in the ventral stream - has been able to match human performance levels i n the classic an imal /non-animal task of Thorpe et al. [39]. Th i s model has yet to be tested on a large multiclass problem like the Cal tech 101. E v e n w i t h a perfect model of human feedforward recognition, it is not clear what level of performance we could expect. More psychophysical s tudy of the boundary between what humans can do in the in i t ia l feedforward stage and what requires recurrent processing (serial attention, integration of gist, context, and top-down expectation) is needed. T h e feedforward model is probably s t i l l far from complete. Features i n the various layers can probably be modeled more accurately, and the feature learning stage is s t i l l very crude. Within- layer interactions such as contour integration are absent. Higher-order features or view-tuned units might improve performance under wide variations i n viewpoint . Nevertheless, there w i l l come a point at which it is appropriate to begin in-t roducing back-projections into the model from higher levels i n the ventral stream and also from other bra in areas, visual and non-visual. T h i s growth i n model com-plexity w i l l have to be managed wi th great care, as there are many recurrent con-nections i n the brain , connecting many areas. Th i s parallels the experience of many A l researchers, who have found that its very difficult to solve any one problem in isolation. 70 Bibliography [1] Shivani Agarwal , A a t i f A w a n , and D a n R o t h . Learning to detect objects i n i m -ages v i a a sparse, part-based representation. PAMI, 26(11):1475-1490, Novem-ber 2004. [2] E . B . B a u m , J . M o o d y , and F . Wi lczek . Internal representations for associative memory. Biological Cybernetics, 59:217-228, 1988. [3] Alexander C . Berg, Tamara L . Berg , and J i tendra M a l i k . Shape matching and object recognition using low distort ion correspondence. In CVPR, June 2005. [4] G . Bouchard and B i l l Triggs. Hierarchical part-based visual object categoriza-t ion. In CVPR, June 2005. [5] Pa t r i c i a S. Church land and Terrence J . Sejnowski. The Computational Brain. T h e M I T Press, 1992. [6] G . Csurka , C . Dance, J . Wi l l amowsk i , L . Fan, and C . Bray. V i s u a l categoriza-t ion w i t h bags of keypoints. In ECCV International Workshop on Statistical Learning in Computer Vision, Prague, 2004 . . . . . [7] L . Fei-Fei , R . Fergus, and P . Perona. Learning generative visual models from few training-examples: an-incremental bayesian -approach-tested-on -101- object categories. In CVPR Workshop on Generative-Model Based Vision, 2004. [8] R . Fergus, P. Perona, and A . Zisserman. Object class recognition by unsuper-vised scale-invariant learning. In CVPR, 2003. [9] M . Figueiredo. Adap t ive sparseness for supervised learning. PAMI, 25(9):1150-1159, September 2003. [10] Vojtech Franc and Vaclav Hlavac. Stat is t ical pattern recognition toolbox for M a t l a b . 71 [11] M a r i o Fr i tz , Bas t i an Leibe, Ba rba ra Caputo , and Bernt Schiele. Integrating rep-resentative and discriminative models for object category detection. In ICCV, pages 1363-1370, Bei j ing, Ch ina , October 2005. [12] K . Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in posi t ion. Biological Cybernetics, 36(4): 193-202, A p r i l 1980. [13] K . G r a u m a n and T . Dar re l l . P y r a m i d match kernels: Discr iminat ive classifica-t ion wi th sets of image features. Technical Repor t M I T - C S A I L - T R - 2 0 0 6 - 0 2 0 , M a r c h 2006. [14] A . D . Holub , M . Wel l ing , and P . Perona. E x p l o i t i n g unlabelled data for hybr id object classification. In NIPS Workshop on Inter-Class Transfer, Whis t le r , B . C . , December 2005. [15] D . H . H u b e l and T . N . Wiesel . Receptive fields of single neurones i n the cat's striate cortex. Journal of Physiology, 148:574-591, 1959. [16] Frederic Jurie and B i l l Triggs. Creat ing efficient codebooks for visual recogni-t ion. In ICCV, 2005. [17] B . Kr i shnapuram, L . C a r i n , M . Figueiredo, and A . Har temink. Sparse mul t i -nomial logistic regression: Fast algorithms and generalization bounds. PAMI, 27(6):957-968, 2005. [18] I. L a m p l , D . Ferster, T . Poggio, and M . Riesenhuber. Intracellular measure-ments of spatial integration and the max operation in complex cells of the cat pr imary visual cortex. Journal of Neurophysiology, 92(5):2704-13, Nov 2004. [19] S. Lazebnik, C . Schmid, and J . Ponce. Beyond bags of features: Spat ia l pyramid matching~fo"r recognizing natural"scene categories: In CVI'R. June 2006. " "~ [20] Y . L e C u n , L . Bo t tou , Y . Bengio, and P . Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86( l l ) :2278-2324, November 1998. [21] Bas t ian Leibe, Ales Leonardis, and Bernt Schiele. Combined object categoriza-t ion and segmentation wi th an impl ic i t shape model. In ECCV Workshop on Statistical Learning in Computer Vision, pages 17-32, Prague, Czech Republ ic , M a y 2004. [22] N . K . Logothetis, J . Pauls , and T . Poggio. Shape representation i n the inferior temporal cortex of monkeys. Current Biology, 5:552-563, 1995. 72 [23] D a v i d G . Lowe. Object recognition from local scale-invariant features. In ICCV, pages 1150-1157, Corfu , Greece, September 1999. [24] D u n j a Mladenic , Janez Brank , MarkO Grobelnik , and Natasa M i l i c - F r a y l i n g . Feature selection using linear classifier weights: Interaction w i t h classification models. In The 27th Annual International ACM SIGIR Conference (SIGIR 2004), pages 234-241, Sheffield, U K , J u l y 2004. [25] J i m M u t c h and D a v i d G . Lowe. Mul t ic lass object recognition w i t h sparse, localized features. In CVPR, pages 11-18, New Yor k , June 2006. [26] B . A . Olshausen and D . J . F i e l d . Emergence of simple-cell receptive field prop-erties by learning a sparse code for natural images. Nature, 381:607-609, 1996. [27] B . A . Olshausen and D . J . F i e l d . How close are we to understanding v l ? Neural Computation, 17:1665-1699, 2005. [28] A . Opel t , A . P i n z , M.Fussenegger, and P .Auer . Generic object recognition w i t h boosting. PA MI, 28(3), M a r c h 2006. [29] M . C . Potter. Mean ing i n visual search. Science, 187:965-966, 1975. [30] M a x i m i l i a n Riesenhuber. T h e H M A X homepage. [31] M a x i m i l i a n Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11):1019-1025, 1999. [32] M a x i m i l i a n Riesenhuber and Tomaso Poggio. Models of object recognition. Nature Neuroscience, 3(supp):1199-1204, 2000. [33] E d m u n d T . Rol l s and Gustavo Deco. The Computational Neuroscience of Vi-sion.-Oxford University-Press,-2001.~ ._ . - - — — [34] T . Serre, M . ' K o u h , C . Cadieu , U . K n o b l i c h , G . K r e i m a n , and T . Poggio. A theory of object recognition: Computat ions and circuits i n the feedforward path of the ventral stream i n primate visual cortex. Technical Repor t C B C L Paper # 2 5 9 / A I M e m o #2005-036, Massachusetts Institute of Technology, Cambridge, M A , October 2005. [35] T . Serre, J . Louie , M . Riesenhuber, and T . Poggio. O n the role of object-specific features for real world object recognition i n biological vision. In Workshop on Biologically Motivated Computer Vision, Tubingen, Germany, November 2002. 73 [36] T . Serre and M . Riesenhuber. Real is t ic model ing of simple and complex cell tuning i n the H M A X model , and implicat ions for invariant object recognition i n cortex. Technical Repor t C B C L Paper # 2 3 9 / A I M e m o #2004-017, M a s -sachusetts Institute of Technology, Cambridge, M A , J u l y 2004. [37] T . Serre, L . Wolf, and T . Poggio. Object recognition w i t h features inspired by visual cortex. In CVPR, San Diego, June 2005. [38] Nicholas V . Swindale. How many maps are there i n visual cortex? Cerebral Cortex, 10(7):633-643, J u l y 2000. [39] S. Thorpe, D . Fize , and C . M a r l o t . Speed of processing i n the human visual system. Nature, 381:520-522, 1996. [40] Anton io Torralba, K e v i n Murphy , and W i l l i a m Freeman. Sharing features: efficient boosting procedures for multiclass object detection. In CVPR, pages 762-769, 2004. [41] S. U l l m a n , M . Vida l -Naque t , and E . Sal i . V i s u a l features of intermediate com-plexity and their use in classification. Nature Neuro science, 5(7):682-687, 2002. [42] Hao Zhang, A l e x Berg, Michae l Mai re , and J i tendra M a l i k . Svm-knn : Dis -criminative nearest neighbor classification for visual category recognition. In CVPR, June 2006. 74 


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items