Data Famine in Big Data EraMachine Learning Algorithms for Visual Object Recognition withLimited Training DatabyZhenyu GuoB.E. Zhejiang University, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University Of British Columbia(Vancouver)March 2014c© Zhenyu Guo, 2014AbstractBig data is an increasingly attractive concept in many fields both in academia and inindustry. The increasing amount of information actually builds an illusion that weare going to have enough data to solve all the data driven problems. Unfortunatelyit is not true, especially for areas where machine learning methods are heavily em-ployed, since sufficient high-quality training data doesn’t necessarily come withthe big data, and it is not easy or sometimes impossible to collect sufficient train-ing samples, which most computational algorithms depend on. This thesis mainlyfocuses on dealing situations with limited training data in visual object recognition,by developing novel machine learning algorithms to overcome the limited trainingdata difficulty.We investigate three issues in object recognition involving limited training data:1. one-shot object recognition, 2. cross-domain object recognition, and 3. objectrecognition for images with different picture styles. For Issue 1, we propose anunsupervised feature learning algorithm by constructing a deep structure of thestacked Hierarchical Dirichlet Process (HDP) auto-encoder, in order to extract “se-mantic” information from unlabeled source images. For Issue 2, we propose aDomain Adaptive Input-Output Kernel Learning algorithm to reduce the domainshifts in both input and output spaces. For Issue 3, we introduce a new probleminvolving images with different picture styles, successfully formulate the relation-ship between pixel mapping functions with gradient based image descriptors, andalso propose a multiple kernel based algorithm to learn an optimal combination ofbasis pixel mapping functions to improve the recognition accuracy. For all the pro-posed algorithms, experimental results on publicly available data sets demonstratethe performance improvements over previous state-of-arts.iiPrefaceThis thesis is written based on a collection of manuscripts. The majority of theresearch, including literature survey, algorithm development and implementation,numerical simulation, real data analysis, and manuscript writing, were conductedby the author, with suggestions from Prof. Z. Jane Wang. The related manuscriptswere primarily drafted by the author, with helpful revisions and comments fromProf. Z. Jane Wang.Chapter 2 is based on the following manuscripts:• Zhenyu Guo and Z. Jane Wang, “An Unsupervised Hierarchical FeatureLearning Framework for One-Shot Image Recognition,” IEEE Transactionson Multimedia, 15(3): 621-632, 2013.• Zhenyu Guo and Z. Jane Wang, “One-shot recognition using unsupervisedattribute-learning,” in The 4th Pacific-Rim Symposium on Image and VideoTechnology (PSIVT), pp. 1-6, 2010.Chapter 3 is based on the following manuscripts:• Zhenyu Guo and Z. Jane Wang, “Cross-domain object recognition via input-output kernel analysis,” IEEE Transactions on Image Processing, 22(8):3108-3119, 2013.• Zhenyu Guo and Z. Jane Wang, “Cross-domain object recognition by outputkernel learning,” in The 14th International Workshop on Multimedia SignalProcessing (MMSP), pp. 372-377, 2012.iiiChapter 4 is based on the following manuscript:• Zhenyu Guo and Z. Jane Wang, “An Adaptive Descriptor Design for ObjectRecognition in the Wild,” in IEEE International Conference on ComputerVision (ICCV), Sydney, Australia, Dec 1-8, 2013.• Zhenyu Guo and Z. Jane Wang, “Why pixel matters? An Adaptive Descrip-tor Design,” submitted, 2013.Over the course of his PhD, Zhenyu Guo has participated in a number of otherprojects leading to some publications. While these other projects are somehowrelated to the topic of this thesis, the following publications are not included due tothe coherence concern to the main focus of limited training data.• Zhenyu Guo, Z. Jane Wang, and Ali Kashani, “Home Appliance Load Mod-eling from Aggregated Smart Meter Data,” under revision, 2013.• Zhenyu Guo and Z. Jane Wang, “Physiological Parameter Monitoring forDrivers Based on Video and Independent Vector Analysis,”, in IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP),toappear, 2014.• Zhenyu Guo and Z. Jane Wang, “Metric based Gaussian kernel learning forclassification,” in IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pp. 3582-3586, 2013.• Keith Edmonds, Ali Kashani, Zhenyu Guo and Janice Cheam, “Demand Re-sponse Optimization,” Patent No. US 61/811670, filed in April, 2013.• Ali Kashani, Janice Cheam, Jonathan Hallam, and Zhenyu Guo, “Methodand System for Forecasting Power Requirements Using Granular Metrics,”Patent No. US 61/564839, filed on November 29, 2012.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The Role of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Visual Object Recognition with Limited Training Data . . . . . . 31.2.1 One-shot Object Recognition . . . . . . . . . . . . . . . . 41.2.2 Cross-domain Object Recognition . . . . . . . . . . . . . 61.2.3 Object Recognition for Images with Different Picture Styles 81.2.4 Connections . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 One-shot Object Recognition . . . . . . . . . . . . . . . . . . . . . . 132.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18v2.3 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . 202.3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . 202.3.2 Hierarchical Dirichlet Process Mixture Model . . . . . . 222.3.3 New Feature Representation . . . . . . . . . . . . . . . . 282.3.4 Hierarchical Feature Learning . . . . . . . . . . . . . . . 292.3.5 Feature Combination . . . . . . . . . . . . . . . . . . . . 312.3.6 One-shot Recognition Decision . . . . . . . . . . . . . . 332.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 342.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 342.4.2 Results on “4-class” data set . . . . . . . . . . . . . . . . 352.4.3 One-shot Results on “Animals with Attributes” . . . . . . 372.4.4 Multiple Shots Results on “Animals with Attributes” . . . 392.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . 403 Cross-domain Object Recognition . . . . . . . . . . . . . . . . . . . 433.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3.1 Background on RKHS of Vector-valued Functions . . . . 523.3.2 Vector-valued Functions for Domain Adaptation . . . . . 543.3.3 Output Kernel Learning . . . . . . . . . . . . . . . . . . 553.3.4 Domain Weighted Output Kernel Learning . . . . . . . . 563.3.5 Domain Adaptive Input-Output Kernel Learning . . . . . 583.3.6 Domain Shift Measure Based on Output Kernel Matrix . . 633.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.4.1 Data Sets and Features . . . . . . . . . . . . . . . . . . . 653.4.2 Results on DA Data Set . . . . . . . . . . . . . . . . . . . 663.4.3 Results on Cal+DA Data Set . . . . . . . . . . . . . . . . 733.4.4 Domain Shift Measure in Output Kernel Space . . . . . . 743.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764 Object Recognition for Images with Different Picture Styles . . . . . 784.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78vi4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.3 Kernel Descriptor Revisit . . . . . . . . . . . . . . . . . . . . . . 824.4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . 834.4.1 g-incorporated Kernel Descriptor . . . . . . . . . . . . . 834.4.2 Basis Functions gi’s . . . . . . . . . . . . . . . . . . . . 864.4.3 Learning of the Parameters . . . . . . . . . . . . . . . . . 884.4.4 Adaptive Descriptor Design . . . . . . . . . . . . . . . . 884.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894.5.1 Domain Adaptation Data Set . . . . . . . . . . . . . . . . 894.5.2 Oxford Flower Data Set . . . . . . . . . . . . . . . . . . 914.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 955.1 Significance and Potential Applications of the Research . . . . . . 955.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . 975.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100viiList of TablesTable 1.1 Summarization of the introduction. . . . . . . . . . . . . . . . 11Table 2.1 Major steps for the proposed one-shot recognition system. . . . 21Table 2.2 Parameter details for sampling of CRF. . . . . . . . . . . . . . 26Table 2.3 Performances of one-shot recognition on the 4-class Caltechdata set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Table 2.4 Results of one-shot recognition on AwA data set. . . . . . . . . 37Table 2.5 Results of using raw and learned features on AwA data set. . . 39Table 2.6 Results of multi-shot recognition on AwA data set. . . . . . . . 39Table 3.1 Algorithm for output kernel learning. . . . . . . . . . . . . . . 57Table 3.2 The proposed Domain Adaptive Input-Output Kernel LearningAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61Table 3.3 Results on single source adaptation. . . . . . . . . . . . . . . . 67Table 3.4 Results of multiple sources adaptation. . . . . . . . . . . . . . 69Table 3.5 Results of multiple features adaptation. . . . . . . . . . . . . . 72Table 3.6 Results on Cal+DA data set . . . . . . . . . . . . . . . . . . . 72Table 3.7 Domain shift measurement based on Output Kernel Divergence(OKD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Table 3.8 Domain shift measurement based on Maximum Mean Discrep-ancy (MMD) . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Table 4.1 Results of KDES on DA data set. . . . . . . . . . . . . . . . . 90Table 4.2 Results of SURF on DA data set. . . . . . . . . . . . . . . . . 91Table 4.3 Results on the Oxford Flower data set. . . . . . . . . . . . . . 93viiiList of FiguresFigure 2.1 One-shot recognition framework. . . . . . . . . . . . . . . . 15Figure 2.2 Illustration of how high level features improve recognition. . . 16Figure 2.3 Illustration of the process of feature learning. . . . . . . . . . 22Figure 2.4 The graphical model of HDP. . . . . . . . . . . . . . . . . . . 24Figure 2.5 Illustration of the clustering property of CRF. . . . . . . . . . 25Figure 2.6 Illustration of spatial pyramid. . . . . . . . . . . . . . . . . . 30Figure 2.7 Illustration of HDP-encoder. . . . . . . . . . . . . . . . . . . 30Figure 2.8 Images from 4-class data set . . . . . . . . . . . . . . . . . . 35Figure 2.9 Visualization of learned feature space. . . . . . . . . . . . . . 41Figure 2.10 Two feature learning procedures. . . . . . . . . . . . . . . . . 42Figure 2.11 Sample images from 10 target classes in “Animals with At-tribute” data set . . . . . . . . . . . . . . . . . . . . . . . . . 42Figure 3.1 Example images from different domains. . . . . . . . . . . . 45Figure 3.2 Illustration of domain shift in output space. . . . . . . . . . . 46Figure 3.3 Framewokr for DA-IOKL. . . . . . . . . . . . . . . . . . . . 48Figure 3.4 Accuracy VS. penalty curve. . . . . . . . . . . . . . . . . . . 69Figure 3.5 Visualization of learned output kernel. . . . . . . . . . . . . . 70Figure 3.6 Accuracy VS. number of training samples for dslr/webcam. . 77Figure 4.1 Illustration of different picture styles caused by different reasons. 79Figure 4.2 Illustration of how picture styles affects gradient based imagedescriptor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Figure 4.3 Plots of the proposed basis functions. . . . . . . . . . . . . . 86ixFigure 4.4 Plots of basis functions. . . . . . . . . . . . . . . . . . . . . 87Figure 4.5 Examples of images with different Instagram effects. . . . . . 92xList of AcronymsADD Adaptive Descriptor DesignAK Average KernelBOW Bag of WordsCH RGB color histogramsDA Domain AdaptationDA-IOKL Domain Adaptive Input & Output Kernel LearningEMK Efficient Match KernelsGMKL Generalized Multiple Kenrel LearningHDP Hierarchical Dirichlet ProcessHDP-ENCODER A feature learning function based on HDP.JBLD Jensen-Bregman LogDetKDES Kernel DescriptorLSS local self-similarity histogramsMKL Multiple Kernel LearningMMD Maximum Mean DiscrepancyNLP Natural Language ProcessingOKD Output Kernel DivergencePHOG pyramid Histogram of Oriented GradientsxiRGSIFT Color SIFTRKHS Reproducing Kernel Hilbert SpaceROD Rank of DomainSDP Semi-Definite ProgrammingSIFT Scale-Invariant Feature TransformSURF Speeded Up Robust FeatureSVM Support Vector MachinexiiAcknowledgmentsI give my sincere thanks to my supervisor, Dr. Z. Jane Wang, for her patient and in-sightful guidance in my research and personal life through my four and half yearsPh.D. study at UBC. This thesis is impossible to achieve without her persistentencouragement, and strong belief and confidence in me. The more than suffi-cient freedom in research granted by Jane made me grow up as an independentresearcher. Her frank and critical suggestions reminded me to balance betweenbeing idealistic and realistic in my career along the way. And her wide and deepknowledge, together with her warm personality, made my Ph.D. journey excitingand joyful.I also would like to extend my thanks to my labmates and fellow graduatestudents at UBC for inspiring interactions and discussions.I would like to thank my friends and colleagues at Max Planck Institute forInformatics, Germany, for their help in both research and personal life when I wasvisiting there.I also want to thank my friends and colleagues at Energy Aware TechnologyInc., for their support and encouragement in years. They introduced me into thearea of smart grid data analysis and helped me achieve publications and patents.At last, I would like to thank my parents and my family for their endless love,care, and support. They are the reasons for me to be happy, and to overcome all theunhappiness.xiiiTo my wife Shanshan, and my daughter Mia.xivChapter 1Introduction1.1 The Role of DataData has become a popular key word that is commonly seen in many engineeringand business fields; this was not the case in the scientific research until a decadeago. The increasingly important role of data is a result of the fast developments inthe sensing, communication, and storage technologies. Such technologies are nowproviding significantly more raw data to process for existing research problems andmethods, as well as cultivating new ways to conduct research. The role data playsin the scientific research has gone through several phases, from the inspiration ofepiphany to the facts that support theoretical conclusions, to resources needed torefine a computational model, and finally to the focus of the research itself.Back to the age of ancient Athens, the discovery of theories about our physicalworld was made by individual philosophers who observed certain phenomena overtime and abstracted general and descriptive rules as explanations. Data, that revealsnature to human beings, was in its premature and instantaneous form. The researcharound data was pure as abstract thinking activities previously.Evolving with human civilization, mathematics and physics emerged and servedas efficient tools for human researchers to conceive theories for more complicatedphenomena, and to design solutions to harder problems. The data then can be re-produced by simulating experiments in laboratories, but it was still limited in termsof amount and variety. To make good use of limited data, trial and error was the1fundamental guidance and belief for researchers to achieve the goals, involving agreat amount of human intelligence and determination. Data and solutions werenot directly connected then, and the super complex systems were beyond the capa-bility of technologies.Then information technology was born, together with the computing machinesthat provide extra power and efficiency in complementary to human intelligence.The explosive amount of data accessible to researchers in the forms of texts, pic-tures, videos, geographical information, and even physiological signals measuredfrom human subjects, etc. These enabled the design of complex systems to solvespecific problems no one would ever have expected before. To make flight plans ata busy international airport, scientist fed the data into an optimization engine to getan optimal schedule under a number of constraints. To have a speech recognitionsystem like Siri in iPhone, researchers and engineers collected tons of human au-dio signals in different languages with annotated scripts to train algorithms to getoptimal models that can recognize speeches from new users. The data has becomeand still is the tool to get an optimal model for a specific task.At last, Big Data arrived along with the highly developed Internet service onmultiple devices for billions of users. Further, with the popularity of mobile andwearable devices which are equipped with a large number of advanced sensors,data is now generated, shared, and stored anywhere and anytime on the earth. Datais no longer only the tool for training better models of algorithms, it is the centralfocus of technologies. It means that data comes to our mind before the problemswe have to solve. For example, twitter collects detailed records on the daily livesof 200 million active users; Facebook stores interactions among 1.15 billion activeusers; Instagram hosts more than 16 billion photos taken by more than 150 millionactive users around the world. The current problem has become how to find theproblems arising from the enormous data coming now from different places, dif-ferent people, in different forms. Research is also now pivoting from making gooduse of data for a certain task, to focusing on the data itself. Although there is noclear answer to the question about what we can do with the data, there is no doubtthat Big Data will benefit the human life with the help of emerging informationtechnologies.This PhD thesis is to develop novel machine learning methods in the applica-2tions of computer vision. According to the statistical learning theory by Vapnik[82], empirical risk of a learning algorithm needs to be minimized based on the es-timation of training data, in order to obtain an optimal model for a specific task. Nomatter for supervised/unsupervised, generative/discreminative, or Bayesian/frequen-tist, machine learning algorithms essentially rely on the estimation of the underly-ing probability distributions of sample data. And it is natural to assume that moretraining data will lead to better computational models, that are the better solutionsfor real-world problems. Then, should we celebrate the glory of the Big Datathat can ultimately solve those complected problems frustrating researchers fordecades? It is unfortunate that the illusion built by the increasing amount of datacannot bring us perfect computational models by simply feeding the data into what-ever methods we currently have. For a specific task, having sufficient high qualitytraining data is always the crucial factor, even for the unsupervised deep learningmethods. Shortage of high quality training data is still an unsolved problem in theBig Data era. Since this thesis focuses on visual object recognition, I will discussseveral real-world problems in the following pages to address how I conquer theproblems introduced by limited training data by designing novel machine learningalgorithms to benefit from the Big Data background.1.2 Visual Object Recognition with Limited TrainingDataComputer vision as a community progressed fast recently, which unarguably ben-efited from the proposal of a number of visual data set. In my opinion, a numberof publicly available data sets are more important than any individual models ormethods for most of the sub areas in computer vision. As stated in a recent work[81] analyzing bias across data sets, Ubiquitous access to image datasets has beenresponsible for much of the recent progress in object recognition after decades ofproverbial wandering in the desert. We also have such examples in other sub areasof computer vision, such as Labeled Face in the Wild (LFW), CMU Multi-PIE andYale Face to face recognition, VOC Pascal to object detection, and Scene-15 andSUN to scene understanding. Instead of conceiving a model first and then validat-ing it by a small amount of visual data, the community has reached a point where3data centric research is more and more popular (i.e., the data set was first proposedand then methods came up to improve the performance on the data set). The bigdata, especially big visual data, is definitely an encouraging trend for the computervision research.However, the shortage of high quality(e.g. well-labeled, well-sampled, withsimilar characteristics of images in testing, etc.) training data concern has not beensolved yet, even given the enormous data we could obtain through Internet and so-cial networks. This thesis studies three real-world visual object recognition prob-lems which involve limited training data, and I propose specific machine learningalgorithms to address these problems accordingly by taking advantages of the BigData trend.1.2.1 One-shot Object RecognitionOur visual world is extremely complicated in terms of various appearances of alarge number of existing objects. It is an impossible task to design a system thatcan recognize all possible objects in the world from images, which is partiallybecause we don’t have enough labeled training images of all objects to train a clas-sification model, and partially because it is hard to find a computational model thatcan handle such a large number of objects. It is not rare that we cannot collect suf-ficient labeled training images for the object categories that we want to recognize,for example, facial images of a stranger who only showed up once, and photosof a particular type of wild animals that is only spotted by human for very limitedtimes. Inspired by the human cognitive system which can recognize a new categoryof object by only seeing an image of it even once, one-shot object recognition wasproposed in [32]. In one-shot setting, there is only one training image per categoryfor the object categories we want to recognize, which are called target categories.It is widely accepted that human will make use of things that have been alreadylearned to improve the learning process of newer concepts. In the big data context,although we don’t have access to many training images from target categories,we do have a relatively large number (or even a really large number) of imagesfrom other categories, which are called source categories. As we discussed above,even though the images from the source domain could be abundant in amount, ac-4curate labels of these images are still hard to obtain. Therefore, this PhD thesisinvestigates on the one-shot recognition problem where there are a large number ofunlabeled images from source categories and only one training image with label isavailable per category in the target domain.The first question in front of us is what information is useful that we can ex-tract from the source images? In the one-shot setting, since the source and targetcategories are distinct, there is not necessarily similarity between images from bothdomains at the object or instance level. But when we studied the representations ofimages, we noted that the Bag-of-Word model or related ones were widely used tomodel an image as a set of local descriptors which encode statistical informationof image patches, such as image gradient, color intensity, texture pattern, and etc.If we treat the visual representation as a mechanism similar to human natural lan-guage, we could assume that there should similarly exist phrases, sentences, andeven paragraphs in the visual representation space, which could provide a moreefficient and informative representation than only using visual words (based onraw image descriptors). For example, instead of only using lines and dots as localfeatures, we could have stripes and grids that can make the representation morediscriminative in order to improve the performance of the classifier. Therefore,we proposed learning such high level visual features from the source images, toconstruct more informative representations of images in the target domain.The second question is naturally how do we learn the features? Also inspiredby the Natural Language Processing (NLP) area where topic models are widelyused for word and topic modeling, we proposed using one of the non-parametricBayesian topic models, Hierarchical Dirichlet Process (HDP), to model the high-level visual phrases and sentences from low-level image descriptors. HDP is es-sentially an infinite mixture clustering method, which can assign low-level visualdescriptors into clusters with possible multiple cluster assignment for one descrip-tor. We refer it as the HDP-encoder that encodes lower level features to higherlevel ones.Furthermore, we proposed stacking the HDP encoders on multiple types ofimage descriptors to achieve a deep structure with multi-modality fusion. Thefinal classification is performed by combining features in all levels through additivekernels. In Chapter 2, we will discuss the one-shot object recognition problem with5more details regarding the algorithms, experiment and analysis.1.2.2 Cross-domain Object RecognitionIn Section 1.2.1, we introduced a problem where training images are extremelylimited for some object categories. In this section, we will discuss another situationthat big data doesn’t help much even when more well-labeled training images arecollected for certain categories.As we mentioned above, data becomes more and more important in computervision than any specific computational model, since the underlying estimation ofthe data distributions influences many vision applications. Particularly for objectrecognition, the correct estimation of the distributions of training data from dif-ferent categories determines the performance of a classifier. With the popularityof photo sharing through social networks, it is desirable to recognize or tag thephotos taken by individual users in everyday life, so that a better personalized in-formation management service could be built on that, such as searching, matching,or retrieval. For a particular school boy, the objects appeared in his photos taken byhis smart-phone are particular objects in his personal life. For instance, a bicyclepainted by himself with his own design, a laptop covered with stickers he chose,clothes, caps, watches, shoes from the brands he likes and worn by him for a year orso. The photos of these objects are sometimes by his iPhone camera with Instagrameffects, sometimes by his DSLR, sometimes by the built-in webcam of his laptop,or sometimes scanned from black & white films taken by a film camera. Thereforethe photos taken by different people with different devices will be very differenteven if they are about the same categories of objects or scenes. However, it is im-possible to collect all the variants of object images to train a classifier for objectrecognition, then the training images will be somehow different from the imagestaken during test. Such difference can be viewed as a shift between distributions oftraining and testing data, which affects the performance of a classifier trained ontraining data. In addition, due to such shift, increasing number of training images(which are different from images in testing) cannot solve the problem. Even in thebig data era, if I want high quality training images for objects in our personal life,there is no convenient way other than collecting and labeling the photos manually.6Images with similar characteristics are considered from the same domain, whichmay relate to the image quality (taken by webcam or DSLR), illumination condi-tions (product photos on Amazon taken in the bright studio lighting conditions),or object appearance (furnitures from IKIA). Visual domain still doesn’t have arigorous definition, but can help us understand the problems. Usually the set ofextra different training images is called the source domain, and the set of imagesappearing in testing is called the target domain (images in source and target do-mains cover the same categories). How to train a model that can take benefits fromthe extra data in similar source domain(s) by overcoming the side-effects intro-duced by domain shift is called domain adaptation. In the literature, the probleminvolving images from the target domain with no labels is referred as unsuperviseddomain adaptation; the problem involving a small number of labeled training im-ages from target is called semi-supervised domain adaptation. In my PhD thesis,I focused on the semi-supervised DA problem since it is more practical and canimprove the performance by only collecting 2 or 3 labeled images per category inthe target domain. Obviously, research in DA tends to develop better personalizedvisual understanding systems benefiting from the big data trend.Object recognition is a multi-class classification problem, which is essentiallyabout learning a vector-valued function that maps the input feature space ontoa output category space, although most of the time the one-against-one or one-against-all mechanism is used as compromise. In this thesis, for the first time weobserve that domain shift doesn’t only exist in the input feature space, but also inthe output category label space. Adopting the kernel theory for vector-valued func-tions, we proposed a Domain Adaptive Input-Output Kernel Learning algorithm toreduce the domain shift in both input and output kernel spaces, in order to obtain atrue multi-class classifier that can adapt useful information from the source domainto the target domain. Experimental results of our proposed algorithm demonstratethe performance improvement over the previous state-of-arts, which also supportsour contribution of introducing the output space analysis into domain adaptationresearch.71.2.3 Object Recognition for Images with Different Picture StylesWe have shown how to benefit from the big data trend to improve object recog-nition in real-world situations lacking of good quality training images. Domainadaptation is especially promising since it takes advantages of the plenty of pub-licly available image training data. However, there are still other training datashortage challenges associated with the current problems in object recognition.Digital images can be different in terms of color tones, contrast, clarity, vi-gnetting, and etc. Here we refer such characteristics of digital images as picturestyles. With the popularity of photo editing and sharing services such as Instagram,Facebook and Flickr that are available on mobile devices, many digital imagesgenerated by users nowadays are captured by a wide range of devices (e.g., smartphones and digital slrs) and processed using different photography effect filters(e.g., “lomo-fi” and “lord-kelvin” available in Instagram) to get distinct picturestyles with strong personal artistic expressions. And clearly most of the imagesshared through social network cannot be characterized into certain pre-defined do-mains, since they are from all possible domains. Therefore, domain adaptation isnot a suitable solution when the training images and testing images are both fromonline sources without “domain-ship” specified.In Chapter 4, we will show that picture style could influence the image gradi-ent, so that the gradient based image descriptors will also be affected. Experimen-tal results also indicate that when training images and testing images have differentpicture styles, the performance of object recognition based on local image descrip-tors will drop. Therefore, it is important to overcome the difficulty brought bypicture styles to fully make good use of big visual data. This thesis made its con-tribution to the computer vision community by (1) introducing the picture styleproblem for the first time; (2) proposing an adaptive descriptor design frameworkfor gradient based descriptors; and (3) proposing a variant of Oxford Flower datasetwith different picture styles. The Since the picture styles can be modeled throughpixel mapping functions, in the thesis study, we took a quite direct approach byformulating the relationship between pixel mapping functions and resulting imagedescriptors from the kernel descriptor [17] point of view. Later, an optimal combi-nation of pixel mapping functions is learned through multiple kernel learning for8object recognition. This work not only proposed a solution for this picture styleproblem, but also opened a new direction for object recognition by bridging thepixel domain with the image feature space constructed based on local descriptors.1.2.4 ConnectionsIn the above sections, we have introduced three related sub-areas of visual objectrecognition, and the corresponding solutions proposed by this Ph.D. thesis. Theseproblems all involve limited training data from different aspects: (1) in one-shotrecognition, there are limited number of training images; (2) in domain adaptation,the training data is limited in a way that there is a shift between the distributionsof training and testing images, which makes the training data of low quality; (3) inthe last application, training and testing images are with different unknown picturestyles, which degrades the performance of a standard classifier trained on such data.In addition, the solutions proposed in this Ph.D. thesis to these problems all makegood use of the resources we have in the big data era, such as a large amount ofunlabeled source category images, a number of well labeled source domain images,and images of all kinds of picture styles that are widely available in social networks.Although we introduce these three applications and the proposed machine learn-ing algorithms separately, they are not exclusive in real-world situations. In par-ticular, the problem caused by different picture styles widely exists in standardobject recognition, and in domain adaptation and one-shot recognition problemsas well. Furthermore, the algorithms are not exclusive for some general purposes.In one-shot recognition, we proposed an unsupervised feature learning algorithmwhich actually can be used for general tasks (e.g. multiple-shot recognition witha number of training images per category) to learn a better image representationfor recognition. In another word, it can be used in parallel in domain adaptationproblem or picture style problem, at the feature extraction stage. The DA-IOKLalgorithm proposed for domain adaptation is essentially a true multi-class classi-fier that can be used for other object recognition tasks involving multiple featuresand/or multiple data sources. At last, the Adaptive Descriptor Design (ADD) algo-rithm proposed for picture style problem can be used to improve any gradient basedimage descriptor in general. In one word, this Ph.D. thesis tackles problems caused9by limited training data in visual object recognition from different angles by mak-ing advantage of big data, and the resulting algorithms can be applied separatelyor combined jointly for real-world situations to improve the performance.1.3 Thesis OutlineThe outline of this thesis is described in this section as follows, and Table 1.1 high-lights the training data shortage challenges, the proposed methods and the contri-butions of this thesis.In Chapter 2, first we review the problem of one-shot object recognition anddiscuss the disadvantages of previous related work, which used labeled source im-ages and/or attribute tables designed by human experts as an auxiliary represen-tation. To formulate the problem in a more practical way, we propose using onlyunlabeled source images to achieve one-shot recognition by unsupervised featurelearning. The proposed feature learning algorithm is a deep structure of stacks ofHierarchical Dirichlet Process (HDP) auto-encoder to construct “semantic” pyra-mid in the feature spaces. Experiments are conducted on Caltech-4 data set andAnimals-with-Attributes data set. The improvement brought by the proposed algo-rithm is demonstrated when comparing it with state-of-art methods.Chapter 3 presents our work on domain adaptation in object recognition. Wenotice that object recognition is essentially about learning a vector-valued functionas a true multi-class classifier, whose output space is neglected by previous work.Therefore, we propose a Domain Adaptive Input-Output Kernel Learning algo-rithm for domain adaptation, by reducing the domain shift in both input and outputkernel spaces through kernel learning. Experimental results show the performanceimprovement when introducing the output kernel space into domain adaptation,and prove the effectiveness of the proposed kernel learning algorithm.Chapter 4 describes a new problem in the computer vision community (i.e.,training and testing images used in real-world applications are mixtures of pho-tos with distinct picture styles), which affects the accuracy of traditional objectrecognition based on local descriptors. In this work, we successfully formulatethe direct relationship between pixel mapping functions and image level features,and also propose an adaptive descriptor design framework through efficient ker-10Problems Chapter 2 Chapter 3 Chapter 4Training Data Challenges One labeled image per category in the targetcategories.Less than 3 labeled images per category in thetarget domain.Labeled images with unknown different pic-ture styles.Side Information Unlabeled images from source categories. Labeled training image from source do-main(s).No.Constraints Target and source categories are different. Categories in source and target domains arethe same.No.Contributions 1.Utilizing only unlabeled images as side in-formation. 2.Flexible and efficient.1.Extending to output space. 2.Reducing do-main shift in both input and output ker-nel spaces. 3.Incorporating multiple featuresthrough multiple input kernels. 4.Measuringdomain shift through Output Kernel Diver-gence (OKD).1.Generally improving gradient-based de-scriptor. 2.Easy to implement into existingmethods. 3.Introducing a new direction forobject recognition. 4.Introducing a new dataset of different picture styles.Table 1.1: Summarization of the applications and the proposed algorithms.11nel learning. In addition, we also propose a variant of Oxford Flower data setwhere images are processed with different popular Instagram effect filters. The ex-periments on domain adaptation data set and the proposed data set demonstrate theperformance improvements of the proposed algorithm over existing gradient-basedimage descriptors.Chapter 5 briefly summarizes the major contributions of the thesis and providesconcluding remarks. A number of future topics are discussed.12Chapter 2One-shot Object Recognition2.1 IntroductionObject recognition using computer vision methods has gone through considerableprogress during the last decade, including methods based on low level features(e.g., Scale-Invariant Feature Transform (SIFT) [60], Speeded Up Robust Fea-ture (SURF) [8], pyramid Histogram of Oriented Gradients (PHOG) [19], and SelfSimilarity [73]) and specific designed machine learning techniques. Numerous pa-pers [37] have shown that recognition accuracy generally increases as the numberof training samples per category increases. However, a large number of labeledtraining samples might not be feasible in practice.However, there are significant evidences that human beings can perform category-level object recognition in a more efficient way, by learning novel concepts fromonly one or a few exemplars. Motivated by such recognition ability of human cog-nitive systems, “one-shot” recognition [32], where the system is given only onetraining sample for each object category, has attracted increasing research atten-tion very recently. In the computer vision community, the one-shot recognitiontask consists of data from two domains: the target domain and the prior-knowledgedomain. The target domain, where we actually perform the one-shot classifica-tion, consists of data from target categories with only one training sample in eachcategory. It is assumed that the prior-knowledge domain consists of data fromcategories that are different from the target categories. Based on previous works13[32, 55, 78, 90], as shown in Figure 2.1, the one-shot recognition procedure gen-erally contains two major components: 1) feature learning in the prior-knowledgedomain, and 2) supervised classification in the target domain with only one labeledtraining sample per category.Although there is only one exemplar for each category in the target domain,some standard classifiers (e.g. Nearest Neighbor [78] and Support Vector Ma-chine (SVM) [45]) can still be used for the supervised classification step. Previousmethods for one-shot recognition [32, 55, 78, 90] are different mainly in the set-tings of the feature learning step in the prior-knowledge domain. [55] uses labelinformation for all images from the prior-knowledge domain, along with a sophis-ticated attribute table designed by human experts. [78] doesn’t use any manuallydesigned attributes, but it still relies on the fully labeled images. This similar settingwas also adopted by a recent work [90]. Although considerable progress has beenachieved by the methods mentioned above, the required side information (e.g., thelarge number of class labels and the manually designed attribute table) in the prior-knowledge domain can be difficult to acquire in practice. Recalling the originalmotivation mentioned earlier that human cognitive systems are able to adapt usefulinformation from prior-knowledge without the help of such side information, wepropose using only the unlabeled images as prior-knowledge, as illustrated in Fig2.1.To intuitively justify that the proposed setting of using unlabeled images asprior-knowledge for one-shot learning is feasible, in Figure 2.2, we take a “zebra-horse” problem as an example. It is a simple task of classifying a test image intocategory “zebra” or “horse”. If we know the concept of “striped” texture pattern,we can easily distinguish zebras from horses by generating this classification rule.If we could learn this “striped” pattern from the “ zebra-horse” data set, we canalso use this simple classification rule to distinguish tigers from leopards. Fromthis example, we believe that such meaningful features are shared among relevantcategories and can significantly benefit category-level recognition. There are ev-idences [39] showing that human beings can learn from unlabeled data throughmanifold regularization, indicating that human beings can learn more meaningfulfeatures (for category-level recognition) from daily experiences. In the zebra-horseexample, it is desirable to learn the “striped” feature (and other abstract attributes)14Figure 2.1: One-shot recognition framework: Only one training image withlabel is provided for each category in the target domain. The prior-knowledge domain contains images from categories that are differentfrom the target categories. Unlike previous works which usually uselabeled images and the designed attribute-table, we only use unlabeledimages in the prior-knowledge domain.15from unlabeled images of zebra and horse. Therefore, how to learn the meaningfulfeatures is a challenging but valuable task.Figure 2.2: The animal pairs, zebra and horse, and tiger and leopard, can beclassified by the “striped” pattern.Since many low-level visual descriptors (e.g., SIFT) have already showed gooddiscriminative power in visual recognition tasks, we plan to incorporate the advan-tages of these handcrafted low-level features. In this chapter, we formulate featurelearning from prior-knowledge as a latent mixture modeling problem, which isto learn mixture components over the low-level descriptors as more meaningfulfeatures from unlabeled images. We propose using Hierarchical Dirichlet Pro-cess (HDP) [79] for this purpose. Since the feature learning process is actually aclustering operation on the histograms of base features, we call it the A featurelearning function based on HDP. (HDP-ENCODER) which can encode low-levelfeatures’ histograms into higher level feature representations automatically, as ex-plained later in Section 2.3.4 and Figure 2.7.Now suppose we have the HDP-ENCODER and its output is also a histogramvector based on the higher level features. We assign the low-level descriptor aslevel− 0 (L0) and the new features as level− 1 (L1). From WordNet [33] in thenatural language processing community, a large lexical database of English, wenote that human language has a hierarchical structure to describe objects and events16from holistic aspects to details. Also, in the object recognition community, spatialpyramid matching [58] shows the matching power of “coarse to fine” in the 2-Dimage spatial domain. Inspired by the hierarchy observed in human language andin image spatial pyramid subdivision, we propose constructing a deep structure ofthe feature pyramid by stacking the HDP-ENCODERs layer by layer, where eachlayer has a unique “describing resolution”. The details will be explained in Section2.3.4 (Figure 2.7) and Section 2.4.3 (Figure 2.10). From top to down, the pyramidprovides image representations from “coarse to fine”, where a higher level capturesmore “macro” information while a lower level captures more “detail” information.It is worth noting that HDP-ENCODERs can be stacked across feature spaces (e.g.,the texture-L2 feature may be learned from SIFT-L1 and SURF-L1 features). Thejoint feature vectors could be viewed as histograms generated from a large jointdictionary.Based on the obtained feature-pyramid, how to transfer such rich descriptionsinto classification power is our next concern. Similar to spatial pyramid match-ing [58], we choose to use weighted summation of intersection kernels to combinethe features at different levels. In addition, since the proposed HDP-ENCODERcould also model the latent components across different feature spaces, we canlearn a single feature that can capture information from multiple lower levels offeatures and is sufficiently compact for real world applications. Therefore, the pro-posed feature learning algorithm can also be viewed as a novel feature combinationmethod. In summary, the main contributions of this chapter are as follows:1. We propose a novel feature learning algorithm based on HDP modeling,which can encode low-level image descriptors into a high level feature vec-tor from unlabeled images in the prior-knowledge domain, as illustrated inSection 2.3.4 (Figure 2.7).2. We propose a deep structure for feature learning by stacking the HDP en-coders to learn a feature pyramid with multiple “describing resolutions”, asillustrated in Section 2.3.4 (Figure 2.7) and Section 2.4.3 (Figure 2.10).3. We evaluate the proposed unsupervised feature learning framework on one-shot image recognition tasks and show comparable performances to that ofprevious supervised feature learning methods [50, 55, 78, 90].174. We evaluate the proposed framework on conventional multi-shot tasks andshow the recognition improvements.In the remainder of this chapter, Section 2.2 provides a brief review of previ-ous work on one-shot recognition and feature combination. Section 2.3 describesthe proposed unsupervised hierarchical feature learning framework. Section 2.4demonstrates the performances of the proposed method on real data sets. Section2.5 concludes the chapter.This chapter expends upon our conference paper [45], in which we investigatedlearning a high level feature from a single type of local descriptor. In this chapter,we propose a novel deep structure for feature learning based on multiple typesof local descriptors and propose a new feature combination scheme using featurepyramid and averaging kernels. We also conduct more experiments.2.2 Related workCompared with the conventional multi-shot visual recognition, there has been rel-atively less work in the area of one-shot recognition. The concept “one-shot learn-ing” was first introduced in [32]. [32] propose a Bayesian framework with a classprior learned from labeled prior-knowledge. [55] tackles the challenge by incorpo-rating human specified high level attributes, where a number of supervised classi-fiers are used to associate bag-of-feature representations with the binary attributes.Their proposed cascade recognition system provides the state-of-art recognitionaccuracy on the “Animal with Attributes” data set. To learn the semantic attributesautomatically from the prior-knowledge data, a nonlinear mapping based methodis used [78] to learn a mapping function by optimizing the discriminative powerof the intermediate representation, where the mapping function can be viewedas a projection from the original feature basis onto higher level latent attributes.Their results on multi-class one-shot recognition are better than the simple naiveBayesian method in [55]. [78] still requires a large number of labeled images asprior-knowledge. Later, similar to the setting in [55], the work in [90] focuses onthe attributes used in the attribute table and designs a better knowledge transferscheme by modeling the attribute priors. [90] can be considered as a better way ofassociating image features with the manually designed binary attributes.18There are several research works in other areas which emphasize weak super-vision, not specific to one-shot learning. [65] exploits side information from pairsof object images labeled as “same” or “different” to learn a metric for measur-ing similarity between unseen object images. Metric learning approaches like [26]require at least weak supervision on the prior-knowledge data, which cannot be ob-tained in our problem. So far, the closest work to our chapter is self-taught learning[69], which uses sparse coding to learn a set of bases for the linear combination toapproximate the image data in the prior-knowledge domain. The weights of thebases are used as features to represent images. In [69], sparse coding is performedon image pixel patches and cannot be applied directly to the discrete space of thehistogram of low-level features (in the bag-of-feature representation). By contrast,our proposed method is based on the handcrafted image descriptors and thus it issuitable to learn a higher level feature without losing the nice property of low-levelfeatures. Another area in machine learning that seems be related to our one-shotproblem is semi-supervised learning (SSL) [11, 35, 92], where unlabeled data areused to improve the weakly supervised classification tasks. As we stated in Section2.1, one-shot recognition generally contains two separated components: 1) featurelearning in the prior-knowledge domain, and 2) supervised classification in the tar-get domain with only one labeled training sample per category. Since there are nolabeled images in the prior-knowledge domain in our setting, there is no room forsemi-supervision in the prior-knowledge data. In addition, it is not feasible to poolthe unlabeled prior-knowledge data and target one-shot training data together torun a SSL method during the classification. Usually SSL methods [11, 92] assumethat the unlabeled training data come from the same categories as the labeled train-ing data, while it is not true in our one-shot recognition setting where the imagesin the prior-knowledge domain generally come from categories that are differentfrom the categories in the target domain.To our knowledge, this chapter is the first work that attempts to learn a deepstructure of the feature pyramid based on low-level image descriptors in the areaof one-shot recognition. Note that the idea of describing an object image in a“coarse to fine” way has a long history, also reflected in the design of low-leveldescriptors. [58] extended the pyramid matching idea [42] to the spatial domainby matching the feature points in different spatial subdivisions. Since the proposed19feature pyramid provides multiple “describing resolutions” which are similar tothe “spatial resolutions” in [58], we will adopt the weighted intersection kernels tocombine different features learned in the feature-learning step.2.3 The Proposed MethodIn this section, we will first formulate our unsupervised feature learning problemfor unlabeled prior-knowledge data, then describe the proposed feature combina-tion method and the supervised classifier for one-shot recognition. For the unsu-pervised feature learning step, we propose using a Hierarchical Dirichlet Process.By wrapping up the feature learning process into the HDP-ENCODER, we proposea deep structure to construct the feature pyramid in two ways: One emphasizes therecognition performance, and the other emphasizes the efficiency and compactnessof the learned features. For the supervised classification in the target domain, wepresent how to incorporate intersection kernels and the average kernel to combineall features in the pyramid, and then a standard Support Vector Kernel Machine isused for classification.2.3.1 Problem FormulationWe denote the data set in the target domain T as XT , with data points {x1,x2, . . . ,xnT },and the data set in the prior-knowledge domain P as XP , with data points {x˜1, x˜2, . . . , x˜nP}.The data in XT comes from M categories CT = {c1,c2, . . . ,cM}, and the data inXP comes from categories CP = {cM+1,cM+2, . . . ,cS}. It is worth noting thatCT and CP are disjoint, which makes one-shot learning a problem different fromsemi-supervised learning. In one-shot recognition tasks, the training data fromXT is XtrainT with category labels YtrainT . The prior-knowledge data is denotedas XtrainP without corresponding category labels. We want to learn latent featuresbased on XtrainP , and to project the target training data XtrainT into the learned fea-ture space, which is denoted as ˆXtrainT . At last, a supervised classifier is trained on{ ˆXtrainT ,YtrainT }. In the testing phase, the trained classifier is used to predict thelabels of projected test data ˆXtestT to get Ypredict . The framework can be describedby major steps in Table 2.1, where step 1 is described in Section 2.3.2 and Section2.3.4, and steps 2 and 4 are described in Section 2.3.3 and Section 2.3.5, and steps203 and 5 are described in Section 2.3.6.Algorithm 1 One-shot Recognition Framework1: HDP-ENCODER ← HDP modeling on XtrainP ,2: ˆXtrainT ← Project XtrainT onto the latent feature spaceusing HDP-ENCODER,3: Classi f ier ← Train a SVM on { ˆXtrainT ,YtrainT } ,4: ˆXtestT ← Project XtestT onto the latent feature spaceusing HDP-ENCODER,5: Ypredict ← Predict the labels of ˆXtestT using Classi f ier.Table 2.1: Major steps for the proposed one-shot recognition system.Furthermore, we briefly discuss data representation in general object recogni-tion problems. A standard way to represent object images is to use the bag-of-feature model, which was originally borrowed from the document modeling area.In the bag-of-feature model, the low-level image descriptors serve as visual words,a codebook or dictionary is computed by clustering the total samples of visualwords, then images are represented by the occurrence histograms of the visualwords in the dictionary. We use V = {w1,w2, . . . ,wi, . . . ,wd} to denote the dictio-nary with vocabulary size d, where wi means the ith visual word. For an image I,we use its histograms of the dictionary visual words h = (h1,h2, . . . ,hi, . . . ,hd) torepresent it in the bag-of-feature model, where hi means the occurrence frequencyof wi in the image I. The histogram vector is based on the low-level features andcan be used as training and testing data in classifiers. Considering that each imageis represented by a histogram vector, our goal here is to learn a higher level fea-ture by fitting a mixture model to the grouped feature data (i.e., histograms of theimages).Let us denote the L0 level feature dictionary as V0 = {w1,w2, . . . ,wi, . . . ,wd}and the L1 level feature dictionary as V1 = {z1,z2, . . . ,zi, . . . ,zl}. Here a higherlevel visual word zi may be a mixture component of wi’s with different proportions.We could see that a low level feature word may belong to different higher levelfeatures with different probabilities according to its group statistics. Since wi’s areexchangeable across different groups, we also need z j’s to be shared among allgroups. Through the feature learning method, we could obtain p(z j|wi), the proba-21Figure 2.3: The blue circles indicate local patches which the low-level de-scriptors are generated from, and the squares are local clusters of de-scriptors within each image. Those two red squares linked by a dottedline belong to the same global cluster across image groups.bility that a low-level feature belongs to a higher level feature. We also can obtainthe new histogram of the image I based on the L1 dictionary V1 by normalizingthe posterior p(z j|I).We therefore need a method that could solve the mixture modeling problemfrom one lower level to the next higher level. Since the new features are homo-geneous with the lower level ones they are learned from, it is desirable to be ableto apply the feature learning method layer by layer to construct a feature pyramid.We propose using HDP [79] as the feature encoder in the proposed unsupervisedfeature learning approach. We will explain our choice of HDP and describe thedetails of HDP shortly.2.3.2 Hierarchical Dirichlet Process Mixture ModelOur idea is to assemble related descriptors (lower level features) into mixture com-ponents as higher level features. In Fig 2.3, the blue circles indicate local patcheswhich the low-level descriptors are generated from, and the squares are clusters onthe descriptors as higher level features. Since we use the bag-of-feature represen-tation, we choose the latent topic model in the document modeling community tofind the latent mixture components. After comparing with parametric latent topicmodels such as Latent Dirichlet Allocation (LDA) [13] and Probabilistic Latent Se-mantic Analysis (pLSA) [46], we adopt HDP, a nonparametric generative model,for our feature learning task. As an infinite mixture model, HDP provides a way22to sample an unbounded number of latent mixture components for grouped data,which means HDP can find the number of the mixture components and the datapoints related to each component automatically. This property is highly desirablefor our feature learning task, since there is no easy way to predetermine the numberof mixture components. Although Dirichlet Process (DP) is also a nonparametricinfinite mixture model and is easier to sample, we don’t choose DP because DP issuitable for mixture modeling in non-grouped data (all the data in a single group),but can’t be applied to grouped data. As illustrated in Fig 2.3, the squares con-nected by the dotted line indicate the same latent component (“striped pattern”)shared by two individual groups. In contrast to DP, HDP has the clustering prop-erty to model the latent components shared among groups. Before describing thedetails of HDP, we first define some notations:1. A feature dictionary V at a low-level. V = {w1,w2, . . . ,wd}, where each en-try is a dictionary visual word.2. An image is a group of visual feature data and represented by orderless visualwords denoted as xj = (x j1,x j2, . . . ,x jN), where x ji is corresponding to aninstance of ith visual word in the jth image. Note that although we use bag-of-feature histogram to represent an image, the x ji here is the indicator forcertain dictionary word in V, not the frequency.We represent an image as a group of orderless visual words, as defined in thebag-of-feature model. We assume that there exist latent mixture components cor-responding to clusters of low level visual words with related attributes. We alsoassume that such latent mixture components are shared among different images.We therefore need to study the latent components and the component membershipsof the visual words. In order to model the images with better describing ability, weconstruct a new visual dictionary based on the learned latent components, and en-code the images with the new dictionary. To serve this learning purpose, we useHDP to model the unlabeled images in the prior-knowledge domain in an unsuper-vised way. The graphical model of HDP with auxiliary variables is showed in Fig2.4. In HDP model, x ji means the ith visual word in image j, z ji is the indicatorvariable (index) associated with a mixture component and z ji has discrete values23Figure 2.4: The graphical model of HDP with auxiliary variables. x ji is the ithobservation (visual word) in group j ( image j), and z ji is the mixturecomponent indicator associated with x ji. pi j is the prior distribution onmixture components, which follows a Beta distribution controlled by theconcentrating parameter α and the stick-breaking random variable β . βfollows a Beta distribution controlled by the parameter γ . θk controlsthe distribution over the observation x ji and H is the Dirichlet Priordistribution on θk.on {1,2, . . .}. θ is the factor associated with the distribution of x ji given each z ji.Referring to Fig 2.4, we now show how to generate x ji for image j.1. Sample β ∼ GEM(γ), where GEM is a distribution designed from stick-breaking construction of Dirichlet Process:β ′k ∼ beta(1,γ),βk = β ′kk−1∏l=1(1−β ′l ),β = (β1,β2, . . . ,β∞).(2.1)2. Sample θk from the Dirichlet prior’s base distribution H .3. Generate the group j (image j) by the following steps:(a) Sample pi j from24Figure 2.5: Illustration of the clustering property of the Chinese RestaurantFranchise [79]. Tables ψ’s, as the local clusters of customers θ ’s, arelinked by dishes φ ’s to form global clusters across the restaurants.pi j|α ,β ∼ DP(α ,β ) by the construction:pi′jk ∼ beta(αβk,α(1−k∏l=1βl)),pi jk = pi′jkk−1∏l=1(1−pi ′jl).(2.2)where pi j = (pi j1,pi j2, . . . ,pi j∞).4. Given pi j, generate x ji by the following steps:(a) Sample component indicator z ji from a multinomial distribution pi j :z ji|pi j ∼ pi j.(b) Sample x ji given z ji and θk from a multinomial distribution F(θz ji):x ji|z ji,θk ∼ F(θz ji).To estimate the HDP model, among those Markov Chain Monte Carlo Samplingschemes, Chinese Restaurant Franchise (CRF) is probably the most intuitive one,which also can illustrate the clustering property of HDP over grouped data. Beforegoing into details of CRF, we first introduce the CRF metaphor.25In CRF, there are multiple restaurants with an unbounded number of tables.A customer comes into a restaurant and chooses a table to sit at, and there is ashared menu across the restaurants and one dish is ordered from the menu by thefirst customer who sits at that table. Tables within each restaurant play the roleof local clusters within one group, over the customers sitting at them. The dishesshared across the restaurants are used to link all the tables together to get mixturecomponents over all data points in different groups. Table 2.2 shows the notationsand the process of CRF.Chinese Restaurant Franchiseθ ji : the ith customers in restaurant j.φk : dishes in the global menu.ψ jt : the dish served at table t in restaurant j.t ji : the index of the ψ jt associated with θ ji.k jt : the index of φk associated with ψ jtThe metaphor will be:customer i in restaurant j sits at table t ji, whereas table t in restaurant j serves dish k jt .n jtk : the number of customers in restaurant j at table t eating dish k.n jt. : the number of customers in restaurant j at table t.n j.k : the number of customers in restaurant j eating dish k.n j.. : the number of customers in restaurant j.m jk : the number of tables in restaurant j serving dish k.m j. : the number of table in restaurant j.m.k : the number of tables serving dish k.m.. : the number of tables occupied.Initialization: customer i = 1 enters the restaurant j and sits at table 1, and orders dish 1.θ j1 = ψ j1,n j11 = 1,m j1 = 1For i = 2, ...,customer i sits at table{t with probability n jt.i−1+α0 , for i = 1,2, ...,n j..n j..+1 with probability α0i−1+α0 , for new tabletable t serves dish{k with probability m.km..+γ , for k = 1,2, ...,m..m..+1 with probability γm..+γ , for new dishTable 2.2: Parameter details and the sampling procedure of the ChineseRestaurant Franchise scheme.26As an analogy to the CRF model in Table 2.2, in our one-shot recognitionproblem, we treat each image as one restaurant, a visual word in that image ascustomer x ji coming in and a local cluster (within the image) as table t ji where x jisitting at. To link the tables in different restaurants, we use dish k jt serving at tablet in restaurant j as the indicator of global mixture component shared across allimages. The distributions of t ji and k jt given previous random variables are givenhere:t ji|t j1, . . . , t j(i−1),α ,G0 ∼Tj∑t=1n jtδt ji=t +αG0, (2.3)k jt |k11,k12, . . . ,k21, . . . ,k jt−1,γ ∼K∑k=1mkδk jt=k + γH. (2.4)where G0 ∼ DP(γ ,H). Equations (2.3), (2.4) have the same meaning as Table 2.2,which is also the guidance for us to do sampling for HDP. Note that the numberof k jt variables is not fixed by the algorithm, which is an important property ofinfinite mixture model that the mixture component space is infinite. The CRF alsoillustrates the clustering property of HDP, as shown in Fig. 2.5. After modeling thelocal clusters (tables) in images (restaurants), HDP also models the global clustersacross all groups using table specific dishes. Finally, the dish k jt indicates thecluster associated with customers x ji’s sitting at the table t jt , which is the higherlevel feature we want to model.Sampling t: According to Equations 2.3 and 2.4, the likelihood due to x jigiven t ji = t for some previously used t is f−x ji(x ji|θk ji), where f−x ji(x ji|θk ji) is theconditional probability of x ji given all data points except itself. The likelihood fort ji = tnew can be calculated as :p(x ji|t− ji, t ji = tnew,k) =K∑k=1m.km..+ γf−x ji(x ji)|θk ji +γm..+ γf−x ji(x ji|θnew). (2.5)Thus the conditional distribution of t ji is:27p(t ji = t|x ji, t− ji,k) ∝n− jijt. f−x ji(x ji|θk ji) if t previously usedα p(t ji = t|x ji, t− ji,k) if t = tnew. (2.6)If the sampled value of t ji is tnew, then we need to assign a global cluster to thistnew. The probability for k jtnew is:p(k jtnew = k|t,k− jtnew) ∝m.k f−x ji(x ji|θk ji) if k previously usedγ f−x ji(x ji|θknew) if k = knew. (2.7)If some table t becomes unoccupied during the updating of t ji, we may delete thecorresponding k jt from the data structure. If the result of deleting k jt some mixturecomponent k becomes unallocated, then we delete this mixture components as well.Sampling k: Because k jt determines the component membership of all thedata points in table t, the likelihood by setting k jt = k is given by f−x ji(x ji|θk ji), sothe probability of k jt is:p(k jt = k|t,k− jt) ∝m.k f−x ji(x ji|θk ji) if k previously usedγ f−x ji(x ji|θknew) if k = knew. (2.8)Following the sampling scheme above, given z ji = k jt , we can update F(θz ji)in Fig. 2.4 for image j.2.3.3 New Feature RepresentationAfter modeling HDP over the prior-knowledge data, we now obtain the likelihoodp(wi|zk,Dprior) and the probability p(zk|wi) for connecting latent components withthe dictionary visual words. To find the representations of images based on thelatent components, we need to compute p(zk|Ij) for the jth image. Recall thatthe histogram for image I j based on visual dictionary V is hj, and we have h ji =p(wi|I j). According to the Bayesian rule, we havep(zk|Ij) = ∑wi∈Ijp(zk|wi)p(wi|Ij), (2.9)28where p(wi|Ij) corresponds to the ith dimension of the normalized bag-of-featurehistogram, i.e. h ji∑d js=1 h js. We define fk(hj) ≡ p(zk|Ij) to map the raw feature vec-tor hj = (h j1,h j2, . . . ,h jd j ) to the higher level representation F(hj), where F(·) =( f1(·), f2(·), . . . , fk(·), . . . , fd j(·)), for j = 1,2, . . . ,N. Suppose hj is the bag-of-feature histogram representation based on visual dictionary Vl , the new represen-tation F(hj) is actually the normalized bag-of-feature histogram based on the nextlevel dictionary Vl+1, where each entry is a latent mixture component learned fromHDP modeling. We call F(·) the HDP-ENCODER.2.3.4 Hierarchical Feature LearningIn this section, we will present the hierarchical feature learning structure based onthe HDP unsupervised feature learning model described in previous sections. Fig2.7 shows the construction process of multiple-level features based on a single typeof low-level features.To motivate the proposed structure, we first review the spatial pyramid match-ing scheme [58]. In spatial pyramid matching, as shown in Fig 2.6, the two-dimensional image space is divided into sub-images equally, then the sub-imagesare treated as separate channels to compute the feature matching at each level. Ex-perimental results show that this spatial pyramid construction yields improvementsin similarity measurement using intersection kernels. Since the bag-of-feature rep-resentation treats the visual vocabulary features as orderless, the improvement in-troduced by such multiple spatial histogram resolutions is intuitive: It actuallytakes the location information of feature points into consideration. In this chapter,we propose a similar solution in the discrete feature space: We construct a multi-layer feature space where each level consists of features with different “describingresolutions”. For example, in the “zebra-horse” problem, the “striped” pattern hashigher level of “describing resolution” than “black” and “white”, since we can de-scribe certain areas as “black” or “white”, while only a structure with repeatinglines and alternative color areas in between could be called “striped”. A lowerlevel feature captures local characteristics of an image while a higher level featuredescribes properties related to certain structure of the object or image background.Incorporating the idea of multiple “describing resolutions”, we now present29Figure 2.6: Illustrations of spatial pyramid subdivisions [58] and the weightsfor each level of resolutions.Figure 2.7: Construction of the feature pyramid based on a single type offeatures using HDP-ENCODERs. Each HDP-ENCODER is obtained byEquation (2.9) from HDP modeling on the lower level features.how to use HDP-ENCODERs to build the feature-pyramid. Fig 2.7 reveals thatthe feature learning method is actually a clustering process on the previous lowerlevel, where L0 denotes the bottom level and L1 denotes the next higher level,and so on. In contrast to spatial pyramid subdivisions where higher levels havemore details, our method provides more informative descriptions at higher levels.In our feature pyramid, from bottom to top, the descriptions focus from local de-tails to regions, then to objects. Similar to spatial pyramid, we believe that ourmultiple “describing resolutions” could provide a more comprehensive similaritymetric, which can benefit the classification later. In details, Fig 2.7 shows how30the HDP-ENCODERs work under the pyramid structure based on a single type offeatures. In Fig 2.7, HDPL0−L1 means the transformation function learned fromL0 features using Equation (2.9) based on HDP modeling, and this HDPL0−L1 isused to encode L0 features into L1 features. Recursively, we learn the functionHDPL1−L2 from L1 features and encode them into L2 features and so on. Note thatwith applying the HDP-ENCODER multiple times, we reduce the feature space toa low dimensional one. In practice, we stop the multiple HDP-ENCODER processonce the dimensionality of the new level feature is below 100, to ensure the dis-criminative power of each level. It is worth noting that we estimate the stackedHDPs layer by layer in a greedy way, under the assumption that features at eachlayer follow a multinomial distribution.So far the feature learning we described is based on a single type of imagedescriptor (e.g., SIFT). Is this enough? Empirically conducting feature learningon one individual image descriptor cannot capture all useful information. For in-stance, both color and texture information can be important for classification tasksand should be learned into joint higher level features. It is therefore important thatwe can jointly model different image descriptors (e.g., SIFT from gray scale im-ages and Color Histogram from color images). We propose learning a higher levelfeature vector from multiple types of low-level descriptors: To couple the featurespaces together in HDP modeling, we concatenate different feature vectors into along vector and then apply HDP, which is equivalent to encoding images with alarge joint dictionary. In practice, we may need to design a specific learning struc-ture for a particular task, by analyzing the features provided. Fig 2.10 in Section2.4.3 shows two possible feature learning procedures.2.3.5 Feature CombinationIn this section, we will discuss how to combine features learned from above sec-tions into the classifier’s input data. One reason why object recognition is a chal-lenging task is that images within the same class usually have high intraclass vari-ability. The low-level image descriptors are designed to be invariant to the varia-tions within classes. At the same time, the descriptors are desired to have discrim-inative power for different classes. There is no single descriptor that can satisfy31both requirements for all object classes, thus adaptively combining different typesof features is preferred. Basically, we want to combine descriptors based on color,shape and texture information.In this chapter, we are facing two kinds of feature combination problems: 1)How to combine all types of features at the same level; and 2) how to combine fea-tures at different levels? For the first question, the described crossing-space HDPmodeling can be a solution. The new higher level features learned from concate-nated spaces capture information from all lower level features, yielding a muchmore compact feature space than the original ones. However, as we mentionedearlier, since every feature space has its unique advantages, we need to integrateuseful information as much as we can, especially for one-shot tasks where onlyextremely limited information is provided.Grauman and Darrell [42] propose pyramid matching to find an approximatecorrespondence between two sets of features. Pyramid matching works by plac-ing a sequence of increasingly coarser grids over the feature space and taking aweighted sum of matches that occur at each level of resolution. In our featurepyramid, at any level, the occurrence of the same feature in two images is consid-ered as a match. More specifically, for the feature pyramid with levels 0,1,2, . . . ,L,we use H lX(k) and HlY(k) to denote the histograms of feature k for images X and Y atlevel l. Then the intersection kernel for these two histogram vectors is:κ l(k)(HX ,HY ) =Dk∑i=1min(H lX(k)(i),HlY(k)(i)). (2.10)It was shown that this histogram intersection is additive Mercer Kernel [42]. Nowlet us look closely at all features at level l of the feature pyramid. [55, 78] simplyconcatenate all feature vectors into a long vector, which is equivalent to using theaverage kernel over intersection kernels calculated for all types of features. [37]shows that Multiple Kernel Learning (MKL) and its variants have the best per-formances on classification tasks. However, in our one-shot recognition problem,we don’t have sufficient training data to optimize the linear combination of differ-ent kernels when optimizing the classifier’s coefficients simultaneously. Thus, wechoose the average kernel here for simplicity and good performance according to32[37]. Then the kernel function for level l is:κ l(HX ,HY ) =1MlMl∑k=1κ l(k)(HX ,HY ). (2.11)After defining the matching kernel at each level, we put weights on the kernelscores according to the “describing resolutions” of different levels. Unlike thespatial pyramid, in which finer grid has higher weights, in our feature pyramid,intuitively higher level features have higher weights since we believe that the fea-tures at higher levels are more meaningful than the lower level ones with respectto objects’ characteristics. More specifically, the low-level features have a largeproportion of noisy information, which is an important concern in the one-shotrecognition problem, and higher level features somehow ’filter’ out unimportantinformation by clustering lower level features. Later in the Caltech 4-class exper-iments, we can see that the higher level feature SIFT-L1 yields better recognitionperformance than the lower SIFT-L0 feature. Therefore, intuitively it makes senseto assign higher weights for higher level features. The weight associated with levell is heuristically set to be 12L−l , where L means the highest level. The final kernelfunction for all levels in the pyramid is as:K(HX ,HY ) = ∑Ll=0 12L−l κ l(HX ,HY )= ∑Ll=0 12L−l 1Ml ∑Mlk=1 κl(k)(HX ,HY )= ∑Ll=0 12L−l 1Ml ∑Mlk=1 ∑Dki=1 min(H lX(k)(i),H lY(k)(i)). (2.12)where the final intersection kernel is actually weighted summation of matchingscores for all the dictionary words in all feature types from all levels.2.3.6 One-shot Recognition DecisionFor one-shot recognition tasks, we first use the proposed unsupervised structurallearning method to build a feature pyramid based on unlabeled prior-knowledgedata, we then compute the kernel matrix for the testing data in the target domainaccording to Equation (2.12). Since [42] shows that the intersection kernel is ad-ditive Mercer Kernel, we can directly input our pre-computed final kernel K, as in33Equation (2.12), into the popular Support Vector Machine classifier to make theclassification decision.2.4 Experimental ResultsTo evaluate the performance of the proposed method on one-shot image recogni-tion, we examine two popular, publicly-available data sets in the area of one-shotrecognition. We first evaluate the proposed unsupervised feature learning methodon a 4-class data set which is a subset of Caltech-101, comparing with previousreported performances based on multi-shot training; We then report the results onthe “Animals with Attributes” data set using two different feature learning proce-dures (as shown in Fig 2.10). All classifications are performed by LIBSVM [22]with using our pre-computed kernels or linear kernel (in the 4-class experiment)with the parameter C = 10.2.4.1 Data Sets4-Class[32, 50], a subset of Caltech-101, is a data set consisting of images fromAirplane, Faces, Leopard and Motorbikes 4 classes. Since no previous work usingunlabeled data as prior-knowledge under one-shot recognition setting, we compareour one-shot recognition results with [50] which used 50 training samples in clas-sifications. Airplane, Faces, Leopard and Motorbikes are used as target categories,and 30 images from each of the remaining categories are used as prior-knowledgedata. We compute SIFT descriptors on a dense grid of the images, every 8 pixels,to form the dictionary and vector data. In the testing phase, we randomly select 1training sample from each target category, and use the remaining 29 samples as thetesting data. The experiments are repeated 10,000 times to calculate the averagedclassification accuracy.“Animals with Attributes” data set [55] contains natural color images of50 animal categories. There are 30,475 images in total and six types of pre-computed features for downloading, including RGB color histograms (CH), SIFT[60], Color SIFT (RGSIFT) [38], PHOG [19], SURF [8] and local self-similarity his-tograms (LSS) [75]. Lampert et al. extracted CH feature vectors for all 21 cells ofa 3-level spatial pyramids(1 ×1, 2×2, 4×4). For each cell, 128-dimensional color34Figure 2.8: Images from 4-class data sethistograms are extracted and concatenated to form a 2688-dimensional feature vec-tor. Each of the other vectors, except PHOG, is 2000-bin bag-of-feature histogram.We didn’t use PHOG because that its simple structure is not suitable for recursivefeature learning. Among the descriptors, SIFT and SURF provide image gradientinformation, CH captures color information, LSS serves as texture descriptor andRGSIFT is a combination of color and local gradient information.For one-shot recognition, we examine the same 10 target categories suggestedby [55], and use the remaining as unlabeled prior-knowledge data. We only use30 samples per prior-class as unlabeled training data, to achieve low computationalcost and to show the generalization ability of the proposed feature learning method.In the testing phase, we randomly select one training sample from each target classand use the remaining as testing data. Therefore we have 10 training samples and6170 testing samples for each independent experiment. We repeat the experiments10,000 times to report the average classification accuracy. The 10 target testingcategories are: chimpanzee, giant panda, leopard, Persian cat, pig, hippopotamus,humpback whale, raccoon, rat and seal (See Fig. 2.11).For the conventional multi-shot recognition, we follow the protocol in [55]with using 50 images in each target category for training and the rest for testing. Werandomly select the training samples in each iteration, and repeat 10,000 iterationsto report an averaged accuracy. Similar to the one-shot recognition experiments,for feature learning, we still only use the features learned from a small subset ofprior-knowledge data, without using the training data in the target categories. Thetraining data in the target categories are only used for classifier training.2.4.2 Results on “4-class” data setBased on the SIFT-L1 features learned as in Section 2.3, we employ the SVM clas-sifier to perform one-shot recognition. For the 4-class data set, the classification35Classes 50 training [50] 50 training (SIFT-L1) one-shot (SIFT-L1) one-shot (SIFT-L0)Airplanes 94 % 89% 79 % 46 %Faces 74 % 93% 82 % 85 %Leopard 92 % 89% 77 % 99 %Motorbikes 88 % 94% 93 % 68%mean 87 % 91% 83 % 74%Table 2.3: Performances of one-shot recognition on the 4-class Caltech dataset.accuracy results are reported in Table 2.3, where an average accuracy of 83% isobserved for SIFT-L1. We clearly note a 9% improvement from SIFT-L0. It isworth noting that the proposed one-shot learning method yields comparable per-formances to the method in [50] which is trained on 50 training samples per classand provides an average accuracy of 87%. In addition, to show the general ap-plicability of the proposed method, we also report the results for SIFT-L1 with 50training samples per category, and the average accuracy is 91% which is better thanthat of [50]. It is worth mentioning that the 50 labeled samples per category areused in both feature learning and classification stages in [50], while they are onlyused in the classification stage in our proposed method.To understand the feature learning process better, in Fig. 2.9, we visualize theraw SIFT-L0 features and the learned intermediate (SIFT-L1) features in 2D plotsusing the t-SNE technique [28]. In the plot of raw features, the testing data pointsfrom different classes are somehow mixed, though they seem to reveal a separablepattern by using nonlinear classifiers. However, it is important to recall that, sincewe only have one training sample per class in one-shot recognition, it is infeasi-ble to discover such nonlinear distribution patterns based on one training samplewithout any prior assumption. In the plot of the learned intermediate representa-tions, it is clear that testing data points from different classes are well separated inthe feature space, therefore even with only one training sample available for eachtarget class, the SVM classifier may still be able to make correct decisions for thetesting data. The proposed feature learning based on disjoint prior-knowledge datamay have the potential to achieve a comparable accuracy as that of the fully trainedclassification method in [50], although more investigations will be needed in the36future to test on a larger number of categories.2.4.3 One-shot Results on “Animals with Attributes”We perform the unsupervised feature learning on a small subset (30 images eachclass and 1200 images in total) of the 40-class prior-knowledge data. The featurepyramids are constructed by two different procedures, as illustrated in Fig 2.10. Inthe first procedure in Fig 2.10(a), we use the HDP-ENCODER to learn L1 featuresfrom each of the 5 L0 descriptors. Here we don’t do the cross-space learning yetmainly due to the concern of computational cost. The HDP learning from L0 toL1 actually filters the feature spaces. We then assign 5 types of L1 features into2 categories, texture and color, and perform cross-space learning to obtain an L2“texture” feature and an L2 “color” feature. Further, we combine the texture andcolor features together to learn an L3 “All” feature. Finally, we compute the finalaverage kernel by Equation (2.12) and perform classifications using SVM.Alternatively, we can construct a feature pyramid following Fig 2.10(b), tolearn a compact L2 “Fusion-single” feature. We compute an interaction kernelbased on this L2 feature by Equation (2.10) to do classifications. Since the “Fusion-single” feature is compact (e.g., 60 dimensions), it is possible to be used in realtime mobile applications. Also [61] designs a fast approximate training approachto speed up the SVM training and testing over intersection kernels. We believe thatthe “Fusion-single” feature has practical advantages.Feature pyramid 1 L0 L0−L1 L0−L1−L2 L0−L1−L2−L3 Feature pyramid 2 L2 Fusion-singleAccuracy 14.1%[55, 78]/17.8%(ours) 20.0% 20.3% 21.6% Accuracy 20.3%Table 2.4: One-shot recognition accuracy on “Animals with Attributes” dataset with feature pyramid 1 and 2. The results shown in boldface aresignificantly better than the others, judged by a paired-sample t-test.Table 2.4 summarizes the results in one-shot recognition tasks when employ-ing the two feature pyramid constructions showed in Fig 2.10. Note that a 14.1%accuracy based on L0 features was reported by [55, 78, 90], where they simplyconcatenated all feature vectors into a long vector and used PCA to get a lowerdimensional representation for classification. In our experiments, we use average37intersection kernels and get a better accuracy as 17.8%. As we can see from thetable, employing feature learning from L0 to L1 can provide a good performanceimprovement, i.e. from 17.8% to 20.0%. By constructing the feature pyramid, weget accuracy gains gradually as the feature level increases. It seems that we canachieve the best performance by constructing a 4-level pyramid. From L0 to L0-L1-L2-L3 we observe an absolute 7.5% accuracy improvement (i.e., representinga relative improvement over 50%). From L0-L1 to L0-L1-L2-L3 we only observea 1.6% absolute accuracy gain. However, it is worth emphasizing that even 1.6%represents a significant progress in one-shot recognition where the provided in-formation is limited and the average accuracy is generally quite low. In one-shotrecognition, the standard deviation of recognition accuracy is usually large due tothe randomness introduced by selecting only a single training sample in each cate-gory. However we observe a consistent improvement (more than 75% of the trials)from L0-L1 to L0-L1-L2-L3 during the repeated 10,000 independent experimentsand the improvement represented by the pair-wise accuracy differences is statis-tically significant, judged by the t-test. In addition, it is worth mentioning thatwe used SIFT-L1 only and can achieve a 18.3% accuracy in our conference paper[45], and here we can further improve the accuracy by 3.3%. From Table 2.4, wealso note that the 60-dimensional L2 “Fusion-single” feature can achieve a 20.30%accuracy. It suggests that the proposed feature learning method can learn a singlecompact feature with good discriminative power.Since there are no specific algorithms designed for one-shot recognition withusing only unlabeled prior-knowledge data, to have a feeling of the accuracy upper-bound, we compare the performance of the proposed method with the ones usingmuch more prior information. Lampert et al. [55] used fully labeled images in theprior-knowledge domain and used a sophisticated animal attribute table designedby human experts for each of the class. They achieve 27.8% in the IAP [55] settingand 40.5% in the DAP [55] setting. Tang et al. [78] used fully labeled data inthe prior-knowledge domain, which is probably the closest experimental settingto ours, and they reported an average accuracy of 23.7% for linear projection and27.2% for logistic projection. We are glad to notice that under our experimentalsetting which provides very limited information compared with previous works,the proposed method can achieve a comparable performance of 21.6%, which is38close to the 23.7% accuracy obtained by using fully labeled prior-knowledge [78].To show the scalability of the proposed one-shot recognition method, we alsoconduct a 50-class recognition task as described in [78]. Here all the testing imagesare still drawn from the 10 target categories, and the training images from therest 40 prior-knowledge categories serve as distractors [78]. For convenience andsaving computational cost, we use a subset of distractors in the feature learning,and the results are shown in Table 2.5. For the proposed method, the L0 featuresachieve an accuracy of 4.68%, and the combination of 4 levels of features achievean accuracy of 5.27%. [78] reported an accuracy of 5.38% for the raw features andan accuracy of 7.5% for the logistic projection method. The accuracy differencebetween the proposed method and [78] when using the raw features (L0) couldbe due to the following major setting differences: We use a subset of 40 prior-knowledge categories as distractors for simplicity and [78] uses the entire data set;we only use the unlabeled prior-knowledge images and [78] uses the labeled ones.However the proposed method still shows an improvement from 4.68% to 5.27%when combining different levels of features.Methods Proposed Method in [78]Accuracy(raw/learned) 4.68%/5.27% 5.38%/7.5%Table 2.5: Recognition accuracy results of using raw features and learned fea-tures in different methods.Methods Raw Features [55] Feature Pyramid 1 Feature Pyramid 2Accuracy 65.9% 71.4% 70.0%Table 2.6: Multi-shot recognition accuracy results on the “Animals with At-tributes” data set when using raw features in [55] and feature pyramids 1and 2 in the proposed method.2.4.4 Multiple Shots Results on “Animals with Attributes”To evaluate the general applicability of the proposed method, we conduct the con-ventional multi-shot recognition experiments on the “Animals with Attributes” data39set, and the results are shown in Table 2.6. The proposed method based on featurepyramid 1 in Fig. 2.10(a) achieves a 71.4% accuracy, and the proposed “Fusion-single” feature also yields an accuracy as high as 70.0%. [55] reported a 65.9%accuracy in multiple shots experiments based on 6 types of L0 raw features. Itis noted that our feature learning process performed only on the prior-knowledgedata can improve the absolute accuracy by 5.5% in multi-shot recognition tasks. Itindicates that the proposed feature learning method can learn a general representa-tion and provides better discriminative power by transferring information betweenthe prior-knowledge and target domains. This example shows that the usage of theproposed method is not limited to one-shot recognition tasks. It is expected that wecould get further improvement if we perform feature learning also on the multipletraining samples in the target categories.2.5 Conclusions and Future WorkIn this chapter, we tackle the problem of one-shot image recognition and we pro-pose a novel unsupervised hierarchical feature learning framework to learn higherlevel features based on low-level image descriptors. To construct the hierarchi-cal feature pyramid, we propose using Hierarchical Dirichlet Process to performfeature learning from a lower level to a higher level. We also show that the HDPencoder can be applied recursively, which makes the feature learning procedureflexible and can be customized depending on particular tasks. Furthermore, wepropose using the summation of weighted intersection kernels and the average ker-nel to transfer our feature pyramid into discriminative power. The proposed featurepyramid construction procedure is capable of learning a single compact featurefor recognition. Our experimental results show that the proposed feature learningframework could benefit both one-shot recognition and conventional multi-shotrecognition tasks.Since we could perform the HDP modeling across feature spaces, in the fu-ture, we plan to incorporate multiple media sources into the proposed one-shotrecognition system. For instance, we would like to add the image-associated textdescriptions into the feature learning phase.40−0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04−0.03−0.02−0.0100.010.020.030.040.05 FacesLeopardMotorbikesAirplanes(a) Raw features−8 −6 −4 −2 0 2 4 6 8−8−6−4−202468 FacesLeopardMotorbikesAirplanes(b) Learned intermediate representationsFigure 2.9: 2D plots of the subsets of the 4-class data set. Each color/patternrepresents a different class from the four categories.41(a) feature learning procedure 1.(b) feature learning procedure 2.Figure 2.10: Two feature learning procedures used for examining “Animalswith Attributes”. The dotted line boxes mean that we concatenate dif-ferent feature spaces together and then apply the HDP-ENCODER.Figure 2.11: Sample images from 10 target classes in “Animals with At-tribute” data set42Chapter 3Cross-domain Object Recognition3.1 IntroductionIn traditional visual object recognition systems, a model is always trained basedon data from the same domain as the testing data, where the implicit assumptionis that the training and testing distributions are the same. However, we often facethe situations that additional data (with or without labels) from other similar butdifferent domains can be available for training. How to train a model that cantake benefits from the extra data in similar source domain(s) by overcoming theside-effects introduced by domain shift is called Domain Adaptation (DA), a re-search topic drawing increasing research attention in computer vision and machinelearning fields recently. In a DA problem, a source domain is different from, butsomewhat similar to the target domain. One example in the text classification areais that the email spam data sets [1] collected from different individual users formdifferent domains, characterizing individual users’ preferences. One example invisual object recognition is given in [12], where the Caltech-256 data set is consid-ered as one domain of object images, and the other domain is the collection of theresults from the image search engine Bing when using the category names fromCaltech-256 as search queries. Label quality is a key characteristic of differentdomains. Although the above two domains both contain images for the same ob-ject categories, the images in Caltech-256 are manually labeled (i.e., noiseless inlabel information), while on the other hand, the labels for Bing images are noisy43and unreliable. As another example, a specific data set for cross-domain objectrecognition is adopted in [41, 54, 71], where the images are acquired by dslr, we-bcam and from the amazon website. Figure 3.1 shows some example images forthe ‘bike’ category from the above 3 domains, along with images from the Cal-tech256 domain. As we can see, though the 4 domains are somewhat similar (i.e.,representing the same object categories), there are differences across these four do-mains in terms of pose, lighting, resolution, camera peculiarities and other factors.In reality, the available data from target domain usually comes with no labels orwith very limited label information. In the literature, the problem involving withavailable training data from target domain with no labels is referred as unsuper-vised domain adaptation; the problem involving with target training data with asmall number of labels is called semi-supervised domain adaptation. In the areaof computer vision and multimedia, the semi-supervised domain adaptation is at-tracting increasing attention, and it is also the focus of this chapter. In this chapter,we tackle the DA problem of image object recognition that aims at efficiently lever-aging the labeled images from different but related source and target domains toderive better hypothesis testing in the target domain.Though it is intuitive to believe that extra labeled data from other source do-mains can potentially increase the recognition accuracy in the target domain, it isnot trivial to make good use of the extra labeled data because of the distribution dif-ferences in the feature spaces across domains. Experimental results in [15, 71, 88]show that standard classifiers directly trained on the combination of the source andtarget domains perform poorly on the test data in the target domain, when com-pared with the classifiers trained on a large number of labeled data from the targetdomain. Let PA (x,y) and PB(x,y) denote joint distributions of feature-label datafrom source domain and target domain respectively, where x is the feature vectorand label y is the label. Semi-supervised domain adaptation is to use a small num-ber of training samples from PB(x,y) and many from PA (x,y) to build a learningmodel for classification. The shift from PA to PB causes troubles when training astandard classifier. We denote the data set from source domain as SA = {XA ,YA }and data set from target domain as SB = {XB,YB}. In detail, the feature data pointsfrom two domains can be described as {xA1 , . . . ,xAnA}, and {xB1 , . . . ,xBnB}.In order to train a domain adaptive classifier by efficiently leveraging the in-44Figure 3.1: Example images of the same ‘bike’ category from four differentdomains: amazon, dslr, webcam and Caltech256.formation from different domains, different DA techniques have been proposed tolearn a function yˆ = f (xBtest |SA ,SB) that predicts the class label of an unseen test-ing sample xtest from target domain with high probability, pB(Y = yˆ|X = xBtest),by forcing pA (x,y) and pB(x,y) to be close. By definition, p(x,y) = p(y|x)p(x),therefore the mismatch between distributions can be handled in terms of p(y|x) orp(x) under different assumptions [91]. Among those, one trend is to directly orindirectly reduce the mismatch of data distributions across domains by projection[10, 25, 40, 41, 49], kernel analysis [30, 31, 47], or metric learning [54, 71]. An-other trend is to combine the decision functions of different classifiers trained onthe data from different domains to obtain a more powerful augmented classifier fortesting data in the target domain [12, 31, 88].In the previous work, label data y is usually treated as a scalar, and multi-classclassification is achieved based on binary classification by adopting one-versus-all45(a) Sample images from domain webcam.sim(desk chair,back pack) > sim(desk chair,bike).(b) Sample images from domain amazon.sim(desk chair,back pack) < sim(desk chair,bike).Figure 3.2: From left to right in each row are images from categories:back pack, desk chair, bike. The proposed output kernel space analy-sis shows that the categories desk chair and back pack are more similarthan desk chair and bike in the webcam domain, while the opposite isobserved in the amazon domain. These observations are consistent withthe visual inspection results.or one-versus-one principle. The use of scalar functions as classifiers works wellin the traditional classification tasks. However, in practice, we observe that the dis-tribution shift from PA to PB actually results in changes in between-category simi-larity. For instance, Figure 3.2 shows that the categories desk chair and back packappear more similar in the webcam domain, while desk chair and bike appear moresimilar in the amazon domain. Such between-category similarity information playsan important role in multi-class classification in DA, and it also defines the struc-ture of the intrinsic manifold where the multi-class data points embed into. Corre-spondingly, a scalar class label cannot capture the shift from PB(y|x) to PA (y|x)correctly. Therefore, we propose modeling the class label as a vector y, using thebinary coding scheme(1 stands for presence and 0 stands for the absence of a classinstance). Recalling the theories of functional analysis on vector-valued functions46[29, 62], we can consider a multi-class classifier as a vector-valued function withthe structured output which induces a vector-valued Reproducing Kernel HilbertSpace (RKHS). In this case, the Input kernel space (the scalar kernel space on theinput features) of this RKHS is related to the mismatch of data distributions PB(x)and PA (x), and the Output kernel space (the matrix kernel space on the struc-tured output of the function) actually corresponds to between-category similarities,which also contributes to a better estimation of P(y|x). The input and output ker-nels together form a more comprehensive kernel space for vector-valued decisionfunctions than the kernel space used in [30, 31, 47] for scalar functions. It is alsomore comprehensive and richer than the intermediate spaces introduced by linearprojections [10, 25, 40, 41, 49] and metric learning [54, 71].Following the above inspiration, in this chapter, we will investigate both theinput and output kernel spaces to overcome the distribution mismatch issue causedby domain shift. More specifically, we propose a Domain Adaptive Input & OutputKernel Learning (DA-IOKL) algorithm for cross-domain image object recognition,i.e., learning an separable RKHS for vector-valued functions (consisting of inputkernel and output kernel) where the mismatch between the data distributions couldbe reduced. The contributions of this chapter can be summarized as follows:1. For the first time, we introduce the analysis of the output kernel space in-duced by a vector-valued decision function into the DA problem;2. We propose a particular objective function in the DA-IOKL model to learnthe optimal input and output kernels jointly. We adopt Maximum Mean Dis-crepancy (MMD) [18] as a regularizer to minimize the structural classificationerror and the mismatch between data domains, and to avoid large computa-tional cost and optimization difficulty, we propose a multiple kernel form toparameterize the input kernel. In addition, DA-IOKL also provides a vector-valued function as a true multi-class classifier;3. We present an efficient algorithm to solve the DA-IOKL optimization prob-lem. Although the objective function of DA-IOKL is not jointly convex w.r.t.all parameters, since it is invex [29] over the output kernel and convex overthe input kernel, it can converge to the global minimum. By adopting box47Figure 3.3: The proposed DA-IOKL framework for cross-domain objectrecognition. Each circle represents a set of data, and its size indicatesthe number of data points.constraints, we apply efficient off-the-shelf optimization approaches such asL-BFGS-B [20] to compute the DA-IOKL solution.We illustrate the cross-domain object recognition process in Figure 3.3. Wecan see that a relatively smaller number of labeled training data from the targetdomain is available during training, compared with the number of labeled trainingdata from the source domain. And a classifier learned jointly based on the datafrom both source and target domains is used to classify the unlabeled testing datain the target domain. For the details about DA-IOKL and the learned multi-classclassifier, please refer to Section 3.3.The rest of the chapter is organized as follows. Section 3.2 summarizes previ-ous works in cross-domain learning and general multiple kernel learning. Section 3.3presents the proposed DA-IOKL algorithm. In Section 3.4, experiments for cross-domain classification are conducted on two data sets: the object domain adaptation48(DA) data set [71] and Caltech256+domain adaptation (cal+DA) data set [40]. Fi-nally, we conclude the chapter in Section 3.5.3.2 Related WorkThe mismatch problem of data distributions was first investigated in the NLP com-munity, where intensive research has been conducted to handle the domain adap-tation problem. To capture the shift in feature spaces, A -distance is introducedin [10, 15, 52] for Structural Correspondence Learning (SCL), which presents anapproximate estimation of the total variance distance between two distributions.Although this method could measure the shift in the feature spaces, it is hard toestimate and it is not clear how to extend it to computer vision adaptation tasks.Other approaches used in vision research such as [30, 31, 67] employ a domainsimilarity measure based on MMD [18], which is non-parametric, easy to estimateempirically and also flexible for the choice of kernel functions. Due to its goodperformance [18] and its compatibility with kernel methods, we adopt MMD as apenalty term over the input kernel space in this chapter.There are two major categories of DA methods based on various domain shiftcriteria: 1) reducing the mismatch of data distributions in the feature spaces byprojections (or equivalently kernel functions or metrics); 2) combining decisionfunctions trained from different domains. In the first category, following theidea of reducing the mismatch of data distributions, the approaches proposed in[10, 15, 52, 67] aim at finding a feature space which can minimize the divergenceof distributions between domains based on a specific measure. In addition, [10]provides a theoretical analysis on the feature representation function and classifi-cation error. Besides these methods, within the scope of data distributions in fea-ture spaces, several intuitive methods are proposed for recognition purposes. [25]presents a feature augmented way to construct a common feature space. Insteadof looking for a subspace by projection, Saenko et al. [71] and Kulis et al. [54]propose learning a metric that can minimize the distance between similar data pairsand maximize the distance between dissimilar data pairs across two domains, byapplying a regularized metric learning model [27]. This approach needs to solve aSemi-Definite Programming (SDP) [48] problem during learning subject to a large49number of linear constraints, and thus it is computationally expensive and hardto scale up to high dimensional data. It is also limited to two-domain adaptationscenarios and requires data correspondences between two domains for better per-formance. [41] adopts a general subspace approach to learn intermediate featurespaces by sampling points along the geodesic of a Grassmann manifold formedby two different domains. More recently, [40] introduces a geodesic flow kernelto extend the idea from [41] and also proposes a way to automatically determinethe optimal dimensionality of the subspace. Similarly, [49] proposes finding anintermediate space where the data points from the source domain could be wellreconstructed by the data points from the target domain. The intermediate spaceis obtained by searching for an optimal linear operator and removing noise andoutliers simultaneously. In particularly, among the above methods, the methods in[25, 40, 41, 49, 54, 71] are investigated for cross-domain image object recognitiontasks. It is worth noting that most methods [25, 40, 41, 49] in the first categorycan be used in both an unsupervised way and semi-supervised way, since it is notnecessary to have labels for certain subspace methods. The small number of labelsfrom the target domains could further improve the performance on the unsuper-vised subspace learning. In this chapter, we include these methods [40, 41, 49] forcomparison under the semi-supervised learning setting for our DA problem.In the second category, several methods are proposed, with focus on the de-cision functions of the classifiers. [88] employs the adaptive SVM to adapt thedecision function f S trained on the source domain into the target SVM classifier f Tby formulating f T = f S +△ f , where △ f is trained based on data from the targetdomain. Transductive SVM [3, 12], Domain Adaptive SVM and cross-domain SVM[51], and other variants of SVMs are also explored in DA problems by defining anew decision function incorporating data from both the target and source domains.Since the first category of methods could yield better performances in cross-domain object recognition, we decide to follow the idea of searching for one orseveral subspaces to reduce the mismatch of data distributions. Recall that the fea-ture spaces or kernel spaces studied in the previous methods [25, 40, 41, 49, 54, 71]are the spaces for scalar functions, which can be considered as the Input spacesfor the vector-valued functions. Therefore, those methods don’t consider the Out-put space which is highly related to multi-class classification. Instead of dealing50only with data in the input space, we investigate the domain shift in the RKHS forvector-valued functions, which contains both Input and Output kernel spaces andis more comprehensive than the input space only. To reduce the data mismatch inthe RKHS of vector-valued functions, as stated in Section 3.1, we propose an Input-Output kernel learning algorithm, DA-IOKL, to jointly learn an input and outputkernel space for the pooled data from source and target domains. Equivalently itcan be considered as searching for a better input-output kernel space where thedomain shift is minimized. The proposed DA-IOKL also provides a vector-valuedfunction as the optimal multi-class classifier.The proposed DA-IOKL contains the learning process for an input kernel matrixand an output kernel matrix. The dimension of the output kernel matrix is usuallysmall, since it only depends on the number of categories. But the dimension of theinput kernel matrix is usually large and grows with the number of data points usedin training and testing. Therefore, we propose learning the output kernel matrix di-rectly with a non-parametric formulation [29] and learning the input kernel matrixwith a parametric form similar to Multiple Kernel Learning (MKL) [70, 77, 83].We plan to learn a convex combination of kernel bases as the optimal input kernelfunction, instead of learning the kernel matrix directly as in [56].3.3 Proposed MethodIn this section we will present the proposed DA-IOKL algorithm. Different fromprevious works that model the class label as scalar and use a scalar function asthe predictive function, the proposed algorithm represents the first attempt to in-vestigate DA with the analysis of the RKHS for vector-valued functions, inspiredby [21, 29, 62]. With the assumption of a multivariate distribution on class labels,we propose using MMD [18] to constrain the mismatch in the marginal distribu-tions. Following a geometric intuition described in [2], we assume the conditionalprobability distributions should be similar if the marginal distributions are shiftedclose to each other. Therefore, we propose learning a vector-valued function inthe RKHS that can give a best classification performance on the target domain data,where we put an MMD regularization on the parametric input kernel estimationand adopt output kernel estimation approach presented in [29]. Since the output51kernel estimation is based on the specific input kernel function, and the input ker-nel function is constrained by MMD regularization, we first fix the input kernelfunction and present the topics related to output kernel learning in Section 3.3.1,Section 3.3.3 and Section 3.3.4. Section 3.3.2 gives a detailed analysis on the ad-vantage of choosing vector-valued function and how this choice could benefit DA.Section 3.3.5 present the whole DA-IOKL that learns the input and output kerneltogether in an alternating optimizing way. At last, Section 3.3.6 proposes a newdomain similarity measure based on the output kernel matrix.3.3.1 Background on RKHS of Vector-valued FunctionsLet Y be a real Hilbert space with inner product (·, ·)Y , X a set, and H is alinear space of functions on X with values in Y . We assume H is also a Hilbertspace with inner product〈·, ·〉. Apparently, if Y = Rm, H is the space of vector-valued functions. We call a function g ∈H a Y -valued function, and we denotethe kernel associated with the RKHS of g as a Y -kernel. We give the definitions forRKHS of Y -valued functions and Y -kernel as follows.Definition 3.3.1. (RKHS of Y -valued functions). A RKHS of a Y -valued functiong : X → Y is a Hilbert space H such that, for all x ∈X there exists Cx ∈ R,||g(x)||Y ≤Cx||g||H ,∀g ∈H .Definition 3.3.2. (Positive semidefinite Y -kernel). We say that H : X ×X →L (Y ) is a positive semidefinite Y -kernel if it satisfies the following property forany finite integer l:l∑i=1l∑j=1(yi,H(xi,x j)y j)Y ≥ 0,∀(xi,yi) ∈ (X ,Y ).In [29], it states that a unique positive semidefinite Y -kernel H is associatedwith a given RKHS of a Y -valued function from H , which is defined over dataset X . We assume Y = Rm, which is the output space in our object recognitionproblem with m categories. Therefore L (Y ) is the space of m ordered square52matrices. Given a basis {bi}i∈T with T = {1, ...,m}, a kernel R over X ×T canbe defined as(bi,H(x1,x2)b j)Y = R((x1, i),(x2, j)).Similarly, given a Y -valued function g : X → Y , we can uniquely define a func-tion h : X ×T → R such thatg(x) = ∑i∈Th(x, i)bi.More details on RKHSs of vector-valued functions could be found in [21, 29, 62].Output KernelFrom Definition 3.3.2, we know that a Y -kernel H is a function defined on X ×X . For xi,x j ∈X , the value for H(xi,x j) is a linear operator in L (Y ), whichis a square matrix. Recall the theory for RKHS of scalar functions, a kernel K fora scalar function is defined on X ×X and its value K(xi,x j) is a scalar, wherexi,x j ∈X . In multi-class classification tasks, we denote Y = Rm as the outputspace of m categories. If a data point x j belongs to category i, its label is denotedby y j ∈ Rm, which has +1 on the the ith element and 0 on others. Let H be theRKHS of the function g : X → Y associated with the Y -kernel H. H could bedecomposed as [29]:H(xi,x j) = K(xi,x j) ·L,∀xi,x j ∈X (3.1)where L is a symmetric positive semidefinite matrix that measures the relationshipsbetween output components (the category similarity information in the multi-classrecognition case) of function g(x). L is called the Output Kernel; the kernel K isthe scalar part of H, and it measures the similarity between data points in the inputspace of the function g(x), so the kernel K is called the Input Kernel. Given thefunction g(x), the predicted category label of a testing data x could be determinedby ypre = argmaxs∈T g(s)(x), where T = {1, ...,m} and g(s)(x) is the s-th elementof the vector-valued function g(x). We can see that g(x) is a also a true multi-classclassification decision function.533.3.2 Vector-valued Functions for Domain AdaptationIn this section, we will explain why it is important to use vector-valued functionsand how the input and output kernels help DA.Why Vector-Valued Functions?Since the data distribution can be decomposed as p(x,y) = p(y|x)p(x), we focuson the conditional probability to illustrate the advantages of using a vector-valuedfunction over a scalar function (e.g. used in previous methods [25, 40, 41, 49] )in DA. For the convenience of expression, we suppose the predictive function isa draw from a Gaussian Process. Accordingly, for a scalar predictive function,the corresponding p(y|x) is a univariate Gaussian distribution characterized by amean and a variance. When extending the scalar predictive function from binaryclassification to multi-class classification (where a vector y is used for the classlabel) by adopting the one-against-one or one-against all rule, it is equivalent tomodel the conditional probability p(y|x) with a covariance matrix whose diagonalentries are the variances of p(y|x) for different classes and off-diagonal entries arezeros. While for a vector-valued predictive function, the resulting p(y|x) is a gen-eral Gaussian Vector distribution characterized by a mean vector and a covariancematrix. The covariances between components of y reflect the similarities betweenobject categories, which can be distinct across domains as discussed in Section 3.1,shown in Figure 3.2.Let Gaussian distributions pA and pB denote the conditional probability distri-butions of domain A and domain B. Suppose we have a way to project the datainto a new space where the two conditional distributions could be matched. Forthe scalar y and the scalar predictive function, only the mean vectors and variancesof pA and pB could be matched (what most of the previous methods try to do) andthere is still a large portion of mismatch between the two distributions due to thedifferences in covariances. However, for the vector y and the vector-valued pre-dictive function, the mean vectors and the covariance matrices (both variances andcovariances) could be matched, which leads to a better match between pA and pB.Therefore, to fully estimate the data distribution, we propose using a vector y tomodel the class label and using a vector-valued function f to model the predictive54function.How to Use Vector-valued Functions for Domain Adaptation?Recalling that data samples are drawn from the distribution p(x,x) = p(y|x)p(x),we can tackle the distribution shift issue from the marginal and conditional distri-butions separately. According to [9], from the geometric perspective, the con-nection between p(y|x) and p(x) could be assumed as follows: If two pointsx1,x2 ∈ X are close in the intrinsic geometry of p(X), then the conditional dis-tributions p(y|x1) and p(y|x2) are similar. Basically, the conditional probabilityp(y|x) varies smoothly along the geodesics in the intrinsic geometry of p(X).In Section 3.3.1, we decompose the kernel for a vector-valued function asH = K ·L, where K is the input kernel. Therefore, we propose minimizing MMD[18] of pA (x) and pB(x), the marginal distributions for two domains, to reducedistribution shift between the two marginal distributions, by learning a proper in-put kernel function K in the RKHS. The details of MMD regularization can be foundin Section 3.3.5. According to the above geometric assumption, with making themarginal distributions of both domains similar, the conditional distributions for thetwo domains, pA (y|x) and pB(y|x), should be similar. Since we use a vector yfor the class label and a vector-valued function f for the predictive function, thecovariance matrix of y is already taken into consideration. For the convenience ofexpression, we first fix the input kernel function K for the estimation of the outputkernel L in Section 3.3.3 and Section 3.3.4. Then in Section 3.3.5, we incorporatea parametric form of K with a MMD regularizer that can learn an optimal K toreduce the mismatch between pA (x) and pB(x).3.3.3 Output Kernel LearningHere we describe the algorithm proposed in [29] to learn an output kernel frominput data, which involves the learning of a Y -valued function g : X → Y andan output kernel matrix L. The basic assumption here is that an output kernelmatrix describes the data structure best if the associated function could achievethe minimum classification error on the training data. Let S+ denote the positivesemidefinite matrix space. The objective function for obtaining the proper output55kernel could be written asminL∈Sm+[ming∈H(l∑i=1||g(xi)−yi||222λ +||g||2H2+ ||L||2F2)], (3.2)where (xi,yi)∈X ×Y are data-label pairs in multi-class classification. Accordingto the representer theorem [62], the optimal solution for the inner minimization hasthe formg∗(x) =l∑i=1H(x,xi)ci = Ll∑i=1ciK(x,xi), (3.3)where K is the input kernel function. By setting the (i, j) entry of K ∈ Sl+ asKi j = K(xi,x j), and Y,C ∈ Rl×m asY = (y1, . . . ,yl)T ,C = (c1, . . . ,cl)T , (3.4)we can see that the objective function in Eqn. (3.2) becomesQ(L,C) := ||Y−KCL||2F2λ +〈CTKC,L〉F2+ ||L||2F2. (3.5)Dinuzzo et al. shows in [29] that Q(·, ·) is an invex function over the open setSm+×Rl×m and proposes an efficient block-wise coordinate descent optimizationalgorithm, which is described in Table 3.1.3.3.4 Domain Weighted Output Kernel LearningIn domain adaptation, one straightforward but effective method is domain weight-ing [14, 25, 72], where the decision functions or loss functions corresponding toindividual domains are weighted according to their “contributions” to the task inthe target domain. Here we adopt the formulation of convex combination of deci-sion functions [72] to address different importance of training data from differentdomains. The weight parameters can be estimated by cross-validation or empiricalstudies.For data set XA from the source domain A and data set XB from the targetdomain B, the joint data set is XA B = [XA ,XB], which results in an output kernelLA B that captures category relationship of the joint set XA B. Also, the associated56Algorithm 1 Output Kernel Learning1: L,C,E,Z← 02: while ||Z+λC−Y||F ≥ δ do3: C← Solution to KCL+λC = Y,4: E←KC5: P← 12ET C−L6: Q← Solution to (ETE+λ I)Q = P7: L← L+λQ8: Z← EL9: end whileTable 3.1: An algorithm for learning a vector-valued function and an outputkernel simultaneously with block coordinate descent [29].function gA B should minimize the classification errors of training data from bothdomains. Now the loss function in Eqn. (3.2) could be split as∑xi∈Aα · ||g(xi)−yi||222λ + ∑x j∈B||g(x j)−y j||222λ , (3.6)where the first item represents the total training error for the source domain trainingdata, and the second item represents the total training error for the target domaintraining data. In DA object recognition, the goal is to achieve better testing perfor-mance in the target domain, therefore we propose a weighted loss to emphasize theimportance of target training error during optimization. By introducing a weightα ∈ [0,1] into the convex combination of the loss function, we could weight theimportance of training data from the source domain. Let D be a diagonal matrixwhose diagonal elements are α for the first nA ones, and 1 for the rest. Similarto Algorithm 1, we can develop an efficient algorithm for domain weighted outputkernel learning by using ˜Y = DY to replace Y and using ˜K = DK to replace Kin Algorithm 1, shown in Table 3.1. The parameter α could be chosen by cross-validation based on the target domain training data.573.3.5 Domain Adaptive Input-Output Kernel LearningWe now present the proposed DA-IOKL algorithm which jointly learns the input andoutput kernels to reduce the domain shift in both kernel spaces. We first briefly re-visit the MMD measure [18] for regularizing domain shift in the input kernel space,we then present the structural risk function and the corresponding optimizationsolution for the proposed DA-IOKL.Domain Shift MeasureTo reduce the domain shift in cross-domain recognition, we first need to define adomain shift measure based on the data from both domains. An efficient nonpara-metric criterion was proposed by Borgwardt et al. [18], which is referred as MMD,to compare the data distributions based on the distance between the sample meansfrom two domains in a RKHS induced by a certain kernel function. Please refer to[18] for the details about MMD. It is worth noting that, MMD can only measure thedomain shift in the input space of the vector-valued function.Structural Risk with Multiple Kernel FormulationHere we introduce a multiple kernel parameterization for the DA-IOKL algorithmfor the domain adaptation purpose. Recall that Eqn. (3.5) defines the classificationerror on the training data, which is actually a function of L and C. To learn an op-timal input kernel function, we use a convex combination of base kernel functionsto parameterize it, so that the new input kernel is Kd that is a function respect tocoefficients d and has the form Kd = ∑Mm=1 dmKm,dm ≤ 0,m = 1,2, . . . ,M, as usedin [70, 77, 83]. dm’s are the combination coefficients and form a column vector d.Km is the m-th base kernel function and M is the total number of base kernels used.Accordingly, the matrix formulation of the input kernel becomesKd =M∑m=1dmKm, (3.7)then Eqn. (3.5) can be rewritten as,Q(L,C,d) := ||Y−KdCL||2F2λ +〈CTKdC,L〉F2+ ||L||2F2,58where the objective function Q(·, ·, ·) is also a function of the kernel combinationcoefficients d. It is easy to see that we could get an ‘optimal’ input kernel functionby minimizing Q(·, ·, ·) w.r.t. d.In cross-domain recognition, reducing the domain shift is one critical concernto ensure the generalizability of the model when tested in the target domain, sincethe number of training samples from the target domain is usually small (e.g., 3 percategory in our experiments). Reducing the domain shift could make the trainingsamples from the source domain more similar to the samples from the target do-main. In addition to reducing domain shift, minimizing classification error is themost important concern to ensure the model’s discriminative ability. Therefore,by addressing the two concerns jointly, we include a MMD regularizing term intothe structural risk function along with the total classification error, which also hasa multiple kernel parameterization for the input kernel learning. Now the MMDregularizer could be written as,Ω(tr(KdS)) =12tr(M∑m=1dmKmS)2 =12dT ppT d, (3.8)where p = [p1, . . . , pM ]T , pm = tr(KmS), and d = [d1, . . . ,dM ]T . Km is the positivedefinite kernel matrix associated with the m-th base kernel function Km.At last, for the desired kernel function Kd = ∑Mm=1 dmKm, we put a simple boxconstraint on the coefficients, which is dm ≥ 0 and dm ≤ ub, for m = 1,2, . . . ,M,with ub > 0, instead of using the simplex constraint in other works [70]. We foundthat the box constraints are easier to solve for large-scale data set or high dimen-sional data, and could obtain good performances. In addition, in order to controlthe model complexity ( preventing ||d|| to be too large), we place another regu-larizing term to bound the norm of the coefficient. We use ||d||1 for the parameterselection purpose for a single image feature, and use ||d||2 for cases where multipleimage features are available.Therefore, the final objective function for our DA-IOKL is:mind∈DminL,C12dT ppT d+θQ(L,C,d)+ηR(d), (3.9)where the feasible set D is {d|0 d ub,}, and R(d) means the l1 or l2 norm of d,59and θ ,η are penalty coefficients. By solving this optimization problem, we couldlearn the optimal input kernel function kd and the output kernel matrix L jointly.In the next subsection, we propose an computationally efficient algorithm to solveEqn. (3.9) in an alternating way.Learning AlgorithmTo solve the optimization problem of DA-IOKL, we make use of the Output KernelLearning algorithm described in Table 3.1 and build our solution based on it. Bydefining J(d) = minL,C Q(L,C,d), we can rewrite Eqn. (3.9) as,mind∈Df (d) = mind∈D12dT ppT d+θJ(d)+ηR(d). (3.10)For the above objective function f (d), it is easy to show that the first part is linearw.r.t. d and J(d) is quadratic w.r.t. d, so that the objective function is convex w.r.t.d, C and L separately, although not jointly convex. It is shown in [29] that J(d) isinvex w.r.t. C and L jointly, so we could infer that f is invex w.r.t (d,C,L) jointlyand thus has the global minima coinciding with a local minima. Therefore, we canemploy an efficient alternating optimization algorithm to solve Eqn. (3.10).The proposed DA-IOKL algorithm is iterative and each iteration contains twooptimization steps. For a given d, J(d) can be computed and C and L can beestimated by using Algorithm 3.1 described in Table 3.1. We first initialize d toget C and L by solving Eqn. (3.5). We then minimize the function f (d) over dwith the fixed C and L. We repeat this two optimization steps for several iterationsuntil convergence or the maximum number of iterations is reached. This DA-IOKLalgorithm is summarized in Table 3.2.For the 2nd step of each iteration, to minimize f (d) over d, we first computethe gradient of f (d) as,▽ f = ppT +θ▽dJ +η▽dR,where ▽dR is the gradient of the l1 or l2 norm. And ▽dJ is the gradient w.r.t d with60Algorithm 2 DA-IOKL1: d← 1M 1M2: while {not converges}3: C,L← Solution to minC,L Q(L,C,d)by Algorithm 3.14: d← Solution to Eq. (3.10) by L-BFGS-B5: end whileTable 3.2: The proposed Domain Adaptive Input-Output Kernel Learning Al-gorithmfixed L and C, which is given by∂J∂dm= 1λ [12tr(Km(CL)2(KTd +Km))− tr(KmCLYT )]+12tr(CLKmC),where ∂J∂dm is the m-th element of ▽dJ, Kd is the fused kernel matrix given certainkernel coefficient d, and Km is the mth base kernel matrix.After obtaining the gradient, we can use quasi-Newton methods with reason-able memory size to optimize f (d) under a simple box constraint. Therefore, theLimited-memory BFGS with Bound constraints(L-BFGS-B) [20] is the naturalchoice for us. L-BFGS-B converges faster than the previous first-order methods[70] since it uses an approximate second-order Hessian update and is suitable forlarge-scale real world data due to its low rank approximation with limited memorysize.A True Multi-class ClassifierAs illustrated in Figure 3.3, DA-IOKL consists of two major components for cross-domain object recognition: DA-IOKL training and DA-IOKL classification. Basedon the joint set XA B, the DA-IOKL model can be trained by solving Eqn. (3.9)using the efficient algorithm described above. We now describe the corresponding61DA-IOKL classifier. As stated in Section 3.3.1, for a certain RKHS, the associatedvector-valued function g(·) is a true multi-classifier. According to the representertheory [62], the non-parametric form (matrix form) of the corresponding functionin the RKHS of g(·) is given as G = CL. And G is a N×m linear operator, where Nis the number of training data and m is the number of categories. Let d∗,C∗ and L∗be the optimal solution learned by DA-IOKL, then the operator is G∗ = C∗L∗, andthe input kernel function is K∗d . For an unlabeled data point xt to be tested in thetarget domain, its row kernel vector computed based on the training data is givenaskNd∗(xt) = [Kd∗(xt ,x1), . . . ,Kd∗(xt ,xi), . . . ,Kd∗(xt ,xN)], (3.11)where x1, . . . ,xi, . . . ,xN are N labeled training data points. For the testing data pointxt , its predicted label vector by using G∗ isyt = kNd∗(xt)G∗, (3.12)where yt is a m dimensional row vector. According to Section 3.3.1, the categorylabel for testing data xt can be determined asyt = argmaxs∈Ty(s)t ,T = {1, . . . ,m}, (3.13)where T is the set of category labels, and y(s)t is the s-th element of the row vectoryt . Therefore, Eqns. (3.11)(3.12)(3.13) together form a true multi-class classifier,the DA-IOKL classifier.Extension to Multi-Source Domain AdaptationUntil now, we have only talked about adaptation from a single source domain,but it is straightforward to extend the proposed DA-IOKL to multi-source domaincases. By investigating the objective function of DA-IOKL, we can see that theMMD term and the domain weighting parameter α are the two items related todifferent domains. Take the case of having two source domains as an example, inSection 3.4.2, we could simply employ two penalty weights α1 and α2 for each ofthe two source domains, which could be estimated by cross-validation on a grid of62[0,1]× [0,1]. For the MMD term, since we only want to reduce the domain shiftsbetween the target domain and each of the source domains, we use the summationof the MMD’s between two domains as the regularizer. Let A and C be two sourcedomains, and B be the target domain, the MMD regularizer in the final objectivefunction isMMD2K(A ,B)+MMD2K(C ,B).Similarly, we could also extend DA-IOKL to multiple source domains if needed.3.3.6 Domain Shift Measure Based on Output Kernel MatrixAfter proposing the specific DA-IOKL model, we look back to ask an importantquestion: which domain should we adapt from? The selection of source domainsshould be based on domain shift or similarity between two domains. The smallershift two domains have, the better performance the adaptation could get. Althoughthe MMD introduced in Section 3.3.5 provides a measure of shift between pA (x)and pB(x), it cannot reflect the shift from pA (y|x) to pB(y|x), which is in the out-put kernel space. In addition, since in our proposed algorithm the output kernel Lis estimated based on the input kernel K, we believe the domain shift informationin K could also be reflected in L. Therefore, we will introduce a domain shift mea-sure based on the divergence of output kernel matrices in multi-class classificationtasks. Besides serving as a domain shift measure, this divergence can intuitivelyexplain how the proposed DA-IOKL handles shift in the output kernel space. FromSection 3.3.1, we know that a RKHS of Y -valued functions can be associated witha unique Y -kernel H , meaning that by carefully choosing a Y -valued function wecould use the associated H to characterize the underlying structure of data froma certain domain. In other words, if could use a pair of (H,g(x)) to describe theRKHS, we could use this pair to represent the domain. According to Equation (3.1),the Y - kernel H can be further decomposed into an input part and an output part.For a fixed input kernel function, a unique output kernel L could be calculatedwhich can reflect the category level structure. Therefore, by fixing a common in-put kernel function, the corresponding output kernel matrices for different domainscan be seen as a domain signature.We now proceed to define a metric on the signatures to measure the domain63shift. Since the output kernel L is fix-ordered positive semi-definite (PSD), wehave considered several popular metrics proposed for PSD matrices, includingRiemannian Metric [34], Affine Invariant Riemannian Metric [68], Log-EuclideanRiemannian Metric [4], Bregman Divergence (so called LogDet) [5] and Jensen-Bregman LogDet (JBLD) [23]. Among these, JBLD, a symmetric extended versionof Bregman Divergence, is much easier to compute than the others. In this chapter,we adopt the JBLD as a measure between output kernels LA and LB, and we referit as OKD. The OKD between domains A and B is defined as the JBLD betweenLA and B,JOKD(A ,B) = J jbld(LA ,LB)= log|LA +LB2 |− 12 log|LA LB|,(3.14)which is symmetric, nonnegative and invariant under congruence transformation[6]. This proposed output kernel divergence could be used as shift measure be-tween domains A and B. Suppose that we have multiple source domains, thedivergence measures between domain pairs could be used as a criterion for domainselection, i.e., selecting the source domain with the smallest OKD value.Unlike MMD which measures the shift in the input kernel space, OKD actu-ally measures the shift in the output kernel space. We also notice that a Rankof Domain (Rank of Domain (ROD)) metric was proposed in [40] for the domainselection purpose. Similar to MMD, the ROD is limited to the input kernel space(i.e. the distribution shift on p(X)), which fails to capture the shift in the outputspace. In Section 3.4, we will report illustrative OKD values computed based onreal data. The results not only provide a criterion for source domain selection, butalso demonstrate that the proposed DA-IOKL method could learn a RHKS wherethe data distributions from different domains are closer than in the original fea-ture spaces. The results provide a direct and intuitive evidence that the proposedmethod can reduce domain shift in the output kernel space, which leads to betterperformances than previous state-of-art methods which mainly focus on the shiftin the input feature space.643.4 ExperimentsIn this section, we will evaluate the proposed DA-IOKL algorithm on the DA dataset [71] and the Cal+DA data set [40]. Experiments on the DA data set are con-ducted for multi-class classification under three scenarios: single source adapta-tion, multi-source adaptation and multi-feature adaptation. The experiments onCal+DA follow the protocol described in [40]. We also compute the OKD valuebetween domains on DA dataset to demonstrate how our proposed method handlesthe domain shift in output kernel space. We compare our results with the state-of-art methods for all experiments. The results show that the proposed DA-IOKLconsistently outperforms the state-of-art methods, which demonstrates the abilityand robustness of DA-IOKL for cross-domain object recognition tasks.3.4.1 Data Sets and FeaturesObject Domain Adaptation Data Set (DA)This benchmark data set for domain adaptation used in our experiments is releasedby Saenko et al. [71], which contains 31 object categories of images from thefollowing 3 domains: amazon, dslr and webcam. In average, the amazon domaincontains 90 instances for each category, whereas dslr domain and webcam domainhave around 30 instances for each category. Moreover, for dslr and webcam do-mains, images are taken for 5 corresponding objects in each category. There are4652 images in total in the data set.To compare with the state-of-art results on this benchmark data set, we firstuse the same SURF [7] feature file released by Saenko et al. in [71]. All imagesare resized to the same width and converted to gray-scale. When abstracting SURFdescriptors, the blob response threshold is set to be 1000, with other parameters leftto be default values. A 64-dimensional non-rotationally invariant SURF descriptoris used to describe the patch surrounding each detected interest point. Then acodebook of size 800 is constructed by K-means clustering on a random subset ofdescriptors generated from images in the amazon domain. Finally, all images in thedata set are represented by bag-of-word histograms formed by vector quantizationusing the 800 dimensional codebook.65To further study the feature combination ability of DA-IOKL granted by themultiple kernel parameterization in the input kernel part, in addition to SURF, wealso abstract GIST [66], dense-SIFT [60], HoG [24] (signed and unsigned) and LSS[74]. We thus investigate 6 types of features in total.Caltech + Domain Adaptation Data Set(Cal+DA)This data set is introduced in [40] to reduce the potential bias in the DA dataset by adding the Caltech256 data set as the fourth domain in addition to theamazon, dlsr and webcam domains. They select 10 common categories betweenCaltech256 and DA: BACKPACK, TOURING-BIKE, CALCULATOR, HEAD-PHONE, COMPUTER-KEYBOARD, LAPTOP-101, COMPUTER-MONITOR,COMPUTER-MOUSE, COFFEE-MUG, and VIDEO-PROJECTOR. There are 8to 151 images per category per domain, and 2533 images in total. FollowingSection 3.4.1, the same SURF features are abstracted from images in Caltech do-main and quantized into 800 dimensional bag-of-word histograms using the samedictionary.3.4.2 Results on DA Data SetIn this section, we will describe the experiment details on the DA data set anddiscuss the results for each of the experimental settings. For the convenience ofcomparing our results with state-of-art results on this data set reported in [41, 49,54, 71], we first follow the same experimental protocol by using the same SURFfeature file as in these works. In addition, we also conduct experiments with mul-tiple image descriptors abstracted from the same data set to demonstrate the abilityof our model for feature combination. Specifically, we use only SURF feature inSection 3.4.2 and Section 3.4.2, and use multiple types of features in Section 3.4.2.We also want to mention another point: Since there are images taken from the sameobjects in the domains dslr and webcam. To make a fair comparison, the same testobjects are held out of training. In other words, if an image of a certain object isa test object, the images of the same object cannot be used during training. All theresults reported for all methods in Section 3.4.2 and Section 3.4.2 are obtained byfollowing this rule. However, the experiments in Section 3.4.2 don’t leave out the66Source Target NC A-SVM ITML [27] symm [71] ARC-t [54] RDALR [49] DA-IOKLdslr webcam 32.2 ± 1.4 33.0 ± 0.8 23 35.1 36.1 36.9 ± 1.2 39.9 ± 1.1webcam dslr 22.1 ± 1.1 26.0 ± 0.7 18 27.5 25.3 30.1 ± 0.8 34.4 ± 1.0amazon dslr∗ 41.3 ± 1.3 42.2 ± 0.9 41 49.5 50.4 50.7 ± 0.8 72.2 ± 2.0Table 3.3: DA Data Set: Classification accuracy results for single sourceadaptation. The average accuracy in % is reported and the correspondingstandard deviation is included. Here ∗ means ‘without holding out theimages of the same test object’.images from the same objects.Single Source Domain AdaptationFollowing the settings in [54, 71], for single source DA experiments, there arelabeled training images available for all categories in both the source and targetdomains at the training time. In every trial of the experiments, we randomly select8 labeled images per category if dslr or webcam is used as the source domain, or20 labeled images per category if amazon is used as the source. We also select3 labeled images per category from the target domain. Note that in our experi-ments, images of the same test object are generally held out of training. For theexceptional cases, we use the mark ∗ to indicate ‘without holding out the imagesof the same test object’. In particular, for each category, we use images of objectswith IDs {1,2,3} for training and {4,5} for testing in the webcam domain, and useimages of objects with ID {1} for training and {4,5} for testing in the dslr domain.We use kernel functions with the form km(xi,x j) = exp(−γmdistm(xi,x j)) as thekernel basis for the input kernel. We use the χ2 distance, l1 norm and l2 norm as thedist(·, ·) function, leading to the χ2 kernel, Laplacian kernel and Gaussian kernelfunctions respectively. And we select γm = 1dim , where dim is the dimensions of thefeature vectors. For the parameters used in the proposed DA-IOKL algorithm, weuse l1 norm regularization as R(d) in the objective function to enforce sparsity inthe kernel coefficients for the purpose of selecting kernel parameters. We set θ to1×10−4 and η to 1×10−3 empirically. Further, we cross-validate the loss penaltyα on the training set and λ is set proportioned to the norm of the output matrix Y.For single source adaptation experiments, the optimal α is 1 for dslr/webcam and67for webcam/dslr, and 0.2 for amazon/dslr. We repeat the experiments 10 times andreport the average accuracy with standard deviation. Results are shown in Table3.3.To demonstrate the accuracy improvements brought by the proposed DA-IOKLalgorithm, we also report the results from several baseline methods. The baselinemethods are described as follows.• NC stands for naive combination of the training data from source and targetdomains. A standard SVM is applied to the combined training data.• A-SVM applies Adaptive SVM [88] on the training data from both domains.• ITML applies metric learning [27] on combined training data from bothdomains to learn a discriminative metric.• symm applies metric learning [71] on the corresponding pairs of trainingsamples between two domains to reduce the domain shift.• ARC-t is presented in [54], which puts an asymmetric regularizer on thecross-domain metric learning problem. The metric is trained on all the datafrom both domains.• RDALR is presented in [49], which learns an optimal linear operator toproject the data from the source domain into an intermediate space by satis-fying reconstruction constraints using the target domain data.In Table 3.3, the results for previous methods are directly quoted from the re-lated papers published in the literature. From the table we can see that the proposedDA-IOKL outperforms all previous methods in all studied experimental settings,and generally the performance improvement is not negligible. Since the exper-imental case amazon/dslr∗ is conducted without holding out the images of thesame test objects, we can see clearly from the table that all methods yield muchbetter results than other experimental settings and that the proposed DA-IOKL pro-vide the best performance on amazon/dslr∗ . Since we only use the SURF type offeatures in the experiments in this section, the coefficients for the input kernel welearned are quite sparse and thus help select the best kernel function.680.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.10.20.30.40.50.60.70.8AccuracyDomain weight α amazon\dslrwebcam\dslrdslr\webcamFigure 3.4: Classification accuracy VS. the penalty term α curves under threeexperimental settings.Source Target NC A-SVM [88] RDALR [49] DA-IOKLamazon,dslr webcam 20.6 ± 1.8 30.4 ± 0.6 36.9 ± 1.1 39.2 ± 2.0amazon,webcam dslr 16.4 ± 1.1 25.3 ± 1.1 31.2 ± 1.3 31.6 ± 1.9dslr,webcam amazon 16.9 ± 0.7 17.3 ± 0.9 20.9 ± 0.9 25.2 ± 1.1Table 3.4: DA Data Set: Classification accuracy results for multiple sourcesadaptation. The average accuracy in % and the corresponding standarddeviation are reported.During the experiments, we also note that the domain weight α plays an impor-tant role in cross-domain object recognition. We visualize the recognition accuracyplots as a function of α for three experimental settings in Figure 3.4. From the fig-ure we can see that dslr and webcam domains are close to each other, since α = 169Figure 3.5: Illustrations of the output kernels learned from the webcam do-main (upper) and the amazon domain (lower), where the nodes representthe object categories and each edge represents the pair-wise distance be-tween categories measured by a kernel score on it. The larger the outputkernel score is, the more similar two categories are.gives the best performance, meaning that the algorithm tends to treat these two do-mains as the same. The best choice of α actually reveals the domain similarity forthe training data, we therefore could use the MMD measure to guide the selectionof α , if cross-validation is not feasible. For the rest of the chapter, we choose thesame α as used in this section for simplicity. In Table 3.3, the best performancein amazon/dslr is achieved by setting α = 0.2. To demonstrate the performanceimprovement by using domain weighting in the input and output kernel learningof DA-IOKL, we also conduct experiments with α = 1 for amazon/dslr∗ and com-pare the results of different methods. We note that DA-IOKL still yields a muchbetter accuracy at 63.5%±2.1%, which is 12.8% better than the state-of-art result70in [49].To understand the output kernel shift illustrated in Figure 3.2 in Section 3.1,we visualize a part of the output kernel matrix in Figure 3.5. The output kernelscores are normalized for comparison. And the larger the kernel score is, the closertwo nodes are. We can see from Figure 3.5 that the observations in the outputkernel space are consistent with the observations in visual appearances shown inFigure 3.2 in Section 3.1. We believe that the correct estimation of the relationshipbetween categories by the proposed DA-IOKL leads to the improved performancesin DA.Multi-Source Domain AdaptationWe also study the performances of the proposed DA-IOKL for multi-source DA,where multiple different source domains are available during training. More specif-ically, following previous works, we conduct experiments for 2-source DA andevaluate DA-IOKL by comparing its performances with state-of-art methods: A-SVM [88] and RDALR [49]. The settings of training/testing samples follow Section 3.4.2,where the same testing objects are held out during training. We report the av-erage accuracy with standard deviation for 5 trials of experiments. Results areshown in Table 3.4. We can see that the proposed DA-IOKL consistently outper-forms other methods. The performance improvement is significant in the casesamazon,dslr/webcam and dslr,webcam/amazon. In amazon,webcam/dslr, the re-sult is slightly better than that of RDALR, though it is much better than that ofNC and A-SVM. The results in Table 3.4 demonstrate that the proposed DA-IOKLcould successfully adapt from multiple source domains to improve the overall ob-ject recognition in the target domain.Multi-Feature Domain AdaptationTo validate the proposed algorithm for the feature combination purpose, we con-duct the experiments with using multiple features. The base kernels are the sameas in previous sections, and all 6 types of features are used. We conduct classifi-cation experiments for dslr/webcam, webcam/dslr and amazon/dslr cases withoutholding out the images from the same objects during training. Results are shown71Source Target SimpleMKL [70] SVM DA-IOKLdslr webcam 88.7 ± 1.8 88.6± 1.5 91.4 ± 1.4webcam dslr 88.7 ± 1.4 90.3 ± 1.8 92.7 ± 1.1amazon dslr 68.5 ± 1.9 65.7 ± 2.0 76.5 ± 1.5Table 3.5: DA Data Set: Classification accuracy results for multiple features.The average accuracy in % is reported and the corresponding standarddeviation is included.Method C→ A C→ D A→ C A→W W→ C W→ A D→ A D→WNC 23.1 ± 0.4 26.5 ± 0.7 24.0 ± 0.3 31.6±0.6 20.8±0.5 30.8±0.6 31.3±0.7 55.5±0.7symm [71] 33.7 ± 0.8 35.0 ± 1.1 27.3 ± 0.7 36.0±1.0 21.7±0.5 32.3±0.8 30.3±0.8 55.6±0.7SGF [41] 40.2 ± 0.7 36.6 ± 0.8 37.7 ± 0.5 37.9±0.7 29.2±0.7 38.2±0.6 39.2±0.7 69.5±0.9GFK(PCA,PCA) [40] 42.0 ± 0.5 49.5 ± 0.8 37.8 ± 0.4 53.7±0.8 32.8±0.7 42.8±0.7 45.0±0.7 78.7±0.5GFK(PLS,PCA) [40] 46.1 ± 0.6 55.0 ± 0.9 39.6 ± 0.4 56.9±1.0 32.1±0.7 46.2±0.7 46.2±0.6 80.2±0.4GFK(PLS,PLS) [40] 38.7 ± 0.6 38.6 ± 1.4 36.6 ± 0.4 36.3±0.9 28.6±0.6 36.3±0.5 35.0±0.4 74.6±0.5DA-IOKL 63.7 ± 1.5 67.1 ± 4.2 46.6 ± 1.5 71.0±3.4 33.8±1.7 54.8±2.3 54.8±2.8 83.3±1.6Table 3.6: Cal+DA data set: Classification accuracy results, where the aver-age accuracy in % and the corresponding standard deviation are reported.in Table 3.5. We compare with the MKL [70] and a standard SVM using an aver-age kernel. In DA-IOKL, we use the l2 norm of d as regularizer R, to search for ameaningful combination of kernel bases. From Table 3.5, we can see that the pro-posed DA-IOKL learns the optimal kernel coefficients successfully and the learnedinput and output kernels lead to better performances than the baseline methods. Itsbetter performances than SimpleMKL demonstrate that DA-IOKL could combinemultiple features efficiently for the DA tasks.To show the impacts by varying the numbers of training samples from thesource domain / target domain, we conduct the experiments where the numberof training samples from the source domain is fixed to 20 and the number of train-ing samples from the target domain is set as {1, 2, 3, 4, 5} respectively. Also,we conduct the experiments where the number of training samples from the targetdomain is fixed to 4 and the number of training samples from the source domainis set as {5, 10, 20, 30, 50} respectively. The results are shown in Figure 3.6.Here for illustration, we only show the results for dslr/webcam and only use the72SVM with an average kernel as the baseline method, though similar observationsare noted for other cases. From Figure 3.6, as expected, we note that the accuracykeeps increasing as the number of training samples from the target or source do-main increases, and the accuracy saturates as the number of training samples fromthe source domain is sufficiently large. The DA-IOKL consistently outperforms thebaseline method when employing different numbers of training samples.3.4.3 Results on Cal+DA Data SetSince the DA data set is a medium-scale data set and consists images from similarobjects for the same categories across two different domains, this may lead to biasfor the methods evaluated on this data set. To further demonstrate the robustness ofthe proposed DA-IOKL, we conduct DA experiments on the Cal+DA data set in thissection, which contains the 4th domain consisting of images from Caltech256, inaddition to amazon, dslr and webcam domains. When Caltech256 domain serves asa source domain, we randomly sample 20 images per category for training. WhenCaltech256 domain serves as the target domain, we randomly select 3 images percategory for training. And the experiments follow the protocol mentioned in previ-ous sections for other three domains. We report the average accuracy with standarddeviation for 20 trials of experiments. We summarize the results for DA-IOKL andthe stat-of-art methods in Table 3.6. We use the initials of the domain names in thetable to describe the adaptation between two domains, e.g. C→ A means that thesource domain is Caltech256 and the target domain is amazon.In the table, the methods NC and symm [71] are described in Section 3.4.2.SGF is proposed in [41], which samples subspaces along the geodesic line be-tween two domains to search for an intermediate space where the domain shiftcould be reduced. GFK [40] is a kernelized method based on geodesic flow pre-sented by SGF and it is able to determine the optimal dimensions of subspacesautomatically. From Table 3.6 we could clearly see that the proposed DA-IOKLconsistently outperforms the previous methods, generally with large improvementmargins.The methods symm, SGF and GFK are somehow similar to each other in thesense that they all look for a ‘better’ subspace by projecting the data from the input73feature space using a linear projection. Although some of them also adopt non-linear projections using the kernel trick, these methods are still within the scope ofthe Input kernel space for scalar functions. In the contrast, the proposed DA-IOKLreduces the domain shift in the RKHSs of vector-valued functions, which is morecomprehensive and contains both Input and Output kernels. Therefore, it is notsurprising that the proposed DA-IOKL consistently provides the best results. Wealso note that DA-IOKL usually shows larger variances, especially for C→ D andA→W cases. We believe it is caused by the randomness of selecting trainingsamples from the source domain. Since DA-IOKL could make better use of thesource training data, the randomness introduced by the source domain affects DA-IOKL more.3.4.4 Domain Shift Measure in Output Kernel SpaceHere we present one of the first empirical studies on domain shift measure betweenvisual data domains in the output kernel spaces. Given the DA dataset, for a clas-sification task on the target domain dslr, we would like to choose a more similardomain from the rest two as the source. To measure the domain shift, first we ran-domly select 20 samples from each domain, we then perform the DA-IOKL to learnthe output kernel for each domain separately. We also learn the output kernels forall possible combinations of any domain pairs. In all the experiments in this sub-section, we use linear kernel as the input kernel function. We use OKD to computedistances between these output kernels as the shift measures between domain pairs.The results are shown in Table 3.7.Before computing the JOKD divergence, we normalize each kernel by its largestdiagonal entry to make all kernels comparable. JOKD between two identical ma-trices is a constant, denoting by Jsel f . Therefore we scale the resulting similaritytable by dividing all entries by Jsel f . The smallest JOKD is 1.000, meaning thatthe two domains are identical. The smaller the corresponding JOKD is, the moresimilar two domains are.Now for the domain selection problem, according to Table 3.7, if dslr is the tar-get domain, it is better to choose webcam as the source, since JOKD(webcam,dslr)is smaller given amazon, dslr as source candidates. This conclusion is supported74Domain webcam dslr amazon webcam+dslr amazon+dslr amazon+webcam webcam + dslr + amazonwebcam 1.000 1.293 1.314 1.165 1.301 1.247 1.353dslr 1.293 1.000 1.410 1.195 1.280 1.375 1.683amazon 1.314 1.410 1.000 1.270 1.150 1.107 1.201webcam+dslr 1.165 1.195 1.270 1.000 1.189 1.209 1.125amazon+dslr 1.300 1.280 1.150 1.189 1.000 1.133 1.070amazon+webcam 1.247 1.375 1.107 1.209 1.133 1.000 1.053webcam+dslr+amazon 1.353 1.683 1.202 1.125 1.070 1.053 1.000Table 3.7: Domain shift measurements between all possible domain pairs bythe proposed OKD measure.Domain webcam dslr amazon webcam+dslr amazon+dslr amazon+webcam webcam+dslr+amazonwebcam 0.000 1.281 1.026 0.374 0.437 0.271 0.180dslr 1.281 0.000 2.861 0.303 0.744 1.801 0.872amazon 1.026 2.861 0.000 1.718 0.756 0.309 0.754webcam+dslr 0.374 0.303 1.718 0.000 0.299 0.788 0.200amazon+dslr 0.437 0.744 0.756 0.299 0.000 0.336 0.071amazon+webcam 0.271 1.801 0.309 0.788 0.336 0.000 0.203webcam+dslr+amazon 0.180 0.872 0.754 0.200 0.071 0.203 0.000Table 3.8: Domain similarity results between all possible domain pairs byMMD [18]. The MMD value is 0 between two identical distributions. Andthe smaller the MMD is, the more similar two distributions are.by experimental results in Section 3.4.2, indicating that the proposed OKD is pow-erful for domain shift measure. Another interesting observation noted from Table3.7 is that monotonicity holds for OKD. In the experiments of DA, we actually learna RKHS from a mixed data collection. For the 3 cases dslr/webcam, webcam/dslrand amazon/dslr, from Table 3.7, we know that the proposed DA-IOKL essentiallylearns an intermediate RKHS between the source and target domains, which provesthat DA-IOKL indeed addresses the domain shift in the output kernel space.For comparison, we also compute the domain similarity using the MMD mea-sure [18]. Table 3.8 shows the MMD measure results, from which we can see thatMMD mis-judges the similarity between webcam-dslr and webcam-amazon, and itdoes not reveal the monotonicity along domain changes. In addition, [40] reportsthe results of ROD metric on these three domains, which coincide with our OKDresults and the MMD results for the three domains. However, they didn’t includethe results on transiting domains (e.g. webcam-amazon), we didn’t include themhere for further comparison.753.5 ConclusionIn this chapter, we introduce the output kernel analysis into the DA problem. Wealso propose a Domain Adaptive Input-Output Kernel Learning algorithm, referredas DA-IOKL, to learn an optimal input and output kernel space for cross-domainimage object recognition. The proposed DA-IOKL is generally applicable to singlesource, multiple sources and multiple features DA tasks, and our experiment resultsshow that it consistently outperforms the state-of-art methods when tested on twostandard benchmark data sets. For the future work, to get more compact represen-tations and a more efficient algorithm, we plan to study RKHSs for operator-valuedfunctions.760.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50.760.780.80.820.840.860.880.90.920.94Number of training data from Target domainAccuracydslr−webcam DA−CKLSVM−averKernel(a) Accuracy VS. the number of training samples from the target domain.0 10 20 30 40 50 600.70.750.80.850.90.95Number of training data from Source domainAccuracydslr−webcam DA−CKLSVM−averKernel(b) Accuracy VS. the number of training samples from the source domain.Figure 3.6: Accuracy results for the dslr/webcam case when varying the num-bers of training samples from the target and source domains.77Chapter 4Object Recognition for Imageswith Different Picture Styles4.1 IntroductionDigital images can be different in terms of color tones, contrast, clarity, vignetting,and etc. Here we refer such characteristics of digital images as picture styles. Withthe popularity of photo editing and sharing services such as Instagram, Facebookand Flickr that are available on mobile devices, many digital images generatedby users nowadays are captured by a wide range of devices (e.g., smart phones anddigital slrs) and processed using different photography effect filters (e.g., “lomo-fi”and “lord-kelvin” available in Instagram) to get distinct picture styles with strongpersonal artistic expressions. Recall that the goal of object recognition is to recog-nize natural scenes [57], daily objects [32], or fine-grained species [64, 84] basedon digital images, it is natural to extend the scope of object recognition from stan-dard laboratory images to photos in the wild for daily use. Although there are alarge number of picture styles, their contributing factors can be separated into 3major categories: (1) scene radiance, (2) image pipeline, and (3) post processing.In Fig. 4.1, we show three pairs of images of the same objects to illustrate differentpicture styles.To illustrate the connection between image descriptors with picture styles, wetake an image from the Oxford Flower data set and process it with a popular Insta-78(a) (b)(c) (d)(e) (f)Figure 4.1: We show 3 pairs of images about the same objects with differ-ent picture styles. The differences between (a) and (b) are mainly dueto different scene radiances (illumination condition). (c) and (d) are ofthe same object and taken under the same condition by a digital SLRand a webcam respectively, representing two different image pipelines.(f) is an image obtained by applying InstaramTMlomo-fi effect filter asa post-processing step to image (e), representing one specific photo-graphic effect.79Figure 4.2: In the upper left is an original image from the Oxford Flowerdata set. In the lower left is the lomo-fi version of the image. We selecttwo regions at the same location from the two images (indicated by redboxes), and show the pixel patch, gradient, and SIFT descriptor for eachof them. We then plot the difference between two descriptors in theright.gram effect filter: lomo-fi. We select two patches at the same locations for thesetwo images respectively, and compute the gradients and SIFT descriptors of thepatches, which are shown in Fig. 4.2. Although these two image patches are al-most the same except the color tones, we note that the resulting SIFT descriptorsdiffer with each other about 33% in terms of l2 norm, which probably will makethem be quantized into two dictionary words in the bag-of-word model. Since thedifference is significant for two images that are almost identical in content, it isreasonable to assume that the difference could be more significant for two content-different images with different picture styles within one object class. Therefore,when images used for training and testing don’t have similar picture styles, the ac-curacy of object recognition will degrade. Among the previous related literature,only Domain Adaptation (DA) considers the situation [71] where some images aretaken by a Digital SLR and the rest are taken by a webcam under similar conditions(e.g. (c) and (d) in Fig. 4.1), and images used in training and in testing are takenby different devices.80Although the DA touches the picture style issue by considering two sets of im-ages from different devices as two domains, in their algorithms the domain label ofan image has to be specified. However, in real world applications, images collectedfrom Internet have no “domain labels”, and the training /testing sets are alwaysmixtures of images with various picture styles. Furthermore, more picture stylescan be created by users through post-processing (e.g. Instagram users or iphonecamera app users) besides the ones due to different cameras. Therefore, with amore general setting than DA, developing robust object recognition algorithms be-comes useful and challenging, which should overcome the difficulties introducedby different picture styles without knowing the style information.In this chapter, we study this general object recognition problem with a focuson picture-style-considered descriptor design. Existing approaches usually ignorethe differences of picture styles when computing the standard descriptors, and thentry to reduce the influences of picture styles in the corresponding feature spaces.Such indirect methods are limited by the feature spaces and always require the styleinformation of the images (e.g. the domain labels in DA). In this chapter, we tacklethe problem in a direct way. Suppose a set of images is denoted by A. For an imageI ∈ A, we define a pixel mapping function g : [0,255]→ [0,255] that can be appliedto all the pixels in I and obtain a new image g(I). Let B denote the set of imagessuch that ∀I ∈ A, g(I) ∈ B. For convenience, we denote B = g(A). From our ob-servation, we assume that the pixel mapping function g would influence the objectrecognition accuracy when the images used in training and testing are processedby g (which is confirmed later by experimental results in Section 4.5.2). Therefore,we propose searching an optimal g∗ that can achieve the best recognition accuracywhen all images used are processed by g∗. The searching of g∗ could be difficultsince there is no clear connections between a general function g and the empiricalrisk of the classifier used in object recognition. However, by defining g based ona convex combination of several basis functions, in this chapter, we incorporatethe pixel mapping function g into image descriptors, and we propose an adaptivedescriptor design based on kernel learning. Though we derive the method based onkernel descriptors [17], it is worth mentioning that the proposed approach can beextended to existing standard descriptors as a general framework. In the following,we discuss some related work in Section 4.2. Then we revisit the kernel descrip-81tors in Section 4.3. We present the proposed method in Section 4.4 and report theexperiments in Section 4.5.4.2 Related workDomain Adaptation is probably the most related area to our problem. In thedata set introduced in [71], images from dslr and webcam are different in pic-ture styles, which is similar to the focus of this chapter. Metric learning basedmethods [54, 71], Grassmann manifold based methods [40, 41], and output kernelspace based method [44] were proposed. As we stated, these DA methods cannotsolve our problem in general situations where the domain label information is un-known and hard to specify for images in the wild. Works in [43, 53, 87] estimatethe model of image pipelines, but such estimations are difficult and have no clearrelationships with the descriptor and recognition accuracy. In the area of keypointmatching, several robust descriptors were proposed, such as DAISY [80], GIH [59]and DaLI [63]. Descriptor learning methods [76, 85, 86] were also developed todetermine the parameters of the descriptors through optimization. All these meth-ods are designed for key point matching between image pairs. The different goalleads to descriptors that are not suitable for object recognition, since they are toodiscriminative to tolerate the within-class variances of object categories.4.3 Kernel Descriptor RevisitThe Kernel Descriptor (KDES) is proposed by Bo et al. in [17], which gives aunified framework and parametric form for local image descriptors. Let z denote apixel at coordinate z, m(z) denote the magnitude of image gradient at pixel z, andθ(z) denote the orientation of image gradient. And m(z) and θ(z) are normalizedby the average values of one patch containing z into m˜(z) and ˜θ(z) respectively.The gradient match kernel between two image patches P and Q can be describedaskgrad(P,Q) = ∑z∈P∑z′∈Qm˜(z)m˜(z′)ko( ˜θ (z), ˜θ (z′))kp(z,z′), (4.1)where kp(z,z′)= exp(−γp||z−z′||2) is a Gaussian position kernel and ko( ˜θ (z), ˜θ (z′))=exp(−γo|| ˜θ (z)− ˜θ (z′)||2 is a Gaussian kernel over gradient orientations. And82m˜(z) = m(z)/√∑z∈P m(z)2 + εg, where εg is a small value. Orientation is normal-ized as ˜θ (z) = [sin(θ(z))cos(θ(z))]. To build compact feature vectors from thesekernels for efficient computation, [17] presented a sufficient finite-dimensional ap-proximation to obtain finite-dimensioned feature vectors and to reduce the dimen-sion by kernel principal component analysis, which provides a closed form for thedescriptor vector Fgrad(P) of patch P such that kgrad(P,Q) = Fgrad(P)T Fgrad(Q).And Bo et al. [17] also showed that gradient based descriptor like SIFT [60], SURF[7], and HoG [24] are special cases under this kernel view framework.For the image-level descriptors, Bo and Sminchisescu [16] presented EfficientMatch Kernels (EMK) which provide a general kernel view of matching betweentwo images as two sets of local descriptors. And they demonstrated that the Bagof Words (BOW) model and Spatial Pyramid Matching are two special cases underthis framework. Let X and Y denote the set of local descriptors for images Ix andIy respectively. x ∈ X is a descriptor vector computed from patch Px in image Ix.When applying EMK on top of kgrad , we can have the image-level kernel asKemk(Ix, Iy) =1|X ||Y | ∑x∈X ∑y∈Y kgrad(Px,Py), (4.2)where | · | is the cardinality of a set. [16] provides a closed-form approximationof the feature vector such that Kemk(Ix, Iy) = Φ(Ix)T Φ(Iy), which makes the matchkernel can be used in real applications with efficient computation and storage.4.4 Proposed Method4.4.1 g-incorporated Kernel DescriptorAs stated in Introduction, we want to apply a pixel mapping function g to imagesused for object recognition. In this section, we will give the relationship betweenpixels and descriptors under the function g. Since a general function is hard tolearn, we define a function g = ∑Ni=1 aigi,ai ≥ 0, which is a convex combinationof basis functions. For the convenience of presentation, let’s look at an simple ex-ample that contains only two basis functions g1 and g2, where g1(u(z)) = u(z) andg2(u(z)) = u(z)2. Then g(u(z)) = a1u(z)+a2u(z)2, where a1,a2 are non-negative,83z is the position of a pixel from image patch P, and u(z) is the pixel value at z.Let g(P) denote the new patch after applying g on the pixels of P. Now the imagegradient at z of g(P) becomes▽g(u(z)) = g′|u(z)▽u(z)= (a1 +2a2u(z))▽u(z),(4.3)where a1 +2a2u(z) is a non-negative real number and ▽u(z) is a vector. Let mg(z)and θg(z) be the magnitude and orientation of ▽g(u(z)), and m(z) and θ(z) becorresponding values of ▽u(z). It is clear that θg(z) = θ(z), therefore ˜θ (z) = ˜θg(z),which means the orientation is invariant to the pixel mapping function g applied tothe image patch. Under the assumption that a1,a2 ≥ 0, we havemg(z) = ||▽g(u(z))||= ||(a1 +2a2u(z))▽u(z)||= a1||▽u(z)||+a2||▽u(z)2||= a1||▽g1(u(z))||+a2||▽g2(u(z))||.(4.4)Notice that the magnitudes used in kgrad are normalized based on local patches,which is important to make the contextual information comparable for differentpatches. Let m1(z) and m2(z) denote ||▽g1(u(z))|| and ||▽g2(u(z))|| respectively.To retain the simple convex combination form of mg(z), we propose a new localnormalizationmˆg(z) = a1m˜1(z)+a2m˜2(z), (4.5)where mˆg(z) denotes the new normalized magnitude of mg(z), and m˜1(z) and m˜2(z)are normalized by l2 norm as mentioned in Section 4.3. It is clear that mˆg(z) is alsolocally normalized and still comparable for different patches. Since the goal ofthis chapter is object recognition, any appropriate local normalization method isacceptable.Now given two image patches g(P) and g(Q), which are obtained by applyingthe function g to patches P and Q respectively, we derive the gradient match kernel84between them as followingˆkgrad(g(P),g(Q))= ∑z∈g(P) ∑z′∈g(Q) mˆg(z)mˆg(z′)ko( ˜θg(z), ˜θg(z′))kp(z,z′)= ∑z∈P ∑z′∈Q(a1m˜1(z)+a2m˜2(z))(a1m˜1(z′)+a2m˜2(z′))ko( ˜θ (z), ˜θ (z′))kp(z,z′)= a1a1 ∑z∈P ∑z′∈Q m˜1(z)m˜1(z′)ko( ˜θ (z), ˜θ (z′))kp(z,z′)+a1a2 ∑z∈P ∑z′∈Q m˜1(z)m˜2(z′)ko( ˜θ (z), ˜θ (z′))kp(z,z′)+a2a1 ∑z∈P ∑z′∈Q m˜2(z)m˜1(z′)ko( ˜θ (z), ˜θ (z′))kp(z,z′)+a2a2 ∑z∈P ∑z′∈Q m˜2(z)m˜1(z′)ko( ˜θ (z), ˜θ (z′))kp(z,z′)= a1a1kgrad(P,Q)+a1a2kgrad(P,Q2)+a2a1kgrad(P2,Q)+a2a2kgrad(P2,Q2),(4.6)where P2 and Q2 denote the pixel-value-squared patches from P and Q. And it isworth noting that ˆkgrad above is different from the standard kgrad in Eq. (4.1), sincewe define a different normalization approach in Eq. (4.5). Since g = ∑2i=1 aigi, Eq.(4.6) indicatesˆkgrad(g(P),g(Q)) =2∑i=12∑j=1aia jkgrad(gi(P),g j(Q)). (4.7)Via Eq. (4.7), we successfully incorporate the pixel mapping function g into imagedescriptors, which we call g-incorporated KDES.To see this connection at the image-level, we plug Eq. (4.7) into Eq. (4.2) andhave the image-level kernelKemk(g(Ix),g(Iy) = 1|X ||Y | ∑x∈X ∑y∈Y ˆkgrad(g(Px),g(Py))= 1|X ||Y | ∑x∈X ∑y∈Y ∑2i=1 ∑2j=1 aia jkgrad(gi(Px),g j(Py))= ∑2i=1 ∑2j=1 aia j( 1|X ||Y | ∑x∈X ∑y∈Y kgrad(gi(Px),g j(Py)))= ∑2i=1 ∑2j=1 aia jK(gi(Ix),g j(Iy))= ∑4m=1 dmKm,(4.8)850 0.2 0.4 0.6 0.8 100.20.40.60.81 g1g2g3Figure 4.3: Plots of the proposed basis functions.where dm and Km have one-to-one correspondence to aia j and K(gi(Ix),g j(Iy)),and the order does not matter since they are exchangeable in the summation. Sincewe limit ai to be non-negative when we define g, dm’s are also non-negative. Inaddition, Km’s are positive definite (PD) kernels, which makes Kemk here a convexcombination of PD kernels and can be used in standard multiple kernel learning.Therefore, we successfully transfer the problem of searching for an optimal g∗ intothe problem of learning the optimal kernel weights through Eq. (4.8). In general,for N basis functions (i.e., gi’s), there will be N2 base kernels in Eq. (4.8).4.4.2 Basis Functions gi’sFrom Section 4.4.1, we know that the selection of basis functions (i.e., gi’s) is asimportant as learning the parameters. By exploring the pixel mapping functions(e.g., photography effect filters) used for photography, we note that Gamma cor-rection and the “S” curve are two major categories of photography effects. Gamma86Figure 4.4: From left to right: the original image I, g1(I), g2(I), and g3(I).correction can brighten (γ < 1) or darken (γ > 1) the images, and the “S” curvecan increase the contrast. Although these popular functions are proposed for vi-sual pleasure of photos, we believe that they can also benefit the construction ofbetter image descriptors. For example, brightening can bring back more detailsin the dark part of an image; Darkening can surpass the irrelevant areas of an ob-ject image, since most images are correctly exposed for the center object; Highercontrast emphasizes the texture and shapes. Therefore, these three types of func-tions make good candidates for basis functions. However, a power-law functionused in Gamma correction contains a free parameter that will be left in the gra-dient expression and cannot be combined using simple addition, therefore a goodapproximation is desired. Since the “S” curve doesn’t have a standard formulation,we adopt the sigmoid function in our algorithm. In this chapter, the three basisfunctions are:g1(x) = 0.3∗ (log(2x+0.1)+ | log(0.1)|) (4.9)g2(x) = 0.8x2 (4.10)g3(x) =11+ e−8(x−0.5) , (4.11)where x takes a value from [0,1], representing a scaled image pixel value. Theplots of these functions are shown in Fig. 4.3, from which we can see that g1 couldserve as a brightening function, similar to gamma correction when γ < 1, g2 hasa shape like gamma correction when γ = 2 and thus could be used as darkening,and g3 has a “S” shape that increases the contrast by brightening the bright regions87and darkening the dark regions. The effects of these functions can be seen fromFig.4.4.4.4.3 Learning of the ParametersAfter defining the basis functions, the next question is how to estimate the weightcoefficients for object recognition. According to Eq. (4.8), the image-level kernelcan be decomposed as a convex combination of several base kernels. We adoptGeneralized Multiple Kenrel Learning (GMKL) [83] and put non-negative con-straints on the kernel coefficients. We also notice that some weight coefficients canbe the same (e.g. a1a2 = a2a1), though our experiments show that the results aresimilar with or without the equal-weight constraint. Since the number of kernels inour algorithm is not large, we use the standard GMKL with l2 norm regularization.4.4.4 Adaptive Descriptor DesignIn this section, we summarize the major steps of the proposed adaptive descriptordesign as follows:Step-1 Process image I from the data set with {gi}3i=1 shown in Eq. (4.9) (4.10)(4.11), to obtain {gi(I)}3i=1.Step-2 Compute gradient-based descriptors for I and {gi(I)}3i=1 to get 4 descriptors.Step-3 Build a codebook using K-means by sampling from all training images andall 4 descriptors of each image.Step-4 Quantize each image from training and testing sets into 4 image-level featurevectors based on the descriptors.Step-5 For each pair of images, compute linear kernels between any two of their 4image-level features to get 16 base kernels, as shown in Eq. (4.8).Step-6 Train GMKL on 16 base kernels to obtain optimal kernel weights.Our proposed method does not require prior knowledge on picture styles oftraining or testing images, and the Adaptive Descriptor Design (ADD) can work asa general framework. For instance, in Step-1, other proper functions can be used88here as basis functions, besides the ones we use here. According to the analysisby [17], most gradient-based descriptors, such as SIFT [60], SURF [7] and HoG[24], are special cases of the kernel descriptor, which all can be used in Step-2to compute descriptors from image patches. In addition, the quantization methodused in Step-4 can be chosen from Bag-of-word, Spatial Pyramid Matching andEfficient Match Kernel. In other words, our proposed algorithm can be used widelyto improve previous methods which are based on gradient descriptors and SVMs.We also want to point out that the proposed ADD is essentially a single featuremethod, although multiple kernels are used for estimating the coefficients. For animage I from the data set, it is equivalent to extracting standard descriptors of g(I)and using a single kernel SVM based on these descriptors for classification.4.5 ExperimentsIn this section, we describe the details of experiments and report the recognitionperformances of the proposed method when compared with standard gradient-based image descriptors. We conduct object recognition on the Domain Adapta-tion data set [71] and the Oxford Flower data set [64]. We also process the imagesfrom the Oxford Flower data set using several popular photography effect filters inInstagramTM1.4.5.1 Domain Adaptation Data SetThe Domain Adaptation data set was introduced by [71], where images for thesame categories of objects are from different sources (called domains): amazon,dslr and webcam. As we stated in Introduction, the two domains dslr and webcamonly differ in picture styles which are due to different image pipelines. Applyingthe proposed ADD algorithm, we adopt KDES + EMK and SURF+BOW two sets offeatures to demonstrate that ADD can work as a general framework to improve theperformances of gradient-based descriptors in general. We follow the experimentalprotocol used in [54, 71] for semi-supervised domain adaptation. It is worth notingthat we don’t use any domain-label information to specify the picture styles of1We use Adobe PhotoshopTMaction files created by Daniel Box, which can give similar effectsas InstagramTM.89Source Target standard KDES ADD AK ADD GMKLdslr webcam 49.30 ± 1.26 50.28 ± 1.02 54.81 ± 1.07webcam dslr 46.67 ± 0.80 48.17 ± 0.99 50.33 ± 0.79amazon dslr 47.43 ± 2.79 48.57 ± 2.51 53.90 ± 2.67Table 4.1: Experiments on the DA data set based on the KDES descriptor.The average recognition accuracy in % is reported and the correspondingstandard deviation is included.images, our proposed method could figure out an optimal descriptor automaticallybased on the training set.ADD based on KDES and EMKWe extract KDES descriptors of all the images in three domains and create a 1,500-word codebook by applying K-means clustering on a subset of all 4 types (original+ 3 variants for each images) of descriptors from amazon domain. And then thiscodebook is used to quantize 4 types of descriptors of all 3 domains of imagesusing EMK. After obtaining the 16 linear kernels by computing the inner productof every two types of descriptors between two given images, we conduct objectrecognition experiments using SVMs for: the standard KDES, the average kernelof these 16 kernels (Average Kernel (AK)), and the GMKL based on 16 kernels. Weshow the results in Table4.1, from which we can see that the proposed AdaptiveDescriptor Design outperforms the standard KDES in all cases for both the averagekernel and an optimal kernel learned by GMKL. Particularly, the ADD GMKLmethod improves about 6% from the standard KDES in all cases, which is close tothe improvements obtained by domain adaptation methods [40, 49, 54, 71] wheredomain-label information is used.ADD based on SURF and BoWTo show the general applicability of the proposed ADD, we follow previous meth-ods [40, 49, 54, 71] to extract standard SURF descriptors from the original and 3variants of each image, then a 800-word codebook is created from amazon domain.90Source Target standard SURF ADD AK ADD GMKLdslr webcam 37.05 ± 1.72 41.61 ± 1.05 42.00 ± 1.16webcam dslr 30.09 ± 0.81 36.57 ± 0.75 36.45 ± 0.49amazon dslr 34.49 ± 1.30 40.62 ± 1.59 36.19 ± 2.04Table 4.2: Experiments on the DA data set based on the SURF descriptor.The average recognition accuracy in % is reported and the correspondingstandard deviation is included.All images in 3 domains are quantized by this codebook using Vector-quantizationto get Bag-of-Word features. After obtaining 16 linear kernels, we also conductexperiments using the standard KDES, the average kernel, and an optimal kernellearned by GMKL. We report the results in Table 4.2. The proposed ADD meth-ods also outperform the standard SURF descriptor in all cases. However, in thisexperiment, the average kernel approach gives better results than that of the GMKLlearned kernel in some cases. We think the worse performance of the GMKL basedADD is due to the lack of training, since the SURF descriptors are sparsely ex-tracted from images and only 11 (8 from the source domain and 3 from the targetdomain) training images per category are used. But the results of ADD AK andADD GMKL are sufficient to show that the proposed Adaptive Descriptor Designcan be applied on top of gradient-based descriptors widely, for different tasks.4.5.2 Oxford Flower Data SetOxford Flower data set [64] contains 1360 images for 17 flower species. To simu-late the images used in real world applications, which are taken by different devicesand under various pixel-level post-processing, we process the images from the Ox-ford Flower data set using 3 photography effect filters that are popularly used inInstagramTM: lomo-fi, lord-kelvin, and Nashville. Together with the original im-ages, we obtain an image data set of 4 effects. We keep the images generated fromthe same original images with same IDs. Note that images from the original OxfordFlower data set were collected from many different sources (e.g., they were takenby different devices under different conditions), the factors of scene radiance and91Figure 4.5: From left to right: the original image, with lomo-fi, lord-kelvin,and Nashville effects.image pipelines are already taken into consideration. We show an example imageand its variants processed by 3 photography effect filters in Fig. 4.5, and these 4images have the ID.Since KDES gives a general framework for gradient-based descriptors, we onlyuse KDES in the experiments in this section. For convenience of expression, werefer the data sets obtained by applying effect filters by picture styles. Similar as inSection 4.5.1, we extract KDES for all images from all 4 styles (original, lomo-fi,lord-kelvin, and Nashville). We construct a 2,000-word codebook by sampling 4types of descriptors from the original style. Then all images from 4 effects arequantized using EMK with this codebook.To simulate the real image collections as mixtures of images with different pic-ture styles, we first generate an experimental data set from the 4 styles, then splitthis data set into training and testing sets. 1) Experimental data set generation:M styles are chosen first, from which we want to sample images. For a given ID,only one image is uniformly randomly selected from M styles (i.e. one and onlyone image from Fig. 4.5 is sampled) to form a experimental data set. This gen-eration procedure makes sure that images with the IDs will not appear in trainingand testing together. 2) Training/testing sets splitting: after obtaining an experi-mental data set, we randomly split the set into training and testing sets with equalsizes. Therefore, there are equal numbers of images from each style appearing intraining and testing. Different from domain adaptation, images used here in train-ing or testing are not separated according to domain labels, which is more similarto real-world applications where no information of picture styles are available. Weperform object recognition using SVMs for the standard KDES, average kernel,and the optimal kernel by GMKL. We report the results for 10 runs of experimental92style1 style2 standard KDES ADD AK ADD GMKLoriginal n/a 69.35 ± 2.20 67.76 ± 2.65 74.32 ± 1.77original lomo-fi 65.85 ± 1.66 64.24 ± 1.88 71.62 ± 1.04original lord-kelvin 67.53 ± 1.32 66.06 ± 1.80 72.82 ± 0.69original Nashville 66.44 ± 1.73 64.62 ± 1.39 71.88 ± 0.72lomo-fi n/a 65.12 ± 1.48 63.97 ± 1.62 69.82 ± 0.70lord-kelvin n/a 68.09± 1.38 67.06 ± 1.40 72.62 ± 1.38Nashville n/a 67.03± 1.10 66.85± 1.28 71.68 ± 1.94all n/a 64.56 ± 0.90 63.24 ± 0.58 69.88 ± 0.51Table 4.3: Results on the Oxford Flower data set. The average recognitionaccuracy in % is reported and the corresponding standard deviation isincluded.data set for each scenario in Table 4.3. For each run, an experimental dataset israndomly generated and split into training and testing.From Table 4.3 we can clearly see that the proposed ADD GMKL methodis superior to the standard KDES in all cases. From the top 4 rows of Table4.3, we note that the recognition accuracy decreases when images with differentpicture styles are used, which confirms the motivation we described in Section4.1. In addition, according to the single style experimental results, the pixel map-ping function g can influence the recognition accuracy through the computationof the g-incorporated descriptors for all images, which supports the proposed ideathat learning an optimal function g∗ can improve object recognition when usinggradient-based descriptors. Further, when images are uniformly sampled from all4 styles, the standard KDES descriptor yields worst performance, which is reason-able since the higher diversity in appearances of images leads to larger differencesbetween descriptors of similar image patches of the same objects.After demonstrating that picture styles can affect the recognition accuracy, theimproved performance of ADD GMKL shows that our proposed algorithm is an ef-ficient solution. Recall that our ADD can be considered as a single feature method,the proposed ADD GMKL outperforms the state-of-art [36] by 4% on the originalOxford Flower data set. Therefore, the Adaptive Descriptor Design can be usedwidely on top of gradient-based descriptors to further improve the recognition ac-93curacy.We also notice that the ADD AK method is not better than the standard KDES.We believe it is due to the small size of the codebook (2,000 words). Since 4 typesof descriptors are extracted from one image on a dense grid and there are 1360images in total, this codebook introduces large distortion in quantization, whichdecreases the performance of the average kernel approach.4.6 ConclusionIn this chapter, we focus on the effects of different picture styles of images onobject recognition. After showing the connection between pixel mapping functionsand gradient-based image descriptors, we incorporate the pixel mapping functiong into the image descriptor and propose an Adaptive Descriptor Design (ADD)framework for object recognition in the wild. We demonstrate that the proposedg-incorporated ADD can be widely used as a general framework based on popularimage descriptors, and the experimental results show the recognition improvementsof ADD on the domain adaptation data set, the standard Oxford Flower data set andits variants with different picture styles (photography effect filters).94Chapter 5Conclusions and Future Work5.1 Significance and Potential Applications of theResearchVisual object recognition has become one of the emerging artificial intelligent tech-nologies that have attractive real-world applications and start influencing people’sdaily life. Especially because of the popularity of smart phones and tablets, im-age object recognition is getting more realistic and meanwhile also challengingdue to the fact that images used in testing could be taken with different devicesunder different conditions for different types of objects. In the uncontrolled envi-ronment, object recognition suffers from the shortage of high quality training data.In this thesis, we propose novel machine learning algorithms to overcome the dif-ficulties for object recognition with limited training data in realistic situations. Inparticular, we develop an unsupervised hierarchical feature learning algorithm forone-shot recognition (Chapter 2), a DA-IOKL algorithm to reduce the mismatch be-tween training and testing data (Chapter 3), and an ADD algorithm to overcome thedifficulties caused by unknown picture styles in both training and testing images(Chapter 4).The shortage of high quality training data could be extreme sometimes, suchas in the one-shot recognition case as we discussed in Chapter 2. One-shot recog-nition has its potential real-world applications: for instance, recognizing the faceof a person whose facial images are not easy to collect (e.g., a wanted criminal or95a newly made friend); or recognizing images of rare animals which have only beentaken photos once. Our proposed unsupervised feature learning method would bevery useful if only unlabeled images are available besides the one-shot training im-ages from target categories. In addition, since the unlabeled images from differentcategories (other than the target categories) come with no cost, the performanceof our proposed algorithm could be further improved by providing more unlabeledimages. Furthermore, our feature learning method is a general method that can beused for traditional multiple-shot recognition problems too. It is also worth notingthat the gain from our algorithm is more significant when the number of labeledtraining images for target categories is relatively small.Besides recognition in web-scale that requires generic object recognition en-gines (e.g., Google, Bing, and Baidu.), there are other practical situations wherea small number of specific object categories with specific objects are in the targetdomain. For example, a personalized recognition system to recognize personal be-longings for a specific user, or an automatic album organizing system to categorizephotos of a particular user who takes photos of several specific scenes and objects.In these cases, it is not efficient to ask every user to collect a large number of la-beled images of the specific objects he/she needs to recognize for a personalizedrecognition system. In this big data era, a large number of labeled training imagesfrom the same categories of interest in other domains can usually be found frompublicly available standard data sets (e.g. SUN data set, MIT Indoor Scene dataset, and ImageNet), and DA aims to tackle the recognition problem by making gooduse of these existing data sets, instead of wasting efforts to collect a large numberof training images for every user. Our proposed DA-IOKL algorithm could improveDA with a better classifier and multiple features comparing with previous meth-ods, with the same amount labeled training images from target domains. We alsopropose a domain similarity measure OKD to help selecting which source domain(e.g., publicly available data set) to use.Different color tones or illumination conditions of images are not thought toplay an important role on the recognition performance, since most of the image de-scriptors are designed to be “invariant to illumination changes”. However, due tothe popularity of services like Instagram, photos shared online or stored on users’smart phones are often with dramatically different picture styles. For the first time,96we notice that different picture styles do cause problems and degrade the perfor-mance of object recognition. Our proposed ADD algorithm could improve the per-formance of any gradient-based image descriptor when images in both trainingand testing are with different picture styles. Furthermore, different picture styleswidely exist in many real-world situations, and our proposed ADD can be used inparallel with other methods to further improve the performance.In summary, the algorithms proposed in this PhD thesis can handle problemscaused by the shortage of high quality training data when (1) the number of trainingimages is limited, (2) there is a mismatch between training and testing images, and(3) images in training and testing have different picture styles. The algorithms canalso be combined appropriately.5.2 Summary of ContributionsIn this thesis, we have investigated several real-world situations of limited trainingdata in the big data context, and proposed novel machine learning algorithms asefficient solutions. We conclude the major contributions of my PhD thesis in thissection.In Chapter 2, we tackle the problem of one-shot object recognition in the set-ting where no labeled images are required in the source categories to reduce thehuman effort in labeling data. We propose an unsupervised hierarchical featurelearning algorithm by stacking HDP-encoders layer by layer to learn more infor-mative image representations. Our algorithm demonstrates superior performanceson both one-shot recognition and conventional multi-shot recognition tasks withonly unlabeled images from source categories, e.g., on Caltech-4 data set and Ani-mal with Attributes data set.In Chapter 3, we present DA-IOKL algorithm to improve the performance ofDA by reducing domain shifts in both input and output spaces. The proposedDA-IOKL can be used in three situations: the single source adaptation, multiplesources adaptation and multiple features adaptation. The experimental results onDomain Adaptation data set and Cal+DA data set show that the proposed DA-IOKLachieves the state-of-art performance for all three cases.Chapter 4, for the first time in the computer vision community to our knowl-97edge, introduces the problem caused by different picture styles. We propose ADDto learn a pixel mapping function for image descriptors that can improve recogni-tion. Experiments demonstrate that ADD is easy to implement based on existingobject recognition algorithms, with only a small amount of modification. Exper-imental results show that ADD can achieve improved performances over severalgradient-based image descriptors on Domain Adaptation data set, Oxford Flowerdataset, and variants of Oxford Flower data set.5.3 Future WorksIn this section, we would like to point out several future directions to further copewith the challenging situations having only limited training data.Besides one-shot recognition and domain adaptation problems described inChapter 2 and Chapter 3 respectively, zero-shot learning is another sub-area deal-ing with shortage of high quality training data in a more extreme setting, wherethere is no training images for the target categories. Zero-shot learning is essen-tially a multi-modal approach which combines textual information and visual in-formation together to improve the prediction on unseen images under the zero-shotsetting. As we discussed in Chapter 2, the proposed HDP-encoder could be extendto multi-modal settings to learn higher level features from the mixture of informa-tion. In addition, textual and visual information could be viewed as in the coupledinput spaces and sharing the same output space. Therefore, we plan to extend ourproposed Input-output Kernel Learning algorithm into the zero-shot setting wheretextural information is available in another input space.In Chapter 4, we have presented an adaptive descriptor design method by in-corporating pixel mapping functions. However, the current approach is based onpre-defined basis functions gi’s and all images used in training and testing are pro-cessed by the same mapping functions. One possible extension is to use moregeneral basis functions or non-parametric functions to provide more degrees offreedom for the learning task. Furthermore, the pixel mapping functions could beadaptive for individual images used in both training and testing, which may im-prove the performance further. One possible way is to adopt the recent developedkernel latent SVM [89] algorithm, which provides a general framework to incorpo-98rate latent variables (the pixel mapping functions in our case) into the kernel SVM.In addition, spatial context has not been considered in the current approach, wherepixel mapping functions are applied on images indifferently everywhere. However,picture styles usually involve with spatial effects, for instance lomo-fi will increasevignetting effects on the corns. And research has already shown that saliency de-tection and human attention models could benefit object recognition. Therefore,one possible way is to extend the pixel mapping functions into spatial-pixel map-ping functions which take spatial context into consideration.99Bibliography[1] Ecml/pkdd discovery challenge, 2006. URLhttp://www.ecmlpkdd2006.org/challenge.html. → pages 43[2] M. A. ´Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valuedfunctions: a review. Foundations and Trends in Machine Learning, 4:195–266, 2012. → pages 51[3] A. G. an V. Vovk and V. Vapnik. Learning by transduction. In Conference onUncertainty in Artificial Intelligence (UAI), 1998. → pages 50[4] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Log euclidean metrics forfast and simple calculus on diffusiontensors. Magnetic Resonance inMedicine, 56(2):411–421, 2006. → pages 64[5] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with bregmandivergences. Journal of Machine Learning Research (JMLR), 6:1705–1749,2005. → pages 64[6] A. Banerjee, D. Boley, and S. Acharyya. Symmetrized bregman divergencesand metrics. In The Learning Workshop, 2009. → pages 64[7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features(surf). Computer vision and image understanding, 110(3):346–359, 2008.→ pages 65, 83, 89[8] H. Bay, T. Tuytelaars, and L. V. Gool. Surf: Speeded up robust features.Computer Vision and Image Understanding (CVIU), 110(3):346–359, 2008.→ pages 13, 34[9] M. Belkin, P. Niyogi, and V. Sindhwani. Mannifold regularization: Ageometric framework for leanring from labeled and unlabeled examples.Journal of Machine Learning Research (JMLR), 7:2399–2434, 2006. →pages 55100[10] S. Ben-david, J. Blitzer, K. Crammer, and P. M. Sokolova. Analysis ofrepresentations for domain adaptation. In Advances in Neural InformationProcessing Systems(NIPS), 2007. → pages 45, 47, 49[11] K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. InAdvances in Neural Information Processing Systems (NIPS), 1998. → pages19[12] A. Bergamo and L. Torresani. Exploiting weakly-labeled web images toimprove object classification: A domain adaptation approach. In Advancesin Neural Information Processing Systems(NIPS), 2010. → pages 43, 45, 50[13] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty. Latent dirichletallocation. Journal of Machine Learning Research (JMLR), 3:2003, 2003.→ pages 22[14] J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman. Learningbounds for domain adaptation. In Advances in Neural InformationProcessing Systems(NIPS), 2007. → pages 56[15] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boomboxesand blenders: Domain adaptation for sentiment classification. In Conferenceof the Association for Computational Linguistics (ACL), 2007. → pages 44,49[16] L. Bo and C. Sminchisescu. Efficient match kernel between sets of featuresfor visual recognition. In Advances in Neural Information ProcessingSystems (NIPS), 2009. → pages 83[17] L. Bo, X. Ren, and D. Fox. Kernel descriptors for visual recognition. InAdvances in Neural Information Processing Systems (NIPS), volume 7,2010. → pages 8, 81, 82, 83, 89[18] K. M. Borgwardt, A. A. Gretton, B. M. J. Rasch, H. peter Kriegel, A. B.Scho¨lkopf, and B. A. J. S. D. Integrating structured biological data by kernelmaximum mean discrepancy. In In ISMB, 2006. → pages 47, 49, 51, 55, 58,75[19] A. Bosch, A. Zisserman, and X. Munoz. Representing shape with a spatialpyramid kernel. In ACM International Conference on Image and VideoRetrieval (CIVR), 2007. → pages 13, 34101[20] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm forbound constrained optimization. SIAM Journal on Scientific and StatisticalComputing, 16(5):1190–1208, 1995. → pages 48, 61[21] C. Carmelie, E. D. Vito, and A. Toigo. Vector valued reproducing kernelhilbert spaces of integrable functions and mercer theorem. Analysis andApplications, 4(4):377–408, 2006. → pages 51, 53[22] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2(3):27, 2011. →pages 34[23] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Efficientsimilarity search for covariance matrices via the jensen-bregman logdetdivergence. In International Conference on Computer Vision (ICCV), 2011.→ pages 64[24] N. Dalal and B. Triggs. Histograms of oriented gradients for humandetection. In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), volume 1, pages 886–893. IEEE, 2005. → pages 66, 83, 89[25] H. Daume´ III. Frustratingly easy domain adaptation. In Conference of theAssociation for Computational Linguistics (ACL), 2007. → pages 45, 47,49, 50, 54, 56[26] J. Davis, B. Kulis, S. Sra, and I. Dhillon. Information-theoretic metriclearning. In International Conference on Machine Learning (ICML), 2007.→ pages 19[27] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoreticmetric learning. In International Conference on Machine Learning (ICML),2007. → pages 49, 67, 68[28] L. der Maaten and G. Hinton. Visualizing data using t-sne. Journal ofMachine Learning Research (JMLR), 9:2579–2605, Sept. 2008. → pages 36[29] F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto. Learning output kernelswith block coordinate descent. In International Conference on MachineLearning (ICML), 2011. → pages 47, 51, 52, 53, 55, 56, 57, 60[30] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer SVM forvideo concept detection. In IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2009. → pages 45, 47, 49102[31] L. Duan, I. W. Tsang, and D. Xu. Domain transfer multiple kernel learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):465–479, 2012. → pages 45, 47, 49[32] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories.IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006. → pages 4, 13, 14, 18, 34, 78[33] C. Fellbaum, editor. WordNet An Electronic Lexical Database. The MITPress, Cambridge, MA ; London, 1998. → pages 16[34] W. Fo¨rstner and B. Moonen. A metric for covariance matrices. Technicalreport, Dpt of Geodesy and Geoinformatics, Stuttgart University, 1999. →pages 64[35] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In InUncertainty in Artificial Intelligence (UAI), pages 148–155. MorganKaufmann, 1998. → pages 19[36] P. Gehler and S. Nowozin. On feature combination for multiclass objectclassification. In IEEE International Conference on Computer Vision(ICCV), pages 221–228. IEEE, 2009. → pages 93[37] P. Gehler and S. Nowozin. On feature combination for multiclass objectclassification. In International Conference on Computer Vision (ICCV),2009. → pages 13, 32, 33[38] T. Gevers and C. G. M. Snoek. Evaluating color descriptors for object andscene recognition. IEEE Transactions on Pattern Analysis and MachineIntellegence, 32(9):1582–1596, 2010. → pages 34[39] B. R. Gibson, X. Zhu, T. T. Rogers, C. W. Kalish, and J. Harrison. Humanslearn using manifolds, reluctantly. In Advances in Neural InformationProcessing Systems (NIPS), 2010. → pages 14[40] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel forunsupervised domain adaptation. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 2066–2073. IEEE, 2012. → pages45, 47, 49, 50, 54, 64, 65, 66, 72, 73, 75, 82, 90[41] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for objectrecognition: An unsupervised approach. In IEEE International Conferenceon Computer Vision (ICCV), pages 999–1006. IEEE, 2011. → pages 44, 45,47, 50, 54, 66, 72, 73, 82103[42] K. Grauman and T. Darrell. The pyramid match kernel: Discriminativeclassification with sets of image features. In IEEE International Conferenceon Computer Vision (ICCV), 2005. → pages 19, 32, 33[43] M. D. Grossberg and S. K. Nayar. Modeling the space of camera responsefunctions. IEEE Transactions on Pattern Analysis and Machine Intelligence,26(10):1272–1282, 2004. → pages 82[44] Z. Guo and Z. Wang. Cross-domain object recognition via input-outputkernel analysis. IEEE transactions on image processing, 22(8):3108–3119,2013. → pages 82[45] Z. Guo and Z. J. Wang. One-shot recognition using unsupervisedattribute-learning. In Pacific-Rim Symposium on Image and VideoTechnology (PSIVT), 2010. → pages 14, 18, 38[46] T. Hofmann. Probabilistic latent semantic analysis. In Advances in NeuralInformation Processing Systems (NIPS), 1998. → pages 22[47] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Scho¨lkopf.Correcting sample selection bias by unlabeled data. In Advances in NeuralInformation Processing Systems(NIPS), 2006. → pages 45, 47[48] P. Jain, B. Kulis, J. V. Davis, and I. S. Dhillon. Metric and kernel learningusing a linear transformation. Journal of Machine Learning Research(JMLR), 13:519–547, 2012. → pages 49[49] I.-H. Jhuo, D. Liu, D. Lee, and S.-F. Chang. Robust visual domainadaptation with low-rank reconstruction. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 2168–2175. IEEE, 2012. →pages 45, 47, 50, 54, 66, 67, 68, 69, 71, 90[50] Y. Ji, K. Idrissi, and A. Baskurt. Object categorization using boosing withinhierarchical bayesian model. In IEEE International Conference on ImageProcessing, 2009. → pages 17, 34, 36[51] W. Jiang, E. Zavesky, and S.-F. Chang. Cross-domain learning methods forhigh-level visual concept classification. In International Conference onImage Processing (ICIP), 2008. → pages 50[52] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. InInternational Conference on Very Large Data Bases (VLDB), 2004. →pages 49104[53] S. Kim, H. Lin, Z. Lu, S. Susstrunk, S. Lin, and M. Brown. A new in-cameraimaging model for color computer vision and its application. IEEETransactions on Pattern Analysis and Machine Intelligence, 34(12):2289 –2302, 2012. → pages 82[54] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get:Domain adaptation using asymmetric kernel transforms. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2011. →pages 44, 45, 47, 49, 50, 66, 67, 68, 82, 89, 90[55] C. H. Lampert, H. Nickisch, and S. Hareling. Learning to detect unseenobject classes by between-class attribute transfer. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2009. → pages 14, 17,18, 32, 34, 35, 37, 38, 39, 40[56] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan.Learning the kernel matrix with semidefinite programming. Journal ofMachine Learning Research (JMLR), 5:27–72, 2004. → pages 51[57] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), volume 2,pages 2169–2178. IEEE, 2006. → pages 78[58] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatialpyramid matching for recognition natural scene categories. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2006. →pages 17, 19, 20, 29, 30[59] H. Ling and D. W. Jacobs. Deformation invariant image matching. In IEEEInternational Conference on Computer Vision (ICCV), volume 2, pages1466–1473. IEEE, 2005. → pages 82[60] D. G. Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60:91–110, 2004. → pages 13,34, 66, 83, 89[61] S. Maji, A. C. Berg, and J. Malik. Classification using intersection kernelsupport vector machines is efficient. In IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2008. → pages 37[62] C. A. Micchelli and M. A. Pontil. On learning vector-valued functions.Neural Computation, 17(1):177–204, 2005. → pages 47, 51, 53, 56, 62105[63] F. Moreno-Noguer. Deformation and illumination invariant feature pointdescriptor. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 1593–1600. IEEE, 2011. → pages 82[64] M.-E. Nilsback and A. Zisserman. A visual vocabulary for flowerclassification. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), volume 2, pages 1447–1454. IEEE, 2006. → pages 78,89, 91[65] E. Nowak and F. Jurie. Learning visual similarity measures for comparingnever seen objects. In IEEE Conference on Computer Vision and PatternRecogntion (CVPR), 2007. → pages 19[66] A. Oliva and A. Torralba. Modeling the shape of the scene: A holisticrepresentation of the spatial envelope. International Journal of ComputerVision, 42(3):145–175, 2001. → pages 66[67] S. J. Pan, I. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfercomponent analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011. → pages 49[68] X. Pennec, P. Fillard, and N. Ayache. A Riemannian framework for tensorcomputing. International Journal of Computer Vision (IJCV), 66(1):41–66,2006. → pages 64[69] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning:Transfer learning from unlabeled data. In International Conference onMachine Learning (ICML), 2007. → pages 19[70] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. Simplemkl.Journal of Machine Learning Research (JMLR), 9:2491–2521, 2008. →pages 51, 58, 59, 61, 72[71] K. Saenko, B. Kulis, M. Fritz, and T. Darell. Adapting visual categorymodels to new domains. In European Conference on Computer Vision(ECCV), 2010. → pages 44, 45, 47, 49, 50, 65, 66, 67, 68, 72, 73, 80, 82, 89,90[72] G. Schweikert, C. Widmer, B. Scho¨lkopf, and G. Ra¨tsch. An empiricalanalysis of domain adaptation algorithms for genomic sequence analysis. InAdvances in Neural Information Processing Systems(NIPS), 2008. → pages56106[73] E. Shechtman and M. Irani. Matching local self-similarities across imagesand videos. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2007. → pages 13[74] E. Shechtman and M. Irani. Matching local self-similarities across imagesand videos. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2007. → pages 66[75] E. Shechtman and M. Irani. Matching local self-similarities across imagesand videos. In IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2007. → pages 34[76] K. Simonyan, A. Vedaldi, and A. Zisserman. Descriptor learning usingconvex optimisation. In European Conference on Computer Vision (ECCV),pages 243–256. Springer, 2012. → pages 82[77] S. Sonnenburg, G. Ra¨tsch, C. Scha¨fer, and B. Scho¨lkopf. Large scalemultiple kernel learning. Journal of Machine Learning Research (JMLR), 7:1531–1565, 2006. → pages 51, 58[78] K. D. Tang, M. F. Tappen, R. Sukthankar, and C. H. Lampert. Optimizingone-shot recognition with micro-set learning. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2010. → pages 14, 17,18, 32, 37, 38, 39[79] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichletprocesses. Journal of the American Statistical Association, 101(476):1566–1581, 2006. → pages 16, 22, 25[80] E. Tola, V. Lepetit, and P. Fua. Daisy: An efficient dense descriptor appliedto wide-baseline stereo. IEEE Transactions on Pattern Analysis andMachine Intelligence, 32(5):815–830, 2010. → pages 82[81] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In IEEEConference onComputer Vision and Pattern Recognition (CVPR), pages1521–1528. IEEE, 2011. → pages 3[82] V. Vapnik. The nature of statistical learning theory. springer, 2000. →pages 3[83] M. Varma and B. R. Babu. More generality in efficient multiple kernellearning. In International Conference on Machine Learning (ICML), pages1065–1072, 2009. → pages 51, 58, 88107[84] C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass recognition andpart localization with humans in the loop. In IEEE International Conferenceon Computer Vision (ICCV), pages 2524–2531. IEEE, 2011. → pages 78[85] S. Winder, G. Hua, and M. Brown. Picking the best daisy. In IEEEConference on Computer Vision and Pattern Recognition (CVPR), pages178–185. IEEE, 2009. → pages 82[86] S. A. Winder and M. Brown. Learning local image descriptors. In IEEEConference onComputer Vision and Pattern Recognition (CVPR), pages 1–8.IEEE, 2007. → pages 82[87] Y. Xiong, K. Saenko, T. Darrell, and T. Zickler. From pixels to physics:Probabilistic color de-rendering. In IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 358–365. IEEE, 2012. → pages 82[88] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video conceptdetection using adaptive SVMs. In ACM Multimedia, 2007. → pages 44, 45,50, 68, 69, 71[89] W. Yang, Y. Wang, A. Vahdat, and G. Mori. Kernel latent SVM for visualrecognition. In Advances in Neural Information Processing Systems (NIPS),pages 818–826, 2012. → pages 98[90] X. Yu and Y. Aloimonos. Attribute-based transfer learning for objectcategorization with zero/one training example. In European Conference onComputer Vision (ECCV), 2010. → pages 14, 17, 18, 37[91] E. Zhong, W. Fan, J. Peng, K. Zhang, J. Ren, D. Turaga, and O. Verscheure.Cross domain distribution adaptation via kernel mapping. In ACM SIGKDDConference on Knowledge Discovery and Data Mining, 2009. → pages 45[92] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning usinggaussian fields and harmonic functions. In International Conference onMachine Learning (ICML), 2003. → pages 19108