Bayesian Formulations of Multiple Instance Learning with Applications to General Object Recognition by Hendrik Kiick I Dipl. Inf., Universitat Erlangen-Niirnberg, 2003 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE i n The Faculty of Graduate Studies (Department of Computer Science) We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA August 30, 2004 © Hendrik Kiick, 2004 IUBCL THE UNIVERSITY OF BRITISH COLUMBIA FACULTY OF GRADUATE STUDIES Library Authorization In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Name of Author (please print) 3 0 / Qg/ZOCfr Date (dd/mm/yyyy) Title of Thesis: %0.y) for different kernel types, Part 2 78 5.7 Positive cluster embedded in a mass of negative data points . . . 79 5.8 Adding the fraction of positive information 80 5.9 ROC for added fraction of positives information 81 LIST OF FIGURES vii 5.10 Learnt classifiers for parallel lines data 83 5.11 ROC plots for object classifiers applied to training set 86 5.12 ROC plots for object classifiers applied to training set 87 5.13 Comparison of two ROC plots with error bars 88 5.14 Polar bear classification 88 5.15 Effects of sub-sampling on classification performance 93 ACKNOWLEDGEMENTS viii ACKNOWLEDGEMENTS I would like to thank Peter Carbonetto first of all for laying the foundations for the work presented here with his research. He also left me with a large base of well documented Matlab code as well as data. In particular I am very grateful to him for performing the tedious manual annotation of the image regions in the image data set used in this work. I further would like to thank Kobus Barnard for providing the segmented Corel data set, Eric Brochu for proof reading several sections of this thesis and Paul Gustafson for taking the time to review my thesis and for providing valuable feedback. Further I would like to thank the fellow students in the LCI lab, the inhab-itants of the Kommune as well as other friends and my family for providing support and entertainment. Last but certainly not least I would like to thank Nando de Freitas for being the best supervisor imaginable. CHAPTER 1. INTRODUCTION 1 CHAPTER 1 INTRODUCTION General object recognition is one of the longstanding open problems in computer vision. Humans axe amazingly good at inferring the semantics of a scene from a two-dimensional picture. At the same time it turns out to be an exceptionally difficult problem to solve computationally. In the general object recognition problem, objects are to be identified and localized in an image based on leaxnt appearance models for classes of objects. This is in contrast to the detection and localization of specific objects, which has received a lot of attention in the computer vision community and produced some excellent research results [49]. Dramatic progress has been made recently on the problem of detecting in-stances of specific object classes in images. The class of human faces, in particu-lar, has received a lot of interest. In their seminal paper, Viola and Jones[70] de-veloped a highly efficient face detector based on the AdaBoost algorithm [28, 63]. Their approach is not restricted to faces, but could in theory be trained to rec-ognize arbitrary object classes. It does, however, require supervised training data. That is, the instances of the object class of interest have to be manu-ally segmented or cropped out in all training images. While this approach is reasonable for learning a single object class such as faces, it does not scale to general object recognition problems with a large number of object classes. It would require manual segmentation and annotation of image regions as shown in Figure 1.1(a) for a very large number of training images. Our goal is to instead learn appearance models of object classes from anno-tated images like the one shown in Figure 1.1(b). The big advantage of this type CHAPTER 1. INTRODUCTION 2 Bear Water Fish (a) Training image with annotated (b) Training image with annota-t i o n s tions Figure 1.1: The difference between supervised information per image region and per image. (Photo courtesy http://philip.greenspun.com) of training data is that it already exists in large quantities. Examples include the Corel Photo CDs, which altogether contain 80 000 annotated images, or the commercial Corbis collection, which mostly caters to journalists and contains 70 million images with very detailed and high quality annotations. A fascinat-ing alternative worth mentioning is the ESP Game project by von Ahn and Dabbish [71]. In this web-based game1, players have to label images presented to them. They score points when a label they choose for an image matches that of another player. The game mechanics thus provide both motivation for the players to provide good labels and a verification mechanism to ensure high quality labels. At the point of writing (August 2004), 3.7 million labels had already been created by the players of the game. We work from the assumption that the images are annotated with nouns referring to object classes. The presence of a certain word in an image's annota-tion then indicates the presence of an instance of the corresponding object class in the image. We will use the terms annotation word, object class and, when abstracting from the object recognition application, concept interchangeably. While annotated images are readily available, they provide significant less 1http://www.espgame.org CHAPTER 1. INTRODUCTION 3 information than images with manually segmented and labelled image regions (such as in Figure 1.1(a)). With annotated images, there is ambiguity both in how they should be segmented, and in the associations between image regions and annotations words. Several machine learning and computer vision researchers have attacked the problem of learning from annotated images. Most of this work has focused on learning the correlation between annotation words and whole images [26, 40, 46, 55, 59, 59]. The learnt models can be used for automatic annotation of new images as well as for text-based image retrieval from image databases without annotations. However, these techniques do not provide a,solution for the general object recognition problem as we defined it above, where we are interested not only in detecting the presence of objects of a certain class but also in their location in the image. That is, instead of automatically annotating images, the goal here is to annotate image regions. The additional location information is required in many practical applications such as robotics. There has been some published work on this significantly harder problem of learning the correspondences between image regions and object classes from annotated images. Since the correct segmentation of both training images and new images is a priori unknown, the first step in all these approaches is to divide the images into image regions. This segmentation is typically performed based on low-level vision cues such as colour, texture and edges in the image, using algorithms such as Normalized Cuts [64]. Some authors have instead chosen a simple rectangular grid subdivision [13, 14, 56]. Obviously, we can not expect these image segmentation techniques to provide a segmentation into semantically meaningful image regions. We can hope though, that they provide an over-segmentation of the image. After assigning object class annotations to the image regions, regions with the same annotation can then be merged to produce an approximate segmentation of the image into objects. For each image region, a vector of features is extracted, typically contain-ing summarizing information about colour and texture properties of the image region and potentially other types of information such as shape properties or CHAPTER 1. INTRODUCTION 4 position in the image. Mori et al. [56] proposed a statistical technique for learning from annotated images. While they focus on the task of automatic image annotation, their technique allows for annotation of individual image regions as well. In their approach, the image regions in the training data are clustered to produce a dis-crete dictionary of prototypical image regions. The most likely annotations for these region prototypes are then determined based on co-occurrence frequencies with annotation words in the training data set. Duygulu et al. [7, 23] suggested that finding associations between image regions and annotation words can be interpreted as a machine translation task. They employ a standard statistical machine translation model, trained with the EM algorithm, to learn a transla-tion table, which provides a mapping from prototypical image regions to words. This idea was extended by Carbonetto et al. [13, 15]. In this work, a mixture of multivariate Gaussians is used to model the joint distribution of image regions and their annotations instead of a discrete translation table. The generative process assumed by this model is that an annotated image is created by first randomly and uniformly choosing words from a dictionary. The features for each image region axe then independently sampled from the Gaussian mixture components corresponding to these words. Blei et al. [10] proposed another mix-ture model, which they coined Correspondence LDA. The assumed generative process in this model is opposite to the one in [15]. Image regions are assumed to be generated first, and each annotation word is then generated conditioned on one of these image regions (or more accurately, it is conditioned on the latent factor that gave rise to the image region). A common problem with these different mixture models is that the pos-terior distribution of the parameters tends to be highly multi-modal. As a consequence, the EM algorithm or its variational variants, which are typically used for training, are prone to get stuck in local minima and are highly depen-dent on their initialization. Markov chain Monte Carlo (MCMC) methods can also perform poorly in this mixture model scenario [17], because the number of modes in the posterior distribution of the parameters is factorial in the number CHAPTER 1. INTRODUCTION 5 of mixture components [65]. While the mentioned approaches model the joint probability distribution of image regions and annotation words, in this thesis we instead explore the use of a discriminative approach, in which we model the conditional probability dis-tribution of annotation words, given the image region features. Discriminative approaches do not waste modelling effort on modelling the observations, but focus on modelling the decision boundaries directly. As a result, they often per-form better and/or are easier to train. Evidence of this is for example provided in the work of Kumar and Hebert [45], who show that discriminative random fields can outperform their generative counterparts in contextual recognition with labelled image regions. In this thesis, we restrict ourselves to binary classification. That is, for each object class, we learn a classifier distinguishing between image regions showing objects belonging to this class from image regions showing other things. Because classifiers for different object classes are trained independently, the training cost increases only linearly with the number of object classes. Each individual classifier allows us to detect and localize objects of a certain class in a given image. In most object recognition applications, this is exactly what we want. If however a multi-categorical classifier is desired, which for each image re-gion decides between all possible object classes, then the outputs of the individ-ual binary classifiers have to be combined. While techniques have been devel-oped to facilitate such assembly of multi-class classifiers from multiple binary classifiers [1, 76], we acknowledge that this approach is not without problems. A true multi-categorical approach might be better suited for this type of applica-tion. However, multi-categorical classification is not necessarily the right thing to do when annotating the regions of a test image. It is quite possible that a test image region does not show an object belonging to any of the trained object classes. In this case it is better to assign no label than to annotate the region with the least unlikely object class. Furthermore, the image region boundaries are not likely to be perfectly aligned with object boundaries. As a consequence, one image region can show (parts of) multiple objects. It is not clear that de-CHAPTER 1. INTRODUCTION 6 Figure 1.2: Probabilistic image region classification. The image is segmented into image regions according to low level vision cues. Features de-scribing the appearance of the image regions (such as colour, texture and shape properties) are extracted and fed into the probabilistic classifier. Based on its probabilistic model and the features describ-ing the image region, the classifier computes the probability that a given image region contains a lion. ciding on a single annotation would be preferable to simply accumulating the annotations of the different binary classifiers. For these reasons, we do not ex-plicitly address the multi-categorical classification problem in this thesis, and we instead focus on learning detectors for individual object classes from annotated training images. We are interested in learning probabilistic classifiers, which for a given test image region not only output a binary decision, but a probability of class mem-bership, i.e. the probability that the image region shows an object belonging to the object class of interest. The additional confidence information is important in assessing the risk when basing decisions on the classification result. Figure 1.2 shows the overall process of probabilistic object recognition for CHAPTER 1. INTRODUCTION 7 water boat sky water rock water rock sky trees Figure 1.3: Examples of segmented images with annotations. a given object class (in this example, for the object class 'lion'). Based on the extracted appearance features, the probabilistic classifier determines the prob-ability that an image region shows a lion. This thesis is about the probabilistic model hidden inside the black box and how to train it. As mentioned above, learning a per region classifier from per image annota-tions is a difficult task because of the unknown associations between annotation words and image regions. We suggest that this learning task can be approached as a constrained semi-supervised learning problem. We demonstrate this with the example given in Figure 1.3. Suppose we are interested in being able to detect boats in images. We could assume that if the word boat does not appear in the caption, then there are no boats in the image2. In this case, we can assign the label 0 to each region in an image without the 'boat' annotation. On the other hand if an image does have this annotation, then we know that at least one of its segments corresponds to a boat. The problem is, we do not know which one. By letting Xi denote the feature vector corresponding to the z-th segment and yt denote the existence of a boat, we can write the provided label information as image 1 image 2 image 3 Input x Labels y Xl x2 x3 1 1 1 Xi x5 x6 0 0 0 X7 Xg Xg Xio 0 0 0 0 This training data is semi-supervised, since some of the labels are explicitly 2 Of course, this depends on how good the labels are, but as mentioned earlier, there are many databases with very good captions. So for now we work under this assumption. CHAPTER 1. INTRODUCTION 8 given, while others are unknown. What makes it a constrained semi-supervised problem is the additional constraint that at least one of the question marks must be a 1. Furthermore, we can impose the additional constraint that at least one of the image regions of Image 1 has to show something other than a boat, i.e. one of the labels {2/1,2/2,2/3} must be 0, since we know from the image's annotations, that it contains 'water' as well. The learning problem we are dealing with here is known as the multiple instance (MI) problem and was introduced by Diettrich et al. in [21]. In the multiple instance setting, binary labels are given for groups of data points (of-ten referred to as bags). A negative group label indicates that all data points in the group have negative individual labels. A positive group label on the other hand only provides the information, that at least one of the data points has to have a positive label. The MI problem arises in different contexts in-cluding drug activity prediction [21], text categorization [3], stock selection [51] and learning from annotated images as discussed here. Several researchers have proposed algorithms for learning from MI data. Diettrich et al. [21] proposed several algorithms that use an axis-aligned box in feature space as the classifi-cation boundary and optimize this box using simple heuristic rules. Andrews et al. [3] adapted support vector machines to the MI setting. They used a simple heuristic optimization technique to deal with the combinatorial problem arising from missing associations. In another publication [2], Andrews et al. attacked the problem using the Disjunctive Programming Boosting technique. In this formulation, the constraints on the individual labels lead to a highly non-convex optimization problem. They dealt with this by first convexifying the problem and then re-introducing the non-convexity incrementally to refine an approximate solution. Another popular approach to the MI problem is based on the Diverse Density formulation, first proposed in [51] and subsequently used in [19, 50, 52, 77]. The Diverse Density function is defined on the feature space. Its definition ensures that it attains high values in the proximity of data points belonging to positive groups and low values in the proximity of points from neg-ative groups. The algorithm then searches the feature space point maximizing CHAPTER 1. INTRODUCTION 9 this measure and simultaneously also optimizes weights for the dimensions of the feature space. The classification boundary is then an ellipsoid in feature space centred at the point found in the optimization. The assumption implicit in this approach is that the target concept (in the object recognition applica-tion, the object class) can be adequately described by a point in feature space. This is a very strong assumption which clearly does not hold in the case of ap-pearance based object recognition, where the appearance distributions of object classes generally form complex manifolds in the feature space. In addition, the mentioned approaches are optimization techniques and as such only provide a point estimate of the decision boundary. Specifically, they lack a probabilistic interpretation. Neural Networks [61] and kernel based methods [29] have also been adopted to the MI setting. However, the solutions proposed in both publications only allow for classification of new groups, not of individual instances. In this thesis, we propose a novel solution to the MI problem, which is fully probabilistic. A probability distribution over the appearance feature space is learnt from training data, allowing for probabilistic classification of individual image regions as illustrated in Figure 1.2. The probabilistic model underlying our method, which will be described in Chapter 3, is extremely flexible and allows highly complex class boundaries to be captured. Usually, such flexibility comes at the price of overfitting. In our approach, this issue is addressed in a principled Bayesian way by placing robust hierarchical priors on the model parameters. We follow the Bayesian learning approach of training the model by computing the posterior probability distribution of its parameters. Because this distribution can not be computed analytically, we approximate it using a Markov Chain Monte Carlo (MCMC) based sampling technique, described in Chapter 4. The Bayesian learning approach allows us to account for the uncertainty in both the model parameters and the unknown labels in a principled manner. The probabilistic model and training technique used in this thesis are based on the work by Tham et al. [66, 67], who proposed a fully Bayesian approach for supervised classification and regression. We extend their proposed model and CHAPTER 1. INTRODUCTION 10 training technique to the MI setting where the observations are group labels instead of individual labels. We further address the problem of a negative bias in the classifiers. One cause for biased classifiers is inherent in the MI problem formulation itself and lies in the asymmetry between the amount of information provided by positive and negative group labels. A negative group label determines all data points in the group to have a negative label, whereas a positive group label only provides a limited amount of information by imposing relative weak constraints on the labels of the individual data points in the group. The core of the problem is that the group label only informs about the existence of positive individual labels, but not about the number of positives in the group. In the application of learning from annotated images it is however possible to make an educated guess at this number of positives, based on the number of image regions and the number of annotation words in an image. In Sections 3.3.4 and 3.3.5, we present extensions to our probabilistic model, which allow us to learn classifiers from such estimated number of positives instead of or in addition to the MI information. The second cause for bias in the specific object class recognition learning task lies in the imbalanced training data distribution. For a specific object class, there typically will be far fewer images with objects of this class than images without. To address this issue, we investigate methods for re-balancing the training data and analyze how they affect classification performance in Sec-tion 4.4 This thesis is organized as follows. Chapter 2 provides the theoretic back-ground for the following chapters. It introduces the foundations of Bayesian modelling and learning for classification. The use of a sampling approach is motivated and Markov Chain Monte Carlo sampling techniques are introduced. Chapter 3 presents our probabilistic models for learning from different types of training data. In Chapter 4, we discuss the details of the Markov Chain Monte Carlo sampler employed in our Bayesian learning approach. In Sec-tion 4.2, a linear algebra trick for speeding up the MCMC sampler is discussed. CHAPTER 1. INTRODUCTION 11 Section 4.3 provides some information about our implementation of the overall system. Methods for re-balancing the training data are described in Section 4.4. In Chapter 5, we present results on both synthetic data and on learning object class classifiers from a database of annotated images. Finally, we conclude and present ideas for future work. Previous publications Part of the research described in this thesis has been previously presented at the European Conference on Computer Vision 2004 in Prague and has been published in [44]. CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 12 CHAPTER 2 BAYESIAN LEARNING AND CLASSIFICATION 2.1 Probabilistic Classification The goal in classification problems in general is to predict the class label of some objects based on some observed features. One example is the task of predict-ing the topical category of a text based on the frequency of word occurrences. Another one is the application mentioned in the introduction, the classification of image regions based on colour and texture features. The class labels in this case consist of the type of the object shown in the image region. In the work presented here, we restrict ourselves to the problem of binary classification. That is, we want to predict whether an object belongs to a certain class of interest or not. In the object recognition setting, we might for example be interested in a tiger-classifier, which, given the features describing an image region, decides whether the region likely shows a tiger or not. Many different techniques for binary classification have been developed, amongst them the very successful and popular Support Vector Machine (SVM) [69]. In contrast to the S V M , which simply returns a binary decision, we are interested in soft or probabilistic classification. A probabilistic classifier does not simply return a yes/no decision but the probability Pr(y = l | x ) , that the object described by features x belongs to the class of interest (indicated by the binary variable y). This answer is much more desirable than a binary decision as it pro-vides additional information about the (un)certainty of the prediction. Turning CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 13 a probabilistic classifier into a hard classifier is achieved by simple thresholding. The natural threshold is 0.5. If the probability of the image region showing a tiger is larger than 50% then the hard classifier would classify it as a 'tiger' re-gion. However, this threshold is not optimal, when the costs for misclassification are asymmetric. This is often the case when the prediction is to be used within a decision making process. To stick with the tiger example, let us assume we have a probabilistic classifier that distinguishes between a tiger and a kitten. And based on the output of the classifier we want to choose the actions back off or approach and pet. In this example, the cost of erring are highly asymmetric. Walking away from the kitten does not incur any cost at all, but petting the tiger could be a very bad idea. In settings like this, a cost-sensitive decision threshold has to be chosen [24]. 2 . 2 Bayesian Learning Our goal is being able to make predictions for new objects (new features x with unknown label y) based on the information provided by given training data. That is, we want to generalize the relationship between features and labels from the examples in the training data to new data. The conditional probability of interest is Pr(y = l|x, T>), where V is the training data and x and y are the feature description and the label of the new data point. In a supervised setting, T> would consist of features and labels for a number of training data points. In the problem we are dealing with, labels for individual data points are generally not available. Instead, we are given the features for the data points and some information about groups of labels. We will discuss the exact nature of our training data and ways to learn from it in detail in Section 3.3. In order to learn the probability distribution Pr(y = l|x,T>), we adopt a parametrized probabilistic model Pr(y = l|x,6), where 6 represents the set of model parameters. One common approach to learning using such a parametrized model is to follow the Maximum Likelihood (ML) approach and find the param-CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 14 eters 9 ML that best predict the given training data. That is 9ML = argmaxp(Z>|0) 9 Predictions are then made on new data using the distribution Pr(y = l | x , 9ML)-One major problem with the M L approach is that of overfitting. If the model is flexible enough, then the model might yield perfect predictions for the training data when using BML- But it might not generalize well at all to new data. A possible solution to this problem is to regularize the problem by penalizing the norm of the parameter vector 6 as in the Ridge Regression approach [38]. The more principled way of dealing with this overfitting problem is to follow the Bayesian approach and specify prior probability distributions for all parameters. In specifying this prior distribution p{9), it is possible to incorporate prior knowledge about the expected/desired values of the parameters. But even if no informative prior is available, placing a vague prior on the parameters still regularizes the problem and helps to avoid the overfitting issue. In the Bayesian learning process, the prior distribution is updated accord-ing to the evidence presented by the training data. The resulting posterior distribution can be computed using Bayes rule likelihood prior posterior /-^>^v evidence A possible approach is then to find the parameter vector, which optimizes the posterior. This is the so called Maximum a Posteriori (MAP) approach. While it deals with the problem of overfitting in a principled way, this approach is still working from the assumption that there is one 'correct' 6true- This is a fair assumption if the posterior probability distribution p(6\V) is in fact very peaked around a single point. However, quite often this is not the case and the posterior is instead diluted and/or multi-modal. In these cases, simply picking the 9MAP maximizing the posterior as the 'true' 9 is rather arbitrary and discards valuable information. CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 15 The true Bayesian solution is to use marginalization. Instead of choosing a single parameter, the uncertainty in the parameters is accounted for by inte-grating over all possible values While this integral is in general mathematically well defined, in pretty much all cases of interest it is intractable. Lacking an analytic solution, one has to revert to approximation techniques. One very common choice is to use a sampling technique, which will be discussed in the next section. 2.3 Markov Chain Monte Carlo Sampling The integral in Equation (2.2) can be approximated using Monte Carlo simula-tion. To this end, samples OI.T are drawn from the posterior distribution of the parameters These samples can then be used to obtain the following approximation This approximation converges to the true solution by the Strong Law of Large Numbers. The problem now is how to generate the required samples from the posterior. Usually, the posterior distribution will not be in a form that can be easily sampled from directly. What is more, is that most of the time we can not even evaluate the posterior distribution for a given parameter choice 6. This is because the evidence term in Equation (2.1) often involves an intractable integral. However, the likeli-hood and prior distributions generally can be evaluated exactly. And because Pr(y = l | x , V) = / Pr(y = l\x,0)p(0\V) d9. (2.2) et~P(e\v). (2.3) CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 16 the evidence term in Equation (2.1) is nothing but a normalizing constant, it is possible to evaluate p(9\T>) up to a normalizing constant. This in theory al-lows using rejection sampling or importance sampling [4, 34, 47, 62] to generate samples from the posterior. In importance sampling, samples are generated by a proposal distribution and then weighted by the ratio between the actual dis-tribution of interest (here p(0\V) ) and the proposal distribution. The problem with this approach in practice is that, for efficiency, the proposal distribution has to be a good approximation of the distribution of interest. Constructing good proposal distributions however can be very hard, especially in high dimen-sions. Since the parameters of our probabilistic model (described in Chapter 3) live in a very high dimensional space, importance sampling is impractical. Instead, we will construct a Markov Chain Monte Carlo (MCMC) sampler. This technique generates samples sequentially. Based on the previous sample Bt (and in the MCMC case only the immediate previous one, to satisfy the Markov condition) a new sample location 6t+1 is chosen according to a transition proba-bility T{9t, 9t+1). The trick is to design this probabilistic transition rule in such a way, that the stationary distribution of the resulting Markov chain is exactly the probability distribution of interest (in our case the posterior p(9\V)). After the chain converges, the samples generated by the chain are then distributed exactly according to the distribution of interest. One important difference to non-sequential sampling techniques (such as rejection or importance sampling) is however, that the samples are not i.i.d. but correlated. In our application, where we use the samples to approximate the integral in Equation (2.3), this correlation amongst the samples is not in itself problematic. The theory on Markov Chain Monte Carlo samplers guarantees convergence to the stationary distribution. However, only asymptotic convergence is guaran-teed. That is, an infinite number of simulation steps can be necessary to reach the stationary distribution. In practice, the simulation is run until the chain can be assumed to have converged for practical purposes. - This initial simulation is referred to as the burn in period. After this burn in, the chain is assumed to be converged and subsequent samples are then used as samples from the station-CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 17 ary distribution. However, estimating after how many steps the chain is close enough to convergence turns out to be a very hard problem. Several techniques have been developed for monitoring the convergence of a chain. Some use a comparison of several chains starting from different initializations. Others just monitor a single chain and look at the convergences of averages or other criteria. A good overview over existing techniques can be found in [36, 62]. Another related issue is the question how efficient the chain explores the stationary distribution. If the individual samples are highly correlated, then it might take a very large number of samples to get a good coverage of the probability distribution. The number of samples required to achieve a good approximation of the distribution thus depends on the autocorrelation of the samples. Since there exists no black box technique for automatically choosing the appropriate length of the burn in period and number of samples, usually some of the techniques described in [36, 62] are used to monitor the convergence properties of the chain. The length of the burn in period and the number of samples to gather are then chosen based on this information and the desired accuracy and/or available computing time. 2.3.1 Metropolis-Hastings sampler The Metropolis Hastings sampler [37] is a Markov Chain Monte Carlo technique for generating samples according to some probability density function p(9), which can be evaluated up to an unknown normalizing constant. The basic idea is similar to rejection sampling. At each step, a new sample is proposed using a proposal distribution q, which is easy to sample from. Using a certain criterion it is then decided whether the sample is accepted or not. There are however fundamental differences. The first one is that in rejection sampling one keeps proposing and discarding samples until a sample is accepted. In the Metropolis Hastings sampler on the other hand, there is no discarding of samples. If the proposed sample is rejected, then the Markov chain stays at the CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 18 previous location. That is, it generates the exact same sample as in the previous step. The other difference is that the Metropolis Hastings approach is sequential, while rejection sampling is not. This means that the proposal distribution q in the Metropolis Hastings sampler can (but does not have to) depend on the location of the previous sample. This allows the Markov Chain to efficiently explore high probability regions of p(0). Furthermore, as already pointed out in the previous section, samples from the Metropolis Hastings sampler will be correlated, whereas samples from a rejection sampler are i.i.d. Let 0* denote the sample generated at the previous time step. Further, let q{0l,0') be the probability of proposing a move from the current sample 0l to the new location 0'. The Metropolis Hastings algorithm then chooses a new sample 0t+1 using the update rules in Table 2.1. r 1. Sample 0'~ q{0\ 0') 2. Evaluate acceptance probability A-mm{1>P(0t)q(et,0>)) ( 2 - 4 ) 3. Draw a sample u from the uniform distribution U(0,1) 4. Choose the next sample 0t+1 using: et+i = I °' U \ 0* u>A Table 2.1: Metropolis Hastings Algorithm The choice of a good proposal distribution q is important for the perfor-mance of the Metropolis Hastings sampler. However, it is generally much easier to construct and less performance critical than the global proposal distribution CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 19 required in rejection or importance sampling. Still, designing a proposal distri-bution that leads to a Markov chain that is mixing well (i.e. efficiently explores the probability distribution) can be challenging especially in high dimensions. 2.3.2 Gibbs sampler The Gibbs sampler [31] gets around the need for a user specified proposal dis-tribution. Instead, it constructs a Markov chain from moves according to the full conditional probabilities along individual dimensions of the parameter space. Assume our parameter vector 0 consists of M components 6 = {6\, O2, • • •, 6M}-Then in each such move one is sampled from the full conditional probability p(9i\8-i) where consists of all components except i. Typically, for one sam-ple from the joint distribution p(9), each of the M components is updated once in turn according to its conditional distribution: For i = 1 , . . . , M : . Sample ~ , • • •, 0£\, 6\+l,..., 6M) V ' While this is how the Gibbs sampler is typically used and implemented, it should be mentioned that the theory behind the Gibbs sampler actually allows for more flexibility. Each move according to one of the full conditionals leaves the overall distribution invariant. In fact, each such update can actually be interpreted as a Metropolis-Hastings step. By using the full conditional distri-bution as proposal q, the acceptance probability A in the Metropolis-Hastings update (Table 2.1) is guaranteed to be 1 [4]. This means, that it is not actually necessary to perform the conditional updates in a fixed order. Furthermore, a full round of sampling from the conditionals is not required to produce one new sample 8t+1. It would be perfectly valid to generate a sample 0t+1 after each conditional update. Similarly, it is acceptable to update some of the compo-nents more frequently than others. The only condition to ensure convergence is that all components get updated infinitely often. CHAPTER 2. BAYESIAN LEARNING AND CLASSIFICATION 20 The convergence properties of the Gibbs sampler depend on how correlated the individual variables in 0 are. If some components are strongly correlated, a sampler with much better mixing properties can be constructed by sampling the correlated variables jointly as a group [48]. Of course this is only possible if the corresponding conditional probability can be analytically computed and sampled from. A sampler in which such groups of variables are sampled together is called a blocked Gibbs sampler. 2.3.3 M e t r o p o l i s - H a s t i n g s w i t h i n G i b b s In order to be able to use the Gibbs sampling scheme as described above, it is obviously required, that the conditional probabilities p(6i\0-i) are known and can be directly sampled from. Often however, (some of) these conditional probabilities are themselves only known up to a normalizing constant. But as was already pointed out in the previous section, each conditional up-date in the Gibbs sampling scheme is nothing but a special case of a Metropolis Hastings update. Instead of using the full conditional as proposal distribution (with resulting acceptance probability 1), we can just as well use another pro-posal distribution q which only changes a single component $i. This move is then accepted according to the Metropolis Hastings acceptance ratio, which in this case is Thus the acceptance ratio is exactly the same as for the univariate Metropolis Hastings sampler for sampling 9i according to the full conditional p{9i\0-i)-This type of Gibbs sampler, which uses Metropolis-Hastings steps for some of the conditional probabilities is known as a Metropolis-Hastings within Gibbs sampler[47, 57]. CHAPTER 3. PROBABILISTIC MODEL 21 CHAPTER 3 PROBABILISTIC MODEL 3.1 T h e b a s i c m o d e l As mentioned in Section 2 .2 , we are using a parametrized model for the proba-bility distribution Pr(y = l|x, V), the probability that a given data point with features x belongs to the concept of interest. The representation we chose is based on a parametrized real valued function / with parameters 9 = {/3,7}. Both the function and its parameters will be discussed further down. The output of this function is then mapped to the probability of concept membership as follows is the cumulative distribution function (cdf) of the univariate Normal distribu-tion 7V(0,1). It provides a continuous and monotonic mapping of the output of the parametrized function / from M. to the range [0,1] , thus producing a valid probability. The function $ in this context is called the probit link. Often, the logistic link function is used instead to provide a similar soft clamping to the range [0,1]. However the probit link is an equally valid choice and we will ex-ploit its connection to the standard univariate normal distribution in Chapter 4 to derive an efficient sampler. Following Tarn, Doucet and Kotagiri[67], the unknown function is repre-sented with a sparse kernel machine with radial basis functions centred at the Pi(y = l|x,0) = Pr(y = l|x,/3, 7) = $(/(x,/3,7)), (3.1) where CHAPTER 3. PROBABILISTIC MODEL 22 data points {xi,. . . , X J V } of the given training data: N /(x,/3, 7) = ^ 7 i A ^ ( x , x i ) . (3.2) i=l Here .K" is a kernel function, /3 € U.N is a TV-dimensional vector of kernel weights and 7 € {0,1}N is a ./V-dimensional kernel selection vector. For a data point with index i in the training data set, ji controls whether the kernel basis func-tion located at this data point is active, and pi controls the scale of this basis function. If a kernel function is inactive, then the associated kernel weight is 0: 7i = 0 & = 0. In this representation, the sparsity of the model is explicitly introduced by the kernel selection vector 7. This is in contrast to models which achieve sparsity by directly penalizing the norm of the weight vector (3 or by placing sparsity inducing priors on the kernel weights 0 [27, 68]. With these approaches, many of the weights will go to 0 during training, resulting in a sparse model that can be used for efficient classification. The big advantage of making the kernel selection explicit using the 7 vector is computational efficiency during training. This increased efficiency is due to the fact that during training only a small fraction of the kernels are active at each point, and all other ones can be completely ignored. In models based on penalizing (3 on the other hand, computation has to be performed for all kernels and the associated weights. This becomes infeasible both in terms of computation as well as storage when the size of the training data set gets large. The kernel function K in the model can be any radial basis function of the form Jf(x,x i ) = ^ ( | | x i - x | | ) , where | | X J — x|| measures the distance between Xj and x. Typically, the Eu-clidean distance in feature space is used, but other distance metrics could be used instead. is an arbitrary function (preferably smooth and often mono-tonic) defined on R + . In contrast to the SVM classification technique [69], the CHAPTER 3. PROBABILISTIC MODEL 23 radial basis functions are not required to be Mercer kernels. That is, the kernel function in our setting does not necessarily have to correspond to a dot product in some high dimensional space. Indefinite kernels are therefore a valid choice in our model, whereas they can not be used in the support vector machine framework. Possible choices for the radial basis function K are: Gaussian: K(x, Xj Linear: K(x, Xi Cubic: K(x, Xi Sigmoidal: K(x, Xi Multiquadric: K(x, Xi Cauchy: K(x, Xi Thin plate: K(x, Xi l l ^ l l 3 t a n h ( U ^ f ) v1 + l l ^ l l 2 (i + I I ^ H T 1 II^II 2 MII^II) The best kernel type and the optimal kernel width (controlled by the parameter r) depend on how the data is distributed. One common approach is to use cross-validation to find the best kernel/kernel-width combination. While this approach is not without problems in itself, it is also not workable in our setting, where supervised data is generally not available. While we acknowledge these problems, we work from the assumption that both the type of kernel and the kernel width are appropriately chosen by the user. While this is admittedly a strong assumption, it is extremely common in the research on kernel based methods. During training, the N radial basis functions will be evaluated only at the N training data points { x x , . . . , X J V } . Thus all necessary kernel evaluations can be precomputed into the N x N kernel matrix A T x i . X i ) i f (x i ,x 2 ) K(x2,x1) K(x2,x2) K(xN,Xi) K(xN,x2) K(xi,xN) K(x2,xN) K(xN,xN) CHAPTER 3. PROBABILISTIC MODEL 24 Since radial basis functions are symmetric by definition, this kernel matrix ^ (called the Gram matrix) is symmetric as well. Equation (3.2), when evaluated at a training data point X j can now be rewritten using matrix notation. In the case where all N kernels are active (i.e. 7 = 1 ) , we get N i=i where denotes the j-th row of the Gram matrix. The matrix notation for the general case, where only a subset of the kernels are active, is /(x j ,/3, 7) = ^ j / 3 7 ) (3.3) where is the N x K matrix (K = 2i7») consisting of the columns i of \P where 74 = 1. then is the j-th row of this matrix. /3T £ RK is the reduced version of 0, only containing the K coefficients for the active kernels. In the following we will never actually use the full vector /3 but only the reduced version /3-y. For notational simplicity, we will therefore drop the subscript and from here on use (3 to mean 3.2 Specifying Bayesian Parameter Priors In order to incorporate prior beliefs and to regularize the learning problem (as discussed in Section 2.2 ), we introduce priors on our parameters. We assume that each kernel a priori has the probability r of being active. That is the prior p(7i|r) for a single component 7* of 7 is a Bernoulli distribution. The prior for the full vector 7 is thus given by P(7|r) = n^i(l-T)1-^. i=l In general however, the user will not a priori have good information about how many kernels should be active, because the optimal degree of sparsity is highly dependent on the data. Thus, instead of having the user choose a fixed r a priori, we opt for a hierarchical Bayesian model [8]. That is, we deal with the CHAPTER 3. PROBABILISTIC MODEL 25 uncertainty of this parameter in the Bayesian way and assign another prior (a so called hyper-prior) p(r) to it. This way, the value of r is allowed to adapt to the data. At the same time we can bias it by specifying the prior p(r) according to our prior belief as to what the value of r should be. While Tam, Doucet and Kotagiri [67] use a completely uninformative uniform prior, we instead choose to put a conjugate Beta-prior on T which allows the user to exert as much control as desired over the percentage of active kernels ^'^'TWrWf'' 111-^- <3' 4) For the choice a = b = 1, we get the uninformative uniform distribution. For a > 1, b > 1 the mode of the distribution is located at . By scaling a and b, the peakedness of the distribution can be controlled. Therefore the user can choose a and b according to his belief as to what T should be (by choosing the mode of the distribution) and according to the amount of (un)certainty of this belief (by controlling the peakedness of the distribution). We obtain the prior on the binary vector 7 by integrating out r < \ I t \ \t\A nK + a)T(N-K + b) P(T) = j Ph\r)p(r)dT = - i — ^ + q + '-, (3.5) where K = ^ 7i is the number of active kernels, i.e. the number of non zero elements in 7. A prior is also placed on the parameter vector /3 € RK. A typical prior on the kernel weights is a zero mean multivariate Normal distribution Af(0,52T,), where E is the covariance matrix and 62 a scaling factor. There exist several different common choices for E. One is to use a diagonal covariance matrix (for example the identity matrix), assuming a priori independence of the individual weights. This choice of prior, known as the weight decay prior regularizes the solution for the posterior of beta by penalizing \\(3\\ [9]. It avoids overfitting and also helps to avoid singularities in the computation of the posterior distribution of /3 as we will see later. We experimented with the weight decay prior, but found in our experiments that we got overall better results using a different, non-diagonal prior. This prior, also used in [67] is the maximum entropy g-prior. CHAPTER 3. PROBABILISTIC MODEL 26 The covariance matrix of this prior is given by The covariance structure is thus adapted to the data distribution. The prior distribution for /3 is then p(/3|,52)=^(0,<52(^$T)-1) (3.6) A big advantage of this prior is that it combines nicely with the likelihood function as we will see later. Several different schemes for choosing the regularization parameter 82 have been proposed in the literature [20, 42]. We follow Andrieu and Doucet [5] in taking the Bayesian approach and assign it another hyper-prior. An inverse gamma prior is chosen for its conjugacy to the Normal distribution: This prior has two parameters /z and v that have to be specified by the user. One could argue that this is worse in terms of required parameter fiddling than the single parameter 52. However, the parameters of this hyper-prior have a much less direct influence than S2 itself and are therefore less critical for the performance of the algorithm [8]. Assigning small values to these parameters results in an uninformative prior and allows 62 to adapt to the data. The graphical model in Figure 3.1 shows the dependencies between the vari-ables in the probabilistic model as presented up to here. One major problem when it comes to sampling from the posterior of the parameters (as motivated in Section 2.3) is the discrete nature of the label y. Because of this, the pos-terior conditional probability of the high dimensional /3 can not be computed analytically, as the likelihood and prior do not combine. This means that we would have to revert to a Gibbs sampler and draw samples from the conditional distribution of one component of f3 at a time. Since /3 can be high dimensional and its components tend to be highly correlated, this would result in a very inef-ficient sampler. Instead, we employ the data augmentation trick first introduced CHAPTER 3. PROBABILISTIC MODEL 27 Figure 3.1: The graphical model with hierarchical hyper-priors. In this and fol-lowing diagrams, dark shaded circles represent variables that are observed in the training data, whereas unshaded circles represent unobserved variables. The light grey shaded boxes stand for user denned parameters. The arrows visualize the probabilistic depen-dencies in the model. Figure 3.2: The augmented graphical model. The dashed arrow indicates a deterministic relationship while the regular arrows symbolize prob-abilistic dependencies. by Nobel laureate Daniel McFadden [54], which in the context of this specific model was also used in [67]. The probabilistic model shown in Figure 3.1 is augmented by introducing the continuous latent variable z € R. On a intuitive level, this variable can be seen as a continuous version of the binary label y. Figure 3.2 the resulting augmented graphical model. The dashed arrow from z to y in the figure symbolizes that y follows deter-CHAPTER 3. PROBABILISTIC MODEL 28 ministically from z. This deterministic relationship is given by: f 1 if z > 0, V = 0„O) = < (3-8) I 0 otherwise. Equivalently, the probability distribution p(j/|z) is defined as V{y\z) = • / (x , i 8 , 7 ) X e 2 = * ( / ( * , 0,7)) The fact that this works out so nicely is due to the choice of the probit link in Equation (3.1). This choice leads to the conditional normal distribution of the augmentation variable z. As a consequence, it is easy to sample z conditioned on the other variables. More importantly though, given z, the conditional dis-tribution of the high dimensional variable /3 can now be evaluated analytically and sampled from directly (as will be described in Chapter 4) because where both the likelihood p(z\j3,x,f) and the prior p((3\62) are Normal dis-tributions. Since the Normal is auto-conjugate, the posterior p(/3|z,x, 7,<52) is again a Normal distribution, which can be analytically computed and directly sampled from. This greatly helps in the construction of an efficient sampler. The joint probability distribution of the model parameters 6 = {f3,f, S2}, the label y and the augmentation variable z for the probabilistic model, as shown in Figure 3.2, factors as follows: p(/3|z,x,7,<52 )cxp(z\f3,x,~,)p((3\62), p{y, 0, z\x) = p(y, /3,~f, 62, z\x) = p(y\z)p(z\x, /3, ~i)p{p\82)p{52)p{f) CHAPTER 3. PROBABILISTIC MODEL 30 3 . 3 Learning from Label Meta-Information 3.3.1 The Supervised Learning Case As discussed in Section 2.2, learning in the Bayesian framework involves comput-ing the posterior distribution p(9\V) (or rather sampling from it as motivated in Section 2.3), where V is the given training data and 9 represents the full set of model parameters. In the fully supervised setting, which Tham, Doucet and Kotagiri deal with in [67], the data consists of N i.i.d. input-output pairs with features X = [xi,. . . , X J V ] and corresponding binary labels Y = [yi,... ,VN]-By applying Bayes Rule (Equation (2.1)), the posterior distribution for this type of training data and the probabilistic model shown in Figure 3.2 turns out to be where Z = [z\,..., ZN] is a vector of N independent augmentation variables, one for each training example. The integration over the augmentation variables Z is easy to realize within a sampling framework. Samples are simply drawn from the joint posterior p(9, Z\V). Integrating out Z then only involves ignoring the Z components of the generated samples. The joint posterior is given by p(0|P)=p(0|X,Y) cxp(Y|0,X)p(0) p(0,Z|D)=p(0,Z|X,Y) cxp(Y|Z)p(0,Z|X) = p(Y|Z)p(Z|X, 0,7)p(/3|<52)p(<*2)p(7) (3.11) CHAPTER 3. PROBABILISTIC MODEL 31 The prior distributions for /3, S2 and 7 are specified in Equations (3.6), (3.7) and (3.5). Note that the terms in parentheses in Equation (3.11) are proportional to the full conditional distribution of Z: N N p(Z|Y,X,/3,7) = Y[p(zi\yi,Xi,P,7) oc Y[p(yi\zi)p(zi\xi,P,-y) i=l i=l It follows from the definitions of p(yi\zi) (Equation (3.9)) and p(zi\xi,P,'y) (Equation (3.10)) that p(zi\yt, X j , P, 7) is a truncated Normal distribution, where the value of the binary label yi determines the direction of truncation: p{zi\yi,x.i,P,~f) oc { \I(o,oo)Af(VTiP~„l) if 2/i = 1, Here we used the matrix notation introduced in Equation (3.3). (3.12) 3.3.2 Our Training Data In the specific problem we are dealing with in this work however, we do not have access to exact label information for the training set examples. Instead, we are given the features X for the examples along with some amount of meta information about the labels. As already discussed in the introduction, this information does not relate to individual labels but to groups of labels. The nature of this information depends on the application as well as our additional assumptions. The types of information that we consider are • hard constraints on the number of positives in a group, • a guess at the number of positives for a group (along with a confidence rating for this guess), • a combination of the two. In the following sections these three cases will be discussed and appropriate extensions of the probabilistic model in Figure 3.2 will be presented, which incorporate the group-wise label meta-information. CHAPTER 3. PROBABILISTIC MODEL 32 3.3.3 Constrained Semi-Supervised Learning Model There are several applications, in which for a group of training examples the only information we are given is whether the corresponding group of labels contains • no positives, • both positives and negatives, • only positives. This is (similar to) the multiple instance learning problem already discussed in the introduction. We encode the group-wise information in a new variable C, which takes one of 3 discrete values, which we will refer to using the symbols © (all positive), ® (mixed) and 0 (all negative). In the © and © cases, the labels of the data points are actually explicitly given by the constraint on the group of labels. However, for a group with C = © we only know that at least one data point in the group must have a positive label and furthermore at least one has to have a negative label. Any combination of labels that fulfil these constraints is valid. If all groups have a © or 0 label, then we actually have a fully supervised learning problem as in the previous section. However, if some of the groups have a ® label, then we are dealing with semi-supervised learning, as only some (if any) labels are explicitly given. It should be mentioned that in the literature often a slightly different defini-tion of the multiple instance problem is used, in which only a binary indicator variable is available per group of labels [21]. This variable indicates whether there are any positive labels in the group or not. Thus the ® and © cases in our formulation are combined into one. It depends on the application which of the two formulations is more appropriate. In scenarios such as the drug discovery problem [21], which is one of the classical multiple instance learning examples, experiments on groups of instances do actually have a binary outcome. However in the case of our motivating example, the task of learning object recognition models from annotated images, groups with a © can actually arise. If an image CHAPTER 3. PROBABILISTIC MODEL 33 has only a single annotation and this annotation corresponds to the object/-concept of interest (for example 'sky'), then we could potentially assume that all image regions actually are positives (i.e. show the sky). Whether this is a reasonable assumption to make or not depends on the quality and thoroughness of the image annotations. If we denote the set of indices in a given group by g, the corresponding subset of labels by Y G and the number of data points in the group by n g , then we can write the deterministic relationship between a group of labels Y G and the group label C as e i f E i € g Vi c = , ® else. (3.15) Note that by computing Cj directly from ZSj, we implicitly integrated out the variables Y, s r P(Cj\Zej)= £ P ^ I Y ^ ^ Y g J Z g J Ygie{o>i}"g Now lets take a look at the posterior conditional distribution of the Z g i in a group for the three possible group labels ©, © and ®. It follows from Equa-tion (3.15) that for the labels 0 and ©, the individual augmentation variables in the group are actually independent P(Zg,IQ = e , X R , j 9 , 7 ) ocp(Cj = 0 |Z 6 i ) J] p(2i |xi,/9,7) o c I ( i : 4 6 , i ( o . » , ( ^ ) = o ) ( z » ) II Mxi'fr-r) (3.16) = ^ W c - - ^ * ^ " . ) ^ II P(^|xi,/3,7) = I(-oo,0](2i)p(Zt|Xi,/3,7) Analogous we get for the Cj = © case p(Z g.|C,- = © , X g . , /3 , 7 ) «p(C,- = ©|Z g . ) JJ p(zi|xi,j9,7) (3.17) oc n I(o,oo)te)p(zt|xi,0,7) Thus for these two cases, the individual augmentation variables z in a group are independently distributed as univariate truncated Gaussian distributions exactly as in the supervised case (see Equation (3.12)). This intuitively makes CHAPTER 3. PROBABILISTIC MODEL 37 sense, because as mentioned before the 0 and © labels for a group completely determine the binary labels of the data points in the group, thus providing supervised label information. The novel and more interesting case is Cj = © v{^%i\Cj = ® ,Xg , , /3 , 7 ) ccp(Cj = ®|Z g , ) JJ ptelxi .jS .T) (3.18) « M Z f t ) n^i|xi,/3,7), where £ denotes the region of IR™8 without the positive and negative orthants. That is, all vectors in the region £ have at least one negative and at least one positive component. The indicator function If applied to the augmentation variables for a group j is then 1 i f 0 < £ i G g , I(o,oo)(*i) d X. A itself is trivially computed from the labels Y g (or equivalently the corresponding augmentation variables Z g ) and CHAPTER 3. PROBABILISTIC MODEL 40 Figure 3.6: Model with guessed number of positives per group the number of data points in the group n g : g i £ g The measurement distribution p(m|A,x) is then chosen to be (3.19) (3.20) p(m\X,x) = B(XX + l, x ( l -A) + l) ocm* A ( l -m)* ( 1 - A ) . This choice guarantees that the mode of the distribution p(m|A, x) is located at A. The mean /j,m and variance of this distribution are given by _ XA + 1 x + 2 2 (XA + l)(x(l-A) + l) (x + 2)2(x + 3) The joint posterior distribution of the model parameters and augmentation CHAPTER 3. PROBABILISTIC MODEL 41 variables for the model in Figure 3.6 is then P(e,z\v)=p(o,z\x,M) cxp(M\Z)p(0,Z\X) = p(A*|Z)p(Z|X, /3, -Y)p(f3\S2)p(62)p(y) 3=1 i€gj with p K | Z g . ) = p(m|A,-,X) = S(xAj + 1, x ( l - A,0 + 1), where \j is computed from Z g i using Equation (3.19). As in the previous section, the conditional distributions p(ZSj \rrij, X g ; )., /3,7) for the different groups of training data points are independent and in the case of this model have the form This posterior distribution is a ng-dimensional multivariate Gaussian, which is scaled by different constants in different orthants of the space. The orthants which correspond to about the guessed fraction of positives nij will have larger scaling factors than those that do not. Figure 3.7 shows an example for a group of 2 data-points. It is worth pointing out that if the guessed positive fraction m is 0 or 1, then it follows from the properties of the Beta distribution and Equation (3.20) that A = m. For these cases we therefore get exactly the same posterior distributions (independent univariate truncated Gaussians) as for the 0 and © group labels in the Section 3.3.3. p ( Z g . | m j , X g . , / 3 , 7 ) ccp(mj\ZS:j) JJ p ( ^ | X i , / 3 , 7 ) (3.21) = B(XXj + 1, X ( l - A,-) + 1) J ! P(* l*i . /3.7) . CHAPTER 3. PROBABILISTIC MODEL 42 (a) Prior p(Zgj |X g.,0, 7) (b) Posterior p(Zg.\mj,Xgj,0,~f) Figure 3.7: Prior and posterior distribution for the augmentation variables in a group with only 2 elements and a guessed fraction of positives m3. 3.3 .5 Hybrid Model While with the model described in the previous section, we can enforce all labels in a group to be positive or negative by choosing the guessed fraction of positives m to be 0 or 1, the constraints on a mixed group are not enforced. That is, if we have a mixed group with a guess of for example rrij = 0.5, then there still is a non-zero posterior probability that all labels in the group are 0 or 1 at once. This is in contrast to the constraint based model presented in Section 3.3.3. If in addition to the guessed fraction of positives we have group labels as described in Section 3.3.3 and reason to believe that for our training data a ® group label actually guarantees that at least one positive and one negative have to be in the group, then it makes sense to enforce this constraint. It adds some amount of additional information which is especially useful if the confidence x in the guesses is low and the guessed M thus are not very informative. It is straightforward to combine the two models presented in the previous sections. The resulting graphical model is shown in Figure 3.8 and the posterior CHAPTER 3. PROBABILISTIC MODEL 43 r t Figure 3.8: Model with both hard constraints and guessed number of positives distribution for this model is The posterior conditional distribution for the augmentation variables in one group is now P(Z g : J |m :,-,C j,Xg ; i,/3,7) oc p(m 3-IZg.^CjIZg^ F T p ( 2 l | X i , / 3 , 7 ) . For the cases Cj = G and Cj = © this distribution is the same as in Equa-tions (3.16) and (3.17). In the Cj = ® case, the augmentation variables z in p (0 , Z | 2 ? ) = p (0 , Z | X , M , C ) ocp(M|Z)p(C|Z)p (0 ,Z |X) = p(M|Z)p(C|Z)p(Z|X,/3, 7)p(/3|<5 2)p(<5 2)p(7) xp(/3|<52M<52)p(7). CHAPTER 3. PROBABILISTIC MODEL 44 the group are distributed according to a multivariate Gaussian in E™ where the orthants of the space are weighted as described in Section 3.3.4. The difference is that, additionally, the positive and negative orthants are removed, i.e. the pdf in these orthants is weighted by 0. CHAPTER 4. COMPUTATION 45 CHAPTER 4 COMPUTATION As described in Section 2.2, our goal is to learn the posterior distribution p(6\T>) for some given training data T>. Knowledge of this posterior distribution then allows us to perform Bayesian probabilistic classification of new data points according to equation (2.2). More to the point, we need to generate samples from this distribution in order to approximate the integral in Equation (2.2) with the Monte Carlo approximation in Equation (2.3). For the different models developed in Chapter 3, the posterior distributions can not be computed in closed form and it is thus not possible to directly sample from them. Instead, we take the approach of constructing an efficient Markov Chain Monte Carlo sampler, whose stationary distribution is the desired posterior distribution p(9\T>). To be exact, we really will be sampling from the joint of the parameters and the augmentation variables p(9, Z\T>). Although we only require samples 0* ~ p(0\T>) in order to perform Bayesian classification as discussed in Section 2.3, it turns out to be much easier to sample from the joint posterior p(9, Z|P) than directly from p(9\V). This is why we introduced Z into the model in the first place. Once we have the samples from the joint, integrating out Z merely involves ignoring the Z components of the samples. 4.1 An efficient blocked Gibbs Sampler for sampling from the posterior The sampler we use is based on the one developed by Tham, Doucet and Ko-tagiri [67]. In fact, most components are identical or only slightly modified. CHAPTER 4. COMPUTATION 46 Figure 4.1: The probabilistic model used in the blocked Gibbs sampler. The variables in each of the two highlighted groups are sampled together as a block. However, in their work they are dealing with the supervised learning case (the model described in Section 3.3.1) whereas we are learning from information for groups of labels. This leads to different approaches for sampling the augmen-tation variables Z from their posteriors in the different proposed models as will be explicated in Section 4.1.5. Figure 4.1 shows the generalized probabilistic model for which we want to sample from the posterior distribution p(/3, 7,5 2,Z|P) =p(/3, 7,<52,Z|X,0), (4.1) where O = {0\,..., Ond } represents the abstract information (or Observations) per group of training data points. The three models presented in Sections 3.3.3, 3.3.4 and 3.3.5 are special cases of this generalized model. Note that we removed the variable T in the model. This is because we were able to integrate it out in Equation (3.5) and thus do not have to sample it. Similarly, the individual labels Y were removed from the model as well, since they follow deterministically from the augmentation variables Z, and in deriving the posterior distributions of the models in Section 3.3 we already integrated out Y . CHAPTER 4. COMPUTATION 47 In order to sample from the posterior in Equation (4.1), a Gibbs sampler (see Section 2.3.2) is used. One possibility would be to use a standard Gibbs sampler and sample each of the 4 components /3, 7, 52, Z from their full conditional distributions, i.e. conditioned on the sampled values of the other 3 components. However, we can do better than that by grouping variables together and using a blocked Gibbs sampler to sample from the corresponding joint conditionals. As mentioned in Section 2.3.2, this blocking reduces the correlation between samples and thus leads to a more efficient MCMC sampler. This is because it allows the chain to explore the probability space faster and more efficiently. As illustrated in Figure 4.1, in our sampler, two blocks are used: {7,/3} and {S2,Z}. The joint conditional distribution for {7 , /3} factors as follows p(f3,7|52, Z, X) = K /3| 7 , S2, Z, X)p(7|<52, Z, X). We can generate a sample from this joint distribution by first sampling 7 ~p(7 |<5 2 ,Z,X) and then conditioned on the sampled 7 sample /3~p03|7,<52,Z,X) For sampling {52, Z}, note by looking at the graphical model in Figure 4.1 that <52 and Z are conditionally independent, given (3 and 7: p(62, Z\f3,7, X , O) = p{821/3,7)P(Z|7, A X, O). The two components can thus be sampled independently. Furthermore, we know that the augmentation variables Z g for different groups are independent: p(Z|7,/3,X,C?) = nP (Zf t r7 ,0 .X f e ,Oj ) j=l To sample from the joint conditional for <52 and Z, we can thus draw independent samples from the conditionals for S2 and for each set of augmentation variables ZSj for the groups in our training data. CHAPTER 4. COMPUTATION 48 3 2 initialize 7,<52,Z for t = 1 to #burn_in_samples + #desired_samples: Sample: 6 4 5 7 ~p(7|<5 2,Z,X) // Section 4.1.3 /3~p(/3|7,<5 2,Z,X) // Section 4.1.2 <52 ~p(<5 2 | /3) / / Section 4.1.4 7 for j = 1 to % : s Zg; ~ p(Zg3. |7,0, X, 0 , ) / / Section 4.1.5 10 i f t > #burn_in_samples: i i Store sampled 7 and /3 Listing 4.1: Blocked Gibbs sampler for sampling from the joint posterior distribution of 6 and Z. The overall Gibbs sampler is given as pseudo code in Listing 4.1. The fol-lowing sections will discuss the initialization of the parameters (line 1 in the pseudo code), and how to sample from each of the conditional distributions used in the Gibbs sampler (lines 4-8). Note that the samplers for /3, 7, and 62 do not depend on the per group label information O and thus stay the same, independent of which of the 3 probabilistic models presented in Section 3.3 is used. The sampler for the groups of augmentation variables Zgj however de-pends on Oj and therefore on the choice of model. The implementations of this sampler for the 3 different models will be discussed in Section 4.1.5. 4.1.1 Initialization Since the Gibbs Sampler is sequential, i.e. a new sample is generated based on the previous sample, initial parameter values have to be chosen, relative to which the very first sample can be generated. In theory the initialization does not affect convergence. That is, the Gibbs sampler is guaranteed to converge to its stationary distribution from any initialization. In practical applications however, where only a finite number of samples are generated, good initializa-CHAPTER 4. COMPUTATION 49 tion can be crucial to achieve fast convergence of the chain to the stationary distribution. For the Gibbs sampler outlined in Listing 4.1, we need to initialize the pa-rameters 7, S2 and the augmentation variables Z. The parameters j3 on the other hand do not need initial values because in our Gibbs sampler, a new sample does not explicitly depend on the /3 values of the previous sample. For the other parameters we try to choose reasonable initial values: The regularization parameter 62 is initialized to the mode of its prior distribution p(62\a,b) (see Equation (3.7)) which is located at The kernel activation vector 7 is initialized by randomly activating a user defined number of kernels (in our experiments we generally used a value of 10). The augmentation variables Z are initialized according to the given group wise label information O. For groups known to contain exclusively negatives or positives, the corresponding variables Zi are initialized to —1 or 1 respectively. For mixed groups, one randomly chosen element is chosen and the corresponding Zi set to 1. This guarantees that the initial configuration satisfies the constraints in the constraint based models (Sections 3.3.3 and 3.3.5). If an estimate of the ratio of positives is available, the corresponding number of elements in the group are assigned a value of 1. For the remaining elements in the group we choose an uninformative value of 0. 4.1.2 Sampling the kernel weights In the Gibbs sampler in Listing 4.1, the kernel weights /3 are sampled from the posterior distribution p(/3|7, S2, Z, X). According to Bayes Rule (Equa-tion (2.1)) this distribution follows from the likelihood p(Z|/3,7,X) and prior p(/3|7, S2) as follows p(/3| 7, S2, Z, X) oc p(Z|/3,7, X)p(/3|7, <52) (4.2) CHAPTER 4. COMPUTATION 50 with p(Z| /3 , 7 ,X)=^(^/3 , l ) p(^| 7 ,<5 2 )=Ar(0, < 5 2 (^)- 1 ) . (4.3) By writing out the definitions for the two Normal distributions and subsequently pulling out all constant terms, we get p(/3|7)<$2,Z,X) oc (27r)-^(<52)-f | ( * £ * 7 ) | 4 x e ( - * ( Z - * , / 3 ) T ( Z - * , / 8 ) ) K e ( - i ( - 2 Z ^ + ^ ( ^ * > ) ) ( 4 4 ) Completing the quadratic form we can see that this is proportional to p(/3|7, S2, Z, X) oc e ( - 2 - (0 - M - ) T ( S *) -^- /O) with E* = ^ ( 0 ) = Pr( 7 i = 0|7-i) = 1 - Pr{li = l|7-i) _ N — K-j + b—l N+a+b-1 A change is proposed if a random draw from the uniform distribution U(0,1) is smaller than qt(0—>1) in the 7» = 0 case or smaller than ^(1 —•()) for 7 i = 1. The probability that this change is actually accepted is then computed according to the Metropolis Hastings acceptance probability (see Equation (2.4) for the CHAPTER 4. COMPUTATION 54 general formulation). For the 0-*l change this acceptance ratio is given by A w i - min (l ^ 7 i = l l7-i ,<* 2 ,Z,X) 9 i ( l-^0n ^ ~ m m V ' F r ( 7 , = 0 | 7 - , , * 9 , Z , X ) « ( 0 - > 1 ) J \ p(7i = l ,7 - i |^ 2 ,Z ,X) gj(l->0)\ v >(7i = 0,7-i|«52,Z,X) qiifi^l)) = min V ' P(7i = = l,7-<,g 2,X) p(7i = 1,7-0 ft(l->0) ! P(Z|7 'p(Z|7, • A P(Z|7, = min II, = min (1, P(Z|7 P(Z| 7. = 0,7_ i ! ( J2 ,X) p( 7 i = 0,7_i) ^(0-1) = l,7-i,0) = O ^ - i . ^ . X ) p( 7i = 0|7-i) ft(O-l) = l ,7- i ,5 2 ,X) p(Z\ii=0,>y-i,8*,X), = min ^l,( l + j 2 ) - ^ l i ^ [ ( z T * . I K 1 * . 1 r 1 < z T ) - ( z ^ o ( < * . o ) - 1 < o z where 71 represents the vector 7 with component ji set to 1 and 70 is the same vector but with 7$ = 0. Analogously we get for the 1 —> 0 case A „ = m i n ( l , ( l + J » ) M T & [ ( B T ^ ^ ^ After proposing a change for component ji, the corresponding acceptance probability is evaluated and the change is then randomly accepted or rejected according to this probability (using another random draw from U(0,1) ). The complete algorithm for sampling 7 from its posterior distribution p(7|<52, Z, X) is given in pseudo-code in Listing 4.2. 4.1.4 Sampling the regularization parameter In the Gibbs sampler in Listing 4.1 the regularization parameter <52 is sampled from its posterior distribution p(<52|/3,7). Using Bayes rule we get p(J2|/3,7)ocp(/3|7,<52)p(<52) where p(<52) is the inverse gamma distribution J^(f ,f) (see Equation (3.7)) and p(P\f,62) is the normal distribution given in Equation (4.3). Since the inverse gamma distribution is a conjugate prior for the variance of the normal CHAPTER 4. COMPUTATION 55 for i = 1 to N: 2 sample ui ~ U(0,1) 7 6 4 3 if 7 i = 0 and u\ < g,(0—»1): // change from 0 to 1 proposed compute acceptance probability Ao_ i sample U2 ~ U(0,1) if « 2 < J 4 O - I : 8 set 7 i = 1 .2 9 .0 else if 71 = 1 and ui < <7t(l—»0): / / change /ro/n i to 0 proposed compute acceptance probability J4 I_4) sample ui ~ W(0,1) .3 if « 2 < A I - H O : .4 set 7 i = 0 Listing 4.2: Metropolis-within-Hastings sampler for sampling 7 from its distribution, the posterior p(5217, /3) is again an inverse gamma distribution. It has the form We can thus sample S2 directly from its posterior distribution using a stan-dard Inverse Gamma sampler. 4.1.5 Sampling the augmentation variables In our Gibbs sampler, the augmentation variables Z are sampled in groups according to the grouping of data points in the training data. The set of variables ZSj for a group j axe sampled from their conditional posterior distribution p(ZSj I7, /3, X , Oj) in line 8 of Listing 4.1. The form of this posterior distribution depends on the type of the group wise label information Oj. In Sections 3.3.3, 3.3.4 and 3.3.5 we already derived the conditional posterior distributions for Z g i posterior distribution p(7|<5 2 , Z, X ) . CHAPTER 4. COMPUTATION 56 in the context of the 3 different proposed models. This section will cover the issue of sampling ZSj from these different posterior distributions. 4.1.5.1 Sampling pure groups The first case of interest is the one where we axe given the information that all elements in the group g^ have a positive label (or where all of them axe known to be negative). This axises in the constraint based model in Section 3.3.3 for the group labels © and 0 and for the two other models (Sections 3.3.4 and 3.3.5) for a guessed ratio of positives rtij of 0 or 1. The individual augmentation variables z^ in the group axe a posteriori distributed according to independent truncated Normal distributions given in Equation (3.12). We implemented the algorithm by Geweke [33], which uses rejection sam-pling to generate samples from the truncated univariate Normal. Depending on the cutoff point, one of two different proposal distributions is chosen. If the cutoff point is such that only the tail of the Normal distribution is cut off, i.e. only a small fraction of the probability mass is removed, the untruncated Normal distribution is chosen as proposal. That is, samples are generated from the normal and rejected iff they fall into the restricted region. In the other case, where most of the probability mass is truncated, an appropriately scaled exponential distribution is used as proposal. This distribution is efficient to sample from and provides a very good approximation of the tails of the Normal distribution. Thus even for extreme cutoff points a very good acceptance ratio can be achieved. 4.1.5.2 Sampling mixed groups with constraints As discussed in Section 3.3.3, for ® group labels in the constraint based model, the posterior distribution for the Z g j . (Equation (3.18)) is a multivariate Normal with the positive and negative orthants removed (see Figure 3.5 for an example). A Metropolis-witbin-Hastings sampler is used to generate samples from this distribution. The prior distribution p (Z g j . |X g i , /3 ,7 ) , which is a multivariate CHAPTER 4. COMPUTATION 57 normal distribution, is used as proposal distribution. This choice leads to a binary acceptance ratio. It is 0 if the sample falls into the constraint orthants, i.e. if all Zi in the group have the same sign. Otherwise the acceptance ratio is 1. In contrast to rejection sampling, previous samples of the Markov Chain are reused. The cost of this reuse is correlation among the generated samples. 4.1.5.3 Sampling mixed groups in the measurement model When using the noisy measurement model introduced in Section 3.3.4, ZSi is sampled according to the posterior distribution p(Zgj\mj,Xgj,f3,7) given in Equation (3.21). As previously mentioned, it is a multivariate Normal distri-bution with the different orthants of the probability space scaled by different factors. The scaling factors depend on rrij, the guessed number of positive labels in the group. While in the constraint based model discussed above the prior distribution p ( Z g J X g i , / 3 , 7 ) generally provides a good proposal distribution, this is not necessarily the case for this re-scaled multivariate normal. The rea-son for this is that the scaling factors might cause the probability mass of the posterior to be concentrated in only a few orthants of the ng-dimensional space. The unsealed prior p(Zgj |X g . , /3 ,7) on the other hand might only assign a tiny fraction of the overall probability mass to these orthants. Instead, a Metropolis-Hastings within Gibbs sampler (see Section 2.3.3) with a local proposal distribution is used. That is, new samples are proposed mostly in the local neighbourhood around the previous sample. This allows the Markov Chain to efficiently explore the high probability regions of the posterior distri-bution. The proposal distribution we use is an isotropic multivariate normal positioned at the previous sample location. Thus a new candidate Z'g. is gen-erated based on the previous sample Zg. using Z ^ ~ 9 ( Z g i , Z g 3 ) = ^ ( Z g i , c 2 l ) , _ 1 where c controls the variance of the proposal. We use c = 2.4 n g 2 which was shown to be optimal for sampling from the unit variance multivariate normal distribution [30]. While this only means that it is optimal for the limiting case CHAPTER 4. COMPUTATION 58 of no orthant wise rescaling, this proposal distribution does result in fairly good acceptance rates for our posterior distribution in practice. Since this proposal distribution is symmetric (i.e. g (Z g i , 7l%.) = q{fL'%., Z g ^)) , the Metropolis-Hastings sampler is really just a Metropolis sampler and the acceptance probability is simply For the definitions of the individual terms please consult Section 3.3.4. 4.1.5.4 Sampling mixed groups in the hybrid model The hybrid model presented in Section 3.3.5 is a simple variation of the noisy measurement model in which we added the constraints from the constraint based model. To sample from it, we use the same Metropolis within Gibbs sampler described in the previous section and modify it by adding in the constraints. The only change required is to set the acceptance ratio A to 0 whenever a proposed sample falls into the constrained orthants. 4.2 Speeding up the Sampler by Incremental Matrix Updates One of the most expensive parts in the overall Gibbs sampler is sampling the binary vector 7 from its conditional as described in Section 4.1.3. In particular, evaluating the acceptance probability 4^o_i or AI-Q is an expensive operation since it requires the matrix inverse (Cl (zgj), 0+1-J'=l Similarly, if 7 » is switched from 0 to 1, a new row and column are inserted at index I into the K x K matrix to produce the (K + 1) x (K + 1) matrix ^fT. tyy. The entries aij = aji of the new l-th row and column are computed as where ipi is the i-th row of the full kernel matrix \&. Using these operations, the matrix product $ T , can be efficiently com-puted in 0(KN) time from the matrix 9T^y. CHAPTER 4. COMPUTATION 60 4.2.2 Incremental update of the matrix inverse While updating the matrix product as shown above is quite straightforward, the more interesting and challenging task is to derive an incremental update procedure for the matrix inverse. That is, we are looking for an efficient way to compute (\& .^ ) 1 incrementally from the already known matrix 1 . We are able to derive such an incremental update based on the Shermann-Morrison formula, which will be presented in the next section. In the subsequent sections we will then discuss how this formula can be used to incrementally update the inverse of our matrix for two special cases. The first case is the removal of the last row and column of the matrix and in the second case a new last row and column are appended to the matrix. Finally we show in Section 4.2.2.4 how the more general removal or insertion of a row and column at an arbitrary index can be reduced to these two base cases. 4.2.2.1 The Shermann-Morrison formula The Sherman-Morrison formula (also known as the matrix inversion lemma) provides a way of updating the inverse of a K x K matrix A in 0(K2) com-putation. It applies when A is transformed using a rank one update of the form Anew = A + UVT^ where u and v are K-dimensional vectors. The inverse of this matrix is then given by the Sherman-Morrison formula {A + uvT)~l = A'1 - l—(A-1u)(vTA-1f. (4.8) l-vTA~1u This formula thus allows us to efficiently compute the inverse of the matrix A + uvT, given that we already know the inverse of A. 4.2.2.2 Removing the last row/column Given a symmetric matrix A (in our case A — ^ ^y) and the corresponding inverse A^1, we want to find an incremental update for the matrix inverse A"1 CHAPTER 4. COMPUTATION 61 when removing the last row and column from A. Let us denote the components of the symmetric matrix A as follows: A = -K a — | a where A-K is the desired sub-matrix of A without the K-th (last) row and column, a = O,KK and a is the (K — l)-dimensional vector with components We first apply a rank one update to transform the original matrix into the matrix A of the form T 0 A = 0 A-K 0 — a — 0 = A + We can use the Sherman-Morrison formula (Equation (4.8)) to compute A-1 from A*1 using an 0(K2) computation. The vectors u and v in this case are 1 ' 0 " —a ,v = 1 0 —a 1 A second rank one update is applied to get the matrix A A + 0 A-K 0 0 ••• 0 1 Using the Sherman-Morrison formula again, the corresponding matrix inverse CHAPTER 4. COMPUTATION 62 A 1 can be incrementally computed from A'1. This inverse now has the form 0 1 _ 0 0 ••• 0 1 The desired matrix inverse AZlK can then be extracted simply by removing the last row and column of A - 1 . Since the two incremental updates using the Sherman Morrison formula have 0(K2) complexity, the overall complexity for computing AZXK is 0(K2) as well. This is a drastic improvement over the 0(K3) cost for computing the matrix inverse from scratch. 4.2.2.3 Appending one row/column In a very similar way, we can formulate the update of adding a new row and column to the end of matrix A in terms of rank one updates. The Sherman-Morrison formula can then be employed again to incrementally compute the corresponding matrix inverse. We use the same notation as before and write the expanded matrix A+^x+i) as i+(AT+l) = (4.9) where now a and a contain the entries of the new row and column that are to be appended to the existing matrix A. In the case A = * 7*-y that we are dealing with in our application, these are given by a = ytyi and CHAPTER 4. COMPUTATION 63 where ipt is the ith row of the full kernel matrix i.e. the row corresponding to the newly activated kernel with index i. First, the original matrix A is extended into the following (K + 1) x (K + 1) matrix 0 A+ = A 0 0 ••• 0 1 The corresponding matrix inverse is known to be 0 " A-1 0 0 ••• 0 1 We then basically apply the two rank one updates used in the previous section, but backwards and inverted. T A+ =A+ + 0 0 A 0 — a — 0 A+ = A+ + -, T — a — = A + (KT+1)> By using the Shermann-Morrison formula (Equation (4.8)), the matrix in-verse can be updated according to these two rank one updates to produce the desired matrix; inverse A~^K+1y The computational complexity of this incre-mental update is 0((K + l) 2), if a and d are given. In our application, we have to compute a and a using additional O(KN) and O(N) computations, resulting in an overall complexity of 0((K + 1)(N + K + 1)). CHAPTER 4. COMPUTATION 64 4.2.2.4 Generalization to removal/insertion at index / In the previous two sections, we have shown how the inverse of a symmetric matrix A can be updated when adding or removing the very last row and column of the matrix. In this section, we will generalize these operations to allow for addition or removal of an arbitrary row and column I of matrix A. The key idea is to swap row and column I with the last row and column of the matrix before the removal or after the insertion of the last row and column. To formalize things, we will refer to the removal of the last (K-th) column and row of a matrix using the operator delx-delK{A) = A-K Similarly, the operator addn+i appends a new (K + l)-th row and column to a matrix addK+i(A) = A+(K+i), where A+(K+I) is defined in Equation (4.9). We further introduce the two operators move]71 and move°ut. When applied to a matrix, move™ moves the last column of the matrix in front of the Z-th column and the last row in front of the Z-th row of the matrix. move°ut is the inverse operator which moves the Z-th column of the matrix to the last column and the Z-th row to the last row. From the definition of the operators it is obvious that move?ut(movein{A)) = movein(movefut(A)) = A. Furthermore, both operators commute with the matrix inverse, that is (movef^iA))-1 = movefut (A-1) {move1,71 {A))'1 = mowe}"^"1) The technique described in Section 4.2.2.2 allows (dei^A)) - 1 to be com-puted incrementally from known A and A " 1 . The general case, where we remove the Z-th row and column (with i ^ K) can be written as: dek{A) = del K {move?^ (A)) CHAPTER 4. COMPUTATION 65 The inverse is thus given by (deh(A))-1 = (delKimovef^iA)))-1 (4.10) Let B = move?"1 (A). Using the fact that the move°ut operator commutes with the matrix inverse, B~x is then straightforward to compute from the known A~1 B-1 = {mover1 {A))~l = rnove\ui\A~X) By rewriting Equation (4.10) as (deh{A))-1 = {delKimove^iA)))-1 = [delK{B))'1 it is then obvious, that we can compute the matrix inverse of dek(A) using the incremental technique described in Section 4.2.2.2 from the (easily computed) B-\ Similarly, the general operation addi of inserting a new row and column at index I can be expressed as addi(A) = move]n(addK+i(A) In this case, the corresponding matrix inverse is given by (addiiA))-1 = (movefiaddK+iiA))-1 = movejn((arfdK+i(^))_1), where in the last we step used the commutativity of move]11 and the matrix inverse operator. When adding a new row and column at index I, the inverse can thus be incrementally computed by first computing (addK+i(A))~l from A-1 according to Section 4.2.2.3 and by then applying the move]n operator to the resulting matrix. 4.3 Implementation The MCMC sampler described in this chapter was implemented in C++ on a Linux platform. We chose C++ for its efficiency and the benefits of the CHAPTER 4. COMPUTATION 66 object-oriented software design it allows. Where possible we made use of C+-l-'s support for generic programming using templates. This allowed for flexibility without the runtime overhead of virtual member functions. While a lot of functionality that would have been available in software packages such as MAT-LAB/Octave or the R language had to be written from scratch, we feel that the benefits of using C++ outweigh the additional programming work required. Furthermore, the library of samplers, data analysis functions and related code, which was implemented in the course of this thesis project was designed to be general and kept separate from the project specific source code. Therefore, it will be easy to reuse as the basis of future projects. In our implementation we used several C++ libraries • The boost::uBLAS library1 was used for linear algebra data types and operations. It provides matrix and vector classes as well as efficient im-plementations of basic linear algebra operations. By using advanced C++ features such as template expressions it allows for readable syntax with-out sacrificing performance. In particular, the use of expression templates makes it possible to avoid the expensive creation of unnecessary temporary objects. • The ATLAS (Automatically Tuned Linear Algebra Software) library2 pro-vides a free and highly optimized implementation of BLAS (Basic Linear Algebra Subprograms)3 as well as a selection of LAPACK (Linear Alge-bra PACKage)4 functions. The boost: :uBLAS library contains bindings, which allow BLAS and LAPACK functionality to be accessed without hav-ing to deal directly with the notoriously nasty interface of these libraries. We used BLAS Level 2 (matrix-vector) and Level 3 (matrix-matrix) op-erations in a few performance critical parts of the code for even higher performance than with uBLAS itself. The Cholesky decomposition algo-1http://www.boost.org/libs/numeric/ublas 2bttp://math-atlas.sourceforge.net/ 3 http : / /www.netl ib.org/blas 4 http : / /www.netl ib.org/lapack CHAPTER 4. COMPUTATION 67 rithm provided by ATLAS's LAPACK implementation was also used. • The boost::random library5 provides random number generators and sam-plers for several basic univariate distributions (uniform, normal, exponen-tial). We chose the Mersenne Twister algorithm [53] as the underlying pseudo random number generator for our MCMC sampler. • The MySQL relational database6 was used for storing the parameters and results of our experimental runs. This allows for easy access to all previous runs and the corresponding parameters and provides powerful search and filter tools. We also implemented a simple C++ interface to MATLAB, so that we could use MATLAB's data analysis and visualization tools. 4.4 Training Data (Re-)Balancing In the application used as our motivating example - learning a classifier for one specific object class from a general image database - a major problem is the imbalance of our training data. For most object classes there will only be a few images containing the object and a much larger number of negative example images without the object. We assume here, that the positive examples are not pure positives (exclusively showing the object of interest) but mixed (showing the object along with other things), which is typically the case in practice. Therefore, not only are the positive examples a clear minority but they also carry much weaker label information. This would not necessarily be a problem, were the actual positive and negative data points nicely separated in feature space. In general though, this will not be the case. Instead, the training data will consist of a few 'maybe positive' data points embedded in a large mass of 'definitely negative' data points. Learning from this kind of data tends to produce classifiers which are heavily biased towards the negative majority class. 5http://www.boost.org/libs/random 6http://www.mysql.com/ CHAPTER 4. COMPUTATION 68 One way of partially mitigating this problem is to make the positive data more informative by adding extra information. Adding the estimated number of positive image regions for the positive example images as described in Sec-tion 3.3.4 has exactly this effect. This is still not sufficient to completely solve the imbalance problem. Therefore we propose a second approach, which is to alter the training data distribution itself. The data imbalance problem arises in many different settings but it is espe-cially common when learning 'one vs. the rest' binary classifiers from data with many different categories. The most prominent class of techniques proposed for dealing with the problem involves changing the distribution of the training data by re-sampling the available data. While one might suspect that for best classi-fication results the positives-to-negatives ratio in the training data should match the 'natural' distribution of the data, it has been empirically shown that this is not necessarily the case [12, 72, 73]. To be more precise, whether the natural dis-tribution is optimal or not depends on the choice of performance measure used for evaluation. If the goal is to maximize the overall classification accuracy, then using the natural distribution is indeed close to optimal. However, classification accuracy is known to be an unsuitable performance measure in cases where the test data distribution is unknown, or where different misclassification costs are associated with the two classes [60]. In the 'one vs. the rest' scenario, one is typically more interested in high accuracy on the positive minority class than in the performance on the much larger background class. Unhappily, using the highly skewed natural class distribution for training will generally result in low accuracy on the minority class and high accuracy on the background class [73]. A much better technique for comparing classifiers when misclassification costs and test set class distribution are unknown is to use Receiver Operator Curve (ROC) analysis [25, 60]. When a single real valued performance measure is required, then the area under the ROC curve (AUC) is a widely accepted al-ternative to classification accuracy [11]. Weiss and Provost [72, 73] show that if AUC is used as the performance measure, then using training data in which the minority class is over-represented generally results in significantly better CHAPTER 4. COMPUTATION 69 classifiers than those trained on the natural distribution of the data. Other researchers have come to the same conclusion and several different techniques to re-sample a training data set have been proposed. There are basically two competing approaches: • Under-sampling the majority class, leaving the minority class unchanged. • Over-sampling the minority class, leaving the majority class unchanged. In under-sampling only a randomly chosen subset of the examples of the back-ground class are used [43]. In the over-sampling technique, the positive ex-amples axe artificially multiplied, either by simple duplication or by creating synthetic new positive examples as linear combinations or noisy versions of the existing ones [18]. Some authors have found under-sampling to outperform over-sampling [22] while others have come to the opposite conclusion [39]. The results seem to depend highly on the problem and classification technique. In the context of our application, model and learning algorithm, over-sampling is not a viable solution. Simple duplication of data points would cause the ma-trix ^"fr-y to become singular and therefore its inverse as required in sampling (3 (Equation 4.5) would be ill-defined. In theory, linear interpolation of positives as done in [18] could get around this problem. However, it seems like a rather ad hoc solution in general and is not applicable to our setting. Our 'positives' consist of groups with varying numbers of data points. Since the groups live in different dimensional spaces, linear interpolation is not possible. Under-sampling, on the other hand, is straightforward to implement by only selecting a subset of the available negative groups or individual negative data points. What further makes under-sampling an appealing solution is the fact that it shrinks the training data set, ideally without discarding much useful information. It therefore makes it possible to apply our algorithm to very large data sets. In spite of the efficient MCMC sampler and the sparse representation we are using, our training algorithm by itself still does not scale too well to datasets significantly larger than several thousand data points. CHAPTER 4. COMPUTATION 70 One approach to under-sampling is to simply choose a random subset of the negative examples and discard the rest. A n alternative approach is to try to keep the most informative negative labels and discard the less informative ones. For the supervised classification setting, researchers have proposed heuristics which try to choose negative examples close to the decision boundary [43]. This is similar in spirit to the Support Vector Machine classification technique [69]. We implemented two different under-sampling techniques. The first one is just a simple random sub-sampling of the negative example images. The second one employs a selective sub-sampling strategy, which picks potentially informative images based on the information in the image annotations. The goal of this sub-sampling strategy is to choose images which will help to disambiguate between the actual positive and negative data points (image regions) in the mixed groups (positive example images). To this end, we first analyze the additional annotations of our positive training images, i.e. the labels of the other object classes present in the positive images. Negative examples are then chosen randomly from the full set, with preference given to those images that share annotations with images in the positive set. The more shared annotations, the higher the probability of choosing a certain image. As an example, assume we were training a horse classifier and all our training images had horses either on grass or in front of trees. Then our sub-sampling algorithm would mostly pick negative example images containing grass and/or trees (but no horses). In training, these negative examples help to disambiguate the image regions in the positive example images which actually show horses, since they provide strong evidence that image regions looking grassy or tree-like are not horses. The desired training data class distribution (i.e. the ratio of positive to neg-ative example images) - which determines the degree of under-sampling per-formed - has to be chosen by the user. CHAPTER 5. RESULTS 71 CHAPTER 5 RESULTS In this chapter, we present results for the techniques presented in this thesis on both synthetic data (Section 5.1) as well as for the problem of learning object class recognition models from a database of annotated images (Section 5.2). All of our experiments were run on 2.5 Ghz Pentium 4 machines equipped with 1 GB main memory. 5.1 Synthetic Data Using synthetic data sets to test and demonstrate the performance of our tech-niques is appealing for several reasons. Firstly, it allows for easy validation of the learnt classifiers. This is especially important for the multiple instance type of problem we are dealing with. The probabilistic models and learning techniques described in this thesis are designed to learn classifiers for individual data points when information is only provided for groups of data points. This makes evalu-ation difficult or impossible when using real world data sets, in which the labels of the individual data points are unknown. In synthetic data, on the other hand, we can construct our groups from individual data points with known individual labels, thus allowing for easy validation. Another benefit of synthetic data is the ability to easily visualize the results. We restrict ourselves to training data, in which the data points X live in a two dimensional feature space. This makes it possible to visualize the predictive distribution Pr(y = l|x,Z>) as a 3D sur-face plot. Furthermore, synthetic data allows us to observe the performance of the algorithm and model itself. This is in contrast to real world data, where many external factors influence the overall performance. In the case of the ob-CHAPTER 5. RESULTS 72 ject recognition application, such external factors include the quality (or lack thereof) of the image annotations, of the image segmentation and the feature extraction. In our first experiment, we generate 500 2D data points with binary labels from a simple mixture model. 5% of the data points are created with a positive label and distributed according to an isotropic 2D Normal distribution. The remaining 95% negative data points are sampled from a surrounding ring with a Gaussian cross-section. Figure 5.1(a) shows the resulting supervised data set. We randomly assemble the data points into groups of 5. Each group's label Cj is determined based on the labels of the contained data points according to Equation (3.13). This process results in 23 mixed bags (Cj = ®), 77 pure negative bags (Cj = ©) and no pure positive bags. The corresponding semi-supervised label information is visualized in Figure 5.1(b). The data points and the group labels C are then used to train the probabilis-tic model described in Section 3.3.3. We choose a Gaussian kernel with kernel width r = 1.0 and opt for uninformative priors by choosing ^ i = v = a = b= 1.0. In the training phase, 2000 samples from the posterior distribution p(9\T>) are generated using the MCMC sampler described in Chapter 4 (and in particular Section 4.1.5.2) after a burn-in period of 2000 samples. Using the generated samples, the predictive probability Pr(y = l\x,T>) is evaluated according to Equation (2.3) for points on a 2D grid to produce the surface plot shown in Figure 5.1(c). Although no data points are a priori known to be positive in this case (see Figure 5.1(b)), the constraints provided by the group labels are sufficient for our approach to learn a nice predictive distribution which clearly separates positive and negative data points. Thresholding this distribution at a threshold value of 0.5 produces a hard binary classifier. As shown in Figure 5.3(b), this classifier nearly perfectly re-constructs the labels of the individual data points in the training data (shown in Figure 5.3(a) for comparison). In a second experiment we used the same training data, but inverted the labels of the individual data points. The resulting training data and predictive CHAPTER 5. RESULTS 7.3 Sit .•: o o~ o Vf o (a) Individual labels in the synthetic data set. Red triangles stand for data points with positive labels, blue circles represent data points with negative labels. . ° ° (b) Semi-supervised label information inferred from the group labels. Data points with unknown, but con-strained labels are shown as black dots. (c) The prediction surface Pr(y = l\x,"D) as computed by our approach from the provided group labels. Figure 5 . 1 : Synthetic test data set. Positive data points were generated from a Normal distribution, negatives surrounding ring. Data points were randomly grouped into bags of 5. CHAPTER 5. RESULTS 74 probability distribution are shown in Figure 5.2. Thresholding it at 0.5 again provides a close to perfect classification of the data points as shown in Figure 5.3. With both experiments, the training phase took 10 seconds using our C++ implementation. The Metropolis-Hastings-within-Gibbs sampler employed for sampling the Z g variables (see Section 4.1.5.2) had an average acceptance rate of 80%, while the acceptance rate of the Metropolis-within-Gibbs sampler used for sampling the kernel selection parameters *y had an acceptance rate of 8%. The sampler achieved a high degree of sparsity, with on average only 6 out of the 500 kernels activated per sample. While the synthetic training data used in these experiments might seem extremely simple and does not provide a challenge to our model and algorithm, it should be noted that many commonly used models would fail to separate the two classes in this example. This is true for linear classifiers and simple mixture models. Multiple instance approaches based on axis parallel rectangles [21] or Diverse Density [50, 51, 52, 77] are likely to succeed in reconstructing a good classification boundary for the example in Figure 5.1. However, they will fail to do so for the inverted example shown in Figure 5.2, where the positives are spread out and surround the negative data points. This is because of the explicit or implicit assumption made by these approaches that the positive class is represented by a point or convex region in feature space. Our next synthetic data set is yet simpler and shown in Figure 5.4. The positive and negative data points are generated from a mixture of 2 isotropic Gaussians. The fraction of positives is chosen to be 10% and the individual datapoints are randomly combined into groups of 6. We use this data set to demonstrate the flexibility with respect to the choice of kernel, that our approach allows. Figures 5.5 and 5.6 show the results of using our learning technique with different kernel types. For each kernel type, the learnt predictive distribution Pr(y = l|x,2>) and the resulting classification for a threshold value of 0.5 are visualized. We can see, that for this simple data set, our approach manages to learn predictive models, which separate the two clusters well with all kernels. At the same time, it is also obvious that the choice of kernel has a large influence CHAPTER 5. RESULTS 75 . * » 4 » « * -3 -1.5 . A 4 A 4 a if (a) Individual labels in the synthetic data set. (b) Semi-supervised label information. (c) The prediction surface Pr(y = l\x,V) as computed by our approach from the provided group labels. Figure 5.2: The same data set as in Figure 5.1 but with inverted labels. on the shape of the leaxnt predictive distribution. It is therefore to be expected that some kernels would be performing significant better on certain data sets than others. Although the experiments shown so far feature a strong imbalance in the CHAPTER 5. RESULTS 76 0 5 I 1-5 2 ^.S -2 -1.6 -I -ttS 0 0.5 1 1.5 2 (c) (d) Figure 5.3: Classification results for a threshold of 0.5 for the learnt models shown in Figure 5.1 and Figure 5.2. The actual labels in the syn-thetic data are shown on the left in (a) and (c) and the predictions of our learnt classifiers on the right in (b) and (d). number of positive and negative data points, this has not resulted in any sig-nificant bias in the resulting classifiers, because in these simple examples, the two classes are well separated. In real world applications, this will often not be the case. More typically, a few positive data points will be embedded in a large diffuse mass of negatives. In our next example we construct a more difficult data set, which simulates this scenario. Like in the previous experiment, the data points are again generated from one positive and one negative 2D Gaus-sian distribution. However in this data set, the two Gaussians are not nicely CHAPTER 5. RESULTS 77 Figure 5.5: Posterior probability distribution Pr(y = l | x , V) and resulting clas-sification of the training data for different kernel types. CHAPTER 5. RESULTS 78 Figure 5.6: Posterior probability distribution Pr(y = l|x, V) and resulting clas-sification of the training data for different kernel types. CHAPTER 5. RESULTS 79 (a) Created data with individual labels (b) Corresponding semi-supervised data Figure 5.7: Synthetic data featuring positive data points embedded in a diffuse mass of negatives. separated, but the more concentrated positive distribution is embedded in the region of influence of the Gaussian generating the negative data points. We further changed the generative process as well. In our previous experiments, we generated the data points independently and then randomly grouped them. Here, we create groups of data points directly. For each group, we randomly decide whether we restrict it to be a pure negative group (with 85% probability) or not (15%). The data points are then generated accordingly. For a pure nega-tive group, all data points are generated from the negative Gaussian component. Otherwise, the group's data points are generated with equal probability from the positive and negative components. The group is thus most likely to contain a mixture of positive and negatives, but can also end up being purely positive or negative. This creation process is inspired by the kind of data we see in the case of over-segmented annotated images. An object is typically only present in a small number of all images. But when it is shown in an image, then it is very likely to cover more than just one of the image's regions. Using the described generative process, we create 75 groups of 8 data points each, producing a total of 600 data points. The created data set consists of 13 mixed, 62 negative and no pure positive groups. Figure 5.7 shows the generated labelled data and the corresponding semi-supervised data. CHAPTER 5. RESULTS 80 (a) Constraint based model with Gaus- (b) Hybrid model with Gaussian kernel sian kernel (c) Constraint based model with sig- (d) Hybrid model with sigmoidal kernel moidal kernel Figure 5.8: Experiment on the data from Figure 5.7 demonstrating the benefit of knowing the fraction of positives per group. In Figure 5.8(a), we show the resulting predictive distribution when learning from just the group labels C using a Gaussian kernel. In a second experiment, we use the hybrid model described in Section 3.3.5 and train it on both the group labels C and the additional fractions of positives M. Since we created the data ourselves, our confidence in these fractions of positives can be quite high and accordingly we choose the confidence parameter x in the probabilistic model to be relatively high with x = 1000. The predictive distribution learnt from this enriched group information is visualized in Figure 5.8(b). Clearly, the additional information helped to learn a much improved predictive distribution CHAPTER 5. RESULTS 8 1 0.8 0 . 6 0 .4 0 . 2 f - hybrid model — constraints only 0.8 0 . 6 0 . 4 0 . 2 ' 0 .4 0 . 6 0 . 8 x • hybrid model - constraints only ) 0 . 2 0 .4 0 . 6 0 . 8 1 (a) Probabilistic classifiers based on Gaussian kernel (Figure 5.8(a) vs. 5.8(b)) (b) Probabilistic classifiers based on sig-moidal kernel (Figure 5.8(c) vs. 5.8(d)) Figure 5.9: ROC plots comparing the classification performance of classifiers trained either on only the group label information (using our con-straint based model) or on the group labels and additionally the frac-tion of positives per group (using our hybrid model). The ROC (Re-ceiver Operator Curve) [25] visualizes the classification performance of a probabilistic classifiers with varying thresholds. The curves al-low to compare classifiers independent of a chosen fixed threshold value. The x-axis measures n a t i v e s f a l s e l y c l a s s i f i e d a s p o s i t i v e s h u a c t u a l n e g a t i v e s the y-axis corresponds to c o t I ^ t ^ ^ i ^ t i m ' • in which the positive cluster is nicely reconstructed. Next, we perform the same experiments with models based on the sigmoidal kernel. As we can see in Figures 5.8(c) and 5.8(d), this kernel does a much better job of separating the positives from the negatives than the Gaussian kernel even when only the group labels C are provided. Adding information about the fractions of positives does help to produce a classifier with less negative bias. But it seems to mostly result in a scaling of the predictive distribution and does not lead to the same qualitative difference experienced in the case of the Gaussian kernel. This is confirmed by the ROC curves shown in Figure 5.9, which compare the classification performance on the training data. In the case CHAPTER 5. RESULTS 82 AUC constraints only constraints and fraction of positives Gaussian 0.885266 0.922479 Sigmoidal 0.935802 0.942255 Table 5.1: Area under the ROC curve (AUC) values summarizing the plots in Figure 5.9. The AUC value is a widely recognized value for comparing the performance of probabilistic classifiers [11, 25]. of the Gaussian kernel, the additional information provided by the fractions of positives results in significant better performance, indicated by a dominating ROC curve. For the sigmoidal kernel on the other hand, the two ROC curves are close to identical. Table 5.1 summarizes the performance of the 4 classifiers using the AUC (Area under ROC Curve) [11, 25] measure. It is interesting to note that, while the additional information provided by the fractions of positives leads to improved classification performance, the choice of kernel has an even stronger impact. Both classifiers based on the sigmoidal kernel outperform the classifiers using the Gaussian kernel. With our last synthetic experiment we demonstrate, how the additional in-formation provided by the fractions of positives M can save the day, when the group labels do not provide sufficient information. We further illustrate the effect of the confidence parameter x- The data set used consists of an equal num-ber of positive and negative data points laid out along two parallel lines as shown in Figure 5.10(a). The data points are then randomly assembled into groups of 5, which for this data set results in exclusively mixed groups (Cj = ®). Thus the group labels themselves are basically completely uninformative and when training our constraint based model using this data, the result is the completely uninformative predictive distribution shown in Figure 5.10(c). The remaining surface plots in Figure 5.10 illustrate, that the additional information provided by the fractions of positives in combination with our proposed hybrid model (see Section 3.3.5) allows to separate the two classes. The confidence parameter X affects how informative the resulting predictive distribution is. CHAPTER 5. RESULTS 83 (c) Cons t ra in t s on ly (x = 0) (d) x = 10 (e) x = 100 (f) x = 1000 Figure 5.10: Figures (c-f) show classifiers learnt from the synthetic data set shown in (a) and (b) using the constraint based model (c) or the hybrid model using different confidence parameters x (d,e,f). CHAPTER 5. RESULTS 84 5.2 Object Recognition Data Set Finally, we present experiments on the task of learning object classifiers from annotated images. For these experiments, we use a set of 300 annotated images from the Corel database. The images in this set were annotated with in total 38 different words. Each image was segmented into regions using normalized cuts [64]. The image regions are described by 6-dimensional feature vectors (CIE-Lab colour, vertical position in the image, boundary to area ratio and standard deviation of brightness). The data set is split into one training set containing 200 images with 2070 image regions and a test set of 100 images with 998 regions. In our first experiment, we compare the approach proposed in this thesis with the mixture of Gaussians translation model trained with EM described in [13, 15, 23]. Both approaches are trained on the exact same training data, in particular the same set of image features. We adopt a vague hyper-prior for 52 (fi = v = 0.01). For the hyper-prior on the kernel selection vector 7, we choose more informative settings of o = 30 and b — 2000. This has the effect of biasing the sampler to use slightly more kernels than with a non-informative hyper-prior, which leads to a more efficient exploration of the 7 parameter space. While this prior is informative, the evidence provided by the data still easily dominates it. Experiments with the different types of kernels showed the sigmoidal kernel to work best for this data set. Not only does it produce better performing classifiers than multi-quadratic, cubic, thin-spline and Gaussian kernels, it also leads to numerically more stable and sparser samplers. The average number of activated kernels per sample lies between 5 and 10, depending on the learnt object class. The kernel width parameters were chosen manually for the different object classes based on a few test runs with different kernel widths. We use both EM with the mixture model and our constrained semi-supervised approach (see Section 3.3.3) to learn probabilistic binary classifiers for several of the words in this dataset of annotated images. The Markov chains of our CHAPTER 5. RESULTS 85 sampler are run for 10 000 samples after a burn-in phase of 10 000 samples. The run times for this training phase are in the range of 5 to 10 minutes, which is perfectly acceptable in our setting. The performance of the learnt classifiers is evaluated by comparing their classification results for varying thresholds against a manual annotation of the individual image regions. The ROC plots in Figure 5.11 and Figure 5.12 show the results on training and test set, averaged over 10 runs, plotting true pos-itives against false positives. The plots show that the approach proposed in this thesis yields overall slightly better and for some object classes significantly better classification performance than the EM mixture method. Given the sim-ple image features used and the small size of the data set, the performance is remarkably good. The ROC curves for the test set shown in Figure 5.12 demonstrate that the classifiers learnt using the proposed approach generalize fairly well even where the EM mixture approach fails due to overfitting (see the results for the concept 'space' for an example). . Figure 5.13 illustrates the dramatically higher consistency across runs of the algorithm proposed in this thesis as compared to the EM algorithm for the mixture model. The error bars indicate the standard deviation of the ROC plots across 10 runs. The large amount of variation indicates that the EM got stuck in local minima on several runs. While with our relatively small data set this problem arose only for some of the categories, in larger and higher dimensional data sets, local minima are known to become a huge problem for EM. In Figure 5.14, we show some examples of the classifications generated by the two algorithms for 2 different images containing polar bears. In order to get a fair comparison of the 'polarbear' classifiers learned by the two approaches, we chose the thresholds so that both classifiers would produce 30 positive instances on the full test-set (which contains 27 patches manually labelled as 'polarbear'). While our approach does not lead to a perfect classification (it produces 1 false positive), it correctly classify the polar bears in the two images while the mixture model trained with EM fails to do so. CHAPTER 5. RESULTS 86 bear bird cheetah cloud Figure 5.11: ROC plots measuring the classification performance on the train-ing set image regions from the Corel image dataset of both our proposed model and training algorithm (solid blue line) and the Gaussian mixture model trained with E M (dashed red line), aver-aged over 10 runs. CHAPTER 5. RESULTS 87 bird cheetah 0 ! fox 1 0 grass 1 0 mountain 1 0 polarbear 1 Figure 5.12: ROC plots measuring the classification performance on the test set image regions from the Corel image dataset of both our proposed model and training algorithm (solid blue line) and the Gaussian mixture model trained with EM (dashed red line), averaged over 10 runs. CHAPTER 5. RESULTS 88 sky sky (a) Performance on training data (b) Performance on test data Figure 5.13: ROC plots for the annotation 'sky'. The average performance of our proposed approach is visualized by the solid blue line, that of the EM mixture algorithm by the dashed red line. The error bars represent the standard deviation across 10 runs. It is clear from the plots that our proposed algorithm is more reliable and stable. (a) Original image (b) Sparse kernel classifier (c) Translation mixture trained with MCMC model trained with E M Figure 5.14: Polar bear recognition results on two example images obtained with our constrained semi-supervised approach trained with MCMC and the mixture model trained with EM. The approach presented in this paper yields better results than the EM mixture model. CHAPTER 5. RESULTS 89 In a second large set of runs we investigate, whether estimates of the fraction of positive image regions in an image can be used to learn better performing classifiers. The estimates of this fraction for a given image is computed as the number of regions divided by the number of annotation words as motivated and described in Section 3.3.4. We use this information in addition to the constraint information provided by the group labels C to train the hybrid model proposed in Section 3.3.5. We do not provide explicit results for the pure measurement model as described in Section 3.3.4 here. In our tests, we found the performance to be basically identical to the hybrid model for high values of the confidence parameter \ - In general, there is no benefit to ignoring the additional informa-tion provided by the constraints C and we therefore only use the hybrid model in our experiments. We train this model in combination with the sigmoidal kernel by again gen-erating 10 000 samples from the posterior after a burn in period of 10 000 steps using our MCMC sampler. Classifiers are trained for the same object classes as in Figure 5.11 and for varying values of the confidence parameter X - Table 5.2 summarizes the performance of the learnt classifiers on the test set image re-gions. For most object classes, we observe that the classifiers benefit from the additional information and produce slightly better classification results as x in-creases. However, the increase is not as significant as we hoped for and in some cases the performance even deteriorates (with the 'wolf class for example). We repeated the same set of experiments, replacing the sigmoidal kernel with a Gaussian one. The results are shown in Table 5.3. Here, the additional information does result in a drastic improvement in classification performance for some categories, in particular for the object classes 'tracks' and 'plains', 'tiger' and 'sand'. While the improvements are larger than in the case of the sigmoidal kernel, the overall performance is still inferior to the one achieved with the sigmoidal kernel. This mirrors our experiences in the synthetic data set in Figure 5.8. Next, we experiment with re-balancing the data set as described in Sec-tion 4.4 to address the high degree of imbalance in our data as well as to speed CHAPTER 5. RESULTS 90 Object Class x = o x = io x = ioo X = 1000 bear 0.69 (0.022) 0.71 (0.015) 0.66 (0.023) 0.70 (0.014) bird 0.93 (0.045) 0.93 (0.032) 0.92 (0.018) 0.92 (0.015) cheetah 0.92 (0.013) 0.93 (0.014) 0.93 (0.005) 0.93 (0.003) cloud 0.80 (0.011) 0.79 (0.013) 0.82 (0.014) 0.80 (0.008) fox 0.63 (0.032) 0.57 (0.050) 0.72 (0.008) 0.72 (0.006) grass 0.93 (0.001) 0.93 (0.003) 0.93 (0.001) 0.93 (0.001) mountain 0.54 (0.026) 0.43 (0.015) 0.60 (0.036) 0.62 (0.021) polarbear 0.97 (0.010) 0.98 (0.010) 0.98 (0.000) 0.98 (0.000) road 0.76 (0.010) 0.77 (0.024) 0.85 (0.018) 0.86 (0.016) sand 0.69 (0.019) 0.68 (0.017) 0.70 (0.018) 0.69 (0.020) sky 0.74 (0.001) 0.74 (0.005) 0.76 (0.008) 0.76 (0.008) snow 0.80 (0.005) 0.78 (0.004) 0.81 (0.000) 0.81 (0.001) space 0.66 (0.198) 0.72 (0.081) 0.76 (0.025) 0.76 (0.000) tiger 0.75 (0.011) 0.71 (0.010) 0.77 (0.005) 0.76 (0.006) tracks 0.87 (0.004) 0.88 (0.011) 0.90 (0.008) 0.91 (0.004) train 0.77 (0.017) 0.77 (0.013) 0.78 (0.007) 0.78 (0.005) trees 0.76 (0.002) 0.76 (0.003) 0.76 (0.003) 0.78 (0.003) water 0.73 (0.002) 0.72 (0.002) 0.71 (0.001) 0.70 (0.001) wolf 0.72 (0.014) 0.69 (0.034) 0.68 (0.011) 0.67 (0.009) zebra 0.93 (0.010) 0.93 (0.010) 0.91 (0.013) 0.92 (0.005) Table 5.2: The effect of the additional information provided by our estimates of the fractions of positive regions per image on the classification performance on the test set. We trained our model with varying confidence values X- F ° r x — 0 t n e additional information is not used at all. The values given are the AUC value, averaged over 10 runs, and the standard deviation in parentheses. We can observe a slight increase of performance overall for increasing values of x- That is, the additional information does help, although not much. CHAPTER 5. RESULTS 91 Object Class x = o x = io x = ioo X = 1000 bear 0.47 (0.127) 0.42 (0.051) 0.47 (0.024) 0.52 (0.041) bird 0.58 (0.050) 0.58 (0.046) 0.64 (0.039) 0.65 (0.034) cheetah 0.86 (0.025) 0.83 (0.041) 0.89 (0.015) 0.91 (0.005) cloud 0.79 (0.017) 0.80 (0.017) 0.78 (0.015) 0.78 (0.012) fox 0.69 (0.033) 0.54 (0.079) 0.71 (0.008) 0.73 (0.008) grass 0.91 (0.006) 0.90 (0.003) 0.92 (0.003) 0.92 (0.002) mountain 0.18 (0.053) 0.16 (0.029) 0.20 (0.031) 0.16 (0.024) polarbear 0.88 (0.063) 0.83 (0.056) 0.90 (0.013) 0.91 (0.002) road 0.81 (0.056) 0.84 (0.048) 0.79 (0.023) 0.73 (0.019) sand 0.60 (0.052) 0.61 (0.047) 0.67 (0.028) 0.67 (0.022) sky 0.83 (0.008) 0.82 (0.006) 0.80 (0.002) 0.80 (0.002) snow 0.80 (0.012) 0.80 (0.005) 0.81 (0.003) 0.82 (0.002) space 0.74 (0.112) 0.79 (0.044) 0.75 (0.029) 0.73 (0.005) tiger 0.75 (0.023) 0.75 (0.027) 0.80 (0.014) 0.80 (0.006) tracks 0.77 (0.122) 0.73 (0.141) 0.87 (0.046) 0.90 (0.023) train 0.58 (0.050) 0.68 (0.033) 0.76 (0.012) 0.77 (0.009) trees 0.71 (0.014) 0.70 (0.019) 0.71 (0.005) 0.72 (0.005) water 0.78 (0.003) 0.76 (0.006) 0.76 (0.003) 0.76 (0.005) wolf 0.42 (0.145) 0.37 (0.156) 0.59 (0.051) 0.63 (0.025) zebra 0.75 (0.076) 0.61 (0.076) 0.75 (0.055) 0.76 (0.039) Table 5.3: The same experimental setup as in Table 5.2, except that a Gaus-sian kernel was used. The additional information provided by the estimated fractions of positives seems to be of larger benefit to mod-els based on the Gaussian kernel. The classification performance (AUC) improves significantly for several classes as the confidence x is increased. At the same time the variance between runs (standard deviation given in parentheses) is also strongly decreased. CHAPTER 5. RESULTS 92 up the computation. We choose to perform under-sampling of negatives on the level of complete images (as opposed to sub-sampling of individual image regions). That is, we discard some of the images, which do not carry the an-notation of interest. The two different under-sampling techniques discussed in Section 4.4 are applied to reduce the number of negative images to varying de-grees. The first technique simply performs random sub-sampling of the negative images. The second approach strives to select maximally informative negative example images by selecting images that share annotations with positive train-ing images. We use these sub-sampling techniques to vary the percentage of positive bags within the data set from the original distribution up to an about equal number of positive and negative bags. We select a subset of the object classes, which are represented in only a small fraction of the training set images (in about 5% on average). We then train the hybrid model with the sigmoid kernel and a confidence value of x = 500 on the resulting data sets. The classification performance of the classifiers trained on the reduced train-ing data sets are presented in Figure 5.15. In the plots, the amount of under-sampling performed increases from the left to the right. The leftmost point in all plots corresponds to the natural class distribution, i.e. training with the full training data set without any under-sampling. For 3 of the 6 object classes, namely 'fox', 'sand' and 'train', increasing amounts of under-sampling actually lead to a significant increase in classification performance on the test set. For the 'wolf and 'zebra' categories, there is no such noticeable increase in perfor-mance but no significant decrease either as the training set size is reduced. And in the case of the 'cheetah' classifier, the performance does in fact decrease for large amounts of under-sampling. The proposed selective under-sampling tech-nique leads to significant better performance than the random sub-sampling for the 'cheetah' and 'sand' annotations and provides similar performance for the remaining object classes. Overall, the benefit of using sub-sampling in general and the proposed selective sub-sampling strategy in particular should be obvi-ous from these graphs, especially considering that the reduced size of the data sets leads to drastically reduced runtime for the training algorithm. Using the CHAPTER 5. RESULTS 93 0.95 0.94 0.93 0.92 0.91 0.9-0.89 0.78 0.76 0.74 0.72 0.7 0.68, (a) Cheetah (c) Sand 0.2 0.3 (e) Wolf 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.85 0.84 0.83 0.82 0.81 0.8 0.79 0.78, 0.94r 0.93 0.92 0.91 0.9 0.89 0.88 0.87 0.86 0.1 0.2 0.3 0.4 0.5 (b) Fox (d) Train 0.2 0.3 (f) Zebra Figure 5.15: These graphs show the classification performance in terms of AUC on the test set for sigmoid hybrid models trained on sub-sampled data sets. The x-axis represents the fraction of positive images in the training set, resulting from increasing under-sampling of the negative images. Red squares correspond to data sets created using the simple random sub-sampling procedure, blue stars stand for our selective under-sampling technique described in Section 4.4. CHAPTER 5. RESULTS 94 same amount of samples, the difference between the left and the right end of the graphs in Figure 5.15 in terms of runtime are one order of magnitude (on average around 40 seconds compared to about 400 seconds). Furthermore, the reduced size of the training data set leads to a lower dimensional parameter space as well. Therefore, a smaller number of samples and a shorter burn-in period would likely suffice to provide the same coverage of the parameter space, resulting in an even larger relative speedup. CHAPTER 6. DISCUSSION AND CONCLUSION 95 CHAPTER 6 DISCUSSION AND CONCLUSION In this thesis, we presented a novel discriminative approach to the problem of learning object recognition models from annotated images. For each object class (corresponding to an annotation word in the training images), an inde-pendent probabilistic object classifier is trained, which for a test image region predicts the probability of the regions showing an object belonging to the class of interest. The learnt classifiers thus allow for detection and localization of objects of a certain class in an image. The proposed approach further provides a general solution for data association and multiple instance problems. Instead of following an optimization approach, we opted for a fully Bayesian solution, which accounts for the uncertainty in the parameters and in the unknown indi-vidual labels. It not only provides us with a probabilistic classifier but the full posterior distribution of our model parameters. The kernel based probabilistic model used in our approach is extremely flexible and allows complex decision boundaries to be constructed. At the same time, thanks to the use of appro-priate hierarchical priors, overfitting is avoided and a high degree of sparsity is achieved, leading to good generalization performance as well as computational efficiency. An efficient Markov Chain Monte Carlo based training approach was presented, which allows for easy incorporation of the constraints on the labels in the multiple instance scenario. We demonstrated the performance of our approach on both synthetic data sets and on the task of learning object classifiers from a set of annotated images. CHAPTER 6. DISCUSSION AND CONCLUSION 96 The presented results showed significantly improved classification performance in comparison to previously published work [13, 15, 23]. Further, a new problem formulation was introduced, in which a classifier for individual binary labels is to be learnt from estimates of the fractions of positive labels in groups of data points. We extended our probabilistic model and learning technique to learn probabilistic classifiers from this type of data. Results on synthetic data sets and on the image data set demonstrated, that our proposed approach manages to benefit from the additional information in this setting in that it produced better performing classifiers than those learnt from multiple instance information alone. To our knowledge, this type of problem has not been previously addressed in the machine learning community. We expect problems of this kind, to which our proposed approach could be applied, to also arise in other fields, such as physics, biology and chemistry. We also experimented with techniques for sub-sampling the set of available annotated training images in the object recognition problem. The resulting reduction in training set size leads to a dramatic speedup of the training pro-cess and makes it feasible to attack the task of learning from very large image databases. At the same time, the problem of strongly biased data sets is ad-dressed by under-sampling only the negative example images. Our experiments showed, that using these techniques, a large speedup without a significant loss in classifier performance can be achieved. In fact, in several cases, a performance increase was observed in our experiments. We proposed a novel technique for performing selective under-sampling, in which the additional information pro-vided by the image annotations is exploited to select a subset of informative negative example images. In our experiments, this approach proved to be supe-rior to simple uniform sub-sampling of the negative example images. While the presented results are very encouraging, there remains room for further improvements. Our probabilistic model shares the problem of poor scalability with all kernel based approaches. While our model and training technique are sparse, in our current implementation, the full kernel matrix \I> is precomputed initially and used throughout the training algorithm. This leads CHAPTER 6. DISCUSSION AND CONCLUSION 97 to 0(N2) computational complexity and storage requirements. In the case of Gaussian kernels this could be avoided by instead performing the kernel evaluations using the extremely efficient Improved Fast Gauss Transform by Yang et al. [74, 75] or the dual tree method of Gray and Moore [35] for general types of kernels. Alternatively, one could improve efficiency for large datasets using the recursive MCMC technique of Andrieu and Doucet [6]. While in our current approach, the kernel type and kernel width parameter have to be chosen by the user, we would like to perform this model selection automatically within the training algorithm. Ideally, feature weights should be optimized at the same time as well. The most promising way of increasing the efficiency of our MCMC sampler would be to improve on the sampler for the high dimensional binary kernel selec-tion vector 7. While the Metropolis-Hastings-within-Gibbs sampler employed in our current approach is fairly efficient, we believe that there is still much room for further improvement. In particular, the design of efficient proposals for high dimensional moves in this discrete space seems like a promising direction for future work. The sampling based approach to the computation of the posterior distribu-tion taken in this thesis has the benefit, that it allows for easy incorporation of the constraints, which arise from the multiple instance information. However, in the future, we would also like to explore variational methods for approximating the posterior distribution. In particular, we hope that the measurement model described in Section 3.3.4 would allow such methods to be more easily employed, since this model avoids the hard constraints, which lead to difficult non-convex optimization problems. With respect to the object recognition application, one obvious area for improvement is the use of more sophisticated image features and improved image segmentation techniques. Our current approach furthermore makes fairly strong implicit assumptions about the quality of the labels. While databases with very good annotations exist, there will still be some amount of noise in the annotations and it could prove beneficial to explicitly take this uncertainty in CHAPTER 6. DISCUSSION AND CONCLUSION 98 the provided image annotations into account. Furthermore we would like to find a principled way of combining the output of multiple classifiers as well as relax our independence assumptions to make it possible to exploit contextual information. To achieve this, the contextual recognition methods of Murphy et al. [58] could be modified and extended to our semi-supervised setting. An interesting variation of the approach presented here is to learn classifiers for local image features (such as those proposed in [49]) instead of image regions. The model and training approach presented in this thesis have already been employed in this scenario by Carbonetto et al. in [16] with excellent results. BIBLIOGRAPHY 99 BIBLIOGRAPHY [1] E L Allwein, R E Schapire, and Y Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. In Proc. 17th International Conf. on Machine Learning, pages 9-16. Morgan Kaufmann, San Francisco, CA, 2000. [2] S Andrews and T Hofmann. Disjunctive programming boosting. In Ad-vances in Neural Information Processing Systems. MIT Press, 2003. [3] S Andrews, I Tsochantaridis, and T Hofmann. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems. MIT Press, 2002. [4] C Andrieu, N de Freitas, A Doucet, and M I Jordan. An introduction to MCMC for machine learning. Machine Learning, 50:5-43, 2003. [5] C Andrieu and A Doucet. Joint Bayesian model selection and estimation of noisy sinusoids via reversible jump MCMC. IEEE Trans. Signal Processing, 47:2667-2676, 1999. [6] C Andrieu and A Doucet. Online expectation-maximization type algo-rithms for parameter estimation in general state space models. In IEEE ICASSP, 2003. [7] K Barnard, P Duygulu, N de Freitas, D Forsyth, D Blei, and M I Jor-dan. Matching words and pictures. Journal of Machine Learning Research, 3:1107-1135, 2002. BIBLIOGRAPHY 100 [8] J M Bernardo and A F M Smith. Bayesian Theory. Wiley Series in Applied Probability and Statistics, 1994. [9] C M Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [10] D Blei and M Jordan. Modeling annotated data. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pages 127-134. ACM Press, 2003. [11] A Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30:1145-1159, 1997. [12] E Brochu. milq. Master's thesis, University of British Columbia, 2004. [13] P Carbonetto. Unsupervised statistical models for general object recogni-tion. Master's thesis, University of British Columbia, 2003. [14] P Carbonetto and N de Freitas. Why can't Jose read? The problem of learning semantic associations in a robot environment. In Human Language Technology Conference Workshop on Learning Word Meaning from Non-Linguistic Data, 2003. [15] P Carbonetto, N de Freitas, P Gustafson, and N Thompson. Bayesian fea-ture weighting for unsupervised learning, with application to object recog-nition. In AI-STATS, Florida, USA, January 2003. [16] P Carbonetto, G Dorko, and C Schmid. Bayesian learning for weakly supervised object classificatio. Technical report, INRIA Rhone-Alpes, 2004. [17] G Celeux, M Hum, and C P Robert. Computational and inferential dif-ficulties with mixture posterior distributions. Journal of the American Statistical Association, 95:957-970, 2000. [18] N Chawla, K Bowyer, L Hall, and W P Kegelmeyer. SMOTE: synthetic minority over-sampling technique. In Proceedings of the International Con-ference on Knowledge Based Computer Systems, 2000. BIBLIOGRAPHY 101 [19] Y Chen and J Z Wang. Image categorization by learning and reasoning with regions. Journal of Machine Learning Research, 5:913-939, 2004. [20] H Chipman, EI George, and R E McCulloch. The practical implementation of Bayesian model selection. Model Selection, IMS Lecture Notes, 38:67-116, 2001. [21] T G Dietterich, R H Lathrop, and T Lozano-Perez. Solving the multi-ple instance learning with axis-parallel rectangles. Artificial Intelligence, 89(1/2):31-71, 1997. [22] C Drummond and R C Holte. C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II held in conjunction with ICML'2003, 2003. [23] P Duygulu, K Barnard, N de Freitas, and D Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In European Conference for Computer Vision (ECCV), pages 97-112, 2002. [24] C Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973-978, 2001. [25] T Fawcett. ROC graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Labs, 2003. [26] S L Feng, R Manmatha, and V Lavrenko. Multiple Bernoulli relevance models for image and video annotation. In Proceedings of the International Conference on Pattern Recognition (CVPR 2004), 2004. [27] M Figueiredo. Adaptive sparseness using Jeffreys prior. In Advances in Neural Information Processing Systems 14, Cambridge, MA, 2001. MIT Press. [28] Y Freund and R E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, August 1997. BIBLIOGRAPHY 102 [29] T Gartner, P A Flach, A Kowalczyk, and A J Smola. Multi-instance kernels. In Proceedings of the 19th International Conference on Machine Learning, pages 179-186, 2002. [30] A Gelman, J B Carlin, H S Stern, and D B Rubin. Bayesian Data Analysis. Chapman & Hall/CRC, 1995. [31] S Geman and D Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell, 6(6):721-741, Nov. 1984. [32] J E Gentle. Random Number Generation and Monte Carlo Methods. Springer, 2003. [33] J Geweke. Efficient simulation from the multivariate normal and Student t-distributions subject to linear constraints. In Proceedings of 23rd Symp. Interface, pages 571-577, 1991. [34] W R Gilks, S Richardson, and D J Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman & Hall/CRC, 1995. [35] A Gray and A Moore. N-body problems in statistical learning. In Advances in Neural Information Processing Systems. MIT Press, 2001. [36] C Guihenneuc-Jouyaux, K L Mengersen, and C P Robert. MCMC con-vergence diagnostics: A "reviewww". Papers 9816, Institut National de la Statistique et des Etudes Economiques-, 1998. [37] W K Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97-109, 1970. [38] A E Hoerl and R W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 3(12):55-67, 1970. [39] N Japkowicz and S Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis Journal, 6, 2002. BIBLIOGRAPHY 103 [40] J Jeon, V Lavrenko, and R Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In SIGIR 03 Conference, pages 119-126, 2003. [41] M I Jordan. Graphical models. Statistical Science (Special Issue on Bayesian Statistics), 19:140-155, 2004. [42] R Kohn, M Smith, and D Chan. Nonparametric regression using linear com-binations of basis functions. Model Selection, IMS Lecture Notes, 11:313-322, 2001. [43] M Kubat and S Matwin. Addressing the curse of imbalanced training sets: one-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179-186. Morgan Kaufmann, 1997. [44] H Kiick, P Carbonetto, and N de Freitas. A constrained semi-supervised learning approach to data association. In European Conference for Com-puter Vision (ECCV), 2004. [45] S Kumar and M Hebert. Discriminative fields for modeling spatial depen-dencies in natural images. In NIPS, 2003. [46] V Lavrenko, R Manmatha, and J Jeon. A model for learning the semantics of pictures. In Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems, 2003. [47] J S Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [48] J S Liu, W H Wong, and A Kong. Covariance structure of the Gibbs sam-pler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81:27-40, 1994. [49] D Lowe. Object recognition from local scale-invariant features. In Inter-national Conference on Computer Vision, Corfu, Greece, pages 1150-1157, 1999. BIBLIOGRAPHY 104 [50] O Maron. Learning from Ambiguity. PhD thesis, Massachusetts Institute of Technology, 1998. [51] O Maron and T Lozano-Perez. A framework for multiple instance learning. In Advances in Neural Information Processing Systems, pages 570-576, 1997. [52] O Maron and A L Ratan. Multiple-instance learning for natural scene classification. In Proceedings of the 15th International Conf. on Machine Learning, pages 341-349. Morgan Kaufmann, 1998. [53] M Matsumoto and T Nishimura. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans, on Modeling and Computer Simulation, 8:3-30, 1998. [54] D McFadden. A method of simulated moments for estimation of discrete re-sponse models without numerical integration. Econometrica, 57:995-1026, 1989. [55] F Monay and D Gatica Perez. On image auto-annotation with latent space models. In Proceedings of the eleventh ACM international conference on Multimedia, pages 275-278. ACM Press, 2003. [56] Y Mori, H Takahashi, and R Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In MISRM'99 First International Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999. [57] P Miiller. Alternatives to the Gibbs sampling scheme. Technical Report 92-14, Institute of Statistics and Decision Sciences, Duke University, 1993. [58] K Murphy, A Torralba, and W Freeman. Using the forest to see the trees: A graphical model relating features, objects and scenes. In NIPS, 2003. [59] J Y Pan, H J Yang, P Duygulu, and C Faloutsos. Automatic image cap-tioning. In Proceedings of the 2004 IEEE International Conference on Mul-timedia and Expo (ICME 2004), 2004. BIBLIOGRAPHY 105 [60] F Provost, T Fawcett, and R Kohavi. The case against accuracy estima-tion for comparing classifiers. In Proceedings of the Fifteenth International Conference on Machine Learning, San Francisco, CA, 1998. Morgan Kauf-mann. [61] J. Ramon and L. De Raedt. Multi instance neural networks. In Proceedings of IMCL-2000 workshop on Attribute- Value and Relational Learning, 2000. [62] C P Robert and G Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 1999. [63] R E Schapire. The boosting approach to machine learning: An overview. In MSRI Workshop on Nonlinear Estimation and Classification, 2002. [64] J Shi and J Malik. Normalized cuts and image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 731-737, 1997. [65] M Stephens. Bayesian Methods for Mixtures of Normal Distributions. PhD thesis, Department of Statistics, Oxford University, England, 1997. [66] S S Tham. Markov chain Monte Carlo for sparse Bayesian regression and classification. Master's thesis, University of Melbourne, 2002. [67] S S Tham, A Doucet, and R Kotagiri. Sparse Bayesian learning for regres-sion and classification using Markov chain Monte Carlo. In International Conference on Machine Learning, pages 634-641, 2002. [68] M E Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211-244, 2001. [69] V N Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. [70] P Viola and M Jones. Robust real-time object detection. International Journal of Computer Vision, 2001. BIBLIOGRAPHY 106 [71] L von Ahn and L Dabbish. Labeling images with a computer game. In Proceedings of the 2004 conference on Human factors in computing systems, pages 319-326. ACM Press, 2004. [72] G M Weiss and F Provost. The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-44, Department of Computer Science, Rutgers University, New Jersey, 2001. [73] G M Weiss and F Provost. Learning when training data axe costly: The ef-fect of class distribution on tree induction. Journal of Artificial Intelligence Research, 19:315-354, 2003. [74] C Yang, R Duraiswami, and L Davis. Efficient kernel machines using the improved fast gauss transform. In NIPS, 2004. [75] C Yang, R Duraiswami, N A Gumerov, and L Davis. Improved fast gauss transform and efficient kernel density estimation. In ICCV, 2003. [76] B Zadrozny and C Elkan. Transforming classifier scores into accurate mul-ticlass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694-699. ACM Press, 2002. [77] Q Zhang and S A Goldman. EM-DD: An improved multiple-instance learn-ing technique. In Advances in Neural Information Processing Systems, 2002.