Comparative Study of Kernel Based Classification and Feature Selection Methods With Gene Expression Data by Mingyue Tan B.Sc, The University of British Columbia, 2003 - A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Mas te r of Science in THE FACULTY OF GRADUATE STUDIES (Computer Science) The University of British Columbia April 2006 © Mingyue Tan, 2006 Abstract Gene expression profiles obtained by high-throughput techniques such as microarray provide a snapshot of expression values of up to ten thousands genes in a particular tissue sample. Analyzing such gene expression data can be quite cumbersome as the sample size is small, the dimensionality is high, and the data are occasionally noisy. Kernel methods such as Support Vector Machines (SVMs) [5, 45] have been extensively applied within the field of gene expression analysis, and particularly to the problems of gene classification and selection. In general, kernel methods outper-form other approaches due to their ability to handle high dimensionality easily. In this thesis, we perform a comparative study of various state-of-the-art kernel based classification and feature selection methods with gene expression data. It is our aim to have all the results together in one place so that people can easily see their similarities and differences both theoretically and empirically. In the literature, a feature selection method is evaluated by the classification accuracies using the features selected by the method. This evaluation criterion measures the classification capabilities of the data after the elimination of irrelevant features. We propose another criterion, called stability, to evaluate the feature selection methods in addition to classification accuracies. The feature set selected by a stable feature selection algorithm should not change significantly when some small changes are made to the training data. In this thesis, we use both of two evaluation criteria to compare feature selection methods. It has been showed that cross validation technique can be used to improve feature selection methods in terms of classification accuracies [8]. In this thesis, we extend one existing feature selection method which utilizes Gaussian Processes (GP) [47] with Automatic Relevance Determination (ARD) [28, 34], and cross validation, and propose a new feature selection method. Experiments on real gene expression data sets show that our method outperforms all other feature selection methods in terms of classification accuracies, and achieves comparable stability as Sparse Multi-nomial Logistic Regression (SMLR) [23], the most stable feature selection method in the literature. i i Contents Abstract i i Contents i i i List of Tables v i List of Figures ix Acknowledgements x i i 1 Introduction 1 1.1 Gene Expression Data 1 1.2 Motivation and Challenges 3 1.3 Problem Statement 4 1.4 Contribution 6 1.5 Thesis Outline 7 2 Related Works 8 2.1 Kernel Based Classifiers 8 2.1.1 Kernel 8 2.1.2 Relevant Kernel Methods 9 2.2 Feature Selection Methods 10 2.2.1 Feature Construction v.s. Feature Selection 10 iii 2.2.2 Methods for Feature Subset Selection 11 2.3 Cross Validation 13 3 Evaluating Relevant Kernel Classification Algorithms 15 3.1 Support Vector Machines 15 3.1.1 Maximal margin classifier 16 3.1.2 Linear SVM 17 3.1.3 Nonlinear SVM 19 3.2 Distance Weighted Discrimination 22 3.3 Bayesian Kernel Classifiers 23 3.3.1 Gaussian Processes Classifiers 23 3.3.2 Relevance Vector Machines and Other Sparse Classifiers . . . 27 3.4 Experimental Results 28 3.4.1 Diagnosis Accuracy Comparison 29 3.4.2 ROC Curve Analysis 31 3.4.3 CPU Cost Comparison 35 4 Comparing Feature Selection Methods for Kernel Machines 37 4.1 Data Sets and Preprocessing 37 4.2 Recursive Feature Elimination ' 38 4.3 GPC with ARD 43 4.4 RVM and SMLR on Feature Space 49 5 Cross Validation Optimization 52 5.1 Stability of Feature Selection Methods 53 5.2 Multiple SVM-RFE 56 5.3 Forward Feature Selection with Gaussian Processes 57 5.4 GP-ARD-RFE 58 iv 5.5 Experimental Results 63 5.5.1 Comparison Using Stability 63 5.5.2 Comparison Using Classification Accuracy 64 6 Conclusion and Future Work 71 Bibliography 73 v List of Tables 3.1 Gene expression data used for classification 28 3.2 ROC confusion matrix 31 3.3 10-fold cross-validation accuracy without feature selection 36 4.1 Gene expression data used for feature selection 38 4.2 Evaluation of the feature ranks given by SVM-RFE—optimal subsets selected using test accuracies. An optimal set is a subset of genes with which the classifier SVM achieves the highest test accuracy. The smallest optimal set is called the biomarker set. Test accuracy for the optimal subsets is the classification accuracy of SVM on test data using genes in the optimal subsets only 41 4.3 Evaluation of the feature ranks given by SVM-RFE—optimal subsets selected using LOO accuracies on the training data. An optimal set is a subset of genes with which the classifier SVM achieves the highest LOO accuracy over the training samples. The smallest optimal set is called.the biomarker set. LOO accuracy for the optimal subsets is the LOO accuracy on the training set using genes in the optimal subsets only 41 4.4 Evaluation of the feature ranks given by GPC-ARD—optimal subsets selected using test accuracies. Compare to Table 4.2 47 vi 4.5 Evaluation of the feature ranks given by GPC-ARD—optimal subsets selected using LOO accuracies on the training data. Compare to Table 4.3 47 4.6 Test Accuracies of SVM with biomarker sets selected by various fea-ture selection methods. The numbers (e.g. 76.7%) below the data sets (e.g. Colon) are the test accuracies using all genes without any feature selection or feature weighting. The training and test data for the Leukemia data are predefined. For this data set, test accuracy with any biomarker set is higher than the accuracy without feature selection. The failure of feature selection methods on the other data sets might be due to the unfortunate partition of the training and test data 51 4.7 Total number of errors in 10 experiments and the average size of biomarker sets. The number (e.g. N = 62) below the data sets (e.g. Colon) is the total number of validation cases in the 10-fold cross validation. No feature selection method outperforms all other methods in terms of classification accuracies on both data sets. . . . 51 5.1 Features selected by RVM for the Colon data set 53 5.2 Number of occurrences of features selected by RVM for the Colon data set 55 5.3 Stability scores of RVM and SMLR on the Colon and Leukemia data sets 55 vii 5.4 Ineffectiveness of Wilcoxon rank sum (a preprocessing step used by GP-ARD-FS) in eliminating irrelevant features. LOO Accuracies us-ing different subsets of genes: (.1) "All genes"; (2) top 373 genes selected by Wilcoxon rank sum test with p — 0.01; (3) top 1024 genes selected by GP-ARD-RFE; (4) top 512 genes selected by GP-ARD-RFE. LOO accuracy with the top 373 genes is not significantly higher than the accuracy without feature selection, and is even lower than the accuracy with the top 1024 genes selected by GP-ARD-RFE. . . 62 5.5 Stability scores of various feature selection methods on the Colon and Leukemia data sets. SMLR and GP-ARD-RFE are consistently more stable than other methods 64 5.6 Leave-One-Fold-Out accuracies using biomarkers selected by various feature selection methods. The number in brackets are the total num-ber of errors in 10 folds. Part of the results are included in Table 4.7. GP-ARD-RFE achieves the highest accuracies on both data sets. . 65 vm List of Figures 1.1 Illustration of informative genes. G l and G2 are informative genes as they are consistent within each phenotype and discriminative between two phenotypes. G3 and G4 are non-informative since they do not show difference between phenotypes [13] 4 3.1 Two dimensional linearly separable data. Both classifiers correctly classify the data, but the margin in (a) is bigger than the one in (b). The classifier in (a) is the maximal margin classifier of this data set. 16 3.2 Linear SVM—the simplest case. HI and H2 are hyperplanes, and X I and X2 are support vectors ' 17 3.3 Non-linear SVM. Data on the left are not linearly separable in the two dimensional space. Via some transformation, say $, data points in 2D can be transformed to 3D feature space where they become linearly separable 19 3.4 Overfitting problem of hard margin with noisy data. Suppose the data in (a) are correct. If we change the class label of one data point (e.g. the one marked by the red square), we get the separating hyperplane as shown in (b) with hard margin. The hyperplane in (b) overfits the noisy data, and has a poor generalization ability. . . . . 21 ix 3.5 Graphical model for GPC with n training data points and one test data point. X\:n and Y\:n are observed. Given x*, we wish to predict y*. fi is a latent variable associated with x,, and yj is obtained by transforming fi. Given / , x and y are independent, fi and /* are joint Gaussian, and the Gaussian process prior is parameterized by 6 [25] 24 3.6 10-fold cross-validation accuracies of various classifiers without fea-ture selection 30 3.7 ROC curves of the six classifiers for the Colon cancer data 33 3.8 AUC values of various classifiers for the Leukemia (1) and Lung cancer data (2) 34 3.9 Graphical summary of CPU costs of various classifiers 35 4.1 SVM-RFE Algorithm [20] 39 4.2 Results of forward selection: accuracies of SVM with various number of genes ranked by SVM-RFE. The X-axis represents the number of genes, and the Y-axis represents accuracies (test accuracies and LOO accuracies on training data). The X-axes in (a), (c), and (e) are in actual scales, while the X-axes in (b), (d), and (f) are in logarithmic scales 42 4.3 Distributions of ARD values for the Colon data set 44 4.4 Distributions of ARD values for the Leukemia data set 45 4.5 Distributions of ARD values for the Lung cancer data set 46 x 4.6 Results of forward selection: accuracies of S V M with various number of genes ranked by A R D values. The X-axis represents the number of genes, and the Y-axis represents accuracies (test accuracies and L O O accuracies on training data). GPC-Laplace is used to get the A R D values for (a), (c), and (e). A R D values from G P C - E P are used to get (b) and (d) 48 5.1 Algorithm of M S V M - R F E [14] ' 56 5.2 Illustration of G P - A R D - F S algorithm: a running example 59 5.3 . Algorithm of G P - A R D - F S [20] . . . 66 5.4 Illustration of feature elimination of G P - A R D - R F E algorithm: a run-ning example (Part 1) 67 5.5 Illustration of feature elimination of G P - A R D - R F E algorithm: a run-ning example (Part 2) 68 5.6 Illustration of G P - A R D - R F E algorithm: the steps after feature elim-ination 69 5.7 Algorithm of G P - A R D - R F E 70 xi Acknowledgements I would like to thank all the people who gave me help and support throughout my degree. ' First of all, I would like to thank my supervisors, Professor Raymond Ng and Professor Nando de Freitas, for their constant encouragement and rewarding guidance throughout this thesis work. Raymond introduced me to data mining research in his "Honours Hour" section of his undergraduate course. I would not come to graduate school without his inspiration in his "Honours Hour". During my Master studies, he gave me a wonderful introduction to the field of bioinformatics using data mining and machine learning approaches. I am grateful to Nando for giving me the Bayesian education and directing me to the research in kernel methods. I am amazed by Nando's broad knowledge in computer science and the way he made the learning process full of fun. Secondly, I would like to thank Professor Kevin Murphy for dedicating his time and effort in reviewing my thesis. Thirdly, I would like to extend my appreciation to my colleagues for their friendship and help, and especially to the following people: Timothy Chan, Jun Wang, Yuhan Cai, Lin Zhong, Suling Yang, and Yizheng Cai. Last but certainly not least, I would like to thank my parents for their endless love and support. MINGYUE TAN The University of British Columbia April 2006 xii Chapter 1 Introduction Gene expression, also known as protein expression or simply expression, is the pro-cess by which a gene's coded information is converted into a final gene product (i.e. a protein or any of several types of RNA). Gene expression is a multi-step process that begins with transcription and translation and is followed by folding, post-translational modification of the protein product and targeting. Particularly, the process by which the genes encoded by the human genome are expressed as proteins involves two steps: DNA sequences are initially transcriped into mRNA sequences which in turn are translated into the amino acid sequences of the proteins that perform various cellular functions. Changes in cellular protein synthesis are often estimated by measuring mRNA levels. Measuring mRNA levels can provide a detailed molecular view of particular genes expressed in different cell types under different conditions. Expression of genes can be assessed with DNA microarray or SAGE (Serial analysis of gene expression) among several other techniques. 1.1 Gene Expression Data Microarray techniques enable simultaneous measurements of the expression levels of tens of thousands of genes. These measurements are made by quantitating the 1 hybridization of celluar mRNA to an array of defined cDNA or oligonucleotide probes immobilized on a solid surface, such as glass, plastic or silicon chip. Each data point produced by a DNA microarray hybridization experiment represents the ratio of expression levels of a particular gene under two different experimental conditions. Typically, the numerator of each ratio is the expression level of the gene in the condition of interest, which the denominator is the expression level of the gene in the reference state of the cell. The general goal of SAGE is similar to the DNA microarray. However, SAGE is a sequence-based sampling technique, where observations are not based on hy-bridization which result in more qualitative, analog values. Instead, SAGE obtains a quantitative profile of cellular gene expression. It measures the absolute count of the mRNA in each sample. Gene expression data can be represented in matrix form, where columns correspond to genes and rows correspond to tissue samples. In many gene expression data, the samples can be divided into several subgroups or classes, such as tumor tissues and normal tissues. Each subgroup is called a phenotype. In this thesis, we focus on the binary case where samples come from two groups only. The typically characteristics of gene expression data are high dimensionality and low sample size. The sample is small for the following reasons. First, it is very expensive to produce high-throughput gene expression data. It costs about $1,000 to produce a single microarray. Hence, a typical gene expression data with 50 samples (i.e. 50 microarrays) costs $50,000. This is only the technical price without human price adding on top of it. Second, it is not always easy to convince patients to donate their tissue samples for research analysis. Moreover, some diseases are difficult to detect, and therefore collecting a significant number of samples may take years. 2 1.2 Motivation and Challenges Gene expression data can help in better understanding of cancer or other types of disease. One challenge in cancer treatment is to target therapies to pathogenetically distinct tumor types in order to maximize efficacy and minimize toxicity [19]. Thus cancer classification has been a central goal of gene expression analysis. However, cancer classification based on gene expression data is not easy due to the characteristics of the data—high dimensionality and low sample size. The reduction in performance of an algorithm for data sets with many attributes is known as the curse of dimensionality [4]. To overcome the curse of dimensionality, we need to extract genes that are truly relevant to the disease status. This problem of identifying relevant features is known as feature selection. Genes that are discriminative among different phenotypes are referred as informative genes. Figure 1.1 depicts informative genes with an example. Suppose there are 4 samples in a data set, where 2 samples are normal tissues and the other 2 samples are tumor samples. Each sample contains expression levels of 4 genes. Gene G l and G2 are informative genes as they are consistent within each phenotype and discriminative between two phenotypes. In contrast, G3 and G4 are non-informative since they do not show difference between phenotypes. Informative genes are often referred as biomarkers. In the domain of our study, informative genes and biomarkers are interchangeable. Improving the classification and prediction accuracy is not the only motiva-tion for feature selection. Identification of informative genes can help researchers gain significant insights into the nature of a particular disease. Informative genes can be used in the development of efficient cancer diagnosis and classification plat-forms. In drug design, researchers record gene expression data for patient before, during and after treatment, aiming at identifying a subset of drug responsive genes 3 Normal Tissues Tumor Tissues S 1 S2 S3 S4 <i> O GI ro £ o G2 «• .c a> CD „ i) G 3 > ' I "ro E o 1 Q4 o •z Figure 1.1: Illustration of informative genes. GI and G2 are informative genes as they are consistent within each phenotype and discriminative between two pheno-types. G3 and G4 are non-informative since they do not show difference between phenotypes [131. and identifying potential drug targets. Gene selection can also reduce the cost in diagnosis, because researchers only need to obtain profiles of relevant genes in order to predict class membership of a new patient. Smaller number of genes lead to cheaper chips, which in turn reduces the cost. 1.3 Problem Statement In medicine, biomarkers are indicators of a particular disease state. In the context of gene expression analysis, biomarkers often stand for the minimum subset of infor-mative genes. In this thesis, we focus on analyzing high-throughput gene expression data from two specific problem domains—cancer classification and biomarker iden-tification. Cancer Classification Suppose we have a gene expression data set D of data points Xi G with binary 4 class labels € {—1,1}, D = {(xi,yi)\i = 1,2, ...,n}, X = {x{\i = 1,2, ...n}, ^ = {yi\i = 1,2, ...,n}. Given this training data set, we wish to predict the class label y* for a new data point x*. Biomarker Identification The quality of a subset of genes can be measured by several quality criteria including: test error and Leave-One-Out (LOO) validation error. Test error is defined as: NTE Test enov = ^ 2 5(yi^yi) (1.1) i where NTE is the number of test cases, and 5(s) is 1 if s is true, otherwise 0. In our case, 5{yi ^ yi) — 1 if the predicted class label yi does not equal to the true label y^ of Xi. LOO validation error is evaluated as: NLOO LOO error = ] T ^ Vi) (1.2) i where NLOO means the number of all validation cases. We discuss the LOO val-idation procedure in Chapter 2. Furthermore, we define test accuracy to be (1 — T e S t j v f r r ° r ) x 1 0 0 % > a n d similarly LOO accuracy to be (1 - L Q ° L ( j f o r r 0 r ) x 100%. Suppose we are given a training set Dtr and a test set Dte, and let S — {gi, <72, 9d} be the set containing all genes of the data sets Dtr and Dte- Ideally, we wish to identify a subset of features M such that M C S, and the test accuracy Accu-test(M) with features in M is the highest among the test accuracies using all possible subset of features. In the case where the highest accuracy is achieved by multiple subsets, we choose M to be the smallest subset. We refer M as the biomarker set, and the problem of identifying the biomarker set is called biomarker identification. In real world problems, the test data is not available when selecting relevant features. Therefore, we cannot use test accuracies to identify biomarker sets. In-stead, we can use the LOO accuracy on the training set to identify the biomarker set. Specifically, we choose M to be the subset of features such that the LOO ac-curacy over the training set AccuJoo(M) with features in M is the highest among the LOO accuracies with all possible subset of features. Again, we use the size of the subsets to break the ties, and the smallest one wins. Ideally, if the training set and the test set are drawn from the same underlying distribution, then we expect the biomarker set Mtr selected using LOO accuracy on the training data is identical to the true biomarker set Mte selected by the test accuracies. 1.4 Contribution Our motivation in writing this thesis is to summarize the enormous amount of work that has been done in the field of kernel based methods with application to gene expression analysis. It is our aim to have all the results together in one place so that people can easily see their similarities and differences both theoretically and empirically. Contributions of the thesis include: (1) We perform a comparative analysis of kernel based classification algorithms on gene expression data; (2) We perform a comparative analysis of kernel based feature selection methods with application to biomarker identification, and demonstrate experimentally the importance of gene selection; (3) We propose to use a stability measure, called stability score, to evaluate the sensitivity of feature selection methods to the changes in training data; (4) We propose an algorithm that utilizes a resampling method (e.g. cross validation), and demonstrate its effectiveness in improving the quality of biomarkers. 6 1.5 Thesis Outline The remaining of the thesis is organized as follows. Chapter 2 defines related works and gives a number of definitions used in the context of our work. In Chapter 3, we review kernel based classifiers and provide experimental evaluation on their performance with gene expression data without feature selection. In Chapter 4, we review existing kernel based feature selection methods and show the comparison amongst these methods by experimental results. In Chapter 5, we show how to improve the stability and reliability of existing feature selection methods by cross validation, a special resampling method. We first present two existing algorithms that use this approach and then propose our new method. Finally, our conclusions are stated in Chapter 6, along with a number of suggestions for future work. 7 Chapter 2 Related Works In this chapter, we give a general review of related works in the area of kernel based classification, feature selection, and cross validation. The concept of kernel and some of its basic properties are elaborated in Chapter 3 where we treat SVMs in more detail. 2.1 Kernel Based Classifiers There are many classification models in the literature. Instead of reviewing them all, we focus on the classifiers used in this thesis: kernel based classifiers. Particularly, we place emphasis on SVM [5, 45], the Relevance Vector Machine (RVM) [44], and Gaussian Process Classifiers (GPC) [7, 8, 11, 25, 35, 38, 39, 47]. 2.1.1 K e r n e l Kernels provide a framework to represent data and must satisfy some conditions. Suppose we are given training samples (xi,yi),(xn,yn) € X x {±1}, where X is a set from which the training cases Xj are taken. In classification and prediction, we need to generalize unseen data points. Given a new point x £ X, we want to predict the corresponding y G {±1}. Intuitively, we want to choose y such that (x,y) is 8 in some sense similar to the training samples. Thus we need notions of similarity in X and in {±1} [42]. Measuring similarity of labels is easy as there are only two situations. Two labels are either identical or different. Similarity measure of inputs are not so obvious. Consider a similarity measures of the form: fc: ' X x X - R (2.1) (xi,Xj) -> k(xi,Xj) (2.2) that is, given two patterns, the function returns a real value characterizing their similarity. Such function k is called a kernel [42]. This general form is hard to study. A simple type of similarity measure is dot product. Obvious limitation of this kernel is that it is only defined when the data to be analyzed are vectors. For a more general object x £ X, we need to represent the object as a vector </>(x) € W, and then define a kernel for any X j , X j G X by k(xi,x.j) = <l>(xi)T<t>(xj) (2.3) 2.1 .2 Relevant K e r n e l Methods Support vector machine [45] is a generalization of the so-called optimal hyperplane algorithm. SVM algorithm creates a hyperplane that separates the data into two classes with the maximum margin. Unlike SVMs which use only support vectors to construct hyperplanes, Marron and Todd [31] use all vectors to construct hyper-planes aiming at solve the so-called "data piling" problem at the margin. Their algorithm is referred as Distance Weighted Discrimination (DWD). One important limitation of SVM and DWD is that they make point pre-diction rather than generating predictive distributions. Tipping [44] formulated the Relevance Vector Machine (RVM), a probabilistic model whose functional form is equivalent to SVM. RVM achieves comparable prediction accuracy to SVM, yet also provides a full predictive distribution. RVM learns a classifier that is constructed 9 as a weighted linear combination of basis functions. Krishnapuram [22] argues the fact that RVM keeps too few basis functions may make it suffer from a systemic under-fitting. They propose Sparse Multinomial Logistic Regression (SMLR), which performs exact multinomial logistic regression with a sparsity-promoting prior. Another important Bayesian kernel classifier is the Gaussian Process Clas-sifier (GPC), which is derived from Gaussian' process priors over functions. The main idea of the Gaussian process classifier is to assume that the class label yi is obtained by transforming some real valued latent variable / ( X J ) associated with X j . A Gaussian process prior is placed on f(xi), and is combined with the training data to obtain predictions for a new data point. 2.2 Feature Selection Methods 2.2 .1 Feature Construction v.s. Feature Selection Feature construction methods compute new features as a combination of the orig-inal ones and are often used in dimensionality reduction. Many popular feature construction techniques are based on a linear combination of the original features, such as Principal Component Analysis (PCA) which transforms a number of pos-sibly correlated variables into a smaller number of uncorrelated variables called principal components that are ordered by decreasing variability. The first princi-pal component is the combination of variables that explains the greatest amount of variation, and the second principal component accounts for the next largest amount of variation, and so on. Several nonlinear construction algorithms such as the one based on kernel methods [10] have been recently proposed. However, unfortunately, the common infeasibility of feature construction methods with application to gene expression analysis exist for both the linear and nonlinear case. Since the principal components are combinations of all input features, we lose the insights of which 10 subset of genes are most relevant [22]. Moreover, if all features have to be collected, the benefit of reducing diagnosis cost is lost. Feature selection, on the other hand, chooses a subset of features from the original input space, which are supposed to be relevant to the task at hand. The subset of features can be combined in a proper way either in a subsequent stage or during feature selection [21]. The main point is that feature selection does not generate new features via a combination but selects features from the original input feature space. Therefore, only feature selection techniques address the problems we defined in Chapter 1—biomarker identification and diagnosis cost reduction. In subsequent sections, we review current approaches to feature selection in the literature. A good review article on feature selection is [18]. 2.2.2 Methods for Feature Subset Selection In the current literature, there are three basic approaches to the problem of feature subset selection in general, or biomarker identification in particular [18]. The three approaches include filter, wrapper, and embedded methods. Filter methods treat feature selection as an isolated step from the classifier design. Typically, a subset of features are selected as a preprocessing step, then a classifier is designed using only the features selected. Wrapper methods search an optimal subset of features with a classifier. They search through the space of possible feature subsets and score the quality of a particular subset S according to its predictive power (e.g. the accuracy of a classifier using only the subset S). Finally, Embedded methods perform feature selection in the process of training. Filter Methods Filter methods select features based on their ability to distinguish between the two classes without relying on a classifier. Statistical tests such as t-test and Wilcoxon 11 rank sum test are examples of the filter approach. In statistical tests, a ranking crite-rion is first introduced to quantify the discriminative capability of individual genes. Top ranked genes at a particular significant level are then returned as biomarkers. Fisher discriminant ratio (FDR) is an alternative to t-statistics [36]. In binary case, we may compute FDR for each feature i as: K')2 + (*i")s where fj,^}, ^ ,a^} are the mean and standard deviation of gene i on the positive and negative samples, respectively. Under the assumption that the class-conditional density functions are Gaussian, larger values of FDR(z) or t-statistic(-i) suggest that feature % is better able to distinguish between two classes. Wilcoxon rank sum test works similarly, except that it does not assume any particular distribution of the training samples. The major drawback of the filter approach is that it ignores the correlations between features. A set of low ranked features may perform well when combined together. On the other hand, a set of top ranked features may provide similar information for a prediction, and therefore their predictive power may not be better when combined together. Another pitfall of this approach is that data of low sample size may not reveal trustworthy information about their distribution. Wrapper Methods The limitation of filter methods which treat each feature isolated can be overcome by wrapper methods which consider feature selection as a search over the space of all possible feature subsets. Generally the wrapper methods consist in using prediction performance of a given classifier to assess the quality of subsets of features. An exhaustive search can be performed if the number of features is not too large. Brute force on high dimensions may become intractable. For example, there are 12 2 possible feature subsets for a data set in N-dimensional space. In our case, N can easily be tens of thousands. Thus efficient search strategies, such as greedy search, can be used. Greedy search strategies come in two forms—forward selection and backward elimination. Moreover, one can use theoretical error bounds as an indicator of prediction performance. Guyon et al. [20] use this as a basis for assessing the importance of each feature at each iteration. Their algorithm, called Recursive Feature Elimination (RFE), is described in detail in Chapter 4 and 5. Embedded Methods Wrapper methods utilize learning machine (e.g. SVM classifier) as a black box to score subsets of variables according to their predictive power. This approach is universal and simple, but embedded methods that incorporate feature selection as part of the training process may be more efficient. In wrapper methods, to obtain the predictive performance, the data is split into training and test (or validation) samples. In embedded methods, all available data are used to train a classifier. An example of embedded methods is Automatic Relevance Determination (ARD) [27, 35] exploited by GPC. ARD parameters can be directly embedded into the covariance function of GPC, which can be considered a kernel that simplifies the problem in high dimensional space [35]. 2.3 Cross Validation Cross validation is a method of estimating the accuracy of a classification or regres-sion model. There are mainly two types of cross validation: k-fold cross validation and Leave-One-Out (LOO) cross validation. In k-fold cross validation, the data is partitioned into k folds (or subsets). One fold is in turn used for testing, and the remaining k-1 folds are used for training. 13 LOO cross validation is an extreme case of k-fold cross validation, with k — n, the number of total cases. Cross validation can be considered as a special form of resampling in the sense that all data are used for training—none has to be held back in a separate test set., 14 Chapter 3 Evaluating Relevant Kernel Classification Algorithms Although Chapter 2 introduced related work and explored various kernel classifica-tion algorithms, we use this chapter to extend the discussion and to evaluate these algorithms with application to tissue classification using gene expression data. In chapter , we shall discuss how to improve classification (or diagnosis) accuracy by incorporating feature selection techniques. We focus on binary classification for simplicity. For an extensive study of multi-class tissue classification using gene expression, please refer to [26]. Let us assume a data set D of data points Xi e Rd with binary class labels yi G {—1,1}, D = {{xi,yi)\i = 1,2, ...,n}, X = {xi\i = 1,2, ...n}, Y = {y{\i = 1,2, ...,n}. Given this training data set, we wish to predict the class label y* for a new data point a;*. We discuss how various classifiers solve this problem in the following sections. 3.1 Support Vector Machines In this section, we present some basic natures of SVM algorithm. More theoretical background information can be found in [12, 42, 45]. 15 3 (a) Maximal Margin (b) Smaller Margin Figure 3.1: Two dimensional linearly separable data. Both classifiers correctly clas-sify the data, but the margin in (a) is bigger than the one in (b). The classifier in (a) is the maximal margin classifier of this data set. 3.1.1 M a x i m a l m a r g i n classifier Sometimes the data are linearly separable as illustrated in Figure 3.1. A data set is linearly separable if there exists a line (2-D), a plane (3-D), or a hyperplane (4-D or higher) which correctly separates the two groups of data such that data from same group all fall on the same side, and no two data points from different groups fall on the same side. We make this assumption for now. We shall discuss the non-separable case in section 3.1.3. For a specific data set, there may exist multiple classifiers that can correctly separate the two classes. For example, in Figure 3.1, both classifiers correctly sepa-rate all the data points. In maximal margin classification, an optimal linear classifier is the one that maximizes the margin, which is defined as the minimum distance of the data objects to the decision boundary. In this case, the classifier on the left is the maximal margin classifier. The definition of margin is given in Section 3.1.2 (see Figure 3.2). 16 Figure 3.2: Linear SVM—the simplest case. HI and H2 are hyperplanes, and X I and X2 are support vectors. 3.1.2 Linear S V M A 2-D example of linear SVM is illustrated in Figure 3.2. Mathematically we have: xi • w + b > +1 for yi = +1 Xi • w + b < -1 for yi = - 1 , or in the combined form: yi{xi • w + b) > 1 (3.1) Furthermore, we suppose x\ and X2 are two vectors that lie on the hyperplanes Hi and H2 respectively, then we have: Hi : xi • w + b = +1 Hi : X2 • w + b = -1 Hence, the margin can be computed as: M = (xi - x2) • w w 2_ \w\ (3.2) Now we can form the constrained optimization problem as: min M = wTw s.t. (XJ • w + b)yi > 1, i = 1 , n 17 (3.3) Therefore, we need to optimize a quadratic function subject to linear con-straints. The solution involves constructing a dual problem where a Lagarange multiplier on is associated with every constraint in the primary problem. The corresponding Karush-Kuhn-Tucker ( K K T ) conditions are: 1. v^ = o 2. yi(xi -w + b) > 1 3. ai > 0 4. ai[yi(xi -w + b) - 1] = 0 The dual problem (3.4) can now be written as: The last K K T condition implies that only the CKJ'S corresponding to points at the boundary can have non-zero values. These Ns points at the boundary are called support vectors since they support the hyperplane on both sides of the margin. Furthermore, only these Ns support vectors are needed to specify the optimal linear classification boundary. For a new data object x*, we predict its label as follows: A nice property of S V M is that during both training (Equation 3.5) and testing (Equation 3.6), data points always appear in the form of dot products. We shall see how this property make it possible to reduce the computational cost by kernelization in section 3.1.3. (3.4) max a > 0 E j l x <*i-\ Y,i=i T!j=i a^ixfx^yjaj} (3.5) y* — sign(wx* + b) = sign aiyi(xfx*) + b . i = i (3.6) 18 * * * * Figure 3.3: Non-linear SVM. Data on the left are not linearly separable in the two dimensional space. Via some transformation, say data points in 2D can be transformed to 3D feature space where they become linearly separable. 3.1.3 Nonlinear SVM In many real world applications, data are not linearly separable in the original input space as illustrated in Figure 3.3. To solve this problem, we can map the data to a higher-dimensional space, where it becomes linearly separable. Intuitively, the higher the dimensionality, the sparser the distribution of the data, and the more likely to separate the data linearly. Kernel trick Recall that the linear classifier only relies on dot product between vectors in both training (Equation 3.5) and prediction (Equation 3.6). If every data point is mapped into some high-dimensional space via some transformation $ : x —> (f>(x), then the dot product becomes: (j)(xi)T(f)(xj). Similarly, the 4>{x) appears only in the form of dot products. Mercer's theorem states that, under some mild conditions, dot prod-uct in some expanded higher dimensional feature space correspond to kernel func-tions defined in the lower dimensional input space, that is: K(xi,Xj) = (p(xi)T(p(xj). 19 Hence, to train and test SVMs, we do not need to know We only need to choose an appropriate kernel function. Here are some commonly used kernel functions: • Linear kernel . • Polynomial kernel K(xi,Xi) = (xiTXj + l)p • Gaussian kernel (a.k.a. rbf kernel) K(xi, Xj) = exp( -^ (x ' i - Xj)T(xi - Xj)) • Sigmoid K(xi,Xi) = tanh(axiTXj — (5) Hard margin v.s. soft margin In many real world applications, especially in gene expression analysis, the data contains some noise which makes the data non-separable. If we use the nonlinear sviri techniques discussed above to deal with this type of non-separability with the constraints that all data must be correctly classified, overfitting problem may incur as illustrated in Figure 3.4. Slack variables £ can be added to allow misclassification of diffcult or noisy examples to improve the generalization of the classifier. This is called C-SVC [45]. The quadratic optimization incorporating slack variables is formulated as: min™^ \wTw + C YH=\ & s.t. y{wTXi + b) > l - & , i = l , . . . ,n (3.7) and £j > 0 where £j > 0 for all i. The parameter C in (3.7) can be viewed as a penalty factor which controls overfitting. 20 - 2 - 2 - 2 -1 0 1 X1 (a) True Data 2 3 4 - 2 -1 o 1 : X1 (b) Noisy Data 2 3 4 Figure 3.4: Overfitting problem of hard margin with noisy data. Suppose the data in (a) are correct. If we change the class label of one data point (e.g. the one marked by the red square), we get the separating hyperplane as shown in (b) with hard margin. The hyperplane in (b) overfits the noisy data, and has a poor generalization ability. In summary, SVM classifier has the following nice properties among many • Its flexibility in choosing a similarity function: By Mercer's Theorem, every semi-positive definite symmetric function is a kernel, and any valid kernel function can be used as the similarity measure. • Its sparseness of solution when dealing with large data sets—only support vectors are used to specify the separating hyperplane. • Its ability to handle large feature spaces—kernelizaton reduces the computa-tion cost, and its complexity does not depend on the dimensionality of the feature space. • Overfitting can be controlled by soft margin approach. • Its nice math property: a simple convex optimization problem which is guar-others: 21 anteed to converge to a single global solution. 3.2 Distance Weighted Discrimination As discussed in Chapter 1, gene expression data produced by high-throughput tech-niques are characterized by High Dimensionality and Low Sample Size (HDLSS). Marron and Todd [31] presents a novel view of the performance of S V M in HDLSS settings via projecting the data onto the normal vector of the separating hyper-plane. This view reveals substantial so-called "data piling" problem at the margin. They argue that data piling may adversely affect the generalization performance of S V M classifiers. To avoid the data piling problem, they propose a new classification algorithm called "Distance Weighted Discrimination" (DWD). The major difference between D W D and S V M is that the former uses all training vectors to define the classifier (or separating hyperplane), while the latter only uses support vectors. Recall that SVMs try to find the classifier that maximizes the minimum distance of the data objects to the decision boundary as illustrated in Equation 3.3. In D W D , an optimal linear classifier is the one that minimizes the sum of the reciprocals of the residues, perturbed by a penalized vector £. Mathematically, D W D forms the following optimization problem: r,w,f3,t; 1 s.t u = yi(xfw + b) + & wTw = l , r > 0,£ > 0 Again, C is a penalty parameter to control misclassification and overfitting. The equation above can be reformulated into the form of of a so-called second-order cone programming (SOCP) problem. We skip the SOCP formulation for D W D since optimization is not the major topic of the thesis. Interested readers can refere to 22 [31] for the formulation and derivation of SOCP for DWD. 3.3 Bayesian Kernel Classifiers In this section, we present the probabilistic, or Bayesian approach to learning kernel classifiers. The Bayesian approach exhibits some differences with respect to the framework of risk minimization. Two key distinctions include: (1) In the Bayesian approach, one can incorporate prior knowledge into the process of estimation; (2) In contrast to SVM, which simply returns a binary decision, yes or no, a probabilistic classifier returns the probability, P(y = l |x, that an object x belongs to the class of interest indicated by the binary variable y. The probability answer is more desirable than a simple binary decision as it provides additional information about the certainty of the prediction. A binary classifier returns two probabilities with one for each class, and the final class label is decided by choosing the one with higher probability. The following two subsections present two Bayesian learning algorithms— Gaussian processes and Relevance Vector Machines (RVM). 3.3.1 Gaussian Processes Classifiers Formally, we have the following definition for Gaussian process (GP): Def in i t ion 3.1 (Gauss ian Process) Given an index set X, a collection of ran-dom variables f{x) with x € X is a Gaussian process if, for every finite set {x\,xn} C X, the random variable {f(x\),f(xn)} has a multivariate Gaussian distribution, with mean \x £ lZn and covariance K € lZnxn. Gaussian processes were originally developed for the problem of regression. In re-gression, the output (or dependent) variable is a' real value rather than a binary number representing the class membership. In order to adapt this method to the 23 Figure 3.5: Graphical model for GPC with n training data points and one test data point. X\-n and Y\:n are observed. Given x*, we wish to predict y*. /j is a latent variable associated with xt, and yi is obtained by transforming / j . Given / , x and y are independent, ft and /* are joint Gaussian, and the Gaussian process prior is parameterized by 0 [25]. problem of classification, the concept of latent variables are introduced. The main idea of Gaussian process classifier (GPC) is to assume that the class label yi is ob-tained by transforming some real valued latent variable f(xi) associated with X{. The graphical model for GPC is shown in Figure 3.5. This graphical model encodes the assumption that x and y are independent given / . A bayesian framework of GPC is described with more details in the following. Gaussian process prior We place a Gaussian process prior on the function /(•), meaning that for any finite set X = {xi, ...,xm}, the random vector / = [f(xi),...,f(xm)]T is a Gaussian. Without loss of generality, we can assume such a process has a zero mean. The covariance between f(xt) and f(xj) can be defined as = cov(xi,Xj) = v0exp I ~ lm(xf - x™f I + vx + v2S(i, j), (3.8) I 771=1 J 24 We say this GP prior is parameterized by 6 which is also known as hyper-parameters. More explicitly, if we use the covariance function defined in 3.8, then 9 = [vQ,lm,vi,V2] is a hyperparameter vector. In particular, the hyperparameter VQ specifies the overall vertical scale of variation of the latent values, v\ is the overall bias of the latent values from zero mean, V2 is the latent noise variance, and lm is the Automatic Relevance Determination (ARD) variable for the m-th feature that controls the contribution of this feature in the modelling. We shall describe ARD in detail in Chapter 4 when we discuss the problem of feature selection. We assume lm = 1 for all m in this chapter. In fact, Ejj = K(xt,Xj) can be any kernel function such as the four listed in section 3.1.3. By the definition of Gaussian Process, the prior probability of these latent function values {/(XJ)} given 6 is a multivariate Gaussian, that is p(/|x, 6) ~ iV(0,£): P(/|x, 0) = t I . e x p O - V s - 1 / ) (3.9) ( 2 T T ) 2 | E | 2 2 Likelihood for class label The likelihood p(D\f, 0) is the joint probability of observing the sample labels given the latent function values. The likelihood can be evaluated as a product of the likelihood function: n p{D\f,e) = \{p{yl\fuO) (3-10) i = l We give two commonly used likelihood functions below. 1. Probit function( [30], [7], [35]): p(yi\fl,e) = $(ylfl) (3.11) where $ is the cumulative distribution function (c.d.f.) of standard Gaussian distri-bution A r(0,1). In the presence of noise from inputs or targets, we may assume that 25 the latent function values are contaminated by a Gaussian noise which is indepen-dent of inputs. If we use 6 to denote the noise, then 5 has zero mean and an unknown variance a2, i.e. N(5; 0, a2). The likelihood function becomes p{yi\fi) = $(^^); 2. Logistic function: This is given by p(y = l\fl,e)= '*Pnifjn (3.12) 1 + exp(/j) Posterior probability The posterior probability can be written as 1 n p(f\D,e) = j^Ilp(yi\fi>W\x>°) (3-13) where the prior probability p(/|x, 6) is defined as in (3.9), and the likelihood function p{yi\fi-,9) is defined as in (3.11) and (3.12). The normalization factor p(D\0) is known as the evidence for 6, and is defined as p(D\6) = Jp(D\f ,0)p(f\x,6)df. A popular idea to compute the evidence is to approximate the posterior distribution p(f\D,6) as a Gaussian using Laplace approximation or Expectation Propagation (EP). Then the evidence can be calculated analytically ( [27], [30]). Model adaptation In the full Bayesian treatment, the class probability is obtained by integrating over the hyperparameters p(y*\x*,D) = Jp(y*\x*, D,6)p(G\D)d0. Monte Carlo meth-ods [35] can be used to approximate the integral. However, this might be too costly to use in practice. Therefore, in this thesis, rather than integrating over the hyperparameters we fix them by maximizing the posterior probability p(0\D), where p(6\D) oc p(D\6)p(6). The prior distribution p(6) can be specified by domain knowledge, or some vague uninformative distribution. In the latter case, the optimal values inferred by Maximizing A Posterior (MAP) will converge to the ones inferred by Maximizing the marginal Likelihood (ML). 26 Prediction Suppose we have found the optimal settings of hyperparameters 9*, then let us take a test sample x*, for which the class label y* is unknown. By the definition of Gaussian process, the latent variable /(x*) and the / = [ / (x i ) , f (x n ) } T have a joint multivariate Gaussian distribution, i.e. f s * \ ^ fc /C(x*,x*) J where fc = [/C(x*, xi), /C(x*, X 2 ) , /C(x*, xn)]T. The conditional distribution of /(x*) given / is also a Gaussian: p(f.\f,D,0) = a^ o c e x p j - - — , Tv~i_i 1 ) ( 3 - 1 4 ) ~ Af _ / * _ p( / | x , r ) " V 2/C(x*,x*)-fc TE- 1fc The predictive distribution of P(f(x*)\D,6*) can be computed as P(MD,e*) = J P(Mf,D,e*)p(f\D,o*)df (3.15) As both terms of the integrand are (or can be approximated as) Gaussians, the predictive distribution (3.15) can be simplified as a Gaussian N(f(x*); fix,, &Xt2) with mean jiXt and variance <JX*2- Refer to [7] for explicit forms of pXt and a x 2 . The class label yXt can then be decided as arg maxj p{y* = i\x*, D,0*). 3.3.2 Relevance Vector Machines and Other Sparse Classifiers Sparse classification algorithms obtain sparse solutions for regression and classifica-tion while maintaining their Bayesian interpretation. The family of sparse classi-fication algorithms includes Relevance Vector Machines (RVM) [44], sparse online Gaussian processes [11], Sparse Multinomial Logistic Regression (SMLR) [23], and some others. We include RVM and SMLR in our comparative study. By exploiting a probabilistic Bayesian framework, RVM learns a classifier that is constructed as a weighted linear combination of basis functions. The weights 27 Name Number of Samples Dimensionality Type of source Colon cancer 62(22+/40-) 2000 microarrays Leukemia 72(25+/47-) 7129 microarrays Lung cancer 26(11+/15-) 19624 SAGE Table 3.1: Gene expression data used for classification are estimated by ML or M A P in the presence of training data. There is no restriction on basis functions. For example, if we use a kernel function centered on the training samples, the learned classifier will be similar to SVM. If we use original features or their transformed form as the basis functions, then the learned classifier can be considered as a feature weighting algorithm. We shall discuss the performance of RVM for feature selection in Chapter 4. We omit the details of the Bayesian framework of RVM as it is not the core issue concerned by this thesis. For a full derivation of RVM, please refer to [44]. 3.4 Experimental Results We conduct an experimental evaluation on three real gene expression data sets. Table 3.1 provides a summary of those reported here. The Colon cancer data set, originally studied by Alon et al. [1], contains the expression levels of the 2000 genes from 22 normal and 40 tumor tissues. The task is to discriminate tumor from normal tissues. The Leukemia data set, originally analyzed by Golub et al. [19], consists of expression values of 7129 genes from 47 samples of Acute Lymphoblastic Leukemia (ALL) and 25 samples of Acute Myeloid Leukemia (AML). The task is to discrimi-nate the two types of leukemia. The Lung cancer data set, generated by BC Cancer Regency, consists of 26 SAGE libraries (samples) with 15 normal libraries, 5 CIS (carcinoma institu) 28 libraries, and 6 invasive libraries. The normal libraries are from bronchial brushings, the CIS libraries are from bronchial biopsies, and the invasive libraries are from frozen resected samples. The values are the sum of all the tag counts that map to a particular gene. There are a total of 19,624 genes including a small number of hypothetical genes. The task is to discriminate normal samples from CIS/INV ones. In keeping with standard practice on kernel classifiers, we normalize the original expression data sets so that the mean is zero and the standard deviation is one. We use several off-the-shelf implementations including OSU-SVM 3.00 1 for SVM, RVM code from B. Krishnapuram [23], and W. Chu's C code for GPC [7]. The specific form of kernel was selected by cross-validation: the one giving the best result is used for all classifiers to allow comparison. Kernel parameters were selected by cross-validation using SVM, and the best parameters were used for all methods. Regularization parameters are method-specific and therefore was chosen by cross-validation for each method. In our experiment, a linear kernel was used for all three data sets. 3.4.1 Diagnosis Accuracy Compar i son Diagnostic accuracies are commonly assessed by cross validation. We use k-fold cross validation with k = 10. More specifically, we randomly divide the data set into 10 partitions (or folds), and ensure that each partition contains an equal number (if possible) of samples from a class. For example, we divide the Colon cancer data set into 10 folds such that each fold contains 4 tumor samples and 2 or 3 normal samples. With the 10 partitions ready, we pass the data to classifiers 10 times. Each time, one fold is used for testing, and the remaining k — 1 = 9 folds are used for training. The average accuracy from the 10 experiments are reported in Figure 3.6. Neither classifier is incorporated with feature selection in these experiments. We 1Available from http://www.ece.osu.edu/~maj/osu-svm 29 Figure 3.6: 10-fold cross-validation accuracies of various classifiers without feature selection make the following observations from the results in Figure 3.6: • Among the three data sets, Colon data set is the hardest to classify for all classifiers. This is likely caused by the low quality of the data: the expression values may contain lots of noise, and some class labels might be mislabelled. Lung cancer data has the lowest sample size and highest dimensionality, but it is predicted most correctly for all classifiers. This is mainly due to its high quality. • Among the classification algorithms, Gaussian processes with EP outperforms all other classifiers as it achieves the highest accuracies for all data sets. How-ever, the difference is not significant. • Notice SVM is the only classifier that failed to achieve 100% accuracy for the Lung cancer data. The reason behind this might be the data-piling problem discussed in section 3.2. 30 True Class positive negative Classified As positive negative True Positives False Positives False Negatives True Negatives Table 3.2: R O C confusion matrix • Sparse classification algorithms (e.g. R V M and S M L R ) obtained the lowest accuracies for the Colon data. This may imply that sparse classification algo-rithms are heavily affected by irrelevant features. • Besides the effect of the quality of the data sets, the presence of a significant number of irrelevant features may make the classification somewhat prone to the curse of dimensionality. We shall see in Chapter 4 that diagnostic accuracies can be improved dramatically after eliminating irrelevant features. 3.4.2 R O C Curve Analys i s In this subsection, we use Receiver Operating Characteristics (ROC) curves to eval-uate and compare classifiers. A n R O C curve is a useful technique for visualizing, organizing and selecting classifiers based on their performance [15]. In binary classification, given a classifier and an object to be classified, there are four possible outcomes. If the object is positive, and it is correctly classified as positive, it is counted as a true positive; if it is misclassified as negative, it is counted as false negative. Similarly, if the object is negative, and it is correctly classified as negative, it is counted as true negative; if it is misclassified as positive, it is counted as false negative. Table 3.2 shows a two by two confusion matrix that summarizes the four outcomes. Let TP, FP, T N and F N be the total number of true positives, false posi-tives, true negatives and false negatives, respectively. Furthermore, let P and N be the total number of positive and negative objects respectively. Then we have the 31 following terms associated with ROC curves: • True Positive Rate (also called hit rate and recall): TP rate = ^ • False Positive Rate (also called false alarm rate): FP rate = ^ ROC graphs are two-dimensional graphs with TP rate as Y-axis and FP rate as X-axis. A discrete classifier (e.g. decision tree) is a classifier that only outputs class labels [15]. Given a test set and a discrete classifier, we can compute the pair of (FP rate, TP rate), which corresponds to a single point in ROC space. Rather than a discrete class label, some classifiers like the six we consider here, give each object a numerical value that represents the degree to which the object belongs to a class. The value can be a strict probability, or an uncalibrated score. Specifically, SVM and DWD give uncalibrated scores, in which case the only property that holds is that higher scores indicates higher probability that an object is of a class. As opposite to SVM and DWD, the remaining four Bayesian classifiers give a strict probability that an object belongs to a class. Any of the six classifiers can be turned into a discrete classifier by thresholding: if the probability or score is above the threshold, the classifier produces a yes; otherwise, a no. By varying the threshold, we can produce different points in the ROC space. In this thesis, we adopt the method described in Algorithm 2 of the paper [15] and generate the ROC curves as follows: (1) merge test sets Ti , T 2 , T i o , generated from 10-fold cross-validation, into one large test set T; (2) sort T by their scores (or probabilities); (3) let threshold to be +oo and produces the point (0,0); (4) reduce the threshold from highest score to lowest score, and compute (FP rate, TP rate) pair for each threshold; (5) connect the points to generate a ROC curve. ROC curves show the tradeoff between benefits (TP) and costs (FP). The closer the curve follows the left-hand border and then the top border, the more ac-curate the test; the closer the curve follows the 45-degree diagonal, the less accurate 32 ROC curve (AUC=0.85455) ROC curve (AUC=0.86591) 0.9 0.8 0.7 CO CD > 0.6 'co o 0.5 Q . CD 0.4 0.3 0.2 0.1 0 0.9 0.8 0.7 co > 0.6 'co o Q . 0.5 CD D 0.4 0.3 0.2 0.1 0 0.9 0.8 0.7 co CD > 0.6 'co O a 0.5 CD D 0.4 0.3 0.2 0.1 0 ^ 2 0.4 0.6 O.i False positives (a) SVM ROC curve (AUC=0.78182) 2 0.4 0.6 0., False positives (c) RVM ROC curve (AUC=0.85682) 0.2 0.4 0.6 False positives (e) GP-Laplace 0.9 0.8 0.7 CO CD > 0.6 'co o 0.5 Q . CD 0.4 0.3 0.2 0.1 0 0.9 0.8 0.7 CO CD > 0.6 'co o Q . 0.5 CD 0.4 f— 0.3 0.2 0.1 0 J y y y y y y y • y / / / '. / • / : / i /• y 0.2 0.4 0.6 0.8 False positives (b) DWD ROC curve (AUC=0.78068) .2 0.4 0.6 O.i False positives (d) SMLR ROC curve (AUC=0.85568) .jr.. 0.4 0.6 False positives (f) GP-EP Figure 3.7: R O C curves of the six classifiers for the Colon cancer data 33 1 2 Dataset Figure 3.8: AUC values of various classifiers for the Leukemia (1) and Lung cancer data (2) the test. In addition to ROC curves, we can use a single scalar value—the Area Under the ROC Curve (AUC)—to represent the classifier performance [33]. The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly cho-sen positive instance higher than a randomly chosen negative instance. Therefore, higher AUC indicates better average performance. Figure 3.7 shows the ROC curves of various classifiers with Colon data set. As our ROC curves are generated from a low-sample-sized data set, the curves in Figure 3.7 are step functions with sharp corners. This makes the AUC values only an estimate of the true probability. Therefore, when using these AUC values to compare two classifiers, one must be aware that the difference may not be statis-tically significant. Having said that, an AUC value gives a good general measure of predictiveness. As shown in Figure 3.7, AUC values given by SVM, DWD, GP-Laplace and GP-EP are comparable (« 0.85), and dominate the AUC values of 31 140 120 «• 100 c o £ 80 CD E 60 i -fe 40 20 0 Figure 3.9: Graphical summary of CPU costs of various classifiers RVM and SMLR, both of which are approximately 0.78. ROC curves of various classifiers with either Leukemia or Lung cancer data exhibit similar shapes, so we omit them here. Figure 3.8 presents a summary of AUC values of various classifiers with Leukemia and Lung cancer data. As Figure 3.8 shows, all classifiers achieve comparably excellent performance with these two data sets. 3.4.3 C P U Cost Compar i son The CPU time taken by the six classifiers considered in this chapter, over the three data sets, is summarized in the graphical display of Figure 3.9. The bars are grouped by data set with 1, 2 and 3 representing Colon cancer, Leukemia and Lung cancer data set respectively. The color of the bars indicate the classifier, and the height shows the C P U time (in seconds) consumed by each classifier. The machine used was an Intel PC with a single 2.66GHz processor and a 1GB R A M . The C P U time reported here is the average time of the 10-fold cross validation experiments. In other words, the CPU time reported is the average n SVM mm D W D • I RVM I I SMLR I I GP-Laplace H GP-EP 2 Dataset 35 Classifier Programming Languages Used Source of Code SVM C++ (Matlab mex interface) [32, 6] DWD Matlab [31] RVM Matlab [23] SMLR Matlab [23] GPC with Laplace C [7] GPC with EP C [7] Table 3.3: 10-fold cross-validation accuracy without feature selection time needed to train with 9-fold data and to test with the remaining 1-fold data. Additionally, for all methods, this CPU time includes the time of loading the data. From some point of view, it is not fair to compare the CPU costs as these classification methods are not implemented in the same programming languages. We report the CPU costs here not aiming to compare the absolute values but to provide some information about the running times of various classifiers. What is important is whether the running time is acceptable, and whether one CPU cost is significantly different from another while the two have comparable accuracies. A summary of the implementations are provided in Table 3.3. Al l six classification methods have reasonably low CPU cost with the maximum cost 131 seconds) occurs when classifying the Lung cancer data (dimensionality = 19624) using Gaussian processes with EP. For all three data sets, SVM, DWD, RVM and SMLR give very similar low cost regardless dimensionality of the data. GP classifiers have higher costs than the other four classifiers over all three data sets. The difference becomes significant when dealing with the two data sets with higher dimensions. It is safe to say that the former four classifiers have much better scalability than Gaussian process classifiers. 36 Chapter 4 Comparing Feature Selection Methods for Kernel Machines As mentioned in Chapter 1, there are often tens of thousands of genes in a gene expression data set generated by high-throughput technologies such as microarray and SAGE, but probably only a small subset of informative genes are relevant to the phenotypes under investigation. Identifying informative genes is a very important research problem in bioinformatic research. For example, informative genes can be used in the development of efficient cancer diagnosis and in the design of effective drugs. In this chapter, we evaluate and compare existing feature selection methods in the context of gene expression analysis. 4.1 Data Sets and Preprocessing We use the same gene expression data sets as in previous chapter (see Table 3.1). We split the training and test data in the same way as in previous studies for fair comparison. According to the source of Leukemia data set, their training set consists 38 samples (27 A L L and 11 AML), and their test set consists 34 samples (20 A L L and 14 AML). Since there is no defined training and test set from She 37 Dataset Number of Training Samples Number of Test Samples Number of Genes Colon 32(11+/21-) 30(11+/19-) 2000 Leukemia 38(ll+/27-) 34(14+/20-) 7129 Lung 14(6+/8-) 12(5+/7-) 19624 Table 4.1: Gene expression data used for feature selection source of Colon data, we randomly split the data into 32 samples for training and 30 samples for testing. Similarly, we randomly split the Lung cancer data into 14 samples for training and 12 samples for testing. Table 4.1 provides a summary of the characteristics of the data sets. There are many different ways to preprocess the data. For example, Golub et al. [19] first normalizes the feature vectors of training samples, and then normalizes the test samples by subtracting the mean and dividing the standard deviation of the training samples. It makes sense to normalize the training and test samples separately as. in [19], because usually the test data is not available when we build classifiers from the training data. However, the sample size of our data is very small, so it would be better if we merge the training and test samples, and normalize them together. One drawback about this type of normalization is that we need to rebuild our classifier before predicting a new test sample, but we gain some improvement in the quality of the normalized data, which we think more important. Therefore, we merge the training and test samples, and normalize the data so that expression values for each gene i have a zero mean and unit variance. 4.2 Recursive Feature Elimination Recursive Feature Elimination is a feature selection algorithm proposed by Guyon et al. in [20]. RFE performs feature selection by iteratively training a classifier with current set of features and remove the feature with the smallest ranking criterion 38 Algorithm S V M - R F E ( X 0 , y) { /* input: training samples XQ = [ x i ,X2 , ...,Xfc, . . . , x ; ] T where x^ € RN */ / * input: class labels y = [yi, y 2 , y k , Vi}T */ / * output: feature ranked list r */ (1) Initialize: Subset of surviving features s = [1, 2 , n ] Feature ranked list r = [] (2) Repeat: until s=[] Restrict training examples to good feature indices X = X0(:,s) Train the classifier a =SVM-train(X, y) Compute the weight vector of dimension length(s) Compute the ranking criteria Ci = (^i)2> for all i Find the feature with smallest ranking criterion / — argmin(c) Update feature ranked list r = [s(/),r] Eliminate the feature with smallest ranking criterion s = s( l : f - l , f+l:length(s)) } /* end algorithm */ Figure 4.1: S V M - R F E Algorithm [20] at each iteration. According to the order of elimination, a ranked feature list r is generated at the end: the earlier a feature is eliminated, the lower its rank is. When S V M , usually with linear kernel, is used as the training classifier, the algorithm is referred to as S V M - R F E . Figure 4.1 presents an outline of S V M - R F E algorithm in the linear case. S V M - R F E can also be generalized to remove more than one feature per iteration for speed reasons. In this chapter, unless otherwise stated, we use S V M -R F E by eliminating one gene at a time. Given a ranked list r, we can obtain subsets of genes by selecting top K genes in the list. By varying the value of K, we can obtain multiple subsets. In the case of Colon data, we get 2000 subsets with size ranging from 1 to 2000. As mentioned in Section 1.3, ideally we want to find the 39 subset of genes with which the test accuracy is the highest. We define optimal subset to be the subset of features with which a classifier achieves the highest test accuracy, and the smallest optimal subset is called the biomarker set. According to [20] and [16], the features selected matter more than the clas-sifier used. Therefore, we evaluate the quality of subsets of genes using SVM only. Figure 4.2 shows the accuracies of SVM on test data using various subsets of genes. Table 4.2 presents a summary of optimal subsets selected using test accuracy. For the Colon data set, SVM achieves best performance of 5 test errors (i.e. 83.3% ac-curacy) with 53 different optimal subsets, and the smallest optimal subset contains 416 genes. For the Leukemia data set, there is only one single optimal subset with size of 5. For the Lung cancer data set, any subset with first 176 genes included are optimal subsets; If the ranks returned by SVM-RFE reveal the true ranks of genes in terms of relevance, and furthermore if we adopt the assumption described in [7] that including a relevant gene lowers the test error and including an irrelevant gene increases the test error, then the ideal curve of accuracies of SVM with respect to various subsets should have continuous global optima in an interval starting at the biomarker set. The curve with multiple spaced optima (e.g. zigzag curve) is not desirable, because it implies that relevant features and irrelevant genes are not separated but somehow mixed together. As discussed earlier, in real world problem, the test accuracy is not available beforehand. Therefore, we have to use the LOO accuracy on the training data as the criterion to select optimal subsets and/or the biomarker set. Table 4.3 gives a summary of optimal subsets selected by LOO accuracy on the training set. The LOO accuracies on the training set are all 100%. If the optimal subsets are reliable, we expect the test accuracies with these optimal subsets to be very high. In Figure 4.2, we show both the LOO accuracies on the training set and the test accuracies 40 Dataset Optimal Subsets Al l Genes # opt. sets test accu. # genes in biomarker set test accu. # genes Colon 53 83.3 416 76.7 2000 Leukemia 1 97.1 5 82.3 7129 Lung 19449 91.7 176 91.7 19624 Table 4.2: Evaluation of the feature ranks given by SVM-RFE—optimal subsets selected using test accuracies. An optimal set is a subset of genes with which the classifier SVM achieves the highest test accuracy. The smallest optimal set is called the biomarker set. Test accuracy for the optimal subsets is the classification accuracy of SVM on test data using genes in the optimal subsets only. Dataset LOO Accu. on Training # Genes # Optimal Subsets Colon 100 5 126 Leukemia 100 2 4821 Lung 100 1 19624 Table 4.3: Evaluation of the feature ranks given by SVM-RFE—optimal subsets selected using LOO accuracies on the training data. An optimal set is a subset of genes with which the classifier SVM achieves the highest LOO accuracy over the training samples. The smallest optimal set is called the biomarker set. LOO accuracy for the optimal subsets is the LOO accuracy on the training set using genes in the optimal subsets only. on the test set with various subsets of genes. The curve of LOO accuracy on the training data and the curve of test accuracy do not achieve the optima at the same gene subsets. This means the biomarker set selected using the training set is not identical to the true biomarker set selected using the test accuracy. Additionally, as shown in Figure 4.2, there is a big gap between the two curves, which implies the problem of overfitting. Additionally, as discussed above, a large number of discontinuous optimal subsets are not desirable as that implies the unreliability of ranks'of the features. 41 Colon Dataset Colon Dataset • LOO Accuracy on Training Accuracy on Test 500 1000 1500 Number of Genes (a) Actual Scale Leukemia Dataset LOO Accuracy on Training Accuracy on Test 0 1000 2000 3000 4000 5000 6000 7000 Number of Genes (c) Actual Scale Lung Dataset LOO Accuracy on Training Accuracy on Test 5000 10000 Number of Genes (e) Actual Scale 15000 0.8 a o.6 0.2 ;- s , ln. LOO Accuracy on Training Accuracy on Test 10 10 10 10 Number of Genes (b) Log Scale on Number of Genes Leukemia Dataset V ... 08 a o.6 02 LOO Accuracy on Training Accuracy on Test 10 10 10 10 Number of Genes (d) Log Scale on Number of Genes Lung Dataset 08 02 LOO Accuracy on Training Accuracy on Test 10 10 10 Number of Genes (f) Log Scale on Number of Genes 10 Figure 4.2: Results of forward selection: accuracies of SVM with various number of genes ranked by SVM-RFE. The X-axis represents the number of genes, and the Y-axis represents accuracies (test accuracies and LOO accuracies on training data). The X-axes in (a), (c), and (e) are in actual scales, while the X-axes in (b), (d), and (f) are in logarithmic scales. 12 4.3 G P C with A R D Automatic Relevance Determination (ARD) ([27], [35]) is one of the most successful Bayesian methods for feature selection and sparse learning. We focus on exploiting ARD to perform feature selection on high dimensional gene expression data. When used in feature selection, ARD is a hierarchical approach where the relevance of each input feature is represented by a hyperparameter. In the context of Gaussian process classifier, ARD hyperparameters can be directly embedded into the covariance function as in Equation 3.8. Let Wd be the weight of the d-th feature and a"1 be the variance of Wd, then we have p(wd\ad) = A/"(0, a - 1 ) . Obviously, if ad —> oo, then variance a~l —*• 0, and weight Wd —> 0, which implies the feature is irrelevant. If ad <C oo, then the variance is finite, and the weight can vary, which indicates the d-th feature is relevant. ARD optimizes a = arg maxQp(y|X, a). During optimization some ad approaches to oo, and the corresponding irrelevant features will be eliminated. Figures 4.3—4.5 show the distribution of ARD values (i.e. weights) returned by GPC-Laplace and GPC-EP for three data sets. From the figures, we can see that many ARD values returned by GPC are very close to zero but not zero. Therefore, to select a subset of genes, one needs to specify a threshold for ARD values. For all of the three datasets, ARD values returned by GPC with Laplace are more spread out than that returned by GPC with EP. Sparser ARD values make it easier to specify a threshold and give more information about the ranks of features in terms of relevance. In the next chapter, we will show how to effectively select features utilizing the ARD values returned by GP-Laplace. For now, we know ARD values indicate the level of relevance (or rank) of a feature: feature with higher ARD value has higher rank in terms of relevance. Therefore, similar to SVM-RFE, GPC with ARD returns a ranked list of genes. 43 Figure 4.3: Distributions of A R D values for the Colon data set 44 Distributions of A R D Values for Leukemia Dataset - 4 - 3 - 2 A R D Values in Logarithmic Scale I G P C - L a p l a c e 1000 -1.86 -1.84 1.82 -1.8 -1.78 -1.76 A R D Values in Logarithmic Scale I G P C - E P -1.74 -1.72 Figure 4.4: Distributions of A R D values for the Leukemia data set 45 Figure 4.5: Distributions of ARD values for the Lung cancer data set 46 Dataset Algorithm Optimal Subsets # opt. sets test accu. # genes in biomarker set Colon GP-Laplace 191 80 8 GP-EP 1 83.3 4 Leukemia GP-Laplace 1 94.1 27 GP-EP 11 94.1 1 Lung GP-Laplace 1 100 3 Table 4.4: Evaluation of the feature ranks given by GPC-ARD—optimal subsets selected using test accuracies. Compare to Table 4.2. Dataset Algorithm LOO Accu. on Training # Genes # Opt. Sets Colon GP-Laplace 93.8 3 3 GP-EP 96.9 29 21 Leukemia GP-Laplace 100 9 4102 GP-EP 100 5 3430 Lung GP-Laplace 100 1 19624 Table 4.5: Evaluation of the feature ranks given by GPC-ARD—optimal subsets selected using LOO accuracies on the training data. Compare to Table 4.3. Forward selection is needed to select the biomarkers. We evaluate the reliability of ARD values as a ranking criteria in a similar way as we evaluate SVM-RFE. The performance of SVM with various subsets of genes are reported in Tables 4.4—4.5 and Figure 4.6, where the columns and figures have similar meanings as in Table 4.2-4.3 and Figure 4.2 . • Similar to SVM-RFE, the curve with multiple spaced optima (e.g. zigzag curve) in Figure 4.6 is not desirable, as it.indicates the unreliability of the ranking of features. 'Qi et al. [37] have shown that evidence optimization can lead to over-fitting by picking one from many classifiers that correctly classify the data. This is consistent with our findings in Figure 4.6, where the LOO accuracies on the training data are significantly better than the test accuracies. In next chapter, we will see how a resampling technique, such as cross validation can be used to address the problem of overfitting. 47 Colon Dataset Colon Dataset • LOO Accuracy on training Test Data 10 10 10 Number of Genes (a) G P C with Laplace Leukemia Dataset - LOO Accuracy on training Test Data 10 10 Number of Genes (c) G P C with Laplace 08 •g o.6 04 02 ~VAu LOO Accuracy on training Test Data 10 10 Number of Genes (b) G P C with E P Leukemia Dataset 0 8 1 0.6 - LOO Accuracy on training Test Data 10 10 Number of Genes (d) G P C with E P Lung Dataset 2 0.6 o 2 r o ^ 10° - LOO Accuracy on training Test Data 10 10 10 Number of Genes (e) G P C with Laplace 10 Figure 4.6: Results of forward selection: accuracies of SVM with various number of genes ranked by ARD values. The X-axis represents the number of genes, and the Y-axis represents accuracies (test accuracies and LOO accuracies on training data). GPC-Laplace is used to get the ARD values for (a), (c), and (e). ARD values from GPC-EP are used to get (b) and (d). 48 From Figure 4.2 and 4.6, we can see that if appropriate subset of genes are selected, we can improve the classification capability of the gene expression data. To evaluate SVM-RFE and GPC-ARD in identifying biomarkers, we use LOO accuracy on training data to select a biomarker set and compute the test accuracy using genes in biomarker sets only. The results are included in Table 4.6. 4.4 R V M and S M L R on Feature Space Sparse classification algorithms, such as RVM and SMLR, obtain sparse solutions for regression and classification while maintaining their Bayesian interpretation. Sparsity-promoting priors are used to encourage the weight estimates to be either significantly large or exactly zero. As mentioned in Chapter 3, if we use original features or their transformed forms as the basis functions, then the learned classifier can be considered as a feature weighting algorithm which returns sparse weights. Therefore, in the context of gene expression analysis, unlike SVM-RFE which simply ranks genes, RVM and SMLR automatically select one optimal subset of genes, which is the biomarker set. In this section, we assess the quality of biomarkers selected by these two sparse classifiers. Table 4.6 presents the test accuracies using biomarker sets selected by the five feature selection algorithms considered in this chapter. As mentioned earlier this Chapter, the training and test data for the Leukemia data are specified by the source of the data, but there are no predefined training and test data for the other two data sets. We randomly partitioned the data set into the training set and the test set. For the Leukemia data, test accuracies for all feature selection methods are higher than the accuracy with all genes. However, all feature selection methods fail on the other data sets. We suspect that the failure might be caused by Test Accuracies of SVM with biomarker sets selected by various feature se-.49 lection methods. The numbers (e.g. 76.7%) below the data sets (e.g. Colon) are the test accuracies using all genes without any feature selection or feature weighting. The training and test data for the Leukemia data are predefined. For this data set, test accuracy with any biomarker set is higher than the accuracy with all genes. The failure of feature selection methods on the other data sets might be due to the possible unfortunate partition of training and test data. The results from one partition may not reveal the true performances of fea-ture selection methods. Therefore, we do. the following: we first split the data into 10 folds to get f l , f2, ... , flO, and then leave the fold fi (i = 1 , 2 , 1 0 ) out, select biomarker set 6; using the remaining 9 folds/ We then train the 9 folds with 6; only using several classifiers and compute the number of errors when classifying the test data fi. The total number of errors in 10 folds are reported in Table 4.7. Now for the Colon data, the classification errors are reduced by using feature selection methods. We exclude the Lung cancer data set from now on, because several methods can not handle it. According to the results in Table 4.6 and Table 4.7, there is no feature selec-tion method that consistently outperforms other methods in terms of classification accuracies. In the next chapter, we shall see how to further improve SVM-RFE and GPC with ARD by utilizing cross validation techniques. We shall also evaluate the sensitivity of the feature selection methods in selecting features with respect to changes in the training data. 50 Dataset Algorithm Test Accu. # Genes (%) in Biomarker Set Colon SVM-RFE 66.7 5 (76.7%) GPC-ARD-LAP 66.7 3 GPC-ARD-EP 73.3 29 RVM 63.3 3 SMLR 63.3 16 Leukemia SVM-RFE 88.2 2 (82.3%) GPC-ARD-LAP 88.2 9 GPC-ARD-EP 88.2 5 RVM 91.2 2 SMLR 94.1 20 Lung SVM-RFE 66.7 1 (91.7%) GPC-ARD-LAP 83.3 1 SMLR 83.3 9 Table 4.6: Test Accuracies of SVM with biomarker sets selected by various feature selection methods. The numbers (e.g. 76.7%) below the data sets (e.g. Colon) are the test accuracies using all genes without any feature selection or feature weighting. The training and test data for the Leukemia data are predefined. For this data set, test accuracy with any biomarker set is higher than the accuracy without feature selection. The failure of feature selection methods on the other data sets might be due to the unfortunate partition of the training and test data. Dataset Algorithm Total # of Errors Avg. Size of of Biomarker Sets Colon All genes 17 2000 SVM-RFE 14 9 (N = 62) GPC-ARD-LAP 15 21 RVM 10 6 SMLR 17 30 Leukemia All genes 2 7129 SVM-RFE 9 4 {N = 72) • GPC-ARD-LAP 8 5 . RVM 7 5 SMLR 2 27 Table 4.7: Total number of errors in 10 experiments and the average size of biomarker sets. The number (e.g. N = 62) below the data sets (e.g. Colon) is the total number of validation cases in the 10-fold cross validation. No feature selection method outperforms all other methods in terms of classification accuracies on both data sets. 51 Chapter 5 Cross Validation Optimization An intuitive solution to solve the problem of low sample size is resampling. Com-monly used resampling methods that generate new data from the distribution of existing data, such as interpolation and extrapolation, do not help in the case of gene expression analysis for the following reasons. First, it is not reasonable to model the distribution from a small number of samples. Second, the conclusion drawn from synthetic data may not be applicable to real-world problems. There-fore, the only option left is to use cross validation. In this way, we can make the best use of the available real data. In the literature, the qualities of feature selection methods are only evaluated by classification accuracies. We suggest to use another criterion, called stability, to evaluate how sensitive a feature selection method is to the changes of training data. In this chapter, we first evaluate the stability of RVM and SMLR in selecting features, and show why they are not suitable to be optimized using cross validation. We then present two existing feature selection methods that utilize cross validation, and then propose our own method. Experimental results show effectiveness and efficiency of our algorithm. 52 Experiment Features (indices) 1 [356 377 765 1969 1924 1976] 2 [356 377 765 1769 1859 1976] 3 [356 377 765 1769 1859 1976] 4 [353 377 765 1013 1024 1759] 5 [356 377 765 1769 1859 1976] 6 [353 377 493 765 1050 1757 1769 1823] 7 [350 377 493 765 792 1769 1823 1976] 8 [377 700 765 1482 1769 1859] 9 [356 377 765 1769 1859 1976] 10 [377 717 1353 1419 1555] Table 5.1: Features selected by RVM for the Colon data set 5.1 Stability of Feature Selection Methods Ideally, if only a small portion of the training data set is changed, we expect that the set of features selected by a feature selection algorithm does not change much. Stability measures how stable a feature selection algorithm is in selecting features. We evaluate the stability of RVM and SMLR as follows. We •first split the data into 10 folds, and each time one fold is left out and the remaining 9 folds are used for training. We end up with 10 different optimal subsets of features returned by the 10 different training sets. Table 5.1 lists the 10 optimal subsets of genes selected by RVM with the Colon data set. It is hard to interpret the stability of a feature selection algorithm using a table like Table 5.1. Therefore, we propose to use a single scalar value—stability score S—to represent the stability of a feature selection algorithm with a particular data set. We first define the following terms and notations before giving the definition of stability score S. • k—the number of folds we divide the data set into (e.g. k = 10). • D\l—the training set containing all folds but the i-th one. • Fi—the feature set selected using 53 • \A\—the cardinality of set A. • Aj—the number of features that appear in i subsets. For example, Ni repre-sents the number of features that only appears in one subset, and A^ represents the number of features that appear in all subsets. We now define the stability score S as follows: E i 2 q x A y i t i * i i S = ^=2ykk (5.1) Since RVM only selects a small number of genes, we ignore the relative ranking of selected genes and only care about which subset of genes are selected. The denominator is the cardinality of the union set of all feature sets, and the numerator is the weighted sum of the number of common genes. The score 5 is defined such that it has a range of [0,1]. If all feature sets are identical, then 5 = 1 because: (1) F{ = Fj: V i,j = l,2,...,fc; (2) \jLiFi = Fv for anY 3 = 1,2,...,k. Therefore, l U ^ i - ^ l = l^ih (3) Nk = \Fi\, and N3 = 0 for all j ^ k; and finally y^k_ (i-xN) J\I S = F i = JF\] = ^ n t n e c o n t r a s t ) if a u feature sets are disjoint, 5 = 0 as every selected feature is selected once, i.e. Ni = \\Ji=1Fi\, and Nt = 0 for i = 2,3, ...,k. This is the reason why we omit i = 1 in the numerator of Equation 5.1. Table 5.2 shows the number of occurrences of all features selected by RVM on Colon data set. We can compute the stability score of RVM on this data set as: g _ f X 3 + | x 2 - f | x l + ^ x l + | x l + | g x l _ ^ Table 5.3 reports the stability scores of RVM and SMLR on Colon and Leukemia data sets. RVM usually selects about 6 features, while SMLR selects 20-30 features in each training. According to our stability evaluation criterion, SMLR is more stable than RVM, as the former has higher stability scores with both data sets. If 54 # Occurrences Features (indices) # Features (Ni) 1 {350, 700, 717, 792, 1013, 1024, 1050, 1353, 1419, 1482, 1555, 1757, 1759, 1924} Ni = 14 2 353, 493, 1823 iV 2 = 3 3 - -4 - -5 356,1859 N5 = 2 6 1976 N6 = l 7 - -8 1769 N8 = 1 9 765 N9 = l 10 377 N10 = l Table 5.2: Number of occurrences of features selected by RVM for the Colon data set RVM SMLR Colon Leukemia 0.213 0.150 0.243 0.321 Table 5.3: Stability scores of RVM and SMLR on the Colon and Leukemia data sets every feature is shared by half of the feature sets, then we expect the stability score to be 0.3. We propose that the stability of a feature selection algorithm is acceptable if its stability score is at least 0.3. In our case, each training set is different by one fold (or partition) of data samples rather than one single data sample. Therefore, we lower our acceptable score to be 0.2. According to our criterion, the stability of RVM is not acceptable as its stability score for the Leukemia data is only 0.15. In summary, both RVM and SMLR select a small number of features, and SMLR is more stable than RVM. The idea of using cross-validation to improve the stability of feature selection algorithms is to average the weights of features selected using different subsets of training samples. 55 Initialize Ranked feature set R = []; Selected subset S = [l,...,d] Repeat until all features are ranked a) Train t linear SVMs on subsets of the original training data, with features in set S as input variables b) Compute and normalize the weight vectors using Equation 5.2 c) Compute the ranking scores c% for features in S using Equation 5.3 d) Find the feature with the smallest ranking score: e = arg minj a e) Update: R = [e, R], S = S - [e] Output Ranked feature list R Figure 5.1: Algorithm of M S V M - R F E [14] 5.2 Multiple S V M - R F E In [14], Duan et al. proposes a new feature selection algorithm called M S V M - R F E (Multiple S V M - R F E ) based on S V M - R F E (Figure 4.1) by adding cross validation into each feature elimination step. Unlike S V M - R F E , which trains S V M once us-ing all training samples, M S V M - R F E trains multiple SVMs using different subsets of training samples at each step, and computes the ranking score from a statisti-cal analysis of weighted vectors of multiple SVMs. The outline of M S V M - R F E is presented in Figure 5.1. Let v/j be the weight vector returned by the j t h S V M and Wji be the weight value associated with the i th feature. The weight vectors are normalized as: w > = i w i ( 5 ' 2 ) Let Vji = w^, and the ranking score is computed as Vi Ci = — (5.3) where Vi — \ Ej=i vji a n d °^ = w ; t - i 56 5.3 Forward Feature Selection with Gaussian Processes In [8], Chu et al. proposed a feature selection algorithm based on Gaussian processes to identify biomarkers for tasks with ordinal labels. We refer their algorithm as GP-ARD-FS. Figure 5.3 presents the outline of their algorithm. In the phase of preprocessing, the goal is to discard a significant amount of genes that have low discriminative capabilities. They adopt the idea in paper [13] and use Wilcoxon rank sum statistic as the ranking criterion. Let Xi be the vector that consists of expression values of all samples for gene i. Split Xi into X^~ and X~', where X^ and X~ are vectors of expression values of samples from positive class and negative class, respectively. For each gene i, Wilcoxon rank sum test is performed to test the null hypothesis that X* and X~ come from distributions with equal medians. The major steps of informative gene selection using Wilcoxon rank sum test are as follows [13]: • Set significance level a = 0.01. • Compute Wilcoxon-statistics for every gene i using Xf and X~. • Use the statistics to compute the corresponding p-values. • Select the genes whose p-values are smaller than the significance level a. Smaller p-values cast doubt on the hypothesis that the distributions of samples from two different classes are identical. In other words, genes with smaller p-values have high discriminative capabilities. The original Colon data set consists of 2000 genes. There are only 373 genes significantly differentially expressed at the significance level of a = 0.01. Therefore, after preprocessing the dimensionality of the data set is reduced to 373. We illustrate the remaining steps of the algorithm using a concrete example. Suppose the data set after preprocessing is D = {(xi,yi)\i = 1,2, ...,9} and Xi £ M 6 57 and Yi £ {±1}- Notice that the aim of this concrete example is to elaborate the algorithm. In real gene expression data, the dimensionality is much higher than the sample size. Figure 5.2 illustrates the algorithm step by step using a running example. 5.4 G P - A R D - R F E In comparison with other feature selection techniques, GP-ARD-FS is effective in selecting relevant genes. However, it has several drawbacks which we will discuss later. We propose a new algorithm called GP-ARD-RFE aiming to overcome the drawbacks of previous algorithms. Figure 5.7 presents an outline of our algorithm. Figure 5.4 - 5.6 explain our algorithm with two concrete examples. In the phase of preprocessing, GP-ARD-FS uses the Wilcoxon rank sum test to eliminate genes that are not significantly differentially expressed at a significance level. A statistical test—either t-test or non-parametric Wilcoxon rank sum test—is used to determine the statistical significance of an observation. In the context of gene expression analysis, statistical test assesses the discriminative capability of a gene by testing if the distribution of normal samples is identical to that of tumor issues for this particular gene. One major shortcoming of using the rank sum test is that data of low sample size does not reveal trustworthy information about the distribution. Alternatively, we propose to use the average ARD values to eliminate a portion of irrelevant genes at the early stage. This procedure is presented under "RFE" in Figure 5.7. During the phase of feature elimination, we adopt the settings of [20] and recursively eliminate chunks of features at a time. At the first iteration, we keep the number of genes, which is the closest power of 2. For example, the original Colon cancer data set contains 2000 genes. At the first iteration, we will 58 2.1 3.1.1 Let k=3, then the 3 folds are: f1 = { (x1, y1), (x2, y2), (x3, f2 = { (x4, y4), (x5, y5), (x6, f3 = { (x7, y7), (x8, y8), (x9, Training sets: D^1 = { f2 , f3 } D X 2 = {f1, f3 } - Q\3= { f 1 , f2 } y3)} V 6 ) } y9)} 3.1.3 Genera te d subsets by adding one top-ranked gene each time. s11=[g2], s12=[g2, g4] s16=R1 S21=[g4], s22=[g4, g2] s26=R2 s31=[g1], S32=[g1, g4] s36=R3 3.1.4 ->\i to n \ i / r* Restrict D X l to DX l(:, sij), and compute L O O error, eij, using G P - L a p l a c e on p\ i (:,sij). E1=[ e11, e12 , e13 , e14, e15 , e16 ] E 2 = [ e 2 1 , e 2 2 , e23 , e24, e25 , e 2 6 ] E 3 = [ e 3 1 , e 3 2 , e33 , e34, e35 , e36 ] fl fl Opt imal A R D Values: 0 1 = [0.3, 1.2, 0.05, 0.4, 0.02, 0.001] 0 *= [0.05, 0.1, 0.01, 0.2, 0.04, 0.008] 0 3 = [2 .1 , 0.6, 0.1, 0.8, 0.3, 0.2] 3.1.2 Sort the genes in descend ing order of optimal A R D va lues. R1 =[g2,g4,g1,g3, g5, g6 ] R2 = [g4, g2, g1,g5, g3, g6 ] R3 = [g1,g4, g2, g5, g6, g3 ] 3.1.5 S u p p o s e min(E1 )=e11, min(E2)=e22, and min(E3)=e33. The minimal gene subsets : M 1 = [ g 2 ] M2=[ g4, g2 ] M3=[g1, g4, g2 ] 5.1 & 5.2 (1) Genera te 3 gene subsets : s1=[g2], s2=[g2,g4], and s3=[g2,g4,g1], (2) Restrict D to D(:,si) and compute the L O O error ei using G P - L a p l a c e on D(:,si). E = [e 1, e2 , e3]. (3) S u p p o s e min(E)=e2, then the minimal subset of relevant genes are: M* = [ g2, g4 ]. Rank the genes in the union set of M 1 , M 2 and M 3 by the number of t imes each gene appears in M i . Breaking ties by average A R D va lues. The ranked gene list is: R = [ g2 , g4,g1] Figure 5.2: Illustration of GP-ARD-FS algorithm: a running example 59 keep 1024(= 2 1 0) genes. At subsequent iterations, we eliminate half of current subset of genes until the number of remaining genes is 2P, where p is specified by the user. Empirically, p = 8 is a good choice. Thus, the number of genes of the Colon data set after each iteration are: 1024, 512 and 256. We randomly split the data into fc folds as usual. One fold is left out in turn and the remaining fc — 1 folds are passed to the classifier of GP-ARD with Laplace for training. The classifier carries evidence maximization and returns the optimal 9*, a vector of optimal values of ARD parameters. Let 9* be the optimal ARD values returned by GP-ARD trained with all folds except the ith fold. Based on the optimal ARD values 9*, the genes are then sorted in descending order of relevance, and a rank list Ri is generated. The ARD values returned by GP-Laplace are not calibrated. Before we compute the average of ARD values, it is important for us to normalize the ARD values so that the norm of 9* is 1 for all i . In this way, each 9* contributes to the average ARD values fairly. For example, in Figure 5.4, 9% has larger values than 9\ in general. Without normalization, 9\ would have greater impact on the average ARD values than 9*. Specifically, we normalize 9* as follows: (5.4) Average ARD value is not the only criterion we use to decide which genes to eliminate. Suppose the number of genes to keep at this iteration is m. We identify the following two groups of genes: • L—the set of features whose rank is lower than m in all Ri L = {g\geC,g#Ri(l:m),Vi} • T—the set of features that are within top m in all Ri. T={g\geC,g€Ri'l:m),Vi} 60 Genes in T survive this iteration, and genes in L are eliminated regardless of their ARD values. The idea is that if everyone thinks that gene g should stay (or leave), then we should not eliminate (or keep) it, respectively. For example, in Figure 5.4, g5 is ranked within top 4 in all Ri's (see 2.1.1.3), but its rank in S according to average ARD values is the fifth (see 2.1.2). We keep g5 as it is a common top gene. Suppose the number of genes in T is |T|, and the number of genes that should survive at this iteration is in, then we have m — \T\ spaces left for surviving genes. We choose these genes according to their average ARD values. We iteratively eliminated chunks of genes until 2P (e.g. 256) genes left. We illustrate feature elimination using a concrete example as shown in Figure 5.4 - 5.5. Suppose the data set consists of 9 samples: D — {{xi,yi)\i = 1,2, ...,9}. Each data sample contains 6 genes at the beginning. That is, x^ € R 6 . We set p — 1, and thus the number of features left after RFE is 2 1 = 2. After the number of features are reduced to 2P, we perform fc-fold validation again on data with these 2P genes: we get optimal ARD values, normalize them, sort genes in descending order of average ARD values, and finally generate a rank list S. No genes are eliminated at this iteration, but the ranked list S is produced according to average ARD values. This ranked list will be used in subsequent forward selection. In the phase of forward selection,' we add one top-ranked gene in R each time into a gene subset r, and carry out LOO cross validation using GP-Laplace without ARD with genes in r as input variables. Alternatively, SVM can also be used in forward selection. In either case, the gene subset with the lowest LOO error is identified as the minimal subset of biomarkers. In summary, the advantages of using GP-ARD-RFE over GP-ARD-FS in-clude: (1) Feature elimination utilizing average ARD values is more reliable than Wilcoxon rank sum test due to the low sample size setting of gene expression data. 61 All genes 373 genes (GP-ARD-FS) 1024 genes (GP-ARD-RFE) 512 genes (GP-ARD-RFE) LOO Accuracy 80.65 82.26 88.71 95.16 Table 5.4: Ineffectiveness of Wilcoxon rank sum (a preprocessing step used by GP-ARD-FS) in eliminating irrelevant features. LOO Accuracies using different subsets of genes: (1) "All genes"; (2) top 373 genes selected by Wilcoxon rank sum test with p = 0.01; (3) top 1024 genes selected by GP-ARD-RFE; (4) top 512 genes selected by GP-ARD-RFE. LOO accuracy with the top 373 genes is not significantly higher than the accuracy without feature selection, and is even lower than the accuracy with the top 1024 genes selected by GP-ARD-RFE. Table 5.4 displays the LOO error using different subsets of genes on Colon data. In the preprocessing step of GP-ARD-FS, there are 373 genes significantly differentially expressed in the rank sum test at the significance level of p = 0.01. The LOO accuracy using these 373 genes is 82.26%, showed in the second column, is not significantly greater than the LOO accuracy (80.65%)using all genes. However, if we use GP-ARD-RFE to eliminate irrelevant genes, the LOO accuracies are significantly improved: the LOO accuracy with top 1024 genes is 88.71%, and the accuracy with top 512 genes is 95.16%. We report these two accuracies in the third and fourth column. To sum up, rank sum is not as reliable as GP-ARD-RFE in eliminating irrelevant features. (2) ARD values are normalized so that ARD values from each training con-tribute fairly to the average. (3) GP-ARD-RFE is more efficient than both MSVM-RFE and GP-ARD-FS in terms of CPU cost. Take GP-ARD-FS as an example, the most costly step is step (3.1.4)—carrying out LOO validation on every training set. Suppose there are N training samples in TJ-dimensional space, then there will be a total of K • N • D validations to be performed for K-fold experiments. Whereas, in GP-ARD-RFE, costly forward selection is only performed once at the end. (4) In GP-ARD-FS, many important features can be discarded in step (3.1.5), 62 as usually the minimal gene set Mi is very small and only features in Mj are candi-dates of the final biomarker set M*. Instead of finding the optimal minimal gene set at an early stage, we use a more conservative approach—we discard unimportant features to avoid possible exclusion of relevant features. There are some variations of our algorithm. For example, G P - E P rather than GP-Laplace can be used as the classifier in forward selection. Moreover, if multiple subsets achieve the same L O O accuracy, we can use the posterior probabilities to break the tie. 5.5 Experimental Results We conduct an experimental evaluation on various feature selection algorithms using Colon cancer data set and Leukemia data set. We exclude Lung data set in the comparisons as its dimensionality is too high for some algorithms as we discussed in Chapter 4. We evaluate the following feature selection methods: S V M - R F E , R V M , S M L R , GP-Laplace, M S V M - R F E , G P - A R D - F S , and G P - A R D - R F E , where the last three methods utilize cross validation. We use two criteria to do the evaluations: classification accuracy using biomarkers and stability in selecting features. 5.5.1 Comparison Using Stability In Section 5.1, we compute the stability scores of R V M and S M L R using Colon and Leukemia data sets. We first split the data into 10 folds, and each time one fold is left out and the remaining 9 folds are used for training. We end up with 10 different optimal biomarker sets returned by the 10 different training sets. We then compute the stability score of the 10 biomarker sets using the formula 5.1. As a side effect, we may find some potential outlying samples at the Step 3.1 presented in Table 5.6, where we run forward selection to identify biomarkers. 63 Colon Leukemia SVM-RFE 0.12 0.14 GP-LAP 0.18 0.11 RVM 0.21 0.15 SMLR 0.24 0.32 MSVM-RFE 0.14 0.08 GP-ARD-FS 0.27 0.12 GP-ARD-RFE 0.27 0.28 Table 5.5: Stability scores of various feature selection methods on the Colon and Leukemia data sets. SMLR and GP-ARD-RFE are consistently more stable than other methods. For example, we found that when we compute the LOO error using biomarker sets, one sample in the Colon data set was constantly classified incorrectly using any biomarker set. Table 5.5 presents the stability scores of the five feature selection algorithms including RVM, SMLR, MSVM-RFE, GP-ARD-FS, and GP-ARD-RFE on Colon and Leukemia data sets. We found GP-ARD-RFE and SMLR outperform other feature selection algorithms in terms of stability. Therefore, the genes selected by GP-ARD-RFE and SMLR are more trustworthy. 5.5.2 Compar i son Us ing Classification Accu racy There are two ways to evaluate the feature selection methods using classification accuracies: (1) We can use all data as training data, select the biomarker set, and then compute the LOO accuracy with biomarker sets only using a classifier. This evalua-tion technique is used in [8]. The flaw of using this type of evaluation is that it does not reveal the generalization ability of the algorithms because all data was used to select biomarkers. (2) We first split the data D into 10 folds to get f l , f2, ... , flO, and then 64 Accuracy (Colon) Accuracy (Leukemia) Al l genes 72.6% (17/62) 9 7 . 2 % ( 2 / 7 2 ) SVM-RFE 77.4% (14) 87.5% (9) GP-Lap 75.8% (15) 88.9% (8) RVM 83.9% (10) 90.2% (7) SMLR 72.6% (17) 9 7 . 2 % (2) MSVM-RFE 79.0% (13) 94.4% (4) GP-ARD-FS 79.0% (13) 93.1% (5) GP-ARD-RFE 8 7 . 1 % (8) 9 7 . 2 % (2) Table 5.6: Leave-One-Fold-Out accuracies using biomarkers selected by various fea-ture selection methods. The number in brackets are the total number of errors in 10 folds. Part of the results are included in Table 4.7. GP-ARD-RFE achieves the highest accuracies on both data sets. leave the fold fi(i = 1 , 2 , 1 0 ) out, select biomarker set ftj using the remaining 9 folds. We then train the 9 folds with 6; only using several classifiers and compute the number of errors when classifying the test data / j . The accuracy is computed 100%, where N is the total number of samples in the whole data set D, and Nerror is the total number of errors in 10 folds. This evaluation techniques are used in [23]. Unlike the first evaluation strategy, the second strategy reveals generalization ability of the feature selection algorithms as the test data is not used to select biomarkers. Table 5.6 presents the results using the second evaluation strategy. MSVM-RFE extends SVM-RFE by utilizing cross validation, similarly, GP-ARD-FS and GP-ARD-RFE extend previous GP-ARD. According to Table 5.6, all of the three feature selection methods that incorporate cross validation techniques outperform their ascendants without cross validation. We find that GP-ARD-RFE outperforms other feature selection methods in terms of both accuracies and stabilities. 65 2. Initialization 3. Loop (2.2) 1. Preprocessing (1.1) Use the Wilcoxon rank sum test as a criterion to measure the variance of the expression values in different classes for each gene. (1.2) Eliminate genes that are not significantly differentially expressed in the rank sum test at the significance level of p = 0.01. Suppose the new data set after elimination is D = {(xi,Ui)\i = 1, ...,n} and Xi G Rd. (2.1) Split the data set D into k mutually disjoint folds. Let D^1 be the training set containing all folds but the i-th one. Initialize i = 1. (3.1) While i < k, let D^i be the tr aining data (3.1.1) Train GP-ARD-Laplace using D^1: maximizing the evidence P(D\l\9) several times and choose the one with the highest evidence as the optimal A R D values 9*. (3.1.2) Sort the genes in descending order of A R D values 9*, and get the ranked gene list Ri £ Rd. (3.1.3) Generate d gene subsets by adding one top-ranked gene in Ri each time such that subset su contains the first gene in Ri, and Si2 contains the top-2 genes in Ri, and so on. (3.1.4) For each gene subset Sij, j — 1 , d , carry out the L O O cross validation using GP-Laplace (without A R D ) on D^1 with only genes in . (3.1.5) Let set M j be the minimal gene subset that yielded the minimal L O O error in (3.1.4). (3.1.6) i = i + l (4.1) Rank the genes in the union set of { M i } ^ = 1 by the number of times each gene was selected in the k subsets { M i } ^ 1 . Genes with the same number of occurrences are further ranked by their average A R D values. Refer the ranked list as R. (5.1) Run forward selection on R, which means repeating steps (3.1.3)-(3.1.4) with Ri replaced by R and by D. (5.2) Let set M* be the minimal gene subset that yielded the minimal L O O error in (5.1). 6. Exit Return the minimal subset of relevant genes M*. 4. Ranking 5. Selection Figure 5.3: A l g o r i t h m of G P - A R D - F S [20] 66 1.1 1.2 Let k=3, then the 3 folds are: f1 ={(x1,y1), f2 = {(x4, y4), f3 = { (x7, y7), Training sets: D^ 1 = {f2, f3 } D\ 2 = { f l , f3 } D \ 3 = { f 1 , f 2 } (x2, y2), (x5, y5), (x8, y8), (x3, y3) } (x6, y6) } (x9, y9) } 2.1.1.3 Sort the genes in descending order of optimal A R D values. R1 = [g4, g1,g2, g5, g6, g3] R2 = [ g4, g6, g2, g5, g3, g1 ] R3 = [ g 6 , g 4 , g 1 , g 5 , g2, g3 ] 4J7- 2.1.1.4 Normalize A R D values using Eqn 5.4. The normalized values are:. A* Q 1 = [0.52, 0.42, 0.10, 0.63, 0.31, 0.21] [0.33, 0.42, 0.36, 0.48, 0.39, 0.45] § = [0.42, 0.32, 0.27, 0.48, 0.37, 0.53] 2.1.3.3-2.1.3.4 Find the subset of genes whose ranks are higher than or equal to m (=4) (i.e. 1,2, 3 or 4) in all Ri's. In this case, T=[g4, g5] Genes in T survive this iteration: M=T=[g4, g5]. 321 2.1.3.5 The number of spaces left for surviving genes: m-|M|=4-2=2. This gene is chosen from S: S=S-T-L=[g1,g6,g2]. The first two genes in S now are g1 and g6. Insert them into M and re-order: M=[g1 ,g4,g5,g6]. Current subset of genes: C=[g1,g2 g6] Size of C: d=6 Surviving subset of genes: M=[ ] The number of surviving genes at this iteration: m= 2 2 = 4 Set p=1, and thus the number of genes at the end of R F E is 2 =2. 2.1 ITERATION 1 2.1.1.1-2.1.1.2 Restrict training samples to D=D(:,C) Optimal A R D values: Q* = [0.5, 0.4, 0.1, 0.6, 0.3, 0.2] Q* = [1.1, 1.4, 1.2, 1.6, 1.3, 1.5] 0* = [1.6, 1.2, 1.1, 1.8, 1.4, 2.0] 2.1.2 Compute the average A R D values of genes in C using normalized A R D values: The average A R D values are: =[0.43,0.39,0.24,0.53,0.36,0.40] Sort the genes in descending order of the average A R D values. The ranked list is: S=[g4, g1, g6, g2, g5, g3] 2.1.3.1 -2.1.3.2 Find the subset of genes whose ranks are lower than m (=4) (i.e. 5 or 6) in all Ri's. In this case, L=[g3]. g3 is eliminated from C: C=[g1,g2,g4,g5,g6]. 2.1.4 Update: Current subset:C=[g1 g4 g5 g6] Length of C: d=4. Surving subset:M=[ ] Number of surviving genes at next iteration: m=4/2=2. Figure 5.4: Illustration of feature elimination of G P - A R D - R F E algorithm: a running example (Part 1) 67 Check condition: d>=2 holds. Enter the second iteration. 2.1 ITERATION 2 2.1.1.1—2.1.1.2 Restrict training samples to D=D(:,C) Optimal A R D values: g1 g4 g5 g6 Q*= [0.1, 0.8, 0.2, 0.6] Q*= [0.6, 1.3, 1.0, 1.4] 0 * = [1.2, 2.0, 1.6, 1.3] 2.1.2 Compute the average A R D values of genes in C using normalized A R D values: The average A R D values are: 0* =[0.25,0.67,0.39,0.54] Sort the genes in descending order of the average A R D values. The ranked list is: S=[g4, g6, g5, g1] 2.1.3.1 -2.1.3.2 Find the subset of genes whose ranks are lower than m (=2) (i.e. 3 or 4) in all Ri's. In this case, L=[g1]. g3 is eliminated from C: C=[g4,g5,g6]. 2.1.4 Update: Current subset:C=[g4,g6] Length of C: d=2. Surving subset:M=[ ] Number of surviving genes at next iteration: m=1. 2.1.1.3 Sort the genes in descending order of optimal A R D values. R1 = [ g4, g6, g5, g1 ] R2 = [ g6, g4, g5, g1 ] R3 = [ g4, g5, g6, g1 ] 2.1.1.4 Normalize A R D values using Eqn 5.4. The normalized values are:. = [0.10, 0.78, 0.20, 0.59] § * = [0.27, 0.58, 0.45, 0.63] @ = [0.39, 0.64, 0.51, 0.42] 2.1.3.3-2.1.3.4 Find the subset of genes whose ranks are higher than or equal to m (=2) (i.e. 1 or 2) in all Ri's. In this case, T=[g4] Genes in T survive this iteration: M=T=[g4]. 2.1.3.5 32. The number of spaces left for surviving genes: m-|M|=2-1=1. This gene is chosen from S: S=S-T-L=[g6,g5]. The first gene in S now is g6. Insert g6 into M: M=[g4,g6]. Figure 5.5: Illustration of feature elimination of G P - A R D - R F E algorithm: a running example (Part 2) 68 Suppose p=2.The number of genes after feature elimination is 4. We now show the last iteration of R F E , in which no gene is eliminated. But G P - A R D is still trained to get a ranked list of genes. 2.1 LAST ITERATION 2.1.1.1-2.1.1.2 2.1.1.4 Restrict training samples to D=D(:,C) Optimal A R D values: Q* = [0.5, 0.4, 0.1, 0.6] 0* =[1.1, 1.4, 1.2, 1.6] 0„ = [1.4,1.6,1.2,1^8] Skip Normalize A R D values using Eqn 5.4. The normalized values are:. £>| § * = [0.57, 0.45, 0.11,0.68] 2.1.1.3 1 A * 0 ? = [0.41, 0.52, 0.45, 0.60] = [0.46, 0.53, 0.40, 0.59] 2.1.2 2.1.4 Update: d=d/2=2 Skip 2.1.3 52. Check condition: d>=4 does not hold. Exit loop of 2.1. J Compute the average A R D values of genes in C using normalized A R D values: The average A R D values are: 0* =[0.48, 0.50, 0.32, 0.62] Sort the genes in descending order of the average A R D values. The ranked list is: S=[g4, g2, g1, g3] 3.1 (1) Generate 4 gene subsets: s1 =[g4], S2=[g4,g2],s3=[g4,g2,g1 ] and s4=[g4,g2,g1,g3] (2) Restrict D to D(:,si) and compute the L O O error ei using GP-Laplace on D(:,si). E = [e1, e2, e3, e4]. (3) Suppose min(E)=e2, then the minimal subset of relevant genes are: M* = [ g4, g2 ]. Figure 5.6: Illustration of GP-ARD-RFE algorithm: the steps after feature elimi-nation 69 1. Initialization (1.1) Split the data set D into k mutually disjoint folds. Let .CA' be the training set, containing all folds but the i-th one. (1.2) Set current subset of genes C = [ 1 , D ] , and d = \C\ Let the subset of genes which will survive this iteration be M = [] Set the number of surviving genes at this iteration to m = 2^og2^D^ Set user defined argument p = 8 2. R F E (2.1) While d > 2r' (2.1.1) Initialize i = 1, and Loop: while i < k (2.1.1.1) Restrict training samples D%ew to be D\\:,C). (2.1.1.2) Train GP-ARD-Laplace using D^ew: maximizing the evidence P(Dnevl\9) several times and choose the one with the highest evidence as the optimal A R D values 9*. (2.1.1.3) If d > 2 P , do the following: Sort the genes in descending order of A R D values 6*, and get the ranked gene list Ri £ Md. (2.1.1.4) Normalize 8* using Equation 5.4, and get the normalized optimal A R D values 9*. (2.1.2) Compute the average A R D values 9 using 9*, where i = 1 , k . Rank the genes in C by the average A R D values, Let S be the ranked list of genes. (2.1.3) If d > 2 P , do elimination as follows: (2.1.3.1) Find the subset of genes whose ranks are lower than m in all Ri,i = 1 : k. Let L be the set containing these common low ranked genes. Genes in L will be eliminated at this iteration regardless of their average A R D values. (2.1.3.2) Remove genes in L from C: C = C — L. (2.1.3.3) Find the subset of genes whose ranks are higher than or equal to m in all Ri,i = I : k. Let T be the set containing these genes. Genes in T will not be eliminated at this iteration regardless of their average A R D values. (2.1.3.4) Update surviving set M = T; (2.1.3.5) The number of genes in set M now is \T\. S = S -T - L. Insert the top m — \T\ genes in S into M so that \M\ is ra. (2.1.4) If d > V update: Current subset of features C = M ; The length of current subset d = \C\ Subset of surviving features M = [] The number of surviving features m = d/2. Otherwise, update d = d/2 (to exit the loop of 2,1). 3. Selection (3.1) Run forward selection on R as in (5.1) of Figure 5.3: (3.1.1) Generate d = 2v subset of genes by adding one top ranked gene each time. (3.1.2) For each gene subset, compute L O O error using D. (3.1.3) Let set M* be the minimal gene subset that yielded the minimal L O O error in (3.1.2). 4. Exit Return the minimal subset of relevant genes M*. Figure 5.7: Algorithm of G P - A R D - R F E 70 Chapter 6 Conclusion and Future Work In this thesis, we did a comparative study on kernel methods with application to gene classification and selection. Gene expression data obtained by high-throughput techniques poses three challenges: high dimensionality, low sample size, and poor quality. Kernel methods outperform other approaches in gene expression data anal-ysis due to their ability to handle high dimensionality easily. Experiments show that all kernel methods (considered in this thesis) are efficient in classifying high dimensional data without feature selection or feature weighting. For a data set in 19624 dimensions, all kernel methods finish the classification in about 2 minutes. In terms of classification accuracies, our experiments with three real gene expression data show that SVM and Gaussian process classifiers with Expectation Propagation outperform their counterparts. Compared with gene classification, biomarker identification is a more impor-tant problem in gene expression analysis. We investigate four kernel-based methods in feature selection or feature ranking. Sparse kernel methods, such as RVM and SMLR, obtain sparse solutions for regression and classification by utilizing sparsity-promoting priors. If original features or their transformed forms are used as basis functions, RVM and SMLR automatically select an optimal subset of genes, which 71 is also known as biomarker set. Unlike RVM or SMLR, SVM-RFE and GPC with ARD ranks all features without selecting an optimal subset. Forward selection has to be used to identify biomarkers. To evaluate the quality of selected features, classification accuracy is com-monly used in the literature. However, this criterion only measures one desired characteristic of good feature selection methods. That is, the elimination of ir-relevant features does not adversely affect the discriminative capability of the two groups of data. Another important property of a good feature selection method is stability, which measures how sensitive a method is to small changes in the training data. If one sample in the training set is changed, the biomarker set selected by a feature selection method is changed dramatically, then the method is not stable. In turn, the features selected by this method are not trustworthy. Therefore, we evaluate the feature selection algorithms using both of the two criteria. The costs of low sample size can be reduced by utilizing cross validation techniques. By using cross validation, we make full use of the available data and aggregate the performance from multiple trainings. MSVM-RFE is the cross vali-dation improvement for SVM-RFE, and GP-ARD-FS and GP-ARD-RFE are two cross validation improvements for GPC with ARD. Experimental results show that by utilizing cross validation techniques, feature selection methods are improved in terms of both accuracies and stabilities. The proposed algorithm GP-ARD-RFE ex-tends from GP-ARD-FS by replacing the useless preprocessing step using Wilcoxon rank sum tests by a step of recursive feature elimination utilizing ARD values. As a side effect of GP-ARD-RFE, we may identify some potential outliers. However, the obstacle of the poor quality of gene expression data is not systematically conquered. We would like to explore this problem in future work. 72 Bibliography [1] U. Alon, N. Barkai, D. A. Notterman, K.'Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Science, 96:6745-6750, 1999. [2] M . Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundation of the potential function method in pattern recognition learning. Automation and Remote Control, 25:821-837, 1964. [3] N . Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337-404, 1950. [4] R. E. Bellman. Adaptive Control Processes. Princeton University Press, Princeton, NJ, 1961. [5] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for opti-mal margin classifiers. In D. Haussler, editor, Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 144-152, Pitts-burgh, A C M Press, 1992. [6] C. Chang and C. Lin. Libsvm: a library for support vector machines, 2001. Software available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/. 73 [7] W. Chu and Z. Ghahramani. Gaussian processes for ordinal regression. Tech-nical report, Gatsby Unit, University College London, 2004. [8] W. Chu, Z. Ghahramani, F. Falciani, and D. L. Wild. Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics, 21(16):3385-3393, 2005. [9] K. K. Chin. Support Vector Machines Applied to Speech Pattern Classification. Master's thesis, Darwin College, Cambridge University, 1999. [10] N. Cristianini, H. Lodhi, and J. Shawe-Taylor. Latent semantic kernels. Jour-nal of Intelligent Information Systems (JJIS), 18(2-3):127-152, 2002. [11] L. Csato and M . Opper. Sparse online Gaussian processes. Neural Computa-tion, 14(3):641-668, 2002. [12] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Ma-chines and other kernel-based learning methods. Cambridge University Press, Cambridge, UK, 2000. [13] L. Deng, J. Pei, J. Ma, and D. Lee. A Rank Sum Test Method for Informa-tive Gene Discovery. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 410-419, 2004. [14] K. Duan, J. C. Rajapakse, H. Wang, and F. Azuaje. Multiple SVM-RFE for Gene Selection in Cancer Classification with Expression Data. IEEE Trans-actions on Nanobioscience, 4(3'):228-34, September 2005. [15] T. Fawcett. ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Technical Report, HP Labs, 2003. 74 [16] T. Furey, N. Cristianini, N . Duffy, D. Bednarski, M . Schummer, and D. Haus-sler. Support vector machine classification and validation of cancer tissue sam-ples using microarray expression data. Bioinformatics, 16:906-914, 2000. [17] S. H. Friend and R. B. Stoughton. The Magic of Microarrays. Scientific Amer-ican, pages 44-53, February 2002. [18] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157-1182, 2003. Special issue on variable and feature selection. [19] T. R. Golub, D. K. Slomin, P. Tamayo, C. Huard, M . Gaasenbeek, J. P. Mesirov, H. Coller, M . L. Loh, J. R. Downing, M . A. Caliguiri, C. D. Bloom-field, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286:531-537, 1999. [20] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46(1-3):389-422, 2002. [21] S. Hochreiter and K. Obermayer. Gene selection for Microarray data. In B. Scholkopf, K. Tsuda, and J. Vert, editors, Kernel Methods in Computational Biology, pages 319-355. MIT Press, 2004. [22] B. Krishnapuram. Adaptive classifier design using labeled and unlabeled data. PhD Thesis, Department of Electrical Engineering, Duke University, 2004. [23] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. hartemink. Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transaction on pattern analysis and machine intelligence, 27(6) :957-968, 2005. 75 [24] B. Krishnapuram, L. Carin, and A. Hartemink. Gene expression analysis: joint feature selection and classifier design. In B. Scholkopf, K. Tsuda, and J. Vert, editors, Kernel Methods in Computational Biology, pages 299-318. MIT Press, 2004. [25] H. Kim and Z. Ghahramani. The EM-EP algorithm for Gaussian process clas-sification. In Proceedings of the Workshop on Probabilistic Graphical Models for Classification at ECML, 2003. [26] T. L i , C. Zhang, and M . Ogihara. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, 20(15): 2429-2437, 2004. [27] D. J. C. MacKay. A practical Bayesian framework for back propagation net-works. Neural Computation, 4(3):448-472, 1992. [28] D. J. C. MacKay, Bayesian methods for backpropagation networks. Models of Neural Networks III, pages 211-254, 1994. [29] D. J. C. Mackay. Introduction to Gaussian processes. In C. M . Bishop, edi-tor, Neural Networks and Machine Learning, pages 133-165. Berlin, Springer-Verlag, 1998. [30] T. P. Minka. A family of algorithm for approximate Bayesian inference. Ph.D. thesis, Massachusetts Institute of Technology, January 2001. [31] J. S. Marron and M . Todd. Distance Weighted Discrimination, http://www.unc.edu / depts/statistics / postscript / papers / marron/HDD / DWD/DWDl.pdf, 2002. [32] J. Ma, Y . Zhao, and S. Ahalt. OSU SVM Classifier Matlab Toolbox, version 3.00, http://www.ece.osu.edu/~maj/osu_svm/, 2002. 76 [33] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29-36, 1982. [34] R. M . Neal. Bayesian Learning for Neural Networks. Lecture Notes in Statis-tics. Springer, 1996. [35] R. M . Neal. Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report No. 9702, Depart-ment of Statistics, University of Toronto, 1997. [36] P. Pavlidis, J. Weston, J. Cai, and W. Grundy. Gene functional classifica-tion from heterogeneous data. In Proceeding of the fifth annual international conference on computational biology, pages 249-255. A C M Press, 2001. [37] Y. Qi, T. P. Minka, R. W. Picard, and Z. Ghahramani. Predictive automatic relevance determination by expectation propagation. In Proceedings of the Twenty-first International Conference on Machine Learning, pages 671-678, 2004. [38] C. R. Rasmussen. Gaussian process in machine learning. Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, Febru-ary 2-14, 2003, Tubingen, Germany, August 4-16, 2003. In O. Bousquet, U. von Luxburg, and G. Ratsch, editors, Revised Lectures 3176:63-71. Springer-Verlag, Heidelberg, 2004. [39] M . Seeger. Notes on Minka's expectation propagation for Gaussian process classification. Technical report, University of Edinburgh, 2002. [40] M . Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270:467-470, 1995. 77 [41] B. Scholkopf. Support vector learning. Munich, Oldenbourg Verlag, 1997. [42] B. Scholkopf and A. J. Smola. Learning with Kernels. Cambridge, MA, MIT Press, 2002. [43] C. J. Stone. Optimal rates of convergence for nonparametric estimators. An-nals of Statistics, 8(6):1348-1360, 1980. [44] M . Tipping. Sparse Bayesian learning and the relevance vector machine. Jour-nal of Machine Learning Research, 1:221-244, 2001. [45] V. N. Vapnik. The Nature of Statistical Learning Theory. New York, Springer Verlag, 1995. ISBN 0-387-94559-8. [46] V. E. Velculescu, L. Zhang, B. Vogelstein, and K. W. Kinzler. Serial analysis of gene expression. Science, 270:484-487, 1995. [47] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian pro-cesses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342-1351, 1998. 78
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Comparative study of Kernel based classification and...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Comparative study of Kernel based classification and feature selection methods with gene expression data Tan, Mingyue 2006
pdf
Page Metadata
Item Metadata
Title | Comparative study of Kernel based classification and feature selection methods with gene expression data |
Creator |
Tan, Mingyue |
Date Issued | 2006 |
Description | Gene expression profiles obtained by high-throughput techniques such as microarray provide a snapshot of expression values of up to ten thousands genes in a particular tissue sample. Analyzing such gene expression data can be quite cumbersome as the sample size is small, the dimensionality is high, and the data are occasionally noisy. Kernel methods such as Support Vector Machines (SVMs) [5, 45] have been extensively applied within the field of gene expression analysis, and particularly to the problems of gene classification and selection. In general, kernel methods outperform other approaches due to their ability to handle high dimensionality easily. In this thesis, we perform a comparative study of various state-of-the-art kernel based classification and feature selection methods with gene expression data. It is our aim to have all the results together in one place so that people can easily see their similarities and differences both theoretically and empirically. In the literature, a feature selection method is evaluated by the classification accuracies using the features selected by the method. This evaluation criterion measures the classification capabilities of the data after the elimination of irrelevant features. We propose another criterion, called stability, to evaluate the feature selection methods in addition to classification accuracies. The feature set selected by a stable feature selection algorithm should not change significantly when some small changes are made to the training data. In this thesis, we use both of two evaluation criteria to compare feature selection methods. It has been showed that cross validation technique can be used to improve feature selection methods in terms of classification accuracies [8]. In this thesis, we extend one existing feature selection method which utilizes Gaussian Processes (GP) [47] with Automatic Relevance Determination (ARD) [28, 34], and cross validation, and propose a new feature selection method. Experiments on real gene expression data sets show that our method outperforms all other feature selection methods in terms of classification accuracies, and achieves comparable stability as Sparse Multinomial Logistic Regression (SMLR) [23], the most stable feature selection method in the literature. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-01-16 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0051729 |
URI | http://hdl.handle.net/2429/18337 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2006-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2006-0320.pdf [ 5.01MB ]
- Metadata
- JSON: 831-1.0051729.json
- JSON-LD: 831-1.0051729-ld.json
- RDF/XML (Pretty): 831-1.0051729-rdf.xml
- RDF/JSON: 831-1.0051729-rdf.json
- Turtle: 831-1.0051729-turtle.txt
- N-Triples: 831-1.0051729-rdf-ntriples.txt
- Original Record: 831-1.0051729-source.json
- Full Text
- 831-1.0051729-fulltext.txt
- Citation
- 831-1.0051729.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0051729/manifest