@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix dc: . @prefix skos: . vivo:departmentOrSchool "Applied Science, Faculty of"@en, "Electrical and Computer Engineering, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Hussein, Faten T."@en ; dcterms:issued "2009-08-12T22:01:44Z"@en, "2002"@en ; vivo:relatedDegree "Master of Applied Science - MASc"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """Computer-based pattern recognition is a process that involves several sub-processes, including pre-processing, feature extraction, classification, and post-processing. This thesis is involved with feature selection and feature weighting processes. Feature extraction is the measurement of certain attributes of the target pattern. Classification utilizes the values of these attributes to assign a class to the input pattern. In our view, the selection and weighting of the right set of features is the hardest part of building a pattern recognition system. The ultimate aim of our research work is the automation of the process of feature selection and weighting, within the context of character/symbol recognition systems. Our chosen optimization method for feature selection and weighting is the genetic algorithm approach. Feature weighting is the general case of feature selection, and hence it is expected to perform better than or at least the same as feature selection. The initial purpose of this study was to test the validity of this hypothesis within the context of character recognition systems and using genetic algorithms. However, our study shows that this is not true. We carried two sets of experimental studies. The first set compares the performance of Genetic Algorithm (GA)-based feature selection to GA-based feature weighting, under various circumstances. The second set of studies evaluates the performance of the better method (which turned out to be feature selection) in terms of optimal performance and time. The results of these studies also show that (a) in the presence of redundant or irrelevant features, feature set selection prior to classification is important for k-nearest neighbor classifiers; and (b) that GA is an effective method for feature selection and the performance obtained using genetic algorithms was comparable to that of exhaustive search. However, the scalability of GA to highly dimensional problems, although far superior to that of exhaustive search, is still an open problem."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/12103?expand=metadata"@en ; dcterms:extent "5722150 bytes"@en ; dc:format "application/pdf"@en ; skos:note "GENETIC ALGORITHM FOR FEATURE SELECTION AND WEIGHTING FOR OFF-LINE CHARACTER RECOGNITON by F A T E N T. HUSSEIN B.Sc. Cairo University, Egypt, 1995 A THESIS SUBMITTED IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F M A S T E R O F APPLIED S C I E N C E in T H E F A C U L T Y O F G R A D U A T E STUDIES D E P A R T M E N T O F E L E C T R I C A L AND C O M P U T E R ENGINNERING We accept this thesis as conforming to the required standard T H E UNIVERSITY O F BRITISH C O L U M B I A , 2002 April 2002 © Faten Hussein, 2002 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial ,gain shall not be allowed without my written permission. Department The University of British Columbia Vancouver, Canada DE-6 (2/88) Abstract Computer-based pattern recognition is a process that involves several sub-processes, including pre-processing, feature extraction, classification, and post-processing. This thesis is involved with feature selection and feature weighting processes. Feature extraction is the measurement of certain attributes of the target pattern. Classification utilizes the values of these attributes to assign a class to the input pattern. In our view, the selection and weighting of the right set of features is the hardest part of building a pattern recognition system. The ultimate aim of our research work is the automation of the process of feature selection and weighting, within the context of character/symbol recognition systems. Our chosen optimization method for feature selection and weighting is the genetic algorithm approach. Feature weighting is the general case of feature selection, and hence it is expected to perform better than or at least the same as feature selection. The initial purpose of this study was to test the validity of this hypothesis within the context of character recognition systems and using genetic algorithms. However, our study shows that this is not true. We carried two sets of experimental studies. The first set compares the performance of Genetic Algorithm (GA)-based feature selection to GA-based feature weighting, under various circumstances. The second set of studies evaluates the performance of the better method (which turned out to be feature selection) in terms of optimal performance and time. The results of these studies also show that (a) in the presence of redundant or irrelevant features, feature set selection prior to classification is important for k-nearest neighbor classifiers; and (b) that GA is an effective method for feature selection and the performance obtained using genetic algorithms was comparable to that of exhaustive search. However, the scalability of G A to highly ii dimensional problems, although far superior to that of exhaustive search, is still an open problem. iii Table of Contents Abstract ii Table of Contents iv List of Tables vii List of Figures viii List of Figures viii Acronyms and Abbreviations ix Acknowledgments x Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Thesis Contributions 4 1.3 Thesis Structure : 5 Chapter 2 Pattern Recognition 6 2.1 Character Recognition 6 2.1.1 Components of Character Recognition System 7 2.2 Pattern Recognition Approaches '. 8 2.3 Supervised verses Unsupervised Learning 9 2.4 Parametric verses Non-parametric ...9 2.5 The Nearest Neighbor Rule..... • 10 2.6 Attribute Weighted K-Nearest Neighbor 12 2.7 Data Normalization 13 2.8 Estimating Classification Error Rate 14 2.8.1 Hold Out Method • 14 2.8.2 Cross-Validation Methods..... ....15 2.8.2.1 K-Fold Cross-Validation Method 15 2.8.2.2 Random Sub-Sampling Method 15 2.8.2.3 Leave-One-Out Cross-Validation Method 16 2.8.3 Bootstrap Method • 16 Chapter 3 Feature Selection and Weighting 17 3.1 The Curse of Dimensionality 17 3.2 Dimensionality Reduction •••• 18 3.3 Feature Selection verses Feature Extraction 18 3.4 Feature Selection verses Feature Weighting 20 3.5 Feature Selection Algorithms : 21 3.5.1 Feature Selection Objective Function 22 3.5.2 Feature Selection Search Strategy 23 3.5.2.1 Feature Selection Using Exponential Search Methods 24 3.5.2.2 Feature Selection Using Deterministic Search Methods 25 3.5.2.3 Feature Selection Using Randomized Search Methods 26 Chapter 4 Genetic Algorithms for Feature Selection and Weighting 30 4.1 Genetic Algorithms Review 30 4.1.1 The Basics of Genetic Algorithm 32 4.1.1.1 Initialization • 32 4.1.1.2 Representation 32 iv 4.1.1.3 Selection (Reproduction) 33 4.1.1.4 Crossover (Recombination) 34 4.1.1.5 Mutation 36 4.1.1.6 Fitness Function 36 4.1.1.7 Generation Replacement 37 4.1.1.8 Schema Theorem 38 4.2 Genetic Feature Selection and Weighting 39 4.2.1 Chromosome Representation 40 4.2.2 Type of Classifier.... 40 4.2.3 Fitness Function 41 4.2.4 Computational Speed-Up Techniques 42 4.2.5 G A Parameters .42 4.3 Comparisons 43 4.4 Genetic Feature Selection for Character Recognition Systems (Literature Review) 45 4.4.1 Recognition of Printed Characters 45 4.4.2 Recognition of Handwritten Characters ....46 4.4.3 Signature Verification .'. . 48 Chapter 5 A Comparative Study between Genetic Feature Selection and Weighting 49 5.1 Introduction 49 5.2 Purpose... • 50 5.3 The Developed System 52 5.4 Experimental Platform 55 5.5 Methodology 56 5.6 Comparative Study 60 5.6.1 The Effect of Varying the Number Weight Values on the Number of Selected Features (Comparison 1) 60 5.6.2 Results of Comparison 1 61 5.6.3 Performance of both Genetic Feature Selection and Weighting in the Presence of Irrelevant Features (Comparison 2) 69 5.6.4 Results of Comparison 2 71 5.6.5 Performance of both Genetic Feature Selection and Weighting in the Presence of Redundant Features (Comparison 3) 75 5.6.6 Results of Comparison 3 76 5.6.7 Performance of both Genetic Feature Selection and Weighting with Regular Databases (Comparison 4) 80 5.6.8 Results of Comparison 4 81 Chapter 6 Genetic Feature Selection Evaluation 90 6.1 Introduction 90 6.2 Convergence of Genetic Feature Selection to an Optimal or Near-Optimal Set of Features (Evaluation 1) . 91 6.3 Results of Evaluation 1 92 6.4 Convergence of Genetic Feature Selection to an Optimal or Near-Optimal Set of . Features within an Acceptable Number of Generations (Evaluation 2) 93 6.5 Results of Evaluation 2 .\"..«...•.• ?4 6.6 Verification Experiments 97 v 6.6.1 Pre-Processing and Feature Extraction 97 6.6.2 Verification of Comparison 1 101 6.6.3 Verification of Evaluation 1 103 Chapter 7 Conclusions and Future Research 106 7.1 Conclusions 106 7.1.1 Genetic Feature Selection verses Genetic Feature Weighting 107 7.1.2 Performance of Genetic Feature Selection 108 7.2 Future Research 108 References • I l l vi List of Tables Table 1: Accuracy of recognition and number of zero features for various selection and weighting schemes 63 Table 2: Accuracy of recognition and number of zero features for various selection and weighting schemes, some with low weights forced to zero 64 Table 3: Best classification accuracy rates achieved by G A and exhaustive search 93 Table 4: Number of generations to convergence 94 Table 5: The extracted feature set from handwritten digits images 98 Table 6: Recognition accuracy and number of zero features for various selection and weighting schemes 102 Table 7: Classification accuracy rates achieved by G A and exhaustive search 104 vii List of Figures Figure 1: Example of the effect of feature weighting 13 Figure 2: The two feature selection approaches, filter and wrapper 23 Figure 3: Basic steps of genetic algorithms 31 Figure 4: The developed system 54 Figure 5: (a) The relationship between number of weight values and actual number of zero (eliminated) features, (b) The number of zero (eliminated) features as a function of the probability of zero 68 Figure 6: Classification accuracy as a function of the number of irrelevant features 72 Figure 7: Number of eliminated features as a function of the number of irrelevant features. 73 Figure 8: Classification accuracy as a function of the number of redundant features 77 Figure 9: Number of eliminated features as a function of the number of redundant features. 78 Figure 10: Training classification accuracies for the various feature selection and weighting methods (using DB3) 82 Figure 11: Validation classification accuracies for the various feature selection and weighting methods (using DB3) 83 Figure 12: Training classification accuracies for the various feature selection and weighting methods (using DB2) 84 Figure 13: Validation classification accuracies for the various feature selection and weighting methods (using DB2) 85 Figure 14: Number of generations to convergence as a function of number of features 96 Figure 15: Exteremal points 101 viii Acronyms and Abbreviations G A Genetic Algorithm FS Feature selection F W Feature Weighting GFS Genetic Feature Selection G F W Genetic Feature Weighting O C R Optical Character Recognition K - N N K-Nearest Neighbor 1-NN 1-Nearest Neighbor BAB Branch and Bound SFS Sequential Forward Selection SBS Sequential Backward Selection SFFS Sequential Forward Floating Selection SBFS Sequential Backward Floating Selection SA Simulated Annealing TS Tabu Search ix Acknowledgments First, I would like to thank God, for his endless blessings and mercy, and without whom I can do nothing. I would like to express my appreciation to my supervisor Dr. Nawwaf Kharma for his continuous guidance, advice and availability throughout this research work. He has provided me with valuable suggestions and technical assistance, which made this thesis possible. I would like to express my gratitude to my supervisor Dr. Rabab Ward for her support and encouragement. I can't thank her enough for her invaluable feedback. To my husband, Ayman, I thank you for your continuous love, encouragement, help and understanding. Without your support I wouldn't have been able to complete this work. I would like to thank my son Omar, for all the joy and happiness he has brought to my life. My heart-felt gratitude goes to my parents for their love, support and never-ending compassion throughout my life, which I will never be able to repay. x Chapter 1 Introduction 1.1 Motivation The main purpose of pattern recognition systems is to classify objects into one of a given number of class labels. Features are scalar properties that represent objects. Mult iple features are combined together to form a feature vector. Any pattern recognition system includes two basic parts: feature extraction and classification (Wang & Fan, 1996). Feature extraction is the process of defining the most relevant features, which wi l l minimize the within-class pattern variability and maximize the between-class pattern variability (Devijver & Kittler, 1982). As the number of features extracted increases, the cost and the running time of the recognition system increase as well. On the other hand, using fewer features could lead to failure of classification. These contradicting requirements have emphasized the need for well-balanced feature selection methods. The goal of feature selection is to reduce the number of features extracted, by eliminating irrelevant and redundant features, while simultaneously maintaining or enhancing classification accuracy. Feature selection is sometimes performed in an ad-hoc way and accomplished by hand. A human designer implicit ly selects the features that appear to him or her to be of most potential use. This process of manual feature selection usually depends on the experience of the designer in the domain knowledge and on trial-and-error. Another problem with the human expert method of feature selection is that humans cannot usually predict the existence of non-linear interactions between features. The process of choosing the best features to represent the data is a difficult and time-consuming task. As a matter of fact, for handwritten character recognition problem we have 1 many variants of letter shape, size, and generally, style. Also, different writers have different writing styles. For the letters of an alphabet, there are nearly an unlimited number of variations. So many features must be used in a typical character recognition system to accommodate these variations. However, for a problem with a large number of features, it is not feasible to perform an exhaustive search to find all relevant features. Therefore, a computer algorithm must be developed to determine which features most accurately represent the pattern. This process of automatic feature selection will ensure that the recognition system is optimized. The problem df feature selection, especially automated feature selection, has received a great deal of attention. Several search algorithms are used to solve the feature selection problem (Dash & Liu, 1997; Jain & Zongker, 1997). Among these, genetic algorithm has been revealed as a powerful search tool to select an optimal subset of features. Genetic Algorithm (GA) has been used for feature selection and weighing in many pattern recognition applications (e.g., texture classification and medical diagnostics). G A has proven to be an effective computational method, especially in situations where the search space is mathematically uncharacterized, not fully understood, or/and highly dimensional. Moreover, G A is domain independent (i.e. do not require derivative information or other auxiliary knowledge about the problem), and has been shown to be an excellent approach for solving combinatorial optimization problems. However, their use iii feature selection (let alone weighting) in character recognition applications has been infrequent. This has inspired the work of this thesis, which is to apply G A for feature selection and weighting for character recognition applications. This work has resulted so far in (Hussein, Kharma & Ward, 2001; Kharma, Hussein & Ward, 2002a; Kharma, Hussein & Ward, 2002b) publications. 2 The main purpose for our work is to apply genetic algorithms for the problem of feature weighting for character recognition application. The work is motivated by the fact that there is no published work in the literature that has applied genetic feature weighting (GFW) in the context of character recognition problems. So it will be applied for the first time. After the employment of GFW to character recognition application, we are interested in comparing the performance of both genetic feature selection (GFS) and GFW also for character recognition applications. We are encouraged to perform such a comparison because it is often mentioned in the literature that feature weighting has the potential of working better than (or at least as well as) feature selection, when applied to the same situation (Komosinski & Krawiec, 2000; Punch, Goodman, Pei, Chia-Shun, Hovland & Enbody, 1993). However, this proposition was never fully and comprehensively assessed before. Only a single comparison exited in the literature (kohavi et. al. 1997), which compares the classification accuracy of feature selection (FS) and feature weighting (FW). However, the search method used in this comparison is not genetic algorithms, and the comparison was not employed for character recognition applications. Therefore, we intend to test the validity of this proposition. In addition, we need to carry out an inclusive comparison between GFS and GFW for character recognition applications. This comparison between GFS and GFW will tackle several issues. These issues include the number of eliminated features by both methods, their performance in situation where irrelevant/redundant features exist and the classification accuracies of both methods in regular situations. Moreover, we aim to test the performance of the better method (which turns out to be GFS) for both optimality and time complexity. 3 1.2 Thesis Contributions Basically, the main contributions of the thesis are: • Developed a GA system to be used to configure the real-valued weights of a k-nearest neighbor (k-NN) classifier component of a character recognition system. This genetic feature weighting system GFW will be applied, for the first time, to off-line recognition of isolated handwritten digits. • A comparison of GFW with, the previously applied methods by other researchers for character recognition systems, GFS (genetic feature selection) under various circumstances. • An evaluation for the performance of the better method in terms of optimal performance and time. In addition to the abovementioned direct contributions, we believe that this research work is important for the field of character recognition, because it provides the following: • Investigates the pragmatic aspects (in terms of the number of eliminated features and the performance under situations where irrelevant/redundant features exist) for the automatic feature selection and weighting using genetic algorithm for character recognition systems. • Provides testable hypothesis (i.e. feature weighting has the potential of working better than feature selection) and formal explanations for the behavior of the GA-based feature selection and weighting in character recognition applications. • Represents an empirically proven method (GFS) for the feasibility of a search for an optimal set of features (of a moderate size) for enhancing the recognition rates in 4 character recognition systems while suggesting a possible use of distributed computational implementations for highly dimensional problems. • Offers a way for building a semi automatic optimization method of a general hand-written symbol recognition system, in which the human intervention is needed to provide new libraries for feature extraction and classification functions. 1.3 Thesis Structure The thesis consists of seven chapters. The first chapter is an introduction. Chapter two presents an overview of pattern recognition and character recognition methods and introduces the k-nearest neighbor classifier. In chapter 3, we present the feature selection and weighting problem and describe the various algorithms used in feature selection. Chapter 4 introduces genetic algorithm and its various parameters. It also compares genetic feature selection with other feature selection methods and surveys genetic feature selection in character recognition applications. Chapter 5 presents the details of the developed system, the platform and the methodologies used. In addition, it provides a comparative study between genetic feature selection and weighting, describes the experimental work and explains the results obtained from these experiments. Chapter 6 presents an evaluation for genetic feature selection in terms of optimality and time and shows an experimental verification for some of the results obtained form the comparative study and evaluation. Finally Chapter 7 offers the conclusions and suggestions for future work. 5 Chapter 2 Pattern Recognition The goal of pattern recognition is to classify objects into a number of classes. These objects, depending on the application, can be images or signal waveforms or anything that needs to be classified. Pattern recognition applications range from automated speech recognition, optical character recognition to fingerprint identification and so on. 2.1 Character Recognition Character Recognition or Optical Character Recognition (OCR) is the process of converting scanned images of machine printed or handwritten text (numerals, letters, and symbols), to computer-processed format (such as ASCII). The popularity of OCR has been increasing each year with the advent of fast microprocessors providing the vehicle for vastly improved recognition techniques. There is a wide variety of OCR systems in use today, from automatic postal address readers through massive document handling computers used by offices, to the desktop systems that employ scanners for reading text into word processing and spreadsheet applications. In general, there are two main categories for the character recognition problem, on-line and off-line. The purpose of on-line handwritten recognition is to recognize the symbols while they are being written. On the other hand, in the off-line case, the recognition process is performed after the symbols have already been written. OCR belongs to the off-line recognition category. In our research work we are interested in the case of off-line recognition of isolated handwritten digits. 6 2.1.1 Components of Character Recognition System An OCR system typically involve the following processing steps (Kharma & Ward, 1999): 1- Pre-Processing. 2- Feature Extraction. 3- Pattern Classification. 4- Post-Processing. The pre-processing step aims to improve the image data by suppressing unwanted distortions or enhancing some image features important for further processing. So the output of the pre-processing step is a cleaned up version of the original image, which can be used into the next step. Examples of pre-processing functions are: Noise removal, skeletonization, thinning, normalization and segmentation. Feature extraction is an important step in achieving good performance of OCR systems. It is the process of defining the most relevant features, which will minimize the within-class pattern variability and maximize the between-class pattern variability (Devijver & Kittler, 1982). Several feature extraction methods exist; for an extensive survey, see (Trier, Jain & Taxt, 1996). The goal is to find those features that are of possible relevance for classification. Assume now that a list of measured features is provided. The portion of the process that must map these input features onto classes is called the classifier. This step can be accomplished by means of a number of algorithms, including clustering techniques, rule-based systems, neural net works and decision trees (Kharma & Ward, 1999). Traditional OCR performs post-processing to assist in the resolution of errors produced in the character recognition processes. The goal is to increase the level of confidence in the 7 classification results. One of the post-processing methods used is a word dictionary to verify word results. Alternatively, recognition results may be verified interactively with the user. 2.2 Pattern Recognition Approaches There are three approaches for pattern recognition: the statistical, the structural and the neural network approaches (Pandya & Macy, 1996). In the statistical approach, the input pattern is represented by a number of attributes or features (e.g. a set of measurements performed on the raw data). If a suitable set of features is chosen to properly represent the patterns, feature vectors having the same class will be close to each other while feature vectors belonging to different classes will be positioned in different regions of the feature space. In this way, the recognition task is reduced to partitioning the feature space into regions of different classes. In our work, we have used the statistical approach for the task of recognizing handwritten digits. The patterns, which are images of handwritten digits, are represented by d dimensional feature vector. On the other hand, in the structural approach, a pattern is assumed to be decomposed into simpler sub-patterns, which in turn can also be decomposed into simpler sub-patterns (called primitives) in a recursive way (Jain, Duin & Mao, 2000). In syntactic pattern recognition, which is a sub-set of structural pattern recognition, patterns of a class are viewed as sentences in a language defined by means of grammar, and primitives are considered the alphabet of the language. Each class has its own defined set of rules or what is called grammar. The grammar specifies the way in which sub-patterns can be combined to form a valid pattern for a specific class. The pattern recognition problem in this case is to determine whether a given pattern belongs to the language generated by that grammar (Devijver & Kittler, 1982). 8 Neural networks (NN), in general, are large number of highly interconnected processing elements (nodes) that usually operate in parallel. The collective behavior of a NN, demonstrates the ability to learn, recall and generalize from training patterns. The analogy between neural networks and human brain made neural networks good candidates for pattern classification problems. In fact, neural networks are a non-parametric type of classifiers with predictive capabilities. They make no assumptions about the underlying distributions. Another main advantage of neural networks is their capability to learn complex nonlinear relationships between inputs and outputs (Jain et al. 2000). Moreover, they have the ability to generalize and even to recognize partially degraded patterns (Pandya & Macy, 1996). 2.3 Supervised verses Unsupervised Learning Pattern recognition can be classified into two broad categories: supervised and unsupervised (Pandya & Macy, 1996). A supervised learning process is one in which the user provides some external information about the problem. That is, a set of examples, or training set of classified elements. In the unsupervised case, no prior information is provided and the system is required to discover the fundamental structure of the data on its own. Correct classes are not available, and grouping or clustering is used in order to infer correct classification from the data elements. 2.4 Parametric verses Non-parametric In statistical pattern recognition, there are two general ways to design a classifier, parametric and non-parametric. The key distinguishing feature is the form of the information \"learned\" during training and passed on to the classifier. Parametric classifiers assume that the patterns in the training set fit a known statistical distribution. These classifiers are parametric, in that 9 they are specified in terms of parameters (mean and covariance) of class distributions. Nonparametric classifiers are useful in cases where the underlying distribution cannot be easily parameterized. In such cases, we can either estimate the density function or directly construct the decision boundary from the training data (e.g. k-nearest neighbor) (Jain et al. 2000). 2.5 The Nearest Neighbor Rule Nearest neighbor classification is a nonparametric method that does not assume that the patterns to be classified have known density functions. The K Nearest Neighbor Rule (k-NNR) is a very intuitive method that classifies unlabeled samples based on their similarity to classified samples in the training set (Dasarathy, 1991). The algorithms for the nearest neighbor rule is described as follows: • Given a feature vector x, whose class is to be determined, and n training samples, identify the k training samples that form the nearest neighbors to x, regardless of the type of the class (where k is chosen to be an odd number). • Out of these k samples, determine the number of vectors kt , which belongs to class W. (where i=l,2,..., m and ]T kt -k and m is the number of classes). i • Assign the unknown vector x to the class w. with the maximum number of kt samples. In other words, the k nearest neighbor method assigns to an unclassified sample x the class most heavily represented among its k nearest neighbors. When k=l, in this case it is known as nearest neighbor rule, the feature vector x is assigned to the class of its nearest 10 neighbor. The k- NNR is very simple; it only requires an integer k, a set of labeled examples (training data) and a metric to measure the \"closeness\". K-NNR is considered an instance based learning, algorithm. Instance based learning algorithms are a class of supervised machine learning algorithms. These algorithms do not construct abstract concepts, but rather base their classification of new instances on their similarity to specific training instances (Aha, 1992). Old training instances are stored in memory, and classification is postponed until new instances are received by the classifier. When a new instance is received, older instances are retrieved from memory and used to classify the new instance. Other names for instance based algorithms are: Memory- based, Exemplar- based or Case- based. Instance based learning algorithms have the advantages of being able to (a) learn complex target concepts (e.g. functions); and (b) estimate target concepts distinctly for each new instance. In addition, their training is very fast and simple; it only requires storing all the training instances in memory. In contrast, the cost of classifying new instances can be high because every new instance is compared to every training instance. Hence, efficient indexing of training instances is important. Another disadvantage of these learning algorithms is that their classification accuracy degrades significantly in the presence of noise (in training instances). In the k-nearest neighbor method, the similarity between two cases can be measured in various ways. The most common similarity measure is based on Euclidean distance. This is described as follows: (Eq.l) 11 Where D is distance, x and y are two instances, xt and yt are the i-th. attribute for the x and y instances, and n is the total number of features. To compensate for the difference in units between features, normalization should be performed. This often scales all features to a range between 0 and 1, inclusive. 2.6 Attribute Weighted K-Nearest Neighbor One major drawback of the Euclidean distance function is its sensitivity to the presence of noise, and particularly, redundant or irrelevant features. This is because it treats all features of an instance (relevant or not) as equally important to its successful classification. A possible remedy is to assign weights to features. The weights can then be used to reflect the relative relevance of their respective features to correct classification. Highly relevant features would be assigned high weights relative to the weights of redundant or irrelevant features. Taking that into account, the Euclidean distance measure can be now refined to: Where w. is the weight of the /-th feature. Assume (see Figure 1) there are several instances belonging to two classes, class 1 and class 2, each having two features (attributes) represented in 2-dimensional space. It is required to classify the unknown instance to either one of these two classes. In the left side of Figure 1, the unknown instance is classified according to the majority class of its k nearest neighbors (k=3, in this case). Thus the unknown instance is incorrectly classified to be belonging to class 1. However, the right side of Figure 1 shows the effect of using the attribute weighted k-nearest neighbor. This weighting is achieved by multiplying the Y-axis by a small weight, which corresponds to decreasing the effect of attribute Y and increasing (Eq.2) 12 the influence of attribute X by multiplying it with a large weight. As a result of the attribute weighting, the unknown instance is correctly classified as belonging to class 2. Note that the dimension in the feature space, where a high weight is assigned, is extended. On the other hand, the dimension in the feature space, where a low weigh is assigned, is compressed. • CNJ 0) J / • \\ < r j \\ i mjo Attribute 1 • Class 1 0 Class 2 Unknown instance Figure 1: Example of the effect of feature weighting. 2.7 Data Normalization In addition to feature weighting, it is also important to apply data normalization before classification. Different features have different measurement units, which means that their values lie within different ranges. So, using the Euclidean distance function with features having different ranges of values will result in a significant problem; features having large values will have larger effect on the classification than those features that have small values. However, this does not necessarily reflect their relative importance for classification (Theodoridis & Koutroumbas, 1998). Therefore, data normalization must be performed first to overcome the differences in units between feature values. A common method used for normalization is to restrict all feature values in a certain range e.g. [0,1] or [0,10] or any other 13 range. This way of scaling features guarantees that all features will be normalized to the same range of values. 2.8 Estimating Classification Error Rate The assessment of a classifier, such as the k-NN, is based on its ability to successfully classify and predict the unseen data. The most commonly method used to measure the performance of a classifier is its error rate (Weiss & Kulikowski, 1991). Error rate is the number of incorrectly classified data samples divided by the total number of all samples. In general, the estimation of the error rate of a recognition system is performed by dividing all the samples available into two sets, a training set and a testing set. The classifier is built and designed using the training set, and then the error rate of the classifier is calculated using the testing set. However, both the training and testing sets should be independent and have a large number of samples so as to return a true measure of the classifier error rate (Jairi et al. 2000). There are many methods for estimating classification error rate. These methods mainly differ in the way the available samples are split into training and testing sets. These methods are described as follows. 2.8.1 Hold Out Method In this method, all the available samples are separated into two sets, called the training set and the testing set. The common splits used for the data samples are 2/3 of the data assigned to training and 1/3 to testing (Weiss & Kulikowski, 1991). Alternatively, half of the data can be used for training and the second half for testing. The classifier is trained using the the training set only. Then the trained classifier is asked to predict the output values of the testing set. The accumulated errors using the testing set is used as the classifier error rate. The 14 problem with this mehtod is that its evaluation can have a high variance. Since the holdout method is a single train-test method, its evaluation of error rate depends heavily on which data points end up in the training set and which end up in the test set. Thus the evaluation may be significantly different depending on how the division is made (Jain et al. 2000). 2.8.2 Cross-Validation Methods The limitations of the holdout can be overcome with a family of cross-validation methods at the expense of higher computational cost. These methods are: the k-fold, the random sub-sampling and the leave-one-out methods (Weiss & Kulikowski, 1991; Jain et al. 2000). 2.8.2.1 K-Fold Cross-Validation Method In this method, the data set is divided into k parts (folds), and the holdout method is repeated k times. Each time, one of the k subsets is used as the testing set and the other k-1 subsets are combined to form a training set. The error rate is the average of the error rates obtained from all k trials. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun k times, which means it takes k times as much computation to make an error estimation. 2.8.2.2 Random Sub-Sampling Method Random sub-sampling can be viewed as a variant of k-fold cross-validation method, where data is randomly divided into a test and training set k different times. In this method, multiple random train-test experiments are performed k times. The train-test sets are chosen randomly each time. Then the classifier is built using the training set, and tested using the testing set. The error rate is the average of the error rates obtained from k runs. Random sub-sampling 15 has better error rates than the single train-test holdout method (Weiss & Kulikowski, 1991). The advantage of this method over the k-fold method is that you can independently choose how large each test set is and how many trials you average over. On the other hand, in the k-fold method all the examples in the dataset are eventually used for both training and testing. 2.8.2.3 Leave-One-Out Cross-Validation Method It is a k-fold cross validation taken to its extreme, with k equal to N, the number of data samples. For a dataset with N examples, perform N runs. For each run, use N-l examples for training and the remaining example for testing. As before the error rate is the average of the error rates obtained from N runs. Leave-one-out method has unbiased estimation for the error rate but it has large computational requirements. For very sparse datasets, we may have to use leave- one- out in order to train on as many examples as possible. Conversely, for large data sets, leave-one-out is computationally expensive, so random sub-sampling or k-fold methods are preferred. 2.8.3 Bootstrap Method The bootstrap is a re-sampling technique with replacement. Given a dataset with N examples, N examples are randomly selected (with replacement) and this set is used for training. The remaining examples that were not selected for training are used for testing. This process is repeated for a specified number of folds (K). The error rate is the average of the error rates obtained from k folds. Obviously, the number of test examples is likely to change from fold to fold. Usually bootstrap method is used for small sample datasets (Jain et al. 2000). 16 Chapter 3 Feature Selection and Weighting 3.1 The Curse of Dimensionality Intuitively, one would expect that the more information that is available, the better we can make decisions. That is, the more features available to the classifier, the better the classification results. However, in practice this is not always true. In fact, adding more features is not always helpful. For a given size of training samples, as the feature dimension increases, the number of data points becomes more sparse relative to the problem dimension. In addition, new features may not add useful information and some features may be noise. This phenomenon, which is often observed in pattern recognition, is called the peaking phenomena (Jain et al. 2000). Peaking phenomena happens when adding new features to a feature set leads to a decrease in the classification accuracy of a classifier trained on a finite set of training samples. In general, the classifier performance depends on the relationship between the sample sizes and the number of features (Jain et al. 2000). For a given problem with d dimensional features, there exists some minimum number of training samples that are required by the classifier to achieve good classification rate. However, the required number of training samples grows exponentially with the dimensionality of the feature space (Pandya & Macy, 1996). This is known as the \"curse of dimensionality\". In practice, the curse of dimensionality means that, for a given sample size, there is a maximum number of features above which the performance of the classifier will degrade rather than improve. It is very difficult to draw the exact relation between the probability of misclassification, the sample 17 size and the number of features. However, a general accepted rule is to employ a number of training samples, which is at least ten times as the number of features used (Jain et al. 2000). 3.2 Dimensionality Reduction As mentioned in the previous section, a major problem associated with the pattern recognition problem is the so-called curse of dimensionality. Usually, there is a very large number of features that a domain expert can provide when designing any pattern recognition problem in general, or character recognition system in particular. This number can easily range from a few dozens to hundreds. Thus, features must be evaluated and the most effective ones chosen. This process is referred to as the dimensionality reduction. Several reasons have motivated the use of dimensionality reduction techniques. The reduction of the number of features will certainly help in reducing/eliminating the curse of dimensionality problem. Moreover/reducing the dimensionality of the problem will, in turn, reduce the time complexity and the memory requirements of the recognition system. In addition, reducing the dimensionality will increase classifier efficiency by eliminating redundant and irrelevant features. 3.3 Feature Selection verses Feature Extraction Two approaches are available to perform dimensionality reduction: Feature selection and feature extraction. Feature selection can be defined as follows. Given a number of features, the feature selection process aims to select the most important features of them so as to reduce their number and at the same time retain as much as possible of their discriminatory power. Feature selection can be modeled as follows: Given a feature set X={x, I j = 1,2,..DJ 18 of D features, a feature selection method should find a subset Y= \\y \\l' = 1 ,2 , . .of d features, where d—' , ' 1 I 1-NN FS 3FW 5FW 17FW 33FW Figure 11: Validation classification accuracies for the various feature selection and weighting methods (using DB3). Results show that classification accuracy is best for the (unaided) 1-NN and FS , but worse for F W . The trend line highlights the fact that the greater the number of weight levels the lower the validation accuracy rate achieved. Finally, it appears that classification accuracy cannot be further degraded by increasing the number of weights beyond 17; there is no difference in classification accuracy between 17FW and 33FW. These results are the opposite of those found in Figure 10! This means that F W is returning good accuracy results for training data, but then returning bad results for validation data. Also , it is the performance of a classifier on validation data that truly reflects its (real-world) predictive capacity. To verify these interesting results, we ran the same set of experiments on another database (DB2). This is to ensure that the above results are independent of the particular 83 sample database used. The results are shown in Figure 12 and Figure 13. It emerges that the results of these experiments indeed confirm the results of Figure 10 and Figure 11. Figure 12: Training classification accuracies for the various feature selection and weighting methods (using DB2). 84 1-NN FS 3FW 5FW 17FW 33FW Figure 13: Validation classification accuracies for the various feature selection and weighting methods (using DB2). Based on the results of Figure 10, Figure 11, Figure 12and Figure 13 we make the following observations. • For the validation sets, the classification accuracy of FS was slightly better than all the F W settings for both databases. However, it was the same as 1-NN for D B 3 and slightly worse for D B 2 . • For the training sets all F W settings have better classification accuracy than FS . • Increasing the number of weight levels led to a slight decrease in the classification accuracy of the validation sets. • Increasing the number of weights has led to slight increase in the classification accuracy of the training sets. 85 We believe the interesting disparity between training and validation results is due to over-fitting (which is caused because of the bias-variance tradeoff)- Over-fitting means that the learning algorithm adapts so well to a training set but it predictions on new samples are poor (the performance on the validation set is bad) (Duda, Hart & Stork, 2000). A classifier can be seen as a learning machine or learning algorithm aiming to estimate correct classes for the given data samples. A classifier with excellent generalization ability is the one that can make good predictions for the samples that are not in the training set. However, a classifier, which allows perfect classification for the training samples while having poor predictions for the new (unseen) samples, is said to over-fit the training samples. Over-fitting is related to the bias vs. variance tradeoff (Geman, Bienenstock & Doursat, 1992). In general, classification error can be decomposed into two components, bias and variance. Bias measure how much the error estimation deviates from the true value, whereas the variance measures the variability in classification for different training samples. For a data set having a finite number of samples, there's a trade-off between these two. Increasing the bias decreases the variance and vice versa (Theodoridis & Koutroumbas, 1998). Generally, as the number of parameters of a learning algorithm increases, the classifier wi l l have more flexibility to adapt to the details of the specific training set, hence the bias w i l l reduce but the variance wi l l increase. Conversely, i f the number of parameters is few, the classifier w i l l not fit the training data well (high bias) but this fit w i l l not change much for different new samples (low variance) (Duda et al. 2000). A s a result, G F W allows for a finer-grain representation of the search space, but at the expense of an increased classification error rate. So, allowing many weights (for the case of 86 G F W ) wi l l reduce bias but w i l l also increase variance, and hence increase the probability of over-fitting. In contrast, using a small number of weights (which is the case for GFS) wi l l increase the bias due to the lack of representation of the space. However, this w i l l also reduce variance, which in turn reduces the possibility of over-fitting of data (Kohavi et al. 1997). The only way to get a zero bias and zero variance at the same time is to increase the number of training samples very large (possibly infinity) and to have a prior knowledge about the problem (i.e. the shape of the decision boundary). Unfortunately, the number of samples in practice is finite and the prior knowledge about the true model of the problem is not known (Duda et al. 2000). Hence, the best to do is to find the best compromise for the number of parameters, which optimizes the bias-variance trade off. However, obtaining low variance is generally more important, to have accurate classifications, than having low bias (Duda et al. 2000). This is because low variance means better generalization ability and less chances of over-fitting. In (Kohavi et al. 1997), it is illustrated that increasing the number of weights above two hardly ever decreases classification errors. In their study they show that with 10 weights levels, F W fails to outperform FS on 11 real-world databases. They conclude that, \"On many natural data sets, restricting the set of weights to only two alternatives-which is equivalent to feature subset selection-gives the best results\". Though we used different search methods and larger datasets than those used by Kohavi et al. (1997), we get similar results and conclusions. In addition, we should take into account that the databases we are using to report our results are handwritten character recognition databases. This means that there are many variants of digit shape, size, and generally, style. Also , different writers have different 87 writing styles. For the 10 digits, there are nearly an unlimited number of variations. Thus, feature weighting over-fits the training data by finding suitable weights, but these weights obtained do not generally represent the underlying variations in the handwritten training samples due to the large variance among these samples. On the other hand, other papers have shown that feature weighting using G A yield a slightly better classification accuracy than feature selection for real-world databases. For example, Punch et al. (1993) perform tests using binary and real-valued weights (with weights in [0,10]). Their database consisted of images of soil samples. Their results show that the error rates obtained using real-valued weights are better than those with binary weights. Also, Komosinski & Krawiec (2000) provide further evidence that feature weighting is somewhat better than selection when applied to a brain-tumor diagnoses system. Their results confirm that feature weighting (using weights in [0,9]) leads to somewhat better classification accuracy rates than just feature selection. However, several points must be raised here. First, results reported in (Komosinski & Krawiec, 2000) are those of G A evaluation. No validation tests were done to test the resultant weights on separate data sets (i.e. their reported results are based on training only). In fact, Kohavi & Sommerfield (1995) have noticed this fact and states that separate holdout sets, which were never used during the feature selection/weighting, should be used in the final test of performance. Second, results reported in (Punch et al. 1993) were performed on validation samples, which were randomly drawn from the training samples. This also has the same biased effect on the results. Third, the improvement in classification accuracies mentioned in both papers for FW over FS is so little that it does not, in itself, provide final evidence that FW gives better results than FS on real datasets. Komosinski & Krawiec (2000) 88 report classification accuracies of 83.43 ± 1.93 and 77.83 ± 2 . 0 3 (for two datasets) for F W , vs. accuracies of 80 ± 1.22 and 75.7 ± 1.64 for FS . While Punch et al. reports a classification error rates for F W of 0.83% and 2% for training and validation respectively, those of FS were 1.66 and 3.2% for training and validation. In conclusion, despite the fact that feature weighting has the best training classification accuracies, feature selection is better in generalization, and hence more suited to real-world applications (in which most data is new). This is because F W overfits the training data, losing generalizability in the process. Therefore, it is advisable to use 2 (FS) or 3 (3FW) weight levels at most, as Kohavi et al. (1997) recommend. 89 Chapter 6 Genetic Feature Selection Evaluation In chapters 5 we compared the performance of GA-based feature selection to GA-based feature weighting, within the context of character recognition systems and under various conditions. In this chapter, we intend to evaluate the performance of the better method (which turns out to be genetic feature selection GFS) in term s of optimality and time. 6.1 Introduction In chapter 5 we have compared genetic feature selection and weighting and showed that G F S outperforms G F W in many aspects such as dimensionality reduction, presence of irrelevant and redundant features and classification accuracy. Therefore, it is important to assess the performance of G F S for both optimality and time. For example, how does G F S perform in comparison with the exhaustive search? To put it in other way, are the solutions obtained from the G F S are optimal (or near optimal) solutions? Also , i f G F S is indeed capable of reaching optimal (or near optimal) solutions, what is the number of generations required to reach such optimal? Does increasing the number of features necessitate an increase in the number of generations to obtain optimal solutions? In fact, we intend to tackle all these questions in this chapter. In the following sections are two empirical studies that study the effectiveness of GA-based feature selection (GFS) with respect to (a) aiming at finding an optimal set of features, and (b) doing so within an acceptable time frame, for off-line applications (e.g., pre-release optimization of character recognition software). 90 6.2 Convergence of Genetic Feature Selection to an Optimal or Near-Optimal Set of Features (Evaluation 1) It is clear from the previously described literature that several studies exist comparing G A with other feature selection search algorithms (see section 4.3), so it not our intention to repeat such work. However, there is a need to determine whether or not G F S does indeed return optimal or near optimal feature sub-sets. This necessitates an exhaustive search of the features space in order to find one or more bona fide optimum feature sub-set. These can then be used to assess (with certainty) the optimality of the best GFS-generated feature sub-sets. For the experiments in sections 5.5.2 and 5.5.4, we used a single training/testing set of size 1000 and 500, respectively. There was no need to have a separate validation set in this case, because it is our goal to see whether or not the G F S w i l l reach the optimal values reached by the exhaustive search, not to test the generalization ability of G F S , which was already proven in previous sections. We run both the exhaustive search and G F S using different number of features each time. The numbers of features used in the searches are 8, 10, 12, 14, 16, and 18. We used D B 3 with the dataset that contains 6 features, and to obtain the required number of features (8,10,12,14,16,18) we randomly selected 2,4,6,8, and 10 features from the 47 dataset and added them to the 6 feature dataset. In general, running the exhaustive search would mean to explore the space of 2d where d is the number of features. However, when d gets large, the exhaustive search becomes computationally prohibitive. For example, having 16 features, this means that we have to try 216 =65536 different feature combinations, while having 18 features would require 218 =262144. When we run the exhaustive search for FS using 16 features it took almost 6 hours, while for 18 features took one and half days. We have only run the 91 exhaustive search for a maximum of 18 features. However, based on these actual run times, the estimated time to run the exhaustive search for 20 features w i l l be seven and half days, whereas for 47 features the time needed wi l l be 2791 years. 6.3 Results of Evaluation 1 The results presented in Table 3 compare the best and average accuracy rates achieved via GA-based feature selection (GFS) with the optimal accuracy rate found via an exhaustive search of the entire feature space. We run both the exhaustive search and G F S using different number of features each time (8,10,12,14,16,18), which are represented in Table 3 as different rows. For the G A , we repeated the run for 5 times with different seeds and recorded the average and best accuracies achieved during these 5 runs (shown in the third and fourth columns in Table 3). For the exhaustive search, we run it once and reported the best accuracy obtained (shown in the second column in Table 3). Comparing the numbers in the best G A column with the numbers in the best exhaustive column exhibits the success of the G A as a feature set optimizer. In every case but one (when the number of features is 16), the best accuracy achieved by the G A was identical to that found by the exhaustive search of the feature space. Furthermore, the fact that the worst average G A value is less than 1% percentage point away from the optimal value (i.e. in the row that has 16 features the difference between average G A and optimal value is 0.92) means that the G A was consistently able to return optimal or near-optimal values of accuracy. Hence, GA-based feature selection consistently converges to an optimal or near-optimal set of features. 92 Number of Best Best G A Average G A Features Exhaustive (classification (for 5 runs) (classification rate) rate) 8 74 74 74 10 75.2 75.2 75.2 12 77.2 77.2 77.04 14 79 79 78.56 16 79.2 79 78.28 18 79.4 79.4 78.92 Table 3: Best classification accuracy rates achieved by G A and exhaustive search. 6.4 Convergence of Genetic Feature Selection to an Optimal or Near-Optimal Set of Features within an Acceptable Number of Generations (Evaluation 2) In general, the run time for the G A is proportional to the number of features, number of generations and size of the population (Kudo & Sklansky, 2000). Moreover, the time complexity of the nearest neighbor classifier is proportional to (t2 x d), where t is the number of training samples and d is the number of features used. So a G A that is applied in a wrapper configuration (see Figure 2 in section 3.5.1) to a nearest neighbor classifier spends most of its time running the nearest neighbor classifier (Br i l l et. al. 1992). Below, we list those factors that affect the time complexity of our G F S or G F W : • Number of generations • Population size • Number of features 93 • Number of training samples In this evaluation, we wi l l investigate the relationship between the number of features (in the original set) and the number of generations required to reach an optimal or near-optimal sub-set. We w i l l do that while keeping the two other factors (i.e. population size and number of training samples) constant. 6.5 Results of Evaluation 2 The results of experimentation are shown in Table 4. The second column of Table 4 shows the optimal and near optimal (the second best value after the optimal) classification accuracy rates attained using an exhaustive search, whereas the third column displays the duration of that search. The fourth column contains the best results achieved by a GA-based search, while the last column contains the time it took the G A to attain those values. Number Best Exhaustive Best Average Number of G A Run of Exhaustive Run Time G A G A (for Generations Time Features (optimal 5 runs) (single and near- run) optimal) 8 74 73.8 2 minutes 74 73.68 5 2 minutes 10 75.2 75 13 minutes 75.2 74.96 5 3 minutes 12 77.2 77 47 minutes 77 76.92 10 5 minutes 14 79 78.8 3 hours 79 78.2 10 5.5 minutes 16 79.2 79 6 hours 79.2 78.48 15 8 minutes 18 79.4 79.2 1.5 days 79.4 78.92 20 11 minutes Table 4: Number of generations to convergence. The following observations can be made: 94 • In every case, the GA-based FS was able to find the optimal or near to optimal values found by exhaustive search. However, the G A took much less time to find the same optimal or near-optimal values returned by the exhaustive search. With only 8 features, the run times for both exhaustive and G A searches are identical. However, for 18 features, the run time for the exhaustive search was 1.5 days compared to 11 minutes for the G A . • A s the number of features increases, the number of generations required to find the optimal or near-optimal values increases as well. Figure 14 below displays the relationship between the numbers of generations needed to reach optimal or near-optimal values with a G A , and the number of features in the original set. The tiny triangles represent actual data from Table 4(above). There are also two curves in Figure 14. The solid line represents a linear best-fit curve that fits, and extrapolates, the data points. This solid line is obtained by regression analysis and using the linear best-fit method. This method calculates a straight line that best fits your data. The extrapolated segment of this curve (beyond 18 features) represents the most optimistic projection of G A run time duration. The dotted curve represents an exponential curve that fits, and extrapolates, the same data points. Also using regression analysis, the best exponential curve that fits the data is calculated. The extrapolated segment of this curve (again, beyond 18 features) represents the most pessimistic projection of G A run time duration. The reason for use of extrapolation is the obviously impossible amount of time required to run exhaustive searches of large feature spaces using a single-processor computer. In any case, one can safely conclude that the time needed for a GA-based search is bound, on the lower side by the (optimistic) best-fit curve, and on the upper side, by the (pessimistic) exponential curve. 95 CO | 255 -(Q <3 205 -c CD (3 155 -° 105-a> | 55 -Z n O 8 10 12 14 16 18 20 25 30 35 40 45 50 55 60 Actual 5 5 10 10 15 20 Expected Expon. 5 5 10 10 15 20 27 59 115 256 531 1122 2367 4953 1058 Expected Linear 5 5 10 10 15 20 21 30 38 47 55 64 72 81 89 Number of Features * — A c t u a l Expected Expon. Expected Linear Figure 14: Number of generations to convergence as a function of number of features. Using exponential extrapolation, the number of generations expected for 60 features is 10585, while it is only 89 for linear extrapolation. The true value should be something in between. If we assume that the true value is midway between 89 and 10585 (which is 5337), then running the G A wi l l be computationally expensive. In such cases, using methods of parallel G A , such as D V e G A (Moser, 1999), becomes necessary. Finally, we conclude that GA-based feature selection is a reliable method of locating optimal or near-optimal feature sub-sets. These techniques also save time, relative to exhaustive searches. However, their effective use in large feature spaces is dependant upon the availability of parallel processors to speed up the G A work. 96 6.6 Verification Experiments In this section, some of the experimental work (comparative study and evaluation) described in this chapter and in chapter 5 is repeated using our own extracted features instead of the publicly accessible character databases. This step is necessary to verify and prove the results we obtained earlier. 6.6.1 Pre-Processing and Feature Extraction The image files that we used for the verification of the results are from the U S P S Office of Advanced Technology Database of Handwritten Digits, produced by the Center of Excellence for Document Analysis and Recognition ( C E D A R ) . Using these handwritten images, we have extracted our own features and build a feature set of 40 features. The code used for extracting those features is written using Matlab. We first performed some pre-processing as needed. The aim of pre-processing is an improvement of the image data that suppresses unwanted distortions or enhances some image features important for further processing. The pre-processing steps that we have done are: 1- Noise removal: Remove isolated pixels (l 's surrounded by 0's). 2- Image resizing: Get the bounding box of the image (which is the smallest rectangle that can contain the region) and shift the bounding box into the middle of the image so that the image size is 128x128. 3- Thinning (when needed): Remove pixels on the boundaries of objects without allowing objects to break apart (i.e. thinning the image to 1 pixel width). Extracting some features requires the thinned image rather than the whole image. The features used are shown in Table 5 and explained below. 97 Seven H u Moments Fi l led Area Solidity Area Convex Area Orientation Bounding Box Number of Holes Eccentricity Major Ax i s Length Number of End Points Centroid Minor Ax i s Length Circularity Eight Extrema Points Table 5: The extracted feature set from handwritten digits images. 1. H u Moments: For a function f(x), we can compute the mean value of the function 2>/W itfixXx-juf using: pi = — .We can also describe the variance by: (7 = — ^ . ! / ( * ) X / ( * ) x=\\ x=l A third statistical property, called skew, describes how symmetric the function is: Skew = — — ^ . A l l of these are examples moments of the function. One x=l can define moments about some arbitrary point, usually either about zero or about the N mean. The n-th moment about zero, denoted as mn, is mn = ^xn f(x), where x=l m 0 i s the total value of the function. The mean jU is the first moment about zero m. divided by the zero-th moment: JU = — - . The n-th moment about the mean, denoted N as jUnand called the n-th central moment is jUn = ^(x- jX)n f(x). The zero-th central moment jU0 is, again, the total value of the function. The first central moment Hx is always 0. The second central moment jU2, when normalized by the total value 98 jU0 is the variance: CJ2 = — - . The third central moment jU3, when normalized is the Mo skew: s k e w = — . The fourth central momentjU^, when normalized is the kirtosis: Mo k i r t o s i s = ^ - . If we have an infinite number of central moments, we can completely Mo describe the shape of the function. A set of seven invariant moments can be derived from the second and third moments (Gonzalez & Woods, 1992). This set of moments is invariant to translation, rotation, and scale change (Hu moments). 2. Area: The number of pixels in the shape. 3. Bounding Box: A l-by-4 vector, which represents the smallest rectangle that can contain the region. The format of the vector is [x y width height], where x and y are the x- and y-coordinates of the upper-left corner of the rectangle, and width and height are the width and height of the rectangle. 4. Major A x i s Length: The length (in pixels) of the major axis of the ellipse that has the same second-moments as the region. 5'. Minor Ax i s Length: The length (in pixels) of the minor axis of the ellipse that has the same second-moments as the region. 6. Fi l led Area: The number of on pixels in filled image. 7. Convex Area: The number of pixels in the convex image 8. Number of Holes: The number of holes in the shape. 9. Number of End Points: The number of end points in the shape. 99 10. Circularity: Circularity measures the ratio of the perimeter divided by the area P 2 (C= where P is the perimeter and A is the area) (Parker, 1994). 4/7A 11. Solidity: The proportion of the pixels in the convex hull that are also in the region. It is computed as Area/ConvexArea. 12. Orientation: The angle (in degrees) between the jc-axis and the major axis of the ellipse that has the same second-moments as the region. 13. Eccentricity: The ratio of the length of the longest chord of the shape to the longest chord perpendicular to it. This is one way to define it; another way to define it is as follows. The eccentricity of the ellipse that has the same second-moments as the region. The eccentricity is the ratio of the distance between the foci of the ellipse and its major axis length. The value is between 0 and 1. (0 and 1 are degenerate cases; an ellipse whose eccentricity is 0 is actually a circle, while an ellipse whose eccentricity is 1 is a line segment). 14. Centroid: the x and y coordinates of the center of mass of the region. 15. Extrema: 8-by-2 matrix, which contains the extremal points in the region. Each row of the matrix contains the JC and y coordinates of one of the points; the format of the vector is [top-left top-right right-top right-bottom bottom-right bottom-left left-bottom left-top]. Figure 15 shows these extremal points. 100 top-left top-right top-left top-& right Left-top Left-bottom right-bottom right-top Left-top Left-bottom right-top right-bottom bottom-right bottom-left bottom-right bottom-left Figure 15: Exteremal points. 6.6.2 Verification of Comparison 1 We have repeated two of the previous experiments done previously using our own database, which contains 40 features. The first experiment that was repeated using our own extracted features is comparison 1 (see section 5.6.1 in chapter 5), which is studying the effect of varying the number of values that weights can take on the number of selected features. The error estimation method used is the leave-one-out cross validation applied to the training data itself. The number of training samples used is 500 and the resultant weights are assessed on a validation set of size 250 that is never used before during training. In addition, we have performed random partitioning for the data samples. We randomly selected different training/validation samples each time provided that the train and validation samples are completely different. Also , the random partitioning is stratified, meaning that all the classes (digit classes) are equally represented (i.e. the number of training/validation samples for each class are equal). This process was repeated five times and the average accuracy is calculated. The average results are shown in Table 6. 101 Method of Selection/ Weighting Increment Value Number of Levels Probability of Zero Accuracy of Training Accuracy of Validation Number of Zero Features 1-NN - - - 70.48 70.96 -FS 1 2 Yi 77.68 74.72 21 3 F W (three discrete values: 0, 0.5, 1) 0.5 3 1/3 (0.33) 78.32 73.6 14 5 F W (five discrete values: 0, 0.25, 0.5, 0.75, 1) 0.25 5 1/5 (0.2) 78.44 73.6 9 33FW 0.03125 (1/32) 32+1 1/33 78.16 73.36 0 Table 6: Recognition accuracy and number of zero features for various selection and weighting schemes. Though we only intended by repeating this experiment to prove the previously obtained results from comparison 1 (see section 5.6.2), the results shown in Table 6 illustrate the followings: • Comparing the results of the 1-NN to the FS and the F W methods we find that a lot of features were eliminated, while the classification accuracy increased. This means that there are many irrelevant features and eliminating those features helped in increasing the classification rate. • A s previously shown in section 5.6.2, we can notice from the table that FS far outperforms F W in terms of the number of zero features. A s the number of values a weight can take increase, the number of eliminated features decrease. These results are consistent with those previously obtained in comparison 1 in section 5.6.2. 102 • FS is better than the other feature weighting settings (3FW, 5 F W and 33FW) in terms of the number of eliminated features. Moreover, FS has a somewhat better validation classification accuracy than F W . This result is consistent with the results of comparisons 2 (described in section 5.6.4), which deals with the usefulness of FS in the presence of irrelevant features. 6.6.3 Verification of Evaluation 1 The second experiment that we repeated using our own extracted features is evaluation 1, which is the study of the convergence of G F S to an optimal or near-optimal set of features. L ike we did before in evaluation 1 (in sections 6.2 and 6.3), we run the exhaustive search to explore the space of 2d where d is the number of features and recorded the optimal solution and near optimal reached by the exhaustive search. Also , we run the G F S five times and reported the best and average accuracies obtained. We used a single train/test set of size 500/250 and the number of features used were 16,18 and 20, for both exhaustive and G A searches. Running the exhaustive search for the whole 40 feature set would be computationally impossible. Therefore, we choose smaller subsets of 20,18 and 16 features out of the whole 40-feature set, to run the exhaustive search. We have only run the exhaustive search for a maximum of 20 features. However, based on the actual run times, the estimated time to run the exhaustive search for 22 features w i l l be 150 hours (6.25 days). The results shown in Table 7 where the maximum classification accuracy obtained using the exhaustive search is recorded. In addition, the best and average values reached by G F S are also shown. 103 Number of features Best Exhaustive (optimal and near-optimal) Exhaustive run time Best G A Average G A (for 5 runs) No of generations G A run time (single run) 16 75.2 74.8 2:30 hours 75.2 74.08 15 1 1/2 minutes 18 75.2 74.8 10:45 hours 75.2 73.84 15 1 min. 40 sec. 20 75.2 74.8 2 days 75.2 74.24 20 2 1/2 minutes Table 7: Classification accuracy rates achieved by G A and exhaustive search. Looking at the results in Table 7, we can see the following: • The optimal values reached by the exhaustive searches for all the 16, 18 and 20 features are the same. This confirms our previous observation that some of the features are redundant/irrelevant, by which eliminating them did not affect the optimal classification rate obtained. • It is clear that G F S can reach an optimal or near optimal solutions, which were found by the exhaustive search. These results are consistent with the previously obtained results from evaluation 1 in section 6.3. Finally, it is important to note that the classification accuracies obtained using our extracted 40 features are around 70% using 1-NN and around 75% after G F S . Though these rates are not high for a typical handwritten digits recognition task, they were satisfactory for us, since our main purpose was to prove that G F S could reach the optimal solutions reached by the exhaustive search. Enhancing the classification accuracy for our 104 extracted features may require the addition of more features and applying more pre-processing steps such as slant and slope corrections. 105 Chapter 7 Conclusions and Future Research 7.1 Conclusions The objective of the research was to apply genetic algorithms for the problem of feature weighting for character recognition application. In addition, we expected that because feature weighting is the general case of feature selection, it should perform better than feature selection, at least in some situations. So, after the employment of G F W to character recognition application, we were interested in comparing the performance of both G F S and G F W also in the context of character recognition applications to test the validity of this hypothesis. To achieve these objectives we built a pattern recognition experimental bench that contains a genetic-based feature selection and weighting module. Then we carried out two sets of studies, which in turn produced some unexpected but justified results. The first set compares the performance of Genetic Algorithm (GA)-based feature selection to GA-based feature weighting, under various conditions. The second set of studies evaluates the performance of the better method (which turned out to be feature selection) in terms of optimal performance and time. The results of these studies show the superiority of G F S over G F W in terms of a) the number of eliminated features, as well as b) recognition accuracy, in situations where irrelevant or/and redundant features are present. Nevertheless, G F S succeeds in finding optimal or near-optimal solutions, in all experiments. In addition, results show that G A is an effective method for feature selection. However, their scalability to highly dimensional problems, in practice, is still 106 an open problem. The following sections summarize the lessons learnt from this research effort. 7.1.1 Genetic Feature Selection verses Genetic Feature Weighting Genetic feature selection is clearly superior to genetic feature weighting in terms of feature reduction. The main reason for this superiority appears to be the small number of weight values that feature selection uses, which is only 2 (zero and one), compared to the number of weight values used by feature weighting (potentially infinite). In the presence of irrelevant features, feature selection and not feature weighting is the technique most suited to feature reduction. Furthermore, it is necessary to use some method of feature selection before a 1-NN or similar nearest-neighbor classifier is applied, because the performance of such classifiers degrades rapidly in the presence of irrelevant features. Since it has been shown that GA-based feature selection is effective in eliminating irrelevant features, it is reasonable to try a G A (at least) as a method of feature selection before classification is attempted using nearest-neighbor classifiers. The classification accuracy of a 1-NN classifier does not suffer so much as a result of having a significant number of redundant features in the (original) feature set. However, due to the other problems associated with redundant features, such as increased dimensionality (and hence computational cost), and the arbitrariness of procedure, as well as slightly worse classification accuracies, it is recommended that a GA-based feature selection method be used to eliminate as many of the redundant features as possible. It is clear that the most suited feature selection/weighting methods for such a task are FS and 3FW. Despite the fact that feature weighting has the best training classification accuracies, feature selection is better in generalization (i.e. make better predictions for the samples that 107 are not in the training set than feature weighting), and hence more suited to real-world applications (in which most data is new). This is because F W over-fits the training data thus, resulting in reduced predictions (than feature selection) for the new samples. Therefore, it is advisable to use 2 (FS) or 3 (3FW) weight levels at most, as Kohavi et al. (1997) recommend. 7.1.2 Performance of Genetic Feature Selection Genetic feature selection is a reliable method of locating optimal or near-optimal feature sub-sets. These techniques also save time, relative to exhaustive searches. The question of how well our method wi l l scale-up to highly dimensional feature spaces remains an open problem. However, their effective use in large features spaces is dependant the availability of parallel processors. 7.2 Future Research We present below a list of research problems, carefully justified, that we believe future research in G F S should address. It is our belief that solutions to such problems wi l l help with the automation of genetic feature selection/weighting in pattern recognition applications. • Research Problem 1. Feature selection, and therefore feature weighting, is NP-complete. Hence, although feature selection has shown very promising results, practical applications are limited by the dimensionality of the solution search space. Moser [Moser and Murty 00] examined the scalability of Distributed Vertical Genetic Algorithms ( D V e G A ) to very large-scale feature selection applications with more than 500 features. His application succeeded in reducing the dimensionality while simultaneously maintaining high accuracy. Crucially, Moser's \"experiments showed that G A scale well to domains of large complexity in feature selection\" [Moser and Murty 00]. So a possible 108 future research is to try and use the idea of distributed genetic algorithms (or parallel processing in general) in the way we implement our own G F S / G F W optimizer to make highly dimensional feature spaces feasible. • Research Problem 2. A possible future research is to understand how to generalize the lessons gained from the successful application of the G F S optimizer (to a particular symbol set) to new and different symbols sets. Examples of symbols sets that can be tried are hand-written English characters, mathematical notations and any (pre-segmented) black & white or gray-scale 2D line drawing. Indeed, the real power of the G F S optimization approach we are proposing wi l l not be fully realized until the experimental bench starts working successfully with different symbol sets. Furthermore, it should do so without recourse to extensive and lengthy trial and error tuning. This w i l l help in building and configuring a character recognition software product tailored for a specific symbol set, and with minimal help from pattern recognition experts. Research Problem 3. We have shown that FS, in general, is superior to F W in terms of the number of eliminated features, as well as accuracy of recognition, especially in cases where irrelevant/redundant features are present in the original feature set. However, other combinations of FS and F W could perform better than FS or F W alone. For example, Raymer, Punch, Goodman, Kuhn & Jain (2000) have applied simultaneous feature weighting and selection using genetic algorithm via a masking technique. They obtained better results on validation samples than with F W alone and state that operating FS and F W simultaneously allow the G A to find better interactions between features than just operating FS and F W independently (Raymer et al., 2000). Hence, researchers may wish 109 to investigate further that approach by comparing the simultaneous feature weighting and selection suggested in (Raymer et al., 2000) and our method of FS and FW that are operating independently. In addition, researchers could also apply different weighting schemes, such as local weighting, and combine both global and local weightings to compare FS and FW. Research Problem 4. Referring to section 4.3, there appears to be no record of any study, which compares the performance of G A with simulated annealing for the feature selection problem. In addition, the only existing work (Zhang & Sun, 2002) that compares G A with tabu search for feature selection problem need to be verified using true classifier error rate and real datasets (See section 4.3 for details). Therefore, a fruitful area for future research is to compare the performance of the three stochastic search algorithms (GA, simulated annealing and tabu search) in the context of feature selection problem. 110 References Aha, D . W . (1992). Tolerating noisy, irrelevant, and novel attributes in instance-based Learning algorithms. International Journal of Man-Machine Studies, 36, 267-287'. Aha, D . W. , & Bankert, R. L . (1994). Feature selection for case-based classification of cloud types: A n empirical comparison. In D . W . A h a (Ed.) Case-Based Reasoning: Papers from the 1994 Workshop (Technical Report WS-94-01). Menlo Park, C A : A A A I Press. ( N C A R A I T R : AIC-94-011) David Beasley, David R. B u l l , & Ralph R. Martin. (1993). A n Overview of Genetic Algorithms: Part 1, Fundamentals. University Computing, Inter-University Committee on Computing, 15(2), 58-69. Blum, A . L . , & Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, pp. 245-271. B r i l l , F . , Brown, D . , & Martin, W . (1992). Fast genetic selection of features for neural network classifiers. IEEE Transactions on Neural Networks, 3(2), 324-328. Dasarathy, Belur V . (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, Los Alamitos, C A : I E E E Computer Society Press. Dash, M . , & L i u , H . (1997). Feature selection for classification. Intelligent Data Analysis, 1, 131-156. Davis, L . (1991). Handbook ofGA. Van Nostrand Reinhold. De Jong, K . (1975). A n analysis of the behavior of a class of genetic adaptive systems. Doctoral dissertation, University of Michigan, Dissertation Abstracts International, 36(10), 5140B, University Microfilms No. 76-9381. Demiroz, G . , & Guvenir, H . A . (1996). Genetic algorithms to learn feature weights for the nearest neighbor algorithm. In Proceedings of the 6 th Belgian-Dutch Conference on Machine Learning ( B E N E L E A R N - 9 6 ) , pp. 117-126. Devijver, P., & Kittler, J. (1982). Pattern Recognition: A Statistical Approach. Prentice Hal l . Duda, R. O., Hart, P. E . , & Stork, D . G . (2000). Pattern Classification. John Wiley Interscience. Estevez, P .A . , & Fernandez, M . (1999). Selection of features for the classification of wood board defects. In Proceedings of the Ninth International Conference on Artificial Neural Networks, 1, 347-352. I l l Ferri, M . , & Piccioni, M . (1992). Optimal selection of statistical units: A n approach via simulated annealing. Computational Statistics and Data Analysis, 13, 47 - 61. Fung, G . , L i u , J., & Lau, R. (1996). Feature selection in automatic signature verification based on genetic algorithms. In Proceedings of International Conference on Neural Information, pp. 811-815. Gaborski, R. S., & Anderson, P. G . (1993). Genetic algorithm selection of features for handwritten character identification. In the International Conference on Artificial Neural Networks & Genetic Algorithms ANNGA 93, Innsbruck, Austria. Geman, S., Bienenstock, E . , & Doursat, R. (1995). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1-58. Glover F . (1986). Future paths for Integer Programming and Links to Artificial Intelligence. Computers and Operations Research, 5, 533-549. Goldberg D E . (1989). Genetic algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, M A . Gonzalez, R. C . , & Woods, R. E . R. (1992). Digital image processing. Addison-Wesley. Handels, H . , Ross, T., Kreusch, J. , Wolff, H . H . , & Poppl, S J . (1999). Feature selection for optimized skin tumor recognition using genetic algorithms. Artificial Intelligence in Medicine, 16(3), 283-297. Howe, N . , & Cardie, C . (1997). Examining locally varying weights for nearest neighbor algorithms. Case-Based Reasoning Research and Development: Second International Conference on Case-Based Reasoning, D . Leake and E . Plaza, eds., Lecture Notes in Aritificial Intelligence, Springer, pp. 455-466. Holland, J .H . (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, M I . Hussein, F. , Kharma, N . , & Ward, R. (2001). Genetic Algorithms for Feature Selection and Weighting, in proceedings of the Sixth International Conference on Document Analysis and Recognition ICDAR'01, pp. 1240-1244. Jack, L . B . , & Nandi, A . K . (2000). Genetic algorithms for feature selection in machine condition monitoring with vibration signals. IEE Proceedings, Vision, Image and Signal Processing, 147(3), 205-212. Jain, A . , & Zongker, D . (1997). Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2), 153-158. 112 Jain, A n i l K . , Duin, Robert P .W., & Mao, Jianchang (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4-37. Kel ly , J. , & Davis, L . (1991). A hybrid genetic algorithm for classification. In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence, pp. 645-650. Kharma, N . , & Ward, R. (1999). Character Recognition Systems for the Non-expert, in IEEE Canadian Review, 33, pp. 5-8. Kharma, N . , Hussein, F. , & Ward, R . K . (2002a). Using Genetic Algorithms for Feature Selection and Weighting in Character Recognition Systems. Evolutionary Computation Journal. Kharma, N . , Hussein, F. , & Ward, R . K . (2002b). Genetic Algorithms for Feature Selection and Weighting in Character Recognition Systems, a Summary Review and Comparative Study. The Sixteenth International Conference on Pattern Recognition, Quebec City, (submitted). K i m , G . , & K i m , S. (2000). Feature selection using genetic algorithms for handwritten character recognition. In Proceedings of the Seventh International Workshop on Froniers in Handwriting Recognition, pp. 103-112. Kohavi , R., & Sommerfield, D . (1995). Feature subset selection using the wrapper method: overfitting and dynamic search space topology. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, KDD'95, Montreal, Canada, pp. 192-197. Kohavi , R., & John, G . (1997) Wrappers for feature subset selection. In Artificial Intelligence journal, special issue on relevance, 97(1-2), 273-324. Kohavi , R., Langley, P., & Yun , Y . (1997). The utility of feature weighting in nearest-neighbor algorithms, European Conference on Machine Learning, E C M L ' 9 7 . Komosinski, M . , & Krawiec, K . (2000). Evolutionary weighting of image features for diagnoses of C N S tumors. Artificial Intelligence in Medicine. 19, 25-38. Kudo, M . , & Sklansky, J. (2000). Comparison of algorithms that select features for pattern classifiers. Pattern Recognition, 33(1), 25-41. Martin-Bautista, M . J . , & V i l a , M . A . (1998). Applying genetic algorithms to the feature selection problem in information retrieval. In Lecture Notes On Artificial Intelligence (LNAI), 1495. Springer-Verlag. Matthew, W . (1999). Galib (2.4.4) Massachusetts Institute of Technology M I T . http://lancet.mit.edu/ga/ 113 Michalewicz, Z . , & Fogel, D . B . (1998). How to solve it. Springer-Verlag. Moser, A . (1999). A distributed vertical genetic algorithm for feature selection. In Proceedings of the Fifth International Conference on Document Analysis and Recognition, Open Research Forum. Moser, A . , & Murty, M . (2000). On the scalability of genetic algorithms to very large-scale feature selection. Real World Applications of Evolutionary Computing: Proceedings, 1803: 77-86. Murphy, P., & Aha, D . (1994). Repository of machine learning databases. Department of Information and Computer Science, University of California, Irvine, C A . http://www.ics.uci.edu/~mlearn/MLRepository Pandya, Abhijit S., & Macy, Robert B . (1996). Pattern recognition with neural networks in C++. Boca Raton, F L : C R C Press. Parker, J. R. (1994). Practical Computer Vision Using C. John Wiley & Sons, Inc. Punch, W. , Goodman, E . , Pei, M . , Chia-Shun, L . , Hovland, P., & Enbody, R. (1993). Further research on feature selection and classification using genetic algorithms. In Proceedings of the Fifth International Conference on Genetic Algorithms, pp. 379-383. Raymer, M . , Punch, W. , Goodman, E . , Kuhn, L . , & Jain, A . (2000). Dimensionality reduction using genetic algorithms. IEEE Transcations on Evolutionary Computation, 4(2), 164-171. Sadiq, S. M . , & Youssef, H . (1999). Iterative Computer Algorithms with Applications in Engineering. I E E E Computer Society. Sahiner, B . , Chan, H.P. , We i , D.T. , Petrick, N . , Helvie, M . A . , Adler, D . D . , & Goodsitt, M . M . (1996). Image feature selection by a genetic algorithm: Application to classification of mass and normal breast tissue. Medical Physics, 23(10), 1671-1684. Shi, D . , Shu, W. , & L i u , H . (1998). Feature selection for handwritten Chinese character recognition based on genetic algorithms. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 5, 4201-4206. Siedlecki, W . , & Sklansky, J. (1988). On automatic feature selection. International Journal of Pattern Recognition, 2, 197-220. Siedlecki, W. , & Sklansky, J. (1989). A note on genetic algorithms for large-scale feature selection. IEEE Transactions on Computers, 10, 335-347. 114 Smith, J.E., Fogarty, T . C , & Johnson, I.R. (1994). Genetic selection of features for clustering and classification, ln IEE Colloquium on Genetic Algorithms in Image Processing and Vision. Theodoridis, S., & Koutroumbas, K . (1998). Pattern Recognition. Academic Press. Trier, O.D. , A . K . Jain, & T. Taxt (1996). Feature extraction methods for character recognition - A survey, Pattern Recognition, 29, 641-662. Vafaie, H . , & De Jong, K . (1993). Robust feature selection algorithms. In Proceedings of the IEEE International Conference on Tools with Artificial Intelligence, pp. 356-366. Van Laarhoven, P . J . M . , & Aarts, E . H . L . (1987). Simulated Annealing: Theory and Applications. D . Reidel: Dordrecht. Wang, Y . K . , & Fan, K . C . (1996). Applying genetic algorithms on pattern recognition: A n analysis and survey. In Proceedings of the Thirteen International Conference on Pattern Recognition, 2, 740-744. Weideman, W . E . , Manry, E . M . , & Yau, H . C . (1989). A comparison of a nearest neighbor classifier and a neural network for numeric handprint character recognition. Proceedings of the International Joined Conference on Neural Networks, Washington, D C , / , 117-120. Weiss, S., & Kul ikowski , C . (1991). Computer Systems that Learn. San Mateo, C A : Morgan Kaufmann. Wettschereck, D . , Aha, D . W. , & Mohr i , T. (1997). A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms. Artificial Intelligence Review, 11, 273-314. Wilson, D . Randall, & Martinez, R. Tony (1996). Instance-based learning with genetically derived attribute weights, Proceedings of the International Conference on Artificial Intelligence, Expert Systems and Neural Networks (AIE'96), pp. 11-14. Zhang, H . , & Sun, G . (2002). Feature selection using tabu search method, Pattern Recognition, 35(3), 701-711. 115 "@en ; edm:hasType "Thesis/Dissertation"@en ; vivo:dateIssued "2002-05"@en ; edm:isShownAt "10.14288/1.0065127"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Electrical and Computer Engineering"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use."@en ; ns0:scholarLevel "Graduate"@en ; dcterms:title "Genetic algorithm for feature selection and weighting for off-line character recognition"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/12103"@en .