@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix skos: . vivo:departmentOrSchool "Applied Science, Faculty of"@en, "Electrical and Computer Engineering, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Vilmansen, Toomas Rein"@en ; dcterms:issued "2010-01-27T18:41:18Z"@en, "1974"@en ; vivo:relatedDegree "Doctor of Philosophy - PhD"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """Two different aspects of the problem of selecting measurements for statistical pattern recognition are investigated. First, the evaluation of features for multiclass recognition problems by using measures of probabilistic dependence is examined. Secondly, the problem of evaluation and selection of features for a general tree type classifier is investigated. Measures of probabilistic dependence are derived from pairwise distance measures such as Bhattacharyya distance, divergence, Matusita's distance, and discrimination information. The properties for the dependence measures are developed in the context of feature class dependency. Inequalities relating the measures are derived. Also upper and lower bounds on error probability are derived for the different measures. Comparisons of the bounds are made. Feature ordering experiments are performed to compare the measures to error probability and to each other. A fairly general tree type sequential classifier is examined. An algorithm which uses distance measures for clustering probability distributions and which uses dependence and distance measures for ordering features is derived for constructing the decision tree. The concept of confidence in a decision in conjunction with backtracking is introduced in order to make decisions at any node of the tree tentative and reversible. Also, the idea of re-introducing classes at any stage is discussed. Experiments are performed to determine the storage and processing requirements of the classifier, to determine effects of various parameters on performance, and to determine the usefulness of procedures for backtracking and reintroducing of classes."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/19197?expand=metadata"@en ; skos:note "INFORMATION AND DISTANCE MEASURES WITH APPLICATION TO FEATURE EVALUATION AND TO HEURISTIC SEQUENTIAL CLASSIFICATION . by TOOMAS REIN VILMANSEN B.Eng., Carleton U n i v e r s i t y , 1968 M.A.Sc., U n i v e r s i t y of B r i t i s h Columbia, 1970 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY i n the Department of E l e c t r i c a l Engineering We accept t h i s t h e s i s as conforming to the r e q u i r e d standard THE UNIVERSITY OF BRITISH COLUMBIA January 1974 In p r e s e n t i n g t h i s t h e s i s in p a r t i a l f u l f i l m e n t o f the r e q u i r e m e n t s f o r an advanced degree at the U n i v e r s i t y o f B r i t i s h C o l u m b i a , I agree tha t the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e and s tudy . I f u r t h e r agree t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y pu rposes may be g r an t ed by the Head o f my Department o r by h i s r e p r e s e n t a t i v e s . It i s unde r s tood that c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be a l l o w e d w i thou t my w r i t t e n p e r m i s s i o n . Department o f ^ ' £~ C' L e~/VG-/A/c£tff A ; ( } The U n i v e r s i t y o f B r i t i s h Co lumbia Vancouver 8, Canada Date Qj / \\Jfi fsj 74'• ABSTRACT Two d i f f e r e n t aspects of the problem of s e l e c t i n g measurements f o r s t a t i s t i c a l p a t t e r n r e c o g n i t i o n are i n v e s t i g a t e d . F i r s t , the e v a l u -a t i o n of features f o r m u l t i c J a s s r e c o g n i t i o n problems by u s i n g measures of p r o b a b i l i s t i c dependence i s examined. Secondly, the problem of e v a l -u a t i o n and s e l e c t i o n of features f o r a general t r e e type c l a s s i f i e r i s i n v e s t i g a t e d . Measures of p r o b a b i l i s t i c dependence are d e r i v e d from p a i r w i s e distance measures such as Bhattacharyya d i s t a n c e , divergence, M a t u s i t a ' s d i s t a n c e , and d i s c r i m i n a t i o n i n f o r m a t i o n . The p r o p e r t i e s f o r the depend-ence measures are developed i n the context of fe a t u r e c l a s s dependency. I n e q u a l i t i e s r e l a t i n g the measures are de r i v e d . A l s o upper and lower bounds on e r r o r p r o b a b i l i t y are derived f o r the d i f f e r e n t measures. Com-parisons of the bounds are made. Feature o r d e r i n g experiments are per-formed to compare the measures to e r r o r p r o b a b i l i t y and to each o t h e r . A f a i r l y general t r e e type s e q u e n t i a l c l a s s i f i e r i s examined. An a l g o r i t h m which uses distance measures f o r c l u s t e r i n g p r o b a b i l i t y d i s t r i b u t i o n s and which uses dependence and d i s t a n c e measures f o r o r d e r -i n g f e a t u r e s i s d e r i v e d f o r c o n s t r u c t i n g the d e c i s i o n t r e e . The concept of confidence i n a d e c i s i o n i n conjunction w i t h b a c k t r a c k i n g i s i n t r o -duced i n order to make d e c i s i o n s a t any node of the t r e e t e n t a t i v e and r e v e r s i b l e . A l s o , the i d e a of r e - i n t r o d u c i n g c l a s s e s at any stage i s discu s s e d . Experiments are performed to determine the storage and pro-c e s s i n g requirements of the c l a s s i f i e r , to determine e f f e c t s of v a r i o u s parameters on performance, and to determine the usefu l n e s s o f procedures f o r b a c k t r a c k i n g and r e i n t r o d u c i n g of c l a s s e s . TABLE OF CONTENTS ABSTRACT TABLE OF CONTENTS — • LIST OF ILLUSTRATIONS — LIST OF TABLES -ACKNOWLEDGEMENT I INTRODUCTION — —< 1.1 Objectives of the Thesis — < — r 1.2 J u s t i f i c a t i o n f o r the Research • 1.2.1 Dependence Measures as Feature-Evaluation C r i t e r i a 1.2.2 Investigation of the Sequential C l a s s i f i e r 1.3 Scope of the Thesis • II DEVELOPMENT OF MEASURES OF PROBABILISTIC DEPENDENCE 2.1 Introduction < 2.2 Mutual Information i n Feature Evaluation < 2.3 Development of Measures of Dependence 2.4 On an Approximation of Mutual Information III PROPERTIES OF MEASURES OF DEPENDENCE 3.1 Introduction • 3.2 Properties • > — 3.2.1 Property 1 • : 3.2.2 Property 2 3.2.3 Property 3 3.2.4 Entropy Property of Maxima —^—•• 3.3 Relations Between the Measures —•—< 3.3.1 Kolmogorov Dependence and Mutual Information — 3.3.2 Bhattacharyya C o e f f i c i e n t and Kolmogorov De-pendence • •—• 3.3.3 Joshi's Dependence, Kolmogorov Dependence, and Mutual Information •= • — > — - • — — 3.3.4 Bhattacharyya Dependence, Joshi's Dependence, Mutual Information '— 3.4 Conclusion — IV RELATIONS BETWEEN DEPENDENCE\" MEASURES AND ERROR ••— • PROBABILITY 4.1 Introduction —> 4.2 Bayes Err o r P r o b a b i l i t y for Feature Evaluation 4.3 Dependence Measures and Error P r o b a b i l i t y < 4.4: Discussion • i i PAGE 4.5 E r r o r Bounds —. ~ 34 4.5.1 Mutual Information and E r r o r P r o b a b i l i t y 34 4.5.2 Kolmogorov Dependence and E r r o r P r o b a b i l i t y — 36 4.5.3 Bhattacharyya Dependence and E r r o r P r o b a b i l i t y 58 4.5.4 Jo s h i ' s Dependence and E r r o r P r o b a b i l i t y 7 u 4.5.5 E r r o r P r o b a b i l i t y and an Approximation of Mutual Information 70 4.6 Comparison of E r r o r Bounds 72 V EXPERIMENTS AND DISCUSSION 80 5.1 I n t r o d u c t i o n < .—. 80 5.2 D e s c r i p t i o n of Data, Preprocessing and Feature Ex-t r a c t i o n —• ' • . 80 5.3 D e s c r i p t i o n of Experiments • •• 81 5.4 Discussi o n of Results • • 83 5.5 Computation • • — • — • • -. 87 5.6 Summary < • • 89 VI DERIVATION OF SEQUENTIAL CLASSIFIER 91 6.1 I n t r o d u c t i o n • - 91 6.2 Seq u e n t i a l C l a s s i f i c a t i o n • • —— 92 6.3 C l u s t e r i n g • • .— 93 6.4 C l u s t e r i n g Using P r o b a b i l i t y D i s t r i b u t i o n s 94 6.5 D e s c r i p t i o n of C l a s s i f i e r < — ~ — 97 6.6 Di s c u s s i o n of \"Optimal\" Tree • . 1 0 0 6.7 Algo r i t h m f o r C l u s t e r i n g • 101 6.8 D e r i v a t i o n of Measures of Confidence 108 6.9 D e r i v a t i o n of Rules f o r Tentative Decisions 110 6.10 Re - i n t r o d u c t i o n of Classes • 112 6.11 Conclusion —— •• . 113 VII EXPERIMENTS r—. , 114 7.1 I n t r o d u c t i o n •• • r ~ . T- 114 7.2 Data Base, Preprocessing and Feature E x t r a c t i o n 7.3 E s t i m a t i o n of P r o b a b i l i t y D i s t r i b u t i o n Functions 115 7.4 Experiments •• • 117 7.4.1 Experiment 1 ; H 7 7.4.2 Experiment 2 Determination of Threshold 120 7.4.3 Experiment 3 I n v e s t i g a t i o n of Confidence i n Decisions < • —• 122 7.4.4 Experiment 4 Comparison of Tree C l a s s i f i e r and Independent Bayes C l a s s i f i e r — 127 7.4.5 Experiment 5 Tree C l a s s i f i e r A p p l i e d to A l l Fonts 132 7.5 Di s c u s s i o n •—• < 132 V I I I SUMMARY •. ~ 136 REFERENCES —T~< ,~. — - - — , 139 i i i LIST OF ILLUSTRATIONS FIGURE PAGE 3.1 Different types of features • 16 4.1 Error probability as overlap • < 3 1 4.2 Dependence measures as overlap 31 4.3 Upper bound given by equation (4.22) ^ 4.4 Lower bound for Kolmogorov dependence • < - r « - — r \" - > — • ^ 3 4.5 A posteriori densi'by.lfUncfionicy for two class problem ^ 4.6 Different ranges for the a posteriori density function for the three class problem •—•• • ^ 4.7 Relations between probability of a correct decision and Kolmogorov dependence for a specific feature value, for M = 3 49 4.8 Upper bound on error probability in terms of Kolmogorov dependence for M=3 —• • 4.9 Relations between probability of a correct decision and Kolmogorov dependence for a specific feature value, for M=10 • 5 3 4.10 Upper bound on error probability in terms of Kolmogorov dependence for M=10 4.11 Upper bounds on error probability in terms of Kolmogorov dependence for various values of M 4.12 Lower bounds on error probability in terms of the Bhat-tacharyya coefficient for various values of M ^3 ft 7 4.13 Equation 4.53 for various values of M < 4.14 Upper bounds on error probability in terms of the Bhatta-=hc. ~ charyya coefficient for various values of M ^8 4.15 Upper and lower bounds on error probability in terms of an approximation of mutual information for various values of M • 7 3 7 ft 4.16 Comparison of error bounds 4.17 Comparison of error bounds • 77 6.1 Probability distributions as vectors > 96 iv FIGURE PAGE 6.2 Tree cl a s s i f i e r • • r — . 98 6.3 Clustering example 106 7.1 Features used for recognition experiments 116 121 7.2 Error probability vs. overlap threshold 7.3 Error probability vs. confidence threshold < 7.4 Storage vs. confidence threshold • • 7.5 Average number of measurements vs. confidence threshold — 125 7.6 Error probability vs. storage 125 7.7 Error probability vs. average number of measurements 126 7.8 Error probability vs. number of features for independent Bayes c l a s s i f i e r using one typewriter font 129 7.9 Error probability vs. number of features for independent Bayes c l a s s i f i e r using a l l fonts • 133 v LIST OF TABLES TABLE PAGE 5.1 Spearman . rank correlation coefficients between feature orderingS ---r-^T^^-r-T-r-r~r-~-r^---T-------^ 84 5.2 Rankings for the dependence measures 5^ 7.1 Spearman rank correlation coefficients between feature j • 119 ordenngs-— • • *—• ^ 7.2 Performance of tree c l a s s i f i e r using single font 7.3 Performance of tree c l a s s i f i e r using a l l fonts 134 \\ v i 1 CHAPTER I INTRODUCTION 1.1 Objectives of the Thesis An important problem i n s t a t i s t i c a l pattern recognition i s the s e l e c t i o n of sets of measurements to be used for pattern c l a s s i f i c a t i o n . In a p r a c t i c a l s i t u a t i o n the number of measurements that can be used f o r c l a s s i f i c a t i o n i s constrained by computational complexity, a v a i l a b l e storage capacity, and cost of measurements. In t h i s t h e s i s , measurement s e l e c t i o n (feature extraction) i s examined using two d i f f e r e n t approaches. The f i r s t objective of the thesis i s to investigate measure of p r o b a b i l i s t i c dependence f or evaluation of features. Several measures are derived and are formulated i n terms of feature class dependency. Relations between the measures and error p r o b a b i l i t y are derived. Ex-periments are performed to compare the feature evaluation c a p a b i l i t i e s of the measures to error p r o b a b i l i t y and to each other. The second objective of the thesis i s to examine the storage and processing requirements of a f a i r l y general decision tree type sequential c l a s s i f i e r . In the decision process, features which d i s -criminate between groups of classes or c l u s t e r s are used. When an unknown pattern i s to be c l a s s i f i e d , a feature measurement i s used to sel e c t a subset of classes to which the pattern might belong. Based on the chosen c l u s t e r , another feature measurement i s used to s e l e c t a smaller subset of the f i r s t c l u s t e r . Features are sequentially measured and smaller sets of sub classes are chosen u n t i l a f i n a l c l a s s i f i c a t i o n i s made. An algorithm i s derived for constructing the decision tree using distance measures f or c l u s t e r i n g of classes. Because some classes 2 a r e e l i m i n a t e d f r o m c o n s i d e r a t i o n a t e a c h s t a g e , i t i s u s e f u l t o make t h e d e c i s i o n s t e n t a t i v e and r e v e r s i b l e . The c o n c e p t o f c o n f i d e n c e i n a d e c i s i o n i n c o n j u n c t i o n w i t h b a c k t r a c k i n g p r o c e d u r e s i s t h e r e f o r e i n t r o -d u c e d . A l s o t h e i d e a o f r e - i n t r o d u c i n g o f p a t t e r n c l a s s e s p r e v i o u s l y e x c l u d e d a t any s t a g e i s e x a m i n e d . 1 . 2 J u s t i f i c a t i o n f o r t h e R e s e a r c h 1 . 2 . 1 Dependence M e a s u r e s as F e a t u r e E v a l u a t i o n C r i t e r i a When a c l a s s i f i e r i s c o n s t r a i n e d b y s t o r a g e l i m i t a t i o n s and p r o c e s s i n g c a p a b i l i t y , i t i s d e s i r a b l e t o h a v e t h e b e s t s u b s e t o f f e a t u r e s a v a i l a b l e f o r t h e c l a s s i f i c a t i o n p r o c e s s e s . The b e s t s u b s e t o f f e a t u r e s i s n o r m a l l y t h e one t h a t m i n i m i z e s t h e p r o b a b i l i t y o f e r r o r . H o w e v e r , t h e p r o b a b i l i t y o f e r r o r o b t a i n a b l e u s i n g a p a r t i c u l a r c l a s s i -f i e r o n some f e a t u r e s u b s e t i s n o t a l w a y s k n o w n . F o r e x a m p l e , t h e p r o b a b i l i t y o f e r r o r f o r a k - n e a r e s t n e i g h b o u r c l a s s i f i e r , g i v e n a f i n i t e d a t a s e t , i s n o t k n o w n . I n s u c h c a s e s t h e o p t i m a l f e a t u r e s u b -s e t i s d e t e r m i n e d b y e s t i m a t i n g t h e e r r o r p r o b a b i l i t y o f e a c h f e a t u r e s u b s e t i n t u r n by t r a i n i n g and t h e n t e s t i n g t h e c l a s s i f i e r u s i n g p a t t e r n s a m p l e s . T h i s e x h a u s t i v e p r o c e d u r e r e q u i r e s e x c e s s i v e com-p u t a t i o n . F o r t h i s r e a s o n d e p e n d e n c e m e a s u r e s , and d i s t a n c e m e a s u r e s [1] - [5] w h i c h i n d i c a t e t h e d i s c r i m i n a t i n g power o r i n f o r m a t i v e n e s s o f f e a t u r e s u b s e t s i n a q u a n t i t a t i v e manner a r e u s e d t o r a n k o r d e r f e a t u r e s , t h e r e b y d e f i n i n g s u b s e t s t o be u s e d . A l s o v a r i o u s o t h e r methods s u c h as m u l t i d i m e n s i o n a l t r a n s f o r m a t i o n s [6] - [9] and i n f o r m a t i o n m e a s u r e s [10] a r e u s e d . F o r a n i n t r o d u c t i o n t o t h e s e and o t h e r c r i t e r i a , t h e r e a d e r i s r e f e r r e d t o t h e r e f e r e n c e s [11] - [ 1 5 ] . I n c l a s s i c a l s e q u e n t i a l c l a s s i f i c a t i o n schemes [ 1 1 ] , [16] t h e 3 o r d e r i n w h i c h f e a t u r e s a r e e x a m i n e d i n a r r i v i n g a t d e c i s i o n s h a s an e f f e c t o n t h e number o f f e a t u r e measurements r e q u i r e d t o make a d e c i s i o n w i t h some s p e c i f i e d e r r o r p r o b a b i l i t y . By u s i n g t h e most \" i n f o r m a t i v e \" f e a t u r e s f i r s t t h e c l a s s i f i c a t i o n p r o c e s s c a n b e t e r m i n a t e d q u i c k l y , u s i n g few f e a t u r e s o n t h e a v e r a g e . The m e a s u r e s o f d e p e n d e n c e i n t h e t h e s i s c a n b e u s e d as a n i n d e x o f t h e i n f o r m a t i o n t h a t f e a t u r e s c o n t a i n a b o u t c l a s s e s and w o u l d t h e r e f o r e be a p p l i c a b l e t o t h e s e q u e n t i a l c l a s s i f i c a t i o n p r o b l e m . 1 . 2 . 2 I n v e s t i g a t i o n o f t h e S e q u e n t i a l C l a s s i f i e r Two t y p e s o f s e q u e n t i a l p a t t e r n c l a s s i f i c a t i o n schemes h a v e b e e n i n v e s t i g a t e d by v a r i o u s w o r k e r s . The f i r s t t y p e o f c l a s s i f i e r i s b a s e d o n s e q u e n t i a l a n a l y s i s [ 1 6 ] . The c o s t s o f m e a s u r e m e n t s and t h e c o n c e p t o f r i s k a r e u s e d . T h i s scheme has b e e n a p p l i e d t o p a t t e r n r e c o g n i t i o n b y F u a n d h i s a s s o c i a t e s , C a r d i l l o , C h i e n , Chen,Kand N i k o l i c [17] - [ 2 5 ] . F u r t h e r w o r k has b e e n done b y H u s s a i n [ 1 6 ] . F o r t h i s m e t h o d , a measurement i s t a k e n , and a n a n a l y s i s i s p e r f o r m e d t o d e t e r -m i n e w h e t h e r t h e e x p e c t e d r i s k o f c o n t i n u i n g b y t a k i n g a n o t h e r m e a s u r e -ment o r t h e r i s k o f m a k i n g a d e c i s i o n i s g r e a t e r . The c o m p u t a t i o n r e q u i r e m e n t s and s t o r a g e r e q u i r e m e n t s a r e f a i r l y h i g h . The p r o c e d u r e i s m o s t u s e f u l when c o s t o f c o m p u t a t i o n and s t o r a g e a r e s m a l l when compared t o t h e c o s t o f e x t r a c t i n g more m e a s u r e m e n t s . A n e x a m p l e w o u l d o c c u r i n m e d i c a l d i a g n o s i s when a n e x t r a measurement m i g h t r e q u i r e a s u r g i c a l o p e r a t i o n . The s e c o n d t y p e o f c l a s s i f i e r i s t h e d e c i s i o n t r e e t y p e s e q u e n t i a l c l a s s i f i e r . F o r t h i s t y p e o f c l a s s i f i e r t h e r e c o g n i t i o n p r o c e s s i s t r e a t e d as a h i e r a r c h y o f sub p r o b l e m s p r o c e e d i n g f r o m a c o a r s e g e n e r a l r e c o g n i t i o n t o a v e r y s p e c i f i c r e c o g n i t i o n . G i v e n a 4 l a r g e number o f c l a s s e s , a measurement i s f o u n d w h i c h g r o u p s t h e c l a s s e s i n t o a few c l u s t e r s . These c l u s t e r s a r e e x a m i n e d and c l u s t e r d e p e n d e n t measurements a r e d e f i n e d t o d i v i d e them i n t o s m a l l e r g r o u p s . The p r o -c e s s c o n t i n u e s u n t i l a l l c l u s t e r s o f p a t t e r n c l a s s e s a r e s p l i t i n t o s i n g l e c l a s s e s . When good f e a t u r e s a r e d e f i n e d t h e c l a s s i f i e r i s a t t r a c t i v e b e c a u s e few measurements a r e r e q u i r e d t o \"home i n \" o n a p a t t e r n . B e c a u s e o f t h i s , t h e r e c o g n i t i o n p r o c e s s c a n be r a p i d . S t o r a g e r e q u i r e -ments a r e k e p t down b e c a u s e c l u s t e r p a r a m e t e r s , n o t i n d i v i d u a l c l a s s p a r a m e t e r s , a r e s t o r e d f o r e a c h node o f t h e t r e e . The u s e f u l n e s s o f t h e t r e e c l a s s i f i e r i s u n d e r l i n e d b y t h e f a c t t h a t s e v e r a l o f t h e IBM c h a r a c t e r r e c o g n i t i o n schemes [27] - [29] employ some t y p e o f t r e e c l a s s i f i c a t i o n scheme. S e v e r a l r e s e a r c h e r s s u c h as N a d l e r [30], [31] and G l u c k s m a n [32] h a v e e x a m i n e d d e c i s i o n t r e e x t y p e c l a s s i f i e r s . A d e c i s i o n t r e e c l a s s i f i e r i s an o p t i o n i n Sammon's OLPARS s y s t e m [33], [34]. The p r o b l e m w i t h t h i s t y p e o f c l a s s i f i e r i s , t h a t o n c e a d e c i s i o n i s made a t any n o d e , c l a s s e s a r e e l i m i n a t e d f r o m f u r t h e r c o n s i d e r a t i o n . B e c a u s e t h e p a t t e r n s e x a m i n e d a r e n o i s y , t h e m e a s u r e -ments e x t r a c t e d a r e random v a r i a b l e s . B e c a u s e m i s t a k e s c a n be made, i t w o u l d be u s e f u l t o make t h e d e c i s i o n s a t any node r e v e r s i b l e . P r o c e d u r e s f o r r e v e r s i n g d e c i s i o n s u s i n g t h e i d e a o f c o n f i d e n c e i n a d e c i s i o n and t h e i d e a o f b a c k t r a c k i n g a r e d e r i v e d and i n v e s t i g a t e d i n t h i s t h e s i s . A l s o t h e i d e a o f r e i n t r o d u c i n g c l a s s e s d e e p e r i n t h e t r e e i s i n v e s t i g a t e d . I n c o n s t r u c t i n g t h e t r e e , t h e i d e a o f c l u s t e r i n g u s i n g d i s -t a n c e m e a s u r e s i s u s e d . T h i s i s done t o b u i l d a t r e e w h i c h u s e s l i t t l e 5 storage and which defines c l u s t e r dependent measurements at each node of the tree. The various decision schemes are experimentally evaluated to determine storage requirements and processing requirements. 1.3 Scope of Thesis In chapter I I , the use of mutual information as a feature evaluation c r i t e r i o n i s b r i e f l y reviewed. Other dependence measures are then derived from distance measures. In chapter I I I the properties of the dependence measures are examined with respect to the feature evaluation problem. Also r e l a t i o n s between the measures are i n v e s t i -gated. In chapter IV the r e l a t i o n s between the d i f f e r e n t dependence measures and Bayes error p r o b a b i l i t y are described. The measures are . re l a t e d to er r o r p r o b a b i l i t y through bounds. Experiments to show the effectiveness of the d i f f e r e n t dependence measures are described i n Chapter V. The measures are compared both with each other and Bayes error p r o b a b i l i t y by using them to rank a set of features extracted from Munson's handprinted data. In chapter VI a f a i r l y general decision tree c l a s s i f i e r i s derived. The idea of c l u s t e r i n g i n p r o b a b i l i t y space using distance measures i s developed. An algorithm for defining a decision tree f o r a given set of features i s described. The concept of confidence i n a decision i s developed. A backtracking procedure i s described f o r allowing changes i n decisions i n the tree. Also a method of reintroducing classes at deeper l e v e l s of the tree i s discussed. In chapter VII various experiments f o r evaluating the sequential c l a s s i f i c a t i o n scheme are described. The experiments are performed using the C o r n e l l data set. The c l u s t e r i n g c r i t e r i o n i s i n -6 vestigated. The storage and processing requirements for various tree configurations are investigated. Chapter VIII contains a summary of the r e s u l t s , and suggestions f o r further research. 7 I I . DEVELOPMENT OF MEASURES OF P R O B A B I L I S T I C DEPENDENCE 2 . 1 I n t r o d u c t i o n I n t h i s c h a p t e r , t h e u s e o f t h e p r o b a b i l i s t i c d e p e n d e n c e m e a s u r e m u t u a l i n f o r m a t i o n f o r f e a t u r e e v a l u a t i o n i s b r i e f l y r e v i e w e d . The r e l a t i o n s h i p b e t w e e n m u t u a l i n f o r m a t i o n and d i s c r i m i n a t i o n i n f o r m a t i o n [35] i s n o t e d . By a n a l o g y some m e a s u r e s o f p r o b a b i l i s t i c dependence a r e t h e n d e r i v e d f r o m o t h e r d i s c r i m i n a t i o n m e a s u r e s . Some g e n e r a l i z a t i o n s o f t h e p r o p o s e d m e a s u r e s a r e p r e s e n t e d . 2 . 2 M u t u a l I n f o r m a t i o n i n F e a t u r e E v a l u a t i o n M u t u a l i n f o r m a t i o n i s a w e l l known m e a s u r e o f p r o b a b i l i s t i c d e p e n d e n c e b e t w e e n p a i r s o f random v a r i a b l e s w h i c h h a s b e e n a p p l i e d t o f e a t u r e e v a l u a t i o n i n p a t t e r n r e c o g n i t i o n [36] - [ 4 3 ] . F o r d i s c r e t e random v a r i a b l e s m u t u a l i n f o r m a t i o n i s g i v e n b y M L P ( X , , C . ) I ( X , C ) = f I P (X, , C . ) l o g 1 1=1- k = l K 1 P ( X k ) P ( C . ) ( 2 . 1 ) w h e r e P ( X k , C 1 ) i s t h e p r o b a b i l i t y t h a t X = and C = C ^ F o r f e a t u r e e v a l u a t i o n , t h e random v a r i a b l e X r e p r e s e n t s a f e a t u r e s u b s e t and C r e p r e s e n t s t h e c l a s s e s ; [C h a s M p o s s i b l e v a l u e s and X has L p o s s i b l e v a l u e s - „ The j u s t i f i c a t i o n f o r u s i n g f e a t u r e c l a s s d e p e n d e n c e f o r e v a l u a t i n g f e a t u r e s i s as f o l l o w s . Any f e a t u r e s u b s e t , X , c o n t a i n s i n f o r m a t i o n a b o u t t h e c l a s s e s , C . I f t h e i n f o r m a t i o n t h a t t h e f e a t u r e s u b s e t has a b o u t c l a s s e s i s h i g h , t h e n we a r e f a i r l y c e r t a i n o f w h i c h c l a s s i s p r e s e n t when a measurement o f t h e f e a t u r e i s t a k e n on an u n -known p a t t e r n . I f t h e i n f o r m a t i o n t h a t a f e a t u r e s u b s e t has a b o u t a 8 clas s i s low, then we are uncertain about which class i s present when a measurement of the feature i s taken on an unknown pattern. High mutual information or dependence between features and classes corresponds to low p r o b a b i l i t y of error, and low mutual information corresponds to high p r o b a b i l i t y of error . I t i s shown below that f o r a feature set that gives perfect d i s c r i m i n a t i o n between a l l classes, mutual information between features and classes reaches a maximum given by the entropy H(C) where M H(C) = - I P(C ) log P(C.). (2.2) i=-l Proof M L P(X, ,C.) i(x,o = y y p(x.p.)io g * 1 i = i k=i ' pcypcc^ ) M L 1=1 k=l M L - 1 1 PO^.C ) log P(C ) . i = l k=l For perfect d i s c r i m i n a t i o n , each possible value of a feature i s mapped in t o one c l a s s . That i s P(C_^/X^) = 1 for some cla s s i P(C j/X k) = 0 for a l l j M 9 Since ^ X log X = 0, M L I(X,C) = - I I ?(\\,C±) log P(C.), i = l k=l M = - I P(C ) log P(C ). Q.E.D. i = l For a feature set which does not give any d i s c r i m i n a t i o n between any classes, the mutual information can be shown to be zero, by s i m i l a r reasoning. For feature sets i n between the two extremes, mutual i n f o r -mation takes on values between zero and H(C). We see that a maximum value of mutual information corresponds to zero error p r o b a b i l i t y and the minimum value corresponds to maximum error p r o b a b i l i t y . I t should be stressed that, i n between the two extremes, there are i n general no exact r e l a t i o n s h i p s between mutual information and error p r o b a b i l i t y . However, a d e f i n i t e association, which i s examined i n the next chapter, e x i s t s between the measure and error p r o b a b i l i t y . Because of t h i s association and because of the i n t u i t i v e concept of informativeness, the measure i s u s e f u l for evaluating features. 2.3 Development of Measures of Dependence To derive other measures of dependence which can s i m i l a r l y be applied to feature evaluation, we examine mutual information more cl o s e l y . I t i s shown below that the product P(X)P(C) has the prop-t i e s of a p r o b a b i l i t y d i s t r i b u t i o n when P(X,.C) i s a d i s c r e t e proba-b i l i t y d i s t r i b u t i o n . 10 F o r a n y k we d e f i n e M 1=1 and f o r any i we d e f i n e L P ( C ) = Y P ( X . , C ) . 1 k = l * 1 F o r t h e f u n c t i o n P ( X ) . P ( C ) e a c h e v e n t o f i n t e r e s t i s a s i n g l e s a m p l e p o i n t i n t h e p r o b a b i l i t y s p a c e X x C . The p r o b a b i l i t y o f a s p e c i -f i c e v e n t \\,C± i s d e f i n e d as P C X ^ . P C C ^ . B e c a u s e P ( X ) and P ( C ) a r e p r o b a b i l i t y d i s t r i b u t i o n s , t h e p r o d u c t f o r any k and i w i l l be p o s i t i v e . T h a t i s : P ( X f c ) . P ( C ± ) *0 .' A l s o , t h e p r o b a b i l i t y o f t h e c e r t a i n e v e n t i s one . T h a t i s : M L I I P ( X ^ ) P ( C ) = 1 . i = l k = l B e c a u s e e a c h e v e n t o f i n t e r e s t i s a s i n g l e s a m p l e p o i n t , t h e i n d i v i d u a l e v e n t s a r e m u t u a l l y e x c l u s i v e . The p r o b a b i l i t y o f t h e e v e n t X ^ C ^ o r t n e e v e n t xce w i l l D e g i v e n by P r ( X , C , o r X , , C ) = P(X ) . P ( C J + P ( X , ) P ( C ) . r a b d ' e a b d e • Hence t h e d i s t r i b u t i o n P(X)p.(c) o b e y s t h e a x i o m s o f p r o b a -11 b i l i t y distributions. Hence, another way of looking at mutual infor-mation i s to consider i t as a special case of discrimination information [35] where P(X,C) and P(X) P(C) are the probability distributions whose discrimination i s being measured. That i s , mutual information gives a measure of distance between P(X,C) and P(X)P(C). b i l i t y distributions P m(Z) and P n(Z), where Z is a random variable taking on T values, are: Some other distance measures [44], [45] between two proba-Kolmogorov variational distance K (Z) = - I I P (Z.) - P (Z.) | . ™* 2 j=l m J n 3 1 T (2.3) Divergence J (Z) = I [P (Z.) - P (Z.)] log mn 1 m i n i 6 3=1 T » (2.4) Bhattacharyya distance (2.5) where P (Z) = I P (Z.) P (Z.) 2, Mmns . L. m j n i J ' 3=1 (2.6) Matusita's distance 12 By s u b s t i t u t i n g pm(z.) = PCX^ C.) Pn(zj) = pcy p(c.) i n t o ( 2 . 3 ) , ( 2 . 4 ) , ( 2 . 6 ) , ( 2 . 7 ) we o b t a i n m e a s u r e s o f d e p e n d e n c e . We h a v e 1 M L V X ' C ) = \" I I |P(VC.) \" P C . ) V x' c ) = I I [P(\\,c ) - P(X.) p(c,)] l o g ' ( 2 - 9 ) i = l k-1 * 1 Tc i • p ( C i > BD(X,C) = - l o g PD(X,C), v2 . ( 2 . 1 0 ) w h e r e M L i %(X,C)=I • X [ P ( 3 C , C ) P ( X . ) P ( C , ) ] 2 , ,2 ( 2 . 1 1 ) i = l k = l K 1 k 1 M L i d ( X , C ) = [ I I ( / P Q L , C ) - ^ P ( X , ) P ( C . ) ) 2 j 2 : ( 2 . 1 2 ) i = l k = l k i The K o l m o g o r o v d e p e n d e n c e , ^ ( X , C ) , was o r i g i n a l l y a p p l i e d t o c o n t i n g e n c y t a b l e a n a l y s i s b y H o e f f d i n g [46] and was a p p l i e d t o f e a t u r e e v a l u a t i o n b y V i l m a n s e n [ 4 7 ] . The d i v e r g e n c e as a d e p e n d e n c e m e a s u r e , 13 Jp(X,C), was used by J o s h i to d e r i v e a q u a n t i t y analagous to c a p a c i t y [48] and w i l l . b e r e f e r r e d to a J o s h i ' s dependence. The Bhatta-charyya dependence, B D(X,C), and Matusita's dependence, d D(X,C), to the best of the author's knowledge, have not been p r e v i o u s l y d e r i v e d . KpCX.C), B D(X,C), J D(X,C) and d D(X,C) have not been a p p l i e d p r e v i o u s l y to feature e v a l u a t i o n i n p a t t e r n r e c o g n i t i o n . The above dependence measures are derived from the most w e l l known p a i r w i s e distance measures. However, any p a i r w i s e d i s t a n c e measure can be converted to a measure of p r o b a b i l i s t i c dependence by the proper s u b s t i t u t i o n . G eneralizations of d i s t a n c e , and hence dependence measures, are s t r a i g h t f o r w a r d . For example, we can see that Kolmogorov dependence and J o s h i ' s dependence are both of the form: M L M(X,C) = 1 1 [ P ^ . C . ) - P ( X k ) P ( C i ) ] fCPO^.C^.PCX^PCC.)) (2.12) 1=1 k=l where f ( . ) has the property i f P(X k,C.) < P(X k)P(C.) f < 0 P(X k,C.) > P(X k)P(C.) f > 0 For K ^ X . C ) , f ( . ) i s given by signum (PCX^C.) - P(X k) P ( C . )) 1, X > 0 where signum (X) ={ } - 1, X < 0 14 U s i n g t h i s i d e a , some p o s s i b l e g e n e r a l i z a t i o n s a r e M L 2n / G D ( X , C ) = I- I [ P ( X ^ , C ) - P ( X k ) P ( C . ) ] n=2 ,3 ( 2 . 1 3 ) i = l k = l and M L G D ( X , C ) = I I [ P ( X k , C . ) - P ( X k ) P ( C . ) ] S x i = l k = l [ P ( X k , C . ) - P ( X k ) P ( C . ) ] 2 n + 1 n = 0 , l ( 2 . 1 4 ) 2 . 4 On an a p p r o x i m a t i o n o f M u t u a l I n f o r m a t i o n When m u t u a l i n f o r m a t i o n i s u s e d f o r e v a l u a t i n g f e a t u r e s i t i s common p r a c t i c e [ 1 4 ] , [42;J , [43] t o u s e a n a p p r o x i m a t i o n t o d e c r e a s e t h e amount o f c o m p u t a t i o n . I t i s assumed t h a t any one f e a t u r e by i t s e l f i s n o t e f f i c i e n t i n t h e r e c o g n i t i o n p r o c e s s . T h i s means t h a t t h e v a l u e o f t h e r a t i o i s f a i r l y c l o s e t o u n i t y . W i t h t h i s c o n d i t i o n , i t i s p o s s i b l e t o a p p r o x i m a t e t h e e x p r e s s i o n f o r l o g a r i t h m by t h e f i r s t t e r m o f i t s s e r i e s e x p a n s i o n . Hence we h a v e M L I ( X , C ) • jfc • I I P ( X , , C . ) [ P ( V C i ) - 1] . ( 2 . 1 5 ) i = l k = l K 1 P ( X k ) T h i s m e a s u r e , b y i t s e l f , has some i n t e r e s t i n g p r o p e r t i e s w h i c h w i l l be i n v e s t i g a t e d l a t e r . 15 CHAPTER I I I PROPERTIES OF MEASURES OF DEPENDENCE 3.1 Introduction The measures of p r o b a b i l i s t i c dependence that were derived i n the previous chapter have properties that are s i m i l a r to the properties of mutual information. In t h i s chapter, the main properties are stated and are discussed with respect to the feature evaluation problem. Only the most common measures are examined for properties. In the following discussions i t i s assumed that the random variables X and C represent features and classes r e s p e c t i v e l y . I t i s also assumed that the number of pos s i b l e values of X i s greater than or equal to the number of possible values of C. That i s L £ M. Perfect d i s c r i m i n a t i o n can be achieved when the preceeding assumption i s made. Before looking at the properties of the dependence measures, d i f f e r e n t types of features w i l l be examined, mainly to get some ph y s i c a l idea of what p o s s i b i l i t i e s e x i s t . (To avoid v i s u a l problems one dimensional features are considered). A three c l a s s problem i s shown i n f i g u r e 3.1 with three d i f f e r e n t features, X, Y, and Z a v a i l a b l e for c l a s s i f y i n g unknown patterns. The cla s s c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n functions (PDF's) for the d i f f e r e n t features are shown i n the f i g u r e . The \"good\" feature, X, i s one for which there i s no overlap of the c l a s s c o n d i t i o n a l PDF's of any of the cla s s e s . When measurements are taken on unknown patterns using t h i s feature, perfect recognition of a l l three classes can be achieved. For the \"average\" feature, Y, there i s some overlap of the class c o n d i t i o n a l PDF's. If values of measurements on unknown patterns c P(X) ri x P ( Y ) Y C P ( Z ) c, a it Z1 5 z c Figure 3.1 Ui^fifetrea-t types of features 17 f a l l in regions r^ and r^, the patterns can be identified with complete certainty. If the values f a l l in r^, there w i l l be some probability of making an error in identification. For example, i f the measurement takes on value Y^, there is some probability that the unknown pattern is in class or in class C^. is more l i k e l y , but we are not really _L sure that i t is correct. The average error probability i s minimized i f the most lik e l y class is chosen in each case. If the \"average\" feature is used for classification, most unknown patterns w i l l be identified correctly but there w i l l be some mistakes. The \"useless\" feature, Z, is one for which the class conditional PDF's of a l l the classes are the same. For any measurement of Z, a l l classes have equal probability. Hence there is no information gained in using Z. This feature gives maximum error probability. Defining the measures as M^(X,C) and keeping the preceding ideas in mind, the main properties of the dependence measures w i l l now be stated and discussed. 3.2 Properties 3.2.1 Property 1 The measures are non negative . MpCX.C) :> 0 . (3.1) Proof It can be shown that I(X,C) ^ 0 (see [49] page 20 ) even though for some points i n feature class space the quantity P(X^,Ci) log p(xk/c.) can be negative. An examination of Kolmogorov dependence and p « V Joshi's dependence shows that, for any point in feature class space, each of these measures is greater than or equal to zero. Obviously the 18 sum over a l l points in feature class space w i l l be greater than or equal to zero. The Bhattacharyya coefficient Pp(X,C) is positive. But for the Bhattacharyya dependence to be positive, the coefficient must be less than or equal to one. For any point in feature class space I 1 [P(Xk,C.) P(X k) E(C.)]2 S | [PCX^C.) + P(X k)P(C.)] . This i s the well known geometric mean—arithmetic mean inequality [50]. Summing over a l l points in feature class space, we obtain KM L J. MM L M L I I [P(x k,c.)p(x k)p(c 1)]2 < |[ I I ?(\\,C±) + I I PCX^PCC.)] 1=1 k=l z i=l k=l i=l k=l = 1 . Therefore PD(X,C) < 1 , and B n > 0 . (3.2) Matusita's dependence is related to p^(X,C) by d2(X,C) = 2[1 - PD(X,C)] . Because PD(X,C) <_ 1 , 19 dD(X,C) >, 0. (3.3) 3.2.2 Property 2 A l l measures assume a minimum value of zero when random vari-ables X and C are s t a t i s t i c a l l y independent. Proof For a given P(X,C), a l l measures are zero only i f P(X,C) = P(X)P(C) for a l l points in feature class space-The preceding equality i s the definition of s t a t i s t i c a l independence between the random variables X and C. When s t a t i s t i c a l independence exists between X and C P(X k/C 1)P(C 1) = P(X k)P(C ±) for a l l k and i . Therefore P(xk/ci) = P(V , and P ( X k / C i ) = P ( X k / C j ) f o r a 1 1 k ' i» a n d 3-This corresponds to the \"useless\" feature case in figure 3.1. The condition of s t a t i s t i c a l independence can also be written as P(C i/X k) = P(C ±) for a l l i and k In words, measurement of feature X supplies no extra infor-mation above the a p r i o r i probability information, which is already known before the measurement of X is made. St a t i s t i c a l independence between a feature and the classes implies that the feature i s useless for aiding in classification and hence means that use of the feature for classification w i l l give maximum error probability. 20 3.2.3 Property 3 The measures are a maximum f o r a given set of a p r i o r i p r o b a b i l i t i e s , P(C), when each value of X i s associated with one value of C ( i . e . Pperfe.ct d i s c f i m i n a i i b n ^ d <) I t i s i n s t r u c t i v e to exa-mine each of the measures f o r t h i s property. Kolmogorov Dependence We can rewrite (2.3) as 1 M K^X.C) = j J P(C.) [A. + B.], where A i = t ^ W \" p ( xk ) ]» k i B =.l [P(X) - P(X, /C,)], i and = (k/PCJ^/C.) > P(X k)} kj = ik/p^) > P(xk/c.)} . Because P(X/C i) and P(X) are p r o b a b i l i t y d i s t r i b u t i o n s , A. = B.. x x Therefore we can redefine Kp(X,C) by M KpCX.C) =-lHC±) J f [P^/C.) -PO^)]. (3.4) 21 U s i n g t h e . v i d e n t i t y M we o b t a i n , f r o m ( 3 . 4 ) , M M K ^ X . C ) = I P ( C . ) c> = .X I / ^ . ^ ^ i o g ^ / c , ) . - ,1 |L,p(v^i| P < V 1=1 k=l ° tc 1' 1=1, k=l •.; T . . - = I f -M L M L -I I P(X k)P(C 1)logP(X k/C.) -:- I I P(X k)P(C.)log P(xk), i=l k=l i=l k=l : M L L M =1 P(C ±)[ X P(X k/C.)log P C V 0 ! ) \" I I P ( V C i ) P ( C i ) l o g i=l k=l k=l j=l J J V(\\/C±)]. (3.13) The f i r s t part of the expression (3.13) i s f i n i t e but the second part becomes in f i n i t e i f V(X^/C±) = 0 and V(X^/C^) ± 0 for any j ± i . Therefore the maximum i s given by J (X,C) = ». <3.14) MAX 3.2.4 Entropy Property of Maxima It was shown in Section 2.2 that for perfect discrimination, mutual information reached a maximum given by the entropy M W X ' C ) = -.I P(C.)logP(C.). 1=1 2 4 In S e c t i o n 3.2.3 i t was shown that M (X,C) = I P(C )[1-P(C.)] , MAX i = l 1 1 \\ d (x,c) = [2l P ( c , ) [ i - v^TcT)] 2. MAX i = l 1 1 Because p (X,C) < 1, the s e r i e s expansion f o r l o g X can be used MAX as f o l l o w s : l o g X = (X- l ) - | ( X - l ) 2 + - j ( X - l ) 3 + .. f o r 0 < X < 2 MAX ( l - p D ) + f ( l - P D ) 2 + j ( 1 - P D ) 3 P = I F Therefore 00 M B (X,C) = I P(C )[l-v^7c7 ) ] ] P . - (3.15) u t> u MAX p=l v i = l We see that these maxima are of the form M 4( I P ( C . ) - f ( P ( C . ) ) , i = l 25 where , P D(X,C) = 1 - d 2 (X,C) . 2 3.3.3 Joshi's dependence, Kolmogorov dependence, and Mutual Information. Joshi's dependence consists of two terms, one being mutual information and the other being the r e l a t e d quantity M L P(X.)P(C.) I A(X,C) = P j x - l^V(Xcf)P(Q;U-5.gC- / r / • (3.20) . - i=i k=i k - - -Vc ; We can r e l a t e I\" (X,C) to K^XjC) by the same r e l a t i o n s h i p s that were noted i n 3.3.1 But J D(X,C) = I(X,C) + I~(X,C) . Therefore, J D(X,C) i s rela t e d to K^XjC) by the following i n e q u a l i t i e s , J D(X,C) ^ 4 (X,C) - 2 log ( 1 + ^ ( X i C ) ) : , . ( 3 . 2 1 ) J D(X,C) >, 4 K ^ + (X,C), ( 3 . 2 2 ) II+£K$XS:G) 4.K n(x,c) V x> c> * . 2 log [ i-n^x.-cy] - ! + \\ ( x,o - <3-23> 27 Also, because l ' (X,C) 5 0, J D(X,C) > I(X,C) . (3.24) 3.3.4 Bhattacharyya dependence, Joshi's dependence, Mutual Information Substituting into the i n e q u a l i t i e s of Hoeffding and Wolfowitz [57] we obtain 2 B D(X,C) <; I (X,C) , (3.25) and 2B D(X,C) s= I'(X,C) . (3.26) Hence 4BDc) = I nt(1 - I P ( c i ) I ^(vc.pa)} u p=l p 1=1 1 k=l k 33 M L •I 7 [ J I [ p ( y c ) - / P i y c ^ l V p=l i = l k=l For any class C., the dependence measures I ( X , C ) , K J J C X . C ) , J D(X,C), d D(X,C) give ar.direct~ind±ca, log n + n(n+1) (log ^ ) (P £ - ' V \" (4.15) Using the Fano bound for H(C), we obtain: I(X,C) $ H(PM,1-PM) + (l-P M)log(M-l) + log n + n(ti+1) (log ^ ) (P g-(4.16) which can easily be rearranged to obtain the desired result. Lower bounds on error probability in terms of mutual information have been derived by Chu and Chueh [53]. They are f a i r l y complicated. 4.5.2. Kolmogorov Dependence and Error Probability A simple bound on error probability in terms of K^XjC) 37 i s derived below. Theorem 4.3 For equal a p r i o r i p r o b a b i l i t i e s P(C.) = ^ , i = 1....M r e * i \" | \" MTI VX , C ) <4- 1 7> Proof: I t i s known that the p r o b a b i l i t y of a c o r r e c t d e c i s i o n , P , i n a Bayes c l a s s i f i e r i s P c = I max'{PCXj^.C^, P(X k,C 2) ... . POL^C^) } . (4.18) k=l Also max { P ( X k , C 1 ) , P ^ , ^ ) , ... PCX^C^)} >, (4.19) M max {P(X k,C. ), ^ I P(X k,C )} j = l f o r a l l i . But, f o r any two numbers a,b max {a,b} = ^ (a+b) + ^ |a-b|. (4.20) By s u b s t i t u t i n g i n t o (4.20), and using (4.18) and (4.19) we obta i n L M , L , M P ^ — c \\ l [P(x^,c.) + ^ I P ( X k , c )] + \\ I \\H^,c.) XP(VC k=l j ^ i J k=l J T 4 ! j = l j = l (4.21) 38 This i s true f o r a l l i . Summing over the M d i f f e r e n t classes we obtain MP £ c . L M . . L M M k=l i = l k=l i = l 2f± L M 1 M + l A A l p ( x k ' c i > \"Mr! I P < W k=l i = l j f i j=l P c M 2M M L , M I i in^c.) i P C ^ . C >|. i = l k=l 3 f i 3=1 P < 1 -e \" M 2M x x M L M i = i k=i k 1 M iH 3 3=1 For equal a p r i o r i p r o b a b i l i t i e s , , M L i M P * 1 - - -6 2M2 1=1 k=l * A r \" j f i j=l • Vilmansen [47] has shown that for equal a p r i o r i p r o b a b i l i t i e s M , M L 1 M * » ™ • Jx Ji | p ( V C i ) - - P w j=i 1 1 M M-l KpCX.C). Q.E.D. 39 The bound is plotted in figure 4.3. For M = 2, there i s a direct relationship. That is Pe = \\ ~ V X ' C ) ' As the number of classes is increased the bound gets more loose. It holds with equality only for the point at which K^(X,C) = 0. A simple lower bound on error probability in terms of Kolmogorov dependence is derived below. Theorem 4.4: For equal a p r i o r i probabilities P(C.) = | , i = 1....M Pe * V X ' C ) • Proof: For equal a p r i o r i probabilities L M yx,o = y I p» P ( C 2 / X k } ' P ( V \\ ) } ' ( 4 ' 2 3 ) k P /X. is the probability of a correct decision in a Bayes c l a s s i f i e r for a given X^. Then, for any X^ we now show that 6 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.S0 0,90 J.00 KOLMOGOROV DEPENDENCE Figure 4.3 Up.p'.ejriib.ou'ndJ given by equation (4.22) 41 By using the arguments of Section 3.2.3 for Kolmogorov depend-ence, we can redefine K = I [P (c ±/x k) -where i ' = { i / P ^ / X ^ > -}.. One of the terms of this summation is P - i c/x k M ' K ^ p,y - h • (4-25) The probability of a correct decision in Bayes classification is given by 4.18. This can be rewritten as L P„ = I PC^) max {P(C 1/X k), P(C 2/X k), ... P(W}' (4'26) k=lBy averaging (4.25) over a l l X^, k=l...L we obtain \\ l p(V l | P ( C . / V -1| * I p = 0 . To obtain some f e e l i n g for the re l a t i o n s h i p s between Kolmogorov dependence and err o r p r o b a b i l i t y , a closer examination i s made of r e l a t i o n s i n the a p o s t e r i o r i density function space. We consider equal a p r i o r i p r o b a b i l i t i e s and various values of M. Upper bounds on e r r o r p r o b a b i l i t y are obtained making extensive use of the quantities K and P , which *k C / X k are defined i n (4.23) and (4.25). 43 0 0 JO 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 i.00 KOLMOGOROV DEPENDENCE Figure 4.4. Loi^rrtpo.un'd 'for Kolmogorov dependence 44 Case 1: M=2 • '.—i—•— From f i g u r e 4.5 i t i s e a s i l y seen t h a t , f o r any K = P , -\\/2 *k C / X k 2 ' ^ - ^ c / ^ - I ^ i \" ( 1 - P c / x k ) ] Averaging over a l l X^ we obt a i n y x . o = PLC - \\ , P e = i - V X ' C ) Case 2: M=3 We consider d i f f e r e n t ranges f o r the q u a n t i t y P. , and de-c / x k termine r e l a t i o n s between P„, and K f o r the ranges. e / x k *k We can define K =Pc/,x.-^+V (4.28) x^ M where V i s the summation of a l l terms (P(C/X k) - j^) which are grea t e r than 0, ex c l u d i n g the .term ^ g ^ ^ ~ |[* W e s e e t h a t V 5 ®' For any X^ we f i r s t examine the case f o r which P ( C / X ) k o 1- P c F i g u r e 4.5 A p o s t e j r i o x i s ; d e n s i'ity^ f^ct±o . n i f p r ftwo^Glasgaprotlem 46 V3 P (C/X ) 2 73 O c 2/3. P(C/X) V2 V3 o *\\—. C Figure 4.6 Different ranges for the a posteriori density 47 T h i s means t h a t b o t h r e m a i n i n g PCC/X^.) a r e l e s s t h a n - j . T h a t i s V=0. T h e r e f o r e , Now we c o n s i d e r t h e r e g i o n 2 c / V 3 -2 1 As P _ , v a r i e s f r o m -- t o -r-, t h e maximum v a l u e o f K^. o c c u r s c / x k 3 2 \\ when o n l y one P C C / X ^ ) o t h e r t h a n ^ 5 / ^ l s S r e a t e r t n a n ^ ' H e n c e , we h a v e t h e f o l l o w i n g r e l a t i o n M The l a s t r a n g e we c o n s i d e r i s We c a n see t h a t I 1 Hence V * N P ~ max * c / x ^ i M = 2 P c j / x t - f . (4.31) 48 We have r e l a t i o n s between the quantities K and P , for the allowable *k C / X k range of P. , (1/3,1). The r e l a t i o n s are summarized i n figure 4.7. To C / X k obtain bounds on error p r o b a b i l i t y i n terms of K^(X,C) we must average over a l l X^. We must use a bound which i s v a l i d over the whole range of P , . c / x k Equation (4.31) i s true for the whole range of P , C / X k From figure 4.7 i t i s simple to derive a second r e l a t i o n s h i p which holds over a l l values of P ^ ^ . The i n e q u a l i t y i s given by the s t r a i g h t l i n e through the points (1/2,1/3) and (1,2/3). The i n e q u a l i t y i s K S|P,, . (4-32) *k 3 C / X k By averaging (4.31) and (4.32) over a l l X^. we obtain K^X.C) $ 2 ? R - | , (4.33) KD(X,C) S | P £ • (4-34) Substituting^ P = 1 - P e c i n t o (4.33) and (4.34) we obtain the following upper bounds on e r r o r pro-b a b i l i t y . v f - l v ^ > (4-35) - V p e * I - | K D ( X , C ) , (4.36) 49 Figure 4.7 R e l a t i o n s between p r o b a b i l i t y of a c o r r e c t d e c i s i o n and Kolmogorov dep^encience f o r a s p e c i f i c feature v a l u e , f o r M=3 50 Equation (4.35) i s the bound given by (4.17). The r e s u l t s are summarized i n figure 4.8. We see that the bound i s tig h t e r when K^XjC) i s small, and that i t i s an equality for K = 0. U 2 i s tig h t e r when IC^XjC) > It i s a n l equality when K^XjC) i s a maximum. The best bound i s a combination of the two derived bounds. I t i s given by P g <: min {U-^IL,} (4.37) Case 3: M a r b i t r a r y By examining the a p o s t e r i o r i p r o b a b i l i t y space and using the ideas of the previous sections we can derive upper bounds on erro r pro-b a b i l i t y for M classes. For the range P ^ 5 1 - ^ , K = 1 — P . . \\ C / X k For the range -| :? P,,/^ 1 \" M K * 1 \" h x, M k For the range | ( 1 - I) t P ^ S \\ K < 2(P , - h \\ c / x k 51 Figure 4 .8 Upper bound xm,, e r r o r p r o b a b i l i t y i n terms of Kolmogorov dependence f o r M=3 52 For the range ± < \\{1 - |) , K ' * 1 \" ^ For the range - ( 1 - -) * 3 1„ • l x . „ < 1 For the range — \\$ P , £ / 1 \\ /n l x J c / x i CT-T-) (1 - TT) k j-1 M K $ 1 - J • (4.38) Xk. For the range 1 ( 1 - ± ) $ P ^ * ^ , V c / x k The range i s v a r i e d by incrementing j u n t i l r M ; M ' that i s j = M-l. The r e s u l t of a l l t h i s i s summarized by the f i g u r e s 4.9 and 4.10 where r e l a t i o n s h i p s and e r r o r bounds are shown f o r M=10. 53 l i g u r a k.9 Figure A.9 R e l a t i o n s between p r o b a b i l i t y of a c o r r e c t d e c i s i o n and Kolmogorov dependence f o r a s p e c i f i c feature value, f o r M=10 Figure 4.1.0 gure 4.10 Upper bound on e r r o r p r o b a b i l i t y i n terms of Kolmogorov dependence f o r M=10 55 The upper bound in figure 4.10 is given by the minimum of a l l the nine possible bounds which result from figure 4.9. It can be seen that upper bounds on error probability for any number of classes can be generated. In figure 4.11 upper bounds are plotted for the values M = 2, 3, 5, 10, 20, 0 0. We see how the bound varies for the different values of M. The previous discussion leads to the following theorem. Theorem 4.5 For equal a p r i o r i probabilities P(C.) = | i = 1, 2, M for the range 1 T, 1 — T T * p - ^ ~ > n + ± c n n - 1 „ n or $ P S — — T ' n e n + 1 where n is an integer M(l - KTX,C)) - n p < — + . e v n(n + 1) n Proof: For the range n ~ 1 < P < n n \" e v n + 1 the upper bound on error probability is a straight line of the form y = ax + b . 56 Figure 4„11 Figure 4.11 Upper boundsjon e r r o r p r o b a b i l i t y i n terms of Kolmogorov dependence f o r various values of M 57 Two points on the l i n e are n - 1 y l = _ ~ <1 \" V X ' C ) = 1 \" £' n 71 = i T + T ' ,n + 1 N x = 1 - (^ T-^ X / M Substituting i n t o the equation, we obtain n - 1 ,, n,. , = a ( l - -) + b , = a ( l — ) + b-n + 1 M-We solve for a and b, and get -M a n(n + 1) ' M - n , n - 1 b = _,_< , + n(n + 1 ) n Hence, i t follows that the piecewise bound i s given by e n(n + 1) D v ' n(n + 1 ) n 5 8 This can be rearranged to . M[l - K (X,C)] - n P * 1 a . n + \" ~ e n(n + 1 ) n Q.E.D. 4 . 5 . 3 . Bhattacharyya Dependence and Err o r P r o b a b i l i t y Using the i n e q u a l i t i e s KJJCX.C) >, 1 - P D ( X , C ) , ( 4 . 4 0 ) K ^ X . C ) $ dp(X,C) ? ( 4 . 4 1 ) we can, by s u b s t i t u t i n g f or K ^ ( X , C ) , i n the r e s u l t of theorem 4.3, obtain the following error bounds i n terms of Bhattacharyya c o e f f i c i e n t and Matusita's dependence. (The bounds are v a l i d for equal a p r i o r i p r o b a b i l i t i e s . ) pe * 1 - s - i r h : ( 1 - p D ( x ' c ) ) » that i s and M - M 1 d 2 (X,C) P $ 1 - 1 - -2— - (4.43) e M 2 Using the i n e q u a l i t i e s ( 3 . 1 9 ) and ( 4 . 2 7 ) we can bound err o r p r o b a b i l i t y with p (Xy.C) and d D(X,C). The r e s u l t s are 59 P e 5 1 - | - / l - p£(X,C) , (4.44) c7CX,C) 2 2 P e , l - i - / - l i - ~ ( - l - - f ) (4.45) The piecewise linear upper bounds on error probability in terms of Kolmogorov dependence which were derived in the previous section can a l l be used to generate upper bounds for Bhattacharyya dependence and Matusita's dependence for the equal a p r i o r i case. However, we use a direct approach to obtain other error bounds. An interesting relationship between error probability and Bhattacharyya dependence is derived,tcfore. Theorem 4.6 For equal a.priori probabilities P(C.) = j-j i = 1,2,...M, pT,(X,C) S — A - P +,/^T^ ^ ~ D v f t e V e Before proving this theorem the following result is needed. For the set of numbers a^, i = l,2,...k, where a l l a^ >. 0, k / k I /iT s / k I a i=l 1 i=l Proof. We square both expressions and obtain k k-1 k k • I a. + 2 I I /a.a. vs. k £ a. . i=l 1 i=l j=i+l 1 J i=l 1 60 We must determine whether k-1 k k 2 I I >^ 7=7 or (k-1) I a 1=1 j = i + l 2 i = l i s greater. I t can be shown that k (k-1) I a = Uj+a^ + ( a i+a 3) + ...+(a +a k_ 1) + ( a ^ ) i = l + (a 2+ a ) + (a 2+ & 4> + ... + (a 2+ a^) + + ( V - i + V • By applying the geometric a r i t h m e t i c mean i n e q u a l i t y to a l l p o s s i b l e pairs of a., a. i t fo l l o w s that 1 J a. + a. /a. a. <: - — - < i 3 2 and hence k-1 k k 2 I I ^TaT $ (k-1) I a 1=1 j = i + l 3 1=1 Therefore k / k X /aT $ Ak I a. . (4.46) 1=1 v 1=1 61 Proof of theorem 4.6 We define M = I /PCC./X^PCC.) k i = l For equal a p r i o r i p r o b a b i l i t i e s M i *k /M C / X k M i = l \"• '\"V s' 1 ; max where P c 7 x = max { P C C j / X j p , P ^ / X ^ , ... P ^ / X ^ } . and i i s the c o e f f i c i e n t f o r P. ... . max c ' x k From (4.46) we know that M j / M I .Mc./Xjp $ /(M-1) I P ( C . / X k ) . (4.48) 1=1 \" 1=1 ±H 1*1 max max But M I P(C i/X k) = 1 - P c / x . (4.49) i = l \" k max By s u b s t i t u t i n g (4.49) i n (4.48) and then by s u b s t i t u t i n g the r e s u l t i n (4.47) we obtain the i n e q u a l i t y 62 We now average o v e r - a l l x^. & P < V \\ * M & P ( X ? + / ^ J i P ( V ^ Because /a i s a convex f] f u n c t i o n of a f o r a l l p o s i t i v e a [49, pp. 82-85], l=1 P£• and a l l other P(C/X^) are set equal to zero. The argument can be used because /P(X) i s a convex downward f u n c t i o n f o r a l l p o s s i b l e values of P(X). Hence, i t f o l l o w s that For M c l a s s e s , n i s v a r i e d from 1 to M-l. A s e r i e s of inequal-i t i e s , g i v i n g a lower bound f o r a l l p o s s i b l e values of P., r e s u l t s . c/x k 66 The relation (A.53) is plotted in figure 4.13 for M = 2, 3, 5, 10, 20, ». The results are valid for any X ^ . However we must average over a l l to obtain a relation between P D and P g . We obtain the piecewise linear convex cover as shown in figure 4.13. We average this cover with respect to P ( X ) and hence obtain an upper bound on error probability. The piece-wise bound is plotted in figure 4.14 for M = 2, 3, 5, 10, 20, °°. We now determine the equations of the straight lines which give the error bound. For an arbitrary n we examine the endpoints of the bound given by (4.52) and from these points:.determine the desired equations. For the range 1 * P , , * 1 n+1 \" &l \\ n one endpoint occurs at 1 , . , . 1 f l ~ P p , = — , whxch gives p £ — n,/ — e / x k n *k M / n the other endpoint occurs at P , = —TT-, which gives p » — (n+1) / 1 ? / x k n + 1 \\ M V . n T l Now, P , = 1 - P , e / x k G / X k Hence, two sets of points on the straight line giving the upper bound on P , are e / x k , 1 n n-1 , , 1 (n+1) n s ( , — — ) and ( V- , —TT) 67 F i g u r e 4 , 1 3 E q u a t i o n ; 4 . 5 3 f o r v a r i o u s v a l u e s o f M 68 Figure 4.14 Figure 4.14 Upper bounds on e r r o r p r o b a b i l i t y i n terms of the Bhattacharyya c o e f f i c i e n t f o r various values of M 69 To obtain the equation of the upper bound we s u b s t i t u t e i n t o the general l i n e a r equation y = ax + b . We o b t a i n : *=L = a ( _ L _ -J2- ) + b , n M AT n , 1 n+1 N , i = a ( ) + b . 1 1 + 1 . vfr v£+T S o l v i n g f o r a and b,. we get a = (v'n+T - AT) (n) (n+1) n-1 b = n ( A + T - v £ ) ( n ) (n+1) For a given X^, the r e s u l t i n g bound i s (y^ip - v^n) *k , n-1 p $ + — — e \\ (v'n+T - v4T)(n)(n+l) By averaging over a l l we obt a i n M P (X,C) - AT p <; 1! + ±- . 6 (v£+T - ) (n) (n+1) n Q.E.D. The previous d i s c u s s i o n Theorem 4<.-5 leads to the fo'il;v i n j th^o-sm. 70 4.5,4 J o s h i ' s dependence and E r r o r P r o b a b i l i t y I t i s p o s s i b l e to r e l a t e J o s h i ' s dependence to e r r o r p r o b a b i l -i t y . I t was p r e v i o u s l y shown that J D(X,C) * I ( X , C ) , = H(C) - H(C/X). From Chu and Chueh [55] we o b t a i n the r e l a t i o n H(C) >. H C K ^ , 1 - W + l o g ^ where P M = max {P(C i)} i and K i s the l a r g e s t i n t e g e r smaller than ~ . Fano's i n e q u a l i t y s t a t e s H(C/X) <: H(P e, 1-P ) + P e log(M-l) where P i s e r r o r p r o b a b i l i t y and M i s number of cl a s s e s , e We combine the two i n e q u a l i t i e s and o b t a i n J D(X,C) >, K X , C ) , H)* (4.55) k=l i=l D(X,C) is essentially a functional approximation to the quantity L I max {P(Xk,C.i)} k=l i which i s , of course, the probability of a correct decision in Bayes clas-s i f i c a t i o n . Devijver went on to show the following inequalities. f , I(X,C) = I I P(Xk)P(C.i) i=l k=l M L P(X^) P2(C.i/X^) = I I P f r / \" 1 (4.57) 1=1 k=l For equal a p r i o r i probabilities L M I(X,C) =M I P(X k) I PZ(C./Xk) - 1 (4.58) k=l i=l 72 This gives the r e s u l t I(X,C) = MD - 1 - (4.59) S u b s t i t u t i n g D = i[I(X,.C) + 1] i n t o the i n e q u a l i t i e s of D e v i j v e r , we get the lower bounds P e * ¥ t l -fW^^ 1 \" i5 f ( X ' C ) + 1 * k ~ Z f(X'C)] > (4.60) and the upper bound P e * 1 - | \" | i ( X ' C ) - ( 4 - 6 1 ) For equal a p r i o r i p r o b a b i l i t i e s i t can be shown that I . (X,C) = 0, corresponding to maximum P mm 6 e ' I m a ^ ( X , C ) = M-l, corresponding to;:minimum P g . The upper bound i s p l o t t e d i n s o l i d l i n e s i n f i g u r e 4.5 f o r M=2,3,5,8,10 and the t i g h t e s t lower bound i s p l o t t e d i n broken l i n e s . 4.6 Comparison of E r r o r Bounds E r r o r bounds f o r Kolmogorov dependence, Bhattacharyya dependence, mutual i n f o r m a t i o n , and at the approximation of mutual i n f o r m a t i o n are compared i n t h i s s e c t i o n . There are s e v e r a l p o i n t s to make about the bounds. (1) A l l the bounds are derived f o r the case of equal a p r i o r i p r o b a b i l -i t i e s . This assumption has the e f f e c t of transforming the dependence 73 J.OO-j U P P E R B O U N D L O W E R B O U N D - — 1 1 1 T 1 J.D 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 JO.D RPPRQXJMRTE MUTURL JNFDRHRTJON Figure 4.15 Upper and lower bounds on e r r o r p r o b a b i l i t y i n terms of an approximation of mutual i n f o r m a t i o n f o r various values of M 74 measures into measures of equivocation. That i s , the measures are of the form L M I I f l P C C . / y ] k=l i = l where f ( . ) i s d i f f e r e n t for each measure. (2) The lower bounds can be achieved over the whole range of P . For the measures K (X,C) , p (X,C) , I(X,C) and i~(X,C) , the lower e u u P bound i s achieved when one PCC/X^) = P £ and a l l remaining PCC/X^) = and a l l PCX^ .) = jj> The bounds are the t i g h t e s t possible bounds f o r equal a p r i o r i p r o b a b i l i t i e s . (3) The upper bounds f o r K D(X,C), p D(X,C) and I(X,C) can be achieved at the end points of the ranges n-1 p n n \" e - n+1 where n = 1, 2, 3 M-l. The bound i s achieved when n of the PCC/X^.) = P^ and one PCC/X^ = l-nP ( and a l l remaining PCC/X^ = 0 for a l l k and ? ( \\ ) = £ f o r a 1 1 k - ^ upper bound f or I(X,C) i s achieved when P g = 0 and P g = 1 - ^ . For comparison purposes the four measures are normalized so that each has the range [0,1]. The normalized measures are zero i f error p r o b a b i l i t y i s a maximum and are one i f err o r p r o b a b i l i t y i s zero, We define the normalized measures as: K (X,C) NORM 1 - T 7 M J\"N0RM^'L; logM 75 1 - P d(X,C) PNORM(X'C) = = 1 -n t o m K * 1 M - 1 In f i g u r e s (4.15) and (4.16) the upper and lower bounds on e r r o r p r o b a b i l i t y and the as s o c i a t e d interbound distances are p l o t t e d f o r M = 2, 3, 5, 10, 20, 0 0 . The upper bounds are represented by the c o n t i n -uous l i n e s and the lower bounds are represented by the broken l i n e s . The bounds are l a b e l l e d f o r Kolmogorov dependence. However, the value of M of any of the bounds can be determined from the y - i n t e r c e p t given by M The interbound distances are p l o t t e d because they, give the unc e r t a i n t y that e x i s t s about the e r r o r p r o b a b i l i t y when a measure has been computed. The bounds p l o t t e d are the t i g h t e s t ones der i v e d . The lower bound f o r mutual in f o r m a t i o n i s the Fano bound given by (4.51); and the upper bound i s given by (4.16) f o r equal a p r i o r i p r o b a b i l i t i e s . The upper bound i s the one derived by Tebbe and Dwyern [56] and Kovalevsky [57]. The number of cl a s s e s being considered has a d e f i n i t e e f -f e c t on the e r r o r bounds f o r a l l the measures. We see that the interbound c distances degrade f o r the d i f f e r e n t measures as M i n c r e a s e s . The rate of degradation v a r i e s f o r the d i f f e r e n t measures. As M approaches 0 0, the maximum interbound distances f o r K^(X,C), p^(X,C) and I(X,C) a l l 76 NORMALIZED 3HATTACHARYYA K E B S L R E NORMALIZED BHRTTRCHARYYR MEASURE Figure 4.16 Comparison\"\"of e r r o r bounds 77 NORMALIZED HPPROX. HUTURL INFORMATION N0RHRLI2ED RPPROX. KJIUBL I N F O R M A T I O N NORMRLIZED MUTUAL INFORMATION NORMALIZED MUTURL INFQRMRT I ON Figure 4.17 Comparison o f ^ e r r o r bounds 78 approach one. This means that when the normalized measures take a value close to one we have very l i t t l e idea of what the error probability i s for the feature set being considered. Theiinterbound distance for the approximation of mutual information degrades, but as M approaches <», the maximum interbound distance is 1/4. We s t i l l have a reasonable idea of what the error probability of a feature set i s when we compute-the approximation of mutual information for many classes. When only two classes are being considered, the Kolmogorov dependence is directly related to error probability and' hence the inter-bound distance is zero. The other measures have maximum interbound distances varying from 0.12 to approximately 0.2. Although KD(X,C) i s good for the two class problem, the max-imum interbound distances increase rapidly as M increases. When twenty classes are being considered, K^(X,C), I(X,C) and p^(X,C) have maximum interbound distances of approximately the same magnitude. For this number of classes, the approximation of mutual information has the small-est maximum interbound distance. A second factor to be considered in comparing the measures is the error probability range of the feature sets being evaluated. For example, we consider features which individually have an error pro-2 bability greater than 1 - — . (The value is f a i r l y arbitrary, but es-sentially means that the feature considered by i t s e l f gives poor per-formance. (A situation in which individually poor features have to be ranked could occur in a sequential classification problem.) With poor features K^(X,C) has a smaller interbound distance than the other mea-sures. If features which have low error probability, (arbitrarily, P^ < 1 ) , are considered, then the approximation of mutual information would have the s m a l l e s t interbound distance over a large range of c l a s s e s . We can conclude that f o r few classes Kolmogorov dependence and the approximation of mutual i n f o r m a t i o n are most c l o s e l y l i n k e d w i t h e r r o r p r o b a b i l i t y and, that f o r many cl a s s e s the approximation of mutual i n f o r m a t i o n i s most c l o s e l y r e l a t e d to e r r o r p r o b a b i l i t y . 80 CHAPTER V EXPERIMENTS AND DISCUSSION 5.1 I n t r o d u c t i o n In t h i s chapter, an experimental comparison of the measures of dependence i s made, both w i t h respect to each other and w i t h respect to estimated e r r o r p r o b a b i l i t y . A group of fe a t u r e subsets i s e x t r a c t e d from a standard data set and rankings of the feat u r e subsets are made, using each of the measures and usin g estimated e r r o r p r o b a b i l i t y . A q u a n t i t a t i v e comparison of the s i m i l a r i t i e s of a l l p a i r s of rankings i s made using the Spearman rank c o r r e l a t i o n c o e f f i c i e n t . Reasons f o r the s i m i l a r i t i e s and d i f f e r e n c e s of the rankings are discussed. The com-p u t a t i o n a l requirements of the measures are examined. 5.2 D e s c r i p t i o n of Data, Preprocessing and Feature E x t r a c t i o n . Munson's [59] [60] handprinted character data f i l e was used f o r the experiments. The f i l e c o n s i s t s of three samples of each FORTRAN character from f o r t y - n i n e d i f f e r e n t people. Each sample i s stored on magnetic tape as a twenty four by twenty four b i n a r y valued a r r a y . For the experiments, only the a l p h a b e t i c characters were used. The charac-t e r s underwent s i z e n o r m a l i z a t i o n and a shearing transformation [61]. Each twenty four by twenty four a r r a y was transformed i n t o a twenty by twenty array by the preprocessing operation. The r e s u l t i n g a r r a y was d i v i d e d i n t o twenty f i v e non overlapping four by four r e g i o n s . For each r e g i o n a f e a t u r e was defined. The value of the fe a t u r e f o r each re g i o n was equal to the number of bl a c k p o i n t s i n the re g i o n . Bayes estimates of both c l a s s u n c o n d i t i o n a l and c l a s s c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n s were made f o r each f e a t u r e , and these estimates were used i n 81 t h e e x p e r i m e n t s . The p r e p r o c e s s i n g scheme and t h e f e a t u r e e x t r a c t i o n p r o c e s s w e r e d e s i g n e d b y Hussain,-Toussainr\" - and ' D o n a l d s o n [ 6 1 ] , [62] f o r some p r e v i o u s e x p e r i m e n t s . 5 . 3 D e s c r i p t i o n o f E x p e r i m e n t s E x p e r i m e n t s w e r e done u s i n g t h e f e a t u r e s e x t r a c t e d f r o m t h e p r e p r o c e s s e d v e r s i o n o f M u n s o n ' s d a t a f i l e . The e x p e r i m e n t s w e r e p e r -f o r m e d t o compare t h e f e a t u r e e v a l u a t i n g c a p a b i l i t i e s o f t h e d i f f e r e n t m e a s u r e s o f d e p e n d e n c e , b o t h i n r e l a t i o n t o e a c h o t h e r and i n r e l a t i o n t o e s t i m a t e d B a y e s e r r o r p r o b a b i l i t y . I d e a l l y , one w o u l d l i k e t o com-p a r e t h e d e p e n d e n c e m e a s u r e s w i t h t h e e r r o r p r o b a b i l i t y o f t h e c l a s s i -f i e r t h a t i s b e i n g u s e d i n a p a r t i c u l a r p r o b l e m , b e c a u s e d i f f e r e n t c l a s s i f i e r s w o u l d p e r f o r m b e t t e r w i t h d i f f e r e n t f e a t u r e s u b s e t s . How-e v e r , t h e B a y e s c l a s s i f i e r i s a good r e f e r e n c e s t a n d a r d b e c a u s e i t y i e l d s t h e minimum e r r o r p r o b a b i l i t y . Twenty f i v e f e a t u r e s u b s e t s w e r e d e f i n e d f o r t h e p r o c e s s e d c h a r a c t e r s , a n d , o v e r t h e t w e n t y s i x c l a s s e s , B a y e s e s t i m a t e s o f b o t h c l a s s c o n d i t i o n a l and c l a s s u n c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n s w e r e made. F o r s i m p l i c i t y e a c h f e a t u r e s u b s e t c o n s i s t e d o f one f e a t u r e ( e a c h f e a t u r e b e i n g t h e number o f b l a c k p o i n t s i n a d e f i n e d r e g i o n ) . T h i s r e d u c e d t h e number o f p a r a m e t e r s t o be e s t i m a t e d and g a v e t h e r e s u l t s g r e a t e r s i g n i f i c a n c e . F o r f u r t h e r e x p l a n a t i o n o f t h e f e a t u r e e x t r a c t i o n and e s t i m a t i o n see [ 6 1 ] , [ 6 2 ] . B e c a u s e d i s t r i b u t i o n s w e r e unknown and h e n c e h a d t o be e s t i -m a t e d , t h e e r r o r p r o b a b i l i t y f o r a f e a t u r e s u b s e t c o u l d n o t be d i r e c t l y c a l c u l a t e d . I n s t e a d t h e e r r o r p r o b a b i l i t y f o r e a c h f e a t u r e s u b s e t was 82 experimentally determined using t r a i n i n g and t e s t i n g data. Each feature subset was used by i t s e l f i n a Bayes c l a s s i f i e r with the assumption of equal a p r i o r i p r o b a b i l i t i e s f o r classes. The r e s u l t s were used to rank the twenty f i v e feature subsets with respect to estimated error proba-b i l i t y , P . The mutual information, Kolmogorov dependence, Bhattacharyya dependence\"'\", Joshi's dependence, and the approximation to mutual i n f o r -mation were computed f o r each feature subset, (using the d i s t r i b u t i o n s estimated from the t r a i n i n g data), and the r e s u l t s were used to make f i v e rankings of the twenty f i v e feature subsets. A method developed e a r l i e r by Toussaint and Donaldson (['63;]: was used to estimate o v e r a l l system performance f o r each feature. This method i s a compromise between Highleyman's [64] H method and the U method of Lachenbruch and Mickey [65]. For the H method the data i s p a r t i t i o n e d into two d i s j o i n t sets, one used f o r t r a i n i n g and the other f o r t e s t i n g . The method y i e l d s p e s s i m i s t i c r e s u l t s f o r small data sets. For the U method, one sample i s removed from the data set, the remaining samples are used f o r t r a i n i n g , and then the sing l e sample i s c l a s s i f i e d . This procedure i s repeated f o r a l l remaining samples. System error p r o b a b i l i t y i s the average over a l l i n d i v i d u a l error proba-b i l i t i e s . This method y i e l d s good estimates f o r system error p r o b a b i l i t y but requires a great deal of computation. The method used i n these experiments was to p a r t i t i o n the data set in t o M d i s j o i n t sets. Bayes estimates of unconditional and class c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n 1. Matusita's dependence was not computed because i t gives the same rankings f o r features as Bhattacharyya dependence. For each point i n feature-class space i t requires the same computation. 83 functions were made using a l l samples in M-1 of the sets. The remain-ing set was used for testing. The estimation and testing were repeated for a l l possible partitions of M-1 sets and then associated single set. In the experiments performed M=7. Hence seven sets of d i s t r i -butions were estimated and seven values of 2^ and each of the depend-ence measures were obtained for each feature. These values were averaged, and rankings of features were obtained using the average values. To obtain a quantitative indication of similarity between the six rankings and hence obtain an indication of performance of the different feature evaluation c r i t e r i a , the Spearman rank correlation coefficient [66] was computed between each of the fifteen pairs.of orderings. The Spearman rank correlation coefficient i s a single number st a t i s t i c which indicates the similarity of two orderings of quantities. It takes on values between zero and one. It is zero i f there is no relationship between the orderings; i t is one i f the two orderings are the same. The results of the experiments are shown in table 5.1 and table 5.2. 5.4 Discussion of Results From table 5.1 we see that the rankings obtained using mutual information, Kolmogorov dependence, Bhattacharyya dependence, Joshi's dependence,and the approximation of mutual information:, are highly correlated with each other. From table 5.2 where the actual rankings are presented, we see that the dependence measures ranked the twenty five feature subsets in a very similar manner. For example, the f i r s t TABLE 5.1 SPEARMAN RANK CORRELATION COEFFICIENTS BETWEEN FEATURE ORDERINGS. i(x,c) Vx.c) BD(X,C) J D(x,c) .712 .649 .695 .691 .752 i(x,c) .969 .995 .993 .974 KpCX.C) .972 .971 .912 BD(X,C) .999 .963 JD(x,c) ' .963 T A B L E 5.2 RANKINGS FOR THE DEPENDENCE MEASURES RANK I(X,C) Kj^X.C) B D(X,C) J D(X,C) ?(X,C) : P g 1 3 3 3 3 1 1 1 3 2 1 1 2 3 1 1 1 1 3 3 3 2 3 1 1 2 3 2 3 1 3 4 4 1 5 1 5 1 5 1 5 2 3 1 1 5 1 3 1 0 1 0 1 0 1 5 ( 2 4 ^ 6 1 0 24 1 3 1 3 24 1 2 7 2 4 4 24 2 4 1 6 1 4 8 4 2 0 4 4 4 2 9 2 0 22 2 0 2 0 1 0 1 0 1 0 22 1 3 22 22 1 4 ( 9* 1 1 1 6 25 1 6 16 1 2 2 3 1 2 1 4 1 4 1 2 1 2 22 6 1 3 1 2 1 9 1 4 1 4 2 0 1 5 1 4 2 5 1 2 1 9 1 9 25 22 1 5 1 9 16 25 25 1 9 *18* 1 6 2 2 2 2 2 25 17 9 9 9 1 8 9 ( 5) 1 8 1 8 1 8 1 8 9 1 8 1 9 1 9 6 6 6 6 6 1 6 2 0 2 1 2 1 8 8 2 1 8 2 1 8 1 2 1 2 1 8 2 0 22 5 8 1 1 5 2 1 23 1 17 7 7 6 1 24 7 7 5 5 17 17 •25 17 5 17 1 7 1 7 ( )\" Indicates t i e s i n ranking. 86 three p o s i t i o n s of the rankings of I(X,C), K ^X.C), B D(X,C) and J D(X,C) in c l u d e features 3, 11, 23 w i t h f e a t u r e 3 ranked f i r s t f o r a l l four measures. I(X,C) i s a l i t t l e d i f f e r e n t but three of the f i r s t four p o s i t i o n s of the rankings c o n t a i n the same three f e a t u r e s . The s i m i l -a r i t y i n the rankings e x i s t s because each of the dependence measures computes a number which i n d i c a t e s the d i s t a n c e between the d i s t r i b u t i o n P(X,C) and the d i s t r i b u t i o n P (X) P(C), or, a l t e r n a t i v e l y , computes the average, over a l l c l a s s e s , of the s e p a r a t i o n or overlap of P(X/C^) and P(X). The d i f f e r e n t measures compute the distance or o v e r l a p , i n d i f f e r e n t ways and t h e r e f o r e should emphasize d i f f e r e n t s t r u c t u r a l r e l a t i o n s of the d i s t r i b u t i o n s . However, f o r the data set used, the measures gave s i m i l a r r e s u l t s . From t a b l e 5.1 we see that the rankings obtained using the dependence measures a l l give approximately the same c o r r e l a t i o n w i t h the ranking obtained using the estimated e r r o r p r o b a b i l i t y c r i t e r i o n . We note that the c o r r e l a t i o n of l!(X,C) w i t h P £ i s higher than any of the other measures. This i s a s a t i s f y i n g r e s u l t as we have shown that i t i s a f u n c t i o n a l approximation to e r r o r p r o b a b i l i t y . From t a b l e 5.2 we can see the d i f f e r e n c e s and a l s o the over a l l s i m i l a r i t y which r e -s u l t s i n the high c o r r e l a t i o n c o e f f i c i e n t s . The d i f f e r e n c e e x i s t s because of the nature of Bayes c l a s s i f i c a t i o n . When an unknown p a t t e r n described by the random v a r i a b l e X having value X_. i s presented to a Bayes c l a s s i f i e r , the d i f f e r e n t p r o b a b i l i t i e s of X^ . over a l l the c l a s s e s are examined, and the c l a s s C. f o r which P(X.,C.) i s a maximum i s chosen i J i as the c o r r e c t c l a s s . The goodness of a f e a t u r e X depends on the per-centage of c o r r e c t c l a s s i f i c a t i o n s made when a set of t e s t i n g data i s presented to the Bayes c l a s s i f i e r . For each unknown p a t t e r n w i t h value 87 X_., there i s only one class which i s \"correct\". Hence, only one P(Xj,C\\) a c t u a l l y evaluates the feature when i t takes on the value X^. When using dependence measures f or feature evaluation, f o r a given X., each P(X., C,) (k=l, ... M) adds some contribution to the good-J 3 k ness of a feature. Hence there i s a basic d i f f e r e n c e i n the evaluation procedure. However, there i s also a d e f i n i t e s i m i l a r i t y i n that both the estimated error p r o b a b i l i t y and the dependence measures show the amount of overlap or the separation of class c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n s . The r e s u l t s of the experiments give support to the contention that dependence measures evaluate features i n a manner l i k e estimated error p r o b a b i l i t y . To strengthen t h i s contention, experiments should be performed with data sets having d i f f e r e n t error p r o b a b i l i t i e s and d i f f e r e n t types of co n d i t i o n a l d i s t r i b u t i o n s . Because of the as s o c i a t i o n between error p r o b a b i l i t y and dependence measures discussed previously, i t i s l i k e l y that r e s u l t s from these experiments would be s i m i l a r to the reported r e s u l t s . 5.5 Computation Since dependence measures evaluate feature subsets i n s i m i l a r ways, the amount of computation required f o r the d i f f e r e n t measures must be considered when choosing a measure. Each of the measures requires a summation over a l l the feature-class space, but for any one point i n the space, each requires a d i f f e r e n t amount of computation. In a com-puter implementation, the factors to be considered are the times r e -quired f o r arithmetic operations such as logarithm,amultiplication d i v i s i o n , square root, exponentiation, absolute value, addition, and 88 subtraction. In terms of amount of computer time required, the f i r s t f i v e operations l i s t e d are complex and the l a s t three operations are simple. As an example we now compare Kolmogorov dependence and mutual information with respect to computational complexity. For any point i n feature-class space, the computation of Kolmogorov dependence requires a subtraction and an absolute value operation; f o r any point, the com-putation of mutual information requires one d i v i s i o n , one logarithm, and one m u l t i p l i c a t i o n or, a l t e r n a t i v e l y , two logarithms, one sub-t r a c t i o n , and one m u l t i p l i c a t i o n . In general, Kolmogorov dependence i s much simpler computationally than mutual information. Using t h i s type of reasoning, we can show that the ordering of the measures with r e s -pect to increasing complexity i s : Kolmogorov dependence, the approxi-mation to mutual information, Bhattacharyya dependence (or Matusita's dependence), mutual information, and Joshi's dependence. For the mult i c l a s s feature s e l e c t i o n problem two other approaches have been used: the expected value approach and the maximum approach. In each method, the distances between a l l p a i r s of classes are computed using c l a s s c o n d i t i o n a l d i s t r i b u t i o n s . For the former, the distances are averaged over a l l p a i r s of classes and the expected value i s maximized, for the l a t t e r , the minimum pairwise distance i s found and M(M-l) i s maximized. In general these methods require — ^ distances to be computed using M d i s t r i b u t i o n s . Dependence measures, however, require only M distances to be computed. But, dependence measures require e s t i -mation of one extra d i s t r i b u t i o n P (X). Hence, dependence measures i n general are computationally l e s s complex, but some of t h e i r advantage i s l o s t because of the extra d i s t r i b u t i o n required. Also, pairwise measures are s i m p l i f i e d f o r some 89 s p e c i a l d i s t r i b u t i o n . (e.g. use of pairwise Bhattacharyya distance f o r Gaussian d i s t r i b u t i o n s ) . No attempt has been made to f i n d any s i m p l i f i -cations for the dependence measures. One should be wary of pronouncing absolute judgements on the d i f f e r e n t multi class measures as there are always p o t e n t i a l p i t f a l l s . However, for the d i s t r i b u t i o n free case, measures of dependence are a f a i r l y simple method of evaluating features. 5.6 Summary The concept that p r o b a b i l i s t i c dependence i s a s p e c i a l case of distance between two p r o b a b i l i t y d i s t r i b u t i o n s has been used to generate measures of dependence from w e l l known distance measures. I t should be noted that the method can be applied to convert any pairwise distance measure to a dependence measure. In th i s tehesrsffivessuch measures have been examined. Of the measures examined, the Bhattacharyya dependence and Matusita's dependence are new. Though Kolmogorov dependence and Joshi's dependence have been derived before [46], [48], they are not w e l l known and have not been previously applied to feature evaluation i n pattern recognition. The behaviour of the measures, when applied to feature evaluation has been discussed. I n e q u a l i t i e s have been derived which r e l a t e the measures to p r o b a b i l i t y of error. Ex-periments have been performed using features extracted from Munson's handprinted character data f i l e . I t was found, for the data used, that the measures of p r o b a b i l i s t i c dependence evaluated features i n very s i m i l a r ways, and that they evaluated features much l i k e the estimated error p r o b a b i l i t y c r i t e r i o n d i d. The computation required f o r the d i f f e r e n t measures was examined. I t i s stressed that amount of com-90 putation required should be considered when choosing a measure to evalu-ate features. The r e s u l t s presented i n t h i s portion of the thesis are summarized i n [67]. 91 CHAPTER VI DERIVATION OF SEQUENTIAL\" CLASSIFIER 6.1 Introduction In t h i s section of the thesis an h e u r i s t i c sequential c l a s s i -f i c a t i o n method i s proposed. The c l a s s i f i e r i s constructed by using distance and information measures and hence i s a d i r e c t a p p l i c a t i o n of the measures previously investigated. The c l a s s i f i e r i s a tree structure i n which c l u s t e r i n g of p r o b a b i l i t y d i s t r i b u t i o n s i s used. A backtracking strategy and other methods are implemented to improve recognition rate. In an attempt to use few measurements e f f i c i e n t l y , the pro-posed tree c l a s s i f i e r makes use of features which discriminate between groups of classes (clusters) at each l e v e l or between subsets of classes. When c l a s s i f y i n g an unknown pattern, one feature subset i s observed i n i t i a l l y , and based upon the r e s u l t , the pattern i s c l a s s i f i e d as a member of some group of classes (a c l u s t e r ) . Dependent upon which cl u s t e r was chosen, a second set of measurements i s made. The pattern i s then c l a s s i f i e d as a member of a smaller group of classes. The process of taking c l u s t e r dependent measurements and of making sub-c l a s s i f i c a t i o n s i s repeated u n t i l the pattern i s f i n a l l y recognized as belonging to one i n d i v i d u a l c l a s s . I t can be seen that as measure-ments are taken, classes are eliminated from further consideration. Because of the p r o b a b i l i s t i c nature of the measurements, errors can: be made. For t h i s reason backtracking schemes and r e - i n t r o d u c t i o n of classes are b u i l t into the c l a s s i f i e r . This h i e r a r c h i c a l method of c l u s t e r i n g i s f a i r l y common but the method of c l u s t e r i n g using distance and information measures and 92 the use of ba c k t r a c k i n g are new. 6.2 Sequential C l a s s i f i c a t i o n S e q u e n t i a l c l a s s i f i c a t i o n procedures have been i n v e s t i g a t e d i n the s t a t i s t i c a l l i t e r a t u r e and more r e c e n t l y have been a p p l i e d to s i g n a l d e t e c t i o n and p a t t e r n r e c o g n i t i o n . Sequential c l a s s i f i c a t i o n i s i n t u i t i v e l y appealing because i t i s a process used by humans i n r e c o g n i z i n g p a t t e r n s . We examine a p a t t e r n , and based upon what we see, look at the p a t t e r n f o r more fe a t u r e s . We continue the process u n t i l we are f a i r l y c e r t a i n of what p a t t e r n i s being examined. The fea t u r e s we examine are dependent on the previous measurements taken. We e f f e c t i v e l y \"home i n \" on a p a t t e r n . In terms of s t a t i s t i c a l p a t t e r n r e c o g n i t i o n , a f e a t u r e measurement i s taken, a-measure i s computed to determine the c e r t a i n t y of c l a s s i f i c a t i o n , then another measurement i s taken or a f i n a l c l a s s i -f i c a t i o n i s made. The procedure continues u n t i l a d e c i s i o n i s made. In s t a t i s t i c a l l i t e r a t u r e most research on s e q u e n t i a l d e c i s i o n theory has been done using Wald's s e q u e n t i a l p r o b a b i l i t y r a t i o t e s t (SPRT) [16]. The a p p l i c a t i o n of s e q u e n t i a l d e c i s i o n theory to p a t t e r n r e c o g n i t i o n was done p r i m a r i l y by Fu and h i s students. Chen [18], [19] used the s e q u e n t i a l approach to p a t t e r n r e c o g n i t i o n and machine l e a r n i n g . The use of time v a r y i n g stopping boundaries was a p p l i e d to p a t t e r n r e c o g n i t i o n by Chien and Fu [68]. The g e n e r a l i z a t i o n of SPRT to m u l t i -c l a s s problems GSPRT (gene r a l i z e d SPRT) was derived by Reed [69] and was a p p l i e d to p a t t e r n r e c o g n i t i o n by Fu, Chien and Chen [18] [68]. The use of dynamic programming as a computational technique 93 f o r f i n i t e s e q u e n t i a l schemes was proposed by Bellman,Kalaba,and Middleton [70]. This technique has been a p p l i e d to p a t t e r n r e c o g n i t i o n by Fu, C a r d i l l o and Chien [20] - [23]. Much of t h i s work i s described i n the monograph by Fu [71]. The method proposed i n the t h e s i s i s q u i t e d i f f e r e n t from the c l a s s i c a l s e q u e n t i a l methods, as w i l l be shown l a t e r . However i t has the property of using c l a s s dependent measurements to home i n on a c l a s s . The c l a s s i c a l methods are p a r t i c u l a r l y a p p l i c a b l e when cost of measurements i s an important f a c t o r i n the d e c i s i o n making process. As has been noted by Fu [71] the storage and computational requirements are high. The proposed method i s an h e u r i s t i c method i n which cost of measurements i s not r e a l l y a f a c t o r and i n which an attempt i s made to minimize storage. The proposed method i s a m o d i f i c a t i o n of d e c i s i o n t r e e type c l a s s i f i e r s . D e c i s i o n t r e e c l a s s i f i e r have been used by s e v e r a l IBM researchers [27][28][29]. Nadler has done some i n v e s t i g a t i o n s [30],[31]. A l s o Sammon, et a l [33],[34] have implemented the d e c i s i o n t r e e type c l a s s i f i c a t i o n scheme as an o p t i o n f o r OLPARS. 6.3 C l u s t e r i n g The purpose of c l u s t e r i n g i s to f i n d n a t u r a l groupings i n a set of data. Normally, i n a c l u s t e r i n g problem, we are presented w i t h a set of samples, each of which i s a measurement vec t o r e x t r a c t e d from a p a t t e r n . We are re q u i r e d to group the samples i n t o \" c l u s t e r s \" so that the members of each c l u s t e r are s i m i l a r to each other and are d i f f e r e n t 94 from members of other c l u s t e r s . A c l u s t e r i n g c r i t e r i o n i s s p e c i f i e d , and f o r each p o s s i b l e grouping, the c r i t e r i o n takes on some value. The best p o s s i b l e c l u s t e r i n g w i l l maximize or minimize the c r i t e r i o n . The c r i t e r i o n i s a f u n c t i o n of distances between the samples or d i s t a n c e s between the samples and some predefined mean v e c t o r s , where the mean vec t o r s c h a r a c t e r i z e the c l u s t e r s . There are v a r i o u s procedures f o r p a r t i t i o n i n g the samples. Good d i s c u s s i o n s are presented i n B B a l l [72], Duda and H a r t [ 7 3 ] , and Fukunaga [74], These conta i n many references to methods used. 6.4 C l u s t e r i n g using P r o b a b i l i t y d i s t r i b u t i o n s For the proposed h i e r a r c h i c a l c l a s s i f i c a t i o n scheme i t was necessary to determine c l u s t e r i n g s of the c l a s s e s . The c o n d i t i o n a l d i s t r i b u t i o n s of a l l c l a s s e s were estimated f o r d i f f e r e n t features and d i s t a n c e and i n f o r m a t i o n measures were used to determine the overlap of the d i f f e r e n t c l a s s e s . The j u s t i f i c a t i o n f o r using the procedure i s s t r a i g h t f o r w a r d : high overlap between c l a s s e s i n d i c a t e s that f o r the measurement used, the c l a s s e s considered are e s s e n t i a l l y the same, and should be c l u s t e r e d ; low overlap i n d i c a t e s that the c l a s s e s should belong to d i f f e r e n t c l u s t e r s . In c l a s s i c a l c l u s t e r i n g methods the i n d i v i d u a l sample ve c t o r s are used. Distances between samplesvectors are employed f o r d e t e r -mining c l u s t e r s . The proposed method uses d i s t a n c e s between v e c t o r s too, but the v e c t o r s used are d i f f e r e n t . A d i s c r e t e PDF can be represented as a v e c t o r w i t h each dimension being one of the p o s s i b l e values of the random v a r i a b l e being considered. The components of the v e c t o r are then the p r o b a b i l i t i e s of the allowed values. A \" d i s t a n c e \" between any p a i r 95 o f v e c t o r s i s t h e n e q u i v a l e n t t o t h e d i s t a n c e b e t w e e n p r o b a b i l i t y d i s t r i -b u t i o n s . F o r e x a m p l e , we h a v e t h e d i s t r i b u t i o n s i n f i g u r e 6 . 1 r e p r e -s e n t e d as t h e v e c t o r s : ? ( X / C 1 ) = ( . 1 , . 2 , . 7 ) P ( X / C 2 ) = ( . 6 , . 3 , . 1 ) To d e t e r m i n e t h e s i m i l a r i t y o f t h e two c l a s s e s and we compute t h e d i s t a n c e b e t w e e n them i n p r o b a b i l i t y s p a c e . I f t h e d i s -t a n c e i s s m a l l , t h e n f o r t h e measurement u s e d , t h e c l a s s e s a r e s i m i l a r and s h o u l d be g r o u p e d t o g e t h e r ; i f t h e d i s t a n c e i s l a r g e , t h e n t h e c l a s s e s a r e d i s s i m i l a r and s h o u l d n o t be i n t h e same c l u s t e r . The d i s t a n c e b e t w e e n c l a s s e s h a s a d e f i n i t e r e l a t i o n t o e r r o r p r o b a b i l i t y . I f two c l a s s e s a r e v e r y s i m i l a r f o r t h e measurement u s e d t h e n t h e y w i l l be m i s t a k e n f o r e a c h o t h e r . The p a i r w i s e e r r o r p r o b a b i l i t y w i l l be h i g h and c l a s s c o n d i t i o n a l d i s t r i b u t i o n s w i l l o v e r l a p a g r e a t d e a l . I f two c l a s s e s a r e v e r y d i f f e r e n t , t h e y w i l l n o t be c o n f u s e d a n d , i n p r o b a b i l i t y s p a c e , t h e c l a s s c o n d i t i o n a l d i s t r i b u t i o n s w i l l h a v e l i t t l e o v e r l a p . The d i s t a n c e b e t w e e n d i s t r i b u t i o n s c a n be computed w i t h any p a i r w i s e d i s t a n c e m e a s u r e . F o r e x a m p l e , E u c l i d e a n d i s t a n c e i n v e c t o r s p a c e c o r r e s p o n d s t o t h e d i s t a n c e c r i t e r i o n o f P a t r i c k and F i s c h e r [14] ( f o r e q u a l a p r i o r i d i s t r i b u t i o n s ) / = I1 [ P ( X k / C 1 ) - P ( V C 2 ) ] 2 -k = l 96 P (X/C) .z .1 P(X/C)' .3 1 Z 3 X 1 2 3 X The simpler Manhattan metric or Minkowski distance corresponds to Kolmogorov v a r i a t i o n a l distance K k=l : s For equal a p r i o r i d i s t r i b u t i o n s , the Kolmogorov v a r i a t i o n a l distance i s equal to error p r o b a b i l i t y . The Bhattacharyya distance, given by corresponds to the cosine of the angle between the two p r o b a b i l i t y d i s t r i b u t i o n vectors. Kolmogorov v a r i a t i o n a l distance i s most a t t r a c t i v e because of i t s known r e l a t i o n to error p r o b a b i l i t y . 6.5 Description of C l a s s i f i e r The basic c l a s s i f i e r consists of the tree shown i n f i g u r e 6.2. In the f i g u r e , Vti represents c l u s t e r j i n l e v e l i , and X.. represents the feature subset used to separate c l u s t e r U^_j 1„ j i n t o a set of smaller c l u s t e r s . Each c l u s t e r has associated with i t some c l a s s i f i c a t i o n r u l e and the necessary s t a t i s t i c a l information to carry out the c l a s s i f i c a t i o n . For example t h i s could be the p r o b a b i l i t y d i s t r i b u t i o n s of the c l u s t e r s i n t o which the classes under consideration are to be divided i n t o . k=l Other measures can be used f o r computing overlap. However, F i g . 6.2 TreeieiJSSsifier 99 As an example we consider the c l a s s i f i e r i n which the proba-b i l i t i e s of the c l u s t e r s are stored at each l e v e l . The c l a s s i f i c a t i o n of an unknown pattern i s done as follows. At the f i r s t stage we take a measurement (feature X^) , and knowing the p r o b a b i l i t y d i s t r i b u t i o n f o r each of the c l u s t e r s generated by feature we use a c l a s s i f i c a t i o n r u l e , such as Bayes r u l e , to determine to which sub group the pattern belongs. By measuring the feature associated with the chosen sub-group, we c l a s s i f y the pattern into a smaller sub group. Using the proper features we t r a v e l through the tree u n t i l a f i n a l c l a s s i f i c a t i o n i s made. A major drawback of the described c l a s s i f i e r i s , that once a d e c i s i o n i s made at any stage, some classes are eliminated from consideration. There are several ways to make allowances f o r the eliminated classes. One way i s to compute some measure of the c o n f i -dence of any decision made and i f the confidence of a dec i s i o n i s low allow extra measurements to be taken or allow a l t e r n a t i v e paths through the d e c i s i o n tree. Another way i s to re-introduce previously e l i m i n -ated classes at deeper l e v e l s of the tree. The four major problems i n the design and implementation of the proposed c l a s s i f i e r are: (1) the design of an algorithm f o r c l u s t e r i n g classes and hence b u i l d i n g the tree c l a s s i f i e r under the constraints of reasonable error rates and low storage. ( 2 ) the de r i v a t i o n of us e f u l measures of confidence at each stage of the decision process ( 3 ) the development of rules f o r changing decisions (4) the development of methods f or reintroducing classes at '\"aspsr lav^ lo cL Sa3 tree„ 100 deeper l e v e l s of the tree. 6.6. Discussion of \"optimal\" tree I t i s d i f f i c u l t to define an optimal tree c l a s s i f i e r because of the d i f f e r e n t constraining f a c t o r s . Our objectives include minimi-zation of error p r o b a b i l i t y , keeping storage requirements low, and using few measurements to i d e n t i f y any unknown pattern. Some com-promise d e f i n i t i o n of \"best\" c l a s s i f i c a t i o n scheme i s necessary, as the d e f i n i t i o n depends on the requirements of the s p e c i f i c problem being considered. We consider a m u l t i c l a s s problem with some defined group of feature subsets. We wish to f i n d a good representation of the pro-posed tree c l a s s i f i e r . At the f i r s t l e v e l we would have to examine a l l possible p a r t i t i o n s of the classes (clusters) for each possible feature subset. Then, f o r each of the r e s u l t i n g c l u s t e r s we would have to consider the p a r t i t i o n i n g s r e s u l t i n g from using a l l possible feature subsets. The same procedure would have to be c a r r i e d out on a l l c l u s t e r s at a l l l e v e l s u n t i l a l l remaining c l u s t e r s had one member. Then each possible tree would have to be evaluated on a set of t e s t i n g data. Then,easing some c r i t e r i o n which considers the error p r o b a b i l i t y , the average number of measurements required per pattern, and the storage requirements of the c l a s s i f i e r , the best tree could be chosen. The process for i n v e s t i g a t i n g the trees i s conceptually quite simple but computationally too demanding. One possible s i m p l i f y i n g approach i s to consider adjacent l e v e l s of the tree to be independent of each other. That i s we neglect the e f f e c t (on error p r o b a b i l i t y and storage) that the choosing of a 101 s p e c i f i c feature at one l e v e l has on the choice of feature at the following l e v e l . Then the obvious approach to b u i l d i n g the tree c l a s s i f i e r i s to examine each feature f o r c l u s t e r i n g at a given node of the tree, compute the error p r o b a b i l i t y of the r e s u l t i n g c l u s t e r s , compute storage requirements, and choose that feature and associated c l u s t e r i n g which i s best (unfortunately \"best\" i s some h e u r i s t i c l e f t to the researcher). This procedure i s repeated f o r a l l nodes of the tree. This method i s s t i l l computationally demanding. I t requires the determination of error p r o b a b i l i t y of a l l possible groupings of classes f o r each feature subset at each node of the tree. 6.7 Algorithm f o r c l u s t e r i n g Instead of examining a l l possible p a r t i t i o n i n g s of the classes for a given feature i t i s computationally more f e a s i b l e to define a h e u r i s t i c r u l e for c l u s t e r i n g . In t h i s case the idea i s to determine for each feature subset the distances between a l l possible p a i r s of class c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n functions and then b u i l d up cl u s t e r s from a l l those classes which have an overlap greater than some threshold. The actual algorithm i s presented toward the end of t h i s section. Also, instead of examining the clus t e r i n g s of a l l possible feature subsets f o r a given set of classes, i t i s proposed to examine only some of them. This can be done by ordering the feature subsets with some evaluation c r i t e r i o n and then examining some predefined number of the better features f o r c l u s t e r i n g . It i s not clear which type of evaluation measure should be 102 used for ordering the features. At the i n i t i a l l e v e l s of the tree when many classes are being considered i t i s important to f i n d features which c l u s t e r the classes into d i s t i n c t groups. At deeper l e v e l s when only a few classes are i n any node, features which discriminate w e l l are d e s i r a b l e . Discrimination measures such as dependence measures, or ex-pected values of pairwise distance measures, are good f o r choosing feature subsets which separate classes. When only a few classes are i n a c l u s t e r i t would be l o g i c a l to use d i s c r i m i n a t i o n measures to deter-mine a feature subset. However, when many classes are being considered and we wish to c l u s t e r them in t o a small number of sub c l u s t e r s a d i s c r i m i n a t i o n measure might not work. I t might be better to use a pre-evaluation measure such as entropy which indicates the peakiness of the unconditional d i s t r i b u t i o n s . Peakiness can be used as an i n d i c a t o r of regions of c l u s t e r i n g . As a compromise we could rank feature- subsets then examine many of them for c l u s t e r i n g effectiveness at i n i t i a l l e v e l s of the tree and examine a smaller number of them for effectiveness' i n d i s c r i m i n a t i o n at deeper l e v e l s of the tree. A l t e r n a t i v e l y , we could use d i f f e r e n t evaluation c r i t e r i a depending upon the number of classes i n the node being considered. To measure the effectiveness of any c l u s t e r i n g for a feature, i t i s u s e f u l to compute some figu r e of merit. The most obvious choice i s the error p r o b a b i l i t y . To do t h i s properly we would have to use the i m p r a c t i c a l approach of using t e s t i n g data for each c l u s t e r i n g . As a compromise the p r o b a b i l i t y d i s t r i b u t i o n s of the c l u s t e r s are computed from the 103 estimated i n d i v i d u a l class c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n s . The following quantity i s computed. L 1 - I\" max P(JL ,..U ), k=l i fc 1 or 1 L l 1 - _ P(X,/iU.) for equal a p r i o r i t k=l 1 d i s t r i b u t i o n s . X i s a feature, lU i s a c l u s t e r , t i s the number of c l u s t e r s . The expression i s the value of error p r o b a b i l i t y when exact d i s t r i b u t i o n s are known. Because d i s t r i b u t i o n s were estimated t h i s measure i s not exactly the true error p r o b a b i l i t y but i t i s close enough to serve as a guide for the separation of the c l u s t e r s . The actual c l u s t e r i n g algorithm i s now stated. 1. I n i t i a l i z e : depth of search, evaluation c r i t e r i o n , overlap threshold, distance measure for pairwise d i s c r i m i n a t i o n . 2. Rank a l l feature subsets f o r the classes being considered using the evaluation measure.. Set n = 1. 3. Using the n ' f h h ranked feature compute pairwise distances between a l l classes considered (the distances are between the class c o n d i t i o n a l density functions). 4. Choose the p a i r of classes which are separated by the smallest 104 p a i r w i s e d istance ( i . e . maximum overlap of P.D.F.'s). 5. Determine a l l c l a s s e s which have overlap greater than some t h r e s h -o l d w i t h t h i s p a i r . 6. For each of these c l a s s e s , determine c l a s s e s which have overlap greater than the same th r e s h o l d . 7. Repeat (6) u n t i l a l l a s s o c i a t e d overlapping c l a s s e s are taken care of. The t o t a l group of c l a s s e s forms a c l u s t e r . 8. Consider the next most overlapping p a i r of c l a s s e s . I f i t has been included i n the previous c l u s t e r , take the next most overlapping p a i r . Search u n t i l an unused p a i r of c l a s s e s i s found. 9. Repeat ( 5 ) , ( 6 ) , ( 7 ) , (8) u n t i l a l l c l a s s e s are taken care of. 10. Compute the d i s t r i b u t i o n s f o r the r e s u l t i n g c l u s t e r s and compute the f i g u r e of merit f o r these d i s t r i b u t i o n s . 11. Set n=n + 1 . I f n i s l e s s than or equal to depth of search go to (3) and continue. Otherwise go to (12). 12. Examine the r e s u l t s f o r the f e a t u r e sets used. I f no c l u s t e r i n g r e s u l t e d , decrease the overlap t h r e s h o l d and set n=l. Go to 3. I f c l u s t e r i n g occurred, choose the featuree andtthe c l u s t e r i n g which gave the best value f o r f i g u r e of m e r i t . Define the c l u s t e r i n g as a node of the t r e e . 13. Reset threshold (or increase i t ) . ' 14. For each c l u s t e r i n the t r e e having more than one member, repeat procedure (2) - (12)., 15. Continue u n t i l t r e e terminates. The c l u s t e r i n g technique used i n steps (5)-(9) i s commonly c a l l e d clumping [72]. I n t h i s technique the two c l o s e s t v e c t o r s ( i n t h i s case v e c t o r s are the d i s t r i b u t i o n s ) are chosen as a s t a r t i n g p o i n t 105 or a clump of vectors. Other vectors are added or not added to the clump dependent upon t h e i r closeness to the p a i r or t h e i r closeness to the mean of the p a i r . In the algorithm used the d i s t r i b u t i o n s of the intermediate cl u s t e r s are not computed as the process searches f o r c l u s t e r s . The d i s t r i b u t i o n s of the i n d i v i d u a l c l u s t e r members are used to f i n d new members. A possible a l t e r n a t i v e procedure i s to compute the d i s t r i -bution of the f i r s t clump, f i n d the nearest c l a s s , recompute the d i s -t r i b u t i o n and proceed t i n the same manner, searching f o r c l u s t e r members and recomputing c l u s t e r density functions as new members are added. The problem with the l a t t e r approach i s that as each class i s added, distances between c l u s t e r ( s ) and remaining classes have to be r e c a l -culated. siderable overlap between classes. With good separation the r e s u l t s w i l l be the same. The f i r s t method was chosen f o r i n v e s t i g a t i o n . To obtain a f e e l f o r the c l u s t e r i n g algorithm, an example i s presented. Refer to figure 6.3. 1 . Feature X has been chosen as a reasonable feature f o r a 6 c l a s s problem. 2. Pairwise distances between classes are computed using the measure The methods can give d i f f e r i n g r e s u l t s when there i s con-L. 3. 4. The p r o b a b i l i t y d i s t r i b u t i o n s f o r classes 2 and 3 are clo s e s t , having the most overlap. The distances of remaining classes from class 2 are now considered. 106 Figure 6.3 Clustering-example 107 The only p o s s i b i l i t y i s c l a s s 4. I t i s not very c l o s e so i t i s el i m i n a t e d . However c l a s s 4 i s q u i t e c l o s e to c l a s s 3. An i n t e r -mediate c l u s t e r c o n s i s t i n g of cl a s s e s 2, 3, 4 i s defined. 5. An examination of c l a s s 4 shows that none of the remaining c l a s s e s overlap much w i t h i t . 6. The f i r s t f i n a l c l u s t e r c o n s i s t s of c l a s s e s 2, 3, 4. 7. P a i r w i s e d i s t a n c e s between c l a s s e s 1, 5, 6 are examined. Classes 5, 6 are most overlapping. 8. Class 1 remains and forms a c l u s t e r by i t s e l f . The f i n a l three c l u s t e r s are l U l = { C l } i u 2 = { c 2 , c 3 , c 4 } i n 3 = { c 5 , c 6 } t Note that by using a lower t h r e s h o l d both c l a s s 5 and c l a s s 6 could be made i n t o i n d i v i d u a l c l u s t e r s . The t r e e c l a s s i f i e r which r e s u l t s from the c l u s t e r i n g a l -gorithm i s a t t r a c t i v e because the storage r e q u i r e d f o r the implemen-t a t i o n i s q u i t e s m a l l . This i s so because we s t o r e the p r o b a b i l i t i e s of the c l u s t e r s , not the i n d i v i d u a l c l a s s c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n s . A l s o because the featuresmmeasuredaateeachl'levelaare c l a s s dependent they e f f e c t i v e l y \"zero i n \" on a p a t t e r n as more measurements are made. This leads to e f f i c i e n t use of measurements. However, i f poor fe a t u r e s are used a t any l e v e l , e r r o r p r o b a b i l i t y goes up q u i c k l y . Therefore i t i s e s s e n t i a l to use good features at any node of the t r e e . A l s o , once a d e c i s i o n i s made at any l e v e l some c l a s s e s are e l i m i n a t e d from c o n s i d e r a t i o n . Because of the 108 p r o b a b i l i s t i c nature of the measurements, there w i l l always be err o r s . Hence there i s a need f o r a method of a l t e r i n g decisions or making them tentative instead of f i n a l . One way of doing t h i s i s to compute a measure of confidence i n each de c i s i o n made. By keeping track of the dec i s i o n path and the confidences we can determine i f there i s doubt i n any of the decisions. A backing up procedure can be implemented so that a l t e r n a t i v e routes through the tree can be used. A second way i s to compute the confidence of a tentative decision, and, i f confidence i s low take extra measurements u n t i l there i s a high confidence i n the dec i s i o n being made. A t h i r d way of lessening the class e l i m i n a t i o n problem i s to re-introduce discarded classes deeper i n the tree. Clusterings at any l e v e l w i l l have some overlap and hence some p r o b a b i l i t y of error. By i introducing the most l i k e l y error causing classes again deeper i n the tree t h i s error p r o b a b i l i t y can conceivably be decreased. 6.8 6.8 Derivation of Measured Confidence For each tentative decision made as we follow a path through the tree, some i n d i c a t i o n of confidence of the dec i s i o n made i s required. This should be a si n g l e number. The number should r e f l e c t the p r o b a b i l i t y of making an error when the tentative d e c i s i o n i s made. The number depends on: (1) the value of the measurement i . e . X, (k indicates the l e v e l of the tree) 109 ( 2 ) the tentative d e c i s i o n made, (we use Bayes r u l e and choose) max { P(X k,U )•}. j U^j represents the j th c l u s t e r discriminated by X^. We reserve C. for the j t h c l a s s ) . J (3) the \"amount\" of confusion between the c l u s t e r chosen U, . k i and a l l other c l u s t e r s at the stage considered {U, . kj j = 1 ... t, j * i } ( a c t u a l l y confusion between Y(X^,U^/) and a l l P ^ , ) ). A l t e r n a t i v e l y we c a l l i t the confidence of choosing ^ ( X ^ J u j ^ ) • Using Bayes r u l e we choose c l u s t e r U, . i f k i p ( u k i ) p(vu k i) * *, 0 . ( 2 ) f o r a l l c l u s t e r s equally l i k e l y i . e . p(X k>U k^) = P(X k,U k j) for a l l i , j e = 0 . 110 (3) for no overlap, ( i . e . P C X ^ / U ^ ) the only non zero p r o b a b i l i t y ) e = e max The postulated e can be normalized to e = e c £MAX A s i m p l i f i c a t i o n i s to consider only the two most l i k e l y c l u s t e r s f o r a given measurement. Represent these as and where i s the most l i k e l y c l u s t e r . Then e = [ P C V V \" I ^ V k i ) + P V k n ) ) ] « k [P(Xk'Uk±> - P ( X k ' \\ n ) ] e c - 1 - P ( Xk*&h> P ( X k ' U k i > It can be seen that the s i m p l i f i c a t i o n i s e s s e n t i a l l y the l i k e l i h o o d r a t i o between the two best decisions. Any monotonic function of the l i k e l i h o o d i s equivalent. Hence log P C X , , ! k i i s a possible choice P ( Xk' Ukn> 6.9 Derivation of Rules for Tentative decisions A method of making decisions r e v e r s i b l e i s to allow back-tracking. The idea i s as follows: as measurements are taken and decisions are made, the confidence of each decision i s computed and I l l stored. Also the node at which the decision was made i s stored. A tentative f i n a l decision i s made and i f any of the intermediate decisions have a low confidence, the process goes to the node having the f i r s t low confidence, takes the second most l i k e l y path and makes a second tenta-t i v e decision. Then a feature i s measured which discriminates between the two p o t e n t i a l classes. The more l i k e l y class i s chosen. When good features are used i n the tree, most unknown patterns w i l l be c l a s s i f i e d with high confidence. Only a few w i l l require the use of backtracking. Hence, the average number of measurements r e -quired w i l l not increase too much. The cost of using t h i s method i s that extra d i s t r i b u t i o n s w i l l have to be stored to allow the painwise dis c r i m i n a t i o n . I t should be noted that not a l l possible pairwise discriminations have to be considered, that only those pairs of classes which are confused have to be stored. Another method using confidences i s to take measurements at each node u n t i l the confidence of a decision i s high. The procedure i s to takee a measurement, compute the confidence, take more measurements i f necessary, and then proceed. This l a t t e r procedure i s i n t u i t i v e l y appealing but i n p r a c t i c e there i s a problem. We have to compute the j o i n t d i s t r i b u t i o n s of the d i f f e r e n t p o t e n t i a l measurements for the cl u s t e r s involved. I f s t a t i s t i c a l independence f or i n d i v i d u a l class c o n d i t i o n a l density functions i s assumed, then we cannot assume that the c l u s t e r d i s t r i b u t i o n s w i l l be s t a t i s t i c a l l y independent. In f a c t t h i s i s true only i f there are no separate c l u s t e r s at the l e v e l being considered. However, the assumption can be invoked to save storage, at the cost of s t a t i s t i c a l honesty. 112 6.10 Re-iritrbductiori of Classes -When a c l u s t e r i n g of d i s t r i b u t i o n s i s c a r r i e d out at a l e v e l , some classes w i l l overlap and hence w i l l be prone to m i s c l a s s i f i c a t i o n . The troublesome classes can be determined during the c l u s t e r i n g process, These classes are the ones that are j u s t s l i g h t l y too f a r from the classes i n a c l u s t e r to be considered members (membership being de-pendent upon a c l u s t e r i n g threshold) but close enough to be confusing. An algorithm can be devised to re-incorporate the offending classes at the next l e v e l . The idea i s : Using some \"best\" feature, c l u s t e r M classes i n t o t d i s j o i n t groups. Define these as U.. =...{U^ /i = l , t ) W h 6 r e V U. = {M}, i = i 1 H D . = {0}. i-1 1 For each U, determine the classes from the other c l u s t e r s which are k close to c l u s t e r U, k Define these c l u s t e r s as P ={F^/± = l , t } t where Q P = {M}, i = l i but li\\ P. ^ {0} • 1=1 At the next l e v e l use the classes from P or from a mixture of P and U 113 The re-addition of classes procedure must be used with caution as indiscriminate use can be disastrous. If too many classes are i n t r o -duced and the procedure i s repeated at d i f f e r e n t l e v e l s , there w i l l not be much of a reduction i n the number of classes which are to be clustered. The r e s u l t would be a spreading of the tree to an uncomfortable s i z e and growth of the tree to an awesome depth. Hence, there should be r e s -t r i c t i o n s as to the number of classes added to a c l u s t e r ; there should be r e s t r i c t i o n s as to the number of l e v e l s of the tree the procedure i s applied to; there should be r e s t r i c t i o n s i n the c l u s t e r s at a l e v e l to which the procedure i s applied. This procedure i s u s e f u l i n an i n t e r a c t i v e environment. The experimenter i n i t i a l l y b u i l d s a tree, then i n t u i t i v e l y decides where the classes can be introduced. The new tree i s constructed and tested. The i t e r a t i v e procedure can be c a r r i e d out u n t i l a reasonable structure i s arrived at. 6.11 Conclusion In t h i s chapter a f a i r l y general tree type c l a s s i f i c a t i o n scheme has been explored. The c l a s s i f i e r makes use of c l a s s dependent measurements i n an e f f o r t to use features e f f i c i e n t l y . The amount of storage to implement the basic scheme i s f a i r l y low. A method of constructing the tree using distance measures i s described. The idea of confidence i s a decision i s explored. Methods for reducing error rate by taking extra measurements, by using back-tracking, by reintroducing classes have been examined. In the next chapter experiments to evaluate the ideas pre-sented are described. 1 CHAPTER V I I EXPERIMENTS 7.1 I n t r o d u c t i o n In t h i s chapter v a r i o u s experiments are described. These experiments were performed to i n v e s t i g a t e the d i f f e r e n t parameters a f f e c t i n g the c l u s t e r i n g and to i n v e s t i g a t e the a p p l i c a t i o n of the d i f f e r e n t h e u r i s t i c r u l e s f o r i n c r e a s i n g the r e c o g n i t i o n r a t e . The experiments were performed using the C o r n e l l t y p e w r i t t e n data set [75] The preprocessing and feat u r e e x t r a c t i o n are f i r s t described. Then var i o u s experiments are described. 7.2 Data Base, Preprocessing and Feature E x t r a c t i o n The C o r n e l l data base c o n s i s t s of t y p e w r i t t e n alphanumeric characters recorded i n standard data format on magnetic tape. The data base was \"designed f o r use i n character r e c o g n i t i o n research f o r improved automated m a i l address reading\" [75]. The patterns are stored as 24 x 24 b i n a r y arrays (no gray l e v e l s ) . Several d i f f e r e n t fonts are stored and v a r y i n g degrees of degradation are allowed. The c h a r a c t e r s , numerals, and punctuation marks were scanned from t y p e w r i t t e n addresses or dead l e t t e r m a i l and from a wide v a r i e t y of type font samples s u p p l i e d by the U.S. Post O f f i c e to C o r n e l l A e r o n a u t i c a l Lab. Further samples were generated by applying the CAL Synthesis and Man i p u l a t i o n Packkge to the scanned samples. The generated set simulates d i f f e r e n t s c a l i n g s , degradations and m i s - o r i e n t a t i o n s of the characters. The r e s u l t i n g data s et has over 100,000 t y p e w r i t t e n symbols c o n s i s t i n g of numerals, a l p h a b e t i c c h a r a c t e r s , and punctuation marks. The data base was preprocessed us i n g an a l g o r i t h m developed 1 1 5 by Hussain, Toussaint--and. Donaldson[61]. Each 24 x 24 array was transformed i n t o a 20 x 20 array during the preprocessing. A f t e r the preprocessing, 43 features were e x t r a c t e d from the 20 x 20 array. The f i r s t 25 features were defined by d i v i d i n g the array i n t o 25 non overlapping 4 x 4 regions. The number of b l a c k p o i n t s i n the i ' t h region ( i = l , 2, ...25) defined the f e a t u r e X^. This f e a t u r e d e f i n i t i o n was used p r e v i o u s l y by Houssain, Toussaint, and Donaldson [611. A l s o , 18 e x t r a features which are defined by v a r i o u s 16 p o i n t regions i n the 20 x 20 array were e x t r a c t e d . These are shown i n f i g u r e 7.1. Again, the number of b l a c k k p o i n t s i n the array gives the value of the f e a t u r e . These fea t u r e s were chosen because, i n t u i t i v e l y , they are i n d i c a t o r s of the presence or absence of c e r t a i n parts i n a character. For example feat u r e 29 i n d i c a t e s the presence or absence of an opening i n the r i g h t hand s i d e of an upper case a l p h a b e t i c character. An i n d i c a t o r such as t h i s would be u s e f u l f o r dichotomizing the a l p h a b e t i c characters i n t o two c l u s t e r s . 7.3 E s t i m a t i o n of P r o b a b i l i t y D i s t r i b u t i o n Functions An important problem i n p a t t e r n r e c o g n i t i o n i s d e c i d i n g how to use a f i n i t e set of samples to design a c l a s s i f i c a t i o n scheme and then evaluate i t s performance. The more samples there are a v a i l a b l e f o r t r a i n i n g , the b e t t e r i s the e s t i m a t i o n of the parameters of the c l a s s i f i e r and the more samples there are a v a i l a b l e f o r t e s t i n g , the b e t t e r i s the estimate of e r r o r p r o b a b i l i t y and other parameters. The problem i s the d i v i d i n g of the a v a i l a b l e data i n t o t r a i n i n g and t e s t i n g s e t s . An e x c e l l e n t b i b l i o g r a p h i c a l survey of the v a r i o u s s o l u t i o n s i s done by Toussaint [76]. 116 feature number 4 miask- -> 4 *1 2 3 k 5 6 7 8 9 1 0 1 1 1 2 1 3 U 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 24 2 5 Figure 7.1 Features - used for recognition experiments 117 For the experiments i n t h i s s e c t i o n of the t h e s i s the r o t a t i o n method proposed by Toussaint and Donaldson [61] was used. The N data samples are d i v i d e d i n t o J d i s j o i n t s e t s , each c o n t a i n i n g N/J samples. Then ( J - l ) of the sets are used to estimate the d i s t r i b u t i o n s f o r the c l a s s i f i c a t i o n scheme. The remaining s et i s used as the t e s t i n g s e t . The procedure i s repeated f o r a l l J , r o t a t i o n s . The system parameters such as e r r o r p r o b a b i l i t y or storage f o r each r o t a t i o n are averaged over a l l r o t a t i o n s to o b t a i n some mean estimate of system performance. The data s e t s used were d i v i d e d i n t o 3 d i s j o i n t sets and the r o t a t i o n method was a p p l i e d f o r the f i n a l r e c o g n i t i o n experiments. 7.4 Experiments In t h i s s e c t i o n v a r i o u s experiments are described. 7.4.1 Experiment 1 The purpose of t h i s experiment was to determine what k i n d of e v a l u a t i o n c r i t e r i o n would be u s e f u l f o r determining features which p a r t i t i o n e d many cl a s s e s i n t o c l u s t e r s . The c l u s t e r i n g a l g o r i t h m p r e v i o u s l y described was a p p l i e d to a l l 43 fea t u r e s f o r a s i n g l e font (IBM executive) using the twenty s i x upper case a l p h a b e t i c characters as c l a s s e s . C l u s t e r i n g s were obtained f o r a l l features and the f i g u r e of merit was computed f o r each of the c l u s t e r i n g s . The feat u r e s were then ranked us i n g the f i g u r e of m e r i t . The Kolmogorov v a r i a t i o n a l d i s t a n c e or p a i r w i s e e r r o r p r o b a b i l i t y was used to compute the p a i r w i s e overlap of c l a s s e s and hence determine the c l u s t e r i n g s . For each feat u r e an i t e r a t i v e procedure was used to determine the \"best\" c l u s t e r i n g . The overlap t h r e s h o l d was i n i t i a l l y s e t very high so that a 118 any classes with any overlap at a l l were clustered. I f a p a r t i t i o n i n g of classes occurred f o r the high threshold i t was defined as the best c l u s t e r i n g . I f no p a r t i t i o n i n g occurred, the threshold was decreased and the c l u s t e r i n g repeated. This process was repeated u n t i l a p a r t i t i o n i n g of classes occurred. This was done so that c l u s t e r i n g s with minimum overlap could be determined. For each feature a f i g u r e of merit was computed. The measure was 1 1 L 1 1 - - r max PC^/C.). k=l i This measure assumes that the d i s t r i b u t i o n s of the c l u s t e r s are equally weighted. This was done because the c l u s t e r i n g algorithm i m p l i c i t l y does t h i s i t s e l f . The features were ranked using t h i s f i g u r e of merit. Various feature evaluation c r i t e r i a were then computed for each feature. The information measures mutual information (I(X,C)), KoMo.ggrovv dependence (KpCXjC)), Bhattacharyya dependence (B D(X,C)) were computed. The pre-evaluation measures quadratic peakiness i 2 L ( I p ( xO ) , entropy ( £ P(X,) log P(X, » and the Minkowski metric k=l k k=l * k ' ( X l^ ^^ k^ ~ S ^ w e r e also computed. k=l Features were ranked using these measures? To get a quantitative measure of the s i m i l a r i t y of the rankings, the Spearman rank c o r r e l a t i o n c o e f f i c i e n t was computed between each p a i r of rankings. The r e s u l t s are sh i n Tables 7 .1. The figures of i n t e r e s t are the c o r r e l a t i o n s between the ranking obtained using the fig u r e of merit F* and the d i f f e r e n t evaluation c r i t e r i a . A c o r r e l a t i o n of 1.0 indicates that two rankings 119 BD H M .280 . .219 .280 . .146 .124 , .047 I .981 .973 .0947 .0738 .0851 .958 .0618 .0411 .0655 .1969 .1836 .1887 % .9788 .820 H .9001 Table 7.1 Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between feature orderingS Table - 7.1.\" 120 are the same. Hence we see that none of the e v a l u a t i o n c r i t e r i a rank fea t u r e s i n a manner l i k e the f i g u r e of mer i t when a l l 26 c l a s s e s are considered. We can conclude that i t i s necessary to examine many features f o r c l u s t e r i n g s when many cl a s s e s are being considered. This procedure was followed i n the c l a s s i f i e r b u i l d i n g process. A c t u a l l y , features were ordered w i t h Kolmogorov dependence but a l a r g e depth of search was allowed. ( i . e . many features were examined.) This was found to be s a t i s f a c t o r y . A l s o , when few cl a s s e s were being considered, the best fea t u r e turned out to be one of the b e t t e r features as determined by the e v a l u a t i o n . 7.4.2 Experiment 2 Determination of threshold This experiment was performed to demonstrate the e f f e c t s of va r y i n g the overlap t h r e s h o l d when c o n s t r u c t i n g the t r e e c l a s s i f i e r . The d i s t r i b u t i o n s and samples f o r a s i n g l e f o n t (IBM exe-c u t i v e ) were used f o r the experiment. The depth of search was set at 9 ( i . e . f i r s t mine features at any l e v e l are examined f o r c l u s t e r i n g and the best i s used). The i n i t i a l t h r e s h o l d was v a r i e d to o b t a i n d i f f e r e n t t r e e s . The thresholds a t each successive l e v e l were i n c r e -mented by a constant. Each r e s u l t i n g t r e e was teste d w i t h a-small t e s t set (600 samples) to get some i n d i c a t i o n of e r r o r p r o b a b i l i t y . The e r r o r p r o b a b i l i t y as a f u n c t i o n of i n i t i a l t h r e s h o l d i s p l o t t e d i n f i g u r e 7.2. We see that the tree w i t h low i n i t i a l t h r e s h o l d has a high e r r o r p r o b a b i l i t y . As the thr e s h o l d i s increased there i s a dramatic drop i n the e r r o r p r o b a b i l i t y followed by a s a t u r a t i o n e f f e c t . The e x p l a n a t i o n i s that the low th r e s h o l d corresponds to very 122 loose c l u s t e r i n g . That i s , d i s t r i b u t i o n s having high overlap are s t i l l considered as separate c l u s t e r s . As the t h r e s h o l d i s i n c r e a s e d , the n a t u r a l groupings as determined by the features come out. I f the t h r e s h o l d i s made too high (maximum i s 2.0), then no p a r t i t i o n i n g of the data w i l l r e s u l t . From t h i s experiment i t was decided to use an i n i t i a l t h r e s h -o l d of between 1.25 and 2. with:_ahprovisibn£f pro dec teasing the 1-threshold altsahyrmdiielifsnos^lustte-ring i( cc 2.00-i.oo-l 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1.00 2.00 3.00 4.00 CONFIDENCE THRESHOLD Figure 7.5 Average number !S ;fgmeasurementsvs. confidence t h r e s h o l d IO.OOT S.00-e.ooH 3.00-2.00-^ j.'.00-| 0 —I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 50.00 100.00 150.00 200.00 DISTRIBUTIONS STORED Figure 7.6 E r r o r p r o b a b i l i t y vs. storage I 126 lo.oo-q 9.00-8.00 H : 3.00H 2.00-1.00-j 0 -j—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i—i 4.00 5.00 6.00 7.00 8.00 9.00 10.00 aVERRGE NUMBER OF MEASUREMENTS Figure 7 . 7 . . . Figure 7.7 Error p r o b a b i l i t y vs. average number of measurements 127 For the example considered i t appears that a confidence l e v e l of between 1.8 and 2.2 would be s u i t a b l e . The exact value chosen would be determined by the storage a v a i l a b l e f o r the problem. 7.4.4 Experiment 4. Comparison of tree c l a s s i f i e r and Independent V ' ' Bayes C l a s s i f i e r In t h i s experiment 26 upper case characters of a s i n g l e f o n t , s p e c i f i c a l l y IBM executive f o n t were used i n a r e c o g n i t i o n experiment to compare the t r e e c l a s s i f i e r i n i t s d i f f e r e n t c o n f i g u r a t i o n s w i t h an independent Bayes c l a s s i f i e r . For each of the 3 r o t a t i o n s a tree c l a s s i f i e r was b u i l t . The i n i t i a l f e a t u r e was set as featu r e 43. The overlap t h r e s h o l d was i n i t i a l i z e d at 1.4 and was made to increase by .05 at each l e v e l . For each r o t a t i o n the best fe a t u r e s f o r each p a i r w i s e d i s c r i m i n a t i o n were determined. Each r o t a t i o n of the c l a s s i f i e r was teste d w i t h i t s a s s o c i -ated t e s t i n g data and the r e s u l t s were averaged to ob t a i n average values f o r the parameters. The normal t r e e c l a s s i f i e r , both w i t h and without b a c k t r a c k i n g was t e s t e d . A l s o a modified t r e e c l a s s i f i e r :.. using the id e a of r e - i n t r o d u c t i o n of c l a s s e s was implemented. The i n t r o d u c t i o n of c l a s s e s was done i n a pure l y h e u r i s t i c way, making use of the author's i n t u i t i o n . Only s e l e c t e d nodes were used, i n an attempt to make the r e - i n t r o d u c t i o n worthwhile. Care was taken to not cause e x t r a o r d i n a r y growth of the t r e e . The r e s u l t s f o r the d i f f e r e n t trees are compared w i t h the r e s u l t s f o r an independent Bayes c l a s s i f i e r . An independent Bayes c l a s s i f i e r i s defined as a Bayes c l a s s i f i e r i n which c l a s s c o n d i t i o n a l 128 d i s t r i b u t i o n s of features are assumed s t a t i s t i c a l l y independent. Hence, j o i n t d i s t r i b u t i o n s are not stored. For the experiments the 43 features were ordered with Kolmogorov dependence and error p r o b a b i l i t y was determined f o r independent Bayes classifiersiusingrvar.iousbsubsets of the ranked features. For a c l a s s i f i e r with k features the f i r s t k ranked features were used. These k features are not n e c e s s a r i l y the optimal k features. To f i n d the optimal k features i s an exhaustive procedure invo l v i n g the t e s t i n g of (,) c l a s s i f i e r s . An i n t u i t i v e l y s a t i s f y i n g approach i s to use the k best i n d i v i d u a l features. The storage required f o r the independent Bayes c l a s s i f i e r i s Mxk d i s t r i b u t i o n s where M i s the number of classes and K i s the number of features. The storage used f o r the tree c l a s s i f i e r can be converted to a number equivalent to the number of features of an i n -dependent Bayes c l a s s i f i e r by d i v i d i n g by M, the number of classes. For example, f or a 26 class problem, a tree c l a s s i f i e r r e q u i r i n g 52 d i s t r i -butions uses a storage equal to 2 features. We see from fi g u r e 7.8- aftdttah.