FEATURE EVALUATION CRITERIA AND CONTEXTUAL DECODING ALGORITHMS IN STATISTICAL PATTERN RECOGNITION b y GODFRIED THEODORE PATRICK TOU.SSAINT B.Sc, University of Tulsa, 1968 M.A.Sc., University of B r i t i s h Columbia, 1970 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in the Department of E l e c t r i c a l Engineering We a.ccept t h i s thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA AUGUST 1972 In p r e s e n t i n g t h i s t h e s i s i n p a r t i a l f u l f i l m e n t o f the r e q u i r e m e n t s f o r an advanced degree at the U n i v e r s i t y o f B r i t i s h C o l u m b i a , I a g r e e t h a t the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e f o r r e f e r e n c e and s t u d y . I f u r t h e r agree t h a t p e r m i s s i o n f o r e x t e n s i v e c o p y i n g o f t h i s t h e s i s f o r s c h o l a r l y p u r p o s e s may be g r a n t e d by the Head o f my Department o r by h i s r e p r e s e n t a t i v e s . I t i s u n d e r s t o o d t h a t c o p y i n g o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l g a i n s h a l l not be a l l o w e d w i t h o u t my w r i t t e n p e r m i s s i o n . Department o f The U n i v e r s i t y o f B r i t i s h Columbia Vancouver 8, Canada ABSTRACT Sever a l p r o b a b i l i s t i c d i s t a n c e , i n f o r m a t i o n , u n c e r t a i n t y , and overlap measures are proposed and considered as p o s s i b l e f e a t u r e e v a l u a -t i o n c r i t e r i a f o r s t a t i s t i c a l p a t t e r n r e c o g n i t i o n problems. A t h e o r e t i c a l and experimental a n a l y s i s .and comparison of these and other w e l l known . c r i t e r i a are presented. A c l a s s of c e r t a i n t y measures i s proposed t h a t i s w e l l s u i t e d t o p r e - e v a l u a t i o n -and i n t r a s e t p o s t - e v a l u a t i o n . Two approaches t o the M-class problem are considered: the w e l l known expected value approach and a proposed g e n e r a l i z e d d i s t a n c e approach. In both approaches the main di s t a n c e measures considered are K u l l b a c k ' s divergence, the Bhattacharyya c o e f f i c i e n t , and Kolmogorov's v a r i a t i o n a l d i s t a n c e . In a d d i t i o n a g e n e r a l i z a t i o n of the Bhattacharyya c o e f f i c i e n t t o the M-class problem i s considered. P r o p e r t i e s and i n e q u a l i t i e s are d e r i v e d f o r the above measures. In a d d i t i o n , e r r o r bounds are d e r i v e d whenever p o s s i b l e , and i t i s shown th a t some of these bounds are t i g h t e r than e x i s t i n g ones while others represent g e n e r a l i z a t i o n s of w e l l known bounds found i n the l i t e r a t u r e . Feature o r d e r i n g experiments were c a r r i e d out t o compare the above measures w i t h each other and w i t h the Bayes e r r o r p r o b a b i l i t y c r i -t e r i o n . F i n a l l y , some very general measures are proposed which reduce t o s e v e r a l of the measures considered above as s p e c i a l cases and e r r o r bounds are d e r i v e d i n terms of these measures. A general c l a s s of algorithms i s proposed f o r the e f f i c i e n t u t i l i z a t i o n of c o n t e x t u a l dependencies among p a t t e r n s , at the s y n t a c t i c l e v e l , i n the d e c i s i o n process of s t a t i s t i c a l p a t t e r n r e c o g n i t i o n systems. The algorithms are dependent on a set of parameters which determine the amount of c o n t e x t u a l i n f o r m a t i o n t o be used, how i t i s t o be processed, and the computation and storage r e q u i r e d . One of the most important i parameters i n q u e s t i o n , r e f e r r e d t o as.the depth of. search, determines the number of a l t e r n a t i v e s considered i n the decoding process. Two general types of algorithms are proposed i n which the depth of search i s pre-s p e c i f i e d and remains f i x e d . The e f f i c i e n c y of these algorithms i s inc r e a s e d by i n c o r p o r a t i n g a t h r e s h o l d and e f f e c t i n g a v a r i a b l e depth of search t h a t depends on the amount of noise present i n the p a t t e r n i n question. S e v e r a l general types of algorithms i n c o r p o r a t i n g the t h r e s h o l d are a l s o proposed. In order t o e x p e r i m e n t a l l y analyze, compare, and evaluate the various a l g o r i t h m s , the r e c o g n i t i o n of handprinted E n g l i s h t e x t was simulated on a d i g i t a l computer. TABLE OF CONTENTS PAGE ABSTRACT i TABLE OF CONTENTS i i i LIST OF ILLUSTRATIONS v i i LIST OF TABLES x ACKNOWLEDGEMENT • x i PART I FEATURE EVALUATION CRITERIA I. INTRODUCTION 1 1.1 Feature Evaluation C r i t e r i a . 1 1.2 Previous Research 3 1.3 Scope of the Thesis (Part I) 5 I I . DATA, FEATURE EXTRACTION, AND PREPROCESSING 9 2.1 Introduction 9 2.2 Data Base, Preprocessing, and Feature Extraction . . . . J_Q 2 .3 Some C l a s s i f i c a t i o n Experiments 13 2, k Results and Discussion . . . 17 I I I . A CLASS OF CERTAINTY MEASURES FOR PRE-EVALUATION AND INTRASET POST-EVALUATION 2k 3.1 Introduction . 2h 3.2 Certainty Measures and Feature Evaluation 25 3.3 Inequalities . . • 31 3. k Error Bounds 35 3.5 Experiments and Results . k2 3.6 Discussion and Conclusions U6 IV. DISTANCE AND INFORMATION MEASURES FOR COMBINED-EVALUATION AND INTERSET POST-EVALUATION: THE EXPECTED VALUE APPROACH TO THE M-CLASS PROBLEM kg U . l Introduction 1+9 i i i PAGE k.2 Information and Distance Measures as Feature E v a l u a t i o n C r i t e r i a 51 k.2.1 The D i r e c t e d and Symmetric Divergence 51 k.2.2 Four Proposed Feature E v a l u a t i o n C r i t e r i a 52 k.3 P r o p e r t i e s and I n e q u a l i t i e s 55 1+. 3.1 P r o p e r t i e s of the Expected Divergence 55 U.3.2 P r o p e r t i e s of the M-class Bhattacharyya C o e f f i c i e n t 59 k.3.3 Some I n e q u a l i t i e s Between the Expected Divergence and the M-class Bhattacharyya C o e f f i c i e n t 60 A G e n e r a l i z a t i o n of the M-class Bhattacharyya C o e f f i c i e n t and i t s R e l a t i o n t o the Expected Divergence 63 k.3.5 Lower Bounds on the Expected Divergence i n Terms of P a i r w i s e Distance Measures 63 k.3.6 Lower Bounds on E r r o r P r o b a b i l i t y i n Terms of the Expected Divergence gli )4.3.7 An Upper Bound on E r r o r P r o b a b i l i t y i n Terms of the . M-class Bhattacharyya C o e f f i c i e n t 67 k.k Experiments and Results 69 k.3 D i s c u s s i o n and Conclusions 70 V. THE GENERALIZED DISTANCE APPROACH TO THE M-CLASS PROBLEM. . . . 77 5.1 I n t r o d u c t i o n 77 5.2 Proposed G e n e r a l i z e d Distance Measures 78 5.3 P r o p e r t i e s and I n e q u a l i t i e s 82 5 .k E r r o r Bounds 83 5.5 Experiments and Results 88 5.6 D i s c u s s i o n and Conclusions 88 VI. PROBABILITY OF ERROR BOUNDS 90 6.1 I n t r o d u c t i o n 90 6.2 Upper Bounds on E r r o r P r o b a b i l i t y i n Terms of Distance and Information Measures f o r Feature E v a l u a t i o n w i t h I n c o r r e c t l y L a b e l l e d T r a i n i n g Data: Two-Class Problem . . 90 6.3 Upper Bounds on E r r o r P r o b a b i l i t y Through Function Approximation: A Family of Distance Measures f o r the Two-Class Problem 98 6.k Upper Bounds on E r r o r P r o b a b i l i t y f o r the M-class Problem: The P a i r w i s e Distance Approach 100 i v PAGE 6.5 Upper Bounds on E r r o r P r o b a b i l i t y f o r the M-class Problem: The Generalized Distance Approach 102 6.6 E r r o r Bounds f o r the M-class Problem: The Function Approximation Approach 103 V I I . DISCUSSION AND CONCLUSIONS 108 7.1 Feature E v a l u a t i o n C r i t e r i a R e v i s i t e d 108 7.2 Suggestions f o r Further Research 115 PART I I CONTEXTUAL DECODING ALGORITHMS V I I I . . INTRODUCTION 119 8.1 Using Contextual C o n s t r a i n t s i n P a t t e r n C l a s s i f i c a t i o n . . . 119 8.2 Compound De c i s i o n Theory 120 8.3 Previous Research 121 8.4 Scope o f the Thesis (Part I I ) 125 XI.. CONTEXTUAL .DECODING ALGORITHMS. 127 9.1 I n t r o d u c t i o n 127 9.2 D e f i n i t i o n of Parameters . . . . . . 129 9.3 Algorithms w i t h F i x e d Depth of Search 131 9.k Algorithms w i t h V a r i a b l e Depth o f Search I n c o r p o r a t i n g Threshold 135 9.5 Approximating the A P r i o r i D i s t r i b u t i o n s ll+l X. EXPERIMENTS AND RESULTS ihk 10.1 Data Base, P r e p r o c e s s i n g , Feature E x t r a c t i o n , Formation of E n g l i s h T e x t s , and System E v a l u a t i o n ' lhk 10.2 Ordering the A l t e r n a t i v e s i n the D e c i s i o n Process l l l6 10.3 Usefulness o f Context as a Function of I n i t i a l System Performance 151 10.k E v a l u a t i o n and Comparison o f Contextual Decoding Algorithms w i t h F i x e d Depth o f Search 155 10.5 E v a l u a t i o n and Comparison of S e q u e n t i a l Decoding Algorithms I n c o r p o r a t i n g Threshold. . . 166 XI. CONCLUDING REMARKS I77 11.1 E f f i c i e n c y of the S e q u e n t i a l Decoding Algorithms 179 v PAGE 11.2 Some General Comments I82 11.3 Suggestions for Further Research 286 APPENDIX A 190 REFERENCES 197 v i LIST OF ILLUSTRATIONS FIGURE . PAGE 3.1 Upper bound on error p r o b a b i l i t y as a function of M (c|x) f o r various values of M. The curves meet the ordinate at (M-l)/M. The dashed l i n e s meet the abscissa at 2(M-1)/M ) + 0 3.2 Upper bound on error p r o b a b i l i t y as a function of M^(c|x) f o r various values of M 1^1 5.1 Comparison of the upper bounds on e r r o r p r o b a b i l i t y as a function of p and pn as the number of pattern classes G varies 87 6.1 Upper bound on er r o r p r o b a b i l i t y U(QQ,3) as a function of QQ f o r various values of 3. Dashed l i n e s meet ordinate at (l-3) and. abscissa at 23 (1-3) 97 9.1 Sequential decoding algorithm f o r I - 2 of Type I-A with search mode I 138 9.2 Sequential decoding algorithm' for I - 2 of Type I-B with search mode I 139 9.3 Sequential decoding algorithm f o r £ = 1 of Type I-A with search mode II 1^ 2 10.1 P r o b a b i l i t y o f . e r r o r as a function of depth of search f o r each of seven texts f o r block decoding with bigrams i n which the a l t e r n a t i v e s are ordered with the l i k e l i h o o d s . . ik8 10.2 P r o b a b i l i t y of e r r o r as a function of depth of search for each of seven texts for block decoding with bigrams i n which the a l t e r n a t i v e s are ordered with the weighted l i k e -lihoods 1)49 10.3 Comparison of r e s u l t s obtained from l i k e l i h o o d ordering with those obtained from weighted l i k e l i h o o d ordering . . . 150 10.h P r o b a b i l i t y of error P as a function of the number of e features used when no contextual information i s used. . . . 153 v i i FIGURE PAGE 10.5 P r o b a b i l i t y of e r r o r P^ as a function of depth of search f o r sequential decoding with bigrams i n the look-ahead mode with 1, 8 , 1 0 , and 25 features 15^ /- Q 10.6 Quiescent depth of search d and optimum depth of search d^ as a function of error p r o b a b i l i t y P g without context f o r sequential decoding with bigrams i n the look-ahead mode 156 1 0 . 7 Percent decrease i n P due to context as a function of e P^ without context f o r sequential decoding with b i -grams i n the look-ahead mode of decision with d = d ^ = 2 and d = d ^ = 26 157 n+1 n n+1 ^ 1 10.8 Comparison of block decoding and sequential decoding with respect to erro r p r o b a b i l i t y P g and computational complexity as a function of depth of search with b i -gram contextual constraints 159 10.9 P r o b a b i l i t y of erro r P as a function of depth of -search f o r sequential decoding with a bigram product approxima-t i o n to trigrams i n the look-ahead-look-back mode of decision l 6 l 10.10 P r o b a b i l i t y of error P g as a function of depth of search for sequential decoding with trigrams i n the look-ahead-look-back mode of decision 163 10.11 Comparison of contextual decoding algorithms with f i x e d depth of search with regards er r o r p r o b a b i l i t y P g as a function of depth of search. Unless otherwise stated a l l algorithms incorporate, bigram contextual con-s t r a i n t s l6k 10.12 P r o b a b i l i t y of erro r P g as a function of computational complexity £ for three cases of i n t e r e s t 165 10.13 P r o b a b i l i t y of erro r P g and average number of bigrams searched (ANBS) as a function of threshold t f o r sequen-t i a l decoding of Type I with bigrams i n the look-ahead mode of decision with search modes I and II 167 v i i i FIGURE PAGE 1 0 . l h P r o b a b i l i t y of e r r o r P g and average number of bigrams searched (ANBS) as a function of threshold t f or sequential decoding with bigrams i n the look-ahead mode of decision of Type I I l 6 9 10.15 Average depth of search values at time of decision as a function of the threshold t f or sequential decoding with bigrams i n the look-ahead mode of decision of Type II with a D = h and d D , = 2 171 n n+1 10.16 P r o b a b i l i t y of erro r P g and average number of bigrams searched' (ANBS) as a function of threshold t f o r sequential decoding with bigrams i n the look-ahead mode of decision of Type I I I with dP = k and dP n = 2 173 * n n+1 10.17 Average depth of search values at time of decision as a function of threshold t f o r sequential decoding with b i -grams i n the look-ahead mode of decision of Type I II with d E = h, a®,., = 2 , <j> = 2 , and d> = 1. . 11 h n "n+1 n n+1 10.18 P r o b a b i l i t y of error P as a function of <j> and <f> f o r e n n+1 sequential decoding with bigrams i n the look-ahead mode of decision of Type I I I with d D = h. d D , n = 2 and t = O.h. 116 n ' n+1 10.19 Average number of bigrams searched ANBS as a function of d> and <j> . f o r sequential decoding i n the look-ahead N N + 1 D D mode of decision of Type I I I with d = k. d = 2 and n ' n+1 t = O.h 177 10.20 Comparison of the various sequential decoding algorithms with f i x e d depth of search and incorporating threshold, with respect to e f f i c i e n c y 181 ix LIST OF TABLES • TABLE PAGE 2.1 Erro r p r o b a b i l i t i e s i n percent f o r t e s t sets 1-7, i n c l u s i v e - Experiment 1 18 2.2 Er r o r p r o b a b i l i t i e s i n percent f o r t e s t sets 1-7, i n c l u s i v e - Experiment 2 19 2.3 Results of other experiments on Munson's multiauthor handprinted data set ( i n percent) 21 3.1 Spearman rank 'correlation c o e f f i c i e n t s r s between feature orderings f o r (a) pre-evaluation measures and (b) i n t r a s e t post-evaluation measures l^j k.l Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between feature orderings with combined and post-evaluation c r i t e r i a . . . . 71 U.2 Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between feature orderings with I and Q 72 k.3 Feature orderings with I and Q 73 5.1 Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between feature orderings with generalized distance measures. . 89 7.1 Summary of general r e s u l t s f o r measures of i n t e r s e t . . . .113 ACKNOWLEDGEMENT I would l i k e to express my appreciation to Professor Robert W. Donaldson for h i s generous counsel and encouragement, and f o r creating a very free atmosphere for research which allowed me to be adventurous and which also made the past few years a very stimulating experience. I am g r a t e f u l to my colleagues Shahid Hussain, Toomas Vilmansen and Mabo Ito f or many stimulating discussions which contributed s i g n i f i c a n t l y to the work presented i n t h i s t h e s i s . I also wish to thank Dr. J.H. Munson of Stanford Research I n s t i t u t e for making h i s multiauthor handprinted data set a v a i l a b l e . I wish to thank Ms. Beverly Harasymchuk for an e x c e l l e n t job of typing the thesis and Ms. Norma Duggan for helping me with f i n a l correc-t i o n s . I also wish to thank Mr. A l MacKenzie for preparing some of the i l l u s t r a t i o n s . Support of t h i s study by two B r i t i s h Columbia Telephone Co. Graduate Scholarships, the National Research Council of Canada under Grant NRC A-3308 and the Defence Research Board of Canada under Grant DRB 2801-30 i s g r a t e f u l l y acknowledged. x i To Ninoska, a f r i e n d and companion-PART I FEATURE EVALUATION CRITERIA 1 CHAPTER I INTRODUCTION 1.1 Feature Evaluation C r i t e r i a In s t a t i s t i c a l pattern recognition problems, a set of measurements (features, v a r i a b l e s , a t t r i b u t e s , c h a r a c t e r i s t i c s , components) i s taken on a pattern sample and subsequently a decision i s made as to what pattern class the unknown pattern sample belongs t o . The process of taking these measurements, p r i o r to c l a s s i f i c a t i o n , i s us u a l l y r e f e r r e d to as feature e x t r a c t i o n . Feature extraction presupposes that the set of measurements to be taken has been somehow previously decided upon. This i n i t i a l choice as to what measurements are to be taken on a set of pattern data by the feature extractor constitutes the feature s e l e c t i o n problem. C l e a r l y , the feature s e l e c t i o n problem involves two i n t e r r e l a t e d operations: search and e v a l -uation. In other words, the "best" subset of features, according to some s p e c i f i e d c r i t e r i o n , i s found by searching over a l l the possible feature subsets, evaluating each subset with the s p e c i f i e d c r i t e r i o n . For example, i n applying dynamic programming with the Fisher return f u n c t i o n , [ l ] , [2], dynamic programming r e f e r s to the search operation while the Fisher return i s the evaluation c r i t e r i o n . While i n some pattern recognition problems feature s e l e c t i o n has as i t s ultimate goal the s e l e c t i o n of those features that minimize the erro r p r o b a b i l i t y , i n some problems other feature evaluation c r i t e r i a may be of i n t e r e s t . For example, i n some approaches the evaluation c r i t e r i a measure the feature's c a p a b i l i t y to characterize the patterns to be recognized, [3], [4], [5]. In other approaches i t i s desired t o select those features from which the o r i g i n a l pattern may be reconstructed, [ 6 ] . Furthermore, i n 2 b i o m e t r i c s d i s t a n c e c r i t e r i a a r e o f i n t e r e s t as i n d i c e s f o r t h e measurement o f d i s t a n c e be tween p o p u l a t i o n s . I n some o f t h e s e a p p l i c a t i o n s one i s n o t i n t e r e s t e d p r i m a r i l y i n c l a s s i f i c a t i o n , b u t r a t h e r i n r e p r e s e n t a t i o n and a n a l y s i s o f t h e d a t a , as i n [7] and [ 8 ] . I n a d d i t i o n , d i s t a n c e , i n f o r m a -t i o n , and o v e r l a p measures a r e u s e f u l i n o t h e r p a t t e r n r e c o g n i t i o n p r o b l e m s s u ch as c l u s t e r i n g [ 9 ] , [ 1 0 ] , [ l l ] , [ 1 2 ] . Howeve r , t h e r e a r e a t l e a s t t h r e e r e a s o n s why f e a t u r e e v a l u a t i o n c r i t e r i a o t h e r t h a n t h e e r r o r p r o b a b i l i t y may be more d e s i r a b l e even when t h e u l t i m a t e g o a l i s t h e m i n i m i z a t i o n o f e r r o r p r o b a b i l i t y . F i r s t , an e x p r e s s i o n f o r t h e e r r o r p r o b a b i l i t y o f a p a r t i c u l a r c l a s s i f i e r may n o t be a v a i l a b l e . F o r e x a m p l e , g i v e n a p a r t i c u l a r p r o b l e m and f i n i t e d a t a s e t , no e x p r e s s i o n i s a v a i l a b l e f o r t h e e r r o r p r o b a b i l i t y o f a k - n e a r e s t n e i g h b o u r c l a s s i f i e r . S e c o n d , i f i t i s d e s i r e d t o m i n i m i z e t h e Bayes e r r o r p r o b a b i l i t y and i f t h e d i s t r i b u t i o n s a r e n o t known, an e s t i m a t e o f t h e -.error . p r o b a b i l i t y has t o be made w i t h t r a i n i n g and t e s t i n g d a t a f o r e a ch f e a t u r e s u b s e t c o n s i d e r e d and t h i s u s u a l l y r e q u i r e s t o o much c o m p u t a t i o n . F i n a l l y , even i f t h e d i s t r i b u t i o n s a r e known and t h e Bayes e r r o r p r o b a b i l i t y i s t o be m i n i m i z e d , t h e e x p r e s s i o n c o n t a i n s a d i s c o n t i n u i t y i n t h e f o r m o f max{•} o r min{• } w h i c h m i gh t be u n d e s i r a b l e f r o m t h e c o m p u t a t i o n a l p o i n t o f v i e w . I n t h i s t h e s i s , f e a t u r e e v a l u a t i o n c r i t e r i a a r e d i v i d e d i n t o t h r e e b a s i c c a t e g o r i e s a c c o r d i n g t o t h e i r s t r u c t u r e . L e t X = (x jXg,...,xn) be a f e a t u r e v e c t o r composed o f n f e a t u r e s , e a c h o f w h i c h can be e i t h e r c o n t i n u o u s o r d i s c r e t e . L e t P ( X ) be t h e p r o b a b i l i t y d e n s i t y f u n c t i o n (PDF) d e s c r i b i n g t h e n f e a t u r e s , and l e t P ( x | C \ ) be t h e i t h - c l a s s - c o n d i t i o n a l PDF. When c o n s i d e r i n g f e a t u r e s i n r e l a t i o n t o e i t h e r P ( x ) o r P ( x |C \ ) i t i s c o n v e n i e n t t o r e f e r t o t h e f e a t u r e s as e i t h e r g e n e r a l - f e a t u r e s o r c l a s s -f e a t u r e s , r e s p e c t i v e l y . T h i s n o t a t i o n was i n t r o d u c e d b y B a r a b a s h [13]. The t h r e e b a s i c t y p e s o f f e a t u r e e v a l u a t i o n c r i t e r i a a r e p r e - e v a l u a t i o n , 3 post-evaluation, and combined-evaluation, c r i t e r i a . Pre-evaluation i s done on the t o t a l c o l l e c t i o n of a v a i l a b l e pattern samples i r r e s p e c t i v e of pattern class a f f i l i a t i o n . Hence, the structure of pre-evaluation c r i t e r i a involves P(x) only and the feature s e l e c t i o n methods incorporating pre-evaluation s e l e c t general-features. Post-evaluation i s based on knowledge of class samples. Hence, the structure of post-evaluation involves the P ( x | c \ ) , i = 1,2,...,M, only (M = numbers of classes) and the feature s e l e c t i o n methods incorporating post-evaluation c r i t e r i a s e l e c t c l a s s -features. This d e f i n i t i o n d i f f e r s somewhat from that given by Watanabe, et a l . [ih] i n as much as the l a t t e r includes the combined-evaluation c r i t e r i a defined below. The structure of combined-evaluation c r i t e r i a involves both P(x) and P(x|C\), i = 1 ,2, ... ,M where i t i s understood that P(X) i s not written as a mixture density function. Post-evaluation c r i t e r i a are further divided i n t o two categories: i n t r a s e t post-evaluation c r i t e r i a and i n t e r s e t post-evaluation c r i t e r i a . In i n t r a s e t evaluation the feature effectiveness c r i t e r i o n i s a measure operating on a s i n g l e c l a s s - c o n d i t i o n a l PDF which i s averaged i n some manner over a l l pattern c l a s s e s . In i n t e r s e t evaluation the effectiveness c r i t e r i o n i s a measure operating between two c l a s s - c o n d i t i o n a l PDF's, and i s averaged i n some manner over a l l pattern class p a i r s . Needless to say, there are some feature evaluation c r i t e r i a which combine the notions of i n t r a s e t and i n t e r s e t c r i t e r i a and there are others which defy categorization i n t o the groups defined above. 1.2 Previous Research In the past, most research i n feature evaluation has centered about post-evaluation, and, of t h a t , more attention has been given to the i n t e r s e t evaluation problem. Furthermore, a great deal of the work considered only the two-class problem, and the Gaussian assumption was very frequently k invoked. Needless to say, the work i n t h i s area, although abundant, has been l i m i t e d . In many p r a c t i c a l problems there are, i n v a r i a b l y , more than two pattern c l a s s e s , and the d i s t r i b u t i o n s are i n many cases d i s c r e t e , as i n genetic populations. In a d d i t i o n , very few extensive studies have been made to compare d i f f e r e n t feature evaluation c r i t e r i a . Mohn [15] compares two s t a t i s t i c a l feature evaluation c r i t e r i a when applied to speaker i d e n t i f i c a t i o n . Mucciardi and Gose [16] compare seven feature evaluation c r i t e r i a when applied to electrocardiogram c l a s s i f i c a t i o n . Chen [17] gives a t h e o r e t i c a l comparison of two popular c r i t e r i a . A good t h e o r e t i c a l and experimental (crop c l a s s i f i c a t i o n ) comparison of four feature s e l e c t i o n approaches can be found i n [ l 8 ] . For the case of d i s c r e t e d i s t r i b u t i o n s experimental comparisons of distance measures applied to various biometric problems can be found i n [ 1 9 ] , [ ? ] , [20] , [ 2 l ] and [ 8 ] . In considering the m u l t i c l a s s problem, two approa.ches have been t r i e d i n the past: the minimax approach and the expected value approach. In both approaches the feature effectiveness c r i t e r i o n i s c a l c u l a t e d f o r a l l class p a i r s . However, i n the former approach the minimum (distance) value of a class p a i r i s maximized while i n the l a t t e r the expected value i s maximized. The disadvantage of these c r i t e r i a i s that they are d i f f i c u l t to r e l a t e to error p r o b a b i l i t y . In f a c t , no upper bounds on e r r o r p r o b a b i l i t y are a v a i l a b l e i n terms of the expected distance measures f o r the general d i s t r i b u t i o n - f r e e case. P r e - s e l e c t i o n methods i n the past have been l i m i t e d to using c r i t e r i a based on the a c t u a l pattern samples, as i n [ 3 ] , [ l ^ ] , [ 2 2 ] - [ 2 6 ] , and [h] , [ 2 7 ] . Ho attempts have been made at using P(x), the unconditional d i s t r i b u t i o n i t s e l f , which has certain, advantages. 5 Intraset p o s t - s e l e c t i o n methods incorporating t h e , d i s t r i b u t i o n s rather than the pattern samples have been severely l i m i t e d to using the Shannon entropy as a c r i t e r i o n , [ 2 8 ] - [ 3 l ] . However, re c e n t l y Vajda [32] has proposed the quadratic entropy for use i n pattern r e c o g n i t i o n a p p l i c a -tions . Upper bounds on erro r p r o b a b i l i t y are very u s e f u l i n feature .. s e l e c t i o n and i n the past attempts have always been made to r e l a t e feature evaluation c r i t e r i a to erro r p r o b a b i l i t y . However, while p r o b a b i l i t y of error bounds are ava i l a b l e for many c r i t e r i a i n the two-class problem with Gaussian d i s t r i b u t i o n s , they are l e s s frequent for the two-class d i s t r i b u t i o n - f r e e case and severely l a c k i n g f o r the M-class d i s t r i b u t i o n -free case. 1.3 Scope of the Thesis (Part I) Part I of t h i s t h e s i s i s a t h e o r e t i c a l and experimental analysis and comparison of distance, information, and overlap measures, with the feature evaluation problem i n mind. The standard c r i t e r i a are considered and a host of others are proposed that belong to the three groups described i n section 1.1. In a d d i t i o n , several classes of measures that do not f a l l under the three categories are proposed. For a l l these measures, properties are derived, r e l a t i o n s between the measures are obtained, and whenever p o s s i b l e , e r r o r bounds are derived. Experiments were performed i n order to determine the s i m i l a r i t y of any two c r i t e r i a . In Chapter II the data, feature e x t r a c t i o n process, and. pre-processing algorithms are described that were used to obtain the features f o r the feature ordering experiments. Some c l a s s i f i c a t i o n experiments were also performed i n order to evaluate the p o s s i b i l i t y of using the scheme fo r handprinted character r e c o g n i t i o n . The r e s u l t s show that the scheme 6 i s indeed promising. In Chapter I I I , a class of c e r t a i n t y measures i s proposed f o r pre-evaluation and i n t r a s e t post-evaluation. The problem of pre-evalua-t i o n i s discussed for real-world problems i n terms of the proposed measures which are functions of P(X) i t s e l f rather than the pattern samples. In-e q u a l i t i e s are derived between the proposed c e r t a i n t y measures and Shannon's entropies. Upper and lower bounds on the measures are derived i n terms of the Bayes err o r p r o b a b i l i t y and i t - i s shown f o r a s p e c i a l case that the upper bound i s t i g h t e r than Shannon's equivocation bound. Experiments are described which were performed, i n order to compare the proposed measures with Shannon's entropies with respect to ordering a set of features. In Chapter IV, distance an'd information measures are considered f o r combined-evaluation and i n t e r s e t post-evaluation with emphasis on the expected value approach to the M-class problem. F i r s t the c l a s s i c a l distance measures are b r i e f l y reviewed. An important d i s t i n c t i o n i s made between various measures of d i r e c t e d and symmetric divergence and four measures are proposed as feature evaluation c r i t e r i a . Several important properties of the expected divergence and the M-class Bhattacharyya c o e f f i c i e n t are derived. Also two i n e q u a l i t i e s between the expected divergence and the M-class Bhattacharyya c o e f f i c i e n t are derived. A g e n e r a l i z a t i o n of the M-class Bhattacharyya c o e f f i c i e n t i s proposed and i t s r e l a t i o n to the expected divergence i s discussed. Several lower bounds on the expected, divergence are derived i n terms of pairwise distance measures. Lower bounds on e r r o r p r o b a b i l i t y i n terms of the expected divergence are given f o r the case of equiprobable pattern c l a s s e s . An upper bound on er r o r p r o b a b i l i t y i n terms of the M-class Bhattacharyya c o e f f i c i e n t i s derived for the general, d i s t r i b u t i o n - f r e e , case and i t i s shown that f or two equiprobable classes the bound reduces to a w e l l known bound found i n the l i t e r a t u r e . Methods f o r obtaining further e r r o r bounds for the case of Gaussian d i s t r i b u t i o n s are out l i n e d . Experiments are described which were performed i n order to analyse the various c l a s s i c a l measures and to compare the proposed measures with the c l a s s i c a l measures. In Chapter V a m u l t i c l a s s i n t e r s e t feature evaluation c r i t e r i o n i s derived from an information t h e o r e t i c point of view. The c r i t e r i o n not only has several desirable properties not possessed by the expected d i v e r -gence, but i t i s , i n f a c t , a g e n e r a l i z a t i o n of the divergence to the M-class problem. The ge n e r a l i z a t i o n of the divergence suggests s i m i l a r generalizations of a l l other p r o b a b i l i s t i c distance measures. Some of the more important measures are considered i n d e t a i l . Some of t h e i r properties are l i s t e d and several i n e q u a l i t i e s between the measures are presented. Upper bounds on error p r o b a b i l i t y are derived for some of the measures f o r the general d i s t r i b u t i o n free case and i t i s shown that for the two equiprobable class case these bounds reduce to w e l l known bounds found i n the l i t e r a t u r e . I t i s also shown that f o r the M-class problem these bounds are t i g h t e r than e x i s t i n g bounds i n terms of the pairwise Bhattacharyya c o e f f i c i e n t s as the value of M increases. Experiments are described which were performed i n order to compare the generalized measures with each other and with the expected value measures. In Chapter VI error bounds are derived f o r various two-class and M-class measures of p r o b a b i l i s t i c distance, information, and overlap. Upper bounds on erro r p r o b a b i l i t y i n terms of these measures are derived for feature evaluation i n which part of the t r a i n i n g data i s i n c o r r e c t l y l a b e l l e d , f o r the two-class problem. I t i s shown that the bounds are t i g h t e r than the ava i l a b l e bounds i n the l i t e r a t u r e . A family of distance measures s i m i l a r to the Bhattacharyya distance i s proposed f o r the two-class 8 problem. Upper bounds on error p r o b a b i l i t y are derived f o r the general, d i s t r i b u t i o n - f r e e , case and i t i s shown that these bounds are t i g h t e r than the Bhattacharyya bounds. Some of the upper bounds derived i n Chapter V concerning the generalized distance measures are generalized s t i l l f u r t h e r . Upper bounds on error p r o b a b i l i t y are derived i n terms of pairwise gener-a l i z e d distances and i t i s shown that they reduce to c e r t a i n bounds found i n the l i t e r a t u r e as s p e c i a l cases, and are t i g h t e r than others found i n the l i t e r a t u r e . F i n a l l y some very general feature evaluation c r i t e r i a are proposed and upper and lower bounds on err o r p r o b a b i l i t y are derived i n terms of these measures f or the general, d i s t r i b u t i o n f r e e , case. I t i s shown that many other distance measures are s p e c i a l cases of these general measures. Furthermore, the err o r bounds can be made as t i g h t as desired by s u b s t i t u t i n g the corresponding value of a parameter. Chapter VII gives a b r i e f summary and discussion of the theore-t i c a l and experimental r e s u l t s and t h e i r i m p l i c a t i o n s , as w e l l as some suggestions f o r further research. 9 CHAPTER I I DATA, FEATURE EXTRACTION, AND PREPROCESSING 2.1 I n t r o d u c t i o n In order t o perform the f e a t u r e e v a l u a t i o n experiments o f Part I and the c o n t e x t u a l decoding experiments of P a r t I I of t h i s t h e s i s i t was necessary t o o b t a i n a -data s e t , choose a f e a t u r e e x t r a c t i o n method (set of measurements), and decide on how much preprocessing had t o be done and. what .structure should be assumed i n the c l a s s i f i e r . Rather than work w i t h a r t i f i c i a l data i t was decided t o perform the experiments on a " r e a l - w o r l d " data s e t , p r e f e r a b l y a s t a n d a r d i z e d one, r e p r e s e n t i n g a p a t t e r n r e c o g n i t i o n problem of current i n t e r e s t . Hence, t h i s chapter, i n a d d i t i o n t o p r o v i d i n g the raw m a t e r i a l s necessary f o r the experiments concerning f e a t u r e e v a l u a t i o n and c o n t e x t u a l decoding, i s i n i t s e l f a c o n t r i b u t i o n t o the f i e l d o f handprinted character r e c o g n i t i o n . When an unknown p a t t e r n appears at a r e c o g n i t i o n system's i n p u t , e x t r a c t e d p a t t e r n features are used i n c o n j u n c t i o n w i t h s t o r e d data t o c l a s s i f y the p a t t e r n . In g e n e r a l , an optimum procedure f o r s e l e c t i n g f eatures i s l a c k i n g . One i s t h e r e f o r e o b l i g e d t o s e l e c t f e a t u r e s on the b a s i s of i n g e n u i t y , experimentation and knowledge of p e c u l i a r i t i e s of the p a t t e r n s t o be recognized, w h i l e endeavouring t o keep the f e a t u r e e x t r a c -t i o n scheme reasonably simple. I n s e l e c t i n g features one u s u a l l y seeks t o use a few w e l l chosen ones, r a t h e r than a l a r g e number s e l e c t e d more or l e s s at random. Storage and p r o c e s s i n g of data needed f o r use w i t h l a r g e f e a t u r e sets are expensive, and many t r a i n i n g samples are needed f o r data c o m p i l a t i o n . One should always consider the p o s s i b i l i t y of reducing the cost and complexity of the recognition system and/or improving r e c o g n i t i o n performance by pre-processing pattern samples p r i o r to e x t r a c t i n g features. Section 2.2 of t h i s chapter describes a size-normalization pre-processing operation, a shearing transformation and t e s t f o r s l a n t i n g characters, and a very simple feature e x t r a c t i o n scheme which was tested on the 3822 upper case alphabetic characters from Munson's [33] m u l t i -author handprinted data set. Bayes c l a s s i f i c a t i o n assuming s t a t i s t i c a l l y independent features was used. The re c o g n i t i o n r e s u l t s , obtained from experiments described i n Section 2.3, compare favourably with those obtained by others [33]-[35] using recognition systems of complexity comparable t o the above scheme. The r e s u l t s , t h e i r i m p l i c a t i o n s , and suggestions for furhter work appear i n Section 2.h. In p a r t i c u l a r , i t i s shown that experimental r e s u l t s obtained using standard data sets should be accom-panied by a d e t a i l e d d e s c r i p t i o n of how the data was p a r t i t i o n e d i n t o t r a i n i n g and t e s t s ets. 2.2 Data Base, Preprocessing, and Feature E x t r a c t i o n Munson's [33] multiauthor data f i l e consists of three samples of each FORTRAN character printed by U9 d i f f e r e n t persons'. Each character i s stored on magnetic tape as a 2k x 2h binary-valued array. In our work only the 3822 upper case alphabetic characters were used. For a descri p -t i o n of Munson's data see [36]. The following paragraphs describe the s i z e normalization pre-processing algorithm, the shearing transformation and t e s t for slantness, and the feature e x t r a c t i o n algorithm used i n Experiment 1, which was conducted i n order to determine r e c o g n i t i o n performance. The preprocessing and feature e x t r a c t i o n algorithms f o r Experiment 2, which was conducted to i n d i c a t e the usefulness of s i z e normalization, are described i n 11 Section 2 .3. P r i o r to size normalization and feature extraction each charac-t e r ' s width ¥ and height H was f i r s t defined by l o c a t i n g the rectangle defined by the two rows and columns closest to the edges of the character array and containing at l e a s t two black points. At t h i s point a t e s t was performed on the character to determine i f i t was "very" slanted, " s l i g h t l y " slanted or not slanted. To determine i f a character was "very" slanted a count was made of the number of black p o i n t s , within the character rectangle, l y i n g on the upper-half of the t h i r d column from the l e f t . I f t h i s count was equal to zero then the character was l a b e l l e d "very" slanted. To determine i f a character was " s l i g h t l y " slanted a count was made of the number of black p o i n t s , within the character rectangle, l y i n g on both the upper and lower h a l f s of the second column from the l e f t . I f the count on the lower h a l f was greater than or equal to twice the count on the upper h a l f then the character was l a b e l l e d as " s l i g h t l y " slanted. I f both of the above conditions were not met, the character was considered to be erect. Following t h i s t e s t , the character was size-normalized i n the following way. The r a t i o H/W was calcu l a t e d . I f H/W <_ 3 the o r i g i n a l character was size normalized to 20 x 20 points by adding t o , or d e l e t i n g from, rows and/or columns at regular s p a t i a l i n t e r v a l s i n the o r i g i n a l 2k x 2k array. Let NHA and NHD equal the number of rows to be added and deleted, r e s p e c t i v e l y , to make the f i n a l height equal to 20. Then NHD = H - 20 and NHA = 20 - H. Let {x} denote the l a r g e s t integer not exceeding x. Let i be any p o s i t i v e integer. I f H < 20 then the NHD rows numbered i • (H+NHD)/(NHD+1) from the top were deleted from the o r i g i n a l 2h rows. I f H < 20 then ND = H/(NHA+l) was obtained and then set equal t o the nearest integer to y i e l d NDI. A new row was in s e r t e d immediately below row i • (NDl) i n the o r i g i n a l 2h x 2k a r r a y . Each i n s e r t e d row was made i d e n t i c a l t o the row above i n the new ar r a y . The f i r s t 20 rows were r e t a i n e d as those d e f i n i n g the s i z e - n o r m a l i z e d c h a r a c t e r . A s i m i l a r procedure was used t o i n s e r t or de l e t e columns. I f H/W > 3 then the o r i g i n a l width W was not a l t e r e d . I n s t e a d , the character was centered h o r i z o n t a l l y on a g r i d 20 columns wide. The height was normalised t o 20 rows, as ex p l a i n e d i n the above paragraph. This procedure prevented characters such as " I " , which may be p r i n t e d e i t h e r w i t h or without s e r i f s , from be i n g converted i n t o a v i r t u a l l y a l l -b l a c k 20 x 20 array when s e r i f s were absent. A f t e r s i z e n o r m a l i z a t i o n the f o l l o w i n g shearing t r a n s f o r m a t i o n was performed on the c h a r a c t e r . For the characters l a b e l l e d " s l i g h t l y " sheared the 20 x 20 a r r a y was. d i v i d e d i n t o three h o r i z o n t a l regions of equal s i z e l a b e l l e d A, B, and C, where A i s the uppermost r e g i o n and C i s the lowermost r e g i o n . A l l the b l a c k and white p o i n t s i n re g i o n A were s h i f t e d t o the l e f t by one u n i t and a l l the p o i n t s i n re g i o n C were s h i f t e d t o the r i g h t by one u n i t . This o p e r a t i o n had the e f f e c t of reducing the width of some characters by two columns, one on each end. For t h i s reason the character was widened again by two columns. For the characters l a b e l l e d "very" sheared the 20 x 20 array was d i v i d e d i n t o f i v e h o r i z o n t a l regions of equal s i z e l a b e l l e d A, B, C, D, and E, where A i s the uppermost r e g i o n and E i s the lowermost r e g i o n . A l l the b l a c k and white p o i n t s i n regions A and B are s h i f t e d t o the l e f t by 2 and 1 u n i t s , r e s p e c t i v e l y . A l l the p o i n t s i n regions D and E are s h i f t e d t o the r i g h t by 1 and 2 u n i t s , r e s p e c t i v e l y . This operation had the e f f e c t of reducing the width of some characters by four columns, two from each end. For t h i s reason the character was widened again by four columns. Following the shearing transformation the 20 x 20 array was divided i n t o 25 non-overlapping U x 1| regions. The 25-component feature vector was then obtained by s e t t i n g vector component equal i n value to th the number of black points i n the i region. Thus, x^ could assume any integer value from 0 to l 6 , i n c l u s i v e . This feature extraction algorithm has several desirable p r o p e r t i e s , as follows: 1. The feature vector i s obtained from the pattern matrix by a r e p e t i t i v e procedure i n v o l v i n g only the simple l o g i c a l operation of counting the number of black points i n a well-defined region. 2. The feature vector's dimensionality i s s i g n i f i c a n t l y lower than that of many other schemes, and the features themselves assume only a r e l a t i v e l y small number of values. 3 . D i s c o n t i n u i t i e s i n the character matrix, i n c l u d i n g s a l t and pepper noise, can be t o l e r a t e d . The p o t e n t i a l s e n s i t i v i t y of the feature e x t r a c t i o n scheme to large deviations i n s i z e and s l a n t i n g among characters of any given class motivated the s i z e normalization and shearing transformation preprocessing. i 2 . 3 Some C l a s s i f i c a t i o n Experiments In c l a s s i f y i n g an unknown character sample on the basi s of i t s features x^, the quantity 25 R= E log[P(x |c )] + log[P(C )] (2 .1 ) i = l J was calculated f o r each of the 26 upper case character--classes C., where P ( x . | c . ) and P(C.) denote the p r o b a b i l i t y of feature vector component x. l 4 conditioned on character class C, and the a p r i o r i p r o b a b i l i t y of the u character class C , respectively, Natural logarithms are used throughout t h i s t h e s i s . The character class C. which maximized R was decided to be J the true i d e n t i t y of the unknown character sample. This c l a s s i f i c a t i o n algorithm minimizes the p r o b a b i l i t y of a recognition error when the feature vector's components are s t a t i s t i c a l l y independent and when the p r o b a b i l i t i e s i n (2 .1 ) are either known exactly or obtained using Bayes estimation. Choosing a p r o b a b i l i s t i c structure for the c l a s s i f i e r i s a n o n - t r i v i a l operation [37] , since i t has a bearing on the optimum dimensionality corresponding to a given sample size and i t s correspondence to the true structure. An increase i n p r o b a b i l i s t i c structure usually means an increase i n optimum dimensionality for a certain fixed sample .size -with .a corresponding -decrease ..in .error -probability .[38]—-P4i-J. -In t h i s thesis conditional independence of features was assumed to red\ice the amount of t r a i n i n g , storage and computation. This popular assumption has received considerable attention i n the past. Minsky [ 4 5 ] , Winder [46], Chow [47] and Nilsson [48] have given proofs that for conditionally independent binary features the Bayes c l a s s i -f i e r can be expressed as a l i n e a r discriminant function. Kazmierczak and Steinbuch [49] stated that for conditionally independent ternary features the Bayes c l a s s i f i e r can be expressed as a quadratic discriminant function. The author proved i n [50] that the f i r s t order Minkowski c l a s s i f i e r [ 5 1 ] , which implements a piecewise l i n e a r decision boundary between two pattern classes, i s equivalent to a l i n e a r discriminant function for conditionally independent binary valued features. While Devijver [52] independently general-ized some of these r e s u l t s , the author [53] showed that for n-valued conditionally independent features a l a r g e f a m i l y of p a t t e r n c l a s s i f i e r s i s e q u i v a l e n t s t t o an ( n - l ) degree polynomial d i s c r i m i n a n t f u n c t i o n . In the f e a t u r e s e l e c t i o n problem the author [5^] has shown th a t even f o r c o n d i t i o n a l l y independent features s i t u a t i o n s can e a s i l y a r i s e i n which the best s i n g l e f eatures make up the worst f e a t u r e subset when the Bayes e r r o r p r o b a b i l i t y i s used as a c r i t e r i o n . A method developed e a r l i e r by Toussaint and Donaldson [55]» [56] was used t o estimate the o v e r a l l system performance. Let the N alphabets of character samples be p a r t i t i o n e d i n t o M d i s j o i n t s e t s . Bayes estimates [57] of P(x.jC.) were obtained u s i n g a l l the c h a r a c t e r s i n M-l of the s e t s . A l l the samples from the remaining set were then recognized u s i n g ( 2 . 1 ) , and P(e|C.), the p r o b a b i l i t y of e r r o r c o n d i t i o n e d on character c l a s s C , was c a l c u l a t e d . The e r r o r p r o b a b i l i t y P(e) was then evaluated as f o l l o w s . 26 P(e) = T P(£|-C.)P('C.) ( 2 . 2 ) J = l 3 3 The above procedure was repeated M t i m e s , w i t h a d i f f e r e n t set of alphabets s e r v i n g as the t e s t set each time. Averaging the r e s u l t s o f the M e x p e r i -ments y i e l d e d an o v e r a l l performance measure. In a c t u a l l y c a l c u l a t i n g R i n (2 .1) and P(e) i n ( 2 . 2 ) , P(x |C.) and P(c.) were stored as 3 2 - b i t s i n g l e p r e c i s i o n b i n a r y numbers. Subsequent m u l t i p l i c a t i o n s , a d d i t i o n s and t a k i n g of n a t u r a l logarithms were done u s i n g s i n g l e p r e c i s i o n . The general problem of e s t i m a t i n g system performance u s i n g a l i m i t e d number of p a t t e r n samples has r e c e i v e d considerable a t t e n t i o n r e c e n t l y [ 5 8 ] - [ 6 7 ] , [ 6 8 ] . Our approach represents a compromise between the H method of Highleyman [58] on the one hand and the U method of Lachenbruch and Mickey [59] on the o ther. Using Highleyman's [58] method, one would p a r t i t i o n the e n t i r e data set of W alphabets i n t o two d i s j o i n t s e t s , and then use one set f o r t r a i n i n g and the other f o r t e s t i n g . The U method of Lachenbruch and Mickey [59] uses a l l but one of the character samples for t r a i n i n g , and the remaining s i n g l e sample f o r t e s t i n g . This procedure i s repeated u n t i l each sample has been used f o r t e s t i n g . In our case t h i s would involve 3822 r e p e t i t i o n s . The U method [59] requires much computation, while Highleyman's approach s u f f e r s from the disadvantage that i t y i e l d s over p e s s i m i s t i c r e c o g n i t i o n accuracies on small data sets [37]. One would expect our method to y i e l d a b e t t e r estimate of performance than the H method while r e q u i r i n g much l e s s computation than the U method. In Experiment 1, a l l 3822 upper case alphabetic characters were used to form seven d i s j o i n t sets. Thus, N = l47,and M = 7. Set 1 con-tained alphabets 1-21, set 2 contained alphabets 22-4-2, and so on (see Table 2.1), with the r e s u l t that each of the seven d i s j o i n t sets contained three alphabets from each of seven d i f f e r e n t authors. In Experiment 2 the 4-2 samples of each upper case character which appeared to vary most i n s i z e and l i n e thickness were f i r s t eliminated using v i s u a l examination. The remaining characters were used to form 105 alphabets, and the t r a i n i n g and t e s t i n g procedure described e a r l i e r was then used with N = 105 and M = 7 (see Table 2 .2) . In the f i r s t group of recognition t e s t s an algorithm s i m i l a r to the one described i n Section 2.3 was used to s i z e normalize the o r i g i n a l 24 x 24 character matrix to 20 wide x 21 high. The normalized matrix was then divided into 28 non-overlapping regions 5 wide x 3 high, and a 28 dimensional feature vector was then obtained for use with (2.1) by counting the number of black points i n each region. In the second group of t e s t s s i z e normalization was not used. The feature vector was obtained by counting 'the number of black points i n each of the 36 non-overlapping 4 x 4 rectangles i n the o r i g i n a l character matrix. In both groups of te s t s i n Experiment 2 the characters' p r o b a b i l i t i e s equalled those of Engl i s h t e x t . 2,k Results and Discussion Tables 2.1 and 2.2 show the err o r p r o b a b i l i t i e s obtained i n Experiments 1 and 2 using each of the seven d i s j o i n t sets f o r t e s t i n g . Also shown are the means and standard deviations of the err o r p r o b a b i l i t i e s . I t follows from (2.2) and Table 2.1 t h a t , on-the average, the more commonly occurring characters are most e a s i l y recognized. Table.2 . 2 indicates that s i z e normalization p r i o r to feature e x t r a c t i o n s i g n i f i c a n t l y improves recogni t i o n accuracy. The r e s u l t s i n both Table 2.1 and Table 2.2 depend strongly on the way i n which the data i s p a r t i t i o n e d i n t o t r a i n i n g and t e s t s e t s , and on the system evaluation procedure. This point has also been considered by Munson [69]. I f only one of the seven sets of alphabets had been used f o r t e s t i n g and the other s i x f o r t r a i n i n g , then the err o r p r o b a b i l i t y i n Experiment 1 would have va r i e d between l 6 and 27 percent depending on the t e s t set. Furthermore, i f only set 2 or set 3 had been used, a contradictory conclusion would have been a r r i v e d at concerning the usefulness of shearing. With set 2 (equiprobable characters) the error p r o b a b i l i t y increases from 25.6 to 26.3 while with set 3 the err o r probab-i l i t y decreases from 30.9 to 28 .0 , when comparing unsheared with sheared characters. I t i s , t h e r e f o r e , recommended that re c o g n i t i o n r e s u l t s obtained using standardized data sets be accompanied by a d e t a i l e d d e s c r i p t i o n of how the data set was p a r t i t i o n e d f o r t r a i n i n g and t e s t i n g . I t i s noted from Table 2,1 that the shearing algorithm used improved the recogni t i o n rate i n a l l except one ( r o t a t i o n No. 2) of the seven rotations for equiprobable characters. However, when Engli s h a p r i o r i p r o b a b i l i t i e s are used, shearing the size-normalized characters A PRIORI CHARACTER PROBABILITIES ERROR PROBABILITIES AVERAGE ERROR PROBABILITY P(e) STMDARD DEVIATION OF P(e) Set 1 Set 2 Set 3 Set k Set 5 Set 6 Set 7 EQUIPROBABLE CHARACTERS* 22.8 25.6 30.9 24.5 24.5 33.3 30.9 27.5 3.7 ENGLISH A PRIORI PROB-ABILITIES* 16.0 23.2 24.4 18.6 17.5 27.0 26.7 21.9 4.1 EQUIPROBABLE CHARACTERS* 19.4 26.3 28.0 . 23.2 23.8 30.6 30.9 26.0 3.9 ENGLISH A PRIORI PROB-ABILITIES* 16.3 26.6 23.5 17-4 18.8 26.9 26.3 22.3 4.3 * Size-normalized only + Size-normalized and sheared Table 2.1 Error p r o b a b i l i t i e s i n percent f o r test sets 1-7, i n c l u s i v e - Experiment 1. CHARACTER PREPROCESSING ERROR PROBABILITIES AVERAGE ERROR PROBABILITY P(e) STANDARD DEVIATION OF P(e) Set 1 Set 2 Set 3 Set k Set 5 Set 6 Set 7 SIZE NORMALIZED 20". 8 25.8 21.7 22.3 25.0 38.1 28.0 26.0 5 . U NOT SIZE NORMALIZED 27.8 hi.k 26.0 32.9 31.8 39.^ 26.1+ 32.3 5.7 TABLE 2.2 E r r o r p r o b a b i l i t y i n percent f o r t e s t sets 1-7, i n c l u s i v e - Experiment 2. 20 a c t u a l l y r e s u l t s i n an increase i n e r r o r probability., on the average. Hence, t h i s shearing algorithm does not appear to be very u s e f u l f o r systems designed to recognize E n g l i s h t e x t . However, the changes i n performance l e v e l s i n a l l these cases are so small that one can hardly make any conclusive statements about the shearing algorithm or the t e s t f o r slantness. The author tends to b e l i e v e t h a t , since the feature e x t r a c t i o n scheme i s s e n s i t i v e to o r i e n t a t i o n of the characters, one should be able to further improve r e c o g n i t i o n performance through proper shearing of the characters, and that the shearing algorithm i t s e l f i s s a t i s f a c t o r y but that a b e t t e r t e s t f o r measuring the slantness of characters i s needed. The attractiveness of the t e s t used i n t h i s chapter i s due to the t e s t ' s s i m p l i c i t y . However, one may have to r e s o r t to more complex t e s t s [70]-[75] i n order to obtain b e t t e r r e s u l t s . Table 2.3 summarizes r e s u l t s obtained by others who have used Munson's multiauthor data f i l e . Exact comparisons between these r e s u l t s and those obtained using the above re c o g n i t i o n system are not p o s s i b l e because of differences i n performance evaluation procedures and p a r t i t i o n i n g of alphabet s e t s . Comparison of the performance c a p a b i l i t i e s of the various feature extraction schemes i s even more d i f f i c u l t because of differences i n c l a s s i f i c a t i o n procedures. Munson's r e s u l t s [33], [34] were obtained using some of the alphabet sets f o r t r a i n i n g and some or a l l of the reamining ones f o r t e s t i n g . Knoll's r e s u l t s [35] were obtained by doing most of the t r a i n i n g on alphabet sets 1-30. However, since human t r i a l - a n d - e r r o r was the l e a r n i n g approach used for designing the table lookup contents, modifications were introduced a f t e r a preliminary recognition run was performed on alphabets 31-60. Recognition runs were then performed on a l l 60 alphabets to give the r e s u l t s shown i n Table 2.3. DESCRIPTION OF TESTING AND TRAINING SAMPLES TYPE OF CLASSIFIER USED FEATURE EXTRACTION METHOD PROBABILITY OF CORRECT RECOG-NITION REJECT PROBABILITY ERROR PROBABILITY * 23^0 Samples for t r a i n i n g . Testing set -unspecified Munson [34] Table lookup Contour Tracing 58 19 23 ** Alphabet sets 1-96 for t r a i n i n g , alphabet sets 97-147 for t e s t -ing Munson [33] Adaptive Piecewise Linear Fe at ure-1 emplat e (PREP.9-View) 78 22 ** Same as above Same as above t o p o l o g i c a l descr-i p t o r s (T0P0) 77 - 23 + Alphabets 1-30 f o r both t r a i n -ing and t e s t i n g [35] Table lookup c h a r a c t e r i s t i c l o c i 7 5 . 9 + + 11. 4 12.7 + Alphabets 31-60 for both t r a i n -ing and t e s t i n g [ 3 5 ] , Same as above Same as above 6 9 . 5 15.1 15 . 4 + Alphabets 1-60 for t r a i n i n g and 1, 4 , 7, 58 f o r t e s t i n g Same as above Same as above ++ 73 . 9 13.0 13.1 * 26 upper case l e t t e r s only. ** Samples include the 10 numerals, the 26 upper case l e t t e r s , and the 10 Fortran symbols. + 25 upper case samples - l e t t e r 0 i s excluded. ++ The r e s u l t s have been modified by us to include h a l f of the ambiguities i n [35] as correct recognition and h a l f as error. TABLE 2.3 Results of other experiments cn Munson's Hultiauthor handprinted data set ( i n percent). 22 F u l l "46-character FORTRAN alphabets were used by Munson to obtain recognition accuracies of 77% and 78% although he achieved a r e c o g n i t i o n accuracy of 85% by combining the c l a s s i f i c a t i o n responses of PREP. 9-VIEW and TOPO. The l a s t three sets of r e s u l t s i n Table 2.3 were obtained using 25-character alphabets which contained only upper case characters, the l e t t e r 0 being excluded. I f one bears i n mind that the f u l l '46-character alphabet would normally be more d i f f i c u l t to recognize than the 26-character upper case alphabet, i t would be n a t u r a l to conclude that Munson's PREP. 9-VTEW and TOPO systems with adaptive piecewise c l a s s i f i e r s would y i e l d somewhat lower error p r o b a b i l i t i e s than those found i n Table 2.2 i f a l l three systems were evaluated using i d e n t i c a l t e s t data and t e s t procedures. The system used i n t h i s t h e s i s would l i k e l y outperform the contour-tracing table-lookup system of Munson [33] and might outperform the table-lookup c h a r a c t e r i s t i c l o c i system of K n o l l [35]• This l a t t e r system was evaluated using some of the same data f o r t r a i n i n g and t e s t i n g ; i f d i s j o i n t t r a i n i n g and t e s t data were used the r e c o g n i t i o n accuracies i n l i n e s h-6 of Table 2.3 would l i k e l y decrease [55] , [37]• One might argue that the recogni t i o n accuracies of the table lookup systems are lower than they would be i f r e j e c t c r i t e r i a were not used. However, r e j e c t s tend to be inherent i n table lookup systems, and avoiding r e j e c t s i s d i f f i c u l t . Concerning system complexity, the above feature e x t r a c t i o n scheme i s simpler than a l l of those i n Table 2.3. However, the use of s i z e normalization and Bayes c l a s s i f i c a t i o n makes i t questionable whether the o v e r a l l recognition system i s simpler and/or more economical than those schemes which employ table lookup c l a s s i f i e r s . The o v e r a l l system would seem simpler than those i n Table 2.3 which use PREP. 9-VTEW or TOPO. By now researchers devoted to both l i n g u i s t i c or s t r u c t u r a l , and s t a t i s t i c a l approaches to pattern recognition have concluded that f o r a s p e c i f i c problem the o v e r a l l "optimal" system has to combine aspects of both methods [76]. The preprocessing i n the form of s i z e normalization incorporated i n t h i s t h e s i s , u s u a l l y r e f e r r e d to as " r e g i s t r a t i o n " belongs to the h e u r i s t i c approach. Chandrasekaran and Kanal [76] have a n a l y t i c a l l y shown t h a t , i n general, the minimum error when no r e g i s t r a t i o n i s permitted i s greater than when measurements for s t a t i s t i c a l c l a s s i f i c a t i o n are taken a f t e r r e g i s t r a t i o n . The good r e s u l t s obtained i n t h i s chapter should encourage other researchers to t r y to f i n d the i d e a l compromise between the s t r u c t u r a l or h e u r i s t i c and the purely s t a t i s t i c a l approach to handprinted character r e c o g n i t i o n , that perhaps w i l l y i e l d even b e t t e r r e s u l t s . Some of the r e s u l t s i n t h i s chapter are also discussed i n [77l= 2k CHAPTER I I I A CLASS OF CERTAINTY MEASURES FOR PRE-EVALUATION AND INTRA-SET POST-EVALUATION 3.1 Introduction Pre-evaluation methods have been studied i n the past mainly by Watanabe and h i s colleagues [3], [lh], [ 2 2 ] - [ 2 6 ] . In [23] Watanabe provided t h e o r e t i c a l foundations for using the Karhunen-Loeve expansion as a t o o l f or pre-evaluation of features. In [ l U ] , [3] Watanabe et a l . proposed the SELFIC method of pre-evaluation. The SELFIC method i s based on a transformation of coordinates i n the feature space, whose aim i s to obtain the optimal coordinate system. The "importance", or "weight", of a coordinate axis may be measured by the average ( i n a c o l l e c t i o n of pattern samples) of the square of the component along that axis of each feature vector. The optimal coordinate system i s obtained by minimizing the entropy of the "weights" given by n H(co) = - Z co. l o g co. , (3.1) J = l J 3 where n i s the dimension of the representation space, and co., j = l , 2 , . . . , n J are the "weights" with the property that n £ co. = 1 , and co. > 0 f o r a l l j . At l e a s t three basic types of i n t r a s e t post-evaluation methods have been studied i n the past. The f i r s t i s based on the a p p l i c a t i o n of the Karhunen-Loeve expansion to each class separately. Chien and Fu [27] proposed the generalized Karhunen-Loeve expansion, which i s b e t t e r s u i t e d to the m u l t i c l a s s problem, and report experiments using the procedure i n [h]. As an aside i t i s i n t e r e s t i n g to note that the Karhunen-Loeve expansion method can i n f a c t be modified so that i t becomes an i n t e r s e t method, as was demonstrated by Fukunaga and Koonts [78]. The second approach known as CLAFFIC i s an a p p l i c a t i o n of the entropy minimizing coordinate system of SELFIC to the pattern samples of each class separately. The t h i r d approach [28]-[3l] consists of minimizing the average c o n d i t i o n a l entropy of the c l a s s - c o n d i t i o n a l p r o b a b i l i t y d i s t r i b u t i o n s themselves given by M X H(X|C) = - E P(C k) E P(X i|C k) l o g P(X |C ), (3.2) k=l i = l where P(X^|C k) i s the p r o b a b i l i t y that the feature vector X takes on i t s th t h i value conditioned on the k pattern c l a s s , p(C 1 ) i s the a p r i o r i k th p r o b a b i l i t y of the k c l a s s , M i s the t o t a l number of pattern classes and X i s the number of values each of the n features can take. In the continuous case the .summation sign becomes an i n t e g r a t i o n . 3.2 Certainty Measures and Feature Evaluation In t h i s section a new approach to pre-evaluation of features i s proposed. The success of the approach i n real-world problems depends c r i t i c a l l y on the type of measure used i n the evaluation process. A class of c e r t a i n t y measures w e l l s u i t e d to t h i s approach i s proposed. F i n a l l y , i t i s noted that the proposed c e r t a i n t y measures can be applied to i n t r a s e t post-evaluation of features. In the past, pre-evaluation methods [ 3 ] , [k], [lh] , [22]-[27] were based on the knowledge of actual pattern samples regardless of pattern class a f f i l i a t i o n . These methods were based on the a c t u a l s t o c h a s t i c v a r i a b l e s of the process involved, i . e . , functions such as w. J i n (3.1) depend on the actual values the features take rather than on the p r o b a b i l i t i e s of occurrence of the corresponding values. These methods s u f f e r from two disadvantages: l ) the evaluation of a set of features depends on the manner i n which they are measured and hence the actual values the features take, and 2) features which are l a b e l l e d "good" on the basis of the values which features take may not be good for pattern class d i s c r i m i n a t i o n . In f a c t , i t has been stated by Watanabe et a l . [lh] that the success of SELFIC i s quite unexpected. . The proposed method for pre-evaluation i s analogous to the t h i r d i n t r a s e t post-evaluation method discussed i n Section 3.1. I t consist of evaluating the features with a measure which i s a function of P(x), the unconditional d i s t r i b u t i o n on the feature vectors. Since P(X) consists of p r o b a b i l i t i e s the method does not s u f f e r from the f i r s t disadvantage.• Furthermore, by choosing an appropriate function of P(x) to serve as an evaluation measure one can have reas.onab.le .confidence ..in s e l e c t i n g features which are good for pattern-class d i s c r i m i n a t i o n . The obvious advantages of t h i s method over post-evaluation methods are that only one unconditional d i s t r i b u t i o n has to be estimated rather than M c l a s s - c o n d i t i o n a l d i s t r i b u -tions and there i s f a r l e s s computation involved i n c a l c u l a t i n g the feature evaluation measure since i t does not have to be.averaged over the M pattern classes. One obvious f i r s t guess candidate f o r a feature evaluation c r i -t e r i o n i s Shannon's unconditional entropy given by A n H(X) = - E P(X.) l o g P(X.), ( 3 . 3 ) j = l J 3 but, as we s h a l l see below, i t turns out that t h i s i s not a very s a t i s f a c -t o r y measure. An important question t o answer at t h i s point i s : once a measure such as H(x) has been decided upon should i t be maximized or minimized i n order to produce an ordering of the available features. The answer to t h i s question depends on whether one r e l i e s on theory or experiment i n r e a l world problems. The answer w i l l f i r s t be i n v e s t i g a t e d from the t h e o r e t i c a l point of view. The square of the unconditional d i s t r i b u t i o n P(x) can be written as M M P(C .. . .... , j k ' i k j f o r k = 1,2,... ,An. Now, by d e f i n i t i o n Vajda's "quadratic" entropy [32] P 2(X f c) = Z P(C.) E P ( C J ) [ P ( X k | c i ) P ( X k | C j ) ] , ( 3 .U ) Q(X) i s given by ^n Q(X) = 1 - 1 . P 2 ( X ). (3.5) k=l It should be noted that Q(x) was also proposed as a measure of entropy by Watanabe [2k, p. ik]. Now, x n E P ( x j c ) P ( x J C ) < P, . (3.6) k=l K 1 " 3 - J where p.. i s the Bhattacharyya c o e f f i c i e n t given by x n _ _ _ _ _ _ _ _ _ _ p.. = E /P(X |C )P(X, |C ). (3.7) From (3.5), (3.6) and (3.7) i t follows that H(X) >_ Q(X) _ 1 - p (3.8) where p i s the expected value of p „ and 0 <_ p <_ 1. I t i s we l l known that p i s a measure of di s c r i m i n a t i o n between the pattern c l a s s e s . From (3.8) i t can be seen that when H(x) i s close to zero p i s close to 1 which implies that the c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s are overlapping. For the two~equiprobable-pattern-class case H(x) can a c t u a l l y be r e l a t e d to error p r o b a b i l i t y P . It i s well known [79] that e PE I T T P2- ( 3 > 9 ) Prom (8) and (9) i t follows t h a t , f o r H(x) <_-P g [1 - 2 H ( X ) ] 2 . (3.10) From (3.10) i t i s obvious that the maximization of H(x) i s a necessary, but not a s u f f i c i e n t , condition f o r the r e a l i z a t i o n of a r b i t r a r i l y low P . Hence, from the t h e o r e t i c a l point of view one would think i t desirable to s e l e c t features with a high value of H(x), i . e . , features with a nearly uniform unconditional d i s t r i b u t i o n . However, the t h e o r e t i c a l point of view assumes that uniform unconditional d i s t r i b u t i o n s can occur i n which the co n d i t i o n a l d i s t r i b u t i o n s are p e r f e c t l y peaked and non-overlappin This i s an u n r e a l i s t i c assumption f o r real-world problems where a l l these d i s t r i b u t i o n s tend to be multimodal. In real-world pattern recognition problems the c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s P(x|c, ), k = 1,2,...,M, tend to be nonuniform. Since the unconditional d i s t r i b u t i o n P(x) i s the average of the co n d i t i o n a l ones, i . e . , since ' M P(X) = I P(C ) P(X|C ), (3.11) k=l the nonuniformity of P(x) i s a measure of the nonuniformity of P(x|c k), k = 1,2,...,M. In other words, i f P(x) i s nonuniform then so i s P(x|c^) f o r at l e a s t one value of k, although the converse i s not n e c e s s a r i l y t r u e . Furthermore, the more uniform the P(x|c^) f o r a l l k are, the more they w i l l overlap each other and the l e s s w i l l be the r e s u l t i n g d i s c r i m i n a -t i o n c a p a b i l i t y of the features. Conversely, the more nonuniform the P(x|C1 ) are, the better the chances that they w i l l have non-overlapping sections. Hence, from a set of real-world features i t i s desirable to sel e c t that subset which has the more nonuniform P(x). Now, although the minimization of the entropy measures the non-uniformity of a d i s t r i b u -t i o n , the entropy i s a minimum when P(X.) = 1 and P(X.) = 0, j = l,2,...,A n 29 j f i . This c o n d i t i o n i m p l i e s t h a t a l l the c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s are i d e n t i c a l and t h a t the features are completely u s e l e s s . However, the chances of f i n d i n g such a subset of fea t u r e s from a set of r e a l - w o r l d f e a t u r e s i s minute, t o say the l e a s t . One of the main disadvantages o f u s i n g the entropy as a measure of nonuniformity i s t h a t i t w i l l n e c e s s a r i l y favour the more peaked d i s t r i b u t i o n and hence the more ov e r l a p p i n g c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s . Therefore, one would not expect the Shannon entropy t o be a very s a t i s f a c t o r y measure of nonuniformity f o r p r e - e v a l u a t i o n of f e a t u r e s . What i s needed i s a measure of nonuniformity t h a t does not n e c e s s a r i l y favour the more peaked d i s t r i b u t i o n s . Such a measure i s propo'sed below. A measure of nonuniformity, or c e r t a i n t y , b e t t e r s u i t e d t o p r e - e v a l u a t i o n of features and which i s e a s i e r t o compute than Shannon's entropy i s given bv . n A M J X ) = E |P(X ) - U(X )| , (3.12) 3=1 3 3 where U(x) i s the uniform d i s t r i b u t i o n , i . e . , U(x.) = — , j = l , 2 , . . . , X n . L i k e H(X), M (X) takes on i t s extreme values when P(x) i s e i t h e r f l a t or p e r f e c t l y peaked. However, i n r e a l - w o r l d problems d i s t r i b u t i o n s l i e somewhere i n between the two extremes, and f o r these d i s t r i b u t i o n s M (x) ' 00 v ' behaves q u i t e d i f f e r e n t l y from H(x), namely, a higher value o f M (X) means a more non-uniform d i s t r i b u t i o n but not n e c e s s a r i l y a more peaked one. Hence one would expect M^i'x) t o have the c a p a b i l i t y t o s e l e c t b e t t e r f e a t u r e s from the d i s c r i m i n a t i o n i n f o r m a t i o n p o i n t of view. One can a l s o look at t h i s problem from the p o i n t of view of multimodal f i n i t e mixtures. Given the mixture d i s t r i b u t i o n P ( x ) , i n an i d e a l s i t u a t i o n f o r f e a t u r e s e l e c t i o n the P(x|c\), i = 1,2,...,M, w i l l be peaked and w i l l peak f o r each i at d i f f e r e n t values of X r e s u l t i n g i n a multimodal (M-modal) d i s t r i b u t i o n . A c r i t e r i o n such as H(x) favours a unimodal d i s t r i b u t i o n while M (x) n e c e s s a r i l y does not. Let the p r o b a b i l i t y d i s t r i b u t i o n s be considered as points i n a A-dimensional space where A i s the number of states the random v a r i a b l e can take. Then M (x) can be considered to be the f i r s t order Minkowski . oo metric [50]- [53] between the two points representing P(x) and U ( x ) . Hence, the e n t i r e .family .of .Minkowski metrics could be in t e r p r e t e d as a class of cer t a i n t y measures. However, i n t h i s chapter only the following class of cer t a i n t y measures w i l l be considered: n 2(k+D A MAX) = E |P(X.)-U(X.)| 2 k + 1 (3.13) J=l 3 3 where k = 0,1,2,... ,<*>. For k = °°(3.13) reduces to (3.12) and f o r k = 0 (3.13) reduces to the square of the Euclidean distance between the points r e p r e s e n t i n g - p ( X ' ) and U(x) . Due to t h e i r - s i m p l r c i t y , M^ -(x) -and 'M^Cx)--are the most a t t r a c t i v e of the measures and the experiments described i n t h i s chapter are concerned with them. The class of c e r t a i n t y measures defined by (3.13) can also be extended, so that the measures are analogous t o Shannon's average c o n d i t i o n a l entropy, as follows M A n 2(k+l) M.(X|C) = E p(c.) E |P(X.|C.) -U(X.)| 2 k + 1 , (3.1k) i = l . 1 J=l 3 1 3 and to Shannon's equivocation as follows 2(k+i) , n A M 2k+l M ( C | X ) = E p(x.) E | P (C .|X.) - M ± | . (3 .15) k j= l 3 i= l 1 3 The measures defined by ( 3 . l M are of i n t e r e s t because they are applicable to i n t r a s e t post-evaluation and those defined by (3.15) because, l i k e Shannon's equivocation,. they can be r e l a t e d to erro r p r o b a b i l i t y . In f a c t , i t w i l l be shown that f o r the two-class problem the upper bound on 31 e r r o r p r o b a b i l i t y i n terms of MQ(C[X) i s t i g h t e r than the t i g h t e s t a v a i l a b l e bound i n terms of Shannon's equivocation. 3 . 3 I n e q u a l i t i e s In t h i s section some i n e q u a l i t i e s are derived between the pro-posed c e r t a i n t y measures and Shannon's entropies. Let the following functions be defined: U { M j x ) } = l o g { p [ l + M j X ) ] } - Mjx) (3.16) U 2 { M j X ) } = logy - | [ M m ( x ) ] 2 - ^ [ M J X ) ] 1 * (3.17) U,{M (X)} = l o g { y [ l - £M 2 (X)]} (3 .18) y [ 2 - M ( X ) ] 2M (x) 00 CO . . V M J X ) } = l o g + ^ 1 X 7 ( 3' 1 9 ) 2+M (X) °° oo where y i s the number of states the random v a r i a b l e can take. Theorem 3 .1 presents four upper bounds on Shannon's entropy H(x) i n terms of M ( x ) . O Theorem 3 . 1 : H(X) £ U . { M j x ) } , i = 1 , 2 , 3 , 4 ( 3 . 2 0 ) where y = A n. Proof: Let P (X) and P^(x) be any two hypothesis-conditional PDF's. Then Kullback's d i s c r i m i n a t i o n information [ 8 0 ] between the two hypotheses i s given by P,(X) I-0 = / P. (X) l o g - i - — dX. (3 .21) 1 2 - » 1 P 2 ( X ) Also, the Kolmogorov v a r i a t i o n a l distance [ 8 l ] between two PDF's i s given 32 by V 12 = 1 I P 1 ( X ) " P 2 ( X ) l ^ * ( 3 , 2 2 ) In the past, several lower bounds on I g have been derived i n terms of V12* v°lkonsky and Rosanov [82] have shown that I 1 2 _ V 1 2 - l o g ( l + V 1 2 ) . (3.23) Kullback [83], [8k] has shown that It can e a s i l y be shown that I 1 2 > - l o g [ l - i v 2 2 ] . (3.25) Vajda [85] has shown that • 2+V 2V I 1 2 - - l 0 « ( ^ ~ W ^ ' ( 3 ' 2 6 ) 12 12 -It -should-be-noted that (3.23)., (3.2U)-, (3-. 25 ) .and .(3- 26.) ..are .progressively t i g h t e r bounds. Now, f o r the case of d i s c r e t e PDF's P-^x) and Pg(x) can be any two d i s t r i b u t i o n s . S u b s t i t u t i n g P ( x ) and U ( x ) i n ( 3 . 2 l ) and (3.22) f o r P^(X) and P (x), r e s p e c t i v e l y , y i e l d s I 1 2 = n l o g X - H(X) , (3.27) m d V 1 2 = M<_(X). (3.28) S u b s t i t u t i n g (3.27) and (3.28) into (3.23), (3.2'+), (3 .25) , and (3.26) and s o l v i n g i n each case f o r H(X) y i e l d s (3 .20) . C o r o l l a r y 3.1: For any p o s i t i v e integer k we have H(X) < U i { M o o ( X ) } < I M M ^ X ) } £ U . { M Q ( X ) } , (3.29.) f o r i = 2 , 3 , and u = A n . 33 Proof: It i s obvious from the d e f i n i t i o n of M^(x) that M j x ) 1 . . . L M ^ U ) > ^ ( X ) (X) >_ ... > M Q(X). (3.30) Equation ( 3 . 2 9 ) follows d i r e c t l y from (3.30)' and theorem 3.1. Lemma 3.1: I f a , a a i s a set of vectors a l l i n a region where 1 2 L f(ot) i s convex , and i f 8 , 0 ,....,0 i s a set of p r o b a b i l i t i e s , 1 2 L i . e . , then 6 > 0 f o r a l l k and E 6 = 1 , k ~ k=l k E 6 f(a, ) < f( E 6 a , ) . ( 3 . 3 1 ) k=l k k ~ k=l k k Proof: For a proof of t h i s lemma see [ 8 6 ] , Theorem 3 . 2 presents four upper bounds on Shannon's average c o n d i t i o n a l entropy i n terms of M (x|c). Theorem 3 - 2 : H(X|C) 1 U.{Maj(x|c)} , i = 1 , 2 , 3 , 4 , ( 3 . 3 2 ) and v = Xn. Proof: A proof s i m i l a r to that of theorem 3 . 1 y i e l d s H(X|C.) <_ u ' .^Ulc . ' )} , i = 1 , 2 , 3 , 4 , ( 3 . 3 3 ) f o r j = 1,2,...,M. Equation (3.32) follows from the a p p l i c a t i o n of lemma 3 . 1 to ( 3 . 3 3 ) . C o r o l l a r y 3 . 2 : For any p o s i t i v e integer k we have H(X|C) < U.{M (X|C)} < U.{M. (X|C)) < U.{MA(x|c)}, ( 3 . 3 4 ) 1 — x oo i — i rk 1 — i o 1 3k f o r i = 2 , 3 , and y = X n. Proof: From the d e f i n i t i o n of M ^ X ^ ) , j = 1,2,...,M, and the expected value theorem, i t f o l l o w s t h a t M J x | c ) 1 ••• l \ + 1 ^ x l c ) l ^ x l c ) >_ I ^ _ 1 ( X | C) >_ ... >_M Q(X|C). (3 .35) Equation (3 .34) f o l l o w s from (3 .35) and theorem 3 . 2 . S i m i l a r proofs y i e l d the f o l l o w i n g theorem and c o r o l l a r y r e garding Shannon's e q u i v o c a t i o n . Theorem 3 . 3 : H(C|X) £ i M M j c l x ) } , i = 1 , 2 , 3 , 4 , (3 .36) and y = M. C o r o l l a r y 3 . 3 : For any p o s i t i v e i n t e g e r k, we have H(C|X) <_ t M M j c l x ) } <_ U±{M (c|X)} <_ U±{M (C|X)}, (3-37) f o r i = 2 , 3 , and y = M. Theorems 3 .1 - 3 .3 were concerned w i t h upper bounds on Shannon's entro p i e s i n terms of the proposed c e r t a i n t y measures. Theorems 3 .4 - 3 . 6 present lower bounds on Shannon's e n t r o p i e s . Let the f o l l o w i n g f u n c t i o n be defi n e d : L{M Q(X)} = 1 - i - - M Q ( X ) , (3 .38) where y i s as b e f o r e . Then, from (3-30) i t f o l l o w s t h a t L{M Q(X)} >_ ... >_L{M k_ 1(X)} ^ L { M k ( x ) } ^ L l M ^ X ) } i . . . 1 L{Mjx)}. (3 .39) Theorem 3 . 4 : For any p o s i t i v e i n t e g e r k we have H(X) >_L{MQ(x)} ^ L l M ^ X ) } >_L{MJX)}, ( 3 . 4 0 ) where y = A n. Proof: Vajda's quadratic entropy [32] Q(x) can be written as (3.U1) ( 3 A 2 ) I t can e a s i l y be shown that H(X) > Q ( X ) . (3.k3) Equation ( 3 A 0 ) follows from ( 3 . 3 9 ) , ( 3 . 4 l ) , (3.^2) and ' ( 3 . U 3 ) . Theorem 3.5' For any p o s i t i v e integer k we have H(X|C) >_L{M (X|C)} _L{M k(x|c)} _ L{M_(x|c)}, (3.kk) n where p = A . Proof: The proof i s s i m i l a r to that of theorem 3.^ with the invocation of the expected value theorem. where u = M. 3.h E r r o r Bounds Since upper and lower bounds on Shannon's equivocation already e x i s t , they can e a s i l y be combined with the r e s u l t s of theorems 3 .3 and 3.6 to y i e l d e r r o r bounds on the proposed measures. For example, i t i s S i m i l a r l y , the following theorem can be proved. Theorem 3 . 6 : For any p o s i t i v e integer k we have H(C|X) _i,{M (C|X)} _ L { M k ( c | x ) } _L{M_(c|x)}, (3.U5) w e l l known [8 7 ] - [ 8 9 ] , [ 1 7 ] , [ 9 0 ] , that ' H(C|X) _ (2 l o g 2) P g. Combining (3.h6) with the r e s u l t from theorem 3 .3 y i e l d s (3.U6) 36 U.{M (C|X)} P e 1 "2 l o g 2 > 1 = 1 , 2 , 3 , 4 , ( 3 . 4 T ) and. y = M. Si m i l a r bounds can be obtained f o r M^(c|x) for k any p o s i t i v e integer, from c o r o l l a r y 3 . 3 . Tighter piecewise bounds can also be obtained as follows. Tebbe and Dwyer [88] and Kovalevskij [89] have independently shown that f o r any integer m such that 1 <_ m <_ M-l, •H(C|X) > (P H-~-)(m+l)m l o g ( ~ i ) + l o g m, ( 3 . 4 8 ) where ( 3 . 4 8 ) i s v a l i d f o r the i n t e r v a l given by (2=1) < P < (JL-). m — e — m+1 A p l o t of ( 3 . 4 8 ) as m increases can be found i n [ 8 9 ] . Combining ( 3 . 4 8 ) with the r e s u l t s of theorem 3 . 3 and c o r o l l a r y 3 . 3 y i e l d s error bounds s i m i l a r to ( 3 . 4 7 ) . In a s i m i l a r way lower bounds on error p r o b a b i l i t y can be derived using the Fano bound and theorem 3 . 6 to y i e l d H(P ,1-P ) + P log(M-l) > L{NL(c|x)} > ... > L{M (clx)} e e e — 0 1 — — 00 . where y = M and H(P ,1-P ) = -P l o g P - ( l - P ) log (1-P ). e e e e e e VJhile the err o r bounds presented so f a r are of i n t e r e s t , they are not as t i g h t as the bounds on Shannon's equivocation [ 9 0 ] . Consider the following example. L e t t i n g k = °°, i = 3 , and combining ( 3 . 4 8 ) with theorem 3 . 3 and c o r o l l a r y 3 . 3 y i e l d s l o g { ~ [ l - f M 2(C|X)]} P <_1J}^ I _ ^ J + ( S = l j ( 3 . 4 9 a ) e — m + 1 m . m(m+l) l o g (-jr~) within the i n t e r v a l defined by 37 m - 1 < p < m m — e — HI+1 For the two-class problem the above bound reduces to P < e — log [ 2 - | M_(C|X)] 2 l o g 2 Furthermore, consider a binary valued feature x. taking on the values 0 , 1 with p r o b a b i l i t i e s P(x = l l c ^ ) = 0 . 8 P(x = l | C ) = 0 . 2 P(x = Ol^) = 0 . 2 P(x = 0|C ) = 0 . 8 and l e t P ^ ) = P(C2) = 1/2. Then the Bayes error p r o b a b i l i t y P g = 0 . 2 and M ( c l x ) = 0 . 6 . S u b s t i t u t i n g f o r M ( c l x ) i n (3.k9h) y i e l d s P < 0 . U 3 . CO 1 O i Q On the other hand, f or t h i s example H (c|x) = 0.U98 and s u b s t i t u t i n g t h i s value i n (3.^6) y i e l d s Pg < 0 . 3 6 . The following theorem presents upper and lower bounds on the proposed c e r t a i n t y measures i n terms of P^ which are competitive with the equivocation bounds. Theorem 3.7-' For any nonnegative integer k, we have (3.^9b) (Mr!) _ P • _________ 2 , pTkT p(k ) / 2 1 V ^ l i V l K ) P e - ? l t ( T ' where P(k) = 2(k+l) 2k+l (3.50) Proof: The righthand side of (3 .50) w i l l be proved f i r s t . Cover and Hart [91] have shown that the assymptotic nearest neighbour error rate R given by M R = 1 - E P(X.) E [P(C.|X.)] J = l i = l (3.51) 38 i s r e l a t e d to error p r o b a b i l i t y by the following i n e q u a l i t i e s : P < R < P (2 - r ~ P ). (3 .52) e — — e M-l e It can e a s i l y be shown that M Q ( C | X ) i s uniquely r e l a t e d to R by the r e l a t i o n s h i p R = ( ~ ^ ) - M Q ( C | X ) . (3 .53) Hence, from (3 .52) and (3 .53) i t follows that M 0 ( C I X ) ^- Pe [ (Mri ) Pe " 2 ] + ( ' ¥ } - ( 3 ' 5 l ° The righthand side of (3-50) follows from (3 .54) and (3.30). Inequality 7 .38 of M i t r i n o v i c [92] states that f o r any non-negative r e a l numbers p, q, and a^ such that q >_ p, we have 1 1 n p n q ( i E a?) < ( r z a?) (3 .55) n . _ l — n . ., l i = l i = l which can be written as £ 1 £ 1 1 N P U P E af > (—) ( E a P) . (3 .56) . , l — n . . I i = l 1=1 Now, l e t t i n g q = 2, p = p(k) and a. = | P ( C i | X j ) - | and making use of Lemma 1, y i e l d s 2 V 0 ' ^ - ^ [ V c l x ) ] • ( 3-5T ) Also, from (3 .52) and (3 .53) i t follows that fig) - P e l M 0 ( c | X ) . (3 .58) Combining (3-57) and (3 .58) and so l v i n g f o r M^tc|X) y i e l d s the lefthand side of ( 3 . 5 0 ) . Some f u r t h e r comments concerning the upper bounds are i n order. Consider the worst and best cases of the upper bound of (3.50). The l o o s e s t bounds occur f o r M (c|x) while the bounds are t i g h t e s t f o r MQ(C | X). Consider f i r s t the l o o s e s t bounds which are obtained by l e t t i n g k = 0 0 i n the LHS of (3.50) and are given by P < U{M (C|X)} Q CO 1 where U{M_(C|X)} = ( ^ ) - | [ M J C | X ) ] 2 . This bound i s p l o t t e d i n F i g . 3 .1 f o r various values of M. The minimum value of M (c|x) i s zero whereas i t s maximum value depends on M and i s given by 2(M-l)/M. The r e s u l t s of F i g . 3 .1 show t h a t the bound gets loose as M increases and, i n f a c t , the l i m i t of U{M (c|x)} as M ->- °° i s equal t o one. For M _ 3 the e q u a l i t y holds when M (CJX) = 0. However, f o r M = 2 the e q u a l i t y holds f o r both the extreme values of M (clx) and hence the bound i s very a t t r a c t i v e f o r the two-class problem. Consider the example of s e c t i o n 3.^ where P =0.2. S u b s t i t u t i n g M (clx) = 0 .6 •*- Q CO 1 i n t o U{M (C]X)} y i e l d s P < 0.32 which i s lower than O.36, the value obtained from the bound on H(CJX). Consider now the t i g h t e s t bounds which are obtained by l e t t i n g k = 0 i n the LHS of (3.50) and are given by P < U{Mn(c|x)} e — U where U{M 0(C|X)} = (^i-) - M Q(C|X). This bound i s p l o t t e d i n F i g . 3.2 f o r various values of M. The minimum and maximum values of M (c|x) are g i v e n , r e s p e c t i v e l y , by zero and (M-l)/M. The r e s u l t s of F i g . 3.2 show t h a t , u n l i k e the bounds on M (CJX), these ho f.O M j a x ) F i g . 3.1 Upper bound on er r o r p r o b a b i l i t y as a function of M (c|x) for various values of M. The curves meet the ordinate at (M-l)/M. The dashed l i n e s meet the abscissa at 2(M-l)/M. hi F i g . 3.2 Upper bound on e r r o r p r o b a b i l i t y as a function of M ( c | x ) f o r various values of M. k2 bounds remain t i g h t as M increases and f o r any value of M the e q u a l i t y holds f o r both of the extreme values of MQ(C|X). For the example of section 3 .h M ( c | x ) = 0 . l 8 . S u b s t i t u t i n g t h i s value into U{M Q (c|x)} y i e l d s P g < 0.32 which i s equal to the value obtained from the bound on M (C|X). CO ' For the two-class (M = 2) problem i t can be shown a n a l y t i c a l l y that M ( c | x ) provides a t i g h t e r upper bound on e r r o r p r o b a b i l i t y than Sh annon's e quivo c at i on. Co r o l l a r y 3>b: P e i | - M 0 ( c | x ) <|i_|L (3.59) Proof: Ito [93] has proposed a set of measures Q given by _1 n Q n = | - | / p ( x ) [ p ( c 1 | x ) - p ( c 2 | x ) ] [ p ( c 1 | x ) - p ( c 2 | x ) ] 2 n + 1 dX ( 3 . 6 0 ) where n i s a nonnegative integer. Furthermore, he has shown that Now, i t can be shown t h a t , i n f a c t , Q 0 = \ - M Q ( C | X ) . (3 .62) S e t t i n g n = 0 i n ( 3 . 6 l ) and s u b s t i t u t i n g (3 .62) y i e l d s ( 3 . 5 9 ) . 3.5 Experiments and Results • The experiments described i n t h i s section were performed with the same 25 features used i n Chapter I I following the s i z e normalization and the shearing transformation. A l l the experiments described below involve a comparison of feature evaluation c r i t e r i a e i t h e r with respect to each other or with respect to the Bayes err o r c r i t e r i o n . Although one would l i k e to compare the various measures with respect to the e r r o r p r o b a b i l i t y obtained from the p a r t i c u l a r c l a s s i f i e r to be used, i t would be quite a r b i t r a r y i n t h i s i n v e s t i g a t i o n to choose such a c l a s s i f i e r . Hence an optimum (Bayes) c l a s s i f i e r was chosen as a standard c r i t e r i o n with which to compare the various measures. Although the choice of t h i s c l a s s i f i e r may also be somewhat a r b i t r a r y i t i s at l e a s t s p e c i a l i n the sense that i t y i e l d s the minimum p r o b a b i l i t y of e r r o r . Hence the experimental r e s u l t s w i l l r e f l e c t the u t i l i t y of a feature evaluation c r i t e r i o n with respect to the minimum attainable e r r o r p r o b a b i l i t y . For s i m p l i c i t y and computation savings, each subset of features evaluated was considered to consist of one feature. This reduction, of the dimensionality, or number of parameters to be estimated i n the Bayes-c l a s s i f i e r to only 17 per pattern class, g r eatly increases the s t a t i s t i c a l s i g n i f i c a n c e of the c l a s s i f i c a t i o n r e s u l t s and hence of the feature ordering, since Munson's data contains approximately one order of magnitude more samples than parameters to be estimated per c l a s s . For p r a c t i c a l purposes i t can be considered that estimated parameters are equivalent to r e a l parameters i f the number of samples used i n the estimation of parameters i s an order of magnitude greater than the number of parameters [ 9 4 ] . In order t o obtain an i n d i c a t i o n of the behaviour of a p a r t i c u l a r measure i t i s s u f f i c i e n t to order the 25 features described above with the measure i n question. Furthermore, i n order to obtain a quantitative i n d i c a t i o n of the s i m i l a r i t y between two measures, the Spearman rank c o r r e l a t i o n c o e f f i c i e n t , r g J between the orderings of the two measures can be c a l c u l a t e d [ 9 5 ] . To compare a p a r t i c u l a r measure with the Bayes r e c o g n i t i o n c r i t e r i o n i t i s necessary to produce an ordering of the 25 features according to such a c r i t e r i o n . Since the p r o b a b i l i t y d i s t r i b u t i o n functions were unknown and had to be estimated, the Bayes recognition p r o b a b i l i t y of each feature was estimated by performing an experiment with the feature i n question used i n a Bayes c l a s s i f i e r i n which the parameters were estimated from t r a i n i n g data. th In c l a s s i f y i n g an unknown sample on the basis of the i feature, the character class C. which maximized P(x'.|C.) was decided to be the true i d e n t i t y of the unknown character sample. The experiments were a l l conducted with equal a p r i o r i p r o b a b i l i t i e s . This c l a s s i f i c a t i o n scheme minimizes the p r o b a b i l i t y of a recognition error when the P(x.|c.) are e i t h e r known exactly or obtained using Bayes estimation [ 5 7 ] . The same method, developed e a r l i e r by Toussaint and Donaldson [55] that was described i n Chapter I I , was used to estimate the o v e r a l l system performance f o r each feature. To t e s t the v a l i d i t y of the e s t i -mated error p r o b a b i l i t y it-was also c a l c u l a t e d a n a l y t i c a l l y .with the d i s t r i b u t i o n s obtained from the t r a i n i n g data. The 25 features were ordered with P g and with the a n a l y t i c values, and the two orderings y i e l d e d a value of r =0.9. The high value of r suggests that the s s estimated values of P f o r each feature are quite r e l i a b l e . e The 25 features were ordered using each of the measures discussed i n t h i s paper by c a l c u l a t i n g the value that a p a r t i c u l a r measure takes for each feature when the p r o b a b i l i t i e s are estimated from the t r a i n i n g data. The same t r a i n i n g data sets with the same r o t a t i o n method that was used i n the c l a s s i f i c a t i o n experiments were used to order the features according to the r e s t of the measures. In other words, for each feature the value that a measure takes was calculated seven times, once with each of the above t r a i n i n g sets., and the f i n a l value associated with the feature and the measure i n question was taken to be the average of the seven r e s u l t s . The 25 features were then ordered according to the f i n a l averaged values to y i e l d an ordering representing the behaviour of the measure i n question. t Some of the measures used are functions of P(x^), i . e . , the unconditional p r o b a b i l i t y d i s t r i b u t i o n of feature x^. Bayes estimates of P(x^) were made by s e t t i n g P(x^) equal to (n+l)/(N+X), where n i s the number of times feature x^ takes on a c e r t a i n value on a l l the samples regardless of pattern class a f f i l i a t i o n , N i s the t o t a l number of av a i l a b l e pattern samples, and X i s the number of values x. can take. 1 In a l l the experiments described below the Spearman rank c o r r e l a t i o n analysis program SRANK, included i n the IBM S c i e n t i f i c Subroutine Package, was used. Subroutine SRANK c a l l s up two other subroutines, RANK and TIE, and the programs are based on the analysis of S i e g e l [95]- 'In SRANK, t i e d observations are'assigned the -average of t i e d ranks and a correction f a c t o r f o r t i e d ranks i s incorporated. In addition to c a l c u l a t i n g r , SRANK performs a s i g n i f i c a n c e t e s t . For d e t a i l s see [95] and the IBM writeups on SRANK. A l l the experiments i n t h i s t h e s i s were performed with the a i d of an IBM, Time shared, Model 3 6 O - 6 7 , Duplex, computer. Experiment 1: This experiment was performed i n order to evaluate M^X) f o r pre-evaluation of features and to compare M^Cx) with H(x) and M ( x ) , with respect to the Bayes recogn i t i o n c r i t e r i o n . M^(x) i s of i n t e r e s t because i t i s uniquely r e l a t e d to Vajda's quadratic entropy Q(X), given Q(X) = £ P(X )[1-P(X ) ] , J=l 3 J as can be seen from equations ( 3 . 4 l ) and (3.4-2), and hence y i e l d s the same feature ordering as M (X). Also, i t i s of i n t e r e s t to determine the s i m i l a r i t y between M (X) and Shannon's w e l l known entropy H(X). The r e s u l t s of t h i s experiment are given i n Table 3 . 1(a). Experiment 2: This experiment was performed i n order to evaluate M (X |c ) f o r i n t r a s e t post-evaluation of features and to compare i t with H(.XJC) -and M (.X.|c.),, with .respect to the Bayes recognition c r i t e r i o n . A lso, i t i s of i n t e r e s t to determine the s i m i l a r i t y of'M ( x | c ) and H(x|c) and to compare the i n t r a s e t post-evaluation measures with the pre-evaluation measures. The r e s u l t s of t h i s experiment are given i n Table 3.l(t>). In Table 3 . 1 P g represents the estimated p r o b a b i l i t y of error from experiments, 3•6 Discussion and Conclusions In the following discussion l e t r g [ x : y ] denote the Spearman rank c o r r e l a t i o n c o e f f i c i e n t between.the feature orderings produced using the feature evaluation c r i t e r i a x and y. The r e s u l t s of Table 3 . 1(a) show that M (x) performed s i g n i f i -cantly b e t t e r than e i t h e r Shannon's entropy, or M (X) ( e f f e c t i v e l y Vajda's quadratic entropy Q ( x ) ) . The r e s u l t s also show that M (X) and H(X) are very s i m i l a r measures, r g[M o o ( x ) :H(x) ] = 0 . 9 1 5 . From Table 3 . 1(b) i t i s observed that M o o (x|c) also performed s l i g h t l y b e t t e r than H(X|C) which i n turn performed s l i g h t l y b e t t e r than M ( x ) . Although M c o (x|c) did not greatly outperform H ( x | c ) , t h i s was expected since i n the post-evaluation case, there i s no objection to peaking the c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s as there i s i n pre-evaluation. As i n the pre-evaluation measures, the r e s u l t s show that M ( x | c ) and H (x|c) are very s i m i l a r 47 H(X) M Q(X) Mro(X) P e 0.221 0.149 0.376 H(X) 1.0 0.975 0.915 M Q(X) 1.0 (a) 0.831 H(X|C) M Q(X|C) M ( x | C ) O 1 P e 0.471 0.370 0.497 H(X|C) 1.0 0.962 0.963 M Q(X|C) - 1.0 0.882 (b) P , H(X) and H(x|c) were minimized to produce the e feature orderings while M _ ( x ) , M ( x ) , M ^ ( x l c ) and M ( x l c ) U 0 0 U CO 1 were maximized. For 23 degrees of freedom and the two-tailed t e s t i n [95], r i s s i g n i f i c a n t to at l e a s t the 0 . 5 , 0 . 2 and 0 . 1 and 0 .05 l e v e l s f o r a value of r g greater than or equal to 0 . l 4 l , O .265, 0.337 and 0 . 3 9 6 , r e s p e c t i v e l y . Table 3 . 1 : Spearman rank c o r r e l a t i o n c o e f f i c i e n t s r g between feature orderings f o r (a) pre-evaluation measures and (b) i n t r a s e t post-evaluation measures. measures with a value of r = 0.963-s Comparing the r e s u l t s from Table 3.1(a) with those from Table 3.1(b), i t i s noted that pre-evaluation with M (x) was more e f f e c t i v e than post-evaluation with M (x|C) (or Vajda's average c o n d i t i o n a l quadratic entropy). These important r e s u l t s , contrary to i n t u i t i v e expectation, show that good features, from the Bayes recogni t i o n point of view, can be selected with knowledge of only the unconditional d i s t r i b u t i o n P(x). Furthermore, these r e s u l t s take on an added dimension i n importance when one takes computational requirements i n t o account. Hence, although H(x|c) and Moo(x|c) perform be t t e r than M<jo(x) , c a l c u l a t i o n of the former two measures requires estimation of M d i s t r i b u t i o n s as opposed to one f o r Mro(x), and requires M times as much computation since M (xlc.) and H(x|C.) have to be averaged over i . co 1 _ i Some portions of t h i s chapter can also be found i n -[-96] . Since the p u b l i c a t i o n of [96] Fukunaga [ l l , p. 235] has rec e n t l y proposed the maximization of H(x) as a feature s e l e c t i o n c r i t e r i o n with the goal of pattern representation i n v o l v i n g structure-preserving transformations. I t i s i n t e r e s t i n g to note from the r e s u l t s of t h i s chapter t h a t , when the goal i s pattern-class separation, H(x) must be minimized rather than max-imized when the discre t e case i s involved. h9 CHAPTER IV DISTANCE AND INFORMATION MEASURES FOR COMBINED-EVALUATION AND INTERSET POST-EVALUATION: THE EXPECTED VALUE APPROACH TO THE M-CLASS PROBLEM 4.1 Introduction In the past, combined-evaluation methods us u a l l y maximized ei t h e r Shannon's mutual information [ 9 7 ] - [ l 0 l ] , an approximation to i t [ l 0 2 ] - [ l 0 4 ] , or a modification of i t [ 1 0 5 ] , [106] . More recently Vilmansen [107]-[.109] Has analyzed the concept of p r o b a b i l i s t i c depen-dence measures between features and classes as combined-evaluation c r i t e r i a . Shannon's mutual information can be written as M P (X j C ) i = 1 p(ck) Jp(x|c k)io g[ p ( x k jdX. ( 4.l) Interset evaluation methods are characterized by having feature e f f e c t i v e -ness c r i t e r i a which are distance measures on a p r o b a b i l i t y space. Such measures include the Matusita distance [ l l 0 ] - [ l l 2 ] , the Kolmogorov v a r i a t i o n a l distance [ 113] , and the more often used Bhattacharyya d i s -tance [ 7 9 ] , [113]-[H7], as w e l l as Kullback's divergence [ 2 8 ] , [80] , [ H 3 ] , [117] , [ l l 8 ] . The Matusita distance between two pattern classes C. and C. i s given by 1 3 2 M(C.:C.) = ( f{[P(x|c.)] l / 2 - [ P ( X | C . ) ] 1 / 2 } d X ) l / 2 . ( 4 . 2 ) 1 J J 1 3 The Kolmogorov v a r i a t i o n a l distance i s given by f V(C.:C.)= |P(x|c.) -P(x|c.)ldX. (4.3) 1 3 J 1 1 3 The Bhattacharyya distance i s given by B(C;:C.) = - l o g p ( C :C.). (4.4) 3- J -^ J t 50 where p(C.:C.) i s the Bhattacharyya c o e f f i c i e n t given by p(C.:C.)= f [P(X|C )?(X|C ) ] l / 2 d X . ()4.5) Kullback's divergence i s given by f p(x|c) J(C. :Cj) = j [p(x|c.)-p(x|c J)]iog[^YTc iy] ( i X- ()+-6) Intraset evaluation for the M-class problem, where M > 2 , i s a d i r e c t extension of the two-class problem, where the c r i t e r i o n i s averaged over the M pattern classes. In i n t e r s e t evaluation the extension of the two-class problem to the M-class problem i s not as straightforward. At l e a s t two basic approaches t o the M-class problem have been studied i n k t h the past [ H 9 J . Let d. . be the distance measure between the i and IJ t h t h j pattern classes for the k , out of n, feature to be evaluated. In one approach (maximin c r i t e r i o n ) the minimum degree of s e p a r a b i l i t y i s maximized; that i s , max I mm k 1 i»J dk . I i = 1, 2 , M-l; _=i+l,.i+2, M, ' k = 1, 2 , n. In a more popular approach (expected c r i t e r i o n ) the expected value of the s e p a r a b i l i t y measures with respect to a l l pattern class p a i r s i s maxi-mized; that i s , M M max k E E P(c.)P(C.)dk. 1=1 3=1 1 3 1 J ; k = 1, 2 , , The two mu l t i c l a s s evaluation c r i t e r i a described above have been applied to feature s e l e c t i o n using the Kullback divergence j(c.:C.) as the s e p a r a b i l i t y measure d^ , and experimental evidence shows that the expected divergence y i e l d s a b e t t e r set of features than does the maximum divergence [ 5 ] , [120] , and i s a measure of considerable i n t e r e s t for future research. As yet, very l i t t l e i s known about the expected 51 dii'-ergence i n i t s r e l a t i o n to Bayes error p r o b a b i l i t y and to other measures of i n t e r e s t . In [120] Grettenberg gives a lower bound on the expected divergence, J , i n terms of Shannon's mutual information. Very -r e c e n t l y , Babu and K a l r a have found an upper bound on e r r o r p r o b a b i l i t y i n terms of the expected divergence for Gaussian d i s t r i b u t i o n s [121]. k.2 Information and Distance Measures as Feature Evaluation C r i t e r i a k.2.1 The Directed and Symmetric Divergence Let P 1(X) and P (x) be two PDF's. It i s w e l l known that Kullback's discrimination information, or d i r e c t e d divergence [80], between P^(x) and P_(x) i s given by D12 = J _ _ P 1 ( x ) l o g [ p i I Y r ] d X , (U.T) and that the symmetric divergence i s given by J12 = j [ P 1 ( X ) " P 2 ( x ) ] l o g [ p r ^ r ] d X . (-+.8) when using these measures f o r feature evaluation i n pattern recognition i t i s convenient to d i s t i n g u i s h between two types of d i r e c t e d as w e l l as symmetric divergence. Let the d i r e c t e d class-feature-divergence be defined as p(x\C ^ D(C.:C.) = J p ( x | C i ) l o g [ j ^ 3 p i y ] c l X , (k.9) and l e t the d i r e c t e d divergence between the class-features and the general-features be defined as p ( x [C ) D(C.:X) = j P ( x | C i ) l o g [ - p ( x ) 1 ] dX. (h.10) S i m i l a r l y , l e t the symmetric class-feature divergence be defined as f , , P(X|C ) J t C . : ^ ) = J[P(X|C.) - P ( x | C j ) ] l o g [ p - ^ - y ] d X , (It.11) and l e t the symmetric divergence between the class-features and the general-features be defined as P(X|C ) J(C.:X) = J[P ( x|C i)-P ( x)]log[ p ^ ]dX. (4 .12) The symmetric class-feature divergence of ( 4 . 1 l ) measures the distance between the two c l a s s - c o n d i t i o n a l PDF's P ( x | c . ) and P ( x | c ) , and, as such, expresses the disc r i m i n a t i o n c a p a b i l i t y between pattern class i and pattern class j . The symmetric divergence given by ( 4 . 1 2 ) , on the other hand, measures the distance between P(x|c^) and P(x) and conveys th the uniqueness of the class-features of the i pattern cla,ss, or the dependence between the general-features and the class-features of the i ^ * 1 pattern c l a s s . I t i s the expected symmetric cla s s - f e a t u r e divergence § i v e n b y _ M M J = Z Z P(C.)P(C.)J(C.:C.) (4.13) i = l j=l 1 3 1 3 •that has been used for m u l t i c l a s s - f e a t u r e s e l e c t i o n . In'the remaining portions of t h i s chapter the terms divergence and expected divergence w i l l be used, r e s p e c t i v e l y , f o r the symmetric and expected symmetric class-feature divergence. It should be noted that the expected d i r e c t e d divergence between the class-features and the general-features i s equivalent to Shannon's mutual information, i . e . , M _ I = Z P(C, ) D((l :X) = I ( C : X ) . ( 4 . l 4 ) k=l K k 4 . 2 . 2 Four Proposed Feature Evaluation C r i t e r i a The f i r s t proposed combined-evaluation measure i s the expected dire c t e d divergence between the general-features and the class-features given by _ M l(X:C) = Z P(C, ) D(X:C. ) (4 .15) k=l k k where D(X:Ck) = Jp(x)log[p^|r>XX-]dX. k' A d d i t i o n a l t h e o r e t i c a l j u s t i f i c a t i o n f o r use of t h i s measure as a feature effectiveness c r i t e r i o n i s given i n section 4 .3 where i t i s shown that the maximization of l(X:C) i s a necessary condition f o r the maximization of the expected divergence, i f the mutual information remains constant. It should be noted that the negative of l(X:C) has appeared i n the l i t e r a t u r e under the name, error information [120]. Now, l(X:C) can be written as follows: I(X:C) = H(X:C) - H ( x ) , ( 4.l6) where M H(X:C) = - Z P(C f c) j P(X)logP ( x|C k)dX. (4.17) k=l ^ The second m u l t i c l a s s combined-evaluation c r i t e r i o n proposed i s H(X:C) which i s formed by adding H(x) to I(X:C), hence, y i e l d i n g a simpler evaluation c r i t e r i o n from the point of view of computation required. A d d i t i o n a l t h e o r e t i c a l j u s t i f i c a t i o n f o r using H(X:C) as a feature effectiveness c r i t e r i o n i s given i n section 4 . 3 where i t i s shown that the maximization of H(X:C) i s a necessary condition f o r the maximization of the expected divergence, i f the average c o n d i t i o n a l entropy H(x|C) remains constant. In pattern recognition problems i t i s i n s t r u c t i v e to consider the mutual information given by (4 .1) as co n s i s t i n g of a combination of entropies as I = H(X) - H(X|C). (4 .18) From (4 .18) one can see that maximizing I involves maximizing H(X) and minimizing H ( x | c ) . In other words, maximizing I simultaneously r e s u l t s 5h i n f l a t t e n i n g the unconditional d i s t r i b u t i o n and peaking each of the c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s . Various other types of entropies besides the Shannon entropy have been proposed i n pattern recognition [32] and information theory [122] . One of these i s Vajda's quadratic entropy [32] discussed e a r l i e r i n chapter I I I and given by Q(X) = / P ( x)[l-P ( x)]dX . (U .19) Just as Shannon's entropies are combined i n the form of ( I 4 . 1 8 ) to form the mutual information I, i t i s proposed here to combine Vajda's quadratic entropies to form the quadratic mutual information Q given by Q = Q(X) - Q(X|C). (It.20) Algebra reduces (k.20) to M Q= E P(C k) / P(x|.Ck)[P(x|Ck)-P(x)]dX. (it.21) k=l One advantage that Q has over I for feature evaluation i s that i t i s easier to compute. The Bhattacharyya c o e f f i c i e n t p(C.:C.) i s given by (k.5)-. For the m u l t i c l a s s problem, K a i l a t h [79] suggested that the expected Bhattacharyya c o e f f i c i e n t given by _ M M p = E E P(C. )P(C.) (C. :C.) (it.22) i = l j = l 1 3 1 3 be used for s i g n a l s e l e c t i o n . The fourth m u l t i c l a s s post-evaluation c r i t e r i o n proposed here i s the M-class Bhattacharyya c o e f f i c i e n t P , f i r s t proposed by Bhattacharyya [ilk], [115] himself, and given by P M = / [P(X|C 1)P(X|C 2) ... F ( x | C M ) ] l / M d X . (it.23) This measure was also l a t e r and independently proposed by Matusita [ 1 2 3 ] , [ 12*4] as a measure of the a f f i n i t y of several d i s t r i b u t i o n s . 4 . 3 Properties and In e q u a l i t i e s 4 . 3 . 1 Properties of the Expected Divergence Theorem 4 . 1 : J = 2[f(C:X) + I(X:C)]. (4 .24) Proof: M M P(X|C.) ,1 = 2 E T(C±) E P(C j) / P ( x l c . ) l o g [ p ^ x | c 1 : ) ] d X . (4 .25) 1 1 ,] 1 J Expanding (4 . 2 5 ) and adding and substracting the term M 2 E P(C.) / P(X|C.)logP(x)dX i = l 1 1 y i e l d _ M P(X|C.) J = 2 E P(C i) / P ( x | C i ) l o g [ — p ^ y - ] d X i = l M + 2 E P(C.) / P(X|C.)logP (x) dX i = l M M - 2 E P(C.) E P(C.) / P(x|C.)logP(x|C.)dX. i = l 1 j = l 3 1 3 (4 . 2 6 ) In (4 . 2 6 ) i t can be shown that M M 2 E P(C.) / P(X|C.)logP(x)dX = 2 E P(C.) / P ( x)logP( X)dX, (4 . 2 7 ) i = l i i i = 1 i a n d M M - 2 E P(C. ) E P(C.) / P (x | c.)logP (x|C.)dX i = l 1 j = l J 1 3 M = -2 E P(C.) / P (x)logP (x|C.)dX. ( 4 . 2 8 ) 1=1 1 1 S u b s t i t u t i n g (4 . 2 8 ) and (4 . 2 7 ) i n t o (4 . 2 6 ) y i e l d s ( 4 . 2 4 ) . It i s i n t e r e s t i n g to note that while J i s an i n t e r s e t post-evaluation measure, i t a c t u a l l y consists of a l i n e a r combination of two combined-evaluation measures, l(C:X) and l ( X : C ) . Since J i s a desirable 56 measure from the pattern class d i s c r i m i n a t i o n point of view, an examina-t i o n of (k.2k) shows that the use of only part of i t i s an a r b i t r a r y choice. Hence, one can j u s t i f y using the proposed feature evaluation measure l(X:C) j u s t as much as Shannon's mutual information. It i s of i n t e r e s t to f i n d out the differences i n the features selected by both of these measures. Proof: It i s w e l l known that the mutual information i s a p o s i t i v e semi-definite function, i . e . , S u b s t i t u t i n g (U.30) into {h.2k) y i e l d s (k.29). A proof analogous to that of theorem k..2 has been suggested for theorem k.3 i n [101], but an alternate proof i s presented here. Theorem h.3: The expected divergence i s bounded below by twice the mutual informat i on, i . e . , Theorem k.2: J > 2 I(X:C) (k.29) I(C:X) > 0. (k.30) J > 2 l ( C : X ) . (k.3l) Proof: Applying algebra to ('+.25) y i e l d s _ M M J = 2 £ P(C ±) / p ( x | c i ) i o g i n dX, (k.32) which can be modified as follows: _ M J = 2 S P(C i) / P ( x | c i)log P(X|C, ) 1 i M n p(x|c.) j = i 3 HcT) dX. .33) i = l 57 Now, Shannon's mutual information can be written as _ M I(C:X) = £ P(C.) / P(X|C.)log i = l 1 1 P(X|C.) M £ P(X|C.)P(C.) 0=1 J 3 dX. (4 .34) Note that i n (4 .33) and (4 .34) the denominator of the argument of the logarithm i s the geometric mean and arithmetic mean, r e s p e c t i v e l y , of the c l a s s - c o n d i t i o n a l PDF's. I t i s known from the arithmetic-mean-geometric-mean i n e q u a l i t y that f o r any value of X M P(C.) M n p ( x | c . ) 3 <_ E P(x|C.)P(C.). (4 .35) j = l 3 j=l 3 3 Applying (4 .35) to (4 .33) and (4 .34) y i e l d s ( 4 . 3 1 ) . Theorem 4 . 4 : The expected divergence i s equal to twice the expected symmetric divergence between the class-features and the general-features, i . e . , J = 2 J(C:X) ( 4 . 3 6 ) Proof: S u b s t i t u t i n g (4 .28) into (4 ,25) and rearranging terms y i e l d _ M J = 2 E P(C.) / [P(X|C 1) - P ( x)]logP ( x|C i)dX. i = l (4 .37) Rearranging (4 .27) y i e l d s M 2 E P(C.) / [P(X|C.) - P(X)]logP ( x)dX = 0. i = l 1 1 ( 4 . 3 8 ) Substracting ( 4 . 3 8 ) from (4 .37) y i e l d s ( 4 . 3 6 ) . I t should be noted that J(C:X) has appeared i n the l i t e r a t u r e before. Joshi [125] has proposed a channel capacity measure based on J(C:X) and presents some properties of J(C:X). Also i t should be noted that J(C:X) can be written as 58 M P(X C ) J(C:X) = Z / [ P ( X , C k ) - P ( x ) P ( C k ) ] l o g [ p - ^ p ^ y ]dX. (^.39) In the j o i n t feature-class space 5J(C:X) i s the symmetric divergence between.two d i s t r i b u t i o n s , P(X,C) and P(x)P(C). Hence, j(C:X) i s a measure of p r o b a b i l i s t i c dependence between features and c l a s s e s . In [107] Vilmansen suggested that there was a close r e l a t i o n between feature-class dependence and d i s c r i m i n a t i o n or distance between c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s . Since J i s a distance measure and J(C:X) i s a dependence measure, theorem h.h adds strong support to Vilmansen's conjecture. This type of dependence should not be confused with measurement dependence as considered i n [ 1 2 6 ] , [127] . Theorem k.5'. The expected divergence i s equal to twice the proposed feature evaluation c r i t e r i o n of (4.17) l e s s the average c o n d i t i o n a l entropy, i . e . , J = 2[H(X:C) - H(X|C)]. (k.kO) Proof: It can e a s i l y be shown that T(C:X) = H(X) -H(X|C), ( U . U l ) and Y(X:C) = H(X:C) - H(x). (h.k2) S u b s t i t u t i n g (h.hl) and (k.h2) i n t o (k.2h) y i e l d s (U. I4O). Equation (U.i+O) shows t h a t , i n the past, when researchers mini-mized H(x|c) they were maximizing only a necessary condition f o r the maximization of the expected divergence. I t i s i n t e r e s t i n g to note th a t , while J i s an i n t e r s e t measure, i t i s a l i n e a r f u n c t i o n a l of an i n t r a s e t measure and a combined measure. Using only H(x|c) for feature evaluation i s an a r b i t r a r y choice from the point of view of d i s c r i m i n a t i o n . Hence, one can j u s t i f y using the proposed feature evaluation measure H(X:C) just as much as the average c o n d i t i o n a l entropy. 4 . 3 . 2 Properties of the M-class Bhattacharyya C o e f f i c i e n t Theorem 4 . 6 : The M-class Bhattacharyya.coefficient i s bounded above by a fu n c t i o n a l of the pairwise Bhattacharyya c o e f f i c i e n t s through the following i n e q u a l i t y : Proof: p^ can be considered to.be the geometric mean of P ( x | c ^ ) , P ( x|C 2), P ( x | c ). From the i n e q u a l i t y of symmetric means [ l 2 8 ] i t follows that [ P ( x j c l ) p ( x | c 2 ) . . . p ( x | c M ) j 1 / M <_ [ ^ f ^ y 2 P ( x | c i ) p ( x | c ) ] 1 / 2 ( 4 . 4 4 ) i 3 from which i t follows that P M ± [ M T 1 ^ T ] 1 / 2 f [±l. P ( x | c . ) P ( x | c . ) ] 1 / 2 d X . ( 4 . 4 5 ) For any nonnegative r e a l numbers a^, i = l , 2 , . . . , n we have the i n e q u a l i t y [129] n n i / o [ E /a~] > t.E. a . ] 1 / 2 , ( 4 . 4 6 ) i = l i = l from which follows that [ E P(X|C.)P(X|C ) ] 1 / 2 < E K j d 1<J 1 1 1 J S u b s t i t u t i n g ( 4 . 4 7 ) into ( 4 . 4 5 ) aJid interchanging the summation and i n t e -g r a t i o n signs y i e l d s ( 4 . 4 3 ) . Matusita [123] has shown that P M i s r e l a t e d to the generalized distance d , where r 60 a (c.:c) = {/ |[p(x|c.)] 1 / r - [p( x | c ) ] 1 / r | r a x } 1 / r (4.48) r i j i j by the r e l a t i o n s h i p P M > 1 - (M-l ) 6 , where, for any p a i r ( i , j ) , i , j = 1,2,...,M, we have In the following theorem i t i s shown that i s bounded from above by a function of the pain-rise 'distances d^(CV:CL). Theorem 4 . 7 : „ r 2 i l / 2 „. , n l r , „ v,2M,l/2 PM 1 [ M T M - I T ] U - I T [ d M ( C i : C j ) ] } ( 4 . 4 9 ) Proof: For r = 1 , d (C.:C.) reduces to the we l l known Kolmogorov r l j . v a r i a t i o n a l distance and the following r e l a t i o n '[79] holds d (C :C ) <_2[1 - P 2(C. : C , ) ] 1 / 2 . (4 .50) •+ o -+ J A l s o , Matusita [123] showed that d (C :C ) _ [d (C :C ) ] r . ( 4 . 5 l ) Combining ( 4 . 5 0 ) and (4 .51) and using the r e s u l t of theorem 4 .6 y i e l d s ( 4 . 4 9 ) . 4 . 3 . 3 Some I n e q u a l i t i e s Between the Expected Divergence and the M-class Bhattacharyya C o e f f i c i e n t Theorem 4.8: For equiprobable pattern c l a s s e s , the expected divergence i s r e l a t e d to the M-class Bhattacharyya c o e f f i c i e n t by the following inequal-i t y J >_ 2(1 - p M). (4 .52) Proof: The expected divergence can-be written as i n ( 4 . 3 3 ) , which can be written as follows: M M J = 2 £ P ( C . ) / P(X | C . ) l o g P ( x | C . ) d X - 2 E P ( C . ) K ( X , i ) , (4 .53) i = l 1 1 1 i = l 1 where K(X,i) = / P ( x | C.)logG(X)dX, (4 .54) and ^, „ \ M P ( C . ) G(X) = n p(x|c.) 3 . (^-55) j = l 3 Since /P(x)dX = 1, i t follows that M / [ E P(x|C.)P(C.)]dX = 1. (4 .56) 3=1 3 3 From (4 .35) and (4 .56) i t follows that G _ l , (4 .57) where M p(c.) G = / [ n p ( x | c . ) 3 ]dx. ( 4 . 5 8 ) J=i 3 Let M cor r e c t i o n functions C ( X , i ) , i=l , 2 ,...,M, be defined such that the G(X) can be written as G(X) = P(x|Ci) + C ( X , i ) , i = 1 , 2 , . . . , M, = P ( X | C . ) [ I + p^ T y - ] . (-+.59) ' i Integrating (4 .59) and using (4 .57) y i e l d s / C(X,i)dX = G - 1 <_ 0 . ( 4 . 6 o ) S u b s t i t u t i n g (4 .59) into (4 .54) y i e l d s K(X,i) = / P(X| C i).logP ( x | C i)dX + / p ( x | c . ) i o g [ n - p C g A j - l d x . ( 4 . 6 D 62 Now, i t can e a s i l y be v e r i f i e d t h a t , f o r any r e a l z, the following i n e q u a l i t y [92] holds: z >_ l o g ( l + z) . ( 4 . 6 2 ) Applying (4 .62) to ( 4 . 6 l ) where z = c(X,i )/P(x|C ±), y i e l d s (4 .63) K(X,i) <_ / P(x|C i)logP(x|C i)dX + / C(X-,i)dX. S u b s t i t u t i n g (4 .54) and ( 4 . 6 0 ) i n t o (4 .63) y i e l d s P(X|.C. ) / p(x|c i)io S[ ^ ]dX >_1-G, (4.64) for i = 1 , 2 , . . . , M. Sub s t i t u t i n g (4.64) in t o (4 .33) y i e l d s _ M J >_ 2 E P(C. )(1-G) , (4 .65) where G i s a function of P ( C i ) , i = 1 , 2 , . . . , M. Now, when P(C i) = l/M for a l l i , G reduces to p , the M-class Bhattacharyya c o e f f i c i e n t , and G (4 .65) reduces to ( 4 . 5 2 ) . Theorem 4 . 9 : For equiprobable pattern classes J > - 2 l o g p M (4 .66) Proof: The expected divergence, f o r equiprobable pattern classes can be written as - ? M J = - E / P(x|C.)log i = l r P(X|Ci) dX. M l/M n [P(X|C.)] The remainder of the proof follows v i a Jensen's i n e q u a l i t y [ 129] . Let E{*} denote expected value. Then, ( 4 . 6 7 ) i = l ' 1 i = l 1 I ? M = 1 x iog[/G ( x ) a x ] . (4 .68) i = l Equation (4 .66) follows d i r e c t l y from ( 4 . 6 8 ) . 4.3 .4 A Generalization of the M-class Bhattacharyya C o e f f i c i e n t and i t s Relation to the Expected Divergence Bhattacharyya's M-class c o e f f i c i e n t defined by•(4.23) has no provisions f o r tailing a p r i o r i p r o b a b i l i t i e s i n t o account; i . e . , i t assumes pattern-classes are equiprobable. In t h i s subsection, a g e n e r a l i z a t i o n of p i s proposed which M takes a p r i o r i p r o b a b i l i t i e s into account and i t i s r e l a t e d to the expected divergence. •x The generalized M-class Bhattacharyya c o e f f i c i e n t p., i s defined as # M P(C.) P M = / n [p(x|c ) ] 1 dx . ( 4 . 6 9 ) 1=1 With proofs s i m i l a r to those of theorems 4 . 8 and 4 .9 i t can be shown, r e s p e c t i v e l y , that J L 2[1 - Py] (U . 7 0 ) and J > -2 l o g p*. (4 .71) 4.3.5 Lower Bounds on the Expected Divergence i n Terms of Pairwise Distance Measures It can e a s i l y be shown that the expected divergence J i s uniquely r e l a t e d to the expected d i r e c t e d c l a s s - f e a t u r e divergence E. . {D(C .:C.)}, where D(C.:C.) i s given by ( 4 . 9 ) , "by the r e l a t i o n J = 2 E..{D(C.:C.)}. (4 .72) By s u b s t i t u t i n g P(x|C.) and P(x|C.) for P (X) and P (x), r e s p e c t i v e l y , i n (3.21) and (3 .22) , and using the r e l a t i o n s (3.23) - (3.2o) we obtain, r e s p e c t i v e l y , (4.73) - (4.76) . D(C. :C ) >_V(C. :C ) - l o g [ l + V(C :C )] (4.73) D(C.:C.) i | C V ( 0 1 = C j ) ] 2 + ^ [ v ( C 1 : C J ) ] 1 ' • (k.V.) D(C :C ) >_ - l o g U - j{V(C :C ) ] 2 } C l .T5) D ( Ci : C V 1 1 0 ^ 2 ^ C ~ c 7 ) - ] " [ 2 + v ( c ' : C j ] 2+V(C.:C.) 2 V(C.:C. _1 i l - i _ r 1 j ; :C .) J L2+V(C. :C.; 1 J 1 J S u b s t i t u t i n g (4.73) - (4.76) into (4.72) we obtain lower bounds on J i n terms of the pairwise Kolmogorov v a r i a t i o n a l distances. Using the r e l a t i o n s [79] V(C.:C ) >_ 2[1 - p(C :C )] (4.77) c L i i C l ? .1/2 V(C.:C.) < 2{1 - [p(C.:C.)r} (4.78) x 3 — 1 J y i e l d s lower bounds on J i n terms of the pairwise Bhattacharyya c o e f f i c i e n t s , For equiprobable clas s e s , the author [130] has shown that V(C.:C.) i s 1 3 uniquely r e l a t e d to the pairwise er r o r p r o b a b i l i t y P (C.:C.) by e 1 j V(C.:C.) = 2[1 -MP (C.:C.)] (4.79) 1 3 e 1 3 S u b s t i t u t i n g (4.79) i n t o (4.72) and (4.73)-(4.76) y i e l d s lower bounds on J i n terms of the pairwise error p r o b a b i l i t i e s . In the i n t e r e s t of b r e v i t y the reader i s r e f e r r e d to [130] for the d e t a i l s of the d e r i v a -t i o n s of t h i s subsection. 4.3.6 Lower Bounds on E r r o r P r o b a b i l i t y i n Terms of the Expected Divergence For a f i n i t e set of nonnegative numbers a_ , a., a such 1 2 m 65 m that I , _ a. = 1 , l e t k = l k m H ( a l 5a 2,...,a_) = - E log a^. ( 4 . 8 0 ) k=l From the Lemma of Chu and Chueb [87] i t follows that H(C) > U(k p ) — c c where U ( k p ) = H ( k p , 1-k p ) + k p l o g k , (U.8l) r c c c c c c c c where p = m a X { P ( C )} and k i s the la r g e s t integer smaller than l / p . Fano's i n e q u a l i t y [131 ] i s given by H(C|X) < U(P ) — e where U(P ) = H(P , 1-P ) + P l o g (M-l) ( 4 . 8 2 ) e e e e From theorem 4 . 2 i t follows that J _ 2[H(C) - H(C|X)]. ( 4 . 8 3 ) S u b s t i t u t i n g ( 4 . 8 2 ) and ( 4 . 8 l ) i n t o ( 4 . 8 3 ) gives a lower bound on J i n terms of P as e _ J > 2[U(k p ) - U(P ) ] . ( 4 . 8 4 a ) — c c e For equiprobable pattern classes k^ = M-l, p^ = l/M, and ( 4 . 8 4 a ) reduces to J >_ 2 [ l o g M - U(P ) ] . ( 4 . 8 4 b ) Consider feature number one used i n the experiments. The estimated er r o r p r o b a b i l i t y P g = 0 . 9 2 5 and the a n a l y t i c a l evaluation using the t r a i n i n g data y i e l d s P g = 0 . 9 1 0 . S u b s t i t u t i n g P g = 0 . 9 1 0 and M = 26 i n ( 4 . 8 4 b ) y i e l d s J > 0 . 0 5 2 ; the actual value of J was c a l c u l a t e d as 0 . 8 2 1 using the t r a i n i n g data. For the two-class problem ( 4 . 8 4 b ) reduces to J(C.:C.) > 4[log 2 - H(P , 1-P ) ] . ( 4 . 8 4 c ) 1 j — e e 6 6 For the example considered i n section 3 . 4 , P g = 0.2 and <j(C^:Cj) = 1.663. Su b s t i t u t i n g f o r P i n ( 4 . 8 4 c ) y i e l d s J(C.:C .) > 0.78O. I t should be noted that the bound given by ( 4 . 8 4 a ) i s attaina b l e f o r any value of M and a s u f f i c i e n t condition f o r a t t a i n a b i l i t y i s that J = 0 and P(c\) = l/M fo r a l l i . The next theorem presents two crude lower bounds on P i n terms e of J . Theorem 4.10: For equiprobable pattern classes the err o r p r o b a b i l i t y i s r e l a t e d to the expected divergence by the following i n e q u a l i t i e s P e >e(l - I J ) M , ( 4 . 8 5 ) P e > £exp(- I j ) , (li. 86) where <* ~ Mf2M-l) • Proof: and From theorems 4 . 8 and 4 . 9 i t follows, r e s p e c t i v e l y , that P M L l - _ - J > (-*.87) P M _ e x P ( - I J ) . ( 4 . 8 8 ) Also, Matusita [ 1 2 4 ] has shown that P e _ 5 ( P M ) M . (U.89) S u b s t i t u t i n g ( 4 . 8 7 ) and ( 4 . 8 8 ) i n t o ( 4 . 8 9 ) y i e l d s , r e s p e c t i v e l y , ( 4 . 8 5 ) and ( 4 . 8 6 ) . To the author's knowledge these are the only a v a i l a b l e bounds on error p r o b a b i l i t y i n terms of the expected divergence f o r the d i s t r i -bution-free case. Unfortunately these bounds are rather loose. For example, ( 4 . 8 5 ) i s not att a i n a b l e , and while ( 4 . 8 6 ) i s attainable when P = 0 i t i s not n e c e s s a r i l y attainable when J = 0 0 and becomes very loose e 67 as M becomes large because of-the dominating influence of £. Consider the example of section 3 .4 where P = 0 .2 and J(C.:C.) = 1 .663. Sub-e I j s t i t u t i n g f o r P i n (4 .85) and (4 .86) y i e l d s , r e s p e c t i v e l y , P > 0.0426 and P > 0 .0545 . Hence these bounds are of t h e o r e t i c a l rather than e p r a c t i c a l i n t e r e s t . 4 . 3 . 7 An Upper Bound on Erro r P r o b a b i l i t y i n Terms of the M - c l a s s Bhattacharyya C o e f f i c i e n t Theorem 4 .11 : P e < f [ 1 + M / P ( C i ) P ( C 2 ) . . . P ( C M T p M] - 1 ( 4 . 9 0 ) Proof: Let n >_ 2 , 0 < q,± <_ q_2 < ... <_ , x. >_ 0, i = 1,2,. . . ,n and n £ q. = 1. 1=1 1 Then Kober's i n e q u a l i t y [132] i s given by q ^ n - 1 ) " 1 < ^ . < ^ (^.91) E [ v ^ r - ^c7r i < j 1 J where » *1 q 2 \ A = E q.x. - x., • x^ ... x n . n 1 1 1 2 n i = l L e t t i n g x± = P(X,C i), n = M , and q = ... = q R = l / M the righthand side of ( 4 . 9 l ) reduces to p(x) - M - M . / p ( x , c 1 ) P ( X , C 2 ) . . . P ( X " ^ T M <_ E [/P(X,C ) - / P ( X , C . ) ] . ( 4 . 9 2 ) i<j 1 3 Integrating both sides of (4 .92) y i e l d s 68 M 1 - M M/P(C 1) P(C ).'..P(CM) • p M < Z / [/P(X,C.)~ - /P(X,C.)] 2dX. (4 .93) i<j 1 3 Expanding the righthand side of (4 .93) i t can be shown that i t reduces to M • M - l - 2 Z /P(C. ')P(C) p(C. :C.) 3 B. ( 4 . 9 4 ) i<J 1 3 1 3 Furthermore 5 L a i n i o t i s [133] has shown that pe 1 |<M-1-B). (4 .95) Combining ( 4 . 9 5 ) , (4 .94) and ( 4 . 9 3 ) , and rearranging terms y i e l d s ( 4 . 9 0 ) . To the author's knowledge the bound given by (4 .90) i s the only upper error bound on the M-class Bhattacharyya c o e f f i c i e n t a v a i l a b l e i n the l i t e r a t u r e . Unfortunately the bound becomes very loose as M increases. In f a c t , f o r M. _ 4 the bound i s useless. For example, consider the f i r s t feature of the experiments, where M = 2 6 , P = 0 . 9 1 0 , and p ^ = O . T 9 9 . S u b s t i t u t i n g f o r p i n ( 4 . 9 0 ) y i e l d s P < 1 2 . 4 . Hence the bound i s of value only f o r M = 2 , 3. It should be noted that f o r two equiprobable pattern classes the bound given by ( 4 . 9 0 ) reduces to P e l | P ( c i : c . ) , ' which i s a we l l known bound f i r s t derived by K a i l a t h [ 7 9 ] . Hence ( 4 . 9 0 ) , i s not only proof that the minimization of p ^ minimizes the upper bound on P , but represents a genera l i z a t i o n of Kailath's bound t o the M-class problem. Consider the two-class example.of section 3 . 4 where P g = 0 . 2 and p(C.:C.) = 0 . 8 . S u b s t i t u t i n g i n ( 4 . 9 0 ) f o r p(C.:C.) y i e l d s P < 0 . 4 . This r e s u l t suggests that for M = 2 ( 4 . 9 0 ) i s t i g h t e r than (3 .49b) (P < 0 .43) but not as t i g h t as ( 3 - 4 6 ) (P < 0 . 3 6 ) . e e • The upper bound given by ( 4 . 9 0 ) can be made independent of a p r i o r i p r o b a b i l i t i e s . From the arithmetic-mean-geometric-mean i n e q u a l i t y i t follows that M / p ( C l ) P ( C 2 ) . . . P ( C M ) < | i (4 .96) i = l S u b s t i t u t i n g ( ' 4 . 9 6 ) into ( 4 . 9 0 ) y i e l d s the desired bound given by A d d i t i o n a l upper bounds, f o r the s p e c i a l case of Gaussian d i s t r i b u t i o n s , can be obtained by combining the r e s u l t s of Babu [ l 2 l ] , [134] with the r e s u l t s of theorems 4.8 and 4.9. The d e t a i l s , however, w i l l not be included here i n the i n t e r e s t of b r e v i t y . 4 .4 Experiments and. Results i n order to compare the various feature evaluation c r i t e r i a , and to determine r e l a t i o n s h i p s with each othor i n order to better understand them. Experiment 1 - This experiment was performed i n order to compare the expected divergence with the Bayes recognition c r i t e r i o n and with other c r i t e r i a of i n t e r e s t such as the expected Kolmogorov v a r i a t i o n a l distance, the expected Bhattacharyya c o e f f i c i e n t and the mutual information. Experiment 2: Theorem 4 .1 shows that the expected divergence i s given by (4 .24) where l(C:X) i s the mutual information and l(X:C) i s the m u l t i c l a s s feature evaluation measure given by (4.15). This experiment was performed i n order to evaluate l(X:C) as a m u l t i c l a s s feature effectiveness c r i t e r i o n and to determine which of the two. components of the expected divergence, T(C:X) or I(X:C), dominates. Experiment 3: Theorem 4.5 shows that the expected divergence i s given by ( 4 . 4 0 ) where H(x|c) i s the average c o n d i t i o n a l entropy and H(X:C) i s the m u l t i c l a s s feature evaluation measure given by ( 4 . 1 7 ) . This experiment was performed i n order to evaluate H(X:C) as a m u l t i c l a s s feature effectiveness Experiments were performed i n the same manner as i n chapter I I I c r i t e r i o n and to determine which of the two components of the expected divergence, H(x|C) or H(X:C) dominates. Experiment 4: This experiment was performed i n order to evaluate the M-class Bhattacharyya c o e f f i c i e n t given by (4.23) and to compare i t with other measures of i n t e r e s t , p a r t i c u l a r l y the expected Bhattacharyya c o e f f i c i e n t . The r e s u l t s of the four experiments' described above are shown i n Table 4.1. Experiment 5: This experiment was performed i n order to evaluate the proposed quadratic mutual information Q given by (4 .2 l ) as a feature evaluation c r i t e r i o n with respect to the Bayes recogni t i o n c r i t e r i o n and to compare i t to Shannon's mutual information. The Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between the orderings are given i n Table 4 .2 . 'Since i t was of i n t e r e s t to determine whether Q could be used as a sub-s t i t u t e to I, ( c a l c u l a t i o n of Q requires l e s s computation than c a l c u l a t i o n of i ) the actual feature orderings were obtained to get an i n t u i t i v e f e e l f o r r g = 0.947. Table 4.3 shows the ordering of the features with I and Q. For example, feature number 3 was ranked f i r s t by both I and Q, and feature number 5 was ranked 22nd by both I and Q. Ca r e f u l observation show that the rank p o s i t i o n of any feature i n one of the columns d i f f e r s by at most 3 from the p o s i t i o n i n the other column except for feature number 25 which was ranked l 4 t h by I and 7th by Q. 4.5 Discussion and Conclusions Experiment 1: I t i s noted from Table 4.1 that the expected divergence, the mutual information, the expected Kolmogorov v a r i a t i o n a l distance, and the expected Bhattacharyya c o e f f i c i e n t are very s i m i l a r measures, as I(X:C) H(X:C) PM H(C|X) I J ? P p e 0.687 • 0 . 6 4 7 0 .628 0.471 0.712 0.690 O.672 0.693 I(X:C) 0.613 0 .992 0.863 0.982 0.992 0.977 0.993 H(X:C) PM 0 .597 0.192 0.895 O.582 0.975 0.598 0.985 0.508 0.983 0.587 0.989 H(X|C) I 0.885 0.874 0.993 0.924 O .981 0.882 0.993 J V 0.976 0.997 O.986 SYMBOL NAME OF MEASURE Estimated E r r o r P r o b a b i l i t y * * F i r s t Proposed Measure Second Proposed Measure Th i r d Proposed Measure Average Conditional Entropy Mutual Information Expected Divergence Expected Kolmogorov V a r i a t i o n Expected Bhattacharyya Coeff. For 23 degrees of freedom and the two-tailed t e s t i n [95] , r s must be greater than or equal to 0 . 3 9 6 , 0 . 5 0 5 , and 0.621 i n order f o r i t to be s i g n i f i c a n t to at l e a s t the 0 . 0 5 , 0 . 0 1 , and 0 .001 l e v e l s , r e s -p e c t i v e l y . * Measures marked thus were minimized to produce an ordering; a l l other measures were maximized. ** Obtained from experiments with t r a i n i n g and t e s t i n g data. (See section 3 . 5 . ) Table 4 . 1 Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between feature orderin with combined and post-evaluation c r i t e r i a . « p e T(X:C) H(X:C) * PM *H(X'j C) I J V P I Q p e 0.712 0.630 I 1.0 0.9^7 Table 4 . 2 Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between feature orderin with I and Q. Rank Ordering Ordering w i t h I w i t h Q 1 3 3 2 11 15 3 23 11 k 15 23 5 13 2k 6 10 13 7 I 2k 25 8 k 20 9 20 10 10 22 D4 11 16 k 12 ik 19 13 12 16 Ik 25 22 15 19 12 16 2 2 IT 9 18 18 18 9 19 6 21 20 21 1 21 8 6 22 5 5 23 1 8 2k 7 17 25 17 7 Table 4.3 Feature orderings w i t h I and Q. evidenced by the fa c t that r [ J ; l ] = 0 . 9 9 3 , r [J;V] = 0.97&, and s s r_ ( J ; p ] = 0 . 9 9 7 . Furthermore, J performs about as w e l l as w e l l as I, V, and p with respect to the Bayes recogni t i o n c r i t e r i o n . The combined-evaluation measure I , with r [P ; l ] = 0.712 i s s l i g h t l y b e t t e r than the S *5 post-evaluation measures J , V, and p with r [P ;<T] = 0 . 6 9 0 , r [P ;V] = O .672, s o s e and r [P ;p] = 0 . 6 9 3 , r e s p e c t i v e l y . S Q Experiment 2: The r e s u l t s show that l(X:C) i s very s i m i l a r to J , V, and p, and i s most s i m i l a r to p, since r [l(X:C ) ; p ] = 0.993.. Also, I(X:C) i s a good measure, from the point of view of the Bayes recognition c r i t e r i o n , with r [l~(X:C);P ] = O .687. Comparing T(X:C) with I , the s e mutual information, i t i s noted that they are very s i m i l a r measures, r [ l ( X : C ) ; l ] = O . 9 8 2 , and both measures, which are l i n e a r components of the expected divergence are equally s i m i l a r to J , r [I(X:C);J] = 0 . 9 9 2 , r ( l ; J ] = 0 . 9 9 3 . Furthermore, the proposed combined-evaluation measure l(X:C) performed just as w e l l as J , since r [l(X:C);P ] = 0 .687 and r [j;P ] = 0 . 6 9 0 , and almost as w e l l as the mutual information, s e r [I;P ] = 0 .712. s ' e Experiment 3: The second proposed combined-evaluation measure i s H(X:C) given by ( 4 . 1 5 ) . This measure has the form of an entropy measure, such as H (x|c) and requires the same amount of computation as H(x|c) once the unconditional PDF's have been estimated or calculated from the estimates of the c l a s s - c o n d i t i o n a l PDF's. Nevertheless, the experimental r e s u l t s show that H(X:C) i s a measure quite d i f f e r e n t from a l l the others and e s p e c i a l l y fromH ( x l c ) since r [H(x|C);H(X:C)] = 0 . 192 . Both, H(X:C) and s H (x|c) are equally weighted components of the expected divergence, as shown i n theorem 4 . 5 , but the experimental r e s u l t s show that H (x|c) i s the dominant component of J , since r [H ( x | c);J] = 0.874 and r [H(X:C);J] = 0 . 5 9 8 . However, from the Bayesian recognition point of view, the proposed measure seems much "better than H(x|c) as indicated by the fact that r s[H(X:C); P g] = 0.647 while r g[H(x|C);P ] = 0 . 4 7 1 . In addition, H(X:C) compares favourably with J , V and p. This l a t t e r observation, together with the fact that H(X:C) i s a very simple measure and requires much less computa-t i o n than either J , V and p for large M, makes i t a very a t t r a c t i v e measure for multiclass feature evaluation. Experiment 4: The results of t h i s experiment showed that although the M-class Bhattacharyya coefficient performed quite w e l l , r g[p^:P e] = 0 . 6 2 8 , and i s indeed suitable for feature evaluation,it performed not quite as w e l l as the other interset post-evaluation measures. Nevertheless, i t s t i l l performed s i g n i f i c a n t l y better than the intraset post-evaluation c r i t e r i o n H(x|c). I t was of p a r t i c u l a r interest to compare p M with p. Although p M i s not quite as good as p from the performance point of view, r [p,,;P ] ~ 0 .628 and r [p;P ] = 0 . 6 9 3 , both measures are very s i m i l a r s M e s ' e measures for ordering features, r g[p,p^] = 0 . 9 8 9 , and p M i s more economical — th to use than p depending on the r e l a t i v e cost of taking an M root versus taking a square root. For example, calculation of p requires [M(M-l)A n/2] m u l t i p l i c a t i o n s and square roots, and [M(M-l)A n /2]-l additions, while calculation of p requires (M-l)X n m u l t i p l i c a t i o n s , \ n M^ *1 roots, and A n -1 additions. Therefore, even i f one M**1 root requires as much computa-t i o n as M(M-l)/2 square roots, p s t i l l requires less computation than p, especially as the number of pattern classes increases. I t i s of interest to note from Table.4.1 that the best measure i s the mutual information, while the worst i s the average conditional entropy. Also, the two most sim i l a r measures are the expected divergence and the expected Bhattacharyya c o e f f i c i e n t while the two least s i m i l a r 7 6 measures are the second proposed measure and the average c o n d i t i o n a l entropy. Experiment 5: The r e s u l t s of Tables 4 .2 and 4 .3 suggest that Q and I are two very s i m i l a r measures of information. With respect to the estimated Bayes error p r o b a b i l i t y P g the feature ordering produced by Q i s not quite as highly c o r r e l a t e d as that produced by I . Nevertheless, f o r a l l p r a c t i c a l purposes, Q seems to be as e f f e c t i v e as I f o r feature evaluation i n pattern recognition and has the added nice property that i t requires l e s s computation. Besides subsections 4 . 3 . 2 , 4 . 3 . 4 , and 4 . 3 . 5 which are also found i n [130] and [ 1 3 5 ] , most of the remainder of t h i s chapter can be found i n [136], except f o r the work concerning the quadratic mutual information which i s found i n [137], and some work on the two-class problem which can be found i n [ 138] . CHAPTER V THE GENERALIZED DISTANCE APPROACH TO THE M-CLASS PROBLEM 5.1 Introduction Although the expected divergence i s an important and u s e f u l feature evaluation c r i t e r i o n i n the multiclass s t a t i s t i c a l pattern recognition problem, as i s demonstrated i n chapter IV, i n c e r t a i n c i r -cumstances i t i s not t o t a l l y s a t i s f a c t o r y as a mult i c l a s s c r i t e r i o n . The disadvantage of the expected divergence i s that the operation of averaging causes a loss i n the true meaning of the divergence among a group of more than two p r o b a b i l i t y d i s t r i b u t i o n s . This i s manifested i n the fact that although the expected divergence, being a mult i c l a s s c r i t e r i o n , should measure the global c h a r a c t e r i s t i c s of a group of d i s t r i b u t i o n s , i n f a c t , i t i s overly s e n s i t i v e to l o c a l c h a r a c t e r i s t i c s . -For example, the divergence has the property that when two d i s t r i b u t i o n s are non-overlapping, perfect d i s c r i m i n a t i o n is' possible and the divergence i s equal to i n f i n i t y . Therefore, when any two out of a group of pattern-class d i s t r i b u t i o n s are non-overlapping f o r a p a r t i c u l a r feature, the feature w i l l be l a b e l l e d as excellent no matter how poor or how we l l that feature discriminates among the remainder of the pattern classes. C l e a r l y , a bet t e r measure i s sen s i t i v e to the disc r i m i n a t i n g c a p a b i l i t y of a feature on a l l the pattern class d i s t r i b u t i o n s even when two of them are non-overlapping. In t h i s chapter a mu l t i c l a s s i n t e r s e t feature evaluation c r i t e r i o n i s derived which not only has the desirable properties discussed above, but which i s , i n f a c t , a generalization of the divergence to the mu l t i c l a s s problem. The generalization of the divergence suggests s i m i l a r generalizations of a l l p r o b a b i l i s t i c distance measures. Some of the more important measures are considered i n d e t a i l . I n e q u a l i t i e s and error bounds are derived and experiments for comparison and evaluation are performed. 5.2 Proposed Generalized Distance Measures In t h i s s e c t i o n , an information t h e o r e t i c a l d e r i v a t i o n of the generalized di\ rergence w i l l be presented based on an i n t e r p r e t a t i o n of information which i s more convenient, when t r e a t i n g pattern r e c o g n i t i o n problems than Shannon's measure of information. The divergence for two classes C. and C. and for d i s c r e t e 1 J d i s t r i b u t i o n s can be written as x n P(X, I C ) J±. = ^ [ P O g c . ) - P ( x k | C . ) ] l o g [ p T ^ T ] . (5.1) Expanding (5.1) y i e l d s x n x n J i j = ^ P(X k!G.) l o g P t x J C . ) - ^ P(-XjC.) l o g P ^ l c . ) k=l k=l x n x n + Z P(X 1JC.) l o g P(X k|C.) - Z P(Xk|CJ) lo g P ( X J C . ) . (5.2) k=l ' k = 1 Let I. be defined as the information contained i n the d i s t r i b u t i o n l P(x|c\), and l e t R\ be the information contained i n the feature vectors of class C.. Let I„ be the maximum amount of information contained i n l • P. i the d i s t r i b u t i o n P(x|CL). The maximum amount of information contained i n the d i s t r i b u t i o n can be ca l c u l a t e d by assuming each state of the feature vector to be equiprobable. Then I p = n l o g X. (5.3) i t h Now, the information contained i n the feature vectors of the i class i s given by = n log X + E P C X J C J log P f x j c ^ ) . (5-5) k=l Note that the expression given by (5.5) i s not the information measure used when information theory i s applied to communication channels, but rather i t i s Lewis' information measure defined i n [139], which i s very useful for the study of pattern recognition. When a sequence of symbols represents information to be transmitted over a communication channel, the usual information measure has the property that the f l a t t e r the d i s t r i b u t i o n , the more information i s contained i n the sequence , on the average, and the more peaked the d i s t r i b u t i o n , the more redundant i s the sequence . In information transmission problems i t i s desired to send messages e f f i c i e n t l y and hence a f l a t d i s t r i b u t i o n i s desirable. On the other hand, when a sequence of symbols represents the output of a feature extractor i . e . , a feature vector, then the information measure of (5.5) has the property that the more the d i s t r i b u t i o n of a pattern class i s peaked, the more information i t gives about the feature vectors of the pattern class i n question. Lewis' measure of information implies that there i s a fixe d amount of information inherent i n a feature extractor that generates a f i n i t e set of feature vectors, and that some i s contained In the pr o b a b i l i t y d i s t r i b u t i o n while the rest i s contained i n the feature vectors themselves. This measure also implies that a "good" feature extractor i s one that for a given pattern class, generates redundant feature vectors representing a highly peaked d i s t r i b u t i o n . Let I"? be defined as the information that the d i s t r i b u t i o n P(xlc.) i J gives about the d i s t r i b u t i o n P(x|C\). Using s i m i l a r notation as above, I"? can be written as l I"? = n l o g A + Z P(X. Ic.) l o g P(x. I O . ( 5 . 6 ) i • k = 1 "k i • j Using (5 .5) and ( 5 . 6 ) i n (5 .2) the divergence can be written as j i j • w - {iH]- (5-7) Equation (5 .7) expresses the divergence as the sum of the amounts of information contained i n the d i s t r i b u t i o n s l e s s the sum of the amounts of information that the d i s t r i b u t i o n s give about each other. Equation (5 .7) suggests a ge n e r a l i z a t i o n of the divergence to the m u l t i c l a s s problem. Let 1^ be the information that a l l the pattern class d i s t r i b u t i o n s , excluding P(x|CL), give about P(x|c^). In other words, i n a class space th — C^,...,C , the complement of the i c l a s s , denoted by i , i s given by C j U . . . U c ^ - j U C _ + _ U • • « U Cjyp Then, the- generalized divergence i s defined as M M -J = Z 1 - z I k • ( 5 . 8 ) •° k=i k k=l " k Equation (5-6) suggests an analogous expression for I k i n ( 5 . 8 ) . The expression i s given by I ^ n l o g A + ' f ^ j K ) logPUjICj:), (5-9) where P(x|a-) = P(x| C, U • •-Uc, ,U C, , , 1 J . . . U c „ ) . In other words, "th P(x|c—) i s a mixture d i s t r i b u t i o n formed from a l l , but the k , pattern K. class d i s t r i b u t i o n s . I t i s easy to show that ' M p(xick} = 11=1-5-4 p(xi(V' (5-10) J#k for k = 1,2,...,M. Su b s t i t u t i n g (5 .10) and ( 5 . 9 ) i n t o (5-8) y i e l d s the following expression f o r the generalized divergence i n terms of known qua n t i t i e s : M A J = £ £ P(X,|C ) l o g G k=l J=l 3 k 1 M TB3T P ( W £^k 81 (5.11) Equation (5.11) can be modified to account f or continuous features and a p r i o r i p r o b a b i l i t i e s as follows: M J Q = E P(C.) / P ( x | c . ) l o g i = l P(x|C )P(C ) M E P(X|C.)P(C.) ( M - l f j = l dX. (5 .12) Let the mixture d i s t r i b u t i o n y(X,C.) be defined as M (5.13) Noting t h a t , i n the j o i n t space of features and clas s e s , the expression given by (5-12) i s Rollback's d i s c r i m i n a t i o n information between P(X,C\) and y(X,C\), we can s i m i l a r l y generalize a l l the other distance measures found i n the l i t e r a t u r e to the M-class problem. The distance measures to be considered i n t h i s chapter are, i n addition to J , the generalized Kolmogorov v a r i a t i o n a l distance given by • M V G = E / |P(X,C.) - y(X,C.)|dX, i = l (5.1U) the generalized Matusita distance given by M \ £ = E / [/P(X,C.) - /y(X,C,)] dX, i = l 1 1 (5.15) and the generalized Bhattacharyya c o e f f i c i e n t given by M P „ = E / /p(x,c j u(x7cTT dx . (5.16) 82 5 .3 Properties and In e q u a l i t i e s The following properties can be derived f o r the generalized distance measures: 0 I J G 1 ro > 0 1 V G 1 2 > 0 <_ M G _<-v/2 0 1 P G 1 i . When a l l the c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s are i d e n t i c a l , i . e . , features are usel e s s , J_, Yn and Mn take on t h e i r minimum values and G G G p_ takes on i t s maximum value. On the other hand, when a l l the c l a s s -G c o n d i t i o n a l d i s t r i b u t i o n s are d i s j o i n t from each other, J , V and M take on t h e i r maximum values and p takes on i t s minimum value. Further-G more, J., takes on i t s maximum value when any one c l a s s - c o n d i t i o n a l d i s t r i b u t i o n i s d i s j o i n t from a l l the others. In the i n t e r e s t of b r e v i t y the proofs w i l l not be given. Also i t should be noted that M , V and G G pn reduce, f o r the two-class case, r e s p e c t i v e l y to M ( C ' C _ ) , V ( C , : C 0 ) and pCC-^Cg). Since the generalized distance measures can be considered as distances between two d i s t r i b u t i o n s , namely, P(X,C^) and u(X,C\), i n e q u a l i t i e s between the generalized distances can be obtained by mere su b s t i t u t i o n of P(x,C^) and y(x,C^) in t o the i n e q u a l i t i e s between ordinary distance measures. Hence, from ( 3 . 2 3 ) - ( 3 . 2 6) i t follows that JG 1 V G ~ loS ( 1 + V ' ( 5 - 1 T ) J G i ^ V 2 + ^ V " ' ( 5 - l 8 ) J Q > - l o g [ l - i ( V G ) 2 ] , (5.19) and 2+V 2V G G From r e s u l t s i n [115] i t f o l l o w s t h a t . J Q _ -2 l o g p Q . (5.21) S i m i l a r l y , from [79] i t f o l l o w s t h a t and V Q _ 2 ( l - p G ) , (5 .22) V G < 2[1 - ( p G ) 2 ] l / 2 . (5 .23) Expanding NL i t can e a s i l y be shown th a t G = 2(1-P_). (5 .24) I t should be remarked t h a t f o r continuous d e n s i t y f u n c t i o n s the ge n e r a l i z e d distance measures have a drawback. Even i f the P ( x | c \ ) , i = 1,2,...,M are n i c e parametric f u n c t i o n s such as Gaussian d e n s i t i e s y(x,C\) los e s the n i c e property and hence the c r i t e r i a cannot be given by e x p l i c i t mathematical expressions, i m p l y i n g t h a t numerical i n t e g r a t i o n s are r e q u i r e d i n the continuous case. Hence, i n p r a c t i c e the g e n e r a l i z e d d i s t a n c e measures may f i n d more a p p l i c a t i o n s i n data a n a l y s i s and represen-t a t i o n problems where the d i s t r i b u t i o n s are d i s c r e t e , as i n b i o m e t r i c problems [ 7 ] , [ 8 ] , [ 1 9 ] , [21]. 5.4 E r r o r Bounds One reason f o r the p o p u l a r i t y of vCc^rC ) and- p(C_ ;C_^ as featu r e e v a l u a t i o n c r i t e r i a i s t h a t , even f o r the general d i s t r i b u t i o n f r e e case, V(c^:Cg) i s uniquely r e l a t e d t o e r r o r p r o b a b i l i t y and there e x i s t s a good upper bound on e r r o r p r o b a b i l i t y i n terms of p(c^:C ). I t can e a s i l y be shown, [ l 4 l ] and [ l l , p. 2 6 9 ] , t h a t f o r equiprobable c l a s s e s 8k Also i t can e a s i l y be shown, [ 7 9 ] , [ 9 3 ] , that pe l | P ( C 1 , C 2 ) . (5 .26) In t h i s s e c t i o n , upper bounds on error p r o b a b i l i t y are derived i n terms of V and p f o r the most general, d i s t r i b u t i o n f r e e , case. G G Theorem 5.1 Proof: P < (M=i) _ (JL)V (5.27) e — M 2M G v ^ . ^ i ; For any set of p o s i t i v e r e a l numbers a^, a^, a^ the following i n e q u a l i t y holds: a.,+ ...+a. +a., +...+a / i ^ / l i - l i + l m - ' l frr oO\ max{an , a o 5 . . . ,a } > max { • • , a.}. (5.28) 1 ' 2 m — m-1 ' l Also, f o r any two p o s i t i v e r e a l numbers a, b the following r e l a t i o n holds: max{a,b} = ~ - + ||a-b| . (5 .29) Now, the p r o b a b i l i t y of a correct decision P^ i s given by P c = / max{P(X , C 1 ) , P(X ,C 2), P(X ,C M)} dX . (5-30) Applying (5.28) to (5.30) y i e l d s P > / i a x { - ^ - Z P(X , C.) , P(X , C.)} dX . (5 .31) Applying (5.29) to (5.31-) y i e l d s 1 1 M P c - 2 ; P(X ,C.) + P(X ,C.)] dX j ^ i M + | / |P(X ,C.) - ~ - Z P(X,C )|dX. (5.32) 85 Expanding terms, averaging over i , and rearranging (5 .32) y i e l d s ( 5 . 2 7 ) . Theorem 5 . 2 : P < (MlfL) ( 1 ) (5.33) e - M ; M G Proof: The proof follows d i r e c t l y from theorem 5.1 and ( 5 . 2 2 ) . Theorem 5 . 3 : P < (M=i.) _ (A-)M 2 (5 .34) e - v M ' 2M G w " y Proof: The proof follows d i r e c t l y from theorem 5.2 and ( 5 . 2 4 ) . It should be noted that f o r the equiprobable two-pattern-class case (5 .27) and (5 .33) reduce, r e s p e c t i v e l y , to (5 .25) and ( 5 . 2 6 ) . Hence the bounds derived, i n theorems 5.1 and 5.2 are generalizations of w e l l -known-.bounds f o r the two-class case. The bounds given by ( 5 . 2 7 ) , (5 .33) and (5 .34) are a t t r a c t i v e because, as M increases, they become prog r e s s i v e l y t i g h t e r , over an increasing i n t e r v a l , than L a i n i o t i s ' bound [133], [ l 4 2 ] i n terms of the pairwise Bhattacharyya c o e f f i c i e n t s . , The L a i n i o t i s bound i s given by P <_ E [ P ( C . ) P ( C . ) ] 1 / 2 p ( C . : C ) . (5 .35) 6 i<_ 1 3 3 Let P(C.) = 1/M f o r a l l i and l e t p(C.:C ) = p for a l l i , j , i ^ _ . Then (5 .35) 1 1 j reduces to pe i ( M i i ) p - (5 .36) Since the bound given by (5 .33) i s at l e a s t as t i g h t as (5 .27) and (5 .34) i t i s s u f f i c i e n t to show that (5-33) i s t i g h t e r than ( 5 . 3 6 ) . I t i s i n t e r e s t i n g to note that the equality i n (5 .36) holds when p = 0 , while the equalitj^ i n (5 .33) holds when p 0 = 1. Now, from (5-33) we have G { ( — ) + (^)P G) = 1 , and from (5 .36) we have, f o r p > 0 , l i m {(-g—)P> = 0 0 • Let (5 .33) and (5 .36) be w r i t t e n , r e s p e c t i v e l y , as and P < U ( p ) . The r e l a t i o n s h i p between U(p ) and U(p) can be seen more c l e a r l y i n G F i g . 5-1 where both bounds are p l o t t e d for several values of M. The points where the two bounds i n t e r s e c t are given by the equation p = p = 2/(M+l). From F i g . 5.1 i t can be seen that the bound given by G (5 .33) i s t i g h t e r than that given by (5 .36) when p = p_ > 2/(M+l). On G the other hand, the reverse holds when p = p„ < 2/(M+l). An obvious more complicated bound, which combines the advantages of both bounds above, can be formed by combining (5 .33) and (5 .36) to y i e l d where the e q u a l i t y holds f o r both p = p = 0 and p = p = 1. G u Consider feature number one from the experiments, f o r which P g = 0 . 9 1 0 . For M = 26 the error p r o b a b i l i t y must be l e s s than (M-l)/M = 25/26 = 0 . 9 6 2 . For t h i s feature and p - 0 . 9 4 7 . S u b s t i t u t i n g these values i n (5 .36) ( L a i n i o t i s ' bound) G and (5 .33) y i e l d s , r e s p e c t i v e l y , P < 11.2. and P < 0 . 9 5 9 . Hence for a e e large number of pattern classes, say M = 2 6 , (5 .33) i s u s e f u l over the e n t i r e range of p_ whereas (5 .36) i s u s e f u l only for p < 0.07*1. For t h i s G same example V = 0 . 4 9 6 . S u b s t i t u t i n g for V i n (5 .27) y i e l d s P < 0.952 P g <_ min(u(p G) , U(p)}, e 87 G F i g . 5.1 Comparison of the upper bounds on err o r p r o b a b i l i t y as a function of p and p r as the number of pattern classes v a r i e s . 88 and experimentally suggests that (5 .27) i s t i g h t e r than ( 5 . 3 3 ) . 5.5 Experiments and Results Although the bounds derived i n section 5.^ + guarantee an err o r p r o b a b i l i t y below a c e r t a i n l e v e l f o r the feature being considered f o r s e l e c t i o n or discarding i t i s of i n t e r e s t to compare the measures with respect to ordering a set of features. Furthermore, since no upper bounds are av a i l a b l e f o r J and J for the d i s t r i b u t i o n free case i t i s of (j i n t e r e s t to compare these c r i t e r i a with the rest of the c r i t e r i a considered i n t h i s chapter. Therefore, s i m i l a r experiments to those of chapters I II and IV were performed. M_ and M were not used since they y i e l d i d e n t i c a l (jr orderings to p_ and p r e s p e c t i v e l y . The r e s u l t s are given i n Table 5 . 1 . 5 .6 Discussion and Conclusions The main r e s u l t s , from Table 5-l» can be summed, up i n the statement that a l l the measures are very s i m i l a r to each other. The two most s i m i l a r measures are J and p„ with r = 0 . 9 9 9 , while the two most U s d i s s i m i l a r measures are J and Vn with r = 0 . 9 6 8 . From the Bayes-r e c o g n i t i o n - c r i t e r i o n point of view J appears to be the best with r = 0.709 u S while V lags behind with r = 0 . 6 5 1 . (j s Some of the r e s u l t s of t h i s chapter can also be found i n [ l U 3 ] and [ikh]. J V V G 0.690 0.709 0.672 0.651 0.693 0.694 1.0 0.992 0.976 0.975 0.997 0.999 1.0 0.972 O.968 0.990 0.992 1.0 0.993 0.986 0.978 1.0 0.983 0.976 1.0 0.998 Table 5.1 Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between feature orderings with generalized distance measures. (See Table 4 . 1 for a d d i t i o n a l explanation.) 90 CHAPTER VI PROBABILITY OF ERROR BOUNDS 6 . 1 Introduction In t h i s chapter error bounds are derived f o r various two-class and M-class measures proposed as feature evaluation c r i t e r i a f o r pattern recognition problems i n v o l v i n g b.oth c o r r e c t l y , and i n c o r r e c t l y , l a b e l l e d t r a i n i n g data. In section 6.2 upper bounds on P g are derived i n terms of w e l l known distance and information measures f o r the two-class problem with i n c o r r e c t l y l a b e l l e d patterns. In se c t i o n 6 .3 a family of distance measures, analogous to the Bhattacharyya distance, i s proposed and upper bounds on P g are derived f o r the family. In section 6.h upper bounds on error p r o b a b i l i t y f o r the M-class problem are derived i n terms of pairwise distances. In section 6.5 the bounds derived i n chapter 5 i n connection with the generalized distance measures, are generalized further. F i n a l l y , i n se c t i o n 6 .6 a very general class of measures i s proposed which un i f y several of the measures found i n the l i t e r a t u r e , a n d upper and lower bounds on P g are derived i n terms of these general measures. 6.2 Upper Bounds on E r r o r P r o b a b i l i t y i n Terms of Distance and Information Measures f o r Feature Evaluation with I n c o r r e c t l y L a b e l l e d T r a i n i n g Data: Two-Class Problem In the usual feature evaluation problem considered so f a r i n chapters I I I - V the p r o b a b i l i t y d i s t r i b u t i o n s are estimated from t r a i n i n g data c o n s i s t i n g of pattern samples which are assumed to be l a b e l l e d c o r r e c t l y . In p r a c t i c e , however, i n many cases the l a b e l s may not be e n t i r e l y correct [1^5] due to human or other errors introduced during.the l a b e l l i n g process. In the recent l i t e r a t u r e considerable i n t e r e s t has been shown i n pattern recognition applications with i n c o r r e c t l y l a b e l l e d t r a i n i n g data [l46] - [151] . In t h i s section the a p p l i c a b i l i t y of various feature evaluation c r i t e r i a for the case of i n c o r r e c t l y l a b e l l e d t r a i n i n g data i s examined for the two-class problem. Let C\ , i = 1,2 and. , i = 1 , 2 , denote, r e s p e c t i v e l y , the pattern classes with correct and i n c o r r e c t l a b e l l i n g according to the p r o b a b i l i t i e s and P(C i|C i) = 3 , i = 1,2 P(C |C ) = 1-3 , i + j i , j = 1,2 (6 .1) J i where — < 3 <_ 1. I t i s assumed that P( X|c C ) = P(X|C ) , i , j = 1,2 . ( 6 . 2 ) i J j Let the c l a s s - c o n d i t i o n a l d i s t r i b u t i o n s estimated from the i n c o r r e c t l y l a b e l l e d t r a i n i n g data be denoted by P(x|c^), which can be w r i t t e n as p(x|c,) = — i — [P(X,C. ,C, ) +P(X,C.,C.)] i P(C.) 1 1 1 J l = — ^ — [P(C. )P(C. |.C. )P(x|C. ,c,) P(C.) i l l i i + P(C )P(C |C )P(XJC ,C ) ] , ( 6 . 3 ) o J l J th where P(C\) i s the estimate of the a p r i o r i p r o b a b i l i t y of the i class from the i n c o r r e c t l y l a b e l l e d t r a i n i n g data. From (6 .1) - ( 6 . 3 ) r e l a t i o n s between P(X,C^) and P(X,C.) can be obtained as follows: P(X,Ci) = 3 P(X,C.) + (1-3) P(X,C ), (6.U) P(X,C.) = T ~ 3 j [ 3 P(X,C i)-(l - 3 ) P(X,C ) ] , ( 6 . 5 ) 92 for i , j = 1 , 2 , i 4- j . These r e l a t i o n s w i l l be used below i n d e r i v i n g upper bounds on error p r o b a b i l i t y i n terms of the Bhattacharyya c o e f f i c i e n t , the Matusita distance, the Shannon equivocation, the distance c r i t e r i a of P a t r i c k and Fisher [152] and Ito [ 9 3 ] , and the Kolmogorov v a r i a t i o n a l distance, evaluated on the i n c o r r e c t l y l a b e l l e d t r a i n i n g data. For two classes i ,j the measures- are defined below. The Bhattacharyya. c o e f f i c i e n t : p = I [ P ( X | C . ) P ( C . ) P ( X | C . ) P ( C . ) ] 1 / 2 dX. ( 6 . 6 ) i i 3 3 The Matusita distance: M 2 = / [/P(X|C. )P(c7) - I/P(X|C . ) P ( C . ) ] 2 dX. ( 6 . 7 ) i i 3 3 The Shannon equivocation: /s /\ s\ H = -/ P ( X ) [ P ( C . | X ) l o g P(C.|x) + P(C.|X) log P ( C.|X)] dX. ( 6 . 8 ) The distance c r i t e r i o n of P a t r i c k and Fisher: d 2 = / [ P ( X | C . ) P ( C . ) - P ( X | C . ) P ( C . ) ] 2 dX. ( 6 . 9 ) i i 3 3 The distance c r i t e r i o n of Ito: Q = / 2 P ( X | C . ) P ( C . ) P ( X | C . ) P ( C . ) / P ( X ) dX. (6 .10) 0 l l 3 3 The Kolmogorov v a r i a t i o n a l distance V = / | P ( X | C . ) P ( C . ) - P ( X | C . ) P ( C . ) | dX. (6.11) i i 3 3 Theorem 6 . 1 : P < — ^ — 5 - • Q N - ( 6 . 1 2 ) E ( 2 8 - 1 ) 2 0 ( 2 3 - 1 ) 2 Proof: S u b s t i t u t i n g (6.U) in t o ( 6 . 1 0 ) , expanding, and rearranging terms 93 y i e l d 0 0 P ? [P 2(X,C . ) + P 2(X,C )] QQ = [3 + (1-3) ]Q Q + 20(1-0) / T ( Y T d X ' ( 6 * 1 3 ) For any nonnegative r e a l numbers a,b there holds the r e l a t i o n (a 2+b 2) = (a+b) 2 - 2ab. (6.lk) Using (6.14) i n (6 .13) and the fact that P(X,C i) + P(X,C.) = P(X) y i e l d s , a f t e r some algebra, lQ = (2B-1) 2 Q + 23(1-3). (6.15) From [93] ve have P < C L . ( 6 . l 6 ) e — 0 Solving (6 .15) for Q Q and s u b s t i t u t i n g i n ( 6 . l 6 ) y i e l d s ( 6 . 1 2 ) . C o r o l l a r y 6 . 1 : 1 " 2 3 0 - 8 } > ' -rjS p ^ „ _ . H _ — i o . ± 7 ; e 2 log 2 (28-1)* (23-1)" Proof: Ito [93] has shown that H T> - 2 l o g 2 from which i t follows that : H < ^ 0 - 2 l o g 2 ' Sub s t i t u t i n g (6 .18) i n t o (6 .12) y i e l d s (6.17) C o r o l l a r y 6 . 2 : (6 .18) P < — - — - . ; . - - e 4 (6.19) e (23 - 1 ) 2 ( 2 3 - D 2 Proof: Ito [93] has shown that H < 2 log 2 p , 94 from which i t follows that H <_ 2 l o g 2 p . (6,20) S u b s t i t u t i n g (6.20) Into (6.17) y i e l d s (6.19). An upper bound i n terms of p does e x i s t i n the l i t e r a t u r e , [150] , and i s given by — 1 P - [^f}f2 • (6.21) e - (23 - D P (23-1) Theorem 6.2: where K = / | P ( X | C . ) P ( C . ) - p ( x | c . ) p ( c . ) | p ( n ) a x , n 1 1 1 1 J .] p(n) = 2(n+l)/( 2n+l) , n a nonnegative integer, and where P(x|c^) and. P ( x | c . ) are d i s c r e t e d i s t r i b u t i o n s . J Proof: S u b s t i t u t i n g (6.4) i n the quantity | P ( X , C . ) - P(X|C )I = a y i e l d s a = ( 2 3 - 1 ) | P ( X | C . ) P ( C . ) - P(X|C.)P(C.) I . (6.23) 1 1 3 3 = (23-l)a It follows t h a t , for d i s c r e t e p r o b a b i l i t y d i s t r i b u t i o n s , a > » p ( n ) = [ T i - i r ] P W . (6.24) Now, from [ l 4 l ] P = J - ^ / a dX. (6.25) e 2 2 S u b s t i t u t i n g (6.24) i n t o (6.25) y i e l d s (6 .22) . I t should be noted that f o r n = 0 (6.22) provides an upper bound i n terms of the distance c r i t e r i o n of P a t r i c k and F i s h e r , i . e . , KQ = d . The Kolmogorov v a r i a t i o n a l distance i s also a s p e c i a l case of K , namely K = V. Furthermore, for n =°° (6.22) holds also for the case of continuous CO d i s t r i b u t i o n s . Theorem 6.3: P e iJ&V ' «o - ( 6 - 2 6 ) Proof: For n = 0 0 (6.22) reduces to P e = | - 2 l 2 f c i T ^ - ( 6 ' 2 7 ) Ito [93] has shorn that from which i t follows that V > 1 - I Q Q • (6.28) (~*--, .1- •'. -i-. , 4- „ ( £ "O Q \ A y-x 4- ( f~ OT \ .-n-^ .-\ /-) -I v. m /i -i^i^><~i"v,"V'iOv-i^"i>"irr -j- r~, ywi rt -,n r>T i-l (6 .26) . C o r o l l a r y 6 .3: P e < 2 1og 2 1(2B-1) ' * " i M f <6-*» Proof: The proof follows d i r e c t l y from (6.18) and theorem 6.3. C o r o l l a r y 6.k: p e - TaT3T ' ~ TaPl} ( 6 - 3 0 ) Proof: The proof follows d i r e c t l y from (6.20) and (6 .29) . I t should be noted that the bound given by (6.30) i s t i g h t e r than the a v a i l a b l e bound i n the l i t e r a t u r e , given by (6.21). Both (6.21) and (6.30) describe s t r a i g h t l i n e s with the same slope l / ( 2 $ - l ) but the intercept on the axis i s nonnegative i n ( 6.2l) while nonpositive i n (6 .30) To i l l u s t r a t e how these bounds vary with 6 consider (6 .26) which can be written as PE 1 U ( G \ J , 8 ) - , and i s p l o t t e d i n F i g . 6 .1 as a function of QQ for various values of g. From (6 .15) and the properties of QQ i t can e a s i l y be v e r i f i e d that 28(1-8) <, Q_ <_ 1/2. The dashed v e r t i c a l l i n e s intercept the abcissa at QQ = 28(1-3) and hence, for a given value of 6 they determine the range of Q Q . The lowest error p r o b a b i l i t y can be r e a l i z e d when the c l a s s -c o n d i t i o n a l d i s t r i b u t i o n s P ( x l c . ) and P(x|C.) are d i s j o i n t . Under these 1 J conditions, i t can e a s i l y be v e r i f i e d that P = 1-3. S u b s t i t u t i n g -e Q 0 = 0 and 6 = 1-P into (6 .15) y i e l d s o = 2P (1-P ) ^0 e V e1' which y i e l d s , f o r any given d i s t r i b u t i o n s P(X,C.) and P(X,C.), P e y- I " V1-2 % > where the e q u a l i t y holds when QQ takes on i t s minimum value. The dashed h o r i z o n t a l l i n e s intercept the ordinate at P g = ^ - j^/l-k$ ( l - 3 j , and hence, for a given value of 8 they determine the range of P . For example for 8 = 0 . 7 5 i n F i g . 6 .1 the minimum possible e r r o r p r o b a b i l i t y (when P(X,C.) and P(x,C.) are d i s j o i n t ) i s given by P = 0 . 2 5 which corresponds to a value of QQ = 0 . 3 7 5 . Hence the bound given by (6.26.) i s attainable at both extremes of QQ f o r any value of 8 -In the discussion so f a r , obtaining p involves estimation of a l l p r o b a b i l i t i e s or parameters of assumed d i s t r i b u t i o n s . However, bounds s i m i l a r to ( 6 . 3 0 ) can be obtained f o r nonparametric estimates of the d i s t r i b u t i o n s and as functions of p o t e n t i a l functions evaluated d i r e c t l y 97 Q o F i g . 6 .1 Upper bound on err o r p r o b a b i l i t y UCQQ.,8) as a function of Q.Q for various values of 3. Dashed l i n e s meet ordinate at (1-3) and abscissa at 23(1-3) . on the i n c o r r e c t l y l a b e l l e d samples. Just as nonparametric estimates can be made of p, [ 153] , they can be made of p and substituted i n (6.30). Also, these estimates are bounded above by a set of Bashkirov-Braverman-Muchnik p o t e n t i a l functions as shown i n [154] and [155] . Hence by su b s t i t u t i o n i n (6 .30) and (6 .19) bounds can be obtained i n terms of p o t e n t i a l functions. In the i n t e r e s t of b r e v i t y the d e t a i l s are omitted. 6.3 Upper Bounds on Er r o r P r o b a b i l i t y Through Function Approximation: A Family of Distance Measures for the Two-Class Problem In t h i s section a family of distance measures s i m i l a r to the Bhattacharyya distance i s proposed. Upper bounds on error p r o b a b i l i t y are derived for the general d i s t r i b u t i o n - f r e e case and i t i s shown that these bounds are much t i g h t e r than the Bhattacharyya bounds. The proposed distance measures are given by T - - l o g 2 T •(•6,3.1 ) n to n where -c i s a measure of a f f i n i t y analogous to the Bhattacharyya c o e f f i c i e n t p considered i n section 6.2 and i s given by P(X,C.) P(X,C.) i = n / 2 / 1 ^—T7- d X < 6 -32) n • [ P n ( x , c . ) + p n ( x , c . ) ] 1 / n 1 J where n = 1,2,. . . ,°°. As shown i n the next theorem, T can be considered ' ' ' n to be the i n t e g r a t i o n of an approximation to the function min{P(X,C^), P(X,C.)}. J Theorem 6.h: P e i T n l . % ( 6 - 3 3 ) where Q i s given by ( 3 . 6 2 ) . Proof: When n = 1, T reduces to Ito's distance c r i t e r i o n which i s ' n equivalent to the harmonic mean of the j o i n t d i s t r i b u t i o n s . Hence T-, = Q N . Now, T can be written as n T n = f fJX^ d X s (6.34) where {f (x)}- i s a sequence of integrable functions and p(x,ci)p(x,cj) fn( X ) = M (X,C.,C.) •* ' The term M (x,C.. ,C.) i s a sum of order n which has the property that l l m {M (X,C. ,C.)} = max' {P(X,C. ) ,P(X,C .) } . (6.35) Since M (X,C.,C.) / 0 i t follows, using l i m i t theorems, that n l j ^ (fn(x)} = min {P(X,C.),P( X,C.)} = g(X). From Lebesgue's theorem on dominated convergence [l6l] i t follows that for any integrable g(x) l i m j- f f fY^Qy-i _ r l l m f r ( T U ^ Y n-H» " n v " * n-*>° n ' ' which implies that L I M {T } = P . (6.36) Also, from (3.55) i t follows that for any n such that n > 1 T _ < T < T . . (6.37) n+1 — n — n-1 From (6.36) and (6.37) i t follows that P = T < . . . < T , < T < T < . . . < T = Q n, e 0 0 — — n+1 — n — n-1 — — 1 0 ' which includes (6.33). It i s obvious that since < H <_ p , provides t i g h t e r bounds than Shannon's equivocation and the Bhattacharyya c o e f f i c i e n t . The properties of T and i t s r e l a t i o n s to other measures w i l l not be derived i n the i n t e r e s t of b r e v i t y . For a d d i t i o n a l comments concerning H and p the reader i s r e f e r r e d 100 t o [17] and [ 156] . 6.4 Upper Bounds on Error P r o b a b i l i t y for the M-CIass Problem: The Pairwise Distance Approach Theorem 6 . 5 : P e < 1 ^ ) - \ V ( 6 . 3 8 ) where V.. i s the pairwise Kolmogorov v a r i a t i o n a l distance given by 13 v. . . = ; |p(x|C. )P(C. ) - P(X|C.)P(C.) |dX. ( 6 . 3 9 ) 1J 1 1 3 3 Proof: Chu and Chueh [157] have shown that M pe 1 Z P e ( i ' j ) ' { 6 ' h 0 ) i<j where P ( i , j ) i s the pairwise error p r o b a b i l i t y given by P (i,,i) = / min {P(X,C. ) ,P(X, C .) } dX. ( 6 . 4 l ) e • i • j Using the r e l a t i o n (5 .29) i n (6 .40) and (S.kl) one obtains M 4 { T P ( C . ) + P ( ( . . . . , . 3 ' . i 3 P < E i [ P(C )] - / |P(X,C )~P(X,C )|dX}, (6 .42) which i n turn y i e l d s ( 6 . 3 8 ) . It can be shown that the bound given by (6 .38) i s t i g h t e r than the L a i n i o t i s bound of ( 5 . 3 6 ) . Furthermore, f o r equiprobable pattern classes (6 .38) reduces to the bound derived by the author i n [ 1 5 8 ] . C o r o l l a r y 6 ,_5_: M P e i < ¥ » - | . £ . ^ ( 6 . . . 3 ) where V11. i s the generalized pairwise distance given by 1J v.. =f | p ( x | c ) p ( c . ) - p ( x | c . ) p ( c . ) | p ( n ) d x , (6 .44) i , i 1 1 1 J 3 101 where p(n) = 2(n+l)/(-2n+l) , n = 0 , 1 , 2 , . . . , ° ° , and P(x|C ) i = 1,2,...,M are d i s c r e t e d i s t r i b u t i o n s . Proof: For d i s c r e t e d i s t r i b u t i o n s v n . > v 2 ? : 1 . (6.U5) I J - i o Combining (6.38) with (6.1+5) y i e l d s (6.43). For n = 0 ( 6 .43) reduces to an upper bound on the distance c r i t e r i o n of P a t r i c k and Fisher [152] for the M-class problem. Furthermore, for M = 2, n = 0 , (6.43) reduces to the bound derived by the author i n [159]. C o r o l l a r y 6.6: M P < ( M r i ) 1 _ D r _ { 6 M ) e d d ± < . i j r where D.. i s the generalized pairwise distance given by D r. = / | [ p ( x | c . ) p ( c . ) ] I / r - [p(x|c.)p(c.)] 1 / rr dX (6.47) I J 1 1 1 1 J J . where r i s a p o s i t i v e integer. Proof: For any two r e a l numbers a,b such that a >_ b _ 0 , the following i n e q u a l i t y holds [105] a r - b r >_ ( a - b ) r , ( 6 .48) from which i t follows that |a-b| > | a l / r - b l / r | r , • (6.49) which y i e l d s , i n turn V. . > Dr.. (6.50) i j - i J Su b s t i t u t i n g (6 .50) into ( 6 . 3 8 ) y i e l d s ( 6 . 4 6 ) . S e t t i n g r = 2 i n ( 6 . 4 6 ) expanding terms and rearranging y i e l d s M P < I /P(C.')P(C.) p(C.:C.) , (6 .51) e - ± < i J i j 102 which i s the L a i n i o t i s [133] bound. Hence, the upper bound of (6.46) can be considered to be a g e n e r a l i z a t i o n of the L a i n i o t i s bound. Theorem 6.6: M P < E x n. , (6.52) 1<J ° where T^. i s the pairwise distance measure proposed for the two-class problem i n section 6.3 and i s given by P(X|C.)P(C. ) P(X|.C.) ( C ) x n. = n,/2 / 1 i ^ 3- dX. (6.53) 1 J [ p n ( x | c . ) p n ( c . ) + p n ( x | c ) p n ( c . ) ] 1 / n i i 3 3 ' Proof: Using a s i m i l a r proof to that of theorem 6.4 i t can be shown that P ( i , j ) < Tn.. (6.54) e i j S u b s t i t u t i n g (6.54) into (6.40) y i e l d s (6 .52) . 6.'5 Upper Bounds on Error P r o b a b i l i t y f o r the 'M-Class Problem: The Generalized Distance Approach Co r o l l a r y 6.7: p. i < ¥ ' - < i f " S • < 6- 5 5> where i s a gener a l i z a t i o n of V of ( 5 . l 4 ) and i s given by G G M , v v£ = E / | p ( X , C . ) - y ( X , C . ) | p U ; dX, (6.56) i = l where p(n) = 2(n+l)/(2n+l) , n a nonnegative integer, and where P(X,C^), i = 1,2,...,M are d i s c r e t e d i s t r i b u t i o n s . Proof: From (3.30) i t follows that ^ . - r i v J > ( - 1 i . . . i V » . (6.57) 103 Noting that V„ = V_ and combining (6 .57) with (5.27) y i e l d s ( 6 . 5 5 ) . G G I t should be noted that f o r n=0, (6 .55) reduces to an upper bound on a generalized version of the 2-class distance c r i t e r i o n of [ 152] . C o r o l l a r y 6 . 8 : where D^ i s a gen e r a l i z a t i o n of DT. . of (6 .47) and i s given by G i j M , , Dp = 1 / | [ P ( X , C ) ] 1 / r - [ y ( X 5 C ) ] 1 / r i r dX, (6 .59) 1 = 1 where r i s a p o s i t i v e integer. Proof: From (6 .49) i t follows that V G - D G * ( 6 , 6 O ) S e t t i n g n = «> i n (6 .55) and s u b s t i t u t i n g ( 6 . 6 0 ) y i e l d s ( 6 . 5 8 ) . Using (5 .24) i t can be shown that f o r r = 2 (6 .58) reduces to ( 5 - 3 4 ) . Hence (6 .58) i s a generalized version of the bound given by ( 5 . 3 4 ) . 6•6 Error Bounds f o r the M-Class Problem: The Function Approximation Approach In t h i s section a very general class of distance c r i t e r i a i s proposed which can also be considered as information measures. However, they e s s e n t i a l l y measure the a f f i n i t y or dispersion of a set of d i s t r i -butions . The proposed measures are given by M , ' R = / [~ I P r ( X , C . ) ] 1 / r dX 5 (6 .61) r M . , 1 i = l where r i s a nonzero r e a l number. It i s of i n t e r e s t to comment on some of the s p e c i a l cases of M M R . For example, f or M = 2 and r = -n, n a p o s i t i v e integer R^ reduces to the distance measures proposed i n section 6 . 3 , i . e . , ~n n M It follows that for r = - 1 , M = 2 , R reduces to the distance c r i t e r i o n of I t o , i . e . R ! i = v M Also, R can be considered as a genera l i z a t i o n of QQ to the M-class problem. The l i m i t as r -> 0 i s equal to the weighted M-class Bhattacharyya c o e f f i -cient , i . e . , L I M - 1 V ( C 1 ) . P ( C 2 ) . . . P ( C H ) . r->0 Theorem. 6 . 7 : For r > 0 l-M^.R 1 4 < P < 1-RM (6 .62) r — e — r Proof: For any p o s i t i v e values ( x) = (Xj| , X g ,. . . ,x^) and p o s i t i v e weights n (a) = (a ,a 2,...,a ) , £ 0 ^ = 1, i = l and any r e a l t ^ 0 , the t norm i s defined [ l 6 o ] as M. (x,«) = ( E a . x * ) 1 ^ . ( 6 . 6 3 ) t . . l l 1=1 Furthermore, the t norm has the property that t S ' {M t(x,a)} = max(x) . . (6.6U) The p r o b a b i l i t y of error can be written as P = 1 - / max{P(X,C 1),...,P(X,C M)}dX. (6 .65) S e t t i n g x i = P(X,C i), n = M, a = l/M, and t = r y i e l d s from (6 .63) and (6.6)4) 105 1 M , max{P(X,C 1) 5... 3P(X,C M)} = {- _ [P(X,C.)] r} A , ( 6 . 6 6 ) i = l where r -> °°. Integrating ( 6 . 6 6 ) , s u b s t i t u t i n g i n (6 .65) and invoking Lebesgue's theorem on dominated convergence y i e l d M P e = 1 - / (| £ [ P ( x , c . ) ] r } l / r dX. (6 .67) i = l where r -> °°. Furthermore, from (3 .55) i t follows that , M , M , {| E [ P ( X , C . ) ] r i } 1 / r i 1 (| E [ P ( X , C . ) ] r 2 } 1 / r 2 (6 .68) i = l i = l for r-| s r 2 > 0 8 1 1 ( 1 r _ L r 2 " S u b s t i t u t i n g (6 .68) into (6 .67) y i e l d s the righthand side of ( 6 . 6 2 ) . The sum of order t i s defined [ l 6 0 ] as S +(x) = ( E x * ) l 7 t , ( 6 . 6 9 ) i = l x and has the property that f o r t ^ > t > 0 , S, (x) < S, (x). ( 6 . 7 0 ) t 2 - b 1 M Now, R can be written as r v i 1/r M 1/r R™ = (|) / { E [P(X,C.)] r} dX, (6 .71) i = l where the quantity i n s i d e the i n t e g r a l i s a sum of order r . The sum of order r has the property that M 1/r ^ { E [P(X,C.)] r} = max{P(X,C 1),...,P(X,C M)}. (6 .72) i = l Integrating ( 6 . 7 2 ) , s u b s t i t u t i n g i n ( 6 . 6 5 ) , and invoking Lebesgue's theorem on dominated convergence y i e l d 106 M l / r P = 1 - / { I [P(X 5C.)] } ax 1=1 1 = 1 - M1^1" R M as r -> ». (6 .73) r Using (6 .70) i n (6 .73) y i e l d s the le f t h a n d side of ( 6 . 6 2 ) . Hence, by increasing r i n (6 .62) the upper and lower bounds can be made a r b i t r a r i l y t i g h t , and as r -> ro both, upper and lower, bounds converge to the p r o b a b i l i t y .of e r r o r . For r = 1 ( 6 . 6 2 ) reduces t o 0 < P < 1 - l/M. — e — Let R M S (X,C) denote the root-mean-square of the j o i n t d i s t r i -butions, and l e t P^ denote the p r o b a b i l i t y of correct c l a s s i f i c a t i o n . C o r o l l a r y 6 . 9 : - • P . < RMS(X,C) < P ( 6 . 7 [0 Proof. For r = 2 , R^ = RMS(X,C). I n e q u a l i t i e s (6.7)+) follow d i r e c t l y from (6 . 6 2 ) . From (6 .62) i t i s noted that f o r r < 1 the bounds are useless. M However, a s i m i l a r measure to R can be defined which has analogous r bounds u s e f u l f o r r < 0 . The p r o b a b i l i t y of error given by 6.65 can be wr i t t e n as P e = / min{[P(X)-P(X,C 1)] ,. . . ,[P(x)-P(X,C M).]}dX, (6 .75) which i n turn can be written as P g = (M-l) / min {y(X,C x),...,y(X,C M)}dX, (6 .76) where u(X,C^) i s the mixture d i s t r i b u t i o n used i n chapter 5 defined i n ( 5 . 1 3 ) . Using the properties of t norms and t sums that t^-°° { M t( x:> a)} = min(x), and t^-» ^ S t ^ x ^ = min(x) , with a proof s i m i l a r to that of theorem 6 . 7 5 the following theorem can be proved. Theorem 6 . 8 : For r < 0 M 1 / r ( M - i ) r M < p < ( M - i ) r M (6 .77) r — e —- r M 1/r = / [| z u r ( x , c . ) ] dX. (6 .78) i = l M One i n t e r e s t i n g property of i s that where {r; 1} = / ^ ( x ^ ) " - . . . • y ( x , c M ) dx , (6 .79) which i s the M-class Bhattacharyya c o e f f i c i e n t of the mixture d i s t r i b u t i o n s . Furthermore, as r -> 0 the upper bound on P g i n (6 .77) reduces to the bound given i n [ 9 3 ] . Hence, for any r < 0 the upper bound of (6 .77) i s t i g h t e r than the bound i n [ 9 3 ] . M Another very i n t e r e s t i n g property of i s that for M = 2 and r = -1 i t reduces to Ito's distance c r i t e r i o n [ 9 3 ] , Vajda's quadratic equivocation [ 3 2 ] , and the asymptotic nearest neighbour er r o r rate [ 9 1 ] . CHAPTER VII DISCUSSION AND CONCLUSION 7.1 Feature Evaluation C r i t e r i a R e v i s i t e d In part I of t h i s t h e s i s several distance, overlap, and informa-t i o n c r i t e r i a have been proposed f o r pattern recognition problems. Together with the w e l l known c r i t e r i a the proposed c r i t e r i a were analyzed and compared from t h e o r e t i c a l and experimental points of view with regards to the feature evaluation problem. In chapter I I I a class of c e r t a i n t y measures f or pre-evaluation and i n t r a s e t post-evaluation were proposed. For the pre-evaluation case no meaningful r e s u l t s could be obtained from theory and hence the conclu-sions come from i n t u i t i o n and experiments. The experimental r e s u l t s showed that t h i s approach to pre-evaluation, i . e . , using the unconditional d i s t r i b u t i o n , Is quite promising e s p e c i a l l y i n l i g h t of the small computa-t i o n a l e f f o r t required compared to analogous post-evaluation methods. The proposed c e r t a i n t y measures take on added s i g n i f i c a n c e when they are put i n the form of Shannon's equivocation, i n which case they are r e l a t e d to error p r o b a b i l i t y by the upper and lower bounds of theorem 3.7- Furthermore, i t has been shown i n c o r o l l a r y 3 .4 that' f o r the two-class problem the proposed c e r t a i n t y measures provide a t i g h t e r upper bound on erro r probabi-l i t y than Shannon's equivocation. In chapter IV the combined-evaluation and i n t e r s e t post-evaluation problems were considered f o r the mu l t i c l a s s case. The w e l l known expected value measures and some new proposed measures were analyzed and compared from the t h e o r e t i c a l and experimental points of view. Some of the measures i n the t h e s i s , such as the M-class Bhattacharyya c o e f f i c i e n t , do not quite f i t the above categorization since they measure the overlap, rather than the distance,between d i s t r i b u t i o n s . The M-class Bhattacharyya c o e f f i c i e n t i s an a t t r a c t i v e measure since error bounds have been derived f o r i t i n theorem 4.11. However, the feature ordering experiments show that i t performs about as w e l l as the expected value, measures with respect to the Bayes recognition c r i t e r i o n . Although the expected divergence has received a great deal of attention i n the past, no upper bounds are a v a i l a b l e f o r the error p r o b a b i l i t y for the general d i s t r i b u t i o n - f r e e case. However, theorem 4.10 gives lower bounds on the error p r o b a b i l i t y i n terms of the expected divergence. The experimental r e s u l t s show t h a t , as far as feature ordering i s concerned, the i n t e r s e t post-evaluation c r i t e r i a of chapter IV are a l l very s i m i l a r . Hence, the s e l e c t i o n of a c r i t e r i o n must be based more on the basis of computational e f f o r t than on r e s u l t i n g performance. Of the several i n t e r s e t c r i t e r i a the expected Kolmogorov v a r i a t i o n a l distance i s one of the easiest to compute i n the d i s c r e t e case. One disadvantage of the expected value approach i s that the distance measure has to be averaged over a l l pattern-class p a i r s . This operation can require a great deal of computation e s p e c i a l l y as the number of pattern classes increases. The proposed combined evaluation measures I(X:C), H(X:C), and Q given, r e s p e c t i v e l y , by (4.13), (4.17), and (4.21), and the M-class Bhattacharyya c o e f f i c i e n t do not s u f f e r from t h i s disadvantage and perform equally well f o r feature ordering. In chapter V the w e l l known distance measures were generalized to the m u l t i c l a s s problem. Although these generalized c r i t e r i a are i n fa c t i n t e r s e t c r i t e r i a they involve averaging' only over the number of cla s s e s , not the c l a s s - p a i r s , and hence require l e s s computation than the i n t e r s e t measures of chapter IV. From the feature s e l e c t i o n point of view they are a t t r a c t i v e since upper bounds on error p r o b a b i l i t y for some of them have been obtained (theorems 5.1 and 5-2). From the feature ordering point of view, the experiments showed that the generalized measures are very s i m i l a r to the expected value measures. Comparing the r e s u l t s of chapters I I I , IV, and V one can see th a t , as f a r as feature ordering with respect to the Bayes recognition c r i t e r i o n i s concerned, the combined-evaluation and the i n t e r s e t post-evaluation c r i t e r i a tend to be superior to the i n t r a s e t post-evaluation c r i t e r i a which, i n t u r n , tend to be superior to the pre-evaluation • c r i t e r i a . This order i s i n agreement with the order of decreasing compu-t a t i o n a l e f f o r t required. However, there are some notable exceptions to the above tendency. For example, the proposed c e r t a i n t y measure M (X) for pre-evaluation i s at l e a s t as good as Vajda's average quadratic entropy Q (x|c ) f o r post-evaluation, and requires f a r les s computational e f f o r t . The generalized distance measures of chapter. V and the combined measures of chapter IV are as good as the i n t e r s e t distance measures and yet require l e s s computational e f f o r t i n the d i s c r e t e case. Hence the former can be considered as more e f f i c i e n t c r i t e r i a , at l e a s t i n the d i s t r i b u t i o n free case. In chapter VI error bounds were derived f o r various two-class and M-class measures proposed as feature evaluation c r i t e r i a f o r pattern recognition problems i n v o l v i n g , both c o r r e c t l y and i n c o r r e c t l y l a b e l l e d , t r a i n i n g data. I t was shown that most of the bounds derived i n chapter VI i n terms of w e l l known measures are t i g h t e r than the av a i l a b l e bounds in the l i t e r a t u r e . Furthermore, i t was shown that some of the general measures and bounds are generalizations of w e l l known measures and bounds found i n the l i t e r a t u r e . Some of the measures such as T (6.32) and n r (6.6l) have the added f l e x i b i l i t y that by choosing the value of n or r I l l one can obtain a c r i t e r i o n which r e f l e c t s overlap between d i s t r i b u t i o n s , M error p r o b a b i l i t y , or a compromise between the two. The measures R and r ( 6 . 7 8 ) , being so general and reducing to many of the w e l l known measures as s p e c i a l cases, serve also to t i e the various measures together i n an underlying manner. A l l the measures considered i n t h i s t h e s i s depend on an estimate of the p r o b a b i l i t y d e n s i t i e s or distributions.. Due to the data i n the experiments i t was convenient to estimate the c e l l p r o b a b i l i t i e s . However, i n other applications i t may be bet t e r to estimate the parameters of some w e l l known density functions. The c r i t e r i a proposed i n t h i s t h e s i s have not yet been investigated f o r s p e c i f i c density functions. I f the w e l l known density functions are not s a t i s f a c t o r y , the estimation can always be accomplished nonparametrically by p o t e n t i a l function methods [117, Ch. 6 ] . While t h i s t h e s i s has tended to present a genera,! comparison and analysis of feature evaluation c r i t e r i a , i . e . , a p p l i c a t i o n independent, i n a p a r t i c u l a r a p p l i c a t i o n the researcher i s faced with the n o n t r i v i a l ta.sk of choosing one of the c r i t e r i a considered. Several guide l i n e s can be given. For data analysis problems r e l a t e d to distance between popula-tions i n biometric data where the variables are u s u a l l y discrete,any. of the measures considered i n t h i s t h e s i s would be quite appropriate although most of them have not yet been i n v e s t i g a t e d on biometric problems. Some measures may be more suited than others f o r the s p e c i f i c problem at hand. In other pattern recognition problems i t may be desir e d to evaluate features with respect to class-feature dependence, distance between c l a s s - c o n d i t i o n a l densities or overlap. In the f i r s t case measures such as the mutual information and l ( X : C ) , ( 4 . 1 5 ) , would be more appropriate. In the second case a distance measure such as the Kolmogorov v a r i a t i o n a l distance might be used while the Bhattacharyya c o e f f i c i e n t might serve as a good measure of overlap. Various other factors such as the computational e f f o r t and the tightness of the error bounds would also d i c t a t e as to the M choice of c r i t e r i o n . Measures such as R^, ( 6 . 6 l ) , would strongly r e f l e c t the Bayes err o r p r o b a b i l i t y f o r large values of r . I f a w e l l known density function i s assumed for a p a r t i c u l a r problem some of the c r i t e r i a would be more e f f i c i e n t and. s u i t a b l e than others. For example, f o r Gaussian density functions with equal covariance matrices the expected divergence becomes very simple to c a l c u l a t e and upper bounds on error p r o b a b i l i t y e x i s t . I f no simple density functions can be assumed and the researcher resorts to nonparametric methods, the distance c r i t e r i o n of P a t r i c k and Fisher [152] i s quite s u i t a b l e . I f the p o t e n t i a l function approach i s + ojror_ ssvsrsil nes,surss 3,urs sorbi s 11 c -b oiry * in. _)3,nr"tir*"nl3r* 5 "t^ 1" p3."i Tn.r~i gr^ Bhattacharyya c o e f f i c i e n t s f o r which err o r bounds were derived by the author i n [155] . An approach that seems e f f i c i e n t when the i n i t i a l set of features i s very large i s the h i e r a r c h i c a l approach incorporating pre-evaluation, i n t r a s e t post-evaluation, and i n t e r s e t post-evaluation c r i t e r i a s e q u e n t i a l l y . Since these three types of c r i t e r i a r e s p e c t i v e l y increase i n computational e f f o r t but also i n performance, the s i z e of the i n i t i a l feature set might be decreased using p r e - s e l e c t i o n . The s i z e of the r e s u l t i n g set might be decreased further using i n t r a s e t p o s t - s e l e c t i o n and, f i n a l l y , the desired features would be chosen using i n t e r s e t post-M s e l e c t i o n or incorporating a c r i t e r i o n such as R . A summary of the o v e r a l l general r e s u l t s of i n t e r e s t i s presented i n Table 7.1. For each of the c r i t e r i a used i n the experiments i n the CATEGORY EQUATION NUMBER CORRELATION WITH Pe NO. PDF'S TO ESTIMATE UPPER ERROR BOUNDS AVAILABLE LOWER ERROR BOUNDS AVAILABLE COMPUTATIONAL COMPLEXITY M_ fX) CO 1 ' PRE 3.13 0.376 1 NO NO 1 Mm (XIC) INTRASET 3.14 0.497 M NO NO 2 MjClX) COMBINED 3.15 0.649 M YES ' YES 3 'rl(X) PRE 3.3 0.221 1 NO NO i ! HfXIC) INTRASET 3.2 0.471 M NO NO MJX) PRE 3.13 0.149 1 . NO NO / . MjXIC) INTRASET 3.14 0.370 M NO NO 2 V INTERSET 4.3 0.672 M YESr NO 4 J 11 4.13 0.690 M A/0* YES 4 11 4.5 0.693 M YEST NO 4 % MODIFIED INTERSET 5.14 0.651 M YES NO 3 JG n 5.12 0.709 M NO NO . 3 PG 5.16 0.694 M YES NO 3 I COMBINED 4.1 0.712 M YES . YES 3 Q i t 4.21 0.630 M NO NO 3 11 4.15 0.687 M NO NO 3 H(X:C) 11 4.17 0.647 M NO NO 3 PM AFFINITY 4.23 • 0.628 M YES YES 3 ..Table 7.1 Siumaary of general r e s u l t s f o r measures of i n t e r e s t . M t For equiprobable pattern classes only. * Upper bounds e x i s t for the case of Gaussian d i s t r i b u t i o n s . t h e s i s Table 7.1 gives i t s category, the number of the equation i n the t h e s i s which defines the c r i t e r i o n , t h e value of r between a feature s ordering with and one with the c r i t e r i o n i n question, the number of d i s t r i b u t i o n s to estimate (one when P(X) i s involved and M when the P ( x | c^)are involved), whether upper and lower er r o r bounds e x i s t , and f i n a l l y the amount of computational complexity involved i n the d i s c r e t e case. It should be r e a l i z e d that for c e r t a i n "nice" d i s t r i b u t i o n s some of the c r i t e r i a require le s s computation. In Table 7.1 the computational complexity was quantized into 4 values f o r convenience. The l e a s t amount of computation i s given the value one, and so on. For example i n the d i s c r e t e case J and V both have a computational complexity value of 4 . However, for Gaussian d i s t r i b u t i o n s with equal covariance matrices J becomes very simple to compute whereas V does not. Furthermore, Table 7 .1 gives r e s u l t s f o r M ( c l x ) for comparison with M (Xj C) and M ( x ) . In CO 1 CO 1 CO [109] Vilmansen performed experiments with a measure of p r o b a b i l i s t i c dependence given by M X n K (X,C) = — Z E |P(X.,C.) - P(X.)P(C.)| i=13=l J 1 3 1 and obtained a value of r o[P- : K ^ X j C ) ] = 0 .649 when he assumed P(C ) = l/M, i = 1,2,...,M. From the above equation one can see that f o r equiprobable pattern classes K (X,C) i s uniquely r e l a t e d to M o o (c|x) and hence the l a t t e r would y i e l d the same ordering of features as the former. Hence r [P : M (Clx)] = 0 . 6 4 9 . s e 0 0 While i t i s hoped that the work presented i n t h i s t h e s i s w i l l help the researcher with the d i f f i c u l t task of choosing a feature evaluation c r i t e r i o n , much work remains to be done. Some of t h i s work i s o u t l i n e d i n section 7-2. 115 7.2 . Suggestions f o r Further Research r 1. In chapter I I I feature ordering experiments were c a r r i e d out i n r e l a -t i o n to pre-evaluation and i n t r a s e t post-evaluation with M^X) and M^xjc) for k = 0,«>. Similar experiments should be c a r r i e d out for values of k such that 0 < k < °°, to determine the optimum value of k. Study of the r e s u l t i n g metric may help understanding the r e l a t i o n between the unconditional d i s t r i b u t i o n and c l a s s - s e p a r a b i l i t y . 2. I t was shorn i n chapter I I I that f o r the two-class problem with equal a p r i o r i p r o b a b i l i t i e s (3 .50) provides, for k = 0, a t i g h t e r upper bound on error p r o b a b i l i t y than Shannon's equivocation. I t should be determined whether t h i s r e s u l t also holds for the general M-class problem. 3. The experimental r e s u l t s of chapter V showed that Q i s an a t t r a c t i v e measure of mutual information. However, no t h e o r e t i c a l r e s u l t s are avai l a b l e f o r Q. Attempts should be made at f i n d i n g e r r o r bounds on Q and comparing them with the bounds on' I. k. In chapter V several error bounds are derived i n terms of the Bhattacharyya c o e f f i c i e n t and the equivocation f o r the two-class problem with imperfectly i d e n t i f i e d samples. While i t was shown that (6.30) i s a t i g h t e r upper bound than ( 6.2l) i t s t i l l has not been determined whether (6 .19) i s also. 5. In chapter V several classes of feature evaluation c r i t e r i a such as M M (6.32), R^ ( 6 . 6 l ) , and (6 .77) have been proposed and studied from a t h e o r e t i c a l point of view. Feature ordering experiments, s i m i l a r to those performed i n the e a r l i e r chapters, should be performed to compare these c r i t e r i a with the other c r i t e r i a considered 116 i n t h i s t h e s i s and to determine what are reasonable values for n .and r i n a r e a l world problem. 6. Just as the s u b s t i t u t i o n of P(x) and U(X) into a class of distance measures y i e l d e d the class of c e r t a i n t y measures considered i n chapter I I I , one can substitute P(x) and U(X) i n t o the class of distance measures given by (6.4-7) to obtain the class of c e r t a i n t y measures given by A n D (X) = Z | [ P ( X . ) ] l / r - [ U ( X . ) ] l / r | r . (7 .1) j = l 3 3 Remembering that i n (6.32) measures the s i m i l a r i t y between two d i s t r i b u t i o n s we can form a class of uncertainty measures by sub-s t i t u t i n g P(X) and U(X) into (6.32) to y i e l d ' A n P(X.) U(X.) T (X) = T/2 Z — - — — — ^ — . (7.2) r j = l [ p r ( x . . ) + u r ( x ) ] 1 / r J J These measures should be compared to the c e r t a i n t y measures of chapter III from the t h e o r e t i c a l and experimental points of view. Feature ordering experiments should be performed i n order to deter-mine t h e i r s u i t a b i l i t y f o r pre-evaluation and to determine the optimum values of r. 7. Just as i n chapter I I I , the measures of (7 .1) and (7.2) can be modified to D ( x | c ) and T ( x | c ) . Feature ordering experiments should be performed to determine the s u i t a b i l i t y of these measures f o r i n t r a s e t post-evaluation and to determine the optimum values of r . 8. The measures given by ( 7.l) and (7-2) can be written i n the form of equivocation measures as A ^ M r D (C|X) = Z P(X.) Z | [ P ( C . | X . ) ] l / r - M" l / r| , (7.3) r j = l 3 i = l 1 3 117 and M . P(C.|X.) T (C|X) = r 1 M j = i 3 i=: E P(X.) E • r - n l / r ( 7 . 4 ) E r r o r bounds should be derived f o r these measures and compared to the bounds derived i n chapter I I I and to the bounds on Shannon's equivocation. 9. In t h i s t h e s i s the analysis and comparison of the feature evaluation c r i t e r i a tended to be from the d i s t r i b u t i o n - f r e e point of view. The measures proposed should also be analyzed and compared f o r s p e c i f i c d i s t r i b u t i o n s and density functions such as Gaussian, Laplacian, etc. Some measures which appear to be more complex than others f o r the d i s t r i b u t i o n - f r e e case a c t u a l l y turn out to be simpler f o r c e r t a i n d i s t r i b u t i o n s . Also, an attempt should be made at t i g h t e n i n g the various error bounds derived i n t h i s t h e s i s f o r p a r t i c u l a r d i s t r i b u -10. A l l the c r i t e r i a proposed i n t h i s t h e s i s were compared with the Bayes recognition c r i t e r i o n by c a l c u l a t i n g the Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between the corresponding orderings. Hence one can produce an ordering of the measures according to decreasing values of the c o r r e l a t i o n c o e f f i c i e n t . S i m i l a r orderings should be obtained by comparing the measures with other e r r o r c r i t e r i a such as the error r e s u l t i n g from the nearest neighbour r u l e , the distance from the means c l a s s i f i e r , and the c o r r e l a t i o n c l a s s i f i e r , j u s t to name a few. A rank c o r r e l a t i o n analysis of r e s u l t i n g measure orderings should suggest which are the best measures'from a very general point of view, i . e . , independent of decision r u l e . t i o n s . 118 11. The measure ordering described above was obtained by using a p a r t i c u l a r feature extraction scheme on a p a r t i c u l a r data set or pattern recognition a p p l i c a t i o n . Similar measure orderings should be obtained f o r d i f f e r e n t feature e x t r a c t i o n schemes and d i f f e r e n t pattern recognition a p p l i c a t i o n s . A rank c o r r e l a t i o n analysis of the r e s u l t i n g measure orderings should suggest which are the best measures independent of feature e x t r a c t i o n scheme and a p p l i c a t i o n . PART I I CONTEXTUAL DECODING ALGORITHMS 119 CHAPTER VIII INTRODUCTION 8.1 Using Contextual Constraints i n Pattern C l a s s i f i c a t i o n In a general pattern recognition problem, patterns are presented sequentially to a system which extracts features which then are used, together with some a p r i o r i information, to make a decision on the unknown patterns. In many of these problems there e x i s t s t a t i s t i c a l dependencies among the patterns. In part II of t h i s t h e s i s a general class of contextual decoding algorithms i s proposed f o r using s t a t i s t i c a l depen-dencies among patterns to improve recognition system performance and/or reduce system cost. The goal of the algorithms i s to make decisions as e f f i c i e n t l y as p o s s i b l e . In one problem of p a r t i c u l a r i n t e r e s t , the recognition c f Englis h t e x t , the characters ( l e t t e r s ) appear i n the general form of natural language dependence [ l 6 2 ]. In t h i s form there are three l e v e l s of context representing d i f f e r e n t types of abstraction. In order of d i f f i c u l t y of abstraction they are s y n t a t i c s , semantics, and pragmatics. Syntactics deals with signs and t h e i r r e l a t i o n s to other signs, where a sign can. be a character or a group of characters such as a trigram or a word. Semantics r e f e r s to signs and t h e i r meaning. Pragmatics i s con-cerned with signs and t h e i r r e l a t i o n s to users. In 1950 Shannon [163] developed a method of estimating the entropy and redundancy of a natural language. This method exploited the knowledge of s y n t a c t i c s , semantics, and pragmatics posessed by the speaker of the language. Furthermore, he showed that ordinary l i t e r a r y E n g l i s h i s at l e a s t 75% redundant. The algorithms proposed i n t h i s t h e s i s e x p l o i t part of t h i s redundancy at 120 the s y n t a t i c l e v e l . Since the algorithms proposed rest on t h e o r e t i c a l foundations of compound decision theory [ l o l l ] , [165] , [ l 6 6 ] , [ 167] , the basic assump-tions usually made w i l l be described i n section 8 . 2 . 8.2 Compound Decision Theory Let P(x|A = 8) denote the p r o b a b i l i t y of the vector sequence X = X^, Xg, . . . , conditioned on the sequence of i d e n t i t i e s A = A , Ag, A^ taking on the values 0 = 6 , 0 0^., where X^ i s the feature vector for the k ^ pattern, and where 0 takes on M values (number of pattern c l a s s e s ) . Also, l e t P(A = 0) be the a p r i o r i p r o b a b i l i t y of the sequence A taking on the values 0 . The p r o b a b i l i t y of c o r r e c t l y i d e n t i f y i n g an unknown pattern sequence i s maximized by maximizing any monotonic function of R = P ( x | I = 0~) p ( T = 0) (8 .1) over a l l the values of 0. In other words, a decision i s a r r i v e d at on a p a r t i c u l a r pattern by using information from a l l the previous and sub-sequent patterns. In a l e s s complicated approach, sequential compound decision theory, a decision i s made on a pattern by using information from the present and a l l the previous patterns only. I t i s c l e a r , however, that both of these approaches are p r a c t i c a l l y u n r e a l i z a b l e without fur t h e r s i m p l i f y i n g assumptions, because the amount of storage and data needed for the estimation of the p r o b a b i l i t y d i s t r i b u t i o n s i n ( 8.l) i s astronomical. The following assumption i s a giant step i n the r i g h t d i r e c t i o n , N P ( X | A = 0) = n p ( x 1 1|A N = 0 ) ( 8 . 2 ) n=l n The assumption ( 8 . 2 ) , known as the assumption of c o n d i t i o n a l independence, 121 implies that the manner i n which the character sample i n question i s formed depends only on the character-class to which the sample belongs and not on the classes of or measurements on,any previous or subsequent character samples. While t h i s assumption does not hold for c e r t a i n pattern recognition problems such as speech and cursive s c r i p t recognition , i t i s a reasonable•assumption for machine and handprinted character recognition. For example, i n p r i n t e d text the l e t t e r t i s u s u a l l y formed in a manner independent of what l e t t e r appeared before i t . However, i n cursive s c r i p t the t i s formed one way when i t follows the l e t t e r o and another way when i t follows the l e t t e r a. However, although ( 8 . 2 ) does not n e c e s s a r i l y hold i n general, i t reduces enormously the storage capacity required for the p r o b a b i l i t i e s on the l e f t hand side of ( 8 . 2 ) and the amount of data required to estimate them. Even so, i t i s s t i l l necessary to store P(X = 0) for a l l M pattern sequences and t h i s required storage capacity increases exponentially with N. What i s needed, i n addition to ( 8 . 2 ) , are techniques which make e f f e c t i v e use of contextual constraints without s t o r i n g excessive amounts of data or using excessive computation. Two approaches to the problem of s t o r i n g P ( x = 6) have been taken i n the past and are described i n section 8 . 3 . 8.3 Previous Research The f i r s t attempts to use contextual constraints to improve the performance of character recognition systems were l i m i t e d to degarbling E n g l i s h t e x t , u s u a l l y by d i c t i o n a r y look-up techniques [ l 6 8 ] , [ 1 6 9 ] , [ 1 7 0 ] . These methods make use of only the decisions on i d e n t i t i e s of the characters. As Raviv [171] has pointed out, such methods di s c a r d a s u b s t a n t i a l amount of information contained i n the measurements of the characters. The d i c t i o n a r y look-up methods are based on the assumption that every word i n the text can be found i n a d i c t i o n a r y table which must be stored. When the c l a s s i f i e r puts out a group of characters making up a p o t e n t i a l word, a l l the words of the same length are searched i n the d i c t i o n a r y to f i n d the best match. This approach i s s a t i s f a c t o r y when the d i c t i o n a r y i s small such as a table of l e g a l Morse-code symbols [172]. One of the f i r s t experiments using the d i c t i o n a r y look-up methods was performed by Bledsoe and Browning [173] to handprinted character r e c o g n i t i o n . More recently A l t e r [17^] made use of the syntax and a d i c t i o n a r y of the En g l i s h language i n an algorithm f o r determining which of the many possible sequences i s , i n some sense, most l i k e l y to have been intended, given the sequence a c t u a l l y received. The obvious disadvantage of these approaches i s that a d i c t i o n a r y of reasonable s i z e takes up too much storage. In the more recent past attempts have been made at using information from the measurements and u t i l i z i n g contextual information i n the form of bigram or trigram s t a t i s t i c s rather than d i c t i o n a r i e s . These approaches f a l l under the l a b e l of Markov approaches because they usually involve the assumption that the language i s a low order Markov chain. At any r a t e , these approaches are rooted i n the assumption that the true i d e n t i t y of a given character i s dependent on the true i d e n t i t i e s of the neighbouring characters and they rest on t h e o r e t i c a l foundations of compound decision theory. One of the f i r s t attempts along these l i n e s was made by Edwards and Chambers [ 175] , who used bigrams i n an attempt to improve the performance of a character recognition scheme. F i r s t , using information from the measurements only, the algorithm narrows the choice down to two or more l e t t e r s . I t then compares the c o n d i t i o n a l bigrams of these l e t t e r s , conditioned on the character which was decided upon previously, and se l e c t s that character which r e s u l t s i n the bigram taking on the highest value. It i s obvious that very l i t t l e information 123 from the measurements i s used i n t h i s approach "because most of i t i s discarded during the decision process, and i n fac t the authors report poor r e s u l t s . Only recently Abend [l64] and Raviv [171] independently derived the formal sequential decision t h e o r e t i c s o l u t i o n f o r the optimum use of contextual information at the s y n t a c t i c l e v e l under the assumption that the natural language to be recognized i s a Markov source. Furthermore, Raviv showed that to make an optimal d e c i s i o n on a character i t i s not necessary to store the a p o s t e r i o r i p r o b a b i l i t y vectors of a l l previous characters but that at each stage the a p o s t e r i o r i p r o b a b i l i t y vector could be c a l c u l a t e d r e c u r s i v e l y . Raviv also demonstrated the importance of appropriately balancing the information obtained from contextual con-siderations with that obtained from the characters. In another approach, based on compound de c i s i o n theory, Duda and Hart [176] proposed proceeding i n blocks of L characters mailing a decision on each block. This method assumes that any one block of L characters i s Independent of another block. Duda and Hart discuss the combinatorial explosion r e l a t e d to t h i s approach. For example, i f L = 10 and four a l t e r n a t i v e s f o r each character are consid.ered ?then over one m i l l i o n s t r i n g s of characters must be enumerated and compared. Experiments were performed, with an algorithm based on t h i s approach, by Toussaint [56] and Donaldson and Toussaint [ 1 7 7 ] 9 i n which only the "most" probable bigrams and trigrams were searched^reducing s i g n i f i c a n t l y the computation without reducing the recognition accuracy. The r e s u l t s of [56] and [177] suggest that the optimal approach to the use of context i n r e a l world problems i s very i n e f f i c i e n t and that to obtain an e f f i c i e n t s o l u -t i o n the algorithms should be suboptimal at l e a s t at some l e v e l s . 12h In a recent paper by Kiseman. and Enrich [178] the d i c t i o n a r y look-up method and Markov approach are combined and applied to recogni t i o n of 350 words without using measurement information. In t h i s algorithm the bigram p r o b a b i l i t i e s were quantized i n t o two l e v e l s ( l e g a l or i l l e g a l ) . Hence t h i s represents another l e v e l at which suboptimality can y i e l d more e f f i c i e n t decisions. In a l l the approaches discussed so f a r , a nonsequential c l a s s i f i e r i s used. However, when features vary i n cost and are expensive, as i n medical diagnosis, a sequential c l a s s i f i e r i s preferable. Hussain [ 1 7 9 ] , [180] recently developed the theory f o r , and performed experiments, using contextual constraints i n an optimum way when the c l a s s i f i e r i s a sequen-t i a l one. A l l the methods discussed above belong to the cla s s of supervised l e a r n i n g schemes i n which the c l a s s i f i e r i s informed, during the t r a i n i n g phase, to which class the pattern samples belong. In two recent papers [ l 8 l ] , [ l 8 2 ] Hilborn and L a i n i o t i s discuss the use of contextual con-s t r a i n t s i n unsupervised c l a s s i f i c a t i o n from the t h e o r e t i c a l point of view. The algorithms proposed i n t h i s t h e s i s l i e along the l i n e s of the approach taken by Duda and Hart [ 1 7 6 ] , and i n fact represent a gene r a l i z a t i o n and formalization of t h e i r approach. F i n a l l y , along purely t h e o r e t i c a l l i n e s , research i s being done on attempting to f i n d error bounds that are easy to evaluate compared to the expression f o r erro r p r o b a b i l i t y f o r the case of dependent hypothesis problems. Not much headway has been ma.de i n t h i s d i f f i c u l t area. In a recent paper by Chu [ l 8 3 ] upper bounds were obtained for the average e r r o r p r o b a b i l i t y of an optimal recognition procedure based on compound decision functions for the s p e c i a l case where the context i s generated by a two-state stationary Markov chain. However, i t was shown by the author 125 [ l 8 4 ] that i n many cases these bounds are too loose to be of any value i n p r a c t i c a l a p p l i c a t i o n s . 8.4 Scope of the Thesis (Part II) Part II of t h i s t h e s i s i s a preliminary experimental analysis and comparison of a proposed general class of algorithms f o r the e f f i c i e n t use of contextual constraints, at the s y n t a c t i c l e v e l , i n the decision process of s t a t i s t i c a l pattern recognition systems. The general class of algorithms i s described i n chapter IX. The algorithms can be broken down into several categories according to several c r i t e r i a . The manner i n which the algorithm proceeds through a piece of text divides the algorithms into three categories r e f e r r e d to as sequential decoding, block decoding, and hybrid decoding. Some general comments applying to t h i s categoriza-t i o n axe given i n section 9 . 1 - The proposed algorithms are of p a r t i c u l a r i n t e r e s t because of t h e i r f l e x i b i l i t y . The algorithms are dependent on a set of parameters which determine the amount of contextual information to be used, how i t i s to be used, and the computation and storage needed. Some of these parameters are defined i n section 9 . 2 . One of the most important parameters i n question i s r e f e r r e d to as the depth of search. Section 9 . 3 presents two general types of algorithms i n which the depth of search values are p r e s p e c i f i e d and remain f i x e d . The e f f i c i e n c y of these algorithms can be increased by incorporating a threshold and e f f e c t i n g a va r i a b l e depth of search depending on how noisy the pattern i s . Several types of algorithms incorporating the threshold are described i n section 9.4. An approach to storage reduction i s presented i n section 9.5 which i s based on approximating the £-gram d i s t r i b u t i o n i n terms of lower order marginals. 126 The experiments and t h e i r r e s u l t s are described i n chapter X. Section 10.1 i s devoted to the de s c r i p t i o n of the data base, preprocessing, feature e x t r a c t i o n algorithm, formation of En g l i s h t e x t s , and system evaluation. A l l the experiments were performed using a set of 7 texts t o t a l l i n g 7938 characters i n which the character samples were taken from Munson's multiauthor upper case handprinted data f i l e . An experimental evaluation of the manner i n which the pattern-class a l t e r n a t i v e s are ordered i s presented i n section 1 0 . 2 . In section 1 0 . 3 a study i s presented i n which the usefulness of using contextual information i s determined as a function of the i n i t i a l system performance by varying the number of features used In the c l a s s i f i e r . An experimental evaluation and. comparison of the contextual decoding algorithms with f i x e d depth of search are presented i n section 10.h. Experiments were performed i n order to compare block decoding with sequential decoding using bigrams. Further experiments were performed to assess and compare the performance of various sequential decoding algorithms i n the look-ahead-look-back mode of decision which use bigrams i n such a way as to approximate t r i -grams i n some.way. F i n a l l y , a study i s made of these algorithms with respect to e f f i c i e n c y . An experimental evaluation and comparison of the sequential decoding algorithms incorporating a threshold are presented i n section .10.5 fox* bigrams i n the look-ahead mode of decis i o n . Three types of algorithms are inv e s t i g a t e d from, the points of view of error p r o b a b i l i t y and computational complexity. Some concluding remarks are included i n chapter XI. A general discussion of the e f f i c i e n c y of the algorithms considered i n the the s i s i s given i n section 11 .1 . Some general comments with regards to using contextual information i n the algorithms are presented i n section 1 1 . 2 . F i n a l l y , some suggestions f o r furt h e r research are included i n section 1 1 . 3 . 127 CHAPTER IX CONTEXTUAL DECODING ALGORITHMS 9.1 Introduction In t h i s chapter a very general contextual decoding technique i s proposed for taking dependencies among patterns into account. The technique w i l l he r e f e r r e d to as generalized decoding. Some of the procedures mentioned i n section 8 .3 based on the Markov approach, i n c l u -ding the suboptimal block decoding method of Donaldson and Toussaint [177]» are s p e c i a l cases of generalized decoding. In a d d i t i o n , many other i n t e r e s t i n g subtopimal techniques which w i l l be described below are also s p e c i a l cases of generalized decoding. Of p a r t i c u l a r i n t e r e s t are a sequential decoding method i n s p i r e d by the Fano algorithm for convolution-a l l y encoded symbol sequences [ 1 8 5 ] , [ l 8 6 ] and a combined block and sequential method which w i l l be r e f e r r e d to as hybrid decoding. The proposed decoding technique i s of p a r t i c u l a r i n t e r e s t because of i t s f l e x i b i l i t y . The procedure i s dependent on a set of parameters, to be defined below, which not only determine the complexity of decoding, but also the amount of contextual information to be used and how i t i s to be processed. Let co ,to . . . co co ,. . . ,co ,ui , n , . . . ,co , „ ,co ,„.-,,... »<A . 1 2 ' n' n+1 ' n+m n+m+1 ' n+l' n+£+l k be a sequence of k patterns representing a portion of t e x t , i . e . , k i s very l a r g e , where a decision has already been made .on patterns co^ through W n 1' e a c ^ °^ these patterns belong to any one of M pattern cl a s s e s . Let X be the i d e n t i t y of the n pattern i n the sequence. Let X11 be the feature vector r e s u l t i n g from co and l e t P(x~'|x - i ) and P(X =i) be the c o n d i t i o n a l p r o b a b i l i t y that a feature vector X11 r e s u l t s from the n^*1 pattern i n the sequence given that the true i d e n t i t y of that pattern i s i , and the a p r i o r i p r o b a b i l i t y that the true i d e n t i t y of to i s i , r e s p e c t i v e l y . Generalized decoding makes a decision on a sequence of m+1 patterns, to ,to ,OJ , by using information from a sequence of £+1 patterns, n n+1 ' n+m to ,u> ,(o ,, where m < £. One can consider viewing the sequence of n n+1 5 n+£ — patterns through an "observation window" of length £+1. When a decision has been made on the sequence of m+1 patterns, the "window" i s advanced by m+1 patterns and the decision process begins anew. When m=0<£, gener-a l i z e d decoding w i l l be r e f e r r e d to as sequential decoding, and when m=£>0, as block decoding. When £>m>0, generalized decoding w i l l be r e f e r r e d to as hybrid decoding. When no contextual information i s to be used, a c l a s s i f i e r output consists of decisions to based on the feature vector X*1 extracted from the n^ 1 pattern to . In some of the methods, incorporating contextual information, described i n the review of previous research, use i s made of only those decisions to i n malting a further f i n a l d e cision on to . Before n ° n a Bayes c l a s s i f i e r makes a decision to , however, i t may have a v a i l a b l e an a p o s t e r i o r i p r o b a b i l i t y vector of dimensionality M, or any vector mono-to n i c to i t , such as a j o i n t p r o b a b i l i t y vector. The j o i n t p r o b a b i l i t y vector for the n^*1 pattern i n a sequence whose feature vector i s X*1 i s given by P n = {¥(^tXn=l)tP(^t\n=2)t...tH^t\n=li)}. (9 .1) There are two pieces of information contained i n every element of P : the value of the p r o b a b i l i t y , and the pattern class which y i e l d s that value. In a Bayes c l a s s i f i e r which uses no contextual information a decision to i s made by s e l e c t i n g that pattern c l a s s associated with the element of 129 P for which the value of the j o i n t p r o b a b i l i t y i s highest. Let R n be the ^ ^ . i c ^ r b j i r o b a b i l i t y vector for wn i n which the pattern classes and t h e i r associated proL.JTr-.^rzzzr- - ^ -.Jo^ ^rc reordered according to dccr~aci v ,s -values of those p r o b a b i l i t i e s and l e t L^ be a l i s t ^7 „"^ao. pattern cla s s e s , for u> , without t h e i r associated p r o b a b i l i t i e s , which contains z £ out of the M possible elements. The £ elements of L^ represent the pattern classes associated with the f i r s t £ elements of R . Some researchers • n i n the past have incorporated contextual information by feeding the l i s t £ vector L^ in t o the contextual decoder. In the proposed scheme, the en t i r e ranked j o i n t p r o b a b i l i t y vector R i s made use of i n the decoder. This means that the decoder i s "looking" through an "observation window" at a matrix of dimensionality M by £+1. The generalized decoder e s s e n t i a l l y traces a path from an element i n the (£+l)st column to an element i n the 1st column. The output of the contextual decoder consists of those pattern classes associated with the elements l y i n g on the traced path from the 1st column to the (m+l)st column. I f no context at a l l were used the tra c e d path would coincide with the f i r s t row of the observation matrix. The observation matrix can be considered as a b u f f e r between the c l a s s i f i e r output and the contextual decoder. As patterns are presented to the c l a s s i f i e r the b u f f e r gets f i l l e d . Once f i l l e d i t s contents are s h i f t e d to the decoder. 9.2 D e f i n i t i o n of Parameters Before presenting i n d e t a i l the manner i n which the contextual decoding algorithms make a decisio n , three sets of parameters used to trace a path through the observation matrix w i l l be defined. Let be a l i k e l i h o o d vector formed by d i v i d i n g the i ^ 1 element of R n by P(A n=i) f o r i = 1,2,...,M, i . e . , 130 Q n = [ P ( X n | x n = l ) , P ( X n | x n = 2 ) s . . . 5 P ( x n | x n = M ) ] 5 ( 9 . 2 ) where i t i s assumed for s i m p l i c i t y of notation, without l o s s of g e n e r a l i t y , that the f i r s t element of Q i s associated with class one, the second with n class two, etc. Let D(x.) be a "depth of search" f u n c t i o n , where x tak.es on the integer values of the subscript of to, i . e . , x = n, n+1, n+£. The •function D(x) defines £+1 values of depth of search. For •example, D(k) •= d^. where cL^ i s an integer such that 1 <_ d^ . <_ M which determines the depth of search f o r the k^*1 pattern. In making a decision on t h i s pattern use i s to be made, of only the f i r s t dj elements of Q^ .. Let H(x) be a "holding" f u n c t i o n , where x takes on integer values of the subscript of to, i . e . , x = n+1, n+2, n+£-l. The function H(x) defines £-1 holding values, h h , h ,„ .. , where ' n+1' n+2 5 n+£-l' 1 < h . < (d L l ) (h ^ ) f o r 1 < k < £-2 — n+k — n+k n+k+1 — — and ^ W l ^ W l 1 ( d n + £ ) ' The use of the holding values w i l l become obvious i n the d e s c r i p t i o n of the decoding process. Let T(x) be a "threshold" function which defines £ threshold val-aes t , t t , „ , where t . i s a r e a l number on the i n t e r v a l n n+1 n+£-l l 0 < t. < 1 for i = n+1, n+£-l. The ro l e of the threshold values — l — w i l l become evident below. The generalized decoding algorithms can be broken down into two main groups: those that incorporate the threshold function and those that do not. Furthermore, the algorithms that do not incorporate the threshold function can be divided i n t o two groups: those that incorporate 131 the depth of search function only, and those that also incorporate the holding function. In addition the algorithms incorporating the threshold function can also use the depth of search function as a default option to save on computation and thus increase the e f f i c i e n c y of the algorithms. 9•3 Algorithms with Fixed Depth of Search Let fi^ be the set of pattern c l a s s e s , f o r the pattern i n j the sequence, c o n s i s t i n g of those classes that are ranked as the d. J highest by using P(X J|A J=a)P(A J=a) f o r a = 1,2,...,M, where d. i s the J depth of search associated with co, . In the algorithms proposed by Duda J and Hart [ l 7 6 ] a decision i s made on patterns to , co . , co „ by maximizing n n+1 n+£ the quantity m a J {PCX"!A n=a)P(X n + 1|A n + 1 = 3 )...P(x n + £|A n + £= e) P(A"=a, A =8, ..., A" " = E ) } [9.3) where a 6 tt^ , 8 £ 9,^ , . . . , e 6 n , and s e l e c t i n g those pattern n n+1 n+£ classes associated with the maximum value. It i s obvious t h a t , in p r a c t i c a l a p p l i c a t i o n s , d , d ,n , d ,. and £ would be small. In p a r t i c u l a r , n n+1' 5 n+£ 9 £ would not be greater than 2 or 3 when ( 9 . 3 ) i s maximized. However, i t w i l l be shown i n section 9 .5 that £ can be made l a r g e r when (9-3) i s su i t a b l y modified. h . i Let 0,^be the set of paired pattern class a l t e r n a t i v e s , f o r th t h 3 K the j * and k patterns i n the sequence, c o n s i s t i n g of those p a i r s of classes which are ranked as the h. highest p a i r s by using J P(X j|A j=a)P(X k|A k = 3)P(A J=a, A k = 3 ) , f o r a,3 = 1,2,...,M such that a € fi^ and 3 (r &, . S i m i l a r l y , l e t .1 Q k 132 h ^d^d^d^ be the set of pattern class a l t e r n a t i v e s grouped i n 3-tuples, for the i ^ * 1 , and k^J' patterns i n the sequence, c o n s i s t i n g of those 3-tuples which are ranked as the h highest 3-tuples by using P(X i|A i=a)P(X J|A J =3)P(X k|A k=y) • P( A ^ a , A j-g , A k-y) , for a,3,Y = 1,2,...,M such that a £ 9, , 3 6 °^ , and y £ ft^ , and i j k so on. The algorithms incorporating the depth of search function and the holding function can be described by an ii-stage process as shown below and can be grouped i n t o two classes of algorithms: type A and type B. Type A Algorithms Let Max(h){»] indi c a t e the operation of s e l e c t i n g the h highest elements of a set. The decoding procedure i s described by the f o l l o w i n g £-stage process. Stage 1 : F L„ n(a,3) = Max(n i n _){P(X A =cx)P(X A =3)P(A =a,A =8)} n+£-l n+£-l 1 1 where a 6 ft. and 3 (f ft n+l n-H-1 Stage 2: ^ » . . o ( a . ^ v ) = M a X ( H r a , J n+£-2 v ' ^ 5 W v n+£-2 T,, v.n+£-2 Kn+£-2 w . n + £ _ ,n+£-l Q ,n+£-2_ w , _ v P(X A =y)P(A =a,A =8,A = Y ) f .„ -,1a,3) rj/.n+Jl, ,n+£-l_ Q s P(A . =a,A =3) h } where y £_ ft and a,8 £ £2 " ' " and where f (a,3) i s an dn+£-2 h+£ n+£-l element of the set F ,„ _ ( a , 6 ) . n-H-1 Stage £: D ( a , e , . . . ,6,e) = Max(l) . P ( . X n | Xn=e )P( A n + A=a , A n + J U 1 = 3 , . . . , X n = e ) o ,8 ,. . . ,6) { P ( x n ^ = a , x n + £ - 1 = e , . ~ n T I ^ ) h n + l where e 6 f t , and a, 6,.... 6 6 ft. , ., and f ,-,(a,8,,..,6) i s d ' d ,. d ,„ .,...,d ,., n+1 n n+£ n+£~l a 5 n+1 a member of the set F ,.(a ,8 ,. . . ,6) c a l c u l a t e d at stage £-1, and where n+1 D(a,8 ,... ,6 ,e) represents a path through the observation m a t r i x which connects element a of column one w i t h element 8 o f column two 3 and so on, u n t i l element e of column £+1. In other words, D( a ,8 ,. . . ,<5 ,e) i s a s t r i n g of d e c i s i o n s t h a t can be made on pat t e r n s to ,to ,<o ,„. I f a d e c i s i o n n n+1 n+£ i s made on to th a t X n=a ( s e q u e n t i a l decoding), then the "observation window" i s s h i f t e d t o to , ,,to to ,„,,. I f a d e c i s i o n i s made on n+1 n+2 n+£+l tha t An=ci and An+"'~=8 ( h y b r i d decoding), then the "observation window" i s s h i f t e d t o to ,„,...,to ,„,,,. I f a d e c i s i o n i s made on n+2' n+3 n+£+2 to ,to to ,. (block decoding) t h a t A =a, A "=3, A =e, then n n+1 n+£ = > > •> the "observation window" i s s h i f t e d t o to ,„,,, to ,„,„, to ,„„,,. Every n+£+l n+£+2 n+2£+l time the "observation window" i s s h i f t e d , the d e c i s i o n process returns t o stage one. While these algorithms are a t t r a c t i v e f o r s m a l l values of £, storage of the c o n t e x t u a l c o n s t r a i n t s increases e x p o n e n t i a l l y w i t h £ sin c e at stage £, (£+l)-grams are needed, I.e., f o r a two stage process t r i g r a m s are r e q u i r e d . The type B algorithms are more s u i t a b l e f o r l a r g e r values of £ sin c e the storage.of the' c o n t e x t u a l c o n s t r a i n t s remains constant as £ i n c r e a s e s . 134 Type B Algorithms Stage 1: F + , (a,3)=Max(h + g ){ P ( X n + £ | X ^ a M X ^ 1 \ A ^ " ^ )P( A n + 4 = a , A ^ ^ B ) } n+£~l n+£-l 1 1 where a € ft, and 3 6 n+£ n+£-l Stage 2: 0(6 sY)=Max(h _L0 0 ) { p ( x n + £ - 2 | A n + £ - 2 = Y ) p ( x n + £ - l = 3 5 A n + £ - 2 = Y ) n+£-2 " * n+£~2 „ ,JI+£I n+! P ^ l A * =«) P U ^ a . A 1 * ^ ) ^n+£—1 where y (: , and 3 £ P., , and where f (a,g) i s an n+£-2 n+£ n+£-l n & ~ 1 element of the set F „ ,(a,8). n.T-£-l Stage £: ' .n+1 ,n D ( e , y) =Max( 1) [ * ^ - ^ >P( ^ ' ~ ^ > * ^ 1 p ( x n + 2 | A n + 2 = 6 . fn+l ( 6 ' G ) , P ( A n + 2 = 6 , A N + 1 = E ) h where y £ °,, and e £ ft , n , and where f JIS.E) i s a member of the d d , d n+1 n n+2 n+1 set F (6 , E ) . n+1 When £=m (block decoding) i t i s i n t u i t i v e l y s a t i s f y i n g to choose the depth of search values such that d = d ,,=...= d „. When m=0 n n+1 n+£ (sequential decoding) i t i s i n t u i t i v e l y s a t i s f y i n g t o choose a monotonically decreasing depth of search function such that d > d ,>...> d ,.. The n n+1 n+£ value of d n should be large enough so that the correct a l t e r n a t i v e i s observed most of the time., since a decision i s made on as . The patterns n co „ have a decreasing influence on co as ' £ increases and hence one would n+£ n 135 expect a more e f f i c i e n t decision by decreasing & + % as & increases. S i m i l a r l y i n hybrid decoding one would l i k e to choose the depth of search values such that d = d , , = ... :=d, > d , , - , > . . . > d , . n n+1 n+m n+m+1 n+£ For m--£,=l,2 these algorithms reduce to those used by Toussaint [ 5 6 ] . and Donaldson and Toussaint [177.1, except for the f a c t that i n [56] and [177] the a l t e r n a t i v e s were i n i t i a l l y ranked by using P(X J 1x^ =0:) rather than PlX 0|A J=a)P(A J=oO. I t i s easy to see that the algorithms described above e f f e c t a "look-ahead" (LA) mode of de c i s i o n . For example, when m=0, £=1 a decision i s made on pattern us by using information from us and the look-ahead n n neighbour w The modification of these, and subsequent algorithms i n the t h e s i s , to the look-back (LB) and the look-ahead-and-look-back (LA-LB) modes of decision i s obvious. One can also see t h a t , i n the above d e s c r i p t i o n .of Type B algorithms 5 bigram contextual constraints are used throughout the process even as I increases. The modifications necessary to incorporate higher order constraints are also obvious. 9.U Algorithms with Variable Depth of Search Incorporating Threshold In t h i s section the generalized decoding algorithms that i n c o r -porate the threshold function w i l l be described. When the threshold func-t i o n i s used, the depth of search and holding functions are not needed, although the depth of search function can s t i l l be used f o r l e a r n i n g optimum threshold values and as a default option to make more e f f i c i e n t decisions. In the algorithms of section 9 . 3 the depth of search function determines the amount of search involved in. the decision process before a decision i s made. However, t h i s depth of search function i s f i x e d and 136 hence e f f e c t s an equal amount of search upon every decision to be made even though many decisions might be c o r r e c t l y made with much l e s s search. A more e f f i c i e n t decision process (le s s search on the average) would incorporate a v a r i a b l e depth of search such that i t would be small f o r e a s i l y recognizable patterns and progressively l a r g e r f or i n c r e a s i n g l y noisy patterns. One way of achieving t h i s goal i s to calculate a measure of r e c o g n i z a b i l i t y and compare i t to a threshold. By using a threshold function, the easy decisions w i l l not demand much search on the average because chances are that the threshold w i l l be exceeded e a r l y i n the decision process. On the other hand, the d i f f i c u l t decisions w i l l demand a greater amount of search before exceeding the threshold. Unfortunately, while terms such as _ . / v n i , n . _. / „n+l. , n+1 . x_/.n . , n+1 . , » P(X |A =i)P(X |A =o)P(A =I,A =J) (9.4) n do measure the r e c o g n i z a b i l i t y of the patterns represented by X" —and n+1 X , they do so i n a r e l a t i v e way and hence i t i s meaningless to compare them to a threshold. One way to circumvent t h i s problem i s to normalize the terms i n (9.4) as done below where B ., and T. ., denote the normalized jk i j k terms for the bigram and trigram cases, r e s p e c t i v e l y . To s i m p l i f y notation 1 0 M l e t (9.2) be written as Q = {Q ,Q~,...,Q }. Define n n n n ( Q i ) ( Q J + 1 ) P ( A n = i , A n + 1 = j ) B i j - M M • ( 9 - 5 ) Z I (Q«)(Q^ 1)P(A n=a,A n + 1.3) a=l 3=1 and ( < ) ( < + 3 ) ( Q n + o ) P ( ^ i , ^ ^ n + 2 = l O m - n n+1 n t-2 / q r\ ink M M M ' I Z I ( < ) ( Q ^ + 1 ) ( 0 ^ ? ) P ( A N = 0 : , A n + 1 = B , A n + 2 = y ) a=l 3=1 Y - l where i t i s assumed as before f o r s i m p l i c i t y of notation, without los s of 1 2 g e n e r a l i t y , that Q n i s associated with class one, Q with class two, etc. 137 The algorithms proposed i n t h i s s e ction can be di v i d e d i n t o 3 basic groups that w i l l be r e f e r r e d to as Types I, I I , and I I I . In a d d i t i o n , the Type I algorithms can be broken down i n t o 3 groups that w i l l be r e f e r r e d to as Types I-A, I-B, and I-C. Furthermore, the search i n a l l the above algorithms can be done i n various modes. Two search modes w i l l be con-sidered i n t h i s t h e s i s r e f e r r e d to as Search Modes I and I I . Only the sequential decoding algorithms w i l l be described since the modifications necessary to obtain the block and hybrid decoding algorithms w i l l become evident. The algorithms are best described by gi v i n g some examples i n the form of block diagrams. Figure 9.1 i l l u s t r a t e s the sequential decoding algorithms of Type I-A with Search Mode I f o r 1=2, where t i s the threshold such that 0 <_ t ^ <_ 1 and dP , d P , and d^ are the default depth of search values. The purpose of the default depth of search values i s to reduce the unnecessary computation i n a decision. When t =0 no contextual information i s used. When t =1.0 and £=1 the algorithms reduce to those of section 9.3. n Figure 9.2 i l l u s t r a t e s the sequential decoding algorithms of Type I-B with Search Mode I f o r £=2 which require le s s computation than the Type I-A algorithms. The drawback of both Type I-A and Type I-B algorithms i s that as £ increases the contextual constraints require increasing storage which grows exponentially with £. The Type I-C algorithms are more su i t a b l e f o r large values of £ since the storage of the contextual constraints remains constant as £ increases. The Type I-C algorithms can be obtained by s u b s t i t u t i n g B. n f o r T.„ i n F i g . 9.2. For £=1 the Type I-A, I-B and I-C algorithms are a l l equivalent. One aspect of the Type I algorithms which contributes toward t h e i r computational complex:!ty i s the fact that when the search i s completed without the threshold being exceeded, say i n F i g . 9.1, a dec i s i o n i s made 138 F i g . 9 .1 Sequential decoding algorithm for I = 2 of Type I-A with Search Mode I. 139 F i g . 9 . 2 Sequential decoding algorithm for £ ~ Mode I.. 2 of Type I-B with Search 1*40 on io based on the highest T. ., encountered. To implement t h i s type of n i j k decision the decoder must keep track of the highest T . e n c o u n t e r e d i j k throughout the decision process by con t i n u a l l y making comparisons where each decision involves d ^ d ^ - I ^ comparisons. In the Type II algorithms these computations are eliminated by deciding A n=l.. In other words, when the search i s completed without the threshold being exceeded a decision i s made without the use of context. Hence no- contextual information i s used when ei t h e r t =0 or t =1.0 and the most contextual information w i l l n n be used f o r some t such that 0 < t < 1.0. Another l e v e l at which the computational complexity of the Type I and Type II algorithms can be markedly decreased, thus enhancing the e f f i c i e n c y of the decisions made, i s i n the computation of the denominator of terms such as B., and T. .. i n ( 9 . 5 ) and ( 9 . 6 ) . In the Jk i j k above a l g o r i t h m s , before the decision process begins, the denominator of the terms such as B,_ and T. must be computed. This r e q u i r e s , i n the Jk i jk • * 2 2 case of B., , 2 x M m u l t i p l i c a t i o n s and (M - l ) additions which can be Jk considerable when M i s large as i n Engli s h t e x t . In the Type I I I algorithms these computations can be d r a s t i c a l l y reduced by using an approximation of B., and T. as follows: jk i j k (QJ-)(Q^ 1)P(X n=i,X n + 1=j) B..= — n S ^ l — ~ ( 9. T) 1 J I Z (Q a)(Q B x l)P(A n-a,A n + 1=3) a 8 n n + 1 where a <c ft , and 8 6 f t - , , and d. d. T. n n-rj. n+2 l j k " E E E (Q a)(0 B," )(Q Y +JP(A n=a,A n + 1=3,A n + 2=Y) ( 9 ' 8 ) „ n n-1-.L n+d a B y where a 6 ft _ , 3 6 ft, , ana y 6 ft^ • a.. a. a. i j k To d i f f e r e n t i a t e d. , d., and d i n the decision process from d., d., and d. as used i n the. c a l c u l a t i o n of B. . and T. l e t d. = d>. , d . = <j> . , and k i j i j k i i j H j ' d^ = <j>k when r e f e r r i n g to the l a t t e r . Obviously, when ^ = <j> = cf>^ = M as i n F i g . 9.1 the Type I I I algorithms reduce to the Type II algorithms. For large values of depth of search, which may be necessary for some a p p l i c a t i o n s , the Search Mode I has a s l i g h t drawback. For example, l e t d n = <3-n+2 ~ 15 f o r a sequential decoding algorithm incorporating .bigram contextual constraints. In the Search Mode I the f i r s t 15 pattern-c l a s s - p a i r a l t e r n a t i v e s considered are associated with d = 1 , 2 , . . . , 1 5 n and d = 1. I t seems i n t u i t i v e l y c l e a r that an a l t e r n a t i v e p a i r such n+1 x as that associated with d = d , n = 2 should be considered e a r l i e r than n n+1 one associated with d =15 and d = 1 . The algorithms incorporating n n+1 the Search Mode II are more su i t a b l e when l a r g e r depth of search values are required since d - d < 1. F i g . 9-3 i l l u s t r a t e s the sequential n n+1 — to decoding algorithm of Type I-A with Search Mode II for 1=1. 9.5 Approximating the A P r i o r i D i s t r i b u t i o n s In several of the algorithms described so far the storage of the contextual constraints increases exponentially with I. These algorithms include the Type A with f i x e d d.epth of search, Type I-A and I-B with va r i a b l e depth, of search, and the block decoding algorithms proposed by Duda and Hart [176] . The Type B arid I-C algorithms represent one approach to storage reduction. In t h i s section a method i s described to reduce the storage of the contextual constraints by means of approximating the 4-gram d i s t r i b u t i o n i n terms of lower order marginals. Considerable work has been devoted to the problem of approximating p r o b a b i l i t y d i s t r i b u t i o n s RETURN TO STAGE 1 DECIDE ON con ON BASIS OF HIGHEST Bjk ENCOUNTERED GO TO STAGE 2 \ CALCULATE Bjk YES \ • DECIDE j F i g . 9.3 Sequential decoding algorithm f o r I •- 1 of Type I-A with Search Mode I I . Ih3 to reduce storage requirements [ l 8 7 ] - [ l 9 4 ] . The algorithms proposed i n [l87]-[l9U] can be used to determine s u i t a b l e optimal approximations. In t h i s t h e s i s the concern i s on e f f i c i e n c y rather than optimality and hence any approximation or pseudo-approximation w i l l do. For example, ^ n .n+1 a ,n+2 ,n+£-l - ,n+£_ -> P ( X =a ,X =(3, A = Y , . . . J X =5 ,X =ej might be approximated i n (9.3) t>y ,,,.n ,n+l 0s ,n+l D .n+2 \ _.,..n+£-l ,n+£_ v P ( X =a ,X = g ) * P ( X =8 ,X = Y ) P ( A =6 ,X =e). CHAPTER X EXPERIMENTS AND RESULTS 10• 1 Da,ta Base, Preprocessing, Feature Extraction., Formation of En g l i s h Texts, and System Evaluation In order to assess the performance of, and compare, the various contextual decoding algorithms proposed i n t h i s t h e s i s , the problem of recognizing E n g l i s h handprinted text was simulated on an IBM-360, Model 67, Duplex d i g i t a l computer. The same 3822 upper case alphabetic charac-t e r s , of Munson's data set used i n the feature ordering experiments, were used here. P r i o r to the extraction of features the characters were sub-jected to the size normalization transformation described i n chapter I I . As was stated i n chapter I I , the shearing algorithm (test f o r slantness) di d not appear to be very u s e f u l f o r systems designed to recognize E n g l i s h t e x t . Accordingly, i t was not incorporated i n t o the context experiments. The same 25 features used i n the feature ordering experiments were used here and they were assumed s t a t i s t i c a l l y independent. In other words, i n the algorithms of chapter IX i t i s - assumed that 25 P ( x n|A n=i) = JI P ( x n ( A n = i ) . (10.1) k=l K It was also necessary to estimate the a p r i o r i s t a t i s t i c s P(A n=6 ) , P(An=0 , A n + 1=9 ) and P(A n=9 , A n + 1=G . , A n + 2=G A, i . e . , n ' n' n+1 n' n+1' n+2 ' a p r i o r i e s , bigrams and trigrams. Accordingly, these s t a t i s t i c s were estimated from an En g l i s h text [195] containing approximately 340,000 characters. The trigram s t a t i s t i c s of the 27 characters ( i n c l u d i n g the blank) were obtained using a character handling routine developed by Dykema [ 1 9 6 ] . The bigram and a p r i o r i p r o b a b i l i t i e s were then obtained as follows: i ) , i , j = l , 2 , . . . , 2 7 (10.2) and 5 • • 5 27. (10.3) Since Bayes estimates [57] were obtained i n each case, a l l the entries of the bigram and trigram tables were nonzero. was estimated on the basis of c l a s s i f i c a t i o n experiments with seven d i f f e r e n t handprinted E n g l i s h texts [ 1 7 9 ] , [197] coming from a wide v a r i e t y of sources and representing a t o t a l of 7938 characters i n c l u d i n g blanks. The seven texts were formed using the d i s j o i n t sets of alphabets put a.side for the purpose of t e s t i n g . Thus the f i r s t text was formed using characters from alphabets 1-21, the second text was formed using alphabets 2 2 - 4 3 , and so on. Therefore, a t o t a l of 2.1 samples f o r each character-class, excluding the blank, were a v a i l a b l e for forming each t e x t . Unless some character-class had more than 21 occurrences in a t e x t , each character sample was used only once f o r each piece of t e x t ; other-wise the samples were recycled. Punctuation marks other than the period were ei t h e r removed or replaced by a period depending on t h e i r place of occurrence i n the t e x t . The seven texts and associated s t a t i s t i c s are given i n Appendix A. were ignored, and blanks were assumed p e r f e c t l y recognizable. Thus, when-ever a blank appeared i n the "observation" matrix i t s i d e n t i t y was always used i n making a decision. For example, consider Type A sequential decoding with bigrams i n the look-ahead mode. I f to i s a blank and to & D n n+1 one of the 26 l e t t e r s , a correct decision on the blank i s assumed and the The performance of the various contextual decoding algorithms During the actual process of c l a s s i f y i n g a piece of text periods ih6 algorithm proceeds to make a decision on wn+-^. I f , on the other hand, to i s one of the 26 l e t t e r s and to , i s a blank then a decision Is made n n+1 on oo such that n n a x {P(X nU n=a) P(A n=a, X D + 1=blank)} ( 10.4) a 1 where a 6 ftn . Maximizing (10.4) implies that e f f e c t i v e l y i t i s assumed d . n _ that for blanks the l i k e l i h o o d vector i s given by = { 1 . 0 , 0 . 0 , 0 . 0 , 0 . 0 } , (10.5) where the f i r s t element i s associated with the blank. This procedure y i e l d e d an estimate of the e r r o r p r o b a b i l i t y f o r each of the seven t e x t s . Unless otherwise stated, whenever reference i s made to the system error p r o b a b i l i t y P , t h i s r e f e r s to the average err o r p r o b a b i l i t y averaged over a l l seven t e x t s . 10.2 Ordering the A l t e r n a t i v e s i n the Decision Process^ Consider block decoding with bigram contextual constraints. In an i d e a l s i t u a t i o n one i s i n t e r e s t e d i n obtaining an ordering of a l l p a i r s of a l t e r n a t i v e s obtained from maximizing p(xn|xn=e )p(x n + 1|) n + 1=e t l)p(x n=e ,x n + 1= 9 A 1 n 1 n+1 n n+1 so that one can choose the 6 and 6 , at the top of the l i s t . Ordering n n+1 the a l t e r n a t i v e s of the i n d i v i d u a l patterns to and w ,., and searching n n+1 over a selected number of a l t e r n a t i v e s determined by d and d i n represents n n+1 an approximation to the above process. The manner i n which the a l t e r n a t i v e s are ordered and the Aralues of d arid d , n determine the accuracy of the n n+1 approximation. The l a r g e r d^ and &n+^ are the more accurate the approxi-mation i s and the l e s s e r i s i t s s e n s i t i v i t y on the manner i n which the a l t e r n a t i v e s are ordered. Conversely, the smaller d and d , _, are the 5 n n+1 more c r u c i a l i s the ordering. In [177]» [56] the ordering was made using the l i k e l i h o o d s P ( I Xn=0 ). Therefore, when d =d =M an. optimal j o i n t 1 n ' n n+1 decision i s made on to and to , . However, when d =d ,,=1 sub optimal n n+1 n n+1 i n d i v i d u a l decisions are made on to and to ,, . To correct t h i s d e f i c i e n c y n n+1 the author suggested i n [56] that the a l t e r n a t i v e s be ordered with the weighted l i k e l i h o o d s P(X n|X n=6 )P(X n=0 n). Experiment 1 0 . 1 ; This set of experiments was- performed i n order to determine the e f f e c t of ordering the a l t e r n a t i v e s using .the weighted l i k e l i h o o d s and to compare the r e s u l t s with those obtained using only the l i k e l i h o o d s . Block decoding with bigrams was chosen f o r the experiments. The r e s u l t s of Experiment 10 .1 are i l l u s t r a t e d i n Figures 1 0 . 1 , 1 0 . 2 , and 1 0 . 3 . In Figures 10 .1 and 10.2 the number of the text used i s i n s c r i b e d on each curve. Before discussing the r e s u l t s i t i s convenient to define two parameters. Let d^ be defined as the quiescent depth of search, i . e . , the depth of search value at which the erro r p r o b a b i l i t y remains e s s e n t i a l l y constant i f the depth of search value i s increased any f u r t h e r . In more formal terms P (d) = P (d Q) for d > d Q e e and (10 P (d) 7^ P (d Q) for d < d Q , e e ' where P (<l) i s the error p r o b a b i l i t y f o r depth of search value d, and where d i s used when d =d ,=...=£ ,„ . Let d^ be defined as the n n+1 n+£ optimum depth of search i n the sense that P (d) i s minimized. In more, formal terms P (d°) < P (d) f o r d # d°. (ID e — e Furthermore, i f P (d) i s a monotonic function of d then d°=d^. e 148 30.0 DEPTH OF SEARCH d = d n + 1 F i g . 10.1 P r o b a b i l i t y of error as a function of depth of search f o r each of seven texts f o r block decoding with bigrams i n which the alt e r n a t i v e s are ordered with the l i k e l i h o o d s . 10.2 P r o b a b i l i t y of err o r as a function of depth of search for each of seven texts f o r block decoding with bigrams i n which the a l t e r n a t i v e s are ordered with the weighted l i k e l i h o o d s . 22.0 Pe AVERAGED OVER 7 TEXTS -ORDERING WITH P(Xn \ f ^6n) f rORDERING WITH P(X* | A ° - 6n)P ( A " = % ) —O .1 .L J L_ _ _ J J_ 1 2 3 4 5 6 7 DEPTH OF SEARCH 70 2 5 10.3 Comparison of r e s u l t s obtained from l i k e l i h o o d ordering with those obtained from weighted l i k e l i h o o d ordering. 151 The r e s u l t s of F i g . 10.3 show that d i s equal to h and 7 , r e s -p e c t i v e l y , f or the weighted and non-weighted l i k e l i h o o d orderings. Hence fo r small values of d i t i s preferable to order the a l t e r n a t i v e s with the weighted l i k e l i h o o d s since a lower P g would be obtained for the same d. It i s i n t e r e s t i n g to note that P e(d=4) = P e(d=26) for the weighted l i k e l i h o o d ordering. In other words, although i n theory d must be equal to 26 i n order to obtain an optimal .joint decision on to and to ., , the n n+1 r e s u l t s show that i n a practical, a p p l i c a t i o n an optimal j o i n t d ecision can be made with d as small as k i f the k a l t e r n a t i v e s having the highest weighted, l i k e l i h o o d s are selected f or each pattern. In the case of bigrams t h i s represents searching l 6 a l t e r n a t i v e p a i r s as opposed to 676, an obvious increase i n the e f f i c i e n c y of the decision process. In the remaining experiments i n t h i s t h e s i s the orderings are made with the T _ ~ ^ ^ . - u j - ... .3 i ^ v - . -1 -r . 1 , I t should be noted that i n p r a c t i c a l problems i t i s not necessary 0 Q that d =d . The r e s u l t s f o r texts 1 and 5 i n Figures 10.1 and 10,2 show 0 , 0 0 Q , that for text 1, d =k and d =5 while for text 5, d =3 and d -k. However, t h i s behaviour may merely be the r e s u l t of having a small sample since i t disappears when the r e s u l t s are averaged over the seven t e x t s . A s i m i l a r behaviour was observed i n [ 177] , [56] where the sample s i z e was small and where &®=h as above. 10.3 Usefulness of Context as a Function of I n i t i a l System Performance In designing a pattern recognition system, e f f o r t s are f i r s t made at obtaining low error p r o b a b i l i t y by improving the c l a s s i f i e r and the feature extractor. When patterns are very noisy one reaches a stage at which improvements on the feature extractor do not decrease P fu r t h e r . e A l t e r n a t e l y , the law of diminishing returns comes into play so that f o r a f i x e d decrease i n P g i t i s cheaper to apply simple contextual constraints than to further improve the feature extractor. Since the error p r o b a b i l i t y at such a point varies i t i s important to determine the e f f e c t s of using contextual constraints as a function of i n i t i a l system performance without context. Furthermore, i n these experiments the error p r o b a b i l i t y without using context i s 19$. For character recognition problems the concensus generally found i n the l i t e r a t u r e i s that a system's err o r p r o b a b i l i t y should, i f p o s s i b l e , be l e s s than 10% before r e l y i n g on context. Studying the behaviour of the algorithms as a function of i n i t i a l system performance w i l l thus help one to extrapolate the r e s u l t s to such a l e v e l . In order to vary the i n i t i a l system performance, experiments were c a r r i e d out with d i f f e r e n t numbers of features. Experiment 10.2 This set of experiments was performed i n order to study 0 Q how parameters such as d , d , and P g with context and s p e c i f i e d d, vary as a function of P g without context. Sequential decoding with bigrams in the look--ahead mode was chosen. F i r s t , P p without context was p l o t t e d as a function of the number of features used. The number of features was increased i n a nat u r a l order on the 20 x 20 g r i d ; i n other words, from l e f t to r i g h t and from top row to bottom row. The r e s u l t s are shown i n F i g . 1 0 . 4 . From the curve i n F i g . 10.4 i t was chosen to do experiments with 1, 5 , 8 , 1 0 , 1 2 , 15 , 2 0 , and 25 features to determine P^ as a function of the depth of search. Some of these r e s u l t s are i l l u s t r a t e d i n F i g . 10 .5- I t should be noted that when the number of features i s small the behaviour of P^ i s quite e r r a t i c . As the number of features increases a decreasing assymptotic behaviour becomes progressively more pronounced. Two para-meters u s e f u l f o r describing the behaviour of these curves are the optimum-153 03 ct. 7 01 1 •—i 1 1 1— 0 5 10 15 20 25 NUMBER OF FEATURES NF F i g . 10.4 P r o b a b i l i t y ox" error P p as a function of the number of features used when no contextual information i s used. 69.0 68-5 v. 52.0 ~- 57-5 Uj I 57.0 £ ^ 4 6 - ° Uj ^ cc cc 1 3 Uj CD Q. — C Q 45.0 J Q 78-0 Lu 0 U. Uj O Q h ^ 17.0 II co 6 o ^ 75.0 CO NF = 1 V .NF = 8 NF=25 / v F = NUMBER OF FEATURES 2 3 4 5 7 8 DEPTH OF SEARCH dn = d n + J =0 70 77 " 2o 10.5 P r o b a b i l i t y of error'P' as a function of depth of search f o r e sequential decoding with bigrams i n the look-ahead mode with 1, 8 , 1 0 , and 25 features. 0 0 and quiescent depth of search values, d and d ". These two parameters are p l o t t e d as a function of error p r o b a b i l i t y without using context i n 0 F i g . 1 0 . 6 . The r e s u l t s show that d J decreases f a i r l y l i n e a r l y as a func-t i o n of P without context u n t i l dQ=d°=5 for P =19$. e e Another parameter of i n t e r e s t when evaluating contextual decoding algorithms i s the percent decrease i n P due to context as a function of P e e without context. F i g . 10.7 i l l u s t r a t e s these r e s u l t s for the two extreme values of d, namely 2 and 26. Several comments concerning these r e s u l t s are i n order. F i r s t , the r e s u l t s show that when the i n i t i a l system erro r p r o b a b i l i t y i s greater than about 50$ one cannot expect any improve-ments. In f a c t , when d i s very small one can even expect d e t e r i o r a t i o n i n performance. Second, when d=26 the percent decrease decreases mono-t o n i c a l l y as a function of P without context. However, f o r small values e of d t h i s behaviour can be al t e r e d and a maximum decrease can be obtained. 10.k Evaluation and Comparison of Contextual Decoding Algorithms_ with Fixed Depth of Search Experiment 10 .3 This set of experiments was performed i n order to evaluate and compare block and sequential decoding with respect to error p r o b a b i l i t y and computational complexity as a function of depth of search. F i g . 1 0 . 8 i l l u s t r a t e s the r e s u l t s f o r bigram contextual cons t r a i n t s . The s o l i d l i n e r e f e r s to sequential decoding where both d and d , n are va r i e d & n n+1 independently and the broken l i n e r e f e r s to block decoding where d =d 3 n n+1 In sequential decoding P decreases at a diminishing rate when d i s e n increased and d , i s held constant, and reaches the lower l i m i t when n+1 d =k. For d =1 increasing d ,, obviously does not change P . However, n n n+1 & e for d > 1 , an increase i n d ,, e f f e c t s a diminishing decrease i n P u n t i l n 5 n+1 to e i t reaches a constant at d ,=5. n+1 U J c ll 3: CO Q: Q Uj ° ^ U J O 77 10 9 8 7 6 5 4 3 SEQUENTIAL DECODING WITH BIGRAMS - LA - MODE »o QUIESCENT DEPTH 0E SEARCH 10 OPTIMUM DEPTH 0E SEARCH 20 30 40 50 60 70 PROBABILITY OF ERROR Pp IN PERCENT WITHOUT USING CONTEXT F i g . 10.6 Quiescent depth of search d J and optimum depth of search d° as a function of P G without context f o r sequential decoding with bigrams i n the look-ahead mode. PROBABILITY WITHOUT CONTEXT PQ IN PERCENT F i g . 10.7 Percent decrease i n P due to context as a function of P without context f o r sequential decoding e e . with bigrams in the look-ahead mode of decision with d = d n n+1 = 2 and d = d n 'n+1 = 26. H In order to compare block and sequential decoding meaningfully a measure of computational complexity E, was defined as the number of m u l t i -p l i c a t i o n s made within the domain searched by the algorithm per decision per pattern. For example, i f bigrams are used with Z~-l and d =d =4 the . t - i - n n+1 term rtf|^)P<^V+H+i>*^n+Hn> i s c a l c u lated .1.6 times per pattern i n sequential decoding and l 6 times per two patterns i n block decoding. Since the term involves two m u l t i -p l i c a t i o n s £-32 for sequential decoding and £=l6 f o r block decoding. In S B S B S B ' F i g . 10 .8 the points marked A' , A , and B , B , and C'~ ,C ind i c a t e points at which block decoding and sequential decoding have an equal E, , where the superscripts S and B stand for sequential and block decoding, r e s p e c t i v e l y . Observing points B and C i n the region of i n t e r e s t (low er r o r p r o b a b i l i t y and low computational complexity) the r e s u l t s suggest that since the sequential decoding algorithms y i e l d lower P , they are more e f f i c i e n t than the block decoding algorithms although the diffe r e n c e i s small. The remaining experiments i n t h i s t h e s i s were performed with sequential decoding algorithms. Experiment 10.4 This set of experiments was performed in order to assess and compare the performance of various sequential decoding algorithms i n the look-ahead-look-back mode of decision which use bigrams i n such a way as to approximate trigrams i n some way. In other words these algorithms make a decision on to using information from to as w e l l as to n and to ,., . n to n n-1 n+1 Three d i f f e r e n t approximations to trigrams were t r i e d : Type I n / . n - l .n o W , M n .n+1 ^ M n - 1 ,n+1 % P(A =ot,X =S)P(A =3,A =Y)P(A =a,A =y), 159 19 dn = J Uj w. (j Q: UJ a . SEQUEN 77.4 L D ECO DING (BIG RAMS -L A) - BLOCK DECODING (BIGRAMS) a" G: g 17 Q: Uj U. O K, • - J 55 16 O Q: a. --26 2 3 4 DEPTH OF SEARCH o' , F i g . 10 ,8 Comparison of block decoding end se q u e n t i a l decoding with respect to err o r p r o b a b i l i t y and computational complexity as a function of depth of search with bigram contextual const r a i n t s . i6o Type II P(X-n-1 Type I I I P(A =a, where a 6 ft , , 8 6 f t - , a a n-1 n and. y £ ft d n+1 Figure 10 ..9 i l l u s t r a t e s the r e s u l t s of sequential decoding i n the look-ahead-look-back mode of decision Type I I I where d n-1 n+1 and d n are v a r i e d independently. The r e s u l t s are s i m i l a r to those of F i g , 10 .8 except f o r the fac t that at .each point i n F i g . 10.9 the error p r o b a b i l i t y i s lower. The trad e - o f f s between the two algorithms are c l e a r l y evident. The broken-line arrows i n F i g . 10.9 i n d i c a t e the path of increasing com-put a t i o n a l complexity along points of i n t e r e s t . From the point of view of e f f i c i e n c y not much can be rained bv -havinsr d >h - and d ;=d , greater ~ n n-1 n+± than 2 or 3. are presented f o r comparison i n F i g . 10.11 where P i s p l o t t e d as a e function of d =d =d ... The r e s u l t s show that the Type I I I approxima-t i o n , which i s the simplest of the three, y i e l d s the lowest error probabi-l i t y . I t should be noted that the Type II approximation i s the r e s u l t of assuming f i r s t - o r d e r Markov dependence. However, the E n g l i s h language i s obviously not a f i r s t - o r d e r Markov source. Hence the Type II approx-imation i s not n e c e s s a r i l y the optimum approximation. Ku and Kullback [193] present an algorithm for i t e r a t i v e l y determining the optimum approximation, i n the minimum di s c r i m i n a t i o n information sense, as a function of the prescribed marginals. The Type II approximation i s only the f i r s t term of Ku and Kullback's i t e r a t i o n approximation to The r e s u l t s of the three types of algorithms of Experiment 10 .4 161 19.0 Lu Uj CL v/.u. U J o 75.0 ft) g 14.0 ct CL SEQUENTIAL DECODING WITH BIGRAM APPROXIMATION TRIGRAMS - P ( f ~ ] ' - - •' TO LOOK- AHEAD - LOOK M O D E OF. DECISION -~o c/ = 7 i - — PATH OF INCREASING COMPUTATIONAL COMPLEXITY 2 3 4 5 DEPTH OF SEARCH VALUE dn_j = dn + 1 F i g . 10.9 P r o b a b i l i t y of error as a function of depth of search f o r sequential decoding with a bigram product approximation to t r i -grams i n the look-ahead-look-back mode of decision.• , n - l _ n n+1 ^ P(A =a, X =3, X =y) • The r e s u l t s suggest that the Type I I I approximation i s a bett e r approxima-t i o n to trigrams of Engli s h text from both the computation and the r e s u l t i n e r r o r p r o b a b i l i t y points of view. Experiment 10.5 This set of experiments was performed i n order to assess and compare the performance of sequential decoding with trigrams i n the look-ahead-look-back mode of decis i o n . The r e s u l t s are given i n F i g s . 10.10 and 10.11 . In F i g . 10.10 P g i s p l o t t e d as a function of depth of search where d =d , and d are vari e d independently. The r e s u l t s are n-1 n+1 n s i m i l a r to those of F i g . 10.9 except that the error p r o b a b i l i t i e s are a l l s l i g h t l y lower and the path of increasing computational complexity i s a l i t t l e d i f f e r e n t . Comparing F i g . 10.9 with F i g . 10.10 brings out the trade - o f f s between bigrams and trigrams with respect to P and d . , d . & to e n - 1 5 n; and d.. , n . For example, using bigrams with d n =d . ., =2 and .d =3- y i e l d s n~'"j_ n — a . n~'l n lower error p r o b a b i l i t y than using trigrams with d =d =d , ,=2. The ° n-1 n n+1 res\ilts of F i g . 10.11 show that the Type I I I approximation gives r e s u l t s very close to the trigram r e s u l t s . In order to compare the three algorithms of most i n t e r e s t from the point of view of e f f i c i e n c y , i n F i g . 10.12 P g i s p l o t t e d as a function of K • I t should be stated that the l i n e s j o i n i n g the points i n F i g . 10.12 are included f o r the sake of j o i n i n g the points belonging to the same algorithm but other than that are meaningless. The r e s u l t s c l e a r l y show that sequential decoding i n the LA-LB mode Type I I I approximation i s more e f f i c i e n t than sequential decoding i n the LA mode and requires the same amount of storage (bigrams). V7hile the former i s s l i g h t l y l e s s e f f i c i e n t than SD-LA-LB with trigrams with respect to E, t h i s d eficiency i s o f f s e t by r e q u i r i n g 27 times l e s s storage. Some of these r e s u l t s are also 163 Uj O Q: U J CL 19.0 18.0 17.0 Q> CL Cr o CC cc Uj u. o CQ <r CQ o cc CL 16-0 15.0 14.0 13.0 SEQUENTIAL DECODING WITH TRIGRAMS LOOK-AHEAD - LOOK - BACK MODE OF DECISION -odn = 1 PATH OF INCREASING COMPU TA TIONA L COM PL EXIT Y 2 3 DEPTH OF SEARCH d n-i d n +1 F i g . 10.10 P r o b a b i l i t y of error P g as a function of depth of search f o r sequential decoding with trigrams i n the look-ahead-look-back mode of decision. 1.6k U J o c t U J Q . 2> 19 18 17 Ct o c t c t Uj L L O c t 16 15 14 13 CONTEXTUAL DECODING ALGORITHMS WITH FIXED DEPTH OF SEARCH 2 DEPTH 3 4 TYPEW SD-LA TYPES TYPE I 0SD-LA-LB' oSD-LA-LB-oSD-LA-LB o SD-LA-LB TRIGRAMS 5 OF SEARCH d n-1 d„ = d n +1 F i g . 10.11 Comparison of contextual decoding algorithms with f i x e d depth of search with regards error p r o b a b i l i t y as a function of depth of search. Unless stated a l l algorithms incorporate bigram con-t e x t u a l c o n s t r a i n t s . U J o UJ , 7 Q. " 75 CL CQ o i-f CL O CL Cc UJ 14 BIGRAMS SD-LA BIGRAMS SD-LA -LB TYPE IR o o, t TRIGRAMS SD-LA - LB W 20 30 40 50 60 COMPUTATIONAL COMPLEXITY g 70 80 ?ig. 10.12 P r o b a b i l i t y of error P g as a function of computational complexity $ f o r three cases of i n t e r e s t . H 166 presented i n [198]. 10.5 Evaluation and Comparison of Sequential Decoding Algorithms^ Incorj)Or^3,ting_ Threshold Experiment 10.6 While i t i s f a i r l y obvious t h a t , f o r sequential decoding incorporating threshold of Type I with large default depth of search values d^, search mode II i s more e f f i c i e n t than search mode I i t i s of .interest to compare the two search modes for small values of d^. Accord-i n g l y D^=l^+2_ v a s s e ' t equal to 5 i n the experiments performed. These and the subsequent experiments were performed with bigram contextual constraints i n the look-ahead mode with 4=1. The r e s u l t s of experiment 10 .6 are shown i n F i g . 10.13 where P g and MBS (the average number of bigrams searched before a decision i s made) are p l o t t e d as a function of the threshold t . The r e s u l t s show that P & behaves about the same for both search modes decreasing as t increases u n t i l t -0.5 a f t e r which P remains constant. The behaviour- of ANBS i s e about the same f o r both search modes except f o r t < 0.2. In the region bounded by 0.0<t<0.2 search mode II requires l e s s computation (search) than search mode I. Hence, even f o r small values of default depth of search such as d^=5 search mode II i s more e f f i c i e n t i f one i s able to t o l e r a t e a smaller decrease i n P g by using context. For example, f o r t = 0.05 both search modes y i e l d P g ~ 18.0% but search mode II requires only 50% of the search required by search mode I. In the remaining experiments i n t h i s t h e s i s , search mode I i s used. The r e s u l t s of F i g . 10.13 also c l e a r l y show the diminishing returns of using added, contextual information. As t increases more contextual information i s made use of i n the decision process. As would be expected, P g decreases, but at a diminishing rate while the cost (computation or ANBS) increases exponentially. OT g pj cn P y P-M H i-b o hi rn CD CD c+ H-PJ H H O H CO •T) & H H-<<\ o CD O O |Zi OP (D O i-3 <<; CD H c+ I'M I cn H-c+ 13-CD H O o I V J pj" CD P P-B O PJ n> o Hj PJ CD o H* 0) H' o ts c+ |3J cn CD p PJ CD hi P en CD l=S I, o CD O a' (Kl '-S I cn CD P h! o & CD P tx) C O p OT Hj O d-H* o O H i C+ CD OT P J o H P c+ PROBABILITY OF ERROR PQ IN PERCENT AVERAGE NUMBER OF M Q j l O -* b b b b ^ BIGRAMS SEARCHED b (ANBS) 2,91 168 Experiment 10.7. These experiments were performed i n order to study.the behaviour of the sequential decoding algorithms, incorporating threshold, of Type II with respect to P g and ANBS as a function of the threshold t for various values of default depth of search. The r e s u l t s are shown i n F i g . 10 . 14 . As mentioned i n section 9 . 4 concerning the Type II algorithms, when the search i s completed without the threshold being exceeded a decision i s made without using context. Hence one would expect the same er r o r p r o b a b i l i t y f or t=0.0 as for t= 1 . 0 . The r e s u l t s of F i g . 1 0 . l 4 v e r i f y these expectations and. show that i s a minimum f o r t= 0 . 4 . At t h i s value of t P =l6.2%. For the Type I algorithms the minimum P i s 16.0% but t h i s occurs for t> 0 . 5 5 , at t= 0 . 4 , P =16.2%. Hence, by choosing the r i g h t threshold value the Type II algorithms can be more e f f i c i e n t than the Type I algorithms. The r e s u l t s also c l e a r l y show the advantages of using a threshold i f they are compared with the r e s u l t s from sequential decoding with fixed depth of search. For example, the l e a s t amount of context that can be used with the algorithms having f i x e d depth of search occurs for d =2 and d =1, i . e . , AKBS=2. In F i g . 1 0 . 8 for d =2 and n n+1 5 ' • n d , =1, P =17.15 whereas i n F i g . 10. lU for ANBS=2, d D =4, d D =2, and n+1 e n n+1 t= 0 . 5 2 , P =16.3. Conversely, i n F i g . 1 0 . l 4 a P of 17.15 can be r e a l i z e d e e with d D =2, d D =1, and t= 0 . 3 which implies AHBS=1.2. This example n n+1 indicates that the Type II algorithm can r e a l i z e a P g of 17.15% with 40% l e s s search than that required by the Type A algorithm . The r e s u l t s of F i g . 1 0 . l 4 also show the diminishing returns of using added contextual information by increasing the default depth of search values. Let t= 0 . 5 . When dI)=d'D =1, no context i s used and P = l 8 . 9 % , but when dE=2 and n n+1 e n dP =1, P =17.0$ r e s u l t i n g i n a marked decrease i n P . On the other hand, n+1 e e PROBABILITY OF ERROR PQ IN PERCENT CD N * to 2 o o b b AVERAGE NUMBER OF BIGRAMS SEARCHED (ANBS) 170 when dP and d^ , are increased, r e s p e c t i v e l y , from k and 2 to 5 and 5 n n+1 there i s no further decrease i n P . e Figure 10.15 i l l u s t r a t e s the average depth of search values, on the pattern heing decided to and the look-ahead neighbour to ,,, at the n ° n+1 time of decision for the Type II algorithms a,s a function of the threshold t f o r &^>=k and dP =2. Let d* be defined as the depth of search value on n n+1 n to at the instant the decision i s made. The r e s u l t s of F i g . 10.15 v e r i f y n 6 that n . d* cP l i m j- n -j n t - 1 . 0 d* .D n+1 d n+1 t->0 . O n ' and ^ _ {d* } = 1 . 0 . t->0.0 n+1 Experiment 1 0 . 8 This set of experiments was performed i n order to study the behaviour of the sequential decoding algorithms, incorporating thresholds, of Type III with respect to P and ANBS as a function of the threshold t f o r various values of d> and <b ,, . Figure 10.16 i l l u s t r a t e s n Yn+1 these r e s u l t s f or the case dP=h and. dP ,=2. These default depth of n n+1 search values were chosen because, as shown i n F i g . 10.1!4, P^ does not decrease any further when dP>k and/or dP,^>2. n n+1 Consid.er B. . as given by ( 9 . 7 ) . For <J> =4> . =1, B. .=1.0 for the i j n n+1 i j f i r s t bigram a l t e r n a t i v e considered i n the decision process. Hence f o r any t , B >t and the decision process terminates at the s t a r t without contextual information being u t i l i z e d . For <j) >.l and/or c[> >1, B. .=n<1.0 n n+1 ' i j and hence for t>n the d e c i s i o n process w i l l not terminate prematurely. However, for smaller values of cj> and (f'n_j ^ , n w i l l be l a r g e r and a l a r g e r value of t w i l l be necessary to make optimum use of contextual information CO •—! s Uj Co It. Uj Q 4.0 SEQUENTIAL DECODING WITH BIGRAMS (LA-MODE) INCORPORATING THRESHOLD TYPE U 0 0f. UJ C D U j o Co o U J ^2.0 o Uj 5^ ^ to a = 4. d , = 2 n y n + 1 PATTERN n (CHARACTER BEING DECIDED) PA T TERN n +1 ' ;. (LOOK-AHEAD NEIGHBOUR) 0.0 0-1 0.2 0.3 0.4 0 . 5 0.5 0.7 0.8 0.9 1.0 THRESHOLD t F i g . 10.15 Average depth of search values at time of decision as a function of the threshold t for sequential decoding with bigrams i n the look-ahead mode of decision of Tyre I I with d"° = k and d^,. = 2. n n+i -5 H under the circumstances. These expectations are v e r i f i e d "by the r e s u l t s i n F i g . 1 0 . l 6 , where for <b =3, cb , =2 the minimum P occurs at t=0.45, n n+1 e f o r cb =3, cb , =1 the minimum P occurs at t=0.5 and f i n a l l y f o r cb =2, n n+1 e rn <b , =1 the minimum P occurs at t=0.65. Furthermore, even i f t=1.0, f o r n+1 e * < • > • > small values of d> and d> , f o r some a l t e r n a t i v e - p a i r s under consideration Y n n+1' B w i l l be greater than t and hence the decision process may not terminate without the threshold being exceeded. Therefore, unlike the r e s u l t s of F i g . 10.1)4, f o r t=1.0 P <18.9. Since B. .->B as cb =<b , ->27 i t follows that e 1j 1j n rn+l for t=1.0, P -+18.-9, i . e . , the r e s u l t s of F i g . 10 . l6 reduce to those of F i g . 1 0 . l 4 , as cb = cb .-^27. I t should be r e a l i z e d that the computation of n n+1 2 the denominator of B.. involves 27 =729 additions and 2x729 m u l t i p l i c a t i o n s 1 J and t h i s denominator must be calculated before the decision process begins. However, with cf>n=3 and <l>n =2 only 6 summations and 12 products are required, t h i s means about 150 times l e s s computation. C l e a r l y , i f the proper parameter values are chosen, the Type III sequential decoding algorithms can be much more e f f i c i e n t than the Type II algorithms. For example, i n F i g . 10.14, where cb =cb , =27, a P of l 6 . 4 can be r e a l i z e d with dD=3 n n+1 e n and d^ =1 at t=0.4 r e s u l t i n g i n ANBS=1.3. On the other hand, i n F i g . 10.16, for 6 =3, A ,=2, the same P of l 6 . 4 can be r e a l i z e d with t=0.4 r e s u l t i n g n ' n+1 e i n approximately the same ANBS=1.4. The r e s u l t s of F i g . 10.16 also c l e a r l y indicate the diminishing returns of using a d d i t i o n a l contextual information by increasing <j> and <b , n . n n+1 Small values c f <b and cb ,. also have an e f f e c t on the average Y n Yn+1 depth of search values at the time of d e c i s i o n , i . e . , d* and dc<" , as i s n n+1 observed bj^ comparing the r e s u l t s on F i g . 10.17 with those of F i g . 10.15. Although d" and d* n increase as t increases, the increase i n F i g . 10.15 xi n+1 i s much more rapi d . For example, at t=0.825 d*=2.0 i n the Type II 15.0 THRESHOLD t F i g . 1 0 . l 6 P r o b a b i l i t y of err o r P and average number of bigrams searched (MBS) as a function of threshold t e j) fo r sequential decoding with bigrams i n the look-ahead mode of decision of Type I I I with d" = k 2-0 3: o U j Co U , 5: O . U j Q Uj CO U J o o U j Q LL O U j 5 ; U J s 7.5-SEQUENTIAL DECODING WITH DIGRAMS (L A - MODE) INCORPORATING THRESHOL D TYPE IE D D du=4 d n + 1 n+1 PATTERN n PATTERN n + 1 0.1 0.2 0.3 0.4 0.5 0.5 0.7 0-8 THRESHOLD t 03 .0 F i g . 10.17 Average depth of search values at time of decision as a function of threshold t f o r sequential 2 , decoding with bigrams in the look-ahead mode of decision of Tyne i l l with d^ = 4 , d^,_, D n n+1 cb = 2 and cb = 1 n n+1 algorithm but d*=1.35 i n the Type III algorithm. Also, i n F i g . 10.17 d n + l = ° f ° r 0 < t < 0 - 5 ' whereas i n F i g . 10.15 n+1 2K n for 0<t<0.5. In a d d i t i o n , i n the Type III algorithm, the l i m i t of d* and d* , r e s p e c t i v e l y , as t->1.0 does not approach dP and d"0,, as i t does n+1' 1 n n+1 i n the Type II algorithm. This means that i n the Type I I I algorithms a decision i s rea,ched e a r l i e r i n the decision process by an amount i n v e r s e l y p r o p o r t i o n a l to the values of A and <b n . n n+1 Experiment 10.9 This set of experiments was performed i n order to study th bshaviour of P and MBS as a function of <j> and i> , for the sequential e ^n Yn+1 decoding algorithms with bigrams i n the look-ahead mode of decision incorporating threshold of Type I I I . Figures 10.18 and 10.19 i l l u s t r a t e D D the type of r e s u l t s obtained for the case of d -k, d , =2, and t= 0 . 4 . n n+1 ' These parameter values were chosen' because they are of i n t e r e s t since for d) =(b =27 they lead to near-optimum r e s u l t s with respect to e f f i c i e n c y n n+1 The r e s u l t s of F i g . 10.18 show that for a f i x e d value of A , P decreases n e. slowly and f a i r l y l i n e a r l y as cj> i s increased. On the other hand, f o r n+1 a f i x e d value of <f>n+1j P g decreases s u b s t a n t i a l l y at f i r s t and quickly reaches an assymptotic value as rj>n i s increased. Hence, these r e s u l t s i n d i c a t e the existence of a t r a d e - o f f between <h and A>„ ,. and a r e l a t i o n -n Tn+i ship between d> and <b _ that w i l l y i e l d more e f f i c i e n t decisions. As * n n+1 an example, consider the black points marked A and B i n Figures 10.18 and 10.19 . Point A represents ( l ) n = : 4 and < l , n + 2 = l whereas point B represents d> =2 and & ,=4. Hence, at point A c a l c u l a t i o n of the denominator of n n+1 B_^j involves one h a l f the computation required-at point B and yet a lower erro r p r o b a b i l i t y i s obtained at point A. C l e a r l y , point A represents a more e f f i c i e n t point of operation than point B. Furthermore, during PROBABILITY OF ERROR PQ IN PERCENT 00 b "1— b AVERAGE NUMBER OF BIGRAMS SEARCHED (ANBS) "\— Co 1— >^3 ^ t . . . . **niv \ W a ID ^ ID CD ... II II II 13 I O CD o t > I s % 3 CO o CD CO o c r-CD IT, o o CO 3 a CO 178 the decision process ANBS=l.ll(- at point A and 1.1+9 at point B. Hence, much l e s s search i s involved at point A than at point B even though a lower P r e s u l t s . C l e a r l y , the r e s u l t s show that the parameters i> and e n <}> not only determine the average number of bigrams searched (amount n+1 of contextual information used) but also which bigrams ( a l t e r n a t i v e p a i r s ) are to be searched, or how the contextual, information i s to be used. Furthermore, the r e s u l t s show that i n order to search the proper a l t e r -native pairs and thus use context e f f i c i e n t l y d> and d> ,, should be 1 J T n n+1 approximately proportional to dP and dP , , i . e . , i f d"°<d^in then tj> should " n n+1' ' n n+1 n be l e s s than <h vn+l 179 CHAPTER XI CONCLUDING REMARKS 11.1 E f f i c i e n c y of the Sequential Decoding Algorithms In a r e a l i s t i c comparison of any of the contextual decoding algorithms proposed i n t h i s t h e s i s the error p r o b a b i l i t i e s must be compared i n r e l a t i o n to cost. The algorithm of l e a s t cost that y i e l d s the desired e r r o r p r o b a b i l i t y i s considered the most e f f i c i e n t . In these algorithms two factors contribute to cost: storage and computation. There are two types of information that require storage: the n-gram s t a t i s t i c s and the l i k e l i h o o d vectors within the "observation" window or b u f f e r . In general the b u f f e r storage i s much l e s s than the storage required by the n-gram s t a t i s t i c s and. can be neglected. The n-gram s t a t i s t i c s require M11 storage elements , where M i s the mimber of pattern classes 5 whereas the bu f f e r requires (£+l)M elements. I f n i s small such as n~2 (bigrams) and £=M-1 then the b u f f e r storage required i s as great as that required to store bigrams. In t h i s section only bigrams are used and £ i s always equal to one. Therefore, the buf f e r storage and that of the n-gram s t a t i s t i c s are held constant and do not enter into the discussion of e f f i c i e n c y . The amount of computation can also be broken down i n t o two quantities: the computation required to calculate the denominator of , i n the algorithms incorporating the threshold, and the computation required by the searching process through a l t e r n a t i v e p a i r s . ' The computation required by the searching process can be characterized by two parameters: the average number of bigrams searched, ATTBS, and the computation associated with keeping track of the maximum value encountered during the decision process, CKTM. Let C(rj)) denote the computation required to ca l c u l a t e the denominator of B... In order to compare the sequential decoding algorithms from the point of view of e f f i c i e n c y P g was p l o t t e d as a function of MBS for various algorithms of i n t e r e s t i n F i g . 1 0 . 2 0 , The black c i r c l e s denote points for sequential decoding with f i x e d depth of search that represent the minimum P obtained when MTBS i s held constant and d and d _ are e n n+1 varied. The st r a i g h t l i n e s j o i n i n g these points serve to unite the r e s u l t s of a common algorithm and other than that are meaningless since MBS can only take on c e r t a i n integer values when d and d , are f i x e d . ° n n+1 The r e s u l t s of F i g . 10.20 show that the algorithms of Type I incorporating the threshold require f a r l e s s search than the algorithm with f i x e d d^ and &n+-^ , p a r t i c u l a r l y i n the i n t e r v a l determined by 16.0 < P g < 1 7 . 0 . For example, for P g = l 6 . 1 , MBS = 12 f o r the algorithm with f i x e d depth of search, whereas MBS = 2 .6 for the Type I algorithms incorporating the threshold. However, these facts a,re not s u f f i c i e n t f o r concluding that the Type I algorithms are more e f f i c i e n t since they require the a d d i t i o n a l c a l c u l a t i o n of the denominator of B... In other i j words, a t r a d e - o f f e x i s t s between the c a l c u l a t i o n of B.. and the decrease i n the amount of search. It i s d i f f i c u l t to q u a n t i t a t i v e l y compare the algorithmic complexity involved i n search, with the arithmetic complexity of the denominator of B... • However, i t seems clear that f o r a large savings i n search, the Type I algorithms are at l e a s t competitive. Comparing the two versions of Type I algorithms, i . e . , those i n v o l v i n g search modes I and I I , the r e s u l t s show that for ABNS < 1.5 the algorithms incorporating search mode II are more e f f i c i e n t even with low values of defa.ult depth of search such as = d \ = 5 . The Type II algorithms . ± n n+1 are very s i m i l a r to the Type I algorithms with regards to computational requirements except that CKTM vanishes, and hence they are more competitiv with the algorithms with f i x e d depth of search. Only i n the Type I I I <I Lg O Uj <: cc o Cc cc Uj o CQ CQ O cc 19-0 SEQUENTIAL DECODING WITH BIGRAMS (LA-MODE) 18-0 17.0 16.0 v\\ WITH THRESHOLD -TYPE I - =d -5 -SEARCH MODE | f ' \ WITHOUT THRESHOLD (FIXED dn AND d n i 1 ) WITH THRESHOLD -TYPE I - SEARCH MODE I o- \ d D _dD 5 V/ITH THRESHOLD - TYPE E d u = 4 , d u ^ 2 D p WITH THRESHOLD - TYPE JE d^ = 4 , dn + l = 2 < K =3 < ii + 1 =2 O S3 J i I I_ / 2 3 4 5 6 7 8 9 10 15 20 30 ' /AVERAGE NUMBER OF BIGRAMS SEARCHED (ANBS) H Co F i g . 10.20 Comparison of the various sequential decoding algorithms with f i x e d depth of search and incorporating • threshold, with respect to e f f i c i e n c y . 182 algorithms, where cb takes on low values such as cb =3 and <±> = 2 , f n Yn+1 ' does i t become clear that C(<b) i s i n s i g n i f i c a n t compared to the savings obtained by e l i m i n a t i n g CKTM and reducing ANBS by a f a c t o r of two or three. Hence, f o r bigrams with 4=1 the sequential decoding algorithms of Type I I I with low values of d"° such as = 4, d ^ M = 2 and low values n 5 n+1 of cb such as cj;^ = 3 and <J>n+-^ = 2 , are the most e f f i c i e n t algorithms considered i n t h i s t h e s i s . As an example consider i n F i g . .10.20 the algorithm with f i x e d depth of search. p Q r ANBS = 4 , P e = 16 .U5. For P g = 16.45 the Type I I I algorithm has MBS = 1 .35. This represents a savings i n search by a f a c t o r of 3 . Furthermore the storage' and other computational requirements of the two algorithms are approximately the same assuming that G'(<b) for <j> = 3 and cb , =-2 balances out CKTM. 0 ^n n+1 11.2 Some General Comments In chapter IX various algorithms f o r using contextual constraints in pattern recognition problems were proposed. In chapter X a preliminary experimental i n v e s t i g a t i o n was made to determine which class of algorithms looks most promising and deserves more attention i n the future. The e f f e c t of ordering the a l t e r n a t i v e s i n the decision process was considered i n section 10.2 and i t was concluded that an ordering of a l t e r n a t i v e s with P(X n,X n= 9 ) was considerably more e f f e c t i v e than using P(X. n|A n=0 n). In fact with the former, the optimal depth of search was 4 ( l 6 bigrams searched i n block decoding), whereas with the l a t t e r d^ = 7 (49 bigrams searched); a savings i n search by a factor of 3 while obtaining the same error rate. In section 10 .3 the usefulness of using contextual information was studied as a function of i n i t i a l system performance. The r e s u l t s suggest that context becomes more u s e f u l as the i n i t i a l system performance improves. Furthermore, the r e s u l t s show that the algorithms become more e f f i c i e n t as the i n i t i a l system performance improves. In other words, the lower P g i s without using context, the smaller the depth of search values that can be t o l e r a t e d while at the same time decreasing by a greater percentage. Obviously, i t would be of i n t e r e s t to know whether the optimum depth of search values encountered i n t h i s t h e s i s are unique to the data base and feature extractor used or whether they imply some more general behaviour. At the moment the r e s u l t s of t h i s t h e s i s can only be compared with those i n [ 177] , [ 5 6 ] , Comparison of the block decoding algorithms with bigrams shows that as i n t h i s t h e s i s , i n [177] and [56] the optimum depth of search values are i n the neighbourhood of h and the i n i t i a l system error p r o b a b i l i t i e s are i n the neighbourhood of 20%. Furthermore, i n [177] and [56] a d i f f e r e n t data base was used and a d i f f e r e n t feature ex t r a c t i o n scheme (contour tracing) was implemented. Hence, these facts add confidence to the hypothesis that these r e s u l t s are f a i r l y independent of the feature extraction scheme. In section 10.k the contextual decoding algorithms with f i x e d depth of search were experimentally evaluated and compared. Although i t was shown that the sequential decoding algorithms were more e f f i c i e n t than the block decoding algorithms, the difference i s small and before any f i n a l conclusions are made as regards these algorithms further experiments should be c a r r i e d out with £ > 1 and with contextual constraints of higher order than bigrams. The r e s u l t s of the experiments concerning the various bigram approximations to trigrams suggest that there are b e t t e r ways of using bigrams than, by assuming f i r s t - o r d e r Markov dependence among the pattern classes. Furthermore, using a s u i t a b l e approximation to trigram s t a t i s t i c s y i e l d s an e r r o r rate comparable to that obtained with trigrams and yet requires f a r l e s s storage capacity. The most important conclusion that can he drawn from the r e s u l t s obtained i n section .10.5 i s that the sequential decoding algorithm of Type III i s the most e f f i c i e n t of the algorithms considered and. involves fa r l e s s computation and search than the algorithms with f i x e d depth of search when suboptimality i s adhered t o . One of the re c u r r i n g themes i n the r e s u l t s and discussions i n chapter X i s concerned with suboptimality and diminishing returns. The most e f f i c i e n t algorithms are suboptimal at almost a l l possible l e v e l s . For example, consider sequential decoding with bigrams of Type I I I with £ = 1 i n the look-ahead mode of dec i s i o n The f i r s t l e v e l of suboptimality i s introduced i n the s i m p l i f y i n g assumption of (8.2), i . e . , the feature vectors are assumed c o n d i t i o n a l l y independent. The second l e v e l of sub-optimality i s introduced with the assumption of c o n d i t i o n a l l y independent features. Both of the above assumptions decrease enormously the storage needed.' The t h i r d l e v e l of suboptimality i s due to the assumption that contextual dependence of a pattern only extends to the nearest neighbours. Using an approximation to the contextual constraints as- i n section 9.5 represents a fourth l e v e l of suboptimality. Using a threshold to e f f e c t a l i m i t e d depth of search represents a f i f t h l e v e l of suboptimality. F i n a l l y , a s i x t h l e v e l i s introduced by approximating B... The r e s u l t s 4-iJ show that , at various possible l e v e l s , as the approximation becomes closer to the optimal case, very soon no improvement i n performance i s observed but a great increase i n computation and/or storage i s noted rendering the algorithms i n e f f i c i e n t . It should be remarked that i f the si m p l i f y i n g assumptions•at any l e v e l are v a l i d then optimality i s obviously upheld at that l e v e l . It i s i n t e r e s t i n g to compare some of the r e s u l t s obtained i n t h i s t h e s i s with some av a i l a b l e r e s u l t s using Raviv's algorithms. Hussain [199] performed some experiments using sequential compound d e c i s i on theory in which a decision i s made on to using information from the e n t i r e past ( a l l previous patterns) under the assumption of f i r s t order Markov dependence among the pattern classes (Raviv's algorithm with bigrams). In these experiments Hussain used approximately the f i r s t h a l f of each of the seven texts included i n appendix A and obtained error rates of 17.8$ and 14.7$ without and with context, r e s p e c t i v e l y . These figures represent a decrease i n P g of 17.W a t t r i b u t e d to the use of context. In t h i s t h e s i s , the sequential decoding algorithm with f i x e d depth of search incorporating bigrams i n the LA-mode with d = d , = k reduces n n+i the error p r o b a b i l i t y from lB.9% to l6.0%. This represents a decrease i n P g of 15.3$ a t t r i b u t e d to context. Hence, although t h i s algorithm i s suboptimal i n the sense that d^ = d^ +^ = k rather than 26 and although information i s used, only from, one nearest neighbour, rathe?" than the en t i r e past, i t i s almost equally e f f e c t i v e as Raviv's algorithm from the point of view of P^ and does not require the involved recursive computation of the a p o s t e r i o r i p r o b a b i l i t i e s ta.king the e n t i r e past i n t o consideration. Furthermore, the sequential decoding algorithm with bigram approximation to trigrams i n the LA-LB mode of decision with d^ ~ d ,., = 3 and d = k reduces P from lQ.9% t o lk.5%. This represents a n+1 n e decrease of 23.3$ i n P g a t t r i b u t e d to context. Hence, using contextual information i n a suboptimal way from the immediate preArious and subsequent, patterns i n the manner done i n t h i s t h e s i s y i e l d e d b e t t e r performance than that obtained using the e n t i r e past i n an otpimal way under the assumption of f i r s t order Markov dependence. From t h i s preliminary experimental i n v e s t i g a t i o n one can conclude that the algorithms that incorporate the threshold and use approximations 186 to the contextual constraints, as well as to the terms such as B.., are the most promising and hence deserve further attention p a r t i c u l a r l y with respect to t h e i r performance and e f f i c i e n c y f o r £ > I and In r e l a t i o n t o the suggestions f o r further research discussed i n section 1 1 . 3 . 11.3 Suggestions for Further Research 1. A l l the algorithms proposed i n t h i s t hesis are based on a c l a s s i f i e r output c o n s i s t i n g of an ordered set of a l t e r n a t i v e s . In section 10 .2 a comparison was made of orderings using the l i k e l i h o o d s P(x |A =8 ) and the j o i n t p r o b a b i l i t i e s P(X n|A n = 0 n)P( \ n = 0 n ). The ordering of these a l t e r n a t i v e s allows one to obtain good r e s u l t s with d as low as h instead of the maximum of 26. The cost of t h i s saving i n depth of search i s obviously the computation involved i n ordering the a l t e r n a t i v e s . In some applications d may be f a i r l y high even a f t e r ordering the a l t e r n a t i v e s . In such a s i t u a t i o n ordering the a l t e r n a t i v e s f o r every -pattern ±'0 be recognized may not be p r o f i t a b l e . One s o l u t i o n to t h i s probelm i s to avoid having to order the al t e r n a t i v e s f o r each pattern by presenting the alt e r n a t i v e s to the contextual decoder i n a predetermined f i x e d order. One predetermined f i x e d order might be that determined by the order of decreasing values of P(A n = 0 n ) . Experiments should be performed to evaluate and compare these fixed-order algorithms with the variable-order algorithms considered i n t h i s t h e s i s . The algorithms incorporating threshold va-lues should be e s p e c i a l l y s u i t e d for predetermined fixed-order a l t e r n a t i v e s . 2. Consider sequential decoding with bigrams i n the look-back mode of decision with f i x e d depth of search, £ = 1, and £ -j. < d r i ' E v e r y decision i s made using some information from the look-back neighbour only. With a 187 s l i g h t modification the decision on each pattern can he made using i n f o r -mation from the look-hack neighbour as well as from a l l the previous patterns that have occurred i n the t e x t . This m o d i f i c a t i o n , which p h i l o -s o p h i c a l l y resembles Raviv's [ l 7 l ] algorithm and i s i n fact some sort of r e a l i z a b l e approximation to sequential compound decision theory, i s obtained- without any a d d i t i o n a l storage needed and with a minimum of ad d i t i o n a l computation required. The algorithm works as follows. In the algorithms considered i n chapter IX the terms given by P(X XIA 1=a)P(X 2IA 2=3)P(X ±=a, A2=3) are computed f o r a Cr ft, and g c ft., and a dec i s i o n i s made on to~ based Q l d2 2 on the highest term encountered. One then proceeds to make a decision on to using the same ordering of a l t e r n a t i v e s for O J ^ that was used i n making a decision on tOg. In the algorithms proposed here the ordering of the alt e r n a t i v e s associated with i s modified a f t e r making a d e c i s i o n on d. 2 so that when a decision i s made on co using information from to^, t h i s decision has been affected by to i n a favourable way. This modification i s r e a l i z e d by s u b s t i t u t i n g the f i r s t d„ l i k e l i h o o d s associated with the d f i r s t dg al t e r n a t i v e s of to^ by the f i r s t d l i k e l i h o o d s associated with the f i r s t d^ a l t e r n a t i v e - p a i r s of to and to^ such that the d^ a l t e r n a t i v e s fo r tog are a l l d i f f e r e n t . The algorithm then proceeds to make a decision on ui using the f i r s t d 0 ( i . e . , d ) a l t e r n a t i v e s of to . One can see 3 f 2 n-1 d that as t h i s process continues information from a l l the previous patterns i s used i n any decision. Of course, the further back one gets from the pattern i n question the less information i s used. However, t h i s i s as i t should be since patterns f a r away from each other i n the sequence w i l l not be as contextually r e l a t e d as nearby patterns. These types of algorithms should be experimentally evaluated and compared to the algorithms considered i n t h i s t h e s i s and to those of Raviv [171']. 3 . When using a threshold i n the contextual decoding algorithms i t was shown i n section 9 . 4 that the terms given "by ( 9 . 4 ) must be normalized i n some way. One obvious way of normalizing ( 9 . 4 ) i s i n the manner done i n section 9 .4 as shown i n ( 9 . 5 ) . Another way of achieving the same end r e s u l t , i . e . , making the depth of search values v a r i a b l e , i s by d e f i n i n g new parameters which have the same function as the o r i g i n a l f i x e d t h r e -shold. Let T be the new parameter. Then the depth of search value d n - n i s determined for each pattern by Q 3-Q M 0. n where i s defined i n (9-5) and 0 < T < 1. For example, i n the case of block decoding with bigrams the terms given by (9-4) would be maximized such that i (r ft, and j £• ft where d and d ,, are, r e s p e c t i v e l y , d 0 v d ,, n n+1 ' n n+1 determined by the number of a l t e r n a t i v e s associated with u and to ,, such J n n+1 that they have l i k e l i h o o d s greater than or equal to f (T,Q) and f ,_,(T,Q). . n n+1 I f T = 0 . 0 then d = M. I f T = 1 . 0 then d = 1 . For 0 < T < 1 the n n n n n value of d^ i s determined by the c l u s t e r i n g of the l i k e l i h o o d s . In other words, i f the l i k e l i h o o d s are very s i m i l a r (uncertainty about the class) d^ w i l l be l a r g e , and i f one or a few l i k e l i h o o d s are much higher than the remaining ones then d w i l l be low. ° n 4 . As was shown i n section 11.1 the algorithms proposed i n t h i s t h e s i s can perform b e t t e r , and be more e f f i c i e n t , than Raviv's [ l T l ] optimum algorithms when the parameters are properly chosen and when the contextual constraints are used i n ways other than that determined by the Markov assumption. However, the algorithms of Raviv can be modified to Incorporate thresholds also and can be made suboptimal at various l e v e l s f or the sake of e f f i c i e n c y . Hence, before any f i n a l conclusions are made concerning the r e l a t i o n s h i p 189 between Raviv's algorithms and those proposed here a thorough evaluation and comparison should be made of the two approaches Including the m o d i f i -cations of Raviv's algorithms. 5. In experiment 10.k i t was shown that the b i grain approximation of Type I I I y i e l d e d the lowest values of P . Also, i t was stated that Ku and e Kullback [193] have developed an algorithm for i t e r a t i v e l y determining the optimum approximation, i n the minimum dis c r i m i n a t i o n information sense, as a function of the prescribed marginals. Experiments should be performed to determine whether using the optimum approximation to contextual con-s t r a i n t s s i g n i f i c a n t l y reduces the p r o b a b i l i t y of error over the suboptimum and pseudo-approximations. 6. I t was concluded i n section 11.1 that, f o r o v e r a l l e f f i c i e n c y i n a contextual decoder,suboptimality at a l l possible l e v e l s i s desirable i n order to make best use of the contextual information before the diminishing returns come i n t o play. One l e v e l at which suboptimality was neither e x p l o i t e d nor explored i s concerned with the storage a l l o c a t e d or used by the contextual c o n s t r a i n t s . In t h i s t h e s i s the a p r i o r i s t a t i s t i c s such as bigrams and trigrams were estimated from a large text and stored as a matrix of elements where each element represented a 32 b i t word. I f these p r o b a b i l i t i e s are quantized so that they use up only 16, 8 , or even h b i t s one caii r e a l i z e a decrease i n storage by a f a c t o r of eight. Experiments should be performed to determine the performance of the algorithms i n t h i s t h e s i s as a function of the quantization of the n-gram p r o b a b i l i t i e s . APPENDIX ~ A THE TEXTS USED IN THE CLASSIFICATION EXPERIMENTS AND THEIR ASSOCIATED STATISTICS 190 T E X T NUMBER 1 THE P H I L O S O P H I C A L D I F F E R E N C E BETWEEN HUMAN P E R C E P T I ON AND COMPUTER R E C O G N I T I O N OF P R I N T E D C H A R A C T E R S _ S E EMS T QJDEPE NJD_ ON THE R ES EAR CHERS (3 IAS J HE_C OGN IT I V i: ^P-SYCHOLL 'GIST M A I N T A I N S THAT" THE D I S T IN GU IS H I NG C H A R A C T E R I S T I C S OF THE E N T I R E S T I M U L U S G I V E R I S E TO THE P E R C E P T WITH NO. S INGL E . A TTR I 8 UT E B E I N G EI THE R N E C E S S A R Y OR S U F F I C I E N T I N C ON'I R A S T T H E EN G I N EER V IEWS P A T T E R N R E C O G N I T I O N AS E X T R A C T I N G V A R I O U S. MAT HEM AT IC_A_L_ M E A S U R E S FROM THE S T I M U L U S BY U S I N G V A R I O U S T R A N S F O R M A T I O N S . ALTHOUGH T H E S E TWO A P P R O ACHES ARE NOT N E C E S S A R I L Y A N T I T H E T I C A L T H E R E IS A FUNDAMENTAL D I S A G R E E M E N T ABOUT WHETHER THE UL T I MA T E P A T T E R N R E C O G N I T I O N A L G O R I T H M S S H O U L D B E B A S E D 0 N P S Y C H O L O G I C A L F E A T U R E S WHATEVER THEY MAY B E OR 0 1 VE_ MAT HEN A TI C A L 0P ER A T I ON S THA T P ER_F_Q!<_K_T HE " D E S I R E D f A S V w i T H H I G H E S T A T I . S T l C A L A C C U R A C Y » ~TH E COMPUTER PROGRAMMER I S MORE T.HAN W I L L I N G TO I N C O R P O R A T E THE KNOWLEDGE OF THE P S Y C H O L O G I S T INTO THE PROGRAMS I F HE CAN E X P L A I N IN S T E P BY S T E P F A S H I O IM WHICH S T I M U L U S F E A T U R E S OR T H E I R R E L A T I O N S H I P GI VE R I S E TO THE P S Y C H O L O G I C A L L A B E L . SOURCE. . . . . . "Character Recognition: 1?er~foniraffc~e~~D^ by B. Blesser and E. Peper, M.I.T. Quarterly Progress Report,'No." 9 8 , 1970, p . ' l 8 5 . TOTAXTJUMLBER-OF"''CH"/\IffCCTERS- T03T NUMBER OF' PERIODS 'NUI':ffiER^ OF"BIJMIvS' 145 TEXT FORMED USING MUNSON'S ALPHABET" SETS. 1-21 191 T E X T N U M B ER F R O M A M A T H E M A T I C A L S T A T I S T I C A L P O I N T OF V I E W T H E A S S U M P T I O N OF R A N CO M S A M P L I N G M A K E S I T P O S S I B L E T O D E T E R M I N E T H E S A M P L I N G 01 S T R I B U T I ON OF A P A R T I C U L -A R S T A T I S T I C G I V E N S O M E P A R T I C U L A R P O P U L A T I O N M I S T R I B 0 T 1 0 N . I F T H E V A R I O U S V A L U E S OF T H E S T A T I S T I C C A N A R I S E F R O M S A M P L E S H A V I N G U N D E T E R M I N E D . OR. O N K N O WN PR OB A Li I C I T I E S OF O C C U R R E N C E T H E N T H E S T A T I S T I C I A N HA S N 0 W A Y TO DE T E R M I N E T H E S A Pi P L I H G D I S T R I B U T I ON OF T H A T S T A T I S T I C . ON T H E O T H E R H A N D I F E A C H AN D E V E R Y D I S T I N C T SA P P L E O F N O B S E R V A T I O N S H A S .EWUA L P R O B A B I L I T Y OF O C C O R R E N C E A S I N S I M P L E R A N D O M S A M P L I N G T H E N T H E S A M P L I N G DJ S T R I 1UJTI ON OF A N Y G I V E N S T A T I S T I C I S R E L A T I V E L Y E A S Y T O S T U D Y G I V E N OF C O U R S E T H E P O P U L A T I O N S I T U A T I O N . I T I S A S A D F A C T T H AT I F ONE KNOWS NO T H I N G A B O U T T H E P R O B A B I L I T Y OF 0 C C U R R E i C E F O R P A R T I C U L A R S A M P L E S OF U N I T S F O R O B S E R V A T I O N HE C A N U S E V E R Y L I T T L E OF T H E M A C H I N E R Y OF S T A T I S T I C A L I N F E R E N C E . T H I S I S WHY T H E A S S U M P T I O N OF R A N D O M S A M P L I N G I S N O T T O B E T A K E N L I G H T L Y . A L ' L T H E T E C H N I Q U E S A N D T H E O R Y T H A T WE W I L L D I S C U S S A P P L Y TO R A N D O M S A M P L E S A N D DO N O T N E C E S S A R I L Y H O L D F O R A N Y D A T A C O L L E C T E D I N A N Y WAY SOURCE S t a t i s t i c s , by W. L. Hays ; Holt, Rinehart and Winston, Inc. 1963, p. 216. TOTAL NUMBER OF CHARACTERS 10 87 NUMBER OF PERIODS 6 NUMBER OF BLANKS 173 TEXT FORMED USING MUNSON'S ALPHABET SETS 192 * * T E X T N U M B E R 3 * * I N S E A R C H I N G F U R R E G I O N S O F T H E - C O R T E X T H A T M I G H T C O N T R I B U T E T O T H E P E R F O R M A N C E O F T H E H I G H E R 1 N T E L L E C T U A L P R O C E S S E S W E W 0 U L D N O T B E I N C L . I N E D T O P A Y T O O M U C H A T T E N T I O N T O T H O S E " C O R T I C A L A R E A S T H A T H A V E A L R E A D Y B E E N E S T A B L I S H E D T O B E T E R M I N A L P O I N T S F O R T H E P E R I P H E R A L N E R V E S A N D T H A T T H E R E F O R E A R E K N O W N T O B E P R I M A R I L Y E N G A G E D I N T H E R E C E I P T " O F S E N S O R Y I N F O R M A T I O N O R T H E T R A N S M I S S I O N O'F M O T O R C O M M A I M P S T O T H E Q Q T L Y I N G R E G I O N S O F T H E B O D Y . OF T H E S E V E R A L H U N D R E D " S Q U A R E I N C H E S O F S U R F A C E A R E A " I'M- T H E C E R E B R A L C O R T E X O N L Y A B O U T O N E F O U R T H I S . U S E D F O R T H E S E S E IM S 0 R I H 0 T 0 R P R O C E S S E S . I N C L U D E D A R E T H E V I S U A L C U R T E X A T T H E E X T R E M E R E A R O F T H E B R A I N T H E S E N S O R Y A N U M O T O R S T R I P S R U N N I N G D O W N T H E S I D E S O F T H E B R A I N A N D A S M A L L R E G I CM A T T H E T O P E D G E O F T H E T E M P O R A L L O B E " W H I C H S E R V E S A S " T H E T E R M I N A L A R E A F 0 R A. O D I T O R Y S E M S A T I O N S o O F T H E R E M A I N I N G T H R E E F 0 U R T H S O F T H E C O R T I C A L S U R F A C E N E A R L Y H A L F I S I N T H E F R O N T A L L O B E S T H E B U L B O U S P R O T R U S I O N S T H A T L I E A H E A D O F T H E A U D I T O R Y R E C E I V I N G A R E A A N D T H E S E N S O R _I_r'LOJ_0_R_ S J R I J P S . I N T H E _ _ E V U L U T I O N A R Y D E V E L O P M E N T O F T H E B " R A ~ I N " r R O w ' TI-IF I OW E R ~ T O T H E H I G H E R / A N I M A L S N O W H E R E H A S M O R E G R O W T H O C C O R R E D T H A N I N T H E S E F R O N T A L L O B E S * T H E E X P A N S I O N O F . T H E S K U L L T O A C C O M M O D A T E T H E M ' ^ W V S G I V E N M A N T H E H I G H F O R E H E A D O F W H I C H H E I S S O P R O U D . SOURCE -. .'"."'.".".". . . ; •;. - The' Machinery of the'Brain, • by E. Wooldridge McGraw H i l l , 1963, p. 145. TOTAL NUMBER OF CHARACTERS 1264 NUMBER OF PERIODS 6 NUMBER OF BLANKS 2l4 TEXT FORMED USING MUNSON'S ALPHABET SETS 43-63 193 T E X T NUMBER DEAR S I R . I HAVE R E V I E W E D YOUR I D E A FOR D E S I G N ENT RY SWEEP G E N E R A T O R AND AN H A P P Y TO A C C E P T I T FOR P U B L I C A T I U N Ii\! E L E C T R O N I C D E S I G N . U N F O R T U N A T E L Y PUB L I G A T I O N W I L L BE D E L A Y E D FOR A FEW MONTHS DUE TO T HE L A R G E B A C K L O G OF I D E A S FOR D E S I G N WE NOW HAVE 0 iM HAND o P R I O R TO P U B L I C A T I O N . THOUGH YOU W I L L R E C E I VE A COPY OF THE F I N A L E D I T E D V E R S I O N OF YOUR MATE R I A L FOR A P P R O V A L . IN A D D I T I O N YOU WILL R E C E I V E PA YMENT FOR YOUR ENTRY S H O R T L Y . YOU ARE NOW E L I G I B L E FOR THE MOST V A L U A B L E OF I S S U E AWARD AS D E T E R M I N E D BY OUR R E A D E R S . THE ANNOUNCEMENT OF THE W I N N I N G ENTRY A P P E A R S A P P R O X I M A T E L Y S I X TO E I G H T WEEKS AFT ER I T S I N I T I A L P U B L I C A T I O N . A L L WINN ING E N T R I E S BE C 0 ME E L I G I B L E FOR THE ID E A OF T H E Y EAR A W A R D. T H AN K YOU A G A I N FOR S E N D I N G US YOUR IDEA FOR D E S I G N AN D P L E A S E K E E P US IN MIND WHEN YOU HAVE A N O T H E R . CO R D I A L L Y . SOURCE. • A Lette r from the E d i t o r of a publis h i n g company -TOT-AL-NU>ffiEE--OF—CHARAGT-E-RS-. . " > . . . . .—810 NUMBER OF PERIODS 10 -NUMBER-QF-£IA-KKS^ . . . -r-^-139-TEX'I FORMED USING MUNSON'S ALPHABET SETS. -. . 6k-8k *# TEXT NUMBER THE SHARE UF TOTAL M A N U F A C T U R I N G A S S E T S H E L D BY TH E TOP C O M P A N I E S MAY B E I N C R E A S I N G BUT P R O B A B L Y NOT N E A R L Y AS MUCH AS TH E GOVERNMENT S U G G E S T S . AND WH AT SEEMS L I K E " A D R A M A T I C R I S E IN S A L E S C ONC EN T R A T I ON W I T H I N A MAJOR C A T E G O R Y OF CONSUMER GOODS INDUS T R I E S V A N I S H E S ALMOST E N T I R E L Y WHEN THE RATHER P R I M I T I V E S T A T I S T I C A L T E C H N I Q U E S USED IN THE GOVERN HE NT S T U D I E S ARE SET A S I D E I N FAVOR OF -METHODS THAT AR E_M ORE SO P H IS T I C A T E D THOUGH 0 U l TE COMMONLY EM PL 0 Y ED BY E C O N O M I S T S . BUT MURE I M P O R T A N T T H E R E AR.E SE R I O U S QUEST IONS AS TO WHAT THE V A R I O U S S E T S OF NUM B E R S P U B L I S H E D BY THE GOVERNMENT I M P L Y FOR P U B L I C P O L I C Y . AS ONE E C O N O M I S T S A I D TO A C O R P O R A T I O N LAW YER I N Q U I R I N G ABOUT THE A C C U R A C Y OF THE O F F I C I A L F IGU RE S _ Q N ^ 0_N CE_NJRA TI ON YOU AR. E A SK IN G _ TH E WRONG _Q UESfTbN7"THt C E N T R A L " ! SS UE I S N O f A C C l i R A C Y - B UT REL~ E V A N C E . THE F I G U R E S ARE S U S P E C T . BUT R I G H T OR WRON G THEY ARE NOT E S P E C I A L L Y S I G N I F I C A N T . E C O N O M I S T S USE THE TERM C O N C E N T R A T I O N IN TWO Q U I T E D I S T I N C T S E N S E S . ONE QF THEM THE K I N D HAS BEEN T A L K I N G A B OUT _M I G H J BE C A L L E D ME GACOR P_C ONC EM TRA TI ON. _ _IT F PC US E S ON THE R A T I O OF THE S A L E S OR A S S E T S OF THE TOP 0 R M A N U F A C T U R I N G C O M P A N I E S TO THOSE OF THE E N T I R E M A N U F A C T!. J R IN G S E C T 0 R. - S O U R C E — — — 7 ~ t — r — — ; — . — —— ~ — • ,,.-J'Blg-nress is"-,a~NmTreT^affigl'~~b'y S. Rose, FORTUNE, V o l . LXXX, No. 6 , Nov. 1969, p. 112. TOTAL NUMBER OF CHARACTERS 1171 NUMBER OF PERIODS . . . 10 NUMBER OF BLANKS l 8 8 TEXT FORMED USING MUNSON'S ALPHABET SETS 85-105 195 TEXT NUMBER 6 ** GEORGE C . SCOTT CORES AS C L O S E TO F I T T I N G H I S D E F I N IT I ON OF THE I D E A L ACTOR AS ONE HAN CAN WITHOUT B REAR IMG A P ART INTO _ THR EE PI S P A R A T E IN i J l V I D U A L S . IN H I S L I F E 0 F F ST A G E " l-i E HAS BEEN S TUBB URNL Y EVEN V I O L E N T L Y I N D IV I D UAL W H E N H E I S A C T I N G HE C R E A T E S A C H A R A C T E R AND H I D E S l-l I S I N D I V I D U A L I T Y WITH S I N G U L A R S U C C E S S AS THE MAN IN ROW TEN HE I S A PER F EC T ION l" ST C R I T I C MORE DEMANDING OF H I M S E L F THAN OF T H O S E AROUND H I M . IN MORE THAN A DOZEN S T A G E ' A N D S C R E E N R O L E S IN ' A S T E A D I L Y GROWING C A R T E R S C O T T HAS DEMON ST RAT ED THAT HE I S ONE OF T H E B E S T OF C O N T E M P O R A R Y A C T O R S . H I S T A L E N T IS BOTH S U B T L E . AND O B V I O U S I T MAKES H I S ART A T ONCE U N S E T T L I N G L Y R E A L Y E T L A R G E R THAN L I F E . I T I S NO A C C I D E N T THAT S C O T T S T R I P ART I TE I D E A L IS A HUMAN B E I N G F I R S T . H I S OWN L I F E AND "HIS I N T U I T I V E " A B I L I T Y TO USE IT AT TH E~ R IGHT " T I ME IN T H E R I G PIT RO L E I S H I S FUN DA MEN TA L RES 0 URC E . A S A GREAT ACTOR HE A C H I E V E S S O M E T H I N G NEW IN EVERY P ART S O M E T H I N G OF H I M S E L F REBORN F A T H E R E D BY I N S I G H T N Li R T U R E D B Y S K I L L A i\j D I M A GI N A T I ON. S C O T T A L S O 0 F f E R S S O M E T H I N G M O R E . ALWAYS J U S T BELOW THE S U R F A C E T H E R E I S AN I N C E S S A N T DRUMBEAT OF A N G E R . SOURCE TIME March 2 2 , 1971, p. 51 TOTAL NUMBER OF CHARACTERS. 1093 NUMBER OF PERIODS 10 NUMBER OF BLANKS 191 TEXT FORMED USING MUNSON'S ALPHABET SETS 106-126 196 TEXT NUMBER THE R ANGEL AMDS LIE THE WESTERN U N I T E D S T A T E S , T H E S E VAST T R A C T S A R E M A I ML Y UT IL I Z ED F OR THE G R A ZIM G 0 F C A T T L E . THEY COOLD B E MUCH MORE P R O D U C T I V E NOT 0 N L Y FROM THE ECONOMIC S T A N D P O I N T BUT A L S O IN THE R EC REAT IONAL AND THE A E S T H E T I C S E N S E . THE CENT UK Y 0 LD WAR OVER THE USE OF THE WIDE OPEN S P A C E S OF THE U N I T E D S T A T E S IS NOW E S C A L A T I N G Y E A R BY YEAR IN S C O P E AND C O M P L E X I T Y . IN THE N I N E T E E N T H C E N T U R Y IT WAS A R E L A T I V E L Y S I M P L E C O N F L I C T M A I N L Y B E T W E E N TH E C A T T L E R A N C H E R S AND THE SHEEP R A I S E R S . TODAY. IT HAS BECOME A HUGE C O N T E S T AMONG MANY I N T E R E S T S V Y I NG FOR THE S P A C E R E S O U R C E S AND B E A U T I E S OF OOR WES T E R N R AN GEL AND S . TO THE T R A D I T I O N A L C L A I M S ON THES E LANDS B Y R A N C H E R S L OMB ER MEN M I N E R S HUNTERS C A M P E RS AND NATURE LOVERS THER E ARE NOW A D D E D THE DEMAN DS A R I S I N G FROM POP OL A T I ON GROWTH AND THE C O N S E O U E NT MARCH OF OR BAN I Z A T l O N . A C R E E P I N G ENCROACHMENT BY . B U R G E O N I N G C I T I E S AND NEW TOWNS R E C R E A T I O N RES 0 RTS H I G H W A Y S C O M M U N I C A T I O N F A C I L I T 1 ES I N D U S T R I E S A ND I N T E N S I V E F A R M I N G D I S P L A C E D BY THE S P R E A D I N G UR BAN I Z A T I ON OF THE E A S T AND M I D D L E WEST IS P O S H I N G INTO THE R A N G E S AND WILDS OF Th E WEST. THE R ES UL TI NG P RES SO R E S NOT ONLY FOR S P A C E BUT A L S O FOR COROL L A R Y N E C E S S I T I E S SUCH AS WATER O B V I O U S L Y MAKE IT E SS ENT I AL TO S E E K A S T R A T E G Y FOR MANAGEMENT OF THE " R A N G E L A I'M!) S T Fl AT W I L L R EC ON C I L E A N D A CC Cl M M 0 DA T E T H E MANY D I FEE R E NT IN T E R E S T S . T HE RAN GEL AND REG I 0N_ C_0_ M P ' R I S E S " M O S T " O F THE AREA O F "THE S E V E N T E E N WESTERN S T A T E S R(JOGNLY THE PORT ION OF THE C O N T I N E N T A L OMIT ED S T A T E S WEST OF D E N V E R . SOURCE . "The Rangeland of Western United States," by R. M. Love, S c i e n t i f i c American, Vol. 2 2 2 , No. 2 , p. 8 9 , .Feb. 1970, . .. TOTAL NUMBER OF CHARACTERS ^_..__l476__ .NUMBER OF PERIODS . .. ... , .. ...... . .. . .10 NUMBER OF BLANKS. . 237-TEXT FORMED USING MUNSON' S ALPHABET SETS 1 2 7-l l l 7 197 REFERENCES [ l ] G. D. Nelson and D. M. Levy, "A dynamic programming approach to the s e l e c t i o n of pattern features," IEEE Trans. Systems Science and Cybernetics, Vol. SSC-4, J u l y 1968, pp. 145-151. [2] G. T. Toussaint and A.B.S. Hussain, "Comments on 'A dynamic programming approach to the s e l e c t i o n of pattern f e a t u r e s , 1 " IEEE Trans. Systems, Man, and Cybernetics, Vol. SMC-1, A p r i l ' 1971, pp. I 8 6 - I 8 7 . [3] S. Watanabe, "A method of s e l f - f e a t u r i n g information compression i n pattern recognition," Proc. of 1966 Bionics Meeting, Gordon and Breach, 1968, pp. 697-707. [4] Y. T. Chien and K. S. Fu, "Selection and ordering of feature observa-tions i n a pattern recognition system," Information and Control, Vol. 12, 1968, pp. 395-414. [5] Z. J . N i k o l i c and K. S. Fu, "On the s e l e c t i o n of features i n pattern rec o g n i t i o n " , Proc. of the Second Annual Princeton Conference on Information Sciences and Systems, IEEE - 6 8-CT - 1 , 1968, pp. 434-438. [6] G. Nagy, "Feature extraction on binary patterns," IEEE Transactions on Systems Science and Cybernetics, Vol. SSC-5, October 1969, pp. 273-278. [7] V. Balakrishnan and L. D. Sanghvi, "Distance between populations on the basis of a t t r i b u t e data," Biometrics, Vol. 24 December 1968. pp. 859-865. [8] A.W.F. Edwards, "Distances between populations on the basis of gene frequencies," Biometrics, Vol. 27 , December 1971, pp. 873-881. [9.1 S. Watanabe, "A u n i f i e d \rlew of c l u s t e r i n g algorithms," Proc. Intnat. Fed. Info. Proc. Soc. , Czechoslovakia, 1971. [10] W. S. Meisel, Computer-Oriented Approaches to Pattern Recognition, Academic Press, New York and London, 1972. [11] K. Fukunaga, Introduction to S t a t i s t i c a l Pattern Recognition, Academic Press, New York and London, 1972. [12] E. H. Ruspini, "A new approach to c l u s t e r i n g , " Information and Control, Vol. 15, J u l y 1969, pp. 2 2 - 3 2 . [13] Yu. L. Barabash, "On properties of symbol reco g n i t i o n , " Engineering •Cybernetics, No. 5 , Sept.-Oct. 1965, pp. 71-77-[3.4] S. Watanabe et a l . , "Evaluation and s e l e c t i o n of variables i n pattern recognition," Computer and Information Sciences-II, J . T. Tou Ed., New York, Academic Press, 1967, pp. 91-122. [15] W. S. Mohn, J r . , "Two s t a t i s t i c a l feature evaluation techniques applied to speaker i d e n t i f i c a t i o n , " IEEE Trans. Comput., Vol. C - 2 0 , Sept. 1971, pp. 979-987. 198 [ l 6 ] A. N. Mueciardl and E. E. Gose, "A comparison of seven techniques for choosing subsets of pattern recognition p r o p e r t i e s , " IEEE Trans, Comput., Vol. C - 2 0 , Sept. 1971, PP.. 1023-1031. [17] C. H. Chen, "Theoretical comparison of a class of feature s e l e c t i o n c r i t e r i a i n pattern r e c o g n i t i o n , " IEEE Trans. Computers, Vol, C~20, September 1971> PP. 1054-1056, [ l 8 ] K. S, Fu, P. J . Min, and T. J . L i , "Feature s e l e c t i o n i n pattern recognition," IEEE Trans. Systems Science and Cybernetics, Vol. SCC-6, Jan. 1970, pp. 33-39. [19] T. W. Kurczynski., "Generalized distance and d i s c r e t e v a r i a b l e s , " Biometrics, Vol. 2 6 , No. 3 , Sept. 1970, pp. 525-534. •[20] J . M. Weiner and 0 . J . Dunn, "Elimination of v a r i a t e s i n l i n e a r discrimination problems," Biometrics, Vol. 2 2 , June 1966, pp. 268-275. [21] W. J . Krzanowski, "A comparison of some distance measures applicable to multinomial data, using a r o t a t i o n a l f i t technique," Biometrics, Vol. 27 , December 1971, PP- 1062-1068. [22] S. Vfatan3.be, "Automatic feature e x t r a c t i o n i n pattern r e c o g n i t i o n , " Automatic Interpretation and C l a s s i f i c a t i o n of Images, Academic Press, 1969, Chapter 5 , pp. 131-136. [23] S. Watanabe, "Karhunen-Loeve expansion and f a c t o r a n a l y s i s , " Trans. Fourth Prague Conf. on Information Theory, e t c . , 1965, Czechoslovak Academy of Sciences, Prague, 1967, pp. 635-660. [24] S, Watanabe, Knowing and Guessing, John Wiley and Sons, New York, 1969. [25] T, Kaminuma, T. Takekawa and S. Watanabe, "Reduction of c l u s t e r i n g problem to pattern r e c o g n i t i o n , " Pattern Recognition, V o l . 3 , 1969, PP. 195-205. [26] S. Watanabe, "Feature compression," Advances i n Information Systems Science,"Vol. 3 , J . T. Ton, Ed., Plenum Press, New York-London, 1970, pp. 63-111. [27] Y. T. Chien and K. S. Fu, "On the generalized. Karhunen-Loeve expansion," IEEE Trans. Information Theory, Vol. IT-13, July 1967, pp. 518-520. [28] J . T. Tou and R. P. Heydorn, "Some approaches to optimum feature e x t r a c t i o n , " i n Computer and Information Sciences-II, J . T. Tou, Ed., Academic Press, New York, 1967, pp. 57-89. [29] J . T. Tou, "Feature s e l e c t i o n for pattern recognition systems," i n Methodologies of Pattern Recognition, S. Watanabe, Ed., Academic Press, New York and London, 1969, pp. 493-508. 199 [30] J . T. Tou, "Feature extraction i n pattern r e c o g n i t i o n , " Pattern Recognition, Vol. 1, No, 1, J u l y 1968, pp. 3-11. [31] J . T. Tou, "Engineering p r i n c i p l e s of pattern r e c o g n i t i o n , " Avances in Information Systems Science, Vol. 1, Plenum Press, New York, 1969, pp. 173 -249 . [32] I. Vajda, "A contribution to the informational analysis of pattern," Methodologies of Pattern Recognition, S. Watanabe, Ed., Academic Press, 1969, pp. 509-519. [33] J . H, Munson, "Experiments i n the recognition of hand-printed t e x t , " 1968 F a l l J o i n t Computer Conf. AFIPS P r o c , Vol. 3 3 , pt. 2 . Washington, D.C.: Thompson, 1968; pp". 1125-1138. [34] J . H. Munson, "The recognition of hand-printed t e x t , " Pattern Recog-n i t i o n , L. Kanal, Ed., Thompson Book Co., Washington, D.C., 1968, pp. 115-140. [35] A. L. K n o l l , Experiments with ' c h a r a c t e r i s t i c l o c i ' f o r recognition of hand-printed characters," IEEE Trans. Computers, V o l . EC -18 , A p r i l 1969, pp. 366-372. [36] J . H. Munson, " C h a r a c t e r i s t i c s of SRI hand-printed data," Stanford Research I n s t i t u t e , Technical. Note, October 19 , 1967. [37] L« Kanal and B. Chandrasekaran, "On dimensionality and sample siz e i n s t a t i s t i c a l pattern c l a s s i f i c a t i o n , " National E l e c t r o n i c s Conference, Chicago, 1968, pp. 2-7 . [38] D. C. A l l a i s , "The s e l e c t i o n of measurements for p r e d i c t i o n , " Stanford E l e c t r o n . Labs., Tech. Rep. 6103-9, Nov. 1964. [39] G. F. Hughes, "On the mean accuracy of s t a t i s t i c a l pattern recognizers," IEEE Trans. Information Theory, Vol. IT - 1 4 , Jan. 1968, pp. 55-63. [40] R. Y. Kain, "The mean accuracy of pattern recognizers with many pattern c l a s s e s , " IEEE Trans. Information Theory, May 1969, pp. 424-425. [ 4 l ] G. F. Hughes, "Number of pattern c l a s s i f i e r design samples per c l a s s , " IEEE Trans. Information Theory, September 1969, PP. 6 l 5 - 6 l 8 . [42] K. Abend, T. J . Harley, B. Chandrasekaran, and G. F. Hughes, "Comments on 'On the mean accuracy of s t a t i s t i c a l pattern recognizers,'" IEEE Trans. Information Theory, Vol. IT -15, May 1969, pp. 420-423. [4-3] B. Chandrasekaran, "Independence of measurements and the mean recog-n i t i o n accuracy," IEEE Trans. Information Theory, V o l . IT - 1 7 , J u l y 1971, PP. 452-456. [44] B. Chandrasekaran, "Quantization of independent measurements and recognition performance," presented at the 1972 IEEE I n t e r n a t i o n a l Symposium on Information Theory. 200, [45] M. Minsky, "Steps toward a r t i f i c i a l i n t e l l i g e n c e , " Proc. IRE, Vol. 4 9 , January 1 9 6 l , pp. 8 - 3 0 . [ 4 6 ] R. 0 . Winder, "Threshold l o g i c i n a r t i f i c i a l i n t e l l i g e n c e , " A r t i f i c i a l I n t e l l i g e n c e , IEEE S p e c i a l P u b l i c a t i o n S - l 4 2, January 1963, pp. 107-.128. [47] C. E. Chow, " S t a t i s t i c a l independence and threshold functions," IEEE Trans, on Computers, Vol. CE-lU, February 1965, pp. 6 6 - 6 8 . [ 4 8 ] N. J . N i l s s o n , Learning Machines: Foundations of Trainable Pattern-pias s i f y i n g Sy s t ems, McGraw-Hill, 1965, Chapter 2 . [49] H. Kazmierezak and K. Steinbuch, "Adaptive systems i n pattern recognition," IEEE Trans, on E l e c . Comp., Vol. EC -12 , December 1963, pp. 822-835. [50] G. T. Toussaint, "On a simple Minkowski metric c l a s s i f i e r , " IEEE Trans, on Systems Science and Cybernetics, Vol. SSC -6 , October 1970, pp. 360-362. [51] J . W. C a r l and G. T. Toussaint, " H i s t o r i c a l note on Minkowski metric c l a s s i f i e r s , " IEEE Trans. Syst. , Man, Cybern., Vol. SMC-1, Oct. 1971, pp. 387-388. [52] P. A. Devijver, "A general order Minkowski metric pattern c l a s s i f i e r , " AGARD Conference Proceedings Ho. 94 on A r t i f i c i a l I n t e l l i g e n c e , Technical E d i t i n g and Reproducing, Ltd., September 1971, pp. 18-1 to 18-11. [53] G. T. Toussaint, "Polynomial representation of c l a s s i f i e r s with independent discrete-valued features," IEEE Trans. Computers, Vol. C - 2 1 , February 1972, pp. 205-208. [54] G. T. Toussaint, "Note on optimai s e l e c t i o n of independent binary-valued features for pattern r e c o g n i t i o n , " IEEE Trans. Inform. Theory, Vol. IT - 1 7 , September 1971, p. 6 l 8 . [55] G. T. Toussaint and R. W. Donaldson, "Algorithms for recognizing contour-traced handprinted characters," IEEE Trans. Comput. (Short Notes), Vol. C-19, June 1970, pp. 541-546. [56] G. T. Toussaint, "Machine Recognition of Independent and Contextually Constrained Contour-Traced handprinted Characters," M.A.Sc. Thesis, U n i v e r s i t y of B r i t i s h Columbia, December 1969. [57] I. J . Good, The Estimation of P r o b a b i l i t i e s , Research Monograph 30, Cambridge, Mass.: M.I.T. Press, 1965. [58] W. H. Highleyman, "The design and analysis of pattern recognition experiments," B e l l System Technical Journal, March 1962, pp. 723-744. [59] P. A. Lachenbruch and M. R. Mickey, "Estimation of error rates i n discriminant a n a l y s i s , " Technometrics, Vol. 1 0 , February 1968, pp. 1-11. 201 [60] M. H i l l s , " A l l o c a t i o n rules and t h e i r error r a t e s , " J . Stat. Royal S o c , series B, Vol. 2 8 , 1966, pp. 1-31. [ 6 l ] K. Fukunaga and D. L. K e s s e l l , "Estimation of c l a s s i f i c a t i o n e r r o r , " IEEE Trans. Computers, Vol. C-20, December 1971, pp. 1521-1527-[62] L. E. Larsen, et a l . , "On the problem of bias i n error rate estimation for discriminant a n a l y s i s , " Pattern Recognition, Vol. 3 , No. 3 , October .1971, pp. 217-223. [63] 0 . J . Dunn and P. R. Varady, " P r o b a b i l i t i e s of correct c l a s s i f i c a t i o n i n discriminant a n a l y s i s , " Biometrics, Vol. 2 2 , 1966, p. 908. [64] P. A. Lachenbruch, "On expected p r o b a b i l i t i e s of m i s c l a s s i f i c a t i o n i n discriminant analysis, necessary sample s i z e , and a r e l a t i o n with the multiple c o r r e l a t i o n c o e f f i c i e n t , " Biometrics, Vol. 2 4 , 1968, p. 823. [65] M. J . Sorum, "Estimating the P r o b a b i l i t y of M i s c l a s s i f i c a t i o n , " Ph.D. D i s s e r t a t i o n , U n i v e r s i t y of Minnesota, 1968. [66] N. Sedransk, "Contributions to Discriminant A n a l y s i s , " Ph.D. Disser-t a t i o n , Iowa State U n i v e r s i t y , 1969. [67] N. Sedransk and M. Okamoto, "Estimation of the p r o b a b i l i t i e s of m i s c l a s s i f i c a t i o n f o r a l i n e a r discriminant function i n the univariate normal case," Ann. Inst. Stat. Math., V o l . 2 3 , No. 3 , '1071 m i l H Q _ ) i ^ . — ' 1 — , jrjr - — s T~* s -[68] K. Fukunaga ajid D. K e s s e l l , "Nonparametric Bayes error estimation using u n c l a s s i f i e d samples," Computer Group Repository, R-72-128, 33 PP [69] J . H. Munson, "Some views on pattern-recognition methodology ," Methodologies of Pattern Recognition, S. Watanabe, Ed., Academic Press, New York, London, 1969, pp. 417-436. [70] R. Bakis, N. M. 'Herbst, and G. Nagy, "An experimental study of machine recognition of handprinted numerals," IEEE Trans. Syst. S c i . Cybernetics, Vol. SSC -4 , July 1968, pp. 119-132. [71] R. G. Casey, "Moment normalization of handprinted characters," IBM J . Res. Develop., Sept. 1970, pp. 548-557. [72] W. C. Naylor, "Some studies i n machine i m i t a t i o n using character recognition," Ph.D. d i s s e r t a t i o n , U n i v e r s i t y of Wisconsin, June 1971-[73] W. C. Naylor, "Some studies i n the i n t e r a c t i v e design of character recognition systems," IEEE Trans, on Computers, Vol. C - 2 0 , September 1971, pp. 1075-1086. [74] G. Nagy and N. Tuong, "Normalization techniques for handprinted numerals," Communications of the ACM, Vo l . 13 , No. 8 , August 1970, pp. 475-481. [75] K. Ugawa, J . Toriwaki, and K. Sugino, "Normalization and recognition of two-dimensional patterns and l i n e a r d i s t o r t i o n by moments," Ele c t r o n . Comm. Japan, June 1967, pp. 34-36. [76] B. Chandrasekaran and L. Kanal, "On l i n g u i s t i c , s t a t i s t i c a l , and mixed models for pattern r e c o g n i t i o n T e c h n i c a l Report OSU-CISRC-TR-71-3, Computer and Information Science Research Center, Ohio State U n i v e r s i t y , March 1971-[77] A.B.S. Hussain, G. T. Toussaint, and R. W. Donaldson, "Results obtained using a simple character recognition procedure on Munson's handprinted data," IEEE Trans. Computers, Vol. C -21 , February 1972, pp. 201-205. [78] K. Fukunaga and W.L.G. Koontz, "Application of the Karhunen-Loeve expansion to feature s e l e c t i o n and ordering," IEEE Trans, on Computers, Vol. C-19, A p r i l 1970, pp. 311-318. [79] T. K a i l a t h , "The divergence and Bhattacharyya distance measures i n si g n a l s e l e c t i o n , " IEEE Trans. Communication Technology, Vol. COM-15, February 1967, PP. 52-60. [80] S. Kullback, Information Theory and S t a t i s t i c s , New York, John Wiley, 1959. [8l] H. Kobayashi and J . B. Thomas, "Distance measures and r e l a t e d c r i t e r i a , " Proceedings of the F i f t h Annua,! A l l e r t o n Conference on C i r c u i t and System Theory, October 1967, pp. 491-500. [82] V. A. Volkonsky and Y. A. Rosanov, "Some l i m i t theorems for random functions, pt. I," i n Theory of P r o b a b i l i t y and i t s Applications (English t r a n s . ) , Vol. IV, p. 180. [83] S. Kullback, "A lower bound for di s c r i m i n a t i o n information i n terms of v a r i a t i o n , " IEEE Trans. Information Theory (Correspondence), Vol. IT-13, January 1967, pp. 126-1.27. [84] S. Kullback, "Correction to "A lower bound f o r d i s c r i m i n a t i o n i n f o r -mation i n terms of v a r i a t i o n , ' " IEEE Trans. Information Theory (Correspondence), Vol. IT--I6, September 1970, p. 652. [85] I. Vajda, "Rote on d i s c r i m i n a t i o n information and v a r i a t i o n , " IEEE Trans. Information Theory (Correspondence), V o l . I T - l 6 , November 1970, pp. 771-773. [86] R. G. Gallager, Information Theory and Re l i a b l e Communication, New York: J . Wiley & Sons* 1965, pp. 82-91 . [87] J . T. Chu and J . Chueh , " I n e q u a l i t i e s between information measures • and error p r o b a b i l i t y , " J . Fr a n k l i n Inst., V ol. 282, August 1966, pp. 121-125. [88] D. L. Tebbe and S. J . Dwyer I I I , "Uncertainty and the p r o b a b i l i t y of er r o r , " IEEE Trans. Information Theory, May, 1968, pp. 5 l 6 - 5 l 8 . 203 [89] V. A. Ko v a l e v s k i j , "The problem of character recognition from the point of view of mathematical s t a t i s t i c s , " i n Cha,racter_Readers and .Pattern Recognition, Ed- V. A, Ko v a l e v s k i j , Spartan Books, New York, 19'6K^roT>!^30. [90] M. E. Hellman and J . Raviv, " P r o b a b i l i t y of e r r o r , equivocation and the Chernoff bound," IEEE Trans. Information Theory, V o l . I T - l 6 , J u l y 1970, pp. 368-372. [91] T. M. Cover and P. E. Hart, "Nearest Neighbour Pattern C l a s s i f i c a t i o n , " IEEE Trans. Information Theory, Vol. IT-I3, January 1967, pp. 21-27. [92] D. S. M i t r i n o v i c , Elementary I n e q u a l i t i e s , P. Noordhoff Ltd., 'Groningen, The Netherlands, 1964. [93] T. I t o , "On approximate error bounds i n pattern r e c o g n i t i o n , " Presented by K. Fukunaga at the F i f t h Hawaii I n t e r n a t i o n a l Conference on Systems Science, January 1972, Honolulu, Hawaii. [94] S. E. Estes, "Measurement s e l e c t i o n for l i n e a r discriminants used i n pattern c l a s s i f i c a t i o n , " Ph.D. Thesis, E l e c t r i c a l Engineering Dept., Stanford U n i v e r s i t y , Stanford, C a l f i o r n i a , 1965. [95] S. S i e g e l , Nonparametric S t a t i s t i c s f o r the Behavioural Sciences, .McGraw-Hill, New York, 1956, Chapter 9 . [96] G. T. Toussaint, "A c e r t a i n t y measure f o r feature evaluation i n pattern r e c o g n i t i o n , " Proc. of the F i f t h Hawaii I n t e r n a t i o n a l Conference on Systems Science, Honolulu, Hawaii, Jan. 1971, pp. 37-39. [97] C. W. Swonger, "Property l e a r n i n g i n pattern recognition systems using information content measurements," Pattern Recognition, Ed. L. N. Kanal, Washington, D. C , Thompson Book Company, 1968, pp. 329-347. [98] H. F. Ryan, "The information content measure as a performance c r i t e r i o n f or feature s e l e c t i o n , " IEEE Proc. of Seventh Symposium on Adaptive Processes, Los Angeles, C a l i f . , l 6 - l 8 D e c , 1968, pp. 2-C-l to 2 - C - l l . [99] L. A. Kamentsky and C. N. L i u , "Computer-automated design of m u l t i -font p r i n t recognition l o g i c , " IBM Journal of Research and Devel-opment, Vol. 7 , No. 1, Jan. 1963, pp. 2-13. [100] R. C. Ahlgren, H. F. Ryan, and C. W. Swonger, "A character recognition a p p l i c a t i o n of an i t e r a t i v e procedure f o r feature s e l e c t i o n , " IEEE Trans. Computers, Vol. C--20, September 1971, pp. 1067-1075-[101] J . A. Lebo, G. F. Hughes, "Pattern recognition preprocessing by s i m i l a r i t y f u n c t i o n a l ^ , " Proc. NEC, Vol. 2 2 , 1966, pp. 551-556. 204 [ 1 0 2 ] P. M. Lewis I I , "The c h a r a c t e r i s t i c s e l e c t i o n problem i n recognition systems," IRE Trans. Information Theory, Vol. I T - 8 , February 1962, pp. 1 7 1 -178. [ 1 0 3 ] E. Rasek, "A contribution to the problem of feature s e l e c t i o n with s i m i l a r i t y functionals in pattern r e c o g n i t i o n , " Pattern Recognition, Vol. 3 , A p r i l 1971, PP. 3 1 - 3 6 . [104] D. Kozlay, "Feature extraction i n an o p t i c a l character recognition machine," IEEE Trans. Computers, Vol, 0 - 2 0 , Sept. 1 9 7 1 , pp. 1 0 6 3 -IO67. [ 1 0 5 ] J . E. Paul, J r . , A. J. Goetze, and J . B, O'Neal, J r . , "A modified figure of merit for feature s e l e c t i o n i n pattern r e c o g n i t i o n , " IEEE Trans. Information Theory, Vol. IT-I6, July. 1 9 7 0 , pp. 504-5 0 7 . [106] G. T. Toussaint, "Comments on 'A modified figure of merit for feature s e l e c t i o n i n pattern r e c o g n i t i o n , ' " IEEE Trans, Informa-t i o n Theory, Vol. IT - 1 7 , September 1 9 7 1 , pp. 6 l 8 - 6 2 0 , [107] T, R. Vilmansen, "On dependence and d i s c r i m i n a t i o n i n pattern r e c o g n i t i o n , " IEEE Trans. Computers, September 1972. [ 1 0 8 ] T. R. Vilmansen, "A comparison of the feature s e l e c t i n g c a p a b i l i t i e s of several measures of p r o b a b i l i s t i c dependence," Proc. of the Annual Canadian Computer Conference, Session " 7 2 , Montreal, Canada, June 1 -3 , 1 9 7 2 , pp. 4 2 2 2 0 1 -422213 . [109] T. R. Vilmansen, "Feature evaluation with measures of p r o b a b i l i s t i c dependence," Submitted for p u b l i c a t i o n to IEEE Trans. Computers. [110] K. Matusita, "Decision rule based on distance for the c l a s s i f i c a t i o n problem," Ann. Inst. Stat. Math., Vol. 8 , 1956, pp. 6 7 - 77 . [ i l l ] K. Matusita, "Distance and decision r u l e s , " Ann. Inst. Stat.. Math,, Vol. 16, 1964, pp. 3 0 5 - 3 1 5 . [112] K. Matusita, "A distance and r e l a t e d s t a t i s t i c s i n m u l t i v a r i a t e a n a l y s i s , " M u l t i v a r i a t e A n a l y s i s , P. R. Krishnaiah, Ed., New York, Academic Press, 1966, pp.. 1 8 7 - 2 0 0 . [113] H. Kobayashi and J . B. Thomas, "Distance measures and r e l a t e d c r i t e r i a , " Proc. F i f t h Annual A l l e r t o n Conference on C i r c u i t and System Theory, October 1967, PP. 4 9 1 - 5 0 0 . [ilk] A. Bhattacharyya, "On a measure of divergence between two s t a t i s t i c a l populations defined by t h e i r p r o b a b i l i t y d i s t r i b u t i o n s , " B u l l . Calcutta Math. S o c , Vol. 3 5 , 194-3, pp. 99 - 1 0 9 -[II5] A. Bhattacharyya, "On a measure of divergence between two m u l t i -nominal populations," Sankhya, Vol. 6 , 1 9 4 6 , pp. 4 0 1 -4o6 . 2 0 > [ l l 6 ] A. Caprihan and R.J.P.de Figueiredo, "On the e x t r a c t i o n of pattern features from continuous measurement," IEEE Trans. Systems Science and Cybernetics, Vol. SCC-6, A p r i l 1970, pp. 110-115. [117] K. Fukunaga and T. F. K r i l e , "An approximation to the divergence as a measure of feature e f f e c t i v e n e s s , " Proc. National E l e c t r o n i c s Conference, Vol. 23, 1967, pp. 57-6 l . [ l l 8 ] R. 1, Hoffman and K. Fukunaga, "Pattern recognition s i g n a l processing for mechanical diagnostics signature a n a l y s i s , " IEEE Trans. Computers, Vol. C-20, Sept. 1971, pp. 1095-1100. [119] P. J . Min, D. A. Landgrebe, and K. S. Fu, "On feature s e l e c t i o n i n m u l t i c l a s s pattern recognition," Proc. of the Second Annual Princeton Conference on Information Sciences and Systems, IEEE-68-CT-l, 1968, pp. )J53-457. [120] T. L. Grettenberg, "Signal s e l e c t i o n i n communication and radar systems," IEEE Trans. Information Theory, Vol. IT-9, October 1963, pp. 265-275. [121] C. C. Babu and S. N. K a l r a , "On feature e x t r a c t i o n In m u l t i - c l a s s pattern r e c o g n i t i o n , " Int. J . Control, 1972, Vol. 15, No. 3, PP. 595-601. [122] A. Renyi, "On measures of entropy and information," Proceedings of the Fourth Berkeley Symposium on Mathematical S t a t i s t i c s and P r o b a b i l i t y , Vol. I, 1960, U n i v e r s i t y of C a l i f o r n i a Press, Berkeley, C a l i f o r n i a , 196 l , pp. 347-561. [l'23] K. Matusita, "On the notion of a f f i n i t y of several d i s t r i b u t i o n s and some of i t s a p p l i c a t i o n s , " Ann. Inst. Stat. Math., V o l . 19, 1967, pp. 181-192. [124] K, Matusita, "Some properties of a f f i n i t y and a p p l i c a t i o n s , " Ann. Inst. Stat. Math., Vol. 23, 1971, PP- 137-155. [125] A. K. J o s h i , "A note on a c e r t a i n theorem stated by Kullback," IEEE Trans. Information Theory (correspondence), V o l . IT - 1 0 , January .1964, p. 93. [126] S. K. Das, "Feature s e l e c t i o n with a l i n e a r dependence measure," IEEE Trans. Comput., Vol. C-20, Sept. 1971, pp. 1106-1109. [127] G. T. Toussaint and T. R. Vilmansen, "Comments on 'Feature s e l e c -t i o n with a l i n e a r dependence measure,'" IEEE Trans. Comput., Vol. C-21, A p r i l 1972, p. 408. [128] E. Beckenbach and R. Bellman, An Introduction t o I n e q u a l i t i e s , Random House, Yale U n i v e r s i t y , 1961, p.76. [129] G. H. Hardy, J. E. Littlewood., and. G. Polya, I n e q u a l i t i e s , 2nd ed. London, New Hork: Cambridge Univ. Press, 1952. 20.6 [l30j G. T. Toussaint, "Some f u n c t i o n a l lower bounds on the expected divergence for multihypoth.esis pattern recognition, communication, and radar systems," IEEE Trans. Systems,, Man, and Cybernetics, Vol. SMC-1, October 1971, pp. 364-385. [131] R. M. Fano, Trains mission of Information, Cambridge, Mass: M.'I.T. Press, 1961". [132] H. Kober, "On the arithmetic and geometric means and on Holder's i n e q u a l i t y , " Proc. Am. Math. S o c , Vol. 9 , 1958, pp. 452-459. [133] D. G. L a i n i o t i s , "A class of upper bounds on p r o b a b i l i t y of e r r o r for multihypothesis pattern r e c o g n i t i o n , " IEEE Trans. Inform. Theory, Vol. IT-15, Nov. 1969, pp. 730-731. [134] C. C. Babu, "On feature extraction i n pattern recognition," Proceedings F i f t h Hawaii International Conference on Systems Science, Jan. 1972, pp. 20 -23. [135] G. T, Toussaint, "Some properties of Matusita"s measure of a f f i n i t y of several d i s t r i b u t i o n s , " Ann. Inst. Stat. Math., submitted for p u b l i c a t i o n . [136] G. T. Toussaint, "On the expected divergence and other m u l t i c l a s s measures for feature evaluation i n pattern r e c o g n i t i o n , " IEEE Trans. Systems, Man, and Cybernetics, submitted for p u b l i c a t i o n . i.137] G. T. Toussaint, "Feature evaluation with quadratic mutual informa-t i o n , " Information Processing L e t t e r s , Vol. 1, No. 4 , June 1972, pp. 153-156. [138] G. T. Toussaint, "Some i n e q u a l i t i e s between distance measures f o r feature evaluation," IEEE Trans. Comput., Vol. C-21, A p r i l 1972, pp. 409-4.10. [139] P. M. Lewis I I , "Approximating p r o b a b i l i t y d i s t r i b u t i o n s to reduce storage requirements," Information and Control, Vol. 2, No. 3, September 1959, pp. 23.4-225. [l40] W. Hoei'fding and J . Wolfowitz, " D i s t i n g u i s h a b i l i t y of sets of distributions,'' 1 Ann. Math. Stat., Vol, 2 9 , 1958, pp. 700-718. [ l 4 l ] G. T. Toussaint, "Comments on 'The divergence and Bhattacharyya distance measures i n s i g n a l s e l e c t i o n , ' " IEEE Trans. Comm. Technol. Vol. COM-20, June 1972, p. 485. [142] D. G. L a i n i o t i s and S. K. Park, "Feature extraction c r i t e r i a : comparison and evaluation," Presented by G. T. Toussaint at the F i f t h Hawaii International Conf. on Systems Sciences, January, 1972. 207 [ l 4 3 ] G. T. Toussaint, "Feature evaluation with a proposed g e n e r a l i z a t i o n of Kolmogorov's v a r i a t i o n a l distance and the Bhattacharyya c o e f f i -c i e n t , " Proc. of the Annual Canadian. Computer Conference - Session ' 7 2 , Montreal, June 1-3, 1972, pp. 422401-4224-13. [ l 4 4 ] G. T. Toussaint, "A proposed generalized divergence measure for more than two p r o b a b i l i t y density functions," IEEE Trans. I n f o r -mation Theory, submitted for p u b l i c a t i o n . [ l 4 5 ] R. L. Kashyap,. "Algorithms f o r pattern c l a s s i f i c a t i o n , " i n Adaptive, Learning and Pattern Recognition Systems, J . M. Mendel and K. S. Fu, Ed., Hew York: Academic Press, 1970. [1.46] R. 0 . Duda and R. C. Singleton, "Training a threshold l o g i c unit with imperfectly c l a s s i f i e d patterns," presented at the WESCON Conv., Los Angeles, C a l i f . , Aug. 1964. [l4-7] A. V/. 'Whitney and S. J . Dv-ryer, "Performance and implementation of the k-nearest neighbour decision rule with i n c o r r e c t l y i d e n t i f i e d samples," Proc. 4th. Annual A l l e r t o n Conf. C i r c u i t and System Theory, 1966, pp. 96-106. [ l 4 8 ] K. Shanmugam and A. M. Breipo h l , "An erro r c o r r e c t i n g procedure f o r lea r n i n g with an imperfect teacher," IEEE Trans. Syst., Man, Cybern., Vol. SMC-1, July 1971, PP. 223-229. [ l 4 9 ] A. K. Agrawala, "Learning with a. p r o b a b i l i s t i c teacher," IEEE Trans. Inform. Theory, Vol. IT-1.6, J u l y 1970. pp. 373-379. [150] C. C. Babu, "On the extraction of pattern features from imperfectly i d e n t i f i e d samples," IEEE Trans. Comput., Vol. C-21 , A p r i l 1972, pp. 410-411.. [ l 5 l ] C. C. Babu, "On the app l i c a t i o n of divergence for the ext r a c t i o n of features from imperfectly l a b e l l e d patterns," IEEE Trans. Syst., Man, Cybern., Vol. SMC-2, A p r i l 1972, pp. 290-292. [152] E. A. Pa t r i c k and F. P. Fisher I I , "Nonparametric feature s e l e c t i o n , " IEEE Trans. Information Theory, Vol. IT -15, Spet. 1969, pp. 577-584. [153] R. P. Heydorn, "An upper bound estimate on c l a s s i f i c a t i o n e r r o r , " IEEE Trans. Inform. Theory, Vol. I T - l 4 , September 1968, pp. 783-784. [154] C. C. Babu and S. N. Kalra, "On the a p p l i c a t i o n of Bashkirov, Braverman and Muchnik p o t e n t i a l function f o r feature s e l e c t i o n i n pattern recognition," Proc. IEEE ( l e t t e r s ) , March 1972, pp. 333-334. [155] G. T. Toussaint, "Comments on 'A class of upper bounds on p r o b a b i l i t y of error f o r multihypothesis pattern recognition," IEEE Trans. Information Theory, submitted for p u b l i c a t i o n . [156] G. T. Toussaint, "Comments on 'Theoretical comparison of a class of feature s e l e c t i o n c r i t e r i a i n pattern recognition," IEEE Trans. Comput., June 1972, Vol. C -21 , pp. 6 l 5 - 6 l 6 . . 208 [157] I1- Chu and J . C. Chueh, "Error p r o b a b i l i t y i n decision functions f o r character r e c o g n i t i o n , " J . Assoc. Computing Machinery, Vol. l 4 , A p r i l 1967, pp. 273-280. [158] G. T. Toussaint, "Some upper bounds on e r r o r p r o b a b i l i t y f o r m u l t i -class pattern recognition," IEEE Trans. Computers, V o l . C - 2 0 , August 197-1, pp. 943-944. [159] G. T. Toussaint, "An improved error bound on the distance c r i t e r i o n of P a t r i c k and F i s h e r , " IEEE Trans. Information Theory 5 submitted f o r p u b l i c a t i o n . [ l 6 0 ] E. F. Beckenbach and R. Bellman, Jjjeoj;^ JJ-iiies_» Second r e v i s e d p r i n t i n g , Springer-Verlag, B e r l i n , Heidelberg, New York, 1965. [ l 6 l ] E. R. P h i l l i p s , An Introduction^to Analysis and Integration Theory, Intext Educational Publishers, Scranton, Toronto, London, 1971, P. 191. [162] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, Urbana. 111 . , The U n i v e r s i t y Press, 1963. [163] C. E, Shannon, "Prediction and entropy of p r i n t e d E n g l i s h , " B e l l . Syst. Tech. .7. , Vol. 3 0 , January 1951, pp. 50-64. [ l 6 4 ] K. Abend, "Compound decision procedures for unknown d i s t r i b u t i o n s and f o r dependent states of nature," Pattern Recognition, L. Kanal, Ed., Thompson Book Co., Washington, D. C., 1968, pp. 207-249. [165] R. 0 . Duda and P. E. Hart, Pattern.. Class i f i cation and Scene A n a l y s i s , i n press. [1.66] R. 0 . Duda, "Elements of pattern r e c o g n i t i o n , " in Adaptive^ Learning, and Pattern Recognition Systems.: Theory and A p p l i c a t i o n s , Eds. J . M. Mendel and K. S. Fu, Academic Press, New York, 1910, [167] K. Abend, "Compound decision procedures f o r pattern r e c o g n i t i o n , " Proc. NEC Vol. 22, 1966, pp. 777-780. [ l 6 8J G. Carlson, "Techniques for r e p l a c i n g characters that are garbled on input," AFIPS Conference Proceedings, Vol. 2 8 , 1966 Spring J o i n t Computer Conference, pp. 189-192. [169] R. W. Cornew, "A s t a t i s t i c a l method of e r r o r c o r r e c t i o n , " Informa-t i o n and Control, No. 2 , February 1966. [170] C. M. Vossler and N. M. Branston, "The use of context f o r c o r r e c t i n g garbled E n g l i s h t e x t , " Proc. ACM 19th National Conference, 1964, pp. D2 4-1 to D2 4-13. [171] J . Raviv,- "Decision making i n Markov chains applied, to the problem-of pattern r e c o g n i t i o n , " IEEE Trans, on Information Theory, Vol. IT-13, October I.967, pp. 536-551. 209 [172] B. Gold, "Machine recognition of hand-set Morse code," IRE Trans, on Information Theory, Vol. IT - 5, March 1959, pp. 17-24. [173] W. W. Bledsoe and J . Browning, "Pattern recognition and reading by machine," Pattern Recognition, L. Uhr, Ed., New York, Wiley, 1966, pp. 301-316. [ l j 4 ] R. A l t e r , "Use of contextual constraints i n automatic speech recognition," IEEE Trans, on Audio and E l e c t r o a c o u s t i c s , Vol. AU-16, March 1968, pp. 6-11. [175] A. W. Edwards and R« 1. Chambers, "Can a p r i o r i p r o b a b i l i t i e s help i n character recognition?" J . Assoc. Computing Machinery, Vol. 11, October 1964, pp. 465-470. [176] R, 0. Duds and P. E. Hart, "Experiments i n the recognition of hand-printed text: part II - context a n a l y s i s , " 1968 F a l l J o i n t Computer Conf. AFIPS P r o c , V o l . 33, pt. 2, Washington, D. C , Thompson, 1968, pp. 1139-1149. [177] R. W. Donaldson and G. T. Toussaint, "Use of contextual constraints i n recognition of contour-traced hanprinted characters," IEEE Trans, on Computers, Vol. C-19, November 1970, pp. 1096-1099. [178] E. M. Riseman and R. W. Enrich,."Contextual word recognition using binary digrams," IEEE Trans. Computers, V o l . C-20, A p r i l 197-1, pp. 397-403. [179] A.B.S. Hussain, "Sequential Decision Schemes f o r S t a t i s t i c a l Pattern Recognition Problems with Dependent and Independent Hypotheses," Ph.D. D i s s e r a t i o n , U n i v e r s i t y of B r i t i s h Columbia, June 1972. [180J A.B.S. Hussain and R. W. Donaldson, "Some sequential sampling techniques for multicategory dependent hypotheses pattern recogni-t i o n , " Proceedings of the F i f t h Hawaii I n t e r n a t i o n a l Conference on System Sciences, January 1972, pp. 40-43. [ l 8 l ] C. G. Hilborn, J r . and D. G. L a i n i o t i s , ' "Unsupervised l e a r n i n g minimum r i s k pattern c l a s s i f i c a t i o n for'dependent hypotheses and dependent measurements," IEEE Trans. Systems Science and Cybernetics, Vol. SSC-5, A p r i l 1969, pp. 109-115. [l82] C. G. Hilborn, J r . and D. G. L a i n i o t i s , "Optimal unsupervised learning multicategory dependent hypotheses pattern r e c o g n i t i o n , " IEEE Trans. Information Theory, V o l . I T - l 4 , May 1968, pp. 468 - 4 7 0 . [183] J . T. Chu, "Error bounds for a contextual recognition procedure," IEEE Trans. Comput. (Short Note), Vol. C-20, Oct. 1971, pp. 1203-1207. [ l 8 4 ] G. T. Toussaint, "Comments on 'Error "bounds for a contextual recog-n i t i o n procedure,"' IEEE Trans. Comput., Sept. 1972. 210 [ I 8 5 ] J . M. Wozencraft and I. M. Jacobs, P r i n c i p l e s of Communication Engineering, New York: J . Wiley and Sons, 196 '5 , Ch. 6 7 ~ ^ [186] R. G. Gallager, Information Theory and ReliableCommunication, New York: J . Wiley and Sons, 1965'~Ch. 6. [ I 8 7 ] P. M. Lewis I I , "Approximating p r o b a b i l i t y d i s t r i b u t i o n s to reduce storage requirements," Information and Control, V o l . 2 , No. 3 , September, 1 9 5 9 , pp. 21 . 4 -225. [188] D. T. Brown, "A note on approximations to discrete p r o b a b i l i t y d i s t r i b u t i o n s , " Information and Control, V ol. 2 , No. It-, December 1 9 5 9 , pp. 3 8 6 - 3 9 2 . [ 1 8 9 ] C. K. Chow and C. N. L i u , "An approach to structure adaptation i n pattern r e c o g n i t i o n , " IEEE Transactions on Systems Science and Cybernetics, Vol. SSC - 2 , No. 2 , December 1 9 6 6 , pp. 7 3 - 8 0 . [190] C. K. Chow, "A cla»ss of nonlinear recognition procedures," IEEE International Convention Record, Vol. l 4 , Pt. 6 , March 1 9 6 6 , pp. 4 0 - 5 0 . [191] C. K. Chow, "The maximum l i k e l i h o o d estimate of a dependence t r e e , " Pattern Recognition, Ed. L. N. Kanal, Washington, D. C. , Thomspon, 1 9 6 8 , pp. 3 2 3 - 3 2 8 . [ l 9 2 ] C. K. Chow, and C. N. L i u , "Approximating d i s c r e t e p r o b a b i l i t y d i s t r i b u t i o n s with dependence t r e e s , " IEEE Transactions on Informa-t i o n Theory, V o l . IT -14, No. 3 , May 1 9 6 8 , pp. 4 6 2 - 4 6 7 . [ 1 9 3 ] H. H. Ku and S. Kullback, "Approximating d i s c r e t e p r o b a b i l i t y d i s t r i b u t i o n s , " IEEE Transactions on Information Theory, Vol. I T - 1 5 , No. 4 , J u l y 1 9 6 9 , pp. 444 -447. [ 1 9 4 ] J . J . Freeman, "Note on approximating d i s c r e t e p r o b a b i l i t y d i s t r i -butions," IEEE Transactions Information Theory, Vol. IT - 1 7 , J u l y 1 9 7 1 , PP. 4 9 1 - 4 9 2 . [195] B. R u s s e l l , Education and S o c i a l Order, London, G. A l l e n and Unwin Ha., 1 9 3 3 . : " [ 1 9 6 ] C. A. Dykema, "A computer simulation study of the a p p l i c a t i o n of contextual constraints to character recognition," M.A.Sc. Thesis, U n i v e r s i t y of B r i t i s h Columbia, September 1 9 7 0 . [197] A.B.S. Hussain, "Sequential decision schemes with on-line feature ordering," Proceedings of the Annual Canadian Computer Conference, Session ' 7 2 , Montreal, Canada, June 1 - 3 , pp. 4 2 2 3 0 1 - 4 2 2 3 1 7 . [ 198] G. T. Toussaint and R. W. Donaldson, "Some simple contextual decoding algorithms applied to recognition of handprinted t e x t , " Proceedings of the Annual Canadian Computer Conference, Session ' 7 2 , Montreal, Canada, June 1 - 3 , pp. 4 2 2 1 0 1 - 4 2 2 1 1 5 . [199] A.B.S. Hussain, personal communication. PUBLICATIONS G.T . T o u s s a i n t , " U l t r a l i n e a r ramp g e n e r a t o r u s e s UJT t o d r i v e D a r l i n g t o n , " E l e c t r o n i c D e s i g n , v o l . 17 , No. 2 1 , O c t o b e r 1969 , p p . 1 17 - 119 . G. T. T o u s s a i n t and R.W. D o n a l d s o n , " A l g o r i t h m s f o r r e c o g n i z i n g c o n t o u r - t r a c e d h a n d p r i n t e d c h a r a c t e r s , " I EEE T r a n s . Comput . , v o l . C -19 , pp . 5 4 1 - 5 4 6 , J u n e 1970 . G. T . T o u s s a i n t , "On a s i m p l e M i n k o w s k i m e t r i c c l a s s i f i e r , " I EEE T r a n s . S y s tems S c i e n c e and C y b e r n e t i c s , v o l . S SC - 6 , O c t o b e r 1970 , p p . 3 60 - 362 . R. W. D o n a l d s o n and G. T. T o u s s a i n t , " U s e o f c o n t e x t u a l c o n s t r a i n t s i n r e c o g n i t i o n o f c o n t o u r - t r a c e d h a n d p r i n t e d c h a r a c t e r s , " I EEE T r a n s . Comput . , v o l . C - 1 9 , November 1970 , p p . 1 0 9 6 - 1 0 9 9 . G. T. T o u s s a i n t and A . B. S. H u s s a i n , "Comments on ' A d y n a m i c p r og r amming a p p r o a c h t o t h e s e l e c t i o n o f p a t t e r n f e a t u r e s , ' " I EEE T r a n s . S y s t e m s , Man, and C y b e r n e t i c s , v o l . SMC-1 , A p r i l 1 971 , p p . 1 86 - 187 . C. T . T o u s s a i n t , "Some u p p e r bounds on e r r o r p r o b a b i l i t y f o r m u l t i -c l a s s p a t t e r n r e c o g n i t i o n , " I EEE T r a n s . Compu t . , v o l . C -20 , A u g u s t 1971 , pp . 9 43 - 944 . G. T . T o u s s a i n t , " N o t e on o p t i m a l s e l e c t i o n o f i n d e p e n d e n t b i n a r y -v a l u e d f e a t u r e s f o r p a t t e r n r e c o g n i t i o n , " I EEE T r a n s . I n f o r m a t i o n T h e o r y , v o l . I T - 1 7 , S e p t e m b e r 1 9 7 1 , p. 6 1 8 . G . T . T o u s s a i n t , "Comments on ' A m o d i f i e d f i g u r e o f m e r i t f o r f e a t u r e s e l e c t i o n i n p a t t e r n r e c o g n i t i o n , ' " I EEE T r a n s . I n f o r m . T h e o r y , v o l . I T - 1 7 , S e p t e m b e r 1971 , p p . 6 1 8 - 6 2 0 . J . W. C a r l and G. T . T o u s s a i n t , " H i s t o r i c a l n o t e on M i n k o w s k i m e t r i c c l a s s i f i e r s , 1 1 IEEE T r a n s . S y s t e m s , Man, and C y b e r n e t i c s , v o l . SMC-1 , O c t o b e r 1971 , p p . 3 8 7 - 3 8 8 . G. T. T o u s s a i n t , "Some f u n c t i o n a l l o w e r bounds on t h e e x p e c t e d d i v e r g e n c e f o r m u l t i h y p o t h e s i s p a t t e r n r e c o g n i t i o n , communica^-t i o n , and r a d a r s y s t e m s , " I E EE T r a n s . S y s t e m s , Man, and C y b e r -n e t i c s , v o l . SMC-1, O c t o b e r 1 9 7 1 , p p . 3 8 4 - 3 8 5 . A. B . S . H u s s a i n , G. T. T o u s s a i n t , and R. W. D o n a l d s o n , " R e s u l t s o b t a i n e d u s i n g a s i m p l e c h a r a c t e r r e c o g n i t i o n p r o c e d u r e on M u n s o n ' s h a n d p r i n t e d d a t a , " I EEE T r a n s . C o m p u t . , v o l . C - 2 1 , F e b r u a r y 1972 , p p . 2 0 1 - 2 0 5 . 2 G. T. Toussaint, "Polynomial representation of c l a s s i f i e r s with independent discrete-valued f e a t u r e s , " IEEE Trans. Comput., v o l . C-21, February 1972, pp. 205-208.. G. T. Toussaint, "Some i n e q u a l i t i e s between distance measures f o r feature evaluation," IEEE Trans. Comput., v o l . C-21, A p r i l 1972, pp. 409-410. G. T. Toussaint and T. R. Vilmansen, "Comments on 'Feature s e l e c t i o n with a l i n e a r dependence, measure,' " IEEE Trans. Comput., v o l . C-21,' A p r i l 1972, p. 408. G. T. Toussaint, "A c e r t a i n t y measure f o r feature e v a l u a t i o n i n pattern r e c o g n i t i o n , " Proceedings of the F i f t h Hawaii I n t e r -n a t i o n a l Conf. on Systems Sciences, January 1972, pp. 37-39. G. T. Toussaint, "Comments on ' T h e o r e t i c a l comparison of a c l a s s of feature s e l e c t i o n c r i t e r i a i n p a t t e r n r e c o g n i t i o n , ' " IEEE Trans. Comput., "June 1972, Vol. C-21, pp. 615-616. . G. T. Toussaint, "Comments on 'The divergence and Bhattacharyya distance measures i n s i g n a l s e l e c t i o n , ' " IEEE Trans. Comm.. Technol., Vol.. C0M-20,June 1972,p.485-G.T. Toussaint, "Feature e v a l u a t i o n w i t h q u a d r a t i c mutual i n f o r m a t i o n , " Information Processing L e t t e r s , V o l . 1, No.4, June 1972, pp. 153-156. G.T. Toussaint and R.W. Donaldson, "Some simple c o n t e x t u a l decoding a l g o -rithms a p p l i e d to r e c o g n i t i o n of handprinted t e x t , " P r o c . Ann. Can. Comp. Conf. Session'72, Montreal, Canada, June 1-3, pp. 422101-422115. G.T. Toussaint, "Feature e v a l u a t i o n w i t h a proposed g e n e r a l i z a t i o n of Kolmogorov's v a r i a t i o n a l distance; and the Bhattacharyya c o e f f i c i e n t , " Proc. Ann. Can. Comp. Conf., Session'72, M o n t r e a l , Canada, June 1-3, pp. 422401-422413. G.T. Toussaint, "Comments on 'Error bounds f o r a c o n t e x t u a l r e c o g n i t i o n procedure,*" IEEE Trans. Camput., V o l . C-21, Sept. 1972, p. 1027. G.T. Toussaint, "Comments on 'A comparison of seven techniques f o r choosing subsets of p a t t e r n r e c o g n i t i o n p r o p e r t i e s ' " , IEEE Trans. Comput., V o l . C-21, Sept. 1972, pp. 1028-1029.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Feature evaluation criteria and contextual decoding...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Feature evaluation criteria and contextual decoding algorithms in statistical pattern recognition Toussaint, Godfried Theodore Patrick 1972
pdf
Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.
Page Metadata
Item Metadata
Title | Feature evaluation criteria and contextual decoding algorithms in statistical pattern recognition |
Creator |
Toussaint, Godfried Theodore Patrick |
Publisher | University of British Columbia |
Date Issued | 1972 |
Description | Several probabilistic distance, information, uncertainty, and overlap measures are proposed and considered as possible feature evaluation criteria for statistical pattern recognition problems. A theoretical and experimental analysis .and comparison of these and other well known criteria are presented. A class of certainty measures is proposed that is well suited to pre-evaluation -and intraset post-evaluation. Two approaches to the M-class problem are considered: the well known expected value approach and a proposed generalized distance approach. In both approaches the main distance measures considered are Kullback's divergence, the Bhattacharyya coefficient, and Kolmogorov's variational distance. In addition a generalization of the Bhattacharyya coefficient to the M-class problem is considered. Properties and inequalities are derived for the above measures. In addition, error bounds are derived whenever possible, and it is shown that some of these bounds are tighter than existing ones while others represent generalizations of well known bounds found in the literature. Feature ordering experiments were carried out to compare the above measures with each other and with the Bayes error probability criterion. Finally, some very general measures are proposed which reduce to several of the measures considered above as special cases and error bounds are derived in terms of these measures. A general class of algorithms is proposed for the efficient utilization of contextual dependencies among patterns, at the syntactic level, in the decision process of statistical pattern recognition systems. The algorithms are dependent on a set of parameters which determine the amount of contextual information to be used, how it is to be processed, and the computation and storage required. One of the most important parameters in question, referred to as the depth of search, determines the number of alternatives considered in the decoding process. Two general types of algorithms are proposed in which the depth of search is pre-specified and remains fixed. The efficiency of these algorithms is increased by incorporating a threshold and effecting a variable depth of search that depends on the amount of noise present in the pattern in question. Several general types of algorithms incorporating the threshold are also proposed. In order to experimentally analyze, compare, and evaluate the various algorithms, the recognition of handprinted English text was simulated on a digital computer. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-03-28 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0101465 |
URI | http://hdl.handle.net/2429/32987 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1972_A1 T68.pdf [ 11.1MB ]
- Metadata
- JSON: 831-1.0101465.json
- JSON-LD: 831-1.0101465-ld.json
- RDF/XML (Pretty): 831-1.0101465-rdf.xml
- RDF/JSON: 831-1.0101465-rdf.json
- Turtle: 831-1.0101465-turtle.txt
- N-Triples: 831-1.0101465-rdf-ntriples.txt
- Original Record: 831-1.0101465-source.json
- Full Text
- 831-1.0101465-fulltext.txt
- Citation
- 831-1.0101465.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
data-media="{[{embed.selectedMedia}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0101465/manifest