13223 SEQUENTIAL DECISION SCHEMES FOR STATISTICAL PATTERN RECOGNITION PROBLEMS WITH DEPENDENT AND INDEPENDENT HYPOTHESES by ABUL BASHAR SHAHIDUL HUSSAIN B. Sc., U n i v e r s i t y of Engineering and Technology, Dacca, Bangladesh, 1966 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY i n the Department of E l e c t r i c a l Engineering We accept t h i s thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA June 1972 In p r e s e n t i n g t h i s t h e s i s in p a r t i a l f u l f i l m e n t o f the requirements f o r an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia , I agree that the L i b r a r y s h a l l make i t f r e e l y a v a i l a b l e fo r re ference and study . I f u r t h e r agree t h a t p e r m i s s i o n f o r e x t e n s i v e copying o f t h i s t h e s i s f o r s c h o l a r l y purposes may be granted by the Head of my Department or by h i s r e p r e s e n t a t i v e s . I t i s understood that copying o r p u b l i c a t i o n o f t h i s t h e s i s f o r f i n a n c i a l ga in s h a l l not be a l lowed without my w r i t t e n p e r m i s s i o n . Department The U n i v e r s i t y o f B r i t i s h Columbia Vancouver 8, Canada ABSTRACT Sequential decision schemes for the purpose of both pattern c l a s s i f i c a t i o n and feature ordering are investigated. An optimum compound sequential p r o b a b i l i t y r a t i o test (OCSPRT) fo r recognition problems with memory i n the source as w e l l as the obser-vation medium i s developed. The r e s u l t s of t h e o r e t i c a l analysis and computer simulation of the OCSPRT for a two-class problem with f i r s t order Markov dependence among the pattern classes are presented. For m u l t i c l a s s recognition problems the s u i t a b i l i t y of Bayes sequential decision schemes based on one-state ahead truncation approxi-mation, with and without on-line feature ordering, i s assessed from the points of view of computational complexity and expected cost of a terminal de c i s i o n . The Bayes sequential d e c i s i o n scheme f o r dependent hypothesis problems i s formulated and i t s performance i s compared with that of the optimum compound nonsequential d e c i s i o n scheme. For dependent hypothesis recognition problems, compound sequen-t i a l pattern recognition schemes (CSPRS) are formulated. In CSPR schemes the required a d d i t i o n a l feature i s observed e i t h e r on the pattern to be decided, as i n the c l a s s i c a l sequential schemes, or on any one of the neighbouring patterns. The pattern selected and the feature a c t u a l l y observed are the ones which provide the maximum amount of a d d i t i o n a l i n f o r -mation. The r e s u l t s of a n a l y t i c a l and experimental evaluation of the CSPR schemes are.presented. The s u i t a b i l i t y of the suboptimal sequential d e c i s i o n scheme with on-line ordering of features as a feature evaluation and ordering c r i t e r i o n i s discussed. A modified on-line sequential (MOLS) d e c i s i o n scheme based on l i m i t e d length of search i s proposed as a compromise between the a d d i t i o n a l computational complexity and improvement i n the recognition performance r e s u l t i n g from the on-line ordering of features. The advantage of incorporating such l i m i t e d length of search over a v a i l a b l e features i n t o sequential decision schemes using a set of preordered features i s also examined. For the purpose of experimental evaluation of the various de-c i s i o n schemes, recognition•of handprinted E n g l i s h text as a p a r t i c u l a r example of a pattern recognition problem was simulated on a d i g i t a l computer. The handprinted characters were obtained from Munson's multiauthor data f i l e prepared at Stanford Research I n s t i t u t e . TABLE OF CONTENTS Page ABSTRACT i TABLE OF CONTENTS i i i LIST OF PRINCIPAL SYMBOLS v LIST OF ILLUSTRATION i x LIST OF TABLES x i i ACKNOWLEDGEMENT , x i i i b I. INTRODUCTION 1 1.1 S t a t i s t i c a l Pattern C l a s s i f i c a t i o n and Sequential Decision Schemes 1 1.2 Previous Research 3 1.3 Scope of the Thesis. 6 II . OPTIMUM COMPOUND SEQUENTIAL PROBABILITY RATIO TEST FOR DEPENDENT HYPOTHESIS PROBLEMS 9 2.1 Introduction 9 2.2 Statement of the Problem. 9 2.3 Derivation of the Decision Function 12 2.4 Recursive Solution of v 15 2.5 Generalized CSPRT for M u l t i c l a s s Problems 20 I I I . COMPOUND SPRT FOR FIRST ORDER MARKOV DEPENDENT HYPOTHESES.. 23 3.1 Description of the Model 23 3.2 Nature of the Stopping Boundaries 26 3.3 Expected Number of Features Per Pattern 29 3.4 Comparison of OCSPRT and Standard SPRT 33 3.5 Performance Evaluation by Simulation Experiments 37 IV. . DATA BASE, PREPROCESSING, FEATURE EXTRACTION, AND SYSTEM EVALUATION 44 4.1 Description of Handprinted Character Set 44 4.2 Preprocessing of the Characters 45 4.3 Feature Extraction Scheme 46 4.4 Estimation of P r o b a b i l i t i e s and System Evaluation Procedure 52 4.5 Preliminary Recognition Experiments 56 4.6 Test Procedure and Organization of Results 62 V. BAYES SEQUENTIAL DECISION SCHEMES FOR MULTICLASS PATTERN RECOGNITION PROBLEMS 66 i i i Page 5.1 Introduction , 66 5.2 Formulation of the Decision Rule 63 5.3 A Suboptimal Decision Rule Based on One-State Ahead Truncation 71 5.4 Compound Sequential Decision Schemes f o r F i r s t Order Markov Dependent Hypotheses 74 5.5 Simple Sequential Decision Scheme 77 5.6 Experiments and Results 78 5.7 Discussion of the Experimental Results 79 VI. SUBOPTIMAL COMPOUND SEQUENTIAL PATTERN RECOGNITION SCHEMES FOR DEPENDENT HYPOTHESIS PROBLEMS 88 6.1 Introduction 88 6.2 Statement of the Problem 90 6.3 Formulation of the Suboptimal Sequential Decision Rule.... 92 6.4 Sp e c i a l Cases-CSPRS Type 1, Type 2, and Type 3 96 VII. ANALYTICAL AND EXPERIMENTAL EVALUATION OF THE CSPR SCHEMES 101 7.1 Introduction 101 7.2 A n a l y t i c a l Evaluation Using Two-Class F i r s t Order Markov Problem 101 7.3 Results and Discussion f o r the Two-Class, F i r s t Order Markov Problem 108 7.4 Experimental Evaluation of CSPR Schemes Using M u l t i c l a s s Problems 113 7.5 The CSPR Schemes and Recognition Systems with Variable Dimensional Feature Vectors 114 VIII. FEATURE ORDERING AND ON-LINE SEQUENTIAL DECISION SCHEMES 127 8.1 Introduction 127 8.2 Formulation of the Sequential Decision Scheme with On-Line Ordering of Features 128 8.3 Results and Discussion of Recognition Experiments Using . the On-Line Sequential Decision Scheme * 131 8.4 Feature Evaluation Using the On-Line Sequential Decision Scheme 137 8.5 Modified On-Line Sequential Decision Scheme with Limited Length of Search 148 8.6 Sequential Recognition Scheme with Limited Length of Search Over Preordered Features... 152 IX. CONCLUSION AND DISCUSSION 156 9.1 General Discussion 156 9.2 Suggestion f o r Future Research 158 APPENDIX-A 161 APPENDIX-B 163 APPEND IX-C 165 REFERENCES 172 LIST OF PRINCIPAL SYMBOLS Vectors Symbol Dimension v k - ^ k k \ A — (,X , . . . , X ) 1 n k —k k 1 3 r = ( x K , . . . , x x ) or,....,x K t + 1 ) , k i t k 1, (X ,. . . ,X ) , kst k th feature vector Set of feature vectors Set of t features X +!>•••>XD) k Set of features yet to be observed at state n, (D-n k) (X ,...,X ), k i r L(x ,... ,X ) k<r Set of feature vectors with r th order Markov dependence • A k = { X k , . . . , - X k ~ t } W={Wk, 1=1,2,.. . ,5 k} n ( N ^ ; C k) Set of pattern classes Set of l i k e l i h o o d vectors Set of state vectors Set of states to which t r a n s i t i o n from N*r i s possible M k ^,-k r k - t 9 S , ' (6 ,...,<$ 2 ) , k i t . k 1 ^(6 , . . . ,6 ) , k<t. Set of optimum terminal decisions { l u > = ( i , j , . . . , A ) Set of indice s of the state vectors combining the pattern cl a s s wk u n b ( q ) = ( i , j , . . . , 4 ) Set of q in d i c e s of feature components n b(q) Set of (D-q) indice s of features (D-q) yet to be observed at state q n b (q)en b(q) Set of indices of f i r s t j^(D-q) features yet to be observed at state q v Symbol d={d o,d 1 5... , < y Set of decisions Dimension M+1 Discrete Variables Symbol D Dimensionality of a feature vector . M T o t a l number of pattern classes UK i th pattern c l a s s r Order of Markov dependence among the feature vectors m Order of Markov dependence among the pattern classes t Number of feature vectors under observation t ^ Amount of dependency on the future patterns y Compound p r o b a b i l i t y r a t i o X. Likelihood r a t i o 1 k W. State vector 1 Wl =^'°l'" ' " ' U l m ^ State vector with element belonging to pattern c l a s s oj^ only Decision state N^ Decision state x N^ (j) State to which t r a n s i t i o n from state N^ i s possible 1 through the observation of xJ ... feature n .+1 J n. Saving i n the average number of features obtained 1 using the OCSPRT r e l a t i v e to SPRT c(n^.) Cost of n^ th feature component C To t a l cost of feature observation i n continuing the n k decision process to state n, k v i Symbol k k k d. . i=l,2,... >M A terminal decision denoting X ~ OJ. I ; 1 d A nonterminal decision o £(a) k;d k) Loss involved i n making the decision d^ when a c t u a l l y 1 3 v k J x <a> Largest integer not exceeding a N^(i) Average number of features required per pattern for a terminal d e c i s i o n based on the t e s t i n g of i th text Np Expected value of the average number of features required per pattern for a terminal d e c i s i o n based on a l l seven texts x Best feature to be observed at any d e c i s i o n state Standard deviation of the features' variance •S'(n^_^) Number -of states i n which a feature from the previous pattern i s preferable to the one from the pattern under consideration f . ( i ) j th feature component (according to some conven-3 ient l a b e l l i n g system) at i th state U.. T o t a l usage of the j th feature at i th state I[x.;fi ] Mutual information between the i th feature and set of pattern classes Q. J(x ) Expected divergence for I th feature component LS Length of search S (a) Computational complexity involved due to on-line ordering of features i n sequential d e c i s i o n schemes S(a,3) Complexity associated with MOLS dec i s i o n schemes due to l i m i t e d length of search vxi P r o b a b i l i t i e s P(x) ' Unconditional p r o b a b i l i t y of event x P(x|y) Conditional p r o b a b i l i t y of event x conditioned on event y k+11 k Pj^=p(uK |UK) State t r a n s i t i o n p r o b a b i l i t y P ( i ) Error p r o b a b i l i t y based on the i th text P„ Expected error p r o b a b i l i t y based on a l l seven texts P ^ ( i ) P r o b a b i l i t y of forced decision based on the f i r s t text Pp Expected p r o b a b i l i t y of forced d e c i s i o n based on a l l seven texts e. .=P(X~to. |x~w. ) Error p r o b a b i l i t y of type I and type II p^(e) Bayes err o r p r o b a b i l i t y •E(a.g) Increase i n err o r p r o b a b i l i t y due to l i m i t e d length of search r e l a t i v e to on-line sequential d e c i s i o n scheme R (a) Reduction i n err o r p r o b a b i l i t y due to on-line ordering of features v i i i LIST OF ILLUSTRATIONS Figure Page 3.1 Block diagram of a sequential recognition system with two-class Markov dependent source 25 3.2 The stopping boundaries of OCSPRT, and the regions of 90 saving and increase i n the average number of features 3.3 Saving i n the average number of features per pattern obtained using OCSPRT r e l a t i v e to SPRT as a function of er r o r p r o b a b i l i t y e and state t r a n s i t i o n p r o b a b i l i t y p... 3.4 Comparison of OCSPRT, SPRT, optimal simple nonsequential, and optimal compound nonsequential schemes^ a = 1.7 40 3.5 Comparison of OCSPRT, SPRT, optimal simple nonsequential, and optimal compound nonsequential d e c i s i o n schemes. a = 1.0 41 3.6 E f f e c t of t, the length of the sequence of feature vectors a v a i l a b l e at any instant, on the performance of OCSPRT 43 4.1 The o r i g i n a l and the size-normalized character samples 47 4.2 The (h x w) pattern matrix and the (h' x w') mask 49 4.3 The n a t u r a l ordering of the 25 feature components 52 5.1 Comparison of simple suboptimal sequential and optimal non-sequential d e c i s i o n schemes 80 5.2 Comparison of simple suboptimal sequential and optimal nonsequential decision schemes with preordered sets of features 82 5.3 Comparison of simple sequential, simple nonsequential, compound sequential, and compound nonsequential d e c i s i o n schemes 83 7.1 System error p r o b a b i l i t y f o r various combinations of features i x Figure Page from the present and the previous patterns. The arrows show the path of minimum error p r o b a b i l i t y . The features variance i s uniformly d i s t r i b u t e d . . . . 109 7.2 System error p r o b a b i l i t y f o r various combinations of features from the present and the previous patterns. The arrows show the path of minimum error p r o b a b i l i t y . The features variance i s exponentially d i s t r i b u t e d 110 7.3 System error p r o b a b i l i t y f o r various combinations of features from present and previous patterns. Features' variance i s l i n e a r l y d i s t r i b u t e d I l l 7.4 E f f e c t of standard deviation of the features' variance on the t o t a l number of states i n which a feature from the previous pattern i s preferable to one from the present pattern 114 7.5 Comparison of CSPRS Type 1, compound sequential, compound nonsequential, simple sequential, and simple nonsequential d e c i s i o n schemes... 116 7.6 Comparison of CSPRS Type 1, CSPRS Type 2, compound sequential, and compound nonsequential decision schemes 119 7.7 Comparison of CSPRS Type 1, CSPRS Type 3, compound sequential, and compound nonsequential d e c i s i o n schemes... 121 7.8 A n a l y t i c a l and experimental evaluation of the d i s c r i m i n a t i o n c a p a b i l i t y of the 25 feature components.... 122 7.9 E f f e c t of observing features from the previous pattern, on er r o r p r o b a b i l i t y and the feature requirement from the present pattern i n CSPRS Type 1 126 Figure . Page 8.1 Comparison of suboptimal on-line sequential, usual sequential, and optimal nonsequential d e c i s i o n schemes 132 8.2 Comparison of computational complexity and the reduction i n the e r r o r p r o b a b i l i t y due to on-line ordering of features 135 8.3 Frequency of usage of the feature components i n on-line sequential decision scheme 141 8.4 Comparison of feature ordering c r i t e r i a on the basis of the performance of the simple suboptimal sequential decision scheme 146 8.5 Evaluation of MOLS decision scheme with various lengths of search 150 8.6 Comparison of computational complexity and the increase i n the e r r o r p r o b a b i l i t y due to the l i m i t e d length of search i n the MOLS decision scheme 153 8.7 Comparison of performance of suboptimal simple sequential scheme using a set of preordered features with and without l i m i t e d search f a c i l i t y 155 x i LIST OF TABLES Table Page 4.1 Results of preliminary recognition experiments with Munson's multiauthor data f i l e 57 4.2 Recognition r e s u l t s obtained by others using Munson's multiauthor data f i l e 58 4.3 Recognition r e s u l t s showing the advantage of size-normaliza-t i o n preprocessing 61 5.1 Recognition r e s u l t s of a l l seven texts using suboptimal simple sequential and optimal simple nonsequential d e c i s i o n schemes 81 5.2 Recognition r e s u l t s of a l l seven texts using suboptimal compound sequential and optimal compound nonsequential decision schemes 84 5.3 Comparison of simple sequential and simple nonsequential decision schemes on the basis of computational complexity... 86 7.1 Recognition r e s u l t s of a l l seven texts using CSPRS Type 1... 117 7.2 Performance of CSPRS Type 1 f o r various amounts of features from the neighbouring patterns 125 8.1 Recognition r e s u l t s of a l l seven texts using suboptimal simple sequential decision scheme with on-line ordering of features 133 8.2 The usage of the feature components at various d e c i s i o n states of an on-line sequential d e c i s i o n scheme. The row and column corresponding to a e n c i r c l e d number i n d i c a t e the feature number and i t s rank r e s p e c t i v e l y 139 x i i Table . Page 8.3 Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between various sets of ordered features 145 8.4 Recognition r e s u l t s of a l l seven texts using suboptimal simple sequential decision scheme with sets of preordered features 147 ACKNOWLEDGEMENT I would l i k e to express my sincere appreciation to Professor Robert W. Donaldson f o r h i s generous counsel and constant encouragement. I am g r a t e f u l to my colleagues Fred Toussaint, Jim Yan and Don A l l a n for many stim-u l a t i n g discussions. I also wish to thank Dr. J . Munson of Stanford Research I n s t i t u t e f or making his multiauthor hand-printed data set a v a i l a b l e . I express my appreciation to Ms. Linda Morris for assuming the onerous burden of typing the mathematical thesis and Ms. Beverly Harasymchuk and Ms. Norma Duggan f o r helping me prepare the f i n a l manuscript. The patience and perseverence of Ms. Margaret Lockwood, who did a wonderful job of proof-reading through several sleepless nights, i s greatly appreciated. I also wish to thank Mr. Herb Black for h i s t e c h n i c a l assistance. Support of t h i s study by the A s s o c i a t i o n of U n i v e r s i t y and Colleges of Canada i n the form of a Commonwealth Scholarship, the National Research Council of Canada under Grant NRC A-3308 and the Defence Research Board of Canada under Grant DRB 2801-30 i s g r a t e f u l l y acknowledged. vi i "i V\ 1 CHAPTER I INTRODUCTION 1.1 S t a t i s t i c a l Pattern C l a s s i f i c a t i o n and Sequential Decision Schemes Given an unknown pattern (e.g., noisy waveform, English character, ..blood c e l l ) , xtfhieh b.elongs to one of M classes (categories, population, or source s t a t e s ) , the problem of a pattern c l a s s i f i e r i s to "categorize" i t c o r r e c t l y . A set of 13 features (measurement, a t t r i b u t e s , or des-c r i p t o r s ) from a pattern i s a D-tuple of numbers (a point i n a D-dimensional feature space) and i s c a l l e d the feature vector for that pattern. The c l a s s i f i c a t i o n process i s the p a r t i t i o n i n g of the feature space i n t o M mutually exclusive regions, \> , i = l , 2, M, such that i f an unknown pattern point i s within a region, $ , the pattern i s c l a s s i f i e d to be a member of class O J . . x The pattern c l a s s i f i c a t i o n problem thus involves the considera-t i o n of three fundamental aspects, namely c h a r a c t e r i z a t i o n , abstraction, and g e n e r a l i z a t i o n [1]. The c h a r a c t e r i z a t i o n involves the s e l e c t i o n of the set of features that characterizes the pattern classes and Is perhaps the most arduous work i n . the area of pattern recognition. Once the features are a v a i l a b l e , i t i s necessary to choose a decision r u l e (or decision scheme) using a l l the information i n hand i n order to be able to c l a s s i f y the pattern samples of unknown clas s e s . This process of obtaining the decision r u l e i s defined as abstraction. The a b i l i t y of the d e c i s i o n r u l e to c o r r e c t l y c l a s s i f y the patterns i s termed g e n e r a l i z a t i o n . A major portion of t h i s thesis i s concerned with the a b s t r a c t i o n problem and the g e n e r a l i z a t i o n a b i l i t y of the corresponding schemes.. A set of features h e u r i s t i c a l l y selected i s assumed a v a i l a b l e . The 2 a v a i l a b l e information for the abstraction part i s a set of patterns whose true " i d e n t i t i e s " are known, the so c a l l e d design or t r a i n i n g set. For the g e n e r a l i z a t i o n portion of the problem there i s another set of patterns with unknown i d e n t i t i e s , the so c a l l e d t e s t i n g set. Depending on how the features from a pattern sample are processed, the decision r u l e can be of two types — nonsequential and•sequential. In nonsequential decision schemes a set of features i s processed a l l at one time and the pattern sample of unknown class i s then c l a s s i f i e d . The number of features to be used i s known i n advance. In sequential d e c i s i o n schemes the features are observed at d i f f e r e n t stages of the c l a s s i f i c a t i o n process rather than a l l at one time. At every stage an a d d i t i o n a l feature i s observed only i f the previously observed feature components are not enough ( i n some sense) to c l a s s i f y the pattern with s u f f i c i e n t confidence. Thus, in a sequential decision scheme j u s t enough features are observed to enable a pattern sample to be " i d e n t i f i e d " with desired confidence. Cost i s always associated with the features i n pattern recogni-t i o n problems. I f t h i s cost i s considerable due to elaborate equipment, excessive time, involved computation, or r i s k y or dangerous operation ( i n biomedical and i n d u s t r i a l applications), one has to resort to the sequential decision scheme, since a trade-off between the system e r r o r ( m i s c l a s s i f i c a t i o n ) and the number of features required f o r c l a s s i f i c a t i o n i s always po s s i b l e i n sequential decision schemes. In some recognition problems (detection of radar s i g n a l s , communication networks) the features are a v a i l a b l e only sequentially, i n which case a sequential scheme may be p a r t i c u l a r l y appropriate. The human perception of patterns i n a person's surroundings i s considered to be based on sequential process. We look at a pattern, and based on what information we have,we look f o r another "cue" and t h i s process continues u n t i l we can i d e n t i f y the pattern with s u f f i c i e n t accuracy. A sequential decision scheme seems to be a na t u r a l way to c l a s s i f y patterns. In sequential decision schemes the optimality cannot be reckoned only i n terms of the p r o b a b i l i t y of m i s c l a s s i f i c a t i o n , as i n the nonsequential case. One has to consider the cost of feature observation as w e l l . The g e n e r a l i z a t i o n a b i l i t y i s , t h e r e f o r e , stated i n terms of the t o t a l expected cost of c l a s s i f i c a t i o n which includes the cost of feature observations and the cost (or loss) of terminal d e c i s i o n - e r r o r ( m i s c l a s s i f i c a t i o n ) . If some r e l a t i v e costs are associated with each type of mis-c l a s s i f i c a t i o n , then an optimum decision scheme i s obviously the one which minimizes the t o t a l expected cost of c l a s s i f i c a t i o n . I f , f o r example, the costs of each type of m i s c l a s s i f i c a t i o n are equal, the optimum de c i s i o n scheme minimizes the p r o b a b i l i t y of error . Such a de-c i s i o n scheme minimizing the cost of c l a s s i f i c a t i o n . i s a r e a l i z a t i o n of Bayes decision r u l e , and w i l l be defined the optimum i n the sequel. 1.2 Previous Research Sequential decision theory appears i n many d i f f e r e n t forms i n a v a r i e t y of d i s c i p l i n e s . Even within the areas of pattern recognition and s i g n a l detection there i s ample a p p l i c a t i o n of the sequential schemes i n feature s e l e c t i o n , parameter estimation, adaptive learning and c l a s s i -f i c a t i o n . This present o u t l i n e of previous research i s r e s t r i c t e d to what seems most relevant and e s s e n t i a l to our present i n v e s t i g a t i o n i n v o l v i n g pattern recognition. Most previous work i n sequential decision theory i s based on Wald's sequential p r o b a b i l i t y r a t i o test (SPRT) [2] and on Blackwell and Girshick's work on the theory of games and s t a t i s t i c a l decisions [3]. A p p l i c a t i o n of SPRT has been l i m i t e d to the binary hypotheses case. With t h i s r e s t r i c t i o n the d e c i s i o n r u l e consists of determining two stopping boundaries p a r t i t i o n i n g the feature space i n t o three mutually exclusive regions; the region of acceptance of the f i r s t ( n u l l ) hypothesis, region of acceptance of second (alternate) hypothesis and the region of i n d e c i -sion. The process of feature observation i s continued as long as the pattern point is i n the region of i n d e c i s i o n . I t has been shown [4] that Wald's SPRT i s optimum f o r two hypothesized pattern classes. In other words there i s no other procedure with at l e a s t as low an e r r o r probab-i l i t y and with a smaller average number of features required for the terminal d e c i s i o n than SPRT. Much work based on both SPRT and deferred decision theory has been c a r r i e d out i n the area of s i g n a l detection. Bussgang and Middelton 15] applied Wald's [2] sequential analysis to the detection of s i g n a l i n noise. B i r d s h e l l and Robert [6] made an extensive study of s i g n a l d e t e c t a b i l i t y and sequential d e c i s i o n theory under the constraint- of bounded observation time. The composite hypotheses case and design and evaluation of optimum receiver for signals of unknown amplitude were considered by Roberts [7]. Fu [8] applied Wald's SPRT to pattern recognition. Chen and Fu [9]-[ll] used the sequential d e c i s i o n approach to both pattern recognition and machine learning. Recognition of handprinted English characters using SPRT was also examined by Chen and Fu [ 9 ] - [ l l ] . A modified SPRT using time-varying stopping boundaries proposed by Anderson, Bussgang, 5 and Marcus [12], [13], to have a f i n i t e sequential decision scheme, was s u c c e s s f u l l y applied to the recognition of patterns by Fu and Chien [14]. The SPRT has also been considered for the recognition of shapes using an i n t e g r a l geometry approach by Wong and Steppe [15]. Attempts have been made to generalize Wald's SPRT to m u l t i -category problems. Based on the theory of SPRT, Reed [16] formulated the GSPRT (generalized SPRT) for m u l t i c l a s s recognition problems. The GSPRT and i t s modified versions have been inve s t i g a t e d by Fu, Chien and Chen [9], [14]. An optimum sequential test for multi-hypothesis case has also.been proposed by Robert and M u l l i s [17], which i s based on the property c a l l e d subspace s e p a r a b i l i t y . Recently, Palmer [18] derived a multi-hypothesis sequential t e s t from the measure associated with the SPRT. The resultant stopping rules are simpler than those of the former mu l t i c l a s s sequential test procedures. A g l i n t s e v and Ter-Saakov [19] have also proposed a method f o r d i s t i n g u i s h i n g multiple hypotheses, using sequential a n a l y s i s . The optimum (Bayes) m u l t i c l a s s sequential decision scheme i s e s s e n t i a l l y a backward procedure [3] and i t minimizes the expected cost of a terminal decision. The use of dynamic programming as a f e a s i b l e computational technique for the c l a s s of optimal, f i n i t e sequential decision schemes was f i r s t proposed by Bellman, Kalaba,and Middleton [20], [21] . I t s a p p l i c a t i o n i n the area of pattern recognition i s mainly due to Fu, C a r d i l l o and Chien [11], [22]-[25]. These authors used computer-simulated character recognition experiments to demonstrate the advantage of the sequential decision schemes over the nonsequential ones. Both on-line ordering of features and o f f - l i n e s e l e c t i o n of feature subsets using the sequential d e c i s i o n schemes were also considered by Fu, Chien, C a r d i l l o , and N i k o l i c [25], [27]. Lasker [28] considered a sequential d e c i s i o n procedure which i s h e u r i s t i c i n nature and a suboptimal version of the proposed scheme of Fu et a l . [11], [22]-[25]. 1.3 Scope of the Thesis S t a t i s t i c a l dependence of some form among the pattern classes and feature vectors of a sequence of pattern samples i s frequently observed i n recognition problems and has previously been considered i n designing recognition schemes [29]— [41]. The consideration of such dependence, however, i s prevalent only i n the area of nonsequential c l a s s i f i c a t i o n process and p r a c t i c a l l y nonexistent i n the area of sequential c l a s s i f i c a t i o n process. A sequential test based on Wald's theory of SPRT [2] and Bayes compound decision theory [42], [43] i s proposed i n Chapter I I . The optimum compound sequential p r o b a b i l i t y r a t i o test (OCSPRT) thus developed considers memory both i n the source states and the observation medium. Formulation of the d e c i s i o n function i n general form and i t s s o l u t i o n i n a recursive manner are presented i n Chapter I I . For the purpose of d e t a i l analysis of the OCSPRT, a two-class problem with f i r s t order Markov dependence among the pattern classes i s considered i n Chapter I I I . Also, included i n Chapter III are the r e s u l t s and discussions of computer -simulated recognition experiments. In most s i t u a t i o n s a Bayes sequential decision scheme (based on dynamic programming) which minimizes the t o t a l expected cost of terminal decisions are impracticable from the points of view of computation and storage requirement. However, a suboptimal decision algorithm based on one-state ahead truncation approximation i s easy to implement and can be quite a t t r a c t i v e as a f i n i t e sequential decision scheme, e s p e c i a l l y for m u l t i c l a s s problems. Various sequential schemes based on one-state ahead approximation for the purpose of both c l a s s i f i c a t i o n and feature ordering are formulated i n t h i s t h e s i s . Recognition of handprinted E n g l i s h texts as a p a r t i c u l a r example of a pattern recognition problem was simulated on a d i g i t a l computer i n order to evaluate the sequential schemes. The d e s c r i p t i o n of the data set, the feature e x t r a c t i o n scheme, the preprocessing operation, and the experimental methodology are included i n Chapter IV 144]. Formulation of the suboptimal sequential d e c i s i o n scheme based on the one-state ahead truncation approximation, f o r r e c o g n i t i o n problems with dependence among the pattern classes as w e l l as the feature vectors i s included i n Chapter V. In order to assess the s u i t a b i l i t y of the suboptimal sequential d e c i s i o n scheme over the optimal nonsequential one, simple ( s t a t i s t i c a l l y independent pattern classes) sequential and non-sequential schemes were simulated on the d i g i t a l computer. The r e s u l t s are presented i n Chapter V. The performance of compound sequential d e c i s i o n scheme with f i r s t order Markov dependent pattern classes are also discussed i n Chapter V. Chapter VI extends the d e c i s i o n scheme developed i n Chapter V by allowing required a d d i t i o n a l features to be observed e i t h e r on the pattern sample to be c l a s s i f i e d or on any one of the neighbouring patterns. Thus, i n the presence of dependence among the pattern classes, only the best features from each pattern sample are observed with the r e s u l t that observation process terminates more quickly. For the purpose of a n a l y t i c a l evaluation of t h i s d ecision scheme a two-class, f i r s t order Markov problem i s considered. The derived r e s u l t s and t h e i r implications are presented i n Chapter VII. Also shown i n Chapter VII are the r e s u l t s of experimental simulation of three s p e c i a l cases of the proposed m u l t i c l a s s sequential d e c i s i o n schemes based on simple but r e a l i s t i c assumptions. In sequential decision schemes the order i n which the a v a i l a b l e features are observed i s usually predetermined. However, i t i s p o s s i b l e to formulate a sequential decision scheme which would always observe a feature which, along with the already observed features, would provide the maximum amount of a d d i t i o n a l information. The suboptimal sequential d e c i s i o n scheme with such on-line ordering of features i s considered i n Chapter VIII. I t s a d m i s s i b i l i t y from the points of view of recognition performance and computational complexity i s c a r e f u l l y assessed. The s u i t a b i l i t y of the suboptimal sequential decision scheme with on-line ordering of features as a feature evaluatuion and ordering c r i t e r i o n i s also discussed i n Chapter VIII. A modified on-line sequential (MOLS) decision scheme based on l i m i t e d length of search i s proposed as a com-promise between the a d d i t i o n a l computational complexity and the improvement i n the recognition performance r e s u l t i n g from the on-line ordering of features. The advantage of incorporating such l i m i t e d length of search over a v a i l a b l e features i n t o the sequential d e c i s i o n schemes using a set of perordered features i s also examined i n Chapter VIII. F i n a l l y , a few concluding remarks and a b r i e f i n d i c a t i o n of possible future research are presented i n Chapter IX. 9 CHAPTER II OPTIMUM COMPOUND SEQUENTIAL PROBABILITY RATIO TEST FOR DEPENDENT HYPOTHESIS PROBLEMS 2.1 Introduction In t h i s chapter the theory of Wald's SPRT [2] and optimum (Bayes) compound decision theory [34], [42], [43] are combined to develop the optimum compound sequential p r o b a b i l i t y r a t i o test (OCSPRT) for recognition problems with memory i n the source as w e l l as the observation medium. The process of feature observation i s continued as long as the information from the pattern to be c l a s s i f i e d and the f i n i t e number of previous patterns are not enough f o r a f i n a l d e c i s i o n with desired c o n f i -dence. A binary hypothesis case i s f i r s t considered f o r the formulation of the OCSPRT. The test i s then generalized f o r the case of multih.ypoth.esis problems at the end of the chapter. 2.2 Statement of the Problem -L 1 2 L Let X = (X , X , ...,X) represent a sequence of L feature vectors a v a i l a b l e at instant 1, 2, L, r e s p e c t i v e l y , of the c l a s s i -f i c a t i o n process. Let n, denote the dimensionality of the k th feature k vector; thus 10 k X x„ 1, 2, D, x where x~l, i = 1, 2i ..., D, corresponds to the i t h feature of the j th feature vector.. A maximum of D features are assumed to be a v a i l a b l e from each pattern sample. Corresponding to these feature vectors i s a series of source states or pattern classes f o r each k, k = 1, 2, ..., L. Each input pattern, therefore, l i e s i n the set fi^ = {UK; i = 1, 2, ..., M}. At any instant k of the sequential decision process, a decision state i s defined i n terms of the number of features observed on the pattern under consideration at that i n s t a n t . Thus, f o r a pattern r e q u i r i n g n^ features for a f i n a l d ecision, the sequential decision process has to go through n^ d i s c r e t e decision states. Let d = (d^, d^, d^} be a set of de-k c i s i o n s at instant k, where each d^, i = 1, 2, . . ., M, represents that the hypothesis H_^ : X ~ co^ (the pattern represented by feature vector k X i s i n cla s s OK) i s accepted. Associated with each state of the d e c i -k k k sion process i s an optimum decision 6 , such that 6 = d_^ , i = 1, 2, M i f a terminal decision i s possible at that state of the process or k k 6 = d Q , a nonterminal decision*, i f a d d i t i o n a l features are necessary before a terminal decision i s possi b l e . The recognition problem under consideration i s assumed to have dependency structure of the following forms: 1. Markov dependence of any f i n i t e order m may e x i s t among the pattern classes such that , k, k-1 P(u). to. , k-m. , ,o>A ), k > m ki k-1 1. , P(w.u). , .... in j , k £ i J P m i,j,..,£, p = 1,2,..., M, (2.1-a) where P(« ]•) and P(*) are used to denote the c o n d i t i o n a l and the unconditional p r o b a b i l i t i e s . 2. For any f i n i t e r, the feature vectors may have r th order Markov dependence among themselves, such that at any instant k p ( x k | x k _ 1 , X 1) = ' P ( X k | x k 1 , X k r ) , k > r v-P(X k | x k 1 , X 1 ) , k £ r . (2.1-b) 3. The feature vectors may also be s t a t i s t i c a l l y dependent on the i d e n t i t i e s of neighbouring patterns, and therefore ,ki k k-1. , „,„k. k P(XK|W«;, ^ _ ± ) f P(X k|co k). (2.1-c) S t a t i s t i c a l dependence of these types are not uncommon i n pattern recognition problems and have been considered i n the past i n one form or the other [ 3 2 ] - [ 4 l ] . nonterminal decision d Q i s e i t h e r n u l l or consists of r e j e c t s i n case.of nonsequential d e c i s i o n schemes [45]. 2.3 Derivation of the Decision Function In t h i s s ection the OCSPRT i s developed f o r the binary pattern-class case and i s based on the assumption that the dependencies extend only into the f i n i t e past. In a f i x e d sample s i z e (nonsequential) compound c l a s s i f i c a t i o n scheme a pattern sample at any instant k i s c l a s s i f i e d on the basis of the information a v a i l a b l e from a l l the pattern samples of the sequence. However, when the d e c i s i o n on the k th pattern cannot be delayed for a d d i t i o n a l information from the subsequent patterns, only the f i r s t k patterns of the sequence are u s e f u l f o r the k th d e c i s i o n . In the absence of a l l forms of dependencies i n the recognition system, the com-pound c l a s s i f i c a t i o n scheme reduces to the simple c l a s s i f i c a t i o n scheme, where the information from the pattern under consideration alone i s u s e f u l . I f cost (or loss) can be assigned to each type of m i s c l a s s i -k k f i c a t i o n , that i s , i f £(OK; d_.) i s an element of an M x M l o s s matrix and k k k k denotes the l o s s due to the d e c i s i o n X ~ w. when X ~ w. i s true, where i J k k k k £(OK; d.) £ £(uu; d^) , then a Bayes de c i s i o n r u l e i s one which makes a decision 6 to minimize the average cost of m i s c l a s s i f i c a t i o n . Let R^ be the expected cost or r i s k of making the optimum decision at any i n s t a n t k. Then f M R, = I JL((ok; d k) P ( X k | u K ) P(u> k)dX k (2.2) K _• i i J i i i = l where the i n t e g r a t i o n i s over the k - f o l d c a r t e s i a n product of the feature space. I f Pj^(j) i s defined as M P v ( j ) = I A(uf; d k) P ( X k | c o J ) P ( A , j=l,2,...M, (2.3) k i = l 1 X then the r i s k R^ i s minimized by minimizing (2.3). Thus the Bayes d e c i s i o n 6 k = d k when p k(H) = Min{p k(j)}, %, j = 1, 2, ..., M, (2.4) 3 13 and the corresponding r i s k involved i s given by R* = Min{ I Uw k; d k) P(X k|oj k) P(w k)} J 1=1 d x k . The decision r u l e defined i n (2.4) i s not unique, since there could be more than one pattern class s a t i s f y i n g (2.4). However, the a l t e r -natives are equivalent since they r e s u l t i n the same r i s k . I f the elements of the loss matrix are such that k k f1' ±f ^j d p = , i,j=l,2,...M (2.5) J ^0, i f i=j the Bayes de c i s i o n r u l e i s equivalent to 6 k = d k i f : P(X k|o) k) P ( W k ) > P(X k|o) k) P(a) k), (2.6) for a l l i = 1, 2, M The d e c i s i o n rule (2.6) for the two-class case reduces to 6 k = d k : i f P(X k; oj k)/P(X k; cok) < T q ^k _ d^: otherwise , where TQ i s the threshold. The optimum nonsequential decision rule, therefore, reduces to the l i k e l i h o o d r a t i o t e s t . Let y be defined as the compound p r o b a b i l i t y r a t i o at any state n^ of the sequential d e c i s i o n process. Then P(X k|o) k)P(o) k) Y n = = ¥ T k k~ ' "k = 1» 2 ' D - ( 2 ' 7 ) n k P(X k|a) k)P(a) k) k In a sequential c l a s s i f i c a t i o n scheme two thresholds ( c a l l e d stopping boundaries) are required, d e f i n i n g a zone i n the feature space co r r e s -ponding to the nonterminal decision d Q . Thus, i f T^ and > ^1 > ^ are the two stopping boundaries, then an optimum OCSPRT consists of forming 14 the compound p r o b a b i l i t y r a t i o y at dec i s i o n state n- and accept n k k . (4 i f V * T 2 k k ( 2 ' 8 ) 6 - < d o " T2 > V > T l k k The two stopping boundaries T^ and T^ can be shown [2] to be re l a t e d to the two types of erro r p r o b a b i l i t i e s , ( t v P e _ I ) a n d (type-II), where e = P(X k - to |x k ~ u) ), k = 1, 2, i , j = 1, 2, ... , M. i j J k k The feature observation process i s continued as long as 6 = d . o Note that the OCSPRT defined i n (2.8) i s n o n f i n i t e i n nature, that i s , the number of features required f o r a terminal decision i s un-bounded. In most recognition -problems only a f i x e d number of features i s a v a i l a b l e such that a terminal decision has to be made at the f i n a l state (n^ = D). One can, of course, define a f i n i t e OCSPRT by incorporating either the truncated scheme of Wald [2] or the time-varying stopping boundaries proposed by Anderson, Bussgang and Marcus [12]-[14]. In the truncated scheme, the dec i s i o n r u l e at the f i n a l state of the process i s modified to accept ' v - T i ' ? K - T 2 i d^, otherwise. In the l a t t e r scheme, the stopping boundaries are modified at every state of the dec i s i o n process such that both T^ and T^ monotonically converge. Thus the observation process i s always interrupted at the end of the desired s t a t e . 15 2.4 Recursive Solution of Y n k Let at any instant k, X k ( t ) = (X k, X k - 1 , x k ~ t + 1 ) be the set of t feature vectors a v a i l a b l e , on the basis of which the k th pattern has to be c l a s s i f i e d . Let W = (Wk; i = 1, 2, ..., ? k} be a set of state vectors, where each element W. of the set W f o r m th I order dependence among the pattern classes w., i s given by w k l l /• i k k-m., . (a) b, . . . , OJ ) , k i m (wK, . . . , u)g) , k < m, b, . . .q, ..., s = 1, ..., M. m lc Thus there are M possible values of i i f k > M and M 'possible values of i i f k < M, with the r e s u l t that •win / M , k $ m v. M , k < m . For any m i l , i t may be shown that the W^'s are f i r s t order Markov i n the sense that p ( w k l w k _ : L , wj) = p (w k |w k x ) k -F i n a l l y , l e t the element W ^ G W be always such that (w , ..., to ), k i m / k Is (co , .. ., w.^ , k < m. The following two r e l a t i o n s , • t h e d e t a i l e d d e r i v a t i o n of which i s presented i n Appendix-A, are assumed to be a v a i l a b l e , -k I f X (t) i s the set of feature vectors a v a i l a b l e at instant k and state n^ of the decision process, and dependence relations. (2.1-a), (2.1-b) and (2.1-c) apply, then f or any r >, m P ( X k | x k " 1 , X k " t + 1 ; Wk) = P ( X k | Y k ^ ; wj) (2.9-a) where 1 Y k ( (X k, X k 1 , ..., X k _ r + 1 ) , k * r k (X k, X k - 1 , ..., X 1 ) , k < r and Let {I.} be the set of indic e s of Wk such that J i ie{I.} i f f u)keWk, i = 1, 2, M. 3 3 i At anv decision state n. , -we have the compound p r o b a b i l i t y r a t i o K , . -Y ., based on the f i n i t e set of t feature vectors P(X k, X k _ 1 , X k~ t + 1;co k) v = —• • , • 'n, D / V k v k - l v k - t + l ks k P(X , X , ..., X ; u^) The above equation can be rewritten as follows: rk „k-l „k'-t+l „k I P(X K, X K " \ X 1 ™ 1 ; WK) ie{I 2> 1 \ ' " '-.I. P(X k, X*"1, X k- t + 1;W k) where the use of (2.9-a) y i e l d s I P ( X k | Y k " 1 ; W k ) P ( X k - 1 ( t - l ) ; W k ) U{I } 1 ' I Y n k ^ l k =k=l k ' ( 2 ' 1 0 ) \ I P(X R|Y k X; W K)P(X R 1 ( t - l ) ; W k ) ue{l } U U Since W's are f i r s t order Markov, and (2.9-b) applies, (2.10) can be rewritten as follows: 17 ? k - l I [P(X k|Y k 1;Wk) I P(X k 1 ; W k - 1 ) P ( X k | w k " 1 ) } i e { I 2 ) 1 j = l ^ 1 J k I [ P C X ^ Y * - 1 ; ^ ) ^ P ( X k ^ jwJj-^PCW^W^ 1)] uell^} v=l Equation (2.11) a f t e r s u i t a b l e rearrangements of terms reduces to I P(X k|Y k- 1;W k) • a k X " I P(X k|Y k- 1;W k) • a k ' ue{I 1} U U where ? k _ 1 a k = P(Wk|Wk h + I p r f l w ^ " 1 ) . B. [ X k _ 1 ( t - l ) ; W ] (2.12) i i ' l , L„ i ' l 1 3=2 J J a n d p ( x k - \ . . . , x ^ 1 ^ " 1 ) 6..[Xk 1(t-l);W]=^ J p , v k - l k-t+1 k-1,' P(X k _ 1,...,X 1;W k 1) k i t P(X k 1,...,X1;Wk~1) , k < t; j=2,3, . . . ,C k 1 3 Id-k-1 k The p r o b a b i l i t y density P(X |Y ;VL) and the state t r a n s i t i o n p r o b a b i l i t y P(W |w ) are a l l assumed known. Therefore the compound p r o b a b i l i t y r a t i o y i s computable from the known qua n t i t i e s only i f n k -k-1 3.[X (t-l);W] i s a v a i l a b l e at every instant k of the d e c i s i o n process. -k-1 It w i l l now be shown that 6 [X (t-l);W] i s r e c u r s i v e l y computable from known information and that the required computation increases only l i n e a r l y with t. The r e s u l t s are stated i n the following theorem-THEOREM 2~1: I f at any state n^ of the sequential d e c i s i o n process X k ( t ) = (X k, X k \ ..., X k t +^) i s the set of t feature vectors able and the dependence r e l a t i o n s stated i n (2.1-a), (2.1-b) and (2.1-c) apply, then f o r any r ^ m k-q-1 k-q-1 P(Wk-q|Wk-q_1)+ J PCW^IW^-^g.fXCt-q-l) , r x k- q( t-q) ; w] = x k " q < — : — ' i l " v^ * 1"" J A i k-q-1 P C W ^ I w ^ - V 5 I P C W ^ I ^ ^ e i X ( r - q - l ) f ( t + l ) , k ^ t for, q > I { 1, and k < t B ±[X k- q(t-q);W] « A k" q r ( t + 1 ) , k > t for q = < k 1 , k < t k-q,-k-q-1 k-q where X k _ q = ' . , I , i - 2, ^ k-q (-k-q-1 k-q PROOF: f o r the sake of s i m p l i c i t y consider k < t; thus P ( X k - q , . X k - q - 1 , . . . f X 1 ; W k - < l ) 3.[X k q(t-q);W] = — i , p ( x k - q , x k- q- 1 >...,x 1 Jw5- q) q = 1,2,...., (t+1); i = 2, £ k " q which can be rewritten i n the following form: k-q-1 k _ q P(X k- q|X k- q- 1,...,X 1;W k- q)^ Xk-q71...,X1;Wk-q,Wk-q-1) 3.[X(t-q) ;W] = — — ^k - q - l ~ ~ ^ ~ p ( x k - q | x k - q - l } ^ ^ m ^l.wk-q) 5 j p r x ^ " 1 , . . . .X^W^.W^" 1) 1=1 The W 's forma f i r s t order Markov chain and (2.9-a) and (2.9-b) apply. Thus k-q p ( x k - q | - k - q - l ; W k - q ) B.[X(t-q);W] = — — ± 1 k-q,-k-q-1 „k-q. p ( x K - q | - K q X ; W K - q ) k-q-1 (2.13) 3=1 P ( X k - q ' 1 , . . . ,X 1;W k- q- 1)P(W k- q IVl^-1) 3 1 I J .k-q-1 ' I P(X k- q- 1,...,X 1;W k- q- 1)P(W k- q| W k- q- 1) £=1 1 1 1 -A f t e r rearrangement of terms (2.13) can be expressed as follows: k-q P(X k- q|Y k- q- 1;W k- q) .6. [X(t-q) ;W] = r r p • p ( x k - q | - k - q - l ; W k - q ) ^k-q-1 k _ 1 'PCW^IW^-1) + I P(W k - q | w k . - q - 1 ) B.[X(t-q-l);W] 1 1 i = 2 1 J J .k-q-1 k-q-1 _P(W k- q|w k- q- 1) + I P(W k- q|w k - q - 1 ) B j l[X(t-q-l);W] , (2.14) _ .k-q 1=2 ^k-q-1 k _ 1 ' P C ^ I ^ " 1 ) + I P(W k- q| W k- q- 1) • B.[X(t-q-l);W]' 1 .1=2 1 3 3 ^k-q-1 k _ 1 _ P ( W k - q | w k - q - 1 ) + I PCW^IwJ"^ 1) • B £[X(t-q-l);W]_ One can s i m i l a r l y obtain (2.14) f or any k 5 t. Due to the _k-q recurrence nature of (3[X(t-q):W], a l l that i s needed to be stored at any any instant k £ t i s a set (A k q ; q = 1, (t-1)} of l i k e l i h o o d r a t i o vectors, where each vector 20 In a r e c o g n i t i o n problem the amount of memory needed to s t o r e the k , - ^ 1 k " • p r o b a b i l i t y d e n s i t i e s P(X |Y ;W_.) i s dependent on the values of parameters m and r and the amount of dependence among the f e a t u r e com-ponents , but not on t . However, the OCSPRT would r e q u i r e a d d i t i o n a l memory to s t o r e the l i k e l i h o o d r a t i o v e c t o r s and the s t a t e t r a n s i t i o n p r o b a b i l i t i e s . For k i t , t h i s amounts to a t o t a l of ( 2 m - l ) • t + 1 _k-q numbers. A l s o , g.[X(t-q);W] has to be computed f o r each q = 1, 2, k-q ( t - l ) and j = 2, 3, , which r e q u i r e s storage of an a d d i t i o n a l ( 2 m - l ) numbers. Therefore, the t o t a l requirement i s ( 2 m + l ) • ( t + l ) + l , and f o r a f i n i t e m, i t incr e a s e s only l i n e a r l y x^ith the in c r e a s e i n the value of t , the length of the a v a i l a b l e set of feat u r e v e c t o r s . 2.5 Generalized CSPRT f o r M u l t i c l a s s Problems The OCSPRT developed i n S e c t i o n 2.3 i s only f o r the b i n a r y hypo-t h e s i s case, where each p a t t e r n c l a s s oj. i s a member of the set Q = ' X CO {co^;i = 1, 2}. I t i s p o s s i b l e however to develop such a t e s t f o r the general case of M hypotheses, s t a r t i n g from the GSPRT (Generalized SPRT) of Reed [16] i n s t e a d of the SPRT of Wald [ 2 ] . In the GSPRT, the g e n e r a l i z e d p r o b a b i l i t y r a t i o f o r p a t t e r n c l a s s oj at s t a t e n^ i s computed as f o l l o w s : P(X k|coJ) G n ( i ) = — , i = 1, 2, . . ., M. k [ n P ( x k | c k ) ] 1 / M j = i 2 This p r o b a b i l i t y r a t i o i s compared with, the stopping boundary of the i th. p a t t e r n c l a s s T_^ and the s e q u e n t i a l d e c i s i o n r u l e i s to r e j e c t th.e k p a t t e r n c l a s s co. from f u r t h e r c o n s i d e r a t i o n ; i n other words, X does not 21 belong to class to_^ i f G ( i ) < T., i = 1, 2 M. n. I k The stopping boundaries are r e l a t e d to the error p r o b a b i l t i e s as follows: 1-e. . T. = 1 M 1/M [ n ( l - e . . ) ] 1 / M J = l 1 J A f t e r the r e j e c t i o n of pattern c l a s s UK, the t o t a l number of classes i s reduced by one. The feature observation process i s continued u n t i l a l l but one of the pattern classes are re j e c t e d . Let the generalized compound p r o b a b i l i t y r a t i o at state n, _k k based on the set X(t) of t feature vectors be k •P(X(t)!cok r n , ( ± ) " l i k 1 1 / M > 1 " 1. 2 . M -k [ n p ( x ( t ) | o ) . ) ] i / M j = i J If the dependence r e l a t i o n s (2.9-a), (2.9-b) and (2.9-c) apply, then the generalized compound p r o b a b i l i t y r a t i o can be expressed as follows: ^ ~ [ n P C x < t > : « O J 1 / M ' p<»i> j = i 3 i f = l t n I P(X(t);W k)] 1 / M P< ui> q £{l.} q By following the same steps as i n (2.10) and (2.11), we obtain I P(X k|Y k- 1;W k) • a k [ n p ( u p ] f = l 1 k N,1/M lei. r ( i ) = \ i [ n I P ( X k | Y k L ; W k ) • j = l qe{I } q k where i s defined i n (2.12) and can be expressed i n terms of $ [•]. Since the ]s' are a v a i l a b l e r e c u r s i v e l y from known information, as proved i n Theorem 2.1, the generalized compound p r o b a b i l i t y r a t i o f o r class ok i s e a s i l y computable at every state of the dec i s i o n process. shown to be optimal f o r M > 2. The designing of the stopping boundaries f o r proper convergence of the r e j e c t i o n c r i t e r i o n i s sometimes d i f f i c u l t . Also the performance of the scheme deteriorates i f some modification of the stopping boundaries i s required to ensure termination of the feature observation process after a desired number of states ( f i n i t e scheme). Therefore, f o r mult i c l a s s recognition problems Bayes sequential d e c i s i o n schemes (Chapter V) seem more s u i t a b l e than those incorporating the GSPRT and the GCSPRT. Unfortunately, the GSPRT (consequently the GCSPRT) has not been CHAPTER I I I COMPOUND SPRT FOR FIRST ORDER MARKOV DEPENDENT HYPOTHESES 3.1 Description of the Model To gain further i n s i g h t i n t o the optimum sequential p r o b a b i l i t y r a t i o test a simpler recognition system i s considered. The pattern classes i n the set = {w., i = 1, 2} are assumed to form a stationary f i r s t order Markov chain. Thus, at any insta n t k P C ^ I ^ 1 . . . , ^ ) = P(o) k|a) k _ 1), i , j = 1, 2 . (3.14) At every instant k, k = 1, 2, a set of t feature vectors i s assumed to be a v a i l a b l e . The k th pattern has to be decided on the basis of t h i s set of feature vectors and the a v a i l a b l e a p r i o r i information. The feature vectors are assumed c o n d i t i o n a l l y independent of the neighbouring feature vectors and pattern i d e n t i t i e s . Thus P(X k|X k _ 1,..., X 1; cok, ok) (3.15) = P(X k|to k) Let P j i = P ^ + 1 | w q ) , q = 1, 2, i , j = 1, 2, (3.16) therefore, the a p r i o r i p r o b a b i l i t y of the pattern classes are given by A block diagram of the recognition system under consideration i n t h i s section i s shown i n Figure 3.1. The compound p r o b a b i l i t y r a t i o y f o r the system under consider-k at i o n now reduces to y = X P12 + P22 _ k - l p[X(t-l);co] _ k - l p l l + P21 ' B ^ t - 1 ) * (3.17) where, P(Xk|cok) p ( x k | w k ) and, ^ P ( x k 1,...,x k t + 1 ; o ) k -1) _ k - l g[X(t-l);w] =< P(X k- 1,...,X k- t + 1;. k- 1) P(X k 1,...,X 1;o J k h P(X k-1 , f or k £ t Y l k - l v ,X ;u)1 ) fo r k < t The process of feature observation continues as long as y i s such that k T„ > v > T, and ah a d d i t i o n a l feature i s a v a i l a b l e from the pattern 2 IL 1 k under consideration. It may be pointed out here that f o r k = t, that i s , at every instant information from the e n t i r e past i s assumed to be a v a i l a b l e , the compound p r o b a b i l i t y r a t i o of (3.17) reduces to P12 + P 2 2 Y n k-1 (3.18) P l l + P21^n k-1 However, source classifier output Figure 3.1 Block diagram of a sequential r e c o g n i t i o n system with two-class Markov dependent source. ro ' T , i f at i n s t a n t (k-1) 6 k 1 = d^" 1 V i a (3-19) ^ T 2, i f at instant (k-1) 6 k _ 1 = d k _ 1 Thus, the use of the e n t i r e past information i n t h i s case i s equivalent to using the past d e c i s i o n alone. The storage and computation involved, for t = k i s thus l e s s than what i s involved when t < k. 3.2 Nature of the Stopping Boundaries Without any loss of generality the Markov chain formed by the pattern classes can be assumed to be homogenous. Thus P i i = P j j " P ' P i j = P j i = ( 1~ p )> ±> J = 1» 2 ' 1 * J and consequently p^ = p 2 - 0.5. Equation (3.17) therefore, reduces to Y = A k • (1-P) + p . P [ X k t ^ l ) ; to]_ K p + (1-p). 3 [ X ( t - l ) ; to] and can be considered to have two separate parts. At any state n^, X summarizes the information of n^ feature components from the k th pattern under consideration and the quantity within the braces summarizes the information of the past ( t - l ) patterns of the sequence. It i s only the value of X that changes from one d e c i s i o n state to another; therefore, the decision r u l e defined i n (2.8) can be expressed as follows: d k , i f A k * T£(k) d k , i f Tj(k) > A k > Tj_(k) d k , i f A k v< Tj(k) where the new set of stopping boundaries i s given by and T 2(k) = T 2 _ k - l p + (l-p)-3[X(t-l) ; a>] _ k - l (1-p) + p.g[X(t-l);to] k-1 T ' f k ^ - T • P + (1-P) * 8tX(t-l);(o] . 1K ' 1 _ k - l (1-p) + p • 3[X(t-l);u] The OCSPRT i s now s i m i l a r to the standard SPRT of Wald, where the decision r u l e i s to accept d 2 , i f i f \ , i f X k 5 T, T 2 > X >T± (3.21) X k S T, The set of stopping boundaries T|(k) and T^(k) i n the case of the OCSPRT however, besides being dependent on the err o r p r o b a b i l i t y e „ , are also dependent on the state t r a n s i t i o n p r o b a b i l i t y p and the past observations. The stopping boundaries are nonlinear i n nature except when p = 0 and p = 1. A set of such stopping boundaries, with e^ 2 = = 0.1 are presented i n Figure 3.2. Regions of increase and decrease i n feature r e -quirement f or a terminal decision r e l a t i v e to a standard SPRT, f o r various amount of information from the past observations, are also, shown Figure 3.2 The stopping boundaries of.OCSPRT, and the regions of saving and increase i n the average number of features. 03 i n Figure 3.2. For p = 0.5, the stopping boundaries are constant and the OCSPRT reduces to the Wald's [2] standard SPRT, which i s only a s p e c i a l case of OCSPRT. 3.3 Expected Number of Features Per Pattern The expected number of features required per pattern f o r a terminal decision using OCSPRT i s derived i n th i s section. The d e r i v a t i o n i s based on the assumption that the e n t i r e past information i s a v a i l a b l e at every instant of the sequential c l a s s i f i c a t i o n process. Without any loss of generality i t may be assumed that = e2l ~ e' w ^ t : ^ the r e s u l t that T 2 = 1/T = T = ( l - e ) / e . Taking logarithm of (3.18) y i e l d s _(1-p) + p y Zn y In \ + in 'k-1 P + (1-p) Y n _, k (3.22) Taking the expected value of (3.22) y i e l d s (1-p) + p • Y n E U n y n ] = E[£n A k] + E[Jtn { p + ( ^ ^ — } ] , k n k - l (3.23) where E[*] i s used to denote the expected value. For s t a t i s t i c a l l y i n -dependent feature components and for a large value of D, i t i s pos s i b l e to show [2], [46] that 'e^(Y), when X k ~ to^ E[ln y ] = ^e„(Y), when X where e.. (y) = (-1) 1 • (l-2e) £ n ( - ~ ) ; i = 1, 2,and E[J£n A K] =• e 9(n) • e 0 ( f ) , i f X* e 1(n) • e 1 ( f ) , i f X ~ w where e^(n), i = 1, 2, denotes the expected number of features required per pattern for a terminal d e c i s i o n when X k . and P(x k|u 9) ki - i ' 2 e.(f) = E[£n — ] , i = 1, 2; j = 1, 2, . . ., D. P ( x K | W l ) The d e r i v a t i o n of the above r e s u l t s appear i n Appendix B. As noted e a r l i e r , use of the e n t i r e past information i n the case of OCSPRT, i s equivalent to using the past d e c i s i o n . Thus, at any instant k (neglecting any excess over the stopping boundaries*) T , i f at i n s t a n t (k-1) 6 k ^ = d k n k-1 W l / T , i f at instant (k-1) 6 k-1 , k - l However, / T , with p r o b a b i l i t y 'k-1 » 1/T, with p r o b a b i l i t y , (1-e), i f X k-1 k-1 i f X i f X k 1 - a). (1-e) , i f X k 1 - co. This i s assumed for the purpose of s i m p l i c i t y of d e r i v a t i o n . I f the expected number of features required per pattern i s large, i . e . y takes n k a long time to reach one of the stopping boundaries, then the increment Ay for each a d d i t i o n a l feature would be small and the assumption i s v a l i d . Let ( 1 " P ) + P \ z = in {——j- r } p + (1-p) y_ V - l k-1 such that when X ~ u^, one can write (1-p) + P Y n V z > ' E ^ { P + (1-p) T k ' 1 } ' n k - l = ( 1 _ e ) £ n {(1-P) + PT,} + £ {(1-P)T + p ( 2 e _ i ) ta {^-P* 1 + P> U JJ Jin l p T + ( 1 _ p ) L S i m i l a r l y , when X^ ^ ~ to^ Let: e (-0 - ( 1 - ^ " r r ( l - p ) T + p, e l U ; _ U * n lpT + (1-p)' e (n. . .) be the expected number of features required k i , 3 y ' • • > * per pattern when the sequence of input patterns k k-1 1 i s such that, X ~ w., X ~ to., X ~ io„ I 3 & and or k i k-1 1 p.(i) denote the p r o b a b i l i t y P(to.|co ,M ) of a sequence of j , • • • , & 1 J * patterns. Equation (3.23) can now be rewritten i n the following form e.(Y) - \ ( \ t i ) . . , t Z > ' e ± ( f ) + e.(z) e.(n. ) = te,(y) - e ( z ) ] / e , ( f ) , (3.24) However, = I e k ( n • £) . P. ( i ) . (3.25) J , . . . , X/ Combining (3.24) and (3-25) y i e l d s e.( Y) - e (z) >.(n) = _ I i 7 ( f ) p ( i ) j , • • , & i j ,..., £ . 2 e (y) - e.(z) j= i e i ( f ) J 1 where p.. i s defined i n (3.16). On s u b s t i t u t i o n and s i m p l i f i c a t i o n , (3.26) f i n a l l y reduces to e ±(n) = {e.( Y) - (-l)*(2x - l H n [ x / ( l - x ) ] } / e . ( f ) , i = 1, 2 where, x = e + p - 2pe. (3.26) (3.27) It i s i n t u i t i v e l y s a t i s f y i n g that the expected number of features required per pattern f o r a terminal decision i s a function of the p r o b a b i l i t y d i s t r i b u t i o n of the feature components, the e r r o r p r o b a b i l i t y e, and the state t r a n s i t i o n p r o b a b i l i t y p. For p = 0.5, that i s , when the source i s memoryless, (3.27) reduces to the expression for the average number of features f o r standard SPRT and i s given by e (y) ei ( n ) = TJf) ' 1 = 1* 2» ( 3 , 2 8 ) where e^(n) denotes the expected number of features required per pattern for standard SPRT, when X ~ co.. I 33 3.4 Comparison of the OCSPRT arid the Standard SPRT It i s now possible to compare the performance of OCSPRT with that of an equally r e l i a b l e standard SPRT. Combining (3.27) and (3.28) y i e l d s n. = • (1 - 2T) £n[(l - T)/x]/e.( Y) (3.29) where n. = {e!(n) - e.(n)}/e!(n). 1 x x i Based on (3.29) we can state the following theorem. THEOREM 3.2 For any 1 £ p £ 0 and 1 > e > 0, the OCSPRT using the e n t i r e past information requires no more features per pattern than the equally r e l i a b l e standard SPRT. The following lemmas need to be proved before proving the Theorem 3.2. Lemma 3.1: The r e a l valued function f ( x , y) = x + y - 2xy (3.30) fo r 1 i x i 0 and 1 5 y £ 0 i s such that 1 £ f(x,y) $ 0. Proof: The proof of the lemma consists of two separate parts: 1. Lower Bound: the i n e q u a l i t y of arithmetic and geometric means for any nonnegative numbers a and b i s given by [47] 1/2 a + b i 2(ab) ' on s u b s t i t u t i o n i n (3.30) we obtain f(x,y) i 2 [ ( x y ) 1 / 2 - xy] . However, for any 1 £ x i 0 and 1 i y 5 0 (xy) >, xy. 34 Thus, f(x,y) £ 0, where eq u a l i t y holds only when x = y. 2. Upper Bound: l e t the new v a r i a b l e x' = (x - 1/2). Thus, f o r 1 £ x £ 0, x' i s such that 1/2 * x' * - 1/2 On s u b s t i t u t i o n i n (2.30) we obtain f ( x f , y) = x ' ( l - 2y) + 1/2, since f o r any r e a l numbers a and b,|a| + |b| $ |a + b|, | f ( x \ y ) | * | x ' ( l - 2y)| . + 1/2. (3.31) I t i s also possible to show that Max |a • b| £ Max |a| • Max |b|. Therefore, (3.31) can be expressed as follows: Max |f ( x ' , y ) | £ Max |x'| • Max | l - 2y| + 1/2. Since, Max |x'| = 1/2 and Max | l - 2y| = 1 we obtain Max | f ( x 1 , y ) | $ 1 or Max |f(x, y ) | £ 1 Q.E.D. Lemma 3.2: For a l l values of 1 5 x $ 0, the r e a l valued function f(x) = (1 - 2x) SLn(~) (3.32) i s always p o s i t i v e . Proof: For any 1 £ x 5 0.5, we have and 0 * (1 - 2x) * -1, 1 >. (^p) * 0. Equ a l i t y sign holds only when x = 1 or x = 0.5. Therefore, f(x) £ 0. For any 0.5 >, x £ 0, We have 1 >, (1 - 2x) > J and i again the equality sign holds only when x = 0.5 or x = 0 Therefore, f(x) i 0 Q.E.D. In order to prove Theorem 3.2, one has to show that 5 0, 1 = 1 , 2. Proof: We have from (3.29) where and (1-2T) „ ,1-T\ 2 e (y) T e 2(Y) = (1 " 2e) Jin T = e + p - 2pe. Since 1 i e £ 0, we have from Lemma 3.2, e 2(y) i 0. Also, the state t r a n s i t i o n p r o b a b i l i t y 1 i p i 0, therefore, from Lemma 3.1 we know x i s bounded by 1 and 0. I t thus d i r e c t l y follows from Lemma 3.2 that n 2 i 0 f o r a l l values of 1 £ p $ 0 and 1 i e 5 0. It can similarly, be shown that £ 0. Q.E.D. In Figure 3.3 i s pl o t t e d f o r various values of p and e. Except when p = 0.5, i n which case the OSPRT reduces to the standard SPRT^ the saving i n the average number of features due to OCSPRT i s always p o s i t i v e and reaches 100 percent when the channel i s completely d e t e r m i n i s i t i c . The higher the desired r e l i a b i l i t y of the recogni t i o n wao 90.0 cu fe! CD a Figure 3.3 Saving i n the average number of features per pattern obtained using OCSPRT r e l a t i v e to SPRT as a function of error p r o b a b i l i t y e and state t r a n s i t i o n p r o b a b i l i t y system, the lower i s the amount of saving, since i n such a s i t u a t i o n both SPRT and OCSPRT require a r e l a t i v e l y large number of features per pattern for terminal decisions. 3.5 Performance Evaluation by Simulation Experiments In order to assess the performance of the OCSPRT, a recognition problem s i m i l a r to the one discussed i n s e c t i o n 3.1 was simulated on an IBM system /360 Model 67 d i g i t a l computer. A sequence of 5000 patterns, with approximately* 2500 of them being a member of each one of the two pattern classes, was generated. The sequence was generated i n such a way that at any instant k, k = 1, 2, the state t r a n s i t i o n p r o b a b i l i t y k+1 k {p(uK |to^), i = 1, 2} = p ± Ap, 0.002 £ Ap £ 0, and consequently p(w^) - pCo^)* Each pattern sample of the sequence was represented by a feature vector of 40 s t a t i s t i c a l l y independent feature components generated i n the computer using a random number generator. The feature components were generated i n such a fashion that P(x^| , i = 1, 2, 40; j = 1, 2, was a univariate normal d i s t r i b u t i o n with mean m_. and 2 variance a . Thus P(x.|w.) = ( 2 T T ) " 1 / 2 a" 1. exp[-(x. - m.) 2/2a 2] . 1 3 l 3 The dimensionality of the generated feature vectors was found to be large enough to be able to avoid the necessity of any form of truncation scheme i n the sequential decision processes f o r the range of threshold values used i n the experiments. Besides the OCSPRT, three other parametric c l a s s i f i e r s were *There were 2504 pattern samples under pattern c l a s s and 2496 pattern samples under pattern class to . simulated to c l a s s i f y the generated sequence of patterns. The c l a s s i -f i e r s were as follows: 1. Simple Nonsequential Bayes C l a s s i f i e r : with s t a t i s t i c a l l y inde-pendent feature components. The pattern classes were also assumed to be s t a t i s t i c a l l y independent. Thus, at any instant k the decision r u l e was , d k , i f P(X k|u) k) * P ( X k | o 3 k ) (3.33) ^ d k , otherwise fo r normally d i s t r i b u t e d feature components the decision r u l e (3.33) was equivalent to D I i = l k D 1 d K , i f V x =j 0.5 D.y 2 . i i where otherwise (m2 - m 1), m2 > m1 and D i s the dimensionality of the feature vector. 2.. Compound Nonsequential Bayes C l a s s i f i e r : with s t a t i s t i c a l l y independent feature components. The pattern classes were assumed to form a f i r s t order stationary homogenous Markov chain with p = 0.8 5. Thus , d K , i f P(X k|u) 2) * P(X k|co 1) 6 k = A 4.1, ^ d 1 , otherwise where 2 P(Xk|co.) = P(X k|o ) . ) I P(X k- 1|o J k- 1)P(co k|co k- 1), 39 and was obtained i t e r a t i v e l y at every instant of the d e c i s i o n process. 3. Standard SPRT: with constant stopping boundaries. The err o r p r o b a b i l i t i e s a n ^ w e r e assumed to be equal such that T 2 = 1/T^ = T. The de c i s i o n r u l e i s presented i n (3.21) and for normally d i s t r i b u t e d features reduces to \ 2 ( d k , i f I x k >, -2- In T + 0.5 n ky S k = < 2 k 2 d , i f — £n T + 0.5 n, u > Y x > In T + 0.5n. y o' y k f; i y k ^ d k , 2 $ - — Sn T + 0. 5n, y - y k k Y k . , i f > x. 1 . , i where T = ( — ) . e In case of OCSPRT the pattern classes were assumed to form a f i r s t order stationary, homogenous Markov chain with p = 0.85. Also the erro r p r o b a b i l i t i e s e^ and e^^ were assumed equal. In the f i r s t set of experiments t = 2 was considered f o r the OCSPRT case,thus information was assumed to be a v a i l a b l e from the immediate past pattern only. Figures 3.4 and 3.5 show the e r r o r p r o b a b i l i t y of the four d i f f e r e n t c l a s s i f i c a t i o n schemes as a function of the average number of features required per pattern. Two sets of curves are presented i n order to compare the c l a s s i f i c a t i o n schemes at various ranges of e r r o r p r o b a b i l i t y . I t i s observed that the OCSPRT f o r a s p e c i f i e d recognition accuracy requires a s i g n i f i c a n t l y fewer number of features (on the average) per pattern than the other three c l a s s i f i c a t i o n schemes. In comparison with an equally r e l i a b l e standard SPRT, savings i n the Figure 3.4 Comparison of OCSPRT, SPRT, optimal simple nonsequential, and optimal compound nonsequential schemes, a = 1.7. 41. average number of features per pattern Figure 3.5 Comparison of OCSRPT, SPRT, optimal simple nonsequential, and optimal compound nonsequential d e c i s i o n schemes, a = 1.0. average number of features as high as 31 percent are observed using the OCSPRT. The OCSPRT requires approximately 60 percent fewer features per pattern than the equally r e l i a b l e compound nonsequential c l a s s i f i c a t i o n scheme. A second set of experiments using the same data set was conducted i n order to observe the e f f e c t of varying t, the length of the set of feature vectors used i n -the -OCSPRT. Four d i f f e r e n t values of t, i n c l u d i n g t = k ( e n t i r e past) were considered. The r e s u l t s are presented i n Figure 3.6. As shown i n the f i g u r e , the performance of the OCSPRT improves with the increase i n value of t. However, the rec o g n i t i o n accuracy increases so l i t t l e beyond t = 2 that t h i s s l i g h t improvement does not j u s t i f y the i n t r o d u c t i o n of a d d i t i o n a l storage and computational complexity. An OCSPRT with a r e l a t i v e l y small value of t would be s a t i s -f actory. We have pointed out before that, using the information from the e n t i r e past (t = k) i n the OCSPRT i s equivalent to using the past decision. For a recognition system with overlapping class c l u s t e r s i n the feature space, the method of using the d e c i s i o n alone tends to propagate e r r o r from one instant to the next. Thus, one would expect the OCSPRT with t = k to perform worse than the OCSPRT with t < k (using past information) and i n f a c t i t does perform worse as we have shown i n Figure 3.6. For a c e r t a i n range of average numbers of features per pattern, the OCSPRT with t = k requires more features than the equally r e l i a b l e OCSPRT with t = 2,3 and 4 and v i c e versa. However, for h i g h l y r e l i a b l e systems, that i s , when the class c l u s t e r s are quite d i s j o i n t i n the feature space, the use of the previous decision would be as e f f i c i e n t as using the past information as far as the expected cost of terminal decisions i s concerned [ 28.0 average number of features per pattern Figure 3.6 E f f e c t of t, the length of the sequence of feature vectors a v a i l a b l e at any instant, on the performance of OCSPRT. CHAPTER IV DATA BASE, PREPROCESSING, FEATURE EXTRACTION, AND SYSTEM EVALUATION 4.1 Description of the Handprinted Character Set In order to assess the performances of the various decision schemes considered i n the rest of the t h e s i s , a p a r t i c u l a r pattern recognition problem was simulated on the IBM systems /360 Model 67 d i g i t a l computer. The pattern classes considered were handprinted, uppercase English alphabetic characters obtained from Munson's m u l t i -author data f i l e [51]. The o r i g i n a l data f i l e was prepared at Stanford Research I n s t i t u t e , and contains the e n t i r e 46 Fortran characters. The characters were c o l l e c t e d from ordinary Fortran coding forms, prepared by 49 d i f f e r e n t authors, with each author cont r i b u t i n g 3 sets of alphabet. Thus there were 147 samples of each Fortran character. Although the authors were i n s t r u c t e d to slash the l e t t e r Z and to put crossbars on the l e t t e r I, we noticed during v i s u a l examination of the data set that several authors f a i l e d to follow these i n s t r u c t i o n s . Some lowercase characters were also encountered i n the data set, however they were not removed from the data set. Each character was a v a i l a b l e i n the form of a 24 x 24 matrix of 0 and 1 corresponding to white and black parts of the character image. There are two main reasons f o r using Munson's data set for the purpose of simulating our recognition experiments. F i r s t l y , t h i s was the only w e l l prepared data set of adequate s i z e that was r e a d i l y 45 a v a i l a b l e during the period when the design of our experiments was i n progress. Secondly, thel.E.E.E. Technical Committee on pattern recognition has organized a subcommittee on Reference Data sets [511 i n the hope of convincing researchers to use standardized data sets i n t h e i r experiments; Munson's multiauthor data f i l e i s one of these reference data sets and recognition r e s u l t s obtained by others have already been published using •this data set [51]-[53-]. Thus one can compare our experimental r e s u l t s with those of others. 4.2 Preprocessing of the Characters P r i o r to the e x t r a c t i o n of the features, each character was subjected to a simple size-normalization preprocessing operation. The purpose of the preprocessing was to reduce the s e n s i t i v i t y of the feature e x t r a c t i o n scheme (described i n Section 4.3) to the v a r i a t i o n i n the s i z e of the character samples. The manner i n which the characters were size-normalized i s described i n the following paragraphs. The width W and height H of each character was f i r s t defined by l o c a t i n g the rectangle defined by the two rows and columns c l o s e s t to the edges of the character array and containing at l e a s t two black points. The r a t i o H/W was then c a l c u l a t e d . I f H/W 3 the o r i g i n a l character was s i z e normalized to 20 x 20 points by adding to, or d e l e t i n g from, rows and/or columns at regular s p a t i a l i n t e r v a l s i n the o r i g i n a l 24 x 24 array. Let NHA and NHD equal the number of rows to be added and deleted, r e s p e c t i v e l y , to make the f i n a l height equal 20. Then NHD = H-20 and NHA = 20-H. Let <x> denote the l a r g e s t integer not . exceeding x. Let i be any p o s i t i v e integer. I f H > 20 then the NHD rows numbered i • <(H + NHD) / (NHD+1)> from the top were deleted from the o r i g i n a l 24 rows. I f H < 20 then NHD. = H/(NHA+lY was obtained and then set equal to the nearest integer to y i e l d NDI. A new row was in s e r t e d immediately below row i • (NDI) i n the o r i g i n a l 24 x 24 array. Each inserted row was made i d e n t i c a l to the row above i t i n the nex^ array. The f i r s t 20 rows were retained as those defining the size-normalized character. A s i m i l a r procedure was used to i n s e r t or delete columns. I f H/W > 3 .then -the original..width W was not a l t e r e d . Instead, the character was centered h o r i z o n t a l l y on a g r i d 20 columns wide. The height was normalized to 20 rows, as explained i n the above paragraph. This procedure prevented characters such as " I " , which may be pri n t e d e i t h e r with or without s e r i f s , from being converted i n t o a v i r t u a l l y a l l - black 20 x 20 array when s e r i f s were absent. Figure 4.1 shows two characters i n t h e i r o r i g i n a l and preprocessed forms. 4.3 Feature E x t r a c t i o n Scheme A good set of features r e s u l t s i n the w e l l d i s c r i m i n a t i o n of the pattern classes and i s , at the same time, r e l a t i v e l y i n s e n s i t i v e to normal v a r i a t i o n s among the patterns of the same c l a s s . The question of s i m p l i c i t y of the feature e x t r a c t i o n algorithm and the dimensionality of the generated feature vector i s of considerable importance. Various feature extraction algorithms, e s p e c i a l l y designed f o r character recognition, based on designer's ingenuity and i n t u i t i o n have been reported from time to time [5 1]-[55], The few outstanding of which are n-tuple, contour t r a c i n g , t o p o l o g i c a l and geometrical d e s c r i p t o r s , c h a r a c t e r i s t i c l o c i , and mask matching. Most of these methods r e s u l t i n a feature vector of r e l a t i v e l y high dimensionality, require extremely complicated operations, or s u f f e r from both of the above disadvantages. *<* MMMMMM M M * * MMMMMMMMMMMMMMM T MMMMM M M MMMMMMMM MMMMMMM MMMMMMMMMMMMM V MMMMMM MMM M M M M M M MMMMMMMMMMMMM MMMMMMM MMM * MMMMMM MMM MMMMM f .4. t- MMMM * MMMMM MMMM V MMMMM -I" MMMMMMMMMMM -r MMMMMMMMMMMMM MMMMMMMMMMMMM **- * MMMMMMMMMMMMMMM MMMMMMMMMMMMM MMMMMMMMMMMMMMM • MMMMMM M =!' * MMMMMMMMMMMMMMM =!' * MMMMM 'f MMMMMMM M MMMMM MM # MMMMMM MMMMMMMMMMMMMM -}* MMMMMM MM * MMMMMMMMMMMMMM -I* MMMMMM MM * MMMMMMMMMMMMMM -i» MMMMMMMMMMMMMMMM * MMMMMMMMMM «o MMMMMMMMMMMMMM MM * * MMMMMMMMMMMMM MMM * * * MMMMMMMMMMMMMMMM •V V *i» *•* 'N 'i» *»- *!» 'I- -r - i * *•* *I> •!* *r 'I5 V *!* *#* •• * 'r- *}• A •J- Ju -J* -v <J> ^ i.t, «.v % v. i» ir 1 f v v i * T J c v ' i " *«* - i - *r •v 'i- -,*> * i - * i * f -1* • * *r 'J* ^ # '!: ^ ^ 'i' ^ *!> ^> '.^ ^ ^ ^ ^ MMMM >•« *<*• MMMMMMM 1^- «jLr %.Lr .1— .1. •' f .V ,J.r ^ *V «-'* »*> Os-J l C ^ ^ ^ v V ^ ^ v ? i- A- -r- '>- t- '<- f -r t» MMMMMMMMMM * MMMMM MMMM MMMMM * - * MMMMMMMMM • MMMM MMMM * * MMMMMMMMMMMMMM *»* MMMM MMM * * MMMMMM MMMMMMM 'r MMMM MMMM * MMMMM MMMMM V »»» MMMMMMMM MMMM *MMMMM MMMM ,o MMM MMMM MMMM •MMMMMMMMMM MMMM V MMM MMMMM MMM •MMMM MMMMM MMMM MMMM MMMM MMMM V •MMMM MMMMMMM MMM MMMM MMMM MMMM T> •MMMM MMMMMM MMMM *v * MMMMMMMMMMMMMM •MMMMMMMMMMMMMMMMMM MMMMMMMMMMMM • MMMMMMMMMMMMMMMM •J^ 1* MMMMMMMMMM * • MMMMMMMMMMMMMM • »V MMMMM * MMMMMM * MMMM * MMMMM ,u -** * MMMM * MMMMM MMMM • MMMMMM * MMMM f • MMMMMM * * MMMM V * MMMMMM •V MMMM * MMMM * "1" MMM .1, .1, » >, ,i# O- .^ «' 'I* 't- ''- *l" 'I* '»* 1" 1" "»"• -»* 1- -t* *l « ' 1- • • *> • »*, 1* *»" »*" »*- -V »•» -1' »•» *** <••. >>r »l, »«, .1* .1* «.» 1- -1" f T *t- -V <1- 1' -i- -i» f ' t - ',- 'i- -i-original s/ze -norm a/t'zed Figure 4.1 The o r i g i n a l and the size-normalized character samnles These schemes are unrealizable i n many p r a c t i c a l s i t u a t i o n s because of the p r o h i b i t i v e system complexity and because of the large amounts of t r a i n i n g data needed f or proper estimation of the required parameters. In this s e c t i o n of the thesis a feature e x t r a c t i o n method i s proposed which, i n s p i t e of i t s s i m p l i c i t y and r e l a t i v e l y low dimensionality, has been found to y i e l d performance comparable with that of other character recognition "schemes. Let G ( i , j ) , i = 1, 2, h ; j = 1, 2, w be a point on a (h x w) dimensional pattern g r i d on which each pattern image i s placed as i l l u s t r a t e d i n Figure 4.2. Each point on the g r i d may be allowed to assume one of q values . depending on the i n t e n s i t y of the image at that point. A mask i s defined to be a g r i d of s i z e (h* x w'), h' £ h,w1 $ w, such that through the mask only a po r t i o n of the pattern g r i d i s observable. A l l possible configurations of the mask within the pattern g r i d are assumed to be v a l i d . The points of the mask are so l a b e l l e d that the lower leftmost point, c a l l e d the o r i -gin, i s subscripted (0, 0). Let a convenient weighting procedure associate weight g ( i , j ) , i = 0, 1, h' - 1; j =0, 1, w' - 1, with each point ( i , j) of the mask. A feature component i s generated by superimposing the mask on the pattern g r i d and scanning only that portion of the pattern g r i d which i s observable through the mask. D i f f e r e n t feature components are generated f o r d i f f e r e n t configurations of the mask on the pattern g r i d . I f one of the configurations i s such that the o r i g i n of the mask i s given by co-ordinate (Jo, k) on the pattern g r i d , then the value of the corresponding feature component i s expressed as follows: Figure 4.2 The (h -x w) pattern matrix and the (h'-x w') mask h ' - l w'-l x ={ y . y g d , j ) . { G[(£ + i>, (k + j>] >. i^O j=0 Even though the dimensionality of the feature vector i s dependent on the s i z e of the pattern g r i d and the mask, as w e l l as on the allowable overlap between neighbouring mask p o s i t i o n s , the actual number of com-ponents i s governed by the number of allowable mask configurations. I f N represents the maximum number of features a v a i l a b l e , then one can max show that N max r (h - h'+l)(w - w' + 1); i f the configurations of the mask are not mutually exclusive, <h/h'> • <w/w'>; otherwise where <x> again denotes the la r g e s t integer not exceeding the value of x. For our experiments the mask i n i t s simplest form was used. The mask used was of s i z e 4 x 4 and the points on i t were a l l equally weighted, such that g ( i , j) =1, i , . j = 0, 1, 3. The pattern g r i d of s i z e 20 x 20 had only two levels of quantization, 0 and 1 depending on whether the p a r t i c u l a r point on the g r i d was white or black. A l l the mutually exclusive configurations of the mask on the pattern g r i d were allowed, thus 25 d i f f e r e n t feature components, each one taking on 17 d i s -crete values, were obtained. The feature e x t r a c t i o n scheme described above i s f a i r l y general and may be u s e f u l f o r various other pattern recognition problems. For character recognition the scheme i s found to have the following d e s i r a b l e pro p e r t i e s : 1. The feature vector i s obtainable from the pattern g r i d by a r e p e t i t i v e procedure i n v o l v i n g only simple l o g i c a l operations; thus the method i s very simple compared to most of the schemes used f o r handprinted characters. 2. The feature vector's dimensionality i s s i g n i f i c a n t l y lower than that of many other schemes and the feature components themselves assume only a r e l a t i v e l y small number of values. 3. D i s c o n t i n u i t i e s i n the character matrix i n c l u d i n g s a l t and pepper noise can be t o l e r a t e d with t h i s feature e x t r a c t i o n s cheme. 4. The scheme i s e s p e c i a l l y s u i t e d f o r studying sequential recog-n i t i o n processes where an a d d i t i o n a l feature i s observed only i f further information regarding the unknown pattern i s nec-essary. This scheme allows the parts of the character to be examined se q u e n t i a l l y so that a new region of the character i s examined only i f those previously examined do not provide enough information f o r a f i n a l d e c i s i o n . The-proposed scheme, l i k e many others, s u f f e r s from the draw-back that i t i s not i n v a r i a n t under s i z e , r o t a t i o n , and thickness trans-formations on the patterns. One may circumvent the d i f f i c u l t i e s through simple preprocessing operations such as size-normalization and shearing of the patterns,and s t i l l maintain the important a t t r i b u t e s of the feature e x t r a c t i o n method. The p o t e n t i a l s e n s i t i v i t y of the scheme to large deviations i n s i z e among the characters of any given class motivated us to use the size-normalization preprocessing. The 25 feature components were s e r i a l l y numbered, s t a r t i n g from the top leftmost configuration of the mask on the pattern g r i d and moving to the r i g h t . Figure 8.3 shows the s e r i a l number of various feature components. This ordering of the feature components i s defined as the n a t u r a l ordering of features. 7 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Figure 4.3 The n a t u r a l ordering of the 25 feature components 4.4 Estimation of P r o b a b i l i t i e s and System Evaluation Procedure The important question that i n v a r i a b l y arises i n the s t a t i s -t i c a l approach to pattern recognition i s : what i s the best way to use a set of f i n i t e number of pattern samples to design a c l a s s i f i c a t i o n scheme and evaluate i t s performance? The r e l i a b i l i t y of the performance estimate i s influenced by the s i z e of both the t e s t i n g and the t r a i n i n g sets of samples. " S u f f i c i e n t c y " of the s i z e of the sets f or r e l i a b l e estimation i s a function of the number of parameters involved i n the r e -cognition system. I t i s usually an arduous job to obtain a " s u f f i c i e n t " s i z e sample due to the l i m i t a t i o n s imposed by cost, d i f f i c u l t y , and computation. The problem e s s e n t i a l l y i s to make "optimum" use of the a v a i l a b l e f i x e d data set to maximize the accuracy of the estimate. The problem has received considerable a t t e n t i o n i n the l a s t several years [44], [48]-[5o], [56 ] > [57 ]• !n most of the e a r l i e r work i n pattern c l a s s i f i c a t i o n , the same set of data was used both f o r the purpose of t e s t i n g and t r a i n i n g the c l a s s i f i e r s . I t i s now w e l l known that t h i s method y i e l d s an over-optimistic estimate of the performance. Results thus obtained are, therefore, of l i m i t e d value. In 1962 Highleyman [48 ] suggested p a r t i t i o n i n g of the data set i n t o two d i s j o i n t sets and using one as the t r a i n i n g set and the other as the t e s t i n g set. This method i s r e f e r r e d to as the H method (Holdout method) i n the pattern recognition l i t e r a t u r e . Highleyman presented a n a l y t i c a l r e s u l t s concerning the optimum p a r t i t i o n i n g to minimize the variance of the estimated error rate. Kanal and Chanrasekaron [50]-however, showed that Highleyman's p a r t i t i o n i n g i s v a l i d only when the t o t a l sample .size .is large. His analysis breaks down.in the small sample case. When the data set i s small compared to the number of parameters that has to be estimated, the H method y i e l d s an overly p e s s i m i s t i c estimate of the system performance. In 1968 Lachenbruch and Mickey [49] made an emhirical comparison of various methods of estimating performances of recognition systems. They observed that one of t h e i r methods, r e f e r r e d to as theU method, y i e l d s the best estimate of performance when the s i z e of the data set i s small r e l a t i v e to the number of parameters involved, and when normality of the underlying p r o b a b i l i t y d i s t r i b u t i o n s i s questionable. For a data set with N samples, the U method consists of conducting N r e c o g n i t i o n e x p e r i -ments by successively t e s t i n g each one of the N samples when the para-meters are estimated with the remaining (N-l) samples. The d r a w -back with t h i s method i s that, unless the data set i s r e l a t i v e l y small, 54 a very large amount of computation i s required. In t h i s t h e s i s , a method e a r l i e r developed by Toussaint and Donaldson [58], [59], combining the important aspects of both the U and the H methods i s used. The R (rotation) method, as we s h a l l c a l l i t , would provide a bet t e r estimate of the performance than the H method but does not require as much computation as the U method. In t h i s method the t o t a l data i s p a r t i t i o n e d i n t o J sets, each c o n s i s t i n g of (N/J) samples. Out of the t o t a l N samples, (N/J) of them are used f o r t e s t i n g purposes, while the t r a i n i n g operation i s performed on the remaining (J-1) • N/J samples. The performance of the system i s estimated i n turn J times, each time using a d i f f e r e n t set f o r t e s t i n g , and the f i n a l estimate of the per-formance i s obtained by averaging the r e s u l t s of each t e s t i n g operation. Both H and U methods can be considered to be the s p e c i a l cases of t h i s R method. In our experiments the values of N and D are 147 and 7 r e s p e c t i v e l y . Set 1 contained alphabets 1 to 21. Set 2 contained alphabet 22 to 42, and so on, with the r e s u l t that each of the seven d i s j o i n t sets contained three alphabets from each of the seven d i f f e r e n t authors. The components of the 25-dimensional feature vector within each cla s s were assumed to be s t a t i s t i c a l l y independent, i n order to minimize the complexity of the problem*. Therefore, f o r each feature component, The features were assumed to be s t a t i s t i c a l l y independent on an ad hoc ba s i s . Unfortunately, the process of checking the v a l i d i t y of such assumption involves laborious computation [60]. Recently t h i s author [60] has proposed a sequential t e s t to determine whether a set of binary valued feature components are f i r s t order Markov or s t a t i s t i c a l l y independent. The test requires only the feature values and therefore can be performed before com-puting the feature l i k e l i h o o d s and the actual c l a s s i f i c a t i o n of the patterns. 17 d i f f e r e n t parameters per pattern c l a s s , each one corresponding to the p r o b a b i l i t y P(x. = j/u> ), i = 1, 2, ...., 25, j = 1, 2, 17, X Jo % = 1, 2, 26, had to be estimated. The Baye's estimate [61], with square error l o s s , was obtained f o r each parameter using the s p e c i f i c set of t r a i n i n g data. Thus P(x. = j / u ) was set equal to (n.. + 1)/ 1 J6 IJ (N + a^) , where n „ denotes the number of samples out of the t o t a l of N pattern samples from class to i n which the i th component of the feature vector takes on the i th value and a. i s the number of values that the i th l feature component can take. It was also necessary to estimate the a p r i o r i s t a t i s t i c s P(co^), and for c l a s s i f i c a t i o n schemes with contextual constraints, the bigram and the trigram s t a t i s t i c s . These s t a t i s t i c s were estimated from an E n g l i s h text [62] containing approximately 340,000 characters. The trigram s t a t i s t i c s of the 27 .characters ( i n c l u d i n g blank) were obtained using a character handling routine developed by Dykama [6.3]. The bigram and the a p r i o r i s t a t i s t i c s were then obtained as follows: 27 , k k-1. v , k k-1 k-2. . . P(.u}.,to. ) = I P(.0).,0). , 0) 1,3=1,2,... ,27, 3 1=1 2 Z and P(u)k) = £ P(cok, OK 1 ) , k = 1, 2, .... 1 3=1 1 J The Bayes estimate was obtained i n each case such that a l l the e n t r i e s of the bigram and trigram tables were nonzero. ' 56 4.5 Preliminary Recognition Experiments In order to assess the usefulness of the feature extraction and the size-normalization preprocessing algorithms, two sets of pre-liminary experiments were conducted. The experimental r e s u l t s also c l e a r l y i n d i c a t e why we advocate the R method f o r s i g n i f i c a n t e s t i -mation of the performance of any recognition system. Experiment 4.1: In t h i s set of experiments, the e n t i r e 3822 characters were recognized using a simple Bayes nonsequential c l a s s i f i e r , f i r s t assuming equiprobable characters (P(w ) = 1/26) and again when the character p r o b a b i l i t i e s equalled those of the English text. In each one •of the J (J=7) t e s t i n g operations, the p r o b a b i l i t y of e r r o r P ( e / u K ) , conditioned on class OK was c a l c u l a t e d and from these the t o t a l e r r o r p r o b a b i l i t y P(e) was obtained as follows: 26 •• P(e) = I P(E/CO.) P(O J . ) . i = l Averaging the r e s u l t s of J experiments y i e l d e d an o v e r a l l average error p r o b a b i l i t y while the standard deviation provided a measure of confidence associated with i t . Table 4.1 shows the er r o r p r o b a b i l i t i e s obtained using each of the seven d i s j o i n t sets f o r t e s t i n g . Also shown are the mean value of the er r o r p r o b a b i l i t y and the corresponding standard deviation. Table 4.2 summarizes r e s u l t s obtained by others who have used Munson's multiauthor data f i l e . Exact comparisons between these r e s u l t s and those obtained using our recognition system are not p o s s i b l e because A P r i o r i Character P r o b a b i l i t i e s E r r o r P r o b a b i l i t i e s Average E r r o r P r o b a b i l i t y P(e) Standard Deviation of P(e) Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Equiprobable Characters 22.8 25.6 30.9 24.5 24.5 33.3 30.9 27.5 3.7 English A P r i o r i P r o b a b i l i t i e s 16.0 23.2 24.4 18.6 17.5 27.0 26.7 21.9 4.1 Table 4.1 Results of preliminary recognition experiments with Munson's multiauthor data f i l e . D e s c r i p t i o n of Testing and T r a i n i n g Samples Type of C l a s s i f i e r Used Feature E x t r a c t i o n . Method P r o b a b i l i t y of Correct Recognition Reject P r o b a b i l i t y E r r o r P r o b a b i l i t y 2340 Samples f o r t r a i n i n g Testing set uns p e c i f i e d Munson [52]* Table lookup Contour Tracing 58 19 23 Alphabet sets 1-96 f o r t r a i n i n g , alphabet sets 97-147 f o r t e s t i n g Munson [51] Adaptive Piecewise L i n e a r Feature-template (PREP. 9-VIEW). 78 - 22 Same as above** Same as above To p o l o g i c a l d e s c r i p t o r s (TOPO) 77 - 23 Alphabets 1-30 for both t r a i n i n g and t e s t i n g [53]** Table lookup C h a r a c t e r i s t i c l o c i 75.9"^ 11.4 12.7 Alphabets 31-60 f o r both t r a i n i n g and t e s t i n g [53]t Same as above Same as above 69.5"^ 15.1 15.4 Alphabets 1-60 for t r a i n i n g and 1, 4, 7, . . ., 58 f o r t e s t i n g f Same as above Same as above -H-73.9 13.0 13.1 *26 upper case l e t t e r s only. **Samples include the 10 numerals, the 26 upper case l e t t e r s , and the 10 Fortran symbols. +25 upper case samples - l e t t e r 0 i s excluded. I++The r e s u l t s have been modified by us to include h a l f of the ambiguities i n [53] as correct r e c o g n i t i o n and h a l f as e r r o r . Table 4.2 Recognition r e s u l t s obtained by others using Munson's multiauthor data f i l e . Ol co 59 of differences i n performance evaluation procedures and p a r t i t i o n i n g of alphabet sets. Comparison of the performance c a p a b i l i t i e s of the various feature extraction schemes i s even more d i f f i c u l t because of differences i n c l a s s i f i c a t i o n procedures. Munson's r e s u l t s [51] were obtained using some of the alphabet sets f o r t r a i n i n g and some or a l l of the remaining ones for t e s t i n g . Knoll's r e s u l t s [53] were obtained by doing most of the t r a i n i n g on alphabet sets 1-30. However, since human t r i a l - a n d - e r r o r was the learning approach used f o r designing the table lookup contents, modifications were introduced a f t e r a pre-liminary recognition run was performed on alphabets 31-60. Recognition runs were then performed on a l l 60 alphabets to give the r e s u l t s shown i n Table 4.2. F u l l 46-character FORTRAN alphabets were used by Munson to obtain recognition accuracies of 77% and 78%*, while the l a s t three sets .of r e s u l t s i n Table 4.2 were obtained using 25-character alphabets which contained only upper case characters, the l e t t e r 0 being excluded. I f one bears i n mind that the f u l l 46-character alphabet would normally be more d i f f i c u l t to recognize than the 26-character upper case alphabet, i t would be n a t u r a l to conclude that Munson's PREP. 9-VIEW and TOPO systems with adaptive piecewise c l a s s i f i e r s would y i e l d somewhat lower er r o r p r o b a b i l i t i e s than our own i f a l l three systems were evaluated using i d e n t i c a l test data and test procedures. Our system would l i k e l y out-perform the contour tracing-table lookup system of Munson [51] and might outperform the table l o o k u p - c h a r a c t e r i s t i c l o c i system of K n o l l [53]. This l a t t e r system was evaluated using some of the same data f o r t r a i n i n g and te s t i n g ; i f d i s j o i n t t r a i n i n g and te s t data were used the * Munson [51] achieved a recognition accuracy of 83 percent combining the c l a s s i f i c a t i o n responses of PREP 9-VIEW and TOPO. recognition accuracies i n l i n e s 4-6 of Table 4.2 would l i k e l y decrease [48] -[50],[59]. One might argue that the recognition accuracies of the table lookup systems are lower than they would be i f r e j e c t c r i t e r i a were not used. Rejects, however, tend to be inherent i n table lookup systems, and avoiding r e j e c t s i s d i f f i c u l t . In conclusion, our feature e x t r a c t i o n scheme i s much simpler than a l l those mentioned i n Table 4.2 while the recognition r e s u l t s compare favourably with those obtained by others who have used Munson's data base. Experiment 4.2: In Experiment 4.2, the 42 samples of each upper case character which appeared to vary most i n s i z e and l i n e thickness were f i r s t eliminated using v i s u a l examination. The remaining characters were used to form 105 alphabets, and the t r a i n i n g and t e s t i n g procedure described e a r l i e r was then used with N = 105 and M = 7 (see Table 4.3). In the f i r s t group of recognition tests an algorithm s i m i l a r to the one described i n Section 4.2 was used to size-normalize the o r i g i n a l 24 x 24 character matrix to 20 wide x 21 high. The normalized matrix was divided i n t o 28 non-overlapping regions 5 wide x 3 high, and a 28 dimensional feature vector was then obtained for use with the simple Bayes c l a s s i f i e r . In the second group of t e s t s , s i z e - n o r m a l i z a t i o n was not used. The feature vector was obtained by the use of 36 non-overlapping 4 x 4 rectangles i n the o r i g i n a l character matrix. In both groups of tests i n Experiment 4.2 the characters' p r o b a b i l i t i e s equalled those of the E n g l i s h text. One concludes, from Table 4.3 that the s i z e normalization pre-processing algorithm described i n section 4.2 i s very a t t r a c t i v e f o r Character Preprocessing Err o r P r o b a b i l i t i e s Average Er r o r P r o b a b i l i t y P(e) Standard Deviation of P(e) Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Size Normalized 20.8 25.8 21.7 22.3 25 iO 38.1 28.0 26.0 5.4 Not Size Normalized 27.8 41.4 26.0 32.9 31:8 39.4 26.4 32.3 5.7 Table 4.3 Recognition r e s u l t s showing the advantage of size - n o r m a l i z a t i o n preprocessing the proposed feature extraction scheme. Table 4.1 also supporting the notion that a combination of h e u r i s t i c and s t a t i s t i c a l procedures should be incorporated i n pattern recognition systems [64], instead of over stretching e i t h e r of them; r e s u l t i n g i n the complicated or c o s t l y scheme. Tables 4.1and 4.3 show c l e a r l y that the recognition r e s u l t s depend quite strongly on both the way i n which the data are p a r t i t i o n e d into t r a i n i n g and t e s t i n g sets, and on the system evaluation procedure. If only one of the seven sets of alphabets had been used f o r t e s t i n g and the other s i x for t r a i n i n g , then the e r r o r p r o b a b i l i t y i n Experiment 4.1 would have varied between 16 and 27 percent depending on the test set. Even the r e l a t i v e comparison of two d i f f e r e n t systems, based on such a r b i t r a r y p a r t i t i o n i n g of the data set may be quite misleading. In Experiment 4.2, the improvement i n system .performance due to the s i z e normalization preprocessing varied between land 16 percent, again depending on the set of alphabets used f o r t e s t i n g . In f a c t , from Table 4i3 one can observe that i t i s possible to a r r i v e at e n t i r e l y contradictory conclusions by basing the r e s u l t s on d i f f e r e n t p a r t i t i o n i n g s . The advantage of using the r o t a t i o n method f o r the estimation of system performance and the reason why the recognition r e s u l t s using standardized data set be accompanied by d e t a i l e d des-c r i p t i o n of how the data set was p a r t i t i o n e d f o r t r a i n i n g and t e s t i n g purposes are c l e a r l y i n d i c a t e d . 4.6 Test Procedures and Organization of Results The performance of various d e c i s i o n schemes was estimated on the basis of the recognition of seven d i f f e r e n t handprinted E n g l i s h texts containing a t o t a l of 3570 characters. The texts were from seven d i f f e r e n t sources and were formed using the d i s j o i n t sets of alphabets put aside f o r the purpose of t e s t i n g . Thus the f i r s t text was formed usin characters from alphabets 1-21, the second text was formed using alphabets 22-43, and so on. Therefore, a t o t a l of 21 samples f o r each character-class (excluding blank, f o r which there was no r e s t r i c t i o n ) were a v a i l a b l e for forming each Text. Unless some character-class had more than 21 occurrences i n a text, each character sample was used only once for each piece of text, otherwise the samples were recycled. Punctuation marks other than the period were either removed or replaced by a period depending on t h e i r place of occurrence i n the text. During the process of c l a s s i -f i c a t i o n , the blanks and periods were assumed to be p e r f e c t l y recognizable Thus, i n c l a s s i f i c a t i o n schemes with contextual c o n s t r a i n t , the i d e n t i t y of a blank was always used i n c l a s s i f y i n g -the subsequent (and the previous sample i f the de c i s i o n was delayed) character sample. I t follows, there-in fore, that at any instant k, the s t r i n g of c o n d i t i o n a l feature vectors X , k-1 1 X , ... does not run back to X , but terminates at the f i r s t blank en-countered. The text materials and t h e i r various s t a t i s t i c s appear i n Appendix C. Let the i th component err o r p r o b a b i l i t y P g ( i ) , of a de c i s i o n scheme represent the r a t i o of the number of character samples m i s c l a s s i -f i e d to the t o t a l number of characters i n the i th, i = 1, 2, 7, text The expected er r o r p r o b a b i l i t y P of the d e c i s i o n scheme, averaged over the seven texts and the corresponding standard deviation a are then given P E = 7 * I P e ( i ) ' a = { 7 * I L P e ( ± ) ] 2 - [P E1 2> (4.1) i = l i = l 64 The i t h component average number of features per pattern (ANFP) N ^ ( i ) , associated with a decision scheme denotes the r a t i o of the t o t a l number of features required i n c l a s s i f y i n g the character samples of the i th text to the t o t a l number of characters i n that text. The expected ANFP, N of the decision scheme, averaged over the seven texts and the r corresponding standard deviation are given by a s i m i l a r set of equations as i n (4.1) . The sequential d e c i s i o n schemes simulated throughout the r e s t of the thesis were assumed to be f i n i t e i n nature, that i s , a maximum of 25 features a v a i l a b l e from a character sample. Thus, at the end of the 25th d e c i s i o n state, i r r e s p e c t i v e of whether enough information was a v a i l a b l e or not, a f i n a l d e c i s i o n was always forced on the character sample*. The number of such forced decisions should be considered as a s i g n i f i c a n t f a c t o r i n .evaluating the performance of f i n i t e sequential decision schemes. In the absence of any upper l i m i t on the number of features a v a i l a b l e , the patterns on which terminal decisions x^ere forced would require observation of more features before the normal termination of the observation process. The r e s u l t i s that the average number of features required per pattern would a c t u a l l y be higher than what i s obtained with a f i n i t e number of features. The error p r o b a b i l i t y would also decrease due to the observation of a d d i t i o n a l features. As an example, l e t us consider two equally r e l i a b l e recognition schemes R^ and R 2« Out of 10 patterns to be c l a s s i f i e d , l e t R^ require 10 features f o r each one of the 5 patterns and a l l the 25 features f o r each * or a feature from the neighbouring patterns i s observed i n case of CSPR schemes (Chapters VI and VII ). 65 one of the remaining 5 patterns. Thus, the ANFP and the amount of forced d e c i s i o n associated with are 17.5 percent and 50.0 percent r e s p e c t i v e l y . However, R^ requires 20 features f o r each one of the 5 patterns and 15 features f o r each one of the remaining 5 patterns. The ANFP and amount of forced decision associated with R^ are,therefore, 17.5 percent and 0.0 percent r e s p e c t i v e l y . Based on only the error p r o b a b i l i t y and the ANFP one would tend to conclude that both R-^ and R^ are of equal e f f e c t i v e n e s s . In f a c t R^ should be considered to be more e f f e c t i v e than R^, since R^ would have observed a d d i t i o n a l features on 50 percent of the t o t a l patterns i f larger dimensional feature vectors had been a v a i l a b l e from the patterns. A more accurate approach i n evaluating the performance of the f i n i t e sequential schemes i s » therefore, to take into account the number of forced decisions i n -conjunction with the e r r o r p r o b a b i l i t y and ANFP. The recognition r e s u l t s of sequential decision schemes throughout the r e s t of the thesis are always accompanied by the component forced decision P^(i) and (or) the expected forced d e c i s i o n P , averaged over a l l the seven texts. r In order to reduce the cost and time of performing a large number of recognition experiments, the r e s u l t s are presented i n the thesis i n a s p e c i a l way. To observe the trend of performance of the various decision schemes, only the f i r s t text with t o t a l of 538 character samples was used for the i n i t i a l evaluation. Thus, P e ( l ) (and PfCl) for sequential schemes) f o r various values of N^(l) are presented i n most cases. To add s i g n i f i c a n c e to the r e s u l t s , tests were then conducted with a l l seven texts using a s p e c i f i c number of features per pattern (or cost per feature i n case of sequential scheme). The expected values P„, P„ and N and the corresponding standard deviations are therefore a v a i l a b l e f o r meaningful comparisons df the schemes. CHAPTER V BAYES SEQUENTIAL DECISION SCHEMES FOR MULTICLASS PATTERN RECOGNITION PROBLEMS 5.1 Introduction A few previous attempts [16]-[19] have been made to formulate sequential decision for m u l t i c l a s s recognition problems. However, the schemes are e i t h e r suboptimal i n nature, e s p e c i a l l y when M > 2 or of l i m i t e d use due to the computational d i f f i c u l t y . The Bayes sequential d e c i s i o n scheme which minimizes the expected cost of terminal decision, i n c l u d i n g the cost of feature obser-vation, i s e s s e n t i a l l y a backward procedure. It i s quite a t t r a c t i v e for m u l t i c l a s s problems. The a p p l i c a t i o n of such optimal sequential d e c i s i o n schemes based on the technique of dynamic programming [20], i n the area of pattern recognition i s due to Fu, C a r d i l l o and Chien [23]-[26]. In order to maintain the true optimality of the d e c i s i o n scheme i t i s necessary to consider the e n t i r e future at every d e c i s i o n state. In other words, i t i s needed to work backward from optimal future to optimal present behaviour and so on back into the past. D i f f i c u l t y i n storage requirement and computation are therefore of major concern i n designing an optimal sequential decision scheme. A recog n i t i o n system, for example, with D s t a t i s t i c a l l y indpendent features per pattern, each one taking on ? D q d i s c r e t e values, would require a storage of 2. ( Q ) numbers for the r i s k 1=1 functions alone. The optimal d e c i s i o n scheme i s therefore, impracticable i n most s i t u a t i o n s . One way to overcome t h i s d i f f i c u l t y i s to devise a suboptimal 67 decision process. A suboptimal procedure, c a l l e d the one-state ahead truncation approximation [65] or limited-depth lookahead mode [66], brings about a considerable reduction i n the storage requirement and s t i l l maintains the important a t t r i b u t e s of sequential schemes. At every decision state the suboptimal procedure assumes the next state to be the f i n a l s t a t e . Thus the r i s k both of continuing and of terminating the feature observation process are computed on the basis of the behaviour of the next state alone. The fac t that the states beyond the immediate next one are dropped from consideration allows the r i s k s to be computed i n a forward manner. I t i s however, necessary to know the " e f f e c t i v e n e s s " of the sequential decision schemes based on the one-state ahead tr u n c t i o n approxi-mation. Unfortunately a general a n a l y t i c a l assessment i s not a v a i l a b l e and no e x p l i c i t experimental assessment e x i t s . C a r d i l l o and Fu's [65] computer simulated experiments to i l l u s t r a t e the effectiveness of the* suboptimal s o l u t i o n involve the on-line ordering of features. Also the experiments were designed to compare the suboptimal s o l u t i o n with the optimal one, which i s of l i m i t e d i n t e r e s t . A comparison between the suboptimal sequential and the optimal nonsequential schemes would be more meaningful, because then one would know whether the saving i n the cost of feature observation provided by the sequential scheme i s s u f f i c i e n t to warrant i t s use i n place of the nonsequential scheme. Moreover, due to the o v e r s i m p l i c i t y of the recognition problem simulated and experimental methodology used [65], the r e s u l t s are of l i m i t e d value. In f a c t , the conclusion based on such experimental procedures can be misleading [44] (also Chapter IV). In t h i s chapter the simple sequential d e c i s i o n scheme based on 68 one-state ahead approximation and the optimal simple nonsequential decision scheme are c a r e f u l l y compared from the points of view of both expected cost of a terminal decision and computational d i f f i c u l t y . The comparison i s based on the computer simulated recognition experiments using the data set described i n Chapter IV. A general compound sequential d e c i s i o n scheme for recognition problems with various forms of dependency i s f o r -mulated. The compound sequential d e c i s i o n scheme with f i r s t order Markov dependent hypotheses i s considered l a t e r and i s simulated on the computer. Its performance i s again compared with that of optimal compound nonsequential and suboptimal simple sequential d e c i s i o n schemes. 5.2 Formulation of the Decision Rule We have at any instant k, k = 1, 2, .., of the sequential -k 1 2 k-1 k decision process a set X = X , X , X , X of feature vectors, on the basis of which the k th pattern sample has to be c l a s s i f i e d . A very general multicategory recognition system i s again considered and i s assumed to s a t i s f y the dependence r e l a t i o n s (2.1-a), (2.1-b) and (2.2-c). Let: f [ X k , X"*"; n^] be the minimum expected r i s k at state n^ of the e n t i r e sequential decision process, having observed k k the sequence of features (x,, ..., x ) on the k th n k k-1 1 pattern and having a v a i l a b l e the set {X ,...,X } of feature vectors from previous (k-1) patterns. k 1 k R [X ,...,X ;n ;6 ] be the r i s k involved i n making an optimal d e c i s i o n s ic 6 at state n^ of the decision process having a v a i l a b l e k k the set {x., x } of features from the k th 1 n k k-1 1 pattern and having the set {X , ..., X } of feature vectors from previous (k-^1) patterns. C be the t o t a l cost of feature observations i n continuing n k the sequential d e c i s i o n process to the n^ th state. c(n^) be the cost associated with the n^ th feature component. The r i s k involved i n making a f i n a l d e c i s i o n 6 at any state n. of the K. sequential d e c i s i o n process amounts to, R g [ x \ X 1; n k ; 6 k ] , n f e = 1, 2, D. I f however, i t i s decided to continue the d e c i s i o n process through the observation of the x k ... th feature component, then the r i s k involved n, +1 i s given by {C + c(n. + 1)} + n, k k f [ ( X k , x k X 1 ; n k + l ] . Q k "k + 1 P(x k |x k)dx k x k ..ex* n k+l' n k + l n k + l where 0 k denotes the feature subspace corresponding to the ( n ^ r l j s t n, +1 k ~k k k feature component and X denotes the set (x x^) of features yet k to be observed at state n, . k Hence by the p r i n c i p l e of optimality, the basic f u n c t i o n a l k 1 equation, governing the i n f i n i t e sequence of expected r i s k f[X , X ; n^]; n = 1, 2, ..., i s [20] (STOP: R [X k, .... X 1; n,; 6 k ] S K. f [ x \ . , X ; n k] = Min< V CONTINUE: {C + c(n,+l)} + k f [ (X , x^ ) ,. . . ,X k n, +1 k (5.1) IC K K. In case of a f i n i t e ( f i x e d number of features) sequential d e c i s i o n process the optimum stopping r u l e can be computed backwards, s t a r t i n g from the known r i s k function of the f i n a l s t ate. At the f i n a l state D, we have f [ X k , X^D] = R g[X k, X 1; D; 6 k ] . I t i s now possible to compute the r i s k f o r each and every state n^ < D, through the f u n c t i o n a l equation 5.1. More s p e c i f i c l y , s t a r t i n g with the k 1 known value of f f X , X'; D], we can write f o r state (D-l) f[X k,...,X 1;(D-l)]=Min ) ^STOP: R s[X k,...,X 1;(D-l); 6 k ] CONTINUE: {C^_^+c(D)}+ f[(X k,x k),...,X 1;D] • P(x k|x k)dx k, xjj eX k. Continuing i n the s i m i l a r fashion we can write f o r the f i r s t state of the decision process (STOP: R g[X k, .. . , X 1 ; l ; 6 k ] f ( X k , ..., X1;!} = Min L CONTINUE: C +c(2) + f [ ( X k , x k ) X 1 ^ ] a k x 2 P ( x k | x k ) d x k , x k £ X k 71 k k 1 where the r i s k f [ ( X , -x.^) , . .., X ; 2] i s a v a i l a b l e from the second state of the decision process. It i s necessary at every state to compute the terminal d e c i s i o n k 1 k r i s k R g[X , X ; n^; S ], n^ = 1, 2, ..., D, employing an optimal k decision r u l e to form 6 . Using compound decision r u l e , the r i s k of deciding k k X ~ UK, at any state n^ i s given by M k " l k) P(X k; A, i = 1, 2, M p ( i ) = l u<v dh k j = l J 1 The Bayes de c i s i o n r u l e i s the one f o r which 6 k = d k i f p (q) = Min{p. (i)} i = 1, 2, ..., M. q K i K The corresponding r i s k involved i n making the decision i s k 1 k R g[X , X ; n j c J o ] - Min{p k(i)} , i = 1, 2, M; 11^=1,2,..., D. Formulation of the optimal sequential d e c i s i o n process being complete, the storage and the computation d i f f i c u l t y associated with the procedure should now be apparent. In the next section a suboptimal sequential decision scheme based on one-state ahead truncation scheme i s presented. The suboptimal procedure i s e a s i l y implementable and requires fa r l e s s storage than the optimal procedure. 5.3 A Suboptimal Decision Rule Based on One-State Ahead Truncation In a suboptimal sequential d e c i s i o n scheme based on one-state ahead truncation approximation, the next p o s s i b l e state i s always assumed to be the f i n a l state of the observation process. Thus, the r i s k of e i t h e r continuing or terminating the feature observation process i s computable using a forward procedure and requires very l i t t l e or no a d d i t i o n a l memory beyond that i s required for the nonsequential d e c i s i o n scheme. Let R [X k, X^ "; n. ] be the minimum expected r i s k at state n. involved C K. K. in continuing the feature observation process; assuming the next state to be a terminal state, k k having H V c t i l a b l e the sequence ^ - ^ T > • • •» ^ " ) of k _ k _ x features from k th pattern and having the set X of (k-1) feature vectors from the past. Based on the one-state ahead truncation approximation i t i s possible to express the r i s k of continuing the observation process to (n^+1) st state as f o l l o w s : R t -X k , • . . y X 1 ; ^ ] = {C + cC-n.+ l ) } + c k n , k ^ M min{£ £(u> k;d k)P(X k,x k . . . X 1 - ; - ^ ) }• i j = l J 1 u k " r l J k X n +1 k P ( x k J , | x k ) d x k , i = 1, 2, M; x k , 1eX k. (5.2) n, +11 n, +1 n*+1 k k £ Using the same mathematical notations as i n Chapter I I , the j o i n t p r o b a b i l i P(X k, x k ,.,, X k \ . . . jX"*";cjk) can be rewritten as n k 3 P(x\ x k X k - \ X 1; o)k) = I P ( X k , x k X k _ 1,..., X^W*). (5.3) n k 1 2 qe{I.} n k q J Since the dependence r e l a t i o n s (2.1-2), (2.1-b) and (2.1-c) are assumed to be s a t i s f i e d and (2.9-a) and (2.9-b) apply, one can express (5.3) as f o l l o w s : P ( X k , x k ...X^ 1,...^ 1;^) = { P(W"|w; x)}, k > l (5.4) k-1 ( I P(X k,x k |Yk_1;Wk) I P ^ " 1 ^ " 1 ) ' q E { i j } n k q £=1 k l t T k - l v n,+l' >•••>" . - j k J q' £ ^ P ( X k , x k , Ja> k)P(w k), k = 1 . n k+r 2 3 k k i-k-1 k The p r o b a b i l i t y density P(X , x^ > ^ ) a n < ^ t n e t r a n s i t i o n prob-ki k-1 k a b i l i t y P(W^ |W^ ) are a l l assumed a v a i l a b l e such that (5.4) i s recur-s i v e l y computable at every instant of the decision process. Thus, the minimum expected r i s k (5.2) of continuing the feature observation process i n a sequential scheme can be computed at every instant of the decision process from a continuously growing set of information without* however, any increase i n the memory requirement above what i s needed f o r making decisions i n nonsequential schemes. The r i s k of terminating the sequential process can s i m i l a r l y be expressed i n terms of the known q u a n t i t i e s . Thus, M -^k—1 R [x\... 1X 1;n k-;6 k] - Min[ I J £ ( . k ; d k ) P ( X k | Y k - 1 ; W k ) •{ I S k 1 qe{I.} j = l J 1 q s-1 3 P(X k _ 1;W k 1).P(Wk|Wk i = 1, 2, M . A sequential decision r u l e based on the l i m i t e d depth look-ahead mode may now be formulated as follows; at state n is. R [X k,. .. .X^n, ] < R [X k,...,X 1;n ; 6 k ] : CONTINUE: the observation C K. S K. i f process, k 1 k 1 k R [X ,...,X ;n ] > R [X ,...,X ;n ;6 ]: STOP: the observation process, C K. S rC n k = 1, 2, ..., D- l . (5.5) The a d d i t i o n a l memory needed to store the costs of the feature components i n suboptimal sequential d e c i s i o n schemes i s usually very small. I t i s therefore possible at every state to compute the r i s k , both of continuing and of terminating the observation process i n a forward manner, using the one-state ahead truncation approximation. 5.4 Compound Sequential Decision Scheme f o r F i r s t Order Markov Dependent Hypotheses •A -simpler recognition system i s considered i n t h i s section, with the following dependence r e l a t i o n s , 1. The pattern classes are only f i r s t order (m=l) Markov such that.(3.14) applies f o r a l l i , j = 1, 2, ..., M. 2. The feature vector at any instant k i s c o n d i t i o n a l l y independent of the neighbouring feature vectors and the pattern i d e n t i t i e s , such that (3.15) applies f o r a l l i , j = 1, 2, M. 3. The feature components within each class are s t a t i s t i c a l l y independent such that n k P(x k, x k . . , x k | W k ) = n P(x k|co k), (5.6) k j = l J 1 k 1^ 2j xj j 1^ 2j iM• Equation (5.4) now reduces to M {B / k k I k u r _ , - k - l k - l N T ) / , ki k-1.-, ( X , X n + l ! a ) j ) { I P ( X ;oJ£ )p(ujlu£, >}» k £ = 1 k > 1 (5.7) P(X k,x k |a>k)P(cok), k = 1 k k i k-1 where P(io.|io ), j , I = 1, 2, .. ., M, c a l l e d the bigram t r a n s i t i o n J p r o b a b i l i t y , i s assumed to be known. The r i s k of continuing the sequential decision process, at any state n^, i s 75 k „1 , r„ . ,• , I r ? -/ k . ,k N„,„k k , kj_ R [X K,...,X ;n. ] = {C + c(n.+l)} + Min{ \ £(a).;d.)P(X;x K L . c k n. k J J i n. +11 j (5.8) V l M I P(a) k|oo k" 1)P(X k- 1;/" 1)} P ( x k ^ | X k) dx k ^ ; i=l,2,...,M; x k u 1 e £ k 3=1 J q ^ n k + l1 ' n k + l ' »"» n k + l c For d i s c r e t e valued feature .components the i n t e g r a t i o n sign i n (5.8) i s replaceable by a summation sign. i k The co n d i t i o n a l p r o b a b i l i t y P ( X N + T J X ) i s obtainable at every k state from the known p r o b a b i l i t y functions of the observed features. In pattern recognition problems the s t a t i s t i c a l independence among feature components e i t h e r implies the class c o n d i t i o n a l independence of (5.6) or independence over a l l pattern classes such that "k P ( x k ...,x k ) = n P(x k) . (5.9) k 1=1 1 Both (5.6) and (5.9) can not i n general be s a t i s f i e d simultaneously [67], [68]. In computing the r i s k i n (5.8), the class c o n d i t i o n a l independence (5.6) i s assumed. Therefore, p r o b a b i l i t y P ( x k . , l x k ) has n T +1 k to be expressed based on the assumption that only (5.6) applies [68]. Thus M P ( x k + 1 | x k ) = I P(x | U k)P(<A,X k), n k=l,2,...,D k j = l k J J 1 k where the a p o s t e r i o r i p r o b a b i l i t y P(u\.|X ) i s given by . . P ( X n J W j ) P ( w i l X n - l " - - ' x i > P(a ) k | x k ) = * , j=l,2,..., M I P ( x k |o)k) • P ( U k | x k x k) 1=1 k k 76 which can be computed i t e r a t i v e l y from the known l i k e l i h o o d s of the feature components. The r i s k of terminating the sequential decision process becomes M M R [X k,...,X 1;n v;6 k] = Min{ £ £ ( w k ; d k ) P ( X k | w k ) • I P C X * " 1 ; / " 1 ) S K 1 i=l 3 q=l q P(tok|cok 1 ) } , i = 1, 2, .... M. j i .q For the loss matrix defined i n (2.5), the optimum terminal d e c i s i o n at state n^ i s to accept 6 k = d k i f P(X k|o) k)P(o) k) = Min{P(X k|o) k)P(w k)}, j = l , 2, M. (5.10) i 1 1 j 3 3 A suboptimal procedure i n compound decision schemes i s to use -the previous decision instead of the information from the past patterns [34]. At insta n t k, we have P(X k;co k) = P(X k|w k) • P ( u k ; X k _ 1 ) . (5.11) k-1 k-1 -k-1 I f 6 = d^ , corresponding to the set X of feature vectors, then using the past decision (5.11) reduces to P(X k; u k) = P(X k|o) k)P(a) k;6 k" 1) 1 1 1 (5.12) = P(X k|a) k)P(co k;ao k" 1), i,q=l,2,...M. The r i s k both of continuing and of terminating the observation process using the past decision alone can now be obtained using (5.12) instead of (5.7) 77 5.5 Simple Sequential Decision Scheme In simple (as opposed to compound) sequential schemes, the pattern classes are assumed to be s t a t i s t i c a l l y independent such that P(w k|to k 1,...,oi^) = P(w k), k = 1, 2, i , j , q = 1,2,. . ,M. Therefore, the information from the neighbouring pattern i s of no use as f a r as the c l a s s i f i c a t i o n of the present pattern i s concerned. Thus, deleting the set of feature vectors from the past patterns i n our notations, one can now write f o r any state n, k M R J X ; n k ; 6 K] = Min{ £ £.(co ; d p P ( X K | S 1 j = l 2 1 o).)P(a).) J J and R c[X k; n k ] = { C + C ( n k + i ) } + k M Min { I £(co k,d k)P(X k,x k lo) k)P ( A } n k J _ 1 k \ +i k P ( x k + 1 | x k ) d x k k = 1,2,... (5.13) The stopping r u l e based on the one-state ahead truncation approximation i s the same as i n (5.5). The optimum terminal d e c i s i o n at any state n k i s to accept 6 k = d k i f P(X k|oj k)P((o k) = Min{P(X k|to k)P(u k)} , j = 1, 2, M. (5.14) The set of r i s k s derived i n (5.13) i s the same as defined i n [11], and [25], where only the simple sequential d e c i s i o n schemes have been considered. 78 5.6 Experiments and Results Several sets of experiments were conducted to compare the performance of optimal nonsequential and suboptimal (one-state ahead truncation) sequential decision schemes. The experiments described below were performed on Munson's data set, as reported i n Chapter IV. Experiment 5.1: The suboptimal simple sequential decision scheme of Section 5.5 and the optimal simple nonsequential schemes were considered i n t h i s set of experiments. The features were a v a i l a b l e i n t h e i r n a t u r a l order and were assigned equal amount of cost i n case of the sequential scheme. Each character's a p r i o r i p r o b a b i l i t y equalled that of English text. Quantities P (1) and P^(l) f o r various values of N^(l) are shown in Figure 5.1. The r e s u l t s of recognition of a l l seven texts appear i n Table 5.1, with the expected values P„, P and N and the corresponding ii r r standard deviations being presented i n Figure 5.1. Experiment 5.2: The same set of d e c i s i o n schemes as i n Experiment 5.1 were considered, except that the features were presented i n three d i f f e r e n t orders. The purpose of t h i s set of experiments i s to compare the per-formance of the decision schemes when the features are preordered according to t h e i r "goodness". The features, i n case of sequential schemes were again assumed equally c o s t l y . Three feature evaluation c r i t e r i a were used to obtain the ordering of the feature components. These are l i s t e d below. 1. Bayes expression f o r err o r p r o b a b i l i t y with E n g l i s h a p r i o r i p r o b a b i l i t i e s (Chapter V I I I ) . 2. On-line ordering technique using suboptimal sequential d e c i s i o n scheme (Chapter VIII). 3. Expected divergence between the class c o n d i t i o n a l d e n s i t i e s . The feature components are assumed s t a t i s t i c a l l y independent within each cl a s s (Chapter V I I I ) . The r e s u l t s are presented i n Figure 5.2. Experiment 5.3: The optimal compound nonsequential and suboptimal compound sequential (Section 5.4) d e c i s i o n schemes were considered. The characters of the texts were assumed to be f i r s t order Markov, such that only bigram s t a t i s t i c s were used and the dependency i n pattern classes extended only to the f i n i t e past. In sequential schemes a l l the features were assigned equal amount of cost. Figure 5.3 shows the quan t i t i e s P £ ( l ) , P^(l) f o r various values of N ^ ( l ) . The r e s u l t s of recognition of a l l seven texts appear i n Table 5.2 with the expected values P , P and Nr . i n c l u d i n g the corresponding standard deviations, being presented i n Figure 5.3. 5.7 Discussion of the Experimental Results The. r e s u l t s compiled i n Figure 5.1, Figure 5.2 and Table 5.1 c l e a r l y i n d i c a t e that the simple sequential decision scheme based on the one state-ahead truncation approximation performs s i g n i f i c a n t l y better than the optimal nonsequential d e c i s i o n scheme. We no t i c e from Figure 5.1 that at 20 percent er r o r rate, the suboptimal sequential d e c i s i o n scheme requires approximately 15.4 percent fewer features (on the average) per pattern sample than the optimal nonsequential d e c i s i o n scheme. This may mean a considerable saving i n cost and time of feature observation 40.01 a. ,£ 30.0 C L Q J 20,0 • a o -a o a o QJ 70.0 0.0 10 A *' error prob. fence d simple nonseqn. 0 0 — simple seqn. ± A A W.O 77 72 7J 74 /5 75 77 7d9 75 2 0 average number of features per pattern [ N^(1),NF] 27 0.0 Figure 5.1 Comparison of simple suboptimal sequential and optimal non-sequential d e c i s i o n schemes. TYPE OF DECISION SCHEME Text Set of Alphabets Used f or Nonsequential j Sequential Avg. No. of Features Pattern Er r o r Prob. (i n percent) Avg. No. of Features Pattern E r r o r Prob. ( i n percent) Forced Dec. ( i n percent) No. Size Training Testing 1 538 22-147 1-21 13.8 31.8 13.5 25.0 5.7 2 420 1-21 43-147 22-42 . 13.6 38.3 12.8 36.1 6.9 3 579 1-42 64-147 43-63 13.3 33.2 12.6 28.6 5.9 4 516 1-63 85-147 64-84 13.1 28.1 12.6 24.7 6.0 5 467 1-84 106-147 85-105 13.5 30.6 12.9 26.1 5.5 6 557 1-105 127-147 106-126 13.1 37.9 12.5 33.9 5.7 7 486 1-126 127-143 13.1 26.1 12.7 24.3 3.9 Expected Values 13.4 32.3 12.8 28.4 5.7 Standard Deviations 0.274 4.25 0.31 4.41 0.83 Table 5.1 Recognition r e s u l t s of a l l seven texts using suboptimal simple sequential and optimal simple nonsequential d e c i s i o n schemes. oo r—1 OA \30.0 £ ^ CL .C 20.0! 0 a> ID .O (7) £?ayes error probability On-line seqn. procedure Q) Expected divergence nonseqn. seqn. fore, dec. 10,0 10 // 12 13 14 15 16 17 average number of features per pattern, - 18 Nfd), A/ 19 20 21 ao F. Figure 5.2 Comparison of simple suboptimal sequential and optimal nonsequential d e c i s i o n schemes with preordered sets of features. 00 to C to n 3 o P. 3 T) O P O i-t H-cn O 3 g o c 3 &. O 3 o cn 3 cn CD 13 JO C fD 3 fD cn fD H- JO (U C M fD 3 O. rt fD H-O 0> H- t—4 cn -cn H-' 3 cn T J o i—1 3" CD (D 3 o 3 cn fD JO c CD 3 rt H-0) O O 3 T3 O 3 3 a. CO CD .a c ro 3 rt H-P c 3 Cr -> - * N o Cr, c - I ft) CO tD 03 - 1 ro P CD S I -LT" <O f\J CD C J 0 CD error probability, p^0),f^J (in percent) P CD CD CD liA \ * \ \ \ \ o o o o 3 3 o o c c a a CO io to ro S CO Co o ro ro 3 -Q -Q Co 3 3 ro ' -Q 3 f t O • <6> 6 t!> forced UJ .CD CD fin percent) •IN .CD CD Text Set of Alphabets Used f o r TYPE OF DECISION SCHEME Nonsequential Sequential Avg. No. of Features Pattern Error Prob. ( i n percent) Avg. No. of Features Pattern E r r o r Prob. ( i n percent) Forced Dec. ( i n percent) No. Size Training Testing 1 538 22-147 1-21 12.9 24.9 13.8 18.0 10.0 2 420 1-21 43-147 22-42 12,7 36.6 13.7 28.3 12.8 3. 579 1-42 64-147 43-63 12.5 - 30.1 13.2 21.2 10.9 4 516 1-63 85-147 64-84 12.2 23.6 13.2 19.4 10.1 5 467 1-84 106-147 85-105 12.6 27.4 13.4 20.3 11.3 6 557 1-105 127-147 106-126 12.3 35.5 13.6 31.1 13.3 7 486 1-126 127-143 12.3 24.7 . 13.2 18.8 11.4 Expected Values 12.5 30.0 13.4 22.46 11.4 Standing Deviations 0.25 4.9 0.27 4.73 1.84 Table 5.2 Recognition r e s u l t s of a l l seven texts using suboptimal compound sequential and optimal compound nonsequential decision schemes. during the c l a s s i f i c a t i o n of pattern samples. However, the sequential decision scheme, due to the nature of the stopping r u l e , requires more computation than the nonsequential scheme. In order to have an estimate of the computational complexity of the two schemes, Table 5.3 was prepared, which indicates the number of mathematical operations involved i n imple-menting the nonsequential and the sequential decision r u l e s . The t o t a l time required by the c e n t r a l processing unit (CPU) of the IBM system/360 Model 67 d i g i t a l computer i n recognizing the seven texts were also noted. Two d i f f e r e n t programs, one for the sequential and the other f o r the non-sequential decision schemes, written i n Fortran-IV, were used. No attempt was made to optimize the programs. Using a FORTRAN H compiler, the sequential decision scheme, observing 12.8 features on the average per pattern was found to require approximately 650 seconds of CPU time, while the'nonsequential decision scheme using 13.4 features on. the average per pattern needed 165 seconds of CPU time to c l a s s i f y a l l seven texts. In t h i s p a r t i c u l a r case, the amount of computation f o r the sub-optimal sequential scheme i s higher by a f a c t o r of four than the optimal nonsequential scheme. Considering the f a c t that the sequential scheme requires s i g n i f i c a n t l y fewer number of features (on the average) for c l a s s i f i c a t i o n , the scheme may be quite a t t r a c t i v e for pattern recogni-t i o n and s i g n a l detection i n s p i t e of i t s extra computational complexity. I t i s possible that through some computational technique or approximation [25] the amount of computation required f o r the suboptimal sequential decision scheme can f u r t h e r be reduced without much s a c r i f i c e i n the performance, making i t even more a t t r a c t i v e . The advantage of using contextual information i n sequential decision schemes i s evident from Figure 5.3. The compound sequential NONSEQUENTIAL SCHEME To t a l Complexity for D States MATHEMATICAL OPERATIONS m u l t i p l i c a t i o n d i v i s i o n a ddition substraction maximization minimization D.M - - - 1 -SEQUENTIAL SCHEME Complexity Per State Process involved MATHEMATICAL OPERATIONS m u l t i p l i c a t i o n d i v i s i o n addition subtraction maximization minimization r i s k of continuing M + 1 - 3M + 1 1 1 -r i s k of terminating - - M 1 1 -d e c i s i o n - - - - - 1 updating M - 1 - - -TOTAL/STATE 2M + 1 - 4M -:- l 2 2 1 Table 5.3 Comparison of simple sequential and simple nonsequential. decision scheme with bigram information performs s i g n i f i c a n t l y better than the simple sequential and the compound nonsequential schemes. At 17.5%error rate f o r example, the compound sequential decision scheme based on the one-state ahead truncation approximation, requires approxi-mately 15 percent fewer features (on the average) per pattern than the suboptimal simple sequential scheme and approximately 19 percent fewer than the optimal compound nonsequential scheme. The amount of computa-t i o n f o r the compound sequential d e c i s i o n scheme, based on the CPU time. needed f or the c l a s s i f i c a t i o n of seven texts, was found to be higher by a f a c t o r of three than the compound nonsequential decision scheme. Thus, the compound sequential d e c i s i o n scheme, which i s easy to implement and requires no more storage than the compound nonsequential d e c i s i o n scheme,may be quite a t t r a c t i v e f o r recognition problems with memory i n •source and observation medium. CHAPTER VI SUBOPTIMAL COMPOUND SEQUENTIAL PATTERN RECOGNITION SCHEMES FOR DEPENDENT HYPOTHESIS PROBLEMS 6.1 Introduction In many recognition problems, a subset of a v a i l a b l e features may contain very l i t t l e discriminatory information or some pattern samples may be so "noisy" that the a v a i l a b l e features may not s u f f i c i e n t l y characterize the patterns. In such circumstances the sequential d e c i s i o n schemes discussed so f a r i n the thesis always observe some features which provide very l i t t l e or no a d d i t i o n a l information f o r d i s c r i m i n a t i o n but increase the expected cost of a terminal decision. Because the process of feature observation i n a sequential d e c i s i o n scheme i s continued u n t i l s u f f i c i e n t confidence f o r a f i n a l d e c i s i o n i s achieved or the cost of feature observation becomes excessive. In recognition problems with dependent hypotheses, t h i s form of redundancy i n the feature observation can be avoided i f the required a d d i t i o n a l feature i s allowed to be observed on any one of the pattern samples of the sequence. The feature may be observed e i t h e r on the pattern under consideration (one to be decided) as i n the c l a s s i c a l case (compound sequential decision scheme of Chapter V), or any one of the neighbouring patterns. The pattern selected and the feature a c t u a l l y observed on i t i s the one which provides the maximum amount of a d d i t i o n a l information. In t h i s chapter a compound sequential pattern recognition (CSPR) scheme [69] f o r dependent hypothesis problems i s proposed. The scheme, during the process of c l a s s i f i c a t i o n of a pattern sample of a sequence 89 may observe the required a d d i t i o n a l feature on any one of the patterns of the sequence. A f i n a l decision on the pattern under consideration i s drawn on the basis of the t o t a l a v a i l a b l e information. The scheme therefore, observes the most useful features from each pattern sample, r e s u l t i n g i n a termination of the feature observation process e a r l i e r than the equally r e l i a b l e c l a s s i c a l sequential and nonsequential decision schemes. Another obvious merit of the CSPR scheme i s that i t permits the look-ahead mode [33] of decision i n the sequential recognition scheme. That i s , a d d i t i o n a l contextual information from the subsequent patterns may also be a v a i l a b l e f o r the c l a s s i f i c a t i o n of the pattern sample under consideration. Such a sequential d e c i s i o n scheme using the knowledge of contextual dependencies both from the previous and the succeeding patterns of the sequence would be quite e f f e c t i v e f o r dependent hypothesis problems from the point of view of minimizing the expected cost of a terminal d e c i s i o n . The CSPR scheme may be p a r t i c u l a r l y a t t r a c t i v e f o r recognition problems with v a r i a b l e dimensional feature vectors [39], [52], [54]. I t i s possible that even a f t e r observing the e n t i r e set of a v a i l a b l e features from a pattern sample, e s p e c i a l l y i f the set i s of r e l a t i v e l y small dimensionality, the information may not be adequate to make a f i n a l d e c ision with s u f f i c i e n t confidence. I f the pattern classes are dependent, the CSPR scheme may gain the a d d i t i o n a l confidence f o r a f i n a l d e cision through the observation of features from neighbouring patterns. For the same reason, the CSPR scheme w i l l be useful f o r f i n i t e sequential recog-n i t i o n problems where a terminal d e c i s i o n i s forced to be made at the end of the f i n a l state i r r e s p e c t i v e of whether s u f f i c i e n t information i s av a i l a b l e or not. The suboptimal sequential decision rule f o r the CSPR scheme i n i t s general form i s formulated i n Section 6.3. Three s p e c i a l cases based on simple but r e a l i s t i c assumptions are then formulated i n Section 6.4. 6.2 Statement of the Problem The decision function f or the CSPRS i s derived on the basis of the following assumptions: 1. Memory both i n the source and observation medium may e x i s t , such that the dependence r e l a t i o n (2.1-a), (2.1-b) and (2.1-c) are a l l s a t i s f i e d . However, a de c i s i o n can be delayed u n t i l a d d i t i o n a l information from the subsequent patterns i s a v a i l a b l e . Thus, the dependencies extend also to the f i n i t e future. 2. Only one feature may be observed at any decision state. 3. One state-ahead truncation procedure i s used to approximate the optimum backward procedure f o r implementing the decision r u l e . At any instant k, l e t a s t r i n g of t patterns be under observation and l e t r k < _ t i k k - t 2 (X \..., X , X i f k 5 t 2 k + t l k 1 (X , X , X ), i f k < t 2 be the corresponding set of feature vectors, where t^ denotes the extent of dependency i n the future. That i s , the number of patterns looked ahead at every instant of the de c i s i o n process, thus, t 2 = (t - t x + 1) we s h a l l denote t = (t» + 1) . o L The r e q u i r e d a d d i t i o n a l f e a t u r e c a n b e o b s e r v e d on any one o f t h e s e t p a t t e r n s . W h i l e t r y i n g t o d e c i d e on t h e k t h p a t t e r n o f t h e s e q u e n c e , t h e d e c i s i o n s o f t h e p r e v i o u s p a t t e r n s a r e assumed t o b e t e n t a t i v e . The d e c i s i o n s , t h e r e f o r e , may be a l t e r e d i f n e c e s s a r y due t o t h e o b s e r v a t i o n o f a d d i t i o n a l f e a t u r e s . L e t ' ( n k + t 1 , n k + t 1 - r V •••» n k - t 2 } ' k * h ( " k + t j / V - t j - r •••> V •••» n i > » k < h d e n o t e a d e c i s i o n s t a t e o f t h e s e q u e n t i a l p r o c e s s a n d l e t N ^ , i = 1, 2, z ( t ) , w h e r e z ( t ) ( t ) , k * t ( k ) D , k < t be any c o n v e n i e n t way o f s u b s c r i p t i n g t h e d e c i s i o n s t a t e s . L e t n ( N ^ ; = ' { N j ( j ) ; j = k - t 2 , k , k + t 1 } k t be a s e t o f t, d e c i s i o n s t a t e s w h e r e e a c h e l e m e n t N ^ ( j ) d e n o t e s a s t a t e t o w h i c h t r a n s i t i o n f r o m s t a t e i s p o s s i b l e t h r o u g h t h e o b s e r v a t i o n o f t h e n^ t h f e a t u r e on t h e j t h p a t t e r n o f t h e s e q u e n c e . S i n c e o n l y one f e a t u r e i s a l l o w e d t o be o b s e r v e d a t any d e c i s i o n s t a t e , t h e number o f e l e m e n t s i n t h e s e t n ( N ^ ; £ k ) a t a n y i n s t a n t k i s g i v e n b y , t , k * t 2 ft , k ^ ^ k , k < tr t k The set n ( N ^ ; t, ) Is the n u l l set when no further t r a n s i t i o n i s pos s i b l e from state N ! " . I Associated with each de c i s i o n state i s a set 92 f(6 K, 6 K \ 6 K t 2 ) , k * t„ A ° = | l(6 K, 6 K _ 1 k < t 2 of t optimum (Bayes) terminal decisions, where 6^ , j = ( k - t 2 ) , . ••> k, represents the optimum decision associated with the j th pattern. F i n a l l y , l e t : r^[6] be the component r i s k involved i n forming an optimal terminal decision on the j th, j = (k-t„), k pattern of the - j + t l sequence based on the set X of feature vectors, t t R [A :N.] be the minimum expected compound r i s k of forming the set A of optimum decisions at state of the decision process based on the a v a i l a b l e set of pattern vectors. $.[N!"(n )] be the expected component r i s k involved with the j th, J i q j = ( k - t 2 ) , ..., k, pattern of the sequence i n observing an a d d i t i o n a l feature x on the q th, q = ( k - t 2 ) , k, . .., (k+t^), pattern t k R^[ n(j>L ; £ ):x] be the minimum expected compound r i s k of observing the <s k best a d d i t i o n a l feature x from one of the C patterns at state of the de c i s i o n process. C(N^) be the t o t a l cost of feature observation i n continuing the decision process up to the state N^. 6 . 3 Formulation of the Suboptimal Sequential Decision Rule A suboptimal sequential d e c i s i o n r u l e based on one s t a t e -ahead truncation approximation, where the next d e c i s i o n state i s considered to be the f i n a l ' state may be expressed at any instant k, as follows: i f : f Rg[A °:N^ ] > Rc[n(N^:i;k) ;x] : CONTINUE the decision process Rg[A °:N^ ] < Rc[n(NSck);x]: • STOP the decision process (6.1) The problem now is to express the risks in terms of known quanti-ties requiring a minimum amount of computation and storage. At any instant k, the component risk involved i n the decision d 3 i.e. the i i decision that XJ ~ to. is M . . _j+t J.A2<i P C Y X ; i = 1, 2, M . u . • p.(i) = I Z(J;dh P(X J i u x J u=l If dependence relations (2.1-a), (2.1-b) and (2.1-c) apply and the dependencies may extend to the f i n i t e future, then _j+t, . _ J + t i P(X X; or1) = I P(X ±; W3) U ie{I } 1 u = l l P(X 1 Y 1 ; W l) • P(X 1 ;W ±,W3) ie{I } q=l q q 1 u n ie{I } q=l s=l v=l q u j+t -1 j+t--2 j+t -1 . . _. . j+t j+t.,-1 P(X 1 |Y 1 ;W 1 )...P(XJ IYJ ;W3) • P(XJ;WJJ} • {P(W 1|W 1 ) 1 ' s 1 v l q ' s ...P(WJ+1|wi)} (6.2) where P(Xj;wj) = P(Xj|Yj_1;WJ) I P(wi|wj-1) • P(X j~ 1;W 3" 1), (6.3) —X (the superscript of E, i s neglected for the sake of s i m p l i c i t y ) . Thus j + t l i P(X ; or) i s computable r e c u r s i v e l y s t a r t i n g with the j o i n t p r o b a b i l i t y _ k - t 2 - l k - t 2 - l P(X ; W ) a v a i l a b l e at instant ( k - t 2 ~ l ) and the known state t r a n s i t i o n p r o b a b i l i t i e s . The decision d^ on the j th pattern of the sequence i s optimum, that i s , 6 j = d2, i f p. (A) = Min{p.(q); q = 1, 2, ... M} *>* K J q M J and the corresponding r i s k involved i s given by M . . _j+t. . r.[6] = Min{ I I £(w 3;d J)P(X 2 v ie{I } u=l U u v = 1, 2,..., M. (6.4) t Thus, the expected compound r i s k of forming the set A of optimum decisions at state N^ " i s x t k R [ A ° : N J ] = f- I r . [ 6 ] . (6.5) S 1 fco j = ( k - t 2 ) 2 However, at state the r i s k involved with j th pattern of the sequence, i f an a d d i t i o n a l feature i s desired from the q th pattern, may be expressed as follows: *.[N?(n +1] = C(Nfc) + c(n +1) + 3 i q x q M . . Min{ J l(o>2;d2) " 11 \T U q v u=l n +1 q P ( X J + t l , * n + 1 ^ u ) } P(xJ + 1 | x ^ ) d x 2 + 1 ; x q ..e X q. (6.6) n +1 Following our assumptions, (6.6) can be rewritten as follows: r M E».[N.(n +1)] = C(N.) + c(n +1) + J x q x q Min{ I I H^3;d3) v i e { i } u = i u v q u n +1 q P U 3 + \ x Q + 1 ; W J ) } P ( x q + . 1 | x q ) d x q + 1 (6.7) q q q where P(X \ x q +1;W-J) = P(X \...,X J,...[X q,x q +1],...,X1;W^) q q can be expressed i n terms of known quantities as i n (6.2). Since n +1 q (6.7) for any k ~> j ~> (q + t ^ + 1) reduces to the following expression: M j+t . *.[N'Cn +D] = C(N ) + cCn +1) + 1 1 Uu3.[;d3)V(X ;U3) . j i q i q i e { 1 } u = 1 u x. x u Thus, the expected compound r i s k associated with the set of t patterns of observing the a d d i t i o n a l feature x q + ^ at state i s given n q 1 by -T- I $.[N t(n +1)], q = k-t , k, k+t ; N t(n +1)en.CN*; / ) . o j = k-t 2 3 1 q x x q x The minimum expected r i s k of continuing the feature observation process at state N_^ , through the observation of on a d d i t i o n a l feature on any one of the £^ patterns under observation i s then, R h ( N ^ C k ) : x ] = Min{-r- I $.[N t(n +1)]}, (6.8) c 1 1 ^ j=k-t, J 1 q • cj ™ lc } * • } • • • ^ l c~ l~ t^ • I f the feature observation process i s to be continued, then the best feature to be observed at state i s any s o l u t i o n to (6.8). That i s , at state N^, x = x q ... i f x n +1 k . k o j=k-l- J H * o j=k-t (6.9) q, % — lc-~t2> • • • j lc j • • • y k+t^. 6.4 S p e c i a l Cases — CSPRS Type 1, Type 2 and Type 3 The compound sequential pattern recognition scheme proposed i n Section 6.3 i s very general i n form. Therefore, a large v a r i e t y of sequential feature "sampling" and c l a s s i f i c a t i o n schemes may be developed using d i f f e r e n t types of dependence r e l a t i o n s , cost a l l o c a t i o n s and decision functions. In t h i s s e c t i o n , three s p e c i a l cases based on f a i r l y simple but r e a l i s t i c assumptions are formulated. The cases are as follows CSPRS Type 1; In the CSPRS Type 1 system i t i s assumed that the dependencies 97 extend only to the f i n i t e past ( t ^ = 0). The pattern classes are assumed to be f i r s t order Markov with known state t r a n s i t i o n p r o b a b i l i t y . The feature vectors are considered c o n d i t i o n a l l y independent as i n (3.15). Under these circumstances, t = + 1, and (6.4) reduces to M . . . . r.(6) = Min{ Y l(u3:d3) P ( X J ; w J ) } , q = 1, 2, M, 3 q u q u where P ( X 3 ; O J 3 ) can be expressed i n the form shown i n (6.3) and for f i r s t order Markov dependence among the pattern classes reduces to the following form: M P ( X j ; t o j ) = P ( X j L j ) Y P ^ " 1 ; / 1 ) P ( a ) j L j _ 1 ) . ' U 1 U i . V U 1 V v=l The expected compound r i s k R^A1";!^] i s now given by R [ A ^ N J ] = i I r.(6) . (6.10) S 1 C j=k-t+l J S i m i l a r l y , the minimum expected r i s k of continuing the d e c i s i o n process may be expressed as follows: k R t n ( N t : 0 ; x ] = Min{^ • £ $ . [ N t ( n +1)]}, q = k-t+1, k c i q t j = kr t +i J 1 q • 1 k f M = Min{C(N.) + c(n +1) + - Y U M i n ( J l(u3; d 1 ) q 1 q t • i J v i u v i=k-t+l i. u:=l n +1 q P ( X J ; x q P ( x q + 1 U q ) d x q x q + 1 e X q ; q = k-t+1, k. q q q q Besides the memory required for s t o r i n g the p r o b a b i l i t y d i s t r i b u t i o n of the feature components and the bigram s t a t i s t i c s , some a d d i t i o n a l memory i s needed to store the set of l i k e l i h o o d s , { P C X ' ' | u)~?) ; i = 1, 2, M}, i = C k-t+1), (k-1). The set { P ( X k _ t ; w k - t ) ; q = 1, 2, M} of j o i n t p r o b a b i l i t i e s are assumed a v a i l a b l e at every ins t a n t k and can be obtained r e c u r s i v e l y from one instant to the other. Thus, the t o t a l a d d i t i o n a l memory requirement at any instant k i s (t.M) numbers, which grow l i n e a r l y with t, the number of patterns under observation. CSPRS Type 2: The CSPRS Type 2 i s equivalent to the CSPRS Type 1, with the exception that the expected compound r i s k e i t h e r of continuing or of terminating the feature observation process i s equal to the r i s k involved only with the pattern under consideration. Thus, at any instant k an a d d i t i o n a l feature can be observed i f necessary, on any one of the t patterns; however, the r i s k of the k th pattern alone i s assumed to be e f f e c t e d . Therefore, the expected compound r i s k of forming the set A ° of terminal decisions i s given by t R [A °:NT] = r, (6) s L i J k M = Min{ J A(a>k;dk) P(X k;u> k)}. 3 u=l J S i m i l a r l y , the minimum expected r i s k of continuing the d e c i s i o n process at state N!" can be expressed as follows: R c [n (Nj:C k ) : x ] = Min{y Nj(n q+1) ]} , q = k- t+1, k f M = Min{C(N t) + c(n +1) + Min{ J i(w ;d ) 9. q n +1 P ( x \ x q . , ;/ ) } P ( x q , Jx q ) d x q } , n +1 u n + 1 ' n , , q q q+1 X n + 1 E X < 1 ; j = 1 » 2 ' M ; q = k " 1 + l f ' " ' k The scheme i s , the re fo re , s impler i n s t ruc tu re and requi res l e s s computation than the CSPRS Type 1. CSPRS Type 3: In CSPRS Type 3, the d e c i s i o n on a pat te rn may be delayed to have a d d i t i o n a l in format ion from the subsequent pa t te rns . Thus, the dependencies extend to the f i n i t e fu ture as w e l l . I t i s assumed that the pat tern c lasses are f i r s t order Markov and the feature vectors are c o n d i t i o n a l l y independent. Let t^ = 1, therefore t = and (6.4) reduces to M . . r . (S ) = Min{ Y Uar^d- 3) P (X J ;a)J)}, v = 1, 2, . . . , M. 1 v •, u v u J u=l i+1 i The j o i n t p r o b a b i l i t y P(X 5 W U ) 1 S exp ress ib le i n the form shown i n (6.2) and f o r f i r s t order Markov dependence among the pat te rn c lasses reduces to M P ( X J + 1 ; w j ) = P(X J ; co j ) Y P ( X j + 1 | w j + 1 ) • P(co J + 1 ,co : i ) ' u u S q q ' u q=l M ^ where the p r o b a b i l i t y P(X 3;io 3) i s given by (6.3). The expected compound r i s k R [A °:N^] i s the same as i n (6.10). However, at any state n, the S X K r i s k of continuing the decision process i s R c k [n(NSr):x] = Minf-^- I $ . [N^n +1) ] }, 1 q fc0 i = k _ t ^ J 1 q q = k-t„, ..., k, k+1 k r M Min{C ( N b + c(n +1) + — T [J • Min{ Y £ ( w j ; d J ) P ( X j + 1 , x q ± 1 ; J) } q i q t . , L ., U v ^, u v n +1 u o j=k-t+l ft q u=l X n +1 q • q q The advantage with the CSPRS Type 3 i s that i t permits look-ahead mode [33] of decision i n sequential c l a s s i f i e r s . Thus, the a d d i t i o n a l information from the subsequent patterns may r e s u l t i n the e a r l i e r termination of the feature observation process i n sequential decision schemes. 101 CHAPTER VII ANALYTICAL AND EXPERIMENTAL EVALUATION OF THE CSPR SCHEMES 7.1 Introduction In t h i s chapter, the s u i t a b i l i t y and the performance c a p a b i l i t y of the CSPR schemes proposed i n Chapter VI are assessed both a n a l y t i c a l l y and experimentally. For the a n a l y t i c a l part, a recognition problem with binary hypotheses and f i r s t order Markov dependence i s considered. The e f f e c t of observing the a d d i t i o n a l feature e i t h e r on the pattern under consideration or the neighbouring one, on the system error prob-a b i l i t y i s then analyzed. For the experimental part, m u l t i c l a s s problems are considered. The r e s u l t s of simulating the three CSPR schemes d i s -cussed i n Chapter VI are presented. F i n a l l y , the advantage of using the CSPR scheme i n recognition problems with v a r i a b l e dimensional feature vectors i s pointed out. 7.2 A n a l y t i c a l Evaluation Using Two-Class, F i r s t Order Markov Problem The two pattern classes CJ^ and to^ of the recognition problem are assumed to be f i r s t order Markov such that at any insta n t k k+11 k. k+11 k s P(w |w2) = P(to 2 1^) = p. Therefore, PCu^) = P(io 2) = 0.5. k i k Let the features be s t a t i s t i c a l l y independent with ,P(x^|w ), i = 1, 2, D; j =1, 2, a un i v a r i a t e Gaussian d i s t r i b u t i o n function with mean 102 2 u. and variance a.. Thus 3 i P ( x k | W k ) = [<27r)~ 1 / 2 -a. - 1] exp[-(x k - M j) 2/2aJ]. Without any loss of generality i t may be assumed that u„ = u and y = 0. ^ 1 F i n a l l y , l e t the feature vectors be c o n d i t i o n a l l y independent. At any instant k, the k th component conditional r i s k f o r the decision d. 3 t M R = I Jl(to k;d k) P(X k|co k) P(to k)dX k J i = i 1 J i i For the type of loss matrix defined i n (2.5) the r i s k becomes the p r o b a b i l i t y of error and, for k = 2, may be w r i t t e n as follows: 2 v ( n , ;n, ) = 1 - I j P(X k, X k _ 1 1 A P ( / ) d X k d X k _ 1 , (7.1) k k k _ 1 j=l n,(j) 3 2 where 9. (j) denotes the feature subspace corresponding to the decision K. d^ at instant k, and n^ and n j c _ - ^ are the dimensionality of feature vectors k k-1 X and X r e s p e c t i v e l y . The er r o r p r o b a b i l i t y i s minimized i f the decision r u l e i s such that 6k = d k , i f P(X k;k.) = Max{P(X k;co k); i , j = 1, 2} 3 3 i -1-i n which case M j ) = {Xk:P(Xk;(ok) * P(X k;co k), i = 1, 2} . (7.2) K. J 1 Since, 2 P(Xk;u)k) = P(X k|a) k) I P(X k- 1;co k- 1) P ^ l / " 1 ) 3 3 i = 1 i J (7.2) i s given by 103 2 , k - l k-l x„, k, k-1, v . , P ( X k | A I P (X ; ) P (<*>* | wq ^ fi. (j) = {X^.X1'"1: r * : , i , j = 1, 2} , P(XK|w ) r B , v k - 1 k - 1 ^ , k, k - l v 1 i ) P(X ;OJ )P(a). u ) i ^ i s=l s J s (7.3) Following the assumptions of the Markov dependence of the pattern classes, we now can express (7.1) as follows: f , , 2 \ ( 2 ) P(X k|to k) • T P(X k 1 L k " 1 ) - P ( c o k L k " 1 ) d X k . dX k 1 (7.4) 1 1 u . 1 q l q q=l n For s t a t i s t i c a l l y independent feature components within each class we can write J c , k N ™ k 'ki k, P ( X K U K ) = n K P(x. to.) 1 i = l J ' 1 -n /2 k k . 0 . [ ( 2 l T ) k - r . „ r , k , -2, n a X ] • exp[-0.5 I (x K-y.)a7 ] 3=1 3 j = l 1 1 J 2 Let 6^ = (o/a^) , where a i s a constant. Therefore, P(X k|co k) . \ n k . . — = e x P [ - {y 2 I G - 2y £ x* • 6 }/2oZ] P(X k < ) j = l J j = l J 3 Taking logarithm on both sides y i e l d s n, P(Xk|o)k) "k A n { — ¥ T T - } = C - T ) ^ e ( x k - o . 5 y ) P ( X k | o 3 k ) a 2 j = l J J Define, 104 f ( x K n = 2 . 1 nxk-q = l -1 k- k-1 q 2 . 1 p ( x k " s=l -1 k-" 1 ) p ( ^ | k-1 to s Let (1-p) • P(X k 1 | u k X) + p • P(X k 1|a, k X) p • P(X k" 1| W k"" 1) + (1-p) • P(X k" 1|a) k" 1) Equation (7.3) therefore, reduces to n k n k 2 «. (2) = {X k,X k _ 1: I 6 0x k $ 0.5 X 6 .+ — £n f ( X k _ 1 ) } R A=l • q=l q W n k n k 2 v = I ex k-0.5u£ 6 - ^ £n f ( X ^ 1 ) 1=1 * * q=l q P k Since x , Jl = 1, 2, n are Gaussian random v a r i a b l e s , v i s a Gaussian random v a r i a b l e with mean and variance obtained as follows: m = E[v] v \ V 2 •E[ i QA] - ° - 5 y i %-Tin f ( x k _ 1 ) • 4=1 q=l q U k k k I f x ~ w , then the random v a r i a b l e x i s a zero mean process and therefore, n k 2 m = -0.5u I 6 - — In f ^ " " 1 ) , x k ~ u>k v ^ q y ' £ 1 S i m i l a r l y , i t can be shown that a = v E [ v 2 ] - E [ v ] 2 n, 2 V A k 0 L V X£ £=1 * * Equation (7.4) now becomes e k ( n k ; n k ]_) = J [•(l-p>-P(Xk 1 | a ) k X) + p • P-(Xk 1 | a ) k ^JdX 1*" x k' x [ (2TT) 1 / 2 • a~ 1]exp[-0.5(v-m v) 2/a 2 7]dv v>0 (1-p) f P(X k X | u ) k X ) • exp[-0.5(v-m v) 2/a 2]dvdX k 1 v v k - l v>0 2ir a P(X k X|u k X)exp[-O.SCv-n^)2/aJ]dvdXk X. v v k - l v>0 x k-11 At t h i s stage l e t us assume that p << 1 and since 1 $ P(X | u/) i = 1, 2,we can write p(x k- x| u k- x) f(x K x) 1 ^ , - J c - l k-1, P(X to2 ) and, therefore, £n f ( X k X) = 'k-1 -JL J e ^ x f 1 - 0.5u) a i = l Let y = (v-m )/a . Then 106 e k ( n l c ; n k - l ) <1^ - f f P ^ " 1 ! k ^ . J2~n J 2-, , ,„k-l (X | OJ ) • exp[-0.5y ]dydX k-1 b sin exp[-0.5y 2]dydX k 1 r k - l b (7.5) where b = -m /a v v n n n k k _ 1 v - i k i io {0.5y I Q - I 6 . ( x f - 0.5y)}/a[ J 6 q=l q 1=1 • £=1 * Consider separately the parts of (7.5) ,.(.a) = O r P l | | p . ( x k - l j u k - l } e x p [ _ 0 . . S y 2 } d y d x k - l ^ x k " l b "(1-P) k-1 1-1 3 r=° n. k-1 ( 2 , ) n k - 1 / 2 ( a ) n k - 1 "~ J exp[-0.5 f i ( x k - 1 ) 2 0 /a 2]dX k- 1{ 1 u=l U " /2T'b -0.5y^]dy} exp[ (7.6) Let Therefore, z. - x k - x . e 1 / 2 I X X u k - l b = y z.b! + b . , x x o i = l where b! = - ^ i a 1/2 > 6 .3 = 1 107 and n. n. k-1 b = 0.5y[ I 6 + I 6 ]/c[ I 8.1 1/2 1=1 j=l 1=1 Equation (7.6) then reduces to the following expression n k-1 (a) = (1-p) • n 1/2 'k-1 exp[-0.5( I z 2 + y 2 ) / a 2 ] dydz k 1. (7.7) q=l q "k-1/2 . "k-1 -» b (2ir) • (a) . A f t e r proper coordinate transformation (7.7) can be expressed as follows (a) (1-P) ,„ , k-1/2 "k (2TT) .a -°° -°° — ... exp[-0.5( I z 1 )/o ]dz. 1 k J J q=l /?r-/2TT b exp[0.5y , 2]dy (7.8) where z' and y' are now independent random varia b l e s . Equation (7.8) f i n a l l y reduces to (a) = 0.5(l-p)[l - e r f { ( u / 2 a ) U / 2 ) 1 / 2 } ] W l where \ V i ^ = I e . + I V V i I-I 1 • j-i and erf(x) = exp[-y ]dy. I t can s i m i l a r l y be shown that the second part of (7.5), denoted (b), reduces to (b) = 0.5p[l - erf{(y/2a)B •(2A V k-1 W l ) - 1 / 2 n where 'k-1 n k ' V l i = l j = l 108 Thus at any decision state with n, and n, .. number of features k k-1 on the k th and the (k-1) st pattern r e s p e c t i v e l y , the k th component erro r p r o b a b i l i t y i s given by e k ( n k ; n k _ 1 ) = 0.5 - 0.5[ ( l - p ) e r f { (y/2cr) ( A q r / 2 ) l / 2 } k' k-1 + p erf{(u/2a)B .(2A ) ~ 1 / 2 } ] (7.9) V n k - i V n k - i 7.3 Results and Discussion for the Two-Class, F i r s t Order Markov Problem In Figures 7.1, 7.2, and 7.3 the error p r o b a b i l i t y e k(n k;n k_^) fo r various combinations of n k and n-^_-^ a r e presented, the d i s t r i b u t i o n of the variance of the feature components being d i f f e r e n t i n each case. The following three cases were considered. Case 7.1: The feature components were a l l of uniform variance, thus o\ = a, i = 1, 2, . . ., D • Case 7.2: The variance of the feature components increased exponentially. Thus, the variance o\ of the i th feature component was given by a = o • a^ exp[i • a2-!> = •*•» 2» •••» D where a^ and were p r e s p e c i f i e d constants. Case 7.3: The variance of the feature components increased l i n e a r l y . Thus, the variance o\ of the i th feature component i s given by o\ = a«(a| + a 2 . i ) , i = 1, 2, .. ., D 109 C D -. trH U=1.8 p=0.001 N0= % + nk_ 1 N=10 n, (NUMBER OF FEf l fURES OBSERVED ON THE K TH PATTERN) Figure 7.1 System error p r o b a b i l i t y f o r various combinations of features from the present and the previous patterns. The arrows show the path of minimum err o r p r o b a b i l i t y . The features' variance i s uniformly d i s t r i b u t e d . 110 CD-ro-CM • •N(f' H = 1.8 p = 0.001 Oj O.Qj .exp[i. Op] AL - n. * n. . 0 /r /f-7 ,-*N=10 (NUMBER OF FEATURES OBSERVED ON THE K TH PATTERN) Figure 7.2 System error p r o b a b i l i t y f o r various combinations of features from the present and the previous patterns. The arrows show the path of minimum err o r p r o b a b i l i t y . The features' variance is exponentially d i s t r i b u t e d . Figure 7.3 System err o r p r o b a b i l i t y f or various combinations of features from present and previous patterns. Features' variance i s l i n e a r l y d i s t r i b u t e d . 112 where a| and a^ are two p r e s p e c i f i e d constants. A maximum of ten features were assumed a v a i l a b l e per pattern sample (D = 10). Also, f o r each one of the three cases u = 1.8, p = 0.008 and <j = 1 were assumed. Note the way i n which the erro r p r o b a b i l i t i e s are plo t t e d i n Figure 7.1, 7.2, and 7.3. A curve on each fi g u r e corresponds to the v a r i a t i o n of the e r r o r p r o b a b i l i t y f o r various combinations of n, and n, ,, the features from the k th and k-1 st k k-1 patterns, with the constraint that n, + n, , = a constant = N • k k-1 o One can view the curve corresponding to N Q = i to be the locus of the error p r o b a b i l i t y associated with the i th state. The erro r p r o b a b i l i t y associated with a state i s dependent on how the state i s ar r i v e d at. T r a n s i t i o n from one state (except the f i n a l ) to the next i s pos s i b l e through the observation of an a d d i t i o n a l feature e i t h e r on the k th or (k-1) st pattern. The arrows i n the figures show the locus of minimum error p r o b a b i l i t y and therefore, the optimum way of sampling the pattern features. When the features are equally e f f e c t i v e as i n Figure 7.1 we notice that at every decision state ( u n t i l features from the k the. pattern are exhausted) a feature from the pattern under consideration (k th one) i s preferable t o ; one from the neighbouring (k-1 st) pattern. However, when the features are not a l l equally e f f e c t i v e , as i n Figures 7.2 and 7.3, we observe that at some states the error p r o b a b i l i t y i s minimized by s e l e c t i n g the a d d i t i o n a l feature from the neighbouring pattern rather than the pattern to be decided. Thus, when the features are of unequal q u a l i t y , having the choice to s e l e c t an a d d i t i o n a l feature from e i t h e r 113 the pattern under consideration or one of the neighbouring patterns may mean an e a r l i e r termination of the feature observation process. This i n turn r e s u l t s i n a reduction i n the required number of features per pattern f o r terminal decisions. The t o t a l number of states i n which a feature from the neighbouring pattern i s preferable over a feature from the pattern under consideration, denoted by s ( n k _ ^ ) , depends on the d i s t r i b u t i o n of the effectiveness of the various feature components. The standard deviation of the features'variance was found to be higher i n Case 7.3 (a^ = 0.287) than i n Case 7.2 (o^ = 0.188) and as a r e s u l t we notice from Figures 7.2 and 7.4 that the value of sC1 1^.^) 1 S higher i n Case 7.3 than i n Case 7.2. Pl o t t e d i n Figure 7.4 i s the s(n. n) f o r several values of standard K—1 deviation, of the features' variance. The means of the features' variance were the same i n each case.. As shown i n Figure 7.4, the higher i s o^, that i s , the more the feature components d i f f e r from each other i n t h e i r e f f e c t i v e n e s s , the larger i s the value of s ( n ^ _ ^ ) . This i n d i c a t e s that i n s i t u a t i o n s where the features are quite d i s s i m i l a r i n t h e i r effectiveness, the CSPR schemes would be more e f f i c i e n t than the c l a s s i c a l sequential and nonsequential schemes. 7.4 Experimental Evaluation of CSPR Schemes Using M u l t i c l a s s Problems In order to assess the performance improvements obtainable using the three CSPR schemes discussed i n Section 6.4, a set of recogni-t i o n experiments were simulated on the d i g i t a l computer. The experiments and t h e i r implications are discussed below. 2 QL . . i I I I I J 1 1 ' J . 1 1 1 1 1 1 1 V/ 0.007 0.07 OJ U af Figure 7.4 E f f e c t of standard deviation of the features' variance on the t o t a l number of states i n which a feature from the previous pattern i s preferable to one from the present pattern . 115 Experiment 7.1: The CSPRS Type 1 with t = 2 was considered. The cost of ob-serving any one of the feature components from a pattern sample was assumed to be the same but dependent on the p o s i t i o n of the pattern on which the feature was observed,. A feature, i f observed on the pattern under consideration was assumed to be l e s s c o s t l y than i f i t was observed on one of the neighbouring patterns. At instant k the t o t a l cost of a feature component from the j th pattern was c(n^ + 1) + K ( k - J)-c(n.. + 1), j =' (k - t + 1), k (7.10) where K was a p r e s p e c i f i e d constant and c(n. + 1) was the cost of a J feature component from the pattern under consideration. The performance of CSPRS Type 1 i s i l l u s t r a t e d i n Figure 7.5. The r e s u l t s of a l l seven texts are shown i n Table 7.1. For comparison purposes the performance of simple sequential, simple nonsequential, compound sequential and compound nonsequential decision schemes are also i l l u s t r a t e d i n Figure 7.5. The f i g u r e i n d i c a t e s that the CSPRS Type 1 requires s i g n i f i c a n t l y fewer features (on the average) per pattern than equally r e l i a b l e simple sequential and simple nonsequential decision schemes. Using the same dependence r e l a t i o n s , the CSPRS Type 1 also performs s i g n i f i c a n t l y b e t t e r than the compound nonsequential scheme. At 20 percent error rate, f o r example,the CSPRS Type 1 requires about 21.0 percent fewer features on the average per pattern than the compound nonsequential scheme. Compared to a compound sequential d e c i s i o n scheme, the CSPRS Type 1 using the same bigram dependence r e l a t i o n among the pattern classes performs better, although the percentage improvement i n performance i s not large. At 20.0 percent error rate, a 4.6 percent 40.0 | 30.0 20.0 -Q O -Q O i -Q. o £ 10.0 0.0 8 Figure forced dec simple nonseqn. o-compound nonseqn. • • •• ®-simple seqn. compound seqn <>-CSPRS Type 7 A. -0 - • n _L 40.0 300 i$ "r—. 20.0 § o QJ t3 QJ O 70.0 70 77 72 73 74 75 75 77 78 average number of features per pattern, [N^(1),NFJ 19 20 21 0.0 7.5 Comparison of CSPRS Type 1, compound sequential, compound nonsequential,simple sequential, and simple nonsequential decision schemes. Text Set of Alphabets Used for Avg. No. of . Features Pattern E r r o r Prob. ( i n percent) Avg. No. of Features Observed ( i n percent) Forced Decision ( i n percent) No. Size Training Testing On k th Pattern On (k-1) s t Pattern 1 538 22-147 1-21 13.5 18.2 ' 71.9 28.1 5.6 2 420 1-41 43-147 22-42 13.2 30.1 73.6 26.4 6.2 3 579 1-42 64-147 43-63 12.9 20.3. 73.4 26.6 5.7 4 516 1-63 85-147 64-84 12.9 18.5 75.8 24.2 3.9 5 467 1-84 106-147 85-105 13.3 22.0 72.6 27.4 7.3 6 557 1-105 127-147 106-126 13.0 30.1 72.5 27.5 7.7 7 489 1-126 127-147 12.8 20.8 73.8 26.2 6.7 Expected Values 13.1 22.8 73.4 26.6 . 6.2 Standard Deviations 0.23 4.73 1.18 1.18 1.17 Table 7.1 Recognition r e s u l t s of a l l seven texts using CSPRS Type 1 118 saving i n the average number of features per pattern occurs f o r the Type 1 r e l a t i v e to the compound sequential scheme. Table 7.1 shows that the CSPRS Type 1 r e q u i r i n g 13.08 features on the average per pattern, observed 26.6 percent of the t o t a l used features on the previous pattern and the remaining 73.4 percent on the pattern under consideration. The number of features observed from the previous patterns i s , i n general, dependent on the parameter K of (7.10). The .more expensive i t was to observe a feature on the previous patterns the l e s s frequently were such features observed. Experiment 7.2: The CSPRS Type 2 with t = 2 was implemented. The feature components were assumed equally c o s t l y i r r e s p e c t i v e of the p o s i t i o n of the pattern on which they were observed. The recognition r e s u l t s based on the f i r s t text appear i n Figure 7.6. Also included i n Figure 7.6 are the r e s u l t s of CSPRS Type 1 and the sequential compound dec i s i o n scheme. Figure 7.6 shows that the CSPRS Type 2 requires s l i g h t l y more features on the average than the equally r e l i a b l e CSPRS Type 1. Compared to the compound sequential decision scheme, the CSPRS Type 2 performed better as long as the average number of features required per pattern was below 75 percent of the t o t a l a v a i l a b l e features. About 18 to 20 percent of the t o t a l features used were found to be observed from the previous pattern sample and the r e s t from the pattern under consideration. error probability, !~P(l)7 (in percent) — Nj L e 1 Co CJ O CD CD 120 Experiment 7.3: The CSPRS Type 3 with t = 3 was considered. Thus, a d d i t i o n a l information from the following as well as the previous patterns was a v a i l a b l e for the c l a s s i f i c a t i o n of the present one. The features were a l l equally c o s t l y as i n Experiment 7.2. The recognition r e s u l t s based on the f i r s t text are presented i n Figure 7.7. Also included i n Figure 7.7 are the. r e s u l t s of CSPRS Type 1, Type 2 and the ordinary compound sequential d e c i s i o n scheme. I t i s evident from Figure 7.7 that the CSPRS Type 3 performs be t t e r than the CSPRS Type 1, Type 2 and the compound sequential decision scheme. However, the percentage improvement i n performance r e l a t i v e to CSPRS Type 1 i s not large. As i t i s pointed out i n Section 6.4, the CSPRS Type 3 permits decisions to be delayed u n t i l some information from the subsequent patterns are a v a i l a b l e . The CSPR scheme r e q u i r i n g on the average 7.5 features per pattern was found to observe 60.5 percent and 16.7 percent of the t o t a l used features on the subsequent and the previous patterns r e s p e c t i v e l y . The feature requirement from the previous and the sub-sequent patterns can be c o n t r o l l e d by making the cost of features dependent on the pattern's p o s i t i o n as i n Experiment 7,1. In Figure 7.8 the recognition c a p a b i l i t y of each feature component i s i l l u s t r a t e d . Two sets of r e s u l t s are presented, one a n a l y t i c a l and the other experimental. The a n a l y t i c a l r e s u l t s were obtained using the Bayes expression for the p r o b a b i l i t y of e r r o r as given by (8.8). The seven l i k e l i h o o d matricies obtained from the t r a i n i n g samples provided the knowledge of the p r o b a b i l i t y d i s t r i b u t i o n s of the feature components. The experimental one i s based on the a c t u a l r e c o g n i t i o n of the seven English texts, each feature component being used se p a r a t e l y . average number of features per pattern, [NfO)] Figure 7.7 Comparison of CSPRS Type 1, CSPRS Type 3, compound sequential, and compound nonsequential decision schemes. 20.0 30.0 12.0V 10.01 -J_. JL _ l _ _ l _ _1_ 22.0 20.0 10 11 12 13 14 15 16 17 feature number 18 19 20 21 22 23 24 25 Figure 7.8 A n a l y t i c a l and experimental evaluation of the d i s c r i m i n a t i o n c a p a b i l i t y of the 25 feature components. to 123 The a p r i o r i p r o b a b i l i t i e s of the pattern classes i n both cases equalled those of the characters of English text. The discrepancy between the a n a l y t i c a l and the experimental r e s u l t s as observed i n Figure 7.8, i s due to the f a c t that i n the experimental procedure the blanks and the periods were included and assumed to be p e r f e c t l y recognizable. We notice from the fig u r e that the majority of the 25 feature components have a recognition c a p a b i l i t y within a narrow range of 14 to 17 percent when evaluated a n a l y t i c a l l y and 25 to 28 percent when evaluated experim-e n t a l l y . Therefore, the 25 features used i n simulating the CSPR schemes ai"e more or less of equal effectiveness as f a r as the recognition c a p a b i l i t y i s concerned. If the variance of the features' effectiveness i s low, very l i t t l e or nothing i s gained by observing the a d d i t i o n a l feature on the neighbouring patterns rather than the pattern under consideration. It therefore, i n t u i t i v e l y appears that i n a s i t u a t i o n l i k e t h i s where the features are equally e f f e c t i v e the CSPR schemes may not perform any better than the ordinary compound sequential decision scheme. We arr i v e d at a s i m i l a r conclusion by analyzing the r e s u l t s of the two-class problem i n Section 7.3. That the three CSPR schemes did not perform s i g n i f i c a n t l y better than the ordinary sequential decision scheme i s most l i k e l y due to the s i m i l a r e f f e c t i v e n e s s of the features. It i s ant i c i p a t e d that the more d i s s i m i l a r the feature components are i n t h e i r effectiveness the more e f f i c i e n t and u s e f u l the CSPR schemes would be as seuqential decision schemes for pattern recognition problems with dependent hypotheses. 7.5 The CSPR Schemes and Recognition Systems with Variable Dimensional Feature Vectors For recognition systems with v a r i a b l e or r e l a t i v e l y low d i -mensional feature vectors,, the CSPR schemes may be quite a t t r a c t i v e as sequential c l a s s i f i e r s . I t i s pos s i b l e that a pattern sample having a low dimensional feature vector can not be c l a s s i f i e d with s u f f i c i e n t confidence even a f t e r observing the e n t i r e set of features from the pattern. One way to increase the confidence i n such a s i t u a t i o n i s to observe a d d i t i o n a l features from the neighbouring patterns, as i n the CSPR schemes. Table 7.2 and Figure 7.9 i l l u s t r a t e s the r e l a t i o n between the system er r o r p r o b a b i l i t y , and the number of features observed on the pattern under consideration and the neighbouring patterns. We note from the table and the fig u r e that i t i s possible to maintain the feature requirement from the pattern under consideration more or l e s s constant and s t i l l reduce the error p r o b a b i l i t y of the system by observing a d d i t i o n a l features from the. neighbouring patterns. With t = 2, for example, the error p r o b a b i l i t y was reduced from 24.1 percent to 17.1 percent by observing on the average only 3.4 more features from the previous pattern, while the number of features from the pattern under consideration was maintained approximately constant. Thus, by s e l e c t i n g a s u i t a b l e value f o r the average number of features from the pattern under consideration (depending on how low the dimensionality of the feature vectors goes), the feature requirement from the neighbouring patterns can be adjusted u n t i l desired accuracy i s achieved. CSPRS TYPE - 1 t = 2, Bigram S t a t i s t i c s , Unordered Features Expt. No. Text Av. No. of Feats. Observed on the Er r o r P r o b a b i l i t y ( i n percent) Forced Decision ( i n percent) No. Size (chars.) k th Pat. ( k - l ) s t Pat. 1 1 538 9.9 1.3 24.1 1.3 2 1 538 9.6 3.8 18.2 5.6 3 1 538 9.7 4.5 16.9 8.9 4 1 538 9.6 4.7 17.1 8.7 t = 3, Bigram S t a t i s t i c s , Unordered Features Expt. No. Text Av. No. of Feats. Observed on the Er r o r P r o b a b i l i t y ( i n percent) Forced Decision ( i n percent) No. Size (chars.) k th Pat. (k-l)st Pat. (k-2)nd Pat. 1 1 538 9.2 2.2 1.2 19.5 2.2 Table 7.2 Performance of CSPRS Type 1 f o r various amounts of features from the neighbouring patterns, error probability fin percent) 9ZT 127 CHAPTER VIII FEATURE ORDERING AND ON-LINE SEQUENTIAL DECISION SCHEMES 8.1 Introduction The expected cost of a terminal d e c i s i o n i n sequential d e c i s i o n schemes i s strongly dependent on the order i n which the features are observed. The sequential schemes discussed so f a r i n the thesis are always provided with a preordered set of features. Thus, at every state the feature to be observed i s known i n advance. I t i s , however, pos s i b l e to formulate a sequential decision scheme which at every decision state would s e l e c t the feature which i s best suited for that state [25], [26]. Therefore, a feature which together with the already observed features on the pattern sample provides the maximum amount of a d d i t i o n a l information i s observed at every state. Such an on-line ordering of features r e s u l t s i n an e a r l i e r termination of the feature observation process. Unfortunately, the implementation of the optimal on-line sequential decision scheme* (based on dynamic programming) i s impracticable i n most si t u a t i o n s due to the amount of computation and storage involved. One amenable s o l u t i o n i s to use the suboptimal decision r u l e based on one-state ahead truncation approximation. However, the question i s how e f f e c -t i v e i s the suboptimal on-line sequential decision scheme both from the points of view of computational d i f f i c u l t y and recognition performance. C a r d i l l o and Fu's [65] r e s u l t s of simulated recognition experiments even •k From now on the sequential decision scheme with on-line feature ordering i s c a l l e d the on-line sequential decision and the sequential decision scheme without any feature ordering i s c a l l e d usual sequential decision scheme. 128 though they are meant to determine the e f f e c t of one-state ahead truncation approximation, do not enable one to s i g n i f i c a n t l y evaluate the s u i t a b i l i t y of the suboptimal on-line sequential decision scheme. More over, as i t i s pointed out i n Chapter V, the o v e r s i m p l i c i t y of the recognition problem simulated and the experimental methodology used, the r e s u l t s are of very l i m i t e d value. In t h i s chapter the performance of the suboptimal on-line sequential decision scheme i s evaluated on the basis of a set of s i g n i -f i c a n t experiments simulated on a d i g i t a l computer. I t s computational complexity i s analyzed and the s i t u a t i o n s i n which the on-line ordering of features may be useful are discussed. The c a p a b i l i t y of the on-line sequential d e c i s i o n scheme as a feature ordering c r i t e r i o n i s assessed and compared with those of three other feature evaluation c r i t e r i a . A modified on-line sequential (MOLS) decision scheme based on l i m i t e d length (LS) of search over a v a i l a b l e features i s proposed. The MOLS decision scheme requires less computation than the on-line sequential decision scheme and performs be t t e r than the usual sequential decision scheme. The advantage of incorporating l i m i t e d length of search into the sequential decision schemes using a set of preordered features i s also pointed out. 8.2 Formulation of the Sequential Decision Scheme with On-Line Ordering of Features At any decision state i , l e t fj(i ) denote the j th component feature of a set of D features a v a i l a b l e i n t h e i r n a t u r a l order. Let n b(q) = { i , j , £}, i , j , £ = 1, 2, D; M i ^ j i ••., 4 nJ b = 1, 2, (q) 129 be a set of q i n d i c i e s of the features f, (1), f . ( 2 ) , ..., f (q),observed at 1, 2, q th states r e s p e c t i v e l y . Therefore, Il^(q), the complement of the set n^(q) i s the set of i n d i c i e s of the features yet to be observed at state q. Let: k k R [X ; n, ; 6 ILCn. )] be the r i s k involved i n forming an optimal terminal S K. D K. decision 6 K at state n, of the decision process k having a v a i l a b l e the sequence of features k k (x^, . .., x^ ) selected i n such a way that the k feature i n d i c i e s fQrm the set II, (n, ) . b k k i R [X ; n. IIL (n )] be the minimum expected r i s k involved at state n of continuing the feature observation process having a v a i l a b l e the sequence of features ( x k , .... x ) from the k th pattern and selected i n n k such a way that the feature indicies form the set I f at state n, i t i s decided to terminate the feature observa-k t i o n process, then the r i s k involved i n optimum (Bayes) de c i s i o n 6 i s given by M R [ X k ; i i K ; 6 k | n b ( n k ) ] = Min{ £ £ ( / ; d K ) P ( X K | U K ; ) P ( O K ) } , X j = 2. J J J i = 1, 2, M. (8.1) If the elements of the loss-matrix i s given by (2.5), then the optimum decision 6 K = d k , i f q P(X k|w k,n, (n. ))P(co k) = Max{P(Xk|o)k;n, (n. ))P(co k)}. 1 q b k q x ' i b k x 130 I f however, an a d d i t i o n a l feature i s desired at state n^, then the r i s k involved i n continuing the observation process can be expressed as follows: R c [ X k ; n k | n b ( n k ) ] M C + n. Min {C(IL+1) + f Min[ £ £(co k;d k) k qe]T b(n k) n, +1 k (8.2) p ( x k > < + i > v v i } i * k ) p ( w k ) ] • p ( < + v y \ + i ) i x k > v \ ) ) d < + i > -k M J J k ^ k The de c i s i o n r u l e f o r the on-line sequential d e c i s i o n scheme based on the one-state ahead truncation approximation i s : T? ryk .^ I TT f n 1 <• *R r y k . „ -(S^ITT ('y, N "1 • C O N T I N U E R c [ X k ; n k | n b ( n k ) ] > R g [ X k ; n k ; 6 k | ^ ( t ^ ) j : STOP (8.3) I f the observation process i s decided to be continued to the (n. +1) s t state then the best feature to be observed i s any s o l u t i o n to K. (8.2). In other words, for the (n k+l) s t state x M f (n,+l); i f c ( i y f l ) + J Mint I U < V d k ) P ( X k , x k , f (n +1) | a>k; n (n ) ) q tc K ^ i - ^ - L J k " J n.+l k P(o) k)] P ( < + r f q ( n k + 1 ) l x k ' n b ( V 5 d X n 1 + l k k Min {c(n,+l) + Min[ \ •n, (n, ) k J , i .1=1 s e b k ) M I £(w k;d k) fik 1 J n, +1 k P ( x k > V +!• V \ + 1 > I V V V ) P ( u k ) 3 P ( X n , + r f s ( n q + 1 ) ^ W k J J k k qefi b ( n k ) ; i = 1, 2, . .., M. In case only a p r e s p e c i f i e d feature component i s a v a i l a b l e f o r observation at state n, , the r i s k s defined i n (8.1) and (8.2) reduce to those of the usual sequential decision scheme of Section 5.5. 8.3 Results and Discussion of Recognition Experiments Using the On-Line Sequential Decision Scheme The simple suboptimal on-line sequential d e c i s i o n scheme of Section (8.2) was simulated on the d i g i t a l computer using the data set described i n Chapter IV. The feature components were assumed equally ..costly and .also s t a t i s t i c a l l y independent within each c l a s s . The per-formance of the decision scheme for various values of average number of features per pattern i s shown i n Figure 8.1. The r e s u l t s of recognition of a l l seven texts are presented i n Table 8.1 and the expected r e s u l t s with corresponding standard deviations are also shown i n Figure 8.1. For the ease of comparison the performance of the simple usual sequential and the optimal nonsequential d e c i s i o n schemes also appear i n Figure 8.1. It i s evident from Figure 8.1 that i f a l l the features a v a i l a b l e from a pattern sample are not observed, the on-line sequential decision scheme i s d e f i n i t e l y more e f f e c t i v e than the usual sequential d e c i s i o n scheme from the point of view of expected cost of a terminal d e c i s i o n . At 17.5 percent error r a t e , f o r example, the on-line sequential scheme requires 44.6 percent and 33.6 percent fewer features on the average per pattern than the optimal nonsequential and suboptimal usual sequential , x „- ' error forced dec. ^ A simple nonseqn. o simple seqn. A A A -A simple on-line seqn. •——-a a 0 Q-_ 70 77 72 13 U 15 16 17 18 19 20 21 average number of features per pattern, f ^ f d ) , ^ ] Figure 8.1 Comparison of suboptimal on-line sequential, usual sequential, and optimal nonsequential decision schemes. Text Alphabet Set Used For Avg. No. of Features Pattern E r r o r Prob. ( i n percent) Forced Decision ( i n percent) NO. Size Training Testing 1 538 22-147 1-21 11.9 16.5 5.2 2 420 1-21 43-147 22-42 11.8 22.1 6.2 3 579 1-42 64-147 43-63 12.0 19.5 6.2 4 516 1-63 85-147 64-84 12.0 18.4 6.8 5 467 1-84 106-147 85-105 12.7 18.2 7.7 6 557 1-105 106-147 106-126 13.0 24.4 9.7 7 489 1-126 127-143 12.1 20.2 6.96 Expected Values 12.2 19.9 6.9 Standard Deviations 0.42 2.45 1.32 Table 8.1 Recognition r e s u l t s of a i l seven texts using suboptimal simple sequential decision scheme with on-line ordering of features. 134 deci s i o n schemes, r e s p e c t i v e l y . At the same error rate, the usual sequential decision scheme requires only 14.7 percent fewer features than the optimal nonsequential decision scheme. Thus, the on-line sequential decision scheme based on the one-state ahead truncation approximation may be quite a t t r a c t i v e f o r recognition problems where a trade-off between the cost of feature observation and r e l i a b i l i t y of the system i s desired. The only disadvantage with the on-line sequential d e c i s i o n scheme i s that at every decision state i t has to search f o r the best feature and therefore, requires a d d i t i o n a l computation. I t i s also necessary to remember the features observed at various states but the storage and computation involved f o r such process i s n e g l i g i b l e . The computational complexity due to the involved search i s of primary con-cern. At any state n^, the decision scheme needs to te s t (D-n^) feature components to determine the feature to be observed, i f necessary, at (n + l ) s t state. I f the on-line sequential d e c i s i o n scheme requires on K. the average I features per pattern f o r terminal decisions, then the average amount of searches N(£), involved per pattern i s N(£) = £(2D - I + l ) / 2 , I = 1, 2, D. (8.4) For 1 = D, that i s , i f a l l the D features are needed per pattern, (8.4) reduces to N(D) = D(D + l ) / 2 . (8.5) Considering (8.5) as the reference, we can express the average computational complexity due to the on-line ordering of features as follows: S c(a) = a(2a - Da + 1)/(D + 1), (8.6) where a = £/D. In Figure 8.2 S (a) for D = 25 i s p l o t t e d as a function of a. 135 Figure 8.2 Comparison of computational complexity and the reduction i n the error p r o b a b i l i t y due to on-line ordering of features. Also shown i n the Figure 8.2 i s RgCcO which denotes the reduction i n the er r o r p r o b a b i l i t y obtainable using the on-line sequential decision scheme r e l a t i v e to the usual sequential d e c i s i o n scheme. Formally, Re(o0 i s defined as follows: R (a) = (P - P ' ) /P e e e e where for some a P g denotes the error p r o b a b i l i t y due to -the simple suboptimal d e c i s i o n scheme without feature ordering, and P^ denotes the e r r o r p r o b a b i l i t y due to the on-line sequential d e c i s i o n scheme. Figure 8.2 shows that for any increase i n value of a beyond 0.45 the computational complexity S^Ccx) increases and the reduction i n the error p r o b a b i l i t y R (a) decreases. A range of values of a therefore, e x i s t s for which the on-line ordering of features i s most e f f e c t i v e from the points of view of computational d i f f i c u l t y and p r o f i c i e n c y i n reducing system error p r o b a b i l i t y . Figure 8.2 i n d i c a t e s that the on-line ordering of features i s s u i t a b l e when on the average around 45 percent of the t o t a l a v a i l a b l e features per pattern i s adequate for terminal decisions. The optimum range i n p r a c t i c e however, would depend on the cost of a d d i t i o n a l computation and the expected cost of a terminal d e c i s i o n . 8.4 Feature Evaluation Using the On-Line Sequential Decision Scheme Selec t i o n of a "good" set of features and c l a s s i f i c a t i o n of patterns based on the selected set of features are the two e s s e n t i a l aspects of a pattern recognition problem. The s i z e of such a feature set i s d i c t a t e d by the data storage and processing l i m i t a t i o n s and i t s 137 objective l i e s i n the minimization of the system error p r o b a b i l i t y . In general the minimization of error p r o b a b i l i t y f o r the s e l e c t i o n of a set of features f o r mult i c l a s s problems i s often impossible. An expression for the system error p r o b a b i l i t y i s not always a v a i l a b l e , and even i f i t may be a v a i l a b l e , the expression may be too cumbersome f o r a n a l y t i c a l minimization. For these reasons various suboptimal (optimal i n some cases) approaches from the information theory point of view which are easier to implement and have bound on the error p r o b a b i l i t y have been proposed f o r feature s e l e c t i o n [71] - [79]. Nevertheless, the computational part can s t i l l be tedious, and imp r a c t i c a l i n some s i t u a t i o n s . To sel e c t the best subset of n features from a set of D, one has to search over (^ ) combinations, except i n some s p e c i a l cases where the c r i t e r i o n function f o r a subset of features i s expressible i n terms of the function of the i n d i v i d u a l components. One simple technique i s to use an information measure to i n d i v i d u a l l y rank each feature component and eliminate those with low rankings. This technique i s by f a r the simplest and has been used by others f o r the s e l e c t i o n of a subset of features [77], [80]-[82]. I m p l i c i t i n t h i s technique i s the assumption that a subset of features containing the ones which are i n d i v i d u a l l y ranked to be good by some feature evaluation c r i t e r i o n , i s a good choice. The v a l i d i t y of such assumption i s however, highly questionable. In fac t i t has been shown t h e o r e t i c a l l y that the best subset may even be composed of the features which are worst when ranked on an i n d i v i d u a l basis [83], [84]. Ordering of features, as we have mentioned e a r l i e r , i s important for sequential decision schemes. The expected cost of terminal decisions i s highly dependent on the order i n which the features are observed. The 138 advantage of the sequential decision scheme with on-line feature ordering i s that i t i s capable of s e l e c t i n g the best features by i t s e l f during the c l a s s i f i c a t i o n process. But on-line ordering scheme may not always be desirable, i n which case we must decide on the ordering i n which a usual sequential d e c i s i o n scheme should observe the features. Several approaches for t h i s purpose are considered i n t h i s section. The frequency of usage of a p a r t i c u l a r feature component i n an on-line sequential decision scheme i s dependent on the effectiveness of that feature. Therefore, the on-line sequential decision scheme can be considered as a scheme f o r the evaluation and the ordering of features. The ordered set of features may be u s e f u l f o r sequential d e c i s i o n schemes with a preordered set of features and the s e l e c t i o n of subsets of features f o r nonsequential decision schemes. Notice that the evaluation of features using the on-line sequential d e c i s i o n scheme i s not on the basis of i n d i v i d u a l performance, because the s e l e c t i o n of a feature at every state i s dependent on the features already observed i n previous states. Table 8.2 shows the usage of the 25 feature components at various d e c i s i o n states of the sequential c l a s s i f i c a t i o n process. The r e s u l t s are based on the recognition of a l l seven texts using a simple on-line sequential d e c i s i o n scheme. On the average 12.25 features per pattern sample were observed. The preference f o r one feature component over others at d i f f e r e n t decision states i s evident from Table 8.2 which indicates that a ranking of the features i s possible using the on-line sequential d e c i s i o n scheme. Let denote the t o t a l usage of the j th feature component at i th state and l e t f . ( i ) represent that the j th ( i n n a t u r a l order) / 2 3 4 5 6 7 a' 9' w[ g ' 5 17 18 19~ 20 2l" 22' 23~ 24 25 total usage of feature components On percent) . 11 . 12 . 13 , U , 15 n.o CO c n 1.1 1 .s 0.8 1.1 0.7 o.n 0.0 0.0 0.6 0.6 0.5 0.7 O.H r 0.6 0.9 1.0 1.1 0. 0 0. C 10.6 1. 7 !*.. I 6. n i i ^ 1.2 1.5 1.1 1.0 1.0 1.1 1.5 1.2 0. ? 0.7 0.1 0. c 0.0 0.3 C O 0.0 0.2 1.1 2.7 2.8 3.8 0 . 1 1.1 0.3 1.7 3.9 0. 3 1.6 1. 2 1.0 1.0 0.6 0.6 0.1 C.7 0.2 0.0 0.0 0.0 10.1 0.6 0.2 9.2 9.2 ; 6.0 / 7. 7.8 fZTJ J 7 . 0 5. 5 0.7 3.0 3. 1 2.6 2. 1 2.0 0.8 1.0 0. 3 0.2 0.3 0.3 0.2 0.5 0.2 0. 1 0 6.2 6.9 6.6 0.9 0.3 3.2 2.3 1.5 1.7 0.9 0.9 0.5 0.3 0.3 0.1 0.1 0.0 0.0 0.0 0.0 0 0.0 | 0.0 | 0.0 | 0.3 j 0.3 j 0.0 j 1.0 j 1.2 I ,., i 1.5 j 1.0 j 0.9 J 1.6 j 2.5 j 2.0 ! , 2 J 3.0 | 1.0 0.0 0 £ 6 2 ^ 17.S 5.3 5.0 2.0 1.6 O.S 0.3 0.0 0.3 0.0 0.3 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.3 9.8 i 7.0 I 6.0 I" & 2.7 3.5 O.C 0.0 C O C O 0.5 1.5 3.0 3.5 3.6 3.7 3.7 0.0 2.8 I 2.9 3.6 2.2 3.0 2. 1 2.6 1.1 0.7 0. 6 0. 6 C O 0.2 0.0 0.0 I 2.8 I" j 2.8 1.7 \ C O 0.0 0.0 . 0.0 0.2 0.7 1.0 1.6 2.3 2.3 2.7 2.5 3.8 0. 0 3.9 3.8 2.5 1.8 1. 1 1.3 \ 0.8 I 0.6 vJ[iD' 0.0 0.9 CO 0.6 0.2 ' 0.7 CO 0.0 C O . C O 1.7 3.0 3.9 0.0 3.7 5. 1 0.5 5.0 5.2 5.1 j 0.3 I j 2.1 j 1.8 1.6 0.7 C.7 0.3 0.3 0.0 0.2 0.2 0. 1 0.0 0.0 0.0 0.0 C O CO 12.6 3. 1 1.0 7.9 5.3 0.6 5.7 6.5 1C9 5. 3 7.8 10.7 5.2 5.6 7.0 3.6 i 3 - * \ 5.7 6.5 0.0 j 0.0 j 0.0 • \ 3 , 3 2.6 1.8 3.8 j 0.3 3.6 3.0 2.6 1.3. 3.7 1.0 2.2 2.3 1.0 2.1 1.7 1.0 1.0 1.0 0.6 2.3 0.9 0.5 1.6 CO 0.6 1.0 0.3 0.5 0.5 0.2 0.2 0.3 0. 1 0.3 0.2 0. 1 0. 1 C O CO 0.1 0.0 0.2 0.1 f3> 1 3 . 1 1.9 1.C 1.1 0.9 0.5 0.6 0.7 0.3 1.2 i 2.0 1.2 ! 2.2 j0.8 0.7 0.3 c o 0.0 1.7 1.3 0.6 0. 7 , i c c j j 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1. 0 CO 0.0 \ 9 - s 0.0 2.6 0.0 C O \ 7.0 0.0 C O C O \ 6.9 0.8 8.5 0.0 0.0 C O 7.0 j 1.7 2.1 5. 2 6.0 C 2 0.3 C O 5.7 j 2.0 0. 3 0.0 0.0 1 5 - ' i 2.0 0.0 0.9 0.0 3.5 3.7 3.6 0.6 C O 3.6 3.0 3.6 0.9 C O 1. 9 0.0 3.0 C.7 C O 0.0 2.1 1. a 2w1 2.5 0.7 0.9 C O 2.1 1 J * S i 1 . 9 ! 0.7 0.0 1.1 I 2.8 2.01 1. 1 C O 1.3 j 1.6 2.0 j 0.9 0.0 1.0 | 2.0 1. 3 j 1.0 C O 0.0 1.2 1.2 1.1 i 0.9 ! 1.8 1.6 C O 0.7 1.3 0.5 I 2.1 0.0 0.0 1.0 0.5 | 2.6 C O 0.0 1.1 0.3 I 2.0 0.0 0. 1 1.0 0. 1 Q C O 0.0 0.8 0.0 0.8 Table 8.2 The usage of the feature components at various decision states of an on-line sequential d e c i s i o n scheme. The row and column corresponding to a e n c i r c l e d number i n d i c a t e the feature number and i t s rank r e s p e c t i v e l y . CO 140 feature i s the the i t h b e s t . A method of ranking the features based on Table 8.2 i s to decide f j ( i ) i f : I \ , * M|*{ I U }, for a l l i > q * 1 £=1 £ J g Jl=l % g 1=1 u g 1=1 1 6 5 i» -j» g. 1 = 1 , 2, . . . , D (8.7) The ordering of the features based on (8.7) i s shown i n Table 8.2 where the column and the row corresponding to a c i r c l e d number represent the feature number and i t s rank r e s p e c t i v e l y . We s h a l l r e f e r to t h i s procedure of ordering the features as the on-line ordering procedure no. 1. Figure 8.3 shows the frequency of use of the 25 feature components employing an on-line sequential d e c i s i o n scheme, where the frequency r e f e r s to the r a t i o of number of pattern samples i n which the feature component was used to the t o t a l number of pattern samples c l a s s i f i e d . The r e s u l t s are again based on the recognition of a l l seven texts and using 12.25 features per pattern on the average. Note that some features were used much more often than the others depending on t h e i r e f f e c t i v e n e s s . For example, feature number 21 was observed on each and every pattern sample c l a s s i f i e d , whereas feature number 25 was observed only on 20 percent of the t o t a l pattern samples. Thus, a ranking of the feature components on the basis of the frequency of usage of the features i s possible i n the on-line sequential decision scheme. We s h a l l r e f e r to this procedure of ordering the features as the on-line ordering procedure no. 2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 feature number Figure 8.3 Frequency of usage of the feature components i n on-line sequential d e c i s i o n scheme. 142 In order to assess the c a p a b i l i t y of the on-line sequential decision scheme as a feature ordering c r i t e r i o n , three other standard feature evaluation c r i t e r i a were also considered f or the purpose of comparisons. The features were evaluated i n d i v i d u a l l y . Thus, the subset to be selected was assumed to contain only one feature. The three c r i t e r i a are as follows: 1. Bayes Error P r o b a b i l i t y : f o r s t a t i s t i c a l l y independent features within each c l a s s , the Bayes error p r o b a b i l i t y f or the i th feature component i s given by P ±(e) = 1 MaxiPCxJw )P(w )} dx i, (8.8) a X 1 i = 1, 2, D; j = 1, 2, M The the i th feature component i s considered to be more e f f e c t i v e than the £ th one, i f Pj.Ce) < P^Cc), i , £ = 1, 2, D. The Mutual Information Between the Features and the Pattern Classes: f or s t a t i s t i c a l l y independent features within each class the mutual information between the i th features and the set of pattern classes i s given by I[x.;fl ] = H(n ) - HCfi Ix.) (8.9) x to to u 1 x where M H ( n ) = - y p(<o ^ n p<w-) 3=1 J and the equivocation 143 M H(fi x.) co1 1 I P(c0 ) J P(x Ico )• An P(x. U).)P(u>.)" 1 3 -1 P(x.) dx • i = 1, 2, l D. Again the i th component feature i s considered to be more e f f e c t i v e than the I th one when I[x ; fi ] > I[x ; a ], i , £ = 1, 2, D. 1 CO X. CO The Expected Divergence Between the Class-Conditional Densities: the expression f o r the expected divergence f o r the % th feature component i s given by M M f J(x£) = I I P(co.)P(co ) J {P(xJco.) - P(xJco )}. i = l j = l J Q. J £n{P(xjl|coi)/P(x£|co )}, I = 1, 2, D. (8.10) The advantage of t h i s c r i t e r i o n i s that f o r s t a t i s t i c a l l y independent features within each c l a s s , the expected divergence for a subset of features i s expressable as a sum of expected divergences of the i n d i v i d u a l feature components. Thus, i f X = ( x n , x„ ... x ) i s a set of n, s t a t i s t i c a l l y independent 1 z. n, k. k features, then J(X) = I J(x ), n = 1, 2, D. 1=1 k A subset of features, containing the ones which are i n d i v i d u a l l y ranked to be good by expected divergence i s , therefore, the best subset. Spearman rank c o r r e l a t i o n c o e f f i c i e n t s , which provide a non-parametric q u a n t i t a t i v e measure of s i m i l a r i t y between a p a i r of orderings [85], [86], were computed to compare the orderings of features obtainable using on-line sequential decision scheme and the three feature evaluation c r i t e r i a . Table 8.3 shows the c o r r e l a t i o n c o e f f i c i e n t s between various p a i r s of orderings. It i s -evident from Table 8.3 that the on-line ordering proce-dures no. 1 and no. 2 r e s u l t i n almost the same ordering of the feature components. The orderings are however quite d i f f e r e n t from the ones obtainable using the three other measures. The c l o s e s t to the on-line ordering procedures i s the expected divergence. The c o r r e l a t i o n c o e f f i -c ients are 0.231 and 0.201 between the expected divergence and the on-l i n e ordering procedure no. 1, and between the expected divergence and the on-line ordering procedure no. 2 r e s p e c t i v e l y . In order to assess the s u i t a b i l i t y of the various feature ordering procedures f or sequential decision schemes without on-line ordering, a group of recognition experiments were conducted. The simple sequential c l a s s i f i e r was presented with features preordered with on-l i n e ordering procedures as w e l l as the three other feature evaluation c r i t e r i a . The r e s u l t s of recognition based on the f i r s t text are presented i n Figure 8.4. The r e s u l t s of a l l seven texts are included i n Table 8.4 the expected values and the corresponding standard deviations are also shown i n Figure 8.4. The r e s u l t s show that the sequential d e c i -sion scheme performs better when the features are presented a f t e r being preordered using the on-line ordering procedures than when nat u r a l ordering i s used. Procedure no. 1 seems to be more e f f e c t i v e than procedure no. 2 as f a r as the performance of the sequential d e c i s i o n scheme i s concerned. 145 bO a •rH U cu T3 O rH CO U 3 •u crj O 00 c •rH U 0) T> U O a) 0 .•rH ! fl o fl o •H •u e M O l+H fl M 3 3 •rl OO fl •rl rl Q^J •u o QJ O C OJ 00 t-t at > •rH T3 cu 4 - 1 u (U &-QJ 4 - 1 •rl Ss 00 C •H M <U T3 SH O O 53 00 C •rl r< CU T5 H o a) c •H •J I fl o -0.402 0.048 0.811 0.781 0.063 Ordered with E r r o r P r o b a b i l i t y 0.111 -0.245 -0.223 0.095 Natural Ordering -0.148 -0.201 0.953 On-line Ordering No. 2 0.994 -0.179 Ordered with Mutual Information -0.232 Ordered with Expected Divergence Table 8.9 Spearman rank c o r r e l a t i o n c o e f f i c i e n t s between various sets of ordered features. on-line ordering proc. nc.l A—* A online ordering proc.no-2 a——• a a - . error probability 0— 0 expected divergence cj —a " N ? SIMPLE SEQUENTIAL DECISION SCHEME 10 11 12 13 14 15 16 17 18 19 average number of features per pattern, ^(1),NFJ 20 21 gure 8.4 Comparison of feature ordering c r i t e r i a on the basis of the performance of the simple suboptimal sequential decision scheme. Feature Ordering C r i t e r i o n Text No. On-line Proce. No. 1 On-line Proce. No. 2 Error P r o b a b i l i t y Expected Divergence Av. No. of feats/Pat. E r r o r Prob. % Forced dec. % Av. no. of feats/Pat. Error Prob. % Forced dec. % Av. No. of feats/Pat. E r r o r Prob. % Forced dec. % Av. No. of feats/Pat. Er r o r Prob. % Forced dec. % 1 12.9 . 19.1 6.7 12.9 21.3 8.5 11.5 15.8 7.2 11.6 14.3 7.0 2 . 13.1 23.7 7.8 2.8 23.3 8.5 . . 11.8 . 21.4 9.5 11.9 20.2 9.2 3 12.4 22.0 5.3 12.3 23.1 4.3 11.3 18.8 6.9 11.2 19.1 7.2 4 12.4 20.5 6.4 12.5 20.3 7.3 11.1 19.1 6.0 10.9 16.4 • 6.6 5 12.7 20.0 5.8 13.0 20.5 6.4 11.5 20.5 9.0 11.8 . 18.6 8.7 6. 13.2 26.8 6.8 13.2 28.7 7.5 11.7 .24.2 7.7 12.0 24.2 9.5 7 12.7 21.6 4.3 12.6 23.1 4.1 11.1 18.6 4.9 10.8 19.4 6.1 Mean 12.8 22.0 6.1 12.8 22.9 7.3 11.4 19.8 6.7 11.5 18.9 7.7 Std. Dev. 0.29 2.43 1.05 0.28 2.63 1.48 0.26 2.43 1.71 0.428 2.86 1.25 Table 8.4 Recognition r e s u l t s of a l l seven texts using suboptimal simple sequential decision scheme with sets of preordered features. We note that the sequential decision scheme performs be t t e r when pre-sented with features preordered using e i t h e r error p r o b a b i l i t y or ex-pected divergence rather than the on-line procedures. The e r r o r prob-a b i l i t y and the expected divergence therefore, seem to be b e t t e r c r i t e r i a to obtain an ordering of features for sequential decision schemes with-out on-line ordering of features. From the computational point of view the error p r o b a b i l i t y and expected divergence are also better methods than the on-line procedures. 8.5 Modified On-Line Sequential Decision Scheme with Limited Length The p r i n c i p a l disadvantage of the on-line sequential decision scheme i s i t s computational complexity. In order to be able to determine the best feature at any state £, a search over (D-£) feature components i s required and t h i s can impose a serious l i m i t a t i o n on the d e s i r a b i l i t y of the on-line sequential scheme, e s p e c i a l l y when D i s large. One good compromise would be to permit the on-line s e l e c t i o n of features but only on the basis of a l i m i t e d search over the a v a i l a b l e features. Thus, at any d e c i s i o n state £, instead of searching over (D-£) features, the scheme may be allowed to search over q features only, where q i s a- pre-s p e c i f i e d constant and D $ q >, 1. The choice of q would depend on the costs of computation, feature components and terminal d e c i s i o n e r r o r s . modified on-line sequential (MOLS) dec i s i o n scheme based on l i m i t e d length of search (LS) q, can be expressed as follows: of Search The r i s k of continuing the observation process, involved with R c [ x k ' n k l n n ( V ] = C + Min ( 0 ( ^ + 1 ) + \ uell' (n. ) b k n k Min i [l£(co k;d k)P(X kx k k + 1;f J n ^ l ) I^VV )P(co k)] p (WV\ + 1 )l x k ;VV ) d x n V ' ( 8 - n ) k k 1 = 1, 2, . . . , M where n'(n, ) efi, (n, ), i s the set of i n d i c i e s of the f i r s t q feature b k b k components yet to -be -observed at dec i s i o n state n^. For q = 1, (8.11) reduces to (5.13) and the MOLS decision scheme reduces to the usual sequential decision scheme with a set of preordered features. However, for q = (D - n^) f o r every state n^, (8.11) i s equivalent to (8.2) and the MOLS dec i s i o n scheme reduces to the on-line sequential decision scheme of subsection 8.2. For any other value of q between these two extremes, the MOLS dec i s i o n scheme requires less computation than the on-line sequential d e c i s i o n scheme and performs better than the usual sequential d e c i s i o n scheme. In order to assess the s u i t a b i l i t y of the proposed MOLS decision scheme a set of recognition experiments were simulated. The re s u l t s of recognition based on the f i r s t text are presented i n Figure 8.5. Seven d i f f e r e n t lengths of search, i n c l u d i n g the cases of on-line sequential and usual sequential d e c i s i o n schemes, were considered. As shown i n Figure 8.5 the higher the length of search, the lower i s the expected cost of a terminal d e c i s i o n . However, considering a length of search of o'nly 5, the recognition performance i s improved by more than 50 percent of what i s achieved by using the f u l l length of search (LS = 25), as i n the case of on-line sequential d e c i s i o n scheme. The computation involved with LS = 25 i s f o f course,far more than what i s involved with LS = 5. By implementing the MOLS scheme, one can make l i m i t e d use of the advantages of the on-line ordering of features without i n c u r i n g excessive computation. The trade-off between the computational complexity (or length of search) and the expected cost of a terminal d e c i s i o n of the MOLS decision scheme i s analyzed as follows. Let q be the desired length of search and £ be the average number of features per pattern required •for a terminal decision using the MOLS scheme. Therefore, the average amount of search involved with each pattern i s N(£, q) = {q • (2D - q + 1) - (D - £) • (D -£ + l)}/2. (8.12) For f u l l length of search, i . e . q = D, as i n the on-line sequential decision scheme, (8.12) reduces to (8.5). Combining (8.5) and (8.13), the computational complexity f o r the MOLS scheme r e l a t i v e to the on—line sequential d e c i s i o n scheme may be expressed as follows: S(a, 6) - ^ - \ + (l-«>a-°+*) ( 8 < 1 2 ) (a - 2 C X - K ) where a = £/D 3 = q/D K = 1/D. The computational complexity S(a, g) i s pl o t t e d i n Figure 8.6 f o r various values of a and 3. Also shown i n Figure 8.6 i s E(a, 3) which denotes the increase i n the error p r o b a b i l i t y r e l a t i v e to the on-line sequential decision scheme due to l i m i t e d length of search. Formally, E(a, 3) i s defined as follows: E(a, 3) = (p^ - P;)/P; where, for some values of a and B P^ denotes the error p r o b a b i l i t y due to the on-line sequential d e c i s i o n scheme. P^ denotes the error p r o b a b i l i t y due to the suboptimal MOLS decision scheme. The trade-off between the computational complexity and expected cost of a terminal decision i s c l e a r l y i n d i c a t e d i n Figure 8.6. The question of a trade-off arises only when the average number of features per pattern for a terminal d e c i s i o n i s l e s s than approximately 70 percent of the number of t o t a l features a v a i l a b l e from each pattern sample. Figure 8.6 shows that for a = 0.56, f o r example, the increase i n the length of search from 25 (B = 1) to 10 ( 6 = 0.4) i s only 12.5 percent. However, the saving i n the a d d i t i o n a l computation i s about 50 percent. Thus, the MOLS decision scheme with s u i t a b l e length of search allows an appropriate balance of computational complexity and expected cost of a terminal d e c i s i o n i n sequential d e c i s i o n schemes with on-line ordering of features. 8.6 Sequential Recognition Scheme with Limited Length of Search Over Preordered Features By r e s t r i c t i n g the sequential c l a s s i f i e r to observe a s u i t a b l e set of preordered features, the t o t a l expected cost of terminal decisions can be considerably reduced and at the same time the a d d i t i o n a l computation involved with on-line ordering of features can be avoided. I t i s however, d i f f i c u l t to obtain an ordering of features that would guarantee the minimum error p r o b a b i l i t y i n a sequential scheme. Moreover, i n sequential schemes the best feature at every state i s always dependent on the i n f o r -mation already a v a i l a b l e from the pattern and can not be p r e s p e c i f i e d . Figure 8.6 Comparison of computational complexity and the increase i n the e r r o r p r o b a b i l i t y due to the l i m i t e d length of search i n the MOLS de c i s i o n scheme. Thus, by r e s t r a i n i n g the order of observing the features one i s l i a b l e to introduce redundancy i n the observed features and therefore, increase the expected cost of a terminal d e c i s i o n . One way of avoiding such redundancy xrould be to provide the sequential c l a s s i f i e r with a s u i t a b l e set of preordered features and at the same time permit the on-line ordering of features on the basis of a l i m i t e d length of search. Thus, the a d d i t i o n a l computation involved i s f a i r l y small but the de c i s i o n scheme has the option to choose from a few feature components. The li m i t e d search f a c i l i t y allows the sequential decision scheme to make the best use of a set of preordered features. I t i s of course, obvious that i f f u l l length of search (LS = 25) i s employed, no matter how the features are ordered, the computational complexity and the performance of the decision scheme would be equal to those of the on-line sequential d e c i s i o n scheme discussed before. Figure 8.7 i l l u s t r a t e s the performance of the simple suboptimal sequential d e c i s i o n scheme using a set of preordered features with and without the f a c i l i t y of the l i m i t e d length of search. Three d i f f e r e n t orderings.of features (shown i n the figure) were considered and LS = 5 was used i n each case. I t i s c l e a r from Figure 8.7 that the sequential d e c i s i o n scheme with LS = 5 performs be t t e r i n each case than the usual sequential d e c i s i o n scheme, observing the p r e s p e c i f i e d feature at every state. Limited amount of search incorporated i n t o sequential d e c i s i o n schemes requires only a l i t t l e a d d i t i o n a l computation but may mean a considerable saving i n expected cost of terminal decisions. I t i s therefore, recommended that sequential d e c i s i o n schemes using a set of preordered features should always have the f a c i l i t y of l i m i t e d length of search. 4 0.0 natural ordering c tf ^ 30.0 20.0 •o D -Q O a o * 10.0 0.0 simple seqn. (preordered feats.) • MOLS scheme' error prob forced dec. 19 20 21 10 11 12 13 14 15 16 17 18 average number of features per pattern, [N(1),NF1 Figure 8.7 Comparison of performance of suboptimal simple sequent/al scf/eme using a set of pre-ordered features with and without l i m i t e d search f a c i l i t y . 156 CHAPTER IX CONCLUSIONS AND DISCUSSION 9.1 General Discussion Various sequential schemes f o r both pattern c l a s s i f i c a t i o n and feature ordering have been considered i n t h i s t h e s i s . The sequential decision schemes require a fewer features on the average than the equally r e l i a b l e nonsequential decision schemes. When the features are c o s t l y , e i t h e r because of elaborate computations or because of danger and r i s k associated (biomedical and i n d u s t r i a l applications) with the measurement operation, the sequential d e c i s i o n schemes become more a t t r a c t i v e than the nonsequential ones. The sequential d e c i s i o n schemes may also be su i t a b l e i n s i t u a t i o n s where i t i s necessary to minimize the time needed for a terminal decision ( r e a l time a p p l i c a t i o n s ) . The computation and storage required f o r the implementation of the sequential d e c i s i o n schemes are, of course, more than those needed for nonsequential schemes. From the computational point of view sequential decision schemes may some times be d i f f i c u l t or even impossible to implement. However, the sub-optimal sequential decision schemes, based on the one-state ahead trunc-ation approximation considered i n t h i s t h e s i s , r e t a i n the important a t t r i b u t e s of the sequential process and are quite p r a c t i c a b l e . U t i l i -z a t ion of contextual information i n nonsequential d e c i s i o n schemes i s quite common and i t reduces the system er r o r p r o b a b i l i t y . Use of such contextual information, i n sequential d e c i s i o n schemes i s possible through the OCSPRT and the compound sequential decision scheme formulated i n Chapters II and V r e s p e c t i v e l y . For dependent hypotheses recognition 157 problems the OCSPRT and the compound sequential decision scheme r e s u l t i n an expected cost of terminal decisions that i s lower than that of simple sequential (including SPRT) and the compound nonsequential decision schemes. The CSPR schemes permit the observation of the required addi-t i o n a l features e i t h e r on the pattern under consideration or any one of the neighbouring patterns. In dependent hypothesis r e c o g n i t i o n problems the schemes,therefore observe only the best features from each pattern sample. As a r e s u l t the expected cost of terminal decisions i s lower than that of c l a s s i c a l compound sequential and nonsequential decision schemes. The more d i s s i m i l a r the feature components are i n t h e i r e f f e c t -iveness the more e f f i c i e n t the CSPR scheme tend to be i n comparison with the compound sequential and compound nonsequential d e c i s i o n schemes. The CSPR schemes also allow the delaying of a decision i n sequential schemes for a d d i t i o n a l information from the subsequent patterns to be av a i l a b l e . Because the CSPR schemes need to search, at every d e c i s i o n state, f o r the pattern on which the desired a d d i t i o n a l feature i s to be observed, these schemes are computationally more involved than the simple and the compound sequential decision schemes of Chapter V. I t may be desirable to employ some e f f i c i e n t computational approximation i n order to reduce the complexity and the time required f o r a terminal d e c i s i o n . The on-line ordering of features i s p e c u l i a r to sequential d e c i s i o n schemes and r e s u l t s i n a considerable reduction i n the expected cost of terminal decision compared to usual sequential d e c i s i o n schemes. Although a d d i t i o n a l computation i s required f o r on-line ordering,a compromise between such a d d i t i o n a l computation and expected cost of terminal d e c i s i o n i s possible using the MOLS dec i s i o n scheme. In MOLS decision scheme on-line ordering of features i s a t t a i n a b l e but only on the basis of a l i m i t e d length of search over the a v a i l a b l e features. F i n a l l y , i t i s recommended that the sequential decision scheme using a set of preordered features should always have the f a c i l i t y of l i m i t e d length of search over the a v a i l a b l e features. 9.2 Suggestions for Future Research Some suggestions for further work follow n a t u r a l l y from the present i n v e s t i g a t i o n . These are summarized below: 1. The d e r i v a t i o n of the expression for the average number of features per pattern required for a terminal d e c i s i o n (Chapter I i s based on the assumption that at any instant k the information from the e n t i r e past i s a v a i l a b l e (t = k). It w i l l be of i n t e r e s t to obtain such an expression for any value of k > t > 1. 2. A wide v a r i e t y of CSPR schemes i s possible to be studied but only three of them have been considered i n Chapter VI. D e t a i l study of the e f f e c t s of a l l o c a t i n g costs to the feature com-ponents, and using various forms of dependence r e l a t i o n s among the pattern classes and feature vectors i s needed f o r further understanding of the CSPR schemes. The CSPRS Type 1 and Type 2 with larger value of t should also be considered. 3. It w i l l be i n t e r e s t i n g to know how the CSPR scheme performs i n s i t u a t i o n s where the feature components are quite d i s s i m i l a r i n t h e i r e f f e c t i v e n e s s . Usefulness of the CSPR schemes i n recognition problems with v a r i a b l e dimensional feature vectors should also be determined. 159 4. The CSPR schemes considered i n Chapters VI and VII are always provided with a set of preordered features. A possible genera-l i z a t i o n i s to allow the on-line ordering of features such that the decision schemes have the choice of observing any one of the remaining features from any one of the patterns of the sequence. 5. Simulation of the sequential d e c i s i o n schemes formulated for dependent hypotheses recognition problems (Chapters I I I , V and VII) i s based on the assumption that the pattern classes are f i r s t order Markov such that only bigram p r o b a b i l i t i e s are used i n the t h e s i s . Higher order dependence, such as second order Markov dependence (trigram) should also be considered to determine the change i n the expected cost of terminal de-c i s i o n s . 6. During the t r a i n i n g phase, the l i k e l i h o o d s P(x. = j | OJ ) j =0, 1, 16; £ = 1, 2, 26, f o r each feature i was estimated from a f i x e d number of pattern samples. In sequential schemes e s p e c i a l l y with on-line ordering, some features are observed more often than the others. Thus, i n -stead of using a f i x e d number of t r a i n i n g samples f o r each feature, one should use more t r a i n i n g samples f o r the features which are more frequently used. An adaptive unsupervised sequential decision scheme i s therefore needed where as soon as a pattern sample i s c l a s s i f i e d the s t a t i s t i c s of only the used feature components are updated. Such a sequential scheme would not only reduce the cost of terminal decisions during the t e s t i n g phase but also reduce the cost of estimating the r e -quired parameters during the t r a i n i n g phase. The feature e x t r a c t i o n scheme proposed i n Chapter IV promises to be a simple but e f f e c t i v e feature e x t r a c t i o n method for pattern recognition and should be investigated, The e f f e c t of considering masks of a s i z e d i f f e r e n t than 4 * 4 and poss i b l y nonsquare should be determined. Use of fewer regions, and/or quantization of the number of black points i n a region to a few values, would reduce required computation time and storage capacity. E f f e c t of d i f f e r e n t weighting systems of the points of the mask should also be determined. F i n a l l y , a l o g i c a l extension would be to implement i n p r a c t i c a l problems the various sequential d e c i s i o n schemes formulated i n t h i s t h e s i s . Possible areas of a p p l i c a t i o n s are speech recog-n i t i o n , biomedical engineering [87], [88], phased array radar, and weather p r e d i c t i o n . APPENDIX-A Detailed Derivation of Equation'2.9 In the following derivations the subscripts are omitted f o r the sake of convenience. For k $ t, we can write i i i i i r,/-vk „k-t+l k k-mN P(X k|x k-\ . . . ,Xk-t+1;Wk) = P ( X, ^ ••>X , V " , M . ~ 1 T, /„k-l ,.k-t+l k k-nu P(X , . . . ,X ;to , . . . ,io ) P(X k X k - r ; M k t • . . . / ^ I X 1 ^ X k- t + 1) " P ( X k _ \ . . . ,X k - r ; co\ . . . ,03 k- m|x k- r,. .. ,X k- t + 1) (A-l) For any r :> m and no delay i n the system (A-l) reduces to „k-r k k-nk p ( x k | x k - \ . . . , x k - t + 1 ; w k ) - p ( x r - , > x u ; " r " > a ) i } „,„k-l „k-r k k-m. ) (A-2) p(xIV "L,...>xIV L ;ur,... , to"* m ) = P(X k|Y k 1; Wk) . -k-1 k-1 k-r where Y = (X , X ), k ^ t . It can s i m i l a r l y be shown that (A-2) holds f o r k < t. Thus (2.9-a) follows immediately. To obtain (2.9-b), note that f o r any k we can write P(Wk|wk-1; Y k _ 1) = P ( ? k " V , ^ " W ; W^ 1 ) ( A _ 3 ) P(Wk Y k b for system without any delay (A-3) reduces to p ( w k | w k - l ; - k - l } = P(Y k 1;Wk~1)P(Wk, Wk-1) P(Wk"1;Yk~1) 162 P(WR ; Y R X) P(Y k 1]w k 1)P(W k|w k V ( W k 1 ) P(Y k" 1|w k _ 1)P(W k~ 1) = p(w k|w k 1) . APPENDIX B Derivation of E[£n Y ] and E[£n X ] — f o r Use i n Chapter I II nk- ; The expected values E[£n y ] a n d E[£n X ] are derived here. The r e s u l t s are analogus to Wald's r e s u l t s [2], [46]. At any instant k the sequential observation process i s t e r -minated when the p r o b a b i l i t y r a t i o y exceeds one of the two stopping n k boundaries. Neglecting any excess over the stopping boundaries, we can write 2' ( T 2 with p r o b a b i l i t y (1-e) Y n = | k ^T^ with p r o b a b i l i t y e . Then E[£n y ] = (1-eHnT. + e£n T, , X k ~ u o J n. / 1 L k however, f or = e2± = e T2 = 1 / T 1 = T = ( 1 - e ) / e Thus E[£n y n ] = e 2 ( Y ) = (l-2e) n ( - ^ ) , X k ~ ^ k k One can s i m i l a r l y show that f o r X . ~ to. E[£n Y n ]'= e 1 ( Y ) = (2e-l)£n(~^) k Let The feature vectors are assumed to be of s u f f i c i e n t l y large dimensionality such that the observation process terminates before the f i n a l state with p r o a b i l i t y one*. For c l a s s - c o n d i t i o n a l l y independent feature components, we have P(X k|eo k) = Jl P ( x k | u ) k ) , j = 1, 2 J i = l J therefore, the f ^ ' s form s t a t i s t i c a l l y independent random v a r i a b l e s . I f the process of feature observation i s terminated at state n, we obtain k D n k D I f. = I f, + I t t n = 1, 2, 1=1 1 j = l 1 £=n,+l * k J k k D = In A K + I f9 . (B.l) £=n, +1 K. The random v a r i a b l e n^ i s dependent only on the f i r s t n^ feature components and i s independent of a l l the f ^ , I > n^. The expected value of (B.l) therefore, y i e l d s D • E [ f ] = E[A k] + {D - E[n ]}-E[f] or E[A k] = E[n k] • E [ f ] k 2 W -2VJ"/» " " "2 f e 0 ( n ) • e 0 ( f ) , i f X - to ^e ] [(n) • e 1 ( f ) , i f X - to A more rigorous streatment requires taking the l i m i t as D approaches i n f i n i t y [2]. APPENDIX-C THE TEXTS FOR RECOGNITION EXPERIMENTS AND THEIR ASSOCIATED STATISTICS ** TEXT NUMBER 1 ** THE PHILOSOPHICAL DIFFERENCE BETWEEN HUMAN PERCEPT ION AND COMPUTER RECOGIUT-ION CF PRINTED CHARACTERS SEEMS TO D TIP END Civ1 THE RESEARCHERS BIAS. THE COGN-I T I V E PSYCHOLOGIST HA IN TAIN'S THAT THE D I STING CIS HI NG CHARACTERISTICS OF THE ENTIRE STIMULUS GIVE RIS E TO THE PERCEPT KITH HO'SINSLE ATTRIBUTE BEING EI TH ER NECESSARY CR SUFFICIENT IN CONTRAST THE UN GIN E'ER VIEWS PATTERN RECOGNITION AS EXTRACTING VARIOO S MATHEMATICAL MEASURES FRO K THE STIMULUS EY USING VARIOUS TRANSFORMATIONS. ALTHOUGH THESE TWO APPRO ACHES ARE NOT NECESSARILY ANTITHETICAL S O U K C E . ................... ".Character Recognition: Performance of Human Computers II by B.'Blesser and E. Peper M.I.T. Quarterly Progress Report, No. 98, 1970, pp. 185 TOTAL NUMBER OF CHARACTERS 538 NUMBER OF CHARACTERS WITHOUT BLANKS AND PERIODS 465 SET OF ALPHABETS USED IN FORMING THE TEXT 1 - 2 1 ** TEXT NUMBER 2 ** FROM A MATHEMATICAL S T A T I S T I C A L POINT OF VIEW THE ASSUMPTION OF RANDOM SAMPLING MAKES IT POSSIBLE TO DETERMINE THE SAMPLING DISTRIBUTION OF h PARTICUL AR S T A T I S T I C GIVES SOME PARTICULAR POPULATION EIST RIBUTION. IF THE VARIOUS VALUES OF THE STATISTIC C AN ARISE FROM SAMPLES HAVING U N C ET I: RM IN ED OR UNKUO WM PROBABILITIES OF OCCURRENCE THEii THE S T A T I S T I C I AN HAS NO WAY TC DETERMINE THE SAMPLING DISTRIBOTI ON OF THAT STATISTIC-SOURCE. S t a t i s t i c s by W. L. Hays Holt, Rinehart and Winston, Inc. 1963, pp. 216 TOTAL NUMBER OF 420 CHARACTERS NUMBER OF CHARACTERS WITHOUT BLANKS AND PERIODS ... 421 SET OF ALPHABETS USED IN FORMING THE TEXT 22 - 42 167 * * T E X T N U M B E R 3 * * I N S E A R C H I N G F O R R E G I O N S O F T H E C O R T E X T H A T M I G H T C O N T R I B U T E T O T H E F E R F O RM A l l C E O F T H E HIGHER I M T E L L E C T U A L P R O C E S S E S WE W O U L D N O T BE INCLINED T O P A Y T OO M U C H A T T E N T I O N T O T H O S E C O R T I C A L A R E A S T H A T HA V E A L R E A D Y B E S N E S T A B L I S H E D T C B E T E R M I N A L P O I N T S F OR T H E PERIPHERAL NERVES A N D T H A T T H E R E F O R E A R E K N OWN T O B E P R I M A R I L Y Erf G A G E D I K T H E R E C E I P T . O F S E N S O R Y I N F O R M A T I O N OR T H E T R A N S M I S S I O N O F M O T O R C O M M A N D S T O T H E O U T L Y I N G R E G I O N S O F T H E B O D Y . O F T H E S E V E R A L H U N D R E D S Q U A R E I N C H E S O F S U R F A C E A R E A I N T H E C E R E B R A L C O R T E X C ML Y A B O U T O N E F O U R T H I S U S E E F O R T H E S E S E N S O R I M O T O R P R O C E S S E S a. SOURCE . The Machinery of the Brain by E. Wooldridge McGraw-Hill, 1963 pp. 145 b. TOTAL NUMBER OF CHARACTERS 580 c. NUMBER OF CHARACTERS WITHOUT BLANKS AND PERIODS 483 d. SET OF ALPHABETS USED IN FORMING THE TEXT 43 - 63 168 * * T E X T N U M B E R DEAR SIR. I HAVE REVIEWED YOUR IDEA FOR DESIGN ENT RY SWEEP GENERATOR AND AM HAPPY TO ACCEPT IT FOR P UBLICATION IN ELECTRONIC DESIGN. 0 HFOfi T U N ATEL Y PUB LIGATION WILL BE DELAYED FOR A FEW MONTHS DUE TO T HE LARGE BACKLOG CF IDEAS FOR DESIGN WE NOW HAVE 0 N HAND. PRIOR TO PUBLICATION THOUGH YOU WILL R EC EI VE A COPY OF THE FINAL EDIT EC VERSION OF YOUR MATE RIAL FOR APPROVAL. IN ADDITION YOU WILL RECEIVE PA YK ENT FOR YOUR ENTRY SHORTLY. YCU ARE MOW E L I G I B L E FOR THE MOST VALUABLE OF I S S U E AWARD AS DETER DINE D BY OUR READERS a. SOURCE... A L e t t e r from the E d i t o r of a Publishing Company b. TOTAL NUMBER OF CHARACTERS 516 c. NUMBER OF CHARACTERS WITHOUT BLANKS AND PERIODS 421 d . SET OF ALPHABETS USED IN FORMING THE TEXT 64 - 84 169 ** TEXT NUMBER 5 ** THE SHARE 0? TOTAL MANUFACTURING ASSETS HELD BY TH E TOP COMPANIES MAY BE INCREASING BUT PROBABLY NOT NEARLY AS MUCH AS THE GOVERNMENT SUGGESTS. AND WH-AT SEEMS L I K E A DRAMATIC RISE I N SALES CO N CENT RAT I ON WITHIN A MAJOR CATEGORY OF CONSUMER GOODS INDUS TRIES VANISHES ALMOST ENTIRELY WHEN TEE RATHER PRI MITIVE ST A T I S T I C A L TECHNIQUES USED IN THE GOVERNME NT STUDIES ARE SET ASIDE IN FAVCR OF METHODS THAT ARE MORE SOPHISTICATED THOUGH QUITE COMMONLY EMPLO YED BY ECONOMISTS a. SOURCE "Bigness i s a Number Game" by S = Rose Fortune v o l . LXXX, No. 6, November, 1969 pp. 112-115 b. TOTAL NUMBER OF CHARACTERS 467 c. NUMBER OF CHARACTERS WITHOUT BLANKS AND PERIODS 394 d. SET OF ALPHABETS USED IN FORMING THE TEXT 85 - 105 * * T E X T N U M B E R 6 ** G E O R G E C . S C O T T COMES AS C L O S E T O F I T T I N G H I S EE I I N I T I O N O F T H E I D E A L A C T O R A S O N E MAN C A N W I T H O U T B R EA K I N G A P A R T I ETC T H R E E D I S P A R A T E - I N D I V I D U A L S . I N H I S L I F E O F F S T A G E HE HAS B E E N S T U B B O R N L Y E V E N VIO L E N T L Y I N D I V I D U A L WHEN HE I S A C T I N G HE C R E A T E S A C H A R A C T E R AND H I D E S H I S I N D I V I D U A L I T Y W I T H S I N G U L A R S U C C E S S AS T H E MAM I N R3 W T E N H E I S A P E R F E C T I O N I S T C R I T I C MORE D E M A N D I N G OF H I MS E L F T E AN O F T H O S E A R O U N D H I M . I N MORE THAN A D O Z E N S T A G E AND S C R E E N R O L E S I M A S T E A D I L Y G R O W I N G C A R E E R S C O T T HAS DEMON S T R A T E D T H A T H E I S O N E OF T H E B E S T O F C O N T E M P O R A R Y A C T O R S SOURCE Time March 22, 1971 pp. 51 TOTAL NUMBER OF CHARACTERS 557 NUMBER OF CHARACTERS WITHOUT BLANKS AND PERIODS 457 SET OF ALPHABETS USED IN FORMING THE TEXT 106 - 126 ** T E X T N U M E E R 7 ** THE RANGE! AtiDS CF THE WESTERN UNITED STATES. THESE VAST TRACTS ARE MAINLY U T I L I Z E D FOR THE GRAZING 0 F CATTLE. THEY COULD DE MUCH MORE PRODUCTIVE NOT 0 NLY FROM THE ECONOMIC STANDPOINT BUT ALSO IN THE R ECREATIOM A L AND THE AESTHETIC SENSE. THE CENTURY 0 LC WAR OVER THE USE OF THE WIDE OPEN SPACES OF THE UNITED STATES IS NCW ESCALATING YEAR BY YEAR IN S COPE AND COMPLEXITY. IN THE NINETEENTH CENTURY IT WAS A RELATIVELY SIMPLE CONFLICT MAINLY BETWEEN TH E CATTLE RANCHERS AND THE SHEEP RAISERS SOURCE "The Rangeland of Western United States I I by R. M. Love S c i e n t i f i c American v o l . 222, No. 2, pp. 89 February, 1970 TOTAL NUMBER CHARACTERS 490 NUMBER OF CHARACTERS WITHOUT BLANKS AND PERIODS 402 SET OF ALPHABETS USED IN FORMING THE TEXT 127 - 147 172 REFERENCES 1. R. L. Kashyap, "Algorithm for Pattern C l a s s i f i c a t i o n " , Adaptive, Learnings, and Pattern Recognition Systems Theory and A p p l i c a t i o n s , J . M. Mandle and K. S. Fu Ed., Academic Press, New York, 1970. 2. A. Wald, "Sequential A n a l y s i s " , John Wi l l e y , New York, 1947. 3. D. Blackwell and M. A. G i r s h i c k , "Theory of Games and S t a t i s t i c a l Decisions", John Willey, New York, 1954. 4. A. Wald and J . Wolfwitz, "Optimum Character of the Sequential P r o b a b i l i t y Ratio Test", Ann. Math. S t a t i s t . Vol. 19, pp. 326-339, 1948. 5. J . J . Bussgang and D. Middleton, "Optimum Sequential Detection of Signals i n Noise", IRE Trans. Info. Theory, Vol. IT-1, pp. 5-19, December, 1955. 6. T. G. B i r d s a l l and R. A. Roberts, "Theory of Signal D e t e c t a b i l i t y : Deffered Decision Theory", The J. of Acoust. Soc. of Am., Vol. 37, No. 6, pp. 1064-1074, June, 1965. 7. R. A. Roberts, "Sequential Detection and Composite Hypotheses with A p p l i c a t i o n to a Signal of Uncertain Amplitude", IEEE Trans. Info. Theory, Vol. T.T-13, pp- 175-183. A p r i l , 1967. 8. K. S. Fu, "A Sequential Decision Model for Optimum Recognition", B i o l o g i c a l Prototypes and Synthetic Systems, Bernard and Kare Ed., Vol. 1, Plenum Press, New York, 1962. 9. C. H. Chen,"A Study of Pattern Recognition Systems with a Sequential Learning Procedure", Ph.D. d i s s e r t a t i o n , School of E l e c t r i c a l Engineering, Pardue U n i v e r s i t y , 1965. 10. , "A Note on Sequential Decision Approach to Pattern Recognition and Machine Learning", Information and Control, Vol. 9, pp. 549-562, 1966. 11. K. S. Fu, "Sequential Methods i n Pattern Recognition and Machine Learning", Academic Press, New York, 1968. 12. T. W. Anderson, "A M o d i f i c a t i o n of the sequential P r o b a b i l i t y Ratio Test to Reduce the Sample Size", Ann. Math. Stat., Vol. 31, March, 1960. 13. J . J . Bussgang and M. B. Marcus, "Truncated Sequential P r o b a b i l i t y Ratio Test", IEEE Trans. Info. Theory, Vol. IT-13, pp. 512-516, July, 1965. 14. Y. T. Chien and K. S. Fu, "A Modified Sequential Machine Using Time Varying Stopping Boundaries", IEEE Trans. Info. Theory, Vol. IT-12, pp. 206-214, A p r i l , 1965. 173 15. E. Wong and A. Steppe, "Invarient Recognition of Geometric Shapes", Methodologies of Pattern Recognition", S. Watanabe Ed., Academic Press, New York, pp. 535-5461969. 16. F.C. Reed, "A Sequential M u l t i d e c i s i o n Procedure", Proc. of Symp. on Decision Theory and A p p l i . to Equip. Devp., Rome A i r Development Center, V o l. 1, pp. 42-69, May, 1960. 17. R.A. Robert and C.T. M u l l i s , "A Bayes Sequential Test of m-Hypotheses", IEEE Trans. Info. Theory, Vol. IT PP. 91-94, January, 1970 18. L.C. Palmer, Sequential Test to Select Among M-Hypotheses", IEEE Trans. Info. Theory, Vol. IT-18, pp. 211-214, January, 1972. 19. A. A. Agl i n t s e v and A. P. Ter-Saakov, "Many A l t e r n a t i v e Sequential Analysis", Engineering Cybernetics, No. 1, pp. 160-168, 1970. 20. R. Bellman, R. Kalba, and D. Middleton, "Dynamic Programming, Se-quential Estimation and Sequential Detection Processes", Proc. Nat. Acad. Sciences, V o l. 47, pp. 338-341, 1961. 21. R. Bellman, "Dynamic Programming", Princeton Univ. Press, Princeton, New Jersey. 22. K. S. Fu and G. P. C a r d i l l o , "An Optimum F i n i t e Sequential Procedure f o r Feature S e l e c t i o n and Pattern C l a s s i f i c a t i o n " , IEEE Trans. Auto. Control.Vol. .12., pp. 588-591, October., 1.967. 23. G.P. C a r d i l l o and K. S. Fu, "A dynamic Programming Procedure f o r Sequentila Pattern C l a s s i f i c a t i o n and Feature S e l e c t i o n " , Intern. J . Math. Biosciences, V o l. 1, No. 3, pp. 463-491, 1967. 24. Y. T. Chien and K. S. Fu, "An Optimal Pattern C l a s s i f i c a t i o n System Using Dynamic Programming", Intern. J . Math. Biosciences, V o l. 1, No. 3, pp. 439-461, 1961. 25. K. S. Fu, Y. T. Chien and G. P. C a r d i l l o , "A Dynamic Programming Approach to Sequential Pattern Recognition", IEEE Trans. E l e c t . Comp., V o l . EC-16, pp. 790-803, Dec. 1967. 26. Z. J . N i k o l i c and K. S. Fu, "On the Se l e c t i o n of Features i n S t a t i s t i c a l Pattern Recognition", Proc. Ann. Princeton Conf. Inform. S c i , and Syst., pp. 434-438,Mar. 1968. 27. Y. T. Chien, "A Sequential Decision Model f o r Sele c t i o n of Feature Subsets i n Pattern Recognition", IEEE Trans. Comp. V o l . C-20, No. 3, pp. 282-290, March,1971. 28. G. E. Lasker, " M u l t i v a r i a t e Sequential Pattern Recognition Technique", Cybernetica, V o l. XII, No. 2, pp. 67-80, 1969-29. R. A l t e r , " U t i l i z a t i o n of Contextual Constraints i n Automatic Speech Recognition", IEEE Trans. Audio and E l e c t r o a c o u s t i c , V o l. AV-16, No. 1. pp. 6-11, March, 1968. 174 30. G. Carlson, "Technique f o r Replacing Character that are Garbled on Input", AFIPS Conf. P r o c , Vol. 128, pp. 189-192, 1966. 31. W.W. Bledsoe and J . Browning, "Pattern Recognition and Reading by Machine", Pattern Recognition, L. Uhr Ed. }pp. 301-316, Wiley, New York, 1966. 32. J. Raviv, "Decision Making i n Markov Chains Applied to the Problem of Pattern Recognition", IEEE Trans. Info. Theory, V o l . IT-13, pp. 536-551, Oct. 1967. 33. K. Abend, "Compound Decision Procedures for Unknown D i s t r i b u t i o n s and for Dependent States of Nature", Pattern Recognition, L. Kanal Ed. } Thomson Book Co., Washington, D.C, -1968. 34. K. Abend, "Compound Decision Procedure f o r Pattern Recognition", Proc. National E l e c t . Confer., Vol. 22, pp. 777-779, 1966. 35. R. W. Chang and J . C. Hancock, "On Receiver Structure of Channel having Memory", IEEE Trans. Inform. Theory, Vol. IT-12, pp. 463-468, Oct. 1966. 36. R. 0. Duda and P. E. Hart, "Experiments i n the Recognition of Hand-Printed Text Part II-Context A n a l y s i s " , AFIPS Conf. P r o c , pp. 1139-1149, 1968. 37. A. W. Drake, "Observation of a Markov Source Through Noisy Channel", IEEE Symp. Signal Trans, and Processing, pp. 12-18, 1965. 38. R. W. Donaldson and G. T. Toussaint, "Use of Contextual Constraint i n Recognition of Contour-Traced Hand-Printed Characters, IEEE Trans. Computers, Vol. C-19j.No.22, pp. 1096-1099, Novem. 19 70. 39. E. M. Riseman and R. W. E h r i c h , "Contextual Word Recognition Using Binary Diagrams", IEEE Trans. Comp., Vol. C-20, No. 4, pp. 397-403, A p r i l 1971. 40. C. G. Hilborn, J r . and D. G. L a i n i o t i s , "Unsupervised Learning Minimum Risk Pattern C l a s s i f i c a t i o n f o r Dependent Hypotheses and Dependent Measurements", IEEE.Trans. System, Sciences and Cyber , Vol. SSC-5, pp. 109-115, A p r i l , 1969. 41. C. G. Hilborn and D. G. L a i n i o t i s , "Optimal Unsupervised Learning Multicategory Dependent Hypotheses Pattern Recognition", IEEE Trans. Info. Theory,Vol. IT-14, pp. 468-470, May, 1968. 42. E. Samuel, "On Simple Rules f o r the Compound Decision Problem", Journal of the Royal S t a t i s t i c a l Society, Series B., Vol. 27, No. 2, pp. 238-240, 1965. 43. E. Samuel, "Asymptotic Solution of the Sequential Compound Decision Problem", Ann. of Math. S t a t i s t . V o l. 34, No. 3, pp. 1079-1094, 1963. 44. A. B. S. Hussain, G. T. Toussaint and R. W. Donadlson, "Results 0b-175 tained Using a Simple Character Recognition Procedure on Munson's Handprinted Data", IEEE Trans. Computers, Vol. C-21, pp. 201-205, February, 1972. 45. C. K. Chow, "An Optimum Character Recognition System Using Decision Functions", IRE Trans. E l e c t . Computers, Vol. 6, pp. 247-254, Dec-ember 1957. 46. J . C. Hancock and P. A. Wintz, "Signal Detection Theory", McGraw-H i l l , New York, 1966. 47. E.. E. Beckenback and R. Bellman, " I n e q u a l i t i e s " , 2nd Ed., Springer-Verlag, New York, 1965. 48. W. H. Highleyman, "The Design and Analysis of Pattern Recognition Experiments", B e l l Syst. Tech. J . , pp. 723-744, March, 1962. 49. P. A. Lachenbruch and M. R. Mickey, "Estimation of E r r o r Rates i n Discriminant Analysis", Technometrics, V o l . 10, pp. 1-11, Feb., 1968. 50. L. Kanal and B. Chandasekaran, "On Dimensionality and Sample Size i n S t a t i s t i c a l Pattern C l a s s i f i c a t i o n " , Proc. Nat. E l e c t r o n i c s Conf., pp. 2-7, 1968. 51. J . H. Munson, "Experiments i n the Recognition of Hand-Printed Texts", 1968 F a l l J o i n t Computer Conference, AFIPS P r o c , V o l . 33, pt. 2, •Wash. D. C . Thompson, pp. 1-125-11-38, 1968. 52. J . H. Munson, "The Recognition of Hand-Printed Text", Pattern Recog- n i t i o n , L. Kanal, Ed., Thompson Book Co., Wash. D. C , pp. 115-140, 1968. 53. A. L. K n o l l , "Experiments with ' C h a r a c t e r i s t i c L o c i ' f or Recognition of Hand-Printed Characters", IEEE Trans. Comp., Vol. EC-18, pp. 366-372, A p r i l , 1969. 54. J . K. Clemens, "Optical Character Recognition for Reading Machine Appl i c a t i o n s " , Ph.D. Thesis, Dept. of E l e c t . Engg., MIT, 1965. 55. M. D. Levine, "Feature E x t r a c t i o n : A Survey", Proc. of IEEE, V o l . 57, No. 8, pp. 1391-1407, August, 1969. 56. G. F. Hugehes, "On the Mean Accuracy of S t a t i s t i c a l Pattern Recog-n i z e r s " , IEEE Trans. Info. Theory, Vol. IT-14, pp. 55-63, Jan., 1968. 57. B. Chandrasekaran, "Independence of Measurements and Mean Recognition Accuracy", IEEE Trans. Info. Theory, Vol. IT-17, pp. 452-456. 58. G. T. Toussaint, "Machine Recognition of Independent and Contextually Constrained Contour-Traced Hand-Printed Characters", M.A.Sc. Thesis, Dept. of E l e c t r i c a l Engineering, Univ. B r i t i s h Columbia, 1969. 59. G. T. Toussaint and R. W. Donaldson,"Algorithm for Recognizing 176 Contour-Traced Hand-Printed Characters", IEEE Trans. Comp., Vol. C-19, pp. 541-546, June, 1970. 60. A. B. S. Hussain, "A Sequential Test for Determining the Dependency Structure of Binary Features", submitted for P u b l i c a t i o n i n IEEE Trans. System, Man, and Cyber. 61. I. J . Good, The Estimation of P r o b a b i l i t i e s , Research Monograph 20, Cambridge, Mass. MIT Press, 1965. 62. B. Ru s s e l l , "Education and S o c i a l Order", London, G. A l l e n and Unwin Ha., 1933. 63. C. A. Dykema, "A Computer Simulation Study of the A p p l i c a t i o n of Contextual Constraints to Character Recognition", M.A.Sc. Thesis, Dept. of E l e c t . Engg., U n i v e r s i t y of B r i t i s h Columbia, September, 1968. 64. B. Chandrasekaran and L. Kanal, "On L i n g u i s t i c , S t a t i s t i c a l , and Mixed Models for Pattern Recognition", Technical Report OSU-CSRC-TR-71-3, Computer and Information Sciences Research Center, Ohio State U n i v e r s i t y , March, 1971. 65. G. P. C a r d i l l o and K. S. Fu, "On Suboptimal Sequential Pattern Recognition", IEEE Trans. Comp., Vol. C-17, No. 8,pp. 789-792, August, 1968. 66. B. E. Boyle, "Model f o r Therapeutic Diagnosis of Disease", MIT Quat. Report, QPR No. 97, pp. 158-165, A p r i l , 1969. 67. Y. L. Barabash, "On Properties of Symbol Recognition", Eng. Cybern. (USSR), pp. 71-77, September-October, 1965. 68. A. B. S. Hussain, "On the Correctness of Some Sequential C l a s s i -f i c a t i o n Schemes i n Pattern Recognition", IEEE Trans. Comp., Vol. C-21, pp. 318-320, March, 1972. 69. A. B. S. Hussain and R. W. Donaldson, "Some Sequential Sampling Techniques f o r Multicategory Dependent Hypotheses Pattern Recognition", Proc. F i f t h Hawaii Inter. Conf. on System Sciences, pp. 40-43, January, 1972. 70. A. B. S. Hussain, "Sequential Decision Schemes with On-Line Feature Ordering", Proc. Ann. Canadian Computer Conference, SESSION'72, June 1-3, 1972. 71. A. Bhattacharyya, "On a Measure of Divergence Between Two Multinomial Populations", Sankhya, V o l . 6, pp. 401-406, 1964. 72. T. K a i l a t h , "The Divergence and Bhattacharyya Distance Measure i n Signal S e l e c t i o n " , IEEE Trans. Comm. Tech.,Vol. COM-15, pp. 52-60, Febuary, 1967. 73. J . T. Tou and R. P. Heydorn, "Some Approaches to Optimum Feature 177 E x t r a c t i o n " , Computer and Information Sciences I I , J . T. Tou, Ed. New York, Academic Press, pp. 57-89, 1967. 74. H. F. Ryan, "The Information Content Measure as a Performance C r i -t e r i o n f or Feature S e l e c t i o n " , IEEE Proc. of Seventh Symposium on Adaptive Processes, Los Angles, C a l i f . , 16-18 December, 1968. 75. P.M. Lewis, "The C h a r a c t e r i s t i c S e l e c t i o n Problem i n Recognition Systems", IRE Trans. Inform. Theory, IT-8, pp. 171-178, Feb., 1962. 76. P. J . Min, D. A. Landgrebe and K. S. Fu, "On Feature S e l e c t i o n i n Mu l t i c l a s s Pattern Recognition", Proc. 2nd Ann. Princeton Confer, on Info., Sciences and Systems, IEEE 68, CT-1, pp. 453-457, 1968. 77. A.-N. Mucciardi and E. E. Gose, "A Comparison of Seven Techniques for Choosing Subsets of Pattern Recognition Properties", IEEE Trans. Comp. Vol.,C-20, pp. 1023-1031, September, 1971. 78. G. D. Nelson and D. M. Levy, "A Dynamic Programming Apporach to the Selec t i o n of Pattern Features", IEEE Trans. Syst. S c i . and Cyber., Vol. SSC-4, No. 2, pp. 145-151, July 1968. 79. G. T. Toussaint and A. B. S. Hussain, "Comment on 'A Dynamic Pro-gramming Approach to the Se l e c t i o n of Pattern Features", IEEE Trans. •Syst. -Man and Cyber, Vol. SMC-1, -No. -2,..pp.. -186.,-April, -19.71,. 80. H. V. Pipeberger and F. W. Stallman, "computation of D i f f e r e n t i a l Diagnosis i n Electrocardigraphy", Ann. N.Y. Acad. S c i . , V o l . 115, pp. 1115-1125, July, 1964. 81. P. S. R. S. Rao, "On Selecting Variables f o r Pattern C l a s s i f i c a t i o n " , Tech. Report No. 11, Information Research Associates, Inc., Lexington, Massachusetts, June, 1967. 82. G. S. Sebestyen and J . Edie, "Pattern Recognition Research", L i t t o n Systems, Waltham, Mass., F i n a l Rep., Contract AF 19 (628)-1604 AD620236, July 1965. 83. ' J . D. El a s h o f f , R. M. Elashoff and G. E. Goldman, "On the Choice of Variables i n C l a s s i f i c a t i o n Problem with Dichotomans Variables", Biometrika, Vol. 54, pp. 668-670, 1967. 84. G. T. Toussaint, "Note on Optimal S e l e c t i o n of Independent Binary valued Features for Pattern Recognition", IEEE Trans. Info. Theory pp. 618,- September 1971. 85. S. Si e g e l , "Nonparametric S t a t i s t i c s f o r the Behaviourial Sciences", McGraw-Hill, New York, 1956, Chapter 9. 86. W. L. Hays, " S t a t i s t i c s " , Holt, Rinehart and Winston, Inc., New York, 1963, Chapter 18. E. Persoon and K. S. Fu, "Medical Diagnosis Using Dynamic Sequential Pattern Recognition", Proc. F i f t h Hawaii Int. Conf. on System Sciences, Suppl. Computer i n Biomedicine, pp. 161-163, January, 1972. K. S. Fu, and M. H. Loew, "Automatic Medical Diagnosis Using Non-parametric Sequential C l a s s i f i c a t i o n Procedures", AGARD Avionics 21st Technical Symposium on A r t i f i c i a l I n t e l l i g e n c e , May 24-28, 1971, Rome, I t a l y .
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Sequential decision schemes for statistical pattern...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Sequential decision schemes for statistical pattern recognition problems with dependent and independent… Hussain, Abul Bashar Shahidul 1972
pdf
Page Metadata
Item Metadata
Title | Sequential decision schemes for statistical pattern recognition problems with dependent and independent hypotheses |
Creator |
Hussain, Abul Bashar Shahidul |
Publisher | University of British Columbia |
Date Issued | 1972 |
Description | Sequential decision schemes for the purpose of both pattern classification and feature ordering are investigated. An optimum compound sequential probability ratio test (OCSPRT) for recognition problems with memory in the source as well as the observation medium is developed. The results of theoretical analysis and computer simulation of the OCSPRT for a two-class problem with first order Markov dependence among the pattern classes are presented. For multiclass recognition problems the suitability of Bayes sequential decision schemes based on one-state ahead truncation approximation, with and without on-line feature ordering, is assessed from the points of view of computational complexity and expected cost of a terminal decision. The Bayes sequential decision scheme for dependent hypothesis problems is formulated and its performance is compared with that of the optimum compound nonsequential decision scheme. For dependent hypothesis recognition problems, compound sequential pattern recognition schemes (CSPRS) are formulated. In CSPR schemes the required additional feature is observed either on the pattern to be decided, as in the classical sequential schemes, or on any one of the neighbouring patterns. The pattern selected and the feature actually observed are the ones which provide the maximum amount of additional information. The results of analytical and experimental evaluation of the CSPR schemes are presented. The suitability of the suboptimal sequential decision scheme with on-line ordering of features as a feature evaluation and ordering criterion is discussed. A modified on-line sequential (MOLS) decision scheme based on limited length of search is proposed as a compromise between the additional computational complexity and improvement in the recognition performance resulting from the on-line ordering of features. The advantage of incorporating such limited length of search over available features into sequential decision schemes using a set of preordered features is also examined. For the purpose of experimental evaluation of the various decision schemes, recognition of handprinted English text as a particular example of a pattern recognition problem was simulated on a digital computer. The handprinted characters were obtained from Munson's multiauthor data file prepared at Stanford Research Institute. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2011-03-21 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0101265 |
URI | http://hdl.handle.net/2429/32648 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1972_A1 H88.pdf [ 8.85MB ]
- Metadata
- JSON: 831-1.0101265.json
- JSON-LD: 831-1.0101265-ld.json
- RDF/XML (Pretty): 831-1.0101265-rdf.xml
- RDF/JSON: 831-1.0101265-rdf.json
- Turtle: 831-1.0101265-turtle.txt
- N-Triples: 831-1.0101265-rdf-ntriples.txt
- Original Record: 831-1.0101265-source.json
- Full Text
- 831-1.0101265-fulltext.txt
- Citation
- 831-1.0101265.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0101265/manifest