Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Investigation of time-domain measurements for analysis and machine recognition of speech Ito, Mabo Robert 1971

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
831-UBC_1971_A1 I86.pdf [ 15.9MB ]
Metadata
JSON: 831-1.0101825.json
JSON-LD: 831-1.0101825-ld.json
RDF/XML (Pretty): 831-1.0101825-rdf.xml
RDF/JSON: 831-1.0101825-rdf.json
Turtle: 831-1.0101825-turtle.txt
N-Triples: 831-1.0101825-rdf-ntriples.txt
Original Record: 831-1.0101825-source.json
Full Text
831-1.0101825-fulltext.txt
Citation
831-1.0101825.ris

Full Text

INVESTIGATION OF TIME-DOMAIN MEASUREMENTS FOR ANALYSIS AND MACHINE RECOGNITION OF SPEECH by MABO ROBERT ITO B.Sc. (E.P.), University of Manitoba, 1960 M.Sc. (E.E.), University of Manitoba, 1963 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in the Department of E l e c t r i c a l Engineering We accept t h i s thesis as conforming to the required standard Research Supervisor Members of the Committee Head of the Department THE UNIVERSITY OF BRITISH COLUMBIA May, 1971 In present ing t h i s thes is in p a r t i a l f u l f i l m e n t o f the requirements for an advanced degree at the U n i v e r s i t y of B r i t i s h Columbia, I agree that the L ib ra ry sha l l make it f r e e l y a v a i l a b l e for reference and study. I fu r ther agree that permission for extensive copying of th is thes is fo r s c h o l a r l y purposes may be granted by the Head of my Department or by h is representa t ives . It is understood that copying or p u b l i c a t i o n o f th is thes is f o r f i n a n c i a l gain sha l l not be allowed without my wr i t ten permiss ion . Department of The Un ive rs i t y of B r i t i s h Columbia Vancouver 8, Canada ABSTRACT At present in speech analysis and mechanical speech recognition work, spectral measurements are the conventional form of signal representation and acoustical descriptions of speech sounds are usually given in terms of this form of representation. In t h i s thesis, certain time-domain measure-ments are investigated as an a l t e r n a t i v e form of signal representation and as a basis for acoustical characterization of speech sounds. The primary measurements studied are the short-time averages of the zero-crossing rate of the acoustic waveform and the d i s t r i b u t i o n patterns of the time intervals between zero-crossings. These measurements are found to be easy to implement with d i g i t a l techniques and are implemented through d i g i t a l computer simulation. Other advantages of these measurements include effectiveness in handling the large i n t e n s i t y range of speech sounds and a b i l i t y to track rapid transient phenomena such as the release of unvoiced stops. Computer software for an interactive graphics f a c i l i t y was developed for a c q u i s i t i o n , presentation, manipulation and analysis of the acoustic speech data. One of the pattern analysis programs, for the display of time-interval d i s t r i b u t i o n data, yielded a v i s u a l presentation which could be compared to frequency spectrograms. Theoretical expressions are developed to relate the time-domain and spectral representation for some phone types and these relationships are compared with experimental r e s u l t s . The above t h e o r e t i c a l expressions show that important spectral characterization features are accounted fo r . These findings, combined with empirical observation of the u t i l i t y of the time-domain signal representation i n phonetic characterization, indicates that t h i s form of representation i s a useful alternative to the spectral representation. The speech materials employed were selected to study temporal structures and contextual variations of acoustic properties and to provide quantitative data useful for word recognition applications. The vowels, f r i c a t i v e s and stops were the main phoneme classes studied. Quantitative data on the acoustic properties of the selected phonemes i s presented and discussed in terms of i) our own spectral data, i i ) other data reported i n the l i t e r a t u r e and i i i ) simple production models. The time-domain signal representation was found to provide an e f f e c t i v e means of analyzing and characterizing the a c o u s t i c a l l y complex stops and voiced f r i c a t i v e s . For the vowels and unvoiced f r i c a t i v e s , which are well suited to spectral analysis, the time domain measurements were found to y i e l d very simple and d i r e c t characterization features. Some limited phonemic decomposition and machine recognition work i s described which demonstrates the design of u s e f u l c h a r a c t e r i z a t i o n f eatures and provides a b a s i s f o r f u r t h e r work. - i v -TABLE OF CONTENTS Page ABSTRACT i i TABLE OF CONTENTS v LIST OF ILLUSTRATIONS. . . x i LIST OF TABLES x v i ACKNOWLEDGEMENT x v i i CHAPTER I INTRODUCTION 1.1 Problem Area 1 1.2 General Discussion of Past Work 3 1.3 Scope of Thesis 8 CHAPTER II • . DATA BASE AND MEASUREMENTS 2.1 Data Base and Data Processing 2.1.1 The data base 11 2.1.2 Data a c q u i s i t i o n and processing 12 2.2 Spectral Analysis of Speech 2.2.1 Frequency versus temporal resolution 14 2.2.2 I d e n t i f i c a t i o n of signal type 16 2.2.3 Implementation of short-time spectral analysis 17 -v-2.2.4 Loss of information and data reduction .. . 19 2.2.5 A computer-controlled, f i l t e r - b a n k spectrum analyzer 21 2.3 Time-Domain Measurements 2.3.1 Types of measurements and general advantages 21 2.3.2 Relevance of previous work on zero-crossing analysis 24 2.3.3 Theoretical r e l a t i o n s h i p between mean zero-crossing rates and the frequency spectrum 28 2.3.4 Measurement of zero-crossing rates 30 2.3.5 Measurement and presentation of zero-crossing distances 31 2.3.6 Relative i n t e n s i t y measurement .. 37 CHAPTER III ANALYSIS AND RECOGNITION OF VOWELS 3.1 Nature of Vowels and Their Characterization 3.1.1 General acoustical properties 38 3.1.2 Vowel production.. 40 3.1.3 Spectral measurement techniques 42 3.1.4 Machine c l a s s i f i c a t i o n of vowels 43 - v i -3.2 R e f e r e n c e V o w e l F o r m a n t D a t a 4 4 3.3 Z - D M e a s u r e m e n t s f o r V o w e l s 4 8 3.4 T h e o r e t i c a l R e l a t i o n s h i p B e t w e e n F o r m a n t S t r u c t u r e a n d Z - D 52 3.5 C o m p a r i s o n o f T h e o r e t i c a l a n d M e a s u r e d Z - D V a l u e s 61 3.6 Z e r o - C r o s s i n g D i s t a n c e A n a l y s i s o f V o w e l s . . . 6 8 3.7 V o w e l F o r m F a c t o r A n a l y s i s 74 3.8 D u r a t i o n o f V o w e l s 7 8 3.9 M a c h i n e I s o l a t i o n o f V o w e l s 80 3.10 M a c h i n e C l a s s i f i c a t i o n o f V o w e l s . . 93 C H A P T E R I V A N A L Y S I S AND R E C O G N I T I O N O F F R I C A T I V E S 4.1 G e n e r a l C h a r a c t e r i s t i c s o f F r i c a t i v e s 4 . 1 . 1 P r o p e r t i e s o f u n v o i c e d f r i c a t i v e s 100 4.1.2 P r o p e r t i e s o f t h e v o i c e d f r i c a t i v e s 104 4.1.3 F u n c t i o n a l b u r d e n i n g o f t h e f r i c a t i v e s 106 4.2 Z - D A n a l y s i s o f U n v o i c e d F r i c a t i v e s 4 . 2 . 1 Z - D m e a s u r e m e n t s " 107 4.2.2 T h e o r e t i c a l r e l a t i o n s h i p b e t w e e n s p e c t r a l s t r u c t u r e a n d t h e Z a n d D v a l u e s . . 110 4 . 2 . 3 C a l c u l a t e d Z a n d D v a l u e s 112 4.2.4 E s t i m a t i o n o f Z a n d D f r o m a s i m p l e s p e c t r a l m o d e l 1 1 3 - v i i -4.3 Other Measurements for Unvoiced F r i c a t i v e s 4.3.1 Zero-crossing distances 117 4.3.2 Duration measurement 119 4.3.3 Intensity measurement 121 4.4 Machine Iso l a t i o n and C l a s s i f i c a t i o n of Unvoiced F r i c a t i v e s 4.4.1 Machine i s o l a t i o n of the unvoiced f r i c a t i v e s 122 4.4.2 Machine c l a s s i f i c a t i o n of the unvoiced f r i c a t i v e s . 124 4.5 Analysis of the Voiced F r i c a t i v e s 4.5.1 A th e o r e t i c a l relationship between spectral structure and Z - D 127 4.5.2 Acoustic properties of the voiced f r i c a t i v e s 131 4.6 Machine I s o l a t i o n and C l a s s i f i c a t i o n of the Voiced F r i c a t i v e s 4.6.1 On the sepa r a b i l i t y of the i s o l a t i o n and c l a s s i f i c a t i o n functions 138 4.6.2 I d e n t i f i c a t i o n of / z / 141 » 4.6.3 I d e n t i f i c a t i o n of / j / ..143 4.6.4 I d e n t i f i c a t i o n of /v/. ...... 145 - v i i i -CHAPTER V ANALYSIS AND RECOGNITION OF STOPS 5.1 General Characteristics of Stops 148 5.2 Analysis of the Stop Consonants 5.2.1 The stop-gap or quiet i n t e r v a l 152 5.2.2 Vowel formant t r a n s i t i o n s . . 156 5.2.3 Aspiration region and voicing onset .. 164 5.2.4 Stop release and f r i c t i o n a l noise regions 168 5.3 Machine Iso l a t i o n and C l a s s i f i c a t i o n of Unvoiced Stops 5.3.1 I s o l a t i o n of unvoiced stops 182 5.3.2 C l a s s i f i c a t i o n of the unvoiced stops 188 5.4 Machine Iso l a t i o n and C l a s s i f i c a t i o n of Voiced Stops 5.4.1 I s o l a t i o n of voiced stops.. 190 5.4.2 C l a s s i f i c a t i o n of the voiced stops 193 CHAPTER VI FURTHER DISCUSSION AND CONCLUSIONS 6.1 Acoustic Complexity of Phonemes 196 6.2 Conclusions 197 6.3 Suggestions for Further Work 202 APPENDICES APPENDIX A SPEECH DATA PROCESSING FACILITY A . l System Composition 204 - i x -A. 2 Primary Data Input 206 A. 3 Secondary Data Input 207 A.4 General Operational Procedures.. 207 APPENDIX B THE FILTER-BANK SPECTRUM ANALYZER „ 209 APPENDIX C TIME-DOMAIN SIGNAL ANALYSIS PROGRAMS 215 APPENDIX D SPECTRUM DISPLAY AND ANALYSIS PROGRAMS „...222 APPENDIX E PHONETIC TRANSCRIPTION SYSTEM AND WORD LISTS o o .227 BIBLIOGRAPHY. „ 231 -x-L I S T OF I L L U S T R A T I O N S P a g e F i g . 2-1 S o n a g r a p h s o f " f e a r " a n d " z e a l o u s " 18 F i g . 2-2 T i m e G r a p h s o f (a) I n t e n s i t y , I , (b) D a n d ( c ) Z f o r t h e W o r d " s i t " 32 F i g . 2-3 S p e c t r o g r a m s a n d T i m e - I n t e r v a l H i s t o g r a m s f o r " f e a r " a n d " z e a l o u s " 35 F i g . 3-1 M e a s u r e d F a n d F f o r V o w e l s C o n t a i n e d i n a V a r i e t y o f C o n s o n a n t a l E n v i r o n m e n t s 4 6 F i g . 3-2 C o m p a r i s o n o f F ^ - V a l u e s f o r V o w e l s f r o m P u b l i s h e d D a t a 47 F i g . 3-3 M e a s u r e m e n t s o f Z a n d D f o r V o w e l N u c l e i 4 9 F i g . 3-4 A V o w e l P r o d u c t i o n M o d e l . . . . 53 F i g . 3-5 E f f e c t o f V a r i a t i o n o f E q u i v a l e n t F o r m a n t A m p l i t u d e s o n Z a n d D f o r V o w e l s 62 F i g . 3-6 E s t i m a t e s o f Z a n d D f o r V o w e l s C a l c u l a t e d f r o m F o r m a n t P a r a m e t e r s 64 F i g . 3-7 T - I - H A n a l y s i s f o r U n f i l t e r e d V o w e l s 6 9 F i q . 3-8 L a n d H M e a s u r e m e n t s f o r V o w e l s 72 m m F i g . 3-9 F o r m - F a c t o r M e a s u r e m e n t s f o r V o w e l s 77 F i g . 3-10 O b s e r v e d D u r a t i o n R a n g e s f o r V o w e l s 79 F i g . 3-11 R e s u l t s o f t h e O p e r a t i o n o f a S i m p l e V o w e l I s o l a t i o n A l g o r i t h m E m p l o y i n g O n l y I n t e n s i t y - x i -I n f o r m a t i o n ( T o p t i m e g r a p h o f e a c h s e t ) . V o w e l r e g i o n i s r e g i o n o f h i g h i n t e n s i t y . W o r d s s h o w n a r e : (a) " s i t " (b) " k u n " ( c ) " t o o l " a n d (d) " g e e l " 82 F i g . 3-12 Z' a n d D' R e s u l t s f r o m M a c h i n e I s o l a t i o n o f V o w e l s 88 F i g . 3-13 V o w e l C l a s s i f i c a t i o n A l g o r i t h m 96 F i g . 4-1 Z - D V a l u e s f o r U n v o i c e d F r i c a t i v e s - S h o w i n g M e a s u r e d a n d C a l c u l a t e d V a l u e s 1 0 9 F i g . 4-2 S i m p l e U V F P r o d u c t i o n M o d e l . . . . . 110 F i g . 4-3 V a r i a t i o n o f Z a n d D w i t h V a r i a t i o n i n P a r a m e t e r V a l u e s o f t h e S p e c t r a l M o d e l 115 F i g . 4-4 O b s e r v e d F r i c t i o n a l N o i s e D u r a t i o n R a n g e s f o r U V F s 120 F i g . 4-5 UVF C l a s s i f i c a t i o n A l g o r i t h m 126 F i g . 4-6 S i m p l i f i e d P r o d u c t i o n M o d e l f o r V o i c e d F r i c a t i v e s 127 F i g . 4-7 (a) T i m e G r a p h s o f I , D, Z a n d X f o r " h a s " . . . . 1 3 6 (b) M e r g e d HP a n d L P - T I H s f o r " z e a l o u s " 1 3 6 (c) T i m e G r a p h s o f I , D, Z a n d X f o r " z e b r a " . . 1 3 6 ZD (d) D v e r s u s Z P l o t f o r / z / i n " z e b r a " 136 F i g . 5-1 S t o p - G a p D e t e c t i o n f o r " s i t " 154 F i g . 5-2 S t o p - G a p D e t e c t i o n f o r " s i d " 1 5 4 - x i i -F i g . 5-3 Voicing Latency for /g/ i n "gun" 154 F i g . 5-4 Release of /b/ i n "bun" 154 F i g . 5-5 E f f e c t of Context on Stop-Gap Durations 157 F i g . 5-6 HP - T-I-H for "kool" 159 F i g . 5-7 HP - T-I-H for "kun" 159 F i g . 5-8 HP - T-I-H for " t o o l " 159 F i g . 5-9 HP - T-I-H for "pun" 159 F i g . 5-10 HP - T-I-H for "log" 160 F i g . 5-11 HP - T-I-H for "lod" .... 160 F i g . 5-12 HP - T-I-H for " s i g " 160 F i g . 5-13 HP - T-I-H for " s i b " 160 F i g . 5-14 Peak Values of Z and D for the Unvoiced Stops....171 F i g . 5-15 Contextual Dependency of Peak Values of D for the Unvoiced Stops 174 F i g . 5-16 Comparison of Measured Values and Estimates of Z and D for the F r i c t i o n a l Noise Region of the Unvoiced Stops- 175 F i q . 5-17 Measured Values of H and cr for the ^ m H F r i c t i o n a l Noise Region of Unvoiced Stops 178 F i g . 5-18 Flow Chart of Procedure for Is o l a t i o n and C l a s s i f i c a t i o n of Unvoiced Stops 184 F i g . 5-19 Flow Chart of Procedure for Isolation and C l a s s i f i c a t i o n of Voiced Stops.. 191 - x i i i -F i g . 5-20 Algorithm for Is o l a t i n g I n i t i a l Voiced Stops .194 F i g . A - l Speech Data Processing System 205 F i g . B-l Block Diagram of the Filter-Bank Spectrum Analyzer 210 F i g . B-2 Typical Attenuation Characteristics of Two of the Bandpass F i l t e r s 212 F i g . B-3 Attenuation Charac t e r i s t i c s of the 100 Hz. and 200 Hz. Lowpass F i l t e r s 212 F i g . B-4 C i r c u i t Diagram of a Typical Spectrum Analyzer Channel 214 F i g . C-l Functions of Intensity and Zero-Crossing Rates for the Word " s i t " . 219 F i g . C-2 Functions of Intensity and Zero-Crossing Rates for the Word " s i d " 219 F i g . C-3 (a) HP - Time Interval Histogram, (b) UF -Time-Interval Histogram, (c) LP - Time-Interval Histogram and (d) Composite Time-Interval Histogram 220 F i g . D-l Isometric Spectral Display of the Word "zealous" . . 224 F i g . D-2 Logarithmic Amplitude Display ... ± 224 -xiv-F i g . D-3 H i g h D e n s i t y D i s p l a y w i t h H i g h - F r e q u e n c y P r e - E m p h a s i s 224 F i g . D-4 H i g h D e n s i t y D i s p l a y o f U n w e i g h t e d S p e c t r u m 2 2 4 -xv-LIST OF TABLES Page Table 2-1 Time-Interval Histogram Channels 34 Table 3-1 Comparison of Z and D, as Measured for Vowels, with Corresponding Values Calculated Using Eqns. (4.27) and (4.28) with N = 3 65 Table 3-2 Feature D e f i n i t i o n s for Vowel C l a s s i f i c a t i o n 94 Table 3-3 Vowel Confusion Matrix 97 Table 4-1 Relative Frequency of Occurrence of English F r i c a t i v e s 107 Table 4-2 U and H for the Unvoiced F r i c a t i v e s 118 m m Table 4-3 Relative Amplitudes of Unvoiced F r i c a t i v e s 122 Table 4-4 Feature De f i n i t i o n s for UVF C l a s s i f i c a t i o n 125 Table 5-1 Relative Frequency of Occurrence of English Stops 152 Table 5-2 Vowel-Stop H Transitions ..162 m Table B-l Spectrum Analyzer Channel Specifications 211 Table D-l Spectrum Display Routines and Options 223 Table E - l Phonetic Transcription System for English Phonemes ; 227 Table E-2 Word L i s t 1.. 228 Table E-3 Word L i s t 2 0.229 - x v i -ACKNOWLEDGEMENT The work described i n t h i s thesis was supported by the National Research Council of Canada under Grant NRC 3308 and by the Defence Research Board of Canada under Grant DRB 2801-26. Acknowledgement i s also g r a t e f u l l y made to the National Research Council for f i n a n c i a l assistance under an educational leave arrangement. I am thankful to Mr. J.P. Steger for assistance i n the design and construction of the spectrum analyzer and i n the development of the spectrum display programs; to Mr. M. Koombes for assistance in the development of the spectrum sampling and d i g i t i z a t i o n program; and to Mr. G. O'Connell for assistance in computer operation and i l l u s t r a t i o n preparation. I also wish to thank my colleagues at the University of B r i t i s h Columbia for many informative and stimulating discussions. I am e s p e c i a l l y g r a t e f u l to my research supervisor, Dr. R.W. Donaldson, for his valuable suggestions and constant encouragement which were fr e e l y given over the course of the research. F i n a l l y , I wish to thank the National Research Council of Canada for support services provided in the preparation of the manuscript, and i n p a r t i c u l a r to Mrs. M. Saver for the typing of the thesis. - x v i i -CHAPTER I INTRODUCTION 1.1 Problem Area The basic underlying aim of work i n speech recognition i s to provide a s u f f i c i e n t l y economic description of the recogni-t i o n process to be p r a c t i c a l for machine implementation. What constitutes a s u f f i c i e n t description has not yet been s a t i s f a c -t o r i l y determined for any but the simplest problems that can be handled by brute-force me'thods. For the recognition of the complex messages contained in continuous, and perhaps imperfect speech, i t i s becoming increas-ingly clear, that considerable knowledge of higher l e v e l l i n g u i s t i c information such as semantic and syntactic factors i s required [1-3], Such knowledge i s s t i l l quite primitive as indicated by the impasse of work on machine t r a n s l a t i o n of languages [4], Moreover, work in the area can be regarded as directed more toward modelling of human perception and recogni-t i o n processes than to the less complex problem of recognition of self-contained units such as isolated spoken words [5]. For problems such as the recognition of a few words uttered by a few speakers, decision-theoretic procedures have yielded some success but are regarded-as brute-force solutions 1 2 [1,6,7]. Although decision procedures are important, s o p h i s t i -cation of decision procedures does not, per se, guarantee acceptable performance [8-10]. That i s , achievable performance i s l i m i t e d by the q u a l i t y of the pattern characterization [11,5, 12]. Furthermore, the optimum or near optimum decision pro-cedures are impractical to implement [13,14]. For example, one near optimum decision procedure [15] would require the storage of a l l possible samples of a l l possible utterances. Two v i t a l aspects of the speech recognition problem, which are the subject of t h i s thesis, are the areas of suitable acoustical representation and adequate acoustical characteriza-t i o n . Acoustical representation i s concerned with the form of the output of transducers or sensors which transform the acoustic signal into a conveniently usable form. Characteriza-t i o n involves the s p e c i f i c a t i o n of s i g n i f i c a n t features of speech signals i n terms of the chosen acoustical representation. For machine work, i t i s desirable, that both the representation and the extraction of features be convenient for machine imple-mentation. Both the representation and characterization problems usually involve a data reduction process, r e s u l t i n g i n the loss of information or perhaps transformation into a l e s s usable form. In the representation problem, t h i s loss of information i s hope-f u l l y unintentional, whereas i n the characterization problem the loss of information i s intentional through appropriate design. The representation problem i s important i n two respects. One r e a s o n i s t h e d e s i r a b i l i t y o f m i n i m i z i n g t h e l o s s o f s i g n i -f i c a n t i n f o r m a t i o n . T h e o t h e r r e a s o n i s t h a t a l t e r n a t i v e r e p r e s e n t a t i o n s may b e m o r e c o n v e n i e n t t o i m p l e m e n t a n d may l e a d t o s i m p l e r s p e c i f i c a t i o n o f s i g n i f i c a n t f e a t u r e s f o r m a c h i n e p r o c e s s i n g . T h e c h a r a c t e r i z a t i o n p r o b l e m d e a l s w i t h t h e d e s i g n o f a c o u s t i c f e a t u r e s w h i c h p r o v i d e a n e c o n o m i c b u t a d e q u a t e s p e c i -f i c a t i o n o f t h e s o u n d s o f s p e e c h . T h e p r o b l e m i s o f i m p o r t a n c e f o r a n u m b e r o f r e a s o n s . 1) I t i s t h e q u a l i t y o f t h e f e a t u r e s w h i c h l i m i t s u l t i m a t e r e c o g n i t i o n p e r f o r m a n c e . 2) P r a c t i c a l i t y o f i m p l e m e n t a t i o n i s d e p e n d e n t o n e c o n o m i c s p e c i f i c a t i o n . 3) U t i l i z a t i o n a n d e x p a n s i o n o f o u r k n o w l e d g e o f t h e s p e e c h p r o c e s s . 4) A d a p t a b i l i t y t o a r a n g e o f p r o b l e m s . 1.2 G e n e r a l D i s c u s s i o n o f P a s t W o r k I n p a s t w o r k i n s p e e c h a n a l y s i s a n d r e c o g n i t i o n , v e r y l i t t l e a t t e n t i o n h a s b e e n p a i d t o t h e p r o b l e m o f a p p r o p r i a t e s i g n a l r e p r e s e n t a t i o n . S i g n a l r e p r e s e n t a t i o n i n t h e f o r m o f s p e c t r a l c o m p o s i t i o n h a s b e e n g e n e r a l l y u s e d i n s p e e c h r e c o g n i -t i o n [ 1 4 , 1 6 - 2 3 ] b e c a u s e o f i t s t r a d i t i o n a l u s e i n m a n u a l a n a l y s i o f s p e e c h [ 2 4 - 2 6 ] . Some s p e e c h s o u n d s , n o t a b l y t h e v o w e l s i n s t e a d y s t a t e , l e n d t h e m s e l v e s t o s i m p l e c h a r a c t e r i z a t i o n i n 4 terms of the frequency spectrum, p a r t i c u l a r l y when human interp r e t a t i o n of spectral displays i s involved [26-28]. However, the characterizations developed from human interpretation of v i s u a l displays have proved awkward for machine implementation [14,16,29-30]. Furthermore, there i s some doubt about the adequacy of these characterizations [7,25,32], Most other speech sounds, p a r t i c u l a r l y those involving dynamic or transient parts, are more d i f f i c u l t to specify because of i n t r i n s i c l i m i t a t i o n s [24,25,33] of the spectral representa-t i o n and i t s p r a c t i c a l implementation. Spectral measurements are also r e l a t i v e l y d i f f i c u l t to implement. The l i m i t a t i o n s of spectral analysis are discussed more f u l l y i n chapter I I . The d i f f i c u l t i e s . w i t h spectral analysis have recently led to interest i n a l t e r n a t i v e acoustic representations, p a r t i c u l a r l y , more d i r e c t representations. Teacher et a l . [34] have proposed a time-domain measurement c a l l e d a single equivalent formant. Other work has dealt with measurements of the zero-crossing structure. Reddy [35] has employed zero-crossing rates of the acoustic signal as an aid i n segmentation of the acoustic wave. Bezdel et a l . [36] and Ewing et a l . [37] have demonstrated the u t i l i t y of measurements of zero-crossing rates i n the recognition of a very small vocabulary. Some limited t h e o r e t i c a l and experimental analysis of the zero-crossing structure of vowel-l i k e signals has also been reported [38,39]. Further discussion of t h i s work i s contained in section 2.3.2. 5 The characterization problem has received some attention in speech recognition and a greater amount in speech analysis but s u r p r i s i n g l y l i t t l e quantitative data i s available [25,7]. T r a d i t i o n a l speech analysis has, however, been aimed at extreme economy of description and generality over a l l possible situations at the expense of completeness and consistency [25, 40-44,165]. For example, the o r i g i n a l foundation of the d i s -t i n c t i v e feature model of the phonemes [45,46] discounted the e f f e c t s of context pn the acoustic properties of phonemes and did not take into account the time-varying aspects of these properties. Fant, one of the o r i g i n a l proponents of the d i s t i n c t i v e feature theory of phoneme characterization, has recently discussed these l i m i t a t i o n s of the o r i g i n a l theory [40]. The early acoustic models of the phonemes thus assumed that they could be described by stable, concurrent and context-independent acoustic properties. Attempts at speech recognition using such a model achieved very limited success [47,17,16,14, 164,165]. Later work shows the development of more complex models. Hughes et a l . [14] and Reddy [48,19] have attempted to account for allophonic or contextual v a r i a t i o n s of the acoustic properties of the phonemes. Sholtz et al.[22] and Martin et a l . [49] have attempted to incorporate knowledge of the temporal structure of speech sounds in t h e i r recognition schemes. The improved recognition performance shown in t h i s Work supports the usefulness of further studies of allophonic v a r i a t i o n s and temporal organizations i n acoustic s p e c i f i c a t i o n of phonemes. The importance of more complete and consistent models of the acoustic structure of the phonemes i s also r e f l e c t e d i n current work i n speech analysis. Allophonic v a r i a t i o n s of vowel [50-52] f r i c a t i v e s [53,54], stops [55], l i q u i d s and semi-vowels [56,57] have been studied. Likewise, some studies of temporal structure have been reported [41,58-62,137]. In the face of incomplete knowledge of phonetic features for phoneme i d e n t i f i c a t i o n , characterization of entire words has been employed i n some r e s t r i c t e d word recognition schemes [166-175]. The main drawbacks are the large storage requirements, s i g n i f y i n g an i n e f f i c i e n t characterization, and lack of gener-a l i t y i n adapting to d i f f e r e n t or larger vocabularies. The li m i t e d success of t h i s work i s larg e l y due to the sparseness of the vocabulary and the high dimensionality of the decision space rather than on the qua l i t y of the characterization [18,63,64]. Two types of features have been employed in the f u l l word characterizations. One type i s the a r b i t r a r y or problem-independent feature [5,12]. Such features incorporate l i t t l e or no knowledge of the speech process; consequently, the resultant models have been c r i t i c i z e d as being "ignorance" models [1,7]. Examples of such features are the individual c o e f f i c i e n t s of an orthogonal function expansion and the raw outputs of acoustic transducers. The most common form of a r b i t r a r y features are the outputs of a hardware spectrum 7 analyzer. Use of such features stems from interest in the a p p l i c a t i o n of decision-theoretic c l a s s i f i c a t i o n procedures [63-68,37,39], However, a r b i t r a r y features are known to be generally i n e f f i c i e n t pattern descriptors [8,12] and often lead to poor recognition performance [9,10,69,138]. The other type of features employed are phonetic features usually based on p a r t i a l phonemic s p e c i f i c a t i o n [18,20-23,34,36,94]. For example, Bobrow et al.[18,94] define phonetic features related to steady-state vowel and f r i c a t i v e properties. King et al.[20] and Chiba [23] have defined phonetic features related to the dynamic structure of vowel-consonant boundaries. The s u p e r i o r i t y of such phonetic features over a r b i t r a r y features has been experimentally demonstrated by the work of King et a l . [20] and. Bobrow et a l . [ 1 8 ] . King et a l . studied d i f f e r e n t feature designs in the implementation of a l i m i t e d word recognizer and found that for comparable recognition performance, one quarter to one half of many phonetic features were required, r e l a t i v e to the number of a r b i t r a r y features. In another study of limited word recognition, Bobrow et al.found a reduction of about one h a l f . Since the design of these phonetic features i s c l o s e l y related to the phonetic s p e c i f i c a t i o n of phonemes, studies of the l a t t e r are bound to be useful for the above ap p l i c a t i o n . 8 1.3 Scope of Thesis The main objectives of t h i s thesis are: 1) The investigation of time-domain forms of signal representation, p a r t i c u l a r l y zero-crossing measure-ments. 2) Quantitative s p e c i f i c a t i o n of the acoustic properties of selected speech sounds i n terms of these measurements. 3) Study of contextual variations and temporal organ-i z a t i o n of the acoustic properties of these selected speech sounds. 4) Interpretation of the presented data in terms of spectral data. 5) Consideration of ap p l i c a t i o n of t h i s data to limited word recognition. The time-domain representations studied include: 1) mean zero-crossing rates, 2) d i s t r i b u t i o n of time-intervals between zero-crossings, 3) form factors, and 4) i n t e n s i t y . Their c o l l e c t i v e usefulness as an a l t e r n a t i v e to spectral represen-t a t i o n i s established by t h e o r e t i c a l and experimental interpretation i n terms of the spectra for those signals which are well suited to spectral representation. Ease of implemen-ta t i o n for machine processing i s also demonstrated. The selected speech sounds include the vowels, f r i c a t i v e s , 9 and stops. These phonemic groups exhibit a range of s t a t i c and dynamic c h a r a c t e r i s t i c s as well as contextual variations and are thus useful for exploring the effectiveness of the time-domain measurements. Quantitative s p e c i f i c a t i o n s of the acous-t i c properties of these phonemes are presented and interpreted in terms of e x i s t i n g spectral data whenever possible. P a r t i a l machine recognition of the selected phonemes in an i s o l a t e d word environment i s attempted. The characterization problem i s an open-ended one and i s unmanageable without some r e s t r i c t i o n s . The r e s t r i c t i o n s . imposed on our work do l i m i t the scope of the work, but do r e t a i n useful generality. One r e s t r i c t i o n i s that the phonemes studied were contained in i s o l a t e d spoken words. Our motivation here was to provide data useful for the solution of problems i n machine recognition of spoken words. Another r e s t r i c t i o n i s that only a limited number of phonemes and allophones of these phonemes were studied. Our choice of phonemes was based on several factors, including convenience of comparison with known spectral data and the study of contextual e f f e c t s and temporal structures. For example, the vowels have been the most widely studied phonemes and often have stable but context-dependent n u c l e i [58], The stops on the other.hand, exhibit transient c h a r a c t e r i s t i c s which are d i f f i c u l t to specify s p e c t r a l l y . Likewise, very l i t t l e i s known about the temporal structure of voiced f r i c a t i v e s . 10 A t h i r d r e s t r i c t i o n i s that only the data from a single speaker was studied. The constraint of manageability was a prime motivation for t h i s r e s t r i c t i o n . The single speaker r e s t r i c t i o n i s also c h a r a c t e r i s t i c of the work of several other researchers [14,18-20]. Furthermore, our work was more con-cerned with the study of contextual biases and temporal structures than i n inter-speaker biases. Nevertheless, comparisons with other data are made to e s t a b l i s h that our data are reasonably representative. CHAPTER II DATA BASE AND MEASUREMENTS 2.1 Data Base and Data Processing 2.1.1 The Data Base Our speech materials were prepared from the word l i s t s used by Hughes et a l . [70] and Halle et a l . [71], which were designed to study the s t a t i c spectral properties of the f r i c a -t i v e [70] and stop [71] consonants. There were several reasons for the selection of these word l i s t s . 1) The material contained sizable samples of at least three important phonemic groups, namely, the vowels, the f r i c a t i v e s and the stops. 2) The o r i g i n a l material was designed to study the ef f e c t s of contextual environment and po s i t i o n on the properties of the f r i c a t i v e and stop consonants. Such e f f e c t s were of considerable interest to us. 3) The material provided a useful sample of vowels in d i f f e r e n t consonantal contexts. Vowels uttered i n d i f f e r e n t contextual environments are known [50,58, 72] to d i f f e r from vowels spoken i n i s o l a t i o n or i n a fixed, neutral environment [27]. 4) The data presented i n the two source papers serves as 11 12 a convenient, although limited, spectrographic reference material for the f r i c a t i v e and stop con-sonants and are the best known such data. The two word l i s t s , comprising a t o t a l of seventy words, are shown for reference in Appendix E. The f r i c a t i v e consonants studied were the unvoiced group, / s , / , f / , and the voiced group, /z,/,v/. The stop consonants studied were the unvoiced group, /p,t,k/ and the voiced group, /b,d,g/. The vowels studied included / i , I, £, ae , A , o,U,u/. The phonetic t r a n s c r i p t i o n system employed in t h i s thesis i s shown i n Table E - l of Appendix E. 2.1.2 Data A c q u i s i t i o n and Processing A l l the spoken words used in the study were produced by a 29 year o l d male having a western Canadian accent. The words were recorded i n a sound-proof room with a high qu a l i t y audio tape recording system comprising an Altec-Lansing type 681A • microphone and a Tandberg type 64X tape recorder, which was located outside the sound-proof room. During recording, care was taken to obtain natural but well a r t i c u l a t e d words. Word f i n a l stop consonants, for example, were a l l required to be released. Several samples of each word were recorded and the perceptually most natural samples were selected for further analysis and processing. Nearly a l l subsequent processing of the speech data was performed with the aid of a computer-based, speech processing f a c i l i t y . This f a c i l i t y , which provided on-line, interactive d i g i t a l processing and real-time data a c q u i s i t i o n , i s described in Appendix A. P a r t i c u l a r l y useful features of the f a c i l i t y were the interactive graphics and the computer-controlled, audio-monitoring c a p a b i l i t y . The primary data input to the computer system consisted of the previously recorded speech samples which were lowpass f i l t e r e d at 8 kHz, sampled at 16 kHz and uniformly quantized to nine b i t s . Secondary analog data inputs comprised reconstructed analog r e p l i c a s of the d i g i t a l images subjected to: (i) lowpass f i l t e r i n g at 1 kHz, ( i i ) highpass f i l t e r i n g at 1 kHz and ( i i i ) spectrum analysis. The spectrum analyzer was a real-time, f i l t e r - b a n k type of instrument and i s described in Appendix B. The primary data was subjected to two basic types of analysis: d i r e c t time-domain analysis and frequency-domain or spectral analysis. A summary of the computer programs for time-domain signal analysis i s given i n Appendix C. In section 2.3, we discuss the time-domain measurement techniques and describe the relevant algorithms. Appendix D contains a b r i e f description of the computer programs written for spectrum display and analysis. The work on spectral analysis played only a support-ing role for the main work which was concerned with.time-domain measurement techniques. For speech work, spectral analysis has c e r t a i n l i m i t a t i o n s which are discussed in section 2.2. 14 2.2 Spectral Analysis of Speech 2.2.1 Frequency Versus Temporal Resolution Short-time spectral estimation has been the p r i n c i p a l analysis tool for speech [24-26,73]. One reason often c i t e d for i t s popularity i s evidence that the ear makes a crude frequency analysis at an early stage in i t s processing [24,74], Many valuable contributions to the acoustic s p e c i f i c a t i o n of speech sounds have been made with the aid of t h i s analysis technique. However, short-time spectral analysis has ce r t a i n l i m i t a t i o n s which have led to consideration of al t e r n a t i v e measurement techniques such as those investigated i n t h i s t h e s i s . Since an understanding of the l i m i t a t i o n s of short-time spectral estima-t i o n lends support to our study of time-domain measurements, we present a b r i e f discussion of t h i s subject here. This discuss-ion stems from reports of others as well as first-hand experience with a f i l t e r - b a n k type of spectrum analyzer and also with a Sonagraph instrument. In short-time spectral analysis, i t i s known [24,33] that the analyzer bandwidth, BW, and the analysis time window, At, are approximately related by: At « 1/BW (2.1) This constraint means that i n designing the analyzer, a compro-mise must be made between frequency and temporal resolution. As Gold and Rader [75] have pointed out, the d i f f e r e n t sounds of 15 speech have d i f f e r e n t spectral and temporal structures and consequently, procedures for specifying an optimum analyzer bandwidth are as yet unknown. A similar problem arises i n vocoder design [6,75]. In most conventional analyzers, a fixed time window in the range from 6.7 to 40 msecs. i s t y p i c a l , corresponding to analyzing bandwidths of from 300 to 50 Hz, respectively. The narrowest bandwidths permit resolution of the harmonics of the voicing frequency, but t h i s tends to be a disadvantage in general speech analysis [25,78], Although i t i s known that the ear does perform a crude spectral analysis, the evidence [74,24] also shows that the ear provides better temporal resolution than frequency resolution and furthermore that the only harmonic of the voicing frequency that i s usually resolved i s the fundamental. Flanagan [24] also shows that the duration of the analysis time window i s a function of frequency. Thus, i t i s questionable whether a fixed duration time window can be optimum. There are several other problems involved in the use of short time spectral analysis such as: 1) D i s t i n c t i o n between transient and broadband noise signals 2) P r a c t i c a l implementation 3) Loss of information These topics are discussed i n the following sections. 16 2.2.2 I d e n t i f i c a t i o n of Signal Type Associated with the problem of temporal versus frequency resolution, discussed i n the preceding section, i s the problem of distinguishing transient behaviour of the signal from noise-l i k e behaviour, since both are often characterized by broad frequency spectrums. For example, from the output of conven-t i o n a l analyzers i t i s often not possible to determine whether the e x c i t a t i o n was a step function transient or a short burst of noise with an appropriately shaped frequency spectrum. This case i s discussed further in Chapter V i n connection with the analysis of the voiced stop /b/. In most conventional spectrum analyzers, the r e l a t i v e l y long time windows employed tend to obscure the transient e f f e c t s and therefore, important knowledge of discrete speech events [76,33]. Even with quasi steady-state signals, spectral or sinusoidal analysis does not simply represent important, speech related signals such as damped sinusoids [7]. Although vowels are usually considered to be s p e c t r a l l y characterized by a series of resonances or formants, actual measurement of important formant parameters such as centre frequency and bandwidth i s not considered to be an easy task [6,24,25,73,77,78,141]. This subject i s discussed further i n the work on vowels described in Chapter I I I . 17 2.2.3 Implementation of Short-Time Spectral Analysis Short-time spectral analysis can be implemented with either analog or d i g i t a l techniques. Most work i n speech analysis and speech recognition has employed the former technique, either in the form of a bank of fixed f i l t e r s or else in the form of a single scanning f i l t e r , such as the Sonagraph instrument [24-26]. Much of the early work i n speech spectral analysis (e.g. Potter et al . [26]) was performed with the aid of the Sonagraph type of instrument. The Sonagraph i s a convenient but limited instrument for producing v i s i b l e records of the short-time or running speech spectrum versus time. F i g . 2^ -1 shows Sonagraph records for the words "fear" and "zealous" as produced with a 300 Hz analyzing bandwidth. High frequency pre-emphasis was also used i n producing the record. Spectral i n t e n s i t i e s are represented on the record by d i f f e r e n t degrees of darkness. One of the disadvantages of the Sonagraph record i s that the inten-s i t y scale has only a 20 db dynamic range. The f i l t e r - b a n k type of analyzer has been the preferred instrument for most speech recognition work [14,16,18,21] because i t provides real-time, p a r a l l e l processing as well as convenient outputs for further machine processing. However, the need to f i x the number of channels and their bandwidths poses some yet unanswered questions regarding the adequacy of the resultant spectrum sampling. Typical past designs have employed from seven to t h i r t y - f i v e channels. — 8 kHz —4 kHz ZOO MS — 8 /fA/z — U KHZ ' : ' Z,: MMMMMi £. r\ n £. — 1 kHz —0.5 kHz F i g . 2-1 Sonagraphs of " f e a r " and " z e a l o u s " 19 In addition to the spectrum sampling problem, the dynamic range problem i s sometimes an acute one in speech analysis. The weakest sounds of speech have i n t e n s i t i e s t y p i c a l l y 30 to .40 db below that of the most intense sounds, a l l contained i n the same utterance. Often only a small portion of the t o t a l energy f a l l s within the range of a p a r t i c u l a r f i l t e r channel. These two phenomena combine to make sa t i s f a c t o r y measurement of the spectral properties of the weaker sounds d i f f i c u l t . Further discussion of t h i s problem i s contained i n Chapters IV and V. The spectrum sampling problem and the dynamic range problem can be circumvented to some extent by d i g i t a l processing techniques. To handle the spectrum sampling problem, several workers [79,19,31] have employed pitch-synchronous Fourier analysis. This method, however, i s susceptible to p i t c h e s t i -mation errors and furthermore, i t i s not adequate for unvoiced sounds and sounds with mixed e x c i t a t i o n modes. Prior to the development of Fast Fourier Transform techniques, the computation time required made d i g i t a l processing extremely unattractive. Even with the Fast Fourier Transform algorithms, the computation times are s t i l l slower than real-time with most general purpose d i g i t a l computers. 2.2.4 Loss of Information and Data Reduction Speech analysis and recognition i m p l i c i t l y involves the reduction of the very large information content of high qu a l i t y 20 speech. In other words, the abstraction processes discard much of the supposedly irrelevant information. A v a l i d question i s therefore, "At what stage should t h i s reduction take place?" With conventional f i l t e r - b a n k spectrum analyzers, data reduction i s inherent in the design because of the fixed number of channels, the r e l a t i v e l y long smoothing time constants, and the loss of phase information. Compression of the data rates by an order of magnitude i s t y p i c a l . From sections 2.2.1 and 2.2.2, i t i s evident that t h i s data reduction at the preprocessing stage i s not always desirable. This leads to the question of data reduction i n l a t e r processing stages and also a l t e r n a t i v e pre-processing techniques. From work i n other areas of pattern characterization and recognition, i t i s well known [12,80] that general mathematical functions, such as the Fourier series representation or other orthogonal functions, are not, per se, e f f i c i e n t pattern descriptors in general. That i s , a very large number of para-meters, which are usually not independent, are required. Hence, formation of more selec t i v e and powerful descriptors or features based on complex functions of the above parameters i s required. Al t e r n a t i v e measurements are therefore worth considering for characterization purposes since they may lead to more d i r e c t r e s u l t s and simpler processing. The time-domain measurements investigated i n t h i s thesis represent one such a l t e r n a t i v e . 21 2.2.5 A Computer-Controlled, Filter-Bank Spectrum Analyzer In spite of the l i m i t a t i o n s and implementation d i f f i c u l -t i e s of short-time spectral analysis, i t was considered desirable to have a spectrum analyzer so that data comparisons could be made. F i r s t l y , i t was desirable to compare our spectral data with that of others, simply to e s t a b l i s h that our data was representative. Secondly, since some of our time-domain measurements can be t h e o r e t i c a l l y related to spectral measure-ments, i t would be possible to v e r i f y the relationships with experimental and calculated data. For our work, a f i l t e r - b a n k type of analyzer was. designed and constructed. The design was an improved, s o l i d - s t a t e version of a recent design by Hughes and Hemdal [14], whose work i s widely referred to i n the l i t e r a t u r e . Appendix B contains d e t a i l s of the design and s p e c i f i c a t i o n s of the analyzer. 2.3 Time-Domain Measurements 2.3.1 Types of Measurements and General Advantages Measurements made d i r e c t l y on the acoustic signal, which we s h a l l c a l l time-domain measurements, are of in t e r e s t as an alt e r n a t i v e or a complementary means of acoustical description and are very useful i n computer-based, speech analysis and recognition work [154,155]. The measurements examined i n t h i s investigation are: 22 1) Zero-crossing rate of the ordinary speech signal, x(t) 2) Zero-crossing rate of the d i f f e r e n t i a t e d speech signal, dx(t)/dt 3) D i s t r i b u t i o n of time i n t e r v a l s between zero-crossings of x(t) 4) D i s t r i b u t i o n of time inter v a l s between zero-crossings of x(t) subjected to 1 kHz lowpass f i l t e r i n g . 5) D i s t r i b u t i o n of time i n t e r v a l s between zero-crossings of x(t) subjected to 1 kHz highpass f i l t e r i n g 6) Direct signal i n t e n s i t y measurement 7) Signal form-factor measurements For convenience of la t e r discussions/ the term "time in t e r v a l s  between zero-crossings" s h a l l sometimes be referred to as "zero- orossing distances". The form-factor measurements are mainly relevant to vowel analysis and are discussed i n section 3.7. The other measurements are more general and are discussed in the following sections and throughout the remainder of the thes i s . One of the advantages of the above measurements compared to short-time spectral measurements i s the elimination of complex external hardware or alte r n a t e l y , complex, time-consuming c a l c u l a t i o n s . Other advantages are: a) -Good temporal resolution b) Less s e n s i t i v i t y to signal i n t e n s i t y fluctuations 23 c) Less s e n s i t i v i t y to dynamic range problems d) Ease of d i g i t a l implementation e) More e f f e c t i v e form of data reduction Better temporal resolution r e s u l t s from the absence of narrow-band f i l t e r s . Other d i s t o r t i o n s added by p r a c t i c a l f i l t e r s used i n spectral measurements, such as spurious o s c i l -l a t i o n s and non-ideal time-window s k i r t s are also avoided. The greater freedom from dynamic range problems r e s u l t s from the absence of energy p a r t i t i o n i n g f i l t e r s . Ease of d i g i t a l implementation for measurements (1) to (5), i n c l u s i v e , r e s u l t s from the fa c t that the signal can be reduced to binary values. The i n s e n s i t i v i t y to signal i n t e n s i t y fluctuations also applies to measurements (1) through (5). This l a s t phenomenon can be e a s i l y established by noting that the zero-crossing points of an a r b i t r a r y signal are not affected i f the signal i s mu l t i p l i e d by a time varying gain factor. A similar advantage r e s u l t s for FM communication systems over AM communication systems. In FM signals, the transmitted i n f o r -mation i s e n t i r e l y contained i n the zero-crossing structure and i t i s known"*" that FM provides better discrimination against amplitude fluctuations, noise and i n t e r f e r i n g signals than AM signa l s . Speech signals, l i k e wideband FM signals, occupy a much larger bandwidth than the actual message bandwidth. In speech, 1/ [81], pp. 502-503 as with wideband FM and other spread-spectrum modulation tech-niques, the large signal-to-message bandwidth serves as a noise combatting measure and can be regarded as a form of redundancy encoding. From t h i s viewpoint and from experiments [82] which have shown that i n f i n i t e l y - c l i p p e d speech remains quite i n t e l l i -g i b l e, i t can be argued that the data reduction involved in zero-crossing analysis i s a more e f f e c t i v e form of redundancy reduction than the data reductions r e s u l t i n g from a r b i t r a r y spectrum sampling. 2.3.2 Relevance of Previous Work on Zero-Crossing Analysis Motivation for investigating zero-crossing rates and zero-crossing distances arises from psychoacoustic, t h e o r e t i c a l and experimental grounds as well as from consideration of the l i m i t a t i o n s of short-time spectral analysis. It i s of interest to f i r s t b r i e f l y summarize the chronological h i s t o r y of zero-crossing analysis and then to discuss s p e c i f i c developments i n more d e t a i l . Very early interest i n zero-crossing analysis stemmed from the c l a s s i c r e s u l t s of L i c k l i d e r and Pollack [26] on the perception of i n f i n i t e l y - c l i p p e d speech. This work led to consideration of the use of zero-crossing rates to estimate formant frequencies, which i n the early history of speech analysis, were important speech parameters. Lack of success i n accurate formant frequency estimation resulted i n a sharp decline of i n t e r e s t in zero-crossing analysis. Recently, however, the value of spectral analysis has been questioned and there has been a de-emphasis on spectral measurements. As a consequence, renewed i n t e r e s t in zero-crossing measurements has occurred, but only a very l i m i t e d amount of work has yet been reported. The p r i n c i p a l conclusions of the work of L i c k l i d e r and Pollack were: i) D i f f e r e n t i a t i o n of speech improved i t s i n t e l l i g i b i l i t y , i i ) P a r t i a l peak c l i p p i n g (12 db) perceptually emphasized the consonants r e l a t i v e to the vowels, i i i ) I n f i n i t e peak c l i p p i n g of the speech signal only s l i g h t l y impaired i t s i n t e l l i g i b i l i t y , iv) The i n t e l l i g i b i l i t y of speech subjected to d i f f e r e n -t i a t i o n and then i n f i n i t e c l i p p i n g was more i n t e l l i g i b l e and of better subjective quality than the undifferentiated version, v) Information related to speaker i d e n t i f i c a t i o n was diminished by i n f i n i t e c l i p p i n g of the speech s i g n a l . The improved i n t e l l i g i b i l i t y of speech with d i f f e r -e n t i a t i o n indicates that important information i s contained i n the upper frequency regions. The i n s e n s i t i v i t y of speech to peak c l i p p i n g indicates that a large amount of information important to recognition i s contained i n the zero-crossing of the normal and d i f f e r e n t i a t e d speech signals. Conclusions ( i ) , ( i i ) and (v) emphasize the fact that the consonants tend to be the dominant information bearing elements of speech, whereas the 26 vowels tend to contain more of the speaker i d e n t i f i c a t i o n information [183,184]. More recent work on d i f f e r e n t transforms of clipped speech has been reported by Ainsworth [83]. Thus, the r e s u l t s of L i c k l i d e r and Pollack provide good psychoacous-t i c grounds for further investigation of zero-crossing structures. Very l i t t l e t h e o r e t i c a l work e x i s t s on the analysis of the d i s t r i b u t i o n of zero-crossing distances. Recently, Scarr [39] has attempted to t h e o r e t i c a l l y analyze the zero-crossing distances for vowel-like sounds modelled as a completely deter-m i n i s t i c s i g n a l . However, the severe r e s t r i c t i o n s of the anal-y s i s leads to s p e c i a l i z e d r e s u l t s which are not generally v a l i d for r e a l speech. Two approaches have been employed i n the t h e o r e t i c a l analysis of mean zero-crossing rates. In one approach [84,39], completely deterministic signals were analyzed. Peterson [84], for example, treats the case of a single damped sinusoid ( i . e . , a single formant or spectral resonance) but the assumption of perfect p e r i o d i c i t y leads to the very special r e s u l t that the mean zero-crossing rate i s harmonically related to the funda-mental r e p e t i t i o n frequency. Likewise, Scarr [39] has used a s i m p l i f i e d , two formant model i n which each formant i s represented by a single sinusoid. Again special r e s u l t s , which r e f l e c t the deterministic nature of the assumed signals, were obtained. The second t h e o r e t i c a l approach, based on the work of Chang et al.[85], models speech-like signals as sample functions from a stationary, ergodic process. This model leads to general relationships between the expected density of zeros of various derivatives of the sample functions and the moments of the power density spectrum. From these r e s u l t s , i t was conjectured by Chang et a l . that the mean zero-crossing rate of the ordinary signal would y i e l d a good measure of the f i r s t formant of vowels and that the mean zero-crossing rate of the d i f f e r e n t i a t e d vowel signal would y i e l d a good measure of the second formant. Sub-sequent experimental work by Peterson [77] showed that formant estimation and tracking by t h i s method was not s a t i s f a c t o r y . Our experimental r e s u l t s , described in section 3.3, also confirms t h i s conclusion. Our th e o r e t i c a l work extends the work of Chang et al. to treat the s p e c i f i c cases of vowels, vowel-like sounds and f r i c a t i v e s . As indicated, the e a r l i e s t experimental investigations of zero-crossing rates for speech signals were concerned with vowel formant tracking and led to unsatisfactory r e s u l t s [77, 86-88,91], However, the recent de-emphasis of formant tracking and other spectral measurements has led to renewed in t e r e s t i n experimental zero-crossing analyses and applications. Reddy [19], for example, has used zero-crossing rates of the non-d i f f e r e n t i a t e d signal as an aid in machine segmentation of the speech s i g n a l . Bezdel [89,36] has demonstrated the po t e n t i a l usefulness of zero-crossing distance measurements for machine 28 recognition of small fixed vocabularies of six and fourteen words. Scarr [39] and Bezdel et a l . [38] have studied zero-crossing distances for u n f i l t e r e d , non-differentiated vowels. This l a t t e r work i s discussed further in section 3.6. The experimental work reported i n t h i s thesis treats a more general and larger set of English phonemes. 2.3.3 Theoretical Relationship Between Mean Zero-Crossing Rates and the Frequency Spectrum Let x(t) be a stationary, ergodic process with auto-c o r r e l a t i o n function R (T) = E[x(t) x(t-T) ] (2.2) and power density spectrum S x ( f ) = | R x ( T ) e " ] 2 7 f f dT (2.3) —00 where expectation E i s an average over the ensemble of waveforms in the process. The expected mean zero-crossing rate of x(t) can be shown to be given by [85]. 1/2 Z = 2 M(2) o L M(o) (2.4) where: r 0 0 k M(k) = / f S (f) df (2.5) J— oo X t i l = k moment of S (f) x The p r o p o r t i o n a l i t y constant, 0 , depends on the s t a t i s t i c a l c h a r a c t e r i s t i c s of the process and i s given by 0, x' x x' dx' H H J x=0 (2.6) where: x' = dx/dt 5C1 p (-|- , ) = j o i n t amplitude ° ^ p r o b a b i l i t y density of x (t) and x* (t) 2 2 | , = variances of x(t) and o 1 x' ( t ) , respectively S i m i l a r l y , the expected mean zero-crossing rate of the n-th derivative of x ( t ) , which we s h a l l denote as z , i s given. n 3 by z = 20 n n M(2n+2) M(2n) 1/2 (2.7) where: n oo n+1 | x -~ ^n+1 n n+1 n+1 ,x x . dx P<T- ' 7 ) j qn qn+l . qn+l n Jx = 0 (2.8) 2 n | = variance of x (t) n The p r o p o r t i o n a l i t y constants," 0 , depend on the s t a t i s -t i c a l c h a r a c t e r i s t i c s of the process and can be re a d i l y 30 calculated for some simple processes. For a stationary, Gaussian process, 0 = 1 for a l l n, including n = 0 [85,90], For a random process with x(t) given by x(t) = A s i n (2rrat+e) (2.9) where A = const. 9 = a random variable uniformly d i s t r i b u t e d in the i n t e r v a l 0 :£ 6 < 2TT, 0^ i s also equal to unity for a l l n. Thus, 0 n may be equal to or close to unity for other non-Gaussian random processes and i s a reasonable hypothesis for speech signals [92,93]. 2.3.4 Measurement of Zero-Crossing Rates The zero-crossing rates of the d i r e c t acoustic signal and i t s f i r s t time derivative appear to be the most useful of the z functions and these were studied i n d e t a i l . Let Z and D n represent the expected mean zero-crossing rates of x(t) and dx/dt, respectively, as given by (2.7) with n = 0 and n = 1. Short-time estimations of Z and D, denoted by Z and D, were computed from the following equations: N Z(j) = \ £ [ l - s g n ( x k + j + 1 ) -sgn (x^.) ] (2.10) D(j) = \ £ [ l - s g n ^ . ^ . s g n ( x ^ . ) ] (2.11) k=l where: ~ [ X k + j + l _ X k + j ] and j = 0 , N , 2N ,..... The quantity, x^, i s the k-th element of the uniformly sampled time series for x ( t ) . If 6 t i s the time between adjacent samples, then A^/2 6 t forms an estimate of dx/dt at the time corresponding to the sample x^. For our work 6t = 1/16 msec, corresponding to a sampling rate of 16 kHz. Also, a value of N = 160 was employed, corresponding to a 10 msec time window. To i l l u s t r a t e t y p i c a l time graphs obtained, F i g . 2-2 shows Z and D for the word " s i t " . The temporal region where Z and D are large corresponds to the unvoiced f r i c a t i v e /s/. The following low values of Z and D corresponds to the vowel / i / . The sil e n t , i n t e r v a l preceding the release of the unvoiced stop / t / i s marked by zero values of Z and D. Lastly, the release of / t / i s marked by the p u l s e - l i k e behaviour of Z and D. 2.3.5 Measurement and Presentation of Zero-Crossing Distances The d i s t r i b u t i o n s of zero-crossing distances provide a more detailed i n d i c a t i o n of the zero-crossing structure of a signal than the mean zero-crossing rates. For our work, the short-time d i s t r i b u t i o n s of time i n t e r v a l s between successive positive-going zero-crossing points were studied. The d i s t r i -butions were computed over a 40 msec, time window with 20 msec, overlap i n the time window for successive computations. The d i s t r i b u t i o n s were computed as follows. Twenty-one channels were established corresponding to the twenty-one time ranges shown in Table 2-1. Over the given time window, the f r e -quency of occurrence of zero-crossing distances corresponding to the indicated time ranges was computed. These event frequencies then formed the required d i s t r i b u t i o n . For sinusoidal signals the channel outputs are not uniformly weighted but are an increasing function of the sinu-\ soidal frequency. For display purposes, the channel outputs were weighted to y i e l d a uniform d i s t r i b u t i o n . For convenience of v i s u a l analysis, a bar graph type of display was employed. This type of presentation, which we s h a l l c a l l time-interval histogram, yielded displays similar to frequency spectrum displays, such as the Sonagraph type of record. Figs. 2-3c and 2-3d, for example, show time i n t e r v a l histograms for the words "fear" and "zealous". The twenty-one channels of the analyzer are placed along the v e r t i c a l axis, with time along the horizontal axis. The channels are arranged i n order of increasing time duration from top to bottom. For any given channel, the magnitude of the event frequencies or channel a c t i v i t y i s represented by the length of the short v e r t i c a l bars. The time-interval histograms shown are merged data sets obtained by combining the data which resulted from separate analyses of the speech signal, subjected to high-pass and low-pass f i l t e r i n g . A cut-off frequency of 1 kHz was employed for both f i l t e r s . 34 Table 2-1 Time-Interval Histogram Channels Logical Channel No. Physical Channel No. Sample Index Range 1 64 64 < k 2 58 58 < k < 64 3 52 52 < k < 58 4 47 47 < k < 52 5 42 42 < k < 47 6 38 38 < k < 42 7 34 34 < k < 38 8 30 30 < k < 34 9 26 26 < k < 30 10 23 23 < k < 26 11 20 20 < k < 23 12 17 17 < k < 20 13 15 15 < k < 17 14 13 13 < k < 15 15 11 11 < k < 13 16 9 9 < k < 11 17 7 7 < k < 9 18 5 5 < k < 7 19 3 3 < k < 5 20 2 2 < k < 3 21 1 0 < k < 2 Fear Zealous ....-ft,»•<•••• — I J H ' I " Spectrograms . . . . . . . . . .«wjiiyiiuiiH . . .Min..-mWimmmillliiini «MNMW •• •--•»• ....tmiut—-—».... —— - « « m « « . - . • • — . . H H U i i i i (a) (b) nun • H ' H 2 C A L 0 U • • • . . • . l u H m i M U i i a n i l l l l l l l l i i i i i i " —MMlHtll  -HHllll l l l l i in > i i i i i . i i | l i i | | | | i n i • • • • t l l l l l l l l l l M "IHMHI • • • • I t l l l l l l l l l l H l l -• llllllllllllllllllllllll H l l i t l l l l l M M . - " I t l l l f lMHlx Time-interval Histograms (c) (d) Fi g . 2-3 Spectrograms and Time-Interval Histograms for "Fear" and "Zealous" CO 36 For comparison purposes, Figs. 2-3a and 2-3b show frequency spectrograms for the words "fear" and "zealous". These spectro-grams were obtained from the 32 channel spectrum analyzer described i n section 2.2.5 and i n Appendix B. The channel centre-frequencies are logarithmically spaced and increase from bottom to top. The lin e a r time scale, which i s along the horizontal axis, covers a t o t a l duration of 0.5 sec. Intensity of the spectrum i s represented by the length of the v e r t i c a l bars associated with each channel. The resultant display i s similar i n appearance to the Sonagraph record shown i n F i g . 2-1. On the spectrogram, the numbers shown along the v e r t i c a l axis are the channel centre-frequencies of the spectrum analyzer. The numbers shown along the v e r t i c a l axis of the time-interval histograms are also frequencies and are centre-frequencies r e s u l t i n g from c a l i b r a t i o n of the analyzer with sinusoids. This c a l i b r a t i o n permits comparisons to be made between the spectro-grams and the time-interval histograms. That i s , comparison of general s t r u c t u r a l patterns, at least, i s possible. For example, consider the word "fear". The phoneme / f / i s a very low i n t e n s i t y sound and appears very weak on the spectrogram even with high frequency pre-emphasis. On the time-interval histogram, / f / shows up more d i s t i n c t l y and demonstrates the greater dynamic range c a p a b i l i t y of t h i s type of measurement. For the word "zealous", the voicing component and the noise component of / z / are c l e a r l y resolved on the time-interval h i s t o -gram. 37 For some feature abstraction applications, a simple geometric mean or centre of gravity parameter was defined for the histogram distributions, using l o g i c a l channel numbers as "distance" and channel event frequencies as "weight". Likewise, a standard deviation parameter, a, was defined using the same variables, for the purpose of indicating the compactness of the histogram d i s t r i b u t i o n s . 2.3.6 Relative Intensity Measurement Relative i n t e n s i t y i s known to be a useful feature i n the phonetic characterization of phonemes. Several workers [21,19] have, i n fact, employed signal i n t e n s i t y as a primary speech segmentation feature. Other workers [16,14] have used acoustic i n t e n s i t y as an aid i n locating vowel-sonorant boundaries and i n distinguishing amongst the unvoiced f r i c a t i v e s . For our purposes, the signal envelope was used as a measure of signal i n t e n s i t y . The signal envelope, I (k), was defined as follows I(k) = m a x l x i + i 6 o J 1 = 1 , 2 1 6 0 ( 2 - 1 3 ) 1 k = 0,1,2, . . . A similar d e f i n i t i o n has been used by Reddy [19]. F i g . 2-2a shows a time graph of I(k) for the word " s i t " . The central, large amplitude region corresponds to the vowel / i / . The i n i t i a l and f i n a l low amplitude regions correspond to the consonants / s / and / t / , r espectively. CHAPTER III ANALYSIS AND RECOGNITION OF VOWELS 3.1 Nature of Vowels and Their Characterization 3.1.1 General Acoustical Properties The vowels have been the most extensively studied sounds of speech. The books by Fant [73] and Flanagan [24] are the most comprehensive references on the vowels although t h e i r emphasis i s mainly on vowel production models. Much of the work on acoustical characterization of vowels has been concerned with s t a t i c properties; that i s , properties of the sustained vowel or else the vowel embedded i n a fixed, r e l a t i v e l y neutral contextual environment [27,72], This work showed that the formant pattern or vocal-tract resonance pattern was of primary importance i n characterizing the s t a t i c proper-t i e s of vowels. In fact, the frequency positions of the f i r s t two formants, denoted by F^ and F^, are often considered to be s u f f i c i e n t parameters for distinguishing the vowels from each other [27,72,29,14,50]. Studies of secondary s t a t i c parameters have been reported by many workers. Peterson et al.[27] and Peterson [28] have shown that the formant frequencies are somewhat speaker depen-dent. Peterson [28] and Flanagan [95] have found that formant amplitudes are less important than the formant frequency p o s i t i o n s . 38 39 Fant [191] has suggested that formant amplitudes can be predicted from the formant frequencies. The secondary role of formant bandwidth has been examined by several workers [96,97,50]. Vowels contained in i s o l a t e d spoken words or i n continuous speech are more complex than s t a t i c vowels. There are several temporal regions i n the acoustical signal that can be associated with dynamic vowels [62,50,58,72,52,98,99], These regions are [62] the onglide, the reduced target and the o f f g l i d e . The onglide i s the region of t r a n s i t i o n from a preceding consonant to the central or reduced target region. S i m i l a r l y , the o f f g l i d e i s the region of t r a n s i t i o n to a following consonant. The central vowel region i s c a l l e d the reduced target region [50,52, 99] because the properties of t h i s region are affected by the i d e n t i t y of the adjacent consonants. For vowel i d e n t i f i c a t i o n purposes, the reduced target region i s generally considered to be the most important. Stevens and House [50] have recently quantitatively examined the reduc-tions in the central vowel region for symmetrical consonantal environments. Relatively small but systematic reductions of the f i r s t two formants were observed. Their data suggests that for sophisticated vowel c l a s s i f i c a t i o n procedures, t h i s context dependent reduction might p r o f i t a b l y be used to resolve vowel confusions. For example, the i r data suggests that vowel target reductions may lead to confusions between / i / and /E/. For simple machine c l a s s i f i e r s , i t i s convenient to consider the 40 target reductions as a small perturbation or form of noise. The vowel onglides and o f f g l i d e s are quite highly context dependent. These t r a n s i t i o n regions are generally considered to be well characterized in terms of formant motions or t r a n s i t i o n s [62,58,98,61,26,100]. Lehiste et a l . [62] and Stevens et a l . [58] have presented quantitative data on the duration and extent of the formant t r a n s i t i o n s , p a r t i c u l a r l y that of , for vowels in d i f f e r e n t consonantal environments. The context dependent vowel onglides and o f f g l i d e s are considered to be more useful as an aid i n i d e n t i f y i n g the adjacent consonants [26,100,61], p a r t i c -u l a r l y complex consonants l i k e the voiced stops, than i n i d e n t i f y i n g the vowel. Nevertheless, such information would be quite useful i n analysis-by-synthesis speech recognition schemes [100]. Another aspect of the temporal structure of vowels i s vowel duration. House [102] and Peterson et al.[103] have presented data which shows that the vowel durations are strongly and systematically affected by the consonantal environment. One r e s u l t i s that the state of voicing of the adjacent consonants i s as important a factor i n determining vowel duration as the i n t r i n s i c l i n g u i s t i c c l a s s i f i c a t i o n of the vowels as long or short vowels. 3.1.2 Vowel Production Models of vowel production have been extensively studied 41 [104,105,73,24,152,153], These studies have been mainly con-cerned with establishing the o r i g i n and nature of vowel spectra from simple a n a l y t i c a l models of the human vocal t r a c t . Fant [73], for example, has c l a r i f i e d several misconceptions related to vowel production. One misconception that i s c l a r i f i e d i s that there i s not a simple one-to-one correspondence between the formant pattern of the vowel spectra and various v o c a l - t r a c t c a v i t i e s as has been popularly believed. He also shows that the point of maximum c o n s t r i c t i o n of the vocal t r a c t for vowels i s not always coincident with the highest point of the tongue as t r a d i t i o n a l a r t i c u l a t o r y models suggest. The a n a l y t i c a l studies of vowel production have achieved considerable success with r e l a t i v e l y few parameters. A simple but e f f e c t i v e e l e c t r i c a l c i r c u i t equivalent that results i s shown to be a cascaded resonant f i l t e r system excited by a periodic pulse t r a i n corresponding to the g l o t t a l pulse source. To a good approximation, these e x c i t a t i o n pulses are found to be triangular in shape. The e s s e n t i a l parameters of the f i l t e r are shown to be the centre frequencies, the gains and the bandwidths of the resonant c i r c u i t s . These parameters are related to the v o c a l - t r a c t parameters i n a complex manner. This equivalent c i r c u i t has also formed the basis of some accurate acoustical vowel analysis methods [30,106] and i s frequently used i n speech synthesis studies [107-113]. In section 3.4, t h i s vowel production model i s used i n developing t h e o r e t i c a l 42 relationships between the formant structure and the zero-crossing rates of the normal and d i f f e r e n t i a t e d acoustic signals. 3.1.3 Spectral Measurement Techniques Many methods have been developed for formant frequency estimation and tracking [78]. Analysis-by-synthesis methods using the linear f i l t e r model [30,106] are the most accurate but require matching of the formant frequency positions, amplitudes and bandwidths. Consequently, considerable i t e r a t i v e computa-tio n i s required. Two spectral peak-picking methods are in common use. One method i s the selection of the l o c a l maxima i n the output of a fixed f i l t e r - b a n k spectrum analyzer. This method has been extensively used i n speech recognition work [16,14,29], A second method, c a l l e d the maximum harmonic selection method, has been used by several workers [19,79]. This l a t t e r method, however, requires c a l c u l a t i o n of a Fourier series representation of the vowel spectrum. Both of the peak-picking methods are subject to spectrum envelope sampling errors [84,6,78]. The cepstrum technique [31] attempts to avoid errors caused by sampling of the vowel spectrum by the voicing harmonics by separating the harmonic structure and the spectrum envelope. A great deal of computation i s , however, required by the cepstrum method. 43 3.1.4 Machine C l a s s i f i c a t i o n of Vowels In speech recognition work, vowel c l a s s i f i c a t i o n has been largely based on measurements of the f i r s t and second formant frequencies, F^ and F^, respectively [16,14,19,114,139]. Duration measurements have also been shown to be useful [14,19, 29]. Forgie et a l . [29] have also used a measurement of the fundamental voicing frequency F^, to aid i n resolving confusions caused by speaker dependence of the formant frequencies. Although the formant frequencies are acknowledged to be important i n vowel characterization, Fant^ suggests that the o v e r a l l spectrum shape may be more important than the formant frequencies alone, p a r t i c u l a r l y i f the formants group together as for F 2,F^/ and F^ of the high front vowels and as for F^ and F^ of the back vowels. Such a characterization would necessarily incorporate formant frequencies, amplitudes and bandwidths. A similar viewpoint i s suggested by the d i s t i n c t i v e feature theory of Jakobson et al.[4 5 ] . A p a r t i a l attempt to incorporate t h i s idea i s reported by Forgie et al.[29], who have used F^ and F^ to e f f e c t a preliminary c l a s s i f i c a t i o n and then have used functions related to the area under the spectrum envelope to resolve confusions. Bobrow et.al.[18] and Gold [21] have employed simple functions of the.output of a f i l t e r - b a n k spectrum analyzer i n an attempt to i d e n t i f y vowel-like sounds but their machines do not e x p l i c i t l y c l a s s i f y the vowels. 1/ [73], p. 206. 44 The vowel c l a s s i f i c a t i o n procedures reported i n t h i s thesis do not r e l y on e x p l i c i t formant frequency estimation and tracking. In fact, the features used for vowel c l a s s i f i c a t i o n take into account the o v e r a l l c h a r a c t e r i s t i c s of the vowel spectrum. Other work i n machine c l a s s i f i c a t i o n of vowels includes that of Teacher et al.[24] and Bezdel et al.[3 8 ] . Teacher et a l . have shown that a single parameter, which they c a l l a single equivalent formant, i s useful in c l a s s i f y i n g vowel-like sounds. This parameter i s extracted from a waveform measurement on speech processed with high frequency pre-emphasis. The very recent work of Bezdel et a l . has dealt with the c l a s s i f i c a t i o n of f i v e vowels using zero-crossing distances of normal, u n f i l t e r e d speech. In machine vowel c l a s s i f i c a t i o n work, the most frequently occurring confusions are the vowel p a i r s : /l-£/, /£-ae / , / a e - A A / a -A / , /a-0/, /u-u/ and / u - 3/. 3.2 Reference Vowel Formant Data A form of spectrum peak-picking was used in our work to determine the vowel formant frequencies. In cases where there was a well defined r e l a t i v e maximum i n the expected formant frequency range, the centre frequency of the appropriate f i l t e r - b a n k channel was selected as the formant frequency. In other cases, several problems were encountered. In the F, frequency range, there was a tendency for the spectrum analyzer to resolve the voicing harmonics. This problem was overcome by estimating the formant frequency p o s i t i o n from the output of several analyzer channels. In the F^ and upper F^ frequency ranges, the r e l a -t i v e l y wide spacing of the channel centre frequencies sometimes made resolution of F^ and F^ d i f f i c u l t , p a r t i c u l a r l y for the vowel / i / . Although our f i l t e r - b a n k spectrum analyzer i s similar in design to that used by others [16,14,29], we have con-cluded that considerable care must be exercised in i t s use. F i g . 3-1 shows a F^ vs F^ p l o t for the vowels studied in our work. The measurements were made manually on the most stable part of the vowel target region. The dispersion of the data points i s attributed mainly to the e f f e c t s of consonantal environment. The data points were checked from Sonagraph records of the speech data. The mean values of F^ and F^ for vowels i n a number of d i f f e r e n t consonantal contexts from a study by Stevens and House [50] are also shown for comparison in F i g . 3-1. There i s reasonably good correspondence between the cl u s t e r centres for our data and the means of Stevens and House. The differences are attr i b u t a b l e to differences in measurement technique [77,78] and i n speakers. To further emphasize the e f f e c t s of measurement techniques, contexts and speakers, we have p l o t t e d i n F i g . 3-2 the mean values of F^ and F for the various vowels obtained from several published sources [27,50,14], Hughes et al. [14] used a peak-4 6 ao-, 2 .5-5 2.0 1 5 -10-08 ® r « u u. u u U.OL I I I I I I * ® e e ae ® 0 o © 3 02 03 OA OS 0.6 Ft (kHz) i — r 07 0.8 09 F i g . 3-1 Measured F^ and F2 for Vowels Contained i n a Variety of Consonantal Environments. C i r c l e d points are mean values of F^ and in consonantal environments from Stevens and House [50]. 47 5 T "» iff, ff3 " 2 « 4 ^ 4 J L , ° 4 G J 0.5 0.4 0.5 0.6 0.7 0.8 F, (kHz) 0.9 F i g . 3-2 Comparison of F^ - F^ Values for Vowels from Published Data. Subscript 1 = Peterson and Barney [27] data for fix e d environment /h-d/. Subscript 2 = Stevens and House [50] data for consonantal environments. Subscript 3 = Stevens and House [50] data for fix e d neutral environment /h-d/. Subscript 4 = Hughs and Hemdal [14] data-peak picking techniques. 48 picking method and a fi l t e r - b a n k spectrum analyzer very similar to ours. Their data i s also from a single speaker and exhibits dispersion patterns and overlap similar to that observed i n our data. F i g . 3-2 i l l u s t r a t e s some of the problems involved i n attempting to specify formant frequencies p r e c i s e l y . Nonethe-less, comparison of Figs. 3-1 and 3-2 indicates that the formant data for our vowels i s self-consistent and t y p i c a l . The formant frequency measurements provide a form of control for our study of time-domain measurements i n character-i z i n g vowels. That i s , i n as much as most previous data on vowels i s i n the form of spectral ( i . e . formant) c h a r a c t e r i s t i c s , our time-domain measurements are interpreted i n terms of spectral c h a r a c t e r i s t i c s and the spectral data i s used as a control to ensure that our lim i t e d data i s t y p i c a l . 3.3 Z - D Measurements for Vowels Fant's remarks on the importance of the o v e r a l l spectral shape and the recent de-emphasis on formant tracking procedures [7,33,34,38,41,49,115] have stimulated i n t e r e s t i n vowel characterizations which do not e x p l i c i t l y r e l y on formant tracking. The zero-crossing rates of the normal and d i f f e r e n t -iated acoustical signals, denoted by Z and D, respectively, are p o t e n t i a l l y useful measurements of the above mentioned type. Let Z and D represent values of Z and D, respectively, averaged over a 10 msec, i n t e r v a l . F i g . 3-3 shows Z vs D 49 60] o 8 5 0 ® o 8 o 30. 20 i i u u u u uu © u Uu to uu u ll u Wei I ® I I to tS 20 25 Z (crossings / i o msec ) 30 F i g . 3-3 Measurements of Z and D for Vowel Nuclei. C i r c l e d points represent Chang's conjectured mapping of Formant Data. 50 measurements obtained for vowels contained in a variety of consonantal contexts. The point of measurement was selected manually in the vowel target region; that i s , at the most stable part of the vowel nucleus. In section 3.3.2, i t was mentioned that Chang et al. [85] had conjectured that Z might be a good estimate of F^ and that D might be a good estimate of F^. I t i s of i n t e r e s t to evaluate the quality of the conjectured estimates. This i s most e a s i l y done by mapping the F^ and F^ data onto the Z-D plane by means of the following equations: Z = 0.02F 1 (3.1) D = 0.02F2 (3.2) The expected positions of three vowels, / i / / A / and /u/, are shown as c i r c l e d points i n F i g . 3-3. For the vowel, / i / , i t i s seen that D y i e l d s a reasonably good estimate of F^ but Z y i e l d s a poor estimate of F^. For the vowel, / A / , neither Z nor D y i e l d good estimates of the f i r s t two formant frequencies. For the vowel, /u/, Z and D y i e l d only f a i r estimates of F^ and F^. Thus, i t can be concluded that Z and D do not y i e l d good d i r e c t estimates of F^ and F^ as was suggested by Peterson [77]. In section 3.5, a t h e o r e t i c a l explanation for the above phenomena i s given. The lack of d i r e c t formant estimation from Z and D, however, does not detract from th e i r usefulness i n vowel characterization. F o r e x a m p l e , i t i s o b s e r v e d t h a t t h e r e i s c l u s t e r i n g o f t h e d a t a p o i n t s . F o r t h e d a t a s h o w n , t h e r e a r e a t l e a s t f i v e d i s t i n c t r e g i o n s c o r r e s p o n d i n g t o t h e v o w e l s o r v o w e l g r o u p s : / i / , / l , e / , /A., a e / , / o / a n d / u / . T h i s c l u s t e r i n g s u g g e s t s t h a t t h e s e t w o m e a s u r e m e n t s a r e u s e f u l i n c l a s s i f y i n g a t l e a s t f i v e t y p e s o f v o w e l s . I t i s o b s e r v e d i n F i g . 3-3 t h a t t h e r e a r e d i f f e r e n t t y p e s o f d i s p e r s i o n p a t t e r n s f o r d i f f e r e n t v o w e l t y p e s . F o r e x a m p l e , t h e p o i n t s o f t h e v o w e l , / i / a r e h o r i z o n t a l l y d i s p e r s e d ; t h e p o i n t s o f / u / a r e v e r t i c a l l y d i s p e r s e d ; a n d t h e p o i n t s o f / i / a r e d i a g o n a l l y d i s p e r s e d . T h i s d i s p e r s i o n o f t h e d a t a i s d u e p r i m a r i l y t o t h e d i f f e r e n t c o n s o n a n t a l e n v i r o n m e n t s i n v o l v e d f o r t h e d i f f e r e n t p o i n t s . T h a t i s , t h e c o n s o n a n t a l c o n t e x t a f f e c t s t h e f o r m a n t f r e q u e n c i e s a n d t o a l a r g e r e x t e n t , t h e f o r m a n t a m p l i t u d e s . T h e e f f e c t s o f c o n t e x t o n t h e f o r m a n t f r e q u e n c i e s w e r e p r e v i o u s l y s h o w n i n F i g . 3 - 1 . T h e v a r i e t y o f d i s p e r s i o n p a t t e r n s e n c o u n t e r e d m e a n s t h a t f o r r e c o g n i t i o n a p p l i c a t i o n s , t h e d e c i s i o n r e g i o n s a r e n o t s i m p l e o n e s . T h i s s u b j e c t i s d i s c u s s e d f u r t h e r i n s e c t i o n 3 . 5 . O u r d i s c u s s i o n o f F ^ - F £ e s t i m a t i o n f r o m Z a n d D h a s s h o w n t h a t t h e r e l a t i o n s h i p i s n o t a s i m p l e l i n e a r o n e a s s u g g e s t e d b y e q n s . ( 3 . 1 ) a n d ( 3 . 2 ) . A n a n a l y t i c a l r e l a t i o n s h i p b e t w e e n t h e f o r m a n t s t r u c t u r e a n d Z a n d D i s o f v a l u e f o r e s t a b l i s h i n g p o s s i b l e d e c i s i o n r e g i o n s f o r r e c o g n i t i o n w o r k . F u r t h e r m o r e , s u c h a r e l a t i o n s h i p w o u l d a l l o w u s t o e s t a b l i s h the generality of the method using formant data as a control. In the next section such a relationship i s developed. 3.4 Theoretical Relationship Between Formant Structure and Z-D In th i s section t h e o r e t i c a l relationships between Z and D and the vowel formant structure are developed by extending the work of Chang et a l . [85] to deal with a r e a l i s t i c vowel model. The r e s t r i c t i o n s and li m i t a t i o n s of the work of Scarr [39] and Peterson [84] are avoided by modelling segments of the speech waveform as sample functions from an ergodic process rather than from a completely periodic, deterministic process. Let it) = Y* x(  [XL co s(27Tkat+e) + x. s i n (2-7rkat+0) ] (3.3) k=0 k where x^ and x_^  are r e a l constants and 9 i s a random variable uniformly d i s t r i b u t e d between 0 and 2ir. Appropriate choices of x^ and x^ make x(t) a good model for the s t a b l e ^ r e g i o n of many voiced sounds such as vowels and semivowels, provided that a equals the fundamental voicing frequency, F^. The assumption that 9 i s a uniformly d i s t r i b u t e d random variable enables stable regions of the acoustic output signal to be regarded as segments from sample functions of an ergodic random process. 1_/ A stable region of a random process i s one in which the s t a t i s t i c s of the process are v i r t u a l l y constant over the region. This assumption i s r e a l i s t i c because phase angle 6 is not es s e n t i a l for l i s t e n e r i d e n t i f i c a t i o n . Consider the linear system shown in F i g . 3-4. This system i s the form of the model used i n most vowel production studies [24,73,116] and speech synthesis studies [108-113]. Excit a t i o n P(t) Linear F i l t e r H(f) x(t) p(f) X(f) F i g . 3-4 A Vowel Production Model The exc i t a t i o n signal p(t) has the same form as (3.3). If we l e t the c o e f f i c i e n t s of p (t), corresponding to x^ and x^, be denoted by p^ and p^, respectively, then the power spectral density of p(t) i s given by: —^ CO 2 S (f) = E p v °(f-ka) (3.4a) k=-eo 2 2 A 2 where: P = (p |v|+p |, i )/4 k ^ 0 (3.4b) .= P Q / 2 * = 0 and 6(f) i s the Dirac delta function. 2 For vowels and vowel-like sounds, the c o e f f i c i e n t s , P , are usually considered [73,24] to be approximately given by 1/ [90], pp. 342-343 54 p 2 « [ r ( k a ) 2 ] 2 ( 3 . 5 ) w h e r e : r = c o n s t , O v e r a s t a b l e r e g i o n o f t h e v o w e l , t h e t r a n s f e r f u n c t i o n o f t h e v o c a l - t r a c k f i l t e r - ^ , H ( f ) , i s g i v e n b y N 1 H ( f ) = TT [ a . + 2 T r j ( f - f . ) ] [ a . + 2 T r j ( f + f i ) ] ( 3 / 6 ) i = l w h e r e f ^ a n d a r e t h e c e n t r e f r e q u e n c y a n d b a n d w i d t h p a r a -m e t e r s , r e s p e c t i v e l y , o f t h e i r e s o n a n c e o f t h e v o c a l t r a c t . T h e n u m b e r o f r e s o n a n c e s , N, i s u s u a l l y t a k e n t o b e t h r e e [ 2 4 , 7 3 ] . T h e p o w e r s p e c t r a l d e n s i t y f o r t h e a c o u s t i c s i g n a l , x ( t ) , 2 / i s g i v e n b y S x ( f ) = | H ( f ) | 2 S ( f ) ( 3 . 7 ) l 12 T h e f a c t o r , H ( f ) , c a n b e w r i t t e n a s H ( f ) I2 = H ( f ) - H ( - f ) ( 3 . 8 a ) w h i c h , b e c a u s e o f t h e q u a d r a n t a l s y m m e t r y o f t h e p o l e s , c a n b e e x p a n d e d i n p a r t i a l f r a c t i o n s a s 1 / [73], c h . 2, [ 2 4 ] , c h s . 3 & 5 2 / [ 9 0 ] , p p . 3 4 8 55 N H(f) i = l B. 1 B. u o\+27rj ( f - f L ) -oL+2>nj ( f - f i) B. + B. 1 o\+27rj (f+f i) -ai+27TJ (f+f ) u (3.8b) where: B = lim . [ a +2TTJ ( f - f ) ] H(f)-H(-f) f->(f.+ J i) 1 TT. (3.8c) and B^ i s the complex conjugate of B^. It i s convenient to write (3.7) i n the form N (f) = £ s (f) i=i X , : L (3.9a) where; Sx,i ( f» - E k=-<» P 6 (f-kct) B. # B. l I a i + 2 7 T J ( f - f i ) 0.-27TJ ( f - f .) B. + o\+2"rrj (f+f ) a ± - 2 T r j (f+f ) (3.9b) For a l l t y p i c a l vowels [73,24], o^r consequently, a good approximation to the residue, B^, i s . 1 / B. l [ a . + 2 7 T j ( f - f .) ] H(f)-H(-f) | f = f a. H(f.) a. K. (3.10) i i 1/ [24], p. 186 2 i 12 where = |H(f\) | i s a r e a l constant, Thus a good approximation to S .(f) i s x, i S x . i ( f ) 2p 2 6 (f-ka) (O.K.) 2 k i i + a 2 + [27r(f-f.)] 2 a 2 + [ 2 7 T ( f + f . ) ] 2 l l J l L l J (3.11) We ultimately wish to calculate the q-th spectral moment of S ( f ) , which from (3.5) can be calculated from M(q) = r 1 S (f) d f (3.12) M(q) •= 0 for q odd =21 f^ S ( f) d f for q even 0 (3.13) Eqn. (3.13) obtains because S (f) i s an even function of f. Eqn. (3.13) can be written i n form analogous to (3.9a), namely N M(q) = M i t e > 1=1 (3.14) where: M. (q) = 2/ S . ( f) d f 1 J0 X , X (3.15) Consider the evaluation of (3.15). Since f^ » o\ , a further approximation to S .(f) for f s 0 i s : x, i 2 2 <r- 2 (a K ) P 6(f-ka) S .(f) « L P — 2 ~ < 3- 1 6 ) X ' k=0 a + [ 2 7 T ( f - f . ) ] th For the i resonance, the most s i g n i f i c a n t terms in (3.16) which a f f e c t the evaluation of ]VL (q) , occur in the v i c i n i t y of th f = f ^ . Furthermore, since the bandwidth of the i resonance i s small compared to the o v e r a l l extent of the vowel spectrum . th 2 . i t i s reasonable to assume that for the i resonance, P- i s k 2 approximately constant and equal to P^, where |3 i s the nearest integer to ( f ^ / a ) . i f we l e t A i = [ K i P f 3 ] 2 ( 3 * 1 7 ) then from (3.16) and (3.17), JXL (q) can be approximated as: f 0 0 ^ 2 A 2 a2 f q 6(f-ka) df M. (q) » 2 / > i i (3.18) 1 . J n —^-» 2 2 k=0 a i + [ 2 7 T ( f - f i ) ] g 2A 2 a 2 (ka) q k=0 a 2 + [27T(ka-f.) ] 2 x L 1 J (3.19) 2 Note that i s approximately equal to amplitude of the harmonic of s x ( f ) closest to f = f^; that i s , with reference to (3.3), A 2 « (x 2 + x 2)/4. 58 The main contribution to the RHS of (3.19) occurs i n the region where (ka-f^) i s a minimum, that i s , i n the region of the resonance peak. Over the region 6f the resonance peak the factor (ka) q i n the numerator can be considered to be approxi-q mately constant and equal to f ? . Thus approximating the summation by an integration, we obtain: M.(q) - 2 f 2 A i °i £ i d £ (3.20) ^ 0 2 2 a + [27r(f-f .) ] i L l J 2 q » A. a. f H (3.21) i l l The frequency i s approximately equal to the frequency of the i - t h formant, F^ and so we can write (3.21) as M. (q) sa A. a. F q (3.22) l i l l The parameter A_^  can be d i r e c t l y associated with the voltage amplitude of the i - t h formant and 2a^ can be d i r e c t l y associated with the 6db bandwidth (radians) of the i - t h formant. F i n a l l y , the q-th spectral moment can be written from (3.22) and (3.14) as: N M(q) « £ A 2 a. F q (3.23) i=l The v a l i d i t y of (3.23) can be established h e u r i s t i c a l l y as follows. Consider the power spectral density function of x(t) to consist of N formants as suggested by (3.6). Now l e t the i - t h formant be replaced by a simpler structure consisting for f s 0 of a single impulse located at each formant frequency, 2 F., and with weight A.a,. The function l ^ 1 1 V ± = A.V a./2 (3.24) can be considered to be the equivalent formant amplitude and takes into account the amplitude and bandwidth of the o r i g i n a l formant as we might expect. Geometrically, yA i s a measure of the area under the formant curve. , For t h i s h e u r i s t i c model we can therefore write S .(f) « A 2 a. 6(f-F.) f ^ 0 (3.25) X , 1 1 1 1 Hence, from (3.15) and (3.14) we obtain (3.23) d i r e c t l y . From (2.7), the mean zero crossing rate of the n-th derivative of x(t) is: given by: , / 9 „ >2 _ M(2n+2)  ( zn / 2V = M(2n) N A. a. F. I l l = i=l 1 (3.26) N _ _ E ,2 „2n A. a. F . . . i i i i = l In p a r t i c u l a r , for N = 3 and n = 0 and n = 1 , we obtain 60 the normalized forms: z 2 = ( / }2 = 4F 2 [1 + ( R 2 7 2 ) 2 + ( R 3 7 3 ) 2 ] ( 3 > 2 7 ) 0 0 2 2 [1 + R 2 + R 3] D 2 « ( z ^ ) 2 4 F 1 V + y 2 ( V 2 > (V3' 3 (3.28) [1 + ( R 2 7 2 ) 2 + ( V 3 ) 2 where: R 2 R 3 A3V a. A i V ^ T F2 "Y = — > 1 7 2 F l F 3 7 3 = ^ ^ 1 The normalized forms for Z and D are convenient because the number of parameters i s reduced by one due to the irrelevance of absolute values of the equivalent formant amplitudes. Z and D are thus functions of the f i v e parameters F^, R2, R 3 , -y^, 7 3 . Clearly, Z and D are not simply and d i r e c t l y related to the formant frequencies as was indicated by the work i n the preceding section. However, there are situations in which cert a i n formants dominate the expressions for Z and D. For example, suppose that the f i r s t formant i s dominant for both Z and D. Then we have approximately that Z « D « 2F^. A similar case i s a single sinusoid at the frequency f = F^. In the next section we attempt to evaluate the quality of (3.27) and (3.28) by comparing calculated and measured values of Z and D. 3.5 Comparison of Theoretical and Measured Z - D Values The relationships for Z and D developed i n the preceding section indicate a dependence on the formant amplitudes and bandwidths as well as the formant frequencies. It i s known [72,28,62,50] that the formant frequencies for a p a r t i c u l a r vowel are more stable parameters than the formant amplitudes and bandwidths. This means that for any p a r t i c u l a r vowel, the parameters F^, y 2 , y ^ a r e roore stable than the parameters and R^. It i s of inter e s t therefore to assess the e f f e c t s of variations of and R^ on the Z and D values for t y p i c a l vowels. Consider the three vowels / i / , / A / , and /u/. From spectral measurements similar to those used i n p l o t t i n g F i g . 3-1, the mean values of F^, F^ and F-, for these three vowels were found to be: / i / : F 1 = 350 Hz; F 2 = 2350 Hz; F 3 = 3480 Hz / A / : F1 = 650 Hz; F 2 = 1400 Hz; F._ = 2546 Hz /u/: F1 = 350 Hz; > 2 = 925 Hz; F 3 = 2400 Hz 62 F i g . 3-5 E f f e c t of Equivalent Formant Amplitudes on Z and D of Vowels. 63 F i g . 3-5 shows selected curves of constant R or R for the above vowels. The values of R 2 and R^  shown represent some-what larger than t y p i c a l ranges found in our data and the data of others [28,72], The enclosed regions thus represent regions where d i f f e r e n t variants of the three vowels can be expected to be found. For a vowel l i k e / i / , the large values of 7 ^ A N & 7 3 cause the higher formant terms to dominate the expresssion for D and to strongly a f f e c t the expression for Z. These conditions therefore cause the dispersion pattern for / i / to be horizontal as has been previously observed. For the vowel /u/, the smaller values of 7 ^ and 7 3 com-bined with the r e l a t i v e l y weak amplitudes of the higher formants cause the higher formants to mainly a f f e c t D and not Z. Con-sequently, the dispersion pattern for /u/ tends to be v e r t i c a l as has been observed. A similar analysis confirms the horizontal dispersion pattern observed for / A / . Comparison of Figs. 3-3 and 3.5 shows that the shape and size of the dispersion patterns of the vowels in question are well correlated. A more e x p l i c i t comparison of calculated and measured values i s possible. F i g . 3-6 shows estimates of Z and D for the same vowels as shown in F i g . 3-3 computed from the formant parameters of frequency, amplitude and bandwidth. These para-meters were obtained from fil t e r - b a n k spectrograms of the speech data. Table 3-1 gives a numerical comparison between measured and calculated values. 64 F i g . 3-6 Estimates of Z and D for Vowels Calculated from Formant Parameters. 65 Table 3.1 Comparison of Z and D, as Measured for Vowels, with Corresponding Values Calculated Using Eqns. (3.27) and (3.28) with N=3 Utterance Z (/10 ms) D (/10 ms) Measured Calculated Measured Calculated g i 1 8 11 55 57 d i 1 12 14 50 51 b i 1 9 16 50 51 t i 1 17 20 53 56 n i s 17 21 49 50 k i 1 13 24 49 60 P i 1 24 24 53 58 f i r 13 26 49 58 1 i / 15 26 50 52 r i f 17 29 53 60 f I t 15 . 14 31 36 s I d 13 15 31 35 s I P 13 16 32 34 w I s t 12 15 31 33 w I s k 12 15 28 31 V I 1 13 17 31 37 s I t 17 19 35 39 s I g 19 20 42 41 s I k 15 21 36 37 s I b 19 21 39 40 w I s P 14 21 37 38 z £ 1 14 16 30 24 S £ k t 14 16 25 30 h ae z 19 17 32 30 s as V 21 24 35 36 s ae V 18 23 28 34 ! k 26 26 38 34 b A n 23 20 31 25 g A n 23 21 34 30 p A n 24 21 33 27 1 A s 20 21 31 26 d A n 25 22 31 29 t'A n 24 22 32 28 k A f 25 22 30 27 b A s 22 23 31 27 con't... 66 Table 3.1 con't. Utterance z ( / i o ms) D (/10 ms) Measured Calculated Measured Calculated v o n 17 16 22 22 f. o 1 17 16 21 19 l o g 20 17 26 24 1 o d 21 17 24 21 l o b 21 17 24 20 1 « k 23 19 24 21 l o t 23 21 24 23 b U / 12 14 16 18 / u r 8 8 12 12 t u 1 9 9 13 14 k u 1 9 10 16 15 p u 1 9 10 16 16 d u 1 9 •9 18 18 d u 1 9 9 18 18 r u 3 6 10 20 23 1 u z 10 10 32 28 f u d 8 11 16 22 z u m 9 11 20 26 g u 1 8 12 14 23 Comparison of Figs. 3-3 and 3-6 and inspection of Table 3-1 shows that there i s reasonably good agreement between measured and estimated points for most of the data. The most noticeable discrepancy between measured and calculated values occurs for the Z values of the vowel / i / The major part of t h i s error i s due to measurement error in F^ caused by the fact that part of the f i r s t formant f e l l below the lowest frequency channel (280 - 310 Hz) of our fi l t e r - b a n k spectrum analyzer. Other sources of error for the vowel / i / and also for the other vowels are the approximations involved in the development of (3.27) and (3.28) as well as errors [78] i n the measurement of the formant parameters. Nevertheless, the generally good agreement between measured and calculated values of Z and D along with the s i m i l a r i t y between our measured formant parameters and those reported by others [27,28,50] indicates that Z and D values are useful parameters for vowel characterization. From our data, i t i s evident that the Z - D values do not permit complete separation of a l l the vowels. The overlap of the regions of / i / and / g / and between / A / and / ae / are similar to the overlap i n F^-F^ positions reported by others [14,29,50] and also present in our data shown in F i g . 3-1. In the d i s t i n c t i v e feature c l a s s i f i c a t i o n system proposed by Jakobson et al. [45], one of the feature oppositions, that i s formulated i s the opposition compact/diffuse. According to t h i s system, one attribute of compact vowels i s that they have their spectral energy concentrated i n a r e l a t i v e l y narrow, central region on the frequency scale. Accordingly, the vowels /1, ae / A / o / are considered to be compact vowels, whereas the vowels /i,I,u,U/ are considered to be d i f f u s e vowels. The r a t i o , D/Z, appears to be a measure of the compact/diffuse opposition; that i s , a low value of D/Z indicates a compact vowel and a large value of the r a t i o indicates a diffuse vowel. On F i g . 3-3 a D/Z r a t i o of 1.5 e f f e c t s the separation of the two groups of vowels except for the vowel /£/ which f a l l s into the " d i f f u s e " group. 68 3.6 Zero-Crossing Distance Analysis of Vowels Zero-crossing distances or time-intervals between zero-crossings were analyzed for vowels under d i f f e r e n t pre-processing conditions. These conditions were: i) no f i l t e r i n g , i i ) lowpass f i l t e r i n g with a 1 kHz cut-off frequency and i i i ) highpass f i l t e r i n g with a 1 kHz cut-off frequency. The condition "no f i l t e r i n g " means that no additional f i l t e r i n g apart from the 8 kHz lowpass pre-sampling f i l t e r was used. The raw data for conditions i i ) and i i i ) were obtained using the. speech data processing f a c i l i t y described in Appendix A. That i s , the o r i g i n a l d i g i t i z e d speech samples corresponding to condition i) were reconverted to analog form, f i l t e r e d in high-quality external analog hardware and then resampled, d i g i t i z e d and stored i n d i g i t a l form. It was found that the data structures obtained under condition i) were more variable than those obtained under conditions i i ) and i i i ) . F i g . 3-7 shows some results obtained for the unfiltered. case, which has been studied to a limited extent by Scarr [39] and Bezdel et al . [38]. The graph attempts to show the primary and secondary a c t i v i t y regions that were observed on the T-I histograms in the temporal regions corres-ponding to the vowel target. Primary a c t i v i t y regions showed more intense a c t i v i t y or channel t a l l i e s than the secondary regions. The primary a c t i v i t y points are enclosed with c i r c l e s and the secondary a c t i v i t y points are uncircled. The points 69 3 3 3 3 1 3 * 3 33 @C 3 3 £ D P <§>® © o ©@@ @@ > ® \ <$> \® ® l © •1 n M H IH M h 00 }0 9 • — ) m — • -• 0 £ • — • - ~ • —» • — e ) > ) <N fS W) N 0) ft l O tx, C5 £> J p »-» *— CM rs. c\| CD 22 t> r>) co vj. Is, <N CO N f l O LO 2 (1) o > 0 <u r H •r-l m o u-t cn • H CO >1 r-l (0 C < I H I I p4 70 e n c l o s e d i n s q u a r e s a r e r e f e r e n c e p o i n t s w h i c h r e p r e s e n t a m a p p i n g o f a s i n g l e s i n u s o i d a t t h e f i r s t v o w e l f o r m a n t f r e -q u e n c y , F ^ , o n t o t h e g r a p h . F o r some v o w e l s , s u c h a s / $ / , / a n & / &/> a l l t h e a c t i v i t y o n t h e h i s t o g r a m w a s c o n c e n t r a t e d i n o n e r e g i o n a n d t h u s , t h e r e w e r e n o d i s t i n c t p r i m a r y a n d s e c o n d a r y r e g i o n s . F o r o t h e r v o w e l s , s u c h a s / i / a n d / u / , t h e s e c o n d a r y r e g i o n s t e n d e d t o b e w i d e l y d i s p e r s e d . F o r v o w e l /A/,., t h e p r i m a r y a n d s e c o n d a r y r e g i o n s w e r e n e a r l y e q u a l i n i m p o r t a n c e . I t i s o f i n t e r e s t t o n o t e t h a t t h e " c o m p a c t " v o w e l s / £ , a t , A , 3 / [ 4 5 ] e i t h e r h a v e o n l y o n e a c t i v i t y r e g i o n o r e l s e t w o r e g i o n s c l o s e t o g e t h e r . A d i r e c t a s s o c i a t i o n o f t h e p r i m a r y a c t i v i t y r e g i o n s w i t h t h e f o r m a n t f r e q u e n c i e s c a n n o t b e made i n g e n e r a l b e c a u s e a l l t h e f o r m a n t s a f f e c t t h e z e r o - c r o s s i n g d i s t a n c e s t o some e x t e n t . H o w e v e r , t h e h i g h - f r e q u e n c y p r e - e m p h a s i s , w h i c h i s a n i n t r i n s i c f a c t o r i n d e t e r m i n i n g t h e a v e r a g e z e r o - c r o s s i n g r a t e s Z a n d D, i s r e d u c e d b y t h e f r e q u e n c y w e i g h t i n g a p p l i e d t o t h e T - I h i s t o g r a m s , a s d i s c u s s e d i n s e c t i o n 2 . 3 . 5 . C o n s e q u e n t l y , t h e r e i s a c l o s e r c o r r e s p o n d e n c e b e t w e e n t h e T - I - H r e s u l t s a n d t h e f o r m a n t f r e q u e n c i e s . t h a n o c c u r s f o r t h e Z a n d D v a l u e s . C o m p a r i s o n o f t h e p r i m a r y a c t i v i t y r e g i o n s f o r t h e v a r i o u s v o w e l s w i t h t h e r e f e r e n c e F ^ p o s i t i o n s s h o w s t h a t t h e r e i s a f a i r l y g o o d c o r r e s p o n d e n c e b e t w e e n t h e F ^ m a p p i n g a n d t h e p r i m a r y a c t i v i t y r e g i o n s e x c e p t f o r t h e v o w e l s / A / a n d / o / . For the l a t t e r two vowels, i t i s postulated that the r e l a t i v e l y strong i n t e n s i t y of the second formant causes the primary a c t i v i t y region to be close to the p o s i t i o n of F^ mapped onto the graph. The T-I histograms obtained under conditions i i ) and i i i ) were found to be much simpler i n structure than those obtained under condition i ) ; that i s , i n general, only one a c t i v i t y region was present in the temporal region corresponding to the vowel target on the resultant T-I histograms. Figs. C-3a and C-3c of Appendix C show t y p i c a l LP and HP time-interval histograms obtained for the word " s i d " . The i n i t i a l part of / s / i s missing from the pictures but the vowel / i / i s located in central temporal region. F i g . C-3d shows the LP and HP time-interval histograms merged into one display. In F i g . C-3b a T-I histogram for " s i d " obtained under condition (i) i s shown for comparison. Figs. 2-2c and 2-2d show merged LP and HP time i n t e r v a l histograms for the words "fear" and "zealous". These figures indicate the r e l a t i v e l y simple a c t i v i t y patterns obtained for the vowel target regions under conditions ( i i ) and ( i i i ) . The r e l a t i v e l y simple appearance of the LP and HP - TIHs suggested that a simple centre-of-gravity type of measurement could be used to characterize the data. The centre-of-gravity measurement i s based on the d i r e c t geometrical properties of the display as discussed in section 2.3.5. F i g . 3-8 shows a ac A. A. u u u o % *f u v o 5 o o u u u u. 44- 58 52 47 42 35 3f 30 2fc 23 20 If 15 13 Low Pass Channel Number F i g . 3-8 L and H Measurements for Vowels, m m 73 plot of L , the centre-of-gravity for the lowpass analysis versus H , the HP centre-of-gravity. On the graph both L and m 3 c m H^ have been interpreted in terms of the physical channel num-bers. The occurrence of data points between physical channels res u l t s because the c a l c u l a t i o n of the centre-of-gravity was not r e s t r i c t e d to i n t e g r a l numbers. This procedure allows us to overcome some problems associated with the r e l a t i v e l y coarse quantization of the equivalent frequency scale associated with the physical channels and hence allows us to achieve fi n e r d i s t i n c t i o n s i n the data. Inspection of F i g . 3-8 shows that the data c l u s t e r s quite w e l l . Comparison of F i g . 3-8 with the F 1 - F 2 p l o t o f F i g . 3-1 indicates that the c l u s t e r i n g of similar vowels and the separation of d i f f e r e n t vowels i s as good or better than that obtained from the formant frequency representation. Comparison of F i g . 3-8 with the Z - D data of F i g . 3-3 shows that the TIH analysis y i e l d s more compact clusters than does the Z - D analysis. This phenomenon i s to be expected, of course, because the combination of broad-band p r e f i l t e r i n g and high-frequency de-emphasis employed in the TIH analysis diminishes the i n t e r a c t i o n of d i f f e r e n t regions of the vowel frequency spectrum. Nevertheless, from a complexity viewpoint, the Z - D analysis i s simpler and would be more easy to implement with simple hardware. 74 3.7 Vowel Form Factor Analysis Form factor analysis gives a measure of the vocal t r a c t damping, which i s related to the bandwidths of the formants [117]. Two form factor measurements were defined by the equations: F F 1 (k) M E k+j 1/2 M k = 0, M, 2M,... (3.30) FF 2 (k) 2MPl M E j=i k+j 1/2 k = 0, M, 2M, .... (3.31) where: pfc = m a x l x ^ . . j = 1, 2...M (3.32) and where x . i s the j - t h sample i n the time series for the 1 u n f i l t e r e d acoustic s i g n a l . For our work a value of M = 160, corresponding to a 10 msec, time window, was selected. The form factor FF^ (k) i s a measure of the RMS to average r e c t i f i e d value of the acoustic waveform and F F 2 (k) i s a measure of the peak-to-peak to RMS value of the waveform. For a pure sine wave, i t can be shown that: FF^k) = 1.11 FF 2(k) = 2.82 (3.33) For a s i n u s o i d w i t h 100% sawtooth amplitude modulation i t can be shown t h a t FF^ (k) « 1.33 F F 2 (k) w 3.0 (3.34) For a s i n g l e formant s i g n a l , the r e l a t i o n s h i p between the 6 db bandwidth, B^, o f the formant and the damping co n s t a n t of the corresponding a c o u s t i c wave i s g i v e n by:-^ a = 7rB (3.35) w and can be o b t a i n e d from (3.6). Thus, 2a i s a measure of the r a d i a n bandwidth o f the formant as p r e v i o u s l y d i s c u s s e d . P r e v i o u s work on formant bandwidth and v o c a l - t r a c t damping [96,117,50] has shown c o n s i s t e n t d i f f e r e n c e s i n formant bandwidths f o r d i f f e r e n t vowels. House [97] has performed p e r c e p t u a l experiments w i t h s y n t h e t i c vowels which i n d i c a t e t h a t formant bandwidth p l a y s a s i g n i f i c a n t r o l e i n vowel p e r -c e p t i o n , i n many speech s y n t h e s i z e r s of the s o - c a l l e d " t e r m i n a l analog" type [108,110,113], c o n t r o l o f the formant bandwidths has been i n c o r p o r a t e d . P r e c i s e measurement o f formant bandwidth i s d i f f i c u l t even w i t h s p e c t r a l a n a l y s i s techniques [50,96] because the bandwidths are r e l a t i v e l y s m a l l and because of the harmonic 1/ [24],p. 151. 76 sampling of the spectral envelope. Typical formant bandwidths are between 40 and 200 Hz. In as much as the formant band-widths tend to be more speaker dependent than the formant frequencies, precise measurements of formant bandwidth are not useful as a primary measurement i n vowel c l a s s i f i c a t i o n . I t was f e l t that coarse measurements of the formant bandwidths or vocal-tract damping are useful, however. Obser-vations of the d i r e c t acoustic signal for the vowels suggested that at least two degrees of vowel-dependent damping could be established. F i g . 3-9 shows form factor measurements made for the vowels in our study. The data i s observed to cluster into two main regions: one, i n the lower l e f t corner, corresponding to the close vowels and the other corresponding to the open vowels. Close vowels such as / i / are produced with the greatest c o n s t r i c t i o n of the vocal-tract; whereas the open vowels such as / a / are produced with the least c o n s t r i c t i o n of the vocal t r a c t . The location of the cluster for the close vowels suggests from (3.33) that the damping of the acoustic wave was small and indeed that e f f e c t was observed in the acoustic s i g n a l . L i ke-wise, the location of the cluster for the open vowels suggests from (3.34) that the short-time envelope decayed or was damped at even a faster rate than a linear decay. This e f f e c t was also observed i n the acoustic s i g n a l . Correlation of our formant bandwidth data and that of Stevens et a l . [50] with the form factor measurements suggests s.odL 4.00 2 x WRMS 3.00L z oo I i .1 I u as D A 6 A A -A o A A O A A a^  I . O O J L I . I O I.20 I.30 \.AO R M S 4VG I 5 0 1.60 F i g . 3-9 Form F a c t o r Measurements f o r Vowels 78 that the form factors mainly indicate the bandwidth of the f i r s t formant. The data of Stevens et al., for example, shows that the f i r s t formant bandwidth of / i / and / i / i s less than half that of /ae / or / A / . This phenomenon i s reasonable in as much as the f i r s t formant i s invariably the most intense component of the acoustic si g n a l . 3.8 Duration of Vowels Duration measurements of vowels are to some extent ar b i t r a r y i n that the notion of a vowel boundary i s not well defined [102,103]. Nevertheless, natural acoustic boundaries can be employed to give a quantitative measure of duration which i s consistent for some comparison purposes. The acoustic boundaries suggested by Peterson et al. [103] were employed for our work. That i s , the vowel t r a n s i t i o n regions to and from consonants were included i n the duration measurement except where the t r a n s i t i o n was a rapid glide to or from a semivowel or l i q u i d . As House [102] has pointed out, t h i s d e f i n i t i o n of duration tends to create a bias toward longer durations for vowels in certain environments: such as those following a voiced stop and p a r t i c u l a r l y in cases where the vowel i s con-tained i n a symmetrical consonantal environment. F i g . 3-10 shows the ranges of durations encountered for the vowels of our study, measured according to the above convention. The vowel environments were generally unsymmetric. i 1 1 i — i 1 1 1 1 0 100 200 300 400 Duration (msec) F i g . 3-10 Observed Duration Ranges for Vowels 80 The duration ranges are similar to those reported by Forgie et a l . [29] and House [102], In general, without s p e c i f i c context information, i t i s not possible to associate a precise duration with a p a r t i c u l a r vowel. The data shows, however, that the vowels / i , a s , D,U/ are generally longer than the vowels / I , £ , A / . A similar d i s t i n c t i o n between i n t r i n s i c a l l y long and short vowels has been made by House [102]. For vowel c l a s s i f i c a t i o n work, the vowel duration i n f o r -mation should be considered to be a secondary measurement. In our vowel c l a s s i f i c a t i o n work, the duration information was found to be useful for separating /*/ from /£./ and /A./. 3 . 9 Machine Isolation, of Vowels For vowels which are neither sustained nor which have been manually isol a t e d from a speech environment, i t i s d e s i r -able to have machine procedures which w i l l i s o l a t e the vowel regions and extract relevant features from the measurements to be passed onto the c l a s s i f i c a t i o n stage. I s o l a t i o n of vowels from a r b i t r a r y environments i s known to be a d i f f i c u l t problem [101,103,21,35]. The d i f f i c u l t y of i s o l a t i o n of a vowel region depends upon a number of factors including, the vowel i t s e l f , the contextual environment and the word stress pattern. Our data shows, for example, that word f i n a l / l / and / r / adjacent to the close vowels / i / and /u/ are p a r t i c u l a r l y troublesome. Unfortunately, the l i q u i d s / l / 81 and / r / were not s t r i c t l y part of our general study. Their presence, however, does point out some of the d i f f i c u l t i e s that can occur. Other workers [14,16,18,103] have encountered similar problems with the l i q u i d s and semivowels. Nevertheless, our work indicates that with suitable r e s t r i c t i o n s on the phonemic composition of words, r e l a t i v e l y simple procedures are adequate for i s o l a t i n g the vowel regions from consonant regions. The algorithms developed i n our work were based i n part on previously reported methods. T r a d i t i o n a l methods for vowel isolation, have been almost exclusively been based on the measurement of signal i n t e n s i t y [14,16,21,35]. That i s , vowel-consonant segmentation was attempted under the assumption that vowels are always more intense sounds than the consonants. Inspection of the time graphs of signal amplitude, I, showed that i n a majority of cases the vowels were c l e a r l y of larger amplitude than the consonants. F i g . 3-11 shows some examples which resulted from the operation of a simple vowel i s o l a t i o n algorithm employing only in t e n s i t y information. The words shown are: " s i t " , "kun", " t o o l " , and "geel". The lowest curve in each figure depicts the time graph of I, measured over successive 10 msec, i n t e r v a l s . The middle curve depicts the smoothed envelope function, I 2 s ( j ) = [ K j ) + I( j+D ]/2 (3.36) 11 Results of the Operation of a Simple Vowel Isolation Algorithm Employing Only Intensity Information (top time graph of each set, Vowel Region i s the Region of Large Intensity. 83 where I(j) i s the j - t h sample of the time series for I. The top graph i n each figure attempts to show regions of high, medium and low intensity, normalized with respect to the maximum intensity of the utterance. The regions of high and low intensity are displayed normally and the regions of medium int e n s i t y are shown as a li n e on the baseline, as are regions of zero i n t e n s i t y . There should be no confusion between regions of medium and zero i n t e n s i t y on the graphs shown because the regions of medium in t e n s i t y occur between a region of high intensity and a region of low i n t e n s i t y . To prevent e r r a t i c beha-viour near the boundary regions, threshold hysteresis was employed between adjacent i n t e n s i t y l e v e l s . The high in t e n s i t y regions displayed i n the top graphs for " s i t " , "kun" and " t o o l " corres-pond to the regions of the vowels / i / , / A / and /u/, respectively. In the top graph of F i g . 3 - l l d , which corresponds to the word "geel", i t i s seen that the most intense portion of the word occurs during / l / rather than during / i / . This word i l l u s t r a t e s the problem of separating f i n a l l i q u i d s from a preceding / i / or /u/. However, the other figures show that the stop and f r i c a t i v e consonants /s/, / t / , /k/ and /g/ are a l l of low inte n s i t y and rea d i l y separated from the vowels. Several workers [19,21] have used formant parameter measurements made at the point of maximum signal i n t e n s i t y to characterize the vowel for l a t e r machine c l a s s i f i c a t i o n . Our data indicated that the point of maximum intensity and the 84 centre of the vowel target region did not generally coincide. Other workers [14,16] have employed schemes i n which each small section, t y p i c a l l y of 11 to 17 msec, duration, of the high intensity region i s c l a s s i f i e d independently. Our data suggests that such a method i s subject to biasing errors due to the vowel onglide and o f f g l i d e regions. The simple vowel i s o l a t i o n algorithm used to obtain F i g . 3-11 was found to be e f f e c t i v e i n i s o l a t i n g a majority of the vowels i n our data. Exceptions occurred i n words containing a f i n a l / l / or / r / adjacent to the close vowels / i / and /u/ and also for some unstressed vowels i n mu l t i - s y l l a b l e words. Other exceptions occurred because of very intense unvoiced f r i c a t i v e s i n some words. Because of the l a t t e r problem an improved vowel i s o l a t i o n algorithm was developed using the measurements I, Z, and D. Let I denote the maximum value of I„ over the entire max 2s utterance. Let v v^, be three Boolean variables defined by: V ]_(j) = 1 , i f [ I ( j ) > 0 . 5 I m a x ] (3.37) = 0 , otherwise v2 ( j ) ' = 1 ' i f [Z (j) < 30] = 0 , otherwise (3.38) 85 V 3 ( j ) = 1 , i f [D(j) < 60] (3.39) = 0 , otherwise and l e t V (j) be the Boolean function s V s ( j ) = V x ( j ) . V 2 ( j ) . V ( j ) (3.40) where the index j denotes the j - t h element i n the respective time s e r i e s . A temporal region was declared to be vowel-like i f : V c ( j ) •V<=(j+1) -V (j+2) = 1 (3.41) where the operator [•] i s the Boolean AND operator. The above condition implies that a vowel l i k e region must be at least 30 msec, i n duration. S i m i l a r l y , a temporal region was declared to be consonantal l i k e i f : V ( j ) + V (j+1) + v" (j+2) = 1 (3.42) s s to where the operator [+] i s the Boolean OR operator and V denotes the complement of V. For the above algorithm, the i n t e n s i t y scale was quantized into two regions rather than three as in the previous algorithm. This t a c t i c served to improve the i s o l a t i o n of unstressed vowels. The conditions on Z and D served to enhance the separ-ation of unvoiced f r i c a t i v e s from the vowels. That i s , the condition V 2 ( j ) - V 3 ( j ) = 1 (3.43) 86 i s a necessary feature for vowel-like sounds and the condition v 2 ( j ) - v 3 ( j ) = 0 (3.44) i s a necessary feature for the unvoiced f r i c a t i v e s . The above c r i t e r i a for Z and D can be considered to be similar to the use of the r a t i o of high frequency spectral energy to low frequency spectral energy which has been used by several workers [22,14] as an a i d in vowel-consonant segmentation. The above vowel i s o l a t i o n algorithm c o r r e c t l y separated a l l the unvoiced f r i c a t i v e s from the vowels. However, the problems with f i n a l / l / and / r / were made s l i g h t l y worse because of the relaxation on the condition on i n t e n s i t y . The algorithm i s o l a t e d about 75% of the vowels c o r r e c t l y i n the sense that more than two-thirds of the region that was isola t e d could be associated with the vowel i n question. The remaining cases mostly involved / i / and /u/ followed by / l / or / r / . In these cases, the vowel was not considered to be c o r r e c t l y i s o l a t e d because the indicated vowel region contained a s i g n i f i c a n t portion of the / l / or / r / region. In spite of t h i s problem, the middle t h i r d of the indicated region could be associated mainly with the vowels / i / or /u/ because these vowels were longer i n duration than the l i q u i d s . In as much as part of the vowel i s o l a t i o n procedure i s to extract suitable features to be passed onto the c l a s s i f i c a t i o n stage, a simple feature extraction scheme based on the above 87 i s o l a t i o n scheme was evaluated. The features extracted were the values of Z and D averaged over the middle t h i r d of the indicated vowel region. The reasoning behind t h i s choice was to hopefully eliminate the adverse e f f e c t s of the vowel trans-i t i o n regions on the features to be used for c l a s s i f i c a t i o n . Let the above averages of Z and D be denoted by Z'and Dl respectively. F i g . 3-12 shows a p l o t of Z' versus D 1. For completeness, the figure includes samples of / i / and /u/ which were not properly isolated, but for which the middle t h i r d of the indicated vowel region corresponded to the appropriate vowel. There are several observations that we can make about thi s graph. One observation i s that the vowels are s h i f t e d i n p o s i t i o n , p a r t i c u l a r l y for the voWels /u/ and / « / , compared to F i g . 3-3, which pertains to manually selected measurements on the vowel target region. A second observation i s the greater dispersion of the data points. Both of these e f f e c t s are a t r r i b u t a b l e to the biasing e f f e c t of the vowel formant t r a n s i t i o n s . For the vowels /u/ and /«/, the t r a n s i t i o n regions were found to exhibit p a r t i c u l a r l y strong state changes and to be long in duration compared to the t o t a l vowel duration. The F^ t r a n s i t i o n for /u/ i s generally i n a d i r e c t i o n downward [26,58] towards the low F^ value for the /u/ target; consequently, the F^ t r a n s i t i o n tends to bias D' towards a higher value. For the vowel /a?/, there i s often a strong t r a n s i t i o n of F upwards to 88 sol S 30 20 10 I i I i i i 1 i ' u u U SB U c I I 7 / A X u 1 A A (i a " } J c? A 1 . A a Cf A . ^ a C 0 ? * * r r o * ^ f f a , ^ / ^ D - Z ' /.ocas u 70 _ /5 20 25 30 Z ' (crossings/'Wms&c) F i g . 3-12 Z' and D' Results from Machine Is o l a t i o n of Vowels. 89 the high F^ target for /«/. Consequently, the F^ t r a n s i t i o n for /as / biases Z' towards lower values. The problem of separating vowels from l i q u i d s was beyond the scope of the present the s i s . The d i f f i c u l t y arises because the l i q u i d s are vowel-like i n nature and any c r i t e r i o n based on signal amplitude i s not a s u f f i c i e n t condition for separating the vowel and l i q u i d s . Rather than attempt elaborate segmen-ta t i o n procedures, a simple but e f f e c t i v e procedure suitable at least for our data was developed. This procedure was based on the following observations: 1) the combined duration of / i / . o r /u/ and / l / or / r / was very long and greater than 250 msecs. 2) f i n a l / l / and / r / tend to have sustained regions [56,62] and these regions are d i s t i n c t from those of / i / and /u/. 3) the vowels / i / and /u/ have quite d i s t i n c t i v e properties compared to the other vowels. The procedure developed e s s e n t i a l l y combines the vowel i s o l a t i o n process with the vowel c l a s s i f i c a t i o n process for the special cases i n question. The f i r s t step i n the procedure was to locate p o t e n t i a l vowel plus l i q u i d regions by means of the condition: v2 ( j ) - v 3 ( j ) - v 4 ( j ) = 1 (3.45) and v-j(J) a r e defined by (3.38) and (3.39), respec-where: . where V 2 ( J ) t i v e l y , and 90 VA ( J ) = 1 . i f [ K j ) > 0.31 ] (3.46) = 0 , otherwise. The condition expressed i n (3.45) caused the vowel-liquid regions to be lumped together. The second step i n the procedure was to search for / i / and /u/ i n any temporal region where (3.45) prevailed for more .than 250 msec. Two s u f f i c i e n c y conditions were defined corres-ponding to / i / and /u/, respectively. Let w^j) = v g ( j ) . v 6 ( j ) . v ? ( j ) . V Q ( j ) (3.47) where V 5 (j) , v ^ j ) , v^(j) and v g ( j ) are binary variables for which the "1" states are defined by: v 5 ( j ) = 1 , i f f [6 < Z(j) < 30] (3.48) v 6 ( j ) = 1 , i f f [45^D(j) < 60] (3.49) •v ?(j) = 1 , i f f [ 3 8 < L m ( j ) < 61] (3.50) V (j) = 1 , i f f [3 ^ H (j) ^  7] (3.51) o m If w^(j) = 1 was s a t i s f i e d for at least 8 contiguous samples (80 msec.) of the candidate region, then / i / was declared to be present. Likewise, the s u f f i c i e n c y condition for /u/ was: w 2(j) = V 9 ( J ) - v 1 0 ( 3 ) • V 1 1 ( J ) - V 1 2 ( J ) (3.52) where v (j),- v ( j ) , V 1 ; L ( J ) / and v 1 2 ( j ) are binary variables defined by 1 , i f f [6 * Z(j) < 15] (3.53) 1 , i f f [10 =£ D(j) < 45] (3.54) 1 i f f [31 s L (j) ^ 58] (3.55) 1 i f f [8 =£ H (j) ^ 20] (.3.56) If w 2(j) = 1 was s a t i s f i e d for at least eight successive values of j ( i . e . 80 msecs.), then /u/ was declared to be present. If neither / i / nor /u/ was detected, no c l a s s i f i c a t i o n was made at t h i s stage. The above procedure c o r r e c t l y i d e n t i f i e d / i / and /u/ in a l l the available test words except for one case of /u/ + / l / . I t must be emphasized, however, that the above process was not intended to be a general procedure but rather an expedient procedure within the scope of the thesis. It i s evident that the above procedure e s s e n t i a l l y ignores f i n a l / l / and /if by invoking s u f f i c i e n c y conditions for the i d e n t i f i c a t i o n of a preceding / i / or /u/. A more general procedure for separating vowel regions from l i q u i d regions might take advantage of the strong L and H t r a n s i t i o n s that often occur between vowels and l i q u i d s . These t r a n s i t i o n s can be. associated with the formant t r a n s i t i o n s that are known to occur between vowels and l i q u i d s [56,62], The remaining problem of locating the vowel nucleus or m m v o w e l t a r g e t r e g i o n w a s h a n d l e d b y s e a r c h i n g w i t h i n t h e v o w e l r e g i o n s i s o l a t e d b y m e a n s o f ( 3 . 4 1 ) f o r t h e m o s t s t a b l e r e g i o n , u s i n g m i n i m u m r a t e o f c h a n g e o f L a n d H a s a c r i t e r i o n . L e t m m AL ( j ) a n d AH ( j ) b e d e f i n e d b y m m A L ( j ) = L ( j + 1 ) - L ( j ) ( 3 . 5 7 ) m J m m AH ( j ) = H ( j + 1 ) - H ( j ) ( 3 . 5 8 ) m m m A L m ( j ) a n d A H m ( j ) a r e a p p r o x i m a t i o n s t o t h e t i m e r a t e o f c h a n g e o f L a n d H . I n a n o r m a l v o w e l r e g i o n we e x p e c t ^ m m ^ A L ( j ) a n d A H ( j ) t o b e z e r o o r a t l e a s t a m i n i m u m i n t h e i m 1 1 m • 1 v o w e l t a r g e t r e g i o n s i n c e t h e f o r m a n t p a r a m e t e r s c h a n g e m o s t s l o w l y I n t h i s r e g i o n [ 5 2 ] , T h u s , m e a s u r e m e n t s made i n t h e r e q i o n o f m i n i m u m t i m e r a t e o f c h a n g e o f L a n d H w e r e e x p e c t e d 3 m m t o b e c l o s e t o t h o s e u s e d i n p l o t t i n g F i g s . 3-3, 3-8 a n d 3 - 9 . U s i n g t h e a b o v e s e a r c h p r o c e d u r e , we d e v e l o p e d t h e f o l l o w i n g f i n a l f o r m o f t h e v o w e l i s o l a t i o n a l g o r i t h m . 1) I s o l a t e a n d c l a s s i f y / i / o r / u / f o l l o w e d b y a l i q u i d b y u s i n g ( 3 . 4 5 ) , ( 3 . 4 7 ) a n d ( 3 . 5 2 ) . 2) I s o l a t e o t h e r v o w e l s u s i n g ( 3 . 4 1 ) . 3) C a l c u l a t e t h e d u r a t i o n , T, o f t h e v o w e l r e g i o n i s o l a t e d i n (2) . " 4) I n t h e v o w e l r e g i o n i s o l a t e d b y ( 2 ) , d e f i n e t h e v o w e l t a r g e t r e g i o n a s t h e l o n g e s t d u r a t i o n s u b r e g i o n w h e r e b o t h I A L ( j ) a n d J A H ( j ) a r e z e r o o r a m i n i m u m . I n 93 cases where the minima of AL (j) and AH (j) do not ' m m 1 coincide, define the region between the two minima as the centre of the target region. 5) Determine the vowel target features Z, D, L , H , FF. m m 1 and FF 2/ averaged over a minimum region of 20 msec, and over a maximum region determined by zero time rate of change of L and H . 3 m m The duration measurements obtained i n step (2) were found to be close to those reported in section 3.8 and resulted from the fact that (3.41) e s s e n t i a l l y located the same natural acoustic boundar ies. In summary, the vowel i s o l a t i o n algorithms investigated in our work were mainly concerned with determining natural acoustic boundaries between vowels and the stop and f r i c a t i v e consonants. Within t h i s context, the simple algorithm based on (3.41) and (3.42) was found to be adequate for the task. 3.10 Machine C l a s s i f i c a t i o n of Vowels The features employed for vowel c l a s s i f i c a t i o n were defined i n terms of the Boolean functions y^ to y 2 ^ which are shown in Table 3-2. 94 T a b l e 3.2 F e a t u r e D e f i n i t i o n s f o r V o w e l C l a s s i f i c a t i o n I f f C o n d i t i o n f o r y k = 1 1 Y l [ 6 < Z < 3 0 ] • [ 45 5 : D < 6 0 ] Y 2 [ 1 5 < Z 2 0 ] - [ 3 8 < D < 4 5 ] y 3 [ 1 1 < Z < 1 7 ] - [ 3 0 < D < 3 8 ] Y 4 [ 1 0 < z < 1 7 ] • [ 2 3 < D < 3 0 ] Y 5 [ 2 0 < z s 2 7 ] - [ 3 8 < D < 4 5 ] Y 6 [ 1 7 < z < 2 1 ] • [ 2 9 < D < 3 8 ] Y 7 [ 2 1 < z < 2 8 ] - [ 2 8 < D < 3 5 ] Y 8 [ 1 7 < z ^ 2 5 ] - [ 2 0 < D < 2 9 ] Y 9 [ 1 0 < z < 1 3 ] - [ 1 1 < D < 2 1 ] Y 1 0 [6 1 0 ] - [ 10 == = D. < 4 5 ] Y l l [ 3 8 < L m ^ 6 1 ] • [ 3 < H m < 7 ] Y 1 2 [ 2 5 < L m < 3 8 ] ' [ 3 < H m < 9 ] Y 1 3 [ 1 5 < L m < 2 5 ] ' [ 3 < H m < 8 ] Y 1 4 [ 1 5 < L m < 2 6 ] • [ 7 < H m < 1 2 ] Y 1 5 [ 1 5 < L m < 2 6 ] • [ 1 1 < H < 1 6 ] Y 1 6 [ 2 6 < L m < 3 4 ] • [ 1 1 < H < 1 6 ] Y 1 7 [ 3 4 < L m < 5 8 ] . [ 8 < H m < 2 0 ] Y 1 8 [ 1 - 05 < F F 1 < 1 . 2 8 ] • [ 1 .5 < F F 2 s 3 . 2 ] Y 1 9 [ 1 - 28 < F F ^ < 1 . 6 0 ] • [2 .4 < F F 2 < 5 . 0 ] Y 2 0 [ 1 8 0 < T < 350 ] Y 2 1 [ 5 0 < T < 2 2 0 ] 95 F i g . 3-13 shows a flow-chart of the vowel c l a s s i f i c a t i o n algor-ithm that was employed i n our work. The above feature d e f i n i t i o n s and c l a s s i f i e r structure were designed to take into account: i) unequal data cluster sizes i i ) d i f f e r e n t cluster shapes i i i ) i r relevant or "don't care" features The s t r u c t u r a l form of the c l a s s i f i e r i s that of a piecewise linear c l a s s i f i e r [64] with binary features and binary weights. Minimum Euclidean-distance-from-the-means c l a s s i f i e r s [64,63] were also considered but i t was found that they did not s a t i s -f a c t o r i l y handle the data c h a r a c t e r i s t i c s described above. Likewise, a simpler, decision tree c l a s s i f i e r [165] was not found to be suitable because simple successive dichotomies of the decision space were not found to be possible. The binary features were designed to simplify the c l a s s i f i e r structure and b a s i c a l l y represent s u f f i c i e n c y con-d i t i o n s associated with the d i f f e r e n t vowels. It should be noted the domain of the binary features do not correspond to completely d i s j o i n t regions of the measurement subspaces such as the Z - D plane. In other words., the features defined on a p a r t i c u l a r measurement subspace are not s u f f i c i e n t to permit complete separation of a l l the vowels. Furthermore, i t should be noted that for the less d i s t i n c t i v e vowels such as /as /, S T A R T 96 Y e s y l ' y l l - y I 8 = L No y 1 0 * y i 7 •'18 " 1 N y 8 ' Y 1 5 " Y 1 9 * y 2 0 1 N y 9 " y i O * Y 1 8 " y 2 1 1 N ( y 7 + y e - ^ - ^ - ^ i = 1 N (y 5 + y 6 - y 1 3 ) ^ i 9 - y 2 o = 1 N (Y 2 + y 3) ^ 2 ^ 8 ^ 2 1 = 1 N £ y 4 - { Y 1 3 + y i 2 } + y 3 - y i 3 ] * y 2 1 = 1 A / /u/ M A/ A / A / N R E J E C T I O N ERROR F i g . 3-13 V o w e l C l a s s i f i c a t i o n A l g o r i t h m 97 / i / a n d fe/, t h e r e i s m o r e t h a n o n e s e t o f s u f f i c i e n t c o n d i t i o n s f o r e a c h v o w e l t y p e . I n F i g . 3 - 1 3 , t h e v o w e l s a r e r a n k e d a p p r o x i m a t e l y i n o b s e r v e d o r d e r o f c o m p l e x i t y ; t h a t i s , / i / i s c o n s i d e r e d t o b e o n e o f t h e l e s s c o m p l e x o r m o r e d i s t i n c t i v e v o w e l s a n d ft/ i s c o n s i d e r e d t o b e o n e o f t h e m o r e c o m p l e x v o w e l s . D u r a t i o n i n f o r m a t i o n w a s f o u n d t o b e q u i t e u s e f u l i n s e p a r a t i n g /ae / f r o m / A / a n d / e / a n d i n s e p a r a t i n g / o / f r o m / A / . F o r g i e e t a l [ 2 9 ] h a v e s i m i l a r l y u s e d d u r a t i o n i n f o r m a t i o n t o d i s t i n g u i s h /ae / f r o m /1/. T a b l e 3-3 s u m m a r i z e s t h e p e r f o r m a n c e o f t h e c o m b i n e d ma-c h i n e i s o l a t i o n a n d c l a s s i f i c a t i o n a l g o r i t h m s . T a b l e 3-3 V o w e l C o n f u s i o n M a t r i x I n t e n d e d V o w e l i I e ae A 0 V u i 13 I 7 l £ 1 3 ae 3 A 9 1 0 7 U 2 u 12 R 1 1 1 2 c o - r - l V ca u • r - l M - l • r - l to CO ro r - l o (U c •r-f rC CJ ro 2 98 There are two types of error that occur: (i) substitution errors and ( i i ) r e j e c t i o n e r r o r s . A substitution error occurs when an intended vowel i s m i s c l a s s i f i e d as another vowel. A r e j e c t i o n error, denoted by the class "R" in the above table, occurs when no vowel a f f i l i a t i o n i s made by the c l a s s i f i c a t i o n algorithm. For our algorithm, the r e j e c t i o n errors were more numerous than the substitution errors. It i s possible to reduce the number of r e j e c t i o n errors but i t i s known [118] that t h i s reduction occurs at the expense of an increase in the number of substitution errors. The sources of both errors are in part due to the feature d e f i n i t i o n s and the c l a s s i f i e r structure and i n part due to errors i n the feature extraction process. The limited amount of data examined in our study prevents us from drawing strong conclusions. However, most other work on vowel c l a s s i f i c a t i o n [14,19,29] has also been based on sparse data and so a comparison with that work can be made; Our over-a l l r e sults are found to be quite comparable to the above mentioned work, which b a s i c a l l y has used only formant frequency information. The confusion of / i / and /e/ i s a common one and occurs even in ordinary human perception. It i s known that vowels are perceived along a continuous perceptual scale; that i s , the perception i s not categorical as i t i s with consonants ]119,120], This f a c t , along with the common confusions that 99 occur w i t h context-independent vowel c l a s s i f i c a t i o n , suggests that context information may be r e q u i r e d to achieve s i g n i f i c a n t improvements i n vowel c l a s s i f i c a t i o n . Nevertheless, w i t h i n the framework of context-independent vowel c l a s s i f i c a t i o n , our r e s u l t s i n d i c a t e that time-domain measurements y i e l d comparable performance to schemes which'use s p e c t r a l measurements. C H A P T E R I V A N A L Y S I S AND R E C O G N I T I O N OF F R I C A T I V E S 4.1 G e n e r a l C h a r a c t e r i s t i c s o f F r i c a t i v e s 4 . 1 . 1 P r o p e r t i e s o f U n v o i c e d F r i c a t i v e s T h e f r i c a t i v e s c a n b e d i v i d e d i n t o t w o l i n g u i s t i c g r o u p s a c c o r d i n g t o a v o i c e d / u n v o i c e d d i s t i n c t i o n . A s L i s k e r [ 1 2 1 ] a n d o t h e r s [ 1 1 9 , 1 2 0 ] h a v e p o i n t e d o u t , t h i s d i s t i n c t i o n i s p r i m a r i l y a l i n g u i s t i c o n e r a t h e r t h a n a n a c o u s t i c o n e . T h e i m p l i c a t i o n s o f t h i s s t a t e m e n t s h a l l b e c o m e m o r e e v i d e n t i n t h i s c h a p t e r . T h e u n v o i c e d f r i c a t i v e s a r e a c o u s t i c a l l y s i m p l e r t h a n t h e i r v o i c e d c o u n t e r p a r t s a n d h a v e b e e n m o r e e x t e n s i v e l y s t u d i e d [ 7 0 , 1 2 2 , 1 2 3 , 5 4 , 7 3 , 2 4 ] , T h e s e s t u d i e s h a v e b e e n m a i n l y c o n c e r n e d w i t h p r o p e r t i e s o f t h e s u s t a i n a b l e f o r m o f t h e f r i c a t i v e . I n o u r w o r k , t h e u n v o i c e d f r i c a t i v e s / s / , / / / a n d / f / w e r e s t u d i e d . O u r i n t e r e s t w a s p r i m a r i l y i n u n v o i c e d f r i c a t i v e s c o n t a i n e d i n s p o k e n w o r d s a n d e s p e c i a l l y , a d j a c e n t t o v o w e l s . F o r t h e s a k e o f c o n v e n i e n c e , t h e a c r o n y m s U V F a n d V F s h a l l b e u s e d t o d e n o t e t h e t e r m s " u n v o i c e d f r i c a t i v e " a n d " v o i c e d f r i c a t i v e " , r e s p e c -t i v e l y . I n t h e e n v i r o n m e n t o f v o w e l s , t h e r e a r e t h r e e t e m p o r a l r e g i o n s i n t h e a c o u s t i c s i g n a l t h a t c a n b e a s s o c i a t e d w i t h U V F s . 1 0 0 101 These regions are: i) vowel t r a n s i t i o n to a noise dominant region, i i ) the f r i c t i o n a l - n o i s e dominant region and i i i ) vowel t r a n s i t i o n from the noise dominant region. T r a d i t i o n a l l y , the temporal region associated with UVFs contained i n speech has been considered to encompass only the s t r i c t l y n o i s e - l i k e region corresponding to turbulent or f r i c t i o n a l noise sources and defined as region ( i i ) above. The central subregion of ( i i ) i s considered to be the f r i c a t i v e nucleus and has been the main object of studies involving spectral analysis [123,70], UVF production [122,73,24] and UVF synthesis [107,108,110,113]. The spectral structure of the nucleus appears to be r e l a t i v e l y simple. Harris [123] has suggested that the lower cut-off frequency of the noise spectrum i s an important para-meter. Hughes et al. [70] indicate that the shape of the noise spectrum i s also important. However, several workers have shown that the spectral c h a r a c t e r i s t i c s of the nucleus are weakly context dependent [70,101,62]. Secondary but also important c h a r a c t e r i s t i c s of temporal region ( i i ) are the duration and i n t e n s i t y . It i s known that the f r i c t i o n a l noise i s of longer duration for UVFs than for other phonemes which have f r i c t i o n a l noise sources [124,17]. The f r i c a t i v e / f / i s known to be generally much weaker in i n -tensity than either / s / or /// [14,54,70,123]. The question of whether the temporal regions of vowel t r a n s i t i o n to adjacent consonants ( i . e . regions (i) and ( i i i ) 102 belong to the vowel or the consonant has long been a contentious point [119,120,122,2]. The d i f f i c u l t y arises from attempts to assign exclusive temporal regions in the acoustic signal to each of the sequence of phonemes in an utterance. From a c l a s s -i f i c a t i o n viewpoint, i t i s preferable and more convenient to adopt the idea of non-exclusive associations of certain temporal regions such as the vowel formant t r a n s i t i o n s . Such a concept i s p a r t i c u l a r l y useful in dealing with the voiced f r i c a t i v e s and stop consonants. It i s known [58,26,103,120] that the c h a r a c t e r i s t i c s of the vowel formant t r a n s i t i o n s depend upon the vowel as well as the consonant. Thus, the t r a n s i t i o n region contains useful information regarding the i d e n t i t y of both phonemes in the sequence. For the UVFs, temporal region ( i i ) i s by far the most important. Fortunately, the region i s also r e l a t i v e l y easy to i s o l a t e because of the occurrence of well defined, natural acoustic boundaries i n the acoustic s i g n a l . Our work, as well as others [70,101,14,16] indicates that for the three UVFs being considered, the properties of the noise dominant region are simple and consistent enough to allow a simple c l a s s i f i c a t i o n scheme. That i s , there appears to be no need to use contextual information such as the vowel formant t r a n s i t i o n s , to d i s t i n -guish amongst /s/, /// and / f / . However, context independent c l a s s i f i c a t i o n may be inadequate for i d e n t i f y i n g /©/ and /h/ [54,56,125], which were not studied i n t h i s t h e s i s . 103 A t t e m p t s t o m o d e l t h e p r o d u c t i o n m e c h a n i s m o f U V F s h a v e b e e n l e s s s u c c e s s f u l t h a n f o r t h e v o w e l s [ 7 3 , 2 4 ] , T h e p r o b l e m s r e s u l t f r o m d i f f i c u l t i e s i n m o d e l l i n g t h e n a t u r e a n d l o c a t i o n o f t h e n o i s e s o u r c e s a s w e l l a s d i f f i c u l t i e s c o n n e c t e d w i t h d i r e c t m e a s u r e m e n t o f t h e v o c a l - t r a c t c o n f i g u r a t i o n a n d t h e s i z e o f t h e c o n s t r i c t i o n s — ^ . N e v e r t h e l e s s , f r o m a t e r m i n a l o r a c o u s t i c o u t p u t p o i n t o f v i e w , s i m p l e e q u i v a l e n t s o u r c e s a n d l i n e a r m o d e l s [ 1 2 2 , 7 3 , 2 4 ] h a v e b e e n d e v e l o p e d w h i c h f o r m a n i n c o m p l e t e b u t a c c e p t a b l e b a s i s f o r s t u d y . T h a t i s , t h e g e n e r a l s h a p e a n d f r e q u e n c y p o s i t i o n o f t h e o u t p u t f r e q u e n c y s p e c t r a c a n b e e x p l a i n e d . I n t h i s m o d e l , t h e e x c i t a t i o n i s a s s u m e d t o b e a w h i t e G a u s s i a n n o i s e s o u r c e a n d t h e v o c a l - t r a c t i s m o d e l l e d a s a t w o - p o l e , o n e - z e r o b a n d p a s s f i l t e r . T h e e l e c t r i c a l e q u i v a l e n t o f t h i s m o d e l i s o f t e n u s e d f o r UVF s y n t h e s i s [ 1 0 7 , 1 0 8 , 1 1 0 , 1 1 3 ] . P r e v i o u s w o r k o n m a c h i n e c l a s s i f i c a t i o n o f u n v o i c e d f r i c a t i v e s [ 1 4 , 1 6 , 1 7 , 1 9 ] h a s b e e n b a s e d o n t h e s p e c t r a l a n a l y s i s r e s u l t s o f H a r r i s [ 1 2 3 ] a n d H u g h e s e t a l . [ 7 0 ] . T h e w o r k r e p o r t e d i n t h i s t h e s i s s h o w s t h a t t h e U V F s c a n b e r e l a t i v e l y e a s i l y c l a s s i f i e d b y m a c h i n e u s i n g o n l y p r o p e r t i e s o f t h e UVF n u c l e u s . T h e m e a s u r e m e n t s Z, D, I a n d t h e d u r a t i o n T , o f t h e n o i s e n d o m i n a n t r e g i o n w e r e f o u n d t o b e s u f f i c i e n t f o r i s o l a t i o n a n d c l a s s i f i c a t i o n e x c e p t f o r some w o r d - i n i t i a l / f / -1 / [ 2 4 ] p . 4 8 . 104 4.1.2 Properties of the Voiced F r i c a t i v e s The voiced f r i c a t i v e s (VFs) are not as well understood as the UVFs. The former are more complex in temporal structure and i n the short-time spectral structure than the l a t t e r . Ideally, the voiced/unvoiced d i s t i n c t i o n between the two groups might be attributed to the presence of a concurrent, vowel-like voicing component added to the voiceless counterpart. Such a simple model has been used to synthesize the VFs [107,113]. There i s evidence to suggest, however, that concurrency of voicing i s not c r u c i a l to the voiced/unvoiced perceptual d i s -t i n c t i o n . It i s known, for example, that in word f i n a l p o s i t i o n the VFs are often p a r t i a l l y devoiced [17,126]. On the other hand, Fant—^ suggests that the role of f r i c t i o n a l noise has been overemphasized with respect to "certain of the VFs such as /v/. The above discussion thus indicates that the VFs are quite variable and that several factors may contribute either singly or j o i n t l y to the voiced/unvoiced d i s t i n c t i o n . There are other differences between the idea l i z e d model and VFs in actual speech. Fant [73,25] points out that the noise i n t e n s i t y in VFs i s usually less than for the unvoiced counterpart. Several workers [17,73,126] have also observed that the duration of the f r i c t i o n a l noise component i s generally 2 shorter than for the unvoiced counterpart. It i s further known—' 1/ [73], p. 170 2/ [24], p. 49 105 that the f r i c t i o n a l noise i s modulated in intensity by the g l o t t a l pulses. This l a t t e r e f f e c t has been incorporated into recent sophisticated speech synthesizers [108,110]. Voiced f r i c a t i v e s contained in isolated spoken words and i n continuous speech have at least as many temporal regions associated with them as with the UVFs. Typical regions that may be found are: i) t r a n s i t i o n from a preceding vowel, i i ) region of occluded-voicing dominance-^, i i i ) region of increasing f r i c t i o n a l noise intensity, iv) region of f r i c t i o n a l - n o i s e dominance, v) region of decreasing f r i c t i o n a l noise intensity and vi) t r a n s i t i o n to a following vowel. These regions are not necessarily found i n every VF. The occurrence of any of these regions i s dependent upon a number of factors including: the p a r t i c u l a r VF, the contextual environment and the p o s i t i o n i n the word. For example, regions of clear f r i c t i o n a l noise dominance are not always present, p a r t i c u l a r l y for /v/. In other words, the VFs are complex and variable phonemes. Our work was concerned with a study of VFs /z/, / 3 / and /v/ in the contextual environment of vowels. In our data, i t was found that there was generally some f r i c t i o n a l noise associated with a l l the sample VFs although the noise was weaker i n i n t e n s i t y and shorter i n duration than for the unvoiced counterparts. It was also found that although voicing and f r i c t i o n a l noise were not always f u l l y concurrent, there were 1/ Occluded-voicing occurs when the vocal-track i s severely con-s t r i c t e d at some point and r e s u l t s i n a low f i r s t formant frequency. 106 elements of both in the temporal structure. Sections 4.5 and 4.6 present further discussions of the temporal structure of the VFs studied in our work. The VFs were found to require more complex features for i s o l a t i o n and c l a s s i f i c a t i o n than the UVFs. E x p l i c i t use was not made of the temporal regions corresponding to the tr a n s i t i o n s to and from adjacent vowels i n our voiced f r i c a t i v e i d e n t i f i c a t i o n algorithms. However, such information may be required to id e n t i f y the VF /<j/, which was not studied i n our work. Very l i t t l e previous work has been reported on machine i s o l a t i o n and c l a s s i f i c a t i o n of the VFs. What work has been reported [14,19] has been based on the primitive acoustical model in which a voiced/unvoiced d i s t i n c t i o n i s made on the basis of presence/absence of concurrent voicing. 4.1.3 Functional Burdening of the F r i c a t i v e s Several studies [127-129] have been made of the r e l a t i v e frequency of occurrence of the phonemes of the English language. Table 4-1 gives excerpts from the work of Dewey [127] and Denes [128] related to the f r i c a t i v e s studied i n t h i s t h e s i s . The table entries represent r e l a t i v e frequency of occurrence of the indicated phonemes as a percentage of the occurrence of a l l phonemes i n representative written material. The differences in values for the two investigators are attributed i n part to the data base and in part to the d i f f e r e n t phonemic t r a n s c r i p t i o n 107 systems employed. Table 4-1 Relative Frequency of Occurrence of English F r i c a t i v e s Investigator / s / /// / f / A/ /*/ 1 /v/ Dewey 4.6 0.84 1.9 3.0 0.05 2.3 Denes 5.1 0.70 1.7 2.5 0.05 1.9 Amongst the UVFs, / s / i s the most important i n terms of frequency of occurrence and /// i s the least frequently occurring. Amongst the VFs, / z / i s the most frequently occurring and / i f i s the least frequently occurring. In fa c t , / 3 / i s the least frequently occurring phoneme in the English language and further-more, never occurs in word i n i t i a l p o s i t i o n . Dewey's data further indicates that the most attention should be paid to: /s / i n word f i n a l and i n i t i a l positions, / z / i n word f i n a l p o s i t i o n , / f / i n word i n i t i a l p o s i t i o n and /v/ in word f i n a l p o s i t i o n . 4.2 Z - D Analysis of Unvoiced F r i c a t i v e s 4.2.1 Z - D Measurements Z-D measurements were made for the UVFs /s/, /// and / f / i n word i n i t i a l and word f i n a l p o s i t i o n s . The contextual 108 environments included a v a r i e t y of vowels with d i f f e r e n t places of a r t i c u l a t i o n . Use of vowels with d i f f e r e n t places of a r t i -c u l ation was intended to indicate the e f f e c t of vowel context on the f r i c a t i v e nucleus properties. Let Z and D denote values of Z and D, respectively, averaged over a 10 msec, i n t e r v a l as for the vowels. The non-enclosed points on F i g . 4-1 are Z-D measurements obtained for the sample UVFs. The p o s i t i o n and contextual environment of the UVFs are also indicated on the graph. The point of measurement of Z and D was manually selected i n the most stable region of the f r i c a t i v e nucleus. The measured data points are observed to cluster into three d i s t i n c t regions corresponding to the three f r i c a t i v e types under study. The dispersion pattern for / f / i s somewhat larger than that for / s / and •/// and indicates that the acoustic properties of / f / were more variable than those of /s/ and ///. The data also suggests that i n i t i a l / f / are more variable than f i n a l / f / . This phenomenon i s attributable to the fact that i n i t i a l / f / were shorter i n duration and more p l o s i v e - l i k e than word f i n a l / f / . The plosive q u a l i t y of i n i t i a l / f / has also been observed by Reddy [19,48]. There are other small contextual e f f e c t s that can be observed in the data. I n i t i a l / s / appear to have s l i g h t l y greater energy i n the higher frequencies than f i n a l /s/, as indicated by the small, upwardly diagonal s h i f t of the points 109 z ( c r o s s i n g s / i o m s e c ) F i g . 4-1 Z - D Values for UVFs. Uncircled points are measured values. Points enclosed with squares are estimates calculated from measured spectra. C i r c l e d points are estimates calcu-lated from a simple spectral model. 110 for i n i t i a l /s/. F,or /// before the vowel /u/, i t i s observed that the anticipatory lip-rounding for /u/ results in lowered Z and D values. Likewise, the vowel / i / exerts a noticeable incluence on a preceding /// and probably r e s u l t s from the fact that the places of a r t i c u l a t i o n of these two phonemes are p h y s i c a l l y close together. 4.2.2 Theoretical Relationship Between Spectral Structure and the Z and D Values The spectra of UVFs have been analyzed by modelling the production mechanism as a linear f i l t e r excited by white Gaussian noise [122,73,25,24]. This model i s depicted i n F i g . 4-2. Unvoiced f r i c a t i v e s have also been adequately synthesized with an e l e c t r i c a l equivalent of the above model [107,108,110,113]. In simplest terms, the linear f i l t e r can be described as a bandpass f i l t e r [122,75] for which the inband attenuation c h a r a c t e r i s t i c s , the bandwidth and the p o s i t i o n along the f r e -quency scale d i f f e r according to the p a r t i c u l a r UVF involved. White . . Linear Gaussian ^ Bandpass Noise v m F i l t e r Source ( ' F(f) x(t) —> X(f) F i g . 4-2 Simple UVF Production Model Since the output of a linear f i l t e r i s Gaussian i f the input i s Gaussian [90], i t i s reasonable to model stable-^ regions of the unvoiced f r i c a t i v e s as a portion of a sample function from an ergodic process. Thus from 2.7, we can write for the UVFs, the following relationships for Z and D. r " ' 2 z 2 4/ f S (f) d f Z 2 = (^) = x ( 4 > 1 ) ^o / s x ( f ) d f o z, 2 4/ f S (f) df D = 0 = J o (4.2) / f 2 S (f) df J o X where S (f) i s the power spectral density of the f r i c a t i v e out-put sig n a l . While (4.1) and (4.2) can be written i n terms of the source and f i l t e r c h a r a c t e r i s t i c s , i t i s more convenient to leave the equations i n the above form since our interest i s i n the o v e r a l l spectral shape of the output signal rather than i n d e t a i l s of the f i l t e r . In the next section, Z and D values calculated from (4.1) and (4.2) using actual spectral measurements are described. In section 4.2.4, we show how a very simple mathematical model of the UVF spectra can be used to explain the nature of the observed Z and D measurements. 1/ Supra p. 52 112 4.2.3 Calculated Z and D Values Estimates of Z and D for the UVFs were calculated from (4.1) and (4.2) using spectral measurements obtained from our fil t e r - b a n k spectrum analyzer. The output of the spectrum analyzer y i e l d s an approximation to | x ( f ) | , where X(f) i s the Fourier transform of the spectrum analyzer input signal, x ( t ) . When x(t) i s ergodic, | x(f) | 2 =• S (f) (4.3) Results of calculations using these spectral measurements are shown as the data points enclosed i n squares on F i g . 4-1. For /s/ and ///, which are r e l a t i v e l y intense sounds, the agreement between measured and calculated values i s very good. This comparison establishes that (4.1) and (4.2) are appropriate and also that the spectral measurement technique was sa t i s f a c t o r y . The displacement of the calculated points for / f / i s attr i b u t a b l e to two factors. F i r s t l y , / f / i s a very weak sound (see Table 4-3) and re s u l t s i n low signal-to-noise r a t i o s in the fi l t e r - b a n k output. Secondly, the relationship between Z and the spectrum contains a high frequency, pre-emphasis fac t o r . This means that any noise i n the high frequency channels of the fi l t e r - b a n k tends to bias Z and D to higher than normal values. Thus, the displacement of the calculated points for / f / can be ascribed to the compounded e f f e c t of low signal-to-noise r a t i o in measuring the spectrum and the high frequency, pre-113 emphasis i m p l i c i t i n (4.1) and (4.2). The spectra of /s/, /// and / f / observed i n our data were similar to those reported by others [122,70]. A l l the spectra had very rapid attenuation below the lower cut-off frequency. For / s / the spectra were generally f l a t and t y p i -c a l l y covered the range from 3.7 to 8.0 kHz. The spectra for /// consisted mainly of a single peak in the v i c i n i t y of 2.5 kHz. The spectra of / f / were more varia b l e . The lower cut-off frequency for / f / varied from 1.0 to 2.0 kHz and the spectrum, X(f) , was either f l a t or decreased in amplitude with frequency approximately as f ^. These observations served as a guide for simple models of the UVF spectra described i n the next section. 4.2.4 Estimation of Z and D Values from a Simple Spectral Model Previous work on spectral characterization of the UVFs indicates that r e l a t i v e l y few parameters are required for des-cr i b i n g the spectra for the purpose of c l a s s i f i c a t i o n of the UVFs. Harris [123] suggests that the lower cut-off frequency of the f r i c a t i v e n o i s e i s an important parameter and gives the following lower cut-off frequencies, f T„: / s / : f T„ = 3.6 kHz L C /// : f L C = 2.0 kHz (4.4) / f / : f = 1.2 kHz 114 The data of several other workers [70,73,122] indicates that spectral shape i s also an important factor. For example, Hughes et al.[70] and Fant [73] point out that the spectrum of /// generally has one strong peak located below 4 kHz. and also that the spectrum drops o f f rapidly on either side of the peak. They describe the spectrum of / s / as being approximately f l a t above the lower cut-off frequency. The spectrum of / f / i s known to be very wide. F a n t ^ suggests that an approximate f i t to the spectrum of / f / between f and 8 kHz i s a curve that f a l l s o f f as (freq.) 1 . I t i s known [70] that a very high frequency spectral peak above 10 kHz sometimes occurs in / f / spectra. In our data, any such peaks were eliminated by the 8 kHz lowpass pre-sampling f i l t e r . The above spectral c h a r a c t e r i s t i c s agree with observations on our data. The above evidence suggested that a simple mathematical model would s u f f i c e to describe the important c h a r a c t e r i s t i c s of the UVF spectra. This model comprises a bandpass spectrum with a fixed upper cut-off frequency, a variable lower cut-off frequency and a variable r o l l - o f f rate. The fixed upper cut-o f f frequency corresponds to the cut-off frequency of the lowpass pre-sampling f i l t e r used to bandlimit the s i g n a l . Using t h i s spectral model, values of Z and D were calculated from (4.1) and (4.2). F i g . 4-3 shows curves of Z vs. D scaled to 1/ [73], p. 202 115 F i g . 4 - 3 : V a r i a t i o n of Z and D with Var i a t i o n in Parameter Values of the Spectral Model. 116 c o r r e s p o n d t o t h e Z - D s c a l e s o f F i g . 4 - 1 . T h e d a s h e d l i n e s o n t h e f i g u r e r e p r e s e n t c o n t o u r s o f c o n s t a n t f a n d t h e L C s o l i d l i n e s r e p r e s e n t c o n t o u r s f o r a f i x e d s p e c t r u m r o l l - o f f r a t e . T h e c u r v e l a b e l l e d " 1 " , f o r e x a m p l e , c o r r e s p o n d s t o t h e c a s e o f no r o l l - o f f o r a f l a t f r e q u e n c y s p e c t r u m . T h e c u r v e l a b e l l e d " 5 " r e p r e s e n t s a r o l l - o f f r a t e f o r | X ( f ) | p r o p o r t i o n a l to f " 4 . C o n s i d e r some t y p i c a l m o d e l s f o r t h e s p e c t r a o f / s / , / / / a n d / f / . F o r / s / t h e s p e c t r u m w a s f o u n d t o b e f a i r l y f l a t a n d w i t h a t y p i c a l f o f 3.7 k H z . F o r / / / , s u i t a b l e p a r a m e t e r s L C -4 w e r e f = 2 . 5 k H z a n d a f r e q u e n c y r o l l - o f f p r o p o r t i o n a l t o f L C F o r / f / , t y p i c a l p a r a m e t e r s w e r e f = 1.25 k H z a n d a s p e c t r u m L C r o l l - o f f p r o p o r t i o n a l t o f \ Z a n d D v a l u e s c a l c u l a t e d f r o m t h e a b o v e p a r a m e t e r v a l u e s a r e s h o w n a s c i r c l e d p o i n t s o n F i g . 4 - 1 . C o m p a r i s o n o f t h e e s t i m a t e s o b t a i n e d f r o m o u r s i m p l e s p e c t r a l m o d e l a n d t h e a c t u a l m e a s u r e d v a l u e s o f Z a n d D y i e l d s v e r y g o o d a g r e e m e n t . F r o m t h i s r e s u l t we c a n c o n c l u d e t h a t t w o p a r a m e t e r s , e i t h e r i n t h e f r e q u e n c y d o m a i n o r e l s e i n t h e t i m e d o m a i n a r e s u f f i c i e n t t o c h a r a c t e r i z e t h e U V F s / s / , / / / a n d / f / . A n o t h e r p u r p o s e o f t h e a b o v e s p e c t r a l m o d e l l i n g i s t o a l l o w u s t o r e l a t e g r o s s c h a r a c t e r i s t i c s o f t h e s p e c t r a l s t r u c t u r e t o t h e Z a n d D v a l u e s . F o r e x a m p l e , w h e n t h e s p e c t r u m i s r e l a -t i v e l y c o m p a c t a s f o r /J"/ a n d / s / , we o b s e r v e t h a t t h e Z v s D p o i n t s l i e c l o s e t o t h e Z = D l o c u s . On t h e o t h e r h a n d , f o r a frequency spectrum which i s r e l a t i v e l y d i f f u s e , as for / f / , the Z vs D points l i e away from the Z = D locus. 4.3 Other Measurements for Unvoiced F r i c a t i v e s 4.3.1 Zero-Crossing Distances Zero-crossing distances were analyzed for the UVFs under the same preprocessing conditions as for the vowels, namely: i) no f i l t e r i n g i i ) lowpass f i l t e r i n g and i i i ) highpass f i l t e r i n g For the UVFs there was v i r t u a l l y no a c t i v i t y on the LP - TIHs, as one would expect, since the UVFs have almost no low frequency energy. By the same token, the a c t i v i t y patterns on the UF and HP time-interval histograms were quite s i m i l a r . Small d i f f e r -ences between the two were found to occur due to the non-ideal HP f i l t e r c h a r a c t e r i s t i c s . Table 4-2 summarizes the centre-of-gravity measurements, U , and H made on the f r i c a t i v e nucleus. ^ J m m The U and H values are interpreted in terms of the physical m m c c •* channel numbers. Context and p o s i t i o n information i s also included in the table. Table 4-2 shows that the a c t i v i t y patterns for /s/ and /// were quite consistent and largely independent of context or p o s i t i o n . For / f / , however, U and H were found to be more m m va r i a b l e . In fact, the U and H values overlap those of / [ / . m m ' J ' This phenomenon compares with the overlap of Z values for / f / and /// which i s observed i n F i g . 4-1. 118 Table 4-2 U and H for the Unvoiced F r i c a t i v e s m m VC or.CV u H VC or CV U H VC or CV U H m m m m m m / f i / 4 4 /s£/ 2 2 / / i / 4 4 / f l / 5 7 /s ae/ 2 2 //ae/ 5 5 /fA/ 9 11 /su/ 2 2 / / u / 5 5 / f V 4 5 A s / 2 2 A / / 4 4 / f u / 4 5 A s / 2 2 A / / 4 4 / i f / 3 4 / u s / 2 2 /u// 4 4 /Af/ 4 5 /uf/ 6 7 Note: VC = vowel + consonant and CV = consonant + vowel One additional parameter was obtained from the HP - TIHs. It was observed that the a c t i v i t y pattern for / f / was spread over a greater number of channels than for / s / and ///. This i s i n d i c a t i v e of a greater noise bandwidth for / f / compared to / s / and ///. For / s / and ///, the standard deviation of the HP - TIH a c t i v i t y pattern, o , was less than or equal to the distance H between adjacent channels. For / f / , o„ was greater than one channel unit. The above dispersion e f f e c t was found to be a p a r t i c u l a r l y useful feature for distinguishing / f / from those i n i t i a l unvoiced stops,which had long duration noise bursts. 119 4.3.2 Duration Measurement The duration of the noise dominant region of the UVFs has been shown by several other workers to be a useful feature for distinguishing the UVFs from voiced f r i c a t i v e s [17,126], a f f r i c a t e s [124] and stops [19]. That i s , the duration of the noise dominant region i s considered to be longest for the UVFs. This observation was found to be true for our data. F i g . 4-4 shows a p l o t of the range of durations observed in our data. The duration measurements were determined by means of the following algorithm. Let and M-2(J) b e Boolean variables defined as follows: = 1 , i f f [Z 2s 30] (4.5) H 2 (j) = 1 , i f f [D ^ 60] (4.6) and l e t V (j) be a Boolean function defined as: n V (j) = Ll, (j) -Ll (j) (4.7) n 1 2 Then the duration, T , i s defined as the maximal length sequence for which V n ( j ) -V n(j+1) = 1 (4.8) for a l l j i n the sequence. In our dataj the word f i n a l UVFs were found to be of longer duration than the word i n i t i a l UVFs. The durations of 120 inHial/s/^ final /s/-initial/S/-| 1 final Itf-* initial/f/ -i ° final I fl-I 1 I 1 50 100 150 200 Duration (msec) 250 F i g . 4-4 Observed F r i c t i o n a l Noise Duration Ranges for UVFs. 121 i n i t i a l / f / , in fact, were found to be r e l a t i v e l y short and comparable in duration to the noise burst of some i n i t i a l un-voiced stops. This agrees with an observation by Reddy [48], who considered that some / f / were more stop-like than UVF-like. In F i g . 4-4, the dashed l i n e represents duration measure-ments based.on the maximal length sequence where M.2(j) -M.2(j+1) = 1 (4.9) for a l l j in the sequence. The dashed l i n e corresponds to i n i t i a l / f / . The longer durations obtained under (4.9) compared to those obtained under (4.8) r e s u l t from the fact that the i n i t i a l part of the f r i c t i o n a l noise contained more low f r e -quency components than l a t e r regions. 4.3.3 Intensity Measurements Table 4-3 presents data on the observed i n t e n s i t i e s of the UVFs r e l a t i v e to the vowel i n t e n s i t i e s . The table shows that some samples of / s / and /// were very intense. As mentioned in section 3.9, t h i s phenomenon hindered attempts to i s o l a t e the vowels from the UVFs s t r i c t l y on the basis of i n t e n s i t y . The table also shows that the intensity of / f / was quite low. As previously mentioned i n section 4.2.3, the low in t e n s i t y for / f / resulted i n d i f f i c u l t i e s in analyzing / f / with the f i l t e r -bank spectrum analyzer. 122 Table 4-3 Relative Amplitudes of Unvoiced F r i c a t i v e s UVF Posit i o n in Word Relative Amplitude (db) Max. Min. / f / I n i t i a l -23 -31 / f / F i n a l -19 -30 /s/ I n i t i a l -5 -15 / a / F i n a l -8 -16 /// I n i t i a l -3 -10 •/!/ F i n a l -2.2 -12 4.4 Machine Is o l a t i o n and C l a s s i f i c a t i o n of Unvoiced F r i c a t i v e s 4.4.1 Machine I s o l a t i o n of the Unvoiced F r i c a t i v e s The UVFs were r e l a t i v e l y easy to i s o l a t e from vowel environments by virt u e of the d i s t i n c t i v e properties of the noise dominant region. That i s , the f r i c t i o n a l noise has acoustical c h a r a c t e r i s t i c s d i s t i n c t l y d i f f e r e n t from that of the vowels. Consequently, there were well defined natural acoustical boundaries i n the s i g n a l . The presence of a noise dominant temporal region i s , however, not s u f f i c i e n t to d i s t i n g u i s h the UVFs, as a group, from VFs and the noise burst of stops. A useful feature for t h i s l a t t e r discrimination task i s the duration of the f r i c t i o n a l noise, which i s longest for the UVFs. Candidate consonantal regions were obtained from the algorithms of (3.41) and (3.42). Since the conditions for a consonantal region as expressed by (3.42) were not strong, a more stringent set of conditions was formulated to detect and is o l a t e the UVF grpup. Let the following Boolean variables be defined: H 3 ( j ) = 1 , i f f [D a 50] (4.10) H4.(j) = 1 , i f f [U m. s l l ] (4.11) H 5(j) = 1 , i f f [ L m > 64] (4.12) H 6(j) = 1 , i f f [T n > 80] (4.13) The condition [L > 641 i s a n u l l condition for the LP-TIH and m indicates the absence of any a c t i v i t y . The duration, '• T , was determined by means of the algorithm described in section 4.3.2. Let V^ be the Boolean function defined as V f ( j ) = H 1(j) . L t 3 (j) .H 4(j) . | X 5 ( j ) (4.14) If the condition V f ( j ) = 1 (4.15) prevailed for eight consecutive values of j , then a f r i c a t i v e -l i k e region was declared. The above algorithm successfully i s o l a t e d a l l the sample UVFs as well as two allophones of the VF / z / and the noise burst 124 of some i n i t i a l unvoiced stops preceding the vowel / i / . As with the vowel i s o l a t i o n algorithm, one of the func-tions of the i s o l a t i o n algorithm was to extract suitable features to be passed onto the c l a s s i f i c a t i o n stage. Since the work described in the preceding sections indicated that simple s t a t i c measurements s u f f i c e to describe the UVF nucleus, suitable features were extracted i n the following way. The isolated, noise dominant region was f i r s t divided into four equal parts and then the measurements of Z, D, H , and o, were averaged over m H the two middle parts. This procedure removed less stable regions near the boundaries. Let the corresponding averages be denoted by Z , D , H , and o„ . One additional measurement, T , as m m m H n determined by the algorithm described i n section 4.3.2 was made. From these resultant measurements, binary features were defined as described in the next section. 4.4.2 Machine C l a s s i f i c a t i o n of the Unvoiced F r i c a t i v e s The features employed for UVF c l a s s i f i c a t i o n were defined in terms of Boolean variables as shown in Table 4-4. The binary features described i n the above table were designed: (i) to simplify c l a s s i f i e r design, ( i i ) to ensure se p a r a b i l i t y of the UVF classes from each other and ( i i i ) to discriminate against other consonants such as VFs and unvoiced stops. F i g . 4-5 shows a flow chart of the UVF c l a s s i f i c a t i o n algorithm. A l l the available samples were c l a s s i f i e d c o r r e c t l y 125 and no other consonants were m i s c l a s s i f i e d as an UVF. Discrim-ination of the UVFs from VFs and the noise burst of unvoiced stops was accomplished mainly by the duration c r i t e r i o n . Table 4-4 Feature D e f i n i t i o n s for UVF C l a s s i f i c a t i o n I f f conditions for = 1 D m > 115 ^8 85 < D £ m 115 V 50 < D < m 85 >io Z m > 95 ^11 75 < Z < m 95 *12 45 < Z < m 75 ^13 30 < Z < m 45 > 4 H m < 3 ^15 3 < H < 6 m ^16 H m < 13 ^17 T n > 120 ^18 80 < T < n 120 ^19 < 1 ^20 > 1 Ambiguities arose, however, which corresponded to the condition, H^7 = 1. The ambiguities were caused by: (a) devoicing in f i n a l /z/, (b) the shortness of some i n i t i a l / f / and (c) the r e l a t i v e l y long noise burst of the unvoiced stops preceding the vowel / i / . 126 F i g . 4-5 Flowchart for UVF C l a s s i f i c a t i o n Algorithm 127 The right hand side of the flow chart shows the procedures used to handle the ambiguous cases. 4.5 Analysis of the Voiced F r i c a t i v e s 4.5.1 A Theoretical Relationship Between Spectral Structure and Z - D The voiced f r i c a t i v e s were found to be considerably more complex and variable than the unvoiced counterparts. This i s p a r t i a l l y a consequence of the mixed source modes involved i n the production of VFs. To c l a r i f y some of the complexities involved, i t i s convenient to e s t a b l i s h a s i m p l i f i e d t h e o r e t i c a l model for reference. G l o t t a l Pulse Source P(t) Formant F i l t e r H(f) White Gaussian Noise Source F i g . 4-6 Simplified Production Model for Voiced F r i c a t i v e s Consider the model shown in F i g . 4-6, which i s a simple production model for VFs [24,73] and which i s also the form of v(t) Modulator v(t)u(t) w(t) x(t) >voiced f r i c a -t i v es y(t) F r i c a t i v e F i l t e r G(f) 128 the model used for synthesis of VFs [108,110]. The model i s a reasonably r e a l i s t i c one i n that voicing modulation of the f r i c t i o n a l noise i s considered. The exc i t a t i o n , v ( t ) , can be considered to be a sample function from a zero-mean, Gaussian random process, which i s uncorrelated with the voicing signal, p ( t ) . Let u(t) be a periodic signal, usually considered to be a rectangular wave, synchronized with p ( t ) . The f r i c a t i v e f i l t e r i s a l i n e a r , time-varying f i l t e r , but since the v a r i a t i o n with time i s slow, i t i s convenient to assume that the f i l t e r i s time-invariant over the time i n t e r v a l of i n t e r e s t . The resultant transfer function of the f i l t e r i s then G(f) and the impulse response i s g ( t ) . Therefore, from F i g . 4-6 the output signal from the f r i c a t i v e . f i l t e r i s given by: y(t) (4.16) The power pectrum of y(t) i s given by: 00 s (f) y k v (4.17) k=0 where: a voicing frequency ( u 2 k + a 2 k ) / 2 k > 0 (4.18) u 2 0 k = 0 The c o e f f i c i e n t s u^ and u^ are the Fourier c o e f f i c i e n t s of u(t) defined in a similar manner to (3.3). S (f) i s the power spectral density of v ( t ) . From the equations developed in section 3.4, we can show that the power spectrum of w(t) can be written in the form: S (f) w £ C2 |H(f) T 6(|f |-ka) k=0 (4.19) Thus, the power spectrum of the f i n a l output signal i s given by: S x ( f ) = Sw ( f ) + S y ( f ) (4.20) since = 0, where S ^ ( f ) i s the cross power spectral density of y (t) and w (t) . From (2.7), the mean zero-crossing rate of the n-th derivative of x(t) i s given by the rela t i o n s h i p : / < ^ 2 n 20 n f 2 n + 2 s x(f) « 0 0 2n f S (f) df which, for the VF model becomes n 20 00 K 1 [c* | H Q c a ) l W " + 2 ^ [ o f 2 n + 2 |G(f) I 2S v(f-k«)df ] = 00 Y [ c 2 |H(ka) | 2 ( k a ) 2 n + d 2 r W f 2 n | G ( f ) | 2S (f-ka)df] k=0 k ^ J ° v (4.21) 130 The power spectrum of v(t) , S^(f) , can usually be assumed to be constant over the audio range and so (4.21) can be s i m p l i f i e d somewha t. There are some special cases of interest which often 2 2 occur with VFs m real speech. The case of dL^  « c^ for a l l k corresponds to very weak f r i c t i o n a l noise, or equivalently, to strong voicing dominance. For t h i s case, (4.21) reduces to a form similar to (3.26) which corresponds to a vowel-like sound. 2 The case of d^ = 0 for a l l k > 0 corresponds to a si t u a t i o n i n which the f r i c t i o n a l noise i s not voicing modulated. The 2 2 case of d^ = 0 for k > 0 and c^ = 0 for a l l k corresponds to the absence of voicing and as we would expect, (4.21) reduces to the relationship for UVFs. With respect to the above s i m p l i f i e d model, the complexity of VFs i n re a l speech arises from several factors including: 2 2 (i) v a r i a t i o n s of the ra t i o s = ^ with time, ( i i ) p o s i t i o n and context dependency of (3 , ( i i i ) slow v a r i a t i o n of H(f) with time. Temporal variations of G(f) also occur but 2 can be ignored since the variations occur when d^ « 0 for a l l k. The v a r i a t i o n of H(f) accounts for the region of t r a n s i t i o n from a vowel to the VF or from the VF. to the vowel. It i s the above time varying parameters which give r i s e to the numerous temporal regions associated with the VFs, which were described i n section 4.1.2. 4.5.2 A c o u s t i c P r o p e r t i e s o f t h e V o i c e d F r i c a t i v e s I n t h e p r e c e d i n g s e c t i o n , a s i m p l i f i e d p r o d u c t i o n m o d e l f o r t h e v o i c e d f r i c a t i v e s w a s d e s c r i b e d i n o r d e r t o g i v e some i n s i g h t i n t o t h e c o m p l e x i t y a n d v a r i a b i l i t y o f t h e a c o u s t i c p r o p e r t i e s o f t h e s e p h o n e m e s . A s d i s c u s s e d e a r l i e r i n s e c t i o n 4.1.2, V F s i n r e a l s p e e c h t e n d t o h a v e a m u l t i p l i c i t y o f s e q u e n t i a l t e m p o r a l r e g i o n s a s s o c i a t e d w i t h t h e m . F o r t h e U V F s , t h e r e l a t i v e l y s t a b l e a n d c o n s i s t e n t n u c l e u s a l l o w e d u s t o i g n o r e s e c o n d a r y t e m p o r a l r e g i o n s a n d p e r m i t t e d u s e o f " s t a t i c " f e a t u r e s f o r c h a r a c t e r i z a t i o n . B y " s t a t i c " f e a t u r e s , we mean f e a t u r e s e s s e n t i a l l y a l l m e a s u r e d a t t h e same p o i n t i n t i m e . F o r t h e V F s , s u c h a s i m p l e s t a t i c c h a r a c t e r i z a t i o n , a l t h o u g h d e s i r a b l e , w a s f o u n d t o b e i n a d e q u a t e f o r t h e e f f e c t i v e c h a r a c t e r i z a t i o n o f t h e V F s i n r e l a t i o n t o d i s c r i m i n a t i o n f r o m o t h e r p h o n e m e t y p e s . I n t h i s s e c t i o n , we d e s c r i b e m o r e e x p l i c i t l y some o f t h e o b s e r v e d a c o u s t i c p r o p e r t i e s o f t h e V F s . T h e c o n v e n t i o n a l g r o u p i n g o f / z / , /3 / a n d / v / , s u p p o s e d l y a s a c o n s i s t e n t g r o u p , w a s f o u n d t o b e r a t h e r i n c o n v e n i e n t f r o m a n a c o u s t i c a l v i e w p o i n t . C o n s e q u e n t l y , some o f t h e m o r e p h o n e m e - s p e c i f i c a c o u s t i c a l p r o p e r t i e s a r e d i s c u s s e d i n t h e l a t e r s e c t i o n s d e a l i n g w i t h m a c h i n e i d e n t i f i c a t i o n o f t h e V F s . A s a g e n e r a l o b s e r v a t i o n , / z / w a s f o u n d t o b e t h e l e a s t v a r i a b l e o f t h e t h r e e V F s a n d / v / t h e m o s t v a r i a b l e . 132 Inspection of spectrographic data showed that the noise i n t e n s i t i e s , p a r t i c u l a r l y for /j/ and /v/ were considerably less than for the unvoiced counterparts. This agrees with the observations of others. F a n t ^ , for example, suggests that for /v/, the noise component i s of secondary importance. The above evidence indicated that i s o l a t i o n and c l a s s i f i c a t i o n of the VFs, p a r t i c u l a r l y /?/ and /v/, should not be c r i t i c a l l y dependent upon detection of the f r i c t i o n a l noise component. Several other phenomena were observed in our data which concur with the reports of others. I t was found, for example, that in word f i n a l p o s i t i o n , the VFs tended to be p a r t i a l l y devoiced and sometimes completely devoiced at the end of the utterance. Several other workers [17,126] have observed a similar e f f e c t . Another phenomenon was that the duration of the noise component was less than that for the unvoiced counterpart [130,126]. In analyzing the VFs, two types of descriptive features were sought. One type were those features common to the entire VF. group which would resu l t i n e f f e c t i v e discrimination against other phoneme types. The second type of features were those which would permit separation of the classes within the VF group. Relatively few consistent features of the f i r s t type were found. This was attributed to the mixed e x c i t a t i o n modes during 1/ [73], p. 170. 133 p r o d u c t i o n a n d t h e l a r g e r a n g e o f n o i s e i n t e n s i t y e n c o u n t e r e d a m o n g s t t h e V F s . D e t e c t i o n o f t h e p r e s e n c e o f t h e f r i c t i o n a l n o i s e c o m p o n e n t , f o r e x a m p l e , w a s s i m p l i f i e d b y d e s i g n i n g t h e f e a t u r e s u s e d f o r n o i s e d e t e c t i o n t o b e d e p e n d e n t o n t h e p a r t i -c u l a r V F i n q u e s t i o n . I n o t h e r w o r d s , i t w a s g e n e r a l l y f o u n d t o b e b e s t t o u s e s l i g h t l y d i f f e r e n t f e a t u r e s f o r c h a r a c t e r -i z i n g t h e d i f f e r e n t V F s . T h e r e s u l t s o f t h i s d e s i g n p h i l o s o p h y a r e s u m m a r i z e d i n t h e f e a t u r e d e s i g n s d e s c r i b e d i n s e c t i o n s 4 . 6 . 2 , 4.6.3 a n d 4 . 6 . 4 . N e v e r t h e l e s s , s e v e r a l g r o u p - c o m m o n f e a t u r e s w e r e d i s -c o v e r e d . One f e a t u r e w a s a maximum i n t h e l o w e s t c h a n n e l s - ^ o f t h e L P - T I H a c t i v i t y p a t t e r n f o r a s i g n i f i c a n t t i m e i n t e r v a l d u r i n g t h e p r o d u c t i o n o f t h e p h o n e m e . I n m o s t c a s e s t h e d o m i n a n t a c t i v i t y w a s i n c h a n n e l 64, b u t i n a l l - c a s e s , t h e r a n g e w a s f r o m c h a n n e l s 52 t o 6 4 , i n c l u s i v e . T h i s f e a t u r e c o r r e s p o n d s t o a v e r y l o w f i r s t f o r m a n t f r e q u e n c y . A v e r y l o w v a l u e o f F ^ i s common t o s o u n d s w h i c h a r e p r o d u c e d w i t h a h i g h d e g r e e o f v o c a l -t r a c t c o n s t r i c t i o n [ 7 3 , 5 0 , 5 8 ] . A c o m p l e m e n t a r y f e a t u r e t o t h e a b o v e w a s t h e m o t i o n o f L t o o r f r o m a h i g h e r p o s i t i o n f o r a n m ^ a d j a c e n t v o w e l t o t h e v e r y l o w p o s i t i o n f o r t h e V F . T h i s f e a t u r e w a s m o s t p r o m i n e n t l y v i s i b l e f o r w o r d f i n a l V F s . T h e a b o v e t w o f e a t u r e s r e l a t e t o t h e s p e c i f i c a t i o n o f H ( f ) i n t h e m o d e l d e s c r i b e d i n t h e p r e c e d i n g s e c t i o n . A n o t h e r f e a t u r e f o u n d t o b e common t o many o f t h e V F s a m p l e s w a s t h e o c c u r r e n c e o f a h i g h D - t o - Z r a t i o , p a r t i c u l a r l y 1/ L o w e s t c h a n n e l s m e a n s t h o s e w i t h t h e l a r g e s t z e r o - c r o s s i n g 134 in temporal regions corresponding to increasing or decreasing noise i n t e n s i t y . Typical D/Z r a t i o s of 3 to 14 were observed. The D/Z r a t i o i n t h i s l a t t e r s i t u a t i o n can be regarded as a measure of the r a t i o of the energy of the noise component to the energy in the voicing components. In terms i o f the pro-duction model of section 4.5.1, the D/Z r a t i o gives an i n d i r e c t measure of the (3 parameters. The dynamical behaviour of the acoustical signal, alluded to above, can be described more e x p l i c i t l y as follows. For many of the VF samples, a temporal region corresponding to occluded-voicing dominance occurred early i n the sound pattern. Within t h i s temporal region, the D/Z r a t i o was low and t y p i c a l l y i n the range from 1.0 to 2.0. In the next temporal region, the in t e n s i t y of the noise component became stronger. Since D i s a more sensitive indicator of the noise in t e n s i t y , i t s value increased p r i o r to any increase i n Z. The resultant e f f e c t was a high D/Z r a t i o i n t h i s temporal region. For /z/, the noise component invariably became strong enough to a f f e c t Z strongly and consequently caused the D/Z r a t i o to drop to lower values again. In section 4.1.1, t h i s region of strong noise i n t e n s i t y was designated as a region of f r i c t i o n a l noise dominance. For /v/, the noise component often remained r e l a t i v e l y weak and consequently Z remained at low values and the D/Z r a t i o remained at high values. The above described behaviour of D/Z i s most e a s i l y i l l u s t r a t e d on a D versus Z p l o t , but can also be deduced from inspection of the time graphs for Z and D. Figs. 4-7a and 4-7c show time graphs of Z and D for the word "zebra" and the l a s t half of the word "has". F i g . 4-7d shows a D vs Z p l o t for the /z / i n "zebra". In Figs. 4-7a and 4-7c, i t i s c l e a r l y evident that Z commences to increase at a later time than D does. In F i g . 4-7d, large values of D/Z. are indicated by the location of data points far removed from the diagonal, which represents the D = Z locus. The time delay between the increase i n D and the increase i n Z was found to be a useful feature for characterizing / z / for the purposes of distinguishing / z / from the unvoiced counterpart /&/ and also from the plosive noise burst of the stop consonants. An e f f e c t i v e measure for t h i s feature was the time difference between the instants where D and Z crossed a fixed threshold. With thresholds of Z. = D = 60/10 msecs., t y p i c a l delays of 30 to 40 msec, were observed. The bottom time graphs i n Figs. 4-7a and 4-7c i l l u s t r a t e the algorithmic measurement of t h i s time delay. The pedestal of the p u l s e - l i k e waveform displayed represents the region where D s 60 and Z < 60. The baseline represents regions where D < 60 and Z < 60, and the pulse-top l e v e l corresponds to regions where D ^ 60 and Z ^ 60. The detection of the presence of the noise component of a VF and measurement of i t s duration were found to be best accom-plished by means of features related to s p e c i f i c phonemes rather I M S " M i 1S04 Minimum *t«o ••••Nllim ••••IMMIIIIIMMt'-torn* imllllimiMiipiniii o»»o •limit our 0 H 4 H U M M 0 8 3 B MMMMIIIMIM t l l l t t t M I I I ' " ' • • -o s s x "..min i 040S I HitltlHIM"" • • 0430 - *ti . . . . . . o»tr ~ « " 0*4* «•• . . MM o x a t 0*04 oioo •mimmiiHiinii F i g . 4-7b Merged HP and LP - TIHs for "zealous" Z F i g . 4-7d D versus Z Plot for /z / i n "zebra" 137 than the entire group i n order to cope with the wide range of noise i n t e n s i t i e s encountered. For /z/, the noise component was quite intense and strongly affected both Z and D, In fact, the peak values of Z and D approached the corresponding values for the unvoiced counterpart /s/. Furthermore, on the HP time-interval histogram, the a c t i v i t y pattern was similar to that for /s/. These pheno-mena made the detection of the noise component of / z / r e l a t i v e l y easy but on the other hand, i t i s evident that other features w i l l be required to d i s t i n g u i s h / z / from /s/. For /z/, noise duration measurements were made using the same algorithm employed for the UVFs. On t h i s basis, the measured duration range for i n i t i a l and medial / z / was from 50 to 100 msec. In word f i n a l p o s i t i o n , the measured durations ranged from 70 to 110 msec. The noise component of /$/ was more d i f f i c u l t to detect and characterize for two reasons. One was that the noise i n t e n s i t y was weaker than for either / z / or the unvoiced counter-part ///. The consequence of t h i s condition was that although D approached the comparable value for ///, the value of Z did not generally a t t a i n i t s respective target value. This phenomenon was p a r t i c u l a r l y true for medial / j / , as they exhibited very weak f r i c t i o n a l noise i n t e n s i t i e s . The second reason was that the main noise component of / 3/ occurred i n the same spectral frequency region as the second formant of the vowel / i / , with 138 the r e s u l t that the D values d i f f e r e d from those of some / i / samples by only a small margin. The two best features for characterizing the f r i c t i o n a l noise of /j/ were the features [D ^ 50] and [4 < U < 5]. The l a t t e r feature was p a r t i c u l a r l y useful for distinguishing / z / from / i / , since the range for / i / was much higher. The Z measurement was not a good indicator for the f r i c t i o n a l noise of / j / . For /v/, the i n t e n s i t y of the noise component was found to be stronger in the word i n i t i a l p o s i t i o n than i n the word f i n a l p o s i t i o n . The l a t t e r case i s known to be f a i r l y common with /v/, p a r t i c u l a r l y i f the preceding vowel i s a rounded one such as /u/ or /o/ [26,73]. For word i n i t i a l /v/, a useful feature for indicating the presence of f r i c t i o n a l noise was found to be the conjunction of the conditions [D ^ 50] and [Z < 40]. The l a t t e r condition served to d i s t i n g u i s h /v/ from other phonemes such as /z/, the UVFs and the unvoiced stops. For word f i n a l /v/, a useful indicator of the noise component was the feature [ a > 1], which indicates a spread out HP-TIH H a c t i v i t y pattern similar to that of / f / . Other phoneme-specific descriptive features for characterization and discrimination of the VFs are given i n the following sections. 4.6 Machine Isolation and C l a s s i f i c a t i o n of the Voiced F r i c a t i v e s 4.6.1 On the Separability of the Iso l a t i o n and C l a s s i f i c a t i o n Functions 139 The voiced f r i c a t i v e s were found to be quite complex sounds in r e a l speech. In contrast, the vowels and p a r t i c u l a r l y the UVFs were found to be r e l a t i v e l y simple sounds with con-sistent and s t a t i c features. The process of phoneme i s o l a t i o n for the vowels and UVFs implies p a r t i a l phonemic-group i d e n t i f i c a t i o n . Ease of i s o l a t i o n of a phoneme group member depends on two premises [35,45,175-177], One i s the existence of consistently common, or necessary features for the entire phoneme group which are not j o i n t l y common to other phonemes. The second condition i s the existence of well defined acoustical boundaries enclosing unique temporal regions with consistent and stable properties. An example of t h i s l a t t e r condition i s the f r i c t i o n a l noise dominant region of the UVFs. Provided that the above form of i s o l a t i o n or p a r t i a l i d e n t i f i c a t i o n of a phoneme can be implemented, the f i n a l c l a s s i f i c a t i o n stage i s then merely concerned with the separation of the members of the phoneme group. The above two step process was found to be very d i f f i c u l t to apply to the recognition of VFs. F i r s t l y , in r e a l speech, the VFs, as a group, had r e l a t i v e l y few, simply defined features common to the group. For example, the d e f i n i t e presence of high frequency f r i c t i o n a l noise, which i s a necessary feature of the UVFs, is.not a necessary feature of the VF /v/. Secondly, simple acoustic boundaries do not generally e x i s t for the VFs and furthermore, simple, s t a t i c characterizations are inadequate for the VFs. 140 Evidently then, more complex procedures, which more cl o s e l y match the complexity of the VFs, are required to i d e n t i f y the VFs. The procedure that was developed e s s e n t i a l l y treats each phoneme separately rather than as a member of a larger group. This procedure allowed more powerful, phoneme-specific features to be designed. Such a procedure compares in a limited way to the s p e c i f i c sound detectors employed by Bobrow et a l . [18,94], In addition, our i d e n t i f i c a t i o n procedures involve use' of dynamic features and are consequently p r i m i t i v e forms of the analysis-by-synthesis recognition model proposed by Halle and Stevens [101]. The i n i t i a l steps i n the VF i d e n t i f i c a t i o n procedure are outlined below and the f i n a l i d e n t i f i c a t i o n stages are described i n the following sections. The vowel-consonant separation algorithm, previously described i n section 3.9 provided appropriate consonantal candidates. It was found that the threshold on the intensity measurement, which i s contained in (3.37), was s u f f i -c i e n t l y high to include as part of the VFs, almost a l l the temporal regions described in section 4.1.2, except the f u l l vowel t r a n s i t i o n regions. This deficiency was not important in our work since e x p l i c i t use was not made of vowel t r a n s i t i o n information. Furthermore the vowel t r a n s i t i o n regions cannot be uniquely assigned to a p a r t i c u l a r phoneme, as was discussed in section 4.1.2. Following the above step, the UVF i s o l a t i o n and c l a s s i f i c a t i o n algorithms were next used to eliminate the consonantal candidates that were unambiguous UVFs. The non-i n i t i a l consonantal candidates were then grouped as stop-like or non stop-like according to whether a stop-gap i n t e r v a l was detected. The procedures for detecting a stop-gap i n t e r v a l are described in the next chapter. The resultant non-stop-like candidates were passed onto the VF i d e n t i f i c a t i o n algorithms described in the following sections. The remaining i n i t i a l consonantal candidates were passed onto only the / z / and /v/ i d e n t i f i c a t i o n algorithms since / 3 / never occurs i n word i n i t i a l p o s i t i o n . 4.6.2 I d e n t i f i c a t i o n of / z / The i d e n t i f i c a t i o n of / z / required the discrimination of /z/ from three other types of phonemes: (1) the other VFs, (2) the UVFs and (3) word i n i t i a l stop bursts. Accordingly, the features for ide n t i f y i n g a consonantal candidate as / z / were designed with the above discrimination tasks in mind. The design procedure involved evaluation of features on the basis of i n t r a - c l a s s categorization power and i n t e r - c l a s s discrimin-ation power. The following binary features were ultimately designed: y± = 1 , i f f [D > 100]-[Z > 50]-[H ^ 3] (4.22) Y 2 = 1 , i f f [ A D > 50 msec] (4.23) 142 y 3 = 1 , i f f [ X Z D ^ 30 msec.]•[(D/Z) x ? 3] (4.24) y = 1 , i f f [52 < L < 6 4 ] - [ A > 40 msec] (4.25) The y defined here are not related to the y defined in chapter I I I . The parameters A and A are duration measurements D Li of the maximal length regions where [D ^  100] and [52 < ^ 6 4 ] , respectively. The parameter X__D i s the time lag of region of increasing Z with respect to that of D, which was described in section 4.5.2 and (D/Z) i s the maximum value of D/z in the A. measurement region for X„ . ^ ZD The candidate consonantal region was declared to be / z / i f the condition: Y l " Y 2 * Y 3 * Y 4 = 1 (4.26) was s a t i s f i e d . The importance of temporal features i s c l e a r l y evident in the above algorithm and implies a dynamic character-i z a t i o n model. The feature, y^ = 1, i s a s t a t i c feature which compares to the types used for vowel and UVF c l a s s i f i c a t i o n . This feature serves to d i s t i n g u i s h /z/ from /3/ and /v/ and from the UVFs /// and / f / . The features, v 2 = 1 a n d Y3 = i ' a r e dynamic conditions which serve to d i s t i n g u i s h / z / from the plosive burst of word i n i t i a l unvoiced stops. The second feature also aids in distinguishing / z / from /s/. The l a s t feature, y. =1, i s a hybrid feature which also aids i n distinguishing 143 /z / from /s/. The above algorithm c o r r e c t l y i d e n t i f i e d a l l the samples of / z / and rejected a l l other consonant samples. It should be remembered that the above algorithm works in conjunction with the UVF c l a s s i f i c a t i o n algorithm depicted in F i g . 4-5 to properly handle devoiced f i n a l /z/. The samples of / z / comprised six contextual and p o s i t i o n a l allophones. Although the above algorithm i s a minimal, context-independent form, the range of allophones treated indicates that the algorithm i s a suitable one for i d e n t i f y i n g other allophones of /z/, since / z / appeared to be the least variable of the VFs. 4.6.3 I d e n t i f i c a t i o n of /7/ No word i n i t i a l forms of /j/ occur in the English language. Consequently, any consonantal candidate i n word i n i t i a l p o s i t i o n was automatically excluded from consideration as /?/. This meant that discrimination against i n i t i a l VFs, i n i t i a l UVFs and i n i t i a l unvoiced stops was not required and resulted i n a less complex i d e n t i f i c a t i o n algorithm for /3/ than for /z/. The i d e n t i f i c a t i o n of the other p o s i t i o n a l allophones of / 3/ required discrimination against the other VFs and the UVF ///. The following binary features were used in i d e n t i f y i n g /3/z Y5 = 1 , i f f [40 < D < 100]-[4 < H m < 5] (4.27) 144 1 i f f [52 < L — 64 1 • [A 2 40] m L J (4.28) 1 i f f [ a < 1] (4.29) A n o n - i n i t i a l , consonantal candidate region was declared to be a /3/ i f the condition: was s a t i s f i e d . The feature, = 1, served to d i s t i n g u i s h / j / from / z / and /s/. The feature, y. = 1, discriminated against the UVFs, o p a r t i c u l a r l y ///. The l a s t feature, y = 1, served to separate /?/ from /v/. The design of the l a s t feature arose from the fact that the f r i c t i o n a l noise caused a very compact HP-TIH a c t i v i t y pattern for /?/ and a spread out pattern for /v/. This e f f e c t compares with the previously described HP-TIH a c t i v i t y dispersion patterns for the UVFs /// and / f / . testing the above algorithm and these were c o r r e c t l y i d e n t i f i e d . A possible weakness in the procedures for id e n t i f y i n g /3/ i s perhaps the algorithm for i s o l a t i n g the consonantal regions from the vowel regions. In cases where the f r i c t i o n a l noise of / 7 / i s weak, / 3/ has acoustic properties rather similar to the vowel / i / and the main d i s t i n c t i o n between the two then comes from the lesser intensity of This was the s i t u a t i o n for / i / and / J / in the word "seizure" and the i d e n t i f i c a t i o n of /3/ y 5 - y 6 - y 7 = l (4.30) Only three allophonic samples of /3/ were available for 145 was c r i t i c a l l y dependent upon the lower intensity of /%/ compared to / i / . 4.6.4 I d e n t i f i c a t i o n of /v/ . Machine recognition of /v/ was found to be s i m p l i f i e d by acknowledging the existence of two d i f f e r e n t acoustic variants: (a) type I, which has a noise component of s i g n i f i c a n t intensity and (b) type II, which has a very weak noise component. This decomposition r e f l e c t s the fact that /v/ i s quite a variable phoneme a c o u s t i c a l l y and one of the more d i f f i c u l t phonemes to i d e n t i f y i n the acoustic s i g n a l . The samples of word i n i t i a l /v/ were found to be well a r t i c u l a t e d and of the type I variant. The samples of word f i n a l /v/ were observed to be of the type II variant. The type II variant i s known to be a common form in n o n - i n i t i a l p o s i t i o n [21,73]. For t h i s type of /v/, there i s evidence [131,73] which suggests that perceptually, the noise i s of secondary importance and that the voicing character-i s t i c s such as a very low value of F^ and the formant t r a n s i t i o n s from adjacent vowels are of primary importance. In our work, the vowel formant t r a n s i t i o n s were not used i n the i d e n t i f i c a t i o n procedure mainly because not enough design data was a v a i l a b l e . However, the presence of a very low value of F^, expressed by the feature [52 .< L ^64] was used. It should be noted that L m J t h i s l a t t e r feature was also used i n i d e n t i f y i n g / z / and / 3 / . 146 T h e V F / v / i s k n o w n [ 1 9 , 1 3 2 , 1 3 3 ] t o h a v e some s p e c t r a l p r o p e r t i e s s i m i l a r t o t h e n a s a l c o n s o n a n t s [ 1 4 0 ] a n d s o d i s -c r i m i n a t o r s t o r e j e c t t h e f e w n a s a l c o n s o n a n t s c o n t a i n e d i n o u r d a t a w e r e i n c o r p o r a t e d i n t o o u r a l g o r i t h m . A l s o , b e c a u s e o f t h e v o w e l - l i k e q u a l i t y o f t h e t y p e I I v a r i a n t o f / v / i n w o r d f i n a l p o s i t i o n , i t w a s n e c e s s a r y t o d i s t i n g u i s h b e t w e e n a n a t u r a l v o w e l e n d i n g a n d a n i n t e n t i o n a l / v / , s i n c e b o t h t y p e s o f r e g i o n s w e r e t r a n s m i t t e d a s c o n s o n a n t a l c a n d i d a t e s . F o r t h e i d e n t i f i c a t i o n o f t h e t y p e I v a r i a n t o f / v / , t h e f o l l o w i n g b i n a r y f e a t u r e s w e r e d e s i g n e d : y 8 = 1 , i f f [D < 1 0 0 ] ( 4 . 3 1 ) y g = 1 , i f f [D > 6 0 ] ( 4 . 3 2 ) y = 1 , i f f [ A ^ > 50 m s e c ] ( 4 . 3 3 ) y n 1 = 1 , i f f [ 5 2 < L < 6 4 ] • [ A T > 4 0 ] ( 4 . 3 4 ) 11 m L T h e p a r a m e t e r A Q i s a d u r a t i o n m e a s u r e m e n t o f t h e m a x i m a l l e n g t h r e g i o n w h e r e y n = 1. T h e j o i n t o c c u r r e n c e o f a l l t h e a b o v e c o n d i t i o n s , t h a t i s , W y i o ' Y i i = 1 ( 4- 3 5 ) s i g n i f i e d t h e p r e s e n c e o f a t y p e I / v / . T h e f e a t u r e , y g = 1, i s a d i s c r i m i n a t o r t o r e j e c t t h e U V F s a n d t h e V F / z / . T h e f e a t u r e , y g = 1 , i s a d i s c r i m i n a t o r t o r e j e c t t h e n a s a l c o n s o -n a n t s . T h e f e a t u r e s , y 1 Q = L a n d y = l f s e r v e d t o d i s t i n g u i s h 147 i n i t i a l /v/ from the i n i t i a l stops /b,d,g/ and /p,t,k/, respectively. For the I d e n t i f i c a t i o n of the type II variant of /v/ # the following additional features were defined: y 1 2 = 1 , i f f [ o H > 1] (4.36) /13 = 1 • i f f - 0 - H m a x ] (4-37) The parameter I i s a measure of the maximum value of the max signal amplitude and was previously defined in (3.37).. The parameter I i s the mean value of I averaged over the consonantal candidate region. With the above features and the features defined i n (4.31) and (4.33), a type II /v/ was declared i f y 8 - y u - ( y 1 2 • y 1 3> •- 1 (4.38) The feature, y Q = 1 , effected discrimination against /z/. The o feature, ' = 1# served to d i s t i n g u i s h f i n a l /v/ from natural vowel endings. The feature, y = 1, served to discriminate /v/ from / 3 / and the features, y ^ 2 = 1 and = 1, were designed to reject the nasal consonants i n f i n a l p o s i t i o n . The above algorithms c o r r e c t l y i d e n t i f i e d f i v e of the six allophones of /v/ that were tested and successfully rejected other consonantal types. However, the algorithms are considered to be incomplete since /v/ appears to be a more variable phoneme than our limited data could account f o r . CHAPTER V ANALYSIS AND RECOGNITION OF STOPS 5.1 General Charac t e r i s t i c s of Stops The stop or plosive consonants, /p,t,k,b,d,g/, l i k e the f r i c a t i v e consonants can be divided into two l i n g u i s t i c groups according to a t r a d i t i o n a l voiced/unvoiced d i s t i n c t i o n . Halle et al. [71] and Lisker [121] point out, however, that concurrent voicing i s not c r u c i a l to the d i s t i n c t i o n . In other words, several acoustic features may contribute to the d i s t i n c t i o n . The stops are amongst the most complex of speech sounds and have proven to be amongst the most d i f f i c u l t phonemes for speech recognition machines to i d e n t i f y . The main reasons for poor recognition performance can be traced to i n a b i l i t y to handle the temporal c h a r a c t e r i s t i c s of the stops and the lack of simple, stable, s t a t i c features for describing the stops. In order to appreciate some of the d i f f i c u l t i e s , we present i n t h i s chapter a f a i r l y comprehensive description and discussion of the pro-p e r t i e s of stops, p a r t i c u l a r l y i n r e l a t i o n to time-domain measurements. For convenience, we s h a l l sometimes use the acronym "UVP" to refer to the term "unvoiced stop" and the acronym "VP" to refer to the term "voiced stop". The temporal structure of the stops i s as important as, or 148 149 perhaps more important than the s p e c i f i c acoustical character-i s t i c s at a p a r t i c u l a r point in time. For our purposes, i t i s convenient to consider seven temporal regions which may not always be present and which may overlap. These temporal regions are similar to those considered by Fant [25,73]. In approximate time sequence, the regions are: i) t r a n s i t i o n from a preceding sound such as a vowel i i ) a r e l a t i v e l y quiet i n t e r v a l corresponding to complete occlusion of the vocal-tract, i i i ) the stop release or i n i t i a l transient of the explosion iv) a f r i c t i o n a l noise region v) an aspirated sound region vi) voicing onset v i i ) t r a n s i t i o n to a following vowel. For the VPs, region (v) i s absent and region (vi) may overlap with regions ( i i i ) and ( i v ) . The c h a r a c t e r i s t i c s of these regions are described i n d e t a i l i n the sections which follow. Halle et al. [71] state that the quiet i n t e r v a l or stop-gap, corresponding to region ( i i ) above, i s the only necessary feature of the stops as a group. In other words, t h i s means that several sets of s u f f i c i e n t features may lead to the i d e n t i f i c a t i o n of a p a r t i c u l a r stop or a grouping of stops. As an example, consider the question of the voiced/unvoiced d i s t i n c t i o n . A c o u s t i c a l l y , the voiced/unvoiced d i s t i n c t i o n cannot i n general be made on the simple basis of presence or 150 absence of concurrent voicing throughout the sound pattern. Several s u f f i c i e n t features have, however, been suggested by several d i f f e r e n t workers for establishing t h i s d i s t i n c t i o n . One feature i s the greater latency, or delay in voicing onset from the instant of stop release, for the unvoiced or tense group /p,t,k/ over that of the voiced or lax group, /b,d,g/ [134,60,73,151]. For the tense stops, t h i s latency has been found to be t y p i c a l l y 50 to 110 msecs., whereas for the lax group, the t y p i c a l latency has been found to be less than 20 msec. A second feature, suggested by Halle et al. [71], i s that the plosive force of production i s less for the lax group than for the tense group. As an acoustic consequence, they report that there i s a de-emphasis in the. high frequency spectral components of the stop burst regions corresponding to ( i i i ) and (iv) above for the lax group over the tense group. A t h i r d feature [73,134] i s the presence of a very low frequency, low i n t e n s i t y voicing signal during the stop-gap region [ i . e . region ( i i ) above] of the lax group which does not usually occur for the tense group. Since t h i s feature i s not a consistent feature for a l l samples of voiced stops, i t i s therefore a s u f f i c i e n t feature.. S t i l l another feature, suggested by several workers [102, 103] i s that preceding vowels are generally longer before the members of the lax group than before the members of the tense group. Unfortunately, the best available data [102] considers only a symmetrical consonantal environment and i s subject to a measurement bias towards shorter durations for the tense group due to the exclusion of any region of aspiration of the conso-nant that i s i n i n i t i a l p o s i t i o n . Nevertheless, i f compensation i s made for t h i s bias, the vowels before VPs are s t i l l between 15% and 45% longer i n duration than those preceding the UVPs. As a comparison, i n our data, the difference for the vowel /l/ was 25% and for the vowel / o / i t was 50%. It i s apparent from the above discussion that the stops are apt to be d i f f i c u l t phonemes to characterize a c o u s t i c a l l y . In comparison with the other phonemes studied in t h i s thesis, i t can be said that the stops are the most highly organized in the time domain. The vowels and the UVFs, on the other hand, are highly organized in the frequency domain ( i . e . s t a t i c c h aracterization). The VFs have elements of both temporal and spectral organization. The stops are amongst the most important of the consonantal phonemes. In fact, s t a t i s t i c s compiled by Denes [128] indicate that / t / i s the most frequently occurring consonant i n English text. Table 5-1 shows the r e l a t i v e frequency of occurrence of the stop consonants, as published by Denes [128] and Dewey [127]. If a l l phonemes were equiprobable, the r e l a t i v e frequency of each phoneme in the English language would be about 2.38%. From the following table, i t i s seen that amongst the stops, / t / and /d/ 152 are the most important i n terms of functional burdening and /g/ i s the least important. Table 5-1 Relative Frequency of Occurrence of English Stops Investigator A / A / Stop Consonant A / A / A / A / Dewey 7.13 2.71 2.04 4.31 1.81 0.74 Denes 8.40 2.90 1.77 4.18 2.08 1.16 Dewey's data, although not shown here, further indicates that for both / t / and /d/. the word f i n a l p o s i t i o n i s the most frequency occurring. For /k/ and A / / word i n i t i a l p o s i t i o n i s more frequently occurring than the word f i n a l p o s i t i o n by a factor of three for /k/ and thirteen for A A F o r A / a n d A / / the word i n i t i a l p o s i t i o n i s more important than the word f i n a l p o s i t i o n , in terms of functional burdening, by a factor of six for /p/ and approximately a hundred for A A 5.2 Analysis of the Stop Consonants 5.2.1 The Stop-Gap or Quiet Interval Two basic types of stop-gaps were observed i n our data, which we s h a l l designate as: I) complete silence, that i s , absence of any acoustic energy II) a very low frequency, very low intensity voicing signal The signal present during the type II stop-gap i s sometimes referred to in the l i t e r a t u r e [26,73] as a "voice bar", i n reference to i t s p o s i t i o n and appearance on a Sonagraph record. For the UVPs in our data, the stop-gaps were found to be a l l of the f i r s t type. This, of course, includes the quiet i n t e r v a l s preceding stops in word i n i t i a l p o s i t i o n . For the VPs, the stop-gaps were mostly of the second type. The exceptions occurred for some /b/ and /g/ i n word i n i t i a l p o s i t i o n . The type I stop-gaps were found to be characterized by the following features: I < 0.04 I (5.1) max Z < 2 (5.2) D < 2 (5.3) U = n u l l condition (5.4) m H = n u l l condition (5.5) m By."null condition", we mean no a c t i v i t y on the corresponding TIH. The features described by (5.1), (5.2) and (5.3) were a set of s u f f i c i e n t conditions for characterization of the type I stop-gap. F i g . 5-1 shows a CRT display of the s i l e n t i n t e r v a l s preceding and following the stop release of / t / i n the word " s i t " , as detected using the above three conditions. Type I quiet F i g . 5-1 Stop-Gap Detection for " s i t F i g . 5-3 Voicing Latency for /g/ in "gun" 155 inte r v a l s are shown as a large negative l e v e l on the time graph marked "Q". The algorithm using the above three conditions was successful i n detecting the stop-gaps for a l l twenty-one UVP samples. Also, a l l the s i l e n t regions preceding and following words were co r r e c t l y detected as type I s i l e n t i n t e r v a l s . The features described by (5.4) and (5.5) were an alternate set of s u f f i c i e n t conditions for detecting a type I quiet i n t e r v a l . The type II stop-gaps were characterized by the following features: I < 0.10 I (5.6) max Z < 10 (5.7) D < 25 (5.8) 58 <U ; 5 64 (5.9) m 58 < L < 64 (5.10) m H = n u l l condition (5.11) F i g . 5-2 shows a CRT display depicting the detection of the type II stop gap preceding the stop release of /d/ i n the word " s i d " , using only the above features involving I, Z and D. On the figure, the type II quiet i n t e r v a l i s represented by a smaller magnitude negative l e v e l than that used for a type I quiet i n t e r v a l on the Q versus time graph. Note that following the release region of /d/, there i s a type I quiet i n t e r v a l corresponding to end-of-word s i l e n c e . Of the nineteen stop-gaps 156 associated with the VPs, sixteen were c l a s s i f i e d by the above algorithm as type II; the remainder were c l a s s i f i e d as type I. Several workers [135,137] have suggested that the duration of the stop-gap d i f f e r s for d i f f e r e n t stops. F i g . 5-5 shows the observed stop-gap durations for our data, as measured using the Q time graphs. Of p a r t i c u l a r i n t e r e s t are the consistently longer durations for /p/ over that of /k/ i n comparable con-textual environments. A similar e f f e c t for these two phonemes has been reported by Menon et a l . [137] and Scharf [135]. In word f i n a l p o s i t i o n , the stop-gap duration of /g/ was consistently shorter than that of /b/ and /d/ for our data. I t i s not possible to make a comparison with other data because of lack of published data. 5.2.2 Vowel Formant Transitions The temporal regions of t r a n s i t i o n to and from adjacent vowels and the stops are known to be important i n stop i d e n t i -fication [71,136,142,143]. Work of others [100,136] with synthetic speech stimuli has indicated that systematic s p e c i f i -cation of the vowel formant t r a n s i t i o n s to and from the stops, p a r t i c u l a r l y for F^, are possible. Recent work by Stevens et a l . [58] with r e a l speech confirms t h i s postulate to some extent, but the work of others [62,99,137] cautions against attempts at overly precise s p e c i f i c a t i o n of absolute formant frequencies at the apparent vowel boundaries i n r e a l speech.. Speech 157 k -» sk ik ok P - Ip sp op t - St It ot <4 r a g- Ig og b- ob lb of- od Id ~T 1 1 1 SO WO 150 200 Stop Gap Duration (msec) F i g . 5-5 E f f e c t of Context on Stop-Gap Durations. 158 recognition machines at present make no d i r e c t attempts to use the vowel formant t r a n s i t i o n s i n i d e n t i f y i n g the stops. The use of t r a n s i t i o n information in our work was r e s t r i c t e d by i n s u f f i c i e n t design data. Nevertheless, our limited data was well correlated with that of Stevens and House [102] i n d i r e c t i o n and magnitude. As Halle et al . [71] have pointed out, the vowel t r a n s i t i o n s are p a r t i c u l a r l y important when the stop i s i n word f i n a l p o s i t i o n because such stops are sometimes not released i n o r d i -nary, continuous speech. For our work, care was taken to ensure that the f i n a l stops were f u l l y released. This i s not an unreasonable condition for words spoken i n i s o l a t i o n [21]. For our work, the manifestation of F^ t r a n s i t i o n s were most readily evident on the HP time-interval histograms. On t h i s type of presentation, a feature c l o s e l y related to the F^ t r a n s i t i o n was the motion of the centre of gravity, H , of the 3 m TIH a c t i v i t y pattern. To i l l u s t r a t e t h i s phenomena, Figs. 5-6 and 5-13, i n c l u s i v e , show HP-TIHs for the words "kool", "kun", "t o o l " , "pun", "log", "lod", " s i g " , and " s i b " . These figures serve to i l l u s t r a t e the variety of t r a n s i t i o n s encountered. It i s observed that for some words l i k e " t o o l " and "sib" the t r a n s i t i o n s are quite pronounced; whereas for other words such as "pun" and " s i g " the motion of H m i s minimal. In passing, i t should be noted that the stop-gaps of the word-final stops are c l e a r l y indicated by a n u l l or no a c t i v i t y condition, as 159 160 161 discussed i n the previous section. Also shown on the figures by means of horizontal bars are the calculated positions of H and the sum of H and a . At any instant in time, the m m H lower of the two bars i s H . m Table 5-2 gives a summary of the observed motion of H 3 m for various stop-vowel and vowel-stop combinations. In the table, a plus sign i s used to indicate that the boundary p o s i t i o n of H was above the H p o s i t i o n for the vowel nucleus, m m Likewise, a minus sign indicates that the boundary p o s i t i o n of H was below that of the vowel nucleus. The magnitude of the m 3 numbers indicates the observed chanqe i n H from the boundary 3 m to the vowel nucleus, quantized to units given by the number of TIH channels crossed. By "boundary", we mean the e f f e c t i v e point of vowel-like voicing onset or termination [103,58]. For example, the t r a n s i t i o n between / t / and /u/ i n " t o o l " as shown in F i g . 5-6 covers six rows or TIH channels. The data i s presented i n th i s way because we are more interested i n r e l a t i v e magnitude and d i r e c t i o n of the changes than in the absolute values of the changes [100]; also i t makes comparison with published formant t r a n s i t i o n data easier. For example, consider the combination /stop + u/. The large magnitudes of the H^ t r a n s i -tions for / t / and /d/ and the small magnitudes for /p/ and /b/ compare to the magnitudes of the F^ t r a n s i t i o n s reported by Stevens and House [102], The r e l a t i v e l y small magnitudes of the H m t r a n s i t i o n s for the combinations of /stop + i / are also. 1 6 2 Table 5 - 2 Vowel-Stop H Transitions m CV or VC H m Transitions CV or VC H m Transitions / W + 3 / W + 2 / p u / 0 / b a / 0 / t u / + 6 /da/ + 5 A V + 2 / g V + 3 / P V 0 A V - 0 . 5 A V + 2 /dA/ + 1 A i / + 1 / g i / + 1 / p i / 0 / b i / 0 A i / + 1 / d i / 0 / o k / + 2 A g / + 2 A P / 0 /Ob/ - 1 / o t / + 1 A d / + 3 /Ik/ 0 A g / 0 A P / - 2 / i b / - 4 A t / - 1 A / - 1 163 c o n s i s t e n t w i t h p u b l i s h e d F^ t r a n s i t i o n d a t a . L i k e w i s e , t h e l a r g e n e g a t i v e t r a n s i t i o n s o b s e r v e d f o r / i p / a n d / l b / a r e a l s o c o n s i s t e n t w i t h t h e r e f e r e n c e d a t a . T h e H t r a n s i t i o n s f r o m w o r d i n i t i a l / p / a n d / b / t o a f o l l o w i n g v o w e l w e r e f o u n d t o b e s m a l l a n d g e n e r a l l y e q u a l t o z e r o o n o u r q u a n t i z e d s c a l e . T h e r e a s o n f o r t h i s p h e n o m e n a i s t h a t t h e b o u n d a r y v a l u e o f F^, a s o b s e r v e d o n t h e s p e c t r o g r a m s a n d S o n a g r a p h r e c o r d s , w a s q u i t e v a r i a b l e a n d t e n d e d t o b e n o t m u c h d i f f e r e n t f r o m t h a t o f t h e v o w e l n u c l e u s ; a n d o f c o u r s e , t h e F^ o f t h e v o w e l n u c l e u s i s g e n e r a l l y d i f f e r e n t f o r d i f f e r e n t v o w e l s . A p h y s i o l o g i c a l e x p l a n a t i o n f o r t h i s e f f e c t , d u e t o S t e v e n s a n d H o u s e [ 5 8 ] , i s t h a t i n p r o d u c i n g t h e b i l a b i a l c o n s o n a n t s / p / a n d / b / , t h e p o s i t i o n o f t h e t o n g u e c a n b e a d j u s t e d t o a n t i c i p a t e t h e f o l l o w i n g v o w e l a n d s o u p o n r e l e a s e o f t h e s t o p , t h e v o c a l - t r a c t i s a l r e a d y i n a p p r o x i m a t e l y t h e a p p r o p r i a t e c o n -f i g u r a t i o n f o r t h e v o w e l . T h e r e f e r e n c e F t r a n s i t i o n d a t a [ 5 8 ] a l s o i n d i c a t e s t h a t f o r i n i t i a l / k / a n d / g / , t h e s t a r t i n g f r e q u e n c i e s o f t h e F^ t r a n s i t i o n s t o t h e f o l l o w i n g v o w e l a r e s u c h t h a t t h e s i g n o f t h e t r a n s i t i o n s a r e p o s i t i v e , a c c o r d i n g t o o u r c o n v e n t i o n , a n d t h a t t h e m a g n i t u d e d o e s n o t v a r y a g r e a t d e a l f r o m v o w e l t o v o w e l . W i t h r e s p e c t t o o u r d a t a , t h i s m e a n s t h a t t h e m a g n i t u d e o f t h e H ^ t r a n s i t i o n s f r o m i n i t i a l / k / a n d / g / t o a f o l l o w i n g v o w e l c a n b e e x p e c t e d t o h a v e a s m a l l r a n g e . a n d t h i s i s , i n f a c t , o b s e r v e d . 1 6 4 5.2.3 A s p i r a t i o n R e g i o n a n d V o i c i n g O n s e t I n E n g l i s h , a s p i r a t i o n i s a u s e f u l f e a t u r e f o r e s t a b l i s h i n g t h e v o i c e d / u n v o i c e d d i s t i n c t i o n o f t h e s t o p s . F o r t h e V P s , a s p i r a t i o n i s g e n e r a l l y n o t p r e s e n t , w h e r e a s f o r m o s t ^ U V P s , a s p i r a t i o n i s p r e s e n t [ 1 2 1 ] . O u r w o r k c o n s i d e r s o n l y u n v o i c e d s t o p s w h i c h a r e a s p i r a t e d s i n c e t h e s e a r e t h e m o s t common a l l o p h o n i c v a r i a n t s . T h e a s p i r a t e d r e g i o n o f U V P s c o m e s a f t e r t h e i n i t i a l t r a n s i e n t o f t h e s t o p r e l e a s e a n d a f t e r a s h o r t t e m p o r a l r e g i o n o f f r i c t i o n a l n o i s e w h i c h i s o f t e n p r e s e n t f o r w e l l a r t i c u l a t e d U V P s . A s F a n t [ 7 3 ] p o i n t s o u t , t h e b o u n d a r y b e t w e e n f r i c t i o n a l n o i s e a n d a s p i r a t i o n i s s e l d o m c l e a r l y d i s c e r n a b l e i n t h e a c o u s t i c s i g n a l . I n s t e a d , t h e f r i c t i o n a l n o i s e r e g i o n c a n b e s a i d t o b e t h e t e m p o r a l r e g i o n w h e r e t h e s p e c t r a l e n e r g y i s l o o s e l y o r g a n i z e d i n t h e f r e q u e n c y d o m a i n w h e r e a s f o r t h e a s p i r a t e d r e g i o n , t h e s p e c t r a l e n e r g y i s m o r e h i g h l y s t r u c t u r e d i n t h e f r e q u e n c y d o m a i n a n d i s a l s o w e a k e r i n t h e h i g h e r f r e q u e n c i e s . I n o t h e r w o r d s , t h e a s p i r a t e d r e g i o n h a s a f o r m a n t -l i k e s p e c t r a l s t r u c t u r e s i m i l a r t o t h a t o f a w h i s p e r e d v o w e l , w h e r e a s t h e f r i c t i o n a l n o i s e r e g i o n h a s p r o p e r t i e s s i m i l a r t o a n u n v o i c e d f r i c a t i v e . F o r o u r d a t a , c e r t a i n c h a r a c t e r i s t i c s o f t h e Z a n d D m e a s u r e m e n t s a p p e a r t o b e r e l a t e d t o p r o p e r t i e s o f t h e f r i c t i o n a l n o i s e a n d t h e a s p i r a t i o n . F o r / t / , a common f e a t u r e w a s a n 1 / U n v o i c e d s t o p s f o l l o w i n g / s / a r e u s u a l l y n o t a s p i r a t e d i n E n g l i s h . 165 approximately linear decrease in D and Z values with time over an i n t e r v a l of about 50 to 80 msecs. Inspection of sonagraphs of / t / showed that there was a gradual decrease in the amount of f r i c t i o n a l noise and a gradual increase i n aspiration over the above time i n t e r v a l . Thus, under these circumstances, we expect from (4.21) that the D and Z values would decrease from high to lower values. For /p/ and /k/, the region of f r i c t i o n a l noise dominance was found to be shorter than for / t / . In fact, for the /p/ in the word "pool", there was very l i t t l e f r i c t i o n a l noise at a l l and the region between the instant of release and voicing onset was mainly a s p i r a t i o n . In such cases, where there was a r e l a t i v e l y long period of aspi r a t i o n (50 to 80 msecs.), the Z and D values remained f a i r l y constant or decreased only slowly with time. Because the region of aspiration resembles part of a whispered vowel, the spectral energy of the aspiration i s usually concentrated i n the same frequency regions as the vowel formants. However, the r e l a t i v e i n t e n s i t i e s of each formant region d i f f e r from that of a f u l l y voiced vowel and the f i r s t formant region i s usually much weaker in in t e n s i t y . One consequence of t h i s vowel-like structure i s that the Z and D values are dependent on the following vowel.. Another consequence i s that although the Z and D values are generally higher than those of the following vowel due to the weak F^ region, they tend to f a l l i n or near the same o v e r a l l range as the vowel range. Both of these 166 phenomena make context-independent detection of the aspiration d i f f i c u l t . A feature related to the presence or absence of s i g n i -f i c a n t a spiration and which i s useful i n establishing the voiced/unvoiced d i s t i n c t i o n i s the delay or latency between the instant of stop release and the beginning of vowel-like voicing onset. Clearly, i f aspiration i s present, t h i s delay i s longer. For our data, the delay was found to be between 50 and 110 msecs. for the UVPs and agrees with the data of others [60,134,144], With respect to the VPs, Lisker et a l [134] report that for /b/ the observed latency i s v i r t u a l l y zero; for /d/ i t i s t y p i c a l l y 5 msec, and for /g/ i t i s t y p i c a l l y 21 msec, or l e s s . These values agree with those found i n our data. To i l l u s t r a t e the nature of voicing latency phenomena, consider F i g . 5-3, which shows I, Q, D and Z time graphs for the word "gun". The Q time graph indicates that the pre-release region of /g/ i s a type II quiet i n t e r v a l . The end of the quiet i n t e r v a l i s signalled by an abrupt increase in the value of D. Next, we observe that the rapid increase i n I, which can be taken to indicate the vowel-like voicing onset that begins approxi-mately 20 msec, after the end of the quiet i n t e r v a l . This can be compared with F i g . 5-4, where i t i s observed that for /b/, the rapid increase i n I i s concurrent with the end of the pre-ceding quiet region. For our work, a useful measure of voicing latency was the 167 time i n t e r v a l from the end of the quiet region to the half amplitude point on the intensity curve for the following vowel. This measure of voicing latency, denoted by T , yielded longer l i durations than the true values but was an easier measure to implement. For i n i t i a l /d/, T was found to be between 20 and 40 msec, whereas for i n i t i a l /g/, T T was found to be greater than 40 msec.in general. The T measurement i s not s t r i c t l y a v a l i d measurement for f i n a l stops because there i s no following vowel. However, since f i n a l VPs that are f u l l y released do have a short i n t e r v a l of vowel-like voicing associated with them, similar results were obtained by defining the latency to be from the end of the stop-gap to the point of maximum amplitude of the stop burst. The VP /b/ had properties of special i n t e r e s t . For /b/, very l i t t l e f r i c t i o n a l noise was observed. Furthermore, instead of the region of aspiration that occurs with the unvoiced counterpart /p/, the release of /b/ was immediately followed by a vowel-like region. As discussed e a r l i e r in section 5.2.2, i n the production of /b/ and /p/, the tongue p o s i t i o n can usually be adjusted to anticipate the following vowel. As a consequence, the Z and D values following the stop release were approximately equal to those of the vowel nucleus or else started at s l i g h t l y lower values than that of the vowel. To i l l u s t r a t e t h i s phenomena, consider F i g . 5-4 in which time graphs of I, Q, D, and Z~ for the word "bun" are depicted. In the figure, the time 168 graph of D i s seen to jump abruptly from a baseline value to the value for the following vowel immediately upon release of the stop. The value of Z i s observed to s t a r t at a s l i g h t l y lower value than for the vowel. Another s i g n i f i c a n t and common feature of the release of /b/ was the step-like change in the signal i n t e n s i t y graph. This feature i s c l e a r l y evident i n the I time graph of F i g . 5-4. A concomitant feature to t h i s step-like change in I was the zero latency of voicing onset. The measurement of T accounts for Li both of these e f f e c t s . It was found that for word i n i t i a l /b/, T was less than or equal to 20 msec. Furthermore, by defining latency for word f i n a l /b/ i n the same way as was done for /g/ and /d/, a latency duration of less than 21 msec, was observed. In summary, i t can be said that the absence or presence of aspiration i s an important feature for establishing the voiced/unvoiced d i s t i n c t i o n of stops. A useful measure for t h i s purpose was the duration of voicing latency. Voicing latency was also found to be a useful feature for distinguishing the members of the VP group. 5.2.4 Stop Release and F r i c t i o n a l Noise Regions (a) General Observations The c h a r a c t e r i s t i c s of these two regions are at present not well understood because of their variable and transient nature. A simple e l e c t r i c a l model, that has been employed with 169 some success [24,73,145] to explain some of the properties of these two regions, i s a simple s o u r c e - f i l t e r model similar to that used in studying f r i c a t i v e production (see Figs. 4-2 and 4-6). The f i l t e r i s , of course, a s i m p l i f i e d model of the vocal-tract and i s time varying. For the stop release region, the e x c i t a t i o n can be modelled as a step function of voltage, while for the f r i c t i o n a l noise region, the source can be modelled as a Gaussian white noise source as was done for the UVFs. The duration of the release or i n i t i a l transient region has been shown [73] to be related to the f i l t e r bandwidth and i s usually considered to be less than about 12 msec. Because of the narrower bandwidths involved for /k/ and /g/, the duration of the release for these two stops i s generally longer than for the other stops. In speech analysis, the stop release region i s usually not well handled by most conventional speech spectrum analyzers because of i t s short duration. That i s , conventional spectrum analyzers can neither resolve the transient nor d i s t i n g u i s h i t from a s p e c t r a l l y shaped noise e x c i t a t i o n . In the best known work on spectral analysis of stops [71], a time window of 20 msec, was used for the analysis; whereas the stop release i s usually considered to be shorter than t h i s . Thus, most often, the data obtained from conventional spectral analysis relates to the f r i c t i o n a l noise and/or a s p i r a t i o n regions. The time-domain measurements investigated in thesis had better temporal 170 resolution and i n addition, permitted d i s t i n c t i o n s to be made between transient signals and noise-like signals. For example, i n our data we found that i n most occurrences of /b/ there was very l i t t l e i f any f r i c t i o n a l noise. As d i s -cussed i n the previous section, t h i s s i t u a t i o n was c l e a r l y r e f l e c t e d i n the Z and D time graphs. Furthermore, the transient nature of the release of /b/ was c l e a r l y r e f l e c t e d by the abrupt r i s e i n signal i n t e n s i t y as shown in F i g . 5-4. (b) Unvoiced Stops Since the properties of the temporal region following the stop release were simpler for the UVPs than for the VPs, we s h a l l discuss the former f i r s t . Amongst a l l the stop consonants, the f r i c t i o n a l noise region was found to be most consistent for the UVPs, e s p e c i a l l y / t / . Since the vocal-tract configurations for / t / and the UVF /s/ are similar, we would expect the values of Z and D for / t / to approach that of /s/. This can be established by means of F i g . 5-14 which shows peak values of Z and D for the o v e r a l l temporal region of the UVPs. The peak values of Z and D can be associated with the f r i c t i o n a l noise i f i t i s present and dominant; otherwise i t i s associated with the aspiration region. Comparison of F i g . 5-14 with F i g . 4-1 shows that the Z - D values for / t / do approach those for /s/. The Z - D values for /p/ and /k/ are seen to exhibit considerable varia t i o n s with vowel context. This finding 171 120 n 100 90 I 80 N 701 Icj If ki tu / a* / / ti / / / tJi pi / / kA. 60l I k sp 50l skpA. \ i °P / ok ku / 401 pu ' / / / X £ > = Z LOCUS 30 20k. / / / J L J I I L 20 30 40 50 60 70 80 90 100 Z (/10ms) F i g . 5-14 Peak Values of Z and D for the Unvoiced Stops, 172 corroborates that of others [71,137,100] i n the following sense. Cooper et al. [100], using synthetic speech st i m u l i , have shown that a high noise-burst frequency (> 2.8 kHz) i s invariably associated with / t / i n perceptual t e s t s . They found that the optimum synthetic noise-burst frequency for /k/ was strongly dependent upon the following vowel and p a r t i c u l a r l y on the second formant of the vowel. Their data i n fact shows that the optimum noise frequency was s l i g h t l y above the vowel . This phenomenon i s r e f l e c t e d i n our Z - D data. For example, for the vowels / i / and /u/, the F^ values are high and low, respec-t i v e l y . From F i g . 5-14, i t i s seen that /k/ preceding / i / has much larger Z and D values than the /k/ preceding /u/. Cooper et a l . also found that the optimum burst frequency for /p/ to be quite variable but always below that of / t / i n comparable environments. This phenomenon i s also r e f l e c t e d i n the Z and D values shown in F i g . 5-14. Another in t e r e s t i n g r e s u l t i n their data i s that preceding the vowel / i / , a l l the UVPs have r e l a t i v e l y high noise-burst frequencies, with that of /k/ only s l i g h t l y below that of / t / . Inspection of F i g . 5-14 shows a similar e f f e c t . With respect to UVPs i n r e a l speech, Fant [25] makes the following observations about the d i s t r i b u t i o n of spectral energy i n the noise bursts for /k/ and /p/. For /k/, the spectral energy i s more concentrated and positioned i n the regions of the vowel formants which are above F^. The /p/ noise i s weaker than that of /k/; i s more spread out i n the frequency dimension and 173 t h e s p e c t r u m h a s a c e n t r e o f g r a v i t y l o w e r t h a n t h a t o f / t / . F a n t ' s o b s e r v a t i o n s a l s o c o n c u r w i t h t h o s e made o n o u r d a t a . T h e d i f f u s e s p e c t r a l e n e r g y d i s t r i b u t i o n f o r / p / s h a l l b e d i s c u s s e d f u r t h e r l a t e r . T o b e t t e r i l l u s t r a t e t h e c o n t e x t u a l d e p e n d e n c i e s o f t h e Z - D m e a s u r e m e n t s f o r t h e U V P s , a p l o t o f D v e r s u s c o n t e x t i s s h o w n i n F i g . 5 - 1 5 . A s i m i l a r f i g u r e r e s u l t s f o r Z v e r s u s c o n t e x t . I t i s o b s e r v e d t h a t t h e D v a l u e s f o r a l l t h r e e U V P s a r e t o some e x t e n t c o n t e x t - d e p e n d e n t . I t i s t h i s v a r i a b i l i t y w i t h c o n t e x t a n d t h e r e s u l t a n t r a n g e o v e r l a p t h a t m a k e s m a c h i n e r e c o g n i t i o n o f t h e U V P s d i f f i c u l t . I n s e c t i o n s 4.2.2 a n d 4 . 2 . 3 , we s h o w e d how Z a n d D e s t i m a t e s f o r t h e U V F s c o u l d b e c a l c u l a t e d f r o m t h e m e a s u r e d s p e c t r a . A s i m i l a r s t u d y w a s a t t e m p t e d f o r t h e f r i c t i o n a l n o i s e r e g i o n o f t h e U V P s . I n F i g . 5 - 1 6 , t h e p o i n t s e n c l o s e d i n s q u a r e s a r e e s t i m a t e s o f Z a n d D c a l c u l a t e d f r o m t h e m e a s u r e d s p e c t r a u s i n g ( 4 . 1 ) a n d ( 4 . 2 ) . T h e p o i n t o f s p e c t r a l m e a s u r e m e n t w a s m a n u a l l y s e l e c t e d i n t h e t e m p o r a l r e g i o n i m m e d i a t e l y f o l l o w i n g t h e r e l e a s e r e g i o n . T h e n o n - e n c l o s e d p o i n t s o n F i g . 5-17 a r e t h e same a s t h e m e a s u r e d p o i n t s s h o w n o n F i g . 5 - 1 4 . C o m p a r i s o n o f m e a s u r e d a n d e s t i m a t e d p o i n t s s h o w s t h a t t h e a g r e e m e n t f o r / p / a n d / k / i s r e a s o n a b l y g o o d . F o r / t / , i t i s o b s e r v e d t h a t t h e c a l c u l a t e d e s t i m a t e s a r e a l l h i g h e r t h a n t h e m e a s u r e d v a l u e s a n d a r e g e n e r a l l y e v e n c l o s e r t o t h e / s / v a l u e s t h a n t h e m e a s u r e d v a l u e s . A p o s s i b l e 174 F i g . 5-15 Contextual Dependency of D for Unvoiced Stops 175 130m 120 110 100 g 90 ^ 80 c I 70 60 50 40 30 20 m ti It st / ki ti / Pi / tu ot / / tJi pi / / / / / / D = Z LOCUS / / 20 30 40 50 60 70 80 90 100 Z (crossings/ 10ms) F i g . 5-16 Comparison of Measured Values and Estimates of Z and D for the F r i c t i o n a l Noise Region of the Unvoiced Stops. 176 explanation for t h i s discrepancy i s the following. The f r i c t i o n a l noise region, as we discussed e a r l i e r , i s generally short in duration and overlaps the region of aspi r a t i o n . In measuring Z and D, an average value over a 10 msec, time window was determined. On the other hand, the e f f e c t i v e time window used i n c a l c u l a t i n g the estimates was of the order of 5 msec, and corresponds to the smoothing time constant of the high-frequency channels of the fi l t e r - b a n k spectrum analyzer. Now, since the Z and D values of the aspirated region are lower than for any f r i c t i o n a l noise region, i t follows that averaging over a longer time window y i e l d s values of Z and D less than the true peak values. Another factor which may contribute to the discrepancy i s the high frequency noise enhancement properties of (4.1) and (4.2). This l a t t e r e f f e c t has been discussed more f u l l y in section 4.2.3. E a r l i e r , we mentioned Fant's observation that the noise spectral d i s t r i b u t i o n for /p/ was more d i f f u s e l y spread i n the frequency domain than that of /k/, or conversely, that the spectral d i s t r i b u t i o n for /k/ was more compact. Halle et a l . [71] report a similar e f f e c t . In our measurement data, a suitable indicator of t h i s e f f e c t was the spread of the HP-TIH a c t i v i t y pattern. Figs. 5-7 and 5-8 show HP-TIHs for the words "kun" and "pun". The stop release i s at the l e f t most part of the display. I t i s seen from these figures that the TIH a c t i v i t y pattern i s more spread out for the /p/ burst than for the /k/ burst. These patterns indicate that the spectral d i s t r i b u t i o n for /k/ was more compact than for /p/ and was confirmed from Sonagraph records and from spectrograms obtained from our f i l t e r - b a n k spectrum analyzer. A similar e f f e c t was observed for other cases of /p/ and /k/. A useful measure of the above spread of the HP-TIH a c t i v i t y pattern was the standard deviation of the pattern, which we denote by a „ . F i g . 5-17 shows a p l o t of H , the centre of H m qravity of the HP-TIH pattern versus a . The units of H ^ J H m correspond to the physical TIH channels and the units of a H correspond to l o g i c a l channel units . The calculated values of H and a were not r e s t r i c t e d to integer values. Two m H ^ inter e s t i n g observations can be made about the resultant graph. One i s the high p o s i t i o n of / t / and the other i s the compactness of /k/ as indicated by low values of a . H (c) The Voiced Stops The preceding descriptions and discussion related mainly to the properties of the UVPs. We now turn to consideration of the VPs. The VPs were considerably more d i f f i c u l t to study than the UVPs because of the short combined duration of the release and f r i c t i o n a l noise regions, t y p i c a l l y less than 30 msecs., and concomitantly, the r a p i d i t y of voicing-onset following the stop release. The voicing component often tended to mask the f r i c t i o n a l noise component, p a r t l y because of the i n t r i n s i c a l l y 1 7 8 ti tu kJL Ik sk ku ot/It ki /st Pi IP ok sp U PA pu 1 1 1 0.5 1.0 15 - Std. Dev. of H 2.0 2.5 3.0 (no. of channels) F i g . 5-17 Measured Values of H and 0" for the . m H F r i c t i o n a l Noise Region of Unvoiced Stops. 1 7 9 g r e a t e r i n t e n s i t y o f t h e v o i c i n g c o m p o n e n t a n d p a r t l y b e c a u s e o f t h e w e a k n e s s o f t h e f r i c t i o n a l n o i s e a s s o c i a t e d w i t h t h e V P s | 7 1 ] . T h e n e t r e s u l t o f t h e a b o v e e f f e c t s w a s a m o r e t r a n s i e n t a n d v a r i a b l e a c o u s t i c s i g n a l f o r t h e V P s c o m p a r e d t o t h e U V P s . I n t h e a b s e n c e o f a l a r g e r a w d a t a b a s e , o u r w o r k was guided t o a l a r g e e x t e n t b y t h e f i n d i n g s o f o t h e r s [ 2 4 , 5 9 , 7 1 , 7 3 , 1 4 4 , 1 4 6 ] . H a l l e e t a l . [ 7 1 ] s t a t e t h a t t h e V P s h a v e l e s s f r i c t i o n a l e n e r g y t h a n t h e u n v o i c e d c o u n t e r p a r t s a n d f u r t h e r m o r e , t h a t / b / a n d V P s i n w o r d f i n a l p o s i t i o n h a v e t h e l e a s t a m o u n t o f f r i c t i o n a l e n e r g y . S t e v e n s [ 5 9 ] s u g g e s t s t h a t t h e t e m p o r a l p a t t e r n o f t h e s h i f t i n f r e q u e n c y r e g i o n o f s p e c t r a l e n e r g y c o n c e n t r a t i o n f o l l o w i n g t h e s t o p r e l e a s e i s a n i m p o r t a n t f e a t u r e i n d i s t i n g u i s h i n g t h e d i f f e r e n t V P s . L i s k e r e t a l . [ 1 3 4 ] p r e s e n t d a t a w h i c h s h o w t h a t t h e v o i c i n g l a t e n c y , a l t h o u g h s m a l l , i s g r e a t e s t f o r / g / a n d l e a s t f o r / b / . T h e s e f i n d i n g s c a n b e r e l a t e d i n t h e f o l l o w i n g w a y . F i r s t , c o n s i d e r t h e V P , / b / . T h e v e r y s m a l l v o i c i n g l a t e n c y a s s o c i a t e d w i t h / b / i m p l i e s t h a t a v o w e l - l i k e v o i c e d r e g i o n i m m e d i a t e l y f o l l o w s t h e s t o p r e l e a s e . T h i s d i r e c t l y c o n c u r s w i t h t h e s t a t e m e n t b y H a l l e e t a l . t h a t / b / h a s n e a r l y n e g l i g i b l e f r i c t i o n a l n o i s e . S i m i l a r l y , S t e v e n s ' f i n d i n g s o f a n u p w a r d f r e q u e n c y s h i f t i n t h e s p e c t r a l e n e r g y c a n b e a s s o c i a t e d w i t h t h e t r a n s i t i o n f r o m a s t a t e o f c o m p l e t e o c c l u s i o n o f t h e v o c a l - t r a c t p r i o r t o r e l e a s e t o a m o r e o p e n o n e f o r t h e f o l l o w i n g v o w e l [ 1 3 1 ] , 180 Next, consider /d/ and /g/. For /d/, Stevens found a downward frequency s h i f t i n the spectral energy with time. This phenomenon can be attributed to the greater delay of voicing onset for /d/ compared to /b/ [134] and the expected presence of some / s / - l i k e f r i c t i o n a l noise [146]. For /g/, Stevens reports that i n the region following the release, the spectral energy i s concentrated i n a narrow band contiguous to that of second formant of the following vowel-like region. Consequently, he states that there i s l i t t l e i f any s h i f t i n the spectral time pattern. The region of noise energy con-centration i s similar to the / k / - l i k e noise reported by Hoffman [146] and the degree of voicing latency i s consistent with that reported by Lisker et a l . [134].. Fant's work on models of VP production [73] i s also relevant to the discussion. For /b/, Fant suggests that a s u i t -able source model for the release region i s a step-modulated pulse t r a i n ; whereas for /g/ and /d/, he suggests a gated white-noise source. These source models are consistent with the acoustic phenomena described above. The phenomena described above were r e f l e c t e d i n our time-domain measurement data. The short f r i c t i o n a l noise region of /g/ and /d/ appeared as short, small.amplitude peaks in the Z and D measurements. Unfortunately, the peak values of Z and D were similar for both /g/ and /d/ and varied considerably according to contextual environment. Also the ranges of the peak 181 values tended to overlap that of the vowels. The l a t t e r e f f e c t i s a t t r i b u t a b l e to the masking e f f e c t of the vowel-like voicing, which was discussed e a r l i e r . In addition, t h i s masking e f f e c t was greater for /d/ than for /g/ because of the shorter voicing-onset latency for /d/ and thus caused the peak Z and D values to be lower than for the noise only case. A more useful quantitative indicator of the presence of f r i c t i o n a l noise for word i n i t i a l /g/ and /d/ was derived from the UF time-interval histograms, which are less susceptible to the masking e f f e c t of the voicing component. For word i n i t i a l /g/ and /d/, i t was found that during the f r i c t i o n a l noise region the condition: U + o*. < 13 (5.12) m u prevailed: where i s the centre of gravity of the UF-TIH a c t i v i t y pattern and a i s the standard deviation of the a c t i v i t y pattern. Another quantitative measurement related to the release and f r i c t i o n a l noise regions of the VPs was the measure of voicing-onset latency, which has been previously described in section 5.2.3. For the purpose of context-independent character-i z a t i o n of the release and f r i c t i o n a l noise regions, the values of the voicing latency were the most consistent features found. The stop releases for the i n i t i a l VPs were found to be more strongly a r t i c u l a t e d than those i n word f i n a l p o s i t i o n . The consequence of t h i s was that the releases of word f i n a l VPs 182 exhibited less evidence of f r i c t i o n a l energy than did the i n i t i a l stop releases. This observation agrees with the statement of Halle et al.[71] concerning the v a r i a t i o n of f r i c t i o n a l noise i n t e n s i t y with p o s i t i o n . In the course of our work, some limited perceptual experiments involving deletion of parts of the voiced stop were performed under computer co n t r o l . These experiments indicated that the stop release and f r i c t i o n a l noise regions were very c r u c i a l to the correct perception of word i n i t i a l VPs and less important for the correct perception of f i n a l VPs. For the word f i n a l VPs the t r a n s i t i o n region from the preceding vowel appeared to be of primary importance. This l a t t e r phenomenon i s well known [100;120], since stops in f i n a l p o s i t i o n are sometimes not released i n ordinary speech. 5.3 Machine Iso l a t i o n and C l a s s i f i c a t i o n of Unvoiced Stops 5.3.1 I s o l a t i o n of Unvoiced Stops The i s o l a t i o n process e s s e n t i a l l y involves a f f i l i a t i o n of c e r t a i n temporal regions of the acoustic signal with a phoneme group type, such as UVP. For our work, candidate stop regions were those consonantal segments that were transmitted through the hierarchy of previous phonemic decomposition operations which yielded the vowels and f r i c a t i v e s . As we have discussed e a r l i e r , the candidate regions do not delimit exact phoneme boundaries, since these do not i n general e x i s t . Rather, the candidate 183 regions contain natural phonetic segments which hopefully one can associate with a unique phoneme or phoneme group. F i g . 5-18 shows a flow chart of the procedure for i s o l a -t i o n and c l a s s i f i c a t i o n of the UVPs. An important function of the i s o l a t i o n process i s the extraction of suitable features that can be used for f i n a l c l a s s i f i c a t i o n . Unfortunately, time did not permit complete implementation of the feature extraction process and the f i n a l c l a s s i f i c a t i o n stage. However, good resu l t s were obtained with the i n i t i a l part of the i s o l a t i o n process and t h i s i s reported here. As discussed i n section 5.1, a stop-gap or quiet i n t e r v a l preceding release i s a necessary feature of a stop. In medial po s i t i o n , i t i s known [147,148] that the presence of a moderately long s i l e n t i n t e r v a l i s a necessary and s u f f i c i e n t condition for i d e n t i f y i n g the stops as a phoneme group type. This i s also true of f i n a l released stops [21]. In our study, a l l word f i n a l stops were released and consequently, the s i l e n t i n t e r v a l preceding the release was employed as a primary feature in i s o l a t i n g word f i n a l UVPs. Let T denote the duration of the stop-gap for f i n a l s UVPs. T g was determined by means of the algorithm described in section 5.2.1. In evaluating T , any region that appeared to be spurious and not greater than 20 msec, in duration and surrounded by quiet regions was treated as part of the o v e r a l l quiet region. 184 C l a s s i f y I n i t i a l UVP Re] C l a s s i f y F i n a l UVP I n i t i a l /p.t.k/ C l a s s i f i e d , F i n a l /p,t,k/ C l a s s i f i e d F i g . 5-18 Flowchart for Isolation and C l a s s i f i c a t i o n of Unvoiced Stops 185 Let T denote the f a l l time of the signal envelope curve, F I, at the vowel stop-gap boundary. For our work, T was defined F to be the time required for I to f a l l from 0.51 to 0.1I , max max where I i s the peak envelope amplitude for the entire word, max c r An abrupt drop i n envelope amplitude to a quiet region indicated abrupt termination of a preceding vowel and rapid t r a n s i t i o n to the occluded state of the v o c a l - t r a c t . With reference to the production model, the rapid drop i n i n t e n s i t y can be regarded as a negative step-function e x c i t a t i o n , i n the same way that the release of /b/ can be associated with a p o s i t i v e step-function e x c i t a t i o n . The algorithm for i s o l a t i n g f i n a l UVPs was context independent and can be described as follows. Let the following binary features be defined. •n = 1 , i f f [Type I quiet i n t e r v a l present] (5.13) •n. = 1 , i f f [40 msec. ^ T < 250 msec] (5.14) 2 S T)_ = 1 , i f f [T ^ 150 msec] (5.15) T) 4 = 1 , i f f [T < 50 msec] (5.16) T i s the duration of the word-final stop-burst, as measured from B the end of the preceding stop-gap to the beginning of the end-of-word sile n c e . Let P^ be a Boolean function defined as Pl = W % + V ( 5 - 1 7 ) 186 A candidate consonantal region i n f i n a l p o s i t i o n was declared to be a f i n a l UVP i f P^ = 1 . A l l the samples of f i n a l UVPs were successfully detected by the above algorithm. For most of our data, both of the features, r\ = 1 and T\ = 1, were s a t i s f i e d . Exceptions occurred with the UVPs which followed /s/. However, i n these cases, the feature, r\ = 1, was s t i l l v a l i d . The feature, n = 1, i s generally a more powerful feature than the feature, = 1, and would be useful in d i s -tinguishing f i n a l , non-released UVPs from some VPs such as /v/ and also from natural vowel endings. The i n i t i a l UVPs were considerably more d i f f i c u l t to i s o l a t e than the f i n a l UVPs. Although the presence of a preceding silence i s s t i l l a necessary feature for the i n i t i a l UVPs, the duration measurement, T , i s no longer meaningful. I t was not s possible to i s o l a t e a l l the i n i t i a l UVPs using only context independent features. Nevertheless, i t was of inte r e s t to see what could be done with simple context independent features. For t h i s purpose, l e t the following context independent binary features be defined. T] = 1 , i f f [preceding Type I quiet i n t e r v a l ] (5.18) T). = 1 , i f f [I < 0.251 ] (5.19). '6 L maxJ x ' T] ' =• 1 , i f f [T > 60 msec] (5.20) / Li ^ Q = 1 , i f f [Z s 30] (5.21) 187 r\9 = 1 , i f f [D ^ 60] (5.22) T ] 1 0 = 1 , i f f [U m < 11] (5.23) T) = 1 , i f f [T < 120 msec] (5.24) The parameter, T , i s the machine measure of voicing latency described i n section 5.2.3. The feature, r\ = 1, served to d i s t i n g u i s h i n i t i a l UVPs from the i n i t i a l VPs. The parameter, T , i s a measurement of f r i c t i o n a l noise duration and i s defined n by (4.8) . The feature, T\ = 1, merely repeats the UVP/UVF d i s t i n c t i o n feature. Let P 2 be a Boolean function defined as p2 = V V V V ( T 1 9 + { 5 - 2 5 ) If the condition P = 1 was s a t i s f i e d , the consonantal region was declared to be an i n i t i a l UVP. This algorithm detected a l l the i n i t i a l samples of / t / and about half the word i n i t i a l samples of /p/ and /k/. The f a i l u r e of the algorithm to detect the remaining samples of /p/ and /k/ was due to the fact that one of the conditions, rj 0 = 1 or n + T) = 1, was not s a t i s f i e d . This problem was due to the weakness of the f r i c t i o n a l noise and the vowel-like properties of the asp i r a t i o n region, which have been discussed.in sections 5.2.3 and 5.2.4. This problem was resolved by the use of context-dependent features. Consider the following binary features: 188 T) = 1 , i f f (5 > 40) -(vowel ^ / i / ) (5.26) T] = 1 , i f f (2 > 20 + U m < 15) . (vowel = /u/) (5.27) T] = 1 , i f f (Z > 30 + U m < 13) . (vowel = /A/) (5.28) Let be a Boolean function defined by P3 = V V V \ l ' \ 2 ( 5 - 2 9 ) The condition, P = 1, resulted i n the detection of a l l the remaining i n i t i a l UVPs. A stronger condition for t h i s purpose was P 3 - \ 3 ^ 1 4 = 1 ( 5 - 3 0 ) While (5.30) represents a more powerful algorithm for our data, i t i s less general, since the features, r\ ^ and are strongly context-dependent. 5.3.2 C l a s s i f i c a t i o n of the Unvoiced Stops Time l i m i t a t i o n s and the complexity of the stop character-i s t i c s prevented development and testing of complete algorithms for c l a s s i f y i n g a l l the UVPs. A simple algorithm was, however, developed for i d e n t i f y i n g / t / , which was the easiest of the three UVPs to c l a s s i f y and which could be c l a s s i f i e d with a context-independent algorithm. The following binary features were designed for t h i s purpose: 189 T ] 1 5 = 1 , i f f [Z > 50] (5.31) T l 1 6 = 1 , i f f [D > 70] (5.32) T) = 1 , i f f [T < 100 msec] (5.33) Let P. be the Boolean function 4 p4 = \5'\e'\i ( 5 - 3 4 ) A candidate UVP was declared to be / t / i f P = 1. This algorithm c o r r e c t l y i d e n t i f i e d a l l i n i t i a l and f i n a l samples of / t / . However, the sample of /p/ preceding the vowel / i / was mis-c l a s s i f i e d as / t / . Without the feature, "H-^. the sample of /k/ preceding / i / was also m i s c l a s s i f i e d as / t / . For /p/ and /k/> the word-final allophonic variants appear to be easier to d i f f e r e n t i a t e than the word i n i t i a l ones because of less contextual dependency of the properties of the stop-burst on the adjacent vowel and also because of a measurable stop-gap duration. The analysis data presented i n section 5.2 suggests that a suitable algorithm for c l a s s i f y i n g word f i n a l /p/ and /k/ could be developed using the measurements of T , a T and the H s H m t r a n s i t i o n s from a preceding vowel. F i g . 5-5, for example shows that T i s consistently longer for /p/ than /k/ in the same 5 contextual environment. F i g . 5-17 shows that a for /k/ i s generally less than that of /p/. Table 5-2 suggests that i f the sign of the H^ t r a n s i t i o n from a preceding vowel i s p o s i t i v e , then the candidate UVP i s most l i k e l y /k/ rather than /p/. 190 Cooper et al.[100] have s i m i l a r l y suggested the use of the sign of the t r a n s i t i o n i n c l a s s i f y i n g stops. From the analysis data, i t i s evident that c l a s s i f i c a t i o n of i n i t i a l /p/ and /k/ i s more d i f f i c u l t than the f i n a l ones because of the greater contextual dependency and because of the absence of a v a l i d stop-gap duration measurement. The analysis data indicates that the measurement a . the H t r a n s i t i o n to the H m following vowel and knowledge of the following vowel would be very useful i n c l a s s i f y i n g the i n i t i a l UVPs. The analysis data further indicates that UVPs preceding the vowel / i / should be treated separately from the other i n i t i a l UVPs because of the acoustic s i m i l a r i t y of the i n i t i a l UVPs before / i / . This, of course, means that knowledge of the adjacent vowel would be required as an input feature to the UVP c l a s s i f i c a t i o n algorithm. 5.4 Machine I s o l a t i o n and C l a s s i f i c a t i o n of Voiced Stops 5.4.1 I s o l a t i o n of the Voiced Stops The voiced stops were more d i f f i c u l t phonemes to i s o l a t e than the unvoiced cognates. I t appears that rather subtle features are required to i s o l a t e and c l a s s i f y the VPs. The candidate acoustic regions treated were those not previously declared to be a vowel, f r i c a t i v e or unvoiced stop. F i g . 5-19 shows a general flow-chart of the procedure for i s o l a t i o n and c l a s s i f i c a t i o n of the VPs. Time did not permit implementation of the feature extraction and c l a s s i f i c a t i o n stages. C l a s s i f y F i n a l VP C l a s s i f y I n i t i a l VP F i n a l /b,d,g/ C l a s s i f i e d ; I n i t i a l \ /b,d,g/ Cl a s s i f i e d / F i g . 5-19 Flow Chart for I s o l a t i o n 'and C l a s s i f i c a t i o n of Voiced Stops 192 In th i s section, we describe the work on the i n i t i a l part of the i s o l a t i o n process. We s h a l l consider the word-final VPs f i r s t . As discussed e a r l i e r i n section 5.1, the presence of a quiet i n t e r v a l preceding the stop release i s a necessary feature of both voiced and unvoiced stops. The presence of a type II quiet i n t e r v a l i s a s u f f i c i e n t feature for a VP but i s not a necessary feature. In our data, most of the VPs had a preceding type II quiet i n t e r v a l except for some i n i t i a l voiced stops. Let T' and T' be the duration parameters corresponding to S B T and T for the UVPs, respectively, except that they are defined S B in terms of the type II quiet i n t e r v a l . Let T be the envelope F f a l l ' t i m e previously defined i n section 5.2.1. The following binary features were defined i n a similar manner to (5.13) through to (5 .18) . LL^ = 1 , i f f [type II quiet i n t e r v a l present] (5.3 5) | l = 1 , i f f [50 ^ T' ^ 200 msec] (5.36) & s L l = 1 , i f f [T' ^  170 msec] (5.37) -J B LL. = 1 , i f f [T ^ 50 msec] (5.38) Let P,_ be the Boolean function b P5 = (|13 + M'4) (5.39) A consonantal candidate region i n f i n a l p o s i t i o n was declared to be a f i n a l VP i f P = 1. A l l the samples of word f i n a l VPs were 193 successfully isolated by the above algorithm. As constructed, the above algorithm i s dependent upon the f i n a l VPs being released. For non-released f i n a l VPs, a d i f f e r e n t algorithm, making using of and perhaps the presence of strong t r a n s i -tions, would be required. The algorithm developed for i s o l a t i n g the i n i t i a l VPs i s shown i n F i g . 5-20. The algorithm e s s e n t i a l l y implements suf f i c i e n c y conditions corresponding to: 1) i n i t i a l VP preceded by type II silence 2) i n i t i a l /g/ and /d/ preceded by type I silence 3) I n i t i a l /b/ preceded by type I s i l e n c e . This algorithm c o r r e c t l y isolated a l l samples of i n i t i a l VPs except one i n i t i a l /g/ preceding the vowel / i / . However, a second sample of the same word was c o r r e c t l y treated. 5.4.2 C l a s s i f i c a t i o n of the Voiced Stops Algorithms for c l a s s i f i c a t i o n of the VPs were not implemented. The analysis data of section 5.2 indicates, however, that the VPs, p a r t i c u l a r l y i n word i n i t i a l p o s i t i o n , are very d i f f i c u l t phonemes to characterize a c o u s t i c a l l y and consequently would be d i f f i c u l t to c l a s s i f y . F i g . 5-5 suggests that the stop-gap duration might be useful i n separating f i n a l /g/ from /g/ and /d/. Table 5-2 indicates that the H t r a n s i t i o n s are ' m p o t e n t i a l l y very useful features for c l a s s i f y i n g the VPs. This 194 START Fi g . 5-20 Flowchart of Algorithm for Isolating I n i t i a l VPs 195 l a t t e r feature corresponds to the use of tr a n s i t i o n s i n c l a s s i f y i n g the stops, which has been proposed by several workers [19,100]. The measure of voicing latency, which was described i n section 5.2.3, appears to be a very useful feature for distinguishing the VPs from each other as well as from the UVPs. CHAPTER VI FURTHER DISCUSSION AND CONCLUSIONS 6 .1 Acoustic Complexity of Phonemes Speech signals can be regarded as an encoding of the l i n g u i s t i c a l l y discrete symbols known as phonemes [120]. Communication theory shows that e f f i c i e n t source coding r e s u l t s when the lengths of the code words are i n inverse proportion to the p r o b a b i l i t y of occurrence of the source symbols [149,150]. As applied to speech, t h i s condition suggests that for the more frequently occurring phonemes, fewer features should be required to adequately specify the acoustic properties, including con-textual variations and temporal structure. This hypothesis, i n turn, suggests that the more probable phonemes should be more d i s t i n c t and less complex than the less probable ones. The characterization data presented in the preceding chapters provides some evidence of t h i s code construction. For example, the unvoiced stop / t / , which i s the most frequently occurring member of i t s a r t i c u l a t o r y group i s also the most d i s t i n c t in terms of acoustical properties. S i m i l a r l y , amongst the voiced and unvoiced f r i c a t i v e s , / s / and /z/, which are the more frequently occurring group members, are observed to be ac o u s t i c a l l y less variable than other group members. 196 197 Related to the above concept of complexity ranking of the phonemes i s the observation of non-homogeneous acoustic properties. For example, the presence of f r i c t i o n a l noise appears to be a more important feature in characterizing the voiced f r i c a t i v e /z/ than the voice f r i c a t i v e /v/. Likewise, the i n i t i a l release transient i n the voiced stops has been found to markedly d i f f e r for /b/ compared to /g/ and /d/. Such non-homogeneous features can be considered to be commensurate with variati o n s in acoustic complexity. In applying the res u l t s of t h i s thesis to problems i n spoken word recognition, s i m p l i f i c a t i o n of the recognition system re s u l t s i f some freedom ex i s t s i n the selection of the vocabulary. That i s , the vocabulary should be chosen to take advantage of the form of signal representation and knowledge of the acoustic properties of the constituent phonemes. One useful p o l i c y i n the design of the vocabulary would be to select words composed of the more d i s t i n c t and e a s i l y characterized phonemes. Clearly, not a l l combinations of such phonemes form acceptable English words. However, since the ac o u s t i c a l l y simpler phonemes tend to be the more common, a greater number of acceptable words can be formed, than i s possible with less common phonemes. 6.2 Conclusions As a form of speech signal representation, the time-domain measurements studied here are concluded to have a number 198 o f u s e f u l p r o p e r t i e s f o r s p e e c h a n a l y s i s a n d r e c o g n i t i o n . T h e a l g o r i t h m s d e s c r i b e d i n C h a p t e r I I a r e e a s i l y i m p l e m e n t e d . A b s e n c e ' o f e n e r g y p a r t i t i o n i n g f i l t e r s e n h a n c e t h e d y n a m i c r a n g e c a p a b i l i t y a n d a l l o w s c o n v e n i e n t a n a l y s i s o f a c o u s t i c a l l y w e a k p h o n e m e s s u c h a s / f / . T h e a b s e n c e o f n a r r o w b a n d f i l t e r s a l l o w s m o r e e f f e c t i v e t r a c k i n g o f r a p i d t r a n s i e n t p h e n o m e n a , s u c h a s t h e r e l e a s e o f u n v o i c e d s t o p s , t h a n i s p o s s i b l e w i t h c o n v e n t i o n a l s p e c t r a l m e a s u r e m e n t s . T h i s l a t t e r a d v a n t a g e c o m b i n e d w i t h t h e s i m p l e r a n d m o r e d i r e c t c h a r a c t e r i z a t i o n o f s u s t a i n a b l e p h o n e m e s i n d i c a t e s t h a t t h e i m p l i c i t d a t a r e d u c t i o n i n v o l v e d i n t h e s i g n a l r e p r e s e n t a t i o n i s n o t d e t r i m e n t a l . T h e q u a n t i t a t i v e d a t a p r e s e n t e d i n t h i s t h e s i s p r o v i d e s s u p p o r t o f o u r e x p e r i m e n t a l a n d t h e o r e t i c a l s t u d y o f t h e p r o p e r -t i e s o f t h e t i m e - d o m a i n m e a s u r e m e n t s a n d t h e i r u t i l i t y i n c h a r a c t e r i z i n g t h e a c o u s t i c p r o p e r t i e s o f s e l e c t e d p h o n e m e s . I n a d d i t i o n , t h e d a t a i s o f i n t e r e s t a s a d i f f e r e n t f o r m o f a c o u s t i c s p e c i f i c a t i o n a n d h a s a p p l i c a t i o n s t o r e c o g n i t i o n w o r k . T h e u s e f u l n e s s o f o u r d a t a i s f u r t h e r e x t e n d e d t h r o u g h t h e i n t e r -p r e t a t i o n o f t h e t h e o r e t i c a l r e l a t i o n s h i p s b e t w e e n t h e t i m e -d o m a i n a n d s p e c t r a l r e p r e s e n t a t i o n s . S p e c t r a l d a t a h a s a l s o b e e n u s e d a s a c o n t r o l t o e s t a b l i s h t h a t o u r d a t a w a s n o t a b -n o r m a l . T h e o r e t i c a l r e l a t i o n s h i p s w e r e d e v e l o p e d b e t w e e n t h e z e r o - c r o s s i n g r a t e m e a s u r e m e n t s a n d t h e f r e q u e n c y s p e c t r u m f o r t h o s e p h o n e m e s w h o s e a c o u s t i c p r o p e r t i e s a r e w e l l s u i t e d t o 199 spectral analysis. Comparison of d i r e c t measurements with th e o r e t i c a l estimates calculated from the experimentally measured spectrum for the nuclei of vowels and unvoiced f r i c a t i v e s shows good agreement and validates the assumptions made. For the vowels, the t h e o r e t i c a l results explain why the Z and D measurements do not provide good estimates of the f i r s t and second formant frequencies. The effectiveness of the time-domain measurements i n characterizing the vowels and unvoiced f r i c a t i v e s , which are phonemes suited to spectral analysis, i s demonstrated i n several ways. F i r s t l y , the measurements suggest that r e l a t i v e l y simple empirical features can be designed to characterize these phonemes. Such features are demonstrated in the work on i d e n t i f i c a t i o n of vowels and unvoiced f r i c a t i v e s . Secondly, the t h e o r e t i c a l relationships show that features known to be important i n the spectral characterization of vowels and unvoiced f r i c a t i v e s are accounted f o r . Extraction of these important spectral features, such as formant frequencies for the vowels and noise cut-off frequency for the unvoiced f r i c a t i v e s , i s d i f f i c u l t to implement. Part of this d i f f i c u l t y i s due to the loss of information r e s u l t i n g from a r b i t r a r y spectrum sampling. Thus, the time-domain measurements lead to economic and more d i r e c t character-i z a t i o n of the vowels and unvoiced f r i c a t i v e s . Variations i n the acoustic properties of vowels, f r i c a t i v e s , and stops due to p o s i t i o n and contextual environment 200 were studied. For the unvoiced f r i c a t i v e s , the properties of the f r i c a t i v e nucleus did vary with context and p o s i t i o n but the design of suitable features for categorical i d e n t i f i c a t i o n was found to be possible. The Z and D measurements yielded par-t i c u l a r l y simple features for characterization. The nuclei of vowels exhibited a r e l a t i v e l y greater v a r i a t i o n in acoustic properties with context than the unvoiced f r i c a t i v e s . The ef f e c t s of context were more pronounced for the Z - D measure-ments than for the zero-crossing distance measurements of the vowel signals subjected to broadband p r e f i l t e r i n g . Using r e l a t i v e l y simple feature designs, vowel recognition performance comparable to that achieved by others, using spectral features was obtained. For more general vowel recognition work, com-pensation for contextual bias may be h e l p f u l . The acoustic properties of the release of the unvoiced stops were found to: be quite context dependent, p a r t i c u l a r l y for /p/ and /k/, as has been noted by others. Duration of the stop-gap was also observed to be context dependent. Context dependent t r a n s i t i o n s , similar to formant t r a n s i t i o n s , were observed between vowel nuclei and the stop-gap or release. While categorical i d e n t i f i c a t i o n of / t / appears to be possible, i d e n t i f i c a t i o n of /k/ and /p/ most l i k e l y requires knowledge of the context. Voiced f r i c a t i v e s i n w o r d - i n i t i a l p o s i t i o n were observed to be more stable with context than the n o n - i n i t i a l allophones. 2 0 1 F e a t u r e s s u b j e c t t o c o n t e x t u a l v a r i a t i o n i n c l u d e d d u r a t i o n a n d i n t e n s i t y o f t h e f r i c t i o n a l n o i s e a s w e l l a s t h e t e m p o r a l o r g a n -i z a t i o n o f t h e s i g n a l . F o r t h e v o i c e d s t o p s t h e m o s t c o n s i s t e n t c o n t e x t u a l e f f e c t s w e r e v o i c i n g t r a n s i t i o n f e a t u r e s i n t h e z e r o - c r o s s i n g d i s t a n c e m e a s u r e m e n t s , w h i c h w e r e s i m i l a r t o t h e w e l l - k n o w n f o r m a n t t r a n s i t i o n s o f s p e c t r a l a n a l y s i s . T e m p o r a l s t r u c t u r e s s t u d i e d i n c l u d e d t h e t r a n s i t i o n r e g i o n s b e t w e e n t h e v o w e l s a n d c o n s o n a n t s a n d t h e r a p i d t r a n s i e n t s f o l l o w i n g s t o p r e l e a s e . F e a t u r e s s i m i l a r t o f o r m a n t t r a n s i t i o n s w e r e o b s e r v e d i n t h e z e r o - c r o s s i n g d i s t a n c e m e a s u r e m e n t s . F o r t h e s t u d y o f t h e c o m p l e x , v o w e l - v o i c e d f r i c a t i v e t r a n s i t i o n s , a u s e f u l f e a t u r e was t h e r a t i o o f t h e D a n d Z m e a s u r e m e n t s . T h e r a p i d t r a n s i e n t s f o l l o w i n g s t o p r e l e a s e w e r e w e l l t r a c k e d . T h e Z - D m e a s u r e m e n t s a n d t h e i n t e n s i t y m e a s u r e m e n t , i n p a r -t i c u l a r , y i e l d e d s i m p l e f e a t u r e s . P a r t i a l m a c h i n e r e c o g n i t i o n o f t h e p h o n e m e s s t u d i e d w a s a t t e m p t e d t o d e m o n s t r a t e t h e i m p l e m e n t a t i o n o f c h a r a c t e r i z a t i o n s b a s e d o n t h e t i m e - d o m a i n f o r m o f s i g n a l r e p r e s e n t a t i o n . T h e u n v o i c e d f r i c a t i v e s w e r e e a s i l y i d e n t i f i e d . R e c o g n i t i o n p e r -f o r m a n c e w i t h v o w e l s c o m p a r e s w i t h t h a t a c h i e v e d b y o t h e r s w i t h m o r e c u m b e r s o m e f o r m a n t f e a t u r e e x t r a c t i o n . T h e w o r k o n v o i c e d f r i c a t i v e s p o i n t s o u t t h e n o n - h o m o g e n e o u s c h a r a c t e r o f p h o n e m e s g r o u p e d a c c o r d i n g t o a s i m p l e a r t i c u l a t o r y m o d e l . A b r i e f e x a m i n a t i o n o f t h e s o u r c e c o d i n g a s p e c t o f t h e s p e e c h s y m b o l s 202 s u g g e s t s t h a t t h e p h o n e m e s a r e e f f i c i e n t l y e n c o d e d i n t e r m s o f a c o u s t i c f e a t u r e s . I n summary, t h e t i m e - d o m a i n m e a s u r e m e n t s s t u d i e d i n t h i s t h e s i s a r e c o n c l u d e d t o b e a n e f f e c t i v e f o r m o f s p e e c h s i g n a l r e p r e s e n t a t i o n a n d p r o v i d e a u s e f u l a l t e r n a t i v e t o t h e c o n v e n -t i o n a l s p e c t r a l r e p r e s e n t a t i o n . T h e a n a l y s i s w o r k s h o w s t h a t t h e d a t a c a n b e i n t e r p r e t e d i n t e r m s o f s p e c t r a l d a t a a n d i n t e r m s o f p r o d u c t i o n m o d e l s . T h e l i m i t e d w o r k o n r e c o g n i t i o n i n d i c a t e s t h a t p h o n e t i c c h a r a c t e r i z a t i o n s i n t e r m s o f t h e t i m e -d o m a i n r e p r e s e n t a t i o n c a n b e d e v e l o p e d w h i c h a r e s u i t a b l e f o r m a c h i n e r e c o g n i t i o n a p p l i c a t i o n s . 6 . 3 S u g g e s t i o n s f o r F u r t h e r W o r k M u c h m o r e a n a l y s i s a n d r e c o g n i t i o n w o r k i s p o s s i b l e . O n e d i r e c t i o n i s r e l a x a t i o n o f some o f t h e c o n s t r a i n t s i m p o s e d o n t h e p r e s e n t w o r k . T h u s , o t h e r p h o n e m e s , o t h e r c o n t e x t u a l e n v i r o n m e n t s , a n d a d d i t i o n a l s p e a k e r s c o u l d b e c o n s i d e r e d . A n o t h e r d i r e c t i o n i s t h e a p p l i c a t i o n o f t h e d a t a t o t h e d e s i g n o f some l i m i t e d w o r d - r e c o g n i t i o n s y s t e m s . F i n a l l y , t h e s o u r c e e n c o d i n g c o n c e p t c o u l d b e f u r t h e r e x p l o r e d . T h a t i s , i t a p p e a r s t h a t f e w e r f e a t u r e s a r e r e q u i r e d t o a c o u s t i c a l l y s p e c i f y t h e m o r e f r e q u e n t l y o c c u r r i n g p h o n e m e s . M o r e c o m p l e t e k n o w l e d g e o f s u c h r a n k i n g o f a c o u s t i c c o m p l e x i t y w o u l d b e v e r y u s e f u l i n v o c a b u l a r y d e s i g n f o r m a c h i n e r e c o g n i t i o n . A P P E N D I C E S APPENDIX A SPEECH DATA PROCESSING FACILITY A. 1 System Composition F i g . A - l shows a block diagram of the speech data processing system used for our work. The spoken words to be processed were recorded in an In d u s t r i a l Acoustics Co. model 1205-A quiet room, using an Altec-Lansing type 681-A microphone and a Tandberg type 64X tape recorder which was located outside the quiet room. The signal-to-noise r a t i o of the tape recorder was specified to be i n excess of 60 db. The PDP-9 computer had 16,384 words of core storage of 18 b i t length. Of the 16 K words of core storage, 8 K words were reserved for raw and processed data storage, while the remaining 8K words were reserved for programs. Because of speed and core size l i m i t a t i o n s , v i r t u a l l y a l l programs were written in assembly language. However, when speed was not c r u c i a l and core size permitted, some a u x i l i a r y programs were written in FORTRAN and linked to the assembly language programs. Modular program units were developed so that combinations of program modules could be linked at load time. Total core space for a l l programs greatly exceeded the available core space. Likewise, only a limited amount of data could be held in core at any one time. Data f i l e s and program modules were stored on magnetic 204 Tape PRIMARY 8 kHz Low Pass F i l t e r Recorder INPUT Tandberg 64X 32 CH. Spectrum Analyzer Krohn-Hite 3342R 1 kHz Low Pass F i l t e r Krohn-Hite 3342R 1 kHz High Pass F i l t e r Krohn-Hite 3342R 32 CH. Analog Multi-plexor 3 DEC Tapes Magnetic Tape A to D Converter (9 bits) 16 kHz Word Rate Radiation Inc. 5510 PDP-9 COMPUTER (16K) 16 kHz Word Rate D - A Converter (9 bits) CRT Display with Light Pen Teletype and Sense Switches X - Y Plotter Audio Monitor F i g . A - l Speech Data Processing System 206 t a p e ( D E C T A P E ) a n d w e r e r e a d i n t o c o r e a s r e q u i r e d . A d e s c r i p t i o n o f t h e s p e c t r u m a n a l y z e r i s g i v e n i n A p p e n d i x B a n d d e s c r i p t i o n s o f t h e f u n c t i o n s o f o t h e r s y s t e m b l o c k s a r e g i v e n i n t h e f o l l o w i n g s e c t i o n s a n d A p p e n d i c e s C a n d D. A . 2 P r i m a r y D a t a I n p u t T h e d a t a s a m p l i n g r a t e u s e d w a s 16 k H z . T o a v o i d a l i a s i n g e r r o r s i n s a m p l i n g t h e s i g n a l , a l o w p a s s f i l t e r w i t h a 8 k H z b a n d w i d t h a n d a 96 d b / o c t a v e r o l l - o f f r a t e w a s u s e d t o p r e f i l t e r t h e a c o u s t i c s i g n a l p r i o r t o s a m p l i n g a n d d i g i t i z a t i o n . L i n e a r q u a n t i z a t i o n w a s e m p l o y e d , w i t h a q u a n t i z a t i o n a c c u r a c y o f o n e p a r t i n 512 (9 b i t s ) . N i n e b i t s q u a n t i z a t i o n i s t y p i c a l o f t h a t u s e d b y o t h e r s [ 1 9 , 1 4 ] . B e c a u s e o f l i m i t e d c o r e s t o r a g e s p a c e t w o d a t a s a m p l e s w e r e p a c k e d p e r 18 b i t c o m p u t e r w o r d . T h e c o r e s t o r a g e s p a c e r e s e r v e d f o r t h e r a w d a t a w a s 4 , 0 9 6 w o r d s . E v e n w i t h t h e s a m p l e p a c k i n g s c h e m e d e s c r i b e d a b o v e , t h e a v a i l a b l e c o r e s p a c e p e r m i t t e d s t o r a g e o f o n l y a o n e -h a l f s e c o n d s e g m e n t o f t h e s p e e c h r e c o r d . L o n g e r s p e e c h r e c o r d s w e r e p r o c e s s e d p i e c e m e a l . T h e s h o r t t i m e - w i n d o w made i n d e x i n g o f t h e a u d i o t a p e r a t h e r d i f f i c u l t . T o o v e r c o m e t h i s p r o b l e m , t h e a c o u s t i c s i g n a l w a s s a m p l e d c o n t i n u o u s l y a n d s t o r e d i n a w r a p - a r o u n d o r c i r c u l a r b u f f e r . U s i n g t h e a u d i o m o n i t o r i n g f a c i l i t y a n d a s e n s e s w i t c h , s a m p l i n g w a s t e r m i n a t e d j u s t a f t e r t h e d e s i r e d s p e e c h s e g m e n t w a s h e a r d . T h a t i s , t h i s p r o c e d u r e 207 defined the end of the time-window rather than the beginning of the window. To ascertain whether the correct segment had been obtained, the captured segment was v i s u a l l y inspected on the CRT display and also by means of playback through the audio moni-toring f a c i l i t y . Acceptable segments were then stored i n l a b e l l e d data f i l e s on DECTAPE. A.3 Secondary Data Input Secondary data input refers to the input of speech data which has been subjected to broadband, lowpass and highpass f i l t e r i n g and to narrow-band spectral analysis. Because of the d i f f i c u l t i e s in capturing appropriate segments of the spoken words, secondary input was performed using reconverted samples. That i s , an analog r e p l i c a of the o r i g i n a l acoustical signal was obtained by reconversion of the d i g i t i z e d record held in core. By t h i s means, lowpass f i l t e r e d , highpass f i l t e r e d , and s p e c t r a l l y analyzed records corresponding to the o r i g i n a l , captured speech segments were obtained. A sampling rate of 16 kHz was again used during t h i s process. A.4 General Operational Procedures On-line, i n t e r a c t i v e , d i g i t a l processing was used extensively. The CRT display, l i g h t pen, sense switches, t e l e -type and audio monitoring f a c i l i t y were found to be very useful. For example, i t was very easy to remove sections of the acoustic signal, such as part or a l l of a stop burst, in order to study 208 the e f f e c t s on the perception of the rest of the utterance. In other situations, i t was convenient to adjust parameters or examine selected data in greater d e t a i l on the CRT display. Unfortunately, on-line data analysis tends to be rather costly in terms of computer time and so hard-copy records were gener-ated by several means including: a X - Y p l o t t e r , a s t r i p chart recorder and photographs of the CRT display. The resultant copies were then available for o f f - l i n e analysis. A summary of the programs for time-domain signal analysis i s contained i n Appendix C. Sim i l a r l y , the programs for spectrum display and analysis are b r i e f l y described i n Appendix D. Other more s p e c i f i c programs for analysis and recognition of the acoustic signals are contained in the main body of the the s i s . APPENDIX B THE FILTER-BANK SPECTRUM ANALYZER Spe c i f i c a t i o n s: No. of channels 32 Frequency range 280 - 9000 Hz Channel spacing Logarithmic with frequency Channel bandwidths 10% of channel centre frequency Spectrum weighting 0.645 db/octave increasing with frequency Input voltage ± 5 v o l t s peak for ch. 0 range ± 2 vol t s peak for ch. 15 ± 1.1 v o l t s peak for ch. 31 Output voltage 0 to + 5 v o l t s maximum range F i g . B-l shows a block diagram of the o v e r a l l spectrum analyzer configuration. Table B-l summarizes the following c h a r a c t e r i s t i c s of the spectrum analyzer: i) centre frequencies of the bandpass f i l t e r s , i i ) upper and lower cut-off frequencies of the bandpass f i l t e r s , i i i ) 6 db bandwidths of the bandpass f i l t e r s , and iv) gain of each channel. The bandpass f i l t e r s were t h i r d order Tchebyscheff f i l t e r s which were designed to give more than 100 db/octave attenuation rate outside the passband. F i g . B-2 shows the 209 210 V, i n Isol. Ampl Band-Pass F i l t e r R e cti-f i e r Low-Pass F i l t e r Output Ampl. + V out U5 .w u CN cn F i g . B-l Block Diagram of the Filter-bank Spectrum Analyzer attenuation c h a r a c t e r i s t i c s of two of the bandpass f i l t e r s . The lowpass f i l t e r s consisted of a single 7 T-section designed to give 20 db/octave attenuation outside the passband and less than 1 db r i p p l e within the passband. For the channels covering the frequency range from 280 to 800 hz, a 100 Hz lowpass f i l t e r bandwidth was used; whereas for the remaining channels, a 200 Hz lowpass f i l t e r bandwidth was used. These bandwidths correspond to smoothing time constants of approximately 10 msec, and 5 msec, respectively. F i g . B-3 shows the attenuation c h a r a c t e r i s t i c s of the two types of lowpass f i l t e r s . A longer smoothing time constant was used for the low frequency channels in order to avoid problems with voicing-frequency r i p p l e on the 211 Table B-l Spectrum Analyzer Channel Specifications Channel No. Centre Freq (Hz) Bandpass F i l t e r Passband (Hz) Bandwidth 6 db-(Hz) Overall Channel Gain 0 295 280 310 30 1.00 1 327 310 - 345 35 1.08 2 362 345 - 380 35 1.16 3 399 380 - 420 40 1.25 4 442 420 - 465 45 1.35 5 492 465 - 520 55 1.45 6 549 520 - 480 60 1.56 7 612 580 - 645 65 1.68 8 681 645 - 720 75 1.81 9 759 720 - 800 80 1.95 10 844 800 - 890 90 2.10 11 939 890 - 990 100 2.26 12 1044 990 - 1100 110 2.44 13 1163 1100 - 1230 130 2.63 14 1298 1230 - 1370 140 2.83 15 1444 1370 - 1530 160 3.05 16 1613 1530 - 1700 170 3.28 17 1797 1700 - 1900 200 3.54 18 2021 1900 - 2150 250 3.82 19 2272 2150 - 2400 250 4.10 20 2546 2400 - 2700 300 4.42 21 2864 2700 - 3000 300 4.76 22 3170 3000 - 3350 350 5.13 23 3544 3350 - 3750 400 5.52 24 3969 3750 - 4200 450 5.95 25 4443 . 4200 - 4700 500 6.40 26 4967 4700 - 5250 550 6.90 27 5524 5250 - 5850 600 7.43 28 6166 5850 - 6500 650 8.00 29 6865 6500 - 7250 750 8.62 30 7663 7250 - 8100 850 9.30 31 8538 8100 9000 900 10.00 212 -70 -20 § I-30 '40 SO -60 • / 1 i I 1 channel 5 j 1 channel 26 / \ 1 \ \ 1 \ A r\ 1 \ \ / 1 \ / tOO 500 1000 frequency (Hz) 5000 F i g . B-2 Typical Attenuation C h a r a c t e r i s t i c s of Two of the Bandpass F i l t e r s 100 1000 frequency (Hz) F i g . B-3 Attenuation Charac t e r i s t i c s of the 100 Hz and 200 Hz Lowpass F i l t e r s 213 output. The shorter time constant was selected to ameliorate temporal resolution of the high frequency components of sounds. F i g . B-4 shows a c i r c u i t diagram for a t y p i c a l analyzer channel. The r e c t i f i e r c i r c u i t i s a special design that greatly reduces the voltage o f f s e t c h a r a c t e r i s t i c s of semiconductor diodes. For a t y p i c a l germanium diode, the dc o f f s e t i s about 0.3 v o l t s and for a t y p i c a l s i l i c o n diode, the dc o f f s e t i s about 0.7 v o l t s . With the c i r c u i t shown, the dc o f f s e t of the diode i s reduced by a factor equal to the r a t i o of the open loop to closed loop gain of the amplifier. S i l i c o n diodes were used in our c i r c u i t and the dc o f f s e t voltage was reduced to values t y p i c a l l y less than 5 mvolts. A problem common to dc amplifier applications i s dc d r i f t in the amplifier which creates an error in the output. This problem was controlled through use of stable power supplies, a temperature-controlled environment and computer-controlled testing of the channel outputs for indications of excessive dc d r i f t . At a late stage i n our work, the highest frequency channel was deleted and an additional low frequency channel covering the frequency range from 200 to 280 Hz was i n s t a l l e d . F i g . B-4 C i r c u i t Diagram of a Typical Spectrum Analyzer Channel to APPENDIX C TIME-DOMAIN SIGNAL ANALYSIS PROGRAMS A large number of program modules were developed to aid in signal analysis and feature design. The programs can be divided into three broad groups comprising: i) those concerned with display, playback and manipulation of the d i r e c t acoustic waveform, i i ) those concerned with simulation of the transducers for producing the reduced time-domain signal representations, i i i ) those concerned with display and processing of the reduced acoustic representation. The f i r s t group of programs included routines f o r : 1) CRT selection and display of any 128 msec or 32 msec section of the speech waveform. 2) Vis u a l comparison of any two 64 msec sections from the same or d i f f e r e n t speech records. 3) Real-time audio playback of any selected section of the speech record. 4) Real-time audio playback of any two sections of the same or d i f f e r e n t speech records, i n sequence, for perceptual comparison t e s t s . 5) Reduced rate analogue output for st r i p - c h a r t recording of the entire speech record. 215 216 6) Hard-copy output of any selected 128 msec section of. the speech waveform on an analogue X-Y p l o t t e r . The second group of programs simulated time-domain transducers through implementation of the algorithms described in Chapter II . In addition, a u x i l i a r y routines provided the following f a c i l i t i e s : 1) Display of the axis-crossing points of the non-d i f f e r e n t i a t e d signal superimposed on the o r i g i n a l waveform. 2) Display of the axis-crossing points of. the d i f f e r e n t i a t e d signal superimposed on the o r i g i n a l waveform. 3) Display of a l l l o c a l waveform extrema superimposed on the o r i g i n a l waveform. 4) Display of signal envelope superimposed on the o r i g i n a l waveform. 5) Calculation of the Z and D functions with time-window duration as a va r i a b l e . 6) Calculation of form factors with time-window duration as a va r i a b l e . Extensive use was made of the l i g h t pen, sense switches and the teletype for selection and manipulation of the data and i n specifying the required operations. T h e t h i r d g r o u p o f p r o g r a m m o d u l e s p r o v i d e d r o u t i n e s f o r : 1) D i s p l a y o f t i m e g r a p h s o f I , D a n d Z. 2) C a l c u l a t i o n a n d d i s p l a y o f s m o o t h e d v e r s i o n s o f I , D a n d Z. 3) C a l c u l a t i o n a n d d i s p l a y o f f u n c t i o n s o f I , D a n d Z s u c h a s D/Z a n d i/z. 4) D i s p l a y o f D v e r s u s Z f o r a n y s e l e c t e d s e g m e n t o f t h e t i m e g r a p h s . 5) B a r g r a p h d i s p l a y o f t h e t i m e - i n t e r v a l h i s t o g r a m s . 6) A p p l i c a t i o n o f d i f f e r e n t c h a n n e l w e i g h t p a t t e r n s t o t h e t i m e - i n t e r v a l h i s t o g r a m s . 7) C a l c u l a t i o n a n d d i s p l a y o f t h e c e n t r e o f g r a v i t y , i t s t i m e d e r i v a t i v e a n d t h e s t a n d a r d d e v i a t i o n o f t h e T - I - H p a t t e r n . 8) D i s p l a y o f o t h e r f u n c t i o n s c o m p u t e d f r o m t h e t i m e -i n t e r v a l h i s t o g r a m s . 9) S u p e r p o s i t i o n o f t h e L P a n d HP t i m e - i n t e r v a l h i s t o g r a m s . 10) H a r d - c o p y o u t p u t o f v a r i o u s CRT d i s p l a y s o n a n a n a l o g u e X - Y p l o t t e r . O t h e r r o u t i n e s w e r e d e v e l o p e d t o d i s p l a y a n d s t u d y t h e e f f e c t s o f d i f f e r e n t t h r e s h o l d s , d e c i s i o n b o u n d a r i e s a n d p a r a m e t e r v a r i a t i o n s . T h e s e r o u t i n e s w e r e u s e d a s a n a i d i n 218' design of characterization features, in studying segmentation algorithms and in establishing decision regions for c l a s s i f i c a t i o n . The accompanying figures i l l u s t r a t e a few of the display programs developed. Figs. C - l and C-2 show time graphs of the following functions for the words " s i d " and " s i t " . The signal envelope function, I. The zero-crossing rate, D, of the d i f f e r e n t i a t e d signal The zero-crossing rate, Z, of the ordinary signal, The function D/Z The function i / z 11 111 iv v i v i i The function d l /dt s An envelope function, I s , obtained by smoothing the function, I, over three adjacent time samples Only a portion of the data for the f r i c a t i v e /s/ i s shown and t h i s data i s shown at the left-most side of the graph. The temporal region for / s / i s denoted by large values of D and Z. The temporal region associated with the vowel / i / i s the region of high i n t e n s i t y following /s/. Following the vowel region i s the stop-gap of /d/ and / t / , which i s marked by low values of I, D and Z. The release of /d/ and / t / i s most c l e a r l y marked by the p u l s e - l i k e waveform at the end of the D graph. F i g . C-3 shows time-interval histograms for the word " s i d " . The r e s u l t s of processing of the s i g n a l ^ p r e f i l t e r e d with 1 kHz highpass and.lowpass f i l t e r s , a r e shown in f i g s . C-3a and C-3c, F i g . C-l Functions of Intensity and Zero-Crossing Rates for the Word " s i t " F i g . C -2 Functions of Intensity and Zero-Crossing Rates for the word " s i d " HW|tH|tHlHl|(fi||fl|ffl||J|lW -(a) HP - Time Interval Histogram (b) UF - Time Interval Histogram MHMI IMWUMU-M i i l l l l l l l l l l l l l l l l l l l l i i i i " ; : tt<MttMt<tMMlMMM.|fM|JHtll!IIM" - — "•'•"IBJIIB" —-iimiiiimimiiniHiiiiiiiiiuii miMHIIItHii-Ml IM M Ml HlllHIII || HH HI | | IIII •• • • - • •. • 111111 • • • mix <*m\»» - -""IffllWlll!" mini H (c) LP - Time Interval Histogram (d) Composite T-I-H F i g . C-3 Time Interval Histograms for " s i d " to to o The r e s u l t s o f p r o c e s s i n g of the u n f i l t e r e d s i g n a l are shown i n f i g . C-3b. F i g . C-3d shows a d i s p l a y o b t a i n e d by merging the r e s u l t s from f i g s . c-3a and c. A l s o shown i n f i g . c-3d i s a frequency s c a l e o b t a i n e d by c a l i b r a t i n g the transducer w i t h s i n u s o i d s . The lower o f the two l o c i of short h o r i z o n t a l l i n e s shown i n f i g s . C-3a, b and c i s a time graph of the centre o f g r a v i t y o f the channel d i s t r i b u t i o n s . The upper l o c u s i s a time graph of the sum of the centre o f g r a v i t y and the standard d e v i a t i o n of the p a t t e r n d i s t r i b u t i o n . The temporal r e g i o n s i n the t i m e - i n t e r v a l histograms a s s o c i a t e d w i t h the three phonemes i n " s i d " are most e a s i l y d e s c r i b e d i n terms of f i g . C-3d. Not a l l of the r e g i o n a s s o c i a t e d w i t h / s / i s shown but the p o r t i o n t h a t i s shown i s a t the l e f t -hand s i d e of the f i g u r e . The temporal r e g i o n a s s o c i a t e d w i t h / i / c orresponds to the c e n t r a l r e g i o n c o n t a i n i n g two h o r i z o n t a l bands. The band a t the bottom of the f i g u r e r e p r e s e n t s the q u i e t i n t e r v a l or stop-gap o f the stop consonant. The r e g i o n o f the v o i c e d -stop r e l e a s e i s i n d i c a t e d by the sh o r t h o r i z o n t a l bands a t the r i g h t - h a n d s i d e o f the f i g u r e . APPENDIX D SPECTRUM DISPLAY AND ANALYSIS PROGRAMS A number of spectrum display routines were developed which provided options for : 1) Form of v i s u a l presentation 2) Amplitude scale 3) Amplitude normalization 4) Time-axis sample density 5) Overall size 6) Frequency-dependent weighting of amplitudes. Table D-l summarizes the display types and options av a i l a b l e . Two types of v i s u a l presentation were developed. One type was a three-dimensional, isometric display with frequency along the conventional y d i r e c t i o n , time along the x d i r e c t i o n and linear spectral amplitude along the z d i r e c t i o n . F i g . D-l shows an example of the isometric display for the word "zealous". This type of display was found to be v i s u a l l y i n t e r e s t i n g but d i f f i c u l t to perform detailed analysis on. The second type of display was a planar form similar in appearance to the familiar SONAGRAPH spectral display. There are, however, several differences. One i s that spectral amplitude i s represented by the length of a bar i n the v e r t i c a l d i r e c t i o n rather than by i n t e n s i t y v a r i a t i o n , as used in the SONAGRAPH 222 TABLE D-l SPECTRUM DISPLAY ROUTINES AND OPTIONS TIME-AXIS FREQ. DISPLAY NAME REPRESENTATION AMPL. SCALE AMPL. NORM. DENSITY SIZE WEIGHTING CAVDIS Isometric Linear No 1 : 8 Med. NO LINDI Bar Linear Yes 1 : 8 Small Normal Medium " HF Incr. Large LF Incr. LGDIS Bar Logar ithmic Yes 1 : 8 S Normal M HF Incr. L LF Incr. DENDI Bar Linear Yes 1 : 4 S Normal plus 3 sample M HF Incr. smoothing L LF Incr. NOTES: 1 : 4 means every 4th time sample (8 msec) displayed HF Incr. = + 0.645 db octave increase ) in addition to spectrum analyzer ) LF Incr. = + 0.645 db octave decrease ) high frequency pre-emphasis. F i g . D-l Isometric Spectral Display of the Word "zealous" acftiou M M • | H * M ' | i i i | i l l j , h • i | ' l ' i | i | | | l j t i j] |i-• •(••••IMtttt l|l li »• • HHI 4* M . »—M.|.|»Hl|. | | | M . • • • • • l l t f l l ) l l l II II-- » HlM|t (« l i * * " • • * * i l " ' • • » • • • " • ) » | t | n - H M n n » - f mmiiwiin ".I'lilliiii • t . l | l l l | | i l . . » i miiiii i i H U M i l • " • M I - i l l l l l l K •MI I | | | 1 I*M» F i g . D-3 High Density Display with High-Frequency Pre-emphasis i i i i i i • l l I I l t i i I I l I l i • 1 1 • l l l * i i i 1 1 • i i • I i l I I I I • I I 1 1 l l i 1 I t i i l l i • l l l l i 1 1 i 1 1 » • • • i g. D-2 Logarithmic Amplitude Display Mill' 11)11 •HI IIIIMIH I M I M H M I I " • • i l M ' M i l l u i i i . . . iiiiiimiiiiiiiihiim.., niiiiiii.1. »'•« «|l-«'-»l»ll||||l I M . . ill •""'Mil IIUM;;; ••••iiiiiiu '••milium i g . D-4 High Density Display of Unweighted Spectrum display. In other words, density modulation rather than intensity modulation i s used. When viewed from a distance the e f f e c t i s quite s i m i l a r . The other differences are due to the sampling and quantization along the frequency and time axes. The short-time spectrum of speech has a very large range of amplitudes along both the frequency and time dimensions. To handle t h i s problem, several d i f f e r e n t methods were attempted. One method was to use a logarithmic spectral-amplitude scale. F i g . D-2 shows an example of t h i s type of display for the word "zealous". With t h i s type of display,the desired dynamic range compression was achieved but i t became d i f f i c u l t to pick out extrema i n the envelope of the running spectrum. Another method used was a linear amplitude scalejbut with additional high frequency pre-emphasis. This frequency weighting provided additional pre-emphasis to that already incorporated into the hardware design of the spectrum analyzer. F i g . D-3 shows a sample display with t h i s additional high frequency pre-emphasis, amounting to 2 0 db between the lowest and highest frequency channels. For comparison, F i g . D - 4 shows the completely unweighted spectrum, that i s , with the spectrum analyzer frequency pre-emphasis removed. A t h i r d method was to use a linear amplitude scale with saturation and then to adjust the o v e r a l l spectrum amplitude. In working with a CRT. display, f l i c k e r arises because of the f i n i t e p l o t t i n g rate allowable. Consequently, the f l i c k e r r a t e i s d i r e c t l y p r o p o r t i o n a l t o t h e n u m b e r o f p o i n t s d i s p l a y e d . S i n c e d i s p l a y f l i c k e r m a k e s v i e w i n g i n c o n v e n i e n t , l o w a n d h i g h d e n s i t y d i s p l a y s w e r e p r o v i d e d i n c o n j u n c t i o n w i t h t h e l i n e a r a m p l i t u d e mode. A l t h o u g h t h e l o w d e n s i t y mode g a v e a l e s s t i r i n g d i s p l a y , t h e h i g h d e n s i t y d i s p l a y w a s f o u n d t o b e m o r e u s e f u l f o r d e t a i l e d a n a l y s i s w o r k . F i g s . D-3 a n d D-4 a r e e x a m p l e s o f t h e h i g h d e n s i t y d i s p l a y . A n o t h e r w a y o f r e d u c i n g t h e a p p a r e n t f l i c k e r r a t e w a s t o c h a n g e t h e o v e r a l l s i z e o f t h e p i c t u r e . T h r e e s i z e o p t i o n s w e r e p r o v i d e d . T h e l a r g e a n d m e d i u m s i z e s w e r e f o u n d t o b e t h e m o s t u s e f u l f o r d e t a i l e d a n a l y s i s w o r k b u t t h e s m a l l s i z e w a s f o u n d t o b e q u i t e u s e f u l f o r o b s e r v i n g g l o b a l s t r u c t u r a l c h a r a c t e r i s t i c s . T h e l i n e a r a m p l i t u d e , h i g h d e n s i t y t y p e o f s p e c t r a l p r e s e n t a t i o n (DENDI) was f o u n d t o b e t h e m o s t u s e f u l d i s p l a y f o r d e t a i l e d a n a l y s i s o f t h e s p e c t r o g r a m s . F o r u s e w i t h t h i s d i s p l a y , a n u m b e r o f a u x i l i a r y r o u t i n e s w e r e d e v e l o p e d t o p r o v i d e d i s p l a y o f t h e v a l u e s o f f r e q u e n c y , w e i g h t e d a n d u n w e i g h t e d a m p l i t u d e f o r a n y d a t a p o i n t . T h e s e r o u t i n e s w e r e u s e d t o d e t e r m i n e f e a t u r e s s u c h a s : i ) F o r m a n t f r e q u e n c i e s , a m p l i t u d e s a n d b a n d w i d t h s i i ) E s t i m a t e s o f Z a n d D i i i ) N o i s e s p e c t r u m c u t - o f f f r e q u e n c y a n d r o l l - o f f r a t e i v ) S p e c t r u m e n v e l o p e e x t r e m a . APPENDIX E PHONETIC TRANSCRIPTION SYSTEM AND WORD LISTS TABLE E - l ENGLISH PHONEMES Phonetic Key Phonetic Key Symbol Word Symbol Word I f i t b bad i f e e t d d i v e I l e t g <g_ive ae bat P p_ot A but t toy a not k c a t law u book m may u boot n now 3 b i r d *) s i n g A 8 B e r t z zero e p a i n 3 vij s i o n o 9° V very aU house % that a l i c e h hat 3 l boy f f a t IU few e t h i n g I shed j y_ou s sat w we 1 l a t e t l church r r a t e ^3 judge 227 TABLE E-2 WORD LIST 1 * f i t sect sheep fea r fussy salve shack f a l l food soothe sure reef niece leash c u f f bus l u s h a l o o f moose bush v i l l a g e zebra Vaughn zealous vogue zoom peeve busy s e i z u r e salve has measure cove lose rouge * After Hughes and Halle [70] 229 TABLE E-3 WORD LIST 2 * kool pool tool kun pun tun keel peel t e e l lock lop l o t whi sk whisp whist sick sip s i t gool bool dool gun bun dun geel beal deal log 1-ob lod sig sib s i d After Halle, Hughes and Radley [71] B I B L I O G R A P H Y BIBLIOGRAPHY Note: JASA = J . Acoustical Soc. of America 1 Pierce, J.R., "Whither Speech Recognition", JASA, v o l . 46, pp. 1049-1051, 1969. 2 Lindgren, N., "Machine Recognition of Human Language: Part I I " , IEEE Spectrum, v o l . 2, pp. 44-59, 1965. 3 Chomsky, N. and M i l l a r , G.A., "Introduction to the Formal Analysis of Natural Languages", i n Handbook of  Mathematical Psychology, Ed: R.D. Luce, R.R. Bush, E. Galanter, John Wiley and Sons, Inc.: New York, pp. 269-321, 1963. 4 Pierce, J.R., et a l , Languages and Machines: Computers : i n Translation and L i n g u i s t i c s , Nat. Academy of Sciences-Nat'1 Res. Council, Washington, D.C., Publ. No. 1416, 1966. 5 Kanal, L. and Chandrasekaran, B., "Recognition, Machine 'Recognition' and S t a t i s t i c a l Approaches", Methodologies of Pattern Recognition, edited by S. Watanabe, Academic Press Inc., New York and London, pp. 493-508, 1969. 6 Schroeder, M.R., "Parameter Estimation i n Speech: A Lesson in Unorthodoxy", Proc. IEEE, v o l . 58, pp. 707-712, 1970. 7 Lea, W.A., "Towards V e r s a t i l e Speech Communication with Computers", Inter. J . Man-Machine Studies, v o l . 2, pp. 107-156, 1970. 8 Kanal, L. and Chandrasekaran, B., "On Dimensionality and Sample Size i n S t a t i s t i c a l Pattern C l a s s i f i c a t i o n " , Proc. Nat. E l e c t . Conf., v o l . 24, pp. 2-7, 1968. 9 Ryan, H.F., "The Information Content Measure as a Per-formance C r i t e r i o n for Feature Selection", Proc. of  the Seventh IEEE Symp. on Adaptive Processes, Los Angeles, USA, pp. 2 c l - 2 c l l , December, 1968. 231 Hughes, G.F., "On the Mean Accuracy of S t a t i s t i c a l Pattern Recognizers", IEEE Trans, on Info. Theory, v o l . IT-14, pp. 55-63, 1968. FU, K.S., Min, P.J. and L i , T.J., "Feature Selection i n Pattern Recognition", IEEE Trans, on System S c i . and Cyb., v o l . SSC-6, pp. 33-39, 1970. Swonger, C.W., "Property Learning in Pattern Recognition Systems Using Information Content Measures", Pattern  Recognition, L. Kanal, ed., Thompson Book Co., Washington, D.C., 1968. Ho, Y.C., and Agrawala, A.K., "On Pattern C l a s s i f i c a t i o n Algorithms: Introduction and Survey", Proc. IEEE, v o l . 56;, pp. 2101-2114, 1968. Hughes, G.W. and Hemdal, J.F., Speech Analysis, Purdue Res. Fqundation, Lafayette, Ind., Rept. TR-EE65-9, July, 1965. Cover, T.M. and Hart, P.E., "Nearest Neighbour Pattern Cl a s s i f i c a t i o n " , ' I E E E Trans, on Info. Theory, v o l . IT-13, pp. 21-27, 1967. Hughes, G.W., The Recognition of Speech by Machine, MIT Res. Lab. of Electronics, MIT, Cambridge, Mass., Rept. TR-395, 1961. Fry, D.B. and Denes, P., "The Solution of Some Fundamental Problems i n Mechanical Speech Recognition", Language  and Speech, v o l . 1, Pt. 1, pp. 35-58, 1958. Bobrow, D.G.., and Kl a t t , D.H., "A Limited Speech Recogni-ti o n System", AFIPS Proc. 1968 F a l l Joint Computer  Conf., v o l . 33, pp. 305-318, 1968. Reddy, D.R., "Computer Recognition of Connected Speech", JASA, v o l . 42, pp. 329-346, 1967. King, J.H. and Tunis, C.J., "Some Experiments i n Spoken Word Recognition", IBM J. Res, and Dev., pp. 65-79, 1966. Gold, B., "Word Recognition Computer Program", M.I.T.  Res. Lab, of Elect r o n i c s , Rept. TR-452, 1966. 233 22 Sholtz, P.N. and Bakis, R., "Spoken D i g i t Recognition Using Vowel-Consonant Segmentation", JASA, v o l . 34, pp. 1-5, 1962. 23 Chiba, S., "Machines That Hear the Spoken Word", New S c i e n t i s t , pp. 706-708, June 22, 1967. 24 Flanagan, J.L., Speech Analysis, Synthesis and Perception, Academic Press Inc., New York, 1965. 25 Fant, G., "Analysis and Synthesis of Speech Processes", Manual of Phonetics; B. Malmberg, Ed., North Holland Publishing Co., • The Netherlands, 1969. 26 Potter, R.K., Kopp, G.A. and Kopp, H.G., V i s i b l e Speech, Dover Publications; New York, 1961. 27 Peterson, G.E. and Barney, H.L., "Control Methods Used i n a Study of the Vowels", JASA, v o l . 24, pp. 175-184, 1952 . 28 Peterson, G.E., "Parameters of Vowel Quality", J . Hearing and Speech Res., v o l . 4, pp. 10-29, 1961. 29 Forgie, J.W. and Forgie, CD., "Results Obtained from a Vowel Recognition Computer Program", JASA, v o l . 31, pp. 1480-1489, 1959. 30 B e l l , C.G., F u j i s a k i , H., Heinz, J.M., Stevens, K.N. and House, A.S., "Reduction of Speech Spectra by Analysis-by-Synthesis Techniques", JASA, v o l . 33, pp. 1725-1736, 1961. 31 Schafer, R.W. and Rabiner, L.R., "System for Automatic Formant Analysis of Voiced Speech", JASA, v o l . 47, . pp. 634-642, 1970. 32 Mundie, J.R. and Moore, T.J., "Speech Analysis as the Ear Sees I t : Aural Topography", 1967 Conf. on Speech  Communication and Processing, Cambridge, Mass., 1967 pp. 206-214. 33 Bremermann, H.J., "Pattern Recognition, Functionals and Entropy", IEEE.Trans, on Bio-Medical Eng., v o l . BME-15, pp. 201-207, 1968. 34 Teacher, C.F., K e l l e t t , H.G. and Focht, L.R., "Experi-mental, Limited Vocabulary Speech Recognizer", IEEE  Trans, on Audio and Electroacoustics, v o l . AU-15, pp. 127-130, 1967. 234 35 Reddy, D.R., "Segmentation of Speech Sounds", JASA, v o l . 40, pp. 307-312, 1966. 36 Bezdel, W. and Bridle, J.S., "Speech Recognition Using Zero-Crossing Measurements and Sequence Information", Proc. IEE, v o l . 116, pp. 617-623, 1969. 37 Ewing, G.D. and Taylor, J.F., "Computer Recognition of Speech Using Zero Crossing Information", IEEE Trans,  on Audio and Electroacoustics, v o l . AU-17, pp. 37-40, 1969. 38 Bezdel, W., and Chandler, H.J., "Results of an Analysis and Recognition of Vowels by Computer Using Zero-Crossing Data", Proc. IEE, v o l . 112, pp. 2060-2066, 1965. 39 Scarr, R.W., "Zero-Crossings as a Means of Obtaining Spectral Information i n Speech Analysis", IEEE Trans, on Audio and Electroacoustics, v o l . AU-16, pp. 247-255, 1968. 40 Fant, G., "The Nature of D i s t i n c t i v e Features", Speech Trans. Lab., Dept. of Speech Communication, Royal Inst, of Technology, Stockholm, Sweden, Rept. STL-QPSR 4, 1966. 41 Babcock, M., Atwood, W., Cohen, J.R. and Hoffman, M.P., "Search and Evaluation of S i g n i f i c a n t Event Sequences i n Automated Speech Analysis", Univ. of I l l i n o i s , Urbana, 111., B i o l o g i c a l Computer Lab., BCL Rept. No. NASA - 1, 1969. 42 Bastian, J . and Abramson, A.S., " I d e n t i f i c a t i o n and Dis-crimination of Phonemic Vowel Duration", JASA, v o l . 34, pp. 743-744, 1962. 43 Otten, K.W., "Automatic Recognition of Continuous Speech-Vol. I: Problems and an Approach", A i r Force Avionics Lab., Wright-Patterson A i r Force Base, Ohio, Rept. TR-AFAL-TR-66-408, 1966. 44 Twaddell, W.F., "Phonemes and Allophones in Speech Analysis", JASA, v o l . 24, pp. 607-611, 1952. 45 Jakobson, R., Fant, C.G.M. and Halle, M., Preliminaries to Speech Analysis, Tech. Rept. 13, Acoustics Lab., MIT, Cambridge, Mass., 1952. 235 46 Jakobson, R. and Halle, M., Fundamentals of Language, Mouton and Co., 's Gravenhage, Netherlands, 1956. 47 Dudley, H. and Balashek, S., "Automatic Recognition of Phonetic Patterns i n Speech", JASA, v o l . 30, pp. 721-732, 1958. 48 Reddy, D.R., "Phoneme Grouping for Speech Recognition", JASA, v o l . 41, pp. 1295-1300, 1967. 49 Martin, T.B., Nelson, A.L. and Zadell, H.J., "Speech Recognition by Feature Abstraction Techniques", R.C.A. Labs., Rept. AL-TDR 64-176, 1964. 50 Stevens, K.N. and House, A.S., "Perturbation of Vowel A r t i c u l a t i o n s by Consonantal Context: An Acoustical Study", J . Speech and Hearing Res., v o l . 6, pp. 111-128, 1963. 51 Stevens, K.N. and Kl a t t , M.M., "Study of Acoustic Properties of Speech Sounds", Rept. BBN-1669, 99 pp., Bolt, Beranek and Newman, Cambridge, Mass., 1968. 52 Lindblom, B.B., "Spectrographic Study of Vowel Reduction", JASA, v o l . 35, pp. 1773-1781, 1963. 53 Bush, C.N., Phonetic Variation and Acoustic D i s t i n c t i v e Features, Mouton and Co., The Hague, 1968. 54 Harris, K., "Cues for the Discrimination of American English F r i c a t i v e s i n Spoken Syllables", Language  and Speech, v o l . 1, p. 1-7, 1958. 55 Schatz, C.D., "The Role of Context i n the Perception of Stops", Language, v o l . 30, pp. 47-56, 1964. 56 Lehiste, I., "Acoustical Char a c t e r i s t i c s of Selected English Consonants", Inter. J . of Amer. L i n g u i s t i c s , v o l . 30, Pt. IV, 1964. 57 Lisker, L., "Minimal Cues for Separating /w,r,l,y/ in Intervocalic Position", WORD, v o l . 13, pp. 257-267, 1957. 58 Stevens, K.N., House, A.S. and Paul, A.P., "Acoustical Description of S y l l a b i c Nuclei: An Interpretation in Terms of a Dynamic Model of A r t i c u l a t i o n " , JASA, v o l . 40, pp. 123-132, 1966. 59 Stevens, K.N., "Acoustic Correlates of Certain Consonantal Features", Proc. of the 1967 Conf. on Speech Communica- ti o n and Processing, Cambridge, Mass, pp 177-185, 1967. 236 60 Abramson, A.S. and Lisker, L., "Voice Onset Time i n Stop Consonants: Acoustic Analysis and Synthesis", Haskins Labs, New York, Rept. SR-3, pp. 1.1-1.8, 1965. 61 Broad, D.J. and F e r t i g , R.H., "Formant Frequency Trajec-t o r i e s i n Selected CVC Syllable Nuclei", JASA, v o l . 47, pp. 1572-1582, 1970. 62 Lehiste, I. and Peterson, G.E., "Transitions, Glides and Diphthongs", JASA, v o l . 33, pp. 268-277, 1961. 63 Sebestyen, G.S., Decision Making in Pattern Recognition, The MacMillan Co., New York, 1962. 64 Nilsson, N., Learning Machines, McGraw-Hill, New York, 1965. 65 Chiba, S., "Spoken Word Recognition by Multiple Linear Separation", Proc. 6th Inter. Congress on Acoustics, Tokyo, Japan, pp. B123-B126, Aug. 1968. 66 Hiramatsu, K. and Kotoh, K., "A Spoken D i g i t Recognition System Using Error Correcting Procedure", Proc. Sixth  Inter. Congress on Acoustics, Tokyo, Japan, pp. B119-B122, August, 1968. 67 Washizawa, S. and Okamoto, M., Shiga, T. and Inomata, S., "A Spoken D i g i t Recognition Scheme and i t s Computer Implementation", Proc. Sixth Inter. Congress on  Acoustics, Tokyo, Japan, pp. B135-B138, August 1968. 68 M i l l e r , J.C., Ross, P.W. and Wine, CM., "An Adaptive Speech Recognition System Operating in a Remote Time-Sharing Computer Environment", IEEE Trans. on  Audio and Electroacoustics, v o l . AU-18, pp. 26-31, 1970. . 69 Kain, R.Y., "The Mean Accuracy of Pattern Recognizers with Many Pattern Classes", IEEE Trans, on Info.  Theory, v o l . 15, pp. 424-425, 1969. 70 Hughes, G.W. and Halle, M., "Spectral Properties of F r i c a t i v e Consonants", JASA, v o l . 28, pp. 303-310, 1956. 71 Halle, M., Hughes, G.W. and Radley, J.P., "Acoustic Properties of Stop Consonants", JASA, v o l . 29, pp. 107-116, 1957. 237 72 Peterson, G.E., "The Information-Bearing Elements of Speech", JASA, v o l . 24, pp. 629-637, 1952. 73 Fant, G., Acoustic Theory of Speech Production, Mouton and Co., The Hague, 1960. 74 Bekesy, G.V., Experiments in Hearing, McGraw-Hill, New York, 1960. 75 Gold, B. and Rader, CM., "Systems for Compressing the Bandwidth of Speech", IEEE Trans, on Audio and  Electroacoustics, v o l . AU-15, pp. 131-136, 1967. 76 Sakai, T. and Inoue, S., "New Instruments and Methods for Speech Analysis, JASA, v o l . 32, pp. 441-450, 1960. 77 Peterson, E., "Frequency Detection and Speech Formants", JASA, v o l . 23, pp. 668-674, 1951. 78 Lindblom, B.B., "Accuracy and Limitations of Sona-Graph Measurements", Proc. of the Fourth International Congress of Phonetic Sciences, H e l s i n k i , 1961, pp. 188-202. 79 Mathews, M.V., M i l l a r , J.E. and David, E.E., "Pitch Synchronous Analysis of Voiced Sounds", JASA, v o l . 33, pp. 179-186, 1961. 80 Watanabe, S., Lambert, P.F., Kulilowski, C.A., Buxton, J.L., and Walker, R., "Evaluation and Selection of Variables in Pattern Recognition", Computer and  Information Sciences II, J . Tou, ed., Academic Press, New York-London, 1967. 81 Lathi, B.P., Signals, Systems and Communication, John Wiley & Sons, Inc., New York, 1965. 82 L i c k l i d e r , J . and Pollock, I., "Effects of D i f f e r e n t i a t i o n , Integration and I n f i n i t e Peak Clipping on the I n t e l l -i g i b i l i t y of Speech", JASA, v o l . 20, pp. 42-51, 1948. 83 Ainsworth, W.A., "Relative I n t e l l i g i b i l i t y of Different Transforms of Clipped Speech", JASA, v o l . 41, pp. 1272-1276, 1967. 84 Peterson, G.E. and Hanne, J.R., "Examination of Two Different Formant Estimation Techniques", JASA, v o l . 38, pp. 224-228, 1965. 238 85 Chang, S., P i h l , G.E., and Essigmann, M.W., "Represen-ta t i o n of Speech Sounds and Some of Their S t a t i s t i c a l Properties", Proc. IRE, v o l . 39, pp.147-153, 1951. 86 Chang, S.H., "Two Schemes of Speech Compression System", JASA, v o l . 28, pp. 565-572, 1956. 87 Mathews, M.V., "External Coding for Speech Transmission" IRE Trans, on Info. Theory, v o l . IT-5, pp. 129-136, 1959. 88 Cherry, E.C. and P h i l l i p s , V.J., "Some Possible Uses of Single Sideband Signals in Formant Tracking Systems", JASA, v o l . 33, pp. 1067-1077, 1961. 89 Bezdel, W., "Some Problems in Man-Machine Communication Using Speech", Inter. J. Man-Machine Studies, v o l . 2, pp. 157-168, 1970. 90 Papoulis, A., Prob a b i l i t y , Random Variables and Stochastic Processes, McGraw-Hill; New York, 1965. 91 Chang, S.H., P i h l , G.E. and Wiren, J., "The Intervalgram as a Visual Representation of Speech Sounds", JASA, v o l . 23, pp. 675-679, 1951. 92 Chang, S.H., "Portrayal of Some Elementary S t a t i s t i c s of Speech Sounds", JASA, v o l . 22, pp. 768-769, 1950. 93 Davenport, W.B., "An Experimental Study of Speech-Wave Probabili t y D i s t r i b u t i o n s " , JASA, v o l . 24, pp. 390-399, 1952. 94 Bobrow, D.G., Harlety, A.K. and Kl a t t , D.H., "A Limited Speech Recognition System I I : F i n a l Report", Bolt, Beranek and Newman, Inc., Cambridge, Mass., Rept. BBN 1819, 1969. 95 Flanagan, J.L., "Relative Insensitiveness of Listeners to Change in Formant Amplitude", J . Speech and Hearing  Disorders, v o l . 22, pp. 205-212, 1957. 96 Dunn, H.K., "Methods of Measuring Vowel Formant Bandwidths", JASA, v o l . 33, pp. 1737-1746, 1961. 97 House, A.S., "Formant Bandwidth and Vowel Preference" J . Speech and Hearing Res., v o l . 3, pp. 3-8, 1960. 239 98 Lindblom, B.E. and Studdert-Kennedy, M., "On the Role of Formant Transitions in Vowel Recognition", JASA, pp. 830-843, 1967. 99 Ohman, S.E.G., "Coarticulation i n VCV Utterances: Spectrographic Measurements", JASA, v o l . 39, pp. 151-168, 1966. 100 Cooper, F.S., Delattre, P.C., Liberman, A.M., Borst, J. M. and Gerstman, L.J., "Some Experiments on the Per-ception of Synthetic Speech Sounds", JASA, v o l . 24, pp. 597-607, 1952. 101 Halle, M. and Stevens, K.N., "Speech Recognition: A Model and Program for Research", IRE Trans, on Info. Theory, v o l . IT^8, pp. 155-159, 1962. 102 House, A.S., "On Vowel Duration in English", JASA, v o l . 33, pp. 1174-1178, 1961. 103 Peterson, G.E. and Lehiste, I., "Duration of Syllable Nuclei in English", JASA, v o l . 32, pp. 693-703, 1960. 104 Dunn, H.K., "The Calculation of Vowel Resonances and an E l e c t r i c a l Vocal Tract", JASA, v o l . 22, pp. 740-753, 1950. 105 Stevens, K.N. and House, A.S., "An Acoustical Theory of Vowel Production and Some of Its Implications", J . Speech and Hearing Res., v o l . 4, pp. 303-320, 1961. 106 Paul, A.P.., House, A.S. and Stevens, K.N., "Automatic Reduction of Vowel Spectra: An Analysis-by-Synthesis Method and i t s Evaluation", JASA, v o l . 36, pp. 303-308, 1964. 107 Rosen, G., "Dynamic Analog Speech Synthesizer", MIT Res. Lab. of Electronics, Cambridge, Mass., Tech. Rept. 353, 1962. 108 Rabiner, L., "Speech Synthesis by Rule: An Acoustic Domain Approach", B e l l Sys. Tech. J., v o l . 47, pp. 17-37, 1968. 109 Nixon, N.R. and Maxey, H.P., "Terminal Analog Synthesis of Continuous Speech Using the Diphone Method of Segment Assembly", IEEE Trans, on Audio and El e c t r o - acoustics , v o l . AU-16, pp. 40-50, 1968. 240 110 Tomlinson, R.S., "SPASS - An Improved Terminal Analog Speech Synthesizer", MIT Res. Lab. of Ele c t . , Cambridge, Mass., QPR-80, pp.. 198-205, 1966. 111 Holmes, J., Mattingly, I. and Shearme, J., "Speech Synthesis by Rule", Language and Speech, v o l . 7, pp. 127-143, 1964. 112 Haggard, M.P. and Mattingly, I.S., "A Simple Program for Synthesizing B r i t i s h English", IEEE Trans, on Audio  and EA, v o l . AU-16, pp. 95-98, 1968. 113 L i l j e n c r a n t s , J.C.W.A., "The OVE III Speech Synthesizer", IEEE Trans, on Audio and Electroacoustics, v o l . AU-16, pp. 137-140, 1968. 114 Foulkes, J.D., "Computer I d e n t i f i c a t i o n of Vowel Types", JASA, v o l . 33, pp. 7-11, 1961. 115 Lindgren, N., "Machine Recognition of Human Language: Part I", IEEE Spectrum, v o l . 2, pp. 114-136, 1965. 116 Fant, G., "Acoustic Analysis and Synthesis of Speech with Applications to Swedish", Ericsson Technics, v o l . 15, pp. 3-108, 1959. 117 House, A.S. and Stevens, K.N., "Estimation of Formant Bandwidths from Measurement of Transient Response of the Vocal Tract" , J . Speech and Hearing Res., v o l . 1, pp. 309-315, 1958. 118 Chow, C.K., "On Optimum Recognition Error and Reject Tradeoff", IEEE Trans, on Info. Theory, v o l . IT-16, pp. 41-46, 1970. 119 Liberman, A.M., Cooper, F.S., Harris, K.S., MacNeilage, P.F., and Studdert-Kennedy, M., "Some Observations on a Model for Speech Perception", Haskins Labs., New York, pp. 2.1-2.27, 1965. 120 Liberman, A.M., et a l , "Perception of the Speech Code", Psychological Review, v o l . 14, pp. 431-461, 1967. 121 Lisker, L., "Closure Duration and the Voiced-Unvoiced D i s t i n c t i o n i n English", Language, v o l . 33, pp. 42-49, 1957. 241 122 Heinz, J.M. and Stevens, K.N., "On the Properties of Voiceless F r i c a t i v e Consonants", JASA, v o l . 33, pp. 589-593, 1961. 123 Harris, K. "Some Acoustic Cues for the F r i c a t i v e Conso-nants" , JASA, v o l . 28, p. 160 (A), 1956. 124 Gerstman, L.J., "Noise Duration as a Cue for Distinguish-ing Among F r i c a t i v e , A f f r i c a t e and Stop Consonants", JASA, v o l . 28, p. 160 (A), 1956. 125 Forgie, J.W. and Forgie, CD., "A Computer Program for Recognizing the English F r i c a t i v e s / f / and /©/", Proc. Fourth Inter. Congress on Acoustics, Copenhagen, Denmark, pp. Gll-1 - Gll-4, August, 1962. 126 Sholes, G.N., "Synthesis of F i n a l / z / Without Voicing", JASA, v o l . 31, p. 1568(A), 1959. 127 Dewey, G., Relativ Frequency of English Speech Sounds, Cambridge-Harvard University Press, 1923. 128 Denes, P., "On the Statistics''of Spoken English", JASA, . v o l . 35, pp. 892-904, 1963. 129 Herdan, G., "The Relationship Between the Functional Burdening of Phonemes and the Frequency of Occurrence", Language and Speech, v o l . 1, pp. 8-13, 1958. 130 Denes, P., "Effect of Duration on the Perception of Voicing", JASA, v o l . 27, pp. 761-764, 1955. 131 Delattre, P., " F i r s t Formant Transitions as a Cue to Place of A r t i c u l a t i o n " , JASA, v o l . 46, p. 110 (A), 1969. 132 Fujimura, O., "Analysis of Nasal Consonants", JASA, v o l . 34, pp. 1865-1875, 1962. 133 Nakata, K., "Synthesis and Perception of Nasal Consonants", JASA, v o l . 31, pp. 661-666, 1959. 134 Lisker, L. and Abramson, A.S., "Voice Onset Time in the Production and Perception of English Stops", Haskins Labs., New York, Rept. SRI, pp. 3.1-3.15, 1965. 135 Sharf, D.J., "Duration of Post-Stress Intervocalic Stops and Preceding Vowels", Language and Speech, v o l . 5, Pt. 1, pp. 26-30, 1962. 242 136 Delattre, P.C., Liberman, A.M. and Cooper, F.S., "Acoustic Loci and T r a n s i t i o n a l Cues for Consonants", JASA, v o l . 27, pp. 769-773, 1955. 137 Menon, K.M.N., Jensen, P.J. and Dew, D., "Acoustic Properties of Certain VCC Utterances", JASA, v o l . 46, pp. 449-457, 1969. 138 Abend, K., Harley, T.J., and Chandrasekaran, B., "Comments on 'The Mean Accuracy of S t a t i s t i c a l Pattern Recognizers' ", IEEE Trans, on Info. Theory, v o l . IT-15, pp. 420-423, 1969. 139 Gerstman, L.J., " C l a s s i f i c a t i o n of Self-Normalized Vowels", IEEE Trans, on Audio and Electroacoustics, v o l . AU-16, pp. 78-80, 1969. 140 House, A.S., "Analog Studies of Nasal Consonants", J. Speech and Hearing Disorders, v o l . 22, pp. 190-204, 1957. 141 Schroeder, M.R., "On the Separation and Measurement of Formant Frequencies", JASA, v o l . 28, p. 159(A), 1956. 142 Gerstman, L.J., Liberman, A.M., Delattre, P.C. and Cooper, F.S., "Rate and Duration of Change i n Formant Fre-quency as Cues for the I d e n t i f i c a t i o n of Speech Sounds", JASA, v o l . 26, p. 952 (A), 1954. 143 Suzuki, H., "Mutually Complementary E f f e c t of Rate and Amount of Formant Transition in Distinguishing, Vowel, Semivowel, and Stop Consonants", MIT Res. Lab. of Electronics, QPR-96, pp. 164-172, 1970. 144 Lisker, L. and Abramson, A.S., "A Cross-Language Study of Voicing in I n i t i a l Stops: Acoustical Measurements", WORD, v o l . 20, pp. 384-422, 1964. 145 Stevens, K.N., "Acoustic Correlates of Place of A r t i c u -l a t i o n for Stop and F r i c a t i v e Consonants", MIT Res. Lab. of Electronics, QPR 89, pp. 199-205, 1968. 146 Hoffman, H.S., "Study of Cues i n the Perception of Voiced Stop Consonants", JASA, v o l . 30, pp. 1035-1040, 1958. 147 Bastian, J., Delattre, P., Liberman, A.M., "Silent Interval as a Cue for the D i s t i n c t i o n Between Stops and Semivowels i n Medial Position", JASA, v o l . 31, p. 1568 (A) , 1959. 243 148 Bastian, J., Eimas, P. and Liberman,'A.M., " I d e n t i f i c a t i o n and Discrimination of a Phonemic Contrast Induced by S i l e n t Interval", JASA, v o l . 33, p. 842 (A), 1961. 149 Huffman, D.A., "A Method for the Construction of Minimum Redundancy Codes", Proc. IRE, v o l . 40, pp. 1098-1101, 1962. 150 Gallager, R.G., Information Theory and Reliable Communi-cation, John Wiley and Sons, Inc., New York, 1968. 151 Liberman, A.M., "Some Cues for the D i s t i n c t i o n Between Voiced and Voiceless Stops i n I n i t i a l Position", Language and Speech, v o l . 1, Pt. 3, pp. 153-162, 1958. 152 Stevens, K.N., Kasowski, S. and Fant, C.G.M., "An E l e c t r i c a l Analog of the Vocal Tract", JASA, v o l . 25, pp. 734-742, 1953. 153 Stevens, K.N. and House, A.S., "Development of a Quanti-ta t i v e Description of Vowel A r t i c u l a t i o n " , JASA, v o l . 27, pp. 484-493, 1955. 154 Ito, M.R. and Donaldson, R.W., "Time Domain Measurements for Speech Analysis and Recognition", paper presented at the 79th Meeting of the Acoustical Soc. of Amer., A t l a n t i c City, N.J., A p r i l , 1970, abstracted in JASA, v o l . 48, p. 130, 1970. 155 Ito, M.R. and Donaldson, R.W., "Time Domain Analysis and Recognition of Speech Sounds", paper submitted for pu b l i c a t i o n i n the IEEE Trans, on Audio and E l e c t r o - acoustics. 156 Susuki, J . and Nakatsui, M., "Formant Frequency Extraction by the Cancel Method", J . Radio Res. Labs, Japan, v o l . 13, pp. 39-56, 1966. 157 Stevens, K.N., "The Quantal Nature of Speech: Evidence from Articulatory-Acoustic Data", In Human Communi- cation: A Unified View, McGraw-Hill: New York, 1969. 158 Liberman, A., Ingeman, F., Lisker, L., Delattre, P. and Cooper, F.S., "Minimal Rules for Synthesizing Speech", JASA, v o l . 31, pp. 1490-1499, 1959. 159 Mattingly, I.C., "Experimental Methods for Speech Synthe-s i s by Rule", IEEE Trans, on Audio and Electroacous-t i c s , v o l . AU-16, pp. 198-202, 1968. 244 Mattingly, I.G., "Synthesis by Rule as a Tool for Phono-l o g i c a l Research", Proc. 43rd Meeting of L i n g u i s t i c  Soc. of America, New York, December, 1968. Estes, S.E., Kerby, H.R., Maxey, H.D. and Walker, R.M., "Speech Synthesis from Stored Data", IBM J . Res, and  Dev., v o l . 8, pp. 2-12, 1964. Rabiner, L.R., "A Model for Synthesizing Speech by Rule", IEEE Trans, on Audio and Electroacoustics, v o l . AU-17, pp. 7-13, 1969. Fischer-Jorgenson, E., "The Phonetic Basis for I d e n t i f i c a -t i o n of Phonemic Elements", JASA, v o l . 24, pp. 611-617, 1952. Peterson, G.E., "Automatic Speech Recognition Procedures", Language and Speech, v o l . 4, pp. 200-219, 1961. Wiren, J. and Stubbs, H.L., "Electronic Binary Selection System for Phoneme C l a s s i f i c a t i o n " , JASA, v o l . 28, pp. 1082-1091, 1956. Davis, K.H., Biddulph, R. and Balashek, S., "Automatic Recognition of Spoken D i g i t s " , JASA, v o l . 24, pp. 637-642, 1952. Denes, P. and Mathews, M.V., "Spoken D i g i t Recognition Using Time-Frequency Pattern Matching", JASA, v o l . 32, pp. 1450-1455, 1960. Denes, P. and Von Ke l l e r , T.G., "Articulatory Segmentation for Automatic Recognition of Speech", Proc. 6th Inter.  Congress on Acoustics, Tokyo, Japan, pp. B143-B146, Aug. 1968. Fatechand, R., "Machine Recognition of Spoken Words", Advances i n Computers, v o l . 1, edited by F.L. A l t , Academic Press Inc., New York, pp. 193-229, 1960. H i l l , D.R., "An ESOTerIC Approach to Some Problems in Automatic Speech Recognition", Inter. J . Man-Machine  Studies, v o l . 1, pp. 101-121, 1969. Medress, M.F., "Computer Recognition of Single Syllable Words Spoken i n Iso l a t i o n " , MIT Res. Lab. of Ele c t r o n i c s , Cambridge, Mass., QPR 92, pp. 338-351, 1969. 245 172 Ohlendorf, R.C. and Coates, C.L., "Recognition of Spoken D i g i t s U t i l i z i n g Sequential Patterns", Texas Univ., Austin, Texas, Rept. TR-57, 88 pp., 1968. 173 H i l l , D.R., "Automatic Speech Recognition: A Problem for Machine Intelligence", Machine Intelligence, edited by N.L. C o l l i n s and D. Michie, Oliver and Boyd, Edinburgh and London, 1967. 174 Sakai, T. and Doshita, S., "The Automatic Speech Recognition System for Conversational Sound", IEEE  Trans, on E l e c t . Comp., v o l . EC-12, pp. 835-846, 1963. 175 Scarr, R.W., "Normalization and Adaptation of Speech Data for Automatic Speech Recognition", Intern. J . Man- Machine Studies, v o l . 2, pp. 41-60, 1970. 176 Reddy, D.R. and Vincens, P.J., "A Procedure for the Segmentation of Connected Speech", J . Audio Eng. S o c , v o l . 16, pp. 404-411, 1968. 177 Sitton, G.A., "Acoustic Segmentation of Speech", Inter. J . Man-Machine Studies, v o l . 2, pp. 61-102, 1970. 178 Nagy, G., "State of the Art in Pattern Recognition", Proc. IEEE, v o l . 56, pp. 836-862, 1968. 179 Tou, J.T., "Feature Extraction in Pattern Recognition", Pattern Recognition, v o l . 1, Permagon Press; Great B r i t a i n , pp. 3-11, 1968. 180 Wee, W.G., "A Survey of Pattern Recognition", Proc. of the Seventh IEEE Symp. on Adaptive Processes, Los Angeles, USA, pp. 2el-2el3, December, 1968. 181 Fant, C.G.M., "On the P r e d i c t a b i l i t y of Formant Levels and Spectrum Envelopes from Formant Frequencies", For  Roman Jacobson, 's Gravenhage: Mouton and Co, pp. 109-120, 1956. 182 P h i l i p p , C., "Some Aspects of Vowel Recognition by the Waveform", Proc. Sixth Inter. Congress on Acoustics, Tokyo, Japan, pp. B139-B142, August 1968. 183 Wolf, J.J., "Acoustic Measurements for Speaker Recognition" MIT Res.. Lab. of Ele c t . , Cambridge, Mass., QPR-94, pp. 216-222, 1969. 2 4 6 184 Wolf, J.J., "Choice of Speaker Recognition Parameters", MIT Res. Lab. of Elect., Cambridge, Mass., QPR-97, pp. 125-132, 1970. 1 8 5 M i l l e r , G.A. and Nicely, P.F., "An Analysis of Perceptual Confusions Among Some English Consonants", JASA, v o l . 27, pp. 338-352, 1955. 1 8 6 Peterson, G.E., "The Speech Communication Process", Manual of Phonetics; B. Malmberg, Ed., North Holland Pub. Co., Netherlands, 1969. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            data-media="{[{embed.selectedMedia}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0101825/manifest

Comment

Related Items