Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Development of tests and preprocessing algorithms for evaluation and improvement of speech recognition… Wasmeier, Hans 1986

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1987_A7 W37.pdf [ 7.69MB ]
Metadata
JSON: 831-1.0097036.json
JSON-LD: 831-1.0097036-ld.json
RDF/XML (Pretty): 831-1.0097036-rdf.xml
RDF/JSON: 831-1.0097036-rdf.json
Turtle: 831-1.0097036-turtle.txt
N-Triples: 831-1.0097036-rdf-ntriples.txt
Original Record: 831-1.0097036-source.json
Full Text
831-1.0097036-fulltext.txt
Citation
831-1.0097036.ris

Full Text

D E V E L O P M E N T O F TESTS A N D PREPROCESSING  ALGORITHMS FOR  E V A L U A T I O N A N D IMPROVEMENT OF S P E E C H RECOGNITION  UNITS  by HANS  WASMEIER  B. Eng., Memorial University of Newfoundland,  A THESIS  1981  SUBMITTED IN PARTIAL FULFILMENT OF  THE REQUIREMENTS FOR T H E DEGREE OF MASTER OF APPLIED  SCIENCE  in THE F A C U L T Y OF GRADUATE STUDIES Department of Electrical Engineering  We accept this thesis as conforming to the required standard  THE UNIVERSITY  OF BRITISH COLUMBIA  December 1986  ® Hans Wasmeier, 1986  1 ?  In presenting this thesis in partial fulfilment of the requirements for an advanced degree at The University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the Head of my Department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission.  Department of Electrical Engineering The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1W5 Date: December  1986  ABSTRACT This  study  speaker  considered  dependent,  the  evaluation  speech  recognition  of  commercially  units,  and  available  preprocessing  isolated word, techniques  may be used for improving their performance. The problem was  that  considered in  three separate stages.  A series of tests were designed to exercise an isolated word, speaker dependent, speech recognition unit. These tests provided a sound basis for determining a given unit's strengths  and weaknesses. This knowledge  decision on the best recognition device  for a given  permits a more informed  price range.  As well,  this  knowledge  may be used in the design of a robust vocabulary, and creation of  guidelines  for best performance. The test vocabularies were based on the forty  English phonemes identified by Rabiner and Schafer [28] and the test variations were representative of common variations which may be expected in normal use.  A  digital  archive system was  implemented for storing the  voice input of test  subjects. This facility provided a data base for an investigation of preprocessing techniques.  As well, it permits the testing  of different  speech recognition units  with the same voice input, providing a platform for device comparison.  Several speech preprocessing and performance improvement techniques  were then  investigated. Specifically, two types of time normalization, the enhancement of low energy  phonemes  techniques speech  permit  recognition  and a change a more unit.  in training technique  accurate They  analysis  may  also  ii  of the provide  were investigated. failure  the  mechanism  basis  for  a  These of the speech  preprocessor  design  which could  be  placed  in front of a  commercial speech  recognition unit.  A  commercially available speech recognition unit, the NEC  SRI00, was  used as  a measure of the effectiveness of the tests and  of the improvements. Results of  the study indicated that the designed tests and  the preprocessing  improvement  techniques  investigated  were  useful  in  &  performance  identifying the  speech  recognition unit's weaknesses. Also, depending on the economics of implementation, it was  found that preprocessing  may  provide a cost effective solution to some of  the recognition unit's shortcomings.  iii  T A B L E OF CONTENTS ABSTRACT  ii  LIST OF TABLES  vi  LIST OF FIGURES  viii  ACKNOWLEDGEMENTS  ix  1. INTRODUCTION 1.1. Outline of Thesis  1 3  2. T H E SPEECH TESTS 2.1. BACKGROUND ON DESIGN OF THE TESTS 2.2. IMPLEMENTATION OF THE TESTS 2.3. RESULTS 2.3.1. General Comments 2.3.2. Comments on Performance of the Various Tests 2.4. INTERPRETATION OF T H E RESULTS 2.4.1. Evaluation of the NEC SR100 2.4.2. Evaluation of the Speech Tests  5 5 13 15 15 16 19 19 21  3. ARCHIVAL OF THE TESTS 3.1. HARDWARE CONFIGURATION 3.2. IMPLEMENTATION  22 22 24  4. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES 26 4.1. TIME NORMALIZATION 26 4.1.1. Synchronized Overlap-Add of Temporal Signals 27 4.1.2. SOLA as Implemented in this Study 33 4.1.2.1. Mode 1: Rectangular Windowing 35 4.1.2.2. Mode 2: The Hamming Window 37 4.2. THE USE OF TWO REFERENCE TEMPLATES 40 4.3. NONLINEAR VOLUME NORMALIZATION 42 4.4. COMMENTS ON T H E TEST RESULTS 46 4.4.1. Mode 0: Recorded Unaltered Data 47 4.4.2. Mode 1: Time Normalization Using a Rectangular Window ... 49 4.4.3. Mode 2: Time Normalization Using a Hamming Window 50 4.4.4. Mode 3: Use of Two Reference Templates per Word 51 4.4.5. Mode 4: Nonlinear Volume Normalization 52 5. CONCLUSIONS AND RECOMMENDATIONS 5.1. T H E TESTS 5.2. THE RECORDING CONFIGURATION 5.3. PREPROCESSING ALGORITHMS IMPROVEMENT TECHNIQUES  IV  AND  RECOGNITION  53 53 56 RATE 57  5.3.1. Mode 1: Time Normalization Using a Rectangular Window ... 57 5.3.1.1. Performance of the Algorithm 57 5.3.1.2. Information Learned About the SR100 from Application of the Algorithm 57 5.3.1.3. Applicability for Real-Time Implementation as a Preprocessing Technique 58 5.3.2. Mode 2: Time Normalization Using a Hamming Window 58 5.3.2.1. Performance of the Algorithm 58 5.3.2.2. Information Learned About the SRI00 from Application of the Algorithm 59 5.3.2.3. Applicability for Real-Time Implementation as a Preprocessing Technique 59 5.3.3. Mode 3: Use of Two Reference Templates Per Word 60 5.3.3.1. Information Learned About the SRI00 from Application of the Technique 60 5.3.3.2. Applicability of the Technique as a Modification to the Training Method 60 5.3.4. Mode 4: Nonlinear Volume Normalization 61 5.3.4.1. Performance of the Algorithm 61 5.3.4.2. Information Learned About the SR100 from. Application of the Algorithm 62 5.3.4.3. Applicability for Real-Time Implementation as a Preprocessing Technique 62 5.4. PERFORMANCE OF T H E NEC SR1O0 SPEECH RECOGNITION UNIT 62 5.4.1. Rules for Best Operation of the NEC SR100 63 5.5. FORMAT OF T H E RESULTS 64 5.6. AREAS FOR FURTHER STUDY 65 BIBLIOGRAPHY  68  APPENDIX 1: THEORY OF OPERATION OF THE NEC SR100  71  APPENDIX 2: RESULTS OF T H E LIVE TESTS  75  APPENDIX 3: RESULTS OF T H E RECORDED AND MODIFIED TESTS  87  APPENDIX 4: PRELIMINARY STUDY ON THE EFFECTS ENVIRONMENT ON RECOGNITION PERFORMANCE 1. Recording Configuration 2. Comments on the Test Results APPENDIX 5: A N E X A M P L E VOCABULARY  OF RULES  v  OF A  NOISY 113 115 117  FOR SELECTION OF A ROBUST 119  List of Tables Table 2.1: Vowel and Diphthong Phonemes of the English Language and the Associated Test Vocabulary 7 Table 2.2: Semi-Vowel and Consonant Phonemes of the English Language and the Associated Test Vocabulary 8 Table 5.1: Summary of the Test Results 54 Table A l . l : Frequency Response of the Front End Filters on the SR100 72 Table A2.1: Statistics for Test 1 - Vowels Said Normally 76 Table A2.2: Statistics for Test 2 - Vowels With Mike Moved 76 Table A2.3: Statistics for Test 3 - Vowels Said Slowly 76 Table A2.4: Statistics for Test 4 - Vowels Said Quickly 77 Table A2.5: Statistics for Test 5 - Vowels With Interogative Intonation 77 Table A2.6: Statistics for Test 6 - Consonants (Group 1) 77 Table A2.7: Statistics for Test 7 - Consonants (Group 2) 78 Table A2.8: Confusion Matrix for Test 1 - Vowels Said Normally 79 Table A2.9: Confusion Matrix for Test 2 - Vowels With Mike Moved 79 Table A2.10: Confusion Matrix for Test 3 - Vowels Said Slowly 80 Table A 2 . l l : Confusion Matrix for Test 4 - Vowels Said Quickly 80 Table A2.12: Confusion Matrix for Test 5 - Vowels With Interogative Intonation 81 Table A2.13: Confusion Matrix for Tests 1 Through 5 Combined 81 Table A2.14: Confusion Matrix for Test 6 - Consonants (Group 1) 82 Table A2.15: Confusion Matrix for Test 7 - Consonants (Group 2) 83 Table A2.16: Confusion Matrix for Tests 6 and 7 Combined 84 Table A2.17: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels 85 Table A2.18: Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants 86 Table A3.1: Statistics for Test 1 - Vowels Said Normally 88 Table A3.2: Statistics for Test 2 - Vowels With Mike Moved 89 Table A3.3: Statistics for Test 3 - Vowels Said Slowly 90 Table A3.4: Statistics for Test 4 - Vowels Said Quickly 91 Table A3.5: Statistics for Test 5 - Vowels With Interogative Intonation 92 Table A3.6: Statistics for Test 6 - Consonants (Group 1) 93 Table A3.7: Statistics for Test 7 - Consonants (Group 2) 94 Table A3.8: Confusion Matrix for Tests 1 Through 5 Combined - Mode 0 95 Table A3.9: Confusion Matrix for Tests 1 Through 5 Combined - Mode 1 95 Table A3.10: Confusion Matrix for Tests 1 Through 5 Combined - Mode 2 96 Table A3.11: Confusion Matrix for Tests 1 Through 5 Combined - Mode 3 96 Table A3.12: Confusion Matrix for Tests 1 Through 5 Combined - Mode 4 97 Table A3.13: Confusion Matrix for Tests 6 and 7 Combined - Mode 0 98 Table A3.14: Confusion Matrix for Tests 6 and 7 Combined - Mode 1 99 Table A3.15: Confusion Matrix for Tests 6 and 7 Combined - Mode 2 100 Table A3.16: Confusion Matrix for Tests 6 and 7 Combined - Mode 3 101 Table A3.17: Confusion Matrix for Tests 6 and 7 Combined - Mode 4 102 Table A3.18: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels - Mode 0 103 Table A3.19: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels - Mode 1 104 vi  Table A3.20: Word Groupings According to Maximum Allowable Error Using Tests 1 Through 5 for the Vowels - Mode 2 Table A3.21: Word Groupings According to Maximum Allowable Error Using Tests 1 Through 5 for the Vowels - Mode 3 Table A3.22: Word Groupings According to Maximum Allowable Error Using Tests 1 Through 5 for the Vowels - Mode 4 Table A3.23: Word Groupings According to Maximum Allowable Error Using Tests 6 and 7 for the Consonants • Mode 0 Table A3.24: Word Groupings According to Maximum Allowable Error Using Tests 6 and 7 for the Consonants - Mode 1 Table A3.25: Word Groupings According to Maximum Allowable Error Using Tests 6 and 7 for the Consonants - Mode 2 Table A3.26: Word Groupings According to Maximum Allowable Error Using Tests 6 and 7 for the Consonants - Mode 3 Table A3.27: Word Groupings According to Maximum Allowable Error Using Tests 6 and 7 for the Consonants - Mode 4 Table A4.1: Results of Test 1, Mode 0 for RF with Noisy Background Table A4.2: Results of Test 1, Mode 0 with Corruption by Noise Type  vii  Rate 105 Rate 106 Rate 107 Rate 108 Rate 109 Rate 110 Rate Ill Rate 112 Data 116 3 .... 117  List of Figures Figure Figure Figure Figure Figure  2.1: 3.1: 4.1: 4.2: 4.3:  Basic Configuration of the NEC SR100 14 Configuration of the Recording Environment 23 Creation of a 2-D Function Using a Rectangular Window 28 Reconstruction of a 1-D Function By Arbitrary Placement 30 Reconstruction of a 1-D Function By Maximizing the Cross-Correlation 32 Figure 4.4: An Example of Distortion of a Periodic Function when the Hamming Window is Applied 38 Figure 4.5: An Example of Nonlinear Smoothing Using the Algorithm of Rabiner, Sambur and Schmidt 44 Figure A4.1: The Caterpillar D215 Knuckle Excavator 114  vin  ACKNOWLEDGEMENTS I would like to thank my supervisor Dr M.R. Ito, for his support and guidance throughout  the  course  of  this  work.  As  well,  Frenette for the software support and consultation for  I  would  like  to  thank  Real  he provided, and Stephen Wu  aid in the implementation of many of the algorithms and support programs  utilized  in  this  thesis.  Also,  I  would  like  to  express  my  gratitude  to  the  volunteers for the speech tests.  Financial support was provided by Northern Telecom through a direct sponsorship to  the  (NSERC) author,  author,  and by the  Natural  Sciences and Engineering Research Council  through both an Industrial and  through  a  Research  Post-Graduate  grant  to  the  Scholarship  awarded  to  the  Tele-operator  Group  in  the  Electrical Engineering Department at the University of British Columbia.  ix  1. The  concept  replacement speech  of speech input  for  control of machines  for dextoral input, is  recognition  hundred  INTRODUCTION  dollars  units to  exist  beyond  by no means  on  the  $30,000.  market, It's  or, in general,  a new  the  or small field. Many  ranging in price  reasonable  as  to  from  assume  a  that  few their  performances are as varied as their prices.  A  problem  exists,  though,  when  attempting  to  select  a  unit  for  a  given  application. Each manufacturer attempts to sell his product, and naturally states that his product is the best value for the money. Test results are often inflated and/or  are  based  on  incomplete  data,  making  comparisons  between  the  performances of different units very difficult.  As  well, recognition units  aren't perfect.  It follows,  then, that each one must  have certain weaknesses. For example, a given unit may not be able to discern well between the 'n' and the 'm' sound, or it may fail to recognize a word if the speaker changes the speed at which it is uttered. If one can identify these weaknesses, then it may be possible to reduce the effects of them. This might include educating the user on avoiding the particular phenomena, or selecting a vocabulary upon.  Also,  in which the if  one  weaker  knows  the  recognition capabilities weaknesses  of  the  are never  solely relied  recognition  units  being  considered, then a more informed decision is possible for the selection of the best unit for a given application.  Another possibility exists as well. When a given unit's weaknesses are identified,  1  INTRODUCTION / 2 it  may  be  possible  to  design  a  speech  preprocessor  to  compensate  for  its  weaknesses, improving the recognizer's performance. This scheme may provide the most cost effective solution for a given application.  This  study  addresses  some of these  questions  for speaker  dependent,  isolated  word, speech recognition units. Its purpose is threefold: 1.  to design  a series of tests which will exercise  a speech  recognition unit  thoroughly enough to obtain a good idea of its strong and weak points. 2.  to implement an archival system  for digital storage of voice input and to  store input of various subjects performing the tests previously designed. This facility permits the testing of various recognition units with the same voice input, providing a sound basis for comparison between units. 3.  to preprocess the speech input so as to improve the recognition rate of a speech recognition unit. This performs two functions. First, the manipulations point  the  Second,  way  through  for the  accurately determine  possible  preprocessing  manipulation the  of  the  techniques speech  to  be  input,  implemented.  one  failure mechanism of a given speech  can  more  recognition  unit.  This  study,  then  is  intended  to  provide  a  basis  for  comparative  testing  of  speaker dependent, isolated word, speech recognition units, and is directed towards an  accurate  appraisal of each  unit's  weaknesses.  As well,  it  is  intended  to  provide the basis for possible algorithms for real-time preprocessing of voice input before presentation to a given speech recognition unit.  INTRODUCTION / 3 1.1. OUTLINE OF THESIS Chapter  2  discusses  the  background and reasoning  behind the  design  of  the  vocabularies, and the variations in enunciation utilized in the various tests. The tests were applied on a commercial speech recognition unit, the NEC SR100, and the results interpreted both with respect to effectiveness of the tests and with respect to the performance of the speech recognition unit.  Chapter 3 describes  the  the voice test signals.  digital voice  storage system,  designed  for archiving of  Methodology used in recording the data is also discussed  in this chapter.  Chapter  4  discusses  the  four  different  speech  preprocessing  and  recognition  improvement techniques investigated, describing the reason for selection of each of the techniques, and providing a description of the specific algorithms utilized. This chapter also discusses the results of applying the various techniques to the stored voice data and replaying it into the NEC SR100.  Chapter  5  effectiveness  presents of  the  conclusions tests,  the  and  recommendations  recording  environment,  with the  respect  to  the  preprocessing  and  recognition improvement techniques, and the performance of the NEC SR100. The preprocessing and recognition improvement techniques their  performance,  information  learned  about  the  are evaluated in terms of NEC  SR100,  and  the  applicability for implementation as a method for improving the recognition rate in real-time.  INTRODUCTION / 4 The  appendices  contain a brief description of the  theory  of operation for the  NEC SR100 (Appendix 1), the results of the live tests (Appendix 2), as well as the recognition results for the recorded and modified tests (Appendix 3). A brief introduction on the effects of a noisy environment is presented in Appendix 4 and Appendix 5 describes an example for selection of a robust vocabulary based on the results of this study.  2. T H E S P E E C H TESTS  2.1. B A C K G R O U N D ON DESIGN  OF T H E TESTS  Design  tests  of  the  speech  recognition  can  be  thought  of  as  two  separate  subproblems. The first is the problem of selecting a vocabulary or vocabularies, and  the  second  is  the  question  of  what  variations  in  the  input  should be  examined. Each of these problems will be addressed separately.  First, the question of vocabulary selection.  Components of the English  language  can be considered on a number of different levels of resolution. On the coarsest level, one can consider the conveyance of ideas, thoughts, desires, and concepts to be a construct of language. Clearly this is the main purpose of language. These conveyances  are constructed of one or more sentences, grouped in some logical  order. Sentences are composed of phrases (e.g. subject, predicate, etc),  which in  turn, are composed of words.  On  a finer scale,  these syllables  words  are constructed by a concatenation  of syllables, and  are in turn constructed of one or more basic linguistic sounds,  termed phonemes.  These  phonemes  are defined by one or more letters of the  alphabet, and the many rules which govern their interaction.  Automated interpretation of speech at all degrees of resolution is a very complex and  much studied  problem. This  study  is  concerned with  the  most primitive  constructs of the spoken language, and the ability of a given speech recognition unit to discern between similar constructs. The most basic construct, then, of the  5  T H E SPEECH TESTS / 6 spoken language, may be considered to be the phoneme.  The tests put forth in the following chapter are based on forty phonemes of the English  language  identified  listed in Tables 2.1  by Rabiner and Schafer  and 2.2.  [28].  These  phonemes  are  For the purposes of the tests, the phonemes are  considered in two subgroups; 1.  Vowels  and diphthongs.  For the  sake of brevity,  this  subgroup  will be  referred to simply as vowels throughout this work. 2.  Semi-vowels  and  consonants.  Again  for  brevity,  this  subgroup  will  be  referred to simply as consonants. These two subgroups form the basis of the three test vocabularies. The reason for  the formation of separate  First,  the  two  recognition  groups  device  representatives  vocabularies for each of these groups is twofold.  are  would  phonetically have  little  quite  different,  difficulty  and  presumably  discerning  between  a any  of the two groups. Second, the division makes creation of the test  vocabularies more reasonable, and analysis of the results more manageable. Note that  the  letters  labels  in the  'vowels' and 'consonants' alphabet.  Some  letters  don't strictly  have  more  refer  than  one  to the  equivalent  pronunciation, and  many phonemes are determined by more than one letter.  To  create  the  vocabulary,  consisting of two static the  vowels,  with the  the  phonemes,  test words  vowel being the  are of two forms;  each  phoneme  was  embedded  in addition to the  are of the form non-static  phoneme.  in  phoneme  a  carrier word  under test. For  'consonant-vowel-consonant' The consonant-based  (CVC),  vocabularies  Table 2.1 Vowel and Diphthong Phonemes of the English Language and the Associated Test Vocabulary Phoneme Type Vowels  Diphthongs  Sub-Group  Reference Number  Symbol f o r Phoneme  Front  0 1 2 3  IY I E AE  beat bit bet bat  beam bin) bero bam  Mid  4 5 6 7  A ER UH OW  hot bird but bought  bomb berm bum balm  Back  8 9  00 U  boot foot  boom buum  10 11 12 13 14 15  Al 01 AU EI OU JU  buy boy how bay boat you  bime boym baum bame boam bume  Example  Test Word  / 8 Table 2.2 Semi-Vowel and Consonant Phonemes of the English Language and the Associated Test Vocabulary Phoneme Type SemiVowels  Consonants  Sub-Group  Reference Number  Liquids  0 1  Glides  Example  Test Word (Group 1)  Test Word (Group 2)  W L  wit let  wem lem  awa ala  2 3  R Y  rent you  rem yem  ara aya  Nasals  4 5 6  M N NG  met net sing  mem nem ngem  ama ana anga  Stops (Voiced)  7 8 9  B D G  bet debt get  bem dem gem  aba ada aga  10 11 12  P T K  pet ten kit  pern tern kern  apa ata aka  Whisper  13  H  hat  hem  aha  Affricate  14 15  DZH TSH  judge church  gem chem  ad ja acha  Fricatives (Voiced)  16 17 18 19  V TH Z ZH  vat that zoo azure  vem them zem jhem  ava a-the aza ajha  Fricatives (Unvoiced)  20 21 22 23  F THE S SH  fat thing sat shut  fern them sem shem  af a atha asa asha  Stops (Unvoiced)  Symbol f o r Phoneme  THE SPEECH TESTS / 9 1.  'consonant-vowel-consonant'  (CVC),  with  the  first  consonant  being  the  non-static phoneme. 2.  'vowel-consonant-vowel'  (VCV), with the consonant again being the non-static  phoneme, but its position now being in the centre of the word. The  three vocabularies generated  are listed in Tables 2.1  and 2.2.  The reason  for creating two consonant-based vocabularies will be discussed in a moment.  These vocabularies, then, consist of parametrically very similar words, basically forming rhyme/alliteration tests. Thus, any identification or misidentification by a speech recognition device should largely be a function of a given device's ability to discriminate between similar phonemes.  The  second  question  of  which  parameters  pronunciation of a given phoneme  to  vary  is  now  addressed.  The  can vary in any number of ways and, as  with the resolution of language, can be viewed on a number of different levels.  On a global level, the pronunciation of a given phoneme  is simply affected by  the  not  phonemes  surrounding it.  That  is,  a  phoneme  is  simply  a  static  phenomena; it shares a transitional time with adjacent phonemes and is therefore affected by them. These variations are termed allophones.  The  speaker  to  speaker  variations  are even more complex.  Each  speaker  has  certain unique characteristics by which one can identify his or her voice. These characteristics  manifest  various phonemes.  themselves  through  variations  in  pronunciation  of  the  They are determined by any number of factors, such as his  THE SPEECH TESTS / 10 or  her vocal tract configuration (e.g.  accents).  size of oral cavities)  or upbringing (e.g.  The unique vocal features which one may identify a given speaker, of  course, is the basis of the study of automatic speaker verification.  Considering any individual speaker, he or she may vary the pronunciation of a given phoneme in any number ways. For example, any number of the following may  alter: speed of pronunciation (i.e. phoneme duration). pitch.  The  person  could  say  the  phoneme  in  an  inquiring  tone,  an  imperative tone, etc. volume. parameters of the vocal tract not under control of the speaker may alter. For example, changes due to a cold, smoking, etc.  In  addition to changes within the phoneme, there is also the possibility of the  addition of auxilliary sounds not part of the phoneme. may  For example,  a person  add a 'click' at the beginning of a word, or a heavy exhalation at the  termination of a word.  As  well,  the  equipment  used  in  registering  the  voice  must  be  considered.  Specifically, the type and placement of the microphone will affect the input.  As  illustrated in the  previous paragraphs, there  are  a great deal of possible  variables which may be addressed. If the test size is to be reasonable, one must make some arbitrary decisions on limiting the parameters to be examined.  THE SPEECH TESTS / 11 There are three  important points to be kept in mind when determining which  parameters are to be varied: The  tests are being designed  to exercise  speaker dependent isolated word  speech recognition units. This implies that the tests should not specifically be concerned with inter-speaker variations in speech, but rather t/7ira-speaker variations. The  tests must be easily  Conclusions  can then  be  controllable made  from  so that them  the results with  a  are consistent.  greater  degree  of  confidence. This eliminates from the study all the parameters which neither the subject nor the tester can control. Given the limits of the previous points, the parameters to be studied should reflect  those to be most commonly expected  in normal  use of a speech  recognition unit.  Using these three criteria, the following tests were decided upon: 1.  Vowels said normally. This provides an indication of a recognition unit's ability to discern between the  vowel  phonemes.  As well,  it provides  a baseline  for comparing the  degradation of the recognition rate due to the variations in input. 2.  Vowels with mike moved. The  mike is positioned away from the mouth at somewhere other than the  recommended distance, but still at a reasonable Shure SM10 close talking mike was used, from  . the  mouth  is  one  mispositioning test was about  thumbwidth.  location. In this study, a  and its recommended The  4 to 5 centimetres  distance  used  distance for  the  or approximately three  THE SPEECH TESTS / 12 fingerwidths. 3.  Vowels said slowly. This  test,  and  susceptability  the  test  that  follows,  determine  a  recognition  device's  to variations in duration of the input. Subjects are asked to  repeat the vowel vocabulary at approximately 2/3 the original speed. 4.  Vowels said quickly. Again  the  vowels are repeated,  but now  the  subjects are asked  to say  them at about 1.5 times the original speed. 5.  Vowels with interrogative intonation. The  vowel  vocabulary is  repeated  in a questioning  tone.  This  gives an  indication of the device's susceptability to changes in pitch. 6.  Consonants (Group 1). Group 1 of the consonants  are said normally. This gives an indication of  the device's ability to discern between the various consonants. 7.  Consonants (Group 2). Group 2 of the consonants over Group  are said normally. Any changes in performance  1 will be an indication of the device's preference  as  to the  position of the consonant.  These tests provide the basis for evaluation of a speech recognition device, and should be useful in identifying a device's weaknesses.  THE SPEECH TESTS / 13 2.2. I M P L E M E N T A T I O N  The  tests  speech  of the  OF T H E  previous  recognition  device;  TESTS  section the  were  used  NEC SR100  in evaluating  (The  theory  a medium priced  of  operation  for  the  SR100 is briefly discussed in Appendix 1).  The  tests were initially applied live, with the test subjects providing voice input  directly to the speech recognition unit (i.e. input was  fed directly into the  performed for two reasons. of  no storage was performed; the mike  speech recognition unit). The live testing  was  First, it was done to insure that neither redirection  the mike output, nor digitization of the data, nor any other aspect of the  archiving process  appreciably affected  the recognition results.  If any of this did  affect the results, it would manifest itself as a reduction in the recognition rate from the live results to the results using the archived voice data. Second, it was done  to  determine  how  interpretation  of  the  data  would be  affected  by  not  having direct access to the speech input. Presumably interpretation of the results should be more accurate if one can replay and/or alter the input.  The  SR100 is not a stand-alone  unit,  and is  controlled via an RS-232 port.  Through this port one can set the device to training or recognition mode, or can load or offload recognition templates to and from the host device. The SRIOO's basic configuration is shown in Figure 2.1.  Control  programs  were  written  so  that  the  SR100  could  be  accessed from  C-language programs on an HP9050 minicomputer. Programs were then written which:  THE SPEECH TESTS / 14  Mike  c  SR-100  RS-232C interface  Host o r dedicated unit  Figure 2.1: Basic Configuration of the NEC SR100  provided access to the training mode of the SR100, controlled loading and offloading of the reference templates and, administered the tests.  The  program which administered the tests required as input the initials of the  subject,  the test number to be executed, and the attempt number for that test.  Based on this input, the program prompted the subject with the word to be said (on the screen), and then recorded the recognition result of the SR100. The order in which the words were tested was pseudo-random (based on the input), as to eliminate  any dependency  on test order. The results were then stored for later  analysis.  A total of twenty attempts were recorded per word per test, consisting of four attempts per word per test by five test subjects. The subjects consisted of four males, The  one of whom was familiar with the SR100 (RF), and one female (MA).  attempts were recorded on two different occasions for each subject, the first  occasion also being the time at which the SR100 was trained.  THE SPEECH TESTS / 15 Although the magnitude of these tests was not sufficient to make any statistical conclusions, they were of sufficient size to base some general observations.  2.3.  RESULTS  2.3.1. General Comments The  results for the tests are tabulated in Appendix 2 and are presented  number of formats.  in a  In the first format, the statistics per test per person are  tabulated, as well as the cumulative average. This yields general information on the performance for each subject with respect to each of the tests. It provides an idea as to the effect of each of the variations in speech on the recognition rate.  The  second  phonemes  format  is  the  confusion matrix. This  or groups of phonemes  that presented  yields  information  as  to  the greatest difficulty for the  speech recognition unit, and the ones most commonly mistaken for one another.  The  results  consonants  of the  five  tests  for  the  phonemes  and the  two  tests  for  the  were combined, to form two confusion matrices. The reason for the  combination was to provide a larger data base on which to interpret the matrix results. This combination was considered valid, because the variations presented in the  tests  were  indicative  of  possible  scenarios  which  could  occur  while  the  recognition unit was in normal use.  Another way of analyzing which phonemes a speech recognition unit has difficulty  THE SPEECH TESTS / 16 discerning,  is  by  error  grouping.  The  groupings  are  based  on  the  confusion  matrices generated by combining the results from tests 1 through 5 (the vowel tests), and tests 6 and 7 (the consonant tests). They are generated by grouping phonemes most commonly misrecognized, until each phoneme within a group has an  error rate better than the required amount. As the required error rate is  decreased, the groupings get larger. This indicates the decreased resolution of the unit, as the maximum error limits become more stringent.  The  described grouping performs two functions. First, as a method of analyzing  the data, it yields an excellent unit has and  pictorial of the phonemes the speech recognition  trouble discerning, indicating to what degree it can resolve  groups  of  phonemes.  vocabulary  selection  groupings,  one  may  Second,  it  may  be  for best operation. That is, design  a  vocabulary  which  used  to  based  on the  attempts  phonemes  determine  to  rules  results have  of  of the  the  best  phonetic contrast, for a given unit, between words.  2.3.2. Comments on Performance of the  The  Various  Tests  following are comments and observations on the performance of each of the  tests and of the results. These comments, along with the results in Appendix 2 provide the basis for the conclusions presented in the next section. 1.  Vowels at Normal Speed. The words with the lowest recognition rate were ; a.  #7 (balm) with nine incorrect.  b.  #1 (bim) and #4 (bomb) with eight incorrect.  c.  #8 (boom) with six incorrect.  THE SPEECH TESTS / 17 Word #4  for DR and word #1  for EC weren't correctly recognized at all  (four misses out of four attempts). This tends to indicate some inconsistency between these templates and the words later spoken.  MA (the female subject) had by far the highest number of rejected inputs. This may have been due to the high pitch (as compared to the pitch of the  other  subjects)  of  MA, but  no subsequent  tests  were  performed to  verify this possibility.  The  words with the highest recognition rates were 3 (bam) and 5 (berm)  with rates of 95%. As well, RF had an overall rate of 93.75%. Vowels with Mike Moved. There was a 22% decrease As  in the recognition rate over the normal case.  well, the SR100 had difficulty in triggering from some of the words,  particularly  #8.  This  could  have  something  to  do  with  the  energy  distribution of the phoneme (See Appendix 1 for Background on the SR100). Vowels Said Slowly. The  results of this test varied widely, depending on the speed which the  subject  considered slow.  words were and  were  were  spoken normally. MA's results  RF, there  subjects  JR's results  was  also  a  the  drastic  reduction  actually better were in  the  the  than when the  same.  recognition  ones whose reduction in speed  For DR, EC rate.  of speech  These seemed  most pronounced. Vowels Said Quickly There was an overall average reduction of 16% in the recognition rate for  THE SPEECH TESTS / 18 the 5 subjects as compared to the results for test 1. This ranged from 9 to 23% individually. Vowels With Interrogative Intonation. Again the results varied drastically for each subject. EC and RF had large reductions in their recognition rates seemed  to  have  the  greatest  approximately the same,  (20  change  and 15% respectively). in  pitch.  MA and  They also  DR remained  and JR had a drastic increase in the recognition  rate (18%). Consonants (Group 1). The  recognition  rate  was  much worse  than the  vowel  based vocabulary.  Dividing the words into their phonetic subgroups, only the semi-vowels above the 50% recognition rate. In fact, the semi-vowels  were  had a recognition  rate of over 80% for four of the five subjects. The words with the worst recognition rate were; a.  #10  (pern) with fourteen incorrect.  b.  #'s  5 (nem),  17 (them [sounds like 'that']),  20 (fern), and 21 (them  [sounds like 'thing']). Also, every subject had at least two words which weren't recognized at all in  the  recognition  mode,  again  indicating  that  a  bad  template  was  registered. No pattern was apparent in these words. Consonants (Group 2). The  purpose of this test was to reveal if the recognition device could more  readily discern between the consonant phonemes if they were located in the middle of the word. Again the results varied widely within the test group. Comparing the results to those for group 1, DR's recognition rate dropped  THE SPEECH TESTS / 19 5%, EC's dropped 1%, JR's rose 9%, MA's rose 24% and RF's rose 25%. Four of the five subjects registered a recognition rate of over 80% for the semi-vowels.  2.4. INTERPRETATION OF T H E RESULTS As previously stated, no statistical conclusions may be made from such a small test group, but some general observations may be made. With this in mind, the results were analyzed using two different criteria. First, the data was interpreted with respect  to an evaluation of the  SR100.  Second, the  data was  evaluated  with respect to the design and implementation of the tests.  2.4.1. Evaluation of the N E C SR100 At 76%, the SRIOO's ability to discern between similar vowel phonemes extremely apparent  impressive. that  good  From  the  scores  results  are  possible,  for  individual  presumably  subjects, if  one  is  is not  however,  it's  consistent in  enunciation of the input. From the results of tests 2 through 5, the recognition rate of the SR100 is affected  to some degree by all the variations in speech  examined, and especially by the slowing of speech input.  The  SR100  is  less  able  to  discern consonants,  achieving  recognition  rates of  52.5% for group 1 and 62.9% for group 2. This could be due to the lack of energy in a consonant sound, or perhaps to the SRIOO's bandlimited analysis of the input (See Appendix 1). Both of these possibilities are supported by the fact that semi-vowels had a much better recognition rate than the remainder of the consonant  vocabulary. It  is these phonemes  within this  vocabulary which have  THE SPEECH TESTS / 20 the  highest energy  results  of group  consonants  in the  lower  frequency  1 and 2 indicate  within  that  a word than at the  bands.  the  The difference  SR100  beginning of  is  better  between the  able  to discern  a word. In either  case,  recognition rates are not very good.  In considering the confusion matrices and error groupings, one would hope that the  misrecognitions  would  be  consistent.  This  would  imply  that  the  speech  recognition unit had difficulty in discerning phonemes  within certain sub-groups,  but  the  was  able  Although this consonants.  to  separate  In  That is,  fact,  when  misrecognized rather was severely  sub-groups  seemed to be the case for the it  can  groupings of the consonants all.  these  as  be  single  misrecognized  from  remaining  vowels, it was  the  confusion  phonemes.  not so  matrices  for the  and error  that their failure modes were not well behaved at  a phoneme  any  seen  from  as  reduces a consonant's  failed to be recognized, it did not tend to be  other  phoneme  or small  any of a large  group of phonemes,  number of other phonemes.  but This  value for discerning between similar words when  using the SR100.  Finally, it is obvious that the SR100 suffers in using only one training input as the  reference  template.  No  training averaging  inherent compensation for a bad template.  is  performed,  so  there  is  no  THE SPEECH TESTS / 21 2.4.2. Evaluation of the Speech Tests  The  designed  speech  tests  seem  to  address  pertinent  speech  parameters,  particularly with respect to the SR100. They did identify a number of problems with the SR100, flagging certain items to beware of. As well, they did indicate which phonemes  are most commonly misrecognized. As a tool for comparison of  various speech recognizers, however, it does have certain drawbacks. Some of the drawbacks are as follows: 1.  It  is  very  difficult  to  control  quantitative  measures  (e.g.  pitch,  word  duration, etc. ). 2.  The subject's normal pronunciation of a given word varied sufficiently  that  simply performing the same live tests on different speech recognition units would not provide adequate control for comparison tests. 3.  Because the tests are applied live, one can't be sure of the specific reason a given word was  not recognized  correctly. That is,  one doesn't have a  clear idea of the failure mechanism, and this might lead to misinterpreted data.  The  main  problems,  then,  were  that  there  was  not  enough  control  and  consistency in the test input, and that one can't access the input to analyse the exact reason for misrecognition. These problems alone justified the implementation of digital recording of the speech input.  3. A R C H I V A L OF T H E TESTS  3.1.  HARDWARE  Certain  basic  CONFIGURATION  criteria had to be met by the  configuration to be used  in the  archival of the data. They were as follows: 1.  it had to provide sufficient  audio quality  so  that  archival/retrieval didn't  affect the performance of the speech recognition devices to be tested. 2.  it had to have easy access to a mass storage device.  3.  it  had to  have  sufficient  computing  power  so  that  speech preprocessing  could be performed in a reasonable amount of time. No  single  computer  in the  Electrical  Engineering Department  requirements, so a combination of computers  were used.  met  all  of the  The HP9050 provided  the computing power and access to the mass storage device, and a resident PDP 11  provided  the  Analog-to-Digital  (A/D)  and  Digital-to-Analog  (D/A)  facilities.  Programs were written so that the A/D and D/A facilities on the PDP, as well as data transfer from the PDP to the HP, could be controlled by C-language programs on the HP.  The  complete  hardware configuration is  shown  in Figure 3.1.  As in the  live  tests, the Shure SM10 close talking microphone was used. The output from the microphone was fed into the microphone amplifier portion of a Scully 280 series high quality tape recorder (recorded signal to noise ratio of greater than 65 db [unweighted]). This signal was then fed to a Krohnite model 3342 filter, set in lowpass RC mode (8 pole), with the cutoff frequency set at 9 kHz. The filtering was performed to remove the possibility of aliasing while sampling. The output of  22  ARCHIVAL OF T H E TESTS / 23  headphones  <0>  ^^^microphonj  analoj speech  filtered analog  analog speech .  VA  oooo .000  f  o o o o oco.  FILTER  tape  &I5M analog speech  DIGITIZER HP9050  digital f speech i  i  A LZZJ  /  \  ^NEC >g-iOC O J  commands  from  HP9050  PDP-11  otrgitiEeJ speech recognition results  ••••  Ctwnmands from digital iignjLj  commands to TOP-II &. SR-IOO  processed speech  "  HP9050  recognition data  digitized, speech  HPIB (instrument bus)  Figure 3.1: Configuration of the Recording  Environment  ARCHIVAL OF THE TESTS / 24 the filter was input to the A/D facilities of the PDP, which sampled at 20 kHz (its maximum rate).  For playback and monitoring of the input, the D/A output was fed into another channel  of the  Krohnite filter.  This channel  was  also set  to  9  kHz lowpass  mode. The filter output was used to drive the speech recognition unit (which was under simultaneous control from the HP9050), or headphones.  3.2.  IMPLEMENTATION  The hardware was configured as described in 3.1, with the D/A monitor attached to the SR100. The SR100 acted as a validator for voice input, insuring that the voice input was  present,  and that it was  spoken loudly enough to trigger the  SR100. This was done in order to better simulate the conditions of the original test.  The  subject was presented,  via a terminal connected to the HP, the vocabulary  of each of the seven tests (see 2.1) in a pseudo-random order. Depression of the 'break' button on the terminal initiated the recording of 1.5 sec. during  which  time  the  subject  was  required  to  pronounce  the  of voice data, word  in  the  prescribed manner. If the input was acknowledged by the SR100 (i.e. the input was loud enough), and was not labelled an erroneous input for any reason, the file was transferred to the HP and stored on a hard disk for later transfer to magnetic tape.  A  total of twenty-five  attempts per word per test were recorded, consisting of  ARCHIVAL OF THE TESTS / 25 five attempts per word per test by five subjects. This is one more attempt than the live tests, but it included the attempt which was to be used as the training input. All but one of the subjects  were the same as the live tests. MA was  unavailable for the recording sessions, and was replaced by HW (a male).  The attempts were recorded on either two or three different occasions, depending on  availability of the  subject.  The files  were  replayed immediately  after  the  session to insure that no words were cut off, due to the limited recording time. If any were cut off, they were immediately redone.  These files provided the basis for general analysis, and for an investigation into preprocessing algorithms.  4. PREPROCESSING ALGORITHMS  A N D RECOGNITION I M P R O V E M E N T  TECHNIQUES As  stated  previously, improvement of the  recognition rate serves two purposes.  First, it provides better insight into the failure mechanism of a given recognition unit.  Second,  placed  in  it  front  provides the of  a  basis  commercial  for  a speech  recognition  preprocessor  unit,  perhaps  which may be  providing a  cost  effective solution for a given application.  In this chapter, the following modifications are discussed: time normalization. the use of two reference templates. nonlinear volume normalization. Comments on the test results, tabulated in Appendix 3, are made at the end of this chapter and are further interpreted in Chapter 5.  4.1. TIME NORMALIZATION The  objective of the time normalization implemented in this study was to make  all  input words (both the training and recognition words) of the same duration;  .75  seconds.  dynamic  The reason  time  for this  warping internally,  was its  that,  although the  performance was  SR100  did perform  limited. This limitation  manifested itself with a large number of misrecognitions and rejections in tests 2 and  3. It was reasoned, then, that the recognition unit may perform better if all  speech inputs were of the same duration. It was recognized that this implies the loss  of  a  certain  amount  of  discriminatory  information  pertaining  to  word  duration. If one and two syllable words are used in a vocabulary (as is often  26  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 27 the case), though, the discriminatory information lost may be negligible.  The algorithm used was a modified form of one presented by Roucus and Wilgus [31], which in turn was based on one by Griffin and Lim [12]. Griffin and Lim claimed that their algorithm appeared to be the best method at that time (1984). Roucus and Wilgus claimed that their algorithm provided at least as high quality as [12], and was computationally much simpler. Both claims were left unverified, but comments will later be made as to the quality of the algorithm implemented. The algorithm of [31] will now be presented.  4.1.1. Synchronized O v e r l a p - A d d of T e m p o r a l  Consider  a  discrete  time  signal,  y(n).  Signals  Using a  windowing function,  w(m),  of  length L , one may define a doubly indexed (or 2-D) function, y (b,n), such that: w y (b,n) = w(b - n) y(n) (4.1) w Now,  let b be defined as: b = mS a  where  m  is  (4.2) the  window  commencement of consecutive  number  and  S is the interval between the a windows. If S is less than the window length L, 3.  then consecutive windows overlap (i.e. any given point in y(n) is represented in more  than  one  window  number).  See  function  y  Figure  4.1  for  an  illustration of  the  windowing process.  If  each  window  of  the  J  between the commencement result is  somehow  w  is  of consecutive  resummed to form a  then  shifted  windows is S  such g  that  the  interval  points apart and the  singly indexed function, the  resulting  y(n)  Y (Va'  n )  w  y ((n +l)S .n) w  1  y ((.V w  a  2,S  a'  n)  y ((m -r3)S ,n) w  1  a  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 29 function would have a new duration of S /S times the original function. Thus, s a given a desired change in duration, one may arbitrarily select either S or S s a and  calculate the value of the other required to attain this change. This is the  basic principle of overlap-add modification of a signal.  Naturally, one can't simply sum the shifted windows of  and hope to attain  reasonable quality speech. This would destroy or alter many of the characteristics of  the  speech  input other than its duration (see  Figure 4.2).  Griffin  and Lim  used an iterative technique which was based on minimizing the distance between the Fourier Transform of the resulting signal X(mS ,w), and that of the original s  input Y(mS ,w). a  They claimed to achieve  excellent  results,  but only after  100  iterations of their algorithm.  Roucus  and  functions S  g  Wilgus  argued  that  arbitrarily  positioning  consecutive  windowed  points apart ignored the fact that most speech is periodic, and that  periodicity is altered by this arbitrary placement (illustrated in Figure 4.2). They reasoned that if the initial placement of consecutive windows is not restricted to exactly every S  g  points, but rather is allowed to be placed within a range of  values with the exact point of placement being determined by that which best matched the periodic waveforms of consecutive windows, then few or no iterations would be required. The signal could then be reconstructed using equation 4.3.  I  w (mS - n) y[n-m(S - S ) - M m ) ] z  x(n) =  (4.3)  L m=-co  vT(mS - n)  x  w  ( ( m  l  + 3 ) S  s'  n )  F i g u r e 4.2:  Reconstruction of a 1-D F u n c t i o n by A r b i t r a r y Placement (e.g. F i r s t I t e r a t i o n of A l g o r i t h m by G r i f f i n and Lim)  PREPROCESSING  ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 31 variable k(m) is the deviation from the nominal value for the positioning of  The  consecutive equation and  windows,  g  Note  that  if  3 of the Least Squares Error  Lim. The shift  between the R  S.  k(m)  is  chosen  k(m) = 0,  this  Estimation (LSEE),  to maximize  signal reconstructed thus  would  the  far, and the  be  the  same  as  described in Griffin  cross-correlation function  next window  to be added  , defined by equation 4.4. w  mS +L L x ( n ) x (mS ,n+k) n=mS s' s  w  R  x*  (  k  )  -  mS +L S  W  mS +L  S  9  <  S  9  -  4  )  1/2  [ Z x(n) L xj(mS ,n+k) n=mS„ n=mS„ s s w  4  5  c  Y  /l  Where : x(n) = the reconstructed signal. x  m  = the window number.  L  = the window length.  S  This  = the 2-D windowed function,  w  g  = the new nominal repetition interval for consecutive windows.  method,  consecutive  then,  windows,  attempts  to  determine  and is termed the  the  best  match  synchronization of  synchronized overlap  and add (SOLA)  algorithm. An illustration of this reconstruction technique is shown in Figure 4.3.  The algorithm is formally specified as follows: 1.  Define a 2-D signal consisting of the successive windows, samples, of the original signal, y(n), and defined by:  taken every S  3.  Figure 4.3: R e c o n s t r u c t i o n of a 1-D F u n c t i o n by Maximizing C r o s s - C o r r e l a t i o n (e.g. F i r s t I t e r a t i o n of A l g o r i t h m by Roucus and Wilgus)  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 33 x (mS ,n) = w(mS - n) y[n - m(S - S )] (4.5) w s s s a 2.  Initialize x(n) and c(n) using the first window of x : w x(n) = w(n) x (0,n) w  (4.6)  c(n) = w (n)  (4.7)  2  3.  Do steps a and b for m = l to total number of frames. a.  Maximize the cross-correlation with respect to k(m) between the signal constructed thus far and the next window to be added using equation 4.2.  b.  Extend estimate by incorporating this window into x(n):  x(n) = x(n) + w(mS + k - n) x (mS ,n + k) s w s c(n) = c(n) + w (mS + k - n) s Normalize the waveform for all n. 2  4.  x(n) = x(n)/c(n)  (4.8) (4.9)  (4.10)  4.1.2. SOLA as Implemented in this Study Two  variations of Roucus and Wilgus' algorithm were implemented in this study,  labelled  Modes  1 and 2.  First,  the  features  common to both modes will be  discussed and then each mode will be discussed separately.  The endpoints of the words were determined using an energy threshold algorithm, insuring that the energy remained above a preset threshold for a minimum of 100  ms at the beginning of a word and remained below  the threshold for a  minimum of 200 ms. at the end of a word. The endpoints were also manually verified for words whose endpoints were, for some reason (e.g. unusual duration, bad  recognition, etc.), questionable.  Of course, a more reliable endpoint detection  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 34 algorithm would have to be developed if the time normalization algorithm were to be incorporated into a real-time preprocessor, but that was beyond the scope of this study.  Both modes utilized a window length of 512 be of sufficient  points. The window length had to  length to capture an adequate number of pitch periods, but not  so long as to cover too many, thus losing temporal resolution.  The  interval between commencement of consecutive  function  S , was a  the  windows in the unnormalized  parameter varied to determine  time warping (e.g. compression or expansion)  the  amount  and type of  to be performed. It was calculated  such that the nominal interval between commencement of consecutive windows in the time normalized function S duration  was  .75  seconds.  g  was  This  256  points, and that the total normalized  approach  insured  reasonable amount of overlap between consecutive normalized function, x(n). compression  to  1/2.  It did, however,  This  presented  maximum compression required was  no  that  there  was  always  a  windows for construction of the  limit the maximum amount of time difficulty  in  this  study,  because  the  1/2 (normalized voice data had a duration of  .75 sec. and the maximum unnormalized duration was 1.5  sec).  The allowable range of k(m) had to cover at least one pitch period of the lowest voice  expected.  This  was  necessary  so  align correctly for all voice  input, the  Obviously,  more  windowed  if  the  function  range will  is not  always  that  consecutive  windows  lowest pitch being the  restricted, align  then  with  the  the  were able to limiting factor.  periodicity  periodicity  of  within the  the  signal  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 35 constructed thus far. Too large a range would result in unnecessary computation. The range, therefore, was chosen to be ± 1 2 0 points from the nominal position.  The is  method described in Roucus and Wilgus is based on the premise that there always  a certain amount of periodicity present  construction  of  a  time  normalized  signal  is  in speech signals,  aided  by  the  and that  utilization  of  this  periodicity. For the most part this is a. valid assumption, but sometimes it does not  apply.  sounds),  Certain  while  phonemes).  other  These  vocal  phenomena  phenomena  phenomena  are  have  simply  little  do not benefit  or  transients no  (e.g.  periodicity  from positioning  the  (e.g.  plosive unvoiced  according to  the  correlation function, and in fact may be degraded as a result of it. The lack of periodicity manifests the  maximum  warping was  itself by a low maximum cross-correlation. In this study, if  cross-correlation  was  below  .6 for  performed and the window was  a  given  window,  no  time  allowed to return to its original  position.  4.1.2.1. Mode 1: Rectangular Windowing Mode  1 utilized  a  rectangular  window  for  w(n).  The strength  of rectangular  windowing is that it allows for integer arithmetic in the calculations, thus making computations  much easier.  This,  in turn, lends  applications.  Mode 1 is formally defined as follows : 1.  Find endpoints of the word.  itself more readily to real-time  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 36 2.  Given desired total time =  .75 sec,  and S  g  = 256 points, determine what  S must be. a 3.  Define  x w  using  equation  4.5,  where  w(n)  is  a  512  point rectangular  window. 4.  Initialize x(n) and c(n) using equations 4.6 and 4.7.  5.  Do steps a, b, c and d for m= 1 to total number of frames a.  if the original position of the window is within the allowable range of k(m), use  this  that  would  this  requirement  for  as  the optimum position. It is reasonable to assume yield  the  computing  highest R  cross-correlation,  (the  pre-empting  cross-correlation  the  function from  w Equation 4.4). If this is the case, Skip b. and c. b.  Maximize cross-correlation  with  respect  to  k(m)  between  the  signal  constructed thus far and the next window to be added using equation 4.4. c.  If maximum cross-correlation is less than .6, then presume that it is a  bad  match.  That  is,  no  periodicity  is  present  or periodicity  is  different between the signal constructed thus far and the next window to be added. If this is the case, then allow the window to return to its  original position,  with  its  starting point being S ahead a  of the  starting point of the previous window. This return in effect eliminates any  time  warping  on the  non-periodic  and/or the  rapidly changing  portion of a given word. d.  Extend  the  estimate  by  incorporating this  equations 4.8 and 4.9. 6.  Normalize the waveform using equation 4.10.  window  into  x(n)  using  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 37 4.1.2.2. Mode 2: The Hamming Window In  the  course  of the  enunciation  of a given  speech waveform  are continually changing.  waveform  purely  aren't  periodic.  word, the  Even  Consider the  the  characteristics  of the  'periodic' portions  of the  rectangular  described in the previous section. When a given window x  windowing  function  is aligned with x(n) w  constructed thus far, the gross periodicity will be aligned, but the finer detail may not be aligned. This can cause a small discontinuity in the waveform where a new window commences, playback.  Or  reconstructed  it  may  cause  voice data. This  a  hollow  is caused  reconstructed  manifesting itself as a 'crackle' in quality  by the  to  be  introduced  into  summing of signals  the  equal in  magnitude, but slightly phase shifted.  If, however, a smoothing function such as a Hamming window is used before the matching process, then discontinuities where a new window commences decreased. given  Also,  point,  one  the  hollow  window  is  quality  shouldn't  contributing the  be  as  should be  apparent because  majority of the  at  weighting  to  any the  given result.  When implemented, however,  this algorithm proved not to be so well behaved.  What occurred was that, after the Hamming window was periodicity of the original function was 4.4  for  an  example).  This  lead  to  sometimes adversely incorrect  periodic function present in x(n) and manifested result.  applied to y(n), affected  synchronization  of  the  (see Figure with  the  itself as an 'echo' effect in the  / 38  Before the Hamming Window i s Applied  Figure 4 . 4 : An Exrmple of D i s t o r t i o n of a Periodic Function When the Hamming Window i s Applied  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 39 The phenomena may be explained in another manner. Since both the tail of the function x(n) and x are weighted by the Hamming window, the auto-correlation w of the Hamming window itself weights the resulting cross-correlation calculations.  As a result of these observations,  the following technique was implemented. All  alignment calculations were performed with a 2-D function utilizing a rectangular window, but construction of the data file to be output was  performed using a  Hamming window. This yielded excellent periodicity alignment while  significantly  reducing crackle and hollowness.  Mode 2 is formally defined as follows: 1.  Find endpoints of the word.  2.  Given desired total time = .75 sec, and S  g  = 256 points, determine what  S must be. a 3.  Define  x  a  512  Initialize x(n), x^(n) and c^(n) using equations 4.6,  4.11  w  using  equation  4.5.  where  w(n)  is  point rectangular  window. 4.  and 4.12  (c(n) is  not required). x (n)  = w (n) x (0,n)  (4.11)  c (n)  = w (n)  (4.12)  h  h  h  w  h  where w^(n) is a 512 point Hamming window. 5.  Do steps a, b, c and d for m = l a.  to total number of frames  If the original position of the window is within the allowable range of k(m), use this as the optimum position and skip b & c.  b.  Maximize cross-correlation  with  respect  to  k(m)  between  the  signal  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 40 constructed thus far with the rectangular window, x(n), and the next window to be added using equation 4.4. If maximum cross-correlation is less than .6, then presume that it is  c.  a bad match. If this is the case, then allow the window to return to its original position, with its starting point being S  ahead of start a  point of the previous window. d.  Extend the estimate by incorporating this window into x(n), x^(n) and c^Cn) using equations 4.8, 4.13 and 4.14. x, (n)  =  x, (n)  +  w,(mS  c,(n)  =  c, (n)  +  w,(mS  h  n  6.  h  h  h  n  +  k - n) x (mS ,n + k)  (4.13)  +  k - n)  (4.14)  s s  w  s  Normalize the output waveform, x^(n), using equation 4.15. x (n)  (4.15)  = x (n)/c (n)  h  h  h  4.2. THE USE OF TWO REFERENCE TEMPLATES Although the  technique  for improving the  test results  discussed  in this  section  does not provide the basis for any preprocessing algorithms, it does aid in better defining the reason for misrecognition or failure for given words. As well, it may provide a training technique that will result in a better recognition rate.  This method for improving the recognition rate was prompted by the fact that occasionally, subsequent  a  bad  template  recognition mode  was  registered.  input was  In  that  situation,  none  of  the  correctly identified, implying that there  was some inconsistency between the input during training and during recognition. It  was  reasoned  that  if  two  separate  attempts  per  word  were  utilized for  training, then the chances of having a bad template would be severely reduced.  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 41 As  well, it was  templates used  for  hoped that even the recognition rate of the words with good  would be aided. It was thought that if two attempts training,  this  would  provide  a  broader  base  for  per word were comparision  of  recognition data.  Two options are possible for processing multiple training input: 1.  One  could  average  the  training  input,  such  that  the  single  resulting  template is the median of all training input. 2.  One may have multiple templates,  with each training word resulting in a  template. A  number of  problems  exist  with  the  first  option.  Averaging the  templates  presumes that one has access to these templates, and that they are in a format that may be averaged. In the case of the SR100, although the templates could be accessed for storage/retrieval purposes, the format was proprietary. This didn't prove to be insurmountable, but the templates consist of compressed data. Each vector within a given template represents  a variable number of vectors in the  uncompressed data. The number of vectors represented by each vector stored was unknown; only the total uncompressed length was  known. This made averaging  multiple templates very difficult.  The  method chosen  was  to generate  two  templates  for each word using two  different input utterances. In the recognition mode, if the given speech recognition unit matched an incoming utterance to either of the two reference templates for a given word, the input was identified as that word. This was the simplest and most universal alternative  (with respect  to other speech  recognition units). The  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 42 only  possible  drawback  of  this  technique  is  that  the  recognition  unit  must  consider a larger number of templates.  The  technique of using two reference  templates was applied to time normalized  data (Mode 2) and the results were labelled Mode 3. Two attempts of each of the three vocabularies said normally were utilized.  4.3. NONLINEAR V O L U M E NORMALIZATION The  intention of modes  1 and 2 was  to make all words  input of the same  duration. This included the training templates as well as the recognition input. One  of the  parameters  available in the template  data of the  SR100 was  the  length of the training word input. It was through this data, for the templates of modes 1 and 2, that it became  apparent the unit was  not always  registering  the complete word. This was particularly true in the tests utilizing the group 1 consonant vocabulary, indicating that the word was shorter than that input. As well, silent  in the  templates for the vocabulary utilizing group 2 of the  periods  present.  These  were  being  declared  in the  two  facts  indicated  that  centre the  of  words  recognition  where  unit  was  consonants, none  were  overlooking  certain low energy consonant phenomena.  This  insight  lead  to  the  implementation  of  a  nonlinear  volume normalization  scheme. This scheme did not affect the higher energy phenomena, but did boost those with low energy.  The specific phonemes  low  Conversely,  energy  consonants.  plosives) were not to be affected.  those  to be addressed were the  consonants  with  transients  (e.g.  static the  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 43 In other words, the volume normalization algorithm had to be fairly  stable in  that it had to leave short time changes in volume unaffected, as these are often pertinent in recognition of the phoneme. It had to, however, respond quickly to valid alterations in volume representing the enunciation of a new phoneme. This was necessary to insure that no portion of a given phoneme was unnecessarily amplified or attenuated.  With these points in mind, the following scheme  was  devised. The total RMS  energy, e(r), was calculated for consecutive  256 point windows (every  of  smoothed  the  input data,  y(n).  This  data was  using  a  12.8 ms)  nonlinear  scheme  proposed by Rabiner, Sambur and Schmidt [29], before it was used to compute the  amount of gain to be applied. The smoothing algorithm consisted of a 3  point  median  smoother,  followed  by  a  three  point  Hanning  smoother.  The  algorithm was well suited to this application in that it smoothed local transients and  yet responded quickly to sharp discontinuities which were not transients. An  example of results of the smoothing algorithm are shown in Figure 4.5.  This smoothed energy file provided the basis for nonlinear volume normalization. A  factoring file f(n),  smoothed  energy  midpoint of  the  of the same  data first  was 256  length  placed in f(n) point window.  as  the data file, was  every  256  points,  created. The  starting  at the  The remainder of fin) was linearly  interpolated from this data.  If f(n) was below f . min  and was above f ., , then volume normalization was silence  Energy P r o f i l e Before Smoothing  Figure 4.5: An Example of Nonlinear Smoothing Using the Algorithm of Rabiner, Sambir and Schmidt  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 45 implemented. The magnitude was boosted by the square root of the ratio between f  . min  and f(n). The lower bound on the volume normalization algorithm f ., silence &  was effectively was  a silence detector which was included in case a true silent period  encountered  (e.g.  if a subject  paused in the  enunciation of a word). No  volume normalization was applied below this threshold. Obviously, a more robust silence  detector  would be  required if  this  technique  were  to  be  used  in a  preprocessor, but it was beyond the scope of this study to investigate this.  The  algorithm was applied to data already time normalized (Mode 2), and two  reference  templates  per word were utilized. This allowed for the comparison of  for Mode 3 to be compared to the results  the results  for this Mode, labelled  Mode 4.  The algorithm is formally defined as follows: 1.  For r = 0 to 59 calculate the RMS energy, e(r), of the incoming file, y(n), according to equations 4.16 and 4.17. i  =  e(r)  2.  3.  256 r i + 255 = I .  n= i  (4.16) 9  (4.17)  y (n)  Filter e(r) using equations 4.18 and 4.19. e ,(r) = median[ e(r-l), e(r), e(r+l)] si  (4.18)  e (r) = .25 e Jr-l) s si  (4.19)  + .5 e Ar) + .25 e J r + l ) si si  Initialize the first 128 points of the factoring array, f(n). If f . min  < e (0) < f ., s silence  then f(n) = e (0) , n = 0 to 127.  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 46 else f(n) = f . , n = 0 to 127. min Repeat for e (59) and the last 128 points of f(n). g  4.  For n =  128 to (end of file - 128), calculate f(n) by interpolating between  successive values of e (r) utilizing equations 4.20 to 4.22.  5.  r = INTEGER[ (n - 128)/256 ]  (4.20)  i = (n - 256r - 128)/256  (4.21)  f(n) = e (r) + i [ e (r+1) - e (r) ]  (4.22)  For n = 0 to end of data. If f . min  < f(n) < f .. silence  then x(n) = x(n) * SQRT(f . /f(n))  (4.23)  min  else x(n) remains unchanged.  4.4. COMMENTS ON T H E TEST RESULTS Results  of the  preprocessing are tabulated  in Appendix 3.  The mode numbers  correspond to the following : 0 :  the original data.  1 :  time normalized data utilizing the technique described in 4.2.2.1 (Rectangular Windowing).  2 :  time  normalized  data  utilizing  the  technique  described  in  4.2.2.2  (The  Hamming Window). 3 :  Mode 2 plus two training templates per word were used.  4 :  Mode 3 plus nonlinear volume normalization, described in 4.4, was applied.  As with the live tests, the results are presented in a number of formats. Tables  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 47 A3.1  to A3.7 present  the  statistics per test per person for each of the  five  modes, providing an indication of the performance of each mode for each subject and test.  In Tables A3.8 to A3.17, the confusion matrices for tests 1 through 5 combined, and  tests  6  &  7  combined,  indication of the effect of each  are  presented  for  each  mode.  mode on which phonemes  This  gives  an  are most commonly  misrecognized.  Tables  A3.18  to  A3.27  provide error  grouping data,  described  in Chapter 2,  based on the confusion matrices of A3.8 to A3.17. These tables provide another method  of  viewing  those  groups  of  phonemes  that  are  most  commonly  misrecognized, and provides a possible basis for vocabulary design.  The  results of the live tests will first be compared with those of the unaltered  data, Mode 0. In subsequent  sections, the results of each of the modes will be  compared with those of the preceding mode. The intent is to comment on the effect of each mode on the recognition rate, the confusion matrix, and the error groupings.  4.4.1. Mode 0: Recorded Unaltered Data  Ideally, one would want the recording process to be completely transparent to the results. That is, it should not affect the recognition tests in any manner (altering speech parameters, adding noise within the analysis bandwidth or whatever).  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 48 Results of the two types of tests (i.e. Live versus Recorded), however, will not be  exactly  occasions  the  would  same,  just  not be  as  the  results  same.  for live  A certain  tests amount  performed at  different  of variation may be  expected. As well, it was necessary to change one of the subjects; this will alter the total statistics somewhat. With these points in mind, the results of the two types of tests were compared.  For  each of the tests requiring normal enunciation of a given vocabulary (i.e.  tests 1, 6 and 7), the average results of the live tests were much similar to those of Mode 0 (within 4%). The recognition rate for test 2 (vowels with mike moved) improved by 9% from the live to recorded mode. The rates for tests 3 (vowels said slowly), 4 (vowels said quickly), and 5 (vowels with interrogative intonation) decreased by 25%, 17% and 9% respectively.  Although individual test results varied widely, it was clearly evident that tests 3 and  4 faired the worst in the use of recorded data. Because the other tests did  not show the same degradation, it is postulated that the reduction in recognition rate  was not due to the quality of recording.  attributed  to  what  one  may call  the  Rather,  the degradation was  'button' phenomena.  In the  recording  environment used, the subject pressed a button and then had 1.5 seconds to say the word. Although the time constraints were liberal, simply its presence may have  affected  the duration  of a subject's  utterance.  This  was not confirmed,  however, and the results for tests 3 and 4 performed live were quite low (35% and 60% respectively), in any case.  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 49 Comparing the confusion matrices and error groupings of the tests for the vowel vocabulary, one can see  that the clustering was quite similar between live and  recorded data. The groupings did not occur in the same order for both the live and  recorded tests, nor at the same error rates, but the trends were obviously  similar.  The  same  trends  was  were  not completely  somewhat  true  similar,  recorded tests, the semi-vowels  there  for the were  consonant a  tests.  number of  Although general  differences.  In  the  ( w, 1, r, y) didn't perform as strongly as in the  live tests, whereas the 'ch/sh' consonants ( words 14, 15, 19 and 23) performed better.  Also,  templates  for  words  17  and  18  accumulated more misrecognized  input for the recorded data than for live tests, and this affected  the clustering  for the error rate tables.  All  of these results  indicated that recording of the data may have  had some  effects on quality of the speech, but those effects were not major.  4.4.2. Mode  As  1: T i m e N o r m a l i z a t i o n U s i n g a Rectangular Window  was expected,  the  technique of time normalization remarkably improved the  results of tests 3 (vowels said slowly) and 4 (vowels said quickly), increasing the recognition rates by 74% and 42% respectively. As well, it improved the results of  the  other tests by a variety of degrees The recognition rate of the tests  utilizing normal intonation (1, moved)  increased  by  4  to  6 and 7) as well as test 2 (vowels with mike  6% and  the  interogative intonation) increased by 14%.  rate  for test 5 ( vowels said with  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 50 The  confusion  matrices  for  the  vowel  tests  recognition rate per word higher, but the  indicate  that,  not  amount of scatter  only  was  the  was considerably  less. 'Scatter' is the number of different words a given word is misrecognized as. As  well,  it is apparent from  the  error groupings that  some  words  benefitted  more from the time normalization than others. In particular, the error rates for 11 (boym) and 12 (boam) decreased more than those of the other words. This altered  the grouping clusters  somewhat,  leaving  11  and  12 ungrouped for the  error rates considered. Note that although the remaining groupings were similar to those for Mode 0, the error rates at which they occurred were smaller.  The  confusion matrices  scatter  was  for the  not nearly as  consonant  dramatic as  tests indicate  that for the  that  the  reduction in  vowel tests. The groups  which benefitted the most from time normalization were the semi-vowels  ( 0, 1,  2, 3), and the nasals ( 4, 5, 6). The error groupings were a bit different from those of Mode from  the  1, in that the nasals  single  super-group  as  had improved enough to remain separate  did the  sub-groups  8  &  11  and  9  &  12.  However, if one were to specify a maximum error rate per word of, say 35%, all  of  these  phonemes  and  sub-group  phonemes  would  fall  into  the  one  super-group. Also the order of combination changed somewhat.  4.4.3. Mode 2: Time Normalization U s i n g a H a m m i n g  Window  The statistics of Mode 2 and Mode 1 were quite similar. The average recognition rate differed by no more than 2% for any given test.  There  were  minor changes  in the  confusion matrices which changed  the error  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 51 groupings  slightly.  For the  vowels,  the  recognition  enough to remain alone at maximum error rate =  rate  for  word 6 improved  20% per word, and 8 & 9  did the same as a group for a maximum error rate of 15% per word. For the consonants, differences  the  groupings  were that  turned  words  15  out  slightly  differently.  & 22 joined the  The  super-group at  only  major  a maximum  error rate = 45% and 50% per word respectively.  4.4.4. Mode 3: U s e of T w o Reference  This  technique  improved the  improvement  for  tests  respectively).  Tests  1,  3  5,  6  2,  Templates per Word  recognition and and  4 7  rates  were  for  only  improved  all  of  the  tests,  marginal (.2%, by  7%,  6%,  but  the  1% and 2% 7%  and  10%  respectively.  For  the  Words  vowel 1 and  tests, the  confusion  2 remained  matrix and error grouping altered  ungrouped at  the  error rates considered.  slightly. As well,  words 8 & 9 remained a separate subgroup for these error rates.  For the consonants,  the amount of scatter in the confusion matrix was reduced  considerably, particularly for certain words (4, 5 and 13 were the most improved with  respect  worse.  to  scatter).  Regarding the  It  was  not  universal, however,  error grouping, the  as  improvement of the  some  words did  recognition rates  and the reduction in scatter forestalled the formation of the single super-group by about 5%. In the interim, smaller sub-groups formed at maximum error = 45% per  word. As with the other modes, the collapse into the super-group was in a  slightly different order.  PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 52 4.4.5. Mode 4: N o n l i n e a r V o l u m e Normalization  Mode 4 was targeted at the consonant tests, and improved the recognition rate for these tests by 4% (test 6) and 6% (test 7). However, it also improved the recognition rates for four of the five vowel tests, increasing the results for tests 1, 2 & 5 by 2%, and by 5% for test 3.  For the vowels, though, this had little effect on the scatter within the confusion matrix and the error grouping. Word 6 remained on its  own at a maximum  error of 20% per word, whereas 14 was grouped with 8 & 9.  For  the  consonants,  the  minimal,  except for  word  reduction 7  (whose  in  scatter scatter  improvement in recognition rate, however, to a more stringent maximum error rate.  in was  the  confusion  reduced  by  5).  matrices  was  The overall  again moved the grouping phenomena  5. CONCLUSIONS  AND RECOMMENDATIONS  In this section conclusions and recommendations are made on the following : the tests and their ability to exercise a speech recognition unit, the recording configuration. each of the modes, both in terms of their effectiveness in achieving what was  desired  and in terms  of the  results  as  well as  comments  on their  compatibility for real-time applications. the performance of the NEC SR100. format of the test results. As well, recommendations are made as to topics for future areas of research. A summary of the test results are given in Table 5.1 for quick reference.  5.1. T H E TESTS The phonetic basis for the test vocabularies worked quite well, providing a logical method for identification of phonetic weaknesses with respect  to a given speech  recognition unit. Obviously, the vocabulary was not complete,  in that it did not  wholly address the question of allophones. Allophones are variations of a phoneme according to the phonemes preceding and following it. As well, it did not address the question of an extended phoneme set due to variations in pronunciation (e.g. rolling r's ).  Addressing all possible variations would, however, yield a vocabulary of enormous size, and one  must temper the  selection  of a vocabulary with the  logistics of  applying the actual tests (and with the amount of extra knowledge gained about the speech recognition unit by expansion of the tests).  53  CONCLUSIONS AND RECOMMENDATIONS / 54  Table 5.1 Summary of the Test Results (In Percent) Test Number  Mode Live  0  1  2  3  4  1  76.6  80.6  86.6  88.1  88.3  90.0  2  54.1  63.3  68.5  68.0  75.3  77.5  3  34.7  10.0  83.8  84.0  84.8  89.5  4  60.0  42.8  84.3  84.3  86.5  85.8  5  73.4  64.3  78.0  76.0  82.0  84.5  6  52.5  50.6  56.5  56.1  63.1  66.9  7  62.9  63.1  67.5  65.6  75.0  81.1  Test Number 1 = Vowels said normally  Mode Number Live = Unrecorded data  2 = Vowels with mike moved  0 = Recorded unaltered data  3 = Vowels said slowly  1 = Time normalized data using a rectangular window  4 = Vowels said quickly 5 = Vowels with interrogative intonation 6 = Consonants (Group 1) 7 = Consonants (Group 2)  2 = Time normalized data using a Hamming window 3 = Mode 2 plus two t r a i n i n g templates per word 4 = Mode 3 plus nonlinear volume normalization  CONCLUSIONS AND RECOMMENDATIONS / 55 The  same is  true  for the  variations in pronunciation of the  vocabulary. The  variations addressed in this study provided pertinent information in flagging the speech recognition unit's pitfalls, but as with the vocabulary, they were simply a small logical subgroup of all possible variations present in speech.  One  problem with the test vocabulary as used was that a few of the subjects  had  difficulty pronouncing some of the  nonsense words, particularly word 9 of  the vowel vocabulary (buum) and word 6 of the two consonant vocabularies ( the words exercising the consonant 'ng' ). Either these words could be dropped from the test vocabulary, or one could attempt to train the subject more thoroughly in pronunciation of the vocabulary. Alternatively, one could investigate other  static  phonemes for construction of the vocabulary.  As  well, a problem existed with the variations in pronunciation. The variations  were not controlled enough to determine that they  were always  the cause for  failure in given tests. It was also not possible to determine the degree of any given  variation  tolerated  by  the  speech  recognition  unit.  The  technique  of  recording the data and normalizing it by preprocessing aided in the reduction of the first of these problems, although it didn't give a clear idea of the amount of variation tolerable.  An alternative  would be to utilize  the  algorithms  presented  here (and develop other ones) to artificially alter speech input with respect to a specific  parameter  in gradations.  This  would permit the  determination  of  the  exact point of failure for each recognition device.  Finally, although the test vocabulary and test variations identified the weaknesses  CONCLUSIONS  AND RECOMMENDATIONS / 56  of the NEC SRI00 quite well, their merits with regards to providing a sound basis for comparison of different speech recognition units were not tested.  5.2. T H E RECORDING  CONFIGURATION  The recording configuration provided excellent results, in that the recording qualitywas  quite  good,  as  was  the  computing  power  available  for  analyzing  and  preprocessing the resulting data. There were certain drawbacks, however.  First, as was pointed out in 4.4.1, the press of the button and the hard limit on the  amount  of time  available  in which  to  input  an utterance  may  have  affected the input itself. This could be eliminated, though, by making initialization of the digitization and the limits to digitizing time transparent to the subject. For example,  some sort of triggering device  temporary buffer utterance.  to initialize  could be utilized in conjunction with a  digitization,  insuring the  capture  of the  complete  As well, the data files could be of variable lengths, so that the time  limit would be eliminated.  A  second  problem was  the mass storage device  utilized for archiving the data  (i.e. the tape drive on the HP9050). Data for a single subject occupied about 38 Mbytes of memory, or one 2400 ft. tape, and access to any file on this tape, regardless  of  location  within  the  tape,  was  25  minutes.  This  retrieval  time  caused a number of problems while working on the preprocessing algorithms, and would  be  a  limiting  factor  in  the  ease  of  testing  future  speech  recognition  devices. The solution, of course, would be to install a mass storage device with faster access.  CONCLUSIONS 5.3.  PREPROCESSING  IMPROVEMENT  ALGORITHMS  AND RECOMMENDATIONS / 57 AND  RECOGNITION  RATE  TECHNIQUES  5.3.1. Mode 1: Time Normalization Using a Rectangular Window  5.3.1.1. The  Performance of the Algorithm  Synchronized  Overlap  and  Add  (SOLA)  method  of  time  normalization,  utilizing a rectangular window, proved to be very stable and effective method for altering the duration of an utterance. It performed extremely well and was not too computationally intensive.  As  was  previously  hollowness utterance  in  the  stated, resulting  however, output.  it Also,  did  generate  occasionally,  a  bit  when  of  crackle  expansion  and  of  an  was required, the algorithm repeated phonetic transitions (for example,  between the vowel and following consonant). This was perceived effectively  as a  stutter-type sound.  5.3.1.2. Information Learned About the SRI00  from Application of the Algorithm  Test results after the application of the Mode 1 algorithm prove that, without a doubt,  the  degrades  NEC SR100 its  rejection rates.  recognition  has  limited time  performance,  warping capabilities.  increasing  This limited ability manifested  both  the  This  limitation  misrecognition  and  itself not only in tests 3 and 4  (vowels said slowly and vowels said quickly), but also in all of the other tests. Particularly, in test 5 (vowels said with interogative intonation), it became evident that  a significant  portion of the  misrecognition  and rejection  was  due  to the  CONCLUSIONS AND RECOMMENDATIONS / 58 change in duration of the utterance (not to the tonal change).  5.3.1.3. Applicability for Real-Time Implementation as a Preprocessing Technique This algorithm is not too computationally demanding in that it does not require any iteration or massive calculations. The only major calculation required is that of  a  cross-correlation  function,  which  is  easily  within  the  realm of real-time  execution.  The  major problem with the implementation of this algorithm, as it stands at  present, is that it requires input of the complete word before it can calculate the amount of time warping required. This would introduce a time delay of at least the duration of the utterance this may be unacceptable,  in recognition of the word. In many applications  so the algorithm could not be applied in its  present  format.  5.3.2. Mode 2: Time Normalization U s i n g a H a m m i n g Window  5.3.2.1. Performance of the Algorithm This modification of the SOLA method of time normalization performed quite well. It was  as effective  in altering the duration of an utterance  used in Mode 1, but had no perceptable crackle or hollowness. still have the occasional stutter-type investigated  as the algorithm It did, however,  sound. The source of this stutter should be  in future work, if either the algorithm of Mode 1 or of Mode 2 is  to be utilized.  CONCLUSIONS AND RECOMMENDATIONS / 59 The  increase in quality obtained in utilizing the Hamming window, however, did  have  a cost. Unlike the algorithm of Mode  1, it required the use of floating  point arithmetic, making it a bit more computationally intensive.  5.3.2.2. Information Learned About the SR100 from Application of the Algorithm From  Mode  affected either  2 results,  it was apparent  by the crackle or hollowness. in terms  of time  that  the SR100  was not noticeably  This implies that the SRIOO's resolution,  or in terms  of frequency,  was not fine  enough to  register the presence or absence of these perturbations.  5.3.2.3. Applicability for Real-Time Implementation as a Preprocessing Technique The for  explicit computation required by this algorithm is much the same as that Mode  floating point  1, with the exception  that the preprocessor must be able to handle  point data. Note that the cross-correlation calculations  data;  Therefore,  it is only this  the reconstruction  algorithm  may still  be  which  requires  implemented  still utilize  floating  in  point  real-time,  fixed data.  but the  preprocessor must be somewhat more powerful (and therefore more expensive).  The  algorithm has the same major drawback as that for Mode  would  introduce  a delay  of at least  the duration  1, in that it  of the utterance  in the  identification process.  In choosing between algorithms of Mode 1 and 2, one must consider the effect on  performance  preprocessor  is  versus to  be  the used  cost solely  of  implementation.  with  the  SR100,  For example, where  no  if the noticeable  CONCLUSIONS AND RECOMMENDATIONS / 60 improvement in performance was obtained by implementing the more complicated algorithm,  the  choice  is  obvious.  The results  utilizing other  speech recognition  units may, of course, be different.  5 . 3 . 3 . Mode 3 : U s e of Two Reference  Templates  P e r Word  5.3.3.1. Information Learned About the SR100 from Application of the Technique From the Mode 3 data, it can be seen that the use of a single training input per  word for  SR100,  creation  of  the  particularly in the  reference  case of the  template consonants.  hinders  performance  of  the  This supported the earlier  postulation that anomalies in the training input adversely affected the recognition rate, and that a single example of an utterance may not provide as wide-ranging a basis for identification of input as would multiple training input.  5.3.3.2. Applicability of the Technique as a Modification to the Training Method Changing  the  training  technique  for  the  speech  recognition  unit  of  similar  technology  as the SR100 would incur two costs. First, the training time would  be doubled and, second, the allowable vocabulary size would be halved. If these factors  are  relatively  unimportant,  then  the  implementation  of this  change  is  simple.  One  may attempt to extend  per  utterance,  but  this technique  presumably  at  some  to include three or more templates  point  the  recognition  rate  no  longer  improves. In fact, improvement utilizing two templates was not universal for all subjects in all tests. This implies that increasing the number of templates may  CONCLUSIONS AND RECOMMENDATIONS / 61 not always increase the recognition rate.  Some speech recognition units already require multiple utterances  of a word in  the  have  training  mode  (e.g.  Votan,  Interstate  and  Threshold  all  products  requiring multiple training input). For these products, the technique proposed in 4.2 would not be implemented, but rather their own facilities would be utilized. One would presume occurs within these machines,  that  or however  whatever  they  for multiple input  internal averaging which  utilize the multiple input, their  recognition rate is correspondingly improved as it would be if this technique were implemented.  5.3.4. Mode 4: Nonlinear V o l u m e Normalization  5.3.4.1. Performance of the Algorithm The  algorithm performed well, in that it increased the volume of the static low  energy consonants in the utterances while having no noticeable effect on the high energy  portions, the transients,  or the silent periods. This modification was, of  course the objective of the algorithm.  The algorithm did, however, have one minor problem. Often, the last phoneme in a  word  attenuates  naturally  towards  the  end  of  the  utterance.  This  slow  attenuation was removed if volume normalization was active. The result yielded an  abrupt termination to the  utterance.  This may have  an adverse  effect on  some speech recognition units, although it did not appear to have one on the SR100.  CONCLUSIONS  AND RECOMMENDATIONS / 62  5.3.4.2. Information Learned About the SR100 from Application of the Algorithm From the results  of Mode 4, it can be concluded that the SRI00 did indeed  miss information due to lack of volume in certain portions of utterances. The fact that the vowel vocabulary tests improved by increasing the volume on these phonemes implied that the trailing 'm' phoneme in these words sometimes played a  role in the  misrecognition (as these words often had a volume to too low  register for the full duration of its enunciation).  5.3.4.3. Applicability for Real-Time Implementation as a Preprocessing Technique The algorithm would be very simple and straight-forward to implement. However, in its present format, it would not be very resilient to noise. It would amplify noise where the noise was mistaken for a part of a word. As such, more work on noise identification would be required before it could be implemented in a real application.  5.4. P E R F O R M A N C E OF T H E N E C SRlOO S P E E C H RECOGNITION  UNIT  In the documentation for the NEC SRlOO, a recognition rate of more than 99% (typical) is claimed, but no test conditions were described. Obviously, the rate is dependent on the vocabulary, speech variations, etcetera. Based on the results of this study, however, one would tend to disbelieve that this sort of accuracy could be consistently achieved with any realistic vocabulary. This inconsistency highlights one  of  the  major reasons  manufacturers'  claimed  questionable data.  for  the  recognition  undertaking of this rates  may  be  study;  inflated  the  and/or  fact based  that on  CONCLUSIONS AND RECOMMENDATIONS / 63 Based on the results of the tests and subsequent preprocessing, one can see that the SR100 has a number of weaknesses; its ability to accomodate variations in utterance duration is limited. the recognition rates are susceptable to mike placement. the unit does not have fine temporal and/or frequency resolution. This may be viewed as a strength as well, as this feature makes it more immune to spurious input (e.g. crackle, etc.). it misses pertinent  recognition data due to high a silence/voice  triggering  threshold.  it has particular difficulty discerning consonants.  5.4.1. Rules for Best Operation of the N E C SR100 The  following are a number of rules to aid in yielding the  from the SR100. They are based on results  from the  best performance  tests and preprocessing  algorithms discussed in this thesis. 1.  if possible, make two training templates of each utterance.  2.  try to say each word consistently, particularly in terms of duration.  3.  enunciate well, making sure to say the phonemes in each word clearly.  4.  be consistent in the placement of the microphone, checking it regularly to ensure that it hasn't moved.  In  addition, an example  of a method for selection of a robust vocabulary (one  which will achieve the best recognition results for a given application) is given in Appendix 5.  CONCLUSIONS AND RECOMMENDATIONS / 64 5.5. F O R M A T OF T H E RESULTS The  statistics were  quite informative in terms of the  amount of change with  respect to each test type and each mode. As well, the confusion matrices gave an excellent  pictorial of how a given word performed (in terms of which words  they were misrecognized as and how often). The groupings according to maximum error  rate  per  word,  however,  left  something  to  be  desired.  There were  a  number of reasons responsible for its shortcomings.  One  reason was that not all utterances failed in a consistent  cases  for  the  consonants,  if  a  word  failed  to  match  manner. In many correctly,  it  was  misrecognized as any number of other templates (i.e. there was a high degree of scatter). No strong preference for any other single template was observed. Also, many different template made  (e.g.  words, when misrecognized, tended to get dumped to the words  16,  grouping based  same  17 and 18 in Table A3.13). Both of these situations  on  reducing error  rates  questionable,  as  the  groupings  to  be  somewhat  tended to interlink (forming one super-group).  Also,  because  of  the  interlinking,  error  groupings  tended  misleading. For example, in Mode 0 of the vowel tests (with the maximum error rate =  25% per word), word 8 was placed in the same group as 4, 6 and 7.  This would tend to make one believe that either word 8 was commonly mistaken as 4, 6 and/or 7 or vice versa. This, however, was not the case. Neither 4, 6 nor  7 was  ever mistaken as word 8 and word 8 was  mistaken as either of  them only once. The problem was that word 8 was misrecognized often as word 9  and word 9 was  often  misrecognized as  word 6.  Thus they  were grouped  CONCLUSIONS AND RECOMMENDATIONS / 65 together.  Lastly, the error grouping results were not very stable. Minor variations in the confusion matrix resulted in significant changes in the error groupings. This was particularly true with respect to the order in which phonemes  were grouped as  the required error rate was made more stringent.  An alternative to the concept of error groupings to aid in selection of a robust vocabulary would be to create some sort of restriction matrix. This would involve associating discriminated  with  each  phoneme  reliably from  it.  the  Each  other  phonemes  addition to the  which  may  vocabulary, then,  not  be  may be  checked against existing members to determine if enough reliable differences  are  present for dependable discrimination.  5.6. A R E A S FOR F U R T H E R The  STUDY  time normalization algorithms as implemented in this study would cause a  fair amount of delay in a real-time application. This could be reduced, however, if the algorithm were to address smaller subsections  of an utterance. Specifically  the algorithm could address the question of normalization on a per phoneme or phoneme-type (i.e. discern only between consonants and vowels) basis. In addition to reducing delay, it would eliminate any problems associated  with variations in  the number of phonemes within each member of the vocabulary. Of course, the nontrivial subject of phoneme identification would then have to be addressed.  Also, before either the time normalization or volume normalization algorithms may  CONCLUSIONS AND RECOMMENDATIONS / 66 be actually implemented in a real-time application, reliable word endpoint/silence detection algorithms must be investigated.  In and  a closely  related topic, a study on the effects of noise on recognition rates  endpoint  possible  detection  algorithms  should be  undertaken,  as  for noise reduction. This  particularly if the  well  an  investigation  of  will be a critical area of study,  speech recognition unit is destined  Some preliminary results  as  for a noisy  environment.  on the effects of noise on performance of the  SRlOO  are presented and discussed in Appendix 4.  As  well,  the  investigated.  possibility  For example,  of  normalizing  other  it may be possible  speech  parameters  should  be  to normalize speech input with  respect to pitch.  Beyond  all  this,  if  the  algorithms  are  to  be  implemented  in  real-time,  the  hardware required for implementation of these algorithms must be designed.  Also, one possibility for improved comparative testing of various speech recognition devices  is  the  parameter  of  determine  a  implementation a  voice  given  input  of by  a a  speech recognition  system  which  quantifiable unit's  could  amount.  failure  point  vary  some  specific  This  variation could  with  respect  to  the  parameter being addressed, and the result could then be used as a measure for comparison. The algorithms discussed  in this thesis could in fact be altered to  perform this task, for the parameters which they address.  CONCLUSIONS AND RECOMMENDATIONS / 67 As well, a more rigorous method for selection of a robust vocabulary should be devised.  For example,  one based on a restriction matrix as described previously  would be a more helpful and reliable tool in aiding in the design of a robust vocabulary, for a given speech recognition device and a given application.  Finally, the test methods and data should be applied to other speech recognition devices, to ascertain their ability to yield relative measures of performance.  BIBLIOGRAPHY [I]  M. Berouti, R. Schwartz and J . Makhoul, "Enhancement of Speech Corrupted by Acoustic Noise," in Proc. 1979 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 208-211, April 1979  [2]  S. F. Boll and D. C. Pulsipher "Suppression of Acoustic Noise in Speech Using Two Microphone Adaptive Noise Cancellation," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp. 752-753, Dec. 1980  [3]  S. F. Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, pp. 113-120, April 1979  [4]  M. Chung, W. M . Kushner and J . N. Damoulakis, "Word Boundary Detection and Speech Recognition of Noisy Speech by Means of Iterative Noise Cancellation Techniques," in Proc. 1982 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1838, May 1982  [5]  T. E. Eger, J. C. Su and L. W. Varner, "A Nonlinear Processing Technique for Speech Enhancement," in Proc. 1984 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 18A.1.1-18A.1.4, March 1984  [6]  Y. Ephraim and D. Malah, "Combined Enhancement and Adaptive Transform Coding of Noisy Speech," IEE Proc, vol. 133, pp. 81-86, Feb. 1986  [7]  Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-33, pp. 443-445, April 1985  [8]  Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 1109-1121, Dec. 1984  [9]  Y. Ephraim and D. Malah, "Speech Enhancement Using Optimal Non-Linear Spectral Amplitude Estimation," in Proc. 1983 IEEE Int. Conf. Acoust, Speech, Signal Processing, pp. 1118-1121, April 1983  [10]  J . B. Font, "Cochlear Modelling," IEEE Magazine, pp. 3-28, Jan. 1985  [II]  T. Gardiner, J . G. McWhirter and T. J . Shepherd, "Noise Cancellation Studies Using a Least-Squares Lattice Filter," in Proc. 1985 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1173-1176, March 1985  [12]  D. W. Griffin and J . S. Lim, "Signal Estimation from Modified Short-Time Fourier Transform," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 236-243, April 1984  68  Acoust., Speech, Signal Processing  / 69 [13]  W. A. Harrison, J. S. Lim and E. Singer, "Adaptive Noise Cancellation in a Fighter Cockpit Environment," in Proc. 1984 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 18A.4.1-18A.4.4, March 1984  [14]  J . W. Kim and C. K. Un, "Enhancement of Noisy Speech by Forward/Backward Adaptive Digital Filtering," in Proc. 1986 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 89-92, April 1986  [15]  J . Lim, "Speech Restoration and Enhancement," Trends and Perspect. Proc. vol. 2, pp. 4-7, April 1982  [16]  J . S. Lim, Editor, Speech Enhancement. Englewood Cliffs, NJ: Prentice-Hall, 1983  [17]  J . S. Lim and A. V. Oppenheim, "Enhancement and Bandwidth Compression of Noisy Speech," Proc. IEEE, vol. 67, pp. 1586-1604, Dec. 1979  [18]  J . S. Lim, "Spectral Root Homomorphic Deconvolution System," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, pp. 223-233, June 1979  [19]  R. J . McAulay and M. L . Malpas, "Speech Enhancement Using a Soft-Decision Noise Suppression Filter," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp. 137-145, April 1980  [20]  O. M. M. Mitchell, C. A. Ross and G. H. Yates "Signal Processing for a Cocktail Party Effect," The Journal of the Acoustical Society of America, vol. 50, pp. 656-660, Aug. 1971  [21]  S. H . Nawab, T. F. Quatieri and J . S. Lim "Signal Reconstruction from Short-Time Fourier Transform Magnitude," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, pp. 986-998, Aug. 1983  [22]  J . C. Ogue, T. Saito and Y. Hoshiko, "A Frequency Domain Adaptive Noise Cancellation in Speech Signal," Technology Reports, Tohoku Univ., vol. 48, No. 2, pp. 181-197, 1983  [23]  A. V. Oppenheim and J . S. Lim, "The Importance of Phase in Signals," Proc. IEEE, vol. 69, pp. 529-541, May 1981  [24]  A. V. Oppenheim and R. W. Schafer, Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975  [25]  J . E. Porter and S. F. Boll, "Optimal Estimators for Spectral Restoration of Noisy Speech," in Proc. 1984 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 18A.2.1-18A.2.4, March 1984  [26]  M . R. Portnoff, "Short-Time Fourier Analysis of Sampled Speech," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, pp. 364-373, June 1981  Sig.  / 70 [27]  M. R. Portnoff, "Time-Scale Modification of Speech Based on Short-Time Fourier Analysis," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, pp. 374-390, June 1981  [28]  L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Englewood Cliffs, NJ: Prentice-Hall, 1978  [29]  L. R. Rabiner, M. R. Sambur and C. E. Schmidt, "Applications of a Nonlinear Smoothing Algorithm to Speech Processing," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-23, pp. 552-557, Dec. 1975  [30]  L. R. Rabiner, M. R. Sambur, "An Algorithm for Determining the Endpoints of Isolated Utterances," Bell Syst. Tech. J., vol. 54, pp. 297-315, Feb. 1975  [31]  S. Roucos and A. M. Wilgus, "High Quality Time-Scale Modification for Speech," in Proc. 1985 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 493-496, March 1985  [32]  R. W. Schafer Press, 1979  [33]  R. J . Scott and S. E. Gerber, "Pitch-Synchronous Time-Compression of Speech," in Proc. 1972 IEEE Conf. for Speech Communications Processing, pp. 63-65, April 1972  [34]  S. Seneff, "System to Independently Modify Excitation and/or Spectrum of Speech Waveform Without Explicit Pitch Extraction," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, pp. 566-578, Aug. 1982  [35]  H. W. Strube, "Separation of Several Speakers Recorded by Two Microphones (Cocktail-Party Processing)," Signal Proc, North-Holland publishing Co., vol. 3, pp. 355-364, Oct. 1981  [36]  V. R. Viswanathan et al., "Evaluation of Multisensor Speech Input for Speech Recognition in High Ambient Noise," in Proc. 1986 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 85-88, April 1986  [37]  V. R. Viswanathan, K. F. Karnofsky, K. N. Stevens and M. N. Alakel, "Multisensor Speech Input for Enhanced Immunity to Acoustic Background Noise," in Proc. 1984 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 18A.3.1-18A.3.4, March 1984  [38]  D. L. Wang and J . S. Lim, "The Unimportance of Phase in Speech Enhancement," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, pp. 679-681, Aug. 1982  [39]  B. Widrow et al.„ "Adaptive Noise Cancelling: Principles and Applications," Proc. IEEE, vol. 63, pp. 1692-1716, Dec. 1975  and J . D. Markel  Signals.  Speech Analysis. New York, NY: IEEE  APPENDIX 1: T H E O R Y OF OPERATION OF T H E N E C SR100 In  this  section,  an overview  of the  theory  of operation of the  NEC SRI00  speech recognition unit is given. The intention is to simply provide background on the type of unit used in these tests. If a more indepth understanding of the machine is required, then one  should refer to the  literature accompanying the  machine.  The  front end of the analysis system for the SRI00 consists of a series of 16  bandpass  and smoothing filters nonlinearly spaced from 200  frequency response  of the filters  are shown in Table A l . l . The output of the  sixteen filters are sampled and digitized every respect  to the sum of the  to 5000 Hz. The  magnitudes  16 ms. and are normalized with  of the output for each filter. These  16  values then form the basic vector used by the speech recognition unit.  There are two distinct modes in the SRI00; training mode and recognition mode. First, training mode will be described. When in training mode, the vector series of the given training word is accumulated and data compression is performed on it.  Compression is  executed  by  simply  representing  from  2  to  7  consecutive  vectors by the centre vector of the series. Determination of the exact grouping is done by determining which one yields the least error per omitted vector. This calculation is performed recursively while vectors for the word are still incoming.  The  resulting template,  uncompressed  length.  then,  is typically compressed  The original number of vectors  to approximately (uncompressed  1/3  length)  the is  stored in the machine, along with the error per unit vector due to compression,  71  / 72 Table A l . l Frequency Response of the Front End F i l t e r s on the SRlOO Freq. (Hz) 250 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700 4800 4900 5000  1  2  3  4  5  6  100 100 100 100 97 97 47 81 19 81 8 46 4 17 7 3 1 3 1 1 0 0 0 0 0 0 0 0 0 0 26 0 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 0 44 0 44 0 0 46 46 0  49 49 48 82 82 74 78 39 15 3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 2 7 40 66 74 78 76 68 14 6 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 1 37 39 74 68 50 22 10 4 2 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 1 2 7 34 68 55 49 20 8 5 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0  0 0 0 0 0 0 0 0  F i l t e r Number 7 8 9 10  11  12  13  0 0 0 0 0 2 1 2 3 34 55 64 54 47 19 8 4 2 2 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 0 6 9 2 4 3 1 0 1 2 3 25 33 42 74 82 87 84 82 86 82 51 29 19 12 8 6 4 3 2 2 1 1 1 1 1 1 1 1 1 0 1 0 0  0 0 0 0 0 2 1 9 18 1 3 5 3 1 1 1 1 3 6 37 41 43 79 82 86 86 86 83 86 88 58 36 26 17 11 8 6 5 4 3 3 2 2 2 1 1 1 1 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 3 5 4 4 2 2 9 3 10 2 7 9 15 36 4 9 3 38 31 2 2 14 48 7 3 4 54 21 4 1 45 46 2 24 49 3 1 3 11 55 1 4 5 51 1 4 50 3 2 2 5 28 2 1 6 13 4 1 6 8 2 6 5 6 11 4 3 6 41 3 3 6 4 7 43 5 51 8 5 5 82 11 4 6 7 83 41 3 86 43 10 3 88 44 14 3 4 90 67 19 81 81 40 4 86 86 43 5 7 88 88 48 83 83 61 8 61 84 84 11 39 81 81 40 29 86 86 43 21 90 90 45 15 93 93 46 11 94 96 48 10 72 103 60 8 53 104 79 6 38 103 100 4 23 89 89 ' 3 18 93 93 3 11 93 93 2 10 96 96 2 8 98 98  0 0 0 0 0 0 3 3 1 5 27 37 54 53 56 55 29 18 10 5 3 2 2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 0 0 0 0 0 1 0 0 3 4 1 5 3 2 2 2 0 2 6 27 2 34 5 56 28 55 27 51 51 67 67 75 75 37 74 21 82 12 87 7 51 4 28 3 17 3 10 2 7 1 •4 1 3 1 2 1 2 0 . 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0  14  15  16  / 73 the  remaining  number  Compression  performs  required  store  to  of  two  the  vectors  and  the  functions;  first  it  reference  data  and  remaining reduces  second  the  it  vectors  themselves.  amount  of memory  reduces  the  amount of  calculation required to be performed when the matching process for recognition is executed. Note that the number of vectors each compressed vector represents is not stored in the template, only the total uncompressed length.  When the SR100 is in recognition mode, the vectors for the incoming word are compared to the vectors the remainder of the distance  from  any  for each of the reference  templates  dynamically while  incoming word is being input. If the accumulated vector  given  template  to the  input becomes too  great,  then  that  template is no longer considered. The distance of each input vector is computed for a range of allowable reference vectors within each template and the reference vector resulting in the smallest distance is selected. The range is determined by the uncompressed length of the vector.  This  effectively  template  performs  time  and by the position of the incoming warping.  Note  that  if  the  distance  calculations were to be performed using the uncompressed data, then to cover the same  allowable range, the distance  between each input vector and the omitted  vectors for each reference template within the allowable range would also have to  be  computed.  This  is  where  the  computational  saving  mentioned  in  the  previous paragraph is realized.  The total vector distance between the input word and each reference template is accumulated and when the input word is completed, the total error is divided by the total number of vectors input, yielding a distance per unit vector for each  / 74 template.  In order to reduce the error introduced by omitting reference vectors the error per unit vector for each template is subtracted from the distance measure per unit vector. The result of this is then used for selecting reference templates; word.  between the various  the one with the smallest number is matched to the input  APPENDIX 2: RESULTS OF T H E LIVE TESTS  75  / 76 Table A2.1 S t a t i s t i c s f o r Test 1 Vowels Said Normally  % Subject  %  Correct  %  Misrecognized  Rejected  % Correct Runner-ups  DR EC JR MA RF  82.812 76.562 64.062 65.625 93.750  10.938 21.875 29.688 10.938 6.250  6.250 1.562 6.250 23.438 0.000  3.125 12.500 6.250 0.000 3.125  Average  76.562  15.938  7.500  5.000  Table A2.2 S t a t i s t i c s f o r Test 2 Vowels With Mike Moved Subject  %  %  Correct  %  Misrecognized  Rejected  % Correct Runner-ups  DR EC JR MA RF  62.500 53.125 42.188 39.063 73.438  20.312 26.562 28.125 4.688 20.312  17.187 20.312 29.688 56.250 6.250  4.688 7.812 7.812 0.000 10.938  Average  54.062  20.000  25.938  6.250  %  Table A2.3 S t a t i s t i c s f o r Test 3 Vowels Said Slowly  % Subject DR EC JR MA RF Average  % Misrecognized  Rejected  % Correct Runner-ups  28.125 7.812 71.875 65.625 0.000  17.187 6.250 9.375 6.250 6.250  54.688 85.938 18.750 28.125 93.750  3.125 0.000 6.250 1.562 0.000  34.688  9.062  56.250  2.188  Correct  / 77 Table A2.4 S t a t i s t i c s f o r Test 4 Vowels Said Quickly  % Subject  %  %  Correct  Misrecognized  Rejected  % Correct Runner-ups  DR EC JR MA RF  73.438 53.125 45.312 46.875 81.250  25.000 45.312 35.937 20.312 18.750  1.562 1.562 18.750 32.812 0.000  10.938 15.625 4.688 3.125 10.938  Average  60.000  29.062  10.938  9.062  Table A2.5 S t a t i s t i c s f o r Test 5 Vowels With Interogative Intonation Subject  %  %  Correct  %  Misrecognized  Rejected  % Correct Runner-ups  DR EC JR MA RF  82.812 56.250 82.812 67.188 78.125  15.625 23.438 15.625 14.062 17.187  1.562 20.312 1.562 18.750 4.688  9.375 4.688 7.812 3.125 12.500  Average  73.438  17.187  9.375  7.500  Table A2.6 S t a t i s t i c s f o r Test 6 Consonants (Group 1)  % Subject  Correct  %  %  Misrecognized  Rejected  % Correct Runner-ups  DR EC JR MA RF  62.500 59.375 52.083 38.542 50.000  33.333 40.625 47.917 41.667 50.000  4.167 0.000 0.000 19.792 0.000  9.375 16.667 11.458 7.292 15.625  Average  52.500  42.708  4.792  12.083  /  Table A2.7 S t a t i s t i c s f o r Test 7 Consonants (Group 2)  % Subject  Correct  % Mi s r ecogni z ed  % Rejected  % Correct Runner-ups  DR EC JR MA RF  57.292 58.333 61.458 62.500 75.000  41.667 40.625 34.375 30.208 23.958  1.042 1.042 4.167 7.292 1.042  18.750 15.625 7.292 9.375 10.417  Average  62.917  34.167  2.917  12.292  Table A2.8 Confusion Matrix f o r Test 1 Vowels Said Normally ( A l l Subjects) Word # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0  1  2  12  l 4 17  15  Was Recognized as Word # 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej  . . . 2 . .  •  2 19  •  12  2 19  .  15 2 11 1 •  . . . . . . . . . . . .  1  •  •  1 3 •  • 15  •  15 1  .  •  1 3 . 1  •  •  . .  1 4  . . . . .  2 3 2 2 1  . . .  • . 2 14 3 . . 1 13 . . . 17 . . 1 17  4  3 1 1 .  •  17  . 1  . .  . 1 17 3 •  •  Table A2.9 Confusion Matrix for Test 2 Vowels with Mike Moved ( A l l Subjects) Word  Was Recognized as Word # 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej  #  0  1  2  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  10  •  . . . .  13 1 • 1 13 • 2 15 • . 1 .  •  •  •  •  . 9 . . 2  . • • 5 15 • • 11 . 5 • . 1 .• 2 . . 1  • • . . . . 1  . 3  .  3  .  1  2 2 2  •  3  1  •  •  •  1  1  .  3 4 . 13  1 1 13 4 2  . . . 9 .  . 6 . . .  . .  .  . . . 1  .  .  . . . . •  •  •  •  .  1  •  •  1  . . . • •  .  . .  . 12  1 12  •  • 12  7 4 4 3 2 5 6 6 9 4 6 7 6 3 4 7  Table A2.10 Confusion Matrix for Test 3 Vowels Said Slowly ( A l l Subjects) Word # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0 6  1  2  #  #  7 1  Was Recognized as Word # 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej •  #  •  .  9 .  1 5  •  •  •  6 . 1 2  . 7 . .  • • • •  • • • •  .  4  2 3  5 .  1 6  3 1  •  . .  1  .  9 1  «  •  9 • 12  9 •  •  . 2 . . . . . . . . . . . . . 3  14 10 9 13 7 13 8 11 11 10 14 13 11 8 11 17  Table A 2 . l l Confusion Matrix for Test 4 Vowels Said Quickly ( A l l Subjects) Word # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0  1  8  6 15  2  Was Recognized as Word # 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej #  2 16 3 15 11  •  •  •  1 8  . . 1 .  15 4 •  . . 6 . • o  ... ...  . 15 . . .  1 7  • • • •  •  2  1  •  •  •  •  . . . . . . . . 8 . 1 14 . 1  . . . . . . 5  .... ....  1 1 8 15 . . . . . . 2 . .  2  7 1 10  ...  •  •  •  • 12  1  2  1  .  .  .  1  .  •  •  •  •  •  •  •  •  .  4  .... .... ....  4 2 1 .  2 1 3 1  . . . . . . .  5 3 2 2 2 5 4  . .  2 3 • •  . .  2 2 . 2 • 16 1  13  Table A2.12 Confusion Matrix f o r Test 5 Vowels with Interogative Intonation ( A l l Subjects) Word # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0 13  1 •  16  2  Was Recognized as Word # 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej  •  •  •  1 17  •  *  1  2 . 19 . . 10  1 1  •  5 19 10  1 2 1  1 1  . . 16 . • . 1 17 . * 14 . 16 . 3 • • 13 . • • • 17  .  #  3 •  1 •  •  . . . •  11 6 3 10  3 1  1  6 3 1  . .  . .  17  1  . . .  •  2 3 3 2 1 2 1 2  Table A2.13 Confusion Matrix for Tests 1 Through 5 Combined ( A l l Subjects) Word # 0 1 2 3 .4 5 6 7 8 9 10 11 12 13 14 15  0 52  1  2  Was Recognized as Word # 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej  6 1 63 8 2 72 5 5 73  2 1 1 • 20 15  48  •  2  1 2  • •  2 3  73 2 21  *  •  4  2 3 65 1 3 11 39 1 • 3 6 1 • 45 21 1 5 • 5 56 1 • • 10 • 60 • • 1 • • 7 64 • 2 1 50 20 4  •  1  2  •  5 11 • • •  67  3 11 •  1  5 2  3  3  4  •  2  •  64  9 7 4 . . . . . 1 . . . . 3 . 65  30 21 17 18 10 26 22 19 26 22 30 28 19 15 19 30  Table A2.14 Confusion Matrix for Test 6 Consonants (Group 1) (All Subjects) Wrd t  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  0 1 2 15  •  •  1 1  • • • • •  1  •  1  •  •  • • •  1  1  2  Has Recognized as Word # 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej  2  1 2 16 • • •  3  • • • • • • •  17 1 2  1 • 9 • 5 7 3 • 2 • 2 • • 3 • • 1 2  2 1 1 1 1 14 • • 8 1 1 •  1  1  1  •  •  •  1 2  1 1  •  •  1  1  6 5 • • 2 12 • • 4 • 3 10 1 2 1 1 1 12  • •  1  12 6  5 8  1 2 1 9 2 • 1 12 2  1  •  •  4 1 1 1  • •  1  1  •  1  1  1 • • 1 • 1 •  • • •  2 •  •  •  1  1  •  •  •  1  1  4 1  2  •  •  2 •  2 1  3  1  •  •  •  •  • •  • •  6 2  • • •  3 7 * 1 13  1 1  •  • •  2 1  •  4 3  • • •  1  5  •  •  •  •  1  1 1 • 2 • 3 • 1 11 • 1 7 • 5 4  •  .  •  .  . • • . • 3 . • . 2 • • 1 • . • • 1 . • . 7 • • 9 . « 2 8  1 •  . 1 «  3 2 1 •  3 .  / 83  Table A2.15 Confusion Matrix for Test 7 Consonants (Group 2) (All Subjects) Wrd *  0  0 18 1 2 3  1  2  •  12  4  5  6  7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  Was Recognized as Word #  3  4  5  •  •  •  •  •  •  •  •  •  •  •  •  v  •  •  •  1  3 .  1  .  •  •  •  1  3  •  •  •  •  •  •  18 . . • . 20 . • . . 13 2 • • • 15 •  •  •  «  •  •  . .  1 1  . .  ... 1 . ... ... ... ... . 1 . ... . . . ... ... ... ... . . .  6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej  ... ... . ...  1  . 14 .  .  •  •  . .  . . 6 . . 11 . . . .  •  •  1 2 1 1 13  . 1 . 5  ... ... ... ... 2 . 1 ... ... ... . 1 . ... . . .  1  .  . .  • •  1 . . 2 15  •  1  . .  11 . 2 12  16  . .  •  •  .  •  •  •  •  «  •  11 8 4 15  • •  9 • 1 1 9 1 • • 3 3 6 • • • • 1 1 • 1 1 3 • • • 2 1 7 • •  . .  1  •  • 1 1 1 1  . .  1 2  4 1 1 3 •  ... . . ...  •  .  1  . 2 1 .  m  ... . . 2 ... . 1 1 . ... 2 ... 3 ... 1 . . . •  •  9 . .  . 1 1 1 . • 1 . • 1 « • 4 • • ii . • . 10 7 14 . . . 2 . • •  ...  6 .  1  *  ,  #  . . . .  .  2  •  •  •  •  •  •  •  •  . .  1 4  •  •  .  2  •  •  •  •  .  1  •  •  •  •  •  •  •  •  •  •  .  .  1  •  •  V  .  1  •  «  •  . • • . « • 11 . 1 * 9 1  /  84  Table A2.16 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Wrd #  0  1 2  3  4  5  Was Recognized as Word # 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej  0 33 • 2 1 27 • 2 • • . 1 3 2 2 34 • 37 1 • 2 3 4 1 1 22 2 1 1 1 1 2 1 1 2 2 . 5 1 2 5 22 2 1 • 1 • 3 . 1 • • 1 • . 6 3 28 • 1 2 7 1 • 2 2 2 6 • 14 6 3 • 2 . 8 • 1 1 20 3 1 2 3 1 1 4 9 • 4 1 • 1 25 • • 5 • 2 * 1 10 • 1 1 . • 21 5 • 4 2 • • 2 . 11 1 1 • 2 1 2 23 2 1 • 1 3 . 12 9 • 5 22 . 3 • • • 1 13 1 1 1 2 3 1 1 28 1 1 • 14 23 13 • 2 1 • • 15 1 . 10 23 2 1 • 3 • 16 4 1 1 1 . * • 17 3 • 9 1 . 17 1 3 1 3 • « 1 2 3 16 3 6 « 16 4 22 * 1 1 19 7 8 • • 22 • • 1 20 1 1 1 2 . 1 • 3 5 1 17 7 . 21 2 1 1 • • • 1 2 6 • 5 21 22 1 1 2 12 • • • 20 . 23 2 12 • • • 6 • • 2 17 #  .  1 1  #  #  m  . .  2  #  #  2 2  #  m  #  m  • •  •  2 •  4 2 1 #  4 1  / 85 Table A2.17 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 f o r the Vowels max. error per word =25% Grp #  words grouped  max. error per word  =20%  group err.(%)  words grouped  group err.(%)  max. error per word = 15% words grouped  group err.(%)  0  0  (IY)  18.00  0  18.00  0 15  6.50  1  1  (I)  16.00  1  16.00  1 2 13  7.67  2  2 (E)  11.00  2  11.00  3  9.00  3  3 (AE)  9.00  3  9.00  4 6 7 8 9 12 14  2.43  4  4 6 7 12 (A UH OW AU)  4.50  4 6 7 12  4.50  5  1.00  5  5 (ER)  1.00  5  1.00  10  10.00  6  8 9 (00 U)  12.50  8 9 14  7.33  11  8.00  7  10 (Al)  10.00  10  10.00  -  -  8  11 (01)  8.00  11  8.00  -  -  9  13 (EI)  18.00  13  18.00  -  -  10  14 (0U)  17.00  15  5.00  -  -  11  15 (JU)  5.00  -  -  -  -  / 86 Table A2.18 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 f o r the Consonants max. error per word =45%  max. error per word = 40% group err.(%)  max. error per word =35% words grouped  group err.(%)  Grp #  words grouped  group err. (%)  words grouped  0  0 (W)  5.00  0  5.00  0  5.00  1  1 (L)  27.50  1  27.50  1  27.50  2  2 (R)  12.50  2  12.50  2  12.50  3  3 (Y)  7.50  3  7.50  3  7.50  4  4 (M)  42.50  4 5 7 8 10 11 16 17 20 21  11.75  4 5 7 8 10 11 16 17 20 2  11.75  5  5 (N)  42.50  6  17.50  6  17.50  6  6 (NG)  17.50  9 12  23.75  9 12  23.75  13  30.00  7  7 10 16 17 20 21 (B P V TH F THE)  16.25  13  30.00  8  8 (D)  45.00  14 15 23  13.33  14 15 19 23  3.75  9  9 (G)  37.50  18  35.00  18 22  15.00  10  11 (T)  37.50  19  40.00  -  11  12 (K)  45.00  22  40.00  -  -  12  13 (H)  30.00  -  -  -  13  14 (DZH)  42.50  -  -  -  -  -  -  14  15 23 (TSH SH)  30.00  -  -  15  18 (Z)  35.00  -  -  -  -  16  19 (ZH)  40.00  -  -  -  -  17  22 (S)  40.00  -  -  -  -  APPENDIX 3: RESULTS OF T H E R E C O R D E D AND MODIFIED  87  TESTS  / Table A3.1 S t a t i s t i c s f o r Test 1 Vowels Said Normally Subject Mode  % Correct  % Misrecognized  % Rejected  % Correct Runner-ups  DR  0 1 2 3 4  79.688 81.250 81.250 83.333 85.417  12.500 15.625 15.625 16.667 14.583  7.812 3.125 3.125 0.000 0.000  4.688 6.250 4.688 12.500 10.417  EC  0 1 2 3 4  64.062 85.938 87.500 81.250 87.500  31.250 12.500 10.938 16.667 10.417  4.688 1.562 1.562 2.083 2.083  9.375 12.500 9.375 14.583 10.417  HW  0 1 2 3 4  92.188 92.188 93.750 93.750 97.917  6.250 7.812 6.250 6.250 2.083  1.562 0.000 0.000 0.000 0.000  4.688 4.688 3.125 4.167 0.000  JR  0 1 2 3 4  79.688 82.812 85.938 93.750 91.667  20.312 17.187 14.062 6.250 8.333  0.000 0.000 0.000 0.000 0.000 .  RF  0 1 2 3 4  87.500 90.625 92.188 89.583 87.500  12.500 7.812 6.250 8.333 10.417  0.000 1.562 1.562 2.083 2.083  6.250 4.688 4.688 6.250 4.167  Average  0 1 2 3 4  80.625 86.562 88.125 88.333 90.000  16.562 12.188 10.625 10.833 9.167  2.812 1.250 1.250 0.833 0.833  8.750 8.125 5.938 7.917 6.250  18.750 12.500 7.812 2.083 6.250  88  / 89 Table A3.2 S t a t i s t i c s for Test 2 Vowels With Mike Moved  % Subject Mode  Correct  % Misrecognized  % Rejected  % Correct Runner-ups  DR  0 1 2 3 4  85.000 86.250 81.250 82.500 91.250  12.500 11.250 15.000 13.750 8.750  2.500 2.500 3.750 3.750 0.000  6.250 6.250 10.000 8.750 8.750  EC  0 1 2 3 4  43.750 62.500 62.500 71.250 68.750  32.500 23.750 22.500 17.500 21.250  23.750 13.750 15.000 11.250 10.000  12.500 13.750 11.250 6.250 10.000  HW  0 1 2 3 4  47.500 47.500 48.750 53.750 55.000  13.750 12.500 11.250 13.750 15.000  38.750 40.000 40.000 32.500 30.000  5.000 6.250 3.750 7.500 8.750  JR  0 1 2 3 4  67.500 63.750 66.250 85.000 88.750  16.250 16.250 13.750 8.750 8.750  16.250 20.000 20.000 6.250 2.500  8.750 8.750 6.250 5.000 5.000  RF  0 1 2 3 4  72.500 82.500 81.250 83.750 83.750  21.250 13.750 13.750 16.250 16.250  6.250 3.750 5.000 0.000 0.000  15.000 7.500 7.500 8.750 6.250  Average  0 1 2 3 4  63.250 68.500 68.000 75.250 77.500  19.250 15.500 15.250 14.000 14.000  17.500 16.000 16.750 10.750 8.500  9.500 8.500 7.750 7.250 7.750  / 90 Table A3.3 S t a t i s t i c s f o r Test 3 Vowels Said Slowly  % Subject Mode  Correct  %  %  Misrecognized  Rejected  % Correct Runner-ups  DR  0 1 2 3 4  13.750 76.250 73.750 77.500 95.000  3.750 8.750 11.250 10.000 5.000  82.500 15.000 15.000 12.500 0.000  1.250 3.750 7.500 7.500 5.000  EC  0 1 2 3 4  1.250 71.250 71.250 72.500 77.500  2.500 28.750 28.750 27.500 22.500  96.250 0.000 0.000 0.000 0.000  0.000 12.500 16.250 12.500 8.750  HW  0 1 2 3 4  2.500 96.250 97.500 97.500 96.250  12.500 3.750 2.500 2.500 3.750  85.000 0.000 0.000 0.000 0.000  0.000 1.250 0.000 2.500 3.750  JR  0 1 2 3 4  12.500 86.250 86.250 88.750 90.000  17.500 13.750 13.750 11.250 10.000  70.000 0.000 0.000 0.000 0.000  0.000 10.000 10.000 10.000 8.750  RF  0 1 2 3 4  20.000 88.750 91.250 87.500 88.750  8.750 11.250 8.750 12.500 11.250  71.250 0.000 0.000 0.000 0.000  1.250 8.750 5.000 8.750 6.250  Average  0 1 2 3 4  10.000 83.750 84.000 84.750 89.500  9.000 13.250 13.000 12.750 10.500  81.000 3.000 3.000 2.500 0.000  0.500 7.250 7.750 8.250 6.500  / 91 Table A3.4 S t a t i s t i c s f o r Test 4 Vowels Said Quickly  % Subject Mode  Correct  %  %  Misrecognized  Rejected  % Correct Runner-ups  DR  0 1 2 3 4  41.250 91.250 90.000 91.250 91.250  27.500 8.750 10.000 8.750 8.750  31.250 0.000 0.000 0.000 0.000  2.500 6.250 8.750 6.250 6.250  EC  0 1 2 3 4  38.750 68.750 68.750 68.750 70.000  60.000 31.250 31.250 31.250 30.000  1.250 0.000 0.000 0.000 0.000  7.500 12.500 15.000 16.250 12.500  HW  0 1 2 3 4  23.750 85.000 83.750 92.500 92.500  53.750 15.000 16.250 7.500 7.500  22.500 0.000 0.000 0.000 0.000  3.750 5.000 6.250 5.000 7.500  JR  0 1 2 3 4  55.000 83.750 87.500 90.000 86.250  33.750 16.250 12.500 10.000 13.750  11.250 0.000 0.000 0.000 0.000  11.250 11.250 7.500 5.000 8.750  RF  0 1 2 3 4  55.000 92.500 91.250 90.000 88.750  40.000 7.500 8.750 10.000 11.250  5.000 0.000 0.000 0.000 0.000  7.500 6.250 6.250 6.250 7.500  Average  0 1 2 3 4  42.750 84.250 84.250 86.500 85.750  43.000 15.750 15.750 13.500 14.250  14.250 0.000 0.000 0.000 0.000  6.500 8.250 8.750 7.750 8.500  / 92 Table A3.5 S t a t i s t i c s f o r Test 5 Vowels With Interogative Intonation  % Subject Mode  Correct  %  %  Misrecognized  Rejected  % Correct Runner-ups  DR  0 1 2 3 4  70.000 76.250 77.500 78.750 96.250  15.000 6.250 8.750 10.000 3.750  15.000 17.500 13.750 11.250 0.000  10.000 2.500 5.000 8.750 3.750  EC  0 1 2 3 4  36.250 62.500 62.500 72.500 67.500  42.500 32.500 33.750 25.000 30.000  21.250 5.000 3.750 2.500 2.500  11.250 17.500 16.250 12.500 17.500  HW  0 1 2 3 4  78.750 91.250 90.000 92.500 91.250  13.750 8.750 10.000 7.500 8.750  7.500 0.000 0.000 0.000 0.000  3.750 6.250 6.250 7.500 7.500  JR  0 1 2 3 4  63.750 81.250 81.250 92.500 91.250  25.000 16.250 15.000 6.250 8.750  11.250 2.500 3.750 1.250 0.000  7.500 10.000 7.500 0.000 5.000  RF  0 1 2 3 4  72.500 78.750 68.750 73.750 76.250  22.500 20.000 22.500 23.750 21.250  5.000 1.250 8.750 2.500 2.500  10.000 11.250 16.250 7.500 3.750  Average  0 1 2 3 4  64.250 78.000 76.000 82.000 84.500  23.750 16.750 18.000 14.500 14.500  12.000 5.250 6.000 3.500 1.000  8.500 9.500 10.250 7.250 7.500  / 93 Table A3.6 S t a t i s t i c s f o r Test 6 Consonants (Group 1)  % Subject Mode  Correct  % Misrecognized  % Rejected  % Correct Runner-ups  DR  0 1 2 3 4  45.833 62.500 61.458 56.944 69.444  50.000 36.458 37.500 41.667 30.556  4.167 1.042 1.042 1.389 0.000  8.333 9.375 11.458 26.389 19.444  EC  0 1 2 3 4  37.500 50.000 46.875 48.611 52.778  55.208 44.792 47.917 48.611 43.056  7.292 5.208 5.208 2.778 4.167  12.500 6.250 15.625 15.278 18.056  HW  0 1 2 3 4  56.250 66.667 65.625 72.222 79.167  43.750 31.250 33.333 27.778 20.833  0.000 2.083 1.042 0.000 0.000  21.875 16.667 17.708 18.056 8.333  JR  0 1 2 3 4  59.375 57.292 58.333 76.389 76.389  40.625 41.667 40.625 23.611 22.222  0.000 1.042 1.042 0.000 1.389  9.375 14.583 17.708 6.944 6.944  RF  0 1 2 3 4  54.167 45.833 47.917 61.111 56.944  45.833 51.042 50.000 38.889 43.056  0.000 3.125 2.083 0.000 0.000  9.375 15.625 15.625 9.722 15.278  0 1 2 3 4  50.625 56.458 56.042 63.056 66.944  47.083 41.042 41.875 36.111 31.944  2.292 2.500 2.083 0.833 1.111  12.292 12.500 15.625 15.278 13.611  Average  / 94 Table A3.7 S t a t i s t i c s for Test 7 Consonants (Group 2)  % Subject Mode  Correct  % Misrecognized  % Rejected  % Correct Runner-ups  DR  0 1 2 3 4  72.917 71.875 73.958 86.111 ' 84.722  27.083 28.125 26.042 13.889 15.278  0.000 0.000 0.000 0.000 0.000  14.583 13.542 15.625 5.556 8.333  EC  0 1 2 3 4  42.708 65.625 54.167 68.056 83.333  55.208 34.375 45.833 31.944 16.667  2.083 0.000 0.000 0.000 0.000  19.792 14.583 23.958 12.500 11.111  HW  0 1 2 3 4  77.083 77.083 76.042 83.333 87.500  22.917 22.917 23.958 16.667 12.500  0.000 0.000 0.000 0.000 0.000  18.750 12.500 14.583 9.722 8.333  JR  0 1 2 3 4  77.083 80.208 80.208 83.333 90.278  21.875 18.750 18.750 16.667 9.722  1.042 1.042 1.042 0.000 0.000  15.625 12.500 13.542 8.333 8.333  RF  0 1 2 3 4  45.833 42.708 43.750 54.167 59.722  54.167 57.292 55.208 45.833 40.278  0.000 0.000 1.042 0.000 0.000  18.750 12.500 15.625 22.222 13.889  Average  0 1 2 3 4  63.125 67.500 65.625 75.000 81.111  36.250 32.292 33.958 25.000 18.889  0.625 0.208 0.417 0.000 0.000  17.500 13.125 16.042 11.667 10.000  Table A3.8 Confusion Matrix f o r Tests 1 Through 5 Combined ( A l l Subjects) Mode 0 Word # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0  1  Was Recognized as Word # 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej  56 6 5 60 6 24 54 4 1 11 69 *  • • •  •  •  •  •  44 . . 73 10 . 24 . 1  1  •  1  .  1  .  1  .  .  • 1 • 3 • 5 1 7 16 14 1 5 3 • 1 3 2 68 3 6 3 1 2 1 1 1 10 10 41 1 • 48 27 2 12 • 9 65 3 1 8 • • 1 75 2 1 • 6 16 58 1 9 4 • 1 2 1 69  .  .  .  . .  4 7 2  . 10 2 4 11 7 3  .  . .  .  3 7  14 •  14  •  4  5  1  •  •  •  9 1  • •  6 1  4  62 *  •  3 1 . . . . . . 4 2 . . . . . 66  51 41 35 30 23 36 23 21 31 23 34 34 26 34 28 38  Table A3.9 Confusion Matrix for Tests 1 Through 5 Combined ( A l l Subjects) Mode 1 Word #  0  1  Was Recognized as Word # 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej  0 91 1 1101 6 . 12 93 7 2 .10107 3 . . 80 4 5 4 6' 7 44 8 2 9 • 10 11 12 1 1 13 14 15 3  . 25 1 10 . 8  1 . 12 14  3  7  •  •  1 . 99 5 5 2 . 3 . 1 57 6 . 2 . 78 29 • 1 6 1 8 92 1 . . .109 4 • 7102 . 1 .107 . 2 1  3 9 3 7  •  •  3 12  1  2  6  3  .  •  •  .  6 •  1  93  1 7 . 3 . 6 . 11 . 2 . 4 L06 10  Table A3.10 Confusion Matrix for Tests 1 Through 5 Combined ( A l l Subjects) Mode 2 Word #  0 1' 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0  1  Was Recognized as Word # 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej  89 1102 12 92 7 11106  6  .  2 23 . . . .  9 3 1 5  77 . 8 20 1 1 2 4 6 .115 2 4 8 . 97 4 • 2 1 1 1 2 43 . 59 6 8 1 , 80 27 • 4 . 8 3 1 4 3 9 91 • 6 . 3 1 • . 7 .108 4 . 11 8101 • 2 . 1 1 • 113 3 .100 . 12 13 1 5 • 1 95 . 5 2 L06 10  . .  3 2  .  . .  Table A3.11 Confusion Matrix for Tests 1 Through 5 Combined ( A l l Subjects) Mode 3 Word #  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0  1  Was Recognized as Word # 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 7 3  89 1104 4 95 9 11102  . .  74 . 6 25 .110 5 .101 3 28 . 1 72  1  1 1  .  1 1  2  4  2  •  1  4 7  1 5 2 8  4  1  . .  7 2  4  .  1  2 • • 76 33 2 7 84 • .104 4 • • 6106 • • • .110 •  4  .  3  5  1  .  4  •  1  4  .  2 17 . 6 .  2  .  3  . . . . . .  4 2 5 3  .  4  . . 89  Table A3.12 Confusion Matrix for Tests 1 Through 5 Combined ( A l l Subjects) Mode 4 Word #  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  0  1  Was Recognized as Word # 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej  99 1105 4 96 8 8105  6 2  1 1  9 4  . 81 . 8 16 3 2 1 3 1 .112 1 • • . 1 1 1 5 2 8 . 95 2 30 . 1 73 1 • 1 • 4 5 1 , 79 32 1 2 • 7 . 4 1 8 1 5 88 1 4 . • • .109 . 9105 2 . .113 . .105 4 . 3 7 1 1 94 • 5 .106  2  .  .  . .  .  .  1  . «  .  . . 2  . . 1 1 1  5 8  / 98 Table A3.13 Confusion Matrix for Tests 6 and 7 Combined ( A l l Subjects) Mode 0 Wrd # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  0  1  2  3  29 1 3 1 23 1 3 2 25  1  .  . •  .  •  6  Was Recognized as Word # 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej  •  •  1  •  •  •  •  •  •  1  •  1  •  •  •  5  5 4 1 4 4 2 5 4  4 2 4 1 1  •  •  •  •  •  •  . .  1 1  • • 1 • • 1 • • • • • 1 3 • • 7 • . 1 1 . • • • • • 23 5 1 3 • • 2 * « 1 « • • • • • 2 20 6 • 1 • 2 1 1 1 2 • , 7 1 3 24 • 1 • • • • • . 2 1 18 1 • 2 • • 4 3 . 3 1 . 1 1 21 4 1 1 1 1 1 3 . 2 • • 3 23 • 1 10 1 • . 1 4 2 • 21 3 . • 2 • 2 . 3 2 1 1 1 20 4 1 1 1 . 2 7 1 5 * 1 26 • 2 • • . 1 3 1 1 1 5 3 25 • • • • . 2 • . 1 1 • . 2 • 26 8 • • . 1 • . 2 • . 1 12 26 1 • • • • # • 3 • • • • • 12 7 4 . 8 4 1 2 3 1 1 3 17 3 . 2 5 t> • 1 • 2 • • • 1 • 1 3 25 • • 5 1 1 • • • • 1 4 • • • 31 . • 4 1 • 1 1 • 1 • • 3 . 24 7 2 • • 1 1 4 . 1 2 • 1 4 4 . 7 12 2 . 1 • • 1 • • 9 . 2 • 26 . 2 • • • • • • 1 6 • • • 7 . • • 24 2 • • • • • • • • •  .  .  5  •  . . 25  2 1  4  •  •  .  . . .  .  .  .  .  .  #  #  #  .  #  / 99 Table A3.14 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Mode 1 Wrd #  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  0  1  2  3  4  5  6  37 2 1 25 1 2 * 31  3  «  •  •  •  Was Recognized as Word # 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej  •  •  «  •  29  «  •  •  •  «  1  •  •  •  •  6 25 2 1 2 25 3 3 3 30 •  •  •  1  1  • •  *  1  1  •  •  17  2 18  2 2 1 1 1  •  •  •  2 1  1 1  . .  1  •  •  •  •  9  •  •  #  •  *  •  •  •  1  •  7  •  •  •  •  •  •  •  •  1  •  •  •  1  •  «  «  •  •  1  1  «  •  3  4  •  •  •  3 3 1 1 1  3 2  1  •  •  1 5 5  1 3 2  6 1  1 2  # #  * •  •  •  2  1 2  17 5 3 2 19 5 3 28  2  1  •  2  * 2 . 1 7 . 1 1 32 1 2 2 1 1 1 17 4 1 » 24 3 3 2 6 5 22 * 2 3 2 1 2 1 1 29 . 2 . 1 1 . 1 24 7 10 26 •  1 1  •  3 3 2 6  1  1  •  2  1  •  •  1 2  1 3  1 1  1  •  1  1 2 5  •  «  •  8  • • •  •  2 1 1  5 3  •  • #  8 2  4 4  •  •  25 4 5 13  •  6  •  •  •  5 2 4 22  1 2  1 2  . 1 2  1 . .  •  •  •  1 • 27  *  6 2  / 100  Table A3.15 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Mode 2 Wrd  «  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  0  1  Was Recognized as Word # 2  3  35 • 1 1 26 1 2 1 30 29  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej  4  5  6  7  1  •  •  1  •  •  1  •  •  •  •  •  •  •  •  •  •  •  2  •  •  •  1  •  •  •  •  •  3 2 2 3 2 1 1 6  4 27 2 1 2 28 1 5 29 •  •  •  •  8  •  1 4 4  • 4 • • * • 1 1 1 1 1 1 1 1 1 2 1 « * 4 • 1 • « • • 2 4 18 2 2 1 1 • 5 1 16 1 1 8 • 3 1 3 . 4 27 • 1 4 . . 1 1 • 1 6 • • 21 2 • • » 1 3 4 3 • 3 • 1 20 2 2 3 2 3 2 • 6 3 22 1 5 2 1 1 1 2 1 30 • 1 1 1 1 • 1 1 1 27 7 2 1 • . . • • • 14 24 2 • . 17 5 3 • 11 2 • 1 1 20 7 2 5 • 3 • 3 29 • 1 « 3 1 . . 28 . • * • 4 4 2 2 2 2 2 23 4 3 • 1 1 2 3 2 • 8 10 5 1 • 1 1 21 2 8 • • • • • • 7 • • • 4 • • • 27 •  . 1 •  6 2  /  101  Table A3.16 Confusion Matrix for Tests 6 and 7 Combined ( A l l Subjects) Mode 3 Wrd #  0  0  28  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  4  •  1  •  •  •  •  •  21  «  •  1  1  2  •  •  •  •  •  22  •  27  6  Was Recognized as Word # 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej  2  3  5  7  1  . .  .  . 1  .  2 1 1  .  .  1 1 • •  •  •  •  •  •  1  25 2 . • 2 25 . • . 1 • 2 25 1 1 . 13 1 1 2 . 13 2 . 1 . 23  1 1  1  •  •  •  •  •  •  •  • •  2 1  2  *  1 1 2 2 ^ 2  9  *  9  •  4 2  2  1 4  1 1 3 4  • %  ^  m  •  •  #  •  •  •  •  •  «  •  •  •  •  •  •  ^  «  •  #  2 1  2 1 • • . » 1 1 • * • 2 • 1 • • 22 1 1 • • • 1 21 2 1 2 1 1 • • 8 . . 19 1 1 1 • • • . 1 3 24 . • 1 1 • • 1 24 4 • • 1 * 1 7 17 1 • 1 1 1 1 • 13 4 3 3 * • • 3 16 4 • 3 1 . 2 25 • « 2 . . 2 • • 23 • 1 • 1 1 20 4 l . 3 3 2 2 4 11 4 . . 1 6 • 1 1 20 . 4 5 * • . 20 • • • • • #  a  #  • • • • *  • • 4 •  #  #  #  #  •  . . . . 1 1  /  102  Table A3.17 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Mode 4 Wrd # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  2  3  4  28 . 1 . 24 30  •  •  0  1  •  •  22 1 . 26 •  .  5  6  Was Recognized as Word # 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej  •  •  1  «  •  •  1  1  •  •  •  •  m  •  •  •  •  •  •  m  •  1  *  •  #  # m  •  •  •  1  •  2  •  •  •  1  #  #  .  9  9  #  •  •  •  •  •  #  v  • • •  m  #  9  •  •  •  m  #  •  •  •  m  #  •  •  •  1 . 1 • • 2 • • « • 23 . 2 1 1 2 1 • • • 1 . 27 1 • • 1 • « « • 22 1 3 • 1 1 2 • • • 1 . 4 15 1 • 3 1 1 3 • • • • 1 25 • 3 1 • • • • 2 1 • 20 1 • 2 1 • 1 . . 1 1 23 2 2 • 1 . . 5 * 3 21 • 1 • • • 1 1 • 25 • 1 . . • 1 1 21 5 • • • 1 1 3 23 • • X 3 1 • 16 4 3 • • • 1 . 1 1 « 3 15 2 5 . . • • • • 1 27 1 • • • • • 1 24 • • • 5 1 1 1 • • 18 7 . . 2 2 1 1 2 • • 5 15 . . • • • 6 • 2 1 19 . « • 4 • • • 1 • . . 24 • • • • • •  .  •  •  * • •  . . •  . •  I •  . •  • . . 2 1  / 103 Table A3.18 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 f o r the Vowels Mode 0 max. error per word = 30% Grp #  words grouped  max. error per word = 25%  max. error per word =20%  group err.(%)  words grouped  group err.(%)  words grouped  group err.(%)  0  0 (IY)  10.83  0  10.83  0  10.83  1  1 (I)  15.83  1 2  8.33  1 2  8.33  2  2 (E)  25.83  3  17.50  3  17.50  3  3 (AE)  17.50  4 6 7 8 9 14  7.36  4 6 7 8 9 12 14  3.81  4  4 67 (A UH OW)  17.50  5  9.17  5  9.17  9.17  10  9.17  10 11  8.75  15.42  11  23.33  13  12.50  9.17  12  20.83  15  13.33  -  5  5 (ER)  6  8 9 (00 U)  7  10 (Al)  8  11 (01)  23.33  13  12.50  -  9  12 (AU)  20.83  15  13.33  -  10  13 (EI)  12.50  -  -  11  14 (0U)  25.00  -  -  12  15 (JU)  13.33  -  -  -  -  / 104 Table A3.19 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 f o r the Vowels Mode 1 max. error per word = 25% Grp #  words grouped  max. error per word =20%  max. error per word = 15%  group err.(%)  words grouped  group err.(%)  words grouped  group err.(%)  0  0 (IY)  3.33  0  3.33  0  3.33  1  1 (I)  7.50  1  7.50  1 2  4.17  2  2 (E)  15.83  2  15.83  3  8.33  3  3 (AE)  8.33  3  8.33  4 6 7 8 9 14  2.92  4  4 7 (A OW)  18.75  4 67  12.22  5  0.00  5  5 (ER)  0.00  6  6 (UH)  17.50  7  8 9 (00 U)  8  5  0.00  10  4.17  8 9  9.58  11  5.83  9.58  10  4.17  12  9.17  10 (Al)  4.17  11  5.83  13  6.67  9  11 (01)  5.83  12  9.17  15  3.33  10  12 (AU)  9.17  13  6.67  -  11  13 (EI)  6.67  14  19.17  -  12  14 (0U)  19.17  15  3.33  13  15 (JU)  3.33  -  -  -  Table A3.20 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 f o r the Vowels Mode 2 max. error per word = 25% Grp #  words grouped  group err.(%)  max. error per word =20% words grouped  group err.(%)  max. error per word = 15% words grouped  group err.(%)  0  0 (IY)  6.67  0  6.67  0  6.67  1  1 (I)  6.67  1  6.67  1 2  4.17  2  2 (E)  15.83  2  15.83  3  9.17  3  3 (AE)  9.17  3  9.17  4  4 7 (A OW)  5  5 (ER)  0.00  5  0.00  6  6 (UH)  19.17  6  7  8 9 (00 U)  9.17  8  10 (Al)  9  4 6 7 14  6.46  5  0.00  8 9  9.17  19.17  10  4.17  8 9  9.17  11  6.67  4.17  10  4.17  12  5.83  11 (01)  6.67  11  6.67  13  6.67  10  12 (AU)  5.83  12  5.83  15  3.33  11  13 (EI)  6.67  13  6.67  -  -  12  14 (OU)  16.67  14  16.67  -  -  13  15 (JU)  3.33  15  3.33  -  -  16.67  4 7  16.67  / 106 Table A3.21 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 f o r the Vowels Mode 3 max. error per word = 25% Grp #  words grouped  max. error per word = 20%  max. error per word = 15%  group err.(%)  words grouped  group err.(%)  words grouped  group err.(%)  0  0 (IY)  7.83  0  7.83  0  7.83  1  1 (I)  4.35  1  4.35  1  4.35  2  2 (E)  11.30  2  11.30  2  11.30  3  3 (AE)  9.57  3  9.57  3  9.57  4  4 7 (A OW)  4 6 7 8 9 14  3.48  5  5 (ER)  1.74  5  1.74  5  1.74  6  6 (UH)  12.17  6  12.17  10  5.22  7  8 9 (00 U)  10.43  8 9 14  8.99  11  5.22  8  10 (Al)  5.22  10  5.22  12  4.35  9  11 (01)  5.22  11  5.22  13  2.61  10  12 (AU)  4.35  12  4.35  15  0.87  11  13 (EI)  2.61  13  2.61  -  -  12  14 (0U)  19.13  15  0.87  -  -  13  15 (JU)  0.87  -  -  -  -  13.48  4 7  13.48  / 107 Table A3.22 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 for the Vowels Mode 4 max. error per word =25% Grp #  words grouped  group err.(%)  max. error per word = 20% words grouped  group err.(%)  max. error per word = 15% words grouped  group err.(%)  0  0 (IY)  6.09  0  6.09  0  6.09  1  1 (I)  5.22  1  5.22  1  5.22  2  2 (E)  10.43  2  10.43  2  10.43  3  3 (AE)  6.96  3  6.96  3  6.96  4  4 7 (A OW)  4 6 7 8 9 14  3.62  5  5 (ER)  0.87  0.87  5  0.87  6  6 (UH)  17.39  6 89  10.14  10  4.35  7  8 9 (00 U)  10.87  10  4.35  11  7.83  8  10 (Al)  4.35  11  7.83  12  1.74  9  11 (01)  7.83  12  1.74  13  4.35  10  12 (AU)  1.74  13  4.35  15  0.87  11  13 (EI)  4.35  14  18.26  -  12  14 (OU)  18.26  15  0.87  13  15 (JU)  0.87  -  -  13.04  4 7 5  13.04  -  -  / 108 Table A3.23 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants Mode 0 max. error per word = 50% Grp  max. error per word = 45%  max. error per word = 40%  #  words grouped  group err.(%)  words grouped  group err.(%)  words grouped  group err.(%)  0  0 (W)  25.00  0  25.00  0  25.00  1  1 (L)  40.00  1  40.00  1  40.00  2  2 (R)  37.50  2  37.50  2  37.50  3  3 (Y)  37.50  3  37.50  3  37.50  4  4 (M)  42.50  4  42.50  4 56  27.50  5  5 (N)  50.00  5 6  31.25  7 8 9 10 11 12 16 17 18 20 21  7.27  6  6 (NG)  35.00  7 8 9 10 11 16 17 18 20 21  11.50  13  35.00  7  7 16 17 20 21 (B V TH F THE)  23.50  12  35.00  14  30.00  8  8 (D)  47.50  13  35.00  15  35.00  9  9 (G)  42.50  14  30.00  19  22.50  10  10 (P)  47.50  15  35.00  22  30.00  11  11 (T)  50.00  19  22.50  23  35.00  12  12 (K)  35.00  22  30.00  -  13  13 (H)  35.00  . 23  35.00  -  14  14 (DZH)  30.00  -  -  -  15  15 (TSH)  35.00  -  -  -  16  18 (Z)  35.00  -  -  -  -  17  19 (ZH)  22.50  -  -  -  18  22 (S)  30.00  -  -  -  -  19  23 (SH)  35.00  -  -  -  -  -  / 109 Table A3.24 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 f o r the Consonants Mode 1 max. error per word =50%  max. error per word = 45%  max. error per word =40%  Grp #  words grouped  0  0 (W)  7.50  0  7.50  0  7.50  1  1 (L)  37.50  1  37.50  1  37.50  2  2 (R)  22.50  2  22.50  2  22.50  3  3 (Y)  27.50  3  27.50  3  27.50  4  4 (M)  37.50  4  37.50  4  37.50  5  5 (N)  37.50  5  37.50  5  37.50  6  6 (NG)  22.50  6  22.50  6  22.50  7  7 10 17 21 (B P TH THE)  32.50  7 10 16 17 18 20 21  14.29  7 10 16 17 18 20 21  14.29  8  8 11 (D T)  38.75  9  9 (G)  20.00  10  12 (K)  11  group err.(%)  words grouped  8 11  group err.(%)  words grouped  group err.(%)  38.75  8 11  38.75  9  20.00  9 12  22.50  45.00  12  45.00  13  27.50  13 (H)  27.50  13  27.50  14  37.50  12  14 (DZH)  37.50  14  37.50  15  30.00  13  15 (TSH)  30.00  15  30.00  19  17.50  14  16 20 (V F)  36.25  19  17.50  22  30.00  15  18 (Z)  30.00  22  30.00  23  27.50  16  19 (ZH)  17.50  23  27.50  -  -  17  22 (S)  30.00  -  -  -  -  18  23 (SH)  27.50  -  -  -  -  / 110 Table A3.25 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants Mode 2 max. error per word =50%  max. error per word = 45%  max. error per word =40%  Grp #  words grouped  group err.(%)  words grouped  group err.(%)  words grouped  group err.(%)  0  0 (W)  12.50  0  12.50  0  12.50  1  1 (L)  35.00  1  35.00  1  35.00  2  2 (R)  25.00  2  25.00  2  25.00  3  3 (Y)  27.50  3  27.50  3  27.50  4  4 (M)  32.50  4  32.50  4  32.50  5  5 (N)  30.00  5  30.00  5  30.00  6  6 (NG)  25.00  6  25.00  6  25.00  7  7 16 20 21 22 (B V F THE S)  27.00  7 8 10 11 15 16 20 21 22  19.72  7 8 10 11 15 16 17 18 20 21 22  8.18  8  8 11 (D T)  41.25  9  32.50  9 12  26.25  9  9 (G)  32.50  12  45.00  13  25.00  10  10 (P)  47.50  13  25.00  14  32.50  11  12 (K)  45.00  14  32.50  19  27.50  12  13 (H)  25.00  17 18  26.25  23  27.50  13  14 (DZH)  32.50  19  27.50  -  -  14  15 (TSH)  35.00  23  27.50  -  -  15  17 (TH)  50.00  -  -  -  -  16  18 (Z)  27.50  -  -  -  -  17  19 (ZH)  27.50  -  -  -  -  18  23 (SH)  27.50  -  -  -  -  Table A3.26 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 f o r the Consonants Mode 3 max. error per word =45%  max. error per word = 40%  / 111  max. error per word = 35%  Grp #  words grouped  0  0 (W)  6.67  0  6.67  0  6.67  1  1 (L)  30.00  1  30.00  1  30.00  2  2 (R)  10.00  2  10.00  2  10.00  3  3 (Y)  26.67  3  26.67  3  26.67  4  4 (M)  16.67  4  16.67  4  16.67  5  5 (N)  16.67  6  6 (NG)  16.67  6  16.67  6  16.67  group err.(%)  words grouped  5 7 8 9 10 11 13 16 17 18 20 21 22  group err.(%)  5.64  words grouped  5 7 8 9 10 11 13 16 17 18 20 21 22  group err.(%)  5.64  7  7 10 (B P)  31.67  12  36.67  12  36.67  8  8 17 (D TH)  43.33  14  20.00  14 15  11.67  9  9 (G)  23.33  15  40.00  19  23.33  10  11 (T)  30.00  19  23.33  23  30.00  11  12 (K)  36.67  23  30.00  -  12  13 (H)  20.00  -  -  -  -  -  13  14 (DZH)  20.00  -  -  -  -  14  15 (TSH)  40.00  -  -  -  -  15  16 20 21 22 (V F THE S)  27.50  -  -  -  -  16  18 (Z)  16.67  -  -  -  -  17  19 (ZH)  23.33  -  -  -  18  23 (SH)  30.00  -  -  -  -  / 112 Table A3.27 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 f o r the Consonants Mode 4 max. error per word = 35% Grp  max. error per word =30%  max. error per word = 25%  #  words grouped  group err.(%)  words grouped  group err.(%)  words grouped  group err.(%)  0  0 (W)  6.67  0  6.67  0  6.67  1  1 (L)  20.00  1  20.00  1  20.00  2  2 (R)  0.00  2  0.00  2  0.00  3  3 (Y)  26.67  3  26.67  3  26.67  4  4 (M)  13.33  4  13.33  4  13.33  5  5 (N)  23.33  5  23.33  5  23.33  6  6 (NG)  10.00  6  10.00  6  10.00  7  7 8 (B D)  31.67  8  9 (G)  16.67  9  16.67  9 12  10.00  9  10 (P)  33.33  12  30.00  14 15  11.67  10  11 (T)  23.33  14  30.00  19  20.00  11  12 (K)  30.00  15  20.00  23  16.67  12  13 (H)  16.67  19  20.00  -  -  13  14 (DZH)  30.00  22  30.00  -  -  14  15 (TSH)  20.00  23  16.67  -  -  7 8 10 11 13 16 17 18 20 21  5.00  7 8 10 11 13 16 17 18 20 21 22  4.55  15  16 17 20 21 (V TH F THE)  20.83  -  -  -  -  16  18 (Z)  10.00  -  -  -  -  17  19 (ZH)  20.00  -  -  -  -  18  22 (S)  30.00  -  -  -  -  19  23 (SH)  16.67  -  -  -  -  APPENDIX 4: PRELIMINARY STUDY ON THE EFFECTS OF A NOISY ENVIRONMENT ON RECOGNITION PERFORMANCE Often the environment in which a speech recognition unit is required to operate is  far  from  perfect.  Conversations  may  be  present  in  the  background or  machinery may be operating, etc. In situations such as these, the performance of a given recognition unit can and will degrade.  The tele-operator  group in the E . E . department in UBC was  interested  in the  performance of speech recognition unit in a heavy duty machinery environment; specifically,  in the cab of a Caterpillar D215 Knuckle Excavator. The D215 is  illustrated in Figure A4.1. Towards the end of the term of this study, one of these machines  was  made available for use by the tele-operator  group through  Caterpillar and MacMillan-Bloedel Research.  Although not enough time remained to allow for an indepth study on the effects of this  environment  on the  performance  of a speech recognition  unit  and of  possible preprocessing techniques to reduce the effects, some noise data files were recorded on the HP9050. The format and gain settings were compatible with the existing  voice  simulation  data  of voice  files files  such  that  the  two  recorded in a noisy  could  be  environment.  tabulated, but the main purpose of this exercise was  summed, Some  yielding results  a  were  to provide a preliminary  data base for future studies into speech recognition in a noisy environment.  Unfortunately the test site was limited in that no digging was allowed and no auxilliary activities such as would be present on a typical work sight were being  113  / 115 performed.  This  had the  benefit,  though,  that  any  degradation  in the  results  could be directly attributed to noise generated by the machine.  In  fact,  have  to  this be  test configuration touches on a very difficult problem which will noisy  environment  is  problem of isolating and defining specific  noise types.  One can imagine that a  background  addressed  hum  may  when  have  the  a  much different  effect  studied.  than  a  That  speaker  is,  in  the  the  background. Regardless, this will not be addressed here.  1. R E C O R D I N G  CONFIGURATION  A Shure SM10 close talking mike was worn by the operator of the machine and the  mike output  was  recorded on a Yamaha  commercial cassette deck. Three  different conditions were recorded: 1.  the machine at full power idle. The engine of the D215 operates at more or less a constant rpm and one selects the nominal engine speed with a separate throttle. After this is selected, the pitch of the engine will change slightly  when  different  hydraulic loads  are  placed  on  the  machine  (all  circuits, including the tracts, are hydraulically driven), but the noise level due to the  engine itself remains  approximately constant.  Full  power idle,  then, is with the throttle completely open and with no load. Open throttle, or near open throttle is the normal operating speed. 2.  moving branches back and forth.  3.  The operator performing the following operation : a.  Rotate the cab.  b.  Raise arm using joint closest to cab.  / 116 c.  Counter-rotate the cab.  d.  Lower arm using joint closest to cab.  e.  repeat from step 1.  This  task  introduced not only  changes  in pitch  in the  engine,  but also  introduced a transient clanging noise at the termination of each rotation due to joint arms contacting. These tasks were executed until enough data was collected to yield eighty data files of 1.5 seconds each. This data was then replayed in the lab and digitized, forming the background noise templates.  Mode 0 (unmodified data) of a single user's voice data for 'vowels said normally' (test 1) was corrupted with each of the noise files. The results, shown in Table A4.1,  are the results of corrupting both the training mode and the recognition  mode data.  Noise Number No Noise 1 2 3  % Correct 87.500 82.812 84.375 75.000  %  %  Misrecognized  Rejected  % Correct Runner-ups  12.500 10.938 7.812 14.062  0.000 6.250 7.812 10.938  6.250 3.125 3.125 1.562  Table A4.1: Results of Test 1, Mode 0 for RF with Noisy Background Data  / 117  The  noise data for the third situation  (rotate-lift-rotate-lower)  was  added to all  five subjects' data for Test 1, Mode 0 and the results are presented in Table A4.2.  2. COMMENTS ON T H E TEST As  would be expected,  RESULTS  all background noise degraded  the  performance  of the  SRlOO, even the steady state noise. The amount of degradation varied with the type of noise, with type 3 decreasing the recognition performance the most. It is interesting  to note,  though, that the  amount of decrease in recognition was no  worse than any of the other phenomena examined in this thesis. However, one would expect the degradation of the performance to be more marked when the D215 is on site.  Regardless,' the results  indicate that the effects of noise is a significant  and should be examined further. If the only interest  % Subject  Correct  %  is to obtain results  %  factor on a  Misrecognized  Rejected  % Correct Runner-ups  DR EC HW JR RF  82.812 25.000 82.812 70.312 82.812  14.062 45.312 4.688 21.875 10.938  3.125 29.688 12.500 7.812 6.250  7.812 4.688 1.562 9.375 3.125  Average  68.750  19.375  11.875  5.312  Table A4.2: Results of Test 1 , Mode 0 with Corruption by Noise Type 3  / 118  speech recognition unit's susceptability  to noise for comparison purposes, then it  would be relatively straightforward to implement a test whereby the performance v.s. noise level may be tabulated.  Preprocessing algorithms aimed at dynamically reducing the noise from incoming utterances,  however,  is  quite  a  more  complicated  proposition.  Many  different  algorithms have been investigated, but, according to Lim and Oppenheim [17], the ones which show most promise  are those involving multi-microphone input (e.g.  13,  & 39  14,  20,  22,  35,  36,  37  in  Bibliography) These,  algorithms of choice for future noise preprocessing work.  then,  are  the  A P P E N D I X 5: A N E X A M P L E OF R U L E S FOR S E L E C T I O N OF A ROBUST VOCABULARY In any given application for a speech recognition device, one often has leeway in selection of the exact words to be used for the vocabulary. Synonyms for a task name or required input may be substituted (e.g. 'start' may be used instead of 'go').  This option may be used to select a vocabulary which best utilizes  the  discriminatory strengths of a given speech recognition unit.  The  following is a series of rules which provide one example of a method for  selection of a robust vocabulary. This example is specifically for the NEC SRlOO and  uses the phonetic error grouping results from this study. The technique is  based  on  the  subdivision  of  words  into  their  phonetic  constructs  and  the  determination of whether a given candidate is phonetically unique enough to be a reliable vocabulary member.  The specific phonetic error groupings utilized were those of Mode 3 (Tables A3.21 and  A3.26) using the most stringent maximum error rates tabulated (i.e.  per  word for the vowels and 35% for the consonants). The selection of Mode 3  seemed justified, in that any misidentification eliminated in the  15%  transition from  Mode 0 to Mode 3 was more than likely due to variations in duration and/or bad reference templates (rather than similarities in phonemes).  The proposed method is as follows: 1.  Divide each word of the proposed vocabulary into its phonetic constructs.  2.  Compare  each  phoneme  in  each  word with  119  the  phoneme  in  the  same  / 120 position in each of the other vocabulary members. If the phonemes a  different  error  group,  they  may  device. In other words, phonemes considered  discriminatory  (when  be  considered  are in  a valid discriminatory  within the following groups may not be compared  to  other  phonemes  within  the  same group.): a.  4, 6, 7, 8, 9 and 14 (A, U H , OW, OO, U and OU) for the vowels.  b.  5, 7, 8, 9, 10, 11, 13, 16, 17, 18, 20, 21 and 22 (N, B, D, G, P, T, H , V, T H , Z, F, T H E and S) for the consonants.  c. 3.  If  14 and 15 (DZH and TSH) also for the consonants. there  are  more  phonemes  in one  word than  in another,  then  each  additional phoneme may also be considered a valid discriminatory device. 4.  For a word to be an acceptable vocabulary member, it must have at least one valid discriminatory vowel or two valid discriminatory consonants  when  compared to each other vocabulary member. If the statistics were based on a larger test group, then it may be possible to state with some statistical certainty that the resulting vocabulary would have a general error rate of better than  15%. That is not possible with this small a  test group. All one may say is that this should yield a more robust vocabulary than simply arbitrarily selecting vocabulary members.  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0097036/manifest

Comment

Related Items