Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Development of tests and preprocessing algorithms for evaluation and improvement of speech recognition… Wasmeier, Hans 1986

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1987_A7 W37.pdf [ 7.69MB ]
Metadata
JSON: 831-1.0097036.json
JSON-LD: 831-1.0097036-ld.json
RDF/XML (Pretty): 831-1.0097036-rdf.xml
RDF/JSON: 831-1.0097036-rdf.json
Turtle: 831-1.0097036-turtle.txt
N-Triples: 831-1.0097036-rdf-ntriples.txt
Original Record: 831-1.0097036-source.json
Full Text
831-1.0097036-fulltext.txt
Citation
831-1.0097036.ris

Full Text

DEVELOPMENT OF TESTS AND PREPROCESSING ALGORITHMS FOR EVALUATION AND IMPROVEMENT OF SPEECH RECOGNITION UNITS by HANS WASMEIER B. Eng., Memorial University of Newfoundland, 1981 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES Department of Electrical Engineering We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA December 1986 ® Hans Wasmeier, 1986 1 ? In presenting this thesis in partial fulfilment of the requirements for an advanced degree at The University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the Head of my Department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Electrical Engineering The University of British Columbia 2075 Wesbrook Place Vancouver, Canada V6T 1W5 Date: December 1986 ABSTRACT This study considered the evaluation of commercially available isolated word, speaker dependent, speech recognition units, and preprocessing techniques that may be used for improving their performance. The problem was considered in three separate stages. A series of tests were designed to exercise an isolated word, speaker dependent, speech recognition unit. These tests provided a sound basis for determining a given unit's strengths and weaknesses. This knowledge permits a more informed decision on the best recognition device for a given price range. As well, this knowledge may be used in the design of a robust vocabulary, and creation of guidelines for best performance. The test vocabularies were based on the forty English phonemes identified by Rabiner and Schafer [28] and the test variations were representative of common variations which may be expected in normal use. A digital archive system was implemented for storing the voice input of test subjects. This facility provided a data base for an investigation of preprocessing techniques. As well, it permits the testing of different speech recognition units with the same voice input, providing a platform for device comparison. Several speech preprocessing and performance improvement techniques were then investigated. Specifically, two types of time normalization, the enhancement of low energy phonemes and a change in training technique were investigated. These techniques permit a more accurate analysis of the failure mechanism of the speech recognition unit. They may also provide the basis for a speech ii preprocessor design which could be placed in front of a commercial speech recognition unit. A commercially available speech recognition unit, the NEC SRI00, was used as a measure of the effectiveness of the tests and of the improvements. Results of the study indicated that the designed tests and the preprocessing & performance improvement techniques investigated were useful in identifying the speech recognition unit's weaknesses. Also, depending on the economics of implementation, it was found that preprocessing may provide a cost effective solution to some of the recognition unit's shortcomings. iii T A B L E OF CONTENTS ABSTRACT ii LIST OF TABLES vi LIST OF FIGURES viii ACKNOWLEDGEMENTS ix 1. INTRODUCTION 1 1.1. Outline of Thesis 3 2. THE SPEECH TESTS 5 2.1. BACKGROUND ON DESIGN OF THE TESTS 5 2.2. IMPLEMENTATION OF THE TESTS 13 2.3. RESULTS 15 2.3.1. General Comments 15 2.3.2. Comments on Performance of the Various Tests 16 2.4. INTERPRETATION OF THE RESULTS 19 2.4.1. Evaluation of the NEC SR100 19 2.4.2. Evaluation of the Speech Tests 21 3. ARCHIVAL OF THE TESTS 22 3.1. HARDWARE CONFIGURATION 22 3.2. IMPLEMENTATION 24 4. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES 26 4.1. TIME NORMALIZATION 26 4.1.1. Synchronized Overlap-Add of Temporal Signals 27 4.1.2. SOLA as Implemented in this Study 33 4.1.2.1. Mode 1: Rectangular Windowing 35 4.1.2.2. Mode 2: The Hamming Window 37 4.2. THE USE OF TWO REFERENCE TEMPLATES 40 4.3. NONLINEAR VOLUME NORMALIZATION 42 4.4. COMMENTS ON THE TEST RESULTS 46 4.4.1. Mode 0: Recorded Unaltered Data 47 4.4.2. Mode 1: Time Normalization Using a Rectangular Window ... 49 4.4.3. Mode 2: Time Normalization Using a Hamming Window 50 4.4.4. Mode 3: Use of Two Reference Templates per Word 51 4.4.5. Mode 4: Nonlinear Volume Normalization 52 5. CONCLUSIONS AND RECOMMENDATIONS 53 5.1. THE TESTS 53 5.2. THE RECORDING CONFIGURATION 56 5.3. PREPROCESSING ALGORITHMS AND RECOGNITION RATE IMPROVEMENT TECHNIQUES 57 IV 5.3.1. Mode 1: Time Normalization Using a Rectangular Window ... 57 5.3.1.1. Performance of the Algorithm 57 5.3.1.2. Information Learned About the SR100 from Application of the Algorithm 57 5.3.1.3. Applicability for Real-Time Implementation as a Preprocessing Technique 58 5.3.2. Mode 2: Time Normalization Using a Hamming Window 58 5.3.2.1. Performance of the Algorithm 58 5.3.2.2. Information Learned About the SRI00 from Application of the Algorithm 59 5.3.2.3. Applicability for Real-Time Implementation as a Preprocessing Technique 59 5.3.3. Mode 3: Use of Two Reference Templates Per Word 60 5.3.3.1. Information Learned About the SRI00 from Application of the Technique 60 5.3.3.2. Applicability of the Technique as a Modification to the Training Method 60 5.3.4. Mode 4: Nonlinear Volume Normalization 61 5.3.4.1. Performance of the Algorithm 61 5.3.4.2. Information Learned About the SR100 from. Application of the Algorithm 62 5.3.4.3. Applicability for Real-Time Implementation as a Preprocessing Technique 62 5.4. PERFORMANCE OF THE NEC SR1O0 SPEECH RECOGNITION UNIT 62 5.4.1. Rules for Best Operation of the NEC SR100 63 5.5. FORMAT OF THE RESULTS 64 5.6. AREAS FOR FURTHER STUDY 65 BIBLIOGRAPHY 68 APPENDIX 1: THEORY OF OPERATION OF THE NEC SR100 71 APPENDIX 2: RESULTS OF THE LIVE TESTS 75 APPENDIX 3: RESULTS OF THE RECORDED AND MODIFIED TESTS 87 APPENDIX 4: PRELIMINARY STUDY ON THE EFFECTS OF A NOISY ENVIRONMENT ON RECOGNITION PERFORMANCE 113 1. Recording Configuration 115 2. Comments on the Test Results 117 APPENDIX 5: AN EXAMPLE OF RULES FOR SELECTION OF A ROBUST VOCABULARY 119 v List of Tables Table 2.1: Vowel and Diphthong Phonemes of the English Language and the Associated Test Vocabulary 7 Table 2.2: Semi-Vowel and Consonant Phonemes of the English Language and the Associated Test Vocabulary 8 Table 5.1: Summary of the Test Results 54 Table A l . l : Frequency Response of the Front End Filters on the SR100 72 Table A2.1: Statistics for Test 1 - Vowels Said Normally 76 Table A2.2: Statistics for Test 2 - Vowels With Mike Moved 76 Table A2.3: Statistics for Test 3 - Vowels Said Slowly 76 Table A2.4: Statistics for Test 4 - Vowels Said Quickly 77 Table A2.5: Statistics for Test 5 - Vowels With Interogative Intonation 77 Table A2.6: Statistics for Test 6 - Consonants (Group 1) 77 Table A2.7: Statistics for Test 7 - Consonants (Group 2) 78 Table A2.8: Confusion Matrix for Test 1 - Vowels Said Normally 79 Table A2.9: Confusion Matrix for Test 2 - Vowels With Mike Moved 79 Table A2.10: Confusion Matrix for Test 3 - Vowels Said Slowly 80 Table A2 . l l : Confusion Matrix for Test 4 - Vowels Said Quickly 80 Table A2.12: Confusion Matrix for Test 5 - Vowels With Interogative Intonation 81 Table A2.13: Confusion Matrix for Tests 1 Through 5 Combined 81 Table A2.14: Confusion Matrix for Test 6 - Consonants (Group 1) 82 Table A2.15: Confusion Matrix for Test 7 - Consonants (Group 2) 83 Table A2.16: Confusion Matrix for Tests 6 and 7 Combined 84 Table A2.17: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels 85 Table A2.18: Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants 86 Table A3.1: Statistics for Test 1 - Vowels Said Normally 88 Table A3.2: Statistics for Test 2 - Vowels With Mike Moved 89 Table A3.3: Statistics for Test 3 - Vowels Said Slowly 90 Table A3.4: Statistics for Test 4 - Vowels Said Quickly 91 Table A3.5: Statistics for Test 5 - Vowels With Interogative Intonation 92 Table A3.6: Statistics for Test 6 - Consonants (Group 1) 93 Table A3.7: Statistics for Test 7 - Consonants (Group 2) 94 Table A3.8: Confusion Matrix for Tests 1 Through 5 Combined - Mode 0 95 Table A3.9: Confusion Matrix for Tests 1 Through 5 Combined - Mode 1 95 Table A3.10: Confusion Matrix for Tests 1 Through 5 Combined - Mode 2 96 Table A3.11: Confusion Matrix for Tests 1 Through 5 Combined - Mode 3 96 Table A3.12: Confusion Matrix for Tests 1 Through 5 Combined - Mode 4 97 Table A3.13: Confusion Matrix for Tests 6 and 7 Combined - Mode 0 98 Table A3.14: Confusion Matrix for Tests 6 and 7 Combined - Mode 1 99 Table A3.15: Confusion Matrix for Tests 6 and 7 Combined - Mode 2 100 Table A3.16: Confusion Matrix for Tests 6 and 7 Combined - Mode 3 101 Table A3.17: Confusion Matrix for Tests 6 and 7 Combined - Mode 4 102 Table A3.18: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels - Mode 0 103 Table A3.19: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels - Mode 1 104 vi Table A3.20: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels - Mode 2 105 Table A3.21: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels - Mode 3 106 Table A3.22: Word Groupings According to Maximum Allowable Error Rate Using Tests 1 Through 5 for the Vowels - Mode 4 107 Table A3.23: Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants • Mode 0 108 Table A3.24: Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants - Mode 1 109 Table A3.25: Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants - Mode 2 110 Table A3.26: Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants - Mode 3 I l l Table A3.27: Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants - Mode 4 112 Table A4.1: Results of Test 1, Mode 0 for RF with Noisy Background Data 116 Table A4.2: Results of Test 1, Mode 0 with Corruption by Noise Type 3 .... 117 vii List of Figures Figure 2.1: Basic Configuration of the NEC SR100 14 Figure 3.1: Configuration of the Recording Environment 23 Figure 4.1: Creation of a 2-D Function Using a Rectangular Window 28 Figure 4.2: Reconstruction of a 1-D Function By Arbitrary Placement 30 Figure 4.3: Reconstruction of a 1-D Function By Maximizing the Cross-Correlation 32 Figure 4.4: An Example of Distortion of a Periodic Function when the Hamming Window is Applied 38 Figure 4.5: An Example of Nonlinear Smoothing Using the Algorithm of Rabiner, Sambur and Schmidt 44 Figure A4.1: The Caterpillar D215 Knuckle Excavator 114 vin ACKNOWLEDGEMENTS I would like to thank my supervisor Dr M.R. Ito, for his support and guidance throughout the course of this work. As well, I would like to thank Real Frenette for the software support and consultation he provided, and Stephen Wu for aid in the implementation of many of the algorithms and support programs utilized in this thesis. Also, I would like to express my gratitude to the volunteers for the speech tests. Financial support was provided by Northern Telecom through a direct sponsorship to the author, and by the Natural Sciences and Engineering Research Council (NSERC) through both an Industrial Post-Graduate Scholarship awarded to the author, and through a Research grant to the Tele-operator Group in the Electrical Engineering Department at the University of British Columbia. ix 1. INTRODUCTION The concept of speech input for control of machines or, in general, as the replacement for dextoral input, is by no means a new or small field. Many speech recognition units exist on the market, ranging in price from a few hundred dollars to beyond $30,000. It's reasonable to assume that their performances are as varied as their prices. A problem exists, though, when attempting to select a unit for a given application. Each manufacturer attempts to sell his product, and naturally states that his product is the best value for the money. Test results are often inflated and/or are based on incomplete data, making comparisons between the performances of different units very difficult. As well, recognition units aren't perfect. It follows, then, that each one must have certain weaknesses. For example, a given unit may not be able to discern well between the 'n' and the 'm' sound, or it may fail to recognize a word if the speaker changes the speed at which it is uttered. If one can identify these weaknesses, then it may be possible to reduce the effects of them. This might include educating the user on avoiding the particular phenomena, or selecting a vocabulary in which the weaker recognition capabilities are never solely relied upon. Also, if one knows the weaknesses of the recognition units being considered, then a more informed decision is possible for the selection of the best unit for a given application. Another possibility exists as well. When a given unit's weaknesses are identified, 1 INTRODUCTION / 2 it may be possible to design a speech preprocessor to compensate for its weaknesses, improving the recognizer's performance. This scheme may provide the most cost effective solution for a given application. This study addresses some of these questions for speaker dependent, isolated word, speech recognition units. Its purpose is threefold: 1. to design a series of tests which will exercise a speech recognition unit thoroughly enough to obtain a good idea of its strong and weak points. 2. to implement an archival system for digital storage of voice input and to store input of various subjects performing the tests previously designed. This facility permits the testing of various recognition units with the same voice input, providing a sound basis for comparison between units. 3. to preprocess the speech input so as to improve the recognition rate of a speech recognition unit. This performs two functions. First, the manipulations point the way for possible preprocessing techniques to be implemented. Second, through the manipulation of the speech input, one can more accurately determine the failure mechanism of a given speech recognition unit. This study, then is intended to provide a basis for comparative testing of speaker dependent, isolated word, speech recognition units, and is directed towards an accurate appraisal of each unit's weaknesses. As well, it is intended to provide the basis for possible algorithms for real-time preprocessing of voice input before presentation to a given speech recognition unit. INTRODUCTION / 3 1.1. OUTLINE OF THESIS Chapter 2 discusses the background and reasoning behind the design of the vocabularies, and the variations in enunciation utilized in the various tests. The tests were applied on a commercial speech recognition unit, the NEC SR100, and the results interpreted both with respect to effectiveness of the tests and with respect to the performance of the speech recognition unit. Chapter 3 describes the digital voice storage system, designed for archiving of the voice test signals. Methodology used in recording the data is also discussed in this chapter. Chapter 4 discusses the four different speech preprocessing and recognition improvement techniques investigated, describing the reason for selection of each of the techniques, and providing a description of the specific algorithms utilized. This chapter also discusses the results of applying the various techniques to the stored voice data and replaying it into the NEC SR100. Chapter 5 presents conclusions and recommendations with respect to the effectiveness of the tests, the recording environment, the preprocessing and recognition improvement techniques, and the performance of the NEC SR100. The preprocessing and recognition improvement techniques are evaluated in terms of their performance, information learned about the NEC SR100, and the applicability for implementation as a method for improving the recognition rate in real-time. INTRODUCTION / 4 The appendices contain a brief description of the theory of operation for the NEC SR100 (Appendix 1), the results of the live tests (Appendix 2), as well as the recognition results for the recorded and modified tests (Appendix 3). A brief introduction on the effects of a noisy environment is presented in Appendix 4 and Appendix 5 describes an example for selection of a robust vocabulary based on the results of this study. 2. THE SPEECH TESTS 2.1. BACKGROUND ON DESIGN OF T H E TESTS Design of the speech recognition tests can be thought of as two separate subproblems. The first is the problem of selecting a vocabulary or vocabularies, and the second is the question of what variations in the input should be examined. Each of these problems will be addressed separately. First, the question of vocabulary selection. Components of the English language can be considered on a number of different levels of resolution. On the coarsest level, one can consider the conveyance of ideas, thoughts, desires, and concepts to be a construct of language. Clearly this is the main purpose of language. These conveyances are constructed of one or more sentences, grouped in some logical order. Sentences are composed of phrases (e.g. subject, predicate, etc), which in turn, are composed of words. On a finer scale, words are constructed by a concatenation of syllables, and these syllables are in turn constructed of one or more basic linguistic sounds, termed phonemes. These phonemes are defined by one or more letters of the alphabet, and the many rules which govern their interaction. Automated interpretation of speech at all degrees of resolution is a very complex and much studied problem. This study is concerned with the most primitive constructs of the spoken language, and the ability of a given speech recognition unit to discern between similar constructs. The most basic construct, then, of the 5 THE SPEECH TESTS / 6 spoken language, may be considered to be the phoneme. The tests put forth in the following chapter are based on forty phonemes of the English language identified by Rabiner and Schafer [28]. These phonemes are listed in Tables 2.1 and 2.2. For the purposes of the tests, the phonemes are considered in two subgroups; 1. Vowels and diphthongs. For the sake of brevity, this subgroup will be referred to simply as vowels throughout this work. 2. Semi-vowels and consonants. Again for brevity, this subgroup will be referred to simply as consonants. These two subgroups form the basis of the three test vocabularies. The reason for the formation of separate vocabularies for each of these groups is twofold. First, the two groups are phonetically quite different, and presumably a recognition device would have little difficulty discerning between any representatives of the two groups. Second, the division makes creation of the test vocabularies more reasonable, and analysis of the results more manageable. Note that the labels 'vowels' and 'consonants' don't strictly refer to the equivalent letters in the alphabet. Some letters have more than one pronunciation, and many phonemes are determined by more than one letter. To create the vocabulary, each phoneme was embedded in a carrier word consisting of two static phonemes, in addition to the phoneme under test. For the vowels, the test words are of the form 'consonant-vowel-consonant' (CVC), with the vowel being the non-static phoneme. The consonant-based vocabularies are of two forms; Table 2.1 Vowel and Diphthong Phonemes of the English Language and the Associated Test Vocabulary Phoneme Sub-Group Reference Symbol for Example Test Type Number Phoneme Word Vowels Front 0 IY beat beam 1 I bit bin) 2 E bet bero 3 AE bat bam Mid 4 A hot bomb 5 ER bird berm 6 UH but bum 7 OW bought balm Back 8 00 boot boom 9 U foot buum Diphthongs 10 Al buy bime 11 01 boy boym 12 AU how baum 13 EI bay bame 14 OU boat boam 15 JU you bume / 8 Table 2.2 Semi-Vowel and Consonant Phonemes of the English Language and the Associated Test Vocabulary Phoneme Sub-Group Reference Symbol for Example Test Word Test Word Type Number Phoneme (Group 1) (Group 2) Semi- Liquids 0 W wit wem awa Vowels 1 L let lem ala Glides 2 R rent rem ara 3 Y you yem aya Consonants Nasals 4 M met mem ama 5 N net nem ana 6 NG sing ngem anga Stops 7 B bet bem aba (Voiced) 8 D debt dem ada 9 G get gem aga Stops 10 P pet pern apa (Unvoiced) 11 T ten tern ata 12 K kit kern aka Whisper 13 H hat hem aha Affricate 14 DZH judge gem ad ja 15 TSH church chem acha Fricatives 16 V vat vem ava (Voiced) 17 TH that them a-the 18 Z zoo zem aza 19 ZH azure jhem ajha Fricatives 20 F fat fern af a (Unvoiced) 21 THE thing them atha 22 S sat sem asa 23 SH shut shem asha THE SPEECH TESTS / 9 1. 'consonant-vowel-consonant' (CVC), with the first consonant being the non-static phoneme. 2. 'vowel-consonant-vowel' (VCV), with the consonant again being the non-static phoneme, but its position now being in the centre of the word. The three vocabularies generated are listed in Tables 2.1 and 2.2. The reason for creating two consonant-based vocabularies will be discussed in a moment. These vocabularies, then, consist of parametrically very similar words, basically forming rhyme/alliteration tests. Thus, any identification or misidentification by a speech recognition device should largely be a function of a given device's ability to discriminate between similar phonemes. The second question of which parameters to vary is now addressed. The pronunciation of a given phoneme can vary in any number of ways and, as with the resolution of language, can be viewed on a number of different levels. On a global level, the pronunciation of a given phoneme is simply affected by the phonemes surrounding it. That is, a phoneme is not simply a static phenomena; it shares a transitional time with adjacent phonemes and is therefore affected by them. These variations are termed allophones. The speaker to speaker variations are even more complex. Each speaker has certain unique characteristics by which one can identify his or her voice. These characteristics manifest themselves through variations in pronunciation of the various phonemes. They are determined by any number of factors, such as his THE SPEECH TESTS / 10 or her vocal tract configuration (e.g. size of oral cavities) or upbringing (e.g. accents). The unique vocal features which one may identify a given speaker, of course, is the basis of the study of automatic speaker verification. Considering any individual speaker, he or she may vary the pronunciation of a given phoneme in any number ways. For example, any number of the following may alter: speed of pronunciation (i.e. phoneme duration). pitch. The person could say the phoneme in an inquiring tone, an imperative tone, etc. volume. parameters of the vocal tract not under control of the speaker may alter. For example, changes due to a cold, smoking, etc. In addition to changes within the phoneme, there is also the possibility of the addition of auxilliary sounds not part of the phoneme. For example, a person may add a 'click' at the beginning of a word, or a heavy exhalation at the termination of a word. As well, the equipment used in registering the voice must be considered. Specifically, the type and placement of the microphone will affect the input. As illustrated in the previous paragraphs, there are a great deal of possible variables which may be addressed. If the test size is to be reasonable, one must make some arbitrary decisions on limiting the parameters to be examined. THE SPEECH TESTS / 11 There are three important points to be kept in mind when determining which parameters are to be varied: The tests are being designed to exercise speaker dependent isolated word speech recognition units. This implies that the tests should not specifically be concerned with inter-speaker variations in speech, but rather t/7ira-speaker variations. The tests must be easily controllable so that the results are consistent. Conclusions can then be made from them with a greater degree of confidence. This eliminates from the study all the parameters which neither the subject nor the tester can control. Given the limits of the previous points, the parameters to be studied should reflect those to be most commonly expected in normal use of a speech recognition unit. Using these three criteria, the following tests were decided upon: 1. Vowels said normally. This provides an indication of a recognition unit's ability to discern between the vowel phonemes. As well, it provides a baseline for comparing the degradation of the recognition rate due to the variations in input. 2. Vowels with mike moved. The mike is positioned away from the mouth at somewhere other than the recommended distance, but still at a reasonable location. In this study, a Shure SM10 close talking mike was used, and its recommended distance from . the mouth is one thumbwidth. The distance used for the mispositioning test was about 4 to 5 centimetres or approximately three THE SPEECH TESTS / 12 fingerwidths. 3. Vowels said slowly. This test, and the test that follows, determine a recognition device's susceptability to variations in duration of the input. Subjects are asked to repeat the vowel vocabulary at approximately 2/3 the original speed. 4. Vowels said quickly. Again the vowels are repeated, but now the subjects are asked to say them at about 1.5 times the original speed. 5. Vowels with interrogative intonation. The vowel vocabulary is repeated in a questioning tone. This gives an indication of the device's susceptability to changes in pitch. 6. Consonants (Group 1). Group 1 of the consonants are said normally. This gives an indication of the device's ability to discern between the various consonants. 7. Consonants (Group 2). Group 2 of the consonants are said normally. Any changes in performance over Group 1 will be an indication of the device's preference as to the position of the consonant. These tests provide the basis for evaluation of a speech recognition device, and should be useful in identifying a device's weaknesses. THE SPEECH TESTS / 13 2.2. IMPLEMENTATION OF T H E TESTS The tests of the previous section were used in evaluating a medium priced speech recognition device; the NEC SR100 (The theory of operation for the SR100 is briefly discussed in Appendix 1). The tests were initially applied live, with the test subjects providing voice input directly to the speech recognition unit (i.e. no storage was performed; the mike input was fed directly into the speech recognition unit). The live testing was performed for two reasons. First, it was done to insure that neither redirection of the mike output, nor digitization of the data, nor any other aspect of the archiving process appreciably affected the recognition results. If any of this did affect the results, it would manifest itself as a reduction in the recognition rate from the live results to the results using the archived voice data. Second, it was done to determine how interpretation of the data would be affected by not having direct access to the speech input. Presumably interpretation of the results should be more accurate if one can replay and/or alter the input. The SR100 is not a stand-alone unit, and is controlled via an RS-232 port. Through this port one can set the device to training or recognition mode, or can load or offload recognition templates to and from the host device. The SRIOO's basic configuration is shown in Figure 2.1. Control programs were written so that the SR100 could be accessed from C-language programs on an HP9050 minicomputer. Programs were then written which: THE SPEECH TESTS / 14 Mike RS-232C interface c SR-100 Host or dedicated unit Figure 2.1: Basic Configuration of the NEC SR100 provided access to the training mode of the SR100, controlled loading and offloading of the reference templates and, administered the tests. The program which administered the tests required as input the initials of the subject, the test number to be executed, and the attempt number for that test. Based on this input, the program prompted the subject with the word to be said (on the screen), and then recorded the recognition result of the SR100. The order in which the words were tested was pseudo-random (based on the input), as to eliminate any dependency on test order. The results were then stored for later analysis. A total of twenty attempts were recorded per word per test, consisting of four attempts per word per test by five test subjects. The subjects consisted of four males, one of whom was familiar with the SR100 (RF), and one female (MA). The attempts were recorded on two different occasions for each subject, the first occasion also being the time at which the SR100 was trained. THE SPEECH TESTS / 15 Although the magnitude of these tests was not sufficient to make any statistical conclusions, they were of sufficient size to base some general observations. 2.3. RESULTS 2.3.1. General Comments The results for the tests are tabulated in Appendix 2 and are presented in a number of formats. In the first format, the statistics per test per person are tabulated, as well as the cumulative average. This yields general information on the performance for each subject with respect to each of the tests. It provides an idea as to the effect of each of the variations in speech on the recognition rate. The second format is the confusion matrix. This yields information as to phonemes or groups of phonemes that presented the greatest difficulty for the speech recognition unit, and the ones most commonly mistaken for one another. The results of the five tests for the phonemes and the two tests for the consonants were combined, to form two confusion matrices. The reason for the combination was to provide a larger data base on which to interpret the matrix results. This combination was considered valid, because the variations presented in the tests were indicative of possible scenarios which could occur while the recognition unit was in normal use. Another way of analyzing which phonemes a speech recognition unit has difficulty THE SPEECH TESTS / 16 discerning, is by error grouping. The groupings are based on the confusion matrices generated by combining the results from tests 1 through 5 (the vowel tests), and tests 6 and 7 (the consonant tests). They are generated by grouping phonemes most commonly misrecognized, until each phoneme within a group has an error rate better than the required amount. As the required error rate is decreased, the groupings get larger. This indicates the decreased resolution of the unit, as the maximum error limits become more stringent. The described grouping performs two functions. First, as a method of analyzing the data, it yields an excellent pictorial of the phonemes the speech recognition unit has trouble discerning, indicating to what degree it can resolve phonemes and groups of phonemes. Second, it may be used to determine rules of vocabulary selection for best operation. That is, based on the results of the groupings, one may design a vocabulary which attempts to have the best phonetic contrast, for a given unit, between words. 2.3.2. Comments on Performance of the Various Tests The following are comments and observations on the performance of each of the tests and of the results. These comments, along with the results in Appendix 2 provide the basis for the conclusions presented in the next section. 1. Vowels at Normal Speed. The words with the lowest recognition rate were ; a. #7 (balm) with nine incorrect. b. #1 (bim) and #4 (bomb) with eight incorrect. c. #8 (boom) with six incorrect. THE SPEECH TESTS / 17 Word #4 for DR and word #1 for EC weren't correctly recognized at all (four misses out of four attempts). This tends to indicate some inconsistency between these templates and the words later spoken. MA (the female subject) had by far the highest number of rejected inputs. This may have been due to the high pitch (as compared to the pitch of the other subjects) of MA, but no subsequent tests were performed to verify this possibility. The words with the highest recognition rates were 3 (bam) and 5 (berm) with rates of 95%. As well, RF had an overall rate of 93.75%. Vowels with Mike Moved. There was a 22% decrease in the recognition rate over the normal case. As well, the SR100 had difficulty in triggering from some of the words, particularly #8. This could have something to do with the energy distribution of the phoneme (See Appendix 1 for Background on the SR100). Vowels Said Slowly. The results of this test varied widely, depending on the speed which the subject considered slow. JR's results were actually better than when the words were spoken normally. MA's results were the same. For DR, EC and RF, there was a drastic reduction in the recognition rate. These subjects were also the ones whose reduction in speed of speech seemed most pronounced. Vowels Said Quickly There was an overall average reduction of 16% in the recognition rate for THE SPEECH TESTS / 18 the 5 subjects as compared to the results for test 1. This ranged from 9 to 23% individually. Vowels With Interrogative Intonation. Again the results varied drastically for each subject. EC and RF had large reductions in their recognition rates (20 and 15% respectively). They also seemed to have the greatest change in pitch. MA and DR remained approximately the same, and JR had a drastic increase in the recognition rate (18%). Consonants (Group 1). The recognition rate was much worse than the vowel based vocabulary. Dividing the words into their phonetic subgroups, only the semi-vowels were above the 50% recognition rate. In fact, the semi-vowels had a recognition rate of over 80% for four of the five subjects. The words with the worst recognition rate were; a. #10 (pern) with fourteen incorrect. b. #'s 5 (nem), 17 (them [sounds like 'that']), 20 (fern), and 21 (them [sounds like 'thing']). Also, every subject had at least two words which weren't recognized at all in the recognition mode, again indicating that a bad template was registered. No pattern was apparent in these words. Consonants (Group 2). The purpose of this test was to reveal if the recognition device could more readily discern between the consonant phonemes if they were located in the middle of the word. Again the results varied widely within the test group. Comparing the results to those for group 1, DR's recognition rate dropped THE SPEECH TESTS / 19 5%, EC's dropped 1%, JR's rose 9%, MA's rose 24% and RF's rose 25%. Four of the five subjects registered a recognition rate of over 80% for the semi-vowels. 2.4. INTERPRETATION OF THE RESULTS As previously stated, no statistical conclusions may be made from such a small test group, but some general observations may be made. With this in mind, the results were analyzed using two different criteria. First, the data was interpreted with respect to an evaluation of the SR100. Second, the data was evaluated with respect to the design and implementation of the tests. 2.4.1. Evaluation of the NEC SR100 At 76%, the SRIOO's ability to discern between similar vowel phonemes is not extremely impressive. From the scores for individual subjects, however, it's apparent that good results are possible, presumably if one is consistent in enunciation of the input. From the results of tests 2 through 5, the recognition rate of the SR100 is affected to some degree by all the variations in speech examined, and especially by the slowing of speech input. The SR100 is less able to discern consonants, achieving recognition rates of 52.5% for group 1 and 62.9% for group 2. This could be due to the lack of energy in a consonant sound, or perhaps to the SRIOO's bandlimited analysis of the input (See Appendix 1). Both of these possibilities are supported by the fact that semi-vowels had a much better recognition rate than the remainder of the consonant vocabulary. It is these phonemes within this vocabulary which have THE SPEECH TESTS / 20 the highest energy in the lower frequency bands. The difference between the results of group 1 and 2 indicate that the SR100 is better able to discern consonants within a word than at the beginning of a word. In either case, recognition rates are not very good. In considering the confusion matrices and error groupings, one would hope that the misrecognitions would be consistent. This would imply that the speech recognition unit had difficulty in discerning phonemes within certain sub-groups, but was able to separate these sub-groups from the remaining phonemes. Although this seemed to be the case for the vowels, it was not so for the consonants. In fact, it can be seen from the confusion matrices and error groupings of the consonants that their failure modes were not well behaved at all. That is, when a phoneme failed to be recognized, it did not tend to be misrecognized as any single other phoneme or small group of phonemes, but rather was misrecognized as any of a large number of other phonemes. This severely reduces a consonant's value for discerning between similar words when using the SR100. Finally, it is obvious that the SR100 suffers in using only one training input as the reference template. No training averaging is performed, so there is no inherent compensation for a bad template. 2.4.2. Evaluation of the Speech Tests THE SPEECH TESTS / 21 The designed speech tests seem to address pertinent speech parameters, particularly with respect to the SR100. They did identify a number of problems with the SR100, flagging certain items to beware of. As well, they did indicate which phonemes are most commonly misrecognized. As a tool for comparison of various speech recognizers, however, it does have certain drawbacks. Some of the drawbacks are as follows: 1. It is very difficult to control quantitative measures (e.g. pitch, word duration, etc. ). 2. The subject's normal pronunciation of a given word varied sufficiently that simply performing the same live tests on different speech recognition units would not provide adequate control for comparison tests. 3. Because the tests are applied live, one can't be sure of the specific reason a given word was not recognized correctly. That is, one doesn't have a clear idea of the failure mechanism, and this might lead to misinterpreted data. The main problems, then, were that there was not enough control and consistency in the test input, and that one can't access the input to analyse the exact reason for misrecognition. These problems alone justified the implementation of digital recording of the speech input. 3. ARCHIVAL OF T H E TESTS 3.1. HARDWARE CONFIGURATION Certain basic criteria had to be met by the configuration to be used in the archival of the data. They were as follows: 1. it had to provide sufficient audio quality so that archival/retrieval didn't affect the performance of the speech recognition devices to be tested. 2. it had to have easy access to a mass storage device. 3. it had to have sufficient computing power so that speech preprocessing could be performed in a reasonable amount of time. No single computer in the Electrical Engineering Department met all of the requirements, so a combination of computers were used. The HP9050 provided the computing power and access to the mass storage device, and a resident PDP 11 provided the Analog-to-Digital (A/D) and Digital-to-Analog (D/A) facilities. Programs were written so that the A/D and D/A facilities on the PDP, as well as data transfer from the PDP to the HP, could be controlled by C-language programs on the HP. The complete hardware configuration is shown in Figure 3.1. As in the live tests, the Shure SM10 close talking microphone was used. The output from the microphone was fed into the microphone amplifier portion of a Scully 280 series high quality tape recorder (recorded signal to noise ratio of greater than 65 db [unweighted]). This signal was then fed to a Krohnite model 3342 filter, set in lowpass RC mode (8 pole), with the cutoff frequency set at 9 kHz. The filtering was performed to remove the possibility of aliasing while sampling. The output of 22 ARCHIVAL OF THE TESTS / 23 headphones <0> analoj speech analog speech . V A ^ ^ ^ m i c r o p h o n j filtered tape analog f &I5M HP9050 digital f speech i i A otrgitiEeJ speech recognition results o o o o o o o o . 0 0 0 o c o . LZZJ commands to TOP-II &. SR-IOO processed speech FILTER DIGITIZER • • • • PDP-11 Ctwnmands from HP9050 digital iignjLj " digitized, speech analog speech / \ N^EC >g-iOC O J commands from H P 9 0 5 0 recognition data HPIB (instrument bus) Figure 3.1: Configuration of the Recording Environment ARCHIVAL OF THE TESTS / 24 the filter was input to the A/D facilities of the PDP, which sampled at 20 kHz (its maximum rate). For playback and monitoring of the input, the D/A output was fed into another channel of the Krohnite filter. This channel was also set to 9 kHz lowpass mode. The filter output was used to drive the speech recognition unit (which was under simultaneous control from the HP9050), or headphones. 3.2. IMPLEMENTATION The hardware was configured as described in 3.1, with the D/A monitor attached to the SR100. The SR100 acted as a validator for voice input, insuring that the voice input was present, and that it was spoken loudly enough to trigger the SR100. This was done in order to better simulate the conditions of the original test. The subject was presented, via a terminal connected to the HP, the vocabulary of each of the seven tests (see 2.1) in a pseudo-random order. Depression of the 'break' button on the terminal initiated the recording of 1.5 sec. of voice data, during which time the subject was required to pronounce the word in the prescribed manner. If the input was acknowledged by the SR100 (i.e. the input was loud enough), and was not labelled an erroneous input for any reason, the file was transferred to the HP and stored on a hard disk for later transfer to magnetic tape. A total of twenty-five attempts per word per test were recorded, consisting of ARCHIVAL OF THE TESTS / 25 five attempts per word per test by five subjects. This is one more attempt than the live tests, but it included the attempt which was to be used as the training input. All but one of the subjects were the same as the live tests. MA was unavailable for the recording sessions, and was replaced by HW (a male). The attempts were recorded on either two or three different occasions, depending on availability of the subject. The files were replayed immediately after the session to insure that no words were cut off, due to the limited recording time. If any were cut off, they were immediately redone. These files provided the basis for general analysis, and for an investigation into preprocessing algorithms. 4. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES As stated previously, improvement of the recognition rate serves two purposes. First, it provides better insight into the failure mechanism of a given recognition unit. Second, it provides the basis for a speech preprocessor which may be placed in front of a commercial recognition unit, perhaps providing a cost effective solution for a given application. In this chapter, the following modifications are discussed: time normalization. the use of two reference templates. nonlinear volume normalization. Comments on the test results, tabulated in Appendix 3, are made at the end of this chapter and are further interpreted in Chapter 5. 4.1. TIME NORMALIZATION The objective of the time normalization implemented in this study was to make all input words (both the training and recognition words) of the same duration; .75 seconds. The reason for this was that, although the SR100 did perform dynamic time warping internally, its performance was limited. This limitation manifested itself with a large number of misrecognitions and rejections in tests 2 and 3. It was reasoned, then, that the recognition unit may perform better if all speech inputs were of the same duration. It was recognized that this implies the loss of a certain amount of discriminatory information pertaining to word duration. If one and two syllable words are used in a vocabulary (as is often 26 PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 27 the case), though, the discriminatory information lost may be negligible. The algorithm used was a modified form of one presented by Roucus and Wilgus [31], which in turn was based on one by Griffin and Lim [12]. Griffin and Lim claimed that their algorithm appeared to be the best method at that time (1984). Roucus and Wilgus claimed that their algorithm provided at least as high quality as [12], and was computationally much simpler. Both claims were left unverified, but comments will later be made as to the quality of the algorithm implemented. The algorithm of [31] will now be presented. 4.1.1. Synchronized Overlap-Add of Temporal Signals Consider a discrete time signal, y(n). Using a windowing function, w(m), of length L, one may define a doubly indexed (or 2-D) function, y (b,n), such that: w y (b,n) = w(b - n) y(n) (4.1) w Now, let b be defined as: b = mS (4.2) a where m is the window number and S is the interval between the a commencement of consecutive windows. If S is less than the window length L, 3. then consecutive windows overlap (i.e. any given point in y(n) is represented in more than one window number). See Figure 4.1 for an illustration of the windowing process. If each window of the function y is J w between the commencement of consecutive result is somehow resummed to form a then shifted such that the interval windows is S g points apart and the singly indexed function, the resulting y(n) Y w ( V a ' n ) y w ( ( n 1 + l ) S a . n ) y w ( ( .V 2 , S a' n ) y w((m 1 - r 3)S a,n) PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 29 function would have a new duration of S /S times the original function. Thus, s a given a desired change in duration, one may arbitrarily select either S or S s a and calculate the value of the other required to attain this change. This is the basic principle of overlap-add modification of a signal. Naturally, one can't simply sum the shifted windows of and hope to attain reasonable quality speech. This would destroy or alter many of the characteristics of the speech input other than its duration (see Figure 4.2). Griffin and Lim used an iterative technique which was based on minimizing the distance between the Fourier Transform of the resulting signal X(mSs,w), and that of the original input Y(mS ,w). They claimed to achieve excellent results, but only after 100 a iterations of their algorithm. Roucus and Wilgus argued that arbitrarily positioning consecutive windowed functions S g points apart ignored the fact that most speech is periodic, and that periodicity is altered by this arbitrary placement (illustrated in Figure 4.2). They reasoned that if the initial placement of consecutive windows is not restricted to exactly every S g points, but rather is allowed to be placed within a range of values with the exact point of placement being determined by that which best matched the periodic waveforms of consecutive windows, then few or no iterations would be required. The signal could then be reconstructed using equation 4.3. I wz(mS - n) y[n-m(S - S ) - Mm)] x(n) = (4 .3) L vT(mS - n) m=-co x w ( ( m l + 3 ) S s ' n ) Figure 4.2: Reconstruction of a 1-D Function by A r b i t r a r y Placement (e.g. F i r s t Iteration of Algorithm by G r i f f i n and Lim) PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 31 The variable k(m) is the deviation from the nominal value for the positioning of consecutive windows, S g. Note that if k(m) = 0, this would be the same as equation 3 of the Least Squares Error Estimation (LSEE), described in Griffin and Lim. The shift k(m) is chosen to maximize the cross-correlation function between the signal reconstructed thus far, and the next window to be added R , defined by equation 4.4. w mSs+L L x ( n ) x (mS ,n+k) n=mS w s' R x * ( k ) - mS +L S mS +L < 4 - 4 ) W S 9 S 9 1/2 [ Z x(n) L xj(mS c,n+k) Y/l n=mS„ n=mS„ w 5 s s Where : x(n) = the reconstructed signal. x = the 2-D windowed function, w m = the window number. L = the window length. S g = the new nominal repetition interval for consecutive windows. This method, then, attempts to determine the best match synchronization of consecutive windows, and is termed the synchronized overlap and add (SOLA) algorithm. An illustration of this reconstruction technique is shown in Figure 4.3. The algorithm is formally specified as follows: 1. Define a 2-D signal consisting of the successive windows, taken every S 3. samples, of the original signal, y(n), and defined by: Figure 4.3: Reconstruction of a 1-D Function by Maximizing Cross-Correlation (e.g. F i r s t Iteration of Algorithm by Roucus and Wilgus) PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 33 x (mS ,n) = w(mS - n) y[n - m(S - S )] (4.5) w s s s a 2. Initialize x(n) and c(n) using the first window of x : w x(n) = w(n) x (0,n) (4.6) w c(n) = w2(n) (4.7) 3. Do steps a and b for m = l to total number of frames. a. Maximize the cross-correlation with respect to k(m) between the signal constructed thus far and the next window to be added using equation 4.2. b. Extend estimate by incorporating this window into x(n): x(n) = x(n) + w(mS + k - n) x (mS ,n + k) (4.8) s w s c(n) = c(n) + w2(mS + k - n) (4.9) s 4. Normalize the waveform for all n. x(n) = x(n)/c(n) (4.10) 4.1.2. SOLA as Implemented in this Study Two variations of Roucus and Wilgus' algorithm were implemented in this study, labelled Modes 1 and 2. First, the features common to both modes will be discussed and then each mode will be discussed separately. The endpoints of the words were determined using an energy threshold algorithm, insuring that the energy remained above a preset threshold for a minimum of 100 ms at the beginning of a word and remained below the threshold for a minimum of 200 ms. at the end of a word. The endpoints were also manually verified for words whose endpoints were, for some reason (e.g. unusual duration, bad recognition, etc.), questionable. Of course, a more reliable endpoint detection PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 34 algorithm would have to be developed if the time normalization algorithm were to be incorporated into a real-time preprocessor, but that was beyond the scope of this study. Both modes utilized a window length of 512 points. The window length had to be of sufficient length to capture an adequate number of pitch periods, but not so long as to cover too many, thus losing temporal resolution. The interval between commencement of consecutive windows in the unnormalized function S , was the parameter varied to determine the amount and type of a time warping (e.g. compression or expansion) to be performed. It was calculated such that the nominal interval between commencement of consecutive windows in the time normalized function S g was 256 points, and that the total normalized duration was .75 seconds. This approach insured that there was always a reasonable amount of overlap between consecutive windows for construction of the normalized function, x(n). It did, however, limit the maximum amount of time compression to 1/2. This presented no difficulty in this study, because the maximum compression required was 1/2 (normalized voice data had a duration of .75 sec. and the maximum unnormalized duration was 1.5 sec). The allowable range of k(m) had to cover at least one pitch period of the lowest voice expected. This was necessary so that consecutive windows were able to align correctly for all voice input, the lowest pitch being the limiting factor. Obviously, if the range is more restricted, then the periodicity within the windowed function will not always align with the periodicity of the signal PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 35 constructed thus far. Too large a range would result in unnecessary computation. The range, therefore, was chosen to be ± 1 2 0 points from the nominal position. The method described in Roucus and Wilgus is based on the premise that there is always a certain amount of periodicity present in speech signals, and that construction of a time normalized signal is aided by the utilization of this periodicity. For the most part this is a. valid assumption, but sometimes it does not apply. Certain vocal phenomena are simply transients (e.g. the plosive sounds), while other phenomena have little or no periodicity (e.g. unvoiced phonemes). These phenomena do not benefit from positioning according to the correlation function, and in fact may be degraded as a result of it. The lack of periodicity manifests itself by a low maximum cross-correlation. In this study, if the maximum cross-correlation was below .6 for a given window, no time warping was performed and the window was allowed to return to its original position. 4.1.2.1. Mode 1: Rectangular Windowing Mode 1 utilized a rectangular window for w(n). The strength of rectangular windowing is that it allows for integer arithmetic in the calculations, thus making computations much easier. This, in turn, lends itself more readily to real-time applications. Mode 1 is formally defined as follows : 1. Find endpoints of the word. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 36 2. Given desired total time = .75 sec, and S g = 256 points, determine what S must be. a 3. Define x using equation 4.5, where w(n) is a 512 point rectangular w window. 4. Initialize x(n) and c(n) using equations 4.6 and 4.7. 5. Do steps a, b, c and d for m= 1 to total number of frames a. if the original position of the window is within the allowable range of k(m), use this as the optimum position. It is reasonable to assume that this would yield the highest cross-correlation, pre-empting the requirement for computing R (the cross-correlation function from w Equation 4.4). If this is the case, Skip b. and c. b. Maximize cross-correlation with respect to k(m) between the signal constructed thus far and the next window to be added using equation 4.4. c. If maximum cross-correlation is less than .6, then presume that it is a bad match. That is, no periodicity is present or periodicity is different between the signal constructed thus far and the next window to be added. If this is the case, then allow the window to return to its original position, with its starting point being S ahead of the a starting point of the previous window. This return in effect eliminates any time warping on the non-periodic and/or the rapidly changing portion of a given word. d. Extend the estimate by incorporating this window into x(n) using equations 4.8 and 4.9. 6. Normalize the waveform using equation 4.10. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 37 4.1.2.2. Mode 2: The Hamming Window In the course of the enunciation of a given word, the characteristics of the speech waveform are continually changing. Even the 'periodic' portions of the waveform aren't purely periodic. Consider the rectangular windowing function described in the previous section. When a given window x is aligned with x(n) w constructed thus far, the gross periodicity will be aligned, but the finer detail may not be aligned. This can cause a small discontinuity in the reconstructed waveform where a new window commences, manifesting itself as a 'crackle' in playback. Or it may cause a hollow quality to be introduced into the reconstructed voice data. This is caused by the summing of signals equal in magnitude, but slightly phase shifted. If, however, a smoothing function such as a Hamming window is used before the matching process, then discontinuities where a new window commences should be decreased. Also, the hollow quality shouldn't be as apparent because at any given point, one window is contributing the majority of the weighting to the given result. When implemented, however, this algorithm proved not to be so well behaved. What occurred was that, after the Hamming window was applied to y(n), the periodicity of the original function was sometimes adversely affected (see Figure 4.4 for an example). This lead to incorrect synchronization of with the periodic function present in x(n) and manifested itself as an 'echo' effect in the result. Before the Hamming Window i s Applied / 38 Figure 4 .4 : An Exrmple of Distortion of a Periodic Function When the Hamming Window i s Applied PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 39 The phenomena may be explained in another manner. Since both the tail of the function x(n) and x are weighted by the Hamming window, the auto-correlation w of the Hamming window itself weights the resulting cross-correlation calculations. As a result of these observations, the following technique was implemented. All alignment calculations were performed with a 2-D function utilizing a rectangular window, but construction of the data file to be output was performed using a Hamming window. This yielded excellent periodicity alignment while significantly reducing crackle and hollowness. Mode 2 is formally defined as follows: 1. Find endpoints of the word. 2. Given desired total time = .75 sec, and S g = 256 points, determine what S must be. a 3. Define x w using equation 4.5. where w(n) is a 512 point rectangular window. 4. Initialize x(n), x^ (n) and c^ (n) using equations 4.6, 4.11 and 4.12 (c(n) is not required). xh(n) = wh(n) xw(0,n) (4.11) ch(n) = wh(n) (4.12) where w (^n) is a 512 point Hamming window. 5. Do steps a, b, c and d for m=l to total number of frames a. If the original position of the window is within the allowable range of k(m), use this as the optimum position and skip b & c. b. Maximize cross-correlation with respect to k(m) between the signal PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 40 constructed thus far with the rectangular window, x(n), and the next window to be added using equation 4.4. c. If maximum cross-correlation is less than .6, then presume that it is a bad match. If this is the case, then allow the window to return to its original position, with its starting point being S ahead of start a point of the previous window. d. Extend the estimate by incorporating this window into x(n), x^ (n) and c^ Cn) using equations 4.8, 4.13 and 4.14. x, (n) = x, (n) + w,(mS + k - n) x (mS ,n + k) (4.13) h h h s w s c,(n) = c, (n) + w,(mS + k - n) (4.14) n h n s 6. Normalize the output waveform, x (^n), using equation 4.15. xh(n) = xh(n)/ch(n) (4.15) 4.2. THE USE OF TWO REFERENCE TEMPLATES Although the technique for improving the test results discussed in this section does not provide the basis for any preprocessing algorithms, it does aid in better defining the reason for misrecognition or failure for given words. As well, it may provide a training technique that will result in a better recognition rate. This method for improving the recognition rate was prompted by the fact that occasionally, a bad template was registered. In that situation, none of the subsequent recognition mode input was correctly identified, implying that there was some inconsistency between the input during training and during recognition. It was reasoned that if two separate attempts per word were utilized for training, then the chances of having a bad template would be severely reduced. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 41 As well, it was hoped that even the recognition rate of the words with good templates would be aided. It was thought that if two attempts per word were used for training, this would provide a broader base for comparision of recognition data. Two options are possible for processing multiple training input: 1. One could average the training input, such that the single resulting template is the median of all training input. 2. One may have multiple templates, with each training word resulting in a template. A number of problems exist with the first option. Averaging the templates presumes that one has access to these templates, and that they are in a format that may be averaged. In the case of the SR100, although the templates could be accessed for storage/retrieval purposes, the format was proprietary. This didn't prove to be insurmountable, but the templates consist of compressed data. Each vector within a given template represents a variable number of vectors in the uncompressed data. The number of vectors represented by each vector stored was unknown; only the total uncompressed length was known. This made averaging multiple templates very difficult. The method chosen was to generate two templates for each word using two different input utterances. In the recognition mode, if the given speech recognition unit matched an incoming utterance to either of the two reference templates for a given word, the input was identified as that word. This was the simplest and most universal alternative (with respect to other speech recognition units). The PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 42 only possible drawback of this technique is that the recognition unit must consider a larger number of templates. The technique of using two reference templates was applied to time normalized data (Mode 2) and the results were labelled Mode 3. Two attempts of each of the three vocabularies said normally were utilized. 4.3. NONLINEAR V O L U M E NORMALIZATION The intention of modes 1 and 2 was to make all words input of the same duration. This included the training templates as well as the recognition input. One of the parameters available in the template data of the SR100 was the length of the training word input. It was through this data, for the templates of modes 1 and 2, that it became apparent the unit was not always registering the complete word. This was particularly true in the tests utilizing the group 1 consonant vocabulary, indicating that the word was shorter than that input. As well, in the templates for the vocabulary utilizing group 2 of the consonants, silent periods were being declared in the centre of words where none were present. These two facts indicated that the recognition unit was overlooking certain low energy consonant phenomena. This insight lead to the implementation of a nonlinear volume normalization scheme. This scheme did not affect the higher energy phenomena, but did boost those with low energy. The specific phonemes to be addressed were the static low energy consonants. Conversely, those consonants with transients (e.g. the plosives) were not to be affected. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 43 In other words, the volume normalization algorithm had to be fairly stable in that it had to leave short time changes in volume unaffected, as these are often pertinent in recognition of the phoneme. It had to, however, respond quickly to valid alterations in volume representing the enunciation of a new phoneme. This was necessary to insure that no portion of a given phoneme was unnecessarily amplified or attenuated. With these points in mind, the following scheme was devised. The total RMS energy, e(r), was calculated for consecutive 256 point windows (every 12.8 ms) of the input data, y(n). This data was smoothed using a nonlinear scheme proposed by Rabiner, Sambur and Schmidt [29], before it was used to compute the amount of gain to be applied. The smoothing algorithm consisted of a 3 point median smoother, followed by a three point Hanning smoother. The algorithm was well suited to this application in that it smoothed local transients and yet responded quickly to sharp discontinuities which were not transients. An example of results of the smoothing algorithm are shown in Figure 4.5. This smoothed energy file provided the basis for nonlinear volume normalization. A factoring file f(n), of the same length as the data file, was created. The smoothed energy data was placed in f(n) every 256 points, starting at the midpoint of the first 256 point window. The remainder of fin) was linearly interpolated from this data. If f(n) was below f . and was above f ., , then volume normalization was min silence Energy Profile Before Smoothing Figure 4 .5: An Example of Nonlinear Smoothing Using the Algorithm of Rabiner, Sambir and Schmidt PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 45 implemented. The magnitude was boosted by the square root of the ratio between f . and f(n). The lower bound on the volume normalization algorithm f ., min & silence was effectively a silence detector which was included in case a true silent period was encountered (e.g. if a subject paused in the enunciation of a word). No volume normalization was applied below this threshold. Obviously, a more robust silence detector would be required if this technique were to be used in a preprocessor, but it was beyond the scope of this study to investigate this. The algorithm was applied to data already time normalized (Mode 2), and two reference templates per word were utilized. This allowed for the comparison of the results for Mode 3 to be compared to the results for this Mode, labelled Mode 4. The algorithm is formally defined as follows: 1. For r = 0 to 59 calculate the RMS energy, e(r), of the incoming file, y(n), according to equations 4.16 and 4.17. i = 256 r (4.16) i + 255 9 e(r) = I . y (n) (4.17) n = i 2. Filter e(r) using equations 4.18 and 4.19. e ,(r) = median[ e(r-l), e(r), e(r+l)] (4.18) si e (r) = .25 e Jr-l) + .5 e Ar) + .25 e Jr+l ) (4.19) s s i s i s i 3. Initialize the first 128 points of the factoring array, f(n). If f . < e (0) < f ., min s silence then f(n) = e (0) , n = 0 to 127. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 46 else f(n) = f . , n = 0 to 127. min Repeat for eg(59) and the last 128 points of f(n). 4. For n = 128 to (end of file - 128), calculate f(n) by interpolating between successive values of e (r) utilizing equations 4.20 to 4.22. r = INTEGER[ (n - 128)/256 ] (4.20) i = (n - 256r - 128)/256 (4.21) f(n) = e (r) + i [ e (r+1) - e (r) ] (4.22) 5. For n = 0 to end of data. If f . < f(n) < f .. min silence then x(n) = x(n) * SQRT(f . /f(n)) (4.23) min else x(n) remains unchanged. 4.4. COMMENTS ON THE TEST RESULTS Results of the preprocessing are tabulated in Appendix 3. The mode numbers correspond to the following : 0 : the original data. 1 : time normalized data utilizing the technique described in 4.2.2.1 (Rectangular Windowing). 2 : time normalized data utilizing the technique described in 4.2.2.2 (The Hamming Window). 3 : Mode 2 plus two training templates per word were used. 4 : Mode 3 plus nonlinear volume normalization, described in 4.4, was applied. As with the live tests, the results are presented in a number of formats. Tables PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 47 A3.1 to A3.7 present the statistics per test per person for each of the five modes, providing an indication of the performance of each mode for each subject and test. In Tables A3.8 to A3.17, the confusion matrices for tests 1 through 5 combined, and tests 6 & 7 combined, are presented for each mode. This gives an indication of the effect of each mode on which phonemes are most commonly misrecognized. Tables A3.18 to A3.27 provide error grouping data, described in Chapter 2, based on the confusion matrices of A3.8 to A3.17. These tables provide another method of viewing those groups of phonemes that are most commonly misrecognized, and provides a possible basis for vocabulary design. The results of the live tests will first be compared with those of the unaltered data, Mode 0. In subsequent sections, the results of each of the modes will be compared with those of the preceding mode. The intent is to comment on the effect of each mode on the recognition rate, the confusion matrix, and the error groupings. 4.4.1. Mode 0: Recorded Unaltered Data Ideally, one would want the recording process to be completely transparent to the results. That is, it should not affect the recognition tests in any manner (altering speech parameters, adding noise within the analysis bandwidth or whatever). PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 48 Results of the two types of tests (i.e. Live versus Recorded), however, will not be exactly the same, just as results for live tests performed at different occasions would not be the same. A certain amount of variation may be expected. As well, it was necessary to change one of the subjects; this will alter the total statistics somewhat. With these points in mind, the results of the two types of tests were compared. For each of the tests requiring normal enunciation of a given vocabulary (i.e. tests 1, 6 and 7), the average results of the live tests were much similar to those of Mode 0 (within 4%). The recognition rate for test 2 (vowels with mike moved) improved by 9% from the live to recorded mode. The rates for tests 3 (vowels said slowly), 4 (vowels said quickly), and 5 (vowels with interrogative intonation) decreased by 25%, 17% and 9% respectively. Although individual test results varied widely, it was clearly evident that tests 3 and 4 faired the worst in the use of recorded data. Because the other tests did not show the same degradation, it is postulated that the reduction in recognition rate was not due to the quality of recording. Rather, the degradation was attributed to what one may call the 'button' phenomena. In the recording environment used, the subject pressed a button and then had 1.5 seconds to say the word. Although the time constraints were liberal, simply its presence may have affected the duration of a subject's utterance. This was not confirmed, however, and the results for tests 3 and 4 performed live were quite low (35% and 60% respectively), in any case. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 49 Comparing the confusion matrices and error groupings of the tests for the vowel vocabulary, one can see that the clustering was quite similar between live and recorded data. The groupings did not occur in the same order for both the live and recorded tests, nor at the same error rates, but the trends were obviously similar. The same was not completely true for the consonant tests. Although general trends were somewhat similar, there were a number of differences. In the recorded tests, the semi-vowels ( w, 1, r, y) didn't perform as strongly as in the live tests, whereas the 'ch/sh' consonants ( words 14, 15, 19 and 23) performed better. Also, templates for words 17 and 18 accumulated more misrecognized input for the recorded data than for live tests, and this affected the clustering for the error rate tables. All of these results indicated that recording of the data may have had some effects on quality of the speech, but those effects were not major. 4.4.2. Mode 1: Time Normalization Using a Rectangular Window As was expected, the technique of time normalization remarkably improved the results of tests 3 (vowels said slowly) and 4 (vowels said quickly), increasing the recognition rates by 74% and 42% respectively. As well, it improved the results of the other tests by a variety of degrees The recognition rate of the tests utilizing normal intonation (1, 6 and 7) as well as test 2 (vowels with mike moved) increased by 4 to 6% and the rate for test 5 ( vowels said with interogative intonation) increased by 14%. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 50 The confusion matrices for the vowel tests indicate that, not only was the recognition rate per word higher, but the amount of scatter was considerably less. 'Scatter' is the number of different words a given word is misrecognized as. As well, it is apparent from the error groupings that some words benefitted more from the time normalization than others. In particular, the error rates for 11 (boym) and 12 (boam) decreased more than those of the other words. This altered the grouping clusters somewhat, leaving 11 and 12 ungrouped for the error rates considered. Note that although the remaining groupings were similar to those for Mode 0, the error rates at which they occurred were smaller. The confusion matrices for the consonant tests indicate that the reduction in scatter was not nearly as dramatic as that for the vowel tests. The groups which benefitted the most from time normalization were the semi-vowels ( 0, 1, 2, 3), and the nasals ( 4, 5, 6). The error groupings were a bit different from those of Mode 1, in that the nasals had improved enough to remain separate from the single super-group as did the sub-groups 8 & 11 and 9 & 12. However, if one were to specify a maximum error rate per word of, say 35%, all of these phonemes and sub-group phonemes would fall into the one super-group. Also the order of combination changed somewhat. 4.4.3. Mode 2: Time Normalization Using a Hamming Window The statistics of Mode 2 and Mode 1 were quite similar. The average recognition rate differed by no more than 2% for any given test. There were minor changes in the confusion matrices which changed the error PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 51 groupings slightly. For the vowels, the recognition rate for word 6 improved enough to remain alone at maximum error rate = 20% per word, and 8 & 9 did the same as a group for a maximum error rate of 15% per word. For the consonants, the groupings turned out slightly differently. The only major differences were that words 15 & 22 joined the super-group at a maximum error rate = 45% and 50% per word respectively. 4.4.4. Mode 3: Use of Two Reference Templates per Word This technique improved the recognition rates for all of the tests, but the improvement for tests 1, 3 and 4 were only marginal (.2%, 1% and 2% respectively). Tests 2, 5, 6 and 7 improved by 7%, 6%, 7% and 10% respectively. For the vowel tests, the confusion matrix and error grouping altered slightly. Words 1 and 2 remained ungrouped at the error rates considered. As well, words 8 & 9 remained a separate subgroup for these error rates. For the consonants, the amount of scatter in the confusion matrix was reduced considerably, particularly for certain words (4, 5 and 13 were the most improved with respect to scatter). It was not universal, however, as some words did worse. Regarding the error grouping, the improvement of the recognition rates and the reduction in scatter forestalled the formation of the single super-group by about 5%. In the interim, smaller sub-groups formed at maximum error = 45% per word. As with the other modes, the collapse into the super-group was in a slightly different order. PREPROCESSING ALGORITHMS AND RECOGNITION IMPROVEMENT TECHNIQUES / 52 4.4.5. Mode 4: Nonlinear Volume Normalization Mode 4 was targeted at the consonant tests, and improved the recognition rate for these tests by 4% (test 6) and 6% (test 7). However, it also improved the recognition rates for four of the five vowel tests, increasing the results for tests 1, 2 & 5 by 2%, and by 5% for test 3. For the vowels, though, this had little effect on the scatter within the confusion matrix and the error grouping. Word 6 remained on its own at a maximum error of 20% per word, whereas 14 was grouped with 8 & 9. For the consonants, the reduction in scatter in the confusion matrices was minimal, except for word 7 (whose scatter was reduced by 5). The overall improvement in recognition rate, however, again moved the grouping phenomena to a more stringent maximum error rate. 5. CONCLUSIONS AND RECOMMENDATIONS In this section conclusions and recommendations are made on the following : the tests and their ability to exercise a speech recognition unit, the recording configuration. each of the modes, both in terms of their effectiveness in achieving what was desired and in terms of the results as well as comments on their compatibility for real-time applications. the performance of the NEC SR100. format of the test results. As well, recommendations are made as to topics for future areas of research. A summary of the test results are given in Table 5.1 for quick reference. 5.1. T H E TESTS The phonetic basis for the test vocabularies worked quite well, providing a logical method for identification of phonetic weaknesses with respect to a given speech recognition unit. Obviously, the vocabulary was not complete, in that it did not wholly address the question of allophones. Allophones are variations of a phoneme according to the phonemes preceding and following it. As well, it did not address the question of an extended phoneme set due to variations in pronunciation (e.g. rolling r's ). Addressing all possible variations would, however, yield a vocabulary of enormous size, and one must temper the selection of a vocabulary with the logistics of applying the actual tests (and with the amount of extra knowledge gained about the speech recognition unit by expansion of the tests). 53 CONCLUSIONS AND RECOMMENDATIONS / 54 Table 5.1  Summary of the Test Results  (In Percent) Test Number Mode Live 0 1 2 3 4 1 76.6 80.6 86.6 88.1 88.3 90.0 2 54.1 63.3 68.5 68.0 75.3 77.5 3 34.7 10.0 83.8 84.0 84.8 89.5 4 60.0 42.8 84.3 84.3 86.5 85.8 5 73.4 64.3 78.0 76.0 82.0 84.5 6 52.5 50.6 56.5 56.1 63.1 66.9 7 62.9 63.1 67.5 65.6 75.0 81.1 Test Number 1 = Vowels said normally 2 = Vowels with mike moved 3 = Vowels said slowly 4 = Vowels said quickly 5 = Vowels with interrogative intonation 6 = Consonants (Group 1) 7 = Consonants (Group 2) Mode Number Live = Unrecorded data 0 = Recorded unaltered data 1 = Time normalized data using a rectangular window 2 = Time normalized data using a Hamming window 3 = Mode 2 plus two training templates per word 4 = Mode 3 plus nonlinear volume normalization CONCLUSIONS AND RECOMMENDATIONS / 55 The same is true for the variations in pronunciation of the vocabulary. The variations addressed in this study provided pertinent information in flagging the speech recognition unit's pitfalls, but as with the vocabulary, they were simply a small logical subgroup of all possible variations present in speech. One problem with the test vocabulary as used was that a few of the subjects had difficulty pronouncing some of the nonsense words, particularly word 9 of the vowel vocabulary (buum) and word 6 of the two consonant vocabularies ( the words exercising the consonant 'ng' ). Either these words could be dropped from the test vocabulary, or one could attempt to train the subject more thoroughly in pronunciation of the vocabulary. Alternatively, one could investigate other static phonemes for construction of the vocabulary. As well, a problem existed with the variations in pronunciation. The variations were not controlled enough to determine that they were always the cause for failure in given tests. It was also not possible to determine the degree of any given variation tolerated by the speech recognition unit. The technique of recording the data and normalizing it by preprocessing aided in the reduction of the first of these problems, although it didn't give a clear idea of the amount of variation tolerable. An alternative would be to utilize the algorithms presented here (and develop other ones) to artificially alter speech input with respect to a specific parameter in gradations. This would permit the determination of the exact point of failure for each recognition device. Finally, although the test vocabulary and test variations identified the weaknesses CONCLUSIONS AND RECOMMENDATIONS / 56 of the NEC SRI00 quite well, their merits with regards to providing a sound basis for comparison of different speech recognition units were not tested. 5.2. T H E RECORDING CONFIGURATION The recording configuration provided excellent results, in that the recording quality-was quite good, as was the computing power available for analyzing and preprocessing the resulting data. There were certain drawbacks, however. First, as was pointed out in 4.4.1, the press of the button and the hard limit on the amount of time available in which to input an utterance may have affected the input itself. This could be eliminated, though, by making initialization of the digitization and the limits to digitizing time transparent to the subject. For example, some sort of triggering device could be utilized in conjunction with a temporary buffer to initialize digitization, insuring the capture of the complete utterance. As well, the data files could be of variable lengths, so that the time limit would be eliminated. A second problem was the mass storage device utilized for archiving the data (i.e. the tape drive on the HP9050). Data for a single subject occupied about 38 Mbytes of memory, or one 2400 ft. tape, and access to any file on this tape, regardless of location within the tape, was 25 minutes. This retrieval time caused a number of problems while working on the preprocessing algorithms, and would be a limiting factor in the ease of testing future speech recognition devices. The solution, of course, would be to install a mass storage device with faster access. CONCLUSIONS AND RECOMMENDATIONS / 57 5.3. PREPROCESSING ALGORITHMS AND RECOGNITION R A T E IMPROVEMENT TECHNIQUES 5.3.1. Mode 1: Time Normalization Using a Rectangular Window 5.3.1.1. Performance of the Algorithm The Synchronized Overlap and Add (SOLA) method of time normalization, utilizing a rectangular window, proved to be very stable and effective method for altering the duration of an utterance. It performed extremely well and was not too computationally intensive. As was previously stated, however, it did generate a bit of crackle and hollowness in the resulting output. Also, occasionally, when expansion of an utterance was required, the algorithm repeated phonetic transitions (for example, between the vowel and following consonant). This was perceived effectively as a stutter-type sound. 5.3.1.2. Information Learned About the SRI00 from Application of the Algorithm Test results after the application of the Mode 1 algorithm prove that, without a doubt, the NEC SR100 has limited time warping capabilities. This limitation degrades its recognition performance, increasing both the misrecognition and rejection rates. This limited ability manifested itself not only in tests 3 and 4 (vowels said slowly and vowels said quickly), but also in all of the other tests. Particularly, in test 5 (vowels said with interogative intonation), it became evident that a significant portion of the misrecognition and rejection was due to the CONCLUSIONS AND RECOMMENDATIONS / 58 change in duration of the utterance (not to the tonal change). 5.3.1.3. Applicability for Real-Time Implementation as a Preprocessing Technique This algorithm is not too computationally demanding in that it does not require any iteration or massive calculations. The only major calculation required is that of a cross-correlation function, which is easily within the realm of real-time execution. The major problem with the implementation of this algorithm, as it stands at present, is that it requires input of the complete word before it can calculate the amount of time warping required. This would introduce a time delay of at least the duration of the utterance in recognition of the word. In many applications this may be unacceptable, so the algorithm could not be applied in its present format. 5.3.2. Mode 2: Time Normalization Using a Hamming Window 5.3.2.1. Performance of the Algorithm This modification of the SOLA method of time normalization performed quite well. It was as effective in altering the duration of an utterance as the algorithm used in Mode 1, but had no perceptable crackle or hollowness. It did, however, still have the occasional stutter-type sound. The source of this stutter should be investigated in future work, if either the algorithm of Mode 1 or of Mode 2 is to be utilized. CONCLUSIONS AND RECOMMENDATIONS / 59 The increase in quality obtained in utilizing the Hamming window, however, did have a cost. Unlike the algorithm of Mode 1, it required the use of floating point arithmetic, making it a bit more computationally intensive. 5.3.2.2. Information Learned About the SR100 from Application of the Algorithm From Mode 2 results, it was apparent that the SR100 was not noticeably affected by the crackle or hollowness. This implies that the SRIOO's resolution, either in terms of time or in terms of frequency, was not fine enough to register the presence or absence of these perturbations. 5.3.2.3. Applicability for Real-Time Implementation as a Preprocessing Technique The explicit computation required by this algorithm is much the same as that for Mode 1, with the exception that the preprocessor must be able to handle floating point data. Note that the cross-correlation calculations still utilize fixed point data; it is only the reconstruction which requires floating point data. Therefore, this algorithm may still be implemented in real-time, but the preprocessor must be somewhat more powerful (and therefore more expensive). The algorithm has the same major drawback as that for Mode 1, in that it would introduce a delay of at least the duration of the utterance in the identification process. In choosing between algorithms of Mode 1 and 2, one must consider the effect on performance versus the cost of implementation. For example, if the preprocessor is to be used solely with the SR100, where no noticeable CONCLUSIONS AND RECOMMENDATIONS / 60 improvement in performance was obtained by implementing the more complicated algorithm, the choice is obvious. The results utilizing other speech recognition units may, of course, be different. 5.3.3. Mode 3: Use of Two Reference Templates Per Word 5.3.3.1. Information Learned About the SR100 from Application of the Technique From the Mode 3 data, it can be seen that the use of a single training input per word for creation of the reference template hinders performance of the SR100, particularly in the case of the consonants. This supported the earlier postulation that anomalies in the training input adversely affected the recognition rate, and that a single example of an utterance may not provide as wide-ranging a basis for identification of input as would multiple training input. 5.3.3.2. Applicability of the Technique as a Modification to the Training Method Changing the training technique for the speech recognition unit of similar technology as the SR100 would incur two costs. First, the training time would be doubled and, second, the allowable vocabulary size would be halved. If these factors are relatively unimportant, then the implementation of this change is simple. One may attempt to extend this technique to include three or more templates per utterance, but presumably at some point the recognition rate no longer improves. In fact, improvement utilizing two templates was not universal for all subjects in all tests. This implies that increasing the number of templates may CONCLUSIONS AND RECOMMENDATIONS / 61 not always increase the recognition rate. Some speech recognition units already require multiple utterances of a word in the training mode (e.g. Votan, Interstate and Threshold all have products requiring multiple training input). For these products, the technique proposed in 4.2 would not be implemented, but rather their own facilities for multiple input would be utilized. One would presume that whatever internal averaging which occurs within these machines, or however they utilize the multiple input, their recognition rate is correspondingly improved as it would be if this technique were implemented. 5.3.4. Mode 4: Nonlinear Volume Normalization 5.3.4.1. Performance of the Algorithm The algorithm performed well, in that it increased the volume of the static low energy consonants in the utterances while having no noticeable effect on the high energy portions, the transients, or the silent periods. This modification was, of course the objective of the algorithm. The algorithm did, however, have one minor problem. Often, the last phoneme in a word attenuates naturally towards the end of the utterance. This slow attenuation was removed if volume normalization was active. The result yielded an abrupt termination to the utterance. This may have an adverse effect on some speech recognition units, although it did not appear to have one on the SR100. CONCLUSIONS AND RECOMMENDATIONS / 62 5.3.4.2. Information Learned About the SR100 from Application of the Algorithm From the results of Mode 4, it can be concluded that the SRI00 did indeed miss information due to lack of volume in certain portions of utterances. The fact that the vowel vocabulary tests improved by increasing the volume on these phonemes implied that the trailing 'm' phoneme in these words sometimes played a role in the misrecognition (as these words often had a volume to too low register for the full duration of its enunciation). 5.3.4.3. Applicability for Real-Time Implementation as a Preprocessing Technique The algorithm would be very simple and straight-forward to implement. However, in its present format, it would not be very resilient to noise. It would amplify noise where the noise was mistaken for a part of a word. As such, more work on noise identification would be required before it could be implemented in a real application. 5.4. PERFORMANCE OF T H E NEC SRlOO SPEECH RECOGNITION UNIT In the documentation for the NEC SRlOO, a recognition rate of more than 99% (typical) is claimed, but no test conditions were described. Obviously, the rate is dependent on the vocabulary, speech variations, etcetera. Based on the results of this study, however, one would tend to disbelieve that this sort of accuracy could be consistently achieved with any realistic vocabulary. This inconsistency highlights one of the major reasons for the undertaking of this study; the fact that manufacturers' claimed recognition rates may be inflated and/or based on questionable data. CONCLUSIONS AND RECOMMENDATIONS / 63 Based on the results of the tests and subsequent preprocessing, one can see that the SR100 has a number of weaknesses; its ability to accomodate variations in utterance duration is limited. the recognition rates are susceptable to mike placement. the unit does not have fine temporal and/or frequency resolution. This may be viewed as a strength as well, as this feature makes it more immune to spurious input (e.g. crackle, etc.). it misses pertinent recognition data due to high a silence/voice triggering threshold. it has particular difficulty discerning consonants. 5.4.1. Rules for Best Operation of the NEC SR100 The following are a number of rules to aid in yielding the best performance from the SR100. They are based on results from the tests and preprocessing algorithms discussed in this thesis. 1. if possible, make two training templates of each utterance. 2. try to say each word consistently, particularly in terms of duration. 3. enunciate well, making sure to say the phonemes in each word clearly. 4. be consistent in the placement of the microphone, checking it regularly to ensure that it hasn't moved. In addition, an example of a method for selection of a robust vocabulary (one which will achieve the best recognition results for a given application) is given in Appendix 5. CONCLUSIONS AND RECOMMENDATIONS / 64 5.5. FORMAT OF T H E RESULTS The statistics were quite informative in terms of the amount of change with respect to each test type and each mode. As well, the confusion matrices gave an excellent pictorial of how a given word performed (in terms of which words they were misrecognized as and how often). The groupings according to maximum error rate per word, however, left something to be desired. There were a number of reasons responsible for its shortcomings. One reason was that not all utterances failed in a consistent manner. In many cases for the consonants, if a word failed to match correctly, it was misrecognized as any number of other templates (i.e. there was a high degree of scatter). No strong preference for any other single template was observed. Also, many different words, when misrecognized, tended to get dumped to the same template (e.g. words 16, 17 and 18 in Table A3.13). Both of these situations made grouping based on reducing error rates questionable, as the groupings tended to interlink (forming one super-group). Also, because of the interlinking, error groupings tended to be somewhat misleading. For example, in Mode 0 of the vowel tests (with the maximum error rate = 25% per word), word 8 was placed in the same group as 4, 6 and 7. This would tend to make one believe that either word 8 was commonly mistaken as 4, 6 and/or 7 or vice versa. This, however, was not the case. Neither 4, 6 nor 7 was ever mistaken as word 8 and word 8 was mistaken as either of them only once. The problem was that word 8 was misrecognized often as word 9 and word 9 was often misrecognized as word 6. Thus they were grouped CONCLUSIONS AND RECOMMENDATIONS / 65 together. Lastly, the error grouping results were not very stable. Minor variations in the confusion matrix resulted in significant changes in the error groupings. This was particularly true with respect to the order in which phonemes were grouped as the required error rate was made more stringent. An alternative to the concept of error groupings to aid in selection of a robust vocabulary would be to create some sort of restriction matrix. This would involve associating with each phoneme the other phonemes which may not be discriminated reliably from it. Each addition to the vocabulary, then, may be checked against existing members to determine if enough reliable differences are present for dependable discrimination. 5.6. AREAS FOR FURTHER STUDY The time normalization algorithms as implemented in this study would cause a fair amount of delay in a real-time application. This could be reduced, however, if the algorithm were to address smaller subsections of an utterance. Specifically the algorithm could address the question of normalization on a per phoneme or phoneme-type (i.e. discern only between consonants and vowels) basis. In addition to reducing delay, it would eliminate any problems associated with variations in the number of phonemes within each member of the vocabulary. Of course, the nontrivial subject of phoneme identification would then have to be addressed. Also, before either the time normalization or volume normalization algorithms may CONCLUSIONS AND RECOMMENDATIONS / 66 be actually implemented in a real-time application, reliable word endpoint/silence detection algorithms must be investigated. In a closely related topic, a study on the effects of noise on recognition rates and endpoint detection should be undertaken, as well as an investigation of possible algorithms for noise reduction. This will be a critical area of study, particularly if the speech recognition unit is destined for a noisy environment. Some preliminary results on the effects of noise on performance of the SRlOO are presented and discussed in Appendix 4. As well, the possibility of normalizing other speech parameters should be investigated. For example, it may be possible to normalize speech input with respect to pitch. Beyond all this, if the algorithms are to be implemented in real-time, the hardware required for implementation of these algorithms must be designed. Also, one possibility for improved comparative testing of various speech recognition devices is the implementation of a system which could vary some specific parameter of a voice input by a quantifiable amount. This variation could determine a given speech recognition unit's failure point with respect to the parameter being addressed, and the result could then be used as a measure for comparison. The algorithms discussed in this thesis could in fact be altered to perform this task, for the parameters which they address. CONCLUSIONS AND RECOMMENDATIONS / 67 As well, a more rigorous method for selection of a robust vocabulary should be devised. For example, one based on a restriction matrix as described previously would be a more helpful and reliable tool in aiding in the design of a robust vocabulary, for a given speech recognition device and a given application. Finally, the test methods and data should be applied to other speech recognition devices, to ascertain their ability to yield relative measures of performance. BIBLIOGRAPHY [ I ] M. Berouti, R. Schwartz and J. Makhoul, "Enhancement of Speech Corrupted by Acoustic Noise," in Proc. 1979 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 208-211, April 1979 [2] S. F. Boll and D. C. Pulsipher "Suppression of Acoustic Noise in Speech Using Two Microphone Adaptive Noise Cancellation," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp. 752-753, Dec. 1980 [3] S. F. Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, pp. 113-120, April 1979 [4] M. Chung, W. M. Kushner and J . N. Damoulakis, "Word Boundary Detection and Speech Recognition of Noisy Speech by Means of Iterative Noise Cancellation Techniques," in Proc. 1982 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1838, May 1982 [5] T. E. Eger, J. C. Su and L. W. Varner, "A Nonlinear Processing Technique for Speech Enhancement," in Proc. 1984 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 18A.1.1-18A.1.4, March 1984 [6] Y. Ephraim and D. Malah, "Combined Enhancement and Adaptive Transform Coding of Noisy Speech," IEE Proc, vol. 133, pp. 81-86, Feb. 1986 [7] Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-33, pp. 443-445, April 1985 [8] Y. Ephraim and D. Malah, "Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 1109-1121, Dec. 1984 [9] Y. Ephraim and D. Malah, "Speech Enhancement Using Optimal Non-Linear Spectral Amplitude Estimation," in Proc. 1983 IEEE Int. Conf. Acoust, Speech, Signal Processing, pp. 1118-1121, April 1983 [10] J. B. Font, "Cochlear Modelling," IEEE Acoust., Speech, Signal Processing Magazine, pp. 3-28, Jan. 1985 [ I I ] T. Gardiner, J . G. McWhirter and T. J . Shepherd, "Noise Cancellation Studies Using a Least-Squares Lattice Filter," in Proc. 1985 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1173-1176, March 1985 [12] D. W. Griffin and J. S. Lim, "Signal Estimation from Modified Short-Time Fourier Transform," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 236-243, April 1984 68 / 69 [13] W. A. Harrison, J. S. Lim and E. Singer, "Adaptive Noise Cancellation in a Fighter Cockpit Environment," in Proc. 1984 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 18A.4.1-18A.4.4, March 1984 [14] J. W. Kim and C. K. Un, "Enhancement of Noisy Speech by Forward/Backward Adaptive Digital Filtering," in Proc. 1986 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 89-92, April 1986 [15] J. Lim, "Speech Restoration and Enhancement," Trends and Perspect. Sig. Proc. vol. 2, pp. 4-7, April 1982 [16] J. S. Lim, Editor, Speech Enhancement. Englewood Cliffs, NJ: Prentice-Hall, 1983 [17] J . S. Lim and A. V. Oppenheim, "Enhancement and Bandwidth Compression of Noisy Speech," Proc. IEEE, vol. 67, pp. 1586-1604, Dec. 1979 [18] J . S. Lim, "Spectral Root Homomorphic Deconvolution System," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, pp. 223-233, June 1979 [19] R. J . McAulay and M. L. Malpas, "Speech Enhancement Using a Soft-Decision Noise Suppression Filter," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp. 137-145, April 1980 [20] O. M. M. Mitchell, C. A. Ross and G. H. Yates "Signal Processing for a Cocktail Party Effect," The Journal of the Acoustical Society of America, vol. 50, pp. 656-660, Aug. 1971 [21] S. H. Nawab, T. F. Quatieri and J . S. Lim "Signal Reconstruction from Short-Time Fourier Transform Magnitude," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, pp. 986-998, Aug. 1983 [22] J . C. Ogue, T. Saito and Y. Hoshiko, "A Frequency Domain Adaptive Noise Cancellation in Speech Signal," Technology Reports, Tohoku Univ., vol. 48, No. 2, pp. 181-197, 1983 [23] A. V. Oppenheim and J . S. Lim, "The Importance of Phase in Signals," Proc. IEEE, vol. 69, pp. 529-541, May 1981 [24] A. V. Oppenheim and R. W. Schafer, Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975 [25] J . E. Porter and S. F. Boll, "Optimal Estimators for Spectral Restoration of Noisy Speech," in Proc. 1984 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 18A.2.1-18A.2.4, March 1984 [26] M. R. Portnoff, "Short-Time Fourier Analysis of Sampled Speech," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, pp. 364-373, June 1981 / 70 [27] M. R. Portnoff, "Time-Scale Modification of Speech Based on Short-Time Fourier Analysis," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, pp. 374-390, June 1981 [28] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1978 [29] L. R. Rabiner, M. R. Sambur and C. E. Schmidt, "Applications of a Nonlinear Smoothing Algorithm to Speech Processing," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-23, pp. 552-557, Dec. 1975 [30] L. R. Rabiner, M. R. Sambur, "An Algorithm for Determining the Endpoints of Isolated Utterances," Bell Syst. Tech. J., vol. 54, pp. 297-315, Feb. 1975 [31] S. Roucos and A. M. Wilgus, "High Quality Time-Scale Modification for Speech," in Proc. 1985 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 493-496, March 1985 [32] R. W. Schafer and J. D. Markel Speech Analysis. New York, NY: IEEE Press, 1979 [33] R. J. Scott and S. E. Gerber, "Pitch-Synchronous Time-Compression of Speech," in Proc. 1972 IEEE Conf. for Speech Communications Processing, pp. 63-65, April 1972 [34] S. Seneff, "System to Independently Modify Excitation and/or Spectrum of Speech Waveform Without Explicit Pitch Extraction," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, pp. 566-578, Aug. 1982 [35] H. W. Strube, "Separation of Several Speakers Recorded by Two Microphones (Cocktail-Party Processing)," Signal Proc, North-Holland publishing Co., vol. 3, pp. 355-364, Oct. 1981 [36] V. R. Viswanathan et al., "Evaluation of Multisensor Speech Input for Speech Recognition in High Ambient Noise," in Proc. 1986 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 85-88, April 1986 [37] V. R. Viswanathan, K. F. Karnofsky, K. N. Stevens and M. N. Alakel, "Multisensor Speech Input for Enhanced Immunity to Acoustic Background Noise," in Proc. 1984 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 18A.3.1-18A.3.4, March 1984 [38] D. L. Wang and J . S. Lim, "The Unimportance of Phase in Speech Enhancement," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, pp. 679-681, Aug. 1982 [39] B. Widrow et al.„ "Adaptive Noise Cancelling: Principles and Applications," Proc. IEEE, vol. 63, pp. 1692-1716, Dec. 1975 APPENDIX 1: THEORY OF OPERATION OF THE NEC SR100 In this section, an overview of the theory of operation of the NEC SRI00 speech recognition unit is given. The intention is to simply provide background on the type of unit used in these tests. If a more indepth understanding of the machine is required, then one should refer to the literature accompanying the machine. The front end of the analysis system for the SRI00 consists of a series of 16 bandpass and smoothing filters nonlinearly spaced from 200 to 5000 Hz. The frequency response of the filters are shown in Table A l . l . The output of the sixteen filters are sampled and digitized every 16 ms. and are normalized with respect to the sum of the magnitudes of the output for each filter. These 16 values then form the basic vector used by the speech recognition unit. There are two distinct modes in the SRI00; training mode and recognition mode. First, training mode will be described. When in training mode, the vector series of the given training word is accumulated and data compression is performed on it. Compression is executed by simply representing from 2 to 7 consecutive vectors by the centre vector of the series. Determination of the exact grouping is done by determining which one yields the least error per omitted vector. This calculation is performed recursively while vectors for the word are still incoming. The resulting template, then, is typically compressed to approximately 1/3 the uncompressed length. The original number of vectors (uncompressed length) is stored in the machine, along with the error per unit vector due to compression, 71 / 72 Table A l . l Frequency Response of the Front End Fi l t e r s on the SRlOO Freq. (Hz) 1 2 3 4 5 6 Fi l t e r 7 8 Number 9 10 11 12 13 14 15 16 250 100 100 49 0 0 0 0 0 0 0 0 0 0 0 0 0 300 100 100 49 2 0 0 0 0 0 0 0 0 0 0 0 0 400 97 97 48 7 0 0 0 0 0 0 0 0 0 0 0 0 500 47 81 82 40 0 0 0 0 0 0 0 0 0 0 0 0 600 19 81 82 66 1 0 0 0 0 0 0 0 0 0 0 0 700 8 46 74 74 37 1 2 0 0 1 0 2 1 1 1 2 800 4 17 78 78 39 2 1 3 0 3 6 1 3 5 4 4 900 3 7 39 76 74 7 2 3 4 1 9 9 2 2 9 3 1000 1 3 15 68 68 34 3 1 5 3 2 18 10 2 7 9 1100 1 1 3 14 50 68 34 5 2 2 4 1 15 36 9 4 1200 0 0 1 6 22 55 55 27 2 0 3 3 3 38 31 2 1300 0 0 1 4 10 49 64 37 6 2 1 5 2 14 48 7 1400 0 0 0 1 4 20 54 54 27 2 0 3 3 4 54 21 1500 0 0 0 1 2 8 47 53 34 5 1 1 4 1 45 46 1600 0 0 0 1 2 5 19 56 56 28 2 1 3 2 24 49 1700 26 0 0 0 1 2 8 55 55 27 3 1 1 3 11 55 1800 24 0 0 0 0 1 4 29 51 51 25 1 1 4 5 51 1900 0 0 0 0 0 1 2 18 67 67 33 3 1 3 4 50 2000 0 0 0 0 0 1 2 10 75 75 42 6 2 2 5 28 2100 0 0 0 0 0 0 1 5 37 74 74 37 2 1 6 13 2200 0 0 0 0 0 0 1 3 21 82 82 41 4 1 6 8 2300 0 0 0 0 0 0 0 2 12 87 87 43 6 2 5 6 2400 0 0 0 0 0 0 1 2 7 51 84 79 11 3 4 6 2500 0 0 0 0 0 0 0 1 4 28 82 82 41 3 3 6 2600 0 0 0 0 0 0 0 1 3 17 86 86 43 5 4 7 2700 0 0 0 0 0 0 0 1 3 10 82 86 51 8 5 5 2800 0 0 0 0 0 0 0 0 2 7 51 86 82 11 6 4 2900 0 0 0 0 0 0 0 0 1 •4 29 83 83 41 7 3 3000 0 0 0 0 0 0 0 0 1 3 19 86 86 43 10 3 3100 0 0 0 0 0 0 0 0 1 2 12 88 88 44 14 3 3200 0 0 0 0 0 0 0 0 1 2 8 58 90 67 19 4 3300 0 0 0 0 0 0 0 0 0 . 1 6 36 81 81 40 4 3400 0 0 0 0 0 0 0 0 0 1 4 26 86 86 43 5 3500 0 0 0 0 0 0 0 0 0 1 3 17 88 88 48 7 3600 0 0 0 0 0 0 0 0 0 1 2 11 83 83 61 8 3700 0 0 0 0 0 0 0 0 0 0 2 8 61 84 84 11 3800 0 0 0 0 0 0 0 0 0 0 1 6 39 81 81 40 3900 0 0 0 0 0 0 0 0 0 0 1 5 29 86 86 43 4000 0 0 0 0 0 0 0 0 0 0 1 4 21 90 90 45 4100 0 0 0 0 0 0 0 0 0 0 1 3 15 93 93 46 4200 0 0 0 0 0 0 0 0 0 0 1 3 11 94 96 48 4300 0 0 0 0 0 0 0 0 0 0 1 2 10 72 103 60 4400 0 0 0 0 0 0 0 0 0 0 1 2 8 53 104 79 4500 0 0 0 0 0 0 0 0 0 0 1 2 6 38 103 100 4600 42 0 0 0 0 0 0 0 0 0 1 1 4 23 89 89 4700 44 0 0 0 0 0 0 0 0 0 0 1 ' 3 18 93 93 4800 44 0 0 0 0 0 0 0 0 1 1 1 3 11 93 93 4900 46 0 0 0 0 0 0 0 0 0 0 1 2 10 96 96 5000 46 0 0 0 0 0 0 0 0 0 0 0 2 8 98 98 / 73 the remaining number of vectors and the remaining vectors themselves. Compression performs two functions; first it reduces the amount of memory required to store the reference data and second it reduces the amount of calculation required to be performed when the matching process for recognition is executed. Note that the number of vectors each compressed vector represents is not stored in the template, only the total uncompressed length. When the SR100 is in recognition mode, the vectors for the incoming word are compared to the vectors for each of the reference templates dynamically while the remainder of the incoming word is being input. If the accumulated vector distance from any given template to the input becomes too great, then that template is no longer considered. The distance of each input vector is computed for a range of allowable reference vectors within each template and the reference vector resulting in the smallest distance is selected. The range is determined by the uncompressed length of the template and by the position of the incoming vector. This effectively performs time warping. Note that if the distance calculations were to be performed using the uncompressed data, then to cover the same allowable range, the distance between each input vector and the omitted vectors for each reference template within the allowable range would also have to be computed. This is where the computational saving mentioned in the previous paragraph is realized. The total vector distance between the input word and each reference template is accumulated and when the input word is completed, the total error is divided by the total number of vectors input, yielding a distance per unit vector for each / 74 template. In order to reduce the error introduced by omitting reference vectors the error per unit vector for each template is subtracted from the distance measure per unit vector. The result of this is then used for selecting between the various reference templates; the one with the smallest number is matched to the input word. APPENDIX 2: RESULTS OF THE LIVE TESTS 75 / 76 Table A2.1 Statistics for Test 1 Vowels Said Normally Subject % Correct % Misrecognized % Rejected % Correct Runner-ups DR 82.812 10.938 6.250 3.125 EC 76.562 21.875 1.562 12.500 JR 64.062 29.688 6.250 6.250 MA 65.625 10.938 23.438 0.000 RF 93.750 6.250 0.000 3.125 Average 76.562 15.938 7.500 5.000 Table A2.2 Statistics for Test 2 Vowels With Mike Moved Subject % Correct % Misrecognized % Rejected % Correct Runner-ups DR 62.500 20.312 17.187 4.688 EC 53.125 26.562 20.312 7.812 JR 42.188 28.125 29.688 7.812 MA 39.063 4.688 56.250 0.000 RF 73.438 20.312 6.250 10.938 Average 54.062 20.000 25.938 6.250 Table A2.3 Statistics for Test 3 Vowels Said Slowly Subject % Correct % Misrecognized % Rejected % Correct Runner-ups DR 28.125 17.187 54.688 3.125 EC 7.812 6.250 85.938 0.000 JR 71.875 9.375 18.750 6.250 MA 65.625 6.250 28.125 1.562 RF 0.000 6.250 93.750 0.000 Average 34.688 9.062 56.250 2.188 / 77 Table A2.4 Statistics for Test 4 Vowels Said Quickly Subject % Correct % Misrecognized % Rejected % Correct Runner-ups DR 73.438 25.000 1.562 10.938 EC 53.125 45.312 1.562 15.625 JR 45.312 35.937 18.750 4.688 MA 46.875 20.312 32.812 3.125 RF 81.250 18.750 0.000 10.938 Average 60.000 29.062 10.938 9.062 Table A2.5 Statistics for Test 5 Vowels With Interogative Intonation Subject % Correct % Misrecognized % Rejected % Correct Runner-ups DR 82.812 15.625 1.562 9.375 EC 56.250 23.438 20.312 4.688 JR 82.812 15.625 1.562 7.812 MA 67.188 14.062 18.750 3.125 RF 78.125 17.187 4.688 12.500 Average 73.438 17.187 9.375 7.500 Table A2.6 Statistics for Test 6 Consonants (Group 1) Subject % Correct % Misrecognized % Rejected % Correct Runner-ups DR 62.500 33.333 4.167 9.375 EC 59.375 40.625 0.000 16.667 JR 52.083 47.917 0.000 11.458 MA 38.542 41.667 19.792 7.292 RF 50.000 50.000 0.000 15.625 Average 52.500 42.708 4.792 12.083 / Table A2.7 Statistics for Test 7 Consonants (Group 2) Subject % Correct % Mi s r ecogni z ed % Rejected % Correct Runner-ups DR 57.292 41.667 1.042 18.750 EC 58.333 40.625 1.042 15.625 JR 61.458 34.375 4.167 7.292 MA 62.500 30.208 7.292 9.375 RF 75.000 23.958 1.042 10.417 Average 62.917 34.167 2.917 12.292 Table A2.8 Confusion Matrix for Test 1 Vowels Said Normally (All Subjects) Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 15 l 3 1 1 12 4 • . . . 1 3 2 17 2 • . . . 1 . 3 19 . . . . 1 4 12 2 2 . . • • 5 19 . . . . . 1 6 15 . . . 1 . 4 7 2 11 • . 2 . . . 8 14 3 . . 1 . 2 9 1 13 . . 3 . 3 10 1 . 17 . . 2 11 • . 1 17 • . 2 12 4 • • • 15 • . 1 13 15 • . . 14 1 . . 1 17 . 1 15 • • • 17 3 Table A2.9 Confusion Matrix for Test 2 Vowels with Mike Moved (All Subjects) Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 10 • . . . . . 1 2 7 1 13 1 • • • 2 4 2 1 13 • • • 2 4 3 2 15 . . • • 3 4 • 9 • 5 3 1 . 2 5 . . 15 • • • • 5 6 1 . • 11 . . 1 1 . 6 7 . 2 . 5 6 . . . . 6 8 • • . 1 . 3 4 1 . 9 9 • . • 2 . . 13 1 . 4 10 . . . 1 . • • 13 . 6 11 . . . . . • • 4 9 7 12 . 1 . 3 1 2 . 6 13 . . . . . . 12 1 3 14 . 3 . . 1 12 • 4 15 1 . . . • • • • • 12 7 Table A2.10 Confusion Matrix for Test 3 Vowels Said Slowly (All Subjects) Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 6 # # # • • • • . 14 1 7 • . • • • 1 2 10 2 1 9 1 • • . 9 3 . 5 • • • • 2 . 13 4 6 . . 4 3 . 7 5 . 7 . . . . 13 6 1 . 5 1 3 . 8 7 2 . . 6 1 . 11 8 9 « . 11 9 1 . 10 10 . 14 11 • . 13 12 9 . 11 13 • 12 . 8 14 9 . 11 15 • • 3 17 Table A2.ll Confusion Matrix for Test 4 Vowels Said Quickly (All Subjects) Word # 0 1 2 Was 3 Recognized as 4 5 6 7 8 Word # 9 10 11 12 13 14 15 Rej 0 8 6 # • • • . . . . 4 2 1 15 2 . . . . . . . 2 1 2 16 . . . . . . . 1 3 3 3 15 1 . . • • • • . 1 4 11 8 1 . . . . . . . 5 . 15 . . . . . . . . 5 6 1 15 . . 1 . . . . 3 7 7 4 6 . 1 . . . . 2 8 • • • o 8 . . . 2 . 2 9 • 15 . . . 3 . 2 10 • 7 . . . 8 . . • . 5 11 • 1 . . . 1 14 . . 4 12 2 10 2 . . . 1 5 • . . 13 4 . . . . • • • • 12 2 2 14 • 1 2 1 . . . 1 . 13 . 2 15 2 1 • • • • • • • • • 16 1 Table A2.12 Confusion Matrix for Test 5 Vowels with Interogative Intonation (All Subjects) Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 13 • • • • 1 . 6 1 16 1 • * . 3 2 17 2 . . 1 3 19 . • 1 . . . 4 . 10 5 1 . . 1 5 17 . 2 6 19 . 1 7 10 • • 8 11 6 1 . 2 9 3 3 10 1 . 3 10 1 16 . • . 3 11 . 1 17 # . 2 12 3 14 * . 1 13 1 • 16 . 2 14 1 3 • • 13 . 1 15 1 • • • • • 17 2 Table A2.13 Confusion Matrix for Tests 1 Through 5 Combined (All Subjects) Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 52 6 1 2 9 30 1 63 8 1 7 21 2 2 72 5 4 17 3 5 73 1 • • 1 • 2 . 18 . 4 48 20 15 2 2 • 3 . 10 5 73 . 26 6 2 65 1 2 3 3 1 . 22 7 21 11 39 1 • 3 6 • . 19 8 1 • 45 21 1 5 1 26 9 5 • 5 56 1 • • 11 . 22 10 * 10 • 60 • • • . 30 11 • 1 • • 7 64 • • . 28 12 4 20 4 2 1 50 • . 19 13 3 11 67 • 3 15 14 • • 5 3 3 4 • 2 64 . 19 15 1 2 2 65 30 Table A2.14 Confusion Matrix for Test 6 Consonants (Group 1) (All Subjects) Wrd Has Recognized as Word # t 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej 0 15 • 2 1 2 1 2 16 3 • • • 17 1 • 2 4 • 1 • 1 9 • 1 1 1 1 • 1 • • • 1 2 5 • 1 • 2 5 7 1 1 6 3 14 • 1 7 • 1 • • 2 • • 8 2 1 2 2 8 • • • 2 • • 1 1 9 2 1 1 • • 1 1 • 1 9 • • • 3 • • • • 1 12 10 • • • 1 2 6 5 • • • • 3 1 11 • 1 2 • 2 12 • • • 1 • • • • 1 . • . 1 12 4 • 3 10 13 • 1 1 1 2 1 1 1 12 14 12 5 • • • 1 1 . • • • 15 1 • 6 8 • • • 1 1 . • 3 . 16 4 1 • • • 1 • • • 6 3 • • 2 . • . 1 17 • 1 1 1 • • 1 • • 1 1 2 7 * • 3 2 • • « 18 1 13 • 1 1 • . 3 19 4 2 • • • 11 • • • 1 2 20 1 1 • • • 2 • 1 • 2 4 • 1 7 . • . 1 21 • 2 1 1 • • • • 1 3 • • 5 7 • • • 22 1 1 9 . 3 23 1 5 • • • 4 • « 2 8 . / 83 Table A2.15 Confusion Matrix for Test 7 Consonants (Group 2) (All Subjects) Wrd Was Recognized as Word # * 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej 0 18 • • • • • • • • • • • • • . 2 1 12 v • • • 1 3 . 1 . • • 3 . . . • • 2 18 . . • . . . • 1 . . 1 • • 3 . 20 . • . . . • • . . . • • 4 . . 13 2 . . . . • • . . 2 m , • • 5 • • • 15 1 . . 1 1 . • • . 1 . . 1 6 • • • . 14 . . 2 • . . . . 4 7 « • • . 6 . 1 • 4 1 . . 2 • • 8 . 1 . . 11 1 1 . • 1 3 . . . . 2 9 . 1 . . . 13 . 2 1 • • 1 1 . . # • • 10 . . . 15 . . • 1 1 . . . 2 . • • 11 . 1 . . . 1 11 . 1 1 . . . 3 . . 1 12 . . . . . 5 2 12 • • . . . 1 . • • 13 . . . . . . 16 « • . . 1 1 . • • 14 . . . . . . 11 8 • . 1 . • • • • 15 . . . . . . . . 4 15 • . 1 . • • • • 16 . 1 . . . . 1 . • • 9 • • 1 1 . . 1 17 . . . 2 . 1 2 . • 1 1 9 • « • 4 • • 18 . . . . . . . . 1 • • 3 9 • • * V . 1 19 . . . . . . 3 6 • • . i i . • • « • 20 . . . . . . 1 . • • 1 1 . . 10 7 . • • 21 . . . . 1 . • 1 1 3 . . . 14 . « • 22 . . . . . . • • • 2 6 . . . 11 . 1 23 • • . . . • • . . . • • • • 1 7 • . . 2 . • * 9 1 / 84 Table A2.16 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Wrd Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej 0 33 • 2 1 27 • 2 • • . 1 3 2 2 34 3 • 37 1 • 2 4 1 1 22 2 1 1 1 1 2 1 1 2 # 2 # . 1 5 1 2 5 22 2 1 • 1 • 3 . 1 • • # 1 • m . 1 6 3 28 • 1 2 7 1 • 2 . • 14 2 2 6 6 3 • 2 # . 2 8 • 3 1 1 20 3 1 2 . . 1 1 4 9 • 4 1 • 1 25 • • 5 • 2 * 1 # 10 • 1 1 . • 21 5 • 4 2 • • 2 . 2 11 1 1 • 2 1 2 23 2 1 • 1 3 # . 2 12 9 • 5 22 . 3 • • • 1 13 1 1 1 2 3 1 1 28 1 1 • • 14 23 13 • 2 1 • • • 15 1 . 10 23 2 1 • 3 • 16 • 4 1 1 1 . * • 17 3 • 9 1 . 2 17 1 3 1 3 • « 1 2 3 16 m 3 6 « • 16 4 22 * 1 1 4 19 7 8 • • 22 • • 1 2 20 1 1 1 2 . 1 • 3 5 1 17 7 . 1 21 2 1 1 • • • 1 2 6 • 5 21 # m # 22 1 1 2 12 • • • 20 . 4 23 2 12 • • • 6 • • 2 17 1 / 85 Table A2.17 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 for the Vowels max. error per word =25% max. error per word =20% max. error per word = 15% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 ( IY) 18.00 0 18.00 0 15 6.50 1 1 ( I ) 16.00 1 16.00 1 2 13 7.67 2 2 (E) 11.00 2 11.00 3 9.00 3 3 (AE) 9.00 3 9.00 4 6 7 8 9 12 14 2.43 4 4 6 7 12 (A UH OW AU) 4.50 4 6 7 12 4.50 5 1.00 5 5 (ER) 1.00 5 1.00 10 10.00 6 8 9 (00 U) 12.50 8 9 14 7.33 11 8.00 7 10 (Al) 10.00 10 10.00 - -8 11 (01) 8.00 11 8.00 - -9 13 (EI) 18.00 13 18.00 - -10 14 (0U) 17.00 15 5.00 - -11 15 (JU) 5.00 - - - -/ 86 Table A2.18 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants max. error per word =45% max. error per word = 40% max. error per word =35% Grp # words grouped group err. (%) words grouped group err.(%) words grouped group err.(%) 0 0 (W) 5.00 0 5.00 0 5.00 1 1 (L) 27.50 1 27.50 1 27.50 2 2 (R) 12.50 2 12.50 2 12.50 3 3 (Y) 7.50 3 7.50 3 7.50 4 4 (M) 42.50 4 5 7 8 10 11 16 17 20 21 11.75 4 5 7 8 10 11 16 17 20 2 11.75 5 5 (N) 42.50 6 17.50 6 17.50 6 6 (NG) 17.50 9 12 23.75 9 12 23.75 7 7 10 16 17 20 21 (B P V TH F THE) 16.25 13 30.00 13 30.00 8 8 (D) 45.00 14 15 23 13.33 14 15 19 23 3.75 9 9 (G) 37.50 18 35.00 18 22 15.00 10 11 (T) 37.50 19 40.00 - -11 12 (K) 45.00 22 40.00 - -12 13 (H) 30.00 - - - -13 14 (DZH) 42.50 - - - -14 15 23 (TSH SH) 30.00 - - - -15 18 (Z) 35.00 - - - -16 19 (ZH) 40.00 - - - -17 22 (S) 40.00 - - - -APPENDIX 3: RESULTS OF THE RECORDED AND MODIFIED TESTS 87 / 88 Table A3.1 Statistics for Test 1 Vowels Said Normally % % % % Correct Subject Mode Correct Misrecognized Rejected Runner-ups DR 0 79.688 12.500 7.812 4.688 1 81.250 15.625 3.125 6.250 2 81.250 15.625 3.125 4.688 3 83.333 16.667 0.000 12.500 4 85.417 14.583 0.000 10.417 EC 0 64.062 31.250 4.688 9.375 1 85.938 12.500 1.562 12.500 2 87.500 10.938 1.562 9.375 3 81.250 16.667 2.083 14.583 4 87.500 10.417 2.083 10.417 HW 0 92.188 6.250 1.562 4.688 1 92.188 7.812 0.000 4.688 2 93.750 6.250 0.000 3.125 3 93.750 6.250 0.000 4.167 4 97.917 2.083 0.000 0.000 JR 0 79.688 20.312 0.000 18.750 1 82.812 17.187 0.000 12.500 2 85.938 14.062 0.000 7.812 3 93.750 6.250 0.000 2.083 4 91.667 8.333 0.000 . 6.250 RF 0 87.500 12.500 0.000 6.250 1 90.625 7.812 1.562 4.688 2 92.188 6.250 1.562 4.688 3 89.583 8.333 2.083 6.250 4 87.500 10.417 2.083 4.167 Average 0 80.625 16.562 2.812 8.750 1 86.562 12.188 1.250 8.125 2 88.125 10.625 1.250 5.938 3 88.333 10.833 0.833 7.917 4 90.000 9.167 0.833 6.250 / 89 Table A3.2 Statistics for Test 2 Vowels With Mike Moved % % % % Correct Subject Mode Correct Misrecognized Rejected Runner-ups DR 0 85.000 12.500 2.500 6.250 1 86.250 11.250 2.500 6.250 2 81.250 15.000 3.750 10.000 3 82.500 13.750 3.750 8.750 4 91.250 8.750 0.000 8.750 EC 0 43.750 32.500 23.750 12.500 1 62.500 23.750 13.750 13.750 2 62.500 22.500 15.000 11.250 3 71.250 17.500 11.250 6.250 4 68.750 21.250 10.000 10.000 HW 0 47.500 13.750 38.750 5.000 1 47.500 12.500 40.000 6.250 2 48.750 11.250 40.000 3.750 3 53.750 13.750 32.500 7.500 4 55.000 15.000 30.000 8.750 JR 0 67.500 16.250 16.250 8.750 1 63.750 16.250 20.000 8.750 2 66.250 13.750 20.000 6.250 3 85.000 8.750 6.250 5.000 4 88.750 8.750 2.500 5.000 RF 0 72.500 21.250 6.250 15.000 1 82.500 13.750 3.750 7.500 2 81.250 13.750 5.000 7.500 3 83.750 16.250 0.000 8.750 4 83.750 16.250 0.000 6.250 Average 0 63.250 19.250 17.500 9.500 1 68.500 15.500 16.000 8.500 2 68.000 15.250 16.750 7.750 3 75.250 14.000 10.750 7.250 4 77.500 14.000 8.500 7.750 / 90 Table A3.3 Statistics for Test 3 Vowels Said Slowly % % % % Correct Subject Mode Correct Misrecognized Rejected Runner-ups DR 0 13.750 3.750 82.500 1.250 1 76.250 8.750 15.000 3.750 2 73.750 11.250 15.000 7.500 3 77.500 10.000 12.500 7.500 4 95.000 5.000 0.000 5.000 EC 0 1.250 2.500 96.250 0.000 1 71.250 28.750 0.000 12.500 2 71.250 28.750 0.000 16.250 3 72.500 27.500 0.000 12.500 4 77.500 22.500 0.000 8.750 HW 0 2.500 12.500 85.000 0.000 1 96.250 3.750 0.000 1.250 2 97.500 2.500 0.000 0.000 3 97.500 2.500 0.000 2.500 4 96.250 3.750 0.000 3.750 JR 0 12.500 17.500 70.000 0.000 1 86.250 13.750 0.000 10.000 2 86.250 13.750 0.000 10.000 3 88.750 11.250 0.000 10.000 4 90.000 10.000 0.000 8.750 RF 0 20.000 8.750 71.250 1.250 1 88.750 11.250 0.000 8.750 2 91.250 8.750 0.000 5.000 3 87.500 12.500 0.000 8.750 4 88.750 11.250 0.000 6.250 Average 0 10.000 9.000 81.000 0.500 1 83.750 13.250 3.000 7.250 2 84.000 13.000 3.000 7.750 3 84.750 12.750 2.500 8.250 4 89.500 10.500 0.000 6.500 / 91 Table A3.4 Statistics for Test 4 Vowels Said Quickly % % % % Correct Subject Mode Correct Misrecognized Rejected Runner-ups DR 0 41.250 27.500 31.250 2.500 1 91.250 8.750 0.000 6.250 2 90.000 10.000 0.000 8.750 3 91.250 8.750 0.000 6.250 4 91.250 8.750 0.000 6.250 EC 0 38.750 60.000 1.250 7.500 1 68.750 31.250 0.000 12.500 2 68.750 31.250 0.000 15.000 3 68.750 31.250 0.000 16.250 4 70.000 30.000 0.000 12.500 HW 0 23.750 53.750 22.500 3.750 1 85.000 15.000 0.000 5.000 2 83.750 16.250 0.000 6.250 3 92.500 7.500 0.000 5.000 4 92.500 7.500 0.000 7.500 JR 0 55.000 33.750 11.250 11.250 1 83.750 16.250 0.000 11.250 2 87.500 12.500 0.000 7.500 3 90.000 10.000 0.000 5.000 4 86.250 13.750 0.000 8.750 RF 0 55.000 40.000 5.000 7.500 1 92.500 7.500 0.000 6.250 2 91.250 8.750 0.000 6.250 3 90.000 10.000 0.000 6.250 4 88.750 11.250 0.000 7.500 Average 0 42.750 43.000 14.250 6.500 1 84.250 15.750 0.000 8.250 2 84.250 15.750 0.000 8.750 3 86.500 13.500 0.000 7.750 4 85.750 14.250 0.000 8.500 / 92 Table A3.5 Statistics for Test 5 Vowels With Interogative Intonation % % % % Correct Subject Mode Correct Misrecognized Rejected Runner-ups DR 0 70.000 15.000 15.000 10.000 1 76.250 6.250 17.500 2.500 2 77.500 8.750 13.750 5.000 3 78.750 10.000 11.250 8.750 4 96.250 3.750 0.000 3.750 EC 0 36.250 42.500 21.250 11.250 1 62.500 32.500 5.000 17.500 2 62.500 33.750 3.750 16.250 3 72.500 25.000 2.500 12.500 4 67.500 30.000 2.500 17.500 HW 0 78.750 13.750 7.500 3.750 1 91.250 8.750 0.000 6.250 2 90.000 10.000 0.000 6.250 3 92.500 7.500 0.000 7.500 4 91.250 8.750 0.000 7.500 JR 0 63.750 25.000 11.250 7.500 1 81.250 16.250 2.500 10.000 2 81.250 15.000 3.750 7.500 3 92.500 6.250 1.250 0.000 4 91.250 8.750 0.000 5.000 RF 0 72.500 22.500 5.000 10.000 1 78.750 20.000 1.250 11.250 2 68.750 22.500 8.750 16.250 3 73.750 23.750 2.500 7.500 4 76.250 21.250 2.500 3.750 Average 0 64.250 23.750 12.000 8.500 1 78.000 16.750 5.250 9.500 2 76.000 18.000 6.000 10.250 3 82.000 14.500 3.500 7.250 4 84.500 14.500 1.000 7.500 / 93 Table A3.6 Statistics for Test 6 Consonants (Group 1) % % % % Correct Subject Mode Correct Misrecognized Rejected Runner-ups DR 0 45.833 50.000 4.167 8.333 1 62.500 36.458 1.042 9.375 2 61.458 37.500 1.042 11.458 3 56.944 41.667 1.389 26.389 4 69.444 30.556 0.000 19.444 EC 0 37.500 55.208 7.292 12.500 1 50.000 44.792 5.208 6.250 2 46.875 47.917 5.208 15.625 3 48.611 48.611 2.778 15.278 4 52.778 43.056 4.167 18.056 HW 0 56.250 43.750 0.000 21.875 1 66.667 31.250 2.083 16.667 2 65.625 33.333 1.042 17.708 3 72.222 27.778 0.000 18.056 4 79.167 20.833 0.000 8.333 JR 0 59.375 40.625 0.000 9.375 1 57.292 41.667 1.042 14.583 2 58.333 40.625 1.042 17.708 3 76.389 23.611 0.000 6.944 4 76.389 22.222 1.389 6.944 RF 0 54.167 45.833 0.000 9.375 1 45.833 51.042 3.125 15.625 2 47.917 50.000 2.083 15.625 3 61.111 38.889 0.000 9.722 4 56.944 43.056 0.000 15.278 Average 0 50.625 47.083 2.292 12.292 1 56.458 41.042 2.500 12.500 2 56.042 41.875 2.083 15.625 3 63.056 36.111 0.833 15.278 4 66.944 31.944 1.111 13.611 / 94 Table A3.7 Statistics for Test 7 Consonants (Group 2) % % % % Correct Subject Mode Correct Misrecognized Rejected Runner-ups DR 0 72.917 27.083 0.000 14.583 1 71.875 28.125 0.000 13.542 2 73.958 26.042 0.000 15.625 3 86.111 ' 13.889 0.000 5.556 4 84.722 15.278 0.000 8.333 EC 0 42.708 55.208 2.083 19.792 1 65.625 34.375 0.000 14.583 2 54.167 45.833 0.000 23.958 3 68.056 31.944 0.000 12.500 4 83.333 16.667 0.000 11.111 HW 0 77.083 22.917 0.000 18.750 1 77.083 22.917 0.000 12.500 2 76.042 23.958 0.000 14.583 3 83.333 16.667 0.000 9.722 4 87.500 12.500 0.000 8.333 JR 0 77.083 21.875 1.042 15.625 1 80.208 18.750 1.042 12.500 2 80.208 18.750 1.042 13.542 3 83.333 16.667 0.000 8.333 4 90.278 9.722 0.000 8.333 RF 0 45.833 54.167 0.000 18.750 1 42.708 57.292 0.000 12.500 2 43.750 55.208 1.042 15.625 3 54.167 45.833 0.000 22.222 4 59.722 40.278 0.000 13.889 Average 0 63.125 36.250 0.625 17.500 1 67.500 32.292 0.208 13.125 2 65.625 33.958 0.417 16.042 3 75.000 25.000 0.000 11.667 4 81.111 18.889 0.000 10.000 Table A3.8 Confusion Matrix for Tests 1 Through 5 Combined (All Subjects) Mode 0 Word # 0 1 Was 2 3 Recognized as Word 4 5 6 7 8 9 # 10 11 12 13 14 15 Rej 0 56 6 4 3 51 1 5 60 6 7 1 41 2 24 54 4 • • • • 1 . 2 . 35 3 1 11 69 • • 3 . • 5 1 . . 30 4 * 44 . 16 14 1 5 . 7 10 . 23 5 . 73 3 • 1 3 2 . 2 . 36 6 • 10 . 68 3 6 3 1 2 4 . 23 7 • 24 . 10 41 . 1 1 1 10 11 . 21 8 1 • 48 27 . 2 . 7 4 31 9 • 1 1 12 • 9 65 . 3 1 3 2 23 10 • 8 • • 1 75 2 . . . 34 11 1 1 • 6 16 58 1 3 . 34 12 . 1 . 9 4 • 1 2 1 69 7 . 26 13 14 . 34 14 • 1 . 4 5 1 9 • 6 4 62 . 28 15 14 • • • • 1 • 1 • * 66 38 Table A3.9 Confusion Matrix for Tests 1 Through 5 Combined (All Subjects) Mode 1 Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 91 . 25 1 1101 6 . 1 1 10 2 12 93 7 . 8 3 .10107 4 . . 80 . 12 14 1 3 3 7 • • 5 6 ' 4 . 99 5 5 2 1 3 • • 7 44 . 1 57 . 3 6 9 8 2 . 78 29 • . 3 1 7 9 2 1 6 1 8 92 7 . 3 10 • • 1 . . .109 4 . . 6 11 7102 . 11 12 1 . 2 1 . 1 .107 6 . 2 13 1 14 3 12 2 6 . • • • 93 . 4 15 3 1 L06 10 Table A3.10 Confusion Matrix for Tests 1 Through 5 Combined (All Subjects) Mode 2 Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 89 6 . 2 23 1' 1102 2 12 92 7 . 9 3 11106 . 3 4 77 . 8 20 1 1 2 4 6 . 1 5 .115 . 5 6 8 . 97 4 • 2 1 2 4 7 43 . 1 59 1 2 6 8 8 1 , 80 27 • 4 . 8 9 3 1 4 3 9 91 • 6 . 3 10 . . 1 • .108 4 . 7 11 8101 • . 11 12 2 . 1 1 . • 113 3 . . 13 3 .100 . 12 14 13 1 5 • 1 95 . 5 15 2 2 L06 10 Table A3.11 Confusion Matrix for Tests 1 Through 5 Combined (All Subjects) Mode 3 Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 89 7 . 2 17 1 1104 3 . 6 2 4 95 9 3 11102 . 2 4 74 . 6 25 1 1 2 4 2 5 .110 . . 1 1 . 3 6 5 . 101 3 . • 1 4 1 7 28 . 1 72 2 • 7 5 . . 8 76 33 • 2 . 4 9 4 1 7 2 7 84 • 8 . 2 10 . . 2 • .104 4 • . . 5 11 6106 • . 3 12 4 . 1 • • • .110 . . 13 14 4 . 3 5 1 4 • 1 4 89 . 4 15 1 Table A3.12 Confusion Matrix for Tests 1 Through 5 Combined (Al l Subjects) Mode 4 Word Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Rej 0 99 6 1 9 1 1105 2 1 4 2 4 96 8 3 8105 . 2 4 81 . 8 16 3 2 1 3 1 . . 5 .112 • . 1 . . • . . . 2 6 8 . 95 2 . 1 1 1 5 2 7 30 . 1 73 1 • 1 • 4 5 8 1 , 79 32 . 1 « 2 . . 9 4 1 8 1 5 88 • 7 . 1 10 • • 1 . .109 4 . 1 11 9105 . 1 12 2 . .113 13 .105 . 5 14 4 . 3 7 1 1 • 5 94 15 1 .106 8 / 98 Table A3.13 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Mode 0 Wrd # 0 1 2 3 4 5 6 Was Recognized as Word # 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej 0 29 1 3 • • • 1 • • 1 • • • 4 • • • . 1 1 1 23 1 1 • • • • 1 • 5 5 2 • • • . 1 2 3 2 25 • • • • 1 • • 1 • • • 4 4 • • 3 . . 25 • 1 3 • • 7 • . 1 • 1 1 1 . • # • • 4 • 23 5 1 3 • • 2 * « 1 4 1 « • • • • 5 • 2 20 6 • 1 • 2 1 1 2 4 • 1 # # 6 7 1 3 24 • . 1 • • • • 2 • # , . 2 7 1 . . . 18 1 • 2 • • . 4 5 3 . 3 1 . 1 8 . 1 21 4 1 1 1 1 1 4 3 . 2 • 9 • 3 23 • 1 10 1 . • . 1 10 4 2 • 21 3 . • 2 • 2 . 3 2 11 1 1 1 20 4 1 1 1 . 2 7 12 1 5 * 1 26 • 2 • • . 1 3 13 1 1 1 5 3 25 • • • • . 2 • . 1 14 • 1 • . 2 26 8 • • . 1 • . 2 15 . . . • . 1 12 26 • • • • 1 . # 16 2 . 3 • • • • • • 12 7 4 . 8 4 17 1 1 2 3 1 1 3 17 3 . 2 5 t> • 18 . 1 • 2 • • • 1 • 1 3 25 • • 5 1 1 19 . • . • • • 1 4 • • • 31 . • 4 20 1 • 1 1 • 1 • • 3 . 24 7 2 • • 21 1 1 4 . 1 2 • 1 4 4 . 7 12 2 . 1 22 • • • • • 1 • • 9 . 2 • 26 . 2 23 • • • • • • • • • • • • • 1 6 • • • 7 . • • 24 2 / 99 Table A3.14 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Mode 1 Wrd Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej 0 37 • 2 1 1 25 1 3 « • • 1 1 2 3 3 2 2 * 31 3 2 1 3 • « • 29 • • 6 • 2 • • • • 9 • 2 • • 4 « • • • 25 2 1 • * 1 2 1 . 2 6 5 « 1 • • 2 25 3 1 1 1 1 1 . 1 6 • • • 3 • 3 30 • • 1 • • • • • 1 1 7 17 2 • 1 2 . • * 3 5 3 6 8 18 1 1 7 . 1 1 # 3 5 2 1 9 32 • 1 2 2 1 1 1 10 * • • • 1 • 7 1 • 17 4 1 • • » 1 1 # 2 5 11 24 3 • 3 2 1 2 # 1 3 12 • • • 1 • • • 6 • 5 22 * 2 3 * 1 13 • 1 • « « 2 1 2 1 1 29 . 2 • • 1 • • 14 • « • • . 1 1 . 1 24 7 • 2 2 # 1 1 15 10 26 2 2 16 17 5 3 • 8 4 17 1 1 3 • 2 2 19 5 • 2 4 . 1 18 3 28 • • • 5 1 . 19 2 . 20 2 1 1 1 1 • 25 4 2 • • 21 4 • 1 1 1 • • • 1 8 2 5 13 4 • * 22 1 1 • 1 5 • • 22 1 6 23 2 3 • « • 6 • • • 27 2 / 100 Table A3.15 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Mode 2 Wrd Was Recognized as Word # « 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej 0 35 • 1 1 • • 1 • • 1 • • • 1 1 1 26 1 • • • • • • • • 2 3 4 2 2 1 30 • • • 1 • • • • • • 2 4 3 29 • • 4 • 4 • • * • 2 • 4 27 2 1 1 1 1 1 1 1 3 1 5 2 28 1 1 1 2 1 « 2 * 6 5 29 • 1 • « • 4 1 • 7 18 2 2 1 1 • 2 1 4 5 8 1 16 1 1 8 • 3 6 3 1 9 . 4 27 • 1 4 1 1 • 1 . . 10 6 • • 21 2 • • » 3 1 3 4 11 • 3 • 1 20 2 2 3 2 2 3 12 • 6 3 22 1 5 2 1 13 1 1 2 1 30 • 1 1 1 1 14 • 1 1 1 27 7 • 2 1 15 • . 14 24 . • • 16 2 • . 17 5 3 • 11 2 17 • 1 1 20 7 2 5 18 • 3 • 3 29 • 1 « 3 1 . 19 . 4 2 . 28 . • * • 4 1 20 2 2 2 2 23 4 3 • • 21 1 1 2 3 2 • 8 10 5 22 1 • 8 1 1 21 2 6 23 • • • • • • • • 7 • • • 4 • • • 27 2 / 101 Table A3.16 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Mode 3 Wrd Was Recognized as Word # # 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej 0 28 • 1 • • • • • 1 • • • • 1 21 « • 1 1 2 • • • 2 1 1 % m • • « 2 27 • • • • • 1 2 ^ # • • • 3 22 • 1 2 * 2 ^ • • « 4 25 2 . • 9 ^ 5 2 25 . • * 9 1 2 • 6 • 2 25 . 1 • 1 # • • • 7 1 1 . 13 1 1 4 3 2 • • • 8 2 . 13 2 2 2 1 2 1 4 1 # • • • 9 . 1 . 23 . » 4 1 1 a • * • • 10 . . • 22 2 • 1 • 1 1 • • • 11 • 1 21 2 1 2 1 1 • • • 12 8 . . 19 1 1 1 • • • * 13 . 1 3 24 . • 1 • • 4 14 1 • • 1 24 4 #  • • 15 . . 1 * 1 7 17 # 1 • 1 1 1 16 1 2 1 • 13 3 3 4 * • • 17 1 1 . 1 • 3 16 4 • 3 1 . . 18 . 1 # # 2 25 • « 2 . . 19 . . 2 • • 23 • 20 1 1 • 1 1 20 4 l . . 21 1 3 3 # 2 2 4 11 4 . . 22 • . 1 • 6 1 1 20 . 1 23 • • • • • • • • • • • • 5 * • 4 • • . 20 1 / 102 Table A3.17 Confusion Matrix for Tests 6 and 7 Combined (All Subjects) Mode 4 Wrd # 0 1 2 3 4 5 6 Was Recognized as Word # 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Rej 0 28 . 1 • • • • 1 « • • • • • 1 . 24 1 1 • • • • m # 1 • 2 # # • • • • 2 30 • • • • • • • • m m • • • • • • 3 22 1 • 1 * • # • 1 . • • • • 4 . 26 1 . 1 • • 9 2 9 • • « • 5 • • 23 . 2 1 1 # 2 1 m • • • • 6 . 1 . 27 1 • • 1 • m « « • • 7 22 1 3 • 1 # 1 2 • • • * 8 1 . 4 15 1 • 3 1 1 3 • • • • • 9 . 1 25 v • 3 1 m • • • • • 10 2 1 • 20 1 • 2 # 1 • 1 . . . 11 1 1 23 2 2 • 1 . . . 12 5 * 3 21 • 1 • • • • 13 1 1 • 25 # • 1 . . . 14 • 1 1 21 5 • • • • 15 1 1 3 23 • • X I 16 3 1 • 16 4 3 • • • • 17 1 . 1 1 « 3 15 2 # 5 . . . 18 • • • • 1 27 1 • • • • 19 • • 1 • 24 9 • • 5 • 20 1 1 2 1 • • 18 7 . . . 21 2 • 1 1 2 • 5 15 . . . 22 • • • 6 • 2 1 19 . 2 23 • • • • • • • « • • • • 4 • • • 1 • . . 24 1 / 103 Table A3.18 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 for the Vowels Mode 0 max. error per word = 30% max. error per word = 25% max. error per word =20% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (IY) 10.83 0 10.83 0 10.83 1 1 (I) 15.83 1 2 8.33 1 2 8.33 2 2 (E) 25.83 3 17.50 3 17.50 3 3 (AE) 17.50 4 6 7 8 9 14 7.36 4 6 7 8 9 12 14 3.81 4 4 6 7 (A UH OW) 17.50 5 9.17 5 9.17 5 5 (ER) 9.17 10 9.17 10 11 8.75 6 8 9 (00 U) 15.42 11 23.33 13 12.50 7 10 (Al) 9.17 12 20.83 15 13.33 8 11 (01) 23.33 13 12.50 - -9 12 (AU) 20.83 15 13.33 - -10 13 (EI) 12.50 - - - -11 14 (0U) 25.00 - - - -12 15 (JU) 13.33 - - - -/ 104 Table A3.19 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 for the Vowels Mode 1 max. error per word = 25% max. error per word =20% max. error per word = 15% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (IY) 3.33 0 3.33 0 3.33 1 1 (I) 7.50 1 7.50 1 2 4.17 2 2 (E) 15.83 2 15.83 3 8.33 3 3 (AE) 8.33 3 8.33 4 6 7 8 9 14 2.92 4 4 7 (A OW) 18.75 4 6 7 12.22 5 0.00 5 5 (ER) 0.00 5 0.00 10 4.17 6 6 (UH) 17.50 8 9 9.58 11 5.83 7 8 9 (00 U) 9.58 10 4.17 12 9.17 8 10 (Al) 4.17 11 5.83 13 6.67 9 11 (01) 5.83 12 9.17 15 3.33 10 12 (AU) 9.17 13 6.67 -11 13 (EI) 6.67 14 19.17 -12 14 (0U) 19.17 15 3.33 -13 15 (JU) 3.33 - - -Table A3.20 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 for the Vowels Mode 2 max. error per word = 25% max. error per word =20% max. error per word = 15% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (IY) 6.67 0 6.67 0 6.67 1 1 (I) 6.67 1 6.67 1 2 4.17 2 2 (E) 15.83 2 15.83 3 9.17 3 3 (AE) 9.17 3 9.17 4 6 7 14 6.46 4 4 7 (A OW) 16.67 4 7 16.67 5 0.00 5 5 (ER) 0.00 5 0.00 8 9 9.17 6 6 (UH) 19.17 6 19.17 10 4.17 7 8 9 (00 U) 9.17 8 9 9.17 11 6.67 8 10 (Al) 4.17 10 4.17 12 5.83 9 11 (01) 6.67 11 6.67 13 6.67 10 12 (AU) 5.83 12 5.83 15 3.33 11 13 (EI) 6.67 13 6.67 - -12 14 (OU) 16.67 14 16.67 - -13 15 (JU) 3.33 15 3.33 - -/ 106 Table A3.21 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 for the Vowels Mode 3 max. error per word = 25% max. error per word = 20% max. error per word = 15% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (IY) 7.83 0 7.83 0 7.83 1 1 (I) 4.35 1 4.35 1 4.35 2 2 (E) 11.30 2 11.30 2 11.30 3 3 (AE) 9.57 3 9.57 3 9.57 4 4 7 (A OW) 13.48 4 7 13.48 4 6 7 8 9 14 3.48 5 5 (ER) 1.74 5 1.74 5 1.74 6 6 (UH) 12.17 6 12.17 10 5.22 7 8 9 (00 U) 10.43 8 9 14 8.99 11 5.22 8 10 (Al) 5.22 10 5.22 12 4.35 9 11 (01) 5.22 11 5.22 13 2.61 10 12 (AU) 4.35 12 4.35 15 0.87 11 13 (EI) 2.61 13 2.61 - -12 14 (0U) 19.13 15 0.87 - -13 15 (JU) 0.87 - - - -/ 107 Table A3.22 Word Groupings According to Maximum Allowable Error Rate Using Tests 1 through 5 for the Vowels Mode 4 max. error per word =25% max. error per word = 20% max. error per word = 15% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (IY) 6.09 0 6.09 0 6.09 1 1 (I) 5.22 1 5.22 1 5.22 2 2 (E) 10.43 2 10.43 2 10.43 3 3 (AE) 6.96 3 6.96 3 6.96 4 4 7 (A OW) 13.04 4 7 13.04 4 6 7 8 9 14 3.62 5 5 (ER) 0.87 5 0.87 5 0.87 6 6 (UH) 17.39 6 8 9 10.14 10 4.35 7 8 9 (00 U) 10.87 10 4.35 11 7.83 8 10 (Al) 4.35 11 7.83 12 1.74 9 11 (01) 7.83 12 1.74 13 4.35 10 12 (AU) 1.74 13 4.35 15 0.87 11 13 (EI) 4.35 14 18.26 -12 14 (OU) 18.26 15 0.87 -13 15 (JU) 0.87 - - -/ 108 Table A3.23 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants Mode 0 max. error per word = 50% max. error per word = 45% max. error per word = 40% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (W) 25.00 0 25.00 0 25.00 1 1 (L) 40.00 1 40.00 1 40.00 2 2 (R) 37.50 2 37.50 2 37.50 3 3 (Y) 37.50 3 37.50 3 37.50 4 4 (M) 42.50 4 42.50 4 5 6 27.50 5 5 (N) 50.00 5 6 31.25 7 8 9 10 11 12 16 17 18 20 21 7.27 6 6 (NG) 35.00 7 8 9 10 11 16 17 18 20 21 11.50 13 35.00 7 7 16 17 20 21 (B V TH F THE) 23.50 12 35.00 14 30.00 8 8 (D) 47.50 13 35.00 15 35.00 9 9 (G) 42.50 14 30.00 19 22.50 10 10 (P) 47.50 15 35.00 22 30.00 11 11 (T) 50.00 19 22.50 23 35.00 12 12 (K) 35.00 22 30.00 -13 13 (H) 35.00 . 23 35.00 -14 14 (DZH) 30.00 - - - -15 15 (TSH) 35.00 - - - -16 18 (Z) 35.00 - - - -17 19 (ZH) 22.50 - - - -18 22 (S) 30.00 - - - -19 23 (SH) 35.00 - - - -/ 109 Table A3.24 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants Mode 1 max. error per word =50% max. error per word = 45% max. error per word =40% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (W) 7.50 0 7.50 0 7.50 1 1 (L) 37.50 1 37.50 1 37.50 2 2 (R) 22.50 2 22.50 2 22.50 3 3 (Y) 27.50 3 27.50 3 27.50 4 4 (M) 37.50 4 37.50 4 37.50 5 5 (N) 37.50 5 37.50 5 37.50 6 6 (NG) 22.50 6 22.50 6 22.50 7 7 10 17 21 (B P TH THE) 32.50 7 10 16 17 18 20 21 14.29 7 10 16 17 18 20 21 14.29 8 8 11 (D T) 38.75 8 11 38.75 8 11 38.75 9 9 (G) 20.00 9 20.00 9 12 22.50 10 12 (K) 45.00 12 45.00 13 27.50 11 13 (H) 27.50 13 27.50 14 37.50 12 14 (DZH) 37.50 14 37.50 15 30.00 13 15 (TSH) 30.00 15 30.00 19 17.50 14 16 20 (V F) 36.25 19 17.50 22 30.00 15 18 (Z) 30.00 22 30.00 23 27.50 16 19 (ZH) 17.50 23 27.50 - -17 22 (S) 30.00 - - - -18 23 (SH) 27.50 - - - -/ 110 Table A3.25 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants Mode 2 max. error per word =50% max. error per word = 45% max. error per word =40% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (W) 12.50 0 12.50 0 12.50 1 1 (L) 35.00 1 35.00 1 35.00 2 2 (R) 25.00 2 25.00 2 25.00 3 3 (Y) 27.50 3 27.50 3 27.50 4 4 (M) 32.50 4 32.50 4 32.50 5 5 (N) 30.00 5 30.00 5 30.00 6 6 (NG) 25.00 6 25.00 6 25.00 7 7 16 20 21 22 (B V F THE S) 27.00 7 8 10 11 15 16 20 21 22 19.72 7 8 10 11 15 16 17 18 20 21 22 8.18 8 8 11 (D T) 41.25 9 32.50 9 12 26.25 9 9 (G) 32.50 12 45.00 13 25.00 10 10 (P) 47.50 13 25.00 14 32.50 11 12 (K) 45.00 14 32.50 19 27.50 12 13 (H) 25.00 17 18 26.25 23 27.50 13 14 (DZH) 32.50 19 27.50 - -14 15 (TSH) 35.00 23 27.50 - -15 17 (TH) 50.00 - - - -16 18 (Z) 27.50 - - - -17 19 (ZH) 27.50 - - - -18 23 (SH) 27.50 - - - -/ 111 Table A3.26 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants Mode 3 max. error per word =45% max. error per word = 40% max. error per word = 35% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (W) 6.67 0 6.67 0 6.67 1 1 (L) 30.00 1 30.00 1 30.00 2 2 (R) 10.00 2 10.00 2 10.00 3 3 (Y) 26.67 3 26.67 3 26.67 4 4 (M) 16.67 4 16.67 4 16.67 5 5 (N) 16.67 5 7 8 9 10 11 13 16 17 18 20 21 22 5.64 5 7 8 9 10 11 13 16 17 18 20 21 22 5.64 6 6 (NG) 16.67 6 16.67 6 16.67 7 7 10 (B P) 31.67 12 36.67 12 36.67 8 8 17 (D TH) 43.33 14 20.00 14 15 11.67 9 9 (G) 23.33 15 40.00 19 23.33 10 11 (T) 30.00 19 23.33 23 30.00 11 12 (K) 36.67 23 30.00 - -12 13 (H) 20.00 - - - -13 14 (DZH) 20.00 - - - -14 15 (TSH) 40.00 - - - -15 16 20 21 22 (V F THE S) 27.50 - - - -16 18 (Z) 16.67 - - - -17 19 (ZH) 23.33 - - - -18 23 (SH) 30.00 - - - -/ 112 Table A3.27 Word Groupings According to Maximum Allowable Error Rate Using Tests 6 and 7 for the Consonants Mode 4 max. error per word = 35% max. error per word =30% max. error per word = 25% Grp # words grouped group err.(%) words grouped group err.(%) words grouped group err.(%) 0 0 (W) 6.67 0 6.67 0 6.67 1 1 (L) 20.00 1 20.00 1 20.00 2 2 (R) 0.00 2 0.00 2 0.00 3 3 (Y) 26.67 3 26.67 3 26.67 4 4 (M) 13.33 4 13.33 4 13.33 5 5 (N) 23.33 5 23.33 5 23.33 6 6 (NG) 10.00 6 10.00 6 10.00 7 7 8 (B D) 31.67 7 8 10 11 13 16 17 18 20 21 5.00 7 8 10 11 13 16 17 18 20 21 22 4.55 8 9 (G) 16.67 9 16.67 9 12 10.00 9 10 (P) 33.33 12 30.00 14 15 11.67 10 11 (T) 23.33 14 30.00 19 20.00 11 12 (K) 30.00 15 20.00 23 16.67 12 13 (H) 16.67 19 20.00 - -13 14 (DZH) 30.00 22 30.00 - -14 15 (TSH) 20.00 23 16.67 - -15 16 17 20 21 (V TH F THE) 20.83 - - - -16 18 (Z) 10.00 - - - -17 19 (ZH) 20.00 - - - -18 22 (S) 30.00 - - - -19 23 (SH) 16.67 - - - -APPENDIX 4: PRELIMINARY STUDY ON THE EFFECTS OF A NOISY ENVIRONMENT ON RECOGNITION PERFORMANCE Often the environment in which a speech recognition unit is required to operate is far from perfect. Conversations may be present in the background or machinery may be operating, etc. In situations such as these, the performance of a given recognition unit can and will degrade. The tele-operator group in the E.E. department in UBC was interested in the performance of speech recognition unit in a heavy duty machinery environment; specifically, in the cab of a Caterpillar D215 Knuckle Excavator. The D215 is illustrated in Figure A4.1. Towards the end of the term of this study, one of these machines was made available for use by the tele-operator group through Caterpillar and MacMillan-Bloedel Research. Although not enough time remained to allow for an indepth study on the effects of this environment on the performance of a speech recognition unit and of possible preprocessing techniques to reduce the effects, some noise data files were recorded on the HP9050. The format and gain settings were compatible with the existing voice data files such that the two could be summed, yielding a simulation of voice files recorded in a noisy environment. Some results were tabulated, but the main purpose of this exercise was to provide a preliminary data base for future studies into speech recognition in a noisy environment. Unfortunately the test site was limited in that no digging was allowed and no auxilliary activities such as would be present on a typical work sight were being 113 / 115 performed. This had the benefit, though, that any degradation in the results could be directly attributed to noise generated by the machine. In fact, this test configuration touches on a very difficult problem which will have to be addressed when the noisy environment is studied. That is, the problem of isolating and defining specific noise types. One can imagine that a background hum may have a much different effect than a speaker in the background. Regardless, this will not be addressed here. 1. R E C O R D I N G C O N F I G U R A T I O N A Shure SM10 close talking mike was worn by the operator of the machine and the mike output was recorded on a Yamaha commercial cassette deck. Three different conditions were recorded: 1. the machine at full power idle. The engine of the D215 operates at more or less a constant rpm and one selects the nominal engine speed with a separate throttle. After this is selected, the pitch of the engine will change slightly when different hydraulic loads are placed on the machine (all circuits, including the tracts, are hydraulically driven), but the noise level due to the engine itself remains approximately constant. Full power idle, then, is with the throttle completely open and with no load. Open throttle, or near open throttle is the normal operating speed. 2. moving branches back and forth. 3. The operator performing the following operation : a. Rotate the cab. b. Raise arm using joint closest to cab. / 116 c. Counter-rotate the cab. d. Lower arm using joint closest to cab. e. repeat from step 1. This task introduced not only changes in pitch in the engine, but also introduced a transient clanging noise at the termination of each rotation due to joint arms contacting. These tasks were executed until enough data was collected to yield eighty data files of 1.5 seconds each. This data was then replayed in the lab and digitized, forming the background noise templates. Mode 0 (unmodified data) of a single user's voice data for 'vowels said normally' (test 1) was corrupted with each of the noise files. The results, shown in Table A4.1, are the results of corrupting both the training mode and the recognition mode data. Noise Number % Correct % Misrecognized % Rejected % Correct Runner-ups No Noise 87.500 12.500 0.000 6.250 1 82.812 10.938 6.250 3.125 2 84.375 7.812 7.812 3.125 3 75.000 14.062 10.938 1.562 Table A4.1: Results of Test 1, Mode 0 for RF with Noisy Background Data / 1 1 7 The noise data for the third situation (rotate-lift-rotate-lower) was added to all five subjects' data for Test 1, Mode 0 and the results are presented in Table A4.2. 2. COMMENTS ON THE TEST RESULTS As would be expected, all background noise degraded the performance of the SRlOO, even the steady state noise. The amount of degradation varied with the type of noise, with type 3 decreasing the recognition performance the most. It is interesting to note, though, that the amount of decrease in recognition was no worse than any of the other phenomena examined in this thesis. However, one would expect the degradation of the performance to be more marked when the D215 is on site. Regardless,' the results indicate that the effects of noise is a significant factor and should be examined further. If the only interest is to obtain results on a Subject % Correct % Misrecognized % Rejected % Correct Runner-ups DR 82.812 14.062 3.125 7.812 EC 25.000 45.312 29.688 4.688 HW 82.812 4.688 12.500 1.562 JR 70.312 21.875 7.812 9.375 RF 82.812 10.938 6.250 3.125 Average 68.750 19.375 11.875 5.312 Table A4.2: Results of Test 1 , Mode 0 with Corruption by Noise Type 3 / 118 speech recognition unit's susceptability to noise for comparison purposes, then it would be relatively straightforward to implement a test whereby the performance v.s. noise level may be tabulated. Preprocessing algorithms aimed at dynamically reducing the noise from incoming utterances, however, is quite a more complicated proposition. Many different algorithms have been investigated, but, according to Lim and Oppenheim [17], the ones which show most promise are those involving multi-microphone input (e.g. 13, 14, 20, 22, 35, 36, 37 & 39 in Bibliography) These, then, are the algorithms of choice for future noise preprocessing work. APPENDIX 5: AN EXAMPLE OF RULES FOR SELECTION OF A ROBUST VOCABULARY In any given application for a speech recognition device, one often has leeway in selection of the exact words to be used for the vocabulary. Synonyms for a task name or required input may be substituted (e.g. 'start' may be used instead of 'go'). This option may be used to select a vocabulary which best utilizes the discriminatory strengths of a given speech recognition unit. The following is a series of rules which provide one example of a method for selection of a robust vocabulary. This example is specifically for the NEC SRlOO and uses the phonetic error grouping results from this study. The technique is based on the subdivision of words into their phonetic constructs and the determination of whether a given candidate is phonetically unique enough to be a reliable vocabulary member. The specific phonetic error groupings utilized were those of Mode 3 (Tables A3.21 and A3.26) using the most stringent maximum error rates tabulated (i.e. 15% per word for the vowels and 35% for the consonants). The selection of Mode 3 seemed justified, in that any misidentification eliminated in the transition from Mode 0 to Mode 3 was more than likely due to variations in duration and/or bad reference templates (rather than similarities in phonemes). The proposed method is as follows: 1. Divide each word of the proposed vocabulary into its phonetic constructs. 2. Compare each phoneme in each word with the phoneme in the same 119 / 120 position in each of the other vocabulary members. If the phonemes are in a different error group, they may be considered a valid discriminatory device. In other words, phonemes within the following groups may not be considered discriminatory (when compared to other phonemes within the same group.): a. 4, 6, 7, 8, 9 and 14 (A, UH, OW, OO, U and OU) for the vowels. b. 5, 7, 8, 9, 10, 11, 13, 16, 17, 18, 20, 21 and 22 (N, B, D, G, P, T, H, V, TH, Z, F, THE and S) for the consonants. c. 14 and 15 (DZH and TSH) also for the consonants. 3. If there are more phonemes in one word than in another, then each additional phoneme may also be considered a valid discriminatory device. 4. For a word to be an acceptable vocabulary member, it must have at least one valid discriminatory vowel or two valid discriminatory consonants when compared to each other vocabulary member. If the statistics were based on a larger test group, then it may be possible to state with some statistical certainty that the resulting vocabulary would have a general error rate of better than 15%. That is not possible with this small a test group. All one may say is that this should yield a more robust vocabulary than simply arbitrarily selecting vocabulary members. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0097036/manifest

Comment

Related Items