UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Speech recognition in a harsh environment Pullman, Susan 1988

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1989_A7 P94.pdf [ 9.2MB ]
Metadata
JSON: 831-1.0064775.json
JSON-LD: 831-1.0064775-ld.json
RDF/XML (Pretty): 831-1.0064775-rdf.xml
RDF/JSON: 831-1.0064775-rdf.json
Turtle: 831-1.0064775-turtle.txt
N-Triples: 831-1.0064775-rdf-ntriples.txt
Original Record: 831-1.0064775-source.json
Full Text
831-1.0064775-fulltext.txt
Citation
831-1.0064775.ris

Full Text

SPEECH RECOGNITION IN A HARSH ENVIRONMENT Susan Pullman B. Sc. Wilfrid Laurier University  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF T H E REQUIREMENTS FOR T H E DEGREE OF M A S T E R OF A P P L I E D SCIENCE  in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF E L E C T R I C A L ENGINEERING  We accept this thesis as conforming to the required standard  T H E UNIVERSITY OF BRITISH COLUMBIA December 1988  © Susan Pullman, 1988  In  presenting  degree at the  this  thesis  in  University of  partial  fulfilment  of  of  department  this thesis for or  publication of  by  his  her  representatives.  It  is  of  £i&crHic*L  &r06-/*>B6&</*J&--  The University of British Columbia Vancouver, Canada  Date  DE-6 (2/88)  Qe^  an advanced  Library shall make it  granted  by the  understood  that  head of copying  my or  this thesis for financial gain shall not be allowed without my written  permission.  Department  for  agree that permission for extensive  scholarly purposes may be  or  requirements  British Columbia, I agree that the  freely available for reference and study. I further copying  the  ZO  J  i m  Abstract  Speech Recognition is a rapidly expanding field with many useful applications in manmachine interfacing. One of the main benefits of speech control is theflexibilityand ease of use allowed an operator for any number of specific applications. Speech recognition units (SRU) are currently at a high level of accuracy for user dependent, pretrained, isolated word recognition. However, if uncontrollable noise is added to the speech input, recognition degrades rapidly. If the application requires a vast set of control words to be used by many operators, then, there can be inconsistencies in recognition. The specific application of this study is the secondary control (ie. non-critical control) of heavy machinery (in particular a caterpillar tractor) using an operator - speech interface.  The inherent problem of this application is the environmental  background noise due to the tractor. It is also important that a robust vocabulary is selected so that no misrecognition occurs between critical control words. In order to add speech input for control of machines in a harsh environment there are two considerations: 1. The reduction of noise from the input speech signal. 2. The selection of a robust vocabulary dependent upon the specific operator and the specific SRU. This study investigates many different types of noise reduction filters, including traditional Wiener, Power Spectral Subtraction and Gaussian filters. The results show that the best types of noise reduction filters are adaptive optimization filters which use two input signals or the Power Spectral Subtraction (PSS) filter. It is possible to ii  reduce the noise to a level within the range of the SRU's capacity for noise. An algorithm for selecting an accurate vocabulary is proposed. This algorithm determines weaknesses for the specific SRU, vocabulary and speaker; and selects the control words around those weaknesses. Testing of this algorithm showed that it was possible to achieve closed to 98% recognition and 0% misrecognition.  iii  Table of Contents  Abstract  ii  List of Tables  viii  List of Figures  x  Acknowledgements  xii  1  Introduction  1  2  Robust Word Selection Package  5  2.1  Control By Speech  6  2.2  Algorithm For Selecting A Robust Vocabulary  11  2.3  Distance Measurements  18  2.4  Testing The Algorithm  19  2.4.1  21  3  4  Results  E n d Point Detector  22  3.1  30  Results  Noise Reduction Filters  31  4.1  Characteristics of the Signals  31  4.1.1  Speech  31  4.1.2  Characteristics of Noise  33  4.1.3  Categories of Filters Investigated  35  iv  4.2  4.3  4.4  4.5  5  6  Noise Reduction Filters Based on PSE of the Noise  37  4.2.1  Power Spectral Subtraction  41  4.2.2  Wiener Noise Filter  43  4.2.3  Implementation and Testing  46  Adaptive Noise Cancelling Filters  49  4.3.1  LMS Adaptive Filters  51  4.3.2  Recursive Least Squares Adaptive Noise Reduction Filter . . . .  54  Enhancement Filters Based On A Simple Model of Speech  56  4.4.1  Linear Predictor Corrector (LPC) Method  58  4.4.2  Gaussian Smoothing Filtering  62  4.4.3  Implementation  64  Data Acquisition and Testing of Filters  68  4.5.1  70  Correlation Between Microphones  78  Testing of Noise Filters 5.1  Evaluation  78  5.2  Data Set  81  5.3  Improving Recognition By Manipulation of VTR6050  82  5.4  Filter Test Results  83 87  Discussion 6.1  Vocabulary Selection  87  6.2  End Point Detector  90  6.3  Filters  91  6.3.1  Analysis of the Estimate of the Power Spectrum  91  6.3.2  Wiener Filter  92  6.3.3  Power Spectral Subtraction  93  v  6.4  6.3.4  Gaussian Smoothing  94  6.3.5  Linear Predictor Corrector  94  6.3.6  Adaptive Filtering  95  6.3.7  LMS  96  6.3.8  RLS  97  Comparison of Filters  97  7 Conclusions and Recommendations  101  References  104  A Previous Work in Speech Recognition at UBC  109  A.l  Recognition Results for the V O T A N VTR6050  113  A.1.1  120  Confusion Matrices for VTR6050 for All Subjects  A. 2 Results for NEC SR100 A.2.1  126  Combined Confusion Matrices for All Subjects for SR100 . . . . 134  B Operation of the VOTAN VTR6050  140  B. l  Additional Features  140  B.2  Memory Space  140  B. 3  Configuration for the Operation of VTR6050  C Results for Vocabulary Selection C. l  143  Vocabulary Selection Results - No Distances Set  C. 2 Vocabulary Selection Results - Distance Measurements Used  D Results of Filtering Tests D. l  141  144 148  152  Results For Noise Mode 1  153  vi  D.l.l  Results When Trained With One Template Only  D.1.2 Results When Trained With Two Template D.2 Results For Noise Mode 2 D.2.1  153 156 159  Results When Trained With One Template Only  D.2.2 Results When Trained With Two Templates  vii  159 162  List of Tables  2.1  Confusion Matrix  15  2.2  Distance Matrix  19  2.3  Results of Vocabulary Selection  21  4.4  Algorithm To Implement the LMS Filter  55  4.5  Algorithm To Implement the RLS Filter  57  5.6  Recognition When Noise Is Added At Different Gains  80  5.7  Results of Filtering With Mode 1 Noise  85  5.8  Results of Filtering With Mode 2 Noise  86  5.9  SNR Results for Filtering Methods  86  6.10 Test 1: Results of Selection When No Distance Score Is Used  87  6.11 Test 2: Results of Selection When Minimum Distance Score Is Used . .  88  6.12 Comparison Between Filters  100  A.13 Vowels and Diphthong Phonemes  110  A.14 Semi-Vowel and Consonant Phonemes  Ill  A. 15 Statistics for Test 1 - Vowels Said Normally  113  A.16 Statistics for Test 2 - Vowels With Mike Moved  114  A.17 Statistics for Test 3 - Vowels Said Slowly  115  A.18 Statistics for Test 4 - Vowels Said Quickly  116  A.19 Statistics for Test 5 - Vowels with Interrogative Intonation  117  A.20 Statistics for Test 6 - Consonants (Group 1)  118  viii  A.21 Statistics for Test 7 - Consonants (Group 2)  119  A.22 Confusion Matrix for Test 1 - Vowels Said Normally  121  A.23 Confusion Matrix for Test 2 - Vowels With Mike Moved  121  A.24 Confusion Matrix for Test 3 - Vowels Said Slowly  122  A.25 Confusion Matrix for Test 4 - Vowels Said Quickly  122  A.26 Confusion Matrix for Test 5 - Vowels With Interrogative Intonation  . . 123  A.27 Confusion Matrix for Test 6 - Consonants (Group 1)  124  A.28 Confusion Matrix for Test 7 - Consonants (Group 2)  125  A.29 Statistics for Test 1 - Vowels Said Normally  127  A.30 Statistics for Test 2 - Vowels With Mike Moved  128  A.31 Statistics for Test 3 - Vowels Said Slowly  129  A.32 Statistics for Test 4 - Vowels Said Quickly  130  A.33 Statistics for Test 5 - Vowels with Interrogative Intonation  131  A.34 Statistics for Test 6 - Consonants (Group 1)  132  A.35 Statistics for Test 7 - Consonants (Group 2)  133  A.36 Confusion Matrix for Test 1 - Vowels Said Normally  135  A.37 Confusion Matrix for Test 2 - Vowels With Mike Moved  135  A.38 Confusion Matrix for Test 3 - Vowels Said Slowly  136  A.39 Confusion Matrix for Test 4 - Vowels With Quickly  136  A.40 Confusion Matrix for Test 5 - Vowels With Interrogative Intonation  . . 137  A.41 Confusion Matrix for Test 6 - Consonants (Group 1)  138  A.42 Confusion Matrix for Test 7 - Consonants (Group 2)  139  ix  List of Figures  1.1  Caterpillar Excavator  2  2.2  Control Words for the Caterpillar Tractor  10  2.3  Expanded Set of Words for the Caterpillar Tractor  14  3.4  Absolute Value Score Threshold with Z-score of 2.0  24  3.5  AVS Separation of Speech from Background Noise  25  3.6  ZC Separation of Noise and Speech  27  3.7  End Of Word Found From AVS Threshold  28  3.8  End Of Word Found From ZC Threshold  4.9  Ensemble PS of Noise 1  34  4.10 Ensemble PS of Noise 2  35  4.11 Short Time PS of Noise 1  36  4.12 Filters Based on PSE of Noise  38  4.13 PSS Filter  43  4.14 Wiener Filter Frequency Weights  45  4.15 Wiener Filter  46  4.16 Filter With PSE # 2 of Noise  47  4.17 Filter With PSE # 3 of Noise  48  4.18 General Form of an Adaptive Filter  49  4.19 Adaptive Linear Combiner  52  4.20 Gaussian Filter Impulse Response  63  4.21 Signal filtered with a Gaussian of Size 4  65  x  . .  29  4.22 Signal filtered with a Gaussian of Size 8  66  4.23 Signal filtered with a Gaussian of Size 16  67  4.24 Data Acquisition Hardware for Two Microphones  68  4.25 Typical Frequency Response of Both Microphones  70  4.26 Digitized Microphone Signals  71  4.27 Canonic Form of Homomorphic System  72  4.28 D^System  72  4.29 Wrapped and Unwrapped Phase  75  4.30 Cepstrum of Mike 2  76  4.31 Linear filter to remove reverberation  77  4.32 Inverse System D^'. .  77  B.33 Configuration for the Operation of VTR6050  xi  142  Acknowledgements  I would like to thank my supervisor, Dr. Ito, for his support and guidance. I would also like to thank Rial Frenette for his help with the data acquisition and Henry Law for his assistance with the testing stage of the research. As well, I would like to express my gratitude to the volunteers used to test the vocabulary selection and filtering methods. Lastly, I would like to thank my parents for their constant encouragement and support.  xii  Chapter 1  Introduction  The robotics group at the University of British Columbia (UBC) is currently working on a project to automate an excavator. A diagram of the excavator used in this project is shown in Figure 1.1. The purpose of this project is to improve the flexibility, safety, and ease of operation of the excavator. It is desirable to have speech as one of the man-machine interfaces, since speech allows another degree of freedom for the operator in that his/her hands are free to perform other tasks. Commercially available Speech Recognition Units (SRU) can be very accurate in "ideal" conditions; that is noise free environments with a small pretrained vocabulary for a specific user. However, for control of an excavator, speech recognition must be accurate in a "harsh", "critical" environment. Harsh here means that there is a lot of environmental background noise and many users, and "critical" means that one misrecognized word by a SRU could have harmful or undesirable effects. In order for speech control to be feasible in an excavator environment a front-end processor is necessary. Two front-end processors for the SRU were investigated. The first was a noise reduction filter; the second, a robust vocabulary selection process for the specific SRU and speaker. The noise reduction filter was required to ensure that the background noise didn't affect the operation of the SRU. The vocabulary selection process was needed to reduce the possibility of misidentifying words. The two processors together would ideally obtain 100% word recognition. The research outlined in this thesis produced a set of rules for selection of a robust 1  2  Chapter 1. Introduction  Figure 1.1: Caterpillar Excavator vocabulary (Chapter 2). The method investigated determined a "best" subset of words, given a number of words, each with multiple synonyms, and used a heuristic search to determine this "best" subset. The second stage of this research investigated different types of noise reduction filters (Chapter 5), including Wiener, Power Spectral Subtraction (PSS), enhancement filters and adaptive filters. The results suggest that Adaptive Noise Cancelling (ANC) filters (much like Widrow's) or the Power Spectral Subtraction method obtain the best, most consistent results when dealing with a non-stationary broadband noise signal like the one associated with the operation of an excavator. Given the constraints of the problem, namely that the signal to be removed is a broadband non-stationary signal and the signal we wish to keep intact - speech - is a complicated non-periodic signal, a method to accomplish this task had to be found. Methods of reducing noise from speech which have been previously investigated can be  Chapter 1. Introduction  3  categorized into the following three groups: 1. Algorithms based on the short-time spectral amplitude estimate of the signal and requiring a prior knowledge of the noise signal. Methods falling into this category include Power Spectral Subtractionfiltering[6] and Wienerfiltering[9]. 2. Algorithms that exploit the periodicity of voiced speech. This category includes methods such as two-dimensional spectral smoothing and spectrum amplitude transformation [3], comb filtering [15] and adaptive noise cancellers [35][36]. 3. Algorithms based on a source system model of speech production which use speech analysis followed by speech synthesis to enhance the speech. Methods falling into this category are a Multi-pulse excited linear prediction system [22] and Linear Predictor Coefficient methods [26]. Other work done in the area of noise reduction from speech includes noise immune systems which use multiple sensors, such as noise-cancelling pressure gradient microphones and accelerometers which measure vibrations [32]. There has also been work done specifically in the area of speech recognition units for a noisy environment [18]. Since the SRU, in this case, bases its calculations on the assumption that there is some noise (ie. directly manipulates distance measurements or Linear Predictor Coefficients to compensate for noise), the compensation is inherent to the recognition process and therefore is not feasible as a pre-processor algorithm. Neither of these two methods were investigated due to the constraints of our solution. The noise-reduced signal is sent to the SRU which operates as a black box; therefore the signal must be "cleaned up" before recognition is attempted, since the SRU parameters can not be changed. Since most of the filtering methods assume that the noise is of some known form (specifically White Gaussian), it was decided to look at methods from each of the  Chapter 1. Introduction  4  three categories mentioned above, but with variations in order to compensate for the changing nature of the noise. From the first class, Wiener filters and Power Spectral Subtraction methods were investigated. From the second class two types of adaptive filters were examined, particularly Least Mean Squared (LMS) and Recursive Least Squares (RLS). The filters analyized from the third category were a Linear Predictor Corrector (LPC) method and a Gaussian filter. Little past work has been done in the area of vocabulary selection. The research in this area seems to be restricted to the classification or selection of a particular word from a very large vocabulary [17]. Previous work is more involved infindingpreclassification (based on LPC, energy, spectral distances, etc.) and fast searching routines.  Chapter 2 Robust Word Selection Package  Currently, commercially available SRUs have not reach the stage at which they are 100% accurate. When selecting a vocabulary for a specific application it is helpful to know where the SRU's weaknesses lie. Specifically, a method or a set of rules for selecting an appropriate vocabulary would be convenient. The question arises as to what level of speech should be used to form these rules. The possible choices are phonetic components, word or phrase components, or sentences. Phonemes are the basic building blocks of any language. Within the English language there are approximately 40 phonemes or linguistic sounds. For a detailed explanation of human speech production and the characteristics of the speech signal, see [26]. One problem with basing a word selection process upon phonemes is that when placed together to form words, their structure tends to change slightly, depending upon the combination of phonemes. Coarticulation and slurring are prime examples of changing phoneme sounds. An additional problem is that people tend to pronounce sounds differently. Therefore, what might be an easily recognizable phoneme said by one speaker is constantly misrecognized when uttered by another user. This could be due to accents or any kind of speech impediment of the speaker. Another problem depends upon what type of processing the SRU does on the speech; what one SRU has no difficulty with, could be a major problem for another SRU. Similarly, words that do not sound at all alike by human perception may sound very similar to a SRU.  5  Chapter 2. Robust Word Selection Package  6  A phoneme-based set of words for testing purposes (like the one designed by Hans Wasmeier [33]) can be a good indication of the accuracy of any particular SRU, in that it rigorously tests all possible sounds the unit might be exposed to. However, in our case it proved to be a poor test when used for selection of appropriate application words, due to problems arising when phonemes are joined to form words. Appendix A contains the results of testing two different speech recognition units, the NEC SR100 and the V O T A N VTR6050, with a phoneme based vocabulary. As seen from the results, there was little consistency in misrecognition of phoneme pairs, across either users, SRUs or different test modes. The only pair of phonemes that was constantly confused, for each user and both SRUs, is the IY-I pair. Since different words are not made up of completely different phonemes, rules based upon the location of the same sound in different words would have to be devised. Other factors that have an effect on recognition include duration, pitch and volume. Even if it was possible to come up with a vast set of problem phoneme pairs and rules, it would be difficult to select an application vocabulary given these variables. Words or short phrases, are the natural level for man-machine based communications. Human factor tests [39] have shown that people tend to be more comfortable using "buzz" or computer type words when interacting with a machine as opposed to a natural language. The algorithm developed in this thesis for selecting a vocabulary is based upon words or short phrases. Sentences or longer phrases were avoided since it tends to be hard for an operator to consistently pronounce these.  2.1  Control By Speech  To use speech as an operator interface for control of any application it is necessary to know the types of command and options that will be needed. The method proposed in  Chapter 2. Robust Word Selection Package  7  this thesis requires that a set of possible control words must be organized into a suitable tree structure. This tree structure can be defined as having three different components: levels, branches, and choices. Figure 2.2 shows the preliminary set of control words for the Caterpillar Tractor project. The first level contains all of the basis tasks required for the specific application. Each subsequent level contains options to the connected task at the previous level. An example of a level is PATH and LOCATION, at level 2. A branch is defined as the path of connected options for a specific task. Progressing along a branch is equivalent to adding to the details of the command. An example of a branch would be MEMORIZE - PATH - START. Each new level in the branch allows the operator to specify more exactly the operation to be performed. A choice is any one of the options at a specific level. At level one the possible choices are MEMORIZE, GOTO, T E L L POSITION, SPEED, C A M E R A , and MODE. There are two basic stages to implementing speech control. The first stage is to set up a control structure like the one shown in Figure 2.2. When initializing the tree structure of commands, the number of choices per level is unconstrained and the number of levels per branch is also variable. This tree structure allows a great deal of flexibility since more branches or more choices at each level can be added. The second stage for implementing the speech control is to use the control structure to manipulate the robot. At this stage each level constrains the possible choices allowed by the user. The number of levels (or details of the command) is also constrained by the predefined control structure of stage one. The manipulation of the robot, using the SRU proceeds as follows. Upon initialization all choices at level 1 are loaded as templates into the SRU's voice card memory. The operator says the task he/she would like performed. The SRU recognizes the word, swaps the next level of control words (those words that are connected to the word recognized at the first level) into the voice  Chapter 2. Robust Word Selection Package  8  card memory and waits for the operator to enunciate one of the choices at the second level. This process continues until the last level in a branch is reached, at which point the control is passed back to the first level. The following provides an explanation of the control structure used in the excavator project. Upon initialization, the operator can tell the excavator to MEMORIZE the motion or location of the bucket; to GOTO a previously memorized location or motion of the bucket; to change the SPEED of response of the joystick or convergence of the end point of the actuator (ie. the bucket or grapple); to move the CAMERA; or to change the MODE of operation of the excavator to either position or velocity. An example of a path of control that could occur with this structure is as follows. If the operator's first speech command is MEMORIZE, then the SRU recognizes this word, proceeds along the branch to the next level (namely PATH and LOCATION) and waits for the operator to say one of these words. If LOCATION is selected then the task is performed and control is passed back to level one. If PATH is selected then the SRU waits for the operator to say START or STOP before performing the corresponding task and passing control back to the first level.  Once a PATH or LOCATION is  memorized, the operator can command the excavator to GOTO the memorized PATH or LOCATION.  Chapter 2. Robust Word Selection Package  Level 1 Memorize  9  Level 2  Level 3  Path  Start Stop  Location Goto  Memorized Location Memorized Path  Tell Position  Bucket Stick Boom  Speed  Path  Quicker Slower Minimum Medium Most  Arm  Quicker Slower Minimum Medium Most  Chapter 2. Robust Word Selection Package  Level 1 Camera  Level 2 Up  Down  Left  Right  On  Off  Mode  Speed  Position  Figure 2.2: Control Words for the Caterpillar Tractor  10  Chapter 2. Robust Word Selection Package  2.2  11  Algorithm For Selecting A Robust Vocabulary  Given an initial set of control words, like the one defined in the previous section, it is essential that accurate recognition is obtained by the SRU for this set. The first step to ensuring a robust vocabulary is to select a number of synonyms for each word in the control structure. As can be seen in Figure 2.3, three synonyms were chosen for each control word for the excavator. The synonyms can be words or phrases, but it is best to try to keep the number of words per phrase small (less than three).  Chapter 2. Robust Word Selection Package  Level 1  12  Level 3  Level 2  Memorise Remember Record  Goto  Memorised Location  Repeat  Point  Send Bucket To  Position  Memorized Path Course Track  Tell Position  Bucket  Determine Position  Endpoint  Give Position  Grapple  Stick Arm Link 1  Boom Shoulder Link 2  Chapter 2. Robust Word Selection Package  Level 1  Level 2  Level 8  Path  Quicker  Handcontroller  Faster  Joystick  Increase  Slower Retard Decrease  Minimum Slowest Least  Medium Average Intermediate  Most Maximum Fastest  Quicker Faster Increase  Slower Retard Decrease  Minimum Slowest Least  Medium Average Intermediate  Most Maximum Fastest  Chapter 2. Robust Word Selection Package  Level 1  Level 2  Camera Optical Device Photograph  Up Elevate Higher  Down Descend Lower  Left West Leftward  Right East Right side  On Lock onto Lock  Off Unlock Take off  Mode State Condition  Speed Velocity Rate  Position Point Location  Figure 2.3: Expanded Set of Words for the Caterpillar Tractor  Chapter 2. Robust Word Selection Package  15  To select a robust vocabulary from the expanded set of control words, the process follows a heuristic approach. A software package first trains and then tests all the choices at any one particular level for each possible user with the designated SRU. The purpose of the training and testing process is to determine any words that would be a high risk (ie. high rate of misidentification, low recognition rate or high rejection rate). The end result should be a control structure with only one word (the optimum word) for each control option. The algorithm for selecting the optimum set of control words proceeds as follows. A confusion matrix is initialized based upon the results of the user saying each of the words at the current level in the control structure five times. For example, if the application requires only four words (0-3) at the current level and it is possible to produce two synonyms (A and B) for each control word then a potential confusion matrix could be:  OA OB 1A IB 2A SB SA SB  OA 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.1  0B 0.0 0.4 0.0 0.2 0.0 0.0 0.1 0.0  1A 0.0 0.0 0.9 0.0 0.1 0.0 0.0 0.0  IB 0.0 0.0 0.1 0.8 0.0 0.0 0.0 0.0  8A 0.1 0.2 0.0 0.0 0.9 0.0 0.1 0.0  SB 0.0 0.0 0.0 0.0 0.0 0.8 0.0 0.0  SA 0.0 0.0 0.0 0.0 0.0 0.0 0.8 0.0  SB REJECT 0.1 0.1 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.0 0.0 0.9 0.0  Table 2.1: Confusion Matrix Table 2.1 is read as follows: The word OA was recognized as: OA, 70% of the time; 2A, 10% of the time; 3B, 10% of the time; and was rejected (not recognized as anything), 10% of the time. The strategy for selecting the optimum subset of control words has two stages. The  Chapter 2. Robust Word Selection Package  16  first stage determines all possible subsets of words with no conflicts. Conflicts could be: misidentification between words; a high rejection rate for a specific word; or a low recognition rate. The second stage is to select the optimum subset from the subsets determined at stage one, based upon a weighting factor. Using the previous confusion matrix as an example the above selection strategy will be augmented. To determine possible subsets at the current level, start at word 0 alternative A and divide the words into branches. Add words to the same branch unless there is a misrecognition rate with another word already in that branch. For example, starting with word OA, select a choice for word one that is not misrecognized with OA, which in this case is 1A. To determine if there is any misrecognition between these two words, look in column 1 row 3 and row 1 column 3 to ensure that the percentage is zero. Continue on to the next word to find an appropriate alternative. A suitable choice for word 2 would be any word that is not misrecognized as word OA or word 1A. The first branch is as follows: OA  \ 1A  \ 2B  ai  Misrecognizing word OA as word 2A is just as significant as misrecognizing word 2A as word OA, therefore these two words should never be placed in the same solution set. To make implementation of the word selection package easier the confusion matrix can be reduced by half its size by adding any percentage values in row X column Y to row Y column X, where Y > X. A lower triangular matrix is created making it unnecessary to look at the upper triangle of the matrix to determine conflict (misrecognized) pairs.  Chapter 2. Robust Word Selection Package  17  For example, to ensure that OA is never confused with 1A it is now only necessary to make sure that column 1 row 3 is zero. The rejection column must be saved since the matrix has no corresponding rejection row. If at any point in the selection process there is no appropriate alternative for a word then the branch that is being formed is terminated, and hence no possible subset of the vocabulary has been found. ( In the given example OA - IB - 2A - / (no match for word three) is such a case.) The rejection rate is an important factor. If a word is constantly rejected (greater then 25% rejection rate) then it is a good indication that the word would be a risk to use in the vocabulary. Similarly, if the recognition rate of a word is low (less then 50%), even if there are no conflicts with other words in the branch, then it should be assumed that this is not a reliable word. In the example given, OB is such a word (40% rejection and 40% recognition), and hence no branch is formed using this word. No significance is given to two synonyms of the same control word being misrecognized (or confused) by the SRU. Obviously, two synonyms of the same word will never be used together and even if they were, they have the same meaning, therefore no harm would come of misrecognition. In the example given, the only possible solutions are: OA 1A  IB  2B  2B  3A  3A  I  I  To determine which of these two solutions is the optimum, weights must be assigned to each, with a weight of 1.0 being the maximum obtainable. The weight is based upon the average recognition rate of each word in the vocabulary subset. For example, the  Chapter 2. Robust Word Selection Package  18  rating for the solution subset OA - 1A - 2B - 3A would be 0.8 ( (0.7 + 0.9 + 0.8 + 0.8 )/4). The subset OA - IB - 2B - 3A would have a rating of 0.775. Therefore the optimum subset to select for the application vocabulary in this example would be OA - 1A - 2B - 3A, since it has the maximum weight value of the two.  2.3  Distance Measurements  When a SRU is used as a speech control interface for any type of application, whether it is the control of an excavator, or placing orders over the telephone, it is critical that no misrecognition occur. It is important to reject words that do not match well to any of the templates in memory. In this way, the user or operator, can repeat the word until it is certain that a good match has been obtained. Most speech recognition units calculate a distance score for each spoken word based on how well the spoken word matches the templates in memory. The SRU used for testing the vocabulary selection process (the VOTAN VTR6050) has a score which can be used to further decrease the rate of misrecognition of the selected vocabulary. To approach 0% misrecognition between words in the vocabulary, the distance score was recorded as the testing was being done. For each recognized word the minimum distance score was recorded and stored in a separate confusion type matrix. If there was no confusion between pairs, the value of the distance would be set to a default distance score of 50. With each of the possible solution subsets, a minimum distance score is determined based upon the minimum distance score of all the words in the subset. It is possible, upon installation of the optimum subset, to reject any word with a distance score greater then the minimum, thereby decreasing misrecognition. Generally, the minimum distance score occurs for the word said (ie. word recognized = word said), thereby,  Chapter 2. Robust Word Selection Package  19  setting up a very rigorous test for recognition of the selected vocabulary. Each word the operator says must be as good as the best match in the initial tests or else it will be rejected.  0B 50 27 50 36 50 50 28 50  OA OA OB  1A IB  2A 2B SA SB  25 50 50 50 50 50 50 30  1A 50 50 23 50 27 50 50 50  IB  SA  SB  SA  SB  50 50 34 31 50 50 50 50  32 30 50 50 26 50 32 50  50 50 50 50 50 27 50 50  50 50 50 50 50 50 25 50  28 50 50 50 50 30 50 24  Table 2.2: Distance Matrix Table 2.2 shows the corresponding distance matrix to go with the confusion matrix of the example in section 2.2. From this example, it is easy to see that the minimum distance score for the optimum subset OA - 1A - 2B - 3A, should be set to 23 upon installation of this vocabulary.  2.4  Testing The Algorithm  To test the word selection algorithm, the V O T A N VTR6050 speech recognition device was used. The preliminary set of words, along with their alternatives, used for the application vocabulary was shown in Figure 2.3. The testing of the word selection algorithm was performed with nine subjects providing voice input directly to the SRU upon queue. The V O T A N was controlled by a RS232 port which was connected to the host computer, the HP9050. The programs to control the VTR6050 and implement the word selection algorithm, were written in the C-language. Programs were written which perform the following functions:  Chapter 2. Robust Word Selection Package  20  1. Allowed user to set up a new tree structure of control words with an unlimited number of branches of control words, or use the previously defined set. It was possible to set up this tree with the use of linked lists structures. 2. Train the VTR6050 for each level of the control structure. The training is equivalent to creating reference templates for use in recognition. 3. Test each level and branch of the tree separately. 4. Select the optimum subset of words for each level separately, based on the previously defined algorithm. 5. Retest the selected optimum subsets separately to assure an increase in recognition and decrease in misrecognition. The program that performed these tasks required little user input once the appropriate control structure was initialized (which could be done ahead of time). The program first prompts the subject to train the SRU with each word and each of its synonyms at the current level. The SRU was trained with only one template per word. The next step was to test the current set of words. The program prompted the subject five times for each word in order to obtain the confusion matrix. The algorithm then determined possible subsets and printed them on the screen, sorted in the order of highest weight values. Sometimes there were many subsets with weight 1.0. In this case the user was allowed to select which of these subsets he/she would prefer. Finally, the selected optimum subset was verified using only the words in the selected subset as possible reference templates for recognition and testing each of these words five more times. This training - testing - selection - retesting procedure was repeated for each level in the tree of control words.  Chapter 2. Robust Word Selection Package  21  Each subject completed this process on 2 different occasions. Due to the availability of some of the subjects four new users had to be added for the second test, but five subjects completed both testing sessions. At thefirsttesting session a word was rejected in both the initial testing and the retesting of the optimum subset, if its distance score was greater then the default value for the VTR6050, which is 50. For the second session, the word was rejected in the retesting stage, if its distance score was greater then the minimum distance score of the optimum subset.  2.4.1  Results  The first time the tests were performed the distance score was always set at its default value of 50. The results show an average increase of 5.2% from 91.5% to 96.7% in recognition over all branches and all users. There was a decrease of 2.21% from 3.00% to 0.79% in misrecognition and a decrease of 3.07% from 5.56% to 2.49% in rejection. As you can see the VTR6050 gives very good results even without a selection process (91.5% recognition) and the word selection package achieves a significant improvement. On the second testing of the vocabulary selection process, the minimum distance calculation was added in an attempt to drive the misrecognition to 0%. The results show a recognition gain of 2.39% from 95.08% to 97.47%, with misrecognition decreased to 0.25% from 1.52% and rejection decreased by 1.16% from 3.43% to 2.27%. Distance Score  % Recognized  % Misrecognized  % Rejected  After Se- Before After SeBefore After Se- Before Selection lection Selection lection Selection lection 2.49 0.790 5.56 91.5 96.7 3.00 Default 2.27 95.1 1.52 0.250 3.43 Minimum 97.5 Table 2.3: Results of Vocabulary Selection  Chapter 3  E n d Point Detector  The end point detector (EPD) is an algorithm which determines where a word starts and stops using a cumulative absolute value threshold and a zero crossing rate threshold. If there was no noise, this task would be easy however, the task becomes harder as the signal to noise ratio decreases. A possible use of this algorithm is to chop out the surrounding background noise, leaving only the word. The resulting signal could be sent to the SRU for recognition. This application would only be useful if the SNR was large enough so that little distortion of the speech occurred. The EPD's real value is that it allows an estimate of the background noise to be made forfilteringpurposes, as will be discused in the next chapter. The EPD works under the assumptions that voiced speech sounds (or vowels) and plosives have a significantly higher absolute value content then noise, and that fricative sounds or unvoiced sounds have significantly lower zero crossing rates then the background noise. The short time absolute value score is defined as : AVS =  £  n  \x(n)\  (3.1)  m=n-N+l  Which can also be written as a windowed value in order to emphasis that only part of the absolute value of the signal is being examined at a time. 00  AVS = n  £  \x(n)w(n - m)\  l»=-CO  22  (3.2)  Chapter 3. End Point Detector  23  Where: 1  ifO<n<JV-l  0 otherwise An absolute value calculation was done instead of a energy calculation so that the results would not be as greatly effected by a short spike of high energy. The short time zero-crossing is defined as: oo  ZC = ^2 \sgn [x(m)] — agn [x(m — 1)]| w(n — m) n  (3.3)  —oo  where: sgn [x(m)] —  1  if x(m) > 0  -1  if x(m) < 0  Within the first 100 ms of the signal the EPD determines the statistics of the absolute value score and the zero-crossing rate of the background noise. It is assumed that there will be no speech in this interval. Within this time interval there is the possibility of calculating 2000 overlapped short time absolute value score all of a duration of 1 ms. Similarly, 2000 zero crossing rates can be determined. The absolute value score threshold is based upon the mean and standard deviations determined over the first 100 ms and is given as:  AVS » Thr  = AVS + o-  AVSn  * 2.0  (3.4)  It was assumed that the short time absolute value score has a Gaussian distribution so, the population z-score is given as:  Chapter 3. End Point Detector  24  The z-score was chosen to be 2.0 which corresponds to a percentage of 95.44%. The significance of equation 3.4 is that the absolute value score threshold has been set at 95.44% on the Gaussian curve of the absolute value score distribution of the background noise.  AVS  Figure 3.4: Absolute Value Score Threshold with Z-score of 2.0 Since the absolute value score in the speech signal should be significantly higher then the absolute value score of the noise, the signal has crossed this bounds when the 4.66% (z-score = -2.0) point on the absolute value score distribution curve passes the 95.44% mark on the absolute value score distribution of the background noise. By using absolute value score distributions (instead of a straight absolute value  Chapter 3. End Point Detector  25  AVS  Figure 3.5: AVS Separation of Speech from Background Noise thresholds) it can be assured that no spike of background noise would trigger the threshold level. Once the statistics of the background noise has been determined, the absolute value score distribution of the signal is averaged over 10 ms intervals, with short-time absolute value score calculations done every 1 ms. 200 samples of the short-time absolute value score goes into the estimate of the statistics. If the word spoken starts with a voiced sound, the absolute value score threshold alone will determine the start of the word. However, if the word starts with a fricative or unvoiced sound, absolute value score calculations alone are not enough, zero crossing  Chapter 3. End Point Detector  26  calculations need to be considered. If the word begins with a plosive sound a spike of high energy will occur. The threshold will be triggered at this point if the plosive is followed by a vowel sound. In the English language, it is usually the case that a plosive is followed by a vowel sound. Therefore, the plosive will be detected as the start of the word. The threshold for the zero crossing rate is determined in a similar manner as the threshold for the absolute value score. It is given as:  ZC  Thret  = ZC - 1.0 * c  ZCn  (3.6)  By experimentation, it was found that the zero crossing rate does not change quite as dramatically as the absolute value score. This is due to the fact that fricatives are very similar in structure to noise, therefore there will not be a lot of change between the zero crossing rates. A z-score of -1.0 corresponds to only 31.74%. The point, at which 68.26% of the distribution of the zero crossing rate of the signal drops below the 31.74% point of the distribution of the zero crossing rate of the background noise, is the start of the fricative. See Figure 3.6. The zero crossing ratefluctuatesa lot, even just within the background noise part of the signal. To be sure that the part of the signal that has just triggered the ZC threshold is actually a fricative and not some change due to the noise, the ZC distribution rate must not go back above the ZC threshold more than two times. The third time that the zero crossing rate goes above the ZC threshold it is assumed that what initially triggered it was not the beginning of a fricative, but some change due to noise. In this case, the position of the beginning of a fricative is reset and the process of finding the drop in ZC rate starts over. If the position of the absolute value score increase is found, but no ZC threshold  Chapter 3. End Point Detector  27  Z C Rate  Figure 3.6: ZC Separation of Noise and Speech has been triggered, it can be assumed that the word begins with a voiced sound and that the beginning of the word is the position at which it crossed that threshold. If a ZC threshold is found but no absolute value score threshold, then the process must continue looking for the absolute value score threshold. The reason is that ZC thresholds are less reliable then absolute value score thresholds. Therefore a zero crossing threshold with no absolute value score threshold only means that the background noise has unstable statistics. If both ZC and absolute value score thresholds are found, then the beginning of the word is the minimum location of the two.  Chapter 3. End Point Detector  28  To find the location of the end of the word the same process is applied, but in reverse. The process now tries to find the point at which the 95.44% point of the absolute value score distribution of the signal drops below the absolute value score threshold of the noise.  0  50  100  150  AVS  Figure 3.7: End Of Word Found From AVS Threshold Similarly, the point at which the 31.74% point of the zero crossing rate distribution of the signal goes above the ZC threshold and stays above this threshold for at least 3 concurrent 10 ms intervals is the end of the fricative. At the end of the word, the point at which the absolute value score drops below the threshold is found first. Once that point is found, we search ahead for the point at which  Chapter 3. End Point Detector  29  Figure 3.8: End Of Word Found From ZC Threshold the ZC rate increases. The position of this threshold can occur only to a maximum distance of three 10 ms intervals after the initial ZC threshold is determined. The reason is simple. If the word ends in a fricative, it will immediately follow the previous phoneme (either voiced, unvoiced, or fricative), therefore the algorithm should determine only the point at which the ZC rate rises initially, with allowances for slight fluctuations. If we allowed the algorithm to find other ZC thresholds, a long distance away from the end of the word, it may find a ZC threshold which is noise and not a correct end point. Therefore, a ZC threshold will always be found at the end of the word, but it will be in the same location as the absolute value score threshold if the word ends with a voiced  Chapter 3. End Point Detector  30  sound. If the word ends in a fricative we can allow for only slight fluctuations in the ZC rate after the end of the absolute value score drop.  3.1  Results  The End Point Detector was tested on a number of recorded speech files. The accuracy was determined by cutting out the part of the signal which the algorithm dictates as the word, padding zeroes to either end of the signal and writing this new signal to a file for subsequent listening tests. It was found that out of 160 speech files, the EPD failed only twelve times, yielding an error rate of 7.5%.  Chapter 4 Noise Reduction Filters  4.1  Characteristics of the Signals  One of the major impediments to speech recognition is the effect of background noise on the accuracy. So that no ill effects occur in speech recognition due to the noisy environment the noise should be reduced from the signal before it is sent to the SRU for identification. Ultimately, the criterion for enhancement or noise reduction relates to the accuracy of recognition obtained by the SRU. It would be easier to suppress the noise if something was known about both the signals involved, namely the speech and the background noise.  4.1.1  Speech  The speech part of the signal should be kept intact. To do this it is important to know some of the features of a speech signal. Speech sounds can be divided into three general categories, those being: 1. Voiced Sounds Voiced sounds are created by relaxed oscillation of the vocal tract, therefore quasiperiodic pulses are produced. An example of a voiced sound is the phoneme "oo" as in boot. 2. Fricatives or Unvoiced Sounds  31  Chapter 4. Noise Reduction Filters  32  Fricatives are created by forming a constriction at some point in the vocal tract and forcing air through the constriction thus, producing a broad spectrum noise source. An example of a fricative is the phoneme "th" as in that. 3. Plosive Sounds Plosive sounds are created by completely closing the vocal tract, building up pressure and releasing it quickly, thereby producing a short burst of noise-like sound. An example of a plosive is "b" as in bet. The frequency components of each phoneme of speech is different. Therefore, from one phoneme to another the properties of speech are changing. Since the speed of enunciation of a word is limited by the physiological aspects of the vocal tract, the properties of the signal are changing relatively slowly. Speech can be modeled as the response of a slowly varying linear system. If a short time sample of speech is windowed, it can be processed as a semi-periodic wave. However, over the complete signal (word, phrase, or sentence) the spectral components of speech are changing and therefore, some consideration must be given to the non-stationary nature of speech. The vowel segments of the signal tend to have well defined frequency components, or formant frequencies as they are generally called. The fricative components do not have well defined frequency components. The fricative sounds produce a static turbulent sound much like a noise signal, therefore it is hard to separate these two signals. As well as different sounds producing different frequency components, different speakers produce varying spectrums. For example, speech tends to be in the frequency range of 0 to 5000 Hz. However, for male speakers, this range can be as low as 2000 Hz and for children, as high as 7000 Hz. When processing a speech signal it is important to cover the full range of frequency components.  Chapter 4. Noise Reduction Filters  4.1.2  33  Characteristics of Noise  The noise components of the signal should be removed or at least reduced. In order to separate the noise from the speech, it is desirable to know some of the features of the noise, and how the noise is being produced. The noise in the speech signal was due mainly to machine sounds. The noise was comprised of engine sounds, noise from the hydraulics, and clanging sounds due to the movement of joints, namely the bucket, boom and stick. There are also external environmental sounds due to the movement of the excavator in its surroundings. For example, things like the bucket banging into surfaces, digging holes, lifting objects, etc. The external environmental noises are more variable then the machine sounds. For analysis purposes noise samples were recorded on site in the cab of the excavator. Since the excavator was tethered to an offboard computer it was not possible to do any digging or any other auxiliary activities, therefore the noise recorded was strictly machine noise. There were two different types of noise recorded: Noise 1: The noise produced when the excavator is stationary but in full power idle. Noise 2: The noise created by random motion of the branches of the excavator and some movement of the cab. A power spectrum estimate of each noise type was calculated as one means of characterizing the noise signal. Since the power spectrum was an average over many noise signals, any highly transient components of the signal were averaged out. This explains why the power spectrums for the two types of noise, as shown in Figures 4.9 and 4.10, are similar. These two ensemble power spectrums can be compared to a power spectrum of noise calculated over a 12.8 ms interval as shown in Figure 4.11. An accurate short time estimate of the noise components is preferable due to the constantly  Chapter 4. Noise Reduction Filters  34  changing environmental noise.  100  E u h_ + -<  o  CD  Q.  cn CD  O 0.01 -  0.001 OkHx  Frequency Sample  Figure 4.9: Ensemble PS of Noise 1 Another factor to consider is the physical constraints (or effects) on the noise signal. Since all recorded or input signals were collected from a small enclosed environment (inside the cab of the excavator) it was expected that the signal would be reverberated. It is very possible that the signal in the microphone is an overlapped and reflected version of the actual signal. There is no single noise source. The microphone is recording a single reference template from a diffuse noise field. The signal to noise ratio was also used to characterize the noise. The average SNR for all noisy speechfileswas found to be 12 dB.  Chapter 4. Noise Reduction Filters  35  1000  iOkHz  Frequency Sample  Figure 4.10: Ensemble PS of Noise 2  4.1.3  Categories of Filters Investigated  Based upon what was known or could be determined about either of the signals a filter had to be designed that would separate the two signals. It was possible to obtain both the noisy speech signal and a reference signal of the noise from separate microphones. There are a number of ways to obtain an estimate of the power spectrum of the noise as will be further explained in Section 4.2. If an estimate of the power spectrum of the noise is known then there is the potential of removing the noise components from the speech components. The first class of filters studied incorporates the estimated noise spectrum into filtering. This class includes Wiener and PSS filtering.  Chapter 4. Noise Reduction Filters  36  Figure 4.11: Short Time PS of Noise 1 Given a model of speech, the noise could be filtered based upon any deviation from this model. The second class of filters explores the possibility of modeling the speech as a slowly varying signal with subsequent sample points being highly correlated, and the noise components as any value deviating from this model. The two filters in this class were the Linear Predictor Corrector (LPC) method and the Gaussian smoothing filter. If an estimate of both of the signals (speech and noise) is available, it should be very easy to separate the components. The third class of filters studied obtains an estimate of the noise and an estimate of the speech plus noise signals simultaneously from two different microphones. Even though a "clean" speech signal is not available  Chapter 4. Noise Reduction Filters  37  this class of filters is able to remove the noise by adaptively adjusting the filter weights to produce an optimum noise reduced output signal. The niters in this class assume that there is enough periodicity in speech that the filter can adjust its weights to the signal. The two adaptive methods modified and tested for this application were LMS and RLS. Generally the criterion for evaluation of afilteringprocess is the improved intelligibility or quality of the filtered speech. For this application the evaluation of the filter ultimately lies in the improved recognition by the speech recognition unit. The SRU used (VOTAN VTR6050) recognizes words based upon spectral analysis. The main consideration for filtering is to remove the spectral components of the noise from the signal, without greatly affecting the speech's spectral components. For this application it is possible to sacrifice some of the quality of the speech if the recognition rate is increased.  4.2  Noise Reduction Filters Based on PSE of the Noise  If it is possible to obtain an estimate of the frequency components of the noise in the signal, then this knowledge could be used to remove the noise components leaving the speech components unaltered. It is essential that an accurate estimate of the noise signal is acquired since any inaccuracies could: 1) increase the amount of noise in the filtered output signal instead of reducing it or, 2) distort the speech signal by removing some of the speech components. In order to be able to separate the two signals with a linear filter it must be assumed that the speech and noise are uncorrelated and additive. The problem is further simplified if it is assumed that the speech is a slowly varying signal, stationary over short time intervals and characterized by a set of resonances.  Chapter 4. Noise Reduction Filters  38  The general form for the noise reduction filters based upon the knowledge of the power spectrum of the noise is as shown in Figure 4.12.  Noisy Speech  Figure 4.12: Filters Based on PSE of Noise It is essential that a good representation of the magnitude of the short-time spectrum is obtained, however the phase is relatively unimportant. Enhancing only the short-time spectral amplitude leads to a zero phase filter. The phase of the filtered speech is identical to that of the original noisy speech. The filter weights of H(u>) are based upon the values estimated for the power spectrum of the noise. Two possibilities were examined for the filter. The first was the method of subtracting the amount of the power spectrum of the noise from the power spectrum of the noisy speech.  This method is called Power Spectrum Subtraction  (PSS). The second approach was to factor out the frequency components of the noise from the noisy speech. This method is generally referred to as Wienerfiltering.Since  Chapter 4. Noise Reduction Filters  39  both of these filters axe magnitude only, the phase information is passed unchanged. As mentioned, it is essential that an accurate estimate of the power spectrum of the noise is derived. The question arises as to how this estimate can be obtained. In our case there were three possibilities, numbered below for further reference. 1. Noise estimated during silent part of the signal. This estimate makes use of the end-point detector, described in Chapter 3, to determine the beginning and ending of the word. 2. Noise estimated a priori from previously recorded noise signals. 3. Noise estimated from a second microphone input while operator is speaking into thefirstmicrophone. The first estimate has the advantage that it gives a good short term estimate of the noise just before the word is spoken. Since the noise is recorded through the same microphone as the signal there is no need to worry about effects of reverberation causing the estimate of the magnitude of the noise to be mis-matched to the magnitude of the noise in the signal. A problem with estimating the noise in this way arises if the background noise changes during the time that the operator is speaking. If this occurs then the estimate will be inaccurate. Since there was a good chance that the noise components were non-stationary it was expected that method 1 might not be a good estimate. The second method is an ensemble average, therefore it is less accurate over a shorttime interval, but is not as greatly effected as method one is by inaccurate transient frequency components. This estimate is good at determining stationary noise components however, it will have problems estimating the temporal noise components. The third method of estimating the noise components of the signal continuously from a second microphone, has an advantage over the previous two methods because  Chapter 4. Noise Reduction Filters  40  it can track changes in the noise. The disadvantage of this noise estimate is matching the magnitude and phase between the two microphones. Since the two microphones are separated by a distance of about 50 cm, the magnitude of the estimate of the noise could be larger or smaller then the magnitude of the estimate of the noise in the signal. There is the added complexity of matching magnitude between the two estimates of the power spectrum with this method. For the first method of estimating noise (ie. at the beginning of the signal), the number of sections (K) was set equal to INT (number of samples before the beginning of word/256). The noise was estimated in sections of 512 (each section has 256 data points and 256 zeroes) sample points for K sections before the start of the word. The second estimate of the noise was determined from an ensemble average of prerecorded noise files. There were 29952 (N) sample points for each record. Each record was divided into 117 section of 256 points each, zero padded with 256 points for an F F T size of 512. Since the sampling rate was 20kHz the resultant power spectral estimate was calculated at equally spaced frequencies of 39.0625 Hz. For each noise type there where 80 files therefore, the average was estimate over a total of 748800 sections of noise data. The third method of estimating the noise continuously, calculated the noise spectrum for each 256 sample points. Each section was padded with 256 zeroes. The power spectrum of the noise was re-estimated for each section of the signal from the second microphone. Due to the magnitude difference between the microphones, the third estimate of the power spectrum had to determine a scaling factor equal to the magnitude difference. The scale factor was determined by calculating the sum of the magnitude of the power spectrum at all frequency components for both microphones during the first section of the signal (time =0). For example the magnitude of microphone number 1 is equal to E j L o ^ m i h i H . The scale value was equal to the magnitude of microphone 1  Chapter 4. Noise Reduction Filters  41  divided by the magnitude of microphone 2. For each successivefilteredsection of data, the magnitude of the power spectrum of the noise over that window is multiplied by the scaling value. This scaling factor will adjust the magnitude to be similar between the two microphones. Note that for all methods of noise power spectrum estimation, the section size that was used was 256 data points windowed with a Hamming window and padded with 256 zeroes due to the overlap and add method. Since a 512 point FFTs give a maximum resolution of 39.0625 Hz, each sample of the power spectrum is a smoothed version of the actual resolution.  4.2.1  Power Spectral Subtraction  Power Spectral Subtraction (PSS) makes an estimate of the power spectrum of noise and subtracts this value from the short time power density spectrum of the signal. Consider: x(t) = s(t) + n{t)  (4.7)  where x(t) is the noisy speech, a(t) is the speech and, n(t) is the noise at time t. Hence: P.(w) = P ( w ) + P „ ( w )  (4.8)  J  where P (u),P (u) x  and P (u>) are the power density spectrums of x(t), s(t) and n(t)  e  n  respectively. If P (w) is known or can be determined then it is possible to subtract n  that value from P (w) and obtain a noise free estimate P (u>) of the power spectrum of z  4  the speech. This method only works if speech is a stationary signal which it is not. However speech can be assumed stationary over short time intervals. The PSS method is derived using windowed versions of s(t), n(t) and x(t) namely, x (c),n (t) and s (t) with their w  w  w  Chapter 4. Noise Reduction Filters  Fourier transforms being  42  and S (u), respectively:  X (U),N (UJ) W  w  W  x {t) = s {t) +n {t) w  w  (4.9)  w  and: X  (u) = S (w) + N (w)  W  w  (4.10)  W  The power spectrum of the signal is calculated as:  PxM  =  \X {u)\* = \S {u)\* + \N {u>)\* 9  9  (4.11)  9  +Sl{u)N (u)+N:(u)S (u) w  V)  A close approximation to equation (4.12) is: \X {u)\ = \S (")\ + E [\N (u>)\ ] + E[S {u)N {u)] + E[N^{u)S {u)] 2  2  w  2  W  w  w  w  m  (4.12)  The expected terms ( E[ ] ) are required since it is not possible to determine exact values for these quantities. If n(t) has zero mean and is uncorrected with s(t) as assumed in the previous discussion, then E[S*(u)N(w)] and E[N (u)S {u)] are zero. v  w  The speech is estimated as: I^HI  2  = | A T * M l - E[\N {u)\*)  where both quantities iX^w)! and 2  that  |X,u(ct;)|  2  —  2  u  (4.13)  |JV (c*^) | ] are easily obtained. There is a chance 2  W  2S[|iV (a>)| ] could be less then zero since the noise spectrum estimate 2  w  allows for a margin of errors. In this case the estimate of the speech power spectrum is set to zero. In order to determine the estimate of s (t), the phase information from the original w  signal X (w) w  Hence:  is retained, and the magnitude information is obtained from  |5,u(ti;)|.  Chapter 4. Noise Reduction Filters  S {u) = vl  43  \S {u)\eV* * °W r  w  (4.14)  x  and: K(n) = F-  1  [S (u)]  (4.15)  w  Figure 4.13 shows a block diagram of the Power Spectral Subtraction method. The values that are known or can be calculated are \X (w)| and E[\N (u)\ }. 2  W  w  2  As mentioned  in the last section the ensemble average of the noise was determined in three ways.  Noisy Speech  MO  Phase Information  Subtract  *[|JMw)H  Filtered Speech  MO Figure 4.13: PSS Filter  4.2.2  Wiener Noise Filter  Wiener filtering is a method of frequency weighting. An optimum fixed filter is determined based upon an estimate of the noise signal. The signals x(t), s(t) &nd.n(t)  Chapter 4. Noise Reduction Filters  44  must be windowed due to the nonperiodic nature of speech. The estimate of the speech signal is obtained by filtering the input signal x(t) with the optimum filter H(OJ).  S{L>) = H{u)X{u)  (4.16)  The optimum filter H(u) is derived based on Wiener's optimum noise filtering method which is briefly outlined as follows. Consider the noisy signal as comprised of speech and noise such that x(t) = s(t) + n(t).  s(t) and n{t) are uncorrected stationary random processes and their power  density spectrums are P»(w) and P (w), respectively. Therefore P [v) = P [v)+P [w)> n  x  t  n  Wiener's approach uses a linear estimator of s(t) which minimizes the mean square error. The noncausal Wiener filter frequency response is given as:  H(u) =  n  . '^\  , .  F  (4.17)  Since speech is not stationary and the spectrum P,(w) cannot be assumed known, the filter must be derived based on an estimate of the signals over a short time window.  _  l[|g,(«)|»)  .  .  Once again, the estimate of noise N (u) is obtained in one of the three ways menw  tioned in Section 4.2. Assuming that E[|5 (w)| ] has no zero values, equation (4.18) 2  w  may be rewritten as:  Since •E7[|iS',(u;) ( ] was unknown it was assumed that it was broadband and fairly 2  J  tt  evenly distributed (which can only be true in the long term case). Hence the estimate  Chapter 4. Noise Reduction Filters  45  of iS^w)! for all frequencies was set equal to 1. The optimum filter transfer function 2  then becomes:  and the estimate of the speech can be found using equation (4.16). The Wiener filter is a zero phase filter and hence the phase of the output signal is the same as the phase of the input signal. «„(*) = F- [\Sl {u)\eV^ '^] l  x  w  (4.21)  The weights of a typical Wienerfilterbased upon an estimate of the power spectrum of the noise during the silent part of the signal is shown in Figure 4.14.  0.8  -1  a> i_  CD  0.6 -  c  H—  o E  zz  0.4 -  t>  CD  Q. CO O  0.2  O CL  20kHz  Frequency Sample  Figure 4.14: Wiener Filter Frequency Weights  Chapter 4. Noise Reduction Filters  Noisy Speech  MO  F  46  |X„( )|  3  W  Phase Information  argi(w)  Filtered Speech  MO  l$Ml lM")l  a  l/a  Figure 4.15: Wiener Filter An intuitive feel for this filter would be that input X(w) components in the range of large values of noise frequency component are reduced by a large amount in the output estimate of the speech whereas, noise frequency components that have small values <: 1, hardly change the value of that frequency in the output speech estimate.  4.2.3  Implementation and Testing  Since the noise was estimated in three different ways, three different implementations of the Wiener and PSS filters were needed. If the noise is estimated during the silence part of the signal, the end point detector is used to determine the point at which the speech starts. Then an estimate of the noise spectrum before this point is made and the values of the filter weights are stored until the end of the word is detected. During the time that the word is being spoken the filter weights are fixed. As soon as the end of the word is detected the method starts  Chapter 4. Noise Reduction Filters  47  determining a new estimate of the power spectrum of the noise. Figure 4.16 shows the block diagram for this filtering method. If the noise estimate is based on an ensemble average calculated from some prior knowledge of the noise then the filter values are simply stored in memory, hence the filter is fixed. Figure 4.12 shows the block diagram for thisfilteringmethod.  Phase Information  Figure 4.16: Filter With PSE # 2 of Noise When the power spectrum of the noise is estimated from a second microphone input the filter weights must be continuously updated. The filter weights and the power spectrum of the noisy speech was recalculated every M (256) samples. Figure 4.17 shows the set up for this method. Implementation of all filters used the overlap and add method to process the continuous speech signal by parts, followed by an F F T routine to convert the signal to the frequency domain so that linearfilteringcan be applied. A Hamming window was selected to window the speech before processing. It is essential that the speech signal is windowed. If a finite stretch of speech is isolated and Fourier transformed, without regard to the pitch of the speech, then the periodicity assumptions of the F F T will be violated. Discontinuities at the edges of the isolated speech signal will cause considerable distortion of the speech spectrum. Distortion can be minimized by using a smooth  Chapter 4. Noise Reduction Filters  48  Phase Information  m[t) Speech - M i k e '  Noite • M i k e t  t  Window  FFT  Window  FFT  » - |ir( )| —| l / n w  *[PM«)l  FFT-  a  Figure 4.17: Filter With PSE # 3 of Noise window, hence the Hamming was chosen. Welsh's method [26] was used to estimate the noise power spectrum in all cases. Briefly the derivation for Welsh's method is as follows. Each complete noise record (or signal) consisting of N points, is divided into K sections of M samples each. Therefore, K power spectrums are calculated for each noise record. Let:  M-l  J2  xM(n)w{n)e-  iun  * = i,2,...,iir  (4.22)  And: (4.23) m  n=0  Then the spectral estimate is: B: (^k) x  = ^ J ^ ( ^ k )  * = 0,1,...,M-1  (4.24)  To estimate the power spectrum using FFTs, calculate:  = £ n=0  x®{n)w[n)e-n**M*n  A; = 0,1, • • •, M - 1  (4.25)  Chapter 4. Noise Reduction Filters  49  for K sections. FFTs are computed at equally spaced frequencies of  =  (2n/M)K*w . t  The value | X ^ ( i f ) | is calculated for each section. The results are summed over all K 2  sections and averaged by dividing by K M U . The results is an estimate of the smoothed power spectrum of the signal.  4.3  Adaptive Noise Cancelling Filters  Adaptive noise cancelling filters attempt to estimate a signal which has been corrupted with additive noise by adjusting their filter weights. An adaptive filter differs from a fixed filter in that it automatically adjusts its own impulse response.  The basic  structure of an adaptive noise cancelling filter is shown in Figure 4.18.  Primary In]jut Speech  1/  • System Output  +  vn  Reference I Noise  Figure 4.18: General Form of an Adaptive Filter There are two categories of adaptive filters, (1) stochastic and (2) exact. 1. Stochastic filters are derived using a statistical error measure but implemented  Chapter 4. Noise Reduction Filters  50  using the exact data available. The statistical measurement is the ensemble average of the squares prediction error function. The optimum weights of the filter are WS, = R PN NN  where:  • Wpj is the filter weight vector of N values. • RNN is the autocorrelation matrix of the input signal (ni) • Pfi is the correlation vector between z and ri\ The solution to this equation is found using a gradient measurement to estimate the value R~ P. No actual computation of R~ is ever done. For complete details l  l  on the derivation of this equation see [35], [36]. An important feature of this filter is that the optimum weights (W^) should converge to a single value. For this filter to be effective, the noise signal must be stationary, or varying slowly enough so that the filter can track the changing statistics. In particular, this study examined the LMS (Least Mean Squares) adaptivefilterwhich uses the approach of steepest descent for solving the normal equations. The LMS algorithm was first derived by Widrow and Hoff in 1959 at Stanford University. The first adaptive noise cancelling system using the LMS algorithm was also designed at Stanford University by two students in 1965 [36]. 2. An exact adaptive filter predicts the noise reduced output signal based upon error measurements derived from the exact data signals acquired. It turns out that the weights of the filter can be found from the equation: W (n) = Rj? (n)P (n) N  N  N  where the value in brackets (n) signifies the weights, correlation, or autocorrelation at the exact time n (or sample number n).  Chapter 4. Noise Reduction Filters  51  This method re-estimates the inverse autocorrelation matrix at each time interval, therefore the filter should be able to follow the actual data signals and any transient components of it. The family of exact methods that determine the inverse correlation matrix in order to predict the weights of the filter are known as RLS (Recursive Least Squares) filters. A convenient features of the RLS is a parameter called the "forgetting factor". This parameter is a data weighting factor which allows recent data to be weighted more heavily into the calculations. Initial work in the area of exact adaptive filters can be attributed to Kalman who in 1960 developed a state space stochastic filter. RLSfiltershave been used for such applications as channel equalization, system identification and speech analysis. More information on the RLS adaptive filter can be found in [l].  4.3.1  LMS Adaptive Filters  This section breifly outlines the LMS algorithm, first derived by Widrow, but adapted for our work. Referring back to Figure 4.18 it can be seen that the adaptive filter changes based on the feedback error value. The adaptive filter is attempting to minimize this error by producing an output z = s + no — y that is a best fit in the least squares sense to the signal s. Assume that s, no,ni and y are statistically stationary and have zero means. No knowledge of the characteristics of s, n or ni is required. However, it is assumed that 0  s and n or n\ are uncorrected and that no and n are correlated but in some unknown 0  x  way. For our purposes the signal s is the speech signal, n is the noise in the input 0  signal and ni is the reference noise recorded by the second microphone.  Chapter 4. Noise Reduction Filters  52  The adaptive filter must transform n to n in order to produce an output z, that x  0  is the best estimate of s. Adjustment of the filter is based on the filter output z. z serves as an error signal for the adaptive process. The filter adjusts itself to minimize the output error. To obtain maximum performance the output signal should be s, in which case the minimum error would be s and no — y = 0 The key equations for the LMS filter are as follows. The output of the filter is defined as: z = e= 3+ n — y 0  (4-26)  The output of the adaptive filter j/,-, is computed as the sum of N filter weights times N input noise components. Figure 4.19 shows the diagram of the adaptive linear  Chapter 4. Noise Reduction Filters  53  combiner. The combiner is linear only in the sense that the weights and the input components are multiplied and summed. However, the weights are not fixed. They change with the characteristics of the inputs, therefore an adaptive filter is in fact non-linear. The output of the adaptive linear combiner is equal to the inner product of W (the filter weights) and Xj (the reference noise signal):  = XjW = W Xi T  V i  (4.27)  The error component can now be rewritten as: ey = dj - XjW  (4.28)  where dj is the desired signal or training signal. For the problem of reducing noise from speech, the desired signal is (s + no), since this is the only signal available, dj is defined as (s + no)y, which is a scaler quantity. Adjusting the weights of the W vector is necessary to minimize the error. When the minimum error is obtained the filter weights have converge to a single optimum value. The gradient method is used to minimize the error term. By setting the gradient to zero to determine the minimum, W* = R~ P is obtained. l  The LMS algorithm finds close approximations to the weight vector without measuring the correlation vector or determining the inverse matrix calculations. It does this by using an iterative steepest decent method where the next weight is equal to the last weight plus the change due to the negative gradient:  Wj  +l  = Wj-tiS7j  The value \i is a stability factor and V i  1 5  *  n e  *  r u e  (4.29)  gradient at the jth iteration. It can  54  Chapter 4. Noise Reduction Filters  be found that an estimate of the true gradient is:  V  y  = -2e,-X  (4.30)  y  Substituting this equation back into (4.29) yields: W =W j+1  j  + 2pe X j  j  (4.31)  The LMS adaptive noise cancelling filter was implemented in software using the programming language C. The algorithm is outline in Table 4.4.  4.3.2  Recursive Least Squares Adaptive Noise Reduction Filter  The Recursive Least Squares (RLS) prediction filter is based on error measurements derived from the exact data acquired. In comparison, LMS filtering is based on statistical error measurements derived from long-term statistical accumulative squares error of the actual data. To derive the optimum weight values for the RLS method the actual data vector [x(l) • • •x(n)] is used in the predictor. Therefore, a new criterion T  is recomputed and optimized every time iteration n. The RLS filter produces results that are exactly optimum for the acquired data, not statistically optimum for a class of data, as with the LMS method. This suggests that the RLS filter requires intensive computations which is true if computed directly. However, fast Kalman filter and vector space algorithms exist which can significantly reduce the complexity. Briefly, the derivation and notation for the exact error measurement is as follows: WN {n) is the N length weight vector of the filter at time n. The error is: e{i\n) = d{i) -  X*W {n) N  (4.32)  e{i\n) is the error at the output of the filter at the ith sample given the N sample input data vector Xs{i), the desired signal d(i) at time i and the N optimumfilterweights  Chapter 4. Noise Reduction Filters  55  Initialize:  w {0) = 1 0  tu (n) = 1/JV(1 < n < N) 0  Operation: j=j+l Acquire dj, Xj  Calculate new output value: Vi = 12^=o  Xj- w n  n  Form prediction error: Cy = dj - Vj  Update LMS predictor: W = Wj + 2n€jXj j+1  Table 4.4: Algorithm To Implement the LMS Filter  Chapter 4. Noise Reduction Filters  56  (W (n)). The RLS method attempts to find a set of filter weights {W (n)) such that N  N  the cumulative squares error measurement: =X>n~V(t|n)  (4.33)  i=l  is minimized. An important feature of the last equation is the variable A. This quantity is known as the "forgetting factor" or a data-weighting factor. A acts like a window causing recent data to have more of an emphasis in the error analysis. As A decreases, the window size decreases and therefore the cumulative error function is based only on recent data points. If the data is stationary, A can be set equal to 1 for good results. In this case the RLS method obtains results very similar to the LMS method. If the statistics of the data signal are changing A should be set to a value less than 1. Normally this value is in the range 0.95 to 0.9995 depending upon how quickly the data is changing. Equation (4.34) shows that the optimum filter weights at time n, can be obtained with a time recursion plus a gain due to the exact error measurement obtained from the actual data acquired. Hence, the error measurement changes at each time iteration n, and the filter weights are a function of the current data and the size of the window depending upon A.  W {n) = W {n - 1) + g {n)e{n\n - 1) N  N  N  (4.34)  The RLS adaptive noise cancelling filter was implemented in software, using the programming language C. The algorithm is given in Table 4.5.  4.4  Enhancement Filters Based On A Simple Model of Speech  This class of noise filters requires very little prior information. However, some basic assumptions are made about the speech signal. The first assumption is that speech is  Chapter 4. Noise Reduction Filters  57  Initialize:  WN{0)=X [0)=O C (0) = 6I (6 » n  N  NN  NN  1)  Operation: n = n+1 Acquire d(n) X (0) t  N  Form prediction error: e(n|n — 1) =  Calculate new gain vector: H{n) = X £ ( n ) C W ( n - l)X {n) N  A+p(n)  9N\Jl) —  Update RLS predictor: W (n)=W {n-l) +9N{n)e{n\n - l) N  N  Update matrix inverse: C N{n) = jC {n - 1) -x9ir( ) ir( ) Kff{n - 1) NN  N  n  x  n  c  Table 4.5: Algorithm To Implement the RLS Filter  Chapter 4. Noise Reduction Filters  58  periodic over short time periods whereas noise has a vicissitudinous nature. The second assumption is that speech is slowly varying over a few data samples, and therefore adjacent speech samples are highly correlated. This class of filters attempts to track the input signal s + no removing or averaging out any quick changes. The first method investigated was a linear predictor corrector method and is described in detail in Section 4.4.1. The predictor uses the information from the last n samples to determine if the current sample is correct. If the current sample is out of the range of the last n samples, then it is assumed that the new data sample contains noise. The corrector uses an Nth order Taylor series expansion to estimate the proper sample value. This method removes any value that is out of range and estimates a correct value based on previous data. The second method is a common Gaussian convolution routine and is described in detail in Section 4.4.2. This method redetermines the current sample, based upon a weighted average of windowed input noisy data. This is a weighted average method which removes any quick changes in the data and produces a "smoothed" version of the input signal. The advantage of this class of filters is the simplicity. No second microphone input is needed. No scaling or estimation of the noise spectrum needs to be done. Also, the methods can be applied in the time domain, thereby reducing any windowing effects that could be a problem with the spectrum estimation filters. Of course, with its simplicity comes the gross underestimation of the complexity of the signal. 4.4.1  Linear Predictor Corrector ( L P C ) Method  The Linear Predictor Corrector attempts to track the incoming signal, re-estimating any point that is out of range by an extrapolation. The premise of this method is to follow the signal very closely re-estimating only the samples that appear incorrect.  Chapter 4. Noise Reduction Filters  59  As the amount of noise in the signal increases, it becomes harder for this method to accurately track the "proper signal". LPC is basically a two stage method. Thefirststage determines if the next sample is an acceptable value. The predictor is based upon the range of the absolute value of the first derivative of the last N samples. N in our case was set to four. Since it is assumes that adjacent samples are highly correlated, the first derivative should change slowly. If the absolute value of the first derivative of the current sample is greater then the maximum value of the last four first derivatives, then the current sample is assumed incorrect. The first derivative is a measurement of the slope. Since speech is modeled as a linear system it is valid to assume that between correlated samples of speech the slope varies slowly. The amount by which the signal can change is restricted by targetting data samples outside the current slope range as incorrect. By using the absolute value of all first derivatives a range of plus or minus the maximum slope, is allowed for the current sample. Instead of finding the minimum and maximum, this approach necessarily allows the slope to change direction. The absolute value of the first derivative acts as an error range for the current sample. The error range can be decreased by setting the maximum value of the first derivative to the second (or third) largest value (in sorted order) in the range of the first derivatives. Lowering this range forces the output signal to be less varying but does not necessarily follow the incoming signal as closely. The second stage - the corrector - is based upon a Taylor series expansion. The order of the Taylor series is estimated by determining the Nth derivative of the numerical data that is equal to zero. Determining the order of a polynomial equation is done by finding the derivative that is equal to zero. The order of the polynomial is then the order of the derivative minus 1. To find or predict the order of the numerical data is slightly more difficult, since  Chapter 4. Noise Reduction Filters  60  there is some noise involved with the actual data. Denote the noisy speech input as y(n), where n is the sample number. The output (predicted) noise free speech is denoted s(n). To determine the order of the Taylor series needed to predict sample n find the numerical derivative of s(n) at which the actual value approaches zero and hovers around it. Numerically the first derivative is: s'(n) = s(n - 1) - s(n - 2) The second derivative is: s"{n) = s'{n)-s"{n-l)  (4.35)  =  s(n- 1) - s(n - 2) - [s(n - 2) - s(n - 3)]  =  s(n - 1) - 2s(n - 2) + s(n - 3)  And so on. Since there is noise associated with the actual data very rarely will the derivative be exactly zero. Therefore it is necessary to check subsequent derivatives to determine if they are also close to zero. Once the order of the last m sample points has been determined, Taylor series expansion is used to correct the current sample number, s(n). Where Taylor's series is defined as: f { x )  ~ h ~ ^ ~  {  The second term is the error term E +i. n  ~  }  ~FHO!~  "  (  O )  (  3  Therefore:  /(*) = E ^ ^ t * - *o) + E k  n+1  (4.37)  But in this case, we wish to determine the next sample point a distance (or time ) of h ahead, where h is the spacing between sample points and in this case is constant. Therefore, by replacing x by x+h and xo by x, the proper form is obtained, namely:  C h a p t e r 4.  Noise Reduction Filters  61  f{x + h) = JT —rP-h + E f  (4.38)  k  n+1  This equation could be interpreted as follows. At the next time interval - (x+h) - an estimate of the function can be obtained by summing up the derivatives of the current function (at location x) times a factor of the interval. Since this is an extrapolation method, there is no need to look at subsequent samples. All predicted values can be determined from previous samples. However extrapolation tends to be less accurate then interpolation due to the uncertainty of the order of the equation. There are a few points to consider with this method. The first is that the waveform being tracked namely speech, can be modelled as a continuous linear Nth order polynomial equation. This assumption is true for voiced speech, but does not hold as well for unvoiced or plosive sounds. Due to this assumption the corrected output waveform - s(n) - has been forced to be continuous or close to continuous. The predicted order of the Taylor series expansion could be out by a few orders due to error in the signal. When determining the order of the signal at time n the Nth derivative is not always zero, therefore an error range of ± 5 (out of a possible maximum resolution of 4096) is allowed. It has been assumed that the order for the Taylor series expansion is equal to the predicted order of the output polynomial waveform, s(n). If the predicted order of the polynomial waveform s(n) is correct then this assumption will hold. To validate this claim take the case where f(x) is an actual curve of order 3, say f(x) = x + x + x +1. s  2  Then any derivative of f(x) greater then / (x) will be zero. For real numerical data 3  there could be small error terms associated with values of the derivatives greater then order 3. Taylor's Theorem states that f(x) =  therefore any derivative of  f(x) greater then f {x), where n is the order of the curve, should be zero and hence n  Chapter 4. Noise Reduction Filters  62  summation need only be done up to the Nth derivative. Another point that need be considered is the validity of using Taylor's series to extrapolate to a point outside the closed interval I. This method tries to predict the value x+h outside of the interval I using Taylor's theorem when in fact, Taylor's theorem states that h should be a value such that x+h is in I. In the present circumstances this extension to the theory should not be a problem for two reasons. The first is that the value of h is very small (200 microseconds) therefore, the value of x+h could be considered at the boundary. The second reason that Taylor's series should still be applied is because sequential points are so close together that they should be correlated, hence (noise free) data will have similar characteristics (ie. slope, curvature, etc.).  4.4.2  Gaussian Smoothing Filtering  The Gaussian filter can be considered as a weighted averaging function. It's purpose is to reduce sensitivity to noise. The impulse response of a Gaussian filter is shown in Figure 4.20. Upon convolution of any signal and the Gaussian function the center point in the convolution has the most weight. Surrounding points in the signal figure less significantly into the weighted output average. The effect of Gaussian convolution is to retain the slowly varying characteristics of the signal and reduce the quick changes, which is somewhat like a least squares fit. The Gaussian function is defined as: f(x) = exp(-x /2a ) 2  2  (4.39)  The Gaussian probability density function is defined as:  where a is the variance of the curve (relates to the width) and 2 is the point about 2  which the curve is centered.  Chapter 4. Noise Reduction Filters  -100  -50  63  0  50  100  Time  Figure 4.20: Gaussian Filter Impulse Response The Fourier transform of a Gaussian is another Gaussian, but with a different variance (or width). A broad Gaussian in the time domain produces a narrow Gaussian in the frequency domain and visa versa. In the frequency domain Gaussian filtering can be thought of as a frequency enhancement filter where the frequency being enhanced is dependent upon the width of the Gaussian. In effect a Gaussian filter tends to "blur" or smooth the signal. If the Gaussian is large it tend to enhance gross features of the signal and if it is small it enhances finer features. Therefore, smoothing of a signal is due to isolating magnitude changes at a particular scale and hencefilteringsignal noise at different degrees of coarseness.  Chapter 4. Noise Reduction Filters  64  If the noise in the signal (ie. excavator noise) had a Gaussian frequency response then the Gaussian filter would work well as a noise reduction filter. However, the specific noise in this problem is broadband therefore the best that can be done is to filter certain components of the noise. Different sizes of Gaussians were chosen for filtering the noise. In particular filter sizes of 4, 8, 16 and 32 points were chosen with variances of 4, 16, 64 and 256 respectively, to ensure that the Gaussian response was spread over the complete range of the filter. It was assumed that the major frequency components of the noise were in the range of 0 to 500 Hz, hence the choice of Gaussians. (Sampling rate equals 20 kHz, therefore 40 samples in the frequency domain equal 500 Hz. ) It should be noted that there is a significant speech component in this range.  4.4.3  Implementation  Since the size of the Gaussian was small it was applied as convolution in the time domain as opposed to multiplication in the frequency domain. The processing time was not greatly affected by this step and any windowing effects associated with frequency domain analysis where avoided. Figures 4.21, 4.22 and 4.23 show plots of the signal before and after filtering with Gaussians of size 4,8 and 16 respectively. Notice the smoothing effect each of the filters have on the data.  Chapter 4. Noise Reduction Filters  -200  H 5600  5700  5800  Sample Number Figure 4.21: Signal filtered with a Gaussian of Size 4  Chapter 4. Noise Reduction Filters  i — i  i i i * ^•i"-r™t—|~r'i 1  5600  i i r T T T " i | i i i — r i • r i r "i"| - I - T - H T - I — I — T — I  5700  5800  Sample Number  Figure 4.22: Signal filtered with a Gaussian of Size 8  I  I  j  5900  Chapter 4. Noise Reduction Filters  100 H  CO  c g>  CO  -100 -i  -200 H -I  I  1—I  1  I  1 '  I  I  1  j  1  5600  I I  " I T  I  I  |  I  1  I  1"  5700  "I  |  1  I  I  I  I  I  f  5800  Sample Number  Figure 4.23: Signal filtered with a Gaussian of Size 16  I  I  I  5900  Chapter 4. Noise Reduction Filters  4.5  68  Data Acquisition and Testing of Filters  It was necessary to acquire two channels of data simultaneously in order to test the two microphone input filters (RLS, LMS and PSS and Wiener methods that estimate the power spectrum of the noise from a second microphone). The data was acquired on site. The complete data acquisition hardware is shown in Figure 4.24 .  Figure 4.24: Data Acquisition Hardware for Two Microphones Two microphones, placed inside the cab of the excavator, where connected to two channels of a low-pass filter at 9 kHz then sampled by an A / D converter on the IRONICS work station. These two channels where sampled every two microsecond and stored in buffers. However, the data was only stored in the data file every fifty microseconds  Chapter 4. Noise Reduction Filters  69  (ie. sampling rate of 20 kHz). The time delay between the two samples is insignificant. In order to prompt the speaker to say the next word a light inside the cab of the excavator was turned on. After about 1.5 seconds this light would shut off, at which point the speaker should be done saying the word. A playback feature was included so that the accuracy of the speech could be determined by listening to the data through a headset, connected to the D / A and a LPF. The subject was allowed 1.5 seconds to say each word. With a sampling rate of 20 kHz, each data file contained 29952 samples times 2 channels for a total of 59904 samples. The A / D has a resolution of 12 bits, therefore each data file contains 119808 bytes. Each datafilecontained alternating samples of noisy voice and noise data. These data files were stored on tape cartridges for transfer to the HP9050 where the filtering was done. The placement of the microphones is a key consideration. Microphone 2, containing a reference pattern of the noise must be located in such a way that it obtains a good correlated version of the noise in microphone 1. It should be placed far enough away from mike 1 so that no speech is contained in this microphone. In particular, it is important to be aware of any barrier which would cause reflection and interference (construction or destructive) of n. differently then r^. It was decided that the 2  best location for mike 2 was attached to the roof of the cab directly above the first microphone which is attached to the operator's headset. The frequency response for each microphone, must be the same. A Shure close talking noise reduction SM10A microphone was already available, therefore a second SM10A was purchased. The response of both these microphones is shown in Figure 4.25.  Chapter 4. Noise Reduction Filters  70  Frequency Response ( a t 8 mm [ 5 / 1 6 in.])  5 0 to 1 5 , 0 0 0 Hz  Zb  i*oo  S o to*  t o ^ e o 10400  Figure 4.25: Typical Frequency Response of Both Microphones 4.5.1  Correlation Between Microphones  Analysis of the data in the lab, determined that indeed there was some type of reverberant effect between the two microphones since the noise in the two microphones have different patterns as can be seen in Figure 4.26. Both signals were acquired in a small enclosed environment (ie. the cab of the excavator) which acts as a reverberator. Due to the placement of the microphones the reverberation effect for each microphone will be different. The reverberant effect should be reduced so that the outputs of the two microphones are correlated. A reverberation can be thought of as the distortion of a signal causing the resultant signal to be the sum of a number of overlapped delayed and decayed replicas. If s[n) is the signal, then the reverberant signal x(n) would be: x{n) = a{n) + £ a s{n - n ) k  k  (4.41)  Which can be represented as: x(n) = «(n) ® p(n). p(n) can be thought of as an impulse train or: M  p(n) = 6{n) + £ a S{n - n ) k  k  (4.42)  k=l  But within any type of reverberant environment the reverberant time (or the time at which another delayed version is added to the signal) may not be constant. As well,  71  Chapter 4. Noise Reduction Filters  20  40  60  80  Sample Number  Figure 4.26: Digitized Microphone Signals there is a loss factor or decay involved with the reverberant signal. Therefore, the train of impulses 6{n) may not be equally spaced and a*, will likely be decreasing. The most commonly used method to separate signals that are convolved is homomorphic deconvolution. For a more detailed explanation of this process see reference [20]. The basic premise of homomorphic processing is to take two convolved signals and reduce them to additive signals so that basic linear filtering can be performed. The canonic representation of such a system is as shown in Figure 4.27 To reduce convolution to addition, the first step is to convert from the time domain  Chapter 4. Noise Reduction Filters  72  ..... (g,  ® x(n)  <8>  1 1 1  x{n)  V\ ) n  L  1 1 1  t1 1 1 1  Figure 4.27: Canonic Form of Homomorphic System to the frequency domain thus reducing convolution to multiplication. Multiplication can then be reduced to addition by taking the complex log of the Fourier transform of the input signal x(n). The D® system can be represented as shown in Figure 4.28.  Figure 4.28: D® System The signal i, at the output is sometimes termed the complex cepstrum. If x{n) is a speech signal then the cepstrum x(n), will be a signal representing the separated glottal and pitch components of speech. The glottal components will occupy the low end of the time scale whereas the pitch components will occupy the high end. If the signal is a reverberated noise signal then the cepstrum should contain impulses at the time that the echoes or reverberations occur. To illustrate, consider the signal y(n) which  73  Chapter 4. Noise Reduction Filters  is a reverberated version of x(n), namely y(n) = x(n) ®p(n). p(n) is the impulse train p(n) = S(n) + E t l l  1  -  then y(n) can be written as: M-l  y(n) = x(n) + £ a x(n - n ) Jb=l fc  t  (4.43)  The complex cepstrum of y(n) is: y(n) = x(n) + p(n)  (4.44)  =  \og(X(z)P(z))  (4.45)  =  log(X(*)) + log(P(*))  Since: Y(z)  Then: y(n)  =  F - ^ l o g X ^ J + F-^logP^)]  y(n)  = x(n)+p(n)  (4.46)  In this form it is possible to use a linear frequency invariantfilterto separate the signals x(n) and p(n). Characteristic of the system D$ must be considered. Thefirstand most important is that the complex log must be unique so that if X(z) = S(z)P(z) then: log[X{z)} = log[S(z)P(z)]  (4.47)  = log[S(z)}+log[P(z))  The first problem arises here. Since X(z) can be represented as: X(z) =  X (e' )+jX (e^) u  R  I  (4.48)  Chapter 4. Noise Reduction Filters  74  or equivalently as: X(z) = \X(e>»)\e s M iaz  (4.49)  x  then: X(z)  =  log[\X{e! )\er"* W]  =  u  (. )  x  4  50  log\X{e^)\+jaigX[z)  therefore: X (z)  =  log\X{e>")\  Xi{z)  =  argX(z)  R  (4.51)  XR(Z) is continuous and unique as long as X(z) has no zeroes on the unit circle. Xj(z) however, is generally noncontinuous and not unique since the value of aigXj(z) is usually determined by computer routines which return values in the range —7r to 7r. In order to remedy this fact the phase component of the z transform must be unwrapped to produce a continuous unique value for the phase curve. This is achieved by adding an appropriate multiple of 2n to the original value of A R G X ( c ) . ,w  Other important characteristics of the complex cepstrum are that X(z) must be a valid z-transform, and z(n) must be real. Therefore Xjt{t? ) must be an odd function u  of u. As well, X(e ) must be a periodic function of u with period 2n. ,u>  The complex cepstrum should give us some indication of the reverberant effect. To estimate where the reverberation will occur in the signal note that sound travels at a speed of approximately 331 m/sec depending upon the humidity. If the shortest distance of the microphone from any reverberant surface is 1 meter, then the time it takes for the original signal to return to the microphone is 3 ms. Of course, the noise  Chapter 4. Noise Reduction  75  Filters  tr tr «  TT -  Tr w  Figure 4.29: Wrapped and Unwrapped Phase source does not originate at the microphone so we must take into consideration the interference of the signal before the signal arrives at the microphone. To determine the reverberation based on a model of the cab and placement of the microphones would be a very difficult problem, since there are so many reflecting surfaces. The method used to determine where the reverberations occurred when microphone two was attached to the roof of the cab and microphone one was worn by the operator is as follows. Correlation between the impulse of the cepstrum (lowest eight samples) and the rest of the cepstrum was performed for each microphone input. In ideal echoing conditions you would see spikes at the location of the reverberation. Our signal is slightly more complicated, but it appears that there are spikes in microphone two approximately every 20 samples (which equals 1 ms). A difference was calculated between the cepstrums of both of the microphones. Large differences occurred between the cepstrums at approximately 20 sample intervals. This suggests that microphone two has reverberant components at intervals of 20 samples, which microphone one does not have. If it is possible to adjust the noise signal so that it matches the noise in the speech  Chapter 4. Noise Reduction Filters  76  signal then the reverberation components that occur in one signal and not the other would be removed, but the reverberation components that occur in both would be left unfiltered.  Cepstrum  Figure 4.30: Cepstrum of Mike 2 Microphone 2's cepstrum was filtered with a comb filter containing stop bands every 20 sample with a width of 5 samples. The filter l(n) (as in Figure 4.31) was designed to remove the reverberant effect while leaving the speech signal undistorted. Microphone l's cepstrum was unfiltered. To recover the signal y(n) from x(n) the inverse system TJg is used. The Dg 1  1  system is shown in Figure 4.32. This system reconstructs the speech signal to its original form minus the effects  Chapter 4. Noise Reduction Filters  77  l(n)  -63  -03  -43  -23  2025  4d  60  Sll  Figure 4.31: Linear filter to remove reverberation  + y(n)  ® +  ® exp[]  Z\)>  Figure 4.32: Inverse System D-,  y{ ) n  1  of reverberation. E X P is the complex exponential. This function has no uniqueness problem like the complex log. Also, y(n) and y(n) are stable sequences since both x(n) and x(n) are assumed to be stable. Since the complex cepstrum separates the glottal and pitch period components of speech the inverse system D® basicly reconstructs the 1  speech by convolving the two components.  Chapter 5 Testing of Noise Filters  5.1  Evaluation  The noise reductionfilterswere evaluated by three methods. A listening test was used as an initial estimate of how well the filter performs. Listening tests can be used to determine if the reconstructed filtered speech retains a "natural" sound and if the word sounds less noisy. The second method of evaluation was a signal to noise ratio of the filtered speech as compared to the signal to noise ratio of the noisy speech. It is desirable that the level of the noise decreases and therefore the SNR increases. The signal to noise ratio is given by the formula:  (5.52) The following derivative shows how the SNR was determined given that only a\^  n  and a can be calculated. 2  The only signals which were available for evaluation were s+n and n, therefore measurement ofCT and a 2  2 + n  were determined with:  ^—E(s + n) - 2(s + n)(7Tn) + (s + n) 2  n-l  (E(s + 2sn + n ) - 2E(s + n)(5 + fl) + E(s + 2sn + n )) 2  n-l  2  (Es - 2Esn + Ea + E(n - 2nn + n ) 2  n-l  2  2  2  78  2  2  Chapter 5. Testing of Noise Filters  79  +2Esn - 2E3n - 2Esfi. + 2Esn) =  /-» • -2 • 2 E ( 6 - 8 ) ( w - n ) , (^« + n + j" J ff  (5.53) (5.54)  But 5 s ft a* 0 since the D.C. offset has been removed. Therefore:  °U» = °l + °l +  (5.55)  Also it is assumed that s and n are uncorrected therefore Esn = 0. So: °ln  = o] + °l  (5-56)  Thus it is possible to calculate the SNR as:  SNR  =  2 ^±£-1  _  ol  On  (5.57)  In decibels this relates to 101 og (^jjp- — 1^. In order to find the parts of the input data file that contains only noise and the part that contains speech + noise, the end point detector was used. Once the boundaries were found, it was easy to calculate the variance within the noise only segment of the file and again within the signal + noise segment. It was found that the average SNR is around 12 dB for signals recorded on site. The SRU was tested with different levels of noise with simulated noisy speech to determine how much the recognition degrades with noise. The simulated version was obtained be adding the noise signal with different gain levels, to the noise free speech files.  Chapter 5. Testing of Noise Filters  Noise Gain No noise 1 2 3 4  80  % Recognition with Estimated SNR Noise Type 1 Noise Type 2 — 90.6 31.7 dB 83.8 83.8 22.3 dB 78.8 81.3 16.6 dB 50.0 76.3 13.2 dB 47.5 53.8 10.7 dB  Table 5.6: Recognition When Noise Is Added At Different Gains The true test of the noise reduction filter is the SRU's recognition rate. Even if the first two tests, listening and SNR, obtain poor results it is possible that recognition by the SRU may increase. In final analysis therefore, what really counts is the increase in recognition. There are a few reasons why thefirsttwo evaluation methods obtain poor results, but the SRU still gets increased recognition rates. Certain noise reduction filters, such as PSS can sometimes distort the speech components, by making them sound flat. The resultant speech has an unnatural sound but still contains all the main speech components minus the frequency components attributed to the noise, therefore the SRU is able to obtain accurate results. Similarly, thefilteredspeech can sometimes sound less noisy by human perception, but obtains bad results on the SRU. Some methods, such as Gaussian smoothing reduce the static sound of the noise, but also dulls or smooths the speech. The SRU gets poor results since it seems to depend upon the variable nature of speech to make recognition. For other filters, such as LMS filters, the SNR may remain about the same but the filtered speech has enhanced fricatives and plosive components.  Chapter 5. Testing of Noise Filters  5.2  81  Data Set  The data used to test the filters was collected on site in the cab of the caterpillar tractor. The operation of the tractor while recording the data was limited to two modes since the data acquisition required the tractor be tethered to an offboard computer. The condition in which the data sets were recorded shall be labeled for further reference. Mode 1 tractor in full power idle but not moving, Mode 2 tractor moving the boom in some random fashion. It is best for mode 2 to be random, not repetitious, in order to best test the transient nature of the niters. It was realized that this perhaps, is not the most rigorous test of the filters. However, at this stage of the caterpillar tractor project it was the best we could do and should give us some indication of the effectiveness of the filters. In order to compare recognition results of the SRU with noise filtered data, to recognition results for noise free data, it was required that one of the original subjects be used. In our case the only available choice at the time was speaker E C . The words used to test the filtering methods were the vowel sounds used in the phoneme based SRU tests (see Appendix A). The reason for using these words was due to the facts that: 1. Noise free versions of the phonemes were readily available on tape in digitized form. 2. In order to have a basis for comparison of recognition results by the SRU between filtered speech and ideal conditions, the same set of words had to be used. 3. In order for the noise filters to be effective they must be able to remove noise from all possible phonemes.  Chapter 5. Testing of Noise Filters  82  4. Only vowels were used and not the consonants since consonants obtained poor recognition results even in ideal conditions. It would be difficult to determine any degradation due to noise or improvement due to filtering for consonant sounds. Also, a full test offilteringall phonemes would require too much time. To obtain enough data to do some statistical analysis it was decided that the subject would repeat each of the 16 words 5 times for the two different modes of operation, as mentioned above, for a total of 160 data files. For the noise filters that need only one signal input (s+n), the data from Channel 1 can be used. For the noisefiltersthat need an a priori ensemble average (Power Spectral Subtraction and Wiener Filter with type 2 noise estimates), all the data collected over Channel 2 (noise signal) can be used.  5.3  Improving Recognition By Manipulation of VTR6050  In addition tofilteringthe noise there are other methods for increasing the recognition rate on the SRU. Since the noise is changing it is preferable to train the SRU with noise free speech and test it withfilteredspeech. We do not want any noise components in the reference template that will not be the same in all cases. The combination for testing the filtered speech was to train the SRU with 1 template of noise free speech and then test the recognition with thefilteredspeech. It is possible to store more than one reference template in the SRU for each word. By training the SRU with two templates the recognition rate will increase. The first template contains a pattern of the noise free speech. The second template contains a reference pattern of thefilteredspeech. In this way, if thefilteringprocess removes any part of the speech component (due to noise components in the same frequency range) then the SRU will have a pattern of these changes. As well, people tend to change the  Chapter 5. Testing of Noise Filters  83  way in which they talk in a noisy environment. The pitch and duration are altered. It is advisable that afilteredversion of the noisy speech is used as a template so that any changes in speech will be represented.  5.4  Filter Test Results  The recognition rate of the SRU was only 42.5% for data sets recorded in Mode 1 type conditions. This rate of recognition was obtained by testing the SRU with the noisy speech (no filtering at all) and trained with noise free speech. For the same test of the SRU when the data set was recorded in Mode 2 type noise the recognition rate was only 37.5%. When a second template was trained, using the same type of noisy speech as the SRU was tested with the recognition increased to 71.4% for Mode 1 noise and 70.0% for Mode 2 noise. From previous tests it was determined that in an ideal environment (noise free) using the same set of words, the same SRU and the same subject E C , an average recognition of 90.6 %, misrecognition of 9.4 % and a rejection rate of 0.00% was obtained. These results were the best obtained for this particular user. In order to achieve these results preprocessor routines of volume and time normalization had to be applied to the speech. (See Appendix A for more information.) The goal, of the filtering method is to obtain this rate of recognition, any higher would be desirable but possibly not obtainable. A summary of the results of all the filtering methods are listed in Table 5.7 and 5.8. The complete results are shown in Appendix D. The first column lists the type of filtering applied to the noisy speech. The second column shows how the noise was estimated. For thosefiltersrequiring a power spectral estimate of the noise the number refers to the noise estimate as defined in Section 4.2.  Chapter 5. Testing of Noise Filters  84  For the LMS method the number in column 2 refers to the value of p and for the RLS method, refers to the value of A. The third column labeled as "Trained with one template", refers to the results obtained when the SRU was trained with only 1 noise free template. The fourth column labeled "Trained with two templates" refers to the results obtained when the SRU was trained with one template of noise free speech plus one template of the speech filtered with the same filtering method it is being trained with. The recognition results can be compared to the SNR for the best filtering methods (PSS, LMS and RLS). It can be seen that the amount of increase (or decrease) of the SNR does not necessarily correspond to the increase in recognition. For example the recognition seems to be best when the noise is estimated a priori, followed by noise estimated during silence then noise estimated from the second microphone. However, the maximum SNR increase occurs when the noise is estimated from the beginning of the signal. According to the SNR rating the PSS method with noise estimated during silence, obtains the greatest improvement. However, the best recognition is obtained for the PSS method when the noise is estimated a priori or from the second microphone.  Chapter 5. Testing of Noise Filters  Filter Type None Wiener Wiener Wiener P.S.S. P.S.S. P.S.S. Gaussian Gaussian Gaussian Gaussian LPC LMS RLS  Trained with 1 templateTrained with 2 templates % Correct % Correct Noise % % Estimate Recognized Runners Up Recognized Runners Up 42.5 21.5 71.4 14.3 1 20.0 10.0 45.0 23.8 2 32.5 13.8 76.5 13.8 3 12.5 8.75 14.1 10.9 1 18.8 83.8 12.5 42.5 15.0 2 53.8 87.5 8.75 3 52.5 17.5 87.5 6.25 Var 4 36.7 23.4 73.0 12.5 Var 16 9.81 20.3 75.8 6.25 Var 64 3.75 17.2 70.3 12.5 Var 128 6.6 15.6 25.1 17.2 48.6 19.3 60.1 24.6 0.005 18.8 80.0 13.8 51.3 80.0 0.995 50.0 16.3 15.0 Table 5.7: Results of Filtering With Mode 1 Noise  Chapter 5. Testing of Noise Filters  Filter Type None Wiener Wiener Wiener P.S.S. P.S.S. P.S.S. Gaussian Gaussian Gaussian Gaussian LPC LMS RLS  Trained with 1 templateTrained with 2 templates Noise % Correct % Correct % % Estimate Recognized Runner8 Up Recognized Runners Up 37.5 15.0 70.0 18.8 1 21.3 11.3 67.5 18.8 2 51.3 8.75 80.0 15.0 3 7.50 5.00 37.5 7.50 1 55.0 7.50 77.5 17.5 2 46.8 10.0 81.3 7.50 3 41.3 16.3 73.8 15.0 Var 4 36.6 23.4 71.2 7.81 Var 16 9.81 20.3 63.3 10.9 Var 64 3.69 17.2 47.8 12.5 Var 256 9.25 15.6 64.6 17.2 47.5 18.8 53.8 22.5 0.005 46.3 15.0 76.3 16.3 0.995 43.8 11.3 77.5 18.8 Table 5.8: Results of Filtering With Mode 2 Noise  Noise Before Mode Filtering 1 1 9.815 11.55 13.50 18.28 2  PSS 2 9.175 13.56  S 10.03 13.60  LMS  RLS  10.54 13.82  9.975 13.58  Table 5.9: SNR Results for Filtering Methods  Chapter 6  Discussion  6.1  Vocabulary Selection  Table 6.10 shows the results for all 9 subjects when testing and retesting of the selected subset used the default distance score of 50. SPEAKER DG HL ML RH KC WL FN PN SD AVERAGE  RECOGNIZED Before After 97.3 98.6 91.4 96.9 91.4 95.7 93.4 98.0 93.5 97.6 89.4 98.6 89.0 93.5 87.7 98.1 90.0 93.5 91.5 96.7  MISRECOGNIZED Before After 0.000 0.40 1.60 4.11 4.36 0.730 0.84 0.000 4.65 1.40 2.29 1.00 3.24 0.400 1.60 1.60 5.87 0.000 3.00 0.790  REJECTED Before After 2.71 1.00 4.53 1.53 4.29 3.60 2.00 5.76 1.82 1.00 8.36 0.40 6.07 7.71 10.7 0.330 4.13 6.47 5.56 2.49  Table 6.10: Test 1: Results of Selection When No Distance Score Is Used The before column refers to the recognition rate for all of the words (all application words and their synonyms). The after column refers to the recognition rate after selection of the optimum subset has occurred. Test 1 was performed with a default distance score of 50 in both cases (before and after selection). Test 2 was performed using a default distance score of 50 for initial tests, but it was replaced by the minimum distance score of the optimum subset for retesting of the optimum subset (after results). 87  Chapter 6. Discussion  88  Test 2 results are shown in Table 6.11. For complete results of the tests see Appendix C. The branch number given in those tables, refer to the level of the application set of words being tested. SPEAKER DG HL ML RH KC DH GG RS SP AVERAGE  RECOGNIZED After Before 93.5 98.2 96.7 91.9 97.5 94.4 94.1 96.6 97.9 99.2 99.3 99.6 97.2 89.4 98.9 94.2 96.3 98.0 97.5 95.1  MISRECOGNIZED Before After 0.000 0.730 0.000 2.29 0.730 3.00 1.29 0.330 0.800 1.11 0.220 0.000 0.400 2.07 2.22 0.000 0.000 0.780 1.52 0.250  REJECTED Before After 6.47 1.09 3.33 5.78 1.73 2.62 4.58 3.07 0.000 1.02 0.440 0.400 2.40 8.49 3.58 1.07 3.03 2.00 3.43 2.27  Table 6.11: Test 2: Results of Selection When Minimum Distance Score Is Used In all cased except for one (DG on the second test) the recognition increased from before selection to after selection. For some subjects the increase in recognition was very dramatic. For example there was an increase of at least 7 % for WL (from 89.4 % to 98.6 % ), PN (from 87.7 % to 98.1 %) and GG (from 89.4 % to 97.2 %). If we study the results more closely there are a few anomalies that must be explained. First, zero percentage misrecognition was not achieved when the distance score was set to the minimum although it came close (0.25%). It was assumed that the word that appears on the screen is the word that was said by the subject. If by chance the subject made a mistake in reading the word, then there was no correction procedure, as the input to the SRU was live. A small amount of the error (misrecognition) can be attributed to this fact. The reason the tests were performed live was because the SRU often requires a second pronunciation of the word being tested. If the recognition was  Chapter 6. Discussion  89  "bad" (ie. the word has to be said louder or quieter) then a clearer pronunciation of the word was required. Since the tests were performed in the robotics lab there was often a small amount of background noises (printers, people coming and going, etc) that would cause the SRU to request a repronunciation of the word. As you can see from the results, the average recognition for all subjects increased on the second test. I believe that this is somewhat due to the subjects improving with practice. However, not all subjects were able to repeat the procedure the second time. The new subjects still obtained very low misrecognition rates. When investigating possible criterion for selecting a vocabulary of application words, it was initially thought that a phonetic based set of rules would be possible. As it turns out, it was difficult to come up with a fixed set of rules based upon misrecognition between phonetic components. The phonetic based tests on the SRU produced little, if any, consistencies across misrecognized pairs, for specific users or specific SRUs. There were a few problems recognizing certain phonetic sounds, such as IY and IE, but in general there were not enough consistencies between misrecognized pairs to obtain a complete set of rules. As well, when there were any changes to the pronunciation of the phonetic set of words, such as volume, intonation etc., then the misrecognition between phonemes was effected. The method proposed by this research, has a number of advantages. The first and most important is that the method is equally effective for any speech recognition unit and is not dependent upon the speaker. The reason being that a test is first performed to determine what words are recognized for the specific user and by the specific SRU. Based upon the results an accurate vocabulary can be selected. Since the method is word based there is no need to worry about the effects of phonemes placed together to form words.  Things such as coarticulation between  phonemes, or having the same phoneme in approximately the same location in two  Chapter 6. Discussion  90  different words, need not be considered. This method requires no knowledge by the user. Since the method allows the user to select any convenient set of words (along with their synonyms) for his/her application, and then automatically selects the best subset, any end user (with no knowledge of phonetics or speech recognition) can use this method. The method requires that each user train and test the pre-selected set of application words. Therefore, this approach makes it possible for users with accents or speech impediments to get equally good recognition rates. A disadvantage of the method would be that for certain applications it may be difficult to determine more than one word for a certain command. An inconvenience of the method is that each operator is required to test the set of application words. Therefore there is a preprocessing time involved (which could be up to an hour, depending upon the size of the vocabulary) before a usable set of words is produced.  6.2  E n d Point Detector  The results show that the EPD is about 92.5% accurate (it failed 12 times out of 160 tests). The result is based upon the number of times it could not find the beginning or end of the word. However, if a detection occured at all then the correct boundaries were found. The boundary detection was verified by writing the signal between the end points to a file, setting surrounding points to silence (zero), and listening to the resulting files. On occasion, part of a trailing consonant would be cut short (if it continued on for a long time after the threshold for the AVS was found). The values of the thresholds for the ZC rate and the AVS are dependent upon the SNR of the speech files. If the SNR is very small the end point detector is not as accurate, which is only reasonable since more noise makes it harder to differentiate the  Chapter 6. Discussion  91  word from the noise. In general the EPD is dependable. If on occasion it misses the boundaries of the word the consequence is that the speaker will have to repeat the word.  6.3  Filters  6.3.1  Analysis of the Estimate of the Power Spectrum  For the filters that required a knowledge of the power spectrum of the noise, namely the Wiener filter and the Power Spectral Subtraction filter, three methods were used to determine the power spectrum. The first method utilized the EPD to determine the noise power spectrum in the part of the signal not containing the speech. This method is effective if the noise stays the same during the time the noise is estimated and the word is spoken. The advantage of this method is that the noise and speech are recorded through the same microphone therefore, there is no problem with magnitude matching between microphones. The disadvantage of the method is any problem that might occur if the noise changes during the time that the word is being spoken. In this case the estimate of the noise becomes inaccurate and the filtering is not effective. Also, the method is very dependent upon the accuracy of the EPD. The second method, using a prior estimate of the noise, is good at getting rid of any stationary noise such as the idling sound. However, the method misses any clanging transient noise. Since this method uses afixedfilter(either Wiener or PSS) the weights of the filter must be determined before hand. In our case two noise environments were considered, as mentioned in Section 5.2. Therefore, a disadvantage of this method is that the SRU has to determine what noise is in the background and then select the filter based upon the background noise.  Chapter 6. Discussion  92  Another point to consider when estimating the PS of the noise a priori is the magnitude of the estimate. The ensemble average tends to give a fairly accurate estimate of the magnitude. However, problems can occur when the average magnitude of the PS is too small for the current speech window, so that the filtering method doesn't remove enough of the noise components. Alternatively, the magnitude could be too large and therefore distort the speech slightly by removing too much of the frequency components of the speech. The third method, of estimating the PS continuously from a second microphone input, has the problem of matching the magnitude between microphones. The method suggested for estimating the magnitude difference by calculating a scaling value in the first window of the signal is not a very sensitive method. As the results indicate, the filtering is very sensitive to the magnitude of the estimate. The magnitude difference between signals appears to change, therefore the second microphone estimate does not always obtain good results due to sensitivity of magnitude adjustments.  6.3.2  Wiener Filter  Since Wiener filtering is a magnitude only filter (zero phase), it is sensitive to the magnitude gain between the noise estimate power spectrum and the noisy speech power spectrum. If the magnitude of the estimate of the noise spectrum is not exactly matched to the magnitude of the noise in the noisy speech thenfilteringcould remove significant components of the speech. One of the weaknesses of the Wiener filter is that it requires some knowledge of the signal being filtered. In this case the signal is speech, which is a complicated non-stationary broadband signal. The assumption made about the power spectrum of speech (ie. P,(w) = 1 for all u ) over simplifies the signal and possibly leads to less accurate results for this method.  Chapter 6. Discussion  93  Wiener filtering assumes that speech is a stationary signal and that the mean squared error is a good error criterion to use. However, it has been shown that the mean square error, which is the error criterion on which Wiener Filtering is based, is not strongly correlated with perception and therefore not a particularly effective error criterion to apply to speech processing.  6.3.3  Power Spectral Subtraction  The PSS method is also a magnitude only filter and hence is sensitive to the magnitude gains between the noise estimate and the actual signal. The PSS has an advantage over the Wiener filter in that it does not require an initial estimate of the signal being filtered. Instead, the PSS method, assumes that the filtered output speech is equal to the power spectrum of the speech in the input signal. This can be shown by starting with the Wiener filter, namely: P {u) t  =  (6.58)  H{u)Y{u)  P.(w) P.(w) +P„(w) (6.59) By setting P,(w) equal to P,(w) then:  A P,(w) + P „ ( w )  P.(«) + P»(W) = P„H P.(w)  =  P„(w)-P„(u;) (6.61)  This is still an estimate of the speech components, but it is slightly more accurate then the estimate made for the Wiener filter.  Chapter 6. Discussion  94  The major problem with the last two methods, frequency weighting (Wiener) and frequency subtraction (PSS), for enhancement for speech recognition is that even though they may increase the SNR they tend to smear or blur some of the frequencies components. One of the problems with F F T methods is that they average frequencies over the intervals (in our case each interval was 39.06 Hz) and therefore may not have high enough resolution.  6.3.4  Gaussian Smoothing  As the name suggests, this filter smooths speech to such an extent that the filtered speech becomes dull or is "blurred". This effect is not desirable for speech recognition since the SRU seems to depend somewhat upon the changing nature of speech to make a recognition. This can be verified by examining the effect of recognition as the Gaussian size becomes larger. As the Gaussian size increases the filtered speech becomes more smoothed or dulled. Since speech components are being averaged together the recognition rate decreases dramatically. We can conclude that the ideal noise reduction filter for speech recognition should remove all of the noise but not effect the speech components. It is difficult to both filter noise and enhance the speech components using a Wiener, PSS or Gaussian, since the speech components overlap the noise components.  6.3.5  Linear Predictor Corrector  The best recognition obtained with the LPC filtering method was only 60.1%. Tests show that the LPC method is very good at tracking the signal, however it is not as good at removing the noise. The model of the speech used for the LPC method perhaps is not complex enough to determine in all cases, the difference between the noise and the speech. The more specifically we attempt to model the speech the more potential there  Chapter 6. Discussion  95  is for removing it from the background noise at the same time however, the system becomes more sensitive to the inaccuracies or deviations from the model. In order to improve this filtering method more work could be done to obtain a better, more complicted model of speech.  6.3.6  Adaptive Filtering  Both the LMS and the RLS noise reduction filters are very dependent upon obtaining an accurate reference of the noise. The noise in the second microphone may be a scaled or time delayed version of the noise in the first microphone and accurate results will still be obtained. However, in our case, we had to contend with reverberation between signals due to the closed environment that both signals were collected in. The suggested method for reducing these effects, namely homomorphic processing, works fairly well. A disadvantage of using homomorphic processing to remove reverberation between microphones is that it is very dependent upon the positioning of the microphones. The position of the reverberation in the cepstrum will depend upon how far from a reverberating surface the microphone is. Therefore, upon initial installation the system will have to be calibrated and the second microphone should befixedin a given position. In our case, there was a significant component of reverberation at intervals of 20 samples in microphone 2 that did not appear in microphone 1. Twenty sample points (with a sampling rate of 20 kHz) corresponds to 33 centimeters if we assume that sound travels at 330 m/sec. This indicates that the roof of the cab is causing the main reverberation in the second microphone. The second disadvantage of this method is that the operator is wearing the first microphone and since his head is not stationary, it is not possible to remove all differences between microphones. The initial results however, have shown that a lot of the differences between the two microphones can be removed.  Chapter 6. Discussion  96  Our adaptivefilteringmethods assume that there are no speech components in the second microphone. If there are any speech components in the second microphone then some of the speech will be removed in the filtered output. Previous tests of noise reductionfiltershave mainly been in the cockpits offighterplanes. The primary microphone can be placed inside the pilot's helmet thereby isolating any speech from the second microphone, which is attached to the outside of the helmet. This solution is not feasible for our purposes since the operator does not wear any kind of inclosed helmet. In our tests some speech was detected in the second microphone, but not enough to drastically effect the filtered speech.  6.3.7 L M S The Least Mean Square Adaptive noise reduction filter has the advantage that it is very fast. For each sample point the LMS method does 2N multiplication and N - l additions and subtractions, where N is the number of weights in the filter. In our case 16 filter weights were used. A disadvantage of the LMS filter is that it takes longer to converge on the correct filter weights. Because of this slow convergence, the LMS is not good at removing non-stationary noise, such as the clanging noises that might be encountered. The LMS method bases it weights upon the statistics for a group of noise signals not the exact data obtained at time n, therefore the transient noise sounds will not be filtered out. In our limited tests however, the LMS still produced quite good results (80.0 % for noise Mode 1 and 76.25% for noise Mode 2). Based on experimentation it was found that the best value for /x, the convergence factor, was 0.005. At any higher value the filter tends to become unstable. At any lower value the convergence of thefilterweights takes too long and therefore, not much of the noise is reduced.  Chapter 6.  Discussion  97  6.3.8 RLS The Recursive Least Squares Adaptive noise reduction filter has the advantage that the filter weights can converge very quickly and therefore remove any clanging, changing non-stationary sounds. As the results indicate RLSfilteringobtained slightly worse recognition for Mode 1 noise (full power idle) but slightly better results for Mode 2 noise as compared to the LMS method. This can be attributed to the ability of the RLS filter to quickly adapt to the changing noise signal. Since RLS filtering bases its statistics upon the exact data obtained, the method is more dependent on an exact second microphone input. The pattern of noise received from the second microphone (strictly noise) is not exactly the same as the noise in the first microphone. When initial tests were performed on the RLS and LMS method using simulated noisy speech (ie. speech added to noise recorded separately), the RLS filtering method obtained very clean noise free speech, whereas the LMS filter had a residue of noise left in the output filtered speech. Therefore, we can conclude that the accuracy of the RLS method is directly effect by the accuracy of the reference template of noise. The number of operations for each sample point that is filtered is very high. The RLS method requires 4iV + 4N multiplication or divisions and SN + N — 1 additions 2  2  or subtractions. As with LMSfilteringthe implemented RLS filter has 16filterweights.  6.4  Comparison of Filters  Chapter 6. Discussion  Characteristics Wiener Requires An Yes A Priori PSE of Speech ?  No  PSS  Gaussian No  No  IPC  Requires An A Priori PSE of Noise? No. of Mikes No. of  Yes  Yes  No  No  1/2  1/2  1  1  2 FFTs  2 FFTs  Depends on  Depends upon  Operations  2 x for filter  2 x for filter  Size of Window  2 x for window 1 +  2x for window 1+  -either 2,4,8  Yes  Yes  Yes  order of extrapolation and whether sample needs to be corrected Yes  Frequency  Frequency  Time  Time  Yes  No  Yes  Yes  Assumes Speech is  Assumes that speech is highly correlated between samples  Works Better For Stationary Noise? Implemented in Frequency or Time Domain Requires An A Priori Guess of the Characteristics of Speech?  Estimate of the PS of speech  or 16  slowly varying system with harmonics  Chapter 6. Discussion  Wiener Characteristics YesRequires An A Priori PS of noise Guess of the Noise?  PSS Yes-  Gaussian Yes-  LPC Yes-  PS of noise  Noise is transient more quickly changing then speech  Noise is any derivation from the model of speech  Problems With Reverberation  No/Yes  No  No  No/Yes  100  Chapter 6. Discussion  Requires A Priori PSE of Noise? No. of Mikes No. of Operations  No  No  No  2  2  4N  3N  A  Works Better For Stationary Noise? Operations in Frequency  LMS  RLS  Characteristics Requires A No Priori PSE of Speech?  a  + 4JV  +N -  2N xor-r N-l  +or-  Time  Yes - over short time. No -over long time. Time  No  No  Requires An A Priori Guess of the Noise?  No  Problems With Reverberation  Yes  Yes -assumes noise is stationary over short time intervals Yes  or Time Domain? Requires An A Priori Guess of the Characteristics of Speech?  No  1  xor -5+or-  Table 6.12: Comparison Between Filters  Chapter 7 Conclusions and Recommendations  To analyze the possibility of using a Speech Recognition Unit in a noisy environment this thesis studied methods of noise reduction and robust vocabulary selection. The use of SRUs until this time has been in close to ideal conditions. The addition of loud clanging background noise significantly degrades the operation of the SRU. There were two problems to overcome to make automated speech recognition feasible. The first problem is the background noise; the second is the inherent inadequacies of the particular SRU used. Findings show that it is possible to increase recognition rates by using the vocabulary selection process outlined in this thesis. An acceptable level of recognition is achieved if the operator can use speech as the controller without experiencing distractions due to misrecognition of words. This is accomplished if recognition is in the range of 97-100% and if misrecognition is 0%. The final results for the vocabulary selection process achieved 97.5% recognition and 0.25% misrecognition. This process is an unique approach to increasing recognition. No previous work has been done to determine ways to increase the recognition of a given SRU by selecting appropriate application words. The advantage of the vocabulary selection process is that it works equally well for any SRU or speaker since it takes a heuristic approach. The results indicate that the three most promising noise reduction filters are PSS, RLS and LMS. Even though the PSS method produces the best results for our tests it is believed that the RLS filter will achieve the best results as the noise degrades the 101  Chapter 7. Conclusions and Recommendations  102  speech further due to increased complexities of the noise environment. The PSS filter was applied to noisy speech recorded in a real environment with a non-stationary broadband noise source. The results indicate that byfilteringthe noisy speech with the PSS method it is possible to increase the recognition almost to the level of accuracy obtained by the SRU in ideal conditions (87.5% as compared to 90.25%). Previously the adaptivefilteringmethods (RLS and LMS) have been applied only to simulated noisy data. However, our tests of the adaptive filters used data recorded in a real environment, therefore the second microphone contains a pattern of a reverberant diffuse noise field. It was necessary to apply homomorphic processing to equalize the two channels of noise data. The homomorphic method was able to remove some of the reverberations in the reference input not contained in the primary input. The results show an increase of 8.6% for both the RLS and LMS filter over the noisy speech if the reverberations in the reference input are first removed. Homomorphic processing is some what successful in our case since the second microphone is in a fixed position and there is a known enclosure. However, it is clear that the differences between the microphones are not due to a single reverberation. Further research should be undertaken to determine a more flexible noise channel equalizer. Possible areas to explore include magnitude and phase matched microphone pairs or multiple noise sensors. A feature included in some of the filters - the end point detector - proved to be a reliable method for determining where the speech starts and stops. The improvement to this EPD over previously designed EPDs is the method of determining the AVS and ZC threshold. Both are based upon a Gaussian distribution score, therefore the thresholds depend upon windows of data and not just single sample points. It is less likely that a short spike of background noise will trigger this EPD. Further work needs to be done in the area of hardware implementation of the noise  Chapter 7. Conclusions and Recommendations  103  reduction filter. For eventual implementation of speech recognition in the excavator, noise reduction must be done in real time. Therefore, hardware filters are necessary. This research has shown that it is indeed possible to use speech as an operator interface in a noisy environment if the two proposed preprocessors are employed. Filtering with a PSS filter overcomes the problem of degradation of the speech due to environmental noises. The vocabulary selection process transcends any inadequacies of the SRU. Therefore, these two processors together, will make speech recognition feasible for use in an excavator environment.  References  [1] S.Thomas Alexander, Adaptive Signal Processing, Springer-Verlag Inc., New York, 1986. [2] S.T. Alexander, "Adaptive Reduction of Interfering Speaker Noise Using the Least Mean Squares Algorithm", ICASSP '85, vol. 2, pp. 728-731. [3] T. Ariki, K. Kajimoto, T . Sakai, "Acoustic Noise Reduction by Two Dimensional Spectral Smoothing and Spectral Amplitude Transformation", ICASSP '86, vol. 1, pp. 97-100. [4] Janet M . Baker, David F. Pinto, "Optimum and Suboptimum Strategies For Automatic Speech Recognition In Noise, and The Effect of Adaptation On Performance", IEEE 1986 ASSP, Vol. 1, pp 745-748, 1986. [5] R. Billi, G. Mossia, F.Nesti, "Word Preselection for Large Vocabulary Speech Recognition", ICASSP 1986, vol 1, pp 65-69. [6] Steven F. Boll, "Suppresion of Acousitc Noise in Speech Using Spectral Subtraction", IEEE Trans. Acoust., Speech and Signal Processing, Vol ASSP-27, No. 2, pp.113-120, April 1979. [7] P. Darlington, Wheeler, Powell, "Adaptive Noise Reduction in Aircraft Communication Systems", ICASSP '85. [8] G.A. Powell, P. Darlington, P.D. Wheeler, "Practical Adaptive Noise Reduction In  104  References  105  The Aircraft Cockpit Environment", Proceedings ICASSP '87IEEE 1987 International Conference on Accoustics, Speech and Signal Processing, Vol 1, pp 173-176, April 1987. [9] Johannes P.C. DeWeerd, "Facts and Fancies About A Posteriori "Wiener" Filtering", IEEE Transactions on Biomedical Engineering, Vol. BME-28 No. 3, March 1981 pp 252-257. [10] Y . Ephrain, D. Malsh, "Combined Enhancement and Adaptive Transform Coding of Noisy Speech", IEE Proceedings F - Communications Radar and Signal Processing, Vol. 133, pp 81-86, 1986. [11] C.W.K. Gritton, D.W. Lim, "Echo Cancellation Algorithms", IEEE ASS Magazine, April 1984, pp 30-37. [12] Conrad J . Hermond Jr., Engineering Acoustics and Noise Control, Prentice-Hall Inc., Englewood Cliffs, New Jersey, 1983. [13] William A. Harrison, Jae. S. Lim, Elliot Singer, "Adaptive Noise Cancellation in a Fighter Cockpit Environment", Proc. 1984 IEEE International Conference on Acoustics, Speech and Signal Processing, March 1984, p.l8A.4.1 - 18A.4.4. [14] J.W. Kim, C.K. Un, "Enhancement of Noisy Speech by Backward/Forward Adaptive Digital Filtering", Proceedings of 1984 IEEE International Conference ASSP, April 1984, pp. 89-92. [15] Jae. S. Lim, Alan V . Oppenhiem, "Enhancement and Bandwidth Compression of Noisy Speech", Proceedings of the IEEE, Vol. 67, No. 12, December 1979, pp 1586-1604. [16] Jae. S. Lim "Speech Enhancement", ICASSP '86 vol. 4, pp 3135-3142.  References  106  [17] Stephen E . Levinson, "Structural Methods in Automatic Speech Recognition", Proceedings of the IEEE, November 1985. [18] B. Patrick Landell, Robert E . Wohlford, Lawerence G. Bakler, "Improved Speech Recognition in Noise", IEEE 1986 ASSP, Vol. 1, pp 749-751. [19] Steven L. Martin, "Wave of Advances Carry DSPs to New Horizons", Computer Design, September 15,1987, pp 69-83. [20] Alan V . Oppenheim, Ronald W. Schafer, Digital Signal Processing, Prentice-Hall Inc., Englewood Cliffs, New Jersey, 1975. [21] A.V. Oppenheim, J.S. Lim, "The Importance of Phase in Signals", Proc. IEEE, vol. 69, pp. 529-544, May 1981. [22] K.K. Palival, "Speech Enhancement Using Multi-Pulse Exicited Linear Prediction System", ICASSP '86, Tokyo. [23] K.K. Paliwal, Anjan Basu, "A Speech Enhancement Method Based On Kalman Filtering", Proceedings ICASSP '87 IEEE 1987 International Conference on Accoustics, Speech and Signal Processing, Vol 1, pp 177-180, April 1987. [24] J.M. Pardo, "On the Determination of Speech Boundaries: A Tool for Providing Anchor Time Points In Speech Recognition", ICASSP '86, pp. 2267, vol. 3. [25] D.B. Pisoni, R.H. Bernacki, "Some Acoustic-Phonetic Correlates of Speech Produced in Noise", ICASSP '85, Vol. 4, pp. 1581-1584. [26] L.R. Rabiner, and R.W. Schafer, Digital Processing of Speech Signals, PrenticeHall Inc., Englewood Cliffs, New Jersey, 1978.  References  107  [27] L.R. Rabiner, M.R. Sambur, "An Algorithm for Determining the Endpoints of Isolated Utteranaces", Bell Syst. J., Vol. 54., No. 1, pp. 297-315, February 1975. [28] Jeffrey J . Rodriquez, Jae S. Lim, "Adaptive Noise Reduction In Aircraft Communication Systems", Proceedings ICASSP '87 IEEE 1987 International Conference on Accoustics, Speech and Signal Processing, Vol 1, pp 169-172, April 1987. [29] Ronald W. Schafer, "Echo Removal By Discrete Generalized Linear Filtering", Technical Report 466, Research Laboratory of Electronics, MIT, Cambridge, Massachusetts, February 28, 1969. [30] Sondhi and Berkley, "Silencing Echoes on Telephones", Proceedings of the IEEE, Vol. 68, No. 8, August 1980, pp 948-963. [31] Steven Tretter, "Estimating the Frequency of a Noisy Sinusoid by Linear Regression", Trans, on Info. Theory, Vol. 31, pp. 832-835, 1985. [32] Viswanathan, Henry and Deri, "Noise Immune Speech Transduction Using Multiple Sensors", ICASSP '85, pp. 712-715. [33] Hans Wasmeier, Preprocessing Algorithms, M.A.Sc. Thesis, Department of Electrical Engineering, University of British Columbia, Vancouver, B.C., 1986. [34] R.G. White, J.G. Walker, Noise and Vibration, John Wiley and Sons Inc., Rexdale, Ontario, 1982. [35] Bernard Widrow, Samuel D. Stearns, Adaptive Signal Processing, Prentice-Hall Inc., Englewood Cliffs, New Jersey, Alan V . Oppenheim, Series Editor, 1983. [36] Bernard Widrow, et al., "Adaptive Noise Cancelling: Principles and Applications", Proceedings of the IEEE, Vol. 63, No. 12, December 1975, pp 1692-1716.  References  108  [37] I.H. Witten, Principles of Computer Speech, Academic Press, Toronto, 1982. [38] "Signal Processing for a Cocktail Party Effect", The Journal of the Acoustical Society of America, vol. 50, pp. 656-660 August 1971. [39] ASI Robotics Symposium, Canada Trade and Convention Centre, Vancouver, BC, Canada, February 25,26, 1988.  Appendix A  Previous Work in Speech Recognition at U B C  Previous work in the area of speech recognition at UBC was completed by Hans Wasmeier. His work involved setting up test and preprocessing algorithms to evaluate and improve speech recognition units. The words used to evaluate the filtering methods were taken from this work. The vocabulary was created by embedding each phoneme in the English language in a carrier word consisting of two static phonemes. For the vowels the test words were of the form consonant-vowel-consonant. The consonant sounds were of two forms: consonant-vowel-consonant and vowel-consonant-vowel. The following tables list the test words selected.  109  Appendix A. Previous Work in Speech Recognition at UBC  Phoneme Type Vowels  Sub-Group Reference Number Front  Mid  Back Diphthongs  0 1 2 3 4 5 6 7 8 9  10 11 12 13 14 15  Symbol for Example Phoneme IY I E AE A ER UH OW OO u Al OI AU EI ou JU  beat bit bet bat hot bird but bought boot foot buy boy how bay boat you  Table A.13: Vowels and Diphthong Phonemes  110  Test Word beam bim bem bam bomb berm bum balm boom buum bime boym baum bame boam bume  Appendix A. Previous Work in Speech Recognition at UBC  Phoneme Type  Sub-Group Reference Number  SemiVowels  Liquids  Glides Consonants Nasals  Stops (Voiced)  Stops (Unvoiced)  Whisper Affricative Fricatives (Voiced)  Fricatives (Unvoiced)  111  Symbol for Example Test Word Test Word (Group 1) (Group 2) Phoneme  0  W  wit  wem  awa  1 2 3 4 5 6 7  L R Y M N NG B  let rent you met net sing bet  lem rem yem mem nem ngem bem  ala ara aya ama ana anga aba  8 9 10  D G P  debt get pet  dem gem pern  ada aga apa  11 12 13  T K H  ten kit hat  tern kem hem  ata aka aha  14 15 16  DZH TSH V  judge church vat  gem chem vem  adja acha ava  17 18 19 20  TH Z ZH F  that zoo azure fat  them zem jhem fern  a-the aza ajha afa  21 22 23  THE S SH  thing sat shut  them sem shem  atha asa asha  Table A.14: Semi-Vowel and Consonant Phonemes  Appendix  A. Previous  Work in Speech Recognition  at UBC  112  As well as testing the different phonemes, there are a number of ways that each word can be said. The following variations in pronunciation were tested: Test 1. Vowels said normally Test 2. Vowels with mike moved Test 3. Vowels said slowly Test 4. Vowels said quickly Test 5. Vowels said with interrogative intonation Test 6. Consonants (Group l) Test 7. Consonants (Group 2) In order to improve the recognition rate of the SRU, a number of preprocessing algorithms were tested. The preprocessing algorithms were: Mode 0: Normal speech - unprocessed Mode 1: Time Normalized Mode 2: The Use of Two Reference Templates Mode 3: Nonlinear Volume Normalization Complete recognition results are listed in the following tables for all changes in variation and preprocessing algorithms and for both of the SRUs ( NEC SR100 and V O T A N VTR6050). The confusion matrices are also listed in Tables A.24 to A.30 and A.38 to A.44. Note that these confusion matrices aren't given in percentages but as the number of times that the word was confused.  Appendix A. Previous Work in Speech Recognition at UBC  A.l  113  Recognition Results for the VOTAN VTR6050  SUBJECT  MODE  DR  0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  EC  HW  JR  RF  AVERAGE  %  CORRECT 76.6 90.6 90.6 93.8 73.4 85.9 87.5 90.6 71.9 82.8 84.4 85.9 85.9 82.8 92.2 89.1 90.6 93.8 93.8 95.3 79.7 87.2 89.7 90.9  % MISRECOGNIZED 12.5 9.38 7.81 6.25 20.3 14.1 10.9 9.38 15.6 12.5 10.9 12.5 14.1 17.2 7.81 10.9 7.81 6.25 6.25 4.69 14.1 11.9 8.75 8.75  %  REJECTED 10.9 0.00 1.56 0.00 6.25 0.00 1.56 0.00 12.5 4.69 4.69 1.56 0.00 0.00 0.00 0.00 1.56 0.00 0.00 0.00 6.25 0.938 1.56 0.312  Table A.15: Statistics for Test 1 - Vowels Said Normally  % CORRECT RUNNERS UP 7.81 9.38 4.69 6.25 15.6 9.38 10.9 7.81 10.9 6.25 6.25 7.81 9.38 12.5 6.25 7.81 6.25 6.25 4.69 1.56 10.0 8.75 6.56 6.25  Appendix A. Previous Work in Speech Recognition at UBC  MODE % CORRECT  SUBJECT DR  EC  HW  JR  RF  AVERAGE  0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  81.3 89.6 91.7 93.8 59.4 64.1 70.3 70.3 65.6 79.7 76.6 70.3 79.7 84.4 81.3 82.8 92.2 87.5 93.8 87.5 75.6 80.6 82.2 80.9  114  % MISRE- % REJECTED % CORRECT RUNNERS UP COGNIZED 6.25 9.38 9.37 0.000 8.33 10.42 8.33 0.000 8.33 0.000 6.25 6.25 6.25 20.3 34.4 18.8 34.4 1.5 0.000 15.6 29.7 1.56 14.1 28.1 17.2 14.1 20.3 17.2 20.3 0.000 20.3 0.000 23.4 21.9 1.56 28.1 7.81 17.2 3.13 9.38 15.6 0.000 1.56 10.9 17.2 10.9 0.000 17.2 1.56 3.12 4.69 0.000 12.5 12.5 4.67 6.25 0.000 0.000 6.25 12.5 10.6 17.2 7.2 0.329 13.5 19.1 0.329 12.2 17.4 11.8 18.4 0.658  Table A.16: Statistics for Test 2 - Vowels With Mike Moved  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  RF  AVERAGE  MODE % CORRECT 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  84.4 89.1 90.6 90.6 65.6 78.1 78.1 78.1 65.6 70.3 73.4 67.2 79.7 84.4 84.4 85.9 87.5 95.3 92.2 89.1 76.6 83.4 83.8 82.2  115  % MISRE- % REJECTED % CORRECT RUNNERS UP COGNIZED 6 . 2 0 7.81 9.40 1.56 6.25 9.38 7.81 1.56 7.81 1.56 7.81 7.81 29.7 4.67 20.3 0 . 0 0 0 21.9 15.6 21.9 0.000 15.6 0.000 21.9 12.5 10.9 15.6 23.4 25.0 4.69 18.8 21.9 4.69 12.5 26.6 6.25 15.6 3.13 17.2 14.1 0.000 15.6 14.1 15.6 0.000 15.6 0.000 14.1 12.5 9.40 3.13 7.81 0.000 4.69 4.69 7.81 0.000 3.13 10.9 0.000 7.81 17.8 5.63 13.1 15.3 1.25 11.9 1.25 10.9 15.0 16.3 1.56 11.3  Table A.17: Statistics for Test 3 - Vowels Said Slowly  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT  DR  EC  HW  JR  RF  AVERAGE  MODE  0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  %  CORRECT  64.1 92.2 87.5 95.3 75.0 75.0 81.3 78.1 65.6 71.9 81.3 82.8 70.3 90.6 90.6 87.5 79.7 89.1 92.2 92.2 70.9 83.8 86.6 87.2  % MISRECOGNIZED  17.2 7.81 12.5 4.67 18.8 25.0 18.8 21.9 28.1 26.6 17.2 14.1 23.4 9.38 9.38 12.5 17.2 10.9 7.81 7.81 20.9 15.9 13.1 12.2  %  REJECTED  116  % CORRECT RUNNERS UP  18.8 0.000 0.000 0.000 6.25 0.000 0.000 0.000 6.25 1.56 1.64 3.13 6.25 0.000 0.000 0.000 3.13 0.000 0.000 0.000 8.13 0.312 0.312 0.625  Table A.18: Statistics for Test 4 - Vowels Said Quickly  10.9 7.81 12.5 3.13 14.1 21.9 12.5 18.8 14.1 21.9 10.9 9.38 14.1 6.25 3.13 9.38 12.5 9.38 1.56 1.56 13.1 13.4 8.13 8.44  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  RF  AVERAGE  MODE % CORRECT 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  70.3 79.7 81.3 84.4 62.5 81.3 75.0 68.8 57.8 60.9 65.6 67.2 76.6 85.9 79.7 79.7 81.3 87.5 85.9 81.3 69.7 79.1 77.5 76.3  117  % MISRE- % REJECTED % CORRECT RUNNERS UP COGNIZED 21.9 7.81 20.3 20.3 0.000 17.2 18.8 18.8 0.000 0.000 15.6 15.6 34.4 3.13 20.3 0.000 15.6 18.8 17.2 25.0 0.000 0.000 23.5 31.3 9.42 15.6 32.8 6.25 17.2 32.8 6.25 17.2 28.1 20.3 26.6 6.25 17.2 6.25 9.38 0.000 10.9 14.1 0.000 12.5 20.3 0.000 12.5 20.3 15.6 3.12 9.38 7.81 12.5 0.000 0.000 7.81 14.1 18.8 0.000 12.5 24.4 5.94 15.0 19.7 1.25 13.8 1.25 14.7 21.3 1.25 16.9 22.5  Table A.19: Statistics for Test 5 - Vowels with Interrogative Intonation  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  RF  AVERAGE  MODE % CORRECT 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  59.4 61.5 70.8 71.9 57.3 57.3 74.0 77.1 57.3 55.2 67.7 70.8 59.4 52.1 68.8 69.8 59.4 57.3 74.0 81.2 58.5 42.8 71.0 74.2  % MISRE- % REJECTED COGNIZED 36.5 4.17 0.000 38.5 0.000 29.2 0.000 28.1 41.7 1.04 42.7 0.000 26.0 0.000 22.9 0.000 37.5 5.21 41.7 3.13 28.1 4.17 27.1 2.08 38.5 2.08 47.9 0.000 31.2 0.000 30.2 0.000 40.6 0.000 42.7 0.000 26.0 0.000 18.8 0.000 39.0 2.50 0.556 57.7 28.1 0.833 25.4 0.417  118  % CORRECT RUNNERS UP 13.5 13.5 8.33 10.4 14.6 10.4 10.4 9.42 18.8 13.5 12.5 10.4 12.5 15.6 12.5 12.5 14.6 16.7 10.4 5.23 14.8 18.6 10.8 9.58  Table A.20: Statistics for Test 6 - Consonants (Group 1)  119  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  RF  AVERAGE  MODE % CORRECT 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  65.6 76.0 89.6 92.7 66.7 74.0 85.4 84.4 79.2 83.3 93.8 86.5 76.0 84.4 91.7 89.6 61.5 67.7 86.5 87.5 69.8 69.4 89.4 88.1  % MISRE- % REJECTED COGNIZED 3.12 31.3 0.000 23.9 0.000 10.4 0.000 7.29 0.000 33.3 0.000 26.0 0.000 14.6 0.000 15.6 18.8 2.08 0.000 16.7 0.000 6.2 13.5 0.000 20.8 3.13 0.000 15.6 0.000 8.33 0.000 10.4 35.4 3.13 0.000 32.3 0.000 13.5 12.5 0.000 2.29 27.9 30.6 0.000 0.000 10.6 11.9 0.000  % CORRECT RUNNERS UP 13.5 10.4 3.13 4.17 18.8 12.5 8.33 8.33 8.33 6.25 3.13 7.29 9.38 11.5 6.25 9.38 11.5 10.4 7.29 5.21 12.3 13.6 5.63 6.88  Table A.21: Statistics for Test 7 - Consonants (Group 2)  Appendix A. Previous Work in Speech Recognition at UBC  A.1.1  Confusion Matrices for VTR6050 for All Subjects  Appendix A. Previous Work in Speech Recognition at UBC  Word 0 No. 0 4 3 1 2 S  1  2  4 12  1  S  4  5  Was Recognized AsWord No. 6 7 8 9 10 11 12  IS  Rej  5 20 2 15 3 1 19  4  5 6 7 8 9 10 11 12 IS  18 2 . 20 5 15 , 4 14  •  1  1  1  1  •  •  .  .  •  •  •  2 1  15 2 3 13 . 20 . 2 18  1  •  3  1  •  .  18  15  .  2 1  18  1  H  15  •  •  1 16  Table A.22: Confusion Matrix for Test 1 - Vowels Said Normally  Word No. 0 0 3 3 1 2 S  1  2  S  Was Recognized AsWord No. 6 7 8 9 10 11 12  •  19 5 13  •  •  •  1  4  16 13 2 2  •  14 1 1  15  •  2 1 19 15  •  •  •  •  •  1  2 12 . 20 2 17  1  U  15 1 1  19 4 6 4  3  Rej  l 20 1  •  15  4  15  •  IS  2 12  4  5 6 7 8 9 10 11 12 IS  4  5  •  2  •  3  •  •  15  Table A.23: Confusion Matrix for Test 2 - Vowels With Mike Moved  2  Appendix A. Previous Work in Speech Recognition at UBC  Word No. 0 0 1 1 6 2 S  1  2  S  5 10  .  4  5  Was Recognized AsWord No. 9 10 11 12 IS 6 7 8  14 15 Rej 14 3 l  20 3 17  4  20  5 6 7 8 9 10 11 12 IS  •  U  2  •  .  •  8 5  18 . . 12  . . . .  1 13 14  3  1  .  1  4 14 4 1 13 20 1 3  ,  . . 16  2 2 2 2 . .  17  2  .  17 3  15  1  17  •  Table A.24: Confusion Matrix for Test 3 - Vowels Said Slowly  Word No. 0 1 0 7 3 1 7 9 2 S  4  •  2  15  4  IS  16 2 13 5 1 19 16  5 6 7 8 9 10 11 12 IS  14  S  Wae Recognized AsWord No. 7 8 9 10 11 12 5 6  1  14 15 Rej 10 . 2 2 4 1  2  20 4 4 3  12  1  .  13  1  1  .  10  1 1 3 3  4  .  2  2  4  .  8 11 19 3  .  1  .  1  2  .  2 3 . 16 . 16  1 16  •  •  .  14  Table A.25: Confusion Matrix for Test 4 - Vowels Said Quickly  1  123  Appendix A. Previous Work in Speech Recognition at UBC  Word No. 0 1 2 S  0  1  2  3 7  2  2  10  1  IS  14 15  Rej 13 3  4 •  5 6 7 8 9 10 11 12 IS 15  4  Was Recognized As Word No. 9 10 11 12 6 7 8 5  19  4  U  S  10 1  1  5 19 3  17  . 11 7  19 ,  . 7 ,  2  1  .  1 10 3 1  1 •  14 4  .  •  •  i  ••  1 9 19 3  •  .  1  3 4  •  1 1  16  1  . 20  3  1  15  1 1  •  1  1  1  •  •  16  •  Table A.26: Confusion Matrix for Test 5 - Vowels With Interrogative Intonation  124  Appendix A. Previous Work in Speech Recognition at UBC  Word Was Recognized A« Word No. No. 0 1 B 8 4 5 6 7 8 9 10 11 IB 18 14 15 16 17 18 19 20 21 0 1 66 1 911 2 2 . 13 8 1 . 17 4  5 6 7 8 9 10 11 12 18  16  22 28 Rej . 1 1  . 3 . 14 . . 2  2  1  .  12 1 1  1  1  1  1  .  . 13 . 1  .10  4  1  1 . 3 10  1  3 5  15  .  11  2  ,  2  3  1  1  .  15  1  •  .  1  3  .  1  11  1  14  ,  .  3  ,  11  15 16 17 18 19 20 21 22 28  2  1  1  1  1  1 2 2  3  .  3  .  12  4  .  3  16  .  1  5  5  1  .  3  9  1  .  2  3  8  .  2  2 2  1  3  ,  1  18  .  . .  3 .  1  1  .  10  1  1  3 .  1  1  .  3  9  2  1 .  •  1  .  2  4  9  Table A.27 Confusion Matrix for Test 6 - Consonants (Group l)  Appendix A. Previous Work in Speech Recognition at UBC  125  Word Was Recognized As Word No. 0 1 2 8 4 5 6 7 8 9 10 11 12 IS 14 15 16 17 18 19 20 21 22 28 Rej No. 8 0 10 2 1 6 13 1 . . 18 1 1 2 . . . 19 1 S  4  5 6 7 8 9 10 11 12 IS U  15 16 17 18 19 20 21 22 2S  . . . . 19 19 . . . . 16 2 . 1 . . . . . . . 19 . . 11 . 11 1 2 13 3. . .1 1 • 3 . .1 1. 1. 1 . .3. . .1  1 . 1 1  . 1 . . 12 . 1 1 . .  3 3.  . .1  . . . .  . 2 1 1 1  . 1  . . . 3  2 1 . 1 . . 2 . . . 2 12 . . 15 . . . . . .  . . 15 . .  1 . . . . . 15 4 6 12  2 3 1 3  .  •  1 1  1 2  3 1 2 1 1 1 •  1 1  12 2 1 1 . 16 3 1 3 10 1 1 1 . . . . 1 2 17 . . 2 1 • 9 1 2 . 1 2 12 1 . . 1 . . 1 . . 1 . • 2 3 • • 2 10 .  Table A.28: Confusion Matrix for Test 7 - Consonants (Group 2)  Appendix A. Previous Work in Speech Recognition at UBC  A.2  Results for N E C SR100  126  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  RF  AVERAGE  MODE % CORRECT 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  79.7 81.3 83.3 85.4 64.1 87.5 81.3 87.5 92.2 93.8 93.8 97.9 79.7 85.9 93.8 91.7 87.5 92.2 89.6 87.5 80.6 88.1 88.3 90.0  127  % MISRE- % REJECTED % CORRECT RUNNERS UP COGNIZED 7.81 4.69 12.5 15.6 3.13 4.69 12.5 16.7 0.000 14.6 0.000 10.4 4.69 9.38 31.3 9.38 10.9 1.56 2.08 14.6 16.7 10.4 2.08 10.4 6.25 1.56 4.69 6.25 0.000 3.13 0.000 4.17 6.25 2.08 0.000 0.000 0.000 18.8 20.3 0.000 7.81 14.1 6.25 0.000 2.08 0.000 6.25 8.33 12.5 0.000 6.25 1.56 4.69 6.25 6.25 8.33 2.08 2.08 4.17 10.4 16.6 2.80 8.75 1.25 5.94 10.6 0.833 7.92 10.8 0.833 6.25 9.17  Table A.29: Statistics for Test 1 - Vowels Said Normally  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT  MODE  % CORRECT  DR  0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  85.0 81.3 82.5 91.3 43.8 62.5 71.3 68.8 47.5 48.8 53.8 55.0 67.5 66.3 85.0 88.8 72.5 81.3 83.8 83.8 63.3 68.0 75.3 77.5  EC  HW  JR  RF  AVERAGE  128  % MISRE- % REJECTED % CORRECT RUNNERS UP COGNIZED 2.50 6.25 12.5 10.0 15.0 3.75 8.75 13.8 3.75 0.000 8.75 8.67 12.5 32.5 23.7 11.3 22.5 15.0 6.25 17.5 11.2 10.0 21.2 10.0 5.00 13.8 38.7 3.75 11.2 40.0 7.50 13.7 32.5 30.0 8.75 15.0 8.75 16.3 16.2 20.0 6.25 13.7 6.25 5.00 8.75 5.00 8.75 2.50 6.25 15.0 21.3 7.50 13.7 5.00 8.75 0.000 16.2 0 . 0 0 0 6 .25 16.2 17.5 9.50 19.3 7.75 15.3 16.7 7.25 14.0 10.7 7.75 8.50 14.0  Table A.30: Statistics for Test 2 - Vowels With Mike Moved  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  MODE % CORRECT  AVERAGE  % MISRE- % REJECTED % CORRECT COGNIZED RUNNERS UP  0 1  13.8  3.75  82.5  1.25  73.8  11.3  15.0  7.50  2 3  77.5 95.0  10.0 5.00  12.5 0.000  7.50 5.00  0  1.25  2.50  96.3  0.000  1  71.3  28.7  0.000  16.3  2  72.5  27.5  0.000  12.5  3  77.5  22.5  0.000  8.75  0 1  2.50  85.0  0.000  97.5  12.5 2.50  0.000  0.000  2 3  97.5 96.3  2.50 3.75  0.000 0.000  2.50 3.75  0  12.5  17.5  70.0  0.000  1 2  86.3  13.7 11.2  0.000  10.0  0.000  10.0  11.2  0.000  10.0  71.3 0.000  1.25 5.00  0.000 0.000  8.75 6.25  81.0 3.00 2.50 0.000  0.500 7.75 8.25 6.50  3 RF  129  84.8 88.8  0 1  20.0  8.75  91.3  8.75  2 3  87.5 88.8  0 1 2 3  10.0 84.0 84.8  12.5 11.2 9.00 13.0 12.7 10.5  89.5  Table A.31: Statistics for Test 3 - Vowels Said Slowly  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  RF  AVERAGE  MODE % CORRECT 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  41.3 90.0 91.3 91.3 38.8 68.8 68.8 70.0 23.8 83.8 92.5 92.5 55.0 87.5 90.0 86.3 55.0 91.3 90.0 88.8 42.8 84.3 86.5 85.8  130  % MISRE- % REJECTED % CORRECT RUNNERS UP COGNIZED 31.2 2.50 27.5 0.000 8.75 10.0 0.000 6.25 8.75 6.25 0.000 8.75 7.50 60.0 1.25 15.0 31.3 0.000 0.000 16.3 31.2 30.0 0.000 12.5 22.5 3.75 53.7 16.2 0.000 6.25 7.50 0.000 5.00 0 . 0 0 0 7.50 7.50 11.2 11.3 33.8 0.000 7.50 12.5 10.0 0.000 5.00 0.000 8.75 13.7 40.0 5.00 7.50 8.75 0.000 6.25 0.000 6.25 10.0 7.50 11.2 0.000 14.3 6.50 43.0 0.000 8.75 15.7 0.000 7.75 13.5 8.50 14.3 0.000  Table A.32: Statistics for Test 4 - Vowels Said Quickly  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  RF  AVERAGE  MODE % CORRECT 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  70.0 77.5 78.8 96.3 36.3 62.5 72.5 67.5 78.8 90.0 92.5 91.3 63.8 81.3 92.5 91.3 72.5 68.7 73.8 76.3 64.3 76.0 82.0 84.5  131  % MISRE- % REJECTED % CORRECT RUNNERS UP COGNIZED 10.0 15.0 15.0 5.00 13.8 8.75 8.75 10.0 11.2 3.75 0.000 3.75 16.3 33.7 3.75 3.75 16.3 33.8 2.50 12.5 25.0 17.5 30.0 2.50 7.50 3.75 13.7 10.0 0.000 6.25 7.50 7.50 0.000 8.75 0.000 7.50 7.50 25.0 11.2 3.75 7.50 15.0 0.000 6.25 1.25 5.00 8.75 0.000 5.00 10.0 22.5 8.75 16.3 22.5 2.50 7.50 23.8 2.50 3.75 21.2 8.50 23.7 12.0 10.3 18.0 6.00 7.25 14.5 3.50 1.00 7.50 14.5  Table A.33: Statistics for Test 5 - Vowels with Interrogative Intonation  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT DR  EC  HW  JR  RF  AVERAGE  MODE % CORRECT 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  45.8 61.5 56.9 69.4 37.5 46.9 48.6 52.8 56.3 65.6 72.2 79.2 59.4 58.3 76.4 76.4 54.2 47.9 61.1 56.9 50.6 56.0 63.1 66.9  132  % MISRE- % REJECTED % CORRECT RUNNERS UP COGNIZED 4.17 8.33 50.0 11.5 1.04 37.5 1.39 26.4 41.7 0.000 19.4 30.6 12.5 7.29 55.2 15.6 47.9 5.21 15.3 2.78 48.6 18.1 43.1 4.17 21.9 0.000 43.8 17.7 1.04 33.3 0.000 18.1 27.8 8.33 20.8 0.000 9.38 0.000 40.6 1.04 17.7 40.6 6 .94 0.000 23.6 1.39 6.94 22.2 9.38 0.000 45.8 15.6 2.08 50.0 0.000 9.72 38.9 15.3 0.000 43.1 12.3 2.29 47.1 2.08 15.6 41.9 15.3 36.1 0.833 13.6 1.11 31.9  Table A.34: Statistics for Test 6 - Consonants (Group 1)  Appendix A. Previous Work in Speech Recognition at UBC  SUBJECT  MODE  DR  0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3  EC  HW  JR  RF  AVERAGE  %  CORRECT  72.9 74.0 86.1 84.7 42.7 54.2 68.1 83.3 77.1 76.0 83.3 87.5 77.1 80.2 83.3 90.3 45.8 43.8 54.2 59.7 63.1 65.6 75.0 81.1  133  % MISRECOGNIZED  % REJECTED  % CORRECT RUNNERS UP  27.1 26.0 13.9 15.3 55.2 45.8 31.9 16.7 22.9 24.0 16.7 12.5 21.9 18.8 16.7 9.7 54.2 55.2 45.8 40.3 36.3 33.9 25.0 18.9  0.000 0.000 0.000 0.000 2.08 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.04 1.04 0.000 0.000 0.000 1.04 0.000 0.000 0.625 0.417 0.000 0.000  14.6 15.6 5.56 8.33 19.8 12.5 12.5 11.1 18.8 14.6 9.72 8.33 15.6 13.5 8.33 8.33 18.8 15.6 22.2 13.9 17.5 16.0 11.7 10.0  Table A.35: Statistics for Test 7 - Consonants (Group 2)  Appendix A. Previous Work in Speech Recognition at UBC  A.2.1  Combined Confusion Matrices for All Subjects for SR100  134  Appendix A. Previous Work in Speech Recognition at UBC  Word No. 0 0 15 1 2 8  1 •  12  2  1 4 17  8  4  135  Was Recognized AsWord No. 5 6 7 8 9 10 11 12 18  U  3 1 1  2 19  4  15 Rej  12  2  4  2  4  15 2  11  .  l 14 1  1  3 13 ,  1  •  2  1  •  •  15  1  17  1  17  15  Table A.36: Confusion Matrix for Test 1 - Vowels Said Normally  Word No. 0 0 10 1 2 8  1  2  13 1  1 13 2  4  5 6 7 8 9 10 11 12 18  14  15  8  4  Was Recognized AsWord No. 5 6 7 8 9 10 11 12 18  1  15 .  9  5  1  •  •  •  1  .  15 1  11 5 1 2 1  6  .  1  1  •  .  ,  3 4  3  1 13 13 4 2  1 9 7 12  2 . 1  1  2 3 2 2 1  .  15 1  U  .  •  17 17  4 4  1 3  .  1 1 1 4  19  5 6 7 8 9 10 11 12 18  1 3  14 15 Rej 2 7 2 4 2 4 3 2 5 6 6 9 4 6 7 6 1 3 4 12 . 12 7  Table A.37: Confusion Matrix for Test 2 - Vowels With Mike Moved  1 3  Appendix A. Previous Work in Speech Recognition at UBC  Word No. 0 1  £ 8  4 5 6 7 8  136  Was Recognized As Word No.  0  6 . . . . .. ..  10 11 12 18  14  15  4 5 6 7 8 9 10 11 1£ 18 14 15 Rej . 1 14 7 1 2 10 1 9 1 9 . . 5 2 . . . 13 . . . 6 . . 4 . . . . 3 . . . 7 7 13 33 .. . . . . 1 1 . . 5 5 1 1 . . 22 . . . 8 11 . .. .. 22 .. . . 66 . . . . . . . 11 99 . 11 4 1 5 10 6 14 7 . . . 13 9 . . . 11 12 . . 8 9 . 11 3 17 1  £  8  Table A.38: Confusion Matrix for Test 3 - Vowels Said Slowly  Word No. 0 0 8 1 £ 8  1  6 15  £  2 16 3  4  5 6 7 8 9 10 11 1£ 18 U  15  8  4  Was Recognized As Word No. 7 8 9 10 11 1£ 5 6  15  18 14  15 Rej  4 2 1  1 8  11 15 l 7  15 4  2  7 1 10  1  2  •  . 6 , .  . 1 . 1 8 8 . 15  2 3 8 1  14 1  4  .  2  5 12  1  1  2 13  .  16  Table A.39: Confusion Matrix for Test 4 - Vowels With Quickly  2 1 3 1 5 3 2 2 2 5 4 . 2 2 1  Appendix A. Previous Work in Speech Recognition at UBC  Word 0 No. 0 13 1 2 S  U  15  2  S  4  IS  U  15  16  1 17  2 19 . 17  5  .  1 1  3  .  • •  1 2  •  1 19  6  1  •  ,  10 11 6 3 10  3 1  ,  3  1  •  1  2  3  1  3  •  •  •  .  l l  16 1 17 14 16  1 1  Table A.40 Confusion Matrix for Test 5  .  Rej  6 3 1  1  10  4 5 6 7 8 9 10 11 12 IS  1  Was Recognized 8A Word No. 5 6 7 8 9 10 11 12  137  13 17  2 3 3 1 1 2 1 2  Vowels With Interrogative Intonation  138  Appendix A. Previous Work in Speech Recognition at UBC  Was Recognized 8A Word No. Word 0 1 2 8 4 5 6 7 8 9 No. 10 11 12 IS 14 15 16 17 18 19 20 21 22 . 15 . 2 . .• • • 0 . 1 . 15 . 2 . • * • 1 . 1 2 16 . . • • • 2 • . . . 17 1. 2 . S 4 5 6 7 8 9 10 11 12 18  14 15 16 17 18 19 20 21 22 28  . 1 . 1 9 . 1 11 1 . 1 . 2 5 7 1 1. . 3 14 .1 . . 1 . . 2. . 82 1 . . . 2 . . 1 19 2 1 1 12 • . . .3. • • ... 12 • • . . , 6 • . • 2 . 2 . 4 , • > . . 11 2 1 • • . 1 •  .  .  ,  ,  . . 41 . . . 11 . . . . .  • •  . . 11 .  1  1 2 • •  1 1  1 ,  5 12 3 10 1 1 12 . . 1 . 1 • . 1 . 1 . 2 . • 1  . . 11 . .  .  .  • •  •  •  .  2 . 1 .  . 3 . 2 . 1  .  . 1  . .  . 1 . 2  . .  . 2 . 1  .  1 .  1 3 . . 12 5 6 8  28 Rej  1 1 1 1 . 2 3 2 1 1 11 . . 1 7 5 7  . . 8 3 . 1 1 2 7 . , 1 13 . 4 2 2 4 . 1 , 1 3 . . 6 • 4 . 1 5 • •  . •  •  3 . . 1 . . .  . 3 1 2 . 1  9 . 3 2 8 .  Table A.41: Confusion Matrix for Test 6 - Consonants (Group 1)  Appendix A. Previous Work in Speech Recognition at UBC  139  8 Word No. Word Was Recognized A 0 1 2 8 4 5 6 7 8 9 1 0 11 12 18 14 15 16 1 7 18 1 9 20 21 22 28 Rej No. 2 0 18 3 . 1 .12 1 3 . . 1 1 1 2 . . 18 . 8 . . . 20  4  5 6 7 8 9 10 11 12 18 14  15 16 17 18 19 20 21 22 28  1 . . .  13 2 . . . . 15 1 . . 1 14 . . 2  2  2  . .  4  6 . 1 6  . . . 1 . . . . 11 1 . . . . 1 . . 1 . . 13 . 1 . . 15 . . . 1 1 . 11 5 . 2 12 . . 16 2 . . . . . . . 1 1  Jt t \ . • • •  •  •  •  •  •  •  •  •  •  • • X *  •  .  1  •  2  1  • • •  4 1  1 3  . . 1  1  .  .  1 1 . .  .  1  •  . .  . .  .  .  •  •  •  •  •  •  •  .  .  1  1 3  •  1 1 1  *  9 1  6  .  3  9  1  1 1  7  •  1 3  . .  2  6  •  2  1  .  11  .  2 3 1 1 .  7 9  ,  2  •  11 8 4 15  >  2  1 4  1 .  6  .  1  11 .  . 9  1 1  .  10  7 14  •  •  Table A.42: Confusion Matrix for Test 7 - Consonants (Group 2)  Appendix B  Operation of the V O T A N VTR6050  The VOTAN speech recognition unit is a speaker dependent, isolated word recognition unit. The unit must be pretrained with all possible words for all possible users. The pretrained words are stored as templates in the memory. Recognition is accomplished by spectral template matching. The V O T A N boasts connected word recognition, although limited tests of our own, have shown that at best it can recognize only every second word in a sentence, depending upon the size of the sentence and the length of pauses between words.  B.l  Additional Features  The V O T A N VTR6050 has some extra features. The first is a floppy disk drive which can be used to store additional templates. The advantages of the floppy disk is that each user can train the SRU before hand, store his/her set of words on their own floppy disk and then later use their pre-trained set of application words to initialize the system. An other feature of the VTR6050 is its ability to interface with the telephone system. There are built in functions which allow the SRU to monitor the telephone and answer it after a given number of rings.  B.2  Memory Space  The VOTAN VTR6050 has the following amount of memory:  140  Appendix B. Operation of the VOTAN VTR6050  V T R System Memory  500  Kbytes  Floppy Disk Memory  760  Kbytes  22  Kbytes  1282  Kbytes  Voice Card Total + Host Memory  141  Unlimited  Each template requires 200 - 250 bytes of memory, therefore 1282 Kbytes can hold approximately 5250 - 6500 words. The Voice Card can hold only one set of words at a time. The 22 Kbytes of memory on the Voice Card corresponds to 75 seconds of speech. All recognition takes place on the Voice Card. There is a maximum of 150 words per set if each word is trained with only one template, however the number of words decrease if more than one template is trained per word (eg. if there are 2 templates/word then the maximum number of words/set is 75). The V O T A N can swap sets between the V T R System Memory and the Voice Card instantaneously, between the Host memory and the Voice Card (time depends on the maximum baud rate of 9600 for the RS232 connector), or between the floppy and the Voice Card.  B.3  Configuration for the Operation of VTR6050  The configuration for the operation of the V O T A N VTR6050 is shown in the Figure B.33. The test of the word selection method was done live, therefore the speech went direct to the VTR6050. The HP9050 acted as the host computer controlling the VTR6050 by an RS232 connector to the HPD3 bus. The subject was prompted from the HP9050 for the appropriate word. Recognition data was returned to the HP9050 from the  142  Appendix B. Operation of the VOTAN VTR6050 analog speech/ f I Herd  Qyw< f  O O O O • e r  O O O C l e c  e  micPophtne.  FILTER  . I  j1  Ono/eo,  I  speech  1  UUU|  5a\ PDP-11  POP-M  VTiieet?  Hf>90fO  HPIOSO  recognition  Processed sp*^ SPeecH  Figure B.33: Configuration for the Operation of VTR6050 VTR6050 by the RS232 connector. For thefilteringtest, the data was acquired as explained in Section 5.4 and transferred by tape to the HP9050. Thefilteringwas done on the HP9050 in software. To test the recognition rate, thefilteredspeech was sent to the framegrabber board of the PDP-11 via the HPIB bus. The digital speech was sent to the D/A port of the PDP-11 at a rate of 20000 samples per second. The D/A output was connected to a Khonite low passfilterwith a cut off of 9 KHz. The analog speech was then input direct to the VTR6050. The VOTAN SRU was controlled by the HP9050 and the VTR6050 returned the recognition data back to HP9050 ( by the RS232 port to the HPIB bus). The HP9050 kept track of the recognition results and did all analysis of the data.  Appendix C Results for Vocabulary Selection  The following tables show the complete results for each subject. Column 3 gives the results that were obtained before any selection had taken place. Column 4 show the results obtained after an optimum subset was selected. The level number refers to the levels of the specific application subset that was used and is defined as follows: Go  Begin  End  Quit  Course  Track  Location  Position  Remember  Record  Goto  Repeat  Send Bucket To  Tell Position  Determine Position Give Position  Speed  Velocity  Rate  Camera  Optical Device  Photograph  Mode  State  Condition  Level 1 Start Stop Level 2 Path Point Level 3 Memorize  Level 4 Memorized Location Point  Position  Course  Track  Bucket  Grapple  Stick  Arm  Link 1  Boom  Shoulder  Link 2  Faster  Increase  Memorized Path Level 5 Endpoint  Level 6 Quicker  143  144  Appendix C. Results for Vocabulary Selection  Slower  Retard  Decrease  Minimum  Slowest  Least  Medium  Average  Intermediate  Most  Maximum  Faster  Level 7 Handcontroler  Path  Joystick  Arm  Joints  Bucket  Faster  Increase  Slower  Retard  Decrease  Minimum  Slowest  Least  Medium  Average  Intermediate  Most  Maximum  Faster  Elevate  Higher  Down  Descend  Lower  Left  West  Leftward  Right  East  Right side  On  Lock Onto  Lock  Off  Unlock  Take Off  Speed  Rate  Position  Location  Level 8 Quicker  Level 9 Up  Level 10 Velocity Point  C.l  Vocabulary Selection Results - No Distances Set  Appendix C. Results for Vocabulary Selection  Subject Level  DG  FN  Before Selection 4fter Selection % Mis% Mis% % % % RecognizedRecognizedRejected RecognizedRecognizedRejected  1  100.00  0.00  0.00  100.00  0.00  0.00  2 3  100.00 100.00  0.00 0.00  100.00 97.78  0.00 2.22  100.00 100.00 100.00 93.33  0.00 0.00 0.00  0.00 0.00  4 5  0.00 0.00 0.00 0.00  6 7  97.78 83.33  2.67 16.67  96.00 100.00  100.00  0.00  100.00  4.00 0.00 0.00  0.00 0.00  8  0.00 0.00 0.00  9 10  0.00 0.00  5.56  100.00  0.00  96.67 100.00  0.00 0.00  3.33 0.00  Average  97.33  0.00  2.71  98.60  0.40  1.0  1  93.33  6.67  0.00  100.00  0.00  0.00  2 3  96.67 80.00  0.00 7.78  100.00 93.33  0.00 0.00  4 5  86.67 88.89  0.00 4.44  3.33 12.22 13.33 6.67  70.00 100.00  0.00 0.00  0.00 6.67 30.00 0.00  6 7 8 9 10  86.67 86.67 96.00 78.89 96.67 89.04  0.00 0.00  13.33 13.33 2.67 8.89 3.33  96.00 90.00  0.00 0.00  96.00 100.00 90.00 93.53  4.00 0.00 0.00 0.40  100.00 100.00 96.67 100.00 100.00 92.00 100.00 80.00 100.00  0.00 0.00 0.00 0.00 0.00 8.00 0.00 8.00 0.00  0.00 0.00 0.00 0.00 12.00 0.00  100.00 96.87  0.00 1.60  0.00 1.53  Average HL  145  1 2 3 4 5 6 7 8 9 10 Average  94.44  1.33 12.22 0.00 3.24 3.33  7.71  96.67 96.67 94.44 100.00 95.56 84.00 100.00 84.00 85.56  3.33 2.22 0.00 2.22 10.67 0.00 9.33 10.00  0.00 0.00 3.33 0.00 2.22 5.33 0.00 6.67 4.44  76.67 91.36  0.00 4.11  23.33 4.53  0.00  0.00 6.67  0.00  4.00 10.00 0.00 0.00 10.00 6.07 0.00 0.00 3.33  Appendix C. Results for Vocabulary Selection  Subject  KC  ML  % Mis% Mis% % % % Recognized Recognized Rejected Recognized Recognized Rejected 0.00 0.00 0.00 100.00 93.33 6.67  2 3  100.00 96.67  0.00 1.11  0.00 2.22  90.00 100.00  0.00 0.00  10.00 0.00  4 5 6 7  100.00 91.11 92.00 100.00  0.00 0.00  0.00 8.89  73.33  0.00 0.00 2.67  100.00  0.00 0.00 4.00 0.00 0.00  0.00 0.00 0.00 0.00  8  8.00 0.00 24.00  100.00 100.00 96.00 100.00  9  88.89  6.67  4.44  90.00  10.00  0.00  10  100.00  0.00  0.00  100.00  0.00  0.00  1.40  1.00 0.00 0.00 0.00 0.00 6.67 16.00 0.00  Average  93.53  4.65  1.82  97.60  1  93.33 100.00 85.56 100.00 93.33 88.00 86.67  6.67 0.00 7.78 0.00 0.00  0.00  100.00  0.00  0.00 6.67 0.00 6.67  10.67 3.33  1.33 10.00  100.00 100.00 100.00 93.33 84.00 100.00  0.00 0.00 0.00 0.00 0.00 0.00  2.67 2.22 13.33  96.00 83.33 100.00  4.00 3.33  2 3 4 5 6 7 8 9 10  PN  After Selection  Before Selection  Level  1  146  0.00  86.67  10.67 4.44 0.00  0.00  0.00 13.33 0.00  Average  91.36  4.36  4.29  95.67  0.73  3.60  1  100.00  0.00 0.00 0.00 3.33  100.00 100.00 100.00  0.00 0.00  0.00 0.00 0.00  4  96.67 98.89 16.67  0.00 3.33 1.11  0.00  2 3 5  95.56  0.00  80.00 4.44  100.00 100.00  0.00 0.00  0.00 0.00  6 7  98.67 86.67  1.33 0.00  0.00 13.33  84.00 100.00  16.00 0.00  0.00 0.00  8 9  89.33 94.44 100.00 87.69  8.00 3.33 0.00 1.60  2.67 2.22 0.00 10.71  100.00 96.67 100.00 98.07  0.00 0.00 0.00 1.60  0.00 3.33 0.00 0.33  10 Average  86.67 93.33  Appendix C. Results for Vocabulary Selection  Subject  147  After Selection  Before Selection  Level  %  RH  1 2 3 4  SD  WL  % Mis% Mis% % % Recognized Recognized Rejected Recognized Rejected Recognized 0.00 0.00 13.33 100.00 86.67 0.00 0.00 0.00 100.00 0.00 100.00 0.00 0.00 0.00 5.56 100.00 94.44 0.00 0.00 0.00 3.33 100.00 96.67 0.00  5  97.78  0.00  2.22  93.33  0.00  6.67  6 7 8  96.00 96.67  1.33 3.33  2.67 0.00  0.00 0.00  0.00 0.00  88.00  2.67  9.33  100.00 100.00 100.00  0.00  0.00  9 10  87.78 90.00  11.11 10.00  96.67 90.00  0.00 0.00  3.33 10.00  Average  93.40  1.11 0.00 0.84  5.76  98.00  0.00  2.00  1  93.33  0.00  6.67  60.00  0.00  40.00  2 3  96.67 98.89  0.00 1.11  3.33 0.00  100.00 100.00  0.00 0.00  0.00 0.00  4 5  100.00 88.89  0.00 8.89  0.00 2.22  100.00 100.00  0.00 0.00  0.00 0.00  6 7 8 9 10 Average  94.67 50.00 92.00  4.00 30.00 8.00  1.33 20.00 0.00  92.22 93.33 90.00  6.67 0.00 5.87  1.11 6.67 4.13  92.00 90.00 100.00 93.33 100.00 93.53  0.00 0.00 0.00 0.00 0.00 0.00  8.00 10.00 0.00 6.67 0.00 6.47  1  100.00  0.00  0.00  100.00  0.00  0.00  2 3 4 5 6 7 8 9  100.00 85.56 76.67 77.78 78.67 100.00 96.00 88.89  0.00 2.22 3.33 0.00 14.67 0.00 2.67 0.00  100.00 100.00 100.00 100.00 96.00 100.00 100.00 90.00  0.00 0.00 0.00 0.00 0.00 0.00 0.00 10.00  0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00  10 Average  90.00 89.36  0.00 2.29  0.00 12.22 20.00 22.22 6.67 0.00 1.33 11.11 10.00 8.36  100.00 98.60  0.00 1.00  0.00 0.40  Appendix C. Results for Vocabulary Selection  C.2  Vocabulary Selection Results - Distance Measurements Used  Appendix C. Results for Vocabulary Selection  Subject  Level  149  Before Selection After Selection % Mis% MisX % % % Recognized Recognized Rejected Recognized Recognized Rejected 100.00 0.00 50.00 50.00 0.00 0.00 100.00 0.00 100.00 0.00 0.00 0.00 0.00 96.67 3.33 100.00 0.00 0.00 100.00 0.00 0.00 0.00 100.00 0.00 100.00 0.00 93.33 0.00 0.00 6.67 92.00 5.33 2.67 100.00 0.00 0.00 100.00 0.00 0.00 0.00 100.00 0.00 98.67 0.00 1.33 92.00 0.00 8.00 97.78 0.00 2.22 100.00 0.00 0.00 96.67 3.33 0.00 100.00 0.00 0.00 98.18 1.09 0.73 93.53 6.47 0.00  DG  1 2 3 4 5 6 7 8 9 10 Average  DH  1 2 3 4 5 6 7 8 9 10 Average  100.00 100.00 100.00 100.00 95.56 100.00 100.00 100.00 97.78 100.00 99.33  0.00 0.00 0.00 0.00 2.22 0.00 0.00 0.00 0.00 0.00 0.22  0.00 0.00 0.00 0.00 2.22 0.00 0.00 0.00 2.22 0.00 0.44  100.00 100.00 100.00 100.00 100.00 100.00 100.00 96.00 100.00 100.00 99.60  0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.40  GG  1 2 3 4 5 6 7 8 9 10 Average  76.67 100.00 65.56 100.00 95.56 72.00 100.00 94.67 90.00 100.00 89.45  0.00 0.00 7.78 0.00 0.00 5.33 0.00 5.33 2.22 0.00 2.07  23.33 0.00 26.67 0.00 4.44 22.67 0.00 0.00 7.78 0.00 8.49  90.00 100.00 100.00 100.00 100.00 92.00 90.00 100.00 100.00 100.00 97.20  0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.40  10.00 0.00 0.00 0.00 0.00 4.00 10.00 0.00 0.00 0.00 2.40  Appendix C. Results for Vocabulary Selection  Subject  Level  150  After Selection  Before Selection  %  % Mis% Mis% % % Recognized Recognized Rejected Recognized Recognized Rejected HL  1 2 3  96.67 100.00 98.89  3.33 0.00 1.11  0.00 0.00 0.00  100.00 100.00 100.00  4 5  90.00 95.56  0.00 0.00  10.00 4.44  6  84.00 100.00  5.33  10.67  100.00 100.00 92.00  0.00 0.00 0.00 0.00 0.00 0.00  0.00 5.33  0.00  100.00  0.00  8.00 0.00  9.34  88.00  0.00  12.00  3.33 20.00  96.67  0.00  7 8  KC  9  92.22  10  76.67  4.44 3.33  90.00  0.00  3.33 10.00  Average  91.93  2.29  5.78  96.67  0.00  3.33  1  100.00  0.00  0.00  100.00  0.00  0.00  2 3  0.00 0.00  0.00 0.00 0.00 0.00  0.00 0.00 0.00  0.00  0.00 0.00  100.00 100.00 100.00  5  100.00 100.00 100.00 100.00  100.00  0.00  6 7  94.67 96.67  4.00 0.00  1.33 3.33  100.00 100.00  0.00 0.00  0.00  8 9 10 Average  97.33 93.33 96.67 97.87  2.67 1.11 3.33 1.11  0.00 5.56 0.00 1.02  92.00 100.00 100.00 99.20  8.00 0.00 0.00 0.80  0.00 0.00 0.00 0.00  1 2  96.67 93.33 100.00  3.33  0.00  90.00 100.00 100.00  100.00 97.78  0.00 0.00  0.00 0.00 0.00 2.22  0.00 0.00 0.00  10.00  6.67 0.00  100.00 100.00  0.00 0.00  85.33 96.67  12.00 0.00 8.00 0.00 0.00 3.00  2.67 3.33 1.33 3.33  96.00 100.00 96.00 93.33 100.00 97.53  4.00 0.00  4  ML  85.33  0.00 0.00 0.00 0.00 0.00  3 4 5 6 7 8 9 10 Average  90.67 96.67 86.67 94.38  13.33 2.62  0.00 3.33 0.00 0.73  0.00 0.00 0.00 0.00  0.00 0.00 0.00 0.00 0.00 0.00 4.00 3.33 0.00 1.73  Appendix  Subject  RH  C. Results  Level  1 2  After Selection % Mis% Mis% % % % Recognized Recognized Rejected Recognized Recognized Rejected 83.33 0.00 16.67 100.00 0.00 0.00 0.00  100.00 100.00 97.33  0.00 0.00 2.67  0.00 0.00  8  83.33 90.67  0.00 8.00  6 7  SP  Before Selection  0.00 2.22  4 5  151  Selection  100.00 90.00  3  RS  for Vocabulary  7.78  0.00 16.67 1.33 0.00  100.00 96.67  0.00 0.00  0.00  100.00 100.00 100.00  0.00 0.00 0.00  0.00 0.00  90.00 96.00  0.00 0.00  10.00 4.00  93.33  3.33  3.33  3.33  0.00  9  100.00  10  96.67  0.00 0.00  3.33  90.00  0.00  10.00  Average  94.13  1.29  4.58  96.60  0.33  3.07  1  100.00  0.00  0.00  100.00  0.00  2 3 4 5 6 7  100.00 81.11 80.00 100.00 98.67  0.00 7.78 0.00 0.00 0.00 0.00  0.00 11.11 20.00 0.00 1.33 0.00  100.00 96.67  8 9 10  100.00 95.56 86.67  0.00 0.00 3.33  96.00 96.67 100.00  Average  94.20  0.00 4.44 10.00 2.22  0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  3.58  98.93  0.00  0.00 0.00 3.33 0.00 0.00 0.00 0.00 4.00 3.33 0.00 1.07  1 2  100.00 83.33 98.89 100.00 100.00 98.67  0.00  90.00 100.00 100.00 100.00 100.00  0.00  10.00  0.00 1.11 0.00 0.00 0.00  0.00 0.00 0.00 0.00  90.00 98.67 96.67  3.33 0.00 3.33  0.00 0.00 0.00 0.00  0.00 0.78  100.00 100.00 100.00 100.00 90.00 98.00  0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00  96.67 96.29  0.00 16.67 0.00 0.00 0.00 1.33 6.67 1.60 0.00 4.00 3.03  0.00 0.00  10.00 2.00  3 4 5 6 7 8 9 10 Average  100.00  100.00 100.00 100.00 100.00  Appendix D Results of Filtering Tests  152  Appendix D. Results of Filtering Tests  D.l D.l.l  Results For Noise Mode 1 Results When Trained With One Template Only  Appendix D. Results of Filtering Tests  Filtering Method No filtering  PS  -  Wiener  1  Wiener  2  Wiener  3  PSS  1  PSS  2  PSS  3  Attempt 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave  154  %  % Mis-  % Correct  Recognized Recognized Runners Up 31.25 50.00 50.00 62.50 43.75 37.50 18.75 42.50 31.25 12.50 12.50 31.25 12.50 20.00 31.25 18.75 56.25 31.25 25.00 32.50 12.50 12.50 18.75 6.25 12.50 12.50 31.25 37.50 43.75 37.50 62.50 42.50 62.50 62.50 56.25 37.50 50.00 53.75 50.00 68.75 50.00 50.00 43.75 52.50  37.50 56.25 62.50 71.25 67.50 68.75 77.50 77.50 68.75 87.50 80.00 68.75 81.25 43.75 68.75 75.00 67.50 87.50 87.50 81.25 93.75 87.50 87.50 68.75 62.50 56.25 62.50 37.50 57.50 37.50 37.50 43.75 62.50 50.00 46.25 50.00 31.25 50.00 50.00 56.25 47.50  6.25 18.75 31.25 18.75 21.25 12.50 6.25 18.75 6.25 6.25 10.00 18.75 18.75 0.00 18.75 12.50 13.75 12.50 0.00 6.25 18.75 6.25 8.75 37.50 12.50 18.75 12.50 12.50 18.75 6.25 12.50 18.75 18.75 18.75 15.00 25.00 12.50 18.75 12.50 18.75 17.50  Appendix D. Results of Filtering Tests  Filtering Method LMS  PS  -  RLS  LPC  Gaussian  4  Gaussian  16  Gaussian  64  Gaussian  256  Attempt 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 Ave 1 2 3 4 Ave 1 2 3 4 Ave 1 2 3 4 Ave  155  %  %Mxs  % Correct  Recognized Recognized Runners Up 56.25 25.00 43.75 18.75 56.25 43.75 50.00 50.00 12.50 50.00 50.00 25.00 43.75 12.50 56.25 18.75 51.25 48.75 31.25 50.00 50.00 62.50 47.50 18.75 12.50 50.00 50.00 6.25 56.25 43.75 31.25 68.75 12.50 16.25 50.00 50.00 50.00 50.00 18.75 61.54 38.46 15.38 18.75 50.00 50.00 43.75 56.25 31.25 37.50 62.50 12.50 19.33 48.56 61.34 35.50 25.00 64.50 31.25 18.75 68.75 40.00 25.00 60.00 25.00 40.00 60.00 36.69 63.31 23.44 9.25 31.25 90.75 12.50 9.25 90.75 11.50 25.00 88.50 9.25 12.50 90.75 9.813 20.31 90.19 12.50 5.250 94.75 18.75 5.250 94.75 1.500 98.50 31.25 2.750 97.25 6.25 17.19 3.750 96.25 12.50 11.00 89.00 12.50 6.500 93.50 25.00 8.750 91.25 12.50 11.00 89.00 15.63 6.563 93.44  Appendix D. Results of Filtering Tests  D.1.2  Results When Trained With Two Template  Appendix D. Results of Filtering Tests  Filtering  PS  Attempt  %  % Mis-  Recognized  Method  No filtering Wiener  1  Wiener  2  Wiener  3  PSS  1  PSS  2  PSS  3  Recognized  % Correct Runners Up  Ave  71.4  28.60  14.30  1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave  43.75 43.75 50.00 37.50 50.00 45.00 81.25 56.25 81.25 68.75 93.75 76.25 6.25 18.75 18.75 12.50 14.06 93.75 75.00 87.50 93.75 68.75 83.75 100.00 87.50 81.25 93.75 75.00 87.50 87.50 87.50 81.25 100.00 81.25 87.50  56.25 56.25 50.00 62.50 50.00 55.00 18.75 43.75 18.75 31.25 6.25 23.75 93.75 81.25 81.25 87.50 85.94 6.25 25.00 12.50 6.25 31.25 16.25 0.00 12.50 18.75 6.25 25.00 12.50 12.50 12.50 18.75 0.00 18.75 12.50  25.00 25.00 31.25 6.25 31.25 23.75 12.50 12.50 12.50 25.00 6.25 13.75 6.25 6.25 12.50 18.75 10.94 6.25 18.75 12.50 0.00 25.00 12.50 0.00 12.50 6.25 6.25 18.75 8.75 6.25 6.25 6.25 0.00 12.50 6.250  Appendix D. Results of Filtering Tests  Filtering Method LMS  PS  Attempt  -  1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 Ave 1 2 3 4 Ave 1 2 3 4 Ave 1 2 3 4 Ave  RLS  LPC  Gaussian  4  Gaussian  16  Gaussian  64  Gaussian  256  158  %  % Mis % Correct Recognized Recognized Runners Up  81.25 81.25 81.25 81.25 75.00 80.00 87.50 68.75 75.00 81.25 87.50 80.00 75.00 69.23 50.00 62.50 43.75 60.10 78.25 67.75 73.00 73.00 73.00 79.75 74.50 74.50 74.50 75.81 76.75 71.50 61.25 71.50 70.25 42.00 22.75 13.00 22.75 25.13  18.75 18.75 18.75 18.75 25.00 20.00 12.50 31.25 25.00 18.75 12.50 20.00 25.00 30.77 50.00 37.50 56.25 39.90 21.75 32.25 27.00 27.00 27.00 20.25 25.50 25.50 25.50 24.19 23.25 28.50 38.75 28.50 29.75 58.00 77.25 87.00 77.25 74.88  18.75 18.75 6.25 6.25 18.75 13.75 12.50 25.00 12.50 12.50 12.50 15.00 25.00 23.08 31.25 25.00 18.75 24.62 6.25 18.75 12.50 12.50 12.50 6.250 6.250 6.250 6.250 6.250 6.250 6.25 25.00 12.50 12.50 0.00 18.75 25.00 25.00 17.19  Appendix D. Results of Filtering Tests  D.2 D.2.1  Results For Noise Mode 2 Results When Trained With One Template Only  Appendix D. Results of Filtering Tests  Filtering Method No filtering  PS -  Wiener  1  Wiener  2  Wiener  3  PSS  1  PSS  2  PSS  3  Attempt 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave  160  %  % Mia-  % Correct  Recognized Recognized Runners Up 25.00 75.00 0.00 37.50 50.00 43.75 31.25 37.50 18.75 25.00 31.25 18.75 12.50 21.25 50.00 56.25 56.25 50.00 43.75 51.25 6.25 12.50 0.00 6.25 12.50 7.50 50.00 62.50 56.25 50.00 56.25 55.00 46.67 43.75 62.50 50.00 31.25 46.83 37.50 43.75 43.75 56.25 25.00 41.25  62.50 50.00 56.25 68.75 62.50 81.25 75.00 68.75 81.25 87.50 78.75 50.00 43.75 43.75 50.00 56.25 48.75 93.75 87.50 100.00 93.75 87.50 82.50 50.00 37.50 43.75 50.00 43.75 45.00 53.33 56.25 37.50 50.00 68.75 53.12 62.50 56.25 56.25 43.75 75.00 58.75  18.75 25.00 25.00 6.25 15.00 18.75 6.25 6.25 12.50 12.50 11.25 12.50 6.25 18.75 6.25 0.00 8.75 0.00 0.00 6.25 6.25 12.50 5.00 12.50 0.00 12.50 6.25 6.25 7.50 12.50 12.50 0.00 18.75 6.25 10.00 25.00 6.25 25.00 12.50 12.50 16.25  161  Appendix D. Results of Filtering Tests  Filtering  PS  Attempt  Method LMS  -  1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 Ave 1 2 3 4 Ave 1 2 3 4 Ave 1 2 3 4 Ave  RLS  LPC  Gaussian  4  Gaussian  16  Gaussian  64  Gaussian  256  %  % Correct % MisRecognized Recognized Runners Up 6.25 43.75 56.25 12.50 56.25 43.75 18.75 37.50 62.50 25.00 56.25 43.75 12.50 62.50 37.50 15.00 53.75 46.25 12.50 37.50 62.50 68.75 18.75 31.25 32.25 0.00 68.75 12.50 56.25 43.75 12.50 25.00 75.00 11.25 43.75 56.25 12.50 50.00 50.00 25.00 50.00 50.00 43.75 25.00 56.25 18.75 50.00 50.00 68.75 12.50 31.25 52.50 18.75 47.50 25.00 64.50 35.50 18.75 69.00 31.00 25.00 40.00 60.00 25.00 40.00 60.00 36.63 63.38 23.44 90.75 31.25 9.25 12.50 9.25 90.75 25.00 11.50 88.50 12.50 9.25 90.75 9.813 90.19 20.31 12.50 94.75 5.25 94.75 18.75 5.25 31.25 98.50 1.50 97.25 6.25 2.75 17.19 3.688 96.31 12.50 89.00 11.00 12.50 6.50 93.50 25.00 8.75 91.25 12.50 10.75 89.25 15.63 9.250 90.75  Appendix D. Results of Filtering Tests  D.2.2  Results When Trained With Two Templates  Appendix D. Results of Filtering Tests  Filtering Method No filtering  PS  Attempt  -  1  Wiener  1  Wiener  2  Wiener  3  PSS  1  PSS  2  PSS  3  2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave  163  % Correct % MisRecognized Recognized Runners Up 6.25 93.75 6.25 75.00 62.50 68.75 50.00 70.00 81.25 62.50 62.50 62.50 68.75 67.50 93.75 87.50 75.00 81.25 62.50 80.00 93.75 25.00 18.75 18.75 31.25 37.50 93.75 87.50 75.00 62.50 68.75 77.50 100.00 87.50 81.25 75.00 62.50 81.25 87.50 87.50 75.00 68.75 50.00 73.75  25.00 37.50 31.25 50.00 30.00 18.75 37.50 37.50 37.50 31.25 32.50 6.25 12.50 25.00 18.75 37.50 20.00 6.25 75.00 81.25 81.25 68.75 72.50 6.25 12.50 25.00 37.50 31.25 22.50 0.00 12.50 18.75 25.00 37.50 18.75 12.50 12.50 25.00 31.25 50.00 26.25  12.50 25.00 18.75 31.25 18.75 12.50 12.50 25.00 31.25 12.50 18.75 6.25 6.25 18.75 18.75 25.00 15.00 0.00 6.25 6.25 12.50 12.50 7.50 6.25 12.50 18.75 25.00 25.00 17.50 0.00 12.50 6.25 6.25 12.50 7.50 12.50 12.50 12.50 18.75 18.75 15.00  Appendix D. Results of Filtering Tests  Filtering Method LMS  PS  -  RLS  LPC  Gaussian  4  Gaussian  16  Gaussian  64  Gaussian  256  Attempt 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 5 Ave 1 2 3 4 Ave 1 2 3 4 Ave 1 2 3 4 Ave 1 2 3 4 Ave  164  %  % Mis-  % Correct  Recognized Recognized Runners Up 6.25 93.75 6.25 81.25 18.75 18.75 75.00 25.00 12.50 68.75 31.25 18.75 62.50 37.50 25.00 76.25 23.75 16.25 93.75 6.25 6.25 81.25 18.75 18.75 62.50 37.50 25.00 81.25 18.75 18.75 68.75 31.25 25.00 77.50 22.50 18.75 37.50 62.50 18.75 68.75 31.25 18.75 62.50 31.50 25.00 62.50 37.50 25.00 37.50 62.50 25.00 53.75 46.25 22.50 81.25 18.75 0.00 61.00 39.00 18.75 71.25 28.75 6.25 71.25 28.75 6.25 71.19 28.81 7.813 70.75 29.25 0.00 60.75 39.25 12.50 65.75 34.25 6.25 55.75 44.25 25.00 63.25 36.75 10.94 51.25 48.75 12.50 32.75 67.25 12.50 51.25 48.75 18.75 56.00 44.00 6.25 52.19 47.81 12.50 77.50 22.50 6.25 62.00 38.00 18.75 56.75 43.25 25.00 62.00 38.00 18.75 64.56 35.44 17.19  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064775/manifest

Comment

Related Items