Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Speech enhancement during BiPAP use for persons living with ALS Chua, Samuel D. 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_spring_chua_samuel.pdf [ 823.29kB ]
Metadata
JSON: 24-1.0073383.json
JSON-LD: 24-1.0073383-ld.json
RDF/XML (Pretty): 24-1.0073383-rdf.xml
RDF/JSON: 24-1.0073383-rdf.json
Turtle: 24-1.0073383-turtle.txt
N-Triples: 24-1.0073383-rdf-ntriples.txt
Original Record: 24-1.0073383-source.json
Full Text
24-1.0073383-fulltext.txt
Citation
24-1.0073383.ris

Full Text

SPEECH ENHANCEMENT DURING BiPAP USE FOR PERSONS LIVING WITH ALS by SAMUEL D. CHUA B.A.Sc., University of British Columbia, 2005  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF  MASTER OF APPLIED SCIENCE in The Faculty of Graduate Studies (Electrical and Computer Engineering)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) November 2012  © Samuel D. Chua, 2012  ABSTRACT Speech from behind a face mask while on Bilevel Positive Air Pressure (BiPAP) ventilation is extremely difficult for persons living with Amyotrophic Lateral Sclerosis (ALS). The inability to verbally communicate while on ventilation causes frustration and feelings of isolation from loved ones and decreases quality of life. A system that integrates with face masks, captures speech, removes ventilator wind noise and outputs and recognizes de-noised speech is proposed, implemented and tested. The system is tested with a dataset consisting of digitally added noise as well as a single patient with ALS. Automated machine recognition of the words is then performed and results analyzed. A subjective listening test is conducted with individuals listening to the noisy and filtered speech samples and the results are also analyzed. Although intelligibility does not seem to improve for human listeners, there appears to be some improvement in machine recognition scores. In addition, feedback from the ALS community reports an improvement in the quality of life simply because patients are able to use their own voice and be heard by loved ones.  ii  PREFACE Ethics approval, H10-01703, for project “Speech Enhancement During Bipap Use For Persons Living With ALS” was obtained through the Clinical Research Ethics Board.  iii  TABLE OF CONTENTS ABSTRACT ................................................................................................................................................. ii PREFACE ................................................................................................................................................... iii TABLE OF CONTENTS .............................................................................................................................iv LIST OF TABLES ..................................................................................................................................... vii LIST OF FIGURES ................................................................................................................................... viii ABBREVIATIONS ......................................................................................................................................ix ACKNOWLEDGEMENTS .......................................................................................................................... x DEDICATION .............................................................................................................................................xi 1  2  3  4  INTRODUCTION ................................................................................................................................. 1 1.1  ALS and BiPAP ............................................................................................................................ 1  1.2  The Effects of ALS on Speech and Quality of Life ...................................................................... 1  1.3  Research Goals .............................................................................................................................. 2  1.4  Organization of the Thesis ............................................................................................................ 2  1.5  Contributions of the Thesis ........................................................................................................... 2  BACKGROUND ................................................................................................................................... 3 2.1  Speech: Our Preferred Method of Communication ....................................................................... 3  2.2  Speech Dysarthria ......................................................................................................................... 3  2.3  Speech in Patients with ALS ......................................................................................................... 4  2.4  Measuring Speech Intelligibility ................................................................................................... 4  2.5  Increasing Speech Intelligibility.................................................................................................... 4  2.6  Three Difficulties in Increasing Speech Intelligibility .................................................................. 5  RELATED WORK – HISTORY AND PRESENT .............................................................................. 6 3.1  The Capture Problem..................................................................................................................... 6  3.2  The Noise Problem ........................................................................................................................ 8  3.3  Automated Speech Recognition .................................................................................................... 9  3.4  Automated Speech Recognition in Persons with ALS ................................................................ 10  AUTOMATIC SPEECH RECOGNITION AND ENHANCEMENT SYSTEM ............................... 12 4.1  System Overview and Setup........................................................................................................ 12  4.2  Microphone Selection.................................................................................................................. 14  4.3  Calibration and Positioning ......................................................................................................... 15  4.4  Microphone Powering Circuit ..................................................................................................... 16  4.5  Spectral Subtraction .................................................................................................................... 17 iv  5  6  4.6  Speech Extraction ........................................................................................................................ 21  4.7  Mel Frequency Cepstral Coefficients .......................................................................................... 24  4.8  Dynamic Time Warping .............................................................................................................. 25  USER INTERFACE AND SYSTEM USAGE ................................................................................... 31 5.1  User Interface Description........................................................................................................... 31  5.2  System Usage .............................................................................................................................. 32  5.3  Initial Training............................................................................................................................. 32  EXPERIMENTS ................................................................................................................................. 34 6.1  Experimental Setup ..................................................................................................................... 34  6.1.1  Goal ..................................................................................................................................... 34  6.1.2  Setup .................................................................................................................................... 34  6.1.3  Hypothesis ........................................................................................................................... 35  6.2  Experimental Results................................................................................................................... 36  6.2.1  Digital Addition of Noise to Nemours Subject BB ............................................................. 36  6.2.2  Person Living with ALS – RG ............................................................................................ 41  6.3  Phoneme Analysis ....................................................................................................................... 45  6.4  Discussion ................................................................................................................................... 47  6.4.1  Summary of Phoneme Analysis .......................................................................................... 47  6.4.2  Summary of ASRES Results ............................................................................................... 48  6.4.3  Effectiveness of ASRES ...................................................................................................... 50  6.4.4  Feedback from ALS Community ........................................................................................ 51  6.5 7  Validity ........................................................................................................................................ 51  CONCLUSIONS AND FUTURE WORK.......................................................................................... 53 7.1  Research Goals Summary ........................................................................................................... 53  7.2  Contributions of this Work .......................................................................................................... 53  7.3  Strengths and Limitations............................................................................................................ 54  7.3.1  Strengths .............................................................................................................................. 54  7.3.2  Weaknesses ......................................................................................................................... 55  7.4  Potential Applications ................................................................................................................. 55  7.5  Future Work ................................................................................................................................ 55  7.6  Conclusion................................................................................................................................... 56  BIBLIOGRAPHY ....................................................................................................................................... 57 APPENDICES ............................................................................................................................................. 59 v  Appendix A: Procedure for Collecting Speech Samples of a PALS ....................................................... 59 Appendix B: Phoneme Analysis Data Sheets.......................................................................................... 60  vi  LIST OF TABLES Table 1 - Noisy DTW of 5 Words by BB.................................................................................................... 37 Table 2 - Post-Filtered DTW of 5 Words by BB ........................................................................................ 37 Table 3 - Noisy DTW of 5 Words by BB with +3dB Noise ....................................................................... 38 Table 4 - Post-Filtered DTW of 5 Words by BB with +3dB Noise ............................................................ 38 Table 5 - Noisy DTW of 5 Words by BB with +7dB Noise ....................................................................... 39 Table 6 - Post-Filtered DTW of 5 Words by BB with +7dB Noise ............................................................ 39 Table 7 - Noisy DTW of 5 Words by BB with +13dB Noise ..................................................................... 40 Table 8 - Post-Filtered DTW of 5 Words by BB with +13dB Noise .......................................................... 40 Table 9 - Noisy DTW of 5 Words by BB with +13dB Noise and Alternate Noise .................................... 41 Table 10 - Post-Filtered DTW of 5 Words by BB with +13dB Noise and Alternate Noise ....................... 41 Table 11 - DTW of 6 Words by RG ............................................................................................................ 42 Table 12 – Post-Filtered DTW of 6 Words by RG...................................................................................... 42 Table 13 - Real-time Filtered DTW of 4 Words by RG .............................................................................. 42 Table 14 - Noisy DTW of 5 Words by RG ................................................................................................. 43 Table 15 – Post-Filtered DTW of 5 Words by RG...................................................................................... 43 Table 16 - Real-Time Filtered DTW of 5 Words by RG ............................................................................ 44 Table 17 - DTW of 3 Phrases by RG .......................................................................................................... 44 Table 18 - Real-Time Filtered DTW of 3 Phrases by RG ........................................................................... 44 Table 19 - Control 5 Sentences and Phoneme Divisions ............................................................................ 45 Table 20 - An Example Phoneme Analysis Trial Explained ....................................................................... 45 Table 21 - Percentage of Correctly Identified Phonemes ............................................................................ 46 Table 22 - Phoneme Analysis of MB .......................................................................................................... 47 Table 23 - Summary of Phoneme Analysis ................................................................................................. 48 Table 24 - Summary of BB Datasets ........................................................................................................... 48 Table 25 - Summary of RG Datasets........................................................................................................... 50 Table 26 - Phoneme Analysis of RS ........................................................................................................... 60 Table 27 - Phoneme Analysis of EC ........................................................................................................... 61 Table 28 - Phoneme Analysis of JW ........................................................................................................... 62  vii  LIST OF FIGURES Figure 1 - Cross-Sectional View of Kang's Microphone Array .................................................................... 8 Figure 2 - System Block Diagram ............................................................................................................... 12 Figure 3 - Panasonic Noise Cancelling Microphone Cartridge (WM-55D103) .......................................... 14 Figure 4 - Frequency Response ................................................................................................................... 14 Figure 5 - Airflow and Optimal Microphone Placement............................................................................. 16 Figure 6 - Electret Microphone Circuit ....................................................................................................... 17 Figure 7 - Normalized WAV ....................................................................................................................... 22 Figure 8 - Original and Truncated WAV .................................................................................................... 23 Figure 9 - Distance Map of Different Words .............................................................................................. 25 Figure 10 - Distance Map of Two Identical Words..................................................................................... 26 Figure 11 - DTW Minimum Cost Path Equation ........................................................................................ 26 Figure 12 - DTW Scoring for the First Row ............................................................................................... 27 Figure 13 - DTW Scoring for the Second Row ........................................................................................... 27 Figure 14 - DTW Scoring for the Entire Grid ............................................................................................. 27 Figure 15 - Minimum Cost Path for Two Different Words......................................................................... 28 Figure 16 - Path with a Score of Zero for Identical Words ......................................................................... 28 Figure 17 - Cost Map of a Word Spoken Slowly ........................................................................................ 29 Figure 18 - Path with a Zero Score for a Word Spoken Slowly .................................................................. 29 Figure 19 - Euclidean Distance in 3-D ........................................................................................................ 30 Figure 20 - ASR Tool .................................................................................................................................. 31 Figure 21 - SnR vs. Recognition Rate of BB for Noisy Signals Subjected to Post-Filtering...................... 49 Figure 22 - SnR vs. Recognition Rate of RG for Noisy, Post-Filtered and Real-time Filtering ................. 50  viii  ABBREVIATIONS Term  Definition  AWG  American Wire Gauge  ALS  Amyotrophic Lateral Sclerosis  ASRES  Automated Speech Recognition and Enhancement System  BiPAP  Bi-level Positive Airways Pressure  DTW  Dynamic Time Warping  HMM  Hidden Markov Model  MFCC  Mel Frequency Cepstral Coefficients  MMSE  Minimum Mean Squared Error  LPC  Linear Predictive Coding  PALS  Person living with Amyotrophic Lateral Sclerosis  SnR  Signal-to-Noise-Ratio  ix  ACKNOWLEDGEMENTS To my supervisor, Philippe, I am grateful for your continued support throughout the duration of this project. If it were not for your patience and guidance, this project would never have come to fruition. To the members of PROP BC and the ALS Society of BC, thank you for your willingness to support this project and to offer your feedback. And finally, to my dear wife, Esther, words simply cannot express my gratitude for your support and sacrifices along the way.  x  DEDICATION  Dedicated to…. those who have fought with ALS courageously and to my God who made me and strengthened me for this work.  xi  1  INTRODUCTION  1.1 ALS and BiPAP Amyotrophic lateral sclerosis (ALS) is a progressive neurological disease that destroys the nerve cells associated with voluntary muscle control [1]. Although the initial symptoms of the disease vary from person to person, as time progresses, all persons with ALS eventually begin to lose their mobility, ability to speak and have trouble breathing due to the weakening of respiratory muscles. At some point assisted breathing is required in the form of mechanical ventilation to ease the strain on the weakened muscles. Usage may initially be nocturnal followed by increasing daytime usage as the disease progresses. Eventually, ventilation will be required on a full-time basis when the respiratory muscles are no longer able to maintain appropriate oxygen and carbon dioxide levels [1]. One assisted breathing method which is common in North America is the Bilevel Positive Air Pressure (BiPAP) breathing apparatus. Unlike a mechanical ventilator, it does not replace the normal breathing mechanism, but rather allows patients with neuromuscular diseases similar to ALS to breathe normally while reducing the amount of effort required by the patient. This is accomplished by fitting a mask to the patient’s face and applying positive air pressure upon inhalation and negative air pressure upon expiration. Although BiPAP cannot slow the progression of the disease, it has been show to improve the quality of life in patients [2]. One major problem caused by BiPAP utilization is its interference with speech caused by the muffling of vocalization by the BiPAP mask and the associated airflow noise. For patients using BiPAP several hours a day this problem greatly limits their ability to communicate verbally.  1.2 The Effects of ALS on Speech and Quality of Life Mixed dysarthria associated with ALS involves imprecise consonants, hypernasality, slowed speech rate, harsh vocal quality, breathiness, slurred speech and low pitch [3]. Acoustic studies show differences in vowel duration, fundamental frequency and vowel space with some variation between different individuals [4]. The effects of dysarthria on speech alone have quite an impact on the quality of life of a patient. The inability to communicate effectively with voice leads to feelings of isolation, frustration, anxiety, loss of control and increased sadness. Isolation comes from the reduced amount of communication, frustration from not being understood, fear and anxiety from failed communication attempts, loss of control as their opinions are ignored or misunderstood, and sadness due to the isolation and frustration experienced by the caregivers and patient [5]. In addition to this, BiPAP users who wear face masks also feel more distant 1  and isolated from their friends and primary caregivers. The combined effect of reduced ability to speak and also the BiPAP face mask reduce the quality of life of persons living with ALS considerably.  1.3 Research Goals The primary goal of this thesis is to develop a prototype Automatic Speech Recognition and Enhancement System (ASRES) that will allow Persons Living with ALS (PALS) to communicate clearly with their voices while being on BiPAP ventilation. The primary goals of this thesis can be separated into three sub-goals as follows: •  Identify the problems associated with capturing and filtering speech by PALS who are on BiPAP ventilation  •  Design and implement a working prototype that PALS can use  •  Validate the effectiveness of the system by examining the recognition rate of ASRES when used with PALS and also by objectively measuring increases in intelligibility through listening tests  1.4 Organization of the Thesis This thesis consists of seven chapters. Chapter 1 is an introduction to the problem while Chapters 2 and 3 discuss the background of the problem as well as related work. Chapter 4 presents the proposed system including an overview of the theory and details behind the implementation. Chapter 5 covers the user interface and describes the usage of the system including training. Chapter 6 includes all the data obtained from the different experiments as well as a summary of the data. And lastly, Chapter 7 presents the conclusions and suggestions for future work.  1.5 Contributions of the Thesis The contribution of this thesis include: •  A discussion of the unique problem that PALS on BiPAP ventilation face with regards to communicating using their voices  •  An implementation and analysis of ASRES that fits the needs of PALS on BiPAP  •  Validation of the effectiveness of the system by conducting experiments with noise digitally added to samples of dysarthric speech as well as real subject data from a PALS  •  A final analysis and evaluation of the prototype  2  2  BACKGROUND The purpose of this section is to clearly articulate the communication problems caused by the face  masks of BiPAP users and why they reduce the quality of life of PALS.  2.1 Speech: Our Preferred Method of Communication Speech is the primary method of communication between people. Although it is a common method of communicating our intent, many factors come into play when it comes to having highly intelligible speech. For example, pitch and tone contribute to the semantic meaning of a phrase as well as inflection. A subtle variation, such as the slight rise in tone at the end of a phrase, can change a statement to be recorded into a question to be answered. Accents and variations in localized pronunciation also provide a challenge with regards to both human and machine recognition. The ability to communicate is simply a part of our nature and surveys and studies have shown that there is a value in person-to-person verbal communication in patient care that cannot be replaced by simply giving written instructions [6]. A study done on laryngectomees (persons who have had their larynx removed) due to illness show that their quality of life improves after the restoration of verbal communication through either a voice prosthesis or a tracheo-oesophageal puncture [7]. Since the loss of the ability to speak is considered a loss in quality of life for individuals, any improvements or enhancements that we can make to restore the intelligibility of speech will help improve an individual’s quality of life.  2.2 Speech Dysarthria Speech dysarthria is a term that refers to a group of motor speech disorders that result from either central or peripheral nervous system damage [8]. The disruption to muscular control that affects the muscles used in producing speech can result in differing levels of intelligibility. Some patients who suffer from mild dysarthria when subjected to a Frenchay Dysarthria test can produce scores which are low enough to have them pass as normal speakers. The Nemours database [9] contains sound samples of 11 different male speakers with varying degrees of dysarthria. Some of the speakers have a greater than 80% intelligibility/understanding rate, while others score less than 60% or lower to the point where they are virtually unintelligible. Although the degree of dysarthria varies from person to person, imprecise articulation [10] is characteristic of all dysarthric speakers. Research has shown that for dysarthrics, vowels are easy to produce whereas consonants are difficult to enunciate. The speech of patients with dysarthria is often characterized as being either very nasally or distorted.  3  2.3 Speech in Patients with ALS Persons living with ALS have a kind of mixed dysarthria that is characterized by defective articulation, slow speech, and imprecise consonant and vowel formation among other things. Documentation of their speech is not extensive, but a few studies have been conducted to examine their speech rate, vowel space and variance in speech intelligibility [11]. Although their symptoms are similar to other types of individuals with dysarthria, they suffer from a unique problem in that unlike other disabilities that affect speech, such as cerebral palsy and multiple sclerosis, ALS causes the patient’s ability to speak to degenerate over time. This poses a very unique problem when it comes to developing solutions to help improve the quality of speech for an individual with ALS as the specifics of their dysarthria degrade over time, rendering a potentially helpful system either less effective or ineffective altogether.  2.4 Measuring Speech Intelligibility Intelligibility can be defined as “how well a speaker’s acoustic signal can be accurately recovered by a listener” [12]. Although the quality of a speech signal affects comprehension, it is important to note that there are many other nonverbal factors that are involved in listener comprehension. Some examples would be the length of a message, its predictability, context, relationship to the listener, and facial cues. The measurement of intelligibility is no simple task either as there are multiple ways in which this can be done. One way is through orthographic transcription in which a user listens to a speech sample and then attempts to reproduce in writing what they heard. A percentage score is obtained by calculating the number of words correctly identified over the total number of words. Although this can objectively measure the number of words correctly perceived, it is important to design an experiment in such way that the context of the sentence in which the words are found does not heavily influence the recognition of the word. A stronger method for testing intelligibility is to ask the listener questions to see how well they were able to understand the speaker’s meaning. Carefully designed general comprehension questions can be used to determine how well a listener understood the speaker overall, while specific factual questions can help shed some light on particular word intelligibility.  2.5 Increasing Speech Intelligibility There are many different methods that have come up over the years that attempt to improve the speech intelligibility of dysarthric patients. When these methods are broken down, they can be broken up into two categories: modification of the speech signal to enhance the actual acoustic characteristics of the 4  signal and the alternative, manipulating speech complementary information or nonverbal cues in order to increase a listener’s comprehension or perception of the speech. Studies have shown that speech intelligibility can be improved simply by offering these additional sources of information to listeners. Semantic cues, first letter and word class cues [13] have been shown to improve intelligibility from 1024% among listeners. Topical cues and key words are communication strategies that are often employed by PALS in communication [14]. These strategies are currently used by partners of PALS who give anecdotal evidence that being able to understand one key word can allow you to understand the rest of the sentence. Although speech intelligibility can be increased by other means that are non-speech related, these go beyond the scope of this project and we will be concentrating on improving speech intelligibility from behind a face mask primarily through speech processing.  2.6 Three Difficulties in Increasing Speech Intelligibility One obvious difficulty with capturing intelligible speech from a patient with a face mask, is that because the mask is placed over the face and forms a complete seal, it is difficult to hear any intelligible speech whatsoever, even if you were to place your ear very close to the mask itself. This we will refer to as the Capture Problem of speech in the following section, Section 3, entitled RELATED WORK – HISTORY AND PRESENT. Secondly, the problem is that patients who are on BiPAP, not only have voices that are generally weaker and thus are not able to speak as loudly as a typical person, but also struggle to articulate their words with regularity and sufficient preciseness so as to be intelligible. This is a significant challenge especially when it comes to trying to capture and process a good speech sample with consistency from behind the mask. This we will refer to as the Articulation Problem in the following section. Thirdly, as BiPAPs function with rushing air, attempts to capture the sound also pick up the “wind noise” within the mask, further decreasing the intelligibility of any captured speech. This we will refer to as the Noise Problem in the following section. As the face mask is an enclosed system, little can be done in terms of opening the mask to allow for sound capture as this would allow air to escape and nullify any respiratory benefits. Coupled with the fact that patients have weak voices, it would seem that capturing sound outside the mask by opening the masks or asking the patients to shout would be impossible. Therefore, we must turn our efforts to improving speech intelligibility by capturing sound not externally, but from within the mask, and mitigating the effects of the rushing air. 5  3  RELATED WORK – HISTORY AND PRESENT At the present time, there are no published works on improving speech from behind a ventilator  face mask and filtering BiPAP induced wind noise. The present work is among the first of its kind and as a result, has very little to draw on in terms of specifics in this field. In the previous section, we identified three problems, the Capture Problem, the Articulation Problem, and the Noise Problem that contribute to the decreased intelligibility of speech in PALS on BiPAP. Although intelligibility could be increased by addressing all three of these, the scope of this project would simply be enormous if we were to tackle all of them. For example, improvements to the Articulation Problem have been attempted and met with different degrees of success. Older solutions for persons with ALS involved using speech synthesizers when articulation simply became too poor or when the patient became nonverbal. Although these systems gave functionality, it is a well-known fact that machine voices are still regarded to be highly impersonal. Time and energy is being expended at the present time to add emotion among other things to make machine voices sound more human [15]. There is research in the area of capturing dysarthric speech, identifying and resynthesizing the badly pronounced phonemes [16] so that the large portion of a person’s speech is retained. However, nothing has yet been made available commercially and the technology exists primarily in labs. As this problem is under research and a large enough problem in it of itself for a thesis, we will not be attempting to improve intelligibility through the re-synthesis of poorly articulated phonemes. We will focus our attention on the Capture Problem and Noise Problem and then explore current work in the field for machine recognition of the cleaned speech in order to select a mechanism for the validation of any claims we might make of increased intelligibility.  3.1 The Capture Problem The first problem that we need to address is how to obtain speech for processing from within a full face mask. If the quality of our signal is poor, then any attempts that we make at improving intelligibility may be hampered not by our algorithms, but simply by the fact that our input is poor to begin with. Therefore, choosing an appropriate solution for capturing sound from within the mask is important. The majority of speech processing methods focus on processing speech after it has been recorded into a microphone; however, any gain that we can achieve by choosing a specific microphone configuration is advantageous. A survey of recent literature shows that although dual microphone solutions for noise suppression are becoming more popular, they still add complexity in terms of weight, 6  size of the array, power consumption and more complex processing [17]. Furthermore, the advantage to using dual microphones noise reduction techniques seem to be more pronounced when trying to perform filtering of non-stationary noise. As we are dealing with the removal of stationary or very slowly varying noise, the two microphone solution seems less appealing in that though it would offer a benefit, its increased computational complexity and additional cost in terms of space within a face mask seem to encourage the use of a single microphone. Very little work seems to have been done on the problem of audio capture from within a full face mask. A large part of the reason is that there is very little need for a typical person who uses BiPAP to speak from behind a mask while the ventilator is running. For those who are hospitalized due to acute respiratory failure, speech is not an option, and therefore, the need for verbal communication is nil. For persons who suffer from sleep apnea and use BiPAP to aid them in getting a good night’s rest, there is no reason to want the ability to communicate while sleeping. In the event that someone with sleep apnea or some other respiratory ailment who uses BiPAP awakens or needs to communicate with their voice, it is a simple matter for them to switch off the machine, loosen the mask slightly, communicate, and then return to sleep. This, however, is not the case for PALS as their loss of motor coordination can make the task of loosening or replacing the Velcro straps of a full face mask extremely difficult. Therefore, any solution allowing them to speak without having to fumble with a mask is helpful. The main category of people who use BiPAP and are not unconscious, suffering from acute respiratory failure or asleep, are persons living with ALS. As ALS is a progressive disease, the window in which an individual is on BiPAP and also sufficiently verbal to communicate is narrow and does not last indefinitely. Coupled with the fact that life expectancy of persons living with ALS is normally 2 to 5 years and the fact that the disease effects only about 6 to 8 people per 100,000 with a diagnosis rate of approximately only 2 per 100,000 new cases [18] each year, compared with cancer that will have 186,000 new cases this year alone [19], it is understandable why there is very little attention and research put into this particular problem of allowing a person living with ALS to speak while on BiPAP. The closest form of published work that relates to the problem of capturing speech from behind a mask comes from military research done on fighter pilot oxygen masks. A work by Kang featuring a 4 microphone array [20] attempts to perform noise reduction within a mask, not by addressing ambient noise, but rather by focusing on the problems associated with reverberations within the face mask. The system constructed by Kang makes use of four collinearly spaced microphones with a sound duct.  7  Figure 1 - Cross-Sectional View of Kang's Microphone Array  This design allows for reverberations in the mask to be removed by using a technique of adding and subtracting the individual microphone outputs. An absorption material such as wool which absorbs sounds at 4 kHz helps to further reduce the reverberation effects. Kang reports that this array works well at restoring lost high-frequency components and minimizing the reverberation effect. Although this setup is good, size is a factor to consider when working with the BiPAP masks as they are smaller than fighter pilot masks. In addition, sound tests that were recorded do not seem to be extremely muffled as in the case of the oxygen masks Kang was using to test. It is possible that the material of the BiPAP full face mask coupled with the CO2 exchange holes contribute to the reduction of the reverberation effect or as fighter pilot masks are airtight and perform their oxygen and carbon dioxide exchange through a system that uses a single hose. There are definitely differences between the two kinds of masks however the extent to which fighter pilot masks differ from BiPAP full face masks has not been experimentally verified. Although Kang reports success with his microphone array and that it performs some noise reduction in addition to its reverberation reduction, the way that it achieves this is not just because of the array, but also in the proximity to the mouth. Kang’s array in his tests is placed at a mere ¼ inch from the mouth, a distance, which would be difficult not only to calibrate and maintain for PALS, but would also perhaps interfere with their breathing or their lips as they struggle to articulate sounds. Although Kang’s work is the closest to our problem, it is not a solution by itself to our problem.  3.2 The Noise Problem The problem of removing stationary noise from a signal is not a new one and has been explored and reviewed rather extensively [21]. In fact, more recent papers have begun to attack problems such as filtering non-stationary wind noise with a single microphone [22]  8  Work on performing filtering of undesired sound in noisy environments such as a helicopters [23], oxygen masks [20], vehicles [24] has been researched and solutions proposed. These solutions to these problems are helpful to us in that the solutions are geared towards solving a specific ambient noise problem such as the noise from helicopter rotors or vehicles. For the most part, this type of noise is unvarying and can be classified as stationary noise. If the noise from the airflow in the mask can be filtered using a similar method and a clean speech signal captured and outputted, this would be of an immense benefit to the quality of life patients as they would retain their ability to communicate using their voices with others while on ventilation. Techniques to attenuate the noise include Wiener filtering, Log-MMSE (minimum-mean-square) and spectral subtraction. Although spectral subtraction was first developed by Boll in 1979 [23] and improved on by Ephraim and Malah with the elimination of the musical noise phenomenon [25] , it still continues to have traction in the academic world. Variations have been proposed for non-stationary noise[26] as well as modifications that take advantage of human auditory characteristics [27]. As it is simple to implement, it continues to remain as a popular choice for performing noise reduction. The other aforementioned techniques also offer similar performance in terms of their ability to reduce noise.  3.3 Automated Speech Recognition In order to help us validate our claims of improving intelligibility through speech processing, it is necessary to explore methods for validation. As discussed in Section 2.4 - Measuring Speech Intelligibility, increased human recognition of words or sentences is an indication that intelligibility has increased. We need not, however, limit ourselves to only human recognition. If a computer recognition system was able to match words according to some library of words and the recognition increased as a result of noise filtering, then we would also have another objective method for determining the improvement of a system. Therefore, it is also necessary to explore automated speech recognition methods to use for validation purposes. Automated speech recognition is a field that is under heavy research as having machines that are able to understand the subtle nuances of human speech would revolutionize the way that way we interface with machines. Work in this field is varied and in the last few decades has been very successful. The first generation of speech recognizers focused primarily on phonemes and was very limited in its ability to recognize commands. This was followed by the second generation of speech recognizers that made use of linear predictive coding (LPC) and dynamic time warping (DTW) [28]. In the 1980’s Hidden Markov Models (HMM) and statistical analysis became the de facto standard for automated recognition of continuous speech. It also allowed for a major increase in the size of the vocabularies of new systems.  9  Although Dynamic Time Warping is older and has since been replaced by Hidden Markov Models for the bulk of speech recognition, it still has use for working with small datasets and isolated words. The bulk of research today focuses on optimizing DTW, finding new ways that it can be applied to larger datasets [29] or using it to enhance HMMs [30].  3.4 Automated Speech Recognition in Persons with ALS Several studies have been conducted in the area of dysarthric speech including patients who have ALS. Studies have shown that their speech differs greatly from our own and that it is very difficult for such people to use off the shelf packages that involve voice recognition [9]. Training people on these systems requires a great amount of time and effort and is a very frustrating procedure for patients who have been weakened by illness. To improve the phoneme recognition, experimental systems have been implemented to try and train speech classifiers to recognize the distinguished phonemes produced by patients with dysarthria. Visual and auditory feedback has been used among other methods to try and come up with the best method for training, however all these are still time consuming and difficult to achieve [8]. ALS patients present a unique problem in that unlike other disabilities the disease causes the patient’s ability to speak to degenerate over time. Studies that are performed on dysarthrics often exclude patients with neurodegenerative conditions as research projects spanning several years cannot continuously obtain consistent data samples as the patient’s disease progresses. Projects like STARDUST [8] that examined isolated word recognition for dysarthrics specifically excluded these types of patients and focused on patients who had a more stable disease. The problem with training sets of classifiers for these patients is that due to rapid degeneration of their speech, possibly within months, a retraining of a speech recognition system becomes necessary. This is extremely time consuming and tedious for these PALS who now have to rerecord samples for a database of words that has expanded due to use. ASR studies that involve PALS testing new speech recognition software have shown that during the recording sessions that are usually repetitive, the PALS grow tired and are no longer able to articulate their words clearly. This implies that systems that have low training and retraining requirements are in the PALS best interests. Currently, there are no products or open source initiatives that are available to the public that are capable of performing automatic speech recognition of dysarthrics speakers let alone persons living with ALS. Efforts have been made to improve intelligibility of dysarthric speech by replacing sections of speech with re-synthesized speech[31], however the study is careful to explain that their methods would only work for a specific subgroup of dysarthrics. As persons with ALS who are at different stages of disease progression will have wildly different characteristics in their speech, it is impossible to use a one10  size-fits-all method for automatically detecting and resynthesizing speech. Although this method of improving intelligibility seems like it might be promising, constructing such a system and then making it operate in real-time would prove challenging and go beyond the scope of this project.  11  4  AUTOMATIC SPEECH RECOGNITION AND ENHANCEMENT SYSTEM This section is an explanation of a prototype system that was designed to address the Capture and  Noise problems as detailed in the previous section. The proposed solution, ASRES, is an automated system that is able to capture and filter noisy speech from behind a face mask, output the processed speech, as well as perform recognition. The details of the system’s implementation and assumptions are explained in this chapter. The following, Figure 2 - System Block Diagram, is a diagram that demonstrates the setup of the system and how a speech signal is passed to the different subsystems.  Figure 2 - System Block Diagram  4.1 System Overview and Setup A Person Living with ALS wearing a face mask is the start point of the diagram. A small electret microphone (WM-61A), with two 30 AWG wires attached to its leads is inserted into a patient’s face mask via small ventilation holes and hung from the top of the mask. It is important that the microphone be placed out of the direct path of the air that flows through the flexible hose connecting the mask to the BiPAP in order to maximize the Signal-to-Noise Ratio (SnR) of the speech. The microphone must also be 12  adjusted so that it is not being pressed against the nasal bridge for patients with larger noses, and also not in line with the patient’s nostrils as deep inhalation and exhalation can create large amounts of noise that further reduce the SnR. Once the microphone is properly inserted, the patient then attaches the full face mask to their BiPAP ventilator. The full face masks that were tested with this setup are the Resmed Ultra Mirage Full Face Mask, Resmed and the Fisher & Paykel FlexiFit432. It is conceivable that the system would work well with any full face mask that makes a complete seal over the patient’s face and has CO2 ventilation holes that are no smaller than 0.0799mm in diameter which is also the diameter of the 30 AWG wire that is used with the microphones. As the device makes use of existing CO2 ventilation holes, it is important to consider whether or not the insertion of a microphone would impair normal operation of the mask. In a Resmed Ultra Mirage Full Face Mask, CO2 is vented through 6 holes in the mask that are approximately 1.8mm in diameter. Therefore the total surface area through which CO2 can pass through is 15.268mm2. As the two 30 AWG wires occupy only 5.107x10-2 mm2, the total amount of area obstructed by the two leads is 0.6690 %. This means that the mask’s ability to vent CO2 is still operating at approximately 99.3% which is not deemed to be a risk when reviewed by the UBC Clinical ethics board. Once the face mask is securely in place, two alligator clips are fastened to the exposed leads, connecting the microphone in the patient’s mask to the rest of the system. The leads pass through a small circuit that supplies power to the electret microphone via the 3.5mm jack on the TMS320C6713 DSP board. After the noise-corrupted speech signal is processed with the filtering software on the board, the filtered speech signal is split with a splitter and sent to two different places. The filtered speech signal is sent to a set of speakers that output the sound so that a listener in the room can hear the patient’s voice in real-time from behind the mask. In addition, the filtered speech signal is also sent to a laptop where the ASRES software is running via a 3.5mm aux cable connected to the microphone jack. A laptop is used as it avoids the cost and difficulty of having to develop a GUI on a piece of embedded software for this prototype. Once the system is properly set up, the patient can begin speaking during normal operation of the BiPAP.  13  4.2 Microphone Selection Microphone selection is important for this application as the environment from which speech is to be extracted is quite noisy. As the results of the digital signal processing and Speech Recognition Software are dependent on the quality of the input signal, it is beneficial to maximize the amount of speech captured and minimize the noise before the signal is subjected to digital signal processing. Two ways to accomplish this in the electret microphone stage are to choose the best electret microphone or microphone array in order to optimize the placement of the microphone or microphones. For this project, a Panasonic Noise Cancelling Back Electret Condenser Microphone Cartridge (WM-55D103) is used for the purpose of being small and non-intrusive.  Figure 3 - Panasonic Noise Cancelling Microphone Cartridge (WM-55D103)  The microphone has a frequency range of 100-10,000Hz which covers the range of human speech, approximately 200 – 8000Hz, quite well.  Figure 4 - Frequency Response  The frequency response for this microphone is fairly flat for distances that are close to the microphone and useful for us since there would be little variance in response for all the different frequencies of speech. 14  The noise cancelling microphone has a black porous material that covers the sensitive electret components and serves to protect the microphone from saliva since it is placed fairly close to the speaker’s mouth. As dysarthrics and PALS often have difficulty in controlling their speech muscles and saliva, this covering is useful. In addition to this, the covering helps to reduce some of the wind noise that passes over the microphone. As a noise-cancelling electret microphone is set up to attenuate signals that are farther away from microphone, it is possible by placement of the microphone in the mask, to increase the SnR simply by making sure that the microphone is sufficiently close to the mouth as well as far away from the noise source as possible. Although the full potential of this characteristic is not fully taken advantage of as the distance between the noise source and the patient’s mouth is still in the order of centimeters, it is still a beneficial quality to have and makes the noise-cancelling electret a better choice than other types of microphones. An omnidirectional microphone, for instance, has the undesirable quality of picking up sound in all directions equally well and as a result collected more noise in trials. A unidirectional microphone proved to also be slightly less useful, but not as poor as the omnidirectional electret microphone.  4.3 Calibration and Positioning Calibration was done using empirical methods. Although there is merit perhaps in doing mathematical modeling of the mask to achieve maximal SnR, this is very difficult to attempt as different masks have different shapes and sizes and any optimization algorithm would need an accurate model created for each one. Furthermore, mathematical modeling of the optimal place to achieve the highest SnR would be complicated by the fact that skin does not have the reflective acoustic properties of other solid surfaces and actually has absorptive qualities. In addition, variations in patient’s nose structures, lips and even facial hair further complicate mathematical modeling. Therefore, as it is mathematically and computationally difficult, not to mention different for each individual BiPAP user, mathematical modeling for the purpose of optimizing the microphone placement was rejected in favor of deductive and empirical testing.  15  Figure 5 - Airflow and Optimal Microphone Placement  Figure 5 - Airflow and Optimal Microphone Placement, above, shows an arrow indicating the initial flow of air from the BiPAP as it enters the mask. Empirically, it was determined that placing a microphone anywhere along that path where the wind struck the microphone directly resulted in very poor SnR ratios of 0 and below. The two rectangles above show the areas in the mask that are out of the direct path of the wind and are candidates for placement of the microphone. Although the bottom rectangle in Figure 5 was out of the direct path of the wind, it was still sufficiently close that empirical tests showed very little improvement in the SnR. Therefore, the only possible placement for the microphone is the top rectangle. Empirical testing again for microphones placed within the rectangle and away from the nostrils show an SnR of up to 30dB when the BiPAP is not in operation and approximately 3-9dB when the BiPAP is on. This large variation in the SnR is due to the variation in the strength of the speech signal. An average person who is able to speak normally would be on the higher end, whereas a patient with ALS who has a low lung capacity might be closer to the lower end.  4.4 Microphone Powering Circuit The diagram below is of the microphone circuit that powers the electret microphone and interfaces it with the TMS320C6713 DSP board that was used for this project.  16  Figure 6 - Electret Microphone Circuit  The electret microphone is connected to a 3.5mm TRS connector in a standard configuration. As it is plugged into the DSP board, a 5V bias is supplied to the microphone through a 2.2kΩ resistor to limit the current.  4.5 Spectral Subtraction The speech signal that comes from the microphone suffers from noise caused by air rushing over the microphone and the general hum of the BiPAP. This noise corrupts the signal and makes it difficult to perform any speech processing unless it is either attenuated or removed. If the noise and the speech signal cannot be separated by using multiple microphones, it is necessary to come up with a way to identify the noise from the same source. One of the simplest ways to do this is to perform power spectral subtraction. This is a simple but effective way to get an idea of how strong the noise is and takes advantage of the fact that while a BiPAP is operating, a person isn’t speaking 100% of the time. The non-speech segments can be used to recalculate the estimate of the noise which in our case shouldn’t vary too much as the BiPAP is fairly consistent. As what is needed is a sample of the stationary noise, it is possible on startup of the system to capture a 250ms sample of the stationary sound and then use this sample for spectral subtraction calculations. This algorithm is simple to implement and can be done in real time with a DSP board or a computer. The following derivation is largely based on the paper by Boll [23] whose work served to pave the way for subsequent spectral subtraction based works. Let ˳{˭{  ˲{˭{ - J{˭{ 17  where y(m) is the corrupted signal, and x(m) is the speech signal and n(m) is the uncorrelated and additive noise signal. In the frequency domain, let the same signal be represented as I{˦{  I{˦{ - ˚{˦{  Where Y(f), X(f) and N(f) are the Fourier transforms of the corrupted, speech and noisy signal respectively with f as a frequency variable. In order to process the original signal, the signal must be divided up into chunks, or windowed. In order to alleviate the effect of discontinuities at the endpoints of each segment, it is necessary to choose an appropriate window. For our application we use a Hanning Window ˣ{J{  Ŵ Ź ŵ . IJJ  Ŷ J FG ˚ .ŵ  In the frequency domain, applying a window to a signal is a convolution operation. I {˦{  I {˦{  ˣ{˦{ I{˦{  ˣ{˦{ {I{˦{ - ˚{˦{{  I {˦{  I {˦{ - ˚ {˦{  Rearranging terms, we can express original signal as the noise signal subtracted from the corrupted signal. I {˦{  I {˦{ . ˚ {˦{  If it were possible to obtain an exact representation of the noise signal, we could completely remove it and restore the original speech signal. Since that is impossible, spectral subtraction is used to reconstruct an approximation of the original signal by subtracting a time-averaged noise spectra from the corrupted signal. For simplicity, the windowing subscript, W, is dropped. I{˦{  I{˦{ . ˚{˦{  The result is an approximation of the original signal. The accuracy of the reconstructed signal is directly dependent on the accuracy of the approximation of the noise spectra to be measured. As it is necessary to calculate the average noise spectra from a period of non-speech activity, this can be done upon the initial startup of the device with the following formula.  18  É˚{˦{É  ŵ H  # ("  ˚ {˦{  In the above equation, the average noise spectra is calculated by summing the spectra of K-1 frames and then dividing by K. K is a variable that is determined by the sampling rate and length of the signal. In order for this to work, we assume that the first K frames, are pure noise with no speech. Averaging the noise spectra over a 250-300ms window provides a fairly good estimate of stationary noise and presents no difficulty to the user. In the event that there is speech during the first ¼ of a second, the system can always be reset, or the noise mean recalculated during another interval of silence. After calculating the noise mean, we can create a simple voice activity detector by setting a threshold of 2-3dB above the noise mean. If 6-8 consecutive frames of the signal are below this threshold, we can safely assume that speech has ended and that we are now looking at a noise. The spectral error therefore that exists is approximately equal to the difference between the noisy signal and the actual speech signal which is also roughly equal to the difference between the actual noise in that frame and the noise average that we have calculated. ε{˦{  I{˦{ . I{˦{  ˚{˦{ . ˚{˦{  In order to help reduce the spectral error, a few modifications are made to the reconstructed signal. As the spectral error is the difference between the noise in the frame and the calculated noise average, then it follows that ŵ Ž‹ 7 H  # ("  ŵ ˚ {˦{ . H  # ("  ˚ {˦{ G  Ŵ  Where M < K and where K is still the total number of frames over which the noise average is computed. Therefore, if we were to perform magnitude averaging of the noisy signal over M frames before subtracting the noise average, this would help decrease the error. ÉI{˦{É  ŵ H  # ("  I {˦{  According to Boll [23], performing this averaging is not a problem as long as the number of frames, M, does not exceed a certain amount. Based on results from his conducted DRT test, as long as the averaging does not exceed 3 half-overlapped windows with total time duration of 38.4ms, intelligibility does not decrease. There is still, of course, the risk that for highly explosive and short  19  sounds, that averaging can cause smearing. However, this risk should be weighed against the benefits of having less spectral error overall. Upon completion of magnitude averaging, the next step is to perform Half-wave rectification. In  the event that at a particular frequency, the average noise, ˚{˦{ subtracted from I{˦{ produces a  negative value because the average of the noise spectra is greater than the noisy signal, it necessary to floor these values to 0. The benefit of doing this, is that overall, the noise floor can be reduced by ˚{˦{. The  disadvantage, however, is when the sum of the noise and speech at that particular frequency is less than ˚{˦{. When the output is set to 0, any speech information that was contained in that frequency is lost and the result is possibly a loss of intelligibility.  Once half-wave rectification is complete, speech and noise that is above the ˚{˦{ threshold  remains. This residual noise will have a value somewhere between zero and a maximum that is measured during the non-speech activity periods. In the case where the noise in the frame at a particular frequency is equal to the average noise spectra, the amount of residual noise will be zero, or very close to it. The residual noise will consist of frequencies randomly scattered throughout the spectrum that exist for the duration of one window, or approximately 25ms. The result is what is known as “musical tones” as it sounds like a number of fundamental tone generators being flipped on and off in all residual noise frequencies. In order to help alleviate this effect, additional residual noise reduction can be performed. There are 3 cases that need to be examined after the average noise spectra is subtracted from the noisy signal. In the first case, if the amplitude of I{˦{, the signal after spectral subtraction, is below the  maximum threshold for noise and fluctuates rapidly between adjacent frames, then there is a high probability that this is not speech but residual noise. Therefore, we can replace that particular frequency in the frame with the minimum value from examining adjacent frames as well. In the second case, if the value of I{˦{ is below the maximum noise threshold and remains fairly  constant between frames, then there is a high probability that this is not noise, but low energy speech. As the values are fairly similar, again taking the minimum preserves the speech.  In the third case, if I{˦{ is greater than the maximum noise threshold, then nothing else needs to  be done, as what remains is most likely a speech signal.  After this, all that remains now is to restore the signal to the time domain. 20  Restoration of the time domain signal is achieved by taking the magnitude spectrum estimate  I{˦{ , combining it with the phase of the original noisy signal then performing an inverse discrete  Fourier transform.  ˲ {˭{  # ("  I{˫{ ˥  { {  ˥  $  In the above equation, θY(k) is the phase of the noisy signal. The estimated magnitude of the original signal can be recombined with the phase information of the noisy signal because it is assumed that noise distortions are primarily in the magnitude spectrum and that phase distortion, for the most part is mostly inaudible. After the signal frames have been restored to the time domain, it is now necessary to overlap-add the overlapping frames in order to reconstruct our approximation of a noise-free signal. This technique of magnitude spectral subtraction works well for stationary and slow varying noise. In the case of patients who are on BiPAP, there is very little variation in noise once the ventilator is in operation. Empirical tests show that the amount of noise is dependent on placement of the microphone and once the microphone is placed, does not vary. Therefore, spectral subtraction is an ideal candidate for this particular problem. Although spectral subtraction is fairly simple and effective, a problem that it still has is the residual “musical noise” that is left behind due to isolated patches of energy in the time-frequency domain. Despite best efforts to alleviate this, these artifacts persist. Depending on the accuracy of the noise estimate and the SNR there can be large fluctuations in the final output. In general, the less noise that needs to be removed and the stronger the speech signal, less artifacts appear in the final output. Ephraim and Malah’s method [25] does not suffer from this and could be used as an alternative, but we will address this in Chapter 7 - CONCLUSIONS AND FUTURE WORK.  4.6 Speech Extraction This section explains the process of extracting commands and words from the de-noised and reconstructed speech signal. To capture a person’s speech, the de-noised speech is outputted from the DSP into the 3.5mm jack of a laptop where the ASRES software is running. The software listens to the microphone channel and captures the clip of speech. Although this may be done automatically by setting certain sound thresholds, for the purpose of explanation, we will describe the case in which the record button of the software is toggled on and off manually in order to acquire a clip of speech. 21  Upon capturing a clip, the next step is to normalize the WAV so that we can have an accurate comparison with others that are stored in the database. In order to do this, we first iterate through all the samples and examine the absolute values of each of the samples. If the absolute maximum is less than 1, then we divide the entire function by this absolute maximum. An example resulting WAV is shown in Figure 7.  Figure 7 - Normalized WAV  Major errors in isolated word recognition are the result of inaccurate detection of the beginning and end points of a speech sample. As the output of the DSP is the de-noised signal and we do not have access to average noise spectra calculations, we need to create an algorithm to determine the endpoints of the speech sample. To extract the endpoints, we use an energy based approach. We begin by segmenting the entire WAV into 30ms frames first. Then, the energy for each frame is calculated according to the following formula. ˗  ˟  $  The first two frames are examined with the following formula and a value for the noise at the front of the signal is calculated. ˗  {˗# - ˗$ { Ӣ Ŷ ‹{˗# ˗$ {  ˩˦ Ŵ Ź 3 ˗# È˗$ 3 Ŷ  Jˮ˨˥J˱˩J˥  After the noise at the front is computed, the noise at the back end is also calculated using the last two frames. ˗  {˗ # - ˗ { Ӣ Ŷ ‹{˗ # ˗ {  ˩˦ Ŵ Ź 3 ˗  Jˮ˨˥J˱˩J˥  # È˗  3Ŷ  Finally the average background noise level is computed from these two values.  22  ˗  Ӣ  {˗ - ˗ { Ŷ J˥˪˥Iˮ˥ˤ  ˩˦ Ŵ Ź 3 ˗ È˗ 3 Ŷ  Jˮ˨˥J˱˩J˥  Rejection occurs on the basis that the value of EN should be between the two limits and that the noise in the front and end frames should not be drastically different. A rejection can also take place if EN exceeds an experimentally determined threshold. If the sound sample has a very heavy background noise, i.e. the values of EF and EB exceed a certain threshold, the sample should also be discarded. Once we have determined that background noise is not a factor and that the first two frames and the last two frames contain similar residual noise, we can then compute the average power of each frame.  ∑S P=  2 n  n  Once the power in a frame exceeds a certain threshold, we can assume that this frame contains speech. To determine the starting frame of speech, we examine the frames sequentially from the first frame until we encounter a frame that exceeds the threshold. This is a simple method of determining the start frame; however, this is susceptible to noise. A better way to determine the start frame is by examining not just the frame, but also the adjacent frames. If two consecutive frames in a set of three exceed the average, then there is a greater likelihood that this represents a speech frame. Once this occurs and the average power of the frame surpasses an experimentally determined threshold, that frame is marked as the starting frame of speech. We repeat this process in reverse beginning with the last frame to determine the ending frame. Once these two frames have been determined, we can extract this portion of the WAV as speech. The frames from the designated starting and ending points are then written to a new WAV which is then converted into its Mel Frequency Cepstral Coefficients for Dynamic Time Warping analysis.  Figure 8 - Original and Truncated WAV  Figure 8 is an example of truncation of a voice clip. The original sample is almost 2.5 seconds in length, while the truncated WAV is only 0.9 seconds in length. If truncation is not performed correctly or 23  no truncation is performed at all, this causes serious problems in the later stages when trying to compare two speech signals using dynamic time warping.  4.7 Mel Frequency Cepstral Coefficients Although utterances of the same phrase differ drastically in the time domain, they are not so different in the frequency domain. Therefore spectral analysis is an excellent way to take advantage of this fact. In order to quantitatively measure sound samples against each other, we need a way of representing them numerically. Mel Frequency Cepstral Coefficients (MFCC) provides us with a method for comparing these WAVs to each other. Calculating the MFCCs is a process that can be summarized as follows: i.  Convert to frames and calculate energy  ii.  Take Fourier Transform  iii.  Take Log of amplitude spectrum  iv.  Perform mel-scaling and smoothing  v.  Take Discrete Cosine Transform Since taking the Fourier transform of a windowed waveform can cause spectral leakage, the first step  is once again to apply a Hanning window to the signal before taking the discrete Fourier transform and calculating the energy spectrum. ˲ {˫{  # ("  ˲{J{ˣ{J{˥  $  È  Ŵ3˫3˚  The energy spectrum is given by I  É˲ {˫{É$  The energy, Ej. is then calculated for each frame ˗  # ("  {˫{I  where J is the number of triangular filters used. Finally the Discrete Cosine Transform is taken of the mel log-amplitudes and only the first 13 coefficients are kept for our purposes.  24  #  I  ("  …‘• Ӟ˭ {˪ - Ŵ Ź{ӟ Ž‘‰#" ˗ H  As MFCC calculations were performed using a DLL, we will not go into further detail regarding the calculation of MFCCs. An MFCC is calculated for each frame of the input signal and an array of 13coefficient MFCC vectors is built. This array can then be compared against other arrays of MFCC vectors stored in a library.  4.8 Dynamic Time Warping ASRES makes use of Dynamic Time Warping, due to the ease of implementation and also its performance for small datasets in order to perform recognition. The following is an explanation of Dynamic Time Warping as used by ASRES. In order to compare a speech sample using its MFCC array, we need to perform a frame by frame comparison of the speech sample with each individual command stored in the library. As signals differ in length, the problem becomes an alignment problem. Let Q and C be two different signals of lengths n and m respectively. ˝ ˕  J# J$  I# I$  J  I  The first thing that needs to be done is to compare the two signals by creating an n by m distance matrix.  Y E N R A B  1 1 1 1 1 0 B  1 0 1 1 1 1 E  1 1 1 1 0 1 A  1 1 1 1 1 1 V  1 0 1 1 1 1 E  1 1 1 0 1 1 R  1 1 1 1 1 1 S  Figure 9 - Distance Map of Different Words  As seen in Figure 9, a 7x6 matrix is created to compare the frames in each of the words. In this example, “BARNEY” is stored in the database, while the word “BEAVERS” has been spoken. If the box at the intersection of two frames is the same, we say that this is a match and assign it a distance score of 0. If the intersection of the two frames is different, we assign it a positive score, in this case, the value 1. 25  If the two frames were completely identical, then the distance score along the diagonal would be 0 as shown in Figure 10  S R E V A E B  1 1 1 1 1 1 0 B  1 1 1 1 1 0 1 E  1 1 1 1 0 1 1 A  1 1 1 0 1 1 1 V  1 1 0 1 1 1 1 E  1 0 1 1 1 1 1 R  0 1 1 1 1 1 1 S  Figure 10 - Distance Map of Two Identical Words  Once we have the distance map, we then compute the cost to reach the top right hand corner of the box by traversing the grid using the following algorithm. ˖{˩ ˪{  ‹{˖{˩ . ŵ ˪ . ŵ{ ˖{˩ . ŵ ˪{ ˖{˩ ˪ . ŵ{{ - ˤ{˩ ˪{ Figure 11 - DTW Minimum Cost Path Equation  Where D(i,j) is the value of the point (i,j) in the distance map. We define a path W to the top-right corner to be a connected set of K elements with each element designated as wk. ˣ  ˱# ˱$  ˱  ˱  ƒš{˭ J{ 3 H  ˭-J.ŵ  The shortest path would be a diagonal, hence max (m,n) as the minimum bound, and the longest path would involving traversing both edges, hence m+n-1. Traversal of the grid is subject to three rules: i.  Boundary Conditions: w1 = (1,1) and wK = (m,n). The path must start at the bottom-left corner and finish at the top-right corner.  ii.  Monotonicity: the path cannot go backwards, therefore ik – ik-1 ≥ 0 and jk – jk-1 ≥ 0.  iii.  Continuity: the path cannot jump and is restricted to adjacent cells.  In order to determine the minimum cost of reaching the top right corner, we begin by building a path map of the entire grid by starting at the bottom left hand corner and then filling each subsequent row. Each D(i,j) in the bottom row can be determined by adding the value of the cell to the immediate left to the value of d(i,j) found in the distance map in Figure 10. After summing the 1’s, on the bottom row, we can determine the total cost, D(i,j) to any point on the bottom row as shown in Figure 12. 26  Y E N R A B  5 4 3 2 1 0 B  4 3 3 2 1 1 E  5 4 3 2 1 2 A  5 4 3 2 2 3 V  4 3 3 3 3 4 E  4 4 4 3 4 5 R  5 5 4 4 5 6 S  Figure 12 - DTW Scoring for the First Row  For the second row, we use the distance map in Figure 10 and apply the formula found in Figure 11 and calculate the next row of the grid as seen in the following figure.  Y E N R A B  1 0 B  1 1 E  1 2 A  2 3 V  3 4 E  4 5 R  5 6 S  Figure 13 - DTW Scoring for the Second Row  This continues until the entire grid is completely filled with values as seen in the following figure.  Y E N R A B  5 4 3 2 1 0 B  4 3 3 2 1 1 E  5 4 3 2 1 2 A  5 4 3 2 2 3 V  4 3 3 3 3 4 E  4 4 4 3 4 5 R  5 5 4 4 5 6 S  Figure 14 - DTW Scoring for the Entire Grid  As seen in the above figure, the minimum cost to get from the bottom left hand corner to the top right hand is 5 in this example. The following figure, Figure 15, shows one the possible minimum paths that can be taken through the grid to arrive at the top right hand corner  27  Y E N R A B  5 4 3 2 1 0 B  4 3 3 2 1 1 E  5 4 3 2 1 2 A  5 4 3 2 2 3 V  4 3 3 3 3 4 E  4 4 4 3 4 5 R  5 5 4 4 5 6 S  Figure 15 - Minimum Cost Path for Two Different Words  Although there are alternative paths that can be taken, following the three rules of grid traversal, the minimum cost remains at 5. In the case where the spoken word is identical to the word stored in the database, a path with a score of zero is produced as seen in Figure 16.  S R E V A E B  1 1 1 1 1 1 0 B  1 1 1 1 1 0 1 E  1 1 1 1 0 1 1 A  1 1 1 0 1 1 1 V  1 1 0 1 1 1 1 E  1 0 1 1 1 1 1 R  0 1 1 1 1 1 1 S  Figure 16 - Path with a Score of Zero for Identical Words  This case does not occur under real conditions as it is impossible for the exact same utterance to be repeated more than once, however, if there is a match, the score of the minimum cost path should be fairly low. When the sounds of words are stretched, dynamic warping provides us with a way of showing that the two are actually identical.  28  Figure 17 - Cost Map of a Word Spoken Slowly  In Figure 17, the word stored in the database is “BEAVERS”, however the speaker has enunciated the word slowly and placing heavy emphasis on the vowels. Although the lengths are no longer equal, dynamic time warping allows us to calculate a zero-cost path through the grid proving that despite the elongated vowels, the command is the same.  Figure 18 - Path with a Zero Score for a Word Spoken Slowly  This path minimization is what makes dynamic time warping extremely useful for comparing samples of speech. In Figure 9, the distance map compares letters with letters, therefore logical values, 1 and 0, are sufficient to determine the cost of the intersection. Since we have MFCC vectors, we need a way to be able to calculate the distance between two vectors. If the two vectors are very similar, i.e. they contain the same segment of a word, the cost to move to that intersection should be low. To compare two vectors we take the Euclidean distance between the two and calculate the distance which will serve as d(i,j).  29  Figure 19 - Euclidean Distance in 3-D  In Figure 19, the Euclidean distance in three dimensions is the square root of the sum of the squares of the differences in each of the component directions. Since our MFCC vectors have 13 components, we extend the Euclidean distance formula to 13 dimensions ˤ{˩ ˪{  {˳# . ˲# {$ - {˳$ . ˲$ {$ -  - {˳#% . ˲#% {$  where y and x are the components of each of the MFCC’s calculated for each frame of speech that is to be examined. A distance map is then computed and the minimum cost path through the grid is calculated. In the word example given above, the final score was identically zero, however as stated before, in real life, this is not the case since no two frames will ever precisely match up. Therefore in order to determine what word is actually being said, we need to calculate the minimum cost path for every word or phrase stored in the database. Once this is done and provided that the speaker has spoken a phrase that is contained in the database, the lowest score should indicate what phrase was spoken.  30  5  USER INTERFACE AND SYSTEM USAGE The previous section described in detail the algorithms behind ASRES. This section documents and  explains the user interface.  5.1 User Interface Description The user interface of ASRES consists of a main window from which all operations proceed.   1   2   3  Figure 20 - ASR Tool  (1), in the figure above, is the button that sets the user interface into capturing mode. When this button is toggled, the system goes into a listening mode in which the input signal is analyzed for speech according to the algorithms described in Chapter 4. (2), on the right hand side contains quick buttons that put the system into a mode to capture one of three phrases that were commonly used to test the effectiveness of the automated speech recognition and to store them in the database. Additional custom phrases can also be captured by pressing the “New” button below the three phrases. Theoretically, it is possible to capture any number of user defined phrases and store them in the database. However, it is important to remember that performing DTW on any sound clip against the others in the database is an operation of the order O(n). Therefore, performance of the system continues to decrease linearly as the database continues to grow. (3) is a text field that provides text feedback. For example, if a user was captured as saying “Help me!”, the system would process the phrase, score it against phrases stored in the database, then display in 31  the text field the name of the phrase that the system recognized. In this case, the text “Help me!” would be displayed. Although a visual display would be of little use to a BiPAP user who was actually crying for help, this output is primarily for those setting up the system. If the command is correctly recognized, further appropriate reaction can then be taken.  5.2 System Usage The “Start cap” button is pressed and de-noised speech is captured by the microphone. The captured speech is compared against the entire database of phrases and a match is located. Because the system is able to recognize different kinds of words, commands and actions can be tied to each of these and stored in a database. For example if the “Help me!” command is recognized, an appropriate reaction could be to trigger an alarm that is connected to the computer via a USB port. Or perhaps if the caregiver is not in the general vicinity, a message could be sent to their phone via a specific app. A two-word command such as “Lights on” could be interfaced with the room lights and allow a patient to have control over the lights. One of the commands that was implemented is “Open Browser.” Upon accurate recognition of this command, the default browser of the operating system, e.g. Firefox, Chrome, or Internet Explorer is opened to its home page. From there, a user can navigate from page to page by using an extension such as “numberedlinks” for Firefox that assigns numbers to every hyperlink on the page. Thus, a user would be able to say “Link 15” and the link would be clicked and the new page loaded. To add more functionality, recording commands like “Back” or “Forward” could also improve the browser experience. The possibilities are endless; however, they all rely on accurate detection of the words.  5.3 Initial Training In order to train the system, a user selects either one of the quick buttons on the side, or the “New” button and records his or her command. These samples become the unique user sound set by which all future voice commands are compared with and aligned to using Dynamic Time Warping. In order to make the system as robust as possible, any “New” command that is recorded can be set to execute a particular command line command. For example, if we wish to implement something beyond “Open Browser” and navigating through a web interface, we can specify a command to run. For example, on a Windows-based system, it is possible to record a clip “Adjust pillow” and set the executed action to run the command line string “msg.exe SamChua Can you please come over and adjust my pillow?” Then, when the ASRES system is running and the words “Adjust pillow” are spoken, the system would execute the above shell command which would cause a message to be sent to the user 32  SamChua who is on the local network with the above text. In this way, provided that the system is setup in advance and these commands are typed in by the caregiver, a PAL could send numerous kinds of commands to particular people. Since the commands are shell-based, any type of command line based custom software could be written or bought and used with the ASRES system. For example, another even more useful application would be to purchase a proprietary PC-to-SMS software and use it to send an SMS to a caregiver. For example, the recorded command could be “Emergency!” and the executed action could be “SMS.exe /u:SamChua /m:Come home immediately, I need help!” One important thing to note however is the degradation of speech over time in individuals with ALS. In some individuals, their ability to enunciate words clearly declines rapidly within a year. For others, the loss of speech is a slow process that drags on for several years. This is problematic as it means that speech samples in the database will no longer be accurate and need to be retrained over time. Because the amount of time differs from person to person, either a variable can be set in the XML configuration file to automatically replace old records with new spoken ones every few months or a user can manually retrain their entire database when they start to notice error rates in recognition increasing substantially. The automatic deprecation of old records and addition of new ones would be ideal, however, due to the fact that dysarthrics struggle to control their facial muscles, there is no clever way to automatically tell whether an utterance is acceptable or not. Although tedious, the only way to accurately re-train is to record, allowing the user to listen to what they recorded, and verify that they would like to keep it.  33  6  EXPERIMENTS In this section, two experiments involving human subjects are explained, conducted and the results  discussed.  6.1 Experimental Setup 6.1.1  Goal The goal of the following experiment was to capture several predetermined speech samples from  a PALS in order to provide ASRES with data that could be used to determine if spectral subtraction improves speech intelligibility. 6.1.2  Setup In order to determine if filtered sound captured from within a mask improved intelligibility,  multiple recordings of sentences and a paragraph from within the mask with the noise filter deactivated are taken to establish a baseline. Once these recording were complete, the airflow noise filter is activated and the same sentences and a paragraph are read and recorded. Test sentences are constructed from a subset of the list of 74 monosyllabic nouns and 37 disyllabic verbs found in the Nemours Database of Dysarthric Speech [9]. Two factors went into the creation of the subset. The first was to remove the words that differed from each other by only a single phoneme such as “cob” and “cop”. Dysarthric speakers have been shown to have difficulty with certain vowels and syllables [10] and removal of these helps to ensure that incorrect recognition of the word is not due to the fact that the speaker is unable to articulate a particular sound resulting in two words being pronounced the same. Words were therefore selected which were at least two phonemes apart. The second factor addressed the issue of context. In order to ensure that context was not what was being used to recognize the speaker’s words, the constructed sentences are nonsensical and are in the form of “The X is Ying the Z” where X and Z are randomly selected nouns without replacement from the reduced set of 11 nouns and Y is a verb selected from the reduced set of 8 verbs listed below. An example sentence is “The fade is leaping the bin”. Set of Nouns: • • •  cob bad bait 34  • • • • • • • •  fade fight dime dew bin rot pat bet  Set of verbs: • wading • leaping • licking • bearing • stewing • sipping • going • surging Finally, the paragraph to be read by the speaker was a standard paragraph in the area of speech sciences called “My Grandfather” You wished to know all about my grandfather. Well, he is nearly ninety-three years old; he dresses himself in an ancient black frock coat, usually minus several buttons; yet he still thinks as swiftly as ever. A long, flowing beard clings to his chin, giving those who observe him a pronounced feeling of the utmost respect. When he speaks, his voice is just a bit cracked and quivers a trifle. Twice each day he plays skillfully and with zest upon our small organ. Except in the winter when the ooze or snow or ice prevents, he slowly takes a short walk in the open air each day. We have often urged him to walk more and smoke less, but he always answers, “Banana oil!” Grandfather likes to be modern in his language.  For the full details of the experimental procedure, see Appendix A: Procedure for Collecting Speech Samples of a PALS. The subject population consisted of a single PALS. 6.1.3  Hypothesis  Prior to the start of this experiment, it was hypothesized that performing spectral subtraction filtering on speech captured by a PALS would increase intelligibility.  35  6.2 Experimental Results In this subsection, the experimental results of a subject RG are discussed. In addition, a set of data with digitally added noise was also tested on ASRES and the results examined in order to test the effectiveness of spectral subtraction on BiPAP. Two types of tests were conducted. The following is an examination first of recognition rates achieved from the created dataset and then of the actual dataset produced by a PALS. 6.2.1  Digital Addition of Noise to Nemours Subject BB In the first sets of tests, a sample of BiPAP noise was recorded and then digitally added to the  voice of a dysarthric speaker and the results were Post-Filtered and the DTW algorithm was applied. The following datasets were created from words spoken by subject BB recorded on the Nemours Database of Dysarthic speech. Subject BB has a form of cerebral palsy that results in speech dysarthria, although not severe. His level of dysarthria, compared to other in the Nemours database is fairly low. BB has a score of 8/8 in the sentences intelligibility and conversation intelligibility. In the area, however, of word intelligibility, his score is only 4/8 which implies that his words are not as well formed as a normal person. BB’s level of dysarthria would be comparable to a person who is in the earlier stages of ALS and would just be starting to use BiPAP on a more regular basis. This makes him a fairly good candidate for a test. In the following results, the recording of the paragraph “My Grandfather” was trimmed until only five words, “Grandfather”, “Coat”, “Buttons”, “Trifle”, and “Organ” remained. These five words were then added to the library for DTW matching. A sample of noise recorded from inside a face mask with BiPAP operating was then digitally added to the five words and the recordings saved. Finally, the sample of digitally added noise was Post-Filtered with the spectral subtraction algorithm and the result again segmented into five different words to test against ASRES. The following tables show the effect of machine recognition as the SnR is gradually decreased.  36  SnR = 11.32 dB Library  NOISY Grandfather Coat Spoken Buttons Trifle Organ  Grandfather 399.51 451.74 404.85 409.06  Coat Buttons Trifle Organ 429.72 434.85 432.59 417.8 288.45 331.68 329.96 302.84 292.41 253.06 305.79 274.45 291.11 313.92 250.89 279.04  416.57 295.18  310.4  291.18 242.36  Table 1 - Noisy DTW of 5 Words by BB  SnR = 22.31 dB POST-FILTERED Grandfather Coat Spoken Buttons Trifle Organ  Library Grandfather Coat Buttons Trifle 207.7 325.05 320.68 332.67 308.03 143.18 233.05 251.14 287.12 220.4 119.45 239.64 327.51 255.81 220.71 131.69 310.5 223.61  203.6  Organ 301.89 203.18 195.19 176.51  204.06 105.44  Table 2 - Post-Filtered DTW of 5 Words by BB  In the results above, the SnR is fairly high and every single word recorded in both the 5 samples of noise corrupted speech as well as filtered speech were matched with the correct sample in the library. Therefore, there is no apparent benefit to performing filtering at this SnR as it does not improve the already excellent recognition rate.  37  SnR = 8.30 dB Library  NOISY  Grandfather Grandfather 493.5 Coat 552.65 Spoken Buttons 503.71 Trifle 444.91 Organ  Coat Buttons Trifle Organ 533.74 545.93 516.98 512.09 389.01 460.63 420.78 427.72 406.92 373.06 388.74 392.32 337.97 371.81 276.45 299.84  458.73 361.62  393  307.9  264.7  Table 3 - Noisy DTW of 5 Words by BB with +3dB Noise  SnR = 19.77 dB POST-FILTERED  Library  Grandfather Grandfather 232.81 Coat 352.28 Spoken Buttons 337.51 Trifle 343.2 Organ  Coat Buttons Trifle Organ 404.65 388.57 345.31 305.99 202.96 322.72 279.87 250.41 294.85 214.69 267.62 238.48 270.32 279.81 130.83 192.42  328.6 297.51  292.26 204.67 126.06  Table 4 - Post-Filtered DTW of 5 Words by BB with +3dB Noise  In the results above, the SnR has been reduced by 3dB. It is still fairly good but lower than the previous example. The Post-Filtered data was 100% correctly matched with the library, however the DTW matching for the noisy signal is down to 40% with only two of the words, “Trifle” and “Organ” being correctly recognized. It is also worth noting that the scores of the Noisy table are much higher than those of the Post-Filtered table. This implies that the alignment between the noisy speech and that which is stored in the library is more distant and therefore, the system would be more prone to making errors as the size of the dataset increased.  38  SnR = 4.31 dB NOISY Grandfather Coat Spoken Buttons Trifle Organ  Library Grandfather 532.67 591.07 527.93 503.11  Coat Buttons Trifle Organ 558.27 571.66 548.45 543.42 437.84 496.93 465.51 467.74 424.23 403.22 418.6 405.6 403.53 439.36 360.08 376.96  497.42 395.78  424.89 367.13 312.55  Table 5 - Noisy DTW of 5 Words by BB with +7dB Noise  SnR = 16.64 dB POST-FILTERED Grandfather Coat Spoken Buttons Trifle Organ  Library Grandfather Coat Buttons Trifle Organ 258.76 389.99 389.29 336.2 293.33 361.26 222.17 352.16 269.2 252.84 328.83 309.38 228.11 264.2 229.4 366.76 279.3 312.32 180.62 219.4 334.95 291.74  309.11 232.14 152.78  Table 6 - Post-Filtered DTW of 5 Words by BB with +7dB Noise  In the results above, the SnR has been reduced by an additional 4 dB from the previous set. Once again the Post-Filtered data was 100% correctly matched with the library, however the DTW matching for the noisy signal is at 60% with three words, “Buttons”, “Trifle” and “Organ” being correctly recognized. In addition to this, the winning scores for each of the noisy samples have increased on average by 54.6 points. This implies that the alignment between the noisy speech and that which is stored in the library is becoming even more distant and more error prone.  39  SnR = -1.70 dB Library  NOISY  Spoken  Grandfather Coat Buttons Trifle Organ  Grandfather 688.6 714.46 646.58 594.55  Coat 694.65 535.93 503.4 469.85  Buttons 715.26 581.55 498.33 503.39  Trifle 686.7 550.37 501.58 427.74  Organ 677.34 552.47 488.53 435.51  633.91  489.28  515.92  447.1  400.54  Table 7 - Noisy DTW of 5 Words by BB with +13dB Noise  SnR = 12.03 dB POST-FILTERED  Spoken  Grandfather Coat Buttons Trifle Organ  Library Grandfather 345.96 413.07 379.34 373.28  Coat 417.85 265.57 338.53 285.58  Buttons 438.26 390.25 286.55 341.79  Trifle 368.66 297.25 297.42 213.9  Organ 338.42 278.12 253.08 244.22  388.06  316.99  352.28  261.48  211.48  Table 8 - Post-Filtered DTW of 5 Words by BB with +13dB Noise  In the results above, the SnR is close to 0dB which means that the signal and noise strength are almost one-to-one. Once again the Post-Filtered data was 100% correctly matched with the library, with the DTW matching for the noisy signal remaining at 60% with three words, “Buttons”, “Trifle” and “Organ” being correctly recognized. Although three words are recognized, buttons is but 5.1 points, 1.0%, away from “Trifle”, the next closest match. If this were repeated, it is quite possible that the recognition rate would go down to 40%.  40  SnR = -1.68 dB NOISY  Spoken  Grandfather Coat Buttons Trifle Organ  Library Grandfather 720.91 730.43 665.13 580.12  Coat 725.14 552.9 537 456.3  Buttons 748.82 598.96 537.99 487.08  Trifle 718.73 567.4 536.62 415.58  Organ 708.46 570.11 529.54 410.74  616.28  473.26  499.22  432.93  385.86  Table 9 - Noisy DTW of 5 Words by BB with +13dB Noise and Alternate Noise  SnR = 12.07 dB POST-FILTERED  Spoken  Grandfather Coat Buttons Trifle Organ  Library Grandfather 361.95 423.18 389.1 368.06  Coat 433.81 271.72 335.86 275.71  Buttons 453.11 396.59 311.83 330.23  Trifle 385.53 302.63 303.94 212.35  Organ 352.25 286.78 277.62 236.44  383.22  315.09  345.35  256.86  209.23  Table 10 - Post-Filtered DTW of 5 Words by BB with +13dB Noise and Alternate Noise  For this particular set of data, a different sample of BiPAP noise was digitally added to set the SnR close to 0 once again. In Table 9, it is observed that the word “Buttons” and “Trifle” are extremely close together and as a result of the noise change, the word “Buttons” is incorrectly recognized as “Trifle”. Once again the DTW matching for the noisy signal has dropped to 40%. From the data collected in the tables above, there is reason to believe that spectral subtraction aids in the recognition of words by DTW. 6.2.2  Person Living with ALS – RG In the second set of tests, a subject with ALS tested the system by performing both experiments  on the procedure sheet as detailed in the experimental setup. In order to validate the theoretical findings of the previous experiment, it was necessary to conduct the same experiment with a real person. Subject RG, a person living with ALS provided the sound clips from which the following results were derived. The following tables follow the same format and analysis of the digitally added noise experiment in the previous section.  41  My Grandfather SnR = 6.89 dB Library  NOISY  Grandfather Coat Buttons Trifle Organ Language Grandfather 402.15 381.8 352.32 426.21 333.79 423.91 Coat 352.16 299.5 251.09 349.65 223.15 333.44 Buttons 342.22 322.35 267.91 348.13 191.05 321.54 Spoken Trifle 351.12 332.86 274.62 349.99 232.95 344.45 Organ 361.03 351.08 296.1 384.98 220.72 348.17 Language  382.26  372.5  306.51 371.54 278.42  371.2  Table 11 - DTW of 6 Words by RG  The data from the table above was generated from a real-time noisy dataset that was cut from RG reading the paragraph, “My Grandfather”. At a SnR of 6.89, and with this particular dataset, DTW performed poorly and was only able to match a single word, “Coat” correctly. SnR = 6.79 dB Library  POST-FILTERED  Grandfather Grandfather 704.2 Coat 742.46 Buttons 681.23 Spoken Trifle 831.07 Organ 762.76 Language  Coat Buttons Trifle Organ Language 690.74 605.99 755.59 535.35 690.45 693.53 586.83 782.55 497.03 711.89 663.34 570.24 764.12 459.74 689.49 784.44 675.2 873.5 573.25 815.71 757.45 643.36 821.56 513.91 757.53  750.65 763.74  646.96 784.11 553.19  739.21  Table 12 – Post-Filtered DTW of 6 Words by RG  In the above table the same real-time noisy dataset was subjected to Post-Filtering. The PostFiltering in this case performed extremely poorly and the SnR actually went down by 0.1dB. As a result, again, only 1 word was correctly recognized and no improvement was made. SnR = 9.12 dB REAL-TIME FILTERED  Library  Grandfather Grandfather 342.82 Coat 327.89 Buttons 310.02 Spoken Trifle 402.9 Organ 411.42 Language  Coat Buttons Language 249.44 304.91 439.06 188.02 287.97 381.67 196.17 218.99 363.23 312.27 407.06 545.7 268.7 376.74 512.62  389.79 236.22  401.37  526.47  Table 13 - Real-time Filtered DTW of 4 Words by RG  42  The data from the table above was generated from a real-time filtered dataset that was cut from RG reading the paragraph, “My Grandfather”. The filtering has improved the SnR just over 2dB and the result is that two words, “Coat” and “Buttons’ are correctly recognized. Due to the fact that during this recording, RG was not able to articulate the words, “Trifle” and “Organ” both words were removed from this dataset. From this dataset we see that Post-Filtering in this case had no benefit, however the results from the Real-Time Filtered dataset show a recognition of two of the words which is better than the recognition of only one word in the noisy dataset. Five Isolated Words SnR = 12.5 dB Library  NOISY  Fight Fight 418.85 Dime 426.74 Spoken Dew 392.86 Bin 388.54 Bait  Dime 150.99 161.13 198.93 174.78  Dew Bin Bait 201.16 140.37 165.79 190.26 165.62 158.91 178.62 184.3 149.09 162.93 143.62 132.76  384.9 177.01 162.06 163.56 118.38  Table 14 - Noisy DTW of 5 Words by RG  In the above dataset, words from each of the five sentences spoken by RG were cut from two recordings. The first recording which had no BiPAP and no filtering enabled served as the baseline and became the library files. The second recording which had the BiPAP enabled and filtering off was segmented and tested against the library with DTW. The result as seen in the above table, at a SnR of 12.5dB was still fairly poor with only a single word “Bait” being recognized. SnR = 18.2 dB POST-FILTERED  Spoken  Library  Fight Dime Dew Bin Bait Fight 116.36 403.95 428.75 360.9 389.88 Dime 134.62 422.62 418.36 374.19 404.29 Dew 192 386.68 372.09 338.11 377.12 Bin 144.25 381.5 386 332.7 372.12 Bait  142.73 389.63 392.44 326.03 361.63  Table 15 – Post-Filtered DTW of 5 Words by RG  This dataset was created by taking the noisy dataset found in Table 14 and applying spectral subtraction. This resulted in an increase in the SnR by 5.7dB and as a result, the software was able to recognize 3 out of the 5 words, or 60%. 43  SnR = 23.45 dB Library  REAL-TIME FILTERED  Fight Fight 114.5 Dime 129.81 Spoken Dew 178.28 Bin 137 Bait  Dime Dew Bin 169.87 179.09 167.08 143.39 189.34 165.34 148.45 144.14 133.46 141.73 147.22 128.87  133.61 128.47  Bait 173.05 164.29 117.96 114.96  148.89 129.54 111.12  Table 16 - Real-Time Filtered DTW of 5 Words by RG  In this above dataset, a real-time spectral subtraction recording of the five words yielded an extremely high SnR. As a result, the recognition rate in this case was 4/5 or 80%. From this dataset we see that there appears to be better recognition of words after performing filtering. Three Commands SnR = 8.56 dB Library  NOISY  Spoken  Help Me Open Browser  Help Me 522.5 1146.6  Open Browser 784.6 1042.66  Close Window 868.68 1150.93  Close Window  1098.57  1150.65  1151.85  Table 17 - DTW of 3 Phrases by RG  In the above dataset, three phrases were recorded with the BiPAP off for the library and then 3 recordings were made with the BiPAP running and then DTW was applied. At a SnR of 8.56dB the results are once again poor with only one phrase being correctly recognized. SnR = 18.7 dB FILTERED  Spoken  Library  Help Me Open Browser  Help Me 325.58 749.32  Open Browser 721.43 637.4  Close Window 474.74 702.31  Close Window  657.25  793.79  505.64  Table 18 - Real-Time Filtered DTW of 3 Phrases by RG  In this dataset, a recording of the 3 phrases was made with the BiPAP running and spectral subtraction enabled. In this case, the filter performed fairly well boosting the SnR to 18.7dB and allowing for a recognition rate of 2 out of 3, or 66%.  44  6.3 Phoneme Analysis In order to further validate if there is any increase in intelligibility an experiment with subjects was conducted with individuals who listened to the recordings of subject RG and wrote down what they perceived was said. Table 19 below shows the 5 control sentences with the two nouns and the verb in bold. Both of the nouns in the five sentences have exactly 3 phonemes each, while the verbs all have 5 phonemes. Therefore, the two nouns and verb in each sentence have a total of 11 phonemes.  Control  Words  Trial 1 The fight is wading the cob. The dime is bearing the bet. The dew is leaping the pat. The bin is stewing the fade. The bait is licking the rot.  1 3 3 3 3 3  2 5 5 5 5 5  3 3 11 3 11 3 11 3 11 3 11  Table 19 - Control 5 Sentences and Phoneme Divisions  The experiment was conducted in two phases consisting of 3 trials each. In each trial, the subject would listen to a sentence and then write down what they perceived RG to be speaking. Subjects understood in advance that the sentences spoken would be in the form of “The X is Ying the Z”. In phase one, three trials were conducted with the subjects listening to unfiltered speech with a SnR of 12.5dB. In phase two, the trials were conducted with the subjects listening to the filtered speech with a SnR of 23.45dB.  No Filtering (SnR = 12.5 dB) Words Trial 1 The fight is wearing the cob the drive is bearing the dirt the view is leaving the path the bend is spewing the pain the bate is licking the vase  1 0 -2 -1 -2 0  2 -2 0 -1 -1 0  3 0 -2 -1 -2 -4  Score -2 -4 -3 -5 -4  81.8% 63.6% 72.7% 54.5% 63.6%  Table 20 - An Example Phoneme Analysis Trial Explained  In Table 20 above, an example trial is explained. In this trial, the subject perceived that sentence number one was “The fight is wearing the cob.” As the first word was perceived correctly, no penalty is applied under column 1 as all the phonemes were correctly identified. The verb, however, “wading” was perceived instead as “wearing”. As these two words differ in exactly 2 phonemes, a score of minus 2 is applied to column 2. The third word, “cob” was correctly identified, therefore no penalty is applied. In total, only 2 phonemes were incorrectly perceived, which means that 9 out of 11 were correctly perceived, resulting in a recognition percentage of 81.8%. This scoring process is then repeated for the remaining 4 45  sentences and this is performed for all 3 trials from the unfiltered set. From this we can then calculate the average phoneme recognition rate for each sentence over the three trials. Finally a single number, an overall percentage of correctly identified phonemes can be computed as seen below in Table 21.  Mean % of Correctly Identified Phonemes per Sentence Sentence #1 75.8% Sentence #2 81.8% Sentence #3 69.7% Sentence #4 48.5% Sentence #5 66.7% Overall % of Correctly Identified Phonemes 68.5% Table 21 - Percentage of Correctly Identified Phonemes  This process is then repeated for the second phase so that an overall percentage of correctly identified phonemes can also be determined for the filtered dataset and then the two results can be compared. Table 22 below is a phoneme analysis of one of four subjects who participated in the study.  46  No Filtering (SnR = 12.5 dB) Words Trial 1 The fight is wearing the cob the drive is bearing the dirt the view is leaving the path the bend is spewing the pain the bate is licking the vase  1 0 -2 -1 -2 0  2 -2 0 -1 -1 0  3 0 -2 -1 -2 -4  Trial 2 the pipe is wearing the cob the dime is bearing the bent the view is reaching the path the bend is spewing the thing the bait is licking the vase  -2 0 -1 -2 0  -2 0 -2 -1 0  0 -1 -1 -4 -4  Trial 3 The fight is wearing the cob the dime is bearing the bent the view is leaving the path the bend is spewing the pain the bait is raking the rock  1 0 0 -1 -2 0  2 -2 0 -1 -1 -2  3 0 -1 -1 -2 -1  -2 -4 -3 -5 -4  -4 -1 -4 -7 -4  -2 -1 -3 -5 -3  Score  RT Filtered (SnR = 23.45 db) Words  Score  81.8% 63.6% 72.7% 54.5% 63.6%  Trial 1 The fight is waving the cob the dime is wearing the death the view is leaking the path the bend is stewing the shade the bait is making the rock  1 0 0 -1 -2 0  2 -1 -1 -1 0 -2  3 0 -2 -1 -1 -1  -1 -3 -3 -3 -3  90.9% 72.7% 72.7% 72.7% 72.7%  63.6% 90.9% 63.6% 36.4% 63.6%  Trial 2 The fight is waving the cob the dime is wearing the bent the Jew is sweeping the path the bend is stewing the shade the bait is making the run  0 0 -1 -2 0  -1 -1 -2 0 -2  0 -1 -1 -1 -2  -1 -2 -4 -3 -4  90.9% 81.8% 63.6% 72.7% 63.6%  81.8% 90.9% 72.7% 54.5% 72.7%  Trial 3 1 2 3 The fight is waving the cob 0 -1 0 The dime is wearing the death 0 -1 -2 The Jew is sweeping the path -1 -2 -1 The bend is stewing the shade -2 0 -1 The bait is making the run 0 -2 -2  -1 -3 -4 -3 -4  90.9% 72.7% 63.6% 72.7% 63.6%  Mean % of Correctly Identified Phonemes per Sentence Sentence #1 75.8% Sentence #2 81.8% Sentence #3 69.7% Sentence #4 48.5% Sentence #5 66.7%  90.9% 75.8% 66.7% 72.7% 66.7%  Overall % of Correctly Identified Phonemes 68.5%  74.5%  Table 22 - Phoneme Analysis of MB  The phoneme analysis of subject MB in Table 22 above shows a slight improvement in recognition of the phonemes, 6%. The details of the last three phoneme analyses are not included here but can be found in Appendix B: Phoneme Analysis Data Sheets. The summary results will be discussed in 6.4.1 - Summary of Phoneme Analysis.  6.4 Discussion In this subsection, the results of the different experiments are summarized and discussed. 6.4.1  Summary of Phoneme Analysis  A summary of the phoneme analysis for all four subjects is found in Table 23 below. 47  Overall % of Correctly Identified Phonemes Subject  No Filtering (SnR = 12.5 dB)  RT Filtered (SnR = 23.45 dB)  68.5% 69.1% 67.3% 74.5%  74.5% 70.3% 67.9% 73.3%  MB RS EC JW  +/- % 6.1% 1.2% 0.6% -1.2%  Table 23 - Summary of Phoneme Analysis  As seen in the data, apart from one subject who showed a 6.1% increase in the number of phonemes recognized by listening to the filtered speech samples, the other three individuals did not show any significant increase in the number of phonemes recognized. In one case, JW, the individual actually had a slight decrease in the number of phonemes accurately recognized. Although there is a possibility that intelligibility of phonemes has improved slightly, no firm conclusions can be reached from the dataset above in Table 23. 6.4.2  Summary of ASRES Results The results from tables of data from both BB and RG are summarized below in a table and a  graph of SnR vs. Recognition Rate are constructed below in order to verify whether or not spectral subtraction of a noisy BiPAP signal provides any benefit to ASR technology. SnR Type 11.32 Noisy 8.3 Noisy 4.31 Noisy -1.68 Noisy -1.7 Noisy 22.31 Post-Filtered 19.77 Post-Filtered 16.64 Post-Filtered 12.07 Post-Filtered 12.03 Post-Filtered  Correct 5/5 2/5 3/5 2/5 3/5 5/5 5/5 5/5 5/5 5/5  Percent 100% 40% 60% 40% 60% 100% 100% 100% 100% 100%  Subj. BB BB BB BB BB BB BB BB BB BB  Set Grandfather Grandfather Grandfather Grandfather Grandfather Grandfather Grandfather Grandfather Grandfather Grandfather  Table 24 - Summary of BB Datasets  The table above was created by summarizing the data found in Table 1 to Table 10. The data was first sorted by type and then by SnR. The “Noisy” type refers to signals that were unfiltered, whereas Post-Filtered refers to the noisy signals that had spectral subtraction applied to them in post-processing. The column “Correct” is a fractional total of the number of words that were correctly recognized by the ASR and the “Percent” column simply expresses the fraction in the form of a percentage.  48  BB - SnR vs. Recognition Rate 25  SnR (dB)  20 15 10 Noisy 5  Postfiltered  0 -5 0%  20%  40%  60%  80%  100%  120%  Automated Speech Recognition Rate (%) Figure 21 - SnR vs. Recognition Rate of BB for Noisy Signals Subjected to Post-Filtering  From Table 24 and Figure 21, there is reason to believe that Post-Filtering a noisy signal results in a significantly higher recognition rate by ASR. All of the signals that were post-processed with spectral subtraction were recognized without fail.  49  SnR Type 12.5 Noisy 8.56 Noisy 6.89 Noisy 18.2 Post-Filtered 6.79 Post-Filtered 23.45 Real-Time Filter 18.7 Real-Time Filter 9.12 Real-Time Filter  Correct 1/5 1/3 1/6 3/5 1/6 4/5 2/3 2/4  Percent 20% 33% 17% 60% 17% 80% 67% 50%  Subj. RG RG RG RG RG RG RG RG  Set 5 Sentences 3 Commands Grandfather 5 Sentences Grandfather 5 Sentences 3 Commands Grandfather  Table 25 - Summary of RG Datasets  RG - SnR vs. Recognition Rate 25  SnR (dB)  20  15 Noisy Postfiltered  10  Real-Time Filter 5  0 0%  20%  40%  60%  80%  100%  Automated Speech Recognition Rate (%) Figure 22 - SnR vs. Recognition Rate of RG for Noisy, Post-Filtered and Real-time Filtering  From Table 25 and Figure 22, there is reason to believe that spectral subtraction increases the ASR rate under real world circumstances. The three data points for the “Noisy” series show that for signals under 15dB, there is a low recognition rate. However, the very same signals subjected to PostFiltering once again show a marked improvement. The real-time filtering shows that if spectral subtraction results in a SnR that is over 20dB, the recognition rate increases once again. 6.4.3  Effectiveness of ASRES ASRES seems to show some effectiveness in improving the recognition of both cases that were  explored. In the case where noise is digitally added to speech samples and the Post-Filtered, the 50  improvement is large. Under real-world circumstances, there is still a noticeable improvement; however, the results are not nearly as fantastical as the 100% recognition rates achieved by the case in which noise was digitally added and Post-Filtered. From the intelligibility analysis as summarized in Section 6.4.1 - Summary of Phoneme Analysis, there is little reason to believe that spectral subtraction at the specific SnRs offers anything more than a marginal improvement in terms of intelligibility. However, from Section 6.4.2 - Summary of ASRES Results, it is clear that performing spectral subtraction on noisy speech signals of a patient with ALS can improve the automatic recognition of their speech signals when the Post-Filtering SnR is greater than 8dB. Therefore, in our discussions regarding the effectiveness of the ASRES system, it would seem that the enhancement that the system offers is not primarily one of increased intelligibility for human listeners, but rather for automated machine recognition of speech. It would not be fair, however, to say, that the system is of no value then for human listeners. Although no discernible increase in intelligibility by filtering was achieved, PALS still gain several immediate benefits from using ASRES. For one, as discussed in Section 1.2 - The Effects of ALS on Speech and Quality of Life, allowing PALS to speak from behind a mask while on BiPAP is already a huge step forward in terms of combatting loneliness and struggles with isolation. 6.4.4  Feedback from ALS Community The feedback from ALS caregivers and subjects has been very positive. In May 2009, ASRES  was presented as the ALS Society of BC’s Engineering Design Competition and a live demonstration was performed of its capabilities. As a result of a successful live demonstration, the system won the Principal Award of $5000 and as a result, was selected for a platform presentation, “C75 – Enhancing Speech During BiPAP Use” at the 20th International Symposium on ALS/MND in Berlin for which a travel bursary was also awarded. The findings were also reported during the keynote speaker’s address during the conference closing as a novel attempt at a problem that had not previously been explored. In addition, subject RG, who tested the system, was very pleased with it as well as caregivers and professionals who were able to observe the system.  6.5 Validity When considering the validity of the experiments that were conducted for this thesis, there are two outstanding issues that still need to be considered: 51  •  Sample size: Due to time, lifespan, disease progression, geographical and language constraints, the number of subjects for the experiment was limited to one subject that was located with the aid of ALS BC. Finding local subjects was difficult due not only to the rarity of the disease, but also due to the fact, that the subject had to have some ability of speech left while being on BiPAP ventilation. One subject was not able to participate due to poor command of the English language. Other subjects that were considered during the time that ASRES was in development did not live to see it come to fruition.  •  Size of datasets: Although some results were achieved, only a handful of datasets were used and the size of the library was very small as well. Larger datasets would help determine whether or not it was sheer coincidence that resulted in the machine recognition rates being higher for PostFiltered and real-time filtered data.  52  7  CONCLUSIONS AND FUTURE WORK In this section the work of the thesis is discussed. The goals of the research are reviewed, the results  are summarized, strengths and weaknesses are assessed and future work is discussed.  7.1 Research Goals Summary The goal of this thesis is to develop and evaluate a prototype Automatic Speech Recognition and Enhancement System (ASRES) that will allow Persons Living with ALS (PALS) to communicate clearly with their own voices while being on BiPAP ventilation. This problem is a significant one and a solution to it would greatly aid in improving the overall quality of life of PALS by allowing them to communicate naturally as well as for loved ones to hear their voices. Although there are some published works on related topics in the area of noise reduction and speech enhancement, there are no published works that offer a solution to this particular problem. As a result, it is not possible to evaluate ASRES against any existing system. In this thesis, the problems associated with capturing and filtering speech by PALS are explained and as a result specific hardware was selected and software written to capture PALS speech and to filter the wind noise that corrupts signal samples taken from within the mask. A working and mobile prototype was designed and implemented using Matlab, a TMS320C6713 DSP board with an interface written in C# for Windows. The effectiveness of ASRES was evaluated through digitally adding noise to speech samples from the Nemours Database of Dysarthric Speech, filtering them with spectral subtraction, and then running the resulting samples through ASRES in order to determine their scores. The effectiveness of the system was also evaluated by testing with a subject with ALS who recorded a number of a different speech samples under different conditions including No filtering with BiPAP on and Filtering with BiPAP on. The resulting sound samples were listened to and the perceived sentences were written down by human subjects and then the accuracy of their transcription was evaluated using phoneme analysis.  7.2 Contributions of this Work In the introduction, it was stated that this thesis would attempt to improve the intelligibility of speech of persons living with ALS by one, identifying the problems associated with capturing speech of PALS, two, designing and implementing a working prototype that would address these problem, and three, validating the effectiveness of the system by conducting and analyzing the results of experiments. An explanation for the problem of the “wind noise” that caused difficulties in capturing speech from behind a full face mask while a subject is on BiPAP ventilation was given and an appropriate 53  capture solution involving a single-microphone using spectral subtraction for stationary noise was designed and implemented. An automated speech recognition system using DTW was also implemented to validate our hypothesis of ASRES providing improvements to the intelligibility of PALS. An evaluation of the results shows that spectral subtraction offers some improvement to machine recognition of noisy speech. The benefits of being able to accurately recognize the speech of PALS are many and the system can be used to trigger alarms, operate a browser, or launch custom software that could contact a caregiver. All of these applications stand to improve quality of life of PALS. A final experiment was conducted to evaluate whether or not there is an increase in intelligibility as perceived by human beings due to filtering. Although one case showed a slight improvement, the other three cases do not exhibit this improvement and the best conclusion is that there is no major improvement in terms of intelligibility for human listeners over that of noisy speech. Although no improvement was made with regards to intelligibility, the very fact that PALS can speak with their own voices while on BiPAP is already an improvement as their speech is completely muffled by the masks if there is no amplification. The results of this thesis are significant in that not only is there reason to believe from the data that there is some benefit to performing filtering of noisy speech, but also that persons with ALS, caregivers and professionals see this device as a step forward in improving the quality of life for persons with ALS.  7.3 Strengths and Limitations 7.3.1 •  Strengths Low hardware requirements: As a single-microphone solution was chosen, only a basic DSP that reads and samples from one channel is required. All the components for a simple system, including the microphone and DSP chip are easily available.  •  Improvements to Automated speech recognition: This is perhaps the single largest strength of the system, the fact that filtered speech is more accurately recognized by a machine. The implications of this are significant as voice-activated alarms and the triggering of custom software all hinges on accurate recognition of the command words that activate them.  •  Improvement of quality of life of PALS: As the system allows PALS to do what they could never do before, using their voice while on BiPAP, there is already an immediate improvement of the quality of life for PALS.  54  7.3.2 •  Weaknesses System training is difficult for PALS. As PALS often have difficulty with fine motor coordination, having them use a Windows GUI to configure the system would prove difficult. All training would have to be done with the aid of a caregiver who would not only setup the system, but record samples of the command words.  •  Results are largely based on a few datasets obtained from one subject with ALS. Although this was due to the already mentioned limitations, more tests with others PALS would help to confirm the validity of the results.  •  The performance of the automated speech recognition degrades as the ability to articulate speech for the user degrades. A mechanism for retraining on a regular basis is necessary in order to avoid decreased accuracy over time.  •  Not the strongest noise filtering algorithm. The noise filtering can be improved by changing to a different algorithm, or using an improved version of spectral subtraction which is more to update with current research.  7.4 Potential Applications There are a number of potential applications that could result from this prototype. The primary one, however, would be the development of this prototype into a completely portable, self-powered system that could be permanently integrated with the face mask with the speakers and computers perhaps attached to the BiPAP unit itself. It is also possible that the device could be integrated in to the construction of existing masks, however, this would be a difficult procedure as extensive testing would need to be conducted with mask manufacturers to determine whether or not the introduction of a microphone system would comprise either the integrity or function of the mask.  7.5 Future Work There are many areas of the project that would need to be enhanced so as to make this project a reality for persons with ALS. The following are a list of several major points that would still need to be solved in order to move ASRES from a prototype into a useable product. •  Permanent integration of microphone with a full-face mask – As the system is currently a prototype, the microphone dangles within the mask cavity and occupies some CO2 ventilation holes. For long-term use, it would be essential to have microphones built into the walls of the mask itself. This would reduce the risk of the microphone ever becoming dislodged and also 55  decrease the time it takes to secure a microphone in the mask each time it is taken off or washed. The addition of a microphone to the mask is a more complex problem that would involve mask design companies and also approval would have to be sought to ensure that the classification of the mask remains as a Type I or II medical device. •  Development of a completely self-contained prototype that would not need an electrical plugin or a laptop to configure it. Currently the system is rather difficult to setup and involves a fair number of wires, connections, and DIP switches that cannot be easily setup and are prone to be being broken or unplugged if sudden movements are made to stretch the wires.  •  Improved training interface – The current user interface is hardly user friendly and does not offer very many features to users.  •  Improving the ASR algorithm by exploring HMMs and building a more sophisticated automated speech recognition system that would be able to handle larger sets of data.  •  Exploring the effect of reverberations in the mask. Although literature indicates that within fighter pilot masks, reverberations decrease intelligibility, this has yet to be confirmed with the full face masks worn by those on BiPAP.  7.6 Conclusion This thesis has identified the problems associated with capturing and recognizing the speech of PALS who are on BiPAP ventilation. A prototype was designed, implemented and tested with human subjects and DTW to evaluate its improvements to speech intelligibility. Although no conclusive improvements to intelligibility for human listeners was made, the ALS community has affirmed that there is real-value in being able to communicate from behind a face mask and that in it of itself improved the quality of life of persons living with ALS. In addition, there is reason to believe that spectral subtraction filtering of noisy speech enhances automated machine recognition of PALS speech. The work of this thesis is a first step towards improving the quality of life for PALS. Although the life expectancy of PALS is approximately 2-5 years, and therefore, the length of time that a device like this would be useful would be fairly short, perhaps on the order of months or at most a year or two, there is still an immeasurable benefit that can be offered by such a device as the value of this device is not determined by its useable lifespan, but rather on the impact that it has on caregivers and families, for the time in which it can be used.  56  BIBLIOGRAPHY [1]  [2]  [3]  [4]  [5] [6]  [7]  [8]  [9]  [10]  [11] [12] [13]  [14]  [15]  [16]  [17]  [18]  NIH, “Amyotrophic Lateral Sclerosis Fact Sheet: National Institute of Neurological Disorders and Stroke (NINDS),” National Institute of Neurological Disorders and Stroke, 2003. [Online]. Available: http://www.ninds.nih.gov/disorders/amyotrophiclateralsclerosis/detail_amyotrophiclateralsclerosis.h tm?css=print. L. S. Aboussouan, S. U. Khan, M. Banerjee, A. C. Arroliga, and H. Mitsumoto, “Objective measures of the efficacy of noninvasive positive-pressure ventilation in amyotrophic lateral sclerosis,” MUSCLE & NERVE, vol. 24, no. 3, pp. 403–409, Mar. 2001. E. R. Klasner and K. M. Yorkston, “Speech intelligibility in ALS and HD dysarthria: the everyday listener’s perspective.(amyotrophic lateral sclerosis)(Huntington disease),” Journal of Medical Speech - Language Pathology, Jun. 2005. J. F. Kent, R. D. Kent, J. C. Rosenbek, G. Weismer, R. Martin, R. Sufit, and B. R. Brooks, “Quantitative Description of the Dysarthria in Women With Amyotrophic Lateral Sclerosis.,” Journal of Speech & Hearing Research, vol. 35, no. 4, p. 723, 1992. “A Guide to ALS Patient Care for Primary Care Physicians.” ALS Society of Canada. M. Shoeb, S. E. Merel, M. B. Jackson, and B. D. Anawalt, “‘Can we just stop and talk?’ patients value verbal communication about discharge care plans,” J. Hosp. Med., vol. 7, no. 6, pp. 504– 507, Aug. 2012. L. Giordano, S. Toma, R. Teggi, F. Palonta, F. Ferrario, S. Bondi, and M. Bussi, “Satisfaction and Quality of Life in Laryngectomees after Voice Prosthesis Rehabilitation,” Folia Phoniatr. Logop., vol. 63, no. 5, pp. 231–236, 2011. M. Parker, S. Cunningham, P. Enderby, M. Hawley, and P. Green, “Automatic speech recognition and training for severely dysarthric users of assistive technology: The STARDUST project,” Clinical Linguistics & Phonetics, vol. 20, no. 2–3, pp. 156, 149, 2006. X. Menendez-Pidal, J. B. Polikoff, S. M. Peters, J. E. Leonzio, and H. T. Bunnell, “Nemours database of dysarthric speech,” in Proceedings of the 1996 International Conference on Spoken Language Processing, ICSLP. Part 3 (of 4), Oct 3-6 1996, Piscataway, NJ, USA, 1996, vol. 3, pp. 1962–1965. A. B. Kain, J.-P. Hosom, X. Niu, J. P. H. van Santen, M. Fried-Oken, and J. Staehely, “Improving the intelligibility of dysarthric speech,” Speech Communication, vol. 49, no. 9, pp. 743–759, Sep. 2007. B. Tomik and R. J. Guiloff, “Dysarthria in amyotrophic lateral sclerosis: A review,” Amyotroph. Lateral. Scler., vol. 11, no. 1–2, pp. 4–15, 2010. K. C. Hustad, “The Relationship Between Listener Comprehension and Intelligibility Scores for Speakers With Dysarthria,” J Speech Lang Hear Res, vol. 51, no. 3, pp. 562–573, Jun. 2008. W. Jones, P. Mathy, T. Azuma, and J. Liss, “The Effect of Aging and Synthetic Topic Cues on the Intelligibility of Dysarthric Speech,” AAC: Augmentative and Alternative Communication, vol. 20, no. 1, pp. 22–29, 2004. J. Murphy, “Communication strategies of people with ALS and their partners,” AMYOTROPHIC LATERAL SCLEROSIS AND OTHER MOTOR NEURON DISORDERS, vol. 5, no. 2, pp. 121–126, Jun. 2004. I. R. Murray and J. L. Arnott, “Synthesizing emotions in speech: is it time to get excited?,” Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on, vol. 3, pp. 1816– 1819 vol.3, 3. M. S. Yakoub, S.-A. Selouani, and D. O’Shaughnessy, “Improving dysarthric speech intelligibility through re-synthesized and grafted units,” in Electrical and Computer Engineering, 2008. CCECE 2008. Canadian Conference on, 2008, pp. 001523–001526. N. Yousefian and P. C. Loizou, “A Dual-Microphone Speech Enhancement Algorithm Based on the Coherence Function,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 2, pp. 599–609, 2012. “ALS Facts,” Amyotrophic Lateral Sclerosis Society of Canada. 57  [19] “Canadian Cancer Statistics 2012.” Canadian Cancer Society, 2012. [20] G. S. Kang and T. M. Moran, “Speech enhancement in noise and within face mask (microphone array approach),” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, 1998, vol. 2, pp. 1017–1020 vol.2. [21] P. Goel and A. Garg, “Review of Spectral Subtraction Techniques for Speech Enhancement,” IJECT, vol. 2, no. 4, 2011. [22] E. Nemer and W. Leblanc, “Single-microphone wind noise reduction by adaptive postfiltering,” in Applications of Signal Processing to Audio and Acoustics, 2009. WASPAA ’09. IEEE Workshop on, 2009, pp. 177–180. [23] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 27, no. 2, pp. 113–120, 1979. [24] S. Ahn and H. Ko, “Background noise reduction via dual-channel scheme for speech recognition in vehicular environment,” Consumer Electronics, IEEE Transactions on DOI 10.1109/TCE.2005.1405694, vol. 51, no. 1, pp. 22–27, 2005. [25] Y. Ephraim and D. Malah, “Speech enhancement using optimal non-linear spectral amplitude estimation,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP ’83., 1983, vol. 8, pp. 1118–1121. [26] Lu-ying Sui, Xiong-wei Zhang, Jian-jun Huang, and Bin Zhou, “An improved spectral subtraction speech enhancement algorithm under non-stationary noise,” in Wireless Communications and Signal Processing (WCSP), 2011 International Conference on, 2011, pp. 1–5. [27] Bing-yin Xia, Yan Liang, and Chang-chun Bao, “A modified spectral subtraction method for speech enhancement based on masking property of human auditory system,” Wireless Communications & Signal Processing, 2009. WCSP 2009. International Conference on, pp. 1–5, 13. [28] B. H. Juang and L. R. Rabiner, “Automatic Speech Recognition – A Brief History of the Technology Development,” in Elsevier Encyclopedia of Language and Linguistics, Second ed., 2005. [29] E. J. Keogh and M. J. Pazzani, “Scaling up Dynamic Time Warping to Massive Dataset,” in Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery, 1999, pp. 1–11. [30] T. Oates, L. Firoiu, and P. R. Cohen, “Using Dynamic Time Warping to Bootstrap HMM-Based Clustering of Time Series,” in Sequence Learning: Paradigms, Algorithms, and Applications, 2001, pp. 35–52. [31] H. Tolba and A. S. El Torgoman, “Towards the improvement of automatic recognition of dysarthric speech,” in Computer Science and Information Technology, 2009. ICCSIT 2009. 2nd IEEE International Conference on, 2009, pp. 277–281.  58  APPENDICES Appendix A: Procedure for Collecting Speech Samples of a PALS The following is a step-by-step outline of the procedure to be conducted at subject’s home. Introduction – 5 minutes •  Welcome and thank patients for their participation  •  Explain the purpose, their right to terminate the study at any time, and outline of procedure  •  Sign consent Form  Prototype Setup – 2 minutes Setup of prototype involves insertion of microphone into the mask to record Experiment #1: - 4 minutes Patient speaks same 3 sentences and paragraph with the mask and inserted microphone. Samples are recorded Experiment #2: - 4 minutes Sound Filter is enabled via a toggle switch and patient speaks same 3 sentences and paragraph with the mask and inserted microphone. Samples are recorded. Debrief – 5 minutes Debrief involves answering any questions the subject may have. They will then be thanked for participating in the study. Summary Total visits: 1 Total Duration: 20 min  59  Appendix B: Phoneme Analysis Data Sheets No Filtering (SnR = 12.5 dB) Words  Score  RT Filtered (SnR = 23.45 db) Words  Score  1 0 0 -2 0 -1  2 -1 -1 -3 -2 -2  3 -3 -2 -1 -1 -1  -4 -3 -6 -3 -4  63.6% 72.7% 45.5% 72.7% 63.6%  Trial 1 The fight is waiting the cop The dime is burying the bet. The view is meeting the pack. The bin is doing the fin. The page is licking the vine.  1 0 0 -1 0 -2  2 -1 -1 -2 -2 0  3 -1 0 -1 -2 -3  -2 -1 -4 -4 -5  81.8% 90.9% 63.6% 63.6% 54.5%  Trial 1 The fight is waving the towel. The dime is wearing the vest. The zoo is soothing the pet. The bin is doing the shade. The beat is making the rock.  Trial 2 The fight is waiting the cop. The dime is burying the bat. The view is meeting the pet. The bin is doing the thing. The beat is making the rock.  0 0 -1 0 -1  -1 -1 -2 -2 -2  -1 -1 -1 -3 -1  -2 -2 -4 -5 -4  81.8% 81.8% 63.6% 54.5% 63.6%  Trial 2 The fight is waving the tub. The guide is wearing the pet. The do is zooting the pack. The bin is doing the fade. The beat is making the rock.  0 -2 0 0 -1  -1 -1 -3 -2 -2  -2 -1 -1 0 -1  -3 -4 -4 -2 -4  72.7% 63.6% 63.6% 81.8% 63.6%  Trial 3 The fight is winning the cop. The dime is burying the vet. The view is meeting the cat. The bin is doing the fig. The meat is making the rock.  1 0 0 -1 0 -2  2 -2 -1 -2 -2 -2  3 -1 -1 -1 -2 -1  72.7% 81.8% 63.6% 63.6% 54.5%  Trial 3 The fight is waving the tub. The dine is wearing the bet. The do is zooting the pat. The bin is stewing the fade. The date is making the rock.  1 0 -1 0 0 -1  2 -1 -1 -3 0 -2  3 -2 0 0 0 -1  -3 -2 -3 0 -4  72.7% 81.8% 72.7% 100.0% 63.6%  -3 -2 -4 -4 -5  Mean % of Correctly Identified Phonemes per Sentence Sentence #1 78.8% Sentence #2 84.8% Sentence #3 63.6% Sentence #4 60.6% Sentence #5 57.6%  69.7% 72.7% 60.6% 84.8% 63.6%  Overall % of Correctly Identified Phonemes 69.1%  70.3%  Table 26 - Phoneme Analysis of RS  60  No Filtering (SnR = 12.5 dB) Words  Score  RT Filtered (SnR = 23.45 db) Words  Score  Trial 1 the fight is waiting the car the dime is baring the dame the view is leading the pack the bend is chewing the food the bait is making the bard  1 0 0 -1 -2 0  2 -1 0 -1 -2 -2  3 -2 -3 -1 -1 -4  -3 -3 -3 -5 -6  72.7% 72.7% 72.7% 54.5% 45.5%  Trial 1 the fight is waiting the cub the die is wearing the bat the dew is leading the past the den is stealing the spade the bait is making the run  1 0 -1 0 -2 0  2 -1 -1 -1 -2 -2  3 -1 -1 -2 -2 -2  -2 -3 -3 -6 -4  81.8% 72.7% 72.7% 45.5% 63.6%  Trial 2 the fight is waiting the car the dime is bearing the bird the dew is leaving the pack the bend is chewing the food the bait is making the rod  0 0 0 -2 0  -1 0 -1 -2 -2  -2 -3 -1 -1 -1  -3 -3 -2 -5 -3  72.7% 72.7% 81.8% 54.5% 72.7%  Trial 2 the fight is waiting the tub the dye is wearing the pants the dew is sleeping the pack the den is stewing the fade the bait is making the run  0 -1 0 -2 0  -1 -1 -1 0 -2  -2 -4 -1 0 -2  -3 -6 -2 -2 -4  72.7% 45.5% 81.8% 81.8% 63.6%  Trial 3 1 2 the fight is waiting the car 0 -1 the dime the bearing the debt 0 0 the dew is leading the pack 0 -1 the bend is chewing the food -2 -2 the bait is making the vine 0 -2  3 -2 -3 -1 -1 -3  -3 -3 -2 -5 -5  72.7% 72.7% 81.8% 54.5% 54.5%  Trial 3 the fight is waiting the top the die is wearing the pants the dew is sleeping the past the den is stewing the fade the bait is making the run  1 0 -1 0 -2 0  2 -1 -1 -1 0 -2  3 -2 -4 -2 0 -2  -3 -6 -3 -2 -4  72.7% 45.5% 72.7% 81.8% 63.6%  Mean % of Correctly Identified Phonemes per Sentence Sentence #1 72.7% Sentence #2 72.7% Sentence #3 78.8% Sentence #4 54.5% Sentence #5 57.6%  75.8% 54.5% 75.8% 69.7% 63.6%  Overall % of Correctly Identified Phonemes 67.3%  67.9%  Table 27 - Phoneme Analysis of EC  61  No Filtering (SnR = 12.5 dB) Words  Score  RT Filtered (SnR = 23.45 db) Words  Score  Trial 1 the fight is waiting the cob the dine is the dew is leaving the path the bin is spewing the the fate is making the rock  1 0 -1 0 0 -1  2 -1 -5 -1 -1 -2  3 0 -3 -1 -3 -1  -1 -9 -2 -4 -4  90.9% 18.2% 81.8% 63.6% 63.6%  Trial 1 The fight is waiting the tub The dine is wearing the bear The dew is reaping the path The bin is spewing the spade The bait is making the rock  1 0 -1 0 0 0  2 -1 -1 -2 -1 -2  3 -2 -1 -1 -2 -1  -3 -3 -3 -3 -3  72.7% 72.7% 72.7% 72.7% 72.7%  Trial 2 the fight is waiting the cob the dine is bearing the bear the dew is leaving the path the bin is spewing the fish the bait is making the rock  0 -1 0 0 0  -1 0 -1 -1 -2  0 -1 -1 -2 -1  -1 -2 -2 -3 -3  90.9% 81.8% 81.8% 72.7% 72.7%  Trial 2 The fight is waiting the tub The die is wearing the bear The dew is reaping the path The bin is stewing the spade The bait is making the rock  0 -1 0 0 0  -1 -1 -2 0 -2  -2 -1 -1 -2 -1  -3 -3 -3 -2 -3  72.7% 72.7% 72.7% 81.8% 72.7%  Trial 3 the fight is waiting the cob the dime is bearing the bear the dew is leaving the path the bin is spewing the fish the fate is making the rock  1 0 0 0 0 -1  2 -1 0 -1 -1 -2  3 0 -1 -1 -2 -1  -1 -1 -2 -3 -4  90.9% 90.9% 81.8% 72.7% 63.6%  Trial 3 The fight is waiting the tub The dine is wearing the bear The stew is reaping the path The bin is stewing the sage The bait is making the rock  1 0 -1 -1 0 0  2 -1 -1 -2 0 -2  3 -2 -1 -1 -2 -1  -3 -3 -4 -2 -3  72.7% 72.7% 63.6% 81.8% 72.7%  Mean % of Correctly Identified Phonemes per Sentence Sentence #1 90.9% Sentence #2 63.6% Sentence #3 81.8% Sentence #4 69.7% Sentence #5 66.7%  72.7% 72.7% 69.7% 78.8% 72.7%  Overall % of Correctly Identified Phonemes 74.5%  73.3%  Table 28 - Phoneme Analysis of JW  62  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0073383/manifest

Comment

Related Items