Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An evaluation of adaptive noise cancellation as a technique for enhancing the intelligibility of noise-corrupted… Neufeld, Leona Arlene 1992

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1992_spring_neufeld_leona.pdf [ 2.35MB ]
Metadata
JSON: 831-1.0065236.json
JSON-LD: 831-1.0065236-ld.json
RDF/XML (Pretty): 831-1.0065236-rdf.xml
RDF/JSON: 831-1.0065236-rdf.json
Turtle: 831-1.0065236-turtle.txt
N-Triples: 831-1.0065236-rdf-ntriples.txt
Original Record: 831-1.0065236-source.json
Full Text
831-1.0065236-fulltext.txt
Citation
831-1.0065236.ris

Full Text

AN EVALUATION OF ADAPTIVE NOISE CANCELLATIONAS A TECHNIQUE FOR ENHANCING THE INTELLIGIBILITY OFNOISE-CORRUPTED SPEECH FOR THE HEARING IMPAIREDbyLeona Arlene NeufeldB.Sc., The University of Calgary, 1983A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF ELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standard...THE UNIVERSITY OF BRITISH COLUMBIAApril 1992© Leona Arlene Neufeld, 1992In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(SiDepartment of Electrical EngineeringThe University of British ColumbiaVancouver, CanadaDate April 22, 1992DE-6 (2/88)AbstractThe speech-to-noise ratio required under noisy conditions so that intelligibility is comparableto that in quiet is significantly higher for a hearing-impaired individual than a normal-hearingperson. In this research a simple, efficient, real-time implementation of a single-input speech-enhancement scheme, designed to increase the intelligibility of speech corrupted by additive noiseusing adaptive noise cancellation (ANC) techniques was evaluated with speech-discrimination testsusing hearing-impaired subjects. Although many single-input speech-enhancement schemes havebeen shown to provide insignificant gains for the normal-hearing person, this work found that ahearing-impaired person’s ability to understand the speech processed using ANC can improvenoticeably, as compared to results obtained with unprocessed speech.The algorithm takes advantage of a novel speech-detection algorithm recently developed atthe University of British Columbia to classify individual signal segments as silence or speech; thecoefficients of an adaptive filter are updated during signal periods classified as ‘silence’. Twovariants of the least-mean square (LMS) algorithm, the leaky- and normalized-LMS, are integratedin the weight-update process to produce a noise-reduction scheme which adapts itself to the powerof the input signal and also corrects for ill-conditioned inputs. The resulting speech-enhancementscheme is most effective with narrowband, quasi-stationary noises and robust under changingnoise conditions. Although often used individually, to our knowledge the incorporation of bothLMS variants into a single algorithm is unique to this research.Speech Perception in Noise (SPIN) tests were used on eight hearing-impaired subjectsto evaluate the algorithm’s performance. Two types of noise background were used. Of thefour subjects who heard motor noise, raw scores increased by 20 percent for the enhancedspeech. These results were shown to be statistically significant using a two-way analysis ofvariance with repeated measures. However for those who heard babble or ‘cafeteria’ noise, thespeech-enhancement process evidenced little improvement over scores from unprocessed speech.Further clinical evaluation is necessary, with a larger sample base, since the limited size of thestudy sample precluded assessing the algorithm with respect to specific subject factors such asdegree or type of hearing loss.11ContentsAbstract.iiList of Tables viList of Figures viiList of Abbreviations ixAcknowledgments xChapter 1 Introduction 11 .1 Speech Modeling 21 .2 Design Constraints 41 .3 Thesis Outline 5Chapter 2 Background 72.1 Survey of Speech Enhancement Techniques 72.2 Adaptive Noise Cancellation 102.3 Single-Input Adaptive Noise Cancellation 11Chapter 3 LMS Algorithm 153.1 Convergence Properties 183.2 Performance Issues 213.2.1 Nonstationary Inputs 213.2.2 Ill-Conditioned Inputs 233.2.3 Finite-Precision Effects 243.3 Leaky-LMS Algorithm 263.4 Final Algorithm 29• Chapter 4 Speech I Silence Detection 304.1 Silence Detection Algorithm 314.2 Algorithm Modifications 33in___________________________________________________Contents __________________Chapter 5 Implementation.385.1 Processing Environment 385.2 Parameter Selection 395.2.1 Stimuli 395.2.2 LMS Algorithm 44TAPSL 44BETAO 48GAMMA7 525.2.3 Speech Detection Algorithm 54Implementation Details 55Low Threshold ZLow 56Average Threshold ZCrit 60High Threshold ZHigh 605.3 Final Parameter Values 60Chapter 6 Clinical Evaluation 626.1 Revised SPIN Test 626.2 Experimental Conditions 646.3 Presentation 656.4 Procedure 676.5 Subjects 686.6 Results 696.6.1 Statistical Evaluation . 69Subject Reliability 706.6.2 Comparison of High and Low Predictability Scores 71Subject Reliability 736.7 Subjective Comments 74ivContents___________________Chapter 7 Discussion.767.1 Vahdity of Subjective Tests 767.1.1 Generalizations 777.1.2 Comments on R-SPIN Presentation 787.2 Performance Limitations . .. 787.3 Summary 797.4 Proposed Research. 81References 83Appendix A Alpha Dependence A.88Appendix B Speech Threshold Tests B.89Appendix C Selected C Code Fragments C.91Appendix 0 Instructions.. D.100Appendix E Subject Characteristics .. E.1O1VList of TablesTable I Computer-Generated Noise Composition 40Table II Transient Performance Computer-Generated Noise — Condition (1) 47Table Ill Steady-State Characteristics Computer-Generated Noise — Condition (1). . . 47Table IV Final Parameter Values 61Table V Raw Scores 70Table VI ANOVA Results — Motor Noise 71Table VII ANOVA Results — Babble Noise 71Table VIII Comparison of High and Low Predictability Scores 72Table IX High Predictability ANOVA Results— Motor Noise 73Table X Low Predictability ANOVA Results— Motor Noise 73Table XI High Predictability ANOVA Results — Babble Noise 74Table XII Low Predictability ANOVA Results — Babble Noise 74Table XIII Subject Characteristics E.102viList of FiguresFigure 1 Conventional Adaptive Noise Cancellation 10Figure 2 Adaptive Line Enhancer 12Figure 3 Adaptive Predictor 13Figure 4 Adaptive Transversal Filter 16Figure 5 Leaky-Normalized-LMS Algorithm 29Figure 6 Speech Detection Algorithm Flow Chart 37Figure 7 System Configuration 39Figure 8 Signal Characteristics of Computer-Generated Noise Condition (1) 41Figure 9 Signal Characteristics of Vacuum-Cleaner Noise 42Figure 10 Signal Characteristics of Speech-Babble Noise 43Figure 11 Filter Transfer Function Computer-Generated Noise— Condition (1) . . . 45Figure 12 Convergence Speed versus Filter LengthComputer-Generated Noise — Condition (1) 46Figure 13 Steady-State Error versus Filter LengthComputer-Generated Noise — All Conditions 48Figure 14 Comparison between Conventional and Normalized LMS Algorithms . . . . 49Figure 15 Effect of Beta on Convergence SpeedComputer-Generated Noise — Condition (1) 50Figure 16 lime Constant versus Convergence Factor BetaComputer-Generated Noise— Condition (1) 51Figure 17 Misadjustment versus Convergence Factor BetaComputer-Generated Noise — Condition (1) 51Figure 18 Effect of Alpha on Transfer Function ShapeComputer-Generated Noise — Condition (1) 52Figure 19 Alpha Dependence of Leaky-LMS Algorithm All Conditions 53Figure 20 Gamma Dependence of Leaky-Normalized LMS Algorithm All Conditions . 53Figure 21 Basic Speech-Detection Algorithm Vacuum Motor Noise 57vii_______________________________________________________Listof FiguresFigure 22 Basic Speech-Detection AlgorithmComputer-Generated Noise— Condition (1) 57Figure 23 Gamma Correction Computer-Generated Noise — Condition (1) 59Figure 24 Basic Speech-Detection Algorithm Babble Noise 59Figure 25 Experimental Setup. 66Figure 26 Basic Speech-Detection AlgorithmComputer-Generated Noise — Condition (2) B.89Figure 27 Basic Speech-Detection AlgorithmComputer-Generated Noise — Condition (3) B.89Figure 28 Basic Speech-Detection AlgorithmComputer-Generated Noise — Condition (4) B.90vifiList of AbbreviationsALE adaptive line enhancerANC adaptive noise cancellationANOVA analysis of varianceANSI American National Standards InstituteASHA American Speech and Hearing AssociationCCF cross-correlation coefficientCG computer-generatedCSA Canadian Standards AssociationdB decibelDSP digital signal processingECG electrocardiographyHL hearing levelHP high-predictabilityHz Hertzkb kilobytekHz kilo-HertzLMS least-mean squareLP low-predictabilityLPC linear-predictive codingms milli-secondMSCCF mean-segmented cross-correlation coefficientMSE mean-square errorR-SPIN revised speech perception in noises secondS.D. standard deviationSNR signal-to-noise ratioSPIN speech perception in noiseSPL sound pressure levelSSSA short-time spectral amplitudeTCI telephone I computer interfaceVU volume unitixAcknowledgmentsI would like to express my sincere gratitude to my supervisor, Dr. Charles Laszlo, for hisguidance and encouragement during the course of this research. I would also like to thank ChrisRose and Dr. R. Donaldson for the use of their speech-detection algorithm and the accompanyingtechnical advice.Audiologist Suzanne Thom, who generously lent her time and expertise to the clinicalevaluation portion of this research, deserves special acknowledgment. Thanks must also beextended to the School of Audiology and Speech Sciences at UBC, and in particular Dr. Andre-Pierre Benguerel and Noelle Lamb, for the use of their facilities and equipment. To the volunteerswho participated in the experiment, your help was invaluable and greatly appreciated.Lastly, I would like to thank my mother, family and friends whose constant encouragementdid not go unnoticed; to Pat Chaba, for his faith, inspiration and steadfast support, I am especiallyindebted.This work was financed by a grant from the Natural Sciences and Engineering ResearchCouncil of Canada.In memory of my father, William Neufeld.xChapter 1 IntroductionAn individual’s capacity to discern sound can become impaired through prolonged exposureto loud noise, disease, age-related processes, damage to the inner ear, or other means.Regardless of the cause the result is the same: a reduced ability to understand speech andhence communicate effectively. Communication is further impaired by the introduction of noiseinto the listening environment: the speech-to-noise ratio required under noisy conditions so thatintelligibility is comparable to that in quiet is significantly higher for a hearing-impaired individualthan a normal-hearing person [1]. Amplification alone is not an effective solution in that therelationship of speech and noise remains constant; hearing-assistive devices which amplify noiseand speech indiscriminately become increasingly inadequate as background noise increases.One motivation for this research was the difficulty that hearing-impaired individuals experiencewhen communicating over a noisy telephone connection. Interfering sounds may emanate fromthe environment of the calling or receiving party on either side of the telephone connection, orderive from line disturbances in the telephone circuit itself. Crosstalk from energy in adjacentcircuits, as well as neighboring power lines, broadcast transmissions and electrical machinery canworsen line conditions. Although telephone systems are specifically designed for low transmissiondistortion as well as high immunity to crosstalk and noise, the amplification required by the hearing-impaired may raise the levels of such noise above their hearing threshold, where it can interferewith the speech signal.This research is concerned with enhancing the intelligibility of speech corrupted by additivenoise through the use of adaptive noise cancellation techniques. In particular, the algorithmis implemented as part of a telephone handset, reducing noise in the incoming speech signal.Evaluation of algorithm performance is specifically targeted towards hearing-impaired listeners.Although many single-input speech-enhancement schemes have been found to provide insignificant gains for the normal-hearing person, it is shown that a hearing-impaired person’s abilityto understand speech processed using adaptive noise cancellation can improve noticeably ascompared to results obtained with unprocessed speech.1__________________________________________________________Introduction_________________1.1 Speech ModelingNoise-reduction techniques directed towards speech enhancement rely to varying extents onthe characteristics of the model used in representing the mechanics of speech production; hearing-loss effects can be interpreted in relation to this model. Therefore it is important to first understandthe process by which speech is generated, as well as the perceptual cues that contribute to speechintelligibility.The principal component of the speech-production mechanism is the vocal tract. It isessentially an acoustic tube terminated on one end by the vocal cords and at the other endby the lips. An auxiliary tube, the nasal tract, can be connected or disconnected by means of asmall flap of tissue, the vellum. Motion of the tongue, lips, jaw and vellum change the shape ofthe vocal tract, affecting the resulting sound quality as well as introducing additional sounds orinterruptions. Sounds can be broadly classified in terms of vocal-tract shape and excitation [2]:• VOICED — the vocal tract is excited with quasi-periodic pulses of air pressure caused by avibration of the vocal cords.• FRICATIVE— a constriction is formed within the vocal tract; forcing air through thisconstriction generates quasi-random pressure variations which stimulate the vocal tract.• PLOSIVE— the vocal tract is completely closed; pressure is built up and then quickly released,exciting the vocal tract.The vocal-tract response is commonly modeled as a linear time-varying filter, characterized byits natural frequencies, called formants, which correspond to resonances in the sound-transmissioncharacteristics of the vocal tract. Voiced sounds are produced by exciting the filter with a periodicpulse train; non-periodic sounds are generated using wideband noise as the filter excitation. Thespectral characteristics of the resulting speech vary with changes in the shape and excitation ofthe vocal tract: the shape determines the envelope of the output speech spectrum, while the finestructure of the spectra is dictated by the excitation. Since motion of the tongue, lips and vocalcords is constrained in time by mechanical and physiological limits, research has demonstrated itis reasonable to represent speech as the output of a linear system with slowly-varying propertiesthat remain relatively invariant over short intervals of between 10 and 30 ms [3].2_______________________________________________________Introduction________________Speech perception can be analyzed in terms of the sound cues derived from the distinctivecomponents of a given language. Vowels are generally quasi-periodic, lower in frequency, andhigher in intensity in comparison to consonants. They provide power and energy to speech. Incontrast, consonants represent a smaller proportion of signal energy but are more important tointelligibility. Unvoiced consonants in particular exhibit still lower energy and demonstrate randomcharacteristics. The peak amplitude of an unvoiced segment is typically much lower than that ofa voiced interval, therefore in terms of energy, the overall dynamic range of speech is very large.Notably, the less intense segments of speech that contribute the most to intelligibility are the firstto be masked by noise [2, 3].Speech perception is also influenced by the short-time spectrum of the speech waveform.In particular, the short-time spectral magnitude contributes significantly to speech perception,whereas phase has been found to be comparatively unimportant. Although a normal-hearingperson perceives sounds in a frequency range which spans from 16 Hz to 16kHz, most of thespectral detail important to intelligibility is contained in the variation of the first three formants,typically located below 3 kHz. Moreover the second formant appears to provide more informationthan the first. Approximately sixty percent of speech power is located below 500 Hz; most ofthis power derives from vowels which contribute minimaUy to intelligibility. Speech sounds havingenergy distributed above 1000 Hz contribute approximately five percent of total speech power butapproximately sixty percent in intelligibility cues [4].The most tangible effect of hearing loss is the elevation of the thresholds where an individualfirst perceives sound. This is compounded by a phenomena termed ‘loudness recruitment’, oftenexperienced by persons with cochlear impairments. The sensation of loudness grows more rapidlywith increasing sound level than in the normal ear; a slight increase in intensity is accompaniedby an enormous increase in perceived loudness. A combination of increased threshold andrecruitment reduces the dynamic range over which sounds can be tolerated. If speech is amplifiedto bring the high frequencies to comfortable perception levels, the amplitude of low frequencycomponents may be greater than the minimum discomfort level. Amplification with amplitudecompression may not suffice since the intensity-discrimination ability of a hearing-impaired personis no better than that of a normal-hearing person while compression can actually reduce intensity3_______________________________________________________introduction________________changes [5].Sensorineural-impaired individuals often suffer from a decrease in speech discriminationattributed in part to masking of speech sounds in one frequency region by those in anotherportion of speech spectrum. Some types of hearing loss can result in an upward spread ofthe energy from the first formant into the region of the second, masking the more importantperceptual information. This effect is most pronounced in persons whose hearing loss at higherfrequencies is more severe than that at lower frequencies [6]. One explanation for this effectis related to the frequency selectivity of the inner ear, which is composed of sensory cells thatare excited by sound-induced vibrations. These cells can be divided into regions sensitive to aspecific frequency range or ‘critical bands’ over which the ear integrates energy. In sensorineuralimpairments the critical bands become enlarged; a pure tone will excite a larger area of thecell membrane. Energy is smeared over more than one band reducing an individual’s ability toresolve the spectral components of complex sounds. In noisy conditions, the difference betweenthe momentary spectra of noise and speech may be insufficient to prevent confusion; competingsounds can thus partially or completely obscure important perceptual cues. The masking noiseinduces an elevation in the individual’s sound threshold proportional to the noise energy within thecritical band associated with the masking noise. This threshold is generally higher in the hearingimpaired than in normal-hearing persons [7]. Narrowband interference such as pure tones aremore adverse to impaired listeners than equally intense wideband stimuli. Narrowband stimuli arealso perceived as more unpleasant to listen to than broadband noise. For this reason, broadbandnoise has been used to mask narrowband to improve pleasantness of speech [3, 5].Hearing impaired persons also suffer from reduced temporal discrimination, which affectstheir ability to resolve auditory events in the time domain. Since hearing loss is generally morepronounced at higher frequencies, rapid speech can cause low-frequency, high-power vowels toswamp consonants. The ability to separate sounds spatially is also affected since temporal cluesare exploited in localizing sound [5].1.2 Design ConstraintsThe primary goal of this research is to determine a practical method of improving the4_______________________________________________________Introduction________________intelligibility of noisy speech for the hearing-impaired where the only input available is noise-corrupted speech. Although there are numerous instances where multi-microphone noise-cancellation techniques have proven successful, there are many applications where a secondsensor input is unavailable such as the telephone or hearing aid.Our objective is to develop a single-input algorithm for enhancing speech that is easilyrealizable, making both cost and implementation complexity important factors. The algorithmshould be robust under diverse noise conditions, operate in real-time, and require a minimumof storage and computational facilities. In addition, the algorithm should be general enough thatit has possible application in a variety of areas. However the algorithm is not intended for usein extremely harsh noise environs, rather a SNR of better than 6 dB is assumed. Neither thecharacteristics of speech nor noise are of a fixed nature; therefore to ensure that the noise-cancellation technique will work under the greatest possible number of signal conditions, thealgorithm must use a minimum of a priori information about signal conditions, and be able toadapt to changes in signal characteristics in real-time.An improvement in speech quality, usually defined as an increase in signal-to-noise ratio(SNR), does not necessarily imply an increase in intelligibility. Single-input noise cancellationmethods which are designed strictly to improve SNR have been shown to have little effect onspeech intelligibility and in some cases to actually decrease discrimination scores[8—1 2]. Thereforethe final noise-reduction algorithm must prove effective in terms of clinical speech-discriminationtests using hearing-impaired subjects.1.3 Thesis OutlineA survey of noise cancellation techniques is presented in the following chapter, focusing on thespecific method selected for the purposes of this research: adaptive noise cancellation. A filter thatis capable of modifying its own parameters with respect to changes in the noise characteristicsis proposed.Chapter 3 discusses the least-mean-square (LMS) algorithm used to iteratively adapt thefilter weights with respect to the noise background. The convergence properties of this algorithmapplicable to its use in this implementation are explained in detail. Two variants of the LMS5Introduction________________algorithm, the normalized-LMS and leaky-LMS are presented that rectify some of the problemsrelated to the use of the LMS algorithm when applied to ill-conditioned signals.Single-input adaptive noise canceflation applies knowledge of signal characteristics determined during periods when speech is absent from the input to formulate the noise-reduction filter.The method used to discriminate between noise and speech is described in Chapter 4.The subsequent chapter details aspects of the algorithm implementation in both hardware andsoftware. The procedures used to determine individual parameters for both the noise-reductionand speech-detection algorithms are discussed as well as the performance improvements providedby the leaky-LMS and normalized-LMS variants.An assessment of the performance of the noise cancellation algorithm in terms of improvementin speech intelligibility is only possible through clinical trials with hearing-impaired persons.Subjective evaluation procedures and results using the Revised Speech Perception in Noise(R-SPIN) test are found in Chapter 6.This paper concludes with a discussion of the validity of the R-SPIN test results, limitationsof the proposed algorithm, possible applications, conclusions and proposed research.6Chapter 2 BackgroundIn the context of this research, the classical method of noise reduction which applies Wienertechniques resulting in filters with fixed coefficients is not viable in that it assumes that thesignals are stationary and that an accurate description of the signal statistics is known prior toprocessing. Both speech and noise have nonstationary properties, and a priori knowledge of thenoise characteristics in particular cannot be assumed given the design constraints. Thereforeother known aspects of speech production and speech perception must be exploited, as well asthe properties of the noise environment which can be estimated from the corrupted speech.2.1 Survey of Speech Enhancement TechniquesA comprehensive review of methods specific to the problem of enhancing speech corruptedby additive noise is found in a paper by Lim and Oppenheim [3]. The methods discussed wererestricted to those employing only a single sensor input in the noise-cancellation process. It wasfound that “almost all of these systems in fact reduce intelligibility and those that do not tend todegrade the quality”. Notably, most of these algorithms were evaluated based on subjective testsby people with normal hearing. For our purposes, a method is desired that increases the abilityof a hearing-impaired individual to understand noisy speech; the quality of the filtered speechis significant only in how it relates to speech intelligibility. Simplicity, efficiency and real-timeperformance are also important considerations.A number of related algorithms capitalize on the importance of the short-time spectralamplitude (SSSA) in speech perception. They rely on the fact that when speech and noise arestatistically independent, the power spectral density of the noise-corrupted signal is the sum of thespectra of the individual signals. A windowed segment of the noisy signal is Fourier transformedto the frequency domain where it is used to form an estimate of the power spectral density ofthe speech alone. Researchers differ in how the speech spectral estimate is expressly calculated[13, 14]. Boll for example, manipulates the spectral estimates directly [8]; an estimate of theinterference SSSA is subtracted from the spectrum of the noise-corrupted speech to produce anestimate of the speech SSSA. Since it is known that phase is relatively unimportant to intelligibility,7Backgroundphase information from the original corwpted signal is used to transform the estimate back into thetime domain. The noise SSSA estimate is obtained and updated during the absence of speech.This assumes that the background noise is locally stationary, that there exists an interval prior tospeech where the SSSA of the noise can be estimated accurately, and that these characteristicsremain approximately invariant during the following speech segment. The enhanced speech hasbeen shown to sound more pleasant and less noisy in subjective listening tests when the degradingsource is additive noise. Although Boll found that spectral subtraction improves intelligibility inspeech compression applications, no significant improvement in intelligibility was demonstratedwhen this strategy was used strictly for speech enhancement. Spectral-subtraction methods tendto emphasize large spectral components relative to those of smaller amplitude; higher formants,which contribute more significant intelligibility cues, do not receive sufficient emphasis [8, 9].The periodicity of voiced sounds is exploited in a technique developed by Shields [15]. Sincevowels and voiced consonants are distinguished in the frequency domain by their characteristicformant frequencies and related harmonics, a comb filter can be derived which only passesthese significant frequencies. Interfering noise is reduced in the frequency regions between theharmonics, while the fundamental periodicity of the desired signal is preserved. Frazier furthersuggests an adaptive structure which adjusts itself to variations in the fundamental frequency[16]. These methods require an accurate estimation of speech parameters; any error can causedistortion or a significant loss of speech cues. In addition, unvoiced sounds such as fricatives andplosives which do not exhibit periodic characteristics are not accounted for. Studies have shownthat although the enhanced speech output sounds less noisy, speech intelligibility is actuallydecreased [10, 11]. However an extension of this method has been applied to the problem ofdistinguishing two competing speakers with some success [17].Thomas and Niederjohn suggested a processing scheme to enhance the intelligibility of cleanspeech transmitted in harsh noise environments [18]. The signal is first highpass filtered to reducethe effect of the first formant, which contributes little to intelligibility and tends to mask moresignificant high-frequency formants. This is followed by infinite-clipping of large peaks to enablean increase in signal amplitude, improving the SNR without surpassing the dynamic range ofthe transmission system. Thomas and Ravindran reported a noticeable improvement when this8Backgroundsame technique was applied to speech degraded by wideband random noise [19]. However astudy of hearing impaired persons by Young found that peak-clipping decreased the intelligibilityof speech in the presence of a competing talker [12]. Lim observes that the improvement in speechintelligibility reported by Thomas may be attributed more to the filtering process rather than theclipping operation itself. By reducing the masking of the high frequency formants, highpass filteringincreases the amplitude of low-energy events such as consonants as compared to higher-energyvowels.Another class of enhancement techniques exploits the underlying model for speech production. The speech model parameters are first estimated from the corrupted signal and aclean speech signal is then synthesized from these parameters. Various methods have beendeveloped to determine the parameters for the model of the vocal-tract function [20—23]. Althoughan improvement with respect to speech quality in the context of bandwidth compression and inparticular Linear Predictive Coding (LPC) has been reported, the performance of LPC vocodersis susceptible to noise and degrades quickly as the SNR decreases [24]. Source parametersare not easily acquired in the presence of noise; they require the classification of speechsegments as voiced, unvoiced or silence and in the case of voiced speech, necessitate pitchdetermination. According to Lim and Oppenheim, “essentially all algorithms for determination ofexcitation parameters with non-degraded speech become seriously degraded with even moderatesignal-to-noise ratios” [3].The method selected for the purposes of this investigation is a variation of adaptive noisecancellation (ANC). Adaptive filters are simple, efficient and have the ability to modify their ownparameters as the characteristics of the input signal change. Their design requires little a prioriknowledge of signal or noise characteristics; the filter response is adjusted with respect to an errorsignal which is derived in some way from the filter output. Conventional two-input ANC has beenused in a variety of applications which include: cancelling 60 Hz interference in electrocardiography(EGG), reducing the maternal heartbeat in fetal ECG, cancelling antenna sidelobe interference [25],improving voice communication at noisy sites such as aircraft cockpit or automobile environments[26, 27], restoring degraded audio signals [28], as well as removing echoes in long distancetelephone lines [29]. Research has also shown success in the practicality of two-channel ANC for9BackgroundFigure 1 Conventional Adaptive Noise Cancellationhearing-aid use [30—32]. Recently a single-input adaptive noise-cancelling hearing aid has beenimplemented commercially and proven effective in increasing the intelligibility of received speechfor some types of hearing loss: the coefficients of an adjustable Wiener filter are modified accordingto the spectra and temporal characteristics of the noise background, which are determined usingpattern-recognition techniques [33].2.2 Adaptive Noise CancellationConventional ANC employs two input signals to reduce the noise at the output of the systemas illustrated in Figure 1. The primary input is the noise—corrupted signal s+n0. The desiredwaveform s, is assumed to be uncorrelated with the noise n0. The second or reference input n1,is a measure of the background noise alone which is in some way correlated with the noise inthe primary. In simpler terms, there is a linear relation between the noise in both inputs, makingit possible to filter the reference input such that the output y is a close approximation to the noisesignal n0. The filter output is subtracted from the primary signal, producing an error signal c. Itcan be shown that by adjusting the filter coefficients to minimize the total power of the output errorsignal, the error signal e will be a best fit in the least-squares sense to the signal s [25, 34].To derive this result mathematicatly, it is assumed for reasons of simplicity and tractability thats, n0, and n1 are statistically-stationary, zero-mean processes. The noise signals are uncorrelatedwith the desired waveform but correlated with each other. The system output, € = s + no — y, iss+nprimary÷ Cerrorreference10Backgroundsquared to obtain the instantaneous error,2 2 2E s +(no—y) —2s(no—y). (1)Taking expectations of both sides of (1) and utilizing the prior assumption that s is uncorrelatedwith n0 and y, equation (1) becomes,E[E2] = E[s2] + E[(no — y)2j — E[2s(no—=E[s21 +E[(no_y)2j.(2)If the filter is adjusted such that the output power E[e2] is minimized, the signal power E[s2]will be unaffected and E [(no — y)2] will be minimized. The filter output y becomes a best least-squares estimate of the noise signal n0. It also follows that when E [(no — y)2] is minimized, sois E [(s — f)2], and f becomes a best least-squares estimate to the signal s. Theoretically if y canbe made to be an exact replica of n0, then = s and the output is free of noise.The degree of noise cancellation obtained depends on whether the input signals are stationaryor exhibit slowly time-varying characteristics, the correlation between the reference and primarynoise signals, as well as the filter structure and adaptive ability of the algorithm used for adjustingthe filter coefficients. Filter algorithms vary in both complexity and performance. One of thesimplest and most tractable is the least-mean-square (LMS) to be described in detail in Chapter 3.However in our application an auxiliary reference signal which is free of the desired waveformis not available. Other information must be used to extract the noise from the primary input signal.2.3 Single-Input Adaptive Noise CancellationWhen only a single input signal is available, the characteristics of either or both the noise andsignal are exploited in the noise cancellation process. A noise canceller structure can be formedby using a delayed version of the primary input as the reference signal. This structure, illustrated inFigure 2, is commonly referred to as the Adaptive Line Enhancer (ALE). It is effective in instanceswhere the interference can be assumed to be broadband white noise, while the desired signal is anarrowband periodic signal. The delay z, is chosen long enough to effectively de-correlate thenoise components of the primary and reference inputs. The delay however must be short enoughthat the periodic components of both inputs are related by a simple phase shift. The delayed11Backgroundbroadbandinterferencepeodicperiodicoutputsignal is filtered and then subtracted from the instantaneous input. As the weights of the filter areadjusted to minimize the error power, the adaptation process compensates for the phase shift ofthe periodic signal but cannot track the de-correlated noise components. At the summing junction,the components of the reference that are correlated with the primary input are removed from theerror signal. Hence the output of the adaptive filter is an estimate of the periodic componentsonly and the converged filter has a bandpass transfer function with passbands centered at thefrequencies of the periodic signal.This method has been applied to the problem of speech signals degraded by additive noiseby Sambur [35]. Using the fact that speech sounds such as vowels and voiced consonants canbe considered quasi-periodic over short intervals, a delay equivalent to one or two pitch periodsof the speech input is used in the reference path. The primary and reference speech signalswill therefore be highly correlated with each other and uncorrelated with the broadband noise.However unvoiced speech exhibits very weak periodic tendencies, hence filter adaptation is onlyperformed during voiced speech. During frames classified as unvoiced or silence, either the filtercoefficients are held constant or no filtering is done at all. Sambur reported both an improvementin speech quality and in SNR using this method but did not claim any improvement in intelligibility.A study by Chabries et aL, however, found that this method fails to improve intelligibilitywhen using the LMS algorithm for adaptation [36]. The LMS algorithm minimizes the mean-Figure 2 Adaptive Line Enhancer12BackgroundPeriodic NoiseSpeechOutputSpeechsquare error with respect to the highest energy (low frequency) components of the signal firstbefore concentrating on lower-energy signals; high-frequency, low-energy signals which contributethe most to speech perception are lost at the filter output. Unvoiced sounds may be removedcompletely if the reference pitch delay is too long. In addition, the filter cannot adapt duringunvoiced speech or silence, introducing reverberation when speech sounds change significantly.It also fails with correlated or narrowband interference.An alternative structure is feasible when the noise in the primary signal is relatively stationary,narrowband, and present separate from the desired signal over short time intervals, as is thecase for speech. In the adaptive predictor structure of Figure 3, the reference input is formed bydelaying the noise-corrupted speech input by a single sample period; the periodic componentsof the primary will then be highly correlated with those in the reference. A speech I silencediscriminator determines when the input signal is free of speech. When a silence decision is made,the adaptive algorithm adjusts the weights to minimize the error signal, essentially predicting thenoise from past characteristics. When a speech decision is made, the adaptive filter is turnedoff and the filter weights remain constant. As the algorithm converges, the frequency responseof the adaptive predictor network approximates the inverse power spectra of the noise. A notchFigure 3 Adaptive Predictor13Backgroundfilter is formed with stopbands centered at frequencies which correspond to periodic componentsin the noise [37].One drawback to this approach is that noise is rarely stationary. There must also be intervalswhen speech is not present, of sufficient length and frequency such that the characteristics of thenoise can be estimated, and these characteristics must not change significantly when speech ispresent. When these conditions are not satisfied, algorithm performance will deteriorate. Speechenhancement using this method has been demonstrated by Hoy [37] with some success in terms ofoverall SNR improvement, however no evidence was found of research into its effect on speechintelligibility.In comparing the two alternative methods of single-input ANC, it is evident that Sambur’smethod is more effective with broadband noise, while the adaptive predictor removes onlynarrowband noise. However as noted in Section 1 .1, narrowband noise has been shown tobe more deleterious to the hearing-impaired. Both algorithms require accurate speech I silencedetection; Sambur’s requires both an accurate method of pitch detection as well as an expandeddecision process which includes a complex voiced I unvoiced classification. Silence decisionschemes are never error free; as noted previously with Sambur’s algorithm, decision error canlead to distortion and missing perceptual cues. Distortion in the adaptive predictor occurs whenspeech is incorrectly classified as silence, which engenders the possibility of removing speechsounds, however this error would generally occur at the start and end of words when the energylevel of the incoming speech is below the noise floor. During these intervals, noise is the dominantinput and the influence of low-level unvoiced speech which is typically random in nature should beminimal. With these considerations, this research focuses on determining if speech intelligibilityis enhanced using the adaptive predictor for noise cancellation.14Chapter 3 LMS AIgorthmUp to this point little has been said about the actual algorithm employed to adjust thecoefficients of the adaptive filter. One of the simplest and most tractable is the least-mean-square(LMS) algorithm, derived from the method of steepest descent. The LMS algorithm is iterative innature; each step the individual filter coefficients are adapted such that the mean-square powerof the error signal is minimized. This algorithm is attractive in that it is simple, efficient andhas proven to be robust in applications where the inputs are stationary or slowly time-varyingprocesses. Storage requirements are minimal. Complex calculations such as matrix inversion ordifferentiation are not required [34].The LMS algorithm is typically implemented using a transversal filter architecture, a blockdiagram of which is shown in Figure 4. Input signal Uk is passed through a L-element tappeddelay line whose outputs are past samples of the input taken at time k, k-i, ... k-L. Accordingly,the tap outputs can be represented as a column vector,Uk = [Uk Uk_i ... Uk_LI, (3)where the superscript T indicates the matrix transpose operation. Each tap output is multiplied bya variable filter weight Wflk, where n is the weight 0 ... L, and k indicates explicitly the time-varyingnature of the filter. The weights,Wk = [Wok Wik ... WLk]T, (4)are adjusted to produce an output Yk which is the best estimate in the mean-square sense ofthe desired response dk. In ANC terminology, the desired response corresponds to the primarysignal, the reference signal to the filter input.The adaptive process adjusts the filter coefficients such that the mean-square error of thedifference or error signal k is minimized. Error signal, k = dk — Yk = dk — U’Wk, is squaredto obtain the instantaneous squared error,= d +W’UkU’Wk—2dkUWk. (5)15LMS AlgorithmFigure 4 Adaptive Transversal FUterIf k, dk, arid Uk are statistically stationary and ergodic, the mean-square error (MSE) can bederived from the expected value of equation (5) holding Wk constant,MSE E E{e] = E[d] + W’E[UkUflWk — 2E[dkU’]Wk(6)= E[d] +WRWk-2PT.The above expression is simplified by replacing the input correlation matrix E[UkU’] with thesymbol R, and substituting the symbol P for the cross-correlation between the desired signal andthe input components E[dkUk].From (6), it can be seen that MSE is a quadratic function of the components of Wk. Expressedin geometrical terms, the error surface is a multidimensional paraboloid, concave upward with asingle global minimum. The minimum MSE or analogously the bottom of the ‘bowl’, is determinedby first differentiating equation (6) with respect to the filter coefficients to obtain an expression forthe gradient of the error surface,ôWk= -2P + 2RWk. (7)Since Uk are real and stationary, by definition R must be symmetric, positive semidefinite and inmost cases, positive definite. If R is non-singular, setting the above gradient to zero produces theoptimal filter W* required to minimize MSE,W’= RP. (8)Hence the steady-state response of the adaptive noise canceller with stationary, stochastic inputsapproximates that of a fixed Wiener filter. The minimum attainable MSE is obtained by substitutingUk16__________________________________________LMS Algorithm(8) in equation (6),—Tj-*cmtn — Luki —In practice, calculation of the input- and cross-correlation matrices R and P is prohibitivesince an infinite time history is necessary for an exact solution. An estimate of the optimal filtercoefficients is instead obtained using iterative gradient search techniques which avoid a directinversion of matrix R. One such technique is the method of steepest descent. Each iteration thegradient of the bowl surface is first calculated directly from the sampled data; the filter coefficientsare then updated with a value proportional to the negative gradient,Wk+1 = Wk+p(—Vk). (10)If the precise value of Vk is known at each step, this adjustment always results in a better filter;the MSE decreases from step k to k.i-1. When the minimum MSE solution is found, the gradientis zero and the coefficients have reached their optimum values. Parameter p controls stabilityand rate of convergence.Since the true gradient is difficult to determine in practice, the LMS algorithm uses as anestimate the gradient of the instantaneous squared error. This estimate can be shown to beunbiased if the inputs are statistically stationary and ergodic. For tractability, individual inputvectors Uk are assumed uncorrelated in time1. Under this condition, the weight vector Wk will beindependent of Uk and the expected value of the derivative of equation (5),E[ekl = = —2E[ckUk]= —2E[dkUk—UkUWk] (11)= -2P+2RWk =is equivalent to the true gradient. Accordingly, the gradient estimate is used in place of the truegradient in equation (10). This yields Widrow’s well-known LMS algorithm,Wk+=Wk +2pfkUk. (12)1 E[UU} = 0 j k See Section 3.1.17__________________________________________LMS AlgorithmThe expected value of the weight vector under the above assumptions is then,E[Wk+1]= E[Wk] + 2pE[dkUk — UkU’Wk]= E[Wk] + 2pP — 2PRE[Wk] (13)= (I— 2pR)E[Wk] + 2pP,which at steady-state converges in the mean to the optimal Wiener filter,E[W]=R’P. (14)3.1 Convergence PropertiesThe behavior of the LMS algorithm has been analyzed extensively in the literature [38—42].This discussion of the convergence properties of the LMS algorithm will focus on aspects thatpertain specifically to this application. An explicit relationship exists between the primary andreference signals of the adaptive predictor structure; Uk is simply a delayed version of the desiredinput dk. Replacing dk and Uk with Xk and Xk1 respectively, the error signal to be minimized canbe expressed as,xk —W’Xk_, (15)and the LMS algorithm for the adaptive predictor rewritten to reflect this relationship,Wk+j =Wk+2iEXk_,. (16)Moreover the correlation matrices are now RX=E[Xk_1’_]and Px=E[xkXk_1], and theexpected value of the weight vector becomes,E[Wk+1= (I — 2pRx)E[WkJ + 2pP. (17)The convergence process of the LMS algorithm depends to a large extent on the p eigenvaluesand corresponding eigenvectors of the correlation matrix R. Analysis is generally performed bytransforming the LMS algorithm to coordinates which are functions of these same eigenvalues and18__________________________________________LMS Algorithmeigenvectors. This de-couples the individual filter weights in order that the convergence propertiesof each uncoupled weight can be analyzed separately. A necessary and sufficient condition forthe LMS algorithm to converge and remain stable, assuming Xk is stationary and ergodic and Xkare uncorrelated over time, is that convergence factor remain within the bound,(18)mawhere A, is the largest elgenvalue of the matrix Rx. If Rx is positive definite, tr{R],where tr[A] indicates the sum of the diagonal components of matrix A. In practice, signal poweris easier to determine than the matrix eigenvalues, hence a computationally-simpler sufficientcondition for convergence is given by,1 (19)tr[Rx]As long as the above conditions are satisfied, the weight vector mean E[Wk] will be asymptoticallyunbiased towards W* as k approaches infinity [34].In the LMS algorithm an estimate of the gradient is used in place of the true gradient, thereforea single update calculation of the vector coefficients can contain considerable noise even asthe weight vector converges in time toward the ideal value. The bound in (19) however, doesnot guarantee that the covariance of the weight vector will converge to a finite value. A set oftwo necessary and sufficient conditions assuming stationary, uncorrelated, Gaussian inputs werederived by Horowitz and Senne to confine the adaptation constant within narrower bounds;1, (20)3’maz,and<1. (21)These conditions assure a finite weight-vector variance at convergence [43]. Similar bounds werealso derived by Feuer and Weinstein [44].The MSE cannot reach the theoretical minimum min at steady-state due to weight-vectornoise. A measure of the actual performance achieved versus the optimal is given by themisadjustment, defined in terms of the ratio of the average excess MSE to the minimum MSE,Maverage excess MSE (22)‘fin19__________________________________________LMS AlgorithmThe misadjustment derived by Horowitz and Senne for the LMS algorithm is defined in terms ofvariable ,(p),M = (23)1—If p is sufficiently small and << , the above equation simplifies to,M ptr [Rx], (24)which agrees with the results presented by Widrow [34]. Misadjustment as defined above is thusa monotonically increasing function of p; at sufficiently small values, misadjustment and aredirectly proportional.The filter coefficients converge towards a steady-state mean W* along a learning curve whichis approximated by summing p exponential functions with time constants,1 (25)where )ip is the eigenvalue of the input autocorrelation matrix R. If there is no noise in theweight vector, the MSE of the adaptive process converges to min following a related geometriccurve. An approximate time constant for the pt mode of MSE convergence is given by,(mIsE)_—•(26)It is evident that faster convergence is achieved by increasing p. When is large, the iterativeprocess may never converge to a finite solution. As decreases the effects of inaccuracies inthe gradient estimate tend to average out and the coefficient vector will eventually converge ifgiven sufficient time. However if p is too small, adaptation may be so slow that in nonstationaryenvironments the process may never converge.If the range of eigenvalues is small, the learning curve can be approximated by a singleexponential with time constant,Tavg4Pavgwhere avg= (L 1) (27)Using the fact that tr[Rx] = and substituting (27) in equation (24),p=o(28)avg20__________________________________________LMS Algorithmmisadjustment is shown to be directly dependent on both average signal power Aavg andconvergence factor p, and inversely proportional to the average time constant Tavg. A tradeoff therefore exists between speed and the quality of the estimate: increasing leads to fasterconvergence but noise in the adaptive process is also increased. Moreover for a fixed ,misadjustment increases with longer filter lengths [34].In practice individual input vectors Xk1 are correlated, in violation of assumptions made inthe preceding convergence analysis. The assumption of mutual independence is commonly madeto make the analysis more tractable, however when the data are correlated the weight vectorWk can no longer be considered independent of past data samples Xk1. Various authors haveexamined the behavior of the LMS algorithm under more realistic assumptions of M-dependence orasymptotic-independence. Both excess MSE and Tp have been shown to increase as correlationincreases. The mean-error of the weight coefficients converges to a finite limit, generally non zero;E[Wk] does not approach W*. The mean-square deviation of the weights is bounded by a constantmultiple of the adaptation constant to within a finite value which can be made arbitrarily smallas p is decreased [45—47]. However convergence of the LMS algorithm has yet to be provenfor the general case [48]When convergence factor p is suitably small, the temporal changes in Wk are far slower thanthose in Xk, and the approximation of independence between Wk and Xk has proven adequate.Moreover simulations in other work have shown that even for highly-correlated, quasi-stationaryinputs and values of p approaching the convergence bounds, equation (13) provides an reliabledescription of LMS behavior [38, 48].3.2 Performance IssuesThe effects of several unfavorable characteristics of the LMS algorithm become significantwhen it is applied in a more realistic environment where the assumptions of stationary, uncorrelatedinputs and infinite-precision arithmetic are no longer applicable.3.2.1 Nonstationary InputsWhen the input is nonstationary, the orientation of the hyper-paraboloid surface is no longerfixed and the adaptive process must track the location of the bottom of bowl as it varies. From21__________________________________________________LMS Algorithmequation (24), it is known that misadjustment varies directly with the signal power. Moreover a stepsize that was originally within the convergence bounds of (19) does not guarantee convergenceif the power is increased. Various authors have derived values for optimal p based on differentassumptions about the input characteristics [49—51].One such solution is to use a variable convergence factor, proportional to the reciprocal ofthe instantaneous power of the input vector. The result is the normalized-LMS algorithm,Wk+1 = Wk + T213fkXk—i, (29)Xk jXk_iwhere /3 is a factor controlling convergence. The new bounds for convergence are,0<6<1, (30)assuming signals are zero-mean, stationary, Gaussian and uncorrelated in time. For an inputwhere all eigenvalues are equal, 3 can be selected to achieve a convergence speed that isfaster than that possible with the conventional LMS algorithm, with a reciprocal increase inmisadjustment. If a small j3 is selected to regulate M, convergence speed increases. Hence thissolution improves stability at the expense of a corresponding increase in steady-state error [52].Assuming stationary, ergodic inputs, a crude analysis can be performed by applying theapproximation, X’_1k tr[Rx]. Variable convergence factor Pk reduces to,/3 13Pk= X’_1k ‘ tr[Rx] (31)Substituting the above in equation (24), it follows that for small values of /3, steady-statemisadjustment is approximately equal to the new convergence factor, M /3, and hence relativelyindependent of signal power.A modification to (29) is to instead estimate the convergence factor using the instantaneoussignal power smoothed by a single-pole lowpass filter, which removes the effect of singular valueswhich might otherwise corrupt the estimate [53]. This latter approach has been adopted in thisresearch. The power estimate is updated each iteration using,= (1—v)o_1 + vz, (32)22________________________________________________LMS Algorithmwhich amounts to forgetting past samples of Xk exponentially with time constant i/v controllingthe memory length of the averaging process. This variant of the normalized-LMS algorithm isthen expressed as,Wk+. =Wk +232fkXk_l, (33)( +)owith a variable convergence factor,)With a suitable choice of ii, equations (33) and (29) have similar properties and performance.However without making strict assumptions about input signal characteristics, it is difficult toanalytically derive convergence bounds for the latter variant and likewise an optimal value for/3. For this application /3 is determined empirically in Section 5.2.2.3.2.2 Ill-Conditioned InputsA second issue concerning the use of the LMS algorithm in a realistic environment is that itis possible for the noise signal to have high energy in one or two frequency bands and relativelylow energy in the remaining spectrum. The adaptive process is implicitly estimating the inverseof matrix R, which is in this case spectrally deficient and therefore ill-conditioned; there will be awide spread between its smallest and largest eigenvalues. The Ume constants for weight vectoradaptation will be equally disparate. Although LMS stability bounds are dominated by the largesteigenvalue, mar’ convergence speed is limited by the size of the smallest eigenvalue Amin.An analysis of this problem with respect to the adaptive line enhancer was conducted byTreichler under the assumption that the input signal is composed of one or more stationarycomponents. The signal is separated into ‘relatively-coherent’ components with correlation timesgreater than the delay and ‘noncoherent’ components with correlation times less than z. Thelatter are then modeled together as having a white spectrum with power 2• Accordingly, thematrix Rx can be replaced with Rx = Rs + o2I, where R5 and o.21 represent the coherent andnoncoherent components respectively [48].23__________________________________________LMS AlgorithmUnder these assumptions, the time constant of the pth mode of the weight-vector learningcurve becomes,1Tp , , (35)+ 2)where 4 is an eigenvalue of R, proportional to the power of the pth uncoupled coherentcomponent. The ratio 4 = can be considered a ‘signal-to-noise’ ratio relating the coherentand noncoherent components of Xk. At large q!,, or 4 > r, the weight-vector time constant canbe approximated by,1 (36)the time constant is directly proportional to the power of the coherent component. Conversely atlow or 4 < a2, the time constant is approximated with,—_L—, (37)2[uTand the noncoherent component dominates convergence.Hence the time constant for the growth of the weights associated with a given coherentcomponent is not equal to that associated with their decay. In the adaptive predictor, a dominantcoherent noise input will cause a notch to form in the overall transfer function at the frequencyof interest. If the noncoherent component of the input is near zero when this coherent noiseinput ceases, r — co; the notch will persist since there is no strong tendency for the uncoupledweights associated with the notch to decay.3.2.3 Finite-PrecIsion EffectsLMS performance is also affected by finite-precision effects related to quantization and roundoff errors in a digital implementation. Due to physical limitations, both internal algorithmic quantitiesand inputs are quantized to a certain limited precision. Weight-vector updates are sensitive to theaccumulation of these errors over long periods of time. This sensitivity is magnified by small stepsizes and ill-conditioned inputs. The effect of finite-precision arithmetic on the LMS algorithm wasanalyzed by Cioffi [54] and Perry [55].24__________________________________________________LMS AlgorithmIn two’s-complement arithmetic, multiplication of two single-precision values creates a double-precision number that is normally rounded or truncated back to single precision to minimize storagerequirements and reduce the complexity of further calculations. The second term of the iterativecalculation of the weight vector in equation (16),pVk = —2PkXk_L,requires at least two multiplications producing round-off error terms i,k and z2,k. In additionquantization error related to the analog-to-digital conversion process will induce error zfk andzXkl in the signals k and Xkl respectively. The gradient estimate including these error effectsbecomes,p = (—2p( + k) + &,k)(Xk_i + IXk_1)+ 2k. (38)Although the individual errors are independent, it is possible to analyze their effects by lumpingthem together into a single quantity,bk = — pVk, (39)describing the deviation of the estimate from the infinite-precision gradient value. The LMSalgorithm including finite-precision effects can be rewritten as,Wk+. = Wk — pVk + bk. (40)The mean value of the weight vector, assuming the inputs are stationary and uncorrelated in timeis then,E[Wk+1]= E[Wkj— pE[Vk] + E[bk]= E[Wk] + 2pE[xkXk_1— Xk_1’Wk] + E[bk] (41)= (I— 2PRX)E[Wk] + 2pPx + E[bk].At steady state the converged weight vector including finite-precision effects is,= R’P + (42)where b is the mean vector of bk. The second term represents the deviation from the optimalvector due to precision effects. The deviation is inversely proportional to p, so that as p decreasesthe magnitude of the limited-precision effects increase. This term can be further decomposed,_R1b= (43)p p__025___________________________________________________LMS Algorithmwhere qp is the elgenvector corresponding to elgenvalue of matrix R. Small eigenvaluesdominate the magnitude of the above term. When R is ill-conditioned, some eigenvalues maybe extremely small and the magnitude of (43) will become appreciably large.Although infinite-precision theory suggests that misadjustment decreases as convergencefactor p is decreased, the preceding analysis demonstrates that p can only be reduced to thepoint where the finite-precision effects of the second term in equation (42) start to dominate.Spectral deficiencies in the input data allow those filter weights associated with missing frequencycomponents to grow unchecked, resulting in weight divergence. Tens-of-millions of iterations maybe required for these effects to become noticeable, and they are therefore not generally observed intheoretical and simulated studies, however research has shown that their effect can be significantand that coefficient magnitudes may become so large that they will saturate in a finite-precisionimplementation. Numerical problems become more pronounced as the ill-conditioned propertiesin matrix Rx increase [56, 57].3.3 Leaky-LMS AlgorithmOne solution to these problems is the leaky-LMS algorithm [34, 48, 54]. A loss-factor isintroduced into the weight calculation to compensate for the deleterious effects due to eigenvaluespread and limited-precision effects. Derived analytically, a cost function is formed by imposingthe additional constraint that the total energy of the filter coefficients must be minimized in additionto minimizing the output power of the error signal. The new cost function is therefore,(44)=— 2xkW’Xk_1+W’XkX _Wk+aW’Wk,where is a positive factor, O<r<<lI2p,which controls the relative strength of the new constraintin the cost function. As in deriving the LMS algorithm, the derivative of (44) with respect to thefilter coefficients is taken,= —2xkXk_1+ Xk_X’_Wk+2Wk= —2Xk_1(zk —1Wk)+2aWk (45)= —2fkXk_ + 2aWk.26__________________________________________________LMS AlgorithmUsing this as the gradient estimate in (10), the update equation becomes,Wk+1 = Wk+2pekXk_—2pcrWk= (1— 2pa)Wk+2pekXk_ (46)= 7Wk+uEkXk_,where-y=l—2pc. The factor 7 essentially forces the weight vector to adapt to ‘stay alive’; if7<1 and p is set to zero, the weights will eventually decay to zero following a curve with timeconstant proportional to.The leaky-LMS algorithm essentially increases the stability of the convergence process atthe expense of an increase in steady-state misadjustment and algorithm complexity. A correctionterm aI is implicitly added to the input correlation matrix Rx, reducing the possibility that R willbe near-singular in low SNR conditions.2The behavior of the leaky-LMS algorithm can thus be crudely analyzed by replacing Rx inthe preceding convergence analysis with Rc=Rx+al. Individual eigenvalues increase by a factor= )p+a. Equation (43) becomes,1 (47)As ? approach zero, the magnitude of the above term can only increase to a finite value limitedby 1Ic, bounding finite-precision error. The mean value of the weight vector is now given by,E[Wk+j] = (I — 2p(Rx+I))E[Wk] + 2pP, (48)which converges to a steady-state vector,E[W] = [Rx+aI]’Px, (49)biased away from the known optimum weight vector R1Px, by a factor,=— W’ = —a(Rx + aIi’R1Px. (50)The excess error about the new converged weight vector is given by,excess MSEb1 p tr(Rx+crI)minc, (51)2 In fact, a variant of the Ieaky-LMS approach is to add a white-noise component of power to the input signal [37,58J.27_______________________________________________LMS Algorithmwhere mjflC is the minimum attainable MSE,minc = E[xj — P(Rx+aI)’Px. (52)The total MSE is the sum of the contributions from bias error and excess error around the biasedsolution,excess MSE excess MSEb +1WT Rx LIW. (53)Steady-state misadjustment therefore increases through a combination of terms dependent on theparameter c. If c is kept small, these effects are minimized [54].The effect of the leaky-LMS algorithm on convergence speed is seen in the revised timeconstant for weight-adaptation,1‘ 2p(,+a) (54)The introduction of a increases the speed of convergence particularly when the input signal poweris near zero. When a coherent component ceases, there will be a tendency for the uncoupledweights associated with it to decay even if the noncoherent component of the input is near zero.No evidence of a detailed convergence analysis was found in the literature for the leaky-LMSvariant. However a quantitative analysis of the effect of a was set forth in a study by Kanedaand Ohga concerning ANC in multi-sensor arrays [58]. The mean value of the cost function inequation (44) is a linear combination of two measures. The quantity,= E[WIWkI, (55)can be considered a measure of the distortion by which filter Wk would degrade a signal withwhite spectrum and unit power. The quantity,D2 = E[(dk _yk)2j, (56)represents a measure of the output noise power. The weight vector which minimizes the meancost function,E[Ck] = D2 + aD1, (57)28LMS AlgorithmFigure 5 Leaky-Normalized-LMS Algorithmis a function of a. Both D and D2 are functions of this vector and therefore also functions of a.As is shown in appendix A, if stationary inputs are assumed, D1 is a monotonically decreasingfunction of a, while D2 is a monotonically increasing function of a. As a is increased, there isa greater tendency for the weight vector to decay to zero, at the expense of increased noisepower. Conversely as a is decreased, the effects of ill-conditioning become more pronounced.An empirical analysis of the effect of a on filter performance is found in Section 5.2.2.3.4 Final AlgorithmThis implementation of the LMS algorithm integrates the leaky- and normalized-LMS algorithms to produce a noise-reduction scheme which adapts itself to the power of the input signal andalso corrects for ill-conditioned inputs. Although these variants have often been used individually,the incorporation of both into a single algorithm is unique to this research. The normalized-LMSis robust to varying signal conditions as required by our design constraints, while the use ofleaky-LMS was inspired by similar work done by Kaneda [58] and Hoy [37] in the field of speechenhancement.The algorithms were not combined literally; convergence parameter Pk of the normalizedLMS varies with signal power, and by precise definition this would also induce leakage constant7k = l—2apk, to be a function of input power. To avoid this complication, parameter was selectedas a constant independent of and an appropriate value determined from empirical work inSection 5.2.2. The final version of the iterative noise-reduction algorithm is shown in Figure 5.29Chapter 4 Speech / Silence DetectionEssential to the predictive noise-canceller structure presented in Section 2.3 is that weightadaptation occur solely during intervals when speech is absent in order that the filter adapt only tothe characteristics of the background noise. A method of discriminating between intervals wherespeech is present and intervals of noise-only (silence) is required. Indeed, most of the techniquesdiscussed in Section 2.1 entail some sort of speech I silence classification. A decision algorithmwhich can determine when speech occurs as opposed to ‘silence’ is not trivial and a number ofapproaches have been suggested. For our application, the detection method must be simple,fast, and efficient in order to work in real-time. It must also be able to continuously adapt to thebackground noise level.Sophisticated methods use a pattern recognition approach to categorize a segment of a signalas silence or speech. A vector of measurable features which are known to vary consistentlybetween classes is exploited in the decision process. They are selected based on their ability todifferentiate the desired signal types. In this application, they can include any of: 1) energy of thesignal, 2) zero-crossing rate of the signal, 3) autocorrelation coefficient at unit sample delay, 4)first predictor coefficient from a linear predictive coding analysis, 5) energy of the prediction error,as well as others. Using classical hypothesis testing procedures, each segment of the signal isassigned to the class for which the distance between the set of measured parameters and theclass vector is minimized. To simplify computations, the decision function usually requires that thevector parameters have approximately normal distributions [59—61].Typically these methods are used in speech and speaker recognition algorithms where notonly precise delineation between silence and speech is needed, but between voiced and unvoicedspeech as well. Word-recognition techniques must also identify specific syllables and other speechcharacteristics. Consequently the same features required in silence discrimination also proveuseful in the speech-recognition process, justifying the complexity involved in computing the fullvector of parameters.A simpler approach relies on only two measures of speech: short-time energy and zerocrossing rate. Its success depends both on the SNR and the assumptions that can be made about30Speech / Silence Detection —the characteristics of the background noise. An estimate of the location of the speech endpointsis first determined using short-time energy measures. This procedure assumes that the averageenergy during speech intervals is greater than that during silence. In a high SNR environment,speech can generally be differentiated from silence using these measures. In particular, high-energy speech sounds such as vowels and vowel-like consonants are readily discerned abovethe interference background. As the SNR decreases however, short-time energy measures losethe ability to distinguish weakly-articulated sounds which are lost in the background noise, hencethe original estimate is revised by taking into account the frequency of zero-crossings duringthe same interval. The zero-crossing rate further differentiates high-frequency sounds such asunvoiced fricatives and stop consonants from low-frequency interference. Notably, at low SNRsthe performance of this strategy deteriorates.Various versions of this technique exist [62—64]; a recently-developed algorithm, originallyproposed to delete non-speech acoustic material from recorded media, has been adapted for ourpurposes This algorithm was developed by Gan and Donaldson [65, 66] and later implementedin real-time by Rose [67, 68]. It meets the requirements of speed, adaptability and efficiency.4.1 Silence Detection AlgorithmThe detection algorithm proceeds as follows: The input signal is segmented into a sequenceof non-overlapping frames consisting of J samples. Each frame F1, an estimate of the short-termaverage energy is calculated using the short-term average magnitude,AVGMAG1= , (58)where x1 is the input sample at time j in frame F1. Note that the classic definition of power wouldrequire the calculation of the squared signal amplitude, but this is overly sensitive to large signallevels [63]. Instead the magnitude of the input signal is averaged to de-emphasize large-amplitudespeech variations, producing a smoother energy function as well as decreasing computationalrequirements. The zero-crossing rate is also calculated each frame,ZCRj=>S(xj), (59)31Speech / Silence Detectionwhere(0 if sgn(x) = sgn(xa_1)1. 1 otherwise.The function S(x) detects a sign change from past sample x,,1 to present sample x,, [67].AVGMAGI and ZCR1are compared each frame to threshold values which reflect the characteristics of the background noise, If either AVGMAGI or ZCR1exceed these specified thresholds, thesegment is classified as speech. The noise intensity generally varies with time, consequentlythe algorithm does not rely on fixed thresholds but adapts to the average background noiselevel. Moreover since speech utterances typically exhibit higher energy at their onset, a largerthreshold is specified for a silence-to-speech context change than for a speech-to-silence. Thisintroduces a hysteresis factor into the algorithm which eliminates rapid context switches at speechamplitudes near the threshold value. The frequency of context switches between classes canalso be curtailed by defining a minimum duration for speech and silence segments. The detectionof stop consonants at word endings, which are easily masked by the background noise, can befacilitated by extending the period of time, or hangover’, that the lower threshold value remainsvalid after a speech-to-silence context switch is made.Rose’s implementation of this algorithm presents some noteworthy enhancements. Inprevious work, the energy thresholds adapted only to changes in the level of the backgroundnoise. In high SNR environments, this level can be quite low. A threshold value based on thislevel will be low in turn, resulting in a smaller percentage of the waveform classified as noise.Rose therefore based threshold calculations on the maximum of either the long-term average ofthe background noise (silence) energy SilAvg, or the long-term average of the speech energySpeechAvg,EAvg = max (SilAvg, SpeechAvg/13). (60)The factor 13 was selected based on SNR specifications for a long-distance telephone link.Threshold levels were then calculated as fixed multiples of the maximum average,EThreshold = K(EAvg1), (61)where K is a threshold multiplier reflecting the classification of the preceding segment. If thissegment was classified as silence, K is set to its max value in anticipation of a silence-to-speech32Speech / Silence Detection —context change. Conversely if the preceding segment was classified as speech, a speech-to-silence context change is expected and K is set to its mm value. The full threshold test forclassification of a signal segment as speech is expressed as,AVGMAG > EThreshold = K(EAvg). (62)A segment is classified as silence when less than the calculated threshold value. With thisapproach, the fraction of signal classified as speech becomes less dependent on SNR conditions.Long-term averages for speech and background noise energy are updated each frame bysumming the N most recent values assigned to the respective decision class,Avg = AVGMAG. (63)N is chosen such that the average is over a length of time appropriate to the signal characteristics.Speech signals are inherently nonstationary: in computing SpeechAvg, Rose used an interval ofapproximately one second to obtain a good estimate of the average speech energy. Noise signalswere considered relatively constant and a period of approximately 125 ms was deemed sufficientto determine the background noise SilAvg.Near the silence-to-speech transition however, there is the possibility that speech samplesmay be misclassified as silence, corrupting the long term SiIAvg. Thus a third threshold multiplierECrit was introduced; a value classified as silence is not included in the calculation of Si/Avgunless it is less than the threshold,AVGMAG, < ECrit (SilAvg). (64)Without this comparison, misclassified speech samples would corrupt SiIAvg such that the resultingaverage would be greater than the actual background noise level.4.2 Algorithm ModificationsIn the majority of noise environments the energy distribution is roughly bell-shaped, henceby the central-limit theorem, the average energy estimate should have a distribution which isapproximately Gaussian assuming enough samples are used in deriving the average. Decision33Speech / Silence Detection —thresholds can be expressed as energy levels a fixed number Z of standard deviations above themean background noise energy,EThreshold= Pay9 + Zi7avg, (65)where Pavg and Oavg are the estimated long-term mean and standard deviation of the averageenergy respectively. Gan assumed that the ratio of the standard deviation to mean,A=f!, (66)Pavgwould remain relatively constant. If this is the case then the threshold value is independent ofabsolute signal level and can be expressed as a multiple of the mean,EThreshold = (1 +AZ)pavg. (67)The factor (1+AZ) corresponds to the threshold multiplier Kof equation (61).To improve algorithm performance in nonstationary noise environments, the assumption thatratio A remains constant was relaxed in our implementation and allowed to vary depending onthe characteristics of the background noise. EThreshold is then a function of both the mean andstandard deviation of the average energy. The threshold test for classification of a signal segmentas speech becomes,AVGMAGI > EThreshold = EAvg + ZL Oavg. (68)EAvg1 is the maximum of either the long-term average noise or silence energy defined in equation(60), while ZL is a positive number representing the number of standard deviations the thresholdlevel is above the long-term mean. As ZL increases, the probability of misclassifying a silencesegment as speech decreases, or conversely, the probability of misclassifying speech as silenceincreases. If the preceding segment is classified as silence, ZL is set to a high value Zm,in anticipation of a silence-to-speech context change. Conversely if the preceding segment isclassified as speech, a speech-to-silence context change is expected and the value ZL is loweredto Zmin. The parameter Oavg represents an estimate of the long term standard deviation of thebackground noise energy.34Speech / Silence Detection —In practice the calculation of standard deviation requires a square-root operation. Thisoperation can be avoided by instead calculating the statistical variance,= N i (AVGMAG — N(SilAvg)2), (69)and expressing the threshold test as,((AVGMAG— EAvg1)> o) and ((AVGMAG— EAvg)2 > Z ). (70)If one examines equation (70), it can be seen that the complexity of the test can be reduced bydefining a new variable,ZLev=Nl, (71)and simplifying the variance calculation,SilVar = AVGMAG — N(SilAvg)2. (72)The test for speech classification becomes,((AVGMAGI — EAvg) > o) and ((AvGMAGI_EAvgj)2 > ZLev SilVar). (73)The value ZLev changes with each context switch; ZLev is set to ZLow after a silence-to-speechtransition and raised to ZHigh once a silence classification is made.The N values of AVGMAG exploited in the calculation of SiIAvg and o2avg correspond to theN most recent segments classified as silence that fall within the critical range defined by,((AVGMAG — SilAvg) < o) or ((AVGMAG— SilAvg)2<z o), (74)which minimizes the possibility that misclassified speech samples near the silence-to-speechtransition will corrupt the long-term silence average and variance. Expressed statistically,AVGMAG must be within Z0 standard deviations of the long term silence average, If a newvariable ZCrit is defined as,ZCrit= N—i’ (75)35Speech / Silence Detection —the threshold test can be reduced in complexity to,((AVGMAG — SilAvg) <o) or ((AVGMAG — SilAvg)2 < ZCrit SilVar). (76)Equations (73) and (74) define the threshold tests for silence detection. Appropriate ranges forvariables ZHigh, ZLow and ZCrit, suitable for this application will be determined in Section 5.2.3.A simplified flow chart of the revised algorithm is given in Figure 6. Note that zero-crossinginformation is not exploited in this implementation. Rose found that at small framesizes, the use ofzero-crossing information was counterproductive; misclassification of noise as speech occurs dueto the large variability in zero-crossing rate measurements when using small sample sizes. Indeedfor telephone-quality speech, the filter cutoff frequency is typically 3400 Hz, removing much of thesignificant information.Several other parameters not included in the flow chart may be used to fine tune theperformance of the algorithm with respect to a specific application. In some applications suchas data compression, frequent context changes between speech and silence can be deleteriousand should be avoided. Specifying minimum frame lengths for silence and/or speech durationdecreases the frequency of context switches, as does specifying a minimum hangover intervalbefore ZLow is switched to ZHigh. The incorporation of these parameters into the algorithm isdiscussed further in Section 5.2.3.36Speech / Silence Detection —Figure 6 Speech Detection Algorithm Flow ChartAt Initialization:Calculate SiIAvg and Varbased on first 128 ms of dataState=SilenceZLev = ZHighEAvg =SilAvgEThreshold = ZLev * Sil VarSpeechAvg = Suitable ValueCount =0SumMag=0NO37Chapter 5 ImplementationThe following chapter discusses aspects of both the hardware and software implementation ofthe speech enhancement algorithm. A brief description of the physical development environmentis presented first, followed by a detailed discussion of the procedures undertaken in selecting themore salient parameter values.5.1 Processing EnvironmentAlgorithm development and evaluation were performed on the system illustrated in Figure7. The TRAIN-ON-PHONE, developed by the Clinical Engineering Program and the Departmentof Electrical Engineering at the University of British Columbia, allows a standard telephone tobe operated outside the commercial telephone network. Two modes of operation are possible:individual telephones may be connected for conversational purposes, or a single telephone may beinterfaced to an audiometer or tape-recorder in playback mode. The telephone / computer interface(TCI), situated between the telephone base and telephone handset, provides the necessaryamplification and anti-aliasing filters required for digitization. A sampling rate of 8 kHz was selectedsuitable for telephone-quality voice. In the TCI, the speech signal from the telephone base is firstinput to a differential amplifier followed by a bandpass filter stage consisting of a 6—pole, 3400—Hz,lowpass, and a 3—pole, 100—Hz, highpass, Butterworth filter. Noise-reduction processing is thenperformed on a 386 AT-compatible personal computer. Upon completion, the enhanced signal isattenuated and smoothed by a 6—pole, 3400—Hz, lowpass filter in the TCI and the result outputto the telephone handset in real-time.The 386 AT serves as the host platform for algorithm development and data storage purposes.Two plug-in commercial products execute the required conversions between digital and analogdomains as well as all digital signal processing (DSP) functions. The SPECTRUM I/O boardsupports four input channels with individual sample-and-hold amplifiers, multiplexed to a common12—bit analog-to-digital converter. Two 12—bit analog output channels are also provided. TheSPECTRUM Processor Board employs a 20—MHz MOTOROLA 56001 digital signal processorwhose standard word size is 24 bits while its internal accumulators compute results to 56—bit38implementationprecision. There are 32 kb of data memory and 16 kb of program memory available on board.MOTOROLA provides both a C compiler and cross assembler for the 56001 processor that workwithin the MS-DOS operating system of the 386 AT. Compiled programs are down-loaded fromthe host computer to the processor board where noise-reduction is performed on the telephoneinput in real-time. Alternatively, digitized signals can first be stored on disk and later down-loadedand processed for further analysis.The algorithm was initially written in C and evaluated at floating-point precision on the 386 AT.Recorded data files were used to allow specific experiments to be repeated. Later, the algorithmwas translated to the 56001 processor with crucial routines rewritten in assembly language toincrease processing speed.5.2 Parameter SelectionParameter selection for the speech enhancement algorithm is an unwieldy process in that theeffects of individual variables are interrelated, however it is possible to select parameters that workwell in tandem. The subsequent sections discuss the selection process followed in determiningparameters values for both the LMS and speech detection algorithms.5.2.1 StimuliBoth computer-generated (CG) data with known statistical characteristics, as well as recordings of actual sounds were used to evaluate the effects of varying critical algorithm parameters.The first set of data consists of zero-mean white Gaussian noise to which periodic interferencewas added at four ratios of coherent to noncoherent noise components 0, as given in Table I.TelephoneBaseTelephoneHandsetFigure 7 System Configuration39ImplementationCondition 800 Hz 1700 Hz 2300 Hz White e — coherentNoise— noncoherent(1) .05 .025 .045 .002 .12/.002 17.8 dB(2) .025 .0125 .0225 .004 .061.004 11.8 dB(3) .005 .0025 .0045 .004 .012/.004 4.77 dB(4) .005 .0025 .0045 .012 .012/.012 0 .0 dBTable I Computer-Generated Noise CompositionThe power spectrum and average-energy histogram for condition (1) are displayed in Figure 8.Average signal power has been normalized to 1.0 in the plots for comparison purposes. Theoptimal Wiener filter to remove such noise has a magnitude transfer function roughly proportionalto the inverse of the power spectrum in Figure 8.Digitized recordings of the motor sounds of a common household vacuum cleaner were usedto demonstrate the algorithm’s performance on a realistic quasi-stationary signal typical of machineenvironments. As can be seen from the plots in Figure 9, the average energy histogram has abell-shaped distribution. The spectral plot indicates that the motor noise is predominantly whitewith strong frequency components at 268 and 1585 Hz and weaker components at 683, 2918,3181, and 3445 Hz.A third set of data consists of speech ‘babble’ obtained from the revised SPIN test used forclinical evaluation in Chapter 6, created by superimposing the recordings of twelve talkers readingaloud from continuous text. The energy histogram, illustrated in Figure 10, is again roughly bell-shaped. One of the requirements for effective performance of the adaptive-predictor structureis that noise characteristics must not change significantly when speech is present. However asdiscussed in the speech model presented in the introduction, speech can generally be consideredstationary only over short intervals of under 30 ms. Efficient cancellation of nonstationary speechbabble with this filter structure is therefore not possible. Plotted in Figure 10 is both the averagepower spectral density for a 5—second sample of speech babble as well as a single curve for a32—ms sample. It is evident that babble noise is rich in low-frequency components; intuitivelyone would expect that the adaptive predictor would form a highpass filter to remove thesefrequencies. The nature of the output speech would intrinsically change, stressing the highfrequency components while removing those of lower frequency. Highpass filtering has been40ImplementationI1010210’10°10-110-210-s -0Average Energy HistogramPower Spectral Density500 1000 1500 2000 2500 3000 3500 4000FrequencyFigure 8 Signal Characteristics of Computer-Generated NoiseCondition (1).72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92Average Magnitude%kP1024-point FFTNo overlap5.0 s average41ImplementationIL 0.51010210110010_i10-210-s10-a10-sAverage Energy HistogramPower Spectral DensityFrequencyFigure 9 Signal Characteristics of Vacuum-Cleaner Noise4000Average Magnitude42ImplementationAverage Energy Histogram1 I I0.90.80.70.60.50.40.30.20.1J=320I I0 0.5 1 1.5 2 2.5Average MagnitudePower Spectral Density102 p128-point FFTNo overlap101 ,J \ — 5.0 s average7’ j 32 ms average100I pv10-110—2 —sIllS10-s V-10—i I I I I0 500 1000 1500 2000 2500 3000 3500 4000FrequencyFigure 10 Signal Characteristics of Speech-Babble Noise43Implementationshown to reduce the masking of high-frequency components of speech important to perception bycomponents lower in frequency but with a higher energy content. Hence there is some indicationthat the intelligibility of speech corrupted by this type of noise would be enhanced by the filteringprocess for individuals with some types of hearing loss.The final data used in algorithm evaluation is a high-quality recording of an oral reading ofthe following article,The rat-a-tat-tat and vroom of children’s toys are doing more than driving parents crazy— they’re damaging their kids’ hearing. The Canadian Hearing Society says some toystake the same long-term toll on ears as the roaring industrial drill ripping up your sidewalk,and it wants toy manufacturers to muffle their products.The speaker was a male adult with a western-Canadian accent. The latter noise data were addedto the clean speech signal at several different SNRs. The corrupted speech was then used toselect parameters for the speech-detection algorithm.5.2.2 LMS AlgorithmThe critical parameters with respect to the leaky-normalized LMS algorithm include: L, thenumber of taps in the transversal filter; 6, the factor governing convergence; and y, the loss factorregulating filter ‘leakage’. The following analyses were executed in non-real-time on the 386 AT;all intermediate values were stored in floating-point format with both input and final output valuestruncated to 12 bits.TAPS L In most cases, the ideal Wiener filter is characterized by an impulse response whichextends infinitely along the time axis. A filter with a finite number of weights can therefore onlyapproximate this ideal response; performance improves as the number of weights is increased.The ability to resolve adjacent frequencies increases with filter length; an increase in resolutionimplies that the filter’s ability to selectively reduce interference in specific frequency bands improveswhile the effect on the remaining spectrum is minimized. However computational requirementsalso increase with filter size and steady-state misadjustment becomes an important factor.The conventional-LMS algorithm as presented in equation (12) was applied to CG noise,condition (1), with convergence parameter p=O.O15. The algorithm was considered to have44Implementationconverged after 5 seconds of adaptation. Since the adaptive predictor is only able to adaptto correlated inputs, the algorithm will attempt to remove the periodic interference but cannotadapt to the uncorrelated noise. The transfer function of the converged filter for three differentfilter lengths is shown in Figure 11. At a sampling rate of 8 kHz, tap lengths L, of 15, 31 and 63provide resolutions of 500, 250 and 125 Hz respectively. Note that the stopband notches at thefrequencies of the periodic components are narrowest for L=63.Using the same noise file, the squared-error outputs from an ensemble of fifty experimentalruns are averaged and the result is plotted versus iteration in Figure 12. Three different filterlengths are used; it can be seen that as the number of weights increase, the filter adapts faster tothe noise characteristics. The corresponding smooth curves are derived from equations (6) and(13), starting with initial weight vector W0 = 0 using known values for Rx and ‘min The meanbehavior predicted by theory agrees closely with that achieved experimentally.Transient characteristics for this noise file are listed in Table II. Convergence speed isdominated by the smallest eigenvalue of R which corresponds to a coherent component Amin’.By replacing min’ in equation (26), the dominant time constant can be calculated from theory. The2000FrequencyFigure 11 Filter Transfer FunctionComputer-Generated Noise— Condition (1)45Implementation0.10.08250IterationFigure 12 Convergence Speed versus Filter LengthComputer-Generated Noise— Condition (1)actual settling time determined experimentally is slightly greater than that predicted by theory.3Due to the weight-vector noise in a LMS implementation, steady-state MSE is larger than minwhich would be of the same order as the input power of the white Gaussian noise component ofthe CG noise samples using an ideal filter. The deviation from the ideal is measured in terms ofsteady-state misadjustment defined in equation (22). For condition (1), Table Ill lists experimentalvalues for steady-state MSE for each filter length and the corresponding misadjustment.4Includedfor comparison are theoretical values for misadjustment calculated from equation (23), as well asthe theoretical steady-state MSE defined as,steady—state MSE = (1 + M)Emin. (77)There is little disagreement between experimental and theoretical results; much of the error can beattributed to the fact that individual input vectors are correlated in time in violation of assumptionsSettling time is estimated experimentally as the iteration where an 200—point ergodic average of the ensemble MSEfalls to within two percent of the steady-state value. The corresponding time constant is calculated as roughly one-quarterof this value.Experimental steady-state MSE is calculated after 5 seconds of adaptation as an ergodic average of the final1000—points of the ensemble MSE. min is computed directly from equation (9) using known statistics Px, Rx and E[xk2J.Excess MSE is the difference between steady-state MSE and min0 50 100 150 20046ImplementationTap Length Amjn’ Dominant Time ConstantTheoretical ExperimentalL=15 .433 .181 92 (11.5 ms) 100 (12.5 ms)L=31 .851 .372 45 (5.6 ms) 58 (7.25 ms)L=63 1.630 .792 21 (2.6 ms) 30 (3.75 ms)Table II Transient PerformanceComputer-Generated Noise — Condition (1)Tap Length Minimum MSE Steady-State MSE MisadjustmentTheoretical Experimental Theoretical ExperimentalL=15 .00279 .00288 .00289 3.1% 3.6%L=31 .00242 .00257 .00263 6.4% 8.8%L=63 .00221 .00251 .00253 13.9% 14.9%Table Ill Steady-State CharacteristicsComputer-Generated Noise— Condition (1)used in deriving the theoretical equations. Another source of error derives from the quantizationof input and output values to 12 bits. In addition, the need to perform ergodic averages would bemitigated if more runs are used in the ensemble average, smoothing the experimental curves.Although minimum MSE decreases with filter length, misadjustment is known to increase.Because of this trade off, minimal improvement in filter performance is achieved once tap length isincreased beyond a limit set by the input signal characteristics. Figure 13 illustrates the theoreticalrelationship between steady-state. MSE and filter length computed from equations (23) and (77)for each CG noise file. Conditions (1) and (2) exhibit the greatest decrease in steady-state MSEas tap length increases due to the fact that the ratio of coherent to noncoherent components0, is largest for these files. Theory predicts that two filter weights are required to cancel eachsinusoidal noise component [48], therefore at tap lengths, L>>5, performance tapers off towardsthe predicted theoretical misadjustment for these noise sources.In summary, the benefits of using a longer filter length are increased resolution andconvergence speed, however steady-state error increases with the number of weights as doesthe cost of implementation. Filter lengths in the range 15.<L<63 have been implemented in relatedLMS applications involving speech [35, 37, 69]. An additional point to be made is that the speechenhancement scheme requires that weight adaptation occur only during periods of ‘silence’. When47ImplementationU,UTap LengthFigure 13 Steady-State Error versus Filter LengthComputer-Generated Noise— All Conditionsa speech-to-silence transition occurs, noise corrupted by speech may be present in vector Xkl,and hence used in updating the filter coefficients. This problem is more prevalent as filter lengthincreases. With these considerations a tap length of L=31 was selected for subsequent analyses.BETA 3 The normalized-LMS algorithm adapts itself continuously to variations in the power ofthe input noise signal. It employs a variable convergence factor Pk’ which is inversely proportionalto the power of the input vector Xk,(3)Parameter governs the speed by which the algorithm approaches its steady-state value. Thefactor o in (34) is an estimate of the input-signal power obtained by filtering the instantaneouspower with a single-pole filter,= (1—v)o_1 + vx. (32)Parameter 1/v defines the number of iterations before an input sample is ‘forgotten’ in the averagingprocess. By definition i/v should be greater than one filter length. It’s upper limit depends on thecharacteristics of the input signal; if 1/v is too large, the power estimate will not be able to track48ImplementationL=310.350.3 -0.25 7:0.2 .. /“Conventional LMSMu=0.050.15 -.0.1Norma1ized.LMS005 (3) Beta=0 025(3) (4) (2) (1)C I I I0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Signal PowerFigure 14 Comparison between Conventional and Normalized LMS Algorithmsnonstationary inputs. Since the purpose of (32) is simply to smooth the estimate, v was set to0.0125 in this application, corresponding to a time constant of 80 iterations or 2.5 filter lengths.The effect of an input sample is forgotten in roughly four time constants, which is equivalent to10 filter lengths or 40 ms when sampling at 8 kHz.The main benefit of the normalized-LMS algorithm is that steady-state characteristics do notchange significantly with a change in input signal power. If the conventional LMS algorithm,p=O.O5,is applied to each of the noise files of Table I, experimental steady-state misadjustment isseen to increase linearly with input signal power as illustrated in Figure 14. When the normalizedLMS algorithm of equation (33) with /3=0.025 is applied to the same noise files, misadjustmentremains below 3.6%, independent of input signal power.As /3 increases, the convergence rate of the adaptive process increases. In Figure 15, thenormalized-LMS algorithm is applied to CG noise, condition (1). An ensemble average of fiftyexperimental runs is shown for two different values of /3.5 The mean curves predicted by theoryare also plotted, replacing estimate o with the known input signal power E[x] in equation (34).The convergence rate for /3=.01 is approximately 30 times slower than that of /3=.35 as expected.was initialized to E [r] of the noise file for this analysis.49ImplementationFigure 16 illustrates the relationship between j3 and the time constant of MSE adaptationas derived experimentally. A theoretical curve is also plotted for comparison purposes usingequation (26) and the dominant eigenvalue of R, Amin’. Both theoretical and experimentalperformance tapers off for values of /3 greater than 0.25. However as convergence speed increaseswith /3, steady-state misadjustment also increases. Figure 17 demonstrates this relationship.Misadjustment as predicted by Horowitz (23) and Widrow (24) is plotted along with experimentalresults. Note that Widrow’s results only apply for very small values of /3.It is interesting to note that if the weight-vector convergence condition in (20) is simplified,substituting )max tr[Rx] and roughly approximating signal power as tr[Rxj (L + 1)o, itfollows that a finite weight-vector variance is ensured if /3 is within the range 0cq3<0.33. Thisis substantiated by the fact that for j3>.0.33, both theoretical and experiment misadjustment isgreater than 60 percent. In general 3 is selected as large as possible to maximize convergencespeed within limits defined for misadjUstment. If the maximum steady-state error desired is lessthan 10%, the valid range becomes 0</3<0.05 for these noise characteristics. Selecting /3=0.025places steady-state MSE within 2.6 percent of the predicted minimum MSE.200IterationFigure 15 Effect of Beta on Convergence SpeedComputer-Generated Noise — Condition (1)50ImplementationCC0C)CIFigure 16 Time Constant versus Convergence Factor BetaComputer-Generated Noise — Condition (1)BetaFigure 17 Misadjustment versus Convergence Factor BetaComputer-Generated Noise — Condition (1)0.6Beta0 0.1 0.2 0.3 0.4 0.551ImplementationVI0 500 1000 1500 2000 2500FrequencyFigure 18 Effect of Alpha on Transfer Function ShapeComputer-Generated Noise— Condition (1)GAMMA The leaky variant of the LMS algorithm, defined in equation (46), is designed toreduce the ill effects caused by spectrally-deficient inputs. A leakage factor, y=l—2pa, determinesthe degree of loss or ‘leakage’ that the weights experience during the weight-update process. Thetransfer function of the converged weight vector for this variant applied to CG noise, condition (1),is provided in Figure 18, for three different values of a and p=O.Ol5. As a increases, the depthsof the stopband notches decrease and the filter response is flatter. As opposed to the effect ofvarying tap length L, notch widths remain constant, while depths increase or decrease.To investigate the effects of a (or alternatively 7), equation (46) is applied to condition (1),holding the convergence factor constant at =0.O5, and allowing a to vary from 0.0001 <a<0.01.Figure 19 illustrates how steady-state misadjustment is affected at different values of a. Condition(1) exhibits substantial deviation as a is varied. This is due in part to the fact that for this data,=0.05 is much closer to the boundaries of the convergence limits in (20). However this inputis also the most spectrally deficient; the effects of ill-conditioning should be more noticeable asdiscussed in Sections 3.2.2 and 3.2.3. Misadjustment is minimized at an a of approximately0.003 which corresponds to 7=0.9997. At larger values of a, the transfer function notches are too3000 3500 400052ImplementationCondition0.45 (1)(2)0.25? 0.2e0.150.1 V0.05 Mu=0.050 0..V....V.............e 0•’•L=3110-i 10-s 10-2 10-1AlphaFigure 19 Alpha Dependence of Leaky-LMS AlgorithmAll Conditions0.8Beta=0.025 T0.7 . J31Condition— (1)- -- (4)0.5•0.40.3p0.20.1•-,—010-v 10 10-s 10-i 10-s 10-21-GammaFigure 20 Gamma Dependence of Leaky-Normalized LMS AlgorithmAll Conditions53Implementationshallow to effectively cancel the noise input. As a approaches zero, the effects of ill-conditioningbecome more significant.In Figure 20, the final leaky-normalized-LMS algorithm is applied to all four noise conditions.Steady-state misadjustment is plotted versus 1—y to allow comparison with Figure 19. Misadjustment is relatively independent of gamma for (1—y).<O.0001,or 7>0.9999. Therefore selectinga gamma in the range 0.999999<7<0.9999 will not effect output misadjustment significantly, butwill correct for some finite-precision effects which may occur after long periods of adaptation.5.2.3 Speech Detection AlgorithmThe critical values in the silence detection algorithm include the speech-detection thresholdsZHigh and ZLow, and the average threshold below which a sample is included in the long-termmean, ZCrit. These thresholds affect the percentage of the speech signal which is classifiedas noise. Intuitively when the thresholds are low, only a small fraction of the speech waveformis misclassified; the noise-reduction scheme adapts solely to the background noise. Howeverthe intervals classified also very short; weight adaptation rarely occurs, degrading algorithmperformance in nonstationary environments. At higher thresholds, a significant portion of thespeech signal is misclassified; the noise-canceler attempts to remove this speech, distorting theoutput waveform.Rose selected threshold levels that optimized algorithm performance with respect to maximizing silence compression while minimizing information loss. In this application, the critical issue ishow these levels affect the performance of the noise-cancellation algorithm. In order to establish avalid operating range, it is possible to use subjective listening tests, however an analytical criterionwhich would measure this effect was also desired. With stationary noise sources, performanceis often gauged by first ascertaining the value of the filter weights at convergence, and thenprocessing the speech and noise individually with this filter. The resulting outputs are thenused to calculate the overall SNR improvement. However, the proposed noise-reduction schemecontinuously adapts to segments of the signal classified as ‘silence’, which may or may not includeactual speech. Hence the filter weights do not converge in the mean, but continuously vary withchanges in the speech signal; the filter transfer function depends on the particular words spokenin a given time interval. Therefore a different approach was undertaken.54ImplementationFor stationary noises, the optimal Wiener filter required to minimize MSE was defined in anearlier section as,W’= R’P. (10)Using known noise characteristics, W can be calculated and used to filter the corrupted speechx to obtain a ‘best’ speech estimate q,q=x®W*, (78)where e denotes convolution. This estimate can then be used a basis of comparison in correlationanalysis.The cross-correlation coefficient (CCF) is a measure of the similarity of two signals q and x,CCF= qTx(79)The CCF between the ‘best’ speech estimate and the enhanced speech waveform, whencalculated over the entire sequence length, emphasizes signal fragments of larger amplitude andprovides little information regarding variations in filter performance during silence intervals wherethe average amplitude is smaller. Therefore signals are divided into non-overlapping segmentsof length H, and the CCF calculated for each segment. The average or mean-segmented CCF(MSCCF), is then used as a measure of algorithm performance,TMSCCF = > CCF. (80)The summation is over the length of the test sequence T, divided by H. A segment length ofH=256 or 32 ms is used.Implementation Details Rose found that small framesizes are more sensitive to variationsin the speech waveform. Shorter framesizes are better able to detect short low-level constantswhich would otherwise be classified as silence [67]. However when the framesize is shorter thantap length L, the probability increases that noise corrupted by speech will be present in vector Xk1,which is used to update the filter coefficients. Therefore a framesize J=L÷1 was selected, where55ImplementationL=31. A minimum silence length of two frames was defined to minimize rapid context switches; twoconsecutive frames of silence levels are required before a segment will be classified as silence.Average energy levels SpeechAvg and SiIAvg were estimated over time intervals of 1 .024 sand 128 ms respectively. SpeechAvg was initialized to 0.122 on a ±1.0 scale (250/2048 in 12—bit,two’s-complement representation). Initial values for SiIAvg and SilVar were determined duringthe first 128 ms of the test signal which was assumed to be noise only. Normally during thisperiod, the noise-reduction algorithm would also be updating the filter coefficients. However forthe following trials, weight updates do not start until after this period, in order to emphasize theeffects of the individual parameters.Low Threshold ZLow For the following sections, the noise-cancellation algorithm is usedwith parameters, L=31, /3=0.025 and 7=0.99995. Both the MSCCF and subjective listening testsare used to evaluate the performance of the speech-enhancement algorithm. Noise data areadded to the clean speech signal at several different SNRs and the MSCCF calculated for variouscombinations of ZHigh, ZLow and ZCrit For this section, SNR is defined as,TSNR2 ElOlog , (81)where s is the speech waveform and n the additive noise signal. The summation is over thelength of the test sequence T.A rudimentary speech-detection algorithm is achieved if all threshold parameters are set equal,i.e. Z=ZLow—ZHigh——ZCrit. Using this basic algorithm, processing was performed on the speechtest file to which motor noise was added at SNRs of 8 and 15 dB. Figure 21 illustrates the changein MSCCF as the parameters are varied in unison from 0.0 to 5.0 S.D.6At low values of Z, there arefewer intervals classified as ‘silence’; the weight-update process does not create an efficient noisefilter, resulting in a low MSCCF. Note that since the algorithm assumes the first 128 ms of signalare noise, an estimate of SilAvg and SilVar is always obtained, hence Z=0 does not imply thatenhancement is not taking place, rather that adaptation to the background noise characteristicsFor the following discussion, thresholds will be referred to in terms of standard deviations (S.D.). Actual parametervalues are proportional to these values. See equations (71) and (75).56- ImplementationNR095 — 8dB15dB0.94,,“ /0I0.93 / I,, I,, I -0.92 1‘S-S..Uj 0.9120.9 -- ‘-S0.890.880.870.86 I I I0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5ZHigh=ZLow=ZcritFigure 21 Basic Speech-Detection AlgorithmVacuum Motor Noise0.99 I ISNR— 8dB0.98 15dBV -,o . ‘So // I’0.96 /V‘SoU 0.952Q ‘0.94 ‘N0.935”0.920.91 I I0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5ZHigh=ZLow=ZcritFigure 22 Basic Speech-Detection AlgorithmComputer-Generated Noise— Condition (1)57implementationrarely occurs. As Z increases, the filter transfer function is a better fit to the characteristics of thenoise and MSCCF increases. However once Z becomes too large, MSCCF starts to decrease;speech misclassified as silence is used in the update process, producing a filter which distorts theoutput speech. MSCCF peaks between 1 .75<Z<2.25 S.D.This procedure was repeated using CG noise, condition (1), to corrupt the clean speech tile.Figure 22 illustrates the variation in MSCCF as Z varies from 0 to 5. MSCCF peaks between0.75<Z<1 .75 S.D. Similar results for other CG conditions may be found in Appendix B. Howeveras the ratio of coherent to noncoherent noise components 0 decreases, the ability of the MSCCFto successfully define a suitable range for Z diminishes. Since the adaptive predictor can onlyaffect the coherent component of the noise, the degree of noise-cancellation possible decreasesas its proportion in the signal decreases. Therefore for conditions (3) and (4), the variation inMSCCF is minimal until Z.1 .5, where speech misclassified as noise starts to cause a significantdistortion in the output speech waveform.One interesting discovery that was made was that leakage factor 7 actually corrects forsome of the distortion error that occurs when Z is chosen too large. As shown in Figure 23,when Z=ZLow=ZHigh=ZCrit= 4.0 and p = 0.05, MSCCF peaks for y 0.9995 (1—0.0005);the leakage effect removes the distortion caused by misclassified speech more rapidly than theconventional-LMS with y = 1.0 (1—= 0).For the sake of interest, babble noise was also added at SNRs of 8 and 15 dB to theclean speech file. However babble is inherently nonstationary, abrogating the calculation of W.Therefore the MSCCF was instead calculated between the enhanced output for the noisy speechand the original uncorrupted speech file. The results shown in Figure 24 are quite different thanthose for CG or motor noise. The peak is smaller, on the order of 2—4%, and seems to dependon SNR. In addition, MSCCF does not drop off significantly as Z increases.Although MSCCF provides some indication of the range in which Z may be operated, resultsvary with both SNR and noise type. Therefore the same selection process was also performedsubjectively by Neufeld using informal listening tests for each of the possible noise backgrounds. Athigher Z, the enhanced speech was noticeably distorted; at lower Z, background noise appeared58Implementation0.9W0.850.80.7510-v 10-6 10-s 10-i 1O- 10-21-GammaFigure 23 Gamma CorrectionComputer-Generated Noise — Condition (1)0.65 p I ISNR— 8dB15 dB- --0.6 / ‘s----——I.,•____•__•-o--1! 0.55045 I I I I I I0 0.5 1 1.5 2 2.5 3 3.5 4 4.5ZHigh=ZLow=ZcritFigure 24 Basic Speech-Detection AlgorithmBabble Noise59Implementationlouder. The range, 1 .0<Z<2.0, was perceived to produce the best compromise between noisereduction and output sound quality.Average Threshold ZCrit The parameter ZCrit prevents speech samples near a silence-to-speech context change from corrupting the long-term average SiIAvg. To determine a validoperating range, ZHigh and ZLow were held constant at the mid-point of the range, Z=1 .5 S.D.,determined in the latter tests. ZCrit was then varied from 0<ZCr!t.<1 .5 S.D. It was found thatMSCCF could not serve as an adequate measure of performance, since its maximum variationover this parameter range in most cases was less than 1%. ZCrit was therefore chosen usingonly subjective listening tests. Sound quality increased for values of ZCrit>0.75 S.D., but therewas minimal change once the parameter was larger than this value.Although ZCrit is designed to stop low-level speech near a silence-to-speech transition fromcorrupting the long-term noise average, it also limits the algorithm’s ability to adapt to sudden, largechanges in the average background level. A step-increase in average noise level greater thanZCrit S.D. will completely stop the process of adaptation. This limits the algorithm’s effectivenessto noise environments with slowly-changing characteristics. ZCrit should therefore be as largeas possible, while still respecting its original purpose. For the ensuing tests, a ZCrit of 1.0 S.D.was selected.High Threshold ZHigh Parameter ZHigh was varied from 1.5 and 6.0 S.D., while ZLow andZCrit were held constant at 1.5 and 1 .0 S.D. respectively. Subjective listening tests were performedon each of the possible noise backgrounds. Speech quality improved as ZHigh was increased, withthe best compromise between output speech quality and noise reduction achieved for a range ofZHigh between 3.0 and 4.5 S.D.5.3 Final Parameter ValuesTable IV lists the parameter values used in the clinical assessment of the speech enhancementalgorithm in the following chapter. Fragments of the actual C code may be found in Appendix C.Minor implementation details not included in the preceding discussion are also included with theprogram.600 CDNNN-IIccIIC’-U--——II—a)-.0—IiHco0)0t.z‘-‘ar-’CD0enC’CD C CD 0Ct Ct 0) 0Chapter 6 Clinical EvaluationThe tests used to assess an individual’s ability to understand speech are commonly drawnfrom criteria applied by audiologists in evaluating hearing performance. The listener is presentedwith a list of words and the percentage recognized and repeated correctly by the listener ismeasured. Ideally the word lists include a distribution of phonemes in proportion to their naturalfrequency of occurrence in the native language of the subjects under test. Examples of such testsinclude the Northwestern University Auditory Test No. 6 [70], the Diagnostic Rhyme Test [71], andthe Modified Rhyme Test [72]. Notwithstanding, the revised Speech Perception in Noise (R-SPIN)test as presented by Bilger, et al. [73] has many desirable qualities that recommend its use inthe context of this research.6.1 Revised SPIN TestThe original SPIN test as devised by Kalikow, et al. [74] was standardized by Bilger specificallyfor use with hearing-impaired rather than normal-hearing subjects. In other tests, word listsare often presented using live voice at a comfortable loudness level in a relatively noise-freeenvironment. The R-SPIN speech material is presented in sentence form and administered in abackground of speech ‘babble’ to simulate more realistic day-to-day communication environments.Further the material is recorded so that in repeated trials the presentation of the material remainsconsistent.There are eight psychometrically-equivalent forms of the test, which have been standardizedwith respect to difficulty, variability, and reliability using individuals with sensorineural hearing loss.Each form consists of 50 sentences spoken by an adult male with a General American accent. Thelistener’s task is to recognize and repeat the final monosyllabic word of each sentence. Specialcare was taken in selecting sentence vocabulary; words whose frequencies of usage in everydayEnglish are neither high nor low were selected and distributed between the different forms such thatthe lists are balanced according to syllable, vowel and consonant types. In each form, there are25 words that are highly predictable from contextual cues, e.g. “Mr. Brown carved the roast beef.”The remaining 25 are contextually neutral, e.g. “Harry might consider the beef.” The words are62___________________________________________________Clinical Evaluation__________ordered such that no more then two words of the same type occur in sequence. A second matchedform utilizes the same group of words but in opposite contextual environments. This design permitsseparate assessment of how well the listener’s auditory system processes speech, as opposedto how well the individual capitalizes on linguistic information from phonological, lexical, syntacticand semantic clues [75].The babble was created by superimposing the recordings of twelve talkers reading aloudfrom continuous text. Selective adjustment of the speech-to-babble ratio is made possible byrecording the sentence material and babble at comparable sound levels on different tape tracks.A calibration tone precedes each form on the tape indicating the recorded level of the followingmaterial. Care was taken to align sentences and babble so that temporal fluctuations in the babbleamplitude would not degrade the speech material. The babble level was reduced by 10 dB duringthe number clue prior to the start of each utterance and then returned to its original level, assistingthe listener in identifying the onset of a sentence.Bilger cautions that for valid results, the test should be administered under certain restrictiveconditions. These conditions derive from the nature of the development and standardizationprocedures. In particular, “the SPIN test is what is recorded on test tapes and is not any otherutterance of the written sentences. Second the temporal alignment of the signal (sentences) trackand the babble track is also a crucial part of the definition of the SPIN test.” The speech and babbleshould be mixed electrically and presented through a single transducer at a speech-to-babble ratioof +8 dB, typical of an everyday noise environment. In addition, the test is standardized only foradults for whom American English is their native language. For other experimental conditions, thetest cannot be presumed to reliable [73].The advantages of the R-SPIN material are that:• The test forms are recorded on tape — the material presentation remains consistent throughrepeated trials.• The material is in sentence form and sentences are of both high and low predictability,therefore the speech is similar to that experienced in everyday situations.• The material is presented in a background of noise as opposed to a noise-free environmentwhich is not indicative of every day life.63Clinical Evaluation__________6.2 Experimental ConditionsIn drafting the conditions and procedures for the following clinical evaluation, careful consideration was taken to the fact that the resulting data should indicate in some way whether thespeech-enhancement algorithm improves the discrimination ability of hearing-impaired persons.Further if some benefit is shown the results would indicate whether it varies with the speech-to-noise ratio and I or the predictability of the speech material. The experiment was patterned afterrecent research by Hanusaik [76], into the effect of inductive coupling and amplification on thespeech perception ability of hearing-impaired persons.Four experimental conditions were chosen:• Cl +8 dB speech-to-noise ratio, no enhancement• C2 +8 dB speech-to-noise ratio, enhanced• C3 +15 dB speech-to-noise ratio, no enhancement• C4 +15 dB speech-to-noise ratio, enhancedThe conditions were ordered such that the poorest speech-to-noise ratio conditions were alwaysexperienced first to minimize learning effects, hence there are four possible orders:#1 #2 #3 #4Cl Cl C2 C2C2 C2 Cl ClC3 C4 C3 C4C4 C3 C4 C3Copies of the original speech materials were obtained from the School of Audiology and SpeechSciences at UBC. Four forms,1 ,3,5 and 7, selected from the R-SPIN data base were assigned tothe four possible condition orders:#1 #2 #3 #4Cl I Form 5 Cl I Form 3 C2 / Form 1 C2 I Form 7C2 / Form 3 C2 I Form 5 Cl I Form 7 Cl I Form 1C3 / Form 7 C4 I Form 1 C3 I Form 5 C4 I Form 3C4 I Form 1 C3 I Form 7 C4 I Form 3 C3 I Form 564___________________________________________________Clinical Evaluation__________The second test was readministered at the end of each order using the same experimentalconditions and form to measure the test I retest reliability of the subject.An amplified-telephone set, equipped with a telecoil for the hearing impaired, functioned astransducer for these tests. The materials were presented to the ear most commonly used by thesubject for telephone communication. The subject was asked to adjust the level of both hearing-aidand telephone-handset volume controls to a comfortable listening level. Two different backgroundnoises were used. The first noise was the original babble noise recorded on the R-SPIN tapes.The second was a recording of motor noise from a common vacuum cleaner. Due to equipmentlimitations, each subject could only experience one type of noise. Four subjects were presentedthe speech materials with babble noise in the background while the remaining four subjects heardmotor noise.Hence there were several departures from Bilger’s procedures in the eventual clinicalpresentation: Reproductions of the speech materials produced by Bilger were used and notoriginals. The materials were presented at a different speech-to-noise ratio for two of theconditions. Tests were administered over a different transducer at a level that was found mostcomfortable by the subject instead of a fixed dB level. Four of the subjects experienced motornoise as opposed to babble. However the most critical difference between this presentation andBilger’s is that two of the conditions present speech as processed by the enhancement algorithm,instead of the original materials. Although this presentation of the R-SPIN tests does not followthe recommended procedures verbatim, it was felt that R-SPIN materials could provide usableresults that would prove effective in evaluating the performance of the algorithm with respect toenhancing speech intelligibility.6.3 PresentationEach subject was seated in the center of a two-room sound-treated booth suitable for earsuncovered testing which meets ANSI specification S3.1—1 977. A diagram of the equipment setupis found in Figure 25. The R-SPIN sentence material was delivered from one track of a high-qualitycassette player to a MADSEN GSI 16 audiometer. Babble noise was available on a second track ofthe same tape. Recorded motor noise was routed from a second tape recorder. Both tape inputs65Clinical Evaluation__________Examiner’s BoothC.1C’,CAnechoic ChamberFigure 25 Experimental Setupwere calibrated for a 0—VU presentation level at the audiometer using the 1000—Hz tone availableat the beginning of each test form. Calibration was repeated for each form. The speech andnoise material were channelled to the INSERT PHONE transducer output of the audiometer. Thissignal was then delivered to the system previously described in Section 5.1 through the TRAIN-ON-PHONE interface. The enhanced output from the noise-reduction filter is passed through aninsulating wall to a Northern Telecom amplified handset, Model NTOCO9AF, distributed by BCTel. The speech channel of the audiometer output was calibrated using the 1 000—Hz tone sothat the acoustic output at the telephone receiver without amplification was 86 dB SPL (soundpressure level), which is a typical presentation level for telephone speech [77]. The noise channeloutput was adjusted to a level 8 or 15 dB lower than the speech channel as the condition orderdemanded. The subject was instructed to adjust the volume control to a comfortable loudness levelduring practice sentences presented before each test. With the volume control at the maximumsetting, an additional 21 dB of amplification is possible, for a maximum 107 dB SPL at the handsetTelephoneMicrophon.66__________________________________________________Clinical Evaluation__________output. Subject responses were picked up by a microphone signal delivered to the experimenter’sheadset and also recorded on tape.The handset output was calibrated prior to the onset of subject testing using a Quest Model1800 precision impulse integrated sound level meter. A Model 7023, 1—inch pressure, 200—voltpolarization microphone was used during this procedure. The rubber ear cup from a model TDH50 earphone is used to acoustically interface the handset receiver with a 6 cc. brass coupler. Thereceiver is placed on top of the coupler and a calibrated weight, consisting of two brass massesattached to a crosspiece in a yolk-like fashion, is balanced on top. An appropriately-sized propsupports the receiver handle. The audiometer output is adjusted until the sound level meter readingis 86±3 dB using the 1000 Hz tone from the test tape as the stimulus. For this system configuration,90 dB HL (hearing level) produced the necessary 86 dB SPL at the telephone. An additional 21 dBof amplification is possible with the volume control at the maximum specification, which meets theminimum 17 dB specified by GSA standard CAN3—T515—M85 for amplified telephone handsets.Calibration was repeated at regular intervals and the output remained within ±3 dB limits.6.4 ProcedurePrior to commencing the speech perception tests, the subjects’ ears were first otoscopicallyexamined to determine the ear canal condition and check for blockages. A tympanometricscreening of each subject was then performed using a MADSEN GSI 33, version 2, middle-ear analyzer. Next the subject was seated in the sound-treated booth and pure tone air and boneconduction thresholds were obtained in accordance with ASHA guidelines [78]. The subject’s mostcomfortable listening level was determined with live voice using a bracketing procedure outlinedby Hemeyer [79]. A speech discrimination test was administered at this level using NorthwesternUniversity Test No. 6, Form D, List 1 as recorded by Auditec of St. Louis. The subject’s hearingaid battery was also replaced if required. A trained audiologist assisted in these procedures.Instructions explaining the test procedure were presented to the subject in both oral andwritten form prior to commencement. A copy of these instructions may be found in Appendix D.The subject was informed that he or she would hear several sets of sentences over the telephonehandset as well as a noise in the background behind the speaking voice. Their task would be to67__________________________________________________Cilnical Evaluation__________repeat the last word of the each sentence and that it was important to guess even if unsure of theword. Practice sentences were administered first to familiarize the individual with the procedure.The subject was asked if possible, to adjust the hearing aid (T-switch) of the ear normally used forthe telephone, so that telephone and aid were magnetically coupled. Any aid in the opposite earwas turned off. The subject was then instructed to adjust the volume control of both the hearingaid and telephone receiver, as well as the position of the receiver against the hearing aid whilelistening to the practice sentences. They were to find the volume and receiver position whichmade the speech seem the clearest and try to keep the receiver in that same position for theremainder of the tests.Once the subject felt comfortable with the practice sentences, they were presented with fivesets of sentences which conformed to the condition orders described previously. The subjectswere instructed to repeat the last word of each sentence and that a response was required foreach test item. The subject responses were scored by hand on forms provided in the R-SPINmanual which were later validated against a recording taken during the test. In instances wherethe individual failed to respond to a sentence, the tape was paused and the subject prompted fora guess of what he or she heard. The subjects were encouraged to take a rest period betweeneach test.6.5 SubjectsPotential subjects were sent a letter of introduction outlining the project. Eight subjectsreplied, five female and three male. Appendix E tables the profile of each subject derived fromquestionnaires completed before tests began. The subjects’ ages ranged from 24 to 76 with amedian of 58.5 years. One subject’s mother tongue was Hungarian, however all had a nativecommand of the English language. The degree of hearing loss measured for each subject rangedfrom mild-to-moderate to profound, and originated from diverse sources, including age-, noise-, anddisease-related causes. Unaided discrimination scores for those able to undergo the test rangedfrom 40 to 82 with a median of 72 correct answers. No valid estimate of speech discriminationwas possible for two of the subjects since audiometer limits precluded presentation of the testwords at an adequate sensation level.68___________________________________________________Clinical Evaluation__________Most rated there own telephone communication ability as ‘Good most of the time’. Threesubjects categorized themselves as ‘Fair’ to ‘Poor’ users of the telephone. Each subject was anexperienced hearing aid user; years of use ranged from 2.5 to 32 with a median of 10.5 years.Only two used a single hearing aid; the remaining subjects used hearing aids in both ears. Of theeight subjects, three used both the T-switch and handset amplifier regularly, two used only theamplifier, and one used a combination of T-switch at home and amplifier at work.Subjects were randomly assigned the type of background noise that would be presentedduring the R-SPIN tests, four subjects to each type of noise. Four subjects, V1-V4, listened tomotor noise, where numbers 1—4 corresponding to the particular condition order followed. Babblenoise was presented to subjects B1—B4.Due to equipment problems during the presentation, one subject (V2) was presented materialsout of order such that the highest SNR conditions were heard first, contrary to the experimentaldesign.6.6 ResultsRaw scores expressed in percentages for each condition C1—C4 are provided in Table Vfor both babble and motor noise tests. Retest scores are located in the final column; the markdesignated by an asterisk in each row corresponds to repeated test for that subject.An informal review of these results shows that for motor noise, there is a consistentimprovement in R-SPIN scores for the enhanced speech: mean enhanced scores are 20% higherthan those obtained with the non-enhanced speech. The relative improvement does not changeas SNR increases: there is a 20% improvement in scores at both 8 and 15 dB. A similar reviewof the babble noise results indicates that overall, the speech enhancement algorithm failed for thistype of noise. Although two subjects (Bi ,B2) showed some improvement in R-SPIN scores forboth SNR conditions, the remaining two subjects actually experienced a decrease in discriminationscores at the low SNR condition, and no improvement at all for high SNR.6.6.1 Statistical EvaluationTo determine if these results were statistically significant, a two-way analysis of variance(ANOVA) with repeated measures was performed for both noise conditions [80]. The results for69Clinical Evaluation__________Subject Cl - 8dB no C2 - 8dB C3 - 15db no C4- 15db *Retestenhancement enhanced enhancment enhancedVi 64 78 82 92 74V2 20 18 *36 44 38V3 *18 24 40 64 40V4 *56 70 78 84 62Average 39.5 47.5 59.0 71.0S.D. 23.9 30.9 24.4 21.5Bi 16 *24 30 40 24B2 24 *44 36 52 44B3 *66 54 78 78 62B4 *54 50 64 64 58Average 40.0 43.0 52.0 58.5S.D. 23.8 13.3 22.8 16.3Table V Raw Scoresmotor noise are presented in Table VI. Both SNR and enhancement show statistical significance(p < 0.05), from which it can be inferred that it is highly improbable that the improvement inscores can be attributed to experimental error alone. However the interaction between SNR andenhancement is not significant. This implies that SNR does not affect enhancement scores; itcannot be predicted that R-SPIN scores will decrease or increase as SNR changes.Table VII presents the ANOVA results for the babble-noise test. Only SNR can be consideredsignificant from these results. Enhancement effects are not significant, therefore an increase ordecrease in scores cannot be statistically attributed to the enhancement process. There is alsono significant interaction between SNR and enhancement effects.Subject Reliability A measure of subject and test reliability is determined by repeating thesecond test for each individual verbatim and calculating the Pearson product-moment correlationcoefficient between the original scores and the results for the repeated test [81]. For the motor-noise background, this correlation coefficient is 0.941. The Bartlett chi-square test was then usedto verify that the calculated correlation was statistically significant (p<0.05). For the motor-noisecase(2=3.246, df=1, p=0.072), the correlation coefficient is not significant. This is due in part70___________________________________________________Clinical Evaluation__________SS •.•. •....... F •. :... PSNR 1 184900 184900 4171 0008Error 3 13300 4433Enhancement 1 40000 40000 1539 0029Error 3 7800 2600Interaction 1 16 00 16 00 0 44 0 556Error 3 11000 3667Table VI ANOVA Results — Motor Noised.f....S:.:...:.:.:.. .. ..,. . F .SNR 1 756 25 756 25 61 74 0 004Ero 3 3675 1225Enhancement 1 90 25 90 25 0 76 0 447Error 3 35475 11825Interaction 1 12 25 12 25 112 0 367Error 3 3275 1092Table VII ANOVA Results— Babble Noiseto the fact that subject V3 scored much higher on the retest score than on the original test.For the babble noise background, the coefficient is 0.983, which is significant(2=5.097, df=1,p=0.024) using Bartlett’s test. It can be inferred that although retest correlation coefficients forboth background noises are quite high, only the babble test results can be considered repeatableto a high degree of significance.6.6.2 Comparison of High and Low Predictability ScoresSentences where the final words are highly predictable with respect to contextual cues, aswell as contextually neutral sentences are included in each form of the R-SPIN tests, thereforeseparate assessment of results for both high- and low-predictability scores is possible. Table VIIIlists the subject scores for each type of sentence individually. For the motor noise case, raw71Clinical EvaluationSubject Cl - 8dB no C2 - 8dB enhanced C3 - 15dB no C4 - 15db *Retestenhancment enhancement enhancedHP LP HP LP HP LP HP LP *HP **LPVi 44 20 *44 **34 46 36 50 42 48 26V2 14 6 16 2 *24 **12 32 12 24 14V3 *14 **4 16 8 32 8 40 24 22 18V4 *44 **12 46 24 46 32 50 34 44 18Average 29.0 10.5 30.5 17.0 37.0 22.0 43.0 28.0S.D. 17.3 7.2 16.8 14.7 10.9 14.0 8.7 13.0Bi 10 6 *16 **8 22 8 28 12 18 6B2 22 2 *34 **10 34 2 40 12 38 6B3 *48 **1 8 40 14 44 34 44 34 46 1684 *32 **22 34 16 42 22 36 28 38 20Average 28.0 12.0 31.0 12.0 35.5 16.5 37.0 21.5S.D. 16.1 9.5 10.4 3.6 10.0 14.4 6.8 11.2Table VIII Comparison of High and Low Predictability Scoresscores for the high-predictability (HP) and low-predictability (LP) sentences improve by 11% and38% respectively over unprocessed speech. However it is difficult to discern from this table anynoticeable trends in the data for the babble noise case. A statistical analysis of the predictabilityscores is therefore completed to obtain further informations.ANOVA results for high- and low-predictability scores with motor-noise background arepresented in Tables IX and X respectively. For HP sentences, enhancement has a significanteffect on R-SPIN test scores while SNR does not. In addition there is significant interactionbetween SNR and enhancement: as SNR improves, scores for the enhanced speech increase.Conversely for LP sentences, only SNR is significant. The enhancement algorithm improves theintelligibility of HP sentences, while having a negligible effect on LP sentences; it can be inferredthat the enhancement algorithm improves the speech quality such that more contextual informationis heard by the listener, however the improvement is not sufficient to increase the intelligibility ofnonsense sentences.72Clinical Evaluation__________SS1 420.25 420.25 5.87 0.0943 214.75 71.581 56.25 56.25 24.98 0.0153 6.75 2.251 20.25 20.25 22.09 0.0183 2.75 0.92Table IX High Predictability ANOVA Results —. .. . . :.:...:...:.:.:.:...df... ... ....:...:.. ... .SS:.:: .. .:.MSSNR 1 50625 50625Error 3 26 75 8 92Enhancement 1Error 3Interaction 1Error 3Table X Low Predictability ANOVA Results — Motor NoiseA review of the ANOVA results in Tables Xl and XII for babble noise shows that neither SNRor enhancement has a significant effect on the R-SPIN scores.Subject Reliability The retest scores for high- and low-predictability sentences are foundin the final column of Table VIII. Corresponding repeated tests are designated in each rowby asterisks. For the motor-noise background, the retest correlation coefficient for the high-predictability sentences is 0.970 and significant(2=4.236, df=1, p=0.04). The low-predictabilityretest coefficient is 0.832 and is not significant (2=1 .767, df=1, p=0.1 84). Subject scores forthe high-predictability sentences are repeatable to a much higher level of significance. It may beinferred that the correlation coefficient for the overall result (0.941), is not significant due to thevariability in the LP original and retest scores of subject V3, as opposed to her HP scores.156.2596.75Motor NoiseF P•.56.776 0.0054.85 0.1150.01 0.929156.2532.250.2580.750.2526.9273___________________________________________________Clinical Evaluation__________df SS MS F PSNR 1 182 25 182 25 6 94 0 078Error 3 78 75 26 75Enhancement 1 20 25 20 25 0 52 0 523Error 3 11675 3892Interaction 1 2 25 2 25 0 17 0 704error 3 3875 1292Table Xl High Predictability ANOVA Results— Babble Noisedf SS MS F PSNI{ 1 19600 19600 338 0163Error 3 17400 5800Enhancement 1 25 00 25 00 1 09 0 374Error 3 69 00 23 00Interaction 1 25 00 25 00 4 41 0 127Error 3 1700 567Table XII Low Predictability ANOVA Results— Babble NoiseFor the babble noise background, the HP retest correlation coefficient is 0.967 and significant(2=4.102, df=1 p=0.043). The LP retest correlation coefficient is also high, 0.992 and significant(2=6.234, df=1, p=0.013). Therefore the scores for both high- and low-predictability sentencescan be considered reliable or repeatable to a high degree.6.7 Subjective CommentsA back-to-back subjective evaluation between unprocessed and enhanced speech was notperformed, however a number of comments were made by subjects regarding the quality of theoutput speech. Of those who heard motor noise in the background, only one subject indicatedthat the processed speech sounded less noisy’, and would be less irritating to listen to overlong periods of time. The remaining subjects did not discern any apparent difference between the74__________________________________________________Clinical Evaluation__________processed and unprocessed speech. It is interesting to observe that although the hearing-impairedsubjects did not differentiate between the enhanced and original speech, their scores improvedconsistently for the enhanced output.Of those who heard babble noise, two noted that the enhanced output sounded ‘tinnier’; lowfrequency components were lacking. Both these subjects showed a decrease in R-SPIN scoreswith the enhanced speech. It is evident that the effective Iowpass filtering operation removedimportant perceptual cues which decreased the intelligibility of the speech for these individuals.75Chapter 7 DiscussionThe following chapter first discusses the validity of the results obtained during clinicalevaluation. Specific limitations of the proposed speech detection algorithm are later described.The paper concludes with a summary of the results as well as suggestions for further research.7.1 Validity of Subjective TestsThe significance of the ANOVA test results in the preceding chapter must be balanced againstall factors which might contaminate the final subject scores. In this investigation, strict control wasmaintained with respect to the experimental environment, so that it was felt that the subjective testsprovided valid results for this specific study sample. To ensure that variations in subject scoreswere primarily due to the enhancement algorithm and not outside factors, a concerted effort wasmade to minimize error that could arise from changes in the clinical presentation itself:• The experiment was designed such that the R-SPIN forms were balanced across all testconditions to minimize sequence effects. Each subject was randomly assigned a treatmentorder and all subjects experienced all treatments, therefore if the test forms were notequivalent, the effects would be distributed equally among all subjects.• Practice effects were minimized by ordering the conditions such that the low SNRs wereadministered first. Improvement in scores due to practice effects would therefore be includedin the significance results for SNR which were not as meaningful to this study. However itmust be noted that for one subject (V2), the test conditions were presented in reverse order(high SNR first), contrary to this design.• The test session was kept as short as possible ( 2.5 hours), to minimize fatigue and boredom.However from comments expressed by the subjects and observations made directly by theauthor, the difficulty in holding and maintaining the telephone handset in one position for theentire test, as well as the concentration required to discern speech in a background of noiseplaced considerable strain on the subject.• The acoustic output of the telephone was calibrated regularly as described in Section 6.3 toensure that a consistent test environment was maintained for each subject.76__________________________________________________________Discussion__________________• Subject retest reliability was verified by repeating the second test for each individual.The Pearson cross-correlation coefficient was calculated to validate the repeatability of theexperiment. For babble noises, this coefficient was large and highly significant. However forthe motor noise case, the retest score for one subject (V3) was considerably higher than theoriginal score. Therefore motor-noise test results could not be said to be repeatable to a highdegree of significance.7.1.1 GeneralizatIonsAlthough it was felt that the small study sample size limited the extent to which the resultscould be generalized to the hearing-impaired population as a whole, the results are such that theydo indicate that further research is necessary and desirable. The speech-enhancement algorithmexhibited positive results for all four test subjects experiencing motor noise and partial success fortwo of the subjects listening to babble noise, using speech materials and noise background whichsimulated a realistic everyday environment rather than a clinical setting. The results obtained werefrom a sample group that was diverse enough to say that the effects were not specific to one typeof hearing loss, etiology, or other subject factor.The ideal experiment should be capable of distinguishing between effects due to factorsoutside the experiment from treatment effects and experimental error. For this work the studysample was quite small, effectively four subjects per background noise type. Although allthe subjects were samples from a group that could be considered homogeneous in that eachexperienced a bilateral hearing loss which required a hearing aid to compensate, in terms ofcategories such as age, degree of hearing loss, etiology, hearing aid type, brand, etc., the subjectswere sufficiently diverse to be unable to quantify effects due to these factors. Since these factorscould not be removed they were included as part of the experimental error, which decreased thesensitivity of the experiment. A larger study sample would allow specific subject factors to becategorized; results could then be assessed with respect to a model which took these factors intoaccount. For example, since two out of four subjects experienced positive results in the babblenoise case, it is possible that this type of speech enhancement might be effective for one typeof hearing loss and not another.77Discussion__________________7.1.2 Comments on R-SPIN PresentationThis experiment was patterned after recent research by Hanusaik [76] whose goal was toevaluate the effect of inductive coupling and receiver amplification on the speech-perception abilityof hearing-impaired persons. To that end, Hanusaik desired that the experimental conditionssimulate the real world as far as the customary use of these devices were concerned. Thereforea telephone was employed as transducer and subjects used their own hearing aids. Each subjectwas also allowed to locate and maintain his own optimal listening position for the telephone receiveras well as adjust hearing and gain receiver amplification to the most comfortable settings.The main goal of this research was to evaluate the efficacy of the speech-enhancementalgorithm in terms of improving speech intelligibility. The telephone represents only one specificapplication of this algorithm. Therefore in hindsight, it was felt that algorithm evaluation would havebeen better served using earphones rather than the telephone as transducer. Fatigue effects wouldhave been greatly minimized, improving the sensitivity of the experiment, although at the sametime reducing the extent that the results could be generalized to the telephone problem.7.2 Performance LimitationsAn attempt was made during development of the speech-enhancement algorithm to minimizeassumptions made about the noise environment in order to facilitate its in related applications.However as noted in the introductory discussions, noise-reduction algorithms must in general makesome assumptions about the noise environment. In particular, the adaptive predictor structurerequires that the following conditions be met:• noise signal is additive in nature with slowly time-varying properties.• components of the noise signal are periodic.• non-speech intervals exist of sufficient length and frequency that the statistics of the background noise can be estimated.• the above characteristics must not change significantly when speech is present.Further, the speech-detection algorithm requires that:• speech-to-noise ratio is positive.78__________________________________________________________Discussion__________________• noise has a quasi-Gaussian signal distribution.• noise signal is relatively stationary with no sharp discontinuities.The above assumptions define the algorithm performance limits. It is not unreasonable toassume that there exist many environments where noise is additive with a quasi-Gaussian signaldistribution. However the constraints of stationarity and periodicity are not so easily addressed.Factory and automobile environments are examples of milieus where the background noise willhave some periodic properties, with characteristics that change slowly over time. This scheme iscapable of adapting to these noise environments, creating a filter which removes the correlatednoise components. However any broadband white noise component will still remain.On the other end of the scale, cafeteria or babble noise as well as street noise exhibitrandom properties that can only be considered stationary for very short time intervals. Theadaptive predictor forms a filter which suppresses only the slowly-changing, usually low-frequencycomponents of the noise, which yields improvement only for a limited class of hearing-impairedindividuals. As demonstrated by the clinical trials, this speech enhancement scheme is inadequatein such environs. Moreover, the filter will have no effect at all on impulsive noise.The proposed speech enhancement scheme will also break down at low SNRs. Under theseconditions, the short-term energy measures used in the speech detection algorithm no longersuffice to differentiate speech from noise. In this work, a SNR of greater than 6 dB was assumed.However in other applications where SNR nears zero or becomes negative, the more sophisticatedalgorithms noted in Chapter 4 could prove more successful.7.3 SummaryA simple, efficient, real-time implementation of a single-input speech-enhancement scheme,designed to increase the intelligibility of speech corrupted by additive noise, was evaluated withspeech-discrimination tests using hearing-impaired subjects. The algorithm takes advantage ofa novel speech-detection algorithm, recently developed at the University of British Columbia, toclassify individual signal segments as silence or speech. The coefficients of an adaptive filter areupdated during signal periods classified as ‘silence’. Two variants of the LMS algorithm, the leakyand normalized-LMS, are integrated in the weight-update process to produce a noise-reduction79__________________________________________________________Discussion__________________scheme which adapts itself to the power of the input signal and also corrects for ill-conditionedinputs. Although often used individually, to our knowledge the incorporation of both variantsinto a single algorithm is unique to this research. The resulting speech-enhancement algorithmadapts continuously to its environment and assumes a minimum of a priori information about signalcharacteristics making it robust under changing noise conditions. However there are limitationsto its use: noise must be quasi-stationary, have some periodic nature and must be available forcharacterization in intervals where speech is not present.Subjective evaluation with hearing-impaired persons demonstrated that this scheme is effective in increasing the intelligibility of corrupted-speech for those noises with well-definedcharacteristics. Specifically there was a statistically-significant increase in R-SPIN scores forfiltered vacuum-cleaner noise, indicative of machinery or motor noises which have strong periodiccomponents. However little or no improvement was realized for babble or cafeteria noise. This isin agreement with other studies which have shown that single-input noise cancellation techniquesare ineffective in removing babble noise without decreasing speech intelligibility [82].Only a small study sample was available for the subjective evaluation, limiting the generalization of the results to the hearing-impaired population as a whole. Further clinical evaluationis necessary, with a larger sample base, since the limited size of the study sample precludedassessing the algorithm with respect to specific subject factors such as degree or type hearingloss. It is interesting to note that although the enhanced speech did not have a statistically-significant effect on fl-SPIN scores in the babble-noise case, two of the tested individuals didshow some improvement. This could be attributed either to experimental error, or to the fact thatthe speech enhancement algorithm may improve the intelligibility of speech only for individualswith a particular type of hearing loss.One of the contributions made by this study is the precise definition of a clinical method ofevaluating the effect of speech-enhancement algorithms on speech discrimination for the hearingimpaired. In most other investigations, simple monosyllabic words have been used to evaluateintelligibility improvement and tests have been applied only to normal-hearing individuals. In thiswork evaluation was performed on hearing-impaired subjects using both high- and low-predictabilitysentence material from the R-SPIN tests in a background of noise. It was felt that this presentation80Discussion__________________was more realistic, simulating an everyday communication environment.7.4 Proposed ResearchThis research incorporated both the normalized- and leaky-LMS variants in the final speech-enhancement algorithm. Although often used in practice [34, 54, 56, 57], the convergenceproperties of the Ieaky-LMS variant are not well defined. In this work, a constant value forparameter 7 was selected empirically from sample noise files. Intuitively it seems that there mustbe a way of adapting 7 such that when the incoherent component of the noise input is small,would likewise be small. Similarly when the incoherent component is large, y would approach 1.0.Treichler [48] also suggests that it may be possible for y to be allowed to be larger than 1 .0 (a <0), effectively reducing the power of the incoherent component. However stability then becomes aproblem and convergence time increases. An investigation of the convergence properties of boththe leaky-LMS and normalized-leaky-LMS algorithms is one possible area of further research.This work focused on a evaluation of adaptive noise cancellation as a technique for enhancingspeech intelligibility for the hearing impaired. However there are many other speech enhancementalgorithms which have been suggested, such as Boll’s manipulation of the short-time spectralamplitude (SSSA) discussed in Section 2.1. Although in general these techniques have beentested with negligible success with respect to increasing discrimination scores with normal-hearingpersons, this does not preclude more positive results with hearing-impaired individuals.This particular implementation of the LMS algorithm is based in the time domain, howeverfrequency domain versions of the LMS algorithm are also practical using block transform methods[83, 84]. In the frequency domain it is possible to exploit additional aspects of speech perception.In particular, it has been suggested that the hearing-impaired individual’s audiogram, which variessignificantly from the norm, may be somehow incorporated into the noise-cancellation process[36]. This idea has been studied by Peterson in relation to Boll’s algorithm with some success:spectral-subtraction techniques were used to suppress noise in a simulated perceptual domainthat represents the frequency response of the ear [851.A final area of possible research would be the incorporation of the algorithm into hearingassistive devices such as the hearing aid, which would require the development of custom81Discussion__________________integrated circuitry. For such non-telephone applications, the bandwidth is not restricted to afrequency range of 300—3400 Hz, therefore an increase in both sampling rate and upper cut offfrequency for the speech input is possible. Research has demonstrated that increasing signalbandwidth can have a significant effect on the intelligibility of transmitted speech for the hardof hearing 4j. The frequency range between 2500 and 6300 Hz has been shown to provideperceptual cues which have a significant effect on word recognition scores, and is particularlysensitive to reducing consonant confusions in noise.One problem, unique to hearing-aid wearers, results from the interaction between thetelephone and hearing-aid. Most hearing-aids are equipped with a telecoil (T-coil) whichmagnetically couples the telephone output to the user’s aid. This minimizes the portion of noisefrom the surrounding environment that is heard directly, whereas the sidetone component isstill transmitted.7 However when a telephone is answered within approximately two feet of acomputer or similar machine, the T-coil picks up radiated energy, mostly 60 Hz power, significantlydeteriorating the resulting speech signal. Telephone noise-reduction will not solve this problem.Sidetone refers to the portion of the speaker’s signal that is heard by the speaker himself at the receiver of the sametelephone.82References[1] R. Plomp, “Auditory handicap of hearing impairment and limited benefit of hearing aids,”JASA, vol. 63, PP. 533—549, Feb. 1978.[2] R. Schafer and L. Rabiner, “Digital representations of speech signals,” Proc. IEEE, vol. 63,pp. 662—677, Apr. 1975. Invited Paper.[3] J. Lim and A. Oppenheim, “Enhancement and bandwidth compression of noisy speech,”Proc. IEEE, vol. 67, pp. 1586—1604, Dec. 1979. Invited Paper.[4] 0. Schwartz and R. Surr, “High-pass and conventional high frequency hearing aids forlisteners with high-frequency sensorineural hearing loss,” in Auditory and Hearing ProstheticsResearch (V. Larson, ed.), pp. 313—328, Grune and Stratton, 1979.[5] B. Sharf and M. Florentine, “Psychoacoustics of elementary sounds,” in The VanderbiltHearing-Aid Report (G. Studebaker and F. Bess, eds.), pp. 3—15, Upper Darby,Pennsylvania: Monographs in Contemporary Audiology, 1982.[6] E.M. Danaher, J.J. Osberger and J.M. Pickett, “Discrimination of formant frequencytransitions in synthetic vowels,” Journal of Speech and Hearing Research, vol. 16, pp. 439—451, Sept. 1973.[7] P.M. Peterson et al., “Multi-microphone adaptive beamforming for interference reduction inhearing aids,” Journal of Rehabilitation Research and Development, vol. 24, pp. 103—110,FaIl 1987.[8] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans.Acoustics, Speech and Signal Processing, vol. ASSP-27, pp. 113—120, Apr. 1979.[9] J. Lim, “Evaluation of a correlation subtraction method for enhancing speech degraded byadditive white noise,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-26,pp. 471—472, Oct. 1978.[10] Y.M. Perlmutter, et al., “Evaluation of a speech enhancement system,” in Int. Conf. onAcoustics, Speech and Signal Processing, pp. 212—215, IEEE, 1977.[11] J.S. Lim, A.V. Oppenheim and L.D. Braida, “Evaluation of an adaptive comb filtering methodfor enhancing speech degraded by white noise addition,” IEEE Trans. Acoustics, Speechand Signal Processing, vol. ASSP-26, pp. 354—358, Aug. 1978.[12] L. Young and J. Goodman, “The effects of peak clipping on speech intelligibility in thepresence of a competing message,” in Int. Conf. on Acoustics, Speech and SignalProcessing, pp. 21 6—21 8, IEEE, 1977.[13] M.R. Weiss, et al, “Study and development of the INTEL technique for improving speechintelligibility,” Report NSC-F/4023, NICOLET Scientific Corp., Dec. 1974.[14] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error shorttime spectral amplitude estimator,” IEEE Trans. Acoustics, Speech and Signal Processing,vol. ASSP-32, pp. 1109—1121, Dec. 1984.[15] V.C. Schields, Jr., “Separation of added speech signals by digital comb filtering,” S. M.Thesis, MIT, Cambridge, 1975.[16] R.H. Frazier, et al., “Enhancement of speech by adaptive filtering,” in mt. Conf. on Acoustics,Speech and Signal Processing, pp. 251—253, IEEE, 1976.83[17] T. Parsons, “Separation of speech from interfering speech by harmonic selection,” JASA,vol. 60, pp. 911—918, Oct. 1976.[18] I. Thomas and R. Niederjohn, “The intelligibility of filtered clipped speech in noise,” JAES,vol. 18, pp. 299—303, Jun 1970.[19] I. Thomas and A. Ravindran, “Intelligibility enhancement of already noisy speech signals,”JAES, vol. 22, pp. 234—236, May 1974.[20] J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE Trans. Acoustics,Speech and Signal Processing, vol. ASSP-26, pp. 197—210, Jun. 1978.[21] B. Musicus and J. Lim, “Maximum likelihood parameter estimation of noisy data,” in mt. Conf.on Acoustics, Speech and Signal Processing, pp. 224—227, IEEE, 1979.[22] A. Oppenheim and R. Schafer, “Homomorphic analysis of speech,” IEEE Trans. Audio andElectroacoustics, vol. AU-16, pp. 221—226, Jun. 1968.[23] J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, pp. 561 —580, Apr. 1975.[24] C. Teacher and D. Coulter, “Performance of LPC vocoders in a noisy environment,” in mt.Conf. on Acoustics, Speech and Signal Processing, pp. 216—219, IEEE, 1979.[25] B. Widrow, et al., “Adaptive noise cancelling: Principles and applications,” Proc. IEEE, vol. 63,pp. 1692—1716, Dec. 1975.[26] G. Kang and L. Fransen, “Experimentation with an adaptive noise-cancellation filter,” IEEETrans. Circuits and Systems, vol. CAS-34, pp. 753—758, Jul. 1987.[27] W.A. Harrison, J.S. Lim and E. Singer, “A new application of adaptive noise cancellation,”IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-34, pp. 21—27, Feb. 1986.[28] S. Vaseghi and P. Rayner, “The effects of non-stationary signal characteristics on theperformance of adaptive audio restoration systems,” in mt. Conf. on Acoustics, Speechand Signal Processing, pp. 377—380, IEEE, 1989.[29] M.Sondhi, “An adaptive echo canceller,” BSTJ, vol. 46, pp. 497—511, Mar. 1967.[30] T. Schwander and H. Levitt, “Effect of two-microphone noise reduction on speech recognitionby normal hearing listeners,” Journal of Rehabilitation Research and Development, vol. 24,pp. 87—92, Fall 1987.[31] M. Weiss, “Use of an adaptive noise canceler as an input preprocessor for a hearing aid,”Journal of Rehabilitation Research and Development, vol. 24, pp. 93—102, Fall 1987.[32] D. Chazan, Y. Medan and U. Shivadron, “Noise cancellation for hearing aids,” IEEE Trans.Acoustics, Speech and Signal Processing, vol. ASSP-36, pp. 1697—1705, Nov. 1988.[33] D. Graupe, J.K. Grosspietsch and S.P. Basseas, “A single-microphone-based self-adaptivefilter of noise from speech and its performance evaluation,” Journal of RehabilitationResearch and Development, vol. 24, pp. 119-126, Fall 1987.[34] B. Widrow and S. Stearns, Adaptive Signal Processing. New Jersey: Prentice-Hall, 1985.[35] M. Sambur, “Adaptive noise canceling for speech signals,” IEEE Trans. Acoustics, Speechand Signal Processing, vol. ASS P-26, pp. 419—423, Oct. 1978.84[36] D.M. Chabries, et al., “Application of adaptive digital signal processing to speech enhancement for the hearing impaired,” Journal of Rehabilltation Research and Development, vol. 24,pp. 65—74, Fall 1987.[37] L. Hoy, et al., “Noise suppression methods for speech applications,” in mt. Conf. onAcoustics, Speech and Signal Processing, pp. 1133—1136, IEEE, 1983.[38] B. Widrow, et aL, “Stationary and nonstationary learning characteristics of the LMS adaptivefilter,” Proc. IEEE, vol. 64, pp. 1151—1162, Aug. 1976.[39] B. Widrow and E. Walach, “On the statistical efficiency of the LMS algorithm,” IEEE Trans.Information Theory, vol. IT-30, pp. 211—221, Mar. 1980.[40] J. Foley, “Comparison between steepest descent and LMS algorithms,” lEE Proceedings,Part F, vol. 134, pp. 283—289, Jun. 1987.[41] W. Gardner, “Nonstationary learning effects of the LMS algorithm,” IEEE Trans. Circuits andSystems, vol. CAS-34, pp. 1199—i 207, Oct. 1987.[42] J. T. Rickard and J. Zeidler, “Second-order output statistics of the adaptive line enhancer,”IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-27, pp. 31—39, Feb. 1979.[43] L. Horowitz and K. Senne, “Performance advantage of complex LMS for controlling narrow-band adaptive arrays,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-29,pp. 722—735, Jun. 1981.[44] A. Feuer and E. Weinstein, “Convergence analysis of LMS filters with uncorrelated gaussiandata,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-33, pp. 222—229,Feb. 1985.[45] A. Krieger and E. Masry, “Convergence analysis of adaptive linear estimation for dependentstationary processes,” IEEE Trans. Information Theory, vol. 34, pp. 642—654, Jul. 1988.[46] J. Kim and L. Davisson, “Adaptive linear estimation for stationary M-dependent processes,”IEEE Trans. Information Theory, vol. lT-21, pp. 23—31, Jan. 1975.[47] T. Daniell, “Adaptive estimation with mutually correlated training sequences,” IEEE Trans.Systems Science and Cybernetics, vol. SSC-6, pp. 12—19, Jan. 1970.[48] J. Treichier, ‘Transient and convergent behavior of the adaptive line enhancer,” IEEE Trans.Acoustics, Speech and Signal Processing, vol. ASSP-27, pp. 53—62, Feb. 1979.[49] J. Chao, H. Perez and S. Tsujii, “A fast adaptive filter algorithm using eigenvalue reciprocalsas stepsizes,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 38, pp. 1343—1351, Aug. 1990.[50] W.B. Mikhael et al., “Adaptive filters with individual adaptation of parameters,” IEEE Trans.Circuits and Systems, vol. CAS-33, pp. 677—685, Jul. 1986.[51] R.W. Harris, D.M. Chabries and F.A. Bishop, “A variable step (VS) adaptive filter algorithm,”IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-34, pp. 309—316, Apr.1986.[52] M. Tarrab and A. Feuer, “Convergence and performance analysis of the normalized LMSalgorithm with uncorrelated gaussian data,” IEEE Trans. Information Theory, vol. 34, pp. 680—691, Jul. 1988.85[53] S. Stearns and R. David, Signal Processing Algorithms, ch. 12, pp. 254—262. New Jersey:Prentice-Hall, 1985.[54] J. Cioffi, “Limited-precision effects in adaptive filtering,” IEEE Trans. Circuits and Systems,vol. CAS-34, pp. 821—833, Jul. 1987.[55] K. Perry, “A distributed-u implementation of the LMS algorithm,” IEEE Trans. Acoustics,Speech and Signal Processing, vol. ASSP-29, pp. 753—762, Jun. 1981.[56] R.D. Gitlin, A.C. Meadors Jr. and S.B Weinstein, ‘The tap-leakage algorithm: an algorithmfor the stable operation of a digitally implemented, fractionally spaced adaptive equalizer,”BSTJ, vol. 61, pp. 1817—1837, Oct. 1982.[57] G. Ungerboeck, “Fractional tap-spacing equalizer and consequences for clock recovery indata modems,” IEEE Trans. Communications, vol. COM-24, pp. 856—864, Aug. 1976.[58] Y. Kaneda and J. Ohga, “Adaptive microphone array system for noise reduction,” IEEETrans. Acoustics, Speech and Signal Processing, vol. ASSP-34, pp. 1391—1400, Dec. 1986.[59] B. Atal and L. Rabiner, “A pattern recognition approach to voiced-unvoiced-silenceclassification with applications to speech recognition,” IEEE Trans. Acoustics, Speech andSignal Processing, vol. ASSP-24, pp. 201—21 2, Jun. 1976.[60] L.F. Lamel, et al., “An improved endpoint detector for isolated word recognition,” IEEE Trans.Acoustics, Speech and Signal Processing, vol. ASSP-29, pp. 777—785, Aug. 1981.[61] P. de Souza, “A statistical approach to the design of an adaptive self-normalizing silencedetector,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-31, pp. 678—684, Jun. 1983.[62] L. Rabiner and M. Sambur, “An algorithm for determining the endpoints of isolatedutterances,” BSTJ, pp. 297—315, Feb. 1975.[63] P.G. Drago, A.M. Molinari and F.C. Vagliani, “Digital dynamic speech detectors,” IEEE Trans.Communications, vol. COM-26, pp. 141—145, Jan. 1978.[64] H. Lee and C. Un, “A study of on-off characteristics of conversational speech,” IEEE Trans.Communications, vol. COM-34, pp. 630—637, Jun. 1986.[65] C. Gan and R. Donaldson, “Adaptive silence deletion for speech storage and voice mailapplications,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. ASSP-36, pp. 925—927, Jun. 1988.[66] C. Gan, “Efficient speech storage via compression of silence periods,” Master’s thesis,University of British Columbia, Vancouver, Dec. 1984.[67] C. Rose and R. Donaldson, “Real-time implementation and evaluation of an adaptive silencedeletion algorithm for speech compression,” in Pacific Rim Conference on Communications,Computers and Signal Processing, pp. 461—468, IEEE, May 1991.[68] C. Rose, “Real-time implementation and evaluation of an adaptive silence deletion algorithmfor speech compression,” Master’s thesis, University of British Columbia, Vancouver, Aug.1991.[69] I. Lecomte, et al., “Car noise processing for speech input,” in Int. Conf. on Acoustics, Speechand Signal Processing, pp. 512—515, IEEE, 1985.86[70] T. TilIman and R. Carhart, “An expanded test for speech discrimination utilizing CNCmonosyllabic words,” Northwestern University Auditory Test No. 6, USAF School ofAerospace Medicine Technical Report, Brooks Air Force Base, Texas, 1966.[71] T. Parsons, Voice and Speech Processing, ch. 13, pp. 345—364. New York: McGraw-Hill,1987.[72] A.S. House et al., “Articulation testing methods: Consonantal differentiation with a closedresponse set,” JASA, vol. 37, pp. 158—1 66, Jan. 1965.[73] R. Bilger, Manual for the Clinical Use of the Revised Spin Test. Department of Speech andHearing Science, University of Illinois, 1984.[74] D.N. Kalikow, K.N. Stevens and L.L. Elliot, “Development of a test of speech intelligibility innoise using sentence materials with controlled word predictability,” JASA, vol. 61, pp. 1337—1351, 1977.[75] R.C. Bilger, et al., “Standardization of a test of speech perception in noise,” Journal ofSpeech and Hearing Research, vol. 27, pp. 32—48, Mar. 1984.[76] L. Hanusaik, “An evaluation of user performance with inductive coupling of hearing aidsand telephone receivers incorporating receiver amplification,” Master’s thesis, University ofBritish Columbia, Vancouver, Apr. 1991.[77] A. Holmes and T. Frank, “Telephone listening ability for hearing-impaired,” Ear and Hearing,vol. 5, pp. 96—100, Mar.-Apr. 1984.[78] American Speech and Hearing Association, “Guidelines for manual pure-tone audiometry,”Asha, vol. 20, pp. 297—301, Apr. 1978.[79] T. Hemeyer, “Speech protocols in determining dynamic range,” in Speech Protocols inAudiology (R. Rupp and K. Stockdell, eds.), New York: Grune and Stratton, 1980.[80] B. Winer, Statistical Principles in Experimental Design, ch. 6, pp. 290—291. New York:McGraw-Hill, 1962.[81] L. Wilkinson, SYSTAT:The System for Statistics, ch. 3, pp. 59—63. Evanston,Il.: SYSTATInc., 1990.[82] J. Lim, “Speech enhancement,” in Int. Conf. on Acoustics, Speech and Signal Processing,pp. 3135—3142, IEEE, 1986.[83] M. Amin, “Adaptive noise cancelling in the spectrum domain,” in lnt. Conf. on Acoustics,Speech and Signal Processing, pp. 2959—2962, IEEE, 1986.[84] D. Mansour and A.H. Gray, Jr., “Unconstrained frequency-domain adaptive filter,” IEEETrans. Acoustics, Speech and Signal Processing, vol. ASSP-30, pp. 726—734, Oct. 1982.[85] T. Petersen and S. Boll, “Acoustic noise suppression in the context of a perceptual model,”in Int. Conf. on Acoustics, Speech and Signal Processing, pp. 1086—1088, IEEE, 1982.87Appendix A Alpha DependenceThis quantitative analysis was presented in research by Kaneda and Ohga concerningadaptive noise reduction in multi-senor arrays [58]:Assume two factors a and a where ai > a > 0. The weight vector that minimizes thecost function,Ci = aiDi(ai) + D2(ai), (A.82)is W(aj). Similarly, the weight vector W(a2) minimizes cost function,C2 =a2Di(a)+ D2(a). (A.83)If W(a1) minimizes C1, the cost can only increase using weight vector W(a2). In explicit terms,aiDi(a2)+ D2(a) aiD1(ai) + D2(ai). (A.84)Similarly if W(a2) minimizes C2, the cost can only increase using weight vector W(ai),a2Di(ai) + D2(ai) a2Di(a)+ D2(a). (A.85)Rearranging these equations gives,ai(Di(a2)— Di(ai)) D2(ai)—Da (A.86)D2(ai)—Da)>a2(Di(a)— Di(ai)). (A.87)Combining the above relations produces,ai(Di(a2)— Di(ai)) a2(Di(a)— Di(ai)). (A.88)Since ai > a > 0, it follows that Di(a2)— Di(ai) 0 or,Di(a2) Di(ai). (A.89)Therefore Dl increases as a decreases. In addition from equation (A.86) and (A.87), it followsthat the right side of (A.86) must be positive, therefore D2(ai) — D2(a) 0 and,D2(ai) D2(a). (A.90)Hence D2 increases as a increases.A.88Appendix B Speech Threshold TestsC)UaFigure 26 Basic Speech-Detection AlgorithmComputer-Generated Noise — Condition (2)Figure 27 Basic Speech-Detection AlgorithmComputer-Generated Noise— Condition (3)Z1-ligh=ZLow=Zcrit2 2.5 3ZHigI=ZLow=ZcntB.89Speech Threshold Tests_____1SNR— 8dBO940.930.920 0.5 1 1.5 2 2.5 3 3.5 4 4.5ZHigh=ZLow=ZcritFigure 28 Basic Speech-Detection AlgorithmComputer-Generated Noise — Condition (4)B.90Appendix C Selected C Code Fragmentsspmain.c real-time main control program for speech enhancement algorithmspmain2.c non-real-time control program for speech enhancement algorithmfilter.h constant definition filespeechc speech detection algorithm subroutines:init_spch_varspeech detectIms.c normalized-leaky-LMS subroutines:init_lms_varupdate_muupdate_filterupdate_tapsfiltercC.9 1*1 #include“filter.h#ifdefdsp56k#includedsp.h#else#include“dspf.h#endif-real-timemainspeechenhancementprogram-interrupt-dven-ifflagFILTsetthenoutputisenhanced-continuesuntilflagDONEissetvoidupdate_taps;voidmaino;voidmain()initdspl0;init_lms_varØ;init_spch_varo;index=0;initdsp20;set_ready_statusO;get_done_flagO;/*initializeDSPboard/*initializevars.forfilter/*initialize vars.forspeechdet./*initializeDSPboardinterrupts/*clearinterrupts/*checkif done:spmain.ct’.)*1 *1 *1 *1 *1 *1dspintdspintdspintdspintdspintdspintdspintdspintfdspintfdspintfdspintfdspintfdspintfdspintfdspint/*if 56000version*1/*ifPCversion*1Iinputxk*1Ioutput*1/*flagsetonAIDinterrupt*11*flagsettostopprocessing*1/*flagsetiffilterison*1/*flagsetifsilencestate1*indextocount data*//GAMMAI2*1P(2*BETA*ek)I((NTAPS+1 )*sigma)I2*1Ppointertostartofbufferx_input*1Poutputoffilter -yk*1Perrx=x_in-filt_out(ek=xk-yk)Pcoefficientvector-Wk*11*circularinputvectorXk-1*1Pdatatransferbuffer toPCxin;x_out;status;done;flit;silence;index;gamma;rnfactx;*xstart;filt_out;errx;Wcoeffl;xinput[J;buff erEl;extemextemextemwhile(!done)wait_data0;1*waitforinterrupt*1clearbitsO;Pclearlow12-bitsof24-bitword*/++index;silence=speech_detectO;Pdeterminesilencestate*1filter(xstart,Wcoef,NTAPS);Pfilt_out=Wk*Xk1*1cal_err(x_in,filt_out);Perrx=x_in-filt_out*1if(silence)mfactx=update_rnu(x_in,errx);P2*BETA*ekl((NTAPS+1)*sigma)/2*1update_filter(mfactx,gammax);IWk+1=2(gammax*Wk+mfactx*Wk)*1voidinitdsp1;voidinitdsp2o;voidstopdspO;voidinit_Ims_varO;voidinit_spch_var;dspintspeech_detecto;dspintfupdate_mu(dspintt,dspintf);voidupdate_filter(dspintf,dspintf);x_out=errx;update_tapsØ;1*xk->xk-1*1setupxO;check_fiit_flag;Pif enhancementturnedoffIif(Ifut) x_out=x_in;Pdonotsendfiltereddata*/send_data(x_out);Poutputdataget_done_flag;Pcheckifdone*/set_ready_status();/*clearinterrupts*1stopdsp;}0) CD*1 #include“filter.h”#ifdef dsp56k#includedsp.h#else#includedspf.h#endifvoidinitdspl0;voidinitdsp2;voidstopdsp;/*if 56000version*1/*ifPCversion*1IinputxkIoutput/*flagsetonAIDinterrupt/*flagsettostopprocessing/*flagsetiffilteris onPflagsetifsilencestatePindextocount data1*GAMMA/2P(2*BETA*ek)/((NTAPS+1)*sigma)121*pointertostartofbufferx_input1*outputoffilter-ykPerrx=x_in-filt_out(ek=xk-yk)Pcoefficientvector-WkPcircularinputvectorXk-1Pdatatransferbuffer toPCvoidupdate_taps0;voidmainO;voidmain()initdspl0;init_Ims_varO;init_spch_var0;clear_processing;signal_PC;get_done_flagO;while(!done)wait_ready_sets;set_processingO;wait_ready_off;ciear_processing;get_done_f lag;/*initializeDSPboard/*initializevars.forfilter/*initializevars.forspeechdet.1*notifyPCwe’reready/*checkifdone1*waitforPChandshake/*tell PCwe’rebusy/waitforPChandshakeoffIoutputdata/xk->xk-1PnofifyPCwe’rereadyPcheckifdone*1 *1 *1 *1 *1*1 *1 *1 *1 *1C)*1Cb*1c) I*spmain2.c-nonreal-timemainLMSprogram0*-waitsforblockfromPC-filters datainblockandsignalsPC-continuesuntilflagDONEissetdspintdspintdspintdspintdspintdspintdspintdspintfdspintfdspintfdspintfdspintfdspintfdspintfdspintx_in;x_out;status;done;flIt;silence;index;gammax;mfactx;*xstart;filt_out;errx;Wcoef(];xinput[1;buffer[];*1 *1 *1 *1 *1 *1 *1 *1 *1 *1 *1 *1 *1 *1 *1*1 *1 *1externexternexternfor(index=0;index<BLOCKSIZE;++index) {x_in=buffe4index];lgetdataclearbitsO;1*clearlower12-bitsof24-bitwordsilence=speech_detecto;/*determinesilencestatefilter(xstart,Wcoef,NTAPS)Pflit_out=Wk*Xk1cai_err(x_in,filt_out);Perrx=x_in-fiit_outif(silence)mfactx=update_mu(x_in,errx);P2*BETA*ek/((NTAPS+1)*sigma)/2*1update_filter(mfactx,gammax);PWk+1=2(gammax*Wk+mfactx*Wk)*1voidinit_ims_varO;voidinit_spch_var0;dspintspeech_detects;dspintfupdate_mu(dspintf,dspintf);voidupdate_filter(dspintf,dspintf);buffer[index]=errx;update_taps;setupx0;stopdsp;*1 *1Pfilter.hFilterconstantdefinitionfile*1 #defineNTAPS31#defineBLOCKSIZE1024#defineTRUE1#defineFALSE0#defineONEOx7FFFFF/1.0*1#defineSILENCETRUE#defineSPEECHFALSE#defineZHIGH.3929065/*NORMALlZED(zhi*zhi)/n1*1#defineZLOW.0725806PNORMALlZED(zlo*zlo)/n1*1#defineZCRIT.0322581/*NORMALIZED(zcr*zcr)/n.1*1#defineSPCH_AVG_INITOx000El 0P225/20489/15MixedFormat*/#defineFRAME32/*Framesizeforaverages*/#defineFRAME_25/*Log2(WINDOW)#defineSIL.AVG_LEN32/*1O24IWINDOW=0.128s @8kHz*/#defineSIL_AVG_LEN_25/*Log2(SILAVG_LEN)*/#defineSPCH_AVG_LEN256P8096/WINDOW=1.012s @8kHz*/#defineSPCH_AVG_LEN_28/*Log2(SPCH_AVG_LEN)#defineMINSIL3/*Minimum#silenceframes#defineHANGOVER1/*offramesforhangover*/#defineBETA0.025PConvergencefactor#defineMUFACT0.0015625P(2*BETA)/(NTAPS+1)*1#defineNU0.0125P1/TimeConst.foraverage*1#defineONE_NU0.9875P1-NU*/#defineSIG_INIT0x000490Pmitpower est.0.00014*1#defineSIG_MINOx00000APMinimumSigma0.000001I#defineGAMMA0.99995/2PGammain2/22MixedFormat*/a ;3.*speech.cCimplementationofROSE’Sspeechdetectionalgorthim*insubroutineform*NOTE:BoththeAIDandDIAare12-bit.Thedataisleft-justified*inthe24-bitwordofthe56000.The56000treatsallnumbers*asfractionsbetweeniI1.Anattemptismadetoretain*anequivalentrepresentationintheCversionofthealgorithm*sothatdirectcomparisonscanbemadetothe56000assembly*codeimplementation.*1 #include‘filter.h”#include“dspf.h”#include<math.h>externdspintx_in;1*inputxk*1externdspintfsill];/*bufferofSIL_AVG_LENmostrecentsilenceaverages*1externdspintfspch[];1*bufferofSPCH_AVG_LENmostrecentspeechaverages*1externdspintlsqusil[];PbufferofSIL_AVG_LENmostrecentsquaredsilenceaverages*1dspintfsil_avg;1*currentsilenceaverage*1dspintfspch_avg;Pcurrentspeechaverage*1dspintfzhigh;/*highthresholddspintfzlow;/*lowthreshold*1dspintfzct;Pcriticalthresholddspintle_thresh;/energythresholdforsilenceclass.*1dspintlsil_squ;Psumofsquaredsilenceaverages*1dspintlsil_var;1*currentsilencevariance*1staticdspintsil_state;Pflagsetifsilencestatestaticdspintfirst_sil;1*counttoallowinWalizaonofsil_avgandsil_var*1staticdspintcount;Pindextocountframes*1staticdspintsil_len;Pcountforsil.intervallength*/voidinit_spch_varO;dspintspeech_detectO;*init_spch_var-initializesspeechdeteconvariables*1 voidinit_spch_var()staticdspinti;count=1;first_sil=0;sil_state=SILENCE;siLstate=SPEECH;1*sil_len=0;sil_start=sil;spch_start=spch;squ_start=squsil;sum_mag=0;sil_tot=0;sil_squ=0;spch_tot=(double)SPCH_AVG_INIT*SPCH_AVG_LEN;spch_avg=(double)FRAME*SPCH_AVG_INIT;for(i=0;kSPCH_AVG_LEN;++i)spch[i]=SPCH_AVG_INIT;staticdspintfsum_mag;1*sumofFRAMEamplitudevalues*1staticdspintfsil_tot;Psumofsilenceaverages*1staticdspintfspch_tot;Psumofspeechaverages*1staticdspintf*silstartS1*pointertostartofsilbuffer*1staticdspintf*spchstart;1*pointertostartofspchbuffer*1staticdspintl*squstart;1*pointertostartofsqu_silbuf.C (‘IFORTESTPURPOSESONLY*/0) C) 0 I*speechdetect-detectsspeechx_value1.811114*-returnssilenceTRUE/FALSE-seeflowchartforalgothmdetailssum_mag1.311614*avg_mag1.81151lose1bit ofsignicance*NOTE:Vaablex_inisa12-bit numberleftjustifiedin24-bit*word.Togetactual+1-1 fractional representation, dividesqu_mag1.613219*by2**23 or8388608.*sil_squ1.113719*Allvalues arethenshiftedrightby8bits-wecanhence*sumatleast256 valueswithoutfear of overflow.Howeversil_tot1.31201*if FRAMEs1ze isdecreasedbelow32thenthereisthe*possibilitythat speech_avgwilloverflowandthis wouldsil_avg1.31201*havetochange.*sil_var1.113719*Therefore toobtainfractional representationforvariables:*x_value,avg_mag, sum_mag, sil_totandspch_tot,aswellasspch_tot1.1231datainbufferssilandspch,divideby8388608/256=32768.spch_avg1.31201lose 3bftsofsignificance*Vaables: sil_avg,spch_avg, e_avg, andduff have*impliciflybeenmultipliedbyFRAMEtoretainmoreduff1.31201significantbits.Likewisesquaredvalues (long):sil_squdiff*diff1.614011e_thresh,e_cñt,sil_squ, sil_varanddatainbuffersqusilareimplicitiymultipliedbyFRAME*FRAME.Thereforee_crit1.11461lose17bsofsignificancetoobtain fractional representations,divideby32768*FRAMEand32768*32768*FRAME*FRAMErespectively.e_thresh1.11461lose17bitsofsignificanceInafixed-pointimplementation,numberslessthan-83886080andgreaterthan8388607 wouldoverflow.Numberslessthan*/+1.1wouldtruncatetozero.Thereforetheaccuracyoftheindividualvariables assuminga24-bitfixed representationforshortintegers, anda48-bitrepresentationforlongintegers,isnotedbelowwherelxiindicatessignificantbits.staticdspintfzlev;/*thresholdmult. zhighorzlow*1staticdspintfe_avg;/maxofspch_avgorsil_avg*1dspintfduff;7*diff betweenavg_magande_avg*1dspintfx_value;/*right-shiftedabs.val.ofx_in*1dspintfavg_mag;/averageofFRAMEmost recentx_values/dspintlsqu_mag;/squaredavg_mag*7dspintle_crit;/*criticalenergythreshold*7dspintI;Pindexforaveraging*1x_value=labs(x_in>>8);sum_mag+=x_value;if (countFRAME)++count;else count=1;avg_mag=sum_magIFRAME;Psqu_mag=sum_mag*sum_mag;if (first_sil <SIL_AVG_LEN)(sil[first_sil]=avg_mag;sO_tot+=avg_mag;squsil[first_sil] =squ_mag;sil_squ+=squ_mag;first_sil++;if (first_sil==SIL_AVG_LEN)(sN_avg=sil_tot*double)FRAMEISIL_AVG_LEN;sil_var=sil_squ-sil_tot*sil_tot*(FRAME*FRAMEISIL_AVG..LEN);zlev=zhigh;sil_state=SILENCE;sil_len=0;zlev=zlow;spch_tot+=avg_mag-*spchstart;*spchstarf=avg_mag;spch_avg=spch_tot*(double)FRAME/SPCH_AVG_LEN;if((dspint)(++spch_start -spch)==SPCH_AVG_LEN)spch_start =spch;elsesil_len++;if(sil_len>=MINSIL){sil_state=SILENCE;if(sil_len>=HANGOVER)zlev=zhigh;e_ct=zcrit*sil_var;diff=sum_mag-sil_avg;if((diff <0)11(duff*duff<e_crit))sil_tot +=avg_mag-*silstal.t;*silstal.t=avg_mag;sil_avg=sil_tot*(double)FRAMEISIL_AVG_LEN;if((dspint)(++sH_start-SN)==SIL_AVG_LEN)sil_start =sO;sil_squ+=squ_mag-*squs3fl;*squstal.t=squ_mag;sil_var=sil_squ-siLtot*siLtot*(FRAME*FRAMEISIL_AVG_LEN);if((dspint)(++squ_start -squsil)==SIL_AVG_LEN)squ_start=squsil;}}C)e_avg=max(sil_avg,spch_avgIl3);e_thresh=zlev*sil_var;sum_mag=0;retum(siI_state);dspintspeech_detect()PShiftrighttoavoidoverflowPSumFRAMEamplitude values*1 *1 *7 *1 *1 *7 *1PCalculate averagemagnitude/*andsquaredmagnitudePAssumefirst128msarePsilenceandcalculate/statisticselse)duff =sum_mag-e_avg;if((duff>0)&&(diff*duff>e_thresh))siLstate=SPEECH;C,)CD C13 (-3 CD c.}I*lms.cCimplementationofLMSalgorthm*update_mu*1 #includefilter.h’#include“dspf.h’#include<math.h>init_Ims_varO;update_mu(dspintf,dspintf);update_filter(dspintf,dspintf);update_tapso;-calculateconvergencefactor-updatesigma=(1.NU)*sigma +NU*xk*xk(NOTE:sigma notallowedtogobelowSIG_MINtoavoiddivisionbyzero-theinitialvalue forsigmaissettoSIG_INITininit_Im s_var-mufact=2*BETA/(NTAPS+1)-mfact=mufact*err/sigma/2-mfactis 2-bitinteger/22-bitfraction(I.E. dMdedbytwo)toallowalargervaluetoberepresented-muf=mufact/sigma calculatedforinterestssake-notethatxk,sigmaarestoredasvaluesbetween+1-8388608(2**23).Toobtainactualfractionalvaluebetween-i-I-1must divdeby8388608.-whenmultiplying ordMdingtwonumberslikexvar(xk)orerrandsigmathereisanimplicitfactorof8388608involvedwhichmust beaccountedforasseeninthefollowingcode.xvar=(xvar*xvar) I8388608.;mfact =(dspintf)ONE_NU*sigma;sigma=mtact+(dspintf)NU*xvar;if (sigma<SIG_MIN)sigma=SIG_MIN;mfact =(mufact*err*8388608.)/(2.*sigma);/*TRUNCATIONSIMULATIONif(mfact> (clspintf)ONE)mfact =(dspintf)ONE;elseif(mfact<-(dspintf)ONE) mfact=-(dpsintf)ONE;/externdspintx_in;Pinputxk*1externdspintfWcoef[];Pcoefficient vector-Wkexterndspintfxinputfl;Pcircularinputvectorxk-1*1externdspintf*5j;Ppointertostartofbuffer x_inpu*1dspintfsigma;Paveragepower esmate*1dspintfmufact;P(2*BETA)/(NTAPS+1)*1dspintlmuf;P(2*BETA)/((NTAPS÷1 )*sigma*1voiddspintf!void°°void*init_Ims_var-initializeLMSfiltervariables*1 voidinit_Ims_var()staticdspinti;xstart=xirlput;for(i=0;i<NTAPS+1;++i) (Wcoefli]=0;xinputli]=0;sigma=SIG_INIT;*1 dspintfupdate_mu(xvar,err)dspintfxvar,err;dspintfmfact;C,)CD I/*Forinterest’ssakeletscalculatemuf/*update_taps-updatefiltertapsmuf=(muf act*8388608.)!sigma;*-Xk->Xk-1return(mfact);*-Xk-1,Wcoefarestoredasvaluesbetween+1-8388608*(2**23).Toobtainactualfractionalvaluebetween*+1-1mustdivdeby8388608./************************************************************************************************update_filter-calculateImsfiltercoefficientsfromnoise-added*datavoidupdate_taps()*-Xk-1,Wcoefarestoredasvaluesbetween+1-8388608*(2**23).Toobtainactualfractionalvaluebetweendspinti;*+1-1mustdivdeby8388608.*-whenmultiplyingordividingtwonumberslikegammafor(i=NTAPS;i>0;--i)*andWcoef[i]ormfactandxinputthereisanimplicitxinput[i]=xinput[i-1];*8388608involvedwhichmustbeaccountedforasseen*inthefollowingcode.*-mfact,gammaare2-bitinteger/22-bitfracfion*(I.E.dividedbytwo)toallowalargervalueexterndspintffilt_out;*toberepresented;thereforeamultiplicationbyvoidfilterc(dspintf,dspintf,dspint);*twoisrequiredinweightupdateequation*—Wk+1=2*(gamma*Wk +mfact*xk_1)*filterc-performFIRfilter*-filt_out=Wcoerxinput(yk=Wk*Xk.1)voidupdate_filter(mfact,gamma)*1dspintfmfact,gamma;voidfilterc(dataptr,coefptr,taps)dspinti;dspintfdataptr(],coefptr[];dspinttaps;for(1=0;kNTAPS+1;++i) ((Wcoef[ij =(gamma*Wcoef[i]+mfact*xinput[i])/4194304.;staticdspintI;)for(1=0,filt_out=0; ktaps+1;++i)flit_out+=(dataptr[il*coefptr[i]) I8388608.;Appendix D InstructionsYou are going to listen to several sets of sentences on the telephone handset. At the sametime you will also hear a babbling noise in the background behind the speaking voice. For everytest sentence you hear, your task will be to repeat the last word of the sentences. For example,if the sentence you hear is: “We shipped the furniture by truck”, you should say “truck” into thetelephone receiver, just as if you were speaking on the phone. Some of the sentences you willbe hearing make more sense than others. Don’t let this concern you. Simply say what you thinkyou heard at the end of each sentence. It is important that you guess, even if you are not sureof the word. If you have any questions, please ask me.We will start with some practice sentences first. Switch the hearing aid which you normallyuse for the telephone to the “T” position and turn off the other hearing aid, if you have one. Whileyou are listening to these practice sentences, 1) adjust the volume control on your hearing aid2) place the receiver against your hearing aid and 3) adjust the volume control on the telephonereceiver. You want to find the volume and the receiver position which makes the man’s speechseem the clearest to you. You may need to move the receiver around a fair bit. Once you havefound the best volume settings on both your hearing aid and the telephone receiver and foundthe best receiver position, do not touch the volume wheel on your hearing aid or the telephonereceiver and try your best to keep the receiver in the “best” position. Repeat only the last wordof each sentence and take a guess when you are unsure of what you have heard. When youfeel that you have adjusted your hearing aid and the telephone receiver to clearly understand theman’s voice and are ready for the test sentences, please tell me.When the practice test is over, you will hear 5 sets of 50 sentences. There will be a period ofrest between each sentence set. Repeat the last word of each sentence, remembering to guessif you have to. Please tell me if you have any questions.Thanks for your help.D.100Appendix E Subject CharacteristicsE.1O1(I)‘1 5) C) coLI9-•:B2.::..B3..:..:..:::::::::::B4.:.:.:.:.:.:.:.:.:.:.:Mother TongueEnglishHungananEnglishEnglishEnglishEnglishEnglishEnglishAge6156712624763762SoMMFFFMFFHeanngLossModeratetoProfoundmixedMildtoMildtoSeveretoModeratetoMildtomoderatemoderatetoseverehighmoderatelymoderateprofoundseveremid&sensonneuralprofoundfrequencyseveresensorineuralsensonneuralhighfrequencysensorineuralsensorineuralsensonneuralsensorineuralogyNoiserelatedOtosclerosisAgerelatedBirthUnknownNoiserelatedBirth/DiseaseHereditary/Ageiscdmlnauon76N/A4082N/A766846ScôrésNumberofAidsOne(right)TwoTwoTwoTwoTwoOne(left)TwoYearsWearing35328821132522AidBrandUnitronPhonicEarSiemensWidexSiemensCE8SiemensDanavoxTypeBehindtheEarBehindtheEarIntheEarBehindtheEarBehindtheEarIntheEarBehindtheEarBehindtheEarTSwitchYesYesNoYesYesYesYesYesAvailableUsualRightLeftLeftRightLeftLeftLeftLeft.TepSelfratedGoodmostoftheGoodmostoftheFairGoodmostofPoortofairPoorGoodmostoftheGoodmostofTelephonetimetimethetimetimethetimeTelephoneUseTswitchonlyBothAmplifieronlyAmplifieronlyBothBothTswitchathomeTswitchonlyCommertsBritishaccentAmplifieronlyinAmplifieronlyatAmplifieronlyinquietworkquietTableXIIISubjectCharacteristics

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065236/manifest

Comment

Related Items