Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Identification of invariant acoustic cues in stop consonants using the Wigner distribution Garudadri, Harinath 1987

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-UBC_1988_A1 G37.pdf [ 7.11MB ]
Metadata
JSON: 831-1.0098033.json
JSON-LD: 831-1.0098033-ld.json
RDF/XML (Pretty): 831-1.0098033-rdf.xml
RDF/JSON: 831-1.0098033-rdf.json
Turtle: 831-1.0098033-turtle.txt
N-Triples: 831-1.0098033-rdf-ntriples.txt
Original Record: 831-1.0098033-source.json
Full Text
831-1.0098033-fulltext.txt
Citation
831-1.0098033.ris

Full Text

Identification of Invariant Acoustic C in Stop Consonants Using the Wigner Distribution by Harinath Garudadri B.Tech., S.V. University, Tirupathi, India, 1978. M.Tech., Indian Institute of Technology, Bombay, India, 1980. A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES Department of Electrical Engineering We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA December, 1987 © Harinath Garudadri, 1987 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of E l e c t r i c a l E n g i n e e r i n g . The University of British Columbia Vancouver, Canada Date December 19 8-7. DE-6 (2/88) Abstract It is a common belief that there are invariant acoustic patterns in speech signals, which can be related to their phonetic description. These patterns are expected to remain invariant, independent of the language, speaker, phonetic context, etc. Although many investigations based on short-time spectral analysis have estab-lished the feasibility of extracting invariant cues in certain contexts, they could not provide a set of invariant cues in any given phonetic context. In this thesis, the Wigner distribution (WD) was used to analyze speech signals for the first time, to investigate acoustic invariance. The WD, like the spectrogram, provides a time-frequency description of the signal. Unlike the spectrogram, it provides correct marginals in the time and fre-quency domains, but it is not a positive distribution. It is demonstrated here that the partially smoothed WD, in which both the properties of positivity and correct marginals are sacrificed to some extent, provides a better time-frequency resolution than short-time spectral analyses methods. An implementation and an interpretation of the partially smoothed WD are presented. The choice of smooth-ing parameters and the nature of cross-term suppression in a partially smoothed WD are discussed in detail. It is shown that the cross-terms in a partially smoothed WD do not mask the underlying nature of a signal in the time-frequency plane. A partially smoothed WD was used to investigate acoustic invariance in voice-less, unaspirated stop consonants spoken by native speakers of English, Telugu and French. Contrary to reports in the literature, it was shown that the features "diffuse-rising" and "compact" spectral shapes were not unique to alveolar and velar places of articulation, respectively, but depended on the vowel context. The resulting ambiguities when specifying the place of articulation were resolved using Formant Onset Duration (time taken for the steady state formants to occur in the vocal tract after the consonantal release) and F 2 of the following vowel. The place of articulation was specified correctly for 86% of the tokens. Unlike in other investi-gations, the errors in specifying the place of articulation were uniformly distributed over all vowel contexts. ii Contents A B S T R A C T ii L I S T O F T A B L E S vii LIST O F F I G U R E S viii A C K N O W L E D G E M E N T S x 1 Introduction 1 1.1 Motivation 1 1.2 Statement of the Problem 2 1.3 Statement of the Results 3 1.4 Organization of the Thesis . . . . i 5 2 Acoustic Invariance in Stop Consonants: Earlier Investigations 6 2.1 Overview 6 2.2 Background 6 2.3 Earlier Studies for Acoustic Invariant cues 8 2.3.1 Context dependencies in Earlier Studies 11 2.4 Drawbacks of the Short-Time LP Smoothed Spectra 12 2.5 Summary . 14 3 Smoothed Wigner Distribution 15 iii 3.1 Overview 15 3.2 Introduction 16 3.2.1 Background 16 3.2.2 Literature Review 19 3.3 Analytic Signals 22 3.4 Positivity and Resolution Considerations 23 3.4.1 Positivity of a Smoothed WD 24 3.4.2 Interpretation of the Smoothed WD 25 3.5 Computing the Smoothed WD 27 3.6 Nature of Cross-terms 30 3.7 Choice of the Smoothing Parameters 32 3.7.1 Choice of n 33 3.7.2 Choice of T 35 3.7.3 Cross-term Suppression in Smoothed WD 35 3.7.4 Other Considerations Specific to Speech 43 3.8 Summary 46 4 Study of Acoustic Invariance: A Cross-Language Study 47 4.1 Overview 47 4.2 Preliminaries 48 4.2.1 Stop Consonants 48 4.2.2 Organization of Stop Consonants in English, French and Tel-ugu 49 4.3 Invariant Patterns in the Wigner Plots 51 4.3.1 Pilot Study 51 4.3.2 Feature Definitions 52 4.3.3 Place of Articulation 60 iv 4.3.4 Comparison of Invariant Patterns from WD Plots and Short-Term Spectra 62 4.3.5 Voiced and Aspirated Stop Consonants 67 4.4 Development of the Program FEATURE 68 4.4.1 Measurement Section 71 4.4.2 Logic Section 72 4.5 The Experiment 76 4.5.1 Training of Judges 76 4.5.2 Modification to the Program FEATURE 79 4.5.3 Data for the Cross-Language Study 79 4.5.4 Running the Program FEATURE 81 4.6 Summary 82 5 Results and Discussions 83 5.1 Overview 83 5.2 Overall Performance 84 5.2.1 Performance by Language 84 5.2.2 Performance by Place of Articulation 85 5.2.3 Performance by Vowel Context 86 5.2.4 Overall Scores 86 5.3 Performance at Each Node 93 5.4 Relevance of the Acoustic Features Measured in WD Plots 95 5.4.1 Spectral Shape 96 5.4.2 Formant Onset Duration and Burst Duration Statistics . . . 98 5.5 Summary 100 6 Summary and Suggestions for Future Research 101 v 6.1 Summary 101 6.1.1 Wigner Distribution 101 6.1.2 Acoustic Invariance . 103 6.2 Suggestions for Future Research 105 A Data Acquisition 107 A . l Speech Signals 107 A. 1.1 Data for Pilot Study 107 A.1.2 Data for the Experiment 107 A.2 Instrumentation 108 A.2.1 Recording Setup 108 A.2.2 Digitization " 108 A.2.3 Computing and Plotting the Wigner distribution 108 A.2.4 Data Acquisition for the Cross-Language Study 109 B The Program for computing the Wigner Distribution 110 C Instructions for Identifying Stop Consonants from the Wigner Plots 116 C. l Preamble 116 C.2 Overview of the Task 116 C.3 Feature Definitions 117 C.4 Addendum 121 Bibliography 123 vi List of Tables 2.1 Results of Blumstein and Stevens, (1979) for voiced CV syllables. . 9 2.2 Results of Kewley-Port, (1983) 10 3.1 Examples of some kernel functions and the resulting representations 25 4.1 Organization of the stops in the languages studied 51 4.2 Comparison of the analysis methods 62 4.3 FOD and F O D - B D values from the pilot data 75 4.4 The choice of original thresholds 76 4.5 The choice of new thresholds 81 5.1 Overall scores by language 84 5.2 Overall performance by place of articulation 85 5.3 Performance by vowel context 86 5.4 Performance by place of articulation and language 87 5.5 Performance at each node in the decision tree 94 5.6 Distribution of the feature "spectral shape" by place of articulation. 96 vii List of Figures 3.1 Spectrogram of an exponential 17 3.2 Wigner distribution of an exponential 17 3.3 Spectrogram of a chirp 17 3.4 Wigner distribution of a chirp 17 3.5 G(t,u) for a pseudo-WD 26 3.6 G(t,u) for a narrowband spectrogram. . 26 3.7 G(t,u) for a wideband spectrogram 26 3.8 G(t,w) for a smoothed WD with Tft — 0.25 26 3.9 Subroutine for computing the WD 29 3.10 Cause of spectral wrap in a pseudo-WD 34 3.11 Spectral wrap in a chirp signal 34 3.12 Effective cross-term suppression as the spectral distance between self-terms decreases 36 3.13 WD of English phone (de] with no smoothing. Tft = 0.01 38 3.14 WD of English phone [de] with partial smoothing. Tft = 0.25 . . . 39 3.15 WD of English phone [de] with complete smoothing. Tft = 1.0 . . . 40 3.16 Pseudo WD of two chirps showing negligible cross-terms during si-lence 41 3.17 Smoothed WD of two chirps. Tft = 1.0 42 3.18 Smoothed WD of an alveolar depicting negligible cross-terms during FOD 44 viii 3.19 Smoothed WD of a velar depicting negligible cross-terms during FOD 45 4.1 WD of the beginning of the syllable [ka] (Telugu speaker) 54 4.2 WD of the beginning of the syllable [ga] (English speaker) 55 4.3 WD of the beginning of the syllable [ta] (Telugu speaker) 56 4.4 WD of the beginning of the syllable [pa] (French speaker) 57 4.5 Examples of typical spectral shapes observed in the WD plots. . . . 59 4.6 WD of the beginning of the syllable [ge] (English speaker) 65 4.7 WD of the beginning of the syllable [du] (English speaker) 66 4.8 WD of the beginning of the syllable [di] (Telugu speaker) 69 4.9 WD of syllable [k*a] from English 70 4.10 Flowchart for measuring the visual patterns 73 4.11 Flowchart for assigning the place of articulation 74 4.12 Template used for measuring the visual patterns 78 5.1 WD of the beginning of the syllable [be] (English speaker) 89 5.2 WD of the beginning of the syllable [da] (English speaker DG) . . . 90 5.3 WD of the beginning of the syllable [te] (Telugu speaker) 92 5.4 Distribution of diffuse non-falling and compact spectral shapes in alveolars and velars 97 5.5 F O D - B D vs FOD scatter plot for English tokens 99 5.6 F O D - B D vs FOD scatter plot for Telugu tokens 99 5.7 F O D - B D vs FOD scatter plot for French tokens 100 B. l Help Screen of WIG4C I l l B.2 Default Wigner Parameters I l l B.3 Default Plot Parameters 112 ix Acknowledgements I am very grateful to my advisors Dr. Michael Beddoes and Dr. John Gilbert, for their continual help, supervision and support. I would like to thank Dr. Andre-Pierre Benguerel for taking active interest in this dissertation and providing me with the necessary background information in phonetics. I am very thankful to Dr. Peter Lawrence for his help in acquiring the speech data and especially for the encouragement he has offered throughout this work. The topic of this dissertation originated at the Indian Institute of Technology, Kanpur, in 1981. I would like to thank Dr. Satish Mullick for introducing the Wigner distribution to me and encouraging me to pursue the topic of acoustic invariance. I thank all my friends, who contributed directly and indirectly to this thesis. Special thanks are due to Prem Lakshmanan, Subroto Bhattacharya, Sudhakar Cherukupalli, Vasudha Vipat, Darrell Wong and John Nicol. I would also like to thank Carol McGeough, Teresa Ukrainetz, Mike Lee and Dr. Jack Bodolec for participating in the study. I would like to thank Maureen Beddoes, Joanna and Fritz Lehmann, and Dr. Susheela Reddy for making my stay in Vancouver very pleasant. It would be futile to try to express my feelings towards John Gilbert and Carolyn Johnson. I am indeed grateful for everything they did for me and I will always remember Vancouver with fond memories. x \ Dedicated to Amma and Bhaskar. xi Chapter 1 Introduction 1.1 Mot iva t ion From the dawn of civilization, speech has played an important role in society. Evolutionists agree that the development of a speech centre in the brain led to the evolution of homo sapiens and it is speech which sets us apart from primates. Speech is the most effective and natural means of communication among humans, as Edward Sapir said, "the best show that man puts on." Over the past 50 years, many advances have been made in understanding speech — both its production and its perception. It has become possible to rehabilitate people with speech and hearing disorders by providing them with aids to cope with their impairments. With the advent of computers, many and varied applications such as reading machines for the blind, speech understanding systems, voice oper-ated remote machines to replace humans in hazardous environments or to operate in inaccessible areas, computer generated speech in information retrieval, security systems, etc. have become feasible. However, there are still many aspects of speech production and perception, that are not well understood. A thorough understand-ing of the nature of speech is essential in order to achieve a desirable level of success in speech related applications, and to assure the quality and reliability of speech technology, comparable with that of, say, home electronics or automobiles. This study is directed towards understanding the nature of some speech signals, i.e. stop consonants. Stop consonants are the sounds produced by a complete closure of the vocal tract, e.g., [k] and [t] in words like "cat." Unlike vowels, 1 which are steady state signals, stop consonants are non-stationary and among the hardest to tackle in signal processing. A new signal processing tool, called the "Wigner distribution" (WD), is used here in a cross-language study of unaspirated stop consonants from English, Telugu and French. The WD has not been used previously, to analyze speech signals extensively. As will be shown, the WD is particularly suited to study the nature of stop consonants. 1.2 Statement of the P r o b l e m The concept of "acoustic invariance" can be explained as follows: Is there some unique acoustic aspect of a given speech segment that remains invariant, independent of language, speaker, phonetic context, etc., that can be related to the phonetic description of that speech segment? Acoustic invariance hypothesizes that such unique acoustic properties are present in speech signals. It is a common belief that these properties (called "invariant acoustic cues"), are used by humans in perceiving speech. Hence, it should be possible to extract them from the signal, using signal processing methods. When these cues refer to the location in the vocal tract, where the airflow is momentarily stopped during speech production, they are called "invariant acoustic cues for place of articulation" in stop consonants. Identification of invariant acoustic cues for place of articulation has been a topic of active interest for many years. The signal processing tools used previously to extract acoustic invariance were essentially short-time spectral estimation meth-ods. Various time-varying spectral properties have been investigated as possible correlates for place of articulation in stop consonants. Although the feasibility of extracting some invariant properties from speech has been unequivocally estab-lished, these investigations have not succeeded in providing a set of necessary and sufficient acoustic properties of the signal to specify the place of articulation. As a consequence, the concept of acoustic invariance is still viewed with considerable reservation [53]. 2 WD is a relatively new signal processing tool. It overcomes some of the draw-backs of conventional Fourier techniques. Over the past few years, it has received much attention in applications involving non-stationary and transient signals. Al-though the technique has proved to be very useful for analyzing monocomponent signals, when used to analyze multicomponent signals, it is circumscribed by the presence of negative energy regions and cross-terms. As a consequence, WD is not widely accepted as an analysis tool for multicomponent signals such as speech [18]. The use of partially smoothed WD to analyze speech signals is a novel aspect of this thesis. The partially smoothed WD was used to investigate acoustic invariance in stop consonants. 1.3 Statement of the Results The contributions of this study were in: (i) developing the WD as an analysis tool for processing multicomponent signals such as speech, and (ii) presenting evidence which supports the concept of acoustic invariance in speech signals. Wigner Distribution The WD is a bilinear time-frequency distribution. It has the property of correct marginals, and hence provides an accurate description of the time-varying spectral properties of a signal. Unlike short-time spectral analysis techniques, the WD is not a positive function. The presence of cross-terms, due to its non-linear nature, poses problems in many applications. In this study, two most desirable properties of a signal processing tool — (i) positivity and (ii) correct marginals, are both sacri-ficed by considering a partially smoothed WD. Extremely satisfactory results were obtained when the partially smoothed WD was used to analyze stop consonants and other transient signals. • The WD can be smoothed to overcome the drawbacks of negative regions and cross-terms. It was shown that, the WD smoothed for positivity (to suppress the cross-terms completely), is identical to a spectrogram computed using Gaussian windows. The partially smoothed WD was shown to provide 3 greater time-frequency resolution than short-time Fourier techniques. [28,30]. • An interpretation of the partially smoothed WD is presented and the choice of smoothing parameters for multicomponent signals, with particular reference to speech were presented [29]. An implementation of the partially smoothed WD was discussed in detail. • The nature of cross-terms was discussed with illustrative examples. It was shown that non-zero contributions in the WD, where the speech signal itself is zero, do not mask useful information in the WD. Acoustic Invariance in Stop Consonants The partially smoothed WD was used in a cross-language study to investigate invariant acoustic properties for place of articulation in 200 unaspirated consonant-vowel (CV) syllables, spoken by native speakers English, Telugu and French. [31] • The place of articulation was identified correctly for 78% of the bilabials, 81% of the coronals and 98% of the velars, based on visual features extracted from the WD plots. The distribution of errors in bilabials and coronals was not uniform across all speakers. Identification of velars was best for all speakers in all vowel contents. • Temporal characteristics of the burst and second formant of the vowel were used to reduce the context dependency in specifying the place of articula-tion. Unlike other investigations, no context dependencies with respect to the following vowel could be found in the results. • Usually, the acoustic feature "diffuse non-falling spectral shape" is associated with an alveolar burst and the acoustic feature "compact spectral shape" is associated with a velar burst [68,2]. It was shown that depending on the following vowel, a diffuse non-falling spectral shape can occur in velar bursts and a compact spectral shape can occur in alveolar bursts. • Dentals and retroflexed alveolars of Telugu were shown to have similar spec-tral shapes during the burst in the WD plots. This result supports theoretical 4 predictions of Blumstein and Stevens [2], that dentals and alveolars should have similar acoustic properties, since they both belong to the class of coro-nals [2,33,12,51]. 1.4 Organizat ion of the Thesis Chapter 2 describes some of the earlier studies, which used LP method for inves-tigating acoustic invariance. The drawbacks of the LP method for analyzing the stop consonants are discussed. Chapter 3 concerns the Wigner distribution. Computation of partially smoothed WD and its interpretation when applied to multicomponent signals are pre-sented. Examples are presented to illustrate the nature of cross-terms and it is argued that they do not prohibit extraction of useful information from WD. Chapter 4 describes the experiment conducted to investigate acoustic invariance in stop consonants. Section 4.2 introduces some of the linguistic terms used throughout this thesis. The invariant patterns observed in the pilot study are explained and compared with those reported in the literature. A program called FEATURE was developed, which interactively obtains the relevant visual features of a WD plot from the judges, and assigns a place of articulation to the WD plot. Chapter 5 discusses the results of the cross language study in detail. Chapter 6 presents the conclusions and some suggestions for future work. Appendix A describes the hardware and software used in acquiring the speech data for the pilot study and the final experiment. Appendix B is intended as a manual for using the program WIG4C.F to compute and plot WD. The program is available on the UBC "General" computer system in a form that can be readily applied to other signals. Appendix C contains the instructions given to the judges to identify invariant visual patterns in the WD plots. 5 Chapter 2 Acoustic Invariance in Stop Consonants: Earlier Investigations 2.1 Overview This chapter describes some of the earlier investigations for specifying the place of articulation in stop consonants, based on their acoustic properties. Section 2.2 presents a brief background of the concept of acoustic invariance. Section 2.3 summarises three recent investigations pertinent to the present work, based on LP method. The context dependencies in these investigations are summarised in Section 2.3.1. Section 2.4 presents the drawbacks of using LP technique for studying stop con-sonants. 2.2 B a c k g r o u n d The subject of invariant acoustic patterns as cues to the perception of phonemes has been of active interest for many years. Several investigators have examined var-ious features as possible invariant cues for place of articulation [20,22,68,2,3,49,51]. These investigators have essentially used two basic approaches in an attempt to establish whether a given set of acoustic properties of the signal can function as invariant acoustic cues. The first method was to analyze a corpus of reasonable size and show that the said acoustic properties were present in the corpus to a sta-6 tistically reliable extent. The second approach was to synthesize speech segments such that the given acoustic properties were present in them. The perception of the synthetic segments was then investigated to assess the importance of these properties in specifying the place of articulation. Examples of acoustic properties investigated as possible correlates for place of articulation include the spectral shape of the acoustic signal during consonantal release, the formant loci (or target frequencies), and the direction of spectral change within a short time interval after the closure release. To some extent, most of these features proved to be perceptually relevant in specifying place of articulation. On the whole, they failed to provide a set of necessary and sufficient acoustic properties to specify the place of articulation of a stop consonant, independent of the phonetic context. This context dependency is often quoted to argue that it is not possible to extract invariant acoustic patterns from a speech segment, that would correlate with its phonetic description [52,53]. A possible reason for the failure of these investigations to provide invariant acoustic cues in all phonetic contexts is that, the signal processing tools used were not suitable to analyze, or synthesize the acoustic properties under consideration. This is true to some extent, when short-time spectral analysis methods are used to analyze stop consonants; and when such information is used to synthesize stop consonants. The WD appears to be a more suitable technique for time-varying spectral analysis of transient signals, such as stop consonants. The WD was used in this study to investigate acoustic invariance in stop consonants. It was desired to see, (i) if it was possible to obtain a better representation of the time-varying spectral properties of stop consonants using the WD, and (ii) if it was possible to extract invariant acoustic cues for place of articulation, independent of the vowel context, using the WD. Recently, acoustic invariance in stop consonants was examined by Blumstein and Stevens [2], Kewley-Port [49] and Lahiri, Gewirth and Blumstein [51] using short-time Linear Prediction (LP) smoothed spectra. Their findings are described briefly, and some drawbacks of LP method when analyzing stop consonants are dis-cussed in rest of this chapter, since they are pertinent to the present investigation. 7 2.3 Ear l ier Studies for Acoust ic Invariant cues Blumstein and Stevens Based on theoretical considerations and perceptual experiments, Stevens and Blum-stein [68] predicted the spectral shape of the burst, for different places of articula-tion during the production of stop consonants, as follows: • diffuse-falling or diffuse-flat spectral shape for bilabials, • diffuse-rising spectral shape for alveolars, and • compact spectral shape for velars. In order to provide a quantitative measure of the degree to which these spec-tral shapes correspond to a given place of articulation, Blumstein and Stevens [2] considered 1800 CV and VC syllables, spoken by six (four male and two female) native speakers of English. They developed three templates, each designed to re-flect diffuse-falling or flat, diffuse-rising and compact spectral shapes. A 14"* order LP filter was used to compute the smoothed spectra of the first 26 ms of the burst of each syllable in the corpus. These spectra were tested against each of the three templates for both correct acceptance and correct rejection. About 85% of CV syllables were correctly accepted or rejected by each template depending on the place of articulation. The results for 450 initial voiced stop consonants ([b, d, g] in the context of vowels [i, e, a, o, u]) reported in table I in [2, p. 1008] are reproduced in table 2.1. Kewley-Port Kewley-Port [49] suggested that the invariant acoustic cues might be present in the time-varying spectra, rather than the static spectrum sampled during the burst and the onset of the vowel, as proposed by Blumstein and Stevens. She computed time-varying LP smoothed spectra using a 20 ms or a 25 ms window, starting from the burst and extending into the following vowel. The window was shifted every 5 ms. Time varying spectra of first 40 ms of each token was plotted on a 3-D display. 8 Diffuse-falling template Correct acceptance Correct rejection [b] 82.5 [d] 80.7 [g] 90.0 Diffuse-rising template Correct acceptance Correct rejection [d] 84.0 [b] 86.0 [g] 88.5 Compact template Correct acceptance Correct rejection [g] 86.7 [b] 91.3 [d] 82.7 Table 2.1: Results of Blumstein and Stevens, (1979) for voiced CV syllables. Kewley-Port proposed three features for the displays; i) Tilt of the spectrum at burst onset, ii) Late onset of low-frequency energy, and iii) Mid-frequency peaks extending over time, based on which the place of articulation was assigned as shown in table I in [49]. A total of 162 CV syllables, spoken by 3 native speakers of English (two male and one female) were analyzed, using a 14th order LP filter for the male speakers and a 12th order LP filter for the female speaker. Four fewer coefficients were used to analyze the voiceless frames; since the spectral sections calculated with fewer coefficients had smoother peaks. The displays were analyzed by three judges independently and a place of artic-ulation was assigned to each display. A wrong place of articulation was assigned to 46 of the 162 displays by one or more judges. In order to resolve errors which may have occurred due to careless judgement, the judges were given "additional instructions" and asked to analyze the 46 plots again, in a collaborative session. The results in percentage for each place of articulation by vowel context, given in table IV in [49, p. 372] are reproduced here as table 2.2. About 88% of the stop consonants were assigned correct place of articulation in this study. Although Kewley-Port stressed on the importance of considering time-varying spectral properties to specify the invariant acoustic cues, the time domain did not play any important role in her study. The late onset feature — a direct measure of voice onset time, was used by the judges in only 0.6% of the judgement to assign correct place of articulation [49, p. 328]. The features: (i) tilt of the spectrum and 9 Vowel (N) b d g Total i ( 8 1 ) 67 93 56 72 e ( 8 1 ) 70 81 89 80 a ( 8 1 ) 100 93 100 96 o ( 8 1 ) 93 96 100 96 u ( 8 1 ) 78 96 100 91 I ( 2 7 ) 89 100 100 96 e ( 2 7 ) 100 100 67 89 ae ( 2 7 ) 100 100 67 89 Total (486) 84 93 87 88 Table 2.2: Results of Kewley-Port, (1983) (ii) mid-frequency peaks, proposed by Kewley-Port are closely related to Blumstein and Stevens' static spectral templates, and do not take time domain variations into account. Lahiri et al. Lahiri et al. [51] studied the time-varying spectral properties of bilabials, alveolars and dentals from Malayalam, French and English for invariant acoustic cues for place of articulation. In a pilot study, they considered the Blumstein and Stevens' spectral templates to determine if the spectral shape alone was adequate to dif-ferentiate bilabials from non-bilabials. They also examined the claim, based on theoretical predictions, that dentals and alveolars share similar spectral properties during the burst (Blumstein and Stevens, [2]; Halle and Stevens, [33]). Lahiri et al. concluded that, the spectral shape during the burst — A) cannot differentiate bilabial stops from dental stops, and B) cannot group dental stops and alveolar stops into one class; contrary to predictions based on phonological and acoustic theories. Lahiri et al. then computed short-time LP smoothed spectra using a 10 ms window, with a 5 ms shift for each spectral line. Rather than estimating the spectral shape or spectral tilt as Blumstein and Stevens, and Kewley-Port did, they focussed on the relative changes in the distribution of energy from the burst release 10 to the onset of voicing. Based on further pilot studies, Lahiri et al. showed that the bilabial stop consonants were characterized by either (i) less difference in the energy at low frequencies than at high frequencies, or (ii) same difference in the energy at low and high frequencies, from the burst to the onset of voicing. Dental and alveolar stop consonants were characterized by a greater difference in the energy at low frequencies than at high frequencies. For "low" and "high" frequencies, they chose 1500 Hz and 3500 Hz respectively. The energy measurements were made on a straight line joining the F2 and F4 peaks. The above metric was used in a cross-language study investigating 493 CV syllables spoken by 3 French speakers (298 tokens), 2 Malayali speakers (95 tokens) and 1 English speaker(100 tokens). Lahiri et al. were able to specify the place of articulation correctly for 89% of the bilabials, 94% of the dentals and 92.5% of the alveolars. It should be noted that except for the 198 CV syllables spoken by two French speakers, the rest of the corpus (a total of 295 CV syllables) was analyzed in the pilot experiments to observe and define the invariant acoustic properties reported in their study. 2.3.1 C o n t e x t d e p e n d e n c i e s i n E a r l i e r S t u d i e s In Blumstein and Stevens [2], most of the errors in specifying the place of articula-tion for initial stop consonants were in the context of high front vowel [i] and high back vowel [u]. Bilabials in the context of [i] were classified as alveolars and velars. Alveolars in the context of [u] were classified as velars. A large number of velars in the context of [i] and [e] were classified as alveolars. Most of the errors reported in Kewley-Port [49] were also in the context of high front and back vowels. Bilabials in the contexts of [i, e, u] and velars in the contexts of [i, er, ae] were incorrectly identified (see table 2.2). No such context effects were reported by Lahiri et al. [51]. Context dependencies of similar nature were encountered by others in speech perception studies, based on stimuli constructed using tape-splicing techniques (eg., Cole and Scott [21]) and synthetic speech stimuli (eg., Cooper et al. [20], Blumstein and Stevens [3]). Thus, it appears that the acoustic information for place 11 of articulation is different, in different vowel contexts. This is considered to weaken the notion of acoustic invariance in speech signals. Many researchers have argued that it is not possible to extract invariant acoustic cues from the speech signals (Lindblom [53]; Liberman, Cooper, Shankweiler and Studdert-Kennedy [52]). 2.4 Drawbacks of the S h o r t - T i m e L P Smoothed Spect ra The burst region of the stop consonants is highly transient. As the closure at the place of articulation is released, the energy in the speech signal increases. The characteristics of the resulting plosion and of the ensuing frication noise undergo a continuous change, till the opening is wide enough to allow the steady state excitation of the vocal tract resonances corresponding to the vowel sound. The duration of the transient part can vary from 5 to 100 msec, depending on place and manner of articulation, language and speaker. In Section 2.3, it was seen that the typical window lengths used in analyzing stop consonants vary from 10 to 26 msec. In the LP method, the signal s(n) is assumed to be short-time stationary over the analysis window of length, N. For each window, the short-time spectrum is computed using an all-pole model (Makhoul, [55]; Markel and Gray, [57]). The smoothed spectrum for a given window can be written as j where A{<?n%) = 1 + E ake-jkn^. (2.2) k=l In Eqn 2.1, G is a constant. The LP parameters ak are computed such that the total squared error in approximating the input signal s(n) with a predicted signal s(n) is minimum [57]. The signal spectrum |5(eJW)|2 can be represented by \H{e'")\2 with an arbitrarily small error, by increasing the filter order p. \S&»*)\* = Km\H[<t»*)\\ (2.3) 12 The relationship between the signal, error and the LP filter, in the frequency do-main can be written as In the present context, it is of interest to see the nature of the above approxi-mation for nonstationary signals and finite values of p. The relationship between the signal spectrum | £ ( g ' w ) | 2 and the LP spectrum \H(eiu>)\2, for finite values of p can be seen by comparing Eqn 2.4 and Eqn 2.1. |^(eJ'w)|2 is being modelled by a flat spectrum G2. As the all-pole filter order p is increased, |I?(e,u')|2 becomes white noise and |S(eJW)|2 becomes identical to |iT(e ,'w)|2. For finite values of p, \E(e'u)\2 represents the portion of the signal that does not fit the all-pole model. This information is ignored in computing the LP smoothed spectra. The value of p is chosen based on the number of resonances expected in the signal. For computing the time-varying spectra in speech, usually 10 parameters are used to represent the formant structure, and 4 additional parameters are used to approximate the glottal filter. In the burst region of stop consonants, there is no formant structure, as there are no steady state resonances in the vocal tract. Further, the glottis is associated with the laryngeal action, eg. voicing and aspiration and not the place of articu-lation. Thus, although a 14"* order LPC filter gives satisfactory results for vowels, it is not ideal for analyzing the burst of stop consonants. Kewley-Port [49] used four fewer coefficients to analyze the frames containing frication, to avoid "rippled peaks" in the spectra. This would further increase the approximations introduced by the LP method. Kewley-Port suggested that the smoother peaks obtained by fewer coefficients were "closer to the underlying fricative spectrum." This is clearly not valid since, steady state signals like vowels have smooth peaks and to study the nature of stop consonants, their differences from vowels should be preserved in the analysis method. When overlapping windows are used to compute successive spectra, adjacent spectra are smeared in time. In conclusion, when short-time spectral estimation techniques, which assume an all-pole model for the signal are used to analyze stop consonants, two sources of error can be identified: 1. A smear between the non-stationary burst region and the steady state vowel, 13 due to overlapping windows. 2. Ignoring the portion of the signal that does not conform to an all-pole model. Hence, the LP parameters cannot represent the nonstationarities occurring within a window. This property, in fact, is used to detect epileptic spikes in the E E G signals. For a proper choice of the filter order, the LP parameters of the EEG background activity do not reflect the properties of the nonstationary spikes. By monitoring the prediction error, epileptic spikes are detected. In conclusion, LP smoothed spectra give excellent results when the signals can be modelled by an all pole system. When the nature of the signal deviates from this ideal, as in stop constants, much of the useful information in the error signal is sometimes ignored. 2.5 S u m m a r y Many investigations in the past failed to yield a complete set of necessary and sufficient set of acoustic properties to specify the place of articulation in stop con-sonants. This is partially due to inadequacies in the signal processing tools used. Three recent investigations for invariant acoustic cues in stop consonants using the LP method were presented. Drawbacks inherent in the LP method for analyzing stop consonants were discussed. 14 Chapter 3 Smoothed Wigner Distribution 3.1 Overview In this chapter, the Wigner distribution (WD) is presented as an alternative to short-time spectral analysis of signals that are essentially transient in nature. Section 2 presents the concept of WD analysis. The literature on its properties and applications is introduced. Section 3 documents the way in which analytic signals were obtained from real signals in this study. Section 4 considers the compromise between positivity and enhanced resolution in WD analysis. It is shown that the WD smoothed for positivity is equivalent to a spectrogram. An interpretation of partially smoothed WD is presented. Section 5 presents the computation of the smoothed WD. A FORTRAN subrou-tine is included, which requires as inputs: the speech signal, the time domain window, and the frequency domain window; and provides the smoothed WD as output. Section 6 explains the nature of cross-terms in the WD of multicomponent sig-nals. Section 7 presents some of the factors to be considered in choosing the smoothing parameters. The effect of partial smoothing is demonstrated and it is shown 15 that partially smoothed WD is a useful alternative to short-time spectral analysis. 3.2 Introduct ion 3.2.1 B a c k g r o u n d The WD is a powerful signal processing tool. It was first introduced by Wigner for quantum mechanics in 1932 [70], and by Ville for communication theory in 1948 [69]. Over the past few years, the WD has generated much discussion about its potential for the analysis of transient signals. The WD is similar to short-time spectra, in the sense that it describes the time-varying spectral properties of a signal. For a steady state exponential signal, both the WD and the short-time FFT are equivalent as shown in Figures 3.1 and 3.2. For time-varying signals, however, WD provides a much better resolution than a short-time FFT. This is illustrated in Figures 3.3 and 3.4, which show the spectrogram and the WD of a chirp signal, respectively. Both displays have been computed using the same resolution in the frequency domain. Note the wide main lobe in Figure 3.3, due to temporal smear. Figure 3.4 shows that the WD preserves time domain characteristics of the signal better than a spectrogram. The WD of a signal f(t) is denned as /+oo f{t + T/2)r{t-T/2)e->»TdT. (3.1) -oo For two signals fi(t) and fi(t), the cross WD is defined as /+0O fi{t + T/2)f;{t-T/2)e->urdT. (3.2) -oo Some of the useful properties of the WD which make it attractive for transient signal analysis are: • The WD is a real function of time and frequency. • The WD gives correct marginal distributions in time and frequency, i. e., in-tegrating the WD at a given time over all frequencies yields the instantaneous 16 Figure 3.1: Spectrogram of an expo-nential. Figure 3.2: Wigner distribution of an exponential. Figure 3.3: Spectrogram of a chirp.' Figure 3.4: Wigner distribution of a chirp. 17 power, and integrating at a given frequency over all time yields the power spectral density. — Wf(t,u)du, = \f(t)\2 (3.3) /+°° Wf{t,u)dt = \F(u)\2 (3.4) -oo • The WD preserves the shifts in time and frequency domains. For signals shifted in time and/or frequency, the WD is also shifted in time and/or frequency by the same amount. /i(0 = /(* + *i)« , W 1 ' (3.5) Wfl{t,w) = Wfit + tuU + Ui) (3.6) • WD of time limited signals is also limited to the same time duration. Simi-larly, WD of band limited signals is also limited to the same bandwidth. Unlike conventional spectral estimation methods, WD is periodic in the frequency domain with a period of 7r rather than 2TT. This makes it necessary to sample the signal at twice the Nyquist rate, or to consider the analytic version of the input signal, in order to avoid aliasing in discrete WD. While the WD is excellent for analyzing monocomponent signals (as shown in Figures 3.2 and 3.4), it has some drawbacks for analyzing multicomponent signals. • For a given Fourier transform pair f(t) «-> F(u), the WD is not always positive for all values of t and w. • The WD can be non-zero when the signal f(t) itself is zero. Similarly, the WD can have non-zero contributions for frequencies where the spectrum F(u) is zero. These non-zero contributions have been called "cross-terms" or "in-terference terms." They occur in the WD because it is bilinear in nature. Consider the signal /(*) defined as fx{t) + fa[t). The WD of f(t) can be shown to be W,{t,u,) =Wfl(t,u) +Wf3(t,u;) + 2m[Wfuh(t,u)}. (3.7) 18 The first two terms in the above equation will be called self-terms since they correspond to the signal itself. The third term is the cross term and its support in the time-frequency plane is generally midway between the self-terms. The presence of negative regions and cross-terms is frustrating in many applications of WD. It is, however, shown that these drawbacks need not deter one from taking advantage of its useful properties. Higher order distributions with correct marginals and positivity have been proposed in literature [16,17]. Discussion in this thesis will, however, be limited to WD and its applications. 3.2.2 L i t e r a t u r e R e v i e w Over the last few years, many advances have been made in applying WD techniques to the processing of non-stationary signals. Some of these, pertinent to the present topic, are summarized in this section. Properties of W D The properties of the WD in the context of signal processing were discussed exten-sively by Claasen and Mecklenbrauker [13] and Flandrin and Escudie [24], The dual nature in time and frequency of WD and Ambiguity function (AF) were studied by Hlawatsch [36,37]. An excellent tutorial review of the WD and its relationship to other time-frequency distributions was given by Boudreaux-Bartels [7]. Cohen gave a generalized equation for generating all possible bilinear distributions, based on a kernel function [15]. These distributions, known as Cohen's class of generalized bilinear distributions, can be written as Cf{t,u;Q) =Wf{t,w) *G{t,u) (3.8) i.e., any desired distribution Cf(t,w;()) can be obtained by convolving WD of the signal Wf(t,oj) and an arbitrary kernel function G(t,uj). The convolution is in both t and OJ. G is related to Q by a two dimensional Fourier transform. If the kernel function G(t,co) is chosen as 2ir8(t)8(w), then Cf(t,w) becomes the WD. Thus, 19 the kernel function also allows the comparison of WD with other distributions in Cohen's class, with respect to spread in time and frequency. 1 A serious drawback to using WD in signal analysis was in the presence of negative regions and crossterms. Their presence is particularly annoying in speech research, since the spectrogram, a popular time-frequency analysis tool in speech for nearly four decades, does not have this disadvantage. The signals for which the WD is always positive were presented by Hudson in 1974 [40]. These signals are chirp-like in nature, but the WD of most of the signals encountered in real life (speech, biomedical signals, etc.) do not belong to this class, de Bruijn in 1967 [9] and Janssen in 1981 [45] have shown that smoothing the WD in time and frequency with appropriate Gaussian windows will always result in a positive distribution2. For the bilinear distributions of Cohen's class, the conditions of positivity and correct marginals are incompatible [71,14]. However, in this class of distributions, the WD has been shown to be superior to other distributions in terms of minimum spread in time and frequency [46]. Further, Janssen and Claasen [47] showed that there is no distribution in Cohen's class, which becomes positive for all signals after smoothing with TQ < 1 and that the WD is the only distribution in this class which becomes positive for all signals after smoothing with TCI = 1. Janssen and Claasen also pointed out that when the signal deviates from the subclass of signals given by Hudson in [40], considerable smoothing was required to get a non-negative distribution. In later sections of this chapter, it is shown that smoothing with TCI = 1 makes WD and spectrograms equivalent, and a smoothing value TQ much less than 1 is adequate in many applications to make meaningful interpretations regarding the nature of the signal. Analytic input for W D For real signals, WD is symmetric with respect to the w = 0 line in the frequency domain. Consequently, contributions at positive and negative frequencies interact to give rise to cross-terms at w = 0 [13,4]. In applications where the information 1See Table 3.1 on page 25 for examples of some kernel functions. 2The smoothing operation is as shown in Equations 3.10, 3.11. In Section 3.4.1, this smoothing was parameterized on the time-bandwidth product Tfl of the Gaussian windows used in smoothing so that Tfi = 1 results in a positive distribution. 20 at low frequencies is important, the analytic version of real signal should be used to compute WD [4,5]. Examples of applications in which analytic signals were used to compute WD can be found in [4,5,26,29,42,44,58,60,56]. It is particularly important to consider the analytic signals in computing WD of speech, since the high amplitude cross-terms at u = 0 not only mask the low frequency information, but also obliterate the low amplitude higher formant structure of speech. In two recent applications of the WD in speech analysis [10,11] and [l], however, the real signal was used in computing the WD. The authors did not comment on the cross-terms at u) = 0 due to the real nature of the signal. Applications of WD, in which analytic versions of speech signals were used can be found in [28,29,65,72]. Cross-terms in W D For signals with fairly simple structure in the time-frequency plane, it is not difficult to interpret the WD, in spite of cross-terms and negative regions. Examples of this are chirp-like signals with fairly localized support in time and frequency, encoun-tered by Boashash and others in seismic prospecting and oceanography [4,5,6,42]. For speech and many biological signals, however, cross-terms and negative regions pose many problems in making meaningful interpretations regarding the nature of the signal. Even though it has been argued that any kind of smoothing will destroy the desirable properties of WD [18], mild smoothing was found inevitable in order to get interpretable results for many types of signals [25,29,31,59,60,61,72]. Flan-drin discussed the nature of cross-terms in multicomponent signals, in general, and suggested some explicit ways of smoothing [25]. Hlawatsch classified cross-terms as inner and outer cross-terms and discussed methods to distinguish cross-terms from self terms [35]. Smoothing WD for positivity eliminates all the cross-terms in the distribution [28]. This, however, results in a distribution similar to a spectrogram. Partial smoothing was used by Flandrin and Martin [58,60], for biological signals and Garudadri, Beddoes, Gilbert and Benguerel [28,29,30,31], Wokurek et al. [72] for speech. Riley [65] and Atlas et al. [1] smoothed the WD for positivity in ana-lyzing speech signals. Such smoothing makes the WD and spectrogram equivalent. Indeed, in their examples of WD plots and spectrograms, the formants can be seen equally well in both WD plots and spectrograms. The WD plots did not display 21 any additional information. It was pointed out by Cohen [18] that the WD is not necessarily zero in the re-gions in time or frequency where the signal itself is zero. It was remarked that these non-zero regions could mask some relevant information about the signal (e.g. si-lences in speech). It was shown that these regions contain the cross-terms in the time domain, which are smoothed when computing the discrete WD using finite windows. Applications of W D The WD has been successfully applied to many signal analysis problems. The WD of the impulse response of loud speakers was found to be useful in characterising them [44]. In studying seismic signals and microstructure of underwater temper-ature gradients, the WD was found to be a useful tool by Boashash and others [4,6,42]. Martin and Flandrin used WD to study the nature of biological signals by defining generalized spectral estimators [59,61]. The problem of synthesizing the time domain signal from an arbitrary description in the time-frequency plane using the WD was studied by Boudreaux-Bartels [7,8], Yu and Cheng [73], and Krattenthaler et al. [48]. The applications of WD in analyzing somatosensory evoked potentials for early detection of multiple sclerosis, EEG signals, harmonic analysis in power systems etc., are being investigated [32], 3.3 A n a l y t i c Signals In the previous section, the importance of using analytic signals to compute the WD was discussed (see page 20). The analytic version of a real signal s(n) is defined as f(n) = s(n) +js(n). where s(n) is the Hilbert transform of s(n). A Hilbert transformer of length 112 points and passband from 100 Hz to 4.9 kHz was designed using the IEEE signal processing program EQFIR [41,64]. The program FASTFILT [41] was modified to read the signal s(n) in blocks of 256 samples and output the complex analytic 22 signal f(n) directly. For a signal of duration N, the output of the FIR filter was of length iV+112. The first 112 points were rejected when forming the complex signal f(n). 3.4 Posit iv i ty and Resolut ion Considerat ions From equation 3.1, it can be seen that the signal f{t) is needed from —oo to +oo to compute the WD. In practice, the signal will be windowed in the time domain using a window h(t). The windowed WD, known as the pseudo WD (PWD) was given by Claasen and Meklenbrauker [13] as follows: ~ 1 r+°° Wf{ttu) = — / W,{ttt)Wh{Otu> - t)d£. (3.9) Z7T J-oo where Wh{t,u) is the WD of window h[i). The above equation is a convolution between Wf(t,u) and Wh(0, w) in the frequency variable u. As a consequence, the window introduces bias or smear parallel to the frequency axis only, and the property of correct marginals with respect to time is not lost. Hence, there is no smear parallel to the time axis, even when successive WD spectra are computed with overlapping windows. In most applications, the presence of cross-terms and negative regions tends to mask the underlying nature of the signal. In practice, it is necessary to apply some sort of smoothing in the time and frequency domains in order to suppress the negative regions and the cross-terms [25,29]. The smoothing process is a 2-D convolution of the WD with a function G(t,oj) given by 1 r+oo f+oo W/{ttu)=— / W f { T , t ) G { t - T , U - t ) d T d t . (3.10) Z7T J-oo J-oo deBruijn showed that using G(t,w) defined as G ( ' ' w ) = ^ e _ < 2 / T V " 2 / n 2 ( 3 - U ) with Ti l > 1 yields a positive WD [9]. G(t,uj) acts as a low pass filter in time and frequency, and filters out the oscillating cross-terms. Here, T and $1 define the amount of smear in the time and frequency domains respectively. As the range of smoothing is increased by increasing the values of T and fl, the amplitude of 23 the cross-terms decreases. For TU > 1, there are no cross-terms in the WD and positivity can be assured. The positivity of WD smoothed with TCI > 1 was shown by de Bruijn [9] and Janssen [45] using Hermite polynomials. A simpler proof, based on the fact that a smoothed WD computed with Tfl = 1 and a spectrogram computed using a Gaussian window are identical, is now presented. 3.4.1 P o s i t i v i t y o f a S m o o t h e d W D Consider the spectrogram of the signal /(<), given by /+oo 2 f{r)h{t - T)e-J"TdT . (3.12) -oo Here, h(i) is the time domain window used in computing the short-time Fourier transform. The spectrogram is related to the WD by a 2-D convolution as shown in Equation 3.13 [13]. 1 r+oo r+oo SfM = — / W}{T,£)Wh{t-T,U>-£)dTdZ (3.13) where Wh{t,uj) is the WD of the window h(t) given by Wh(t,oj) = f+0° h{t + r/2)h*{t - T/2)e-j"Tdr. (3.14) J — OO For a Gaussian window h(t), defined as Eqn 3.14 can be shown to be Wh{t,u) = e-t7/T\-»2T\ (3.16) If we choose Tfl — 1 in Equation 3.11, the smoothing function in Equation 3.10 and the spectrogram in Equation 3.13 will be identical to each other. Thus, the WD smoothed according to Equation 3.10 with TQ = 1 is identical to a spectrogram computed using a Gaussian window. Since the spectrogram is a positive func-tion, we conclude that smoothed WD is also positive for Tfl = 1. Spectrograms computed using other windows (Hamming, Hanning etc.,) are similar WD plots smoothed with Tfi > 1, since Gaussian windows have the smallest time-bandwidth product. 24 Representation 2ir6{t)6(u) i „-t 2 /r 2 -w 2 /n 2 Wigner distribution Pseudo Wigner distribution Spectrogram Smoothed Wigner distribution Table 3.1: Examples of some kernel functions and the resulting representations 3.4.2 I n t e r p r e t a t i o n o f t h e S m o o t h e d W D An interpretation of the PWD for monocomponent signals was given by Flan-drin [27]. Since the WD of monocomponent signals do not have any cross-terms, there is no necessity for smoothing. For multicomponent signals, partial smooth-ing can be viewed as a compromise between the desired resolution and cross-term suppression in the distribution. Such a compromise sacrifices both the properties correct marginals and positivity, to some extent. An interpretation of the partially smoothed WD is considered here. An insight into the smoothed WD of multicom-ponent signals like speech can be gained by considering Cohen's class of generalized bilinear time-frequency distributions [15]. All members in Cohen's class can be de-scribed in terms of the WD and a kernel function as shown in Equation 3.10. By choosing different functions for G(t,w), different time-frequency representations of Cohen's class can be obtained. A few of the choices for G(t,u), relevant to this dis-cussion, are shown in Table 3.1. The function G(t,w) is plotted for a pseudo-WD, for a narrowband spectrogram, a wideband spectrogram, and a smoothed WD in Figures 3.5 - 3.8. For the pseudo-WD, it can be seen from Figure 3.5 that the kernel G(t,u>) does not introduce any temporal smear. For a narrowband spectrogram, on the other hand, the kernel produces substantial smear in the time domain, as shown in Figure 3.6. For a spectrogram, choice of T (or fi), implies the choice of fi (or T) based on Heisenberg's inequality. Thus, reducing T for better time resolution in turn increases the frequency domain smear, as shown in Figure 3.7. The kernel for a 25 Figure 3.5: G(t,u) for a pseudo-WD. F i e u r e 3-6= G(t,u) for a narrowband _ _ spectrogram.  Figure 3.7: G(t, u) for a wideband spec- Figure 3.8: G(t, u) for a smoothed WD trogram. 1 with TQ = 0.25.  26 smoothed WD with Tfl = 0.25 is shown in Figure 3.8. The option of specifying both T and fl explicitly and independent of each other makes the partially smoothed WD an attractive tool. Since WD has correct marginals, it may be viewed as the perfect time-frequency description of the signal. The kernel function may be considered as an aperture through which the signal is being observed. As the kernel deviates from 6(t)6(u), the WD is smeared in time and frequency. While the area of this aperture is constant, and defined by uncertainty relations in Fourier analysis, it can be explicitly defined in the partially smoothed WD. Besides introducing a smear in the time and frequency domains, smoothing causes a loss of the phase information. This can be seen from Example I on page 30, where the phase terms and <f>k are present only in the cross-term in Equation 3.25. Thus, smoothing for positivity eliminates all the phase information in the WD; however, partially smoothed WD has proved to be a very useful tool. A priori information about the nature of the signal is often helpful in choosing the smoothing parameters. 3.5 C o m p u t i n g the Smoothed W D In practice, large computational savings can be achieved by computing the smoothed WD directly, rather than by computing the WD and then performing the 2-D con-volution for smoothing. The formulation for computing the partially smoothed WD is given below: By choosing a separable smoothing function such as in Equation 3.11, Equa-tion 3.10 can be rewritten as : Wf{t,u>) = [Wf(t,u) * G(UJ)} * g{t) (3.17) where G(u) = ie--a/aa a n d g(t) = ± e - * 9 / r a . The bracketed term is the PWD in Equation 3.9 computed using a time window h(t) such that Wk{0,u) = G[u). (3.18) Substituting Equation 3.14 into Equation 3.18 and taking the inverse Fourier trans-form on both sides, we have h2{r/2) = V ^ F c - " 2 ^ 2 ) 2 . (3.19) 27 We can now rewrite Equation 3.17 as Wf(t,u})=Wf{t,oj)*g{t) (3.20) where Wf{t,u>) = r°° f(t + r/2)/*(i - T/2)h2{r/2)e-^rdr. J — oo (3.21) Substituting for g(t) in Equation 3.20 and interchanging the order of integration, we have The bracketed term is the convolution of f(t + r/2)f*(t — T/2) with g(t). This term is windowed using /i2(r/2), and its Fourier transform computed to yield the smoothed WD at time t. In discrete form, Equation 3.22 is given by: where M = 2K. This equation is the same as the one given by Flandrin [25,26], except that the smoothing windows have been chosen as Gaussian and defined in terms of T and fi. This enables one to specify the amount of smoothing in a smoothed WD, relative to that of a spectrogram. Figure 3.9 shows an implementation of Equation 3.23 for analytic signals as a FORTRAN callable subroutine WIG. The input arrays F, GT and GF are the input signal, time window and frequency window respectively. Array F should be dimen-sioned at least IT + IF in the calling routine where IT and IF are the lengths of the arrays GT and GF respectively. A cyclic shift of (IF — 1/2) points is performed on GF before passing it to WIG. Each time WIG is called, a WD of length IF points corresponding to the time point (IF + IG — l)/2 is returned in RL. WIG calls the subroutine FFT described in [64]. This may be replaced by a call to a local FFT program for faster execution. (3.22) C-i)/2 1 £ f{n + k-i)f*(n-k-i)g(i) h2{k)e-jkm^ k=-K [«=-(/-l)/2 (3.23) 28 C F(IF+IT) : Input /Output data Array. C GT(IT) : Time smoothing window. IT is odd. C GF(IF) : Freq. smoothing window. IF is power of 2 C C DO 10 loop performs time smoothing. DO 20 and DO 30 C loops form the product for FFT computation. IPOWER(IF) C returns IZ such that IF=2**IZ. IZ is used in CALL FFT. C SUBROUTINE WIG(F, GT, GF, IF. IT. RL) COMPLEX F( l ) . S(256) REAL GT(1). GF(1). RL(1) C IFBY2 = IF / 2 IF2M1 = IFBY2 - 1 IF2P1 = IFBY2 + 1 IZ = IPOWER(IF) C DO 20 K = 1. IF2P1 KPLUS = IF2P1 + K - 1 KMINUS = IF2P1 - K + 1 S(K) = (0..0.) DO 10 I = 1, IT 10 S(K) = S(K) + FCKPLUS + I - 1) * + CONJG(F(KMINUS + I - 1)) * GT(I) 20 S(K) = 2. * S(K) * GF(K) DO 30 I = 1. IF2M1 30 S(IF2P1 + I) = C0NJG(S(IFBY2 - I + 1)) C CALL FFT(S, IZ. IF) DO 40 I = 1, IF 40 RL(I) = REAL(S(I)) C RETURN END Figure 3.9: Subroutine for computing the WD 29 3.6 Na ture of Cross- terms An understanding of the nature of cross-terms is essential if they are to be handled successfully and if useful information is to be extracted from the WD. Generally speaking, cross-terms oscillate in the time-frequency plane and occur midway be-tween the self-terms. Unfortunately, analytic evaluation of these terms is lengthy and cumbersome [19]. An insight into the nature of the cross-terms, however, can be gained by considering simple cases. Three tutorial examples are considered here to provide such insight. The approach adopted is to express the signal in terms of simple signals for which the cross-terms can be readily written. Example I: f(t) in terms of harmonics Consider a quasi-stationary analytic signal f(t), with the duration of stationarity at least T" seconds. f(t) can be expanded in terms of its Fourier coefficients as / ( * ) = E Aie?Uit+i*i h < t < h + T' (3.24) t"=0 where n can be made sufficiently large to make the right hand side arbitrarily close to f(t). The WD of this signal can be shown to be n Wf{t,u>) = 2TV Y^Ai26{OJ -Ui) + 4 T T £ AtAk cos ((wf- - uk)t + {4>i - 4>k)) S(u- {ojt + wfc)/2)(3.25) The first sum in the above equation corresponds to self terms the direct contribu-tions which are always positive. The second sum corresponds to the cross-terms due to the components at frequencies w,- and u>k. These cross-terms occur midway between w,- and uk, and are modulated parallel to the time direction, at a frequency Ui — uk. These terms will be called "frequency domain cross-terms." The frequency domain cross-terms are suppressed by smoothing parallel to the time direction. The term j,e~l*lTi in Equation 3.11 performs this smoothing. As the value of T is increased, the amount of suppression increases. 30 Example II: f(t) in terms of impulses Consider a time limited signal i=0 Self-terms in the WD of f(t) are of the form 2n6(t — £,) for any w. The cross WD due to the terms at and tj can be written as /+00 [6{t + T/2 - U)6{t - T/2 - ty)]e-'wrdr. (3.26) -oo The bracketed term is a function of both t and r. Solving for t and r at t+T/2—ti = 0 and t — T/2 — tj — 0, the above can be rewritten as /+ O0 t. _J_ +. \6(t - -^-)6(T - (U - ti))]e-'urdT (3.27) -OO i = AiA)6{t-t-^-)e-^-t^ (3.28) 2Vt[W_itj{t,u})} = 2\AiA*\6[t-t-^i-)cos{u{ti-tj)). (3.29) 2 Hence, the cross-term due to components at and tj occurs at (£,• + tj)/2 and is modulated parallel to the frequency direction. The frequency of modulation is a function of — tj. Cross-terms of this nature will be called "time domain cross-terms." This example is useful for understanding the nature of non-zero terms in the WD, when the signal is zero. It will be used to counter the argument given by Cohen [18] that these cross-terms mask useful information in the signal. Equa-tion 3.29 shows the structure of these cross-terms. The time domain cross-terms are suppressed by the term ^e -'"3/02 in Equation 3.11. In Section 3.4.1, it was seen that this smoothing was inherently present in computing WD. The frequency response of the lowpass filter in this smoothing is G(t,u) at t = 0 in Figures 3.5 and 3.8. Example III: General signals The classification of cross-terms as time domain and frequency domain cross-terms has been adopted here simply as a convenience. For a given signal, the cross-31 terms can be modulated parallel to both time and frequency directions, simulta-neously. This can be seen from the example considered by Boudreaux-Bartels [7] and Hlawatsch [35]. Let h(t) be a signal concentrated at (0,0). The cross WD of the signals fi(t) and /2(f), defined as h(t) = A1h(t-tl)e^t h{t) = A2h{t-t2)e^1 is given by W W * . " ) = lAtA&^-^'-V-'-MWkit - t c , u - we) (3.30) where uc = —; ud = (Jt - w2 tc = —-—; td = ti — t2 Thus, the cross-terms due to / 1 and / 2 occur at their geometric center; the fre-quency of modulation parallel to the time direction is a function of and that parallel to the frequency direction is 3.7 Choice of the Smooth ing Parameters The smoothing parameters T and $7 determine the amount of cross-term suppres-sion parallel to the time and frequency directions, respectively. In a digital im-plementation, it is necessary to truncate the Gaussian windows in Equation 3.11. Since the Fourier transform of a time-limited signal cannot be band-limited, the spectra of the windows exist for all frequencies. The smoothing operation shown in Equation 3.10 is essentially a spatial low pass filtering operation, in both time and frequency directions. Hence, it follows that it is not possible to generate a positive distribution using truncated windows, even with TCI > 1. WD smoothed for positivity is not of much interest, since the resolution is lost. By applying par-tial smoothing over small regions, it is possible to limit the spread of components 32 in time and frequency to regions defined by T and fi. The factors considered in choosing the values for fi and T for computing the smoothed WD of speech sig-nals are discussed below. They should also be applicable to other multicomponent signals. 3.7.1 C h o i c e o f fi The length IF of the window GF, in Figure 3.9 is normally chosen as a power of 2 for efficient computation. For a 60 dB sidelobe suppression, IF should span at least 6 times the standard deviation of the window [34]. This is achieved by choosing fi' = IF/3, where f2' is fi in number of samples. For this choice of fi, the effective smear is given by fi = ^ , where fs is the sampling frequency. Spectral Wrap The periodicity of the WD in the frequency domain is 7r rather than 2ir. Because of this, a peculiar phenomenon called "spectral wrap" occurs in the WD [30]. The effect of spectral wrap is to make very high frequencies appear at very low frequencies and vice versa. The way spectral wrap occurs in the WD is illustrated in Figure 3.10. The range of smear in the frequency domain is shown schematically by the window (perpendicular to paper). Because of the finite width of the main lobe of this window, components that are close to 0 or it spill over to the other end of the spectrum. Spectral wrap can be minimized by making the width of the main lobe as small as possible. This is achieved by increasing the value of IF at the cost of an increase in computation. For a given value of IF, the width of the main lobe can be further reduced by making fi' >IF/3. This, of course, causes an increase in the relative amplitude of the side lobes. Figure 3.11 illustrates the spectral wrap in a chirp signal. In this study, speech signals were sampled at 10 kHz. For this sampling fre-quency, IF = 128 and fi' = IF/3 were found to keep spectral wrap to a satisfactory level. The effective smear fi in the frequency was 117 Hz. In the next section, it is demonstrated that for this choice of fi, the time domain cross-terms did not mask the presence of silence durations in the signal. 33 Figure 3.10: Cause of spectral wrap in a pseudo-WD. chirp# #256/l/l-256:4# 64x64/256# 234/l/2%# 200A# 7089646 Figure 3.11: Spectral wrap in a chirp signal. 34 3.7.2 C h o i c e o f T From Equation 3.25, it can be seen that the cross-term of two signals that are (uji — ojk) apart is modulated parallel to time with a frequency of (u>,- — uk). To eliminate this cross-term, the time domain smoothing T should extend to more than one period of (w,- — uk). As the frequency (wt- — u>k) decreases, the value of T should be increased for reasonable cross-term suppression. In the limiting case when (a;,- — u;k) = fi, the value of T is l/Cl such that Tfi = 1. In this situation, H can be further reduced by choosing a longer window for GF, so that satisfactory cross-term suppression ensured with TU < 1. Figure 3.12 demonstrates frequency domain cross-term suppression in two chirps of opposite slopes. The smoothed WD was computed with TU = 0.06. As the spectral separation (a;,- — wy) between self-terms decreases, the amplitude of cross-terms increases, indicating the need for larger values of T for satisfactory cross-term suppression. In the present work, the value of T was based on the choice of $7 and the desired value of TQ. After experimenting briefly with various values, TU = 0.25 was chosen for the entire corpus. For the choice of fl as selected in the previous section, this corresponded to T = 2.1 msec, or 21 sample points. This choice was judged satisfactory since the duration of the burst in stop consonants generally varies from 5 to 40 msec. Ideally, a value of T that is smaller than T" in Equation 3.24 is desired. The length of the Gaussian window GT in Figure 3.9 was chosen as IT = 2T rather than 3T, in order to save computational costs. For the WD plots with TU = 0.25, the time domain resolution is roughly equal to that of a narrowband spectrogram and the frequency domain resolution is roughly equivalent to that of wideband spectrogram. 3.7.3 C r o s s - t e r m S u p p r e s s i o n i n S m o o t h e d W D It is frequently argued in the literature that the presence of negative regions and cross-terms limit the usefulness of WD for multicomponent signals. In the course of this work, the WD was applied extensively to synthetic and real multicomponent signals. As a result, it was concluded that the problem was not as grave as it is usually portrayed to be (see [18]). A few examples are presented here to illustrate 35 2chirps #1152/l/65-1088:10# 128xl03/l024# 117/5/6%# 200A# 7689233 Figure 3.12: Effective cross-term suppression as the spectral distance between self-terms decreases. 36 the effectiveness of partial smoothing. In many situations, surprisingly little smoothing is necessary to bring out the structure of a signal. The smoothing in Figure 3.12 occurs over 5 sample points or 0.5 msec. This smoothing can be seen to be adequate when the self-terms are more than 2 kHz apart. For most multicomponent signals, the frequency domain cross-terms can completely mask the signal structure. This is illustrated in Figure 3.13, depicting the pseudo-WD of the syllable [de] in English. Two versions of the WD of this syllable, smoothed with TU = 0.25 and TU = 1 are shown in Figures 3.14 and 3.15. Figure 3.15 is equivalent to a spectrogram. The formant structure can be seen clearly in both 3.14 and 3.15. Figure 3.14, however, provides much better temporal resolution than its spectrographic equivalent. The time domain cross-terms are generally easier to handle than the frequency domain cross-terms because of the inherent smoothing present in the computation. In a recent paper, Cohen [18] pointed out that the WD has non-zero contributions when the signal is zero. He implied that these contributions might mask important information about the signal, for example, silences in speech. In light of this, a closer examination of the time domain cross-terms, viz-a-viz the seriousness of these objections, is in order. The time domain cross-terms are suppressed depending on the choice of 0 (or IF in Figure 3.9). From examples II and III on page 31, it can be seen that the frequency of time domain cross-terms increases as the (temporal) distance between the self-terms increases. As the distance between the self-terms increases, the cross-term due to them is attenuated more. The effect of these terms is most pronounced when the self-terms are separated by a very short silence. In these situations, the WD is a better choice, since the temporal detail is rather poor in a spectrogram. This is illustrated in Figures 3.16 and 3.17, which depict the pseudo-WD and the spectrogram of a synthetic signal with a small silence duration. It is clear that the presence of a silence can be easily identified in the WD. It should also be noted that the time domain cross-terms appear only when the silence duration is smaller than the window length IF. This has some important implications in the WD analysis of speech signals since silences occur in speech for 37 Figure 3.13: WD of English phone [de] with no smoothing. Tft = 0.01 38 Time jgbl2 #640/230/65-576:10# 128x52/512# 117/21/25%# 200A# 93796 Figure 3.14: WD of English phone [de] with partial smoothing. TQ = 0.25 39 jgbl2 #640/230/65-576:10# 128x52/512# 117/85/100%# 200A# 55723 Figure 3.15: WD of English phone [de] with complete smoothing. TH = 1.0 2ch_0 #384/l/65-320:5# 128x52/256# 117/1/1 %# 200A# 14179021 Figure 3.16: Pseudo WD of two chirps showing negligible cross-terms during si-lence. 41 2ch_0 #384/l/65-320:5# 128x52/256# 117/85/100%# 200A# 1417305 Figure 3.17: Smoothed WD of two chirps. TO = 1.0. 42 many different reasons, for example: • Silences maybe due to breath pauses which are used to mark grammatical boundaries in speech are usually greater than 50 msec, in duration. • Silences may also occur because of particular phonetic sequences in a given speech segment. An example of this is the closure of voiceless stop consonants, such as the sound to [p] in words like "speech" and "april." These silences are usually of the order of 100 ms or more. These silences do not produce any contributions in the WD, since the windows used in speech are generally much shorter. • Silences that are shorter than the window length occur in speech between the burst and the onset of the vowel in stop consonant-vowel sequence. They vary from 5 to 50 ms. A visual feature related to this duration, called "formant onset duration" (FOD), is defined on page 58. Figures 3.18 and 3.19 show FOD in an alveolar and a velar stop consonant. It can be seen that the time domain cross-terms do not mask the presence of silence during FOD. Results presented in Chapter 5 establish that the FOD carries useful information regarding the place of articulation in stop consonants, thus making it clear that the time domain cross-terms do not inhibit extraction of useful information from the WD of speech. 3.7.4 O t h e r C o n s i d e r a t i o n s S p e c i f i c t o S p e e c h Similar to spectrographic analysis it was necessary to pre-emphasize the speech signal prior to computing the WD. In order to observe the higher formants in the WD, some form of compression (e.g., logarithmic) of the data had to be performed. The compression equation used was: W(*,w) = 10log10 [W'{t,w) * A + 1] (3.31) where W'(t,u) is the half rectified WD, normalized with respect to the maximum value of the WD. The term A was the scaling factor. In order to keep the logarithmic datum as zero, unity was chosen as an offset. 43 csb31 #896/l30/65-832:10# 128x77/768# 117/2l/25%# 200A# 34386 Figure 3.18: Smoothed WD of an alveolar depicting negligible cross-terms during FOD. 44 3.8 S u m m a r y In this chapter, the concept of WD was introduced and a brief review of the liter-ature dealing with properties and applications of WD was presented. The cross-terms in WD mask the underlying nature of speech signals. It was shown that smoothing the cross-terms completely makes WD and spectrogram equivalent. Par-tially smoothed WD was found to be a useful alternative in many applications. The extent of partial smoothing was parameterized on TO by using truncated Gaussian windows. An interpretation of the partially smoothed WD and its computational aspects, was presented. The nature of cross-terms was discussed in detail, as were factors involved in choosing the smoothing parameters. It was shown that it is possible to obtain a better time-frequency resolution (or localization) by using partially smoothed WD, than by using short-time spectral analysis methods. 46 Chapter 4 Study of Acoustic Invariance: A Cross-Language Study In Chapter 2, it was suggested that, in earlier investigations, the analysis methods were partly responsible for the context dependency in specifying invariant cues for place of articulation. In Chapter 3, it was demonstrated that the partially smoothed WD overcomes the drawbacks of short-term spectral analysis techniques. The next task was to determine whether the WD would be more successful in specifying invariant acoustic cues for place of articulation in stop consonants, with less context dependency. A cross-language study was conducted to establish this. The experimental procedure is presented in this chapter. 4.1 Overview Section 4.2 introduces some of the linguistic terms used throughout this disser-tation. Section 4.3 presents the invariant patterns observed in the WD plots of the pi-lot study. These patterns were compared with the invariant acoustic cues proposed in the literature [2,49,51]. Section 4.4 concerns the program FEATURE, which interactively obtains data re-garding the invariant patterns of a WD plot, and assigns a place of articula-tion to it. 47 Section 4.5 deals with the final experiment, in which three judges visually identi-fied the invariant patterns in the WD plots. The program FEATURE assigned a place of articulation to each WD plot based on the response of the judges. 4.2 Prel iminar ies 4.2.1 S t o p C o n s o n a n t s Stop consonants can be classified articulatorily in terms of (i) the vocal tract pa-rameter, place of articulation and the source parameters, voicing and aspiration. Formal definitions of the place of articulation, voicing and aspiration can be found in Peterson and Shoup [63]. These terms are described below: Place of Articulation is the place along the vocal tract, where the airflow is stopped completely for a brief period and suddenly released, in articulating the stop consonant. The stop consonants used in this study have four different places of articulations: • Bilabial — airflow is stopped at the lips, as in the phones [p, b]. • Dental — airflow is stopped behind the teeth as in the phones [t, d].1 Dental fricatives occur in English, as in words "thin" and "the." Dental stop consonants analyzed in this study were spoken by native speakers of Telugu and French. • Alveolar — airflow is stopped at the alveolar ridge, as in the phones [t, d]. • Velar — airflow is stopped at the velum, as in the phones [k, g]. Voicing occurs when the vocal folds are set into a quasi-periodic motion dur-ing articulation, giving rise to a low frequency resonance in the vocal tract. This low frequency resonance in a time-varying spectral representation of the speech signal is sometimes called the "voice bar." For stop consonants, the duration between the instant of release and onset of voicing is called "Voice Onset Time" (VOT) [54]. If the onset of voicing occurs prior to or at the •"•The diacritic will be used instead of " n" to represent dental stops throughout this thesis. 48 release, VOT has a negative or 0 value, and the stop consonant is said to be voiced. If the onset of voicing occurs at least 15 ms after the release or after the release, VOT has a positive value and the stop consonant is said to be voiceless. Aspiration occurs when a turbulence noise source at the glottis excites the vocal tract, following the release. This excitation usually lasts for at least 30 ms after the release [63, p. 59]. "Invariant acoustic cues for place of articulation in stop consonants" can be defined as those unique properties of a speech segment corresponding to a given place of articulation, which can be extracted from the signal using signal process-ing methods. Given these properties, it should be possible to determine the place of articulation in a speech segment using signal processing methods. It is a com-mon belief that some form of these properties is used by the auditory system in transducing speech signals. 4.2 .2 O r g a n i z a t i o n o f S t o p C o n s o n a n t s i n E n g l i s h , F r e n c h a n d T e l u g u Stop consonants used in this study were spoken by native speakers of three different languages — English, French and Telugu. (Telugu is a Dravidian language, spoken by about 50 million people in the state of Andhra Pradesh, India.) These languages were chosen because of their different linguistic origins. A brief description of the organization of stop consonants in these languages follows: For a given place of articulation, there are four different combinations of source parameters: voicing and aspiration, that can excite the vocal tract. Not all com-binations of these source parameters are utilised by the native speakers of a given language. For example, voiced aspirated stop consonants occur in Telugu, but not in English and French. In addition, native speakers of different languages group the consonants with similar articulatory features into different phonemic categories. The stop consonants of English, French and Telugu are described below in terms of articulatory parameters. 49 English stop consonants can be grouped into two different phonemic categories — /p, t, k/ and /b, d, g/. These groups are contrasted primarily by aspiration and secondarily by voicing, /p, t, k/ are aspirated and voiceless while /b, d, g/ are unaspirated. Usually, /b, d, g/ have no pre-voicing, though some English speakers tend to pre-voice these stops2. Phonemically, the above two groups in English have been termed voiceless and voiced stops. There are three places of articulation in English: bilabial /p, b/, alveolar /t, d/ and velar /k, g/. French stop consonants can be grouped into two different phonemic categories — /p, t, k/ and /b, d, g/. But unlike in English, the primary contrast is voicing and the secondary contrast is aspiration, /p, t, k/ have a positive VOT while /b, d, g/ have a negative VOT, i.e., they are prevoiced. Aspiration in French depends on the syllable context, /p, t, k/ are unaspirated in the word initial position and aspirated in the word final position while /b, d, g/ are never aspirated. Phonemically, these two groups in French are also termed as voiceless and voiced stops. There are three places of articulation in French: bilabial /p, b/, dental /t, d/ and velar /k, g/. Telugu uses all the four modes of source excitation [50]. Thus, the stops of Telugu are grouped in four different phonemic categories — voiceless unaspirated /p, t, t, k/, voiceless aspirated /p h , th, th, k h / , voiced unaspirated /b, d, d, g/, and voiced aspirated /b h , d h , d h , g h/. Telugu has four places of articulation; bilabial /p, p h , b, b h / , dental /t, th, d, d h / , post-alveolar retroflex /t, th, d, d h / and velar /k, k h, g, g h/. Retroflexed alveolar stops are produced with the tip of the tongue curled backwards, while making the constriction at the alveolar ridge. For brevity, these sounds will be referred to as Telugu alveolars, throughout this dissertation. Dentals, alveolars, and retroflexed post-alveolars, etc. are referred to as coronals [12]. Table 4.1 summarizes the organization of word initial stop consonants in these three languages. The signs '+' and '—' before voicing and aspiration denote their presence or absence in the realization of that phoneme. It should be noted that the 2The English informants who participated in this study had no such pre-voicing. 50 + Aspiration — Voicing — Aspiration — Voicing — Aspiration + Voicing + Aspiration + Voicing Bilabial Telugu / p h / English /p / Telugu /p / French /p / English /b / Telugu /b / French /b / Telugu / b h / Dental Telugu / t h / Telugu /t/ French /t/. Telugu / d / French / d / Telugu / d h / Alveolar Telugu / t h / English ft/ Telugu /t/ English / d / Telugu / d / Telugu / d h / Velar Telugu / k h / English /k/ Telugu /k/ French /k/ English /g/ Telugu /g/ French /g/ Telugu / g h / Table 4.1: Organization of the stops in the languages studied. voiced stops of English are articulatorily similar to the voiceless stops of French and the voiceless unaspirated stops of Telugu. 4.3 Invariant Patterns in the Wigner P lots 4.3.1 P i l o t S t u d y In the pilot study, three native speakers of English, French and Telugu (JG, AP and HG respectively), served as informants. The corpus for this study contained each of the consonants shown in Table 4.1, except Telugu dentals, followed by the vowels [i a u]. This gave 18 syllables each from English and French and 36 syllables from Telugu. The informants were provided with the list of syllables on a type written page. They were requested to pause before each CV syllable in order to capture each syllable in isolation. Two sets of the CV syllables were recorded and the second list was chosen for digitization. The analog recordings were made in a sound proof booth on a Revox tape recorder. An A K G microphone was used for recording the pilot data. A monotone signal at 120 Hz was played through a pair of headphones and the subjects were 51 requested to try and maintain constant pitch as given by the signal. The data was digitized using a 12 bit A /D convertor. The WD of the discrete data was computed and plotted. Further hardware and software details of this setup are presented in appendix A. The WD plots of the pilot data were examined for unique patterns, for each place of articulation. The invariant patterns observed were similar to those reported in the literature [2,49,51] in many respects, with some essential differences. In this section, these invariant patterns are described and compared with those reported in literature. 4.3 .2 F e a t u r e D e f i n i t i o n s In order to see how well the invariant patterns of the WD plots characterized a given place of articulation, a set of visual features of the WD plots were defined; these features are described below. Vowel Pattern The vowel pattern was characterized by 2 to 5 peaks distributed in frequency and extending in time. These peaks correspond to the formants of the steady state vowel. The formant structure was connected by a wide spectral line, at the beginning of every pitch period. In Figure 3.14, the WD of a steady state vowel can be seen from time 24 ms onwards. Spectral Shape The spectral shape of the burst region was broadly classified into three classes. a) Compact Pattern The compact pattern was characterised by the presence of a single prominent spectral peak above 700 Hz. The compact pattern was sometimes character-ized by several successive peaks in the time domain as shown in Figure 4.1, which depicts the WD of a velar from Telugu in the context of [a]. This was named a "compact spectrum with multiple peaks." The compact spectrum 52 was usually centered around the frequency of the F 2 of the following vowel. When there was more than one peak in frequency at a given time, and the number of contours in the tallest peak were more than twice the number of contours in other peaks, the spectral shape was also classified as compact. This is illustrated in Figure 4.2, which shows a velar from English, in the context of [a]. b) Diffuse Non-Falling Pattern A diffuse non-falling pattern was characterised by the presence of two or more significant peaks during the burst. These peaks usually covered a range of more than 1.5 kHz and their amplitude decreased, prior to the start of the vowel. A diffuse non-falling spectral shape is illustrated in Figure 4.3, which shows the WD of a Telugu alveolar in the context of [a]. The diffuse non-falling spectral shape was a typical characteristic of bursts of most alveolars in this study. The bursts of velars in the context of high front vowels also displayed a diffuse non-falling spectral shape. There were, however, many differences in the diffuse non-falling spectral shapes of alveolar and velar bursts, that were not reflected in the gross spectral shape. c) Diffuse Falling Pattern The diffuse falling pattern was characterized by an energy concentration in the low frequency region, which rapidly increased over 5 to 10 ms, to its maximum value. This was followed by the vowel pattern. The diffuse falling pattern was typical to bilabials in the pilot data. Sometimes, the burst was completely absent in the bilabial burst, and the WD plot started with a typical vowel pattern. This spectral shape was named "diffuse falling" rather than "falling," partly to be consistent with the terminology used in the literature and partly to include the diffuse vowel pattern when the burst was not noticeable. When the burst was present, it did not excite any higher formants. Thus, the diffuseness of the bilabial burst previously reported in the literature was essentially due to the diffuseness of the following vowel. A typical diffuse falling pattern is illustrated in Figure 4.4, depicting a French bilabial in the context of [a]. 53 o c cr Sh 4 H Formant Onset Duration. (58 ms) 13 . Burst T Duration] | (8 ms' Start of the Burst. (7 ms 26 39 Time 52 65 t 78 Start of the Vowel. (65 ms) 2nd Formant = 1.3 kHz Spectral shape = Compact with multiple peaks hglka #896/20/l-832:10# 128x84/832# 117/21/25%# 100# 57622 Figure 4.1: WD of the beginning of the syllable [ka] (Telugu speaker). 54 jglga #640/l25/l-576:10# 128x58/576# 117/2l/25%# 200A# 70352 Figure 4.2: WD of the beginning of the syllable [ga] (English speaker). 5 5 ; hglta #384/20/65-320:10$ 128x26/256# 117/21/25%# 200A# 48700 Figure 4.3: WD of the beginning of the syllable [ta] (Telugu speaker). 56 , \ aplpa #384/30/65-320:10# 128x26/256# 117/2l/25%# 200A# 142473 Figure 4.4: WD of the beginning of the syllable [pa] (French speaker). 57 The voice bax was usually continuous with the vowel in the WD plots and there was no decrease in its energy prior to the start of the vowel. The energy in the burst, however, usually decreased prior to the start of the vowel. The energy due to voicing, in the range 0 to 700 Hz, was ignored when determining the diffuse non-falling and compact spectral shapes. Some examples of typical spectral shapes observed in the WD plots are shown as schematic energy density diagrams in Fig-ure 4.5. The vowel region is represented by a mesh. The horizontal lines represent the formant structure and the vertical lines represent the speech spectra at each pitch pulse. The burst is represented by dark areas. Formant Onset Duration Formant Onset Duration (FOD) was defined as: "the duration from the start of the burst to the start of the vowel." The start of the vowel was taken as the time instant where a formant structure could be observed in the WD plots. FOD may be related to the time duration needed for resonances to set in the vocal tract, after the consonantal release. It is always positive. FOD is different from VOT (see page 48) as defined by Lisker and Abramson [54], which reflects the duration between the start of the burst and the start of voicing. In the pilot data, FOD was observed to become progressively longer as the place of articulation moved from the lips to the velum. Typical values for the FOD are shown in Figures 4.1 - 4.2. Burst Duration Burst duration (BD) was denned as: "the duration from the start of the burst to the end of the burst." For bursts with diffuse non-falling and compact spectral shapes, there was a decrease in the amplitude of the burst, prior to the start of the vowel. For these spectral shapes, the end of the burst was taken as that point in time at which the number of contours was half the maximum. For the pseudo-log compression described in section 3.7.4, each contour corresponded to approximately 3 dB difference. This strategy of measuring BD was adopted because it was a convenient visual pattern to measure. The BD in compact and diffuse non-falling bursts is shown in Figures 4.3 and 4.2. For bursts with diffuse falling spectral 58 time 3 t (No noticeable burst. Only vowel pattern can be seen.) (Peaks at low frequency have more energy. Examples of diffuse-falling spectral shape during the burst. (No noticeable burst Second formant shows up more clearly.) E (The peak at low (The low frequency frequency is not peak is due to due to pre-voicing.) pre-voicing.) b) Examples of diffuse non-falling spectral shape during the burst. (No multiple peaks (No multiple peaks (Compact with in time. No in time. Pre-voicing multiple peaks.) pre-voicing.) is present.) c) Examples of compact spectral shape during the burst. (Compact with multiple peaks.) "VOWEL BURST Figure 4.5: Examples of typical spectral shapes observed in the WD plots. shape, the burst was usually connected with the vowel pattern. There was no noticeable decrease in the burst amplitude. The end of the burst was then taken to coincide with the start of the vowel. See Figure 4.4. This definition of BD ignores subsequent peaks in a compact spectrum with multiple peaks in time as shown in Figure 4.1. It is suggested that these peaks may also have a role in the perception of velars. Blumstein and Stevens [2] have mentioned that for a stimulus with an abrupt onset to give rise to an auditory representation containing a narrow spectral peak, the spectral peak must persist for a few tens of milliseconds following the onset. In a task which requires identification of velars from bilabials and coronals, however, a spectral shape with multiple peaks was felt to be a sufficient visual feature, since, it was observed that this pattern occurred only in velars in the pilot study. The results presented in Chapter 5 also indicate that a compact spectral shape with multiple peaks was typical to only velars in English, Telugu and French. Second Formant Reliable second formant (F2) measurements are not possible in the vicinity of the burst, because of formant transitions. Originally, it was intended to study only the burst region of the stop consonants for invariant acoustic cues, and the use of F 2 was not anticipated at the time of computing the WD plots. Consequently, the WD plots included only 2 to 3 pitch periods from the start of the vowel. An approximate value of F 2 was nevertheless obtained for each token, as explained on page 118. 4.3 .3 P l a c e o f A r t i c u l a t i o n In terms of the visual patterns described in Section 4.3.2, the invariant visual patterns observed in the WD plots of pilot data are described below. 60 Bilabials Bilabials were characterized by diffuse falling spectra, during the burst. The BD was typically less than 10 ms and the burst was usually connected with the vowel pattern. The FOD was also typically less than 10 ms. Figure 4.4 shows an example of a typical bilabial in which the low frequency burst was noticeable. When the steady state vowel started within 2 to 3 ms from release, the burst pattern of bilabials was not noticeable and the WD plot appeared to start directly with the vowel pattern. Alveolars Alveolars were mainly characterized by a diffuse non-falling spectral shape. The BD was typically 5 to 15 ms after which there was a noticeable attenuation in the energy of the burst. The FOD was typically between 5 to 20 ms. Figure 4.3 shows an alveolar, spoken by Telugu speaker, in the context of [a]. Sometimes, alveolars in the context of [o] or [u] displayed a compact spectral shape. Velars Velars were characterized by diffuse non-falling spectral shape when followed by front vowels, and compact spectral shape when followed by back vowels. BD for velars was around 5 to 15 ms, and FOD was around 30 to 50 ms. Figures 4.1 and 4.2 show (respectively), the WD of syllable [ka] in Telugu and syllable [ga] in English. Figure 4.2 depicts a compact spectrum and Figure 4.1 depicts a compact spectrum with multiple peaks. Compact spectra with multiple peaks usually occurred in the context of [o] and [u] in French and Telugu velars. They were less frequent in the English velars. It appears that the bursts of velars are most affected by the following vowel. This can be appreciated by considering that the shape of the front cavity has a relatively larger degree of freedom than the shape of the back cavity. Since, for a velar place of articulation the length of the front cavity is the maximum, so then are the coarticulation effects. In the WD plots, the spectral shape of velar bursts 61 Method Window length Window shift Blumstein and Stevens 2] LP spectra 26 ms — Kewley-Port 49] LP spectra 20 ms, 25 ms 5 ms Lahiri et al. 51] LP spectra 10 ms 5 ms Garudadri et al. [31] Smoothed WD 4.2 ms 1 ms Table 4.2: Comparison of the analysis methods. was seen to vary progressively from diffuse non-falling to compact and compact spectrum with multiple peaks (in the time domain), depending on the vowel context and the language. 4.3 .4 C o m p a r i s o n o f I n v a r i a n t P a t t e r n s f r o m W D P l o t s a n d S h o r t - T e r m S p e c t r a In this section, the invariant patterns obtained using the WD are compared with those obtained using short-time spectral estimation techniques (summarised in Sec-tion 2.3). The form of acoustic invariance observed in the speech signal depends to a large extent on the signal processing tool used. In the investigations considered in Section 2.3, LP smoothed spectra were computed using different window sizes. In the present work, the smoothed WD was used to study the time-varying spectral properties of the signal.3 Table 4.2 compares the equivalent temporal resolutions in each of these of these methods. The invariant acoustic cues for place of articulation reported in [2,49,51] can be considered to be time-averaged versions of the invariant patterns described in Section 4.3.3. By collapsing these patterns in the time domain, the invariant cues in [2,49,51] mat be obtained, are obtained. The enhanced temporal detail in the WD plots serves to explain some of the context dependencies reported in literature (see Section 2.3.1). These context dependencies appear to be mainly due to (i) time domain smear in the analysis methods, and (ii) coarticulation effects. In [2,49], it 3Use of the term "smoothing" as in: "smoothed WD," and "LP smoothed spectra," is unfor-tunate, since there is nothing in common between the terms, when used in these contexts. In the LP method, "smoothing" corresponds to ignoring the portion of the signal that does not fit the all pole model (see page 12). In smoothing the WD, local averages are computed to suppress the cross-terms (see Section 3.4.2). 62 was not possible to specify, using LP method, invariant acoustic properties in the phonetic contexts where coarticulation may be expected. Effects of Temporal Smear The effects of temporal smear were maximum when short-time spectral analysis methods were used to study bilabials, since bilabials have a weak burst which usually lasts for about 5 to 10 ms. In the WD plots, the energy in this burst was seen to be concentrated in the low-frequency region. Thus, averaging the spectrum of a bilabial burst with the diffuse spectrum of a vowel results in a diffuse spectral line with falling slope. This characteristic of bilabials is adequate to distinguish bilabials from alveolars and velars in many vowel contexts, since, the high amplitude burst associated with alveolar and velar release [74] would make the averaged spectral shape non-falling. For bilabials in the context of high front vowels, however, the high frequency F 2 can dominate the low amplitude burst when smeared across each other. This was perhaps the reason for more errors when identifying bilabials in the context of [i, e] in Blumstein and Stevens [2] and Kewley-Port [49]. The relatively short window (10 ms) used by Lahiri et al. limits the smear across the burst and the vowel to some extent. The important innovation in their work, however, was to consider energy gradients over time, at low and high frequencies. By doing so, they were able to specify the place of articulation, independent of the vowel context. It thus, appears that bilabials can be characterised by a low frequency energy in the burst. Typical values of BD and FOD are in the range of 0 to 10 ms. The diffuse nature of bilabials reported in the literature can be attributed to the temporal smear across the burst and the vowel. Coarticulation Effects In Section 4.3.3, it was mentioned that the spectral shapes of the velar bursts were seen to vary most, depending on the vowel context. It was also suggested that the shape of the front cavity varies more than the shape of the back cavity when articulating velars in different vowel contexts. A compact spectral shape is known to occur when the cavities posterior and anterior to the place of articulation 63 resonate approximately at the same frequency (Stevens, [66]). For velars in the context of back vowels, the resonance frequency of the back cavity coincides with the F2 of the following vowel, giving rise to a compact spectrum. For high front vowels, however, the frequency of F2 is quite high [62]. Thus, for velars in the context of high front vowels, it would be expected that the resonance frequency of the front cavity would be high in anticipation of the following vowel. The spectral shape of velar bursts in the context of high front vowels then, becomes diffuse rather than compact, since the front and back cavities do not resonate at the same frequency. Figure 4.6 shows the WD of the beginning of the syllable [ge], spoken by an English speaker. The spectral shape during the burst can be seen to be diffuse non-falling, instead of compact. In the WD plots, the spectral shapes of alveolar bursts were observed to be less diffuse in the context of back vowels than in the context of front vowels. The above explanation regarding diffuse non-falling spectral shape of velars in the context of front vowels can be extended to explain this. The lower resonance frequency of the front cavity (in anticipation of a back vowel), enhances the resonance frequency of the back cavity, giving rise to a compact spectral shape, rather than a diffuse non-falling spectral shape. Blumstein and Stevens have reported that a large number of alveolars in the context of [u] were incorrectly accepted as velars by the compact template [2, p. 1011]. Figure 4.7 shows the WD of the beginning of the syllable [du], spoken by an English speaker. The spectral shape can be seen to be compact, rather than diffuse non-falling. This was the only alveolar token of this speaker that was classified as a velar in this study. It thus appears that the spectral shapes "diffuse non-falling" and "compact" are not unique to alveolars and velars, respectively. Both alveolars and velars share these two features, depending on the vowel context. Given this ambiguity in the spectral shapes, it is possible to explain many error patterns encountered by researchers when specifying the place of articulation information in the acoustic signal. In Blumstein and Stevens [2], a large number of velars in the context of front vowels were incorrectly accepted by the diffuse-rising template as alveolars and incorrectly rejected by the compact template as velars. Alveolars in the context of [u] were incorrectly accepted by the compact template as velars. In Kewley-Port 64 24 32 Time dgal7 #640/l60/65-576:10# 128x52/512# 117/2l/25%# 200A# 208541 Figure 4.6: WD of the beginning of the syllable [ge] (English speaker). 65 16 24 32 40 Time jgal5 #640/200/65-576:10# 128x52/512# 117/21/25%# 200A# 48143 Figure 4.7: WD of the beginning of the syllable [du] (English speaker) 66 [49], a large number of velars in the context of front vowels [i, er, ae] were identified as alveolars. Blumstein and Stevens have explained such context dependencies as coarticulation effects of the tongue body in anticipation of the following vowel [2]. Kewley-Port [49] used the feature "late onset of low-frequency energy," a mea-sure of VOT, to distinguish velars from bilabials and alveolars. The judges in her study used this feature for only 0.6% of the corpus, indicating that the place of articulation was assigned based on the spectral shape alone in her study. She re-ported that the feature definition system she used did not capture the importance of this feature. The late onset feature is probably more unsuitable for (pre)voiced velars. Voicing can occur at any time with respect to the consonantal release, since, the place of articulation is associated with the vocal tract and voicing corresponds to the source (glottis), driving the system. Thus, a measure of onset of the vowel, rather than the onset of voicing would serve better to disambiguate velars from non-velars, since, a steady state formant pattern cannot occur until the release is complete. Formant onset duration defined in Section 4.3.2 is related to time duration required for the steady state resonances to occur in the vocal tract, after the consonantal release. Unlike VOT, the value of FOD is always positive. In the pilot study, FOD was found to increase as the place of articulation moved from the lips to the velum. While the spectral shape during the burst is determined mainly by the vocal tract configuration, FOD appears to be related to inertia of the articulators. Hence, the FOD was considered to be a good choice as a secondary feature for specifying invariant acoustic patterns for place of articulation. Section 4.4.2 presents the manner in which FOD and F2 were used to disambiguate (i) alveolars in the context of back vowels from velars and (ii) velars in the context of front vowels from alveolars. 4.3 .5 V o i c e d a n d A s p i r a t e d S t o p C o n s o n a n t s The invariant patterns described in the previous section were more readily observ-able in the WD plots of voiceless, unaspirated stop consonants, than in the WD plots of voiced and aspirated stop consonants. When voice onset occurred prior to 67 release, the voice bar was superimposed on the invariant patterns described above. An example of this is depicted in Figure 4.8, which shows the WD of syllable [di], spoken by a Telugu speaker. The voice bar superimposed on the burst with diffuse non-falling spectral shape can be observed. In the WD plots of aspirated stops, it was relatively more difficult to identify the invariant patterns described in Section 4.3.3. The burst region in an aspirated CV syllable consists of a sud-den increase in energy associated with the release, fricative noise at the place of articulation and aspiration. The energy due to aspiration can be fairly wideband. Figure 4.9 shows the WD of the beginning of syllable [kha], spoken by an En-glish speaker. The noisy structure in this plot is perhaps due to aspiration. A compact spectral shape, corresponding to velar place of articulation, can be seen superimposed on a noisy background. In Table 4.1, it may be seen that only unaspirated voiceless consonants occur in all the three languages. Consequently, the cross-language study envisaged in this study was possible only for the unaspirated voiceless stops of English, Telugu and French. Further, the source for voicing and aspiration is at the glottis, while the source for the place of articulation is at the location of the stop in the vocal tract. Since the invariant cues for the place of articulation were the focus of this study, voiced and aspirated consonants were not considered. 4.4 Development of the P r o g r a m FEATURE In the previous section, it was shown that the WD describes the time-varying spec-tral properties of the burst with much greater precision than short-time spectral analysis methods. In addition to the visual patterns described in section 4.3.3, there were many other interesting aspects of the signal displayed in the partially smoothed WD plots, which were lost when the smoothing was made equivalent to that in a spectrogram, i.e. Tfi = 1.0, as illustrated in Figures 3.14 and 3.15. From these figures, and other WD plots presented in this dissertation, it is clear that the partially smoothed WD is a powerful tool for investigating acoustic invariance in stop consonants and analyzing other transient signals. In this dissertation, the extent to which the invariant patterns described in 68 16 24 32 40 Time hg ld i #640/300/65-576:10# 128x52/512# 117/2l/25%# 200A# 17231 Figure 4.8: WD of the beginning of the syllable [di] (Telugu speaker). 69 Figure 4.9: W D of syllable [k fca] f rom Engl ish . 70 Section 4.3.3 specified the place of articulation was investigated. A program called FEATURE was developed to assign the place of articulation to a set of values for the visual features defined in section 4.3.2. Three judges performed visual pattern matching measured the values of visual features. This procedure was similar to the one used by Kewley-Port [49]. It is reasoned that, if human judges can find invariant patterns that lead to correct identification of the place of articulation, then developing sophisticated pattern matching algorithms for recognition by a computer would be justified. The program FEATURE included two procedures: (i) a measurement section in which values corresponding to the visual patterns were obtained interactively from the judges, and (ii) a logic section in which the place of articulation was assigned to the WD plot based on an assessment of the visual patterns. These sections are described here. 4.4.1 M e a s u r e m e n t S e c t i o n In this procedure, judges were asked to indicate: (i) the start of the burst, (ii) the start of the vowel, (iii) the F2 of the vowel and (iv) the spectral shape within the burst region. Simple ad-hoc tests were then performed to determine whether the burst region corresponded to the release of a stop, or some acoustic artefact not rel-evant to the task of assigning place of articulation in this experiment. Weak bursts of less than 5 ms duration, with spectral shapes that were clearly different from those of bilabials, alveolars and velars discussed in Section 4.3.3, were candidates for these tests. Such bursts were observed in 4 of the 9 bilabials in the pilot data. Because of equipment limitations, it was not feasible to thoroughly investigate the nature of these bursts. It was, however, noted that ignoring these bursts resulted in a typical bilabial pattern. The strategy of ignoring these bursts was adopted whenever their duration was found to be less than 5 ms, and when there was no spectral peak in the burst at the F2 of the following vowel. Until such a time as it will be possible to study these bursts extensively, they are presumed due to some acoustic artefact. When an artefact was detected, judges were requested to ignore this region of the plot and asked to identify another region for the burst. New values for the start of the burst, the start of the vowel, and the spectral shape were 71 obtained interactively. This process was repeated until reliable burst cues were identified, or until the judges decided to evaluate the next plot. If a reliable burst region was identified, the judges were prompted for the burst duration. The place of articulation was then assigned as explained in the next section. The flowchart of the measuring section of the program is shown in Figure 4.10. 4.4 .2 L o g i c S e c t i o n The flowchart for the logic section is depicted in Figure 4.11. Decisions regarding the place of articulation were made at ten different nodes in the flowchart, based on: (i) the spectral shape, (ii) FOD, (iii) BD and (iv) second formant F 2 of the following vowel, as shown in Figure 4.11. This section has four time thresholds — TT2, TT2B, TT3 and TT4, and two frequency thresholds — F T l and FT2 against which the parameter values were compared in assigning the place of articulation. The place of articulation was assigned as "bilabial" at node 1 for tokens with diffuse falling spectral shape, and velar at node 10 for tokens with compact spectral shape with multiple peaks in time domain. In the pilot data, FOD, BD and F 2 were redundant for the tokens with the foregoing spectral properties. Alveolars in the context of front vowels and velars in the context of back vowels are said to have a homorganic place of articulation, with the following vowel. For such tokens, the values of the parameters measured from the WD plots do not have conflicting values. These tokens are assigned the place of articulation at nodes 2,3, 8 and 9. Specifically, tokens with diffuse non-falling spectral shape and FOD—BD less than TT2 or FOD less than TT3 are assigned alveolar place of articulation at nodes 2 and 3, respectively. Similarly, tokens with compact spectral shape and F O D - B D greater than TT2B or FOD greater than TT4 are assigned velar place of articulation at nodes 9 and 8, respectively. As noted earlier (section 4.3.4), alveolars in the context of back vowels, and ve-lars in the context of front vowels, can have compact and diffuse non-falling spectral shapes, respectively, because of coarticulation in segments with non-homorganic place of articulation. For such tokens with conflicting cues, i.e., diffuse non-falling spectral shape and a long FOD or compact spectral and a relatively short FOD, 72 C START ) MEASURE START OF THE BURST -4 CO c o o tr* O B 0 CO EL p 3 CO MEASURE START OF THE UOIEL T MEASURE F2 YES YES <SS=COMPACT y NO NO ^{ALIGN HITH F2 {MULTIPLE PEflKS)^ .NO MC1=4 J I N G L E PEAK AT F2; MC1=5 ^BURST COUERS F2 ^BURST DURATION > TT1 vNO YES YES <SS=NON-FALLING>^ | .NO YES/ MC1=6 MCl=-5 ( PEAK ATF2~> MC1=2 NO YES BURST DURATION > TU VNO MC1=3 MCl=-2 SS=FALLING vNO MCX=1 MCl=-3 MEASURE BURST DURATION (BD) BURST REGION DOES NOT CORR-ESPOND TO STOP RELEASE (CO TO 1) IGNORE THIS REGION IN THE WD PLOT • ^ K IDENTIFY BURST AGA"TN> NO (GO TO START) I IEXT PLOT I I — ' (GO TO START) NO VES < SSrCOMPflCT, HP=V > YES <(F0D-BD)(TT2>iJP-NO, ^ ss=NON- FA LLTNG"^Q—^2=L LC2=1 YES (FOD-BD))TT2B> YES LC2=3 NO, < SS:COr1PACT > YES YES LC1=3 vNO 'SS:NON-FftLLINcV^| (ALUEOLAR > LC1=4 LC3=1 YES-< F2 <' FT1 ) m L C 3 : 2 . < FOD ( TT3 ^ FOD ) TO~> NO YES NO LC3=3 NO p<~SS:COHPflCT >  < F2 ) F T 2 ~ V ^ YES LC1=5 YES < PASS = 1 > YES {SSiNON-FALLING^ LC1=6 ( UELAR ) m - TT3 • 4 TO : TO - 4 PASS : 2 FT1 : FT1 • 0.2 FT2 : FT2 - 0.2 GO TO flflfl FOD (ms) F O D - B D (ms) [b] 8, 6, 15 0, 0, 0 English [d] 17, 12, 22 7, 5, 15 [g] 32, 28, 25 23, 19, 19 [P] 11, 12, 6 0, 0, 0 Telugu [t] 28, 25, 24 16, 15, 15 [k] 42,63,65 37, 55, 59 [P] 12, 10, 8 0, 0, 0 French [t] 24, 14, 22 18, 7, 10 W 55, 24, 40 46, 18, 34 Table 4.3: FOD and F O D - B D values from the pilot data. the ambiguity was resolved using F 2 of the following vowel. Thus, for tokens with compact spectral shape and FOD less than TT3, if F 2 was less than FT1 indicat-ing a back vowel, then an alveolar place of articulation was assigned at node 4. Similarly, for tokens with diffuse non-falling spectral shape and FOD greater than TT4, if F 2 was greater than FT2 indicating a front vowel, then a velar place of articulation was assigned at node 7. For tokens in which ambiguities of the visual features could not be resolved, the place of articulation was assigned on the basis of the spectral shape alone at nodes 5 and 6. Tokens with diffuse non-falling spectral shapes were assigned an alveolar place of articulation, and tokens with compact spectral shape were assigned a velar place of articulation as in [2,49] The choice of the time domain thresholds was based on the FOD and FOD—BD values of the 27 voiceless unaspirated stop consonants of the pilot data, measured by the experimenter. These values are shown in Table 4.3. Each entry corresponds to a given phone in the context of [i], [a] and [u], respectively. The frequency thresholds FT1 and FT2 were selected from the formant data given in Peterson and Barney [62]. Most of the informants in [62] spoke General American English. The choice of thresholds is shown in Table 4.4. There were a small number of tokens in the pilot study to arrive at statistically reliable values for the thresholds. It was not feasible to consider a larger corpus, since computation costs were pro-75 Pass 1 Pass 2 TT1 5 ms — TT2 16 ms — TT2b 26 ms — TT3 16 ms 20 ms TT4 28 ms 24 ms FT1 1.0 kHz 1.2 kHz FT2 1.7 kHz 1.5 kHz Table 4.4: The choice of original thresholds. hibitive; therefore, a strategy of adjusting the thresholds for tokens with conflicting cues was adopted. For the tokens that were not classified at one of the nodes 3, 4, 7 or 8 in the first pass, the thresholds TT3, TT4, F T l and FT2 were relaxed (as shown in Table 4.4), and and classification was attempted in a second pass. If classification was remained impossible in the second pass, place of articulation was assigned based on the spectral shape alone, as explained earlier. In principle, it should be possible to further investigate the acoustic signal for additional invariant patterns, when it is not possible to resolve the ambiguities by using the simple features considered here. The program FEATURE has five check points to trace the various branches tra-versed in assigning the place of articulation for each token. At the conclusion of analyzing each token, the values of the visual features and the check points were recorded in a file for future analysis. Considerable effort was expended to make this program easy to use and resistent to user errors. 4.5 T h e Exper iment 4.5.1 T r a i n i n g o f J u d g e s Three judges participated in this study. Two of them (CM and TU), graduate students in the School of Audiology and Speech Sciences, were conversant with the terminology of acoustic phonetics and one of them (ML), an undergraduate student 76 in Electrical Engineering, was familiar with the signal processing terminology. It was possible to run the program and measure the parameter values from the WD plots as prompted by the program, without knowledge of either disciplines. All the judges were paid for their participation in the study. Judges were provided with instructions for measuring the visual patterns in the WD plots and for using the program FEATURE (Appendix C) and familiarized with the procedure in an oral presentation and a demonstration, by the experi-menter. A set of templates xeroxed onto transparencies were provided to facilitate measurements (see Figure 4.12). The 27 plots of the pilot study were randomised and given to each of the judges in a loose leaf binder. When the judges performed the experiment independently, the program was able to assign correct place of articulation to 80% of the tokens. When the experimenter performed the same task, a score of 96% was obtained. A collaborative judgement was then performed, in which the experimenter also participated. Since the collaborative session was also part of the training, a unani-mous agreement was requested for each parameter value measured from the plots. A score of 92% was obtained in this session. Syllable [ti] by the French informant and syllable [gu] by the English informant were unanimously assigned velar and alveolar place of articulation, respectively. The syllable [ti] had a very weak burst, with only one conspicuous peak in the burst. As a result, the spectral shape of the burst in this token was judged as compact. The token was assigned velar place of articulation at node 6. For the token [gu], the voicing started 10 ms after the release and the vowel started 25 ms after the release. Based on instructions given to the judges, it was unanimously decided that the vowel started 10 ms after the burst, making the value of FOD 10 ms rather than 25 ms. As a result, this token was classified as an alveolar at node 4 instead of a velar at node 8. Based on feedback from the judges, some changes were made in the program which are summarised in the next section. Some clarifications regarding the original instructions were provided (see Section C.4 in Appendix C). 77 0.5-1.5-2.5-3.5-4.5-12 16 20 24 28 Time 32 36 40 44 48 Figure 4.12: Template used for measuring the visual patterns. 78 4.5 .2 M o d i f i c a t i o n t o t h e P r o g r a m FEATURE A cursory analysis of the data files containing responses from the judges revealed that the values of the parameters measured by the judges were not quite similar. The program FEATURE is designed such that small measurement errors in the pa-rameters do not drastically affect its performance. Of the five parameters measured for each plot, the F 2 of the vowel does not correspond to the burst region of the CV syllable. As noted earlier, the use of F 2 was not anticipated prior to computing and plotting the WD (see page 60). When analyzing data from pilot study, it was found that measuring F 2 from the WD plots (as explained on page 118) was quite time consuming and not very accurate. Measurement of F 2 is not a difficult task and many techniques for its accurate measurement have been proposed in the literature [57]. Since it is not proposed that the WD be used to measure F 2 it was decided to keep the F 2 errors uniform across all the judges. In order to achieve this uniformity, the experimenter measured the F2's of the corpus and an F 2 data base was created. This data base was then expanded to include the maximum time duration in each WD plot. The program was modified to prompt judges with the proper template to be used in the measurements, and the F 2 of the vowel, for each token. The errors made by the judges in the training session were mainly recognizing various spectral shapes. A series of schematic WD plots were made to illustrate various forms of diffuse falling, diffuse non-falling and compact spectral shapes in the WD plots, and provided to judges. 4 .5 .3 D a t a f o r t h e C r o s s - L a n g u a g e S t u d y In the final experiment, recorded tokens were obtained from six informants: JG and DG for English, AP and JB for French and HG and CS for Telugu. Tokens for the final experiment was recorded approximately a year after recording tokens for the pilot study. The informants were provided with a list of CV syllables containing the nasal consonant /m/ , all the consonants of their language shown in Table 4.1, and the nasal consonant / n / , followed by the vowels / i , e, a, o, u/. The recording conditions were similar to those used for preparing the pilot set. 79 Each informant recorded three sets of repetitions. The last two sets were chosen for digitization. WD plots were generated for the unaspirated voiceless stops, as explained in Appendix A. Dental Tokens from Telugu Based on theoretical considerations and spectral analysis of Swedish dental stops, Fant showed that dentals have a diffuse spectral shape with predominance of high-frequency energy during the burst [23]. Blumstein and Stevens suggested that alveolars, dentals and other phonetic categories broadly termed "coronals," share similar spectral properties during the burst, namely, diffuse-rising spectral shape and that they would differ in other attributes of the acoustic signal, such as the direction of formant transitions or onset frequencies of particular formants [2]. Phonologically, alveolars and dentals are also described with the same feature, i.e., [coronal] [12,33]. These observations prompted Lahiri et al. [51] to investigate alveolar and dental stops in various languages to determine whether they displayed the diffuse-rising spectral shape. Unfortunately, their prediction that the spectral shape was diffuse-rising for both alveolars and dentals was not substantiated. They did show that the relative changes in the distribution of energy from the burst release to the onset of voicing was the same for these two phonetic categories. Since dentals and alveolars are phonologically contrasted in Telugu, dentals of Telugu were included in the corpus. The program FEATURE was modified such that if it assigned an alveolar place of articulation to a dental stop, the classification was considered correct. The final data set consisted of tokens spoken by three new speakers (DG, CS, JB), two new vowels vowel contexts ([e, o]), and a new place of articulation (dentals of Telugu), which were not considered in the pilot study. This resulted in 60 plots for English, 80 plots for Telugu (including the plots for 20 dental stops) and 60 plots for French. The WD plots were randomized within each language and kept in loose-leaf binders for analysis by the judges. These plots were not analyzed by the experimenter prior to finalizing the program FEATURE. 80 Telugu French Pass 1 Pass 2 Pass 1 Pass2 TT1 5 ms — 5 ms — TT2 20 ms — 18 ms — TT2b 40 ms — 30 ms — TT3 30 ms 34 ms 20 ms 24 ms TT4 44 ms 40 ms 30 ms 26 ms FT1 1.0 kHz 1.2 kHz 1.0 kHz 1.2 kHz FT2 1.7 kHz 1.5 kHz 1.7 kHz 1.5 kHz Table 4.5: The choice of new thresholds. 4.5 .4 R u n n i n g t h e P r o g r a m FEATURE The final experiment was conducted in three stages. In the first stage, judges spent an average of 7 hours each and analyzed the data independently. In tokens that were incorrectly classified in the independent judgement, a distinct pattern was observed in the five visual features measured by the judges. A closer examination of the parameter values showed that the single set of thresholds used in the program FEATURE was not optimum for all the three languages. For instance, the FOD's of alveolars in Telugu were much longer than those in English. In order to establish the extent to which parameter values measured by the judges were invariant for a given place of articulation, a different set of thresholds for each language was arrived at by re-examining the pilot data — this time, independently for each language. The second stage of the experiment involved running the program with different threshold values for each language, but for the same parameter values obtained in the independent judgement. The choice of thresholds for English were the same as in the independent judgement. The new thresholds used for the tokens of Telugu and French are shown in Table 4.5. In the third stage, the three judges participated in a collaborative judgement and analyzed 50 WD plots for which the assignment of spectral shape had not been unanimous. Unlike the collaborative judgement reported in [49], the corpus did not include the tokens for which the spectral shape was unanimous and wrong 81 place of articulation was assigned to the consonant. Similarly, tokens for which the same place of articulation was assigned by the three judges, in spite of different spectral shapes, were also analyzed in the collaborative judgement. No additional instructions were given to judges for their collaborative judgement. This judgement mainly served the purpose of correcting errors due to oversight and/or careless judgement. The results of the three stages of this experiment are presented in the next chapter. 4.6 S u m m a r y In this chapter, a cross-language study of stop consonants from English, Telugu and French, to investigate acoustic invariance was presented. Some of the linguistic terms used throughout this dissertation and the organization of stop consonants in the languages considered were explained. The invariant patterns for each place of articulation, observed in WD plots of the pilot data were explained. A comparison of these patterns with the invariant cues reported in the literature indicated that the difficulties reported in [2,49,51], in specifying the place of articulation (inde-pendent of the vowel context) were (i) partly due to temporal smear in the analysis techniques, and (ii) partly due to coarticulation effects. The WD was useful in pre-serving the temporal detail and identifying some coarticulation effects. A formal study was conducted to establish if it was possible to specify the place of articu-lation, independent of the vowel context, based on the invariant patterns observed in the WD plots. The results of this study are described in Chapter 5. 82 Chapter 5 Results and Discussions In this chapter, the results of a cross-language study to investigate acoustic in-variance in stop consonants are presented. The purpose of the experiment was to identify visually unique patterns in the WD plots of CV syllables spoken by native speakers of different languages. The corpus consisted of a total of 200 utterances, 60 from English, 80 from Telugu and 60 from French, spoken by two native speak-ers of each language. Five vowel contexts ([i, e, a, o, u]) were considered for each place of articulation. The English tokens had bilabial, alveolar and velar places of articulation. The Telugu tokens had bilabial, dental, retroflexed post-alveolar and velar places of articulation. The French tokens had bilabial, dental and velar places of articulation. As noted earlier, for brevity, retroflexed post-alveolar stop consonants from Telugu will be referred to as Telugu alvelolars. Three judges an-alyzed the WD plots and provided the values for 5 visual parameters defined in the Section 4.3.2. A place of articulation was assigned to each plot by the prog-ram FEATURE, based on the responses of the three judges. The performance of this program in assigning the correct place of articulation is discussed. 5.1 Overview Section 5.2 presents the overall performance in identifying the invariant patterns. The data is collapsed to allow discussion of the results in each language and for each place of articulation, independently. The complete data is then presented to allow discussion by (i) language, and (ii) place of articulation. 83 Ind. Sess. New Thrs. Col. Sess. Eng. 81.2 81.2 86.7 Tel. 75.3 85.3 87.5 Fre. 76.1 76.7 83.3 Table 5.1: Overall scores by language. Section 5.3 presents the performance at each of the ten nodes in FEATURE where a place of articulation was assigned. The number of tokens that were assigned the correct place of articulation, and the number of tokens that were assigned a wrong place of articulation, at each node are presented. Section 5.4 presents the role played by various visual parameters measured from the WD plots, in arriving at a correct place of articulation. 5.2 Overa l l Performance The overall performance of the program FEATURE, in assigning the correct place of articulation to a WD plot has been arranged as a confusion matrix in Table 5.4. In the following two sections, the performance is explained by language and by place of articulation, by collapsing the confusion matrix accordingly. Since the performance of the judges was fairly consistent in measuring parameter values from the WD plots, results for individual judges are not presented. 5.2.1 P e r f o r m a n c e b y L a n g u a g e Table 5.1 shows the overall performance in English, Telugu and French for the independent session, the experiment with new thresholds and the collaborative session. The 10% and 0.6% increase in scores for Telugu and French tokens from the independent session to the session with new thresholds show that the choice of original thresholds was particularly poor for Telugu tokens. Although only 75% of Telugu tokens displayed invariance in the independent session, 85% of them, 84 Ind. Sess. New Thrs. Col. Sess. Lab. Cor. Vel. Lab. Cor. Vel. Lab. Cor. Vel. Lab. 75.0 11.7 13.3 75.0 13.9 11.1 78.3 10.0 11.7 Cor. 10.3 60.9 28.8 10.3 72.0 17.7 5.8 80.8 13.3 Vel. 1.1 2.2 96.7 1.1 2.7 96.1 0.0 1.7 98.3 Table 5.2: Overall performance by place of articulation. nevertheless shared invariant acoustic properties in the parameters measured by the judges, as demonstrated by the experiment with new thresholds. For the collaborative judgement, there was an improvement of 5.5%, 2.2% and 6.6% in the results for English, Telugu and French respectively, resulting in an overall improvement of 4.8%. The small improvement in the results of the collab-orative judgement implies that the distribution of errors was fairly random in the independent session, indicating that the training of the judges was good enough to enable them to make reliable measurements from the WD plots. The 2.2% in-crease in the results for Telugu indicate that there were relatively fewer errors in the independent judgement of Telugu tokens. 5.2.2 P e r f o r m a n c e b y P l a c e o f A r t i c u l a t i o n Table 5.2 shows the confusion matrices for place of articulation in: (i) the inde-pendent session, (ii) the experiment with new thresholds and (iii) the collaborative session. The scores for the alveolar and dental tokens have been combined as coronals in this table. Correct scores are indicated by bold face. From the flowchart in Figure 4.11, it may be deduced that the choice of new thresholds did not affect the classification of bilabials. There was relatively small improvement in the classification of bilabials in the collaborative session (only 3.3%, corresponding to 2 tokens), indicating that errors in bilabials were not due to careless judgement. Most of the improvements in the session with new thresholds and in the col-laborative session were with the identification of coronal tokens. From Tables 5.1 85 Bilabial Coronal Velar [i 75 75 92 [e 83 81 100 a 75 75 100 [o 92 100 100 [u 67 75 100 Table 5.3: Performance by vowel context. and 5.2, it may be seen that for the session with new thresholds, improvements were in the identification of coronal tokens of Telugu. Most of the errors in the independent session were associated with identification of the alveolars of English and dentals of French. About 8.8% of these were corrected in the collaborative session. The identification of velars was uniformly good, across the three languages (greater than 96%). Their classification was robust with respect to the threshold values and there were relatively fewer errors in the independent session. 5.2.3 P e r f o r m a n c e b y V o w e l C o n t e x t Table 5.3 shows the results of the collaborative session for each place of articulation, with respect to vowel context. For the limited corpus considered in this study, no context dependencies could be found in the performance of the program FEATURE. Errors were distributed, relatively uniformly, across front and back vowels. 5.2.4 O v e r a l l S c o r e s Table 5.4 shows the performance by both language and place of articulation, for each speaker. The salient points to be noted from this table are discussed here briefly. 86 Ind. Sess. New Thrs. Col. Sess. Lab. Cor. Vel. Lab. Cor. Vel. Lab. Cor. Vel. b JG 66.7 16.7 16.7 66.7 16.7 16.7 90 0 10 DG 96.7 3.3 0.0 96.7 3.3 0.0 100 0 0 Eng. d JG 0.0 76.7 23.3 0.0 76.7 23.3 0 90 10 DG 6.7 60.0 33.3 6.7 60.0 33.3 10 50 40 g JG 0.0 10.0 90.0 0.0 10.0 90.0 0 10 90 DG 0.0 3.3 96.7 0.0 3.3 96.7 0 0 100 P HG 56.7 16.7 26.7 56.7 20.0 23.3 60 20 20 CS 76.7 16.7 6.7 76.7 23.3 0.0 70 30 0 t HG 0.0 46.7 53.3 0.0 86.7 13.3 0 100 0 CS 3.3 50.0 46.7 3.3 83.3 13.3 0 100 0 Tel. t HG 0.0 80.0 20.0 0.0 100 0.0 0 100 0 CS 13.3 60.0 26.6 13.3 86.7 0.0 10 90 0 k HG 0.0 0.0 100 0.0 0.0 100 0 0 100 CS 0.0 0.0 100 0.0 0.0 100 0 0 100 P AP 90.0 0.0 10.0 90.0 0.0 10.0 90 0 10 JB 63.3 16.7 20.0 63.3 20.0 16.7 60 10 30 Fre. t AP 36.7 53.3 10.0 36.7 53.3 10.0 20 80 0 JB 10.0 56.7 33.3 10.0 63.3 27.7 0 70 30 k AP 6.7 0.0 93.3 6.7 3.3 90.0 0 0 100 JB 0.0 0.0 100 0.0 0.0 100 0 0 100 Table 5.4: Performance by place of articulation and language. 87 Bilabials The results from the collaborative session indicate that identification of bilabials was relatively poor for Telugu informants HG, CS and French informant JB. There was a weak burst, a few tens of milliseconds prior to the diffuse falling pattern in the WD plots of 26 bilabial tokens out of a total of 60 tokens. The strategy of ignoring these regions (as explained in Section 4.4.1) was effective for 13 of these tokens. For the other 13 tokens, the program was unable to ignore these bursts, even though the spectral shape of these bursts was quite different from those of alveolars and velars. It was not possible to study these bursts in detail, at this stage, due to limitations in the A / D , D /A and waveform editing facilities. Further investigations are needed to establish the perceptual relevance of these bursts and to incorporate this phenomenon into the set of invariant patterns. From Table 5.4, it can be seen that this characteristic of bilabials could be specific to language and/or speaker, since these bursts were predominant in the bilabials of both Telugu informants and French informant JB. Figure 5.1 shows the WD of the syllable [be] spoken by a native English speaker. This figure presents an example of a weak burst, prior to the diffuse falling pattern. Dentals and Alveolars The identification of alveolars was rather poor for tokens spoken by English speaker DG. All errors for DG (5 out of 30 tokens) occurred only in alveolars and 4 of these tokens occurred were in one set of CV syllables. Two of these tokens had weak bursts, leading to incorrect classifications of spectral shape. For the other tokens, the start of the vowel was identified as being a couple of pitch periods later by all judges, even though the WD plots displayed clear formant structure. Fig 5.2 shows the beginning of the syllable [da], spoken by DG. For this token, the judges identified the start of the vowel at 40 ms, instead of 12 ms. Ambiguities of this nature were not present in the pilot data. Lahiri et al. [51] studied the dentals and alveolars of French and Malayalam and showed that the spectral shape of dentals during the burst was not diffuse-rising, across all vowel contexts. In Chapter 2, it was suggested that the LP technique 88 jgb? #640/l80/65-576:10# 128x52/512# 117/21/25%# 200A# 85942 Figure 5.1: WD of the beginning of the syllable [be] (English speaker). 89 dgbl2 #640/l70/65-576:10# 128x52/512# 117/21/25%# 200A# 108690 Figure 5.2: WD of the beginning of the syllable [da] (English speaker DG) 90 was not suitable for studying the nature of stop consonants. In the present study all dental from tokens of Telugu were found to have a diffuse non-falling spectral shape during the burst. This is extremely encouraging, since, the program was designed for alveolars and neither the judges nor the experimenter had analyzed the WD of a dental token. This suggests that the original predictions of Fant [23] and Blumstein and Stevens [2] were correct, and that problems encountered by Lahiri et al. might be attributed to the signal processing tool. Figure 5.3 shows the WD of the beginning of a Telugu dental in the context of [e]. The spectral shape during the burst can be seen to be diffuse non-falling. The FOD was much longer for the alveolars and dentals of Telugu, than for the alveolars of English and dentals of French, which lead to a number of Telugu coronals being classified as velars. It was apparent that a single set of thresholds was not appropriate for all languages. A new set of thresholds, different for each language, was obtained from Table 4.3. With the new thresholds, there was an improvement of 30% in the results of coronals of Telugu. Thus, although the original thresholds failed to demonstrate invariance in the coronals of Telugu, there were invariant patterns in the parameters obtained in the independent session, in approximately 89% of Telugu coronals. About 8% of errors occurred due to careless judgement; these were corrected in the collaborative session. From table 5.4, it can be seen that the experiment with new thresholds mainly served to improve the scores of Telugu coronals. The improvement for the rest of the tokens was fairly marginal. Performance for French dentals was poor, with an average score of 55%. In the pilot study, dental tokens from French were found to have weak bursts, lead-ing to incorrect classification of the spectral shape. Similar problems were also encountered with the French dentals in the final experiment. A more extensive pilot study might have helped in obtaining a better performance for the French dental tokens. It was encouraging to note that not all errors were due to improper choice of thresholds or to tokens of poor quality. About 20% of the errors were due to over-sight and careless judgement, which were corrected in the collaborative session, ultimately resulting in an overall performance of 75% for French alveolars. 91 hgb32 #640/280/65-576:10# 128x52/512# 117/2l/25%# 200A# 244872 Figure 5.3: W D of the beginning of the syllable [te] (Telugu speaker) 92 Velars Performance of velars was excellent in all sessions (greater than 9 0 % ) . After the collaborative session, there was only one velar ([gi] of English speaker JG) that was classified as an alveolar. The following conclusions can be made concerning invariant acoustic patterns in velars: • Velars can have both diffuse non-falling and compact spectral shapes, during the burst, depending on the vowel context. o The FOD following the burst and the F 2 of the vowel can be used along with spectral shape, to specify the invariant acoustic patterns for velars, independent of the following vowel. In Blumstein and Stevens [2] and Kewley-Port [49], considerable context depen-dency was reported in specifying place of articulation for velars. Using the WD, it was possible to resolve these context dependencies using the FOD and F 2 informa-tion. 5.3 Performance at E a c h N o d e Table 5.5 shows the number of tokens assigned a place of articulation at each of the ten nodes in flowchart in Figure 4.11. Each entry corresponds to the number of times a correct place of articulation was assigned at a given node. The quantity in parentheses indicates the number of times an incorrect place of articulation was assigned at a given node. A large number non-bilabial French tokens were classified as bilabials at node 1. In assigning the bilabial place of articulation, the program FEATURE used only the spectral shape and ignored the FOD—BD and FOD information. It is possible to use the parameters FOD-BD and FOD together with the diffuse falling spectral shape in the flowchart in Figure 4.11, to prevent non-bilabials being classified as bilabials. The pilot data, however, indicated that such measures were unnecessary. Alveolars in the context of back vowels and velars in the context of front vowels tend to have compact and diffuse non-falling spectral shapes respectively. Nodes 4 93 1 2 3 4 5 6 7 8 9 10 Ind. Sess. Eng. 49 (2) 17 (3) 21 (1) 0 (4) 4(2) 12 (6) 9(4) 12 (0) 0(0) 24 (2) Tel. 40 (5) 15 (0) 13 (1) 1(1) 42 (8) 0(1) 12 (38) 2(7) 10 (2) 35 (6) Fre. 46 (16) 2(0) 4(1) 0 (3) 27 (1) 0(2) 22 (10) 7(2) 4(1) 25 (7) New Thrs. Eng. 49 (2) 17 (3) 21 (1) 0 (4) 4(2) 12 (6) 9(4) 12 (0) 0(0) 24 (2) Tel. 40 (5) 84 (2) 9(3) 1(2) 13 (6) 2(9) 12 (0) 5(0) 6(0) 35 (6) Fre. 46 (16) 2(0) 13 (4) 1(3) 19 (0) 2(2) 21 (7) 8(2) 1(1) 25 (7) Col. Sess. Eng. 57 (3) 18 (0) 20 (3) 0 (0) 4(0) 11 (12) 9(6) 14 (0) 0(0) 24 (0) Tel. 39 (3) 91 (0) 8(6) 1(0) 17 (9) 2(3) 12 (0) 5(0) 6(0) 35 (3) Fre. 45 (6) 2(0) 19 (1) 1(3) 20 (0) 3(4) 21 (6) 12 (3) 1(3) 23 (5) Table 5.5: Performance at each node in the decision tree. and 7 were provided to allow the program'FEATURE to assign the correct place of articulation to these tokens by using FOD and F 2 information. A very small number of alveolar tokens arrived at node 4, in all three languages. This is partly due to the fact that coarticulation effects are greater for velars than for alveolars and partly due to poor choice for the value of threshold FT1. For tokens that arrived at nodes 5 and 6, conflicting visual cues could not be resolved and a decision about place of articulation was based on the spectral shape alone. For these tokens, the set of visual patterns defined in Section 4.3.2 was not sufficient to assign a reliable place of articulation. For the English tokens, the nodes 2, 3 and 8, 10 appear to be quite robust for alveolars and velars respectively. For Telugu and French velars, the nodes 7, 8, 9 and 10 all seem to be quite reliable, with a large number of judgement favouring nodes 7 and 10. With the new thresholds, node 2 was found to be the most favoured for Telugu alveolars. 5.4 Relevance of the Acoust ic Features Measured in W D Plots In the past, many acoustic features have been proposed as invariant acoustic cues for place of articulation in stop consonants. To some extent, all these features have been shown to be perceptually relevant, i.e. by synthesizing stimuli contain-ing these features and investigating the extent to which these are perceived as a given stop consonant [20,22,68,49,51]. This suggests that there might be a set of properties in the acoustic speech signal, which ultimately conveys perceptual in-formation of a linguistic nature. The goal is to discover all the elements of such a set and to specify the extent to which each member conveys information about place of articulation. Experiments to establish the perceptual relevance of spectral shape and the combination of FOD, BD and F 2 were not attempted as part of this dissertation. The role played by these features in determining place of articulation was nevertheless assessed by studying the judges' responses, as described below. 95 Inde jendent Session Collaborative Session Fall. NonF. Comp. Fall. NonF. Comp. Eng. b 49 2 9 57 0 3 d 2 45 13 3 48 9 g 0 13 47 0 12 48 Tel. P 40 12 8 39 15 6 t 1 51 8 0 60 0 t 4 55 1 3 56 1 k 0 12 48 0 12 48 Fre. P 46 3 11 45 0 15 t 14 42 4 6 51 3 k 2 22 36 0 21 39 Table 5.6: Distribution of the feature "spectral shape" by place of articulation. 5 .4.1 S p e c t r a l S h a p e Table 5.6 shows distribution of the feature "spectral shape," by place of articula-tion, in each language. For each place of articulation, in each language, a total of 20 tokens were analyzed by the three judges^  Diffuse Falling From Table 5.4, it can be seen that 95% of bilabials in English, 65% of bilabials in Telugu and 75% of the bilabials in French were judged to have diffuse falling spectral shape, after errors due to careless judgement were corrected. Thus, out of a total of 20 bilabial tokens from each language 1 token in English, 7 tokens in Telugu and 4 tokens in French, were found not to have a diffuse falling spectral shape. As discussed in section 5.2.4, these tokens displayed a brief burst prior to the diffuse falling spectrum and were assigned an alveolar or velar place of articulation only by default. Thus, the absence of a diffuse falling spectral shape was not a sufficient cause to reject a token as bilabial. The presence of a diffuse falling spectral shape during the burst was nevertheless found to be a reliable cue for bilabials, since only 2.9% of non-bilabials were judged to have a diffuse falling spectral shape. 96 Collaborative Session 1001 Eng. Tel. Fre. Eng. Tel. Fre. (Coronal) (Velar) Figure 5.4: Distribution of diffuse non-falling and compact spectral shapes in alve-olars and velars. Diffuse Non-Falling Vs Compact Spectral Shapes The distribution of diffuse non-falling and compact spectral shapes in coronals and velars is shown in Figure 5.4. It can be seen that the number of velars judged to have a diffuse non-falling spectral shape was substantially more than the number of coronals that were judged to have a compact spectral shape. About 20% of velars in English and Telugu and 35% of velars in French were judged to have a diffuse non-falling spectral shape. Similarly, 6.7%, 1.7% and 5% of the coronals in English, Telugu and French were judged to have a compact spectral shape. All dental tokens of Telugu were judged to have a diffuse non-falling spectral shape. From Figure 5.4, it may also be deduced that co-articulation effects were more prominent for velars than for coronals. Such context dependencies, with respect to the following vowel were not observed during the bursts in bilabials. In Section 4.3.3, it was suggested that the spectral shape of velar burst was most affected by the vowel context. For velar place of articulation, it has been shown that the cavities posterior and anterior to the constriction have roughly the same resonance frequencies [66]. Since the shape of the front cavity has a greater degree of freedom than the shape of the back cavity, its resonance frequency varies more than the resonance frequency of the back cavity. Velars in the context of back 97 vowels displayed a compact spectrum during the burst in the WD plots. For velars followed by front vowels, the front cavity has a higher resonance frequency than the back cavity. This difference in resonance frequencies appears to cause a diffuse spectral shape in the WD plots of velars in the context of front vowels. The FOD, might be related to the inertia of the articulators, rather than to the shape of the vocal tract. Hence, spectral shape along with FOD, should specify the invariant cues for the place of articulation in velars. Results of the present study indicate that this indeed the case. Figure 4.6 shows the WD of the beginning of syllable [ge], spoken by a native English speaker. The spectral shape is diffuse non-falling, rather than compact, because of the vowel context. The FOD for this token was 32 ms. 5.4.2 F o r m a n t O n s e t D u r a t i o n a n d B u r s t D u r a t i o n S t a t i s -t i c s Figures 5.5, 5.6 and 5.7 show respectively, scatter plots of FOD—BD vs FOD, for English, Telugu and French. The value of BD is around 10 ms in most situations. These figures show that there is considerable overlap in the value of FOD for coronals and velars. It may also be deduced that FOD increases as the place of articulation moves from lips to the velum. FOD is believed to be a more robust measure to distinguish different pho-netic categories than the "late onset of low-frequency energy" feature proposed by Kewley-Port. The late onset of low-frequency energy is a direct measure of the VOT [49, pp. 325] and would not work for pre-voiced velars. For a given place of articulation, voicing can occur prior to the release, along with the release or after the release, depending on the phonemic category [54,12]. For this reason, the FOD was denned such that it is related to the time taken for release, rather than the onset of voicing. Unlike VOT, the value of FOD is always positive. Its value is related to the time taken for the opening at the place of articulation to reach sufficient area, to allow steady state resonances in the vocal tract. 98 Figure 5.5: F O D - B D vs FOD scatter plot for English tokens Figure 5.6: F O D - B D vs FOD scatter plot for Telugu tokens 99 French w E CO I O 100-90 80 70-60 5 0 -4 0 -30 20 10 - 10 X Q a • D • • 3 • X ^ -10 Legend A p x t • k 10 20 30 40 50 60 70 80 90 100 FOD in ms Figure 5.7: F O D - B D vs FOD scatter plot for French tokens 5.5 S u m m a r y In this chapter, the results of a cross-language study, to investigate acoustic in-variance in stop consonants were presented. It was possible to specify the place of articulation for about 87% of the English tokens, 88% of the Telugu tokens and 83% of the French tokens, based on invariant patterns in the WD plots. The performance was best for velars in all the three languages (average of 98.3%) and alveolars and dentals of Telugu (average of 97.5%). There were no context depen-dencies in specifying the place of articulation, i.e. the errors in specifying the place of articulation were uniformly distributed over all vowel contexts. Further inves-tigations, with larger corpora, are suggested for coronals of English and French and bilabials of Telugu and French. The results presented in this chapter clearly indicate that the WD is more suited to study the nature of stop consonants, than the conventional short-time spectral estimation methods. 100 Chapter 6 Summary and Suggestions for Future Research 6.1 S u m m a r y In this investigation, the partially smoothed WD was used in a cross-language study of stop consonants. The purpose of the investigation was to identify invari-ant acoustic patterns in the WD plots which might specify the place of articulation in stop consonants. Contributions of this study include: (i) developing the par-tially smoothed WD as a tool for transient signal analysis, and (ii) demonstrating evidence in support of acoustic invariance in stop consonants. A summary of both topics is presented in this chapter, and some suggestions for extending this work are presented. 6.1.1 W i g n e r D i s t r i b u t i o n The WD is a new signal processing tool that is more suited for analyzing tran-sient signals than conventional short-time spectral analysis methods. It has been successfully used in many areas, such as loud-speaker analysis, biological signal processing, seismic signal analysis, etc. [44,59,61,4,6,42,43]. For speech signal analysis, however, reports on applications of WD in the literature have been lim-ited [10,28,29,31,1,65,72]. In Chester's work [10], real signals were used to compute the WD and the problem of cross-terms in the WD was not addressed. Atlas et al. [1] also used real signals to compute the WD. In the works of Atlas et al. and Riley [1,65], the WD analysis was equivalent to that of a spectrogram, since the 101 WD was smoothed for positivity. For bilinear distributions of Cohen's class, the requirements of positivity and correct marginals are incompatible. The spectrogram and related short-time spec-tral estimation techniques are positive, but do not have the property of correct marginals. Consequently, they fail to localize time or frequency events in the sig-nal, to their respective time-frequency coordinates in the distribution. The WD, on the other hand, has the property of correct marginals. It has nevertheless, other undesirable properties such as cross-terms and negative regions. The require-ments of positivity and correct marginals are often emphasized in the literature, when proposing time-frequency distributions for signal analysis. By sacrificing both these properties, it is possible to obtain a distribution that has — a) better time-frequency localization properties than short-time spectral estimation techniques, and b) cross-terms of small enough amplitude as to have a negligible influence on the extraction of useful information. The partially smoothed WD presented in this dissertation is such a compromise. Such approaches are not generally accepted in the literature (see Cohen's comments in [18]). The above compromise has, never-theless, proved to be very useful in understanding the nature of stop consonants in speech signals. The salient points regarding the development of the partially smoothed WD, as a tool for analyzing transient signals are summarised below: • It was shown that the LP smoothed spectra were not suitable for studying the nature of the burst in stop consonants. This method not only introduces a bias in the time direction between successive spectra, but also ignores the portion of the signal that does not correspond to a typical steady state vowel (chapter 2, page 12). • The WD smoothed for positivity, and the spectrogram were shown to be equivalent, in terms of resolution in the time domain and the frequency do-main (page 24). An interpretation of the partially smoothed WD for multi-component signals was presented (page 25). • The implementation of the partially smoothed WD was discussed in detail. A FORTRAN callable subroutine for computing the partially smoothed WD 102 was also included (page 27 and Appendix B). • In order to overcome the drawbacks of the cross-terms in a partially smoothed WD, it is essential to have an understanding of the nature of the cross-terms. The nature of the cross-terms in the WD of multicomponent signals was illustrated in Section 3.6, by expressing the signal in terms of simpler signals, for which the cross-terms could be readily written. • The choice of time-domain and frequency-domain smoothing parameters was discussed in Section 3.7. Spectral wrap was demonstrated in the WD of analytic signals. The choice of the frequency-domain smoothing parameter H should take the spectral wrap into consideration. It was pointed out that a priori knowledge of the signal is useful in choosing these parameters and that the choice essentially depends on the temporal and/or spectral separation between regions of interest in the signal. This was demonstrated using two chirps of opposite slopes in Figure 3.12. As the separation between the self-terms decreases in the frequency domain, the effect of cross-terms is seen to become more prominent. • In a recent paper, Cohen pointed out that the WD is not necessarily zero, when the signal itself is zero [18]. It was shown that these terms do not inhibit extraction of useful information from the WD. The time-domain window used to compute the WD on a digital computer, suppresses these contributions in the WD where the signal is zero, to an acceptable level (see Section 3.7.3). 6.1.2 A c o u s t i c I n v a r i a n c e According to the notion of "acoustic invariance," there are invariant acoustic pat-terns present in the speech signal, which are related to the phonetic descrip-tion of the signal in an abstract way. Many researchers have studied the time-varying spectral properties of stop consonants in order to identify these patterns [20,22,68,2,49,51]. These investigations were essentially based on short-time spec-tral analyses methods — spectrographic analysis, or time-varying LP smoothed spectra. Although these investigations unequivocally established the feasibility of extracting invariant cues in certain contexts from speech using signal processing 103 methods, they failed to provide a set of necessary and sufficient invariant cues in any given phonetic context. This context dependency is often quoted to argue that it is not possible to extract invariant patterns in the acoustic domain, that would correlate with the phonetic description of a speech segment [53]. From the discussions in Chapter 2 (page 12), it is clear that short-time spectral estimation techniques are not suitable for investigating invariant acoustic patterns in stop consonants. These methods not only introduce a smear across the burst and the vowel regions (giving rise to context dependencies), but also fail to provide sufficient information about the signal which might explain the coarticulation ef-fects. The partially smoothed WD was used in a cross-language study of 200 stop consonants, spoken by native speakers of English, Telugu and French. The salient results of this cross-language study are as follows: o The context dependency reported in the literature for bilabials (in the context of [i]), was partly due to the temporal smear between the burst and the fol-lowing vowel, introduced by the analysis method. The context dependencies reported for alveolars and velars (in the context of [u] and [i] respectively), are attributed to inability of those methods to provide sufficient detail, with which to characterize the coarticulation effects. o The temporal information in the WD was used to reduce the context depen-dency in specifying the invariant cues for place of articulation. The term "Formant Onset Duration" was defined to quantify the time taken for the steady state formants to set in after the release. — It was shown that the features "diffuse" and "compact" spectral shapes occur in both alveolars and velars, depending on the vowel context. The FOD and F 2 of the following vowel were used to resolve ambiguities in the spectral shape during the burst. — About 78% of the bilabials, 81% of the alveolars and 98% of the velars were assigned correct place of articulation. Approximately 87% of the English tokens, 88% of the Telugu tokens, and 83% of the French tokens were identified correctly. 104 — Unlike in other investigations, the errors in specifying the place of ar-ticulation were uniformly distributed over all vowel contexts. • Both dentals and alveolars of Telugu were shown to have diffuse non-falling spectral shape during the burst, which supports the theoretical predictions of Blumstein and Stevens [2]. That Lahiri et al. [51] were unable to demonstrate the same finding for stop consonants from French and Malayalam is attributed to the inadequacy of the LP analysis. Lahiri et al. [51] obtained correct identification for 89.3% of the bilabials, 94% of the dentals and 92.5% of the alveolars using time-varying LP smoothed spectra. It is however, not possible to compare their results with the results presented in this dissertation, for two reasons; (i) they did not consider velars, and (ii) the constraints on their experiment were quite different from those used in the present study. The corpus for the final experiment in [51] included 100 English tokens, 100 French tokens, and 95 of the 175 Malayalam tokens, that were analyzed in earlier pilot studies; and 198 additional tokens from French that were not analyzed in earlier pilot studies. Hence, the 493 tokens (on which their results were based) included 295 tokens that were used to determine the invariant patterns. 6.2 Suggestions for Future Research The discussions in Chapter 3 suggest that the WD is a useful technique for ana-lyzing transient signals, in spite of cross-terms and negative regions. By sacrificing the properties of correct marginals and positivity to some extent, the partially smoothed WD can be used to extract more information from the signal than is possible with short-time spectral analysis techniques. The partially smoothed WD can be used to analyze many other nonstationary signals in addition to speech signals. All the applications involving short-time spectra of transient signals can be considered as suitable candidates for the WD analysis techniques, e.g., EEG and EMG signal analysis for early detection of abnormal pathological conditions, harmonic analysis of transients in high voltage DC transmission systems etc. [32]. Boudreaux-Bartels [7,8] and Yu and Cheng [73] presented techniques for syn-105 thesizing the signal from a modified WD. Boudreaux-Bartels pointed out the im-portance of including the cross-terms in the time-frequency description, in order to obtain a good approximation to the synthesised signal. This information is lost when the WD is partially smoothed. A signal synthesis technique from partially smoothed WD would be very useful in many applications. For example, speech signals with desired properties in the time and frequency domains could be syn-thesised and their perception studied. This would be similar to the Haskins Lab's pattern playback machine, which proved to be invaluable in many investigations. Simple peak picking routines can be written to detect and measure the visual parameters described in Section 4.3.2. This would enable the experiments (de-scribed in Chapter 4) to be conducted on a large data base. The study could be extended to nasals, fricatives and other phonetic categories, and would allow the investigation of the properties of aspiration. In speech research, narrowband and wideband spectrograms are used to study different aspects of the signal. Narrowband spectrograms suffer from poor tempo-ral resolution and wideband spectrograms suffer from poor frequency resolution. The WD plots used in the experiment have a frequency resolution equivalent to that of a narrowband spectrogram and a temporal resolution equivalent to that of a wideband spectrogram. It is possible to alter the smoothing parameters dynami-cally, depending on some arbitrary parameter specifying the non-stationarity of the signal. It seems reasonable to assume that the auditory system analyzes transient signals with fine temporal resolution and stationary signals with fine spectral reso-lution. A similar strategy in computing and displaying the WD plots might enable the visualization of invariant acoustic patterns in speech with greater facility, and thus a closer match with peripheral auditory analysis. 106 Appendix A Data Acquisition In this appendix, the setup used for acquiring speech data in this thesis is presented. A . l Speech Signals A . l . l D a t a f o r P i l o t S t u d y In the pilot study, three native speakers of English, French and Telugu (JG, AP and HG respectively), participated in recording the data. The informants spoke a list of CV syllables from a type written page. The list contained each of the consonants shown in Table 4.1, except the dentals of Telugu, followed by the vowels [i, a, u]. There were 18, 18 and 36 stop consonants from English, French and Telugu. The subjects were asked to pause before each CV syllable in order to minimize the co-articulation effect. Two sets of the CV syllables were recorded and the second list was chosen for digitization. A . 1 . 2 D a t a f o r t h e E x p e r i m e n t In the final experiment, six informants participated in recording the data; JG and DG for English, AP and JB for French and HG and CS for Telugu. The informants were provided with a list of CV syllables containing the nasal consonant /m/ , all the consonants of their language shown in Table 4.1, and the nasal consonant / n / followed by the vowels [i, e, a, o, u]. The list was recorded in triplicate for each informant. The last two sets were chosen for digitization. The unaspirated, 107 voiceless tokens from this set were chosen for final analysis. This resulted in 60 tokens each for English and French and 80 tokens for Telugu (including 20 dental stop consonants). A . 2 Instrumentat ion A . 2 . 1 R e c o r d i n g S e t u p The analog recordings were made in a sound-proof room on a Revox tape recorder. An A K G microphone was used for recording the pilot data. A monotone signal at 120 Hz was played through the headphones and subjects (informants) were requested to try and maintain constant pitch throughout the recording. A . 2 . 2 D i g i t i z a t i o n The data (i.e. speech segments) were low-pass filtered to 4.8 kHz using a Krohn-Hite filter and digitized using a LPA-11 I/O peripheral on a VAX-11/750 system. The data was digitized at 10 kHz sampling frequency. A rudimentary waveform editing program was developed to capture the burst and the beginning of the vowel. The program was capable of scanning the data for the burst, plotting different sections of the data and playing them through a D /A convertor. The digitized data was edited to extract 153.2 msec, of the signal covering the burst and the vowel. The edited data was transferred to the UBC General main frame computer on magnetic tape. A . 2 . 3 C o m p u t i n g a n d P l o t t i n g t h e W i g n e r d i s t r i b u t i o n On the main frame, the discrete time signal of each token was passed through a Hilbert transformer, as explained in chapter 3, to form a complex analytic signal. The names of the complex data files were formed with the initials of the speaker, the session number and the intended utterance. For instance, the data file JG1GA referred to speaker JG, session #1 and the intended utterance was syllable [ga]. An extensive menu-driven FORTRAN program called WIG4C. F was developed 108 in order to conveniently compute and plot the WD. The program has features which modify the WD parameters and the plotting parameters interactively, and then plots the results on a terminal, or a laser printer, or stores the output for later use. The program can also be used in batch mode for computing the WD of large corpora economically. The usage of WIG4C . F and the parameters used in computing the WD of the speech data in this experiment are discussed in Appendix B. A . 2 . 4 D a t a A c q u i s i t i o n f o r t h e C r o s s - L a n g u a g e S t u d y The data was digitized on a PDP 11/23 and transferred to a HP9000 computer on a HPIB link for editing. A basic waveform editing facility was developed on the HP9000 to locate the burst. The digitized data was transferred back and forth between the PDP and HP9000 for audio feedback during the editing to locate the burst. The edited data of each utterance was stored on the tape before proceeding to the next utterance due to space constraints on the hard disk of the HP9000. The voiced and voiceless unaspirated consonants were digitized and edited. The data was transferred from the HP9000 to VAX and converted from UNIX format to ASCII. The ASCII data were then transferred to the main frame for computing and plotting the WD. The procedure for preprocessing the data using a Hilbert transformer and computing the WD was similar to that used in preparing the pilot data. It should be that a lack of proper A / D and D/A facilities made consideration of a larger corpus impossible. The procedures described above, i.e., transferring the data between the PDP 11 and the HP9000 many times for each token, and then converting the data to VAX format for exporting the data to the UBCG main frame was extremely time consuming and costly. 109 Appendix B The Program for computing the Wigner Distribution An extensive menu-driven FORTRAN program WIG4C. F was developed to com-pute and plot the WD conveniently. The WIG4C.F program can be run either interactively to experiment with the various parameter values, or in batch mode to compute the WD of large corpora economically. The program was developed based on the principles presented in Chapter 3. This appendix concerns the usage of WIG4C. F to compute and plot the WD of a given signal. The choice of various parameter values used in computing the WD of the corpus is also explained. This appendix is intended as a manual for the version of WIG4C.F on UBC MTS main frame computer. The program has three major sections — the shell, the WD section, and the plotting section. The shell forms the interface between the user and the other sections of the program. The prompt CMD> is displayed in the shell. Figure B.l depicts the help screen of the shell. The help command displays all the commands available to the user. The current values of the Wigner and the plot parameters can be displayed using the DW (display Wigner) and the DP (display plot) commands respectively. Figures B.2 and B.3 show the default values of various parameters when the program is started. The commands MW (modify Wigner) and MP (modify plot) can be used to modify the 11 Wigner parameters and 16 plot parameters. During the modification of the Wigner and plot parameters, the prompt is changed to M0DWIG> and M0DPLT> respectively. In order to change a parameter, the user needs to type in the serial number of that parameter, upon which the system 110 #run wig4c.o+*disspla #Execution begins 14:47:53 Wigner Distribution. Version 4.c Type H for Help CMD>help To DISPLAY the Wigner parameters type DW To DISPLAY the Plot parameters type DP To MODIFY the Wigner parameters type MW To MODIFY the Plot parameters type MP To COMPUTE the Wigner Distribution type C To PLOT 3-D Perspective view type PP To PLOT the Contours type PC To PLOT the Input signal type PI To PLOT A l l the three plots type PA To STOP the execution of the program type S To get thi s HELP f a c i l i t y type H To go to MTS type # or E or 0. CMD> Figure B.l: Help Screen of WIG4C CMD>dw 1) Input f i l e = -1* 2) Length of Segment = 384 points 3) Starting point = 1 4) WD starts at 66 (window center) 5) Computing WD t i l l 320 (window center) 6) WD computed every 10 points 7) Frequency range spec - 128 points 8) High-shaping (Y/N) = Y 9) Time smoothing (Y/N) = Y 10) Output f i l e = -WIG1 11) Sampling Frequency = 10000 Smear in the frequency direction.. 117.19 Hz. Smear in the time direction 2.1 ms. OMEGA * T 0.25 Smoothing occurs over 21 data points The i d e n t i f i c a t i o n code for this computation i s -1# #384/1/65-320:10# 128X26/266* 117/21/26X* 200(A)* CMD> Figure B.2: Default Wigner Parameters 111 CMD>dp 1) Input array name = -VIG1 2) Array dimensions (f x t) = (128, 26) 3) Total signal duration (points) = 256 4) Rectify the data? (Y or n) = T 5) Log of the data? (Y or n) = Y 6) Scaling factor (Normalized) = 100.0 7) Azimuth viewing angle = -60.0 8) Elevation viewing angle = 60.0 9) Number of contours = 13 10) Output on Term/File ( I or C or Q or P).. = T 11) Negative Freq. axis (Y or N) = N 12) Time axis Fitting? (Y or N) = Y 13) Legend for the plot i s -1* #384/1/65-320:10# 128X26/256S 117/21/25X* 200(A)* 14) Suppress parts of WD? (Y or N) = N 15) Scaling factor (Absolute) = 200.0 16) Log mode (Absolute or Normalized) = A CMD> Figure B.3: Default Plot Parameters prompts for the desired value. The display command D can be used to check the current status of the parameters within the modify routines. A brief description of these parameters along with the typical values used in computing the WD of the corpus is given here. Wigner Parameters 1. The file which holds the complex analytic version of the signal for which the WD is to be computed. 2. The time duration of the signal (in number of samples). 3. The starting point of the data in the file. 4. The first point in the data file for which the WD is desired. This is normally chosen as IF/2 + 1 where IF is the window length used in computing the WD, since to compute the WD at a given point, IF/2 past samples and IF/2 future samples are needed. If the past and future samples are not available, the program pads the data with zeros. 5. The last point in the data file for which the WD is desired. This is normally chosen as the duration of the signal, less IF/2. 112 6. This parameter specifies the decimation of the WD in time. Generally, its value should not exceed the larger of • the duration of quasi-stationarity in the signal or • the amount of time-domain smoothing in computing the smoothed WD. In the present study, one WD was computed for every 10 samples (i. e., 1 msec, for a sampling frequency of 10 kHz). 7. The length of the time-domain window I F , in number of samples (usually chosen as a power of 2) . This parameter relates to the extent of smear in the frequency domain. In the present study, I F was chosen as 128. 8. A Boolean variable to specify pre-emphasis on the data. The speech signal in the corpus was pre-emphasized prior to computing the WD. 9. A Boolean variable to switch on or off the time-domain smoothing. In order to change the current level of smoothing, the parameter has to be reset to 'Y'. The program then computes the time duration required for Tfl = 1 smoothing and prompts the user for the desired amount of smear in the time domain. In computing the WD of the corpus, 2.1 msec, of smear in the time domain was chosen corresponding to Tfl = 0.25. 10. The output file to store the WD. The program provides -WIG1, -WIG2. ... as default files for every new computation. 11. The sampling frequency. In the present study, the signal was sampled at 10 kHz. Plot Parameters 1. The file containing the WD of the input signal as a 2-D array. The default is the file in which the results of the most recent WD computation were stored. 2. The dimensions of the array in the input file (updated every time a new WD computation is performed). 113 3. The time duration of the signal (in number of samples). This information is necessary to setup the correct axes in plotting the WD. It holds the value corresponding to the default input file after a new computation. 4. A Boolean variable to delete the negative regions in the WD. 5. A Boolean variable to switch on or off a pseudo-log operation on the data prior to plotting. 6. A constant for scaling the normalised WD data prior to computing the log. Values in the range of 100 to 500 were found useful in amplifying the higher formants, while keeping the level of cross-terms low. 7. The azimuth view angle of the camera for the 3-D plot. The default value is - 6 0 ° 8. The elevation angle for the camera. The default value is 60°. 9. The number of contours to be used in the contour map. The default value is 13. 10. The output device for the plots. The plot output can be produced on one of the following devices: • T — Terminal (default) • C — Compressed plot file • Q — QMS - 1200 Lasergraphics laser printer • P — Post Script file for plotting on the QMS - ps800 laser printer 11. A Boolean variable to get the frequency axis from — f to | instead of from 0 to 7T. 12. A Boolean variable set to Y. Retained for historic reasons. 13. Legend for the plot. When a new WD is computed, the Wigner section sets the legend with the values of some of the parameters. This legend can be over written by the user. There are six fields of entry in the computed legend, 114 each separated by a '#' sign. The entries within a field are separated by a '/'. Each field of entry in the legend is explained below. (a) Name of the file containing the time domain signal. (b) Wigner parameters 2, 3, 4, 5 and 6. (c) Plot parameters 2 and 3. (d) Frequency smear in Hz, time smear in number of samples, and TQ in percentage. (e) The log scaling factor followed by plot parameter 16. (f) The peak value in the WD, prior to scaling and the log operation. 14. A Boolean variable to invoke an editor. This editor was used to set given regions in the time-frequency plane to 0. This parameter was built in to analyze somatosensory evoked potentials using the WD. It was set to N (for 'no' suppression) in speech research. 15. A constant to scale the WD data without prior normalization. This enables one to compare the plots with one another, since the absolute amplitude relationships within a corpus are preserved. A value of 200 was used in computing the WD of both pilot data and final data. 16. A character variable to choose between normalised log and absolute log. The Absolute log was used in this study. After modifying the Wigner and plot parameters to the desired values, the WD can be computed using the command C. This command also sets the plot parameters for plotting the current WD. There are 4 commands available for plotting the 3-D perspective view, the contour map, the input signal or all the three plots on a single plot page. The command PA was used in plotting the WD of the corpus. The input signal is plotted in the top frame, the contour map is plotted in the middle frame and the perspective view is plotted in the bottom frame. The computed legend is printed at the bottom of the page. See Figure 4.3 for an example. The user can go to MTS without unloading the program from the memory by typing #, E or Q. The command S is used to stop the execution of the program. 115 Appendix C Instructions for Identifying Stop Consonants from the Wigner Plots C . l Preamble It is believed that there are "invariant acoustic features" present in the speech signal which correspond to a given phone. These features are expected to manifest themselves in the speech signal in the form of unique time-frequency patterns, across all languages and in all phonetic contexts. This research is directed towards establishing whether such features can be identified for stop consonants in the Wigner plots. The Wigner Distribution (WD) of bilabial, alveolar and velar stop consonants [b, p], [d, t] and [g, k] followed by the vowels [i], [e], [a], [o], and [u] were computed. The CV syllables were uttered by native speakers of English, French and Telugu. The Wigner plots you will be analyzing have three subplots referred to as Plot A, Plot B and Plot C. Plot A shows the speech wave in the top frame; Plot B and Plot C depict the WD of this signal as a contour map and a perspective plot, in the middle and bottom frames respectively. The signal duration in the plots is 51.2 ms, 78.6 ms, or 115.2 ms. C.2 Overview of the Task You will be requested to make rough estimates of the visual features of the WD plots. These features, comprising mainly of boundaries and shapes, are described below. Three templates are provided to facilitate the analysis. The program 116 prompts you for each of these features and stores your answers. Based on your estimates, the program then classifies the graph according to place of articulation of the consonant as bilabial, alveolar or velar. The experiment begins with your log-ging onto the PDP 11/73, and running the program FEATURE. The program keeps track of your earlier sessions so that you can start and stop the task whenever you want and work on the task at your convenience. C.3 Feature Definit ions All the CV syllables presented here have a small segment of silence followed by a burst and the vowel region. In the first stage of the program, you will be identifying the start of the burst, start of the vowel and the frequency of the second formant of the vowel. In the second stage, you will be requested to identify the spectral shape within the burst region. The program will then determine whether this burst region corresponds to the release of a stop or some irrelevant acoustic artefact. If it corresponds to an artefact, you will be requested to ignore this region of the plot and identify another region in the burst. You will then be requested to estimate new values for the start of the burst, the start of the vowel and the spectral shape again. This process will be repeated until reliable burst cues are identified, or until you decide to move to the next plot. If a reliable spectral shape has been identified, you will be prompted for the burst duration. The program will then identify the place of articulation and proceed to the next plot. The features that you will be measuring are defined below. Do not hesitate to refer back to these definitions whenever you feel the necessity. 1) Start of the Burst The start of the burst is the time at which peaks start to appear in the contour and the 3-D plots. The beginning of the burst can also be seen from the speech plot (Plot A) where the signal departs from zero and becomes noise-like. Generally, it is easiest to measure the start of the burst on the contour map (Plot B). See Figures 1, 2, 3 and 4.1 1 Figures 1, 2, 3 and 4 are included in chapter 4 as Figures 4.1, 4.3, 4.4 and 4.2. 117 2) Start of the Vowel The start of the vowel is characterized by the appearance of a well defined formant structure, in Plots B and C. Typically, the formant structure has the following features: • At the beginning of each pitch period, a spectral line where the troughs are not as deep appears. This can usually be observed in both Plot B and Plot C. • 2 to 5 peaks distributed in frequency and extending in time (formants). Sometimes, the burst can excite the vocal tract, resulting in a formant-like structure in the WD plots. This formant-like structure looks a little different from the formant structure excited by phonation alone. / / you notice this in the WD plots, take the beginning of the formants due to phonation as the onset of formant structure. At this time instant, a large downward deflection can normally be observed in Plot A. See Figures 1, 2, 3 and 4. 3) Second formant The second formant frequency can be measured either in Plot B or in Plot C using the appropriate template. In Plot B, you should align the grid on the successive peaks corresponding to the second formant and read the corresponding frequency. If you use Plot C, superimpose the vertical axis of the template (on Plot C) to its homologue on the graph until the last spectral line on the template is at a tangent to the top of the second formant on the last spectral line of the graph. The second formant frequency can then be read off the frequency axis of Plot C 2 . This will become clearer once it has been demonstrated to you in the training session. 4) Spectral Shape The spectral shape of the burst region can be classified into three classes. When you are prompted as to whether the spectral shape of the burst falls into one of these three categories, please answer "Y" for yes and "N" for no. For the three possible spectral shapes, the typical visual cues are listed below. 2The second formant measurement could sometimes be tricky both in B and C. In Plot B, it is not always easy to differentiate the valleys from the peaks. In Plot C, the second formant peak in the last spectral line could be hidden by another peak in front of it. If the second formant measurement is ambiguous in both B and C, your best approximation adequate. 118 a) Compact Pattern A compact pattern can be recognized by the presence of one or more of these visual features. • A single major peak, above 700 Hz in the frequency domain. Usu-ally, but not always, occupying a region of less than 1.5 kHz. Ignore the peak below 700 Hz, if there is one. • More than one peak in the frequency domain, but one of these peaks is much larger in amplitude than the others, making the spectral shape appear more compact than diffuse. In this case, the number of contours in the tallest peak is at least twice that of the other peaks in Plot B. • Several successive peaks in the time domain, near the frequency of second formant of the vowel. These peaks need not have less than 1.5 kHz spread in the frequency domain. This feature alone is sufficient for you to classify the spectral shape as compact. • Sometimes, there is some energy in the low frequency region (0 to 700 Hz) due to pre-voicing. This is usually referred to as the pre-voicing bar. Unlike the peaks of the burst, the amplitude of the pre-voicing bar does not decrease prior to the start of the vowel. Some-times, the pre-voicing bar can make the compact spectral shape look like a diffuse pattern. The pre-voicing bar should not be considered in determining the compact spectral shape. See Figures 1 and 4. b) Diffuse Non-Falling Pattern A diffuse non-falling pattern can be recognized by the presence of one or more of these visual features. • Two or more significant peaks in the burst spectrum, covering a range of 1.5 kHz or more. • A decrease in the amplitude of the burst peaks in time, prior to the start of the vowel. 119 • Sometimes, there is some energy in the low frequency region (0 to 700 Hz) due to pre-voicing. This is usually referred to as the pre-voicing bar. Unlike the peaks of the burst, the amplitude of. the pre-voicing bar does not decrease prior to the start of the vowel. In general, the amplitude of the pre-voicing bar is not large enough to make a diffuse pattern look like a falling pattern. The pre-voicing bar should not be considered in determining the diffuse non-falling spectral shape. See Figure 2. c) Diffuse Falling Pattern A diffuse falling pattern can be recognized by the presence of one or more of these visual features. • Rapid increase in energy in the low frequency region (0 to 700 Hz) over time. • Often, no other peaks are present at higher frequencies. » If there are peaks at higher frequencies, their amplitude is smaller than those of the low frequency peaks. Thus, the line joining the peaks in frequency domain is always falling (negative slope). © There is no noticeable burst region. The signal appears to go with-out transition from silence into the vowel. See Figure 3. Once the spectral shape has been determined, the program asks a few ques-tions to establish that the region being analyzed is the relevant portion of the burst and not an acoustic artefact. 5) Burst Duration (BD) BD is denned as the duration from the start to the end of the burst. Measurement of BD for each of the spectral shapes is outlined below: • For compact spectra with a single peak in the time-domain, determine the tallest peak (choose the one with the largest number of peaks in Plot B). The end of the burst is the time at which the number of contours 120 is down from the maximum by half. When there are multiple peaks in the time-domain, the duration of the first peak will be taken as the BD. See Figures 1 and 4. • For diffuse non-falling spectra, there is a decrease in the energy of the burst prior to the start of the vowel. Some of the peaks could be connected with the formants of the vowel. Choose the tallest peak in Plot B. The end of the burst is the time instant where the number of contours is down from the maximum by half. In situations where it is difficult to identify the end of the burst, it can be taken to be the time at which more than two peaks in the diffuse spectra disappear completely. See Figure 2. • For diffuse falling spectra, the burst blends with the vowel. There is no noticeable decrease in the energy of the burst prior to the start of the vowel. For this case, the end of the burst coincides with the start of the vowel. The BD is taken to be the duration between the start of the burst and the start of the vowel. See Figure 3. After reading the value of BD, the program proceeds to identify the place of artic-ulation. At this point, you are given an option of entering a comment to be stored in the data-base along with your analysis. The program then starts to analyze another plot, or quits processing, depending on your choice. In the training session, you will be familiarized with some typical time-frequency patterns in the WD of stop consonants. Running the program FEATURE and measur-ing various features of the WD plots using the templates will also be demonstrated. There are approximately 200 WD plots for which the above analysis is to be carried out. C.4 A d d e n d u m Some additional instructions and tips to use the program FEATURE are given here. These were compiled based on the analysis of the training session data. Modifica-tions to FEATURE which affect the user interaction are also described. 121 • The maximum time reading of each Wigner plot and the second formant of all the utterances are stored in a file that can be accessed by FEATURE. This eliminates the need for the user to input this information to the program. • In order to decide whether a peak aligns the second formant, a tolerance of about ±200 Hz is permissible. A larger tolerance could be considered, if the peak under consideration migrates into the second formant over this frequency range. • Compact and Diffuse non-falling spectral shapes generally have a frequency spread of less than 1.5 kHz and greater than 1.5 kHz, respectively. But it is not necessary that they strictly confirm to the above range. You are encouraged to eye-ball the first few milliseconds of the burst and decide on the spectral shape, unless the burst happens to belong to a borderline situation. • There were many errors in the training session, due to wrong identification of the Start of the Vowel (VST). In addition to the cues described in the original instructions, VST can also be identified by — An abrupt increase of energy in the low-frequency region (< 500 Hz). — Appearance of the second formant. (See Plot # 22 of the training session). • Another source of errors in the training session was wrong classification of the spectral shape. Some hand-drawn pictures are provided on the next page to illustrate various spectral shapes.3 In general, you are requested to refer to the instructions, whenever in doubt. It is a good idea to review the instructions before starting a session, especially if it is more than a couple of days since your earlier session. This figure has been reproduced as Figure 4.5 in chapter 4. 122 Bibliography [l] Atlas, L. E. , Zhao, Y., and Marks II, R. J. , "Application of The Generalized Time-Frequency Representation to Speech Signal Analysis," in Proc. IEEE Pacific Rim Conference on Communications, Computers and Signal Process-ing, pp. 517-520, (June 1987). [2] Blumstein, S. E. , Stevens, K. N., "Acoustic Invariance in Speech Production: Evidence from Measurements of the Spectral Characteristics of Stop Conso-nants," J. Acoust. Soc. Am., 66, pp. 1001-1017, (1979). [3] Blumstein, S. E. , Stevens, K. N., "Perceptual Invariance and Onset Spectra for Stop Consonants in Different Vowel Environments," J. Acoust. Soc. Am., 66, pp. 648-662, (1979). [4] Bouachache, B., "Wigner Analysis of Time Varying Signals. An Application in Seismic Prospecting," in Signal Processing II; Theories and Application, H. W. Schiissler (Ed), Elsevier Science Publishers B.V. (North - Holland), EURASIP, pp. 703-706, (1983). [5] Boashash, B., "On the Anti-Aliasing and Computational Properties of the Wigner-Ville Distribution," in Applied Signal Processing, M. H. Hamza, edi-tor, Acta Press, Anaheim, (1985). [6] Boashash, B., White, L., and Imberger, J. , "Wigner-Ville Analysis of Non-Stationary Random Signals (with Application to Turbulent Microstructure Signals," in Proc. IEEE ICASSP-86, Tokyo, pp. 2323-2326, (1986). [7] Boudreaux-Bartels, G., "Time-Frequency Signal Processing Algorithms: Anal-ysis and Synthesis using Wigner Distributions," Ph. D. Dissertation, Rice University, Houston, Texas, (1983). 123 [8] Boudreaux-Bartels, G., and Parks, T. W., "Time-Varying Filtering and Signal Estimation Using Wigner Distribution Synthesis Techniques," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-34, pp. 442-451, June 1986. University, Houston, Texas, (1983). [9] de Bruijn, N.G., "Uncertainty Principles in Fourier Analysis," in Inequalities, O. Shisha (ed), Academic Press, New York, pp. 57-71, (1967). [10] Chester, D. B., "The Wigner Distribution and its Application to Speech Recog-nition and Analysis," Ph. D. Dissertation, University of Cincinnati, Ohio, (1982). [11] Chester, D., and Wilbur, J., "Time and Spatial Varying C A M and A l Signal Analysis using the Wigner Distribution," in Proc. IEEE ICASSP-85, Tampa, FI, pp. 1045-1048, (1985). [12] Chomsky, N., and Halle, M. , The Sound Pattern of English, Harper and Row, New York, (1968). [13] Claasen, T. A. C. M. , and Mecklenbrauker, W. F. G., "The Wigner Distri-bution — A Tool for Time-Frequency Signal Analysis," Parts I, II and III, Philips Journal of Research, Vol. 35, pp. 217-250, pp. 276-300 and pp. 372-389, (1980). [14] Claasen, T. A. C. M. , and Mecklenbrauker, W. F. G., "On Time-Frequency Discrimination of Energy Distributions: Can They Look Sharper Than Heisen-berg?," in Proc. IEEE ICASSP-84, (1984). [15] Cohen, L., "Generalized Phase-Space Distributions," J. Math. Phys., vol.7, pp. 781-786, (1966). [16] Cohen, L., and Zaparovanny, Y. I., "Positive Quantum Joint Distributions," J. Math. Phys., 21 (4), pp. 794-796, (1980) [17] Cohen, L., and Posche, T. E. , "Positive Time-Frequency Distributions," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-33, pp. 31-38, (Feb. 1985). 124 [18] Cohen, L., "On a Fundamental Property of the Wigner Distribution," IEEE Trans. Acoust, Speech. Signal Processing, vol. ASSP-35, No. 4, pp. 559-561, (Apr. 1987). [19] Cohen, L., "Wigner Distribution for Finite Duration or Band-Limited Signals and Limiting Cases," IEEE Trans. Acoust., Speech. Signal Processing, vol. ASSP-35, No. 6, pp. 796-806, (June 1987). [20] Cooper, F. S., Delattre, P. C , Liberman, A. M. , Borst, J. M. , Gerstman, L. J. , "Some Experiments on the Perception Synthetic Speech Sounds," J. Acoust. Soc. Am., 24, pp. 597-606, (1952). [21] Cole, R. A., and Scott, B., "The Phantom in the Phoneme: Invariant Cues for Stop Consonants," Perception &; Psychophysics, Vol. 15, No. 1, pp. 101-107, (1974). [22] Delattre, P. C , Liberman, A. M. , Cooper, F. S., "Acoustic Loci and Transi-tional Cues for Consonants," J. Acoust. Soc. Am., 27, pp. 613-617, (1955). [23] Fant, G. Acoustical Theory of Speech Production, Mouton, The Hague, The Netherlands, (1960). [24] Flandrin, P., and Escudie, B., "Time and Frequency Representation of Finite Energy Signals: A Physical Property as a Result of an Hilbertian Condition," Signal Processing (2), pp. 93-100, (1980). [25] Flandrin, P., "Some Features of Time-Frequency Representations of Multi-component Signals," in Proc. IEEE ICASSP-84, 41B.4, (1984). [26] Flandrin, P., and Martin, W., "A general Class of Estimators for the Wigner-Ville Spectrum of Non-Stationary Processes," in Lecture Notes in Control and Information Sciences, vol. 62. Berlin: Springer-Verlag, pp. 15-23, (1984). [27] Flandrin, P., and Escudie, B., "An Interpretation of the Pseudo-Wigner-Ville Distribution," Signal Processing (6), pp. 27-36, (1984). [28] Garudadri, H., Beddoes, M.P., Gilbert, J.H.V., and Benguerel, A-P., "Iden-tification of Invariant Acoustic Cues in Stop Consonants using the Wigner 125 Distribution," in Applied Signal Processing, M. H. Hamza, editor, Acta Press, Anaheim, (1985). [29] Garudadri, H., Benguerel, A-P., Gilbert, J.H.V., and Beddoes, M.P., "Appli-cation of Smoothed Wigner Distribution (WD) to Speech Signals," J. Acoust. Soc. Am. Suppl. 1 79 S94, (1986). [30] Garudadri, H., Beddoes, M.P., Benguerel, A-P., and Gilbert, J.H.V., "On Computing the Smoothed Wigner Distribution," mProc. IEEE ICASSP-87 35.12, (1987). [31] Garudadri, H., Gilbert, J.H.V., Beddoes, M.P., and Benguerel A-P., "Invariant Acoustic Cues in Stop Consonants: A Cross-Language Study using the Wigner Distribution," J. Acoust. Soc. Am. Suppl. 1 82 S55, (1987). [32] The software developed in this thesis for computing and plotting the WD is being used by Dr. Eisen of the Faculty of Medicine at UBC to analyze EMG signals and Dr. Begleiter of State University of New York Medical College to analyze E E G signals. The author is also collaborating with Dr. Bhattacharya of ASEA, Milwaukee to analyze the transients in high voltage DC transmission lines. Dr. Clapton of the University of Michigan, Ann Arbor and Dr. Rhode of the University of Wisconsin have expressed interests in obtaining a portable version of this software. [33] Halle, M. , and Stevens, K. N., "Some Reflections on the Theoretical Bases of Phonetics," in Frontiers of Speech Communication Research, edited by B. Lindblom and S. Ohman, Academic, London, (1979). [34] Harris, F.J., "On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform," Proc. IEEE, Vol.66, No.l, pp. 51-83, (1978). [35] Hlawatsch, H., "Interference Terms in the Wigner Distribution," Presented at the International Conference on Digital Signal Processing, Florence, Italy, September 5-8, (1984). [36] Hlawatsch, H., "Transformation, Inversion and Conversion of Bilinear Signal Representations," in Proc. IEEE ICASSP-85, 27.5, (1985) 126 [37] Hlawatsch, H., "Duality of Time-Frequency Signal Representations: Energy Density Domain and Correlation Domain," in Applied Signal Processing, M. H. Hamza, editor, Acta Press, Anaheim, (1985). [38] Hlawatsch, F., "Unitary Time-Frequency Signal Representations," in Signal Processing III: Theories and Applications, I. T. Young et al. (Eds), Elsevier Science Publishers B.V. (North - Holland), EURASIP, pp. 33-36, (1986). [39] Hlawatsch, F., and Krattenthaler, W. "Unitary Time-Frequency Signal Rep-resentations," in Signal Processing III: Theories and Applications, I. T. Young et al. (Eds), Elsevier Science Publishers B.V. (North - Holland), EURASIP, pp. 37-40, (1986). [40] Hudson, R. L., "When is the Wigner Quasi-Probability Density Non-Negative?," Rep. Math. Phys., vol. 6, No. 2, pp. 249-252, (1974). [41] IEEE Programs for Digital Signal Processing, IEEE Acoust., Speech Signal Processing Group, New York, (1979). [42] Imberger, J., and Boashash, B., "Application of the Wigner-Ville Distribution to Temperature Gradient Microstructure: A new Technique to Study Small-Scale Variations," J. Phys. Oceanogr., 16, pp. 1997-2012, (1986). [43] Iyer, V., Ramamoorthy, P. and Ploysongsang, Y., "Autoregressive Modeling of the Wigner Spectrum," in Proc. IEEE ICASSP-87, 35.9, (April 1987). [44] Janse, C. P., and Kaiser, A. J. M. , "Time-Frequency Distributions of Loud-speakers: The Application of the Wigner Distribution," J. Audio Eng. Soc. 31(4), pp. 198-223, (Apr. 1983). [45] Janssen, A. J. E. M , "Positivity of Weighted Wigner Distributions," SIAM J. Math. Anal., vol. 12, pp. 752-758, (1981). [46] Janssen, A. J. E . M. , "On the Locus and Spread of Pseudo-Density Functions in the Time-Frequency Plane," Philips Journal of Reaserch, vol 37, No. 3, pp. 79-110, (1982). 127 [47] Janssen, A. J. E. M, and Claasen, T. A. C. M. , "On Positivity of Time-Frequency Distributions," IEEE trans. Acoust., Speech, Signal Processing, vol. ASSP-33, No.4, pp 1029-1033, (1985). [48] Krattenthaler, W., Hlawatsch, F. and Mecklenbrauker, W., "An Iterative Al-gorithm for Signal Synthesis from Modified Pseudo Wigner Distributions," Presented at the IEEE 1986 Digital Signal Processing Workshop, Chatam (MA), (Oct. 1986). [49] Kewley-Port, D., "Time Varying Features as Correlates of Place of Articula-tion in Stop Consonants," J. Acoust. Soc. Am., 73, pp. 322-335, (1983). [50] Kostic, D., Mitter, A., and Krishnamurti, Bh., "A Short Outline of Telugu Phonetics," Indian Statistical Institute, Calcutta, (1977). [51] Lahiri, A., Gewirth, L., and Blumstein, S. E. , "A Reconsideration of Acoustic Invariance for Place of Articulation in Diffuse Stop Consonants: Evidence from a Cross-Language Study" J. Acoust. Soc. Am., 76, pp. 391-404, (1984). [52] Liberman, A. M. , Cooper, F. S., Shankweiler, D. P. and Studdert-Kennedy, M. , "Perception of the Speech Code," Psychol. Rev., 74, pp. 431-461, (1967). [53] Lindblom, B., "Adaptive Variability and Absolute Constancy in Speech Sig-nals: Two Themes in the Quest for Phonetic Invariance," in Proc. XI Int. Congress of Phonetic Sciences, Vol. 3, Tallin, Estonia, U. S. S. R., pp. 9-18, (Aug. 1987). [54] Lisker, A., and Abramson, S., "A Cross-Language Study of Voicing in Initial Stops: Acoustical Measurements," Word 20, pp. 384-422, (1964). [55] Makhoul, J., "Linear Prediction: A Tutorial Review," Proc. IEEE, vol. 63, No. 4, (April 1975). [56] Marinovic, N. M. , and Eichmann, G., "An Expansion of Wigner Distribution and its Applications," in Proc. IEEE ICASSP-85, 27.3, (1985). [57] Markel, J. D., and Gray, Jr., A. H., Linear Prediction of Speech, Springer-Verlag, New York, (1976). 128 [58] Martin, W., and Flandrin, P., "Analysis of Non-Stationary Processes: Short-Time Periodograms versus a Pseudo-Wigner Estimator," in H. Schussler, Ed., EUSIPCO-8S, Amsterdam: North-Holland, pp. 455-458, (1983). [59] Martin, W., "Measuring the Degree of Non-Stationarity by Using the Wigner-Ville Spectrum," in Proc. IEEE ICASSP-84, 41B.3, (1984). [60] Martin, W., and Flandrin, P., "Detection of Changes of Signal Structures by using the Wigner-Ville Spectrum," Signal Processing, vol. 8, No.2, pp. 215-233, (1985). [61] Martin, W., and Flandrin, P., "Wigner-Ville Spectral Analysis of Nonstation-ary Processes," IEEE Trans. Acoust., Speech Signal Processing, vol. ASSP-33, pp. 1461-1470, Dec. (1985). [62] Peterson, G. E. , and Barney, H. L., "Control Methods used in a Study of the Vowels," J. Acoust. Soc. Am., Vol. 24, No. 2, pp. 175-184, (1952). [63] Peterson, G. E. and Shoup, J. E. , "A Physiological Theory of Phonetics," J. Speech Hearing Res., 9, pp. 5-67, (March 1966). [64] Rabiner, L. R., and Gold, B., Theory and Applications of Digital Signal Pro-cessing, Prentice-Hall, Inc., Englewood Cliffs, N. J., (1975). [65] Riley, M . D., "Beyond Quasi-Stationarity: Designing Time-Frequency Repre-sentations for Speech Signals," in Proc. IEEE ICASSP-87, 15.9, (1987). [66] Stevens, K. N., "The Quantal Nature of Speech: Evidence from Articulatory-Acoustic Data," in (E. E. David, Jr and P. B. Denes, Eds), Human Commu-nications: A Unified View, New York: McGraw-Hill, (1972). [67] Stevens, K. N., and Blumstein, S. E. , "Quantal Aspects of Consonant Pro-duction and Perception: A Study of Retroflex Consonants," J. Phonet. 3, pp. 215-234, (1975). [68] Stevens, K. N., and Blumstein, S. E. , "Invariant Cues for Place of Articulation in Stop Consonants," J. Acoust. Soc. Am., 64, pp. 1358-1368, (1978). 129 [69] Ville, J., "Theorie et Applications de la Notion de Signal Analytique," Cables et Transmission, vol. 2 A(l), pp. 61-74, (1948). [70] Wigner, E. , "On the Quantum Correction for Thermodynamic Equilibrium," Phy. Rev., vol. 40, pp. 749-759, (1932). [71] Wigner, E. , "Quantum-Mechanical Distribution Functions Revisited," in Per-spectives in Quantum Theory, edited by W. Yourgrau and A. van der Merwe, M. I. T., Cambridge, (1979). [72] Wokurek, W., Hlawatsch, F., and Kubin, G., "Wigner Distribution Analysis of Speech Signals," Presented at the Int. Conf. on Digital Signal Processing, Florence, Italy, (Sept. 1987). [73] Yu, Kai-Bor., and Cheng, S., "Signal Synthesis from Wigner Distribution," in Proc. IEEE ICASSP-85, 27.7, (1985). [74] Zue, V., "Acoustic Characteristics of Stop Consonants: A controlled Study," Sc. D. thesis, MIT (unpublished), (1976). 130 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0098033/manifest

Comment

Related Items