ESTIMATIONS OF GLOTTAL WAVES AND VOCAL-TRACT A R E A FUNCTIONS FROM S P E E C H SIGNALS by HUI QUN DENG A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE D E G R E E OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Electrical Engineering) THE UNIVERSITY OF BRITISH COLUMBIA April 2005 ©Hui Qun Deng, 2005 Abstract This study estimates glottal waves and vocal-tract area functions (VTAFs) from vowel sounds. Traditional estimations assume that glottal waves are zero over closed glottal phases, and that glottises and lips are terminated with constant impedances. In reality, these assumptions are invalid: glottal waves can hardly be zero due to common incomplete glottal closures and acoustic disturbances during vocal-fold collisions; glottal impedances are time-varying during phonation; lip radiation impedances are frequency-dependent. Consequently, traditional estimations yield biased and distorted estimates. In this study, a method which for the first time obtains unbiased vocal-tract filter (VTF) estimates from sustained vowel sounds over closed glottal phases is developed. It assumes that glottal waves for such sounds are periodically stationary random processes, allowing non-zero glottal waves to exist over closed glottal phases. A new method for detecting glottal phases using vowel sounds is also developed. The effects of glottal and lip terminal impedances on VTF estimates are modeled realistically using high-pass, and low-pass filters, respectively. The VTF estimates are used to obtain glottal waves from the vowel sounds. Moreover, a new method for deriving VTAFs from the VTF estimates over closed glottal phases is developed. It eliminates the distortion effects of lip radiation impedances on the VTAF estimates, assuming the glottises are completely closed. Effects of glottal losses on the estimates obtained using our methods are investigated. It is shown that estimates from large-lip-opening vowel sounds are less affected by glottal losses than those from small-lip-opening vowel sounds. Theoretically, to enable the elimination of the degrading effects of glottal losses on the estimates, lip-opening areas must be known. ii Glottal phases, glottal waves and VTAFs estimated using our methods from vowel sounds produced by male and female subjects contain detailed information. The obtained glottal phases were validated using electroglottograph signals. The obtained glottal waves increase during rapid vocal-fold collisions, and decrease or even increase during vocal-fold parting. The differences in glottal waveforms of different genders are explained by their physiological differences in larynxes. The VTAFs obtained from large-lip-opening vowel lal sounds of these subjects are very similar to the VTAF measured from an unknown subject's magnetic resonance image. Such detailed results cannot be obtained using traditional methods. iii Contents Abstract ii List of Tables and Figures vi Acknowledgements x 1 Introduction 1 2 Background and Literature Review..... 6 2.1 Production of Glottal Waves 6 2.1.1 Larynx 6 2.1.2 Glottal Waves and Glottal Phases 8 2.1.3 Vibrations of Vocal Folds 10 2.1.4 Relationship between the Glottal Wave and the Fold Contact Area 11 2.1.5 Descriptive Parametric Models of Glottal Waves 13 2.1.6 Source-Tract Interaction 15 2.1.7 Aspiration Noise in Glottal Waves 15 2.2 Estimating Glottal Waves Using Non-Speech Signals 16 2.2.1 Mechanical Model Approach 16 2.2.2 Pneumotachograph Mask Approach 16 2.2.3 Reflectionless Tube Approach 17 2.3 Estimating Glottal Waves from Speech Signals 17 2.3.1 Inverse Filter Approach 17 2.3.2 Using a Descriptive Parametric Model of Glottal Waves 19 2.4 Vocal-Tract Area Function Measurements and Estimations 19 2.4.1 Vocal Tract 19 2.4.2 VTAF Measurements Using MRI 20 2.4.3 VTAF Estimation Using Formant Frequencies 22 2.4.4 VTAF Estimation Using Lip Input Acoustic Impedance 23 2.4.5 VTAF Estimation Using Vocal-Tract Filters 24 2.5 Summary 26 3 Transfer Functions of Vocal-Tract Filters 28 3.1 Equivalent Acoustic Systems for Producing a Vowel Sound 28 3.2 Transfer Functions for Producing a Vowel Sound 30 3.3 Glottal Impedance 31 3.4 The Lip Radiation Impedance 35 3.5 The Signal Flow Diagram of VTF 37 3.6 The Discrete-Time Model for the Glottal Reflection Coefficient 41 3.7 The Discrete-Time Model for the Lip Reflection Coefficient 41 3.8 The Discrete-Time Transfer Functions of VTFs and GVTFs 43 3.9 Vocal-Tract Driving Point Impedance 45 3.10 Summary 47 4 Vocal-Tract Filters and Their Estimates 48 4.1 Calculating the Parameters of the Lip Reflection Coefficient 48 4.2 Calculating VTF Frequency Responses 53 4.3 Features of VTFs 56 4.4 Calculating the Glottal Impedance 56 iv 4.5 Calculating the Glottal Reflection Coefficient over Closed Glottal Phases 58 4.6 Calculating the Driving-Point Impedance 58 4.7 Calculating the Difference Between a VTF and Its Estimate 58 4.8 Calculating Transfer Functions of VTF Estimates 59 4.9 Features of VTF Estimates Corresponding to Incomplete Glottal Closures 59 4.10 Summary 71 5 A New Method for Estimating Vocal-Tract Filters and Glottal Waves from Vowel Sounds ... 72 5.1 Introduction 72 5.2 Transfer Functions for Producing a Vowel Sound 73 5.3 Detecting Glottal Phases from a Vowel Sounds 74 5.4 Estimating the Vocal-Tract Filter 77 5.5 Locating the Signal Segments for the VTF Estimation 80 5.6 Obtaining the Glottal Waveform 82 5.7 Summary 87 6 Estimating Vocal-Tract Area Functions from Vowel Sounds 88 6.1 Introduction 88 6.2 VTAF Estimation Assuming Boundary Condition 1 89 6.3 VTAF Estimation Assuming Boundary Condition 2 95 6.4 Comparing VTAF Estimations Assuming Different Boundary Conditions 99 6.5 A New Method for Obtaining VTAFs from VTFs 101 6.6 Distortion Effects of Incomplete Glottal Closures on VTAF Estimates 106 6.7 Summary 109 7 Results and Discussions: Glottal Waves and Vocal-Tract Area Functions from Vowel Sounds 115 7.1 Introduction 115 7.2 Recording and De-Noising Speech Signals 115 7.3 Steps for Obtaining Glottal Waves and VTAFs from Speech Signals 116 7.4 Results 118 7.5 Validation of the Estimates of Glottal Phases 134 7.6 Discussion of the VTF and VTAF Estimates 135 7.7 Discussion of the Estimates of Glottal Waves 138 7.8 Sensitivity of the Estimates to the Estimated Vocal-Tract Lengths 140 7.9 Summary 145 8 Conclusions and Future Work 146 8.1 Contributions of This Thesis 146 8.2 Future Work 149 References 151 Appendix A 161 Appendix B 162 Appendix C 163 Appendix D 164 v List of Tables and Figures Table 4.1. Parameters of rup(z) 50 Table 4.2. Formant frequencies of VTFs and their estimates corresponding to Ag=l mm2 60 Table 6.1. Comparisons between VTAF estimations based different boundary conditions 100 Fig. 2.1. The superior view of the larynx [Titze, 1994] 7 Fig. 2.2. The comparison of larynxes of the male and the female [Titze, 1989] 8 Fig. 2.3. The vocal fold of the adult male (solid line) and of the female in a coronal view [Titze, 1994, and 1989] 8 Fig. 2.4. The glottal waveform and the movements of the vocal folds [Rubin, 1995] 9 Fig. 2.5. The layers of vocal fold (the coronal view) [Titze, 1994] 11 Fig. 2.6. Rothenberg model of the relationship between an EGG waveform and the phases of vocal fold vibratory cycle [Rubin, et al. 1995] 13 Fig. 2.7. The Liljencrants-Fant parametric model of the glottal wave [Quatieri, 2001] 14 Fig. 2.8. A sagittal view of the airway through larynx and vocal tract [Rubin etc al., 1995] 20 Fig. 2.9. A method for determining the central line of a vocal tract [Takemota, et al., 2001] 21 Fig. 2.10. The lengths and positions of epilarynx, pharynx, and oral cavity [Story, 2004] 22 Fig. 2.11. The VTAFs (left) and the VTFs (right) obtained in [Wakita, 1973] 26 Fig. 3.1. The Thevenin equivalent circuit for producing glottal waves 29 Fig. 3.2. The Norton equivalent circuit for producing the glottal wave 29 Fig. 3.3. (a) The frequency responses of the normalized radiation resistance R (--), reactance X (-.-) and impedance Z (-) of a 5-cm2 piston in an infinite baffle; (b) those of a lip opening of 5 cm2 36 Fig. 3.4. The acoustic tube model of the vocal tract 38 Fig. 3.5. The signal flow diagram from the glottal source to the lip volume velocity 40 Fig. 3.6. The discrete-time signal flow diagram from the glottal source to the lip volume velocity. 40 Fig. 4.1. The block diagram of the calculation of GVTFs or VTFs 49 Fig. 4.2. riip(f) (broken line) and rup(z) (solid line) of an adult lip opening for la I 50 Fig. 4.3. rup(f) (broken line) and rnp(z) (solid line) of an adult lip opening for lil 51 Fig. 4.4. riip(f) (broken line) and riip(z) (solid line) of an adult lip opening for IvJ 51 Fig. 4.5. riip(f) (broken line) and rup(z) (solid line) of an adult lip opening for Id 52 Fig. 4.6. riip(f) (broken line) and riip(z) (solid line) of an adult lip opening for IOI 52 Fig. 4.7. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for / a / by a male subject 53 Fig. 4.8. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for /i / by a male subject 54 Fig. 4.9. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for IvJ by a male subject 54 Fig. 4.10. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for Id by a male subject 55 Fig. 4.11. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for IOI by a male subject 55 vi Fig. 4.12. (a) The time-varying glottal-area, (b) the time-varying glottal resistances, and (c) the glottal reactance over the closed glottal phase 57 Fig. 4.13. (a) the glottal impedance for Ag= 1 mm2 and lg=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zv-p1 Zg) (dotted line), and (e) the frequency response of GVTF for la / 61 Fig. 4.14. (a) the glottal impedance for Ag= 1 mm2 and /s=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zw1 Zg) (dotted line), and (e) the frequency response of GVTF for / i / 62 Fig. 4.15. (a) the glottal impedance for Ag= 1 mm2 and /g=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zw1 Zg) (dotted line), and (e) the frequency response of GVTF for /u/ 63 Fig. 4.16. (a) the glottal impedance for Ag= 1 mm2 and =^18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zyji Zg) (dotted line), and (e) the frequency response of GVTF for Id 64 Fig. 4.17. (a) the glottal impedance for Ag= 1 mm2 and /s=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ ZvH Zg) (dotted line), and (e) the frequency response of GVTF for 101 65 Fig. 4.18. (a) the glottal impedance for Ag= 2 mm2 and lg=lS mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zvjl Zg) (dotted line), and (e) the frequency response of GVTF for la / 66 Fig. 4.19. (a) the glottal impedance for Ag= 2 mm2 and lg=lS mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zw1 Zg) (dotted line), and (e) the frequency response of GVTF for HI 67 Fig. 4.20. (a) the glottal impedance for Ag= 2 mm2 and =^18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zyj/ Zg) (dotted line), and (e) the frequency response of GVTF for lul 68 Fig. 4.21. (a) the glottal impedance for Ag= 2 mm2 and lg=l8 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zvrl Zg) (dotted line), and (e) the frequency response of GVTF for /e/ 69 Fig. 4.22. (a) the glottal impedance for Ag= 2 mm2 and lg=lS mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ ZVTI Zg) (dotted line), and (e) the frequency response of GVTF for 10/ 70 Fig. 5.1. A pmic(n) segment that does not contain the effect of the open glottis 82 Fig. 5.2. (a) the designed derivative glottal waveform, (b) the simulated estimate of the derivative glottal waveform when Ag=2 mm2, (c) the spectrum of the designed derivative glottal waveform, (d) the spectrum of the difference between (a) and (b) 85 vii Fig. 5.3. (a) the designed derivative glottal waveform, (b) the simulated estimate of the derivative glottal waveform when Ag=l mm2, (c) the spectrum of the designed derivative glottal waveform, (d) the spectrum of the difference between (a) and (b) 86 Fig. 6.1. The tube model used in the VTAF estimation based on boundary condition 1 91 Fig. 6.2. The signal flow diagram for the tube model in Fig. 6.1 94 Fig. 6.3. The equivalent signal flow diagram from uM+i+(t) to uiip(t) 94 Fig. 6.4. Vowel lal: (a) the frequency responses of riiP(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0) 110 Fig. 6.5.Vowel IM: (a) the frequency responses of ri;p(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs withrg=0.99 (x), 0.95 (0) I l l Fig. 6.6. Vowel IvJ: (a) the frequency responses of rup(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0) 112 Fig. 6.7. Vowel Id: (a) the frequency responses of ri;p(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0) s 113 Fig. 6.8 Vowel lol: (a) the frequency responses of rnp(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0) 114 Fig. 7.1. The results from lal by female subject M 120 Fig. 7.2. The results from lal by female subject H 121 Fig. 7.3. The results from lal by female subject W 122 Fig. 7.4. The results from lal by female subject K 123 Fig. 7.5. The results from lal by male subject L 124 Fig. 7.6. The results from lal by male subject Y 125 Fig. 7.7. The results from lal by male subject D 126 Fig. 7.8. The results from lal by male subject G 127 Fig. 7.9. The results from lal by male subject A 128 Fig. 7.10. The results from lal by male subject R 129 Fig. 7.11. The results from lal by female subject Z 130 Fig. 7.12. The results from HI by male subject D 131 Fig. 7.13. The results from HI by male subject G 132 Fig. 7.14. The results from lil by female subject H 133 Fig. 7.15. The results from lal by male subject D, given M+l=43 141 Fig. 7.16. The results from lal by male subject D, given M+l=48 142 viii Fig. 7.17. The results from lal by male subject R, given M+l=43 143 Fig. 7.18. The results from lal by male subject R, given M+l=48 144 ix Acknowledgments The work "Estimations of Glottal Waves and Vocal-Tract Area Functions from Speech Signals" would not have been accomplished without direct and indirect support and inspiration from many people. My greatest indebtedness is to my supervisory committee for their important support throughout the research. Professor Beddoes initiated this interesting research project, discussed and recognized the milestones of the project throughout the four years, from the beginning of my MSc program until the end of my PhD program. Professor Ward worked very hard commenting on all my papers and thesis, and provided me a good working environment. Her accurate thinking, strong spirit in dealing with difficult problems, and her commitment for excellence make her a role model for me. Professor Hodgson provided me with the facilities for recording speech sounds and building vocal-tract models, and the opportunities to lecture speech acoustics. His positive comments on my research encouraged me, and his corrections of my manuscripts saved me from many errors. My second debt is to professor Bryan Gick, who provided me equipment and instructive comments when I validated my new method for detecting glottal phases using vowel sound signals. My third debt goes to those professors who taught courses and inspired me. The fourth debt is to my colleagues in the Image Laboratory and in the Electrical and Computer Engineering Department who created a friendly and supportive environment in which I could work efficiently. I was fortunate enough to receive from professor Linda Rammage several important reference papers, which were not easy for me to obtain. I thank Shaffiq Rahemtulla for his technical support in recording synchronized speech and EGG signals at the Speech Laboratory in x the University of British Columbia. I also thank the volunteer students for their cooperation in recording speech sounds used in this study. I am very grateful to my University of British Columbia, for affording me to access the learning and research resources in diverse areas. I will benefit from my study here for the rest of my life. Last but not least, my special deep gratitude goes to my parents and husband for consistent support and understanding for my academic pursuance. Hui Qun Deng April 2005 xi 1 Introduction From speech sounds, we perceive linguistic information, as well as the gender, the age, the laryngeal health, the identity, and even the emotion of the speaker. A speech sound contains the information about the glottal wave and the vocal tract. The vocal tract is the airway from the upper surface of the vocal folds to the lip opening. The glottal wave is the airflow passing through the glottis (the space between the two vocal folds) and entering the vocal tract. The vocal tract modulates the glottal wave. The modulation effect of the vocal tract to the glottal wave is determined by the cross-sectional area of the vocal tract, which is referred to as the vocal-tract area function (VTAF). Glottal waves and VTAFs are speaker-dependent. Glottal waves and VTAFs are important in many applications of different fields. In acoustic phonetics, they are needed to describe features of speech sounds [Stevens, 1998]; in speech synthesis, they are needed to obtain natural sounding speech; in speech pathology, glottal waves are used to aid to diagnose voice disorders [Baken, and Orlikoff, 2000]; in speech recognition, VTAFs are used to recognize vowel sounds; in speaker identification, glottal waves are used to reduce the error rate [Plumpe, et al. 1999]; in second language learning, VTAFs are converted to vocal-tract shapes to help correct pronunciation [Dowd, et al. 1998]; in helping deaf people with pronunciation, vocal-tract shapes based on VTAFs are used as visual feedback [Rissiter, et al. 1994, Mahdi, 2003]; in synthesizing singing, both glottal waves and VTAFs are needed [Lu, 2002]. Methods for directly obtaining glottal waves and VTAFs from speech sounds have long been desired, since they do not interfere with normal speech production. This thesis estimates glottal waves and vocal-tract area functions (VTAFs) from vowel sounds. A vowel sound signal is the convolution of the glottal wave and the vocal-tract filter (VTF). The glottal wave can be obtained by inverse filtering the sound, given the VTF. The 1 challenge is how to obtain the VTF from the sound without knowing the glottal wave. Another challenge is to obtain VTAF from a vowel sound without knowing the glottal and lip boundary conditions. These are two ill-defined inverse problems. Existing methods for obtaining VTFs from speech signals involve one of the following assumptions about glottal waves: 1) glottal waves are zero over closed glottal phases [Miller, 1959, Veenman, et al. 1985]; 2) glottal waves can be modeled using a few descriptive parameters [Fujisaki and Ljugqvist, 1987; Kasuya, et al.,1999; Bozkurt, et al. 2004]; 3) glottal waves have smooth waveforms [Milenkovic, 1986; Moore, et al. 2004]. Existing methods for obtaining a unique VTAF from a speech signal involve one of the following assumptions about vocal-tract boundary conditions: 1) the glottis is completely closed, and the lip opening is terminated with some characteristic impedance [Atal, 1971]; 2) the lip radiation impedance is zero, and the glottis is terminated with some characteristic impedance [Wakita, 1973]. In reality, however, these assumptions about glottal waves and vocal-tract boundary conditions are not always valid. Actual glottal waves are much more complicated than these assumptions. During phonations, glottises close and open periodically. Also, incomplete glottal closures are common. Lip radiation impedances vary with frequency. These discrepancies between the assumptions imposed and the reality lead to biased estimates of glottal waves and of VTAFs. This study aims to develop more accurate methods for estimating glottal waves and VTAFs from vowel sounds than previous approaches. We overcome the problems with existing methods, by using more realistic assumptions about glottal waves and vocal-tract boundary conditions. We begin with clarifying concepts related to glottal waves and VTFs. 2 In Chapter 2, background knowledge related to the production of glottal waves, vocal-fold vibrations, and vocal-tract area functions is briefly described. Then, existing methods for measuring glottal waves and vocal-tract area functions are reviewed. Chapter 3 clarifies the concepts related to vocal-tract filters (VTFs). Then, it models the effects of glottal and lip impedances on the VTF, and formulates the transfer functions of VTFs and of VTF estimates. To gain quantitative knowledge about VTFs and VTF estimates, Chapter 4 calculates VTFs corresponding to different vowels using the concepts and formulae developed in Chapter 3. The effects of incomplete glottal closures on VTF estimates are also revealed. In Chapter 5, a new method for obtaining unbiased VTF estimates from vowel sounds is developed, assuming that glottal waves for sustained vowel sounds are periodically stationary processes. In addition, a new method for detecting glottal phases using vowel sound signals only is developed. Moreover, the effects of glottal losses on glottal-wave estimates obtained by inverse filtering the vowel sounds using the VTF estimates are analyzed and simulated. Chapter 6 first reformulates and compares existing methods for estimating VTAFs from vowel sounds. Then, it develops a new method for obtaining vocal-tract area functions (VTAFs) from VTF estimates obtained over closed glottal phases, assuming the glottises are completely closed. In addition, the effects of incomplete glottal closures on VTAF estimates obtained are also investigated via simulations. In Chapter 7, using the methods developed in this study, the glottal waves and VTAFs are estimated from vowel sounds produced by 11 subjects. The glottal waves and VTAFs obtained contain detailed information, which cannot be obtained using existing methods from speech signals. In addition, the glottal phases estimated using our method is validated using synchronized EGG (electroglottalgraph) signals. Finally, in Chapter 8, the advances made in this study, and future work, are summarized. 3 The new concepts, methods, findings and results of this study are: 1. VTFs are now distinguished from glottal-vocal-tract filters (GVTFs). VTFs are determined by VTAFs only; whereas, GVTFs are determined by VTAFs as well as glottal impedances (Ch. 3); 2. The difference between the GVTF and the VTF is determined by the ratio of the driving-point impedance of the vocal tract to the glottal impedance (Ch. 3); 3. There exists a glottal closing resistance, which is positive when the glottal area is decreasing, and negative when the glottal area is increasing (Ch. 3); 4. The lip radiation impedance is modeled using the product of the frequency-dependent weighting factor and the radiation impedance of a piston in an infinite baffle (Ch. 3, 4); 5. The formant frequencies of a GVTF are always higher than those of the corresponding VTF (Ch. 4); 6. A new method for detecting glottal phases using vowel sounds only is developed (Ch. 5); 7. A new method for obtaining unbiased VTF estimates from vowel sounds is developed (Ch. 5); 8. A new method for obtaining high-resolution VTAF estimates from vowel sounds is developed (Ch. 6); 10. It is found that to eliminate the effects of glottal loss on estimates of VTFs, glottal waves and VTAFs, lip-opening areas must be known (Ch. 6); 11. Analysis and simulations show that estimates of glottal waves and VTAFs corresponding to small-lip-opening vowel sounds are more affected by incomplete glottal closures than those corresponding to large-lip-opening vowel sounds (Ch. 4, 5, 6); 12. VTAF estimates obtained form large and small-lip-opening vowel sounds produced by several subjects agree with the simulated results (Ch. 7). 4 13. Normalized VTAFs estimated using our method from lal sounds by several subjects are similar to the VTAF measured from an unknown subject's magnetic resonance image (Ch. 7). 14. The glottal waveforms obtained display non-zero waveforms over closed glottal phases: during vocal-fold colliding, the glottal waves increase; during vocal-fold parting, some monotonically decrease, some remain at a level or even increase. (Ch. 7). The concepts and methods developed in this study contribute to the above-mentioned applications, such as natural sounding speech synthesis, speech pathology, speaker identification, and the study of speech production. 5 2 Background and Literature Review Knowledge about the vocal tract and vocal-fold vibrations can help us estimate glottal waves and VTAFs from speech signals, and can also help us explain the estimates obtained. This chapter provides fundamental knowledge related to glottal waves and VTAFs, reviews existing methods for estimating glottal waves and VTAFs, and points out problems with the existing methods. 2.1 Production of Glottal Waves 2.1.1 Larynx The larynx is a biomechanical system in which glottal waves are generated, as shown in Fig. 2.1. It controls the vocal-fold tension and shape by moving the two arytenoid cartilages relative to the cricoid cartilage in the medial-lateral and the anterior-posterior directions via muscles. The tension and shape of the vocal folds determine the fundamental frequency of the glottal flow. The glottal flow is also referred to as the glottal wave, or glottal volume velocity coming from the trachea passing through the glottis (the space between the vocal folds) and entering the vocal tract during phonation. The larynx descends or rises when producing different vowel sounds, resulting in changes in the length of the vocal tract. The larynx descends to produce lal and lol, and rises to produce HI, IvJ and Id [Story and Titze, 1996, Honda, 2001]. For example, the length of the vocal tract of an adult Japanese speaker producing lal is 9 mm longer than when producing Id. 6 Thyroarytenoid muscle Thyrovocalis Thyromuscularis — Vocal process Arytenoid cartilage Pharynx Thyroid cartilage Vocal ligament Glottis Cricoid cartilage Posterior cricoarytenoid muscle Fig. 2.1. The superior view of the larynx [Titze, 1994]. In analyzing glottal waveforms of people of different genders, it is useful to know the difference in the larynxes of different genders. The comparison between larynxes of the male and the female is shown in Fig. 2.2. The larynxes of the male and the female are non-proportional. The overall larynx size of the male speaker is 1.2 times that of the female; but, the membranous fold length of the male speaker (forming a protruding Adam's apple) is 1.6 times that of the female, and the thickness of the male is 1.2 times that of the female [Titze, 1989]. In addition, the medial surface of the vocal fold for the adult male bulges out to make the coronal section of glottises more rectangular than wedge-shaped, as shown in Fig 2.3. As a result, the divergent space between the contacting folds (6-7 in Fig. 2.4) for the male is larger than that for the female. It is known that the fundamental frequency of a glottal wave is mainly determined by the tension and length of the membranous vocal folds, whereas the mean airflow and vibration amplitude of a glottal wave are related to the overall larynx size [Titze, 1989]. 7 Fig. 2.2. The comparison of larynxes of the male and the female [Titze, 1989]. \ \ \ Fig. 2.3. The vocal fold of the adult male (solid line) and of the female in a coronal view [Titze, 1994, and 1989]. 2.1.2 Glottal Waves and Glottal Phases During phonation, under the lung pressure and the Bernoulli effect, the vocal folds under tension are forced apart and proximate periodically, resulting in glottis opening and closing, and pulses of airflow entering the vocal tract. Fig. 2.4 shows the schematic glottal waveform and the corresponding vocal-fold movements. In Fig. 2.4, the interval between 1 and 4 is the opening glottal phase (when the glottal area is increasing); the interval between 4 and 6 is the closing glottal phase (when the glottal area is decreasing); and the interval between 6 and 10 is the closed glottal phase (when the glottal area is zero). 8 5 mm Fig. 2.4. The glottal waveform and the movements of the vocal folds [Rubin, 1995]. Incomplete glottal closures are very common [Cranen, and Schroeter, 1995]. Glottal closures during phonation can be observed through laryngostroboscopy. It is reported that the percentage of glottal closure (the ratio of the fold-contact length in the anterior-posterior direction to the total fold length) varies from 68.9-99.3% in men and 45.1-96.5% in women. The lowest value of 45.1% is obtained from phonations with high pitch at low intensity by female subjects. The highest value of 99.3% is obtained from phonations with low pitch at loud phonation by male subjects. Most incomplete glottal closures are located in the arytenoid (posterior) region, especially for women; and some are located in the anterior region for men [Suiter and Albers, 1996]. 2.1.3 Vibrations of Vocal Folds To better understand glottal waves, a close look at the vibrations of vocal folds is helpful. The complicated vibration patterns of the vocal folds are facilitated by the complicated structure of the vocal folds. A vocal fold has a layered structure. As shown in Fig. 2.5, a vocal fold consists of three layers: 1) the epithelium; 2) the lamina propria (which consists of three sub-layers: superficial, intermediate and deep), and 3) the muscle. The structure of the vocal fold is also roughly divided into the cover layer (epithelium, superficial, intermediate) and the body layer (deep layer, muscle). During the vocal-fold vibration, the cover layer of the fold supports a surface wave with compressive and rarefactive phases [Berke, 1993]. This surface wave is referred to as the mucosal wave. The mucosal waves propagate from the lower portion of the folds towards and along the upper surface to the lateral boundaries of the folds, at a speed of 0.29-1.18 m/s [Wenokur, Berke, Kreiman, and Ye, 1993]. The effect of the mucosal wave produces the vertical phase difference: when the lower portion is in touch, the upper portion remains apart, as shown in 6-7 of Fig. 2.4; when the lower portion is parting, the upper portion is still in touch, as shown in 9-10 of Fig. 2.4. Simulations using a finite-element model of the vocal folds and the observation using high-speed quantitative imaging [Berry, 2001, Titze, 2002] have found that the vibration of the vocal fold can be described as the superposition of the mucosal wave movement along the folds, and the lateral movement of the folds. The lateral movements of the folds are responsible for modulating the glottal airway and generating the glottal wave signal. The mucosal wave is responsible for forming the glottal shape, and plays a supportive role in the vocal fold vibration. The convergent and divergent glottal shapes produce favourable pressure conditions such that the 10 Epithelium — Lamina propria Superficial layer -Intermediate layer Deep layer — Fig. 2.5. The layers of vocal fold (the coronal view) [Titze, 1994]. intraglottal pressure is in-phase with the net lateral velocity of the folds. The convergent glottal shape (9,10,1,2 in Fig. 2.4) creates higher pressure, helping the folds to separate; in contrast, divergent shape (4,5,6,7 in Fig. 2.4) creates a lower pressure, helping the folds to proximate. Our study notes that, when the rarefaction wave of the tissue travels over the vertical extent of the contacting vocal folds (see 6-8 in Fig 2.3), the air between the colliding folds must be squeezed into the vocal tract. Although the airflow squeezed may be very limited compared to the total glottal airflow, its time derivative should not be ignored, because it is the derivative of the glottal wave, not the total glottal wave, that contributes to the speech sound. 2.1.4 Relationship between the Glottal Wave and the Fold Contact Area As an indirect method for observing the vocal-fold vibration, EGG (electroglottalgraph) signals are usually used. During phonation, a pair of electrodes is held in firm contact on the neck on the left and right sides of the thyoid cartilage. A non-dangerous (< 1 mA) high-frequency electrical current is emitted from the electrodes and is conducted across the larynx. The amplitude of the EGG signal reflects changes in the impedance through the electrical current 11 pass, and hence the relative vocal-fold contact area can be detected, as shown in Fig. 2.6 [Rubin, etc. al, 1995], where an increasing fold contact area is displayed as a decreasing EGG waveform. An EGG signal can be divided into distinct phases. In the interval 1-2 (see Fig. 2.6), the fold contact area becomes maximal for a short period. In the interval 2-3, the fold contact area decreases as the folds part from lower margins toward upper margins. In the interval 3-4-5, the fold contact area continues to decrease when upper margins continue to open. In the interval 5-6, the fold contact area remains unchanged when the folds separate. In the interval 6-7, the two folds start closing along their lower margins in a zipper-like fashion, and in the interval 7-1, the folds rapidly increase the contact area in a vertical direction toward the upper margin [Kay, 1999]. EGG signals can be used to detect the glottal phases, but they cannot indicate clearly when glottises start closing, and when opening, i.e., the time instants 3 and 7 cannot be clearly determined using the EGG signal. In contrast, the glottal wave signal can provide more information about the fold vibration than can the EGG signal. The glottal wave signal can indicate the time instant 3 (i.e., the instant when the glottal wave starts increasing), the interval when the glottis is opening (the derivative of the glottal wave is positive), the interval when the glottis is closing (the derivative of the glottal wave is negative), the time instant 7 (i.e., when the derivative of the glottal wave becomes maximally negative), and the interval when the glottis is closed. Comparing the phases of the glottal wave and that of the EGG signal, it can be seen that the instant when the glottal airflow decreases most rapidly shortly precedes the time when the fold contact area increases most rapidly (within the interval 7-1 of Fig. 2.6). In Chapter 5, the above time-phase relationship between the glottal wave and the EGG signal in the Rosenberg model is used to validate the glottal phases estimated from speech signals. Our study finds that 12 towards upper margins Fig. 2.6. Rothenberg model of the relationship between an EGG waveform and the phases of vocal fold vibratory cycle [Rubin, et al. 1995]. during the short interval 7-1-2 in Fig 2.6, the glottal wave derivative is not zero and may even exhibit positive peaks, which we explain by the air squeezed out by the colliding folds into the vocal tract (see also 6-7-8 of Fig. 2.4). Thus, the constant glottal waveform over closed glottal phase (7-1-2) in the Rothenberg model simplifies actual ones. 2.1.5 Descriptive Parametric Models of Glottal Waves In the literature, glottal waveforms are often described using parametric models. The most widely used descriptive parametric model of glottal waves is the Liljencrants-Fant model [Fant 1986], illustrated in Fig. 2.7. The LF model describes the shape of the glottal flow derivative using a number of parameters representing the closed phase, the opening phase, and the returning phase of the glottis. The seven-parameter LF model is [Quatieri, 2001]: 13 vL F(0 = 0, E^'-VsmlSl^t-T.)], -EAe 0<t<To T<t<T Te<t<Tc In the Lilgencrants-Fant model, the time of glottal opening, the time of glottal closing, the start time of the return phase, three parameters describing the shape of the glottal wave during the open phase, and two parameters describing the shape of the glottal wave during the return phase, are used to describe the derivative of a glottal wave. It should be noted that actual glottal waveforms contain more information than that described by parametric models. For example, actual glottal waves contain source-tract interactions, aspiration noise as discussed in the following sections, as well as gender differences, which can be heard but remain to be identified. Moreover, glottal waves over closed glottal phases may not be zero. "nMI (a) (b) Closed Phase Open Phase Return Phase Time Glottal Pulse Fig. 2.7. The Liljencrants-Fant parametric model of the glottal wave [Quatieri, 2001]. Non-zero glottal waves over closed glottal phases can be caused by three different mechanisms [Cranen and Schroeter, 1996]: 1) abduction (the opening that is connected to the membranous glottis); 2) glottal chink (the opening in the cartilage portion of the glottis when the 14 vocal folds are adducted); and 3) vertical tissue movements. The abduction and vertical motions of the vocal folds change the derivative of a glottal wave, whereas a glottal chink does not. 2.1.6 Source-Tract Interaction During open glottal phases, the glottal wave is not only affected by the glottal area and lung pressure, but is also affected by the vocal tract system, i.e., the glottal waveform is phoneme and speaker dependent. In the literature, the effect of vocal tract on the glottal wave during open glottal phases is called source-tract interaction. Simulations and analysis show that, due to the inductance of the first formant load of the vocal tract, the glottal wave rises more gradually but falls more rapidly than the glottal area waveform [Ananthapadmanabha and Fant, 1982]. Simulations also show that the first formant of the vocal tract may introduce "double peak" structure in the derivative of the glottal waveform during open glottal phase. The double peak structure is also observed in the derivative of the glottal waveforms obtained using our method in Chapter 7. 2.1.7 Aspiration Noise in Glottal Waves Parametric models of glottal waves only describe the periodic components of glottal waves. However, actual glottal waves contain not only periodic but also aperiodic components, which is the aspiration noise, or the turbulence noise generated by the air friction at the glottis. Analysis of a sustained vowel sound lal produced by a female subject shows that the turbulence noise produced by the glottis is 10-40 dB smaller than the periodic components at frequencies lower than 3 kHz, but becomes comparable above 3 kHz [Jackson, 2001]. 15 The aspiration noise is not stationary, but varies with the glottal area and the glottal volume velocity, i.e., it is pitch synchronized [Coker, etc., 1996]. The pitch-synchronized aspiration noise in the glottal wave results in different measurements of a sustained vowel sound in two pitch periods. We use this knowledge to obtain an unbiased estimate of the vocal-tract filter from a sustained vowel sound in Chapter 5. 2.2 Estimating Glottal Waves Using Non-Speech Signals 2.2.1 Mechanical Model Approach One approach for obtaining glottal waveforms is to model the vibration of the vocal folds using a mechanical system consisting of resistances, masses and springs. The glottal area time function and the glottal airflow are then calculated from this model [Flanagan 1972, Vries 2002]. It is difficult to measure the mechanical parameters of vocal folds of real subjects. Also, the calculated glottal waveform is determined by a small number of mechanical parameters, and thus lacks details. 2.2.2 Pneumotachograph Mask Approach This method uses a circumferentially vented pneumotachograph mask to sense the volume velocity at the mouth opening during speech [Rothenberg, 1973, Baken and Orlikoff, 2000]. The output of the mask is inversely filtered to provide an estimate of the glottal volume velocity. Although this method can provide a reliable indication of zero airflow, its limitation is that the upper frequency is as low as 2000 Hz. Thus, fine detail in the volume-velocity waveforms cannot be obtained [Hillman, 1981]. 16 2.2.3 Reflectionless Tube Approach This method uses a uniform pipe with 6 feet in length and 1 inch in inner diameter as a "pseudo-infinite" termination of the vocal tract to dampen the vocal-tract resonance, and the pressure wave sensed by a microphone within in the tube is taken as an approximation of the glottal volume velocity [Sondhi, 1975]. However, a vocal tract is obviously not a uniform tube, and thus the sound pressure in the pipe contains the filtering effect of the vocal tract [Hillman, 1981]. Moreover, the interaction between the glottal source and vocal tract when not connected to the pipe is different from that with the pipe. 2.3 Estimating Glottal Waves from Speech Signals Obtaining glottal wave signals from speech signals is more advantageous than using the above-mentioned methods, in that glottal wave signals can be estimated during natural speech production. A vowel sound signal is the convolution of the glottal wave and the vocal-tract filter. The problem is how to obtain the glottal wave without knowing the vocal-tract filter. To solve this ill-defined problem, vocal-tract filters and glottal waves in the literature are estimated based on the assumption that the glottal waves are zero over closed glottal phases, or that the glottal waves can be represented using a small number of parameters. Accordingly, existing approaches for obtaining glottal waves from speech signals fall into two categories: those using inverse filter, and those using descriptive parametric model of glottal waves. 2.3.1 Inverse Filter Approach In this approach, parameters of the vocal-tract filter (VTF) are first obtained, the time derivative of the glottal wave is then obtained by inverse filtering the vowel sound signal, and the glottal wave signal is obtained by integrating the glottal wave derivative. Some researchers use DAP method (discrete all-pole) method [El-Jaroudi and Makhoul, 17 1991] to obtain more accurate VTF estimates [Alku, etc., 1998, 2002]. Unfortunately, the DAP method ignores the influence of non-linear and time-varying glottal impedances on the VTF estimates, and estimates the VTF over an interval covering both open and closed glottal phases. Consequently, the VTF estimates used contain the effects of open glottises, and the resulting glottal waves are inaccurate. In some approaches, VTFs are estimated over closed glottal phase, over which the glottal wave is assumed to be zero [Miller, 1959, Wong, 1979, Veenman, 1986]. However, this assumption is not always true. As mentioned in previous sections, during phonation a glottis may never be completely closed, or the colliding vocal folds (after the glottis closes) may squeeze or/and push out air into the vocal tract. As a result, the VTF estimates used for obtaining glottal waves contain influences of the non-zero glottal waves, and the resulting glottal waves are not accurate. As for identifying closed glottal phases, electroglottographic (EGG) signals are usually used [Krishnamurthy and Childers, 1986; Childers and Ahn,1995]. In cases when only speech signals are available, closed glottal phases are identified using linear prediction residual error produced from sliding covariance analysis of the speech signal [Wong, et al., 1979]. However, the residual error of linear prediction is a mathematically minimized quantity, and has no physical equivalents. It is not surprising that the glottal phases detected using residual error are unreliable. To overcome this problem, another approach detects the closed glottal phases using the formant frequencies obtained from the sliding covariance analysis, assuming that glottal waves are always zero over closed glottal phases [Plumpe, at al., 1999]. Obviously, the VTF estimates are degraded by non-zero glottal waves over closed glottal phases. There are some approaches to estimate glottal waves, assuming that glottal waves are smooth [Milenkovic, 1986, Moore and Clements, 2004]. These approaches are not convincing, because the smoothness of an unknown glottal wave is unknown. 18 2.3.2 Using a Descriptive Parametric Model of Glottal Waves In this approach, the parameters of the glottal-wave model and of the VTF are jointly estimated from a cycle of the vowel-sound signal [Kasuya, Maekawa and Kiritani, 1999; Fujisaki and Ljugqvist, 1987; Lu, 1999]. For this approach, errors arise from tow causes. The first is the parametric model of the unknown glottal wave. As mentioned previously, parametric models are smoothed approximations of the unknown glottal wave, and are unable to capture all details of the glottal waveform. Also, parametric models ignore the influence of the vocal-tract resonance on the glottal wave. Consequently, the glottal waveform obtained lacks details, and the resulting VTF estimate is biased by the difference between the actual glottal wave and its parametric model. Secondly, since the VTF parameters are estimated over one glottal cycle, the effect of the time-varying glottal impedance corresponding to the open glottal phase is included in the VTF estimate. Thus, this approach cannot produce accurate VTF estimates. It should be noted that, if an estimation of glottal waves is based on assumptions about glottal waves, then it involves a problem of "circular logic". As criticized in [Hillman and Weinberg, 1981], it relies on a priori assumptions about the nature of the glottal wave, which are, in turn, used as criteria for adjusting the VTF estimate. Our study estimates glottal waves based on a more realistic assumption about glottal waves: for a sustained vowel sound, the glottal wave is a combination of periodic components and turbulence noise. 2.4 Vocal-Tract Area Function Measurements and Estimations 2.4.1 Vocal Tract The vocal tract is the airway (or the acoustic tube) starting from the upper margins of vocal folds and ending at the lips, as shown in Fig. 2.8. The cross sectional area of the vocal tract 19 Hard palate Tongue Epiglottis Laryngeal ventricle Thyroid cartilage Trachea Fig. 2.8. A sagittal view of the airway through larynx and vocal tract [Rubin etc al., 1995]. changes with the vocal "gesture", which is determined by the configurations of the tongue, jaw, soft palate, and lips. 2.4.2 VTAF Measurements Using MRI Nowadays, a three-dimensional vocal tract can be imaged using the MRI (magnetic resonance imaging) technique. Before measuring the VTAF, a vocal-tract centerline is needed. The ideal centerline is in the transmission direction of plane sound waves in the vocal tract, so that the length of the centerline equals the product of the travel time of the sound wave and the sound speed. There are some methods for extracting the centerline of a three-dimensional vocal tract image [Mermelstein, 1973; Story and Titze 1996; Takemota etc al., 2001]. One method is described here. Due to the left-right symmetry of the vocal tract, the centerline of the vocal tract is in the midsagittal plane (the plane that divides the left and the right of the body). First, several points are set on the image of the vocal tract wall intersecting the midsagittal plane (see Fig. 2.9). 20 Then, from each of these points, draw a line that is the shortest line reaching the opposite side of the vocal tract in the midsagittal plane. The middle point of the line is taken as a point on the centerline. At the lip opening, select the center of the lip opening as the middle point. Draw a line passing the center of the lip opening and perpendicular to the surface of the lip opening, and then select two points inside and outside the lip opening along the line as points on the centerline. At the glottal end, select the center of the glottis as a point on the centerline. Connect these centerline points using a smooth interpolation function to form a smooth centerline [Takemota and Honda, 2001]. The vocal tract cross-sectional area at a distance from the glottis is the intersection of the 3D vocal-tract volume image and the plane that is perpendicular to the centerline at the distance. In this way, a three-dimensional vocal-tract volume is described using a two-dimensional VTAF. The VTAF obtained from MRI, and the positions of pharynx, oral cavity and other parts of the vocal tract for the sound /a/ is shown in Fig. 2.10 [Story, 2004]. There are several factors that affect the accuracy of the VTAF measurements using MRI methods. First, the MRI method cannot detect the existence of teeth and bones. To fill the space Fig. 2.9. A method for determining the central line of a vocal tract [Takemota, et al., 2001]. 21 8 ^ 6 § 4 < 0 0 2 4 6 8 10 12 14 16 18 Distance from Glottis (cm) Fig. 2.10. The lengths and positions of epilarynx, pharynx, and oral cavity [Story, 2004]. occupied by the teeth, measurements of the teeth are needed. Moreover, an erroneous vocal-tract centerline introduces further errors in the VTAF measurements. In Chapter 7, VTAFs obtained from MRI date will be compared with VTAFs estimated from speech signals. 2.4.3 V T A F Estimation Using Formant Frequencies The estimation of the VTAF from the speech signal is an inverse problem. The possibility of obtaining VTAFs from speech signals was first explored by [Mermelstein, 1967]. The un-uniform vocal-tract cross-sectional area is approximated using several Fourier functions. Applying Webster's acoustic horn equation to the approximated vocal tract, it is found that, if the logarithm of the VTAF is band limited preserving only first 2n Fourier components, if the glottal impedance is infinite, and if the vocal tract is lossless, then the lowest n pole and n zero frequencies of the admittance function measured at the lip opening uniquely determine the VTAF. The problem with this approach is that a speech signal can only provide the pole frequencies, which correspond to resonance frequencies of the vocal-tract filter, but not zero 22 frequencies of the admittance at the lips. Therefore, format frequencies of vocal-tract filter alone cannot determine a unique VTAF, i.e., there are an infinite number of VTAFs corresponding to one set of formant frequencies of a vocal-tract filter. Moreover, this approach is based on the condition that the shape of the vocal tract is only "slightly perturbed" from a uniform shape. Consequently, this approach cannot obtain correct VTAFs that have sharp shapes. In spite of these limitations, in the literature, a great amount of effort has been expended in obtaining VTAFs using formant frequencies. It is believed that a speech signal does not contain enough information for determining a unique VTAF [Sondhi, 1979, Schroeter and Sondhi, 1994], and prior information about human vocal tracts is needed in addition to the formant frequencies. Some researchers combine morphology constraints in addition to the formant frequencies obtained from speech signals [Yahis and Itakura 1996, Dang and Honda 2002] to estimate VTAFs. Some researchers use an articulatory codebook, which is a lookup table of corresponding acoustic and geometric vectors, to find the VTAF corresponding to given acoustic parameters [Atal, 1978; Larar, Schroeter and Sodhi, 1988]. This approach requires a large codebook that contains many acoustic parameters and VTAFs. However, the VTAFs in the codebook are limited, and cannot correctly represent an unknown VTAF. 2.4.4 VTAF Estimation Using Lip Input Acoustic Impedance To avoid the above problems, a method was developed that uses no speech signals, but the acoustic input impedance looking from the lip opening into the vocal tract [Sondhiand Gopinath, 1971; Yehia, Honda and Itakura, 1995]. This method requires that the speaker articulate a vowel without producing sounds, but with completely closed glottis while a volume-velocity excitation is generated from a uniform tube connected to the mouth opening. The vocal-23 tract sound-pressure response to the external excitation is transmitted back into the tube, and is then analyzed to obtain the zeros and poles of the lip input impedance, from which the VTAF is then derived. The limitations of this method are that a well-designed acoustic measurement system is not available in most cases, and that the subject is not in the natural condition for speech production. Moreover, the resulting VTAF estimates are still hot accurate [Devaney and Goodyear, 1994]. 2.4.5 V T A F Estimation Using Vocal-Tract Filters As the LP (linear prediction) theory evolves [Atal and Hanauer, 1971, Itakura and Saito, 1968], two methods that are more advantageous over those using formant frequencies were developed. A vocal tract is acoustically modeled as a lossless M-sectional tube, with each section having the same length and a different cross-sectional area. The acoustic transfer function derived from the tube model is compared with the vocal-tract filter estimated from the speech signal using LP. Assuming that glottal waves has a spectrum with -12 dB/oct slop, and that the speech signal is compensated for the glottal wave, it is shown [Atal, 1971] that, if the glottal reflection coefficient is one and the lip end is terminated with some characteristic impedance, which is referred to as boundary condition 2 in the literature, a unique VTAF can be derived from a speech signal. In contrast, it is shown [Wakita, 1973] that if the lip reflection coefficient is one and the glottal end is terminated with some characteristic impedance, which is referred to as boundary condition 1 in the literature, a unique VTAF can be derived from a speech signal. It is reported that VTAFs obtained based on boundary condition 2 are not reasonable [Wakita, 1973]. However, it is also reported that VTAF estimates based on boundary condition 1 are at times not reasonable [Ray, 1995]. Our study shows that both boundary conditions 1 and 2 can 24 lead to reasonable results, if the speech signals correspond to the assumed boundary conditions [Deng, CAA 2003]. VTAF estimation based on boundary condition 1 requires that speech signals used for the estimation be limited to a low frequency range (0-4 kHz) to satisfy the assumed boundary conditions. VTAF estimates obtained from such signals have low resolution: each section of the tube model is about 2.5 cm. Such low-resolution VTAF cannot describe vocal tracts in details, especially short vocal tracts. For example, the VTAF estimates obtained in [Wakita, 1973] display large cross-sectional areas near the glottal end (see Fig. 2.11) compared to those obtained using the MRI (magnetic resonance imaging) method (see plots (a) of Figs 4.7 to 4.11). Some researchers interpolate the obtained low-resolution VTAFs using polynomials to obtain smoothened VTAFs [Mahdi, 2003]. But one cannot obtain details of the VTAFs. This study points out three problems in the existing methods [Atal and Hanauer, 1971; Wkita, 1973]. First, the vocal-tract filter estimates used in deriving the VTAFs contain the effects of glottal waves, since speech signals of different speakers are compensated based on the same assumption that glottal waves have a spectrum with 12 dB/oct. slope. The simplified assumption about glottal waves may be adequate in a very low frequency range for some speakers, but is not appropriate in a higher frequency range for all speakers. Secondly, the assumed constant lip boundary conditions cannot be true, since the lip radiation impedance is frequency dependent. Thirdly, the assumed glottal boundary conditions are invalid, because glottal areas vary periodically during phonation. This study aims to obtain more accurate and higher resolution VTAF estimates from vowel sounds than previous approaches. We understand that to obtain accurate estimates of VTAFs from vowel sounds, VTF models should be realistic, and that VTF estimates should not contain the effects of glottal waves and of time-varying glottal boundary conditions. Based on 25 D I S T A N C E F R O M G L O T T I S (CM) F R E Q U E N C Y ( K H Z ) Fig. 2.11. The VTAFs (left) and the VTFs (right) obtained in [Wakita, 1973]. this understanding, we first develop more realistic VTF models to include the frequency-dependent lip boundary conditions, as shown in Chapter 3. To avoid the effects of time-varying glottises, we estimate VTF estimates from vowel sound over closed glottal phases. To be able to minimize the effects of glottal waves on the VTF estimates, we estimate VTFs from sustained vowel sounds by taking the advantage of that the glottal waves for sustained vowel sounds are periodically stationary random processes [Deng, etc. IEEE Trans. 2005]. Finally, we eliminate the effects of frequency-dependent lip boundary conditions from the VTF estimates, so that VTAF estimations are free of the effects of frequency-dependent lip boundary conditions [Deng, etc. ICASSP 2005]. Our approaches are presented in the following chapters. 2.5 Summary Previous methods for estimating glottal waves and VTAFs from speech signals are based on simplified assumptions about glottal waves and glottal and lip boundary conditions. As a 26 result, the resulting glottal waves are not accurate and also lack information corresponding to closed glottal phases; the resulting VTAF estimates lack detail or are unreasonable. This study does not invoke these simplified assumptions about the glottal waves and glottal and lip boundary conditions, and obtains more accurate estimates of the glottal waves and VTAFs than previous approaches. 27 3 Transfer Functions of Vocal-Tract Filters Knowledge of VTF transfer functions is key to obtaining accurate estimates of the glottal waves and VTAFs from speech signals. This chapter clarifies concepts related to VTF estimates, and investigates the factors that affect the transfer function of a VTF estimate. Namely, the time-varying and non-linear glottal impedance, the frequency-dependent lip radiation impedance, and the vocal-tract area function (VTAF). Finally, transfer functions of VTF estimates that contain the effects of incomplete glottal closures and frequency-dependent lip boundary conditions are formulated. 3.1 Equivalent Acoustic Systems for Producing a Vowel Sound The acoustic system for producing vowel sounds is shown in Fig. 3.1 [Flanagan, 1972]. The subglottal acoustic system is represented using the constant lung pressure PL, and the trachea input impedance Z T . The airflow passing through the glottis is represented using ug(t), and the effect of the glottis on the glottal volume velocity is modeled using the time-varying glottal resistance Rg(t) and the glottal inductance Lg(t). ptg(t) is the trans-glottal pressure. The supraglottal acoustic system is represented using the vocal-tract filter, Z V T is the vocal-tract driving point impedance looking from the back end of the vocal tract into the vocal tract, and pi(t) is the total sound pressure at the backend of the vocal tract. UHP(t) is the volume velocity at the lip opening, and Z L is the lip radiation impedance, which converts volume velocity to sound pressure in space. The system looking from the back end of the vocal tract into the lung can also be represented using an equivalent volume-velocity source and source impedance, as shown in 28 R g ( t ) x _ ^ u g « PiT p.g(t) p sb(t) Pl( t) Vocal Tract Ulip(t) Fig. 3.1. The Thevenin equivalent circuit for producing glottal waves. Ulip(t) u s c ( t ) Fig. 3.2. The Norton equivalent circuit for producing the glottal wave. Fig. 3.2. The equivalent volume-velocity source usc(t) is the short-circuit volume velocity obtained by forcing pi(t)=0 in Fig 3.1. Simulations [Fant 1982] show that the effects of viscous glottal resistance and of the glottal reactance on usc(t) can be neglected, and the short-circuit current is approximated by: usc(t) = Ag(t)pPL/kp (3.1) where Ag(t) is the glottal area. The equivalent volume-velocity source usc(t) is referred to as the glottal source. The glottal source usc(t) is a linear function of the glottal area. This knowledge is used in detecting glottal phases using usc(t) in Chapter 5. The excitation source of a vowel sound is the time-varying glottal source usc(t) through the source impedance, or the time-varying glottal wave ug(t), not the constant lung pressure pL. 29 3.2 Transfer Functions for Producing a Vowel Sound The transfer function from the glottal wave to the lip volume velocity is defined as the vocal-tract filter (VTF): where UuP(f) and Ug(f) are the Fourier transforms of the volume velocity at the lips uup(t) and the volume velocity at the back eng of the vocal tract ug(t), respectively. The transfer function of the vocal tract contains only the effect of the vocal tract, with no effect of the glottal impedance Zg(t). This thesis distinguishes VTFs from glottal-vocal-tract filters (GVTFs), and defines the transfer function of a GVTF as: H {f) = llj!EMl (3.3) u„(f) where Usc(f) is the Fourier transform of usc(t). The GVTF transfer function contains not only the effect of the vocal tract, but also the effect of the non-linear and time-varying glottal impedance Zg(t). Thus, GVTFs are non-linear and time-varying. An HGVTF(F) becomes equal to an HvTF(f) when the glottis is completely closed. A VTF estimate can be obtained when the glottis is closed. Since incomplete glottal closures are very common, a VTF estimate may be equal to a GVTF corresponding to the incomplete glottal closure. It is common not to distinguish VTFs from GVTFs [Rabiner, 1978]. However, in estimating glottal waves from speech signals, only VTFs must be used. The VTF and the GVTF are related. According to the equivalent circuit shown in Fig 3.2, the following relationship holds: Ug(f) Zg+ZT u„(f) z+zT+zw 30 (3.4) Therefore, the relationship between the GVTF and the VTF is: H GVTF (/) — (3.5) 1 HvTF (/) l + Zw l(Z +ZT) From the above relationship, it can be seen that: 1. The spectral difference between a GVTF and the corresponding VTF is determined by the ratio of ZVTA Zg+Zj); 2. When Z g »ZVT , the transfer function of the GVTF equals that of the VTF; 3. At frequencies where ZVT becomes large, the magnitude of Z V T /(Zg+Zr) is large, and the frequency response of the GVTF is weaker than that of the VTF; 4. As shown below, the glottal resistance is dominated by the non-linear kinetic resistance when the glottal area is large, and by the linear viscous resistance when the glottal area is small. Thus, a GVTF is time varying and non-linear when the glottal area is opening and large, but is nearly linear and time invariant when the glottal area is very small. 3.3 Glottal Impedance When passing through the glottis, the glottal wave encounters viscous and kinetic glottal resistances, and glottal reactance. Both the glottal resistance Rg(t) and the glottal inductance Lg(t) are functions of the time-varying glottal area. This section gives their quantitative formulae. The glottal resistance is calculated from the relationship between the pressure drop and the volume velocity through the glottis. Aerodynamic theory and experiments using a glottal model show that, given a static glottal volume velocity U g and a constant glottal area A g , the pressure drop through the glottis is [Flanagan, 1972]: 31 2At Al where / is the length of the glottis, and H is the glottal depth (thickness), p is the air viscosity coefficient. The above relationship between Ptg and Ug is non-linear. The dynamic non-linear glottal resistance is determined by the derivative of the pressure drop with respect to the glottal volume velocity: 0.815PUg | UpHl2 ( 3 7 ) where Rk is the kinetic resistance, and R v is the viscous resistance. The kinetic glottal resistance is proportional to the glottal volume velocity and inversely proportional to A g . The viscous glottal resistance is proportional to the viscosity coefficient and the depth of the glottis, and is inversely proportional to A g 3 . If the glottal area is large, the glottal resistance is dominated by the kinetic glottal resistance Rk. If the glottal area is small, the glottal resistance is dominated by the viscous glottal resistance R v . The glottal resistance given in (3.7), which is obtained from the relationship between a static pressure drop and the flow through a time invariant circular orifice, is used to approximate the glottal resistance when the glottal area is time-varying and non-uniform in shape. The time-varying and non-linear glottal resistance can be, to a first order approximation, represented using its averaged value over a glottal cycle: R, _ ! ! £ ^ L + « £ L 0.8) A. A. 32 where Ag is the averaged value of the time-varying glottal area, and Ug is the averaged value of the time-varying glottal volume velocity. While passing through the glottis, the time-varying glottal volume velocity also encounters the inertance of the air in the glottis. This inertance is measured using the glottal inductance Lg(t) [Flanagan, 1972]: V ) = ^T7T ( 3 - 9 ) The glottal inductance is time-varying and linear. Given a time-varying glottal area and glottal wave, this thesis derives the total glottal impedance according to the total pressure drop through the glottal inductance and the glottal resistance: plg (t) = j\Lg (t)ug (r)] + Rg (t)ug (t) dLAt) duAt) =ug ( O - j p + Lg ( O - j p + Rg (0«, (0 (3-10) A '(t) duAi) Ag (t) dt Thus, at angular frequency CD, the equivalent glottal impedance at time t is: A/(0 12/tf//2 0.875/*, (0 . PH ..... Zg(t) = -pH-±-j+ / ~ + +J(0-^— (3.11) Ag(tf Ag(tf Ag(tf Ag(t) Clearly, when Ag(t)=0, both the glottal resistance and the glottal inductance are infinite. According to Eq. (3.11), this thesis finds that the time varying glottal inductance A '(t) introduces a resistance - pH —-—- in addition to the kinetic and viscous glottal resistances. V ' ) 2 33 The resistance - pH — is negative when the glottis is opening (Ag'(t)>0), and is positive when the glottis is closing (Ag'(t)<0). This thesis names - pH v> 2 the glottal closing resistance Rc. From Eq. (3.11), we understand that right before the glottal-closure instant, when Ag(f) is small and Ag'(t) is near its negative maximum, the total glottal resistance can be greater than that corresponding to an incomplete glottal closure, for which Ag(t) is non-zero and Ag'(t) is zero. This is illustrated in Fig. 4.12 in Section 4.4. As to the effect of the time-varying and non-linear glottal resistance on the GVTF, we note that there are incorrect conclusions in [Fant 1982, Plumpe, etc. al. 1999] that the time-varying glottal area introduces an equivalent "hypothetical inductance", and that the resonant frequency of a GVTF is lower than that of the VTF if Ag'(t) is negative, and is higher than that of the VTF if Ag'(t) is positive. The above incorrect conclusions are drawn from a "pseudo-Laplace transform" [Fant 1982, Plumpe, etc. al. 1999], in which the Laplace transform is applied to the derivative of a differential equation that relates the time-varying glottal resistance to the sound pressure at the back end of the vocal tract, and the non-zero time derivative of the glottal area is allowed to exist in the coefficients of the transformed equation (see appendix A). It is known that for the Laplace transform to be applicable to a differential equation, it is required that the coefficients of the differential equation be time invariant, i.e., the parameters of the system must be time invariant. Thus, it is incorrect to allow the non-zero time derivative of the glottal area to exist in the coefficients of the Laplace transform of the differential equation. The Laplace transform is not applicable to a time-varying system. Thus, the conclusions drawn 34 from the "pseudo-Laplace transform" are incorrect. Our conclusions are that the time-varying glottal area introduces a glottal closing resistance, and that the resonant frequency of a GVTF is always higher than that of the corresponding VTF, as shown in appendix B. 3 .4 The Lip Radiation Impedance During phonation, the vibrating air plug in the lip opening acts as a sound source. A vibrating surface transmits sound waves to the air by overcoming the reaction of the air on the vibrating surface. The radiation impedance of the vibrating surface is the complex ratio of the average sound pressure on the surface to the volume velocity produced by the vibrating surface. Therefore, sound-pressure waves that are radiated and reflected back to the sound source from nearby reflectors make the source radiation impedance greater than that in a free space. In the literature, one approximation for the lip radiation impedance is that of a circular piston in an infinite baffle. It is known that the normalized radiation impedance of a piston in an infinite baffle is [Flanagan, 1972]: z = p , u = 1 J l ( 2 k a ) l j p pel A ka (3.15) 2(kaY where k=a/c, co is the angular frequency, c is the sound speed, a is the radius of the piston, a=(A/7t)1/2, A is the area of the piston, p is the density of the air, Ji(x) is first order Bessel function, and Ki(x) is a related Bessel function: 2 x3 x5 x1 = + ] (3.16) 1 n 2> 32-5 3 2-5 2-7 The normalized acoustic radiation resistance and reactance for a piston with 5 cm2 area in a baffle are plotted in Fig. 3.3(a). However, according to the definition of an infinite baffle, a boundary can be considered as an infinite baffle if the dimensions of the boundary are much greater than a wavelength of the 35 The normalized radiation R, X and Z of a piston of 5 c m 2 in an infinite baffle kHz Fig. 3.3. (a) The frequency responses of the normalized radiation resistance R (--), reactance X (-.-) and impedance Z (-) of a 5-cm2 piston in an infinite baffle; (b) those of a lip opening of 5 cm2. sound [Kinsler, 2000]. The wavelength of the sound is X 3^.5 cm at 10,000 Hz, and the radius of the head is about 9 cm. Therefore.at frequencies lower than 10 kHz, the head cannot have as strong reflection effects as an infinite baffle does, and the lip radiation impedance at low frequencies is smaller than that of a piston with the same area as the lip opening in an infinite baffle. Another approximation for the lip radiation impedance is the radiation impedance of a piston in a sphere [Flanagan, 1972]. However, this approximation does not take into account reflections from the body of a speaker, and thus the lip radiation impedance is under-estimated. In this study, the lip radiation impedance at low frequencies is approximated using the radiation impedance of an unflanged pipe with the same opening as the lip opening and, at high frequencies, it is approximated using that of a piston on an infinite baffle. It is known that the radiation impedance of an unflanged pipe is approximately half that of the piston in a baffle 36 [Kinsler, 2000]. Considering that the reflection effect of the head and body becomes stronger and stronger as frequency increases, this study approximates the lip radiation impedance using a frequency-dependently weighted Zp, i.e., the normalized lip radiation impedance at frequency/is represented as: ZLSM/pc = (0.5 + f/Fs)Zp (3.17) where SM is the lip-opening area, F s is the sampling rate of the speech signal. For a 5-cm2 lip opening, the frequency response of the lip radiation resistance, reactance and total impedance given in Eq. (3.17) are shown in Fig. 3.3(b). 3.5 The Signal Flow Diagram of V T F To understand the modulation effect of the vocal tract to the glottal wave signal, the signal flow diagram of sound signals in the vocal tract is needed. In the discrete-time domain, the speech sound signal is sampled at discrete time instants, and the modulation effect of the vocal tract can be modeled using a cylindrical tube with M equal-length sections, as shown in Fig. 3.4 [Rabiner, 1978]. In Fig. 3.4, S M is the cross sectional area of the vocal tract at distance x=m L V T / M from the glottis, um+(t) and um"(f) are the positive-going volume velocity and negative-going volume velocity, respectively, at the left end of the m* section at time t, and the arrows indicate the transmission directions of the positive-going sound waves and negative-going sound waves, not their reference directions. In the literature on speech signal processing, the convention that the reference directions of the positive-going and negative-going volume velocities are opposite is used. In this study, the convention that the reference directions for the positive-going and negative-going volume velocities are the same is used. The convention used here conforms to the linear superposition 37 s 'm r — \ ! u m + (t) u M+ (t) i. i uM" (t) UM+1 (t) |um (f) glottis lips Fig. 3.4. The acoustic tube model of the vocal tract. principle, but the other convention does not [Deng, etc., Eurospeech 2003]. It is shown that the final transfer functions derived under both conventions are the same. The glottal wave (total volume velocity at the back end of the vocal tract) is represented by ug(t)=ui+(t)+ui"(t), and the total volume velocity at the lip opening is represented by uiiP(t)=UM+i(t), as shown in Fig. 3.4. The transfer function of the vocal-tract filter is then UM+i(f)/(Ui+(f)+Uf (f)), where UM+i(f) is the Fourier transform of uM +i (t), and Ui+(f)+ U1"(f) is the Fourier transform of ui+(t)+uf (t). At the glottal boundary, assuming the glottal impedance is much larger than the trachea input impedance, according to the continuity of volume velocity and of sound pressure, the following relationship holds, as shown in Fig. 3.2: u{ (t) + M " (0 = usc 00 - [M,+ (0 - u " (t)]pc I Zg S, (3.18) Define the glottal reflection coefficient as: Zg-pc/Sl Zg+pc/S1 (3.19) Then, from Eqs. (3.18) and (3.19), we get: u\ (t) = 0.5(1 + rg)usc(t)-rgu;(t) (3.20) 38 At the boundary of the m* and m+l* sections, the continuity of volume velocity must be satisfied, i.e.: < (t-D) + a' (t + D) = u+m+i (t) + u~m+i (t) (3.21) Also, the continuity of sound pressure must be satisfied, i.e.: [u+m(t-D)-u-m(t+D)]pc/Sm Hu^-u^tyipc/S^ (3.22) Define the reflection coefficient at the left side of the boundary of the m* and m+l* sections as: = sm+l-sm From Eqs. (3.21) and (3.22), we get: «: + i (o = a + rju:(t-D) - rmU-mjt) (3.24) u-{t + D) = rmul(t-D) + (l-rm )u'm+l«) (3.25) At the lip boundary, the continuity of sound pressure leads to: [u+M(t-D)-u-(t + D)]pc/SM =ZLuUp(t) (3.26) and the continuity of volume velocity leads to: u+M (t - D) +u~(t + D) = uUp (f) (3.27) Define the lip reflection coefficient as: pclSM -ZL R _• M L hip -' (3.28) pc/SM +ZL Then, from Eqs. (3.26) and (3.27), we get: ulip(t) = (l + rlip)u+M(t-D) (3.29) and uM(t + D) = rlipu+M(t-D) (3.30) The above relationships in Eqs.(3.18) to (3.30) are summarized in the volume-velocity signal flow diagram shown in Fig. 3.5. 39 usc(t) d+rG)/2 Ul+01 u W k D ~ p l+rm u^/ft) uM+(0_ 1+riip uUp(t) H D M <-u,-(t)' 1 um(t) l-rm um+i"(t) uM'(0 Fig. 3.5. The signal flow diagram from the glottal source to the lip volume velocity. Usc(z) (l+rg(z))/2 U1+r(z} | Um+(zi ••Tz"2 l+rm Um+1+(z) UM+(z) l+rlip(z) U„p(z) ->>-K 4 z"2 Um(z) l-rm Um+,"(zJ z"2 K < z"2 uM"(zr— Y rlip(z) Fig. 3.6. The discrete-time signal flow diagram from the glottal source to the lip volume velocity. In discrete-time signal processing, the Z domain signal flow diagram is used. The signals and elements in Fig. 3.5 are substituted with their Z domain equivalents. The frequency-dependent glottal reflection coefficient rg and lip reflection coefficient rup are represented using their Z transforms rg(z) and riiP(z), respectively. If the number of sections and the sampling rate are related as: M=2LFs/c (3.31) then D=L V T/MC=0.5/F s , and the time delay unit D in Fig 3.5 is transformed to z'm in the Z domain. The discrete-time volume-velocity signal flow diagram of the system for producing a vowel sound is shown in Fig. 3.6. Before deriving the formulae for the discrete-time transfer functions of the VTF and of the GVTF, the discrete-time models of frequency dependent glottal reflection coefficient rg and lip reflection coefficient rup are needed. In the following two sections, we derive their discrete-time models. 40 3.6 The Discrete-Time Model for the Glottal Reflection Coefficient It can be shown that the glottal reflection coefficient is a high-pass filter (see Section 4.5). In this study, the glottal reflection coefficient corresponding to a small glottal area is modeled as a first order FIR filter in the Z domain: rg(z) = y + 0c'1 (3-32) The parameters y and 8 can be determined using rg values at f=0 and at f=Fs/2: rg(z=l) = rg(f=O) = y + 0 (3.33) rg (z = -l) = rg(f = Fs/2) = y-0 (3.34) Thus, V = (rg(0) + rg(Fs/2))/2 (3.35) 0 = (rg(O)-rg(Fs/2))/2 (3.36) This first order FIR model of rg approximates well to the calculated rg corresponding to small glottal areas of incomplete glottal closures, as shown in Section 4.5. 3.7 The Discrete-Time Model for the Lip Reflection Coefficient As shown in section 3.4, the lip radiation impedance is modeled using a weighted Zp. Substituting Zi i p in Eq. (3.17) for Zi i p in Eq. (3.28), one obtains: 1 - (0.5 + / / F ) Z D = —v { s_>__p_ ( 3 3 8 ) 1 + (0.5 + f/F,)Zp Since Z U p is a function of the product of frequency and lip-opening radius as shown in Eq. (3, 15), then, as frequency increases, rijP decreases from 1 more rapidly if the lip opening is larger than if the lip opening is smaller. 41 For mathematical simplicity, the lip radiation impedance at low frequencies can also be approximated using the radiation impedance of a vibrating spherical source with the same surface area as the lip opening in free space. The normalized radiation impedance of a spherical source is [Flanagan, 1972]: z = fa, _ 0-5jkaM sp \ + jkas 1 + 0.5 jkaM where k=co/c, c is the sound speed, aM is the radius of the lip opening, as is the radius of the sphere, a S=(SM /4;t)1/2=0.5aM, SM is the area of the lip opening, and aM is the radius of the lip opening. Therefore, at low frequencies, the lip reflection coefficient can be expressed as: l-j0.5kaM /(l + j0.5kaM) r'"' 1 + m^kn.. I(\ + iO.SJm... ) (3.40) * l + j0.5 aM/(l + j0.5kaM) 1 l + jkaM In the Z domain, riip is represented using its Z transform, which can be obtained using the bilinear transformation [Oppenheim and Schafer, 1999; Rahim, 1994]: 2 q - z - y , (1 + z -1) where co is the angular frequency, F s is the sampling rate of the signal. Inserting Eq.(3.41) into Eq. (3.40), we get the Z transform of rup as: r * ( z ) = 2(1 -z*)F, 1 + — r ^ - a M l c (1 + z - 1) (3.42) 1 + z"1 l + 2aMFs/c + z'l(l-2aMFs/c) The above rup(z) represents the ri ip values over a low frequency range. For the frequency range 0-Fs/2, the above rup(z) is modified as: , , (1 + /&"1) ^ ' " t ^ j (3'43) 42 where ju = (l + #)/(l + /0> because nip(z=l)=riip(f=0)=l. The pole of rlip(z) is set to be the same as the pole in Eq. (3.42), i.e., a = l-2aMFs/c l + 2aMFs/c and p is determined so that riip(z=-l) approaches the value of rup(f=F/2), i.e. (1+ « ) ( ! - £ ) %(z = -l) = a+m-a) = rUp(f = F,/2) Therefore, l-rlip(Fs/2) + a(l + rlip(Fs/2)) (3.44) (3.45) (3.46) \ + rUp(Fsl2) + a(\-rlip(Fsl2)) The frequency response of rlip(f) in Eq. (3.38) and that of riip(z) in Eq. (3.43) are presented in Section 4.1 for different lip-opening areas. 3.8 The Discrete-Time Transfer Functions of VTFs and GVTFs In the discrete-time domain, the transfer function of the VTF is: HWF(z) s UAz) (3.47) where UiiP(z) and Ug(z) are Z transforms of uup(n) and ug(n), the discrete-time signals of uiip(t) and ug(t), respectively. The transfer function of the GVTF is: Hr,VTF (z) — " UUp(z) UJz) The transfer function of the GVTF can be derived from the Z transforms of (3.18) to (3.30). From Eq. (3.20), Usc(z) can be expressed as: 2 2r(z) (3.48) USCU) = l + rg(z) l + rg(z) u;(z) u;(z\ (3.49) From Eqs. (3.21) and (3.22), we get: 43 Um+(z) = 1+r. 1 Um+i+(z)z1'2 +-I^-Um+l-(z)z 1/2 tt+O 1 U ~(z)zl/2=-^Um+l+(z) + T^Um+r(z) 1 + r_ 1 + r (3.50) (3.51) Combining Eqs. (3.50) and (3.51), we get: 1 _ z112 1 r m U-m(z)_ m . V ' 2 Z " 1 / 2 . l+rm / » z_ I z~\ (3.52) Thus, u;(z)_ M - l | 1/2 n m=i [l + r m From Eqs. (3.29) and (3.30), we get: 1 rmz~ U+M(z) UM(z\ (3.53) 1 z I /2 l + fy(z) *Vz) (3.54) Substituting Eqs. (3.53) and (3.54) into Eq. (3.49), then: UM) = 1 + rAz) 1 + rAz) M - l | 1/2 IT l + r„ r„z"' z"1 1 l + fy(z) r(,p(z)z" ^(z) (3.55) The transfer function of the GVTF is then: Z-M,20.5(l + rg)(l + rUp(z))ll(l + rJ M - l U«P(z) U.M) m=l 1 ri ' _rlZ'1 Z~\ 1 ' M - l ^ " i / l ip Z Z (3.56) Denote: 4 K - I ( Z ) * * - I ( Z ) " C M _ , ( Z ) £ „ _ , ( * ) r.z"1 z-1 ' M - l r « - i Z -1 (3.57) Then: 44 IY1 —1 0.5(1 + rg (z))(l + rlip (z))z-M'2U (1 + rm) H GVTF [AM_X (z) + rs (z)CM_x (z), (z) + r, (z)DM_, (z)] M - l 0.5(1 + rg (z))(l + r;ip (z))z"M 12 Y[(X + rm) 1 ^ ( z ) z -1 (3.58) m=l A * . , (z) + rg (z)CM_x (z) + (z) + rg (z)DM_, (z)]z"r(l> (z) Substituting rUp(z) in (3.43) and rg(z) in (3.39) into Eq. (3.58 ), then the GVTF transfer function is: 0.5(1 + Y + 0z'x )[(1 + ii + (a + mi'1 ]z-M/2]1 (1 + rm) HGVTF (^) m=l M+2 (3.59) where bm's are function of rg, rm's and rup. Since the glottal impedance is time varying, the GVTF transfer function is also time-varying. The transfer function of the VTF can be obtained by setting rg(z)=l in Eq. (3.58): M-l [(l + M + (a + /3M)z-l]z-M,2Yl(l + rm) HVTF (Z) = 71=1 [AM_, (z) + C V , (z)](l + az'1) + [BM_X (z) + DM_X (z)]z"! (1 + 0C )M [l+// + (a + ^ - | ] z - M ' 2 f ] ( l + r J m-\ M+l m=l (3.60) where am's are function of rm's and rn, 3.9 V o c a l - T r a c t D r i v i n g P o i n t I m p e d a n c e As shown in Eq. (3.5), the driving-point impedance Z V T , which is the acoustic impedance looking from the back end of the vocal tract into the vocal tract, plays an important role in determining the difference between the VTF and the GVTF. There are two consequences when Z V T becomes non-negligible compared to Z G . First, the interaction between the glottal source and 45 the vocal tract becomes stronger, i.e., the difference between the glottal waveform ug(t) and the glottal source waveform usc(t) increases. Second, the difference between the GVTF and the VTF also increases. The vocal-tract driving point impedance can be calculated from the transfer impedance [Kinsler, 2001] of the M-sectional tube model of the vocal tract. The acoustic impedance observed from the left end of the section (Fig. 3.4) is: ZL + j-^-taa{kL/M) ZM = ^ (3.61) SM + jZL tan(£L/M) where SM is the cross-sectional area of the M * section of the tube model of the vocal tract. The acoustic impedance observed from the left end of the m-l* section is: Zm + j-^-XaaiJdLIM) = f ^ — — (3-62) Sm-i £o£_ + jZmtm(kL/M) n-1 where m=M, M-l, , 2, and Z,„ is the acoustic impedance observed from the left side of the mthtube. Z] is then the vocal-tract driving point input impedance Z V T -The vocal-tract driving point impedance is related to the transfer function of the VTF as: P P\U i P P ZVT=—L= 'P =-r-HWF =—f-ZLHVTF (3.63) ug uupug uuP rlip where Pi is the sound pressure at the back end of the vocal tract. Z V T resonates at the VTF formant frequencies. Z V T resonates stronger if the lip opening is smaller, because a smaller lip opening has a less damping effect in the VTF. The frequency responses of ZVT for different VTAFs are presented in Section 4.6. 46 3.10 Summary VTFs are different from GVTFs. VTFs contain only the effects of vocal tracts, and are linear and time invariant for sustained vowel sounds. GVTFs contain not only the effects of vocal tracts but also the effects of time-varying glottal impedances, and are time-varying and non-linear. A GVTF equals the VTF when the glottis is completely closed. The transfer function of a VTF estimate containing the effects of incomplete glottal closures and frequency-dependent lip impedances on VTF estimates is formulated. The difference between a VTF estimate and the corresponding VTF is determined not only by the glottal impedance but also the vocal-tract driving point impedance, which resonates at the VTF resonant frequencies. These concepts and formulae will be used in the simulations of VTFs and GVTFs, to reveal the effects of incomplete glottal closures and lip radiation impedances on the estimates of VTFs and VTAFs for different vowel sounds, in Chapters 4 and 6. 47 4 Vocal-Tract Filters and Their Estimates In the previous chapter, the general concepts and formulae for calculating glottal impedances, lip radiation impedances, vocal-tact driving point impedances, and transfer functions of VTFs and GVTFs were developed. This chapter calculates transfer functions of VTFs and VTF estimates to provide quantitative knowledge about them. The calculated VTFs and GVTFs do not contain noise effects found in real speech signals, and thus can reveal clear features of VTFs and GVTFs. If a glottis is never completely closed during phonation, the transfer function of a VTF estimate is actually equal to that of a GVTF estimate. Thus, the features of GVTFs are the features of VTF estimates containing the effects of incomplete glottal closures. The procedure for calculating VTF and GVTF transfer functions for a given VTAF and glottal area is shown in Fig. 4.1. The VTAFs measured from an unknown adult male subject's magnetic resonance image [Story and Titze, 1996] are used in the VTF and GVTF calculations. The sectional length of the VTAFs is Lyr/M=0.396825 cm for all vowel sounds. Thus, the sampling rate for the VTF and GVTF responses is Fs= 0.5MC/L VT =44.1 kHz, and the observable frequency range of the VTFs or GVTFs is up to Fs/2=22.05 kHz. 4.1 Calculating the Parameters of the Lip Reflection Coefficient Given a VTAF, then the lip-opening area SM and radius aM are known, and riip(f=Fs/2) is calculated using Eq. (3.38). Then, a and P of rUp(z) are calculated using Eqs. (3.44) and (3.46), respectively. The calculated a and (3 for lal and IM, IvJ, Id and IOI are summarized in Table 4.1. 48 VTAF (Si, S2, . ..,SM) from MRI Glottal area A g r - • ZvT (f) <- Eq. (3.62) *- Lip-opening area=SM Zg(f) «-Eq. (3.11) ' • rlip(f) <r Eq. (3.38) rg(f) <r Eq. (3.19) r r ri, .. . ,rM-i ^Eq.(3.23) riip(z) <-Eqs.(3.43-46) r g(z) <r Eq s. (3.32-36) r,., H G V T F ( Z ) <r Eq. (3.59), or H V T F ( Z ) <~ Eq. (3.60) Frequency response, formant frequencies, poles of GVTFs or VTFs Fig. 4.1. The block diagram of the calculation of GVTFs or VTFs. From Table 4.1, one can see that as the lip-opening area S M gets larger, the parameter a in riip(z) becomes closer to -1, i.e., the pole of rup(z) becomes closer to 1. The parameter (3 for different lip openings does not change as much as the parameter a does. This gives us a clue to determining the initial values for solving the non-linear equations in a, (3, ri, , T M - I using Newton's method in Chapter 6. The frequency responses of rjjp(z) corresponding to different lip-opening areas for vowel sounds la I, /i/, /u/, Id and 101 are plotted in Figs. 4.2-4.6 using solid lines. The frequency responses of riip, in which the lip radiation impedance is (0.5+f/Fs)Zp, are presented in their corresponding figures using broken lines. rup(z) approximates rijP(f) well at most frequencies. 49 Table 4.1. Parameters of rnp(z) Vowel SM (cm2) rli0(f=Fs/2) a 3 lal 5.03 0.08 -0.5225 0.5939 IM 1.58 0.12 -0.2824 0.6468 IvJ 0.86 0.25 -0.1374 0.5042 Id 1.60 0.12 -0.2853 0.6450 IOI 0.14 0.65 0.3055 0.4861 T h e f r e q e n c y r e s p o n s e s of r M p (f) a n d r | j p ( z ) T h e p h a s e of r | j p (f) a n d r | j p ( z ) k H z Fig. 4.2. riip(f) (broken line) and rnp(z) (solid line) of an adult lip opening for la/. 50 -3 b , • , d 0 5 1 0 1 5 2 0 kHz Fig. 4.3. riip(f) (broken line) and rup(z) (solid line) of an adult lip opening for lil. T h e f r e q e n c y r e s p o n s e s of r | j p (f) a n d r | j p ( z ) T h e p h a s e of r | j p (f) a n d r M p (z ) CO co k H z Fig. 4.5 riip(f) (broken line) and riip(z) (solid line) of an adult lip opening for Id. 1 0.8 0.6 0.4 0.2 T h e f r e q e n c y r e s p o n s e s of r | j p (f) and r | j p ( z ) a=0.3055 b = 0.48BT^ 1 0 1 5 20 T h e p h a s e of r | | p ( f ) a n d r | | p ( z ) in c cc CO 2 -2 k H z Fig. 4.6. riip(f) (broken line) and rap(z) (solid line) of an adult lip opening for IOI. 52 4.2 Calculating V T F Frequency Responses From a given VTAF, the reflection coefficients ri, r2, , rM-i, at the M-l boundaries of the tube model are calculated using Eq. (3.23). Then, by substituting the reflection coefficients and the parameters of riip(z) for those in Eq. (3.60), the coefficients of the H V T F ( Z ) are obtained. The frequency response of H V T F ( Z ) , the frequency response of the numerator of HVTF(Z) , and the poles of H V T F ( Z ) for la/, lil, lul, lei and IOI are shown in plots (a), (b), (c) and (d) of Figs. 4.7-4.11, respectively. The frequencies of poles are the formant frequencies of each VTF. VTAF 1 1 [ _ _ ^ r ~ - ~ ^ ^ ~ ~ ~ ~ ~ — r~~~~~~"'*"~^ I 1 (a) 2 4 6 8 10 12 14 Distance from the glottis (cm) The frequency response of HVTF(z) 16 i i i i (b) 0 5 10 15 kHz The frequency response of the numerator of HVTF(z) 20 : - • — - — , : (c) 0 5 10 15 kHz The poles of the HVTF(z) 20 ! * ! * + J * , + I I .. ..I (d) kHz Fig. 4.7. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for / a / by a male subject. 53 Fig. 4.8. The VTAF, the frequency response of HVTF(Z) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for /i / by a male subject. V T A F 2 4 6 8 10 12 14 16 18 •stance from the glottis (cm) The frequency response of Hy^z) 0 5 10 15 20 kHz The frequency response of the numerator of HVTF(z) 0 5 10 15 20 kHz Fig. 4.9. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of HVTF(Z) , and poles of H V T F ( Z ) for IvJ by a male subject. 54 VTAF 3 o 1 € -50 3.16 m 3.14 •° 3.12 0.95 l / I I I 1 I I 2 4 6 8 10 12 Distance from the glottis (cm) The frequency response of HVTF(z) 14 1 1 1 1 0 5 10 15 kHz The frequency response of the numerator of HVTF(z) 20 "; ^ - : 0 5 10 15 kHz The poles of the HVTF(z) 20 ^+ + + .+ + + + ; + + + + + ; ! i : + * + + l + + 0 5 10 15 kHz 20 (a) (b) (c) (d) Fig. 4.10. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for Id by a male subject. VTAF -^ y ^ ^ ~ i i ^ - — . 2 4 6 8 10 12 14 Distance from the glottis (cm) The frequency response of H^z) 16 0 5 10 15 kHz The frequency response of the numerator of H^fz) 20 \ ^ ~ ^ - ;-0 5 10 15 kHz The poles of the H^z) 20 if. <c T + * j , * " * 1 " 4- + + +4-i + + i l i i i i i • L .. 1 -1 0 5 10 15 kHz 20 (b) (c) Fig. 4.11. The VTAF, the frequency response of H V T F ( Z ) , the frequency response of the numerator of H V T F ( Z ) , and poles of H V T F ( Z ) for IOI by a male subject. 55 4.3 Features of V T F s From the results obtained above, the following features of VTFs are observed: 1. A VTF is stable, i.e., the norms of its poles are not greater than 1; 2. The formant bandwidths of a VTF increase (or the norms of its poles decrease) as frequency and the lip-opening area increase. This is because the lip radiation resistance increases as the product of frequency and the lip-opening area (See section 3.4), and thus its damping effect in the VTF resonance increases with frequency and with the lip-opening area. 3. The filtering effect of the numerator 1 + ju + (a + bp:)z~l of a H V T F ( Z ) is not significant. 4.4 Calculating the Glottal Impedance To gain quantitative knowledge about the time-varying glottal impedance, we calculate it using Eq. (3.11): A A A A A g Ag Ag ns The U g can be approximated using [Flanagan, 1972]: where Ps is the subglottal pressure. Thus, the glottal resistance is calculated as: _12yHl2 0 . 8 7 5 ^ A,' = RV+Rk+Rc where Rv denotes the glottal viscous resistance, Rk denotes the glottal kinetic resistance, and Rc denotes the glottal closing resistance caused by the time-varying glottal inductance. 56 Assume that the glottal area varies from 4 to 20 mm2, the fundamental frequency of a vowel sound is 200 Hz, the pitch period is T0=l/200, the opening glottal phase lasts 0.4T0, the closing glottal phase lasts O.lTo, and the closed glottal phase lasts 0.5To. The time function of the glottal area is plotted in Fig. 4.12 (a), where the sampling rate is 44.1 kHz. It is known [Flanagan, 1972] that p=1.86xl0"5 N-s/m2 (air viscosity coefficient), p=1.14 kg/m3 (the air density in the mouth), H= 3 mm (the depth of the glottis), / = 18 mm (the length of the glottis), and a typical value of subglottal pressure is Ps=10 cm H 2 0 water=980 N/m2. In Fig. 4.12 (b) are the calculated time functions of glottal viscosity resistance Rv (the dashed line), kinetic resistance Rk (dot and dashed), and the glottal closing resistance Rc (solid line). In Fig. 4.12 (c) is the glottal reactance X g corresponding to 4 mm2 opening. Fig. 4.12 (b) illustrates that due to the Uiottal area (a) 100 n (sample) Glottal viscous resistance Rv(- -), kinetic resistance closing resistance R, (-) and R+Fl+R, (.) (b) 100 n (sample) Fig. 4.12. (a) The time-varying glottal-area, (b) the time-varying glottal resistances, and (c) the glottal reactance over the closed glottal phase. 57 glottal closing resistance Rc (solid line), the total glottal resistance (dotted line) immediately before the glottal closure instant is larger than that corresponding to the incomplete glottal closure. 4.5 Calculating the Glottal Reflection Coefficient over Closed Glottal Phases Given the glottal resistance and reactance corresponding to an incomplete glottal closure, the frequency-dependent glottal reflection coefficient rg can be calculated using Eq. (3.19). The y and 8 parameters of rg(z) are calculated using Eqs. (3.35) and (3.36). For a glottal area of Ag=l and Ag=2 mm2, the glottal resistance and reactance are shown in plots (a) of Figs. 4.13-22, the frequency responses of rg and of its first order FIR model rg(z)=y+9z"1 are plotted using solid lines and broken lines in plots (b) of Figs. 4.13-4.22. For Ag=l mm2, rg(z) is very closed to rg. For larger A g , rg can be approximated better using a higher order filter in the discrete-time domain, which requires more computation cost. 4.6 Calculating the Driving-Point Impedance Given the lip-opening area, the lip radiation impedance is calculated from Eq. (3.17). Then, for the given VTAF, the vocal-tract driving-point impedance Z V T is calculated iteratively using Eqs. (3.61) and (3.62). For vowels la/, /i/, IvJ, Id, and IOI, the frequency-dependent Z V T are plotted in (c) of Figs. 4.13-4.22. From these ZVT frequency responses, it can be seen that a ZVT resonates at the VTF formant frequencies, and that a ZVT resonates more strongly if the lip opening is smaller. 4.7 Calculating the Difference Between a V T F and Its Estimate 58 VTF estimates are obtained over closed glottal phases. If a glottis is never completely closed, the VTF estimate contains the effect of the incomplete glottal closure, and is actually equal to a GVTF estimate. As shown in Eq. (3.5), the ratio of the transfer function the GVTF to that of the VTF is 1/(1 + Zyj l(Zg + ZT)). The trachea input impedance Z T , which is of the order of 50 cgs acoustic Ohm [Ishizaka,1976], can be omitted if the glottal area is less than 3 mm2. The frequency responses of 1/(1 + Zw IZg) corresponding to glottal areas of A g =l and 2 mm2, and different VTAFs are shown using broken lines in plots (d) of Figs. 4.13-4.22. 4.8 Calculating Transfer Functions of V T F Estimates Assume the glottis is incompletely closed over closed glottal phases. Given a constant glottal area, and the VTAF of a vowel, the transfer function of a VTF estimate can be calculated by substituting rg(z), riiP(z) and r m into Eq. (3.59). The frequency responses of the VTF estimates corresponding to A g =l and 2 mm2, and VTAFs of la I, lil, lu/, Id, and IOI are plotted in (e) of Figs. 4.13-4.22. 4.9 Features of V T F Estimates Corresponding to Incomplete Glottal Closures From the results shown in Figs. 4.13 to 4.22, we find that: 1) ZVT (the vocal-tract driving-point impedance) resonates at formant frequencies of the VTF, with greater magnitude if the lip-opening area is smaller; 2) Corresponding to the same incomplete glottal closure, the difference between a VTF estimate and the VTF is greater for a vowel with a smaller lip-opening area than that for a vowel with a larger lip-opening area; for example, given the same glottal area, the difference between the response of the VTF estimate and that of the VTF is greater for 59 the small-lip-opening vowel sound IOI than those for larger-lip-opening vowels /a / , and Id; 3) The formant bandwidths of a VTF estimate are wider than those of the corresponding VTF, especially under 5-6 kHz; 4) The formant frequencies of a VTF estimate containing the effect of an incomplete glottal closure are always slightly higher than those of the VTF, as shown in Table 4.2. This is proved in Appendix B. The effect of the glottal impedance in increasing formant frequencies of the VTF estimate is also illustrated using a uniform tube with 5 cm2 cross-sectional area and 17 cm length [Flanagan, 1972]. Table 4.2. Formant frequencies of VTFs and their estimates corresponding to Ag=l mm2. Vowels Formant Fl (kHz) F2 (kHz) F3 (kHz) F4 (kHz) lal VTF 0.7858 1.1531 2.8149 3.4089 VTF estimate 0.7865 1.1536 2.8163 3.4227 IM VTF 0.2235 2.4856 3.3759 3.9024 VTF estimate 0.2235 2.4877 3.3761 3.9135 IvJ VTF 0.2555 1.1456 2.4466 3.6987 VTF estimate 0.2555 1.1456 2.4481 3.7071 Id VTF 0.6298 2.0265 2.6148 3.5124 VTF estimate 0.6300 2.0313 2.6252 3.5341 IOI VTF 0.3773 0.8626 2.4877 3.7504 VTF estimate 0.3774 0.8627 2.4903 3.7585 60 E O Glottal resistance (-) and reactance (-.) 4000 2000 0 0.98 0.96 0.94 E ° 3 « 2 o 40 20 0 -20 40 20 0 10 15 kHz 20 - 1 1 i _ _ l l l i 0 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 -i i i — i 0 5 10 15 20 x 104 The frequency responses of Z^T ; I. \ 0 5 10 15 The frequency responses of the VTF (-) and of M^+Z^^/Z^) (..) 20 i i i • — i i i i i 0 5 10 15 The frequency response of the VTF estimate 20 I I I i -l l i i (a) (b) (c) (d) (e) Fig. 4.13. (a) the glottal impedance for Ag= 1 mm2 and /g=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance ZVT, (d) the frequency response of VTF (solid line) and of 1/(1+ ZVT/ ZG) (dotted line), and (e) the frequency response of GVTF for /a/. 61 E - G o 4000 3 2000 o 'i 0 0.96 0.94 0.92 E ° 3 a 2 8 1 o cd to 40 m 20 •o 0 -20 40 m 20 •o 0 -20 Glottal resistance (-) and reactance (-.) - — T - - 1 1 " ' — " 1 1 1 i 0 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 -i i i -—-* — " i 0 x 104 5 10 15 The frequency responses of Z V T 20 "I i i i i 0 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2 .^^ ) (..) 20 i i i 1 _ J ^ -1 0 5 10 15 The frequency response of the VTF estimate 20 I I I 1 _ 0 5 10 15 kHz 20 (a) (b) (c) (d) (e) Fig. 4.14. (a) the glottal impedance for Ag= 1 mm2 and =^18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zw1 Zg) (dotted line), and (e) the frequency response of GVTF for /if. 62 E ° 4000 ? 2000 o 8 o 0.98 0.96 0.94 ° 2 1 1"? o 0.5 a Glottal resistance (-) and reactance (-.) 40 20 0 -20 40 20 0 -20 -i i i — ' — " ' -i i i i 3 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 -i i i _ _ _ _ — , - —> >—•—«-t—• i x 105 5 10 15 The frequency responses of ZyT 20 I i i i i „ l , , • , 3 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2 .^^ ) (..) 20 j i I 'I i ' -J ' T * ^ n i i i 3 5 10 15 The frequency response of the VTF estimate 20 i i i i -3 5 10 15 kHz 20 (a) (b) (c) (d) (e) Fig. 4.15. (a) the glottal impedance for Ag- 1 mm2 and lg=l&mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zvr/ Zg) (dotted line), and (e) the frequency response of GVTF for /u/. 63 Glottal resistance (-) and reactance (-.) 0 4000 1 2000 o 3 o CO 0.96 0.94 °b 9 i 0.88 E J Z O 4 o « 2 40 m 20 £ 0 -20 40 20 0 -20 -1 1 1 , ~ 1 1 1 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 : —™ i i 3 x 104 5 10 15 The frequency responses of Z V T 20 I X J, J i i i i -V. h i i _ vV_ .— i I _ 3 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2 .^^ ) (..) 20 ' i i i i 0 5 10 15 The frequency response of the VTF estimate 20 ^ j ^ ^ y v ^ ^ , ^ - ^ i ^ ^ ^ ^ ^ ^ / y ^ y ^ - ^ ^ v ^ ^ - x , „ ^ ^ i i i i 0 5 10 15 kHz 20 (a) (b) (c) (d) (e) Fig. 4.16. (a) the glottal impedance for Ag= 1 mm2 and lg=l8 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance ZVT, (d) the frequency response of VTF (solid line) and of 1/(1+ Zvrl Zg) (dotted line), and (e) the frequency response of GVTF for Id. 64 Glottal resistance (-) and reactance (-.) 0 4000 1 2000 o 'i o 0.95 0.9 0.85 E 6 10 0 8 1 i § 2 in O ) o 40 20 0 -20 40 m 20 •a 0 -20 - I i i , -• ~~~ 1 i i I 0 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 -i i i .. • — ~ i i i 0 x 105 5 10 15 The frequency responses of Z V J 20 i 0 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2 .^^ ) (..) 20 -I i 1 i i 0 5 10 15 The frequency response of the VTF estimate 20 - i i i i 0 5 10 15 20 (a) (b) (c) (d) (e) kHz Fig. 4.17. (a) the glottal impedance for Ag= 1 mm2 and lg=lS mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zyji Zg) (dotted line), and (e) the frequency response of GVTF for IOI. 65 E O Glottal resistance (-) and reactance (-.) 2000 1000 0 0.9 0.8 0.7 E ° 3 V, 2 § 1 8 40 20 0 -20 40 20 10 15 kHz 20 - — 1 1 1 i 0 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 -I l , t ; ' ,. —- - -T ~ i i *~ i 0 x 104 5 10 15 The frequency responses of Z V T 20 ; I. i i i i I r _ I I I 0 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2 .^^ ) (..) 20 - i V i i i i 0 5 10 15 The frequency response of the VTF estimate 20 1 1 1 1 (a) (b) (c) (d) (e) Fig. 4.18. (a) the glottal impedance for Ag= 2 mm2 and Zg=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance ZVT, (d) the frequency response of VTF (solid line) and of 1/(1+ Zvrl Zg) (dotted line), and (e) the frequency response of GVTF for la /. 66 Glottal resistance (-) and reactance (-.) 2000 1000 0.8 0.6 o 3 « 2 S 1 o C8 ( 0 O ) o 40 m 20 £ 0 40 m 20 •D 0 -20 - -1 1 1 _ - — 1 1 1 i D 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 -i i 1— ' _ — - ' — - —~r i i i 3 x 104 5 10 15 The frequency responses of Z V J 20 1 3 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2 .^^ ) (..) 20 -|V >' ' , ' • , , -D 5 10 15 The frequency response of the VTF estimate 20 i i i i 0 5 10 15 kHz 20 (a) (b) (c) (d) (e) Fig. 4.19. (a) the glottal impedance for Ag= 2 mm2 and lg=lS mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance ZVT, (d) the frequency response of VTF (solid line) and of 1/(1+ ZVT/ ZG) (dotted line), and (e) the frequency response of GVTF for 67 E .c ° 2000 § 1000 o £ 0 0.8 E I "I 8 0.5 <0 Glottal resistance (-) and reactance (-.) 40 20 0 -20 40 20 0 -20 ! , , ———— . - — - 1 1 1 1 0 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 I I •—i—-^z? - — " — - T i i ... ( i 0 x 105 5 10 15 The frequency responses of ZyT 20 I 0 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2 .^^ ) (..) 20 in , i i i N ^ m , —y- \—y III ! i i i 0 5 10 15 The frequency response of the VTF estimate 20 I I . 1 1 0 5 10 15 kHz 20 (a) (b) (c) (d) (e) Fig. 4.20. (a) the glottal impedance for Ag= 2 mm2 and /5=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ ZVP1 ZG) (dotted line), and (e) the frequency response of GVTF for IvJ. 68 E .c 0 2000 1 1000 o 8 o Glottal resistance (-) and reactance (-.) 0.9 0.8 u 0.5 E sz 0 4 o 1 2 o o <0 (0 o> u 40 m 20 •o 0 -20 h 40 20 h 10 15 kHz 20 - i i i _ — - — 1 1 1 1 0 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 20 i i 1 . — -—T i i i 0 4 x 10 5 10 ' 15 The frequency responses of 20 h [ i >V.,,~ i i 0 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2^^) (..) 20 - \ 1 ' V ' i i ~ v i i 0 5 10 15 The frequency response of the VTF estimate 20 i i i i i ' i i i (a) (b) (c) (d) (e) Fig. 4.21. (a) the glottal impedance for Ag= 2 mm2 and Zg=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance ZVT, (d) the frequency response of VTF (solid line) and of 1/(1+ ZVT/ ZG) (dotted line), and (e) the frequency response of GVTF for /e/. 69 Glottal resistance (-) and reactance (-.) 2000 1000 a O 10 0 8 1 2 o 2 x 10 5 10 15 The glottal reflection coefficient(-) and its FIR model (-.) 10 15 The frequency responses of ZyT 5 10 15 The frequency responses of the VTF (-) and of 1/(1+2^^ ) (..) 10 15 The frequency response of the VTF estimate kHz 20 20 (a) (b) (c) (d) (e) Fig. 4.22. (a) the glottal impedance for Ag= 2 mm2 and /g=18 mm, (b) the frequency responses of rg (solid line) and its model (dotted line), (c) vocal-tract driving point impedance Zvr, (d) the frequency response of VTF (solid line) and of 1/(1+ Zyri Zg) (dotted line), and (e) the frequency response of GVTF for IOI. Note: In this study, the vocal tract is modeled as a lossless tube. As shown below, the viscosity resistance in the vocal tract is much smaller than the reactance of the tube. According to [Flanagan, 1972] the viscosity resistance per-unit length of a tube is: Ra =27m^5(OpnJA2 (4.1) 70 where a is the radius, A is the cross-sectional area of a section of the vocal tract, co is the angular frequency of the sound, and p is viscosity coefficient of the air in the mouth. On the other hand, the reactance along per-unit-length of the vocal tract is: Given S=0.1 cm2 (the minimum cross-sectional area of the VTAF for I'll produced by an adult male), p=1.14 kg/m3, p=1.86xl0"5 Newton-sec/m2, and frequency f=100 Hz, the ratio of the viscosity resistance to the inductive reactance is 0.129. It can be seen that the viscosity resistance is much smaller than the reactance of the air in the tube, and the influence of the viscosity resistance on the transfer function of the VTF can be neglected. 4.10 Summary From VTAFs of different vowels and different glottal areas, we observed the features in the frequency responses of the lip reflection coefficients, the glottal impedances, the vocal-tract driving-point impedances, the ideal VTFs, and the VTF estimates containing the effects of incomplete glottal closures. The glottal impedances increase the formant frequencies and bandwidths of the VTF estimate. The vocal-tract driving-point impedances become great at VTF formant frequencies. Given the same glottal area, the VTF estimate differs more from the corresponding ideal VTFs, if the vowel corresponds to a smaller lip opening. The parameter a of a lip reflection coefficient is in the range of -0.53 to 0.35, and the parameter p of a lip reflection coefficient is in the range of 0.48 to 0.65. . Xa = coLa - cop IA (4.2) Thus: (4.3) 71 5 A New Method for Estimating Vocal-Tract Filters and Glottal Waves from Vowel Sounds 5.1 I n t r o d u c t i o n As mentioned in chapter 2, previously, estimating vocal-tract filters and glottal waves from vowel sounds imposed either the invalid assumption that glottal waves over closed glottal intervals are zero, or parametric models for glottal waves, resulting in biased vocal-tract-filter estimates and glottal-wave estimates lacking information over closed glottal intervals. In this chapter, we develop a new method for obtaining unbiased VTF estimates from vowel sounds without using those assumptions about glottal waves. Moreover, a new method for detecting glottal phases from vowel sounds is developed. The glottal wave estimate is then obtained by filtering the sound signal using the inverse filter of the VTF estimate. The validation of the glottal phases estimated, and the estimates of the VTFs and the glottal waves obtained using our methods are presented in Chapter 7. Accurate VTF estimates should contain as less as possible the effects of open glottises and of glottal waves. To overcome the difficulty in knowing the glottal-wave signals, this study proposes to estimate VTFs from sustained vowel sounds. Since for such a sound the pitch, loudness, and the vocal tact are kept unchanged, thus the glottal wave can be characterized as a periodically stationary random process in the VTF estimation. The periodic components of the glottal wave can be eliminated from the VTF estimation. It is shown that unbiased VTF estimates can be obtained from sustained non-nasalized vowel-sound signals. The glottal waves can be obtained by inverse filtering the vowel sounds. The main contribution of this chapter is to 72 transfer the ill-defined inverse problem into an over-determined parameter-estimation problem, based on the knowledge about the acoustic source-tract system. In the following, we start from the relationship between the VTF and the speech signal. 5.2 Transfer Functions for Producing a Vowel Sound The sound pressure at a microphone placed at a distance from the lips is related to the lip volume velocity by a derivative factor [Flanagan, 1972; Deng, PACRJM 2003]: Pmic(z)=Y-z-rF'/c(l-z-l)UUp(z) (5.1) Am where p is the air density, r is the distance from the lips to the microphone, F s is the sampling rate of the speech sound, c is the sound speed. According to Eqs. (3.60) and (5.1), we relate the sound pressure to the glottal wave signal by: PnJti =T-zrF'lc (1 - z~' )Ug (zW^iz) Am _Jg,[(l+// + (a + ^ )z-1]z-M / 2-A(l-z-1)^(z) (5.2) m=l M-l whereK{ = pT~[(1 + rm)/(4m-) >0, A=rFs/c, -am is the coefficient of z" , m=l, , M+l, in the m=l denominator of the H V TF(Z). The numerator of Eq. (5.2) can be viewed as the delayed and filtered derivative of the glottal wave. The speech sound signal can also be represented using the glottal source signal. Combining Eqs. (3.59) and (5.1), we get the vowel-sound signal at the microphone: 73 PmiX^ — {\-z-l)Usc(z)HGVTF(z) Am 0.5^ (1+y+fe"1) [fl+M + {a + Pfiz~l ]z -M/2-A (l-z-l)Usc(z) (5.3) M+2 The numerator of Eq. (5.3) can be viewed as the delayed and filtered derivative of the glottal source. In the following, we develop a new method for estimating glottal phases using the derivative of the glottal-source signal obtained from the speech signal pmic(n) according to Eq. 5.3 Detecting Glottal Phases from a Vowel Sounds Glottal phases are usually detected using electroglottographic (EGG) signals, or high-speed cameras. However, in many situations, these are not available. In this section, we present a method for detecting the glottal phases from vowel sounds. As shown in Eq. (3.1), when the glottis is open, usc(t) is a linear function of the time-varying glottal area A g. Therefore, the opening glottal phases (when the glottal area is increasing) can be detected if the time derivative of usc(t) remains positive; the closing glottal phases (when the glottal area is decreasing) can be detected if the derivative of usc(t) remains negative. The glottal closure instants can be detected when the time derivative of usc(t) reaches its maximum negative peaks. Within a glottal cycle, the interval after the glottal closure instant and before the glottal opening is identified as the closed glottal phase. In the following, we estimate the time derivative of usc(n) from the speech signal. Let the time-domain equivalent of o.5K1(l + y+fo-l)[(X + M + (a + Pfi)z~i]z~M/2~''Q.-z~1)U!C(z) in Eq. (5.3) be uSCf(n-M/2-A), which is the derivative of the glottal source signal delayed by M/2+A and filtered by (1+y+Gz"1)(1 + ju + (a + PfJ)z'1. Then, the time-domain speech signal can be represented according to Eq. (5.3): (5.3). 74 PnM =uscf'(n-M/2-A)+blPmiln-\)+ ^ + 2 p m , , ( « - M - 2 ) (5.4) Let us obtain the uscf(n-M/2-A) signal by inverse filtering the speech signal using the time-averaged version of GVTF. Let b\, . . . , hibe the time-averaged versions of , bM,. Let the Ith glottal cycle correspond to the time interval [ni, ni+1, , ni+Ti-1], where Ti is the number of samples in the pitch period. From (14), b\, . . . , DM should satisfy: usc;(n,-MI2-® + - + usc;(n,-M/2-k+T,-I) j>Jn+7i-y • • RJrh+Tt-3-M). DM+2 v 4 V p, B Pi (5.5a) or: Usc/+P,B = p,+EI (5.5b) where U s c n ' and pr are Tjxl vectors, P[ is a Trx(M+2) matrix, B is a (M+2)xl vector. In (5.5b), B and U s c n ' are unknowns, pt is the measurement in the interval [nI; ni+1, , ni+Trl], and Pi is the measurement in the interval [nrM-2, n r M-l, , ni+Tr2], and E, is the approximation error vector resulting from representing the time-varying GVTF by its averaged version frm's, i.e., E , = P, (B- B), where B is the actual time-varying (M+2)xl vector. (5.5b) represents an under-determined system of linear equations, since the number of unknowns is larger than the number of equations, i.e., M+2+TI>TI. To overcome this difficulty, we use a "sustained" vowel sound since, for such a sound, the glottal source, and the vocal-tract shape, remain almost unchanged. We use Usc/ to represent the glottal source signal in the adjacent J* cycle. Denote the starting point of J* cycle as nj, which has the same relative position in the Ith cycle as nt in the Ith cycle. Then, more equations about Used' and B are obtained: 75 uJ(n,-M/2-£) + uJ^-M/l-A+T,-!)] Ip^+T,-!) . . tUfh+T-M-S)} ^ [pjn,+T,[gfl)\ gO) (5.6a) or: U^'+PJB = pJ+EJ (5.6b) where p ; is a x l vector consisting of T) samples of Pmic(n) in the interval [nh nj+1, , nj+Tr 1], and Pj is Ti x(M+2) matrix consisting of T) samples of Pmk(n) in the interval [nj-M-2, nj-M-1, , nj+Ti-2], and Ej =Pj(B-B) + (Uscfl '-UscfJ'), the approximation error. Pj and pj are different measurements from P! and p i , since the speech signal Pmic(n) is not exactly periodic due to the randomness of the turbulence noise contained in the glottal wave. Combining (5.5b) and (5.6b), we get the following equation: (Pj -P,)B = {pJ -Pl)+Ej-_E, E (5.7) where X is a Tix(M+2) matrix, Y is a Tixl vector, and E is a Tixl error vector to be minimized. B can be obtained by taking the least-squares error solution of Eq. (5.7). It represents an over-determined system of linear equations, since the number of equations is larger than the number of unknowns, i.e., TT>M+2. Also, the column vectors in X are linearly independent. Then, the least-squares error estimate of B is [Hogben, 1987]: B = (X TX )~lX TY (5.8) The all-pole fdter \i £ (1 - B(m)z'm) obtained from Eq. (5.8) may be unstable. This m = l problem can be solved by shifting ni and nj (the selected starting points of the two cycles) by the same number of samples, and using different measurements to construct X and Y in Eqs. (5.7-5.8). 76 The uSCf(n-M/2-A) estimate is then obtained by inverse filtering the speech signal pmiC(n) using the filter ^(i - B(m)z'm) • Simulations in Chapter 4 show that the filtering effects of (l+Y+Sz"1) and (l+n+(a+ftii)z'') is not significant. Therefore, the usc/(n-M/2-A) estimate can be viewed as the delayed derivative of the glottal source signal usc(n-M/2-A). Using this signal, we detect the glottal phases, as mentioned in the beginning of this section. It should be noted that, since the actual GVTF is time-varying, the usc(n-M/2-A) signal cannot be "accurately" recovered from the speech signal using a time-invariant filter. Nevertheless, as validated using EGG signals (see Chapter 7), the glottal phases detected using the uSCf(n-M/2-A) estimates are correct. After the glottal phases are detected, we estimate the VTF parameters using the speech signal sub-segments produced over closed glottal phases. 5.4 Estimating the Vocal-Tract Filter As mentioned in Chapter 2, existing methods for obtaining VTF estimates from vowel sound signals involve some simplified assumptions, such as that the unknown glottal wave is zero over closed glottal phases, or that it can be described using a parametric model, or that it is smooth. In this section, without using these simplifications, we obtain unbiased VTF estimates. Let P(z)=zf Pmic(z). Then, from Eq. (5.2), JC, [(1 + ju + (a + Pf£)Z-1 ]z~M 11 (1 - z'x Wg (z) ( 5 Q ) m=l The time-domain equivalent of P(z) is p(n)=pmic(n+A). For convenience, it is referred to as the sound pressure at the lips. In Eq. (5.9), let Ugf'(n-M/2) be the time-domain equivalent of K^il + p + ia + ^ z'^z'"1 >2(l-z~>)Ug(z). Then, fromEq. (5.9),p(n) can be expressed as: 77 p(ri)=ugf\n-MI2) + axp(n-\) + + aMp(n-M-l) (5.10) In estimating the coefficients am's of the VTF, only the p(n) samples that do not contain the influence of the open glottis, i.e., those corresponding to closed glottal phases, should be used. Let these samples be in the interval [nci-M-l, nci, , nci+L-l] within the i t o cycle. Then, according to Eq. (5.10), the following relationship must be satisfied: uJ{nc-MI2) • M-M-i) ' + uJ(nci+L-\-MI2) ^ v 1 _p(nci+L-2) . . p(nci+L-2-M)_ ^ ' (5.11a) or: where pcj consists samples of p(n) in the interval [rid, , nCi+L-l], PCj consists samples of p(n) in the interval [ricj-M-1, nCj, , nCj+L-2]. In Eq. (5.11), Ug/d' and A are the unknowns, whereas, pCi and fC(- are measurements. Eq. (5.11) is an under-determined system of linear equations, in which the number of unknowns is larger than the number of equations, i.e., M+l+L> L. To solve for am, m=l, , M+l, more equations are needed. For a sustained vowel sound, the glottal wave is nearly periodic. Thus, the glottal waves corresponding to two intervals that have the same relative positions in two periods are nearly the same. Let the interval [nCj-M-l, ncj, , ncj+L-l] in the j * cycle have the same relative position as [nCj-M-l, nCi, , nci+L-l] in the i t h cycle. Then, more equations about K]Ug/Ci' and A are obtained: (5.12a) Pel 78 or: Ugfci'+PcjA = Pcj+e (5.12b) where pCj consists samples of p(n) in the interval [n^ , , nCj+L-l], Pcj consists samples of p(n) in the interval [nCj-M-l, nCj, , nCj+L-2], and s is the difference between the actual derivative of the glottal wave corresponding to the two sub-segments in the i m and the j m cycles. Subtracting Eq. (5.1 lb) from Eq. (5.12b), we get: {Pcj - Pct)A = (pCJ ~ Pci)+£ (5.13) > v ' 1 v ' e s where Q is a Lx(M+l) matrix, 5 and e are Lxl vectors. Now, the influence of the periodic components of the glottal wave is removed from the system of Eq. (5.13). e is the stochastic components originating from the randomness of the turbulence noise in the glottal wave. If L>M+1, then Eq. (5.13) is an over-determined equation. Larger L can help reduce the effect of noise on the estimate. The least-squares error solution [Hogben, 1987] of Eq. (5.13) is then taken as the estimate of A: A = (QTQTlQTS (5.14) The estimate of the denominator of the HVTF(Z) is then: * M+l • H(z)^l-YJMm)z"n (5-15) m=l It is clear that the estimate given by Eq. (5.14) contains the influence of the unknown e as: A-A = (QTQy[QTe (5.16) It is shown (in Appendix C) that for a sustained vowel sound, the estimator given by (5.14) is an unbiased. The accuracy of the VTF estimate can be improved by averaging many such estimates obtained from different cycles of the sound. For some speakers, the duration of the closed glottal phase is very short, and the number of equations in Eq. (5.13) is less than the number of unknowns, i.e., L<M+1. In such a case, 79 additional L equations are constructed from two other closed glottal phases, such as those in the m th and n* cycles: (Pen 7 Pan) A = [Pcn_^ Pan) + $ (5-17) R V The estimate of A can then be obtained as the least-squares error solution of the over-determined equations formed by Eqs. (5.13) and (5.19): Q S £ A = + R V S a <t> (7 (5.18) and A = (Q.TQ)1QT<I> (5.19) In the following section, we provide a method for finding the p(n) signal sub-segments required in the VTF estimation from the speech signal Pmic(n) recorded at a distance from the lips. 5.5 Locating the Signal Segments for the VTF Estimation In estimating the parameters of the VTF, instead of p(n), the pm, c(«) signal recorded at a distance from the lips is used. The pmiC(n) samples that do not contain the influence of the open glottis must be used in estimating the VTF. This section shows how to determine the pmiC(n) segments required. First, identify the time interval when the glottis is closed using the estimate of the signal uscj (n-M/2-A) obtained in Section 5.3. Assume that within the i m cycle, at instant noi, the uscf (n-M/2-A) signal reaches its maximal negative peak, then it returns to zero gradually or with fluctuations, and finally remains positive after crossing zero at instant noi+Nc-l. uSCf'(n) undergoes the same process during the interval [n0i-M/2-A, , n0i+Nc-l-M/2-A] as does uSCf 80 (n-M/2-A) during the interval [n0j, n0i+l, , noi+Nc-l]. The interval [n0j-M/2-A, , n0i+Nc-1-M/2-A] is identified as the closed glottal phase in the i"1 cycle, and is denoted as [nci0Se, , riciose+Nc-1], where n c i o s e =n0i-M/2-A is the glottal closure instant, and N c is the duration of the closed glottal phase. SecOnd, locate the p(n) samples that are produced over the closed glottal phase. According to Eq. (5.10), at the instant n=nciOSe+3M/2+l (i.e., 3 M / 2 +1 sampling periods after the glottal closure instant), the sound pressure at the lips is p(nciose+3M/2+l)=Ugf'(nciose+M+l) + aip(nciose+3M/2) + + aM+ip(nci0Se+M/2), where the sound pressure signal p(nciose+M/2) contains the glottal reflection at nci0Se (recall Eq. (3.31), thus the time for the sound wave to travel from the glottis to the lips is LVTFS/C=M/2 sampling periods), and all other sound pressure signals contain the glottal reflections after the glottal closure instant. At the instant nci0Se+NC+M/2-l, the sound pressure at the lips is p(nciose+Nc+M/2-l)=Ugf'(nciose+Nc-l)+ aip(nciose+Nc+M/2-2)+ +aM+ip(nciose+Nc-M/2-M-2), where the sound pressure signal p(nciose+Nc+M/2-l) contains the glottal reflection at n c i 0 S e +NC-1 (i.e., the last instant of the closed glottal phase), and other sound pressure signals contain the glottal reflection after the glottis closes and before the glottis opens. Therefore, p(n) samples in the interval [nciOSE+M/2, , nciOSe+3M/2+l, , nci0se+NC+M/2-l] do not contain the influence of the open glottis, and are used in constructing (PCi,pcd in Eq. (5.11). Specifically, pci=[p(nciose+3M/2+l), , p(nciose+Nc+M/2-l)]T, and the length of pd is L=NC-M-1. Third, translate the above p(n) samples into the pmiC(n) samples. Since p(n-A)=pmjC(n), and n c i o S e = noj-M/2-A, then p(n) samples in the interval [nciose+MJ2, , n ciO S E+3M/2+l, , nciose+NC+M/2-l] are p,mc(n) samples in the interval [n0i, , noi+M+1, , n0j+Nc-l]. Specifically, pCi=[pmic(noi+M+l), , Pmic("oi+Nc-l)]T. The pmic(n) samples needed to construct Eq. (5.11) are illustrated in Fig. 5.1. Similarly, (PCj,Pcj) in the j * cycle required in Eq. (5.12) are 81 from the interval [n0j, , n0j+Nc-l)]T, where n0j is the instant when the usc/(n-M/2-A) signal reaches its negative peak in its j"1 cycle. Having constructed (PCi,pcd and (Pcj, pCJ), the all-pole parameters of VTF can be solved from Eq. (5.14). As shown above, the pmic(n) segments that do not contain the effect of the open glottis can be located according to the usc/(n-M/2-A) signal obtained in Section 5.3, without the need to know the distance from the lips to the microphone or to use other signals, such as EGG. I P m i c ( n ) L+M=NC-1 L=NC-M-1 Prolan) samples for Pc i Pmic(n) samples for pci . A . A C ^ >t n •+—\ 1 1 h noi-1 n0 i noi+M+1 n0i+Nc-2 n o i + N c - l Y — J " v ' Closing glottal phase N c samples in closed glottal phase Fig. 5.1. A Pmic(n) segment that does not contain the effect of the open glottis. 5.6 Obtaining the Glottal Waveform As shown in Eq. (5.2), the time-domain equivalent of the numerator can be obtained by M+l * filtering the speech signal using ( l - ] T a m z~m)- Since the filtering effect of [1+ p+(a +fip)z'1] m=l is not significant, the result of the inverse filtering is viewed as the scaled and delayed derivative of the glottal wave signal ug'(n-M/2-A). The glottal waveform ug(n-M/2-A) is then obtained by integrating its derivative using the filter l/(l-z-1). The zero line of the glottal wave cannot be recovered from the sound pressure signal due to the factor (1-z1) in the transfer function from the lip volume velocity to the sound pressure at the microphone, as shown in (5.1). Since the 82 glottal wave is never negative, its zero line is set to its minimum value in this study. The method for estimating Ki is under research. It is noted that a glottis can hardly be completely closed during closed glottal phases. Thus, the VTF estimate may contain the effect of an incomplete glottal closure. The differences between the estimate of the derivative of the glottal wave and the actual one can be analyzed as following. Let the transfer function of such a VTF estimate be HVTF . In fact, it equals to an HGVTF corresponding to the incomplete glottal closure. The estimate of the delayed derivative of the glottal wave obtained by filtering the speech signal using the inverse filter of the VTF estimate is: whereK2 = e~j2¥A,F-p/(47tr), Ug(f), Pmic(f), and Ug(f) are the Fourier transforms of the estimate of the delayed derivative of the glottal wave, the speech signal, and the actual derivative of the glottal wave. Eq. (5.22c) means that, if an incompletely closed glottal closure during closed glottal intervals is not small enough, then the corresponding glottal impedance Z G cannot be much greater than ZVT, and the estimate of the derivative of the glottal wave contains extra components KjUg'ZvjfZg. It is known from section 4.6 that ZVT becomes large at the resonant frequencies of the VTF. Thus, the extra components may become comparable with the true one at the VTF resonance frequencies if the incomplete glottal closure is not small enough. To see the features of faUg'Zvr/Zg contained in the estimate of the glottal wave derivative, we simulate the time-domain estimate of the derivative of the glottal wave according to Eq. (5.22b). The original Ug' signal is designed according to the LF model (see section 2.1.5), as shown in plot (a) of Fig. 5.2. The Ug' signal is first convolved with the filter HVTF(Z) Ug(f) = Pmic(f)/HvTF(f) = K2Ug(f)HvrF(f)/HGVTF(f) = K2Ug(f)(l + Zvr/Zg) (5.22c) (5.226) (5.22a) 83 corresponding to the VTAF of lal given in [Story and Titze, 1996]. Then, the convolution result is inverse filtered using HGVTF (Z) that corresponds to the VTAF and an incomplete glottal closure Ag=2 mm2. The simulated the estimate of the derivative of the glottal wave is plotted in (b) of Fig. 5.2. The spectrum of the original derivative of the glottal wave is shown in plot (c) of Fig. 5.2. The spectrum of the difference between the estimate of the derivative of the glottal wave and the original one is plotted in (d) of Fig. 5.2. The extra (residual) components at the first and the fourth formant frequencies (Fl=590 Hz, F4=3420 Hz) of the VTF are 54 dB and 62 dB, respectively, and they are greater than those of the original derivative of the glottal wave (52 dB and 37 dB). In the time-domain, the extra (residual) components due to the incomplete inverse filtering exhibits ripples over the whole pitch period (see plot (b) of Fig. 5.2). A method for eliminating the effect of an incomplete glottal closure contained in the VTF estimate is needed. As shown in section 6.6, this requires the measurement of the lip opening area. The simulation results for an incomplete glottal closure Ag=l mm is shown in Fig. 5.3. One can see that the ripples caused by the incomplete inverse filtering (plot (b) of Fig. 5.3) become smaller than the ripples caused for Ag=2 mm2 in plot (b) of Fig. 5.2. 84 2000 2000 The derivative of the glottal wave given by the LF model 2500 3000 The esimate of the derivative of the glottal wave via inverse filtering using 1/H Q V T F 2500 3000 (a) 3500 (b) 3500 The spectrum of the designed derivative of glottal wave (c) 4 6 8 10 12 14 16 18 20 22 kHz The spectrum of residual in the estimate of the derivative of the glottal wave (d) Fig. 5.2. (a) the designed derivative glottal waveform, (b) the simulated estimate of the derivative glottal waveform when Ag=2 mm2, (c) the spectrum of the designed derivative glottal waveform, (d) the spectrum of the difference between (a) and (b). 85 2000 The derivative of the glottal wave given by the LF model 2500 3000 (a) 3500 2000 The esimate of the derivative of the glottal wave via inverse filtering using 1/HQV11-2500 3000 (b) 3500 60 The spectrum of the designed derivative of glottal wave 4 6 8 10 12 kHz The spectrum of residual in the estimate of the derivative of the glottal wave ~i r -i 1 1 1 1 r (c) 22 (d) Fig. 5.3. (a) the designed derivative glottal waveform, (b) the simulated estimate of the derivative glottal waveform when Ag=l mm2, (c) the spectrum of the designed derivative glottal waveform, (d) the spectrum of the difference between (a) and (b). 86 5.7 S u m m a r y This chapter developed a new method for obtaining unbiased estimates of VTFs from sustained vowel sounds, without imposing the over-simplified assumptions about glottal waves made in previous methods. Also, this chapter provided a new method for determining the glottal phases from vowel sounds, without the need to know the distance between the microphone and subject, nor to use other signals, such as EGG. Moreover, the effects of incomplete glottal closures on the estimate of the glottal wave are analyzed and simulated. The estimates of glottal phases, glottal waves and VTFs obtained from vowel sounds produced by several subjects will be presented and discussed in Chapter 7. 87 6 Estimating Vocal-Tract Area Functions from Vowel Sounds 6.1 Introduction Existing methods for estimating a vocal-tract area function (VTAF) from a speech signal first obtain an estimate of the VTF or GVTF from the signal, and then derive the VTAF from the VTF or GVTF estimate, assuming the vocal-tract boundary conditions satisfy one of the following conditions: • Boundary condition 1: the glottal end is terminated with some characteristic impedance (i.e., the glottal reflection coefficient is a real constant), and the lip end is terminated with zero radiation impedance (i.e., the lip reflection coefficient is one); • Boundary condition 2: the glottis is completely closed (i.e., the glottal reflection coefficient is one), and the lip radiation impedance is a constant resistance (i.e., the lip reflection coefficient is a real constant). The above two sets of vocal-tract boundary conditions assume either the lip or the glottal reflection is one, and the other end of the vocal tract has a constant reflection coefficient. In reality, however, neither of the two sets of boundary conditions is exact: the glottal boundary condition is time varying as the glottis opens and closes periodically during phonation; moreover, the lip reflection coefficient is frequency-dependent, as shown in Chapters 3 and 4. Even if one can obtain VTAF estimates [Wakita, 1973] from signals over a low frequency range (<3.5 kHz), over which boundary condition 1 can be approximately satisfied, speech signals in such a low frequency range lack information about the vocal-tract filters over the higher frequency range and thus, the resulting VTAF estimates lack details about the vocal-88 tract shape. For example, from a speech signal sampled at a rate of Fs=7 kHz, one can obtain a VTAF estimate with sectional length of LVT/M=0.5c/Fs =0.5*350/7000=2.5 cm. This resolution is not good enough to describe detail of VTAFs, especially for short vocal tracts of children and female subjects. To obtain high-resolution and accurate VTAF estimates, speech signals covering a wide frequency range must be used. However, over wide frequency ranges, the lip reflection coefficients are frequency-dependent, and as a result, existing VTAF estimations based on either assumed boundary condition 1 or 2 yield distorted and unreasonable estimates. To avoid the effects of time-varying open glottises on VTAF estimates, we derive VTAFs from VTF estimates obtained over closed glottal phases, assuming the glottises are completely closed. The distortion effects of the frequency-dependent lip reflection coefficients on the VTAF estimates are eliminated using the method developed in this chapter. Considering that incomplete glottal closures are common, the effects of incomplete glottal closures on VTAF estimates obtained assuming complete glottal closures are also revealed in this chapter. First of all, based on the concepts of VTFs and GVTFs established in Chapter 3, we reformulate in section 6.2 and 6.3, and compare in section 6.4 the existing methods for estimating VTAFs from speech signals. 6.2 V T A F Estimation Assuming Boundary Condition 1 It is shown that an optimal estimate of the GVTF can be obtained from a vowel sound signal, for which the effect of the glottal source has been compensated by 6 dB/ oct. [Wakita, 1973]. In solving for the GVTF from the autocorrelation coefficient of the speech signal, Robinson's M-step recursive mathematical equation is found to be analogous to the M-matrix product representing the transfer function of the M-sectional tube model of the vocal tract satisfying boundary condition 1. Then, the coefficients kn, , kM-t in Robinson's recursive 89 equation are found to be equal to the reflection coefficients ri, , r M of the tube model [Wakita, 1973]. In this section, we reformulate the derivation for obtaining the VTAF from a GVTF, for which riiP=l, and rg= real constant. Let the GVTF transfer function be V _ V HGVTF (Z) : M i , HM(z) (6.1) in which t| is a constant. It is shown that if the impulse response of HGVTF{Z) is known, the above HM(z) can be obtained as below [Wakita, 1973]: Hm+l (z) -{M+2)Hm+l(z-x) 1 kmz-1 K z~ Hm (z) -(m+l) Hm(z-1) m = 0,l, , M -1 (6.2) in which H0(z) " 1 " _z~lH0 (*-•)_ z~\ (6.3) m+l-i 1=0 m m m = 0, M - l (6.4) i=0 ai ( m ) is the coefficient of z"1 in Hm(z), ao(m)=l, R; is the auto-correlation coefficient of the HGVTF(Z) impulse response. R; is the auto-correlation coefficient of the speech signal if the glottal source signal is compensated for. The above recursive algorithm can be expressed in the form of a matrix chain: HM (Z) -(M+X)HM(zx)_ 1 1 1 k 0 " 1 " z-1 _ z" _ _k0z-' z-'_ (6.5) The above shows that HGVTF{Z) can be obtained from a speech signal. In contrast, HGVTF(Z) can also be derived from the acoustic tube model shown in Fig 6.1, where the sections 90 of the tube model are numbered from the lip end to the subglottal end, and um+(t) and um"(t) are the positive-going and negative-going volume velocities at the left end of section m. At the boundary of m th and m+l* sections, according to the continuity of volume velocity and the continuity of sound pressure, the following two relationships hold: U+m+1 (t-D) + U-m+l (t + D) = U+m(t) + U- (t) (6.6) and [^(t-^-^+D^/s^ =[irm(t)-irm(t)wsm (6.7) where D is the time delay for the sound wave to transmit in one section of the tube model. The reflection coefficient at the left side of the boundary of the m* and m+lth sections is defined as: Thus, and, r = - m = 0, , M U+m+l (t-D) = -^U:(t) + -^-U-m (t) l + rm 1 + r„ i+c i+c (6.8) (6.9) (6.10) >M+1 U M + ( t ) U M " ( t ) Sm+1 S i Ulip+(t) " Uiip"(t) Xglottis-LvT/M Xglo t t i s Xm+1 X l i p s Fig. 6.1. The tube model used in the VTAF estimation based on boundary condition 1. 91 Let the sampling period of the speech signal be 2D. Then, the Z transforms of Eqs. (6.9) and (6.10) lead to: Trr+ r*\] 1 T „»2 J'*YTT+(*<\ (6.11) 1 " z 1 / 2 r z" 2 Assume that the glottis is terminated with a reflection-less tube with cross-sectional area SM+I- Let uM+i+(t) and uM+f(t) be the positive-going and negative-going volume velocities at the distance L V T / M from the lower side of the glottis, then from Eq. (6.11): U+M+l(z) UM+l(z) r M i H i 1 + r 1 rM ' 1 l " ' 1 r„" _rMz~l z'\ rlZ~l z~\ _r 0Z - 1 z~'_ (6.12) Substitute uiip"(t)=0, uiip+(t)=uiip(t) in Eq. (6.12), and assume that the lip reflection coefficient is riip=ro=l, then Eq. (6.12) becomes: KM UM+l(z) = z M 1 t l , / 2ff — rMZ - i 1 r, z"'_ z-' 1 r, z\ ^z"1 z-1 1 1 z"1 z'1 (6.13) Ulip(z) Thus, UM+l(z)/Ulip(z) UM+l(z)/Ulip(z) M 1 « r r J _ rM2 1 Z"1 1 h ' 1 " z"'_ rxzX z-\ z\ (6.14) Denote: V + M ( Z ) V"M(Z) According to Eqs. (6.14), and (6.15), we get: 1 1 RM-\ 1 rx ' 1" JM z~l z-\ /M-IZ'1 z\ . V 1 -1 z (6.15) 92 -(M+l) U up(z) U+M+i(z) 12 IP . i=0 -(M+l) f M . i=0 [1,0] VM(z) (6.16) -(M+l) i=0 [1,7-M] 1 TM-\ 1 rx ' 1" rM-xZ~ l _ r l z _ l Note that rM, rM-i , ri and r0 in Eq. (6.16) defined for the tube model in Fig. 6.1 correspond to rg i i\ > TM-I and rup in Eq. (3.56) defined for the tube model in Fig. 3.4, respectively. The volume-velocity signal flow diagram of the tube model shown in Fig. 6.1 is shown in Fig. 6.2. The equivalent system from UM+I+ to UHp is shown in Fig. 6.3. Comparing Fig. 6.3 with Fig. 3.5, it can be seen that UM+I+ plays the same role as 2usc(t) in Fig. 3.2. Thus, Eq. (6.16) equals two times HGVTF(Z) in Eq. (3.56). Because both V+M(z) and HM(Z) represent the normalized denominators (with leading term of unity) of the transfer functions from the glottal source to the lip opening of the same vocal tract, then, VM +(z)=HM(z). Also, it can be shown that: VM(z) = z<M+X)V+M(z-1) (6.17) Hence, V+M(Z) V~M(Z) HM(z) -(M+l) HM(z~l) (6.18a) i.e., " 1 ru 1 RM-i V - i - i -1 -1 - i - i - i rMz Z _ JMAZ Z _ J[Z Z _ z _ -xZZ 1 ]_kM_2Z "-M-2 z~l 1 18b) Since the corresponding coefficients on the two sides must be equal, then, 93 uM+i+(t) l+rM uM+(t) um+,+(t) l+rm um+(t)r l+rlip ulip(t) uM+r(t)'— rM uM"(t) Fig. 6.2. The signal flow diagram for the tube model in Fig. 6.1. UM+i+(t), , l+rM uM+(t) um+i+(t) l+rm um+(t) . u,+(t) l+r„p uHp(t) uM"(t)L Uf(0 riip Fig. 6.3. The equivalent signal flow diagram from UM+i+(t) to uup(t) rm = km_x m = M , , 1 (6.19) where m increases from the lip end to the glottal end. Having obtained rm's from the speech signal, one can calculate the area ratios of a VTAF using Eq. (6.8). The above VTAF estimation requires the GVTF correspond to vocal-tract boundary condition 1. This method is criticized because the results are dependent on the 6dB/oct. compensation for glottal sources, which are different for different subjects. Secondly, to have unity lip reflection coefficients, the speech signals have to be band limited in a very low frequency range, which leads to low-resolution VTAF estimates. Moreover, this method often results in unreasonable VTAFs [Ray, 1995]. 94 Note that Eqs. (6.2) and (6.5) are different from those in [Wakita, 1973], in which negative signs are inserted in front of km, m=0, 1, , M-l . In [Wakita, 1973], these negative signs are inserted because of the negative sign in his equation z"(M+1)Hm(z"1)= -Hm(z), which is inserted in order to conform to the convention that the reference direction for negative-going volume velocities is opposite to that for positive-going volume velocities. In our study, we use the convention that the reference directions of the positive-going volume velocity and negative-going volume velocity are the same. Under this convention, there is no need to insert a negative sign in the equation z"(M+1)Hm(z"')= Hm(z), nor in front of km. 6.3 V T A F Estimation Assuming Boundary Condition 2 This section reformulates the derivation in [Atal, 1971] such that the resulting equations have expressions similar to those in Section 6.2, and thus can easily be compared with each other. The VTAF can be derived from the coefficients of the VTF transfer function H V TF(Z), assuming the lip reflection coefficient is a real constant, i.e., riiP=rM [Atal, 1973]. In this derivation, the sections of the vocal-tract tube model are numbered from the glottal end to the lip end, as shown in Fig. 3.4. Given riiP=rM, according to Eq. (3.60), the VTF transfer function becomes: M m=l HWF (z) = IU{AM(Z) BM (z)~ T [,[cM(z) DM 0 M z - M / 2 n a + o m=l AM(z) + CM(z) 95 in which: BM (z) 1 1 h 1 CM(z) DM(z\ z~\ r2z~x z~x _ JuZ'X zx_ (6.21) Denote: Am(z) flm(z) Cm(z) Dm(z) 1 1 r2 1 .V1 z~\ _r2Z~X z~\ (6.22) and r iAniz) B„ i s [1,1] m W " iC n (z) D. (z) (z) (6.23) =Am(z) + Cm(z) m = l , , M It can be shown [Atal, 1971] that for m=l, , M, the components in the A-B-C-D matrix defined in Eq. (6.22) have the following four properties: 1) Am(z) + Cm(z) is an m* order polynomial in z"1 with a leading term of unity: Am (z) + Cm (z) = Gm(z) = l + gxz~l + + gmz-m (6.24) 2) Reciprocal polynomial relationships, for m=l, , M, B,„(z) =zmCm(z') Dm(z) = zmAm(zl) Bm(z) + Dm (z) = z~m (Am (Z-l) + Cm (z~l)) = gm+8m-xZ~1+ + z"m 3) The coefficient of z'm in Am(z)+Cm(z) is equal to rm. We prove this property as follows. From Eqs. (6.22) (6.24) and (6.27), it follows that: (6.25) (6.26) (6.27) r (Am\(z) BmAz) Am(z)+cm(z)=[i i\ -y - 1 iC^iz) A--i(z) r i r m "1" zx_ 0 = Am_, (z) + Cm_, (z) + [Bm_x (z) + Dm_, (z)]rmz = Gm.1(z) + z-(m-1)Gm_1(z-1)rmz-1 (6.28) 96 According to Eq. (6.24), the coefficient of z" ( m l ) in z<m'l)Gm-i(2A) equals one. Thus, the coefficient of z'm in z"(m4)Gm.i(z"l)rmz"1 is rm. Because there is no z"m terms in Gm.i(z), then the coefficient of z"m in Am(z) + Cm(z) equals the reflection coefficient rm. 4) From AM(Z)+CM(Z), which is the denominator of the VTF transfer function shown in Eq. (6.20), Am(z)+Cm(z), m=M-l, M-2,...,l, can be derived. We prove this as follows. From Eqs. (6. 24) and (6.28) we have: Gm(z) = G^(z) + z-mGm^z-')rm Multiplying z m + 1 to both sides of the above equation, then we get: zm+iGm(z) = zm+iGm_l(z) + zGm^(z-i)rm Substituting z with z"1 in the above equation, we then get: z-(m+l)GJZ-l) = z-^G^iz^ + G ^ z - 1 ^ Combine Eq. (6.29) and (6.31): Gm(z) 1 r m z'm+X)Gm{z-x)_ fmz~l Gm.,(z) -v z-mGm_x(z-x) m = 2, , M (6.29) (6.30) (6.31) (6.32) Thus, GmM) 1 m GJz) z-mGm_x{z-{)_ Jmz'x z\ _z-(m+l)Gm(z-l)_ z ' z-1 Gm(z) _-Vx 1 _z<m+l)Gm(z-l)_ (6.33) Therefore, G^^iG^-r^G^z-1)) (6.34) 97 As shown above, r M can be obtained from the coefficient of z"M in Am(Z)+CM(Z). Then, AM-I(Z)+CM.I(Z) can be obtained using Eq. (6.34). The coefficient of z~M_1 in AM-I(Z)+CM-I(Z) is rM-i- Similarly, other rm's can be obtained. We find that under boundary condition 2, the rm's can also be derived from the impulse response of HVTF(Z), in the same fashion as that for deriving rm from the impulse response of HQVTF(Z). According to Eq. (6.23), r T l 1 (6.35) G,(z) " 1 1 ~ ' 1 " _z-2Gl(z'i)_ _r,z_1 z~l _ z~\ From Eqs. (6.32), and (6.35): GM(Z) 1 _z^GM(zl)_ JMz'x Z _ ,J ' M - l RM-xZ~l Z'1 1 ' 1" nz-1 z\ (6.36) Since GM(Z)=AM(Z)+CM(Z) is the denominator of HVTF(Z), then GM(Z) can also be represented using Eq. (6.5) in terms of km, which is determined by the impulse response of HVTF(Z). comparing Eq. (6.5) and Eq. (6.36), we get: rn=K-x w = 1> ' M (6-37) where m increases from the glottal end to the lip end. Eq. (6.37) has been validated in the VTAF estimations based on boundary condition 2 [Deng, CAA2003]. Having obtained the rm's, one can calculate the area ratios of the VTAF using Eq. (3.23). As shown above, under boundary condition 2, the VTAF can be derived from the coefficients of the VTF, or from the impulse response of the VTF. The VTAF estimation based on boundary condition 2 requires the VTF to be known, and the lip reflection coefficient to be a constant. We note that the definition of rm in Eq. (6.8) and in [Atal, 1971] is rm=(Sm-Sm+i)/ (Sm+Sm+i), which is negative that defined in Eq. (3.23) and in this section. If we use the 98 definition rm=(Sm-Sm+i)/ (Sm+Sm+i), then r m should be equal to the negative coefficient of z"m in Gm(z), which is independent on the reference directions of positive-going and negative-going volume velocities. In [Atal, 1971], rm is directly assigned the coefficient of z"m in Gm(z), for m=l, , M, which is a mistake, or a misprint. In Appendix D, we illustrate this mistake by an example. 6.4 Comparing V T A F Estimations Assuming Different Boundary Conditions As shown above, from a speech signal, one can obtain two different VTAF estimates using different equations, assuming the speech sound is produced under two different vocal-tract boundary conditions. We compare different VTAF estimations in Table 6.1, so that one can correctly apply different equations given differnt vocal-tract boundary conditions. It is reported in [Wakita, 1973] that boundary condition 2 cannot lead to reasonable VTAF estimates. This conclusion is incorrect. We believe that in [Wakita, 1973], boundary condition 2 was not satisfied, because a 20-ms speech-signal segment, which covers several glottal cycles, was used for the VTAF estimation, and the effect of the open glottises cannot satisfy the assumed boundary condition 2. Good VTAF estimates can be obtained only if the speech signal segments used for the estimations are produced under the assumed vocal-tract boundary conditions. In reality, however, the assumed glottal and lip boundary conditions cannot always be satisfied. Glottal impedances are time-varying, lip radiation impedances are frequency-dependent. To avoid the effects of time-varying glottal boundary conditions on VTAF estimates, we derive VTAFs from the VTFs estimated over closed glottal phases using the method in Chapter 5. However, a VTF transfer function contains the effect of the frequency-dependent lip 99 reflection coefficient. If one assumes rup is a frequency-independent constant, and derives the VTAF from the VTF using existing method shown in section 6.3, then the resulting VTAF estimate contains distortions. In the next section, a new method for deriving the VTAF from the VTF is developed. Table 6.1. Comparisons between VTAF estimations based different boundary conditions VTAF estimations Based on boundary condition 1 Based on boundary condition 2 Transfer functions required GVTF VTF Boundary conditions required The glottis is terminated with a reflection-less tube, and the lip opening is terminated with zero acoustic impedance. The glottis is closed, and the lip opening is terminated with a reflection-less tube. Sectional numbers of the tube model m increases from the lips to the glottis. m increases from the glottis to the lips. Definition of reflection coefficient rm I'm =(Sm-Sm +i)/ (Sm+l+Sm) as in Eq. (6.8). I'm =(Sm+l"Smy (Sm+1+Sm) as in Eq. (3.23). Relationship between r m and the coefficients of the transfer function Unknown. rm equals the coefficient of z"m in Am(z) +Cm(z). But, r m equals the negative* coefficient of z"m in Am(z) + Cm(z), if the definition of rm is the same as in Eq. (6.8). Relationship between r m and the impulse response of the transfer function fm=km-l where km.j is determined by the autocorrelation coefficient of the GVTF impulse response. rm=km-l where km.i is determined by the autocorrelation coefficient of the VTF impulse response. But, fm= ~ km.\t if the definition of rm is the same as in Eq. (6.8). *Note: the word "negative" is missing in the corresponding sentence in [Atal and Hunauer, 1971]. 100 6.5 A New Method for Obtaining VTAFs from VTFs This section shows how to derive the VTAF from a VTF, which contains the effect of the frequency-dependent lip reflection coefficient. Denote the estimate of the denominator of the HVTF(Z) obtained from a speech signal as: * M+l ff(z) = (l+S;V"') (6-38) 1=1 From the coefficients hi, , h\i+i, we first solve the parameters a and |3 of the nip, and the reflection coefficients rm's of the tube model of the vocal tract. Then, we derive the VTAF from rm's using Eq. (3.23). According to Eq. (3.60), the VTF transfer function expressed in terms of r m and rnp is: M - l [(l+M+(a + Mz-iz-M,2ll(l + rm) H (z) = — [A _^ia) + Cw_1(z)]a+az-,)+[^.1(z) + DM_ia)]z-1(l+A"1)/" where li = (\ + a)l<\ + P) (6.39) Equating the denominator of the above HVTF(Z) and the estimate of the denominator in Eq. (6.38), we get: M+l [A*., (z) + CM_X (z)](l + az~l) + [BU_Y (z) + DM_X (z)]^1 (1 + pYx ) / i = 1 + ^ \z* (6-40) i= i Expressing the above AM-I(Z)+CM-I(Z) and BM-i(z)+DM_i(z) in terms of GM-I(Z) as in Eqs. (6.28) and (6.27), then Eq. (6.39) becomes: M+l GM_X (z)(l + az-i) + GM_, (z-')z"M (1 + fr-x)u = l + ^hiZ-1 (6.41) i=\ i.e., 101 (l+glZ-1 +g2z-2 + + gM_lzHM-l))(l+cz-1) +M(8M-l+8M-2z-1 + + ^ M " 2 ) +z"(M-1))(l+A"1);"1 = l+\zA+ +hM+lz-(M+n (6.42) Since ju = (1 + « ) / ( l + /?), and since the corresponding coefficients of the two sides are equal, then g i , g2, g M - i , a, P and p have the following non-linear relationships: / ,=a- /z ( l + /?) + l = 0 f2 =a+gl+[igM_l-h, =0 /3 = agx +82+ MSM-2 + frgM-i -h2=° (6.43) /M+1 = a§M-l +M + ~ h M = 0 fM+2=PM-hM+l =0 In order to obtain gi, g 2 , , g M - i , cc, 3 and p. from hi, , 1IM+I, we use Newton's method [Fausett, 2003] to solve the above non-linear equations, as shown below. The above equations can be represented in a matrix form: f h h flU+1 JM+2_ V a 0 0 . 0" 0 0 0 0 a 1 0 . 0 0 0 0 0 1 0 a 1 . 0 0 0 0 1 p + M 1 p 0 0 0 0 . a 1 0 0 0 . 0 a 1 P 0 0 0 0 0 0 . 0 0 P 0 0 0 0 (M+2)xM (M+2 xM 1 81 82 • 1 (6.44) (M+2)xl The above equation can be represented in blocked matrices as: (\ a 0 l x (M- l ) AR + ju 1 P ^ l x ( M - l ) BL BR "1" ~ - l ~ -J G H (6.45) where: 102 "l 0 0 0 0" "0 0 0 0 r a 1 0 0 0 0 0 0 1 p AR = 0 0 0 a 1 ,BR = 1 p 0 0 0 0 a p 0 0 0 0 0 0 • 0 0 0_ 0 z 0 0 0 0 (M+l)x(M-l) (M+l)x(M-l) a "0" " 8i ' 0 82 K AL = BL = 0 , G = . , and H = 1 0_ (M+l)xl A (M+l)xl _8M-I. (M-l)xl _HM+l_ (M+l)xl If the initial values of a, P and u are known, then initial values of g i , , g M - i can be solved from the following equation: (AR + juBR)G = H - AL - juBL (6.46) Step 1: initialize P ( 0 ) in the range of 0.5-0.65, which is observed in Table 4.1. Note that to converge to a solution, for different lip-opening areas, it may be necessary to try different P ( 0 ) values. Then, from the last equation in Eq. (6.43), we get: p ( 0 ) =hM+i/ P ( 0 ) ; and from the first relationship in Eq. (6.43), we get a (0)= p ( 0 ) (1+ p ( 0 ))-l. Then, g!(0), , g M - i ( 0 ) are solved for using Eq. (6.46) as: "1 0 0 0 0" "0 0 0 0 1 a 1 0 0 0 0 0 0 1 p 0 0 0 a 1 +M p 0 0 0 0 a P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (M+IKM-D (M+l)x(M-l) 81 l\ -a 82 K • • _8M-\_ hmx-fiP_ (6.47) 103 Equivalently, g i ( 0 ) , , g M - i ( 0 ) can be solved for from the following equations obtained by omitting the last row in the above equation: 1 0 a 1 0 0 0 0 0 0 0 0 a 1 0 a 0 0 0 0 1 p P o 0 1 0 0 0 0 MAM-X) Step 2: construct the Jacobian matrix: MX(M-1) " 8i ' \ -a Si K • K-x _8M-X_ K -t*_ J = # L ML ML da dp djU dgx dfx dflrf+l dflU+2 df\f+2 da dfi djU dgx dgM-x According to Eq.(6.43), then: J = 1 -M - 1 - P 0 0 0 . 0 0 1 0 g M - l 1 0 0 . 0 M 8x M8u-i 8M-2+PSM-X a 1 0 • M PM 82 M8M-2 <? M - 3 PgM-2 0 a 1 ju PM 0 8M-2 M82 gl + Pg2 0 M PJU a 1 g M-l . M8i I + PB\ M PM 0 . 0 a 0 M P PJU 0 0 . 0 0 Step 3: given the values of a ( k ) , p ( k ) , u ( k ) , and gi (k), increments are calculated: (6.48) (6.49) (6.50) ., gM-i(k), k=0, 1,..., N, their 104 A/? A// A * , A « M - 2 A # M - 1 ••>8M-I ) Ma^.fi^^.g^ .**-,(t)) f2(a«\0m.Mm,8lw McfK^KjiOKg™. MaO\0*\M*\8lw, (6.53) Step 4: update the estimate of a, P, p and g;, , gM-i' - a ( t + l ) - Acr p(k+l) A/? //<*> Aju 81^ — + <*+l) . 6 M - l _ 6 M - 1 As M-i. (6.54) Step 5: repeat 3) and 4) until k=N. N is a number large enough to obtain accurate solutions. Now, the coefficients g i , , g M - i , and parameters a and P are obtained. Step 6: construct the polynomial GM-i(z)=l+giZ1 + + gM-iz(M'1}, and derive the reflection coefficients ri, , r M - i of the tube model from GM-I(Z), as shown in Section 6.3. The area ratios of the VTAF can then be derived using Eq.(3.23). In the following, VTAF estimates are normalized to the maximal area of Si, , SM. The above solution is based on the assumption that rg=l. As mentioned in Section 2.1.2, it is common that a glottis never completely closes during phonation. Thus, an estimate of HVTF(Z) denominator obtained over closed glottal phases may contain the effect of rg<l, and is in fact an HGVTF(Z) denominator. In the next section, the distortion effect of incomplete 105 glottal closures on VTAF estimates obtained using the method developed here is investigated. 6.6 Distortion Effects of Incomplete Glottal Closures on V T A F Estimates This section investigates via computer simulations the distortion effects of incomplete glottal closures on VTAF estimates derived from VTF estimates using the method developed in Section 6.5. Given a VTAF and rg, HGVTF(Z) and HVTF(Z) can be constructed using Eqs. (3.59) and (3.60), respectively. From the constructed H G VTF(Z) and H V TF(Z), VTAF estimates are derived using the method in Section 6.5. The difference between the VTAF obtained from the HVTF(Z) and that from the HGVTF(Z) is due to the effect of the incomplete glottal closure. VTAFs for lal, hi, IvJ, Id and 101 measured using the MRI method [Story and Titze, 1996] were used for constructing the VTFs and GVTFs. The values of the glottal reflection coefficient were set to be 1, 0.99, and 0.95. In the MRI measurements, the sectional length of the tube model is 0.396825 cm. Therefore, the signal sampling rate for GVTFs and VTFs is FS=MC/2LVT=44.1 kHz. Thus, the observable frequency range of GVTFs and VTFs is from 0 to 22.05 kHz. From the given VTAFs of lal, hi, Id, Id and IOI, the frequency responses of lip reflection coefficients and of their corresponding IIR models rup(z) are calculated, and are plotted using broken and solid lines in (a) of Figs. 6.4-6.8, respectively. The frequency responses of constructed VTFs and GVTFs with rg=0.99 and 0.95 are plotted in solid, dotted and broken lines in plot (b) of Figs. 6.4-6.8, respectively. 106 For lal, HI, Id, Id and 101, the VTAFs from MRI are plotted using dots, the VTAF estimates derived from the VTFs and GVTFs with rg=0.99 and 0.95 are plotted using squares, crosses, and diamonds in plot (c) of Figs. 6.4-8, respectively. From these simulations we find that: • For all vowels, the VTAFs can be correctly recovered from the VTFs using the method in Section 6.5, as shown by the co-centered squares and dots. This means that the algorithm in Section 6.5 is mathematically correct. • For vowels with large lip-opening areas, such as lal and Id, reasonable VTAF estimates can be recovered from GVTFs using the method developed in Section 6.5, if rg is greater than 0.95; • For vowels with smaller lip-opening areas, such as HI, Ivrf and IOI, to obtain reasonable VTAF estimates from GVTFs using the method in Section 6.5, rg needs to be greater than 0.9995. • The difference in rg required to obtain reasonable VTAF estimates of different sounds indicates that VTAF estimates corresponding to smaller lip openings are more distorted by incomplete glottal closures than those corresponding to large lip openings. This is explained as follows. As shown in Eq. (3.5), assuming the trachea impedance is negligible compared to the glottal impedance, the GVTF and its corresponding VTF are related as: HGVTF (/) = H W F (/)/(!+ Zw IZg) 107 It can be seen that the difference between H G V T F and H V T F is determined by the ratio ZyT/Zg . Equation (3.64) and simulations in Section 4.6 tell us that Z V T resonates more strongly when the lip opening is smaller, because smaller lip opening has smaller radiation resistance, which leads to smaller damping in the resonance of a VTF. Thus, for the same glottal area, the smaller the lip opening, the greater Z V T / Z G can become, and the more different HQVTF is from HVTF- As a result, the VTAF estimate derived from HGVTF is more different from that derived from HVTF. We note that to obtain the VTAF estimate from a vowel sound, without the distortion effects of an incomplete glottal closure and the frequency-dependent lip reflection coefficient, one needs to know the lip opening area. This is explained in the following. Corresponding to an incomplete glottal closure, the estimate of the H V TF(Z) denominator shown in Eq. (6.38) becomes an estimate of HGVTF(Z) denominator, for which rg, ri,...,rM-i, cc, and P are unknown. Assume rg is simply a real number. Then the denominator of the HGVTF(Z) in Eq. (3.59) is an M+l111 order polynomial in z"1 with leading term of unity. Equating the denominator in Eq. (3.59) and the denominator estimate in Eq. (6.38), one can construct M+l equations to relate hi, , hM+i to rg, ri,...,rM-i, oc, and P in Eq. (3.59). Clearly, the number of unknowns is larger than the number of equations, i.e., M+2>M+1. To determine the M+2 unknowns, and then the VTAF, one more constraint is needed. Obviously, measuring the lip-opening area to determine rnp is more feasible than measuring rg, or any of ri, , rM-i. Measuring the lip-opening area and eliminating both effects of unknown incomplete glottal closure and frequency-dependent rup on VTAF estimates form future work. 108 6.7 S u m m a r y Obtaining high-resolution and accurate VTAF estimates from speech signals requires that: 1) the speech signals used cover a wide frequency range, 2) the effects of open glottises on the VTAF estimates be avoided, and 3) the effects of frequency-dependent lip boundary conditions on the VTAF estimates be eliminated. In this chapter, based on the concepts of VTFs and GVTFs established in chapter 3, we reformulated and compared existing methods for obtaining VTAFs from speech signals, and developed a new method for deriving VTAFs from VTF estimates obtained over completely closed glottal phases. If the VTF estimates contain the effects of incomplete glottal closures, this method can still obtain reasonable VTAFs from the VTF estimates that correspond to large lip openings and sufficiently strong glottal reflection coefficients. To obtain more accurate VTAF estimates from VTF estimates that contain the effects of incomplete glottal closures, both distortion effects of incomplete glottal closures and of frequency-dependent lip reflection coefficients on VTAF estimates need to be eliminated. In such cases, acoustic signals alone are not enough for determining VTAFs, because the number of unknowns is lager than the number of constraints. Therefore, measurements of lip-opening areas are required. 109 Frequency responses of r | i p(f) (..)and r | | p (z) (-) 0 -20 co -o -40 -60 kHz F requency responses of H V T F ( z ) (-), H G V T F ( z ) with r =0.99 (..), 0.95 (- -) (a) Jj j 1 1 1 1 l i t ! « A l 1 —' / ~- - ~ V _ 1 1 1 -i 1 1 1 -(b) 8 10 12 14 16 18 20 22 kHz T h e original V T A F and its est imates from V T F and G V T F s (c) 6 8 10 12 Dis tance from the glottis (cm) 14 16 Fig. 6.4. Vowel lal: (a) the frequency responses of riip(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0). 110 Frequency responses of r | j p(f) (..)and r | i o (z) (-) kHz F requency responses of H V T F ( z ) (-), H Q V T F ( z ) with r a =0.99 (..), 0.95 ( (a) (b) T h e original V T A F and its est imates from V T F and G V T F s 0.8 0.6 0.4 0.2 . s a E H ® B B $ 0 O 0 0 0 $ O O O O 0 Q O O < > i B o 0<> 0 , B a IF 0 0 0 B ft BBm [^iinBasiS5i??o<^^i, (c) 6 8 10 12 Dis tance from the glottis (cm) 14 16 Fig. 6.5.Vowel (a) the frequency responses of riiP(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0). i l l Frequency responses of r | j p(f) (..)and r | i p (z) (-) (a) kHz Frequency responses of H (z) (-), H (z) with r =0.99 (..), 0.95 (- -(b) T h e original V T A F and its est imates from V T F and G V T F s 0.8 0.6 0.4 0.2 B B N , 0 0 0 ° agn 1 1 S o . o° ° 3 o^o ° 0 < > 0 o ^ B (c) 6 8 10 12 Dis tance from the glottis (cm) 14 16 18 Fig. 6.6. Vowel lul: (a) the frequency responses of riiP(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0). 112 Frequency re sponses of r | i p(f) (..)and r | j p (z) (-) kHz F requency responses of H (z) (-), H (z) with r =0.99 (..), 0.95 (- -(a) T h e original V T A F and its est imates from V T F and G V T F s V (c) 6 8 10 Dis tance from the glottis (cm) Fig. 6.7. Vowel Id: (a) the frequency responses of rijP(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0). 113 -20 [ CO -o -40 -60 Frequency responses of r | j p(f) (..)and r | j p (z) (-) kHz F requency responses of H (z) (-), H (z) with r =0.99 (..), 0.95 (- -i n 1 <~ n r 0.8 0.6 0.4 0.2 The original V T A F and its est imates from V T F and G V T F s — , , , 1 -a— o, 0 0 O 0 O - B O B p a s 1 ? B . B O S A 0 i B H (C) 6 8 10 12 Dis tance from the glottis (cm) 14 16 Fig. 6.8 Vowel lot: (a) the frequency responses of ri;p(dotted line) and its IIR model (solid line); (b) the frequency responses of synthetic VTF (solid line) and GVTF with rg=0.99 (dotted line), 0.95 (broken line); (c) VTAF from MRI (dots), its estimates from VTF (•) and GVTFs with rg=0.99 (x), 0.95 (0). 114 7 Results and Discussions: Glottal Waves and Vocal-Tract Area Functions from Vowel Sounds 7.1 Introduction In the previous chapters, the problems with existing methods for estimating glottal waves and VTAFs from speech signals have been elucidated, the related concepts have been clarified, and new methods for estimating glottal phases, VTFs, glottal waves, and VTAFs from speech signals have been developed. This chapter applies these concepts and methods to estimate glottal phases, glottal waves, and VTAFs from vowel sounds. The main purposes of this chapter are to present the results obtained using our methods, to discuss and explain the results, and to demonstrate the effectiveness of our methods. 7.2 Recording and De-Noising Speech Signals Six male and five female adult subjects are invited to produce non-nasalized sustained vowel sounds each lasting 2-3 seconds with the vocal tract, loudness and pitch remaining unchanged. The sounds were recorded in a sound controlled booth in the UBC Interdisciplinary Speech Research Lab. To validate the glottal phases and the glottal waves estimated from these sounds, synchronous EGG signals were also recorded. The speech sound and the synchronized EGG signals were digitalized using Kay Elemetrics CSL 4400 and were recorded using a computer. The procedure for recording sounds using a computer is described in [Liu and Wang, 2002]. The distance from the lips of the subjects to the 115 microphone was 30.5 cm (12 inch). The sampling rate of each signal was Fs=44.1 kHz. Thus, the time delay from the lips to the microphone is A= 38 sampling periods. Before calculating the estimates, the recorded signals were first de-noised using a wavelet-transform-based method [Deng, 2002]. The speech and EGG signals were decomposed into 8 layers in terms of wavelet coefficients. For a sampling rate of 44.1 kHz, the wavelet coefficients in layer A8 are in the frequency range 0-86 Hz. The wavelet coefficients in the frequency ranges lower than the fundamental frequencies of the speakers are due to the background noise, and are set to zeros. The de-noised signals are then obtained by inverse transformation of the remaining wavelet coefficients. 7.3 Steps for Obtaining Glottal Waves and VTAFs from Speech Signals Our methods for estimating glottal phases, glottal waves and VTAFs from speech signals have been described in detail in previous chapters, and are not repeated in this chapter. This section describes the main steps for obtaining glottal waves and VTAFs from speech signals. Step 1: Record sustained vowel sounds produced by different subjects. Step 2: De-noise the recorded signal and obtain the clean signal PmiC(n) [Deng, 2002]. Step 3: Measure the pitch period Ti of the speech signal Pmic(n), and estimate the derivative of the glottal source signal u'sc(n-M/2-A), and the GVTF, as shown in Section 5.3. The order of the GVTF is determined according to the average length of the vocal tract. For the sound lal, the length of the vocal tract is approximately 14.5 and 17.5 cm for female and male adult subjects respectively. Thus, the order of the GVTFs is M+2=2LvFs/c +2 = 39 116 for the female, and 46 for the male subjects. Step 4: Identify the glottal closure instants, and the duration of the closed glottal phase N c, according to the derivative of the glottal source signal u'sc(n-M/2-A) obtained in step 3, as described in Section 5.5. Step 5: Calculate the formant frequencies of the GVTF (i.e., the frequencies of poles of the GVTF), and determine the vocal-tract length using the formula LvT=(2i-l)c/(4Fi), where F; is the i m formant frequency of the GVTF [Wakita, 1977]. The average of the lengths estimated from the 3rd through the 17th formant frequencies of the GVTF is taken as the vocal-tract length Lvr-Step 6: Determine the order of the VTF: M+l=2LvFs/c+l; Step 7: Estimate the VTF over closed glottal phases: a) Construct pc;, pCj, PCi, and PCI of Eq. (5.13) using the speech signal samples in 2 adjacent closed glottal intervals, as shown Fig. 5.1; if the closed glottal interval is short, then 2 more sub-segments are used to construct Pcm> Pcm> Pen) and Pen of Eq. (5. 19); b) Solve for A&=[a} , aM+if using Eq. (5.14) or Eq. (5.20); c) If the filter 1/(1-aiz'1- -CIM+IZ'1*'1) stable, save it for later averaging; otherwise, discard it; d) Repeat a) to c) for the next 2 adjacent closed glottal intervals if available; e) Average the obtained estimates: A=(A(1) +A(2) + +A(K))/K, f) Obtain the estimate of the denominator of H V TF(Z): H(z)=l-A(l)z_1- -A(M+l)z"M1; Step 8: Obtain the derivative of the glottal waveform ug (n-M/2-A) by filtering Pmic(n) using the filter H(z)=l-A(l)z~l- -A(M+l)z"M"', as shown in Section 5.6. Step 9: Obtain the glottal waveform ug(n-M/2-d) by integrating the above obtained 117 derivative of the glottal wave using the filter l/Cl-z"1); Step 10: Derive the VTAF from H(z)=l-A(l)z"1- -A(M+l)z"M"\ using the algorithm developed in Section 6.5. 7.4 R e s u l t s The glottal waves, and the VTAFs obtained for the large-lip-opening vowel sounds lal produced by 5 female and 6 male subjects are shown in Figs. 7.1-7.11. To show the distortion effects of glottal losses oh the estimates of glottal waves and VTAFs from small-lip-opening vowel sounds, the results for the vowel sounds HI produced by 2 male and 1 female subjects are plotted in Figs. 7.12-7.14. In each figure, there are 6 plots (a)-(f). Plot (a): the de-noised speech signal p,„iC(n), labeled as pmic(n); and the bold solid lines mark the signal segments for estimating the VTF. Plot (b): the solid line: the estimate of the delayed derivative of the glottal source signal uscf(n-M/2-A), labeled as dusc(n-M/2-d); and the dotted line: the delayed EGG signal, labeled as egg(n-M/2-d); and the bold solid lines mark the closed glottal intervals. Plot (c): the estimate of the delayed derivative glottal wave signal Ugf (n-M/z-A), labeled as dug(n-M/2-d). Plot (d): the estimate of the glottal waveform signal ug(n-M/z-A), labeled as ug(n-M/2-d). 118 Plot (e): the solid line: the frequency responses of the all-pole part of the VTF M+l * estimate 1/^(1-A(m)z"m); the dotted line: the frequency responses of the all-pole part of m=I M+2 * the GVTF estimate 1/ £ ( 1 - B(m)z'm). m=l Plot (f): solid line: the VTAF estimate derived from the VTF estimate; dotted line: the VTAF measured using MRI for an unknown male subject. In order to compare the VTAF estimates obtained using our method with the VTAF obtained using MRI, the VTAF estimates are divided by their maximal cross-sectional areas; and the VTAF from MRI is divided by its maximal cross-sectional area, and is also normalized to have the same lengths as those of the VTAF estimates for different subjects. In the following sections, we will validate the estimated glottal phases using the synchronized EGG (electroglottalgraph) signals, discuss the glottal waves obtained, and compare the VTAFs derived from the VTF estimates using our method with those measured from the magnetic resonance images of an unknown male subject. 119 The speech signal p r r t c (n) 2 0 0 0 (a) 2 1 0 0 2 2 0 0 2 3 0 0 2 4 0 0 2 5 0 0 2 6 0 0 2 7 0 0 2 8 0 0 2 9 0 0 3 0 0 0 3 1 0 0 dusc(n-IW2-d) and egg(n-IW2-d) (b) 2 1 0 0 2 2 0 0 2 3 0 0 2 4 0 0 2 5 0 0 2 6 0 0 2 7 0 0 2 8 0 0 2 9 0 0 3 0 0 0 3 1 0 0 dug(n-W2-d) (c) 2 1 0 0 2 2 0 0 2 3 0 0 2 4 0 0 2 5 0 0 2 6 0 0 2 7 0 0 2 8 0 0 2 9 0 0 3 0 0 0 3 1 0 0 ug(n-M/2-d) (d) 2 1 0 0 2 2 0 0 2 3 0 0 The V T F estimate 2 4 0 0 2 5 0 0 2 6 0 0 2 7 0 0 n 2 8 0 0 2 9 0 0 3 0 0 0 3 1 0 0 The V T A F s from the V T F and from M R I 0 - 2 0 - 4 0 - 6 0 i i M + 1 F 3 9 1 I ^ 1 / \ _ M. —V*l~ (e) (f) 0 1 0 kHz 1 5 2 0 2 4 6 8 1 0 1 2 1 4 Distance from the glottis (cm) Fig. 7.1. The results from lal by female subject M . 120 The speech signal P r r i c (n) 1700 1800 1900 2000 2100 2200 2300 2400 2500 dusc(n-fvV2-d) and egg(n-M/2-d) 1700 1800 1900 2000 2100 2200 2300 2400 2500 dug(n-rvV2-d) 1700 1800 1900 2000 2100 2200 2300 2400 2500 ug(n-IW2-d) 0 1700 1800 1900 2000 The V T F estimate 2100 2200 2300 2400 2500 The V T A F s from the V T F and from MRI 0 -20 -40 -60 I j I M+1f=39 1 1 1 1 0.8 (e> 0.6 0.4 / / ^ J / \r}'. l i l - r v ^ i i l i 0.2 / 10 kHz 15 20 2 4 6 8 10 12 14 Distance from the glottis (cm) Fig. 7.2. The results from lal by female subject H. 121 1000 -1000 The speech signal P ^ n ) 1- 1 ( M A A A i i u/s A A 1 M nA A * i i i i -flnA A A -r i 1 1 i lAVA i i -1700 1800 1900 2000 2100 2200 2300 dusc(n-M/2-d) and egg(n-M/2-d) 2400 1700 1800 1900 2000 2100 2200 2300 dug(n-M /2-d) and egg(n-IW2-d) 1700 0 -20 -40 -60 2400 1800 1900 The V T F estimate 2000 2100 n 1 (a) 2500 (b) 2500 2500 (c) (d) 2200 2300 2400 2500 The V T A F s from the V T F and from MRI ii V fi 1 \ I i M+1j=41 i xt"~ i (f) 10 kHz 15 20 2 4 6 8 10 12 14 Distance from the glottis (cm) Fig. 7.3. The results from lal by female subject W. 122 0 -20 -40 -60 The speech signal P r t c ( n ) (a) 1800 1900 2000 2100 2200 2300 2400 dusc(n-lvV2-d) and egg(n-M/2-d) 2500 2600 (b) 1800 1900 2000 2100 2200 2300 dug(n-rvV2-d) and egg(n-rvV2-d) 2400 2500 2600 (c) 1800 1900 2000 2100 2200 2300 ug(n-fvV2-d) 2400 2500 2600 (d) 1800 1900 2000 2100 The V T F estimate 'I \\ w ^ 7 —V ~l\ Tv i i i ; M+I|=39 ; - -1 1 L V 1 r l\ 1 1 1 1 |^ \ \ _ 2200 n (e) 2300 2400 2500 2600 The V T A F s from the V T F and from MRI 10 kHz 15 20 (f) 2 4 6 8 10 12 Distance from the glottis (cm) Fig. 7.4. The results from lal by female subject K. 123 The speech signal P l r i c (n) (a) 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 dusc(n-M/2-d) and egg(n-IW2-d) 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 dug(n-M /2-d) and egg(n-r\/V2-d) 1 (o) 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 ug(n-IW2-d) (d) 0 -20 -40 -60 2300 2400 2500 2600 2700 The V T F estimate 2800 n 2 A | ! f\ ! M+1j=43 | 0.8 v k t/^jf^k '"\ i " " (e) 0.6 0.4 I ( 1 1 0.2 The V T A F s from the V T F and from MRI 10 kHz 15 20 5 10 15 Distance from the glottis (cm) Fig. 7.5. The results from lal by male subject L. 124 The speech signal P ( r i c (n) 1000 -1000 50 0 -50 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 dusc(n-fvV2-d) and egg(n-fvV2-d) I I I I I J iflf v\fHv^ i V H_ J _ _ z _ _ : IT i i I ft Pf^ur"^ o| MI / I 'V _ i ESS \f i If i ^ i i ' \f\ i i 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 dug(n-IW2-d) and egg(n-fvV2-d) 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 ug(n-fvV2-d) 2000 1000 2300 2400 2500 2600 2700 2800 2900 3000 3100 3200 3300 n The V T F estimate The V T A F s from the V T F and from MRI 0 -20 -40 -60 A A 1 I i J _ A w ^ i M+-1|=46 l L i i 1 1 10 kHz 15 20 5 10 15 Distance from the glottis (cm) Fig. 7.6. The results from /a/ by male subject Y. 125 The speech signal p ^ J n ) i r — i 1 1 i r 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 duscCn-Mte-d) and egg(n-IW2-d) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 ug(n-IW2-d) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 n The V T F estimate The V T A F s from the V T F and from MRI kHz Distance from the glottis (cm) Fig. 7.7. The results from lal by male subject D. 126 0 -20 -40 -60 The speech signal p [ r t c (n) 3400 3600 3400 3600 3800 The V T F estimate 4000 4200 n (a) 3800 4000 4200 4400 4600 4800 5000 dusc(n-M/2-d) and egg(n-IW2-d) (b) 3400 3600 3800 4000 4200 4400 4600 4800 5000 dug(n-M/2-d) (c) 3400 3600 3800 4000 4200 4400 4600 4800 5000 ug(n-IW2-d) (d) 4400 4600 4800 5000 The V T A F s from the V T F and from MRI r\N\J:'rA , M + 1 F 4 6 i i i i 0.8 <e> 0.6 0.4 ' \ •L 0.2 (f) 10 kHz 15 20 5 10 15 Distance from the glottis (cm) Fig. 7.8. The results from lal by male subject G. 127 The speech signal p ^ ( n ) _ J i i i i i i i 1 — 3200 3400 3600 3800 4000 4200 4400 4600 4800 dusc(n-tvV2-d) and egg(n-fvV2-d) 3200 3400 3600 3800 4000 4200 4400 4600 4800 ug(n-lvV2-d) n The V T F estimate The V T A F s from the V T F and from MRI kHz Distance from the glottis (cm) Fig. 7.9. The results from lal by male subject A. 128 3500 100 0 -100 3500 3500 3500 The speech signal P [ r i c (n) 4000 4500 dusc(n-M/2-d) and egg(n-M/2-d) 4000 4500 dugfn-M^-d) and egg(n-IW2-d) 4000 4500 ug(n-IW2-d) 4000 The V T F estimate 5000 5000 5000 (a) T 1 1 LS-.-l 1 1 — i ' 1 ' • (b) (c) (d) 4500 5000 The V T A F s from the V T F and from MRI 5 10 15 Distance from the glottis (cm) Fig. 7.10. The results from lal by male subject R. 129 The speech signal p . (n) )QII i i i i i i i — 1600 1700 1800 1900 2000 2100 2200 2300 dug(n-fvV2-d) and egg(n-IW2-d) 1600 1700 1800 1900 2000 2100 2200 2300 Fig. 7.11. The results from Idl by female subject Z. 130 The speech signal P ^ J n ) 1000h I L_l I L I 'J I I I 1 1 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 dusc(n-lvV2-d) and egg(n-fvV2-d) 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 dug(n-fvV2-d) and egg(n-lvV2-d) 501 1 1 1 1 1 1 1 r~r 3600 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 ug(n-IW2-d) n The V T F estimate The V T A F s from the V T F and from MRI kHz Distance from the glottis (cm) Fig. 7.12. The results from IM by male subject D. 131 500 -500 3400 6000 4000 2000 0 0 -20 o -40 -60 3600 The s p e e c h signal P ^ n ) 3400 3600 3800 4000 4200 4400 dusc(n-tvV2-d) and egg(n-lvV2-d) 4600 3400 3600 3800 4000 4200 4400 dug(n-rvV2-d) and egg(n-fvV2-d) 4600 3800 4000 4200 ug(n-IW2-d) 4400 4600 4800 4800 4800 (a) (b) H (c) 3400 3600 3800 The V T F estimate 4000 4200 n 4400 4600 4800 The V T A F s from the V T F and from MRI h 1 1 1 0.8 .-r"\ " \ v i A / ^ i , i i • v i - \ r^_y i <e> 0.6 / V \ • .. . A J_ "\_\_ 0.4 1 ' ' \ ^ ^ 1 1 1 0.2 (d) (f) 10 kHz 15 20 5 10 15 Distance from the glottis (cm) Fig. 7.13. The results from lil by male subject G. 132 1700 0 -20 -40 -60 The speech signal P r r i c (n) 1700 1800 1900 2000 2100 2200 2300 dugfn-M^-d) and egg(n-M'2-d) 1700 1800 1900 2000 2100 2200 2300 ug(n-IW2-d) 1800 1900 2000 The V T F estimate 2100 n (a) 1700 1800 1900 2000 2100 2200 2300 2400 2500 dusc(n-M/2-d) and egg(n-IW2-d) (b) 2400 2500 (c) 2400 2500 (d) 2200 2300 2400 2500 The V T A F s from the V T F and from MRI ft ' \ ll "-' I I I Ji r\ M+1|=39 | 0.8 j\ jV_/r_ V / ^ ^ A " ' <e> 0.6 1—:^y^LA_ 0.4 0.2 10 kHz 15 20 ' A T , , /V ' / 1 K 1 I 1 1 ' — 2 4 6 8 10 12 14 (f) Dis tance from the glottis (cm) Fig. 7.14. The results from IM by female subject H . 133 7.5 Validation of the Estimates of Glottal Phases For each sound, the derivative of the glottal source signal uSCf(n-M/2-A) discussed in Section 5.3 was first estimated, as shown by the solid curve labeled as dusc(n-M/2-d) in plot (b) of each figure. We identify the glottal phases using this signal. Within one pitch period, the glottal closure instant was identified as the time when the dusc(n-M/2-d) waveform reached its maximal negative peak; the opening glottal phase was identified as the interval when dusc(n-M/2-d) remained positive; the closing glottal phase was identified as the interval when dusc(n-M/2-d) was negative and decreasing; and the closed glottal phase was identified as the interval after the glottal closure instant and before the opening glottal phase. The glottal phases identified using the dusc(n-M/2-d) waveform were validated using the EGG signal. The EGG signal was delayed by M/2+A samples to synchronize with the dusc(n-M/2-A) signal. In this study, decreasing EGG waveforms display increasing vocal-fold contact. The EGG samples at the glottal closure instants identified using the dusc(n-M/2-d) waveform are marked by circles. These circles appeared at the instants corresponding to instant 7 in the Rothenberg model (see Fig. 2.6). Thus, the glottal closure instants identified using our method are accurate. From the egg(n-M/2-d) signal for each sound, one can see that, during the intervals corresponding to the closed glottal phases identified using the signal dusc(n-M/2-d), the amplitude of egg(n-M/2-d) first decreases rapidly, and then remains at its lowest level for a short time, then increases from its lowest level. During the intervals corresponding to the identified opening glottal phases, the egg(n-M/2-d) signal first keeps increasing, and then remains at its highest level. During the intervals corresponding to the identified closing 134 glottal phases, the egg(n-M/2-d) signal first remains at its highest level and then decreases. The EGG samples at the instants when the vocal folds contact each other at the highest speed (i.e., when the derivative of the egg(n-M/2-A) signal reaches its negative peaks) are marked by stars. In each cycle, the glottal closure instant shortly precedes the instant when the most rapid vocal-fold contact occurs. The above-obtained time-line relationship between the identified glottal phases and the EGG signal reflects the theoretical model relating the glottal phases and the EGG signal given in the Rothenberg model (see Fig. 2.6). This confirms that the glottal phases are correctly identified using our method. It is interesting to note that, in a glottal cycle, there is a time lag between the glottal closure instant and the instant when the most rapid vocal-fold contact occurs. The time lags are 0 to 6 sampling periods for different subjects. These time lags can be converted to the traveling distance of the mucosal waves along the vocal folds, given the speed of the mucosal wave and the sampling rate F s of the speech signal. It is known that the mucosal wave speed is in the range of 0.29-1.18 m/s [Wenokur, et al. 1993]. Thus, 0 to 6 sampling periods correspond to the distances of about 0 to 39.4-160.5 pm along the vocal folds. 7.6 Discussion of the VTF and VTAF Estimates Comparing the frequency responses of the VTF estimates (plots (e) of Figs. 7.1-7.14) with those of GVTFs for lal and l\l corresponding to glottal area Ag=l and 2 mm2 (see plots (e) of Figs. 4.13, 4.14, 4.18, and 4.19), one can see that the VTF estimates obtained for most 135 of the subjects contain the effects of incomplete glottal closures from about Ag=l mm2 to Ag=2 mm2. As simulated in Section 6.5, even if glottises are not 100% closed, if glottal reflection coefficients are large enough, reasonable VTAF estimates can still be obtained for large lip-opening vowel sounds. Comparing the solid lines with the dotted lines in plots (f) of Figs. 7.1 to 7. 11, one can see that the VTAFs estimated from the sounds /a/ produced by the 5 female and 6 male subjects are similar to the VTAF of lal measured using MRI of a male subject [Story and Titze, 1996]. The VTF estimate (plot (e) of Fig. 7.11) for female subject Z contains the effect of large incomplete glottal closure than others, and thus the resulting VTAF (plot (f) broken line of Fig. 7.11) is more degraded. It is also noted that for most subjects, the vocal-tract lengths and hence the order of the VTFs for vowel lal estimated from 3-17th formant frequencies of GVTFs can lead to good VTAF estimates that are similar to that measured using MRI. But, the estimated vocal-tract length for male subject R is 1 cm longer than average value, and also leads to an unreasonable VTAF estimate. When the vocal-tract length value for male subject R is set to be the average value, a reasonable VTAF is obtained, as shown by the solid line in plot (f) of Fig. 7.10. This means the method for estimating the vocal-tract length using formant frequencies [Wakita, 1977] works for most times, but not always. The results obtained from the vowel sound I'll produced by male subject D, male subject G, and female subject H are shown in Figs. 7.12 -7.14. It can be seen that the first (lowest) formants of the two VTF estimates shown by the solid lines in plots (e) of Fig. 7.12-7.13 are flatter than that that of the VTF for HI shown in plot (b) of Fig. 4.8, i.e., the two VTF estimates contain the effect of glottal losses. As a result, the VTAF estimates obtained for 136 male subjects D and G are distorted, just as simulated in section 6.6, as shown by the diamond plot in (c) of Fig. 6.5. In contrast, the first formant of the VTF estimate for female subject H (see the solid line in plot (e) of Fig. 7.14) is sharper than those for male subjects D and G, i.e., the VTF estimate for female H contains less glottal loss, resulting in a VTAF estimate similar to that of /if measured from the MRI of an unknown male subject (the dotted line in plot (f) of Fig. 7.14). The agreement between the distorted results (the solid lines in plots (f) of Figs. 7.12-7.13) obtained from small-lip-opening vowel sounds with that (the diamond line in plot (c) of Fig. 6.5) obtained from the simulation implies that our VTF models are realistic. The agreement between the VTAFs obtained using our method from large-lip-opening vowel sounds with that measured from MRI implies that our methods for obtaining estimates of VTFs and of VTAFs from vowel sounds are accurate. To eliminate the distortions in the VTAF estimates corresponding to small-lip-opening sounds, it is necessary to eliminate the effects of glottal losses in the VTF estimates. In deriving the VTAF, the viscous loss, heat conduction, and the vocal-tract wall loss are taken into account of the effect of the lip radiation impedance in the VTF estimate. Thus, the estimated lip-opening areas are slightly larger than the actual ones, especially if the VTF estimate contains the effect of glottal loss. The differences between the VTAF estimate obtained using our method and the VTAF obtained using MRI can be mainly ascribed to the effect of the incomplete glottal closure, the individual difference between the vocal-tract shapes, and how well the sound is sustained. To obtain good VTF estimates, it is desirable for the subjects to produce sustained vowel sounds at a constant low pitch and loudness. To obtain more accurate VTAF estimates from the VTF estimates, it is necessary to measure the 137 lip-opening areas and eliminate the distortion effects of the incomplete glottal closures on the VTAF estimates. 7.7 Discussion of the Estimates of Glottal Waves The VTF estimates in Figs. 7.1-7.10 contain very limited glottal losses and can yield good VTAF estimates. Thus, the glottal-wave estimates obtained via inverse filtering of the speech sounds contain very limited residual resonance of the VTFs. These estimates of glottal waves contain detailed information, especially over closed glottal phases, which cannot be correctly obtained using previous methods. Over open glottal phases, some derivatives of the glottal waveforms (see plots (c) of Figs. 7.1-7.4) are similar in shape to the LF parametric model; whereas, some (see plots (c) of Figs. 7.5-7.10) have double-peak structures, which are explained by the source-tract interaction [Fant, 1986]. Over closed glottal phases, all the glottal waves and their derivatives are not zero. We observed that each obtained derivative glottal waveform displays single or multiple positive peaks over the short interval of the vocal-fold collision (when the vocal-fold contact area was increasing, as indicated by the steep decreasing EGG signal), as shown in plots (c) of Figs. 7.1-7.10. These positive peaks could be due to the glottal chink (an opening in the posterior glottis) and the vocal-fold collision. It is known from simulations [Cranen and Schroeter, 1996] that a moderate glottal chink leads source-tract interaction, resulting in ripples in the glottal waveform right after the glottal closure instant. It is also known that, during phonation, compression and rarefaction tissue waves propagate along the vibrating membranous vocal folds, and that when the lower margins of the vocal folds are in contact, 138 their upper margins are still apart [Berke and Gerratt, 1993]. As the rarefaction of the tissue wave travels the vertical extent of the contacting vocal folds, the vocal-fold contact area increases rapidly (see the steep decreasing EGG signals), and the air between the folds is squeezed into the vocal tract [Titze, 1986]. Although the squeezed airflow may be very limited, its derivative can have large positive values during the rapid fold collision. One might also concern that the VTF estimates contain the effects of finite glottal impedances, and the derivative glottal waveforms contain resonance of the VTFs according to (5.22). However, simulations in section 5.6 show that such effects result in ripples over the whole glottal cycle, not just in the interval of the vocal-fold collision. Therefore, for the cases shown in plots (c) of Figs. 7.1-7.10, the positive peaks during the vocal-fold collision cannot be the pure effect of the residual resonance of the VTF. It is noted that the derivatives of glottal waves for male subjects exhibit relatively greater positive peaks during vocal-fold collisions than those of female subjects. This could be explained by the anatomic differences in larynxes of different genders. As mentioned in section 2.1.1, vocal folds of the male have more rectangular coronal sections than those of the female. Thus, the space between the colliding vocal folds (see 6-7 of Fig. 2.4) for the male is larger than that for the female. Consequently, the derivatives of the glottal waves caused by the fold collisions of the male are greater than those of the female. The glottal waveforms obtained were labeled as ug(n-M/2-d), as shown in plots (d) of Fogs. 7.1-7.10. We observed that during the interval of vocal-fold collision, the glottal waves gained some increments. Whereas, as the vocal folds part from their lower margins toward the upper margins, as indicated by the increasing EGG signals, some glottal waveforms decrease monotonically, as shown in plots (d) of Figs. 7.2, 7.5-7.9, and some remain constant 139 or even increase, as shown in plots (d) of Figs. 7.3 and 7.4. This means that, after vocal-fold contact areas became maximal, the glottises were not completely closed: some became more and more closed, and some even started opening. 7.8 Sensitivity of the Estimates to the Estimated Vocal-Tract Lengths As noted in section 7.6, vocal-tract lengths estimated from formant frequencies some times may not accurate enough to yield good VTAF estimates. However, more accurate estimations of vocal-tract lengths need neural network trained using a large number of training formant frequencies [Dusan, 2000]. This section shows the sensitivity of estimates of glottal waves (derivatives) and VTAFs to the estimates of vocal-tract lengths (or the estimated VTF orders) via experiments. We compare the results (Figs. 7.7 and 7.10) with the results (Figs. 7.15, 7.16 and Figs. 7.17, 7.18) obtained from the sounds lal produced by male subjects D and R, given M+l=43 and 48, which are smaller and larger than the average value 45 for about 5%. The comparison shows that for M+l=43, 48, the obtained glottal waveforms (their derivatives) for the same subject are almost the same as those for M+l=45, but the VTAF estimates exhibit truncations and extensions relative to those for M+l =45. When M is decreased (which means the tube model is shorter than the actual vocal tract) the relative position of the estimated maximum cross section of the oral cavity is shifted toward the lip end, and the overall VTAF is almost unchanged. But, when M is increased (which means the tube model is longer than the subjects' vocal tracts), the relative positions of the estimated maximum cross sections of their oral cavities are shifted toward the glottal ends, and additionally the estimated lip opening areas become unreasonably large. 140 200 The s p e e c h signal P ^ n ) 3800 4000 4200 4400 The V T F estimate 4600 4800 n (a) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 dusc(n-rvV2-d) and egg(n-M/2-d) (b) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 dug(n-fvV2-d) (c) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 ug(n-lvV2-d) (d) 5000 5200 5400 5600 The V T A F s from the V T F and from MRI 5 10 15 Distance from the glottis (cm) Fig. 7.15. The results from lal by male subject D, given M+l=43. 141 100 0 -100 The speech signal P ^ n ) (a) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 dusc(n-M/2-d) and egg(n-M/2-d) (b) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 dug(n-IW2-d) I I I I b i . ^ . . . ^ w * v i i i i 1 1 I I , . 1 i i , ,1 1 1 (c) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 ug(n-M/2-d) (d) 3800 4000 4200 4400 4600 4800 5000 5200 5400 5600 n The V T F estimate The V T A F s from the V T F and from MRI 0 -20 -40 -60 J J _ L — i y f _'"j r-i i i ! M+1|=48 | \ A ri 1 1 l 1 10 15 kHz 20 5 10 15 Distance from the glottis (cm) Fig. 7.16. The results from lal by male subject D, given M+l=48. 142 3500 100 0 -100 3500 3500 2000 1000 3500 The speech signal P r r i c (n) 4000 4500 dusc(n-fvV2-d) and egg(n-fvV2-d) 4000 4500 dug(n-lvV2-d) and egg(n-fvV2-d) 4000 4500 ug(n-fvV2-d) 4000 The V T F estimate (a) 5000 — " " " 1 T vy L ™ ' i 1 ! (b) 5000 (c) 5000 (d) 4500 5000 The V T A F s from the V T F and from MRI m T3 5 10 15 Distance from the glottis (cm) Fig. 7.17. The results from lal by male subject R, given M+l=43. 143 500 -500 3500 3500 3500 0 -20 -40 -60 The speech signal P ^ n ) 4000 4500 dusc(n-lvV2-d) and egg(n-fvV2-d) 4000 4500 dug(n-lvV2-d) and egg(n-fvV2-d) 4000 4500 ug(n-fvV2-d) 3500 4000 The V T F estimate n i ' i i i i M ,'A ! ', , M+1i=48 i 0.8 - f r t L _ 1 1 <e> 0.6 0.4 1 |Ay~ 0.2 (a) 5000 (b) 5000 f--[ \ 5000 (d) 4500 5000 The V T A F s from the V T F and from MRI 10 kHz 15 20 5 10 15 Distance from the glottis (cm) Fig. 7.18. The results from lal by male subject R, given M+l=48. 144 7.9 S u m m a r y The estimates of glottal waves obtained using our method contain detailed information over closed glottal phases. During vocal-fold colliding, the glottal waves increase; during vocal-fold parting, they decrease or even increase. Also, the glottal phases detected from the vowel sounds using our methods are correct. The VTAF estimates obtained from large-lip-opening vowel / a / sounds produced by male and female subjects using our method are similar to that measured from the magnetic resonance image of an unknown male subject. This means that our estimation of VTAFs is accurate. The distorted VTAFs obtained for small-lip-opening vowel l\l sounds are similar to the simulated results. This implies that our model for VTFs and simulations are realistic. To eliminate the degrading effects of glottal loses on the VTAF estimates, the lip-Opening areas must be known. 145 8 Conclusions and Future Work This thesis develops methods for obtaining accurate estimates of glottal waves and VTAFs from vowel sounds. This study faces the challenges of two ill-defined inverse problems: 1) obtaining glottal waves from vowel sounds without knowing VTFs; and 2) deriving VTAFs from vowel sounds without knowing vocal-tract boundary conditions. In the following two sections, we summarize the contributions of this study, and suggest future research directions. 8.1 Contributions of This Thesis The concept of VTFs is now distinguished from that of GVTFs. A VTF contains the effects of the vocal tract and the lip radiation impedance; whereas, a GVTF contains not only the effects of vocal tract and lip radiation impedance, but also the effect of a glottal opening. Only the VTF should be used in obtaining the glottal wave and the VTAF. If the glottis is not completely closed over closed glottal phase, the VTF estimate obtained over closed glottal phases is in fact equal to a GVTF corresponding to the incomplete glottal closure. The difference between the VTF estimate and the actual VTF is determined by the ratio of the vocal-tract driving-point impedance to the glottal impedance. We also found that a time-varying glottal area introduces the glottal closing resistance, which becomes positive when the glottal area is decreasing, and becomes negative when the glottal area is increasing. We model the lip radiation impedance to be between that of a piston in an infinite baffle and that of a piston in an unflanged pipe. The transfer function of the VTF estimate containing the 146 effects of an incomplete glottal closure and the frequency-dependent lip-radiation impedance is modeled using the GVTF transfer function. The above was shown in Chapter 3. The effects of incomplete glottal closures on VTF estimates are revealed by comparing the simulated GVTF and VTF. They may not be observed from the actual estimates, which contain noise effects. We find that, given the same incomplete glottal closure, a VTF estimate differs more from the actual VTF if the vowel sound corresponds to a smaller lip-opening area. Also, we confirm that incomplete glottal closures increase formant frequencies and resonance bandwidths of VTF estimates, especially at low frequencies. These were shown in Chapter 4. To overcome the difficulty in knowing the glottal waveforms, we developed a method for obtaining accurate VTF estimates from sustained vowel sounds. We characterize the glottal waves for such sounds as periodically stationary processes. This assumption about glottal waves is more realistic than those used in previous methods. We convert the first ill-defined inverse problem to an over-determined linear equation in which the parameters of the VTF are the unknowns, and the coefficients are constructed from the vowel-sound sub-segments over closed glottal phases. The estimates of the VTF parameters obtained from this equation are proven to be unbiased. In addition, a new method for detecting glottal phases from sustained vowel sounds, and the method for locating the sub-segments corresponding to the closed glottal phases are developed. Moreover, the effect of incomplete glottal closure contained in the VTF estimate on the estimate of the glottal wave is analyzed and simulated. These were presented in Chapter 5. To overcome the difficulty of not knowing the vocal-tract boundary condition in the VTAF estimation, we derive the VTAF from the VTF estimate obtained over closed glottal 147 phases, so that the glottal boundary condition becomes known as rg=l. Given this condition, the parameters of lip boundary condition and of the VTAF can then be simultaneously derived from the VTF estimate, and the VTAF estimate is free of the effect of the frequency-dependent lip boundary condition. If the VTF estimates contain the effects of incompletely glottal closures, our method can still obtain reasonable VTAF estimates if the vowel sound corresponds to a large lip opening and if the glottises are closed adequately. For speech sounds produced with incomplete glottal closures and small lip openings, to obtain reasonable VTAF estimates, both distortion effects of glottal and lip openings on VTAF estimates need to be eliminated. In such cases, theoretically, the lip-opening areas must be known. These are presented in Chapter 6. Experimental results in Chapter 7 show that our above concepts, models and methods developed for obtaining accurate estimates of glottal waves and of VTAFs from vowel sounds are accurate. The VTAF estimates obtained using our method for the large lip-opening vowel sounds lal produced by the 11 subjects are similar to that measured using the MRI method for an unknown male subject. The VTAFs derived from the VTF estimates of lil are similar to the simulated VTAF estimate distorted by an incomplete glottal closure. The differences between the VTAF estimates obtained using our method and that using the MRI method for an unknown male subject can be ascribed to the differences between subjects, incomplete glottal closures, and measurement noise. The estimates of glottal waves obtained by inverse filtering the vowel-sound signals using the VTF estimates obtained using our method contain more information than previous methods. We found that the obtained glottal waves and their derivatives are not zero over closed glottal phases. In the short interval of the fold collision, the derivatives of some glottal 148 waves exhibit positive peaks. As the vocal folds part from their lower margins toward their upper margins, some glottal waves monotonically decrease, while some start increasing. These non-zero glottal waves over closed glottal phases have been predicted in previous studies, but could not be correctly obtained using previous methods due to the limitations in the assumptions about the glottal waves. Our methods for estimating glottal waves and VTAFs from speech signals are based on realistic assumptions about glottal waves and vocal-tract boundary conditions, and thus can yield more accurate results than existing methods. 8.2 Future Work Although our methods can obtain more accurate estimates of glottal waves and VTAFs from speech signals than previous methods, further improvements are still needed. To obtain accurate estimates of VTAFs and glottal waves from vowel sounds produced with large incomplete glottal closures, a method for eliminating both effects of glottal loss contained in the VTF estimate is needed. This requires the lip-opening areas must be known. It is also highly desired to obtain accurate glottal waveforms from vowel sounds produced by a larger number of subjects, in order to find the features in glottal waveforms and VTAFs related to gender, age, speaker identity, types of phonation, voice disorder, emotion, and so on. In addition, a method for blindly determining the vocal-tract length and the order of the VTF for a subject is needed. 149 Our methods have potential applications in many fields. In speaker identification, our methods can provide more details about the glottal waveforms and VATFs than previous methods, and can improve the correct rate of the identification. In speech and singing sounds synthesis, our methods can provide glottal waveforms and VTAFs of various speakers, and make it possible to synthesize a-subject-sounding speech and singing sounds. In speech and singing signal coding, knowing the features in the glottal waves and VTAFs can make the coding more efficient while restoring the naturalness of the original sounds. In speech pathology, our methods for obtaining glottal waves and VTAFs from speech signals are much more economical than those using MRI or X-ray. Moreover, the speech signal can be transmitted over internet, and the remote diagnosis of voice disorder becomes possible. In training (deaf) people or second language learner to pronounce vowel sounds correctly, glottal wave and VTAF estimated can be presented by the computer as an important objective visual feedback to the subjects. In the study of acoustical phonetics, VTAF estimates can help understand oral configurations of different languages. 150 References [1] P. Alku, T. Backstrom, E. Vilkman, "Normalized amplitude quotient for parameterization of the glottal flow," The Journal of the Acoustical Society of America, vol. 112 (2), pp701-710, 2002. [2] P. Alku, E. Vilkman, and A. Laukkanen, "Estimation of amplitude features of the glottal flow by inverse filtering speech pressure signals," Speech Communication, vol. 24, pp. 123-132, 1998. [3] T. V. Ananthapadmanabha and G. Fant, "Calculation of true glottal flow and its components," Speech Communication^ro\. 1 pp. 167-184, 1982. [4] B. S. Atal, and Schroeder, M. R. "Predictive coding of speech Signals" the 6th International Congress on Acoustics, Tokyo, pages C-5-4, 1968. [5] B. S. Atal, and L. Hanauer, "Speech analysis and synthesis by linear prediction of the speech wave," The Journal of the Acoustical Society of America, vol. 50, Number 2 (part 2), p.637-655, 1971. [6] B. S. Atal, J. J. Chang, " M. V. Mathews and J. W. Tukey, "Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer sorting techniques," The Journal of the Acoustical Society of America, vol. 63, 1535-1555, 1978. [7] Baken, R. J. and Orlikoff, Robert F. Clinical Measurement of Speech and Voice. Singular Publishing Group, 2000. [8] G. S. Berke and B. R. Gerratt, "Laryngeal biomechanics: An overview of mucosal wave mechanics," Journal of Voice, vol. 7, No. 2, pp. 123-128, 1993. [9] D. Berry, "High-speed digital imaging of the medial surface of the vocal folds," The Journal of the Acoustical Society of America, 110, p.2539-2547, 2001. 151 [10] B. Bozkurt, B. Doval, C. D'Alessandro, and T. Dutoit, "Zeros of Z-Transform (ZZT) decomposition of speech for source-tract separation," International Conference on Spoken Language Processing, Oct. 2004. [11] D. G. Childers and C. Ahn, "Modeling the glottal Volume-velocity waveform for three voice types," Journal of the Acoustical Society of America, Vol. 97, No. 1, pp. 505-519, 1995. [12] C. H. Coker, M. H. Krane, B. Y. Reis and R. A. Kubli, "Search for unexplored effects in speech Production," ICSLP 1996. [13] B. Crane and J. Schroeter, "Modeling a leaky glottis," Journal of Phonetics, vol. 23, pp.165-177, 1995. [14] B. Crane and J. Schroeter, "Physiologically motivated modeling of the voice source in articulatory analysis/synthesis," Speech Communication 19 (1996), pp. 1-19. [15] J. Dang and K. Honda, "Estimation of vocal tract shapes from speech sounds with physiological articulatory model," Journal of Phonetics (2002) 30, pp. 511-532. [16] H. Deng, "Modeling the glottal wave and vocal-tract filter with speech signals de-noised using wavelet-based method," EECE 571q course project, University of British Columbia, April 2002. [17] H. Deng, M. P. Beddoes, R. K. Ward, M. Hodgson, "Obtaining the vocal-tract area function from the vowel sound", Proceeding of Canadian Acoustic Week, Edmonton, Canada, 2003, pp. 40-41. [18] H. Deng, Michael P. Beddoes, R. K. Ward, M. Hodgson, "Estimating the glottal waveform and the vocal-tract filter from a vowel sound signal", Proceeding of IEEE 152 Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM03), Victoria, Canada, 2003, pp. 297-300. [19] H. Deng, M. P. Beddoes, R. K. Ward, M. Hodgson, "Estimating the vocal-tract area function and the derivative of the glottal wave from a speech signal", Proceeding of Eurospeech, Geneva, Switzerland, 2003, pp. 2437-2440. [20] H. Deng, R. K. Ward, Michael P. Beddoes, M. Hodgson, "Estimating vocal-tract area functions from vowel sound signals over closed glottal phases," Proceedings of IEEE ICASSP, Montreal, Canada, 2004, Vol. I, pp. 589-592. [21] H. Deng, R. K. Ward, Michael P. Beddoes, M. Hodgson, "Effects of glottal and lip boundary conditions on vocal-tract area function estimates from speech signals," Proceedings of IEEE ICASSP, Philadelphia, USA, 2005, Vol. I, pp. 901-903. [22] H. Deng, R. K. Ward, Michael P. Beddoes, M. Hodgson," A new method for obtaining accurate estimates from vowel sounds (accepted for publication)," IEEE Transaction on Speech and Audio Processing, 2005 (in press). [23] J. W. Devaney and C. C. Goodyear, "A comparison of acoustic and magnetic resonance imaging techniques in the estimation of vocal tract area functions," 1994 International Symposium on Speech, Image Processing and Neural Networks, April 1994 Page(s): 575 -578. [24] A. Dowd, J. Smith, and J. Wolfe, "Learning to pronounce vowel sound in a foreign language using acoustic measurements of the vocal tract as feedback in real time," Language and Speech, vol. 41, pp. 1-20, 1998. [25] S. V. Dusan, "Statistical estimations of articulatory trajectories from the speech signal using Dynamic and phonology constraints," PhD thesis, Waterloo University, 2000. 153 [26] G. Fant, Acoustic theory of speech production, Mouton, Hague, 2na ed., 1970. [27] G. Fant, "Glottal flow: models and interaction," Journal of Phonetics, Vol. 14, No. pp. 393-399, 1986. [28] G. Fant, "Some problems with voice analysis," Speech Communication, vol. 13, pp. 7-22, 1993. [29] H. Fujisaki and Ljugqvist, M., "Estimation of voice source and vocal tract parameters based on ARMA analysis and a model for the glottal source waveform," IEEE ICASSP 1987, pp. 637-640. [30] S. Furui, Digital Speech Processing, Synthesis, and Recognition, second edition, revised and expanded, Marcel Dekker, Inc. 2001. [31] M. Frohlich, D. Michaelis and H. W. Strube, "SIM-simultaneous inverse filtering and matching of a glottal flow model for acoustic speech signals," The Journal of the Acoustical Society of America, 110(1), pp. 479-488, Jul. 2001. [32] J. L. Flanagan, Speech Analysis Synthesis and Perception. Springer-Verlag, 1972. [33] L. V. Fausett, Numerical Methods Algorithms and Applications, Prentice Hall, Upper Saddle River, N. J., 2003. [34] H. Fujisaki and Ljugqvist, M., "Estimation of voice source and vocal tract parameters based on ARMA analysis and a model for the glottal source waveform," Proceedings of IEEE ICASSP 1987, pp. 637-640. [35] R. E. Hillman, "Estimation of glottal volume velocity waveform properties: a review and study of some methodological assumptions," Speech and Language: Advances in Basic Research and Practice, Vol. 6, pp.411-473, Academic Press, Inc. 1981. 154 [36] J. Holmes and W. Holmes: Speech Synthesis and Recognition, 2nd edition, New York: Taylor and Francis, 2001, pp. 13-14. [37] L. Hogben, Elementary Linear Algebra, West Publishing Company, MN, 1987. [38] K. Ishizaka, M. Matsudaira, T. Kaneko, "Input acoustic-impedance measurement of the subglottal system," The Journal of the Acoustical Society of America, vol. 60, No. 1, pp. 190-197, July 1976. [39] F. Itakura, S. Saito, "Analysis Synthesis Technology based on the Maximum Likelyhood Method," The 6th International Congress on Acoustics, Tokyo, Japan, August, 21-28, 1968, pp. C-17-20. [40] P. J. B. Jackson, and CH. Shadle, "Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech," IEEE Transactions on Speech and Audio Processing, Vol. 9, No. 7, pp.713-726, 2001. [41] H. Kasuya, K. Maekawa and S. Kiritani, "Joint estimation of voice source and vocal tract parameters as applied to the study of voice source dynamics," ICPhs 99 San Francisco, pp. 2505-2512, 1999. [42] Kay Electrics Corp. Instruction Manual Electroglottograph Model 6103, September 1999. [43] K. Krishnamurthy, and D. G. Childers, "Tow-channel speech analysis," IEEE Transactions on acoustics, speech, and signal processing, vol. ASSP-34, No. 4, pp. 730-743, 1986. [44] L. E. Kinsler, A. Frey, and J. V. Sanders, Fundamentals of Acoustics. John Wiley &Sons, inc., pp. 175, pp. 274, 2000. 155 [45] A. K. Krishnamurthy, and D. G. Childers, "Tow-Channel Speech Analysis," IEEE Transaction on Acoustics, Speech, and Signal Processing, Vol. ASSP-34, No. 4, pp. 730-743, 1986. [46] J. N. Larar, J. Schroeter and M. M. Sondhi, "Vector quantization of the articulatory space," IEEE Trans, on Acousti., Speech, Signal Processing, vol. 36, no. 12, pp. 1812-1818, 1988. [47] J. Liu, and J. Wang, "Diagnostic Aid for the Vocal Tract," EECE 496 Project Report, April 2002. [48] L. Ljung, System Identification Theory for the User, Second edition, Prentice Hall PTR, Prentice-Hall, Inc. Upper Saddle River, NJ. 1999. [49] H. Lu, "Joint estimation of vocal tract filter and glottal source waveform via convex optimization", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 79-82, 1999. [50] H. Lu, Toward a High-Quality Singing Synthesizer With Vocal Texture Control, PhD thesis, Stanford University, 2002. [51] A. Mahdi, "Visualisation of the Vocal Tract Based on Estimation of Vocal Area Functions and Tormant Frequencies," Proceedings of Eurospeech, Geneva, Switzerland, September 1-4, 2003. [52] P. Mermelstein, "Articulatory Model for the study of speech production," The Journal of the Acoustical Society of America, Vol. 53, No. 4, 1973, pp. 1070-1082. [53] R. L. Miller, "Nature of the vocal cord wave," The Journal of the Acoustical Society of America, vol.31, pp. 667-677, Jun. 1959. [54] P. Milenkovic, "Glottal inverse filtering by joint estimation of an AR System with a 156 linear input model," IEEE Transactions on Acoustics, Speech and signal processing, Vol. ASSP-34, No. 1, pp. 28-42, February 1986. [55] E. Moore and M. Clements, "Algorithm for automatic glottal waveform estimation without the reliance on precise glottal closure information," Proceeding of IEEE ICASSP 2004, pp. 1-101-104, May 2004. [56] P. Moore, "A short history of laryngeal investigation," Journal of Voice, 5, p. 266-281, 1991. [57] V. Oppenheim, R. W. Schafer with J. R. Buck, Discrete-Time Signal Processing, Upper Saddle River, N.J., Prentice Hall, 1999. [58] M. D. Plumpe, T. F. Quatieri, and D. A. Reynolds, "Modeling of the glottal flow derivative waveform with application to speaker identification," IEEE Transaction on Speech and Audio Processing, vol. 7, no. 5, pp. 569-586, 1999. [59] T. F. Quatieri, Discrete-time Speech Signal Processing. Prentice Hall, 2001. [60] L. R. Rabiner, and R. W. Schafer, Digital Processing of Speech Signals. Prentice-Hall, 1978. [61] G. C. Ray, "Determination of the area-function of individual vocal tract from average for sustained vowels," Engineering in Medicine and Biology Society, 1996 and 14th Conference of the Biomedical Engineering Society of India. An International Meeting, Proceedings of the First Regional Conference., IEEE , 1995, Page(s): 2/78 -2/79 [62] M. G. Rahim, Artificial Neural Networks for Speech Analysis/Synthesis, Chapman & Hall, 1994. [63] D. Rossiter, D. M. Howard, and M. Downes, "A real-time LPC-based vocal tract area display for voiced development," Journal of Voice, vol. 8 No. 4, pp. 314-319, 1994. 157 [64] M. A. Rothenberg, "A new inverse filtering technique for deriving the glottal air flow during voicing," Journal of the Acoustical Society of America, vol. 53, pp. 1632-1645, 1973, [65] J. S. Rubin, Diagnosis and Treatment of Voice Disorders, New York: Igaku-Shoin, 1995, pp. 290-311. [66] J. Schroeter, and M. M. Sondhi, "Techniques for estimating vocal-tract shapes from the speech signal," IEEE Transaction on Speech and Audio Processing, Vol 2, No. 1, ppl33-150, 1994. [67] M.M. Sondhi and B. Gopinath, "Determination of vocal-tract shape from impulse response at the lips," Journal of the Acoustical Society of America, Vol. 49, pp. 1867-1873, 1971. [68] M.M. Sondhi, "Measurement of the glottal waveform," Journal of the Acoustical Society of America, Vol. 57, pp.228-232, 1975. [69] M. M. Sondhi, "Estimation of vocal-tract areas: the need for acoustical measurements," IEEE Transaction on Acoustics, Speech, and Signal Processing, vol. ASSP-27, No. 3, pp. 268-273, 1979. [70] K. Stevens, Acoustic Phonetics, The MIT Press, 1998. [71] B. H. Story, I. R. Titze, and E. A Hoffman, "Vocal tract area functions from magnetic resonance imaging," Journal of the Acoustical Society of America, Vol. 100, No. 1, July, 1996. [72] B. H., Story, Titze, Ingo R. and Hoffman, Erica A. "The relationship of vocal tract shape to three voice qualities," Journal of the Acoustical Society of America, 109(4), PP. 1651-1667, 2001. 158 [73] B. H. Story, "An overview of the physiology, physics and modeling of the sound source for vowels," Acoustical Science and Technology, Vol. 23, No. 4, p. 195-206, July, 2002. [74] B. H. Story, "Vowel acoustics for speaking and singing," Acta Acustica united with Acustica, 90(4), pp.629-640, 2004. [75] M. Suiter and F. W. J. Albers, "The effects of frequency and intensity level on glottal closure in normal subjects," http://www.ub.rug.n1/eldoc/dis/medicine/a.m.sulter/c3.pdf, 1996. [76] H. Takemoto, K. Honda, S. Masaki, I Shimada, I. Fujimoto, S. Takano, K. Takeo, "Extraction of temporal patterns of vocal-tract area functions in a vowel sequence from a 3D MRI movie," TECHNICAL REPORT OF B3ICE., SP 2001-24 (2001-5), pp. 67-74. [77] R. Titze, Principles of Voice Production, Prentice-Hall, Inc., 1994. [78] R. Titze, "Physiologic and acoustic differences between male and female voices," Journal of the Acoustical Society of America, vol. 85 (4), pp. 1699-1707, April 1989. [79] I. R. Titze, "Glottal flow models," Journal of Phonetics, vol. 14, pp. 405-406, 1986. [80] D. E. Veenman, and S. Bement, "Automatic glottal inverse filtering from speech and electroglottographic signals," IEEE Transaction on Acoustics, Speech, Signal Processing, vol. ASSP-33, pp. 369-377, 1985. [81] M. P. de Vries, H. K. Schutte, "Glottal flow through a two-mass model: comparison of navier- stokes solutions with simplified models," Journal of the Acoustical Society of America 111(4), April 2002, pl847-1853. 159 [82] H. Wakita, "Direct Estimation of the Vocal Tract Shape by Inverse Filtering of Acoustic Speech Waveforms," IEEE Transaction on Audio Electroacoust. Vol. AU-21: 417-427, 1973. [83] H. Wakita, Normalization of Vowels by Vocal-Tract Length and Its Application to Vowel Identification," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-25, No. 2, pp. 183-192, April 1977. [84] D. Y. Wang and J. D. Mark, "Least squares glottal inverse filtering from the acoustic speech waveform," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, No.4, pp. 350-355,1979. [85] R. Wenokur, G. S. Berke, B. R. Gerratt, J. Kreiman, and M. Ye, " In vivo measurement of laryngeal mucosal wave speed in humans," Journal of the Acoustical Society of America, Vol. 93, pp. 2295, Apr. 1993. [86] D. Y. Wong, J. D. Markel, and A. H. Gray, "Linear squares glottal inverse filtering from the acoustic speech waveform," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-27, pp. 350-355, Aug. 1979. [87] H. Yehia, M. Honda, and F. Itakura, "Acoustic measurements of the vocal-tract area function: sensitivity analysis and experiments," Proceedings of IEEE ICASSP1995, Vol. l.pp. 9-12, May 1995. [88] H. Yehia, F. Itakura, "A method to combine acoustic and morphological constraints in the speech production inverse problem," Speech Communication, 18 (1996), pp. 151-174. 160 Appendix A We pointed out that incorrect conclusions are drawn from a "pseudo-Laplace transform" in [page 176, Ananthapadmanabha and Fant, 1982]. We retype that mistake in the following. According to Eq. (12) in [page 176, Ananthapadmanabha and Fant, 1982]: dV V 1 r 1 C^ + ^- + jlvdt + -V8o(t) = Usc(t) at K L L where C, R, L are parameters of the impedance looking from the back end of the vocal tract into vocal tract, V is the sound pressure at the back end of the vocal tract, g0(t) is a function of glottal area, Usc(t) is the equivalent glottal source signal, shown in Fig. 3.2. Differentiating the above equation, the following holds: d2V • + 1 1 ,^ 1 dV 1 C dt LC v^uAt) dt' The coefficients in the above equation are time-varying. In [page 176, Ananthapadmanabha and Fant, 1982], the time variable in the coefficients is replaced by 'x' to assume that the equivalent glottal impedance is stationary, and denote co\ = [1/LC][1 + ±Lg0(r)] = co20[l + ±Lg0(T)] and 2al=[\IRC][\ + \RgQ(T)] Then the Laplace transform is applied to the above differentiated equation as follows: V(s) = sco2LUsc(s)/[s2 +2tf1s + 6> 1 2 ] From the differentiated equation, it is found that "the source conductance to be 0.5go(x), which causes bandwidth modulation", and "an equivalent hypothetical inductance of value 21 161 goM, which causes frequency modulation of the resonant frequency of the load" is.found [page 176-177, Ananthapadmanabha and Fant, 1982]. Appendix B This section proves that a formant frequency of a VTF estimate containing the effect of incomplete glottal closure is always higher than that of the corresponding VTF. Let F be a formant frequency of the VTF. Then, the following holds: of = 0 (Bl) f=F First, since | Z V T | resonates at F (see Section 3.9), and since | Z G | increases with frequency, then, | Z V T / Z G | at F-A is larger than it is at F, where A>0, and F-A is a vicinity of F. Second, it can be shown by circuit analysis that Z V T , which is equivalently constituted by parallel resistance, reactance and capacitance, is resistant at the resonance frequency F, and is reactive at frequencies lower than the resonance frequency. This is also found true from the calculated Z V T . Then, the angle between 1 and Z V T / Z G at F-A is smaller than at F, assuming angle(Zvr) is small at f= F-A. Therefore, |1+ Z V T / Z G | is greater at F-A than it is at F, and the following holds: d df \l + Zvr/Zg <0 (B2) f=F Since log|#G V T f(/)|= log H\TF (/) 1 + Z ^ / Z ^ = log|H v n P(/)|-log|l + Z v r / z / | (B3) then, 162 d{\og\HCVTF(f)\}_ df d|ffvrF(/)[ d\\+ZvrIZg df df According to Eqs. (Bl) and (B2), then, d{\og\HcvrF(f)\}\ \HWF(f)\ \\ + Zw IZ | d\l + Zvr/Zg | df df f=F l + Zyj/Zg f = p >0 (B4) (B5) The above equation means that the GVTF formant frequency, at which d{\og \ HcvrF(f)\) _ ^ ^ . e ^ ^ frequency response of the GVTF has a peak, is higher than df Eq. (B2) tells us that it is the glottal inductance, Lg = pH I Ag , that causes the GVTF formant frequency higher than that of the VTF. The effect of the glottal inductance in increasing formant frequencies is also illustrated using a uniform tube with 5 cm2 cross-sectional area and 17 cm length [Flanagan, 1972]. Appendix C We prove that the estimator given by Eq. (5.14) is an unbiased estimator of A, i.e., the mathematical expectation of the estimate equals to the true parameters of the VTF: E{A) = A (CI) From Eq. (5.16), we get: E{A} = A-E{{QTQylQTe} (C2) Therefore, we want to prove 163 E{(QTQrlQT£} = 0 (C3) Denote (QTQ)~'QT=W, where Wis a (M+l)xL matrix. Let the m*entry of (QTQ)'1QT ebe em, m=l, , M+l. Then, em = ±Wmle, (C4) and the expectation of em is E{em) = JjE{Wmlel) (C5) = X{E[WmluV (nci -M /2 + 1-1)]-£[W> V (ncj -M12 +1 -1)]} Since the glottal wave signal is cyclostationary, and since ncj-nCj equals one or multiple pitch periods, then the joint probability density function of (Wmi, u'gj(t+nCi)) is identical to that of (Wmi, u 'gj(t+nCj)). Thus, E[Wmlu(nci -M12 + 1-1)] = £ [ W > ' t f (ncj-M 12 + 1-1)] (C6) Thus, E{(em} = 0 (C7) and E{(QTQ)-1QT£} = 0 (C8) Therefore, the estimator given by Eq. (5.14) is an unbiased estimator of A. Appendix D As pointed out in section 6.4, in [Atal and Hanauer, 1971], a negative sign is missing in relating the sectional reflection coefficients of the vocal-tract tube model to the 164 coefficients of the VTF. Here, we illustrate this mistake in [Atal and Hanauer, 1971] by using an example. According to Eq. (F19) in [Atal and Hanauer, 1971]: C,(z) = < ( z ) - < ( z ) According to Eq. (Fl 1) in [Atal and Hanauer, 1971]: 1 1-r, z v l r l Z v l nz-1'2 z-U2 w„ ( 1 )(z) >%(1)(z) w21(1)(z) w22(1)(z\ From the above two equations, we get: C1(z) = -^(zv2-r1z-"2) Thus, for Ci(z), the ratio of the coefficients of z~1/2 and z 1 / 2 is -rj. Unfortunately, it is said [page 655, Atal and Hanauer, 1971] that "for each C„(z), the ratio of the coefficients of z'"^ and z"^ is rn." The reason why the negative sign was left out cannot be seen from the original paper. 165
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Estimations of glottal waves and vocal-tract area functions...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Estimations of glottal waves and vocal-tract area functions from speech signals Deng, Hui Qun 2004
pdf
Page Metadata
Item Metadata
Title | Estimations of glottal waves and vocal-tract area functions from speech signals |
Creator |
Deng, Hui Qun |
Date Issued | 2004 |
Description | This study estimates glottal waves and vocal-tract area functions (VTAFs) from vowel sounds. Traditional estimations assume that glottal waves are zero over closed glottal phases, and that glottises and lips are terminated with constant impedances. In reality, these assumptions are invalid: glottal waves can hardly be zero due to common incomplete glottal closures and acoustic disturbances during vocal-fold collisions; glottal impedances are time-varying during phonation; lip radiation impedances are frequency-dependent. Consequently, traditional estimations yield biased and distorted estimates. In this study, a method which for the first time obtains unbiased vocal-tract filter (VTF) estimates from sustained vowel sounds over closed glottal phases is developed. It assumes that glottal waves for such sounds are periodically stationary random processes, allowing non-zero glottal waves to exist over closed glottal phases. A new method for detecting glottal phases using vowel sounds is also developed. The effects of glottal and lip terminal impedances on VTF estimates are modeled realistically using high-pass, and low-pass filters, respectively. The VTF estimates are used to obtain glottal waves from the vowel sounds. Moreover, a new method for deriving VTAFs from the VTF estimates over closed glottal phases is developed. It eliminates the distortion effects of lip radiation impedances on the VTAF estimates, assuming the glottises are completely closed. Effects of glottal losses on the estimates obtained using our methods are investigated. It is shown that estimates from large-lip-opening vowel sounds are less affected by glottal losses than those from small-lip-opening vowel sounds. Theoretically, to enable the elimination of the degrading effects of glottal losses on the estimates, lip-opening areas must be known. Glottal phases, glottal waves and VTAFs estimated using our methods from vowel sounds produced by male and female subjects contain detailed information. The obtained glottal phases were validated using electroglottograph signals. The obtained glottal waves increase during rapid vocal-fold collisions, and decrease or even increase during vocal-fold parting. The differences in glottal waveforms of different genders are explained by their physiological differences in larynxes. The VTAFs obtained from large-lip-opening vowel /a/ sounds of these subjects are very similar to the VTAF measured from an unknown subject's magnetic resonance image. Such detailed results cannot be obtained using traditional methods. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2009-12-22 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0064790 |
URI | http://hdl.handle.net/2429/17035 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2005-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_2005-103876.pdf [ 6.66MB ]
- Metadata
- JSON: 831-1.0064790.json
- JSON-LD: 831-1.0064790-ld.json
- RDF/XML (Pretty): 831-1.0064790-rdf.xml
- RDF/JSON: 831-1.0064790-rdf.json
- Turtle: 831-1.0064790-turtle.txt
- N-Triples: 831-1.0064790-rdf-ntriples.txt
- Original Record: 831-1.0064790-source.json
- Full Text
- 831-1.0064790-fulltext.txt
- Citation
- 831-1.0064790.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0064790/manifest