3D Subject-Specific Biomechanical Modelingand Simulation of the Oral Region and Airwaywith Application to Speech ProductionbyNegar Mohaghegh HarandiB.Sc., AmirKabir University of Technology, 2005M.Sc., Isfahan University of Technology, 2009A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate and Postdoctoral Studies(Electrical & Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)April 2016c© Negar Mohaghegh Harandi 2015AbstractThe oropharynx is involved in a number of complex neurological functions,such as chewing, swallowing, and speech. Disorders associated with thesefunctions, if not treated properly, can dramatically reduce the quality of lifefor the sufferer. When tailored to individual patients, biomechanical modelscan augment the imaging data, to enable computer-assisted diagnosis andtreatment planning.The present dissertation develops a framework for 3D, subject-specific biome-chanical modeling and simulation of the oropharynx. Underlying data con-sists of magnetic resonance (MR) images, as well as audio signals, recordedwhile healthy speakers repeated specific phonetic utterances in time with ametronome. Based on this data, we perform simulations that demonstratemotor control commonalities and variations of the /s/ sound across speak-ers, in front and back vowel contexts. Results compare well with theoriesof speech motor control in predicting the primary muscles responsible fortongue protrusion/retraction, jaw advancement, and hyoid positioning, andin suggesting independent activation units along the genioglossus muscle.We augment the simulations with real-time acoustic synthesis to generatesound. Spectral analysis of resultant sounds vis-a`-vis recorded audio signalsreveals discrepancy in formant frequencies of the two. Experiments using 1Dand 3D acoustical models demonstrate that such discrepancy arises from lowresolution of MR images, generic parameter-tuning in acoustical models, andambiguity in 1D vocal tract representation. Our models prove beneficial forvowel synthesis based on biomechanics derived from image data.Our modeling approach is designed for time-efficient creation of subject-specific models. We develop methods that streamline delineation of artic-ulators from MR images and reduce expert interaction time significantly(≈ 5 mins per image volume for the tongue). Our approach also exploitsmuscular and joint information embedded in state-of-the-art generic models,iiAbstractwhile providing consistent mesh quality, and the affordances to adjust meshresolution and muscle definitions.iiiPrefaceThe research presented herein was approved by UBC clinical Research EthicsBoard, certificate number: H16-00016. Most of the contributions and ideasdescribed in this dissertation have been presented previously in the publica-tions listed in pages v and vi.[P1, P3-6]: Negar M. Harandi was the primary author and main contributorto the design of these papers, under the supervision of Dr. Sidney Fels andDr. Rafeef Abugharbieh. Negar M. Harandi developed the speaker-specificmodels, performed the simulations, and analyzed the results. Dr. MaureenStone and Dr. Jonghye Woo provided, and pre-processed the MRI data. Dr.Marek Bucki shared his code for FE registration; Dr. Claudio Lobos sharedhis implementation of FE meshing ([P1]). Dr. Ian Stavness provided the codefor the inverse solver, and the VT skin mesh ([P1, P3, P4]). Dr. Kees vanden Doel provided the implementation of the 1D acoustic synthesizer ([P1,P4]). Dr. Maureen Stone assisted with the analysis of the results, and gaveeditorial feedback to [P3, P6]. The work in [P1] is reflected in Section 3.1,3.3, and 5.1. The work in [P3] is reflected in Chapter 4.[P2, P7]: Negar M. Harandi was the primary author and main contributorto the design, implementation, and testing of the methods developed in thesepapers, under the supervision of Dr. Sidney Fels and Dr. Rafeef Abugharbieh.Dr. Maureen Stone and Dr. Jonghye Woo provided the MRI data. Theimplementation of the shape-matching algorithm (Gilles and Pai, 2008) wasavailable through the SOFA open-source simulation framework. The work inthese papers is reflected in Section 3.2.[P8]: Negar M. Harandi was the primary author and main contributor tothe design of this paper, under the supervision of Dr. Sidney Fels and Dr.Rafeef Abugharbieh. Dr. Daniel Aalto and Dr. Jarmo Malinen provided theVT geometries with resonance frequencies, and assisted with interpretationof the results. The work in this paper is reflected in Section 5.2.ivPreface• Published Journal Manuscripts:[P1] Harandi NM, Stavness I, Woo J, Stone M, Abugharbieh R, FelsS. 2015. Subject-specific biomechanical modeling of the oropharynx:towards speech production. Comput Methods Biomech Biomed Eng:Imaging Vis, (ahead-of-print), 1-11.[P2] Harandi NM, Abugharbieh R, Fels S. 2014. 3D segmentation ofthe tongue in MRI: a minimally interactive model-based approach.Comput Methods Biomech Biomed Eng: Imaging Vis. 3(4):178–188.• Journal Manuscripts in Review:[P3] Harandi NM, Woo J, Stone M, Abugharbieh R, Fels S. 2015. Articu-lation variability in English /s/: a biomechanical modeling approach.Submitted.• Peer-Reviewed Conference Papers:[P4] Harandi NM, Woo J, Farazi MR, Stavness I, Stone M, Fels S,Abugharbieh R. 2015. Subject-Specific Biomechanical modeling ofthe Oropharynx with Application to Speech Production. Proceedingsof International Symposium on Biomedical Imaging (ISBI); Brooklyn,USA.[P5] Harandi NM, Woo J, Stone M, Abugharbieh R, Fels S. 2014. In-verse Simulation of Subject-Specific Jaw-Tongue-Hyoid Model fromDynamic MR. Proceedings of International Symposium on ComputerMethods in Biomechanics and Biomedical Engineering (CMBBE);Amsterdam, Netherlands.[P6] Harandi NM, Woo J, Stone M, Abugharbieh R, Fels S. 2014. Subject-specific biomechanical modeling of the tongue: analysis of muscleactivations during speech. Proceedings of International Seminar onSpeech Production (ISSP); Cologne, Germany.[P7] Harandi NM, Abugharbieh R, Fels S. 2014. Minimally interac-tive MRI segmentation for subject-specific modeling of the tongue.MICCAI Workshop on Bio-Imaging and Visualization for Patient-Customized Simulations (BIVPCS); Nagoya, Japan. Best Paper Award.vPreface• Workshop Articles:[P8] Harandi NM, Aalto D, Hannukainen A, Malinen J, AbugharbiehR, Fels S. Spectral analysis of the vocal tract in vowel synthesis: acomparison between 1D and 3D acoustic analysis. In 3rd Interna-tional Workshop on Biomechanical and Parametric Modeling of Hu-man Anatomy (PMHA 2015), Montreal, Canada.viTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xvDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Data Acquisition and Measurement . . . . . . . . . . . . . . . 82.1.1 Magnetic Resonance Imaging . . . . . . . . . . . . . . 82.1.2 Computed Tomography Imaging . . . . . . . . . . . . 132.1.3 Electromyography . . . . . . . . . . . . . . . . . . . . 142.1.4 Electromagnetic Articulometry . . . . . . . . . . . . . 152.1.5 Image Segmentation . . . . . . . . . . . . . . . . . . . 162.2 Biomechanical Modeling . . . . . . . . . . . . . . . . . . . . . 182.2.1 Generic Models . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Subject-Specific Modeling . . . . . . . . . . . . . . . . 212.2.3 Inverse Simulation Methods . . . . . . . . . . . . . . . 252.3 Acoustics of Speech . . . . . . . . . . . . . . . . . . . . . . . 26viiTable of Contents2.3.1 Vowel Phonemes . . . . . . . . . . . . . . . . . . . . . 262.3.2 Consonant Phonemes . . . . . . . . . . . . . . . . . . 282.3.3 Coarticulation . . . . . . . . . . . . . . . . . . . . . . 292.3.4 Articulatory Speech Synthesis . . . . . . . . . . . . . . 302.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Subject-Specific Modeling of the Oropharynx . . . . . . . . 353.1 FE Tongue Modeling . . . . . . . . . . . . . . . . . . . . . . . 363.1.1 FE Registration . . . . . . . . . . . . . . . . . . . . . 373.1.2 FE Meshing . . . . . . . . . . . . . . . . . . . . . . . 403.1.3 Tongue Muscle Bundles . . . . . . . . . . . . . . . . . 433.1.4 Tongue Muscle Material . . . . . . . . . . . . . . . . . 463.1.5 Forward Simulation . . . . . . . . . . . . . . . . . . . 473.2 Tongue Segmentation from MRI . . . . . . . . . . . . . . . . 503.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 513.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 613.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 623.3 Jaw and Hyoid Modeling . . . . . . . . . . . . . . . . . . . . 623.3.1 Bone Segmentation from MRI . . . . . . . . . . . . . 633.3.2 Forward Simulation . . . . . . . . . . . . . . . . . . . 663.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684 Data-Driven Simulation for Speech Research . . . . . . . . . 704.1 MRI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2 Tissue Displacement . . . . . . . . . . . . . . . . . . . . . . . 744.3 Inverse Simulation . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 Definition of the Control Points . . . . . . . . . . . . . 764.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.4.1 Tongue-Protruder Muscles . . . . . . . . . . . . . . . . 814.4.2 Tongue-Retractor Muscles . . . . . . . . . . . . . . . . 814.4.3 Other Muscles . . . . . . . . . . . . . . . . . . . . . . 824.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.5.1 Apical vs. Laminal Speakers . . . . . . . . . . . . . . 834.5.2 Mechanisms of Tongue Elevation . . . . . . . . . . . . 844.5.3 Commonalities Across Speakers . . . . . . . . . . . . . 844.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85viiiTable of Contents5 Acoustic Analysis for Speech Synthesis . . . . . . . . . . . . 875.1 Synthesis of Vowels in Running Speech . . . . . . . . . . . . . 875.1.1 Biomechanical Model of Vocal Tract . . . . . . . . . . 885.1.2 Time-Domain Acoustical Model . . . . . . . . . . . . 895.1.3 Results and Discussion . . . . . . . . . . . . . . . . . 905.2 Synthesis of Sustained Vowels . . . . . . . . . . . . . . . . . . 965.2.1 Helmholtz Resonances . . . . . . . . . . . . . . . . . . 975.2.2 Webster Resonances . . . . . . . . . . . . . . . . . . . 985.2.3 Results and Discussion . . . . . . . . . . . . . . . . . 1005.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 1056.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 106Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108AppendicesA Oropharyngeal Muscles . . . . . . . . . . . . . . . . . . . . . . 124A.1 Tongue Muscles . . . . . . . . . . . . . . . . . . . . . . . . . 124A.1.1 Extrinsic Muscles . . . . . . . . . . . . . . . . . . . . 125A.1.2 Intrinsic Muscles . . . . . . . . . . . . . . . . . . . . . 126A.2 Jaw and Hyoid Muscles . . . . . . . . . . . . . . . . . . . . . 127A.2.1 Jaw Closers . . . . . . . . . . . . . . . . . . . . . . . . 127A.2.2 Jaw Openers . . . . . . . . . . . . . . . . . . . . . . . 128B International Phonetic Alphabet . . . . . . . . . . . . . . . . 131ixList of Tables2.1 Max force and CSA for muscle bundles in the generic tonguemodel (Buchaillard et al., 2009). . . . . . . . . . . . . . . . . . 192.2 Max force and CSA for the muscles in the generic jaw-hyoidmodel (Stavness, 2010). . . . . . . . . . . . . . . . . . . . . . . 223.1 Mesh quality measured for FEmesh in speakers A-D. Numberof elements, as well as average values of the quality measure(AR or JR), are presented for each specific element type. . . . 434.1 Speaker information for this study: sex, age, and time framesassociated to individual sounds in /@-gis/ and /@-suk/ utter-ances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.2 Absolute tracking error (mm) for speakers A-D, averaged overall control points and all time frames in /@-gis/ and /@-suk/. . 785.1 Formant frequencies (F1, F2, F3) of /i/ and /u/ in speakersA-D, as computed from our simulations, audio signals, andcine MRI data (values are in Hz). . . . . . . . . . . . . . . . . 935.2 Formant frequencies for the simulations using FEreg and FEmesh,compared to the audio and cine MRI data for /i/ of speaker B. 95xList of Figures1.1 A mid-sagittal diagram of a human upper airway . . . . . . . 21.2 Designed work-flow for subject-specific, oropharyngeal model-ing and simulation in speech research . . . . . . . . . . . . . . 42.1 Sustained vs. real-time speech articulation . . . . . . . . . . . 92.2 Super resolution tongue MRI volume . . . . . . . . . . . . . . 122.3 Generic tongue model . . . . . . . . . . . . . . . . . . . . . . . 192.4 Generic jaw-hyoid model . . . . . . . . . . . . . . . . . . . . . 202.5 Hill’s muscle model . . . . . . . . . . . . . . . . . . . . . . . . 212.6 Failure in Mesh-Matching methodology. . . . . . . . . . . . . . 242.7 Phonetic of English vowels . . . . . . . . . . . . . . . . . . . . 272.8 Formant frequencies of English vowels . . . . . . . . . . . . . . 282.9 Place of articulation for consonants . . . . . . . . . . . . . . . 292.10 Fields of carticulation . . . . . . . . . . . . . . . . . . . . . . . 302.11 Two-mass glottal model . . . . . . . . . . . . . . . . . . . . . 333.1 Cine MRI for biomechanical modeling . . . . . . . . . . . . . 353.2 Pipeline for subject-specific FE tongue modeling . . . . . . . . 363.3 Elastic registration in MMRep algorithm . . . . . . . . . . . . 393.4 Results of FE registration . . . . . . . . . . . . . . . . . . . . 403.5 Surface patterns for FE meshing . . . . . . . . . . . . . . . . . 413.6 Results of FE meshing . . . . . . . . . . . . . . . . . . . . . . 423.7 Definition of tongue muscle bundles in high resolution . . . . . 443.8 Effect of mesh resolution on the tongue deformation . . . . . . 483.9 GGP activation in the speaker-specific tongue models . . . . . 493.10 Forward simulation of the speaker-specific tongue models . . . 493.11 Activation of functionally-distinct muscle segments . . . . . . 503.12 Designed segmentation pipeline . . . . . . . . . . . . . . . . . 513.13 Initialization without user guidance . . . . . . . . . . . . . . . 553.14 Ground truth segmented by dental expert . . . . . . . . . . . 56xiList of Figures3.15 Mesh representation during segmentation . . . . . . . . . . . . 573.16 Evaluation via Dice coefficient . . . . . . . . . . . . . . . . . . 583.17 Evaluation via modified Hausdorff distance . . . . . . . . . . . 593.18 Sensitivity analysis in the male group . . . . . . . . . . . . . . 603.19 Sensitivity analysis in the female group . . . . . . . . . . . . . 613.20 Atlas deformation for jaw segmentation . . . . . . . . . . . . . 643.21 Mandible segmentation from MRI . . . . . . . . . . . . . . . . 643.22 The Jaw-tongue-hyoid model for speaker B . . . . . . . . . . . 653.23 ArtiSynth GUI for adjusting muscles . . . . . . . . . . . . . . 663.24 Activation of the jaw muscles for speaker-specific models . . . 673.25 Activation of the jaw muscles for speaker A . . . . . . . . . . 684.1 Schematic representation of MRI data . . . . . . . . . . . . . 724.2 Mid-sagittal slice of cine MRI at /s/ . . . . . . . . . . . . . . 734.3 Tissue displacement in tagged MRI . . . . . . . . . . . . . . . 744.4 Setting up FE tongue model for /@-suk/ . . . . . . . . . . . . 774.5 Estimated muscle activations for /@-gis/ . . . . . . . . . . . . 794.6 Estimated muscle activations for /@-suk/ . . . . . . . . . . . . 805.1 Area function assessment for the VT geometry . . . . . . . . . 905.2 Deformed VT for vowels /i/ and /u/ . . . . . . . . . . . . . . 915.3 Audio signal and spectra for /@-gis/ . . . . . . . . . . . . . . . 925.4 Average formant frequencies for /i/ and /u/ . . . . . . . . . . 945.5 Normalized area profile along the VT . . . . . . . . . . . . . . 955.6 VT geometries extracted from MRI data . . . . . . . . . . . . 1005.7 Simulation result for 1st and 2nd formant frequencies . . . . . . 101A.1 Extrinsic muscles of the tongue . . . . . . . . . . . . . . . . . 125A.2 Intrinsic muscles of the tongue . . . . . . . . . . . . . . . . . . 126A.3 Mandible anatomy . . . . . . . . . . . . . . . . . . . . . . . . 128A.4 Illustration of the jaw muscles . . . . . . . . . . . . . . . . . . 129A.5 Illustration of the hyoid muscles . . . . . . . . . . . . . . . . . 130xiiList of Abbreviations1D 1-dimensional2D 2-dimensional3D 3-dimensionalCBCT Cone beam Computer TomographyCSA Cross-Sectional AreaCT Computer TomographyDOF Degree Of FreedomE-IDEA Enhanced Incompressible Deformation Estimation AlgorithmEMA Electro-Magnetic ArticulometryEMG ElectromyographyFE Finite ElementFEM Finite Element MethodGE Gradient EchoGUI Graphical User InterfaceHARP HARmonic PhaseHz HertzIDEA Incompressible Deformation Estimation AlgorithmIPA International Phonetic AlphabetJR Jacobian RatioMHD Modified Hausdorff DistanceMMRep Mesh-Match-and-Repair (algorithm)MR Magnetic ResonanceMRI Magnetic Resonance ImagingMSCT Multi-Slice Computer TomographyOPAL Oral, Pharyngeal, and Laryngeal (complex)PCA Principle Component AnalysisPDE Partial Differential EquationPOA Place Of ArticulationxiiiList of AbbreviationsRF Radio FrequencySE Spin EchoSNR Signal-to-Noise RatioSTD Standard DeviationTE Echo TimeTF Time FrameTLM Transmission Line-circuit ModelTMJ Temporomandibular JointTR Repetition TimeVT Vocal TractOropharyngeal Muscles:AD Anterior DigastricAM Anterior MylohyoidAT Anterior TemporalDP Deep MasseterGG GenioglossusGGA Genioglossus AnteriorGGM Genioglossus MediumGGP Genioglossus PosteriorGH GeniohyoidHG HyoglossusIL Inferior LongitudinalIP Inferior-lateral PterygoidMH MylohyoidMP Medial PterygoidMT Medial TemporalPD Posterior DigastricPM Posterior MylohyoidPT Posterior TemporalSH Stylo-HyoidSL Superior LongitudinalSM Superficial MasseterSP Superior-lateral PterygoidSTY StyloglossusTRANS TransversusVERT VerticalisxivAcknowledgementsMy deepest appreciation goes to to my supervisors, Dr. Sidney Fels and Dr.Rafeef Abugharbieh, for making this thesis possible. Your support, advice,and inspiration guided me throughout the past five years. It’s been an honourworking with you and learning from you.I am indebted to Dr. Maureen Stone for generous data sharing, many fruitfuldiscussions, and hosting me at her lab and in her own home during myresearch visits to the University of Maryland Dental School. I admire you asa great woman and academic scholar.I have had the pleasure of working closely with a number of fine researchersduring the course of this PhD. Thanks must go to Dr. Jonghye Woo fora rewarding collaboration, and for his hosting of my visit to MassachusettsGeneral Hospital. Dr. Ian Stavness deserves special thanks for his noblecollaboration with me on this project. Many thanks to Dr. Daniel Aaltoand Dr. Jarmo Malinen for sharing their knowledge with me during severaldiscussion sessions and lengthy emails. Finally, I would like to thank Dr.John Lloyd for his great support with ArtiSynth.The positive experience I’ve had during my graduate studies is due, in part,to my fellow graduate students, with whom I have shared an office, building,or research project. Special thanks go to Antonio Sa´nchez, and Andrew Hofor many useful conversations, and their assistance with the project.I want to say a heart-felt thank you to my friends and family for their loveand support throughout this journey. I am grateful to my parents for passingon their life-long motivation to me, and my sister for her cheer and laughter.Lastly, I thank my dear Matthew for his daily encouragement, endless love,and his editing eyes which were invaluable to this manuscript.xvDedicationTo my dad,who told me to learn;and my mom,who told me to pursue my dreams.xviChapter 1IntroductionSpeech production is a complex neuromuscular human function that involvescoordinated interaction of the oropharyngeal structures (shown in Figure1.1). Speech impairments are widespread and have an adverse effect on thesufferer’s quality of life. Currently, articulation, fluency, and voice disordersafflict more than 7.5 million people in the United States. It has proven diffi-cult to characterize the complex, nonlinear relationship between neurologicalactivation units, articulatory motion, and sound generation. Such difficultyhinders efforts in rehabilitation planning greatly. A deeper understanding ofspeech production would, thus, be invaluable in a clinical setting.Medical imaging has enabled more detailed observation of the oropharynx;nevertheless, hidden variables, such as muscular forces and activations, re-main mostly immeasurable. By predicting such variables, biomechanicalmodels can improve the protocols of diagnosis, and assist in treatment plan-ning. In jaw reconstructive surgery, study of the pre- and post-operativemodels of the mandible and maxilla (constructed based on computed tomog-raphy [CT] scans of the patients) has led to the selection of more suitableoperational procedures, and a reduction in the risk of associated complica-tions (Wola´nski et al., 2015).Despite the clinical success of patient-specific bone models in jaw recon-structive surgery, biomechanical modeling of speech remains challenging.Firstly, articulatory motion of oropharyngeal soft tissue is rapid and dif-ficult to capture. Magnetic resonance imaging (MRI) is better suited todepicting soft-tissue, but still falls short due to low contrast and spatialsparsity. Such challenging data makes conventional methods of image pro-cessing inadequate; current methods for extracting the surface geometry oforopharyngeal structures from MRI data mostly focus on 2D slices, and failto delineate the tongue at its region of contact with the surrounding tissue.The 3D reconstruction of surfaces from sparse 2D segmented contours is not1Chapter 1. IntroductionLipsMaxillaMandibleTongueHyoidVelumEpiglottisNasal CavityNasopharynxOropharynxHypopharynxFigure 1.1. A mid-sagittal diagram of a human upper airway, denoting the(oro)pharyngeal structures.straight-forward, and hinders modeling efforts.Secondly, the structure and physiology of oropharyngeal soft tissue, withits interwoven muscle and tendon fibers, remains a challenge to representin a biomechanical model. Current models of articulatory movement aregeneric: that is, they represent an average human anatomy and function,thus, failing to provide individualized information. In addition, the makingof current generic models relies heavily on expert interaction – a processthat is not cost effective when dealing with many individual cases. Thesefactors contribute to a gap in speech research, separating medical imagingfrom soft-tissue modeling and simulation.The gap extends further when looking at biomechanical vs. acoustical mod-els of speech. Sound production, the ultimate physical goal of speech, hasmostly been addressed independent of biomechanics. Articulatory speechsynthesizers generate sound based on the geometry of the vocal tract, whichis estimated directly from medical images. The search for an ideal model,one which represents both the acoustical and biomechanical characteristicsof the oropharynx, continues to this day.To summarize, subject-specific modeling and simulation of the oropharynx isclinically relevant and beneficial for speech research, but presents challengesdue to the complexity of speech anatomy and function. The present disserta-tion investigates and develops methods to deal with some of these challenges.2Chapter 1. IntroductionThe underlying data includes dynamic (cine and tagged) magnetic resonance(MR) images, which capture the motion of articulators during the repetitionof specific speech utterances.Our approach to biomechanical understanding of speech processing integratesthe required constituent units (e.g. data processing, subject-specific biome-chanical modeling, data-driven simulation and acoustic synthesis) into a uni-fied framework. In each unit, we develop methods to deal with the challengesthat hinder subject-specific speech analysis. A time-efficient segmentationtool is introduced to accurately extract the 3D tongue surface from MRIdata in a time-efficient manner. A subject-specific finite element modellingscheme is designed to benefit from the biomechanical information embed-ded in the generic models while providing adjustable spatial resolution andmuscle topology for our tongue models.Next we demonstrate the benefits of our unified subject-specific framework,to serve as a complementary tool for studying inter- and intra-subject vari-ability in speech production. By predicting the muscle activations that suc-cessfully reproduce certain measured speech motions for each speaker, ourframework offers an insight into the underlying biomechanics of speech. Ourmethods could potentially lead to the development, modification, and verifi-cation of theories of speech strategy across speakers of different gender, age,language, or pathology.Finally, we enable acoustic analysis of our subject-specific simulations byusing a biomechanical model for the vocal tract, which deforms along with themotion of the articulators. Using such a model, we investigate the challengesthat both the current acoustic synthesizers and biomechanical models facein generating accurate speech sounds based on subject-specific motion.Each chapter of this dissertation details a component of the designed work-flow, as shown in Figure 1.2. In Chapter 2 we present a review, and identifythe challenges facing standard approaches to the characterization of oropha-ryngeal structures and speech function – with a particular focus on dataacquisition and measurement, as well as biomechanical and acoustical mod-eling.Chapter 3 details our approach to subject-specific modeling of the orophar-ynx. We develop methods to address the challenge of MRI segmentationfor the articulators, and quantify the efficacy of these methods in regards31.1. Contributions2Subject-Specific Modelling(Chapters 3)Data-Driven Simulation(Chapter 4)Tagged-MRIGeneric ModelsCine-MRIMuscleActivationsSoundAcoustic Synthesis(Chapter 5)Figure 1.2. Designed work-flow for subject-specific, oropharyngeal modelling andsimulation in speech research.to time-efficiency, accuracy, and parameter sensitivity. Further, we incorpo-rate standard generic models into our subject-specific modeling approach, tominimize remodeling efforts, before demonstrating the performance of ourmodels in a forward simulation scheme.In Chapter 4 we enable inverse simulation of our developed models, basedon tissue trajectories extracted from the tagged MRI data of each speaker.Through this simulation, we investigate inter-speaker variability in the esti-mated muscle-activation patterns. Our results show consistency with pub-lished measurements of muscle activation for speech production.In Chapter 5 our subject-specific models and corresponding data-driven sim-ulations are coupled with a standard 1D acoustical model, in order to syn-thesize vowel sounds in both running and sustained speech configurations.Spectral analysis of our deformed vocal-tract model reveals discrepancy withthat of the recorded audio. We identify the sources of discrepancy by com-paring the performance of the 1D and 3D acoustical models, which carrytheir computations both in time and frequency domains.In Chapter 6 we summarize the contributions of this dissertation, describedirections for future work, and provide concluding remarks.1.1 ContributionsThe primary, novel, contributions of this dissertation are highlighted here.1. Our work fills a gap between medical images of the oropharynx andspeech analysis in the biomechanical and acoustical domains, by de-41.1. Contributionsveloping, incorporating, and validating methods that fit our designedframework in Figure 1.2. This is first achieved by efficient processingof dynamic MR images for modeling and simulation purposes (parts ofchapters 3 and 4). Following this, we facilitate creation of subject-specific biomechanical models of the tongue, mandible, and hyoid,based on standard generic models (Chapter 3). We then simulate ourmodels by solving the inverse problem for MRI-based tissue displace-ments (Chapter 4). Finally, we enable sound generation based on thepredicted biomechanics, by integrating an articulatory acoustic syn-thesizer to our biomechanical system (Chapter 5). Components of thiscontribution have been published in [P1].2. We significantly reduce the expert interaction time required for 3Dsegmentation of tongue tissue from MR images, by introducing a real-time, intuitive interaction scheme into a mesh-to-image registrationtechnique (Section 3.2). Each segmentation task requires less than fiveminutes – compared with two or more hours using standard, semi-automatic tools. The quantitative results show comparable accuracywith human errors. This contribution has been published in [P2].3. We provide tools for adjusting the spatial resolution and muscle topol-ogy of our subject-specific, finite element (FE) tongue models (Chapter3). By combining the benefits of FE meshing and registration tech-niques, our modeling approach leverages the biomechanical propertiesof a standard generic model without being constrained by its meshconfiguration or muscle definition. This contribution has been partlypublished in [P1], and partly submitted for publication in [P3].4. Using our data-driven simulation, we measure and quantify inter-speakervariability in the muscle-activation patterns responsible for the /s/sound in front and back vowel contexts (Chapter 4). Results showconsistency with theories of speech motor control in predicting the pri-mary muscles involved in tongue protrusion/retraction, jaw advance-ment, and hyoid positioning. Our findings compare well with publishedmedical measurements in suggesting independent activation units alongthe genioglossus muscle. This contribution has been submitted for pub-lication in [P3].5. We identify the challenges of using, and demonstrate the trade-off be-tween, standard 1D and 3D acoustical models for sound generation51.1. Contributions(Chapter 5). We show that low resolution of dynamic MR images andgeneric parameter-tuning in acoustical models degrade the accuracy ofthe computed formant frequencies. We also verify that the stability andaccuracy of such formants are adversely affected by ambiguity in 1Drepresentation of the vocal tract, and simplification of acoustic equa-tions required to make 3D analysis possible in frequency-domain. Thiscontribution has partly been published in [P1], and partly presentedat the Parametric Modeling of Human Anatomy (PMHA) workshop(2015) in Montreal, Canada.6Chapter 2BackgroundCurrent advancements in data acquisition techniques have created a greatopportunity to observe articulatory movements. Data analysis and measure-ment methods enable quantitative assessments of these observations, and areconsidered inevitable for transition to computational models. This chapterreviews the tools and techniques used for the analysis of speech production.Section 2.1 provides a review of medical imaging methods, such as MRI,for depiction of the oropharyngeal structures. The section follows by ad-dressing the potentials and challenges of physiological data acquisition tech-niques, such as electromyography. Finally, image segmentation methods arediscussed, identifying the challenges in dealing with oropharyngeal medicalimages.Despite current advances in medical imaging techniques, measuring the biome-chanics of speech (such as chronology and level of muscular force and activa-tion) remains a challenge, mainly due to the structural and physiological com-plexity of the articulators. Biomechanical models in particular complementthe imaging observations, furthering the understanding of speech production.Section 2.2 reviews previously reported generic biomechanical models of thearticulators, and identifies the need to move to a subject-specific framework.Finally, inverse simulation techniques are discussed as a solution to the lackof the physiological data.The ultimate goal of speech mechanics is to generate sound. Establishedspeech synthesis techniques, such as concatenative synthesis, involve the se-lection of units of speech based on statistical analysis of the sound recordedfrom a population of speakers. These methods are successful in generatingnatural sounding speech, but fail to explain the nature of speech produc-tion. Articulatory speech synthesis, on the other hand, attempts to simulatespeech production by applying the governing rules of physics. Mathematicalmodels of vocal folds and tracts are built, in which sound waves propagate72.1. Data Acquisition and Measurementfrom the subglottal area to the exterior of the lips. Section 2.3 explains theacoustics characteristics of different speech units, and reviews the currenttrend in articulatory speech synthesizers.2.1 Data Acquisition and MeasurementDiversified medical data is acquired to provide insight into the morphology ofthe oropharynx and physiology of speech. Static imaging techniques capturethe tongue in a sustained posture, while dynamic imaging techniques strive torecord a chronology of oropharyngeal motion in running speech. The commonchallenges fall into the following categories: 1) the motion artifact caused byrelatively-long scan times; 2) insufficient spatial and temporal resolutions; 3)low signal-to-noise ratio; and 4) poor soft-tissue contrast. In addition, theapproved level of the ionization dosage is a matter of active discussion. Thissection briefly reviews each state-of-the-art imaging modality with respect tothe goals of this thesis.2.1.1 Magnetic Resonance ImagingMagnetic resonance imaging (MRI) is capable of differentiating between bodytissues, by altering magnetic alignment of hydrogen atoms, and measuringthe released energy while the protons resume their previous alignment. T1-weighted images are well-suited to depict soft-tissue anatomy and fat, whileT2-weighted images are optimal for showing fluid. The major limitation ofMRI in depicting the oropharynx is motion artifacts. For years, relativelylong acquisition time of conventional MRI made recording of running speechimpractical. Nowadays, shorter acquisition times have become possible dueto the advances in technical aspects of MR scanners, which allow for parallelreconstruction and/or k-space encoding.High-resolution MR volumes require a long acquisition time, commonly lead-ing to involuntary movement of the tongue and, hence, introduces severemotion artifacts (Plenge et al., 2012). 2D acquisition can provide a refineddepiction of the tongue in the acquired plane, but it results in low through-plane resolution, inadequate for most volumetric analyses. Some previous82.1. Data Acquisition and MeasurementReal-time articulation Sustained articulation /a/ /i/ /u/ Figure 2.1. Mid-sagittal contours of the articulators derived from sustained andreal-time MRI for vowels /a/, /i/, and /u/. c©Engwall (2003b), adapted withpermission.speech studies adjust the orientation of the acquisition plane so it is orthogo-nal to the axis of vocal tract (VT) (Badin et al., 2002; Engwall, 2003a; Takanoand Honda, 2007). This facilitates tongue modeling for a sustained tongueposture; however, sustained articulations cause hyper-articulated configura-tions, which are not necessarily representative of running speech. A studyby Engwall (2003b) suggests that jaw opening is more dominant in sustainedvowels; whereas, its contribution in running speech is taken over by largeralterations in tongue shape (see Figure 2.1).Fast MRI AcquisitionImprovements in magnet quality in recent MR scanners has enabled GradientEcho sequences to capture higher quality images – comparable to Spin Echo(SE) – resulting in faster acquisitions.Parallel imaging, as a complementary reconstruction technique, has caused arevolutionary change in the length of MRI scans (Larkman and Nunes, 2007).Multi-array coil configurations augment the obtainable SNR, and provide thespatial information (in the form of receiver coil sensitivity maps) to shortenphase encoding time. In 2011, a 16-channel array receiver coil – custom-designed to provide localized sensitivity in the airway – was used to capture92.1. Data Acquisition and Measurementof the VT at a maximal temporal resolution of 84ms, for a single slice (Kimet al., 2011a) .Traditional MRI acquires parallel lines of data in the k-space (i.e., Cartesiansampling), mainly to provide simplicity in computations, and robustness tonoise in early MRI scanners. Under-sampling of the k-space (in order toreduce scan time) is limited in the Cartesian grid, and introduces a jumpingeffect in consecutive images. Radial encoding enables under-sampling of thek-space, by including both the low and high resolution information in everyscan profile. Using radial encoding, Uecker et al. (2010) capture dynamicMR images of the oropharynx, at the rate of 55ms per frame. Throughcombining under-sampled radial GE MRI, and serial image reconstruction(using parallel imaging), Iltis et al. (2015) quantify the tongue movement ina single slice of real-time MRI acquired at the rate of 10ms per frame.Synchronization methodsMR images may be acquired following an external stimulus (trigger) withan adjustable delay. This allows for the acquisition and reconstruction ofpseudo-moving images, in the case of repetitive periodic motion; thus, per-mitting under-sampling. Performance can be limited by the reproducibilityof the motion; however, it is shown that good reproducibility can be ob-tained for tongue motions in speech – even in poly-syllabic utterances (Alveyet al., 2008). In a gated sequence, speakers are asked to repeat an utterancein time with a metronome (Alvey et al., 2008) or their heartbeat (Venturaet al., 2011), while the audio signal is recorded simultaneous to MR imag-ing. This allows a higher frame rate and signal-to-noise ratio, compared tothe non-gated sequences. The utterance onset from recorded audio signals isused for reconstruction of the MR images (Shimada et al., 2012). The highintensity of audio noise present during MRI acquisition makes use of alter-native systems, such as optical microphones and noise cancellation methods,a necessity for simultaneous audio recording (NessAiver et al., 2006; Aaltoet al., 2014).102.1. Data Acquisition and MeasurementMulti-planar AcquisitionMany MRI studies of speech production are limited to a single acquisitionplane. The mid-sagittal plane depicts the dynamics of articulators, but itis not sufficient for identifying some important features of speech, such asgrooving/doming, and anatomical asymmetries. Kim et al. (2011b) makesuse of slice-interleaving technique to produce multi-slice real-time MR scansof the VT during pronunciation of English fricatives.Super-Resolution as Post-ProcessingSuper-resolution reconstruction techniques are introduced to generate isotropicMR volumes from orthogonal slice stacks acquired sequentially (Peled andYeshurun, 2001; Bai et al., 2004). The imaging process is formulated as anobservation model; and the intensity value of each voxel is obtained usingan optimization method such as a maximum a posteriori (Bai et al., 2004)or least square (Plenge et al., 2012) estimation. Recently, Woo et al. (2012)apply an edge-preserving data combination technique, based on Markov Ran-dom Field, to build super-resolution volumes of the human tongue. Isotropicresolution of 0.94 mm are reported. Figure 2.2 shows the reconstructionresult, in comparison to in-plane high resolution data.Upright MRIDue to technical limitations of scanners, MR imaging is mostly performed insupine position. However, gravitational force is believed to effect the dynam-ics and biomechanics of speech. A study by Kitamura et al. (2005) showsthat tongue retraction exists in supine position and is more severe in backvowels than front vowels. In case of the former, this may be due to stabiliz-ing the tongue by pressing its sides against hard palate. The study measuresnoticeable displacement (up to 20 mm) for the tongue tip and identifies theeffect of body posture on the larynx, lower jaw, lower lip and posterior pha-ryngeal wall. The major drawback of upright MRI is that it is currently onlyavailable in open-MRI configuration that provides low intensity magneticfields (typically 0.5 T), due to safety issues. Lower magnetic field translatesinto longer acquisition time for images of sufficient SNR.112.1. Data Acquisition and MeasurementFigure 2.2. Super resolution volume reconstruction for static tongue MRI (Wooet al., 2012); Original axial (a), sagittal (b) and coronal (c) stacks are shown afterisotropic re-sampling. Reconstructed volume (d) is compared to the ideal case (e).c©Woo et al. (2012), adapted with permission.Tagged MRIDynamic acquisition fails to provide high image contrast within the soft tis-sue itself; therefor, internal tissue points are not distinguishable, and theirmotion is not quantifiable. MR tagging introduces temporary features insidethe tissue – by applying a sequence of RF pulses to spatially modulate longi-tudinal magnetization of hydrogen protons, prior to imaging (Kerwin et al.,2000). In subsequent images, varying magnetization manifests itself as al-ternating light and dark tag patterns. The induced tags persist through themotion, and are visible in images acquired perpendicular to tagging planes.Tagged MRI is successfully applied to speech imaging – especially using gatedcine MRI pulse sequences (NessAiver et al., 2006; Xing et al., 2013).Harmonic phase (HARP) algorithm is introduced to estimate the 2D motionof the features from tagged MRI slices (Osman et al., 2000). The 2D motion(estimated from slices acquired from different orientations and at differenttimes) is then combined to produce a 3D tracking result. Liu et al. (2012)introduce an incompressible deformation estimation algorithm (IDEA) that122.1. Data Acquisition and Measurementincorporates tongue incompressibility constraint while imposing a smoothing,divergence-free, vector spline in seamlessly interpolating HARP velocity fieldsacross the tongue. The results are shown to be accurate for internal tissuepoints of the tongue; however, since HARP uses a bandpass filter (that makesthe object boundaries blurry) the motion estimated at the tongue surfaceremains inaccurate. Later on, Xing et al. (2013) propose an enhanced versionof the algorithm (E-IDEA) that improves the reliability of the displacementfield at the tongue surface, by incorporating 3D deformation of the tonguesurface computed from cine MRI.Diffusion Tensor ImagingDiffusion Tensor Imaging (DTI) is an MRI method designed to detect thediffusion process of water molecules in biological tissue. Such diffusion re-flects the interaction of water molecules with many obstacles (such as musclefibres), and, hence, reveals microscopic details in tissue architecture. DTIpatterns are extracted from the images using tractography methods.DTI tractography has been used to depict the structure of the human tongue.Gaige et al. (2007) demonstrate the geometric relationships between intrinsicand extrinsic myofiber populations, with a focus on the manner in which keyextrinsic fibers merge with the longitudinal, transverse, and vertical intrinsicfibers. Mijailovich et al. (2010) incorporate these fiber structures to derive afinite-element model of lingual deformation during swallowing. Using DTI,Murano et al. (2010) show major changes in the structure of the inferiorlongitudinal muscle bundle in the tongue anatomy of one control volunteerand one glossectomy patient.2.1.2 Computed Tomography ImagingComputed Tomography (CT) imaging combines multiple X-ray projectionsto reconstruct 3D images of tissue (with spatial resolution as high as 0.3mmper voxel). Bone and airway images have high contrast in CT, since X-rayis absorbed by dense tissue and passes through air. Multi-slice CT (MSCT)scanners have recently been equipped with multiple arrays of X-ray detectorsthat are able to reconstruct a 3D volume from a single rotation, and thus132.1. Data Acquisition and Measurementreduce exposure time significantly. Cone beam CT (CBCT) is used in studiesrelated to orthodontic treatments and maxillofacial surgeries. Glupker et al.(2015) use CBCT to measure airway volume changes between open and closedjaw positions for patients with temporomandibular joint disorders. Fujiiet al. (2011) use a 320-detector-row MSCT scanner to capture a single phase3D image volume of the oropharynx in less than 0.35 sec. The imagingprocess is repeated for 29 phases at intervals of 0.1 sec, to generate a fullythree-dimensional film of the swallowing on one volunteer. The study isunique in providing a full 3D movie of a fast oropharyngeal function. Veryrecently, Inamoto et al. (2015) use similar single phase image protocol toinvestigate the effects of age, gender, and height on the anatomy of theoropharynx in 55 volunteers. High dosage of X-ray exposure is the maindrawback of medical CT which hinders its use on healthy volunteers.2.1.3 ElectromyographyElectromyography (EMG) involves transducing electrical signals associatedwith muscle activations (Miyawaki et al., 1975). EMG assists with under-standing speech motor control, by identifying active muscles and their co-ordination during a speech gesture. However, the relationship between theEMG signal and the mechanical muscle forces is not straightforward, and canbe influenced by many factors, such as muscle fiber type, muscle length, andmuscle velocity (Sherif et al., 1983). Direct measurement of the muscle forceduring EMG session has been performed to help investigation of such rela-tionship for musculoskeletal muscles, but is subject to technical challenges(Roberts and Gabaldn, 2008).Fine-wire needle electrodes are used to measure the activity of the lateralpterygoid muscle (the main jaw opener) mainly in breathing and chewingmovements (Murray, 2012). In the early 1980’s, Tuller et al. (1981) andGentil and Gay (1986) compare EMG recordings of jaw muscles for speechand non-speech gestures, identifying the medial pterygoid and superior lat-eral pterygoid muscles as the jaw elevators, and inferior lateral pterygoidand (anterior belly of) digastric muscles as the jaw depressors during speech.Surface electrodes are used widely for recording facial EMG, in order to fa-cilitate recognition of audible and silent speech. We refer to Wand (2015)for a recent review of the field. Surface EMG of jaw muscles including sub-142.1. Data Acquisition and Measurementmandibular, masseter, anterior temporalis and (anterior belly of) digastricmuscles are also reported in conjunction with recordings of jaw movement incontrol subjects (Kawakami et al., 2012) and patients with related disorders(Ma et al., 2013).EMG recording of the tongue during speech gestures was performed in themid 1960’s (Mac Neilage and Sholes, 1964); however, the first study to recordactivity of extrinsic muscles of the tongue (including the anterior and pos-terior genioglossus, hyoglossus, and genioglossus) did not happen until twodecades later (Baer et al., 1988). The process remains to be challenging: themoist surface and highly deformable body of the tongue prohibits excessiveuse of electrodes on its surface (Yoshida et al., 1982). Measured signals aredifficult to interpret reliably, due to their high variability within subjects,across sessions, and across subjects. In addition, deep and small musclesare barely accessible, and their EMG recording is limited by lack of suitabletechnology in dealing with crosstalk between adjacent channels.2.1.4 Electromagnetic ArticulometryElectromagnetic articulometry (EMA) systems track the position of articu-lators by using a set of markers that attach to the surface of the tongue,jaw, teeth and lips. Each marker is a small sensor coil that moves in aknown (and varying) electromagnetic field. The electromagnetic field in-duces a weak current in the sensors, which is further measured and mappedto the coordinates of the markers. A reference marker is often used to com-pensate for head motion. EMA systems are not invasive, do not requireline-of-sight, and are able to capture their data in upright position. Modernsystems, such as the Carstens AG500 (www.articulograph.de) and NDIWave (www.ndigital.com), can provide high temporal update rates (e.g.around 100Hz) and are self-calibrating; however, their accuracy degrades inthe presence of metallic objects (such as mercury tooth fillings) due to fielddistortion. The tracking is also limited to eight markers in simultaneous ac-quisition. Most speech studies use the markers on the mid-sagittal line ofthe articulators, assuming that speech is a bilaterally symmetric phenomenon(Engwall, 2003a; Narayanan et al., 2014).152.1. Data Acquisition and Measurement2.1.5 Image SegmentationSegmentation is the task of partitioning a medical image into semanticallyinterpretable regions (i.e., organs of interest). The level of difficulty in im-age segmentation is related to the degree of intensity-variation across tissueboundaries: while segmentation of the mandible from CT can be achievedby applying simple thresholding methods, segmentation of the tongue fromMRI remains a challenge and requires intervention from a trained anatomist.Manual segmentation produces accurate results, but is prohibitively time-consuming and tedious – especially where it must be repeated for severaldatasets. General-purpose interactive tools can ease the task, but still re-quire significant user interaction time.Prior knowledge of shape and appearance, if incorporated effectively, canassist in dealing with soft-tissue inhomogeneities, noise, and low contrast inmedical images (Heimann and Meinzer, 2009). In this regard, shape con-straints have been embedded into level-set framework (Leventon et al., 2000;Tsai et al., 2003; Foulonneau et al., 2009), and further equipped with traineddistance maps (Bresson et al., 2006). In addition, statistical methods, suchas the Active Shape Models (Cootes et al., 1995), have been widely exploredto encompass intra- and inter-subject morphological discrepancies. Here, thecardinality of the training set is proportional to the degree of natural vari-ability of the organ shape. For example, Heimann et al. (2007) select 32 of86 datasets to train their 3D reference model for segmentation of the liver inCT volumes. Acquiring a sufficiently large dataset remains a challenge forMRI volumes of the upper airway.As an alternative to statistical methods, prior information may be formu-lated in a single template, and registered to the target image. Saddi et al.(2007) use template matching as a complementary step in their liver segmen-tation process, in order to compensate for the limitations of their learningset. Somphone et al. (2008) transform their binary template subject to con-formity constraints between local patches. In a different approach, Gilles andPai (2008) use explicit shape representation of the template to segment mus-culoskeletal structures from MR images. Their mesh deformation is regular-ized based on an expanded version of a computer animation technique calledShape Matching (Muller et al., 2005). The method is proven to efficientlyapproximate large, soft-tissue elastic deformations. We refer to Sotiras et al.162.1. Data Acquisition and Measurement(2012) for a detailed review on deformable medical image registration.Despite successful use of shape prior, automatic segmentation is still chal-lenging in low-contrast medical images of soft-tissue. Clinical applicationsdemand expert supervision and control over the process and final results ofthe segmentation. Effective and minimal interactivity schemes would pro-vide higher reliability, while lowering the cost of interaction. (Freedman andZhang, 2005) combine shape prior and interactivity in a graph-cut frameworkin 2D. Recently, (Mory et al., 2012) incorporate user input as inside/outsidelabelled points to improve the robustness and accuracy of a non-rigid implicittemplate deformation.Lee et al. (2013) propose a semi-automatic seeding method, based on theRandom Walker algorithm (Grady, 2006), for 3D segmentation of the tonguein low-resolution dynamic MRI, where the tongue has a uniform intensity.Their dataset consists of stacks of 2D slices, captured and averaged over 26time-frames for two English speakers. The user provides seeds in some sliceimages (in space) and frames (in time) which propagate to other frames andslices using a deformable image-to-image registration technique. The seedsare further fed to the Random Walker algorithm (Grady, 2006) to obtainthe final segmentations. This method is useful for speech analysis, but stillrequires excessive amount of user interaction for accurate segmentation of (ahigh-resolution static) MRI volume.Other reported works on segmentation of the oropharyngeal structures fromMRI data focus on 2D slices. Bresch and Narayanan (2009) propose an un-supervised regional technique (performed in the frequency domain), whichcaptures the shape of the VT. (Peng et al., 2010) use a shape-based varia-tional framework for tracking tongue contour at its surface in mid-sagittaldynamic MRI. Eryildirim and Berger (2011) manage to include physicallycorresponding surface curves of the tongue in their previously introducedPCA-guided segmentation method. Although these methods provide valu-able information for speech studies, they ignore delineation of the tongue atits base, as well as its contact with the epiglottis, hyoid bone, and salivaryglands. Segmentation algorithms tend to fail in these areas, due to the fusionof the tongue into neighbouring tissues of similar composition. In addition,3D reconstruction of the tongue shape from its sparse 2D segmented contoursis not straight-forward.172.2. Biomechanical Modeling2.2 Biomechanical ModelingRecent improvements in speech data acquisitions motivate the use of com-putational approaches to model speech phenomena (Vasconcelos et al., 2012;Ventura et al., 2009, 2013). Biomechanical models aim to simulate the dy-namics of speech production following biological and physical assumptionsabout the articulators and motor control (Fang et al., 2009; Stavness et al.,2012a,b).Simulation of soft-tissue deformation should be handled based on the physicsof continuum mechanics. Finite element (FE) analysis divides the continuousdomain into a set of discrete sub-domains, called elements, to approximatethe solution of the related partial differential equations (PDE). The accuracyand stability of the solution depends on the resolution, as well as on the shapeand type of the elements (Nealen et al., 2006). FE models are adopted intothe ArtiSynth toolkit (Lloyd et al., 2012) for simulation of the generic oral,pharyngeal, and laryngeal (OPAL) deformable soft-tissues.2.2.1 Generic ModelsThe TongueGeneric models of the tongue, the main articulator in speech production,have been developed previously (Dang and Honda, 2004; Gerard et al., 2006;Buchaillard et al., 2009) and incorporated in the simulation of speech move-ments (Perrier et al., 2003; Stavness et al., 2012a,b). These models are fur-ther enhanced through coupling to the jaw, and hyoid (Stavness et al., 2011),as well as the face and skull (Badin et al., 2002; Stavness et al., 2014a).The state-of-the-art FE tongue model is developed by Buchaillard et al.(2009), based on the CT images of a single male subject. It consists of946 nodes, 740 hexahedral elements, and 11 pairs of muscle bundles1 withbilateral symmetry (see Figure 2.3). The model is imported from the ANSYSenvironment (www.ansys.com) into ArtiSynth (Stavness, 2010). Each muscle1Genioglossus anterior (GGA), medium (GGM), posterior (GGP); hyoglossus (HG);styloglossus (STY); inferior longitudinal(IL); verticalis (VERT); transversus (TRANS);geniohyoid (GH); mylohyoid (MH); superior longitudinal (SL).182.2. Biomechanical ModelingGGA GGM GGP STY HG VERT GH MH TRANS IL SL Figure 2.3. The generic FE tongue model, designed by Buchaillard et al. (2009),and its muscle bundles.bundle in the model is defined as a set of muscle fibers (indicating the di-rection of the force) and their corresponding elements. The force capacity ofeach muscle is a function of its cross-sectional area (CSA), and is distributedacross its fibers. Table 2.1 shows the maximum force for each muscle bundleas used by Buchaillard et al. (2009).The model uses a fifth-order Mooney-Rivlin tissue material where strain en-ergy (W ) is described as:W = C10(I1 − 3) + C20(I1 − 3)2 + κ(lnJ)2 (2.1)I1 is the first invariant of the left Cauchy-Green deformation tensor; C10and C20 are the Mooney-Rivlin material parameters, and the term κ(lnJ)2reinforces the incompressibility. Values of C10 = 1037Pa and C20 = 486Paare used as measured by Gerard et al. (2006) from a fresh cadaver tongueand scaled by a factor of 5.4 to match the in-vivo experiments (Buchaillardet al., 2009). The Bulk modulus is set to κ = 100×C10 to provide a Poisson’sratio close to 0.499. Tongue tissue density is set to 1040 kg.m−3, close toTable 2.1. Max force and CSA for muscle bundles in the generic tongue model(Buchaillard et al., 2009).Intrinsic Muscles GGA GGM GGP VERT TRANS IL SLMax Force (N) 32.8 22 67.2 36.4 90.8 16.4 34.4CSA(mm2) 82 55 168 91 227 41 86Extrinsic Muscles STY HG MH GHMax Force (N) 43.6 118 35.4 32CSA (mm2) 109 295 88 80192.2. Biomechanical ModelingHyoid Attachments Jaw Attachments AT MT PT IP SM DM AD GH AM PM SH MP Figure 2.4. The generic jaw-hyoid model, as proposed by Hannam et al. (2008).water density. The values of C10 and C20 increase linearly from (1037 Pa,486 Pa) at no activation, to (10370 Pa, 4860 Pa) at full activation.The Jaw and HyoidThe generic jaw-hyoid model in ArtiSynth (Hannam et al., 2008) is cou-pled to the tongue FE model via multiple attachment points included inthe constitutive equations of the system as bilateral constraints. Tongue-jawattachments include the insertion of the genioglossus and geniohyoid ontothe mandibular geniotubercle and the insertion of the mylohyoid along themandibular mylohyoid ridge. Tongue-hyoid attachments include the entireregion around the anterior-superior surface of the hyoid bone, including in-sertions of the geniohyoid, mylohyoid, and hyoglossus muscles. Eleven pairsof bilateral point-to-point Hill-type actuators are used to represent the as-sociated muscles2, as shown in Figure 2.4. The temporomandibular joint(TMJ) is modeled by curvilinear constraint surfaces (see Figure 2.4).The Hill’s muscle model provides a 1D mechanical model for the skeletal mus-cles by defining a relationship between tension and shortening velocity. Thisrelationship accounts for both active and passive forces in a tetanic musclecontraction (Martins et al., 1998). Hill’s muscle model can be described by2Mylohyoid: anterior (AM), posterior (PM); temporal: anterior (AT), middle (MT),posterior (PT); masseter: superficial (SM), Deep (DM); pterygoid: medial (MP), superior-lateral (SP), inferior-lateral (IP); digastric: anterior (AD), posterior (PD); stylo-hyoid(SH).202.2. Biomechanical ModelingRestLengthTensionTotalActivePassiveFSECEPELFigure 2.5. Hill’s model for skeletal muscles: Force-length relationship (left) andthe 3-element representation (right).a 3-element representation as shown in Figure 2.5. The contractile element(CE) is the active part of the muscle; it shortens when activated, but is freelyextensible when unactivated. The series elastic element (SE) allows for rapidtransition of the muscle state from inactive to active and works as an en-ergy storing unit. The parallel element (PE) is responsible for the passivebehaviour of the muscle when stretched. The overall tension (F) and length(L) of the muscle relates to those of its elements as follows:F = FPE + FSE, FSE = FCEL = LSE + LCE, L = LPE(2.2)where FSE and FPE are non-linear functions of the muscle stretch (changeof the length relative to the resting state).The instantaneous force generating capacity of the muscles in the jaw modelvary non-linearly with length, and linearly with shortening velocity. Table2.2 shows the maximum force and CSA used for the jaw and hyoid musclesin the generic model (Stavness, 2010).2.2.2 Subject-Specific ModelingIn order to be clinically relevant, the aforementioned generic models shouldbe simulated using neurological or kinematic measurements, such as EMG orEMMA recordings. Unfortunately, available data is often specific to certainsubjects that do not share the same geometry as the generic model. A212.2. Biomechanical ModelingTable 2.2. Max force and CSA for the muscles in the generic jaw-hyoid model(Stavness, 2010).Jaw Closers AT MT PT SM DM MPMax Force (N) 158.0 95.6 75.6 190.4 81.6 174.8CSA(mm2) 395 239 189 476 204 437Jaw Openers SP IP ADMax Force (N) 28.7 66.9 40.0CSA (mm2) 72 167 100similar issue manifests itself in the validation phase, prohibiting meaningfulcomparison of numeric simulation results with subject-specific measurements.To alleviate some of these issues, one option is to perform heuristic registra-tion of subject data to the generic model (Fang et al., 2009; Sa´nchez et al.,2013), or restrict comparisons to average speech data reported in literature(Stavness et al., 2012a,b). While these approaches are valuable in provid-ing a proof of concept, they are not suitable in a patient-specific medicalsetting. Subject-specific biomechanical modeling, on the other hand, wouldaddress these issues while simultaneously enabling the investigation of inter-and intra-subject variability in speech production. In addition, it facilitatesfurther development of a patient-specific platform for computer-assisted di-agnosis and treatment planning of speech disorders.A volumetric finite element is often represented in a tetrahedral or hexahedralconfiguration. Tetrahedral meshing is widely explored in the literature (Alliezet al., 2005; George et al., 2002). It is straightforward and applicable tounrestricted topologies; but linear tetrahedra tend to lock and become overlystiff for nearly incompressible materials such as muscles (Hughes , 2000).Hexahedra have better convergence, may vastly reduce the size of the linearsystem, and are preferable for non-linear analysis of anisotropic materials(Montagnat et al., 2000).Image-based subject-specific modeling algorithms tend to directly generatea hexahedra-dominant FE mesh from a dataset of stacked images. Keyaket al. (1990) introduce the popular voxel-based method in which each voxelbelonging to the organ of interest is identified by thresholding, and thentransformed into a cubic element. The method is fully automated, general,and robust; however, it lacks efficient descriptors for identifying soft-tissues.222.2. Biomechanical ModelingOther algorithms rely either on contours (Teo et al., 2007) or a surface mesh(Baghdadi et al., 2005; Bucki et al., 2010a), prior-extracted from the imagestacks or volume. The modeling process consists of two consecutive phases ofsegmentation and FE volume mesh generation; and the overall clinical utilitydepends greatly on the accuracy and the degree of automation of each phase.Current methods for creating subject-specific biomechanical meshes can beorganized into two categories: meshing and registration. FE meshing tech-niques tend to generate FE models based solely on a subject’s anatomy.The generated mesh often suffers from jagged boundaries, and low qualityelements that should be made smooth, and regular. Smoothing, in turn,may cause surface shrinkage and the generation of ill-conditioned elements.Different approaches, such as mesh untangling or interior mesh smoothing,are proposed to cope with the problem (Zhang et al., 2005; Livesu et al.,2015). Still, involved optimization procedures are computationally expen-sive. A mixed-element FE meshing method, designed by Lobos (2012), hasbeen shown to generate well-behaved meshes that approximate anatomicalboundaries effectively. FE meshing techniques provide an adjustable meshresolution to fit different needs for simulation time and accuracy; but theyfail to offer the biomechanical information included in current generic modelssuch as muscle definitions and coupling attachments. This, in turn, intro-duces prohibitive costs of redesigning these features for each subject model.The subject-specific FE volume mesh may be acquired by registering a genericFE model to the surface mesh of the anatomy. This will automatically con-vey the muscle attachments and mechanical properties of the generic modelto the volume mesh of a specific subject (Bucki et al., 2010a; Grosland et al.,2009; Sigal et al., 2008). The approach is also well-suited for complex mesheswhere re-meshing is a burden. Registration is performed using a 3D transfor-mation Map : x 7−→ y, which represents the mapping between the reference,x = (x1, x2, x3), and the actual, y = (y1, y2, y3), coordinate systems of theelements. In the Mesh-Matching approach (Couteau et al., 2000), Map forthe internal elements is estimated from the mapping between the source andtarget surfaces. Due to this estimation, Mesh-Matching methods may gener-ate invalid or poor quality elements, unsuitable for any further FE analysis(Luboz et al., 2005). Figure 2.6 explains the problem in a simple schematic2D case, where M-M methodology fails to provide high-quality elements forthe target configuration.232.2. Biomechanical ModelingFigure 2.6. Schematic 2D representation of failure in Mesh-Matching methodology,as described by Bucki et al. (2011): the source FE mesh(a), the target surface(b),their aligned configuration(c), and the final mesh(d) after applying the transfor-mation Map to the internal nodes. The circled area represents elements of poorquality. c©Bucki et al. (2011), adapted with permission.In order to enable FE analysis of hexahedra-dominant meshes, the elementsshould maintain their convexity and positive volume (Shepherd, 2007). Thisis assessed through the element regularity measure. The measure is definedas the value of the Jacobian J(x), which is the determinant of the matrix∂Map/∂x. The element e is considered regular if J(x) > 0 for all the ele-ment’s nodes, x in e.Hexahedral elements should also keep their shape conformity to prevent un-even discretization of the deformed domain (Knupp, 2000). The popular Ja-cobian ratio (JR) quality measure is defined as the ratio Jen/Jemax, where Jenis the value of the Jacobian, at node n, in element e and Jemax = Maxm∈eJem.JR ranges from 0 to 1 and gives an indication of the contribution of eachnode in the element distortion.To compensate for the aforementioned irregularities, relaxation procedureswere introduced, to repair the mesh after deformation. Mesh repair is per-formed through minimizing some validity and quality energy functions, whichare calculated based on the Jacobian value. The state-of-the-art mesh repairmethod is embedded into the Mesh-Match-and-Repair (MMRep) algorithm(Bucki et al., 2010a). Here, the mesh repair is performed as the follow-up toa multi-scale, iterative, and elastic Mesh-Matching registration process.The current oropharyngeal models are designed for normal adult speakers.Therefore, it is reasonable to suspect their fidelity (even as a reference forsubject-specific modeling) in cases where the anatomy and/or neurology devi-ates vastly from such norms. As an example, a study by Smith and Goffman(1998) suggests different underlying control processes for children, that causelarger, slower speech movements.242.2. Biomechanical Modeling2.2.3 Inverse Simulation MethodsForward simulation consists of tuning muscle-activation signals to producedesired kinematics. There are two main challenges associated with predictingthe muscle activations of oropharyngeal structures. Firstly, muscle forces aredifficult to measure, while performing complex motor tasks, such as runningspeech. Secondly, the associated biomechanics is redundant to the motionspace of the system. For example, a tongue model (such as Buchaillard etal. [2009]) contains several pairs of extrinsic and intrinsic muscles, varyingactivations of which may cause similar motion. This is mostly due to the com-putational limitations of the models for differentiating all possible kinematicDOFs of the soft-tissue. This phenomenon is called motor redundancy, andhas been confirmed for tongue configurations in speech movements (Stavness,2010). Direct tuning of tongue muscles has been reported previously in theliterature (Fang et al., 2009), but the process is mostly manual and reliesheavily on trial-and-error. As an alternative, data-driven methods estimatemuscle activations based on the measured kinematics, by solving an inverseproblem.Inverse methods tend to optimize the solution to a set of system equations.These equations incorporate the force and kinematic measurements, subjectto specific biomechanical constraints. Static optimization methods calculatethe net force of the system for a sustained configuration of the model, suchas point-to-point movements of the jaw (Silva and Ambrsio, 2004). In orderto break down the net force of the system to those of individual muscles,the process needs to be repeated for each instance of motion. An instanta-neous cost function, such as minimum excitation, is used to resolve muscleredundancy. Static optimization may magnify the recording error throughdifferentiation of velocity and acceleration (Erdemir et al., 2007).Dynamic optimization methods apply more biologically plausible assump-tions, such as minimizing metabolic energy expenditure per unit of distance(based on the start and end point of the motion). The approach tends toproduce accurate results, but is computationally expensive. Some studies(such as the one by Anderson et al. (2001) on joint movement) show prac-tically equivalent results for static and dynamic inverse optimization, andsuggest that, depending on the motor task, the dynamic optimization maynot be necessary. However, the dynamic scheme is still useful if, 1) accurate252.3. Acoustics of Speechexperimental data is not available, 2) activation dynamics is known and playsan important role in the task, or 3) the ability to predict novel movement isdesired (Anderson et al., 2001).The trajectory-tracking simulation is a popular inverse modeling techniquewidely used for musculoskeletal systems, where the muscles are mainly mod-eled as mass-less springs (Erdemir et al., 2007). The method is also ex-panded to quasi-static FE Models for the face (Sifakis et al., 2005). Eachinput of muscle activations and skeletal configuration is directly mapped tothe steady state expression it gives rise to. Stavness (2010) includes themuscular-hydrostatic properties of the tongue (such as incompressibility)to enable dynamic simulation of a FE tongue model. The method showspromise for simulation of soft-tissue (such as those of the tongue) whichare activated without the mechanical support of a rigid skeletal structure.It uses per-timestep static optimization (as opposed to optimizing over thefull time-varying trajectory), and is computationally efficient; nevertheless,it may lead to suboptimal muscle activations. Estimated activations are fedback to the forward simulation, and the error in the model’s trajectories isused to re-adjust the prediction.2.3 Acoustics of SpeechSpeech, the vocalized form of human communication, can be subdivided intosmall sound units called phonemes. Vowels are voiced phonemes, meaningthey are articulated through vibration of the vocal folds, while the VT re-mains mainly open. Vowels function as the core of speech syllables. Anexample is the vowel a (/æ/) in the English word cat, written in the Interna-tional Phonetic Alphabet (IPA) as /kæt/. In contrast to vowels, consonantsmostly serve as the beginning (onset) or end (coda) of a syllable, and arearticulated with complete or partial closure of the VT.2.3.1 Vowel PhonemesSeveral articulatory features, referred to as quality, are associated with eachvowel to distinguish it from others. From the vowel qualities common inphonetic studies, height and backness/frontness depend on the vertical and262.3. Acoustics of Speech Front Central Back Close Round Un-round Mid-close Mid-open Open beet bait bet bat boot book bought bot bit buh boat but Vowel /i/ Vowel /u/ Figure 2.7. IPA classification of monophonic English vowels based on their qualities(left) vs. a mid-sagittal schematic representation of the front vowel /i/ (middle)and back vowel /u/ (left).horizontal position of the tongue respectively; roundness relates to the con-figuration of the lips; nasality is associated with the position of the velum.Figure 2.7 shows the IPA classification of the monophonic English vowelsbased on their qualities. It also includes a schematic representation of thearticulators in the front vowel /i/ and the back vowel /u/ in the mid-sagittalplane.Vowels are periodic signals, so they are expected to introduce distinct har-monic peaks in the frequency domain. In particular, the first two peak fre-quencies, known as F1 and F2 formants, are used to define distinct vowels.The value of F1 is mainly determined by the height of the tongue body;it is higher for an open vowel (such as /a/) and lower for a closed vowel(such as /i/). On the other hand, the value of the F2 is effected by thebackness-frontness of the tongue body; it is higher for a front vowel (suchas /i/) and lower for a back vowel (such as /u/) (Ladefoged, 2001). Severalstudies measure the formant values of the vowels from audio signals acrossdifferent populations and languages. Figure 2.8 shows the formant diagramfor 10 English vowels as proposed in the early 1950’s by Peterson and Barney(1952). Later on, Hillenbrand et al. (1995) revisited that study, consideringdynamic aspects of vowel production, such as duration and spectral change.Their results suggest numerous differences with those of Peterson and Barney(1952), both in terms of average frequencies of F1 and F2, and the degree ofoverlap among adjacent vowels. Other studies, such as that of Hillenbrand272.3. Acoustics of Speech Front Central Back Close Round Un-round Mid-close Mid-open Open beet bait bet bat boot book bought bot bit buh boat but Vowel /i/ Vowel /u/ 2000 1000 500 3000 4000 0 400 800 1200 Second formant frequency (F2) in Hz First formant frequency (F1) in Hz u U i I ɛ æ ʌ ɑ ɔ ə Figure 2.8. Regions of average formant frequencies for 10 English vowels across 76speakers, as described in Peterson and Barney (1952).(2003), yield different diagrams for phonetically distinct dialects. We referto Jacewicz and Fox (2013) for a recent thorough study of the change informant trajectories in North American English across dialects, gender, andage.2.3.2 Consonant PhonemesConsonants are classified by their manner and place of articulation, as wellas voiced or unvoiced qualities. By manner, consonants are categorized, as1) plosives or stops (such as /p/) where air is completely blocked and burstsafter release of the constriction, 2) fricatives (such as /s/) where a turbulentair-flow is generated at the point of constriction, 3) nasals (such as /m/ and/n/) generated by lowering the soft-palate, 4) liquids (such as /l/ and /r/)produced by raising the tip of the tongue, and 5) semi-vowels (such as y [/j/]in the word yes) which are phonetically similar to vowels, but function as asyllable boundary rather than its nucleus.Consonants are also categorized by place of articulation (POA), i.e., the pointof contact of the active (e.g., lower lip or tongue) and passive (e.g., upper lipor teeth) articulators. Figure 2.9 illustrates the POA of each category, whileincluding a corresponding example from English sounds. Most of the con-282.3. Acoustics of SpeechVocal cords Trachea Vocal tract 1 2 3 4 5 6 7 8 9 lip lip tongue velum hard palate alveolar ridge tip blade front center back epiglottis mandible uvula root 1. Bilabial 2. Labiodental 3. Dental 4. Alveolar 5. Postalveolar 6. Palatal 7. Velar 8. Uvular 9. Glottal /p/ /f/ /θ/ /s/ /ʃ/ /j/ /k/ /h/ Figure 2.9. Classification of consonants based on place of articulation, each accom-panied by an example from English phonetics. Adapted from 1994 EncyclopædiaBritannica.sonants do not have harmonic frequency spectra, making their acoustic cuesmore difficult to ascertain than vowels; however, spectral features (periodsof silence, voice bars, noise, and effects on adjacent phonemes) can be usedto distinguish consonants.2.3.3 CoarticulationThe term coarticulation, also referred to as assimilation, has been used todescribe the spreading of acoustic and/or articulatory features of a phonemeto its adjacent neighbours in running speech. In general, humans are ableto speak as many as five syllables per second: slightly faster than the ratearticulators are able to adjust to independent articulatory positions. Thebrain, however, has an intelligent way of planning an optimal compromisedtrajectory for adjacent phonemes, and saves time and energy. For example,although the nasal consonant /n/ is normally alveolar, it exhibits a dentalPOA in the word tenth (tEnT), aligning with the following dental sound /T/.Coarticulation explains the variation in phonetic manifestation of a givensound according to its nearby sounds (Ohala, 1993).Coarticulation has been hypothesized to occur both in anticipation of an ap-proaching phoneme (anticipatory), as well as when the effect of a phoneme is292.3. Acoustics of Speechanticipatory field of gesture 2 carryover field of gesture 2 gesture 1 2 3 Figure 2.10. Anticipatory and carryover fields for three overlapping phonetic ges-tures as explained in Fower and Satzman (1993).carried over to the following phoneme (perseverative). Figure 2.10 shows thiseffect for three overlapping phonetic gestures. The degree of overlap dependson number of variables, such as the distance of the gestures in articulatoryand perception space. Look-ahead models (based on anticipatory hypothe-sis) predict that some features of a given vowel (such as lip rounding), ora given consonant (such as nasality), affect its preceding phonemes in thespeech plan, until another vowel or consonant is reached. Contradicting ob-servations, however, gave rise to constrained-frame models, suggesting thatthe aforementioned features are time-locked to their associated phoneme,and that the effective field of spreading is smaller than suggested by thelook-ahead models (Fower and Satzman, 1993).2.3.4 Articulatory Speech SynthesisSpeech synthesizers focus on generating the sounds that resemble humanspeech. Articulatory speech synthesizers, in particular, use a representationof the vocal folds and tract to create the desired acoustics for an observedshape of the oral cavity (Doel et al., 2006; Birkholz et al., 2013). The acoustictheory of speech production suggests that both vowels and fricatives can begenerated using a source-filter system (Fant, 1960). For vowels, vibration ofthe vocal folds, under the expiatory pressure of the lungs, is the only sourcefor the system. The VT, consisting of the larynx, pharynx, oral and nasal302.3. Acoustics of Speechcavities, constitutes the filter where sound frequencies are shaped.Wave propagation in the vocal tractTraditionally, the acoustic system is approximated by a 1D wave equation,referred to as the Webster equation. It associates the slow varying CSA of arigid tube with the pressure wave for a low-frequency sound. For the soundparticle velocity v(x, t), at time t, in a tube with area function A(x), alongthe axis x, the Webster equation yields the following:1c2∂2v(x, t)∂2t=∂2v(x, t)∂2x+1A(x)∂A(x)∂x∂v(x, t)∂x(2.3)where c is the velocity of the sound propagation (Benade and Jansson, 1974).Unfortunately, it is not possible to solve the Webster equation analyticallyfor an arbitrary A(x). For a constant A, however, the equation simplifies to1c2∂2v(x, t)∂2t=∂2v(x, t)∂2x(2.4)which has a general solution in the form of waves propagating in oppositedirections. Using this simplification, Kelly and Lochbaum (1962) approxi-mate the VT with cylindrical segments of constant area. In each segment,the wave gains a propagation delay, and is partially reflected at its junctionwith the next segment. The Kelly-Lochbaum model can be expressed as aladder (or a waveguide) filter and, hence, is straightforward to implement. Anumber of improvements of this basic model have been proposed, to includeother features, such viscous and thermal losses along the path of propaga-tion, radiation at the lips, and time-varying nature of the VT (Va¨lima¨ki andKarjalainen, 1994; Doel and Ascher, 2008).The complex shape of the VT, with its side branches and asymmetry, has mo-tivated use of higher-dimensional acoustic analysis. The common 3D meth-ods (such as the boundary element method [BEM] [Kagawa et al., 1992], fi-nite element method [FEM] [Vampola et al., 2008a] and the finite-differencetime-domain method [FDTD][Takemoto et al., 2010]) produce more accu-rate results at the price of higher computational cost. 2D methods have312.3. Acoustics of Speechbeen proposed as a compromise, but they must overcome significant errorsin the spectra – especially for the first formants – related to the use of circu-lar cross-sections at the mid-sagittal slice of the 3D VT (Arnela and Gausch,2014). Finally, several studies find that the spectra yielded by 1D acous-tic analysis matches closely to those of 3D analysis for frequencies less than7KHz (Takemoto et al., 2014; Arnela and Gausch, 2014). The discrepancybetween the formant frequencies of the filter and the recorded audio is at-tributed to the insufficient boundary conditions – especially in case of openlips and/or velar port (Aalto et al., 2012).Coupling the models of vocal folds and tract is necessary for solving the acous-tic system. Popular techniques in the literature are based on direct numericalsimulation of either a transmission line circuit model (TLM) (Ishizaka andFlanigan, 1972) or a hybrid time-frequency system (Sondhi and Schroeter,1987). In particular, TLMs provide a descriptive analogy between acousti-cal and electrical circuits, and are widely adopted in modeling the complexturbulence noise sources for voiceless consonants (Birkholz et al., 2007).Excitation SourceA steady air-flow passes from the lungs into the trachea, and through the vo-cal folds (also known as vocal cords) to reach the VT. The primary source ofvocal excitation for voiced phonemes is the quasi-periodic vibration of the vo-cal folds, as explained in the myoelastic-aerodynamic theory. The vocal foldssuck together to close the air passage, due to negative supraglottal pressuregenerating a Bernoulli effect. Subglottal pressure builds up and bursts thevocal folds open, generating a pulse of air pressure at the glottal exit. Thecords oscillate before viscosity kills off their vibration. The elasticity of thevocal folds and, hence, the frequency of their vibration is actively controlledby the tension of the thyroarytenoid muscle.In the late 1960’s, Flanagan and Landgraf (1968) propose a simple singlemass model to describe the physics of glottal vibration. The preliminaryresults were satisfactory, but suffered from failure to permit non-uniform,out-of-phase movement of the tissue at the glottal entry and exit. Later on,Ishizaka and Flanigan (1972) introduced a more complex, two-mass modelof the vocal-fold vibrations, which was shown to be adequate for synthesis ofvoiced vowels. The glottis takes a convergent and a divergent shape during322.3. Acoustics of Speech1 2 3 4 5 6 7 8 9 lip lip tongue velum hard palate alveolar ridge tip blade front center back epiglottis mandible uvula root 1. Bilabial 2. Labiodental 3. Dental 4. Alveolar 5. Postalveolar 6. Palatal 7. Velar 8. Uvular 9. Glottal /p/ /f/ /θ/ /s/ /ʃ/ /j/ /k/ /h/ x1 x2 d1 d2 m1 m2 r1 s1 r2 s2 Trachea Vocal tract Ugs Ugs Psg P1 kc Figure 2.11. Two-mass glottal model, as proposed by Ishizaka and Flanigan (1972).each cycle, which creates a rise in the driving pressure and its asymmetry(Scherer et al., 1983), and leads to self-sustained oscillation of the vocal folds(Titze, 1988). Figure 2.11 shows the schematic of the model. The two-mass model is proven insufficient for study of glottal pathologies. Severaladvanced models have been proposed to provide a stronger correlation withbiology of the vocal folds (such as their layered structure) at the expense ofcomputational cost (Story and Titze, 1995). We refer to Cveticanin (2012)for a comprehensive review of these models.Voiceless excitation of the VT is the result of non-acoustic air motions (suchas turbulence at articulatory constrictions), and plays an essential role in pro-nouncing consonants. Accurate physical models to describe this phenomenoncurrently do not exist. Contemporary methods model the turbulence with arandom perturbation in either the pressure or velocity field, usually by in-serting lumped noise sources in the TLMs. Good results are achievable, butsubject to fine-tuning the parameters of noise sources, such as their numbers,place, level, and spectra (Birkholz et al., 2007).332.4. Conclusions2.4 ConclusionsRecent advances in acquisition technologies have made it possible to cap-ture abundant data during speech, in terms of audio, medical images andphysiological recordings. Fast MRI, in particular, has contributed vastly tounderstanding the anatomy and motion of the articulators. Computationalmodels complement such data in describing speech phenomena. Genericbiomechanical models of oropharyngeal structures are evolving into complexdescriptors of speech behaviour; methods for subject-specific modeling aregaining popularity for describing variability in the physiology of the humanbody. The acoustics of sound for speech production have been studied formany decades. Mathematical models, as well as their numerical implemen-tations, describe the complex physics of sound propagation in the VT. Artic-ulatory speech synthesizers, in particular, have shown promise in providingvaluable insight into understanding the speech process, rather than focusingon speech synthesis.This chapter has presented an overview of the ongoing research in the fieldsof speech data acquisition and measurement, biomechanical modeling, andacoustic analysis. We have identified areas of research that require furtherinvestigation. The available models of speech articulators are generic and ir-relevant for simulation and validation using speaker-specific data. Segmenta-tion of soft-tissue from MRI is challenging and time-consuming, and appearsto cause a bottle-neck in subject-specific modeling. There is also a need for aframework that enables acoustic analysis based on the biomechanics of eachindividual speaker. In the following chapters we describe our contributionsto these open research problems.34Chapter 3Subject-Specific Modeling ofthe OropharynxIn this chapter we demonstrate our design for developing subject-specific,biomechanical models of the oropharynx, according to MR images of indi-vidual speakers. We use biomechanical information provided by standardgeneric models, as available in the ArtiSynth simulation framework and de-scribed by Stavness et al. (2011, 2012a, 2014a,b). Our oropharyngeal modelincludes a FE model of the tongue, coupled with rigid-body bone structures(such as the mandible, maxilla and hyoid). Later, in Chapter 5, we includea deformable, air-tight model of the vocal tract to enable acoustic synthesis.The underlying data for construction of the models (for each speaker) is onecine MR image volume of the head-and-neck. This image is the first volumein a sequence of 26 time-frames that capture the utterance /@-gis/, acquiredby our collaborators Woo et al. (2012) at the University of Maryland DentalSchool, in Baltimore MD, USA. This image volume precedes the phoneme /@/and, thus, bears the most resemblance to a neutral tongue posture. We makeour subject-specific models for four healthy English speakers, anonymouslyidentified as A, B, C, and D. Figure 3.1 shows the mid-sagittal view of ourimage volumes. More information on this dataset is provided in Chapter 4,A B C D A B C D Figure 3.1. Mid-sagittal view of the 1st TF of cine MRI data for speakers A-D.353.1. FE Tongue Modeling3SegmentationGeneric FE Model (FEgen)1st TF of Cine-MRISurfaceMesh (S)FE RegistrationFE MeshingFEregFEmeshDefining Muscle BundlesFEfinalTissue TrackingSubject-Specific ModellingInverse SimulationTagged-MRI Generic ModelsCine-MRIActivationsTissue DisplacementsFigure 3.2. Designed pipeline for generation of high-resolution, subject-specific FEmodelf of the tongue (FEfinal).where we simulate our models based on MR images.3.1 FE Tongue ModelingTo create our subject-specific tongue models, we design and follow the pipelineshown in Figure 3.2. As an input to our pipeline, we use the standard genericFE tongue model developed by Buchaillard et al. (2009), which is describedin more details in Chapter 2; this model provides 2493 DOFs (946 nodesand 740 elements), and consists of 11 pairs of muscle bundles with bilateralsymmetry. We refer to this generic model as FEgen during the rest of thechapter.We follow our framework by delineating the surface geometry of the tonguefrom cine MRI, using the methods we develop and validate later in Section3.2. We refer to the surface mesh as S. Based on S, we create two versionsof FE tongue models, using a registration as well as a meshing technique.Our first tongue model (FEreg) is the result of the registration of FEgento S. We use a multi-scale, iterative, and elastic registration method calledMesh-Match-and-Repair (Bucki et al., 2010a). The registration starts bymatching the two surfaces, and then applies the 3D deformation field to theinner nodes of FEgen, via interpolation. A follow-up repair step compensatesfor potential irregularities of the elements. Note that the elements of FEreg– similar to FEgen – are aligned along muscle fibres. Because of this, the sizeof the elements depends directly on the density of the muscle fibres in each363.1. FE Tongue Modelingregion of the model. This results in smaller elements at the anterior-inferiorregion of the tongue (where most fibres originate), and larger elements closeto the dorsum of the tongue (where most fibres span into the tongue body).Unfortunately, low resolution elements are located in the region undergoingmaximum deformation during speech.To address our concerns about the resolution of FEreg, we generate our sec-ond tongue model (FEmesh) following the pipeline shown in Figure 3.2.We use a meshing technique proposed by Lobos (2012) to generate a regu-lar, mixed-element FE mesh (referred to as FEmesh). The meshing algorithmstarts with an initial grid (of hexahedral elements) that encloses the surfacemesh S. The method then eliminates those elements that present little orno intersection with S, employing a set of mixed-element patterns to fill gen-erated holes at the surface boundary. Further, the quality of the mesh isimproved using the Smart Laplacian filter (Freitag and Plassmann, 2000).FEmesh bares our desired resolution and is well-behaved during simulation.Finally, we augment FEmesh with the definition of muscle bundles availablein FEreg; since both FE models are in the same spatial domain, we simplycopy the bundle locations from FEreg to FEmesh, replacing the bundle’s ele-ments with those of FEmesh which fall into the bundle’s spatial domain. Ourapproach for generating FEfinal provides multiple fundamental advantagesover using FEreg. Firstly, the user has control over the mesh resolution. Sec-ondly, the muscle fibre definitions are no longer tied to the configuration ofthe elements; therefore, it is possible to modify the muscle fibres based ondifferent linguistic hypotheses and preferences.The rest of this section describes units of our modeling process in more detail.3.1.1 FE RegistrationTo create FEreg from FEgen (from Figure 3.2), we use a FE registrationmethod called Mesh-Match-and-Repair (MMRep), which performs iterativelyin two consecutive steps of elastic registration and mesh repair (Bucki et al.,2010a). In the mesh registration phase, we first coarsely align our genericsource mesh (FEgen) to the target surface (S), by matching the correspond-ing landmarks between the two. Next, FEgen is embedded in a deformablevirtual elastic grid. Local elastic registration is then performed by applying373.1. FE Tongue Modelingsuccessive elementary deformations in a coarse-to-fine grid scheme. At eachgrid level, the registration energy (E), is computed and minimized iteratively.Each iteration of the method consists of the following steps (see Figure 3.3for 2D illustration):1. Calculate E based on a geometrical similarity measure between corre-sponding local nodes of FEgen and S.2. Compute the gradient of E with respect to each grid node position,while other grid nodes are fixed.3. For each element, find the displacement vector that minimizes E.4. Convey such displacement vector (from step 3) to the source pointslocated in neighbouring cells, using a distance-based weight function.After each iteration, the virtual grid returns to its initial (regular) config-uration, embedding the newly deformed version of the FEgen. If no sub-stantial energy decrease is observed, the grid is refined by subdividing eachcell into eight smaller ones, and the algorithm moves to the next level. Theweight functions are designed to ensure C1 differentiability, bijection, andnon-folding property of the total deformation.To limit the space distortion, the method evenly distributes multiple controlpoints inside the source mesh, and computes a potential elastic energy atthem based on the Green Lagrange strain and stress tensors. The preferredelementary deformation at each iteration is the one which can provide thebest ratio between registration energy decrease and elastic energy increase.In the mesh-repair phase, the MMRep algorithm uses a two-fold process tocompensate for the regularity and quality of the distorted elements. Firstly,all elements are inspected for possible irregularities, and the nodal positionsadjusted to regularize the irregular elements. The regularity energy (ER) inregion A is defined as:ER =∑j∈A,φk(Jj) (3.1)where Jj is the Jacobian of the local deformation at node j, and φk(t) =1 − exp(−kt) defines a penalty function of strength k. ER is maximized tofind a regular configuration. Higher values of k favour the solution in whichall Jacobians are positive.383.1. FE Tongue ModelingFigure 3.3. Elastic registration in the MMRep algorithm (Bucki et al., 2010a):the source point-set at refinement level 1 (a); after deformation at level 1 (b); atrefinement level 2 (c); and after deformation at level 2 (d). c©Bucki et al. (2010a),adapted with permission.Secondly, the quality of the mesh is improved by finding a configurationthat maximizes the quality energy (EQ) in region A, defined as:EQ =∑j∈A,e∈Aψk(JRej) (3.2)where ψk(t) = 1− exp(−k(JRmin − t)), and JRmin is a predefined minimumsatisfactory level of the Jacobian ratio (JR). Bucki et al. use JRmin = 1/30in accordance with standard suggested by ANSYS FE analysis software.Both of the steps involved in the repair phase can alter and, hence, reducethe geometrical accuracy of the surface. To avoid this issue, the number ofrepair steps is limited to 50, with a maximum node displacement of 0.1mmin each iteration.Using the MMRep algorithm, we register FEgen into the tongue surfaces (S)we segmented from the cine MRI data of four healthy speakers. Figure 3.4shows the registration results (FEreg) in the mid-sagittal plane for speakers A-D. To assess the quality of each hexahedral mesh, we calculate the normalizedJacobian ratio (JR) of its elements as follows: we first compute the Jacobianfor each node of the element, and then calculate the quotient between theminimum to maximum Jacobian value found within the element. We averagethe values of JRe over all elements in each mesh, to obtain JR values of 0.34(for speakers A, B and D), and 0.36 (for speaker C). The values of JRe,however, range anywhere between 1/30 ≈ 0.03334 and 0.9.393.1. FE Tongue ModelingA B C D Figure 3.4. Results of FE registration using MMRep (Bucki et al., 2010a): elementconfiguration in the mid-sagittal plane of FEreg for speakers A-D.3.1.2 FE MeshingAs discussed in Section 2.2.2, hexahedral meshes are preferable for most FEmethods in a wide variety of simulation problems; however, because of thecubic shape of the elements, such meshes fail to achieve an adequate approx-imation of curved domains at the tongue surface. To deal with such issue,Buchaillard et al. (2009) designed an intricate element configuration for theirgeneric model, in which elements align with the direction of the surface. Togive a greater degree of freedom to our models (such as adjustable spatial res-olution), we adopt a more conventional FE hexaheral meshing developed byLobos (2012) that deals with the problem of curved domains through limiteduse of other types of elements (wedges, pyramids and/or tetrahedra). Themethod introduces a set of mixed-elements patterns – employed at the surfaceof the target domain – and conserves hexahedra elsewhere. These patternscan combine with any regular or non-regular hexahedral meshing technique,and achieve acceptable representation of the surface, while ensuring properconnectivity among elements.A hexahedral mesh, built for the surface S, includes elements that are com-pletely outside, completely inside, or at the boundary of S. Elements at theboundary are defined as those that have some nodes inside and others outsideS. One way to deal with outside nodes is to project them into S. Such asolution leads to two problems: firstly, a mesh can become tangled becauseif an element’s edges are crossed; secondly, element quality issues may arisedue to node proximity. In his method, Lobos (2012) replaces the boundaryelements – when necessary – with other type of elements.Lobos’s method uses the face subdivision rules, explained in Figure 3.5 (top),to handle a quadrilateral face that intersects S. The dashed lines are inserted403.1. FE Tongue ModelingFigure 3.5. Replacing boundary hexahedra with mixed elements: face consistencypatterns (top), and surface patterns (bottom). Dashed lines represent diagonals tobe inserted, and dots show inside nodes. c©Lobos (2012), adapted with permission.to split the face into two triangles, ensuring topological consistency betweenneighbor elements. Use of these basic rules leads to a set of mixed-elementpatterns shown in Figure 3.5 (bottom).The steps involved in the algorithm can be summarized as follows:1. Produce an initial grid with desired resolution that covers S.2. Eliminate elements that present little intersection with S; For any in-side node that lies close to the surface, project it to S.413.1. FE Tongue ModelingA B C D A B C D A B C D Figure 3.6. Results of FE meshing using mixed-element patterns (Lobos, 2012):element configuration in the mid-sagittal plane of FEmesh for speakers A-D.3. Replace all remaining boundary elements with mixed elements (Figure3.5 [bottom]).4. Project the outside nodes of new boundary elements to S.5. Improve the quality of the boundary elements.We follow Lobos (2012) in using the Smart Laplacian filter (Freitag andPlassmann, 2000) to improve the quality of the new mixed elements (as in-structed in step 5). We apply the 4th level of grid refinement in the algorithmto achieve our desired spatial resolution (with a typical element being approx.5mm-wide in each dimension). Figure 3.6 shows results of meshing (FEmesh)in the mid-sagittal plane for speakers A-D; Table 3.1 includes the number oftetrahedra, pyramids, wedges, and hexahedra in each model.At the end, we estimate the mesh quality of each element; for pyramids,wedges and hexahedra, we use the normalized Jacobian Ratio (JR) as for-mulated by Joe (2008). Since JR=1 for any tetrahedron, we use anotherquality measure called Aspect Ratio (AR) for the tetrahedral elements (Lo-bos et al., 2007):ARe =(16∑6i=1 l2i )238.47867V e(3.3)where V e denotes the volume of the element e, and li is the length of the i-thedge in e. AR is 1 for the quadrilateral tetrahedron and →∞ as e becomesincreasingly distorted. Table 3.1 also shows the average quality measure (ARor JR) for each type of element.423.1. FE Tongue ModelingTable 3.1. Mesh quality measured for FEmesh in speakers A-D. Number of ele-ments, as well as average values of the quality measure (AR or JR), are presentedfor each specific element type.Speaker Tetrahedra Pyramids Wedges HexahedraA1004 880 529 1414AR=2.38 JR=0.86 JR=0.84 JR=0.93B917 808 805 2061AR=2.37 JR=0.85 JR=0.88 JR=0.94C1026 962 757 2741AR=2.47 JR=0.86 JR=0.87 JR=0.95D1124 1040 626 1763AR=2.34 JR=0.84 JR=0.84 JR=0.943.1.3 Tongue Muscle BundlesThe top row of Figure 3.7 illustrates the process of defining the muscles inhigh resolution. The goal is to define a muscle bundle (Mfinal) in FEfinal thatcorresponds to a specific muscle bundle (Mreg) in FEreg. Since both FEregand FEfinal share the same coordinates, the fibers of Mreg (indicated in redin Figure 3.7) are simply copied to Mfinal. The elements of Mfinal, however,need to be redefined.Consider element e in FEfinal. In a simple and intuitive approach, we canassign e to Mfinal if e falls within a predefined distance (d) to the fibers ofMfinal. However, no single value of d yields satisfactory results. Firstly, inthe regions where fibers are very close to each other, their correspondingelements tend to overlap. Overlapping elements may introduce error in theinverse solver, where a wrong muscle may be considered responsible for anunrelated motion. Secondly, in the regions where fibers are relatively far fromeach other, elements in between fibers tend to fall out of the muscle definitionand create holes in the muscle. These holes may cause inhomogeneity in theforce-activation behaviour of the muscle.We assign e to a certain Mfinal if the elements of the corresponding Mregcontain the e. In addition, we incorporate adjacency relationships betweenthe tongue muscles – as in the generic tongue model – to avoid the overlapbetween non-overlapping bundles.433.1. FE Tongue ModelingProposed Method for FE final Using fixed distanceSegments of GG in FE regdUsing fixed distance Proposed M finalGGa GGb GGc GGd GGeM reg overlaid on FE meshFigure 3.7. Defining muscle elements in the high resolution FE tongue model(top row), as well as functional segments of the genioglossus muscle for speaker C(bottom row). Overlapping elements are shown in black, for the muscle elementsdefined using a fixed, predefined d distance.The method is explained in the following steps: For each element e in FEmesh1. Compute the distance (Distanceie) from e to Mireg (the ith muscle bun-dle in FEreg) for 0 < i < 21.2. Initiate a first-in first-out queue (Q); add the indices i to Q in ascendingorder of Distanceie; denote the nth value in Q with qn.3. Iterate until Q is empty:(a) Read q1 from Q, and add it to the list of bundles for e.(b) If qn (1 < n < 21) does not interdigitate with q1, remove qn fromQ.(c) Remove q1 from Q.443.1. FE Tongue ModelingDistanceie (in step 1) is computed as the minimum euclidean distance fromthe centroid of e to the centroids of all elements in the muscle bundle Mireg.The number 21 is inclusive of the left and right muscles. Step 3b benefitsfrom a predefined binary matrix that shows the interdigitation between themuscle bundles (e.g., the entry for the TRANS, and VERT is 1, since theyshare elements in FEgen). We refer to Appendix A for more information onthe anatomy of tongue muscles.The bottom row of Figure 3.7 shows muscle elements of the five functionalsegments of the genioglossus (GG) muscle for speaker C. The proposedmethod is compared with the approach using a predefined, fixed distance,while the muscle elements in FEreg serve as the ground truth. Note that theproposed method preserves the boundary of each segment, while preventingoverlaps and holes in muscle definition.The bone attachments in the tongue model (the FE nodes at which the modelis biomechanically coupled to the mandible and hyoid rigid bodies) are alsotransferred from FEreg to FEfinal. For each attachment node in FEreg, theclosest node (according to euclidean distance) in FEfinal is considered to bethe corresponding attachment.Each muscle bundle in the tongue can be further divided into functionally-distinct fibre groups (referred to as functional segments), which are believedto be controlled quasi-independently in a synergistic coordination (Stoneet al., 2004). We divide the muscle fibers of the GG, VERT and TRANSinto five functional segments (a: posterior to e: interior). This division wasinitially proposed based on EMG measurements from the GG (Miyawakiet al., 1975), and later reinforced using information from ultrasound imag-ing and tagged MRI (Stone et al., 2004). We also follow Fang et al. (2009)in dividing the STY into two functional segments: STYa (the anterior partwithin the tongue), and STYp (originating from the posterior tongue to thestyloid process). Note that FEgen includes only three functional segmentsfor the GG and one functional segment for each of TRANS, VERT or STY(Buchaillard et al., 2009).A more detailed analysis of the tongue’s myoarchitecture would require amultiscale conceptualization of tongue muscle mechanics, as in the approachby Mijailovich et al. (2010). Detailed fibre fields could potentially be digitizedfrom cadaver tissue, and registered to subject-specific muscle geometries, asin the method of Sa´nchez et al. (2014) for muscles in the forearm.453.1. FE Tongue Modeling3.1.4 Tongue Muscle MaterialThe muscles in FEgen (Buchaillard et al., 2009) have been modeled using ahyper-elastic (Mooney-Rivlin) material. Fibers pass through it, indicatinglines of action. Applying an external force through the fibers compressesthe muscle material according to the Monney-Rivlin constitutive model, asdescribed in Equation 2.1.For our tongue models, we borrow a term (WB) from the Blemker musclemodel, and add it into Equation 2.1 to modify the strain energy. WB dependson the fiber stretch (λ) and activation level (a) of the muscle. Using theformulation by Blemker et al. (2005), we compute WB(λ, a) by solving thefollowing:λ∂WB∂λ= σmaxffibertotalλ/λoptffibertotal = affiberactive(λ) + ffiberpassive(λ)(3.4)where σmax is the maximum isometric stress and λopt is the fiber stretchat its optimum length. ffiberactive and ffiberpassive correspond to normalized activeand passive force-length relationships of a muscle fiber, respectively). Theactivation level (a) can be any value between 0 (no activation) and 1 (maximalactivation). As in Blemker et al. (2005), we assume a piecewise exponentialform for the passive force, and a piecewise quadratic form for the active force:ffiberpassive =0 λ ≤ λoptP1(e(P2λ/λopt−1) − 1) λopt < λ < λ∗P3eλ/λopt + P4 λ∗ ≤ λffiberactive =9(λ/λopt − 0.4)2 λ ≤ 0.6λopt1− 4(1− λ/λopt)2 0.6λopt < λ < 1.4λopt9(λ/λopt − 1.6)2 1.4λopt ≤ λ(3.5)We follow Blemker et al. (2005) in setting P1 = 0.05, and P2 = 6.6 in ac-cordance with the measurements on muscle tissue (Zajac (1988)). We useσmax = 1× 105Pa, λopt = 1.1 and λ∗ = 1.5 to achieve stability for our mod-els. P3 and P4 are set so that ffiberpassive is C0 and C1 continuous at λ = λ∗.Parameters of the Mooney-Rivlin material are set as in Subsection 2.2.1.463.1. FE Tongue ModelingUsing the Blemker model with Mooney-Rivlin material, we account for thenon-linearity, incompressibility, and hyper-elasticity of the tongue muscletissue. In addition, we use Rayleigh damping coefficients β = 0.03s andα = 40s−1 to achieve critically damped response for the model.3.1.5 Forward SimulationUsing our methods for FE tongue modeling, we first generate a high reso-lution version (FEFinal) of the generic tongue model (FEgen). Both modelsuse the Blemker material, as described in Subsection 3.1.4. We activate eachmuscle individually and compare the deformation in the low and high resolu-tion models with each other. Figure 3.8 shows the results for 40% activationof the GGP, SL, and IL muscles, after reaching equilibrium. The tip of thetongue is on the left side of the figure; the tongue-jaw and tongue-hyoidattachment points are fixed, and the gravity is set to zero. In the case ofthe GGA, FEFinal shows superior ability to protrude, avoiding unnecessarycurling on the top surface of the tongue. In the case of the SL, the highresolution model reaches upper high and back, at the tip, resulting in a smallindent at the blade. In the case of the IL, FEFinal is able to curl (and de-press) the tip and blade to a greater extent. Note that FEFinal also enablesthe user to modify (add, delete and relocate) the muscle fibers, and theirelement description, to produce desired deformations.To evaluate the performance of our speaker-specific tongue models, we acti-vate each individual muscle and assess the corresponding deformation. Foreach speaker, we apply the same level of activation to our model as we doto the (high resolution) generic model, and compare the resulting deforma-tions. Figure 3.9 shows the results for 10% activation in the GGP muscle.Although bearing varying geometry and head posture, all speaker-specificmodels succeed in protruding.Figure 3.10 shows the impact of such forward simulation, with 10% muscleactivation, on the mid-sagittal contour of the tongue for speaker A. Themid-coronal plane is also included for the TRANS muscle which tends tonarrow the tongue from the lateral sides and result in a larger contour inthe mid-sagittal plane. After a series of experiments, we conclude that thespeaker-specific models exhibit similar muscular behaviour to the genericmodel.473.1. FE Tongue ModelingNeutral Posture GGP Activated SL Activated A B A A B B IL Activated A B Figure 3.8. Impact of muscle activation on deformation of the generic tonguemodel, in original resolution (A) vs. the high resolution (B), depicted in lateralview. The GGP, SL, and IL muscles are activated up to (40%) in each case.Red/blue dots show the tongue-jaw/hyoid attachment points.We finally activate each functionally-distinct segment of the posterior GG,VERT, TRANS, and STY muscles, to observe their individual impact onthe shape of our tongue models. Figure 3.11 shows the deformation in themid-sagittal plane, with a: most posterior to e: most anterior portion ofeach muscle. The STYa and STYb denote the intrinsic and extrinsic musclefibers, respectively.483.1. FE Tongue ModelingNeutral Posture GGP Activated A A C C D D Generic Generic B B Figure 3.9. Impact of the GGP activation (10%) in the generic, and speaker-specifictongue models. Red/blue dots show the tongue-jaw/hyoid attachment points.TRANSILGGPGGMGGASLHG MHSTYVERTTRANSGHFigure 3.10. Impact of muscle activations (solid) vs. neutral posture (dashed) onthe tongue contour in the mid-sagittal plane, for Speaker A. The mid-coronal planeis included for the TRANS muscle. Muscle are activated up to (10%) in each case.Red/blue dots show the tongue-jaw/hyoid attachment points.493.2. Tongue Segmentation from MRIGGa GGb GGc STYa STYbTRANSa TRANSb TRANSc TRANSd TRANSeVERTa VERTb VERTc VERTd VERTeFigure 3.11. Impact of activation of functionally-distinct muscle segments on thetongue contour (solid) vs. neutral posture (dashed) for speaker A in the mid-sagittal plane. Red/blue dots show the tongue-jaw/hyoid attachment points. Thesubscripts a to e range from the posterior to the anterior tongue. For the STY, aand b include the intrinsic and extrinsic fibers, respectively.3.2 Tongue Segmentation from MRIIn this section, we develop a real-time, force-based, user interaction plat-form, coupled with a mesh-to-image registration technique (Gilles and Pai,2008), to delineate tongue tissue from MR image volumes. The developedmethod expands the application of such methodology from musculoskeletalstructures to highly deformable soft-tissue. We also extend the state-of-the-art by considering delineation of the tongue from the epiglottis, hyoid bone,and salivary glands (depicted in high resolution static MRI).Both shape and intensity priors are incorporated in the form of a source imagevolume and its corresponding surface mesh, which is delineated by a dentalexpert. The choice of the source dataset is arbitrary. We use a discretesurface mesh representation to deal with regularity and shape constraints.The overall pipeline of the developed method is shown in Figure 3.12. Thecurrent position of the surface nodes is stored in the Mechanical State module.Loop 1 includes the modules that handle mesh-to-image registration. Themesh is deformed according to local intensity similarity between the source503.2. Tongue Segmentation from MRIFigure 3.12. Designed segmentation pipeline. Iterative loop (1) includes the mesh-to-image registration modules. Iterative loop (2) incorporates potential user-interactive boundary labelling. Both loops update the mechanical state of thedeforming mesh, simultaneously.and target volumes. The deformation is regularised using an extended versionof a shape matching algorithm (Gilles and Pai, 2008). In Loop 2, we deploy aneffective minimal user interaction mechanism to help attain higher clinicalacceptance. Both loops shown in Figure 3 have access to and are able toupdate the mechanical state of the mesh, simultaneously. This provides real-time visualisation of the surface evolution.Our developed method has been fully implemented under the SimulationOpen Framework Architecture (SOFA @www.sofa-framework.org), an open-source modular framework based on C++. This allows for the registrationalgorithm to be interpreted as a real-time simulation process, during whichthe source model iteratively deforms to match the target configuration, start-ing from its initial position.3.2.1 MethodsMesh-to-image registration (Loop 1 in Figure 3.12) is handled by applyingexternal and internal forces to the vertices of the mesh. The external forcef e(t) steers the mesh toward the target boundaries. The internal force f i(t)keeps the mesh regular and close to the prior shape. Instead of summingup the internal and external forces, we use a Pair-and-Smooth approach,513.2. Tongue Segmentation from MRIoriginally proposed by Cachier and Ayache (2001), in which the external andinternal forces are applied sequentially. This approach minimizes propagationof noise from image features to the final result. Let x(t) denote the vectorof positions x for all the vertices on the surface, at time t. We also definexr = x(0) as the reference vertex positions. The mesh is deformed in twosteps.1. Intensity profile registration: the image-based external force, f e(t), iscalculated for each node, as described later in this subsection; then, theposition of each vertex is augmented with the vertex’s external force,ina unitary time step (x(t) + f e(t)).2. Shape Matching Regularization: a smoothing internal force is appliedon the augmented position of the vertices, and results in the vector ofthe regularized goal positions, defined as x˜. The details of the smooth-ing process follow later in this subsection.The deformation is done by moving the vector of reference vertex positionsxr to the vector of current positions x(t). Afterwards, the deformation issmoothened by applying the relevant internal forces:f i(t) = αi(x˜− x(t))where x˜ is the vector of new regularized goal positions. The two steps involvean iterative search for x(t) and x˜ respectively. The details of each moduleare described in the following section.Intensity Profile RegistrationFor each node at position x on the surface, the external force at time t iscalculated byf e(t) = αe(x′ − x(t)) (3.6)where αe is the stiffness and x′denotes the new location of the node. Thesearch for x′is performed within a pre-defined range of inward and outwardsteps at the direction normal to the surface. At each iteration, x′is selectedto be the point which maximizes a local similarity measure between the sourceand target image volumes. Our algorithm matches the 1D gradient intensityprofiles of pre-defined length L in the direction normal to the surface. LetGtar(x) be the gradient profile of the target image, centred at point x, and523.2. Tongue Segmentation from MRIlet Gsrc denote the corresponding gradient profile in the source image. Theoptimum value of x in each time step, denoted by x′, is calculated using thenormalized cross correlation as the similarity metric:x′= argmaxx〈Gtar(x)−Gtar‖ Gtar −Gtar ‖,Gsrc −Gsrc‖ Gsrc −Gsrc ‖〉(3.7)where G is the average value of G and <,> and ‖ . ‖ denote the innerproduct and L2 norm respectively.Shape Matching RegularizationTo regularize the mesh deformation, we apply the extended version of theshape matching algorithm, previously introduced in the context of muscu-loskeletal structures (Gilles and Pai, 2008). The underlying mesh is subdi-vided into overlapping clusters of nodes, defined around each vertex (i) onthe surface. The cluster for vertex i is defined asζi = {j : d(xi, xj) < s} (3.8)where d is the Euclidean distance and s is the predefined cluster size (orradius). Then, for each cluster ζi, the algorithm approximates the localdeformation of the nodes with a rigid transform (Ti), applied on the referenceposition. The least square estimation of Ti is obtained byTi = argminT∑j∈ζimj ‖ Txrj − xj − f ej ‖2 (3.9)where mj represents the mass weight of particle j in the cluster; and fej iscomputed from Equation 3.6. This, in turn, will update the goal position ofeach node in the cluster from x˜ = Txr.Due to the overlapping nature of the clusters, each vertex may obtain differentgoal positions from the different clusters it belongs to. These goal positionsare subsequently combined into an average position for each vertex. Thefinal (goal) position is used to calculate the corresponding internal forces,which are then averaged and applied to all the vertices of each cluster. Here,shape matching acts as an elastic force that is proportional to the strain;533.2. Tongue Segmentation from MRIwhereas, updating the reference positions at each time step would simulateplastic deformations. We follow Gilles and Pai (2008) in using the simplegradient descent scheme with unitary time stepx(t+ dt) = f i + x(t) (3.10)To summarize, one iteration of the mesh-to-image registration (in Loop 1)involves the following steps:1. Calculate external forces f e using Equation 3.6 and 3.7.2. Calculate shape-matching forces f i.(a) For each cluster ζi, compute Ti from Equation 3.9.(b) For each vertex i, average goal positions as in x˜i =∑i(T)xri/|ζi|.3. Evolve node positions from x = x˜.4. Update reference positions to simulate plasticity from xr = x.For the initialization mode, we model the underlying mesh with one cluster.Hence, the bodily movement of the mesh would be purely rigid, containingthree translational and three rotational DOFs. If desired, we enable theuser to guide initialization towards what he may deem as a better position(in a simple mouse-click and drag). This step inserts a spring force fromthe mesh toward the cursor. The initialization scheme compensates for largedisplacements between the initial and final tongue positions (see Figure 3.13).At any time, the user can make the transition to the deformation mode byincreasing the number of clusters (through entering the desired number in adialogue box and clicking a button).User InteractionWe incorporate an effective minimal user interaction mechanism to guaranteea satisfactory result for the end user. The procedure is shown in Loop 2in Figure 3.12. At any time during the registration process, the user isfree to inspect the orthogonal cut-planes of the deforming mesh, overlaid onthe corresponding 2D sections of the target image. The user may provideadditional boundary labels by simply clicking in any area where automaticsegmentation is deemed inadequate.543.2. Tongue Segmentation from MRIFigure 3.13. Initialization mode. From left to right: the position of the tongue inthe volume, mid-axial, mid-sagittal, and mid-coronal plane, before (top) and after(bottom) initialization. No user guidance force was necessary here.Since boundary constraints are handled through forces in SOFA, our inter-action is also force-based. As soon as a new boundary voxel is clicked, thealgorithm searches for the closest surface node on the mesh and inserts aspring force between these two points. The closest points on the surfacewill update in each iteration of the mesh deformation. We empirically use apredefined stiffness of about 104 in all implementations. Stiffness values of ahigher order of magnitude may cause instability and, hence, are avoided.All parameters are fixed in all the experiments. We unify the number ofsurface nodes to number 2502, in order to capture the sub-millimetre detailsof the tongue’s shape. For the intensity profile registration module, thelength of the profiles is set to 50 pixels, centred on the investigated voxel.The search range is five voxels, inward and outward, in the normal directionto the mesh surface. The stiffness coefficient, αe, is set to 1. For shapematching regularization, we set the number of clusters to 300. To attainhigh flexibility, the radii of all clusters are set to 1.3.2.2 ResultsWe apply our segmentation method on MRI scans of 18 normal subjects.For each dataset, three sagittal, coronal, and axial stacks of MRI slices are553.2. Tongue Segmentation from MRIacquired, with the tongue at the rest position, using a T2-weighted TurboSpin Echo pulse sequence. A Siemens 3.0 T Tim Treo MRI scanner is usedwith an 8-channel head and neck coil. The size of each dataset is 256 ×256× z (z ranges from 10 to 24) with 0.94mm ×0.94mm in-plane resolutionand 3mm slice thickness. A MRF-based edge-preserving data combinationtechnique is applied to build super-resolution volumes of the tongue withisotropic resolution of 0.94mm. Details of data acquisition and reconstructiontechniques is described by Woo et al. (2012).The developed method is evaluated for 18 normal subjects (eight females, 10males). All 18 datasets are manually segmented under the supervision of ourdental expert collaborator, using the TurtleSeg interactive tool (Top et al.,2011). The results are used, both as the ground truth and as the sourcemodel in inter-subject registration, as described later in this subsection. Thesegmented surface for each volume includes all of the internal muscles of thetongue, as well as the digastric, geniohyiod, and hyoglossus muscles (see Fig-ure 3.14); it excludes the hyoid bone, mandible bone, epiglottis, and salivaryglands. We cut the mylohyoid, palatoglossus and styloglossus muscles, fol-lowing the contour of the tongue. In addition, tongue tissue above the linebetween the epiglottis and hyoid bone is included. The process takes aboutfive to seven hours for each dataset.Figure 3.15 shows 3D representation of the result for subject 5 as the sourceand subject 2 as the target.Figure 3.14. Ground truth segmented by a dental expert in TurtleSeg (Top et al.,2011). Axial (right), sagittal (middle), and coronal slices are shown in red, blue,and orange respectively. Salivary glands (labels 1-3), hyoid bone (label 4) andepiglottis (label 5) have been excluded.563.2. Tongue Segmentation from MRIAxial Sagittal Coronal BeforeinitializationAfterinitializationBefore userinteractionAfter userinteractionFigure 3.15. 3D representation of the mesh during the segmentation process. Redarrows show the areas that require extra, user-provided boundary labels.Measures of Volume Overlap. After categorizing the subjects into twogroups of (female and male) anatomy, we noticed that anatomy of one malesubject was a closer match with the female group; therefore, he was excludedfrom the male group and added to the female group (F9 in figures 3.16 and3.17). For each dataset in each group, the segmentation was repeated byiterating the source on other members of the corresponding group, resultingin (9 × 8) × 2 experiments in total. In each case, the dental expert wasasked to interact with the segmentation for 1-3 minutes. The distance and573.2. Tongue Segmentation from MRI80 82 84 86 88 90 92 M1 M2 M3 M4 M5 M6 M7 M8 M9 F1 F2 F3 F4 F5 F6 F7 F8 F9 Average Dice Coefficient Subjects: M(male)- F(female) Before Interaction After Interaction Figure 3.16. Average Dice coefficient for subjects in the male (M) and female (F)group, before (light gray) and after (dark gray) expert interaction time of 2 ± 1minutes.the volume overlap between result (A) and ground truth (B) were calculatedbefore and after the interaction. We used the Dice coefficient as a measureof the volume overlap, reported as a percentage:Dice(A,B) = 2|A ∩B||A|+ |B| × 100 (3.11)Figure 3.16 shows the Dice measure calculated before and after expert inter-action, for both the male and female groups. The average Dice, measuredon all the datasets in the male group, improved from 87.15 ± 1.65 to 90.37± 0.42 after expert interaction. In addition, the mean of the inter-subjectstandard deviation (STD) dropped from 1.02 ± 0.28 to 0.29 ± 0.07. Theaverage overlap in the female group is 87.23 ± 1.58, before interaction, and90.44 ± 0.42, after interaction. The mean of the measured STD also changesfrom 0.80 ± 0.16 to 0.29 ± 0.10.Measures of Surface Distance. For calculating the distance between thetwo surfaces, we used the Modified Hausdorff Distance (MHD) as the measure583.2. Tongue Segmentation from MRI0 0.5 1 1.5 2 2.5 M1 M2 M3 M4 M5 M6 M7 M8 M9 F1 F2 F3 F4 F5 F6 F7 F8 F9 Average Modified Hausdorff Subjects: M(male)- F(female) Before Interaction After Interaction Figure 3.17. Average Modified Hausdorff distance for subjects in the male (M)and female (F) groups, before (light gray) and after (dark gray) expert interactiontime of 2 ± 1 minutes.of object-matching (Dubuisson and Jain, 1994):MHD(A,B) = max(d(A,B), d(B,A))(A,B) =1Na∑a∈Ad(a,B)d(a,B) = minb∈Bd(a, b)(3.12)where Na is the set of all the nodes (a) on the surface A, and d(a, b) = ||a−b||is the simple Euclidean distance between vertices a and b. Figure 3.17 showsthe average MHD. The mean of the measure in the male group changes from2.06 ± 0.16 mm to 1.62 ± 0.10 mm after interaction. The mean of themeasured STD also changes from 0.15 ± 0.04 mm to 0.07 ± 0.02 mm. Forthe female group, the mean of MHD is 2.06 ± 0.21 mm, before, and 1.59 ±0.12, after interaction. The mean of the STD also decreases from 0.13 ± 0.04mm to 0.06 ± 0.01 mm.Sensitivity Analysis. In order to justify our choice of parameters, weperformed a sensitivity analysis on four main parameters of the algorithm:cluster number, cluster radius, search range, and stiffness coefficient. Foreach experiment, we changed the value of one of the parameters, while theothers were fixed to their default values. To minimize the control variables,593.2. Tongue Segmentation from MRI0 0.5 1 1.5 2 2.5 3 3.5 75 80 85 90 95 100 1 100 200 300 400 500 600 Modified Hausdorff (mm) Dice Coefficient (%) Number of Clusters Dice (M) MH (M) 00.511.522.533.575808590951000 5 10 15 20Modified Hausdorff (mm)Dice Coefficient (%)Radius of ClustersDice (M) MH (M)00.511.522.533.575808590951001 3 5 7 9 11 13 15Modified Hausdorff (mm)Dice Coefficient (%)Search RangeDice (M) MH (M)00.511.522.533.575808590951000 0.3 0.6 0.9 1.2 1.5Modified Hausdorff (mm)Dice Coefficient (%)Stiffness CoefficientDice (M) MH (M)Figure 3.18. Sensitivity analysis on four main parameters of the developed algo-rithm: Dice and modified Hausdorff measures for subject #7 in the male group.we excluded the initialization and user interaction stages, running the analy-sis only for the mesh-to-image registration. In each group (e.g. male/female),we chose the target image with the worst segmentation accuracy measured inthe previous experiments (e.g. M7 and F6). Other members of the group wereiteratively used as the source of registration. Hence, we conducted (1×8)×2experiments for each value of each parameter. The average and variance ofthe Dice and MH measures, for the male and female subjects, are shown infigures 3.18 and 3.19, respectively.• Number of clusters. Proposed default value: 300. Small numbersof clusters result in a more rigid model and decrease segmentation ac-curacy. More clusters results in a higher degree of freedom for thedeformation, but values greater than 600 cause convergence failure.• Radius of clusters. Proposed default value: 1. Radii of less than 10produce similar results. The deformable model becomes more rigid asthe radius (e.g. overlap) of the clusters increases, decreasing segmenta-tion accuracy.603.2. Tongue Segmentation from MRI0 0.51 1.5 2 2.5 3 3.5 75 80 85 90 95 100 1 100 200 300 400 500 600 Modified Hausdorff (mm) Dice Coefficient (%) Number of Clusters Dice (F) MH (F) 00.511.522.533.575808590951000 5 10 15 20Modified Hausdorff (mm)Dice Coefficient (%)Radius of ClustersDice (F) MH (F)00.511.522.533.575808590951001 3 5 7 9 11 13 15Modified Hausdorff (mm)Dice Coefficient (%)Search RangeDice (F) MH (F)00.511.522.533.575808590951000 0.3 0.6 0.9 1.2 1.5Modified Hausdorff (mm)Dice Coefficient (%)Stiffness CoefficientDice (F) MH (F)Figure 3.19. Sensitivity analysis on four main parameters of the developed algo-rithm: Dice and modified Hausdorff measures for subject #6 in the female group.• Search range. Proposed default value: 5. Greater search range willenable the model to deform more during each time step; however, forvalues greater than 10 pixels inward/outward, the model will have dif-ficulties in convergence, and will fluctuate back and forth close to theoptimum solution.• Stiffness Coefficient. Proposed default value: 1. Low values of stiff-ness decrease the effect of image-based forces, and are therefore inade-quate for matching the model to the target image. Values higher than1.5 will introduce instabilities in the model.3.2.3 DiscussionThe manual segmentation used as the ground truth is likely to be fuzzyand uncertain in problematic regions, such as at the boundary of the hyoidbone and salivary glands (see Figure 3.14). However, modeling requisites613.3. Jaw and Hyoid Modelingprohibit the segmentation to include bones and non-muscle soft tissues. Toassess such uncertainty, we designed an experiment in which the same dentalexpert was asked to repeat the same manual segmentation after 10 days,without referring to his first attempt. The result showed a volume overlapof Dice= 91% and a surface-to-surface MHD= 1.52 mm between the twomanually segmented volumes. We argue that this uncertainty imposes similarlimits (91% for the volume overlap, and 1.5 mm for the surface-to-surfacedistance) on the measures achievable by automated segmentation. In fact,while expert interaction resulted in about 7% improvement for Dice valuesas low as 83% (0.6 mm decrease in distance value), the improvement wasless than 1% for the Dice values as high as 90% (0.3 mm decrease in distancevalue). The same ambiguities cause the user interaction to be inefficient aftera certain time limit, justifying our choice of restricted interaction time forreporting the results.3.2.4 SummaryIn this section, we have tackled the challenging problem of semi-automated3D segmentation of the tongue from isotropic MRI volumes. Previous workshave included delineation of tongue contours at its surface in 2D MRI slices.We adapted an inter-subject registration framework using a shape matching-based regularization technique. This method was combined with an instantforce-based user interaction mechanism, which attracts the model towardsuser-provided boundary labels. We were able to achieve segmentation accu-racy with an average Dice coefficient of 90.4± 0.4%, and an average distance= 2 ± 0.2 mm, within an expert interaction time of 2 ± 1 minutes. Thus, weconclude that our human-in-the-loop approach, using a variation of the shapematching technique (Gilles and Pai, 2008), provides an effective method tosegment complicated soft tissue areas, like the tongue.3.3 Jaw and Hyoid ModelingOur speaker-specific model of the mandible and hyoid is similar to the Ar-tiSynth generic model (Stavness et al., 2011) in its biomechanics: it is cou-pled to the tongue FE model via multiple attachment points, included in the623.3. Jaw and Hyoid Modelingconstitutive equations of the system as bilateral constraints. Eleven pairsof bilateral point-to-point Hill-type actuators, as listed in Subsection 2.2.1,are used to represent the associated muscles, and the TMJ is modeled bycurvilinear constraint surfaces. The bone density is set to 2000 kg/m3, assuggested by Dang and Honda (2004). For each speaker, the geometries ofmandible and hyoid bone rigid bodies are replaced with the correspondingsurfaces, segmented from the first TF of cine MRI data (Subsection 3.3). Thebone-tongue attachment points are computed based on the generic tonguemodel, as described in Section 3.1.3.3.3.1 Bone Segmentation from MRITo build our speaker-specific models, we need to delineate the surface geom-etry of the articulators from cine MRI data. Unfortunately, cine MRI onlyprovides partial bone visibility, which makes the results of manual segmen-tation inadequate for detection of the location of muscle insertion sites andthe temporomandibular joint (TMJ).Static MRI, however, provides higher resolution and a better representationof bone surfaces. Woo et al. (2015) create a high resolution static MRI atlasthat includes speaker data used in the present study, as shown in Figure3.20. Di, in the figure, denotes the deformation from static MRI of speakeri onto the atlas space. We first build a segmentation mask for the mandiblein the atlas space, and then morph the mask onto the static MRI of thespeaker (using the inverse of Di [D−1i ]). Finally, we perform an image-basedelastic registration (Vercauteren et al., 2009) between the static and cineMRI images of each speaker, to generate the mask in the cine MRI (at firstTF). In the figure, this final registration is denoted by Ri.633.3. Jaw and Hyoid ModelingExtractMaskD1D2DND1-1DN-1R1R2RND2-1Static MRIfor SpeakersStatic-MRIAtlasJaw Mask in Static MRI for SpeakersJaw Mask in CineMRI for SpeakersJaw Mask inAtlas SpaceFigure 3.20. Atlas deformation for jaw segmentation. Di denotes the deforma-tion from static MRI of speaker i onto the atlas space; Ri stands for the elasticregistration from the static to cine MRI space. Jaw masks are shown in blue.Generic Model Partial Surface Final Segmentation Axial Sagittal Coronal Figure 3.21. Mandible segmentation (speaker A). The generic model is sculptedto match the partial surface, while its intersection with the image is inspected.Orange contours (bottom row) show the final result at mid-views of cine MRIdata.643.3. Jaw and Hyoid ModelingFigure 3.22. Geometries of the tongue, jaw and hyoid overlaid on the mid-axial,mid-sagittal, and mid-coronal slices of cine MRI data, for speaker B.The final mask (in the cine MRI space) yields a partial mandible surface,as shown in Figure 3.21 (for speaker A). We deploy this partial surface asthe guide for (manual) sculpting of a generic mandible mesh (available inArtiSynth). For sculpting, we use BlendSeg, a customized plug-in for theBlender mesh editing software (www.blender.org) that allows inspection ofthe mesh intersection with the image data, throughout the sculpting process(Ho et al., 2014). Figure 3.22 shows the final geometries of the jaw, hyoid,and tongue, overlaid on cine MRI data for speaker B.As mentioned earlier, cine MRI data does not provide sufficient resolution fordepicting the sites of muscle origin and insertion. The insertion sites on themandible are transformed from the generic jaw model to speaker geometry,through the sculpting process. Using the relative distance between the originand insertion sites in the generic model, we calculate rough coordinates ofthe origin sites in the speaker space. These coordinates are further fine-tuned according to the generic model/speaker MR data, in the ArtiSynth’sgraphical user interface (see Figure 3.23). Making the image data visible,during the process, helps estimate the length of the muscles, ensuring theorigin sites do not lie outside of the speaker’s skull.653.3. Jaw and Hyoid ModelingModels/Mech/frameMarkers/Origin-PT-RightFigure 3.23. Using ArtiSynth GUI to adjust muscles of the jaw and hyoid. Config-uration of the muscles for speaker B (left) is adjusted based on the generic model(right). Origin and insertion sites of the muscles are defined as frame markers (reddots) that can be moved freely in space (using the transition handle). A matchingcine MRI image data can be made visible, if necessary.3.3.2 Forward SimulationFinally, we attach our tongue models to the mandible and hyoid throughthe attachment points calculated in Subsection 3.1.3. We further assess theimpact of jaw muscles in each speaker-specific jaw-tongue-hyoid model in aforward simulation scheme. Figure 3.24 shows the models after activationof the jaw-opener (SP, IP, AM, PM, AD), and jaw-closer (SM, DM, MT,PT, AT, MP) muscles, for speakers A-D. The active muscles that are visible,are depicted in darker shades of red. Less excitation is used for the jaw-closer muscles (compared to jaw openers), since the jaw-closers consist ofmore, larger muscles, and generate a higher magnitude of force per excitationunit. Figure 3.25 shows mid-sagittal contours of the tongue, mandible, hardpalate, and hyoid after activation of the jaw-closer and jaw-opener muscles,for speaker B. The maxilla is considered to be fixed, and we avoid collisionbetween the tongue and mandible, or maxilla, in our simulations.663.3. Jaw and Hyoid ModelingSpeaker A Speaker B Speaker C Speaker D Figure 3.24. Impact of activation of jaw-opener (middle) and jaw-closer (right)muscles vs. the neutral posture (left) for speakers A-D. The jaw-opener/closermuscles are activated linearly up to ten/five percent.673.4. ConclusionstonguemaxillajawhyoidFigure 3.25. Impact of activation of jaw-opener (middle) and jaw-closer (right)muscles vs. the neutral posture (left) for speaker A in the mid-sagittal plane. Thejaw-opener/closer muscles are activated linearly up to ten/five percent.3.4 ConclusionsIn this chapter we have detailed the development of our subject-specific,biomechanical models of the tongue-jaw-hyoid based on MR data. We startby delineating the articulators from the MR volume. Our method for tonguesegmentation benefits from a real-time user interaction that significantly im-proves time-efficiency. We also suggest methods to enable segmentation ofthe mandible and hyoid, notwithstanding the poor contrast of bone in MRI.Based on the delineated surfaces, we then create our models. Our approachfor creating tongue models combines meshing and registration techniques,to benefit from a state-of-the-art generic model (Buchaillard et al., 2009),while providing the opportunity to adjust the resolution and modify muscledefinitions. To conclude, we test the performance of our models in a for-ward simulation scheme by activating individual muscles and observing thecorresponding tissue motion/deformation.We believe that our approach for generating FEfinal offers benefits that canbe investigated further in future. Firstly, we suggest that a higher resolutiontongue model provides the opportunity to simulate more complex and longerspeech utterances that exhibit additional variability in tongue shape. Swal-lowing is another example where more local tongue motions are observed.Second, our proposed approach offers structural independence between theconfiguration of muscle fibres and elements. This enables the user to modify,add or delete individual muscle fibres, in order to accommodate more sub-683.4. Conclusionstlety in neural innervation (as suggested by Slaughter et al. (2005) for theIL and SL muscles and by Mu and Sanders (2000) for the GG). A finer fibrestructure is also useful in studying different languages where sounds are sim-ilar but not identical. In addition, ability to edit the fibres is beneficial forsimulation of speech, in disorders such as glossectomy, where the innervationpattern varies based on the missing tissue (Chuanjun et al., 2002). Finally,as resolution of dynamic MRI data improves, we will be able to capture finershapes of the tongue. Hence, our model should be positioned to present moredetails.69Chapter 4Data-Driven Simulation forSpeech ResearchIn this chapter, we enable data-driven simulation of our subject-specific mod-els, and demonstrate its application in investigating motor control to thetongue. We especially look at the fricative sound /s/ because it is madein a region of the vocal tract where minor changes in tongue position andshape are audible and can easily compromise the pronunciation of /s/. Forthis reason, /s/ is used as a test of articulatory accuracy by oral surgeons,prosthodontists, and speech pathologists. At least two prevalent /s/ gesturesare identified in the literature: the apical /s/, which uses the tongue tip tocontact the alveolar ridge; and the laminal /s/, which uses the tongue blade(Dart, 1991). A 2D tagged MRI study by Reichard et al. (2012) identifies atrend in the tongue tip moving faster during apical /s/, with no significanteffect observed in the tongue body. A similar study confirms a differencein tongue shape for the two gestures, reporting the occurrence of a shal-lower groove at the velopalatal juncture in apical /s/. A possible correlationbetween palate height and /s/-type is also noted (Stone et al., 2012).This study examines the motor control and motion patterns of /s/ in two backand front vowel contexts, in the utterances /@-gis/ and /@-suk/, to exploit thedifferential effects of neighboring sounds on /s/ realization. By simulatingspeaker-specific models – based on the MRI data of multiple speakers – thischapter explores possible answers to two questions: namely, What are the keymuscles responsible for the motion into the two laminal and apical gestures of/s/? and How do vowel context and /s/-type affect activation pattern amongdifferent speakers?We base our simulations on 3D tagged and cine MRI data, which capturesthe motion of the tongue’s tissue-points during the production of /s/. Us-ing this quantified, tissue-point motion, Xing et al. (2015) calculate internal704.1. MRI Datamotion patterns, as well as degree of shortening and lengthening of individ-ual muscles. However, the data alone provides an incomplete picture of themotor control to the tongue. For example, active and passive shortening ofa muscle can cause similar motion, and co-contraction of antagonist musclesmay result in no shortening. It is, therefore, difficult to disambiguate thecauses of muscle shortening from MRI alone. In this study, however, we sim-ulate our speaker-specific biomechanical models using tissue-point motion,extracted from tagged MRI, to first infer which muscles are actively shorten-ing (using an inverse model), and then to actively shorten those muscles topredict tissue-point motion (forward model). Following this, we compare theresults with the tagged MRI trajectories, in order to fine-tune the predictedmuscle activations. Our results (as presented in subsection 4.4 and discussedin subsection 4.5) supplement and enhance current knowledge of how muscleactivation is related to tongue motion patterns.Figure 4.1 shows a schematic representation of the MRI data used in thisstudy. The cine and tagged MRI slices are recorded during synchronizedrepetition of the desired speech utterance, and averaged over time to producea volumetric representation of the oropharynx for each time frame (TF), asdescribed in Section 4.1. The MR images are then fed into the work-flowpresented in Chapter 3. The internal tissue displacements are calculatedfrom tagged MRI and further enhanced with the tongue surface informationof cine MRI data (subsection 4.2). The biomechanical models of the tongue,mandible, hyoid, and maxilla are then constructed for each speaker based onthe surface geometries segmented from the first TF of cine MRI, as describedin Chapter 3. The speaker-specific models are then simulated based on thetissue displacements (subsection 4.3). We use the Artisynth platform, whichsupports both forward and inverse simulations. Forward simulation yieldskinematic trajectories of the model based on muscle activations, and theinverse simulation provides estimates of muscle activation patterns, based onthe tissue trajectories measured at specific control points from the data.4.1 MRI DataOur MRI data captures four normal, Caucasian American English speak-ers with mid-Atlantic dialect – two with apical /s/, and two with laminal714.1. MRI Data11 Tagged MRI Cine MRI Sagittal Coronal Axial TF 1 TF 26 Figure 4.1. Schematic representation of the MRI data used in this study. Stacksof tagged and cine MRI are acquired at each time-frame (TF) in sagittal, coronal,and axial views./s/. Each speaker repeats the utterances /@-gis/ and /@-suk/ in time witha metronome. Both cine and tagged MRI data is acquired using a Siemens3.0T Tim-Treo MRI scanner with a 16-channel head and neck coil. Thein-plane image resolution is 1.875mm × 1.875mm with a slice thickness of6 mm. Other sequence parameters include the following: repetition time(TR) 36 ms, echo time (TE) 1.47 ms, flip angle 6, and turbo factor 11.The axial, sagittal, and coral stacks of cine MRI slices are combined toform isotropic super-resolution volumes for 26 TFs, using a Maximum APosteriori-Markov Random Field method with an edge-preserving regular-ization scheme (Woo et al., 2012). Table 4.1 summarizes the information ofeach individual speaker. The phonemes of interest (/@/, /g/, /i/, /s/ and/@/, /s/, /u/, /k/) are identified in specific TFs in each utterance. Each724.1. MRI DataTable 4.1. Speaker information for this study: sex, age, and time frames associatedto individual sounds in /@-gis/ and /@-suk/ utterances.Speaker Sex Age /s/-type/@-gis/ TFs /@-suk/ TFs@ g i s @ s U kA M 23 apical 8 12 16 21 8 13 19 21B M 22 apical 6 10 18 20 7 10 16 19C F 43 laminal 8 10 14 23 4 9 15 18D F 21 laminal 5 9 13 19 7 10 17 19vowel is identified at the TF before the tongue begins to move toward thenext consonant, i.e., that is, the maximum vowel position. Each consonant isidentified at the TF when the tongue first contacts the palate, i.e., the initialframe, rather that the maximum consonant position.These frames were chosen because they were easily identifiable from the MRImovies. Figure 4.2 shows the mid-sagittal slice of cine MRI at the TF asso-ciated with /s/ for each speaker in both utterances.Speaker A Speaker B Speaker C Speaker D/s/ in /ə–sʊk/ /s/ in /ə–gɪs/Figure 4.2. Mid-sagittal slice of cine MRI at /s/ in /@-gis/and /@-suk/. SpeakersA and B show the apical /s/, and speakers C and D show the laminal /s/ gesture.734.2. Tissue DisplacementZoom HereHARP IDEA Surface Normal E-IDEASurface GeometryInferior -SuperiorCone ColorLeft-RightInf-SupAnt-PostAnterior - PosteriorFigure 4.3. Tissue displacements, calculated from tagged MRI, using HARP (Os-man et al., 2000), IDEA (Liu et al., 2012), and enhanced by surface normals fromcine MRI in E-IDEA (Xing et al., 2013). c©Xing et al. (2013). Adapted withpermission.4.2 Tissue DisplacementThe 2D motion of the tongue tissue points is estimated from tagged MRimage slices using the Harmonic Phase (HARP) algorithm (Osman et al.,2000). We further utilize the Enhanced Incompressible Deformation Estima-tion Algorithm (E-IDEA), to combine 2D motion data and produce the 3Dtracking result with an incompressibility constraint (Liu et al., 2012; Xinget al., 2013). E-IDEA imposes a smoothing, divergence-free, vector splineto seamlessly interpolate velocity fields across the tongue. In addition, itimproves the reliability of the displacement field at the tongue surface by in-corporating 3D deformation of the tongue surface computed from cine MRI.Figure 4.3 demonstrates the effectiveness of E-IDEA in improving the accu-racy of the tissue displacements at the tongue surface.Both HARP and E-IDEA calculate the displacement field (at each TF) withreference to the first TF (when tags were initially applied). However, inorder to simulate our models, we have to calculate displacements betweensuccessive TFs. To get from the nth to the (n+ 1)th TF, we first move fromthe nth to the first TF – via the inverse of the nth velocity field – and thenmove from the first to the (n+ 1)th TF by adding the (n+ 1)th velocity field.We adopt a simple fixed-point algorithm (Chen et al., 2008) to invert the744.3. Inverse SimulationE-IDEA velocity fields.In this study, we perform spatial and temporal regularization to reduce pos-sible noise in the estimated motion caused by registration errors or surfaceambiguities; in the spatial domain, the displacement vectors are averaged ina spherical region of predefined radius around each point of interest; in thetime domain, a cubic interpolation is performed between successive TFs tosmooth the trajectories and find the intermediate displacements.4.3 Inverse SimulationForward dynamic simulation requires fine-tuning of the muscle activationsover time. EMG recordings of the tongue have been used previously (Fanget al., 2009), but they suffer from a lack of suitable technology to deal withthe moist surface and the highly deformable body of the tongue (Yoshidaet al., 1982). Also, the relationship between EMG signals and muscle forcesis not straightforward. As an alternative, muscle activations can be predictedfrom available kinematics by solving an inverse problem. The results may befurther fed to a forward simulation system to provide the necessary feedbackto the inverse optimization process. The forward-dynamics tracking methodwas initially introduced for musculoskeletal systems (Erdemir et al., 2007);later on, Stavness et al. (2012a) expanded the method to the FE models(such as the tongue) with muscular hydrostatic properties that are activatedwithout the mechanical support of a rigid skeletal structure.In ArtiSynth, the system velocities (u) are computed in response to the activeand passive forces:Mu˙ =factive(q,u, a) + fpassive(q,u) (4.1)factive(q,u, a) = Λ(q,u)a (4.2)where M is the mass matrix of the system, and Λ denotes a nonlinear functionthat relates the system positions (q) and the system velocities (u) to theactive forces. In the case of Blemker muscles, the value of factive(q,u, a) iscalculated from Equation 3.5. The inverse solver uses a sub-space (v) of thetotal system velocities as its target:v = Jmu (4.3)754.3. Inverse Simulationwhere the target velocity sub-space (v) is related to the system velocities(u) via a Jacobian matrix Jm. The inverse solver computes the normalizedactivations (a) by solving a quadratic program subject to the condition 0 ≤a ≤ 1:a = argmin(‖(v −Ha)‖2 + α‖a‖2 + β‖a˙‖2) (4.4)Here ||a|| and a˙ denote the norm and time-derivative of the vector a; thematrix H summarizes the biomechanical characteristics of the system, suchas mass, joint constraints, and force-activation properties of the muscles;α and β are coefficients of the `2-norm regularization and damping terms,respectively. The regularization term deals with muscle redundancy in thesystem and opts for the solution that minimizes the sum of all activations.The damping term secures system stability by prohibiting sudden jumps inthe value of activations. We follow Erdemir et al. (2007) in opting for `2-norm(rather than `1-norm) as `2-norm is easier to implement. As a result, theregularization favors similar activation values over sparsity, across differentmuscles. The solution converges after iterating between inverse and forwarddynamics in a static per time-step process, where the system is made linearin each integration step. This method is computationally efficient comparedto the static methods; however, it may lead to sub-optimal muscle activations(Stavness et al., 2012a).4.3.1 Definition of the Control PointsAs mentioned above, the inverse solver in ArtiSynth uses a sub-space of thetotal system kinematics as its target. This means that the solver tracks thevelocities of certain points in the model (referred to as control points). In thisstudy, we define a control point as a marker that attaches to an element ofthe tongue model, at a desired location. The initial location of control pointsis defined according to a subset of FE nodes in the generic tongue model (andhence, in FEreg from Section 3.1). As a result, for all four speakers, controlpoints have similar location relative to their tongue geometries in the firstTF of cine MRI.The biomechanical models in this study are generated based on /@-gis/ data.Although both the /@-gis/ and /@-suk/ utterances were recorded consecu-tively in the same MRI session, their image data does not necessarily match764.3. Inverse Simulationfrom 1st TF of /ə–gɪs/ at 1st TF of /ə–sʊk/Figure 4.4. Initializing simulation of the /@-suk/ sequence for speaker B. Mid-sagittal view of the FE tongue model after rigid registration from the 1st TFof /@-gis/ (left) vs. result of inverse simulation to match the 1st TF of /@-suk/.Blue/green circles show target tracking points before/after inverse simulation. Themandible, hyoid, and maxilla are included in the model, but are not shown in thefigure, for the sake of simplicity.at the first TF. This can either be due to, (1) repositioning of the head be-tween the two acquisitions (rigid transform), or (2) having a slightly differenttongue posture at /@/ for the two utterances (non-rigid deformation). There-fore, the tagged MRI trajectories of /@-suk/ do not hold a direct associationwith the control points of the FE tongue model. Building a new model for/@-suk/ is not optimal, since it doubles the labour cost of modeling, andmakes comparison of the muscle activations between the two utterances lessmeaningful. To deal with this issue, we compensate for the head motion byapplying a rigid transformation on our model (Sharp et al., 2002), and thenuse the inverse solver to estimate the muscle activations that put the tonguein the correct position at the first TF of /@-suk/. Figure 4.4 shows this initial-ization process for the /@-suk/ sequence in speaker B. The mandible, hyoid,and maxilla are included in the model, but are not shown in the figure, forthe sake of simplicity.In this study, the tongue and bone models of the speakers fit the exact surfacegeometry extracted from the first TF of cine MRI (which is not perfectlysymmetrical). However, to reduce the computational cost of the inverseproblem, we assume bilateral symmetry in motion; the left and right musclesare activated together and with the same intensity. The control points (32FE markers) are distributed in left half of the tongue.774.4. ResultsTable 4.2. Absolute tracking error (mm) for speakers A-D, averaged over all control pointsand all time frames in /@-gis/ and /@-suk/.A B C D/@-gis/Mean 1.80 ± 0.68 1.95 ± 0.75 1.85 ± 0.64 1.70 ± 0.66Std 0.83 ± 0.35 0.71 ± 0.26 0.67 ± 0.22 0.73 ± 0.26/@-suk/ Mean 1.90 ± 0.55 1.88 ± 0.81 1.94 ± 0.60 1.90 ± 0.69Std 0.49 ± 0.19 0.79 ± 0.27 0.97 ± 0.28 0.62 ± 0.214.4 ResultsWe estimate the muscle activation patterns using the inverse simulation, withkinematic trajectories from the MRI data. Table 4.2 shows the tracking erroraverage over all the control points, at 26 TFs, in the two utterances, /@-gis/and /@-suk/. Figures 4.5 and 4.6 show the muscle activation patterns. Thefour speakers are shown in columns A-D with TFs (1 to 26) along the x axis.Speakers A and B use an apical /s/, while speakers C and D use a laminal/s/. The muscles of the tongue include the following: genioglossus (GG), hyo-glossus (HG), styloglossus (STY), verticalis (VERT), transversus (TRANS),geniohyoid (GH), mylohyoid (MH), and longitudinal (inferior [IL], superior[SL]). The GG, VERT and TRANS muscle bundles are further divided intofive smaller functionally-distinct segments (a: posterior to e: interior), assuggested by Miyawaki et al. (1975) and Stone et al. (2004). We also followFang et al. (2009) in dividing the STY muscle into two functional segments(p:posterior and a:anterior). The muscles of the jaw and hyoid include thefollowing: temporal (anterior [AT], middle [MT], posterior [PT]), masseter(superficial [SM], deep [DM]), pterygoid (medial [MP], superior-lateral [SP],inferior-lateral [IP]), digastric (anterior [AD], posterior [PD]) and stylo-hyoid(SH).784.4. Results0510152025Activation (%)/s/Speaker A (apical)GGaGGbGGcGGdGGe/s/Speaker B (apical)GGaGGbGGcGGdGGe/s/Speaker C (laminal)GGaGGbGGcGGdGGe/s/Speaker D (laminal)GGaGGbGGcGGdGGe0510152025Activation (%)/s/TRANSaTRANSbTRANScTRANSdTRANSe/s/TRANSaTRANSbTRANScTRANSdTRANSe/s/TRANSaTRANSbTRANScTRANSdTRANSe/s/TRANSaTRANSbTRANScTRANSdTRANSe0510152025Activation (%)/s/VERTaVERTbVERTcVERTdVERTe/s/VERTaVERTbVERTcVERTdVERTe/s/VERTaVERTbVERTcVERTdVERTe/s/VERTaVERTbVERTcVERTdVERTe0510152025Activation (%)/s/SLGHMHHGILSTYaSTYp/s/SLGHMHHGILSTYaSTYp/s/SLGHMHHGILSTYaSTYp/s/SLGHMHHGILSTYaSTYp0510152025Activation (%)/s/ATMTPTSMDMMP/s/ATMTPTSMDMMP/s/ATMTPTSMDMMP/s/ATMTPTSMDMMP8 12 16 21Time Frame (#)0510152025Activation (%)/s/SHIPSPADPD6 10 1820Time Frame (#)/s/SHIPSPADPD8 10 14 23Time Frame (#)/s/SHIPSPADPD5 9 13 19Time Frame (#)/s/SHIPSPADPDFigure 4.5. Muscle activations estimated by the inverse solver during the utterance/@-gis/ for speakers A-D.794.4. Results0510152025Activation (%)/s/Speaker A (apical)GGaGGbGGcGGdGGe/s/Speaker B (apical)GGaGGbGGcGGdGGe/s/Speaker C (laminal)GGaGGbGGcGGdGGe/s/Speaker D (laminal)GGaGGbGGcGGdGGe0510152025Activation (%)/s/TRANSaTRANSbTRANScTRANSdTRANSe/s/TRANSaTRANSbTRANScTRANSdTRANSe/s/TRANSaTRANSbTRANScTRANSdTRANSe/s/TRANSaTRANSbTRANScTRANSdTRANSe0510152025Activation (%)/s/VERTaVERTbVERTcVERTdVERTe/s/VERTaVERTbVERTcVERTdVERTe/s/VERTaVERTbVERTcVERTdVERTe/s/VERTaVERTbVERTcVERTdVERTe0510152025Activation (%)/s/SLGHMHHGILSTYaSTYp/s/SLGHMHHGILSTYaSTYp/s/SLGHMHHGILSTYaSTYp/s/SLGHMHHGILSTYaSTYp0510152025Activation (%)/s/ATMTPTSMDMMP/s/ATMTPTSMDMMP/s/ATMTPTSMDMMP/s/ATMTPTSMDMMP8 13 1921Time Frame (#)0510152025Activation (%)/s/SHIPSPADPD7 10 16 19Time Frame (#)/s/SHIPSPADPD4 9 15 18Time Frame (#)/s/SHIPSPADPD7 10 1719Time Frame (#)/s/SHIPSPADPDFigure 4.6. Muscle activations estimated by the inverse solver during the utterance/@-suk/ for speakers A-D.804.4. Results4.4.1 Tongue-Protruder MusclesTongue protruder muscles include the posterior region of the genioglossusmuscle (GGa/b/c), which pulls the tongue forward, as well as the TRANSand VERT muscles. The GGa/b/c and TRANS muscles also elevate thetongue body. The floor muscles GH and MH assist in tongue elevation andprotrusion.Our results, as demonstrated in Figure 4.5 and 4.6, show that for both utter-ances, the GGa/b became more active over time and was maximally activeprior to the final consonant. The exception to this pattern was the upperpharyngeal region of the GG (GGb) for speakers C and D, which had littleactivity and no pulse toward the end of the utterance. The GGc pulse oc-curred during both vowels. Speaker B used the GGb more than the otherspeakers, to position the vowel. The TRANS tended to increase in activationthroughout the course of the utterance, with slightly more activation for thevelar stops (/g/ or /k/) than the alveolar /s/. The TRANSd/e increasedactivity just before /s/, consistent with local tongue tip protrusion, more sofor the apical speakers (A and B). Speakers varied in terms of which regionof the TRANS was more active, but the net effect was protrusion. The ac-tivation patterns across the five segments of the TRANS, and VERT, weremore alike (though not identical) than the activation patterns across the fivesegments of the GG. The GH pulls the hyoid forward, and shows small ac-tivations local to the vowel for speaker A, and throughout the utterance forspeakers C and D. Speaker B does not use it at all. The MH activates moreduring the consonants, mainly to elevate the tongue.4.4.2 Tongue-Retractor MusclesThe tongue is retracted by the extrinsic muscles, STY and HG, which pullthe tongue backward/upward and backward/downward, respectively. Twointrinsic muscles, the SL and IL, also retract the tongue; they additionallyelevate (SL) and lower (IL) the tip. Finally, the anterior fibers of the ge-nioglossus (GGd/e) lower the upper body and blade of the tongue, causingbackward motion of the tongue body.More activation of the retractor muscles was expected for /suk/ than /gis/,since the tongue moves backwards during /suk/. For /suk/, the SL was814.4. Resultsactive for all subjects, with speakers B and C increasing the activation untilthe /u/ was reached. A continuous low-level activation of the SL was usedby speakers A and D during /suk/, and by all four subjects during /gis/.The IL was not used at all during /gis/; but it was used for the /uk/ bythree of the subjects, consistent with retracting the tongue tip. The largestactivations in both utterances were seen in the GGd/e for all four speakers (5-10% activation). The GGd muscle – the most active – lowers or stabilizes thetongue dorsum, and the GGe further lowers the tongue blade. For /@-gis/, thespeakers used the GGd throughout the utterance, with smaller activationsat /g/ than /i/ and /s/. During /@-suk/, the GGd was most active during/u/. The GGe was active for /@/ and the breath, with occasional activationfor the first consonant, irrespective of what it was.Of the two extrinsic retractors, the STY was fairly quiescent for both ut-terances, with speaker B using it during /gis/, and speaker C during /suk/.The HG, on the other hand, was active for 3 speakers (A, B and C), mostlyduring /@/ and /s/ in both utterances. Of the two intrinsic retractors, theSL was active for all subjects, mostly during /@/, /is/ (gis) and /uk/ (suk).During /suk/, speaker A and D had minimal SL activity. The IL was mostlyquiescent, with very slight, occasional activity during the velar stops /g/ and/k/.4.4.3 Other MusclesRow 5 in Figure 4.5 and 4.6 contains the jaw closing muscles, which globallyelevate the tongue. For /@-gis/, these muscles have large peaks of activityduring closure into /g/, and smaller ones during the motion into /s/ (con-sistent with tongue elevation for those sounds). For /@-suk/, the activationstend to be smaller than for /@-gis/, with only speakers C and D, having aclosure activation into /s/. The IP and SP in row 6 are jaw protruding andclosing muscles. In /@-gis/, they behave like the jaw closing muscles in row5, with two peaks of activity, one preparatory for /g/ and a second duringthe /i/ or /s/. The SH and PD pull the hyoid back and up; the AD pullsthe hyoid forward and tongue up. These muscles, like the jaw closers, aremost active during the /g/ and /s/, although speaker B had a fairly activeAD throughout. These activations could allow fine tuning of jaw position;They could also be related to pitch changes, as the position of the larynx824.5. Discussionvaries with pitch. During /@-suk/, the IP is active throughout the utterance(subjects A, D) or during the first half (subjects B, C). The SP has littleactivity in either utterance. The hyoid-positioner muscles again show signif-icant activity in the PD associated with /k/. Speaker C appears to use boththe jaw closers, openers, and hyoid-positioners in the transition from /@/ to/s/.4.5 DiscussionThis study investigates differences in the key muscle activations and theiroverall activation pattern, during apical vs. laminal /s/ production, and as afunction of differences in vowel context, using speaker-specific biomechanicalmodels. We discuss the results below.4.5.1 Apical vs. Laminal SpeakersSpeakers A, B used an apical /s/, and speakers C, D used a laminal /s/.Both the VERTd and TRANSd/e were more active for the apical /s/. Thisdifference is not seen in the GG data; however, it should be noted that forthe TRANS and VERT, region e extends into the tongue tip, whereas theGGe stops at the tongue blade. It is possible that these additional activa-tions create a very subtle difference in tongue positioning. The differencesinvolved in creating an apical vs. laminal /s/ may require less active effortthan one would expect. For example, Stone et al. (2012) find that palateshape has a strong effect on choice of /s/-type, and some of the difference intongue tip shape may reflect palate shape. Moreover, thus far, only a slightlyfaster tip motion in apical /s/ has been found to distinguish the two motions(Reichard et al., 2012). Perhaps the simultaneous activation of the VERTdand TRANSd/e protrudes the tip slightly more in apical /s/, and the palateconstraint reduces the overall activation needed.834.5. Discussion4.5.2 Mechanisms of Tongue ElevationOne of the original questions asked in the study was whether the /s/ soundin /gis/ and /suk/ would differ because of different neighboring sounds anddifferent locations in the utterance. Interestingly, a large difference was ob-served in muscle activation patterns related to the location of the /s/ in theutterance. This was due, however, to the vowel preceding the /s/ more thanthe /s/ position.Figures 4.5 and 4.6 show that the pharyngeal portions of the GG (a/b/c) areactive into the last consonant of each utterance, regardless of whether it is/s/ or /k/, while the jaw muscles appear more active at the beginning of theutterance. This can be explained by the context. The /@/ at the start of theutterance requires an open jaw. The jaw closure into the following consonantis large, as seen in the activation of the jaw closure muscles at or after the/@/, and may do the lion’s share of the tongue elevation/fronting needed forboth the /s/ and the /k/. When these same consonants appear at the endof the utterance, however, the jaw is already quite closed for the precedingvowel (/i/ or /u/), and so the tongue must internally elevate and front itself,increasing activation in the GGa/b/c.4.5.3 Commonalities Across SpeakersSince tongue muscle activity measured from EMG usually shows variabil-ity among subjects, it is not surprising to see individual differences amongspeakers in our simulation results. However, there are some similarities thatcan be observed among all speakers. The first commonality across speak-ers is the large amount of muscle activation in the largest tongue muscle,genioglossus (GGa/b/c/d/e), followed by the jaw advancement muscle (theinternal pterygoid [IP]), and the hyoid positioner muscles (the digastric [AD,PD] and the stylo-hyoid [SH]). The GGa/b/c was the most active muscle ofprotrusion/elevation for all subjects, with as much as 15% activation. TheGGd/e was the most active muscle of retraction/lowering, with up to 10%activation. The GGa was always activated during articulation of the conso-nants, to elevate the tongue to the palate absent jaw assistance. The GGdwas continually active in both utterances – possibly to stabilize the uppertongue surface so it did not hit the palate inadvertently. Equally active were844.6. Conclusionsthe IP, AD, PD and SH. IP was more active during the forward moving/gis/, but was still quite active in /suk/, where it was most active for /s/and tapered off for /k/.Jaw position is critical and inflexible for /s/. In both utterances, the IP wasquite active at or before the /s/. The hyoid positioning muscles, AD, PDand SH, were active in both utterances, often with pulses for the consonants.The hyoid is a particularly unstable bone, as it is the only bone in the humanbody that does not articulate with another bone. It is stabilized entirely bymuscles. The AD pulls it forward, The PD pulls it back and up, The SHpulls it down. The PD and SH were often active synchronously, sometimeswith AD and sometimes without. These muscles may be so active becausethey have three roles that occur simultaneously in speech. Firstly, theyposition the hyoid to allow anterior-posterior tongue body motion duringvowels. Secondly, they resist the anterior pull on the hyoid of the GGa during/s/ and /k/ or /g/. Thirdly, they assist in changing pitch, as hyoid/thyroidposition varies with pitch in speaking.The second commonality among speakers was a similar variety of activationpatterns across the GG regions (a/b/c/d/e), consistent with independentactivation of fibers throughout the GG. Stone et al. (2004) and Miyawakiet al. (1975) found independent regions of compression and activation in thismuscle. Anatomical studies have shown very high innervation of these andall muscles of the tongue (Sokoloff et al., 1992). The other muscles thatmake up a structural unit with the GG, namely the TRANS and VERT (seeTakemoto et al. (2001)), show considerably less activation (< 5%) and maybe used to fine-tune the position and surface shape of the tongue. Somebehavioral differences in these muscles were consistent with differences inapical vs. laminal /s/. The floor muscles, GH and MH, have little activationduring these utterances and may be more important for swallowing.4.6 ConclusionsIn this chapter we have enabled data-driven simulation of our speaker-specificbiomechanical models to investigate inter- and intra-subject variability inspeech motor control. We use MRI data of four normal subjects speakingthe /s/ sound in two vowel context. Two of the speakers use apical and854.6. Conclusionsthe other two use laminal /s/. The results indicate the dominant use ofthe genioglossus muscle over the other muscles of the tongue. As expected,the posterior portion of the genioglossus is found to be more active than itsanterior when /s/ follows /i/ in /@-gis/. The reverse is true when /s/ precedes/u/ in /@-suk/. The transverse muscle is also found to be moderately activemore at the anterior, but also at posterior for two of the speakers. Theactivations of other muscles of the tongue seem to be subtle, perhaps usedto fine-tune tongue posture. Our results also identifies the anterior-digastricmuscle, as well as the inferior-lateral portion of the pterygoid muscle, to bethe key active muscles of the jaw and hyoid during /s/. The results do notindicate any substantial difference between apical and laminal /s/ types. Itshould also be noted that our modeling and simulation experiments are basedon medical data from four English speakers. A larger database is needed inorder to confirm the validity/generality of our findings.86Chapter 5Acoustic Analysis for SpeechSynthesisDue to recent advances in acquisition technologies, speech data, including au-dio signals and medical images is abundant today. Such data motivates theuse of computational approaches for modeling speech phenomena (Vascon-celos et al., 2012; Ventura et al., 2009, 2013). On one hand, biomechanicalmodels of the oropharynx aim to simulate the dynamics of speech productionunder simplified – but biologically and physically valid – assumptions; on theother hand, articulatory speech synthesizers focus on generated sound as anend product, by designing a representation of the vocal tract and folds thatis capable of generating the desired acoustics of an observed shape of the oralcavity (Doel et al., 2006; Birkholz et al., 2013). The search for an ideal model– that represents both the acoustical and biomechanical characteristics of theoropharynx – continues to this date.5.1 Synthesis of Vowels in Running SpeechIn this section, we expand our subject-specific modelling and simulationframework from chapters 3 and 4 to include a real-time acoustic synthe-sizer. The biomechanical models are enhanced with an air-tight VT meshthat deforms along with the movement of the articulators. The VT geometryis then processed in real-time to extract the 1D representation required forsolving Navier-Stokes equations for vowel synthesis (Doel et al., 2006).The vocal folds oscillate during the articulation of vowels and voiced-consonants(e.g., /b/), but are wide open and have little effect on articulation of frica-tives (e.g. /s/) and stops (e.g. /k)/. Constrictions or obstructions at certainpoints in the tract create turbulence that generates the high frequency noise875.1. Synthesis of Vowels in Running Speechresponsible for making the fricatives and stops. The synthesis of fricativesdepends to a large extent on lung pressure and the noise characteristics of thesystem. Due to the lack of voicing information, we focus our acoustic analysissolely on the synthesis of the vowels, specifically /i/ in /@-gis/, and /u/ in/@-suk/. The reduced vowel /@/ is only used to help the subject maintain aneutral tongue posture at the start of the speech utterance.5.1.1 Biomechanical Model of Vocal TractWe model the VT as a deformable, air-tight mesh – referred to as geometricskin – which is coupled to the articulators, as proposed by Stavness et al.(2014b). Each point on the skin is attached to one or more master com-ponents, which can either be 3-DOF points, such as finite-element nodes,or 6-DOF frames, such as rigid body coordinates. The position of each skinvertex (qv) is calculated as a weighted sum of contributions from each mastercomponent:qv = qv0 +M∑i=1wifi(qm,qm0 ,qv0) (5.1)where qv0 is the initial position of the skinned point, qm0 is the collectiverest state of the masters, wi is the skinning weight associated with the ithmaster component, and fi is the corresponding blending function. For a pointmaster (such as a FE node) the blending function fi is the displacement ofthe point. For frames (such as rigid bodies) fi is calculated by linear, ordual-quaternion linear, blending. To provide two-way coupling between theskinned mesh and articulators, the forces acting on the skin points are alsopropagated back to their dynamic masters.To create the skin, we initially segment the shape of the VT from the firsttime-frame of cine MRI data (described in Section 4.1). The skin is attachedto and deforms along with the motion of the mandible rigid-body and tongueFE model. We also restrict the motion of the VT to the fixed boundaries ofthe maxilla and pharyngeal wall.885.1. Synthesis of Vowels in Running Speech5.1.2 Time-Domain Acoustical ModelFor our 1D acoustic analysis, we describe the VT with an area functionA(x, t), where 0 ≤ x ≤ L is the distance from the glottis on the tube axisand t denotes time. We take a similar notion of Doel and Ascher (2008)in defining the variables u(x, t) and p(x, t) as the scaled versions of volume-velocity uˆ and air density ρˆ, respectively:u(x, t) = A(x, t)uˆ/c (5.2a)p(x, t) = ρˆ/ρ0 − 1 (5.2b)where ρ0 is the mass density of the air and c is the speed of sound. We solvefor u(x, t) and p(x, t) in the tube using derivations of the linearized Navier-Stokes equation (5.3a), and the equation of continuity (5.3b), subject to theboundary conditions described in Equation 5.3c:∂(u/A)∂t+ c∂p∂x= −d(A)u+D(A)∂2u∂x2(5.3a)∂(Ap)∂t+ c∂u∂x= −∂A∂t(5.3b)u(0, t) = ug(t), p(L, t) = 0 (5.3c)The right hand side of Equation 5.3a is the damping term, added to modelthe frequency dependant wall-loss. It is beneficial to note that setting thedamping term to zero, and combining equations 5.3a and 5.3b together, yieldsthe classic simple wave equation (if A = 1) as in the following:∂2u∂t2= −c2∂2u∂x2(5.4)For monochromatic waves of the form eiw(t−x)/c, the damping term in Equa-tion 5.3a results in the following:− [d(A) +D(A)w2/c2]u (5.5)We follow Doel and Ascher (2008) in using d(A) = d0A−3/2 and D(A) =D0A−3/2 with the wall loss coefficients d0 = 1.6 m/s and D0 = 0.002 m3/s,to match previously reported hard-wall loss at frequencies 500 Hz and 2000Hz; some more scaling was applied to these damping coefficients (×4 and895.1. Synthesis of Vowels in Running SpeechFigure 5.1. Intersecting planes superimposed on VT geometry at time t = ti (left),vs. analog area function A(x, ti) and its schematic discretized representation, using20 segments of equal length (right).×8 depending on discretization factor) during implementation, in order toyield the desired bandwidth. Figure 5.1 shows the intersecting planes usedto calculate a discretized area function representation of the VT geometry attime t = ti.In Equation 5.3c, ug(t) is the source volume velocity at the glottis. We couplethe VT to a two-mass glottal model (Ishizaka and Flanigan, 1972). We referto Doel and Ascher (2008) for full details of the implementation.5.1.3 Results and DiscussionIn this section, we first use the muscle activations calculated in the previouschapter (Section 4.4) to run our models in the forward simulation scheme.This time we add the skinned VT mesh to our models, to find the deformedVT geometries in each time-frame. More information on the MRI data ispresented in Section 4.1. Figure 5.2 shows the mid-sagittal cut of the models,superimposed on the image data in time-frames that correspond to /i/ and/u/. The red arrows in the figure indicate the areas of mismatch that oftenoccur around the lips, velum, and epiglottis.The mismatch occurs primarily because our VT model is based on segmen-905.1. Synthesis of Vowels in Running Speechtations around the initial position of these structures, which, at times, isdifferent from what we see in the /i/ or /u/ time-frames.Next, we fit a center-line to our VT geometries, and set up 20 planes, anglednormal to the center-line, that start from the lips and end at the epiglottis.The intersection of these planes with the VT gives a discrete representationof A(x, t) at any time (t) during the simulation. These area functions are fedinto the acoustic synthesizer in real-time, to generate sounds from which wethen calculate the formant frequencies (as the peaks of the spectrum).As for the ground truth, we first look at the recorded audio signals, whichare available for three of our speakers (A, B, and D), while they repeat /@-gis/ and /@-suk/, synced to the metronome in a lab environment. Figure 5.3shows the acoustic profile and spectrum of a single repetition of /@-gis/, asspoken by speaker B. The formant frequency is calculated at the mid-pointof the time interval associated with the vowel (here /i/). We average theresults over all repetitions of the utterance.In addition, we extract the airway from the associated TF of the cine MRIdata (using the semi-manual segmentation tool ITK-SNAP by Yushkevichet al. [2006]) to obtain the corresponding area functions that we later feed/u/ in/ə-suk/ /i/ in/ə-gis/ JPHSpeaker A Speaker B Speaker DSpeaker CFigure 5.2. Deformed VT mesh (blue) superimposed on mid-sagittal cine MRimages at /i/ and /u/. The red arrows indicate the areas of mismatch.915.1. Synthesis of Vowels in Running SpeechFigure 5.3. The audio signal and spectra for one repetition of the speech utterance/@-gis/, as spoken by speaker B. The formants are shown as red dots associatedwith each time instant of audio, using Praat phoneme analysis software (Boersmaand Weenink, 2005).into the acoustic synthesizer. The resultant formants provide a maximumlimit to the accuracy obtainable in our simulations – given the contrast andresolution of the cine MRI data, and the fidelity of the acoustic synthesizer.Table 5.1 shows the first three formants for both /i/ and /u/ in speakersA-D, as calculated from our simulations, the audio signals, and cine MRIimages. Figure 5.4 represents the average values, and standard deviation ofthese formants, to ease the overall comparison. Note that for both /i/ and/u/, F1 is often higher in our simulations, compared to the audio signals.The opposite happens for F2, where the simulations fall short. F3 shows amixed behaviour where it is noticeably lower than the audio in /i/ of speakerA, and higher than the audio in /u/ of speakers B and D. Looking at cineMRI measurements, we realize that, for /i/, the values of F2 are also lowerthan the audio, with a large difference for /i/ of speaker D. For the samecase, F1 is unusually low in cine MRI.We speculate that one reason for such discrepancies lies in the fact that our925.1. Synthesis of Vowels in Running SpeechTable 5.1. Formant frequencies (F1, F2, F3) of /i/ and /u/ in speakers A-D, ascomputed from our simulations, audio signals, and cine MRI data (values are inHz).Speakers Simulation Cine MRI Audio/i/A (317, 1652, 2613) (240, 1826, 3062) (243, 2264, 3077)B (256, 1905, 3086) (267, 2055, 3012) (268, 2272, 3032)C (327, 1600, 3091) (272, 1943, 3033) Not AvailableD (424, 1871, 2886) (196, 2086, 2720) (347, 2675, 3032)/u/A (379, 1516, 2321) (373, 1699, 2771) (300, 1636, 2541)B (311, 1698, 2696) (317, 1995, 2704) (333, 1814, 2317)C (405, 1633, 2899) (371, 1859, 2842) Not AvailableD (430, 1990, 2921) (321, 2440, 2836) (380, 2022, 2674)dynamic MRI images provide incomplete VT visibility. In particular, due tothe fuzzy boundaries between the teeth and airway, we tend to exclude someof the VT volume in the mouth opening (mostly between the teeth). Thelow spatial resolution of the images also negates our efforts to extract the VTposterior pharyngeal side branches. In addition, we notice that the averagelength of the VT, as reported in the literature for adult speakers, is around17cm. However, the longest airway extracted from our MRI images does notexceed 14.5cm, leaving out a few centimeters around/below the epiglottis. Ifthese areas are not visible in the first TF of cine MRI sequence, our modelscan not recover later. It is important to note that synthetic increase of theVT length (i.e., as an independent parameter to the acoustic synthesizer)lowers the value of the three formants, with greater effect on F2 and F3.Such scaling of the center-line does not provide a solution though, since itremaps the position of the intersecting planes to an arbitrary position thatno longer matches the image data.The second hypothesis to explain discrepancies in Table 5.1 is that the 1Dimplementation of Navier-Stokes equations, chosen here for the sake of real-time simulations, does not capture the complexities of 3D VT geometry, andis not enough for accurate calculation of the formant frequencies. We willaddress this hypothesis in the next section, by comparing the methods to theresults of some 3D analysis.935.1. Synthesis of Vowels in Running SpeechIn a separate experiment, we perform the acoustic analysis of our simulationsusing two different resolution for our tongue models, as described in Chapter3). FEreg is of lower resolution and matches the element configuration ofan standard generic model (Buchaillard et al., 2009), while FEfinal providesa mixed-element alternative with higher resolution. We carry our analysison /i/ of speaker B, which, as shown in Table 5.1, demonstrates a bettermatch in formants with the cine MRI and audio data. The cine MRI valueFigure 5.4. Average formant frequencies for /i/ (top row) vs. /u/ (bottom row)as computed from our simulations, audio signals, and cine MRI data.945.1. Synthesis of Vowels in Running Speech0.00.10.20.30.40.50.60.70.80.91.01.10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21Normalized Area Plane NumberFE_high FE_reg Cine-MRI TF17Soft PalatemEpiglottis LipsMaxillamesh regFigure 5.5. Simulation results for speaker B. A normalized area profile along theVT for /i/ compared to the cine MRI at time-frame 17.Table 5.2. Formant frequencies for the simulations using FEreg and FEmesh, com-pared to the audio and cine MRI data for /i/ of speaker B.Audio Cine MRI FEreg FEmeshF1(Hz) 268 267 262 256F2(Hz) 2272 2055 1905 1995of F2, however, remains 9.5% less than the audio signal. Figure 5.5 showsthe normalized area profile along the VT at /i/ in our simulation, comparedto the cine MRI data. Both the FEreg and FEmesh tongue models are ableto capture the expected shape of the VT. Again, the noticeable mismatcheshappen at the areas influenced by the lips, velum, and epiglottis. Table 5.2compares the formant frequencies of our simulations using both the FEregand FEmesh for speaker B. Looking at F2, we notice that while FEmesh seemsto decrease the gap between the simulation and the ground truth (audioand cine MRI), there is still a difference in values. This may be due to, (1)the discrepancy between the manual segmentations performed at the 1st and17th time frame of the cine MRI, or (2) the arbitrary nature of centre-lineextraction for 1D acoustic analysis. We will discuss the latter in more detailin the next section.955.2. Synthesis of Sustained VowelsThese quantitative results suggest that FEreg and FEmesh do not show anappreciable difference in acoustic performance for the simulation of the ut-terance /@-gis/ using our source-filter based speech synthesizer (Doel andAscher, 2008). Thus we conclude that the ArtiSynth generic tongue model(Buchaillard et al., 2009) provides sufficient resolution for modelling of thisutterance at such a level of acoustic fidelity and cine MRI resolution.5.2 Synthesis of Sustained VowelsIn Section 5.1, we coupled a state-of-the-art 1D acoustic synthesizer to subject-specific biomechanical models of the oropharynx, in order to generate soundin real-time. As shown in Subsection 5.1.3, the formant frequencies computedfrom the resultant acoustic signals are different from those of the audio signal.One hypothesis to explain this difference is that 1D acoustic analysis cannotcapture the complex three dimensional shape of the VT, causing a discrep-ancy in the spectrum. In this section, we further investigate this hypothesisby running a series of experiments that aim to compare the results of 1Dand 3D acoustic analysis. We hope such comparison assists in clarifying thelimitations of our current models and/or speech synthesizer.For the sake of comparison, we categorize the methods of acoustic analysisinto two classes. The first is time-domain analysis, in which the wave equationis derived over time to generate sound. Finite Fourier transform (FFT) ofthe output sound signal is then divided by the FFT of the source signal (atthe glottis) to yield the frequency spectrum of the filter (VT). We refer to thepeaks of the spectrum as formant frequencies. Examples of the time-domainanalysis are the acoustical models proposed by Doel and Ascher (2008) for1D, and by Takemoto et al. (2014) for 3D analysis.The second analysis is carried fully within the frequency domain. Sincetime is absent from the equations, the VT is not moving, i.e. the vowels aresustained. The VT transfer function (in frequency domain) is calculated bysolving the frequency analogy of the wave equation. We refer to the peaks ofthe associated spectrum as resonance frequencies of the VT. Examples arethe 1D and 3D models suggested by Aalto et al. (2014) and Kivela¨ (2015).Through the rest of this section, we explicitly use the terms formant andresonance to refer to the time-domain and frequency-domain analysis.965.2. Synthesis of Sustained VowelsFor our 3D analysis, we decided to follow Aalto et al. (2014) in calculatingthe Helmholtz resonances of our VT geometries using the 3D finite elementmethod (FEM) analysis. The method is solely based on the shape of the VTand, does not require any glottal source excitation; thus, far fewer parametersneed be adjusted, compared to a full 3D time-domain analysis, such as theone by Takemoto et al. (2014). Solving the equations directly in the Fourierdomain simplifies the acoustical model by discarding any time-dependantdamping or loss (such as effects of the vibration of the VT wall). Furthersimplification of the wave equations is also inevitable, in order to make math-ematical derivations possible; however, the simulation runs close to real-timein comparison with time-domain 3D methods – that can take up to severalhours for analysis of a single geometry (Takemoto et al., 2014).5.2.1 Helmholtz ResonancesAssuming that the air flow in the VT is irrotational (∇ × v = 0), we candefine a velocity potential Φ(s, t) withv = −∇Φ (5.6)where we use vector s to denote the parameter of space in three dimensions.Solving the wave equation for either the pressure field (P) or velocity field(v) doesn’t necessarily provide a simple answer for the other. Solving forΦ, on the other hand, yields v directly, from Equation 5.6, and P, from the(linearized) Bernoulli equation for irrotational and unsteady flow:P = −ρ∂Φ∂t(5.7)The wave equation for the velocity potential is then summarized as follows:∇2Φ = 1c2∂2Φ∂t2(5.8)where∇2 is the Laplace operator. Making the velocity potential (in Equation5.8) time-harmonic, by setting Φλ(s, t) = Re{Φλ(s)eiλt}, Equation 5.8 canbe rewritten as∇2Φλ(s) + λ2c2Φλ(s) = 0 (5.9)975.2. Synthesis of Sustained Vowelsan eigenvalue problem that is, in terms of mathematics, considerably easierto solve than the wave equation. Using Equation 5.9, the vowel resonancesare calculated by finding the eigenvalues (λ), and their corresponding velocitypotential eigenfunction Φλ from the Helmholtz resonance problem:c2∇2Φλ + λ2Φλ = 0 on Ω (5.10a)Φλ = 0 on Γ1 (5.10b)αλΦλ +∂Φλ∂ν= 0 on Γ2 (5.10c)λΦλ + c∂Φλ∂ν= 0 on Γ3 (5.10d)where Ω ∈ R3 is the air column volume, and ∂Ω is its surface – includingthe boundary at the mouth opening (Γ1), the air-tissue interface (Γ2) anda virtual plane above glottis (Γ3).∂Φλ∂νdenotes the exterior normal deriva-tive. Note that the Dirichlet boundary condition in Equation 5.10b simplifiesthe model by regarding the mouth boundary as an idealized open end of anacoustic tube, therefore neglecting lip radiation loss. The value of α regulatesthe energy dissipation through tissue walls, and the case α = 0 correspondswith hard, reflecting boundaries. We calculate the numerical solution ofEquation 5.10 with the Finite Element Method (FEM) using piecewise linearshape functions and approximately 105 tetrahedral elements. The imaginaryparts of the eigenvalues (in the ascending order) yeild Helmholtz resonances(HR1,HR2, ...) of the VT air column in increasing order of frequency. Werefer to Aalto et al. (2014) and Kivela¨ et al. (2013) for details of the imple-mentation.5.2.2 Webster ResonancesWe also derive a 1D interpretation of Equation 5.8 in order to calculate whatwe will refer to as Webster resonances (WR). This will allow investigationof acoustical differences between equations 5.3 and 5.10, independently fromthe dimensionality (1D vs. 3D) of the solution. To do so, we compute theaverages of the 3D solution in tube cross-sections. If Φ is a solution toEquation 5.8, then such an average will be denoted byΦ(s, t) =1A(s)∫Γ(s)Φ dA (5.11)985.2. Synthesis of Sustained Vowelswhere scalar s is the implicit parameter denoting points on the centre-line,and Γ(s) is the normal cross-section at s. For the sake of simplicity, thearea of this cross-section is considered to be circular – i.e., A(s) = piR(s)2.The Webster equation is obtained after a lengthy derivation, as described byLukkari and Malinen (2012):(1c2Σ(s)2∂2Φ∂t2) +2piαW (s)A(S)∂Φ∂t− 1A(s)∂∂s(A∂Φ∂s) = 0 (5.12)Here Σ denotes the sound speed correction factor that depends on the cur-vature of the VT (κ(s)) as follows:Σ(s) = (1 +14η2(s))−1/2 with η(s) = R(s)κ(s) (5.13a)W (s) = R(s)√R′(s)2 + (κ(s)− 1)2 (5.13b)By setting Φλ(s, t) = Re Φλ(s)eiλt as before, the time-harmonic Websterequation can be written as follows:(λ2c21Σ2+ λ2piαWA)Φλ =1A∂∂s(A∂Φλ∂s) on [0, L] (5.14a)λΦλ − cΦ′λ = 0 at s = 0 (5.14b)Φλ = 0 at s = L (5.14c)Note that equations 5.3 and 5.12 both solve the so called Webster equationin 1D, but with different derivations: Equation 5.12 assumes that the centre-line is a smooth curve whose curvature affects the speed of sound in the VT,while Equation 5.3 assumes that the centre-line is piecewise linear. We referto Kivela¨ (2015) for details of the implementation and parameter values.The 1D approach suffers from ambiguity in the area functions, due to the lackof a uniquely definable centre-line for the 3D VT surface meshes. Differentcentre-lines have different acoustic lengths. This is a source for formant errornot directly related to the Webster model itself, but the geometry process-ing it requires. In an effort to better compare the Webster and Helmholtzresonances, we linearly scale the length of the VT centre-line so that the995.2. Synthesis of Sustained Vowels/a/ /e/ /o//i/Figure 5.6. VT geometries extracted from MRI data (Aalto et al., 2013).three lowest Webster resonances coincide in average with the correspond-ing Helmholtz resonances from the same VT configuration. We use onlythe three lowest resonances because they are purely longitudinal and, hence,accounted for by the Webster model. After scaling, the acoustic length iscorrected based on Helmholtz model. The results (SR) are expected to takeaway the systematic discrepancy between resonances from 1D and 3D acous-tics.5.2.3 Results and DiscussionFor the simulations in this section, we use static MRI images acquired witha Siemens Magnetom Avanto 1.5 T scanner. A 12-element Head MatrixCoil, and a 4-element Neck Matrix Coil, allow for the Generalize Auto-calibrating Partially Parallel Acquisition (GRAPPA) acceleration technique.One speaker, a 26-year-old male, was imaged while he uttered four sustainedFinnish vowels. The MRI data covers the vocal and nasal tracts, from thelips and nostrils to the beginning of the trachea, in 44 sagittal slices, with anin-plane resolution of 1.9mm. Figure 5.6 shows the VT surface geometriesextracted from MRI data using an automatized segmentation method (Aaltoet al., 2013). Note the greater detail captured in these geometries comparedto those extracted from dynamic MRI (in Section 5.1), especially below theepiglottis, at the posterior pharyngeal side branches, and between the teeth.Figure 5.7 shows the first two formant/resonance frequencies, computed forthe four Finnish vowels. Webster formants (WF ) are calculated by solvingEquation 5.3, as suggested by Doel and Ascher (2008). Helmholtz (HR) andWebster resonances (WR) are obtained from equations 5.10 and 5.14, respec-1005.2. Synthesis of Sustained Vowels50070090011001300150017001900210023002500100 200 300 400 500 600 700/i//e//o//a/First Formants/Resonances (Hz)Second Formants/Resonances (Hz)AFWFSRWRH RFigure 5.7. Simulation results for first and second formant/resonance frequenciesfor different vowels: Helmholtz resonances (HR), Webster resonances (WR) andtheir scaled version (SR), Webster formants (WF ) and formants from audio signal(AF ).tively (Aalto et al., 2014). SR denotes the scaled version of WR, as describedin Subsection 5.2.2. The figure also includes the formant frequencies (AF )computed from audio signals recorded in an anechoic chamber (Aalto et al.,2014). The values are averaged over 10 repetitions of each vowel utterance.As we can see in Figure 5.7, the resonance values (HR, WR and SR) lie closetogether for vowels /i/ and /e/, with SR being closer to HR, as expected. Forvowels /o/ and /a/ there is more difference in the first resonances of HR andWR; For /o/, although SR lies closer to HR, its first resonance is surprisinglylow. For all of the vowels in Figure 5.7, the second formant of the audio is lessthan the computed results. This finding contradicts our results in Table 5.1,where second Webster formants from cine MRI of /i/ are consistently lowerthan the values from the audio signals. The vowel /i/ is expected to be verysensitive to glottal end position, which, in turn, suggests the significance ofadequate MRI resolution and accurate geometry processing for its spectralanalysis.1015.3. ConclusionsInterestingly, the Webster formants (WF ) remain closer to the audio formants(AF ) than any of the resonances in the case of /i/, /e/, and /a/. For /o/ thedistance to the AF is almost equal for WF and HR, with both having similarvalues for the second formant/resonance; however, the first HR is lower, andthe first WF is higher, than the first AF .The time-domain Webster analysis (Doel and Ascher, 2008) accounts forthe VT wall-vibration phenomenon that is missing in the resonance analy-sis. This is done by substituting A(x, t), from equation 5.3, with A(x, t) +C(x, t)y(x, t): where C(x, t) is the slow-varying circumference and y(x, t) isthe wall displacement governed by a damped mass-spring system. Settingy(x, t) to zero, the Webster formants move along the arrows in Figure 5.7,reducing in their first formants. This moves the WF closer to the HR as bothacoustical models now ignore the wall vibration. Meanwhile, WF movesaway from the audio formants in the case of /i/, /e/, and /a/. The distancebetween WR and WF remains large, despite the fact that both acousticalmodels solve the Webster equation. The results imply that 3D Helmholtzanalysis is more realistic than its 1D Webster version, as expected.Overall, our experiments suggest that the time-domain interpretation ofacoustic equations provides more realistic results – even if it requires reduc-ing from 3D to 1D. This may be partially due to the fact that time-domainanalysis allows for more complexity in the acoustical model such as inclusionof lip radiation and wall loss. Certainly unknown parameters always remain(such as those involved in glottal flow, coupling between fluid mechanics andacoustical analysis, etc.), which are estimated indirectly, based on observedbehaviour in simulations.It should be noted that our experiments are solely based on data from asingle speaker. A larger database – inclusive of more speakers from differentgenders and languages – is needed in order to confirm the validity/generalityof our findings.5.3 ConclusionsIn this chapter, we have investigated the acoustics of speech production. InSection 5.1 we enable acoustical synthesis of the vowel phonemes, based on1025.3. Conclusionsdata-driven simulations of our speaker-specific biomechanical models. Wemodel the VT as an air-tight skin mesh that deforms, along with the ar-ticulators. We then use a standard acoustic synthesizer (Doel and Ascher,2008) that generate sounds, based on a 1D representation of VT geometries.Comparison of the formant frequencies of generated sounds, with those fromthe audio signals and cine MRI data, demonstrates discrepancies. We spec-ulate that low contrast and resolution of dynamic MR images contribute tothese results. In addition, our deformed VT often suffers from a mismatchwith images in the regions around the epiglottis, soft-palate, and lips (whosebiomechanical models were not included).In Section 5.2 we look at the sustained vowels in order to experiment with a3D, frequency-based analysis of the VT (Aalto et al., 2014). We mainly seekto investigate if such analysis (performed in 3D) would yield more accurateresults than the 1D Doel-Ascher synthesizer. This time, we use the VT sur-face geometries extracted from the high resolution static MRI of one Finnishspeaker uttering four sustained vowel phonemes. Our results suggest the ad-vantage of time-domain over the frequency-domain analysis, as the Websterformants (Doel and Ascher, 2008) lie closer to the formants of audio signalsthan the Helmholtz resonances (Aalto et al., 2014).We believe in effective coupling of biomechanical and acoustical models ofthe oropharynx, in order to provide a better understanding of speech produc-tion. A system that is capable of providing a mapping from activation unitsto articulatory movements and sound, would greatly assist with speech reha-bilitation planning. To accomplish this, several challenges on each side shouldbe addressed. The speaker-specific biomechanical models should include theepiglottis, velum, and face, to produce accurate VT deformations. In addi-tion, the accuracy of the models (and their simulations) depends highly onthe resolution and contrast of dynamic medical images. On the acousticalside, 3D time-domain analysis of the VT is still impractical, but greatly antic-ipated, for real-time simulations. In addition, biomechanical and acousticalmethods should move from generic parameter-tuning to meet the needs ofspeaker-specific models. This necessitates further advances in data measure-ments from the oropharyngeal structures. An example of this is audio andimage recording from the subglottal area, in order to enable synthesis of con-sonant phonemes. A high-resolution MRI depiction of muscle fibers couldalso be used to adjust the tongue model – especially for the pathologicalanatomy.103Chapter 6ConclusionsThis dissertation develops methods for subject-specific modeling and simu-lation of the oropharynx, with direct applications to speech research. Drivenby a clinical need for better understanding of speech biomechanics, we en-able investigation of inter- and intra-speaker variability in speech produc-tion, based on medical imaging data. To do so, we incorporate, develop andevaluate methods that address several challenges to data processing, meshgeneration, data-driven simulation, and acoustical modeling.In Chapter 3 we first design a real-time, click-based expert-interaction schemefor a mesh-to-image registration method, in order to delineate the tonguesurface from volumetric MR images. To accomplish this, we use a shapematching algorithm, driven by gradients of intensity profiles, in the directionnormal to surface mesh. Confined to the segmented surface, we then gen-erate a hexahedral-dominant mesh that bears the desired spatial resolutionand mesh quality. Using the Mesh-Match-and-Repair registration method,we augment our mesh with biomechanical information embedded in a stan-dard, generic tongue model, while permitting adjustments to muscle fiberdefinition. We finally couple our FE tongue models to rigid-body models ofthe mandible, maxilla and hyoid, and successfully perform forward simula-tions that comply with the literature.In Chapter 4 we perform data-driven simulations of our subject-specific mod-els in order to investigate inter- and intra-speaker variability in muscle acti-vation patterns. We derive our inverse simulations using tissue trajectoriesextracted from tagged- and cine-MRI data (subject to damping and regu-larization constraints that compensate for muscle redundancy and systeminstability). This is the first time that such comprehensive, quantified tissue-point motion has been used to drive an oropharyngeal, biomechanical model.Our simulations lead to a novel interpretation of the data itself, by distin-guishing between the active and passive muscle shortenings, and identifying1046.1. Concluding Remarksthe co-contraction of antagonist muscles where no regional shortening is ob-served.Next, we generate sound by coupling our biomechanical models with a 1Dstandard acoustic synthesizer. A vocal tract skin mesh translates the motionof the articulators to deformations of an airtight airway. Such deformationsupdate the extracted centre-line and area profiles, and alter the synthesizedsound accordingly. Using the recorded audio as our ground truth, we ex-tend our experiments to 3D methods, and identify the sources of inaccuracyin computed formant frequencies. Our attempts show promise in linkingacoustical and biomechanical models for speech research.6.1 Concluding RemarksBy moving from generic to speaker-specific models, we fill a gap that ex-ists between the oropharyngeal medical images and the biomechanical mod-els of articulatory motion. We demonstrate that the current technologyfor super-resolution reconstruction of cine- and tagged-MRI provides ade-quate information for building and data-driven simulation of speaker-specificoropharyngeal models. We carry such transition (from data to individual-ized models) by facilitating MRI processing and FE meshing. In particular,we demonstrate significant improvement in time efficiency using our tonguesegmentation method, and allow adjustability in mesh configurations of ourFE tongue models.Our inverse simulations prove beneficial in investigating motor control vari-ability across speakers. Looking at the laminal and apical /s/ in four healthyEnglish speakers, we indicate dominant use of the genioglossus muscle, withits posterior portion being more active than the anterior when /s/ follows/i/. The reverse is true when /(s)/ precedes /u/. We also find the anteriordigastric muscle, as well as the inferior-lateral pterygoid muscle, to be the keyactive muscles of the jaw and hyoid during /s/. Our results do not indicateany substantial difference between activation patterns in apical and laminal/s/ types, confirming the hypothesis that the associated variations in tongueshape are more attributable to palate constraint. In addition, we are ableto verify the theories of motor control that suggest multiple functionally-distinct segments throughout key muscles of the tongue. Our results show1056.2. Future Workvariety in activation mainly through the genioglossus, but less through ver-ticalis and transversus muscles. Such speaker-specific simulations provideabundant opportunities for testing, modifying, and validating the theories ofspeech strategy across the population.Finally, we introduce a platform to generate sound, based on the predictedbiomechanics. The accurate and realistic synthesis of speech phonemes proveschallenging, and falls outside of the scope of this study; nevertheless, our at-tempts set the stage for a unified framework, in which the articulators aredriven (based on the medical data) to generate a particular sound. We showthat synthesized vowels suffer from ambiguity in vocal tract representationfor 1D acoustical models, and require higher MRI resolution for 3D meth-ods. The results also show sensitivity to the generic – and often conventional– parameter-tuning performed for acoustical models, in order to generate arealistic sound.6.2 Future WorkI suggest potential improvements and future work at the end of each chapter,but now would like to highlight a few directions from a global perspective.To improve our subject-specific modeling and simulation framework, I rec-ommend including models of other oropharyngeal organs, such as the velum,uvula, epiglottis and lips (upon which the shape of the vocal tract depends).Due to the small size of these organs, such modeling would require MR im-ages of higher resolution and contrast. For larger scale models, such as theface, I recommend implementing FE registration methods that work directlywith MRI images (rather than surface meshes) to eliminate the burden ofsegmentation.Our modeling and simulation framework would certainly benefit from a morecomprehensive set of medical data. CT images remove the complexity of bonesegmentation; jaw optical tracking can increase the reliability of inverse simu-lations. Other biomechanical measurements, such as maximum jaw exertionforce, could help with tuning each speaker-specific model. A more in-depthpost processing of tagged MRI data could enable tracking of each individualmuscle in the tongue. The results might be used for validation of inverse1066.2. Future Worksimulations, or be incorporated into a muscle-based (as opposed to a point-based) inverse simulation scheme, ensuring a more meaningful averaging oftagged MRI tracking data. In addition, a higher resolution MRI is (deemedto be) necessary for enabling 3D acoustical analysis.The vocal tract skin mesh, which is used to link our biomechanical models totheir acoustical counterparts, could benefit from methods that allow changesin the topology of the airway (such as shortening or discontinuity). Thisfeature might come in handy where the biomechanical models of teeth andface are present, and could prove essential for modeling highly deformed or,at times, isolated air cavities in the mouth.Lastly, I recommend that a generic tongue model incorporate a higher reso-lution representation of muscle fibers, using fibers extracted from digitizationof the cadaver tissue. Such a model, when adapted to speaker data, would re-quire faster computation methods. Experiments with mesh-less elastic mod-els, instead of FE, could reduce simulation time; however, some technicaldifficulties (such as ambiguity in definition of a unique attachment site) needto be addressed accordingly. It would also be beneficial to compare theaccuracy of such simulations against FE models and image data in speechproduction.107BibliographyAalto D, et al.2014. Large scale data acquisition of simultaneous MRI andspeech. J Appl Acoust. 83:64–75.Aalto D, et al. 2013. Algorithmic Surface Extraction from MRI Data-Modelling the Human Vocal Tract. Proceeding of 6th InternationalJoint Conference on Biomedical Engineering Systems and Technologies;Barcelona, Spain.Aalto D, et al.2012. How far are vowel formants from computed vocal tractresonances? arXiv:1208.5963.Alliez P, Cohen-Steiner D, Yvinec M, Desbrun M. 2005. Variational tetrahe-dral meshing. ACM Trans Graph. 24(3): 617–625.Alvey C, Orphanidou C, Coleman J, McIntyre A, Golding S, Kochanski G.2008. Image quality in non-gated versus gated reconstruction of tonguemotion using magnetic resonance imaging: a comparison using automatedimage processing. Int J Comput Assist Radiol Surg. 3(5):457–464.Anderson FC, Pandy MG. 2001. Static and dynamic optimization solutionsfor gait are practically equivalent. J Biomech. 34(2): 153–161.Arnela M, Guasch O. 2014. Three-dimensional behavior in the numericalgeneration of vowels using tuned two-dimensional vocal tracts. Proceedingof 7th Forum Acousticum; Krakw, Poland.Badin P, Bailly G, Reveret L, Baciu M, Segebarth C, Savariaux C. 2002.Three-dimensional linear articulatory modelling of tongue, lips and face,based on MRI and video images. J Phonetics. 30(3):533–553.Baer T, Alfonso PJ, Honda K . 1988. Electromyography of the tongue musclesduring vowels in /apVp/ environment. Ann Bull RILP. 22:7-19.108BibliographyBaghdadi L, Steinman DA, Ladak HM. 2005. Template-based finite-elementmesh generation from medical images. Comput Meth Programs Biomed.77(1):11–21Bai Y, Han X, Prince JL. 2004. Super-resolution reconstruction of MR brainimages. Proceedings of 38th Annual Conference on Information Sciencesand Systems; Princeton, New Jersey, USA.Benade AH, Jansson EV. 1974. On Plane and Spherical Waves in Hornswith Nonuniform Flare I. Theory of Radiation, Resonance Frequencies,and Mode Conversion. ACTA ACUST UNITED AC. 31(2): 80–98.Birkholz P, Jackl D, Krger BJ. 2007. Simulation of losses due to turbulencein the time-varying vocal system. IEEE Trans Audio Speech LanguageProcess. 15(4): 1218–1226.Birkholz P. 2013. Modeling Consonant-Vowel Coarticulation for Articula-tory Speech Synthesis. PLoS ONE 8(4): e60603. doi:10.1371/journal.-pone.0060603Blemker SS, Pinsky PM, Delp SL. 2005. A 3D model of muscle reveals thecauses of nonuniform strains in the biceps brachii. J Biomech. 38(4):657-665.Boersma P, Weenink D. 2005. Praat: doing phonetics by computer (Version4.3.01) [Computer program]. Retrieved from http://www.praat.org.Bresch E, Narayanan S. 2009. Region segmentation in the frequency domainapplied to upper airway real-time magnetic resonance images. IEEE TransMed Imag. 28(3):323–338.Bresson X, Vandergheynst P, Thiran JP. 2006. A variational model for objectsegmentation using boundary information and shape prior driven by theMumford-Shah functional. Int J Comput Vis. 68(2):145–162.Buchaillard S, Perrier P, Payan Y. 2009. A biomechanical model of cardinalvowel production: Muscle activations and the impact of gravity on tonguepositioning. J Acoust Soc Am. 126:2033–2051.Bucki M, Lobos C, Payan Y. 2010. A fast and robust patient specific finiteelement mesh registration technique: application to 60 clinical cases. MedImage Anal. 13(3):303-317.109BibliographyBucki M, Lobos C, Payan Y, Hitschfeld N. 2011. Jacobian-based re-pair method for finite element meshes after registration. Eng Comput.27(3):285–297.Cachier P, Ayache N. 2001. Regularization in image non-rigid registration:I. Trade-off between smoothness and intensity similarity. Technical report,INRIA.Chen M, Lu W, Chen Q, Ruchala KJ, Olivera GH. 2008. A simple fixed-pointapproach to invert a deformation field. Med Phys. 35(1): 81–88.Chuanjun C, Zhiyuan Z, Shaopu G, Xinquan J, Zhihong Z. 2002. Speechafter partial glossectomy: a comparison between reconstruction and non-reconstruction patients. J Oral Maxillofac Surg. 60(4):404–407.Cootes TF, Taylor CJ, Cooper DH, Graham J. 1995. Active shape models:Their training and application. Comput Vis Image Understand. 61(1):38–59.Couteau B, Payan Y, Lavallee S. 2000. The mesh-matching algorithm: anautomatic 3D mesh generator for finite element structures. J Biomech.33(8):1005–1009.Cveticanin L. 2012. Review on mathematical and mechanical models of thevocal cord. J Appl Math. 2012: 1–18.Dang J, Honda K. 2004. Construction and control of a physiological articu-latory model. J Acoust Soc Am. 115:853-870.Dart S. 1991. Articulatory and acoustic properties of apical and laminalarticulations. UCLA Working Papers in Phonetics 79.Doel K van den, Vogt F , English RE, Fels S. 2006. Towards ArticulatorySpeech Synthesis with a Dynamic 3D Finite Element Tongue Model. Pro-ceeding of the 7th Intentional Seminar on Speech Production; Ubatuba,Brazil.Doel K van den, Ascher UM. 2008. Real-time numerical solution of Web-ster’s equation on a non-uniform grid. IEEE Trans Audio Speech LangProcessing. 16:1163–1172.110BibliographyDrake R, Vogl AW, Mitchell AW. 2010. Gray’s anatomy for students.Churchill Livingstone, Elsevier Inc., Philadelphia, PA, 2nd edition.Dubuisson MP, Jain AK. 1994. A modified Hausdorff distance for objectmatching. Proceedings of the 12th IEEE International Conference on Pat-tern Recognition; Jerusalem, Israel.Engwall O. 2003. Combining MRI, EMA and EPG measurements in a three-dimensional tongue model. Speech Comm. 41(2):303–329.Engwall O. 2003. A revisit to the Application of MRI to the Analysis ofSpeech Production-Testing our assumptions. Proceedins of 6th Interna-tional Seminar on Speech Production; Sydney, Australia.Erdemir A, McLean S, Herzog W, van den Bogert AJ. 2007. Model-basedestimation of muscle forces exerted during movements. Clin Biomech.22(2):131-154.Eryildirim A, Berger MO. 2011. A guided approach for automatic segmenta-tion and modeling of the vocal tract in MRI images. Proceedings of 19thEuropean Signal Processing Conference; Barcelona, Spain.Fang Q, Fujita S, Lu X, Dang J. 2009. A model-based investigation of ac-tivations of the tongue muscles in vowel production. Acoust Sci Tech.30(4):277–287.Fant G. 1960. Acoustic theory of speech production. The Hague, Netherlands:Mouton.Flanagan JL, Landgraf LL. 1968. Self-oscillating source for vocal-tract syn-thesizers. IEEE Trans Audio Electroacoust. 16(1): 57–64.Freedman D, Zhang T. 2005. Interactive graph cut based segmentation withshape priors. Proceeding of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition 2005; San Diego, USA.Freitag L, Plassmann P. 2000. Local optimization-based simplicial mesh un-tangling and improvement. Int J Numer Meth. Eng 49(12):109125.Foulonneau A, Charbonnier P, Heitz F. 2009. Multi-reference shape priorsfor active contours. Int J Comput Vis. 81(1):68–81.111BibliographyFowler CA, Saltzman E. 1993. Coordination and coarticulation in speechproduction. Lang speech. 36(2-3): 171–195.Fujii N, et al.. 2011. Evaluation of swallowing using 320-detector-row mul-tislice CT. Part I: single-and multiphase volume scanning for three-dimensional morphological and kinematic analysis. Dysphagia. 26(2): 99–107.Gaige TA, Benner T, Wang R, Wedeen VJ, Gilbert RJ. 2007. Three dimen-sional myoarchitecture of the human tongue determined in vivo by diffusiontensor imaging with tractography. J Magn Reson Imaging. 26(3):654–661.Gentil M, Gay T. 1986. Neuromuscular specialization of the mandibularmotor system: speech versus non-speech movements. Speech Commun.5(1):69–82.George PL, Borouchaki H, Laug P. 2002. An efficient algorithm for 3D adap-tive meshing. Adv Eng Softw. 33(7): 377–387.Gerard JM, Wilhelms-Tricarico R, Perrier P, Payan Y. 2006. A 3D dynamicalbiomechanical tongue model to study speech motor control. Recent ResDevelop Biomech. 1:49–64.Gilles B, Pai D. 2008. Fast musculoskeletal registration based on shapematching. Proceedings of the 11th International Conference on MedicalImage Computing and Computer Assisted Intervention. New York, USA.Glupker L, Kula K, Parks E, Babler W, Stewart K, Ghoneima A. 2015. Three-dimensional computed tomography analysis of airway volume changes be-tween open and closed jaw positions. Am J Orthod Dentofacial Orthop.147(4): 426–434Grady, L. 2006. Random walks for image segmentation. IEEE Trans PatternAnal Mach Intell. 28(11):1768–1783.Grosland NM, Bafna R, Magnotta VA. 2009. Automated hexahedral mesh-ing of anatomic structures using deformable registration. Comput MethodBiomech. 12(1): 35–43.Hannam AG, Stavness I, Lloyd JE, Fels S. 2008. A dynamic model of jawand hyoid biomechanics during chewing. J Biomech. 41(5):1069-1076.112BibliographyHeimann T, Mu¨nzing S, Meinzer HP, Wolf I. 2007. A shape-guided de-formable model with evolutionary algorithm initialization for 3D soft tissuesegmentation. Proceedings of 20th International Conference on Informa-tion Processing in Medical Imaging. Kerkrade, The Netherlands.Heimann T, Meinzer HP. 2009. Statistical shape models for 3D medical imagesegmentation: A review. Med Image Anal. 13(4):543–563.Hillenbrand J, Getty LA, Clark MJ, Wheeler K. 1995. Acoustic characteris-tics of American English vowels. J Acoust Soc Am. 97(5): 3099-3111.Hillenbrand J. 2003. American English: Southern Michigan. J Int Phon As-soc. 33: 121–126.Ho AK, et al. 2014. 3D dynamic visualization of swallowing from multi-slice computed tomography. ACM SIGGRAPH 2014 Posters; Vancouver,Canada.Hughes TJR. 2000. The finite element method: linear static and dynamicfinite element analysis. Dover Publications.Iltis PW, Frahm J, Voit D, Joseph AA, Schoonderwaldt E, Altenmu¨llerE. 2015. High-speed real-time magnetic resonance imaging of fast tonguemovements in elite horn players. Quant Imaging Med Surg. 5(3): 374–381.Inamoto Y, et al.. 2015. Anatomy of the larynx and pharynx: effects of age,gender and height revealed by multidetector computed tomography. J OralRehabil.Ishizaka K, Flanigan JL. 1972. Synthesis of voiced sounds from a two-massmodel of the vocal cords. J Bell Syst Tech. 51: 1233–1268.Jacewicz E, Fox R A. 2013. Cross-dialectal differences in dynamic formantpatterns in American English vowels. In Vowel inherent spectral change.Springer Berlin Heidelberg; pp. 177–198.B. Joe. 2008. Shape measures for quadrilaterals, pyramids, wedges, and hex-ahedra. Technical Report. Retrieved from http://members.shaw.ca/bjoe.Kawakami S, et al.. 2012. Mechanomyographic activity in the human lat-eral pterygoid muscle during mandibular movement. J Neurosci Meth.203(1):157–162.113BibliographyKelly KL, Lochbaum CC. 1962. Speech Synthesis. Proceeding of Fourth In-ternational Congress on Acoustics; Copenhagen, Denmark.Kerwin W, Osman N, Prince J. 2000. Image processing and analysis in taggedcardiac MRI. In: Bankman I, editor. Handbook of Medical Imaging, chap-ter 24. Academic Press; p. 375–391.Keyak JH, Meagher JM, Skinner HB, Mote CDJ. 1990. Automated three-dimensional finite element modelling of bone: a new method. J BiomedEng. 12(5):389–397.Kim YC, Hayes CE, Narayanan SS, Nayak KS. 2011. Novel 16channel receivecoil array for accelerated upper airway MRI at 3 Tesla. J Magn Reson.65(6):1711–1717.Kim YC, Proctor MI, Narayanan SS, Nayak KS. 2011. Visualization of VocalTract Shape Using Interleaved Real-Time MRI of Multiple Scan Planes.Proceeding of 12th Annual Conference of the International Speech Com-munication Association; Florence, Italy.Kitamura, et al. 2005. Difference in vocal tract shape between upright andsupine postures: Observations by an open-type MRI scanner. Acoust scitechnol. 26(5):465–468.Kivela¨ A. 2015. Acoustics of the vocal tract: MR image segmentation formodelling, Master’s thesis, Aalto University School of Science.Kivela¨ A, Kuortti J, Malinen J. 2013. Resonances and mode shapes of thehuman vocal tract during vowel production. Proceedings of 26th nordicseminar on computational mechanics; Oslo, Norway.Knupp PM. 2000. Achieving finite element mesh quality via optimization ofthe Jacobian Matrix norm and associated quantities. Int J Numer MethEng. 48(8):1165–1185.Ladefoged P. 2001. Vowels and consonants. Phonetica 58:211-212.Larkman DJ, Nunes RG. 2007. Parallel magnetic resonance imaging. PhysMed Biol. 52(7):R15.114BibliographyLee J, Woo J, Xing F, Murano EZ, Stone M, Prince JL. 2013. Semi-automaticsegmentation of the tongue for 3D motion analysis with dynamic MRI.Proceedings of the 10th IEEE International Symposium on BiomedicalImaging; San Francisco, USA.Leventon ME, Grimson WEL, Faugeras O. 2000. Statistical shape influence ingeodesic active contours. Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2000; Hilton Head, USA.Liu X, Abd-Elmoniem K, Stone M, Murano E, Zhuo J, Gullapalli R, PrinceJL. 2012.Incompressible Deformation Estimation Algorithm (IDEA) fromTagged MR Images. IEEE Trans. Med. Imaging 31(2): 326-340.Livesu M, Sheffer A, Vining N, Tarini M. 2015. Practical hex-mesh optimiza-tion via edge-cone rectification. ACM Trans Graphic. 34(4): 141.Lloyd JE, Stavness I, Fels S. 2012. ARTISYNTH: A fast interactive biome-chanical modeling toolkit combining multibody and finite element simula-tion. In: Payan Y, editor. Soft Tissue Biomechanical Modeling for Com-puter Assisted Surgery. Springer Berlin Heidelberg; p. 355–394.Lobos C. 2012. A set of mixed-elements patterns for domain boundary ap-proximation in hexahedral meshes. Stud Health Technol Inform. 184:268–272.Lobos C, Bucki M, Hitschfeld N, Payan Y. 2007. Mixed-element mesh foran intra-operative modeling of the brain tumor extraction. Proceedings of16th International Meshing Roundtable; Seattle, USA.Luboz V, Chabanas M, Swider P, Payan Y. 2005. Orbital and maxillofacialcomputer aided surgery: patient-specific finite element models to predictsurgical outcomes. Comput Methods Biomech Biomed Eng. 8(4):259–265.Lukkari T, Malinen J. 2012. Webster’s equation with curvature and dissipa-tion. arXiv preprint arXiv:1204.4075.Ma SYL, Whittle T, Descallar J, Murray GM, Darendeliler M A, Cistulli P,Dalci O. 2013. Association between resting jaw muscle electromyographicactivity and mandibular advancement splint outcome in patients with ob-structive sleep apnea. Am J Orthod Dentofacial Orthop. 144(3): 357–367.115BibliographyMac Neilage PF, Sholes GN. 1964. An electromyographic study of the tongueduring vowel production. J SPEECH LANG HEAR R. 7(3): 209–232.Martins JAC, Pires EB, Salvado R, Dinis PB. 1998. A numerical modelof passive and active behavior of skeletal muscles. COMPUT METHODAPPL M. 151(3):419–433.Mijailovich SM, Stojanovic B, Kojic M, Liang A, Wedeen VJ, Gilbert RJ.2010. Derivation of a finite-element model of lingual deformation duringswallowing from the mechanics of mesoscale myofiber tracts obtained byMRI. J Appl Phys. 109(5):1500–1514.Miyawaki O, Hirose H, Ushijima T, Sawashima M. 1975. A preliminary reporton the electromyographic study of the activity of lingual muscles. Ann BullRILP. 9(91):406.Montagnat J, Delingette H, Scapel N, Ayache N. 2000. Representation,shape, topology and evolution of deformable surfaces: Application to 3Dmedical image segmentation. Technical Report, INRIA.Mory B, Somphone O, Prevost R, Ardon R. 2012. Real-Time 3d image seg-mentation by user-constrained template deformation. Proceedings of the15th International Conference on Medical Image Computing and Com-puter Assisted Intervention; Nice, France.Mu L, Sanders I. 2000. Neuromuscular specializations of the pharyngeal dila-tor muscles: II. Compartmentalization of the canine genioglossus muscle.Anat Rec. 260(3):308–325.Muller M, Heidelberger B, Teschner M, Gross M. 2005 July. Meshless defor-mations based on shape matching. In ACM Trans Graph. 24(3):471–478.Murano EZ, Shinagawa H, Zhuo J, Gullapalli RP, Ord RA, Prince JL, StoneM. 2010. Application of diffusion tensor imaging after glossectomy. Oto-laryngol Head Neck Surg. 143(2):304–306.Murray GM. 2012. The lateral pterygoid muscle: function and dysfunction.Semin Orthod. 18(1):44–50.Narayanan S et al.2014. Real-time magnetic resonance imaging and electro-magnetic articulography database for speech production research (TC). JAcoust Soc Am. 136(3): 1307–1311.116BibliographyNealen A, Mller M, Keiser R, Boxerman E, Carlson M. 2006. Physicallybased deformable models in computer graphics. In Comput Graph Forum.25(4):809–836.NessAiver MS, Stone M, Parthasarathy V, Kahana Y, Paritsky A. 2006.Recording high quality speech during tagged cineMRI studies using a fiberoptic microphone. J Magn Reson. 23(1):92–97.Ohala JJ. 1993. Coarticulation and phonology. Lang Speech. 36(2-3): 155–170.O’Kusky JR, Norman MG. 1995. Sudden infant death syndrome: increasednumber of synapses in the hypoglossal nucleus. J Neuropath Exp Neur. 54:627–634.Osman NF, McVeigh ER, Prince JL. 2000. Imaging Heart Motion UsingHarmonic Phase MRI. IEEE Trans Med Imaging. 19(3): 186–202.Peled S, Yeshurun Y. 2001. Superresolution in MRI: application to humanwhite matter fiber tract visualization by diffusion tensor imaging. MagnReson Med. 45(1):29–35.Peng T, Kerrien E, Berger MO. 2010. A shape-based framework to segmen-tation of tongue contours from MRI data. Proceedings of the IEEE In-ternational Conference on Acoustics Speech and Signal Processing 2010;Dallas, USA.Perrier P, Payan Y, Zandipour M, Perkell J. 2003. Influences of tongue biome-chanics on speech movements during the production of velar stop conso-nants: A modeling study. J Acoust Soc Am. 114(3):1582–1599.Peterson GE, Barney HL. 1952. Control methods used in a study of thevowels. J Acoust Soc Am. 24(2):175–184.Plenge E, Poot DHJ, Bernsen M, Kotek G, Houston G, Wielopolski P, WeerdLVD, Niessen WJ, Meijering E. 2012. Super-resolution in MRI: better im-ages faster?. Proceedings of SPIE Confetrence on Medical Imaging: ImageProcessing 2012. San Diego, USA.Reichard R, Stone M, Woo J, Murano E, Prince J. 2012. Motion of apical andlaminal /s/ in normal and post-glossectomy speakers. Acoustical Societyof America, 3346.117BibliographyRoberts TJ, Gabaldn AM. 2008. Interpreting muscle function from EMG:lessons learned from direct measurements of muscle force. Integr CompBiol. 48(2): 312–320.Saddi KA, Rousson M, Chefdhotel C, Cheriet F. 2007. Global-to-local shapematching for liver segmentation in CT imaging. Proceedings of the 10thInternational Conference on Medical Image Computing and Computer As-sisted Intervention; Brisbane, Australia.Sa´nchez CA, Stavness I, Lloyd JE, Fels S. 2013. Forward dynamics trackingsimulation of coupled multibody and finite element models: Applicationto the tongue and jaw. Proceedings of the 11th International Symposiumon Computer Methods in Biomechanics and Biomedical Engineering; SaltLake City, USA.Sa´nchez CA, Lloyd JE, Fels S, Abolmaesumi P. 2014. Embedding digitizedfibre fields in finite element models of muscles. Comput Methods BiomechBiomed Eng: Imaging Vis. 2(4):223–236.Scherer RC, Titze IR, Curtis JF. 1983. Pressureflow relationships in twomodels of the larynx having rectangular glottal shapes. J Acoust Soc Am.73(2): 668–676.Sharp GC, Lee SW, Wehe DK. 2002. ICP registration using invariant fea-tures. IEEE Trans Pattern Anal 24(1):90–102.Shepherd J. 2007. Topologic and Geometric Constraint-Based HexahedralMesh Generation. Doctoral Dissertation, University of Utah.Sherif MH, Gregor RJ, Liu LM, Roy RR, Hager CL. 1983. Correlation ofmyoelectric activity and muscle force during selected cat treadmill loco-motion. J Biomech. 16(9):691–701Shimada Y, Nishimoto H, Kochiyama T, Fujimoto I, Mano H, Masaki S,Murase K. 2012. A technique to reduce motion artifact for externally trig-gered cine-MRI (EC-MRI) based on detecting the onset of the articulatedword with spectral analysis. Magn Reson Med Sci. 11(4):273–282.Sifakis E, Neverov I, Fedkiw R. 2005. Automatic determination of facialmuscle activations from sparse motion capture marker data. ACM TransGraph. 24(3)–417-425.118BibliographySigal IA, Hardisty MR, Whyne CM. 2008. Mesh-morphing algorithms forspecimen-specific finite element modeling. J Biomech. 41(7):1381–1389.Silva MP, Ambrsio JA. 2004. Sensitivity of the results produced by the inversedynamic analysis of a human stride to perturbed input data. Gait Posture.19(1): 35–49.Slaughter K, Li H, Sokoloff AJ. 2005. Neuromuscular Organization of the Su-perior Longitudinalis Muscle in the Human Tongue. Cells Tissues Organs.181:51–64.Smith A, Goffman L. 1998. Stability and patterning of speech movementsequences in children and adults. J Speech Lang Hear Res. 41(1):18–30.Sokoloff AJ, Deacon TW. 1992. Musculotopic organization of the hypoglossalnucleus in the cynomolgus monkey, Macaca fascicularis. J Comp Neurol324: 81-93.Sondhi MM, Schroeter J. 1987. A hybrid time-frequency domain articulatoryspeech synthesizer. IEEE Trans Acoust Speech Sinal Process. 35(7): 955–967.Stavness I. 2010. Byte your tongue: A computational model of humanmandibular-lingual biomechanics for biomedical applications. DoctoralDissertation, University of British Columbia.Stavness I, Lloyd JE, Payan Y, Fels S. 2011. Coupled hard-soft tissue simu-lation with contact and constraints applied to jaw-tongue-hyoid dynamics.Int J Numer Method Biomed Eng. 27(3):367–390.Stavness I, Lloyd JE, Fels S. 2012. Automatic Prediction of Tongue MuscleActivations using a Finite Element Model. J Biomech. 45(16):2841–2848.Stavness I, Gick B, Derrick D, Fels S. 2012. Biomechanical modeling of En-glish/r/variants. J Acoust Soc Am. 131(5):EL355–EL360.Stavness I, Nazari MA, Flynn C, Perrier P, Payan Y, Lloyd JE, Fels S.2014a. Coupled Biomechanical Modeling of the Face, Jaw, Skull, Tongue,and Hyoid Bone. In: Magnenat-Thalmann N, Ratib O, Choi HF, editors.3D Multiscale Physiological Human. Springer London. p. 253–274.119BibliographyStavness I et al.2014b. Unified Skinning of Rigid and Deformable Models forAnatomical Simulations. Proceeding of ACM SIGGRAPH Asia; Shenzhen,China.Stone M, Epstein MA, IskarousK. 2004. Functional segments in tongue move-ment. Clin Linguist Phons. 18(6):507–521.Stone M, Rizk S, Woo J, Murano EZ, Chen H, Prince JL. 2012. Frequencyof apical and laminal /s/ in normal and post-glossectomy patients. J MedSpeech Lang Pathol. 20(4):106–111.Story BH, Titze IR. 1995. Voice simulation with a bodycover model of thevocal folds. J Acoust Soc Am. 97(2): 1249–1260.Somphone O, Mory B, Makram-Ebeid S, Cohen L. 2008. Prior-BasedPiecewise-Smooth Segmentation by Template Competitive DeformationUsing Partitions of Unity. Proceedings of the 10th European Conferenceon Computer Vision; Marseille, France.Sotiras A, Christos D, Paragios N. 2012. Deformable Medical Image Regis-tration: A Survey. Technical Report, INRIA.Takano S, Honda K. 2007. An MRI analysis of the extrinsic tongue musclesduring vowel production. Speech Comm 49(1): 49–58.Takemoto, H. 2001. Morphological analyses of the human tongue musculaturefor three-dimensional modeling. J Speech Lang Hear R 44(1):95–107.Takemoto H, Mokhtari P, Kitamura T. 2014. Comparison of vocal tracttransfer functions calculated using one-dimensional and three-dimensionalacoustic simulation methods. Proceeding of 15th Annual Conference of theInternational Speech Communication Association; Singapore, Singapore.Teo JCM, Chui CK, Wang ZL, Ong SH, Yan CH, Wang SC, Wong HK, TeohSH. 2007. Heterogeneous meshing and biomechanical modeling of humanspine. Med Eng Phys. 29(2): 277–290.Titze IR. 1988. The physics of smallamplitude oscillation of the vocal folds.J Acoust Soc Am. 83(4): 1536–1552.120BibliographyTop A, Hamarneh G, Abugharbieh R. 2011. Active learning for interactive 3dimage segmentation. Proceedings of the 14th International Conference onMedical Image Computing and Computer Assisted Intervention; Toronto,Canada.Tsai A, Yezzi AJ, Wells W, Tempany C, Tucker D, Fan A, Grimson WE,Willsky A. 2003. A shape-based approach to the segmentation of medicalimagery using level sets. IEEE Trans Med Imag. 22(2):137–154.Tuller B, Harris KS, Gross B. 1981. Electromyographic study of the jawmuscles during speech. J Phon. 9: 175–188.Uecker M, Zhang S, Voit D, Karaus A, Merboldt KD, Frahm J. 2010. Real-time MRI at a resolution of 20 ms. NMR Biomed. 23(8): 986–994.Va¨lima¨ki V, Karjalainen M. 1994. Improving the kelly-lochbaum vocal tractmodel using conical tube sections and fractional delay filtering techniques.Proceeding of the 3rd International Conference on Spoken Language Pro-cessing; Yokohama, Japan.Vasconcelos MJ, Ventura SM, Freitas DR, Tavares, JMR. 2012. Inter-speakerspeech variability assessment using statistical deformable models from 3.0Tesla magnetic resonance images. P I MECH ENG G-J AER. 226(3): 185–196.Ventura SR, Freitas DR, Tavares JMR. 2009. Application of MRI andbiomedical engineering in speech production study. Comput MethodsBiomech Biomed Eng. 12(6): 671–681.Ventura SR, Freitas DR, Tavares, JMR. 2011. Toward dynamic magneticresonance imaging of the vocal tract during speech production. J Voice.25(4):511–518.Ventura SR, Freitas DR, Ramos IMA, Tavares, JMR. 2013. Morphologicdifferences in the vocal tract resonance cavities of voice professionals: anMRI-based study. J Voice. 27(2):132–140.Vercauteren T, Pennec X, Perchant A, Ayache N. 2009. Diffeomorphicdemons: Efficient non-parametric image registration. Neuroimage. 45(1):6-1–72.121Wand M. 2015. Advancing Electromyographic Continuous Speech Recogni-tion: Signal Preprocessing and Modeling. KIT Scientific Publishing.Wola´nski W, Gzik-Zroska B, Kawlewska E, Gzik M, Larysz D, Dzielicki J,Rudnik A. 2015. Preoperative Planning of Surgical Treatment with the Useof 3D Visualization and Finite Element Method. In Developments in Med-ical Image Processing and Computational Vision. Springer InternationalPublishing; pp. 139–163.Woo J, Murano EZ, Stone M, Prince JL. 2012. Reconstruction of High Res-olution Tongue Volumes from MRI. IEEE Trans Biomed Eng. 6(1): 1–25.Woo J, Lee J, Murano, EZ, Xing F, Al-Talib M, Stone M, Prince JL. 2015. Ahigh-resolution atlas and statistical model of the vocal tract from structuralMRI. Comput Methods Biomech Biomed Eng: Imaging Vis. 3(1):47–60.Xing F, Ye C, Woo J, Stone M, Prince J. 2015. Relating speech productionto tongue muscle compressions using tagged and high-resolution magneticresonance imaging. Proceeding of SPIE Medical Imaging; Munich, Ger-many.Xing F, Woo J, Murano EZ, Lee J, Stone M, Pronce JL. 2013. 3d TongueMotion from Tagged and Cine MR Images. Proceeding of the 16th Inter-national Conference on Medical Image Computing and Computer-AssistedIntervention; Nagoya, Japan.Yoshida K, Takada K, Adachi S, Sakuda M. 1982. Clinical Science EMG Ap-proach to Assessing Tongue Activity Using Miniature Surface Electrodes.J Dent Res. 61(10): 1148–1152.Yushkevich PA, Piven J, Hazlett HC, Smith RG, Ho S, Gee JC, Gerig, G.2006. User-guided 3D active contour segmentation of anatomical struc-tures: significantly improved efficiency and reliability. Neuroimage, 31(3):1116–1128.Zajac FE. 1988. Muscle and tendon: properties, models, scaling, and appli-cation to biomechanics and motor control. Crit Rev Biomed Eng. 17(4):359–411.122Zhang Y, Bajaj C, Xu G. 2005. Surface smoothing and quality improvementof quadrilateral/hexahedral meshes with geometric flow. Proceedings ofthe 14th International Meshing Roundtable; San Diego, USA.123Appendix AOropharyngeal MusclesA.1 Tongue MusclesThe human tongue is unique in the body. The structure that resembles itmost closely is the heart, since both are composed entirely of soft-tissue;but, the tongue typically moves at 10 times the rate of the heart. Tonguemuscle architecture is considerably more complex than heart muscle, whichallows the tongue to be extremely versatile: chewing requires that it throwsfood onto the teeth; some languages utilize it in the pronunciation of over150 phonetic sounds; and every time we inhale the genioglossus posteriorcontracts in order to keep the pharynx open.The tongue is a perfect muscular hydrostat, meaning it has no skeleton andno sack. It is composed entirely of soft tissue. There are two definitivefeatures associated with muscular hydrostats: they have orthogonal muscleorientation, and they are volume preserving. Compression in one locationmeans expansion in another, and motion is accompanied by deformation.The tongue consists of 100 laminae (alternating layers of orthogonal fibers),enabling complex deformations. The hypoglossal nerve (CN XII) suppliesmotor fibres to all the muscles of the tongue – except the palatoglossus mus-cle, which is innervated by the vagus nerve (CN X). These nerves arise fromthe hypoglossal nucleus (of the caudal brain stem), which includes about13,000 motoneurons (O’Kusky JR and Norman, 1995). It is possible that ev-ery one of these has independent control, potentially enabling very complexdeformations. The complex muscle fiber direction is compatible with local-ized innervation of muscle fibers.The implication lies in localized control, notcontrol of muscle compartments.Previously, extrinsic muscles of the tongue were thought to move the tongue124A.1. Tongue MusclesFigure A.1. Lateral and sagittal cross-section views of the tongue, denoting extrin-sic muscles (underlined). c©Elsevier, Drake et al. (2010), adapted with permission.as a rigid body, while intrinsic muscles would fine tune tongue shape intoa minimal deformation. Today we know that both extrinsic and intrinsicmuscles participate in the movement and deformation of the tongue.A.1.1 Extrinsic MusclesThe extrinsic muscles of the tongue originate on the bones outside the tongue,and insert into the tongue surface. Extrinsic muscles, as shown in Figure A.1,are named for origin and insertion: genioglossus (GG), hyoglossus (HG),styloglossus (STY) and palatoglossus (PG). The suffix glossus refers to thetongue. Fibers are located very laterally, back to front (SG, HG, PG), orvery medially, front to back (GG), and mostly interdigitate with the intrinsicmuscles.The genioglossus (GG) originates from the mandibular symphysis, and in-serts into the mid-line tongue surface, excluding the tip, in a fan shape. Here,the prefix genio refers to the greek word for chin. The genioglossus anterior(GGA) depresses the anterior tongue; and the genioglossus posterior (GGP)pulls the posterior of the tongue forward. GG is the major muscle responsiblefor protruding the tongue.The hyoglossus (HG) originates from the hyoid bone (hence the prefix hyo),and inserts into the posterior tongue, interdigitating with the transverse, and125A.1. Tongue MusclesFigure A.2. Posterior views of tongue muscles with horizontal and vertical cut-away, denoting intrinsic tongue muscles (underlined). c©Elsevier, Drake et al.(2010), adapted with permission.along the lateral margin to the tip, inserting along its length. It retracts anddepresses the tongue.The styloglossus (STY) arises from the styloid process of temporal bone(hence the prefix stylo), and like the HG, inserts into the posterior tongue,interdigitating with the transverse muscle, and along the lateral margin tothe tip, inserting along its length. It retracts and elevates the tongue, pullingit up and back.The palatoglossus (PG) originates from the oral surface of the soft palate(hence the prefix palato), and spreads into the lateral tongue, where someof its fibers interdigitate with the transverse muscle. The PG elevates theposterior tongue, during swallowing, in order to close the oropharyngeal isth-mus.A.1.2 Intrinsic MusclesThe intrinsic muscles of the tongue originate and insert within the soft tis-sue of the tongue. Fibers course lengthwise or crosswise, and interdigitatemostly with other muscles. Intrinsic muscles are named for direction of theirfibers: superior longitudinal(is) (SL), inferior longitudinal(is) (IL), verti-cal(is) (VERT) and transvers(e/us) (TRANS). Figure A.2 shows the intrinsicmuscles in mid-sagittal and mid-coronal cross-cuts of the tongue.126A.2. Jaw and Hyoid MusclesThe superior longitudinal (SL) originates at the tongue tip, and inserts intothe posterior surface, above the hyoid. Its activation shortens the tonguewhile elevating and curling the tip.The inferior longitudinal (IL) originates from the tongue blade, and insertsinto the posterior surface, above the hyoid and below the SL. Its activationshortens the tongue while depressing and curling the tip.The verticalis (VERT) arises at the upper surface of tongue, and inserts intothe upper surface of the IL and ventral surface of the tongue. Its activationflattens and widens the tongue; it also protrudes the tongue if activatedalongside with the TRANS.The transversus (TRANS) originates from the median septum, and insertsinto the upper lateral surface of the tongue. It narrows and lengthens thetongue.A.2 Jaw and Hyoid MusclesThe human jaw (the bone structure at the entrance of the mouth) is articu-lated by the motion of its lower section (mandible), while its upper section(maxilla) is mostly fixed into the skull. The hyoid bone is a small, u-shapedbone distantly anchored to the skull; it supports tongue motion by provid-ing attachments to the HG as well as mouth floor muscles. The muscles ofthe jaw and hyoid insert into (or originate from) the mandible. Figure A.3identifies the attachment sites of each muscle on the mandible.The jaw muscles (the masseter, temporalis, medial pterygoid and lateralpterygoid) are shown in Figure A.4. They are all inverted by the mandibulardivision (V3) of the fifth cranial nerve. The lateral pterygoid is the onlymuscle from the four to open the jaw, while the bilateral activation of theothers results in jaw closing.A.2.1 Jaw ClosersThe medial pterygoid (MP) is a thick quadrilateral, originating from abovethe medial surface of the lateral pterygoid plate (deep head), as well as127A.2. Jaw and Hyoid MusclesNotch TemporalisRamusMasseterAngleBaseOblique lineMentalforamenMylohyoidTemporalisDigastricSubmandibularfoveaMylohyoid grooveLingulaInternalPterygoidMandibularforamenGenio-glossusGenio-hyoidMylohyoid line LateralPterygoidAlveolar partMandibularcondyleCoronoidprocessFigure A.3. The insertion sites of jaw, and hyoid muscles, on the mandible. PublicDomain. Adapted from Health, Medicine and Anatomy Reference Pictures, 2013.the maxillary tuberosity, and the pyramidal process of the palatine bone(superficial head). Passing downward, lateral and posterior, the fibers insertinto the internal surface of the ramus, and down to the angle of the mandible.When activated, the MP muscle elevates the mandible, and closes the jaw.The masseter is a strong mastication muscle that parallels the medial ptery-goid. The superficial head of the masseter (SM) originates from the zygo-matic process of the maxilla, and the zygomatic arch; it inserts into the angleand ramus of the mandible. The deep head of the masseter (DM) arises fromthe lower border and medial surface of the zygomatic arch. Its fibers passdownward and forward to insert into the upper half of the ramus. At itsinsertion, the masseter joins the MP to form a common sling, allowing forpowerful jaw elevation. The temporalis closes the jaw and pulls the mandibleback. It arises from the temporal fossa and the deep part of temporal fascia,and inserts within the coronoid process of the mandible. If the entire musclecontracts, the main action is to elevate the mandible and raise the lower jaw;however, the contraction of the posterior, by itself, retracts the mandible.A.2.2 Jaw OpenersThe lateral pterygoid is the only jaw muscle that protrudes the mandible,hence opens the jaw. Its superior head (SP) arises from the sphenoid bone,and inserts onto the capsule of the temporomandibular joint; its inferior head128A.2. Jaw and Hyoid MusclesFigure A.4. Illustration of the jaw muscles (underlined) inserting into the mandiblec©Elsevier, Drake et al. (2010), adapted with permission.(IP) arises from the lateral pterygoid plate, and inserts onto the condyloidprocess.Other jaw openers include the muscles of the mouth floor, as illustrated inFigure A.5. The geniohyoid (GH) originates from the back of the mandibularsymphysis and inserts on the anterior surface of the hyoid body. It depressesthe mandible, and opens the jaw (if the hyoid is kept in position), or it pullsthe tongue body forward and up by elevating the hyoid (if the hyoid is notstabilized).The mylohyoid (MH) is a flat, triangular muscle arising from the mandibularand inserting to the hyoid. Here, the prefix mylo refers to the Greek word formolar. The MH forms the muscular floor of the oral cavity and has similaraction to the GH.Figure A.5 also shows some of the suprahyoid muscles. The digastric is anarrow muscle that includes two bellies. The anterior belly of the digastric(AD) arises from the lower border of the mandible, and closes to the symph-ysis. It opens the jaw and pulls the tongue body forward and up, if the hyoidbone is not stabilized. The posterior belly of the digastric (PD) originatesat one end from the mastoid process of the temporal bone; at the other end,it forms a tendon that attaches to the hyoid. The digastric muscle showsaction similar to the GH.129A.2. Jaw and Hyoid MusclesFigure A.5. Frontal view of the submandibular and neck muscles, denoting musclesof the mouth floor (underlined). c©Elsevier, Drake et al. (2010), adapted withpermission.130 THE INTERNATIONAL PHONETIC ALPHABET (revised to 2005)CONSONANTS (PULMONIC)´A Åi y È Ë ¨ uPe e∏ Ø oE { ‰ ø Oa ”åI Y UFront Central BackCloseClose-midOpen-midOpenWhere symbols appear in pairs, the one to the right represents a rounded vowel.œòBilabial Labiodental Dental Alveolar Post alveolar Retroflex Palatal Velar Uvular Pharyngeal GlottalPlosive p b t d Ê ∂ c Ô k g q G /Nasal m µ n = ≠ N –Trill ı r RTap or Flap v | «Fricative F B f v T D s   z S Z ß Ω ç J x V X  © ? h HLateralfricative Ò LApproximant √ ® ’ j ˜Lateralapproximant l  ¥ KWhere symbols appear in pairs, the one to the right represents a voiced consonant. Shaded areas denote articulations judged impossible.CONSONANTS (NON-PULMONIC)SUPRASEGMENTALSVOWELSOTHER SYMBOLSClicks Voiced implosives Ejectives> Bilabial ∫ Bilabial ’ Examples:˘ Dental Î Dental/alveolar p’ Bilabial! (Post)alveolar ˙ Palatal t’ Dental/alveolar¯ Palatoalveolar ƒ Velar k’ Velar≤ Alveolar lateral Ï Uvular s’ Alveolar fricative " Primary stress Æ Secondary stressÆfoUn´"tIS´n … Long e… Ú Half-long eÚ * Extra-short e*˘ Minor (foot) group≤ Major (intonation) group . Syllable break ®i.œkt ≈ Linking (absence of a break) TONES AND WORD ACCENTS LEVEL CONTOURe _or â Extrahigh eˆ or ä Risinge! ê High e$ ë Fallinge@ î Mid e% ü Highrisinge~ ô Low efi ï Lowrisinge— û Extralow e& ñ$ Rising-fallingÕ Downstep ã Global riseõ Upstep à Global fall© 2005 IPA DIACRITICS Diacritics may be placed above a symbol with a descender, e.g. N( 9 Voiceless n9 d9 ª Breathy voiced bª aª 1 Dental t 1 d1 3 Voiced s3 t 3 0 Creaky voiced b0 a0 ¡ Apical t ¡ d¡ Ó Aspirated tÓ dÓ £ Linguolabial t £  d£ 4 Laminal t 4 d4 7 More rounded O7 W Labialized tW dW ) Nasalized e) ¶ Less rounded O¶ ∆ Palatalized t∆ d∆ ˆ Nasal release dˆ ™ Advanced u™ ◊ Velarized t◊  d◊ ¬ Lateral release d¬ 2 Retracted e2 ≥ Pharyngealized t≥   d≥ } No audible release d}    · Centralized e· ù Velarized or pharyngealized : + Mid-centralized e+ 6 Raised e6  ( ®6 = voiced alveolar fricative) ` Syllabic n` § Lowered e§ ( B§ = voiced bilabial approximant) 8 Non-syllabic e8 5 Advanced Tongue Root e5 ± Rhoticity ´± a± ∞ Retracted Tongue Root e∞∑ Voiceless labial-velar fricative Ç Û Alveolo-palatal fricativesw   Voiced labial-velar approximant » Voiced alveolar lateral flapÁ Voiced labial-palatal approximant Í Simultaneous S and xÌ Voiceless epiglottal fricative ¿  Voiced epiglottal fricative Affricates and double articulationscan be represented by two symbols ÷   Epiglottal plosive joined by a tie bar if necessary.kp ts((131