@prefix vivo: . @prefix edm: . @prefix ns0: . @prefix dcterms: . @prefix dc: . @prefix skos: . vivo:departmentOrSchool "Applied Science, Faculty of"@en, "Electrical and Computer Engineering, Department of"@en ; edm:dataProvider "DSpace"@en ; ns0:degreeCampus "UBCV"@en ; dcterms:creator "Vogt, Florian"@en ; dcterms:issued "2009-05-01T13:32:50Z"@en, "2009"@en ; vivo:relatedDegree "Doctor of Philosophy - PhD"@en ; ns0:degreeGrantor "University of British Columbia"@en ; dcterms:description """The human upper airway anatomy consists of the jaw, tongue, pharynx, larynx, palate, nasal cavities, nostrils, lips, and adjacent facial structures. It plays a central role in speaking, mastication, breathing, and swallowing. The interplay and correlated movements between all the anatomical structures are complex and basic physiological functions, such as the muscle activation patterns associated with chewing or swallowing, are not well understood. This work creates a modeling framework as a bridge between such disciplines as linguistics, dentistry, biomechanics, and acoustics to enable the integration of physiological knowledge with interactive simulation methods. This interactive model of the upper airway system allows better understanding of the anatomical structures and their overall function. A three-dimensional computational modeling framework is proposed to mimic the behavior of the upper airway anatomy as a system by combining biomechanic, parametric, and acoustic modeling methods. Graphical user interface components enable the interactive manipulation of models and orchestration of physiological functions. A three-dimensional biomechanical tongue model is modified as a reference model of the modeling framework to demonstrate integration of an existing model and to enable interactivity and validation procedures. Interactivity was achieved by introducing a general-purpose fast linear finite element muscle model. Feasible behavior of the biomechanical tongue model is ensured by comparison with a reference model and matching the model to medical image data. In addition to the presented generic tongue model, individual differences in shape and function are important for clinical applications. Different medical image modalities may jointly enable guidance of the creation on individuals’ anatomy models. Automatic methods to extract shape and function are investigated to demonstrate the feasibility of upper airway image-based modeling for this modeling framework. This work may be continued in many other directions to simulate the upper airway for speaking, breathing, and swallowing. For example, progress has already been made to develop a complete vocal tract model whereby the tongue model, jaw model, and acoustic airway are connected. Further, it is planned to apply the same tissue modeling methods to represent other muscle groups and model the interaction with other anatomical substructures of the vocal tract such as the face, lips and soft palate."""@en ; edm:aggregatedCHO "https://circle.library.ubc.ca/rest/handle/2429/7799?expand=metadata"@en ; dcterms:extent "4386325 bytes"@en ; dc:format "application/pdf"@en ; skos:note """Towards an Interactive Framework for Upper Airway Modeling Integration of Acoustic, Biomechanic, and Parametric Modeling Methods by Florian Vogt Dipl. Ing., Hamburg University of Applied Science, 1998 MSc, Stanford University, 2000 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Electrical and Computer Engineering) The University Of British Columbia (Vancouver) April 2009 c Florian Vogt 2009 Abstract The human upper airway anatomy consists of the jaw, tongue, pharynx, larynx, palate, nasal cavities, nostrils, lips, and adjacent facial structures. It plays a central role in speaking, mastication, breathing, and swallowing. The interplay and correlated movements between all the anatomical structures are complex and basic physiological functions, such as the muscle activation patterns associated with chewing or swallowing, are not well understood. This work creates a modeling framework as a bridge between such disciplines as linguistics, dentistry, biomechanics, and acoustics to enable the integration of physiological knowledge with interactive simulation methods. This interactive model of the upper airway system allows better understanding of the anatomical structures and their overall function. A three-dimensional computational modeling framework is proposed to mimic the behavior of the upper airway anatomy as a system by combining biomechanic, parametric, and acoustic modeling methods. Graphical user interface components enable the interactive manipulation of models and orchestration of physiological functions. A three-dimensional biomechanical tongue model is modi ed as a reference model of the modeling framework to demonstrate integration of an existing model and to enable interactivity and validation procedures. Interactivity was achieved by introducing a general-purpose fast linear nite element muscle model. Feasible behavior of the biomechanical tongue model is ensured by comparison with a reference model and matching the model to medical image data. In addition to the presented generic tongue model, individual di erences in shape and function are important for clinical applications. Di erent medical image modalities may jointly enable guidance of the creation on individuals' anatomy models. Automatic methods to extract shape and function are investigated to demonstrate the feasibility of upper airway image-based modeling for this modeling framework. This work may be continued in many other directions to simulate the upper airway for speaking, breathing, and swallowing. For example, progress has already been made to develop a complete vocal tract model whereby the tongue model, jaw model, and acoustic airway are connected. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Co-Authorship Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Contributions and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Demarcation of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter 2: Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Physiology of the Upper Airway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Anatomical Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Air Cavities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.3 Materials and Couplings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Principles of Sound Production and Speech Synthesis . . . . . . . . . . . . . . . . . 16 2.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Underlying Methods for Articulatory Synthesis . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Biomechanical Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 iii 2.3.2 Measurement-based Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.3 Sound Production Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.4 Modeling Vocal Tracts and Faces . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Imaging and Tracking Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.4.1 Comparison of Image Modalities . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Image Extraction Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.1 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5.2 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.3 Mesh-based Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6 Existing Modeling Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6.1 Framework A: Ptolemy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.6.2 Framework B: Real ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6.3 Framework C: ANSYS and Fluent . . . . . . . . . . . . . . . . . . . . . . . 44 2.6.4 Framework D: Software for Interactive Musculoskeletal Modeling (SIMM) . 44 2.6.5 Framework E: Simulation Open Framework Architecture (SOFA) . . . . . . 44 2.6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 3: Creation of a Modeling Framework for the Vocal Apparatus . . . . . 46 3.1 Framework Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.1.2 Non-functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.3 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.1.4 Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Vocal Tract Simulation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.1 Graphical User interface Design for Model and Simulation Editing and Control 60 3.2.2 Model Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.2.3 Validation, Experimentation, and Control Modules . . . . . . . . . . . . . . 72 3.2.4 Library of Anatomy Components Default Model . . . . . . . . . . . . . . . 72 3.2.5 Model Building from Images . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.3 Matching Requirements with Realizations . . . . . . . . . . . . . . . . . . . . . . . 74 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 iv Chapter 4: Creation of a Tongue Model for the Complete Vocal Tract . . . . . . 77 4.1 Building Deformable Anatomical Models . . . . . . . . . . . . . . . . . . . . . . . . 78 4.1.1 Ecient Anatomical Tongue Model . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.2 Other Upper Airway Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 Validating Deformable Anatomical Models . . . . . . . . . . . . . . . . . . . . . . . 87 4.2.1 Comparison with Reference Simulation . . . . . . . . . . . . . . . . . . . . . 87 4.2.2 Matching Simulation Results to Measurement . . . . . . . . . . . . . . . . . 91 4.3 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Chapter 5: Data Acquisition and Extraction . . . . . . . . . . . . . . . . . . . . . . . 95 5.1 Creation and Structure of Vocal Tract Data Sets . . . . . . . . . . . . . . . . . . . 96 5.1.1 Important Existing Data Sets for Vocal Tract Modeling . . . . . . . . . . . 96 5.2 Extraction of the Tongue Shapes from MRI . . . . . . . . . . . . . . . . . . . . . . 98 5.2.1 Segmentation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.3 MR Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.4 Discussion of MR Segmentation Experiments . . . . . . . . . . . . . . . . . 107 5.2.5 Summary of MR Image Segmentation . . . . . . . . . . . . . . . . . . . . . 108 5.3 Registration of Tongue Shapes Across MRI Images . . . . . . . . . . . . . . . . . . 109 5.3.1 MR Image Registration Experiments . . . . . . . . . . . . . . . . . . . . . . 111 5.3.2 High Level Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.3 Discussion of MR Registration Experiments . . . . . . . . . . . . . . . . . . 116 5.3.4 Summary of MR Registration Experiments . . . . . . . . . . . . . . . . . . 118 5.4 Real Time Ultrasound Tongue Tracking . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4.2 Tracking Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.4.3 Experiment: Vowel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.4.4 Experiment: Driving Physics Synthesis Models . . . . . . . . . . . . . . . . 126 5.4.5 Discussion of Tongue and Groove Results . . . . . . . . . . . . . . . . . . . 128 5.4.6 Summary of Tongue and Groove . . . . . . . . . . . . . . . . . . . . . . . . 129 5.5 Discussion of Data Acquisition and Extraction Results . . . . . . . . . . . . . . . . 129 Chapter 6: Summary & Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 v 6.1.1 Modeling Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.1.2 Interactive Biomechanic Tongue Model . . . . . . . . . . . . . . . . . . . . . 132 6.1.3 Image Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.1.4 Validation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.2 Contributions and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.4 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Appendix A: Authors Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Appendix B: Researcher Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.1 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 vi List of Tables 2.1 List of Tongue Muscles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Comparison of existing speech synthesis . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Finite element formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Summary of Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 List of Simulation Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 Summary of point-based connection types . . . . . . . . . . . . . . . . . . . . . . . 70 3.2 Realization of Nonfunctional Requirements . . . . . . . . . . . . . . . . . . . . . . 75 3.3 Realization of Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1 Finite element tongue accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.1 Segmentation Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Segmentation Experiment Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3 Parameters for FEM based Registration . . . . . . . . . . . . . . . . . . . . . . . . 112 5.4 Demons Registration Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5 Principle Component Analysis of Tongue Tracking Results . . . . . . . . . . . . . . 126 B.1 Researcher Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 vii List of Figures 1.1 Overview of Vocal Tract Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Integrated Physiological Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Four Main Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Schematic Voice Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Vocal Tract Structures and Bones of the Head and Neck . . . . . . . . . . . . . . . 12 2.3 Extrinsic and Intrinsic Tongue Muscles . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 The Oral Cavity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Head, Face, Neck, and Lip Muscles . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Vocal Tract Con gurations for Fricative . . . . . . . . . . . . . . . . . . . . . . . . 17 2.7 One-dimensional Tube Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Modeling Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.9 Mass-Spring and Finite Element Models . . . . . . . . . . . . . . . . . . . . . . . . 24 2.10 Boundary Element Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.11 MR and Ultrasound Examples for Speech . . . . . . . . . . . . . . . . . . . . . . . 28 2.12 Medical Image Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.13 Rigid Body Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Vocal Tract-related Research Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Mock-ups of the Graphical User Interface . . . . . . . . . . . . . . . . . . . . . . . 56 3.3 Modeling Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4 Block Diagram of ArtiSynth Framework . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5 ArtiSynth Software Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.6 Anatomy Modeling Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.7 Mock-up of a Model and Probe Library Window . . . . . . . . . . . . . . . . . . . 74 4.1 Vowel Postures Composite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Tongue Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 viii 4.3 FEM Sti ness warping Forces and Example . . . . . . . . . . . . . . . . . . . . . . 82 4.4 FEM Muscle Forces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 Integrated Jaw-Tongue-Hyoid Model . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Tongue-Airway Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.7 Single Muscle Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.8 Tongue Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.9 Tongue-Airway Speech Postures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1 Implemented Segmentation Pipeline Diagram . . . . . . . . . . . . . . . . . . . . . 100 5.2 E ect of Sigmoid lter Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.3 Interactive Segmentation Application . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4 Sample Slice from Tiede/Engwall MRI Volumes . . . . . . . . . . . . . . . . . . . . 102 5.5 Tongue Segmentation Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.6 2D Segmentation Pipeline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.7 3D Segmentation Pipeline Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.8 FEM Registration Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.9 FEM Registration Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.10 Demons Registration Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.11 Demons Registration Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.12 FEM and Demons 3D Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.13 Registration Design Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.14 Tongue tracking System Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.15 Block Diagram of Tongue Tracking Algorithm . . . . . . . . . . . . . . . . . . . . . 121 5.16 Tongue Tracking Concept and Output . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.17 Filtered Ultrasound Tongue Contours . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.18 Un ltered Ultrasound Tongue Contours . . . . . . . . . . . . . . . . . . . . . . . . 125 5.19 Mean and Variance of Ultrasound Tongue Contours . . . . . . . . . . . . . . . . . 125 ix Acknowledgments Many people have helped me in di erent ways during my PhD studies. First I would like to thank my supervisor, Dr. Sidney Fels, who initiated this project and who gave me great support and guidance throughout my PhD work. I feel very privileged to have been mentored by someone with such insight and creativity. Without his encouragement and support, this thesis would never have been possible. I would also like to thank the ArtiSynth team, all of whom have been great to work with on such a diverse and dynamic project. In particular, I thank Dr. John Lloyd, whose expertise in software engineering and dynamics modeling was integral to the project; Ian Stavness for his contributions in modeling of the jaw and for many inspiring conversations; Kees van den Doel for his insight into acoustics modeling; Dr. Eric Vatikiotis-Bateson for his support and refreshing comments; and Dr. Alan Hannam for sparking enthusiasm and providing insights into anatomical modeling. To those past and present in the Human Communication Technologies Lab, I thank you all for your help and comradeship. In addition, I would like to thank my supervisory committee, Drs. Matthew Yedlin and Michiel van de Panne, for their valuable feedback and discussions. Further, I would like to thank Drs. Rafeef Abugharbieh, Tim Salcudean, Antony Hodgson, Perry Cook, Hani Yehia, and Keith Hamel for reviewing my work and providing their great comments. I would like to acknowledge the funding sources for my doctoral studies: the Advanced Telecommunications Research Institute (ATR) in Kyoto/Japan, the Peter Wall Institute for Advanced Studies at UBC, the Natural Sciences and Engineering Research Council of Canada (NSERC), and Green College. Finally, I would like to thank my parents and sister, who support me always. Florian Vogt x Dedication To Margot, Harald, and Gabriele Vogt xi Co-Authorship Statement This project was initiated by Dr. Sidney Fels (supervisor) as a framework for articulatory synthesis. It was investigated by Florian Vogt by literature review, as well as through discussions and interviews with medical and speech production researchers at UBC, ATR1 (Japan), and beyond to develop the project focus and concept. As a result, software requirements were developed by Florian Vogt in consultation with Dr. Sidney Fels. The rst system design, speci cations, and interaction models were created and validated by Florian Vogt with throw-away prototypes in consultation with Dr. Sidney Fels. This system design was implemented as a prototype and analyzed by PDF2 Dr. Oliver Gunther and Florian Vogt to demonstrate parametric and dynamic model interactions and their user interfaces. This prototype was used to test the system design by implementation of models from the reviewed literature and interactive components supervised by Florian Vogt: dynamic jaw model (PDF Dr. Oliver Gunther), parametric tongue model (URA3 Rahul Chander), parametric lip model (URA Justin Lam), dynamic face model (Florian Vogt and GRA4 Leah Vilhan), acoustic model (RA Carol Jaeger), 2D nite element tongue model (Florian Vogt), connections and model integration (Florian Vogt and PDF Dr. Oliver Gunther), timeline and property editor (PDF Dr. Oliver Gunther and SP5 Kalev Tait). The prototype performance and usage analysis by Florian Vogt in consultation with Dr. Sidney Fels resulted in ndings that (a) dynamic models in particular nite elements are best suited to create a component- based complete upper airway; (b) a more ecient implementation is needed to ensure the required interactivity. The second and current version of the system was designed by RA6 John Lloyd based on the experience obtained in creating the rst prototype system. Dr. Lloyd's contributions to this new system included: models built from a hierarchy of components, the ability to combine di erent 1Advanced Telecommunications Research Institute 2Postdoctoral Fellow 3Undergraduate Research Assistant 4Graduate Research Assistant 5Software Programmer 6Research Associate xii model types in the same simulation (including rigid bodies, mass-spring systems, and deformable objects), bilateral and unilateral constraints, a general property framework that allows component attributes to be selected and interactively modi ed, redesign of the simulation engine with support for implicit integrators (using externally acquired direct sparse matrix solvers), interactive OpenGL rendering and manipulation, and implementation of input and output probes. Dr. Lloyd also supervised the creation of a Timeline widget for probe arrangement (implemented by Paul Gan, URA), and a graphical navigation widget (implemented by Chad Decker, URA). In parallel, the framework was evaluated, tested, and re ned by the team by implementing models from the literature and their innovative extensions: Hill-type muscle model, jaw and laryngal model (GRA Ian Stavness), nite element muscle model, biomechanical tongue model (Florian Vogt), vocal acoustics model (RA Dr. Kees van den Doel), collision detection and resolution (URA Elliot English), 2D parametric vocal tract model (URA Eric Lok), and parametric face model (Dr. Sidney Fels and VR7 Dr. Takaaki Kuratate). Much of the feedback for improvement and innovation was provided by Ian Stavness, Florian Vogt, Dr. Kees van den Doel, Elliot English, Dr. John Lloyd, and Dr. Sidney Fels alike. As part of a literature review, Florian Vogt selected the existing non-interactive biomechanical tongue model and designed and implemented a stand-alone 2D prototype, with assistance from Dr. John Lloyd, based on the \\sti ness warping" techniques used for fast linear FEM simulation. He also selected the fast Pardiso direct solver for use in implicit integration. Dr. John Lloyd designed and implemented the 3D version of the sti ness warping method that was used for the 3D tongue model and other tissue simulation, along with the implicit integrator and the Java interface for Pardiso. Florian Vogt tested, analyzed, and validated this implementation for the tissue model and extended it to a nite element muscle model with the support for the dynamics solver by Dr. John Lloyd. As a collaboration, the published non-interactive biomechanical tongue model's implementation was supplied by Drs. Stephanie Buchaillard, Pascal Perrier, Matthieu Chabanas, and Yohan Payan. Florian Vogt integrated, tested, and validated the nite element tongue model in the interactive framework. Dr. Stephanie Buchaillard modi ed the original implementation to support a jointly developed comparative metric. The presented Tongue-Jaw model results from a collaboration of Ian Stavness, Florian Vogt, and Dr. John Lloyd. Florian Vogt contributed to the connection formulation of unilateral connections which are implicitly integrated and bilateral connections which are explicitly integrated. The presented Tongue-Airway model 7Visiting Researcher xiii results from a collaboration by Dr. Kees van den Doel, Florian Vogt and Elliot English, and Dr. Sidney Fels. Florian Vogt contributed to the connection formulation and MR image and acoustic validation. From the literature, Florian Vogt selected critical existing medical image and tracking data sets to support upper airway dynamics modeling. The concept of driving dynamics and acoustics models from a real-time ultrasound tongue tracker was innovated by Florian Vogt in consultation with Dr. Sidney Fels. This concept was implemented and tested by a team headed by Florian Vogt consisting of GRAs Graeme McCaig and Adnan Ali. Its improved lter design was developed and analyzed by Florian Vogt in consultation with Dr. Hugo de Paula. The presented concept of automatic extracting and referencing tongue postures from MR images was developed by Florian Vogt. Further, the selection and adaptation of existing segmentation and registration methods to this problem was performed by Florian Vogt. The implementation and some preliminary numerical simulations were performed for registration (URA Rahul Chander) and segmentation (URA Charles Wilson) under close supervision by Florian Vogt, and experiments were performed and analyzed by Florian Vogt. For the thesis preparation, some portions were adopted from jointly prepared co-authored publications. In particular, portions of Chapter 3 originally appeared in Fels, Vogt, van den Doel, Lloyd & Guenter [80], Fels, Vogt, Gick, Jaeger & Wilson [83], Vogt, Fels, Gick, Jaeger & Wilson [288]. Portions of Chapter 4 were published in van den Doel, Vogt, English & Fels [65], Fels, Lloyd, van den Doel, Vogt, Stavness & Vatikiotis-Bateson [82], Vogt [286], Vogt, Lloyd, Buchaillard, Perrier, Chabanas, Payan & Fels [290]. Portions of Section 5.4 appeared rst in Vogt, McCaig, Ali & Fels [291]. xiv Chapter 1 Introduction 1.1 Motivation The human upper airway is the anatomical complex associated with the mouth and vocal tract, including the jaw, tongue, pharynx, larynx, palate, airway, and adjacent facial structures, as shown in Figure 1.1. It is of primary importance in eating, breathing, and communicating. Disorders of upper airway anatomical structures can have detrimental e ects on people's health and well- being. Figure 1.1: Schematic view of midsagittal vocal tract structures and cavities (underlined). Regrettably, disorders are widespread and may be caused by genetic predisposition, injury, or various medical and dental conditions. For example, in the US, obstructive sleep apnea (OSA) and dysphagia a ect 1 in 10 people at some point in their lives. The complex interaction and correlated movements between all anatomical structures make diagnosis dicult and treatment challenging to prescribe, especially when fundamental physiological functions, such as the muscle activation 1 patterns associated with chewing or swallowing, are not well understood. Other oral disorders, such as cleft palate or tongue cancer, often cause problems with feeding and speech. A common treatment is oral and maxillofacial surgery which requires many iterations, since the surgical outcome to yield functional speech and sucient jaw movement for feeding and aesthetics cannot be predicted suciently. Thus, patients are exposed to many surgical procedures which could be avoided with better planning. Linguistics is another discipline which investigates the upper airway to understand speech production. Linguists have contributed critical knowledge and insights into the anatomical functional behaviors and muscle and speech task organizations. Most linguistic research is based on observations of subject experiments using acoustic and image analysis. This eld has many open research questions about underlying speech processes, such as coarticulation, speaker-to-speaker variations, consonant production, and dysfunctional behavior. Pure observational research is not able to answer these questions since the human experiments required are too invasive. Further, muscle and brain functions cannot easily be separated. Linguists have created computational models to perform virtual experiments to recreate and explain the observations obtained by measurements. However, these models would be more comprehensive and realistic if resources from other knowledge domains, such as medical simulation and software engineering, were used. Within the eld of engineering, speech production has been investigated for communication applications such as speech coding for ecient data transmission and speech and facial synthesis as communication aids. The synthesis of natural human speech has a history of over 200 years [68], yet it is still an unsolved scienti c question in our time. Other unsolved problems of speech synthesis are to create speaker independent models to allow speaker parametrization and to create a vocal and facial synchronized model for character animation. Articulatory speech synthesis is a method which shows much promise in addressing these problems, but existing models remain too basic in answering these questions. The strength of computational models is the ability to modify and extend them to re ect newly gained knowledge. Thus, a comprehensive biomechanical and acoustic upper airway model could address many open research questions. The required expertise needed to create such a model cuts across many research areas including linguistics, medical and dental research, biomechanics, medical imaging and medical analysis, software engineering, aerodynamics, and acoustics. Thus, a comprehensive model has not yet been created. 2 Currently, researchers from these di erent elds work on various structure models and aspects of the upper airway, for example, tongue, larynx, lips, and face. Each is very complex and is developed independently using di erent platforms, modeling approaches, and tool sets. Modeling the upper airway is especially dicult with a complete speech apparatus since the structures are interconnected, and this has therefore yet not been performed. However, some open research questions require only a subset of connected upper airway structures such as tongue and jaw, and thus the required complexity can be stepwise achieved. In summary, the motivation of this work is to computationally simulate the physiological processes: speaking, mastication, swallowing, and breathing. The upper airway is a wonderfully complex system that is challenging to model. There are many open questions to understand how the complete upper airway works. Current research primarily uses observations and data to measure the motions and shapes of articulators. But this does not distinguish the contributions of underlying physiology and motor control. This is the motivation to create physics-based models. 1.2 The Approach Our approach is to create an integrated physiological upper airway model, which requires the expertise from many di erent disciplines, as shown in Figure 1.2. This includes on the one side, cognitive and physiological disciplines as well as physics and modeling disciplines. Combining expertise between disciplines allows knowledge to be transferred across domains. For example, dental researchers working on sleep apnea would gain better simulation algorithms from biomechanics and acoustics. In addition, researchers who extract anatomical shapes from images would be able to use anatomical models, created by dental researchers, as an atlas for image extraction methods. The key to bridging domains of knowledge and creating an integrated physiological upper airway model is to nd the right abstractions. These abstractions represent data and knowledge from cognitive and physiological disciplines as well as algorithms from physics and modeling disciplines. Our chain of modeling representations is depicted in the center of Figure 1.2. In the case of speech, intents represent the forming of tasks and goals, which is studied in the eld of articulatory phonetics. These intents are transmitted using neural signals and motor control programs to activate muscles. Activated muscles generate forces that deform articulators like the tongue. The tongue shape controls the air ow though the oral cavity, which modulates vocal sounds. These 3 Medical Imaging & Analysis Biomechanics Acoustics Numerical Analysis Computer Graphics Aerodynamics Linguistics Dentistry Respirology Oncology Neurology Music Physics/Modeling Disciplines Cognitive and Physiological Disciplines Neural Signal Muscle Activation Force Deformation Geometry Sound Intent Abstractions Figure 1.2: Approach: Integrated physiological model to bridge different disciplines using common abstractions. abstractions from intent to sound are not only useful in speech but for the other upper airway functions as well. This approach may answer open research questions: (1) How much physiological model accuracy is required to produce realistic speech? and (2) What speech behavior is due to basic physiology and what is neutrally controlled? More specifically, this thesis work conceptualizes a computational modeling framework of the upper airway towards a complete, physics-based three-dimensional upper airway model. This computational model may enable researchers and practitioners to perform experiments to gain pure understanding, compare models as a function of parameters, and study perturbation of models. Virtual experiments performed with such a model will further understanding in linguistic and clinical applications. This modeling framework has been created to address the long-term goal to understand the coupling of the brain to the upper airway. The work in this dissertation was carried out as part the collaborative ArtiSynth 8 project. ArtiSynth is a project, evolved from this thesis work, that was established to develop an upper 8The name ArtiSynth is derived from Articulatory Synthesizer (http://artisynth.org). 4 airway modeling framework. In order to address the various aspects of this modeling problem, the project draws on the knowledge of a diverse group of researchers from the University of British Columbia. This group consists of engineers, linguists, and computer and medical scientists having the expertise to address specifics of the modeling problem including biomechanics, aerodynamics, acoustics, and anatomical modeling. This thesis lays the groundwork and conceptualization for the ArtiSynth project by defining modeling abstractions and framework components. In summary, to create integrated physiological modeling requires the definition of common abstraction so that experts of different disciplines can contribute their knowledge to an upper airway simulation framework. 1.3 Contributions and Impact In this thesis work, the following contributions, depicted in Figure 1.3, have been made to the field of speech science and technology. Modeling Framework GUI and timeline Models and constraints in the model library Tongue Model Validation Image Processing Processing Image Modeling Dynamic Static Shapes Figure 1.3: The Four Main Thesis Contributions Depicted with their Interrelations. Modeling Framework The modeling framework describes how the abstraction chain is for- mulated and handled in a simulation system. The conceptual development to unite and connect diverse modeling approaches including kinematics, biomechanics, and acoustics is presented in Chapter 3. In order to bridge different disciplines of linguistics, medical simulation, communication, dental and medical research, and acoustics, knowledge of the different domains is structured as described in Chapters 2 and 3. Requirements and conceptualizations were formulated from user feedback and domain knowledge. In order to manipulate the underlying models and their simulation processes through a user interface, seven complementary modeling concepts for the framework were formulated. This is published in ICPhS 2003 [83]. This information enables the creation of a framework that 5 accommodates the needs of the diverse research areas. A taxonomy of models is developed to categorize properties and unify ordering to anticipate di erent modeling types in the framework design. My contribution to the design and proof of concepts enables integration of di erent modeling types such as parametric and biomechanical models in the framework. This is published in ICPhS 2003 [288]. Interactive Biomechanical Tongue Model The tongue is a central and complex anatomical structure, which is challenging to model. The tongue model exercises the abstraction chain from muscle activation to geometry. Interactivity is important to enable intuitive understanding and exploration of models. It requires one to nd a fast and accurate solution for deformable models. First, this thesis work presents a general solution for interactive muscle models using nite elements, and second, the existing non-interactive ICP 9 tongue model created by Buchaillard [32], Gerard, Wilhelms-Tricarico, Perrier & Payan [94], Gerard, Perrier & Payan [96], to which the interactive muscle formulation is applied, as presented by Vogt et al. [290]. This tongue model provides a proof of concept for a complete upper airway model as described in Chapter 4. Image Modeling Di erent image modeling methods are investigated to determine how di erent types of image data may be used to model individuals. Current upper airway processing methods are very labor intensive, which makes them not feasible for clinical applications. Image modeling may provide geometry in the abstraction chain. Three di erent automatic extraction methods are investigated on static and dynamic image data to test their performance for providing shapes for modeling. Segmentation and registration algorithms are created to extract and cross reference 2D and 3D magnetic resonance image data, as described in Sections 5.2 and 5.3. Ultrasound is a non-invasive imaging method which allows tracking of the tongue surface contour. Previously, this process was not possible to perform in real-time, which would be very bene cial for clinical and experimental applications. My developed algorithm tracks, in real-time, 2D tongue shapes of vowel postures from ultrasound images, as described in Section 5.4. Model Validation Validation of computational models is important to ensure correct and plausible results. In the abstraction chain the geometry output of a model is tested for a given muscle activation. This thesis work introduces two general validation methods for physiological models, one using reference models and the other using images, as presented in 9Institut de la Communication Parlee, Grenoble-France 6 Section 4.2. First, the interactive biomechanical tongue model, described in Chapter 4, is validated by comparison with the reference tongue model [96]. The range of feasible tongue motions is exercised with di erent muscle activations. Second, image-based validation is introduced, where the 3D tongue model is manually matched to 2D images by using plausible muscle activations. This method allows us to determine if there is sucient control in the tongue model to reach a pose, and if so, what muscle activations are involved. Here examples are shown for vowel postures. The impact of various aspects of this thesis work has been demonstrated in the publications listed in Appendix A. The feasibility of the modeling framework has been demonstrated by collaborations with researchers Dang & Honda [54], Gerard et al. [96], Kuratate, Munhall, Rubin, Vatikiotis- Bateson & Yehia [141], Peck, Langenbach & Hannam [196], Rubin, Saltzman, GoldStein, McGowan, Tiede & C [223], and Birkholz [25] who successfully integrated their modeling approaches in the framework. 1.4 Demarcation of the Work The work presented in this thesis is a new upper airway modeling framework to simulate the physiological processes of speaking, mastication, swallowing, and breathing. The related work and examples in the design focus on speech synthesis as one of the processes with the largest scope. The framework design and the modeling abstractions are created for physiological structures represented by biomechanical, parametric, and acoustic modeling techniques, where the model implementation and experiments are limited in scope to the biomechanical tongue model representing abstractions, shown in Figure 1.2, from muscle activation to geometry. Other representation such as intent, motor control, and acoustics have been conceptualized, but have not been implemented and are left for further development by other team members. This work does not set out to create new physiological models. Aeroacoustic, neither aerodynamics methods nor simulations, are not used since we are investigating only shape modeling using biomechanics. Further, the focus of this work is the upper airway excluding the larynx. Such components will be addressed in future work. In this thesis work, speech tasks have been modeled with their biomechanic physiology, and the 7 speech synthesis methods have been discussed to show that articulatory methods are consistent with the design approach rather than concatenative or formant-based synthesis methods. Additional physiological functions such as swallowing, feeding, and breathing are not covered. These tasks may be able to be handled within the framework, but this is not the focus of this work. Modeling of other anatomical structures such as the face, jaw, and airway are not covered in detail and only insofar as to show how they t into the framework design. Finally, intent activation and control of models was not investigated beyond the muscle activation, meaning that higher level functions such as motor control, neurophysiology, and cognitive mapping were not examined, as shown in Figure 2.1. However, this thesis work provides a foundation where these concepts can be tested. 1.5 Thesis Structure Chapter 2 provides background theory and presents the related modeling research. The presented background theory entails an anatomy overview, a discussion of biomechanical modeling, and speech synthesis techniques. Existing vocal anatomy models are presented in a taxonomy in relation to their properties. Finally, data acquisition methods and extraction techniques to create and validate anatomical models are discussed. Chapter 3 describes development and details of the novel modeling framework to integrate di erent existing and new model types. Technical requirements for this framework are described and exempli ed through scenarios. The design details of this framework are presented in their functions and their component structure. Chapter 4 presents the implementation of the biomechanical tongue, validating the framework design presented in Chapter 3 by integrating both an existing model and a new one. Two procedures to validate particular models and judge the quality in an overall context are introduced. Chapter 5 presents relevant image data sets for vocal anatomy and demonstrates the process of automatic extraction to create anatomical shapes. Further feasibility studies are presented of automatic extraction of vocal shapes from magnetic resonance and ultrasound images. Chapter 6 concludes the work with a summary, a discussion of the research contributions, an outline of directions for future work. 8 Chapter 2 Background and Related Work This thesis describes the creation of a simulator framework for upper airway structures. These structures are represented by computational models, which incorporate an a priori knowledge of the physical anatomical system to recreate its functional behavior. It is crucial to understand the anatomical structures and their functional behavior to create a suitable computational model. This chapter discusses upper airway physiology, modeling methods, and existing simulation systems, and focuses in particular on speech representation to give a basis for the framework development. The presentation of background anatomy will be followed by discussion and contrast of speech synthesis methods: (1) discrete time-domain synthesis, (2) formant-based synthesis, and (3) articulatory synthesis. Our focus, physiological modeling of the upper airway, closely related to articulatory speech synthesis, allows an explicit representation of human anatomy and produces speech sounds based on the deformation caused by movements of articulators such as the tongue, lips, jaw, or hyoid bone, and their excitation with the glottis signal. Section 2.3 discusses the main existing shape modeling methods, biomechanic and geometric, which will be surveyed and a taxonomy created according to their properties. Further, Section 2.3.3 presents an overview of existing approaches for acoustic modeling including air ow and aeroacoustics such as source- lter or Navier-Stokes equations. This is important to show how mixing and matching di erent modeling approaches into one simulation may be used in order to build up a more complex model formulation. A working simulation system will nd application in various domains, namely speech synthesis, face animation, and surgical simulation: therefore, we review the domain-speci c modeling approaches and alternative simulation systems. Finally, we will discuss how realistic model shapes and material properties are determined from real-life measurements, namely medical imaging or point tracking. 9 2.1 Physiology of the Upper Airway The human voice production apparatus is comprised of the vocal tract, or upper airway, and the lungs. A holistic picture of human voice production is schematically represented in Figure 2.1, organized by locations in the body, brain, and consciousness, with their corresponding functions (boxes), information ow (lines), and interactions (dashed lines). For this thesis, only the voice production functions below the motor control system, which are modeled by vocal tract anatomy, will be addressed. The vocal tract anatomy is referred to as rigid and deformable vocal tract structures and air- lled cavities, as shown in a mid-sagittal view in Figure 2.2a, above the vocal folds (larynx), which are involved in sound production. Vowel sounds are produced by vibrations of the vocal folds and ltered by the cavities. Other speech sounds are consonants, and these are mainly produced through other sound phenomena, for example turbulence at cavity constrictions. In addition to sound production, the vocal tract is shared for feeding and breathing tasks. Some parts of the anatomy are functionally used for one task|for example the esophagus for swallowing|and others are shared between all tasks|for example, the tongue. This thesis presents modeling methods to form an approximation of the upper airway anatomy in order to mimic its mechanical behavior. The model is constructed by building up components that mimic each individual anatomical entity in the physical system. Therefore, the complexities of the anatomical components and overall system are important for informing our modeling decisions. The anatomical information provided in this section has been compiled from Gray [102], Sicher [244], [107]. The functions and properties of relevant anatomical substructures and air cavities will be reviewed. At the end of this section, various biomedical material relevant for simulation will be covered. 2.1.1 Anatomical Structures The anatomical structures are grouped by their dominant material type, since many contain more than one material type. The structures are presented in the order of their sti ness from hard to soft: bone, cartilage, muscle and ligament, and tissue. 10 Motor cortex Motor nuclei Voice Higher-order centers Vocal tact anatomy Vocal tact muscles Human Brain Sensory centers Sensory receptors Figure 2.1: Schematic voice production, adopted from [111]. Bones of the Head and Neck There are three hard structures of the upper airway: the skull, the hyoid, and the vertebral column, as shown in Figure 2.2b. The skull functions to protect the brain and eyes, and in addition forms the scaffolding of the upper airway. The skull is supported by the vertebral column and is composed of a series of immovably jointed bones, including the maxilla. An exception to the jointed bones of the skull is the mandible, which is movable. The mandible articulates with the skull at two joints that enable the mandible’ s opening and closing. Both maxilla and mandible act as sockets for a series of upper and lower teeth that allow biting and chewing. The hyoid is a u-shaped bone held by a set of muscles attached to the skull. The hyoid is attached to the base of the tongue and sits on top of the larynx. The motions of the hyoid and mandible are the main exterior influences that shape the tongue. In addition, the hyoid is responsible for larynx lifting and lowering. The vertebral column is a flexible structure made up of a series of bones called vertebrae. The vertebrae are grouped by location as cervical, thoracic, lumbar, sacral, and coccygeal. The upper cervical vertebrae form the neck and provide physical support to the skull. The cervical vertebrae shape the back of the pharyngeal wall as a central structure of the upper airway. While the flexibility of the vertebral column is important for head movement, it remains predominantly 11 rigid during speaking and swallowing tasks. This is an important consideration for modeling representations. For dynamic modeling of the upper airway, these structures are assumed to be rigid, which reduces the computational complexity of their representation. Nevertheless, deformable models for skeletal structures could be added later where required. (a) Cranium Maxilla Hyoid Mandible Vertebrae (b) Figure 2.2: Illustrations of (a) midsagittal vocal tract structures and cavities (underlined) and (b) head and neck bones. Cartilage Cartilage is a tough, exible connective tissue that covers the ends of bones to form a smooth, shock-absorbing surface for joints, and results in very low friction. It is therefore important to keep bone structure exible. In the upper airway, the nose and larynx are structures containing cartilage. The larynx is a vocal structure located between the trachea and the pharynx and beneath the esophagus. It is made up of walls of cartilage, ligaments, and muscle, and contains the vocal folds, making it the primary organ of voice production. The cartilage structures of the larynx are the thyroid, corniculate, cricoid, cuneiform, arytenoid, and epiglottis. 12 The largest cartilage of the larynx is the thyroid, which consists of two laminae fused at a sharp angle to form a subcutaneous projection named the laryngeal prominence, otherwise known as the \\Adam's apple". The cricoid cartilage is smaller in structure, but thicker and stronger than the thyroid. It forms the lower and posterior parts of the laryngeal wall. The epiglottis is a thin ap of cartilage that projects upward behind the back of the tongue, in front of the entrance to the larynx. It functions to cover the entrance to the larynx during swallowing, preventing food and liquid from entering the trachea, as described in more detail by Sicher [244]. Muscles and Ligaments This section discusses the main anatomical substructures, namely the tongue, velum, lips, and facial tissue, which contain muscles and ligaments. Tongue The tongue is comprised of a large bundle of muscle groups located on the oor of the mouth that manipulate food for chewing and swallowing. The tongue, with its wide variety of possible movements and shapes, is the major articulator used to produce speech sounds. The intrinsic muscles lie entirely within the tongue, while the extrinsic muscles attach the tongue to surrounding bone and cartilage structures, as shown in Figure 2.3. At a rst approximation, extrinsic muscles reposition the tongue, while intrinsic muscles change the shape of the tongue. Tongue physiology is still only partially understood due to its complexity. Recent work by Takemoto [266] produced a compelling physiological static model of the tongue, created from cadaver dissections, which shows the elaborate interwoven intrinsic muscle bers. Details about the role of the tongue for di erent tasks have been addressed by Hiiemae & Palmer [107]. However, dynamic modeling of the tongue to replicate hypotheses and assumptions has not been performed due to the lack of a high- delity tongue model. The following Table 2.1 and Figure 2.3 present the extrinsic and intrinsic muscles of the tongue. The intrinsic muscles can often be controlled in subsections, labeled according to their anatomical location as anterior and posterior. These muscle de nitions are the basis for the ICP biomechanical tongue model created by Gerard et al. [96], with the exception of the palatoglossus [140, 153], which is implemented and improved in Chapter 4. Velum The velum, or soft palate, is the soft tissue located at the back of the roof of the oral cavity, as shown in Figure 2.2. The soft palate consists of tissue and muscles, but does not contain 13 (a) Figure 2.3: (a) Lateral view of the human tongue, with some extrinsic muscle attachment, and (b) coronal section of the tongue, showing intrinsic muscles. (Reproduced from Gray [102]) Muscle name Type Attachment Function Genioglossus extrinsic mandible protrudes and depresses its center Hyoglossus extrinsic hyoid bone depresses Styloglossus extrinsic styloid process elevates and retracts Palatoglossus extrinsic anterior palatine arch elevates Superior longitudinal intrinsic - assists retraction or deviates the tip Inferior longitudinal intrinsic - bulges side to side Verticalis intrinsic - located in the middle Transversus intrinsic - runs along the sides Table 2.1: A list of extrinsic and intrinsic tongue muscles bone like the hard palate at the front of the oral cavity. The lower portion of the soft palate, called the palatine velum, can be moved to close the connection between the nasal and oral cavities. Hanging o the center of the soft palate is a conical process called the palatine uvula, shown in Figure 2.4, which plays a role in the shaping of vocal sounds [16]. Lips and Facial Skin The muscles and tissue of the face, in particular the lips, are of great importance to vocal tract modeling, since they form the end of the vocal tract and change its shape for di erent speech postures. The face muscles and skin are the cover stretched over the mandible and maxilla. The shape of the lips is highly correlated to the jaw opening and some muscles for face and jaw are used jointly e.g. Masseter. The muscle of the head, face, and neck are shown in 14 (a) (b) Figure 2.4: (a) Midsagittal anatomical atlas of the upper airway. (b) Frontal view of the oral cavity with view of the tongue, palatine velum, and uvula. (Both reproduced from Gray [102]) Figure 2.5 (a) and the lip muscles are shown in Figure 2.5 (b). 2.1.2 Air Cavities The main cavities in the vocal tract, in uencing sound production, are the connected oral, nasal, and laryngeal. Their location is represented schematically in Figure 2.2. They are de ned by the airspace not occupied by the surrounding tissues. The connection of the nasal cavity to other cavities can be closed o by the velum. 2.1.3 Materials and Couplings In order to replicate the functional behavior of the vocal tract anatomy in simulations, it is important to understand the material properties of the di erent substructures. A good starting point is the literature of biomechanical material by Fung [92] and Duck [67]. For speech tasks, much bony material can be approximated as rigid. However, other structures, such as the tongue have complex material properties that require special considerations for tissue measurements (see Gerard, Ohayon, Luboz, Perrier & Payan [95]). 15 (a) (b) Figure 2.5: (a) Muscles of the head, face, and neck. (b) Arrangement of lip muscle bers (Reproduced from Gray [102]). 2.2 Principles of Sound Production and Speech Synthesis Fundamentally, speech is caused by air owing though the vocal tract in the following process. When a person exhales, the air ow may cause turbulence if the passage way constricts against the obstruction of glottis and articulators in the vocal tract. During voiced speech, as in the case of vowel production, the air ow through the vocal folds is pulsed, but laminar. In contrast, during unvoiced speech, as in the case of come some consonants (see Jackson [122], Shadle [237]), the air ow may become turbulent near constrictions of articulators, such as the palate, teeth, lips, and tongue. Examples of articulators of consonants are shown in Figure 2.6 (see Stevens [255]). Such turbulence causes air pressure modulations which are ltered by the oral and nasal cavities, and become audible sound waves after they exit the mouth and nostrils. Speech synthesis can be categorized as: (i) concatenation, (ii) formant, and (iii) articulatory synthesis. All of these techniques may use time domain and frequency domain representations. For example, when an articulatory model is used to determine the vocal tract geometry, the generation and the propagation of sound can be represented in the time domain or in the frequency domain (see Sondhi & Schr•oter [249]). The three speech synthesis types are listed with their properties and applications in Table 2.2 and are discussed in more detail below. 16 (a) /f/, /v/ (b) /s/, /z/ (c) /θ/, /dh/ (d) /sh/, /yogh/ Figure 2.6: Vocal tract con gurations for four fricative cognate pairs (a)-(d) Application Concatenation Formant Articulatory [112, 135] [120, 166, 224] Text-to-speech Quality single speaker Intelligible speech Intelligible vowels [26, 35, 119] [7] [44] Speech coding Good compression Open problem Currently not used [158, 234] Analysis Limited Limited 3D models have shown of speech production initial success Audio-video Poor match Poor match Potential synchronization for joint model Surgical planner n/a n/a Emerging for head models [24]. Table 2.2: A Comparison of existing speech synthesis with their application Speech models are used in four main application areas: (1) text-to-speech synthesis (TTS), (2) speech coding, (3) speech recognition10, and (4) talking head animation. Text-to-speech synthesis Text-to-speech synthesis represents text in its phoneme11 structure where all individual phoneme sounds are stored in a database. During the synthesis process, text may be transcribed into a sequence of phonemes (or phones), diphones, or non-uniform units which become, along with the database look-up process, an audible synthesized speech signal (see online sound example [174]). The acoustic transitions between phonemes are critical for producing intelligible natural speech. 10Speech recognition is not discussed in this context, since it is an even longer term goal to integrate speech production and recognition. /citetRabiner1993 presents an overview of speech recognition. 11Speech sound, utterance: any of the abstract units of the phonetic system of a language that correspond to a set of similar speech sounds [255]. 17 Currently, the best sounding speech synthesizer uses concatenative models and a large number of stored sound samples such as Campbell [36], FreeTTS [90], MBROLA [162], and FestVox [85]. Concatenative synthesis, as presented by Isard & Miller [119], for text-to-speech translation, uses segments of the actual recorded speech signal. These segments are variable and can range in length from a single phoneme to a complete word. Usually the unit is chosen a diphone, which spans from the center of a given phoneme to the center of the next phoneme within a word. For each individual speaker, diphones are stored in a separate database which can range in megabyte size. During the synthesis process, diphones are sequenced and blended together to form a continuous speech signal. A drawback of this method is that sounds that are not not prerecorded cannot be synthesized. Further, not all diphones blended together result in natural speech that can be improved with longer stored segments such as triphones, which result in larger databases. Triphones are units that span three phonemes to provide a context-dependent representation. In general, the required segmentation process to create such a database is performed by humans, and is a labor-intensive process, as shown by Black & Taylor [26, 27], Sproat [250]. An automatic phonetic segmentation technique was reported by Pellom [197], which may simplify the process. Formant-based synthesis represents speech by its characteristic spectral structure using lters, presented by Rabiner, Schafer & Flanagan [209]. There are two main con gurations used: rst, the cascade model by Klatt [135] considers the spectrum as a cascade of 2-pole resonant lters. Second, the simple parallel model by Holmes [113] acts as the partial fraction expansion of the cascade model. The simple parallel model su ers from hypersensitivity to the amplitude controls. By pairing a correction lter with each resonator in the parallel con guration, the sensitivity problem can be solved. Given the correct parameters, the parallel formant-based synthesizer produces high quality speech at 6000 b/s. Unfortunately, the determination of the correct parameters automatically remains an unsolved problem. Formant-based synthesis is widely used in text-to- speech systems such as DECtalk [58]. Commonly, one set of formant parameters is created for each phoneme and stored in a small database. Using simple phoneme look-up and overlap-add techniques to blend the formant transitions produces intelligible speech (though still robotic sounding). Articulatory speech synthesis based on 2D models produces speech sounds based on the deformation caused by movements of the articulators such as the tongue, lips, jaw, or hyoid bone, and their excitation with the glottis signal. Simple two-dimensional articulatory speech synthesis 18 representations12 of shapes are built from one-dimensional lters, and a two-mass model of the vocal cords is presented [166]. In this case, shape corresponds in an equivalent representation to the diameter of a tube model, as shown in Figure 2.7. In the two-dimensional articulatory speech synthesis, shapes are commonly represented as contours of the mid-sagittal cut plane of a vocal tract [224]. The articulatory approach has been since applied to the singing voice by Cook [46], Kob [136], Lu [152], using still 1-2D representations. Extensions to this approach are applying rule-based approaches by Rabiner [210] to articulatory speech synthesis [110, 303]. The waveguide vocal tract model by [171] presents an extension of shape and acoustics in two dimensions. Figure 2.7: One-dimensional Tube Model Articulatory speech synthesis using 3D models allows close correspondence to the human anatomy. The vocal tract shapes can be veri ed with medical image data, and knowledge within the elds of articulatory phonology and medicine can link physical observation to models directly. Furthermore, the decomposition of the model in anatomical substructures is very intuitive. Anatomical and articulation di erences are directly parametrized in the model to represent di erences, such as age, gender, and prosody. Recently articulatory methods introduced by Story & Titze [260, 261] and Badin, Bailly, Raybaudi & Segebarth [18], Birkholz [25] simulate three-dimensional shapes. There are multiple methods to represent anatomy and animate vocal tracts: (1) physical modeling methods, such as tissue and muscles, (2) data driven models, and (3) recorded frames. Articulatory synthesis models have the potential to produce natural sounding speech at low bit rates approaching the theoretical limit [268, page 45], since speakers' anatomy con gurations are xed and movement of the articulators change in the order of 10Hz. Estimates for the number of critical articulators vary from 3 to 10 [222], which set the limit given a 10 bit resolution in the range of 300-1000 bits per second (b/s). This synthesis method is computationally more expensive than the other synthesis methods, but with the increase of available computation it becomes a plausible choice. 12Historically, articulators is presented in a two-dimensional image, which refers to two-dimensional synthesis even if the underlying filter model is only a one-dimensional linear filter. 19 Speech Coding Speech coding is used to achieve a low-bandwidth representation of the speech signal for transmission and recording applications. Over time, speech coding methods evolved with improved compression factors. Here are some time-domain methods with increasing complexity. The simplest time-domain representation is the waveform data itself, which is essentially, a signal amplitude encoded over time. This can be encoded using Pulse Code Modulation (PCM), which directly represents a ltered and sampled digital voice signal. A more sophisticated encoding called Linear Prediction Coding (LPC) was introduced by Atal & Hanauer [13]. This represents speech as an auto-regressive (AR) process. The prediction error is called the residual. The LPC coecients together with the residual can be used to resynthesize original speech. The LPC coecients de ne the lter and can be interpreted using various transformations in many ways, including a ladder lter circuit, spectral envelope, partial correlation coecients (PARCOR), and tube area functions, as shown by Markel & Gray [158]. Taken as a source plus lter model, the residual represents the glottal source. If the linear model truly represents the underlying speech process, it is expected that the residual would be either a white noise source (that is, when the vocal cords are not vibrating) or a periodic pulse train (that is, when the vocal cords are vibrating). This decomposition provides for further compression of the residual that drives the LPC lter. For very-low rate encodings of speech, switching between a white noise source and a periodic pulse train allows for speech coding at less than 1000 b/s as introduced by Atal & Remde [14]. Such speech encoding is intelligible, but not at all natural. A compromised encoding of the residual was proposed by Atal & Schr•oder [15]. Schr•oder & Atal [234] used a codebook of random waveforms to represent the residual. The representation of the residual is determined by the code index of the random waveform that matches, after resynthesis, most closely with the speech from the original residual. The encoding is called Codebook Excited Linear Prediction (CELP), which can synthesize high quality speech at 4800 b/s, and is used in such applications as cellular phone voice encoding. A speech coding method is Pitch Synchronous Overlap Add Method (PSOLA) by Valbret, Moulines & Tubach [282], which decomposes the speech signal into elementary waveforms of each one-pitch period. The original signal is reconstructed by these waveforms using a summed with overlap add methods. In contrast, the Waveform Similarity Overlap Add (WSOLA) algorithm by Verhelst & Roelands [284] used similar methods to PSOLA with short-time Fourier transform representation. 20 Talking Heads The modeling of vocal tracts has much in common with facial animation, even though face models mainly describe the curved surface of the skin with the goal of looking real, rather than sounding real. Both face and vocal tract models meet geometrically at the mouth, but have more in common through shared anatomical substructures. One research area that applies both speech and face models together is talking heads [226]. Here, primarily synthesized or recorded speech is combined with three-dimensional facial animations. For this application, audiovisual synchronization is very important since speech perception and understanding depends on it. The e ect of audiovisual misalignment goes so far that utterances can be misunderstood, which is shown in the McGurk e ect by McGurk & MacDonald [163]. One approach to solving the audiovisual requirement is to synchronize audio and video with algorithms implemented by Albrecht, Haber & Seidel [6], Cohen & Massaro [43], Hill, Pearce & Wyvill [109], Hill et al. [110], Uz, Gueduekbay & Ozguc [281]. The other approach, suited for the articulatory domain, is to combine speech and face models. Since the face and vocal anatomies are physically interconnected, the required synchronization happens automatically. Other research implemented additional control for nonverbal expressions of the face such as the ones by [5, 127, 147, 168]. Creating a combined model of face and vocal tracts allows, for example, for di erentiation with perception tests between coarticulation and emotions. 2.2.1 Summary In summary, articulatory synthesis shows great potential to gain better understanding of speech production and eventually to create natural sounding speech synthesis. The articulatory methods link anatomy and speech closely together with the opportunity of in-depth modeling of anatomical structures for areas where using computation is not a constraint. Current articulatory research primarily focuses on observations and data to describe the shape and motion of articulators. This approach does not separate the contributions of underlying physiology and motor control, which is the motivation to create physics-based models. This may be a way to answer many open research questions such as origin prosody and coarticulation. Concatenative synthesis is very suitable for low-bit rate encoding and high quality text-to-speech synthesis based on a single speaker database. Formant-based synthesis is currently the leading method for low complexity text-to-speech synthesis. The next chapter discusses the existing research, which serves as a basis 21 to build a three-dimensional articulatory speech simulator. Articulatory phonetics is the eld of linguistics that presents articulators and air cavities and their functions in the speech production process. Results from this eld are important to model higher level control for speech tasks. In the following we discuss and contrast suitable modeling techniques mainly for shape modeling, but also for acoustic modeling. 2.3 Underlying Methods for Articulatory Synthesis Articulatory synthesis handles both shape and acoustics of vocal anatomy, which can be represented with a multitude of modeling techniques. The modeling taxonomy represents both model domain and type. The model domain represents the physics phenomena or material property. A subset of model domains that are rigid, deformable, and uid can be found in physics-based animation or engineering packages discussed in Section 2.6. But a combination of these domains, including sound and hybrid models, does not exist but is important to represent vocal tract phenomena. Model type , shown in Figure 2.8, refers to the mathematical or representational methods of the model. Some anatomy model representations such as geometric, physics-based, and measurement-based have developed in related research areas such as computer animation. In contrast, other methods such as acoustic and inverse models that are common in speech research may have little application in other research areas. The creation of this taxonomy organizes properties of modeling methods and enables placement of new methods in the context of existing ones. While the number of methods to choose from might be overwhelming, we will review methods that are biomechanical (physics-based), measurement-based, and acoustic which have been applied to articulatory speech synthesis. A discussion of other methods can be found in the review literature by Mase & Mase [159], Mortenson [170], Nealen, M•uller, Keiser, Boxerman & Carlson [178], Rienstra & Hirschberg [219], White [299]. 2.3.1 Biomechanical Techniques Biomechanical modeling techniques are commonly used to simulate the function of anatomical structures, in order to mimic the functional behavior and condensate knowledge in computational models. Biomechanical approaches have the advantage of providing a controlled environment to explore for what-if scenarios of particular anatomical structures and if correctly connected to 22 from articulatory synthesis from formant synthesis from concatenative synthesis Inverse Aeroacoustic Aerodynamics Source-filter Acoustic Image-based Marker-baded Mesurement-based Subspace integration Linear Modal Analysis Reduced dynamics Eulerian grid Semi-Lagrangian Grid-based finite volume Eulerian and Semi-Lagrangian Mesh Free Methods Smoothed Particle Hydrodynamics Loosely Coupled Particle Lagrangian Mesh Free Boundary element Finite volume Finite difference Finite element Spring-mass Rigid-body Lagrangian Mesh Based Physics-based or Biomechanic Reduced shape Trajectory Spline Surface Contour Line segments Geometric Type Model Figure 2.8: Taxonomy of modeling methods for articulatory synthesis. adjacent biomechanical models of other anatomical structures to represent the joint behavior. Speech researchers have applied biomechanical techniques to speech anatomy that are larynx by Ishizaka & Flanagan [120], articulators by Perkell [199], and faces [296]. Historically, simulated separately, the modeling techniques are for the most part very similar. Primarily, bone, tissue, and muscle are represented by rigid body models and the deformable modeling techniques: mass-spring models, nite element models, and boundary element models. These model methods with their tradeo s are discussed next. This thesis work investigates mass-spring and nite element models, but provides an adequate interface for other models in the taxonomy to be easily added. 23 Mass-Springs The mass-spring method, sometimes called particle method, is a discrete method which tries to approximate the properties of elastic bodies with point masses connected by a network of springs, as shown in Figure 2.9. Each of the springs in the network follows Hook's law, which leads to a set of linear equations stating the relationship between external force and displacement [213]. The chosen discrete time step size determines a tradeo between speed of computation, stability, and accuracy [198]. (a) (b) Figure 2.9: 2D Mesh of (a) mass-spring and (b) nite element model in plane These are advantages of mass-spring models:  They provide easy and coherent construction of the spring network.  They are real-time capable that is they update the state vector quickly, due to low complexity. The following are the disadvantages of mass-spring models:  Modeling behavior and material properties are very sensitive to network topology and state.  Sti ness of the mesh is limited by density and the simulated time step.  They are not stable for all initial values, which can be improved with the implicit Euler method. In addition to tissues, mass-spring methods have been applied as point-to-point muscle models which allows complex muscle behavior with non-linear springs, for example the Hill-type muscle model[108]. These point-to-point muscles have been applied in combination with rigid bodies to model the jaw by Peck et al. [196], tongue by Dang & Honda [54] and skeletal structures by Pandy 24 [191]. Finite Elements The nite element model (FEM) is one of various solid mechanics methods created to extend rigid body dynamics to handle internal deformation. FEMs, as shown in Figure 2.9 for a 2D example, have been applied to speech by Deng [60], Shiraki & Honda [243], Stone [256], de Vries [293] and to faces by Bro-Nielsen [29], Koch, Gross, Carls, von Buren, Fankhauser & Parish [137]. The strain energy E of a linear elastic body can be written as follows: E = ∫ Ω T (x)(x)dx (2.1) where  is stress and  is strain. A simulated body is divided into discrete elements and the simulation results in a linear system of N equations which can be solved with a linear solver. More precisely the time behavior is evaluated by integrating a 2rd order dynamics equation of the form: M(x) ẍ = f(x; ẋ; t) (2.2) where M is the combined mass matrix for all dynamical components, ẍ is the combined computed acceleration, x is the combined dynamical component position, ẋ is the combined component velocity, f is the total generalized force acting on all the components, and t is time. Upper airway tissue and muscles such as the tongue, face, and soft palate, are largely nonlinear for large deformation and therefore the use of the linear formulation produces in these cases large errors, as formulated by Fung [92] and Duck [67]. Alternative nite element formulations with nonlinear strain-stress relationships require non-linear solvers which are slower to compute but are more accurate. Some choices of nite element formulation are listed in Table 2.3. In particular, recent work by M•uller & Gross [175] has introduced fast quasi linear FEM as an interactive method which may be suitable for tissues modeling without a great loss of accuracy (see more in Chapter 4). The advantages and disadvantages of nite element models in general are as follows:  They allow for a continuous linear or non-linear mechanics basis.  They have a large dynamic range, as compared to mass-spring models.  They represent material properties in physics quantities The following are the disadvantages of mass-spring models: 25 Type Reference Material Solver Interactive Linear [305, chap. 1] lin lin yes Warping [175] lin lin yes Volume Pres. [118] lin lin yes Large deform. [305, chap. 3] lin nlin no Hyperelastic [8, theory chap. 4.6] and [245] nlin nlin no Table 2.3: A selection of linear (lin) and nonlinear (nlin) nite element formulations for tissue modeling.  They are computationally expensive, and therefore traditionally dicult to simulate in real time. Boundary Elements The Boundary Element Method (BEM) describes a deformable object by a boundary integral equation formulation of static linear elasticity. This method has good real-time performance, as shown by James & Pai [123], for homogeneous elastic objects by using a discrete surface description, as shown in Figure 2.10. Figure 2.10: Boundary element methods provides an equivalent representation of a deformable free space object using surface dynamics. Three main steps are required to implement the BEM: (1) discretize the boundary elements, (2) apply the integral equations at each of the boundary nodes, and (3) apply the boundary conditions of the desired boundary value. The tradeo s of this method, according to Bro-Nielsen [29], are as follows:  There is very good real time and low latency performance in three-dimensional simulations.  Only quasi uniform body properties can be modeled. 26 Rigid Bodies Rigid body models have been applied in speech tasks to model skeletal structures such as the jaw and skull[230]. Rigid models may be sucient to mimic macroscopic behavior and are more ecient to compute than deformable models. In summary, there are a number of physics-based modeling methods which are suitable to model anatomical structures. In the following, we will discuss other common methods that are measurement-based, geometry, and acoustics. 2.3.2 Measurement-based Techniques Measurement-based modeling techniques use data of physical properties of structures being modeled. Data of physical properties include sound recordings, medical images, and point tracking data. Since vocal tract behavior cannot be captured with a single measuring modality in its spatial and temporal extent (as will be shown in Section 2.4.1), a variety of modalities have historically been used. Applying measurement-based models in simulations allows the study of shapes, motions, and sounds to validate other models by comparison. For example, Yehia & Tiede [304] reconstructed a three-dimensional vocal tract model from MRI data to validate vowel vocal tract shapes produced by an inverse model, as shown in Figure 2.11a. Similarly, by using ultrasound imaging byAkgul, Kambhamettu & Stone [1] has extracted three- dimensional tongue surfaces for vowels providing geometry models. In the context of modeling anatomical structures, these extracted geometry may be match matched [50] to a dynamic model atlas that includes muscle details to create a model of an individual subject. As a new measurement-based technique developed in this thesis work, Section 5.4 describes a real- time 2D ultrasound tongue extraction algorithm to interactively drive computer-based models. This real-time extraction and interaction can be used to investigate whether ultrasound tongue imaging is sucient for vocal modeling. Perceptual measures allow the interactive comparison of synthesized and articulated speech. An example of the tracking results is shown in Figure 2.11b. Linking medical image data into the animation process enables the evaluation of two stages of production: (a) deformable shapes, and (b) an acoustic output signal. At each stage, data and modeled sources can be compared by creating perceptual test conditions for verifying models and parameters. 27 (a) (b) Figure 2.11: Vowel postures in (a) Magnetic Resonance Imaging (MRI) by Engwall & Badin [75] and (b) Real-time ultrasound tongue tracking by Vogt et al. [291] Key Frame Animations are techniques using manually created input, similar to measurement- based techniques. These techniques are used in computer animation of movies to define the pose of a geometric model for particular points in time. Intermediate poses are calculated by interpolation of adjacent key frames. In this way, it is easy for an animator to achieve a desired motion. Reduced shape models describe the collective behavior of measured data. Here, statistical techniques such as principle component analysis (PCA) and independent component analysis (ICA) reduce multidimensional data sets to lower dimensions of shape changes. Examples include principle component analyses by Badin et al. [18], active shapes by Cootes, Hill, Taylor & Haslam [48], active snakes by McInerney & Terzopoulos [164], and geometrical models by K¨ahler, Haber, Yamauchi & Seidel [144]. Most data-driven techniques have the advantage of being computationally fast, assuming that access to the data storage is fast. A disadvantage of these techniques, observed here in this thesis work, is that they do not allow linking of multiple independently-developed models, while physics-based models permit bidirectional interaction. However, the data can be used to infer the parameters used in physical models. 2.3.3 Sound Production Techniques Modeling acoustic phenomena from air flowing through the vocal tract requires integrating results from aerodynamics with acoustical properties of a soft-walled, time-varying tube that has multiple, 28 moving obstructions, some of which vibrate. Three general approaches model the aerodynamic processes beyond the methods described in Section 2.3.4, which are described below. First, Sinder [247] and de Vries [293] studied the aerodynamic properties of the vocal tract. Sinder [247] reports the simulation of fricative sound sources with a two-dimensional jet model. The numerical simulation used the Navier-Stokes equation with 20000 nodes, and matched a measured physical acoustic signal to 2500Hz. de Vries [293] modeled the glottis noise with a symmetric two mass-spring model for the vocal folds and a 3000 node aerodynamic model using the Navier-Stokes equation in order to match the glottis signal for either gender. Fedkiw, Stam & Jensen [79] reported a simpli cation of the Navier-Stokes using a special implicit Euler integration, which was developed for computer graphic simulations of virtual smoke, which shows the potential for being adapted for vocal simulations. Second, Pothou, Huberson, Voutsinas & Knio [205] and Shadle, Barney & Davies [238] simulated particle models to approximate the vocal tract using basic geometry. Using a better shape model, their work shows the potential for being extended more closely to the complete vocal tract. In addition, their approach may be complemented with a localized frication model to lead to a more complete speech model. Third, Vorl•ander [292] and Schmitz, Vorl•ander, Feistel & Ahnert [233] presented ray tracing algorithms to model sound propagation for room acoustics. The application of this approach to vocal tracts holds promise whereby pressure waves would be computed directly. However, scaling down of geometry may have near-far- eld implications. These methods describe aerodynamics models which produce pressure waves; to produce acoustic sound waves requires a second aeroacoustic model conversion. In addition, quasi linear sound sources described with wave equations could be replaced with nonlinear sources by Howe [114], Rienstra & Hirschberg [219]. Nonlinear sources can be approximated using noise production following the Lighthill analogy [148] or the Lighthill-Curle analogy [52, 149], which describes the noise production of sources of turbulence. Recent research presented promising results for sound e ect synthesis by Dobashi, Yamamoto & Nishita [61, 62] and vowel synthesis by van den Doel & Ascher [63] using these principles. 29 2.3.4 Modeling Vocal Tracts and Faces So far, we have described several modeling techniques. This section will describe particular techniques which are organized by the relevant anatomical structures. As mentioned earlier, the speech process comprises a source{the glottis, and a lter{the vocal tract. For this research, we extend the components of speech production further to include the face and the head. In the following, we discuss the most in uential research for both parts, along with the modeling techniques. In addition to the vocal tract, face models are discussed here, because they have a strong coupling with vocal tract motion due to the close link of anatomical structures of both the face and vocal tract. Articulatory models Articulatory models describe vocal tracts with their de ning anatomical structures with (i) acoustics, (ii) geometry, and (iii) biomechanics. These models can be parametric or non-parametric. Acoustic models are often one-dimensional where as shape descriptions using geometry and biomechanics models can be two-dimensional or three-dimensional. Generally, in order to model the resulting acoustics, these shapes are approximated as about 10 to 15 tube segments, each with a di erent cross-sectional area [70]. From this model, the area functions of the vocal tract are used to calculate its transfer function [77], which can excited with a source model to produce vowels. Fricative production was investigated by Jackson [122], Shadle [237] using 1D acoustics models. The source- lter model represents a one-dimensional wave propagation. The wave equation solutions are simulated using digital lters. Based on the conservation of momentum and mass equations: a(x) @P (x; t) @x = @U(x; t) @t (2.3) @U(x; t) @x = a(x) c2 @P (x; t) @t (2.4) where a(x) is the cross-sectional areas of the tube at position x,  is the density of air, P(x,t) is the pressure at point x at time t, c is the sound velocity in air and U(x,t) is the volume velocity of point x at time t. From this the Webster's horn can be derived as: @ @x ( 1 a(x) @U(x; t) @x ) = 1 c2a(x) @2U(x; t) @t2 (2.5) 30 When a(x) is constant for a given section m of the tube results am(x) = am and the equation for each section reduces to: @2Um(x; t) @x = 1 c2 @2U(x; t) @t2 (2.6) Using a decomposition of this equation into two traveling pressure waves allows us to formulate the junction scattering relations known as Kelly-Lochbaum junction [130]. In the lter equivalent by Gray [101] drives the underlying shape the resulting tube diameter parameters. This model assumes quasi-static temporal behavior since tongue parameters change slowly relative to acoustic rates. Many researchers use two-dimensional geometry models to determine the acoustic properties of the vocal tract, including work by Rubin et al. [224], Sondhi [248]. Rubin et al. [224] de ned six articulatory parameters to control a mid-sagittal cross-section as a two-dimensional representation of the vocal tract. The resulting 2D vocal tract shape is sectioned as area functions that are one-dimensional for an acoustic model for sound propagation. This articulatory synthesis model is e ectively a sequence of two-dimensional segments, and ignores many important properties of the vocal tract, such as the parallel nasal tract and 3D vocal tract shapes such as for /L/ sounds. Maeda [156] based his two-dimensional statistical shape model on X-Ray images. The resulting vocal tract shapes are approximated with area functions to model acoustics. Similarly, to model singing byCook [47] developed a 2D parametric geometry model of the vocal tract and used a 1D waveguide model to model the acoustics. Three-dimensional geometry models have been used to describe vocal tract model shapes. For example, Badin et al. [18], Engwall [71, 74] modeled the vocal tract using parametric shape descriptions, Stone [256] modeled vocal tract shape based on MRI reconstructions, and Engwall [71], Shirai & Honda [242], Stone, Davis, Douglas, NessAiver, Gullapalli, Levine & Lundberg [257] modeled parametric tongue shapes. Ultimately, these parameters are still used to determine an area function so that speech can be synthesized using 1D acoustics models. Articulators, in particular the tongue have been modeled using biomechanics. Modeling approaches for the tongue include 2D nite elements by Payan & Perrier [195], 3D mass-spring models by Dang & Honda [54], and 3D nite element by Wilhelms-Tricarico [301] and Gerard et al. [96]. 31 Representation of the Glottis Overall, much speech research focuses on glottis models as a source. The most in uential work models the glottis as an externally driven two mass-spring system by Flanagan & Ishizaka [87]. While this model gives good approximations to vocalized sounds, it ignores the turbulence e ects of air ow which produces sound without vocal cord vibration, such as when a person whispers. Other air ow models use velocity volume representations since they translate easily into area functions for one-dimensional models. Recent research by Titze & Story [276], de Vries [293] implements two di erent models for the glottis: mass-spring models and nite element models. Furthermore, the research compares these in conjunction with a one-dimensional lter model for vowels. Another physics-based model by Drioli [66] reduces the model complexity by matching the ow waveform. Instead of modeling, there are other ways to represent the glottis in a simulator: (1) recorded data from a glottis electrogram or high speed video analysis by Sakakibara, Konishi, Kondo, Murano, Kumada, Imagawa & Niimi [228] can be useful sources, (2) analytical waveforms, such as Rosenberg [221], or a switched noise/pulse train by Atal & Remde [14]. Overall, there is a variety of glottal representations, and without comparison it is hard to say which glottis model is better for vowel and consonant production. The proposed framework allows for the investigation of the tradeo s between the di erent glottal models in the context of a complete vocal tract model. For example glottal models may be coupled and a ected by the tract. Face Models Di erent types of face models have developed over time: (1) control point animation using key frames, (2) parametric face models, and (3) physics-based models. All three model categories are still in use for di erent applications and t in the model taxonomy presented in Figure 2.8. The history of facial animation presented by Parke & Waters [194] only developed over the last thirty years. The rst computer-generated face images by Parke [192] used a polynomial representation for the head, eyes and lips using key frames, which were extended by Parke [193] into the rst complete parametrized face model. Using three-dimensional representations Platt [203] developed a physics-based muscle-controlled facial expression model. Waters [296] controlled the face by an underlying muscular structure. Hill et al. [109] reported the rst automatically synchronized speech and facial animation. The development of the Cyberware optical laser 32 scanner [53] provided a better face data acquisition. Lee, Terzopoulos & Waters [146] described techniques to map individual face scans onto a canonical representation known as facial attributes. Cohen & Massaro [43] presented a parametrized face and speech model. Much of the current research, such as work by Guenter, Grimm, Wood, Malvar & Pighin [103], K•ahler et al. [144], Turner [280], focus on the production of photo-realistic face animations in real-time. Another use for facial animations developed for preoperative simulations in the area of medicine. Researchers in this area achieved preoperative [60, 137] simulated incisions, while Pieper [202] investigated wound closures. Recently, work by Bettega et al. [24] developed a simulator to predict facial deformations after structural facial surgeries. The application of a simulator that includes speech would open up many possibilities for speech surgery. All the facial animation mentioned here uses a variety of approaches for tissue and muscle models, which are discussed in the next section. Articulatory Phonetics Articulatory phonetics describes the process of articulatory speech production of anatomical structures in linguistics, which provides a basis for the underlying model. Further, the resulting postures are applied as mental models in training of speaking and singing by LaBou [145]. In the speech production view, the vocal tract consists of a set of resonating air cavities which act like lters shown by Browman & Goldstein [30], Fujimura [91], as shown in Figure 2.2, bounded by anatomical structures comprising of bone, tissue, and muscles. Sounds are produced in several locations and in multiple ways in the human vocal tract. The main sound sources are (1) the quasi-periodic vibrations of the vocal cords which act like a pulsing sound source and (2) the turbulent noise originating in narrow constriction of air passages [122, 237], mainly in the oral air cavity, called frication, as shown for four sounds in Figure 2.6. Less often, sounds originate from plosive pressure releases of air, which follow pressure increase due to a vocal tract obstruction. Implosions, which follow pressure decrease and clicking sounds, originate from lip separations and from the tongue separating from the palate. The oral cavity is bound by the tongue blade and the body on its lower surface and by the hard and soft palates on its upper surface. The pharyngeal cavity is bound by the root of the tongue ventrally, and by the pharynx wall dorsally. The nasal cavity can be coupled into the oral cavity when the velopharyngeal port is open. 33 The vocal tract is bounded by hard and soft tissue structures. These structures are either essentially rigid (such as the hard palate and teeth), or movable. The movable structures associated with speech production are also referred to as articulators. The tongue, lips, jaw, and velum are the primary articulators; movement of these articulators appears to account for most of the variation in vocal tract shape associated with speaking. However, additional structures are capable of motion as well. For instance, the larynx can be moved up or down to shorten or lengthen the vocal tract. A downward motion of the larynx, while the oral and nasal cavities are occluded, can produce the negative pressure in the vocal tract needed for the implosive sounds of some languages. The described speech production process is the basis for modeling the anatomical structures that produce speech. The next sections describe the modeling process and methods. 2.4 Imaging and Tracking Background In this section, we discuss rst the major medical imaging and tracking techniques which are applicable to vocal tract modeling. Second, we review image and mesh processing techniques which are relevant to process image and tracking data for vocal tract simulations. 2.4.1 Comparison of Image Modalities (a) (b) (c) Figure 2.12: Examples of medical images of the upper airway: (a) mid-sagittal MR image of head with high sampling rate/low resolution (b) ultrasound image of the mid-sagittal tongue surface, (c) mid-sagittal CT image of the author's head. Medical imaging allows the measurement of anatomical structures, their shapes, and motion. Imaging has been developed for clinical practice with noninvasive techniques being used for 34 diagnostics. Examples of medical images for the upper airway are shown in Figure 2.12. Existing medical imaging techniques perform di erently in terms of spatial resolution and imaging rate. In general, X-ray-based methods are superior for bone and cartilage imaging, magnetic- and sonic- based imaging are better at muscle and soft tissue imaging. Table 2.4 presents a summary of imaging techniques which are based on data from Prince & Links [206] and Whalen, Iskarous, Tiede, Ostry, Lehnert-LeHouillier, Vatikiotis-Bateson & Hailey [298]. Further, the table shows properties of motion and shape capture systems from manufacturer speci cations. Motion capture systems have become popular in speech research using infrared tracking systems such as Vicon [285] or NDI Optorak [187], as well as magnetic tracking systems for general purpose such as Polhemus Fastrak [204], and custom speech solutions such as a Magnetometer (EMA) [39]. A tracking solution custom-built for speech research is the X-Ray Microbeam system [297]. In manufacturing and computer graphics, 3D optical scanners have been developed to reconstruct surfaces such as the Cyberware scanner [53]. These motion capture systems are used in speech research alone and recently in combination with medical imaging methods such as with ultrasound by Whalen et al. [298]. This combination may improve some measurement limitations such as possible subject position as well as particular research needs such as tracking the head position. 35 S y st em D im en si on s S am p li n g S p at ia l F re q u en cy R es ol u ti on X -r ay Im ag in g ( lm -b as ed ) 2D st at ic 0. 05 m m V id eo F lu or os co py 2D 60 H z 1 m m C om pu te d T om og ra ph y 3D 0. 1 H z 0. 2 m m St at ic M ag ne ti c R es on an ce Im ag in g 3D 0. 1- 2 H z 1 m m C in e M ag ne ti c R es on an ce Im ag in g 3D 8- 24 H z 3- 5 m m U lt ra so un d Im ag in g 2D 30 -1 00 H z 1 m m U lt ra so un d Im ag in g 3D 30 -6 0 H z 1 m m M ot io n C ap tu re (V ic on ) 3D m ar ke rs 10 0- 50 0 H z 0. 5 m m 2D M ag ne to m et er (E M A ) 2D m ar ke rs (1 2 F le sh pt s) 20 0- 50 0 H z 0. 1 m m 3D M ag ne to m et er (E M A ) 3D m ar ke rs (8 F le sh pt s) 25 H z 1 m m X -R ay M ic ro be am 2D m ar ke rs (5 F le sh pt s) 30 -1 50 H z 0. 2 m m E le ct ro pa la to gr ap h (E P G ) 2D (1 0x 10 gr id ) 20 0 H z P al at e co nt ac t C yb er w ar e sc an ne r 3D Su rf ac e m es h an d T ex tu re 0. 1 H z 2 m m T ab le 2. 4: M ed ic al im ag in g an d m ot io n/ sh ap e ca pt ur e sy st em s to ch ar ac te ri ze an at om ic al st ru ct ur e. 36 2.5 Image Extraction Techniques In order to create a realistic model or validate simulation data, it is important to use real- world image and tracking data of the types described in the last section to ensure realistic or plausible simulation results. In order to create a quanti ed comparison between simulation and measurements or create geometry for a model component, the images have to be processed to obtain geometry data. Medical image processing is a fast-growing research area which covers the required algorithms of image registration and segmentation. To maintain a full and up-to-date suite of image and mesh processing algorithms takes signi cant e ort. Therefore, it is a good strategy to apply and extend existing open source frameworks, in particular [117, 176, 235], in order to leverage the already existing and growing number of algorithms and to contribute and share other algorithms with these communities. In the following sections, we de ne the concepts of the algorithms and the related research for vocal tract modeling. 2.5.1 Image Registration Registration is a method of aligning a pair of images that contain common information. In the discipline of medical imaging, registration may be performed on intramodal images, as well as multimodal images. There are some de nitions of terms for image registration. The alignment corresponds to the coordinate system of the physical world and not to the coordinate system used by the pixels of the image. The terms fixed and moving images describe the pair of images that ought to be registered. Fixed image is the reference that remains stationary while the moving image is deformed, scaled, rotated, and translated in order to align it with respect to the fixed image. Registration may be classi ed in two categories, rigid registration and deformable or non-rigid registration. Rigid registration applies a translation and rotation translation to the moving image so that it aligns to the xed image. Rigid registration methods are limited to images which di er in the choice of the coordinate system. In a rigid registration, transform may also include scaling if the images are in a di erent scale. One major characteristic of these registration methods is 37 that the image has a regular pixel size and spacing after the application of spatial transforms, and therefore they are called rigid. Non-rigid registration includes a range of registration methods in which the pixel spacing and pixel size are no longer uniform throughout the image. Ane transformation and warping are two examples of spatial transformations that can be used to do non-rigid registrations. Both ane transformation and warping are general cases of spatial transformations, of which the rigid transformations form a special case. Ane transformation allows an application of shearing factors that enable it to eliminate the shear di erences between xed and moving images. Warping is an operation that stretches pixels in a selective and non-linear way. Warping transformations are non-linear and non-homogeneous. Ane and warping may allow registration of images involving anatomical deformations or motion. As an example, this thesis can be applied to process magnetic resonance images of 2D and 3D datasets of vocal postures. The vocal posture datasets are obtained while producing a particular sound. All images of the 2D datasets are referenced to the identical physical coordinate system. The 3D datasets also have the same coordinate system. As a result, no rigid registration is required for either the 2D or 3D datasets. Upon comparing any two images of di erent vocal postures, one can notice signi cant changes in the shape of the vocal tract. The tongue, lips, and air tract seem to undergo deformation and translations, whereas the lower jaw experiences rotation and translation about the hyoid pivot points. Clearly, deformable registration techniques must be applied to address inter-posture registration. Registration of datasets to each other, or to a universal model, allows one to compare and analyze changes in vocal tract posture. Di erent methods have been developed for the registration process, including point-based, surface-based, and intensity-based methods. These methods perform rigid body or non-rigid body transformations to achieve registration. Recent reviews by Brown [31], K•ahler et al. [144], Pham [201] provide a systematic overview. Most relevant to vocal tract modeling is the use of intensity-based methods using a non-rigid body transformation with a variational formulation, so that the transformation is constrained to maximize the correlation between xed and moving images, without any tears or gaps. Most non-rigid body registration precedes a rigid-body registration stage to align datasets [89]. Other registration techniques of relevance are the demons algorithm by Thirion [272] which uses segmented input images, nite element-based registration by Ferrant, War eld, Guttmann & 38 Figure 2.13: Rigid-body registration for one speaker: 1. L-sound; 2. R-sound; 3. di erence without registration 4. di erence with registration. The registered version is in signi cantly better correspondence. Mulkern [84], and multiresolution deformable registration by Mattes, Haynor, Vesselle, Lewellyn & Eubank [161]. Rigid Body Registration Three-dimensional rigid-body registration requires three translation and three rotation parameters to form a transformation T , along with an intensity factor . Following Hajnal, Saeed, Soar, Oatridge, Young & Bydder [104], we minimize the error in the rigid-body registration, de ned as follows. E = ∑ x (B(x) R(T [x]))2 Where B is a baseline scan, and the image to be registered via the linear transformation T is the repeat scan R. Non-Rigid Body Registration In this case, we seek to identify a vector displacement eld that minimizes an image mismatch cost function. While a spatial linear transformation can express translation, rotation, scaling, and shear, a vector displacement eld is able to express 39 whatever general deformation is required to bring a scan into exact structural correspondence with another. The cost of this generality is high computational complexity. Another diculty lies in constraining the transformation such that undesirable transformations that do not preserve the topology of the image are avoided. One such method, known as fluid registration, presented by Freeborough & Fox [89], is presented here. The algorithm seeks to minimize a mismatch cost function, de ned as one minus the square of the cross-correlation between the xed image (L-sound) and the moving image (R-sound) shown in Figure 2.13, which means we minimize C = 1 ∑ xB(x)R(xU(x)) (1=N) ∑ xB(x) ∑ xR(xU(x)))[∑ xB 2(x) (1=N) (∑xB(x))2] [∑xR2(xU(x)) (1=N) (∑xR(xU(x)))2] with respect to the displacement vector U(x), which is generally non-integer, and C is a function of the input images and the displacement vector, C = C (B(x); R(x);U(x)). The transformation de ned by the displacement vector is constrained using the Navier-Stokes equation describing the motion of a compressible viscous uid. We start with the displacement vector U(x) set to zero. 1. Compute the partial derivatives of C with respect to U(x) in the direction of each of the three axes in the Cartesian coordinate system, and use these values as a force eld acting on a uid. 2. By solving the Navier-Stokes equation, we obtain, in this uid analogy, the velocity at each point in the uid. The present implementation uses a simple implicit method to solve this system. rT  rv(x) +r  rTv(x) + b(x) = 0 3. We then transform the velocities into the deformed space de ned by the current value of U(x). w(x) = v(x)(1r U(x)) 4. We then update the U(x) by adding the amount each point in the space would move in a small time interval, that are the deformed velocities multiplied by a small value. 40 U(x) := U(x) + w(x) 5. If the minimum of the Jacobian of U is smaller than an arbitrary fraction M , then we stop iterating. The Jacobian measures the di erential volume change of a point being mapped through the transformation. If the minimum Jacobian is negative, the transformation is no longer a one-to-one mapping and, as a result, folds the image domain inside out (see Christensen, Miller, Marsch & Vannier [42]). 6. We accumulate the transformation in Ua(x), by de ning Pa(x) = xUa(x) and then setting Ua(x) = xPa(x): (2.7) 2.5.2 Image Segmentation Segmentation is the method of extracting a region of interest in a given image. Segmentation may be applied for diagnosis purposes to obtain an organ or part of a tissue of interest in a medical image. A more complex segmentation use is the process of subdividing datasets to identify anatomical structures by separation in a set of regions of interest. One of the long-term objectives of this work is to display sections of vocal tract, tongue, teeth, etc. as part of the simulation. To display these anatomical features separately from each other, they must rst be segmented from images such as MRI. Di erent methods have been developed for segmentation of MR images, which vary in model complexity and degree of manual versus automatic processing. Cox provides a structured overview of image intensity segmentation techniques by Cox [51], Maintz & Viergever [157] including algorithms based on detecting edges, boundaries, contrast gradients, and seed-based region- growing. In the area of speech segmentation, algorithms applied to vocal tracts have been so far fairly basic and extract predominately vocal tract area functions [18, 73, 275] and tongue surface contours [2, 17, 72, 256, 258]. Most methods require a signi cant amount of manual intervention to tune parameters to make the segmentation work properly, and the majority of current approaches require the user to specify or correct cut-places and contours within a 3D drawing post-segmentation tool. For this work, it was important that the segmentation is working on 3D datasets, since the vocal 41 tract is a tube-like channel which changes from a few millimeters to 10 centimeters in cross-section diameter. One approach which goes in the direction of specifying minimal constraints is [207], who applied a morphology-based segmentation process for vocal tract reconstruction. In this section, we discussed image extraction techniques which will be applied in Chapter 5. The next section discusses similar processing techniques for meshes. 2.5.3 Mesh-based Processing Vocal tract simulation systems do not use images as internal representation, rather closed surface meshes for rigid-structures and volume meshes for deformable structures are common representations. Processing and manipulation of meshes are therefore important for a exible framework. Surface meshes are de ned as meshes that only contain vertices on the surface, while volume meshes have interior vertices in addition to surface ones. Mesh-based processing, as well known as computational geometry [23, 93], de nes algorithms for the conversion from images to meshes, as well as algorithms from one mesh to another mesh. The creation of volume meshes is a preprocessing step in order to prepare a shape or surface mesh for nite element analysis [22, 305]. Solutions are discussed in the research forums and are o ered as products [9, 229]. Mesh Creation The conversion from images to surface meshes is a common processing step in image processing software using algorithms such as Marching Cubes by Lorensen & Cline [151] and implicit surface techniques by Montani, Scateni & Scopigno [169]. These algorithms produce often very dense meshes (that is one vertex per pixel), which are not suitable for simulations. Therefore, mesh-based lter techniques or surface reconstruction techniques constitute a way to further process the dense mesh to get more suitable results for simulation. Mesh-based Registration We have discussed previously registration algorithms for images (see section 2.5.1), with their application for model creation and validation. Similar rigid and deformable algorithms exist as well for meshes, which o er an alternative to the image-based ones. Manual mesh registration can be performed with mesh editing software [165, 220], while automatic methods require special algorithms. Automatic rigid mesh registration is quite simple, since it results in a rigid-body transformation 42 that can be determined by projection. Automatic deformable registration algorithms have been applied to volume meshes in biomedical applications by Couteau et al. [50]. In computer graphics, solutions to similar problem references di erent surface meshes with parametrization algorithm by Kraevoy & She er [138]. Using mesh-based registration is analog to image-based registration, but is closer to the simulation representation since it allows to transfer of the simulation properties of the reference anatomical substructure onto a mesh which might come from the segmented image of a new anatomical shape. Mesh Comparison Another way of comparing meshing for validation is by measuring the error between surfaces using, for example, the Hausdor distance [12] which allow the display of the vertex or face distance in a color map or specify the overall mean error. 2.5.4 Summary There are a variety of image and mesh algorithms and techniques which make the work ow of vocal tract simulation easier and faster. These tasks in the work ow, could be for example: Import of new anatomical structures, validation with reference sets, and smoothing shapes. Since required algorithms are rapidly advanced in active research communities, it is useful to leverage existing frameworks such as [117, 176, 235]. Since the simulation of the vocal tract is the focus of this project, other frameworks can be integrated or import/export le formats would allow external use. 2.6 Existing Modeling Solutions Prior to designing and building the vocal tract simulation system, we reviewed relevant existing simulation frameworks in order to nd out if an existing system could t our requirements (see Chapter 3.1). We introduce the related simulation frameworks and continue with a section that discusses their attributes. 2.6.1 Framework A: Ptolemy Ptolemy [33] is a software framework designed for the modeling, simulating, and designing of concurrent, real-time, embedded systems. Using well-de ned models of computation that 43 de ne the interactions between components, it allows for heterogeneous mixtures of models of computation. 2.6.2 Framework B: Real flow Real ow [180] is commercial platform for dynamic and uid e ects geared toward lm production e ects. It provided many fast algorithms, which are mainly designed to produce realistic looking rather then numerically accurate results. 2.6.3 Framework C: ANSYS and Fluent ANSYS [9] is a commercial software package for engineering simulation solutions using nite element modeling, simulation, and validation. Since the maker of ANSYS recently acquired Fluent [10], a leading computational uid dynamics package, it extended ANSYS to handle complex uid dynamics scenarios as well. 2.6.4 Framework D: Software for Interactive Musculoskeletal Modeling (SIMM) SIMM [173] is a commercial product, that recently evolved to the open source OpenSIM/SimTK [59], developed at Stanford University, which targets biomechanical modeling for health care profession- als, scientists, engineers, and animators modeling interactive musculoskeletal systems. Using rigid body and muscle models enables the analysis of biomechanical systems and movement patterns as well as surgical planning. 2.6.5 Framework E: Simulation Open Framework Architecture (SOFA) SOFA [78] is an open source simulation framework for surgical planning. It includes rigid, deformable, and uid model types in one framework. The SOFA project was made public in 2007 and has grown since. SOFA focuses on surgical simulation and is similar to ArtiSynth except it does not include connection formulations and a sound synthesis module. 44 2.6.6 Discussion Important attributes of the ve related frameworks are listed in Table 2.5. The closest related open source frameworks are SOFA and OpenSim, which did not exist at the time the ArtiSynth framework was designed. Both alternative frameworks are missing important design characteristics for upper airway modeling such as model connections and acoustics model formulations, which is the rationale for the development of the ArtiSynth Framework. In summary, there is not a framework that suits all needs discussed in Section 3.1 since the model types or task-supporting interactive components might be missing, or the trade o between simulation speed and accuracy is unsuitable for interactive simulations. Framework License Source Model types Negative Positive Since Ptolemy free yes dis, cont, fin state no ODEs comp. model 1997 Real Flow commercial no rb, part, fluids lim. acc 1998 ANSYS commercial no rb, fem, fluids no interact high acc 1970 OpenSim free yes rb, part, fem no ac acc fem 2006 SOFA free yes rb, part, fem, fluids no ac active community 2006 Table 2.5: A list of existing simulation frameworks related to vocal tract simulation (rb=rigid body, part=particle system, rt=real time, dis=discrete, con=continues, ac=acoustics, acc=accurate) 2.7 Summary This chapter has presented work relating to articulatory speech synthesis. Understanding the process of speech from articulatory phonetics allows for a better understanding of the physical process of speech. In order to model this process, mathematical and computational models are required from the eld of facial animation and speech synthesis. This chapter has also included a discussion of the methods and presented arguments for the choice of supported models. The next chapters describe the framework and the implementation plans for the core simulator. 45 Chapter 3 Creation of a Modeling Framework for the Vocal Apparatus This chapter describes a framework design and conceptualization to model vocal tracts and upper airways. This framework is the basis for a large project to construct the simulation software called ArtiSynth13. Parts of this chapter were published before. In particular parts of Sections 3.1.3 and 3.1.4 originally appeared in Fels et al. [83]. Parts of Section 3.2.2 have been published in Fels et al. [80], Vogt et al. [288]. It allows for the creation and interconnection between various kinds of dynamic and parametric models to form a complete integrated biomechanical system, capable of articulatory speech synthesis, the study of physiological feeding and breathing activities, and talking head simulation. The ArtiSynth simulation platform addresses issues arising from modeling anatomical substructure. Existing models have been developed in separation, and the goal of ArtiSynth is to provide an infrastructure, thereby supplying mechanisms to glue these and other model types together in order to form hybrids and allow comparisons between models. This approach acknowledges many open questions within this research area and the community e orts of researchers and practitioners required to create a sucient model for complex upper airway structures. Since existing simulation platforms, as discussed in Section 2.6, are not sucient for the needs of this application domain, the design and creation of the simulation platform, as well as the implementation of a basic model library and the graphical user interface (GUI), are essential. The framework requirements, as the basis for the system design, lay the foundation for the ArtiSynth group e orts to develop an entire framework. These requirements are described in the following. 13The name ArtiSynth is derived from Articulatory Synthesizer (http://artisynth.org). 46 3.1 Framework Requirements Before the design of a suitable modeling framework for the upper airway, technical and conceptual requirements [264] were determined as part of a user-centered design approach suggested by Nielsen [182] and Tidwell [274]. Key members of the research community, such as engineers, linguists, and medical scientists, were involved in helping with the design of the interface and providing feedback about design, mock-ups, and prototyping systems. This process was essential to create a system that would be widely used in this research domain. The feedback for the system design was conducted by informal interviews, questionnaires (see Appendix B.1) and discussions with leading researchers located at UBC and elsewhere listed in Table B.1. Signi cant feedback was solicited at the following venues:  International workshops (Gabfest 2003+2005, Ultrafest 2004+2005, Audio Visual Speech Processing 2005, and Int. Seminar on Speech Production 2006)  Conferences (ICPhS 2003, SIGGRAPH 2006, Int. Symposium on Biomedical, Simulation 2006, Vancouver Robotics in Society meeting 2007), and  Institutional visits (Advanced Telecommunications Research 2002, NTT 2002, Stanford University CCRMA 2003, IBM Almaden research center 2003, ICP 2005/2006, Basel University 2006). Discussion ndings supported a number of high-level requirements for a simulation tool for combined anatomy animation and sound production and are summarized below:  Cross platform and connectivity to existing tools and libraries  Open source simulation code, models, and data sets  Both graphical user interface and application programming interface  Support libraries  Basic model type library: dynamic and parametric biomechanical and acoustical models  Integration and connection of models  Creation and editing support for models  Time-based data sequencing  Import and export of measured and synthesized data 47  Easy exchange of alternative models for comparison  Ecient computational methods for interactive simulations Since existing simulation platforms, as discussed in Section 2.6, do not ful ll all these high level requirements, and are dicult to extend, we decided to create the new biomechanical simulation platform. To understand the impact and motivations of the requirements, they will be discussed in more depth in the following sections. Modeling the upper airway in a general application is a dicult research problem. There have been many recent advances in computation, graphics, data acquisition, and simulation algorithms which show much promise in creating an anatomically realistic model. However, evaluation and veri cation of a model by a single research group has not been successful in solving this problem by itself. The research required to tackle upper airway modeling spans many disciplines, as shown Figure 3.1, expertise of which are dicult to amalgamate in a single institution to focus on this problem. Researchers mainly focus on particular problems, which often results in anatomical model components being developed in isolation. Most researchers in universities and research centers do not have the resources to create custom simulation software, since their focus is mainly geared towards answering research questions. The small market size, speci c needs of the problem domain, and fast moving technology are also reasons why commercial products have not been suitable. In addition, most researchers do not recreate related research and compare their own advances, since this is very time consuming and takes time away from new publishable research. Since people use radically di erent systems, the exchange of models becomes very dicult and is seldom performed. From a core group of researchers who model anatomy of the upper airway in speech research, dentistry, and medicine, a larger community will bene t. The larger community may include scienti c algorithm developers who look for example applications to exemplify their tools. Other examples include teachers who have the need for a classroom tool or experimenters and clinicians who would like to validate their hypothesis in patient-speci c cases. In summary, this approach uses a software design to an integrated physiological model to bridge di erent disciplines including cognitive and physiological disciplines as well as physics and modeling disciplines. Formulating the right abstractions to represent data and knowledge from cognitive and physiological disciplines and algorithms from physics and modeling disciplines is the key to solving the problem of modeling the upper airway. The following section describes the methodology used 48 to determine the functional and non-functional requirements. Speech Production Computer Animation Medical Simulation Medical Imaging Figure 3.1: Vocal Tract-related Research Fields 3.1.1 Methodology Specific knowledge of a diverse community is needed to make a breakthrough in long-term research questions by building a complete biomechanical upper airway model. The framework’ s intention is to serve as a vehicle to bring researcher expertise together in one system and enable model and knowledge sharing. As an additional benefit, researchers would save time with this approach by allowing the contribution of their expertise while leveraging the overall framework including others’ contributions. Researchers can contribute on different levels of the project: model implementation, data collection, and experimental validation. The success of this approach requires that researchers gain substantial advantages for their substructure model implementation and experimentation. Thus, it is important to establish an open source style to share models and results in order to provide complete and cutting-edge modeling tools. Openly sharing results can be very positive and fertile to the individual researcher and research community alike, but trust between parties and an acceptable integration into the publishing cycle still needs to be established. This dimension requires a social experiment for the project to be successful. There are many large-scale projects in the last decades that have shown success with open source development, as discussed by Raymond [212], in communities (e.g. Linux, BSD, FSF, GNU), and their success in research areas [19, 34, 117, 235]. Most communities have found suitable rules and licensing terms for their members and facilitators. 49 Structuring related work in speech synthesis, facial animation, and biomechanic simulators allowed the creation of abstractions and representations to shape and organize a way to think about the modeling problem. In order to build a exible and extensible three-dimensional upper airway simulator, a suitable matching system design is required. The process to create software designs methodically has been formalized in the eld of software engineering, as described by Summerville [264], The magnitude of such a simulator with the various methods easily becomes a large-scale project. This requires new techniques and representations in order to generate an accepted framework within the research community. One important step in the design process is the development of representations for data and methods. The following sections describe the non-functional and functional requirements. 3.1.2 Non-functional Requirements System engineering de nes non-functional requirements [264] as requirements that specify criteria to judge the operation of a system, rather than speci c behaviors. In contrast, functional requirements specify speci c behavior or functions. The following non-functional requirements come from the speci cs of the modeling problem and were observed in the collected feedback mentioned earlier. Availability is important to reach acceptance and penetration in academic and teaching communities. A common solution is to provide a free and open source system with minimal commercial license ties. Thus, sharing and deploying new developments should be made easy in order to strengthen the upper airway modeling community. Cross platform portability enables one to bridge the gap between the variety of computer platforms that are used by researchers. A smooth integration of ongoing research is so important in providing a simulation solution across di erent operating systems. Thus, the primary platforms were identi ed: Microsoft Windows, Linux, and Apple OS X. Interactivity to manipulate and simulate models speeds up the modeling process and enables new avenues in developing and validating models. Approximate modeling methods have been applied to gain interactivity, which allows for a better feel for the model behavior and sensitivity of the model. To ensure validity, the approximate methods should be replaceable with accurate methods that simulate oine. These accurate methods could be deployed 50 in commercial simulation packages by exporting the appropriate simulation les for a given model. Interactive simulations are those where actual time is  simulation time, while the di erence is sucient for the expectation of the user. Extensibility of a simulation system is important for the development of future models. New model formulations that require speci c implementations are an ongoing area of research and therefore require easy ways to extend the framework. Extensibility therefore needs to be addressed at the programming and user-interface levels. Extension solutions such as plug-ins or patches enable contributions from outside the core developer team. Usability allows non-programming researchers, who maintain domain knowledge, the ability to create and modify models and conduct experiments. The solution is to provide a graphical user interface (GUI) that allows intuitive understanding of complex dynamic systems and enables control of modeling components. Working closely with non-programming researchers that model systems and soliciting their feedback, helps to evolve and maintain a acceptable user interface. Complexity of the code or implementation should be limited to allow new contributors to understand and facilitate extensions of the modeling system. Personnel, especially at academic institutions, have varying levels of programming skills. It should be made easy for people to contribute at the di erent levels, while maintaining consistency to the whole system and easy understanding. In particular, the code complexity of a particular model should not be so high that a graduate or undergraduate level student could not perform a course project in the area by writing an extension or conducting an experiment. Support for research is a challenge for open source development, since ongoing and unpublished research might not be shared with the entire community. This suggests that the mechanism to share code and data should address access control. Researchers are not necessarily good coders and it takes signi cant work to transform a research prototype into a stable production code. While it is important that the core code of a simulator is well maintained and stable, an extension might not follow the same requirements. Thus, extensions should not impede core functionality and they should be labeled as to how far they are tested. Support for education requires consideration of the natural uctuation of students in order to provide appropriate training material. Further, it is important to consider the non- 51 programmer perspective that requires sucient functionality in the user interface. From a programmer perspective, the environment should be able to maintain a small-scale project in a coherent application programming interface (API). Simulation speed and accuracy is important in creating an interactive, repeatable, and precise modeling environment. Simulation speed requirements may range from oine, through interactive, to real-time. While it is essential for haptic applications to maintain real-time simulation speeds in order to be e ective, anatomical predictive applications like ArtiSynth do not have this strict real-time requirement. Thus, it is more important that the simulation system is accurate, and for practitioners, often interactive rates are sucient (refer to requirement interactivity above and to Section 4.1.1). All requirements, as discussed above help to guide decisions in choices for the software platform and in software design. In the following, we will discuss the functional requirements for the framework using hierarchical task descriptions and scenarios. 3.1.3 Functional Requirements This section describes functional requirements [264] from the user's point of view using task descriptions and scenarios presented by Fels et al. [83]. The scenarios focus on researchers who model the vocal tract by programming or by using graphical user interfaces. Using articulatory synthesis as the central application, it was found that biomechanical modeling of the upper airway for dentists and clinical application has a large overlap in tasks and for the most part only requires a subset of functionality since they do not use the sound production functionality. Depending on the objectives of researchers and practitioners, there are many approaches, techniques, and tools used in articulatory speech synthesis and biomechanical modeling. Finding a common tool set that encompasses all needs is complicated. In 1981, Rubin, Baer & Mermelstein [225] developed an approach for an articulatory synthesis model and made the model available for researchers. Based on 2D parametric shape descriptions of anatomy speech which is approximated in a 1D tube model, speech is synthesized using this platform, limiting the types of sounds that are possible. Since midsagittal plane representations are applied, o sagittal sounds such as /l/ and /r/, as well as asymmetries are not represented. With the development of new imaging and modeling techniques, it is now possible to shift to a 3D model of the vocal tract to represent the equivalent virtual anatomy [56, 267]. Three-dimensional visualizations have been 52 shown to be e ective in teaching the complex interrelations of human anatomy as demonstrated by the work of Tergan & Keller [270]. Commonly, for acoustic modeling, source-excited area-function based lter models are used for speech synthesis. However, this limits the types of sounds that can be produced. This framework approach lays a foundation for both 2D and 3D vocal tract shape models. The acoustics may be modeled with di erence techniques ranging from 1D waveform acoustics to 2-3D aerodynamic models [160]. Finally, it is important to address the open research question of how detailed vocal tract shape models are necessary to appropriate model articulation processes, as discussed by Birkholz [25]. For this framework, it is important to support di erent representations in order to eventually uncover the answer to this question. The developed hierarchical task decomposition is derived through a survey of a representative set of researchers' work. Illustrative citations of some of the many research results investigated include the research areas of tongue [257, 300], vocal tract [18, 75, 224, 304], face [21, 147, 214, 215, 283], and acoustics [121, 131, 200]. The researchers who provided feedback described in Section 3.1 also provided important insights into functions for physiological modeling of the upper airway and, in particular, for the purpose of speech production. The survey provides a picture of some of the tasks that will need to be supported. The feedback process allows for identi cation of several primary tasks that are performed by speech researchers using or exploring articulatory speech synthesis. These include: 1. Import and integration of new 2D or 3D models of vocal tract parts (a) Import static geometry of vocal tract parts (b) Import dynamic models (c) Glue new model into existing methods; e.g., attach a new lip model to a vocal tract 2. Import and integration of new excitation models (a) Import time-domain glottal waveforms (b) Import 2D or 3D vocal fold models (c) Integrate excitation with vocal tract models and investigate both independent and dependent sources 3. Analysis of new vocal tract models in articulatory speech synthesizer 53 (a) Measure deformation of new model over time (b) Adjust parameters of new model (c) Adjust parameters of infrastructure to accommodate new model, that change the integration method or model parameters (d) Specify timelines for parameter values for animation of vocal tract (e) Compare speech output using di erent synthesis methods 4. Comparison of di erent vocal tract models (a) Monitor vocal tract geometry and identify di erences (b) Monitor speech output and identify di erences in time domain and frequency domain both analytically and perceptually (c) Specify timelines for animation of di erent models 5. Comparison of di erent data for driving vocal tract models (a) Import data from MRI, ultrasound, EMA and other data sources (b) Link data sources to model parameters (c) Specify intervals for driving model parameters from data while specifying model parameters to be driven by simulation (d) Compare both vocal tract shapes and acoustic output using both perceptual and objective measures 6. Synthesis of speech (a) Specify time intervals for synthesizing speech from either data-driven or simulated articulatory parameters (b) Concatenate and interpolate articulatory parameters using di erent methods for text- to-speech synthesis (c) Alternate between simple and complex synthesis models including 2D tube models, simpli ed aeroacoustic model and complete aeroacoustic model (d) Alternate excitation methods ranging from simple glottal waveforms to complex vocal fold models for pitch control 54 7. Integration of vocal tract model with face models (a) Explore integration with models ranging from geometric face models to complex dynamic face models (b) Perform audio and/or visual perception tests (c) Calibration/compare with recordings The majority of the tasks require model manipulations which could be supported by tool-based graphics user interface (GUI) as discussed by Tidwell [274, chap. 8]. For example, tool-based GUIs can be found in image, vector-graphics, and 3D-model editors. The following scenario integrates some critical tasks to illustrate how particular graphical user interfaces provide suitable solutions. 3.1.4 Scenarios We have attempted to specify the requirements of researchers in the eld by looking at our own requirements and deriving the requirements of others from the literature and through personal communications. The requirements described in Section 3.1.2 give a high-order decomposition of some of the activities that need to be supported in the 3D articulatory speech synthesizer. This section presents a scenario, originally published in Fels et al. [83], that illustrates how a typical researcher might use the articulatory synthesizer. This approach provides a means of looking at the work ow and types of interfaces that would be required. The following scenario was chosen to illustrate typical interactions of anatomical models and it demonstrates the task support for common tasks with graphical elements. It is of note that the paper prototypes for the GUIs are tools used to elicit requirements and are only partially carried over in the current framework design. In this scenario, Bob is a hypothetical post-doctoral fellow working on a new tongue tip model. He has created a new 3D FEM-based tongue tip model in MATLAB. He has also extracted various parameters for his tongue tip model from ultrasound data for various speech sounds. He wants to try his tongue tip model in the 3D articulatory synthesizer so that he can hear the e ects of di erent movement parameters on sound output. Bob plans to use the ArtiSynth system, since he is aware that he would be able to test his model in a complete speech synthesis framework without having to write his own speech synthesis algorithms. He starts up the ArtiSynth system. 55 (a) (b) Figure 3.2: Mock-ups of the graphical user interface for (a) graphic and (b) timeline manipulations (Reproduced from Fels et al. [83]) Upon starting ArtiSynth, Bob sees the initial user interface appear as illustrated in Figure 3.2. The left window model view shows a 2/3D rendering of the current vocal tract model. The right window, called the timeline window, shows a timeline of probes that are used to drive the synthesizer. At this point in the scenario, there are no input probes selected for driving the simulation, so the timeline window would normally be empty rather than having the probes as indicated in Figure 3.2b. The first thing Bob wants to do is replace the default tongue tip model with his own tongue tip model, while leaving the remainder of the vocal tract model unchanged. In the graphics window, he zooms in and selects the region of the tongue in the vocal tract model that he wants to replace. The selected parts appear in the component window as well. He then cuts the section out. At this point, a new window appears with the names and positions of the end-points where the cut was made. This list is used to interface with the model that Bob wants to insert. Next, Bob imports his model. He loads his MATLAB code into the simulator. His code provides calls that are made as the simulation proceeds through the vocal tract model and reaches references to his model. The critical points are the places where his model attaches to the default model. A close up of how the connections are made is displayed in a new window. Bob assigns links between his model’s nodes and the nodes of the default vocal tract model at the points that were cut. He notes that he has successfully made the scale of his model the same as the rest of the components so that connections are easily made. For each group of links, he specifies parameters for how each node at the boundary affects attached nodes. Essentially, Bob specifies the flow of forces and the boundary conditions for the seam between his model and the default model. As Bob is not so concerned about the interaction between his model 56 and the default model, he selects that the links are treated as anchors. As such, they will only have geometric constraints, effectively providing a solid anchor for his model. Next, Bob imports the probe values that he has extracted from his ultrasound images so that he can drive his tongue tip model. At this point in his research, he has only extracted the movement of a single location of the tongue tip. He notices that when he imported his model in the step before, the timeline window indicated that additional probes were available. This is because, as part of the interface, Bob indicated which parameters of his model were available as probes. These probes have associated functions that are called at each time step to update if there is data specified for them in the timeline window. As he only has data for one point on the tongue tip, he leaves other previously defined probes unspecified so that the simulation will automatically calculate the values of these probes over time. When he imports his probe data, the probe appears in a probe clip window, much as a non-linear video-editing suite uses video clip windows. Bob drags the probe clip into the timeline so that it will be used to drive his model. At this point, Bob can press play and watch the probe data drive the FEM model he has created. However, his main goal is to produce speech using the whole vocal tract. His ultrasound images are from a subject saying /la/. Thus, he needs some data for the rest of the vocal tract along with a sound source for the excitation. He opens the data import window and looks in the default data directory. He finds a probe file for an /l/ and an /a/ for the default model and loads them in. These clips appear in the clip window. He drags them into the timeline window and adjusts their durations dynamically to fit with his probe data. He also adjusts the linear interpolation between the probe data to get a smooth transition for the default probes. He selects a simple acoustic glottal pulse for his excitation and an area-function sound synthesis method. He places a virtual microphone at the mouth of the vocal tract model. The microphone probe appears on the timeline. He presses play and the synthesis engine steps through the timeline and synthesizes speech by updating the model parameters and calculating the vocal tract shape at each time step. The acoustic wave appears in the timeline, and it is synchronized with the other input probes. Bob listens to the sound and is happy with it. He saves the sound file for further processing. He also then saves the entire state of the simulator so that he can show this to his colleagues later. This concludes a hypothetical scenario describing a possible modeling task that covers interface functionality for the requirement. Interface elements include a 3D model editor paradigm, probe, playback concept, and timeline interface. These elements are realized as GUI components of the framework design, as described in Section 3.2.1. Other elements of the scenario such as 57 MATLAB dependencies were optional for the core framework design. Further, the scenario illustrates mathematical modeling formulations such as model connections, input and output probes, acoustics, and biomechanic models that will be discussed in Section 3.2.2. 3.2 Vocal Tract Simulation Framework This section describes an upper airway simulation framework comprised of a model simulation core, graphical user interface components, and the application of the framework to experimental processes. The core framework enables the creation of upper airway shape models based on combinations of rigid body, mass-spring, nite element, kinematic, and acoustic models. The dynamical models, whose equations of motion are integrated numerically together with kinematic models, are combined in a single framework. This framework supports the creation of a complex, dynamic jaw model based on muscle models, a parametric tongue model, a face model, two lip models, and a source- lter based acoustic model linked to the vocal tract model via an airway model. The models have been connected together to form a complete vocal tract that produces speech and may be driven both by data and by dynamics. Each of these models are complex and are often developed independently of other structures. The complex aeroacoustical processes that involve the interaction of these anatomical elements with air ow and pressure waves and which eventually produce speech have also been modeled [246, 247, 265]. This approach has become particularly relevant due the recent interest in developing natural- looking talking heads [43, 143, 146, 194, 296] The development of naturalistic talking heads, in fact, strengthens the need for articulatory speech synthesizers. But their relevance goes beyond this reason, since it is believed that articulatory models are the most promising approach to generate realistic speech in a very compact representation. Further, since the integration of separately developed models is time consuming, comparisons between di erent modeling approaches have rarely been performed. For these reasons, to create a complete articulatory synthesizer, it is critical to combine di erent models and modeling frameworks easily and to be able to integrate them within a complete vocal tract model that can be validated geometrically as well as acoustically. Providing this functionality allows the articulatory speech synthesis, dentistry and other disciplines, to build on existing research and provides a platform for exploring and advancing integrated physiological 58 modeling. The focus of framework for upper airway modeling is to combine animation techniques with dynamic simulation methods along with baseline anatomical models to provide eventually a complete vocal tract for researchers to extend. Within the framework, researchers may create new model components, compare and predict geometric and acoustic properties of the vocal tract, and explore the details of physiological functions such as speech production, all within an integrated upper airway model that may include acoustics. This research aims to accomplish the following goals: (1) formulate a novel framework for upper airway modeling (section 3); (2) implement a core simulator for dynamic three dimensional vocal tract models to synthesize speech (sections 4); (3) create a library of anatomical models (Section 3.2.2); (4) provide visual and acoustic rendering (Section 2.2); and (5) create a data- driven modeling and validation integration facility (Section 3.2.1). Thus far, a core simulator was created to provide implementation support for dynamical and parametric modeling frameworks upon which some speci c model instances have been integrated together to provide a functional three-dimensional base model of the vocal tract. The main focus has been on creating components that have not been developed or integrated suciently by other groups. This includes: (a) an aeroacoustics module; (b) a framework for connecting di erent types of models to each other; and (c) methods for incorporation of real speech production physiological data. One long-term goal of this project is to construct a complete 3D model of the vocal and facial articulators, and to use this model as the basis for a combined biomechanics and aerodynamics model. The new physiological model may produce for speech production purposes both verbal and non-verbal articulations that enhance computer-mediated communication. Further, the integrated physiological model may synthesize natural facial gestures to accompany synthesized speech. In contrast to most speech models, this approach has an intermediate 3D geometric representation of the vocal tract that allows speech production and veri cation with imaging data. Since this long- term goal is complex, this thesis work focuses on the overall concept creation for the framework for biomechanical modeling of anatomical structures, and will modify a reference model of tongue to showcase its functionality. The tongue modeling work and the connection to an airway model are presented in Chapter 4. 59 3.2.1 Graphical User interface Design for Model and Simulation Editing and Control Abstracting from the hierarchical task descriptions of Section 3.1.2 and scenarios of Section 3.1.4 enables the design of six complementary modeling concepts and core simulation frameworks, shown in Figure 3.3, to allow the implementation of these tasks outlined above with relative ease. Input/output probe abstraction Timeline interface Property inspector Instrumentation of modeling and simulation control Anatomy model library Creation and editing of models Figure 3.3: The orthogonal modeling concepts to support the articulatory synthesis framework Probe abstractions provide hooks into modeling representations that link a probe with particular model parts. Thus, a probe can be used in a data-derived articulatory parameter to drive a speci c set of model points over time. For example, biomechanical muscle models can be activated using probes. Many articulatory data suited for probes are n-dimensional time sequences, for example muscle activations and position parameters. These can be represented as graphs, providing an intuitive way to interact using a vector graphics line tool. Other probe types can be de ned and included in the model, such as general measuring tools listed in the probe library in Figure 3.7b. The use of the probe abstraction provides an interface for a wide variety of data sources and sinks that are important for upper airway modeling. An integration of the data extraction and analysis tools that many researchers have developed 60 and that are part of the normal process of speech research is not planned here. Instead, the intention is to provide an abstraction of all data sources as probes. Either an input probe will provide an input data source such as a sequence of articulatory parameters derived from rules, or any other set of extracted data provided by the researcher. Likewise, an output probe is any data extracted from the articulatory speech synthesizer speci ed by the researcher through the vocal tract model. These output probes include such things as the acoustic output or a speci c articulatory parameter as it changes over time. Timeline interfaces provide a common metaphor for manipulating time-based data in audio video and animation applications, as shown in Figure 3.7a. In the same way, video clips are manipulated in a video editing suite by arranging, scaling, interpolating, and transitioning data. Probes such as muscle activations can be similarly manipulated. The use of this timeline is illustrated in the previous scenario. The provided feedback from researcher discussions and experimentations on prototypes shows that both the timeline and probes are very intuitive and cover the tasks for time-based manipulations. Since probes can be used for input and output, they allow the storage and reading of data in a uni ed le format. In future work, this concept can be used to create feedback loops where output probes are fed back into input probes to allow control formulations. Creation and Editing of Models In addition to time-based manipulation, the actual anatomy shapes, geometry as well as internal parameters of the model, need to be manipulated. One example of that would be the modi cation of the tongue tip in the previous discussed scenario. ArtiSynth should provide fundamental manipulation to create a smooth work ow for modeling. In cases where an external 3D editor such as Blender and Maya provide specialized mesh manipulations, it is important to ensure connectivity by supporting standard le formats. In addition, the graphical windows can as well be used to improve the visualization by allowing cut planes and traces and to overlay visual aids, images and video sequences for validation and registration. Property Inspector Many models have a number of properties for physics and modeling parameters and visual appearance. Using a mechanism which uni es the manipulation of properties and the automatic generation of corresponding graphical interface components will achieve a uni ed way of interaction as well as take away the burden of the model developer. Instrumentation of Models and Simulation Control Much of the functionality of control- ling the model is already covered through timeline and properties. To illustrate the 61 connections and interactions between models, as well as their stability state of simulation, it is useful to have a ow graph network of all models of the simulation environments. The simpli ed representation of the actual geometrical shapes allows a user to see easily how models are connected, constrained, and indicate stability and accuracy problems. The way simulation systems treat the modeling of time is important for sound synthesis [208] and for heterogeneous simulation systems [33] to obtain repeatable and correct results. Data representations and simulation algorithms require varying processing rates and delay requirements for graphics rendering, acoustics rendering, and physical simulations. For physiological modeling, it is reasonable to allow non-real-time simulations, in order to relax system requirements. In this case, time may be represented as discrete ticks as the smallest common denominator of all time-step requirements. A given system would be advanced with the multiple of this minimal time unit. Anatomical Model Library To enable practitioners of ArtiSynth to create their own models without programming, it is useful to provide a model library for di erent anatomical parts, which can be added and modi ed in a new formulation. We envision this like a "Tinker Toy Box" on the three model abstraction levels: (1) basic components: particles, springs, elements, and rigid bodies; (2) anatomical materials: tissue, bone, and ligaments; and (3) anatomical structures: jaw bone, tongue, and velum. This feature encourages the new creation of models in the way graphics modeling packages work. 3.2.2 Model Components The above scenario illustrates some of the complex operations that would be useful for researchers developing and using a 3D articulatory speech synthesizer. Interaction elements are outlined to show how these manipulations could be realized. Through multiple infrastructure designs and prototyping cycles, various parts of the system were grouped into ve main components, and graphics rendering, as shown in Figure 3.4. The architecture is composed of: (1) a simulator engine containing models and constraints; (2) a scheduler; (3) a graphical user interface (GUI) and timeline module; (4) a graphical rendering; (5) a numerics engine; and (6) acoustic rendering. Input probes change model parameters and output probes store simulation results. The data ow is indicated by horizontal arrows and the control ow by vertical arrows. This section has in part been previously published by Fels et al. [80], Vogt et al. [288]. The model was developed collaboratively by myself, John Lloyd, and other group members. Most of the material described 62 under the headings \\Models and Constraints" and \\Scheduler" was designed and implemented by John Lloyd. Most of the current implementation of the \\Graphical User Interface" and \\Numerics Engine" was done by John Lloyd. Graphical rendering was primarily implemented by John Lloyd, Elliot English, and Kees van den Doel, and acoustical rendering was implemented by Kees van den Doel. Input probe data Output probe data Numerics Graphical rendering Acoustic rendering GUI and timeline interface Models and constraints Scheduler Figure 3.4: Block diagram of ArtiSynth design (Adapted from Fels et al. [80]) Models and Constraints Three-dimensional vocal tract and face models consist of a hierarchical object-oriented structure which represents multiple levels of detail. Anatomical structures are represented at the top level by nodes, such as the tongue, lips, and vocal folds. These high-level structures consist of medium-level structures, such as muscle groups, bones and tissue. The medium-level structures are composed of lower-level structures, such as geometry, muscle bers, and ligaments. A set of libraries and modules allows the user to de ne deformable models, as well as the muscular activation of the model in three dimensions. This framework supports models which are built from components arranged in a hierarchy. All components are based on the ModelComponent interface, as shown in Figure 3.5. This interface supplies methods to query its name, parent and index, and state of selection. A CompositeComponent interface enables the contents of sub-components to enable hierarchical structures. The combination of both interfaces enables the creation of model hierarchies and, further, to be manipulated in graphical user interfaces such as the biomechanical model in Section 3.2.2. 63 ModelComponent getName() getParent() isSelected() getIndex() DynamicComponent getState(state) setState(state) CompositeComponent getComponents() getComponent(index) getComponent(name) Model <> <><> <> <> Constraint modifyInputs (to, t1) modifyOutputs (t1) initialize (t0) advance (t0, t1) getModels() isDependent(model) Figure 3.5: Core Simulation interfaces and their essential methods (Reproduced from Fels et al. [80]). One building block of the framework is the DynamicComponent and Model interfaces to generalize state-based dynamics and parametric models in their characteristics to initialize and advance their state over time. A central simulator is able to manage discrete time-based model constellations. Using a generalization model internal parameter through, properties provides a uni ed interface for User Interaction. Properties could be both model internal state such as position, velocity, sti ness or mass, but also visualization and annotation information, such as color, texture, or description of anatomy. Hereby is in particular generalization of properties important to expose the model interior for manipulation. This modeling framework enables the representation of functions of anatomical complexes of the 64 upper airway. This has been demonstrated by implementing biomechanical structures such as the deformable tongue model by Vogt et al. [290], the rigid jaw model by Stavness [252], and acoustic models by van den Doel et al. [65]. The use of this interface supports many future models that will be able to be integrated into this framework such as uid dynamic models. Scheduler The scheduler is a central component, which calls models to advance over time. To enable models to be controlled by the scheduler, they must supply two principal methods: model.initialize (long t0); model.advance (long t0, long t1); The rst initializes the model to time t0, and the second advances it from time t0 to t1. Both methods are used by the scheduler to drive the simulation. Models are classi ed as either parametric or dynamic, depending on whether their advancement depends on an internal state. This approach is suitable for biomechanic and parametric models alike, as described in Section 3.2.2. Biomechanic models use force activations and underlying di erential equations to calculate forces and positions evolving over time, in contrast to parametric modes which have in general only descriptions of position as trajectories. Because dynamic models contain state, their time advancement generally requires numerical integration from some known initial state, and there is usually a maximum step size at which this integration should be performed. Models make this step size known via the method model.getMaxStepSize(); and the scheduler will then ensure that the step size used by advance never exceeds this. Models are advanced at these time steps in the same order to ensure repeatable results. The common alternative scheduling scheme to create multiple threaded simulation loops would not bring repeatable results since the scheduler timing would be machine and load depended, and this is therefore not pursued. Graphical User Interface (GUI) A primary aim of the framework is to allow a user to interactively control the simulation of the vocal tract model, using di erent sets of control inputs, and to record the resulting trajectory of speci c observables. Thus, the interface concepts of Section 3.2.1 are applied to create the graphical user interface. To facilitate this process, this framework uses the concept of input probes and output probes. 65 An input probe provides a stream of data which drives the simulation; examples include muscle activation levels, external forces, parameters for parametric models, glottal excitation waveforms, etc. An output probe supplies a stream of observable data resulting from the simulation; examples include the locations of speci c marker points attached to a model (such as the tongue tip), interaction forces, or generated acoustic waveforms. Output probes can also supply functions of observable data, such as the distance or angle between marker points or cross-sectional areas of the airway mesh. Many of the users for whom the simulation architecture is intended, are not primarily programmers. This is one property which is shared with computer game engines. Therefore, the core simulation module is separated from the user interface. Numerics Engine Most physics-based simulations have numerical processes as a bottle neck. The numerics engine separates the numeric algorithms from the simulation engine to allow for exibility in the implementation. This allows for central numerical optimization and alternative implementations such as implicit versus explicit Euler integration. In order to have interchangeable integration schemes, we de ned a numeric solver interface for ordinary di erential equations (ODE). It is of note that most interactive simulations use implicit integration schemes to provide more stable solutions. The numeric calculation for the state of the next time step is expressed as a sparse matrix. In the linear dynamic components (e.g. nite element or springs), a linear solver is sucient. In the case of nonlinear components (e.g. Hyperelastic nite elements or nonlinear springs), the solver would either need to linearize the formulation to use a linear solver, or use a nonlinear integration scheme (e.g. Newton-Raphson method). Nonlinear integration schemes calculate multiple time steps to iteratively converge on the exact solution [99]. Graphical and Acoustic Rendering Graphical and acoustic rendering are separated to ensure functional independence and address the di ering timing requirements between acoustics and graphics. The motion of anatomical substructures can be modeled at a relatively coarse temporal resolution of around 50ms. High quality audio synthesis occurs at a sampling rate of 44.1 kHz, which requires a temporal resolution of around 22ns, about 2000 times denser. However, O'Brien, Cook & Essl [185] simulated the physics of surface vibrations for graphics applications at 10 times the audio sampling rate to obtain plausible results, which may have to be considered for bone 66 vibration studies. In principle, the simulation capabilities of the framework may be used for the simulation of aeroacoustical phenomena as well, but this requires the entire simulation to run at the audio rate, which requires extremely long run times. The time scale of aeroacoustic phenomena is between one and two orders of magnitude smaller than the time scale of vocal tract motion; aeroacoustic analysis can be carried out considering the vocal tract position quasi-static. This section introduced the core modeling framework and the following sections will address more details about the models. Modeling Methods The framework design allows for the integration of vastly di erent modeling methods that are biomechanical models, parametric models and acoustic models. All of them share the ability to advance in time. Interaction between models is formulated as constraints. This is covered in Section 3.2.2 in more detail. Biomechanical Modeling As originally described by Fels et al. [80], this framework supports the integration of biomechanical models in the form of rigid and articulated bodies, particles, springs, and nite elements. More specialized anatomical components, such as piecewise linear and nonlinear muscle by Hill [108], are supported. A biomechanical model created in this context contains all required components in a single model, which governs the overall advance by integration of a dynamics in form of M(x) ẍ = f(x; ẋ; t): (3.1) Hereby, M is the overall mass matrix for all its components, x is the overall dynamical component state, f is the overall force acting on its components, and t is time. So far, dynamics components include rigid bodies, particles, and nite element nodes. Forces are created from springs, nite elements, gravity, external actuation, and contacts. Advancing the model can be performed with both explicit and implicit integrators such as forward Euler, backward Euler, or Runga-Kutta. In practice it has been found preferable to choose implicit methods to solve linear nite elements systems for speed-stability trade-o . Solving such systems is based on Jacobian State space J J = @f @(x; ẋ) : 67 The resulting sparse matrix solution is supported by interchangeable iterative and direct sparse solver such as Pardiso [232], Umfpack [57] or Taucs [278]. In particular the nite element tongue model described in Section 4.1.1 is implemented in this way. Parametric Models In general, parametric models describe the shape of structures, without containing any dynamic state information on which biomechanical models are based. Examples of such shapes are 2D/3D spline (for 3D lips [186, 214] and 2D parametric vocal tract [224]) or a mesh which is controlled by basis reduction using component analysis (for the tongue [73], for the face [141]). While it is convenient to create parametric anatomy models from images or scanned data sets, it remains dicult to formulate the interaction with other models, since local interactions behavior can't be formulated easily to feedback into a parametric model. Nevertheless it is possible to mix biomechanical and parametric models, but the interaction will be limited to unidirectional behavior where the parametric model takes a master role and is not a ected through interactions. Therefore, parametric models are limited in their use to build up complex models from independently developed sub-components. Still, parametric models are valuable as a reference or comparison for equivalent biomechanical models, which t nicely into a common validation framework. Finally, it is important to mention that there are dynamic model formulations which use dimension reduction in the dynamic state space for solid [123, 124] and uid dynamic [279] formulations. These are used to speed up the simulation, while reducing the result space. Model Connections and Constraints In addition to the internal material behaviors which describe materials, bilateral and unilateral constraints, and connections between models are useful to formulate inter-model connections and constraint equations. As originally described by Fels et al. [80], this framework supports connections of di erent model types using loose coupling. One way to facilitate the coupling is by the use of Constraint components shown in Figure 3.5. Constraint enable the creation of coupling of models using a dependency graph to advance the model in the right order by the scheduler. To limit the complexity of the system, relaxation of some constraints is allowed. First, a quasi- static state of articulatory gestures is used. Keeping the vocal tract static over a period of time 68 provides a simpli cation of coupling between biomechanics and aeroacoustics and uid dynamic scenarios. Second, the geometry or anatomy is the common ground for the simulations: this makes it dicult to abstract from the geometry but make models coherent for their interfacing. Dynamic components support their attachment to another, in which case the state of the master component sets the slave components. This allows close coupling of models, which has been shown to connect nite element models to one another or to rigid models. Formulations for outer control and connections between various kinds of dynamic and parametric models, as listed in the taxonomy of Section 2.3, provide a solution for the requirement of model integration of coupled systems. In particular, the concept of model connections may allow the building of greater complexity from existing substructures and mix di erent modeling methods such as a rigid body with nite elements. The motivation is to connect di erent anatomical structures to study their joint behavior. Outer control of models by means of constraint formulations may allow the imposition of limits such as joints, curvature, position, maximum deformation, and collision. Our early work [288] has used a generalized de nition for all inter- and intra-model formulations as constraints, as described in Section 3.2.2. An alternative representation, used in this section, is based on the fact that the mathematics and function of inter- and intra-model formulations are distinct and therefore distinguished between connections as inter-model formulations and constraints as intra-model formulations. The following section will describe how much progress has been made in formulating model connections. Connections can be formulated in two ways: unidirectional where one \\master" model in uences another dependent \\slave" model but not vice versa, and bidirectional where two models in uence each other both ways. Further, there are \\loose" connections in which coupled systems can be solved for each advancement independently, and \\rigid" connections which in general requires solving combined system equations to ensure accurate and converging results. Experiments of the rst ArtiSynth prototype (see development history in co-authorship statement) showed that connections for parametric models are limited as unidirectional, where the parametric model has to have the \\master" role. Parametric models for \\slave" roles in unidirectional connections or for bidirectional connections may be possible given an inverse model formulation. In contrast, biomechanical models allow unidirectional and bidirectional \\loose" connections with model independent formulations. Bidirectional \\rigid" connections, as described earlier, may require the solution of combined system equations. Connection formulations are speci c 69 to their opposing paired model types, which is documented in the point-based connection matrix of Table 3.1. This matrix is limited to the description of unidirectional connection formulations. Bidirectional connection formulations may be formulated, but are not limited to a pair of two unidirectional connections, under the assumption that the connection directionality is orthogonal. In the point-based, connections may be extended to a higher order of spatial complexity such as curve-based and surface-based connections as well as for temporal complexity such as time-window formulations. From/To P RB AT S FE DP SP MV Particle (P) 0 0 0 0 0 0 2 1 Rigid Body (RB) 0 J 1 0 0 0 Attachment (AT) 0 1 0 2 1 Spring (S) 0 0 0 Finite Element (FE) 0 3 0 Damper (DP) 0 0 2 2 Spline Point (SP) 2 2 1 Mesh Vertex (MV) 1 1 2 1 2 2 1 Table 3.1: Point-based connection matrix of between di erent model types: 0=dynamic state, 1=position, 2=kinematic state,J=rigid body joint, 3=marker-based state projection (adapted based on work of John Lloyd). This work explores only a few connection formulations of these connections as prototypes and leaves the remaining for future work. Vogt, Chen, Hoskinson & Fels [287] demonstrated a unidirectional connection of a set of mesh vertices to particles of a mass spring system by setting the position state of the particle. Vogt [286] showed two ways to create a unidirectional model connection to a nite element system by: (a) imposing forces on nite element nodes using zero-length springs; and (b) setting position states on nite element nodes. The demonstration system was based but not limited to 2D elements and was advanced with a forward Euler method. All these connection methods are suitable for implementing in explicit integration schemes, where all model dynamics are integrated and then connections are updated by transferring forces and states to connected models. The use of a one-step explicit integration scheme has the issue that wave propagation e ects take place such that multiple advancement steps are needed to provide converging results. This nding led the choice of the implicit integration scheme described in Chapter 4 for the 3D 70 nite element muscle model. Future work directions include a nite element tongue bidirectional connection between a 3D nite element tongue and a rigid body laryngal model, for which an initial prototype was put together by John Lloyd, Ian Stavness, and myself. Acoustics Modeling and Coupling The acoustics modeling situation is similar to an integration of audio and motion in computer graphics. Running a detailed simulation using nite elements at audio rates was attempted [184], to calculate an animation with sound from a physical model of deformable bodies, which resulted in extremely long processing times. A di erent approach was taken [64], where the audio and motion simulators run their own specialized models in parallel at di erent rates. This allows for real-time high-quality interactive simulation with motion and sound. A linear audio model is currently connected to the airway model for the production of vowels. Information about the airway's 3D shape that is coupled to a subset of relevant anatomical parts, as described in Section 3.2.2, is used to generate a wave propagation channel which is excited by a glottal excitation. The glottal excitation is implemented as a special input probe (see Section 3.2.1). The probe can either read PCM data from a le, or generate the glottal wave algorithmically according to the Rosenberg model [221]. The wave propagation through the vocal tract can be modeled using the well-known Kelly & Lochbaum [130] tube segment lter model. V•alim•aki & Karjalainen [294], V•alim•aki [295] improved the model by using conical tube sections and a fractional delay ltering method. Carre [37, 38] proposes an improved model of the linear tube representation using distinct regions. The linearized Navier-Stokes equations may be approximated numerically, based on the related work by Stam [251] using real-time simulation of turbulence visual ow e ects on a 2D grid. Dobashi et al. [61, 62] demonstrated real-time sound synthesis for 3D structures. Richard, Liu, Snider, Duncan, Lin, Flanagan, Levinson, Davis & Slimon [216], Richard, Liu, Sinder, Duncan, Lin, Flanagan, Lin, Levinson, Davis & Slimon [217] created uid dynamics simulation for the vocal tract simulated vowel sounds and later fricatives [218]. An advantage of the latter approach is that the airway can be stretched continuously (when pursing the lips for example), which is not possible with the classical Kelly-Lochbaum model, which requires a xed grid size. Through careful use of Java interfaces, all details concerning the airway's coupling to the surrounding anatomical structures are hidden from the audio model. This allows for easy 71 modi cation to the airway model without requiring any changes in the audio code, and vice versa. The airway model also provides an arena for more sophisticated audio modeling techniques based on uid dynamics, which we are currently developing. 3.2.3 Validation, Experimentation, and Control Modules The validation, experimentation, and control modules are extensions to the core framework shown in Figure 3.4 to show how a high-level task can be performed with the framework. The process diagram in Figure 3.6 illustrates how this module is another key component in the architecture which handles the import and manipulation of various data sources. This module allows the manipulation of the state space for animation data by: (1) a base pose geometry import; (2) a key frame animation for articulatory gestures; and (3) validation through geometric data analysis. The use of import/export lters for geometric models allows for the use of 3D animation editors, such as Maya, 3DMax or Blender, to exploit existing animator skills. Additional data may originate from video cameras, ultrasound scanners, Magnetic Resonance Imaging scanners (MRI), visual trackers to provide geometric comparison of simulation results. In addition, vocal recordings may enable spectral comparison with the acoustics simulation output. Further data representation allows for multiple levels of detail (LOD) to nd the required techniques for physiological modeling. Established computer graphic techniques for subdivision surfaces and geometry, used to achieve more detail from coarse data sources for faces, has been demonstrated by Jeong, K•ahler, Haber & Seidel [125]. Further, mesh mapping techniques such as those used by Couteau et al. [50] and Kraevoy & She er [138] allow registration of generic modeling shapes to subject speci c shapes. This mesh registration enables transfer of the dynamic properties and clean meshes from generic models. This ability to cross reference models reduces signi cantly the manual work for a patient- speci c model and enables a convenient way to perform cross subject comparisons. 3.2.4 Library of Anatomy Components Default Model In order to allow non-programming modelers and practitioners to create their own models, it is important to supply, in addition to a graphical user interface which enables editing, a library of anatomy components as building stones. Thus, this library would provide an anatomical \\Tinker Toy" set on the di erent model abstraction levels high to low: Anatomical substructures on parametric and biomechanical basis such as tongue, lips, and jaw, as illustrated in Figure 3.7a; biomechanical materials such as tissues, bone, and muscles; and elemental dynamic components 72 Tracking data Medical Images Processing Segmentation + Registration Cross posture, time, slice Experiments Experimental data collection: Posture, utterance, words sentence, subjects Anatomical Atlas Material + structures Landmarks Analysis and Control Feedback Validation Matching + Comparison Optimization Model builder Speech Control Text2speech commands Data Processing Filter + combine + reconstruct B A Medical image data Core Framework Input probe data Output probe data Numerics Graphical rendering Acoustic rendering GUI and timeline interface Models and constraints Scheduler Geometry Processing Figure 3.6: Anatomy Modeling Architecture (Process Diagram). such nite elements, springs, and rigid bodies. Analogous to a model library is a library of probes, in particular for measuring tools illustrated in Figure 3.7b. 3.2.5 Model Building from Images For many models and simulation purposes, a generic or average model is sucient to answer some research questions. An average model may not be sucient where modeling an individual subject or patient more closely is required. In this case, recent improved methods for fast and high resolution imaging of internal organs such as Magnetic Resonance Imaging (MRI), three-dimensional ultrasound, and Computer Tomography (CT) combined with automatic image extraction techniques, dynamic modeling techniques, and increased computation capacity, enhance possibilities to create realistic computational models. Methods and examples of how this can be accomplished are discussed in Chapter 5. 73 (a) (b) Figure 3.7: A mock-up of (a) model and library window shows di erent anatomy components and (b) probe library window shows di erent probes for measurement and manipulation (Reproduced from Fels et al. [83]). 3.3 Matching Requirements with Realizations In order to framework design, target nonfunctional and functional requirements, discussed in sections 3.1.2 and 3.1.3, were assessed in a user-centered design process. The following tables summarize how the requirements are addressed in the framework design. In summary, most of the nonfunctional requirements were addressed in the framework design. While the framework design concentrated primarily on needs of non-programming users, room for improvement remains to be made for the needs of programming users speci cally by reducing the complexity of implementing models. The functional requirements were addressed by the framework design, and this is demonstrated together with its critical components, as shown in Chapter 4. 74 Requirement Realization Availability Free and open source distribution Portability Sun Java produces machine independent code to maximize portability. Optional native code for time critical solver maximizes speed Interactivity Extensive graphical user interface design, support, and components empower non- programming users to create, modify, and simulate models Extensible Modeling application programming interface (API), file formats, support library Usability Graphical user interface, abstract concepts and metaphors e.g. timeline Complexity Limited success; complexity still high Support for Research Exchange of modeling methods, validation, support for data import, probe concept Support for Education Modeling library and probe concept for examples Speed and Accuracy Support of fast simulation methods, formulation of model creation and validation procedures Table 3.2: A List of Nonfunctional Requirements with Matching Design Resolutions Requirement Realization Import and Integration Support of file import formats and various biomechanical, parametric, and acoustic modeling methods, connection of models. Analysis and Comparison Probe output and export of geometry, MATLAB integration Comparison of different data Limited internal support, relies on external tools, conceptualization stage for model creation and validation Synthesis of speech limited; parametric articulatory synthesis model Integration of vocal tract with face preliminary face model, no complete integration, but support available Table 3.3: A List of Functional Requirements to Support Modeling Tasks and Design Resolutions 3.4 Summary This chapter introduces a simulation framework for upper airway modeling. Using a research community-centered design approach by Fels et al. [83] led to requirements by, components, and an implementation independent framework design. Based on the model taxonomy, presented in Figure 2.8, and six complementary modeling concepts, the uni ed framework was created for di erent model types including biomechanical, parametric, and aeroacoustics. Interactions across and within models are formulated as connection and constraint concepts. The framework addresses the need for both non-programmers and programmers by creating graphical editing and creation support, which facilitates versatile manipulations with minimal burden for model developers. The framework design by Vogt et al. [288], Vogt, Guenther, Hannam, van den Doel, Lloyd, Vilhan, 75 Chander, Lam, Wilson, Tait, Derrick, Wilson, Jaeger, Gick, Vatikiotis-Bateson & Fels [289] and Fels, Vogt, van den Doel, Lloyd, Stavness & Vatikiotis-Bateson [81] includes interfaces for models, constraints, probes, numerics, graphical rendering, and acoustic rendering, as well as time control via a scheduler and graphical timeline. In order to obtain meaningful results, the process of model creation and validation is addressed. Within the overall ArtiSynth team, this thesis work focused on the conceptual development and framework design for this project. Further, this thesis work provided a proof of concept of the modeling framework to integrate di erent modeling types. The next chapter presents the proof of concept of the simulation framework by implementing a biomechanical tongue model as a key- component in creating a working vocal tract model. 76 Chapter 4 Creation of a Tongue Model for the Complete Vocal Tract This chapter describes an interactive biomechanical 3D tongue model using the simulation framework presented in Chapter 3. The interactive tongue model makes two main contributions to this thesis. First, the integration of an existing tongue model by Gerard et al. [96] in the simulation system demonstrates the framework's design. The tongue is a critical component of the upper airway due to its complexity and connectivity. In particular, the demonstration of connections to other models such as jaw and airway within the framework shows that models developed separately may be connected to build more complex biomechanical models towards a complete vocal tract. Second, the creation of a fast 3D nite element tongue model demonstrated interactive biomechanics. The motivation here is to gain interactivity for general soft tissue and muscle models as requirements discussed in Chapter 3.1. A nite element sti ness warping method is applied to accelerate the simulation with little loss of accuracy. Integrating an existing and established tongue model allows for focus on ecient modeling methods for building anatomical models. The nite element method builds upon the simulation framework described in Chapter 3 to allow easy replacement and interaction with other modeling types. The fast nite element solution is based on sti ness warping to provide an interactive simulation solution for other muscle and tissue structures. This chapter also discusses model validation techniques by example of the tongue to ensure its accuracy. Validation is an important step in the development process, which should be carefully performed for future physiological structures towards a complete vocal tract model. 77 4.1 Building Deformable Anatomical Models There is a rising need for interactive and accurate simulation of the human oral pharyngeal anatomy to assist research programs in medicine, dentistry, and linguistics. Relevance exists for the study of physiological functions (such as speech production, feeding, and breathing), dental and surgical training, and result prediction for clinical interventions. The application of a dynamical model is signi cantly dependent on the speed with which it can be simulated. For a system to be considered to have interactive rates, it needs to respond quickly enough to appear responsive to human interaction. Fast simulation methods, which may be approximate, are required to ensure interactivity. Where as interactive simulation does not necessarily need to run in real-time, it rather may run 2 to 5 times slower than real-time. This is sucient for a spectrum of possibilities for practitioners to explore what-if scenarios for training and preoperative planning. Figure 4.1: Vowel postures /a/ (red), /i/ (green), and /u/ (blue) composited from magnetic resonance images by Engwall & Badin [75]. This picture shows that tongue and lips are major deforming soft tissues in the upper airway. Figure 4.1 shows a typical midsagittal view of the head for three vowel postures. Di erences in colors show posture di erences, which illustrates deformed anatomical structures. The jaw, tongue, velum, and lips contribute to most deformations and are therefore good candidates for this investigation. Other than the jaw, which is often modeled as rigid, the tongue, velum, and lips have deformable attributes. Each of these deformable structures are important for realistic speech modeling. This section describes an ecient nite element tongue model, shown in Figure 4.2a/b 78 in rest pose/ deformed pose. In the future this tongue model will allow the interconnection with other models such as the jaw model by Stavness, Hannam, Lloyd & Fels [254] and the airway model by van den Doel et al. [65] described in Section 4.1.2. (a) (b) Figure 4.2: (a) 3D tongue model at rest showing surface mesh and (in cutaway) FEM edges corresponding to muscle bers (Reproduced from Vogt et al. [290]); (b) deformed tongue model caused by muscle activation. 4.1.1 Efficient Anatomical Tongue Model The tongue is the principal organ of the oral pharyngeal anatomy. In contrast to muscle-bone systems such as the jaw, where a rigid structure is de ned by the bones and forces acting on it are applied by the muscles, in the tongue, the muscles de ne, simultaneously, a exible structure and the forces acting on it. The background of tongue anatomy is described in Section 2.1.1. In this section originally published by Vogt et al. [290], it is shown that it is feasible to rapidly compute the dynamics of a nite element model of a muscle-activated tongue (Figure 4.2) with tolerable accuracy, using a sti ness-warping method such as that described by M•uller & Gross [175]. This builds on the work by Gerard et al. [96], who developed an FEM tongue model comprised of hexahedral elements and simulated it in ANSYS. ANSYS provides accurate FEM solutions, but at high computational cost: the ICP tongue model in ANSYS takes in the order of 105 times of event time to simulate. The accuracy of this faster approach is tested by comparing it to tongue and simple tissue and muscle models computed using ANSYS. The contributions of this research include: 79  Coalesce sti ness-warping with muscle forces projected on FEM edges to conceive a rapid model of muscle-activated tissue  Establishing that this kind of model can be combined with an implicit integrator that is solvable by a conjugate gradient solver  Creation of a test suite for simple tissue and muscle models to ensure correct implementation and accuracy  Determining the accuracy of fast method compared to a published reference tongue model It is anticipated that the techniques described here may be applied to many other types of muscle- activated tissue. The tongue anatomy has been modeled in many ways by a number of researchers. Parametric representations of the tongue shape have been devised based on statistical methods by Badin et al. [18], Engwall [72] and spline descriptions by King & Parent [134], Stone & Lundberg [259]. A physiological representation is described by Takemoto [266]. Dynamic models have been constructed using both mass-spring systems by Dang & Honda [55] and nite element methods by Gerard et al. [96], Payan & Perrier [195], Wilhelms-Tricarico [300]. A recent survey by Hiiemae & Palmer [107] describes these representations and applications in detail. Another requirement in addition to eciency is the ability to be compared with real measurements, thus an e ective tongue model must provide:  Representation of both tissue and muscle ber;  Large deformations, particularly at the tongue tip;  Incompressible (hyperelastic) and non-linear tissue deformation Finite element methods constitute a good solution for representing a combination of tissue and muscle ber, and are applied traditionally in engineering [22, 305]. FEM models also provide greater stability and accuracy than mass-spring models. However, current FEM solutions [54, 95, 195, 300] do not compute in real-time. Section 2.3.1 discusses di erent nite element formulations and their tradeo s, summarized in Table 2.3. In addition, formulations can be applied with di erent element types such as tetrahedron, hexahedron, or quadrilateral. Recent innovations in the area of physical-based animation [175, 269] and medical computation [49] suggest nite element algorithms that can run in real-time, but with decreased (or even unknown) 80 accuracy, to yield plausible results even for large deformations. The representation of the tongue's muscle activation is especially signi cant for simulating physiological functions such as speaking and swallowing. Most established methods for muscle tissue modeling combine Hill's non-linear spring model [108] with either mass-spring systems [55], nite elements [98, 269], or Cosserat models [189, 190, 263]. The tongue model described here follows Gerard et al. [94] and uses a force-based muscle model which is described below. The nite element modeling method developed by M•uller & Gross [175] has been implemented here in three dimensions providing a real-time and unconditionally stable solution for a wide range of tissue models. The method o ers real-time capabilities with limited loss of precision. In computer animation, linear elasticity models described by Equation 4.3, are popular for real-time simulations, but the drawback is that these models are not precise for large rotational deformations. M•uller & Gross's sti ness warping algorithm is based on linear displacements tetrahedron to solve the underlying partial di erential equations and this removes the artifacts that linear elastic forces show while keeping the governing equation linear. This problem is solved by extracting rotations of elements rather than rotations of vertices. For a single tetrahedral element with sti ness matrix Ke, the forces fe acting at its four vertices are fe = Ke  (x x0) = Ke  x + f0e; (4.1) where x contains the positions of the four vertices and f0e contains force o sets. This method assumes that the rotational part Re of the deformation of the tetrahedron is known. Then, using the warped sti ness method forces compute as fe = ReKe  (R1ex x0) = ReKeR1exReKex0 = K0ex + f 00e: (4.2) In this way, the same forces are reached as though the regular linear elastic forces are computed in a rotated coordinate frame, in more detail shown in Figure 4.3a, forces are applied to tetrahedral vertices whereby the deformed coordinates x are rotated back to the original frame R−1e x. Displacements R−1e x x0 are multiplied with the sti ness matrix resulting in forces Ke(R−1e x x0). Finally, forces are rotated back to the frame of the deformed tetrahedron by multiplying them with Re. In summary, this method separates the deformation in a rotational part and a linear part. The performance of the algorithm is similar to linear elasticity models with a smaller loss of accuracy as shown by Nesme, Payan & Faure [179]. 81 x0 x Re-1 x (x – x0) Re (Re-1 x – x0) Ke(Re-1 x – x0) ReKe(Re-1 x – x0) (a) (b) (c) Figure 4.3: (a) Elastic forces acting at the vertices of a tetrahedron are computed where, its deformed coordinates x are rotated back to an unrotated frame R−1e x. Displacements R−1e x x0 are multiplied with the sti ness matrix yielding the forces Ke(R−1e x x0) that are nally rotated back to the frame of the deformed tetrahedron by multiplying them with Re (Figure adopted from M•uller & Gross [175]) and the small deformation simulation results of a nite element beam (b) with and (c) without sti ness warping. Adaptation of the Reference Tongue Model Geometry For this work, the ICP 3D FEM tongue geometry developed by Gerard et al. [96] has been used, as is shown in Figure 4.2, and developed from medical image data. This geometry contains 946 nodes, connected to form 740 hexahedral elements. These hexahedra are further subdivided into 3700 tetrahedra (using the Freudenthal tesselation with ve tetrahedra per hexahedron), as the present implementation of the sti ness-warping algorithm requires tetrahedral geometry. Dynamic Modeling of the Tissue As originally described by Vogt et al. [290], the nite element tongue adapts a sti ness warping formulation by M•uller & Gross [175]. This lumped-mass nite element model, in which mass positions vector x follows the dynamic equation M •x + C _x + RKRT xRKx0 = fext + fm (4.3) with M as the (diagonal) mass matrix, K as the sti ness matrix, C as the damping matrix, f0, fext, and fm as the rest-position, external and muscle forces. The overall linear nite element formulation, corrects K and f0 for each time step by factoring out the e ect of elemental rotations. Each element is referenced to its rest pose and is referred to as stiffness-warping. In this way large rotations only lead to minimal volume distortion without adding much computation to the linear 82 method. Figure 4.4: Muscle forces are applied between nite element nodes as indicated by bold lines (Reproduced from Vogt et al. [290]). Dynamic Modeling of Muscle Activation As originally described by Vogt et al. [290], internal tongue muscles use nite elements with a common activation level in the unit N. In more detail, as exempli ed in Figure 4.4, muscles are comprised of fibers which are segments between nodes to which a uniform force is applied. Given i and j are nodes of an edge, with positions xi and xj , then the acting muscle forces on nodes i and j are fij and fij , such as fij = ijuij ; uij  xj xi lij ; lij  kxj xik; (4.4) and ij is the muscle activation level times a mass node weighting. The motivation for this model is described in detail by Gerard et al. [95]. Boundary conditions are formulated in this lumped-mass model as attached nodes at the insertion point of bones such as jaw and hyoid. For the stand-alone tongue described here, these attachment nodes are rigid. In the future, this will support integration with other models, such as the jaw/laryngeal model by Stavness et al. [254]. This muscle model may be extended in the future to allow attachment points that do not require co-location with FEM nodes. This would allow for separate muscle components to activate tissue as systems rather than a joint modeling characteristic. This also allows for future muscle systems to be placed inside tissue models independent of the FEM characteristics. Dynamic Integration As originally described by Vogt et al. [290], the dynamic integration of Equation 4.3 is an implicit method used for high speed-stability properties. This is analogous to the formulation by [175], other than @fm=@x is evaluated, since fm is force direction depended. With 83 the integrator step size h, xi and ẋi the node positions and velocities at step i, and Fm ≡ ∂fm/∂x, the implicit formulation is (M + hC + h2K− h2Fm) ẋi+1 = M ẋi − h(K xi + f0 − fm − fext) (4.5) where at each step we solve for ẋi+1. The computation of this solution is made easier by the fact that the matrix on the right side of (4.5) is symmetric positive definite (SPD), as shown in [290]. 4.1.2 Other Upper Airway Anatomy This section reports on some preliminary work to showcase the strength of the research direction to build a simulation framework that focuses on interconnected upper airway anatomy. This connection illustrates a future path towards the creation of a complete human upper airway model. Tongue-Jaw Connection (a) (b) (c) Figure 4.5: Initial results of integrating a deformable tongue model with (a,b) kinematic jaw with unilateral connections and (c) dynamic UBC jaw model by Stavness [252]) with bilateral connection. The tongue model is developed to support connection formulations implemented in the framework. Examples for finite element tongue-kinematic jaw mesh models are shown for unidirectional connections for 2D in Figure 4.5a, which is described in Section 3.2.2, and as a future work direction for 3D in Figure 4.5b. 84 Independent of the tongue model, the framework also supports an implementation of the jaw and laryngeal biomechanics model by Stavness et al. [254]. The jaw model consists of a xed rigid skull, oating rigid mandible, two temporomandibular joints, eighteen muscles, and multiple bite points. To integrate the tongue into a complete upper airway simulation, the connection to other models is crucial. For instance, the tongue is relevant in chewing by exerting pressure in conjunction with food particles on the jaw. Further, the tongue presses against teeth and palate during swallowing. As future work, Figure 4.5c shows a fundamental dynamic bidirectional connection of the jaw and tongue models. This future work direction, despite current limitations, shows much potential in developing model connectivity solutions and studying composite dynamic systems. 85 Tongue-Airway Connection As future work direction, the tongue model be has been connected to a mesh airway to model shape-based sounds towards articulatory speech synthesis. The vocal tract shape is modeled using the tongue model in conjunction with xed rigid models of jaw, hyoid, and palate connected to a deformable mesh representing the airway. Actuation of the tissue model deforms the airway, providing a time-varying acoustic tube which is used for the synthesis of sound. The tongue- airway connection showcases the ability of the framework to create sound from biomechanical simulations. The tongue model, as described in Section 4.1.1, is used whereby the tongue muscles are attached by xing FEM nodes to the jaw, hyoid, and skull bones, which are modeled as xed rigid bodies. The tongue motions are constrained by the jaw, the hyoid, and the palate, which are xed static meshes depicted in Figure 4.6. Collision detection and response, which are outside of the thesis scope, is necessary to prevent the tongue from penetrating the jaw, hyoid, and, palate, described by van den Doel et al. [65]. The airway is abstracted into a mesh as an interface between biomechanical models and acoustics. The details about the acoustics formulation are discussed in [63] and Chapter 2. The contributions of this thesis work beyond tongue model are the rigid registration of the models and validation methods using MR images. (a) (b) Figure 4.6: (a) Airway mesh and (b) combined meshes showing the airway (yellow) is connected dynamically to the muscle activated nite element tongue model and palate (red). The palate, jaw, and hyoid constrain the tongue (Both images reproduced from van den Doel et al. [65]). 86 Lips, pharynx, face To create a complete vocal tract model, additional anatomical structures such as pharynx, lips, and face need to be added. The current framework and implementation is sucient to implement these additional models for the pharynx using the work of Stratemann, Miller, Hatcher, Huang & Lang [262] and for the biomechanical face and lips using the work of Chabanas & Payan [40] and Charbanas [41]. 4.2 Validating Deformable Anatomical Models The previous section described details of an ecient anatomical model and the simulation results pertaining to this model. Here, the process of validation is described by example of the nite element tongue model. This section discusses validation results of two methods: (1) tongue shape comparison with a reference model; (2) vocal tract shape comparison with medical images of the tongue-airway. These two validation methods are complementary, since they have di erent strengths and weaknesses. The comparison with a reference model allows quantitative testing of simulation methods and implementations, but requires the existence of an accurate reference model. The comparison with patient or average data veri es the realism of the model, but requires measurement error and model assumptions to be taken into consideration. In this case, quantitative comparisons may not be possible but rather plausible behavior is tested. Beyond these validation techniques, it is important to ensure that the underlying discrete solver gives correct results. Problems may arise by the fact that the solver is not converging or the solution of the di erential equation is outside of the stability region. Many instabilities can be detected by observing the interactive graphical output of the animation. For mesh-based problems, it is good practice to test the mesh conditioning to avoid topological inaccuracies by the ANSYS- importer. 4.2.1 Comparison with Reference Simulation In order to have trust in numerical simulations, the validity, correctness and limitations of models need to be determined. Ensuring a correct implementation can be achieved by comparison of 87 results from a reference simulation. For the nite element tongue model implementation, the same geometry simulated in ANSYS is used as a reference described in Section 4.2.1. For the development process various test cases were used to ensure a correct behavior and at the same time make sure that the approximations made to speed up the system do not reduce accuracy. The chosen test case, common in surgical simulation and animation, was a rectangular beam which could easily be varied in size and number of elements. This beam was tested under gravity and constant load using the Truth Cube [132] setup and compared (1) small deformation results in ANSYS and (2) sti ness warping method to the hyperelastic model in ANSYS. A simple beam muscle was further implemented with the same muscle formulation as in the tongue model and compared to the ANSYS analog. Our simple muscle model, shown in Figure 4.2.1, consisted of a 3x3x9 block of 81 hexahedral (or 405 tetrahedral) elements with muscle bers connecting the middle nodes. A 2 Newton muscle activation was applied, and the resulting deformation errors were within 0.3 mm (10%). (a) (b) Figure 4.7: Single muscle model, with the muscle bers drawn in red, shown in the rest state (a), and under a muscle activation of 2 Newtons (b). Tongue Model Results: Stability, Accuracy, and Speed Results: Accuracy This section, originally published by Vogt et al. [290], compares the accuracy of our sti ness-warping FEM method (denoted as WRP) with two methods implemented using the commercial FEM package ANSYS: (1) a linear small-deformation model (LSD); and (2) a hyperelastic Mooney-Rivlin solid model (HYP). All models consist of the same tetrahedral mesh 88 Task Muscle activations LSD (mm) WRP (mm) max mean max mean A posterior genioglossus (2.0N) 3.5 2.3 2.1 1.3 B anterior genioglossus (0.5N) 2.6 1.9 1.8 1.2 C hyoglosse (2.0N) 3.5 2.2 2.7 1.7 D transversalis (2.0N) 0.6 0.4 0.5 0.3 E inferior longitudinalis (0.5N) 5.9 5.0 3.0 2.8 Table 4.1: Muscle activation tasks and end-task deformation errors of these tasks (compared to HYP), resulting from the methods LSD and WRP (analogous to published results by Vogt et al. [290] but repeated with ANSYS Version 10). except the HYP model which used both hexahedral and tetrahedral meshing, as described in Section 4.1.1. The tissue elasticity con guration was derived from the experiments reported by Gerard et al. [95]. For both the WRP and LSD models, a Young's modulus of E = 6912 and a Poisson's ratio of  = 0:49 is used. For HYP, C1 = 1152, C2 = 540, and  = 0:49 were set. All models used a Raleigh damping (that are C = M + K) with = 6:22 and = 0:11. With these models, ve tasks were simulated, in which a constant activation was applied to one or more tongue muscles for 1.2 seconds (to reach steady state) and observed at a rate of 10ms. The tasks are described in Table 4.1. The HYP and LSD models were computed using a Newton- Raphson integrator in ANSYS, while our WRP model was computed using the single step implicit integration scheme of (4.5) with a xed time step of 10ms. To judge the model accuracy, the resulting deformations from WRP and LSD were compared with those of the HYP (which was assumed to be the most accurate and so used as a reference). In particular, the deformations ui of ten nodes lying on the mid-sagittal plane of the tongue were compared to the reference deformations uri resulting from the HYP model. The deformation error ei of each sample point was then computed as ei = kuri uik (4.6) which is a disparity measured in mm. The mean and maximum of the error ei were used to judge the overall deformation error. Table 4.1 shows these values for both the WRP and LSD at the end- points of each of the tasks. Figure 4.8(a) shows the unactivated tongue model, while Figure 4.8(b) 89 (a) 0 20 40 60 80 100 0 20 40 60 80 Position in mm Po si tio n in m m REST WRP HYP (b) Figure 4.8: (a) Tongue at initial rest before activation, showing the bers of the posterior genioglossus muscle; (b) Nodes in the tongue's mid-sagittal plane before activation (REST) and after activation of Task A, as modeled by both WRP and HYP (Reproduced from Vogt et al. [290] for (b) simulation repeated with ANSYS Version 10) shows the mid-sagittal plane nodes before activation and after activation for Task A, as modeled by both HYP and WRP. Figure 4.8 (b) suggests that the deformations produced by the sti ness-warping model do in fact adhere quite closely to those produced by the hyper-elastic reference model. This is supported more quantitatively by Table 4.1, where the mean error for WRP is always within 3 mm, and the maximum 3D error is also close to this, which is 1.5 to 2 times better than the results for LSD. The WRP error of 3mm may be not quite sucient for speech modeling. The di erences between WRP and HYP models can be attributed to (1) volume preservation; (2) non-linear sti ness; and (3) locking of linear tetrahedral elements. The HYP model has implicit volume preservation in its formulation which is missing in the WRP model. The cause of this can be explained by locking of the linear tetrahedral elements, described by Zienkiewicz & Taylor [305] in Volume 1 Section 11.3.2, and the lack of volume preservation. Higher order elements with volume preservation have been since implemented by Ngan & Lloyd [181]. An inherent source of error is the quantization of time in the discrete integration scheme. This was found to be negligible in the models compared here. Element type di erences between tetrahedral and hexahedral did not indicate signi cant di erences in the results. 90 Results: Speed and Stability Computation times for the results reported above were markedly faster for the WRP model as compared with the ANSYS LSD and HYP models, with the former requiring only about 10 seconds per simulated second, while ANSYS required about 600 seconds. All tests were run on a 2.8GHz Pentium IV single processor computer. For the implicit integration step (4.5), the Pardiso spare solver was used [231]. A conjugate gradient (CG) solver was also used, although this was slower than Pardiso because a preconditioner was not included, and so about 300 iterations were required to achieve equivalent accuracy. There is a linear dependency for linear nite elements, with or without warping methods, between number of elements and computation time, as shown by M•uller & Gross [175] and Nesme et al. [179]. Hyperelastic nite elements have a polynomial relationship between elements and compute time due to the nonlinear Newton-Raphson solver method, described in [8, Theory Reference Chap. 6.4, Structual Guide Chap. 8]. The per-step computation times were 100ms and 600ms for Pardiso and for unpreconditioned CG, respectively, with CG requiring about 2ms per iteration. About 2ms of both these times was required to update the K and f0 terms of (4.3), as required by the sti ness-warping algorithm, and so this overhead is small. Using the Pardiso solver with a 10ms step size resulted in a simulation speed of about 10 seconds to 1 real second, which is sucient for interactive rates. With regard to stability, the implicit solver was found stable at time steps of up to 20ms for passive tissue. However, in the case of muscle-activated tissue the stable step size was around 5ms. 4.2.2 Matching Simulation Results to Measurement The goal is to produce realistic tongue shapes for speech. This is achieved by varying the muscle activations of the biomechanical tongue model to match the tongue shapes in MR images. Fitting an MR image sequence of articulation postures allowed validation of the deformation range produced by muscle activation. This interactive process supports the display of images in conjunction with simulation and this was integrated into ArtiSynth, as shown in Figure 4.9. A way of validating plausible tongue motion is to leverage the interactivity of the system and include an anatomy expert. It might be believed that since the creation of a tongue model simulator, it is possible to fully validate it by recording medical images of an individual tongue motion and recreating this tongue motion with the model by minimizing error between imaged and simulated shape. However, this 91 Figure 4.9: Con gurations producing the vowels E (left) and u (right). We use the top gures to nd activation parameters to con gure the tongue according to MRI data (Images reproduced from van den Doel et al. [65]). is not possible since muscle activation for most tongue muscles cannot be directly measured, even where EMG measurements are possible, since the signal intensity is a function of the muscular electrical activity and of the relative position of the EMG electrode. Furthermore, EMG is an invasive procedure and rarely performed clinically for the tongue. Alternatively, an image matching procedure can be used to estimate muscle activation. Another issue is the formulation of the model and its parameters. The behavior of the model should not only test for particular experimental parameters, but in addition the overall model range. 92 4.3 Conclusion and Future Work This chapter described the development of a reference model for the critical components of the modeling framework described in Chapter 3. The reference model constitutes a fast and stable nite element model and has comparable performance to ANSYS FEM simulation using a previously published reference tongue model. The comparison required the adaptation of the hexahedral tessellation of the reference tongue model to a tetrahedral tessellation, whereas a future hexahedral sti ness warping model may eliminate this model adaptation. This method admits simulations speeds that are within a factor of 10 of real-time, at the expense of a small loss in model accuracy. Further, constraints and connections were demonstrated in preliminary experiments that connect the nite element tongue model to a jaw-laryngeal model, as well as to an airway model. The resulting tongue-airway model was validated with matching MR images and acoustic analysis. The accuracy of the MR matching process was limited by the image accuracy/resolution and further limited by how close the tongue model represents the imaged individual. Several ways to improve the speed and accuracy of the model are considered. The rst is to use a multithreaded version of Pardiso on a machine with several processor cores. Current sparse direct solvers are designed for large systems and result in suboptimal speeds for models like the FEM tongue model, since they utilize the CPU and Memory. One way to speed-up solvers is to deploy processors which solve faster in parallel with graphics or cell processors. This will allow speeding up the solver by an order of magnitude [28, 139]. Another possibility is to use a preconditioned CG method [227], which would greatly reduce the number of CG iterations required. Finally, reduced coordinate dynamic approaches, along the lines of Barbic & James [20], James & Pai [123] will be pursued. Improvements in accuracy non-linear element formulations such as volume-preservation as presented by Irving et al. [118] and piecewise linear sti ness formulation look promising. Three methods of validation were described in this chapter, and their general application to biomechanical models was also discussed. Both comparisons with reference models and measured data allowed the validation of a biomechanical tongue model. For models with acoustic output, a validation method was described to assess the overall model behavior. This work may be continued in many other directions to simulate the upper airway for speaking, breathing, and swallowing. For example, progress has already been made to develop a complete vocal tract model whereby the tongue model, jaw model, and acoustic airway are connected. 93 Further, it is planned to apply such tissue modeling methods to represent other muscle groups and model the interaction with other anatomical substructures of the vocal tract, such as the face, lips, and soft palate. The next chapter will discuss how to develop a speci c model, such as the tongue model, from acquired data. 94 Chapter 5 Data Acquisition and Extraction Work from parts of this chapter has previously been published or is now prepared for publication. In particular parts of sections 5.1 and 5.2 appear in a manuscript prepared for IEEE Transactions on Medical Imaging. Part of Section 5.4 has been published in Vogt et al. [291]. Part of the same section is also now prepared for publication in Speech Communication. Recorded data from actual subjects are vital for creating realistic anatomical models and simulating and validating their functional behavior. Acquiring medical image data is an active area of research and signi cant improvements have been made during the time of this thesis work. The upper airway is particularly dicult to image due to the speed of articulatory movements (in the order of 10Hz) and the diversity of materials present in the passage (air, tissue, and bone). As a result, no single imaging modality captures the spatiotemporal function of all tissues, as discussed in Section 2.4.1. Therefore, multiple image or data modalities are needed to measure articulatory phenomena. These modalities need to be processed to be incorporated into the simulation framework described in Chapter 3. Two important processing steps are registration and segmentation. This chapter presents a feasibility study of automatic segmentation and registration to support the creation of anatomical structures from image data. The goal is to evaluate whether automatic image processing technology is currently sucient to support model creation from medical image data. The focus of this work is to choose appropriate image types and thereby extract tongue shapes suited for model development. This provides a proof of concept for the initial processing step required for creation and validation of the model framework shown in Figure 3.6. Section 5.1 outlines the core data sets as a basis for the modeling algorithms. The feasibility of semi- and automated extraction techniques is discussed for MR in Sections 5.2 and 5.3 and for ultrasound in Section 5.4. 95 5.1 Creation and Structure of Vocal Tract Data Sets Image data sets are useful in many modeling tasks: for instance, to provide a basis to create static structural modeling templates, to create anatomical atlases as developed for brains [277], and to validate for dynamic tasks. The creation of upper airway structural modeling templates entails identifying the key structures and determining their geometry, parameters, and interconnections. Major structures include: the jaw, skull, tongue, velum, pharynx, lips and face, the hyoid, muscle and connective tissues, the laryngeal cartilages (thyroid, cricoid, and arytenoids), and the nasal passage and sinuses. Computed tomography (CT), magnetic resonance (MR), and ultrasound are able to provide the primary data for estimating structure geometry. Another use of image data sets is the creation of anatomical atlases that are used for registration of all the models being created. These will contain landmarks, reference frames, and muscle attachment points, which can be useful in morphing the generic models to patient-speci c data, integrating other researchers' models, and helping to establish a common vocabulary for medical practitioners, researchers, and biomedical industry. In order to validate dynamic vocal tract tasks, it is important to obtain fast imaging and tracking techniques as discussed in Section 2.4.1. Fast imaging lacks spatial resolution and needs therefore to be matched with slower more spatially rich modalities to obtain holistic measurements. A commonly used image set reference in medical image processing is The Visible Human Project [177], which provides data sets of complete human male and female cadavers informed by MR, CT, and higher resolution anatomical photographic images. Unfortunately, cadaver images have only limited validity since soft-tissues of the vocal tract areas show massive deformations. We therefore require additional datasets from living humans to obtain realistic soft tissues shapes. Existing suitable datasets are discussed in the following section. 5.1.1 Important Existing Data Sets for Vocal Tract Modeling There are a variety of existing databases for multiple subjects containing facial images [115, 167], 3D face shapes [142], voice recordings [150], and multimodal data [302] for speech tasks, including phonetic labeling and speech postures. Especially relevant for this work are the following datasets: Speech X-ray Videos from ATR and Strasbourg [11, 172] provide 2D-midsagittal image 96 projections of dynamic speech tasks at 60 images per second and acoustic data. This modality is not widely used in practice due to the rising health concerns. From X-ray images, extraction of articulators has been performed successfully in two dimensions by Fontecave & Berthommier [88], Thimm [271]. Speech Production X-ray Microbeam Database from the University of Wisconsin [297] contains articulatory and acoustic data from 26 female and 22 male speakers completing 116 linguistic and non-linguistic tasks. The articulatory data comprise the positions of eight gold pellets placed on tongue, mandible, and lips, sampled at 146 Hz. It is widely used in speech research as a reference, but has the disadvantage that no new stimuli can be recorded, since the unique recording facility is no longer functional. MOCHA MultiCHannel Articulatory database from Queen Margaret University College by Alan Wrench [302] aims to be a phonetically balanced dataset for training an automatic speech recognition system for three British speakers. It provides synchronized recordings of the voice, laryngograph at 16 kHz, 2D Electromagnetic Articulograph (EMA) at 500 Hz for eight articulators, electropalatograph (EPG) at 200Hz and video of the front view of the mouth area. CINE 2D/3D MR speech data set by Stone et al. [257] contains 2D image sequences using Cine-MR and Tagged Cine-MR (tMR) acquisition techniques allows to estimate the internal tongue deformation for speech tasks including vowel-vowel and vowel-consonant sequences. 2D/3D MRI Swedish/French data sets by Badin et al. [18], Engwall [72], Engwall & Badin [75] contain two- / three-dimensional static image sets, obtained at 11/43 seconds per image, for each a French and Swedish speaker for a corpus of vowel and vowel-consonant-vowel sequences. 2D/3D MRI English data set by Tiede [275], which contains static image sets, obtained at 20 seconds per image, for ve English speakers, for a corpus of vowels and fricatives. There is no single existing data set which can cover all speech modeling aspects to create and validate a biomechanical anatomical model such as the tongue model discussed in Chapter 4. In order to create and validate a future full vocal tract in the modeling framework, the described data sets will be important. The exclusively 2D dataset will become useful when a working model exists to drive a speci c task. The recent rise of inexpensive and fast image modality of ultrasound, as described in Section 2.4.1, 97 has been investigated in this work to capture and extract sucient real-time tongue postures. Further automatic image extraction and registration techniques were investigated for existing MR data sets by Tiede [275] and Engwall & Badin [75]. 5.2 Extraction of the Tongue Shapes from MRI This section describes the construction of 3D models of multiple component human vocal tracts from medical images. More speci cally, segmentation algorithms were developed to extract contours of desired anatomy from medical images. These contours can be used to create a 3D mesh representation of that anatomical structure. The tongue was chosen as a target for segmentation, since it is a large and central part of the upper airway and is variable enough over di erent vocal postures to provide a signi cant challenge for segmentation. Existing extraction methods by Engwall [72], Tiede [275], and Badin et al. [17] have used manual segmentation methods. Here the aim is to investigate automatic segmentation methods and test their suitability. Image segmentation is the process of extracting a speci c region from the background of an image. There are many di erent types of segmentation algorithms, ranging from simple threshold of pixel intensities to region growing based on partial di erential equations. Insight Toolkit alone has over 30 implementations of segmentation algorithms, and the capacity for many more. In order nd an adequate method, many segmentation algorithms were tested using the segmentation examples included in ITK. By examining the results of these preliminary tests, a segmentation method was chosen and all e orts were put towards tuning this method to get the best possible results. The method chosen was actually a two-stage segmentation algorithm combining a geodesic active contour method with a laplacian level-set method. We explored a level-set approach for segmentation and provide an overview of the theory. This method is used in a lter pipeline which is described in detail in Section 5.2.1. 5.2.1 Segmentation Methods Level-set based segmentation lters begin with two input images: a feature image and an initial level-set (basically a contour). For the geodesic active contour segmentation, the feature image is de ned as the gradient magnitude of the input image, and must be provided by the user. The 98 feature image for the Laplacian level-set segmentation is calculated using the second-derivative of the input image. The contour is implanted in a function (the level-set function) and evolved under the control of a di erential equation. This causes the contour to grow or shrink until it locks onto the feature image. At any given time, the active contour is represented by the zero level-set; that is where the output image of the segmentation lters is equal to zero. The segmented region lies within this contour, and is represented by pixels with negative intensities. Level-sets have the advantage that arbitrarily complex shapes can be modeled and topological changes such as merging and splitting are handled implicitly. A detailed mathematical description of level-set algorithms can be found in [117, 133, 188, 236]. From the segmentation evaluation, shown in Section 5.2.2, the most promising segmentation method was chosen. By visual examination of the preliminary results (Section 5.2.2), the combination of geodesic active contour lter followed by the Laplacian segmentation lter has the best performance. The process of applying the chosen segmentation method can be broken down into three layers. The lowest layer is the mathematical theory of the segmentation method. The middle layer consists of the lter pipeline, which is a cascade of the segmentation lters and the various preprocessing and postprocessing lters required for segmentation. Description of Segmentation Pipeline The chosen segmentation pipeline is controlled with the interactive GUI application shown in Figure 5.3. The complete pipeline is shown in Figure 5.1 and the resulting images of each pipeline stage are shown in Figure 5.6. The pipeline starts with the image input. The image is read using the ITK class ImageFileReader, which determines the proper le format from the speci ed le name. The input image is read into an anisotropic di usion lter provided by an instance of CurvatureAnisotropicDiffusionImageFilter. This lter is an edge-preserving smoothing lter used to reduce excess detail [117]. Following this lter, the image is passed through a gradient magnitude lter, provided by the GradientMagni- tudeRecursiveGaussianImageFilter, to nd the edges of the image using the rst derivative of a Gaussian. The pixel intensities of the edge image are then remapped to oating point values between 0.0 and 1.0 using a sigmoid lter. The sigmoid lter allows application of nonlinear mappings to images and recalculates the pixel intensities using the following formula: 99 10 Curvature Anisotropic Diffusion Gradient Magnitude Sigmoid Fast Marching GeodesicActive Contour Laplacian Level-Set Segmentation Binary Threshold Gradient Anisotropic Diffusion Binary Threshold Initial Level-Set Initial Segmen- tation Initial Segmen- tation Input Image Input Image Geodesic Active Contour Level-Set Segmentation Pipeline Laplacian Level-Set Segmentation Pipeline Figure 4.1: Implemented Segmentation Pipeline Diagram The pipeline starts with the image input. The image is read using the ITK class ImageFileReader, which determines the proper file format from the specified file name. The input image is read into an anisotropic diffusion filter provided by an instance of CurvatureAnisotropicDiffusionImageFilter. This filter is an edge-preserving smoothing filter used to reduce excess detail [1]. Following this filter, the image is passed through a gradient magnitude filter (provided by the GradientMagnitudeRecursiveGaussianImageFilter) to find the edges of the image using the first derivative of a Gaussian [4]. The pixel intensities of the edge image are then remapped to floating point values between 0.0 and 1.0 using a sigmoid filter. The sigmoid filter recalculates the pixel intensities using the following formula: Figure 5.1: Implemented Segmentation Pipeline Diagram I ′ = (MaxMin) 1 1 + e− I−β α +Min (5.1) In this equation, I is the intensity of the input pixel, I' the intensity of the output pixel, Min and Max are the minimum and maximum values of the output image, sets the width of the input intensity range, and set the intensity around which the range is centered. Figure 5.2 illustrates the signi cance f ach parameter. 0 0.2 0.4 0.6 0.8 1 -10 -5 0 5 10 α=0.25 α=0.5 α=4 α=2 α=1 α=-1 β=-4 β=-2 β= 4 β= 2 β=0 0 0.2 0.4 0.6 0.8 1 10 5 0 5 10 Figure 5.2: E ect of Parameters of the Sigmoid lter (Adapted from Iba~nez et al. [117]) Pixel intensities are remapped using the sigmoid lter in order to produce the speed image, which 100 is input to the Geodesic Active Contour segmentation lter. The speed image works as follows: the level-set will grow fast in regions with higher pixel intensities and slower in regions with low pixel intensities [133]. Thus, it is desirable to have all edges represented by zero intensity pixels and all smooth areas represented by pixels with intensities equal to one. The Geodesic Active Contour segmentation lter requires two inputs: the speed image and an initial level-set, which in the case of this application is produced by the FastMarchingImageFilter. The instance of FastMarchingImageFilter is given seed points and an initial distance as inputs. It produces a distance map from which the initial contour is located at the speci ed distance from the seeds. The Geodesic segmentation lter grows the initial level-set with the data provided by the speed image for a speci ed number of iterations. The output image will have regions of pixels with negative intensities that represent segmented regions. Therefore, a threshold must be applied to the output in order to represent the segmented region as a binary mask [133]. The output from the binary mask represents the initial segmentation model. Figure 5.3: Interactive Segmentation Application This initial segmentation may produce a satisfactory segmentation result of the region of interest, in which case the user could stop at this stage. If this is not the case, the user can re ne the segmentation using the Laplacian Level-Set segmentation lter pipeline, shown in the lower half of Figure 5.1. This segmentation lter attempts to grow the initial level-set to the second-derivative edges. The inputs to this lter are an input image and the initial segmentation. Since the segmentation lter is based on the second derivative of the input image, the image should be smoothed. This is accomplished using an instance of GradientAnisotropicDi usionFilter. The Laplacian segmentation lter creates its own speed image and operates similarly to the Gradient 101 Active Contour segmentation lter. A binary threshold lter is used in an analogous fashion as the previous segmentation pipeline. 5.2.2 Experiments We investigated two MR data sets by Tiede [275] and Engwall & Badin [75] for tongue extraction. Since we did not require prior registration for the input set, Engwall & Badin's data set had been discarded because prior registration would be required to obtain a continuous sample dataset. We chose therefore the data set [275] of /r/ and /l/ sounds. A sample image from this set is shown in Figure 5.4a. The set contains many volumes of images, with most volumes containing between 25 and 30 image slices. The spacing in each image slice is 1mm/pixel in both the x and y directions, and 4mm in the z direction (between slices). The Tiede MR data set requires preprocessing before they can be input into the segmentation pipeline. First, the sets of images were made into an image volume. This step is not required; however, it is convenient and preserves the z-spacing. Level-set segmentation lters require that the data spacing in the input images is isotropic, meaning that the pixel spacing is equal for each axis. This is not the case for the Tiede MR data sets, since the z-spacing is four times as large as the x- and y-spacing. Isotropic resampling was performed using the ResampleIsotropic example included in ITK. Figure 5.4b shows the image from Figure 5.4a after resampling. This example also remaps the pixel intensity of the image, which is bene cial as it allows one to increase the overall contrast of the image. (a) (b) Figure 5.4: (a) coronal slice from Tiede MRI Volume, (c) sagittal 2D test input image 102 Experiment 1: 2D Method Selection To nd the best segmentation pipeline, preliminary testing was performed on a 2D test input image, shown in Figure 5.4c. One can see that the image is well detailed with low noise. Several segmentation methods were applied to this image using the segmentation examples included in ITK. A selection of segmentation results is shown in Figure 5.5. One can see from visual inspection that the Geodesic Active Contour segmentation lter followed by Laplacian segmentation yielded the best results. (a) (b) (c) (d) (e) (f) Figure 5.5: Tongue Segmentation Test Results for (a) Connected Threshold Segmentation, (b) Neighborhood Connected Segmentation, (c) Shape Detection Level-Set Segmentation, (d) Fast Marching Level-Set Segmentation, (e) Geodesic Active Contour Level-Set Segmentation, and (f) Segmentation Followed by Laplacian Level-Set Segmentation. 103 Experiment 2: 2D Tongue Segmentation After the choice of segmentation pipeline was made, the pipeline was implemented and applied on the 2D test input image. The pipeline is shown in Figure 5.1. Figure 5.6 illustrates the output image of various stages in the Geodesic segmentation part of the pipeline. Each pipeline stage has parameters that can be modi ed to improve the results. (a) (b) (c) (d) (e) Figure 5.6: 2D Segmentation Pipeline Results in stages:(a) Input, (b) Output from Curvature Anisotropic Di usion Filter, (c) Output from Gradient Magnitude Filter, (d) Output from Sigmoid Filter, (e) Geodesic Active Contour Level-Set Segmentation Output. Experiment 3: 3D Tongue Segmentation The segmentation application implements the pipeline shown in Figure 5.1 for three dimensions. Figure 5.7 shows the x, y, and z slices of the resampled input with the tongue segmentation overlaid in red. The volume used in this case is the Tiede MRI volume for subject mr and posture /r/. 104 (a) (b) (c) (d) Figure 5.7: 3D Segmentation Pipeline Results: (a) midcoronal, (b) midsagittal, (c) midtransverse, (d) perspective looking from the back of the head out the mouth. 5.2.3 MR Segmentation Results As was previously stated, each stage in the segmentation pipeline has parameters for controlling the lter output. Some parameters produce outputs that will result in better segmentations; however, nding the optimal values for these parameters is an iterative task. Furthermore, di erent input images will require di erent parameters based on the properties of the image. This section will 105 outline the tuned lter parameters in the pipeline and explain the rationale for choosing their values. The image lters used in the segmentation pipeline can be tuned to reduce adverse properties of the image, such as noise or excess detail. For reference, the complete pipeline can be seen in Figure 5.1. The user-supplied lter parameters for the segmentation pipeline are summarized in Table 5.1. The Anisotropic Gradient Magnitude lter and Laplacian Segmentation lter have parameters similar to the Anisotropic Di usion lter and Geodesic Active Contour lter, respectively. One must remember that the purpose of the Sigmoid lter is to remap the pixel intensities. A good value of is about half way between the edge intensity and the background intensity. should be a negative value so that high intensity edges are mapped to low intensities. In addition, many times it is not desirable to run the Laplacian Segmentation lter until convergence. For instance, the 3D segmentation used only 30 iterations. Stage Parameter Description Value Anisotropic Iterations More iterations refine result 5 Diffusion Time step Time step < 1/(2N ) 0.0625 Conductance Lower values → more diffused image 2.0 Gradient Magnitude σ Standard deviation of the Gaussian smoothing kernel 1.0 Sigmoid α see eq 5.1 -1 Filter β 1000 Fast Seeds Seed point for initial level-set 2D(3) 3D(5) Marching Init. Distance Distance of initial level-set from seed points 5.0 Geodesic Propagation Weight term for the inflation 1.2 Active Scaling of the level-set - expands Contour Curvature Weight term for the curvature 1.0 Scaling of the level-set - higher value Advection Controls the scaling 1.0 Iterations Number of iterations 800 Max RMS Used to determine when the 0.02 Change solution has converged Binary Lower When thresholding the output -1100 Threshold Upper of level-set segmentation 0 Table 5.1: Description of Filter Parameters for 2D and 3D Image Segmentation Pipeline. 106 5.2.4 Discussion of MR Segmentation Experiments Semi-automatic segmentation methods were able to extract contours of the tongue from multiple MR image sets in both two and three dimensions, and the experiments are summarized in Table 5.2. No. Description Results Limits 1 Select best of 6 Seg. algorithms Level set works best 2 2D Seg. of MR tongue images Found stable values for filter Bleeding of lower tongue surface 3 3D Seg. Values of #2 are stable Same problems as in #2 Table 5.2: Experiment Summary of MR Image Tongue Segmentation. Segmentation and other image processing lters are provided by ITK. Initial tests were conducted to nd a simple yet e ective pipeline method. The method chosen was the result of two level-set based segmentation pipelines: a Geodesic Active Contour segmentation pipeline provided an initial segmentation, which was re ned by a Laplacian Level-Set segmentation pipeline. Each pipeline consists of multiple image processing lters, each of which was implemented as a binary executable so that the intermediary lter outputs could be tested. An interactive user interface was designed to control the segmentation pipeline. The tongue was successfully segmented in 2D and 3D using the selecting segmentation pipeline. However, the lower portion of the tongue was not well de ned through gray scale levels, and human intervention is required to obtain consistent results in these areas. A few recommendations for further development:  Automated segmentation of individuals may be achieved using an anatomical atlas that includes extraction parameters and structure morphology to improve and cross reference extraction results.  Improve segmentation veri cation techniques Automated Segmentation The segmentation results are not sucient for the tongue oor due to the lack of enough gradient information in this region, which may require manual landmarks. In addition, the presented technique is still a time-consuming process (30s per image), which is not suitable for fully interactive applications given the need for manual parameter tuning. While lter parameters for similar images such as di erent vocal tract postures are stable, new image sets may require signi cant parameter tuning. 107 An atlas-based segmentation method, as shown by Commowick [45], may improve upper airway extraction results. The atlas store the morphology of structures from previous extractions or existing models such as the tongue model presented in Chapter 4. Linked to each morphology may be previous extraction parameters for di erent structure areas. This promises to (1) save time to determine suitable extraction parameters and (2) retarget the internal model structure, such as muscle organization, from one individual to another. Segmentation Verification Techniques Another future direction is the veri cation of segmented data in which segmented regions are compared to reference extractions or to the medical de nition of that region. For instance, the segmented tongue, shown in Figure 5.7, may be statistically compared to a manual reference segmentation of an expert or to a statistical model of the actual tongue. The comparison would allow quantifying the extraction quality of di erent algorithms and settings. 5.2.5 Summary of MR Image Segmentation This work tested the feasibility of extracting anatomical structures from MR images for model development. In particular, 2D and 3D tongue shapes were segmented using automatic techniques. A number of segmentation techniques used to extract tongue boundaries were analyzed for their e ectiveness. This was judged by visual inspection of the smoothness and quality of contours obtained. Using the Geodesic Active Contour/Level-set method segmentation of the tongue was performed. This technique very e ectively extracted the tongue surface in areas of high edge detail of the image such as tongue-air surface. In contrast, the lower tongue muscle area was extracted with limited success due to insucient contrast of boundaries. These segmented images will enable the development of model of speci c patients. This will require further advancements in the extraction techniques and subsequent translation into geometry. For example the combination of automatic extraction and human guidance with landmarks would result in better de nitions of boundaries where image detail is limited. 108 5.3 Registration of Tongue Shapes Across MRI Images This section presents a feasibility study of algorithms to register images containing vocal postures. The referencing of vocal postures may be used for the creation of vocal tract models. One automatic approach to nd the transformation between vocal tract postures and subjects across 2D and 3D image sets is the application of deformable registration. This thesis work evaluates two registration methods provided by the Insight Toolkit [117] (ITK): (1) the nite elements method (FEM) by Ferrant et al. [84] and (2) Thirion's demons method [272, 273] that is based on optical ow. The FEM-based approach is ideal when the images are segmented. The demons approach is more suitable for unsegmented images. In experimentation with the FEM-based and demons techniques, reference sets of 2D and 3D images were registered. In the 2D experiment, a heuristic method was applied to obtain optimal registration parameters. For the 3D experiment, the optimum parameters from the 2D experiments were applied and the resulting registered images were qualitatively evaluated. FEM-based Algorithm In this method, the imaged anatomy structures are modeled as elastic bodies. The material properties, such as elasticity, density, and Poisson's ratio are applied to estimate the deformation eld. The registration process starts by subdividing the moving image into small voxels. As a next step, the image is superimposed with a mesh that divides each object into elements. ITK supports di erent element types including tetrahedrons and hexahedrons. Once the mesh is created, the registration process then determines node locations in order to assign force vectors. The FEM algorithms compute the deformation of the image based on the element stresses and strains caused by forces. Elastic deformations for all elements are solved by applying force e ects, including elements interactions based on an estimate of the deformation eld [84]. As this method applies physical properties of anatomical structures to estimate the deformations, the physical properties for each pixel may be provided or correspond to the pixel brightness. To distinguish anatomical structures during mesh creation, the image requires prior segmentation. Once the image is segmented, an image mask indicates white-segmented portion of the image, while the rest of the image remains black. ITK provides metrics as part of its FEM-based registration framework to assess the result quality. 109 Demons Algorithm In this method, the images are assumed to be from the same modality, and pixel intensities are homologous. The latter implies that the same tissue properties in both images will be represented by the same brightness level. To account for Gaussian variations on maximum intensities, histogram matching is performed on the moving image. After equalization, isocontours are extracted based on the deformation eld. This eld is based on the optical ow equation. The deformation eld selectively allows pixels to cross the isocontours in the image. The registration result is compared with the xed image and error is estimated. Demons algorithm does not allow for specifying elastic properties of the anatomic structures. As the pixels are not constrained by physical materials, they may provide unrealistic deformations. The resulting image may be very coarse. To address this issue, in the ITK implementation of demons algorithm, an external smoothing factor is introduced to force smooth deformation elds. After each iteration, the deformation eld is convolved with a Gaussian kernel with a user selectable standard deviation. The convolution with the Gaussian kernel gives the image elastic properties. The experiments in later sections illustrate the e ects that di erent values for this parameter have on quality of registration. Comparison of FEM and Demons method The following analysis illustrates the relative merits and demerits of the two techniques: 1. FEM is a powerful method that allows multimodal registration since it is independent of pixel intensities of the image. Demons algorithm, on the other hand, assumes the homologous nature of the images and is restricted to intramodal registration that are images obtained from the same modality. 2. FEM is based on physical properties including Young's modulus, density, and Poisson's ratio. Demons algorithm does not take into account the elastic nature of the material to do the registration. As a result, the solutions to the registration problem using demons algorithm are not unique. However, using an external Gaussian kernel, one may impose the elastic behavior on the image. 3. The quality of registration in FEM is proportional to the number of pixels per element. The smaller the elements, the ner the registration because the boundaries and tissue interfaces will be accurately resolved. For Demons algorithm the quality of registration depends on the number of iterations allowed and the smoothness of the gradient for the deformation eld. 110 4. FEM is much slower than the demons algorithm for a large number of elements. 5. FEM requires segmentation of the images or at least a priori information about the material properties represented in each voxel. The Demons algorithm does not require segmentation, but assumes that the global ane properties and scaling have been matched between the two images. 5.3.1 MR Image Registration Experiments Two independent experiments were performed for non-rigid registration of vocal postures. The experiments are based on the two deformable registration methods, FEM and demons. The rst experiment is simply the registration of two 2D vocal postures. The results from the rst experiment are applied to estimate the initial parameters for the second experiment that performs the registration of two 3D vocal postures. The objective of the experiments is to do a performance analysis of the two registration schemes and study the e ect of di erent parameters in each method. Experiment 1: 2D Case Using FEM and Demon Methods Estimation of optimum values for 2D registration parameters of the two methods respectively. A 2D registration is a good start to get a rst-hand feel of the registration algorithms. In addition, for numerical techniques, the 2D case is simpler than the 3D case due to lesser degrees of freedom and fewer image pixels. FEM Case A FEM-based registration was performed between vocal postures /Ca/ ( xed image) and /Sha/ (moving image). The choice of the vocal postures was random to show a proof of concept. To use FEM-based registration, the vocal posture images had to be segmented. Using the method described in Section 5.2 allowed the segmentation of these images , the results of which are shown in Figure 5.8. After segmentation, the parameters le was edited to set the parameters of the registration. The following results of varying few selected parameters are shown in Table 5.3 and the result in Figure 5.9. Demons Case The demons registration was performed between /A/ ( xed image) and /Sho/ (moving image) postures shown in Figure 5.10. For this portion of the experiment, /Sho/ were 111 Figure 5.8: FEM Registration 2D input images /Ca/ ( xed image) and /Sha/ (moving image). TC P/E E ρ IP ts I a 2 15 100 1 1.0 20 b 2 1 100 1 1.0 20 c 2 1 100 1 1.0 40 d 4 1 100 9 15 20 e 4 1 100 10 15 20 f 4 1 100 9 1.5 20 g 2 1 100 10 1.5 20 h 2 1 100 10 15 20 Table 5.3: Parameters Variations for FEM-based Registration of /Sha/ to /Ca/ for test cases (TC) with parameters for Pixels/Element (P/E), Elasticity(E), number of integration points (IP), time steps (ts), Iterations (I), and Density x Capacity (). registered to /A/. The value of standard deviation  for the Gaussian smoothing kernel was selected to 1.0. The registered images were produced for di erent values of the standard deviation, as presented in Table 5.4 and Figure 5.11 TC a b c d e  0.5 0.75 1 2 5 Table 5.4: Parameters variations for Demons-based registration of /Sho/ to /A/ for test cases (TC) di erent standard deviations . 112 Figure 5.9: FEM Registration 2D test case results (a-h)for images /Ca/ ( xed image) and /Sha/ (moving image) for 8 test cases with parameters listed in Table 5.3. Figure 5.10: Demon Registration 2D input images /A/ ( xed image) and /Sho/ (moving image). Experiment 2: 3D Data Set Using FEM and Demon Methods Evaluation of the 2D registration parameters of the two methods respectively for the 3D case. Using the optimum values of parameters for 2D case (except #pixels/elements = 4 for FEM), both FEM and Demons registration techniques were applied to 3D images. Even though the snapshots of the 3D rendering of the images is provided below, for a true assessment of the results, the reader 113 Figure 5.11: Demons Registration 2D test case results (a-h)for images /A/ ( xed image) and /Sho/ (moving image) for 8 test cases with parameters listed in Table 5.3. The last Image shows the deformation eld for  = 0.75. is encouraged to use Volview to view the results. For both registrations the input and result is shown in Figure 5.12. The rationale behind the use of the parameters of the 2D case for the 3D case is developed from the fact that all the images are of the same tissue and using the same modality. A fresh search for the optimum values for the 3D case would take a long time due to the large number of computations involved per iteration and could simply not have been performed with the computing resources available and the limited timeframe of this project. Even if the search for optimum values for parameters of the 3D case is done in the future, the parameter values used for the second experiment would lead to better results than selecting random values as a start. In the FEM case, to reduce the time taken by the registration, the number of pixels/elements was reduced to 4 from the optimum case for 2. The reduction in the number of elements used by the image signi cantly reduced the number of computations needed; however it also resulted in a fairly coarse registration of the tongue. The Demons registration yielded registration results that were smooth, however some artifacts were also introduced, for example a secondary chin below the real 114 (a) (b) (c) (d) (e) (f) Figure 5.12: 3D Registration with FEM (input a and b, result c) and Demons (input d and e, result f). chin (Figure 5.12). 5.3.2 High Level Designs There are two high level designs possible based on the choice of the deformable registration scheme used. At the time of writing this report, it is not clear which design would prove to be better because they are not complete enough to be tested. The designs aim to satisfy the following objectives: 1. Generation of the inter-posture registration information 2. 3D visualization of various segmented portions of the vocal tract The first design scheme is based on Demons deformable registration and is similar to an atlas-based approach. The high level design is shown below in the Figure 5.13 a. In the design scheme, the 115 vocal posture A is to be registered to vocal posture B. A Demons-based registration will perform the registration on the unsegmented images of A and B. The nal parameters of the registration can be stored as the deformation eld. At the same time, a segmented image of A can be created using segmentation lters. If this deformation eld is applied to the segmented vocal posture A, the resulting image will be a good approximation to segmented vocal posture B. Once the segmented image is obtained, it can be visualized using volume-rendering applications. The second design scheme is based on the FEM deformable registration approach. The high level design is shown below in the Figure 5.13b. In this design scheme, the vocal posture A is to be registered to vocal posture B. As a FEM registration lter is being used, the inputs require segmentation. Segmentation lters can be used to do segmentation. The output of the registration lter will be a segmented vocal posture image and a deformation eld. This image can be visualized using volume-rendering applications. Each design has its own merits and demerits. The merits and demerits of the registration schemes will not be discussed again here, but they directly in uence the performance of the respective designs. Some other points to note are: 1. For the second design, the quality of segmentation is an important limiting factor, as the registration is done after doing two segmentations. On the other hand, the rst segmentation design requires only the base posture. Therefore, it is relatively more insensitive to the quality of segmentation. 2. If the segmentation routines segment one object in the image at a time, then the rst design will be more ecient. The reason is that the two images will have to be registered only once, in a global fashion, and then segmented separately for each object like tongue, lips, jaw, etc. For the second design, each time a new object is required to be registered, the images have to be re-segmented and then re-registered. 5.3.3 Discussion of MR Registration Experiments In this work, for di erent vocal postures, the con guration of the vocal tract is di erent, as it is deformed di erently in each case. Non-rigid registration estimates this deformation, and it was therefore used in this project. The non-rigid registration techniques available in ITK are FEM and Demons algorithm. Each has its own merits and demerits. The FEM is based on physical properties of a material and can be used for multimodal registration. However, the number of 116 (a) (b) Figure 5.13: Block diagrams for (a) Demons and (b) FEM registration design scheme. computations required in its execution is very large. The Demons algorithm is based on the optical di usion and does not require segmented but homologous images. Two experiments were conducted to study the success of registration of the two techniques and the in uence of their control parameters. The experiments were for two and three dimensions respectively. For the rst experiment, the results were studied and were qualitatively explained. The best possible registration case was identi ed for both deformable techniques. These parameters were used for the 3D case to assess the performance. The results of experiments are discussed below for the FEM and Demons algorithm. In the FEM case, from the variation of parameters, the e ect on the registration quality can be determined and in some cases explained qualitatively: #Pixels/Element is an indication of the size of the elements. As the image area or volume is nite in size, if the number of pixels per element increases, then the number of total elements in the image will decrease. Consequently, the element size will be larger than previously to be able to represent the same area/volume. This can introduce quantization e ects that can lead to jaggedness in the image. The jaggedness is especially noticeable at the boundaries of segmented portions of the image, as a large element size is a poor approximation to a ne boundary. Observing images 6 and 7, with all other parameters held the same, as the 117 #pixels/elements changes from 4 to 2, the image registration match improves. The change is noted in the ner boundaries for image 7. Elasticity (E) is an indication of the deformability of the elements in the image. If E decreases, then the tendency of the elements to undergo deformation increases. This leads to a reasonable match in registration. Observing images 1 and 2, with all other parameters held the same, as the elasticity changes from 15 to 1, the image registration match improves. Image 2 is fuller than image 1 and is relatively more similar to Ca than image 1. # integration points is the number of nodes to which force vectors will be associated. Observing images 2 and 7, with all other parameters held approximately the same, as the number of integration points changes from 1 to 10, the image registration match improves. Time step is not very sensitive for the registration. It is possible that over the variation of values speci ed in images 7 and 8, the e ect is too small to be noticeable. Iterations parameter was set to 20 for most of the images. Observing images 2 and 3, as the maximum observed iterations range from 20 to 40, not much di erence was observed. It is a reasonable assessment that image 8 represents the best match to /Ca/ posture. The values of parameters of image 8 can then be assumed to be close to the optimum for FEM. In the Demons case, from the variation in the standard deviation of the Gaussian smoothing kernel, the variation in the elasticity/sti ness of the image could be studied. It was expected that the higher the value of the standard deviation, the higher would be the sti ness, and less the ability of the image to undergo deformation and thus register. This is indeed the case for images 3 and 4, which look similar to /Sho/ than /A/. On the other hand, too low of a value for standard deviation, and the deformation eld is no longer smooth, leading to coarseness in the image. This is evident in image 1 near the chin region. From the experiment, a value of 0.75 for the standard deviation gives an optimum registration quality that is suciently similar to the xed image, and at the same time not too coarse. 5.3.4 Summary of MR Registration Experiments This section investigates the feasibility of registration methods to compute inter-posture vocal tract deformation. In essence, the interrelationships between di erent tongue postures or between particular postures of individuals were examined. Two automatic non-rigid registration methods, 118 FEM and Demons algorithm, were evaluated from 2D and 3D MR images. Using these evaluated methods, it was possible to compute the deformation for di erent tongue postures. The results show that these registration methods are only partially suitable to determine tongue deformation. In particular, small rigid and small deformed tongue shapes yield tangible results. For larger deformations, the quality of the result deteriorates, since these algorithms using only the image information converge on local minima rather than the correct solution. This can be overcome by using multiple images of intermittent postures which have incremental deformations. These registration results interrelate di erent model states and thereby enable merging of di erent postures into one model. Interesting extensions of these registration methods can be made to incorporate user guidance with landmarks in order to enhance the tolerable deformation range. Further, the nite element registration may be accelerated by the fast methods presented in Chapter 4. These methods may be applied to additional problems between individuals and to merge information between di erent image modalities. In the larger context of this thesis, these methods have proven a suitable incorporation into the model creation and validation process. 5.4 Real Time Ultrasound Tongue Tracking This section describes the \\Tongue and Groove" system, a real-time 2D ultrasound tongue tracking system that can interactively drive computer-based models using ultrasound data. This real-time extraction and interaction allows us to study whether the system is sucient for vocal modeling using perceptual measures (that is whether it \\feels right" to a person). In addition the real-time tongue tracker enables new avenues such image-based human interface controllers for sound synthesis or speech/singing learning tools. Ultrasound is the premier method for measuring tongue motion as it is fast and non-invasive, as shown in Section 2.4. Following segmentation and registration techniques, real-time ultrasound tongue tracking is a third extraction method described here to investigate a suitable component in the creation and validation process. Using this method, the feasibility to produce sucient 2D tongue shapes from ultrasound measurements may be determined using a real-time extraction algorithm. The feasibility criteria are the quality of the derived shapes and the shapes' actuation results of Perry Cook's Singing 119 Physical Articulatory Synthesis Model (SPASM) [47]. SPASM was used during the time when ArtiSynth was not yet available. Tongue imaging predominantly uses B-mode ultrasound, which is based on the re ection properties of the tissues. Despite the challenges, ultrasound imaging, with full or semi-manual segmentation and registration, is a viable approach. An example of an emerging tool is the ultrasound speech processing software Ultrax by Gick, Campbell, Oh & Tamburri-Watt [97] which uses the semiautomatic snake algorithm [128]. The existing automatic contour extraction algorithms [1{3, 154, 259, 298] do not extract tongue contours in real time but concentrate on system reliability. Ultrasound scanner Video capture Tongue extractionProbe Audio-visual display Analysis or Figure 5.14: Tongue tracking system diagram (reproduced from Vogt et al. [291]) This section discusses the real-time tongue contour tracking system \\Tongue and Groove" by Vogt et al. [291] with an additional performance analysis of the algorithm using vowel sounds. Parts of this section have been reproduced from this publication. This work shows that real-time algorithms are feasible to use and able to distinguish tongue shapes of vowel sounds. One inspiration for this original work was to use the tongue contour as an input device for sound production and will be discussed in Section 5.4.4. 5.4.1 System Design Figure 5.14 shows the components of the \\Tongue and Groove" system. An Aloka SSD-900 ultrasound scanner is used with a small probe, similar in shape to a microphone. The subject presses the probe against the underside of the jaw. Sound-conductive gel may be used to lubricate the skin for better probe contact. The probe can be held in the hand, or used with a xed stand. The SSD-900 produces two-dimensional images of the tongue pro le in analog NTSC video format. Thirty frames per second are obtained at 768x525 pixel and 8 bit gray scale resolution. The 120 SSD-900 calibrates the ultrasound image so that image distances correspond to scaled real-world distances. The intensity in different parts of the image depends on the ultrasonic reflectivity of body parts. The tongue-air boundary layer on the upper surface of the tongue has high reflectivity and therefore creates the most intense region of the image. The ultrasound image is digitized using a Linux workstation with a video capture card. A video capture library written in C makes image data available to the tongue extraction algorithms. It calculates the amount of motion within each of 10 vertical bands of the tongue image. The other algorithm calculates a vector of vertical positions along the tongue surface. The tongue tracking system is intended to control realistic human voice sounds; it is therefore important to have accurate readings of tongue/hard-palate positions, which become visible on ultrasound during swallowing, to drive a vocal tract model. The output of the image-processing algorithm is used to drive sound synthesis algorithms and/or to provide constantly updated control parameters of audio-visual displays at the video frame rate. Alternatively, the image and output parameter are captured in a file for numerical analysis. 5.4.2 Tracking Algorithms background subtraction maxY for each line medianΔX in N bins low pass filtertcapture Output Crop to Region of Interest I1 I2 I3 I4 Figure 5.15: Block diagram of tongue tracking algorithm The tongue tracking algorithm shown as a block diagram in Figure 5.15 processes the captured ultrasound image frame by frame with the following step: 1. The captured image is cropped to a fixed region in which the tongue surface appears, such as 300x300 pixels, producing the image I1, shown as the white frame in Figure 5.16a. Since the probe is held under the chin, the tongue is approximately in the same position relative to the probe, regardless of the user. 2. To minimize the noise problems, the background derived from the mean of a series of calibration frames is subtracted resulting in image I2. 121 3. For each vertical column (Y-direction) the pixel with the maximal brightness is selected resulting in a vector I3. This vector I3 with the dimension 1x300 contains the column number of the found maximum. 4. Using the median of a band of adjacent maximum location, such as 30 band with each 10 locations, allows a reduction in the number of outputs to improve noise robustness of the output I4 with the dimension 1x30. 5. Finally a low-pass filter by Rabiner & Schafer [211, p. 158-162] was selected to perform temporal smoothing resulting in the output tongue contour. This filtering stage was since added following the work presented by Vogt et al. [291] X Y t N bins Anterior Hyoid shadow (a) (b) Figure 5.16: Tongue tracking (a) concept image, and (b) output image display (Reproduced from Vogt et al. [291]). These output values correspond to the distance from the probe to the lower contour of the tongue. Since the hard palate is fixed, these values give all the required information to estimate the configuration of this portion of the vocal tract. Another noteworthy observation in the images is the shadow of the hyoid, shown in Figure 5.16a, which perturbs the view of the tongue root. The algorithm is capable of a capture and processes 30 frames per second output on an 800MHz Pentium-II Workstation. The output of the unfiltered state I3 As is shown in Figure 5.16b. At this stage, the algorithm is subject to a significant amount of noise-based error of about 10 pixels. By adding filtering stages, the output is improved significantly, as shown in Section 5.4.3. 122 5.4.3 Experiment: Vowel Analysis This work introduces an automatic contour tracking method for ultrasound tongue images. In addition, a performance evaluation is presented for tongue tracking algorithms. The evaluation is based on posture discrimination using principle component analysis (PCA) [126] for the vowels /a/, /e/, /i/, /o/, and /u/. Analyses are performed on extracted parameters of ultrasound tongue image sequences for one native German speaker. In the experimental setup, speakers are seated in a xed position, and ultrasound video and audio speech signals are recorded. The digitized video frames are manually separated in time intervals of stationary utterances based on the acoustic signal. The captured images are processed with two tracking algorithms: (1) the original algorithm and (2) a re ned implementation of the extraction algorithm shown above. The output of these tracking algorithms is a set of thirty tongue contour parameters. With these parameters, two types of analyses are done: contour variance for each vowel and vowel classi cation. The contour parameters are rst represented in their principal component analysis (PCA) form in order to identify the variation in tongue contour tracking. The principle components are used as well for vowel classi cation, which is a measure of the quality of both the ultrasound imaging and the contour extraction algorithm. This de nes a framework that can be used to further compare di erent contour tracking algorithms and ultrasound image capture settings. Vowel Data Acquisition Ultrasound and acoustic recordings are acquired for one German speaker, who produces each of the vowels /a/, /e/, /i/, /o/, and /u/ sustained for an interval of about 10 seconds, which corresponds to 300 ultrasound images per vowel. The vowel postures are the best case in voice production since no motion from articulator dynamics is apparent, thus reducing image noise due to motion blur. The following gures contain extracted tongue contours for one speaker for the set of chosen vowels: with ltering (Figure 5.17) and without ltering (Figure 5.18). 123 (a) /a/ (b) /e/ (c) /i/ (d) /o/ (e) /u/ Figure 5.17: Filtered contours for subject fv vowels /a/,/e/,/i/,/o/, and /u/. Analysis and Classifications Three types of analysis are presented: mean and contour variance, PCA for variation in tongue contour tracking for one speci c vowel, and PCA for the comparison of vowel postures in order to classify or recognize an unidenti ed vowel posture. Mean and Contour Variance Since contour shapes are acquired for sustained vowels, a mean and variance analysis over each posture allows inspection of di erences between postures using the mean plot and the presence of noise with the variance. The mean and variance plots for one subject are shown in Figure 5.19. As a result, the mean contour shapes are distinct from each other, while the variance indicate constant variability for high frequencies, which is due to image noise. This analysis gives insight regarding contour variance, but is only a one-dimensional analysis and therefore it cannot be used for vowel classi cation, which requires more complex analysis methods. Therefore the complementary principle component analysis (PCA) was performed, as described in the following section. 124 (a) /a/ (b) /e/ (c) /i/ (d) /o/ (e) /u/ Figure 5.18: Un ltered contours for subject fv vowels /a/,/e/,/i/,/o/, and /u/. (a) mean (b) variance Figure 5.19: Mean and variance for vowels /a/,/e/,/i/,/o/, and /u/. Principal Component Analysis for Variation Principal component analysis (PCA) is identi ed as a suited method for more detailed analysis within vowel postures sets. The data sets are decomposed into a set of principal components, which are orthogonal to each other and therefore capture the nature of a given contour. In this analysis, the weighted sum of a subset of components is used to reconstruct each vowel posture. Also, the more components, the smaller the con dence level. In addition, the components 125 are ordered by their contribution to the variance. For example, the rst component is responsible for most of the characteristic curve shape. An advantage of the PCA method is that for large data sets (where the number of components is greater than the number of samples), principal components representation is more compact than the data itself. On the other hand, in the presence of noise, more components are required to reconstruct the shape. The rst principal components have only low frequency changes, which re ect overall tongue curvature. Higher order components mostly contain high frequency changes, which re ect the presence of image noise. Principal Component Analysis for Vowel Classification Principal components also nd another use in our analysis on the distances between vowel postures. In this case, the principal components are used for a cross vowel classi cation. This analysis is performed in order to establish how well vowels can be recognized and discriminated using tongue contour tracking, which could be used for automatic annotation of data sets or voiceless speech recognition. The classi cation results summarized in Table 5.5 show the classi cation tables. For all vowels, most test vowel shapes are classi ed in the right category. /a/ /e/ /i/ /o/ /u/ /a/ 91 0 3 0 0 /e/ 0 91 0 0 0 /i/ 0 0 88 0 0 /o/ 0 0 0 91 0 /u/ 0 0 0 0 100 Table 5.5: Classi cation table of vowels for each test sample: rows indicate categories /a/, /e/, /i/, /o/, and /u/, column test data 5.4.4 Experiment: Driving Physics Synthesis Models This section presents the main nding of the work by Vogt et al. [291], called \\Tongue and Groove," which focused on the interactive controller for musical instruments by means of real- time tongue tracking. At the time this work was conducted no implementation of the articulatory 126 framework existed and the real-time ultrasound tongue tracking system served as a sandbox for some framework concepts. A relevant aspect is the real-time interaction with a physical model, which enables a performer/user to embody the complete system from data acquisition and extraction to model implication, and assess in this way the errors correct parameter mapping, and model delity. The assumption is transferred from the eld of musical controllers which states that a sucient sense, mapping, and action space of a music controller will perceptually feel right and allow expression. Existing physical instruments that make use of the tongue as a control mechanism include reed instruments, the harmonica, and the mouth harp. In addition, instruments such as the Mouthesizer by Lyons, H•ahnel & Tetsutani [155] and the talk box, based on the arti cial larynx by Espenschied & A el [76], use various elements of the human vocal tract to control or modulate sound. The Mouthesizer created uses the lips as the sole means of input. The talk box utilizes a speaker placed in the performer's mouth, which records the ltering e ect of the mouth using an external microphone. The talk box got very popular in the 60's and it was played by many performers such as Peter Frampton performing \\Do you feel like we do." Another related music controller is the Vocoder (voice and encoder) by Dudley [69], which extracts the formant frequencies from an acoustic voice signal. With the assumption of a single linear lter model, the formant frequencies would be the equivalent of the lter coecients. The \\Tongue and Groove" system is di erent in that instead of acoustic measurement, it uses an articulatory model based on measurement of the physical con guration of the vocal tract in real time. These measurements are used in an active sense to control a digital instrument, rather than the more passive embodiment found in the talk box where the interior of the mouth is used as a physical acoustic chamber. In the present project, the mapping of the vocal tract to the sound output is recon gurable. The goal of this study is not to directly model the vocal tract as used in everyday speech, but rather to explore how to leverage the ne motor control skills developed by the tongue for expressive music control. For experimentation, four di erent types of music synthesis algorithms were used as output for the \\Tongue and Groove" which can be read about in [291]. Most relevant to the speech project is the Tongue-SPASM instrument which controls Perry Cook's Singing Physical Articulatory Synthesis Model (SPASM) [47]. SPASM simulates human voice sounds by modeling a vocal excitation function and ltering it through a virtual vocal tube with varying cross-section. The Tongue- SPASM maps tongue heights to radii of cylindrical segments in the virtual resonant tubes. 127 The Tongue-SPASM algorithm is capable of reading in new control vectors and changing the sonic output at a rate of at least 30 signals/second, corresponding to the video frame rate to the ultrasound signal. We observed that this actually made the instrument more fun to play. This suggests that, as in the body, the tongue might be best used as a secondary controller that modi es a primary stream of musical information. Tongue-SPASM leverages the familiarity of lter-shaping with the tongue. For this reason, the instrument has a more \\natural" feeling than the others do. However, with our rough mapping into vocal tract space, the familiarity actually causes the problem of unexpected behavior|a given tongue con guration does not make the sound a user would expect. This frustration can be removed with further re nement of the mapping from tongue measurements to vocal tract space. 5.4.5 Discussion of Tongue and Groove Results The results from our experiments with \\Tongue and Groove" are based on informal testing of our working prototype by a small number of users. Tongue tracking performance was achieved for multiple control points of the tongue contour at video frame rate. By improving noise robustness, we are able to track control points within 5 pixel accuracy at NTSC resolution. Further, it was found that performers had diculties in controlling multiple tongue points independently, which suggests that the tongue is not suitable to control independent parameters by using a one-to-one mapping of tongue control points to independent sliders. A better way to think about the tongue is a spatiotemporal contour controller|that is many gestures of the tongue can be controlled accurately and reliably as we observe in speech production. Gesture modeling and mapping seems to present a promising avenue for further investigation of the tongue as an intimate music controller. The main improvement of the algorithm is due to noise removal and limiting the bandwidth to achieve better contour tracking. These improvements allow real-time automatic contour extraction for vowels with few errors. In order to compare and test ultrasound acquisition setups and extraction algorithms, we presented an analysis framework with three types of analysis. First, the mean and variance analysis indicated how stable a set of postures is; for example, vowels varied with 5 pixels. Second, the principal component analysis within vowel postures allows more accurate measures. Third, the vowel classi cation measures how well tongue contours can be used to to distinguish and recognize vowel postures with 95% accuracy. 128 In the future, there are four directions for extension of this work. First, postures from multiple speakers can be analyzed to study cross-speaker di erences. Second, additional vocal sounds such as fricatives can be investigated. Third, more data sources in addition to ultrasound could be added to give a more complete representation of the vocal tract. These data should be recorded simultaneously and synchronously on a single medium. For example, a digital video camera could record ultrasound images on the video track, acoustic speech signal, recorded with a microphone, on the right audio track, and on the left audio track the glottis signal from a laryngograph. This would make it easier to analyze speech sounds that are not only characterized by tongue postures. Fourth, a more holistic posture identi cation could be explored, such as fast Eigen-tongues method to determine the shape by Akgul et al. [2] and Hueber, Chollet, Denby, Dreyfus & Stone [116]. 5.4.6 Summary of Tongue and Groove The aim of this study was to determine the suitability of automatic real-time tongue tracking for the model creation and validation process. An automatic extraction algorithm was formulated to produce tongue shapes from 2D ultrasound images. This algorithm achieved real-time tongue shapes to allow vowel posture tracking. The quality of the algorithm was measured with a mean variance analyses in Figure 5.19 and in the vowel classi cations in Table 5.5. Tracking was sucient only for static postures such as vowels, since the signal to noise ratio was found to be too small for this particular instrumentation. While this approach is valid in the overall modeling process, the required algorithms will need further improvements to track all tongue postures. A possible direction is to devise more suitable real-time implementation of algorithms such as those developed by Akgul, Kambhamettu & Stone [4]. Another direction is to constrain postures with a physical simulation as presented in Chapter 4 to create more noise robust tracking. In order to reconstruct the dynamics of the vocal tract, ultrasound images alone are not sucient. One solution to this problem, in addition to having a constraining physical model, is to acquire more data such as lip shapes using video and glottal vibration with laryngograph to resynthesize realistic speech. 5.5 Discussion of Data Acquisition and Extraction Results This chapter presents data sets and three distinct feasibility studies for components of the modeling and validation process. Limits were determined for each of the components: segmentation, registration, and tongue tracking. Relevant upper airway data sets are presented with 129 spatiotemporal qualities and their suitability to model di erent anatomical structures. The developed 2D/3D automatic MR tongue segmentation using a two-stage level-set method produces clean tongue surface contours with sucient parameter tuning. Since the lower part of the tongue has sparse features its segmentation does not produce consistent results. In this case, other semi-automatic segmentation methods such as 3D live-wire [105] were identi ed as viable alternatives. In conclusion, both methods are superior to manual segmentation due to the reduction in required labor for clinical and research use. The introduced 2D/3D automatic MR tongue Demons and nite element registration methods are suitable to reference deformations across postures and subjects. Compared to geometry-based referencing algorithms shown in Section 2.5.3 the image-based methods are often slower but have the advantage of showing internal deformations. Finally, a 2D real-time ultrasound tracking algorithm enables measurement of the tongue surface. The algorithm is capable of extracting vowel postures reliably. However, it was not suitable to deduce tongue dynamics determined by visual inspection nor the activation of the singing synthesizer SPASM. The image extraction methods presented here where not used for the tongue model creation of Chapter 4, since this model will be built eventually on existing research. However, the extraction methods have developed methods and brought up issues for the development for modeling individuals in the future. 130 Chapter 6 Summary & Conclusion 6.1 Summary The integrated physiological model approach for the upper airway is suitable to further understanding of many underlying processes that are in the case of speech examples such as coarticulation, consonant production, speaker-to-speaker variations, and dysfunctional behavior. Current modeling systems provided separate solutions to model structures with biomechanical, parametric and acoustic properties of the vocal tract including the tongue, larynx, lips, and face. However, current modeling systems do not provide solutions to couple these systems in order to construct complex models such as a full vocal tract. There are many open questions to understand how the complete upper airway works. Current research primarily uses observations and data to measure the motions and shapes of articulators. But this does not distinguish the contributions of underlying physiology and motor control. This is the motivation to create physics-based models. The approach of this work examines the question of how support of a diverse and distributed research community must be formulated to share existing and future knowledge by means of a software tool. The software tool should enable researchers to integrate their speci c work in a larger modeling context to validate the work and share it with others. The solution approach for this research question can be addressed with the areas of a uni ed framework, interactive biomechanic tongue model, image modeling, and validation facility. The premise is to support the diverse modeling process of the research community in supporting collaborative knowledge exchange. The following three high level areas were addressed in this thesis work: A Modeling framework organizes existing and future model types based on a taxonomy to allow mixing and matching of these approaches, including biomechanical, parametric, and 131 acoustic models. The framework design addresses both graphical and programming users needs but provides interface design components as solutions. An interactive biomechanic tongue model was created to show a proof of concept towards a complete upper airway model. An existing non-interactive tongue model is integrated into the modeling framework using fast nite element methods of muscle tissue. Image modeling is the process to extract shapes from di erent types of static and dynamics images to model individuals. Current upper airway processing methods are labor intensive, which makes them not feasible for clinical applications. Three feasibility studies are conducted for static MR segmentation and registration and dynamics ultrasound tongue shape tracking. Model validation is demonstrated using two methods, one using reference models, the other using images. The rst validation method compares the output of my interactive tongue model using the ICP tongue model as a reference to assess the accuracy of the fast nite element methods. The second image-based validation is where a 3D model is manually matched to a 2D image by using plausible muscle activations. 6.1.1 Modeling Framework This work introduced a modeling framework to simulate the upper airway. The research community involvement was essential to develop a user-centered design approach that allowed for formulation requirements and a design implementation independent of interface components. The developed model taxonomy further enabled the creation of a uni ed framework for di erent model approaches: biomechanical, parametric, and aeroacoustics. To allow more complex model interaction connection and constraint, concepts were designed to work across and within models. Both the graphical and programmer interface were addressed to support general needs of manipulations for editing and creation tasks. 6.1.2 Interactive Biomechanic Tongue Model A nite element tongue model was created as a reference component for the upper airway to demonstrate the capabilities of the modeling framework. The tongue was created with fast and stable nite element muscle formulation and its performance is comparable to ANSYS FEM simulation using a previously published reference tongue model. The simulation speeds achieved are within a factor of 10 of real time, at the expense of a small loss in model accuracy. 132 6.1.3 Image Modeling Using selected MR and ultrasound image sets for the example of the tongue, two extraction methods and one registration method were created to support the model creation and validation. The presented 2D and 3D automatic MR tongue segmentation methods using a two stage level-set method produces clean tongue surface contours. The presented method is superior to manual segmentation methods, and there is a saving in required labor for clinical and research use. The introduced 2D and 3D automatic MR tongue registration methods using the Demons algorithm and nite elements show promise to reference deformations across postures and subjects. Finally, 2D real-time ultrasound tracking algorithm enables tracking of the tongue surface. Real-time drive may be used in model validation or as a human interface controller. 6.1.4 Validation Process Validation methods and their general application were demonstrated on the nite element tongue model. The validation process was shown in context with the framework and it was demonstrated on di erent levels by comparison to a reference system or measured data. 6.2 Contributions and Impact With my thesis work, I have made the following contributions to the eld:  Modeling framework creation – Conceptualize the framework's focus by organizing related work – Devise requirements and conceptualizations from researchers feedback – Develop a model taxonomy to categorize properties and unify ordering – Contribute to the design and proof of concept to integrate di erent modeling types  Demonstrate framework feasibility with critical modeling tasks – Proof of concept for a full vocal tract model by integration of the ICP biomechanical tongue [96] – Extension of a biomechanical tongue [94] to allow interactive simulation speeds 133 – Contribution to the facility to interconnect models by connecting the tongue model to other anatomical structures that are kinematic jaw and airway.  Model validation and activation – Demonstration of validation techniques for the biomechanical tongue [96] in the interactive modeling framework – Demonstration of automatic segmentation and registration to support tongue model creation and validation from image data – Real-time tracking of tongue shapes from ultrasound Further the impact of this work is demonstrated by publication and presentations listed in Section 1.3. 6.3 Conclusion In conclusion, the presented framework enables biomechanical modeling of upper airway anatomy. Within this framework, tongue, jaw, airway, and face have been integrated by di erent group members. I believe it is essential to collaborate with leading model researchers and practitioners from di erent institutions to tackle this complex problem. A key requirement is to create graphical user interfaces for interactive models to allow exploratory work and intuitive understanding. Hereby biomechanical modeling is important, since it enables connections and forward dynamics simulations. For the continuation this work, there are many new avenues for research. First is the connection of the tongue model to other anatomy models such as the jaw or airway. This enables the study of joint behavior for speech, mastication, and swallowing tasks. For many of these questions in clinical practice, modeling individuals becomes essential. Thus, solutions to create models from image data with top-down (anatomical atlas) and bottom-up (shape extraction) approaches become important. Second is the improvement of the tongue model with more sophisticated modeling methods. Lastly, there are many long term goals where this project will lead. One long-term goal of this project is to develop the integrated physiological model for articulatory speech synthesis that produces natural sounding speech. It is not clear which modeling approach achieves this goal and therefore we provide a simulation framework for developing and evaluating 134 di erent modeling approaches. The presented framework design and its proof of concept provide an important step towards modeling the upper airway. 6.4 Epilogue The initial ArtiSynth project focus of speech production based on articulatory synthesis has expanded to include other functions of the upper airway such as chewing [106] and breathing tasks, as well as clinical treatment of jaw reconstruction [253], sleep apnea, and tongue cancer. The requirements for the simulation of speech production tasks, from intent to sound production, have been found most demanding compared to other upper airway functions. Project members' collaborations with clinicians and medical researchers are established towards a more complete upper airway model, in particular of individuals. The software framework was developed further in particular for biomechanical modeling for deformable models such as inverse dynamics modeling and ecient nonlinear [181] and reduced coordinate nite elements. Much progress has been made by the ArtiSynth team to connect di erent model components such as tongue and airway [65], as well as tongue and jaw. In this context, new connections methods have been developed to produce accurate and stable results. Currently, the ArtiSynth team includes modeling researchers working on topics in biomechanics, aerodynamics, image processing, motor control, and acoustics. The cognitive and anatomy disciplines involved in the project include linguistics, dentistry, oral facial surgery, and music. 135 Bibliography [1] Akgul, Yusuf ; Kambhamettu, Chandra ; Stone, Maureen: Automatic motion analysis of the tongue surface from ultrasound image sequences. In IEEE Workshop on Biomedical Image Analysis, 1998, pp. 126{132 [2] Akgul, Yusuf ; Kambhamettu, Chandra ; Stone, Maureen: Extraction and Tracking of The Tongue Surface from Ultrasound Image Sequences. In IEEE Conference on Computer Vision and Pattern Recognition, 1998, pp. 298 [3] Akgul, Yusuf ; Kambhamettu, Chrandra ; Stone, Maureen: A task-speci c contour tracker for ultrasound. In Proceedings. IEEE Workshop on Mathematical Methods in Biomedical Image Analysis, 2000, pp. 135{142 [4] Akgul, Yusuf S. ; Kambhamettu, Chandra ; Stone, Maureen: Automatic extraction and tracking of the tongue contour. In IEEE Transactions on Medical Imaging 18 (1999), No. 10, pp. 1035{1045 [5] Albrecht, Irene ; Haber, J•org ; Kähler, Kolja ; Schröder, Marc ; Seidel, Hans-Peter: May I talk to you? { Facial Animation from Text. In Proceedings Pacific Graphics, 2002, pp. 77 [6] Albrecht, Irene ; Haber, J•org ; Seidel, Hans-Peter: Automatic Generation of Non- Verbal Facial Expressions from Speech. In Proceedings Computer Graphics International (CGI), 2002, pp. 283{293 [7] Allen, Jonathan ; Hunnicutt, Sharon ; Carlson, Rolf ; Granstrom, Bjorn: MITalk-79: The 1979 MIT text-to-speech system. In Journal of the Acoustical Society of America 65 (1979), pp. 130{ [8] ANSYS, Inc. (Veranst.): ANSYS LS-DYNA User’s Guide. Release 10.0. 2005 [9] ANSYS Inc.: ANSYS - finite element analysis application [Computer Software]. 2006. { URL http://www.ansys.com 136 [10] ANSYS Inc.: Fluent - computational fluid dynamics (CFD) simulation software [Computer Software]. 2007. { URL http://www.fluent.com [11] Arnal, Alain ; Badin, Pierre ; Brock, Gilbert ; Connan, Pierre-Yves ; Florig, Evelyne ; Perez, No•el ; Perrier, Pascal ; Simon, Pela ; Sock, Rudolph ; Varin, Laurent ; Vaxelaire, Beatrice ; Zerling, Jean-Pierre: An X-ray Database for French. In SPS 5 Proceedings, 2000 [12] Aspert, N. ; Santa-Cruz, D. ; Ebrahimi, T.: MESH: Measuring Error between Surfaces using the Hausdor distance,. In Proceedings of the IEEE International Conference on Multimedia and Expo 2002 (ICME) vol 1, 2002, pp. 705{708 [13] Atal, Bishnu S. ; Hanauer, Suzanne C.: Speech analysis and synthesis by linear prediction of the speech wave. In Journal of the Acoustical Society of America 50 (1971), pp. 637{655 [14] Atal, Bishnu S. ; Remde, Joel R.: A new model of LPC excitation for producing natural- sounding speech at low bit rates. In IEEE International Conference on Acoustics, Speech and Signal Processing, 1982, pp. 614{617 [15] Atal, Bishnu S. ; Schröder, Manfred R.: Stochastic coding of speech signals at very low bit rates. In Proceedings of International Conference on Communications, 1984, pp. 1610{1613 [16] Back, G. W. ; Nadig, S. ; Uppal, S. ; Coatesworth, A. P.: Why do we have a uvula?: literature review and a new theory. In Clinical Otolaryngology and Allied Sciences 29 (2004), No. 6, pp. 689 [17] Badin, P. ; Bailly, G. ; Reveret, L. ; Baciu, M. ; Segebarth, C.: Three-dimensional linear articulatory modeling of tongue, lips and face based on MRI and video images. In Journal of Phonetics 30 (2002), No. 3, pp. 533 [18] Badin, Pierre ; Bailly, Gerard ; Raybaudi, Monica ; Segebarth, Christoph: A Three-Dimensional Linear Articulatory Model Based on MRI Data. In Proceedings of the International Conference of Spoken Language (ICSLP), 1998, pp. 14{20 [19] Bailer, Werner: Writing ImageJ PlugIns? A Tutorial. 2006 [20] Barbic, Jernej ; James, Doug L.: Real-Time Subspace Integration for St.Venant-Kirchho Deformable Models. In ACM Trans on Graphics 24 (2005), pp. 982{990 137 [21] Basu, Sumit ; Oliver, Nuria ; Pentland, Alex: 3D lip shapes from video: A combined physical-statistical model. In Speech Communication 25 (1998), No. 12, pp. 131{148 [22] Bathe, Klaus-Juergen: Finite element procedures. Prentice Hall, 1996 [23] Berg, Mark de ; Krefeld, M. van ; Overmars, M. ; Schwarzkopf, O.: Computational Geometry : Algorithms and Applications (Hardcover). Springer, 2000 [24] Bettega, G. ; Payan, Yohan ; Mollard, B. ; Boyer, S. ; Raphael, A. ; Lavallee, B.: A simulator for maxillo-facial surgery integrating cephalometry and orthodontia. In Journal of Computer Aided Surgery 5 (2002), No. 3 [25] Birkholz, Peter: 3D-Artikulatorische Sprachsynthese, Uni Rostock, Germany, PhD thesis, 2006 [26] Black, Alan ; Taylor, Paul: CHATR: A Generic Speech Synthesis System. In Proceedings of COLING, the 15th International Conference on Computational Linguistics, 1994, pp. 983{ 986 [27] Black, Alan ; Taylor, Paul: Festival Speech Synthesis System / Human Communication Research Centre, University of Edinburgh, UK. 1997 (HCRC/TR-83). { Technical report [28] Bolz, Je ; Farmer, Ian ; Grinspun, Eitan ; Schröoder, Peter: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. In Proc. ACM SIGGRAPH, 2003, pp. 917{924 [29] Bro-Nielsen, Morten: Finite Element Modeling in Surgery Simulation. In Proceedings of IEEE (Special issue on surgery simulation) 86 (1998), No. 3, pp. 490{503 [30] Browman, Catherine P. ; Goldstein, Louis: Articulatory phonology: An overview. In Phonetica 49 (1992), pp. 155{180 [31] Brown, J.: A survey of image registation techniques / Columbia University, NY. 1992. { Technical report [32] Buchaillard, Stephanie: Muscle Activations and Lingual Movements: Modeling Natural and Pathological Speech, Universite Joseph Fourier, Grenoble-France, PhD thesis, 2007 [33] Buck, Joseph T. ; Ha, Soonhoi ; Lee, Edward ; Messerschmitt, David G.: Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems. In Int. Journal of Computer Simulation, special issue on Simulation Software Development 4 (1994), pp. 155{ 182 138 [34] Burger, Wilhelm ; Burge, Mark J.: Digitale Bildverarbeitung : Eine Einführung mit Java und ImageJ. 2., •uberarbeitete Au age. Berlin, Heidelberg : Springer-Verlag, 2005 [35] Campbell, W. N.: Prosodic encoding of English speech. In Proceedings of the International Conference of Spoken Language (ICSLP), 1992, pp. 663{666 [36] Campbell, W.N.: Processing a speech corpus for CHATR synthesis. In Proceedings of International Conference on Speech Processing (ICSP’97), URL http://feast.atr.jp/ chatr/, 1997, pp. 183{186 [37] Carré, R.: Linear correlates in the speech signal: Consequences of the speci c use of an acoustic tube? In The Behavioral and brain sciences 21 (1998), No. 2, pp. 261 [38] Carré, Rene: Dynamic properties of an acoustic tube: Prediction of vowel systems. In Speech Communication 51 (2009), No. 1, pp. 26 [39] Carstens Medizinelektronik: Electromagnetic Articulography (EMA) - Magnetic tracking system. 2001. { URL http://www.articulograph.de [40] Chabanas, Matthieu ; Payan, Yohan: A 3D Finite Element Model of the Face for Simulation in Plastic and Maxillo-Facial Surgery. In Springer Lecture Notes in Computer Science 1935 (2000), pp. 1068{1075 [41] Charbanas, Matthieu: Modélisation des tissus mous de la face pour la chirurgie orthognatique assistée par ordinateur., Universite Joseph-Fourier, Grenoble, PhD thesis, 2002 [42] Christensen, G. ; Miller, M. ; Marsch, J. ; Vannier, M.: Automatic analysis of medical images using a deformable textbook. In Computer Assisted Radiology (1995), pp. 146{151 [43] Cohen, Michael M. ; Massaro, Dominic W.: Modeling coarticulation in synthetic visual speech. pp. 141{155. In Models and Techniques in Computer Animation, D. Thalmann N. Magnenat-Thalmann, Springer-Verlag, 1993 [44] Coker, Cecil H.: A model of articulatory dynamics and control. In Proceedings of the IEEE vol 64, 1976, pp. 452{460 [45] Commowick, Olivier: Design and Use of Anatomical Atlases for Conformal Radiotherapy Planning, INRIA Sophia Antipolis in Computer Science, PhD thesis, 2007. { URL http: //olivier.commowick.org 139 [46] Cook, Perry R.: Identification of Control Parameters in an Articulatory Vocal Tract Model, Stanford University Department of Music, Stanford, CA, PhD thesis, 1990 [47] Cook, Perry R.: SPASM, a real-time vocal tract physical model controller and Singer, the companion software synthesis system. In Computer Music Journal 17 (1993), pp. 30{43 [48] Cootes, T.F. ; Hill, A. ; Taylor, C.J. ; Haslam, J.: The Use of Active Shape Models for Locating Structures in Medical Images. In Image and Vision Computing 12 (1994), No. 6, pp. 355{366 [49] Cotin, S. ; Delingette, H. ; Ayache, N.: Real-time elastic deformations of soft tissues for surgery simulation. In IEEE Transactions On Visualization and Computer Graphics 5 (1999), January-March, No. 1, pp. 62{73 [50] Couteau, Beatrice ; Payan, Yohan ; Lavallee, Stephane: The Mesh-Matching algorithm: an automatic 3D mesh generator for nite element structures. In Journal of Biomechanics 33 (2000), No. 8, pp. 1005{1009 [51] Cox, R.: Motion and functional MRI. In In Boston Workshop on Functional MRI, 1996 [52] Curle, N.: The in uence of solid boundaries upon aerodynamic sound. In Proceedings of the Royal Society of London, 1955 (A231), pp. 505{514 [53] Cyberware Inc.: 4020/PS 3D Scanner, 4020/RGB 3D Scanner with color digitizer. 8 Harris Court 3D, Monterey, California 93940. 1989. { URL http://cyberware.com [54] Dang, J. ; Honda, K.: A physiological articulatory model for simulating speech production process. In Journal of the Acoustical Society of Japan 22 (2001), No. 6, pp. 415{425 [55] Dang, Jianwi ; Honda, Kiyoshi: Construction and control of a physiological articulatory model. In Journal of the Acoustical Society of America 115 (2004), No. 2, pp. 853{870 [56] Dang, Jianwu ; Honda, Kiyoshi: Estimation of vocal tract shapes from speech sounds with a physiological articulatrory model. In Journal of Phonetics 30 (2002), pp. 511{532 [57] Davis, T. A.: A column pre-ordering strategy for the unsymmetric-pattern multifrontal method. In ACM Transactions on Mathematical Software 30 (2004), No. 2, pp. 165{195 [58] DECtalk: DECTalk DC01 Owner’s Manual. Maynard, Mass: , 1984 [59] Delp, Scott: OpenSim - models of musculoskeletal structures. Website. March 2006. { URL http://simtk.org 140 [60] Deng, Xiao Q.: A finite element analysis of surgery of the human facial tissue, Columbia University, New York, PhD thesis, 1988 [61] Dobashi, Y ; Yamamoto, T ; Nishita, T: Real-time rendering of aerodynamic sound using sound textures based on computational uid dynamics. In ACM Transactions on Graphics (TOG) 22 (2003), No. 3, pp. 732{ [62] Dobashi, Yoshinori ; Yamamoto, Tsuyoshi ; Nishita, Tomoyuki: Synthesizing Sound from Turbulent Field using Sound Textures for Interactive Fluid Simulation. In Computer Graphics Forum 23 (2004), No. 3, pp. 539{545 [63] Doel, Kees van den ; Ascher, Uri: Real-time numerical solution of Webster's equation on a non-uniform grid. In IEEE Transactions on Audio, Speech and Language Processing 16 (2008), pp. 1163{1172 [64] Doel, Kees van den ; Kry, Paul G. ; Pai, Dinesh K.: FoleyAutomatic: Physically-based Sound E ects for Interactive Simulation and Animation. In Proc. ACM SIGGRAPH, 2001, pp. 537{544 [65] Doel, Kees van den ; Vogt, Florian ; English, R. E. ; Fels, Sidney S.: Towards Articulatory Speech Synthesis with a Dynamic 3D Finite Element Tongue Model. In Proc of ISSP, 2006, pp. 59{66 [66] Drioli, Carlo: A ow waveform matched low-dimensional glottal model based on physical knowledge. In Journal of the Acoustical Society of America 117 (2005), No. 5, pp. 3184{3195 [67] Duck, F.A.: Physical Property of Tissues: A Comprehensive Reference Book. London: Academic Press, 1990 [68] Dudley, H. ; Tarnoczy, T.H.: The Speaking Machine of Wolfgang von Kempelen. In Journal of the Acoustical Society of America 22 (1950), No. 2, pp. 151{166 [69] Dudley, Homer: Remaking speech. In Computer Music Journal 11 (1939), pp. 169{177 [70] Dunn, H. K.: The calculation of vowel resonances, and an electrical vocal tract. In Journal of the Acoustical Society of America 22 (1950), pp. 740{753 [71] Engwall, Olov: Modeling of the vocal tract in three dimensions. In eos, 1999, pp. 113{116 [72] Engwall, Olov: A 3D tongue model based on MRI data. In Proceedings of the International Conference of Spoken Language (ICSLP), 2000 141 [73] Engwall, Olov: Replicating three-dimensional tongue shapes synthetically / KTH Quarterly Progress and Status Report: Speech, Music and Hearing. 2000. { Technical report [74] Engwall, Olov: Making the Tongue Model Talk: Merging MRI & EMA Measurements. In Proc. of Eurospeech vol 1, 2001, pp. 261{264 [75] Engwall, Olov ; Badin, Pierre: Collecting and analysing two- and three-dimensional MRI data from Swedish / KTH Quarterly Progress and Status Report: Speech, Music and Hearing. 1999. { Technical report [76] Espenschied, Lloyd ; Affel, Herman: artificial larynx. US Patent by ATT Bell Labs. 1929 [77] Fant, Gunnar: Acoustic Theory of Speech Production. Mouton : The Hague, Netherlands: Mouton., 1960 [78] Faure, Francois: Simulation Open Framework Architecture (SOFA). 2007. { URL http: //www.sofa-framework.org [79] Fedkiw, Ronald ; Stam, Jos ; Jensen, Henrik W.: Visual simulation of smoke. In Proceedings Computer Graphics International (CGI), 2001 [80] Fels, Sidney ; Vogt, Florian ; Doel, Kees van den ; Lloyd, John ; Guenter, Oliver: Artisynth: Towards Realizing an Extensible, Portable 3D Articulatory Speech Synthesizer. In Int Workshop on Auditory Visual Speech Processing, 2005, pp. 119{124 [81] Fels, Sidney ; Vogt, Florian ; Doel, Kees van den ; Lloyd, John ; Stavness, Ian ; Vatikiotis-Bateson, Eric: ArtiSynth: A Biomechanical Simulation Platform for the Vocal Tract and Upper Airway / Computer Science Dept., Univ of British Columbia. 2006 (TR- 2006-10). { Technical report [82] Fels, Sidney S. ; Lloyd, John E. ; Doel, Kees van den ; Vogt, Florian ; Stavness, Ian ; Vatikiotis-Bateson, Eric: Developing Physically-Based, Dynamic Vocal Tract Models using ArtiSynth. In Proc of ISSP, 2006, pp. 419{426 [83] Fels, Sidney S. ; Vogt, Florian ; Gick, Bryan ; Jaeger, Carol ; Wilson, Ian: User- centered Design for an Open Source 3D Articulatory Synthesizer. In Proc Int Congress of Phonetic Science (ICPhS), 2003, pp. 179{182 142 [84] Ferrant, M. ; Warfield, S. K. ; Guttmann, C. R. G. ; Mulkern, R. V.: 3d image matching using nite element based elastic deformation model. In International Society and Conference Series on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 1999, pp. 202{209 [85] FestVox: Festival Speech Synthesis System. open source software. 1999. { URL http: //festvox.org [86] Fitzpatrick, J. M. ; Hill, D. L. G. ; Maurer, C. R.: Handbook of Medical Imaging. Bellingham, WA : SPIE Press, 2000 [87] Flanagan, James L. ; Ishizaka, Kenzo: Automatic generation of voiceless excitation in a vocal cord-vocal tract speech synthesizer. In IEEE Transactions on Acoustics, Speech and Signal Processing 24 (1976), pp. 163{170 [88] Fontecave, J ; Berthommier, F: Semi-Automatic Extraction of Vocal Tract Movements from Cineradiographic Data. In Proceedings of the International Conference on Spoken, 2006 [89] Freeborough, Peter A. ; Fox, Nick C.: Modeling Brain Deformations in Alzheimer Disease by Fluid Registration of Serial 3D MR Images. In J Comput Assist Tomogr 22 (1998), No. 5, pp. 583{590 [90] FreeTTS: A Java-based speech synthesizer. Open source software. 1998. { URL http: //freetts.sf.net [91] Fujimura, Osamu: Articulatory perspectives of speech organization. pp. 323{342. In Speech Production and Speech Modelling., (Eds.), 1990 (In W. J. Hardcastle and A. Marchal) [92] Fung, Y.C.: Biomechanics : mechanical properties of living tissues. 2nd. Springer-Verlag, 1993 [93] Gallier, James: Curves and Surfaces in Geometric Modeling: Theory and Algorithms. Morgan Kaufmann, 1999 [94] Gerard, J. M. ; Wilhelms-Tricarico, R. ; Perrier, P. ; Payan, Y.: A 3D dynamical biomechanical tongue model to study speech motor control. In Recent Research Developments in Biomechanics 1 (2003), pp. 49{64 [95] Gerard, J.M. ; Ohayon, J. ; Luboz, V. ; Perrier, P. ; Payan, Y.: Indentation for estimating the human tongue soft tissues constitutive law: application to a 3D biomechanical 143 model to study speech motor control and pathologies of the upper airways. In Lecture Notes in Computer Science, 3078 (2004), pp. 77{83 [96] Gerard, J.M. ; Perrier, P. ; Payan, Y.: 3D biomechanical tongue modelling to study speech production. pp. 85{102. In J. Harrington & M. Tabain (eds). Speech Production: Models, Phonetic Processes, and Techniques, Psychology Press: New-York, USA, 2006 [97] Gick, Bryan ; Campbell, Fiona ; Oh, Sunyoung ; Tamburri-Watt, Linda: Toward universals in the gestural organization of syllables: A cross-linguistic study of liquids. In JPHO 34 (2006), No. 1, pp. 49{72 [98] Gladilin, E. ; Zachow, S. ; Deuflhard, P. ; Hege., H.-C.: Virtual Fibers: A Robust Approach for Muscle Simulation. In Proc MEDICON, 2001, pp. 961{964 [99] Golub, Gene H. ; Loan, Charles F. V.: Matrix Computation. 3. JHU Press, 1996 [100] Gottschalk, S. ; Lin, M. C. ; Manocha, D.: OBBTree: A Hierarchical Structure for Rapid Interference Detection. In ACM Trans on Graphics 15 (1996), No. 3 [101] Gray, A.: Passive cascaded lattice digital lters. In Circuits and Systems, IEEE Transactions on 27 (1980), No. 5, pp. 337{344 [102] Gray, Henry: Anatomy of the Human Body. Philadelphia: Lea & Febiger, 2003 [103] Guenter, Brian ; Grimm, Cindy ; Wood, Daniel ; Malvar, Henrique ; Pighin, Frederic: Making faces [facial animation]. In Proc. ACM SIGGRAPH, 1998, pp. 55{66 [104] Hajnal, J. V. ; Saeed, N. ; Soar, E. J. ; Oatridge, A. ; Young, I. R. ; Bydder, G. M.: A registration and interpolation procedure for subvoxel matching of serially acquired MR images. In J Comput Assist Tomogr 19 (1995), No. 2, pp. 289{96 [105] Hamarneh, Ghassan ; Yang, Johnson ; McIntosh, Chris ; Langille, Morgan: 3D live- wire-based semi-automatic segmentation of medical images. In SPIE Medical Imaging 5747 (2005), pp. 1597{1603 [106] Hannam, Alan G. ; Stavness, Ian ; Lloyd, John E. ; Fels, Sidney: A Dynamic Model of Jaw and Hyoid Biomechanics during Chewing. In Journal of Biomechanics 41 (2008), No. 5, pp. 1069{1076 [107] Hiiemae, Karen M. ; Palmer, Je rey B.: Tongue Movements in Feeding and Speech. In Critical Reviews in Oral Biology and Medicine 14 (2003), No. 6, pp. 413{429 144 [108] Hill, A. V.: The Heat of Shortening and the Dynamic Constants of Muscle. In Philosophical Transactions of the Royal Society B: Biological Sciences 126 (1938), No. 843, pp. 136{195 [109] Hill, D. ; Pearce, A. ; Wyvill, B.: Animating speech: A automated approach using speech synthesised by rules. In the Visual Computer vol 3, 1988, pp. 277{287 [110] Hill, David R. ; Manzara, Leonard ; Schock, Craig: Real-time articulatory speech- synthesis-by-rules. In Proceedings of the 14th Annual International Voice Technologies Applications Conference of the American Voice I/O Society, San Jose 1995, 1995, pp. 27{44 [111] Hirano, Minoru: The vocal cord during phonation. In Igaku no Ayumi 80 (1968), No. 10 [112] Holmes, John N.: The in uence of glottal waveform on the naturalness of speech from a parallel formant sythesizer. In IEEE Transactions on Audio and Electroacoustics (1973), pp. 298{305 [113] Holmes, John N.: Formant Synthesizers: Cascade or Parallel? In Speech Communication 2 (1983), pp. 251{273 [114] Howe, Michael L.: Acoustics of Fluid-Structure Interaction, Cambridge Monographs on Mechanics. Cambridge University Press, New York, 1999 [115] Huber, Daniel: The Computer Vision Homepage. accessed on Apr. 12, 2006 2006. { URL http://www.cs.cmu.edu/~cil/vision.html [116] Hueber, T. ; Chollet, G. ; Denby, B. ; Dreyfus, G. ; Stone, M.: Continuous- Speech Phone Recognition from Ultrasound and Optical Images of the Tongue and Lips. In Interspeech, 2007, pp. 658{661 [117] Ibáñez, Luis ; Schroeder, Will ; Ng, Lydia ; Cates, Josh ; Insight Software Consortium the: The ITK Software Guide: The Insight Segmentation and Registration Toolkit. Kitware Inc., 2003 [118] Irving, G. ; Schroeder, C. ; Fedkiw, Ron: Volume Conserving Finite Element Simulation of Deformable Models. In Proc. ACM SIGGRAPH, ACM TOG 26, 2007 [119] Isard, Steve D. ; Miller, D. A.: Diphone Synthesis Techniques. In Proceeding of IEEE Conference on Speech Input/Output, 1986 (258), pp. 77{82 [120] Ishizaka, Kenzo ; Flanagan, James L.: Synthesis of voiced sounds from a two-mass model of the vocal tract. In Bell System Technical Journal 51 (1972), pp. 1233{1268 145 [121] Jackson, M. ; Espy-Wilson, Carol Y. ; Boyce, Suzanne E.: Verifying a vocal tract model with a closed side-branch. In Journal of the Acoustical Society of America 109 (2001), No. 6, pp. 2983{7 [122] Jackson, Philip J.: Characterisation of plosive, fricative and aspiration components in speech production, University of Southhampton, uk, PhD thesis, 2000 [123] James, Doug L. ; Pai, Dinesh K.: ARTDEFO: Accurate Real Time Deformable Objects. In Proc. ACM SIGGRAPH, 1999, pp. 65{72 [124] James, Doug L. ; Pai, Dinesh K.: BD-Tree: Output-Sensitive Collision Detection for Reduced Deformable Models. In ACM Trans on Graphics 23 (2004), No. 3 [125] Jeong, Won-Ki ; Kähler, Kolja ; Haber, J•org ; Seidel, Hans-Peter: Automatic Generation of Subdivision Surface Head Models from Point Cloud Data. In Proc. Graphics Interface, 2002, pp. 181{188 [126] Jolliffe, I. T.: Principal component analysis. Springer, 2002 [127] Kalra, P. ; Magnenat-Thalmann, N.: Simulation of facial skin using texture mapping and coloration. In IFIP Transactions B (Applications in Technology) B-9 (1993), pp. 365{74 [128] Kass, Michael ; Witkin, Andrew ; Terzopoulos, Demetri: Snakes: Active contour models. In International Journal of Computer Vision 1 (1987), pp. 321{331 [129] Kaufman, Danny M. ; Edmunds, Timothy ; Pai, Dinesh K.: Fast frictional dynamics for rigid bodies. In ACM Trans. Graph. 24 (2005), No. 3, pp. 946{956. { ISSN 0730-0301 [130] Kelly, K. L. ; Lochbaum, C. C.: Speech Synthesis. In Proc. Fourth ICA, 1962 [131] Kempelen, Wolfgang R. von: Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine. Stuttgart : Stuttgart-Bad Cannstatt F. Frommann, 1970 [132] Kerdok, A. E. ; Cotin, S. M. ; Ottensmeyer, M. P. ; Galea, A. M. ; Howe, R. D. ; ; Dawson, S. L.: Truth Cube: Establishing Physical Standards for Real Time Soft Tissue Simulation. In Medical Image Analysis 7 (2003), pp. 283{291 [133] Kimmel, R. ; Caselles, V. ; Shapiro, G.: Geodesic active contours. In International Journal on Computer Vision 22 (1997), No. 1, pp. 61{97 [134] King, Scott A. ; Parent, Richard E.: A 3D Parametric Tongue Model for Animated Speech. In JVCA 12 (2001), No. 3, pp. 107{115 146 [135] Klatt, Dennis H.: Software for a cascade/parallel formant synthesizer. In Journal of the Acoustical Society of America 67 (1980), pp. 971{995 [136] Kob, Malte: Physical Modeling of the Singing voice, University of Technology Aachen, PhD thesis, 2002 [137] Koch, R.M. ; Gross, M.H. ; Carls, F.R. ; Buren, D.F. von ; Fankhauser, G. ; Parish, Y.I.H.: Simulating facial surgery using nite element models. In Proc. ACM SIGGRAPH, 1996, pp. 421{8 [138] Kraevoy, Vladislav ; Sheffer, Alla: Cross-parameterization and compatible remeshing of 3D models. In ACM Trans. Graph. 23 (2004), No. 3, pp. 861{869. { ISSN 0730-0301 [139] Krüger, Jens ; Westermann, R•udiger: Linear algebra operators for GPU implementation of numerical algorithms. In Proc. ACM SIGGRAPH 2005 Courses, 2005, pp. 234 [140] Kuehn, D. P. ; Azzam, N.A.: Anatomical Characteristics of Palatoglossus and the Anterior Faucial Pilar. In Cleft Palate Journal of Anatomy 15 (1978), No. 4, pp. 349{359 [141] Kuratate, Takaaki ; Munhall, Kevin G. ; Rubin, Philip E. ; Vatikiotis-Bateson, Eric ; Yehia, Hani C.: Audio-Visual Synthesis Of Talking Faces From Speech Production Correlates. In Proc. of Eurospeech vol 3, 1999, pp. 1279{1282 [142] Kuratate, Takaaki ; Vignali, Guillaume ; Vatikiotis-Bateson, Eric: Building a large scale 3d face database and applying it to face animation. In Proceedings of Visual Computing / Graphics & CAD Joint Symposium, 2003, pp. 105{110 [143] Kähler, K. ; Haber, J. ; Seidel, H.-P.: Geometry-based Muscle Modeling for Facial Animation. In Proceedings Graphics Interface, 2001, pp. 37{46 [144] Kähler, Kolja ; Haber, J•org ; Yamauchi, Hitoshi ; Seidel, Hans-Peter: Head shop: Generating animated head models with anatomical structure. In Proc. ACM SIGGRAPH, 2002, pp. 55{64 [145] LaBouff, Kathryn: Singing and Communicating in English. Oxford University Press, 2007 [146] Lee, Yuencheng ; Terzopoulos, Demetri ; Waters, Keith: Constructing physics-based facial models of individuals. In Proc. Graphics Interface, 1993, pp. 1{8 [147] Lee, Yuencheng ; Terzopoulos, Demetri ; Waters, Keith: Realistic modeling for facial animation. In Proc. ACM SIGGRAPH, 1995, pp. 55{62 147 [148] Lighthill, Michael J.: On sound generated aerodynamically 1. In Proceedings of the Royal Society of London, 1952 (A211), pp. 564{587 [149] Lighthill, Michael J.: On sound generated aerodynamically 2. In Proceedings of the Royal Society of London, 1954 (A222), pp. 1{32 [150] Linguistic Data Consortium: Online Catalog of speech and text databases - contains hundreds of corpora of language data. 1997. { URL http://www.ldc.upenn.edu [151] Lorensen, W. ; Cline, H.: Marching Cubes: A high resolution 3D surface construction algorithm. In Proc. ACM SIGGRAPH, 1987, pp. 163{169 [152] Lu, Hui-Ling: Toward a high-quality singing synthesizer with vocal texture control, Stanford University, PhD thesis, 2002 [153] Lubker, James: Palatoglossus Function in Normal Speech Production - Electromyographic Implications. In Journal of the Acoustical Society of America 53 (1973), No. 1, pp. 296 [154] Lundberg, Andrew J. ; Stone, Maureen: Three-dimensional tongue surface reconstruction: Practical considerations for ultrasound data. In Journal of the Acoustical Society of America 106 (1999), No. 5, pp. 2858{2867 [155] Lyons, Michael J. ; Hähnel, Michael ; Tetsutani, Nobuji: The Mouthesizer: A Facial Gesture Musical Interface. In Proc. ACM SIGGRAPH, 2001 [156] Maeda, Shinji: Improved articulatory model. In Journal of the Acoustical Society of America 84 (1988), No. S1, pp. 146pp [157] Maintz, J. B. A. ; Viergever, Max A.: A Survey of Medical Image Registration / Image Sciences Institute, Utrecht, Netherland. 1997. { Technical report [158] Markel, John D. ; Gray, Augustine H.: Linear Prediction of Speech. New York : Springer Verlag, 1972 [159] Mase, G. T. ; Mase, George E.: Continuum Mechanics for Engineers. Taylor and Francis, 1999 [160] Matsuzaki, H. ; Miki, N. ; Nagai, N. ; Hirohku, T. ; Ogawa, Y.: 3D FEM analysis of vocal tract model of elliptic tube with inhomogeneous-wall impedance. In Proceedings of the International Conference of Spoken Language (ICSLP) vol 2, 1994, pp. 635{638 148 [161] Mattes, David ; Haynor, David R. ; Vesselle, Hubert ; Lewellyn, Thomas K. ; Eubank, William: Non-rigid multimodality image registration. In Medical Imaging 4322 (2001), No. 186, pp. 1609{1620 [162] MBROLA: Multilingual Speech Synthesizer. 1996. { URL http://tcts.fpms.ac.be/ synthesis/ [163] McGurk, Harry ; MacDonald, John: Hearing lips and seeing voices. In Nature 264 (1976), pp. 746{748 [164] McInerney, Tim ; Terzopoulos, Demetri: Topology Adaptive Deformable Surfaces for Medical Image Volume Segmentation. In IEEE Transactions on Medical Imaging 18 (1999), No. 10, pp. 840{850 [165] Mercury Computer Systems Inc., San Diego, California: Amira Medical Visualization Software [Computer Software]. 2006. { URL http://www.amiravis.com [166] Mermelstein, Paul: Determination of the Vocal-Tract Shape from Measured Formant Frequency. In Journal of the Acoustical Society of America 41 (1967), pp. 1283{1294 [167] Meyers, Ethan: Face Databases: a review. accessed on Apr. 11, 2006 2006. { URL http://web.mit.edu/emeyers/www/ [168] Mick, J.: Making faces [machine vision developments]. In Image Processing (1997), pp. 9{10 [169] Montani, Claudio ; Scateni, Riccardo ; Scopigno, Roberto: A modi ed look-up table for implicit disambiguation of Marching Cubes. In The Visual Computer 10 (1994), pp. 0178{ 2789 [170] Mortenson, Michael E.: Geometric Modeling 2nd Edition. John Wiley & Sons, 1997 [171] Mullen, J. ; Howard, D. M. ; Murphy, D. T.: Real-time dynamic articulations in the 2-D waveguide mesh vocal tract model. In EEE Transactions on Audio, Speech and Language Processing 15 (2007), No. 2, pp. 577 { 585 [172] Munhall, K.G. ; Vatikiotis-Bateson, E. ; Tohkura, Y.: Manual for the X-ray lm database / ATR, Kyoto, Japan. 1994 (TR-H-116). { Technical report [173] MusculoGraphics, Inc: SIMM - Software for Interactive Musculoskeletal Modeling [Computer Software]. 2001. { URL http://www.musculographics.com 149 [174] Möhler, Gregor: A collection of text-to-speech systems with sound examples. Website. 2001. { URL http://www.ims.uni-stuttgart.de/~moehler/synthspeech/ [175] Müller, Matthias ; Gross, Markus: Interactive virtual materials. In GI ’04: Proceedings of the 2004 conference on Graphics interface, Canadian Human-Computer Communications Society, 2004, pp. 239{246. { ISBN 1-56881-227-2 [176] National Institutes of Health: ImageJ - Image Processing and Analysis in Java [Computer Software]. online. 2004. { URL http://rsb.info.nih.gov/ij/ [177] National Library of Medicine: Visible Human Project (CBM 2007-1). Current Bibliographies in Medicine. 1987 [178] Nealen, Andrew ; Müller, Matthias ; Keiser, Richard ; Boxerman, Eddy ; Carlson, Mark: Physically Based Deformable Models in Computer Graphics. In Computer Graphics Forum 25 (2006), pp. 809{836 [179] Nesme, Matthieu ; Payan, Yohan ; Faure, Francois: Ecient, physically plausible nite elements. In Eurographics, 2005 [180] Next Limit Technologies: Real Flow - fluid and dynamics simulation [Computer Software]. 2006. { URL http://www.nextlimit.com/realflow/ [181] Ngan, Wayne ; Lloyd, John: Ecient Deformable Body Simulation using Sti ness-Warped Nonlinear Finite Elements. In Proc. Symposium Interactive 3D Games and Graphics, 2008 [182] Nielsen, Jakob: Usability Engineering (Interactive Technologies). Morgan Kaufmann, 1993 [183] Nikishkov, G.P.: Java Performance in Finite Element Computations. In Proc Appl Sim & Mod, 2003, pp. 410 [184] O’Brien, J. F. ; Cook, P. R. ; Essl, G.: Synthesizing Sounds from Physically Based Motion. In Proc. ACM SIGGRAPH, 2001, pp. 529{536 [185] O’Brien, J. F. ; Cook, P. R. ; Essl, G.: Synthesizing Sounds from Physically Based Motion. In Proc. ACM SIGGRAPH, 2001, pp. 529{536 [186] Ogata, Shin ; Murai, K. ; Nakamura, S. ; Morishima, S.: Model-based lip synchronization with automatically translated synthetic voice toward a multi-modal translation system,. In Multimedia and Expo, 2001. ICME, 2001, pp. 28{31 150 [187] Optotrak Tracking System: Northern Digital Inc., Waterloo, Ontario. 1990. { URL http://www.ndigital.com [188] Osher, Stanley J. ; Fedkiw, Ronald P.: Level Set Methods and Dynamic Implicit Surfaces. Springer, 2002 [189] Pai, D. K. ; Sueda, S. ; Wei, Q.: Fast physically based musculoskeletal simulation. In Conditionally accepted to ACM Transaction on Graphics (2007) [190] Pai, Dinesh K. ; Sueda, Shinjiro ; Wei., Qi: Fast Physically Based Musculoskeletal Simulation. In ACM Trans Graph (2005) [191] Pandy, M. G.: Computer modeling and simulation of human movement. In Annual Review of Biomedical Engineering 3 (2001), No. 1, pp. 245{273 [192] Parke, Frederic I.: Computer generated animation of faces., University of Utah, Salt Lake City, MS thesis, 1972 [193] Parke, Frederic I.: A parametric model of human faces, University of Utah, Salt Lake City, PhD thesis, 1974 [194] Parke, Frederic I. ; Waters, Keith: Computer Facial Animation. A K Peters, 1996 [195] Payan, Yohan ; Perrier, Pascal: Synthesis of V-V sequences with a 2d biomechanical tongue model controlled by the Equilibrium Point Hypothesis. In Speech Communications 22 (1997), No. 2, pp. 185{205 [196] Peck, C. C. ; Langenbach, G. E. J. ; Hannam, A. G.: Dynamic Simulation of muscle and articular properties during human wide jaw opening. In Archives of Oral Biology 45 (2000), No. 11, pp. 963{982 [197] Pellom, Bryan L.: Enhancement, Segmentation, and Synthesis of Speech with Application to Robust Speaker Recognition. Department of Electrical and Computer Engineering, Duke University,Durham, NC, PhD thesis, 1998 [198] Pentland, A. ; Williams, J.: Good vibrations: model dynamics for graphics and animation. In Proc. ACM SIGGRAPH, 1989, pp. 207{214 [199] Perkell, J. S.: A physiologically oriented model of tongue activity in speech production, Massuchets Institute of Technology, PhD thesis, 1974 151 [200] Perkell, Joseph S.: Properties of the tongue help to de ne vowel categories : hypotheses based on physiologically-oriented modeling. In Journal of Phonetics 24 (1996), pp. 3{22 [201] Pham, D.: A survey on current methods on image segmentation / Image Sciences Institute, Utrecht, Netherland. 2001. { Technical report [202] Pieper, Steven D.: More than skin deep: Physical modeling of facial tissue, Massachusetts Institute of Technology, Media Arts and Sciences, Cambridge, MA, MS thesis, 1989 [203] Platt, Stephen M.: A system for computer simulation of the human face., University of Pennsylvania, Pittsburg, MS thesis, 1972 [204] Polhemus Inc.: Fastrack System. accessed on Apr. 11, 2002 1970. { URL http://www. polhemus.com/ [205] Pothou, K.P. ; Huberson, S.G. ; Voutsinas, Spyros G. ; Knio, O.M.: Application of 3D Particle Method to the Prediction of Aerodynamic Sound. In Proc. of European Series in Applied and Industrial Mathematics (ESAIM) vol 1, 1996, pp. 349{362 [206] Prince, J. L. ; Links, J. M.: Medical Imaging Signals and Systems. Upper Saddle River, NJ : Pearson Prentice Hall, 2006 [207] Pritchard, David: Vocal tract Visualization / UBC, Canada. 2002. { Technical report [208] Puckette, Miller: FTS: A Real-time Monitor for Multiprocessor Music Synthesis. In Computer Music Journal 15 (1991), No. 3, pp. 58{67 [209] Rabiner, L. R. ; Schafer, R. W. ; Flanagan, J. L.: Computer synthesis of speech by concatenation of formant coded words. In Bell System Technical Journal 50 (1971), No. 5, pp. 1541{1558 [210] Rabiner, Lawrence R.: Speech synthesis by rule: an acoustic domain approach, Massachusetts Institute of Technology, Cambridge MA, PhD thesis, 1967 [211] Rabiner, Lawrence R. ; Schafer, Ronald W.: Digital Processing of Speech Signals. Englewood Cli s, NJ : Prentice-Hall Inc., 1978 [212] Raymond, Eric S.: The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. O'Reilly, 2001 [213] Reeves, W. T.: Particle systems|a technique for modeling a class of fuzzy objects. In Proc. ACM SIGGRAPH, 1983, pp. 359{375 152 [214] Reveret, Lionel ; Bailly, Gerard ; Badin, Pierre: MOTHER : a new generation of talking heads providing a exible articulatory control for video-realistic speech animation. In Proceedings of the International Conference of Spoken Language (ICSLP) vol 2, 2000, pp. 755{758 [215] Reveret, Lionel ; Benoit, Christian: A New 3D Lip Model for Analysis and Synthesis of Lip Motion in Speech Production. In Proc. of the Second ESCA Workshop on Audio-Visual Speech Processing, 1998, pp. 207{212 [216] Richard, G. ; Liu, M. ; Snider, D. ; Duncan, H. ; Lin, Qiguang ; Flanagan, James L. ; Levinson, Stephen ; Davis, D. ; Slimon, S.: Numerical simulations of uid ow in the vocal tract. In Proc EUROSPEECH, 1995 [217] Richard, Ga•el ; Liu, M. ; Sinder, Daniel J. ; Duncan, H. ; Lin, Qiguang ; Flanagan, James L. ; Lin, H. ; Levinson, S. ; Davis, Donald ; Slimon, S.: Vocal tract simulations based on uid dynamic analysis. In Journal of the Acoustical Society of America 97 (1995), No. 5, pp. 3245pp [218] Richard, Ga•el ; Liu, M. ; Sinder, Daniel J. ; Duncan, H. ; Lin, Qiguang ; Flanagan, James L. ; Lin, H. ; Levinson, S. ; Davis, Donald ; Slimon, S.: A Fluid Flow Approach to Speech Generation. In Speech Production Seminar, 1996 [219] Rienstra, Sjoerd W. ; Hirschberg, Avraham: An Introduction to Acoustics / Eindhoven University of Technology, Netherland. 2002. { Technical report [220] Robert McNeel and Associates, Seattle, Washington: Rhino Modeling Software [Computer Software]. 2005. { URL http://www.rhino.com [221] Rosenberg, Aaron E.: E ect of Glottal Pulse Shape on the Quality of Natural Vowels. In Journal of the Acoustical Society of America 49 (1971), No. 2, pp. 583{590 [222] Roweis, Sam ; Alwan, Abeer: Towards articulatory speech recognition: Learning smooth maps to recover articulator information. In Proc. of Eurospeech, 1997 [223] Rubin, P. ; Saltzman, E. ; GoldStein, L. ; McGowan, R. ; Tiede, M. ; C, Browman: CASY and Extensions to the task-Dynamic model. In Proc 4th Sp Prod Sem, 1996, pp. 125{ 128 [224] Rubin, P. E. ; Baer, T. ; Mermelstein, P: An articulatory synthesizer for perceptual research. In Journal of the Acoustical Society of America 70 (1981), pp. 321{328 153 [225] Rubin, Philip ; Baer, T. ; Mermelstein, Paul: An articulatory synthesizer for perceptual research. In Journal of the Acoustical Society of America 70 (1981), pp. 321{328 [226] Rubin, Philip ; Vatikiotis-Bateson, Eric: Webpage: Talking Heads. 1998 [227] Saad, Yousef: Iterative Methods for Sparse Linear Systems. SIAM, 2003 [228] Sakakibara, Ken-Ichi ; Konishi, Tomoko ; Kondo, Kazumasa ; Murano, Emi Z. ; Kumada, Masanobu ; Imagawa, Hiroshi ; Niimi, Seiji: Vocal fold and false vocal fold vibrations and synthesis of khoomei. In Proc. of the International Computer Music Conference, 2001, pp. 135{138 [229] Sandia National Labratories: Cubit - Geometry and Mesh Generation Toolkit. 2006. { URL http://cubit.sandia.gov [230] Sanguineti, V. ; Laboissiere, R. ; Ostry, D. J.: A dynamic biomechanical model for neural control of speech production. In Journal of the Acoustical Society of America 103 (1998), pp. 1615{1627 [231] Schenk, O. ; Röllin, S. ; Hagemann, M.: Recent advances in sparse linear solver technology for semiconductor device simulation matrices. In IEEE SISPAD, 2003, pp. 103{ 108 [232] Schenk, Olaf ; Gärtner, Klaus: Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO. In Journal of Future Generation Computer Systems 20 (2004), No. 3, pp. 475{487 [233] Schmitz, Oliver ; Vorländer, Michael ; Feistel, Stefan ; Ahnert, Wolfgang: Merging software for sound reinforcement systems and for room acoustics. In Audio Engineering Society 110th Convention, 2001 [234] Schröder, Manfred R. ; Atal, Bishnu S.: Code-Excited Linear Prediction (CELP). In Proceedings of IEEE International Conference on Acoustics, Speech and Signal processing, 1985, pp. 937{940 [235] Schröder, W. ; Martin, K. ; Lorensen, Bill: The Visualization Toolkit: An Object- Oriented Approach to 3D Graphics (vtk). Prentice Hall, 1997 [236] Sethian, J. A.: Level Set Methods and Fast Marching Methods. Cambridge University Press, 1999 154 [237] Shadle, Christine H.: The acoustics of fricative consonants, Massachusetts Institute of Technology, Cambridge, USA., PhD thesis, 1985 [238] Shadle, Christine H. ; Barney, Anna ; Davies, P.O.A.L.: Fluid ow in a dynamic mechanical model of the vocal folds and tract. I. Measurements and theory. In Journal of the Acoustical Society of America 105 (1999), pp. 444{455 [239] Shiller, D. M. ; Ostry, D. J. ; Gribble, P. L.: E ects of gravitational load on jaw movements in speech. In Journal of Neuroscience 19 (1999), pp. 9073{9080 [240] Shiller, D. M. ; Ostry, D. J. ; Gribble, P. L. ; Laboissiere, R.: Compensation for the e ects of head acceleration on jaw movement in speech. In Journal of Neuroscience 21 (2001), pp. 6447{6456 [241] Shiller, D. M. ; R., Laboissiere ; J., Ostry D.: The relationship between jaw sti ness and kinematic variability in speech. In Journal of Neurophysiology 88 (2002), pp. 2329{2340 [242] Shirai, Katsuhiko ; Honda, Masaaki: Estimation of articulatory motion from speech waves and its application for automatic recognition. In Spoken Language Generation and Understanding. D. Reidel Publishing Company, 1980, pp. 87{99 [243] Shiraki, Yoshinao ; Honda, Masaaki: LPC speech coding based on variable-length segment quantization. In IEEE Transactions of Acoustics, Speech and Signal Processing 36 (1988), pp. 1437{1444 [244] Sicher, H.: Oral Anatomy. 4th. Saint Louis, MO : Mosby, 1965 [245] Simo, J.C. ; Hughes, T.J.R: Computational Inelasticity. Springer, 1997 [246] Sinder, D. J. ; Krane, M. H. ; Flanagan, J. L.: Synthesis of unvoiced speech from an aeroacoustic source model. In Journal of the Acoustical Society of America (2005) [247] Sinder, Daniel J.: Speech Synthesis Using an Aeroacoustic Fricative Model, Rutgers University, New Jersey, USA, PhD thesis, 1999 [248] Sondhi, Man M.: Model for Wave Propagation in a Lossy Vocal Tract. In Journal of the Acoustical Society of America 55 (1974), May, No. 5, pp. 1070{1075 [249] Sondhi, Man M. ; Schröter, J•urgen: A hybrid time-frequency domain articulatory speech synthesizer. In IEEE Transactions on Acoustics, Speech and Signal Processing AU-21 (1987), pp. 955{967 155 [250] Sproat, Richard: Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer Academic Publishers, 1997 [251] Stam, Jos: Stable uids. In Proc. Computer graphics and interactive techniques, 1999, pp. 121 { 128 [252] Stavness, Ian: Computational Modeling of Human Jaw Biomechanics, University of British Columbia, MS thesis, 2006 [253] Stavness, Ian ; Hannam, Alan ; Lloyd, John ; Fels, Sidney: Towards Predicting Biomechanical Consequences of Jaw Reconstruction. In International Conference of the IEEE Engineering in Medicine and Biology Society, 2008, pp. 4567{4570 [254] Stavness, Ian ; Hannam, Alan G. ; Lloyd, John E. ; Fels, Sidney: An Integrated Dynamic Jaw and Laryngeal Model Constructed From CT Data. In 3rd International Symposium on Biomedical Simulation (Springer LNCS) (2006) [255] Stevens, Kenneth N.: Acoustic Phonetics. Massachusetts Institute of Technology Press, Cambridge MA, 2000 [256] Stone, Maureen: Toward a model of three-dimensional tongue movement. In Journal of Phonetics 19 (1991), pp. 309{320 [257] Stone, Maureen ; Davis, E. ; Douglas, A. ; NessAiver, M. ; Gullapalli, R. ; Levine, W. S. ; Lundberg, A.: Modeling the motion of the internal tongue from tagged cine-MRI images. In Journal of the Acoustical Society of America 109 (2001), No. 6, pp. 2974{2982 [258] Stone, Maureen ; Davis, Edward P. ; Douglas, Andrew S. ; Aiver, Moriel N. ; Gullapalli, Rao ; Levine, William S. ; Lundberg, Andrew J.: Modeling Tongue Surface Contours From Cine-MRI Images. In J Speech Lang Hear Res. 44 (2001), pp. 1026{1040 [259] Stone, Maureen ; Lundberg, Andrew: Three-dimensional tongue surfaces from ultrasound images. In SPIE Proceedings Vol. 2709, 1996, pp. 168{179 [260] Story, Brad H. ; Titze, Ingo R.: Parameterization of vocal tract area functions by empirical orthogonal modes. In Journal of Phonetics 26 (1998), pp. 223{260 [261] Story, Brad H. ; Titze, Ingo R.: A preliminary study of voice quality transformation based on modi cations to the neutral vocal tract area function. In Journal of Phonetics 30 (2002), pp. 485{509 156 [262] Stratemann, S. ; Miller, A. ; Hatcher, D. ; Huang, J. ; Lang, T.: 3D Craniofacial Imaging: Airway and Craniofacial Morphology / University of California, San Francisco. 2007. { Technical report [263] Sueda, S. ; Pai, D. K.: Strand-based Simulation of Biomechanical Systems with Tendons. In Journal of Biomechanics (submitted) (2007) [264] Summerville, Ian: Software Engineering, 8th addition. Addison Wesley, 2006 [265] Svancara, P. ; Horacek, J. ; Pesek, L.: Numerical modeling of production of czech vowel /a/ based on FE model of vocal tract. In Proc ICVPB, 2004 [266] Takemoto, Hironori: Morphological Analysis of the Human Tongue Muscularture for Three-Dimensional Modeling. In jslhr 44 (2001), pp. 95{107 [267] Takemoto, Hironori ; Honda kiyoshi: Measurement of temporal changes in vocal tract area function during a continuous vowel sequence using a 3D CINE-MRI technique. In isss, 2003 [268] Tatham, Mark ; Morton, Kathrine: Developments in Speech Synthesis. John Wiley and Sons, 2005 [269] Teran, J. ; Sifakis, E. ; Blemker, S. ; Hing, Ng T. ; V., Lau ; C. ; Fedkiw, R.: Creating and Simulating Skeletal Muscle from the Visible Human Data Set. In IEEE TVCG (in press)., 2005 [270] Tergan, Sigmar-Olaf ; Keller, Tanja: Knowledge and information visualization : searching for synergies. Springer, 2005 [271] Thimm, G.: Tracking Articulators in X-ray Movies of the Vocal Tract,. In 8th Int. Conf. Computer Analysis of Images and Patterns, 1999 [272] Thirion, J. P.: Image matching as a di usion process: an analogy with Maxwell's demons. In Medical Image Analysis 2 (1998), No. 3, pp. 243{260 [273] Thirion, Jean-Philippe: Non-Rigid Matching Using Demons. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1996, pp. 245 [274] Tidwell, Jenifer: Designing Interfaces: Patterns for Effective Interaction Design. O'Reilly, 2005. { URL http://designinginterfaces.com 157 [275] Tiede, Mark K.: MRI Toolbox: A MATLAB-Based System for Manipulation of MRI Data / ATR, Kyoto, Japan. 1999 (TR-H-255). { Technical report [276] Titze, Ingo R. ; Story, Brad H.: Acoustic interactions of the voice source with the lower vocal tract. In Journal of the Acoustical Society of America 101 (1997), No. 4, pp. 2234 { 2243 [277] Toga, Arthur W. ; Thompson, Paul M.: Maps of the Brain. In The anatomical Record 265 (2001), pp. 37{53 [278] Toledo, Sivan ; Chen, Doron ; Rotkin, Vladimir: TAUCS - A library of Sparse Linear Solvers. 2001 [279] Treuille, Adrien ; Lewis, Andrew ; Popovic, Zoran: Model reduction for real-time uids. In ACM Trans. Graph. 25 (2006), No. 3, pp. 826{834. { ISSN 0730-0301 [280] Turner, Russell: Interactive Construction and Animation of Layered Elastic Characters, Swiss Federal Institute of Technology, Lausanne, PhD thesis, 1993 [281] Uz, B. ; Gueduekbay, Ugur ; Ozguc, Buelent: Realistic speech animation of synthetic faces. In Proceedings Computer Animation, 1998, pp. 111{18. { (accessed Aug. 10, 2002) [282] Valbret, H. ; Moulines, E. ; Tubach, J. P.: Voice transformation using PSOLA technique. In Speech Communications of the ACM 11 (1992), No. 2-3, pp. 175{187 [283] Vatikiotis-Bateson, E. ; Ostry, D. J.: An analysis of the dimensionality of jaw motion in speech. In Journal of Phonetics 23 (1995), pp. 101{117 [284] Verhelst, Werner ; Roelands, Marc: An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale Modi cation of speech. In Conference proceedings IEEE Acoustics, Speech, and Signal Processing(ICASSP-93), 1993, pp. 554{557 [285] Video-Based Tracking System: Vicon, Los Angeles, California. 1986. { URL http: //www.vicon.com [286] Vogt, Florian: Finite Element Modeling of the Tongue. In Int Workshop on Auditory Visual Speech Processing, 2005, pp. 143{144 [287] Vogt, Florian ; Chen, Timothy ; Hoskinson, Reynald ; Fels, Sidney S.: A Malleable Surface Touch Interface. In Proc. ACM SIGGRAPH Sketches, 2004 158 [288] Vogt, Florian ; Fels, Sidney S. ; Gick, Bryan ; Jaeger, Carol ; Wilson, Ian: Extensible Infrastructure for a 3D Face and Vocal-Tract Model. In Proc Int Congress of Phonetic Science (ICPhS), 2003, pp. 2345{2349 [289] Vogt, Florian ; Guenther, Oliver ; Hannam, Alan ; Doel, Kees van den ; Lloyd, John ; Vilhan, Leah ; Chander, Rahul ; Lam, Justin ; Wilson, Charles ; Tait, Kalev ; Derrick, Donald ; Wilson, Ian ; Jaeger, Carol ; Gick, Bryan ; Vatikiotis-Bateson, Eric ; Fels, Sidney: ArtiSynth Designing a Modular 3d Articulatory Speech Synthesizer. In Journal of the Acoustical Society of America 117 (2005), No. 4, pp. 2542 [290] Vogt, Florian ; Lloyd, John E. ; Buchaillard, Stephanie ; Perrier, Pascal ; Chabanas, Matthieu ; Payan, Yohan ; Fels, Sidney S.: Investigation of Ecient 3D Finite Element Modeling of a Muscle-Activated Tongue. In Springer LNCS 4072 (2006), pp. 19{28 [291] Vogt, Florian ; McCaig, Graeme ; Ali, Adnan ; Fels, Sidney S.: Tongue 'n' Groove. In Int Conf on New Interfaces for Musical Expression (NIME02), 2002, pp. 60{63 [292] Vorländer, Michael: Untersuchungen zur Leistungsfähigkeit des raumakustischen Schallteilchenmodells, Reinisch-Westf•alische Technische Hochschule (RTWH), Aachen, Germany, PhD thesis, 1989 [293] Vries, Marinus P. de: A new voice for the voiceless Design and in-vitro testing of a voice- producing element, Rijksuniversiteit Groningen, PhD thesis, 2000 [294] Välimäki, V. ; Karjalainen, M.: Improving the Kelly-Lochbaum vocal tract model using conical tube sections and fractional delay ltering techniques. In Proc. Int. Conf. Spoken Language Processing vol 2, 1994, pp. 615{618 [295] Välimäki, Vesa: Discrete-Time Modeling of Acoustic Tubes Using Fractional Delay Filters, Faculty of Electrical Engineering, Helsinki University of Technology, PhD thesis, 1995 [296] Waters, Keith: A Muscle Model for Animating Three-Dimensional Facial Expression. In Proc. ACM SIGGRAPH vol 21, 1987, pp. 17{24 [297] Westbury, John R.: X-RAY Microbeam Speech Production Database User's Handbook / University of Wisconsin Madison, WI. URL http://www.medsch.wisc.edu/ubeam/, 1994. { Technical report [298] Whalen, D. H. ; Iskarous, Khalil ; Tiede, Mark K. ; Ostry, David J. ; Lehnert- LeHouillier, Heike ; Vatikiotis-Bateson, Eric ; Hailey, Donald S.: The Haskins 159 Optically Corrected Ultrasound System (HOCUS). In Journal of Speech, Language, and Hearing Research 48 (2005), pp. 543{553 [299] White, Frank: Fluid Mechanics. McGraw-Hill Book Company, 1999 [300] Wilhelms-Tricarico, Reiner: Physiological modeling of speech production: methods for modeling soft-tissue articulators. In Journal of the Acoustical Society of America 97 (1995), No. 5, pp. 3085{98 [301] Wilhelms-Tricarico, Reiner: A biomechanical and physiologically-based vocal tract model and its control. In jpho 24 (1996), pp. 23{28 [302] Wrench, Alan: MOCHA MultiCHannel Articulatory database: english / Queen Margaret University College, UK. URL http://data.cstr.ed.ac.uk/mocha/, 1999. { Technical report [303] Wu, Changshiann: Articulary Speech Synthesizer, University Of Florida, Department of Electrical & Computer Engineering, PhD thesis, 1996 [304] Yehia, Hani C. ; Tiede, Mark: A Parametric Three-Dimensional Model of the Vocal-Tract Based on MRI data. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’97), 1997, pp. 1619{1625 [305] Zienkiewicz, O.C. ; Taylor, R.L.: The finite element method. Oxford, 2000 160 Appendix A Authors Publications During my doctoral studies I demonstrated the impact of various aspects of this thesis work by the following publications: [306] Fels, S., Vogt, F., van den Doel, K., Lloyd, J., and Guenter, O. Artisynth: Towards realizing an extensible, portable 3d articulatory speech synthesizer. In Int Workshop on Auditory Visual Speech Processing (2005), pp. 119{124. [307] Fels, S., Vogt, F., van den Doel, K., Lloyd, J., Stavness, I., and Vatikiotis- Bateson, E. Artisynth: A biomechanical simulation platform for the vocal tract and upper airway. Tech. Rep. TR-2006-10, Computer Science Dept., Univ of British Columbia, 2006. [308] Fels, S. S., Lloyd, J. E., Stavness, I., Vogt, F., Hannam, A., and Vatikiotis- Bateson, E. Artisynth: A 3d biomechanical simulation toolkit for modeling anatomical structures. Journal of the Society for Simulation in Healthcare 2, 2 (2007), 148. [309] Fels, S. S., Lloyd, J. E., van den Doel, K., Vogt, F., Stavness, I., and Vatikiotis- Bateson, E. Developing physically-based, dynamic vocal tract models using artisynth. In Proc of ISSP (2006), pp. 419{426. [310] Fels, S. S., Vogt, F., Gick, B., Jaeger, C., and Wilson, I. User-centered design for an open source 3d articulatory synthesizer. In Proc Int Congress of Phonetic Science (ICPhS) (2003), pp. 179{182. [311] van den Doel, K., Vogt, F., English, R. E., and Fels, S. S. Towards articulatory speech synthesis with a dynamic 3d nite element tongue model. In Proc of ISSP (2006), pp. 59{66. [312] Vogt, F. Finite element modeling of the tongue. In Int Workshop on Auditory Visual Speech Processing (2005), pp. 143{144. 161 [313] Vogt, F., Fels, S. S., Gick, B., Jaeger, C., and Wilson, I. Extensible infrastructure for a 3d face and vocal-tract model. In Proc Int Congress of Phonetic Science (ICPhS) (2003), pp. 2345{2349. [314] Vogt, F., Guenther, O., Hannam, A., van den Doel, K., Lloyd, J., Vilhan, L., Chander, R., Lam, J., Wilson, C., Tait, K., Derrick, D., Wilson, I., Jaeger, C., Gick, B., Vatikiotis-Bateson, E., and Fels, S. Artisynth designing a modular 3d articulatory speech synthesizer. Journal of the Acoustical Society of America 117, 4 (2005), 2542. [315] Vogt, F., Lloyd, J. E., Buchaillard, S., Perrier, P., Chabanas, M., Payan, Y., and Fels, S. S. An ecient biomechanical tongue model for speech research. In Proc ISSP (2006), pp. 51{58. [316] Vogt, F., Lloyd, J. E., Buchaillard, S., Perrier, P., Chabanas, M., Payan, Y., and Fels, S. S. Investigation of ecient 3d nite element modeling of a muscle-activated tongue. Springer LNCS 4072 (2006), 19{28. [317] Vogt, F., McCaig, G., Ali, A., and Fels, S. S. Tongue 'n' Groove. In Int Conf on New Interfaces for Musical Expression (NIME02) (2002), pp. 60{63. Beyond my thesis focus, I investigated new ways of interaction with physical models and the image sensors to support those interactions. This work carried out during my studies resulted in the following publications: . [318] Cavens, D., Vogt, F., Fels, S. S., and Meitner, M. Interacting with the Big Screen: Pointers to Ponder. In Proc ACM CHI (2002), pp. 678{679. [319] Fels, S. S., and Vogt, F. Tooka: Explorations of two person instruments. In Proc Int Conf on New Interfaces for Musical Expression (2002), pp. 116{121. [320] Shen, C., Wang, B., Vogt, F., Oldridge, S., and Fels, S. S. Remoteeyes: A remote low-cost position sensing infrastructure for ubiquitous computing. In Proc Int Workshop on Networked Sensing Systems (INSS2004) (2004), pp. 31{35. [321] Shen, C., Wang, B., Vogt, F., Oldridge, S., and Fels, S. S. Remoteeyes: A remote low-cost position sensing infrastructure for ubiquitous computing. Trans of the Society of Instrument and Control Engineers E-S-1 (2005), 85{90. 162 [322] Stavness, I., Vogt, F., and Fels, S. Cubee: a cubic 3d display for physics-based interaction. In Proc. ACM SIGGRAPH Sketches (2006), p. 165. [323] Stavness, I., Vogt, F., and Fels, S. Cubee: thinking inside the box. In Proc. ACM SIGGRAPH Emerging technologies (2006), p. 5. [324] Vogt, F., Chen, T., Hoskinson, R., and Fels, S. S. A malleable surface touch interface. In Proc. ACM SIGGRAPH Sketches (2004). [325] Vogt, F., Wong, J., Fels, S. S., and Cavens, D. Tracking multiple laser pointers for large screen interactioninteracting with the Big Screen: Pointers to Ponder. In Extended Abstracts of ACM UIST (2003), pp. 95{96. [326] Vogt, F., Wong, J., Po, B. A., Argue, R., Fels, S. S., and Booth, K. S. Exploring collaboration with group pointer interaction. In Proc of Computer Graphics Int (CGI2004) (2004), pp. 636{639. 163 Appendix B Researcher Feedback Since this research work is focused on a community-based approach with user-centered develop- ment. There where many people which gave feedback to this project some of which are listed in the Table B.1 below, which are roughly organized by their main area of impact. The following questionnaire was used in feedback sessions with these researcher and practitioners. Area Researcher (affiliation) Dentistry Alan Hanam (UBC), Authur Miller (USanFransico) Medical simulation Yohan Payan (TIMC), Francois Faure (INRIA) Acoustics modeling Hani Yehia (UFMG), Dough Whalen (Yale), Perry Cook (Princeton) Speech modeling Pascal Perrier (ICP), Ken-Ichi Sakakibara (NTT), Masaaki Honda (WasedaU), Gordon Ramsay (Haskins) Speech analysis Bryan Gick (UBC), Kiyoshi Honda (ATR), Philip Rubin (Haskins) Face modeling Eric Vatikiotis-Bateson (UBC), Demetri Terzopoulos (UT), Takaaki Kuratate (ATR), Matthieu Chabanas (ICP, Michael Cohen (UCSC) Numerical analysis Chen Greif (UBC), Olaf Schenk (UBasel) Medical imaging Maureen Stone (UMaryland), Olov Engwall (KTH), Mark Tiede (Haskins labs), Shri Narayanan (USC) Motor Control Dinesh Pai (UBC), David Ostry (McGill), Jianwu Dang (ATR) Table B.1: A selection of the researcher, which gave feedback to the work. 164 B.1 Questionnaire Name: Affiliation: Email: date Please circle or underline the for you appropriate statements. Feel free to give more detailed comments on the back of the page. Thank you. 1. What describes your role(s) best for the area of articulatory speech synthesis? Researcher, modeler, practitioner, educator, developer, end-user, or other please state: 2. What types of research area(s) do you aliation closest with? Linguistics, dentistry, facial animation / talking heads, surgical simulation, speech production, robotics, physics based animation, or other please state: 3. What task(s) are you investigating? Speaking, swallowing, chewing, nonverbal communi- cation, audio-visual integration, disorders, individual behavior, average behavior, or other please state: 4. What type of modeling(s) / analysis methods to you? Acoustics, biomechanics, parametric, mass-spring, nite-element, rigid-body, static geometry, statistical models, or other please state: 5. What tools are you using for modeling / analysis ? MATLAB and toolboxes, nite element package, acoustic analysis such as Praat, or other please state: 6. How are the modeling / analysis tools you use distributed? Not shared, open source, commercial, in-house custom software, shared with key collaborators, or other please state: 7. Do you have software developers on your team on average? ( ) # of dedicated software developers, ( ) # research assistance/ undergraduate students ( ) # graduate students, ( ) # of researchers. 8. Which programming languages or graphical tools are your developers uent in? No; mostly GUI applications, Graphical programming such as SIMULINK or MAX/MSP, Object oriented programming: Java/C++ etc, scripting languages: MATLAB, python, basic, or other please state: 9. What data processing and analysis tools are you using? MATLAB, 3D model editor (Blender, 3DMax, Rhino, Maya, other), Image processing and visualization software (Amira, ITK, VTK, ImageJ), or other please state: 165 10. Do you acquire you own subject data and or are you using existing data sets? What type of subject data? MR, CT, Ultrasound, audio, EMG, video, EMA, EPG, magnetic or optical tracker, or other please state: 11. How are the subject data and models you use distributed? Not shared, open source, commercial, shared with key collaborators, or other please state: 12. What are essential features for in ArtiSynth in order to use it for you research? Open source, free of change, professional programming support, cross platforms ( ), compatible with software ( ), graphical user interface, application programming interface, community support, feature ( ), modeling methods ( ). 13. Other comments or suggestions ? 166"""@en ; edm:hasType "Thesis/Dissertation"@en ; vivo:dateIssued "2009-05"@en ; edm:isShownAt "10.14288/1.0067236"@en ; dcterms:language "eng"@en ; ns0:degreeDiscipline "Electrical and Computer Engineering"@en ; edm:provider "Vancouver : University of British Columbia Library"@en ; dcterms:publisher "University of British Columbia"@en ; dcterms:rights "Attribution-NonCommercial-NoDerivatives 4.0 International"@en ; ns0:rightsURI "http://creativecommons.org/licenses/by-nc-nd/4.0/"@en ; ns0:scholarLevel "Graduate"@en ; dcterms:title "Towards an interactive framework for upper airway modeling : integration of acoustic, biomechanic, and parametric modeling methods"@en ; dcterms:type "Text"@en ; ns0:identifierURI "http://hdl.handle.net/2429/7799"@en .