The Voice Box: A Fast Coupled VocalFold Model for Articulatory SpeechSynthesisbyArvind VasudevanB.Tech, Electrical and Electronics Engineering, S.R.M. University, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Electrical and Computer Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)July 2017c© Arvind Vasudevan 2017AbstractSpeech is unique to human beings as a means of communication and manyefforts have been made towards understanding and characterizing speech.In particular, articulatory speech synthesis is a critical field of study as itworks towards simulating the fundamental physical phenomena that under-lines speech. Of the various components that constitute an articulatoryspeech synthesizer, vocal fold models play an important role as the sourceof the acoustic simulation. A balance between the simplicity and speed oflumped-element vocal fold models and the completeness and complexity ofcontinuum-models is required to achieve time-efficient high-quality speechsynthesis. In addition, most models of the vocal folds are seen in a vacuumwithout any coupling to the vocal tract model.This thesis aims to fill these lacunae in the field through two majorcontributions. We develop and implement a novel self-oscillating vocal-foldmodel, composed of an 1D unsteady fluid model loosely coupled with a2D finite-element structural model. The flow model is capable of handlingirregular geometries, different boundary conditions, closure of the glottisand unsteady flow states. A method for a fast decoupled solution of the flowequations that does not require the computation of the Jacobian matrix isprovided. The simulation results are shown to agree with existing data inliterature, and give realistic glottal pressure-velocity distributions, glottalwidth and glottal flow values. In addition, the model is more than orderof magnitude faster than comparable 2D Navier-Stokes fluid solvers whilebetter capturing transitional flow than simple Bernoulli-based flow models.Secondly, as an illustrative case study, we implement a complete articu-latory speech synthesizer using our vocal fold model. This includes bothlumped-element and continuum vocal fold models, a 2D finite-differencetime-domain solver of the vocal tract, and a 1D tracheal model. A clear workflow is established to derive model components from experimental data oruser-specified meshes, and run fully-coupled acoustic simulations. This leadsto one of the few complete articulatory speech synthesizers in literature anda valuable tool for speech research to run time-efficient speech simulations,and thoroughly study the acoustic outcomes of model formulations.iiLay AbstractBuilding a speech synthesizer that can generate natural-sounding high qual-ity speech has been a long-term goal. One of the critical components ofthese synthesizers is a model of the vocal folds, colloquially known as thevocal cords. These are two fleshy slits inside our larynx that vibrate quasi-periodically, and act as the sound source for speech. This buzzing sound(called phonation) then propagates through our vocal tract, which can takedifferent shapes, giving rise to speech sounds at the outlet of our mouth.Most previous models either made coarse approximations of the complexvocal fold structures (as rectangular interconnected masses) or were mathe-matically intricate and computationally difficult (taking days to simulate onesecond of speech). In this thesis, we implement and validate a 2-dimensionalvocal fold model combined with an 1-dimensional airflow model that aimsto find a balance between complexity and computational expense for speechsynthesis applications.iiiPrefaceThis thesis was part of the Oral, Pharyngeal and Laryngeal Complex (OPAL)project.Most of the contributions and results described in Chapter 3 have beenpreviously presented in the publication [P1]. I was the main author andcontributor to the design, implementation and testing of the vocal fold modeldescribed in the [P1], under the supervision of Dr. Sidney Fels. Dr. VictorZappi assisted with the implementation of the 2D finite-difference time-domain (FDTD) simulation that was used for the vocal tract coupling. Dr.Peter Anderson helped with the validation of the 1D unsteady fluid modelthat was used to drive the vocal folds.Chapter 4 has been partially published in the literature [P2]. I developedimplementations of two exemplar lumped-element vocal fold models, thetwo-mass and body-cover model, on the Graphical Processing Unit shader.This enabled coupling with the real-time GPU-based vocal tract solver de-signed by Dr. Victor Zappi, who is the primary author of this paper. Dr.Andrew Allen and Dr. Nikunj Raghuvanshi assisted with the GPU-basedimplementation of the 2D wave-equation solver.Peer-Reviewed Conference Paper Accepted forPublication[P1] Vasudevan A, Zappi V, Anderson P, Fels S, 2017. A Fast Robust 1DFlow Model for a Self-Oscillating Coupled 2D FEM Vocal Fold Simulation.INTERSPEECH. (Accepted)Journal Manuscripts Accepted for Publication[P2] Zappi V, Vasudevan A, Allen A, Raghuvanshi N, Fels S, 2067. To-wards real-time two-dimensional wave propagation for articulatory speechsynthesis. Proceedings of Meetings on Acoustics (POMA). (Accepted)ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xivDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Background and Previous Work. . . . . . . . . . . . . . . . . 42.1 Overview of Human Phonation . . . . . . . . . . . . . . . . . 52.1.1 Physiology of the Human Airway . . . . . . . . . . . 62.1.2 Vocal Fold Structure . . . . . . . . . . . . . . . . . . 72.1.3 Theories of Phonation . . . . . . . . . . . . . . . . . . 82.2 Lumped-Element Models of the Vocal Folds . . . . . . . . . 102.2.1 One-Mass Models . . . . . . . . . . . . . . . . . . . . 112.2.2 Multi-Mass Model . . . . . . . . . . . . . . . . . . . . 122.2.3 Body-Cover Models . . . . . . . . . . . . . . . . . . . 152.3 Continuum Models of the Vocal Folds . . . . . . . . . . . . . 162.3.1 Eigen-Analysis Models . . . . . . . . . . . . . . . . . 172.3.2 Self-Oscillating Continuum Models . . . . . . . . . . 17vTable of Contents2.4 Model Coupling . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.1 Combination of Structural and Flow Models . . . . . 222.4.2 Collision Handling . . . . . . . . . . . . . . . . . . . . 242.4.3 Vocal-Tract Coupling . . . . . . . . . . . . . . . . . . 252.5 Articulatory Synthesizers . . . . . . . . . . . . . . . . . . . . 272.5.1 Structural Models . . . . . . . . . . . . . . . . . . . . 272.5.2 Flow/Acoustic Models . . . . . . . . . . . . . . . . . 302.6 Data Acquisition and Measurement . . . . . . . . . . . . . . 332.6.1 In-Vivo Studies . . . . . . . . . . . . . . . . . . . . . 332.6.2 Excised Laryngeal Studies . . . . . . . . . . . . . . . 352.6.3 Synthetic Laryngeal Studies . . . . . . . . . . . . . . 352.6.4 Vocal Tract Measurements . . . . . . . . . . . . . . . 362.7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . 373 2D Continuum Model with 1D Flow Model . . . . . . . . . 383.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2 Structural Model . . . . . . . . . . . . . . . . . . . . . . . . . 403.2.1 Numerical Implementation Procedure . . . . . . . . . 433.3 1D Fluid Model . . . . . . . . . . . . . . . . . . . . . . . . . 433.3.1 Model Formulation . . . . . . . . . . . . . . . . . . . 433.3.2 Numerical Solution Framework . . . . . . . . . . . . . 453.3.3 Fluid-Structure Coupling . . . . . . . . . . . . . . . . 463.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.4.1 Fluid Model Validation . . . . . . . . . . . . . . . . . 493.4.2 Flow-Separation Experiment . . . . . . . . . . . . . . 523.4.3 Coupled Vocal Fold Simulation . . . . . . . . . . . . . 533.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 Coupled Articulatory Synthesizer . . . . . . . . . . . . . . . 634.1 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . 634.1.1 Vocal Tract Biomechanics . . . . . . . . . . . . . . . 634.1.2 Vocal Tract Acoustics . . . . . . . . . . . . . . . . . . 654.1.3 Trachea Model . . . . . . . . . . . . . . . . . . . . . . 684.1.4 Alternate Vocal Fold Models . . . . . . . . . . . . . . 704.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . 76viTable of Contents5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . 84Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85viiList of Tables2.1 Summary of components in FRANK and component types.Rigid = rigid body, FE = finite element. Adapted from [112] 303.1 A list of the different material properties used for the vocalfold model. Parameters derived from [3][9] . . . . . . . . . . . 423.2 A list of the fluid properties used for the vocal fold model.Parameters derived from [3][9][12] . . . . . . . . . . . . . . . . 453.3 Performance of decoupled solver vs coupled solver. Simu-lation conducted for a time period of 2s with ∆t = 0.02s.Length of channel is 0.6m. . . . . . . . . . . . . . . . . . . . 523.4 A summary of the parameters used for the solution of theanalytical problem. Parameters derived from [12] . . . . . . . 533.5 A comparison of peak centerline velocities and subglottal pres-sures of the model presented in the paper (Vp, Psp) and fromAlipour et al [9] (Va, Psa). Note that the values from litera-ture are estimates from published graph data . . . . . . . . . 594.1 A list of the model parameters used for the 2D FDTD simulation 694.2 A comparison of the maximum and minimum output normal-ized pressures predicted by different vocal fold models . . . . 75viiiList of Figures2.1 Basic Anatomy of the Human Airway: Saggital view of thehead (left) and enlarged coronal view of the laryngeal complex(right), cWikimedia Commons, adapted from [102] . . . . . . 62.2 Coronal (left) and superior (right) view of the laryngeal com-plex (right), cWikimedia Commons, adapted from [102] . . . 72.3 Layers of the vocal fold, c©Wiley Publishing, Gick et al [51] . 82.4 Composition of the Vocal Folds according to Body-Cover the-ory [59] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 Idealized shapes of the vocal fold during a single cycle ofvibration. Note that the lower part of the vocal fold leads theupper part and the vertical travelling wave on the vocal foldsurface. Reproduced with permission from Story, [105] . . . . 102.6 Coronal view of the two-mass model. Reproduced with per-mission from Peter Birkholz, [26] . . . . . . . . . . . . . . . . 112.7 Coronal (left) and superior (right) view of the vocal folds withquantities highlighted. Ishizaka et al, [64] . . . . . . . . . . . 122.8 Bernoulli flow in the two-mass model in two cases a) Bernoulliflow in lower region and jet flow in upper region b) jet flowin both lower and upper regions . . . . . . . . . . . . . . . . . 132.9 Pseudo 3D view showing the inclined masses of the modi-fied two-mass model that can represent the triangular glottalopening, reproduced with permission from Birkholz et al [25] 152.10 Coronal view of the body-cover model of the vocal folds, re-produced with permission from Birkholz et al [25] . . . . . . 162.11 3D view of FEM model by De Vries et al, reproduced withpermission from [38] . . . . . . . . . . . . . . . . . . . . . . . 182.12 Trajectories of nodes in layer-8 of quasi-3D FEM model, re-produced with permission from [3] . . . . . . . . . . . . . . . 19ixList of Figures2.13 Visual depiction of the different combination of structural andflow models in the literature. Horizontal axis represents in-creasing spatial complexity from left to right. Vertical axisrepresents increasing flow complexity from bottom to top . . 232.14 View of the idealized 2D vocal tract shape for Czech vowel[a:] coupled with the 2D vocal tract model, c©Springer, re-produced with permission from Hajek et al [53] . . . . . . . . 262.15 Components of an Articulatory Synthesizer . . . . . . . . . . 282.16 2D midsaggital anatomical model of the vocal tract, repro-duced with permission from Texeira et al [114] . . . . . . . . 292.17 Biomechanical vocal tract model used to calculate the vocaltract shape, VocalTractLab, reproduced with permission fromBirkholz et al [24] . . . . . . . . . . . . . . . . . . . . . . . . . 292.18 FRANK: a Functional Reference ANatomical Knowledge [13]a) midsaggital view of the components b) hard componentsof the model c) soft components of the model . . . . . . . . . 312.19 High-speed laryngoscopy images (above) synced with EGGdata (below) for one cycle of vocal fold vibration, c©WileyPublishing, Gick et al [51] . . . . . . . . . . . . . . . . . . . . 343.1 Finite element mesh shown in the coronal plane. This in-cludes the body (dark grey), the ligament (white) and thecover (light grey) regions . . . . . . . . . . . . . . . . . . . . . 413.2 2D continuum vocal fold model in the coronal plane. Thevocal folds are assumed symmetric with the shaded out vocalfold (left) not simulated. The fluid model is calculated alongthe glottal centre-line . . . . . . . . . . . . . . . . . . . . . . . 473.3 Sample area warping for fluid model. Asafe is used as thearea function for the 1D fluid simulation to ensure stability. . 483.4 Mesh refinement study where numerical error p∗ is plotted asa function of the factor αmin for three levels of mesh quality.Coarse mesh (blue), Medium mesh (red) and Fine mesh (yellow) 513.5 Flow-separation prediction of the flow model when inferiorend of the vocal folds is fixed and superior is allowed to oscil-lated. The filled dots represent the point of flow separation,assuming flow from left to right . . . . . . . . . . . . . . . . . 543.6 Vocal fold model shapes in one cycle of vibration. Fundamen-tal frequency of oscillation is 146 Hz. Time difference betweeneach time step is approximately 0.57 ms . . . . . . . . . . . . 55xList of Figures3.7 Typical glottal waveforms including glottal volume flow (top)and subglottal pressure (bottom). . . . . . . . . . . . . . . . . 563.8 Selected vocal fold frames for the convergent, neutral anddivergent glottal shapes . . . . . . . . . . . . . . . . . . . . . 563.9 Centerline velocity predictions for the convergent, neutral anddivergent glottal shapes . . . . . . . . . . . . . . . . . . . . . 573.10 Centerline pressure predictions for the convergent, neutraland divergent glottal shapes . . . . . . . . . . . . . . . . . . . 584.1 Coupled articulatory synthesizer flowchart . . . . . . . . . . . 644.2 Area function representation of vocal tract . . . . . . . . . . . 654.3 Concatenated tube model used for 1D speech synthesis . . . . 664.4 Framework for biomechanically driven articulatory speech syn-thesis. The vocal tract airway mesh is extracted from theFRANK model over time and used as the domain for theacoustic simulation. It is coupled with the trachea model andvocal fold model to create a complete articulatory synthesizer. 664.5 Area function representation of trachea with concurrent vocaltract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.6 2D vocal tract domain for vowel /a/. The boundary includes6 Perfectly Matched Layers (PML) for absorption. The blackdot represents the listener and the left end of the symmetrictube contains the glottal inputs and feedback pressure cells. . 714.7 2D vocal tract domain for vowel /i/. The boundary includes6 Perfectly Matched Layers (PML) for absorption. The blackdot represents the listener and the left end of the symmetrictube contains the glottal inputs and feedback pressure cells. . 714.8 Normalized output pressure for a coupled 2D continuum vocalfold model used as glottal source to 2D FDTD simulation forvowel shape /a/. . . . . . . . . . . . . . . . . . . . . . . . . . 724.9 Normalized output pressure for a uncoupled 2D continuumvocal fold model used as glottal source to 2D FDTD simula-tion for vowel shape /a/. . . . . . . . . . . . . . . . . . . . . . 734.10 Normalized output pressure for a coupled body cover modelused as glottal source to 2D FDTD simulation for vowel shape/a/. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.11 Normalized output pressure for a coupled two-mass modelused as glottal source to 2D FDTD simulation for vowel shape/a/. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75xiList of Figures4.12 Output pressure for a 3D FEM acoustic simulation for vowelshape /a/. Reproduced from [32] . . . . . . . . . . . . . . . . 765.1 Components of an Articulatory Synthesizer . . . . . . . . . . 82xiiList of Abbreviations1D: One Dimensional2D: Two Dimensional3D: Three DimensionalBC/BCs: Boundary Condition/Boundary ConditionsCFD: Computational Fluid DynamicsCFL: Courant—Friedrich—Levy ConditionCSA: Cross Sectional AreaCT: Computed TomographyEGG: ElectroglottographyFEM: Finite Element Method/ Finite Element ModelFVM: Finite Volume MethodFFT: Fast Fourier TransformFSI: Fluid Structure InteractionMRI: Magnetic Resonance ImagingOSA: Obstructive Sleep ApneaPa: Pascal (unit of pressure)UA: Upper AirwayVF: Vocal FoldsVT: Vocal TractxiiiAcknowledgementsI express my deepest gratitude to my supervisor, Dr. Sidney Fels; this the-sis would have just remained an idea without his guidance, mentorship andabove all, patience. His belief in my ability and the academic freedom thathe provided, are aspects of my Master’s that I will always cherish. I havelearned valuable lessons on staying hungry, being humble and having a senseof humour that I will take forward in my career; it has been an honour work-ing with you and learning from you.I’ve had the pleasure of working with a number of fine researchers overthe course of this degree, foremost being Dr. Victor Zappi, who was friend,guide and all-round amazing co-worker. I truly enjoyed the times we spentbreaking our heads over the most intractable of problems, and the lessonsI learned about perseverance and perspective from them. Many thanks toDr. Peter Anderson, who apart from being an extremely distinguished re-searcher, ranks among the nicest people I’ve met.The great memories I take of this degree is due, in part, to my fellow col-leagues at the Human Communication Technologies (HCT) Lab and theOPAL modelling group. I thank them for the great conversation and evenbetter company; they made every day (and some nights!) at the lab, trulyenjoyable.I’m lucky to have some truly amazing friends and family who’ve given meincredible love and support through my degree; you guys are truly one of akind. Finally, last but definitely not least, what more can I say about myparents except that, this isn’t my achievement so much as it’s ours! Thankyou for everything.xivDedicationTo my parents, for being youTo science, for being magicalxvChapter 1IntroductionSpeech is a form of communication that is unique to human beings, and hasbeen a topic of intensive study for many centuries. Through the years signif-icant efforts have been made towards building systems for speech recognitionand more importantly, speech synthesis. Of the different methods that havebeen proposed for speech synthesis, articulatory speech synthesis is one ofthe most challenging and promising. Articulatory synthesis attempts tosimulate the physiological processes and physics that occur in the humanbody to generate speech output. This involves creating anatomically accu-rate models of the upper airway, a biomechanical simulation of articulationmovements and an acoustic model representing the air wave production andpropagation through the vocal tract.The flow of pressurized air from the lungs to the mouth-opening gen-erates sound and, as a consequence, speech. To simulate speech, a criticalcomponent that all articulatory synthesizers require is a glottal model thatacts as the excitation source to the acoustic simulation. The glottis is theslit-like opening between the two fleshy vocal folds within the larynx. Thepassing of air from the trachea into the upper airway is controlled by thevocal folds that vibrate periodically in a process known as phonation. Thisis the source of sound into the vocal tract, and it’s position within the larynxgives the latter its colloquial name, the voice box.Initially, these models were based on the linear source-filter theory thatposits that speech production is a two-step process: a sound-source is gener-ated from the glottis, and articulators in the vocal tract shape this waveformsimilar to a resonant filter [42]. However, the underlying assumption thatthis model is predicated on, that the source and filter are independent ofeach other, has been challenged in recent work. It is now well-establishedthat the sound-source from the glottis and filtering by the vocal-tract artic-ulators are non-linearly coupled [119]. Thus, improved glottal models thatcan capture this phenomenon better are critical to achieving high qualityarticulatory speech synthesis.There are a multitude of vocal fold models in the literature, implementedwith different combinations of structural and flow models. Of these, there11.1. Contributionsare two main classes of models: lumped-element and continuum models.Lumped-element models are conceptually and computationally simple, rep-resenting the vocal fold structure through lumped-masses interconnectedthrough springs. Continuum models on the other hand, solve the internalcontinuum mechanics of the vocal fold structure through numerical methodssuch as the finite-element method. This provides a rich and complex charac-terization of vocal fold vibration, but are computationally extremely expen-sive and are often unstable when run as part of a larger acoustic simulation.Thus, articulatory synthesizers have been forced to using lumped-elementmodels, despite the possibilities of higher-quality synthesis and physical re-alism offered by continuum models. A balance between the simplicity andspeed of the lumped-element vocal fold models and the completeness andcomplexity of continuum models is required for speech synthesis.Equally, surveys of the field have noted that while multiple models ex-ist, each with their unique model formulations, there is a lack of under-standing of how these modelling decisions translate to acoustic outcomes[26]. Two issues in particular have exacerbated this problem: firstly, mostmodels are built using proprietary or commercial finite-element modellingtoolkits which makes it extremely difficult for speech researchers to comparemodels. Secondly, many of the continuum vocal fold models can take daysto simulate a short portion of acoustic output. When combined with anacoustic solver for the vocal tract, this makes it practically impossible forresearchers to run multiple simulations and do comparative analyses. Thepresent thesis investigates a solution to these issues through two main path-ways: firstly, through developing and implementing a novel self-oscillatingvocal fold model that attempts to find a balance between computationalcost and complexity. This model is then used to drive a 2D finite-differencetime-domain (FDTD) simulation of the vocal tract, as part of a coupledstable acoustic simulation to demonstrate the feasibility of the model as apotential tool for speech researchers.1.1 ContributionsThis thesis targets a specific lacunae in the field of articulatory speech syn-thesis, and vocal fold modelling in particular. The contributions of the thesisare given below:21.1. ContributionsImplemented and validated a time-efficient vocal fold modelfor articulatory speech synthesis• We identified that fluid simulation of glottal flow remains a major road-block towards creating time-efficient and complex continuum modelsof the vocal folds for articulatory speech synthesis.• We put forward a novel solution framework of the 1D unsteady Navier-Stokes equations, for a time-efficient decoupled numerical solution.This solution does not required the computation of the Jacobian, andis faster than the coupled simulation as well as and other 2D Navier-Stokes models.• A novel vocal fold model was built comprising of a 2D finite-elementbased structural model, loosely coupled with the 1D unsteady flowmodel.• The model performance was compared with published literature andexperimental data, to validate its efficacy for a set of standard tests.Demonstrated the potential of the proposed model inbuilding an articulatory speech synthesizerWe demonstrated the feasibility of our vocal fold model by building anillustrative articulatory speech synthesizer. The synthesizer included thevocal fold model, a 2D finite-difference time-domain vocal tract acousticsolver and a model of the trachea. A sample framework is included todrive the model using different geometries. Coupled acoustic simulationsare performed for two vowel shapes derived from literature as an illustrativeexample.3Chapter 2Background and PreviousWorkWith the increase in simulation and computational capabilities, articulatoryspeech synthesis has once again become an active area of research. Througharticulatory speech synthesis, researchers can gain insight into a more funda-mental understanding and characterization of speech that no other methodoffers [90]. Articulatory synthesis also allows for the possibility of buildingnatural looking graphical talking-head models, where the biomechanics candrive the acoustics of the system [23]. Finally, better models of the humanbody for speech will provide clinicians with insights to help predict patholo-gies and functional outcomes of surgical interventions [133] [116]. For thesereasons, articulatory speech synthesis is a challenging, but promising, areaof research.This chapter provides a bird’s-eye view of the field, reviewing advancesin vocal fold models in the context of developments in articulatory synthesis.Section 2.1 provides a self-contained overview of the physiology of humanphonation and a detailed description of the exemplar theories of phonation.As these theories facilitated a more thorough understanding of the processof human phonation, many models of the vocal folds were created. Thesestarted with lumped-element models that represented the vocal fold struc-ture through masses interconnected with springs and dampers. These massesare driven by airflow to give rise to the oscillation that is characteristic ofvocal fold vibration. Current advancements in computational capabilitieshave made it possible to create more complex models that solve the contin-uum mechanics of the vocal fold structure and the airflow driving it. Thesemodels give rise to Partial Differential Equations (PDE’s), that are solvedusing numerical techniques such as the Finite-Element Method (FEM) orFinite-Volume Method (FVM).Section 2.2 reviews the wealth of lumped-element models and identi-fies their strengths, limitations and areas of improvements. The followingSection 2.3 does the same for continuum models solved using PDE’s. Forboth groups of models, the different types of flow models used to drive the42.1. Overview of Human Phonationstructural models are highlighted and discussed. While the sheer numberof models are overwhelming, there is a clear pattern to the model combina-tions that have made in literature. This is explored in section 2.4, whichisolates exemplar structural and flow models. By looking at these models inisolation, we can better understand what gap exists in models of the fieldand how we can potentially fill it. A particularly interesting sub-problemin the field is the handling of collisions which is covered in subsection 2.4.2Finally, subsection 2.4.3 talks about the coupling between the vocal foldsand vocal tract, and how this is handled by models in literature.The ultimate goal of speech mechanics is to generate sound. Building anarticulatory speech synthesizer requires many different model components towork together in tandem: biomechanical models to represent the anatomy ofthe vocal tract and movement of articulators, acoustic solvers for the vocaltract and a glottal excitation to act as the source to the simulation. Whilethere are a range of individual components, complete articulatory synthe-sizers are rare due to the complexity inherent in building an entire system.Section 2.5 provides an overview of the different acoustic and biomechanicalmodels in literature, and reviews the current trend in articulatory synthesis.Finally, advancements in data acquisition techniques have enabled re-searchers to gain improved knowledge on observing and quantifying articu-latory processes. As a thumb rule, experimental measurements of the glottisare difficult due to the general inaccessibility of the vocal folds. This requiresspecialized instrumentation, or even fabrication studies in lieu of direct mea-surements. Medical imaging methods such as Magnetic Resonance Images(MRI) and Computed Tomography (CT) scans have enabled clearer visualdepiction of oropharyngeal structures in general, that complement otheracoustic recordings. Section 2.6 gives an insight into the experimental datain literature and the methods used to obtain them.2.1 Overview of Human PhonationHuman phonation is a complex multi-step process that requires an under-standing of the anatomy of the human airway and the physics that leads tophonation. In the following sections, we shall briefly look at the anatomyof the upper airway, with specific attention paid to the vocal folds. Thenwe shall understand the various theories of phonation, and what lessons welearn from them in designing better models of the vocal folds.52.1. Overview of Human PhonationFigure 2.1: Basic Anatomy of the Human Airway: Saggital view of the head(left) and enlarged coronal view of the laryngeal complex (right), cWiki-media Commons, adapted from [102]2.1.1 Physiology of the Human AirwayFigure 2.1 details the basic physiological structure of the human airway.The periodic expansion and contraction of the lungs leads to the expellingof pressurized air through the trachea, also known as the windpipe. Atthe end of the trachea is the larynx which connects it to the vocal tract(highlighted in blue in figure 2.1) and contains the vocal folds (commonlyknown as the vocal cords). The vocal tract splits at the soft palate to leadto the nasal tract above (highlights in dotted blue lines in figure 2.1). Thevocal fold regulates the flow of air from the trachea to the vocal tract in aperiodic manner. The upper airway contains multiple articulators that canbe individually controlled to take up multiple poses; this modifies the soundfrom the glottis.The vocal folds are contained inside the larynx complex (highlighted indotted red lines in figure 2.1), which includes various adjustable muscles,ligaments and cartilage. The larynx is suspended between the trachea be-low and pharyngeal complex leading to the vocal tract above [51]. It isimportant to note that the glottis forms a constriction in the airway tube.The ventricular folds, also known as the false vocal folds, can also be seenabove the true vocal folds. They are two thick folds of mucosal membrane62.1. Overview of Human PhonationFigure 2.2: Coronal (left) and superior (right) view of the laryngeal complex(right), cWikimedia Commons, adapted from [102]with narrow bands of fibrous tissues [102]. They generally do not play amajor role in primary phonation, but can be used for phonation in certainforms of singing and for patients who lose the ability to phonate with theirtrue vocal folds.2.1.2 Vocal Fold StructureFigure 2.2 shows a closeup coronal view and superior view of the larynx,with the glottis between the vocal fold highlighted. The vocal folds can begenerally assumed to be symmetric about the glottal mid-line (highlightedin the figure 2.2); this approximation is not valid in the case of vocalfold pathologies that causes polyps to form on individual vocal folds. Thevocal folds can be adducted (brought together) and abducted (taken apart),by the muscles of the larynx. The latter creates a triangular slit openingthat is called the glottis. The vocal folds are attached posteriorily to thearytenoid cartilages and anteriorily to the thyroid cartilage. The vocal foldsare stretched horizontally from back to front across the larynx, similar toa rubber band. Thus, when air flows over its surface after it is expelledfrom the lungs, it vibrates similar to a stretched membrane. This rhythmicopening and closing of the vocal folds create the buzzing sound known asphonation.72.1. Overview of Human PhonationFigure 2.3: Layers of the vocal fold, c©Wiley Publishing, Gick et al [51]Hirano et al showed that the vocal fold is comprised of multiple layers,each with unique structural and material properties [59]. As a general ruleof thumb, as we go from superficial outermost layers of the vocal folds to thecore, the material becomes less flexible and more rigid. Figure 2.3 shows thecoronal view of the internal structure of the vocal folds. Medially to laterally,the vocal fold is made of multiple layers including the epithelium and thelamina propia before leading to the vocalis muscle. The lamina propia itselfcan be divided into three constituent parts: superficial, intermediate anddeep. Hirano et al also put forward the Body-Cover Theory that classifiedthe layers into two main groups: the cover and the body. Figure 2.4 explainsthe correlation between the layers to the cover-body theory. The loose coverenables the propagation of the vertical travelling wave that is critical toself-sustained vocal fold oscillation and phonation. The cross-section of thevocal folds varies in the anterior-posterior direction and is not uniform.2.1.3 Theories of PhonationThe process of phonation has fascinated speech researchers from the late18th century, with a range of theories proposed to explain the self-oscillation82.1. Overview of Human PhonationFigure 2.4: Composition of the Vocal Folds according to Body-Cover theory[59]of the vocal folds. The most commonly accepted theory today is the Myoe-lastic Aerodynamic theory of Phonation [125], which was augmented by theBody-Cover Theory that was previously mentioned. The former theory is aamalgamation of two concepts: the myoelastic nature of the vocal folds, cou-pled with the aerodynamics of the flow passing through the glottis. With thecompression of the lungs, air flows through the trachea to the closed glottis.This continuously increases the subglottal pressure (Psg) that acts againstthe bottom surface of the vocal folds. At a certain threshold pressure calledthe onset pressure, the subglottal pressure overcomes the elastic propertiesof the vocal folds, and forces the vocal folds open.It is important to note here that the glottis still represents a constrictionin the upper airway. Bernoulli’s principle along with the laws of conservationof mass and conservation of energy tells us that the airflow through a con-striction is faster, with an associated drop in pressure inside the constriction.This negative air pressure difference combined with the elastic nature of thevocal folds pulls the individual folds back together, thus closing the glottalopening. Figure 2.5 shows the different poses taken up by the vocal foldsfrom the coronal view, during a cycle of phonation. This is opening-closingpattern is repeated continuously giving rise to a periodic glottal signal. Itis important to note the importance that the transfer of energy between theflow and the vocal fold structure plays in the self-oscillation process. Thevertical travelling wave also seen in the figure 2.5 is a representation of thisprocess.Later work, extended the myoelastic aerodynamic theory to investigatethe mucosa that coats the vocal fold surface, challenging the underlyinglaminar flow assumption of the theory. Another seminal work, was Ishizakaet al’s work on the flow separation theory, which looked at the important roleof turbulences and eddies in the phonation process. Flow models that cancapture these aspects of the flow can help to better characterize the glottal92.2. Lumped-Element Models of the Vocal FoldsFigure 2.5: Idealized shapes of the vocal fold during a single cycle of vi-bration. Note that the lower part of the vocal fold leads the upper partand the vertical travelling wave on the vocal fold surface. Reproduced withpermission from Story, [105]flow, and help with potentially achieving better acoustic output. Readers arereferred to [51] for a non-technical introduction to the theories of phonation,and different types of phonation.The vibratory behaviour of the vocal folds can be also modified by theactivation of the intrinsic and extrinsic muscles of the vocal folds. The prop-erties that affect vocal fold motion are the tissue mass, stiffness and viscosity.The constriction of the cricothyroid muscle can alter vocal fold tension whilethe thyroarytenoid muscle allows changing of the internal stiffness [58]. Thefundamental frequency of oscillation is a function of the current propertiesof the vocal folds; thus, creating models that connect muscle activation tothe material properties of the vocal folds will be useful in achieving differentvoice registers. For a more in-depth discussion of phonatory control andvocal fold physiology, the reader is refered to the paper by Jiang et al [67].2.2 Lumped-Element Models of the Vocal FoldsThe earliest models of the vocal folds were those that approximated the vocalfold structure as lumped masses interconnected by springs and dampers.The section looks at some seminal models that have shaped the field overthe past 50 years. For a more comprehensive survey of the different models,organized by mechanical design, aerodynamic simulation and application,we refer the reader to the survey by Birkholz et al [26].102.2. Lumped-Element Models of the Vocal FoldsFigure 2.6: Coronal view of the two-mass model. Reproduced with permis-sion from Peter Birkholz, [26]2.2.1 One-Mass ModelsThe earliest model of vocal folds was the simple one-mass block model in-troduced in 1968 by Flanagan and Landgraf [45]. This model consisted ofa single mass, that had 1 degree of freedom. The model could oscillate inthe medial-lateral direction in the coronal plane. However, to obtain a trueself-oscillating vocal fold model, the system requires the constant impart-ing of energy to compensate for internal friction losses. This is achievedthrough asymmetric pressure loading, something that a one-mass modelcannot do due to its time invariant glottal orientation. This model achievesself-oscillations through acoustic loading of the supraglottal tract; later stud-ies by Zanuartu et al [131] analyzed the importance of the acoustic loadingby using a non-square geometry for the vocal fold.The lack of time-varying geometry remained the major issue with thismodel along with the inability to replicate experimentally observed phasedifferences between the inferior and superior edge of the vocal folds. Therewere attempts made to introduce an extra degree of freedom, in the trans-lational and rotational directions [74][62] and more recently in the parallel-perpendicular direction of the airflow [1]. Almost all these models uses a flowmodel that based on a simple steady Bernoulli principle flow model. Theexception is the work by Horacek et al [62] which use the unsteady 1D Eulerequation to solve the pressure profile and flow. However, due to their sig-nificant drawbacks, one-mass models are no longer used in most applicationareas and are not promising as a potential model for speech synthesis.112.2. Lumped-Element Models of the Vocal FoldsFigure 2.7: Coronal (left) and superior (right) view of the vocal folds withquantities highlighted. Ishizaka et al, [64]2.2.2 Multi-Mass ModelMulti-Mass models added extra mass elements to each vocal fold to over-come some of the drawbacks of the single-mass model. The classic two-massmodel introduced by Ishizaka and Flanagan [64] added a second mass su-perior to the one-mass model (shown in fig.2.6), while restricting motion tomedial-lateral translation. While the airflow fluid loading occurs only on thelower mass, this configuration allows for the experimentally observed phasedifference between the inferior and superior edges of the vocal folds. Dueto the importance of this model in terms of understanding the field of vocalfold modeling better, let us briefly look at the mechanics behind this model.Two-Mass ModelFigure 2.7, shows the different quantities involved in the two-mass vocalfold structure in the coronal view. The displacements of the two vocalfolds define the vocal fold shape, and as a consequence, the flow at everytime instant. The lateral displacements x1 and x2 are the solution to thedifferential equation:m1x¨1 + r1x˙1 + s1 + kc(x1 − x2) = f1 (2.1)m2x¨2 + r2x˙2 + s2 + kc(x2 − x1) = f2 (2.2)122.2. Lumped-Element Models of the Vocal FoldsFigure 2.8: Bernoulli flow in the two-mass model in two cases a) Bernoulliflow in lower region and jet flow in upper region b) jet flow in both lowerand upper regionswherem1,m2 are the masses, r1, r2 are the damping values, s1, s2 are therestoring spring forces and kc is the stiffness of the coupling spring. f1, f2are the aerodynamic forces that act on the vocal folds and varying based onthe flow model used. The simplest model to represent flow is the well-knownBernoulli’s equation. For a steady, inviscid, incompressible flow it can bewritten as:p+ 12ρu2 + ρgz = const (2.3)where u is the flow speed in a point of the streamline, p is the pressureat the chosen point, ρ is the density of the fluid, g is acceleration due togravity and z is the elevation with respect to the reference 1D plane.Equation 2.3 can be seen as the statement of conservation of energy-momentum for a fluid, and implies the classic Bernoulli’s Effect when usedin conjunction with the statement of mass conservation (Au = const). Thiswould mean a reduction of pressure and increase in speed when a fluidgoes through a tube constriction. The most-common implementation ofthe flow differs from the original paper by [64], and is based on a modifiedversion of Bernoulli-based flow. As shown in figure 2.8, steady Bernoulliflow is assumed till the point of minimum glottal diameter at which flow isassumed to detach. A constant diameter jet exists in the region from theflow-separation point till the glottal exist, where pressure is assumed to beconstant. A pressure recovery after the glottal exit is also included in theequation. This simplifies to the following equation:Pb = Ps − (Ps − Pi)(am/a)2 (2.4)132.2. Lumped-Element Models of the Vocal FoldsPj = Pi (2.5)where Pb is the pressure in the Bernoulli regime and Pj is the pressure inthe jet regime. Here, Ps and Pi are the subglottal and supraglottal (epila-ryngeal) pressures respectively. am is the minimum glottal area and a is thearea at the location we are calculating the pressure at. The glottal flow Ugcan now be calculated based on the glottal shape and the pressures incidenton the vocal folds.This model has been one of the most used VF models and many varia-tions have been proposed [17] including models with added asymmetries inthe masses [103]. An additional vertical DOF was added to the model toaccount for inferior-superior vocal fold motion in later models [46][63]. How-ever, this model fails to simulate the difference in registers and transitionsbetween them; to this end, a third superior mass was added to solve thisissue [122]. While this helped the two-mass model better simulate differentvoice qualities, the inability of the model to simulate the relative closureand the smooth continuum between registers remains a drawback in it’s usefor speech synthesis. Finally, Titze et al [121][117] extended the model, al-lowing for control through muscle activation as well. However, while thelow-dimensional nature of the lumped-element model is attractive for con-trol, there is less physiological accuracy when assigning material propertiesto these models.A seminal model for articulatory synthesis was the triangular glottismodel was introduced by Birkholz et al [25]. The model used inclined masses(shown in fig. 2.9 to model glottal abduction and closure as part of thegoverning equations, to better capture different voice qualities. The modelhas been used as part of a complete articulatory synthesizer and shown tosynthesize breathy, normal and pressed phonation types. Finally with theincrease in computational capabilities, these systems have been extended to6-mass with 3D capabilites [98], to even a 25-mass model [130]; the lattermodel was designed to capture more intricate geometry details for videoendoscopy applications. However, the question arises whether continuummechanics structural model would be preferable instead at such an expensivecomputational cost.All the previous model use the Bernoulli-based flow model which is usu-ally enough to characterize the flow for the purpose of lumped-element mod-els. This requires the prior assumption or computation of the flow-separationpoint for the flow. The location of flow-separation has been shown to play142.2. Lumped-Element Models of the Vocal FoldsFigure 2.9: Pseudo 3D view showing the inclined masses of the modified two-mass model that can represent the triangular glottal opening, reproducedwith permission from Birkholz et al [25]a critical role in the self-oscillation process as it helps decide the pressuredistribution inside the glottis and, as a consequence, the aerodynamic forceson the masses. Generally models assume that the flow separation is eitherat the location of smallest diameter inside the glottis or at a ratio of it (Eg.1.2 ∗ amin). Pelorson et al [91] put forward a model to estimate the flowseparation model instead of a fixed location like most previous papers inliterature. In terms of flow modelling, an significant exception is the effortby LaMar et al [72] who use a quasi-one-dimensional Euler system to drivea symmetric two-mass model. This model is pertinent as it was shown tobe a more accurate treatment than the traditional Bernoulli-based systems,with a lower computational cost than full 2D Navier-Stokes solvers.2.2.3 Body-Cover ModelsAs mentioned in Section 2.1, Hirano et al [59] explored the structural differ-ences of the vocal fold layers. Story et al introduced the Body-Cover modelthat added an extra mass to represent the different layers and capture thecover vibration better [107]. As shown in figure 2.10, this differentiationallowed for defining varied structural properties to the body and cover, andremains one of the most popular models in literature till today. A classic152.3. Continuum Models of the Vocal FoldsFigure 2.10: Coronal view of the body-cover model of the vocal folds, repro-duced with permission from Birkholz et al [25]Bernoulli-flow model was used to drive the outer cover masses directly, andas a consequence the inner body mass. This was later extended to a 4-massmodel to study different voice registers and their transitions [123]. Finally,an extremely complex 128-mass model that explored 7 different layers withdivisions in the saggital plane was introduced [118]. This system providedextensive data on differing voice types and was effective in capturing smallvoice distinctions.2.3 Continuum Models of the Vocal FoldsWhile higher dimensional lumped-element models helped to better modelthe vocal folds, considerable doubt existed if the geometry and viscoelasticproperties of the vocal folds adequately. Thus, continuum models of the vo-cal folds have been a major goal of speech synthesis for the last two decades.As efficient computing capacity evolved, the possibility of highly complexPartial Differential Equation (PDE) based models became feasible. Thesemodels are based on the fundamental laws of continuum mechanics whereall the processes that contribute to phonation need to be integrated. Thisinclude the PDE equations of the aeroacoustic flow along with the biome-chanical model of the vocal fold structure. This enables a direct relationshipbetween these characteristics and the glottal waveform that is generated bythe system.162.3. Continuum Models of the Vocal Folds2.3.1 Eigen-Analysis ModelsWhile early continuum models failed due to improper application of bound-ary conditions to study of vocal-fold resonance [34], later work characterizedthe vibration of the vocal folds in terms of its eigenmodes and eigenfrequen-cies [19][20]. These empirical studies using two-dimensional models of thevocal folds. also threw up interesting findings where the eigenfunctionsin studies incorporating non-linearities of aerodynamic flow was similar tothose obtained from linear eigenmode systems. This suggested that, in con-trast to the prevailing school of thought at the time, linear eigenmodes wereof potentially greater importance in the vibration of the vocal folds thanthe modelling of the (possibly) non-linear aerodynamic flow. It also estab-lished the intrinsic role played by the vertical phasing eigenmode, pointingto a direct control over the opening and closing of the glottis. Structuraland visco-elastic parameters were also presented for normal vibration. Thelatter of these studies also matches data from existing in vivo studies buthowever, did not include aerodynamic forces.2.3.2 Self-Oscillating Continuum ModelsWhile the previous studies used the continuum models to study the physiol-ogy of phonation, DeVries et al [38] used a 3D Finite Element Model (FEM)of the vocal fold structure to determine realistic parameters for a lumpedtwo-mass model. A symmetric model with each vocal fold having 3000 ele-ments (shown in figure 2.11) was considered with detailed descriptions of thevocal fold’s material properties and geometry. By matching the dynamic be-haviour of the FEM model and lumped element models for a given pressureflow model, the parameters for the lumped element model was obtained.The first model that used the FEM for computation of vocal fold dy-namics was by Alipour et al [3]. This model adds the ligament as well to thebody-cover differentiation that was used by DeVries [38] when formulatingthe vocal fold tissue structure. The model has tissue mechanics modelledwith the finite element method combined with a Bernoulli-based fluid solver.The biomechanical model is quasi 3D with the vocal folds being divided into15 layers along the coronal plane, with figure 2.12 showing the trajectoriesof nodes in layer 8. The model by Alipour [3] made some simplifying as-sumptions:• Small deformation and linear-elasticity• Single plane deformation172.3. Continuum Models of the Vocal FoldsFigure 2.11: 3D view of FEM model by De Vries et al, reproduced withpermission from [38]• Transverse isotropy perpendicular to the tissue fibers• Fixed control volume for integrationSimilar to the two-mass model for the class of lumped-element models,this model also served as the template for other continuum models. We shalllook at it’s mechanics below. As part of the assumption of a linear materialwith transverse isotropy, a constitutive equation can be written as:σ = [S] (2.6)where σ is the stress tensor, is the strain tensor and [S] is a stiffnessmatrix. For transverse isotropy, we have 5 independent elastic consonantsout of 21 in the 6X6 symmetric matrix [S]. These are the Young’s modulusand Poisson’s ratio (E, ν) in the plane transverse to the fibers, as well asthe Young’s modulus, Poisson’s ratio and shear modulus (E′, ν ′, µ′). Therelationship in the transverse plane is written as:µ = E2(1 + ν) (2.7)When generalized for the transversely isotropic material in question.x =1E(σx − νσz)− ν′E′σy (2.8)z =1E(σz − νσx)− ν′E′σy (2.9)182.3. Continuum Models of the Vocal FoldsFigure 2.12: Trajectories of nodes in layer-8 of quasi-3D FEM model, repro-duced with permission from [3]192.3. Continuum Models of the Vocal Foldsy = − ν′E′(σx + σz) +1E′σy (2.10)γxy =1µ′τxy, γyz =1µ′τyz, γzx =1µτzx (2.11)In this model we are solving the displacement vector ψ in the x(medial-lateral) and z(inferior-superior) planes. The y plane is the longitudinalplane in the dorsal-ventral direction. Using the assumption of planar strain,a displacement vector is defined asψ = u(x, y, z, t)i + w(x, y, z, t)k (2.12)where u and w are the lateral(x) and vertical(z) components of thedisplacement vector. Using our other assumption of linear elasticity, wedefine the relationship between the strain vector and the displacement vectorasij =12(∂ui∂xj+ ∂uj∂xi) where (i, j = x, y, z) (2.13)where ux = u, uy = v, uz = w. The previous set of equations enable adirect relationship between the stress and the displacements. These equa-tions describe the continuum and the equations to be solved using numericalmethods. The method of choice here is the Finite-Element Method (FEM);the FEM is commonly used in solving PDE’s over complex domains by split-ting the domain up into smaller finite elements. The equivalent problem isnow solved over this element, where the field variable is interpolated insidethe element from the values at the nodes of the element. Applying the FEMto the above problem we get:Mψ¨ +Dψ˙ +Kψ = F (2.14)where M is the mass matrix, [D] is the damping matrix and [K] is thestiffness matrix all of the size 6X6. F is the forcing vector that is derivedfrom the aerodynamic forces on the surface of the FEM vocal fold mesh. Thispaper used the simple Bernoulli formulation similar to the model explainedin section 2.2.2.While the individual spatial problem is in 2D, multiple layers of 2Dlayers are stacked together to achieve a 3D solution. There is a string force,that accounts for the forces between the different layers. This enables astructural solution that captures maximum complexity without becoming202.3. Continuum Models of the Vocal Foldstoo computationally unwieldy. Cook et al [33], put forward a 2D/3D hybridstructural model of the vocal folds that preserves 3D effects of vocal foldlength and longitudinal stresses, while maintaining the 2D computationaldomain. Many of the assumptions mentioned above are used by other modelsin literature; however, recent work has suggested that the linear elasticityassumption might be sacrificing accuracy at the altar of computational cost[31].This formulation led to the development of multiple models with varia-tions to either the structural model and its dimensionality, the fluid solverused (ranging from Bernoulli to 2D Navier Stokes[10]) and the applicationor model-focus in question (phonation[3], pathologies [52], flow separationcomputation[8] and bulging factor[7]).While the finite-element model by Alipour et al [3] provided the firstFEM structural solution for vocal fold phonation modelling, it used an ex-tremely coarse and simple Bernoulli-based flow model. One of the firstmodels with coupled fluid-solid interaction was by Tao et al.; this includedFEM models of both the vocal folds and the airflow along with a propercollision model [113]. The first fully 3D model including the entire larynxgeometry, the false vocal folds and laryngeal ventricles was put forward byRosa et al [37]. The method used shared mesh nodes at the fluid-solid (ie.airflow-muscle) interface to have a coupled simulation. Despite the use ofcoarse meshes (2600 for airflow and 3000 for tissue) this method still hadsignificant computational complexity and demonstrated the self-oscillationof vocal folds.Later methods focused on completely resolved flow computation using animmersed-boundary method for the fluid-structure interface [79][41]. Luo etal [79] used two finite difference-based discretizations of the Navier-Stokesequation (for aerodynamics) and viscoelastic equations (for the solid me-chanics), which are coupled for solving the FSI problem. Another interest-ing field of research is capturing fluid-structure-acoustic interactions. Linket al [75] created a 2D FEM scheme where Lighthill’s analogy is used todescribe the fluid-acoustic interaction as well as capture the Coanda effect.Finally, while most of the previous models used 2D Navier-Stokes modelsand the small deformation assumption, Zheng et al[135][128], extended thework of Rosa [37] to a 3D unsteady Navier-Stokes flow solver coupled with a3D structural FEM solver. The simulation results were comprehensive, withinsights into the glottal waveform, pressure and velocity distributions insidethe vocal folds along with jet dynamics. However, a comparable resolved3D simulation of flow is more than 100 times more expensive that a 2D flowmodel.212.4. Model CouplingRecent work has also looked at the role that the false vocal folds play inthe phonation process [129]. Zheng et al, created a version of their previousmodel with 2D flow, to study the computational effects of the false vocal foldsin the phonation process [136]. Alipour et al, also extended their originalFEM model to include the false vocal folds [9]. The time-dependent pressureand velocity distributions inside the glottal region were reported as partof the results, which an extremely valuable tool to benchmark simulationsagainst.2.4 Model CouplingIn the previous section, we have looked at a wide range of vocal fold modelsin literature. Each of the models had a structural and a fluid component.While the number of models tailored to individual applications is vast, theycan be segmented by classifying these individual structural and fluid com-ponents. Subsection 2.4.1 gives an insight into what combinations existin literature, and the lacunae in the field that needs to be filled for ourapplication of articulatory speech synthesis. Subsection 2.4.2, talks aboutstandard collision models that are used across structural simulations. Fi-nally, Subsection 2.4.3 looks at which models are actually coupled to thevocal tract/trachea and the challenges with running a full simulation.2.4.1 Combination of Structural and Flow ModelsFrom the previous two sections, vocal fold models for articulatory speechsynthesis can be divided into few major categories.Structural Models (in increasing order of complexity)• 2-Mass Models• Body-Cover Models• High-dimensional Lumped Models• 2D/Quasi-3D Continuum Models• 3D Continuum ModelsFluid Models (in increasing order of complexity)• Bernoulli-Model• 1D Inviscid Euler222.4. Model CouplingFigure 2.13: Visual depiction of the different combination of structural andflow models in the literature. Horizontal axis represents increasing spatialcomplexity from left to right. Vertical axis represents increasing flow com-plexity from bottom to top• 2D Navier-Stokes• 3D Navier-StokesFigure 2.13, gives a visual illustration of the different model combinationsin literature. Going from left to right on the X-Axis we go from the 2-Mass to3D continuum models in increasing spatial complexity. Going from bottomto top along the Y-Axis, we go from Bernoulli-based flow models to a 3Dunsteady Navier-Stokes simulation. Images of sample models are given ineach quadrant with their references [62][38][64][44] with the quantity of eachtype of model gives next to it. Our proposed model is filling a lacunae seenat the centre of the graph. (Note: The graph is not meant to be exhaustive,but rather provide a visual depiction of the standard combinations that existin literature to better understand potential gaps in the field)Figure 2.13 leads to some striking observations:• There exists a multitude of lumped-element models in literature, mainly232.4. Model Couplingcoupled with a Bernoulli-based solver. These models have generallybeen the most common models used for speech synthesizers, for theirconceptual and computational simplicity. While there exist some com-binations of lumped-element models with an unsteady flow model,these have been to benchmark the performance of quasi-steady Bernoulli-based models with the unsteady flow models.• With increased computational capabilities, the combination of 2D/Quasi-3D continuum models driven by 2D Navier-Stokes models has beenapplied to different sub-problems. The main advantage of this ap-proach is a synergy of dimensionality between the flow and structuremodels and the computation of an entire velocity field. Most of the2D flow models are solved either using the Finite-Volume Method orthe Finite-Element Method, while 3D flow models use the latter.• There is a lacunae of 1D unsteady flow models used in conjunctionwith continuum vocal fold models. This could particularly be useful incases where the bulk pressure distribution is of more importance thana fully-resolved flow computation. In addition these models would becomputationally much cheaper than comparable 2D models. However,we need 1D flow models that can estimate the flow separation pointand the glottal flow properly, unlike previous Bernoulli models.2.4.2 Collision HandlingCollisions between the vocal folds presents a different prospect to bothlumped-element as well as continuum models. Since the lumped-elementscould not come to a sudden unnatural stop on contact (zero-glottis condi-tion), it was proposed to apply a non-linear spring at the time of impact toprevent further movement in conjunction with increasing the damping ratiosof the masses [64]. This remains an open problem to be solved in continuummodels with many different ways of handling collisions suggested. We willlook at some significant types of contact handling in literature.Many of the continuum models assumed mid-line symmetry between theright and left vocal folds, to reduce computational costs by simulating justa single vocal fold. This moves the plane of contact to the glottal midlinefor many models. Alipour et al [3] in their first model removed a degree offreedom when any node reached the plane of contact. In terms of full 3Dstructural models, Rosa et al [37] calculated a force to avoid interpenetra-tion of nodes that are in contact while Tao et al [113] use the AugmentedLagrangian Method. As mentioned previously, structural models which use242.4. Model Couplinga sharp-interface immersed boundary method are directly coupled to theflow solvers.An important drawback of these class of problems, especially in the caseof 2D or 3D fluid models, is the possibility of zero-area and zero-volumeelements. This is a major stability issue with regards to the overall systemwith the responsibility falling on the structural solver to ensure a minimumglottal diameter at all points. Thus, there is never complete glottal closurein many of these models; for example in quasi-3D models, not all the layerscan be completely closed at the same time [9]. This can be a drawback inusing these models when stability is important, in applications such as acoupling articulatory synthesizer.2.4.3 Vocal-Tract CouplingThrough the previous sections we have gained a bird’s eye view into thecomplexities and intricacies that goes into building a vocal fold model. Asmentioned previously, a lot of early models were based off the source-filtertheory [42]; this meant that early model were used as simple exciters to theresonant vocal tract. However, as the non-linear coupling of the vocal foldsand the vocal tract has been established in literature [119], it is importantto design vocal fold models that are coupled to the vocal tract load.The glottal flow (Ug) is the output of the vocal fold model that is fedinto the vocal tract. This is the source to the vocal tract simulation; thepressure recorded at the outlet of the vocal tract would be the equivalentsound pressure that a listener would hear. There is also feedback from thevocal tract to the vocal fold model: the supraglottal or epilaryngeal pressure(Pe) is acts as a feedback to the fluid simulation of the vocal folds. Similarlythe pressure output at the end of the trachea, the subglottal pressure (Psg),will act as the input pressure to the vocal fold model. The supraglottalpressure along with the subglottal pressure act as the boundary conditionsto the fluid simulation in the vocal folds. Also, the structural characteristicsof both the attached subglottal and supraglottal tracts have shown to affectthe vocal fold vibration [115].However, if we look at the vocal fold models in literature we can seethat the vast number of them are not actually coupled. In fact in literature,less than 5 continuum models have actually been coupled to a vocal tractfor simulation. Most models give a static boundary condition on each endbased on experimental values without running a vocal tract simulation aspart of a larger system. It is also pertinent to note that it is easier to couplelumped-element models with quasi-steady Bernoulli-based solvers than un-252.4. Model CouplingFigure 2.14: View of the idealized 2D vocal tract shape for Czech vowel[a:] coupled with the 2D vocal tract model, c©Springer, reproduced withpermission from Hajek et al [53]steady solvers. In fact, both the exemplar two-mass and body-cover modelshave been coupled as part of larger synthesizers [64][101][126][107]. Thereare two main potential reasons we can put forward for the lack of coupledmodels in literature: firstly, running a full synthesizer is extremely con-ceptually difficult to design and implement, and computationally expensiveto run. Till recently, the computing capabilities available made it difficultto run anything others than a 1D wave-reflection based vocal tract model.Secondly, the feedback from the vocal tract can change rapidly at certaintimes, despite the scale differences between the vocal folds and vocal tractsimulations. This changes the boundary conditions of the flow model of thevocal folds, often pushing the vocal folds into regions of instability.Recently, efforts have been made to couple more complex vocal foldsmodels to vocal tract models. Alipour et al have demonstrated a quasi-3Dvocal fold model, driven by an unsteady 2D Navier-Stokes flow model cou-pled to a 1D wave-reflection analog vocal tract and trachea [7][9]. Recently,efforts to combine more complex vocal tract models have been undertaken;the 2D Navier-Stokes equation was solved in a complete computational chan-262.5. Articulatory Synthesizersnel inspired by the vocal tract [94]. This included the vocal folds and a smalltrachea tract as well providing an insight into the flow-fields inside the chan-nel. Recent Hajek et al [53], created a 2D finite element solution of the self-oscillating vocal folds, connected to a trachea and vocal tract model (figure2.14. Idealized vocal tract shapes were created for standard Czech vowelsfrom MRI data and the flow was solved using the Navier-Stokes equations,using the Arbitary Lagrangian-Eulerian approach for boundaries. The firsttwo formants of the generated outlet pressure was compared with publisheddata and was found to be in relatively good agreement. The most complexmodel of the fluid-structure-acoustics interaction in the vocal tract was a3D model using realistic laryngeal and vocal tract geometries by Jiang et al[68]. This model achieved self-sustained oscillations and demonstrated thata range of voice-related quantities were within normal physiological ranges.The model also showed the likelihood of strong source-filter coupling fromits results; the main drawback of the model was the immense computationalcost to simulate the system.2.5 Articulatory SynthesizersOne of the major goals of vocal fold modelling is articulatory speech synthe-sis. Similar to vocal fold models, we can think of the articulatory synthesizerhaving two main components apart from the vocal fold model: the structuralmodel that represents the physiology of the upper airway and the biome-chanics of the different articulators; and the acoustic model solving the flowthrough the vocal tract channel. Figure 5.1 illustrates the different compo-nents that are part of an articulatory synthesizer. Since this is a massivefield in it’s own right, we shall only look specifically at important attempts inliterature to model different articulatory structures and acoustic phenomenain the vocal tract.2.5.1 Structural ModelsAs seen in 5.1, the structural model of the vocal tract includes two maincomponents: biomechanical models with articulator control, and extractionof the vocal fold mesh for the acoustic simulation. There is a bi-directionalcoupling between the upper airway structures and the aeroacoustic flow;however, very few models capture this phenomenon. As mentioned beforevery few articulators have all the components mentioned in 5.1, and oftenthe control of articulators to shape the structural vocal tract is not included.272.5. Articulatory SynthesizersFigure 2.15: Components of an Articulatory SynthesizerTeixeira et al implemented a 2D midsaggital anatomical model of the vocaltract (figure 2.16), where positions of individual articulators can be con-trolled using data files [114]. This was an improved implementation of twoseminal models of the field [84][95]. It is important to note that this is notstrictly a biomechanical model in the true sense of the word, since there areno physics that are associated with the simulation. Rather, the movement ofarticulators can be derived from existing data, most likely medical imagingdata of the vocal tract.Dang et al introduced FEM-based models of articulators to replicate 2Dmidsaggital regions of the vocal tract and simulate articulatory movements[36]. Fully 3D models of articulators of the vocal tract were created frommedical imaging data by Badin et al [16], with biomechanically interpretabledata used to control the system. In the same vein, a comprehensive speechsynthesizer with a 3D structural model was put forward by Birkholz et alas part of the VocalTractLab software system. Here a gestural score wascomputed for every utterance that the user wants to achieve; this gesturalscore was then used to compute the vocal tract shape to run the acousticsimulation (figure 2.17).One of the major advances in the field has been the development of the282.5. Articulatory SynthesizersFigure 2.16: 2D midsaggital anatomical model of the vocal tract, reproducedwith permission from Texeira et al [114]Figure 2.17: Biomechanical vocal tract model used to calculate the vocaltract shape, VocalTractLab, reproduced with permission from Birkholz et al[24]292.5. Articulatory SynthesizersName Value(s)Face FEMTongue FEMJaw, Hyoid, Maxilla RigidSoft-Palate FEMPharynx FEMLarynx FEMLarynx Cartilages RigidTable 2.1: Summary of components in FRANK and component types.Rigid = rigid body, FE = finite element. Adapted from [112]comprehensive biomechanical toolkit ArtiSynth by Lloyd et al [77], gearedtowards simulations of the upper airway of the human body. Based on anopen-source model, the system provides tools to build biomechanical modelsand simulate them using an internal physics engine. This engine is capable ofsimulating combine multibody and finite element simulations with collisionhandling, connectors and numerical solvers build-in. A particular distin-guishing feature of the system, is the ability to model both line and FEMmuscles, and control them using muscle activations for forward and inversesimulations. This is critical for building a complete speech synthesizer, asit provides a strong physiological link for control of articulators rather thangestural scores that are a function of medical imaging data. Many differ-ent models of upper-airway articulators have been put forward in literature[27][50][54][78][92][100][127] but often included only portions of the upper-airway complex. To enable a complete study of the upper-airways complexacross fidelities, a Functional Reference ANatomical Knowledge (FRANK)[13] biomechanical model of the head and neck has been implemented in theArtiSynth toolkit. Table 2.1 summarizes the different components involvedin the FRANK model and the sources from which they are obtained. Figure2.18 illustrates the hard and soft components of the model, and the airwaythat would be used for the acoustic simulation.2.5.2 Flow/Acoustic ModelsIn this subsection, we shall look at some exemplar flow/acoustic models thathave been suggested in literature for articulatory synthesis and other upper-302.5. Articulatory SynthesizersFigure 2.18: FRANK: a Functional Reference ANatomical Knowledge [13]a) midsaggital view of the components b) hard components of the model c)soft components of the modelairway related functions. In general there are a few major ways to simulatesound/flow propagation in a tube given below.• Digital wave-guide filters These models are based on the classicKelly-Lochbaum reflection-type [69] line model that models a non-uniform tube as a set of concatenated tube segments with varyingvalues of impedance. This model is computationally very efficient andhas been used in a variety of classic vocal tract models [85][74][107].However, the model is not particularly versatile and cannot includespecific turbulence and flow separation effects of the flow.• Transmission Line Circuit (TLM) models This model imple-ments the electrical circuit equivalent of the acoustic tube in question.Thus the model is converted to the form of resistors, capacitors andinductors with values that capture the impedances of the individualtube sections [64][81]. Birkholz et al illustrated how this model couldbe used as part of a complete articulatory synthesizer [24].• Hybrid time-frequency simulation These models attempt to takeadvantage of both classic approaches for a fast versatile simulation[11][101]. The impulse responses calculated for the vocal tract arecombined with the glottal input signal to find the radiated outputpressure.312.5. Articulatory Synthesizers• Direct numerical simulation of flow This method involves solvingthe characteristic acoustics equations (Eg. Webster equation in 1D[126], Wave equation in 2D [132], Navier-Stokes in 3D[68]) over thevocal tract domain. This method is the most physiologically relevantbut can be computationally expensive especially at higher dimensions.Of these approaches, the direct numerical simulation of flow is of partic-ular interest to us in terms of the overall goals of articulatory synthesis. Inaddition, these models can also guide us in choosing better flow models forthe vocal fold simulations as well. Apart from speech synthesis, there weremodels put forward in literature to study fluid-structure interaction in otherrelated biomedical and phonation problems of the upper airway. In general,a stable solution of the upper airway with colliding deformable bodies andclosure of the airway is extremely hard to achieve apart from being compu-tationally prohibitive. One of the major challenges models need to overcomeis being able to predict the recovery of pressure after a constriction in theairway, and recover stably when the airway opens up again. A lot of thesechallenges are similar to those faced by the flow models of the vocal folds.Many different models have been suggested in literature ranging fromlumped parameter models [21], 1D [28][65], 2D [66][76][80] and 3D models[55][56][57][82] of airflow. Of these models, two models stand out for theirapproaches to the modelling of the flow in the upper airway. Firstly, oneof the flow models that demonstrated the ability to predict the pressurerecovery, and stably handle closure and even collapse of the upper airway,was the 1D model suggested by Anderson et al [12]. Handling collapsing orclosure in tubes is an non-trivial tasks for most flow models, and can oftenlead to an unstable simulation. The model was experimentally validated,and the bulk pressure predictions were shown to agree with other comparable3D models. The model also required significantly lower computational costsand was used for modelling Obstructive Sleep Apnea (OSA) simulations.Secondly in terms of our goals of achieving faster, higher quality speechsynthesis, Zappi et al [132] provided the first real-time solution of the 2Dwave equation for interactive speech synthesis applications. This modelenables us to move beyond simplistic 1D representations of the vocal tract,to achieve higher quality synthesis by capturing the vocal tract geometrybetter without the associated increase in computation time.322.6. Data Acquisition and Measurement2.6 Data Acquisition and MeasurementIn this section we focus on the methods used to acquire data and measurephenomena associated with phonation. We focus on the vocal fold model,to highlight the experimental data and measurements that are available asa benchmark to compare the performance of computational models with.Readers are refered to the review of the field by Mittal et al [87], for a morecomprehensive tabulation of the different studies on phonation. We alsolook at the medical imaging techniques most commonly used to capture thevocal tract structure in subsection 2.6.4.Subsection 2.6.1, talks about in-vivo studies of the vocal folds to collectdata on phonation, and for standard measures of the vocal folds. However,the inaccessibility of the vocal folds has meant that in-vivo studies of thevocal folds have had strong ethical and logistical issues, limiting their scope.This has meant that speech researchers, have attempted to find alternativemethods to study and measure vocal fold characteristics. Subsection 2.6.2,explores the use of excised laryngeal from both humans and animals (eg.canine) in lieu of the actual vocal folds from a live subject. Subsection2.6.3, looks into synthetic fabricated models of the vocal folds that wereemployed to measure information. Finally, subsection 2.6.4 talks about themethods used to obtain the structure of the vocal tract.2.6.1 In-Vivo StudiesThere are three main methods used are given below:• Laryngoscopy: The laryngoscope is an endoscope that is particu-larly designed for observing the laryngeal complex from the supraglot-tal duct. A catheter is inserted into the throat, past the velum tojust above the larynx. A camera records the oscillation of the vocalfolds from the top [34]. Figure 2.19 contains sample images from alaryngoscope in top for a cycle of phonation. The opening and clos-ing of the glottis is clearly visible in the different stages of the cycle;however, this data needs to be buttressed with other data to gain aquantitative understanding. Since there is only a top down view, it’stough to make out the exact point of opening/closure, and thus thisdata would ideally be combined with electroglottography (EGG) data.• Electroglottography (EGG): The EGG is a non-invasive tool thatis enables speech researchers to get a high-temporal resolution readingof the degree of closure in the glottis. Electrodes are attached on332.6. Data Acquisition and MeasurementFigure 2.19: High-speed laryngoscopy images (above) synced with EGGdata (below) for one cycle of vocal fold vibration, c©Wiley Publishing, Gicket al [51]either side of the thyroid notch; the electrical resistance between theelectrodes is a function of the degree of opening and closure of theglottis [47][73]. A zero value or minimum value of the electrode readoutwould correspond to an open glottis and vice-versa. Figure 2.19 showsthe EGG signal in the bottom related to a specific laryngoscopy signal;this is often called a laryngograph.• Pneumotachography and Audio Recordings: The pneumota-chograph (or airflow meter) is a tool used to measure airflow and airpressure during speech. It constitutes the famous Rothenberg mask[96], that covers the mouth and nose, calculating the oral and nasalairflow. When used in conjuction with a microphone recording theoutput sound pressure, we can gain an understanding of the glottalwaveform [89][61].342.6. Data Acquisition and Measurement2.6.2 Excised Laryngeal StudiesThe focus of excised laryngeal studies were usually twofold: firstly, to findout the structural properties of the various vocal fold layers, and secondly,carry out experimental measurements using these excised larynxes placedin physiologically viable conditions. This topic is vast with many differentmodels in literature; readers are referred to the paper by Miri, [86] thatprovides an in-depth review of the various mechanical testing methods andthe constitute materials that are consequently used for modelling. The paperexplains the different methods that can be used for measurement (tractiontestings, shear rheometry, linear skin rheometry and indentation testing) andthe different values reported by technique. Both human and canine larynxeshave been used to calculate the material properties [5][2][71][70]. This wealthof data provides excellent starting points to build and compare our models;however, these studies have some drawbacks. In particular, the researchershave to work quickly since there are limited run times, difficulty in restoringthe proper tensioning and specific environmental conditions which need tobe maintained.2.6.3 Synthetic Laryngeal StudiesFabrication of synthetic larynxes for to study vibratory characteristics ofvocal folds has been a widely used tool, especially over past few years. Thefollowing sections looks at the different models in literature, and the datathey provide in validating our potential system.Static ModelsEarly synthetic models of the vocal folds were static models, that were gen-erally extruded 2D models progressing to more complex 3D models. Whilethese models did not oscillate, they gave insight into the relationship betweenthe pressure distribution inside the glottis and the glottal flow parameter.Scherer et al [97], looked at the difference in intra-glottal pressure distribu-tions for a symmetric and tilted static laryngeal model. Fulcher et al [49] dida similar study using a symmetric system with two vocal folds, establishingthe link between transglottal pressure, the geometry and the output flowrate. This was later performed with a hemilarynx model as well [48].352.6. Data Acquisition and MeasurementDriven ModelsIn some models an external control was given over the structure of thevocal folds; this was achieved using linear actuator to make the vocal foldsperiodically change their shapes. This enables us to see the synthetic vocalfold take up shapes that are seen during actual phonation with the caveatthat the motion is decouple from the airflow. There were both simple linearoscillatory models [40] as well as more complex waveform-based models inliterature [124].Self-Oscillating ModelsSelf-oscillating models achieve coupling between the structure as well as theairflow, enabling osciallations to naturally entrain. An important result forthe field that was achieved early-on when Titze et al [120], had been able toestablish the threshold pressure for phonation assumping a model that wasmade up of a stationary body layer, and fluid-filled oscillating cover layer.Similar models diving the idealized structure into two layers was also putforward by Chan et al [30][29]. Thomson et al [115] put forward a variationof the canonical M5 vocal fold model introduced by Scherer et al [97] thatwas mentioned previously. Later modifications of this model were also putforward; an important model was fabricated by Becker et al [18], who studiedthe fluid-structure interaction inside the glottis.2.6.4 Vocal Tract MeasurementsThe measurement of the various articulators of the upper-airway is a com-plete research field by itself. In general, for speech we shall assume thatairway is directly derived from MRI data or obtained from the biomechan-ical model as in the case of FRANK (section 2.5.1). As mentioned in thevocal-tract coupling section 2.4.3, Story et al [108] developed a set of areafunctions from Magnetic Resonance Imaging (MRI) data. Story [106] laterrevisited this data-set by carrying out new measurements of area functionsfrom the same patient, to better understand inter-speaker variability. An-other famous MRI data-set is the ’ATR MRI database for Japanese VowelProduction’ which contains male MRI vocal tracts data [109], obtained withthe phonation-synchronization method. Takemoto et al [110], provided amethod for extraction of area functions, commonly used in 1D vocal tractmodels, from MRI data. These data-sets allow us to build vocal tract do-mains over which we can solve the wave equations.362.7. Discussion and Conclusion2.7 Discussion and ConclusionThere has been a cornucopia of vocal fold models proposed over the last half-century covering a range of applications from articulatory speech synthesis,vocal fold pathologies and phonation dynamics. Early models made signifi-cant simplifying assumptions to both the structural and flow components ofthe model; the vocal fold structure was approximated as lumped rectangularmasses, and a Bernoulli-based flow model was applied. As computational ca-pabilities grew, the first continuum models of the vocal folds emerged withthe constitutive equations being solved using numerical methods such asFEM. This has led to the development of very complex and comprehensivemodels that have given speech researchers a better insight into phonationdynamics, jet dynamics and coanda effect as well as pathology treatment.This has been mirrored by the increase sophistication of data acquisitiontechniques used by scientists on the vocal folds. Finally, tremendous workhas been put into building vocal tract models for articulatory synthesis. Allthe factors combined, make the prospects for the field seem extremely ripe.To take the research forward into practical applications, there is a need forusable models of the vocal fold and vocal tract to be coupled; this includeconsiderations of speed, robustness and usability. Equally, models are cur-rently disparate, implemented across different proprietary and commercialtool-kits, thus pointing towards the need for integrated open-source models.This chapter has presented an overview of vocal fold modelling in thecontext of articulatory speech synthesis. Exemplar models of the field havebeen highlighted, and an attempt has been made to see if there exists certainlacunaes in the field, despite the wealth of models in literature. We haveidentified areas of research that require further investigation. In particular,there is a need to move beyond lumped-element models to include continuummodels of the vocal folds in articulatory synthesizers. However, there aretwo major issues with regards to continuum models: there is no appropriateflow model that strikes a balance between the computational and conceptualsimplicity of Bernoulli, and the significantly greater computing cost of using2D Navier-Stokes models. Secondly, the vast number of vocal fold modelsexist in a vacuum with static boundary conditions and no coupling to eitherthe vocal tract or trachea. Thus, a continuum vocal fold that combined a2D structural model with an unsteady 1D flow model would move the fieldtowards a practical, use able vocal fold model. Running this model as part ofa stable coupling articulatory synthesizer, would enable speech researchersto better study the acoustic outcomes of model decisions. In the followingchapters we describe our contributions to these open research problems.37Chapter 32D Continuum Model with1D Flow ModelIn this chapter, we introduce our 2D continuum model of the vocal folds,driven by an 1D unsteady flow model. We first start with defining thecharacteristics that we look for when choosing structural and flow modelsof the vocal folds in Section 3.1. In the following subsections of the chapterwe describe the model formulation, including the numerical implementationthat we use in solving the system. Finally, the coupling of the structuraland flow models and the treatment of contact is covered.3.1 IntroductionThe modelling of the vocal-fold phonation is an extremely complex fluid-structure interaction problem. The vocal fold structure is made of multiplelayers, the surface of which is acted on by aero-acoustic forces that arepredicted by the fluid model. In previous work, many assumptions havebeen made for the modelling of the airflow through the vocal folds. One ofthe classic assumptions, the quasi-steady assumption made for applying theBernoulli-equation, is demonstrably false as seen in previous work. However,compromises need to be made in model formulation based on current com-putational capabilities. Some of the major considerations when choosing anappropriate fluid model for the vocal folds are listed below:1. Instead of a fully-coupled approach, it would be suitable to have thestructural and flow equation solved separately from each other. Thiswould enable us to switch out the structural model for a more complex3D model later, without affecting the entire system, thus achivingmodularity. In this case, the area function of the glottal tube A(x, t)in question, can be treated as a known quantity.2. Comparison studies between Bernoulli-based solvers and 2D Navier-Stokes solvers for lumped-element and continuum-models, showed that,383.1. Introductionwhile Bernoulli models performs acceptably in predicting bulk intra-glottal pressures, they are unsatisfactory in computing the location ofseparation point and the glottal flow rate. The separation-point hasbeen shown to be critical to stable self-sustained oscillations as it playsan important role in the force loading of the vocal folds. Thus, theBernoulli-model is not sufficient for our formulation despite its manycomputational advantages.3. While 2D/quasi-3D and 3D models of flow have shown themselves ca-pable of reproducing many of the intricacies of phonation fluid dynam-ics, the computational cost involved in solving both the solid and fluidequations in 2D (or higher) dimensions continues to be prohibitive.This rings especially true in the case of articulatory synthesis, wherethe role of the Coanda effect for example, is not important to achievingmodal phonation which is our primary goal [75]. Thus, we choose tofocus on those features that are critical to generating better acousticoutcomes.4. Viscous losses along with turbulences play an important role in thephonation dynamics. The model that we choose should be able tohandle this automatically.5. Many 2D models and 3D flow models, do not allow the vocal foldsto collide in the structural domain as they need to have a minimumarea function diameter. This is done to ensure that there are nozero/negative-volume inverted elements in the flow mesh that will leadto instability. Thus, our flow model should handle closure and reopen-ing of the glottal tube in a numerically stable and physically realisticmanner.6. The vocal fold channel has sudden variations in area and an irregulargeometry. Our choice of spatial discretization should account for this.The natural conclusion that can be drawn from the review of literature inChapter 2 is that our ideal flow model will be an 1D unsteady fluid model.This model should be able to predict the bulk intraglottal pressure, flowseparation point, viscous losses and most importantly, glottal flow values.The model should be able to handle tube closure and reopening, apart fromirregular geometries. Finally, a loosely coupled framework would be helpfulin extending our structure model in the future without too much addedeffort. As mentioned before in section 2.5.2, the unsteady 1D fluid model393.2. Structural Modelsuggested by Anderson et al for modelling collapsing tubes for ObstructiveSleep Apnea (OSA) [12] contains a lot of the characteristics required inour system and showed comparable performance to 3D simulations. Thismodel will be used as a starting point around which our fluid model will bedesigned.We first start with the formulation of the structural component of thevocal fold, followed by the fluid model and its numerical implementation.Information about the fluid-structure coupling is then provided in the fol-lowing subsection. Section 3.4 first contains an analytical case to verifyour numerical solution framework. This is followed by experiments usingthe coupled model to examine the fluid model’s abilities to predict pressureand velocity distributions that are in-line with experimentally-measured andsimulation results in literature.3.2 Structural ModelWe use a 2D structural model based on the model by Alipour et al [3] witha focus on lower computational load. Of the models that are available inliterature this model made sense for a few major reasons: it is the simplest2D continuum model available and the logical step-up from using lumped-element models. While there are more complex structural models both interms of dimensionality as well as material properties, this model has beenshown to reproduce the major phonation characteristics in previous studieswhen coupled with a 2D Navier-Stokes model [7]. Potentially in futureiterations, the possibility of using a model-reduction technique to optimizethe system further can be looked at. We do not reproduce the mathematicsof deriving the matrix equations here for the sake of brevity. Readers arereferred to the original paper for the same [3].The structural mesh that we use for the finite-element method is shownin figure 3.1. There are three main regions of the mesh, each with differentmaterial properties as suggested by the body-cover model [59]. We chooseto assume symmetry across the midline plane and simulate a hemi-laryngealmodel instead of both vocal folds. The model uses a linear-elasticity as-sumption which is computationally cheaper to solve, and has been validatedin previous studies. A linear shape function is used for the finite-elementformulation and the material properties are taken from a recently updatedversion of the model [9] (Table 3.1). The vocal fold mesh was the samemesh used by [3] and the vocal fold structure was divided into three ma-403.2. Structural ModelFigure 3.1: Finite element mesh shown in the coronal plane. This includesthe body (dark grey), the ligament (white) and the cover (light grey) regionsterial regions: body, cover and ligament. The FEM solution of the spatialproblem yields a second-order matrix differential equation (3.1) in the timedomain:MΨ¨ +DΨ˙ +KΨ = F (3.1)The equation is discretized using the second-order central scheme centredat the nth time-step for stability as shown in equations 3.2 3.3:ψ˙ = ψn+1 − ψn−12.∆t +O(∆t)2 (3.2)ψ¨ = ψn+1 − 2ψn − ψn−1(∆t)2 +O(∆t)2 (3.3)The aerodynamic force and string forces are used to calculate forcingvector F, in equation (3.1). Since contact between the symmetric vocalfolds would happen at the glottal midline, a rigid plane is assumed to bepresent there. When the vocal fold reaches the midline, an impact force isapplied to prevent interpenetration. The contact force is normalized over acollision region defined by the Aclosed value associated with the fluid model,that explained in subsection 3.3.3.413.2. Structural ModelName Value(s)Body Longitudinal Young’s modulus 20 kPaCover Longitudinal Young’s modulus 15 kPaLigament Longitudinal Young’s modulus 30 kPaBody Transverse Young’s modulus 2 kPaCover Transverse Young’s modulus 1.5 kPaLigament Transverse Young’s modulus 3 kPaBody Longitudinal Shear modulus 12 kPaCover Longitudinal Shear modulus 11 kPaLigament Longitudinal Shear modulus 20 kPaBody Viscosity 6 poiseCover Viscosity 3 poiseLigament Viscosity 5 poiseLongitudinal Poisson’s ratio (All layers) 0.4Transverse Poisson’s ratio (All layers) 0.9Lung pressure 1.0 kPaFluid density 1.14 kg/m3Fluid dynamic viscosity 1.86e-5 Pa· sχmin 0.2Table 3.1: A list of the different material properties used for the vocal foldmodel. Parameters derived from [3][9]423.3. 1D Fluid Model3.2.1 Numerical Implementation ProcedureThe solution of the tissue mechanics is outline as follows:1. The simulation is started with the hemilaryngeal mesh at the prephona-tory position. The tissue properties of each layer are defined based onTable 3.1.2. The mass, stiffness and damping matrices (6X6) are calculated foreach element. The mass matrix is a function of the density ρ and areaA. The density and stiffness matrices are calculated using the finite-element interpolation shape function inside each element to integratethe strain energy and the viscoelastic properties of the different layersof the vocal folds.3. These element masses are assembled into the global matrices that areshown in Equation 3.1.4. The area function A(x, t) is extracted from the structural model andgiven to the fluid model. The pressure distribution is obtained throughsolving the fluid model.5. The nodal forces for each surface element in calculated and applied,along with the string forces across all elements. This gives the globalforce vector F .6. The matrix differential equations are solved to give us the displacementvector ψ for the current time step.7. The new nodal coordinates are calculated by adding the dynamic dis-placement vector to the previous coordinates.8. The updated geometry is used to calculate the area function A(x, t)for the next time step of the system. We repeat steps 2 to 7 for ourtime period of simulation.3.3 1D Fluid Model3.3.1 Model FormulationHere we attempt to apply the 1D version of the Navier-Stokes mass andmomentum equations for incompressible flow to our problem. Based on the433.3. 1D Fluid Modelideas of Cancelli and Pedley [28], the flow continuity and the momentumequations for the equation are written as:∂∂tA+ ∂∂xAu = 0 (3.4)ρu∂∂xu+ ρ ∂∂tu+ ∂p∂x− τ sA= 0 (3.5)τ − τfric − τχ = 0 (3.6)where s is the perimeter around the cross-section area A, u is averagevelocity, p is pressure and ρ is density. The term τ models the viscouslosses with two major components: τfric describing the laminar losses andτχ describing the losses due to flow separation. The equations are givenbelow:τfric = −2µ(s/A)u (3.7)τχ = (A/s)(1− χ)ρu( ∂∂xu) (3.8)While τ is written separately for clarity, it can be seen that the threemajor solution variables are u, p and τ . The flow separation term is ofparticular importance in our case; we can define flow separation at the pointbeyond which the pressure is effectively constant i.e. ∂p/∂x = 0. The flowseparation point also plays another critical role: after the drop in pressure atthe constriction, a pressure recovery takes place downstream in the channel.The flow separation point limits the pressure-recovery to match with theboundary conditions downstream. The χ term is defined as:χ ={1, for u ∂p∂x < 0χmin, for u ∂p∂x ≥ 0where bi-directional flow is accounted for and χmin is defined by the userbased on the problem under consideration. As suggested by Anderson et al[12], we use the inviscid approximation to calculate χ.∂p∂x≈ −ρu(∂u∂x)− ρ(∂u∂t) (3.9)This approximation is advantageous as the τ term is no longer dependenton p, and because the τ -term is generally small when ∂p/∂x = 0.443.3. 1D Fluid ModelName Value(s)Lung pressure 1.0 kPaFluid density 1.14 kg/m3Fluid dynamic viscosity 1.86e-5 Pa· sχmin 0.2Table 3.2: A list of the fluid properties used for the vocal fold model.Parameters derived from [3][9][12]3.3.2 Numerical Solution FrameworkAnderson et al [12] suggested using Newton’s method to solve equations(3.4), (3.5) and (3.6) since A(x, t) is a known quantity. The method wouldbe:J(X) ·∆X = −F (X) (3.10)where X is the solution vector, J is the Jacobian matrix and F is theresidual vector. The suggested solution framework is iterative, recalculat-ing X(n+1) until the residual F is below a certain tolerance. However, therequirement to compute the Jacobian makes this coupled solution more com-plicated. While the system is still quite fast, the possibility of a faster solverof the equations can be considered.In the case of a velocity-driven flow, we can see that the flow equa-tions can be solved sequentially. The velocity boundary condition can beapplied to (3.4), to calculate the velocity distribution u(x). This can beused to compute τ(x) using (3.6). Finally, equation (3.5) is solved using thecalculated u and τ values. This procedure is extremely fast, and computa-tionally efficient. On the other hand, for pressure-driven flows we requirea coupled-solution of equations (3.4)-(3.6), which can be solved using New-ton’s method as mentioned above. However, glottal flow is generally driventhrough pressure-pressure boundary conditions though there exists cases inwhich velocity-driven flow can be defined instead.Novel Decoupled Solution Framework:Thus, we suggest a method to convert pressure-pressure boundary condi-tions to the equivalent velocity-pressure boundary conditions, followed by adecoupled solve. The solution involves a bounded search, where we iterate453.3. 1D Fluid Modelthe system till we find uInlet-pInlet boundary conditions equivalent to thespecified pInlet-pOutlet boundary conditions. This solution would enableus to have an unsteady 1D numerical implementation that is significantlyfaster than the coupled solver. The model parameters are given in table 3.2Our solution procedure is as follows:1. Create two initial guesses for input velocity, u1i=0, u2i=0 for a givenpInletn+1−pOutletn+1 boundary condition. The superscript refers tothe simulation time, and the subscript refers to the iteration number.The previous time step’s input velocity uInletn and 1.1 ∗ uInletn aregood guesses to speed up convergence.2. Use the decoupled solver to find the outlet pressures p1i=0, p2i=0 forthe u1i=0 − pInlet and u2i=0 − pInlet systems respectively.3. Calculate the change in pressure with velocity dpdui = (p2−p1)/(u2−u1)4. Update u1i = u2i−1 and p1i = p2i−1 from the previous iteration, andcreate a new guess for u2i using dpdui.5. Solve the decoupled equations for the u2i−pInlet boundary conditions6. Calculate the difference between your target outlet pressure and thenew computed outlet pressure as diff=pOutlet− p2i7. Iterate steps 3-6 until diff is below a certain tolerance value3.3.3 Fluid-Structure CouplingThe structural and fluid models are coupled through the area-function thatis the input to the fluid simulation, and the aerodynamic pressure that isused to compute the forcing vector for the FEM solid mechanics. We chooseto loosely couple the solid and fluid models rather than have a combinedformulation; this enables us to treat the area function as a pre-computedquantity for solving the 1D fluid model sequentially. The structural modelis discretized in the coronal plane, with the fluid model computed alongthe centre-line/midline in Figure 3.2. Velocity components perpendicularto the midline are considered to be zero as we purely focus on 1D flow. Afourth-order asymmetric scheme is used for spatial discretization of the fluidmodel. At every discrete point, a cross-sectional area is extracted from the463.3. 1D Fluid ModelFluid ModelFigure 3.2: 2D continuum vocal fold model in the coronal plane. The vo-cal folds are assumed symmetric with the shaded out vocal fold (left) notsimulated. The fluid model is calculated along the glottal centre-linestructural model, which is the medial-lateral opening between the two vocalfolds, multiplied by the distance in the dorsal-ventral plane. The triangularnature of the glottal opening is taken into account when computing the cross-sectional area A and the associated perimeter s for the fluid model. Anotherimportant case, is that of collisions during the self-oscillation process. Atevery time-step, the updated area function is extracted from the structuralmodel, and passed to the fluid model. Since equation 3.5 has terms dividedby A, we choose to handle the fluid model area through warping as suggestedin [12] with a ’safe’ Area function:Asafe(x, t) = A(x, t) +Aclosed ∗ w(A(x, t)) (3.11)The transition function is defined as:w(A(x)) =0, for A(x) > AsmallA(x)−AsmallAclosed−Asmall , forAclosed ≤ A(x) ≤ Asmall1, forA(x) < Aclosed(3.12)where Aclosed is the smallest numerically stable area that is empirically473.3. 1D Fluid Modelx(m)0 0.05 0.1 0.15 0.2 0.25 0.3Area(m2 )00.20.40.60.81ArealAsafeAsmallAclosedFigure 3.3: Sample area warping for fluid model. Asafe is used as the areafunction for the 1D fluid simulation to ensure stability.determined, and Asmall is the area at which transition begins (shown inFigure 3.3). Anderson et al [12] suggested a value of Asmall = 2.5*Aclosed.The solid model is allowed to collide with the mid-line in the structuralsimulation (as shown in Figure 3.2) and a collision force is applied to thesurface nodes. Thus, we enable the model to have realistic behaviour in boththe structural and fluid domains. To ensure stability, the nodes are assumedto be in a state of collision, while the minimum glottal area is lower thanAclosed. The pressures at the surface nodes of the FEM mesh are assumedto be equal to the concurrent computed pressures at the glottal mid-line. Alinear interpolation is used to find the pressures at each surface node of theFEM body. The Force Vector on the node is then computed as follows:~Fnode = pnode ∗Anode ∗ ~n (3.13)where pnode is the pressure at the node, Anode is the effective nodalarea shared between the elements the node belongs to, and ~n is the unitnormal vector to the nodal surface. Both the fluid and structural modelsare temporally discretized using a central scheme with same time-stepping.483.4. Results3.4 ResultsOur model is implemented in MATLAB [83], a high-level matrix-based com-puting environment and programming language. This is done for two mainreasons: firstly, MATLAB provides us with an unified environment to buildour vocal fold models, and vocal tract implementations on. Since MATLABuses a Java-based API, it can also interface with ArtiSynth [77], enabling usto build complete articulatory synthesizers including biomechanical models.The second reason is to provide speech researchers with a toolkit that canbe used to test the different vocal fold and vocal tract models. Most speechresearchers do not come from an engineering background and thus MAT-LAB is the easiest starting point in terms of language complexity comparedto other lower-level languages such as C++ and Java.The 1D fluid model is applied to three test cases. First, the accuracyof our decoupled implementation is verified for a standard problem in thefield. Then it is applied to driving a vocal fold model for a static boundary-condition problem. Here, we look at it’s prediction of the flow-separationpoint as well as the pressure-velocity distribution that it predicts.3.4.1 Fluid Model ValidationWe choose the same problem defined by Anderson et al [12], which is anexample of the 2D starling resistor class of problems. This class of probleminvolves flow through a flexible tube connected to two rigid ends; readers arereferred to a review for the different types of these models [22]. We definethe following area function for the model:A(x, t) = A0 −Amsin(pix)sin(pit) (3.14)where A0 is the initial area and Am is the magnitude of the collapse ofthe tube (or more colloquially, the extent of constriction of the tube), whichare both constants. We define αmin = min(A(x, t))/A(0, 0) which impliesthat:Am = A0 · (1− αmin) (3.15)Thus lower values of αmin, implies a more constricted tube, with valuesclose to zero implying almost complete closure. We assumed mixed u − pinlet boundary conditions for the problem, and no viscous losses to enablean analytical solution τ(x, t) = 0. For these constraints we get the analyticalsolution for the mass (3.4) and momentum equations (3.5) as:493.4. Resultsu(x, t) = 1A(u0A0 +Amcos(pit)(1− cos(pix))) (3.16)p(x, t) = piρAm∫sin(pit)A(1 + (u2 − 1)cos(pix))dx− ρu2 + cp (3.17)Readers are referred to [14] for a derivation of the solution. Equation 3.17is integrated numerically, and the constant cp is defined numerically. Whilethis model undeniably does not include viscous losses that play a significantrole in vocal fold simulation, it serves as an useful tool to ensure that ourdecoupled fluid solver performs as well as the coupled solver that was origi-nally proposed [12]. Table 3.4 gives a summary of the values and boundaryconditions (BCs) used for the simulations. A non-dimensional pressure er-ror p∗ was defined comparing the analytical and numerical implementationsolutions:p∗ = max(|p(x, t)− pa(x, t)|)/max(ρ · u(x, t)2) (3.18)where pa is the analytical solution. The error is calculated as part ofa mesh-refinement study identical to Anderson et al [12]. The study isperformed for 0.01 ≤ αmin ≤ 0.99, for meshes of 3 different discretizations: acoarse mesh (∆x = 0.1, ∆t = 0.02), a medium mesh (∆x = 0.05, ∆t = 0.01)and a fine mesh (∆x = 0.025, ∆t = 0.005). The result of the experimentsare shown in the figure 3.4, where the pressure error (p∗) is plotted againstαmin.The main observations of the results from figure 3.4:• As expected, the error p∗ reduces with the improvement in mesh equal-ity. We see an illustration of the 2nd order accuracy of the discretiza-tion method, as we have an approximate reduction in the numericalerror by a factor of 4 when the mesh is refined by a factor of 2.• We see that the error increases rapidly for highly constricted geome-tries (αmin ≈ 0). Thus, defining a finite Asafe(x, t) in equation 3.11 isimportant to achieve a stable solution.• For the finest mesh, we get good pressure results up to Aclosed/A0 =0.05 or 95% closure. This serves as a valuable tool in choosing bothmesh equality and area warping for our vocal fold solver.503.4. Resultsαmin0 0.2 0.4 0.6 0.8 1p*00.010.020.030.040.050.06coarsemediumfineFigure 3.4: Mesh refinement study where numerical error p∗ is plotted asa function of the factor αmin for three levels of mesh quality. Coarse mesh(blue), Medium mesh (red) and Fine mesh (yellow)• Our results are practically identical to those of Anderson et al [12],with a general error difference (∆p∗) less than 10% between the meth-ods. This implies that our decoupled solution is almost identical tothe coupled solution despite potentially being much faster.To confirm our model’s computational advantages over the coupled so-lution of equations 3.4 and 3.5, we compare their time-based performances.We take the previous analytical equation 3.14, and solve it using both ourdecoupled solution scheme and the coupled solution scheme. We use MAT-LAB’s internally built tic−toc functions to achieve comparable values. Table3.3, shows that our model is almost 50 times faster than the coupled simu-lation when taking a ∆x value of 0.01 and 0.005 for the analytical problemin question.However, we are not able to achieve a relevant direct comparison to the2D Navier-Stokes equation. This is because of two main reasons: firstly, the2D Navier-Stokes is generally solved using dedicated finite-element toolk-its such as ADINA making it an unfair comparison to our MATLAB-basedsolvers. Secondly, the standard 2D Navier-Stokes implementations for MAT-LAB are usually quite primitive, only allowing for steady-state problems513.4. ResultsDiscretization ∆x (m) Solver Time (s)0.01 Decoupled 2.0460320.01 Coupled 91.3302720.005 Decoupled 4.0324970.005 Coupled 195.852746Table 3.3: Performance of decoupled solver vs coupled solver. Simulationconducted for a time period of 2s with ∆t = 0.02s. Length of channel is0.6m.or purely rectangular domains. They also suffer from instability arisingfrom lack of accurate turbulence estimations, non-convergence of the New-ton method for solvers for non-linear problems and lack of numerical sta-bilization [99][93][134]. These are often built into commercially availabletoolkits at the downside of computational cost. However, in general we cansee that our model even for a rectangular domain will be much faster thana 2D Navier-Stokes equation. Taking a standard finite-difference discretiza-tion (100X1 = 100 cells for the 1D model and 100X100 = 10000 cells forthe 2D model) , we can see that for a refinement of the mesh by a factor of2, would imply a 4 times increases in total number of cells for a 2D model(200X200 = 40000 cells). On the other hand, this would only equate to a2 times increase for 1D model (200X1 = 200 cells). This can quickly addup, as we attempt to find a compromise between the mesh quality and thepressure error as seen from figure 3.4. This is an even greater issue when us-ing the FVM or FEM for solving the fluid; achieving requisite mesh qualityfor FSI simulations is an extremely challenging problem. Thus, our modelimplementation represents an extremely fast solution of the flow equations,in comparison to other 1D model implementations and the 2D Navier-Stokesmodels.3.4.2 Flow-Separation ExperimentA critical feature for a robust vocal fold model is its ability to estimateflow separation points. Pelorson et al [91], showed the importance of flowseparation on the self-oscillation of the vocal folds and created a theoreticalmodel to augment the Bernoulli’s equation for estimating the flow-separationpoint. However, most 1D models still make an assumption based on arearatio’s, instead of estimating the actual flow separation location. (Eg. flow523.4. ResultsName Value(s)Inlet velocity BCs uinlet 1.0 m/sInlet pressure BCs pinlet 0.0 PaDensity (ρ) 1.2 kg/m3A0 0.2 m2Length of Domain (x) 1.0 mTable 3.4: A summary of the parameters used for the solution of theanalytical problem. Parameters derived from [12]separates when cross-sectional area is 1.2*amin, where amin is the minimumglottal area). Our fluid model directly accounts for the flow separationthrough the χ and τχ terms.Figure 3.5 shows an illustration of the model’s flow-estimation capacity.We use a modified vocal fold where the inferior edge of the glottis is fixed andthe superior edge is allowed to oscillate. We look closer at the upper surfaceto understand the results better. In general, the flow separation point isbetween 1.2 to 1.4 times of the minimum glottal area which is within therange given in literature [39]. The results are qualitatively similar to thoseobtained by Alipour and Scherer [8], when a computational flow model wasused to study flow separation further. Thus, we can now attempt to use themodel as part of a coupled structural fluid simulation.3.4.3 Coupled Vocal Fold SimulationWe now run a coupled vocal fold model simulation to validate the model’sphonatory response. To achieve a fair comparison, we create a set-up assimilar as possible to both experimental data as well as previously publishedmodels. The current gold-standard in the field is the quasi-3D finite elementmodel by Alipour et al [9], which is one of the few continuum models thathas been simulated in a full coupled acoustic simulation. Similar to thepaper, we implemented a trachea and vocal-tract acoustics model using the1D Kelly-Lochbaum digital filter method [69]. This enables a coupled light-weight acoustic simulation to validate our vocal fold model. Figure 3.6shows the vocal fold shapes over one cycle of phonation. It can immediatelybe seen that our model reproduces one of the major features of phonation,which is the vertical travelling wave. However, it can also be seen that thelower edge of the vocal folds, which leads the collision phase of the vocal533.4. ResultsX(cm)0.6 0.7 0.8 0.9 1Y(cm)0.80.850.90.9511.051.11.151.2Figure 3.5: Flow-separation prediction of the flow model when inferior endof the vocal folds is fixed and superior is allowed to oscillated. The filleddots represent the point of flow separation, assuming flow from left to rightfold oscillation, does not have a significant collision period.To truly validate the vocal fold model, we need to compare its time-dependent pressure and flow behaviour at various stages of the glottal cycle.In figure 3.8, the poses taken by the vocal fold structure during differenttime-steps of the phonation cycle are shown. Figure 3.9 gives the centrelinevelocity values along the axial distance for selected frames. Finally, figure3.10, gives the respective pressure values for the selected frames.As suggested by Alipour et al [9], we can see some clear patterns in thepressure-velocity distributions of the system. The selected frames displayimportant points in the overall phonation cycle, .i.e. a convergent, divergentand neutral glottis. There are specific characteristics we expect to see in thepressure and velocity distributions of each these frames; for example, weexpect negative pressures during vocal fold closing as this enables the vocalfolds to be pulled back together. This is followed by a pressure recoverywithin the glottis itself. These characteristics are clearly seen in the figure3.10, and will be discussed in further detail in section 3.5.543.5. DiscussionFigure 3.6: Vocal fold model shapes in one cycle of vibration. Fundamentalfrequency of oscillation is 146 Hz. Time difference between each time stepis approximately 0.57 msA particularly important quantity for speech synthesis is the glottal flow(Ug). The glottal flow is given as the output from the vocal folds into thevocal tract, and acts as the excitation source to the vocal tract filter. Figure3.7 shows the glottal flow waveform (above) and the time-varying sub-glottalpressure waveform (below). The vocal fold model is coupled to the tracheathrough the subglottal pressure; one of the challenges is to ensure modelstability even when the subglottal pressure varies rapidly in time. As wecan see, there are significant high frequency variations in the subglottalpressure seen in figure 3.7; this value is the boundary condition to the inputof our vocal fold fluid simulation. These variations are also visible on theleft extremity of the graph 3.10.3.5 DiscussionIn this section we look at the results in context of observations noticed inexperimental studies and other canonical papers of the field. The extreme553.5. Discussion0 10 20 30 40 50 60Ug(ml/s)0200400600F0 = 146 HzTime(ms)0 10 20 30 40 50 60P(Pa)50010001500Figure 3.7: Typical glottal waveforms including glottal volume flow (top)and subglottal pressure (bottom).X(cm)0 0.2 0.4 0.6 0.8 1 1.2Y(cm)00.20.40.60.811.2ConvergentNeutralDivergentFigure 3.8: Selected vocal fold frames for the convergent, neutral and diver-gent glottal shapes563.5. DiscussionX(cm)-1 0 1 2 3U(m/s)01020304050ConvergentNeutralDivergentFigure 3.9: Centerline velocity predictions for the convergent, neutral anddivergent glottal shapesdifficulty in accessing the vocal folds has meant that apart from the glottalflow data, directly comparable data is extremely sparse in the field. Thus,most of the comparisons to other models in literature and experimentaldata are qualitative. It should be noted that the paper by Alipour et al [9],included the false vocal-folds as well in their formulation; this would causesome natural deviation in the distributions especially towards the outletof the glottis. However, the earlier models by the group [7][3] are directcomparisons to our model.Vocal Fold MotionFigure 3.6 showed the different shapes taken by the vocal fold model. Theglottis’ behaviour was in line with previous values from literature. Thereis a convergent shapes during glottal opening and divergent shape duringglottal closing. The glottal angles ranged from 60 degrees divergent to 48degrees convergent. As noted previously [9], this implies the presence of arobust mucosal wave. One of the the drawbacks mentioned previously, waswith regards to the short collision time that was observed in the model.This is most likely a consequence of the area warping that is carried outfor the fluid simulation; flow in the glottis still exists but at significantly573.5. DiscussionX(cm)-1 0 1 2 3P(Pascals)-20002004006008001000 ConvergentNeutralDivergentFigure 3.10: Centerline pressure predictions for the convergent, neutral anddivergent glottal shapes583.5. DiscussionGlottal Configuration Peak Vp Peak Va Peak Psp Peak PsaConvergent 34 m/s 33 m/s 670 Pa 690 PaNeutral 33 m/s 32 m/s 520 Pa 515 PaDivergent 43 m/s 45 m/s 990 Pa 950 PaTable 3.5: A comparison of peak centerline velocities and subglottal pres-sures of the model presented in the paper (Vp, Psp) and from Alipour etal [9] (Va, Psa). Note that the values from literature are estimates frompublished graph datadiminished levels. This causes the vocal folds to rebound faster than theyotherwise might when the flow is zero. A potential improvement to the fluidmodel would be the addition of a τsmall term that artificially damps downthe pressures down even further when the area of the fluid model is warped.This will ensure that the main pressure that acts on the vocal folds is onlythe impact pressure, Pim.Velocity DistributionThe velocity distributions generally increased through convergent shapesand decreased through a divergent glottal shape. This is reflected in thehigher peak that the divergent glottis has, with a much quicker fall off. Thisalso points to flow-separation taking place earlier in the cycle. One of theinteresting observations we can make is a slight negative velocity that ispredicted by the model at the outlet of a strongly divergent glottis as shownin figure 3.9. This would correspond to the location of the downstreamvortex that has been shown to play a significant role in the 3D glottis. Thisis both a positive and a drawback for the fluid model: it is positive thatthe model is able to predict the likely rise of turbulence at the outlet of thedivergent glottis, however, a negative velocity is not a physically realisticcharacterization of turbulence in a 1D fluid model, especially when it isused to calculate the velocity flow. However, this does not affect the overallflow of the model, which is heavily predicated on the velocity at the pointof minimum glottal area. Table 3.5 shows that there is strong agreementin peak values predicted by our model in comparison to vocal fold modelsdriven by a 2D Navier-Stokes flow model in literature [9].593.5. DiscussionPressure DistributionAs mentioned by Alipour et al [9], the convergent glottal shapes have highpressures within the glottis and decreasing pressures towards the end. Thedivergent glottis on the other hand tends to show a dip in pressure near theminimum constriction area. This is seen as a negative pressure in the figure,that acts on the vocal folds to pull them together for collision. It is alsoclearly seen that the most open glottal shape in figure 3.8 has the lowestpressure distribution in figure 3.10 (neutral glottis: colour red). There isa noticeable pressure recovery that takes place at the outlet of the vocalfold for the divergent glottal shape; this can again be attributed to the flowseparation in the divergent glottis and associated pressure recovery. Againas seen previously, table 3.5 shows that there is again agreement in peakvalues predicted by our model in comparison to vocal fold models driven bya 2D Navier-Stokes flow model in literature [9]. Thus, the 1D model doesan impressive job of replicating a lot of features of higher-order models inthe results.Glottal WaveformThe glottal waveform is arguably the most important data point to evaluatethe vocal fold model as it acts as the output to the vocal tract simulation.There are a few characteristics that are readily apparent from 3.7:• The glottal flow is slightly skewed to the right but is generally quitesymmetric. This is similar to other continuum models in literature[3][113], but dissimilar to experimentally observed data. We expectedthe opening phase to be significantly more gradual than the closingphase of the model. This is since the negative pressures inside theglottis act quickly with the myoelastic muscle forces, to pull the vocalfolds back together. A possible reason for this disparity is the broaderand less concentrated negative glottal pressure for the divergent glottisshown in figure 3.10.• The average value of the flow is 241 mL/s and the peak flow is about473 mL/s. This is in good agreement with published data; Alipourin his canonical paper [9] reported an average flow value of 191 mL/sand a peak flow of just over 400 mL/s.• The ratio of the closed cycle to open cycle or open quotient is about0.9. This is significantly different from lumped-element models inliterature that usually report cycles of about 0.6-0.7 [26]. However,603.6. Summarythis is again similar to other continuum models in literature. We alsosee that complete closure is achieved for an extremely small part ofthe vocal fold phonation cycle. An obvious explanation, that was alsomentioned previously, is the role of the area-warping in reducing theclosed cycle to open cycle ratio. As an extension of the model, we canlook at including a τsmall term to add extra viscous losses to simulatecomplete closure.• The fundamental frequency of the flow is 146 Hz. This is identical tothe value reported by Alipour et al [3] in the paper containing identicalgeometry to our vocal folds (146 Hz).• The subglottal pressure (Psg) predicted by our model shows very strongresemblance to the inferior glottis pressure recorded by Alipour et al[6] during experimental studies of dynamic glottal pressures on ex-cised larynges. This shows that our model does an excellent job ofreplicating the pressure conditions inside the glottis during phonation.3.6 SummaryIn this chapter, we have presented a vocal fold model to meet the require-ments of articulatory speech synthesis. We started by enumerating the con-siderations that are there when choosing appropriate constituent models forour simulation in section 3.1. Section 3.2, introduced our structural model ofchoice with the numerical implementation procedure. The model is solved inthe structural domain using a 2D finite-element method procedure and wasdiscretized using a 2nd order central scheme. It was also identified in section3.1 that a major lacunae in the field, is the fluid models used to predict theaerodynamic pressures and velocities in the glottis. In particular, predictionof intraglottal pressures, separation point and overall glottal flow rate wereshown to be critical factors that determined the performance of the flowmodel. In addition, only higher dimensional models of the flow were cur-rently capable of estimating the viscous losses that play a critical role in thepressure recovery, and as a consequence the bulk intraglottal pressure dis-tribution. However, 2D and 3D Navier-Stokes solvers are computationallyprohibitive, forcing speech researchers to resort to low-dimensional modelsof the flow to build articulatory synthesizers.Based on these criteria, we introduced and validated a novel 2D finite-element continuum model loosely coupled with a 1D unsteady fluid model.This model is unique in a number of ways. It is the first unsteady 1D613.6. Summaryfluid model that has been used for vocal fold vibration dynamics and firstsuch combination of structural and flow models in literature. The modelis loosely coupled to the solid model, making it possible to use extremelysophisticated solvers in the future seamlessly. The model is presented andformulated in section 3.3. The flow model is capable of handling irregulargeometries, different boundary conditions and closure of the glottis. Wepropose a method for a fast decoupled solution of the flow equations thatdoes not require the computation of the Jacobian matrix. We create animplementation where the model is discretized with a 2nd order temporalaccuracy and 4th order spatial accuracy scheme.The numerical implementation of the fluid model is validated for a stan-dard problem with an analytical solution, and then combined with the struc-tural model. A coupled vocal fold simulation is performed with a tracheaand vocal tract model designed using the Kelly-Lochbaum wave-digital filtermethod. The simulation results are compared with data from literature andshown to be in good agreement. However, significant further work is stillrequired in improving vocal fold simulations. This is discussed in detail inthe final chapter.62Chapter 4Coupled ArticulatorySynthesizerThe main goal of our vocal fold model is articulatory speech synthesis. Inthis chapter we shall build an articulatory synthesizer using our vocal foldmodel as one of the components. In particular, we hope to illustrate thevocal fold model’s utility in the context of articulatory speech synthesis andthe effect of coupling on the system output. Section 4.1 describes the modelformulation and the different components that are involved in building themodel. In Section 4.2, preliminary results are given from the system inquestion. Section 4.3 discusses the significance of the results in the contextof the overall field of articulatory speech synthesis.4.1 Model FormulationBuilding an articulatory synthesizer includes a range of components: a vocalfold model (which includes inside it a structural and flow model), a biome-chanical model of the vocal tract and its articulators, and a vocal tractacoustics model that simulates the acoustic pressure field and gives us thefinal radiated pressure from the mouth. In Chapter 3, we explored a new 2Dstructural model of the vocal folds coupled with an 1D Navier-Stokes basedunsteady flow model. Figure 4.1 gives a look at the different componentsthat are part of our articulatory speech synthesizer. In further subsections4.1.1, 4.1.2 and 4.1.3 we explore the individual components in greater detail.4.1.1 Vocal Tract BiomechanicsThe vocal tract biomechanics is primarily involved with defining a domainfor the acoustic simulation. The vocal tract biomechanics can be definedin two major ways: meshes of the vocal tract can be directly derived frompublished data or from a connected biomechanical model. Most modelsin literature used vocal tracts that are either directly derived from meshes634.1. Model FormulationFigure 4.1: Coupled articulatory synthesizer flowchartor converted to area function notation as shown in figure 4.2. Fant [42]and Story’s [106] data sets, still remain extremely important in the contextof building 1D and 2D vocal tract models. There is a natural correlationbetween the area function description and 1D-tube representation of thevocal tract. The vocal tract is divided into concatenated tubular sectionsof different diameters as shown in figure 4.3. A point to note is that whilefigure 4.3 shows a nasal section, this is not required for our simulation aswe are only modelling vowel shapes. However, the 1D representation losesa lot of the asymmetric structural information that plays a critical rolein the perceptual quality of generated speech; thus, 2D and higher ordermodels have become more popular in literature. The representation of 3Ddata in 2D form is still an interesting open problem in the field that is nowbeing explored. Arnela et al [15] for example, have proposed methods torepresent 3D mesh data in 2D while simultaneously capturing as much of themodelling intricacies as possible. The focus of these systems is to preservethe 3rd dimensional spatial information, while reducing computational loadthat comes with the additional dimensionality.The vocal tract biomechanics can be seen as an additional step that isbuilt on top of this process. First, the airway mesh is extracted for the vocaltract pose at the time instant under consideration. This is then converted644.1. Model FormulationFigure 4.2: Area function representation of vocal tractto the dimensionality of the fluid solver (1D, 2D or even full 3D) and usedto build a structural domain. Finally, the acoustic equation is solved overthis domain. Recently, Anderson et al [13] introduced the latest edition ofthe FRANK model referenced in chapter 2. This model includes a compre-hensive list of hard and soft articulators that can be controlled for speechsynthesis using muscle activations in a physics simulation environment. Asimilar model was also created by Dabbaghchian et al [35]. The 3D airwaymesh can be extracted, and used as the domain for the pressure wave sim-ulation of the vocal tract as shown in figure 4.4. This is later rasterized totransform it to a 2D domain. As the shape of the airway changes from time-step to time-step, we extract the new airway mesh to define our structuraldomain. Thus, this enables the simulation of the vocal tract biomechanics.However, in our current simulation, we focus on preliminary results usingsimple symmetric tubes. This is because the conversion of the 3D mesh to a2D rasterized domain is a process that has not been completely validated inprevious papers [15][132]; therefore we focus on standard data in literaturefor our simulations instead of adding greater uncertainity to the process.4.1.2 Vocal Tract AcousticsAs mentioned in Chapter 2, there have been a range of models suggestedto simulate the acoustics of the vocal tract. They can generally be dividedinto 1D, 2D and 3D models in terms of dimensionality or into different cat-egories based on the type of simulation (eg. digital wave-guide filter, directnumerical flow simulation). For articulatory synthesis, the direct numeri-cal simulation model is particularly attractive. While the other models are654.1. Model FormulationFigure 4.3: Concatenated tube model used for 1D speech synthesisFigure 4.4: Framework for biomechanically driven articulatory speech syn-thesis. The vocal tract airway mesh is extracted from the FRANK modelover time and used as the domain for the acoustic simulation. It is coupledwith the trachea model and vocal fold model to create a complete articula-tory synthesizer.664.1. Model Formulationpotentially faster or easier to implement, we can gain a fundamental un-derstanding of the pressure propagation and velocity of airflow through thismethod. It also ties in logically with the rest of the components of thesystem, i.e. vocal fold models and biomechanical model.Faster simulations can be achieved by using a simplified representation ofthe vocal tract, consisting of a straight concatenation of cylindrical segments[104], and by bounding propagation in one dimension only. This approacheventually allows to reach real-time simulation rates and good results partic-ularly for the first formant of the tube [126][24][107]. However, this methodhas some drawbacks: the model doesn’t include any higher order modesdue to its straight symmetric geometries and struggles to naturally simulateforks/cavities in the vocal tract. While there have been some models thathave suggested remedies [88][60], these model produce loose simulations inthe upper end of the spectrum which affects the naturalness of the sound.On the other hand, 3D models, while very accurate take a prohibitivetime to simulate very short pieces of speech (e.g., 60 minutes [111], 44 min-utes [15] for 5ms of audio).2D Finite-Difference Time DomainWe base our vocal tract solver on the model suggested by Zappi et al [132].They suggested a GPU-based 2D Finite-Difference Time Domain (FDTD)simulation as a compromise for high-quality synthesis. The model achievedreal-time run rates and small positional errors in the first formants. However,the model calls for an extremely complex implementation scheme that en-ables the parallization of the solving of individual cells to achieve a real-timesolution. For our 2D continuum vocal fold model, a solution on the shaderwould be difficult to achieve in conjunction with the 2D FDTD solver. Thus,we create an alternative implementation of the model that runs on the CPUinstead, where computational speed is sacrificed for system inter-operabilityand robustness. There exists the future potential for a complete continuummodel implementation on the GPU in conjunction with the 2D FDTD sim-ulation; this would be a natural step forward after validation of the 2D VFmodel. We also create shader implementations of canonical lumped-elementmodels to act as the glottal excitation for the GPU-based 2D FDTD sys-tem. Details of these implementations are given in subsection 4.1.4. Theequations to be solved is an augmented version of the two-dimensional waveequations written as:∂p∂t+ (1− β)p = −ρc2∆.v (4.1)674.1. Model Formulationβ∂v∂t+ (1− β)v = −β2∆pp+ (1− β)vb (4.2)where p is the pressure at a discretized cell, and v is the velocity. Thisequation allows each point in space to transition between fully solid (wall)and fully open (air) states, via a scalar parameter field 0 ≤ β ≤ 1. Theparameter is changed smoothly at time scales larger than the main system’svibrational time-periods. Thus the equation amounts to linear interpolationbetween the standard wave equation when β = 1 (air), and enforcing someprescribed velocity boundary conditions of v = vb when β = 0 (boundary).For intermediate values of β, the affected region acts partially reflective andpartially transmissive.The above equations are discretized and solved numerically similar tostandard FDTD solvers, using second-order accurate spatial and temporalderivatives with velocities sampled on a staggered grid. The β field is sam-pled at cell centres and 6 Perfectly-Matched Layers (PML) are employedat the edges of the doman to absorb outgoing radiation. Since we use anexplicit scheme for time integration, the time step ∆t and spatial cell size∆x must obey the related Courant-Friedrichs-Levy (CFL) stability condi-tion in two dimensions: ∆t < ∆x/(c)2, where c is the speed of sound in themedium. The original paper [132] also validated the vocal tract system’sperformance by comparing it’s impulse response to formants published inliterature.The choice of this model is logical in the context of our vocal fold simu-lation; we expect that the synergy of dimensionality between the vocal folds(2D) and vocal tract (2D) models will enable more complex geometries to berepresented and their acoustic effects captured. This falls in the continuumbetween fast, but simplified 1D vocal tract models and complex, but com-putationally prohibitive 3D models. Equally, we avoid the dimensionalitymismatch of Alipour et al [9], where a 3D vocal fold model was combinedwith a 1D Kelly-Lochbaum vocal tract. It is unclear if any of the potentialbenefits of the accurate 3D vocal fold simulation, can actually be gleanedthrough a 1D simplistic wave-reflection analog system.4.1.3 Trachea ModelWe use a simple 1D representation of the trachea in building this articulatorysynthesizer; it is an implementation of the 1D digital wave-guide proposedby Kelly-Lochbaum [69]. As we saw in the previous chapter, the trachea684.1. Model FormulationDistance from glottis (cm)-12-10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18Area (cm2)02468Trachea Vocal tractFigure 4.5: Area function representation of trachea with concurrent vocaltract modelmodel plays a critical role in the speech synthesis. A time-varying subglottalpressure (Psg) signal drives the overall vocal fold phonation in addition tothe epilaryngeal pressure (Pe). A standard lung impedance pressure of 800Pa is assumed in this case, and the trachea geometry suggested by Story etal [107] is used. The area-function of the trachea that is used is shown infigure 4.5. Unlike the vocal tract, the trachea is a much more symmetric,cylindrical tube; this makes it an ideal candidate for an 1D representationwithout losing out on accuracy.Name Value(s)Speed of Sound 350 m/sFluid density 1.14 kg/m3Sampling Rate 220500 Hz∆t 4.535e-06 s∆s 0.002244783432338Table 4.1: A list of the model parameters used for the 2D FDTD simulation694.2. Results4.1.4 Alternate Vocal Fold ModelsTo compare the performance of our vocal fold model, we create implemen-tations of two canonical models in literature. Both the two-mass model [64]and the body-cover model [107] are implemented as part of the overall sys-tem. Thanks to the modular nature of the system, these can be switchedin for the 2D FEM vocal fold model with no extra changes required to theoverall functioning of the system. Both the models have shown that they arecapable of reproducing the glottal waveform to a reasonable extent. Theycan also be coupled to the vocal tract simulation easily without the stabilityissues associated with a CFD simulation.4.2 ResultsWe attempt to illustrate our vocal fold model’s ability to drive an entirearticulatory synthesis simulation in a stable and physically realistic manner.The model is coupled together as shown in figure 4.1: this includes the1D trachea, the vocal fold models (two-mass, body cover and continuum)and the 2D FDTD vocal tract. Table 4.1 gives the values used to drivethe 2D FDTD simulation. We used the area functions published by Storyet al [106], where vocal tract area functions were derived from MRI scansof the human airway. This data is a standard for the field, and providesan excellent starting point for our synthesizer. Potentially, meshes can bedirectly extracted from ArtiSynth models [13], as well. Figures 4.6 and 4.7show the vocal fold meshes that we use in the 2D FDTD simulation.The discretization in the spatial and time domains are related by theCourant—Friedrichs—Lewy (CFL) condition for convergence of finite-differencenumerical implementation: ∆s = ∆t ∗ c ∗ √2.0, where ∆s is the spatial dis-cretization, ∆t is the temporal discretization and c is the speed of sound inthe domain.Coupled vs Uncoupled Vocal FoldsOne of the major goals of our system is to drive a complete coupled articu-latory synthesizer. We take the 2D domain shown in figure 4.6 and use our2D continuum vocal fold model as the glottal input. The computed pressurewaveform at the listener is recorded over time; the simulation is run for atime period of 1s. The same experiment is carried out with an uncoupledversion of our 2D vocal fold model instead. This would imply that instead of704.2. Results10 20 30 40 50 60 70 80 90510152025303540Figure 4.6: 2D vocal tract domain for vowel /a/. The boundary includes 6Perfectly Matched Layers (PML) for absorption. The black dot representsthe listener and the left end of the symmetric tube contains the glottal inputsand feedback pressure cells.10 20 30 40 50 60 70 80 90510152025303540Figure 4.7: 2D vocal tract domain for vowel /i/. The boundary includes 6Perfectly Matched Layers (PML) for absorption. The black dot representsthe listener and the left end of the symmetric tube contains the glottal inputsand feedback pressure cells.714.2. ResultsTime[ms]0 10 20 30 40P norm-0.6-0.4-0.200.20.4Coupled FEM VFFigure 4.8: Normalized output pressure for a coupled 2D continuum vocalfold model used as glottal source to 2D FDTD simulation for vowel shape/a/.having time-varying boundary conditions thanks to trachea and vocal tractcoupling, this model will have static boundaries conditions instead. This canbe seen akin to a vocal fold model driven by the entire 800Pa lung pressure,acting at the subglottal duct. The wave-forms for both models are shownbelow in figure 4.8 and 4.9.We can immediately notice from the results the presence of a significantnon-linearity in the coupled simulation that is not present in the uncou-pled simulation. The coupled simulation has significant negative pressuretroughs that are not visible in the uncoupled model. To understand if thisis a function of the coupling or a simulation artifact, we also couple twolower-order models, the two-mass and the body-cover model to the simula-tion as a comparison. Since these models are quite primitive with simplersymmetric glottal flow (Ug) wave-forms, it is unlikely that they would beable to generate such a pressure non-linearity by themselves, thus suggest-ing that it would be a function of the non-linear coupling instead. Figures4.10 and 4.11 present the normalized pressure outputs at the listener forthe articulatory synthesizers driven by the two-mass model and body-cover724.2. ResultsTime[ms]0 10 20 30 40P norm-0.6-0.4-0.200.20.4Decoupled FEM VFFigure 4.9: Normalized output pressure for a uncoupled 2D continuum vocalfold model used as glottal source to 2D FDTD simulation for vowel shape/a/.734.2. ResultsTime[ms]0 10 20 30 40P norm-0.6-0.4-0.200.20.4Coupled BC VFFigure 4.10: Normalized output pressure for a coupled body cover modelused as glottal source to 2D FDTD simulation for vowel shape /a/.model respectively.While the individual wave-forms have different frequencies as expected,both the coupled two-mass and body-cover model also contain the identicalnon-linearity seen in the pressure waveform of our coupled 2D continuummodel. This validates the significant role played by coupling in an articu-latory synthesizer, and reaffirms the non-linear relationship that exists be-tween the source and filter models as suggested by Titze [119]. Table 4.2summarizes the maximum and minimum of the normalized output pressurein the different vocal fold models; we can clearly see that the uncoupledvocal fold model has significantly different results than even simple lumped-element models. It is important to note that the ratio shown in table 4.2has no physical relevance; it’s just a tool to understand the pretty signifi-cant differences between the waveform values. This is an interesting resultmoving forward; it is important to design continuum models that can alsobe coupled stably to ensure the maximum non-linearity of the source-filterrelationship is captured.Finally, as a reference we include a sample graph by Arnela et al [32] ofthe output pressure at the listener for a 3D vocal tract simulation. There744.2. ResultsTime[ms]0 10 20 30 40P norm-0.6-0.4-0.200.20.4Coupled 2M VFFigure 4.11: Normalized output pressure for a coupled two-mass model usedas glottal source to 2D FDTD simulation for vowel shape /a/.VF Model Max Pnorm Min Pnorm Max/Min(Pnorm)Coupled Continuum VF 0.5163 -0.5044 1.0236Uncoupled Continuum VF 0.4333 -0.1833 2.3639Coupled BC VF 0.5260 -0.3829 1.3737Coupled 2M VF 0.3648 -0.3850 0.9475Table 4.2: A comparison of the maximum and minimum output normalizedpressures predicted by different vocal fold models754.3. Discussion and ConclusionsFigure 4.12: Output pressure for a 3D FEM acoustic simulation for vowelshape /a/. Reproduced from [32]are a few significant differences which means that this is only an instructivegraph for reference and is not a complete comparison. Firstly, the modeluses a parametric vocal fold model suggested by Fant et al [43], with a fun-damental frequency (F0) of 110 Hz. This model is not coupled in the truesense; this is unlike self-oscillating models such as ours. Secondly, the im-plementation contains a head model of radiation that can significantly affectthe final pressure waveform. Figure 4.12 shows the pressure waveforms. Wecan see that our results look reasonably similar in terms of overall shape.Since the results from the /i/ simulation were qualitatively similar with thepresence of an identical negative pressure trough for coupled models, wechoose not to reproduce them for the sake of brevity.4.3 Discussion and ConclusionsBuilding an articulatory synthesizer is a non-trivial task. In this chapter,we have put forward one possible articulatory synthesizer utilizing our novelvocal fold model. While the individual components that constitute an ar-ticulatory synthesizer have been designed many times over in literature,764.3. Discussion and Conclusionsmaking these disparate components work together is both conceptually andcomputationally challenging. The difficulties in achieving coupled stablearticulatory synthesis has made speech researchers forgo the possibility ofusing more comprehensive models, for simplicity. In particular, the vocalfold model is often ignored as researchers prefer to focus their efforts onmodelling the vocal tract acoustics; this is driven by the belief that thepayoff is greater for the computational cost in vocal tract modelling.In this chapter, we have attempted to lay down a modular articulatorysynthesis framework, that uses a stable, coupled finite-element vocal foldmodel. Initial results are promising; the output pressures at the mouthseem in line with waveforms seen in literature. However, significant workneeds to be done in testing and validating the entire articulatory synthe-sizer. In general the main contributions of this illustrative case-study arethe following:• Coupling: As mentioned in Chapter 2, the vast majority of vocalfold models remain uncoupled in literature. This is especially truein the case of continuum models, where coupling the model meanstime-varying boundary conditions for the flow simulation. Many flowmodels struggle to handle these rapidly changing boundary conditionsin a stable, physically realistic manner. Our choice of 1D flow modelmakes it easier to achieve a stable coupled simulation as shown insection 4.2. Coupling also adds non-linearities that can significantlychange the final sound generated.• 2D Vocal Tract Model: Apart from Alipour et al [9], this is the onlycomplete articulatory synthesizer using a continuum vocal fold modelin conjunction with models of the trachea and vocal tract. In particu-lar, our model is the only model not using a 1D wave-reflection vocaltract, but solving the entire 2D wave-equation for the vocal tract aswell. Both the 2D FEM vocal fold model and the 2D FDTD model liein the unexplored space between 1D and 3D models, potentially actingas a bridge between the computational and conceptual advantages ofthe models respectively.• Accessibility and Modularity: Most continuum models of the vo-cal folds are either solved using commercial finite-element tool-kitssuch as ANSYS[113] or ADINA[39] or implemented using proprietarycode that belongs to labs [3][135] or even a combination of both [9].This has served as a major hindrance in two major ways: speech re-searchers have struggled to access and reproduce results [4], and more774.3. Discussion and Conclusionsimportantly, combining these models as part of a larger articulatorysynthesizer has remained an unachievable goal. In our model, the en-tire system is completely implemented in MATLAB (1D trachea + 2DFEM vocal folds + 2D FDTD vocal tract) and interfaced naturally. Alink is also provided to the biomechanical toolkit ArtiSynth to drivethe structural vocal tract models using biomechanics.However, there are some significant shortcomings to this study. Theresults are very preliminary; the coupled system needs to go through vasttesting to validate each component apart from just the simulation output.The vocal-tract model is extremely coarse by itself; it is still not clear howa 2D vocal tract should be represented unlike the simple 1D area functionmodel and the full 3D mesh models. Finally, while the possibility for user-specified meshes exists, it hasn’t been displayed yet in this system becauseof the above reasons.78Chapter 5ConclusionsThis thesis presents the definition and validation of a novel vocal fold modeltargeting the application of articulatory speech synthesis. Driven by theneed for vocal fold models that balance the competing priorities of com-putational cost and complexity, we enable speech researchers to potentiallymove beyond simple lumped-element models for building articulatory syn-thesizers. To do so, we propose a model comprising of a 2D structuralmodel loosely coupled with a 1D unsteady flow model; this enables us tocombine the completeness of higher dimensional structural models with thecomputational advantages of 1D fluid models.In Chapter 3, we presented a formulation and implementation of our vo-cal fold model. We start by taking a canonical 2D structural model that usesa linear-elasticity based constitutive equation. A hemilaryngeal continuummodel is considered, with symmetry assumed across the glottal mid-line.We start by writing our 1D fluid equations based on the implementationsof Cancelli et al [28] and Anderson et al [12]. The coupled solution of theseequations however, requires the computation of the Jacobian that could po-tentially reduce many of the computational advantages gleaned by using a1D model over a 2D Navier-Stokes based solver. We thus present a methodfor a fast decoupled solution of the flow equations that does not requirethe computation of the Jacobian matrix; this is achieved through an itera-tive bounded-search procedure that estimates the equivalent velocity-drivenboundary conditions We finally couple these models together, using an aero-dynamic force interpolation scheme.First, we demonstrate the fluid model’s performance for a standard ana-lytical problem in literature; the model shows low non-dimensional pressureerrors with the refinement of the mesh. These results are shown to be simi-lar to previously published results using the coupled Jacobian approach, ata computational cost that is more than an order of magnitude faster. Inaddition, the model can be shown to have an even greater computationaladvantage over 2D Navier-Stokes equations. Secondly, the model is used topredict the flow separation point for a forced-oscillating vocal fold surface.The model predicts that the flow separates at cross-sectional areas between795.1. Discussion1.2 to 1.4 times the minimum glottal area (amin); this is in line with resultsin literature and shows that our 1D fluid model does not suffer from themajor issues that plagues other 1D models based on Bernoulli’s equation.Finally, we validate our model’s performance for a full coupled simulation.This included validation of the vertical mucosal wave, velocity and pressuredistributions over time and values of the glottal flow (Ug).In Chapter 4, we looked at the feasibility of using our vocal fold modelto build a complete articulatory speech synthesizer. We implemented an1D wave-reflection analog of the trachea driven by a constant lung pressure.A 2D Finite-Difference Time-Domain (FDTD) solver based on the workof Zappi et al [132] was implemented in MATLAB to solve the 2D waveequation over the acoustic vocal tract domain. Apart from our vocal foldmodel, two canonical models of the field, the two-mass model and the body-cover model, are also implemented to provide a comparison. We showcasea framework to add user-specified meshes to the vocal tract solver; this caneither be through a biomechanical toolkit such as ArtiSynth or through datafrom literature. We choose to use the area functions from Story et al [106],to create our symmetric vocal tract. The output wave was shown to bephysically realistic for our vocal fold model, and in agreement with otherexemplar vocal fold models.5.1 DiscussionOur results demonstrate that the unique approach of coupling the 1D fluidmodel with a 2D structural model is appropriate for vocal fold phonation.The fluid-structure interaction of the vocal folds is dominated by the solidmodel, and is mainly driven by the bulk fluid pressure rather than minutevariations in the flow. The significant differences in density and viscosity ofthe two bodies also implies that we do not need a tightly-coupled regimeto achieve stability in the FSI. Equally, our method of using string energyto approximate the third dimensional behaviour works to our advantage;while the range of motion of the system is reduced, it is significant enoughto capture the standard cycle of vibration.In particular, we avoid the standard eigen/modal analysis that accom-panies many papers in the field, based on two major reasons. Firstly, pub-lished work in literature has shown that despite many of the model’s abilityto entrain at expected ranges of primary modes, they fail to produce a glot-tal waveform that is physically reasonable. This leads to our second reason:with our goal being articulatory speech synthesis, we focus on the flow model805.1. Discussionthat plays the most critical role in deciding the output of the vocal folds andas a consequence the output of the overall system.In terms of the glottal flow (Ug), there is no standard method in literatureto compute the quantity for 2D and 3D vocal fold models. In general, theglottal flow is understood to be the product of the glottal velocity andthe glottal area; usually the minimum glottal area or glottal area at thepoint of flow separation are considered for the latter. In our simulation, wemultiply the minimum 2D glottal cross-sectional area with the flow velocityat that point; this gives a clean glottal waveform without too many randomvariations. However, it would be interesting to see if other formulations ofglottal flow give radically different glottal waveforms and, as a consequence,different inputs to the vocal tract simulation.A general comment is the lack of watertight validation of our modelwith respect to the field; this arguably remains the biggest drawback ofthis study. This however remains a major issue for the entire field at large;data on the vocal fold is sparse at best, and misleading at worst. Dueto the inaccessibility of the vocal folds, speech scientists have resorted tostudies on excised larynges and fabricated models apart from computationalsimulations. Thus, we are forced to compare our models mainly to existinggraphs or to static laryngeal pressure distributions that do not have a strongresemblance to a dynamic simulation. Therefore, the validation of the modelis a patchwork of observations that are combined together to understand themodel performance. The pressing need in the field is an open computationaldata-set that presents a canonical validation to compare models to. At thispoint in time, the closest comparison remains the model of Alipour et al,recently validated in 2015 [9]. Thus we choose to compare our pressure andvelocity distributions to that model, attempting to correlate our system’spredictions to those seen in the overall model. It is pertinent to rememberthat we will have qualitatively similar and not quanitatively similar resultsto this model; a combination of the model’s slightly different geometry, andits significantly different computational set-up (quasi-3D structural modelwith 2D Navier-Stokes flow model), means that such a comparison wouldbe disingenuous. However, we show that we can achieve similar results, ata significantly lower computational cost, enabling FEM models to be seenas a feasible tool for articulatory synthesis applications.Finally, one of the major achievements of the model is the ability to runcoupled simulations. It is important to keep in mind that the manner ofthis coupling is open to debate. In an 1D flow model for the trachea/vocaltract, there is only a single value through which coupled is achieved makingit a natural coupling with our 1D vocal fold fluid model. However, when815.2. Future WorkFigure 5.1: Components of an Articulatory Synthesizerusing the 2D vocal tract solver, the question arises about which cells wouldbe appropriate for pressure feedback and how should the glottal output befed into the vocal tract solver. This is an open question that now arises withthe possibility of having higher order vocal-tract solvers coupled with vocalfold models, an issue which previously did not exist in the field. This is aconsequence of pushing the envelope of the state-of-the-art in the field.5.2 Future WorkPotential improvements and future work have been noted at the end of eachchapter; here, a few directions from the perspective of articulatory synthesisat large are highlighted. We refer back to figure 5.1 to explore future work,component by component.Trachea SystemThe exploration of the role of the trachea in speech synthesis remains quitenascent. Currently, the accepted wisdom in the field is that the role oftrachea is minimal apart from reflecting the time-varying lung pressure at thesubglottal duct. This is a reasonable assumption to make considering that825.2. Future Workmany models have altogether done away with the trachea and still managedto produce reasonable vocal fold vibrations and vocal tract propagationvalues. The lungs, as they expand or contract, behave like bellows suggestingvelocity or volumetric flow-rate BCs. However, since the lungs are alsolimited in strength from person to person, they behave like a pressure-drivensystem in the limiting simulation case. Potentially, our 1D fluid model couldalso be applied to seamlessly model the lungs as a velocity or pressure BCs,as well as calculating the flow through the trachea. While this might not haveany perceptible difference in modal phonation, as we aim for better voicequalities, it might help in augmenting the subglottal signal with gender-based (different lung capacities) and register-based (different fundamentalfrequence of subglottal signal) information.Glottal SystemWith regards to the glottal system, we shall divide potential future workinto two parts: improvements to our model and general improvements forthe field.One possibility to improve the overall performance of the system is to gofor a fully-coupled, implicit FSI solver to enhance simulation accuracy andstability. Overall, there remains the need to validate the system further;potentially we can look at a sensitivity analysis of our model to understandits performance over a range of values. Alternatively, we can look at using a2D Navier-Stokes model coupled with a 2D structural model and comparingits performance to our system. This will isolate the performance of our1D fluid model and help us better understand its shortcomings. One ofthe improvements mentioned previously was the inclusion of a τsmall termthat can ensure a more realistic collision time for the model. Equally, itwould be useful to understand the role that coupling plays in the overallperformance of the system; this can potentially be understood by runningthorough comparisons of coupled and non-coupled vocal fold models keepingall other parts of the system identical. Finally, the focus of this model wasarticulatory speech synthesis. To make this model usable for that purpose,it needs to be simulated in quasi real-time simulation rates. One potentialsolution is to follow the path suggested by Zappi et al [132] for the vocaltract; by parallelizing and optimizing the solver on the GPU we can ensure acomplete acoustic simulation at extremely fast simulation rates. This wouldbe a massive step-ahead for the entire field at large.In general, one of the main issues holding back vocal fold models is thelack of validation data in literature. One of the major achievements for the835.3. Concluding Remarksfield would be the generation of an open-source data store that providescomparison data for models to be validated against. Equally, there is aneed to have models that are easily accessible by speech researchers; mostcontinuum models are implemented in commercial or proprietary systemsmaking it practically impossible for speech researchers to reproduce thesesystems. This has severely hampered progress in the field, to the extent thatonly a handful of research groups are in any position to make contributionsin this area. We hope that by making our model available freely, we canhelp create systems to effectively compare vocal fold models.Vocal Tract SystemsThere are constant improvements to the state-of-the-art in vocal tract mod-elling. They can be mainly divided into two parts: 1) higher complexitymodels and 2) faster models. In terms of the latter, the use of real-time2D models can enable speech synthesis at usable simulation times, andwith the rise of high-quality GPU’s, even move towards interactive simu-lations. A potential extension of our work would be to generate speechusing biomechanically-driven vocal tract shapes. This would require an im-provement in methods of extracting the centreline and 2D contours from 3Dmeshes, as well as stable dynamic vocal tract simulations.5.3 Concluding RemarksTo conclude, this thesis has presented a new vocal fold model for articulatoryspeech synthesis. The model is significantly faster than existing continuummodels while predicting similar glottal waveforms as higher order vocal foldmodels. The model was used as part of a complete articulatory speechsynthesis simulation, where it produced physically realistic values. This workprovides a starting point for developing an accurate, efficient and interactivearticulatory synthesis toolkit which will eventually enable building betterspeech models and potentially lead to a clearer understanding of speechitself.84Bibliography[1] Seiji Adachi and Jason Yu. Two-dimensional model of vocal fold vi-bration for sound synthesis of voice and soprano singing. The Journalof the Acoustical Society of America, 117(5):3213–3224, 2005.[2] F Alipour, RC Scherer, and VC Patel. An experimental studyof pulsatile flow in canine larynges. Journal of fluids engineering,117(4):577–581, 1995.[3] Fariborz Alipour, David A Berry, and Ingo R Titze. A finite-elementmodel of vocal-fold vibration. The Journal of the Acoustical Societyof America, 108(6):3003–3012, 2000.[4] Fariborz Alipour, Christoph Brucker, Douglas D Cook, Andreas Gom-mel, Manfred Kaltenbacher, Willy Mattheus, Luc Mongeau, Eric Nau-man, Rudiger Schwarze, Isao Tokuda, et al. Mathematical models andnumerical schemes for the simulation of human phonation. CurrentBioinformatics, 6(3):323–343, 2011.[5] Fariborz Alipour and Ronald C Scherer. Pulsatile airflow duringphonation: an excised larynx model. The Journal of the AcousticalSociety of America, 97(2):1241–1248, 1995.[6] Fariborz Alipour and Ronald C Scherer. Dynamic glottal pressures inan excised hemilarynx model. Journal of Voice, 14(4):443–454, 2000.[7] Fariborz Alipour and Ronald C Scherer. Vocal fold bulging effectson phonation using a biophysical computer model. Journal of Voice,14(4):470–483, 2000.[8] Fariborz Alipour and Ronald C Scherer. Flow separation in a com-putational oscillating vocal fold model. The Journal of the AcousticalSociety of America, 116(3):1710–1719, 2004.[9] Fariborz Alipour and Ronald C Scherer. Time-dependent pressureand flow behavior of a self-oscillating laryngeal model with ventricularfolds. Journal of Voice, 29(6):649–659, 2015.85Bibliography[10] Fariborz Alipour and I Titze. Combined simulation of two dimensionalairflow and vocal fold vibration. Status and Progress Report, NationalCenter for Voice and Speech, 8:9–14, 1995.[11] Donald R Allen and William J Strong. A model for the synthesisof natural sounding vowels. The Journal of the Acoustical Society ofAmerica, 78(1):58–69, 1985.[12] Peter Anderson, Sidney Fels, and Sheldon Green. Implementationand validation of a 1d fluid model for collapsible channels. Journal ofbiomechanical engineering, 135(11):111006, 2013.[13] Peter Anderson, Negar M Harandi, Scott R Moisik, Ian Stavness, andSidney Fels. A comprehensive 3d biomechanically-driven vocal tractmodel including inverse dynamics for speech research. In Interspeech2015: 16th Annual Conference of the International Speech Communi-cation Association, pages 2395–2399, 2015.[14] Peter J. Anderson. Modeling the fluid-structure interaction of the up-per airway: towards simulation of obstructive sleep apnea. PhD thesis,University of British Columbia, 2014.[15] Marc Arnela and Oriol Guasch. Two-dimensional vocal tracts withthree-dimensional behavior in the numerical generation of vowels. TheJournal of the Acoustical Society of America, 135(1):369–379, 2014.[16] Pierre Badin, Gerard Bailly, Lionel Reveret, Monica Baciu, ChristophSegebarth, and Christophe Savariaux. Three-dimensional linear ar-ticulatory modeling of tongue, lips and face, based on mri and videoimages. Journal of Phonetics, 30(3):533–553, 2002.[17] Lucie Bailly, Xavier Pelorson, Nathalie Henrich, and Nicolas Ruty.Influence of a constriction in the near field of the vocal folds: Physicalmodeling and experimental validation. The Journal of the AcousticalSociety of America, 124(5):3296–3308, 2008.[18] Stefan Becker, Stefan Kniesburges, Stefan Müller, Antonio Delgado,Gerhard Link, Manfred Kaltenbacher, and Michael Döllinger. Flow-structure-acoustic interaction in a human voice model. The Journalof the Acoustical Society of America, 125(3):1351–1361, 2009.[19] David A Berry, Hanspeter Herzel, Ingo R Titze, and KatharinaKrischer. Interpretation of biomechanical simulations of normal and86Bibliographychaotic vocal fold oscillations with empirical eigenfunctions. The Jour-nal of the Acoustical Society of America, 95(6):3595–3604, 1994.[20] David A Berry and Ingo R Titze. Normal modes in a continuum modelof vocal fold tissues. The Journal of the Acoustical Society of America,100(5):3345–3354, 1996.[21] CD Bertram and TJ Pedley. A mathematical model of unsteady col-lapsible tube behaviour. Journal of Biomechanics, 15(1):39–50, 1982.[22] Christopher D Bertram. Flow-induced oscillation of collapsed tubesand airway structures. Respiratory physiology & neurobiology,163(1):256–265, 2008.[23] Jonas Beskow. Talking heads-Models and applications for multimodalspeech synthesis. PhD thesis, Institutionen för talöverföring ochmusikakustik, 2003.[24] Peter Birkholz, Dietmar Jackèl, and Bernd J Kroger. Simulation oflosses due to turbulence in the time-varying vocal system. IEEE Trans-actions on Audio, Speech, and Language Processing, 15(4):1218–1226,2007.[25] Peter Birkholz, Bernd J Kröger, and Christiane Neuschaefer-Rube.Synthesis of breathy, normal, and pressed phonation using a two-massmodel with a triangular glottis.[26] Peter Birkholz, BJ Kröger, and P Birkholz. A survey of self-oscillating lumped-element models of the vocal folds. Studientexte zurSprachkommunikation: Elektronische Sprachsignalverarbeitung, pages47–58, 2011.[27] Stéphanie Buchaillard, Pascal Perrier, and Yohan Payan. A biome-chanical model of cardinal vowel production: Muscle activations andthe impact of gravity on tongue positioning. The Journal of the Acous-tical Society of America, 126(4):2033–2051, 2009.[28] Claudio Cancelli and TJ Pedley. A separated-flow model forcollapsible-tube oscillations. Journal of Fluid Mechanics, 157:375–404,1985.[29] Roger W Chan and Ingo R Titze. Dependence of phonation thresholdpressure on vocal tract acoustics and vocal fold tissue mechanics. TheJournal of the Acoustical Society of America, 119(4):2351–2362, 2006.87Bibliography[30] Roger W Chan, Ingo R Titze, and Michael R Titze. Further studies ofphonation threshold pressure in a physical model of the vocal fold mu-cosa. The Journal of the Acoustical Society of America, 101(6):3722–3727, 1997.[31] Siyuan Chang, Fang-Bao Tian, Haoxiang Luo, James F Doyle, andBernard Rousseau. The role of finite displacements in vocal fold mod-eling. Journal of biomechanical engineering, 135(11):111008, 2013.[32] Marc Arnela Coll. Numerical production of vowels and diphthongsusing finite element methods. 2015.[33] Douglas Cook, Pradeep George, and Margaret Julias. 2d/3d hybridstructural model of vocal folds. Journal of biomechanics, 45(2):269–274, 2012.[34] Bert Cranen and Louis Boves. On the measurement of glottal flow.The Journal of the Acoustical Society of America, 84(3):888–900, 1988.[35] Saeed Dabbaghchian, Marc Arnela, Olov Engwall, Oriol Guasch, IanStavness, and Pierre Badin. Using a biomechanical model and artic-ulatory data for the numerical production of vowels. In Interspeech,8-12 Sep 2016, San Francisco, pages 3569–3573, 2016.[36] Jianwu Dang and Kiyoshi Honda. Construction and control of a phys-iological articulatory model. The Journal of the Acoustical Society ofAmerica, 115(2):853–870, 2004.[37] Marcelo de Oliveira Rosa, José Carlos Pereira, Marcos Grellet, andAbeer Alwan. A contribution to simulating a three-dimensional larynxmodel using the finite element method. The Journal of the AcousticalSociety of America, 114(5):2893–2905, 2003.[38] MP De Vries, HK Schutte, and GJ Verkerke. Determination of param-eters for lumped parameter models of the vocal folds using a finite-element method approach. The Journal of the Acoustical Society ofAmerica, 106(6):3620–3628, 1999.[39] Gifford Z Decker and Scott L Thomson. Computational simulations ofvocal fold vibration: Bernoulli versus navier–stokes. Journal of Voice,21(3):273–284, 2007.88Bibliography[40] Mickael Deverge, Xavier Pelorson, Coriandre Vilain, P-Y Lagrée,F Chentouf, J Willems, and A Hirschberg. Influence of collision onthe flow through in-vitro rigid models of the vocal folds. The Journalof the Acoustical Society of America, 114(6):3354–3362, 2003.[41] Comer Duncan, Guangnian Zhai, and Ronald Scherer. Modeling cou-pled aerodynamics and vocal fold dynamics using immersed bound-ary methods. The Journal of the Acoustical Society of America,120(5):2859–2871, 2006.[42] Gunnar Fant. Acoustic Theory of Speech Production. Mouton, TheHague, 1960.[43] Gunnar Fant, Johan Liljencrants, and Qi-guang Lin. A four-parametermodel of glottal flow. 1985.[44] Mehrdad H Farahani and Zhaoyan Zhang. Experimental validation ofa three-dimensional reduced-order continuum model of phonation. TheJournal of the Acoustical Society of America, 140(2):EL172–EL177,2016.[45] J Flanagan and Lois Landgraf. Self-oscillating source for vocal-tract synthesizers. IEEE Transactions on Audio and Electroacoustics,16(1):57–64, 1968.[46] JL Flanagan and K Ishizaka. Computer model to characterize the airvolume displaced by the vibrating vocal cords. The Journal of theAcoustical Society of America, 63(5):1559–1565, 1978.[47] Adrian J Fourcin and Evelyn Abberton. First applications of a newlaryngograph. Medical & biological illustration, 21(3):172–182, 1971.[48] Lewis P Fulcher, Ronald C Scherer, Kenneth J De Witt, PushkalThapa, Yang Bo, and Bogdan R Kucinschi. Pressure distributions ina static physical model of the hemilarynx: measurements and compu-tations. Journal of Voice, 24(1):2–20, 2010.[49] Lewis P Fulcher, Ronald C Scherer, Guangnian Zhai, and Zipeng Zhu.Analytic representation of volume flow as a function of geometry andpressure in a static physical model of the glottis. Journal of Voice,20(4):489–512, 2006.[50] Jean-Michel Gérard, Pascal Perrier, and Yohan Payan. 3d biomechan-ical tongue modeling to study speech production, 2006.89Bibliography[51] Bryan Gick, Ian Wilson, and Donald Derrick. Articulatory phonetics.John Wiley & Sons, 2012.[52] Heather E Gunter. Modeling mechanical stresses as a factor inthe etiology of benign vocal fold lesions. Journal of biomechanics,37(7):1119–1124, 2004.[53] Petr Hájek, Pavel Švancara, Jaromír Horáček, and Jan G Švec. Nu-merical simulation of the self-oscillating vocal folds in interaction withvocal tract shaped for particular czech vowels. In Recent GlobalResearch and Education: Technological Challenges, pages 317–323.Springer, Cham, 2017.[54] Negar M Harandi, J Woo, MR Farazi, L Stavness, M Stone, SidneyFels, and Rafeef Abugharbieh. Subject-specific biomechanical mod-elling of the oropharynx with application to speech production. InBiomedical Imaging (ISBI), 2015 IEEE 12th International Symposiumon, pages 1389–1392. IEEE, 2015.[55] Matthias Heil. Stokes flow in collapsible tubes: computation and ex-periment. Journal of Fluid Mechanics, 353:285–312, 1997.[56] Matthias Heil, Andrew L Hazel, and Jaclyn A Smith. The mechanicsof airway closure. Respiratory physiology & neurobiology, 163(1):214–221, 2008.[57] John Henry Heinbockel. Introduction to tensor calculus and continuummechanics, volume 52.[58] Minoru Hirano. Morphological structure of the vocal cord as a vibratorand its variations. Folia Phoniatrica et Logopaedica, 26(2):89–94, 1974.[59] Minoru Hirano and Yuki Kakita. Cover-body theory of vocal foldvibration. Speech science, pages 1–46, 1985.[60] Julio C Ho, Matías Zañartu, and George R Wodicka. An anatomi-cally based, time-domain acoustic model of the subglottal system forspeech production. The Journal of the Acoustical Society of America,129(3):1531–1547, 2011.[61] Eva B Holmberg, Robert E Hillman, and Joseph S Perkell. Glottalairflow and transglottal air pressure measurements for male and femalespeakers in soft, normal, and loud voice. The Journal of the AcousticalSociety of America, 84(2):511–529, 1988.90Bibliography[62] J Horáček, P Šidlof, and JG Švec. Numerical simulation of self-oscillations of human vocal folds with hertz model of impact forces.Journal of fluids and structures, 20(6):853–869, 2005.[63] K Ishizaka and JL Flanagan. Acoustic properties of longitudinal dis-placement in vocal cord vibration. Bell System Technical Journal,56(6):889–918, 1977.[64] Kenzo Ishizaka and James L Flanagan. Synthesis of voiced soundsfrom a two-mass model of the vocal cords. Bell system technical jour-nal, 51(6):1233–1268, 1972.[65] OE Jensen. Instabilities of flow in a collapsed tube. Journal of FluidMechanics, 220:623–659, 1990.[66] Oliver E Jensen and Matthias Heil. High-frequency self-excited os-cillations in a collapsible-channel flow. Journal of Fluid Mechanics,481:235–268, 2003.[67] Jack Jiang, Emily Lin, and David G Hanson. Vocal fold physiology.Otolaryngologic Clinics of North America, 33(4):699–718, 2000.[68] Weili Jiang, Xudong Zheng, and Qian Xue. computational model-ing of fluid–structure–acoustics interaction during voice production.Frontiers in bioengineering and biotechnology, 5, 2017.[69] John L Kelly and Carol C Lochbaum. Speech synthesis. 1962.[70] Sid Khosla, Shanmugam Murugappan, Raghavaraju Lakhamraju, andEphraim Gutmark. Using particle imaging velocimetry to mea-sure anterior-posterior velocity gradients in the excised canine larynxmodel. The Annals of otology, rhinology, and laryngology, 117(2):134,2008.[71] Sid Khosla, Shanmugam Muruguppan, Ephraim Gutmark, andRonald Scherer. Vortical flow field during phonation in an excisedcanine larynx model. Annals of Otology, Rhinology & Laryngology,116(3):217–228, 2007.[72] M Drew LaMar, Yingyong Qi, and Jack Xin. Modeling vocal foldmotion with a hydrodynamic semicontinuum model. The Journal ofthe Acoustical Society of America, 114(1):455–464, 2003.91Bibliography[73] FLE Lecluse, MP Brocaar, and J Verschuure. The electroglottographyand its relation to glottal activity. Folia Phoniatrica et Logopaedica,27(3):215–224, 1975.[74] Johan Liljencrants. A translating and rotating mass model of the vocalfolds. 1991.[75] Gerhard Link, M Kaltenbacher, Michael Breuer, and M Döllinger.A 2d finite-element scheme for fluid–solid–acoustic interactions andits application to human phonation. Computer Methods in AppliedMechanics and Engineering, 198(41):3321–3334, 2009.[76] HF Liu, XY Luo, ZX Cai, and TJ Pedley. Sensitivity of unsteady col-lapsible channel flows to modelling assumptions. International Jour-nal for Numerical Methods in Biomedical Engineering, 25(5):483–504,2009.[77] John E Lloyd, Ian Stavness, and Sidney Fels. Artisynth: A fast inter-active biomechanical modeling toolkit combining multibody and finiteelement simulation. In Yohan Payan, editor, Soft Tissue Biomechani-cal Modeling for Computer Assisted Surgery, pages 355–394. Springer,New York, 2012.[78] Jorge C Lucero and Kevin G Munhall. A model of facial biomechan-ics for speech production. The Journal of the Acoustical Society ofAmerica, 106(5):2834–2842, 1999.[79] Haoxiang Luo, Rajat Mittal, Xudong Zheng, Steven A Bielamow-icz, Raymond J Walsh, and James K Hahn. An immersed-boundarymethod for flow–structure interaction in biological systems with appli-cation to phonation. Journal of computational physics, 227(22):9303–9332, 2008.[80] XY Luo and TJ Pedley. The effects of wall inertia on flow in a two-dimensional collapsible channel. Journal of Fluid Mechanics, 363:253–280, 1998.[81] Shinji Maeda. A digital simulation method of the vocal-tract system.Speech communication, 1(3-4):199–229, 1982.[82] A Marzo, XY Luo, and CD Bertram. Three-dimensional collapse andsteady flow in thick-walled flexible tubes. Journal of Fluids and Struc-tures, 20(6):817–835, 2005.92Bibliography[83] The Mathworks, Inc., Natick, Massachusetts. MATLAB version8.5.0.197613 (R2015a), 2015.[84] Paul Mermelstein. Articulatory model for the study of speech produc-tion. The Journal of the Acoustical Society of America, 53(4):1070–1082, 1973.[85] Peter Meyer, Reiner Wilhelms, and Hans Werner Strube. A quasiartic-ulatory speech synthesizer for german language running in real time.The Journal of the Acoustical Society of America, 86(2):523–539, 1989.[86] Amir K Miri. Mechanical characterization of vocal fold tissue: a reviewstudy. Journal of Voice, 28(6):657–667, 2014.[87] Rajat Mittal, Byron D Erath, and Michael W Plesniak. Fluid dynam-ics of human phonation and speech. Annual Review of Fluid Mechan-ics, 45:437–467, 2013.[88] Parham Mokhtari, Hironori Takemoto, and Tatsuya Kitamura. Single-matrix formulation of a time domain acoustic model of the vocal tractwith side branches. Speech Communication, 50(3):179–190, 2008.[89] Randall B Monsen and A Maynard Engebretson. Study of variationsin the male and female glottal wave. The Journal of the AcousticalSociety of America, 62(4):981–993, 1977.[90] Pertti Palo. A review of articulatory speech synthesis. PhD thesis,Citeseer, 2006.[91] Xavier Pelorson, Avraham Hirschberg, RR Van Hassel, APJWijnands,and Yves Auregan. Theoretical and experimental study of quasisteady-flow separation within the glottis during phonation. application to amodified two-mass model. The Journal of the Acoustical Society ofAmerica, 96(6):3416–3431, 1994.[92] Pascal Perrier, Yohan Payan, Stéphanie Buchaillard, Mohammad AliNazari, and Matthieu Chabanas. Biomechanical models to studyspeech. Faits de langues, 37:155–171, 2011.[93] Per-Olof Persson. Implementation of finite element-based navier-stokes solver 2.094-project. 2002.[94] Petra Pořízková, Karel Kozel, and Jaromír Horáček. Numerical so-lution of compressible and incompressible unsteady flows in channel93Bibliographyinspired by vocal tract. Journal of Computational and Applied Math-ematics, 270:323–329, 2014.[95] Pedro Paulo Leite Do Prado. a target-based articulatory synthesizer.1991.[96] Martin Rothenberg. A new inverse-filtering technique for deriving theglottal air flow waveform during voicing. The Journal of the AcousticalSociety of America, 53(6):1632–1645, 1973.[97] Ronald C Scherer, Daoud Shinwari, Kenneth J De Witt, Chao Zhang,Bogdan R Kucinschi, and Abdollah A Afjeh. Intraglottal pressureprofiles for a symmetric and oblique glottis with a divergence an-gle of 10 degrees. The Journal of the Acoustical Society of America,109(4):1616–1630, 2001.[98] Raphael Schwarz, Ulrich Hoppe, Maria Schuster, Tobias Wurzbacher,Ulrich Eysholdt, and Jörg Lohscheller. Classification of unilateral vo-cal fold paralysis by endoscopic digital high-speed recordings and in-version of a biomechanical model. IEEE transactions on biomedicalengineering, 53(6):1099–1108, 2006.[99] Benjamin Seibold. A compact and fast matlab code solving the in-compressible navier-stokes equations on rectangular domains mit18086navierstokes. m. 2008.[100] Eftychios Sifakis, Andrew Selle, Avram Robinson-Mosher, and RonaldFedkiw. Simulating speech with a physics-based facial muscle model.In Proceedings of the 2006 ACM SIGGRAPH/Eurographics sympo-sium on Computer animation, pages 261–270. Eurographics Associa-tion, 2006.[101] Man Sondhi and Juergen Schroeter. A hybrid time-frequency do-main articulatory speech synthesizer. IEEE Transactions on Acous-tics, Speech, and Signal Processing, 35(7):955–967, 1987.[102] Susan Standring. Gray’s anatomy: the anatomical basis of clinicalpractice. Elsevier Health Sciences, 2015.[103] Ina Steinecke and Hanspeter Herzel. Bifurcations in an asymmetricvocal-fold model. The Journal of the Acoustical Society of America,97(3):1874–1884, 1995.94Bibliography[104] Kenneth N Stevens. Acoustic phonetics, volume 30. 2000.[105] Brad H Story. An overview of the physiology, physics and modelingof the sound source for vowels. Acoustical Science and Technology,23(4):195–206, 2002.[106] Brad H Story. Comparison of magnetic resonance imaging-based vo-cal tract area functions obtained from the same speaker in 1994 and2002. The Journal of the Acoustical Society of America, 123(1):327–335, 2008.[107] Brad H Story and Ingo R Titze. Voice simulation with a body-covermodel of the vocal folds. The Journal of the Acoustical Society ofAmerica, 97(2):1249–1260, 1995.[108] Brad H Story, Ingo R Titze, and Eric A Hoffman. Vocal tract areafunctions from magnetic resonance imaging. The Journal of the Acous-tical Society of America, 100(1):537–554, 1996.[109] Sayoko Takano, Kiyoshi Honda, and Keisuke Kinoshita. Measurementof cricothyroid articulation using high-resolution mri and 3d patternmatching. Acta acustica united with acustica, 92(5):725–730, 2006.[110] Hironori Takemoto, Kiyoshi Honda, Shinobu Masaki, Yasuhiro Shi-mada, and Ichiro Fujimoto. Measurement of temporal changes in vo-cal tract area function from 3d cine-mri data. The Journal of theAcoustical Society of America, 119(2):1037–1049, 2006.[111] Hironori Takemoto, Parham Mokhtari, and Tatsuya Kitamura. Acous-tic analysis of the vocal tract during vowel production by finite-difference time-domain method. The Journal of the Acoustical Societyof America, 128(6):3724–3738, 2010.[112] Keyi Tang. A feasibility study of template-based subject-specific mod-elling and simulation of upper-airway complex. PhD thesis, Universityof British Columbia, 2017.[113] Chao Tao, Jack J Jiang, and Yu Zhang. Simulation of vocal fold impactpressures with a self-oscillating finite-element model. The Journal ofthe Acoustical Society of America, 119(6):3987–3994, 2006.[114] António JS Teixeira, Roberto Martinez, Luís Nuno Silva, Luis MTJesus, Jose C Príncipe, and Francisco AC Vaz. Simulation of human95Bibliographyspeech production applied to the study and synthesis of european por-tuguese. EURASIP Journal on Applied Signal Processing, 2005:1435–1448, 2005.[115] Scott L Thomson, Luc Mongeau, and Steven H Frankel. Aerodynamictransfer of energy to the vocal folds. The Journal of the AcousticalSociety of America, 118(3):1689–1700, 2005.[116] Ingo R Titze. Mechanical stress in phonation. Journal of Voice,8(2):99–105, 1994.[117] Ingo R Titze. Regulating glottal airflow in phonation: Application ofthe maximum power transfer theorem to a low dimensional phonationmodel. The Journal of the Acoustical Society of America, 111(1):367–376, 2002.[118] Ingo R Titze. The Myoelastic Aerodynamic Theory of Phonation. Na-tional Center for Voice and Speech, 2006.[119] Ingo R Titze. Nonlinear source–filter coupling in phonation: Theory a.The Journal of the Acoustical Society of America, 123(4):1902–1915,2008.[120] Ingo R Titze, Sheila S Schmidt, and Michael R Titze. Phonationthreshold pressure in a physical model of the vocal fold mucosa. TheJournal of the Acoustical Society of America, 97(5):3080–3084, 1995.[121] Ingo R Titze and Brad H Story. Rules for controlling low-dimensionalvocal fold models with muscle activation. The Journal of the AcousticalSociety of America, 112(3):1064–1076, 2002.[122] Isao T Tokuda, Jaromir Horáček, Jan G Švec, and Hanspeter Herzel.Comparison of biomechanical modeling of register transitions andvoice instabilities with excised larynx experiments. The Journal ofthe Acoustical Society of America, 122(1):519–531, 2007.[123] Isao T Tokuda, Marco Zemke, Malte Kob, and Hanspeter Herzel.Biomechanical modeling of register transitions and the role of vocaltract resonators a. The Journal of the Acoustical Society of America,127(3):1528–1536, 2010.[124] M Triep, Ch Brücker, and W Schröder. High-speed piv measurementsof the flow downstream of a dynamic mechanical model of the humanvocal folds. Experiments in Fluids, 39(2):232–245, 2005.96Bibliography[125] Janwillem Van den Berg. Myoelastic-aerodynamic theory of voiceproduction. Journal of Speech, Language, and Hearing Research,1(3):227–244, 1958.[126] Kees van den Doel and Uri M Ascher. Real-time numerical solution ofwebster’s equation on a nonuniform grid. IEEE transactions on audio,speech, and language processing, 16(6):1163–1172, 2008.[127] Reiner Wilhelms-Tricarico. Physiological modeling of speech produc-tion: Methods for modeling soft-tissue articulators. The Journal ofthe Acoustical Society of America, 97(5):3085–3098, 1995.[128] Q Xue, R Mittal, X Zheng, and S Bielamowicz. Computational mod-eling of phonatory dynamics in a tubular three-dimensional model ofthe human larynx. The Journal of the Acoustical Society of America,132(3):1602–1613, 2012.[129] Qian Xue and Xudong Zheng. The effect of false vocal folds on la-ryngeal flow resistance in a tubular three-dimensional computationallaryngeal model. Journal of Voice, 2016.[130] Anxiong Yang, Jörg Lohscheller, David A Berry, Stefan Becker, UlrichEysholdt, Daniel Voigt, and Michael Döllinger. Biomechanical mod-eling of the three-dimensional aspects of human vocal fold dynamics.The Journal of the Acoustical Society of America, 127(2):1014–1031,2010.[131] Matías Zañartu, Luc Mongeau, and George R Wodicka. Influence ofacoustic loading on an effective single mass model of the vocal folds.The Journal of the Acoustical Society of America, 121(2):1119–1129,2007.[132] Victor Zappi, Arvind Vasudevan, Andrew Allen, Nikunj Raghuvanshi,and Sidney Fels. Towards real-time two-dimensional wave propaga-tion for articulatory speech synthesis. In Proceedings of Meetings onAcoustics 171ASA, volume 26. ASA, 2016.[133] SM Zeitels. Phonosurgery: Past present and future. OPERA-TIVE TECHNIQUES IN OTOLARYNGOLOGY HEAD AND NECKSURGERY, 10:1–1, 1999.[134] Tao Zhang, Amgad Salama, Shuyu Sun, and Hua Zhong. A compactnumerical implementation for solving stokes equations using matrix-vector operations. Procedia Computer Science, 51:1208–1218, 2015.97Bibliography[135] X Zheng, R Mittal, Q Xue, and S Bielamowicz. Direct-numerical simu-lation of the glottal jet and vocal-fold dynamics in a three-dimensionallaryngeal model. The Journal of the Acoustical Society of America,130(1):404–415, 2011.[136] Xudong Zheng, Steve Bielamowicz, Haoxiang Luo, and Rajat Mittal.A computational study of the effect of false vocal folds on glottalflow and vocal fold vibration during phonation. Annals of biomedicalengineering, 37(3):625–642, 2009.98
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- The voice box : a fast coupled vocal fold model for...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
The voice box : a fast coupled vocal fold model for articulatory speech synthesis Vasudevan, Arvind 2017
pdf
Page Metadata
Item Metadata
Title | The voice box : a fast coupled vocal fold model for articulatory speech synthesis |
Creator |
Vasudevan, Arvind |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | Speech is unique to human beings as a means of communication and many efforts have been made towards understanding and characterizing speech. In particular, articulatory speech synthesis is a critical field of study as it works towards simulating the fundamental physical phenomena that underlines speech. Of the various components that constitute an articulatory speech synthesizer, vocal fold models play an important role as the source of the acoustic simulation. A balance between the simplicity and speed of lumped-element vocal fold models and the completeness and complexity of continuum-models is required to achieve time-efficient high-quality speech synthesis. In addition, most models of the vocal folds are seen in a vacuum without any coupling to the vocal tract model. This thesis aims to fill these lacunae in the field through two major contributions. We develop and implement a novel self-oscillating vocal-fold model, composed of an 1D unsteady fluid model loosely coupled with a 2D finite-element structural model. The flow model is capable of handling irregular geometries, different boundary conditions, closure of the glottis and unsteady flow states. A method for a fast decoupled solution of the flow equations that does not require the computation of the Jacobian matrix is provided. The simulation results are shown to agree with existing data in literature, and give realistic glottal pressure-velocity distributions, glottal width and glottal flow values. In addition, the model is more than order of magnitude faster than comparable 2D Navier-Stokes fluid solvers while better capturing transitional flow than simple Bernoulli-based flow models. Secondly, as an illustrative case study, we implement a complete articulatory speech synthesizer using our vocal fold model. This includes both lumped-element and continuum vocal fold models, a 2D finite-difference time-domain solver of the vocal tract, and a 1D tracheal model. A clear work flow is established to derive model components from experimental data or user-specified meshes, and run fully-coupled acoustic simulations. This leads to one of the few complete articulatory speech synthesizers in literature and a valuable tool for speech research to run time-efficient speech simulations, and thoroughly study the acoustic outcomes of model formulations. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-07-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0348751 |
URI | http://hdl.handle.net/2429/62195 |
Degree |
Master of Applied Science - MASc |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2017-09 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2017_september_vasudevan_arvind.pdf [ 5.86MB ]
- Metadata
- JSON: 24-1.0348751.json
- JSON-LD: 24-1.0348751-ld.json
- RDF/XML (Pretty): 24-1.0348751-rdf.xml
- RDF/JSON: 24-1.0348751-rdf.json
- Turtle: 24-1.0348751-turtle.txt
- N-Triples: 24-1.0348751-rdf-ntriples.txt
- Original Record: 24-1.0348751-source.json
- Full Text
- 24-1.0348751-fulltext.txt
- Citation
- 24-1.0348751.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0348751/manifest