UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Visual feedback during speech production Stelle, Elizabeth Leigh 2016

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_february_stelle_elizabeth.pdf [ 5.62MB ]
JSON: 24-1.0340613.json
JSON-LD: 24-1.0340613-ld.json
RDF/XML (Pretty): 24-1.0340613-rdf.xml
RDF/JSON: 24-1.0340613-rdf.json
Turtle: 24-1.0340613-turtle.txt
N-Triples: 24-1.0340613-rdf-ntriples.txt
Original Record: 24-1.0340613-source.json
Full Text

Full Text

Visual feedback during speech productionbyElizabeth Leigh StelleBA (Hons), The University of Queensland, 2006A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Linguistics)The University Of British Columbia(Vancouver)December 2016© Elizabeth Leigh Stelle, 2016AbstractThe visual speech signal has a well-established influence on speech perception, andthere is growing consensus that visual speech also influences speech production.However, relatively little is known about the response to one’s own visual speech;that is, when it is presented as speech feedback. Since visual feedback is generatedby the same speaking event that generates auditory and somatosensory feedback,it is temporally compatible with these typical sources of feedback; as such, it ispredicted to influence speech production in comparable ways. This dissertationuses a perturbation paradigm to test the effect visual feedback has on production.Two delayed auditory feedback experiments tested the effect of different typesof visual feedback on two fluency measures: utterance duration and number ofspeech errors. Visual feedback was predicted to enhance fluency. When the pre-sentation of static and dynamic visual feedback was randomized within a block, ut-terance duration increased with dynamic visual feedback but there was no changein speech errors. Speech errors were reduced, however, when the different typesof visual feedback were presented in separate blocks. This reduction was only ob-served when dynamic visual feedback was paired with normal auditory feedback,and for those participants who were more verbally proficient. These results suggestthat consistent exposure to visual feedback may be necessary for speech enhance-ment, and also that the time-varying properties of visual speech are important ineliciting changes in speech production.In the bite block experiment, participants produced monosyllabic words in con-ditions that differed in terms of the presence or absence of visual feedback and abite block. Acoustic vowel contrast was enhanced and acoustic vowel dispersionwas reduced with visual feedback. This effect was strongest at the beginning of theiivowel and tended to be stronger during productions without the bite block. For asmall subset of participants the magnitude of motion of the lower face increased inresponse to visual feedback, once again without the bite block.The results of this dissertation provide evidence that visual feedback can en-hance speech production, and highlight the multimodal nature of speech process-ing.iiiPrefaceThis dissertation is original work by the author, Elizabeth Leigh Stelle.The experiments in this dissertation were run under approval of The Univer-sity of British Columbia Behavioural Research Ethics Board, (certificate no. H12-02559) in Canada, and The University of New Mexico Institutional Review Board(certificate no. 12-587) in the USA. The images in Figure 2.1, Figure 4.3, andFigure 4.5 are used with the consent of the participants.Parts of Chapter 3 have been published as Stelle, E., Smith, C.L., & Vatikiotis-Bateson, E. Delayed auditory feedback with static and dynamic visual feedback.In Proceedings of The 1st Joint Conference on Facial Analysis, Animation andAuditory-Visual Speech Processing, 40-45. The lead author (Stelle) designed andimplemented the experiment, analyzed the data, and wrote the manuscript. Writingof the manuscript involved input from the additional authors (Vatikiotis-Bateson,Smith).ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Feedback during speech production . . . . . . . . . . . . . . . . 31.1.1 Feedback and feedforward motor control . . . . . . . . . 51.1.2 Altered speech feedback . . . . . . . . . . . . . . . . . . 61.1.3 Visual speech feedback . . . . . . . . . . . . . . . . . . . 91.2 On the use of novel feedback . . . . . . . . . . . . . . . . . . . . 141.3 Outline of the dissertation . . . . . . . . . . . . . . . . . . . . . . 162 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21v2.5 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.6 Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . 232.6.1 Generalized linear mixed effects models . . . . . . . . . . 242.6.2 Random effects . . . . . . . . . . . . . . . . . . . . . . . 242.6.3 Coding schemes . . . . . . . . . . . . . . . . . . . . . . 272.6.4 Evaluating the models . . . . . . . . . . . . . . . . . . . 283 Visual feedback during a delayed auditory feedback task . . . . . . 303.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.1 Delayed auditory feedback . . . . . . . . . . . . . . . . . 313.3 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.1 Hypothesis and predictions . . . . . . . . . . . . . . . . . 383.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 513.4 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.4.1 Hypothesis and prediction . . . . . . . . . . . . . . . . . 563.4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 573.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 683.5 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . 724 Visual feedback during a bite block task . . . . . . . . . . . . . . . . 794.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.2.1 The effect of visual information on phonological contrasts 804.2.2 The effect of oral perturbations on vowel production . . . 824.2.3 Hypothesis and predictions . . . . . . . . . . . . . . . . . 834.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.3.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . 854.3.2 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . 854.3.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 86vi4.3.4 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 884.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954.4.1 Vowel contrast . . . . . . . . . . . . . . . . . . . . . . . 974.4.2 Vowel dispersion . . . . . . . . . . . . . . . . . . . . . . 1014.4.3 Lower face magnitude of motion . . . . . . . . . . . . . . 1084.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1165 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235.2 Summary of experimental results . . . . . . . . . . . . . . . . . . 1235.3 Multimodal speech production . . . . . . . . . . . . . . . . . . . 1265.3.1 Targets of production . . . . . . . . . . . . . . . . . . . . 1265.3.2 Multimodal feedback integration . . . . . . . . . . . . . . 1285.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 1345.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137A Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153viiList of TablesTable 3.1 Conditions presented to each participant. . . . . . . . . . . . . 40Table 3.2 Disfluency transcription conventions from Brugos and Shattuck-Hufnagel (2012). . . . . . . . . . . . . . . . . . . . . . . . . . 42Table 3.3 Cross-tabulation of error categories from each coder. . . . . . . 45Table 3.4 Means and standard errors of the means of the experimentalmeasures for each condition. . . . . . . . . . . . . . . . . . . 45Table 3.5 Fixed effects for the model of (log) utterance duration. . . . . . 47Table 3.6 Fixed effects for the model of speech errors. . . . . . . . . . . 50Table 3.7 Fixed effects for the model of speech errors with the significanteffect of DAF disruption. . . . . . . . . . . . . . . . . . . . . 51Table 3.8 Conditions presented to each participant. . . . . . . . . . . . . 58Table 3.9 Cross-tabulation of error categories from each coder. . . . . . . 59Table 3.10 Means and standard errors of the means of the experimentalmeasures for each condition. . . . . . . . . . . . . . . . . . . 60Table 3.11 Fixed effects for the model of (log) utterance duration with asignificant interaction of the predictors. . . . . . . . . . . . . . 62Table 3.12 Fixed effects for the model of speech errors with the significantauditory feedback effect. . . . . . . . . . . . . . . . . . . . . . 65Table 3.13 Fixed effects for the model of speech errors with the signif-icant interaction between auditory feedback, visual feedback,and DAF disruption. . . . . . . . . . . . . . . . . . . . . . . . 66Table 3.14 Number of pauses per 100 sentence for each condition in Ex-periment 1 and Experiment 2. . . . . . . . . . . . . . . . . . . 74viiiTable 3.15 Means and standard errors of the means for the dependent vari-ables in Experiment 1 and Experiment 2. . . . . . . . . . . . . 75Table 4.1 The four conditions presented to each participant. . . . . . . . 86Table 4.2 The four condition orders used in the experiment. . . . . . . . 88Table 4.3 Summary of the dependent variables and effects structures usedin the statistical models of vowel contrast. . . . . . . . . . . . 100Table 4.4 Fixed effects for the model of (cube root) AVS (25%) with thesignificant visual feedback effect. . . . . . . . . . . . . . . . . 100Table 4.5 Fixed effects for the model of (cube root) AVS (50%) with anear-significant interaction of the predictors. . . . . . . . . . . 101Table 4.6 Summary of the dependent variables and effects structures usedin the statistical models of vowel dispersion. . . . . . . . . . . 103Table 4.7 Fixed effects for the model of (cube root) Euclidean distance(25%) with the significant visual feedback and phoneme effects. 104Table 4.8 Fixed effects for the model of (cube root) Euclidean distance(50%) with the significant interaction between type of oral per-turbation and type of visual feedback. . . . . . . . . . . . . . . 105Table 4.9 Fixed effects for the model of (cube root) Euclidean distance(75%) with the significant effect of phoneme. . . . . . . . . . . 106Table 4.10 Fixed effects for the model of (log) magnitude of motion. Theeffect of oral feedback was close to significance. . . . . . . . . 110Table A.1 Stimuli used in the DAF experiments (Chapter 3). . . . . . . . 157Table A.2 Stimuli used in the bite block experiment (Chapter 4). . . . . . 159ixList of FiguresFigure 2.1 An example of the visual feedback presented to a participant. . 20Figure 2.2 Simulated data demonstrating homoscedasticity (left) and het-eroscedasticity (right). . . . . . . . . . . . . . . . . . . . . . 28Figure 3.1 An example of a complex disfluent event (error code: e.ps). . . 43Figure 3.2 An example of a part-word prolongation. . . . . . . . . . . . 44Figure 3.3 Distributions of utterance duration for each condition. . . . . 46Figure 3.4 Counts of each type of error per condition. . . . . . . . . . . 48Figure 3.5 Density plots of speech error counts for each condition. . . . . 49Figure 3.6 Effect of DAF disruption on speech errors. . . . . . . . . . . 52Figure 3.7 Distributions of utterance duration for each condition. . . . . 61Figure 3.8 Counts of each type of error per condition. . . . . . . . . . . 63Figure 3.9 Density plots of speech error counts for each condition. . . . . 64Figure 3.10 Interaction between DAF disruption, auditory feedback, andvisual feedback. . . . . . . . . . . . . . . . . . . . . . . . . . 67Figure 4.1 Predicted effects of the experimental manipulations on the vari-ables to be measured. . . . . . . . . . . . . . . . . . . . . . . 84Figure 4.2 The canonical positions in the vowel space for the GeneralAmerican English monophthongs used in the experiment (adaptedfrom Ladefoged (2006)). . . . . . . . . . . . . . . . . . . . . 85Figure 4.3 Bite block configuration used in the oral perturbation conditions. 87Figure 4.4 Mahalanobis distances from a token of the vowel [i] to the dis-tribution of each vowel category. . . . . . . . . . . . . . . . . 92xFigure 4.5 A screenshot from the FlowAnalyzer software with a region ofinterest (ROI; black rectangle) specified for a particular partic-ipant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Figure 4.6 Comparison of mouth and lower face magnitudes of motionfrom two ROIs. . . . . . . . . . . . . . . . . . . . . . . . . . 95Figure 4.7 Vowel space as measured at the 25% point in the vowels. . . . 96Figure 4.8 Vowel space as measured at the 50% point in the vowels. . . . 96Figure 4.9 Vowel space as measured at the 75% point in the vowels. . . . 97Figure 4.10 Means and standard errors for AVS distance for each conditionat three points in the vowel. . . . . . . . . . . . . . . . . . . 98Figure 4.11 Means and standard errors for Euclidean distance for each con-dition at three points in the vowel. . . . . . . . . . . . . . . . 102Figure 4.12 Means and standard errors for Euclidean distance for each phonemein each condition at each measurement point in the vowel. . . 107Figure 4.13 Means and standard errors for magnitudes of motion of thelower face for each condition. . . . . . . . . . . . . . . . . . 108Figure 4.14 Different patterns of magnitudes of motion of the lower face inresponse to visual feedback. . . . . . . . . . . . . . . . . . . 110Figure 4.15 Non-high vowels: Magnitudes of motion of the lower face foreach participant’s production in each condition. . . . . . . . . 112Figure 4.16 High vowels: Magnitudes of motion of the lower face for eachparticipant’s production in each condition. . . . . . . . . . . . 113Figure 4.17 Correlations between mean MM and mean AVS. . . . . . . . 115Figure 4.18 Results from Section 4.4.1 and Section 4.4.2 re-plotted to showhow the relationship between vowel contrast (top) and voweldispersion (bottom) changes throughout the vowel. . . . . . . 118xiAcknowledgementsWithout question, the best part of my graduate school experience has been thepeople I’ve met along the way.My co-advisors, Eric Vatikiotis-Bateson and Caroline Smith, were in manyways a perfect complement to one another. Eric gave me free rein to pursue myresearch interests while Caroline kept me focused on actually finishing my dis-sertation. Many thanks go to Eric for being an unflappable advisor who offeredforthright guidance and genuine kindness. I am immensely grateful to Caroline fortaking me on as a student; not only was she generous with her time and thought-ful with her feedback, but she also made it possible for me to run my experimentsat UNM. I couldn’t have finished this work without her. I was fortunate to havetwo other committee members for my dissertation: Molly Babel and Bryan Gick.Molly’s attention to detail helped me to think more carefully about my work andBryan encouraged me to consider the bigger picture. My defense was a more ener-gizing and exciting experience than I had anticipated. In addition to my committee,I’d like to thank my university examiners, Kathleen Curry Hall and Janet Werker,and my external examiner, Jeffery Jones, for their insightful questions and engag-ing discussion.I had assistance from many grad student colleagues while running my experi-ments. Thanks go to Michael Fry for error coding, Michael McAuliffe for sharinghis OFA Python script, and Alexis Black, Megan Keough, Michael McAuliffe,Ella Fund-Reznicek, and George Stelle for stimuli recordings. I’m not sure I canoffer enough thanks to George Stelle for creating all of the experiment software.Thanks also go to the UNM Ling 101 instructors who helped recruit participants.And the UBC Linguistics administrators, Edna Dharmaratne and Shaine Meghji,xiihelped me in so many ways, especially after I moved to Albuquerque; I offer themmy sincerest thanks.As luck would have it I ended up being part of the very best cohort, alongsideHeather Bliss, Elizabeth Ferch, Raphael Girard, Murray Schellenberg, Anita Sza-kay, and Carmela Toews. After bonding over the intensity of coursework, we’veremained close despite the fact that we’re now scattered around the globe, and Ilook forward to our semi-regular reunions for many years to come. Beyond mycohort, I’ve formed many cherished friendships with other grad students in mydepartment which I hold close to my heart. Thank you to Alexis Black, AmeliaReis Silva, and Sonja Thoma for generously welcoming me into their homes dur-ing my visits to Vancouver. Alexis was also a steadfast source of encouragementduring the final stretch of dissertation writing; I look forward to returning the fa-vor! Mark Scott and Anita Szakay are the kind of friends you meet as an adult butfeel like you’ve known your entire life. They are part of my fondest memories ofgrad school and continue to be important people in my life. Thank you to Mark fortalking shop, reading drafts, and listening to practice talks. And thank you to Anitafor admonishing me to just get it done!I’ve made many wonderful friends in Albuquerque, but special thanks go toJessica Gross, Haley Groves, and Meg Healy for sharing with me the highs andlows of grad school and parenting. I’m grateful for my family and friends in Aus-tralia; my parents and siblings have cheered me on through the long years of gradschool, as have my dear friends Cate Ryan and Michael O’Brien.Bertrand Russell described a good life as “one inspired by love and guided byknowledge.” While grad school has deepened my appreciation for critical thinkingand scientific inquiry, it is the love I have for my family–George, Adelaide, andSullivan–which inspires me daily. My children won’t remember any of this, but I’llremember their role in my dissertation with deep affection, and be forever in awethat we managed to get through it one piece. My husband... Where do I even begin?George, you have been unwavering in your love and support, and have maintaineda resolute confidence in me even when I lost confidence in myself. And you’vedone all this while working on your own dissertation. I can’t adequately expressmy love and admiration, except to say that, by Russell’s measure, we are living thevery best life, and I am thankful for it.xiiiDedicationFor George.Always.xivChapter 1IntroductionThe flexibility we bring to the task of perceiving and producing speech has re-ceived considerable attention. Listeners can integrate information from a varietyof sensory modalities. For example, conflicting sensory information can result inpercepts from a different category than the inputs (e.g. McGurk and MacDonald,1976) and ambiguous information from one sensory modality can be influenced byinformation from another modality (e.g. Bertelson et al., 2003). Speakers make useof several feedback channels during the production process, such as auditory (e.g.Houde and Jordan, 1998) and somatosensory (e.g. Tremblay et al., 2003). There isa high degree of redundancy in the speech production-perception system, a featurewhich is common to most biological systems and which allows for more robust op-eration, especially in difficult speaking and listening conditions. Such redundancyallows for variable integration of the sensory modalities in different situations.This dissertation investigates this flexibility by testing speakers’ ability to in-corporate atypical speech feedback signals. Specifically, it looks at real-time visualfeedback of one’s own speech production. Visual feedback was chosen as a testcase for the effects of atypical feedback since, as perceivers, we have extensive ex-perience with this modality. This familiarity may make it more likely that speakerscan make use of this information during production. Just as the addition of a vi-sual speech signal can improve perception accuracy, especially in difficult listeningconditions (e.g. Navarra and Soto-Faraco, 2007; Sumby and Pollack, 1954), visualfeedback is hypothesized to improve speech production in terms of measures such1as fluency and phonemic contrast. The expectation is that visual feedback willhave the strongest effect in difficult speaking conditions, in a parallel manner tothe improvement found in perception when a visual signal is added to difficult lis-tening conditions, such as speech in noise. Indeed, there is evidence to suggest thatspeech production stimulated by an audiovisual signal is more accurate in shad-owing tasks than an audio-only signal (Reisberg et al., 1987; Scarbel et al., 2014)and more fluent for aphasic patients (Fridriksson et al., 2015, 2012). Finally, visualfeedback has temporal properties that make compatibility with other sources ofspeech feedback likely, since all sources of feedback are generated from the sameact of speaking.This dissertation presents evidence in support of this hypothesized role for vi-sual feedback during speech production. The experiments reported here use a per-turbation paradigm to create difficult speaking conditions in which to compare thepresence and absence of visual feedback. Two types of perturbations are used: de-layed auditory feedback (auditory perturbation) and bite block speech (somatosen-sory perturbation). The visual feedback comparison is also made for non-perturbedspeech in each experiment. The results of the delayed auditory feedback experi-ments suggest that it is the time-varying properties of visual feedback, as opposedto the static form properties, that can globally influence the production of whole ut-terances, but these changes are reduced when the presentation of dynamic and staticvisual feedback is randomized rather than presented in consistent blocks (Chap-ter 3). The results of the bite block experiment suggest that changes to produc-tion can also be observed in individual segments; vowels exhibit greater acousticcontrast and reduced dispersion when produced with visual feedback. Acousticcontrast is also modestly correlated with lower face movement (Chapter 4).Before moving forward, a terminological clarification is needed. The term vi-sual feedback is used variably in the literature. It is often used to refer to thevisualization of an acoustic or articulatory property of speech, such as formants(Kartushina et al., 2015), pitch (de Bot, 1984), or jaw position (Loucks and De Nil,2006). Visual feedback can also be used to refer to biofeedback of non-visible ar-ticulators, such as the tongue, by way of ultrasound (Adler-Bock et al., 2007) orelectromagnetic midsagittal articulography (Katz and Mehta, 2015). In this dis-sertation, visual feedback refers to the front view of one’s face, as one would see2when looking in a mirror. This is the same view one typically has of an interlocu-tor, at least when talking face-to-face, and is also consistent with the type of visualstimuli commonly used in audiovisual perception research.The rest of this chapter consists of a general overview of feedback, includingvisual feedback, during speech production, taking into account its interaction withfeedforward control and considering evidence for the importance of feedback basedon experiments using altered speech feedback.1.1 Feedback during speech productionFeedback refers to information about the sensory consequences of an action that isfed back to the motor control system in order to monitor and, if necessary, changethe action. Actions can be changed online or the feedback can be used to trainthe system and thus change subsequent actions. Three types of sensory feedbackthat are normally available during speech production include auditory, tactile, andproprioceptive feedback. Auditory feedback is received via both air- and bone-conducted sound. Tactile feedback includes contact between articulators as well aschanges in air pressure that are registered as the sensation of touch by the vocaltract walls. Proprioceptive feedback provides information about the movementsand position of speech-relevant body structures. Interestingly, while the oral cavityand lips have very high tactile sensitivity (Miller, 2002), the oro-facial muscles arelargely devoid of proprioceptors (Cattaneo and Pavesi, 2014). Instead, mechanore-ceptors in the skin, responding to the stretch of skin due to the muscle activitybeneath it, most likely provide proprioceptive feedback (Ito and Ostry, 2010).An alternative term for proprioceptive feedback that is commonly used is (so-mato)sensory feedback. Early bite block studies referred to the feedback fromarticulators as “sensory feedback” (e.g. Fowler et al., 1980; McFarland and Baum,1995; McFarland et al., 1996), with later studies using “somatosensory” (e.g. Laneet al., 2005) or “oro-sensory” (e.g. Namasivayam et al., 2009) as a cover termfor both tactile and proprioceptive feedback. Perturbation studies which involvemechanically altering jaw trajectories often refer to this movement information as“somatosensory feedback” (e.g. Feng et al., 2011; Lametti et al., 2012; Nasir andOstry, 2008; Tremblay et al., 2003). In this dissertation, feedback concerning the3movement of and contact between articulators is particularly relevant to the biteblock experiment in Chapter 4. In keeping with the terminological practices mostcommon in these types of studies, somatosensory feedback will be used to refer totactile and proprioceptive feedback.As stated above, feedback can be used for the control of actions. Control sys-tems are classically categorized as either open-loop control and closed-loop controlsystems, a distinction based on the use of feedback (e.g. Hood, 1998). Open-loopcontrol is not context-dependent; external events do not affect the way a goal isachieved and feedback is not used to ensure a successful end state. In contrast,closed-loop control is context-dependent. External information is used to achievethe desired end-state, thus the control system makes adjustments to its performanceduring the task.A simple example is a clothes dryer, a device that can function using either typeof control system. Using the timer setting on the dryer is an example of open-loopcontrol. The duration is pre-set and the drying cycle will run to completion evenif the clothes become dry before the cycle is complete. Using the moisture sensorsetting on the dryer is an example of closed-loop control. The desired moisturelevel is set and the moisture sensors monitor the clothes (by monitoring for electri-cal resistance, for example). The dryer is turned off once the desired moisture levelhas been reached.Open-loop control requires minimal computation; however, the more complexclosed-loop control is better suited to the adaptive abilities critical to biological sys-tems. An example of this is how the hypothalamus uses negative feedback, or errorsignals, to regulate body temperature (e.g. by activating sweat glands in responseto high body temperature, thus initiating evaporative cooling). There is a problemwith closed-loop control, however, and that is that it takes time to generate and pro-cess feedback. Feedback from an action is not available immediately; for example,there is a delay of at least 80-100 ms between visual or proprioceptive feedbackand its effect on movement, with the delay for reaching movements that have avisual goal being much longer (300-700 ms) (Desmurget and Grafton, 2000). Forthe motor control of actions to be closed-loop, the various stages involved must becompleted within the time frame of a single movement goal, so that the feedbackcan play its part in ensuring the correct movement is made. These stages involve4the registering of an error, calculation of corrective movements, issuing the mo-tor commands for the corrective movements, and the actual implementation of thecorrection by the muscles (Perkell, 2012).This temporal limitation is especially problematic for rapid, highly-skilled ac-tivities such as speech. For speech production a range of feedback delays havebeen reported. Some results suggest that closed-loop motor control may be vi-able for speech tasks. For example, Tiede et al. (2006) perturbed jaw motion insuch a way that the acoustics were consequently changed. They found that evi-dence for compensation was seen in the acoustic domain before the kinematicallyperturbed vowel was completed, and beginning as early as 75 ms after the on-set of the perturbation. However, longer delays are more commonly reported forspeech feedback, with averages closer to 100 ms (e.g. Tourville et al., 2008), 200ms (e.g. Burnett et al., 1998), or upwards of 400 ms (e.g. Purcell and Munhall,2006). Thus, the possibility for closed-loop control most likely only arises dur-ing very controlled speaking tasks where the speech sounds are sustained, and notduring running speech (Perkell, 2012).Despite all this, there remains a role for feedback during speech production, inthe context of feedforward motor control.1.1.1 Feedback and feedforward motor controlThe temporal limitation of closed-loop control systems described in the previoussection can be addressed by a system that predicts the sensory consequences ofan action in advance of actually receiving the feedback. This type of system isknown as a feedforward model, and it involves defining a motor plan before anaction is started (Desmurget and Grafton, 2000). Given an initial state, an internalmodel makes a prediction of the sensory consequences of an action, and this pre-diction can be used as feedback before the actual–or ‘reafferent’–sensory feedbackis processed. This essentially sidesteps the problem of having to wait for the reaf-ferent signal before updating the motor plan, which is particularly useful for fastsequences of actions.How, then, does peripheral sensory feedback fit with forward models of mo-tor control? Three important roles for feedback that have been identified include:5learning the internal model (i.e. the motor-sensory relationship), updating the in-ternal model, and handling sudden, or unexpected, perturbations (Hickok, 2014).Despite the inadequacies of feedback for online motor control, feedback maybe important during development in establishing the sensory-motor relationship(Borden, 1979). The importance of this becomes clear when one of the sourcesof feedback is absent during development. For example, congenitally deaf infantshave a considerable delay in the onset of babbling compared to hearing infants(Oller and Eilers, 1988) and the speech of older deaf children with some resid-ual hearing tends to receive low intelligibility scores, the low scores due largelyto phoneme production problems (Smith, 1975). In the case of typical develop-ment, once the internal model of speech control has reached a mature state, speechproduction can proceed primarily under the control of feedforward commands.The next two roles for feedback identified above involve correcting either per-sistent or sudden errors. In the State Feedback Control model of speech processing,which builds on neurological evidence for sensory predictions of speech (Venturaet al., 2009), Hickok et al. (2011) propose an external monitoring stage. This stepof the motor control process considers whether the actual sensory consequencesof an articulation match the predicted sensory consequences. When error signalsare generated, not only is the motor controller provided with corrective feedback,but the internal model is also fine-tuned so that future errors are minimized. Whilethe neurological evidence for these components of forward models of speech mo-tor control is still evolving, there is considerable behavioral evidence for speakersadjusting their productions in response to errors. Many studies have used a pertur-bation paradigm to investigate these adjustments by intentionally causing errors.An overview of this research is presented in the next section.1.1.2 Altered speech feedbackPerturbing a system is one way to investigate how the various parameters con-tribute to the overall performance. While this rationale has been the starting pointfor much research, it is not a universally held proposition. As (Borden, 1979, p.312) argues, “Compensation under abnormal circumstances does not mean periph-eral feedback is necessary under normal circumstances.” From this perspective,6altering a component of the system alters the system entirely, thus the response toa perturbation may not provide much insight into the original system. Even withthese reservations, it is neverthless clear that perturbation studies have been a richsource of information on the role of feedback in speech production, with the po-tential to yield important insight into the coordination of multiple speech signals(e.g. Zimmermann et al., 1988). Numerous studies have explored the effects ofaltered speech feedback, and a review of acoustic (both spatial and temporal) andsomatosensory perturbations is presented below.In a classic study of this phenomenon in the acoustic domain, Houde and Jor-dan (1998) played speakers their auditory feedback through headphones whilemaking incremental changes to the formant frequencies. For example, over thecourse of the training phase of the experiment, participants gradually heard theirproductions of pep sound more like pip, which was achieved by lowering F1. Par-ticipants compensated for this feedback shift by making their productions soundmore like pap, which is produced with a higher F1. In addition to sustained vow-els, this effect has been found with longer utterances. Cai et al. (2011) shiftedthe F2 minimum of [u] in the word owe as produced in the phrase I owe you ayo-yo. Speakers compensated for the shift by changing their production in a simi-lar fashion to those in the Houde and Jordan study, but to an even greater degree,suggesting that auditory feedback may play a greater role in time-varying gesturescompared to sustained vowels.The nature of this compensatory response to perturbations has been investi-gated in more detail. As in other studies, Katseff et al. (2012) found that compensa-tion for formant perturbations was incomplete. Interestingly they also observed thatsmall formant perturbations resulted in greater compensation than large formantperturbations. The authors (and others, (e.g. Perkell, 2012)) argue that incompletecompensation is due to conflict between auditory and somatosensory feedback; thegreater the mismatch between the two feedback signals, the more likely a speakeris to rely on the unperturbed feedback signal, thus minimizing compensation.The studies just described perturbed spatial properties of the acoustic signal;however, it is also possible to introduce temporal perturbations. Cai et al. (2011)shifted the F2 minimum during the [u] of word you as produced in the phrase I oweyou a yo-yo, either shifting the minimum backward in time (by an average of 45.47ms) or forward in time (by an average of 24.6 ms). The forward shift, which wasdescribed as the “deceleration” condition, resulted in longer durations for severalintervals throughout the utterance; the effects of the local perturbation seemed toextend to the F2 landmarks for the whole utterance. In contrast to the compensatoryresponse to spatial perturbations in the acoustic domains, the compensation to thetemporal perturbation was in the same direction as the temporal perturbation.Larger scale temporal shifts can also be used, such as the delayed auditoryfeedback (DAF) paradigm. In this set up, auditory feedback is typically delayed by180-200 ms, the effect of which is to impede fluent speech output (Yates, 1963).As with the more focused temporal shifts described in the Cai et al. (2011) study,the response to the perturbation is in the same direction as the perturbation; that is,auditory feedback is shifted forward in time, or ‘decelerated’, and speakers slowtheir speech in response. This is in contrast to spatial perturbations, which elicitcompensations in the opposite direction to the perturbation, as in Houde and Jordan(1998). There is some speculation that the effects of DAF may be exacerbated inspeakers who rely more strongly on feedback during production, whether this isdue to instability in their feedforward control system (Chon et al., 2013), or a lackof experience with competing auditory inputs (Fabbro and Daro`, 1995). This typeof perturbation will be discussed in more detail in Chapter 3.Altered feedback has also been explored in the somatosensory domain. In anearly demonstration that perturbations to articulator trajectories elicit a compen-satory response, Kelso et al. (1984) perturbed the jaw motion of a speaker by me-chanically applying a constant load during the closing gestures of /bæb/ and /bæz/.One of the differences between the final consonants is the relevance of the upperlip; the upper lip is involved in the articulation of /b/ but not /z/. In response theupper lip was lower for the /b/ closure than it was on trials when the jaw was notperturbed. This upper lip lowering was not found for the /z/ closure, which wasinterpreted as evidence that the jaw, lower lip, and upper lip were acting as a coor-dinative structure; that is, for the phoneme /b/, there was a “temporary marshalingof many degrees of freedom into a task-specific, functional unit” (Kelso et al.,1984, p. 828).Tremblay et al. (2003) mechanically altered speakers’ jaw trajectories by ap-plying force in the direction of jaw protrusion. Speakers adapted to this perturba-8tion, shifting their jaw trajectory back to the pre-perturbation path. An after-effectwas also produced which was in the opposite direction to the perturbation; that is,the jaw trajectory became retracted. Critically, these perturbations did not mea-surably affect the acoustic properties of speech, and the compensations were onlyfound when the participants produced speech movements and not when they pro-duced non-speech movements.Altering the feedback that a speaker receives affects current and subsequentproductions, providing evidence for a role for feedback in the control of speechproduction. This role may be limited in normal speaking conditions, when feedfor-ward control is primarily relied upon, but is nonetheless important in the contextof perturbed speech.1.1.3 Visual speech feedbackThe visual speech signal contains a wealth of information. For example, labialcues can be used for vowel identification, and jaw motion can be mapped ontosyllables. But visually salient speech information extends beyond the lower face.Motion computed from muscle activity at the lips is correlated with activity at theouter regions of the face (Vatikiotis-Bateson et al., 1996). In an examination of therelationship between vocal tract configurations, facial motion, and speech acous-tics, Yehia et al. (1998) demonstrated that orofacial motion is highly correlatedwith movement of the lips, jaw, and tongue, and also that fairly precise estimatesof acoustic parameters (RMS amplitude and line spectrum pairs1) can be madefrom orofacial motion. These correlations between acoustics and facial motion canbe further increased when the utterances under examination are produced in thecontext of more extensive prose (Vatikiotis-Bateson and Yehia, 2002). Abstract-ing away from facial deformation, rigid body motion of the head has also beenfound to correlate with speech acoustics; as much as 88% of the variance of speechfundamental frequency can be predicted from head motion (Yehia et al., 2002).Given these visual cues to speech, it is not surprising that visual informationplays an important role in the perception of speech. In the first demonstration ofthis, Sumby and Pollack (1954) showed improved accuracy for word identifica-1Line spectrum pairs, which are derived from LPC coefficients, map well to vocal tract shapes(Yehia et al., 1998).9tion for audiovisually presented speech in noise compared to audio-only speech innoise. Audiovisual perception research subsequently expanded in many directions.Complementing the correlation between head motion and F0 described above, theperception of speech output from a talking head was found to improve when headmotion was added (Munhall et al., 2004a). Audiovisual enhancement for percep-tion was found even with spatially filtered visual signals (Munhall et al., 2004b).The combination of incongruent audio and visual signals was famously found toresult in a type of fused percept that is categorized as something different to theauditory and visual signals (McGurk and MacDonald, 1976). The McGurk effecthas proved to be very robust; for example, it can be induced with temporally andspatially dislocated audiovisual signals (e.g. Jones and Munhall, 1997; Munhallet al., 1996) and with a gender mismatch between the auditory and visual signals(e.g. Green et al., 1991).While there is still much that remains unclear in the field of audiovisual speechprocessing (see Vatikiotis-Bateson and Munhall (2015)), this brief tour was pre-sented in order to establish that: 1) speech-relevant information is distributed overthe whole face, and, 2) as perceivers, we have extensive experience using this in-formation (although which aspects of the visual signal perceivers are using is anopen question). Given this, visual speech information is an ideal candidate for therole of ‘atypical speech feedback’; speakers are already familiar with the relationthat the visual signal has to speech output, thus making it more likely that it couldbe used in a novel context.Visual speech feedback is not something speakers typically have access to.That being said, the possibility of experiencing this kind of feedback is certainlybecoming more common due to video chat software, which usually shows eachspeaker a small video of their own face next to a large video of their interlocu-tor.2 Visual speech feedback has been the topic of a small, but growing, body ofresearch, which has looked at both non-clinical and disordered speech.Only a handful of studies have investigated the effects visual feedback has onthe speech produced by non-clinical populations. They have all used a similar ex-2The VoIP and videoconferencing software, Skype, is estimated to have generated 2 trillionminutes of video calls in the last 10 years (https://blogs.skype.com/2016/01/12/ten-years-of-skype-video-yesterday-today-and-something-new/).10perimental paradigm involving difficult speaking conditions, created by delayingthe auditory feedback. The first of these had an exploratory goal of looking forany effects there might be when visual information was presented during speechproduction. Tye-Murray (1986) had subjects view their mouth in a mirror whilespeaking with delayed auditory feedback (DAF), but found no effect of visual feed-back on utterance duration (the sole dependent variable that was measured). Usinga very similar design, but this time measuring speech errors in addition to utteranceduration, Jones and Striemer (2007) found that a subset of participants producedfewer speech errors when they had visual feedback. Most recently, Chesters et al.(2015) looked at both immediate and delayed visual feedback in the context ofimmediate and delayed auditory feedback. Measuring a range of phonetic prop-erties (such as utterance duration and rhythm), the authors found more disruptiveeffects of delayed visual feedback than immediate visual feedback (i.e. durationsand speech errors were increased). A similar paradigm is used for the experimentsin Chapter 3, and these studies are discussed in more detail in that chapter.In their study looking at speech perception of self and others, Sams et al. (2005)compared a number of conditions using both congruent and incongruent audiovi-sual stimuli. While this study was not looking at speech production per se, one ofthe conditions required participants to mouth a syllable while watching themselvesin a mirror. This visual signal was paired with an audio signal from another speakerto create an audiovisual stimulus. This contrasted with a condition in which boththe audio and visual signals from another speaker. There was no significant dif-ference in syllable identification accuracy between these two conditions; in bothcases, participants’ identification of the auditorily presented syllable was adverselyaffected for incongruent audiovisual stimuli and enhanced for congruent audiovi-sual stimuli. The authors suggest that the McGurk effect induced in these condi-tions involves a similar perceptual mechanism, despite the obvious difference thatparticipants were articulating in one condition but not the other.Building on research into choral speech effects for people who stutter, visualfeedback has been proposed as another means for minimizing speech disruptions.Choral speech, or speaking in time with another person, reduces stuttering be-haviors, even when a visual-only model (i.e. someone mouthing an utterance) isprovided for stutterers to speak in time with (Kalinowski et al., 2000). Snyder11et al. (2009) extended this by providing stutterers with both synchronous and asyn-chronous (i.e. delayed) visual feedback of their own productions. Both of theseexperimental conditions resulted in reduced stuttering frequency compared to acontrol condition with no visual feedback. While Snyder et al. (2009) reported nodifference between immediate and delayed visual feedback, Hudock et al. (2011),using a similar experimental design, found a greater reduction in stuttering whenvisual feedback was delayed. Visually stimulated speech productionRelated to the notion of self-generated visual feedback is research looking at speechproduced in response to a visual or audiovisual signal from another speaker.One type of evidence that visual speech information can affect productionscomes from congenitally blind speakers. Me´nard and colleagues (2009; 2013)have documented the differences between blind and sighted French speakers inthe production of vowels. Differences were found between the two groups in boththe acoustic and articulatory domains. Blind speakers produce smaller acousticcontrast distances between vowel categories (2009; 2013), and greater dispersionwithin vowel categories (2013). In terms of articulation, blind speakers produced asmaller range of upper lip protrusion (i.e. a visible articulation) and a greater rangeof tongue backing and tongue curvature (i.e. a non-visible articulation) (2013).These results suggest that visual speech information plays a role in the develop-ment of speech targets.In addition to this evidence from visual deprivation, there is also more directevidence from the provision of visual speech information. Shadowing tasks requireparticipants to immediately repeat speech. By contrasting audio-only and audiovi-sual stimuli, these tasks are usually used to explore the role of visual informationin speech perception. However, since such studies use speech production output asa dependent variable, they also suggest that visual speech information has a roleto play in speech production, a connection also noted by Venezia et al. (2016).Reisberg et al. (1987) found an improvement in terms of percentage of words cor-rectly shadowed, when participants shadowed an audiovisual model compared toan audio-only model. This improvement occurred when shadowing an L2 and a12complex prose passage in the L1. Scarbel et al. (2014) also found an improvementwith audiovisual shadowing when using a close-shadowing task, which requiresparticipants to repeat speech as quickly as possible. The experiment comparedclose-shadowing in noise and clear speech conditions and found that reaction timeswere faster and responses were more accurate in the audiovisual noise conditions.This was especially so for /apa/ stimuli compared to /ata/ and /aka/.Venezia et al. (2016) investigated sensorimotor integration of visual speech,looking at the neural regions involved when speech production is stimulated byaudio-only, visual-only, and audiovisual inputs. Participants were presented four-syllable strings of CV syllables from the three different types of input, and thenthey engaged in covert rehearsal of the strings. Covert speech production stimu-lated by visual-only or audiovisual stimuli was found to involve additional senso-rimotor brain regions (the posterior superior temporal region, a multisensory pro-cessing area) compared to production stimulated by audio-only. Additionally, re-gions used during production stimulated by audio-only were activated to a slightlygreater extent when the production triggers were visual-only or audiovisual. Theseresults were interpreted as evidence for a dedicated visuomotor speech pathway.This is supported by Fridriksson and colleagues’ (2015; 2012) research showingthat patients with damage to their auditory-motor network nevertheless experienceenhancement to their speech output when shadowing audiovisual speech. SummaryThis foundational research demonstrates that visual speech information, in theform of either self-generated visual feedback or a visual signal from another speaker,can elicit changes in speech production. These changes range from a reduction inspeech errors to faster reaction times, and might generally be thought of as en-hancements to speech output. In the case of visual feedback, however, the reportedeffects are quite variable, and as is often the case in the early stages of a researchprogram, more questions are raised than answered. For example, the effects haveprimarily been tested in the context of disfluent speech, whether artificially or de-velopmentally induced, so it is unknown whether these effects would still be evi-dent in other contexts, or with different stimuli (e.g. words rather than sentences).13There is also the question of which properties of the visual feedback are responsiblefor the effects; is visual form or visual timing more important? The experimentsin this dissertation address these issues, bringing more data to bear on our under-standing of multimodal speech processing.1.2 On the use of novel feedbackOne possible objection to this line of research is that it is not ecologically normal;we see other talking faces, but not typically our own. As such, there may be con-cerns that any changes to speech output that occur in the context of visual feedbacktell us little about the more ecologically natural context of speaking without vi-sual feedback. A similar concern was raised in Section 1.1.2 with regard to alteredfeedback. Two arguments for using visual feedback are presented here.The use of visual feedback–a source of feedback that is not typically available–is motivated by the fact that speakers have extensive experience seeing their reflec-tion. In addition, there is a precedent for using novel modality pairings in speechperception research, and this research demonstrates that information from the atyp-ical modality can influence perception.While it is certainly true that visual feedback of one’s own speech is not nor-mal in the majority of communicative contexts, the modern world affords manyopportunities for encountering visual speech feedback. Mirrors are ubiquitous, “socheap and so common that [we] use them to wallpaper our bathrooms and dancefloors” (Kleeman, 2016), with many a conversation taking place in their presence.As noted previously, video chat software can provide a view of both the interlocutorand oneself during a conversation. Thus, opportunities for visual speech feedbackare perhaps more abundant than one might first think. But despite examples suchas these, most people do not have extensive experience speaking at length whileattending to visual feedback. Part of the motivation for presenting the whole face,and not just the lower face, as visual feedback in the dissertation experiments, wasto match the common experience of looking in a mirror.Speech perception research sets a precedent for the novel pairing of modalities.This “novelty ” can be in terms of the congruity of the signals or the naturalness ofthe particular modality combination. For example, as described in Section 1.1.3,14the McGurk effect is observed in response to incongruent audio and visual signals,and some research has extended this by introducing additional spatial, temporal,and gender incongruence. In terms of atypical modality pairings, Fowler and Dekle(1991) tested whether haptic speech information, provided by placing a hand onthe speaker’s face, could influence auditory speech perception. They found that“feeling” a /ba/ increased the percentage of /ba/ auditory judgements. Similarly,Gick and Derrick (2009) paired auditorily presented voiced and voiceless stopswith an aero-tactile signal: mechanically generated air puffs. Perceivers were betterat identifying the voiceless consonants when they were accompanied by a puff ofair and worse at identifying the voiced consonants when they were paired with theair puff.The latter two studies are presented within the framework of ecological percep-tion, which emphasizes the role of the environment in structuring information aboutperceived events (see Fowler (1986) for a detailed outline of direct-realist theoriesof perception). According to this theory, properties of environmental events (e.g.speaking) can be jointly signaled by different types of environmentally structuredenergy (e.g. sound, light); in other words, “visible talking and audible talking... arethe same event of talking” (Fowler and Dekle, 1991, p. 817). This idea of jointspecification in the environment is an appealing one, and can easily be extended torationalize an exploration of novel feedback: visual feedback is structured by thesame speaking event that structures auditory and somatosensory feedback, thus allsources of feedback jointly specify the speaking event.3While direct-realism could be a useful starting point for framing predictionsabout novel speech feedback effects, it introduces theoretical issues that are notaddressed in this dissertation; namely, the debate over the objects of perception(e.g. Diehl and Kluender (1989); Fowler (1986); Liberman and Mattingly (1985);and more recently in the context of mirror neurons: e.g. Hickok (2009); Rizzolattiet al. (2001)). A more general interpretation of these types of multimodal effects3Both Fowler and Dekle (1991) and Gick and Derrick (2009) explicitly tested the role of jointspecification in multimodal perception by constructing modality pairings that varied in the amountof experience perceivers would have had with them (e.g. speech-related puffs of air on the hand(more experience) versus speech-related puffs of air on the neck (less experience). In contrast, visualfeedback was used in this dissertation because speakers have experience with it, just not experiencewith it as a source of feedback.15is offered by Vatikiotis-Bateson and Munhall (2012). They suggest that the sourceof information is not critical, so long as the signals have an appropriate temporalpairing and do not contradict ecologically valid multimodal pairings. Framed inthis way, visual feedback is predicted to enhance speech output by virtue of thefact that its time-varying properties are compatible with those of the auditory andsomatosensory feedback channels, since these different sources of feedback are allgenerated by the same event of speaking.Seeing oneself while speaking may not be a normal experience, but the com-ponents of that experience are quite normal. Speakers have extensive experiencewith visual speech from their simultaneous role as perceivers, and quite often haveaccess to reflections of their own face. Even the use of visual feedback to guide ac-tions is normal, albeit for limb motor control (e.g. Sober and Sabes, 2003). Whilethere is the caveat that altering feedback or adding a new signal may limit ourinsights into the unperturbed system, these research paradigms have neverthelessenriched our understanding of speech processing. Visual feedback has temporalproperties that could facilitate alignment with other feedback channels; as such itcan be used to investigate speakers’ responses to novel multimodal feedback con-texts.1.3 Outline of the dissertationThis dissertation presents evidence that visual feedback can be used by speakersto enhance their speech output. This evidence comes from two types of perturba-tion experiments: delayed auditory feedback (DAF) and a bite block perturbation.The results of these experiments are consistent with a small but growing body ofresearch showing that visual information is relevant not only to perception, butalso production. Speech produced in the context of another talking face resultsin greater accuracy during shadowing tasks compared to shadowing an audio-onlysignal (Reisberg et al., 1987; Scarbel et al., 2014) and greater fluency for aphasicpatients (Fridriksson et al., 2015, 2012) and people who stutter (Kalinowski et al.,2000). Visually stimulated speech production has also been shown to involve a ded-icated visuo-motor neural pathway, which may underpin the production changesthat occur in response to visual speech (Venezia et al., 2016). In the context of16self-generated visual speech feedback, there are indications that speech produc-tion is more fluent for stutterers (Snyder et al., 2009) and non-stutterers (Jones andStriemer, 2007) compared to speech that is produced without visual feedback. Theexperimental results reported in this dissertation provide further evidence for en-hancement to speech produced with visual feedback, and also expands the role ofvisual speech in production by showing that visual feedback is not only relevant fordisfluent speech, but also for the fluent production of speech targets. These find-ings highlight the need for models of speech production to be able to account for amore expanded notion of multimodality; speakers can make adaptive use of infor-mation from a range of modalities, including modalities that would not normallybe considered ecologically natural, such as visual speech feedback.A description of the methodological procedures that are common to the ex-periments in the dissertation is presented in Chapter 2. This chapter also includesan overview of (generalized) linear mixed effects models and details of the modelspecifications used in the analyses.The two experiments in Chapter 3 use a DAF paradigm similar to the three pre-vious studies (Chesters et al., 2015; Jones and Striemer, 2007; Tye-Murray, 1986)which have investigated visual feedback during speech production. Experiment 1takes as its starting point the observation that speakers often use strategies, such asfocusing on their articulations, in order to overcome the challenges of producingspeech with DAF. The experiment was designed to minimize the opportunities fordeveloping a strategy when producing speech under visual feedback conditions, byvarying the order of stimulus presentation. The results suggest that visual feedbackdoes not have a facilitative effect on production in this case, however.Experiment 2 reinstates the consistent stimulus ordering used in previous re-search (Chesters et al., 2015; Jones and Striemer, 2007; Tye-Murray, 1986) andexpands on the paradigm by contrasting different types of visual feedback: novisual feedback, static visual feedback, and dynamic visual feedback. The time-varying component of the visual feedback, compared to just the static form of theface, was predicted to stimulate greater production changes. The results are some-what conflicting. While dynamic visual feedback did elicit the greatest changes inproduction, the predicted facilitatory effect was observed when the dynamic visualfeedback was paired with normal, rather than delayed, auditory feedback. When it17was paired with DAF, speech output was slowed to a greater extent than with novisual feedback.Chapter 4 reports on an experiment in which visual feedback is presented ina previously untested task. In this experiment, a bite block perturbation was in-troduced, and production measures were made at the level of individual segmentsrather than whole utterances, as in the DAF experiments. The visual feedback waspredicted to enhance the production of vowel contrast and minimize the dispersionof the vowels. Overall, the acoustic results supported both of these predictions,but the effect of visual feedback was mostly evident at the beginning of vowels.Optical flow analysis of the lower face motion revealed considerable interspeakervariation, but a subset of speakers tended to produce greater magnitudes of motionfor non-high vowels during unperturbed productions with visual feedback.Chapter 5 places the experimental results in the context of more general, theo-retical issues and presents future directions for this research.18Chapter 2Methods2.1 OverviewThe experiments reported in this dissertation use a perturbation paradigm to as-sess the effect visual feedback has on speech production. Chapter 3 reports ondelayed auditory feedback experiments and Chapter 4 reports on a bite block ex-periment. What follows is the methodological procedures these experiments havein common. This chapter describes the equipment used during the experiment andfor the recording of stimuli, a basic outline of the experimental procedure, and anoverview of the statistical analysis technique.2.2 EquipmentStimuli were presented to participants either through Sony MDR-V6 circumauralclosed back headphones for the delayed auditory feedback (DAF) experiments orover Altec Lansing ADA215 computer speakers for the bite block experiment. Par-ticipants repeated each stimulus, speaking into an Audio-Technica ATR2100-USBCardioid Dynamic USB microphone, positioned approximately 1-2 inches fromthe mouth. Participants were video recorded at 30 frames per second (fps) witha Logitech HD Pro Webcam C920 (1080p widescreen), placed on top of a 21.5inch high-definition LED monitor, approximately 20 inches in front of them. A19Figure 2.1: An example of the visual feedback presented to a participant. Par-ticipants in the DAF experiments wore headphones, but those in the biteblock experiment did not.custom-written program1 played the stimuli, audiovisually recorded the speech,fed the audio back to the headphones (with or without a delay) (for the delayedauditory feedback experiments only), and displayed different types of visual feed-back on the monitor. In the relevant conditions, participants were able to see theirwhole face displayed on the monitor and were instructed to maintain this positionthroughout the experiment. Figure 2.1 shows an example of the visual feedbackthat participants saw. The audio and video were recorded directly to a laptop com-puter with sufficient processing power and memory to simultaneously perform thenecessary signal processing and also run the experiment.2.3 StimuliThe lists of sentences and words used in the experiments are presented in Ap-pendix A.The sentences used as stimuli for the DAF experiments were produced by afemale speaker of General American English (Midlands) in a sound attenuating1The program used to run the experiments can be accessed at https://github.com/stelleg/daf.20booth with a Samson CO3U multi-pattern condenser microphone and recordeddirectly to a computer (16 bit, 44.1 kHz).The words used as stimuli for the bite block experiment were produced by amale speaker of General American English (Pacific Northwest) in a quiet roomwith an Audio-Technica ATR2100-USB Cardioid Dynamic USB microphone andrecorded directly to a computer (16 bit, 44.1 kHz).2.4 ProcedureAll of the experiments reported in this dissertation used a within-participants de-sign with two independent variables. The primary variable of interest was the typeof visual feedback participants received during speech production; visual feed-back was either absent, static, or dynamic. The experiments used different subsetsof these visual feedback options. When visual feedback was absent participantslooked at a fixation point on the monitor. Static visual feedback was a still imageof the participant’s face taken before the experiment, and dynamic visual feedbackwas a real-time video of the participant producing the utterance.The second independent variable manipulated a non-visual source of feedback;either auditory feedback (Chapter 3: delayed auditory feedback) or somatosensoryfeedback (Chapter 4: bite block). The specifics of these variables are discussed inthe relevant chapters.Prior to the bite block experiment, participants were shown the words thatwould be presented auditorily in the experiment, along with sample sentences con-taining each word (Appendix A). Since these words would be repeated throughoutthe experiment, a preview was provided to ensure that participants knew what thetarget form was. Each experiment started with a practice block in which partici-pants repeated an auditorily presented set of sentences (Chapter 3) or words (Chap-ter 4) that were not part of the main experiment. Participants produced speech inthe practice blocks while looking at a fixation point on the monitor, and with-out any auditory or somatosensory perturbations. The auditory presentation of thestimuli throughout the experiments was self-paced, and participants were given anopportunity to take a break between each condition. Participants were instructed torepeat each sentence or word as accurately as possible. For the DAF experiments21they were also instructed to avoid re-starting a sentence if they made an error. Forall of the experiments, participants were instructed to look at the monitor whilerepeating each sentence or word.After data collection was complete, the video recordings from the DAF exper-iments revealed that many of the participants were not looking at the monitor for asubstantial number of the stimuli. In order to track the effect this might have on theresults, each utterance was coded for whether or not the participant was looking atthe monitor while they produced the utterance. This coding was based solely onqualitative judgments that were made according to the following criteria.1. Before the experiment began participants were required to look at their faceon the monitor so that a still image could be captured for use in the experi-ment. The region that participants focused on when they took the still imageof their face was classified by the coder as the “looking region.”2. For any given item, a participant was counted as looking at the monitor iftheir focus stayed in the looking region for at least half of the time they werespeaking.This information–which will be referred to as the Looking variable–was includedin the statistical analyses for the DAF experiments to control for any effect it mighthave on the dependent variables.A review of the videos from the bite block experiment showed that partici-pants consistently looked at the monitor throughout the experiment, so the lookingvariable was not included in the analyses for that experiment.2.5 MeasuresDifferent measures were used across the experiments. The dependent variables forthe DAF experiments were utterance duration and number of speech errors per ut-terance (Section For the bite block experiment the dependent variablesincluded acoustic measures (Euclidean distance and average vowel space (calcu-lated from Mahalanobis distances)) and articulatory measures (mean magnitudeof motion of the lower face) (Section 4.3.4). The specifics of each measure arediscussed in the relevant chapters.222.6 Statistical analysesFor each experiment, the statistical significance of the results was determined by(generalized) linear mixed effects analyses, a class of regression analysis that in-corporates both fixed and random effects.2 This type of statistical model is par-ticularly useful for repeated measures experimental designs and has the advantageof being robust in the face of missing data points. For each dependent variablereported in this dissertation, a default model was first constructed, and on the basisof subsequent model criticism, any necessary changes were made. Model criticism(or, model validation) assesses whether the model assumptions have been met, andthe procedure for checking this is described below in Section 2.6.4. The defaultmodel specifications are described here and any experiment-specific changes tothis default are discussed in the relevant chapters.The analyses were implemented in R (R Core Team, 2014) using either thelme4 package (Bates et al., 2014) or the glmmADMB package (Skaug et al., 2014).Models were constructed to investigate the relationship between each dependentvariable and the two independent variables: type of visual feedback and type ofnon-visual feedback. The levels of the independent variables were treatment codedand entered into the model as fixed effects with an interaction term, and the lookingcontrol variable was also added. The maximal random effects structure justifiedby the design was used. Random effects included intercepts for participants anditems3, as well as by-participant and by-item random slopes for the effect of theinteraction between type of visual feedback and type of non-visual feedback (moredetails on the random effects structure are given below in Section 2.6.2).While there are several ways to compute p-values for the fixed effects, thesimulations reported by Barr et al. (2013) suggest that likelihood ratio tests are oneof the best methods. This model comparison approach was used for the analysespresented here. The log likelihood of the full model with the effect in question wascompared to the log likelihood of a model without this effect, using the anova()function in R.2Winter (2013) presents an accessible conceptual overview of mixed effects models.3For the analysis in bite block experiment (Chapter 4), the phoneme variable was included as arandom effect instead of the item variable, since phonemes and items were conflated in the stimulusset (i.e. each phoneme category was represented by a single word).232.6.1 Generalized linear mixed effects modelsWhile a linear model predicts the dependent variable as a linear function of the in-dependent variables, a generalized linear model takes this linear model and relatesit to the dependent variable by way of a link function. A link function is “a contin-uous function that defines the response of variables to predictors in a generalizedlinear model, such as logit and probit links. Applying the link function makes theexpected value of the response linear and the expected variances homogeneous.”(Bolker et al., 2009, p. 128) This is necessary when the dependent variable has cer-tain distributional properties; for example, when it is a discrete probability distri-bution such as the binomial distribution. A generalized linear mixed effects modeladditionally allows for the specification of both fixed and random effects.The counts of speech errors in the DAF experiments (Chapter 3) conform to aPoisson distribution, and as such are modeled using a generalized linear mixed ef-fects model with a logarithmic link function. The log link provides the relationshipbetween the linear predictors and the mean of the Poisson distribution. Since thedata contained more zero counts than expected based on the Poisson distribution(Zuur et al., 2009), zero-inflation was specified in the model to avoid biased pa-rameter estimates and standard errors. The fixed and random effects were the sameas those outlined above in Section Random effectsRandom effects are “factors with levels randomly sampled from a much largerpopulation,” in comparison with fixed effects which are “factors with repeatablelevels” (Baayen, 2008, p. 241). In linguistics experiments, the most common fac-tors included as random effects are participants and stimuli, and that is the casein the experiments presented here. The random effects structure of the model canspecify both an intercept and a slope. The intercept allows the model to make ad-justments to the dependent variable on the basis of each level of the random factor.For example, participants will vary in their baseline speech rate, and the model cancapture this by adjusting the intercept for each participant. The slope allows theindependent variable to have different effects on the levels of the random factor.For example, participants will vary in the effect that delayed auditory feedback has24on their speech rate; some participants will only reduce their speech rate a little butothers will exhibit substantial decreases in speech rate.4It is possible to use a data-driven approach to the specification of the randomeffects structure. In this case, likelihood ratio tests are used to determine whethera particular random effects parameter improves the model fit; if it does not, thenit is removed (Baayen et al., 2008). An alternative to this is the design-drivenapproach of Barr et al. (2013). In this case, the maximal random effects structurejustified by the experimental design is specified, in keeping with the traditionalstandard for mixed-model ANOVAs to categorize both participants and items asrandom effects. For the experiments in this dissertation that includes both randomintercepts and random slopes, as specified in the default model described above.Using simulated data, Barr et al. (2013) showed that such fully specified modelsdo not suffer from a significant loss of power compared to models in which therandom effects have been simplified on the basis of model comparison, and theyalso have the advantage of minimizing Type I error rates. Given this, and the factthat linear mixed effects models are being used in this dissertation for confirmatoryhypothesis testing rather than data exploration, the maximal approach to specifyingrandom effects was used.This approach is not without problems, however. Sometimes such a model willnot converge if there is insufficient data to support the number of parameters in-troduced by such a complex model. The strategy used in this dissertation to dealwith this situation is to simplify the random effects structure in a stepwise man-ner until the model converges. For example, take as a starting point the maximalmodel in Equation 2.1, which has been presented using the lmer package syntax.The first line includes the function call, lmer(), and specifies that the dependentvariable (DV) is modeled as depending on an interaction between two independentvariables (IV1 and IV2). This is the fixed effects portion of the model. The nexttwo lines specify the random effects structure. To the right of the vertical line arethe grouping factors (participant and item) and to the left of the vertical line are4Note that the intercept and slope parameters that get included in the model are not the individualadjustments for each level of the random factors, but the overall variance of the random factors.The individual adjustments, known as Best Linear Unbiased Predictors, can be calculated from therandom effects parameter estimates.25the specifications for the grouping factors. The 1 specifies that there is a randomintercept for participant and item. The independent variables (IV1 ∗ IV2) specifythe way in which the random slope can vary.lmer(DV∼ IV1 ∗ IV2 +(1+ IV1 ∗ IV2 | participant) +(1+ IV1 ∗ IV2 | item), data)(2.1)If this model fails to converge, the first simplification would be to removethe interaction term for the random slope. Since there is usually less variancein experimentally-designed stimuli compared to the variance for participants, sim-plifying the by-item random slope term is the first step.lmer(DV∼ IV1 ∗ IV2 +(1+ IV1 ∗ IV2 | participant) +(1+ IV1 + IV2 | item), data)(2.2)If this still fails then both random slope interaction terms can be simplified. Therandom slope can be further reduced so that it specifies a single dependent variable,or it can be removed completely if necessary.lmer(DV∼ IV1 ∗ IV2 +(1+ IV1 + IV2 | participant) +(1| item), data)(2.3)The models used in this dissertation retained the maximal random effects struc-ture that would allow the model to converge. Any simplifications made to the de-fault models are specified in the relevant sections. It should be noted that simplify-ing the random slope term affects the model’s interpretation of repeated measures.The only way to specify that a measure is repeated within-participants, for exam-ple, is to add the relevant fixed effect to the by-participant random slope. Thissimplification can increase the chance of Type I error. However, in the face ofnonconvergence, this is the trade-off that must be made. In their evaluation ofdifferent model specifications, Barr et al. (2013) used a similar approach when26confronted with convergence issues. They also included the simplified models inthe assessment of the maximal model performance from which they were derived.They argued that since they were “evaluating analysis strategies rather than partic-ular model structures” (p. 266), this was a legitimate decision which also reflectedcommon practices among researchers when dealing with nonconvergence.Problems with convergence were occasionally encountered during the modelcomparison stage of the analyses in this dissertation. Removing a fixed effect inthe comparison model (in order to calculate the p-value of the fixed effect) some-times caused nonconvergence. Barr et al. (2013) also noted this phenomenon. Fol-lowing their suggestion, the random effects structure of the comparison model wassimplified according to the procedure described above. The model including thefixed effect in question was then re-fitted with the same random effects structureso that the likelihood ratio could be calculated. Simplifications made during modelcomparison are specified in the relevant sections.One final point concerns whether a random slope for the looking control vari-able in the DAF experiments should be included in the models. While Barr et al.(2013) acknowledge that there has been limited formal testing of this issue, theysuggest that it is not necessary to include a random slope for a control variable ifthere is no interaction with the fixed effects specified in the model. As such, themodels in Chapter 3 do not include random effects for the looking variable.2.6.3 Coding schemesThe default coding scheme used in this dissertation–and the default in the statisticalsoftware used–is treatment coding (also known as, dummy coding). This codingscheme compares the levels of the categorical independent variable to a referencelevel. In cases where collinearity (i.e. high correlations) among the fixed effectsis an issue, sum coding can be used instead. This is a way of centering the vari-ables, and compares the levels of the categorical independent variable to the grandmean. Note that collinearity is generally not an issue for experimental data sincethe independent variables are designed to be quite different and thus not likely tobe correlated. Additionally, while coding choice affects the interpretation of thecoefficients, it doesn’t affect how good a fit a model is. This means that model27−3 −2 −1 0 1 2 3−3−2−10123Fitted valuesResiduals−4 −3 −2 −1 0 1 2−4−202Fitted valuesResidualsFigure 2.2: Simulated data demonstrating homoscedasticity (left) and het-eroscedasticity (right).comparison–which is used in this dissertation to determine statistical significance–is not affected by coding choice.2.6.4 Evaluating the modelsTwo assumptions of linear mixed effects model are that the residuals are normallydistributed and homoscedastic. Homoscedasticity refers to the residuals havingstandard deviations that are constant across the range of fitted values. A simula-tion of homoscedastic residuals is shown in the lefthand plot of Figure 2.2. If theresiduals do not have constant standard deviations across the range of fitted values,they are described as being heteroscedastic. A simulation of this is shown in therighthand plot of Figure 2.2.For the linear mixed models constructed in this dissertation, residual plotswere visually inspected to check for obvious deviations from homoscedasticity.If heteroscedasticity was observed a number of options were tested to improve themodel, including fitting the data to a different distribution and transforming thedata. Any such changes are specified in the relevant chapters.For generalized linear mixed models using a Poisson distribution, an impor-tant assumption is that the variance of the errors increases with the mean (Baayen,282008). The output of the modeling function in the glmmADMB package does notinclude an estimate of the ratio of variance and mean, so an estimate was calculatedusing a function proposed by developers of some of the commonly used (general-ized) linear model pacakages for R.5 To meet the model assumptions, the ratioshould be close to 1.5A description of this overdispersion function can be found at http://glmm.wikidot.com/faq.29Chapter 3Visual feedback during a delayedauditory feedback task3.1 OverviewThis chapter reports on two experiments that examine the effect visual feedbackhas on speech produced with delayed auditory feedback. The first experimentrandomized the presentation of two types of visual feedback–static and dynamic–while the second experiment separated the presentation of these different types ofvisual feedback and compared their effects to speech produced without visual feed-back. Dynamic visual feedback affected both utterance duration and the numberof speech errors in an utterance, but in different ways. Utterance durations, whichwere longer with delayed auditory feedback than normal auditory feedback, werefurther increased with dynamic visual in both experiments. Small differences inspeech errors were found in the second experiment only. While the number ofspeech errors was reduced when dynamic visual feedback was paired with nor-mal auditory feedback, this reduction was only significant when a measure of par-ticipants’ disruption by delayed auditory feedback was included in the statisticalmodel; speech errors were reduced with dynamic visual feedback for participantswho were minimally disrupted by delayed auditory feedback. These results suggestthat it is the time-varying information in visual feedback which has the potential toelicit changes in speech production, and also that speakers need sustained exposure30to visual feedback in order for these changes to occur.3.2 Introduction3.2.1 Delayed auditory feedbackOne method for investigating multimodal sensory processing is to create difficultspeaking or listening conditions. For example, visual speech information improvesperception accuracy to a greater extent when speech is presented in noise comparedto clear speech conditions (Sumby and Pollack, 1954). For speech production, oneway to create a difficult speaking task is to present speakers with delayed auditoryfeedback (DAF).In a DAF task, speakers produce speech while their auditory feedback is playedback to them with a delay. The length of the delay varies between experiments,ranging from very brief (e.g. 25 ms: Stuart et al., 2002) to very long (e.g. 500 ms:Howell et al., 1983). The most disruptive feedback delay is 180-200 ms, beyondwhich point the disruptive effects start to taper off (e.g. Black, 1951; Howell andArcher, 1984). It has been noted that in order for the delay to be most disruptive,the auditory feedback must be loud enough to mask feedback via bone conduc-tion (e.g. Lee, 1950). Typical responses to DAF include reduced speech rate (orincreased utterance duration), stuttering-like disfluencies, such as repetitions andprolongations, and increased intensity (e.g. Stuart et al., 2002; Yates, 1963). In acomprehensive investigation of the reported effects of DAF, Chesters et al. (2015)found significant effects on a range of measures: utterance duration, consonantduration, and vowel duration all increased, as was the total number of speech er-rors and the mean intensity. Mean pitch and pitch variation decreased, and therewas a change to measures of rhythm. Disruption from DAF is amplified at fastspeaking rates; Stuart et al. (2002) found significantly more speech errors during areading passage spoken at a fast rate. Task differences have also been found; DAFinduced speech rate decreases were greater for reading than conversation (Coreyand Cuddapah, 2008).The effect DAF has on speech kinematics has also been investigated. Start-ing from research findings that suggest that the response to a stimulus depends on31the point in the phase of activity to which it is applied, Zimmermann et al. (1988)looked at jaw movement relative to the onset and offset of the vocalic portions ofDAF. With 100 ms DAF, the onset of DAF mostly occurred during the steady statevoiced portion of syllable (defined as the period between the movement offset ofthe CV gesture and the movement onset of the VC gesture). With 200 ms DAF, theonset of DAF mostly occurred during the closing gesture of the syllable. Despitethese different onset alignments, the offset for both DAF delays occurred mostlyduring the closed portion of the syllable (defined as the period between the move-ment offset of the VC gesture and the movement onset of the CV gesture). In otherwords, consonant closure durations were lengthened with 200 ms DAF relative to100 ms DAF. Sasisekaran (2012) looked at lip aperture variability and lip move-ment duration in response to 200 ms DAF. As predicted, DAF resulted in greatervalues for both of these measures compared to unperturbed speech. This differencewas also found in comparison to speech produced with gated feedback, suggestingthat the effects are specific to DAF and not just any type of feedback perturbation.Yates (1963) notes that there is a wide range of individual differences in re-sponse to DAF. Some of this can be attributed to gender differences; males tend tobe more adversely affected (i.e. produce more speech errors) than females (Coreyand Cuddapah, 2008). However, this isn’t the case for all measures of disruption:Chon et al. (2013) found that female participants exhibited a greater reduction inarticulation rate than males, while Sasisekaran (2012) found that the duration oflabial trajectories was longer for males than females. Yates (1963) also reportsa possible correlation between verbal facility, or intelligibility, and the effect ofDAF. Fabbro and Daro` (1995) took up this question, comparing simultaneous in-terpreters, who have some experience with DAF, to a control group, and found thatthere was no difference in the number of speech errors between the normal and de-layed conditions for the interpreters, but the control group made significantly moreerrors with DAF. The interpreters did have longer syllable durations with DAF,however, as did the control group. The authors suggest that speakers with highverbal facility may rely on auditory feedback less than speakers with lower verbalfacility, and hence show a resistance to DAF.Many researchers have noted that it is possible to counteract the effects of DAFby diverting attention away from the auditory feedback. Suggestions for how this32could be achieved include focusing on articulatory movements (Lee, 1950) or boneconducted feedback (Yates, 1963), or simply “not listening” Goldiamond et al.(1962). Lee (1950) also noted that participants attempted to slow their speech inorder to avoid disfluencies.This ability to counteract DAF effects has been explored in the context of adap-tation studies. Following extended exposure to DAF, adaptation to the delay hasbeen reported to occur to some degree, but generally not completely. There is alsothe potential for adaptation to be prevented by changing delay time or intensity(Yates, 1963). Katz and Lackner (1977) looked at these issues in more detail, withthe goal of verifying that adaptation effects were not simply due to stabilizationthat occurs at the beginning of a task. Participants showed a reduction in speecherrors and utterance duration, as well as aftereffects once the DAF was removed(increased speech rate). Additionally, it was observed that the degree of adapta-tion depended on the degree of disruption caused by the initial exposure to DAF;the more disrupted (in terms of both speech errors and duration) the greater theadaptation that followed. Further support for adaptation to DAF was found in laterstudies (e.g. Attanasio, 1987; Venkatagiri, 1980).The participants in the Katz and Lackner (1977) study reported trying to use avariety of strategies to counteract the effects of DAF, ranging from changing theirspeech rhythm or speech rate, to focusing on their articulators. Attanasio (1987) ex-plored they way that different strategies affect adaptation, comparing speakers whovaried in their intrinsic sensitivity to somatosensory feedback (operationalized asthe ability to discriminate between different shapes based on oral feedback), and byexplicitly instructing half of the participants to focus on articulations. The resultsshowed that the amount of adaptation was influenced by intrinsic oral awareness;speakers who had high sensitivity to oral somatosensory feedback adapted morethan those with low sensitivity. However, receiving explicit instructions to focuson articulations did improve adaptation for speakers with low oral feedback sensi-tivity. The use of strategies led Katz and Lackner (1977) to propose that adaptationto DAF is an active process on the part of speakers, requiring “conscious strategiceffort” (p. 482).333.2.1.1 DAF with visual feedbackThere is a small body of research which has investigated whether providing vi-sual feedback would ameliorate the disruptive effects of DAF. These researchershypothesized that, given the interaction of auditory and visual information in per-ception, a similar relationship may exist for production. DAF tasks were used sincethe visual signal would provide information that was synchronous with the actualspeech output, and this could potentially counteract the effects of the asynchronousacoustic feedback. The results from this research have been mixed.Tye-Murray (1986) was the first to investigate this. Participants read and mem-orized a sentence before producing it. During production they were presented with200 ms DAF while simultaneously receiving visual feedback via a mirror. Theywere able to see from their neck to their mouth only, and they were specificallyinstructed to attend to the image of their mouth during the task. Utterance durationwas measured, but no effect of visual feedback was found.Jones and Striemer (2007) expanded this design; sentences were presented au-ditorily, which participants repeated with 180 ms DAF, and with the addition ofnoise to mask the real-time auditory feedback. The visual feedback condition wascompared to a condition without visual feedback and one in which participantsread the sentences. The number of speech errors per utterance was measured inaddition to utterance duration. For the analysis, participants were split into twogroups according to the number of speech errors they produced in the DAF con-dition with no visual feedback. Those who produced more errors were classifiedas high-disruption, and those who produced fewer errors were classified as low-disruption. While there was no effect of visual feedback overall for either of thevariables measured, the low disruption group produced significantly fewer speecherrors during production with DAF and visual feedback compared to the DAF read-ing condition.In a pilot study to the Tye-Murray (1986) paper, the author had tested the effectof delayed visual feedback (2000 ms) on production. While overall there was no ef-fect of the delay, two of the 13 participants produced errors that were clearly relatedto the visual feedback; for example, producing [sI] instead of [soU] during visualalignment with [It]. This effect was explored more systematically by Chesters et al.34(2015), who tested both immediate and delayed visual feedback paired with 200ms DAF. As with Jones and Striemer (2007), the sentences to be repeated werepresented aurally, and masking noise was also presented throughout the experi-ment. Immediate visual feedback, which was provided by way of a mirror, didnot reduce the effects of DAF; however there was an effect of visual feedback inthe normal auditory feedback condition: the total duration of consonant segmentswas increased in this condition. Delayed visual feedback, which was provided by adelayed video stream and was either 200 ms, 400ms, or 600 ms, had no significanteffect on the speech measures when paired with normal auditory feedback but didin the DAF condition. Utterance duration, total number of speech errors, and totalduration of vowel segments all had increased values, and there was also a changeto rhythm. These differences held for all levels of the visual delay, except for thespeech errors; speech errors increased with the 200 ms and 400 ms delays but notthe 600 ms delay.Overall then, results for the effect of visual feedback on production have beenvariable, both in terms of the direction of the effects and whether they are evidentin combination with immediate or delayed auditory feedback. A note on fluency and disfluencyBefore moving forward, fluency and disfluency will be discussed in more detail,since the effects of DAF are discussed in these terms. While fluency can intuitivelybe thought of as “flowing” or “smooth” speech, it can be difficult to pinpoint spe-cific features that contribute to this. In his review of these issues, Lickley (2014)discusses that part of the problem is the different levels/domains that fluency canbe described at. For example, a listener’s perception of fluency versus measurabledisturbances in the speech production, or, fluency in planning versus fluency in per-formance. Equally challenging is defining disfluency. As noted by (Lickley, 2014,p. 451-452), disfluencies are a normal part of speech; “While speakers may varyin the frequency of disfluencies in their speech, everyone is disfluent some of thetime.” While there have been a range of proposals for what constitutes disfluency,recently there has been some consensus on disfluent features of normal speech;several features commonly identified are filled pauses, repetitions, substitutions,35insertions, and deletions (Lickley, 2014). Repetitions, substitutions, and insertionscan involve words (either part or whole) or phrases. Prolongations are also com-monly identified as a type of disfluency, especially in the context of experimentallyinduced disfluency (e.g. through DAF) and developmental stuttering (e.g. Coreyand Cuddapah, 2008; Stuart et al., 2002)While speech errors may seem an obvious example of disfluency, changes inspeech rate, or utterance duration, may be less so. However, these measures areconsistently found to be strong predictors of fluency for both non-native and nativespeech. For example, using a measure of mean length of syllables, Bosker et al.(2013) found this measure to be more highly negatively correlated with fluencyratings of non-native Dutch speech than other objective measures, such as numberof pauses and number of corrections. Similar results were found with articulationrate for fluency ratings of French L2 speakers with certain tasks (Pre´fontaine et al.,2016). Using acoustic manipulations of speech rate (syllables per second includingpauses) and articulation rate (syllables per second excluding pauses), Bosker et al.(2014) found that speed is weighted similarly for native and non-native speechwhen judging fluency. Temporal disfluency is a global increase in duration thatgoes beyond duration increases due to phonological processes such as phrase-finalstrengthening.3.3 Experiment 1As described above, speakers are known to actively use strategies to counteractthe disruptive effects of DAF. Visual feedback that is provided during such a taskcould be used as one such strategy; for example, by focusing on the movement ofthe lips. In fact, most of the experiments using visual feedback have presented itin a way that makes this more likely. It is an open question as to whether visualfeedback can still elicit changes when the opportunities for using it as a means tocounteract DAF are minimized. This can be achieved by not directing attention tovisual articulations and by changing the presentation order of stimuli.Previous experiments have often presented visual feedback in a targeted man-ner. As described in Section 1.1.3, Snyder et al. (2009) looked at the effect visualfeedback had on developmental stuttering. While they did observe improvements36in stuttering when visual feedback was presented, the effect was only found whenparticipants actively attended to oral speech movements in the visual feedback sig-nal. This observation was first made during the piloting stage of the study, andconsequently explicit instructions were given in the main experiment for partici-pants to attend to the motion of the articulators. In the context of non-disorderedpopulations, the experiments in both Tye-Murray (1986) and Jones and Striemer(2007) did not give participants instructions to focus on the mouth; however, thiswas implicit in the fact that just the lower half of the face was presented as visualfeedback. In contrast, Chesters et al. (2015) presented participants with a viewof their whole face during visual feedback conditions, although they did not statewhat instructions were given for this condition. In the condition with no visualfeedback, participants looked at a fixation point “at an equivalent position to theirmouth when viewing their mirror image” (p. 876). This suggests that there wasan expectation that participants would look at their mouth when presented withvisual feedback, although it is unknown whether this was because they receivedinstructions to do so.Previous experiments have also presented visual feedback in consistent blocks.But varying the order of presentation of stimuli, either within or across experimen-tal blocks, is one way to assess the robustness of multimodal speech processes.Taking the within-block case, multiple conditions can be presented in a single blockinstead of presenting one condition per block. In their investigation of the influ-ence of auditory and haptic information on speech perception, Fowler and Dekle(1991) ran three experiments, two of which contrasted in the order of presentationof stimuli. In their first experiment, conditions were blocked such that audio onlyjudgments of syllable identity were made in one block and audio-haptic judgmentswere made in a separate block. This procedure was changed in their third exper-iment so that trials from each condition were mixed in a single block; any oneblock comprised a random mix of audio only, haptic only, and audio-haptic judg-ments. They found a reduced effect of the felt syllable on the heard syllable in thethird experiment compared to the first experiment, but the effect was nonethelesssignificant.Order can also be manipulated across blocks. Varying the order of experi-mental blocks–where each block contains stimuli from a single conditions–is com-37monly used for counterbalancing. Rosenblum and Saldan˜a (1992) were interestedin whether audiovisual perception was dependent on order of presentation. Theyvaried the order of presentation of the audio, visual, and audiovisual conditionsfor their investigation of the phonetic similarity of congruous and incongruous au-diovisual syllables. They found no effect of the order of presentation; that is, thevisual influence on auditory percepts was robust to the type of stimuli that pre-ceded it. In a later experiment investigating the importance of kinematics in visualspeech information, Rosenblum and Saldan˜a (1996) found that the visual influenceof static images on auditory perception was minimized when static images werepresented after the dynamic images compared to before the dynamic images. Inlight of their earlier findings (Rosenblum and Saldan˜a, 1992), they suggested thatthis was evidence of a post-perceptual, or strategic, response to the static images,since the order of presentation shouldn’t affect audiovisual perception.In the experiment presented in this chapter, both of these approaches are usedto test for effects of visual feedback in a DAF task where it is less likely that activestrategies will be used by speakers. The whole face was presented as feedback andparticipants were instructed to simply look at the monitor during the task. This pre-sentation also capitalizes on the fact that speech relevant information is distributedacross the entire face (Section 1.1.3). Instead of presenting each condition in asingle block, the static and dynamic visual feedback conditions were randomizedwithin a block, in a similar manner to the above experiment by Fowler and Dekle(1991). Static visual feedback was used as the baseline condition against which tocompare the effects of dynamic visual feedback. Since the type of visual feedbackbeing presented was randomized, a still image was used rather than a fixation pointin order to avoid making the transition between the types of feedback jarring.3.3.1 Hypothesis and predictionsAs discussed in Section 1.1.3, the visual speech signal provides a range of speechinformation, from lower level articulatory information to higher level informationabout the temporal structure of speech. During speech production this signal ishypothesized to enhance speech output by providing information that is temporallycompatible with other sources of feedback.38Given this hypothesis, visual feedback is predicted to counteract the disruptiveeffects of DAF by providing information that is synchronous with the non-delayedfeedback (i.e. bone-conducted auditory feedback and somatosensory feedback). Itis predicted to decrease both utterance duration and speech errors. As discussedin Section 3.2.1, increased utterance duration can be considered a type of disflu-ency, and is a common response to DAF. By reinforcing the correct speech timing,dynamic visual feedback is predicted to minimize this type of disfluency, as wellspeech errors.3.3.2 Methods3.3.2.1 Participants26 students (mean age: 22.2 years (sd: 7.6); 7 males) recruited from The Universityof New Mexico participated in the experiment. Participants were compensated fortheir time with either a gift-card or extra credit toward their final grade. All par-ticipants self-reported having normal hearing and normal or corrected-to-normalvision, and all were native English speakers. StimuliThe stimuli consisted of 64 short sentences (mean length: 7.8 words (sd: 1); 8.8syllables, (sd: 1); see Appendix A) taken from the Harvard sentences, a list ofphonetically balanced sentences for use in speech tasks (Rothauser et al., 1969).The stimuli were recorded as per Section ProcedureThe experiment, which required participants to repeat sentences in a four differentfeedback conditions, followed the procedure outlined in Section 2.4. A within-participants design with two independent variables was used. These variables ma-nipulated the type of auditory and visual feedback participants received duringspeech production: auditory feedback was either normal or delayed (by 180 ms,to elicit the maximally disruptive effects of DAF), and visual feedback was eitherstatic or dynamic. Each participant was presented subsets of the stimuli in the39visual feedbackstatic dynamicnormal NAF-Picture NAF-Videoauditory feedbackdelayed DAF-Picture DAF-VideoTable 3.1: Conditions presented to each participant.four conditions outlined in Table 3.1.The stimuli were counterbalanced to conditions across participants using theprocedure proposed by Durso (1984). This ensured that a particular stimulus ap-peared in each condition equally frequently, and any given participant respondedto it only once. The conditions were organized into two types of mixed blocks.Normal auditory feedback blocks contained a randomized mix of NAF-Picture andNAF-Video conditions while delayed auditory feedback blocks contained a ran-domized mix of DAF-Picture and DAF-Video conditions. Each type of block waspresented four times with each block containing a unique subset of the stimuli.The order of the blocks was counterbalanced across subjects. Each participant pro-duced 64 sentences, yielding a total of 1664 sentences across all 26 participants.15 sentences were discarded due to technical errors during the recording of theexperiment.White noise was presented throughout the experiment in an effort to mask par-ticipants’ real-time auditory feedback. During the presentation of the stimuli tobe repeated the signal-to-noise ratio (SNR) was 10 dB. The SNR of the auditoryfeedback that participants received was not constant; while the noise level was heldconstant the signal level varied depending on how loudly the participant spoke. MeasuresTwo commonly used measures of speech disruption in DAF paradigms are utter-ance duration (or speech rate) and total number of speech errors (Yates, 1963), bothof which are used in this experiment.40In line with the three previous studies which investigated visual feedback ef-fects during a DAF task (Chesters et al., 2015; Jones and Striemer, 2007; Tye-Murray, 1986), utterance duration was measured. To maintain consistency in seg-mentation, a set of criteria were established for identifying the beginning and endof an utterance, based on an item by item comparison of seven of the participants.As an example, utterances that ended with a stop might or might not have an audi-ble release. In the case of the audible release, the end of the utterance was markedat the end of the release once there was no longer any evidence of F2 in the burst.If there was no audible release, the utterance was marked as ending at the end ofvoicing.Speech errors were coded following the disfluency transcription conventionsproposed in Brugos and Shattuck-Hufnagel (2012). Table 3.2 presents a summaryof the coding scheme. These conventions build on a number of proposals for an-notating disfluent speech and cover all of the disfluent phenomena produced byparticipants in this experiment. An additional benefit of this coding scheme isthat it was designed to be used in Praat (Boersma and Weenink, 2009), which wasthe program used for duration analysis. Some of the error codes were adapted tothe particulars of this experiment. For example, since the task involved repeatingsentences rather than producing unscripted utterances, there was the potential forwords to be deleted during the repetition. These missing words were coded as an“e” error and were marked between the two words where they should have oc-curred. This type of error involved missing content words; for example, if cleanwas deleted during the repetition of The doorknob was made of bright clean brass.Any cases that could be considered a natural reduction were not marked as an er-ror; for example, if the sentence You cannot brew tea in a cold pot was repeated asYou can’t brew tea in a cold pot, the reduced modal was not coded as an error. Ifa speech error occurred in the word immediately following the missing word, themissing word error was coded with that speech error. An example of a complexdisfluent event (i.e. one with multiple speech errors) is provided in Figure 3.1.Some words from the stimuli sentences were consistently produced as a pho-netically similar word by a large number of participants. Examples include produc-ing “green” instead of “clean”, “meals” instead of “mules”, and “pushed” insteadof “plush.” All of these cases resulted in a semantically plausible sentence. It was41Phenomenon Symbol Descriptionprolongation pr abnormal and/or incongruous prolongation of asegment within a worddisfluent pause ps, psw abnormal and/or incongruous pause between (ps)or within (psw) wordssilence s end of a silence (whether disfluent-sounding ornot)filler f filled pause, filler words or segments (e.g. um,huh, mm)error e mispronunciation or wrong wordcut c a partially completed wordrestart word rs restarting of a segment, syllable, word, after aword has been cut offrestart phrase %r start of a new phrase after a previous phrase wasnot finishedTable 3.2: Disfluency transcription conventions from Brugos and Shattuck-Hufnagel (2012).decided not to categorize these as production errors, since they were likely dueto misperception when initially listening to the stimuli (which were embedded innoise). A list of these accepted substitutions is included in Appendix A.The most difficult type of disfluency to mark was prolongation in the DAFconditions, as these conditions were produced with an overall slower speech rate.In keeping with Brugos and Shattuck-Hufnagel’s definition of prolongation (“ab-normal and/or incongruous prolongation”), prolongation was judged relative to theutterance in which it occurred in order to decide whether it was abnormal or incon-gruent. This allowed normal prosodic lengthening to be distinguished from disflu-ent prolongation. It also helped to determine if a potential prolongation was duesimply to an overall slowed speech rate, or if it was an additional source of length-ening and thus incongruous with the rest of the utterance. Thus, what counted as42Soap can wash most dirt awaysoap can watch (ps) mwost dirt awaye.ps eTime (s)0 3.302Figure 3.1: An example of a complex disfluent event (error code: e.ps). Thespeaker makes a pronunciation error (watch instead of wash) which isimmediately followed by a pause.an instance of prolongation varied from item to item and from participant to par-ticipant. Additionally, both part- and whole-word prolongations were coded. Anexample of a part-word prolongation is given in Figure 3.2.During error coding, items were discarded if they included laughter, were in-complete, or there was a major recall error.1 37 items were discarded during theerror coding process. The final number of items in the analysis was 1612.Error coders were blind to the conditions being coded. It should be notedthat it was usually obvious which auditory feedback condition a participant wasperforming, due to the striking effect DAF has on speech production. However, thecritical comparisons for this experiment were the visual feedback conditions, andcoders were not able to identify this aspect of the experimental manipulation.1An example of a major recall error: For the stimulus The wreck occurred by the bank on MainStreet, one participant instead produced The wreck recorded at the red street. This item was dis-carded.43The fur of cats goes by many namesfur of cats goes by man(an)y namespr rsTime (s)0 2.667Figure 3.2: An example of a part-word prolongation. The labial nasal inmany is abnormally prolonged, with a duration of approx. 240 ms. Inter-rater agreementA second coder coded 13% of the utterances. These utterances were randomly se-lected from the data set, with the restriction that the selection include two sentencesper condition per subject. The second coder was given detailed instructions aboutimplementing the coding scheme and also went through a set of examples with themain coder. Inter-rater word-by-word agreement was Cohen’s kappa 0.662 (p <0.001). Cohen’s kappa values between 0.61 and 0.80 represent substantial agree-ment beyond chance (Landis and Koch, 1977). Table 3.3 shows the cross-tabulationof errors categories from each coder. Video codingAs discussed in Section 2.4, the video recordings were coded for whether or noteach participant was looking at the monitor while they produced each utterance.452 of the 1612 items were produced while the participant was not looking at themonitor. This looking variable was included in the statistical analyses below.44Main CoderØ e ps pr rs cSecond Coder Ø 1162 11 17 14 6 3e 27 58 0 0 2 0ps 1 0 16 0 0 0pr 13 0 0 17 0 0rs 5 1 0 1 29 0c 7 0 6 1 0 3Table 3.3: Cross-tabulation of error categories from each coder. (Ø = no error,e = mispronunciation or word error, ps = pause, pr = prolongation, rs =restart, c = cut)Picture VideoUtterance duration (ms) NAF 2235 (66) 2253 (66)DAF 3479 (76) 3539 (66)No. speech errors per utterance NAF 0.36 (0.06) 0.39 (0.06)DAF 1.38 (0.07) 1.41 (0.08)Table 3.4: Means and standard errors of the means of the experimental mea-sures for each condition.3.3.3 ResultsThe means and standard errors of the means for utterance duration and total speecherrors are shown in Table 3.4. The statistical results for each measure are discussedbelow.452000400060008000NAF DAFType of auditory feedbackUtterance duration (ms)picture videoFigure 3.3: Distributions of utterance duration for each condition. The lowerand upper hinges of the boxplot represent the first and third quartiles,respectively. The middle line represents the median. Utterance durationA summary of the utterance duration results is shown in Figure 3.3. The modelconstruction, criticism, and comparison steps are outlined below.A linear mixed effects analysis was performed using the general formula out-lined in Section 2.6. The two independent variables were Type of Auditory Feed-back and Type of Visual Feedback, and the Looking variable was also added. Themaximal random effects structure was specified. Visual inspection of the residualplots of this model revealed heteroscedasticity.The distribution of the duration measures revealed right-skewing suggestive ofa log-normal distribution, so the values were log-transformed and the models werere-fit. During the model comparison process the random effects structure had to besimplified for the model that contained only the auditory feedback factor in order46coefficient (std. error) t-value(Intercept) 7.68 (0.03) 291.89auditory(DAF) 0.42 (0.03) 15.22visual(video) 0.02 (0.01) 2.50looking(no) 0.03 (0.01) 3.21Table 3.5: Fixed effects for the model of (log) utterance duration. The sig-nificance of the auditory feedback and visual feedback factors were con-firmed with model comparison.for the model to converge. This model and its comparison model had simplifiedby-item random slopes, allowing variation for the auditory feedback factor only.The random effect for participants retained the maximal specification. Visual in-spection of the residual plots for these models revealed no obvious deviations fromhomoscedasticity.Model comparison revealed a significant effect of auditory feedback (χ2(1) =59.133, p < .001): utterance duration increased in the DAF conditions compared tothe NAF conditions. There was also a significant effect of visual feedback (χ2(1) =6.3754, p = .01157), with increased utterance duration when visual feedback wasthe video compared to the picture. Model comparison failed to find an effect of theinteraction between auditory and visual feedback. Table 3.5 shows the fixed effectscoefficients, standard errors, and t-values for the model with the significant maineffects.As seen in Table 3.5 the Looking control variable was a significant factor in themodel. Utterance duration increased when participants were looking away fromthe monitor compared to when they were looking at the monitor. Speech errorsThe counts for each type of error are shown in Figure 3.4. Mispronunciations andword errors, which included incorrect or missing words and word inversions (e.g.clean bright brass instead of bright clean brass), were the most common type of47picture video050100150050100150NAFDAFe rs pr pscomplexother e rs pr pscomplexotherType of errorError countsFigure 3.4: Counts of each type of error per condition. (e = mispronunciationor word error, rs = restart, pr = prolongation, ps = pause, complex =multiple disfluencies in a single event, other = cuts and fillers)error. However, for the DAF-Picture condition, errors, restarts, prolongations, andpauses had very similar counts. Other than mispronunciations and word errors,the other types of errors were much more common in the DAF conditions than theNAF conditions. This was especially the case for restarts and prolongations, twostuttering-like disfluencies commonly produced under DAF conditions.A summary of the total number of speech errors per utterance is shown inFigure 3.5. A generalized linear mixed effects analysis (Poisson distribution) wasperformed using the general formula outlined in Section 2.6. The two independentvariables were Type of Auditory Feedback and Type of Visual Feedback, and theLooking control variable was also added. The maximal random effects structurewas specified. The models also accounted for zero-inflation in the data.During the model comparison process the random effects structure had to besimplified for the model that contained only the visual feedback factor in order48NAF DAF0.00.51.00 1 2 3 4 5 0 1 2 3 4 5Total errors per utterancedensitypicture videoFigure 3.5: Density plots of speech errors for each condition. The verticaldashed lines represent the mean number of speech errors per utterancefor each condition.for the model to converge. This model and its comparison model had a maximalby-participant random slope but an intercept only for item. The ratio of the er-ror variance to mean was 1.03 for the full model, suggesting that there were noconcerns of overdispersion.Model comparison revealed a significant effect of auditory feedback (D = 59.36,p < .001).2 As expected, speech errors increased in the DAF conditions comparedto the NAF conditions. Model comparison failed to find an effect of visual feed-back or an interaction between the independent variables. Table 3.5 shows the fixedeffects coefficients, standard errors, and z-values for the model with main effectsbut no interaction.As seen in Table 3.6, the Looking control variable was once again significant.2The likelihood ratio test for generalized linear models uses deviance (D) as a measure of modelfit. Zuur et al. (2009) describe it as a maximum likelihood equivalent of the sum of squares ofresiduals.49Coefficient (Std. Error) z-value(Intercept) −1.25 (0.12) −10.57auditory(DAF) 1.32 (0.09) 14.24visual(video) 0.08 (0.06) 1.19looking(no) 0.25 (0.07) 3.54Table 3.6: Fixed effects for the model of speech errors. The significanceof the auditory feedback factor was confirmed with model comparison.(NB: Since the model used a Poisson distribution, the coefficients are inexpected log counts.)The total number of speech errors increased when participants were looking awayfrom the monitor compared to when they were looking at the monitor. Post-hoc analysis: DAF disruptionThe planned statistical analysis failed to find an effect of visual feedback on thenumber of speech errors. The following analysis looks at the speech error resultsin more detail. It should be noted that this was not one of the planned comparisonsfor this experiment, but is presented to gain more insight into possible effects ofvisual feedback.As discussed in Section 3.2.1, Jones and Striemer (2007) tested whether therewas a differential effect of visual feedback based on the degree to which a par-ticipant was affected by DAF. They split participants into two groups based onthe number of speech errors produced in the baseline DAF condition (i.e. no vi-sual feedback). Participants who made fewer errors were categorized as “low-disruption” and participants who made more errors were categorized as “high-disruption.” They found that participants in the low disruption group produced fewerrors in the DAF condition with visual feedback compared to the DAF conditionin which stimuli were read off the monitor.This interaction was tested in the present data set. Instead of splitting partic-ipants into two groups based on their mean number of errors in the DAF-Picture50Coefficient (Std. Error) z-value(Intercept) −2.06 (0.19) −10.93auditory(DAF) 1.47 (0.13) 11.48visual(video) 0.08 (0.17) 0.47disruption 0.59 (0.11) 5.38auditory(DAF):visual(video) −0.06 (0.17) −0.36Table 3.7: Fixed effects for the model of speech errors with the significanteffect of DAF disruption. (NB: Since the model used a Poisson distribu-tion, the coefficients are in expected log counts.)condition, the mean values were added to the model as a continuous variable, whichwill be referred to as “disruption.” This variable is interpreted as a measure of aspeaker’s verbal proficiency; the less disrupted a speaker is by DAF the more pro-ficient they are (Chon et al., 2013).A generalized linear mixed effects analysis (Poisson distribution) was per-formed, with Type of Auditory Feedback, Type of Visual Feedback, and Disruptionas independent variables. The random effects structure was simplified to achieveconvergence; the by-participant random slope remained maximal but the randomeffect for item contained an intercept only.Model comparison revealed a significant effect of disruption (D = 13.92, p <.001), with the number of speech errors increasing across the board for participantswho were more disrupted by DAF. Model comparison failed to find an interactionbetween disruption and the feedback variables. Table 3.7 shows the fixed effectscoefficients, standard errors, and z-values for the model with the significant effectof disruption, and this effect is visualized in Figure DiscussionThe results of this analysis showed an effect of dynamic visual feedback on ut-terance duration; however, it was in the opposite direction to the prediction. Forboth NAF and DAF conditions, utterance duration increased with dynamic visual51-2-1010.0 0.5 1.0 1.5 2.0Disruption(mean number of errors in DAF-Picture)Predicted number of errors per utterance (log counts)picture videoNAF DAFFigure 3.6: Effect of DAF disruption on speech errors. Dotted lines represent95% confidence intervals around predicted values. (NB: The predictedvalues are based on the fixed effects of the model only, and not therandom effects.)feedback compared to static visual feedback. The analysis failed to find an effectof dynamic visual feedback on speech errors. A post-hoc analysis also failed tofind any interaction with degree of DAF disruption.Visual feedback was predicted to decrease utterance duration, but instead theresults showed an increase in duration for the DAF-Video condition. This maybe related to reports of increased segmental durations in two similar contexts tothe present experiment. As discussed in Section 3.2.1, Zimmermann et al. (1988)52looked at jaw movement in DAF tasks, comparing two auditory feedback delays:100 ms and 200 ms. Despite the fact that the onset of DAF aligned with differentarticulatory-defined portions of the syllable, the offset of DAF most often occurredduring the closed portion of the syllable for both delays. In order for this offsetalignment to occur, the closed gesture must be sustained for a longer period oftime for the 200 ms DAF relative to 100 ms DAF. Recall that in the present ex-periment, DAF was 180 ms, so closer to the 200 ms delay than the 100 ms delayin Zimmermann et al. (1988). This DAF offset alignment pattern was primarilyfound when the following syllable had a bilabial or labiodental onset, two placesof articulation that are visually distinct. It is possible that these sustained closedportions of the syllable were reinforced by dynamic visual feedback, thus length-ening them further still. Prolongation is a commonly reported disfluency underDAF conditions, and this was also observed in the present experiment (Figure 3.4).Chesters et al. (2015) corroborated the prolongation results from subjective speecherror coding with measures of consonant duration; they found increased consonantduration with DAF compared to NAF. Consonant duration was also increased withvisual feedback, although this effect was only significant when auditory feedbackwas normal. If a similar effect occurred in the present experiment, this could ac-count for the increased utterance duration in the NAF-Video condition.Jones and Striemer (2007) reported a reduction in speech errors for a subsetof their participants when visual feedback was presented during a DAF task, butthis was not found in the present experiment. One of the biggest differences be-tween these two experiments is the stimulus presentation order. As described inSection 3.3, static and dynamic visual feedback were randomized within a block inorder to make it less likely that participants would develop a strategy for using thevisual feedback to counteract the disruptive effects of DAF. But it may be the casethat sustained exposure to visual feedback over the course of a block is necessaryfor visual feedback effects to be observed. The next experiment uses a similar de-sign to the present experiment, but presents one condition per block (as in Chesterset al., 2015; Jones and Striemer, 2007; Tye-Murray, 1986). A condition withoutany visual feedback is also added, against which to compare the effects of eachtype of visual feedback (static and dynamic).533.4 Experiment 2In the previous experiment, static visual feedback was used as a baseline againstwhich to compare the effects of dynamic visual feedback. Experiment 2 involvesthe same task as Experiment 1 (i.e. participants repeated sentences with or withoutDAF while being presented with different types of visual feedback) and examineswhether there are different responses to these two types of visual signals by com-paring them to speech production without visual feedback.While speech is a time-varying phenomenon, it is possible to capture certainlinguistically relevant distinctions from static visual forms (Fromkin, 1964). A va-riety of studies with fluent speakers suggest there are differences in the way staticand dynamic visual speech information is processed in perception tasks. Severalstudies have documented improvements in perception tasks when dynamic–but notstatic–visual speech is presented with congruous auditory stimuli. For example,dynamic information improved response times when identifying speech from non-speech (Kim and Davis, 2014) and in a syllable identification task (Gilbert et al.,2012). The N1 component of the event-related brain potential is sensitive to au-ditory speech and can be modulated by visual speech. EEG recordings reveal de-creases in latency and amplitude of N1 in response to dynamic visual informationcompared to other visual signals (Gilbert et al., 2012). The latency improvementwas seen for any kind of motion (chewing and speech) whereas the amplitudedecrease was specific to speech motion, which may be indicative of an alertingmechanism; any kind of facial motion enhances the speed of the auditory neural re-sponse, with meaningfulness of the visual information (i.e. speech or non-speech)subsequently reflected in the magnitude of the response.Such studies suggest no role for static visual information; however, there is ev-idence that this information can influence perception. Using a McGurk paradigm,still images of a face affected responses to auditory stimuli, increasing the percent-age of visually influenced percepts (Rosenblum and Saldan˜a, 1996). The authorssuggested that this effect on auditory perception was due to post-perceptual pro-cessing (i.e. strategic responding) based on the results of a follow-up experiment;they found that static images had a reduced effect on responses to auditory stim-uli when they were preceded by conditions which had dynamic images. Earlier54work from the authors had shown that the order of presentation of McGurk-typestimuli should not affect the influence visual information has on auditory percep-tion (Rosenblum and Saldan˜a, 1992). Irwin et al. (2006) also contrasted static anddynamic images using a McGurk paradigm, with conflicting results. They usedbrief visual stimuli (approx. 100 ms). For the static condition the video frame thatmost clearly showed the consonant place of articulation was repeated three times,while for the dynamic condition three consecutive frames that spanned the conso-nant closure and release were presented. When the temporal alignment betweenthe audio signal and the different types of visual signals was controlled for, therewas no difference in the effect of the different types of visual signals. However, ina visual-only identification task, the brief static images were more accurately iden-tified than the brief dynamic images. This result is unexpected, but may be due tothe fact that a 100 ms image composed of a single frame is visually ‘clearer’ thana 100 ms image composed of three frames. Thus the result may say more aboutthe effects of visual clarity than it does about static versus dynamic perception.Additionally, in an experiment looking at the effect of external visual signals onstuttering inhibition, static and dynamic visual information were equally effectivein reducing stuttering when combined with a pure tone (Guntupalli et al., 2011).Neuroimaging suggest that both static and dynamic visual speech images arerelevant to auditory cortical processing, although with differences in the degree ofactivation. In a phoneme detection task, Calvert and Campbell (2003) presentednormal hearing participants with either dynamic visual images of syllables (e.g.[vu], [im]) or static images from the maximal closed or open portion of these syl-lables (e.g. [v], [u]). fMRI results showed that, overall, similar cortical regionswere activated for both types of visual speech signals, but dynamic visual speechactivated these areas to a greater degree. However, there were some differences:dynamic images activated visual motion areas, auditory cortex, and the left supe-rior temporal sulcus to a greater extent (but static images did activate these areasto some extent; this could be seen when compared to the baseline condition); staticimages activated the ventral premotor cortex and intraparietal cortex to a greaterdegree. So, while static images can influence auditory areas, this influence is en-hanced by the matching temporal structure between dynamic visual speech andauditory visual speech.55Studies of neurophysiologically impaired patients also support distinct process-ing of static and dynamic visual speech information, complementing the behavioraland neuroimaging results from normal populations. Campbell (1992) presented ev-idence of differences in the processing of static and dynamic speech information.Similarly, a subject with profound visual form agnosia who performed on par withcontrol subjects in audiovisual, auditory only, and visual only vowel identificationtasks, dropped to chance performance when static images were presented (Munhallet al., 2002). Other patients with visual agnosia have exhibited similar differencesin static and dynamic visual processing, although the responses to audiovisual stim-uli suggest that the audiovisual integration may be compromised (de Gelder et al.,1998).These studies show impairment to static visual processing while sparing dy-namic visual processing. But there is a double dissociation, as the opposite patternis also attested. In a well-documented case of ‘motion-blindness’, a patient with alesion in the V5 region of the occipital lobe was able to identify speech forms fromstatic images of the face but was unable to repeat multisyllabic forms presentedvisually and was unable to integrate visual information when presented with incon-gruous audiovisual syllables (Campbell et al., 1997). The results from behavioraland neuroimaging experiments, as well as lesion studies, thus show the importanceof both static form and dynamic timing properties for visual speech processing.3.4.1 Hypothesis and predictionWhile the time-varying properties of the visual feedback are considered to be im-portant for establishing temporal compatibility with other feedback signals, theremay still be an enhancing effect of static visual feedback, albeit weaker than dy-namic visual feedback. This is in line with the finding from Calvert and Campbell(2003) that static visual speech activated similar cortical regions to dynamic visualspeech but to a lesser degree. Experiment 2 compares utterances produced withthese two forms of visual feedback, and also makes a comparison to utterancesproduced without visual feedback. In Experiment 2 the different types of visualfeedback are presented in separate blocks; this consistent exposure to a single typeof visual feedback is similar to the experimental procedures used in previous ex-56periments which pair visual feedback and DAF (Chesters et al., 2015; Jones andStriemer, 2007; Tye-Murray, 1986). Presenting the visual feedback in this way, asopposed to the randomized presentation used in Experiment 1, may make it morelikely to see effects of visual feedback on speech production.As in Experiment 1, visual feedback is predicted to counteract the disrup-tive effects of DAF by providing information that is synchronous with the non-delayed feedback. Dynamic visual feedback–a rich source of time-varying speechinformation–is predicted to have the greatest impact on speech production. In termsof utterance duration, dynamic visual feedback could have a fluency-enhancing ef-fect and thus decrease duration; although given the results of Experiment 1, it isalso possible that dynamic visual feedback could increase utterance duration. Interms of speech errors, dynamic visual feedback is predicted to decrease the num-ber of speech errors. Static visual feedback is predicted to have qualitatively similareffects but with a reduction in the magnitude of the effects.3.4.2 Methods3.4.2.1 ParticipantsThirty students (mean age: 21.2 years (sd: 4.5); 6 males) recruited from The Uni-versity of New Mexico’s Linguistics 101 course participated in the experiment.Participants were compensated for their time with extra credit toward their fi-nal grade. All participants self-reported having normal hearing and normal orcorrected-to-normal vision, and all were native English speakers. StimuliThe stimuli consisted of 60 short sentences (mean length: 7.8 words (sd: 1); 8.7syllables, (sd: 1); see Appendix A) taken from the Harvard sentences, a list ofphonetically balanced sentences for use in speech tasks (Rothauser et al., 1969).The stimuli were recorded as per Section 2.3.57visual feedbackabsent static dynamicnormal NAF-NoVideo NAF-Picture NAF-Videoauditory feedbackdelayed DAF-NoVideo DAF-Picture DAF-VideoTable 3.8: Conditions presented to each participant. ProcedureThe experiment, which required participants to repeat sentences in a six differentfeedback conditions, followed the procedure outlined in Section 2.4. A within-participants design with two independent variables was used. These variables ma-nipulated the type of auditory and visual feedback participants received duringspeech production: auditory feedback was either normal or delayed (by 180 ms),and visual feedback was either absent, static, or dynamic. This resulted in the sixconditions shown in Table 3.8. Each participant produced subsets of the stimuli inthe six conditions.The stimuli were grouped into blocks of 10 sentences, with each block com-prised of a single condition. Condition order and stimuli assignment were simulta-neously counterbalanced using the procedure proposed by Zeelenberg and Pecher(2014). This controlled sequential and ordinal effects of condition ordering andensured a particular stimulus appeared in each condition equally frequently, andany given participant responded to it only once. Each participant produced 60 sen-tences, yielding a total of 1800 sentences across all 30 participants. 11 sentenceswere discarded due to technical errors during the recording of the experiment.White noise was presented throughout the experiment in an effort to mask par-ticipants’ real-time auditory feedback. During the presentation of the stimuli tobe repeated the signal-to-noise ratio (SNR) was 10 dB. The SNR of the auditoryfeedback that participants received was not constant; while the noise level was heldconstant the signal level varied depending on how loudly the participant spoke.58Main CoderØ e ps pr rs cSecond Coder Ø 1270 18 6 8 4 1e 5 33 0 0 1 0ps 0 0 25 0 0 0pr 13 0 0 10 0 0rs 8 0 1 0 11 0c 2 0 1 0 0 3Table 3.9: Cross-tabulation of error categories from each coder. (Ø = no error,e = mispronunciation or word error, ps = pause, pr = prolongation, rs =restart, c = cut) MeasuresThe measures were the same as those described in Section for Experiment1; namely, utterance duration and speech errors. 24 items were discarded based onthe error coding criteria outlined in Section The final number of items inthe analysis was 1765. Inter-rater agreementA second coder coded 10% of the utterances. These utterances were randomly se-lected from the data set, with the restriction that the selection include one sentenceper condition per participant. Inter-rater word-by-word agreement was Cohen’skappa 0.695 (p < 0.001). Cohen’s kappa values between 0.61 and 0.80 representsubstantial agreement beyond chance (Landis and Koch, 1977). Table 3.9 showsthe cross-tabulation of errors categories from each coder. Video codingAs discussed in Section 2.4, the video recordings were coded for whether or noteach participant was looking at the monitor while they produced each utterance.59No Video Picture VideoUtterance duration (ms) NAF 2188 (62) 2172 (53) 2125 (56)DAF 3084 (90) 3084 (51) 3194 (78)No. speech errors per utterance NAF 0.39 (0.06) 0.41 (0.06) 0.29 (0.05)DAF 0.98 (0.07) 1.00 (0.06) 0.97 (0.07)Table 3.10: Means and standard errors of the means of the experimental mea-sures for each condition.333 of the 1765 items were produced while the participant was not looking at themonitor. This looking variable was included in the statistical analyses below.3.4.3 ResultsThe means and standard errors of the means for utterance duration and total speecherrors are shown in Table 3.10. The statistical results for each measure are dis-cussed below. Utterance durationA summary of the utterance duration results is shown in Figure 3.7. The statisticalanalysis involved a number of iterations of model construction before an adequatefit was found that also met the assumptions for model validity. These steps areoutlined below.A linear mixed effects analysis was performed using the general formula out-lined in Section 2.6. The two independent variables were Type of Auditory Feed-back and Type of Visual Feedback, and the Looking control variable was alsoadded. The maximal random effects structure was specified. Visual inspectionof the residual plots of these models revealed heteroscedasticity.To address this the values were log-transformed, as per Experiment 1. Themodel failed to converge with the maximal random effects structure. The by-itemrandom slope was simplified so that it no longer had an interaction term for the60250050007500NAF DAFType of auditory feedbackUtterance duration (ms)no video picture videoFigure 3.7: Distributions of utterance duration for each condition. The lowerand upper hinges of the boxplot represent the first and third quartiles,respectively. The middle line represents the median.effect of type of auditory and visual feedback. Visual inspection of the residualplots revealed no obvious deviations from homoscedasticity.During the model comparison process the random effects structure had to befurther simplified for the model that contained only the visual feedback factor. Thismodel, and its comparison model, had an intercept only for the random effect ofitem.Model comparison revealed a significant effect of auditory feedback (χ2(1) =58.934, p < .001); utterance duration increased in the DAF conditions comparedthe NAF conditions. The interaction between the independent variables was alsosignificant (χ2(1) = 8.9253, p = .01153). Table 3.11 shows the fixed effects co-efficients, standard errors, and t-values for the model with the significant interac-tion. The contrasts of interest for the interaction were obtained using multcomp61Coefficient (Std. Error) t-value(Intercept) 7.67 (0.02) 311.04auditory(DAF) 0.32 (0.03) 10.08visual(picture) −0.01 (0.01) −0.62visual(video) −0.02 (0.02) −1.95looking(no) −0.00 (0.01) −0.32auditory(DAF):visual(picture) 0.02 (0.03) 0.67auditory(DAF):visual(video) 0.07 (0.03) 2.51Table 3.11: Fixed effects for the model of (log) utterance duration with a sig-nificant interaction of the predictors.(Hothorn et al., 2008). Post-hoc comparisons between the levels of the visual feed-back variable were made for each level of the auditory feedback variable. For theDAF conditions there was a significant increase in utterance duration with dynamicvisual feedback compared to no visual feedback (z = 2.956, p < .01 (adjusted p-value, single step method)) and compared to static visual feedback (z = 2.561, p =.02823 (adjusted p-value, single step method)). However, there was no significantdifference in utterance duration between static visual feedback and no visual feed-back, nor were there were any significant differences between the types of visualfeedback in the NAF conditions. Speech errorsThe counts for each type of error are shown in Figure 3.8. Mispronunciations andword errors accounted for most of the errors across the conditions, and they ac-counted for a larger proportion of the total errors than in Experiment 1. As withthe previous experiment, the stuttering-like disfluencies (i.e. restarts and prolonga-tions) were more common in the DAF conditions than the NAF conditions.A summary of the total number of speech errors per utterance is shown inFigure 3.9. Since the dependent variable was a count variable, a generalized lin-62no video picture video02550750255075NAFDAFe rs pr pscomplexother e rs pr pscomplexother e rs pr pscomplexotherType of errorError countsFigure 3.8: Counts of each type of error per condition. (e = mispronunciationor word error, rs = restart, pr = prolongation, ps = pause, complex =multiple disfluencies in a single event, other = cuts and fillers)ear mixed effects analysis (Poisson distribution) was performed using the generalformula outlined in Section 2.6. The two independent variables were Type of Au-ditory Feedback and Type of Visual Feedback, and the Looking control variablewas also added. The maximal random effects structure was specified. The modelsalso accounted for zero-inflation in the data.The full model failed to converge with the maximal random effects structure soit was simplified by removing the interaction term from the by-item random slope.The ratio of the error variance to mean was approximately 1 for this new model,suggesting that there were no concerns of overdispersion.During the model comparison process the random effects structure had to befurther simplified for the model that contained only the auditory feedback factor.This model, and its comparison model, had a by-participant random slope for theeffect of the auditory feedback factor, and an intercept only for the random effect63NAF DAF0. 1 2 3 4 0 1 2 3 4Total errors per utterancedensityno video picture videoFigure 3.9: Density plots of speech errors for each condition. The verticaldashed lines represent the mean number of speech errors per utterancefor each condition.of item.Model comparison revealed a significant main effect of auditory feedback (D= 59.48, p < .001); speech errors increased in the DAF conditions compared tothe NAF conditions. Model comparison failed to find an effect of visual feedbackor an interaction between the independent variables. Table 3.12 shows the fixedeffects coefficients, standard errors, and z-values for the model with the significantauditory feedback effect.As can be seen in Table 3.12, the Looking control variable was a significantfactor in the model. The total number of speech errors increased when participantswere looking away from the monitor compared to when they were looking at themonitor.64Coefficient (Std. Error) z-value(Intercept) −1.02 (0.12) −8.55auditory(DAF) 0.95 (0.08) 12.26visual(picture) 0.03 (0.07) 0.35visual(video) −0.07 (0.08) −0.89looking(no) 0.19 (0.08) 2.28Table 3.12: Fixed effects for the model of speech errors with the significantauditory feedback effect. (NB: Since the model used a Poisson distribu-tion, the coefficients are in expected log counts.) DAF disruptionAs in Experiment 1, the speech error results were explored in more detail by testingwhether changes in speech output in response to visual feedback were sensitive tothe degree of DAF disruption a participant experienced.Given that the Looking control variable was significant in the speech errormodel, it was decided to exclude those items from the analysis in which the partic-ipant was not looking at the monitor. This left 1432 sentences in the analysis. Thedisruption variable was calculated over this subset. One participant did not lookat the monitor at all during the DAF-NoVideo condition, so all data points fromthat participant were excluded. This further reduced the number of sentences inthe analysis to 1399.A generalized linear mixed effects analysis (Poisson distribution) was per-formed, with Type of Auditory Feedback, Type of Visual Feedback, and Disruptionas independent variables. In order for the model to converge, the random effectsstructure was simplified; the interaction term was removed from the by-participantrandom slope and the random effect of item had an intercept only.Model comparison revealed a significant interaction between auditory feed-back, visual feedback, and disruption (D = 17.54, p < .01). Table 3.13 shows thefixed effects coefficients, standard errors, and z-values for the model with the sig-nificant interaction. The interaction is visualized in Figure 3.10, with the predicted65Coefficient (Std. Error) z-value(Intercept) −1.05 (0.24) −4.40auditory(DAF) −0.01 (0.28) −0.03visual(picture) −0.08 (0.30) −0.25visual(video) −0.78 (0.33) −2.41disruption 0.18 (0.19) 0.95auditory(DAF):visual(picture) 0.64 (0.39) 1.62auditory(DAF):visual(video) 1.44 (0.41) 3.54auditory(DAF):disruption 0.73 (0.22) 3.27visual(picture):disruption −0.00 (0.25) −0.01visual(video):disruption 0.44 (0.26) 1.67auditory(DAF):visual(picture):disruption −0.47 (0.32) −1.48auditory(DAF):visual(video):disruption −1.01 (0.32) −3.19Table 3.13: Fixed effects for the model of speech errors with the significantinteraction between auditory feedback, visual feedback, and DAF dis-ruption. (NB: Since the model used a Poisson distribution, the coeffi-cients are in expected log counts.)number of errors per utterance plotted against the disruption variable.As can be seen in Figure 3.10 there is considerable overlap of the confidenceintervals, especially in the DAF conditions. The greatest distinction can be found atthe low end of the disruption range in the NAF conditions. For sentences producedwith NAF, the model predicts that speakers who are minimally disrupted by DAFwill have a reduction in speech errors with dynamic visual feedback compared tostatic and no visual feedback.66NAF DAF-2-101230.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5Disruption(mean number of errors in DAF-NoVideo)Predicted number of errors per utterance (log counts)no video picture videoFigure 3.10: Interaction between DAF disruption, auditory feedback, and visual feedback. Dotted lines represent 95%confidence intervals around predicted values. (NB: The predicted values are based on the fixed effects of themodel only, and not the random effects.)673.4.4 DiscussionThe results of this experiment showed an effect of dynamic visual feedback on ut-terance duration; however, it was in the opposite direction to the initial prediction.In the DAF conditions, utterance duration increased with dynamic visual feedbackcompared to the conditions where visual feedback was static or absent. While theexpected increase in speech errors in the DAF conditions compared to the NAFconditions was found, the initial analysis failed to find an effect of the differenttypes of visual feedback. However, an additional analysis found that speech errorsin the NAF conditions were reduced with dynamic visual feedback for those speak-ers who were minimally disrupted by DAF. Static visual feedback did not have aneffect on either of the dependent variables.The increase in utterance duration in the DAF-Video condition is consistentwith the results from Experiment 1, and the interpretation of those results is alsoapplicable here. It is possible that the duration of the closed portion of syllables wasincreased with DAF, in line with the articulatory findings from Zimmermann et al.(1988), and that the dynamic visual feedback reinforced this, resulting in a furtherincrease to this portion of the syllable. While the differences were small, more pro-longations were identified during error coding in the DAF conditions with visualfeedback compared to no visual feedback; prolongations accounted for 22% of thetotal error in DAF-Video, 20% in DAF-Picture, and 17% in DAF-NoVideo. Onepossible caveat with this explanation is that the prolongations coded in Experiment1 did not follow this pattern; prolongations accounted for more of the total errorsin DAF-Picture condition (22%) than in the DAF-Video condition (18%) in Exper-iment 1. If the increased utterance duration is in part due to more and/or longerprolongations when DAF is produced with dynamic visual feedback, as suggestedabove, then we might have expected to see more prolongations in the Experiment1 DAF-Video condition than in the DAF-Picture condition, which was the patternobserved in Experiment 2. However, it is important to keep in mind the codingcriterion used for prolongations. Prolongations were defined as “abnormal and/orincongruous prolongation of a segment within a word” (Table 3.2). Thus, if anutterance was produced very slowly with many prolongations throughout, any oneinstance of prolongation would not have been “abnormal and/or incongruous,” and68as such would not have been coded as a prolongation. The coded prolongationsrepresent extreme cases of prolongation within a given utterance. All this is to say,that even though there were fewer prolongations coded in the DAF-Video condi-tion compared to the DAF-Picture condition for Experiment 1, there may still havebeen longer syllable closure durations which contributed to the overall increase inutterance duration for the DAF-Video condition. A quantitative method for iden-tifying prolongation, or including a measure of segment durations as per Chesterset al. (2015), would clarify this issue.The predicted reduction in speech errors with visual feedback was not observedin the main analysis. However, a further analysis which tested whether there wasa differential effect of visual feedback based on the degree of DAF disruption aspeaker experienced did reveal an effect (where ‘disruption’ was the mean num-ber of errors a speaker made in the DAF-NoVideo condition). Speakers whoproduced minimal speech errors with DAF showed a reduction in speech errorswhen dynamic visual feedback was paired with NAF. Jones and Striemer (2007)also found that visual feedback reduced speech errors for participants classifiedas “low-disruption,” although this was observed when visual feedback was pairedwith DAF, not NAF. They interpreted this to mean that low-disruption speakerswere those who could largely ignore their auditory feedback and instead focus onfeedback that was synchronous with their actual output (i.e. somatosensory andvisual feedback).The different results in these two experiments could be due to differences in theexperimental design and analysis. Jones and Striemer (2007) included a conditionwhere participants repeated the auditorily presented sentences while also readingthem on the monitor. In the analysis of the low-disruption group in Jones andStriemer (2007), it was the difference between this condition and the visual feed-back condition which was significantly different (when paired with DAF). Theyfailed to find a significant difference within this subset of speakers for the condi-tions which would be equivalent to DAF-NoVideo and DAF-Video in the presentexperiment. In this respect, the results for the two experiments could be consideredsimilar, since the analyses in this chapter also failed to find a difference here. Asfor the different results reported for the NAF comparisons, this could be due to thedifferent analyses used. Unlike the analysis presented in this chapter, which used a69continuous measure of disruption, Jones and Striemer used a median split to createtwo categories of speakers: low- and high-disruption (where ‘disruption’ was op-erationalized in the same way as the experiments in this chapter). A median splittreats all values in a category as being equal, whether a value is close to the medianor far away, thus there is some loss of information. As can be seen in Figure 3.10,the reduction in errors in the NAF-Video condition at the lowest end of the disrup-tion range was quite modest. If a similarly small reduction occurred in Jones andStriemer’s data, this pattern could have been obscured once all participants belowthe median disruption level were grouped together.Part of the interpretation proposed by Jones and Striemer (2007) for their low-disruption results is still relevant for the results of the present experiment; namely,speakers who are minimally disrupted by DAF most likely rely less on auditoryfeedback than speakers who experience extensive disruption. This interpretation isalso supported by the results of the DAF experiment by Fabbro and Daro` (1995)with simultaneous interpreters. According to Chon et al. (2013), these minimallydisrupted speakers could be considered more verbally proficient, with more accu-rate or stable feedforward control of speech. The results of Experiment 2 also raisethe possibility of an additional requirement for visual feedback to have an effect.When dynamic visual feedback is added speech output can become more fluent,but this may depend on all sources of feedback being synchronous (i.e. in the NAFcondition); the results suggest that it is not sufficient for the visual feedback to besynchronous with just the somatosensory feedback but not the auditory feedback.While this interpretation is at odds with the results from Jones and Striemer, thereis some support for this in the results from Chesters et al. (2015) showing thatspeech production was disrupted to a similar degree by a variety of visual feedbackdelays, regardless of whether there was synchrony with the auditory feedback (i.e.by also delaying the auditory feedback). In all cases there was some degree ofasynchrony between the auditory, somatosensory, and visual feedback.In considering such an interpretation of the results, it should be noted that thetemporal requirements for audio and visual signal alignment have been investigatedextensively in the context of speech perception. A visual influence on auditoryperception does not require the two signals to be synchronous. For example, whilethe McGurk effect is strongest when the signals are temporally aligned (Munhall70et al., 1996) there is still evidence of a (weakened) visual influence when the visualsignal leads the audio signal by as much as 360 ms (Jones and Jarick, 2006). Thetemporal window of visual influence is typically described as asymmetric; a visualsignal that leads an audio signal is tolerated more than an audio signal that leads avisual signal, although the exact range of the window may depend on the type ofstimuli used (Schwartz and Savariaux, 2014).There may be a similar tolerance for some amount of temporal asynchronybetween speech feedback signals, but perhaps not to the same extent as seen inperception. There is evidence to suggest that temporal misalignments among feed-back signals are not well tolerated. The clearest example of this is the disruptiveeffects of DAF. Additionally, the targeted delays described in the study by Caiet al. (2011) resulted in the production of longer durations that extended beyondthe point of the temporal perturbation. And shadowing tasks that involve asyn-chronous audiovisual stimuli do not result in the accuracy improvements seen withsynchronous audiovisual stimuli (Reisberg et al., 1987). Given this, it is possiblethat a greater degree of temporal cohesion among multimodal signals is more im-portant in the context of speech feedback than speech perception. This issue isreturned to in Chapter 5.The analysis of Experiment 2 suggests that static visual feedback does not elicitchanges in speech production. One potential concern with the static visual feed-back presented in this experiment is the fact that it was an image of a face in anon-speech posture. In the Calvert and Campbell (2003) experiment described inSection 3.4, the static visual images were from the peak of the syllable and the base-line image was a face with a neutral expression, with the target letter superimposedon the lips. For the comparison of the baseline to the static images conditions, itwas noted that the static images resulted in greater activation of the face fusiformregion (which is important for facial recognition), despite the fact that both imageswere of a stilled face. This baseline condition is similar to the static visual feed-back used in this experiment, and so raises the question of the extent to which thisfeedback was processed as being relevant for a speech task.Assuming that the static visual feedback was treated by speakers as being rele-vant for speech, there are still other possible reasons why this form of visual feed-back did not affect speech output in the manner predicted (i.e. by minimizing dis-71fluencies, but to a lesser degree than dynamic visual feedback). One difficultyin presenting static visual feedback during the production of whole utterances, isthat it is not possible to present static feedback which matches the auditory sig-nal (e.g. producing [u] and seeing a still image of the face articulating [u]). Theneutral expression with mouth closed that participants were presented may havehad varying effects depending on what was being articulated. For example, thisstatic form may have facilitated production during closed or labial sounds, reduc-ing speech errors, but hindered production during open sounds, increasing speecherrors. Alternatively, Campbell (1996) suggests that a static face paired with anauditory speech signal could be parsed as two separate events rather than a singleevent, thus diminishing the effects of the visual signal on speech processing. Theresults of this experiment suggest that the production of longer utterances is moresensitive to dynamic visual information; this contrasts with the examples describedin Section 3.4 which involved static visual information influencing single syllablesor short multisyllabic words.The results presented in this chapter suggest that dynamic, but not static, visualfeedback influences speech production. This effect was not consistent across theconditions and measures, however. Speech produced with DAF exhibited a furtherincrease in utterance duration when paired with dynamic visual feedback. In con-trast, dynamic visual feedback had a facilitatory effect in terms of speech errors(i.e. speech errors were reduced), but only when paired with NAF and only forthose speakers who were minimally disrupted by DAF. One possible interpretationof this is the importance of temporal cohesion among feedback signals.3.5 General discussionThe experiments presented in this chapter tested whether visual feedback can en-hance speech production output during a DAF task. Experiment 1 randomized thepresentation of static and dynamic visual feedback within each block in order tomake it less likely that participants would respond strategically, which is commonin response to DAF. Utterance duration unexpectedly increased when dynamic vi-sual feedback was paired with both normal and delayed auditory feedback andthere were no changes in the number of speech errors between the two visual feed-72back conditions. The failure to find a facilitative effect of dynamic visual feedbackmay have been due to the random presentation of the types of feedback.Experiment 2 grouped the different types of visual feedback into consistentblocks, to contrast with Experiment 1. Additionally, a condition with no visualfeedback was added in order to more carefully compare the effects of static anddynamic visual feedback. Once again, utterance duration increased with dynamicvisual feedback, but in this experiment it was only when paired with DAF. Whilethe main analysis did not find an effect of visual feedback on speech errors, an ad-ditional analysis which included the degree of DAF disruption experienced by theparticipant as a predictor showed that speech errors were reduced for participantswho were least disrupted by DAF when dynamic visual feedback was paired withnormal auditory feedback. This was interpreted to mean that verbal proficiency isrelevant for the use of visual feedback during the production of whole utterances,and also that temporal cohesion among signals may have more stringent require-ments in feedback than it does in perception. The results also suggest that it isthe time-varying component of visual feedback, rather than the static form, whichis important in enhancing speech output. A more detailed comparison of the twoexperiments is presented below.The increased duration observed when dynamic visual feedback was pairedwith NAF in Experiment 1 was attributed to a possible effect of segmental dura-tion increases, following results from Chesters et al. (2015). However, this mayalso have been due to the greater number of pauses which occurred in Experi-ment 1 compared to Experiment 2, as shown in Figure 3.4 compared to Figure 3.8.The number of pauses identified during error coding for all conditions in both ex-periments is shown in Table 3.14. Since there were an additional four stimuli inExperiment 1, the pauses counts are represented as pauses per 100 sentences tofacilitate comparison between the experiments. In Experiment 1 there were morethan twice as many pauses in the dynamic visual feedback conditions, for bothNAF and DAF, compared to Experiment 2. For the DAF-Picture condition thereslightly less than twice as many pauses in Experiment 1 than Experiment 2, andfor the NAF-Picture condition there were a similar number of pauses in the twoexperiments. The greater number of pauses in Experiment 1 may have contributedto the longer utterance durations when dynamic visual feedback was presented.73No Video Picture VideoExperiment 1 NAF – 6.1 8.9DAF – 26.7 32.5Experiment 2 NAF 7.0 5.7 3.7DAF 15.2 15.4 14.5Table 3.14: Number of pauses per 100 sentence for each condition in Exper-iment 1 and Experiment 2.One point to note on this issue is that all pauses, which were quite long insome instances (up to 2000 ms in DAF conditions), were included in the measureof utterance duration. Duration was calculated with pauses included in order tobe consistent with previous studies looking at visual feedback during a DAF task.While the previous studies using this paradigm don’t explicitly state that there werelong pauses in the data,3 the reported mean durations of utterances and their con-stituent segments in Chesters et al. (2015) suggest that pauses on the order of 750ms (on average) occurred in that experiment. While pauses are undoubtedly a com-mon feature of speech produced with DAF, their inclusion in measures of utteranceduration may obscure some of the more subtle effects of visual feedback. A mea-sure such as articulation rate, which calculates syllables per second minus pauses(e.g. Chon et al., 2013; Stuart et al., 2002), may be more insightful for measuringdifferences between visual feedback conditions, and is recommended for futureresearch.This increase in the number of pauses in Experiment 1 compared to Experiment2 was echoed in the main dependent variables. Overall, Experiment 1 resulted ingreater disfluency, with both the utterance duration means and speech error meanshaving higher values than in Experiment 2. The means and standard errors are re-produced in Table 3.15 for ease of comparison. The biggest differences are seen forthe DAF conditions. The mean utterance duration was greater by 395 ms and 3453These studies also don’t report whether pauses of a certain duration were excluded from theanalysis.74No Video Picture VideoExperiment 1Utterance duration (ms) NAF – 2235 (66) 2253 (66)DAF – 3479 (76) 3539 (66)No. speech errors per utterance NAF – 0.36 (0.06) 0.39 (0.06)DAF – 1.38 (0.07) 1.41 (0.08)Experiment 2Utterance duration (ms) NAF 2188 (62) 2172 (53) 2125 (56)DAF 3084 (90) 3084 (51) 3194 (78)No. speech errors per utterance NAF 0.39 (0.06) 0.41 (0.06) 0.29 (0.05)DAF 0.98 (0.07) 1.00 (0.06) 0.97 (0.07)Table 3.15: Means and standard errors of the means for the dependent vari-ables in Experiment 1 and Experiment 2.ms for the DAF-Picture and DAF-Video conditions, respectively, of Experiment1 compared to Experiment 2. The mean number of speech errors per utterancewas greater by 0.38 for the DAF-Picture condition, and 0.44 for the DAF-Videocondition of Experiment 1 compared to Experiment 2. Longer durations were alsoobserved for the NAF conditions of Experiment 1 compared to Experiment 2, butthese differences were smaller than the differences between the DAF conditions ofthe two experiments. One exception to this pattern is the NAF-Picture condition;there were fewer speech errors on average in Experiment 1 compared to Experi-ment 2.One interpretation of this overall greater disfluency is that Experiment 1 wasmore difficult than Experiment 2, possibly due to the randomized presentation ofthe different types of visual feedback. If this is the case, the failure to find a facili-tative effect of dynamic visual feedback in Experiment 1 may have been due to taskdifficulty. In perception, audiovisual integration has been shown to be somewhat75sensitive to the demands of increased cognitive load. Using a McGurk paradigm,the addition of non-speech perceptual monitoring task (in either the auditory or vi-sual domains) has been shown to reduce the influence of visual information (Alsiuset al., 2014, 2005). A more modest reduction in the McGurk effect has also beenreported when cognitive load is increased by adding a working memory task (mem-orizing a string of numbers) (Buchan and Munhall, 2012). There are similaritiesbetween these tasks, which were designed to increase difficulty, and conditions ofExperiment 1, which involved: memorization of the stimuli to be repeated, addi-tional auditory feedback monitoring demands introduced by the alternating blocksof delayed and non-delayed auditory feedback, and monitoring of randomly pre-sented static and dynamic visual feedback (an unusual source of speech feedback).The conditions of Experiment 2 involved the first two of these, but each block pre-sented just one type of visual feedback. This may have reduced the difficulty of thetask enough such that visual feedback could be observed. However, Experiment 2is still a challenging task, and this may be part of the reason why the reduction inspeech errors with dynamic visual feedback was primarily found for more verballyproficient participants.Finally, with regard to the looking variable: As discussed in Section 2.4, eachitem was coded for whether the participant was looking at the monitor during therepetition of the stimulus, after it was noted that many participants looked awayfrom the monitor for a large number of items. This information was included inthe analyses as a control variable. Similar procedures–or alternatives such as theuse of catch trials–are not reported in previous studies (Chesters et al., 2015; Jonesand Striemer, 2007; Tye-Murray, 1986). One concern with this is that the analysesof speech errors in Experiment 1 and 2, and the analysis of utterance duration inExperiment 1, showed that the looking variable was a significant factor in thesemodels; participants produced more errors and greater durations when they werelooking away from the monitor compared to when they were looking at the monitor.It is possible that some of the patterns reported in earlier work could have beeninfluenced by trials in which participants were not looking at the visual feedback,which would contribute to some of the variability across the studies. It also makesit difficult to interpret reported effects of visual feedback if there’s a chance thatparticipants were not actually looking at the visual feedback.76Why might participants have looked away, or closed their eyes, in such highnumbers during this task? One possible reason is that participants felt uncomfort-able watching themselves; at the end of the experiments, many participants com-mented on how strange or funny it was to see themselves talk. However, this reasonis unlikely, given that participants consistently looked at the monitor during the biteblock task with visual feedback, which is presented in the next chapter. Perhaps themost likely reason is the challenging nature of the task, as discussed above. This issupported by the fact that there were more sentences in which participants lookedaway from the monitor during Experiment 1, the more challenging of the experi-ments, than Experiment 2. Additionally, production was more disfluent when par-ticipants were not looking at the monitor, suggesting that they looked away whenthey felt particularly challenged by the task. This interpretation is consistent withwork showing that gaze aversion (i.e. looking away from an interlocutor) increasesin response to increased difficulty of a cognitive task, and this happens in face-to-face communication as well as video-conferencing (Doherty-Sneddon and Phelps,2005).The two experiments in this chapter explored the effect of visual feedback ona delayed auditory feedback (DAF) task. The results of Experiment 1 suggestthat visual feedback does not enhance speech production in a DAF task whichvaries the presentation of visual feedback. However, when the different types ofvisual feedback (dynamic, static, no visual feedback) were presented in consistentblocks as in Experiment 2, there was a modest reduction in speech errors. Theseimprovements occurred with dynamic, but not static, visual feedback, suggestingthat it is the time-varying properties of visual feedback which have the potentialto elicit changes in speech production. This is consistent with the hypothesis thatit is the temporal properties of visual feedback that establish compatibility withother sources of speech feedback. The finding that enhanced fluency with dynamicvisual feedback (found only for the most verbally proficient speakers) occurredduring productions with normal, rather than delayed, auditory feedback, suggeststhat temporal compatibility between feedback signals may include a requirementfor synchrony. That is, feedback can enhance productions only when all signalsare aligned; it may not be sufficient for only a subset of the feedback signals tobe aligned (e.g. only somatosensory and visual feedback in the case of delayed77auditory feedback). This issue is returned to in Chapter 5.The experiment presented in the next chapter (Chapter 4) was designed withthese issues in mind. The next experiment is a bite block task, in which participantsrepeat single words with and without a bite block, and with and without visualfeedback. An oral perturbation was used as this difficult speaking task makes itmore likely that speakers will rely on sensory feedback (Lane et al., 2005) but itdoes not interfere with the temporal synchrony of the feedback signals. During theexperiment participants are presented with consistent blocks of visual feedback, asin Experiment 2 from the present chapter. Finally, to reduce the difficulty of thetask, participants are required to repeat single words rather than whole sentences.78Chapter 4Visual feedback during a biteblock task4.1 OverviewThis chapter reports on an experiment that examined the effect visual feedback hason the production of vowels. Participants produced monosyllabic words in fourconditions that varied in terms of the presence or absence of visual feedback andan oral perturbation. The acoustic analysis provides evidence that visual feedbackcan enhance the distinction between vowel categories and can minimize the vari-ability within vowel categories. Optical flow analysis of video recordings revealedconsiderable inter-speaker variation in lower face motion; however, a subset ofparticipants did produce greater magnitudes of motion during normal productionwith visual feedback. While there was a modest positive correlation between facialmotion and acoustic contrast overall, the participants who produced vowels withgreater magnitudes of motion in the presence of visual feedback tended to exhibitless acoustic contrast than participants who produced vowels with smaller magni-tudes of motion. Overall, these results support the hypothesis that visual feedbackcan enhance speech output, but they also raise questions concerning the relation-ship between articulation and acoustics in achieving this enhancement.794.2 Introduction4.2.1 The effect of visual information on phonological contrastsVisual information has been shown to be relevant to the acquisition of phonologicalcontrasts in two areas: altered speech production in the visually impaired suggeststhat visual information is involved in normal native language acquisition and non-native speech production that is guided by visual speech information suggests thatvisual information can be used to learn second language phonological contrasts.Me´nard and colleagues (2015; 2009; 2013) have documented the differencesbetween congenitally blind and sighted French speakers in the production of vow-els. Participants were required to produce each of the ten French oral vowels in thecarrier phrase V comme pVpa (V as in pVpa), with acoustic and articulatory mea-sures made of the initial, sustained vowel. Euclidean distances between specificvowel pairs were measured to determine acoustic vowel contrast, with the meanof these distances representing the global between-category contrast for the vowelspace. Blind speakers were found to produce less between-category acoustic con-trast than sighted speakers (Me´nard et al., 2009). When this measure was usedto compare specific phonological feature contrasts rather than the vowel space asa whole, the same decrease in acoustic contrast was observed for blind speakers,especially for contrasts that differed in terms of both rounding and place of articu-lation (e.g. /i/–/u/), rather than just one of these dimensions (e.g. /i/–/y/) (Me´nardet al., 2013). The amount of dispersion within each vowel category, which “reflectsthe precision with which a specific goal is reached” (2013, p. 2984), was also as-sessed, by measuring the Euclidean distance from each token (in F1xF2xF3 space)to the mean of its category. Blind speakers’ vowel categories had greater disper-sion than sighted speakers’ vowel categories (Me´nard et al., 2013). Additionally,tongue position and curvature from ultrasound images and upper lip protrusionfrom video images were measured. Blind speakers produced smaller lip protrusiondistances and greater tongue backing and tongue curvature compared to sightedspeakers (Me´nard et al., 2013).The results from these studies suggest that visual speech information plays arole in the implementation of phonological contrasts, affecting both the degree of80contrast achieved and the manner in which the contrast is implemented. Whenvisual information is not available during development, acoustic vowel contrastis reduced and the individual tokens of a vowel category are less tightly clustered.Me´nard et al. (2013) suggest that this may be due to greater articulatory variability;they make a comparison to sighted speakers, suggesting that access to visual infor-mation during perception would help to reduce variability by providing additionalinformation that could be used to guide their articulatory targets. These studiesalso show that for blind speakers the extent of movement of visible articulations isreduced, while non-visible articulations are enhanced.In addition to this example of the absence of visual speech information impact-ing vowel contrasts, visual speech information has been investigated in the contextof second language (L2) acquisition. As expected, given the enhanced percep-tual accuracy for audiovisual speech reported in other difficult listening conditions(e.g. speech in noise Sumby and Pollack, 1954), audiovisual speech, compared toaudio-only, also improves the perception of L2 vowel contrasts. For example, in aspeeded syllable classification task which varied in terms of presence or absence ofthe Catalan vowel contrast /E/–/e/, only Catalan-dominant bilinguals showed sensi-tivity to the contrast when the stimuli were presented auditorily, but both Catalan-dominant and Spanish-dominant bilinguals were sensitive to the vowel contrastduring audiovisual presentation of the stimuli (Navarra and Soto-Faraco, 2007).Improvements in producing L2 contrasts are also observed when visual speech in-formation is made available. Computer-assisted pronunciation systems have beenused to compare the effects of presenting different types of visual speech informa-tion. Using native speaker ratings of production accuracy as a measure, Massaroet al. (2008) reported improvements in the production of the Mandarin contrast /i/–/y/ when participants completed training while seeing a front view of lip motionproduced by a talking head. Similar improvements were not found when partic-ipants learned the Arabic contrast /k/–/q/ with a sagittal view of the talking headwhich also displayed the internal articulators. Given that this result could have beendue to the different phoneme contrasts rather than the type of visual feedback, thereis scope for further exploration. Katz and Mehta (2015) took up this question in thecontext of real-time visual feedback, presenting English-speaking participants witha visualization of their moving tongue (derived from EMA sensors) within a talking81head while producing a voiced palatal stop. While articulatory accuracy–measuredas the number of successful attempts at “hitting” an articulatory target displayed inthe oral cavity of the talking head with a marker displayed on the tongue–improvedafter training, the analysis of the spectral burst in the acoustic recordings revealedconsiderable inter-speaker variability.This research suggests that visual speech information plays a role in shapingvocalic and consonantal contrasts in the context of both development and the ac-quisition of non-native contrasts by adults. The experiment presented in this chap-ter tests whether providing visual feedback can also change adult speakers’ nativevowel contrasts.4.2.2 The effect of oral perturbations on vowel productionBite block perturbations have been used in a range of research to investigate multi-modal sensory processing. Early work showed that compensation for a bite blockis almost immediate, and for the English vowels measured the acoustic compen-sation is nearly complete (Fowler and Turvey, 1980), suggesting a limited rolefor auditory feedback in the restructuring of the necessary articulatory dynamics.However, later work showed that immediate compensation may be much less, es-pecially for consonants, but there is improvement over time (McFarland and Baum,1995). Lane et al. (2005) aimed to adjudicate this issue by looking at the interactionbetween hearing status and oral perturbation. Participants included adult cochlearimplant users (age at implantation ranged from 28 to 78) tested before and afterimplantation. The experiment tested the effect of the bite block with both maskedand normal auditory feedback, and acoustic vowel dispersion (the “spread” of thetokens within a given category) and vowel contrast (the degree of separation be-tween vowel categories) were measured. The results suggest that compensationfor a bite block does involve use of auditory feedback. The bite block increasedvowel dispersion, but this increase was greatest prior to implantation than it waswhen re-tested with the bite block one year after implantation, suggesting a rolefor long-term experience with auditory feedback in compensating for an oral per-turbation.In terms of articulatory reactions to oral perturbations, high inter-speaker vari-82ability is often reported, with compensation ranging from complete to absent (Me´nardet al., 2008; Savariaux et al., 1995). In a comparison of blind and sighted speakers’compensation for a a lip-tube perturbation during the productions of French /u/,Me´nard et al. (2015) found that blind speakers produced a greater degree of tonguecurvature than sighted speakers in response to a lip-tube perturbation during theproduction of French /u/ (Me´nard et al., 2015). This finding is consistent with thestudy described in the previous section, showing that blind speakers tend to pro-duce more extreme non-visible articulations than sighted speakers (Me´nard et al.,2013).In the experiment presented in this chapter, a bite block paradigm was usedin combination with the presentation of visual feedback during speech production.Following Lane et al. (2005) and Me´nard et al. (2013), the acoustic contrast anddispersion of vowels were analyzed. These measures are defined in more detail inSection Hypothesis and predictionsThe general hypothesis of this dissertation is that visual feedback, which has time-varying properties that make it compatible with other sources of speech feedback(auditory and somatosensory), will enhance speech output. The visual speech sig-nal is a rich source of information, with jaw motion (Johnson et al., 1993) and lipposition (e.g. Fromkin, 1964; Montgomery and Jackson, 1983) being particularlyrelevant for vowels. When this modality is not available during development, be-cause of congenital blindness, the acoustic and articulatory properties of vowels arechanged relative to speakers who have access to visual speech information (Me´nardet al., 2009, 2013). Given this previous research, one way in which speech outputis predicted to be enhanced is in terms of the production of vowel contrasts. Per-turbing speech production with a bite block increases reliance on feedback (Laneet al., 2005), thus visual feedback is predicted to enhance speech output to a greaterextent during bite block speech compared to normal (non-perturbed) speech.Consistent with Lane et al. (2005) and Me´nard et al. (2013), the followingpredictions were made. (A graphical summary of the predictions is presented inFigure 4.1.) Vowel contrast is predicted to be reduced during bite block production83Vowel contrast Vowel dispersion Magnitude of motionno bite block bite block no bite block bite block no bite block bite blockType of oral perturbationMagnitude of effectno visual feedback visual feedbackFigure 4.1: Predicted effects of the experimental manipulations on the vari-ables to be measured.and increased with visual feedback. The visual feedback effect is predicted tobe greater during bite block production compared to normal speech production.Vowel dispersion, which is used here to refer to the spread of tokens within a vowelcategory rather than the distance between vowel categories as in Dispersion Theory(Flemming, 1995), is predicted to be increased during bite block production andreduced with visual feedback. The visual feedback effect is predicted to be greaterduring bite block production compared to normal speech production. In line withthe articulatory results from Me´nard et al. (2013) showing articualtory differencesbetween blind and sighted speakers during vowel production, visual feedback waspredicted to result in comparable changes. The analysis of this experiment focusedon visible articulatory differences since this aligns with what participants were ableto see during the presentation of visual feedback. A global measure of lower facemagnitude of motion was used, with the magnitude predicted to be reduced duringbite block production and increased with visual feedback. Since the bite blockwill limit jaw opening, the visual feedback effect may be observed more duringproduction without the bite block.84u who’d•U hood•A hod•æ had•E head•I hid•i heed•Figure 4.2: The canonical positions in the vowel space for the General Amer-ican English monophthongs used in the experiment (adapted from Lade-foged (2006)). Each vowel was produced in a monosyllabic word ofthe form /hVd/ (represented orthographically next to the appropriatevowel.)4.3 Methods4.3.1 ParticipantsThirteen students (mean age: 24.2 years (sd: 10.6); 5 males) recruited from TheUniversity of New Mexico’s Linguistics 101 course participated in the experi-ment. Participants were compensated for their time with extra credit toward theirfinal grade. All participants self-reported having normal hearing and normal orcorrected-to-normal vision, and all were native English speakers.4.3.2 StimuliSeven monosyllabic words–of the form /hVd/–served as stimuli for participants’production, and the vowel in each word was one of seven American English monoph-thongs. Figure 4.2 shows the vowel positions in the vowel space.85visual feedbackabsent dynamicno bite block NoBB-NoVideo NoBB-Videooral perturbationbite block BB-NoVideo BB-VideoTable 4.1: The four conditions presented to each participant.4.3.3 ProcedureThe experiment, which required participants to repeat words in a four differentfeedback conditions, followed the procedure outlined in Section 2.4. A within-participants design with two independent variables was used. These variables ma-nipulated the type of oral perturbation and visual feedback participants receivedduring speech production, with the variables being either absent or present. Whenthe oral perturbation was absent, participants produced speech as they would nor-mally. For the oral perturbation, participants produced the stimuli while holding abite block between their upper and lower right molars. The bite block was a dispos-able pair of wooden chopsticks, approximately 8 mm in diameter. An example ofthis configuration is shown in Figure 4.3. Visual feedback consisted of a real-timevideo of the utterance being produced by the participant. When visual feedbackwas absent participants looked at a fixation point on the monitor. Each participantproduced repetitions of the stimuli in the four conditions outlined in Table 4.1.The stimuli were presented multiple times in each condition: each conditioncontained four blocks, each of which contained two repetitions of each stimu-lus. Stimulus order was randomized in each block. Condition order was pseudo-randomized across participants; participants always completed the conditions with-out the bite block before the conditions with the bite block, but the order of thevisual feedback conditions was randomized within each of these subgroups. Theoral perturbation conditions were ordered in this way to avoid any possible com-pensation or adaptation effects which could occur when shifting from the bite block86Figure 4.3: Bite block configuration used in the oral perturbation conditions.Participants held the chopsticks between their upper and lower right mo-lars while producing the stimuli. During speech production participantslooked straight ahead at the visual feedback on the monitor with thechopsticks typically lying parallel to the microphone. (Note that this isnot the case in this image. This video still was chosen to more clearlyshow the bite block; it came from a period between blocks while theparticipant turned to the side.)conditions to the no bite block conditions. This resulted in the four condition orderspresented in Table 4.2.In total each participant produced 14 words per block, and thus 56 words perconditions and 224 words for the experiment, yielding a total of 2912 words acrossall 13 participants. 12 2 words were discarded due to technical errors during therecording of the experiment, leaving 2910 words in the analysis.1In addition to these stimuli, a set of sibilant stimuli (seat, sheet, sort, short) were also included.In total then, each participant produced 22 words per block, and thus 88 words per condition and 352words for the experiment.2In a pilot study for this experiment, participants produced twice as many repetitions of eachword. However, this proved to be too taxing for the participants during the bite block conditions.To avoid participants becoming too fatigued during the experiment, the number of repetitions percondition was reduced to two.87Condition 1 Condition 2 Condition 3 Condition 4Order 1 NoBB-NoVideo NoBB-Video BB-NoVideo BB-VideoOrder 2 NoBB-Video NoBB-NoVideo BB-NoVideo BB-VideoOrder 3 NoBB-NoVideo NoBB-Video BB-Video BB-NoVideoOrder 4 NoBB-Video NoBB-NoVideo BB-Video BB-NoVideoTable 4.2: The four condition orders used in the experiment.4.3.4 Measures4.3.4.1 Acoustic measuresTwo acoustic measures were used: vowel contrast and vowel dispersion. Vowelcontrast refers to the distance between vowel categories, and aims to capture thespread of the vowel space. Vowel dispersion refers to the distance between individ-ual vowels and the mean of their category, and aims to capture the spread of vowelcategories.The audio recordings were segmented, annotated, and measured using Praat(Boersma and Weenink, 2014) and FAVE (Rosenfelder et al., 2011). FAVE (ForcedAlignment and Vowel Extraction) includes two programs which enable automaticalignment of orthographically transcribed data and acoustic recordings, and auto-matic extraction of vowel formant measures. The command-line versions of theseprograms were used for the analysis of the present data set. A Praat script wasused to create a TextGrid with word boundaries and phoneme labels. FAVE-alignwas then used to automatically add phoneme boundaries for each word. Theseboundaries were checked and manually corrected if required. During the manualcheck and correction phase, vocalic onsets and offsets were identified by the fol-lowing criteria: 1) Vocalic onsets were identified by the onset of periodicity in thewaveform and the onset of harmonic structure in the spectrogram. In cases wherethese did not align due to temporal smearing in the spectrogram, the boundary wasplaced at the onset of periodicity. 2) Vocalic offsets were identified by an abrupt88intensity decrease in the waveform and the offset of high frequency components inthe spectrogram. In some cases the offset could also be identified by the offset ofperiodicity.Formant analysis was carried out using FAVE-extract as an interface for Praat.Before running FAVE-extract, formant settings were chosen for each participant.To do this, the spectrograms from a random word in each block were viewed intandem with Praat’s formant tracker, and the settings were adjusted to optimizethe formant tracker’s performance.3 The two settings adjusted were the number offormants and the upper limit of the formant search range. Once these values weredetermined, the FAVE-extract settings were specified: Praat’s formant predictionalgorithm along with the pre-determined settings for each participant. Praat’s al-gorithm was chosen as the formant prediction method over the Mahalanobis pre-diction method as this enabled subsequent manual checking of the automaticallyextracted formant values in Praat. For each participant, formant values were ex-tracted at three points during the vowel production–25%, 50%, and 75%–in orderto capture some of the temporal dynamics.Following Lane et al. (2005), the effects of the oral perturbation and visualfeedback manipulations were assessed in terms of vowel dispersion and vowel con-trast. Vowel dispersion refers to the spread of a vowel category, operationalized asmean Euclidean distances (d), and vowel contrast refers to the spread of the vowelspace, operationalized as mean Mahalanobis distances (DM).Euclidean distance is the distance between two points in Euclidean space; inthis case, the distance between a point and the average of all the points of thatcategory. For example, the distance between each instance of [i] and the mean ofall the tokens of /i/, measured in F1xF2 space. The Euclidean distance formula isgiven in Equation 4.1. This calculation was applied per participant per condition atthe three time points during the vowel production. That is, for each participant, andat each of the three time points, the mean of each vowel category was calculatedfor each condition, and the distances between this mean and each token from thatcategory were calculated.3The formant tracker had the most difficulty with /u/, so each token of who’d was also checked.89d(Vi,Vi¯) =√(F1i−F1i¯)2 +(F2i−F2i¯)2where V = a vowel categoryi = the i-th vowel categoryi¯ = the mean of the i-th vowel categoryF1 = the first formantF2 = the second formant(4.1)Mahalanobis distance is similar to Euclidean distance, however the formulatakes into account the variance and covariance of the category a particular token isbeing compared to, not just the mean of that category (Lane et al., 2005). This mea-sure has been used in a variety of speech studies ranging from automatic speechrecognition, for mapping training data from one language to a new language (e.g.Sooful and Botha, 2001), to second language learning, for calculating the distancebetween L2 productions and the native language target (e.g. Kartushina et al.,2015). Multiple Mahalanobis distances were calculated for each token, as eachtoken was compared to the distribution of each of the seven vocalic categories atthree points in the vowels. Thus for each token, twenty-one distances were calcu-lated (e.g. [i] → /i/, [i] → /I/, [i] → /E/, [i] → /æ/, [i] → /A/, [i] → /U/, [i] → /u/,at 25%, 50%, and 75% points in the vowel). The Mahalanobis distance formula isgiven in Equation 4.2.DM =√(xi−µ j)T S−1(xi−µ j)where x = a vector of F1 and F2 valuesµ = a vector of mean F1 and F2 valuesi = the i-th vowel categoryj = the j-th vowel category(.)T = the transpose operationS−1 = the inverse covariance matrix(4.2)These distances were used to calculate the average vowel space (AVS) dis-tance, using a similar procedure to Lane et al. (2005). AVS was the mean of the90Mahalanobis distances between vowel pairs, calculated as per for the followingprocedure:1. For each token, the Mahalanobis distance to the distribution of each vowelcategory was calculated. This resulted in seven distances for each token. Avisualization of this step is shown in Figure 4.4.2. The AVS distance for each token was calculated by averaging over the sevendistances from step 1. These token-specific AVS distances were used in thestatistical models.3. This preceding two steps were repeated at each measurement point in thevowel (25%, 50%, 75%)There are two differences between this procedure and the one used by Lane et al.(2005). The Mahalanobis distance between a token and its own category (e.g.the distance between [i] and /i/) was measured and included in the calculation ofAVS for the present analysis, but not in Lane et al.’s analysis. This distance wasincluded in the calculation in order to capture all of the vowel category distributionsin the AVS measure. Lane et al. calculated pairwise Mahalanobis distances, andthen averaged over these distances to calculate AVS.4 This approach results in areduced data set for the statistical analysis, as the AVS measure is calculated for aset containing all the different vowel pairs; that is, instead of there being an AVSmeasure for each token of a vowel, there is one AVS measure for a set of vowels.This can be problematic for within-participant analyses and for repeated measures(Karlsson and van Doorn, 2012). The approach described above is an attempt toavoid this reduction in the number of data points for the statistical analysis. Articulatory measuresA measure of visible articulation, focusing on the lower face, was used for theanalysis. Motion of the lower face was measured using optical flow analysis (OFA),4The description of this calculation in Lane et al. (2005) is initially described as involving pair-wise distances, although at one point in the paper what is described is similar to the procedure used inthe present analysis: “Each repetition of a particular vowel on the ith trial is given a Mahalanobis dis-tance to the distribution of each of the other vowels. The square roots of the distances were averagedfor each group and listening condition” (2005, p. 1640).91Figure 4.4: Mahalanobis distances from a token of the vowel [i] to the distri-bution of each vowel category.a method for extracting 2D movement information from videos without the needfor pre-defined measurement locations. The basic operation of the algorithm isdescribed by (Barbosa et al., 2008b, p. 6-7):Moving images are recorded as changes in the intensity (and color)values for the pixels in the image array that are influenced by the mo-tion. The optical flow algorithm does not merely register the change ofintensity from one image to the next for each pixel; rather it attempts tokeep track of specific intensity values, corresponding to image objectsas they change location within the pixel array. Thus, the algorithmassigns a motion vector consisting of a magnitude and a direction toeach pixel based on where the intensity associated with that pixel inone image is located in the next image in sequence. The direction issimply the line from the first pixel to the second and the magnitudecorresponds to the Euclidean distance between them. The array ofmotion vectors comprises the optical flow field.Since the difference between adjacent frames involves a temporal difference, thisdisplacement of intensity values can be represented as pixel velocity and thus the92Figure 4.5: A screenshot from the FlowAnalyzer software with a region of in-terest (ROI; black rectangle) specified for a particular participant. TheROI encompassed mouth movements that occurred throughout all con-ditions of the experiment.inferred movement is in terms of velocity of motion.The algorithm used in the present analysis is from Horn and Schunck (1981),as implemented by Barbosa et al. (2008b) in their FlowAnalyzer software.5 Asingle, continuous video recording of the experiment was made for each partici-pant. Using FlowAnalyzer, a region of interest (ROI) was specified for each videorecording. The ROI was positioned to capture the speaker’s lower face movements–particularly the mouth–throughout the experiment. As such, the top edge of therectangular ROI was positioned near the tip of the speaker’s nose and the bottomedge was positioned at the top of the microphone (which was usually aligned withthe bottom of the speaker’s chin). The outer edges of the ROI were positioned to in-cluded the maximal extent of mouth position along the transverse plane throughoutthe experiment. An example ROI is shown in Figure 4.5.In determining the ideal ROI size and position, a number of issues were takeninto consideration. The main concern was whether the ROI should encompass thewhole lower face or just the mouth. The difficulty with using a mouth-only ROI is5https://www.cefala.org/FlowAnalyzer/93that multiple ROIs must be created throughout the experiment, since the positionof the speaker’s head, and thus mouth, moves between blocks and conditions. Thismeans that an additional normalization procedure would need to be introduced inorder to pool data across the multiple ROIs for each speaker.6In order to see the differences between the motion measured within the lowerface ROI and the mouth ROI, an informal comparison of the two was made. OFAwas performed on the video recording of the first block from each of the two NoBBconditions for one participant. As can be seen in Figure 4.6, the pattern across thevowels was essentially the same for the two ROIs, but with larger magnitudes forthe mouth only ROI. Given that similar patterns of motion were extracted fromthe two test ROIs, the choice was made to use the lower face ROI for ease ofdata extraction. Large ROIs have also been successfully used in other contexts.When OFA is used to calculate correlations between multimodal speech signals,large ROIs (encompassing the whole lower face, for example) have been shown toperform on par with more targeted analyses using marker dots for capturing facialmotion (Barbosa et al., 2008a). When used to measure tongue movements fromultrasound, similar results where obtained regardless of whether a wide or narrowROI was used (Hall et al., 2015).The output of OFA includes five vectors quantifying values and magnitudes inthe horizontal and vertical directions, as well as a summed magnitude of motion(velocity) within the ROI. This summed magnitude is a single scalar value; it is thesum of the Euclidean magnitudes for each pixel in the ROI. These values are calcu-lated for each frame-step in the video sequence. Following previous research (e.g.Barbosa et al., 2008a; Fuhrman, 2014; Hall et al., 2015), the summed magnitudeof motion (MM) was used in the present analysis.Since MM values are highly correlated between adjacent frame-steps, an aver-aging procedure was used to derive a single MM value for each token (Hall et al.,2015). Each mean MM was calculated from the frame-steps that occurred duringthe onset of the word (/h/) plus the first half of the vowel. This portion of the word6An alternative view is that it may not be necessary to normalize when using a mouth-only ROI.Creating multiple ROIs which only capture mouth movement is akin to using OFA as an object-tracker, and since the object changes shape the ROI will need to change accordingly (Adriano VilelaBarbosa, personal communication).94NoBB-NoVideo NoBB-Video0.81.01.2i ɪ ɛ æ ɑ ʊ u i ɪ ɛ æ ɑ ʊ uSegmentMean magnitude of motionROImouthlower faceFigure 4.6: Comparison of mouth and lower face magnitudes of motion fromtwo ROIs. Calculations were made from the first block of the NoBB-NoVideo condition (left) and the first block of the NoBB-Video condi-tion (right) for one participant. In each of these blocks the participantproduced two repetitions of each stimulus, so each point represents themean of two repetitions.was chosen in order to capture the maximum extent of the opening gesture of thetarget vowel. A Python script7 time-aligned the Praat TextGrid boundaries withthe OFA frame-step sequences, then extracted the OFA vectors and calculated therelevant means.4.4 ResultsThe F1xF2 vowel spaces, as measured at the 25%, 50%, and 75% points in thevowel, are shown in Figures 4.7, 4.8, and 4.9, respectively. The statistical resultsfor the acoustic and articulatory measures are discussed below.7The original Python script was written by Michael McAuliffe and I modified the script for thisexperiment.95no video video-2-1012-2-1012normal productionbite block production-2-1012 -2-1012F2F1voweliɪɛæɑʊuFigure 4.7: Vowel space as measured at the 25% point in the vowels.no video video-2-1012-2-1012normal productionbite block production-2-1012 -2-1012F2F1voweliɪɛæɑʊuFigure 4.8: Vowel space as measured at the 50% point in the vowels.96no video video-2-1012-2-1012normal productionbite block production-2-1012 -2-1012F2F1voweliɪɛæɑʊuFigure 4.9: Vowel space as measured at the 75% point in the vowels.4.4.1 Vowel contrastVowel contrast refers to the spread of the vowel space, operationalized as the av-erage vowel space (AVS) (the mean Mahalanobis distance between categories). Asummary of the AVS distance results is displayed in Figure 4.10. The statisticalanalysis involved a number of iterations of model construction before an adequatefit was found that also met the assumptions for model validity. These steps areoutlined below.A linear mixed effects analysis was performed using the general formula out-lined in Section 2.6. The two independent variables were type of oral perturba-tion and type of visual feedback, and the maximal random effects structure wasspecified. Visual inspection of the residual plots of these models revealed clearheteroscedasticity.Since the raw data followed a gamma distribution (i.e. the values were con-tinuous, positive, and there was right-skewing), generalized linear mixed effectsmodels were tested, specifying that the data be fit to a gamma distribution. Models9725% 50% 75%16182022normal bite block normal bite block normal bite blockType of oral perturbationVowel contrast (AVS)no video videoFigure 4.10: Means and standard errors for AVS distance for each conditionat three points in the vowel.were constructed and tested using three different R packages: lme4, glmmADMB(Skaug et al., 2014), and MASS (using glmmPQL) (Venables and Ripley, 2002).There were problems with all of these models, either in terms of failure to con-verge, which was not resolved by simplifying the random effects structure, or het-eroscedasticity and non-normality of the residuals. It is possible that these issueswere due to implementation problems inherent to the packages.8It is generally preferable to fit data to the appropriate distribution rather thantransforming data and fitting them to a normal distribution. However, given theproblems encountered with implementing a gamma model, data transformationwas tested instead. For an underlying gamma distribution a cube root transforma-tion is required to normalize the distribution. The cube root of each AVS distancewas calculated and these values were used as the dependent variable in the models.A linear mixed effects model was once again constructed, with the default fixedeffects structure. The maximal random effects structure was used, although thisdiffered between the models constructed for the different points in the vowels. Forthe models of vowel contrast at the 25% and 50% points, random effects included8The possibility of implementation problems for gamma GLMMs is noted athttp://glmm.wikidot.com/faq.98intercepts for participants and phonemes, as well as by-participant and by-phonemerandom slopes for the effect of the interaction between type of oral perturbation andtype of visual feedback. Visual inspection of the residual plots for these modelsrevealed no obvious deviations from homoscedasticity.The models for the 75% point in the vowel failed to converge with this randomeffects structure. To address this, the random effect for phoneme was simplifiedto include the intercept as well as by-phoneme random slopes for the effect oftype of oral perturbation and type of visual feedback, minus the interaction term.While this model converged, visual inspection of the residual plots revealed devia-tions from homoscedasticity. The distribution of the cube-root AVS distances waschecked to see if this could be driving the problem. The distribution still exhib-ited some right-skewing, suggesting that the underlying distribution was actuallylognormal rather than gamma.9 Log AVS distances were calculated for this mea-surement point, and the linear mixed effects models re-fit with these distances asthe dependent variable. The residual plots for these models were improved relativeto those for the models using the cube root dependent variable, so these modelswere used to evaluate the effect of the experimental variables on AVS distance atthe 75% point in the vowel.10Table 4.3 gives a summary of the fixed and random effects structures and datatransformations used for each statistical model.The models for each measurement point in the vowel is considered in turn.Model comparison at the 25% time point revealed a significant effect of visualfeedback (χ2(1) = 4.1681, p < .05); the presence of visual feedback increasedthe AVS distance. The interaction at this time point was not significant. Table 4.4shows the fixed effects coefficients, standard errors, and t-values for the model withthe significant visual feedback effect.Model comparison at the 50% time point revealed that both the effect of visual9The lognormal and gamma distributions are continuous probability distributions which take onlypositive real numbers. A simple procedure for determining which of these distributions a variablefollows, is to take the logarithm of the variable; if the variable is from a lognormal distribution thelogarithm will be normally distributed, while if it’s from a gamma distribution the logarithm willexhibit left-skewing.10While the residual plots were improved by using log AVS distances instead of cube root AVSdistances, the p-values obtained from the subsequent likelihood ratio tests were very similar for bothsets of models.99DV Fixed Effects Structure Random Effects Structure25% 3√AVS type of oral perturbation *type of visual feedback(1 + oral * visual | participant) +(1 + oral * visual | phoneme)50% 3√AVS type of oral perturbation *type of visual feedback(1 + oral * visual | participant) +(1 + oral * visual | phoneme)75% log(AVS) type of oral perturbation *type of visual feedback(1 + oral * visual | participant) +(1 + oral + visual | phoneme)Table 4.3: Summary of the dependent variables and effects structures used inthe statistical models of vowel contrast.Coefficient (Std. Error) t-value(Intercept) 2.63 (0.08) 34.85oral(biteblock) −0.09 (0.06) −1.53visual(video) 0.07 (0.03) 2.11Table 4.4: Fixed effects for the model of (cube root) AVS (25%) with thesignificant visual feedback effect.feedback and the interaction between type of oral perturbation and type of visualfeedback were close to significance (visual feedback effect: χ2(1) = 3.5346, p =.0601; interaction: χ2(1) = 2.9687, p = .08489). There was an overall tendency forthe AVS to be increased with visual feedback, and this tendency was strengthenedduring normal production compared to bite block production. Table 4.5 showsthe fixed effects coefficients, standard errors, and t-values for the model with theinteraction.Model comparison at the 75% time point failed to find any significant effects.100Coefficient (Std. Error) t-value(Intercept) 2.65 (0.09) 30.61oral(biteblock) −0.03 (0.05) −0.52visual(video) 0.10 (0.04) 2.71oral(biteblock):visual(video) −0.10 (0.06) −1.73Table 4.5: Fixed effects for the model of (cube root) AVS (50%) with a near-significant interaction of the predictors.4.4.2 Vowel dispersionVowel dispersion refers to the spread of vowel categories, operationalized as themean of the Euclidean distances between each vowel token and the mean of thevowel category. The mean Euclidean distance results for the four conditions ateach time point in the vowel are shown in Figure 4.11. The statistical analysisinvolved a number of iterations of model construction before an adequate fit wasfound that also met the assumptions for model validity. These steps are outlinedhere.A linear mixed effects analysis using the default formula outlined in Section 2.6was performed to determine the relationship between Euclidean distance and thetwo independent variables: type of oral perturbation and type of visual feedback.Phoneme category was also added as a fixed effect, rather than a random effect.Phonemes, rather than items, were included as a random effect in the other analy-ses of this experiment since item and phoneme category are conflated in the stimuli(i.e. there is only one word for each phoneme category). Additionally, only a sub-set of the possible English vowels were used in the experiment and thus the levelsof the phoneme factor were not exhausted, which is one possible criterion for in-cluding a variable as a random effect. However, the set of possible phonemes is asmall one, in contrast to the set of possible participants, for example, and samplingonly a small number of factor levels from a small set can be reason to include thevariable as a fixed effect. In such a situation it can also be problematic to specifythe variable as a random effect, and this was the case for the analysis of Euclidean10125% 50% 75%5055606570normal bite block normal bite block normal bite blockType of oral perturbationVowel dispersion (Euclidean distance)no video videoFigure 4.11: Means and standard errors for Euclidean distance for each con-dition at three points in the vowel.distance. These models resulted in random effects correlations that were all at1.0 or close to 1.0, even once the slope term was simplified, suggesting that therandom effects structure was overfitting the data. To address this it was decidedto model phoneme category as a fixed effect, with a by-participant random slopefor phoneme. This also has the advantage of allowing any differences betweenvowels to be observed. Visual inspection of residual plots with this model struc-ture revealed clear heteroscedasticity and also showed that the residuals were notnormally distributed.The procedure for fitting the data to a gamma distribution in a generalized lin-ear mixed effects model that was outlined for the vowel contrast data above wasalso tested here. The same problems were encountered, so a cube root transfor-mation of the data was performed. The cube root of each Euclidean distance wascalculated and these values were used as the dependent variable in the models.New linear mixed effects models were constructed, with maximal random effectsstructures, although due to convergence issues this differed between the modelsconstructed for the different points in the vowels.In the model for the 75% point in the vowel, the random effects included in-tercepts for participants, as well as by-participant random slopes for the effect of102DV Fixed Effects Structure Random Effects Structure25% 3√AVS type of oral perturbation *type of visual feedback +phoneme(1 + oral + visual + phoneme |participant)50% 3√AVS type of oral perturbation *type of visual feedback +phoneme(1 + oral + visual + phoneme |participant)75% 3√AVS type of oral perturbation *type of visual feedback +phoneme(1 + oral * visual + phoneme |participant)Table 4.6: Summary of the dependent variables and effects structures used inthe statistical models of vowel dispersion.the interaction between type of oral perturbation and type of visual feedback plus arandom slope for phoneme. The models for the 25% and 50% points in the vowelfailed to converge with this random effects structure, so they were simplified toinclude intercepts for participants with by-participant random slopes for the effectof phoneme, type of oral perturbation, and type of visual feedback, minus the in-teraction term. Visual inspection of the residual plots for all these models revealedno obvious deviations from homoscedasticity, and normal distributions of residu-als. Table 4.6 gives a summary of the fixed and random effects structures and datatransformations used for each statistical model.The models for each measurement point in the vowel are considered in turn.Model comparison at the 25% point revealed a significant effect of visual feedback(χ2(1) = 4.7673, p = .029), with Euclidean distance decreasing with visual feed-back, and a significant effect of phoneme (χ2(6) = 20.069, p < .01), with the highback vowels /u, U/ exhibiting the greatest vowel dispersion.11 The effect of oral per-11For the effect of phoneme, the reference level in the model was /A/, the default level set byR. Phoneme was included as a fixed effect primarily to address problems encountered when it wasspecified as a random effect, rather than because it was a primary experimental manipulation. Assuch, there were no a priori predictions for vowel category comparisons, and so the default factorlevel was not changed. Also of note is the fact that changing the reference level does not affect theoutcome of likelihood ratio testing, which was used to determine significance of the factors.103Coefficient (Std. Error) t-value(Intercept) 3.66 (0.13) 28.47oral(biteblock) 0.09 (0.05) 1.99visual(video) −0.10 (0.04) −2.29phoneme(æ) 0.12 (0.07) 1.75phoneme(E) −0.04 (0.09) −0.49phoneme(I) −0.12 (0.09) −1.27phoneme(i) −0.23 (0.09) −2.65phoneme(U) 0.29 (0.09) 3.35phoneme(u) 0.43 (0.12) 3.51Table 4.7: Fixed effects for the model of (cube root) Euclidean distance(25%) with the significant visual feedback and phoneme effects.turbation narrowly missed significance (χ2(1) = 3.6844, p = .05492). There was atendency for the Euclidean distance to increase during bite block productions. Theanalysis failed to find a significant interaction at this measurement point. Table 4.7shows the fixed effects coefficients, standard errors, and t-values for the model withthe significant visual feedback and phoneme effects.Model comparison at the 50% point revealed a significant interaction betweentype of oral perturbation and type of visual feedback (χ2(1) = 4.185, p < .05). Thecontrasts for the interaction were obtained using multcomp (Hothorn et al., 2008).Post-hoc comparisons revealed that Euclidean distance was smallest during normalproduction with visual feedback. There was a significant decrease in Euclidean dis-tance in the NoBB-Video condition compared to the BB-NoVideo condition (z =-2.806, p = .02477 (adjusted p-value, single step method)) and the BB-Video con-dition (z = -3.133, p < .01 (adjusted p-value, single step method)). There was alsoa marginally significant decrease in Euclidean distance in the NoBB-Video condi-tion compared to the NoBB-NoVideo condition (z = -2.391, p = .07531 (adjustedp-value, single step method)). In addition to the significant interaction, there was104Coefficient (Std. Error) t-value(Intercept) 3.56 (0.15) 24.18oral(biteblock) 0.07 (0.06) 1.10visual(video) −0.12 (0.05) −2.39phoneme(æ) 0.07 (0.08) 0.91phoneme(E) −0.02 (0.11) −0.15phoneme(I) −0.01 (0.12) −0.12phoneme(i) −0.18 (0.11) −1.61phoneme(U) 0.26 (0.08) 3.46phoneme(u) 0.41 (0.11) 3.70oral(biteblock):visual(video) 0.13 (0.06) 2.05Table 4.8: Fixed effects for the model of (cube root) Euclidean distance(50%) with the significant interaction between type of oral perturbationand type of visual feedback.also a significant effect of phoneme (χ2(1) = 19.883, p < .01), with the high backvowels /u, U/ exhibiting the greatest vowel dispersion. Table 4.8 shows the fixedeffects coefficients, standard errors, and t-values for the model with the significantinteraction.Model comparison at the 75% point revealed a significant effect of phoneme(χ2(6) = 22.728, p < .001), with the front vowels /i, I, E/ exhibiting the leastvowel dispersion. The effect of oral perturbation was close to significance (χ2(1)= 3.1981, p = .07372); there was a tendency for Euclidean distance to increase dur-ing bite block production compared to normal production. The analysis failed tofind a significant interaction at this measurement point. Table 4.9 shows the fixedeffects coefficients, standard errors, and t-values for the model with the significantphoneme effect. Figure 4.12 shows the differences between the vowels at eachmeasurement point in the vowel.105Coefficient (Std. Error) t-value(Intercept) 3.86 (0.13) 28.73oral(biteblock) 0.13 (0.06) 1.93visual(video) −0.03 (0.04) −0.64phoneme(æ) −0.24 (0.08) −2.99phoneme(E) −0.50 (0.09) −5.79phoneme(I) −0.47 (0.09) −4.98phoneme(i) −0.41 (0.12) −3.45phoneme(U) −0.21 (0.12) −1.76phoneme(u) 0.05 (0.12) 0.42Table 4.9: Fixed effects for the model of (cube root) Euclidean distance(75%) with the significant effect of phoneme.106i ɪ ɛ æ ɑ ʊ u50751001255075100125507510012525%50%75%normal bite block normal bite block normal bite block normal bite block normal bite block normal bite block normal bite blockType of oral perturbationVowel dispersion (Euclidean distance)no video videoFigure 4.12: Means and standard errors for Euclidean distance for each phoneme in each condition at each measure-ment point in the vowel.1070.440.460.480.50bite block productionnormal productionType of oral perturbationMean magnitude of motionVisual feedbackno videovideoFigure 4.13: Means and standard errors for magnitudes of motion of thelower face for each condition.4.4.3 Lower face magnitude of motionLower face magnitude of motion (MM) was calculated from optical flow analysisof the video recordings made during the experiment. The mean MM results forthe four conditions are shown in Figure 4.13. As with the acoustic measures, thestatistical analysis involved a number of iterations of model construction beforean adequate fit was found that also met the assumptions for model validity. Thesesteps are outlined here.A linear mixed effects analysis was performed using the default formula out-lined in Section 2.6. The two independent variables were type of oral perturbationand type of visual feedback, and the maximal random effects structure was spec-ified. Visual inspection of the residual plots of these models revealed clear het-eroscedasticity and also showed that the residuals were not normally distributed.The raw MM values exhibited considerable right-skewing. Unlike with theacoustic measures this was not sufficiently improved by a cube root transforma-108tion. However, a log transformation did result in a more symmetrical distribution,so these values were used as the dependent variable in the model. Inspection ofthis new model showed that the residuals were still somewhat heteroscedastic. Toaddress this, outliers were identified by calculating z-scores for the untransformedMM values for each participant and excluding those that were beyond ±2.5 (ap-prox. 2.5% of all tokens, leaving 2835 data points in the analysis).The correlations between the fixed effects parameters for this model also sug-gested some degree of collinearity between the experimental variables. This wasaddressed by sum coding the variables instead of treatment coding (as described inSection 2.6.3). A new model was fitted and model inspection confirmed that theresiduals were homoscedastic and normally distributed, and also showed reducedfixed effects correlations.One final problem was the high random effects correlations for the phonemerandom variable, suggesting that the model was overfitting the data. The signifi-cance of the slope parameters was assessed by likelihood ratio tests between mod-els with and without the various slopes. This process revealed that, for the phonemerandom effect, only a random slope for the oral perturbation was required. Subse-quent models used this random effects structure for the phoneme random variablealongside the maximal random effects structure for the participant variable.Model comparison revealed that the effect of oral perturbation was close tosignificance (χ2(1) = 3.1591, p = .0755), with a tendency for MM to decreaseduring bite block production compared to normal production. The analysis failedto find an effect of visual feedback or an interaction between oral perturbation andvisual feedback. Table 4.10 shows the fixed effects coefficients, standard errors,and t-values for the model.Closer inspection of the OFA results reveals considerable interspeaker varia-tion. The mean MM for each vowel in each condition and for each participant isshown in Figure 4.15 and Figure 4.16, highlighting the variation across participantsin response to both the bite block and visual feedback. A summary of the meanMM patterns is shown in Figure 4.14 and a description of the patterns follows.These figures show that in response to visual feedback, for both normal produc-tions and bite block productions, the different possibilities for mean MM changesare all realized; there are examples of increases and decreases, as well as no change.109Coefficient (Std. Error) t-value(Intercept) −0.80 (0.05) −17.25NoBB vs. BB −0.09 (0.05) −1.84NoVideo vs. Video −0.01 (0.02) −0.72Table 4.10: Fixed effects for the model of (log) magnitude of motion. Theeffect of oral feedback was close to significance.Articulatory reaction(P07,P08,P11,P13)Articulatory reaction(P06)No articulatory reaction(P01,P02,P03,P04,P05,P09,P10,P12) vowelsnon-high vowelsnormal bite block normal bite block normal bite blockType of oral perturbationMean magnitude of motionno video videoFigure 4.14: Different patterns of magnitudes of motion of the lower face inresponse to visual feedback. The magnitudes were pooled for eachgroup of participants, with a separation between high vowels (/i I U u/)and non-high vowels (/E æ A/).110Five participants (P06, P07, P08, P11, and P13) exhibited quite large changes inmean MM in response to visual feedback, although participant P06’s responsespatterned in a different way from the other four participants. The biggest changeswere observed with the non-high vowels.During normal production there was a tendency for MM to increase with visualfeedback. Examples of this can be seen for participants P07 [æ], P08 [æ], P11 [æ],and P13 [E,æ,A]. Participant P2 [A] also followed this pattern. Two notable excep-tions can be seen from participants P06 [A] and P12 [æ], who showed decreasedMM.During bite block production there was a tendency for MM to decrease withvisual feedback. Examples of this can be seen for participant P07 [E,A]. Partici-pant P01 [E] also followed this pattern. For the high vowels the changes in MMwith visual feedback were more variable and of a smaller magnitude. There arehigh vowel examples which follow the pattern described for the non-high vow-els: an MM increase with visual feedback during normal production (P13 [i]) andan MM decrease with visual feedback during bite block production (P07 [i], P08[u]). However the opposite pattern also occurred. For example, participant P06 [U]exhibited decreased MM with visual feedback during normal production and par-ticipant P13 [U] exhibited increased MM with visual during bite block production.111P01 P02 P03 P04 P05 P09 P10 P12 P06 P07 P08 P11 P130.ɛæɑNoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BBType of oral perturbationMean magnitude of motionno video videoFigure 4.15: Magnitudes of motion of the lower face for each participant’s production in each condition. The magni-tude is shown for each of the three non-high vowels. (NoBB = normal production, BB = bite block production)112P01 P02 P03 P04 P05 P09 P10 P12 P06 P07 P08 P11 P130.ɪʊuNoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BB NoBB BBType of oral perturbationMean magnitude of motionno video videoFigure 4.16: Magnitudes of motion of the lower face for each participant’s production in each condition. The magni-tude is shown for each of the four high vowels. (NoBB = normal production, BB = bite block production)113Given that there were between-participant differences in lower face motion inresponse to visual feedback, it is possible that the enhanced acoustic contrast withvisual feedback reported in Section 4.4.1 was driven by those participants whohad the greatest lower face motion. Correlations between mean MM and meanAVS were calculated to investigate this possible relationship. This was not partof the planned analysis, but is presented to gain further insight into the possibleeffects of visual feedback. AVS distances from the 25% and 50% points in thevowels were used, as visual feedback had a significant effect at these points. Thesemeasurement points also most closely match the temporal span over which meanMM was calculated (i.e. the portion of the word up to the mid-point of the vowel).For each participant, the mean of each measure from each condition was calculated,with high and non-high vowels calculated separately in order to capture some ofthe differences observed in Figure 4.14. The correlation between mean MM andmean AVS was calculated for the group of participants who had an articulatoryreaction to visual feedback (P07, P08, P11, P13) and the group of participants whodid not have an articulatory reaction to visual feedback (P01, P02, P03, P04, P05,P09, P10, P11). Note that participant P06 was not included. While this participantdid have an articulatory reaction to visual feedback, it was in the opposite directionto the other participants (MM decreased instead of increased with visual feedbackduring normal production). It was decided to exclude this participant rather thancalculate correlations for such a small subset of the data. The data are shown inFigure 4.17.Spearman’s correlations were calculated since the data were not normally dis-tributed. There was a significant correlation between mean MM and mean AVS at25% for both the no articulatory reaction group (rs = .37, p < .01) and the articu-latory reaction group (rs = .38, p < .05). There was also a significant correlationbetween mean MM and mean AVS at 50% for both the no articulatory reactiongroup (rs = .42, p < .001) and the articulatory reaction group (rs = .57, p < .001).For both groups, as mean MM increased so too did the mean AVS. This suggeststhat the increased acoustic contrast with visual feedback observed in Section 4.4.1was not solely due to the participants who produced greater lower face motion. Infact, Figure 4.17 suggests that it was the participants who didn’t have an articula-tory reaction to visual feedback who tended to produce greater acoustic contrast.1141020300.3 0.4 0.5 0.6 0.7 0.8Vowel contrast (AVS) at 25%No articulatory reaction Articulatory reaction1020300.3 0.4 0.5 0.6 0.7 0.8Mean magnitude of motionVowel contrast (AVS) at 50%Figure 4.17: Correlations between mean MM and mean AVS at the 25%point in the vowel (top) and between mean MM and mean AVS at the50% point in the vowel (bottom). Within each figure the data groupedin terms of articulatory reaction to visual feedback. Dotted lines rep-resent the median of each variable. Scatterplot smoother lines include95% confidence intervals.1154.5 DiscussionThe results showed clear effects of visual feedback on the acoustic measures, aswell as more subtle effects on the articulatory measures.Visual feedback enhanced vowel contrast at the beginning of the vowel (25%point), and there was a tendency for visual feedback to enhance contrast at thevowel midpoint during normal production more than bite block production. Vi-sual feedback also reduced vowel dispersion. At the beginning of the vowel thisoccurred for both normal and bite block production. The interaction between typeof oral perturbation and type of visual feedback at the vowel midpoint showed thatdispersion was minimized during normal production with visual feedback in com-parison to both of the bite block conditions, as well as in comparison to normalproduction with no visual feedback, although this latter contrast as only marginallysignificant. No effect of visual feedback was observed at the end of the vowel.Overall the visual feedback effect on the acoustic measures is consistent with thepredictions from Section 4.2.3; visual feedback enhanced contrast and reduceddispersion. However, the initial prediction was that this effect would be greatestduring bite block production, but this was not borne out in the results. Instead,there tended to be a greater effect of visual feedback during normal production.These results complement the findings reported by Me´nard and colleagues(2009; 2013) comparing blind and sighted speakers’ productions of French vowelcontrasts (Section 4.2.1). These studies used a measure of average vowel spacesimilar to the one used in the present experiment, but based on Euclidean distancesrather than Mahalanobis distances, and with the inclusion of F3 information to bet-ter capture French rounding contrasts. Sighted speakers were found to producegreater acoustic contrast than blind speakers; in the present experiment, acousticcontrast was greatest when visual feedback was present compared to when it wasabsent. In terms of within-category vowel dispersion, Me´nard et al. (2013) foundthat sighted speakers’ productions were less dispersed than blind speakers’ pro-ductions, which is also similar to the present results which found reduced voweldispersion in the presence of visual feedback. The results from these studies andthe experiment presented in this chapter support a role for visual information dur-ing speech production. Vowels produced with visual feedback are more acous-116tically contrastive relative to productions without visual feedback, just as vowelsproduced by speakers with normal vision are more acoustically contrastive thanvowels produced by speakers with congenital visual deprivation.The effect of visual feedback was not consistent across the measurement pointsduring the vowels. The results from Section 4.4.1 and Section 4.4.2 have beenre-plotted in Figure 4.18 to show this more clearly and to consider how the rela-tionship between contrast and dispersion changes throughout the vowels. Voweldurations12 were also checked, and a linear mixed effects model with a maximalrandom effects structure showed that there was no effect of the type of visual feed-back or oral perturbation on vowel duration, nor was there an interaction betweenthe two independent variables.In terms of vowel contrast, the AVS measure was greatest at the 25% and 50%points (usually slightly greater at the midpoint) and lowest at the 75% point. Avisual inspection of the plots suggests that this pattern is weakest for vowels pro-duced without an oral perturbation (top left cell of Figure 4.18) and without visualfeedback; instead the AVS remained relatively stable across the time points of thevowels. Figure 4.18 also highlights that visual feedback had a bigger effect onvowel contrast than the bite block did, as supported by the statistical analysis. Interms of vowel dispersion, the general pattern was for dispersion to be greatest atthe beginning of the vowel and then decrease and stabilize by the middle of thevowel. This was the case for vowels in the BB-NoVideo, NoBB-NoVideo, andNoBB-Video conditions. Dispersion for the BB-Video condition stayed at a fairlyconsistent level across the vowels.In their study looking at the role auditory feedback plays in compensating fora bite block, Lane et al. (2005) describe a complementary relationship betweencontrast and dispersion: “high dispersion leads to low contrast (and conversely)”(p. 1638). When the acoustic goal region is large (i.e. highly dispersed), “the tra-jectory, guided by least effort [...], passes through the most proximal parts of thegoal regions, thereby reducing its travel and hence vowel separation” (p. 1638).The general pattern observed in the results of this experiment is for contrast to12Mean vowel durations in milliseconds for the four conditions were as follows (standard devi-ations in parentheses): NoBB-NoVideo 272 (54); NoBB-Video 268 (57); BB-NoVideo 265 (59);BB-Video 265 (61).117normal production bite block production16182022Vowel contrast (AVS)no video videonormal production bite block production505560657025% 50% 75% 25% 50% 75%Point in vowelVowel dispersion (Euclidean distance)Figure 4.18: Results from Section 4.4.1 and Section 4.4.2 re-plotted to showhow the relationship between vowel contrast (top) and vowel disper-sion (bottom) changes throughout the vowel.be at its maximum and dispersion to be at its minimum in the middle of vowel,which aligns with the proposed complementarity. This pattern is what one wouldexpect for the ‘steady state’ portion of the vowel, and it was the midpoint that wasalso measured by Lane et al.. In the present experiment, this relationship betweencontrast and dispersion was (roughly) maintained throughout the vowel for the con-dition involving normal production and no visual feedback–that is, the conditionthat is most similar to typical speech. However, there was a considerable decrease118in vowel contrast at the 75% measurement point for the other conditions, but thiswas not accompanied by a large increase in dispersion. While this measurementpoint was intended to primarily capture the end of the steady state portion of thevowels, it is possible that it sometimes captured formant transitions into the fol-lowing consonant, especially for vowels that have more dynamic trajectories suchas /æ/ and /U/.This possibility will first be considered for the contrast between the presenceand absence of visual feedback during normal production (top left cell of Fig-ure 4.18). Vowel contrast increased with visual feedback at the first two mea-surement points but decreased at the third measurement point. If the effects oftransitioning into the following consonant counteracted the visual feedback effectthen this decrease in contrast would be expected; all vowels were followed by /d/,in which case F2 for the front vowels would be expected to lower during the tran-sition and F2 for the back vowels would be expected to raise, the net result being areduction in contrast. One way to test this account would be to look at the completevowel trajectory. If correct, then we would expect to see more dynamic trajectorieswith visual feedback compared to without visual feedback.Comparing bite block productions (top right cell of Figure 4.18) to normalproduction without visual feedback, there is little difference in contrast at the firsttwo measurement points, but a large decrease in contrast at the end of the vowelsproduced with a bite block. In this case it is possible that the influence of thefollowing consonant may start sooner during bite block productions as the speakerattempts to make a constriction in the presence of the bite block, thus the contrastis even more reduced than at the comparable point in the normal productions. Biteblock studies do suggest that consonants are more affected by the perturbation thanvowels. For example, McFarland and Baum (1995) found that participants wereunable to completely compensate for a bite block during productions of /t p k/.However, the consonants were produced in the syllable onset rather than coda, anda different type of bite block was used (one that did not protrude from the mouth).Since these studies typically take measurements at the midpoint of the vowel (e.g.Lane et al., 2005; Me´nard et al., 2007) (and sometimes the first glottal pulse, e.g.McFarland and Baum, 1995), it is difficult to know how the decreased contrast atthe end of vowels produced with a bite block compares to previous work.119Phoneme category was included as a fixed effect rather than random effect inthe statistical analysis of vowel dispersion, due to overfitting problems. The analy-sis showed that phoneme was a significant factor at each measurement point in thevowel, with the most prominent pattern being that dispersion was greatest for thehigh back vowels /u, U/ (Figure 4.12). This is unexpected given acoustic descrip-tions of American English vowels (e.g. Bradlow, 1995; Heald and Nusbaum, 2015;Hillenbrand et al., 1995), which do not typically show the high back vowels to besubstantially more variable than the other vowels. One possibility is that the highdispersion is due to some feature of the variety of American English spoken by theparticipants, eleven of whom were from New Mexico (mostly from Albuquerque).In this variety /u/-fronting is common and some speakers have been reported toneutralize the contrast between /u/ and /U/ before an /l/ (Labov et al., 2006). Theremay be more general instability in this region of the vowel space, leading to thehigh dispersion values reported here.The articulatory measures revealed considerable interspeaker variation. Over-all, there was no effect of visual feedback on the magnitude of motion of the lowerface. For a subset of the participants, however, magnitude of motion increasedwith visual feedback during normal production. This effect was strongest for thenon-high vowels, which is what one would expect since overall orofacial motion ismuch smaller for high vowels due to the low magnitude of jaw-related facial defor-mation. One participant produced the opposite pattern; the magnitude of motionof the lower face decreased with visual feedback during normal production, andthis was observed with both non-high and high vowels. While the initial predictionthat magnitude of motion would be reduced during bite block production was sup-ported, the predicted increased in magnitude of motion with visual feedback wasonly partially supported.One issue to consider is the extent to which the failure to find an effect of theexperimental variables in the main analysis is due to the precision, or imprecision,of the MM measure. As discussed in Section, optical flow was calculatedwithin an ROI that included the whole lower face. An informal comparison wasmade to a smaller ROI which just captured mouth movement. This showed that thepattern of movement for the two ROIs was very similar, but the magnitudes weredifferent; the (larger) lower face ROI had a smaller magnitude of motion. The120lower face ROI was chosen for ease of data extraction. While the comparison ofthe two ROIs confirmed that the larger ROI captured movement patterns that wererepresentative of mouth movement, it is possible that some contrasts were obscureddue to the smaller magnitudes of motion extracted from this ROI. Additionally,since a single mean value was calculated to represent lower face motion, therewas a loss of information regarding temporal dynamics of the movement. Futurework in this area could expand the analysis to include temporal dynamics. Oneway to do this would be to represent the change in summed magnitude over timewith coefficients of the discrete cosine transform, an approach that has been usedfor modeling the temporal dynamics of vowel formants (Watson and Harrington,1999). Applying this type of analysis to both acoustic and magnitude of motiondata could provide insight into how the relationship between the two changes overtime.A first approximation of the relationship between the acoustic and articulatoryresults was presented in Section 4.4.3. This was a post-hoc part of the analysispresented for exploratory purposes. Positive correlations between lower face mag-nitude of motion and acoustic contrast were found for the group of participantswho had an articulatory reaction to visual feedback as well as the group of par-ticipants who did not. As such, the enhanced acoustic contrast found with visualfeedback must not be due solely to those participants who produced greater lowerface motion with visual feedback. The correlations in Figure 4.17 also suggest thatthe group of participants who did not have an articulatory reaction to visual feed-back tended to produce greater acoustic contrast compared to the participants whodid have an articulatory reaction to visual feedback. Further research is requiredto specifically test whether speakers have a preference for responding with greateracoustic or articulatorily contrast in the presence of visual feedback. It would alsobe beneficial to include simultaneous recordings of tongue and lip movements tosee if similar trade-offs are made in the presence of visual feedback to those re-ported by Me´nard et al. (2013) for blind and sighted speakers.In the present experiment, visual feedback tended to have the greatest effecton vowels during normal production rather than bite block production, contrary tothe initial prediction. This may have been due to the physical limitations intro-duced by the bite block, by limiting the range of articulatory movement available121to speakers in such a way that the changes made in response to visual feedbackduring normal production were no longer possible. While speakers are able toproduce extreme tongue positions with a bite block in place, lip position compen-sation does not always occur (e.g. for /i/: Gay et al., 1981). The bite block usedin the present experiment–chopsticks that protruded from the side of the mouth–may have impeded both tongue and lip positioning more than the bite blocks usedin other experiments, which are typically a small block that is held between themolars or pre-molars. However, Figure 4.15 and Figure 4.16 show that some par-ticipants produced variation in lower face magnitude of motion during bite blockproduction depending on whether visual feedback was available. For most of thesecases, the magnitudes decreased when visual feedback was available. This suggeststhat the bite block did not substantially limit movement, at least not for all partic-ipants. An alternative explanation for the lack of visual feedback effect duringbite block production appeals to the novelty of the task. Not only did participantshave to watch themselves speak, but they did so with an object protruding fromthe mouth. The strangeness of this may have lead to active inhibition of changesto their speech production, especially changes that were visually discernible. Thisissue of visual feedback differences with perturbed and unperturbed speech will bereturned to in Chapter 5 in the context of possible limiting factors in multimodalfeedback integration.In this experiment visual feedback was shown to affect the production of vow-els. This was particularly clear for the acoustic measures; when visual feedbackwas available vowel contrast increased and vowel dispersion decreased. The articu-latory measures were more variable; only a subset of participants produced greatermagnitudes of lower face motion when visual feedback was available. Overall,these findings are in line with the hypothesis that visual feedback can enhancespeech production.122Chapter 5Discussion5.1 OverviewThis final chapter presents a summary of the experimental findings reported inChapter 3 and Chapter 4, and relates these to theoretical issues concerning themultimodal nature of speech production. The present research is also relevant toquestions concerning the link between perception and production, and a proposalfor future research in this area is outlined.5.2 Summary of experimental resultsThe experiments reported in this dissertation tested speakers’ ability to incorpo-rate visual speech feedback. While visual feedback is an atypical source of speechfeedback, it is temporally compatible with auditory and somatosensory feedbacksince the time-varying properties of these signals are all generated by the same actof speaking. Given this compatibility, visual feedback was predicted to enhancespeech output, providing reinforcement of the typical feedback signals especiallyduring difficult speaking conditions. Real-time visual feedback, of the sort onewould see when looking in a mirror, was presented during two perturbation tasks:one which delayed the auditory feedback and one which introduced an oral pertur-bation with a bite block.The two experiments in Chapter 3 compared the effects of different types of123visual feedback (dynamic and static) had when paired with normal and delayed au-ditory feedback. The different visual feedback conditions were randomized withina block in order to make it less likely that participants would actively use the vi-sual feedback as a strategy to counteract the disruptive effects of delayed auditoryfeedback (DAF). On the basis of previous research, dynamic visual feedback waspredicted to enhance the production of whole utterances when paired with DAF,decreasing utterance duration and the number of speech errors. The results showedan increase in utterance duration with dynamic visual feedback but no significantchange in the number of speech errors, suggesting that sustained exposure to visualfeedback is required before speech enhancement is observed.The second experiment in Chapter 3 presented the different types of visualfeedback (dynamic, static, no visual feedback) in consistent blocks. In doing so,it tested whether the predicted visual feedback effects would be observed whenvisual feedback was presented consistently, and it also allowed a more careful con-sideration of the properties of visual feedback that may drive speech productionchanges. As in the first experiment, utterance duration increased with dynamicvisual feedback. One possible explanation for this effect–which was in the op-posite direction to the initial prediction–is that durational increases over specificsegments contributed to the overall durational increases. Previous research has re-ported that the closed portion of the syllable (i.e. the period between the movementoffset of the VC gesture and movement onset of the CV gesture) are lengthenedin response to DAF, particularly for visible places of articulation (i.e. labial, labio-dental); this effect could be reinforced by dynamic visual feedback, resulting ineven longer durations. Unlike the first experiment, a small reduction in speech er-rors was observed; there was an overall reduction in speech errors with dynamicvisual feedback, but this was only significant for those participants who were min-imally disrupted by DAF. Interestingly, this effect occurred when dynamic visualfeedback was paired with normal, but not delayed, auditory feedback. This find-ing suggests that temporal cohesion among multimodal feedback signals may beimportant in speech production. There was no effect of static visual feedback oneither of the measures, suggesting that it is instead the time-varying properties ofvisual feedback which are important for enhancing speech output.Chapter 4 presented the results of a bite block experiment, comparing the ef-124fects of the presence versus absence of visual feedback when speech was producedwith a bite block in place or without any oral obstructions. Participants produced/hVd/ words (heed, hid, head, had, hod, hood, who’d) in each condition, andchanges in acoustic contrast and dispersion, as well as lower face magnitude ofmotion, were measured. Acoustic speech output was enhanced in the predicteddirection: vowel contrast increased and vowel dispersion decreased during produc-tions with visual feedback. This effect was greatest at the beginning of vowels andtended to be stronger during productions without the bite block. There was con-siderable inter-speaker variation for the lower face motion results. A subset of theparticipants tended to produce greater magnitudes of motion with visual feedback,especially for the non-high vowels. In addition to a modest positive correlationbetween lower face motion and acoustic contrast, it was also noted that the subsetof participants who produced greater magnitudes of motion with visual feedbacktended to produce less acoustic contrast than the participants who produced smallermagnitudes of motion.Overall these results support the hypothesis that visual feedback can enhancespeech production. The fluency enhancement observed in Experiment 2 of theDAF experiments was more limited (i.e. it was only significant when additionalpredictors were added to the statistical model, and only for normal auditory feed-back) than the acoustic contrast enhancement observed in the bite block experi-ment. This may be due in part to the different stimuli used in the experiments:participants repeated whole sentences in the DAF experiments and monosyllabicwords in the bite block experiment. Limiting speech production to simple strings,such as a CVC syllable, instead of structurally complex sentences, can make iteasier to observe the effects of experimental manipulations since there are fewerlinguistic factors muddying the waters, so to speak. However, data from produc-tions of whole utterances provide an important reminder that speech is a highlycomplex process, and in this context the effects of experimental manipulations willoften play out in more subtle, and potentially even different, ways. For example, inthe second DAF experiment, the visual feedback both enhanced, in terms of speecherror reduction, and hampered, in terms of durational increases, speech production.It is true that the analyses presented in this dissertation show that the effect vi-sual feedback has on speech production is, overall, quite small. This is perhaps un-125surprising given that visual feedback is not typically available during speech. How-ever, the results are reassuringly consistent with a general pattern that is emergingacross different populations and methodological frameworks; namely, that visualspeech information enhances not only perception, but also production. Visuallystimulated speech production is more accurate (Reisberg et al., 1987; Scarbel et al.,2014), and more fluent in the case of aphasic patients (Fridriksson et al., 2015,2012) and people who stutter (Kalinowski et al., 2000). When presented as feed-back of one’s own productions, visual speech information improves fluency forstutterers (Snyder et al., 2009) and non-stutterers (Jones and Striemer, 2007). Sim-ilar fluency enhancing results were reported in Chapter 3, and the new finding thatvisual feedback also increases vowel contrast and reduces vowel dispersion wasreported in Chapter 4. There is also evidence that when speakers are deprived ofvisual speech information during development, due to congenital blindness, theopposite patterns are found; in comparison to sighted speakers productions’ vowelcontrast is reduced and vowel dispersion is increased (Me´nard et al., 2009, 2013).In addition, visible articulations exhibit smaller movements and non-visible artic-ulations exhibit larger movements for blind compared to sighted speakers (Me´nardet al., 2013). Finally, a recent brain imaging study shows that visually stimulatedproduction recruits additional neural pathways beyond the typical auditory-motorpathway, providing a possible neural basis for these behavioral changes (Veneziaet al., 2016).Visual feedback can influence speech production, and the results presented inthis dissertation have implications for theoretical issues in speech production re-search. Two such issues–targets of production and multimodal feedback integration–are addressed in the next section.5.3 Multimodal speech production5.3.1 Targets of productionWhen a speech sound is produced, the speaker is presumed to be attempting toachieve a target, which is typically represented in some task space or coordinateframe. Research tends to have the goal of placing the target either in articula-126tory space (e.g. Browman and Goldstein, 1992) or in auditory/acoustic space (e.g.Stevens, 1989). Compensation for altered acoustic feedback (e.g. Houde and Jor-dan, 1998) and the trading relations between different articulations with the sameacoustic consequences (e.g. Guenther et al., 1999) are argued to be evidence forauditory/acoustic targets. Compensation for altered somatosensory feedback (e.g.Tremblay et al., 2003) and articulators functioning as coordinative structures (e.g.Kelso et al., 1984) are argued to be evidence for articulatory space. However, giventhe complexity of the speech processing system, it is unlikely that placing targetsin only one of these task spaces could adequately capture the whole system. Mod-els such as the perception for action control theory (PACT) (Schwartz et al., 2012)propose a middle ground between these two task spaces and aim to codify the re-lation between perception and production; in PACT, speech targets are describedas “perceptually-shaped gestures.” This move towards more flexible models ofspeech production raises the question of whether there is a visual component tospeech targets.Recent work with aphasic and non-clinical populations provides support forvisual speech targets and suggests that these targets may also involve dedicatedneural pathways in the speech motor control architecture. Aphasic patients demon-strate increased fluency when speaking in time with an audiovisually presentedrecording of a speaker, compared to an audio-only presentation (Fridriksson et al.,2012). This effect, referred to as speech entrainment, is observed in those patientswho still have cortical motor areas and auditory-motor interface areas relativelyintact (Fridriksson et al., 2015). The existence of a neural route for mapping visualtargets to motor programs was proposed as an alternative to the auditory-motorpathway, based on the fact that enhanced fluency was only observed during au-diovisual speech entrainment. Venezia et al. (2016) provided evidence for thispathway in healthy speakers. As described in Section 1.1.3, a comparison wasmade between the covert production of CV syllable strings that were produced inresponse to either audio-only, visual-only, or audiovisual stimuli. Neural regionsinvolved in audio-only stimulated productions were activated to a greater extent byvisual-only and audiovisually stimulated productions, and these latter productionsalso involved additional sensorimotor brain regions.Of the additional regions activated, Venezia et al. (2016) note that the left pos-127terior middle temporal gyrus is of particular interest. Not only did this region ac-tivate during covert rehearsal following visual and audiovisual stimuli, activationwas also found during passive perception, especially for visual and audiovisual in-puts. They propose that this region may be the site of “visual speech targets forproduction (i.e., high-level sensory representations of visual speech gestures)” (p.204).Visual speech gestures (specifically, the magnitude of motion of lower face)were measured in the bite block experiment in Chapter 4. On the face of it, theresults from this experiment offer only limited support for visual targets; speechproduced with visual feedback only resulted in changes to the magnitude of lowerface motion for five of the thirteen participants. Of these five, four participantstended to produce non-high vowels with greater magnitudes when visual feedbackwas available, and one participant produced vowels with smaller magnitudes whenvisual feedback was available. But the acoustic and articulatory results as a wholeraise a number of issues that would need to be taken into account in a model ofspeech motor control that combines speech targets from multiple task space, asproposed by (Venezia et al., 2016, p. 197):“To be specific, we assume that the noted behavioral increases in speechoutput during or following exposure to audiovisual speech reflect theactivation of a complementary set of visual speech targets (i.e., the vi-sual patterns a talker is trying to produce) that combine with auditoryspeech targets to facilitate speech motor control.”This issue is addressed in the next section in the context of multimodal feedbackintegration.5.3.2 Multimodal feedback integrationThe response to feedback and feedback manipulations has been used as evidencefor the task space of targets, as described above. Feedback can not only be usedto detect and correct perturbations, it can also be used to update forward modelsof motor control (Hickok, 2014). Venezia et al. (2016) suggest that a dedicated vi-suomotor pathway for speech motor control could be relevant for feedback: “Onepossibility is that feedback from visual speech is used to tune internal vocal tract128control circuits in a similar fashion to auditory speech.” (p. 205). In this section,this proposal for visual feedback and its integration with other modalities is ex-plored in the context of Hickok and colleagues (Hickok, 2012, 2014; Hickok et al.,2011) State Feedback Control (SFC) model of speech processing. While Veneziaet al. don’t specifically situate their proposal for visual targets within this model,the assumptions of this model are implicit in their discussion. Implications for thisproposal are discussed in light of the results presented in this dissertation, specif-ically the fact that the effects of visual feedback tended to be most clearly seenduring unperturbed speech.The SFC model integrates theoretical perspectives and experimental findingsfrom speech perception, speech production, psycholinguistics, and clinical do-mains, with the aim of marrying auditory- and motor-centric models. The (Heirar-chical) State Feedback Control (SFC) model (Hickok, 2012, 2014; Hickok et al.,2011) is an elaboration of previous work (e.g. Hickok and Poeppel’s (2004) dualstream hypothesis, and Ventura et al.’s (2009) evidence for the role of efferencecopies in speech motor control), providing a detailed proposal for the sensory-motor integration processes of the dorsal stream (Hickok and Poeppel, 2004).In the SFC model, speech production begins with an intention to speak, whichprovides parallel input to motor and auditory phonological systems. This inputactivates two types of representation: a motor plan and associated sensory targets.System output comes from an articulatory controller that generates motor com-mands for the vocal tract, as well as sending a copy of the motor commands to aninternal model of the vocal tract. This internal model is an estimate of the state ofthe vocal tract. The estimate is then transformed into a prediction of the sensoryconsequences of the motor command. The prediction is involved in two subse-quent functions: fast internal monitoring and slow external monitoring. Internalmonitoring considers whether the motor command that has been initiated will havethe intended sensory consequences, while external monitoring considers whetherthe actual sensory consequences match the predicted sensory consequences. Errorsignals can be generated by either of these monitoring loops, providing correctivefeedback to the motor controller via the internal model.The most recent descriptions of this model include this control system at twolevels: the higher level involves auditory targets at the syllable level and the lower129level involves somatosensory targets at the phoneme level (Heirarchical State Feed-back Control (HSFC) Hickok, 2012, 2014). However, this division is not absolute,as phoneme and syllable targets may be distributed across the two levels, and havedifferent weightings, depending on the phoneme or syllable. For example, sibilantshave clear auditory and somatosensory targets. These two levels are presumed tointeract, with one possible function of this interaction being to fine tune forwardpredictions generated at one level with information from the other level. For exam-ple, Hickok (2012) suggests that information about the articulatory phase from thesomatosensory level might enable the auditory prediction to be more accurate. Theresults of this dissertation suggest that visual feedback may have some influenceon these different levels of the control system. For example, visual feedback en-hanced acoustic vowel contrast and reduced acoustic vowel dispersion (Chapter 4);this could be interpreted as visual feedback contributing to the production of moreaccurate acoustic/auditory targets. How might this influence of visual feedback beimplemented?Venezia et al. (2016) propose that the sensorimotor integration of visual speechinvolves separate auditory-motor and visuo-motor speech pathways, and thus com-plementary sets of targets across the different modalities. This proposal is consis-tent with their finding that neural regions in addition to the auditory-motor networkwere activated in response to visually stimulated production, and also findings thataphasic patients with damage to their auditory-motor network (restricted to the in-ferior frontal gyrus) show improved fluency when shadowing audiovisual speechcompared to audio-only speech (Fridriksson et al., 2015, 2012). Venezia et al. con-sider, and reject, two other mechanisms for integrating visual and auditory speechsignals; these are in line with proposals for visual speech integration which in-volve visual information increasing activation of the auditory-motor pathway (e.g.Calvert et al., 2000). The first possibility is that auditory and visual speech are firstintegrated, and this integrated representation serves as input to the dorsal stream,which is involved in the sensory-motor integration processes described in the SFCmodel. The second possibility is that visual speech is integrated directly in thedorsal stream (i.e. in the sensory-motor network rather than in the sensory system).These two possibilities are considered unlikely based on Venezia et al.’s results;while some brain regions responded to all types of input (audio-only, visual-only,130audiovisual; although most strongly for the latter two), there were regions thatshowed greater activation during rehearsal when stimulated by visual-only or au-diovisual signals, but not audio-only inputs. These regions included the bilateralpre-central sulci and left central sulcus, the caudate nucleus, the inferior frontalgyrus, and the middle temporal gyrus.Recall that while Venezia et al.’s (2016) proposal was based on results fromvisually stimulated speech production (specifically, the covert repetition of (au-dio)visual presentations of another speaker producing CV syllable strings), theysuggest that the proposed visuo-motor pathway could be relevant for visual speechas feedback. The results from this dissertation do provide further evidence of “be-havioral increases in speech output during or following exposure to audiovisualspeech” (Venezia et al., 2016, p. 197); the presence of visual feedback was asso-ciated with modest improvements in fluency (Chapter 3) and enhanced vowel con-trasts (Chapter 4). One interesting aspect of these results is that the visual feedbackeffects were most often observed during unperturbed speech; that is, when audi-tory feedback was not delayed and there was no bite block. This has implicationsfor the coordination of different feedback signals, in terms of temporal alignmentrequirements and the relative weight of feedback.In the second DAF experiment (Section 3.4) the predicted reduction in speecherrors occurred, but only when dynamic visual feedback was paired with normalauditory feedback, and not DAF as predicted. A possible explanation for this isthat temporal cohesion among the different types of feedback may be necessary inorder for speech output to be enhanced.A number of studies suggest that there is sensitivity to timing information dur-ing the process of detecting mismatches between predicted and reafferant feedback.The act of producing speech has been shown to result in a suppressed auditory cor-tical response, which is hypothesized to be due to an accurate match between thepredicted and actual sensory consequences of speaking (Heinks-Maldonado et al.,2006). The production of faster, more rhythmically complex vowel sequences re-sults in less suppression than single sustained vowels, indicative of a poorer matchbetween the predicted and reafferent feedback, possibly due to timing discrepan-cies: “the auditory feedback predictions became more dynamic and more difficultto keep in temporal registry with the incoming auditory feedback” (Ventura et al.,1312009, p. 5). Behroozmand et al. (2011) confirmed that auditory-evoked responsesare sensitive to timing mismatches caused by altering pitch feedback at differentonset delays. Error detection was enhanced after the immediate onset of produc-tion, perhaps to accommodate delays in sensory feedback transmission. Later workshowed that ERP responses are enhanced (i.e. the mismatch detection is more sen-sitive) when pitch perturbations occur earlier in the vocalization (Behroozmandet al., 2016). Sensitivity to variability in the production of vowels is also seen atthe onset of vocalization; within the first 50 ms of the vowel, productions that areacoustically farther from the center of the vowel category’s distribution result inless suppression of the auditory-evoked response (i.e. an error response) (Nizioleket al., 2013). This error response is also correlated with a “corrective” change;these peripheral vowels become acoustically closer to the center of the vowel dis-tribution by the middle of the vowel.This work suggests that prediction errors are sensitive to timing mismatches,particularly at the beginning of an utterance and when more complex speech isused. While this work has only considered auditory feedback, it is likely that itwould apply to other speech feedback channels, especially given that there is ev-idence for these temporal patterns of suppression response in other domains. Forexample, while self-produced touches result in cortical suppression in the cere-bellum and reports of diminished ‘tickliness’, once a delay is introduced thereis less suppression and participants progressively rate the touch as more ‘tickly’(Blakemore et al., 2001, 2000). In perception, visual speech can be thought ofas an anchor that improves a perceiver’s ability to align to a multisensory signal(Vatikiotis-Bateson and Munhall, 2015). In the context of feedback, the presenceof multiple signals may also have an anchoring effect stemming from the fact thatall sources of feedback are generated from the same event of speaking. The combi-nation of this effect and the sensitivity to timing mismatches between predicted andreafferant feedback within a given feedback modality may lead to less tolerance oftiming mismatches between multiple feedback signals. In the context of the HSFCmodel, this temporal alignment would be important in order for feedback from thedifferent levels to be able to fine-tune predictions. In the context of visual feed-back, it is the time-varying properties of the signal that establish its compatibilitywith other sources of feedback, adding a further need for temporal cohesion.132The results from the bite block experiment (Chapter 4) also showed that vi-sual feedback tended to have a stronger effect–in terms of enhanced acoustic con-trast and diminished vowel dispersion–on unperturbed speech. The possibility thatthis was due to physical limitations introduced by the bite block was discussed inSection 4.5, although this was possibility was considered unlikely. An alternativeaccount is presented here.In the Venezia et al. (2016) model there are complementary sets of auditoryand visual targets, and the visual targets are more strongly activated when speechis produced in the context of visual stimulation. The bite block results suggest thatthere is more that needs to be taken into account; while visual targets may havebeen activated with visual feedback, when the bite block was in place the speechenhancing effects of visual feedback were diminished. One way to think of thisis in terms of different weightings; somatosensory feedback became more heavilyweighted than visual feedback when there was an oral perturbation to counteract.Given that one rarely sees one’s mouth, it is perhaps unsurprising that somatosen-sory feedback is prioritized in this particular context. (Haggard and de Boer, 2014,p. 470) compare this lack of oral visual experience to the contributions made bysomatosensory and visual feedback during manual tasks: “The somatosensory in-nervation of the hand, although very rich, normally remains subservient to vision.[...] In contrast, within the mouth, somatosensation rules.”However, the more extensive experience with somatosensory feedback isn’tenough to block the effects of visual feedback. And in the context of speech, itis not clear that any one feedback modality can be said to “rule” the others. Re-cent work suggests that individuals have modality preferences, for both feedbackreceived during production (e.g. Lametti et al., 2012) and speech signals duringperception (e.g. Gick et al., 2008). For example, using a perturbation paradigm,Lametti et al. (2012) simultaneously manipulated both auditory feedback (down-ward shift of F1) and somatosensory feedback (mechanical altering of jaw dis-placement). A negative correlation between the amount of compensation for eachperturbation was found; participants who compensated more for the somatosen-sory perturbation compensated less for the acoustic perturbation. Some partici-pants were also observed to only compensate for one type of perturbation. Relatedto this are the results from Chapter 3 and from Jones and Striemer (2007) showing133that the visual feedback effect depends on how disrupted a person is by DAF (i.eto what degree auditory feedback is preferred). Additionally, the correlations be-tween lower face magnitude of motion and acoustic contrast in Chapter 4 suggestthat some speakers respond to visual feedback by enhancing acoustic contrast andothers respond by enhancing part of their articulations; these different responsescould be driven by an individual’s modality preference.5.3.3 SummaryContrary to the initial predictions of this dissertation research, visual feedbacktended to have a stronger effect on unperturbed speech than perturbed speech.While it is the case that speakers can use visual information in difficult listen-ing conditions (e.g. Navarra and Soto-Faraco, 2007; Sumby and Pollack, 1954)and speaking conditions (e.g. Fridriksson et al., 2012; Kalinowski et al., 2000),there are additional factors that need to be considered for the integration of feed-back signals during production. The demands of counteracting an oral perturbationmay have the effect of increasing the weighting of somatosensory feedback, thusreducing visual feedback effects. Delaying auditory feedback may interfere withtemporal cohesion among the feedback signals, with timing errors overriding vi-sual feedback effects. In a model of speech processing such as Hickok’s (2012;2014) HSFC model, in which there would be forward control systems for auditory,somatosensory, and visual feedback, these factors are relevant to the manner inwhich the different levels would interact. More broadly, the implications of theseresults for speech targets is that domains of representation need not be rigidly de-fined. There is evidence for both sensorimotor and auditory/acoustic targets, andgrowing evidence for visual targets as well.5.4 Future workThis research into the effects of visual feedback on production is related to moregeneral questions concerning the relation between production and perception. Thisrelation can be between a speaker and a listener; for example, Pickering and Gar-rod’s (2004) interactive alignment account relies on a tight coupling between pro-duction and perception to facilitate dialogue. But often this relation is conceptual-134ized as being relevant to individual speaker-listeners, as in motor theories of speechperception, which hypothesize that what is being perceived is speech gestures (e.g.Liberman and Mattingly, 1985). Perkell and colleagues (2004a; 2004b) assume theopposite hypothesis; namely, that perception drives production. They investigatedthis issue by looking at the relationship between speakers’ perceptual acuity forvowel and sibilant contrasts and their production distinctness for these contrasts.Those speakers who had higher perceptual discrimination scores were found toalso produce greater contrasts, measured in terms of tongue body movements andF1xF2 acoustic space for the vowels, and in terms of linguo-dental contact in thesublingual cavity and acoustic center of gravity for the sibilants.Two studies involving visual information complicate this account. In a voweldiscrimination task, Me´nard et al. (2009) found that blind participants had higherdiscrimination scores than sighted participants for vowel contrasts, but sightedspeakers produced a more contrastive vowel space than blind speakers. GivenPerkell et al. (2004a) proposal, one would expect the blind speakers’ vowel spaceto be more contrastive in response to their greater perceptual acuity. Me´nard et al.(2009) suggest that the effects of visual deprivation on production override the ef-fects of perceptual acuity. Alternatively, it’s possible that blind speakers do havegreater perceptual acuity, but along dimensions not tested by the experimental ma-nipulations.Findings from a study comparing speech-reading abilities of oneself versusothers also raise questions about Perkell et al.’s (2004a) account. In line with re-search showing sensitivity to self-generated biological motion (e.g. Knoblich andFlach, 2001), Tye-Murray et al. (2013) demonstrated that perceivers are better atspeech-reading videos of their own speech compared to the speech of others. Par-ticipants were pre-recorded reading a large list of sentences, a subset of which wasthen used as stimuli in a speech-reading test. They were significantly better atspeech-reading themselves than other people, independent of their general lipread-ing ability. As part of the analysis, Tye-Murray et al. (2013) noted that there wasno correlation between participants’ general speech-reading ability and how wellthey were speech-read by other participants. If perception drives production, wewould expect to see a positive correlation here: the better a person is at speech-reading (due to high perceptual acuity), the better they are at being speech-read by135others (due to high production distinctness). Admittedly, the goal of Tye-Murrayet al.’s study was not to test this perception-production link, so it is possible thatthe finding is due simply to the particular experimental set up which was testinganother question. However, this finding, along with that from Me´nard et al. (2009),suggests that there are more factors to be considered in describing the perception-production link in the context of audiovisual speech processing.Thus the general question for future research concerns whether there is a re-lation between how sensitive a perceiver is to visual speech information and howvisually contrastive they are as a speaker. The fact that the results reported in Chap-ter 4 showed that some speakers produced greater magnitudes of lower face motionin response to visual feedback suggests that there may be a positive correlation be-tween visual perception and production, at least for some speakers.The investigation of visual feedback can contribute a new perspective on oldquestions like the relation between production and perception. Exploring the re-lation between speaker and listener in the visual domain, in addition to the audi-tory and articulatory domains, is particularly germane to this question given thehighly visible nature of much face-to-face communication. But visual informa-tion is also relevant to the link between production and perception for individualspeaker-listeners. Not only is there well established work showing the importanceof visual speech information for perception, but there is a growing body of research,of which this dissertation is a part, showing that visual information, including vi-sual feedback of oneself, can also affect speech production.136BibliographyAdler-Bock, M., Bernhardt, B. M., Gick, B., and Bacsfalvi, P. (2007). The use ofultrasound in remediation of North American English /r/ in 2 adolescents.American Journal of Speech-Language Pathology, 16(2):128–139. → pages 2Alsius, A., Mo¨tto¨nen, R., Sams, M. E., Soto-Faraco, S., and Tiippana, K. (2014).Effect of attentional load on audiovisual speech perception: evidence fromERPs. Language Sciences, 5:727. → pages 76Alsius, A., Navarra, J., Campbell, R., and Soto-Faraco, S. (2005). AudiovisualIntegration of Speech Falters under High Attention Demands. Current Biology,15(9):839–843. → pages 76Attanasio, J. S. (1987). Relationships between oral sensory feedback skills andadaptation to delayed auditory feedback. Journal of Communication Disorders,20(5):391–402. → pages 33Baayen, R. H. (2008). Analyzing Linguistic Data: A Practical Introduction toStatistics using R. Cambridge University Press, New York. → pages 24, 28Baayen, R. H., Davidson, D. J., and Bates, D. M. (2008). Mixed-effects modelingwith crossed random effects for subjects and items. Journal of Memory andLanguage, 59(4):390–412. → pages 25Barbosa, A. V., Yehia, H. C., and Vatikiotis-Bateson, E. (2008a). Linguisticallyvalid movement behavior measured non-invasively. In Go¨cke, R., Lucey, P.,and Lucey, S., editors, Proceedings of the International Conference onAuditory-Visual Speech Processing – AVSP 2008, pages 173–177, Tangalooma,Australia. → pages 94Barbosa, A. V., Yehia, H. C., and Vatikiotis-Bateson, E. (2008b). TemporalCharacterization of Auditory-Visual Coupling in Speech. In Proceedings ofMeetings on Acoustics, volume 1, pages 1–14. → pages 92, 93137Barr, D. J., Levy, R., Scheepers, C., and Tily, H. J. (2013). Random effectsstructure for confirmatory hypothesis testing: Keep it maximal. Journal ofMemory and Language, 68(3):255–278. → pages 23, 25, 26, 27Bates, D. M., Maechler, M., Bolker, B., and Walker, S. (2014). lme4: Linearmixed-effects models using Eigen and S4. → pages 23Behroozmand, R., Liu, H., and Larson, C. R. (2011). Time-dependent neuralprocessing of auditory feedback during voice pitch error detection. Journal ofCognitive Neuroscience, 23(5):1205–1217. → pages 132Behroozmand, R., Sangtian, S., Korzyukov, O., and Larson, C. R. (2016). Atemporal predictive code for voice motor control: Evidence from ERP andbehavioral responses to pitch-shifted auditory feedback. Brain Research. →pages 132Bertelson, P., Vroomen, J., and de Gelder, B. (2003). Visual Recalibration ofAuditory Speech Identification A McGurk Aftereffect. Psychological Science,14(6):592–597. → pages 1Black, J. W. (1951). The effect of delayed sidetone upon vocal rate and intensity.Journal of Speech and Hearing Disorders, 16:56–60. → pages 31Blakemore, S. J., Frith, C. D., and Wolpert, D. M. (2001). The cerebellum isinvolved in predicting the sensory consequences of action. Neuroreport,12(9):1879–1884. → pages 132Blakemore, S.-J., Wolpert, D. M., and Frith, C. (2000). Why can’t you tickleyourself? NeuroReport, 11(11):R11–R16. → pages 132Boersma, P. and Weenink, D. (2009). {PRAAT}: Doing phonetics by computer.→ pages 41Boersma, P. and Weenink, D. (2014). {PRAAT}: Doing phonetics by computer.→ pages 88Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens,M. H. H., and White, J.-S. S. (2009). Generalized linear mixed models: apractical guide for ecology and evolution. Trends in Ecology & Evolution,24(3):127–135. → pages 24Borden, G. J. (1979). An interpretation of research on feedback interruption inspeech. Brain and Language, 7(3):307–319. → pages 6138Bosker, H. R., Pinget, A.-F., Quene´, H., Sanders, T., and Jong, N. H. d. (2013).What makes speech sound fluent? The contributions of pauses, speed andrepairs. Language Testing, 30(2):159–175. → pages 36Bosker, H. R., Quene´, H., Sanders, T., and de Jong, N. H. (2014). The Perceptionof Fluency in Native and Nonnative Speech. Language Learning,64(3):579–614. → pages 36Bradlow, A. R. (1995). A comparative acoustic study of English and Spanishvowels. The Journal of the Acoustical Society of America, 97(3):1916–1924.→ pages 120Browman, C. and Goldstein, L. (1992). Articulatory phonology: An overview.Phonetica, 49:155–180. → pages 127Brugos, A. and Shattuck-Hufnagel, S. (2012). A proposal for labelling prosodicdisfluencies in ToBI (poster). Stuttgart, Germany. → pages viii, 41, 42Buchan, J. N. and Munhall, K. G. (2012). The Effect of a Concurrent WorkingMemory Task and Temporal Offsets on the Integration of Auditory and VisualSpeech Information. Seeing & Perceiving, 25(1):87–106. → pages 76Burnett, T. A., Freedland, M. B., Larson, C. R., and Hain, T. C. (1998). Voice F0responses to manipulations in pitch feedback. The Journal of the AcousticalSociety of America, 103(6):3153–3161. → pages 5Cai, S., Ghosh, S. S., Guenther, F. H., and Perkell, J. S. (2011). Focalmanipulations of formant trajectories reveal a role of auditory feedback in theonline control of both within-syllable and between-syllable speech timing. TheJournal of Neuroscience, 31(45):16483–90. → pages 7, 8, 71Calvert, G. A. and Campbell, R. (2003). Reading speech from still and movingfaces: the neural substrates of visible speech. Journal Of CognitiveNeuroscience, 15(1):57–70. → pages 55, 56, 71Calvert, G. A., Campbell, R., and Brammer, M. J. (2000). Evidence fromfunctional magnetic resonance imaging of crossmodal binding in the humanheteromodal cortex. Current Biology, 10(11):649–657. → pages 130Campbell, R. (1992). The Neuropsychology of Lipreading. PhilosophicalTransactions: Biological Sciences, 335(1273):39–45. → pages 56Campbell, R. (1996). Seeing speech in space and time: Psychological andneurological findings. In Proceedings of the 4th International Conference onSpoken Language Processing, Philadelphia. → pages 72139Campbell, R., Zihl, J., Massaro, D., Munhall, K., and Cohen, M. M. (1997).Speechreading in the akinetopsic patient, L.M. Brain, 120(10):1793–1803. →pages 56Cattaneo, L. and Pavesi, G. (2014). The facial motor system. Neuroscience &Biobehavioral Reviews, 38:135–159. → pages 3Chesters, J., Baghai-Ravary, L., and Mo¨tto¨nen, R. (2015). The effects of delayedauditory and visual feedback on speech production. The Journal of theAcoustical Society of America, 137(2):873–883. → pages 11, 17, 31, 34, 37,41, 53, 57, 69, 70, 73, 74, 76Chon, H., Kraft, S. J., Zhang, J., Loucks, T., and Ambrose, N. G. (2013).Individual Variability in Delayed Auditory Feedback Effects on SpeechFluency and Rate in Normally Fluent Adults. Journal of Speech Language andHearing Research, 56(2):489–504. → pages 8, 32, 51, 70, 74Corey, D. M. and Cuddapah, V. A. (2008). Delayed auditory feedback effectsduring reading and conversation tasks: Gender differences in fluent adults.Journal of Fluency Disorders, 33(4):291–305. → pages 31, 32, 36de Bot, K. (1984). Visual Feedback of Intonation I: Effectiveness and inducedpractice behavior. Language & Speech, 26(4):331–350. → pages 2de Gelder, B., Vroomen, J., and Bachoud-Levi, A.-C. (1998). Impairedspeechreading and audio-visual speech integration in prosopagnosia. InCampbell, R., Dodd, B., and Burnham, D., editors, Hearing by Eye II:Advances in Psychology of Speechreading and Audio-Visual Speech, pages195–207. Psychology Press, Hove. → pages 56Desmurget, M. and Grafton, S. (2000). Forward modeling allows feedbackcontrol for fast reaching movements. Trends in Cognitive Sciences,4(11):423–431. → pages 4, 5Diehl, R. L. and Kluender, K. R. (1989). On the objects of perception. EcologicalPsychology, 1(2):121–144. → pages 15Doherty-Sneddon, G. and Phelps, F. G. (2005). Gaze aversion: A response tocognitive or social difficulty? Memory & Cognition, 33(4):727–733. → pages77Durso, F. T. (1984). A subroutine for counterbalanced assignment of stimuli toconditions. Behavior Research Methods, Instruments, & Computers,16(5):471–472. → pages 40140Fabbro, F. and Daro`, V. (1995). Delayed auditory feedback in polyglotsimultaneous interpreters. Brain and Language, 48:309–319. → pages 8, 32, 70Feng, Y., Gracco, V. L., and Max, L. (2011). Integration of auditory andsomatosensory error signals in the neural control of speech movements.Journal of Neurophysiology, 106(2):667–679. → pages 3Flemming, E. S. (1995). Auditory representations in phonology. PhD thesis,UCLA. → pages 84Fowler, C. A. (1986). An event approach to the study of speech perception from adirect-realist perspective. Journal of Phonetics, 14:3–28. → pages 15Fowler, C. A. and Dekle, D. J. (1991). Listening with eye and hand: Cross-modalcontributions to speech perception. Journal of Experimental Psychology:Human Perception and Performance, 17(3):816–828. → pages 15, 37, 38Fowler, C. A., Rubin, P., Remez, R. E., and Turvey, M. T. (1980). Implications forspeech production of a general theory of action. In Butterworth, B., editor,Language production, pages 373–420. Academic Press. → pages 3Fowler, C. A. and Turvey, M. T. (1980). Immediate Compensation in Bite-BlockSpeech. Phonetica, 37(5-6):306–326. → pages 82Fridriksson, J., Basilakos, A., Hickok, G., Bonilha, L., and Rorden, C. (2015).Speech entrainment compensates for Broca’s area damage. Cortex, 69:68–75.→ pages 2, 13, 16, 126, 127, 130Fridriksson, J., Hubbard, H. I., Hudspeth, S. G., Holland, A. L., Bonilha, L.,Fromm, D., and Rorden, C. (2012). Speech entrainment enables patients withBroca’s aphasia to produce fluent speech. Brain, 135(12):3815–3829. → pages2, 13, 16, 126, 127, 130, 134Fromkin, V. (1964). Lip Positions in American English Vowels. Language andSpeech, 7(4):215–225. → pages 54, 83Fuhrman, R. (2014). Vocal Effort and Within-Speaker Coordination in SpeechProduction: Effects on Postual Control. PhD thesis, The University of BritishColumbia. → pages 94Gay, T., Lindblom, B., and Lubker, J. (1981). Production of bite-block vowels:Acoustic equivalence by selective compensation. The Journal of the AcousticalSociety of America, 69(3):802–810. → pages 122141Gick, B. and Derrick, D. (2009). Aero-tactile integration in speech perception.Nature, 462(7272):502–504. → pages 15Gick, B., Jo´hannsdo´ttir, K. M., Gibraiel, D., and Mu¨hlbauer, J. (2008). Tactileenhancement of auditory and visual speech perception in untrained perceivers.The Journal of the Acoustical Society of America, 123(4):EL72–EL76. →pages 133Gilbert, J. L., Lansing, C. R., and Garnsey, S. M. (2012). Seeing facial motionaffects auditory processing in noise. Attention, Perception, & Psychophysics,74(8):1761–1781. → pages 54Goldiamond, I., Atkinson, C. J., and Bilger, R. C. (1962). Stabilization ofBehavior and Prolonged Exposure to Delayed Auditory Feedback. Science,135(3502):437–438. → pages 33Green, K. P., Kuhl, P. K., Meltzoff, A. N., and Stevens, E. B. (1991). Integratingspeech information across talkers, gender, and sensory modality - Female facesand male voices in the McGurk effect. Perception & Psychophysics,50(6):524–536. → pages 10Guenther, F. H., Espy-Wilson, C. Y., Boyce, S. E., Matthies, M. L., Zandipour,M., and Perkell, J. S. (1999). Articulatory tradeoffs reduce acoustic variabilityduring American English vertical bar r vertical bar production. Journal of theAcoustical Society of America, 105(5):2854–2865. → pages 127Guntupalli, V. K., Nanjundeswaran, C., Kalinowski, J., and Dayalu, V. N. (2011).The effect of static and dynamic visual gestures on stuttering inhibition.Neuroscience Letters, 492(1):39–42. → pages 55Haggard, P. and de Boer, L. (2014). Oral somatosensory awareness. Neuroscience& Biobehavioral Reviews, 47:469–484. → pages 133Hall, K. C., Allen, C., McMullin, K., Letawsky, V., and Turner, A. (2015).Measuring magnitude of tongue movement for vowel height and backness. InProceedings of the 18th International Congress of Phonetic Sciences, Glasgow,UK. → pages 94Heald, S. L. M. and Nusbaum, H. C. (2015). Variability in Vowel Productionwithin and between Days. PLoS ONE, 10(9). → pages 120Heinks-Maldonado, T. H., Nagarajan, S. S., and Houde, J. F. (2006).Magnetoencephalographic evidence for a precise forward model in speechproduction. Neuroreport, 17(13):1375–1379. → pages 131142Hickok, G. (2009). Eight Problems for the Mirror Neuron Theory of ActionUnderstanding in Monkeys and Humans. Journal of Cognitive Neuroscience,21(7):1229–1243. → pages 15Hickok, G. (2012). Computational neuroanatomy of speech production. NatureReviews Neuroscience, 13(2):135–145. → pages 129, 130, 134Hickok, G. (2014). The architecture of speech production and the role of thephoneme in speech processing. Language Cognition and Neuroscience,29(1):2–20. → pages 6, 128, 129, 130, 134Hickok, G., Houde, J., and Rong, F. (2011). Sensorimotor Integration in SpeechProcessing: Computational Basis and Neural Organization. NEURON,69(3):407–422. → pages 6, 129Hickok, G. and Poeppel, D. (2004). Dorsal and ventral streams: a framework forunderstanding aspects of the functional anatomy of language. Cognition,92(1–2):67–99. → pages 129Hillenbrand, J., Getty, L. A., Clark, M. J., and Wheeler, K. (1995). Acousticcharacteristics of American English vowels. Journal of the Acoustical Societyof America, 97(5):3099–3111. → pages 120Hood, L. J. (1998). An overview of neural function and feedback control inhuman communication. Journal of Communication Disorders, 31(6):461–470.→ pages 4Horn, B. K. P. and Schunck, B. G. (1981). Determining Optical Flow. ArtificialIntelligence, 17:185–203. → pages 93Hothorn, T., Bretz, F., and Westfall, P. (2008). Simultaneous Inference in GeneralParametric Models. Biometric Journal, 50(3):346–363. → pages 62, 104Houde, J. F. and Jordan, M. I. (1998). Sensorimotor adaptation in speechproduction. Science, 279:1213–1216. → pages 1, 7, 8, 127Howell, P. and Archer, A. (1984). Susceptibility to the effects of delayed auditoryfeedback. Perception & Psychophysics, 36(3):296–302. → pages 31Howell, P., Powell, D. J., and Khan, I. (1983). Amplitude contour of the delayedsignal and interference in delayed auditory feedback tasks. Journal ofExperimental Psychology: Human Perception and Performance, 9(5):772–784.→ pages 31143Hudock, D., Dayalu, V. N., Saltuklaroglu, T., Stuart, A., Zhang, J., andKalinowski, J. (2011). Stuttering inhibition via visual feedback at normal andfast speech rates. International Journal of Language & CommunicationDisorders, 46(2):169–178. → pages 12Irwin, J. R., Whalen, D. H., and Fowler, C. A. (2006). A sex difference in visualinfluence on heard speech. Perception & Psychophysics, 68(4):582–592. →pages 55Ito, T. and Ostry, D. J. (2010). Somatosensory Contribution to Motor LearningDue to Facial Skin Deformation. JOURNAL OF NEUROPHYSIOLOGY,104(3):1230–1238. → pages 3Johnson, K., Ladefoged, P., and Lindau, M. (1993). Individual differences invowel production. The Journal of the Acoustical Society of America,94(2):701–714. → pages 83Jones, J. A. and Jarick, M. (2006). Multisensory integration of speech signals: therelationship between space and time. Experimental Brain Research,174(3):588–594. → pages 71Jones, J. A. and Munhall, K. G. (1997). The effects of separating auditory andvisual sources on audiovisual integration of speech. Canadian Acoustics,25(4):13–19. → pages 10Jones, J. A. and Striemer, D. (2007). Speech disruption during delayed auditoryfeedback with simultaneous visual feedback. The Journal of the AcousticalSociety of America, 122(4):EL135–EL141. → pages 11, 17, 34, 35, 37, 41, 50,53, 57, 69, 70, 76, 126, 133Kalinowski, J., Stuart, A., Rastatter, M. P., Snyder, G., and Dayalu, V. (2000).Inducement of fluent speech in persons who stutter via visual choral speech.Neuroscience Letters, 281(2-3):198–200. → pages 11, 16, 126, 134Karlsson, F. and van Doorn, J. (2012). Vowel formant dispersion as a measure ofarticulation proficiency. The Journal of the Acoustical Society of America,132(4):2633–2641. → pages 91Kartushina, N., Hervais-Adelman, A., Frauenfelder, U. H., and Golestani, N.(2015). The effect of phonetic production training with visual feedback on theperception and production of foreign speech sounds. The Journal of theAcoustical Society of America, 138(2):817–832. → pages 2, 90144Katseff, S., Houde, J., and Johnson, K. (2012). Partial Compensation for AlteredAuditory Feedback: A Tradeoff with Somatosensory Feedback? Language andSpeech, 55(2):295–308. → pages 7Katz, D. I. and Lackner, J. R. (1977). Adaptation to delayed auditory feedback.Perception & Psychophysics, 22(5):476–486. → pages 33Katz, W. F. and Mehta, S. (2015). Visual feedback of tongue movement for novelspeech sound learning. Frontiers in Human Neuroscience, 9:612. → pages 2,81Kelso, J. A. S., Vatikiotis-Bateson, E., Tuller, B., and Fowler, C. A. (1984).Functionally specific articulatory cooperation following jaw perturbationsduring speech: Evidence for coordinative structures. Journal of ExperimentalPsychology: Human Perception and Performance, 10(6):812–832. → pages 8,127Kim, J. and Davis, C. (2014). How visual timing and form information affectspeech and non-speech processing. Brain and Language, 137:86–90. → pages54Kleeman, A. (2016). Seeing Double. The New Yorker. → pages 14Knoblich, G. and Flach, R. (2001). Predicting the Effects of Actions: Interactionsof Perception and Action. Psychological Science, 12(6):467–472. → pages 135Labov, W., Ash, S., and Boberg, C. (2006). The Atlas of North American English:Phonetics, Phonology and Sound Change. Mouton de Gruyter, New York. →pages 120Ladefoged, P. (2006). A Course in Phonetics. Harcourt Brace College Publishers,Fort Worth, 5th edition. → pages x, 85Lametti, D. R., Nasir, S. M., and Ostry, D. J. (2012). Sensory Preference inSpeech Production Revealed by Simultaneous Alteration of Auditory andSomatosensory Feedback. The Journal of Neuroscience, 32(27):9351–9358. →pages 3, 133Landis, J. R. and Koch, G. G. (1977). The Measurement of Observer Agreementfor Categorical Data. Biometrics, 33(1):159–174. → pages 44, 59Lane, H., Denny, M., Guenther, F. H., Matthies, M. L., Me´nard, L., Perkell, J. S.,Stockmann, E., Tiede, M., Vick, J., and Zandipour, M. (2005). Effects of biteblocks and hearing status on vowel production. The Journal of the Acoustical145Society of America, 118(3):1636–1646. → pages 3, 78, 82, 83, 89, 90, 91, 117,118, 119Lee, B. S. (1950). Effects of Delayed Speech Feedback. The Journal of theAcoustical Society of America, 22(6):824–826. → pages 31, 33Liberman, A. M. and Mattingly, I. G. (1985). The motor theory of speechperception revised. Cognition, 21(1):1–36. → pages 15, 135Lickley, E. (2014). Fluency and disfluency. In Redford, M., editor, Handbook ofSpeech Production. Wiley-Blackwell, Oxford. → pages 35, 36Loucks, T. M. J. and De Nil, L. F. (2006). Oral kinesthetic deficit in adults whostutter: A target-accuracy study. Journal of Motor Behavior, 38(3):238–246. →pages 2Massaro, D. W., Bigler, S., Chen, T. H., Perlman, M., and Ouni, S. (2008).Pronunciation training: the role of eye and ear. In Proceedings of Interspeech,pages 2623–2626. → pages 81McFarland, D. H. and Baum, S. R. (1995). Incomplete compensation toarticulatory perturbation. The Journal of the Acoustical Society of America,97(3):1865–1873. → pages 3, 82, 119McFarland, D. H., Baum, S. R., and Chabot, C. (1996). Speech compensation tostructural modifications of the oral cavity. The Journal of the Acoustical Societyof America, 100(2):1093–1104. → pages 3McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature,264(5588):746–748. → pages 1, 10Me´nard, L. (2015). Multimodal Speech Production. In Redford, A, M., editor,The Handbook of Speech Production, pages 200–221. John Wiley & Sons, Inc.,UK, 1st edition. → pages 80Me´nard, L., Dupont, S., Baum, S. R., and Aubin, J. (2009). Production andperception of French vowels by congenitally blind adults and sighted adults.Journal of the Acoustical Society of America, 126(3):1406–1414. → pages 12,80, 83, 116, 126, 135, 136Me´nard, L., Leclerc, A., Brisebois, A., Aubin, J., and Brasseur, A. (2008).Production and perception of French vowels by blind and sighted speakers. InSock, R., Fuchs, S., and Laprie, Y., editors, International Speech ProductionSeminar 2008 Proceedings, pages 197–200, Strasbourg, France. → pages 83146Me´nard, L., Polak, M., Denny, M., Burton, E., Lane, H., Matthies, M. L.,Marrone, N., Perkell, J. S., Tiede, M., and Vick, J. (2007). Interactions ofspeaking condition and auditory feedback on vowel production in postlinguallydeaf adults with cochlear implantsa). The Journal of the Acoustical Society ofAmerica, 121(6):3790–3801. → pages 119Me´nard, L., Toupin, C., Baum, S. R., Drouin, S., Aubin, J., and Tiede, M. (2013).Acoustic and articulatory analysis of French vowels produced by congenitallyblind adults and sighted adults. The Journal of the Acoustical Society ofAmerica, 134(4):2975–2987. → pages 12, 80, 81, 83, 84, 116, 121, 126Me´nard, L., Turgeon, C., Trudeau-Fisette, P., and Bellavance-Courtemanche, M.(2015). Effects of blindness on production–perception relationships:Compensation strategies for a lip-tube perturbation of the French [u]. ClinicalLinguistics & Phonetics, 0(0):1–22. → pages 83Miller, A. J. (2002). Oral and Pharyngeal Reflexes in the Mammalian NervousSystem: Their Diverse Range in Complexity and the Pivotal Role of theTongue. Critical Reviews in Oral Biology & Medicine, 13(5):409–425. →pages 3Montgomery, A. A. and Jackson, P. L. (1983). Physical characteristics of the lipsunderlying vowel lipreading performance. The Journal of the AcousticalSociety of America, 73(6):2134–2144. → pages 83Munhall, K. G., Gribble, P., Sacco, L., and Ward, M. (1996). Temporal constraintson the McGurk effect. Perception & Psychophysics, 58(3):351–362. → pages10, 70Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., and Vatikiotis-Bateson,E. (2004a). Visual prosody and speech intelligibility: head movement improvesauditory speech perception. Psychological Science, 15(2):133–137. → pages10Munhall, K. G., Kroos, C., Jozan, G., and Vatikiotis-Bateson, E. (2004b). Spatialfrequency requirements for audiovisual speech perception. Perception &Psychophysics, 66(4):574–583. → pages 10Munhall, K. G., Servos, P., Santi, A., and Goodale, M. A. (2002). Dynamic visualspeech perception in a patient with visual form agnosia. Neuroreport,13(14):1793–1796. → pages 56147Namasivayam, A. K., van Lieshout, P., McIlroy, W. E., and De Nil, L. (2009).Sensory feedback dependence hypothesis in persons who stutter. HumanMovement Science, 28(6):688–707. → pages 3Nasir, S. M. and Ostry, D. J. (2008). Speech motor learning in profoundly deafadults. Nature Neuroscience, 11(10):1217–1222. → pages 3Navarra, J. and Soto-Faraco, S. (2007). Hearing lips in a second language: visualarticulatory information enables the perception of second language sounds.Psychological Research, 71(1):4–12. → pages 1, 81, 134Niziolek, C. A., Nagarajan, S. S., and Houde, J. F. (2013). What Does MotorEfference Copy Represent? Evidence from Speech Production. The Journal ofNeuroscience, 33(41):16110–16116. → pages 132Oller, D. K. and Eilers, R. E. (1988). The Role of Audition in Infant Babbling.Child Development, 59(2):441–449. → pages 6Perkell, J. S. (2012). Movement goals and feedback and feedforward controlmechanisms in speech production. Journal of Neurolinguistics, 25(5):382–407.→ pages 5, 7Perkell, J. S., Guenther, F. H., Lane, H., Matthies, M. L., Stockmann, E., Tiede,M., and Zandipour, M. (2004a). The distinctness of speakers’ productions ofvowel contrasts is related to their discrimination of the contrasts. Journal of theAcoustical Society of America, 116(4, Part 1):2338–2344. → pages 135Perkell, J. S., Matthies, M. L., Tiede, M., Lane, H., Zandipour, M., Marrone, N.,Stockmann, E., and Guenther, F. H. (2004b). The distinctness of speakers’ /s/ -/sh/ contrast is related to their auditory discrimination and use of an articulatorysaturation effect. Journal of Speech, Language and Hearing Research,47(6):1259–1269. → pages 135Pickering, M. J. and Garrod, S. (2004). Toward a mechanistic psychology ofdialogue. Behavioral and Brian Sciences, 27:169–226. → pages 134Pre´fontaine, Y., Kormos, J., and Johnson, D. E. (2016). How do utterancemeasures predict raters’ perceptions of fluency in French as a second language?Language Testing, 33(1):53–73. → pages 36Purcell, D. W. and Munhall, K. G. (2006). Compensation following real-timemanipulation of formants in isolated vowels. Journal of the Acoustical Societyof America, 119(4):2288–2297. → pages 5148R Core Team, R. (2014). R: A Language and Environment for StatisticalComputing. → pages 23Reisberg, D., McLean, J., and Goldfield, A. (1987). Easy to hear but hard tounderstand: A lip-reading advantage with intact auditory stimuli. In Dodd, B.and Campbell, R., editors, Hearing by Eye: The Psychology of Lip-Reading,pages 97–114. Lawrence Erlbaum Associates, London, UK. → pages 2, 12, 16,71, 126Rizzolatti, G., Fogassi, L., and Gallese, V. (2001). Neurophysiologicalmechanisms underlying the understanding and imitation of action. NatureReviews Neuroscience, 2(9):661–670. → pages 15Rosenblum, L. D. and Saldan˜a, H. M. (1992). Discrimination Tests of VisuallyInfluenced Syllables. Perception & Psychophysics, 52(4):461–473. → pages38, 55Rosenblum, L. D. and Saldan˜a, H. M. (1996). An audiovisual test of kinematicprimitives for visual speech perception. Journal of Experimental Psychology:Human Perception and Performance, 22(2):318–331. → pages 38, 54Rosenfelder, I., Fruehwald, J., Evanini, K., and Jiahong, Y. (2011). FAVE (ForcedAlignment and Vowel Extraction). → pages 88Rothauser, E., Chapman, W., Guttman, N., Nordby, K., Silbiger, H., Urbanek, G.,and Weinstock, M. (1969). IEEE Recommnded Pratice for Speech QualityMeasurements. IEEE Transactions on Audio and Electroacoustics,17(3):225–246. → pages 39, 57Sams, M., Mo¨tto¨nen, R., and Sihvonen, T. (2005). Seeing and hearing others andoneself talk. Cognitive Brain Research, 23(2-3):429–435. → pages 11Sasisekaran, J. (2012). Effects of delayed auditory feedback on speech kinematicsin fluent speakers. Perceptual and Motor Skills, 115(3):845–864. → pages 32Savariaux, C., Perrier, P., and Orliaguet, J. P. (1995). Compensation strategies forthe perturbation of the rounded vowel [u] using a lip tube: A study of thecontrol space in speech production. The Journal of the Acoustical Society ofAmerica, 98(5):2428–2442. → pages 83Scarbel, L., Beautemps, D., Schwartz, J.-L., and Sato, M. (2014). The shadow ofa doubt? Evidence for perceptuo-motor linkage during auditory and audiovisualclose-shadowing. Frontiers in Psychology, 5:568. → pages 2, 13, 16, 126149Schwartz, J.-L., Basirat, A., Me´nard, L., and Sato, M. (2012). ThePerception-for-Action-Control Theory (PACT): A perceptuo-motor theory ofspeech perception. Journal of Neurolinguistics, 25(5):336–354. → pages 127Schwartz, J.-L. and Savariaux, C. (2014). No, There Is No 150 ms Lead of VisualSpeech on Auditory Speech, but a Range of Audiovisual Asynchronies Varyingfrom Small Audio Lead to Large Audio Lag. PLoS Computational Biology,10(7):e1003743. → pages 71Skaug, H., Fournier, D., Bolker, B., Magnusson, A., and Nielsen, A. (2014).Generalized Linear Mixed Models using AD Model Builder. → pages 23, 98Smith, C. R. (1975). Residual hearing and speech production in deaf children.Journal of Speech and Hearing Research, 18(4):795–811. → pages 6Snyder, G. J., Hough, M. S., Blanchet, P., Ivy, L. J., and Waddell, D. (2009). Theeffects of self-generated synchronous and asynchronous visual speech feedbackon overt stuttering frequency. Journal of Communication Disorders,42(3):235–244. → pages 11, 12, 17, 36, 126Sober, S. J. and Sabes, P. N. (2003). Multisensory integration during motorplanning. Journal of Neuroscience, 23(18):6982–6992. → pages 16Sooful, J. J. and Botha, E. C. (2001). An acoustic distance measure for automaticcross-language phoneme mapping. In Proceedings of the Twelfth AnnualSymposium of the Pattern Recognition Association of South Africa. → pages 90Stevens, K. N. (1989). On the quantal nature of speech. Journal of Phonetics,17:3–45. → pages 127Stuart, A., Kalinowski, J., Rastatter, M. P., and Lynch, K. (2002). Effect ofdelayed auditory feedback on normal speakers at two speech rates. The Journalof the Acoustical Society of America, 111(5):2237. → pages 31, 36, 74Sumby, W. H. and Pollack, I. (1954). Visual contribution to speech intelligibilityin noise. Journal of the Acoustical Society of America, 26(2):212–215. →pages 1, 9, 31, 81, 134Tiede, M. K., Ito, T., and Ostry, D. J. (2006). Compensatory response tounexpected jaw perturbation triggered by formant transitions during speech. InProceedings of the Seventh International Seminar on Speech Production,Ubatuba, Brazil. → pages 5150Tourville, J. A., Reilly, K. J., and Guenther, F. H. (2008). Neural mechanismsunderlying auditory feedback control of speech. NeuroImage,39(3):1429–1443. → pages 5Tremblay, S., Shiller, D. M., and Ostry, D. J. (2003). Somatosensory basis ofspeech production. Nature, 423(6942):866–9. → pages 1, 3, 8, 127Tye-Murray, N. (1986). Visual feedback during speech production. The Journalof the Acoustical Society of America, 79(4):1169. → pages 11, 17, 34, 37, 41,53, 57, 76Tye-Murray, N., Spehar, B. P., Myerson, J., Hale, S., and Sommers, M. S. (2013).Reading your own lips: Common-coding theory and visual speech perception.Psychonomic Bulletin & Review, 20(1):115–119. → pages 135, 136Vatikiotis-Bateson, E. and Munhall, K. (2015). Audiovisual speech processing:Something doesn’t add up. In Redford, M., editor, Handbook of SpeechProduction. Wiley-Blackwell, Oxford. → pages 10, 132Vatikiotis-Bateson, E. and Munhall, K. G. (2012). Time-varying coordination inmultisensory speech processing. In Stein, B. E., editor, The New Handbook ofMultisensory Processing, pages 421–433. The MIT Press, Cambridge, MA. →pages 16Vatikiotis-Bateson, E., Munhall, K. G., Kasahara, Y., Garcia, F., and Yehia, H.(1996). Characterizing audiovisual information during speech. In Bunnell, H.and Idsardi, W., editors, ICSLP 96 - Fourth International Conference on SpokenLanguage Processing, Proceedings, Vols 1-4, pages 1485–1488, New York. I EE E. → pages 9Vatikiotis-Bateson, E. and Yehia, H. C. (2002). Speaking mode variability inmultimodal speech production. IEEE Transactions on Neural Networks,13(4):894–899. → pages 9Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S.Springer, New York, fourth edition. → pages 98Venezia, J. H., Fillmore, P., Matchin, W., Lisette Isenberg, A., Hickok, G., andFridriksson, J. (2016). Perception drives production across sensory modalities:A network for sensorimotor integration of visual speech. NeuroImage,126:196–207. → pages 12, 13, 16, 126, 127, 128, 129, 130, 131, 133151Venkatagiri, H. S. (1980). The core disruption under delayed auditory feedback:Evidence from adaptation study. Journal of Communication Disorders,13(5):365–371. → pages 33Ventura, M. I., Nagarajan, S. S., and Houde, J. F. (2009). Speech target modulatesspeaking induced suppression in auditory cortex. Bmc Neuroscience, 10:58. →pages 6, 129, 131Watson, C. I. and Harrington, J. (1999). Acoustic evidence for dynamic formanttrajectories in Australian English vowels. The Journal of the Acoustical Societyof America, 106(1):458–468. → pages 121Winter, B. (2013). Linear models and linear mixed effects models in R withlinguistic applications. → pages 23Yates, A. J. (1963). Delayed auditory feedback. Psychological Bulletin,60(3):213–232. → pages 8, 31, 32, 33, 40Yehia, H., Rubin, P., and Vatikiotis-Bateson, E. (1998). Quantitative associationof vocal-tract and facial behavior. Speech Communication, 26(1–2):23–43. →pages 9Yehia, H. C., Kuratate, T., and Vatikiotis-Bateson, E. (2002). Linking facialanimation, head motion and speech acoustics. Journal of Phonetics,30(3):555–568. → pages 9Zeelenberg, R. and Pecher, D. (2014). A method for simultaneouslycounterbalancing condition order and assignment of stimulus materials toconditions. Behavior Research Methods. → pages 58Zimmermann, G., Brown, C., Kelso, J. A. S., Hurtig, R., and Forrest, K. (1988).The association between acoustic and articulatory events in a delayedauditory-feedback paradigm. Journal of Phonetics, 16(4):437–451. → pages 7,32, 52, 53, 68Zuur, A. F., Ieno, E. N., Walker, N. J., Saveliev, A. A., and Smith, G. M. (2009).Mixed Effects Models and Extensions in Ecology with R. Statistics for Biologyand Health. Springer, New York. → pages 24, 49152Appendix AStimuliSentence Accepted SubstitutionA cup of sugar makes sweet fudge “made” for “makes”The doorknob was made of bright clean brass “clear” for “clean”,“green” for “clean”,“glass” for “brass”We need grain to keep our mules healthy “meals” for “mules”The plush chair leaned against the wall “pushed” for “plush”Bathe and relax in the cool green grass “cold” for “cool”, “glass”for “grass”Take two shares as a fair profit “pick” for “take”North winds bring colds and fevers “cold” for “colds”A gray mare walked before the colt “great” for “gray”, “cold”for “colt”Table A.1 continued on next page153Table A.1 – continued from previous pageSentence Accepted SubstitutionsCap the jar with a tight brass cover “glass” for “brass”,“bright” for “tight”, and“tap” for “cap”The odor of spring makes young hearts jump “order” for “odor”They sliced the sausage thin with a knifeTake the winding path to reach the lakeNote closely the size of the gas tankWipe the grease off his dirty faceMend the coat before you go outThe stray cat gave birth to kittensThe young girl gave no clear responseThe meal was cooked before the bell rangThe frosty air passed through the coatA saw is a tool used for making boardsThe wagon moved on well oiled wheelsMarch the soldiers past the next hillPlace a rosebush near the porch stepsBoth lost their lives in the raging stormThe cement had dried when he moved itTable A.1 continued on next page154Table A.1 – continued from previous pageSentence Accepted SubstitutionsThe fly made its way along the wallDo that with a wooden stickLive wires should be kept coveredThe large house had hot water tapsIt is hard to erase blue or red inkThe wreck occurred by the bank on MainStreetFill the ink jar with sticky gluePack the records in a neat thin caseThat move means the game is overGlass will clink when struck by metalIt takes a lot of help to finish theseMark the spot with a sign painted redThe fur of cats goes by many namesHe asks no person to vouch for himGo now and come here laterSoap can wash most dirt awayThe bloom of the rose lasts a few daysBottles hold four kinds of rumTable A.1 continued on next page155Table A.1 – continued from previous pageSentence Accepted SubstitutionsHe wheeled the bike past the winding roadDrop the ashes on the worn old rugThe desk and both chairs were painted tanThe way to save money is not to spend muchShut the hatch before the waves push it inCrack the walnut with your sharp side teethHe offered proof in the form of a large chartSend the stuff in a thick paper bagA quart of milk is water for the most partThey told wild tales to frighten himThe three story house was built of stoneThe poor boy missed the boat againBe sure to set the lamp firmly in the holePick a card and slip it under the packThe first part of the plan needs changingThe mail comes in three batches per dayYou cannot brew tea in a cold pot* The crooked maze failed to fool the mouse* A sash of gold silk will trim her dressTable A.1 continued on next page156Table A.1 – continued from previous pageSentence Accepted Substitutions* Breakfast buns are fine with a hot drink* Throw out the used paper cup and plateˆ The crunch of feet in the snow was the onlysoundˆ In the rear of the ground floor was a largepassageˆ The man wore a feather in his felt hatˆ Hang tinsel from both branchesˆ A round mat will cover the dull spotˆ Boards will warp unless kept dryTable A.1: Stimuli used in the DAF experiments (Chapter 3). Alternate wordswhich were accepted as substitutions during speech error coding arelisted. (Sentences marked with an asterisk were not used in Experiment2. Sentences marked with a caret were practice sentence.)Word Example Sentenceheed The girl did not heed their warning.she She likes to go bowling.Table A.2 continued on next page157Table A.2 – continued from previous pageWord Example Sentencehod A hod is a tool used by builders.sort They need to sort the blocks.who’d Who’d have thought he’d win?sheet I put a fresh sheet on the bed.hid The boy hid from the bullies.sore My legs are sore after exercising.hood The hood of the car was hot.seat Take a seat by the window.had I wish I had some money.shore We walked along the shore.head His head was blocking my view.see Can you see the balloon?short The man was too short.ˆ fought The soldiers fought in the war.ˆ gas We need gas for the car.ˆ heard I heard you talking.ˆ hide The boy will hide the chocolate.ˆ how’d How’d you do in the exam?ˆ leash Put the leash on the dog.Table A.2 continued on next page158Table A.2 – continued from previous pageWord Example Sentenceˆ mat Please wipe your feet on the mat.ˆ wet We got wet during the storm.ˆ hayed The grass was hayed to make cattle feed.ˆ hoyed A passing stranger hoyed me.Table A.2: Stimuli used in the bite block experiment (Chapter 4). The wordlist and example sentences were shown to participants prior to the exper-iment. (Words marked with a caret were practice words. )159


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items